ChatGPT解决这个技术问题 Extra ChatGPT

How to take column-slices of dataframe in pandas

I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.

Currently, I do the following:

data = pandas.read_csv('mydata.csv')

which gives something like:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))

I'd like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.

It is not possible to write something like

observations = data[:'c']
features = data['c':]

I'm not sure what the best method is. Do I need a pd.Panel?

By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is. Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]

DataFrame is inherently a dict-like object when you do df[...], however some conveniences, e.g. df[5:10] were added for selecting rows (pandas.pydata.org/pandas-docs/stable/…)
So what this inconsistency is a design decision in favor of convenience? Alright, but it definitely needs to be more explicit for beginners!
The design consideration of supporting convenience makes the learning curve much steep. I wish that there are better documentation for the beginning just presenting a consistent interface. For example, just focus on the ix interface.

T
Ted Petrou

2017 Answer - pandas 0.20: .ix is deprecated. Use .loc

See the deprecation in the docs

.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.

Let's assume we have a DataFrame with the following columns: foo, bar, quz, ant, cat, sat, dat.

# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat

.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step

# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat

# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar

# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat

# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned

# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar

# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat

# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat

You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z

# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
#    foo ant
# w
# x
# y

if your using apply with lambda row, as in: df['newcol'] = df.apply(lambda row: myfunc(row), axis=1) then you can in myfunc(row){ ... use row['foo':'ant']. for example (according to this StackOverflow answer), inside the myfunc you can evalueate if any of these is non-numeric: row['foo':'ant'].apply(lambda x: isinstance(x, str)).any()
.iloc should be used now, instead of .loc. Fix that, and I'll upvote it.
@craned - that is not correct. From the Pandas documentation: .loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. A similar statement is made about .iloc excep it specifically refers to index based slicing. In other words in this example, he used label based indexing and .loc is the correct choice (basically the only choice). If you want to slice by position -rows 5:10 for instance, then use .iloc
what if is we want, ['foo', 'ant', 'cat', 'sat', 'dat'], without bar and quz, it is possible to do something like: ['foo', 'ant': 'dat'], what would be the quickest way?
R
Rick supports Monica

Note: .ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.

The DataFrame.ix index is what you want to be accessing. It's a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:

>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
      b         c         d         e
0  0.418762  0.042369  0.869203  0.972314
1  0.991058  0.510228  0.594784  0.534366
2  0.407472  0.259811  0.396664  0.894202
3  0.726168  0.139531  0.324932  0.906575

where .ix[row slice, column slice] is what is being interpreted. More on Pandas indexing here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced


Careful that ranges in pandas include both end points, ie >>>data.ix[:, 'a':'c'] a b c 0 0.859192 0.881433 0.843624 1 0.744979 0.427986 0.177159
Multiple columns cab be passed like this df.ix[:,[0,3,4]]
@Karmel: There looks like a copy/paste error in the output above. Perhaps you meant df.ix[:,'b':'e']?
It's better to use loc instead of ix: stackoverflow.com/a/31593712/4323
Old answers like this need to be deleted. .ix is deprecated and should never be used.
S
Shankar ARUL

Lets use the titanic dataset from the seaborn package as an example

# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')

using the column names

>> titanic.loc[:,['sex','age','fare']]

using the column indices

>> titanic.iloc[:,[2,3,6]]

using ix (Older than Pandas <.20 version)

>> titanic.ix[:,[‘sex’,’age’,’fare’]]

or

>> titanic.ix[:,[2,3,6]]

using the reindex method

>> titanic.reindex(columns=['sex','age','fare'])

In pandas 0.20: .ix is deprecated.
deprecation warning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. when you use df.loc[:, some_list_of_columns]
m
moldovean

Also, Given a DataFrame

data

as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:

>>> data.iloc[:,[0,3]]

will give you

          a         d
0  0.883283  0.100975
1  0.614313  0.221731
2  0.438963  0.224361
3  0.466078  0.703347
4  0.955285  0.114033
5  0.268443  0.416996
6  0.613241  0.327548
7  0.370784  0.359159
8  0.692708  0.659410
9  0.806624  0.875476

B
Brendan Wood

You can slice along the columns of a DataFrame by referring to the names of each column in a list, like so:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]

So if I want all the data starting from column 'b', I need to find the index of 'b' in data.columns and do data[data.columns[1:]] ? That's the canonical way to operate?
You mean you want to select all the columns from 'b' onwards?
Yes, or selecting all the columns in a given range.
I'm pretty new to pandas myself, so I can't speak as to what's considered canonical. I would do it like you said, but use the get_loc function on data.columns to determine the index of column 'b' or whatever.
u
user2023507

And if you came here looking for slicing two ranges of columns and combining them together (like me) you can do something like

op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op

This will create a new dataframe with first 900 columns and (all) columns > 3593 (assuming you have some 4000 columns in your data set).


Great, somebody has tried this out... I was wondering, this 0:899 that gets the first 900 columns.. why did they do it like this? This does not feel like Python at all. When using ranges in python it is always 'until' not 'until and included'
S
Surya

Here's how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.

In [37]: import pandas as pd    
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))

In [44]: df
Out[44]: 
          a         b         c         d         e         f         g
0  0.409038  0.745497  0.890767  0.945890  0.014655  0.458070  0.786633
1  0.570642  0.181552  0.794599  0.036340  0.907011  0.655237  0.735268
2  0.568440  0.501638  0.186635  0.441445  0.703312  0.187447  0.604305
3  0.679125  0.642817  0.697628  0.391686  0.698381  0.936899  0.101806

In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing 
Out[45]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing 
Out[46]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [47]: df.iloc[:, 0:3] ## index based column ranges slicing 
Out[47]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

### with 2 different column ranges, index based slicing: 
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

Please try to avoid just dumping code as an answer and try to explain what it does and why. Your code might not be obvious for people who do not have the relevant coding experience. Please edit your answer to include clarification, context and try to mention any limitations, assumptions or simplifications in your answer.
C
Camilo

Another way to get a subset of columns from your DataFrame, assuming you want all the rows, would be to do:
data[['a','b']] and data[['c','d','e']]
If you want to use numerical column indexes you can do:
data[data.columns[:2]] and data[data.columns[2:]]


M
Max Kleiner

Its equivalent

 >>> print(df2.loc[140:160,['Relevance','Title']])
 >>> print(df2.ix[140:160,[3,7]])

V
Vladimir Gavrysh

if Data frame look like that:

group         name      count
fruit         apple     90
fruit         banana    150
fruit         orange    130
vegetable     broccoli  80
vegetable     kale      70
vegetable     lettuce   125

and OUTPUT could be like

   group    name  count
0  fruit   apple     90
1  fruit  banana    150
2  fruit  orange    130

if you use logical operator np.logical_not

df[np.logical_not(df['group'] == 'vegetable')]

more about

https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html

other logical operators

logical_and(x1, x2, /[, out, where, ...]) Compute the truth value of x1 AND x2 element-wise. logical_or(x1, x2, /[, out, where, casting, ...]) Compute the truth value of x1 OR x2 element-wise. logical_not(x, /[, out, where, casting, ...]) Compute the truth value of NOT x element-wise. logical_xor(x1, x2, /[, out, where, ..]) Compute the truth value of x1 XOR x2, element-wise.


M
Mykola Zotko

You can use the method truncate

df = pd.DataFrame(np.random.rand(10, 5), columns = list('abcde'))

df_ab = df.truncate(before='a', after='b', axis=1)
df_cde = df.truncate(before='c', axis=1)