How can I strip the whitespace from Pandas DataFrame headers?

python pandas whitespace

I am parsing data from an Excel file that has extra white space in some of the column headings.

When I check the columns of the resulting dataframe, with df.columns, I see:

Index(['Year', 'Month ', 'Value'])
                     ^
#                    Note the unwanted trailing space on 'Month '

Consequently, I can't do:

df["Month"]

Because it will tell me the column is not found, as I asked for "Month", not "Month ".

My question, then, is how can I strip out the unwanted white space from the column headings?

This answer should be accepted, not the current one.

TomAugspurger

You can give functions to the rename method. The str.strip() method should do what you want:

In [5]: df
Out[5]: 
   Year  Month   Value
0     1       2      3

[1 rows x 3 columns]

In [6]: df.rename(columns=lambda x: x.strip())
Out[6]: 
   Year  Month  Value
0     1      2      3

[1 rows x 3 columns]

Note: that this returns a DataFrame object and it's shown as output on screen, but the changes are not actually set on your columns. To make the changes, either use this in a method chain or re-assign the df variabe:

df = df.rename(columns=lambda x: x.strip())

Henry Ecker

Since version 0.16.1 you can just call .str.strip on the columns:

df.columns = df.columns.str.strip()

Here is a small example:

In [5]:
df = pd.DataFrame(columns=['Year', 'Month ', 'Value'])
print(df.columns.tolist())
df.columns = df.columns.str.strip()
df.columns.tolist()

['Year', 'Month ', 'Value']
Out[5]:
['Year', 'Month', 'Value']

Timings

In[26]:
df = pd.DataFrame(columns=[' year', ' month ', ' day', ' asdas ', ' asdas', 'as ', '  sa', ' asdas '])
df
Out[26]: 
Empty DataFrame
Columns: [ year,  month ,  day,  asdas ,  asdas, as ,   sa,  asdas ]


%timeit df.rename(columns=lambda x: x.strip())
%timeit df.columns.str.strip()
1000 loops, best of 3: 293 µs per loop
10000 loops, best of 3: 143 µs per loop

So str.strip is ~2X faster, I expect this to scale better for larger dfs

Eric Duminil

If you use CSV format to export from Excel and read as Pandas DataFrame, you can specify:

skipinitialspace=True

when calling pd.read_csv.

From the documentation:

skipinitialspace : bool, default False Skip spaces after delimiter.

This doesn't skip trailing spaces per the OP's example. There doesn't seem to be a reasonable way to do this, particularly for multi-row headers which create MultiIndexes. It can be done, but it should be easier.

@TerryBrown: It doesn't help in the general case, that's true, and also why my answer begins with an "if". I've often seen whitespaces in Dataframes imported from CSV, that's why I mentioned it.

Jervine Lovesu

Actually can do that with

df.rename(str.strip, axis = 'columns')

Which is shown in Pandas documentation here.

loicgasser

If you are looking for an unbreakable way to do it, I would suggest:

data_frame.rename(columns=lambda x: x.strip() if isinstance(x, str) else x, inplace=True)

Upvoted! This is where my mind went since I like to strip whitespace earlier in my process flow and handle incoming data with variable headers (nans, ints, etc). Using the isinstance(var, type) check slows it down sure - but how many headers are we talking? Here I'd exchange the flexibility for computation since I don't forsee bringing in a header set of more than 25 columns...and definitely not more than 500...

How can I strip the whitespace from Pandas DataFrame headers?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US