Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign

python csv utf-8 pandas

I've read something about a Python 2 limitation with respect to Pandas' to_csv( ... etc ...). Have I hit it? I'm on Python 2.7.3

This turns out trash characters for ≥ and - when they appear in strings. Aside from that the export is perfect.

df.to_csv("file.csv", encoding="utf-8")

Is there any workaround?

df.head() is this:

demography  Adults ≥49 yrs  Adults 18−49 yrs at high risk||  \
state                                                           
Alabama                 32.7                             38.6   
Alaska                  31.2                             33.2   
Arizona                 22.9                             38.8   
Arkansas                31.2                             34.0   
California              29.8                             38.8

csv output is this

state,  Adults â‰¥49 yrs,   Adults 18âˆ’49 yrs at high risk||
0,  Alabama,    32.7,   38.6
1,  Alaska, 31.2,   33.2
2,  Arizona,    22.9,   38.8
3,  Arkansas,31.2,  34
4,  California,29.8, 38.8

the whole code is this:

import pandas
import xlrd
import csv
import json

df = pandas.DataFrame()
dy = pandas.DataFrame()
# first merge all this xls together


workbook = xlrd.open_workbook('csv_merger/vaccoverage.xls')
worksheets = workbook.sheet_names()


for i in range(3,len(worksheets)):
    dy = pandas.io.excel.read_excel(workbook, i, engine='xlrd', index=None)
    i = i+1
    df = df.append(dy)

df.index.name = "index"

df.columns = ['demography', 'area','state', 'month', 'rate', 'moe']

#Then just grab month = 'May'

may_mask = df['month'] == "May"
may_df = (df[may_mask])

#then delete some columns we dont need

may_df = may_df.drop('area', 1)
may_df = may_df.drop('month', 1)
may_df = may_df.drop('moe', 1)


print may_df.dtypes #uh oh, it sees 'rate' as type 'object', not 'float'.  Better change that.

may_df = may_df.convert_objects('rate', convert_numeric=True)

print may_df.dtypes #that's better

res = may_df.pivot_table('rate', 'state', 'demography')
print res.head()


#and this is going to spit out an array of Objects, each Object a state containing its demographics
res.reset_index().to_json("thejson.json", orient='records')
#and a .csv for good measure
res.reset_index().to_csv("thecsv.csv", orient='records', encoding="utf-8")

Give us an example of your data, becuase I can't reproduce "trash" characters.

Doesn't even have to be your data. A simple, complete example that reproduces the problem is what we want: df = pd.DataFrame({"A": ['a', '≥']}); df.to_csv('test.csv'), works fine for me. Post your python version as well.

Huh, I try @TomAugspurger 's simple test but I get "SyntaxError: Non-ASCII character '\xe2' in file test.py on line 5, but no encoding declared; see python.org/peps/pep-0263.html for details" Needless to say, I don't understand the page they point me to. I mean, I understand I need to edit my python install … but I'm on deadline elsewhere now, you know?

Either your python or your terminal encoding is set to expect only ascii characters. You can read here for a way to set your encoding that may work as a temporary solution.

Yes I think that will have to do. I am scared to update to Python 3 in the middle of a project anyway.

Mark Tolonen

Your "bad" output is UTF-8 displayed as CP1252.

On Windows, many editors assume the default ANSI encoding (CP1252 on US Windows) instead of UTF-8 if there is no byte order mark (BOM) character at the start of the file. While a BOM is meaningless to the UTF-8 encoding, its UTF-8-encoded presence serves as a signature for some programs. For example, Microsoft Office's Excel requires it even on non-Windows OSes. Try:

df.to_csv('file.csv',encoding='utf-8-sig')

That encoder will add the BOM.

this solution encoding='utf-8-sig' worked for me. also encoding='utf-16' should work

This issue was driving me crazy! Thank you very much for this awesome answer! Of interesting note, doing df.to_excel('file.csv') generates an excel file that has no issue with Excel. Seems this issue is only pertaining to CSV files...

@gaborous CSV files are text files and need the encoded BOM hint for Excel to open it correctly. Did you mean df.to_excel('file.xls')? I get an error using df.to_excel('file.csv'). XLS and XLSX files are in an Excel format already so Excel should definitely have no problem opening them.

@MarkTolonen Yes for the file extension, that was a typo on my part. Indeed for the BOM, I learnt the hard way, but this is not obvious (why then offer the option to save as 'utf-8' into CSV without a BOM?).

@xjcl For some narrow definition of reasonable. Windows dealt with encodings and then went Unicode before UTF-8 was invented, and backward compatibility was important.

germ

encoding='utf-8-sig does not work for me. Excel reads the special characters fine now, but the Tab separators are gone! However, encoding='utf-16 does work correctly: special characters OK and Tab separators work. This is the solution for me.

The ultimate lesson is that you need to understand the basics of encodings to be able to specify the correct one. Perhaps see the Stack Overflow character-encoding tag info page which contains a brief intro with pointers to more information.

utf-16 didn't work in my case. Hit AttributeError: Can only use .str accessor with string values! once opened at df[name].str.contains(regx, regex=True, na=False). Back to 'utf-8-sig' here on Windows.

Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US