ChatGPT解决这个技术问题 Extra ChatGPT

Write to UTF-8 file in Python

I'm really confused with the codecs.open function. When I do:

file = codecs.open("temp", "w", "utf-8")
file.write(codecs.BOM_UTF8)
file.close()

It gives me the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

If I do:

file = open("temp", "w")
file.write(codecs.BOM_UTF8)
file.close()

It works fine.

Question is why does the first method fail? And how do I insert the bom?

If the second method is the correct way of doing it, what the point of using codecs.open(filename, "w", "utf-8")?

Don’t use a BOM in UTF-8. Please.
@tchrist Huh? Why not?
@SalmanPK BOM is not needed in UTF-8 and only adds complexity (e.g. you can't just concatenate BOM'd files and result with valid text). See this Q&A; don't miss the big comment under Q

Z
Zanon

I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on "I'm meant to be writing Unicode as UTF-8-encoded text, but you've given me a byte string!"

Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:

import codecs

file = codecs.open("lol", "w", "utf-8")
file.write(u'\ufeff')
file.close()

(That seems to give the right answer - a file with bytes EF BB BF.)

EDIT: S. Lott's suggestion of using "utf-8-sig" as the encoding is a better one than explicitly writing the BOM yourself, but I'll leave this answer here as it explains what was going wrong before.


Warning: open and open is not the same. If you do "from codecs import open", it will NOT be the same as you would simply type "open".
you can also use codecs.open('test.txt', 'w', 'utf-8-sig') instead
I'm getting "TypeError: an integer is required (got type str)". I don't understand what we're doing here. Can someone please help? I need to append a string (paragraph) to a text file. Do I need to convert that into an integer first before writing?
@Mugen: The exact code I've written works fine as far as I can see. I suggest you ask a new question showing exactly what code you've got, and where the error occurs.
@Mugen you need to call codecs.open instead of just open
E
Eric O Lebigot

Read the following: http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig

Do this

with codecs.open("test_output", "w", "utf-8-sig") as temp:
    temp.write("hi mom\n")
    temp.write(u"This has ♭")

The resulting file is UTF-8 with the expected BOM.


Thanks. That worked (Windows 7 x64, Python 2.7.5 x64). This solution works well when you open the file in mode "a" (append).
This didn't work for me, Python 3 on Windows. I had to do this instead with open(file_name, 'wb') as bomfile: bomfile.write(codecs.BOM_UTF8) then re-open the file for append.
Maybe add temp.close() ?
@user2905353: not required; this is handled by context management of open.
K
Kamran Gasimov

It is very simple just use this. Not any library needed.

with open('text.txt', 'w', encoding='utf-8') as f:
    f.write(text)

t
tzot

@S-Lott gives the right procedure, but expanding on the Unicode issues, the Python interpreter can provide more insights.

Jon Skeet is right (unusual) about the codecs module - it contains byte strings:

>>> import codecs
>>> codecs.BOM
'\xff\xfe'
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'
>>> 

Picking another nit, the BOM has a standard Unicode name, and it can be entered as:

>>> bom= u"\N{ZERO WIDTH NO-BREAK SPACE}"
>>> bom
u'\ufeff'

It is also accessible via unicodedata:

>>> import unicodedata
>>> unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
u'\ufeff'
>>> 

R
Ricardo

I use the file *nix command to convert a unknown charset file in a utf-8 file

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()

Use # coding: utf8 instead of # -*- coding: utf-8 -*-which is far easier to remember.
I am really interested in seing something like that working on windows
c
celsowm

python 3.4 >= using pathlib:

import pathlib
pathlib.Path("text.txt").write_text(text, encoding='utf-8') #or utf-8-sig for BOM

R
RogerZ

If you are using Pandas I/O methods like pandas.to_excel(), add an encoding parameter, e.g.

pd.to_excel("somefile.xlsx", sheet_name="export", encoding='utf-8')

This works for most international characters I believe.