ChatGPT解决这个技术问题 Extra ChatGPT

"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte

Here is my code,

for line in open('u.item'):
# Read each line

Whenever I run this code it gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

I tried to solve this and add an extra parameter in open(). The code looks like:

for line in open('u.item', encoding='utf-8'):
# Read each line

But again it gives the same error. What should I do then?

Badly encoded data I would assume.
Or just not UTF-8 data.
We had this error with msgpack when using python 3 instead of python 2.7. For us, the course of action was to work with python 2.7.

P
Peter Mortensen

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.


Explicit is better than implicit (PEP 20).
The trick is that ISO-8859-1 or Latin_1 is 8 bit character sets, thus all garbage has a valid value. Perhaps not useable, but if you want to ignore!
I had the same issue UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 32: invalid continuation byte. I used python 3.6.5 to install aws cli. And when I tried aws --version it failed with this error. So I had to edit /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/configparser.py and changed the code to the following def read(self, filenames, encoding="ISO-8859-1"):
Is there an automatic way of detecting encoding?
@OrangeSherbet I implemented detection using chardet. Here's the one-liner (after import chardet): chardet.detect(open(in_file, 'rb').read())['encoding']. Check out this answer for details: stackoverflow.com/a/3323810/615422
m
mkrieger1

The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.

Example:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")

You may be correct that the OP is reading ISO 8859-1, as can be deduced from the 0xe9 (é) in the error message, but you should explain why your solution works. The reference to speech recognitions API's does not help.
P
Peter Mortensen

Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.

In Windows-1252 encoding, for example, the 0xe9 would be the character é.


So, How can I find out what encoding is it! I am using linux
There is no way to do that that always works, but see the answer to this question: stackoverflow.com/questions/436220/…
P
Peter Mortensen

Try this to read using Pandas:

pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')

Not sure why you're suggesting Pandas. The solution is setting the correct encoding, which you've chanced upon here.
'latin-1' is the same as 'ISO-8859-1'?
@PeterMortensen yes it is, Wikipedia confirms it. They both produce the same output when used with decode in Python as well.
@AlastairMcCormack one more late comment, 'latin-1' will always read a file without error because there are no invalid bytes in that encoding, even if it produces the wrong characters. It is the only encoding in Python with that property.
@MarkRansom I'm not sure about that :) What's an invalid byte in any 8bit code page? Surely, all the iso-8859 code pages will accept any byte?
P
Peter Mortensen

This works:

open('filename', encoding='latin-1')

Or:

open('filename', encoding="ISO-8859-1")

Depends on what you mean by "works". If you mean avoids exceptions that's true, because it's the only encoding that doesn't have invalid bytes or sequences. Doesn't mean you'll get the proper characters though.
P
Peter Mortensen

If you are using Python 2, the following will be the solution:

import io
for line in io.open("u.item", encoding="ISO-8859-1"):
    # Do something

Because the encoding parameter doesn't work with open(), you will be getting the following error:

TypeError: 'encoding' is an invalid keyword argument for this function


But this is version 3
Yeah I know. I thought it might be helpful for the people using Python 2
Worked for me in Python3 as well
In case you want something easier to remember, 'ISO-8859-1' is also known as 'latin-1' or 'latin1'.
P
Peter Mortensen

You could resolve the problem with:

for line in open(your_file_path, 'rb'):

'rb' is reading the file in binary mode. Read more here.


P
Peter Mortensen

You can try this way:

open('u.item', encoding='utf8', errors='ignore')

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
@MartenCatcher yeah but it helps future visitors to the question, although more explanation put to the answer would make it much better, I believe it serves better purpose as an answer rather than as a comment
What is the intent? Ignoring errors? What are the consequences?
A
Alain Cherpin

Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.

If your script runs on a Linux OS, you can get the encoding with the file command:

file --mime-encoding <filename>

Here is a python script to do that for you:

import sys
import subprocess

if len(sys.argv) < 2:
    print("Usage: {} <filename>".format(sys.argv[0]))
    sys.exit(1)

def find_encoding(fname):
    """Find the encoding of a file using file command
    """

    # find fullname of file command
    which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
    if which_run.returncode != 0:
        print("Unable to find 'file' command ({})".format(which_run.returncode))
        return None

    file_cmd = which_run.stdout.decode().replace('\n', '')

    # run file command to get MIME encoding
    file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
                               stdout=subprocess.PIPE,
                               stderr=subprocess.PIPE)
    if file_run.returncode != 0:
        print(file_run.stderr.decode(), file=sys.stderr)

    # return  encoding name only
    return file_run.stdout.decode().split()[1]

# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))

I was looking for an answer and interestingly you've answered 7 hours ago to a question asked 8 years ago. interesting coincidence .
I don't get it, why would you use a 33-line program to avoid typing one line in the shell?
V
Vineet Singh

I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 183: invalid continuation byte

So this is how I fixed it.

import pandas as pd

pd.read_csv('top50.csv', encoding='ISO-8859-1')


P
Peter Mortensen

This is an example for converting a CSV file in Python 3:

try:
    inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
    pass

P
Peter Mortensen

Sometimes when using open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you're trying to open exists:

import os
assert os.path.isfile(filepath)

How would opening a file that doesn't exist generate a UnicodeDecodeError? And in Python it's customary to use the EAFP principle over the LBYL that you're endorsing here.
P
Peter Mortensen

Open your file with Notepad++, select "Encoding" or "Encodage" menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.


Notepad++ is Windows only. For example, it doesn't work on Linux.
What is "Encodage"? What language?
"Encodage" is "Encoding" if the menu is in French
E
Eric Aya

So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.

I had problem with .csv file opening with that description:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte

I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol. I resaved that file with 'Save as..' command with Encoding 'UTF-8' & my program started to work.


Please note that questions and answers on SO must be in English only - even if the problem you encountered may bite mainly programmers using cyrillic alphabet.
@ThierryLathuille, is it a real problem? Could you please give me a link/referense to the comunity rule on that issue?
This is considered a real problem - and is probably what caused your answer to get downvoted. Non-English content is not allowed on SO (see for example meta.stackoverflow.com/questions/297673/… ), and the rule is really strictly respected. For questions in Russian, you have ru.stackoverflow.com , though ;)
@ThierryLathuille This applies to the English content, not problems with non-English symbols. And this doesn't necessarily have to be about other languages, it could be a different UTF-8 character (for example, a checkmark).
A
Anoop Ashware

The encoding replaced with encoding='ISO-8859-1'

for line in open('u.item', encoding='ISO-8859-1'):

print(line)


S
SONY ANNEM

Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding='ISO-8859-1')


Nobody said that the file in the question is a csv file.
K
Kalluri

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7044: invalid continuation byte

The above error is occuring due to encoding

Solution:- Use “encoding='latin-1'”

Reference:- https://pandas.pydata.org/docs/search.html?q=encoding


D
D.L

I keep coming across this error and often the solution is not resolved by encoding='utf-8' but in fact with engine='python' like this:

import pandas as pd

file = "c:\\path\\to_my\\file.csv"
df = pd.read_csv(file, engine='python')
df

A link to the docs is here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html