Here is my code,
for line in open('u.item'):
# Read each line
Whenever I run this code it gives the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte
I tried to solve this and add an extra parameter in open(). The code looks like:
for line in open('u.item', encoding='utf-8'):
# Read each line
But again it gives the same error. What should I do then?
As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1"
, so replacing open("u.item", encoding="utf-8")
with open('u.item', encoding = "ISO-8859-1")
will solve the problem.
The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.
Example:
file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")
Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open
call.
In Windows-1252 encoding, for example, the 0xe9
would be the character é
.
Try this to read using Pandas:
pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')
decode
in Python as well.
'latin-1'
will always read a file without error because there are no invalid bytes in that encoding, even if it produces the wrong characters. It is the only encoding in Python with that property.
This works:
open('filename', encoding='latin-1')
Or:
open('filename', encoding="ISO-8859-1")
If you are using Python 2, the following will be the solution:
import io
for line in io.open("u.item", encoding="ISO-8859-1"):
# Do something
Because the encoding
parameter doesn't work with open()
, you will be getting the following error:
TypeError: 'encoding' is an invalid keyword argument for this function
Python 2
'ISO-8859-1'
is also known as 'latin-1'
or 'latin1'
.
You could resolve the problem with:
for line in open(your_file_path, 'rb'):
'rb' is reading the file in binary mode. Read more here.
You can try this way:
open('u.item', encoding='utf8', errors='ignore')
Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.
If your script runs on a Linux OS, you can get the encoding with the file
command:
file --mime-encoding <filename>
Here is a python script to do that for you:
import sys
import subprocess
if len(sys.argv) < 2:
print("Usage: {} <filename>".format(sys.argv[0]))
sys.exit(1)
def find_encoding(fname):
"""Find the encoding of a file using file command
"""
# find fullname of file command
which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
if which_run.returncode != 0:
print("Unable to find 'file' command ({})".format(which_run.returncode))
return None
file_cmd = which_run.stdout.decode().replace('\n', '')
# run file command to get MIME encoding
file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
if file_run.returncode != 0:
print(file_run.stderr.decode(), file=sys.stderr)
# return encoding name only
return file_run.stdout.decode().split()[1]
# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))
I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 183: invalid continuation byte
So this is how I fixed it.
import pandas as pd
pd.read_csv('top50.csv', encoding='ISO-8859-1')
This is an example for converting a CSV file in Python 3:
try:
inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
pass
Sometimes when using open(filepath)
in which filepath
actually is not a file would get the same error, so firstly make sure the file you're trying to open exists:
import os
assert os.path.isfile(filepath)
UnicodeDecodeError
? And in Python it's customary to use the EAFP principle over the LBYL that you're endorsing here.
Open your file with Notepad++, select "Encoding" or "Encodage" menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.
So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.
I had problem with .csv file opening with that description:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte
I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol. I resaved that file with 'Save as..' command with Encoding 'UTF-8' & my program started to work.
The encoding replaced with encoding='ISO-8859-1'
for line in open('u.item', encoding='ISO-8859-1'):
print(line)
Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding='ISO-8859-1')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7044: invalid continuation byte
The above error is occuring due to encoding
Solution:- Use “encoding='latin-1'”
Reference:- https://pandas.pydata.org/docs/search.html?q=encoding
I keep coming across this error and often the solution is not resolved by encoding='utf-8'
but in fact with engine='python'
like this:
import pandas as pd
file = "c:\\path\\to_my\\file.csv"
df = pd.read_csv(file, engine='python')
df
A link to the docs is here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Success story sharing
chardet
. Here's the one-liner (afterimport chardet
):chardet.detect(open(in_file, 'rb').read())['encoding']
. Check out this answer for details: stackoverflow.com/a/3323810/615422