ChatGPT解决这个技术问题 Extra ChatGPT

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

I'm using NLTK to perform kmeans clustering on my text file in which each line is considered as a document. So for example, my text file is something like this:

belong finger death punch <br>
hasty <br>
mike hasty walls jericho <br>
jägermeister rules <br>
rules bands follow performing jägermeister stage <br>
approach 

Now the demo code I'm trying to run is this:

import sys

import numpy
from nltk.cluster import KMeansClusterer, GAAClusterer, euclidean_distance
import nltk.corpus
from nltk import decorators
import nltk.stem

stemmer_func = nltk.stem.EnglishStemmer().stem
stopwords = set(nltk.corpus.stopwords.words('english'))

@decorators.memoize
def normalize_word(word):
    return stemmer_func(word.lower())

def get_words(titles):
    words = set()
    for title in job_titles:
        for word in title.split():
            words.add(normalize_word(word))
    return list(words)

@decorators.memoize
def vectorspaced(title):
    title_components = [normalize_word(word) for word in title.split()]
    return numpy.array([
        word in title_components and not word in stopwords
        for word in words], numpy.short)

if __name__ == '__main__':

    filename = 'example.txt'
    if len(sys.argv) == 2:
        filename = sys.argv[1]

    with open(filename) as title_file:

        job_titles = [line.strip() for line in title_file.readlines()]

        words = get_words(job_titles)

        # cluster = KMeansClusterer(5, euclidean_distance)
        cluster = GAAClusterer(5)
        cluster.cluster([vectorspaced(title) for title in job_titles if title])

        # NOTE: This is inefficient, cluster.classify should really just be
        # called when you are classifying previously unseen examples!
        classified_examples = [
                cluster.classify(vectorspaced(title)) for title in job_titles
            ]

        for cluster_id, title in sorted(zip(classified_examples, job_titles)):
            print cluster_id, title

(which can also be found here)

The error I receive is this:

Traceback (most recent call last):
File "cluster_example.py", line 40, in
words = get_words(job_titles)
File "cluster_example.py", line 20, in get_words
words.add(normalize_word(word))
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nltk/decorators.py", line 183, in memoize
result = func(*args)
File "cluster_example.py", line 14, in normalize_word
return stemmer_func(word.lower())
File "/usr/local/lib/python2.7/dist-packages/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

What is happening here?


i
icktoofay

The file is being read as a bunch of strs, but it should be unicodes. Python tries to implicitly convert, but fails. Change:

job_titles = [line.strip() for line in title_file.readlines()]

to explicitly decode the strs to unicode (here assuming UTF-8):

job_titles = [line.decode('utf-8').strip() for line in title_file.readlines()]

It could also be solved by importing the codecs module and using codecs.open rather than the built-in open.


running this line.decode('utf-8').strip().lower().split() also gives me the same error. I have added the .deocode('utf-8')
@kathirraja: Can you provide a reference for that? As far as I know, even in Python 3, the decode method remains the preferred way to decode a byte string to a Unicode string. (Though, the types in my answer are not right for Python 3 -- for Python 3, we're trying to convert from bytes to str rather than from str to unicode.)
E
Erik Kjellgren

This works fine for me.

f = open(file_path, 'r+', encoding="utf-8")

You can add a third parameter encoding to ensure the encoding type is 'utf-8'

Note: this method works fine in Python3, I did not try it in Python2.7.


It doesn't work in Python 2.7.10: TypeError: 'encoding' is an invalid keyword argument for this function
It doesn't work in Python 2.7.10: TypeError: 'encoding' is an invalid keyword argument for this function This works fine: import io with io.open(file_path, 'r', encoding="utf-8") as f: for line in f: do_something(line)
Worked like a charm in python3.6 Thanks a lot!
G
Georgi Karadzhov

For me there was a problem with the terminal encoding. Adding UTF-8 to .bashrc solved the problem:

export LC_CTYPE=en_US.UTF-8

Don't forget to reload .bashrc afterwards:

source ~/.bashrc

I had to use export LC_ALL=C.UTF-8 on Ubuntu 18.04.3 and Python 3.6.8. Otherwise this solved my problem, thanks.
For me it was solved with set -x LANG en_US.UTF-8 (fish) on macOS 10.15.7 and Python 3.6.7
B
Billy

You can try this also:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

What are the implications of this? It sounds like it's something global and not only applicable for this file.
Notice that the above is deprecated in Python 3.
m
mikemaccana

I got this error when trying to install a python package in a Docker container. For me, the issue was that the docker image did not have a locale configured. Adding the following code to the Dockerfile solved the problem for me.

# Avoid ascii errors when reading files in Python
RUN apt-get install -y locales && locale-gen en_US.UTF-8
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

l
loretoparisi

When on Ubuntu 18.04 using Python3.6 I have solved the problem doing both:

with open(filename, encoding="utf-8") as lines:

and if you are running the tool as command line:

export LC_ALL=C.UTF-8

Note that if you are in Python2.7 you have do to handle this differently. First you have to set the default encoding:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

and then to load the file you must use io.open to set the encoding:

import io
with io.open(filename, 'r', encoding='utf-8') as lines:

You still need to export the env

export LC_ALL=C.UTF-8

h
hoijui

To find ANY and ALL unicode error related... Using the following command:

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

Found mine in

/etc/letsencrypt/options-ssl-nginx.conf:        # The following CSP directives don't use default-src as 

Using shed, I found the offending sequence. It turned out to be an editor mistake.

00008099:     C2  194 302 11000010
00008100:     A0  160 240 10100000
00008101:  d  64  100 144 01100100
00008102:  e  65  101 145 01100101
00008103:  f  66  102 146 01100110
00008104:  a  61  097 141 01100001
00008105:  u  75  117 165 01110101
00008106:  l  6C  108 154 01101100
00008107:  t  74  116 164 01110100
00008108:  -  2D  045 055 00101101
00008109:  s  73  115 163 01110011
00008110:  r  72  114 162 01110010
00008111:  c  63  099 143 01100011
00008112:     C2  194 302 11000010
00008113:     A0  160 240 10100000

h
hoijui

Use open(fn, 'rb').read().decode('utf-8') instead of just open(fn).read()


A
Aminah Nuraini

You can try this before using job_titles string:

source = unicode(job_titles, 'utf-8')

For Python2 or Python3 as well?
i
iamigham

For python 3, the default encoding would be "utf-8". Following steps are suggested in the base documentation:https://docs.python.org/2/library/csv.html#csv-examples in case of any problem

Create a function def utf_8_encoder(unicode_csv_data): for line in unicode_csv_data: yield line.encode('utf-8') Then use the function inside the reader, for e.g. csv_reader = csv.reader(utf_8_encoder(unicode_csv_data))


V
Vishal Kumar Sahu

python3x or higher

load file in byte stream:

     body = ''
        for lines in open('website/index.html','rb'):
            decodedLine = lines.decode('utf-8')
            body = body+decodedLine.strip()
        return body

use global setting:

    import io
    import sys
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')