ChatGPT解决这个技术问题 Extra ChatGPT

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

I want to parse my XML document. So I have stored my XML document as below

class XMLdocs(db.Expando):  
   id = db.IntegerProperty()    
   name=db.StringProperty()  
   content=db.BlobProperty()  

Now my below is my code

parser = make_parser()     
curHandler = BasketBallHandler()  
parser.setContentHandler(curHandler)  
for q in XMLdocs.all():  
        parser.parse(StringIO.StringIO(q.content))

I am getting below error

'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
Traceback (most recent call last):  
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 517, in __call__
    handler.post(*groups)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/base_handler.py", line 59, in post
    self.handle()   
  File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 168, in handle
    scan_aborted = not self.process_entity(entity, ctx)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 233, in process_entity
    handler(entity)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 71, in process
    parser.parse(StringIO.StringIO(q.content))   
  File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)   
  File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)  
  File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)   
  File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 136, in characters   
    print ch   
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)   
Your stacktrace shows that your executing code is different to what you pasted - and that you're using print. Don't use print in a WSGI app!

K
Kenan Banks

The actual best answer for this problem depends on your environment, specifically what encoding your terminal expects.

The quickest one-line solution is to encode everything you print to ASCII, which your terminal is almost certain to accept, while discarding characters that you cannot print:

print ch #fails
print ch.encode('ascii', 'ignore')

The better solution is to change your terminal's encoding to utf-8, and encode everything as utf-8 before printing. You should get in the habit of thinking about your unicode encoding EVERY time you print or read a string.


in my case , i was printing twitter stream to a terminal , and it was working fine. Then i wanted to redirect the programs output to a file , i started getting 'ascii' codec can't encode characters in position 32-36 . Later , as in this answer, i used print tweet.encode("utf-8",ignore) , and it all worked.
N
Nicole

Just putting .encode('utf-8') at the end of object will do the job in recent versions of Python.


What do you mean with "recent versions of Python"? Only 3.x, or also 2.7?
Python 2.7 is clearly recent since it's still in wide spread use.
Works for me on Python 2.7
M
Morgan Wilde

It seems you are hitting a UTF-8 byte order mark (BOM). Try using this unicode string with BOM extracted out:

import codecs

content = unicode(q.content.strip(codecs.BOM_UTF8), 'utf-8')
parser.parse(StringIO.StringIO(content))

I used strip instead of lstrip because in your case you had multiple occurences of BOM, possibly due to concatenated file contents.


I have done exactly as mentioned in answer but getting the above error, First it was giving me at position 0 mentioned in question and now it is giving me at position 5785 mentioned in prev comment
I recommend converting any string s which produces the error with s = unicode(s.strip(codecs.BOM_UTF8), 'utf-8'). s refers to the name of your strings.
Try to replace lstrip with strip.
I understand what you are suggesting and I had done the same error in detail : ascii' codec can't encode character u'\xef' in position 5785: ordinal not in range(128)
It's an encode error during the conversion of an unicode to string during printing. It won't contain a UTF-8 BOM, it can't be decoded back to unicode, and the error is because it countains non-ASCII characters - removing them would break the content, and the BOM is only one of them.
O
Orlando Pozo

This worked for me:

from django.utils.encoding import smart_str
content = smart_str(content)

D
Duncan

The problem according to your traceback is the print statement on line 136 of parseXML.py. Unfortunately you didn't see fit to post that part of your code, but I'm going to guess it is just there for debugging. If you change it to:

print repr(ch)

then you should at least see what you are trying to print.


-1 for non-unicode solution to an obvious unicode encoding problem.
The unicode encoding problem is with the print statement. Yes, there may be other issues but fixing the print to not crash is the immediate issue.
R
Rosh Oxymoron

The problem is that you're trying to print an unicode character to a possibly non-unicode terminal. You need to encode it with the 'replace option before printing it, e.g. print ch.encode(sys.stdout.encoding, 'replace').


printing is not essential, the main statement for me where I am getting error is of parse statement
@Mahesh: It's YOUR code that's causing the problem, at line 136 of parseXML.py -- either fix it yourself, or show us that part of the code so we can help you.
H
Hafiz Muhammad Shafiq

An easy solution to overcome this problem is to set your default encoding to utf8. Follow is an example

import sys

reload(sys)
sys.setdefaultencoding('utf8')

Do not do this. why it breaks code
Can you explain the reason?
There is a link in my comment that explains it. Essentially libraries expect the default of ascii to remain the default. It is why setdefaultencoding is not normally available without the reload trick.