ChatGPT解决这个技术问题 Extra ChatGPT

Setting the correct encoding when piping stdout in Python

When piping the output of a Python program, the Python interpreter gets confused about encoding and sets it to None. This means a program like this:

# -*- coding: utf-8 -*-
print u"åäö"

will work fine when run normally, but fail with:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)

when used in a pipe sequence.

What is the best way to make this work when piping? Can I just tell it to use whatever encoding the shell/filesystem/whatever is using?

The suggestions I have seen thus far is to modify your site.py directly, or hardcoding the defaultencoding using this hack:

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print u"åäö"

Is there a better way to make piping work?

If you have this problem on windows, you can also run chcp 65001 before executing your script. This can have issues, but it often helps, and doesn't require a lot of typing (less than set PYTHONIOENCODING=utf_8).
chcp command is not the same as setting PYTHONIOENCODING. I think chcp is just configuration for the terminal itself and has nothing to do with writing to a file (which is what you are doing when piping stdout). Try setx PYTHONENCODING utf-8 to make it permanent if you want to save typing.
I faced a somewhat related issue, and found a solution here --> stackoverflow.com/questions/48782529/…

P
Peter Mortensen

Your code works when run in an script because Python encodes the output to whatever encoding your terminal application is using. If you are piping you must encode it yourself.

A rule of thumb is: Always use Unicode internally. Decode what you receive, and encode what you send.

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

Another didactic example is a Python program to convert between ISO-8859-1 and UTF-8, making everything uppercase in between.

import sys
for line in sys.stdin:
    # Decode what you receive:
    line = line.decode('iso8859-1')

    # Work with Unicode internally:
    line = line.upper()

    # Encode what you send:
    line = line.encode('utf-8')
    sys.stdout.write(line)

Setting the system default encoding is a bad idea, because some modules and libraries you use can rely on the fact it is ASCII. Don't do it.


The problem is that the user doesn't want to specify encoding explicitly. He wants just use Unicode for IO. And the encoding he uses should be an encoding specified in locale settings, not in terminal application settings. AFAIK, Python 3 uses a locale encoding in this case. Changing sys.stdout seems like a more pleasant way.
This answer is wrong. You should not be manually converting on each input and output of your program; that's brittle and completely unmaintainable.
@Glenn Maynard : so what is IYO the right answer? It's more helpful to tell us than just say 'This answer is wrong'
@smci: the answer is don't modify your script, set PYTHONIOENCODING if you are redirecting script's stdout in Python 2.
@Glenn Maynard Actually decoding and encoding is a good practice, from the python doc: "Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end."
C
Craig McQueen

First, regarding this solution:

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

It's not practical to explicitly print with a given encoding every time. That would be repetitive and error-prone.

A better solution is to change sys.stdout at the start of your program, to encode with a selected encoding. Here is one solution I found on Python: How is sys.stdout.encoding chosen?, in particular a comment by "toka":

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

unfortunately, changing sys.stdout to accept only unicode breaks a lot of libraries that expect it to accept encoded bytestrings.
nosklo: Then how can it work reliably and automaticly when output is a terminal?
@Rasmus Kaj: just define your own unicode printing function and use it every time you want to print unicode: def myprint(unicodeobj): print unicodeobj.encode('utf-8') -- you automatically detect terminal encoding by inspecting sys.stdout.encoding, but you should consider the case where it is None (i.e. when redirecting output to a file) so you need a separate function anyway.
@nosklo: This does not make sys.stdout accept only Unicode. You can pass both str and unicode to a StreamWriter.
I assume this answer was intended for python2. Be careful with this on code which is intended to support both python2 and python3. For me it's breaking stuff when ran under python3.
s
sophros

You may want to try changing the environment variable "PYTHONIOENCODING" to "utf_8". I have written a page on my ordeal with this problem.

Tl;dr of the blog post:

import sys, locale, os
print(sys.stdout.encoding)
print(sys.stdout.isatty())
print(locale.getpreferredencoding())
print(sys.getfilesystemencoding())
print(os.environ["PYTHONIOENCODING"])
print(chr(246), chr(9786), chr(9787))

gives you

utf_8
False
ANSI_X3.4-1968
ascii
utf_8
ö ☺ ☻

Changing sys.stdout.encoding maybe does not work, but changing sys.stdout does work: sys.stdout = codecs.getwriter(encoding)(sys.stdout). This can be done from within the python program, so the user is not forced to set an env variable.
@jeckyll2hide: PYTHONIOENCODING does work. How bytes are interpreted as a text is defined by user environment. Your script shouldn't be assuming and dictate the user environment what character encoding to use. If Python doesn't pick up the settings automatically then PYTHONIOENCODING can be set for your script. You shouldn't need it unless the output is redirected to a file/pipe.
+1. Honestly I think it's a Python bug. When I redirect output I want those same bytes that would be on the terminal, but in a file. Maybe it's not for everyone but it's a good default. Crashing hard with no explanation on a trivial operation that usually "just works" is a bad default.
@SnakE: the only way I can rationalize why Python's implementation intentionally would enforce an iron-clad and permanent choice of encoding on stdout at startup time, might be in order to prevent any badly encoded stuff coming out later on. Or changing it is just an unimplemented feature, in which case allowing the user to change it later on would be a reasonable Python feature request.
@daveagp My point is, behavior of my program should not depend on whether it is redirected or not---unless I really want it, in which case I implement it myself. Python behaves contrary to my experience with any other console tools. This violates the least surprise principle. I consider this a design flaw unless there is a very strong rationale.
S
Sérgio
export PYTHONIOENCODING=utf-8

do the job, but can't set it on python itself ...

what we can do is verify if isn't setting and tell the user to set it before call script with :

if __name__ == '__main__':
    if (sys.stdout.encoding is None):
        print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
        exit(1)

Update to reply to the comment: the problem just exist when piping to stdout . I tested in Fedora 25 Python 2.7.13

python --version
Python 2.7.13

cat b.py

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import sys

print sys.stdout.encoding

running ./b.py

UTF-8

running ./b.py | less

None

That check doesn't work in Python 2.7.13. sys.stdout.encoding is automatically set based on the LC_CTYPE locale value.
mail.python.org/pipermail/python-list/2011-June/605938.html the example there still work , i.e. when you use ./a.py > out.txt sys.stdout.encoding is None
I had a similar problem with a sync script from Backblaze B2 and export PYTHONIOENCODING=utf-8 solved my problem. Python 2.7 on Debian Stretch.
C
CLaFarge

I had a similar issue last week. It was easy to fix in my IDE (PyCharm).

Here was my fix:

Starting from PyCharm menu bar: File -> Settings... -> Editor -> File Encodings, then set: "IDE Encoding", "Project Encoding" and "Default encoding for properties files" ALL to UTF-8 and she now works like a charm.

Hope this helps!


T
Tompa

An arguable sanitized version of Craig McQueen's answer.

import sys, codecs
class EncodedOut:
    def __init__(self, enc):
        self.enc = enc
        self.stdout = sys.stdout
    def __enter__(self):
        if sys.stdout.encoding is None:
            w = codecs.getwriter(self.enc)
            sys.stdout = w(sys.stdout)
    def __exit__(self, exc_ty, exc_val, tb):
        sys.stdout = self.stdout

Usage:

with EncodedOut('utf-8'):
    print u'ÅÄÖåäö'

P
Peter Mortensen

I just thought I'd mention something here which I had to spent a long time experimenting with before I finally realised what was going on. This may be so obvious to everyone here that they haven't bothered mentioning it. But it would've helped me if they had, so on that principle...!

NB: I am using Jython specifically, v 2.7, so just possibly this may not apply to CPython...

NB2: the first two lines of my .py file here are:

# -*- coding: utf-8 -*-
from __future__ import print_function

The "%" (AKA "interpolation operator") string construction mechanism causes ADDITIONAL problems too... If the default encoding of the "environment" is ASCII and you try to do something like

print( "bonjour, %s" % "fréd" )  # Call this "print A"

You will have no difficulty running in Eclipse... In a Windows CLI (DOS window) you will find that the encoding is code page 850 (my Windows 7 OS) or something similar, which can handle European accented characters at least, so it'll work.

print( u"bonjour, %s" % "fréd" ) # Call this "print B"

will also work.

If, OTOH, you direct to a file from the CLI, the stdout encoding will be None, which will default to ASCII (on my OS anyway), which will not be able to handle either of the above prints... (dreaded encoding error).

So then you might think of redirecting your stdout by using

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

and try running in the CLI piping to a file... Very oddly, print A above will work... But print B above will throw the encoding error! The following will however work OK:

print( u"bonjour, " + "fréd" ) # Call this "print C"

The conclusion I have come to (provisionally) is that if a string which is specified to be a Unicode string using the "u" prefix is submitted to the %-handling mechanism it appears to involve the use of the default environment encoding, regardless of whether you have set stdout to redirect!

How people deal with this is a matter of choice. I would welcome a Unicode expert to say why this happens, whether I've got it wrong in some way, what the preferred solution to this, whether it also applies to CPython, whether it happens in Python 3, etc., etc.


That's not odd, that's because "fréd" is a byte sequence and not a Unicode string, so the codecs.getwriter wrapper will leave it alone. You need a leading u, or from __future__ import unicode_literals.
@MatthiasUrlichs OK... thanks... But I just find encoding one of the most infuriating aspects of IT. Where do you get your understanding from? For example, I just posted another question about encoding here: stackoverflow.com/questions/44483067/…: this is about Java, Eclipse, Cygwin & Gradle. If your expertise goes this far, please help... above all I'd like to know where to learn more!
j
jno

I could "automate" it with a call to:

def __fix_io_encoding(last_resort_default='UTF-8'):
  import sys
  if [x for x in (sys.stdin,sys.stdout,sys.stderr) if x.encoding is None] :
      import os
      defEnc = None
      if defEnc is None :
        try:
          import locale
          defEnc = locale.getpreferredencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.getfilesystemencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.stdin.encoding
        except: pass
      if defEnc is None :
        defEnc = last_resort_default
      os.environ['PYTHONIOENCODING'] = os.environ.get("PYTHONIOENCODING",defEnc)
      os.execvpe(sys.argv[0],sys.argv,os.environ)
__fix_io_encoding() ; del __fix_io_encoding

Yes, it's possible to get an infinite loop here if this "setenv" fails.


interesting, but a pipe doesn't seem to be happy about this
c
cessor

I ran into this problem in a legacy application, and it was difficult to identify where what was printed. I helped myself with this hack:

# encoding_utf8.py
import codecs
import builtins


def print_utf8(text, **kwargs):
    print(str(text).encode('utf-8'), **kwargs)


def print_utf8(fn):
    def print_fn(*args, **kwargs):
        return fn(str(*args).encode('utf-8'), **kwargs)
    return print_fn


builtins.print = print_utf8(print)

On top of my script, test.py:

import encoding_utf8
string = 'Axwell Λ Ingrosso'
print(string)

Note that this changes ALL calls to print to use an encoding, so your console will print this:

$ python test.py
b'Axwell \xce\x9b Ingrosso'

B
Basj

On Windows, I had this problem very often when running a Python code from an editor (like Sublime Text), but not if running it from command-line.

In this case, check your editor's parameters. In the case of SublimeText, this Python.sublime-build solved it:

{
  "cmd": ["python", "-u", "$file"],
  "file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
  "selector": "source.python",
  "encoding": "utf8",
  "env": {"PYTHONIOENCODING": "utf-8", "LANG": "en_US.UTF-8"}
}

j
jacouh

Since Python 3.7, we can use Python UTF-8 Mode, by using command line option -X utf8:

 python -X utf8 testzh.py

The script testzh.py contains

print("Content-type: text/html; charset=UTF-8\n") 
print("地球你好!")

To set Windows 10 Internet Service IIS as CGI Script handler,

We set Executable as this:

"C:\Program Files\Python39\python.exe" -X utf8 %s

https://i.stack.imgur.com/9ezXd.png

This works for Chinese Ideograms as expected on Browser Microsoft.Edge like this screenshot: Otherwise, error occurs.

https://i.stack.imgur.com/dZrFs.png

Please see https://docs.python.org/3/library/os.html#utf8-mode


q
qz-

I'm surprised this answer has not been posted here yet

Since Python 3.7 you can change the encoding of standard streams with reconfigure(): sys.stdout.reconfigure(encoding='utf-8') You can also modify how encoding errors are handled by adding an errors parameter.

https://stackoverflow.com/a/52372390/15675011