ChatGPT解决这个技术问题 Extra ChatGPT

How to use unicode characters in Windows command line?

We have a project in Team Foundation Server (TFS) that has a non-English character (š) in it. When trying to script a few build-related things we've stumbled upon a problem - we can't pass the š letter to the command-line tools. The command prompt or what not else messes it up, and the tf.exe utility can't find the specified project.

I've tried different formats for the .bat file (ANSI, UTF-8 with and without BOM) as well as scripting it in JavaScript (which is Unicode inherently) - but no luck. How do I execute a program and pass it a Unicode command line?

@JohannesDewender - Copy-paste gone wrong?
Python 3.6: "the default console on Windows accept all Unicode characters with that version" (well, most of it for me) BUT you need to configure the console: right click on the top of the windows (of the cmd or the python IDLE), in default/font choose the "Lucida console".
@LưuVĩnhPhúc - No, this is about passing unicode command line arguments, rather than displaying text in the console. Console might not get involved at all.

k
kgiannakakis

Try:

chcp 65001

which will change the code page to UTF-8. Also, you need to use Lucida console fonts.


Do you know if there's a way to make this the default?
Note there are serious implementation bugs in Windows's code page 65001 support which will break many applications that rely on the C standard library IO methods, so this is very fragile. (Batch files also just stop working in 65001.) Unfortunately UTF-8 is a second-class citizen in Windows.
@bobince Do you have an example of a bug in the Windows code page 65001 support? I'm curious because I've never run into one, and googling didn't turn anything up either. (Batch files do stop working, of course, but UTF-8 is hardly a second-class citizen...)
@romkyns: My understanding is that calls that return a number-of-bytes (such as fread/fwrite/etc) actually return a number-of-characters. This causes a wide variety of symptoms, such as incomplete input-reading, hangs in fflush, the broken batch files and so on. Some background. The default code pages used for CJK "multibyte" locales have special handling built in to fix this, but 65001 doesn't - it is not supported.
Interesting question here though - is the bug because it should report bytes and instead reports characters - or because the applications using it have assumed bytes=characters incorrectly? In other words, is it an API fail or an API usage fail?
I
Ilya Zakharevich

My background: I use Unicode input/output in a console for years (and do it a lot daily. Moreover, I develop support tools for exactly this task). There are very few problems, as far as you understand the following facts/limitations:

CMD and “console” are unrelated factors. CMD.exe is a just one of programs which are ready to “work inside” a console (“console applications”).

AFAIK, CMD has perfect support for Unicode; you can enter/output all Unicode chars when any codepage is active.

Windows’ console has A LOT of support for Unicode — but it is not perfect (just “good enough”; see below).

chcp 65001 is very dangerous. Unless a program was specially designed to work around defects in the Windows’ API (or uses a C runtime library which has these workarounds), it would not work reliably. Win8 fixes ½ of these problems with cp65001, but the rest is still applicable to Win10.

I work in cp1252. As I already said: To input/output Unicode in a console, one does not need to set the codepage.

The details

To read/write Unicode to a console, an application (or its C runtime library) should be smart enough to use not File-I/O API, but Console-I/O API. (For an example, see how Python does it.)

Likewise, to read Unicode command-line arguments, an application (or its C runtime library) should be smart enough to use the corresponding API.

Console font rendering supports only Unicode characters in BMP (in other words: below U+10000). Only simple text rendering is supported (so European — and some East Asian — languages should work fine — as far as one uses precomposed forms). [There is a minor fine print here for East Asian and for characters U+0000, U+0001, U+30FB.]

Practical considerations

The defaults on Window are not very helpful. For best experience, one should tune up 3 pieces of configuration: For output: a comprehensive console font. For best results, I recommend my builds. (The installation instructions are present there — and also listed in other answers on this page.) For input: a capable keyboard layout. For best results, I recommend my layouts. For input: allow HEX input of Unicode.

For output: a comprehensive console font. For best results, I recommend my builds. (The installation instructions are present there — and also listed in other answers on this page.)

For input: a capable keyboard layout. For best results, I recommend my layouts.

For input: allow HEX input of Unicode.

One more gotcha with “Pasting” into a console application (very technical): HEX input delivers a character on KeyUp of Alt; all the other ways to deliver a character happen on KeyDown; so many applications are not ready to see a character on KeyUp. (Only applicable to applications using Console-I/O API.) Conclusion: many application would not react on HEX input events. Moreover, what happens with a “Pasted” character depends on the current keyboard layout: if the character can be typed without using prefix keys (but with arbitrary complicated combination of modifiers, as in Ctrl-Alt-AltGr-Kana-Shift-Gray*) then it is delivered on an emulated keypress. This is what any application expects — so pasting anything which contains only such characters is fine. However, the “other” characters are delivered by emulating HEX input. Conclusion: unless your keyboard layout supports input of A LOT of characters without prefix keys, some buggy applications may skip characters when you Paste via Console’s UI: Alt-Space E P. (This is why I recommend using my keyboard layouts!)

HEX input delivers a character on KeyUp of Alt; all the other ways to deliver a character happen on KeyDown; so many applications are not ready to see a character on KeyUp. (Only applicable to applications using Console-I/O API.)

Conclusion: many application would not react on HEX input events.

Moreover, what happens with a “Pasted” character depends on the current keyboard layout: if the character can be typed without using prefix keys (but with arbitrary complicated combination of modifiers, as in Ctrl-Alt-AltGr-Kana-Shift-Gray*) then it is delivered on an emulated keypress. This is what any application expects — so pasting anything which contains only such characters is fine.

However, the “other” characters are delivered by emulating HEX input.

One should also keep in mind that the “alternative, ‘more capable’ consoles” for Windows are not consoles at all. They do not support Console-I/O APIs, so the programs which rely on these APIs to work would not function. (The programs which use only “File-I/O APIs to the console filehandles” would work fine, though.)

One example of such non-console is a part of MicroSoft’s Powershell. I do not use it; to experiment, press and release WinKey, then type powershell.

(On the other hand, there are programs such as ConEmu or ANSICON which try to do more: they “attempt” to intercept Console-I/O APIs to make “true console applications” work too. This definitely works for toy example programs; in real life, this may or may not solve your particular problems. Experiment.)

Summary

set font, keyboard layout (and optionally, allow HEX input).

use only programs which go through Console-I/O APIs, and accept Unicode command-line arguments. For example, any cygwin-compiled program should be fine. As I already said, CMD is fine too.

UPD: Initially, for a bug in cp65001, I was mixing up Kernel and CRTL layers (UPD²: and Windows user-mode API!). Also: Win8 fixes one half of this bug; I clarified the section about “better console” application, and added a reference to how Python does it.


OK, for something this thorough, you deserve to be the accepted answer! Awesome!
I am a newbie to C++ and can't understand this answer after reading carefully. Can somebody help me about this or make a easier explanation?
@Bachi Thanks to Bachi, I found out that v73 of my keyboard layout (mentioned above) was missing some support files. Now fixed! (Judging by my .log files, it is an intermittent bug in zip -ru [?!]. Have no clue how to debug it — or avoid in the future…)
@Rick: Right! I added a link to a workaround in Python (but I cannot find a direct link to the patch right now…).
Bugs in the console are not in the kernel. The APIs in kernel32.dll and kernelbase.dll typically interface to system calls exported by ntdll.dll. The console API ultimately makes either I/O calls (e.g. NtReadFile, NtDeviceIoControlFile) in Windows 8+ or LPC calls in older versions. These system calls go through the kernel (e.g. via the ConDrv device in Win 8+), but ultimately they're implemented in the user-mode console host process. This is either an instance of conhost.exe in Windows 7+ or, in older versions, the session subsystem process, csrss.exe. Console bugs are usually here.
P
Peter Mortensen

I had same problem (I'm from the Czech Republic). I have an English installation of Windows, and I have to work with files on a shared drive. Paths to the files include Czech-specific characters.

The solution that works for me is:

In the batch file, change the charset page

My batch file:

chcp 1250
copy "O:\VEŘEJNÉ\ŽŽŽŽŽŽ\Ž.xls" c:\temp

The batch file has to be saved in CP 1250.

Note that the console will not show characters correctly, but it will understand them...


Cheers! I needed this so that I could input the copyright character within my batch file.
This worked perfectly for me too in an almost identical situation to yours. Instead my path contained Irish Gaelic characters i.e. á, é, í, ó, and ú.
@vanna that solves my "Turkish characters and spaces in path on network problem". you are great.
You probably just needed to use different font to also display the characters correctly, Lucida Console worked for me.
"Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Polish, Czech, Slovak, Hungarian, Slovene, Bosnian, Croatian, Serbian (Latin script), Romanian (before 1993 spelling reform) and Albanian."
P
Peter Mortensen

Check the language for non-Unicode programs. If you have problems with Russian in the Windows console, then you should set Russian here:

https://i.stack.imgur.com/45C5G.png


That doesn't enable support for Unicode in cmd, it only switches the default codepage to cp866 which is still an 8-bit character set. It even uses cp866 instead of cp1251 which adds its own shitload of trouble.
See also me answer below for new option in newer Windows 10 versions
W
Wernfried Domscheit

It's is quite difficult to change the default Codepage of Windows console. When you search the web you find different proposals, however some of them may break your Windows entirely, i.e. your PC does not boot anymore.

The most secure solution is this one: Go to your Registry key HKEY_CURRENT_USER\Software\Microsoft\Command Processor and add String value Autorun = chcp 65001.

Or you can use this small Batch-Script for the most common code pages.

@ECHO off

SET ROOT_KEY="HKEY_CURRENT_USER"


FOR /f "skip=2 tokens=3" %%i in ('reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /v OEMCP') do set OEMCP=%%i

ECHO System default values:

ECHO.
ECHO ...............................................
ECHO Select Codepage 
ECHO ...............................................
ECHO.
ECHO 1 - CP1252
ECHO 2 - UTF-8
ECHO 3 - CP850
ECHO 4 - ISO-8859-1
ECHO 5 - ISO-8859-15
ECHO 6 - US-ASCII
ECHO.
ECHO 9 - Reset to System Default (CP%OEMCP%)
ECHO 0 - EXIT
ECHO.


SET /P  CP="Select a Codepage: "

if %CP%==1 (
    echo Set default Codepage to CP1252
    reg add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 1252>nul" /f
) else if %CP%==2 (
    echo Set default Codepage to UTF-8
    reg add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 65001>nul" /f
) else if %CP%==3 (
    echo Set default Codepage to CP850
    reg add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 850>nul" /f
) else if %CP%==4 (
    echo Set default Codepage to ISO-8859-1
    add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 28591>nul" /f
) else if %CP%==5 (
    echo Set default Codepage to ISO-8859-15
    add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 28605>nul" /f
) else if %CP%==6 (
    echo Set default Codepage to ASCII
    add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 20127>nul" /f
) else if %CP%==9 (
    echo Reset Codepage to System Default
    reg delete "%ROOT_KEY%\Software\Microsoft\Command Processor" /v AutoRun /f
) else if %CP%==0 (
    echo Bye
) else (
    echo Invalid choice
    pause
)

Using @chcp 65001>nul instead of chcp 65001 suppresses the output "Active code page: 65001" you would get every time you start a new command line windows.

A full list of all available number you can get from Code Page Identifiers

Note, the settings will apply only for the current user. If you like to set it for all users, replace line SET ROOT_KEY="HKEY_CURRENT_USER" by SET ROOT_KEY="HKEY_LOCAL_MACHINE"


nice idea and usable example too!
U
User

Actually, the trick is that the command prompt actually understands these non-english characters, just can't display them correctly.

When I enter a path in the command prompt that contains some non-english chracters it is displayed as "?? ?????? ?????". When you submit your command (cd "??? ?????? ?????" in my case), everything is working as expected.


This is probably a bit dangerous as you could get naming conflict. e.g., if you have two files both which render as "???", and you enter "cd ???" it wouldn't know which to use (or worse would choose an arbitrary one).
You don't enter ???, you enter the real name it's just being displayed as ???. Think of it as of a password input box. Whatever you enter is displayed as ***, but submitted is the original text.
This did indeed works for commands run directly in the command prompt. However, with running a .cmd batch file, I still need to put chcp 65001 at the top of of the batch file.
In your case, it is a font problem... the content is there, just no proper font to display it. But OP is different.
P
Peter Mortensen

On a Windows 10 x64 machine, I made the command prompt display non-English characters by:

Open an elevated command prompt (run CMD.EXE as administrator). Query your registry for available TrueType fonts to the console by:

    REG query "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont"

You'll see an output like:

    0    REG_SZ    Lucida Console
    00    REG_SZ    Consolas
    936    REG_SZ    *新宋体
    932    REG_SZ    *MS ゴシック

Now we need to add a TrueType font that supports the characters you need like Courier New. We do this by adding zeros to the string name, so in this case the next one would be "000":

    REG ADD "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont" /v 000 /t REG_SZ /d "Courier New"

Now we implement UTF-8 support:

    REG ADD HKCU\Console /v CodePage /t REG_DWORD /d 65001 /f

Set default font to "Courier New":

    REG ADD HKCU\Console /v FaceName /t REG_SZ /d "Courier New" /f

Set font size to 20:

    REG ADD HKCU\Console /v FontSize /t REG_DWORD /d 20 /f

Enable quick edit if you like:

    REG ADD HKCU\Console /v QuickEdit /t REG_DWORD /d 1 /f

In general using codepage 65001 will only work without bugs in Windows 10 with the Creators update. In Windows 7 it will have both output and input bugs. In Windows 8 and older versions of Windows 10 it only has the input bug, which limits input to 7-bit ASCII.
I tried using this method, and now the font is super small and it seem it is permanent.
P
Peter Mortensen

One really simple option is to install a Windows bash shell such as MinGW and use that:

https://i.stack.imgur.com/o6phE.png

There is a little bit of a learning curve as you will need to use Unix command line functionality, but you will love the power of it and you can set the console character set to UTF-8.

https://i.stack.imgur.com/G51mw.png

Of course you also get all the usual *nix goodies like grep, find, less, etc.


In this (old) case, the issue was with a script rather than a console. Would using bash scripts solve this?
Yes indeed they wood bash scripts can be flagged as UTF-8 and just work with a lot more power than windows batch files - I know that it was an old case but thought the option was worth flagging for future reference as MS don't seem to be getting much better at Unicode.
grep, find, and less.
Outputting UTF-8 encoded characters are fine. But input is still encoded by system codepage.
Just to add that Windows users may already have a bash shell if you use Git: just open a Git > Git Bash window.
z
zvi

I found this method as useful in new versions of Windows 10:

Turn on this feature: "Beta: Use Unicode UTF-8 for worldwide language support"

Control panel -> Regional settings -> Administrative tab-> Change system locale...

https://i.stack.imgur.com/6D4ut.png


How to achieve this by using powershell or cmd?
I'm trying to display Chinese characters in the console and doing this didn't work on Windows 10 64-bit (Installed in Turkish and later changed to English). Next, I'll try to install Chinese language and see if it works.
Just be careful with this, it broke the functionality of some old and crappy programs that were working fine in server 2019.
V
VonC

Starting June 2019, with Windows 10, you won't have to change the codepage.

See "Introducing Windows Terminal" (from Kayla Cinnamon) and the Microsoft/Terminal.
Through the use of the Consolas font, partial Unicode support will be provided.

As documented in Microsoft/Terminal issue 387:

There are 87,887 ideographs currently in Unicode. You need all of them too? We need a boundary, and characters beyond that boundary should be handled by font fallback / font linking / whatever. What Consolas should cover: Characters that used as symbols that used by modern OSS programs in CLI. These characters should follow Consolas' design and metrics, and properly aligned with existing Consolas characters. What Consolas should NOT cover: Characters and punctuation of scripts that beyond Latin, Greek and Cyrillic, especially characters need complex shaping (like Arabic). These characters should be handled with font fallback.


C
Community

As I haven't seen any full answers for Python 2.7, I'll outline the two important steps and an optional step that is quite useful.

You need a font with Unicode support. Windows comes with Lucida Console which may be selected by right-clicking the title bar of command prompt and clicking the Defaults option. This also gives access to colours. Note that you can also change settings for command windows invoked in certain ways (e.g, open here, Visual Studio) by choosing Properties instead. You need to set the code page to cp65001, which appears to be Microsoft's attempt to offer UTF-7 and UTF-8 support to command prompt. Do this by running chcp 65001 in command prompt. Once set, it remains this way until the window is closed. You'll need to redo this every time you launch cmd.exe.

For a more permanent solution, refer to this answer on Super User. In short, create a REG_SZ (String) entry using regedit at HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor and name it AutoRun. Change the value of it to chcp 65001. If you don't want to see the output message from the command, use @chcp 65001>nul instead.

Some programs have trouble interacting with this encoding, MinGW being a notable one that fails while compiling with a nonsensical error message. Nonetheless, this works very well and doesn't cause bugs with the majority of programs.


c
code4j

This problem is quite annoying. I usually have Chinese character in my filename and file content. Please note that I am using Windows 10, here is my solution:

To display the file name, such as dir or ls if you installed Ubuntu bash on Windows 10

Set the region to support non-utf 8 character. After that, console's font will be changed to the font of that locale, and it also changes the encoding of the console.

After you have done previous steps, in order to display the file content of a UTF-8 file using command line tool

Change the page to utf-8 by chcp 65001 Change to the font that supports utf-8, such as Lucida Console Use type command to peek the file content, or cat if you installed Ubuntu bash on Windows 10 Please note that, after setting the encoding of the console to utf-8, I can't type Chinese character in the cmd using Chinese input method.

The laziest solution: Just use a console emulator such as http://cmder.net/


This didn't for me. The Chinese characters in the output of point command are still garbled.
@SiqingYu I give up the crazy setting. Just use blog.miniasp.com/post/2015/09/27/Useful-tool-Cmder.aspx
I used Cmder before, but it cannot replace the developer console used by Visual Studio.
@SiqingYu Do you mean the c# interactive powershell?
Not the interactive power shell, but the developer console, used by Visual C++ too. It is the default debug console in Win32 Console Application projects.
P
Peter Mortensen

For a similar problem, (my problem was to show UTF-8 characters from MySQL on a command prompt),

I solved it like this:

I changed the font of command prompt to Lucida Console. (This step must be irrelevant for your situation. It has to do only with what you see on the screen and not with what is really the character). I changed the codepage to Windows-1253. You do this on the command prompt by "chcp 1253". It worked for my case where I wanted to see UTF-8.


Windws-1253 isn't an Unicode codepage. It's a standard 256-character codepage. Apparently you only used characters that can be displayed in that codepage, but it won't be universal.
S
S. Hristov

A quick decision for .bat files if you computer displays your path/file name correct when you typing it in DOS-window:

copy con temp.txt [press Enter] Type the path/file name [press Enter] Press Ctrl-Z [press Enter]

This way you create a .txt file - temp.txt. Open it in Notepad, copy the text (don't worry it will look unreadable) and paste it in your .bat file. Executing the .bat created this way in DOS-window worked for mе (Cyrillic, Bulgarian).


R
Robert Boehne

I see several answers here, but they don't seem to address the question - the user wants to get Unicode input from the command line.

Windows uses UTF-16 for encoding in two byte strings, so you need to get these from the OS in your program. There are two ways to do this -

1) Microsoft has an extension that allows main to take a wide character array: int wmain(int argc, wchar_t *argv[]); https://msdn.microsoft.com/en-us/library/6wd819wh.aspx

2) Call the windows api to get the unicode version of the command line wchar_t win_argv = (wchar_t)CommandLineToArgvW(GetCommandLineW(), &nargs); https://docs.microsoft.com/en-us/windows/desktop/api/shellapi/nf-shellapi-commandlinetoargvw

Read this: http://utf8everywhere.org for detailed info, particularly if you are supporting other operating systems.


Ahh, no, I'm sorry, but you missed the question. This is for when I'm writing a program that will receive the unicode characters. My question was about sending the unicode characters to another program (which hopefully supports receiving them, but I really have no way to know except disassembly).
P
Peter Mortensen

A better cleaner thing to do: Just install the available, free, Microsoft Japanese language pack. (Other oriental language packs will also work, but I have tested the Japanese one.)

This gives you the fonts with the larger sets of glyphs, makes them the default behavior, changes the various Windows tools like cmd, WordPad, etc.


P
Peter Mortensen

Changing code page to 1252 is working for me. The problem for me is the symbol double doller § is converting to another symbol by DOS on Windows Server 2008.

I have used CHCP 1252 and a cap before it in my BCP statement ^§.


Thanks it works! I don't know why people voted this down, it is a valid alternative for some people.. This codepage 1252 does fix the problem also on Windows Server 2012, where the same code with CP 65001 did not work for me. I suppose it depends in what codepage the batch script was edited with, or the OS defaults. In this case it was created with Notepad on a German MUI machine with en-US base OS..
P
Peter Mortensen

I got around a similar issue deleting Unicode-named files by referring to them in the batch file by their short (8 dot 3) names.

The short names can be viewed by doing dir /x. Obviously, this only works with Unicode file names that are already known.


new disks have 8.3 name generation disabled by default and this won't work
a
afkjm

Mind for those using WSL who also do not want the extra packages from Cygwin or Git, wsltty is available which provides just the terminal with UTF-8 support