ChatGPT解决这个技术问题 Extra ChatGPT

How to remove non UTF-8 characters from text file

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:

Malformed UTF-8 character (fatal)

Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.

Is there anyway to do it?

Maybe it's the same as this: stackoverflow.com/questions/7656283/…
What are non UTF-8 characters? All characters in a well formed UTF-8 string are UTF-8 (actually Unicode) characters! Some of them are UTF-8 encoded in several consecutive bytes....
@BasileStarynkevitch: the error message clearly states that there is a malformed UTF-8 character. That means that a byte appeared that cannot appear as part of a valid UTF-8 file. That's not hard; it could be a 0xC0 or 0xC1 byte, or 0xF5..0xFF, or a sequencing problem with bytes that would otherwise be valid.

w
wberry

This command:

iconv -f utf-8 -t utf-8 -c file.txt

will clean up your UTF-8 file, skipping all the invalid characters.

-f is the source format
-t the target format
-c skips any invalid sequence

"iconv -f utf-8 -t utf-8 -c file.txt" on a Mac. hyphen between 'f' and '8'
Conveniently you can transform the clipboard contents on a Mac doing so: pbpaste | iconv -f utf-8 -t -utf-8 -c | pbcopy. I also created an Alfred workflow with a global shortcut for stripping all special characters by targeting ascii.
This produced a file that was completely blank for me. Just want to let everyone know this is potentially destructive and to back up their file before running this on it.
iconv -f utf-8 -t ascii//TRANSLIT solved my problem. It converts curly quotes to straight quotes.
-o for different output file
C
Charles Knell

Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.


iconv is not available in cygwin. Is there any way to do this on windows/cygwin? I have a big (100000+ lines) XML file that needs stripping of invalid characters. I don't care about valid utf-8. I've set notepad++ to utf-8, but even after saving it from there I still get errors in the XML parser
ubuntu WSL on Windows it comes with iconv
Z
Zombo

iconv can do it

iconv -f cp1252 foo.txt

M
Mythos

None of the methods here or on any other similar questions worked for me. In the end what worked was simply opening the file in Sublime Text 2. Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it.

May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.


b
bensiu
cat foo.txt | strings -n 8 > bar.txt

will do the job.


No, this will also kill a lot of valid utf-8 characters.