I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:
Malformed UTF-8 character (fatal)
Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.
Is there anyway to do it?
This command:
iconv -f utf-8 -t utf-8 -c file.txt
will clean up your UTF-8 file, skipping all the invalid characters.
-f is the source format
-t the target format
-c skips any invalid sequence
Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.
None of the methods here or on any other similar questions worked for me. In the end what worked was simply opening the file in Sublime Text 2. Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it.
May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.
cat foo.txt | strings -n 8 > bar.txt
will do the job.
Success story sharing
pbpaste | iconv -f utf-8 -t -utf-8 -c | pbcopy
. I also created an Alfred workflow with a global shortcut for stripping all special characters by targetingascii
.iconv -f utf-8 -t ascii//TRANSLIT
solved my problem. It converts curly quotes to straight quotes.-o
for different output file