ChatGPT解决这个技术问题 Extra ChatGPT

How do I determine file encoding in OS X?

I'm trying to enter some UTF-8 characters into a LaTeX file in TextMate (which says its default encoding is UTF-8), but LaTeX doesn't seem to understand them.

Running cat my_file.tex shows the characters properly in Terminal. Running ls -al shows something I've never seen before: an "@" by the file listing:

-rw-r--r--@  1 me      users      2021 Feb 11 18:05 my_file.tex

(And, yes, I'm using \usepackage[utf8]{inputenc} in the LaTeX.)

I've found iconv, but that doesn't seem to be able to tell me what the encoding is -- it'll only convert once I figure it out.

In my experience, the file(1) command has always been pretty good at guessing a file's encoding. I don't know if it's smart enough to use the file's com.apple.TextEncoding extended attribute or not.

N
Naman

Using the -I (that's a capital i) option on the file command seems to show the file encoding.

file -I {filename}

This function seemed to be unable to tell the difference between ASCII and UTF-8 (Seems they are the same for most US characters, but not all, perhaps something that would detect the unicode bit)
I'm with @BadPirate here, doesn't differentiate ascii to utf-8 (testing on OSX)
ASCII and UTF8 are the same unless there's a character beyond OxFF in the file, or a BOM.
file -I * seems to work perfectly for me (on OSX). A system complained about the encoding of one of many files, without specifying which. All files were ascii, except for one, which was utf-8. Most likely the culprit.
@notJim That's incorrect. ASCII is only defined through 0x7F so anything beyond that point is clearly not ASCII. Unicode and Latin-1 have the same code points in 0x80-0xFF but there is no common encoding of Unicode which is identical to Latin-1 (because that would inherently be restricted to 8 bits, which is much too little for Unicode).
r
random_user_name

In Mac OS X the command file -I (capital i) will give you the proper character set so long as the file you are testing contains characters outside of the basic ASCII range.

For instance if you go into Terminal and use vi to create a file eg. vi test.txt then insert some characters and include an accented character (try ALT-e followed by e) then save the file.

They type file -I text.txt and you should get a result like this:

test.txt: text/plain; charset=utf-8


I can confirm the OS X case, charset=us-ascii or charset=utf-8 depending on the content of the file
but it only seems to look at the first few KB of the file. in my case, the vim command at stackoverflow.com/a/33644535/161022 correctly identified the file as utf-8 whereas the file command claims its us-ascii
Indeed, it appears that file cheats for performance reasons. I just created a 3MB ASCII file on Ubuntu and added a few UTF-8 characters to the end and it still reports ASCII not UTF-8. I tried the -k option (keep going) but then it reports "data" not "UTF-8" so still no good.
C
Community

The @ means that the file has extended file attributes associated with it. You can query them using the getxattr() function.

There's no definite way to detect the encoding of a file. Read this answer, it explains why.

There's a command line tool, enca, that attempts to guess the encoding. You might want to check it out.


I was assuming that OSX stored the encoding as meta-data. I understood the file contents were just a cluster of bits and had no inherent encoding.
@JamesA.Rosen OS X apps like TextEdit do store file encoding as an attribute (named "com.apple.TextEncoding"). It's quite likely that the attributes indicated by that @ include the file encoding attribute. You can use the command xattr -p com.apple.TextEncoding <filename> to look at the encoding attribute if it exists.
can you please explain how to use getxattr ? I am not able to use it.
That's a function call you would use if you want to write a program. From the command line, just type ls -l@ <filename> to see what attributes are set for the file. To see the actual attribute, type xattr -p com.apple.TextEncoding <filename>
To get enca do brew install enca and you have to specify language but none works, so: enca FILENAME -L __
j
jmettraux
vim -c 'execute "silent !echo " . &fileencoding | q' {filename}

aliased somewhere in my bash configuration as

alias vic="vim -c 'execute \"silent \!echo \" . &fileencoding | q'"

so I just type

vic {filename}

On my vanilla OSX Yosemite, it yields more precise results than "file -I":

$ file -I pdfs/udocument0.pdf
pdfs/udocument0.pdf: application/pdf; charset=binary
$ vic pdfs/udocument0.pdf
latin1
$
$ file -I pdfs/t0.pdf
pdfs/t0.pdf: application/pdf; charset=us-ascii
$ vic pdfs/t0.pdf
utf-8

This is the only answer that gave me what I needed – "latin1", as opposed to "us-ascii". Although, I did have to remove the backslashes.
Thanks a lot, I removed the backslashes.
$ alias vic="vim -c 'execute \"silent !echo \" . &fileencoding | q'" -bash: !echo: event not found
@AntonTropashko alias vic="vim -c 'execute \"silent \!echo \" . &fileencoding | q'"
R
RPM

You can also convert from one file type to another using the following command :

iconv -f original_charset -t new_charset originalfile > newfile

e.g.

iconv -f utf-16le -t utf-8 file1.txt > file2.txt

b
bx2

Just use:

file -I <filename>

That's it.


I can't be bothered to vote down, but that answer is completely wrong. Small -i says do not classify the contents if it is a regular file. -I is equivalent to --mime which outputs mime type strings. The osx tools behave differently from standard linux tools.
Well, for a Windows 1252 encoded file file -I gets me text/plain; charset=unknown-8bit. Though it works better for an utf8 file: text/plain; charset=utf-8.
r
rstackhouse

Using file command with the --mime-encoding option (e.g. file --mime-encoding some_file.txt) instead of the -I option works on OS X and has the added benefit of omitting the mime type, "text/plain", which you probably don't care about.


ls -l@a will show extended attributes. Looking at the man page for ls on Yosemite, I don't see a --mime-encoding option.
You were talking about the file command. Didn't know that one existed. Noob. Anyway. Sorry about the downvote. SO won't let me undo it unless someone edits this answer.
W
Will Robertson

Classic 8-bit LaTeX is very restricted in which UTF8 characters it can use; it's highly dependent on the encoding of the font you're using and which glyphs that font has available.

Since you don't give a specific example, it's hard to know exactly where the problem is — whether you're attempting to use a glyph that your font doesn't have or whether you're not using the correct font encoding in the first place.

Here's a minimal example showing how a few UTF8 characters can be used in a LaTeX document:

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[utf8]{inputenc}
\begin{document}
‘Héllø—thêrè.’
\end{document}

You may have more luck with the [utf8x] encoding, but be slightly warned that it's no longer supported and has some idiosyncrasies compared with [utf8] (as far as I recall; it's been a while since I've looked at it). But if it does the trick, that's all that matters for you.


J
Jouni K. Seppänen

The @ sign means the file has extended attributes. xattr file shows what attributes it has, xattr -l file shows the attribute values too (which can be large sometimes — try e.g. xattr /System/Library/Fonts/HelveLTMM to see an old-style font that exists in the resource fork).


d
dreamlax

Typing file myfile.tex in a terminal can sometimes tell you the encoding and type of file using a series of algorithms and magic numbers. It's fairly useful but don't rely on it providing concrete or reliable information.

A Localizable.strings file (found in localised Mac OS X applications) is typically reported to be a UTF-16 C source file.


p
pi3

Synalyze It! allows to compare text or bytes in all encodings the ICU library offers. Using that feature you usually see immediately which code page makes sense for your data.


j
jmdeamer

You can try loading the file into a firefox window then go to View - Character Encoding. There should be a check mark next to the file's encoding type.


J
Joao Encarnacao

I implemented the bash script below, it works for me.

It first tries to iconv from the encoding returned by file --mime-encoding to utf-8.

If that fails, it goes through all encodings and shows the diff between the original and re-encoded file. It skips over encodings that produce a large diff output ("large" as defined by the MAX_DIFF_LINES variable or the second input argument), since those are most likely the wrong encoding.

If "bad things" happen as a result of using this script, don't blame me. There's a rm -f in there, so there be monsters. I tried to prevent adverse effects by using it on files with a random suffix, but I'm not making any promises.

Tested on Darwin 15.6.0.

#!/bin/bash

if [[ $# -lt 1 ]]
then
  echo "ERROR: need one input argument: file of which the enconding is to be detected."
  exit 3
fi

if [ ! -e "$1" ]
then
  echo "ERROR: cannot find file '$1'"
  exit 3
fi

if [[ $# -ge 2 ]]
then
  MAX_DIFF_LINES=$2
else
  MAX_DIFF_LINES=10
fi


#try the easy way
ENCOD=$(file --mime-encoding $1 | awk '{print $2}')
#check if this enconding is valid
iconv -f $ENCOD -t utf-8 $1 &> /dev/null
if [ $? -eq 0 ]
then
  echo $ENCOD
  exit 0
fi

#hard way, need the user to visually check the difference between the original and re-encoded files
for i in $(iconv -l | awk '{print $1}')
do
  SINK=$1.$i.$RANDOM
  iconv -f $i -t utf-8 $1 2> /dev/null > $SINK
  if [ $? -eq 0 ]
  then
    DIFF=$(diff $1 $SINK)
    if [ ! -z "$DIFF" ] && [ $(echo "$DIFF" | wc -l) -le $MAX_DIFF_LINES ]
    then
      echo "===== $i ====="
      echo "$DIFF"
      echo "Does that make sense [N/y]"
      read $ANSWER
      if [ "$ANSWER" == "y" ] || [ "$ANSWER" == "Y" ]
      then
        echo $i
        exit 0
      fi
    fi
  fi
  #clean up re-encoded file
  rm -f $SINK
done

echo "None of the encondings worked. You're stuck."
exit 3

K
Keltia

Which LaTeX are you using? When I was using teTeX, I had to manually download the unicode package and add this to my .tex files:

% UTF-8 stuff
\usepackage[notipa]{ucs}
\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}

Now, I've switched over to XeTeX from the TeXlive 2008 package (here), it is even more simple:

% UTF-8 stuff
\usepackage{fontspec}
\usepackage{xunicode}

As for detection of a file's encoding, you could play with file(1) (but it is rather limited) but like someone else said, it is difficult.


j
jalf

A brute-force way to check the encoding might just be to check the file in a hex editor or similar. (or write a program to check) Look at the binary data in the file. The UTF-8 format is fairly easy to recognize. All ASCII characters are single bytes with values below 128 (0x80) Multibyte sequences follow the pattern shown in the wiki article

If you can find a simpler way to get a program to verify the encoding for you, that's obviously a shortcut, but if all else fails, this would do the trick.