ChatGPT解决这个技术问题 Extra ChatGPT

Detect encoding and make everything UTF-8

I'm reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.

Unfortunately, there are sometimes problems with the encodings of the texts. Example:

The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly. Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course. In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.

What can I do to avoid the cases 2 and 3?

How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?

How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:

How do I find out what encoding the text uses? How do I convert it to UTF-8 - whatever the old encoding is?

Would a function like this work?

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

I've tested it, but it doesn't work. What's wrong with it?

"The "ß" in "Fußball" should look like this in my database: "Ÿ".". No it should look like ß. Make sure you collation and connection are set up correctly. Otherwise sorting and searching will be broken for you.
Your database is badly setup. If you want to store Unicode content, just configure it for that. So instead of trying to workaround the issue in your PHP code, you should first fix the database.
USE: $from=mb_detect_encoding($text); $text=mb_convert_encoding($text,'UTF-8',$from);

P
Peter Mortensen

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

Usage:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

Usage:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().


Well, if you look at the code, fixUTF8 simply calls forceUTF8 once and again until the string is returned unchanged. One call to fixUTF8() takes at least twice the time of a call to forceUTF8(), so it's a lot less performant. I made fixUTF8() just to create a command line program that would fix "encode-corrupted" files, but in a live environment is rarely needed.
How does this convert non-UTF8 characters to UTF8, without knowing what encoding the invalid characters are in to begin with?
It assumes ISO-8859-1, the answer already says this. The only difference between forceUTF8() and utf8_encode() is that forceUTF8() recognizes UTF8 characters and keeps them unchanged.
"You dont need to know what the encoding of your strings is." - I very much disagree. Guessing and trying may work, but you'll always sooner or later encounter edge cases where it doesn't.
I totally agree. In fact, I didn't mean to state that as a general rule, just explain that this class might help you if that's the situation you happen to find yourself in.
P
Peter Mortensen

You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.

Here is what I probably would do:

I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.

$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';

$accept = array(
    'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
    'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
    'Accept: '.implode(', ', $accept['type']),
    'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
    // error fetching the response
} else {
    $offset = strpos($response, "\r\n\r\n");
    $header = substr($response, 0, $offset);
    if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
        // error parsing the response
    } else {
        if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
            // type not accepted
        }
        $encoding = trim($match[2], '"\'');
    }
    if (!$encoding) {
        $body = substr($response, $offset + 4);
        if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
            $encoding = trim($match[1], '"\'');
        }
    }
    if (!$encoding) {
        $encoding = 'utf-8';
    } else {
        if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
            // encoding not accepted
        }
        if ($encoding != 'utf-8') {
            $body = mb_convert_encoding($body, 'utf-8', $encoding);
        }
    }
    $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
    if (!$simpleXML) {
        // parse error
    } else {
        echo $simpleXML->asXML();
    }
}

Thanks. This would be easy. But would it really work? There are often wrong encodings given in the HTTP headers or in the attributes of XML.
Again: That’s not your problem. Standards were established to avoid such troubles. If others don’t follow them, it’s their problem, not yours.
Ok, I think you've finally convinced me now. :)
Thanks for the code. But why not simply use this? paste.bradleygill.com/index.php?paste_id=9651 Your code is much more complex, what's better with it?
Well, firstly you’re making two requests, one for the HTTP header and one for the data. Secondly, you’re looking for any appearance of charset= and encoding= and not just at the appropriate positions. And thirdly, you’re not checking if the declared encoding is accepted.
t
troelskn

Detecting the encoding is hard.

mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.

As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.

Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.


Thank you very much! What's better: mb-convert-encoding() or iconv()? I don't know what the differences are. Yes, I will only have to parse Western European languages, especially English, German and French.
I've just seen: mb-detect-encoding() ist useless. It only supports UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS and ISO-2022-JP. The most important ones for me, ISO-8859-1 and WINDOWS-1252, aren't supported. So I can't use mb-detect-encoding().
My, you're right. It's been a while since I've used it. You'll have to write your own detection-code then, or use an external utility. UTF-8 can be fairly reliably determined, because its escape sequences are quite characteristic. wp-1252 and iso-8859-1 can be distinguished because wp-1252 may contain bytes that are illegal in iso-8859-1. Use Wikipedia to get the details, or look in the comments-section of php.net, under various charset-related functions.
I think you can distinguish the different encodings when you look at the forms which the special sings emerge in: The German "ß" emerges in different forms: Sometimes "Ÿ", sometimes "ß" and sometimes "ß". Why?
Yes, but then you need to know the contents of the string before comparing it, and that kind of defeats the purpose in the first place. The German ß appears differently because it has different values in different encodings. Somce characters happen to be represented in the same way in different encodings (eg. all characters in the ascii charset are encoded in the same way in utf-8, iso-8859-* and wp-1252), so as long as you use just those characters, they all look the same. That's why they are some times called ascii-compatible.
m
miek

This cheatsheet lists some common caveats related to UTF-8 handling in PHP: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

This function detecting multibyte characters in a string might also prove helpful (source):


function detectUTF8($string)
{
    return preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}         # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )+%xs', 
    $string);
}


I think that doesn't work correctly: echo detectUTF8('3٣3'); # 1
P
Peter Mortensen

A little heads up. You said that the "ß" should be displayed as "Ÿ" in your database.

This is probably because you're using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong, this is, P believes your MySQL is set to use UTF-8, so it sends data as UTF-8, but your MySQL believes PHP is sending data encoded as ISO 8859-1, so it may once again try to encode your sent data as UTF-8, causing this kind of trouble.

Take a look at mysql_set_charset. It may help you.


h
harpax

A really nice way to implement an isUTF8-function can be found on php.net:

function isUTF8($string) {
    return (utf8_encode(utf8_decode($string)) == $string);
}

Unfortunately, this only works when the string only consists of characters that are included in ISO-8859-1. But this could work: @iconv('utf-8', 'utf-8//IGNORE', $str) == $str
Its doesn't work correctly: echo (int)isUTF8(' z'); # 1 echo (int)isUTF8(NULL); # 1
Though not perfect, I think this is a nice way to implement a sketchy UTF-8 check.
mb_check_encoding($string, 'UTF-8')
Just to put into context how badly this will work: there are exactly 191 printable characters in ISO 8859-1; Unicode 13 defines about 140000. So if you pick a random Unicode character, encode it correctly as UTF-8, and pass it to this function, there is a more than 99% chance of this function incorrectly returning false. In case you think those are obscure characters, note that ISO 8859-1 has no Euro symbol, so isUTF8('€') will be among that 99%.
P
Peter Mortensen

Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.

Here's some pseudocode of what you did:

$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);

You should try:

detect encoding using mb_detect_encoding() or whatever you like to use if it's UTF-8, convert into ISO 8859-1, and repeat step 1 finally, convert back into UTF-8

That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.

This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.

The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).


H
Halil Özgür

The interesting thing about mb_detect_encoding and mb_convert_encoding is that the order of the encodings you suggest does matter:

// $input is actually UTF-8

mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
// ISO-8859-9 (WRONG!)

mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
// UTF-8 (OK)

So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.


This happens because ISO-8859-9 will in practice accept any binary input. The same goes for Windows-1252 and friends. You have to first test for encodings that can fail to accept the input.
@MikkoRantalainen, yeah, I guess this part of the docs says something similar: php.net/manual/en/function.mb-detect-order.php#example-2985
Considering that WHATWG HTML spec defines Windows 1252 as the default encoding, it should be pretty safe to assume if ($input_is_not_UTF8) $input_is_windows1252 = true;. See also: html.spec.whatwg.org/multipage/…
K
Kevin ORourke

Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.

So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).


I don't want to read out the encoding from the feed information. So it's equal if the feed information are wrong. I would like to detect the encoding from the text.
@marco92w: It’s not your problem if the declared encoding is wrong. Standards have not been established for fun.
@Gumbo: but if you're working in the real world you have to be able to deal with things like incorrect declared encodings. The problem is that it's very difficult to guess (correctly) the encoding just from some text. Standards are wonderful, but many (most?) of the pages/feeds out there doesn't comply with them.
@Kevin ORourke: Exactly, right. That's my problem. @Gumbo: Yes, it's my problem. I want to read out the feeds and aggregate them. So I must correct the wrong encodings.
@marco92w: But you cannot correct the encoding if you don’t know the correct encoding and the current encoding. And that’s what the charset/encoding declaration if for: describe the encoding the data is encoded in.
P
Peter Mortensen

You need to test the character set on input since responses can come coded with different encodings.

I force all content been sent into UTF-8 by doing detection and translation using the following function:

function fixRequestCharset()
{
  $ref = array(&$_GET, &$_POST, &$_REQUEST);
  foreach ($ref as &$var)
  {
    foreach ($var as $key => $val)
    {
      $encoding = mb_detect_encoding($var[$key], mb_detect_order(), true);
      if (!$encoding)
        continue;
      if (strcasecmp($encoding, 'UTF-8') != 0)
      {
        $encoding = iconv($encoding, 'UTF-8', $var[$key]);
        if ($encoding === false)
          continue;
        $var[$key] = $encoding;
      }
    }
  }
}

That routine will turn all PHP variables that come from the remote host into UTF-8.

Or ignore the value if the encoding could not be detected or converted.

You can customize it to your needs.

Just invoke it before using the variables.


what is the purpose of using mb_detect_order() without a passed in encoding list?
The purpose is to return the system configured ordered array of encodings defined in php.ini used. This is required by mb_detect_encoding to fill third parameter.
P
Peter Mortensen

mb_detect_encoding:

echo mb_detect_encoding($str, "auto");

Or

echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");

I really don't know what the results are, but I'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.

auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". It returns the detected charset, which you can use to convert the string to UTF-8 with iconv.

<?php
function convertToUTF8($str) {
    $enc = mb_detect_encoding($str);

    if ($enc && $enc != 'UTF-8') {
        return iconv($enc, 'UTF-8', $str);
    } else {
        return $str;
    }
}
?>

I haven't tested it, so no guarantee. And maybe there's a simpler way.


Thank you. What's the difference between 'auto' and 'UTF-8, ASCII, ISO-8859-1' as the second argument? Does 'auto' feature more encodings? Then it would be better to use 'auto', wouldn't it? If it really works without any bugs then I must only change "ASCII" or "ISO-8859-1" to "UTF-8". How?
Your function doesn't work well in all cases. Sometimes I get an error: Notice: iconv(): Detected an illegal character in input string in ...
j
jocull

I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.

Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s.

//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{
    $process = array(&$_GET, &$_POST, &$_REQUEST);
    while (list($key, $val) = each($process)) {
        foreach ($val as $k => $v) {
            unset($process[$key][$k]);
            if (is_array($v)) {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
                $process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
            } else {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
            }
        }
    }
    unset($process);
}
catch(Exception $ex){}

Thanks for the answer, jocull. The function mb_convert_encoding() is what we've already had here, right? ;) So the only new thing in your answer is the loops to change encoding in all variables.
P
Peter Mortensen

It's simple: when you get something that's not UTF-8, you must encode that into UTF-8.

So, when you're fetching a certain feed that's ISO 8859-1 parse it through utf8_encode.

However, if you're fetching an UTF-8 feed, you don't need to do anything.


Thanks! OK, I can find out how the feed is encoded by using mb-detect-encoding(), right? But what can I make if the feed is ASCII? utf8-encode() ist just for ISO-8859-1 to UTF-8, isn't it?
ASCII is a subset of ISO-8859-1 AND UTF-8, so using utf8-encode() should not make a change - IF it's actually just ASCII
So I can always use utf8_encode if it's not UTF-8? This would be really easy. The text which was ASCII according to mb-detect-encoding() contained "ä". Is this a ASCII character? Or is it HTML?
That's HTML. Actually that's encoded so when you print it in a given page it shows ok. If you want you can first ut8_encode() then html_entity_decode().
The character ß is encoded in UTF-8 with the byte sequence 0xC39F. Interpreted with Windows-1252, that sequence represents the two characters  (0xC3) and Ÿ (0x9F). And if you encode this byte sequence again with UTF-8, you’ll get 0xC383 0xC29F what represents ß in Windows-1252. So your mistake is to handle this UTF-8 encoded data as something with an encoding other than UTF-8. That this byte sequence is presented as the character you’re seeing is just a matter of interpretation. If you use an other encoding/charset, you’ll probably see other characters.
P
Peter Mortensen

harpax' answer worked for me. In my case, this is good enough:

if (isUTF8($str)) {
    echo $str;
}
else
{
    echo iconv("ISO-8859-1", "UTF-8//TRANSLIT", $str);
}

P
Peter Mortensen

I was checking for solutions to encoding since ages, and this page is probably the conclusion of years of search! I tested some of the suggestions you mentioned and here are my notes:

This is my test string:

this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chàrs to see thèm, convertèd by fùnctìon!! & that's it!

I do an INSERT to save this string on a database in a field that is set as utf8_general_ci

The character set of my page is UTF-8.

If I do an INSERT just like that, in my database, I have some characters probably coming from Mars...

So I need to convert them into some "sane" UTF-8. I tried utf8_encode(), but still aliens chars were invading my database...

So I tried to use the function forceUTF8 posted on number 8, but in the database the string saved looks like this:

this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chà rs to see thèm, convertèd by fùnctìon!! & that's it!

So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution:

$finallyIDidIt = mb_convert_encoding(
  $string,
  mysql_client_encoding($resourceID),
  mb_detect_encoding($string)
);

Now in my database I have my string with correct encoding.

NOTE:

Only note to take care of is in function mysql_client_encoding! You need to be connected to the database, because this function wants a resource ID as a parameter.

But well, I just do that re-encoding before my INSERT so for me it is not a problem.


Why do you not just use UTF-8 client encoding for mysql in the first place? Would not need manual conversion this way
P
Peter Mortensen

After sorting out your PHP scripts, don't forget to tell MySQL what charset you are passing and would like to receive.

Example: set the character to UTF-8

Passing UTF-8 data to a Latin 1 table in a Latin 1 I/O session gives those nasty birdfeets. I see this every other day in OsCommerce shops. Back and fourth it might seem right. But phpMyAdmin will show the truth. By telling MySQL what charset you are passing, it will handle the conversion of MySQL data for you.

How to recover existing scrambled MySQL data is another question. :)


P
Peter Mortensen

Get the encoding from headers and convert it to UTF-8.

$post_url = 'http://website.domain';

/// Get headers ///////////////////////////////////////////////
function get_headers_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL,            $url);
    curl_setopt($ch, CURLOPT_HEADER,         true);
    curl_setopt($ch, CURLOPT_NOBODY,         true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT,        15);

    $r = curl_exec($ch);
    return $r;
}

$the_header = get_headers_curl($post_url);

/// Check for redirect ////////////////////////////////////////
if (preg_match("/Location:/i", $the_header)) {
    $arr = explode('Location:', $the_header);
    $location = $arr[1];

    $location = explode(chr(10), $location);
    $location = $location[0];

    $the_header = get_headers_curl(trim($location));
}

/// Get charset ///////////////////////////////////////////////
if (preg_match("/charset=/i", $the_header)) {
    $arr = explode('charset=', $the_header);
    $charset = $arr[1];

    $charset = explode(chr(10), $charset);
    $charset = $charset[0];
}

///////////////////////////////////////////////////////////////////
// echo $charset;

if($charset && $charset != 'UTF-8') {
    $html = iconv($charset, "UTF-8", $html);
}

R
Rick James

Ÿ is Mojibake for ß. In your database, you may have one of the following hex values (use SELECT HEX(col)...) to find out):

DF if the column is "latin1",

C39F if the column is utf8 -- OR -- it is latin1, but "double-encoded"

C383C5B8 if double-encoded into a utf8 column

You should not use any encoding/decoding functions in PHP; instead, you should set up the database and the connection to it correctly.

If MySQL is involved, see: Trouble with UTF-8 characters; what I see is not what I stored


What do you mean by "you may have hex"? Arbitrary binary data? Or something else? Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today).
@PeterMortensen - Yeah, my wording was rather cryptic. I hope I my clarification helps. Do a SELECT HEX(col)... to see what is in the table.
Y
YakovL

Try without 'auto'

That is:

mb_detect_encoding($text)

instead of:

mb_detect_encoding($text, 'auto')

More information can be found here: mb_detect_encoding


An explanation would be in order. E.g., what is the idea/gist? What kind of input was it tested on? From the Help Center: "...always explain why the solution you're presenting is appropriate and how it works". Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today).
M
MMJ

Try to use this... every text that is not UTF-8 will be translated.

function is_utf8($str) {
    return (bool) preg_match('//u', $str);
}

$myString = "Fußball";

if(!is_utf8($myString)){
    $myString = utf8_encode($myString);
}

// or 1 line version ;) 
$myString = !is_utf8($myString) ? utf8_encode($myString) : trim($myString);

P
Peter Mortensen

I found a solution at http://deer.org.ua/2009/10/06/1/:

class Encoding
{
    /**
     * http://deer.org.ua/2009/10/06/1/
     * @param $string
     * @return null
     */
    public static function detect_encoding($string)
    {
        static $list = ['utf-8', 'windows-1251'];

        foreach ($list as $item) {
            try {
                $sample = iconv($item, $item, $string);
            } catch (\Exception $e) {
                continue;
            }
            if (md5($sample) == md5($string)) {
                return $item;
            }
        }
        return null;
    }
}

$content = file_get_contents($file['tmp_name']);
$encoding = Encoding::detect_encoding($content);
if ($encoding != 'utf-8') {
    $result = iconv($encoding, 'utf-8', $content);
} else {
    $result = $content;
}

I think that @ is a bad decision and made some changes to the solution from deer.org.ua.


The link is broken: "Not Found. The requested URL /2009/10/06/1/ was not found on this server."
P
Peter Mortensen

When you try to handle multi languages, like Japanese and Korean, you might get in trouble.

mb_convert_encoding with the 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.

I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.

The below snippet extracts the title element from a web page. If you would like to convert the entire page, then you may want to remove some lines.

<?php
require_once 'simple_html_dom.php';

echo convert_title_to_utf8(file_get_contents($argv[1])), PHP_EOL;

function convert_title_to_utf8($contents)
{
    $dom = str_get_html($contents);
    $title = $dom->find('title', 0);
    if (empty($title)) {
        return null;
    }
    $title = $title->plaintext;
    $metas = $dom->find('meta');
    $charset = 'auto';
    foreach ($metas as $meta) {
        if (!empty($meta->charset)) { // HTML5
            $charset = $meta->charset;
        } else if (preg_match('@charset=(.+)@', $meta->content, $match)) {
            $charset = $match[1];
        }
    }
    if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
        $charset = 'auto';
    }
    return mb_convert_encoding($title, 'UTF-8', $charset);
}

P
Peter Mortensen

This version is for the German language, but you can modify the $CHARSETS and the $TESTCHARS.

class CharsetDetector
{
    private static $CHARSETS = array(
        "ISO_8859-1",
        "ISO_8859-15",
        "CP850"
    );

    private static $TESTCHARS = array(
        "€",
        "ä",
        "Ä",
        "ö",
        "Ö",
        "ü",
        "Ü",
        "ß"
    );

    public static function convert($string)
    {
        return self::__iconv($string, self::getCharset($string));
    }

    public static function getCharset($string)
    {
        $normalized = self::__normalize($string);
        if(!strlen($normalized))
            return "UTF-8";
        $best = "UTF-8";
        $charcountbest = 0;
        foreach (self::$CHARSETS as $charset)
        {
            $str = self::__iconv($normalized, $charset);
            $charcount = 0;
            $stop = mb_strlen($str, "UTF-8");

            for($idx = 0; $idx < $stop; $idx++)
            {
                $char = mb_substr($str, $idx, 1, "UTF-8");
                foreach (self::$TESTCHARS as $testchar)
                {
                    if($char == $testchar)
                    {
                        $charcount++;
                        break;
                    }
                }
            }

            if($charcount > $charcountbest)
            {
                $charcountbest = $charcount;
                $best = $charset;
            }
            //echo $text . "<br />";
        }
        return $best;
    }

    private static function __normalize($str)
    {
        $len = strlen($str);
        $ret = "";
        for($i = 0; $i < $len; $i++)
        {
            $c = ord($str[$i]);
            if ($c > 128) {
                if (($c > 247))
                    $ret .= $str[$i];
                elseif
                    ($c > 239) $bytes = 4;
                elseif
                    ($c > 223) $bytes = 3;
                elseif
                    ($c > 191) $bytes = 2;
                else
                    $ret .= $str[$i];

                if (($i + $bytes) > $len)
                    $ret .= $str[$i];
                $ret2 = $str[$i];
                while ($bytes > 1)
                {
                    $i++;
                    $b = ord($str[$i]);
                    if ($b < 128 || $b > 191)
                    {
                        $ret .= $ret2;
                        $ret2 = "";
                        $i += $bytes-1;
                        $bytes = 1;
                        break;
                    }
                    else
                        $ret2 .= $str[$i];
                    $bytes--;
                }
            }
        }
        return $ret;
    }

    private static function __iconv($string, $charset)
    {
        return iconv ($charset, "UTF-8", $string);
    }
}

P
Peter Mortensen

I had the same issue with phpQuery (ISO-8859-1 instead of UTF-8) and this hack helped me:

$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;

mb_internal_encoding('UTF-8'), phpQuery::newDocumentHTML($html, 'utf-8'), mbstring.internal_encoding and other manipulations didn't take any effect.


P
Peter Mortensen

For Chinese characters, it is common to be encoded in the GBK encoding. In addition, when tested, the most voted answer doesn't work. Here is a simple fix that makes it work as well:

function toUTF8($raw) {
    try{
        return mb_convert_encoding($raw, "UTF-8", "auto"); 
    }catch(\Exception $e){
        return mb_convert_encoding($raw, "UTF-8", "GBK"); 
    }
}

Remark: This solution was written in 2017 and should fix problems for PHP in those days. I have not tested whether latest PHP already understands auto correctly.


Do you have any insight why, or how your files were different? What parts didn't work for you? For example: Uppercase German characters didn't convert correctly. Curious, what is "GBK" ?
In what way doesn't the most voted answer work?
An explanation would be in order. E.g., what is the idea/gist? From the Help Center: "...always explain why the solution you're presenting is appropriate and how it works". Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today).