ChatGPT解决这个技术问题 Extra ChatGPT

Force character vector encoding from "unknown" to "UTF-8" in R

I have a problem with inconsistent encoding of character vector in R.

The text file which I read a table from is encoded (via Notepad++) in UTF-8 (I tried with UTF-8 without BOM, too.).

I want to read table from this text file, convert it do data.table, set a key and make use of binary search. When I tried to do so, the following appeared:

Warning message: In [.data.table(poli.dt, "żżonymi", mult = "first") : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

and binary search does not work.

I realised that my data.table-key column consists of both: "unknown" and "UTF-8" Encoding types:

> table(Encoding(poli.dt$word))
unknown   UTF-8 
2061312 2739122 

I tried to convert this column (before creating a data.table object) with the use of:

Encoding(word) <- "UTF-8"

word<- enc2utf8(word)

but with no effect.

I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. encoding = "UTF-8"):

data.table::fread

utils::read.table

base::scan

colbycol::cbc.read.table

but with no effect.

==================================================

My R.version:

> R.version
           _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          0.3                         
year           2014                        
month          03                          
day            06                          
svn rev        65126                       
language       R                           
version.string R version 3.0.3 (2014-03-06)
nickname       Warm Puppy  

My session info:

> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250                LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    

base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.2 colbycol_0.8     filehash_2.2-2   rJava_0.9-6     

loaded via a namespace (and not attached):
[1] plyr_1.8.1     Rcpp_0.11.1    reshape2_1.2.2 stringr_0.6.2  tools_3.0.3   

g
gagolews

The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII. To discriminate between these two cases, call:

library(stringi)
stri_enc_mark(poli.dt$word)

To check whether each string is a valid UTF-8 byte sequence, call:

all(stri_enc_isutf8(poli.dt$word))

If it's not the case, your file is definitely not in UTF-8.

I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:

read.csv2(file("filename", encoding="UTF-8"))

or

poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings

If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:

stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"

Thank you! =) According to the result of all(stri_enc_isutf8(poli.dt$word)), it seems my file "is not in UTF-8 at all". However, I managed with the problem by using hash table object instead of data.table, which turns out to be faster in my particular problem and does not have such problems with endcoding.
stri_encode(str, from="", to = "UTF-8") does not seems to work for me, the object returns "unknown" with Encoding() or "ASCII" using stri_enc_mark(), not "UTF-8"
This isn't working for me. I have a character vector x, such that all(stri_enc_isutf8(x) returns TRUE, and Encoding(x) returns "unknown", but x <- stri_encode(x, "", "UTF-8"); Encoding(x) returns "unknown". Similarly Encoding(x) <- "UTF-8"; Encoding(x) returns "unknown"
Using iconv conversion, it says that: "unsupported conversion from 'unknown' to 'UTF-8'"
For me poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") didn't work, but stri_enc_toutf8(poli.dt$word) did. I know it's not a common case since it converting to UTF-8 only, but may be useful for someone.
E
Elias EstatisticsEU

I could not find a solution myself to a similar problem. I could not translate back unknown encoding characters from txt file into something more manageable in R.

Therefore, I was in a situation that the same character appeared more than once in the same dataset, because it was encoded differently ("X" in Latin setting and "X" in Greek setting). However, txt saving operation preserved that encoding difference --- of course well-done.

Trying some of the above methods, nothing worked. The problem is well described “cannot distinguish ASCII from UTF-8 and the bit will not stick even if you set it”.

A good workaround is " export your data.frame to a CSV temporary file and reimport with data.table::fread() , specifying Latin-1 as source encoding.".

Reproducing / copying the example given from the above source:

package(data.table)
df <- your_data_frame_with_mixed_utf8_or_latin1_and_unknown_str_fields
fwrite(df,"temp.csv")
your_clean_data_table <- fread("temp.csv",encoding = "Latin-1")

I hope, it will help someone that.


Even this didn't work for me

关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now