Encode String to UTF-8

java utf-8

I have a String with a "ñ" character and I have some problems with it. I need to encode this String to UTF-8 encoding. I have tried it by this way, but it doesn't work:

byte ptext[] = myString.getBytes();
String value = new String(ptext, "UTF-8");

How do I encode that string to utf-8?

It's unclear what exactly you're trying to do. Does myString correctly contain the ñ character and you have problems converting it to a byte array (in that case see answers from Peter and Amir), or is myString corrupted and you're trying to fix it (in that case, see answers from Joachim and me)?

I need to send myString to a server with utf-8 encoding and I need to convert the "ñ" character to utf-8 encoding.

Well, if that server expects UTF-8 then what you need to send it are bytes, not a String. So as per Peter's answer, specify the encoding in the first line and drop the second line.

@Michael: I agree that it isn’t clear what the real intent is here. There seem to be a lot of questions where people are trying to explicit conversions between Strings and bytes rather than letting the {In,Out}putStream{Read,Writ}ers do it for them. I wonder why?

@Michael: Thanks, I suppose that makes sense. But it also makes it harder than it needs to be, doesn’t it? I am not very fond of languages that work that way, and so try to avoid working with them. I think Java’s model of Strings of characters instead of bytes makes things a whole lot easier. Perl and Python also share the “everything is Unicode strings” model. Yes, in all three you can still get at bytes if you work at it, but in practice it seems rare that you truly need to: that’s quite low-level. Plus it feels kinda like brushing a cat the wrong direction, if you know what I mean. :)

leventov

How about using

ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(myString)

But how do I obtain a encoded String? it returns a ByteBuffer

@Alex: it's not possible to have an UTF-8 encoded Java String. You want bytes, so either use the ByteBuffer directly (could even be the best solution if your goal is to send it via a network collection) or call array() on it to get a byte[]

Something else that may be helpful is to use Guava's Charsets.UTF_8 enum instead of a String that may throw an UnsupportedEncodingException. String -> bytes: myString.getBytes(Charsets.UTF_8), and bytes -> String: new String(myByteArray, Charsets.UTF_8).

Even better, use StandardCharsets.UTF_8. Available in Java 1.7+.

The array return by array() will most likely be bigger than needed and padded, as it is the ByteBuffers internal array. Better to use string.getBytes(StandardCharsets.UTF_8) which will return a new array with the correct size.

Joachim Sauer

String objects in Java use the UTF-16 encoding that can't be modified^*.

The only thing that can have a different encoding is a byte[]. So if you need UTF-8 data, then you need a byte[]. If you have a String that contains unexpected data, then the problem is at some earlier place that incorrectly converted some binary data to a String (i.e. it was using the wrong encoding).

^{* As a matter of implementation, String can internally use a ISO-8859-1 encoded byte[] when the range of characters fits it, but that is an implementation-specific optimization that isn't visible to users of String (i.e. you'll never notice unless you dig into the source code or use reflection to dig into a String object).}

Technically speaking, byte[] doesn't have any encoding. Byte array PLUS encoding can give you string though.

@Peter: true. But attaching an encoding to it only makes sense for byte[], it doesn't make sense for String (unless the encoding is UTF-16, in which case it makes sense but it still unnecessary information).

String objects in Java use the UTF-16 encoding that can't be modified. Do you have an official source for this quote?

@AhmadHajjar docs.oracle.com/javase/10/docs/api/java/lang/… : "The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes."

Eduardo Cuomo

In Java7 you can use:

import static java.nio.charset.StandardCharsets.*;

byte[] ptext = myString.getBytes(ISO_8859_1); 
String value = new String(ptext, UTF_8);

This has the advantage over getBytes(String) that it does not declare throws UnsupportedEncodingException.

If you're using an older Java version you can declare the charset constants yourself:

import java.nio.charset.Charset;

public class StandardCharsets {
    public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
    public static final Charset UTF_8 = Charset.forName("UTF-8");
    //....
}

This is the right answer. If someone wants to use a string datatype, he can use it in the right format. Rest of the answers are pointing to the byte formatted type.

Works in 6. Thanks.

Correct answer for me too. One thing though, when I used as above, German character changed to ?. So, I used this: byte[] ptext = myString.getBytes(UTF_8); String value = new String(ptext, UTF_8); This worked fine.

The code sample doesn't make sense. If you first convert to ISO-8859-1, then that array of byte is not UTF-8, so the next line is totally incorrect. It will work for ASCII strings, of course, but then you could as well make a simple copy: String value = new String(myString);.

Peter Štibraný

Use byte[] ptext = String.getBytes("UTF-8"); instead of getBytes(). getBytes() uses so-called "default encoding", which may not be UTF-8.

@Michael: he is clearly having trouble getting bytes from string. How is getBytes(encoding) missing the point? I think second line is there just to check if he can convert it back.

I interpret it as having a broken String and trying to "fix" it by converting to bytes and back (common misunderstanding). There's no actual indication that the second line is just checking the result.

@Michael, no there isn't, it's just my interpretation. Yours is simply different.

@Peter: you're right, we'd need clarification from Alex what he really means. Can't rescind the downvote though unless the answer is edited...

Michael Borgwardt

A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.

So if you have an encoding problem, by the time you have String, it's too late to fix. You need to fix the place where you create that String from a file, DB or network connection.

It's a common mistake to believe that strings are internally encoded as UTF-16. Usually they are, but if, it is only an implementation specific detail of the String class. Since the internal storage of the character data is not accessible through the public API, a specific String implementation may decide to use any other encoding.

@jarnbjo: The API explicitly states "A String represents a string in the UTF-16 format". Using anything else as internal format would be highly inefficient, and all actual implementations I know do use UTF-16 internally. So unless you can cite one that doesn't, you're engaging in pretty absurd hairsplitting.

Is it absurd to distinguish between public access and internal representation of data structures?

The JVM (as far as it is relevant to the VM at all) uses UTF-8 for string encoding, e.g. in the class files. The implementation of java.lang.String is decoupled from the JVM and I could easily implement the class for you using any other encoding for the internal representation if that is really necessary for you to realize that your answer is incorrect. Using UTF-16 as the internal format is in most cases highly inefficient as well when it comes to memory consumption and I don't see why e.g. Java implementations for embedded hardware wouldn't optimize for memory instead of performance.

@jarnbjo: And once more: as long as you cannot give a concrete example of a JVM whose standard API implementation does internally use something other than UTF-16 to implement Strings, my statement is correct. And no, the String class is not really decoupled from the JVM, due to things like intern() and the constant pool.

bstpierre

You can try this way.

byte ptext[] = myString.getBytes("ISO-8859-1"); 
String value = new String(ptext, "UTF-8");

I was going crazy. Thank you to get the bytes in "ISO-8859-1" first was the solution.

This is wrong. If your string includes Unicode characters, converting it to 8859-1 is going to throw an exception or worse give you an invalid string (maybe the string without those characters with code point 0x100 and over).

works perfectly

Quimbo

In a moment I went through this problem and managed to solve it in the following way

first i need to import

import java.nio.charset.Charset;

Then i had to declare a constant to use UTF-8 and ISO-8859-1

private static final Charset UTF_8 = Charset.forName("UTF-8");
private static final Charset ISO = Charset.forName("ISO-8859-1");

Then I could use it in the following way:

String textwithaccent="Thís ís a text with accent";
String textwithletter="Ñandú";

text1 = new String(textwithaccent.getBytes(ISO), UTF_8);
text2 = new String(textwithletter.getBytes(ISO),UTF_8);

perfect solution.

fedesanp

String value = new String(myString.getBytes("UTF-8"));

and, if you want to read from text file with "ISO-8859-1" encoded:

String line;
String f = "C:\\MyPath\\MyFile.txt";
try {
    BufferedReader br = Files.newBufferedReader(Paths.get(f), Charset.forName("ISO-8859-1"));
    while ((line = br.readLine()) != null) {
        System.out.println(new String(line.getBytes("UTF-8")));
    }
} catch (IOException ex) {
    //...
}

laxman954

I have use below code to encode the special character by specifying encode format.

String text = "This is an example é";
byte[] byteText = text.getBytes(Charset.forName("UTF-8"));
//To get original string from byte.
String originalString= new String(byteText , "UTF-8");

Community

A quick step-by-step guide how to configure NetBeans default encoding UTF-8. In result NetBeans will create all new files in UTF-8 encoding.

NetBeans default encoding UTF-8 step-by-step guide

Go to etc folder in NetBeans installation directory

Edit netbeans.conf file

Find netbeans_default_options line

Add -J-Dfile.encoding=UTF-8 inside quotation marks inside that line (example: netbeans_default_options="-J-Dfile.encoding=UTF-8")

Restart NetBeans

You set NetBeans default encoding UTF-8.

Your netbeans_default_options may contain additional parameters inside the quotation marks. In such case, add -J-Dfile.encoding=UTF-8 at the end of the string. Separate it with space from other parameters.

Example:

netbeans_default_options="-J-client -J-Xss128m -J-Xms256m -J-XX:PermSize=32m -J-Dapple.laf.useScreenMenuBar=true -J-Dapple.awt.graphics.UseQuartz=true -J-Dsun.java2d.noddraw=true -J-Dsun.java2d.dpiaware=true -J-Dsun.zip.disableMemoryMapping=true -J-Dfile.encoding=UTF-8"

here is link for Further Details

Prasanth RJ

This solved my problem

    String inputText = "some text with escaped chars"
    InputStream is = new ByteArrayInputStream(inputText.getBytes("UTF-8"));

Encode String to UTF-8

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US