ChatGPT解决这个技术问题 Extra ChatGPT

Converting byte array to String (Java)

I'm writing a web application in Google app Engine. It allows people to basically edit html code that gets stored as an .html file in the blobstore.

I'm using fetchData to return a byte[] of all the characters in the file. I'm trying to print to an html in order for the user to edit the html code. Everything works great!

Here's my only problem now:

The byte array is having some issues when converting back to a string. Smart quotes and a couple of characters are coming out looking funky. (?'s or japanese symbols etc.) Specifically it's several bytes I'm seeing that have negative values which are causing the problem.

The smart quotes are coming back as -108 and -109 in the byte array. Why is this and how can I decode the negative bytes to show the correct character encoding?

Hi, I know it is a really old post but I am facing similar problems. I am making a man-in-the-middle proxy for ssl. The problem that I am facing is same as yours. I listen to the socket and get the data into InputStream and then into byte[]. Now when I am trying to convert the byte[] into String (I need to use the response body for attacks), I get really funny characters full of smart quotes and question marks and what not. I believe yours problem is same as mine as we both are dealing with html in byte[]. Can you please advice?
By the way, I went to the extent to find the encoding of my system using Sytem.properties and found it to be "Cp1252". Now, I used String str=new String(buffer, "Cp1252"); but no help.

n
nalply

The byte array contains characters in a special encoding (that you should know). The way to convert it to a String is:

String decoded = new String(bytes, "UTF-8");  // example for one encoding type

By The Way - the raw bytes appear may appear as negative decimals just because the java datatype byte is signed, it covers the range from -128 to 127.

-109 = 0x93: Control Code "Set Transmit State"

The value (-109) is a non-printable control character in UNICODE. So UTF-8 is not the correct encoding for that character stream.

0x93 in "Windows-1252" is the "smart quote" that you're looking for, so the Java name of that encoding is "Cp1252". The next line provides a test code:

System.out.println(new String(new byte[]{-109}, "Cp1252")); 

I tried using UTF-8 and it still came out as ?'s. How come it isn't finding a mapping for those negative values?
0x93 is a valid continuation byte in UTF-8, though - the presence of that byte only rules out its being UTF-8 if it doesn't come after a byte with the first two bits set.
@Josh Andreas explains why - because Java's byte datatype is signed. The 'negative' values are just bytes with the most significant byte set. He also explains what the most likely character set you should be using is - Windows-1252. You should know what character set to use from context or convention, though, without having to guess.
d
davnicwil

Java 7 and above

You can also pass your desired encoding to the String constructor as a Charset constant from StandardCharsets. This may be safer than passing the encoding as a String, as suggested in the other answers.

For example, for UTF-8 encoding

String bytesAsString = new String(bytes, StandardCharsets.UTF_8);

This is a repeat of an answer from 2011. -1
@james.garriss I don't think it is, insofar as I'm just mentioning a new constructor introduced in java 7 allowing the encoding to be passed as a constant, which in my opinion is nicer, and safer, than the previous api mentioned in the earlier answers where the encoding was passed as a String, if at all.
F
Flexo

You can try this.

String s = new String(bytearray);

You can try... but it will fail in almost all cases.
A
Adi Sembiring
public class Main {

    /**
     * Example method for converting a byte to a String.
     */
    public void convertByteToString() {

        byte b = 65;

        //Using the static toString method of the Byte class
        System.out.println(Byte.toString(b));

        //Using simple concatenation with an empty String
        System.out.println(b + "");

        //Creating a byte array and passing it to the String constructor
        System.out.println(new String(new byte[] {b}));

    }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        new Main().convertByteToString();
    }
}

Output

65
65
A

c
craig
public static String readFile(String fn)   throws IOException 
{
    File f = new File(fn);

    byte[] buffer = new byte[(int)f.length()];
    FileInputStream is = new FileInputStream(fn);
    is.read(buffer);
    is.close();

    return  new String(buffer, "UTF-8"); // use desired encoding
}

This code will leak a resource if the read throws an exception.
Q
Questioner

I suggest Arrays.toString(byte_array);

It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.


Can you give more information on why you're suggesting this? (Will it solve the problem? Can you say why it solves it?) Thanks!
It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.
@sas, you should add this information to your answer itself (by editing it) rather than as a comment. Generally on SO you should always keep in mind that comments may at any point be deleted - the really important information should be in the answer itself.
S
Simon G.

The previous answer from Andreas_D is good. I'm just going to add that wherever you are displaying the output there will be a font and a character encoding and it may not support some characters.

To work out whether it is Java or your display that is a problem, do this:

    for(int i=0;i<str.length();i++) {
        char ch = str.charAt(i);
        System.out.println(i+" : "+ch+" "+Integer.toHexString(ch)+((ch=='\ufffd') ? " Unknown character" : ""));
    }

Java will have mapped any characters it cannot understand to 0xfffd the official character for unknown characters. If you see a '?' in the output, but it is not mapped to 0xfffd, it is your display font or encoding that is the problem, not Java.