ChatGPT解决这个技术问题 Extra ChatGPT

Reading InputStream as UTF-8

I'm trying to read from a text/plain file over the internet, line-by-line. The code I have right now is:

URL url = new URL("http://kuehldesign.net/test.txt");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
LinkedList<String> lines = new LinkedList();
String readLine;

while ((readLine = in.readLine()) != null) {
    lines.add(readLine);
}

for (String line : lines) {
    out.println("> " + line);
}

The file, test.txt, contains ¡Hélló!, which I am using in order to test the encoding.

When I review the OutputStream (out), I see it as > ¬°H√©ll√≥!. I don't believe this is a problem with the OutputStream since I can do out.println("é"); without problems.

Any ideas for reading form the InputStream as UTF-8? Thanks!

The HTTP protocol specifies the encoding. Why aren’t you using a library API that handles that for you? You should never have to guess the encoding like this. I don’t mean to be negative: you’re doing great! I just wonder whether there isn’t an easier way.
I won't have access to the server which is serving the text/plain file, unfortunately, and it's not using a UTF-8 encoding. I wasn't aware of any good network libraries; any suggestions?
Looking at the docs, I wouldn’t think you would have to specify the encoding at all. I am surprised they give you a byte stream! You do have access to underlying URLConnection, from which you can check the Content-Encoding, then open an InputStreamReader with the correct argument. A quick check of the source doesn’t turn up anything that seems to do that for you, which seems pretty darned lame and error prone, so I probably missed something.

t
tobijdc

Solved my own problem. This line:

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

needs to be:

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));

or since Java 7:

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), StandardCharsets.UTF_8));

I’m pretty sure that form of the constructor won’t raise an exception on invalid input. You need to use the with a CharsetDecoder dec argument. This is same Java design bug that the OutputStreamWriter constructors have: only one of the four actually condescends to tell you when something goes wrong. You again have to use the fancy CharsetDecoder dec argument there, too. The only safe and sane thing to do is to consider all other constructors deprecated, because they cannot be trusted to behave.
Since Java 7 it is possible to write the provide the Charset as a Constant not as a String StandardCharsets.UTF_8
A
Ahmed Ashour
String file = "";

try {

    InputStream is = new FileInputStream(filename);
    String UTF8 = "utf8";
    int BUFFER_SIZE = 8192;

    BufferedReader br = new BufferedReader(new InputStreamReader(is,
            UTF8), BUFFER_SIZE);
    String str;
    while ((str = br.readLine()) != null) {
        file += str;
    }
} catch (Exception e) {

}

Try this,.. :-)


Instead of file += str, create a StringBuilder and append to that. The compiler might be able to optimize the string appending, but it's likely creating a lot of garbage
If you want to convert a BufferedReader into a string, use Apache Commons, do not reinvent the wheal: String myStr = org.apache.commons.io.IOUtils.toString( myBufferedReaderInstance);
UTF8 = "utf8", nice variable ;)
J
Joshua Joel Cleveland

I ran into the same problem every time it finds a special character marks it as ��. to solve this, I tried using the encoding: ISO-8859-1

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("txtPath"),"ISO-8859-1"));

while ((line = br.readLine()) != null) {

}

I hope this can help anyone who sees this post.


Could you please tell what is the characters not supported in UTF-8?
g
grigouille

If you use the constructor InputStreamReader(InputStream in, Charset cs), bad characters are silently replaced. To change this behaviour, use a CharsetDecoder :

public static Reader newReader(Inputstream is) {
  new InputStreamReader(is,
      StandardCharsets.UTF_8.newDecoder()
      .onMalformedInput(CodingErrorAction.REPORT)
      .onUnmappableCharacter(CodingErrorAction.REPORT)
  );
}

Then catch java.nio.charset.CharacterCodingException.