ChatGPT解决这个技术问题 Extra ChatGPT

How to GetBytes() in C# with UTF8 encoding with BOM?

I'm having a problem with UTF8 encoding in my asp.net mvc 2 application in C#. I'm trying let user download a simple text file from a string. I am trying to get bytes array with the following line:

var x = Encoding.UTF8.GetBytes(csvString);

but when I return it for download using:

return File(x, ..., ...);

I get a file which is without BOM so I don't get Croatian characters shown up correctly. This is because my bytes array does not include BOM after encoding. I triend inserting those bytes manually and then it shows up correctly, but that's not the best way to do it.

I also tried creating UTF8Encoding class instance and passing a boolean value (true) to its constructor to include BOM, but it doesn't work either.

Anyone has a solution? Thanks!


D
Darin Dimitrov

Try like this:

public ActionResult Download()
{
    var data = Encoding.UTF8.GetBytes("some data");
    var result = Encoding.UTF8.GetPreamble().Concat(data).ToArray();
    return File(result, "application/csv", "foo.csv");
}

The reason is that the UTF8Encoding constructor that takes a boolean parameter doesn't do what you would expect:

byte[] bytes = new UTF8Encoding(true).GetBytes("a");

The resulting array would contain a single byte with the value of 97. There's no BOM because UTF8 doesn't require a BOM.


Thanks! I was going crazy with my special characters not working in Excel CSV :)
For clarity, Encoding.UTF8 is equivalent to new UTF8Encoding(true). The parameter controls whether GetPreamble() will emit a BOM.
There's no BOM because GetBytes can't assume we're writing to a file. Whoever writes to the file should do the preamble thing first (like a StreamWriter, for example).
Why content type is set to "application/csv" instead of "text/csv" (as shown here)? In any case, neither way works, here. Excel still opens it with unrecognizable characters.
If I use contentType of application/csv it works fine, but if I replace it with text/csv it stops working, maybe someone has a clue why is that?
H
Hovhannes Hakobyan

I created a simple extension to convert any string in any encoding to its representation of byte array when it is written to a file or stream:

public static class StreamExtensions
{
    public static byte[] ToBytes(this string value, Encoding encoding)
    {
        using (var stream = new MemoryStream())
        using (var sw = new StreamWriter(stream, encoding))
        {
            sw.Write(value);
            sw.Flush();
            return stream.ToArray();
        }
    }
}

Usage:

stringValue.ToBytes(Encoding.UTF8)

This will work also for other encodings like UTF-16 which requires the BOM.


This is actually a very useful workaround. The use of a StreamWriter, with encoding, solved my immediate problem and allowed my file to be opened with Excel 2013.
Thanks. It`s helped me to save .csv with arabic characters. Using Encoding.GetBytes returned bad file, with unknown characters.
y
yfeldblum

UTF-8 does not require a BOM, because it is a sequence of 1-byte words. UTF-8 = UTF-8BE = UTF-8LE.

In contrast, UTF-16 requires a BOM at the beginning of the stream to identify whether the remainder of the stream is UTF-16BE or UTF-16LE, because UTF-16 is a sequence of 2-byte words and the BOM identifies whether the bytes in the words are BE or LE.

The problem does not lie with the Encoding.UTF8 class. The problem lies with whatever program you are using to view the files.


UTF-8 is a variable width encoding. It only requires 1 byte to encode ASCII characters, but other code points will use multiple bytes.
The codepoints encoded with multiple bytes have a pre-defined order (based on the U+ big-endian representation). However, since UTF8 is represented as a stream of bytes (rather than as a stream of words or dwords which are themselves represented as a sequence of bytes), the concept of endianness doesn't apply. Endianness is applicable to the representation of 16-, 32-, 64-, 128-bit integers as bytes, not to the representation of codepoints as bytes.
Sorry, I thought you were referring to the storage of codepoints with the phrase "sequence of 1 byte words". Thanks for the clarification. +1 for your answer and comment.
Some programs use it to detect the encoding as being UTF-8. Programs that don't require it should ignore it as the character emitted is something that is to be ignored anyway. It's older programs that can't handle the BOM.
It does, if you wanna, say, open a UTF-8 file that has surrogate pairs in Visual Studio...
D
Daniel Peñalba

Remember that .NET strings are all unicode while there stay in memory, so if you can see your csvString correctly with the debugger the problem is writing the file.

In my opinion you should return a FileResult with the same encoding that the files. Try setting the returning File encoding,