Why do we use Base64?

algorithm character-encoding binary ascii base64

Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport.

But is it not that data is always stored/transmitted in binary because the memory that our machines have store binary and it just depends how you interpret it? So, whether you encode the bit pattern 010011010110000101101110 as Man in ASCII or as TWFu in Base64, you are eventually going to store the same bit pattern.

If the ultimate encoding is in terms of zeros and ones and every machine and media can deal with them, how does it matter if the data is represented as ASCII or Base64?

What does it mean "media that are designed to deal with textual data"? They can deal with binary => they can deal with anything.

Thanks everyone, I think I understand now.

When we send over data, we cannot be sure that the data would be interpreted in the same format as we intended it to be. So, we send over data coded in some format (like Base64) that both parties understand. That way even if sender and receiver interpret same things differently, but because they agree on the coded format, the data will not get interpreted wrongly.

From Mark Byers example

If I want to send

Hello
world!

One way is to send it in ASCII like

72 101 108 108 111 10 119 111 114 108 100 33

But byte 10 might not be interpreted correctly as a newline at the other end. So, we use a subset of ASCII to encode it like this

83 71 86 115 98 71 56 115 67 110 100 118 99 109 120 107 73 61 61

which at the cost of more data transferred for the same amount of information ensures that the receiver can decode the data in the intended way, even if the receiver happens to have different interpretations for the rest of the character set.

Historical background: Email servers used to be 7-bit ASCII. Many of them would set the high bit to 0 so you had to send 7-bit values only. See en.wikipedia.org/wiki/Email#Content_encoding

You can (or historically could) only rely on the lower 7bits of ascii being the same between machines - or translatable between machines, especially when not all machines used ascii

@Martin, you are kidding. Perl is hard to read, but base64 is unreadable at all.

@Lazer Your image is missing

@Lazer, "But byte 10 might not be interpreted correctly as a newline at the other end." why? the two parties have agreed upon ASCII and they must be interpreting it correctly!

Real Ambush

Your first mistake is thinking that ASCII encoding and Base64 encoding are interchangeable. They are not. They are used for different purposes.

When you encode text in ASCII, you start with a text string and convert it to a sequence of bytes.

When you encode data in Base64, you start with a sequence of bytes and convert it to a text string.

To understand why Base64 was necessary in the first place we need a little history of computing.

Computers communicate in binary - 0s and 1s - but people typically want to communicate with more rich forms data such as text or images. In order to transfer this data between computers it first has to be encoded into 0s and 1s, sent, then decoded again. To take text as an example - there are many different ways to perform this encoding. It would be much simpler if we could all agree on a single encoding, but sadly this is not the case.

Originally a lot of different encodings were created (e.g. Baudot code) which used a different number of bits per character until eventually ASCII became a standard with 7 bits per character. However most computers store binary data in bytes consisting of 8 bits each so ASCII is unsuitable for tranferring this type of data. Some systems would even wipe the most significant bit. Furthermore the difference in line ending encodings across systems mean that the ASCII character 10 and 13 were also sometimes modified.

To solve these problems Base64 encoding was introduced. This allows you to encode arbitrary bytes to bytes which are known to be safe to send without getting corrupted (ASCII alphanumeric characters and a couple of symbols). The disadvantage is that encoding the message using Base64 increases its length - every 3 bytes of data is encoded to 4 ASCII characters.

To send text reliably you can first encode to bytes using a text encoding of your choice (for example UTF-8) and then afterwards Base64 encode the resulting binary data into a text string that is safe to send encoded as ASCII. The receiver will have to reverse this process to recover the original message. This of course requires that the receiver knows which encodings were used, and this information often needs to be sent separately.

Historically it has been used to encode binary data in email messages where the email server might modify line-endings. A more modern example is the use of Base64 encoding to embed image data directly in HTML source code. Here it is necessary to encode the data to avoid characters like '<' and '>' being interpreted as tags.

Here is a working example:

I wish to send a text message with two lines:

Hello
world!

If I send it as ASCII (or UTF-8) it will look like this:

72 101 108 108 111 10 119 111 114 108 100 33

The byte 10 is corrupted in some systems so we can base 64 encode these bytes as a Base64 string:

SGVsbG8Kd29ybGQh

Which when encoded using ASCII looks like this:

83 71 86 115 98 71 56 75 100 50 57 121 98 71 81 104

All the bytes here are known safe bytes, so there is very little chance that any system will corrupt this message. I can send this instead of my original message and let the receiver reverse the process to recover the original message.

"most modern communications protocols won't corrupt data" - although for example email might, with a delivery agent replacing the string of characters "\nFrom " with "\n>From " when it saves the message to a mailbox. Or HTTP headers are newline terminated with no reversible way to escape newlines in the data (line continuation conflates whitespace), so you can't just dump arbitrary ASCII into them either. base64 is better than just 7-bit safe, it's alpha-numeric-and-=+/ safe.

"The disadvantage is that encoding the message using Base64 increases its length - every 3 bytes of data is encoded to 4 bytes." How does it increase to 4 bytes? Won't it will still be 3*8 = 24 bits only?

@Lazer: no. Look at your own example - "Man" is base-64 encoded as "TWFu". 3 bytes -> 4 bytes. It's because the input is allowed to be any of the 2^8 = 256 possible bytes, whereas the output only uses 2^6 = 64 of them (and =, to help indicate the length of the data). 8 bits per quartet of output are "wasted", in order to prevent the output from containing any "exciting" characters even though the input does.

It might be helpful to restate "When you encode data in Base64, you start with a sequence of bytes and convert it to a text string" as "When you encode data in Base64, you start with a sequence of bytes and convert it to a sequence of bytes consisting only of ASCII values". A sequence of bytes consisting only of ASCII characters is what's required by SMTP, which is why Base64 (and quoted-printable) are used as content-transfer-encodings. Excellent overview!

I find an back refered post talking about this "If we don't do this, then there is a risk that certain characters may be improperly interpreted. For e.g. Newline chars such as 0x0A and 0x0D, Control characters like ^C, ^D, and ^Z that are interpreted as end-of-file on some platforms, NULL byte as the end of a text string, Bytes above 0x7F (non-ASCII), We use Base64 encoding in HTML/XML docs to avoid characters like '<' and '>' being interpreted as tags."

Sridhar Sarnobat

Encoding binary data in XML

Suppose you want to embed a couple images within an XML document. The images are binary data, while the XML document is text. But XML cannot handle embedded binary data. So how do you do it?

One option is to encode the images in base64, turning the binary data into text that XML can handle.

Instead of:

<images>
  <image name="Sally">{binary gibberish that breaks XML parsers}</image>
  <image name="Bobby">{binary gibberish that breaks XML parsers}</image>
</images>

you do:

<images>
  <image name="Sally" encoding="base64">j23894uaiAJSD3234kljasjkSD...</image>
  <image name="Bobby" encoding="base64">Ja3k23JKasil3452AsdfjlksKsasKD...</image>
</images>

And the XML parser will be able to parse the XML document correctly and extract the image data.

This might be how Microsoft's old .mht format works (html file + images in a single file).

Community

Why not look to the RFC that currently defines Base64?

Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII [1] data.Base encoding can also be used in new applications that do not have legacy restrictions, simply because it makes it possible to manipulate objects with text editors. In the past, different applications have had different requirements and thus sometimes implemented base encodings in slightly different ways. Today, protocol specifications sometimes use base encodings in general, and "base64" in particular, without a precise description or reference. Multipurpose Internet Mail Extensions (MIME) [4] is often used as a reference for base64 without considering the consequences for line-wrapping or non-alphabet characters. The purpose of this specification is to establish common alphabet and encoding considerations. This will hopefully reduce ambiguity in other documents, leading to better interoperability.

Base64 was originally devised as a way to allow binary data to be attached to emails as a part of the Multipurpose Internet Mail Extensions.

This is fair but begs the question of why we still use it today, when we're not restricted to US-ASCII

Håvard S

Media that is designed for textual data is of course eventually binary as well, but textual media often use certain binary values for control characters. Also, textual media may reject certain binary values as non-text.

Base64 encoding encodes binary data as values that can only be interpreted as text in textual media, and is free of any special characters and/or control characters, so that the data will be preserved across textual media as well.

So its like with Base64, mostly both the source and destination will interpret the data the same way, because most probably they will interpret these 64 characters the same way, even if they interpret the control characters in different ways. Is that right?

They data may even be destroyed in transit. For example many FTP programs rewrite line endings from 13,10 to 10 or via versa if the operating system of the server and client don't match and the transfer is flagged as text mode. FTP is just the first example that came to my mind, it is not a good one because FTP does support a binary mode.

@nhnb: I think FTP is a fine example since it shows that text-mode is unsuitable for things that want binary data.

What is a textual media?

But this begs the question of what other protocols use, if not base64. Won't every protocol have the problem that it needs to reserve certain bytes to be control chars? And yet I only see base64 being used for email & form data.

Aiden Bell

It is more that the media validates the string encoding, so we want to ensure that the data is acceptable by a handling application (and doesn't contain a binary sequence representing EOL for example)

Imagine you want to send binary data in an email with encoding UTF-8 -- The email may not display correctly if the stream of ones and zeros creates a sequence which isn't valid Unicode in UTF-8 encoding.

The same type of thing happens in URLs when we want to encode characters not valid for a URL in the URL itself:

http://www.foo.com/hello my friend -> http://www.foo.com/hello%20my%20friend

This is because we want to send a space over a system that will think the space is smelly.

All we are doing is ensuring there is a 1-to-1 mapping between a known good, acceptable and non-detrimental sequence of bits to another literal sequence of bits, and that the handling application doesn't distinguish the encoding.

In your example, man may be valid ASCII in first form; but often you may want to transmit values that are random binary (ie sending an image in an email):

MIME-Version: 1.0 Content-Description: "Base64 encode of a.gif" Content-Type: image/gif; name="a.gif" Content-Transfer-Encoding: Base64 Content-Disposition: attachment; filename="a.gif"

Here we see that a GIF image is encoded in base64 as a chunk of an email. The email client reads the headers and decodes it. Because of the encoding, we can be sure the GIF doesn't contain anything that may be interpreted as protocol and we avoid inserting data that SMTP or POP may find significant.

That's awesome--this explanation made it click. It's not to obfuscate or compress data, but simply to avoid using special sequences that can be interpreted as protocol.

Sridhar Sarnobat

Base64 instead of escaping special characters

I'll give you a very different but real example: I write javascript code to be run in a browser. HTML tags have ID values, but there are constraints on what characters are valid in an ID.

But I want my ID to losslessly refer to files in my file system. Files in reality can have all manner of weird and wonderful characters in them from exclamation marks, accented characters, tilde, even emoji! I cannot do this:

<div id="/path/to/my_strangely_named_file!@().jpg">
    <img src="http://myserver.com/path/to/my_strangely_named_file!@().jpg">
    Here's a pic I took in Moscow.
</div>

Suppose I want to run some code like this:

# ERROR
document.getElementById("/path/to/my_strangely_named_file!@().jpg");

I think this code will fail when executed.

With Base64 I can refer to something complicated without worrying about which language allows what special characters and which need escaping:

document.getElementById("18GerPD8fY4iTbNpC9hHNXNHyrDMampPLA");

Unlike using an MD5 or some other hashing function, you can reverse the encoding to find out what exactly the data was that actually useful.

I wish I knew about Base64 years ago. I would have avoided tearing my hair out with ‘encodeURIComponent’ and str.replace(‘\n’,’\\n’)

SSH transfer of text:

If you're trying to pass complex data over ssh (e.g. a dotfile so you can get your shell personalizations), good luck doing it without Base 64. This is how you would do it with base 64 (I know you can use SCP, but that would take multiple commands - which complicates key bindings for sshing into a server):

https://superuser.com/a/1376076/114723

Gilbert