Why did base64 win against uuencode?

Why did base64 win against uuencode?(retrocomputing.stackexchange.com)

225 points by egorpv 2 years ago | 103 comments

ekidd 2 years ago |

One reason that uuencode lost out to Base64 was that uuencode used spaces in its encoding. It was fairly common for Internet protocols in those days to mess with whitespace, so it was often necessary to patch up corrupted uuencode files by hand.

Base64, on the other hand, was carefully designed to survive everything from whitespace corruption to being passed through non-ASCII character sets. And then it became widely used as part of MIME.

jjeaff 2 years ago | |

and yet, Internet protocols (http, at least) don't play well with equal signs which are part of base64, sometimes. That little issue has caused lots of intermittent bugs for me over the years, either from forgetting to urlencode it or not urldecoding it at the right time.

eastbound 2 years ago | | |

So there are 7 base64 encodings, one with “+ / =“, one with “- _ =“, one with “+,” and no “=“… https://en.wikipedia.org/wiki/Base64#Variants_summary_table

OskarS 2 years ago | | |

And slashes as well, which is a magic character in both urls and file systems. Means you can't reliably use normal base64 for filenames, for instance. That might seem like a niche use-case, but it's really not, because you can use it for content-based addressing. Git does this, names all the blobs in the .git folder after their hash, but you can't encode the hash with regular base64.

JoshTriplett 2 years ago | | |

Ditto the obnoxious "quoted-printable" mail encoding, which turns every = into =3D.

Still more robust than uuencode though.

onetimeuse92304 2 years ago | | |

I am using base62 for data that can be included in URIs.

afiori 2 years ago | | |

all three symbols are some of the worst possible choices for compatibility with urls and many other things

.-_ would have been a better choice tha +/=

Vt71fcAqt7 2 years ago | | |

And now we can have whitespace in url queries but we are still using %20 everywhere because "that's standard"...

candiddevmike 2 years ago | |

...only for base64 to become fragmented into standard or URL due to character choice, and padded or not padded as a "cool trick to save some bytes".

ggm 2 years ago |

Having lived through the transition, I can say personally it comes down to "packaging" -if MIME had adopted UUENCODE format, I probably would have used it but as materials emerged to me which depended on base64 decode, it became compelling to use it. Once it was ubiquitously available in e.g ssl, it became trivial to decode a base64 encoded thing, no matter what. Not all systems had a functioning uudecode all the time. DOS for instance, you had to find one. If you're given base64 content, you install a base64 encode/decode package and then its what you have.

There was also an extended period of time where people did uux much as they did shar: both of which are inviting somebody else's hands into your execution state and filestore.

We were also obsessed with efficiency. base64 was "sold" as denser encoding. I can't say if it was true overall, but just as we discussed lempel-zif and gzip tuning on usenet news, we discussed uuencode/base64 and other text wrapping.

Ned Freed, Nathaniel Borenstein, Patrik Falstrom and Robert Elz amongst others come to mind as people who worked on the baseXX encoding and discussed this on the lists at the time. Other alphabets were discussed.

uu* was the product of Mike Lesk a decade before, who was a lot quieter on the lists: He'd moved into different circles, was doing other things and not really that interested in the chatter around line encoding issues.

eesmith 2 years ago | |

Here are Usenet comments from the the 1994 comp.mail.mime thread "Q: why base64 ,not UUencode?"

1) https://www.usenetarchives.com/view.php?id=comp.mail.mime&mi...

> Some of the characters used by uuencode cannot be represented in some of the mail systems used to carry rfc 822 (and therefore MIME) mail messages. Using uuencode in these environments causes corruption of encoded data. The working group that developed MIME felt that reliability of the encoding scheme was more important that compatibility with uuencode.

In a followup (same link):

> "The only character translation problem I have encountered is that the back-quote (`) does not make it through all mailers and becomes a space ( )."

A followup from that at https://www.usenetarchives.com/view.php?id=comp.mail.mime&mi... says:

> The back-quote problem is only one of many. Several of the characters used by uuencode are not present in (for example) the EBCDIC character set. So a message transmitted over BITNET could get mangled -- especially for traffic between two different countries where they use different versions of EBCDIC, and therefore different translate tables between EBCDIC and ASCII. There are other character sets used by 822-based mail systems that impose similar restrictions, but EBCDIC is the most obvious one.

> We didn't use uuencode because several members of our working group had experience with cases where uuencoded files were garbaged in transit. It works fine for some people, but not for "everybody" (or even "nearly everybody").

> The "no standards for uuencode" wasn't really a problem. If we had wanted to use uuencode, we would have documented the format in the MIME RFC.

That last comment was from Keith Moore, "the author and co-author of several IETF RFCs related to the MIME and SMTP protocols for electronic mail, among others" says https://en.wikipedia.org/wiki/Keith_Moore .

mjevans 2 years ago | |

After a given point usenet was nearly 8-bit clean, and thus https://en.wikipedia.org/wiki/YEnc was also developed to convolve all the octets (I + 42 (decimal)) and escape the results that happened to still match reserved characters (CR, LF, 0x0, = (yEnc escape)) - it seems that if the result character was among that set, then = was output and new output determined by O = (I+64) % 256 instead.

wkat4242 2 years ago | | |

Yenc is still used a lot actually, for the purpose of what Usenet has de facto become, a piracy network :)

u801e 2 years ago | | |

It's too bad yenc didn't take the place of base64 for email.

LukeShu 2 years ago | |

> We were also obsessed with efficiency. base64 was "sold" as denser encoding. I can't say if it was true overall

uuencode has file headers/footers, like MIME. But the actual content encoding is basically base64 with a different alphabet; both add precisely 1/3 overhead (plus up to 2 padding bytes at the end).

cratermoon 2 years ago | | |

uuencode has some additional overhead, namely 2 additional bytes per line, that means it varies from 60-70%, the latter being best case, while base64 is 75% efficient in all cases.

DaiPlusPlus 2 years ago |

On a related note, I'm getting flashbacks to being on the web in the late-1990s, back when "Downloads!" was a reason to visit a particular website; and noticing that Windows users like myself could just download-and-run an .exe file, while the same downloads for Mactintosh users would be a BinHex file that'd also be much larger than the Windows equivalent - and this wasn't over FTP or Telnet, but an in-browser HTTP download, just like today.

Can anyone explain why BinHex remained "popular" in online Mac communities through to the early 2000s? Why couldn't Macs download "real" binary files back then?

Aardwolf 2 years ago |

A thing I wonder: why is using = padding required in the most common base64 variant?

It's redundant since this info can be fully inferred from the length of the stream.

Even for concatenations it is not necessary to require it, since you must still know the length of each sub stream (and = does not always appear so is not a separator).

There's no way that using the = instead of per-byte length-checking gains any speed, since to prevent reading out of bounds you must check the per byte length anyway, you can't trust input to be a multiple of 4 length.

It could only make sense if it's somehow required to read 4 bytes at once, and you can't possibly read less, but what platform is such?

smudgy 2 years ago |

You know, when I first got into binary encoding into text I asked myself this very question but never put any effort into looking it up.

Now, 25+ years later, I have some answers - thanks!

Dwedit 2 years ago |

There is still one sort-of efficient way of embedding binary content in an HTML file. You must save the file as UTF-16. A Javscript string from a UTF-16 HTML file can contain anything except these: \0, \r, \n, \\, ", and unmatched surrogate pairs 0xD800-0xDFFF.

If you escape any disallowed character in the usual way for a string ("\0", "\r", "\n", "\\", "\"", "\uD800") then there is no decoding process, all the data in the string will be correct.

If you throw data that is compressed in there, you're unlikely to get very many zeroes, so you can just hope that there aren't too many unmatched surrogate pairs in your binary data, because those get inflated to 6 times their size.

Note that this operates on 16-bit values. In order to see a null, \r, \n, \\ and ", the most significant byte must also be zero, and in order for your data to contain a surrogate pair, you're looking at the two bytes taken together. When the data is compressed, the patterns are less likely.

38 2 years ago |

https://wikipedia.org/wiki/Binary-to-text_encoding

whoopdedo 2 years ago | |

Not listed was a clever encoding for MS-DOS files, XXBUG[1]. DOS had a rudimentary debugger and memory editor. (It even stuck around all the way to Windows XP but didn't survive the transition to 64-bit.) Because it had the ability to write to disk you could convert any file to hexadecimal bytes and sprinkle some control commands about to create a script for DEBUG.EXE. The text-encoded file could then be sent anywhere without needing to download a decoder program first.

[1] http://justsolve.archiveteam.org/wiki/XXBUG

bnjf 2 years ago |

but uuencode makes for fun: https://github.com/bnjf/compre.sh

nibbula 2 years ago |

Ascii85 survived by hiding in popular bloat.

quickthrower2 2 years ago |

I first met base64 in ASP.NET viewstate.

t-3 2 years ago |

Base64 is very bizarre in general. Why did they use such a weird pattern of symbols instead of a contiguous section, or at least segments ordered from low->high (on that note, ASCII is also quite strange, I'm guessing due to some backwards compatibility idiocy that seemed like it made sense at some point (or maybe changing case was super important to a lot of workloads or something, making a compelling reason to fuck over the future in favor of optimisation now))?

The padding character is not essential for decoding, since the number of missing bytes can be inferred from the length of the encoded text. In some implementations, the padding character is mandatory, while for others it is not used. An exception in which padding characters are required is when multiple Base64 encoded files have been concatenated.

This subset has the important property that it is represented identically in all versions of ISO 646, including US-ASCII, and all characters in the subset are also represented identically in all versions of EBCDIC. Other popular encodings, such as the encoding used by the uuencode utility, Macintosh binhex 4.0 [RFC-1741], and the base85 encoding specified as part of Level 2 PostScript, do not share these properties, and thus do not fulfill the portability requirements a binary transport encoding for mail must meet.