Why did base64 win against uuencode?(retrocomputing.stackexchange.com) |
Why did base64 win against uuencode?(retrocomputing.stackexchange.com) |
Base64, on the other hand, was carefully designed to survive everything from whitespace corruption to being passed through non-ASCII character sets. And then it became widely used as part of MIME.
Still more robust than uuencode though.
.-_ would have been a better choice tha +/=
There was also an extended period of time where people did uux much as they did shar: both of which are inviting somebody else's hands into your execution state and filestore.
We were also obsessed with efficiency. base64 was "sold" as denser encoding. I can't say if it was true overall, but just as we discussed lempel-zif and gzip tuning on usenet news, we discussed uuencode/base64 and other text wrapping.
Ned Freed, Nathaniel Borenstein, Patrik Falstrom and Robert Elz amongst others come to mind as people who worked on the baseXX encoding and discussed this on the lists at the time. Other alphabets were discussed.
uu* was the product of Mike Lesk a decade before, who was a lot quieter on the lists: He'd moved into different circles, was doing other things and not really that interested in the chatter around line encoding issues.
1) https://www.usenetarchives.com/view.php?id=comp.mail.mime&mi...
> Some of the characters used by uuencode cannot be represented in some of the mail systems used to carry rfc 822 (and therefore MIME) mail messages. Using uuencode in these environments causes corruption of encoded data. The working group that developed MIME felt that reliability of the encoding scheme was more important that compatibility with uuencode.
In a followup (same link):
> "The only character translation problem I have encountered is that the back-quote (`) does not make it through all mailers and becomes a space ( )."
A followup from that at https://www.usenetarchives.com/view.php?id=comp.mail.mime&mi... says:
> The back-quote problem is only one of many. Several of the characters used by uuencode are not present in (for example) the EBCDIC character set. So a message transmitted over BITNET could get mangled -- especially for traffic between two different countries where they use different versions of EBCDIC, and therefore different translate tables between EBCDIC and ASCII. There are other character sets used by 822-based mail systems that impose similar restrictions, but EBCDIC is the most obvious one.
> We didn't use uuencode because several members of our working group had experience with cases where uuencoded files were garbaged in transit. It works fine for some people, but not for "everybody" (or even "nearly everybody").
> The "no standards for uuencode" wasn't really a problem. If we had wanted to use uuencode, we would have documented the format in the MIME RFC.
That last comment was from Keith Moore, "the author and co-author of several IETF RFCs related to the MIME and SMTP protocols for electronic mail, among others" says https://en.wikipedia.org/wiki/Keith_Moore .
uuencode has file headers/footers, like MIME. But the actual content encoding is basically base64 with a different alphabet; both add precisely 1/3 overhead (plus up to 2 padding bytes at the end).
Can anyone explain why BinHex remained "popular" in online Mac communities through to the early 2000s? Why couldn't Macs download "real" binary files back then?
It's redundant since this info can be fully inferred from the length of the stream.
Even for concatenations it is not necessary to require it, since you must still know the length of each sub stream (and = does not always appear so is not a separator).
There's no way that using the = instead of per-byte length-checking gains any speed, since to prevent reading out of bounds you must check the per byte length anyway, you can't trust input to be a multiple of 4 length.
It could only make sense if it's somehow required to read 4 bytes at once, and you can't possibly read less, but what platform is such?
Now, 25+ years later, I have some answers - thanks!
If you escape any disallowed character in the usual way for a string ("\0", "\r", "\n", "\\", "\"", "\uD800") then there is no decoding process, all the data in the string will be correct.
If you throw data that is compressed in there, you're unlikely to get very many zeroes, so you can just hope that there aren't too many unmatched surrogate pairs in your binary data, because those get inflated to 6 times their size.
Note that this operates on 16-bit values. In order to see a null, \r, \n, \\ and ", the most significant byte must also be zero, and in order for your data to contain a surrogate pair, you're looking at the two bytes taken together. When the data is compressed, the patterns are less likely.
So a common hack was to binhex the .sit file. Binhex was originally designed to make files 7-bit clean, but had the side effect that it bundled the resource fork and the data fork together.
Later versions of StuffIt could open .sit files which lacked the resource fork just fine, but by then .zip was starting to become more common.
I don't really understand why macOS users like this "simple" installation, because when you "uninstall" the app, it leaves all the trash in your system without a chance to clean up. And implying that macOS application somehow will not do "who-knows-what" to your system is just wrong. Docker Desktop is "simple", yet the first thing it does after launch is installing "who-knows-what".
Whereas on macOS, installation is trivial, but then the application sets up stuff upon first run and that is really intransparent then, with no way of properly uninstalling the app unless there is a dedicated uninstaller.
But yeah, the simple case is quite nice.
The padding character is not essential for decoding, since the number of missing bytes can be inferred from the length of the encoded text. In some implementations, the padding character is mandatory, while for others it is not used. An exception in which padding characters are required is when multiple Base64 encoded files have been concatenated.This shows the binary, base64 without padding and base64 with padding:
NULL --> AA --> AA==
NULL NULL --> AAA --> AAA=
NULL NULL NULL --> AAAA --> AAAA
As you can see, all the padding does is make the base64 length a multiple of 4. You already get uniquely distinguishable symbols for the 3 cases (one, two or three NULL symbols) without the ='s, so they are unnecessary
Refer to the "examples" section of the wikipedia page
But I think it's likely just poor design taste.
I'm not sure I understand this part. You can decode aGVsbG8=IHdvcmxk, what do you need to know?
I only mentioned the concatenation because Wikipedia claims this use case requires padding while in reality it doesn't.
Using the array-indexing method, the noncontiguity of the characters doesn’t matter, and the processing is also independent of the character encoding (e.g. works exactly the same way in EBCDIC).
https://datatracker.ietf.org/doc/html/rfc2045#section-6.8 says:
This subset has the important property that it is represented
identically in all versions of ISO 646, including US-ASCII, and all
characters in the subset are also represented identically in all
versions of EBCDIC. Other popular encodings, such as the encoding
used by the uuencode utility, Macintosh binhex 4.0 [RFC-1741], and
the base85 encoding specified as part of Level 2 PostScript, do not
share these properties, and thus do not fulfill the portability
requirements a binary transport encoding for mail must meet.
If you want to learn why ASCII is the way it is, try "The Evolution of Character Codes, 1874-1968" at https://archive.org/details/enf-ascii/mode/2up by Eric Fischer (an HN'er). My reading is contiguous A-Z was meant for better compatibility with 6-bit use.Considerably stranger in regard to contiguity was EBCDIC, but it too made sense in terms of its technological requirements, which centered around Hollerith punch cards. https://en.wikipedia.org/wiki/EBCDIC
There are numerous other examples where a lack of knowledge of the technological landscape of the past leads some people to project unwarranted assumptions of incompetence onto the engineers who lived under those constraints.
(Hmmm ... perhaps I should have read this person's profile before commenting.)
And the performance claims are absurd, e.g.,
"A simple and extremely common int->hex string conversion takes twice as many instructions as it would if ASCII was optimized for computability."
WHICH conversion, uppercase hex or lowercase hex? You can't have both. And it's ridiculous to think that the character set encoding should have been optimized for either one or that it would have made a measurable net difference if it had been. And instruction counts don't determine speed on modern hardware. And if this were such a big deal, the conversion could be microcoded. But it's not--there's no critical path with significant amounts of binary to ASCII hex conversion.
"There are also inconsistencies like front and back braces/(angle)brackets/parens not being convertible like the alphabet is."
That is not a usable conversion. Anyone who has actually written parsers knows that the encodings of these characters is not relevant ... nothing would have been saved in parsing "loops". Notably, programming language parsers consume tokens produced by the lexer, and the lexer processes each punctuation character separately. Anything that could be gained by grouping punctuation encodings can be done via the lexer's mapping from ASCII to token values. (I have actually done this to reduce the size of bit masks that determine whether any member of a set of tokens has been encountered. I've even, in my weaker moments, hacked the encodings so that <>, {}, [], and () are paired--but this is pointless premature optimization.)
Again, this fellow's profile is accurate.
Hardware has advanced, but software depends on standards and conventions formulated for far less capable hardware, and that's a problem.
The efficiency of string processing/generation is hugely important in terms of global energy consumption.
A simple and extremely common int->hex string conversion takes twice as many instructions as it would if ASCII was optimized for computability.
Bounds-checking for the English alphabet requires either an upfront normalization or twice the checking, so 50-100% more instructions for that.
There are also inconsistencies like front and back braces/(angle)brackets/parens not being convertible like the alphabet is.
[({< <-> >})] would have been just as or more useful than the alphabet being convertible and saved a few instructions in common parsing loops.
> I never questioned the competence of past engineers
False just based on your opening volley of toxic spew. Backwards compatibility is an engineering decision and it was made by very competent people to interoperate with a large number of systems. The future has never been fucked over.
You seem to not understand how ASCII is encoded. It is primarily based on bit-groups where the numeric ranges for character groupings can be easily determined using very simple (and fast) bit-wise operations. All of the basic C functions to test single-byte characters such as `isalpha()`, `isdigit()`, `islower()`, `isupper()`, etc. use this fact. You can then optimize these into grouped instructions and pipeline them. Pull up `man ascii` and pay attention to the hex encodings at the start of all the major symbol groups. This is still useful today!
No, the biggest fuckage of the internet age has been Unicode which absolutely destroys this mapping. We no longer have any semblance of a 1:1 translation between any set of input bytes and any other set of character attributes. And this is just required to get simple language idioms correct. The best you can do is use bit-groupings to determine encoding errors (ala UTF-8) or stick with a larger translation table that includes surrogates (UTF-16, UTF-32, etc). They will all suffer the same "performance" problem called the "real world".
Given that, dragging a ready-to-run file (folder) to /Apps symlink is much more convenient than “setting up your system for preparation of initializing of downloading of the installation process starter manager, please wait and press next sometimes”.
I go back and forth between Windows/Mac/Linux on the daily (right tool for the right job) and each has some strengths. App packaging is far and away one of Mac's current strengths.
I maintained Nativefier (a now defunct open source project that would package web sites as Electron apps) and the ease of packaging an app was Mac > Windows > Linux.
What is your preferred system? How does it affect other needs, like collation, or testing if something is upper-case vs. lower-case, or ease of supporting case-insensitivity?
Have you measured the performance difference? https://johnnylee-sde.github.io/Fast-unsigned-integer-to-hex... shows a branchless UlongToHexString which is essentially as fast as a lookup table and faster than the "naive" implementation.
> Bounds-checking for the English alphabet
In the following it goes from 2 assembly instructions to three:
int is_letter(char c) {
c |= 0x20; // normalize to lowercase
return ('a' <= c) && (c <= 'z');
}
Yes, that's 50% more assembly, to add a single bit-wise or, when testing a single character.But, seriously, when is this useful? English words include an apostrophe, names like the English author Brontë use diacritics, and æ is still (rarely) used, like in the "Endowed Chair for Orthopædic Investigation" at https://orthop.washington.edu/research/ourlabs/collagen/peop... .
And when testing multiple characters at a time, there are clever optimizations like those used in UlongToHexString. SIMD within a register (SWAR) is quite powerful, eg, 8 characters could be or'ed at once in 64 bits, and of course the CPU can do a lot of work to pipeline things, so 50% more single-clock-tick instructions does not mean %50 more work.
> like front and back braces/(angle)brackets/parens not being convertible
I have never needed that operation. Why do you need it?
Usually when I find a "(" I know I need a ")", and if I also allow a "[" then I need an if-statement anyway since A(8) and A[8] are different things, and both paths implicitly know what to expect.
> and saved a few instructions in common parsing loops.
Parsing needs to know what specific character comes next, and they are very rarely limited to only those characters. The ones I've looked use a DFA, eg, via a switch statement or lookup table.
I can't figure out what advantage there is to that ordering, that is, I can't see why there would be any overall savings.
Especially in a language like C++ with > and >> and >>= and A<B<int>> and -> where only some of them are balanced.
GET /your/path-to/the.file HTTP/1.1A very different world than today.
And Python uses RFC 4648
- the “default” encoder (“b64encode”) will pad the output
- although it will not linebreak (“encodebytes”) does that)
- the default decoder will error if the input is not padded
- the default decoder will ignore all non-encoding characters by default
Also both b64encode and encodebytes actually use binascii.b2a_base64, which claims conformance to RFC 3548, which attempts to unify 1421 and 2045. Except RFC 3548 requires rejecting non-encoding data, whereas (again) Python accepts an ignores it by default, in 2045 fashion.
> Lines of Quoted-Printable encoded data must not be longer than 76 characters. To satisfy this requirement without altering the encoded text, soft line breaks may be added as desired. A soft line break consists of an =
IIUC in Base64 you can throw whichever white space anywhere and it should be ignored. And in URL ("percent") encoding there is no insignificant white space possible (?) and encoding of white space depends on implementation (dreaded space `%20` vs ` ` vs `+` in application/x-www-form-urlencoded [2]).
[1] https://en.wikipedia.org/wiki/Quoted-printable [2] https://en.wikipedia.org/wiki/Percent-encoding
From "Things Every Hacker Once Knew" (2017), has an entire section on ASCII and the clever bit-fiddling that occurs:
* http://www.catb.org/~esr/faqs/things-every-hacker-once-knew/...
* Discussion from ~2 months ago: https://news.ycombinator.com/item?id=37701117
Shifted numerals were nearly a bitwise operation as well, but we didn't end up using that keyboard layout.
The design considerations at https://ia800606.us.archive.org/17/items/enf-ascii-1972-1975... show that 6-bit support was more important than naive collation support:
> A6.4 It is expected that devices having the capability of printing only 64 graphic symbols will continue to be important. It may be desirable to arrange these devices to print one symbol for the bit pattern of both upper and lower case of a given alphabetic letter. To facilitate this, there should be a single-bit difference between the upper and lower case representations of any given letter. Combined with the requirement that a given case of the alphabet be contiguous, this dictated the assignment of the alphabet, as shown in columns 4 through 7.
I just found and skimmed Bob Bemer's "A Story of ASCII", which includes personal recollections of the history. It seems that the 6-bit subset was firmed up first. From https://archive.org/details/ascii-bemer/page/n17/mode/2up?q=... :
> This is reflected in the set I proposed to X3 on 1961 September 18 (Table 3, column 3), and these three characters remained in the set from that time on. The lower case alphabet was also shown, but for some time this was resisted, lest the communications people find a need for more than the two columns then allocated for control functions.
but serious discussion of lower case wasn't taken up until later. From https://archive.org/details/ascii-bemer/page/n25/mode/2up?q=... :
> ISO/TC97/SC2 held its next meeting in 1963 October, at which time it was decided to add the lower case alphabet.
and at https://archive.org/details/ascii-bemer/page/n27/mode/2up?q=... :
> At the 1963 May meeting in Geneva, CCITT endorsed the principle of the 7-bit code for any new telegraph alphabet, and expressed general but preliminary agreement with the ISO work. It further requested the placement of the lower case alphabet in the unassigned area.
Bemer did not like interleaving lower- and upper-case. From https://archive.org/details/ascii-bemer/page/n5/mode/2up?q=l... :
> I had a great opportunity to start on the standards road when invited by Dr. Werner Buchholz to do the main design of the 120-character set [9,24] for the Stretch computer (the IBM 7030). I had help, but the mistakes are all mine (such as the interspersal of the upper and lower case alphabets). ...
> he didn't make the same mistake I made for STRETCH by interspersing both cases of the alphabet!
Incidentally, the worst offender is Microsoft themselves: it all got worse with .nuget, .vs, .azcopy, .azdata, .azure, .azuredatastudio, .dotnet, etc. I just don't understand it.
My current sad-thing I’m unhappy about is how the “My Documents” folder ended up being a second AppData folder, with lots of software storing settings, templates, project files, etc in that dir instead of AppData.
Windows absolutely needs application-silos to protect users from lazy apps. I hate to say it, but Apple was 100% right to make iPhone OS a file-system-free OS - we can’t do that on desktop, but gosh-darn-it, why is software so terrible? :(
You used "that seemed like it made sense" when you could have written "that made sense." The additional "seemed like" implies the past engineers were unable to see something they should have.
You used "fuck over the future in favor of optimisation now" implying the engineers were overly short-sighted or used poor judgement when balancing the diverse needs of an interchange code.
I get that people here don't like profanity, but I don't see any slight in describing engineering decisions like optimizing for common workloads today over hypothetical loads tomorrow as 'fucking over the future'. Slightly hyperbolic, sure, but it's one of the most common decisions made in designing systems, and commonly causes lots of issues down the line. I don't see where saying something is a mistake that looks obvious in retrospect is a slight. Most things look obvious in tetrospect.
"some backwards compatibility idiocy that seemed like it made sense at some point"
Is obviously attack on their judgment.
"a compelling reason to fuck over the future in favor of optimisation now"
Talk about passive-aggressive! Of course the person who wrote this does not think that there was any such "compelling reason", which leaves us with the extremely hostile accusation.
And as I've noted, the arguments that these decisions were idiotic or effed over the future are simply incorrect.
0: https://specifications.freedesktop.org/basedir-spec/basedir-...
If you really meant your comment now, there was no reason to add "seemed like it" in your earlier text.
> I don't see any slight
You can see things however you want. The trick is to make others understand the difference between what you say and that utterances of an ignorant blowhard, "full of sound and fury, signifying nothing."
You don't seem to understand the historical context, your issues don't make sense, your improvement seem pointless at best, and you have very firm and hyperbolic viewpoints. That does not come across as 20/20 hindsight.