The Unlikely Story of UTF-8: The Text Encoding of the Web(lunduke.locals.com) |
The Unlikely Story of UTF-8: The Text Encoding of the Web(lunduke.locals.com) |
Naively, it seems like creating a scheme to pack these code points would be trivial: just represent each character as a series of bytes. But it's not so simple! As I understand it:
- they wanted backward compatibility with ASCII, which used only a single byte to represent each character
- they wanted to use memory efficiently: common characters shouldn't use 2 bytes
- they wanted to gracefully handle errors: a single corrupted byte shouldn't result in the rest of the string being parsed as garbage