Node's Unicode Dragon(cirw.in) |
Node's Unicode Dragon(cirw.in) |
UTF-8 does not have this problem. That's the way we should be moving.
JS's treatment of strings is even more wacky than you might think -- it is neither really UCS-2 or UTF16. Engines are semi-required to use UTF-16 representations of strings internally, but the API surface that is exposed to the JS code makes them look like UCS-2 strings (i.e. no surrogate pairs). However, if you stick a JS string into something that is UTF-16 aware, such as a DOM node, then the surrogate pairs will display correctly.
See [1] for a very clear explanation of this muddy subject.
I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.
2. Interoperability with legacy systems that don't use UTF-8 (for example, JavaScript). For example, Rust needs support for the full range of string encodings, because we need that support for implementing a browser engine.
Also check out the bug report: https://code.google.com/p/v8/issues/detail?id=2875
I would argue that the UTF-8 corner cases are more rare because they are harder to produce accidentally, and also more serious because they have security implications.
If you need to accept arbitrary binary data, JSON is a profoundly bad choice. At a minimum, you would expect them to base64 encode the data and put that into a JSON string.
If you are looking at error reports, how is it even remotely acceptable to have them silently modified to include invalid unicode replacement characters?
The lesson here isn't some crappy hack workaround they found, it's a case study in the lengths you'll have to go to when you insist on making technology choices without considering the problem you want to solve.
I wonder... at some point, Javascript could get a convenient literal syntax for creating pre-filled ArrayBuffers, which would basically be the format JSON would want to adopt. But would it? Are changes to Javascript literal syntax folded into JSON, or is JSON now its own thing that doesn't track JS any more?
XML doesn't even allow escaped null bytes, so you're basically forced to use base64 or weird custom app-internal escapes.
JSON never tracked javascript. It has one version, period. But you could get people to adopt a superset with a new data type, if you kept it simple.
http://www.fileformat.info/info/unicode/char/1f409/index.htm
Also, since any ASCII dragon is also a valid Unicode dragon (in UTF-8, at least), the following might satisfy your needs:
To see this dragon, either:
1. Use Safari or Firefox on OS X. 2. Install custom fonts for Linux or Windows. 3. Install https://chrome.google.com/webstore/detail/chromoji-emoji-for... for Chrome
Also: didn't know that for every emoji there is https://en.wikipedia.org/wiki/π
If you want to check a string for valid encoding and/or replace bad bytes with replacement char on the _ruby_ end... it's not very obvious how you do that with the ruby stdlib api, and it takes a few tricks to do right.
So I wrote a gem for it: https://github.com/jrochkind/ensure_valid_encoding
So really, they were parsing the JSON as if it were UTF-16, but really it was UCS-2. How is that an error in Node?
on the other hand, if the offending bytes were blindly substituted into the JSON, then it's not surprising that there were decoding issues down the line...
> The exceptions that were crashing us were caused by people using String.prototype.substr. That function works perfectly on strings that only contain Unicode 1.0 data, but as soon as you're storing UTF-16 in your UCS-2 string there's a possibility that when you take a slice you'll split a valid surrogate pair into two invalid lonely surrogates.
To me, it seems like it'd be nearly impossible for somebody to trigger, but there's always Murphy's law...
Suppose you receive a long piece of text wrapped in JSON, unpack it into a JS String, then start processing it in fixed size chunks. If your source text contains any significant percentage of surrogate pair-represented characters, you'll eventually break one.
1. One of our customer's javascript apps sent a truncated string to their web-server in a JSON payload. This string ended with a leading surrogate (this is another instance of V8 bug discussed in the blog post).
2. Their ruby backend exploded when they tried to use a regular expression on the string (because ruby's regexp library is strict about valid utf-8).
3. The bugsnag exception notifier copied the bytes from the incoming parameter into the JSON exception notification payload (ruby didn't notice because its string library unconditionally believes you if you tell it a string is valid utf8 β another bug :p).
(1440x900)
As for binary data in web services ... isn't it easier to just use Content-Type for that and use the appropriate type for the payload? That wouldn't require a textual data format that can contain arbitrary binary data.
(Still, you're right, I admit to having been able to avoid any work painful enough to teach me XML arcana. I was actually thinking of one of the many variants of "Binary XML" I had read about recently, and assumed the typing was bijective to XML's own types. In other news, BSON of all things has a raw-binary type.)
So far as I know, Haskell is the only other language that I know of that exposes, as the defaultish-native interface, Unicode strings as a sequence or iterable of code points (by just using UTF-32). Java, C#, your-language-here all do code units. C++'s template are powerful enough that someone could make unicode_str<encoding_to_store_as>, but I've not seen one.
See: http://www.unicode.org/glossary/#code_point http://www.unicode.org/glossary/#code_unit
Consider the problem of producing a valid substring from a Unicode string. It's important that you not split surrogate pairs, and it's true working with code points spares you from that particular problem. But it's also important that you not split combining marks, and zero width joiners, and Hangul syllables... (see http://www.unicode.org/reports/tr29/ for all the gory details).
An average programmer cannot correctly extract a substring from a Unicode string whether given the code units or the code points. These abstractions are inadequate: instead you wantΒ something like grapheme clusters.
As for the inflation issue, 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 and 2 bytes in utf-16. It tends to even out somewhat. And if you really want your data to be small, gzip will do a better job than either.
For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?)
One interesting conclusion from looking at the state of Twitter (http://blog.luminoso.com/2013/09/04/emoji-are-more-common-th...) is that CESU-8 is probably more common than real UTF-8.
Another fun thing I ran into today is that Python regular expressions allow astral characters, but you can't safely use them until 3.3 because narrow builds will quietly replace them with nonsense that doesn't run (https://github.com/LuminosoInsight/python-ftfy/commit/86aa65...). And the very reason this came up was in a workaround for a different bug in 3.3.
> ... MySQL ...
> why the hell do people even presume utf8 is max 3 bytes?
I think you answered your own question before you even asked.Programmer time is at least two orders of magnitude more expensive than storage space or bandwidth for text.
At-rest storage is cheap. Memory is cheaper than it used to be, but CPU cache is not. At some point the text will have to cross the CPU where every byte still counts.
True, but
1. time is precious. For example, you waste 50% more time for a fulltext indexing scan because utf8 is longer.
2. Memory. If you can't hold text in a single machine, you have bigger issues (e.g. clustering algorithms, persistency, redundancy, etc.)
3. Network transfer. If you can save 50% in a db connection rtt, you save a lot.
It makes no sense to save BMP in 3 bytes anyway.
If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)
How do I know GB is preferred? I'm going off of three things:
- According to wikipedia (http://en.wikipedia.org/wiki/GB18030), software sold in China is legally required to support it.
- I was once given a chinese ebook, which I had to figure out was in GB before I could read it. (And now, I know about chardet!)
- I worked with a chinese programmer who accidentally committed files in GB, even though they were supposed to be in UTF-8.
And since the latest GB can in fact represent any unicode point, it's hard to see why it wouldn't be preferred indefinitely.
How so? AFAIK Shift-JIS is ASCII compatible just like UTF8, so does other double byte encodings like BIG5 and GBK.
That's exactly how those UTF-X and UCS-Y encodings were invented, right?
The point is, this beast is called unicode, how ironic.
But you know, there are other cases besides exchanging. Like I said, if your text data is mainly latin you are good, but not so good if you are stuck with non-latin BMPs.
For the classic double-byte languages, what's your total data size after compression? e.g. in the case of full-text search, enabling compression has been enough of a win that the 2/3-byte expansion hasn't been a challenge, particularly since the biggest UTF-8 drawback (inability to predict total byte string length) isn't an issue when working with a data structure which records the length of each record.