Node's Unicode Dragon

94 points by foobar2k 12 years ago | 63 comments

stormbrew 12 years ago |

Wish I'd known about this when I was pointing out in another HN thread how utf-16 is a terrible encoding for, among other reasons, pushing the corner case where you find out your encoding/decoding is broken to the very edge of likelihood. It's ridiculous that v8 doesn't properly support utf16, but it's to be expected I suppose.

UTF-8 does not have this problem. That's the way we should be moving.

ender7 12 years ago | |

This behavior is actually part of the ECMAScript standard [0], so it's unlikely that V8 (or any other conformant JS engine) would behave the way you (and many others) would want.

JS's treatment of strings is even more wacky than you might think -- it is neither really UCS-2 or UTF16. Engines are semi-required to use UTF-16 representations of strings internally, but the API surface that is exposed to the JS code makes them look like UCS-2 strings (i.e. no surrogate pairs). However, if you stick a JS string into something that is UTF-16 aware, such as a DOM node, then the surrogate pairs will display correctly.

See [1] for a very clear explanation of this muddy subject.

[0] http://www.ecma-international.org/ecma-262/5.1/#sec-8.4

[1] http://mathiasbynens.be/notes/javascript-encoding

stormbrew 12 years ago | | |

That is all incredibly depressing.

sillysaurus2 12 years ago | |

This. Why doesn't everybody use UTF-8? Nobody seems to have any problems with UTF-8. It seems to work almost perfectly, and it's efficient.

est 12 years ago | | |

Because some of us are pissed that some BMP characters takes 3 bytes in UTF8, that's 50% more waste of storage space and 50% more time to read/write.

I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.

http://www.python.org/dev/peps/pep-0393/

pcwalton 12 years ago | | |

1. Controversy over Han unification made Unicode adoption less universal than might have been hoped.

2. Interoperability with legacy systems that don't use UTF-8 (for example, JavaScript). For example, Rust needs support for the full range of string encodings, because we need that support for implementing a browser engine.

millstone 12 years ago | | |

Did you read the article? The problem occurs precisely because V8 mishandles UTF-8.

Also check out the bug report: https://code.google.com/p/v8/issues/detail?id=2875

ximeng 12 years ago | | |

A lot of Windows is UTF-16 or UCS-2, including Office, which forces their use for working with APIs or transferring data.

millstone 12 years ago | |

Why do you think that UTF-16's corner cases, by which you presumably mean surrogate pairs, are less likely than UTF-8's corner cases, like invalid code units and non-shortest forms?

I would argue that the UTF-8 corner cases are more rare because they are harder to produce accidentally, and also more serious because they have security implications.

justin_vanw 12 years ago |

Man, I'm starting to think there is a cult around JSON.

If you need to accept arbitrary binary data, JSON is a profoundly bad choice. At a minimum, you would expect them to base64 encode the data and put that into a JSON string.

If you are looking at error reports, how is it even remotely acceptable to have them silently modified to include invalid unicode replacement characters?

The lesson here isn't some crappy hack workaround they found, it's a case study in the lengths you'll have to go to when you insist on making technology choices without considering the problem you want to solve.

derefr 12 years ago | |

Any wire-serialization format that wants to send arbitrary data should really have a "raw binary payload" type. XML has CDATA. ASN.1 has bitstrings. BERT has Binaries. But JSON doesn't really have anything like that.

I wonder... at some point, Javascript could get a convenient literal syntax for creating pre-filled ArrayBuffers, which would basically be the format JSON would want to adopt. But would it? Are changes to Javascript literal syntax folded into JSON, or is JSON now its own thing that doesn't track JS any more?

Dylan16807 12 years ago | | |

CDATA disallows null bytes, so it's even worse than non-support: illusory support

XML doesn't even allow escaped null bytes, so you're basically forced to use base64 or weird custom app-internal escapes.

JSON never tracked javascript. It has one version, period. But you could get people to adopt a superset with a new data type, if you kept it simple.

baddox 12 years ago |

Despite that being a rather interesting technical article, I am upset that my expectation of an actual Unicode depiction of a dragon was not met.

greenyoda 12 years ago | |

There is actually a Unicode dragon character at code point U+1F409:

http://www.fileformat.info/info/unicode/char/1f409/index.htm

Also, since any ASCII dragon is also a valid Unicode dragon (in UTF-8, at least), the following might satisfy your needs:

http://www.dougsartgallery.com/ascii-art-dragon.html

cirwin 12 years ago | |

🐉

To see this dragon, either:

1. Use Safari or Firefox on OS X. 2. Install custom fonts for Linux or Windows. 3. Install https://chrome.google.com/webstore/detail/chromoji-emoji-for... for Chrome

pavlov 12 years ago | | |

The dragon glyph is rendered correctly in IE10 on Windows 8 without any custom fonts. Hooray for the most underestimated browser ever ;)

Wilya 12 years ago | | |

Next time I have some "Here be dragons" code, I'm going to use this.

lelf 12 years ago | | |

There is also 🐲 U+1F432 DRAGON FACE

Also: didn't know that for every emoji there is https://en.wikipedia.org/wiki/🐉

nonchalance 12 years ago |

String encoding in general is a mess. Wait till you get to code pages. Incidentally, the largest JS script I've ever seen pertained to encoding and decoding characters under various codepages: https://raw.github.com/Niggler/js-codepage/master/cptable.js [github complains "(Sorry about that, but we can't show files that are this big right now.)"]

jrochkind1 12 years ago |

The OP describes an environment where data goes from node to Rails.

If you want to check a string for valid encoding and/or replace bad bytes with replacement char on the _ruby_ end... it's not very obvious how you do that with the ruby stdlib api, and it takes a few tricks to do right.

So I wrote a gem for it: https://github.com/jrochkind/ensure_valid_encoding

state 12 years ago |

Whew. This explains a bug from six months ago that drove me up the wall. I could never figure it out.

shawnz 12 years ago |

> Unfortunately for us, Javascript has never been updated to support UTF-16. Instead it continues to treat strings as UCS-2.

So really, they were parsing the JSON as if it were UTF-16, but really it was UCS-2. How is that an error in Node?

justincormack 12 years ago | |

JSON is defined as UTF8, 16 or 32 [1]. The escaped characters are UTF-16 not UCS2. It is unfortunate of JavaScript can't parse it correctly!

[1] http://www.ietf.org/rfc/rfc4627.txt

kansface 12 years ago | | |

This is true of JSON, but its not true of Javascript which gives no fucks about utf16 (or valid surrogate pairs). Its a very strange world where JSON and Javascript have incompatible interpretations of strings.

http://mathiasbynens.be/notes/javascript-encoding

kansface 12 years ago | |

They wanted to parse some bytes as utf-16, but are unable to do so because V8 only understands ucs2 (with invalid surrogate pairs). This is a major problem with node- ie, it happily produces/consumes invalid unicode encoded strings.

dsj36 12 years ago |

how did the error JSON include the undecodable bytes? JSON strings are all unicode sequences, so there would have had to be some way that the raw bytes were mapped into codepoints.

on the other hand, if the offending bytes were blindly substituted into the JSON, then it's not surprising that there were decoding issues down the line...

jlarocco 12 years ago | |

From the article:

> The exceptions that were crashing us were caused by people using String.prototype.substr. That function works perfectly on strings that only contain Unicode 1.0 data, but as soon as you're storing UTF-16 in your UCS-2 string there's a possibility that when you take a slice you'll split a valid surrogate pair into two invalid lonely surrogates.

To me, it seems like it'd be nearly impossible for somebody to trigger, but there's always Murphy's law...

twoodfin 12 years ago | | |

These kinds of isolated surrogate pairs are pretty easy to create if you're doing the right kind of processing on the right kind of data.

Suppose you receive a long piece of text wrapped in JSON, unpack it into a JS String, then start processing it in fixed size chunks. If your source text contains any significant percentage of surrogate pair-represented characters, you'll eventually break one.

cirwin 12 years ago | |

In the example I looked at to debug this, the sequence of events was:

1. One of our customer's javascript apps sent a truncated string to their web-server in a JSON payload. This string ended with a leading surrogate (this is another instance of V8 bug discussed in the blog post).

2. Their ruby backend exploded when they tried to use a regular expression on the string (because ruby's regexp library is strict about valid utf-8).

3. The bugsnag exception notifier copied the bytes from the incoming parameter into the JSON exception notification payload (ruby didn't notice because its string library unconditionally believes you if you tell it a string is valid utf8 — another bug :p).

sujayakar 12 years ago | | |

ah yeah step 3 seems pretty bad -- cool that you found that bug!

scoopr 12 years ago |

This same problem manifests with Java as well, where some methods that claim to return UTF-8 on closer inspection actually return “modified UTF-8”, which is broken the same way. Notably I ran across this in with JNI function GetStringUTFChars, but may come across in DataOutputStream's writeUTF etc.

bsaul 12 years ago |

Reminds me of a previous discussion about Go being more "mature" than node.js, where i said having someone like Pike on board gives you more than 30 years of "maturity". I'm pretty sure you wouldn't find those leaky UTF encoding handling in Go.

ygra 12 years ago | |

Well, Node builds atop an established language, while Go is a new development. It's probably easier to build sane Unicode semantics into a new language than to change the JS spec.

pjscott 12 years ago | |

Since Rob Pike and Ken Thompson are the guys who came up with UTF-8, you'd expect them to write decent Unicode encoding for Go. It would be surprising if they didn't.

scott_karana 12 years ago |

Is it just me, or is the two-column layout a bit tricky for readability?

(1440x900)

oceanstone 12 years ago |

I can't believe NodeJS doesn't support Dragon symbols. This is a dealbreaker.