Emoji.length == 2

279 points by stanzheng 9 years ago | 159 comments

danbruc 9 years ago |

The Unicode standard describes in Annex 29 [1] how to properly split strings into grapheme clusters. And here [2] is a JavaScript implementation. This is a solved problem.

[1] http://www.unicode.org/reports/tr29/

[2] https://github.com/orling/grapheme-splitter

Joeri 9 years ago | |

This is most definitely not a solved problem, because graphemes (visual symbols) are a poor way to deal with unicode in the real world. Pretty much all systems either deal with the length in bytes (if they're old-style C), in code units / byte pairs (if they're UTF-16 based, like windows, java and javascript), or in unicode code points (if they're UTF-8 based, like every proper system should be). Dealing with the length in visual symbols is actually pretty much impossible in practice because databases won't let you define field lengths in graphemes.

The way things compose: bytes combine into code points (unicode numbers), and code points combine into graphemes (visual symbols). In UTF-16 for legacy compatibility reasons with UCS-2, code points decompose into code units (byte pairs), and high code points, which need a lot of bits to represent their number, need two code units (4 bytes) instead of one.

Java and JavaScript are UTF-16 based, so they measure length in code units and not code points. An emoji code point can be a low or high number depending on when it was added. Low numbers can be stored in two bytes, high numbers need four bytes. So an emoji can have length 1 or 2 in UTF-16. However, when moving to the database it will typically be stored in UTF-8, and the field length will be code points, not code units. So, that emoji will have a length of 1 regardless of whether it is low or high. You don't notice this as a problem because app-level field length checks will return a bigger number than what the database perceives, so no field length limits are exceeded.

There isn't any such thing as "characters" in code. In documentation when they say "characters" usually they mean bytes, code units or code points. Almost never do they mean graphemes, which is intuitively what people think they mean. The bottom line is two-fold: (A) always understand what is meant in documentation by "length in characters", because it almost never means the intuitive thing, and (B) don't try to use graphemes as your unit of length, it won't work in practice.

danbruc 9 years ago | | |

This is most definitely not a solved problem, because graphemes (visual symbols) are a poor way to deal with unicode in the real world.

What do you think how text editing controls work? You cursor moves one grapheme cluster at a time, selections start and end at grapheme cluster boundaries, and pressing backspace once deletes the last grapheme cluster even if it took you several key strokes to enter. Grapheme cluster are obviously useful and certainly not a poor way to deal with Unicode in the real world.

Sure, grapheme clusters are neither the most common way to talk about strings, nor are they the most useful one in all situations, but nobody claimed that. If you have to allocate storage, you of course use the size in bytes after encoding. If you translate between encodings, you may want to look at code points. The right tool for the job, and sometimes the right tool is grapheme clusters.

There isn't any such thing as "characters" in code.

Sure, there is. Actually characters exist only in code, they are not used in any field dealing with written language besides computing. A character is the smallest unit of text a computer system can address.

piersadrian 9 years ago | | |

I would point you to Swift's implementation of its "Character" type. Swift string handling is a model for how programming languages should approach Unicode characters and their complex combinations. The standard interface into all Swift strings is its "Character" type, which works exclusively with grapheme clusters.

[1] https://developer.apple.com/library/content/documentation/Sw...

deathanatos 9 years ago | | |

A "code unit" exists in UTF-8 and UTF-32; they are not unique to UTF-16.[1] UTF-8's relationship with code points is approximately the same as UTF-16's, except that UTF-8 systems tend to understand code points better because if they didn't, things break a lot sooner, whereas they mostly work in UTF-16.

Your entire argument that graphemes are a poor way to deal with unicode seems to be that current programming languages don't use graphemes, instead dealing in a mix of code units or points. But the article here shows a number of cases where that doesn't break down, and the person you're responding to clearly points out that, for the cases covered in the article, graphemes are the way to go (and he's correct).

Graphemes aren't always the correct method (and I don't think your parent was advocating that), just like code units or code points aren't always the right way to count. It's highly dependent on the problem at hand. The bigger issue is that programming languages make the default something that's often wrong, when they probably ought to force the programmer to choose, and so, most code ends up buggy. Worse, some languages, like JavaScript, provide no tooling within their standard library for some of the various common ways of needing to deal with Unicode, such as code points.

[1]: http://unicode.org/glossary/#code_unit

Manishearth 9 years ago | | |

> There isn't any such thing as "characters" in code. In documentation when they say "characters" usually they mean bytes, code units or code points. Almost never do they mean graphemes, which is intuitively what people think they mean. The bottom line is two-fold: (A) always understand what is meant in documentation by "length in characters",

This is because languages usually have a built in char type.

> don't try to use graphemes as your unit of length, it won't work in practice.

Swift does this and it's a really good thing. Everything is in graphemes by default -- char segmentation, indexing, length, etc.

There are way too many problems caused by programmers interpreting "code point" as a segmentable unit of text and breaking so many other scripts, not to mention emoji.

deathanatos 9 years ago | |

> This is a solved problem.

Not … really. Yes, we "know" the solution, but the terrible APIs that compose so many language's standard string type goads the programmer into choosing the wrong method or type.

JavaScript has — to an extent — the excuse of age. But the language still really (to my knowledge) lacks an effective way to deal with text that doesn't involve dragging in third-party libraries. You are not a high-level language if your standard library struggles with Unicode. Even recent additions to the language, such as the recent inclusion of leftPad, ignore Unicode (and, in that particular example, render the function mostly useless).

paulddraper 9 years ago | | |

> You are not a high-level language if your standard library struggles with Unicode

So C++, Lisp, Java, Python, Ruby, PHP, and JS are not high-level languages.

HN teaches me something new every day.

danbruc 9 years ago | | |

That is what I meant, there is an existing algorithm to do this because the author tried to come up with one. That JavaScript fails to provide an implementation, well, too bad, but this is of course a problem one may have to solve in any language.

And while other languages provide the necessary support at the language or standard library level, I would guess there are quite a few developers out there that are not even aware that they are looking for enumerating grapheme clusters. But now some more know and if they made a good language choice, it is now a solved problem for them.

libeclipse 9 years ago | | |

I didn't know perfect unicode support in the stdlib was a requirement for being a high-level language.

newtang 9 years ago | |

I'm the author. Thank you for you sharing! I will check it out shortly.

ggchappell 9 years ago | |

> And here [2] is a JavaScript implementation.

Is it up to date?

The last commit to this repo was on July 16, 2015, and the code says it conforms to the 8.0 standard. But Unicode 9.0 came out in June 2016. The document in your link [1] indicates that there were changes in the text-segmentation rules in the 9.0 release. However, I can't say whether any of these affect the correctness of the code.

darkengine 9 years ago |

The thing that frustrates me the most about Unicode emoji is the astounding number of combining characters. For combining characters in written languages, you can do an NFC normalization and, with moderate success, get a 1 codepoint = 1 grapheme mapping, but "Emoji 2.0" introduced some ridiculous emoji compositions with the ZWJ character.

To use the author's example:

‍woman - 1 codepoint

black woman - 2 codepoints, woman + dark Fitzpatrick modifier

‍️‍‍woman kissing woman - 7 codepoints, woman + ZWJ + heart + ZWJ + kissy lips + ZWJ + woman

It's like composing Mayan pictographs, except you have to include an invisible character in between each component.

Here's another fun one: country flags. Unicode has special characters 🇱 🇮 🇰 🇪 🇹 🇭 🇮 🇸 that you can combine into country codes to create a flag. 🇰+🇷 = 🇰🇷

edit: looks like HN strips emoji? Changed the emoji in the example into English words. They are all supposed to render as a single "character".

Animats 9 years ago |

Before emoji, fonts and colors were independent. Combining the two creates a mess. Try using emoji in an editor with syntax coloring. We got into this because some people thought that single-color emoji were racist.[1] So now there are five skin tone options. The no-option case is usually rendered as bright yellow, which comes from the old AOL client. They got it from the happy-face icon of the 1970s.

Here's the current list of valid emoji, including upcoming ones being added in the next revision.[2]

A reasonable test for passwords is to run them through an IDNA checker, which checks whether a string is acceptable as a domain name component. This catches most weird stuff, such as mixed left-to-right and right-to-left symbols, zero-width markers, homoglyphs, and emoji.

[1] https://www.washingtonpost.com/news/the-intersect/wp/2015/02... [2] http://unicode.org/emoji/charts-beta/full-emoji-list.html

kmill 9 years ago |

There are multiple ways of counting "length" of a string. Number of UTF-8 bytes, number of UTF-16 code units, number of codepoints, number of grapheme clusters. These are all distinct yet valid concepts of "length."

For the purpose of allocating buffers, I can see the obvious use in knowing number of bytes, UTF-16 code units, or the number of codepoints. I also see the use in being able to iterate through grapheme clusters, for instance for rendering a fragment of text, or for parsing. Perhaps someone can shed light on a compelling use case for knowing the number of grapheme clusters in a particular string, because I haven't been able to think of one.

I'm not sure about calculating password lengths: if the point is entropy, the number of bytes seems good enough to me!

The password field bug is possibly compelling, but I don't think it's obvious what a password field should do. Should it represent keystrokes? Codepoints? Grapheme clusters? Ligatures? Replace all the glyphs with bullets during font rendering?

(Similarly, perhaps someone could explain why they think reversing a string should be a sensible operation. That this is hard to do is something I occasionally hear echoing around the internet. The best I've heard is that you can reuse the default forward lexicographic ordering on reversed strings for a use I've forgotten.)

TorKlingberg 9 years ago |

If you want to do Unicode correctly, you shouldn't ask for the "length" of a string. The is no true definition of length. If want to know how many bytes it uses in storage, ask for that. If you want to know how wide it will be on the screen, ask for that. Do not iterate over strings character by character.

fryguy 9 years ago | |

How many dots/stars should one display for a password? That's a question that can't be answered by your two valid question. Are you suggesting that dots/stars shouldn't be displayed for passwords, since you can't ask how many "characters" it is?

slededit 9 years ago | | |

You could divide the length of the string by the length the '*' character in a monospaced font. It doesn't really make sense for a combining or other invisible character to get its own asterisk.

toast0 9 years ago | | |

If you have an entry indicator, it should probably be about the same width as the entered text; or if you're concerned about leaking precise length information for fields that aren't monospaced, you could add a dot each time the rendered text would increase in width.

deathanatos 9 years ago | |

I'd avoid the term "character", but I'd argue there are valid reasons to consume a Unicode string grapheme by grapheme. For example, a regex engine trying to match "e + combining_acute_accent" wants to match both the pre-combined version and the version that uses combining characters.

The main thrust of your point — that "length" without clarification of what measure of length is meaningless — I agree with.

chungy 9 years ago |

> The current largest codepoint? Why that would be a cheese wedge at U+1F9C0. How did we ever communicate before this?

Sounds cute, but inaccurate.

If we count the last two planes that are reserved for private use (aka, applications/users can use them for whatever domain problems they like), that would be U+10FFFD.

If we count the variation selector codepoints (used for things like changing skin tone, or the look of certain other characters), U+E01EF.

If we count the last honestly-for-real-written-language character assigned, it would be 𪘀 U+2FA1D CJK COMPATIBILITY IDEOGRAPH-2FA.

But I suppose none of that sounds as fun as an emoji (which are really a very small part of the Unicode standard).

rspeer 9 years ago | |

I tried to look up what U+2FA1D, the highest-numbered printable character, means in context.

It is a Traditional Chinese character. It's a variant of U+2F600, 𪘀, which is pronounced "pián". It apparently is used in zero words. It's in Unicode because it's listed in the 7th section of TCA-CNS 11643-1992, a Taiwanese computing standard.

Searching for it gives lots of sites that acknowledge that it's a character that exists and then provide no definition for it.

My guess: it occurred in someone's name at some point. Pretty strange that it ended up requiring a compatibility mapping, though, when nobody seems to use the character or the character it's mapped to!

newtang 9 years ago | |

You're right, thank you! I'll add an edit.

zach417 9 years ago |

Tom Scott did a nice YouTube video related to this: https://www.youtube.com/watch?v=sTzp76JXsoY

teknologist 9 years ago |

This appears to be a rehash of what Mathias Bynens was talking about a few years ago.

http://vimeo.com/76597193

https://mathiasbynens.be/notes/javascript-unicode

ge0rg 9 years ago |

I've gone through exactly the same discovery process when implementing faux stamps (something between images and Emoji) in my xmpp app yesterday.

My idea was to increase the font size of a message that only consists of Emoji, depending on the number of Emoji in the message, like this:

https://xmpp.pix-art.de/imagehost/display/file/2017-03-09_09...

The code turned out more complex than first expected, mirroring the same problems OP encountered:

https://github.com/ge0rg/yaxim/blob/gradle/src/org/yaxim/and...

kalleboo 9 years ago | |

I'm working on a project that has to handle special rendering of emoji as well, and I simply ask the system "will this string render in the emoji font" and "how big of a rect do I need to render this string" to calculate the same thing, rather than trying to handle it myself and relying on assumptions about the sizing of the emoji. I figure this way I also future proof against whatever emoji they think up in the future.

mhils 9 years ago |

The Zero-Width-Joiner allows for some really strange things: https://blog.emojipedia.org/ninja-cat-the-windows-only-emoji....

One can basically achieve an unlimited number of emojis by concatenating the current ones.

joeblau 9 years ago |

I ran into this 2 years ago on Swift when I was creating an emojified version of Twitter. I wanted to ensure that each message sent had at least 1 emoji and I quickly realized that validating a string with 1 emoji was not as simple as:

  if (lastString.characters.count == 2) {
     // pseudo code to allow string and activate send button
  }

This was the app I was working on [1]; code is finished, but I'm not launching it (probably ever). The whole emoji length piece was quite frustrating because my assumption of character counting went right out of the window when I had people testing the app in Test Flight.

[1] - https://joeblau.com/emo/

Manishearth 9 years ago | |

Actually, this is just due to Swift not implementing Unicode 9's version of UAX 29 (which had just come out at the time). Swift should handle it correctly, but it's lagging behind in unicode 9 support. In general a "character" in a string is a grapheme cluster, and most visually-single emoji are single grapheme clusters. The exception is stuff like ‍️[1]. That should render as a male judge (I don't think there's font support for it yet) according to the spec, and it should be a single grapheme cluster, but the spec has what I consider a mistake in it where it isn't considered to be one. I've filed a bug about this, since the emoji-zwj-sequences file lists it as a valid zwj sequence, but applying the spec to the sequence gives two grapheme clusters.

There's active work now for Unicode 9 support in Swift. Since string handling is heavily dependent on this algorithm (they have a unicode trie and all for optimization!) it's trickier than just rewriting the algorithm.

But, in general, you should be able to trust Swift to do the right thing here, barring bugs like "not up to date with the spec". Swift is great like that.

[1]: https://r12a.github.io/uniview/?charlist=%F0%9F%91%A8%F0%9F%...

hwc 9 years ago |

How can that entire article never mention the term UTF-16?

Retr0spectrum 9 years ago | |

Why should it? Other than for explaining why the abomination of surrogate pairs came into existence.

tantalor 9 years ago |

> I have no idea if there’s a good reason for the name “astral plane.” Sometimes, I think people come up with these names just to add excitement to their lives.

https://en.wikipedia.org/wiki/Plane_(esotericism)#The_Planes

openasocket 9 years ago |

The issue doesn't really seem to be the emojis, but rather the variation sequences, which seem to be really awkward to work with, but I can sort of see why they're necessary. But the fact that we need special libraries to answer fairly basic queries about unicode text doesn't bode well.

masklinn 9 years ago | |

> But the fact that we need special libraries to answer fairly basic queries about unicode text doesn't bode well.

That's always been needed to actually properly work with unicode, what do you think ICU is? Few if any languages have complete native Unicode support. And it's hardly new, Unicode has an annex (#29) dedicated to text segmentation: http://www.unicode.org/reports/tr29/

codezero 9 years ago |

I see your 2 and raise you 2:

"(this is a color-hued hand from Apple that doesn't render on HN)".length == 4

I ran into the length==2 bug when truncating some text, it led to errors trying to url encode a string :)

The author's `fancyCount2` still returns a size of 2 for these kinds of emoji, but I'm not too surprised.

sorenjan 9 years ago |

I think the article "A Programmer's Introduction to Unicode" that was shared here recently is a good read and explains Unicode well.

https://news.ycombinator.com/item?id=13790575

pc2g4d 9 years ago |

Just ran into this yesterday when I discovered that an emoji character wouldn't fit into Rust's `char` type. I just changed the type to `&'static str` but I still wish there was a single `grapheme` type or something like that.

gtrubetskoy 9 years ago |

In Go:

  func main() {
      shit := "\U0001f4a9"
      fmt.Printf("len of %s is %d\n", shit, utf8.RuneCountInString(shit))
  }

$ len of � is 1

Though I can't say that this is all that intuitive either...

geocar 9 years ago | |

Codepoints still aren't the same as characters.

Consider the examples given about combining emoji; Consider two runes that make one character: e and ◌́

remx 9 years ago |

Just going to leave this link here: https://mathiasbynens.be/notes/javascript-unicode

Traubenfuchs 9 years ago |

If this interests you, read the source of Java's abstractStringBuilder.reverse(). It's interesting and very short. I am not sure it can deal with multi-emoji-emoji though.

xem 9 years ago |

Here are my 2 cents: you can decompose an Unicode string with the ES6 spread operator:

[..."(insert 5 poo emoji here)"].length === 5

[..."(insert 5 poo emoji here)"][1] === "(poo emoji)"

lsv1 9 years ago |

As a developer dealing with the encoding of user input made in UTF-8 into a legacy systems which only support ASCII... I prefer this.

rsmets 9 years ago |

(U+200B), zero width space, should be outlawed... got me good a couple years ago! Had todo a hexdump to see what was going on.

beaugunderson 9 years ago |

lodash's toArray and split both support emoji, with good unit tests. I also wrote emoji-aware for this purpose:

https://www.npmjs.com/package/emoji-aware

nutbutter 9 years ago |

The golf course flag equals one obviously because at a hole-in-one. :)

jtymann 9 years ago |

Makes me wonder whether or not that should be considered a bug.

Manishearth 9 years ago | |

I'm sure all browser designers out there would love it if we could switch JS over to UTF8, or in general have any system where JS uses a well formed encoding when it comes to unicode. We can't, because of backwards compatability.

TheRealPomax 9 years ago |

but the real question is why he needed password length constraints instead of password strength constraints...

marichards 9 years ago |

create table twitter(tweet varchar(? ... that's it, I give up, time to become an Uber driver

wcummings 9 years ago |

> Sometimes, I think people come up with these names just to add excitement to their lives.

Let's get outta here guys, we've been rumbled!

phkahler 9 years ago |

Unicode is fucked. All these bullshit emojis remind me of the 1980s when ASCII was 7 bits but every computer manufacturer (Atari, Commodore, Apple, IBM, TI, etc...) made their own set of characters for the 128 values of a byte beyond ASCII. Of course Unicode is a global standard so your pile-of-poop emoji will still be a pile-of-poop on every device even if the amount of steam is different for some people.

It's beyond me why this is happening. Who decides which bullshit symbols get into the standard anyway?

masklinn 9 years ago | |

> Unicode is fucked. All these bullshit emojis

Ah yes, all those bloody emoji taking the place of better worthier characters, those dastardly pictures taking up all of one half of one 16th of one Unicode plane (which has only 16 of those, and only 14 public).

And the gall they have, actually being used and lighting up their section of plane 1 like a christmas tree while the rest of the plane lies in the darkness: http://reedbeta.com/blog/programmers-intro-to-unicode/heatma... what a disgrace, not only existing but being found useful, what has the world come to.

And then of course there's the technical side of things: emoji actually forced western developers — and especially anglo ones — to stop fucking up non-ASCII let alone non-BMP codepoints. I don't think it's a coincidence that MySQL finally added support for astral characters once emoji started getting prominent.

In fact, I have a pet theory that the rash of combining emoji in the latest revisions is in part a vehicle to teach developers to finally stop fucking up text segmentation and stop assuming every codepoint is a grapheme cluster.

raphlinus 9 years ago | |

Meet the shadowy overlords who approve emojis[0]

[0] http://www.latimes.com/business/technology/la-fi-tn-emoji-q-...

sdegutis 9 years ago | |

Language is inherently complex, there's no way to solve this in any "cleaner" way than what we already came up with. Unfortunately the best way forward is to build up what we already have and cover all the warts with wrapper functions/libraries.

PeterisP 9 years ago | | |

Well, there is one way, we can simplify and standardize format of language. Unfortunately that requires generations of "reeducation", so it's not a viable solution in the short term - but it does seem possible that this is where languages are going in the next few centuries, as globalization, easier travel and more interrelated communities are likely to result in slow, gradual convergence to less languages as many of the current 6000+ languages cease to be used in practice.

phkahler 9 years ago | | |

And where are the sex emoji? The dirtiest thing I've been able to text is a heart and a pair of handcuffs ;-)

carapace 9 years ago | |

Unicode is a conflation of two ideas, one good and the other impossible.

The good idea is to have a standard mapping from numbers to little pictures (glyphs, symbols, kanji, ideograms, cuneiform pokings in dried clay, scratches on a rock, whatever.) This is really all ASCII was.

The impossible idea is to encode human languages into bits. This can't be done and will only continue to cause heartache in those who try.

ASCII had English letters but wasn't an encoding for English, although you can and everyone did and does use it for that.

Manishearth 9 years ago | | |

I hate this argument every time I see it because it's invariably used in the wrong place.

Yes, the goal of encoding all human languages into bits is one that's near impossible. Unicode tries, and has broken half-solutions in many places. Lots of heartache everywhere.

This is completely irrelevant to the discussion here. The issue of code points not always mapping to graphemes is only an issue because programmers ignore it. It's a completely solved problem, theoretically speaking. It's necessary to be able to handle many scripts, but it's not something that "breaks" unicode.

XaspR8d 9 years ago | |

If anything, their adaptability gives me confidence. They have little power to stop vendors from creating new emojis that are morphologically distinct from existing ones, so they might as well wrangle them into a standard.

TAForObvReasons 9 years ago | |

There is a Unicode encoding "UTF-32" which has the advantage of being fixed width. This is not popular for the obvious reason that even ascii characters are expanded to 4 bytes. Additionally the windows APIs, among other interfaces, are not equipped to handle 4-byte codepages.

Manishearth 9 years ago | | |

Being fixed width is not an advantage. Code points aren't a very useful unit of text outside of the implementation of algorithms defined by unicode. All of these algorithms generally require iteration anyway. O(1) code point indexing is nearly useless.

http://manishearth.github.io/blog/2017/01/14/stop-ascribing-...

raphlinus 9 years ago | | |

It's fixed width with respect to code points, but not with respect to any of the other things mentioned in the linked article. For example, the black heart with emoji variation selector (which makes it render red) is two code points.

marcosdumay 9 years ago | | |

> "UTF-32" which has the advantage of being fixed width

It's fixed width for now. It can not hold all the current available code-points, so it will probably have the same fate as UTF-16 (but it will probably take a long time).

I'd stay away from it.

0149 ; Deprecated # L& LATIN SMALL LETTER N PRECEDED BY APOSTROPHE 0673 ; Deprecated # Lo ARABIC LETTER ALEF WITH WAVY HAMZA BELOW 0F77 ; Deprecated # Mn TIBETAN VOWEL SIGN VOCALIC RR 0F79 ; Deprecated # Mn TIBETAN VOWEL SIGN VOCALIC LL 17A3..17A4 ; Deprecated # Lo [2] KHMER INDEPENDENT VOWEL QAQ..KHMER INDEPENDENT VOWEL QAA 206A..206F ; Deprecated # Cf [6] INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT SHAPES 2329 ; Deprecated # Ps LEFT-POINTING ANGLE BRACKET 232A ; Deprecated # Pe RIGHT-POINTING ANGLE BRACKET E0001 ; Deprecated # Cf LANGUAGE TAG