Emoji.length == 2(blog.jonnew.com) |
Emoji.length == 2(blog.jonnew.com) |
Let's get outta here guys, we've been rumbled!
It's beyond me why this is happening. Who decides which bullshit symbols get into the standard anyway?
Ah yes, all those bloody emoji taking the place of better worthier characters, those dastardly pictures taking up all of one half of one 16th of one Unicode plane (which has only 16 of those, and only 14 public).
And the gall they have, actually being used and lighting up their section of plane 1 like a christmas tree while the rest of the plane lies in the darkness: http://reedbeta.com/blog/programmers-intro-to-unicode/heatma... what a disgrace, not only existing but being found useful, what has the world come to.
And then of course there's the technical side of things: emoji actually forced western developers — and especially anglo ones — to stop fucking up non-ASCII let alone non-BMP codepoints. I don't think it's a coincidence that MySQL finally added support for astral characters once emoji started getting prominent.
In fact, I have a pet theory that the rash of combining emoji in the latest revisions is in part a vehicle to teach developers to finally stop fucking up text segmentation and stop assuming every codepoint is a grapheme cluster.
[0] http://www.latimes.com/business/technology/la-fi-tn-emoji-q-...
The good idea is to have a standard mapping from numbers to little pictures (glyphs, symbols, kanji, ideograms, cuneiform pokings in dried clay, scratches on a rock, whatever.) This is really all ASCII was.
The impossible idea is to encode human languages into bits. This can't be done and will only continue to cause heartache in those who try.
ASCII had English letters but wasn't an encoding for English, although you can and everyone did and does use it for that.
Yes, the goal of encoding all human languages into bits is one that's near impossible. Unicode tries, and has broken half-solutions in many places. Lots of heartache everywhere.
This is completely irrelevant to the discussion here. The issue of code points not always mapping to graphemes is only an issue because programmers ignore it. It's a completely solved problem, theoretically speaking. It's necessary to be able to handle many scripts, but it's not something that "breaks" unicode.
http://manishearth.github.io/blog/2017/01/14/stop-ascribing-...
It's fixed width for now. It can not hold all the current available code-points, so it will probably have the same fate as UTF-16 (but it will probably take a long time).
I'd stay away from it.
Note that, at present, only 4 of the 17 planes have defined characters (Planes 0, 1, 2, and 14), two are reserved for private use (15 and 16), and an additional is unused but is thought to be needed (Plane 3, the TIP for historic Chinese script predecessors). Four planes appear to be sufficient to support every script ever written on Earth, as it's doubtful there are unidentified scripts with an ideographic repertoire as massive as the Unified CJK ideographs database.
We are very unlikely to ever fill up the current space of Unicode, let alone the plausible maximum space permissible by UTF-8, let alone the plausible maximum space permissible by UTF-32.
lol.
Unicode was ambitious for its time, but naive. Today we know better. It "jumped the shark" when the pizza slice showed up and has only been getting stupider since. Eventually it will go the way of XML (yes, I know XML hasn't gone anywhere, shut up) and we will be using some JSON hottness (forgive the labored metaphor please!) that probably consist of a wad of per-language standards and ML/AI/NLP stuff, etc.. blah blah hand-wave.)
Unicode just sucks.
Yes, "it jumped the shark when the pizza slice showed up". However, that doesn't imply that it did everything wrong. The notion of multi-codepoint characters is necessary to handle other languages. that is a solved problem, it's just that programmers mess up when dealing with it. Emoji may be a mistake, but the underlying "problems" caused by emoji existed anyway, and they're not really problems, just programmers being stupid.
We had multiple per-language encodings. It sucked.
Whatever this mess is, it's a whole thing that isn't a byte-stream and it isn't "characters" and it isn't human language. Burn it with fire and let's do something else.
[1] http://stackoverflow.com/documentation/unicode/6485/characte...
(In reality I am slightly less hard-core, I see some value in Unicode. And I really like Z̨͖̱̟̺̈̒̌̿̔̐̚̕͟͡a̵̭͕͔̬̞̞͚̘͗̀̋̉̋̈̓̏͟͞l̸̛̬̝͎̖̏̊̈́̆̂̓̀̚͢͡ǵ̝̠̰̰̙̘̰̪̏̋̓̉͝o̲̺̹̮̞̓̄̈́͂͑͡ T̜̤͖̖̣̽̓͋̑̕͢͢e̻̝͎̳̖͓̤̎̂͊̀͋̓̽̕͞x̴̛̝͎͔̜͇̾̅͊́̔̀̕t̸̺̥̯͇̯̄͂͆̌̀͞ it is an obvious win.). Even when it doesn't quite work... (I think I'm back to "fuck Unicode" now.)
https://mobile.twitter.com/unicode/status/722133439726505984
The way things compose: bytes combine into code points (unicode numbers), and code points combine into graphemes (visual symbols). In UTF-16 for legacy compatibility reasons with UCS-2, code points decompose into code units (byte pairs), and high code points, which need a lot of bits to represent their number, need two code units (4 bytes) instead of one.
Java and JavaScript are UTF-16 based, so they measure length in code units and not code points. An emoji code point can be a low or high number depending on when it was added. Low numbers can be stored in two bytes, high numbers need four bytes. So an emoji can have length 1 or 2 in UTF-16. However, when moving to the database it will typically be stored in UTF-8, and the field length will be code points, not code units. So, that emoji will have a length of 1 regardless of whether it is low or high. You don't notice this as a problem because app-level field length checks will return a bigger number than what the database perceives, so no field length limits are exceeded.
There isn't any such thing as "characters" in code. In documentation when they say "characters" usually they mean bytes, code units or code points. Almost never do they mean graphemes, which is intuitively what people think they mean. The bottom line is two-fold: (A) always understand what is meant in documentation by "length in characters", because it almost never means the intuitive thing, and (B) don't try to use graphemes as your unit of length, it won't work in practice.
What do you think how text editing controls work? You cursor moves one grapheme cluster at a time, selections start and end at grapheme cluster boundaries, and pressing backspace once deletes the last grapheme cluster even if it took you several key strokes to enter. Grapheme cluster are obviously useful and certainly not a poor way to deal with Unicode in the real world.
Sure, grapheme clusters are neither the most common way to talk about strings, nor are they the most useful one in all situations, but nobody claimed that. If you have to allocate storage, you of course use the size in bytes after encoding. If you translate between encodings, you may want to look at code points. The right tool for the job, and sometimes the right tool is grapheme clusters.
There isn't any such thing as "characters" in code.
Sure, there is. Actually characters exist only in code, they are not used in any field dealing with written language besides computing. A character is the smallest unit of text a computer system can address.
[1] https://developer.apple.com/library/content/documentation/Sw...
Your entire argument that graphemes are a poor way to deal with unicode seems to be that current programming languages don't use graphemes, instead dealing in a mix of code units or points. But the article here shows a number of cases where that doesn't break down, and the person you're responding to clearly points out that, for the cases covered in the article, graphemes are the way to go (and he's correct).
Graphemes aren't always the correct method (and I don't think your parent was advocating that), just like code units or code points aren't always the right way to count. It's highly dependent on the problem at hand. The bigger issue is that programming languages make the default something that's often wrong, when they probably ought to force the programmer to choose, and so, most code ends up buggy. Worse, some languages, like JavaScript, provide no tooling within their standard library for some of the various common ways of needing to deal with Unicode, such as code points.
This is because languages usually have a built in char type.
> don't try to use graphemes as your unit of length, it won't work in practice.
Swift does this and it's a really good thing. Everything is in graphemes by default -- char segmentation, indexing, length, etc.
There are way too many problems caused by programmers interpreting "code point" as a segmentable unit of text and breaking so many other scripts, not to mention emoji.
Not … really. Yes, we "know" the solution, but the terrible APIs that compose so many language's standard string type goads the programmer into choosing the wrong method or type.
JavaScript has — to an extent — the excuse of age. But the language still really (to my knowledge) lacks an effective way to deal with text that doesn't involve dragging in third-party libraries. You are not a high-level language if your standard library struggles with Unicode. Even recent additions to the language, such as the recent inclusion of leftPad, ignore Unicode (and, in that particular example, render the function mostly useless).
So C++, Lisp, Java, Python, Ruby, PHP, and JS are not high-level languages.
HN teaches me something new every day.
And while other languages provide the necessary support at the language or standard library level, I would guess there are quite a few developers out there that are not even aware that they are looking for enumerating grapheme clusters. But now some more know and if they made a good language choice, it is now a solved problem for them.
Is it up to date?
The last commit to this repo was on July 16, 2015, and the code says it conforms to the 8.0 standard. But Unicode 9.0 came out in June 2016. The document in your link [1] indicates that there were changes in the text-segmentation rules in the 9.0 release. However, I can't say whether any of these affect the correctness of the code.
To use the author's example:
woman - 1 codepoint
black woman - 2 codepoints, woman + dark Fitzpatrick modifier
️woman kissing woman - 7 codepoints, woman + ZWJ + heart + ZWJ + kissy lips + ZWJ + woman
It's like composing Mayan pictographs, except you have to include an invisible character in between each component.
Here's another fun one: country flags. Unicode has special characters 🇱 🇮 🇰 🇪 🇹 🇭 🇮 🇸 that you can combine into country codes to create a flag. 🇰+🇷 = 🇰🇷
edit: looks like HN strips emoji? Changed the emoji in the example into English words. They are all supposed to render as a single "character".
Here's the current list of valid emoji, including upcoming ones being added in the next revision.[2]
A reasonable test for passwords is to run them through an IDNA checker, which checks whether a string is acceptable as a domain name component. This catches most weird stuff, such as mixed left-to-right and right-to-left symbols, zero-width markers, homoglyphs, and emoji.
[1] https://www.washingtonpost.com/news/the-intersect/wp/2015/02... [2] http://unicode.org/emoji/charts-beta/full-emoji-list.html
For the purpose of allocating buffers, I can see the obvious use in knowing number of bytes, UTF-16 code units, or the number of codepoints. I also see the use in being able to iterate through grapheme clusters, for instance for rendering a fragment of text, or for parsing. Perhaps someone can shed light on a compelling use case for knowing the number of grapheme clusters in a particular string, because I haven't been able to think of one.
I'm not sure about calculating password lengths: if the point is entropy, the number of bytes seems good enough to me!
The password field bug is possibly compelling, but I don't think it's obvious what a password field should do. Should it represent keystrokes? Codepoints? Grapheme clusters? Ligatures? Replace all the glyphs with bullets during font rendering?
(Similarly, perhaps someone could explain why they think reversing a string should be a sensible operation. That this is hard to do is something I occasionally hear echoing around the internet. The best I've heard is that you can reuse the default forward lexicographic ordering on reversed strings for a use I've forgotten.)
The main thrust of your point — that "length" without clarification of what measure of length is meaningless — I agree with.
Sounds cute, but inaccurate.
If we count the last two planes that are reserved for private use (aka, applications/users can use them for whatever domain problems they like), that would be U+10FFFD.
If we count the variation selector codepoints (used for things like changing skin tone, or the look of certain other characters), U+E01EF.
If we count the last honestly-for-real-written-language character assigned, it would be 𪘀 U+2FA1D CJK COMPATIBILITY IDEOGRAPH-2FA.
But I suppose none of that sounds as fun as an emoji (which are really a very small part of the Unicode standard).
It is a Traditional Chinese character. It's a variant of U+2F600, 𪘀, which is pronounced "pián". It apparently is used in zero words. It's in Unicode because it's listed in the 7th section of TCA-CNS 11643-1992, a Taiwanese computing standard.
Searching for it gives lots of sites that acknowledge that it's a character that exists and then provide no definition for it.
My guess: it occurred in someone's name at some point. Pretty strange that it ended up requiring a compatibility mapping, though, when nobody seems to use the character or the character it's mapped to!
My idea was to increase the font size of a message that only consists of Emoji, depending on the number of Emoji in the message, like this:
https://xmpp.pix-art.de/imagehost/display/file/2017-03-09_09...
The code turned out more complex than first expected, mirroring the same problems OP encountered:
https://github.com/ge0rg/yaxim/blob/gradle/src/org/yaxim/and...
One can basically achieve an unlimited number of emojis by concatenating the current ones.
if (lastString.characters.count == 2) {
// pseudo code to allow string and activate send button
}
This was the app I was working on [1]; code is finished, but I'm not launching it (probably ever). The whole emoji length piece was quite frustrating because my assumption of character counting went right out of the window when I had people testing the app in Test Flight.[1] - https://joeblau.com/emo/
There's active work now for Unicode 9 support in Swift. Since string handling is heavily dependent on this algorithm (they have a unicode trie and all for optimization!) it's trickier than just rewriting the algorithm.
But, in general, you should be able to trust Swift to do the right thing here, barring bugs like "not up to date with the spec". Swift is great like that.
[1]: https://r12a.github.io/uniview/?charlist=%F0%9F%91%A8%F0%9F%...
https://en.wikipedia.org/wiki/Plane_(esotericism)#The_Planes
That's always been needed to actually properly work with unicode, what do you think ICU is? Few if any languages have complete native Unicode support. And it's hardly new, Unicode has an annex (#29) dedicated to text segmentation: http://www.unicode.org/reports/tr29/
"(this is a color-hued hand from Apple that doesn't render on HN)".length == 4
I ran into the length==2 bug when truncating some text, it led to errors trying to url encode a string :)
The author's `fancyCount2` still returns a size of 2 for these kinds of emoji, but I'm not too surprised.
func main() {
shit := "\U0001f4a9"
fmt.Printf("len of %s is %d\n", shit, utf8.RuneCountInString(shit))
}
$ len of � is 1Though I can't say that this is all that intuitive either...
Consider the examples given about combining emoji; Consider two runes that make one character: e and ◌́
[..."(insert 5 poo emoji here)"].length === 5
[..."(insert 5 poo emoji here)"][1] === "(poo emoji)"
No, no, no, NO.
Please stop spreading this bit of misinformation. Emoji has not changed this situation.
Hangul is one where NFC won't work well. Yes, we actually already encode all possible modern hangul syllable blocks in NFC form as well, but this ignores characters with double choseongs or double jungseongs that can be found in older text. Which you sometimes see in modern text, actually.
All Indic scripts (well, all scripts derived from Brahmi, so this includes many scripts from Southeast Asia as well like Thai) would have trouble doing the NFC thing.
I am annoyed that the unicode spec introduced more complexity into their algorithms to support Unicode, but this is because they could have achieved mostly the same task by not introducing emoji-specific complexity and reusing features that existing scripts already have and have already been accounted for.
I'm not sure if that would be a good thing or a bad thing.
Unicode has a block for Hangul jamo, but they aren't used in typical text. Instead, Hangul are presented using a massive 11K-codepoint block of every possible precomposed syllable. ¯\_(ツ)_/¯
There is active discussion on actually being able to build up complex grapheme clusters in such a manner, because it's necessary for Egyptian and Mayan text to be displayed properly. U+13430 and U+13431 have been accepted for Unicode 10.0 already for some Egyptian quadrat construction.
This isn't at the textual level, and the components are not strictly radicals, but this may interest you.
https://en.wikipedia.org/wiki/Ideographic_Description_Charac...
LOL, I knew about the crazy flag characters, but I had no idea it was this bad. Does "woman + ZWJ + heart + ZWJ + hands pressed together + ZWJ + woman" become "2 women lovingly holding hands"? Unicode has become completely absurd, and I am grateful every day that I'm not one of the poor coders having to implement it.
Humorously enough, input methods have not advanced at all. To type any of these things, I need to open a character picker, then either type in the character's name if I know it, or scroll through pages of symbols until I find the one I want. Yet we still call this "text."
Ironic huh, that we're discussing this on a site which has "solved" this by just pretending it doesn't exist :)
We're still on that IPv4 unicode, that Python 2.7 world, where we use scanf() to read user input and goto wherever we please.
Which reminds me: We need a Jaguar emoji!
...if you want to destroy the user experience. Practically many implementations would be forced to support colors. (I'm personally okay with that, but just to be correct)
Emoji skin colour and emoji using colour are two unrelated concerns.
Fonts and colors are still, for the most part, independent. Color or the lack thereof is a property of your font and text rendering subsystems. For instance Noto Emoji provides B&W emoji, and Noto Color Emoji provides colored ones.
Why test this at all? It's not as if a website should ever need to render a user's password as text. Is there another use case for excluding this "weird stuff" that I'm not seeing?
You don't need to use IDNA for this, though. There are standards specifically for dealing with Unicode passwords, such as SASLprep (RFC 4013) and PRECIS (RFC 7564).
Perhaps predictably, this has backfired and will continue to backfire quite spectacularly; it turns out that when you force people to start thinking along racial lines, they might not end up with the exact same ideas about race that you have. I suspect this may be a large contributing factor behind the recent resurgence of ethno-nationalism (see the Alt Right et al.).
If you have a limit on the length of a field, it helps to tell the user what it is in a way they understand. For non-technical users, bytes (and the embedded issue of encoding) and code points are both pretty esoteric, but number of symbols is less so. OTOH, SMS has strict data and encoding limits, and people managed with that; also provisioning byte storage for grapheme limited fields is hard: some graphemes use a ton of code points, family emoji and zalgo text are clear examples.
So it can fit in a database, i.e. with a certain number of bytes?
Semantics matter a lot.
Speaking of equality: in a lecture about logic I once gave, I asked the students whether {1,2} and {1,2} were the same. In a very real sense, they are different because I drew them (or typed them) in different places and slightly differently -- I promise I typed the second {1,2} with different fingers. But, through the lens of same-means-same-elements, they are the same. That is a warmup for {1,2} vs {1,1,2}, and {1,2} vs {n : n is a natural number and 1 <= n <= 2}.
(There's also kind of a joke about how my set of natural numbers might be red and your set of natural numbers might be blue, but the theory of sets doesn't care about the difference.)
If I want to use äöüßÄÖÜẞ because I'm confident that I can properly type them on all devices I'll need to type then, then let me. It's not your concern what method of input I'm using.
And maybe, just maybe, using latin characters is actually more of a hassle for a user anyway. (I think the risk of that occoring is low, but still. At the moment, it's a self-fulfilling prophecy that all users have proper method to input atin script available. We simply force them to have one.)
Edit: And the confusion is also possible with just latin characters. U+0430 looks exactly like "a", but has a different code point and thus ruins the hash.
IDSes let you basically do arbitrary table layout with arbitrary CJK ideographs, which is very very different. With Hangul I can say "display these three jamos in a syllable block", and I have no control over how they get placed in the block -- I just rely on the fact that there's basically one way to do it (for modern korean, archaic text is a bit more complicated and idk how it's done) and the font will do it that way.
With IDS I can say "okay, display these two glyphs side-by-side, place them under this third glyphs, place this aggregate next to another aggregate made up of two side-by-side glyphs, and surround this resulting aggregate with this glyphs". Well, I can't, because I can't say the word display there; IDS is for describing chars that can't be encoded, but isn't supposed to really be rendered. But it could be, and that's a vastly different thing from what existing scripts like Hangul and Indic scripts let you do when it comes to glyph-combining.
Research on what to do vis à vis Mayan characters (including perhaps reusing Egyptian control characters for layout) is still ongoing, as is better handling of Egyptian.
Twitter is not a "basic case", it's a case where the length limit is arbitrary and the specifics don't matter much anymore. Usually when you want to segment text that is not the case.
Edit: basically, my problem with your original comment is that it helps spread the misinformation that a lot of these things are "just" emoji problems, so some folks tend to ignore them since they don't care about emoji that much, and real human languages suffer.
This, to me, is an argument for emoji being a good thing for string-handling code. The fact that they're common means that software creators are much more likely to notice when they've written bad string-handling code.
Whatever platform you're using should have an API call for counting grapheme clusters. It may be more complex behind the scenes, but as an ordinary programmer it should be no more difficult to do it correctly than it is to do it wrong.
Another contributor to this is that programmers love to hate Unicode. It has its flaws, but in general folks attribute any trouble they have to "oh that's just unicode being broken" even though these are often fundamental issues with international text. This leads to folks ignoring things because it's "unicode's fault", even though it's not.
I think you misunderstood the comment to which you replied. What it was saying is that emoji fitzpatrick scale and color fonts are orthogonal concerns, you can do skin tones in grayscale.
"These are characters from a country you've never been to. Each three-byte sequence (assuming UTF-8) corresponds to a square-shaped character." --> Easy for everyone to understand, and less chance of screwup (as long as the software supports any Unicode at all).
"These should be decomposed into sequences of two or three characters, each three bytes long, and then you need a special algorithm to combine them into a square block." --> This pretty much means the software must be developed with Korean users in mind (or someone must heroically go through every part of the code dealing with displaying text), otherwise we might as well assume that it's English-only.
Well, now the equation might be different, as more and more software are developed by global companies and there are more customers using scripts with complicated combining diacritics, but that wasn't the case when Hangul was added to Unicode.
For example: if NFD works properly, the first two characters below should look identical, and the third should show a "defective" character that looks like the first two except without the circle (ㅇ). It doesn't work in gvim (it fails to consider the second/third example as a single character), Chrome in Linux, or Firefox in Linux.
은 은 ᅟᅳᆫ
Of course, if it were the only method of encoding Korean, then the support would have been better, but it would've still required a lot of work by everyone.
It is worth noting that precomposed Hangul syllables decompose to the Jamo characters under NFD (and vice versa for NFC). However, most data is sent and used with NFC normalization.
So yeah, Unicode is not a problem here (the compatibility with existing character sets was essential for Unicode's success), it's a problem of legacy character sets :-)
[1] Only correct for South Koreans though :) but the pattern is now very regular and it's much more efficient than heavy table lookups.
Unexpectedly sinister.
Any time you define an upper limit, someone will come up with more emojis that will require larger number of code points per grapheme.
Technically yes. But they are only "exposed" in UTF-16. In UTF-32 code points and code units are the same size, so you only have to deal with code points. In UTF-8 you only have to deal with code points and bytes. UTF-16 is unique in having something that is neither code point nor byte but sits in between.
While different layers may use words of the same size, there are still differences, for example what is valid and what is not. While for example U+00D800 is a perfectly fine code point, the first high-surrogate, 0x0000D800 is not a valid UTF-32 code unit. 0xC0 0xA0 is a perfectly fine pair of bytes, both are valid UTF-8 code units, and they could become the code point U+000020 if only 0xC0 0xA0 were not an invalid code unit subsequence.
So yes, while I agree that UTF-16 is special in that sense that one has to deal with 8, 16 and 32 bit words, I don't think that one should dismiss the concept of code units for all encoding forms but UTF-16. There enough subtle details between the different layers so that the distinction is warranted. And that is actually something I really like about the Unicode standard, it is really precise and doesn't mix up things that are superficially the same.
Not at all. I've never seen people using UTF-8 deal with a code unit stage. They parse directly from bytes to code points.
> While for example U+00D800 is a perfectly fine code point, the first high-surrogate, 0x0000D800 is not a valid UTF-32 code unit.
I thought that was an invalid code point. Where would I look to see the difference? Nevertheless I would expect most code to make no distinction between the invalidity of 0x0000D800 and 0x44444444, except perhaps to give a better error message.
> 0xC0 0xA0 is a perfectly fine pair of bytes, both are valid UTF-8 code units, and they could become the code point U+000020 if only 0xC0 0xA0 were not an invalid code unit subsequence.
If you say that they're correct code units then at what point do you distinguish bytes and code units? In practice almost nobody decodes UTF-8 with an understanding of code units, neither by that name nor any other name. They simply see bytes that correctly encode code points, and bytes that don't.
Especially if you say that C0 is a valid code unit despite it not appearing in any valid UTF-8 sequences.
As are the interlinear ruby annotations.
I know that they appear on Unicode's shitlist in [UTR#20], a proposed tech report that contained a table of codepoints that should not be used in text meant for public consumption. UTR#20 suggested things you could do when you encounter these codepoints, but it was withdrawn, leaving the status of these codepoints rather confused.
[UTR#20]: http://www.unicode.org/reports/tr20/tr20-9.html#Interlinear
0149 ; Deprecated # L& LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
0673 ; Deprecated # Lo ARABIC LETTER ALEF WITH WAVY HAMZA BELOW
0F77 ; Deprecated # Mn TIBETAN VOWEL SIGN VOCALIC RR
0F79 ; Deprecated # Mn TIBETAN VOWEL SIGN VOCALIC LL
17A3..17A4 ; Deprecated # Lo [2] KHMER INDEPENDENT VOWEL QAQ..KHMER INDEPENDENT VOWEL QAA
206A..206F ; Deprecated # Cf [6] INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT SHAPES
2329 ; Deprecated # Ps LEFT-POINTING ANGLE BRACKET
232A ; Deprecated # Pe RIGHT-POINTING ANGLE BRACKET
E0001 ; Deprecated # Cf LANGUAGE TAG
(note that the ruby annotation codepoints aren't on that list).The use in XML/HTML is no longer maintained by Unicode, it is maintained by the W3C instead: https://www.w3.org/TR/unicode-xml/.
Yes
> And what are you supposed to do when you encounter one?
Nothing. Don't display them, or display some symbolic representation. You probably shouldn't make ruby happen here; if your text is intended to be rendered correctly use a markup language.
----------
Unicode is ultimately a system for describing text. Not all stored text is intended to be rendered. This is why it has things like lacuna characters and other things.
So when you come across some text using ruby, or some text with an unencodable glyph, what do you do? You use ruby annotations or IDS respectively. It lets you preserve the nature of the text without losing info.
(Ruby is inside unicode instead of being completely deferred to markup since it is used often enough in Japanese text, especially whenever an irregular (not out of the "common" list) kanji is used. You're supposed to use markup if you actually want it rendered, but if you just wanted to store the text of a manuscript you can use ruby annotations)
It does have all the major Unicode annexes--normalization (java.text.Normalizer), grapheme clusters (java.text.BreakIterator), BIDI (java.text.Bidi), line breaking (java.text.BreakIterator), not to mention the Unicode script and character class tables (java.lang.Character). And, since Java 8, it does have a proper code point iterator over character sequences.
> "\xAA".split ''
That works on a platform where my platform is UTF-32, but not one where it is UTF-8.
No, this post is talking about having a minimum length on the password for safety reasons (i.e. a limit on the minimum entropy). You're right that a minimum byte length will ensure this, but what happens when your user types in n-1 "things" but their password gets accepted anyway. That's only a minor thing but (and I'm not entirely sure whether this is possible) what about when your user types in n "things" but the password doesn't get accepted because it's actually only n-1 bytes. Now the password won't be accepted and the user has no idea why.
I agree that these are relatively trivial things, but the point is that it's not as simple as "just use the byte length".
I kind of like the idea of minimum length in cm as a password requirement.
or, "Your username must not take more than 0.001ml of ink when printed at 12pt"
Oh, no. That doesn't make sense. You need to limit by Giga grapheme strings.
Note that I didn't actually ask you about rendering.
I care from the point of view of the base level of natural language processing. Some decisions that have nothing to do with rendering are:
- Do they count as graphemes?
- What do you do when you feed text containing ruby characters to a Japanese word segmenter (which is not going to be okay with crazy Unicode control characters, even those intended for Japanese)?
- Could they appear in the middle of a phrase you would reasonably search for? Should that phrase then be searchable without the ruby? Should the contents of the ruby also be searchable?
Seeing how ruby codepoints are actually used would help to decide how to process them. But as far as I can tell, they're not actually used (markup is used instead, quite reasonably). So I'm surprised that your answer is a flat "Yes".
Sadly, no :( You may have luck scraping Wikibooks or some other source of PDFs or plaintext. In general you won't find interlinear annotations on the web because HTML has a better way of dealing with ruby. This is also why they're in the "shitlist", that shitlist is for stuff that's expressly not supposed to be used in markup languages.
Another way to get a good answer here is by asking the unicode mailing list, they tend to be helpful here. I know that they're used because I've heard that they are, so no first-hand experience with them. This isn't a very satisfying answer, I know, but I can't give a better one.
> Do they count as graphemes?
The annotation characters themselves? By UAX 29 they probably do, since UAX 29 doesn't try to handle many of these corner-case things (it explicitly asks you to tailor the algorithm if you care about specifics like these). ICU might deal with them better. The same goes for word segmentation, e.g. UAX 29 will not correctly word-segment Thai text, but ICU will if you ask it to. I haven't tried any of this, but it should be easy enough.
I guess a lot of this depends on what kind of processing you're doing. Ignoring the annotation sounds like the way to go for NLP, since it's ultimately an _annotation_ (which is kinda a parallel channel of info that's not essential to the text). This certainly applies for when the annotations are used for ruby, though they can be used for other things too. Interlinear annotations were almost used for the Vedic samasvara letter combiners, though they ultimately went with creating new combiners since it was a very restricted set of annotations.
They're not used much so the best way forward is probably to ignore them, really. These are a rather niche thing that never really took off.
Well, that's probably because in UTF-8 code unit is byte :)
Quoting https://en.wikipedia.org/wiki/UTF-8:
> The encoding is variable-length and uses 8-bit code units.
By definition, code unit is a bit sequence of a fixed size which can form code points. In UTF-8 you form code points using 8-bit bytes, therefore in UTF-8 code unit is byte. In UTF-16 it is a sequence of two bytes. In UTF-32 it is a sequence of four bytes.
Code units may 'exist' on all three through the fiat of their definition, but they only have a visible function and require you to process an additional layer in UTF-16.
Surrogate codepoints are indeed valid codepoints. It's just that valid UTF-8 is not allowed to encode surrogate codepoints, so the space of codepoints supported by UTF-8 is actually a subset of all Unicode codepoints. This subset is known as the set of Unicode scalar values. ("All codepoints except for surrogates.")
My admittedly quite limited experience with Unicode is from trying to exploit implementation bugs. And with that focus it is quite natural to pay close attention to the different layers in the specification in order to see where optimized implementations might miss edge cases.
And I am generally a big fan of staying close to the word of standards, if it does not cause unacceptable performance issues, I would always prefer to stick with the terminology of the standard even if it means that there will be transformations that are just the identity.
The distinction between code points and scalar values will for example become relevant if you implement Unicode meta data. There you can query meta data about surrogate code points even if a parser should never produce those code points.