The ü/ü Conundrum

179 points by firstSpeaker 2 years ago | 275 comments

re 2 years ago |

> Can you spot any difference between “blöb” and “blöb”?

It's tricky to try to determine this because normalization can end up getting applied unexpectedly (for instance, on Mac, Firefox appears to normalize copied text as NFC while Chrome does not), but by downloading the page with cURL and checking the raw bytes I can confirm that there is no difference between those two words :) Something in the author's editing or publishing pipeline is applying normalization and not giving her the end result that she was going for.

  00009000: 0a3c 7020 6964 3d22 3066 3939 223e 4361  .<p id="0f99">Ca
  00009010: 6e20 796f 7520 7370 6f74 2061 6e79 2064  n you spot any d
  00009020: 6966 6665 7265 6e63 6520 6265 7477 6565  ifference betwee
  00009030: 6e20 e280 9c62 6cc3 b662 e280 9d20 616e  n ...bl..b... an
  00009040: 6420 e280 9c62 6cc3 b662 e280 9d3f 3c2f  d ...bl..b...?</

Let's see if I can get HN to preserve the different forms:

Composed: ü Decomposed: ü

Edit: Looks like that worked!

mgaunard 2 years ago | |

I believe XML and HTML both require Unicode data to be in NFC.

fanf2 2 years ago | | |

I don’t think so?

https://www.w3.org/TR/2008/REC-xml-20081126/#charsets

XML 1.1 says documents should be normalized but they are still well-formed even if not normalized

https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-normaliza...

But you should not use XML 1.1

https://www.ibiblio.org/xml/books/effectivexml/chapters/03.h...

mbrubeck 2 years ago | | |

HTML does not require NFC (or any other specific normalization form):

https://www.w3.org/International/questions/qa-html-css-norma...

Neither does XML (though it XML 1.0 recommends that element names SHOULD be in NFC and XML 1.1 recommends that documents SHOULD be fully normalized):

https://www.w3.org/TR/2008/REC-xml-20081126/#sec-suggested-n...

https://www.w3.org/TR/xml11/#sec-normalization-checking

layer8 2 years ago | | |

You believe incorrectly. Not even Canonical XML requires normalization: https://www.w3.org/TR/xml-c14n/#NoCharModelNorm

Eisenstein 2 years ago | |

Perhaps the author used the same character twice for effect, not suspecting someone would use curl to examine the raw bytes?

mglz 2 years ago |

My last name contains an ü and it has been consistenly horrible.

* When I try to preemptively replace ü with ue many institutions and companies refuse to accept it because it does not match my passport

* Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again

* Sometimes I can enter my name as-is and there seems to be no problem, only for some other system to mangle it to � or or a box. This often triggers error downstream I have no way of fixing

* Sometimes, people print a u and add the diacritics by hand on the label. This is nice, but still somehow wrong.

I wonder what the solution is. Give up and ask people to consistenly use a ascii-only name? Allow everybody 1000+ unicode characters as a name and go off that string? Officially change my name?

weinzierl 2 years ago |

This article is about a failure to do normalization properly and is not really about an issue with Unicode. Regardless what some comments seem to allude to, an Umlaut-ü should always render exactly the same, no matter how it is encoded.

There is, however, a real ü/ü conundrum, regarding ü-Umlaut and ü-diaeresis. The ü's in the words Müll and aigüe should render differently. The dots in the French word are too close to the letter. In printed French material this is usually not the case.

Unfortunately Unicode does not capture the nuance of the semantic difference between an Umlaut and a Tréma or Diaresis.

The Umlaut is a letter in its own right with its own space in the alphabet. An ü-Umlaut can never be replaced by an u alone. This would be just as wrong as replacing a p by a q. Just because they look similar does not mean they are interchangeable. [1]

The Tréma on the other hand, is a modifier that helps with proper pronunciation of letter combinations. It is not a letter in its own right, just additional information. It can sometimes move over other adjacent letters (aiguë=aigüe, both are possible) too.

Some say this should be handled by the rendering system similar to Han-Unification, but I strongly disagree with this. French words are often used in German and vice versa. Currently there is no way to render a German loan word with Umlaut (e.g. führer) properly in French.

[1] The only acceptable replacement for ü-Umlaut is the combination ue.

noodlesUK 2 years ago |

One thing that is very unintuitive with normalization is that MacOS is much more aggressive with normalizing Unicode than Windows or Linux distros. Even if you copy and paste non-normalized text into a text box in safari on Mac, it will be normalized before it gets posted to the server. This leads to strange issues with string matching.

jesprenj 2 years ago |

Should you really change filenames of users' files and depend on the fact that they are valid utf8? Wouldn't it be better to keep the original filename and use that most of the time sans the searches and indexing?

Why don't you normalize latin alphabets filenames for indexing even further -- allow searching for "Führer" with queries like "Fuehrer" and "Fuhrer"?

zeroCalories 2 years ago | |

I generally agree that you shouldn't change the file name, but in reality I bet OP stored it as another column in a database.

For more aggressive normalization like that, I think it makes more sense to implement something like a spell checker that suggests similar files.

josephcsible 2 years ago |

IMO, it was a mistake for Unicode to provide multiple ways to represent 100% identical-looking characters. After all, ASCII doesn't have separate "c"s for "hard c" and "soft c".

layer8 2 years ago |

The more general solution is specified here: https://unicode.org/reports/tr10/#Searching

bawolff 2 years ago | |

Collation and normal forms are totally different things with different purposes and goals.

Edit: reread the article. My comment is silly. UCA is the correct solution to the author's problem.

blablabla123 2 years ago |

As a German macOS user with US keyboard I run into a related issue every now and then. What's nice about macOS is I can easily combine Umlaute but also other common letters from European languages without any extra configuration. But some (Web) Applications stumble upon it, while entering because it's like: 1. ¨ (Option-u) 2. ü (u pressed)

kps 2 years ago | |

Early on, Netscape effectively exposed Windows keyboard events directly to Javascript, and browsers on other platforms were forced to try to emulate Windows events, which is necessarily imperfect given different underlying input systems. “These features were never formally specified and the current browser implementations vary in significant ways. The large amount of legacy content, including script libraries, that relies upon detecting the user agent and acting accordingly means that any attempt to formalize these legacy attributes and events would risk breaking as much content as it would fix or enable. Additionally, these attributes are not suitable for international usage, nor do they address accessibility concerns.”

The current method is much better designed to avoid such problems, and has been supported by all major browsers for quite a while now (the laggard Safari arriving 7 years from this Tuesday).

https://www.w3.org/TR/uievents

chuckadams 2 years ago |

Clearly the author already knows this, but it highlights the importance of always normalizing your input, and consistently using the same form instead of relying on the OS defaults.

makeitdouble 2 years ago | |

The larger point is probably that search and comparison are inherently hard as what humans understand as equivalent isn't the same for the machine. Next stop will be upper case and lower case. Then different transcriptions of the same words in CJK.

mckn1ght 2 years ago | |

Also, never trust user input. File names are user inputs. You can execute XSS attacks via filenames on an unsecured site.

userbinator 2 years ago |

its[sic] 2024, and we are still grappling with Unicode character encoding problems

More like "because it's 2024." This wouldn't be a problem before the complexity of Unicode became prevalent.

bornfreddy 2 years ago | |

You mean this wouldn't be a problem if we used the myriad different encodings like we did before Unicode, because we would probably not be able to even save the files anyway? So true.

userbinator 2 years ago | | |

Before Unicode, most systems were effectively "byte-transparent" and encoding only a top-level concern. Those working in one language would use the appropriate encoding (likely CP1252 for most Latin languages) and there wouldn't be confusion about different bytes for same-looking characters.

n2d4 2 years ago | |

You make it sound like non-English languages were invented in 2024

mschuster91 2 years ago | |

> This wouldn't be a problem before the complexity of Unicode became prevalent.

It was a problem even before then. It worked fine as long as you had countries that were composed of one dominant ethnicity that sharted upon how minorities and immigrants lived (they were just forced to use a transliterated name, which could be one hell of a lot of fun for multi-national or adopted people) - and even that wasn't enough to prevent issues. In Germany, for example, someone had to go up to the highest public-service courts in the late 70s [1] to have his name changed from Götz to Goetz because he was pissed off that computers were unable to store the ö and so he'd liked to change his name rather than keep getting mis-named, but German bureaucracy does not like name changes outside of marriage and adoption.

[1] https://www.schweizer.eu//aktuelles/urteile/7304-bverwg-vom-...

bawolff 2 years ago | |

Combining characters go back to the 90s. The unicode normal forms were defined in the 90s. None of this is new at this point.

_nalply 2 years ago |

Sometimes it makes sense to reduce to Unicode confusables.

For example the Greek letter Big Alpha looks like uppercase A. Or some characters look very similar like the slash and the fraction slash. Yes, Unicode has separate scalar values for them.

There are Open Source tools to handle confusables.

This is in addition to the search specified by Unicode.

Havoc 2 years ago |

For those intrigued by this sort of thing check tech talk “plain text” by Dylan Beattie

Absolute gem. His other talks are entertaining too

hanche 2 years ago | |

He seems to have done that talk several times. I watched the 2022 one. Time well spent!

mawise 2 years ago |

I ran into this building search for a family tree project. I found out that Rails provides `ActiveSupport::Inflector.transliterate()` which I could use for normalization.

anewhnaccount2 2 years ago |

Reminded of this classic diveintomark post http://web.archive.org/web/20080209154953/http://diveintomar...

CoastalCoder 2 years ago |

Isn't ü/ü-encoding a solved problem on Unix systems?

</joke>

philkrylov 2 years ago |

The article suggests using NFC normalization as a simple solution, but fails to mention that HFS+ always does NFD normalization to file names, and APFS kinda does not but some layer above it actually does (https://eclecticlight.co/2021/05/08/explainer-unicode-normal...), and ZFS has this behavior controlled by a dataset-level option. I don't see how applying its suggestion literally (just normalize to NFC before saving) can work.

jph 2 years ago |

Normalizing can help with search. For example for Ruby I maintain this gem: https://rubygems.org/gems/sixarm_ruby_unaccent

noname120 2 years ago | |

Wow the code[1] looks horrific!

Why not just do this: string → NFD → strip diacritics → NFC? See [2] for more.

[1] https://github.com/SixArm/sixarm_ruby_unaccent/blob/eb674a78...

[2] https://stackoverflow.com/a/74029319/3634271

jph 2 years ago | | |

Sure does look horrific. :-) That's because it's the same code from 2008, long before Ruby had the Unicode handlers. In fact it's the same code as for many other programming languages, all the way back to Perl in the mid-1990s. I didn't create it; I merely ported it from Perl to Ruby.

More important, the normalization does more than just diacritics. For example, it converts superscript 2 to ASCII 2. A better naming convention probably would have been "string normalize" or "searchable string" or some such, but the naming convention in 2012 was based on Perl.

kazinator 2 years ago |

Oh that Mötley Ünicöde.

lxgr 2 years ago | |

I'm aware of the "metal umlaut" meme, but as a German native speaker, I can't not read these in my head in a way that sounds much less Metal than probably intended :)

082349872349872 2 years ago | | |

> "When we finally went to Germany, the crowds were chanting, ‘Mutley Cruh! Mutley Cruh!’ We couldn’t figure out why the fuck they were doing that." —VNW

Symbiote 2 years ago | | |

Years ago, an American metalhead was added to a group chat before she came to visit.

She was called Daniela, but she'd written it "Däniëlä". When my Swedish friend met her in person, havin seen her name in the group chat, he said something like "Hej, Dayne-ee-lair right? How was the flight?".

ooterness 2 years ago | | |

The best metal umlauts are placed on a consonant (e.g., Spın̈al Tap). This makes it completely clear when it's there for aesthetics and not pronunciation.

ginko 2 years ago | | |

I will always pronounce the umlaut in Motörhead. Lemmy brought that on himself.

yxhuvud 2 years ago | | |

Yes, those umlauts made it sound more like a fake french accent.

082349872349872 2 years ago | |

It can encode Spın̈al Tap, so it's all good.

chuckadams 2 years ago | | |

Oh sweet summer child, i̶̯͖̩̦̯͉͈͎͛̇͗̌͆̓̉̿̇̚͜͝͠ͅt̶̥̳͙̺̀͊͐͘ ̷̧͉̲̩̩̠̥̀̍̔͝c̸̢̛̙̦͙̠̱̖̠͆̆̄̈́͋͘ą̴̩̪̻̭̐́̒n̶̡̛̛̳̗̦͚̙̖͓̝̻̓̔̎̎̅̒͊ͅ ̵̰̞̰̺̠̲̯̤̠̹̯̩͚̥̗͌̓e̴̪̯̠͙̩̝͓̎́̋̈́̂̓̏̈͗͛̓̀̾͗͘n̶͕̗̣͙̺̰̠͐́͆̀́̌͑̔̊̚ĉ̴̗͔̼̦̟̰͐̌̂̅͋̄̄͘̕̚o̵̧͙̤͔̻̞̝̯̱̰̤̻̠̝̎͐̈́̈̐͆͑̃̀̏̂͝͠͝d̸͕̼̀̐̚ế̴̢̢̡̳͇̪̤͇͉̳̟̈̈̈́̎̀̋͆͊̃̓͛̈́͘ ̷̞̞̜̖͇̱̞͔̈́͋̈́̃̎̇̈͜͝ͅs̷̢̡͚͉͚̬̙̼̾̅̀̊̈́̏̇͘͜ö̸̥̠̲̞̪̦͚̞̝̦́̃̈́́̊͐̾̏̂͂̓̋͋̚͠ ̶̞̺̯̖͓̞͇̳͈̗͖̗̫̍̌̋̈͗̉͝͠m̶̳̥͔͔͚̈́̕̕̚͘͜͠u̵͚̓͗̔̐̽̍ċ̷̨̢̡̛̭͓̪͕̗̝̟͓̩͇͒̽͒͑̃́̇͌̊͊̄̈́͘͜h̶̳̮̟̃͂͛̑̚̚ ̵̢͉̣̲͇͕̈̈̍̕͘ͅm̴̱͙̜͔̋̐̅͗̋̈̀̌͛̈͘̕͠o̷̧̡̮̜͎͙̖̞͈̘̩̙͓̿̆̀̋͜r̶͙̗̯͎̎͛̌̈́̂̓̈̑̅̓͊̒̊̑̈ę̷͕͉̲̟̽̄͒̍͑̀̿̔̒̃̅̿́͘͝ͅ.̷̡̧̻̘̝̞̹̯̞͚̱̼͓̠͇̌̅͂.̷̧̫͙̮̞̳̼̤̪̖̦̟͕̏̐͑̾̈́̀̅͌̓.̵̧̛̛̖̥͔͍̲̲͉̺̩̪̭̋́̓̌͂̽̋̃̎͋͆͝͠ͅ

raffy 2 years ago |

I created a bunch of Unicode tools during development of ENSIP-15 for ENS (Ethereum Name Service)

ENSIP-15 Specification: https://docs.ens.domains/ensip/15

ENS Normalization Tool: https://adraffy.github.io/ens-normalize.js/test/resolver.htm...

Browser Tests: https://adraffy.github.io/ens-normalize.js/test/report-nf.ht...

0-dependancy JS Unicode 15.1 NFC/NFD Implementation [10KB] https://github.com/adraffy/ens-normalize.js/blob/main/dist/n...

Unicode Character Browser: https://adraffy.github.io/ens-normalize.js/test/chars.html

Unicode Emoji Browser: https://adraffy.github.io/ens-normalize.js/test/emoji.html

Unicode Confusables: https://adraffy.github.io/ens-normalize.js/test/confused.htm...

WalterBright 2 years ago |

> Can you spot any difference between “blöb” and “blöb”?

That's where Unicode lost its way and went into a ditch. Identical glyphs should always have the same code point (or sequence of code points).

Imagine all the coding time spent trying to deal with this nonsense.

euroderf 2 years ago | |

A fine sentiment, but (FWIW) it goes into a ditch when dealing with CJK.

WalterBright 2 years ago | | |

One unique sequence per unique glyph takes care of all that.

ulrischa 2 years ago |

It is really so awful that we have to deal with encoding issues in 2024.

ComputerGuru 2 years ago |

ZFS can be configured to force the use of a particular normalized Unicode form for all filenames. Amazing filesystem.

NotYourLawyer 2 years ago |

ASCII should be enough for anyone.

zzo38computer 2 years ago | |

ASCII is good for a lot of stuff, but not for everything. Sometimes, other character sets/encodings will be better, but which one is better depends on the circumstances. (Unicode does have many problems, though. My opinion is that Unicode is no good.)

hanche 2 years ago | |

And who needs more than 640 kilobytes of memory anyhow?

mckn1ght 2 years ago | | |

Don’t forget butterflies in case you need to edit some text.

euroderf 2 years ago | |

Filling the upper 128 characters with box-drawing characters was all well & fine, but you'd think IBM might've given some thought instead to defining a character set that would have maximum applicability for the set of all (Roman alphabet -descended) Western languages. (Plus pinyin.)

earthboundkid 2 years ago |

This isn’t an encoding problem. It’s a search problem.

juujian 2 years ago |

I ran into encoding problems so many times, I just use ASCII aggressively now. There is still kanji, Hanzi, etc. but at least for Western alphabets, not worth the hassle.

zzo38computer 2 years ago | |

I also just use ASCII when possible; it is the most likely to work and to be portable. For some purposes, other character sets/encodings are better, but which ones are better depends on the specific case (not only what language of text but also the use of the text in the computer, etc).

arp242 2 years ago | |

This works fine as a personal choice, but doesn't really work if you're writing something other random people interact with.

Even for just English it doesn't work all that well because it lacks things like the Euro which is fairly common (certainly in Europe), there are names with diacritics (including "native" names, e.g. in Ireland it's common), there are too many loanwords with diacritics, and ASCII has a somewhat limited set of punctuation.

There are some languages where this can sort of work (e.g. Indonesian can be fairly reliably written in just ASCII), although even there you will run in to some of these issue. It certainly doesn't work for English, and even less for other Latin-based European languages.

layer8 2 years ago | |

The article isn’t about non-Unicode encodings.

juujian 2 years ago | | |

Meant to write ASCII

keybored 2 years ago |

I try to avoid Unicode in filenames (I’m on Linux). It seems that a lot of normal users might have the same intuition as well? I get the sense that a lot will instinctually transcode to ASCII, like they do for URLs.

zzo38computer 2 years ago | |

I also try to avoid non-ASCII characters in file names (and I am also on Linux). I also like to avoid spaces and most punctuations in file names (if I need word separation I can use underscores or hyphens).

skissane 2 years ago | | |

Sometimes I wish they had disallowed spaces in file names.

Historically, many systems were very restrictive in what characters are allowed in file names. In part in reaction to that, Unix went to the other extreme, allowing any byte except NUL and slash.

I think that was a mistake - allowing C0 control characters in file names (bytes 0x01 thru 0x1F) serves no useful use case, it just creates the potential for bugs and security vulnerabilities. I wish they’d blocked them.

POSIX debated banning C0 controls, although appears to have settled on just a recommendation (not a mandate) that implementations disallow newline: https://www.austingroupbugs.net/view.php?id=251

keybored 2 years ago | |

I argue that using more Unicode instead ASCII—people disagree. I say that I use ASCII-only in filenames (because filenames suck between platforms, and in general) and people downvote. :)

git clone https://github.com/ghurley/encodingtest Cloning into 'encodingtest'... remote: Enumerating objects: 9, done. remote: Counting objects: 100% (9/9), done. remote: Compressing objects: 100% (5/5), done. remote: Total 9 (delta 1), reused 0 (delta 0), pack-reused 0 Receiving objects: 100% (9/9), done. Resolving deltas: 100% (1/1), done. warning: the following paths have collided (e.g. case-sensitive paths on a case-insensitive filesystem) and only one from the same colliding group is in the working tree: 'ss' 'ß'