The ü/ü Conundrum(unravelweb.dev) |
The ü/ü Conundrum(unravelweb.dev) |
Historically, many systems were very restrictive in what characters are allowed in file names. In part in reaction to that, Unix went to the other extreme, allowing any byte except NUL and slash.
I think that was a mistake - allowing C0 control characters in file names (bytes 0x01 thru 0x1F) serves no useful use case, it just creates the potential for bugs and security vulnerabilities. I wish they’d blocked them.
POSIX debated banning C0 controls, although appears to have settled on just a recommendation (not a mandate) that implementations disallow newline: https://www.austingroupbugs.net/view.php?id=251
But spaces in filenames are really just an inconvenience at most for heavy terminal users, and are a natural thing to use for basically everyone else. All my markdown files are word-word-word.md, but all my WYSIWIG documents are "Word word word.doc".
The hassle of constantly explaining to angry civilians "why won't it let me write this file" would be worse than the hassle of having to quote or backslash-escape the occasional path in the shell.
For non-technical WYSIWYG users, there is a simple solution: auto-replace space with underscore when user enters a filename containing it; you could even convert the underscore back to a space on display. Some GUIs already do stuff like this anyway - e.g. macOS exchanging slash and colon in its GUI layer (primarily for backward compatibility with Classic MacOS where slash not colon was the path separator.)
By allowing a character to do double duty in that way, you make necessary all the complexity of quoting/escaping.
If the set of file name characters, and the set of file name delimiters, are orthogonal, you reduce (possibly even eliminate) the need for that complexity.
Also, allowing space in filenames creates other difficulties, such as file names with trailing spaces, double spaces, etc, which might not be noticed, even two files whose names differ only in the number of spaces.
A character like underscore does not have the same problem, since a trailing underscore or a double underscore is more readily recognised than a trailing or double space.
Your trailing/double space issue is also easy to solve (in the world of wishes) with highlighting or other mechanisms, so making the world much worse by banning spaces is not the appropriate remedy
Not really true - the “general world of computer use” uses that stuff very heavily, just “behind the scenes” so the average user isn’t aware of it. For example, it is very common for GUI apps to parse command line arguments at startup (since, e.g., one way the OS, and other apps which integrate with it, uses to get your word processor to open a particular document, is to pass the path to the document as a command line argument)
> and banning spaces does not remove the complexity of escaping (how do you escape _?)
You don’t need to escape _ unless it has some special meaning to the command processor/shell. On Unix it doesn’t. Nor for Windows cmd.exe
I take it you are talking about GUIs which do that, not filesystems.
That means they're not using it since they don't have to deal with spaces as spaces vs as separators
> You don’t need to escape
So how do you differentiate between a user inserting a space and a user inserting a literal _ in a file name?
The end-user isn't consciously using it. The software they are using is.
We are talking here about programmer-visible reality, not end-user-visible reality. Those two realities don't have to be the same, as in the "replace spaces with underscores and vice versa" idea.
> So how do you differentiate between a user inserting a space and a user inserting a literal _ in a file name?
Underscores are rarely used by non-technical users. It isn't a standard punctuation mark. Back when people used typewriters, the average person was familiar with using them to underline things, but nowadays, the majority of the population are too young to have ever used one. I doubt many non-technical users would even notice if underscores in file names were (from their perspective) automatically converted to spaces, since they probably wouldn't put one in to begin with.
It's tricky to try to determine this because normalization can end up getting applied unexpectedly (for instance, on Mac, Firefox appears to normalize copied text as NFC while Chrome does not), but by downloading the page with cURL and checking the raw bytes I can confirm that there is no difference between those two words :) Something in the author's editing or publishing pipeline is applying normalization and not giving her the end result that she was going for.
00009000: 0a3c 7020 6964 3d22 3066 3939 223e 4361 .<p id="0f99">Ca
00009010: 6e20 796f 7520 7370 6f74 2061 6e79 2064 n you spot any d
00009020: 6966 6665 7265 6e63 6520 6265 7477 6565 ifference betwee
00009030: 6e20 e280 9c62 6cc3 b662 e280 9d20 616e n ...bl..b... an
00009040: 6420 e280 9c62 6cc3 b662 e280 9d3f 3c2f d ...bl..b...?</
Let's see if I can get HN to preserve the different forms:Composed: ü Decomposed: ü
Edit: Looks like that worked!
https://www.w3.org/TR/2008/REC-xml-20081126/#charsets
XML 1.1 says documents should be normalized but they are still well-formed even if not normalized
https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-normaliza...
But you should not use XML 1.1
https://www.ibiblio.org/xml/books/effectivexml/chapters/03.h...
https://www.w3.org/International/questions/qa-html-css-norma...
Neither does XML (though it XML 1.0 recommends that element names SHOULD be in NFC and XML 1.1 recommends that documents SHOULD be fully normalized):
https://www.w3.org/TR/2008/REC-xml-20081126/#sec-suggested-n...
* When I try to preemptively replace ü with ue many institutions and companies refuse to accept it because it does not match my passport
* Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again
* Sometimes I can enter my name as-is and there seems to be no problem, only for some other system to mangle it to � or or a box. This often triggers error downstream I have no way of fixing
* Sometimes, people print a u and add the diacritics by hand on the label. This is nice, but still somehow wrong.
I wonder what the solution is. Give up and ask people to consistenly use a ascii-only name? Allow everybody 1000+ unicode characters as a name and go off that string? Officially change my name?
There is, however, a real ü/ü conundrum, regarding ü-Umlaut and ü-diaeresis. The ü's in the words Müll and aigüe should render differently. The dots in the French word are too close to the letter. In printed French material this is usually not the case.
Unfortunately Unicode does not capture the nuance of the semantic difference between an Umlaut and a Tréma or Diaresis.
The Umlaut is a letter in its own right with its own space in the alphabet. An ü-Umlaut can never be replaced by an u alone. This would be just as wrong as replacing a p by a q. Just because they look similar does not mean they are interchangeable. [1]
The Tréma on the other hand, is a modifier that helps with proper pronunciation of letter combinations. It is not a letter in its own right, just additional information. It can sometimes move over other adjacent letters (aiguë=aigüe, both are possible) too.
Some say this should be handled by the rendering system similar to Han-Unification, but I strongly disagree with this. French words are often used in German and vice versa. Currently there is no way to render a German loan word with Umlaut (e.g. führer) properly in French.
[1] The only acceptable replacement for ü-Umlaut is the combination ue.
Why don't you normalize latin alphabets filenames for indexing even further -- allow searching for "Führer" with queries like "Fuehrer" and "Fuhrer"?
For more aggressive normalization like that, I think it makes more sense to implement something like a spell checker that suggests similar files.
Edit: reread the article. My comment is silly. UCA is the correct solution to the author's problem.
The current method is much better designed to avoid such problems, and has been supported by all major browsers for quite a while now (the laggard Safari arriving 7 years from this Tuesday).
More like "because it's 2024." This wouldn't be a problem before the complexity of Unicode became prevalent.
It was a problem even before then. It worked fine as long as you had countries that were composed of one dominant ethnicity that sharted upon how minorities and immigrants lived (they were just forced to use a transliterated name, which could be one hell of a lot of fun for multi-national or adopted people) - and even that wasn't enough to prevent issues. In Germany, for example, someone had to go up to the highest public-service courts in the late 70s [1] to have his name changed from Götz to Goetz because he was pissed off that computers were unable to store the ö and so he'd liked to change his name rather than keep getting mis-named, but German bureaucracy does not like name changes outside of marriage and adoption.
[1] https://www.schweizer.eu//aktuelles/urteile/7304-bverwg-vom-...
For example the Greek letter Big Alpha looks like uppercase A. Or some characters look very similar like the slash and the fraction slash. Yes, Unicode has separate scalar values for them.
There are Open Source tools to handle confusables.
This is in addition to the search specified by Unicode.
Absolute gem. His other talks are entertaining too
</joke>
Why not just do this: string → NFD → strip diacritics → NFC? See [2] for more.
[1] https://github.com/SixArm/sixarm_ruby_unaccent/blob/eb674a78...
More important, the normalization does more than just diacritics. For example, it converts superscript 2 to ASCII 2. A better naming convention probably would have been "string normalize" or "searchable string" or some such, but the naming convention in 2012 was based on Perl.
She was called Daniela, but she'd written it "Däniëlä". When my Swedish friend met her in person, havin seen her name in the group chat, he said something like "Hej, Dayne-ee-lair right? How was the flight?".
ENSIP-15 Specification: https://docs.ens.domains/ensip/15
ENS Normalization Tool: https://adraffy.github.io/ens-normalize.js/test/resolver.htm...
Browser Tests: https://adraffy.github.io/ens-normalize.js/test/report-nf.ht...
0-dependancy JS Unicode 15.1 NFC/NFD Implementation [10KB] https://github.com/adraffy/ens-normalize.js/blob/main/dist/n...
Unicode Character Browser: https://adraffy.github.io/ens-normalize.js/test/chars.html
Unicode Emoji Browser: https://adraffy.github.io/ens-normalize.js/test/emoji.html
Unicode Confusables: https://adraffy.github.io/ens-normalize.js/test/confused.htm...
That's where Unicode lost its way and went into a ditch. Identical glyphs should always have the same code point (or sequence of code points).
Imagine all the coding time spent trying to deal with this nonsense.
Even for just English it doesn't work all that well because it lacks things like the Euro which is fairly common (certainly in Europe), there are names with diacritics (including "native" names, e.g. in Ireland it's common), there are too many loanwords with diacritics, and ASCII has a somewhat limited set of punctuation.
There are some languages where this can sort of work (e.g. Indonesian can be fairly reliably written in just ASCII), although even there you will run in to some of these issue. It certainly doesn't work for English, and even less for other Latin-based European languages.
Same for names that don't fit field lengths, addresses that require street numbers etc. It's a real pain to deal with all of it and each system will fail in its own way to make your life a mess, but people will embrace the mess and won't blink an eye when you bring paper that just don't match.
In Unicode umlaut and diaeresis are both represented by same codepoint, U+0308 COMBINING DIAERESIS.
Everyone should be storing strings as UTF-8, and any time strings are being compared they should undergo some form of normalization. Doesn't matter which, as long as it's consistent. There's no reason to store string data in any other format, and any comparison code which isn't normalizing is buggy.
But thanks to institutional inertia, it will be a very long time before everything works that way.
This will result in misprinting Japanese names (or misprinting Chinese names depending on the rest of your system).
> a normative subset of Unicode Latin characters, sequences of base characters and diacritic signs, and special characters for use in names of persons, legal entities, products, addresses etc
My German last name also contains an ü, so when we emigrated to an English-speaking country and obtained dual-citizenship we used 'ue' for that passport and I now use 'ue' on a day-to-day basis. This also means I have two slightly different legal surnames depending by which passport I go.
[0] https://en.wikipedia.org/wiki/Wikipedia:Romanization_of_Russ...
> Officially change my name?
Yes. That's the only one that's going to actually work. You can go on about how these systems ought to work until until the cows come home, and I'm sure plenty of people on HN will, but if you actually want to get on with your life and avoid problems, legally change your name to one that's short and ascii-only.
in the meantime he was unable to own the company he founded (instead made his wife the owner), had a national ID card with a different character, and i am not sure if he had a bank account, but i think the bank didn't care because laws that enforced the names to match the passport/ID only came later. i don't know how the ID didn't automatically imply a name change, but the IDs were issued automatically and maybe he filed a complaint about his name being wrong.
Some form of biometrics to pull up an ID in a globally agreed-upon system is certainly the way forward. Whether or not it is close to what a final solution should be, World ID is making some effort into solving global identification problems https://worldcoin.org/world-id
[1] https://www.icao.int/publications/Documents/9303_p3_cons_en....
https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...
https://learn.microsoft.com/en-us/windows/win32/fileio/namin...
In something like a code review, people will think you're insane for pointing out that this type of assumption might not hold. Actually, come to think of it, explaining localization bugs at all is a tough task in general.
NFC just means never use combining characters if possible, and NFD means always use combining characters if possible. It has nothing to do with whether something is a "real" letter in a specific language or not.
The whether or not something is a "real" letter vs a letter with a modifier, more comes into play in the unicode collation algorithm, which is a separate thing.
> Characters may not combine well on some computers.
It was easy to detect people typing or editing text on Apple devices because “their” characters appeared broken, unlike usual single codepoints.
So now on macOS you can have a very mixed bag with some programs normalizing, some not (it's a bug) and many expecting normalized file names.
So it's kinda like other Linux now except a lot of dev assuming normalization is happening (and in some cases still is when the string passes through certain APIs).
Worse due to normalization now being somewhat application/framework dependent and often going beyond basic Unicode normalization it can lead to quite not so funny bugs.
But luckily most users will never run into any of this bugs even if the use characters which might need normalization.
And, of course, the Apple fanboys will just shrug and suggest you also convert the rest of the organization to Apple devices, after all, if Apple made a choice, it can't be wrong.
If it's a user choice then CMSs have to be able to deal with all normalisation forms anyway and shouldn't care one bit whether macOS sends NFD or NFC. Mac users could of course complain about their choice not being honoured by macOS but that's of no concern to CMSs.
This rabbit hole goes very, very deep. In Dutch, the digraph IJ is a single letter. In Swedish, V and W are considered the same letter for most purposes (watch out, people who are using the MySQL default utf8_swedish_ci collation). The Turkish dotless i (ı) in its lowercase form uppercases to a normal I, which then does _not_ lowercase back to a dotless i if you're just lowercasing naively without locale info. In Danish, the digraph aa is an alternate way of writing å (which sorts near the end of the alphabet). Hungarian has a whole bunch of bizarre di- and trigraphs IIRC. Try looking up the standard Unicode algorithm for doing case insensitive equality comparison by the way; it's one heck of a thing.
People somehow think that issues like these are only an issue with Han unification or something, but it's all over European languages as well. Comparing strings for equality is a deeply political issue.
Actually, there is just only one trihraph. "dzs" almost exclusively used for representing "j" from English and other alphabets, for example "Jennifer" is "Dzsennifer" in Hungarian or "jam" is "dzsem" in the same way.
Trigraph and digraphs actually make sense, at least as a native as these really mark similar sounds what you would think you will get by combining the given graphs. These letters doesn't cause too much issues in search in my opinion, but hyphenation is a form of art (see "magyar.ldf" for LaTeX as an example).
To complicate the situation even further we have a/á, e/é, i/í and o/ó/ü/ő and u/ú/ü/ű letters, all of those considered to be separate ones and you can easily type them in a Hungarian desktop keyboard. On the other hand, mobile virtual keyboards usually show a QWERTY/QWERTZ layout where you can only find "long vowels" by long pressing their "short" counterparts, so when you are targeting mobile users you maybe want to differentiate between "o" and "ö", but not between "o" and "ó" nor between "ö" and "ő".
Unicode shouldn't be responsible for making such searches work, just like it's not responsible for making searches for "analyze" match text that says "analyse".
So now we are saddled with an encoding that has to be bug compatible with any encoding ever designed before.
There's history here, with Unicode originally having just 65k characters, and hindsight is always 20/20, but I do wish there was a move towards deprecating all of this in favour of always using pre-composed.
Also: what you linked isn't "ASCII" and "extended ASCII" doesn't really mean anything. ASCII is a 7-bit character set with 128 characters, and there are dozens, if not hundreds, of 8-bit character sets with 256 characters. Both CP-1252 and ISO-8859-1 saw wide use for Latin alphabet text, but others saw wide use for text in other scripts. So if you give me a document and tell me "this is extended ASCII" then I still don't know how to read it and will have to trail-and-error it.
I don't think Unicode after U+007F is compatible with any specific character set? To be honest I never checked, and I don't see in what case that would be convenient. UTF-8 is only compatible with ASCII, not any specific "extended ASCII".
But the English letter "c" and the Russian letter "с" are completely different characters, even if at a glance they look the same - they make completely different sounds, and are different letters. It would be ludicrous to suggest that they should share a single symbol.
Or the whole eh/ye flip En/UK/Ru Eh/е/э Ye/є/е
г/е are unified and that's probably as it should be but there are downsides.
My use case was to thwart spammers in our company’s channels, but I suppose it could be used to also normalize accent encoding issues.
Basically converts a phrase into a regular expression matching confusables.
E.g. "ℍ℮1೦" would match "Hello"
What would you think about this approach: reduce each character to a standard form which is the same for all characters in the same confusable group? Then match all search input to this standard form.
This means "ℍ℮1l೦" is converted to "Hello" before searching, for example.
If they're truly drawn the same (are they?) then why have a distinct encoding?
For example in Python
>>> "Ᾰ̓ΡΕΤΉ".lower()
'ᾰ̓ρετή'
>>> "AWESOME".lower()
'awesome'
The Greek Α has lowercase form α, whereas the Roman A has lowercase form a.Another argument would be that you want a distinct encoding in order to be able to sort properly. Suppose we used the same codepoint (U+0050) for everything that looked like P. Then Greek Ρόδος would sort before Greek Δήλος because Roman P is numerically prior to Greek Δ in Unicode, even though Ρ comes later than Δ in the Greek alphabet.
Let’s consider the opposite approach, that any letters that render the same should collapse to the same code point. What about Cherokee letter “go” (Ꭺ) versus the Latin A? What if they’re not precisely the same? Should lowercase l and capital I have the same encoding? What about the Roman numeral for 1 versus the letter I? Doesn’t it depend on the font too? How exactly do you draw the line?
If Unicode sets out to say “no two letters that render the same shall ever have different encodings”, all it takes is one counterexample to break software. And I don’t think we’d ever get everyone to agree on whether certain letters should be distinct or not. Look at Han unification (and how poorly it was received) for examples of this.
To me it’s much more sane to say that some written languages have visual overlap in their glyphs, and that’s to be expected, and if you want to prevent two similar looking strings from being confused with one another, you’re going to have to deploy an algorithm to de-dupe them. (Unicode even has an official list of this called “confusables”, devoted to helping you solve this.)
There are more reasons:
– As a basic principle, Unicode uses separate encodings when the lower/upper case mappings differ. (The one exception, as far as I know, being the Turkish “I”.)
– Unicode was designed for round-trip compatibility with legacy encodings (which weren’t legacy yet at the time). To that effect, a given script would often be added as whole, in a contiguous block, to simplify transcoding.
– Unifying characters in that way would cause additional complications when sorting.
Unicode wants to be able to represent any legacy encoding in a lossless manner. ISO8859-7 encodes Α and A to different code-points, and ISO8859-5 has А at yet another code point, so Unicode needs to give them different encodings too.
And, indeed, they are different letters -- as sibling comments point out, if you want to lowercase them then you wind up with α, a, and а, and that's not going to work very well if the capitals have the same encoding.
It turns out this is complex and controversial enough that the wikipedia page is pretty gigantic.
Consider broadcasting of text in Morse code. The Morse for the Cyrillic letter В is International Morse W.
In the early years of Unicode, conversion from disparate encodings to Unicode was an urgent priority. Insofar as possible, they wanted to preserve the collation properties of those encodings, so the characters were in the same order as the original encoding whenever they could be.
But it's more that Unicode encodes scripts, which have characters, it doesn't encode shapes. With 10,000 caveats to go with that, Unicode is messy and will preserve every mistake until the end of time. But encoding Α and A and А as three different letters, that they did on purpose, because they are three different letters, because they're a part of three different scripts.
They may be drawn the same or similar in some typefaces but not all.
U+2012 FIGURE DASH, U+2013 EN DASH and U+2212 MINUS SIGN all look exactly the same, as far as I can tell. But they have different semantics.
For example in Czech, Валерий would be transliterated as Valerij because "j" is pronounced in Czech as English "y" in "you".
So it isn't per se normalization, but it's not not normalization either. In any case (heh) it's a weird thing that probably shouldn't happen. Worth noting that APFS doesn't normalize file names, but normalization happens higher up in the toolchain, this has made some things better and others worse.
The "proper" way of sorting and comparing Unicode strings is part of the standard; it's called the Unicode Collation Algorithm (https://unicode.org/reports/tr10/). It is unwieldy to say the least, but it is tuneable (see the "Tailoring" part) and can be used to implement o/ö equivalence if desired. I think it's great that this algorithm (and its accompanying Common Locale Data Repository) is in the standard and maintained by the consortium, because I definitely wouldn't want to maintain those myself.
Now, 'I'.lower() depends on your locale.
A cause for a number of security exploits and lots of pain in regular expression engines.
edit: Well, apparently 'I'.lower() doesn't depend on locale (so it's incorrect for Turkish languages); in JS you have to do 'I'.toLocaleLowerCase('tr-TR'). Regexps don't support it in neither.
Goethe is so famous that in Heidelberg, Germany, there is a building with a placard that says, "Goethe almost slept here."
It was an inn and he was supposed to spend the night but was unable to.
Potentially OP is talking about a set of requirements he imposed on himself?
Edit: or maybe France? Either way, it's free choice still theoretically. https://en.wikipedia.org/wiki/Naming_law#:~:text=Since%20199....
we did something comparable to make sure our kids had names that transliterated nicely into chinese so that they could use the same or at least a similar name in english and chinese, instead of having two names like it is common for many expats and locals in china.
It's been a while since I last saw it, but it wasn't because of the font since it was published on a Swedish newspaper's website and other texts worked fine.
The font you’re using can (and probably will) rewrite it as 2 glyphs using the GSUB table. This makes sense because it’s a more efficient way to store the drawing operations. The GPOS table is then responsible for handling the offset to put things in their right place.
Main point is that it’s up to the font to move things about.
Now, that may not be what was going on in your case at all but it’s possible.
Because MacOS always uses it, regardless of the user's intention, so it decomposes umlauts into diaereses (despite them having different meanings and pronunciations) and mangles cyrillic, and probably more problems I haven't yet run into.
U+00FC LATIN SMALL LETTER U WITH DIAERESIS
and Unicode Normalization Form D: U+0075 LATIN SMALL LETTER U
U+0308 COMBINING DIAERESIS
Unicode calls these two forms ‘canonically equivalent’.I guess the official answer is "attempt to distinguish everything that any language is known to distinguish, and then use locales to implement different collation orders by language", or something like that?
But it's still not totally obvious how one could make a principled decision about, say, whether the encoding of Persian and Urdu writing (obviously including their extensions) should be unified with the encoding of Arabic writing. One could argue that Nastaliq is like a "font"... or not...
Many things we might want to do with strings require a locale property, which Unicode tried allowing as an inline representation, this was later deprecated. I'm not convinced that was the correct decision, but it is what it is. If you want to properly handle Turkish casing or Swedish collation, you have to know that the text you're working with is Turkish or Swedish, no way around it.
Figure dash is defined to have the same width as a digit (for use in tabular output). Minus sign is defined to have the same width and vertical position as the plus sign. They may all three differ for typographic reasons.
I've worked on a system that … well, didn't predate Unicode, but was sort of near the leading edge of it and was multi-system.
The database columns containing text were all byte arrays. And because the client (a Windows tool, but honestly Linux isn't any better off here) just took a LPCSTR or whatever, it they bytes were just in whatever locale the client was. But that was recorded nowhere, and of course, all the rows were in different locales.
I think that would be far more common, today, if Unicode had never come along.
ASCII also allowed the characters @[\]^{|}~ to be replaced by others in ‘national character allocations’, and this was commonly used in the 7-bit ASCII era.
In the 8-bit days, for alphabetic scripts, typically the range 0xA0–0xFF would represent a block of characters (e.g. an ISO 8859¹ range) selected by convention or explicitly by ISO 2022². (There were also pre-standard similar methods like DEC NRCS and IBM's EBCDIC code pages.)
I suppose in the 60s/70s it would be in the era of teletypewriters where maybe over striking would more naturally be a thing.
I also found references to less supporting this sort of thing, but seems to be about bold and underline, not accents.
Sort of but not really. The post-2012 residence cards do not display a registered alias anywhere, and since those cards are what banks are required to KYC you on, a lot of banks won't allow you to use a registered alias which in turn means it's hard to use it for anything else (credit cards, phone, pension...). It's very non-joined-up government.
This is a weird formation; "ji" means text. It's half of the half of "emoji" that means text: 絵文字, 絵 [e, "picture"] 文字 [moji, "character", from 文 "text" + 字 "character"].
For example, there's an apartment and office building complex on a site near a historic canal and dam. The building development was named after this site. Then in one of the apartments (CORRECTION: offices), a scandalous political event happened. The complex was called Watergate, the scandal was called Watergate too, and now the suffix -gate is used for scandals.
It was one of the offices, not one of the apartments (specifically, it was series of break-ins to and the wiretapping of the headquarters of the Democratic National Committee by people working for President Nixon’s re-election committee.)
git clone https://github.com/ghurley/encodingtest
Cloning into 'encodingtest'...
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (9/9), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 9 (delta 1), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (9/9), done.
Resolving deltas: 100% (1/1), done.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:
'ss'
'ß'But even more than that, I just don't get how C++ turns into 'C' at all. It seems actively misleading.
APL did use overstriking extensively, and there were video terminals that knew how to compose overstruck APL characters.
I don't see every institution come up with a fix anytime soon, but having it clear that they're breaking the law is such a huge step. That will also have a huge impact on bank system development, and I wonder how they'll do it (extend the current system to have the customer facing bits rewritten, or just redo it all from top to bottom)
There is the tale of Mizuho bank [0], botching their system upgrade project so hard they were still seeing widespread failures after a decade into it.
[0] https://www.japantimes.co.jp/news/2022/02/11/business/mizuho...
It's excellent, but also sad that it takes legislation to motivate companies to fix their crappy legacy systems, and they will likely fight tooth and nail rather than comply.
All the coreutils still can not find strings, just buffers. Zero terminated buffers are NOT strings, strings are unicode.
https://perl11.github.io/blog/foldcase.html
This is not just convenience, it also has spoofing security implications for all names. C and C++11 are insecure since C11. https://github.com/rurban/libu8ident/blob/master/doc/c11.md Most other programming languages and OS kernels also.
I wonder if this also means one can require a European bank have a name on file in Kanju, Thai script or some other not-so-well-known in Europe alphabet.
As far as the passports go, ICAO 9303-3 allows for latin characters, additional latin characters, such as Þ and ß, and "diacritics", so something not too crazy, i.e. Z̷̪͘a̵͈͘l̷̹̃g̷̣̈́ő̶͍ would still be plausible.
It's not a myth, as anyone living in Japan knows, and the "just use Unicode, all you need is Unicode" dogma is really harmful; a lot of "international" software has become significantly worse for Japanese users since it took hold.
> The problem here is exactly the lack of unification in Roman alphabets!
Problems caused by failing to unify characters that look the same do not mean it was a good idea to unify characters that look different!
The alternative would be that the software used Shift_JIS with a Japanese font. If the software used a Japanese font for Japanese it wouldn't need metadata anyway.
There really isn't a problem with Han unification as long as you always switch to a font appropriate for your language; you don't need to configure metadata. If you don't you are always going to run into missing codepoint problems.
In cases where the system or user configures the font, properly using Unicode is still easier than configuring alternate encodings for multiple languages.
I lived in Japan. It is a myth. :-¥
To a Swede or a Finn, o and ö are different letters, as distinct as a and b (ö sorts at the very end at the alphabet). A search function that mixes them up would be very frustrating. On the other hand, to an American, a search function that doesn't find "coöperation" when you search for "cooperation" is also very frustrating. Back in Sweden, v and w are basically the same letter, especially when it comes to people's last names, and should probably be treated the same. Further south, if you try to lowercase an I and the text is in Turkish (or in certain other Turkic languages), you want a dotless i (ı), not a regular lowercase i. This is extremely spooky if you try to do case insensitive equality comparisons and aren't paying attention, because if you do it wrong and end up with a regular lowercase i, you've lost information and uppercasing again will not restore the original string.
There are tons and tons of problems like this in European languages. The root cause is exactly the same as the Han unification gripes: Unicode without locale information is not enough to handle natural languages in the way users expect.
Why not as data tagged with the appropriate language?
Look, we can just disregard The New Yorker entirely and the UX will improve.
Anyway, my point is that perhaps ideally (and maybe search engines do this) the results should be determined by the locale of the searcher. So someone in the English speaking world can find Łódź by searching for Lodz, but a Pole may need to type Łódź. My brother could find Shunin by typing Wyhnh, but a Russian could not…
https://en.wikipedia.org/wiki/Informal_romanizations_of_Cyri...
Maybe we should start modifying the search behavior of English words to make them more convenient for non-native speakers as well. We could start by making "bed aidia" match "bad idea", since both sound similar to my foreign ears.
At the same time, sometimes words containing those letters might show up in context where the user is not familiar with that language. Such users might not know how to enter those letters. They might not even have the capability to type those letters with their installed keyboard layouts. If they are searching for content that contains such letters (e.g. a first name), normalizing them to the visually-closest ASCII is a sensible choice, even if it makes no sense to the speakers of the language.
It's important to understand a situation from different perspectives.
It's not about coming up with a single correct interpretation that makes logical sense. It about making a system work in least-surprising ways to all classes of users.
Diacritics exacerbate this so much as they can be shared between two language yet have different rules/handling. French typically has a decent amount and they're meaningful but traditionally ignores them for comparison (in the dictionary for instance). That makes it more difficult for a dev to have an intuitive feeling of where it matters and where it doesn't.
The pre-composed characters are necessary only for backwards compatibility.
It is completely unrealistic to expect that Unicode will ever provide all the pre-composed characters that have ever been used in the past or which will ever be desired in the future.
There are pre-composed characters that do not exist in Unicode because they have been very seldom used. Some of them may even be unused in any language right now, but they have been used in some languages in the past, e.g. in the 19th century, but then they have been replaced by orthographic reforms. Nevertheless, when you digitize and OCR some old book, you may want to keep its text as it was written originally, so you want the missing composed characters.
Another case that I have encountered where I needed composed characters not existing in Unicode was when choosing a more consistent transliteration for languages that do not use the Latin alphabet. Many such languages use quite bad transliteration systems, precisely because whoever designed them has attempted to use only whatever restricted character set was available at that time. By choosing appropriate composing characters it is possible to design improved transliterations.
I agree it's unlikely this will ever happen, but as far as I know there aren't really any serious technical barriers, and from purely a technical point of view it could be done if there was a desire to do so. There are plenty of rarely used codepoints in Unicode already, and while adding more is certainly an inconvenience, the status quo is also inconvenient, which is why we have one of those "wow, I just discovered Unicode normalisation!" (and variants thereof) posts on the front-page here every few months.
Your last paragraph can be summarize as "it makes it easier to innovate with new diacritics". This is actually an interesting point – in the past anyone could "just" write a new character and it may or may not get any uptake, just as anyone can "just" coin a new word. I've bemoaned this inability to innovate before. That is not inherent to Unicode but computerized alphabets in general, and I that composing characters alleviates at least some of that is probably the best reason I've heard for favouring compose characters.
I'm actually also okay with just using composing characters and deprecating the pre-composed forms. Overall I feel that pre-composed is probably better, partly because that's what most text currently uses and partly because it's simpler, but that's the lesser issue – the more important one that it would be nice to move towards "one obviously canonical" form that everything uses.
Many of the existing typefaces, even some that are quite expensive, do not contain all the pre-composed characters defined by Unicode, especially when those characters have been added in more recent Unicode versions or when they are used only in languages that are not Western European.
The missing characters can be synthesized with composing characters. The alternatives, which are to use a font editor to add characters to the typeface or to buy another more complete and more expensive version of the typeface, are not acceptable or even possible for most users.
Therefore the fact that Unicode has defined composing characters is quite useful in such cases.
The ‘early’ Unicode alphabetic code blocks came from ISO 8859 encodings¹, e.g. the Unicode Cyrillic block follows ISO 8859-5, the Greek and Coptic block follows ISO 8859-7, etc.
But it does, IIRC, for both Bengali and Telugu.
1. Unicode isn't a method of storing pixel or graphic representations of writing systems; it's meant to store text, regardless of how similar certain characters look.
2. What do you do about screen readers & the like? If it encounters something that looks like a little half-moon glyph that's in the middle of a sentence about foreign alphabets that reads "Por ejemplo, la letra 'c'", should it pronounce it as the English "see" or as Russian "ess"?
I'm not sure that that is really possible without something way bigger or more complicated than Unicode. Consider the string "fart". In English that means to emit gas from the anus. In Swedish it means speed. Does that mean Unicode should have separate "f", "a", "r", and "t" for English and Swedish?
> 2. What do you do about screen readers & the like? If it encounters something that looks like a little half-moon glyph that's in the middle of a sentence about foreign alphabets that reads "Por ejemplo, la letra 'c'", should it pronounce it as the English "see" or as Russian "ess"?
What would a human do if that was in a book and they were reading it aloud for a blind friend?
(IIRC, she learned the language entirely from books so has no idea of the correct pronunciation and thinks she's fluent)
2. I think the pronunciation should not be encoded into the text representation on a general scale. You would need different encodings for "though" and "through" in english alone. Your example leaves the meaning open, even if being read as text. If I was the editor, and the distinction was important, I'd change it to "For example, the cyrillic letter 'c'".
I understand that Unicode provides different code points for same-looking characters, mostly because of history, where these characters came from different code sheets in language-specific encodings.
“Cyrillic” isn't the same everywhere. Bulgarian fonts differ from Russian fonts, some letters are “latinized”, some borrow from handwritten forms:
https://bg.wikipedia.org/wiki/Българска_кирилица
Colored example has the third alternative for Serbian cursive.
So without some external lang metadata we don't even know how your message should look.
However, Russian “Кк” traditionally is different from Latin “Kk” in most recognized families. In the '90s, font designers regularly thrashed ad-hoc font localization attempts which ignored the legacy of pre-digital era, and blindly copied the Latin capital into capital and minuscule forms.
is Incremented C
which is Big C
which is Capital C
It was a joke, by the way.
Similarly to how I'd expect to still get reasonable results if I type "beleive" instead of "believe".
That said, this is obviously pretty context-dependent, in some settings it will make more sense to do an exact-match search, in which case you'd want to differentiate n and ñ (while still handling different possible unicode variants of ñ if those exist).
This is simply not true. As I've pointed out in a sibling comment, Unicode has a lot of surprising and frustrating behaviors with many European languages as well if you use it without locale data. The characters will look right, but e.g. searching, sorting and case-insensitive comparisons will not work as expected if the application is not locale aware.
This is quite a different situation from Japan. A lot of applications don't do searching, sorting, or case-insensitive comparisons, but virtually every application displays text.
As far as I know all Shift_JIS fonts are Japanese; you would have to be wilfully perverse to make one that wasn't.
> If the software used a Japanese font for Japanese it wouldn't need metadata anyway.
If it just uses the system default font for that encoding, as almost all software does, then it will also behave correctly.
> There really isn't a problem with Han unification as long as you always switch to a font appropriate for your language
Right. But approximately no software does that, because if you don't do it then your software will work fine everywhere other than Japan, and even in Japan it will kind-of-sort-of work to the point that a non-native probably won't notice a problem.
> In cases where the system or user configures the font, properly using Unicode is still easier than configuring alternate encodings for multiple languages.
I'm not convinced it is. Configuring your software to use the right font on a Unicode system is, as far as I can see, at least as hard as configuring your software to use the right encoding on a non-Unicode system. It just fails less obviously when you don't, particularly outside Japan.
Most games that I know of that target CJK + English (and are either CJK-developed, or have a local publisher based in East Asia) do indeed switch fonts depending on language (and on TC vs. SC).
> I'm not convinced it is. Configuring your software to use the right font on a Unicode system is, as far as I can see, at least as hard as configuring your software to use the right encoding on a non-Unicode system. It just fails less obviously when you don't, particularly outside Japan.
I'm considering 3 scenarios:
1. You are configuring for the Japanese-speaking market. In which case, fix a font, or fonts.
2. You are localizing into multiple languages and care about localization quality. In which case, yes, you need to know that localization in Unicode is more than just replacing content strings, but this is comparable to dealing with multiple encodings.
3. You are localizing into multiple languages and do not care about localization quality, or Japanese is not a localization target. In which case Japanese (user input / replaced strings) in your app / website will appear childish and shoddy, but it is still a better experience than mojibake.
In any case, it seems to me that it is not a worse experience than pre-Unicode. It's just that people who have no experience in localization expect Unicode systems to do things it cannot do by just replacing strings. You indeed frequently run into issues even in European languages if you just think it's a matter of replacing strings.
Right, because unicode-based systems don't work well in Japan. E.g. a unicode-based application framework that ships its own font and expects to use it will display ok everywhere that's not Japan. So Japan is increasingly cut off from the paradigms that the rest of the world is using.
> Users who need to tag text with the language identity should be using standard markup mechanisms, such as those provided by HTML, XML, or other rich text mechanisms. In other contexts, such as databases or internet protocols, language should generally be indicated by appropriate data fields, rather than by embedded language tags or markup.
(and that emojis have had their positive impact in forcing apps into better Unicode support would be a + for the use of a tag)
As I said though, if you're in full control and only need to be compatible with yourself, you can do whatever you want.
Out-of-band metadata has plenty of other problems besides the fact that it doesn't exist in a lot of cases
Be that as it may, the overwhelming majority of unicode fonts are dramatically wrong for Japanese and not dramatically wrong for other languages.
> Due to this browsers often have an option to force usage of system fonts and set minimum size to improve readability.
Such options are shrinking IME. E.g. Electron is built on browser internals, but does it offer that option?