Watch out: ɢoogle.com isn’t the same as Google.com(thenextweb.com) |
Watch out: ɢoogle.com isn’t the same as Google.com(thenextweb.com) |
TR36 bidi spoofs are usually worse than TR39 confusables. Move over with your cursor over it. http://www.unicode.org/reports/tr36/#Bidirectional_Text_Spoo...
That's why browsers or dns tools use libidn, just programming languages not.
Unicode maybe should have been three dimensional, with "concept of G" in the 2D space, and "ways of representing G" behind G, along the third axis. All ways of representing G, whether little capital, capital, lower case, would or at least could equate to conceptual G in the 2D space.
It brings up interesting, long-standing problems. Which of these count as the same letters?
* Letters in two languages with the same appearance and making the same phonetic sound
* Letters in two languages with the same appearance but making slightly different phonetic sounds. E.g., R in English and French
* Letters in in two languages that are otherwise the same, but one has an accent. Is the accent part of the letter? Separate? Are they really the same letter?
* Letters in two languages with the same appearance but making completely different phonetic sounds.
* Similar (by any property) letters in two related languages; e.g., both Indo-European
* Similar (by any property) letters in two unrelated languages; e.g., French and Vietnamese.
* Letters with the same phonetic sound but different appearances.
* Letters with the same appearance, one is phonetic and one an ideograph
* Letters that are otherwise identical, but alphabetize differently in their respective languages
* EDIT: Forgot a key one; Letters that are otherwise identical, but follow different rules of how they combine with the letters around them (a common issue, though not familiar to English speakers).
* Letters that are in all ways identical but belong in different languages. In which languages code group does the letter belong? One? Both? What if the subset of Unicode supported by an application includes one language but not the other?
etc. etc.
It gets worse than this. Example: the letters Ä and Ö exist in both Swedish and German (as an example).
In German they are actually counted as the letters A and O with diaereses above them, and they alphabetize together with other instances of the letters A and O, because that's what they are.
In Swedish those are their own letters, which are completely separate from the letters A and O. They get their own place in the alphabet (second-to last and last, respectively), and replacing them with AE and OE is technically not acceptable in Swedish like it is in German (though it's often done anyway, e.g. on airline tickets).
And in Unicode they are represented by the same code-point even though in one language it is a letter, and in the other language it's only a variation on another letter. What a mess.
The cases where Unicode has taken similar looking characters and combined them into one have not been successful, Han Unification for example was widely viewed as a misstep and has caused many problems, such as making it impossible to embed certain Japanese characters in Chinese text without higher-level markup.
https://en.wikipedia.org/wiki/Unicode_equivalence
As mentioned by others on this thread, the real issue is not with Unicode per se, but rather with the ways that web browsers handle it (or fail to handle it, as the case may be).
Any ideas for how to accomplish this in practice?
But it would be wrong to use them in this case because an IPA G and the letter G are semantically different things and should not be unified into a single character just because they look similar.
It's not necessarily the case that any given symbol has a bunch of different Unicode representations; unfortunately G has at least two, though.
https://en.m.wikipedia.org/wiki/IPA_Extensions
http://www.fileformat.info/info/unicode/block/ipa_extensions...
This can of course be used in a malicious way. I thought about rebuilding the homepage of the bank Credit Suisse on www.credit-siusse.ch, but that's probably illegal.
the screenshot on your kmap repo[1] was dead as well, until i actually opened it. i'm guessing the jpg isnt generated until somebody clicks on it.
enough cyberstalking for me this evening :p
Why are there multiple representations of alphabet characters in Unicode? It seems reasonable to include accent marks, but what's the benefit in having a Cyrillic 'o' alongside a standard 'o' or the 2 or 3 other ASCII-lookalike sets of characters?
http://money.get.away.get.a.good.job.with.jack.ilovevitaly.com
The actual domain is http://xn--oogle-wmc.com/Which is an Internationalized domain name[1] in punycode transcription
[1] https://en.wikipedia.org/wiki/Internationalized_domain_name
The G in question here is
https://en.wiktionary.org/wiki/%C9%A2
OR
This Vitaly guy…
I got tons of referral header spam (that shows up in e.g. Google Analytics) for all sorts of social media buttons and EU cookie law scare tactic sites. And then there was Vitaly who just spammed me with ilovevitaly.com, which if I recall correctly actually was a site about himself at the time.
http://money.get.away.get.a.good.job.with.more.pay.and.you.are.okay.money.it.is.a.gas.grab.that.cash.with.both.hands.and.make.a.stash.new.car.caviar.four.star.daydream.think.i.ll.buy.me.a.football.team.money.get.back.i.am.alright.jack.ilovevitaly.com/#.keep.off.my.stack.money.it.is.a.hit.do.not.give.me.that.do.goody.good.bullshit.i.am.in.the.hi.fidelity.first.class.travelling.set.and.i.think.i.need.a.lear.jet.money.it.is.a.secret.%C9%A2oogle.com/#.share.it.fairly.but.dont.take.a.slice.of.my.pie.money.so.they.say.is.the.root.of.all.evil.today.but.if.you.ask.for.a.rise.it%27s.no.surprise.that.they.are.giving.none.and.secret.%C9%A2oogle.comOr, in Bruce Schneier's words: "Unicode is just too complex to ever be secure."
You really need to support this 'sub café {} café()' => Undefined subroutine café in your friendly and social programming language, otherwise you will be accused of discrimination. Of course the two é are not normalized.
Which unicode-friendly language does really check for mixed script confusables? Java only is my guess. Even perl6 falls into this trap.
Or how about the word "gullible" isn't in the dictionary?
See https://www.w3.org/International/articles/idn-and-iri/
and https://wiki.mozilla.org/IDN_Display_Algorithm
plus http://www.chromium.org/developers/design-documents/idn-in-g...
The whole point of getting unicode into domain names is so we can have 新浪首页.com so that it's no longer a latin alphabet centric system.
It seems that putting the allowed character set into the tld would be a pretty user-friendly way of doing that.
Edit: As an added bonus, tlds are centrally managed, and are already western/latin encoded. So why not customize it with a localized abbreviation for the language or tld type?
Actually, some of these would probably be nice aliases for some math / science oriented sites.
E.g. - .com
https://www.𝙿𝙰𝚈𝙿𝙰𝙻.com/
And yet when I paste this into the latest Firefox it redirects to https://www.paypal.com/No 301 redirects or anything, the browser just treats it like ASCII, which it is clearly not, it actually happens to be Fullwidth:
https://en.wikipedia.org/wiki/Fullwidth_form
Serious phishing opportunity if you ask me!
ICANN require that registries follow RFC3491 and related RFCs for name prep before allowing a name to be registered https://www.icann.org/resources/unthemed-pages/idn-guideline... . What that one does is (among other things) NFKC normalization and case-folding:
irb(main):016:0> "\ufeff\uff30\uff21\uff39\uff30\uff21\uff2c"
=> "PAYPAL"
irb(main):017:0> "\ufeff\uff30\uff21\uff39\uff30\uff21\uff2c".unicode_normalize(:nfkc).downcase
=> "paypal"This:
www.paypal.com
or this:
www.PAYPAL.com
would be fullwidth.
What you actually posted are characters in the Mathematical Alphanumeric Symbols block. Specifically:
𝙿 — U+1D67F MATHEMATICAL MONOSPACE CAPITAL P
𝙰 — U+1D670 MATHEMATICAL MONOSPACE CAPITAL A
𝚈 — U+1D688 MATHEMATICAL MONOSPACE CAPITAL Y
𝙻 — U+1D67B MATHEMATICAL MONOSPACE CAPITAL L
That said, I haven't done anything with it, and I'm not a domain squatter, so if anyone wants it I can hook you up!
Showing non-ascii in red would be an easy solution for everybody.
I've yet to see a useful site with Cyrillic domain. Theoretically it sounds good, but practically everyone still uses Latin domains. May be it'll change with time.
- Average User
Would be annoying if [name].me or whatever is red!
Give me a popup warning explaining the problem when I try to visit the site, same as I get for certificate problems.
'ɢ' is obviously an exception since (I imagine) it's considered to be in your locale, but maybe it shouldn't be.
The characters are from the Latin character set, but non-ASCII. Highlighting the Å in red would look pretty confusing. And in many countries you want the entire domain name written in non-ASCII characters, depending on the language. E.g. websites in Russia, China, India, etc...
Here are some contexts in which this semantic difference is important: search (compare search results for "cop" and "сор"), alphabetical sorting, text-to-speech, spellchecking, case conversion ("ATOM" -> "atom", but "АТОМ" -> "атом", note the difference between t-т and m-м).
I never rely on Unicode for computation. When receiving Unicode I always make sure it's in the ASCII range. It could be argued that there should never have been Unicode domain names but I guess Western people are very lucky that ASCII includes most of their characters...
Please don't spread the myth of Western languages being encodable in ASCII, and don't pretend to support Unicode (or anything else than English) if you filter everything to ASCII.
The _only_ Western language that is encodable in ASCII is English.
Corollary: English is the only language that can be encoded in ASCII.
The other western languages have endless issues with text being encoded/stripped down to ASCII. e.g. French, Spanish, Portuguese, German...
If you're transcribing a conversation at the UN and there is a mix of different languages the fact that "Het" is transcribed as a latin character set is information. Het may be a southern American group of people, or it could just be a Russian dude saying "no", even if it looks the same.
I understand that we're still burdened by intralanguage homonyms, but I appreciate the fact that it isn't complicated further.
how many languages even check for mixed script confusables? for dynamic languages this check is much too expensive, but they are leading the "good cause", allowing everything, and checking nothing.
I use http://unicode-table.com to help figure out what's what. The official Unicode specifications[1] is impenetrable, and it's really hard to deal with.
I have heard proposals that mixed-script IDNs get converted to punycode in URL display, but I don't know if any browser has fully implemented that yet.
Maybe as part of the locale configuration, in addition to number and date format, people should pick a friendly and an offensive color! :)
But that was not my point. The point was about identifiers, such as DNS names.
H4xx0r j0k3s is all it is.
I never seen anyone go out of their way to defend or use leet speak all day long.
e.g. a site selling them in the UK is promoting "JO66 ERX", which is probably supposed to be read as "Jogger X". Current bid £750, for some reason.
xn--1na
[0]https://www.punycoder.com/Something like www.paypal.com --> www.n--pal-n76secrc.com
Source: I'm colorblind(protanope) and red would definitely be an issue. Android studio, for example, is really annoying for me because the particular red they use for errors is very hard to distinguish from black
Especially in this case, where there is unlikely to be a specialized class of scammers who go phishing only for people with red-green colorblindness. So long as browsers implement a feature that stops the phishing in 99% of cases, the scammers will try something else.
Compare to Chrome's https indicator- it turns the "https://" part of the URL green (which I can barely distinguish as different, so it is useless to me) and adds a padlock icon.
Colorblind-friendly graphs might use both color and symbols to distinguish elements.
Significantly less common than red/green deficiency, though - I only know of one more on the island I live on (pop. 15,000 or so)
Non-latin alphabet domain names do have legitimate uses, although they are very rarely used.
https://newrepublic.com/article/117608/chinese-number-websit...
I am not claiming that everyone speaks a language that is representable in the Latin alphabet.
We as (technical) humans can recognize (hence this discussion) that the use of this uncommon G is meant to mislead you into thinking you're going to Google, when in fact you're going to Hell. I'd like to be warned of that possibility.
In this case, the extremely oversimplified algorithm might be "does the domain, as filtered down to canonical characters, represent one of the top five destination domains, yet go somewhere else if not canonicalized?"
Russians would definitely be pissed though.
Kudos to my employer, though - after some discussion, I was given a small budget and our SCADA GUI frontends now sport colour palettes optimized for deuteranopes, protanopes and tritanopes.
We've got a couple of very grateful feedbacks - and, unsurprisingly, quite a bunch of 'Gee, did you have some colorblind sod do your GUIs? My display looks like a Grateful Dead cover!' from people who've inadvertently messed with accessibility settings...