Watch out: ɢoogle.com isn’t the same as Google.com

Watch out: ɢoogle.com isn’t the same as Google.com(thenextweb.com)

206 points by lucodibidil 9 years ago | 133 comments

rurban 9 years ago |

What about ‮goog‬le.com which is really <U+202E>goog<U+202C>le.com :)

TR36 bidi spoofs are usually worse than TR39 confusables. Move over with your cursor over it. http://www.unicode.org/reports/tr36/#Bidirectional_Text_Spoo...

That's why browsers or dns tools use libidn, just programming languages not.

a3n 9 years ago |

This is strange to me. This is clearly meant, in unicode, to be 'G' that we all know and love. It has uselessly expanded "the alphabet" (to be western-centric) in a confusable way.

Unicode maybe should have been three dimensional, with "concept of G" in the 2D space, and "ways of representing G" behind G, along the third axis. All ways of representing G, whether little capital, capital, lower case, would or at least could equate to conceptual G in the 2D space.

hackuser 9 years ago | |

> Unicode maybe should have been three dimensional, with "concept of G" in the 2D space, and "ways of representing G" behind G, along the third axis. All ways of representing G, whether little capital, capital, lower case, would or at least could equate to conceptual G in the 2D space.

It brings up interesting, long-standing problems. Which of these count as the same letters?

* Letters in two languages with the same appearance and making the same phonetic sound

* Letters in two languages with the same appearance but making slightly different phonetic sounds. E.g., R in English and French

* Letters in in two languages that are otherwise the same, but one has an accent. Is the accent part of the letter? Separate? Are they really the same letter?

* Letters in two languages with the same appearance but making completely different phonetic sounds.

* Similar (by any property) letters in two related languages; e.g., both Indo-European

* Similar (by any property) letters in two unrelated languages; e.g., French and Vietnamese.

* Letters with the same phonetic sound but different appearances.

* Letters with the same appearance, one is phonetic and one an ideograph

* Letters that are otherwise identical, but alphabetize differently in their respective languages

* EDIT: Forgot a key one; Letters that are otherwise identical, but follow different rules of how they combine with the letters around them (a common issue, though not familiar to English speakers).

* Letters that are in all ways identical but belong in different languages. In which languages code group does the letter belong? One? Both? What if the subset of Unicode supported by an application includes one language but not the other?

etc. etc.

vurpo 9 years ago | | |

> Letters in in two languages that are otherwise the same, but one has an accent. Is the accent part of the letter? Separate? Are they really the same letter?

It gets worse than this. Example: the letters Ä and Ö exist in both Swedish and German (as an example).

In German they are actually counted as the letters A and O with diaereses above them, and they alphabetize together with other instances of the letters A and O, because that's what they are.

In Swedish those are their own letters, which are completely separate from the letters A and O. They get their own place in the alphabet (second-to last and last, respectively), and replacing them with AE and OE is technically not acceptable in Swedish like it is in German (though it's often done anyway, e.g. on airline tickets).

And in Unicode they are represented by the same code-point even though in one language it is a letter, and in the other language it's only a variation on another letter. What a mess.

jahewson 9 years ago | |

That character is from the phonetic alphabet so it's not the "concept of G", it's the concept of a "voiced uvular stop", which happens to looks visually like G. So what Unicode is doing is separating two conceptually different ideas, exactly as intended.

The cases where Unicode has taken similar looking characters and combined them into one have not been successful, Han Unification for example was widely viewed as a misstep and has caused many problems, such as making it impossible to embed certain Japanese characters in Chinese text without higher-level markup.

stevenbedrick 9 years ago | |

It actually does do something along those lines, with the "canonical" and "compatible" equivalence rules:

https://en.wikipedia.org/wiki/Unicode_equivalence

As mentioned by others on this thread, the real issue is not with Unicode per se, but rather with the ways that web browsers handle it (or fail to handle it, as the case may be).

zokier 9 years ago | | |

I think it is very much an issue in Unicode that they did not define the NFKD of ɢ to be G. As far as I can tell, the rationale is that ɢ is semantically different because it is used in IPA. I find that pretty weak, considering the ubiquity of smallcaps. Asking browsers to diverge (as far as equivalence goes) from Unicode standards sounds a lot like a failure of Unicode.

spullara 9 years ago | | |

The web browser or DNS?

drewmate 9 years ago | |

That's a really interesting proposal, but I'm afraid it would be difficult to implement in practice. If this third dimension were actually encoded into the number that represents each character, you'd end up with a lot of wasted bits (since most characters probably wouldn't even need the 3rd dimension, or at least as much of it as the heaviest users.) Another option would be to supplement the metadata that already accompanies Unicode characters (which block it is in, the name of the character/block, etc...) This could be done in practice now, but the information would almost certainly just be ignored if it needed to be looked up in a supplemental table. Furthermore, it's difficult to agree on just about anything in Unicode, and classifying all the characters based on concept seems like a Herculean task for a slow-moving body.

Any ideas for how to accomplish this in practice?

a3n 9 years ago | | |

I'll get to that as soon as I make email secure by design.

jahewson 9 years ago | | |

This already exists in Unicode, it's called "Variation Selectors" and they have their own block and are used to select emoji skin tones amongst other things.

But it would be wrong to use them in this case because an IPA G and the letter G are semantically different things and should not be unified into a single character just because they look similar.

Lagged2Death 9 years ago | |

The G is part of a block called "IPA extensions." Most of its content is more obviously specialized. This G is a phonetic G.

It's not necessarily the case that any given symbol has a bunch of different Unicode representations; unfortunately G has at least two, though.

https://en.m.wikipedia.org/wiki/IPA_Extensions

http://www.fileformat.info/info/unicode/block/ipa_extensions...

donquichotte 9 years ago |

Some time ago I registered http://www.goolge.io/. Still haven't done anything with it, I guess at some point I'll just redirect it to duckduckgo. [EDIT: now it's redirected to duckduckgo.]

This can of course be used in a malicious way. I thought about rebuilding the homepage of the bank Credit Suisse on www.credit-siusse.ch, but that's probably illegal.

Entangled 9 years ago |

Web browsers should have an option to show non-ascii chars in urls in red.

cjrd 9 years ago |

Proud owner of http://gïthub.com checking in...

y4mi 9 years ago | |

the visiblend screenshot on your projects page is dead because of an unresolveable dns href.

the screenshot on your kmap repo[1] was dead as well, until i actually opened it. i'm guessing the jpg isnt generated until somebody clicks on it.

enough cyberstalking for me this evening :p

[1] https://github.com/cjrd/kmap

yamaneko 9 years ago | |

Awesome site, by the way. I'm just checking out your tutorial on LDA.

TazeTSchnitzel 9 years ago |

https://en.wikipedia.org/wiki/IDN_homograph_attack

talideon 9 years ago | |

Most registries did a better job on constructing their IDN tables than Verisign did. :-(

orbitur 9 years ago |

This is something that's been bugging me for years.

Why are there multiple representations of alphabet characters in Unicode? It seems reasonable to include accent marks, but what's the benefit in having a Cyrillic 'o' alongside a standard 'o' or the 2 or 3 other ASCII-lookalike sets of characters?

ergot 9 years ago |

For me it just redirects to

    http://money.get.away.get.a.good.job.with.jack.ilovevitaly.com

The actual domain is http://xn--oogle-wmc.com/

Which is an Internationalized domain name[1] in punycode transcription

[1] https://en.wikipedia.org/wiki/Internationalized_domain_name

The G in question here is

https://en.wiktionary.org/wiki/%C9%A2

http://charcod.es/#%C9%A2/610

underyx 9 years ago | |

>ilovevitaly.com

This Vitaly guy…

I got tons of referral header spam (that shows up in e.g. Google Analytics) for all sorts of social media buttons and EU cookie law scare tactic sites. And then there was Vitaly who just spammed me with ilovevitaly.com, which if I recall correctly actually was a site about himself at the time.

ergot 9 years ago | | |

Wow what an odd site

cdubzzz 9 years ago | |

Interesting, this domain now redirects to:

    http://money.get.away.get.a.good.job.with.more.pay.and.you.are.okay.money.it.is.a.gas.grab.that.cash.with.both.hands.and.make.a.stash.new.car.caviar.four.star.daydream.think.i.ll.buy.me.a.football.team.money.get.back.i.am.alright.jack.ilovevitaly.com/#.keep.off.my.stack.money.it.is.a.hit.do.not.give.me.that.do.goody.good.bullshit.i.am.in.the.hi.fidelity.first.class.travelling.set.and.i.think.i.need.a.lear.jet.money.it.is.a.secret.%C9%A2oogle.com/#.share.it.fairly.but.dont.take.a.slice.of.my.pie.money.so.they.say.is.the.root.of.all.evil.today.but.if.you.ask.for.a.rise.it%27s.no.surprise.that.they.are.giving.none.and.secret.%C9%A2oogle.com

Kenji 9 years ago |

Unicode URLs are the devil. Too many indistinguishable characters. URLs should stay full ASCII imho. And I say that as someone whose language requires non-ASCII symbols.

Or, in Bruce Schneier's words: "Unicode is just too complex to ever be secure."

rurban 9 years ago | |

But think about the poor underrepresented folks using foreign character sets?

You really need to support this 'sub café {} café()' => Undefined subroutine café in your friendly and social programming language, otherwise you will be accused of discrimination. Of course the two é are not normalized.

Which unicode-friendly language does really check for mixed script confusables? Java only is my guess. Even perl6 falls into this trap.

http://unicode.org/reports/tr39/#Mixed_Script_Confusables

palunon 9 years ago | | |

When it is just accents, it's ok. But when your users have a language that uses à radically different alphabet, sometimes they can't even read Latin transliterations.

underyx 9 years ago |

It was a pretty nice surprise that when sending this URL in Slack it was automatically converted to `xn--oogle-wmc.com`.

Fiahil 9 years ago | |

Slack is not doing anything. It's Google chrome filling up your clipboard with the "extended" version of the url.

underyx 9 years ago | | |

But when I paste it in the Slack message box it shows the ɢoogle.com version.

seagreen 9 years ago | |

The fact that we need application-specific security measures against this just emphasizes the problem. There are a lot of applications.

SamWhited 9 years ago |

There has been talk at the IETF of redefining IDNA2008 (the current way you prevent issues like this) in terms of the PRECIS framework (RFC 7564). This wouldn't exactly "solve" the problem, but it would mean that IDNA could be more agile with respect to Unicode versions and would make it easier to react to new problems, new confusable characters, etc. as they happen.

vbezhenar 9 years ago |

What about Googlé.com and infinite number of other variations?

StavrosK 9 years ago | |

Why is everyone thinking so small? What about https://www.goоgle.com?

Or how about the word "gullible" isn't in the dictionary?

http://www.dictionary.com/browse/gulliblе

tlrobinson 9 years ago | | |

Not sure why you're getting downvited, people seem to have missed your clever use of the Cyrillic "o".

koliber 9 years ago | | |

Would it be possible to register a .xn--cm-fmc TLD and have a .cоm registry all of your own?

bmmayer1 9 years ago | | |

Stupid question, how did you do that? What characters are you using?

vbezhenar 9 years ago | | |

I think, it's impossible to register this domain.

joncrocks 9 years ago |

I believe now that browsers have support for non-ascii URLs, each of them have schemes for anti-phishing.

See https://www.w3.org/International/articles/idn-and-iri/

and https://wiki.mozilla.org/IDN_Display_Algorithm

plus http://www.chromium.org/developers/design-documents/idn-in-g...

77pt77 9 years ago | |

Browsers have supported this for almost a decade.

hannele 9 years ago |

Ahh, the old classic, PayPaI: https://en.wikipedia.org/wiki/PayPaI (uppercase 'i')

alessioalex 9 years ago |

This just redirects me to http://xn--oogle-wmc.com/ so I know it's not the real google (using Chrome).

cesis 9 years ago |

Why Google analytics isn't filtering out this referral spam?

akerro 9 years ago | |

It's literally not their job to filter referrals... they do the opposite, they collect referrals.

jahewson 9 years ago |

Browsers already blacklist many visually similar characters, it seems that the IPA characters need to be added to that list.

chaz6 9 years ago |

I thought there were supposed to be registry rules preventing similar looking names to be registered as an idna. I guess not.

shshhdhs 9 years ago | |

I believe they aren't preventative measures, but responsive. So if Google contacts ICANN, then they may do something about it

darkr 9 years ago | | |

Some registries do this automatically. Some don't.

talideon 9 years ago | |

Yes and no. One of the problems is that Verisign's handling of IDNs wasn't exactly the best conceived, which left them with silly IDN codepoint tables like this: https://www.iana.org/domains/idn-tables/tables/com_latn_1.2....

Programmatic 9 years ago |

I'm not sure how feasible this is, but wouldn't it make sense for .com/.net/etc to be latin alphabet only and allow other domains to be localized with unicode? I wouldn't really have a problem with 新浪首页.cn, and I doubt I would confuse ɢoogle.ru or whatever with google.com

barkingcat 9 years ago | |

That defeats the purpose of an internationalized dns system.

The whole point of getting unicode into domain names is so we can have 新浪首页.com so that it's no longer a latin alphabet centric system.

Programmatic 9 years ago | | |

Doesn't that yield a whole class of problems though that we're trying to solve with obtuse solutions such as "let's make that character set in red so people don't get phished"? How is that any more international and/or easy to use?

It seems that putting the allowed character set into the tld would be a pretty user-friendly way of doing that.

Edit: As an added bonus, tlds are centrally managed, and are already western/latin encoded. So why not customize it with a localized abbreviation for the language or tld type?

Roboprog 9 years ago |

Cool! I want a cool non-alpha unicode domain. I guess "square-root" is already taken, but there must be some cool domains left (even though nobody can actually type them in).

Actually, some of these would probably be nice aliases for some math / science oriented sites.

E.g. - .com

Roboprog 9 years ago | |

Meh. Markup ate my "radioactive pie" (9762 dec / 2622 hex) symbol :-(

hannele 9 years ago |

I'm curious, why is it allowed to register domain names with mixed character sets? I am behind allowing Unicode characters in domain names for the obvious reasons, but are there compelling use cases for allowing them to be mixed?

klodolph 9 years ago | |

Technically, Unicode is only one character set. If you want to disallow mixing, you have to disallow it on some other basis, like script. There are many edge cases to consider, though, and many legitimate reasons to mix scripts.

reacweb 9 years ago |

Maybe browser should have a security option to whitelist characters in URL. When a URL uses another character, there would be popups with explanations and choices.

transfire 9 years ago |

Oh, you mean Unicode Sucks(TM)? Yes. Yes it does.

irb(main):016:0> "\ufeff\uff30\uff21\uff39\uff30\uff21\uff2c" => "ＰＡＹＰＡＬ" irb(main):017:0> "\ufeff\uff30\uff21\uff39\uff30\uff21\uff2c".unicode_normalize(:nfkc).downcase => "paypal"