Google T5 scores 88.9 on SuperGLUE Benchmark, approaching human baseline

Google T5 scores 88.9 on SuperGLUE Benchmark, approaching human baseline(super.gluebenchmark.com)

303 points by alexwg 6 years ago | 235 comments

I didn't know anything about SuperGLUE before (turns out it's a benchmark for language understanding tasks), so I clicked around their site where they show different examples of the tasks.

One "word in context" task is to look at 2 different sentences that have a common word and decide if that word means the same thing in both sentences or different things (more details here: https://pilehvar.github.io/wic/)

One of their examples, though, didn't make any sense to me:

1. The pilot managed to land the airplane safely

2. The enemy landed several of our aircrafts

It says that the word "land" does NOT mean the same thing in those sentences. I am a native English speaker, and I honestly don't understand what they are thinking the second sentence means. Shot them down? If so, I have never heard "landed" used in that context, and it appears neither has Merriam-Webster. Also, the plural of aircraft is just "aircraft", without the s.

rladd 6 years ago | |

My mother got a perfect 800 score on the GRE English test many years ago when she wanted to go back to graduate school after her children were grown up enough (highschool/college age).

She told me that the way she got her perfect score was by realizing when the questions were wrong and thinking of what answer the test creators believed to be correct.

She had to outguess the test creators and answer the questions wrong -- in the "right" way.

This seems like a similar situation.

Huppie 6 years ago | | |

I've had the 'pleasure' of taking some 'Microsoft certifications' at various companies I worked at in the past and this sounds extremely familiar.

"I probably won't ever do it like that and/or there's a syntax error in all four of the answers... but this is the answer you want to hear. It's wrong, mind you, but it's what you want to hear."

qubex 6 years ago | | |

I have achieved similar results by similar means in both English and certain other subjects wherein one would assume a “true academic” would “know better” (picking out Sin[x]=2 as being “evidence of error in prior working” when x could merely be Complex, or marking “f[f[n]]=-n as “unsolvable” when it’s just requires a bit of lateral thinking). This always depresses me, like when (as a Brit) I hear Americans say “I could care less” as an indicator of disregard, when actually that indicates they are somewhere above the point of minimal regard.

conjectures 6 years ago | | |

This does seem like the meta solution to most tests, particularly standardised tests :)

saagarjha 6 years ago | | |

That's how I got through the SAT…

throwaway936482 6 years ago | |

I think this is really interesting, because "the enemy landed several of our aircraft(s)" is the sort of sentence I'd have hauled a student up for using as a teacher, because 1) it's a none standard, arguably incorrect usage they've used either because they're a none native speaker or because they're trying to be clever and failing, and 2) because the plural of aircraft is aircraft. Nevertheless the author of this sentence almost certainly meant land to mean something different (shot down) than the author of the first, and we can infer the author's intended meaning despite the none standard usage. This poorly written sentence is the sort of thing you see all the time in the real world, especially from none native speakers, children, and people writing about a topic outside their expertise. If a program can spot the difference in the usage of the word land between these two sentences and infer what the intended meaning in the second sentence is, then it's doing pretty well. Just inferring that land is used to mean something different in the two sentences is less impressive but still pretty cool and I'm not sure which claim is being made.

HereBeBeasties 6 years ago | | |

If you teach others English, please learn the difference between "none" and "non". You mean "non-standard" in all your examples here (if British) or perhaps "nonstandard" (if American).

saalweachter 6 years ago | | |

As someone who spends a lot of time puzzling out intent, I would infer they are using "landed" to mean "grounded" in that context.

_II__II_ 6 years ago | |

The example directly below that: "Justify the margins" and "The end justifies the means" is the one I find dubious. Obviously the former could mean to format a document, but those exact words in that structure could be a demand for someone to justify a financial margin for example. It is both true and false depending on the context.

andrewstuart2 6 years ago | | |

One of my favorite examples that I heard in a David Rock talk which I can no longer find on youtube: "Time flies like an arrow":

Time moves swiftly and in one direction.

Record the speed of flies in the same way you would an arrow.

Time flies, which are a kind of fly, are fond of an arrow. (e.g. Time flies like an arrow, fruit flies like a banana).

nebulous1 6 years ago | | |

I'm guessing this is intentional. To a human, although this could be somebody being asked to justify their financial margins that's not a very likely answer. The human can easily see that, while it's possible they're the same meaning, given the lack of any other context the answer is that they're not.

The enemy could have landed several of our aircraft on one of their runways. Agassi may have beaten Becker over the head with his tennis racket. I suspect part of the test is that there can be other meanings that do technically work.

hn_throwaway_99 6 years ago | | |

This is a good point I hadn't thought of. Honestly, I'm really not surprised anymore that the humans only scored 89%.

sjg007 6 years ago | | |

The ends justify the means.

eindiran 6 years ago | |

The second one means "the enemy successfully got several of our aircrafts".

Specifically, definition 3a or 3b for the verb form here: https://www.merriam-webster.com/dictionary/land

So potentially the enemy captured the aircraft (3a) or destroyed them (3b).

topspin 6 years ago | | |

Would a native English speaker use the word "landed" in this way? In the context of aircraft? "Landed" is badly ambiguous here and several distinct meanings are plausible. Captured is the most natural word given your interpretation.

Honestly that sentence -- the use of landed and that awful plural -- approaches engrish. Is that deliberate or is the use of English here just badly flawed? I can't see any other possibilities.

LaMarseillaise 6 years ago | | |

If taking the "captured" interpretation, I think it could be reasonably inferred that they successfully landed the aircraft at an airfield afterwards (same meaning). This was my initial read of it and it does not seem strange to me on reflection.

I would like also to point out that even if we do interpret the second as meaning "destroyed", the first could then be interpreted as a combat aviator shooting down an opposing aircraft, bringing us back to the same meaning. Or perhaps both of my interpretations are correct and the meanings are different...

What this tells me is that the benchmark is not very useful.

steve19 6 years ago | | |

My immediate thought was captured ie. "Iran successfully landed our UAV by transmitting false GPS data".

This language is used on the Wikipedia page about that incident.

https://en.m.wikipedia.org/wiki/Iran–U.S._RQ-170_incident

Someone 6 years ago | | |

Aircraft typically get captured on the ground, or get forced to land by threat of being shot down. “Landed”, for me, would require the enemy to actively land the plane, just as “landing a fish” requires both the fisherman’s action and moving the fish from water to land.

I also wouldn’t use “landed” for destroying an enemy plane (neither by shooting it down nor by destroying it on the ground)

That, realistically, leaves hacking the plane’s electronics and then directing it to one’s own airfield.

natch 6 years ago | |

So many options for sentence number two.

- The enemy stole the aircrafts, and after some drama in flight managed to land several of them.

- The enemy used remote control to force them to land.

- The enemy used coercive force to force our pilots to land them.

- The enemy captured them.

- The enemy shot them down.

- During a friendly event while we set our differences with our enemy aside and agreed to fly each other's aircraft at an airshow for some reason, we landed several of theirs, and they landed several of ours.

- There was a hearing mistake and "energy" (as in energy beam beamed by a UFO) was accidentally transcribed as "enemy."

- The writer is just screwing with us.

- The writer is not a native speaker of English, and they made a mistake and actually meant that the enemy boarded several of our (parked) aircrafts.

- The writer is creative with language and believes that it would be cute to say that when an enemy projectile struck one of our aircrafts, then the enemy has "landed" that aircraft as one would land men on the moon or land rovers (no pun intended) on Mars.

ethbro 6 years ago | | |

- An ML algorithm from the future traveled back in time, writing specific SuperGLUE examples to poison AI research, thereby preventing the emergence of a competitive AI which would also master the secrets of closed timelike curves

mcphage 6 years ago | | |

> then the enemy has "landed" that aircraft as one would land men on the moon or land rovers (no pun intended) on Mars

Or perhaps as one would land a punch.

matthewowen 6 years ago | |

I think it's landed in the same sense as "landed a deal": got, or achieved, in this case achieving shooting them down.

For me, my first read of the sentence would definitely be that it means shot down.

hn_throwaway_99 6 years ago | | |

Ahh, just found an example where that's taken from https://glosbe.com/en/en/land. If you find on that page you'll see the exact sentence "the enemy landed several of our aircraft" (without the s after aircraft) which it says means "shoot down".

I have still never heard landed used in that way, and again in other dictionaries I searched I couldn't find that definition either. Thus, this is a case where the "AI" may get it "right", and me, the human would get it "wrong", but that still feels like it's missing a huge point. It feels you could get a number of errors by the human which the AI gets "right", but in fact the human is better able to detect what is rare, uncommon or at least ambiguous.

otterpop 6 years ago | | |

so this is why the human score is 89.8 :)

theaeolist 6 years ago | | |

> I think it's landed in the same sense as "landed a deal": got, or achieved, in this case achieving shooting them down.

My buddy is a pilot and they always say "I landed the takeoff pretty good. PRETTY GOOD!"

bobbyi_settv 6 years ago | |

I've never seen "landed" used as in the second sentence, but I was definitely able to understand from context that it was not being used to mean the same thing as in the first sentence.

amingilani 6 years ago | | |

Have you ever "landed" a deal? Or "landed" first strike in a game?

archontes 6 years ago | | |

You've never landed a fish?

brainless 6 years ago | |

I think the difference in these sentences is about the way to land. In sentence 1, the pilot of the aircraft is in control. In sentence 2, the pilots are not in control, the enemy forced them to land (whatever the means).

If I read these two sentences in context of some news, they would evoke very different "landing" scenes in my head.

mohaine 6 years ago | | |

#2 is the same as landing a fish. i.e. to place on land what doesn't belong on land.

im3w1l 6 years ago | |

The only possible explanation I can think of is this.

3. a : to catch and bring in

// land a fish

b : gain, secure

// land a job landed the leading role

imagine enemy soldiers capturing a base or hangar ship including the aircraft.

It's kind of a stretch though.

mooman219 6 years ago | |

This is definitely where the 10.2% of human failures are.

jdale27 6 years ago | | |

jcims 6 years ago | |

In looking through many of the replies to this downstream, it appears that the system is actually correct in that there's an obscure use of 'land' at play in the second sentence.

It makes me think that there's going to be many adversarial examples of text that humans parse one way because of common usage while machines parse another way because of details like this.

devin 6 years ago | | |

Colorless green ideas sleep furiously!

Search for it if you’re interested in its origin.

rahimnathwani 6 years ago | |

For #2, my immediate read was that the planes had been shot down. If the context were to suggest that the enemy had somehow hijacked the planes, then of course the word land would mean the same in both sentences.

I have never used or heard 'land a plane' in this context, but the sentence didn't immediately strike me as unnatural, incorrect or unclear.

taneq 6 years ago | | |

> I have never used or heard 'land a plane' in this context, but the sentence didn't immediately strike me as unnatural, incorrect or unclear.

It struck me as pretty awkward and very ambiguous. It probably means 'obtained' but 'captured' would be a far better word in that case. The suggestions that it means 'hit/shot' don't work because in that case it's not the aircraft that is landed but the shot, which is landed on the aircraft.

Also the use of the incorrect plural "aircrafts" when 'aircraft' is both singular and plural makes me think it's just a poor question.

The very fact that there's so much discussion about it is evidence that it's not straightforward even among native English speaking humans.

tanilama 6 years ago | |

It means 'succeeded in shooting down' right? Seems pretty contextual, but understandable.

vonseel 6 years ago | | |

Seems like a really odd way of saying it but that’s what I’d think too, as in “landed their shots”.

This is either a poor question, or a really great question, if the goal of the test is to confuse computers where a human would normally say “huh, weird way of saying that but I guess they mean...”.

mannykannot 6 years ago | |

From the abstract of the associated paper: "performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research."

It occured to me that hn_throwaway_99's question, and the responses to it, is the sort of dialog in which one could find additional headroom for further research into natural language understanding. We can understand, for example, that while the two uses of 'landed' are different, they are not completely unrelated, and we can explain how they are related, for example by introducing a third construct, 'landed a fish', as a couple of replies have done.

briga 6 years ago | | |

Limited headroom? Seems like they're assuming greater-than-human language ability is just impossible and will never be surpassed.

jasonlotito 6 years ago | |

So, here is the thing. ML shouldn't just be about learning rules. It should be about actually learning, and understanding.

Just because you've never heard the word used that way, you were able to infer it meant something different. Even with the use of aircrafts.

We all make mistakes when writing or speaking. We don't let that get in the way of interpreting the information being passed. Even if we post comments that contain errors.

SiVal 6 years ago | |

Yes, the second should be, "The enemy downed several of our aircraft." Landed can be used to mean "bagged," as in, "We finally landed the Smith account," (it's a fishing term), but it should not be used in this figurative sense when referring to aircraft, because of the obvious confusion with the common, concrete sense of the word. And, yes, it should be aircraft.

throwaway1777 6 years ago | |

The fact this comment sparked so much discussion with some agreeing and some disagreeing says to me that Google did about as well as a human.

alex_young 6 years ago | |

The plural of aircraft is aircraft. Not aircrafts. https://www.grammar-monster.com/plurals/plural_of_aircraft.h...

Maybe the _examples_ for a language test should be grammatically correct?

Al-Khwarizmi 6 years ago | | |

It depends on what your goal is. But in most cases, I'd say no. If the goal has anything to do with understanding real language written by real humans, it's better for the system to be able to handle texts with errors.

BiasRegularizer 6 years ago | | |

True, but having some noise in the label is actually good for generalization. If it's only learned on perfectly correct sentences then its tolerance for mistakes will be very low.

jasonlotito 6 years ago | | |

Maybe the examples for a language test should use language that people actually use every day.

didibus 6 years ago | |

It's weird, because I understood the second one as meaning shoot down, yet to me that's the same definition of landed. You just assume the enemy didn't land them gracefully without a scratch, because they are well, enemies.

So I would have answered that the word meant the same thing.

seanwilson 6 years ago | |

> One "word in context" task is to look at 2 different sentences that have a common word and decide if that word means the same thing in both sentences or different things (more details here: https://pilehvar.github.io/wic/)

Can anyone explain what makes this difficult for a machine? What existing knowledge does the machine start with? At a glance, it doesn't feel like it should be difficult if the machine had a large corpus to train on that showed many examples of each words in different contexts.

randcraw 6 years ago | |

1) The pilot [voluntarily] brought down his aircraft.

2) The pilots [involuntarily] brought down their aircraft [because some authority figure(s) forced them down.]

The active verb 'land' can be performed by different actors: pilot vs a more powerful agent (usually who flies an armed aircraft). The voluntary/involuntary agency is a subtle difference that only those familiar with this military practice are likely to grok.

ebg13 6 years ago | |

> I am a native English speaker, and I honestly don't understand what they are thinking the second sentence means

Clearly the enemy conferred lesser nobility and commensurate landownership unto said aircrafts. https://en.wikipedia.org/wiki/Landed_gentry

mrosett 6 years ago | |

I believe it’s “land” in the sense of “land a fish” (or a prize in general) which is a less common but legitimate usage.

gradstudent 6 years ago | |

Perhaps the enemy obtained several of our aircraft. In the same sense as one might land a new car in a contest.

6gvONxR4sf7o 6 years ago |

One thing to always point out in these cases is that the human baseline isn't "how well people do at this task," like it's often hyped to be. It's "how well does a person quickly and repetitively doing this do, on average." The 'quickly and repetitively' part is important because we all make more boneheaded errors in this scenario. The 'on average' part is important because the errors the algo makes aren't just fewer than people, they're different. The algos often still get certain things wrong that humans almost never would.

This is really really super great, let's be clear. It's just not up to the hype "omg super human" usually gets.

TheOtherHobbes 6 years ago | |

It seems to mean "How well does Mechanical Turk do the task?" which is a separate thing again. And yes - error type is at least as revealing as error frequency.

I have no idea where the real human baseline is, or how to find it.

Also, consider this discussion. GLUE winners may be able to make informed parsing guesses about single text blocks, but they're years away from being able to make a useful contribution to a discussion like this one.

IshKebab 6 years ago | |

Regarding the type of errors, it seems like the benchmark should be able to take that into account. That is, get a load of humans to do the task on the same specific examples, then for each example you know how hard it is, and what acceptable answers are (I bet a lot of the ground truth is wrong or ambiguous).

Then you can benchmark your AI but penalise it more heavily for getting things wrong that are obvious to a human.

6gvONxR4sf7o 6 years ago | | |

That would be ideal, if money weren't a factor. Since money is a factor, I wonder what the tradeoff is between labelling each instance N more times versus just getting N times more instances labeled.

Pahr3yah 6 years ago | |

In the context of GPT2 someone coined the expression "Humans Who Are Not Concentrating Are Not General Intelligences"

The_Amp_Walrus 6 years ago | | |

I think it was this blogger: https://www.google.com/amp/s/srconstantin.wordpress.com/2019...

jcims 6 years ago | |

Great point! It makes sense in the context of what these algorithms would generally be tasked with.

pmoriarty 6 years ago |

There was an article[1] posted to HN recently about these benchmarks, and it was pretty skeptical.

Regarding SuperGLUE specifically, it asked:

"Indeed, Bowman and his collaborators recently introduced a test called SuperGLUE that's specifically designed to be hard for BERT-based systems. So far, no neural network can beat human performance on it. But even if (or when) it happens, does it mean that machines can really understand language any better than before? Or does just it mean that science has gotten better at teaching machines to the test?"

[1] - https://www.quantamagazine.org/machines-beat-humans-on-a-rea...

RcouF1uZ4gsC 6 years ago |

I think classifying this as human level is misleading.

Look at the sub-scores on the page. One score that looks very different from humans is AX-b.

The SuperGlue paper provides more context about AX-b

https://arxiv.org/pdf/1905.00537.pdf

AX-b "is the broad-coverage diagnostic task, scored using Matthews’ correlation (MCC). "

This is how the paper describes this test

" Analyzing Linguistic and World Knowledge in Models GLUE includes an expert-constructed, diagnostic dataset that automatically tests models for a broad range of linguistic, commonsense, and world knowledge. Each example in this broad-coverage diagnostic is a sentence pair labeled with a three-way entailment relation (entailment, neutral, or contradiction) and tagged with labels that indicate the phenomena that characterize the relationship between the two sentences. Submissions to the GLUE leaderboard are required to include predictions from the submission’s MultiNLI classifier on the diagnostic dataset, and analyses of the results were shown alongside the main leaderboard. Since this broad-coverage diagnostic task has proved difficult for top models, we retain it in SuperGLUE. However, since MultiNLI is not part of SuperGLUE, we collapse contradiction and neutral into a single not_entailment label, and request that submissions include predictions on the resulting set from the model used for the RTE task. We collect non-expert annotations to estimate human performance, following the same procedure we use for the main benchmark tasks (Section 5.2). We estimate an accuracy of 88% and a Matthew’s correlation coefficient (MCC, the two-class variant of the R3 metric used in GLUE) of 0.77. "

If you look at the scores, humans are estimated to score 0.77. Google T5 scores -0.4 on the test.

How did T5 get such a high score if it scored so abysmally on the AX-b test?

The AX scores are not included in the total score.

From the paper: "The Avg column is the overall benchmarkscore on non-AX∗ tasks."

If the AX scores were included, the gap between humans and machines would be bigger than the current score indicates.

craffel 6 years ago | |

Hi, one of the paper's authors here. We didn't submit our model's predictions for the AX-b task yet, we just copied over the predictions from the example submission. We will submit predictions for AX-b in the next few days.

throwaway_bad 6 years ago |

Possibly dumb question: How do you ensure there's no data leakage when benchmarking transfer learning techniques? Is that even a problem anymore when the whole point is to learn "common sense" knowledge?

For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web, might contain much of the same information as the benchmark datasets, which I presume is also scraped from the web.

craffel 6 years ago | |

Hi, one of the paper authors here. Indeed this is a good question. A couple of comments:

- Common Crawl overall is a sparse web dump, it is unlikely that the month we used includes any of the data that are in any of the test sets.

- In order for the data to be useful to our model, it would have to be in the correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the label in a format our model could extract meaning from. We introduced this preprocessing format so I don't believe this would ever happen.

- Further, most of these datasets live in .zip files. The Common Crawl dump doesn't unzip zip files.

- C4 is so large that our model sees each example (corresponding to a block of text from a website) roughly once ever over the entire course of training. Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps.

throwaway_bad 6 years ago | | |

> Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps

I am not so sure about that. Have you seen this thread: https://www.reddit.com/r/MachineLearning/comments/dfky70/dis...

Apparently lots of sentence fragments were memorized in GPT-2 (including real world URLs, entire conversations, username/emails and other PII).

calabin 6 years ago | |

I think that this is a good question that I would also like to know the answer to. Additionally, are there other benchmarks or tests where this issue (possibly) presents itself?

taneq 6 years ago | |

You don't. Even humans frequently leak information like this. It's just a consequence of having incomplete or incompletely analyzed information.

jcims 6 years ago | |

Not dumb at all and probably a major challenge when developing benchmarks.

Al-Khwarizmi 6 years ago |

This surprised me a bit, on the creation of the corpus they use for training:

"We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”."

I don't understand this decision. This list contains words that can be used in a perfectly objective sense, like "anus", "bastard", "erotic", "eunuch", "fecal", etc.

I can understand that they want to avoid websites full of expletives and with no useful content, but outright excluding any website with even one occurrence of such words sounds too radical. If we ask this model a text comprehension question about a legitimized bastard that inherited the throne, or about fecal transplants, I suppose it would easily fail. Strange way of limiting such a powerful model.

Veedrac 6 years ago | |

They say they removed pages, not websites. Having false positives isn't a problem when you're still left with 750GB of data—quality matters more than slightly higher quantity at that point.

Al-Khwarizmi 6 years ago | | |

Sorry, I was thinking about pages even though I said websites. Native language interference (typically, we use the same term for pages and websites in my language).

Anyway, my point is not a matter of quantity. The way they're doing it, they have 750 GB of data, but they have exactly zero data that talks about bastards, fecal transplants, etc. So they may have a hard time answering questions about those specific subjects.

nopinsight 6 years ago |

As someone working in the field, I congratulate the excellent accomplishment but agree with the authors that we shouldn't get too excited yet (their quote below after the four reasons). Here are some reasons:

1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here: http://www.ericswallace.com/triggers

2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20.

3) Most or all of the tests are multiple-choice. Learning complex correlations from sufficient data should help solve most of them. This is useful but human-level understanding is more than correlations.

4) The performance on datasets that require commonsense knowledge, COPA and WSC, are the weakest relative to humans (who score 100.0 on both).

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, p.32 https://arxiv.org/pdf/1910.10683.pdf

"Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting."

I’d like to emphasize that the work and the paper are excellent. Still, we are quite far from human-level language understanding.

---

We may need more advanced tests to probe the actual language understanding ability of AI systems. Here are some ideas:

* Test for conceptual understanding in a non-multiple-choice format. Example: Write a summary for a New Yorker article, rather than standard news pieces (which tend to follow repeated patterns).

* Commonsense test with longer chains of inference than those needed for solving Winograd Schema and set in non-standard situations (e.g. fantasy world). This should greatly reduce the chance that an approach can simply detect correlations from huge datasets.

* Understanding novel, creative metaphors like those used in some essays by professional writers or some of the Economist's title articles.

martincmartin 6 years ago |

"The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems."

"We take into account the lessons learnt from original GLUE benchmark and present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard."

YeGoblynQueenne 6 years ago |

Assuming that the baseline human score was set according to the performance of adult humans, then according to these results T5 has a language understanding ability at least as accurate as a human child.

In fact it's not just T5 that should be able to understand language as well as a human child, but also BERT++, BERT-mtl and RoBERTa, each of which has a score of 70 or more. There really shouldn't be anything else on the planet that has 70% of human language understanding, other than humans.

So if the benchmarks mean what they think they mean, there are currently fully-fledged strongly artificially intelligent systems. That must mean that, in a very short time we should see strong evidence of having created human-like intelligence.

Because make no mistake: language understanding is not like image recognition, say, or speech processing. Understanding anything is an AI-complete task, to use a colloquial term.

Let's wait and see then. It shouldn't take more than five or six years to figure out what all this means.

YeGoblynQueenne 6 years ago | |

To clarify, I meant this comment as an expression of skepticism- I don't believe that the SuperGLUE benchmark really evaluates language understanding, or that BERT and friends are within a few percents of human language understanding. I think SuperGLUE is just another benchmark that is measuring something else than what it's supposed to be measuring (machine learning benchmarks usually do).

It seems that the teams behind the attempts to beat such benchmarks are aware of the weaknesses of the benchmarks though, so that's encouraging.

enisberk 6 years ago |

I attended one of the talks(1) of the Sam Bowman. His talk was about "Task-Independent Language Understanding" and he also talked about GLUE and super GLUE; he mentioned that some models are passing an average person in experiments. They did some experiments to understand BERT's performance (2). (similar to article 'NLP's Clever Hans Moment') But they found a different answer to question "what BERT really knows," so he was skeptical about all conclusions. Check these out if you are interested in.

(1)[https://www.nyu.edu/projects/bowman/TILU-talk-19-09.pdf]

(2)[https://arxiv.org/abs/1905.06316]

ilaksh 6 years ago |

The AIs in the benchmark are all trained exclusively on text, correct?

My assumption has always been that to get human-level understanding, the AI systems need to be trained on things like visual data in addition to text. This is because there is a fair amount of information that is not encoded at all in text, or at least is not described in enough detail.

I mean, humans can't learn to understand language properly without using their other senses. You need something visual or auditory or to associate with the words which are really supposed to represent full systems that are complex and detailed.

I think it would be much more obvious if there were questions that involved things like spatial reasoning, or combining image recognition with that and comprehension.

alexwg 6 years ago |

Paper: https://arxiv.org/abs/1910.10683

throwaway_bad 6 years ago | |

Twitter summary: https://threadreaderapp.com/thread/1187161460033458177.html

ArtWomb 6 years ago |

"Attention is all you need", indeed. Of course, our instinct tells us there is more to language inference than word proximity. And so results approaching or exceeding expert-level human baseline raise more questions than providing cause for popping champagne corks.

In Question Answering, which is also advancing rapidly with insights from transformers and denoising auto-encoders, but still far from human baseline. The ease with which these models can answer a sample question such as: "Who was the first human in space", demonstrates both their efficacy and limitations. Pre-trained on a large corpus of text, almost every document that contains the the name "Yuri Gagarin" will in its near vicinity describe him in relation to his pioneering accomplishment for which he became a cultural icon.

And for even more generalizable scenarios, such as "what might you find on a Mayan monument"? It becomes imperative that an agent explain its reasoning in natural language as well to enable self-correcting backpropagation of error correction.

Language may be considered low-dimensional relatively speaking. And sentence prediction across quotidian tasks manageable in current state-of-the-art architectures. But looking at how difficult it is to predict the next N frames of video given a short input example demonstrates the intractability of the problem in higher dimensional spaces.

Neural Models for Speech and Language: Successes, Challenges, and the Relationship to COmputational Models of the Brain - Michael Collins

https://www.youtube.com/watch?v=HVnFKmPaU8c

skybrian 6 years ago |

They came up with the SuperGLUE benchmark because they found that the GLUE benchmark was flawed and too easy to game. There were correlations in the dataset that made it possible to get questions right without real understanding, and so the results didn't generalize.

Could the same thing happen again with the better benchmark due to more subtle correlations? These things are tough to judge, so I'd say wait and see if it turns out to be a real result.

lettergram 6 years ago |

Although those are some great results, I wish I could try it out locally...

https://github.com/google-research/text-to-text-transfer-tra...

It drives me nuts that most of these papers / publications don't have code where I can just run:

> python evaluate_model.py

Still exciting, just annoying that I'd have to set up google cloud to try this out.

ehsankia 6 years ago | |

They often do setup Python notebooks / Colabs you can simply run, especially with the data hosted on GCloud. Unfortunately not this time.

femto113 6 years ago |

My experience with image classification benchmarks was that they approached human levels only because the scoring only counts how much they get “right” and doesn’t penalize completely whack answers as much as they should (like getting full credit for being pretty sure a picture of a dog was either a dog or an alligator). I suspect there’s something similar going on in these language benchmarks.

riku_iki 6 years ago |

> T5-11B (11 billion parameters)

So, this is largest language model so far?

lucidrains 6 years ago | |

Yes. The last one was 8.3B https://arxiv.org/pdf/1909.08053.pdf

pauljurczak 6 years ago |

Use of Natural Language Understanding term in context of this benchmark is preposterous. No understanding takes place there. Please stick to NLP (Natural Language Processing) term for the next couple of decades. Thank you.

nightnight 6 years ago |

This clearly demonstrates once again that Google is miles ahead of the competition in AI. I mean, they just have the best data.

If you want to have an every day example of Google's AI skills: Switch you phone's keyboard to GBoard, especially all iOS users, and you will face a night and day difference to any other keyboard esepcially the stock one. When using multiple languages at the same time the leap to other keyboards gets even bigger.

GBoard is my phone's killer app and if Google dropped it for iOS I'd left the same day to Android.

htfu 6 years ago | |

That's how I used to feel, but it's turning into a nuisance.

It used to stick to single words or sometimes splitting one if missing a space, but now will sometimes attempt to "correct" the sum of two perfectly valid standalone words after the fact, 97% of the time resulting in nonsense.

I cannot for the life of me understand why.

tremon 6 years ago | |

I have the opposite experience. Yes, some of the suggestions from GBoard are useful, but I feel there's an equal number of times where I've typed a complete word, only to hit space and have the word auto-corrected to what GBoard was expecting. As a typing aid, it's almost unusable because of that.

dingle_thunk 6 years ago | |

Have you tried the iOS 13 keyboard's built in swipe feature?

Are you aware of Swiftkey?

occamrazor 6 years ago | | |

I think GP is talking about predictive text, rather than keyboard ergonomics.

rrival 6 years ago |

Where do I take the SuperGLUE test?

woodgrainz 6 years ago |

Several of the systems in this leaderboard utilize the BERT model, a clever approach devised by Google for natural language processing. A nice laymen's guide to BERT:

https://towardsdatascience.com/bert-explained-state-of-the-a...

vagab0nd 6 years ago |

This is cool. Since they released a 11B pre-trained model, can we finally reproduce "unicorn-level" text generation now?

penagwin 6 years ago | |

My understanding is that a lot of these really high performance models that reach for every percentage-point possible require an absurd amount of hardware - specifically an absurd amount of GPU memory.

For example I have what I consider a fairly "high end" rig for being a hobbyist individual, with 32GB of RAM, i7 8700k, 1080ti - there's 0 chance their model would fit on my system.

So I mean maybe if you have a ton of money? Usually what happens is a slimmer model with not "quite" as high of a score gets released that actually fits on consumer hardware.

vagab0nd 6 years ago | | |

Maybe I'm oversimplifying, but it seems to me that once you have the model trained, it should be possible to partition it somehow when inferencing, to fit smaller machines. At least for a proof of concept it should be possible.

vonseel 6 years ago |

I wonder what I would score on this test. Are these things correlated to standardized test scores at all for humans ?

LukeB42 6 years ago |

http://www.irc.org/history_docs/tao.html