Google T5 scores 88.9 on SuperGLUE Benchmark, approaching human baseline(super.gluebenchmark.com) |
Google T5 scores 88.9 on SuperGLUE Benchmark, approaching human baseline(super.gluebenchmark.com) |
One "word in context" task is to look at 2 different sentences that have a common word and decide if that word means the same thing in both sentences or different things (more details here: https://pilehvar.github.io/wic/)
One of their examples, though, didn't make any sense to me:
1. The pilot managed to land the airplane safely
2. The enemy landed several of our aircrafts
It says that the word "land" does NOT mean the same thing in those sentences. I am a native English speaker, and I honestly don't understand what they are thinking the second sentence means. Shot them down? If so, I have never heard "landed" used in that context, and it appears neither has Merriam-Webster. Also, the plural of aircraft is just "aircraft", without the s.
She told me that the way she got her perfect score was by realizing when the questions were wrong and thinking of what answer the test creators believed to be correct.
She had to outguess the test creators and answer the questions wrong -- in the "right" way.
This seems like a similar situation.
"I probably won't ever do it like that and/or there's a syntax error in all four of the answers... but this is the answer you want to hear. It's wrong, mind you, but it's what you want to hear."
Time moves swiftly and in one direction.
Record the speed of flies in the same way you would an arrow.
Time flies, which are a kind of fly, are fond of an arrow. (e.g. Time flies like an arrow, fruit flies like a banana).
The enemy could have landed several of our aircraft on one of their runways. Agassi may have beaten Becker over the head with his tennis racket. I suspect part of the test is that there can be other meanings that do technically work.
Specifically, definition 3a or 3b for the verb form here: https://www.merriam-webster.com/dictionary/land
So potentially the enemy captured the aircraft (3a) or destroyed them (3b).
Honestly that sentence -- the use of landed and that awful plural -- approaches engrish. Is that deliberate or is the use of English here just badly flawed? I can't see any other possibilities.
I would like also to point out that even if we do interpret the second as meaning "destroyed", the first could then be interpreted as a combat aviator shooting down an opposing aircraft, bringing us back to the same meaning. Or perhaps both of my interpretations are correct and the meanings are different...
What this tells me is that the benchmark is not very useful.
This language is used on the Wikipedia page about that incident.
I also wouldn’t use “landed” for destroying an enemy plane (neither by shooting it down nor by destroying it on the ground)
That, realistically, leaves hacking the plane’s electronics and then directing it to one’s own airfield.
- The enemy stole the aircrafts, and after some drama in flight managed to land several of them.
- The enemy used remote control to force them to land.
- The enemy used coercive force to force our pilots to land them.
- The enemy captured them.
- The enemy shot them down.
- During a friendly event while we set our differences with our enemy aside and agreed to fly each other's aircraft at an airshow for some reason, we landed several of theirs, and they landed several of ours.
- There was a hearing mistake and "energy" (as in energy beam beamed by a UFO) was accidentally transcribed as "enemy."
- The writer is just screwing with us.
- The writer is not a native speaker of English, and they made a mistake and actually meant that the enemy boarded several of our (parked) aircrafts.
- The writer is creative with language and believes that it would be cute to say that when an enemy projectile struck one of our aircrafts, then the enemy has "landed" that aircraft as one would land men on the moon or land rovers (no pun intended) on Mars.
Or perhaps as one would land a punch.
For me, my first read of the sentence would definitely be that it means shot down.
I have still never heard landed used in that way, and again in other dictionaries I searched I couldn't find that definition either. Thus, this is a case where the "AI" may get it "right", and me, the human would get it "wrong", but that still feels like it's missing a huge point. It feels you could get a number of errors by the human which the AI gets "right", but in fact the human is better able to detect what is rare, uncommon or at least ambiguous.
My buddy is a pilot and they always say "I landed the takeoff pretty good. PRETTY GOOD!"
If I read these two sentences in context of some news, they would evoke very different "landing" scenes in my head.
3. a : to catch and bring in
// land a fish
b : gain, secure
// land a job landed the leading role
imagine enemy soldiers capturing a base or hangar ship including the aircraft.
It's kind of a stretch though.
It makes me think that there's going to be many adversarial examples of text that humans parse one way because of common usage while machines parse another way because of details like this.
Search for it if you’re interested in its origin.
I have never used or heard 'land a plane' in this context, but the sentence didn't immediately strike me as unnatural, incorrect or unclear.
It struck me as pretty awkward and very ambiguous. It probably means 'obtained' but 'captured' would be a far better word in that case. The suggestions that it means 'hit/shot' don't work because in that case it's not the aircraft that is landed but the shot, which is landed on the aircraft.
Also the use of the incorrect plural "aircrafts" when 'aircraft' is both singular and plural makes me think it's just a poor question.
The very fact that there's so much discussion about it is evidence that it's not straightforward even among native English speaking humans.
This is either a poor question, or a really great question, if the goal of the test is to confuse computers where a human would normally say “huh, weird way of saying that but I guess they mean...”.
It occured to me that hn_throwaway_99's question, and the responses to it, is the sort of dialog in which one could find additional headroom for further research into natural language understanding. We can understand, for example, that while the two uses of 'landed' are different, they are not completely unrelated, and we can explain how they are related, for example by introducing a third construct, 'landed a fish', as a couple of replies have done.
Just because you've never heard the word used that way, you were able to infer it meant something different. Even with the use of aircrafts.
We all make mistakes when writing or speaking. We don't let that get in the way of interpreting the information being passed. Even if we post comments that contain errors.
Maybe the _examples_ for a language test should be grammatically correct?
So I would have answered that the word meant the same thing.
Can anyone explain what makes this difficult for a machine? What existing knowledge does the machine start with? At a glance, it doesn't feel like it should be difficult if the machine had a large corpus to train on that showed many examples of each words in different contexts.
2) The pilots [involuntarily] brought down their aircraft [because some authority figure(s) forced them down.]
The active verb 'land' can be performed by different actors: pilot vs a more powerful agent (usually who flies an armed aircraft). The voluntary/involuntary agency is a subtle difference that only those familiar with this military practice are likely to grok.
Clearly the enemy conferred lesser nobility and commensurate landownership unto said aircrafts. https://en.wikipedia.org/wiki/Landed_gentry
This is really really super great, let's be clear. It's just not up to the hype "omg super human" usually gets.
I have no idea where the real human baseline is, or how to find it.
Also, consider this discussion. GLUE winners may be able to make informed parsing guesses about single text blocks, but they're years away from being able to make a useful contribution to a discussion like this one.
Then you can benchmark your AI but penalise it more heavily for getting things wrong that are obvious to a human.
Regarding SuperGLUE specifically, it asked:
"Indeed, Bowman and his collaborators recently introduced a test called SuperGLUE that's specifically designed to be hard for BERT-based systems. So far, no neural network can beat human performance on it. But even if (or when) it happens, does it mean that machines can really understand language any better than before? Or does just it mean that science has gotten better at teaching machines to the test?"
[1] - https://www.quantamagazine.org/machines-beat-humans-on-a-rea...
Look at the sub-scores on the page. One score that looks very different from humans is AX-b.
The SuperGlue paper provides more context about AX-b
https://arxiv.org/pdf/1905.00537.pdf
AX-b "is the broad-coverage diagnostic task, scored using Matthews’ correlation (MCC). "
This is how the paper describes this test
" Analyzing Linguistic and World Knowledge in Models GLUE includes an expert-constructed, diagnostic dataset that automatically tests models for a broad range of linguistic, commonsense, and world knowledge. Each example in this broad-coverage diagnostic is a sentence pair labeled with a three-way entailment relation (entailment, neutral, or contradiction) and tagged with labels that indicate the phenomena that characterize the relationship between the two sentences. Submissions to the GLUE leaderboard are required to include predictions from the submission’s MultiNLI classifier on the diagnostic dataset, and analyses of the results were shown alongside the main leaderboard. Since this broad-coverage diagnostic task has proved difficult for top models, we retain it in SuperGLUE. However, since MultiNLI is not part of SuperGLUE, we collapse contradiction and neutral into a single not_entailment label, and request that submissions include predictions on the resulting set from the model used for the RTE task. We collect non-expert annotations to estimate human performance, following the same procedure we use for the main benchmark tasks (Section 5.2). We estimate an accuracy of 88% and a Matthew’s correlation coefficient (MCC, the two-class variant of the R3 metric used in GLUE) of 0.77. "
If you look at the scores, humans are estimated to score 0.77. Google T5 scores -0.4 on the test.
How did T5 get such a high score if it scored so abysmally on the AX-b test?
The AX scores are not included in the total score.
From the paper: "The Avg column is the overall benchmarkscore on non-AX∗ tasks."
If the AX scores were included, the gap between humans and machines would be bigger than the current score indicates.
For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web, might contain much of the same information as the benchmark datasets, which I presume is also scraped from the web.
- Common Crawl overall is a sparse web dump, it is unlikely that the month we used includes any of the data that are in any of the test sets.
- In order for the data to be useful to our model, it would have to be in the correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the label in a format our model could extract meaning from. We introduced this preprocessing format so I don't believe this would ever happen.
- Further, most of these datasets live in .zip files. The Common Crawl dump doesn't unzip zip files.
- C4 is so large that our model sees each example (corresponding to a block of text from a website) roughly once ever over the entire course of training. Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps.
I am not so sure about that. Have you seen this thread: https://www.reddit.com/r/MachineLearning/comments/dfky70/dis...
Apparently lots of sentence fragments were memorized in GPT-2 (including real world URLs, entire conversations, username/emails and other PII).
"We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”."
I don't understand this decision. This list contains words that can be used in a perfectly objective sense, like "anus", "bastard", "erotic", "eunuch", "fecal", etc.
I can understand that they want to avoid websites full of expletives and with no useful content, but outright excluding any website with even one occurrence of such words sounds too radical. If we ask this model a text comprehension question about a legitimized bastard that inherited the throne, or about fecal transplants, I suppose it would easily fail. Strange way of limiting such a powerful model.
Anyway, my point is not a matter of quantity. The way they're doing it, they have 750 GB of data, but they have exactly zero data that talks about bastards, fecal transplants, etc. So they may have a hard time answering questions about those specific subjects.
1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here: http://www.ericswallace.com/triggers
2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20.
3) Most or all of the tests are multiple-choice. Learning complex correlations from sufficient data should help solve most of them. This is useful but human-level understanding is more than correlations.
4) The performance on datasets that require commonsense knowledge, COPA and WSC, are the weakest relative to humans (who score 100.0 on both).
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, p.32 https://arxiv.org/pdf/1910.10683.pdf
"Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting."
I’d like to emphasize that the work and the paper are excellent. Still, we are quite far from human-level language understanding.
---
We may need more advanced tests to probe the actual language understanding ability of AI systems. Here are some ideas:
* Test for conceptual understanding in a non-multiple-choice format. Example: Write a summary for a New Yorker article, rather than standard news pieces (which tend to follow repeated patterns).
* Commonsense test with longer chains of inference than those needed for solving Winograd Schema and set in non-standard situations (e.g. fantasy world). This should greatly reduce the chance that an approach can simply detect correlations from huge datasets.
* Understanding novel, creative metaphors like those used in some essays by professional writers or some of the Economist's title articles.
"We take into account the lessons learnt from original GLUE benchmark and present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard."
In fact it's not just T5 that should be able to understand language as well as a human child, but also BERT++, BERT-mtl and RoBERTa, each of which has a score of 70 or more. There really shouldn't be anything else on the planet that has 70% of human language understanding, other than humans.
So if the benchmarks mean what they think they mean, there are currently fully-fledged strongly artificially intelligent systems. That must mean that, in a very short time we should see strong evidence of having created human-like intelligence.
Because make no mistake: language understanding is not like image recognition, say, or speech processing. Understanding anything is an AI-complete task, to use a colloquial term.
Let's wait and see then. It shouldn't take more than five or six years to figure out what all this means.
It seems that the teams behind the attempts to beat such benchmarks are aware of the weaknesses of the benchmarks though, so that's encouraging.
(1)[https://www.nyu.edu/projects/bowman/TILU-talk-19-09.pdf]
My assumption has always been that to get human-level understanding, the AI systems need to be trained on things like visual data in addition to text. This is because there is a fair amount of information that is not encoded at all in text, or at least is not described in enough detail.
I mean, humans can't learn to understand language properly without using their other senses. You need something visual or auditory or to associate with the words which are really supposed to represent full systems that are complex and detailed.
I think it would be much more obvious if there were questions that involved things like spatial reasoning, or combining image recognition with that and comprehension.
In Question Answering, which is also advancing rapidly with insights from transformers and denoising auto-encoders, but still far from human baseline. The ease with which these models can answer a sample question such as: "Who was the first human in space", demonstrates both their efficacy and limitations. Pre-trained on a large corpus of text, almost every document that contains the the name "Yuri Gagarin" will in its near vicinity describe him in relation to his pioneering accomplishment for which he became a cultural icon.
And for even more generalizable scenarios, such as "what might you find on a Mayan monument"? It becomes imperative that an agent explain its reasoning in natural language as well to enable self-correcting backpropagation of error correction.
Language may be considered low-dimensional relatively speaking. And sentence prediction across quotidian tasks manageable in current state-of-the-art architectures. But looking at how difficult it is to predict the next N frames of video given a short input example demonstrates the intractability of the problem in higher dimensional spaces.
Neural Models for Speech and Language: Successes, Challenges, and the Relationship to COmputational Models of the Brain - Michael Collins
Could the same thing happen again with the better benchmark due to more subtle correlations? These things are tough to judge, so I'd say wait and see if it turns out to be a real result.
https://github.com/google-research/text-to-text-transfer-tra...
It drives me nuts that most of these papers / publications don't have code where I can just run:
> python evaluate_model.py
Still exciting, just annoying that I'd have to set up google cloud to try this out.
So, this is largest language model so far?
If you want to have an every day example of Google's AI skills: Switch you phone's keyboard to GBoard, especially all iOS users, and you will face a night and day difference to any other keyboard esepcially the stock one. When using multiple languages at the same time the leap to other keyboards gets even bigger.
GBoard is my phone's killer app and if Google dropped it for iOS I'd left the same day to Android.
It used to stick to single words or sometimes splitting one if missing a space, but now will sometimes attempt to "correct" the sum of two perfectly valid standalone words after the fact, 97% of the time resulting in nonsense.
I cannot for the life of me understand why.
Are you aware of Swiftkey?
https://towardsdatascience.com/bert-explained-state-of-the-a...
For example I have what I consider a fairly "high end" rig for being a hobbyist individual, with 32GB of RAM, i7 8700k, 1080ti - there's 0 chance their model would fit on my system.
So I mean maybe if you have a ton of money? Usually what happens is a slimmer model with not "quite" as high of a score gets released that actually fits on consumer hardware.
In option 2, the aircraft met the ground violently and lethally.
For me taking an airborne object and making it touch the ground is pretty much the same meaning whether it's from the inside or remotely or shooting it down.
Landing a deal (or a fish) is like landing a plane. A human acts to cause a desired outcome. Unlike forcing a pilot to involuntarily land a plane, the perspective of the fish as involuntarily being forced to land is not a necessary inference for this use of 'land'.
Geez, language can be subtle.
I don't think anyone in the field thinks that once we match human performance on benchmark X, we're officially done. It just means it's time for more interesting benchmarks.
Over time, if it starts to become difficult to design benchmarks that humans can outperform machines on, then that will prompt interesting conceptual work about what exactly the difference between human and machine language competency is. And then that will lead either to more sophisticated benchmarks or alternatively gradually more sophisticated and persuasive arguments that machines really have surpassed us in language competence.
I don't think we're yet at a point where we don't know how to make harder benchmarks, and if and when we do hit such a point, I'd definitely bet the result will be a conceptual advance in benchmark design rather than declaring machine superiority once and for all. At least for the first few rounds of this cycle.
"But instead of concluding that BERT could apparently imbue neural networks with near-Aristotelian reasoning skills, they suspected a simpler explanation: that BERT was picking up on superficial patterns in the way the warrants were phrased. Indeed, after re-analyzing their training data, the authors found ample evidence of these so-called spurious cues. For example, simply choosing a warrant with the word “not” in it led to correct answers 61% of the time. After these patterns were scrubbed from the data, BERT’s score dropped from 77 to 53 — equivalent to random guessing."
Maybe it should be? The "dieselgate" talk[1] at 32c3 suggests engineering has gotten very good[2] at "teaching machines to the test".
[1] https://media.ccc.de/v/32c3-7331-the_exhaust_emissions_scand... (good text summary: https://lwn.net/Articles/670488/ )
There is no fundamental way to overcome this problem, except by not using metrics as goals.
Systems like GPT-2, incredibly (I used to be a skeptic of a pure statistical approach) manage to extract meaning, keep a theme, and understand the intent behind a sentence. They are amazing.
When you have a system that displays all the characteristics of understanding something, it is irrelevant whether or not it "fakes" it. No one ever proved that humans are not "faking" intelligence either.
These rankings, if real, should be in constant flux.
I think that this article makes a good point, and correctly identifies weaknesses.
However, I also think that humans often take very similar shortcuts. There are good reasons why "bag of words" approaches work much of the time. Additionally there's lots of evidence showing that very rapid reading by humans does not imply deep understanding.
I think it's very important that people are aware of the weaknesses of these types of models. However, I think it's interesting that these weaknesses are becoming harder and harder to find.
Structuring a problem as a multiple choice task is basically turning it into a classification problem, but it doesn't really answer the question everyone wants answered: is it really possible to reduce the problem of language understanding to classification? i.e. is it really possible to understand human language with no other ability than the ability to identify the classes of objects?
But that is a question that has to be answered before any performance on benchmarks that reduce language understanding to classification can be appraised correctly. If accurate classification is not sufficient for language understanding, then beating benchmarks like SuperGLUE tells us nothing new (we already know we have good classifiers).
The problem here is that we have no good measures of language understanding, of humans or machines- because we have a poor, er, understanding of our own language ability. Until we know more about what it means to understand language it won't be possible to evaluate automated language understanding systems very well.
Hopefully though, the skepticism I've observed around results like the one above, will lead to a renewed effort to research our language ability, and perhaps our intelligence in general.
...but, humans evolved the ability to use language over hundreds of generations... So... Maybe that's not such a bad thing?
If we wish to use a model in critical situations, such as a medical setting or commanding a self-driving car, 1) and 4) above cannot be ignored.
Humans are susceptible to adversarial triggers too, so this doesn't necessarily make the model less impressive. It is a big problem in practical use though.
Off the top of my head, I can think of:
* garden path sentences
* highly recursive sentences
Could you or anyone provide some other classes?
The two classes above however can generally be understood by a large number of educated native speakers with time to think carefully.
Humans also do not get derailed so badly as in the examples in this link. http://www.ericswallace.com/triggers
In the specific remember that deaf-blind people exist, so if you're sure that you "need something visual or auditory" then those people are not, according to your beliefs, able to understand language. I think they'll disagree with you quite strongly.
I got curious if/how deafblind people learn to communicate in the first place, if they are completely deafblind from birth. If humans can learn not just communication but language without either vision or hearing, that seems to suggest either extreme adaptability or language learning being quite decoupled from vision and hearing. From an evolutionary standpoint, I imagine that both deafness and blindness are probably uncommon enough that language learning could have explicit dependencies on both hearing and vision.
I found an old-looking video about communication with deafblind people. At the linked timestamp is a woman who is deafblind since age 2.
Keep in mind that a most of the current ML systems have diverged from biology. A majority of the recent breakthroughs come from mathematics, the rational is that just because human brain does it in a certain way does not necessarily mean it is the only way to do it.
I was just trying to explain why text input alone isn't going to be adequate and that was an example.
Thanks for the link, that is one example of the type of thing I was talking about I think.
One other point: I’ve never heard the term “landed” to mean “grounded”, which is maybe the actual intent of #2, but maybe the ai sentence generation is off.....
But in the future that sentence might mean hacking and theft of the actual planes, an actual landing.
This is something that actually does happen. Less than 10 or 20 years ago, China did it to an US Air Force reconnaissance aircraft.
Just brainstorming here, but a vanilla network partition strategy might be to load each layer's weight into memory and perform the forward pass sequentially. I think that would be prohibitively slow - some of these models (e.g. BERT) can already take up to 3-4 seconds to perform a single forward pass on a CPU, and that's with all model weights already loaded into main memory. I suspect fetching/loading each layer separately would blow this out by an order of magnitude.
The thing is that when you're going for leaderboards you're reaching for every last percentage point, so the efficiency of the model size/performance isn't a concern, you want to ramp up the resource usage to as you have access to.
TL;DR - Yeah basically most people will run a "slimmed down" version of the model that isn't "as" performant, but is still an improvement over previous models and actually fits on your machine.
However note that the dataset used to train GPT-2 is about 20x smaller than C4. I'm not 100% sure how many times the training set was repeated over the course of training for GPT-2, but it was likely many times. I stand by my statement (that memorization is unlikely with SGD and no repetition of training data) but I would be happy to be proven otherwise.
However they don't publish how well a human performs on the dataset without "not" in it.
They do initially note that Even human beings don’t do particularly well on this task without practice
I've looked at the warrant task. It's pretty tricky! I'd bet real money that untrained humans perform much, much lower than the 80% correct rate they get on the full set on ones without "not". I don't think it would be as low as the 53% BERT gets, but it would drop significantly.
I find the HANS analysis[1] much more compelling, but again I'd note that humans suffer on this dataset too (although again - not as badly as models do).
Let's imagine that that in the brain everything goes through a series of models, first tokenization into words, then we build something like an abstract syntax tree, then we analyse meaning in the context etc; and each time one of these steps reaches a nonsensical result we start over with additional parsing time allocated. It's probably not true, but close enough to be a useful model.
Now what you consider an adversarial example depends on how far down the stack it has to go until it's caught:
- "The old man the boat." fails in the early parsing steps. We reliably miscategorize old as adjective when it's a noun.
- "More people have been to Russia than I have, said Escher" goes a step further, it parses just fine but makes no sense. The tricky thing is that you might initially not notice that it makes no sense. This is about the level where AI is today.
- "Time flies like an arrow; fruit flies like a banana" makes perfect sense, but you could notice that the straight forward way to parse it leads to a non-sequitur and parsing it as "time-flies love eating arrows; fruit-flies love eating bananas" is probably a better way to parse it.
Of course that's just the parsing steps. You can trick human "sentiment analysis" by swapping words without changing the meaning. Compare "this bag is made from fake leather" to "this bag is made from vegan leather". PR and marketing have made a science out of how to make bad things sound good. Similarly PR is great at finding adversarial examples for reading comprehension, where they say one thing that's nearly universally understood to mean something different (or to mean nothing at all; or where something that seems to mean nothing at all actually means something very siginicant).
Of course we assume all text to be targeted to humans; so if something is widely misunderstood by humans we blame the sender for writing such a bad message; when it's widely misunderstood by AI we blame the AI for being so bad at reading.
The propensity to make mistakes in comprehension is unavoidable, humans only approach 90% accuracy, and computers are getting close to the same level of accuracy on the same base materials as humans.
The other way of testing would be to devise a test where there is only a single interpretation, where the context is clear, and there is no ambiguity in meaning. In that case a competent human and computer algorithm could be expected to answer all questions perfectly.
The purpose of this benchmark on the other hand is to test comprehension when meaning is not explicit and context clues are implied, something humans have had the advantage at over computers until quite recently. The computer won't be 100% accurate, but that's not the purpose of this test.
As a lifelong native speaker (PNW English), I've also never heard "landed" used to refer to shooting down or capturing enemy airplanes. I could understand it from context, which is what I suppose the software is also going for, but I'd mark it with a red pen if someone showed me that sentence, just for clarity's sake (i.e. understandable from context but should be replaced).
These uses of 'land' and 'down' are military euphemisms for the use of force to compel a reluctant pilot to land. The difference is the degree of violence used.
Involuntary 'landing' implies the aircraft is forced to land by a party other than the pilot because if the pilot did not comply the plane would be shot down or collide or crash. It usually implies survival of the pilot. 'Downing' also means involuntary removal of the aircraft from the sky, but does not denote that a violent landing did occur, only that the likelihood of violence is much greater because a (more abrupt) landing was forced upon the pilot. From what I've read, 'downing' usually implies the plane crashed.
I'll admit that, as a non-native speaker, this fills me with glee.
> For SuperGLUE, we improved upon the state-of-the-art by a large margin (from an average score of 84.6 [Liu et al., 2019c] to 88.9). SuperGLUE was designed to comprise of tasks that were “beyond the scope of current state-of-the-art systems, but solvable by most college-educated English speakers” [Wang et al., 2019b]. We nearly match the human performance of 89.8 [Wang et al., 2019b]. Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting.
I'm not sure why the SuperGLUE/GLUE benchmark was designed to omit the AX-* scores from the benchmark score. It may be that they have no corresponding training set.
Regardless of the status of the AX-* tests, I am very impressed by your results on the SuperGLUE benchmark.
Language is specifically a human communication tool, there's no value in surpassing the language skill that humans have, if indeed such a thing is even meaningful (what does it mean to be better than the best* French person at French?)
* By whatever language-related metric
So there's two ways in which a language automaton must be better than human: it cannot rely on non-verbal hints nor can it easily ask for clarification, and it must be able to interpret many different dialects and idioms correctly -- many more than an average human would need to.
[0] https://en.wikipedia.org/wiki/Garden-path_sentence
[1] https://en.wikipedia.org/wiki/Time_flies_like_an_arrow;_frui...
Obvious in hindsight...
(Noun-adjective is a rare formation, but amusingly more common in the same situations where the author uses rare and archaic definitions like the adjective "fell".)
I think "I eat my rice with chicken" vs "I eat my rice with children" vs "I eat my rice with chopsticks" is the canonical example here.
There's a whole field in NLP involved in showing what changes happen to entities mentioned in a sentence as a a side effect of the sentence, and this example shows it pretty well.
The question was "How do you delete all files in the current directory?". Using DOS 6.22 (I think, it's from memory).
My answer "del." was marked incorrect. Because the teacher didn't know enough about DOS to understand that's the standard shortcut for "del .". And the teacher refused to even try out the command, lets alone fix the incorrect mark. sigh
In my TAFE class, I was asked to list two examples of operating systems.
I listed Linux and eComStation. The teacher had never heard of eComStation and marked me wrong.
Refused to correct my mark even when I proved him right. I'm still bitter about it a decade later.
They're getting better at it though. More recently I've done their devops certification and it looks like they're recommending somewhat more sane practices now...
There were still questions where even after three or four tries at certification / reading up on whatever Microsoft thinks is 'good' we didn't find 'the correct answer' according to Microsoft though... ¯\_(ツ)_/¯
I tried the dispute process but it's basically impossible to dispute / report broken questions unless you have a photographic memory.
It's like if people were discussing where to have a conference, and one of them proposed a hotel. Then another person suggested a resort. Then a third person floated a cruise ship. Cruise ships do float, but it has nothing to do with anything. They are floating the idea of the ship as a venue.
I haven't worked in aviation so my understanding of terminology could be wrong, but either way it is definitely an unusual example.
For example I might say that "they landed 4 aircraft with their daring" if they forced us to abandon an air craft carrier (e.g. by sinking it) and then managed to steal 4 of the planes (before it sunk). Or I might say "they landed 4 aircraft with that bomb" if they dropped a bomb on an airfield and it destroyed 4 aircraft.
So this would be particularly apt wording if the enemy had thrown a net over the plane as it sank in the ocean.
But I prefer to think the enemy gifted british country estates to the planes.
disclaimer: beyond pedantic, but 100% appropriate given the topic is NLP and idioms
I suppose in some case it could score better than humans on SuperGLUE benchmark.. but eventually it will have to come back down to near human score as it gets more accurate.
I think humans are already behind at the face recognition task for example.
They're not shy about illustrating a military application up front!
So landing=catching=scoring.
Depending on the type of fishing, you can still be an underdog to land the fish after hooking it.