a lot of YT videos already has autogenerated english subtitles, which is actually available as a vtt download, so don't even need to use Whisper on a video to obtain it!
Keep in mind (pun) that the only real intelligence here is us, and we are pretty good at figuring out when a tool has exhausted its utility.
Somewhat counterintuitively, scaling datasets is the lazy and economical approach. If you have the compute already, might as well dig an OOM more text tokens.
But there are other sources of data, and slightly different ways to utilize it. Multimodality, in very large training runs, will almost inevitably increase sample efficiency (for obvious reasons of context richness), synthetic data is already very effective [1], and there are and will be discovered other ways to do more in the condition of diminishing raw text resources. But a thorough abandonment of the scaling strategy is very unlikely.
Sutton's Bitter Lesson [2] points at a very powerful rule of thumb: we shouldn't turn AI engineering into a contest of smartness, we should allow complex smartness to emerge from generic low-level algorithms. What will be seen as laughable in decades to come is not the scaling strategy, but the Godlike conceit of people who thought they can devise generally applicable rules of reasoning from first principles.
1: https://arxiv.org/abs/2304.08466 2: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Training ResNet-50 on real ImageNet gives 73.09% top-1 accuracy, while training it on synthetic data (same resolution, same number of images) generated by this work gives 64.96%, which is SOTA compared to previous work's 63.02%. Therefore, synthetic data is worse than real data for now.
But synthetic data is not useless, because training on real data plus synthetic data is a bit better than both real data and synthetic data. (Accuracy here is different due to different methodology.) Using 1:1 real data and synthetic data improves accuracy from 76.39% to 77.61%. But using 1:2 is worse than 1:1 (77.16%), even if dataset became 50% larger. With 1:4, result is worse than not using synthetic data at all. So synthetic data at best can enlarge dataset by 5x, more likely just 2x.
In any case, with your last point "we should allow complex smartness to emerge" you essentially agree with my point that new levels will emerge from orthogonal (new) directions.
The good thing about brute force is that it summons so many resources it primes the way for smarter approaches.
For those not conceited the objective is not some deus-ex-machina but "algorithms that work".
It is possible to form opinions by knowing the domain, rather than drawing an exponential curve of newspaper headlines which trails off "..."
* Train on all of television history, and streaming content.
* Train on YouTube.
* I suspect at some point we'll have a recording of most of people's lives, e.g. live-streaming: https://en.wikipedia.org/wiki/Lifestreaming#Lifecasting
It won't roll like that. AI will empower people to be more productive but won't free people up because it makes mistakes, can't help itself, and cannot function autonomously. There is no LLM application that is safe for autonomous usage today. How can we go from 0 to 1? I don't see a path. Self driving cars still can't reach L5 to completely remove the need for driver.
But maybe this is a blessing in disguise. It will make AI more like a new ability of humans than of the companies. Companies need people to unlock AI efficiencies. And AI tends to become open sourced so everyone has access to the same. AI is not a moat for companies and human ability to hand-held it is tied to individuals. That would make the transition easier. Solving that last 1% accuracy might encounter exponential friction and last for a while.
If your department gets a bunch of entry-level hires or interns, that frees up people in your organization even if they make mistakes, require supervision and can't function autonomously. Similarly, if an AI system can do half of a particular job under human supervision, it can free up (or make redundant) half of the people doing that job.
This will probably be (or already has been) solved by large transformer models or their successor architectures.
What was missing was common sense reasoning about what they see. We now have that.
you mean boilerplate and spam right?
As opposed to what? Being "captive" in jobs for paying bills?
What would you suggest as the alternative?
If we're running out of training data with hallucinations and performance remaining so inadequate (per OpenAI's whitepaper) is an autoregressive transformer the right architecture?
Perhaps ongoing work in finetuning will take these models to the next level but ignoring the LLM hype it really does seem like things have plateaued for a while now (with expected gains from scaling).
We've been trying to speed run neural networks science for the past decade but we still don't fully understand how they work. It's like being a bad programmer who doesn't understand algorithms so you compensate by spending money on hardware to make your programs run faster. At some point we will reach a limit where you can't buy your way out of the problem with more data or money and we'll all be forced to return to studying the foundations of the science rather than just trying to scale the existing models up.
I am certain when we get to that point everyone will realize we've been trying to feed these models too much data. It makes more sense that our current architectures are just not effective at assimilating the data they have.
It seems "good enough" (for now) but synthetic makes up a very small proportion of the training set being used in current models that have been trained on it, if that proportion ends up being mostly synthetic we'll likely see whatever weird hallucinations and biases in the dominant backend (GPT4 or whatever) become amplified.
It's been shown repeatedly that garbage in = garbage out for training data.
https://open.substack.com/pub/echoesofid/p/why-llms-struggle...
What encoding is this??
Just to add to this, the human brain also encodes quite a lot of evolutionary lessons. We didn't have to learn edge detectors.
AI can generate as much synthetic data as we need, on demand.
Many SOTA models, in fact, are already being trained with synthetic AI-generated data.
See https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines
After all, your skin is a pretty big organ. And your sense of balance and proprioception in all the joints is quiet a few different channels and pretty high temporal resolution.
No, I don't think "orthogonal" directions will be fruitful.
I also disagree on evaluations. What you call brute search is not brute search at all, nor a deux ex machina, it is a lawful and honest method of algorithmic discovery of true regularities. "Smarter approaches", meanwhile, usually amount to stilted expressions of narcissism of researchers overly proud with having come up with shallow tricks aping some aspect of explicit human reasoning. They're not actually smart, nor do they work far outside of the toy distribution for which they were developed.
I think they already don't blindly feed it just all the garbage raw data they can find, but prefer high quality, well-prepared sources.
And aside from spam, we're not just blindly posting AI content either. We're putting in meaningful prompts, rejecting answers we don't like, and editing answers we do.
In fact, there's no reason to think that academic papers won't start using language models to write better.
Tainting your text with AI can be as simple as pasting a paragraph in and asking if there's anything to improve.
Purity, accuracy and relevance of data collected from the internet is going to a very hard problem.
If by that you mean Common Crawl, Wikipedia etc, that's hardly "high quality, well prepared", and very subject to the biases and flaws of the creators who will vary widely in expertise, intelligence and ability.
If we build a system where we feed the exhaust of an AI to another one at each step, should we call it the AI Centipede, like in the movies? https://m.imdb.com/list/ls064583741/
Posted some thoughts previously here --
We had computers spit out text (especially spam) for ages now. You'd have to filter those out, too, if tainting actually was a problem.
One "simple" application would be to build a full index of facts in the whole training corpus. Just pass each document to GPT and ask it to extract the facts. Then create an inverted index, with each fact and its references. This will allow us to generate a wikipedia-like corpus of exhaustive fact research. We can say if a fact is known or not, we can tell if it is settled or controversial, and if it is a preference we can tell what is the distribution. This has got to help with factuality and generate lots of text to feed the model. Basically only costs electricity and GPU. It nicely side-steps the problem of truth by simply modelling the empirical distribution in an explicit way. At least the model won't hallucinate outside the known facts.
How would you "self-validate" against hallucinated facts?
What makes self-validation possible are hard external rules that can be evaluated independently and automatically. Like the rules of Chess or Go.
We don't have anything like that for LLMs and what people want to use them for.
Posing this as a thought experiment, agree we still have more data to go. That we are wondering about this suggests that the current approach may be inadequate, i.e. it should not take petabytes of data for a LLM to match the performance of a high school student (for the LLM = AGI folks).
> One "simple" application would be to build a full index of facts in the whole training corpus. Just pass each document to GPT and ask it to extract the facts.
Agree, KG+LLM is a good next step to explore and should address some hallucination issues (see DRAGON from Leskovec and Liang groups). But we're already now talking about architectural changes as I posited.
In any case, where do we get such knowledge graphs (or index of facts)? Some already exist (e.g. Wiki, UMLS) and were created by humans but are clearly inadequate in coverage.
The proposition of using GPT-like models to generate these (i.e. GraphGPT) seems conceptually flawed as GPT does not itself know if a statement is factual or not which is problematic even for humans.
Settled vs controversial is orders of magnitude more complex, how on earth do we do this without human annotation? You can't rely on frequency (i.e. some things were facts for 100 years but all of a sudden they're not anymore and this is not controversial by definition).
The only reason LLMs work as well as they do now is because sheer volume of data (and NTP) makes the noise seem hidden and by definition an autoregressive model should be somewhat impervious to singular factoids (vs a model being grounded by the garbage dump that is CommonCrawl/the internet).
> At least the model won't hallucinate outside the known facts.
Not sure this is a given, even if a model acts as a natural language database of factoids it is probable that it will hallucinate links unless you're strictly grounding output in which case we've just built a colossally over-engineered IR/STS tool.
> One "simple" application
I think what you've posited is actually harder to build than anything that's been achieved thus far with LLMs.
But at some point you can just give the model access to tools, tell it to solve some problems, build plans, generate logs of each approach and train on those outputs. Programming is ripe for this - all the tools are easily accessible to a digital actor, everything is suited to text based model, there's plenty of tooling to provide feedback and explanations for errors geared towards humans.
No need to fumble with robotics and physical world - you can create a superhuman programmer. Then make it build out the infrastructure for physical world learning. AGI apocalypse here we come !
Can't we keep doing that again?
We had techniques like drop-out and data augmentation to help.
The network is most likely trained with something like a categorical cross entropy loss function. Those totally punish being wrong a lot more than saying "I don't know". See https://www.v7labs.com/blog/cross-entropy-loss-guide
It's just that saying "I don't know" means that your model is spreading the probability of what the next token in the text stream might be over many different outcomes. A very 'uniform' probability distribution, instead of sharp prediction.
That looks very different to GPT literally outputting the words "I don't know".
I don't think this is right.
Can I take an untrained LLM (a neural network with random parameters), and have it start generating garbage, and then train the network to produce more of the same and then have it bootstrap itself to intelligence? Of course not.
What if I train it just a little bit first? What if I train it until it produces gibberish, but does occasionally string two words together that are spelled correctly. Can I have it produce petabytes of gibberish and then train on that to reach GTP4's level?
You seem to argue that at some point, the AI is able to improve by training on its own output. At what point does that arrive? Because so far we've never seen an AI improve based on its own output. (As far as I know?)
Maybe it's because AI is such an overloaded term, but this is pretty commonplace for (semi-)supervised learning algorithms.
Pseudo-labeling [1,2] is an example of this that has been around for decades. When done properly it does improve the performance of the original model, up to a certain limit (far from the singularity).
Moreover, it is apparently possible to improve a model's performance by augmenting it's training set with synthetic examples generated by a second model [3].
Finally, boosting [4] can also be seen as iteratively leveraging the output of a model to train a slightly better model. In fact, a specific type of boosting often yields state of the art performance on tabular data.
[1] https://arxiv.org/abs/2101.06329
[2] https://stats.stackexchange.com/questions/364584/why-does-us...
[3] https://arxiv.org/abs/2304.08466
[4] https://en.m.wikipedia.org/wiki/Boosting_(machine_learning)
This is not the same thing. There will still be value for fine tuning, but it's no substitute.
There are instances of things that happened (history, what Paris Hilton did say on 22nd of April etc, big database of mostly irrelevant facts) and truths (math, physics, chemistry etc) where AI can enhance discoveries by helping us to see what we have not yet realised.
Both seem endless tbh but personally I'm more interested in latter.
The best examples I know of are instruction tuning sets but that is a minute amount of data compared to the unsupervised training data.
The easiest counterexample is training LLMs, how are you going to synthesize useful language examples if you want more. Some version of this is true for most applications.
Doesn't work in majority of domains. You need to know the generating process (e.g. game rules) and build a realistic simulation environment that emulates that, in order to generate data that is useful. Both of these things are out of reach for most applications.
I believe the next large step will be multi-modal, where text is contextualized by video so the LLM will be able to concretize what "sitting on a chair" actually means with a single example, without needing to see thousands of textual associations to infer the meaning from the text.
Not even mathematicians think in terms of logic when trying to solve problems.
You only do (formal) logic as an afterthought when communicating your proofs to other people or writing them down. Otherwise it's mostly intuition.
Identity is what provides the irreducible basis, in the sense that we cannot enter into the consideration of specific facts that are placed under this identity, and it is this identity that becomes for us the true concrete fact, beyond which there is nothing more.
...
For example, for a musical composition, compared to a painting. Where does a musical composition exist? It is the same question as to know where 'aka' exists. In reality, this composition only exists when it is performed; but to consider this performance as its existence is false. Its existence is the identity of the performances.
...
For each of the things we have considered as a truth, we have arrived through so many different paths that we confess we do not know which one should be preferred. To properly present the entirety of our propositions, it would be necessary to adopt a fixed and defined starting point. But what we are trying to establish is that it is false to admit in linguistics a single fact as defined in itself. There is, therefore, a necessary absence of any starting point, and if some reader is willing to follow our thoughts carefully from one end to the other of this volume, they will recognize, we are convinced, that it was, so to speak, impossible to follow a very rigorous order. We will allow ourselves to present, up to three or four times in different forms, the same idea to the reader because there really is no starting point more appropriate than another on which to base the demonstration.
...
As language offers no substance under any of its manifestations, but only combined or isolated actions of physiological, physical, and mental forces, and as nevertheless all our distinctions, our terminology, and all our ways of speaking are based on this involuntary assumption of a substance, we cannot refuse, first and foremost, to recognize that the most essential task of the theory of language will be to untangle what our primary distinctions are all about.
...
There are different types of identity. This is what creates different orders of linguistic facts. Outside of any identity relationship, a linguistic fact does not exist. However, the identity relationship depends on a variable point of view that one decides to adopt; therefore, there is no rudiment of a linguistic fact outside the defined point of view that presides over distinctions.
Source: http://www.revue-texto.net/docannexe/file/116/saussure255_6....
TL;DR: identity is equivalent to equivalence
When talking about identity/equivalence of types in the context of homotopy type theory, yes. This is literally what the univalence axiom states.
Auggierose, I'm curious about your thoughts on how we can provide more rigor to LLMs when it comes to large-scale program transformations and proof synthesis. Given the complexity and versatility of these systems, what kind of foundational framework do you believe would enable GPT and similar models to synthesize and execute proofs rigorously? How can we ensure that they are both reliable and adaptable while dealing with various mathematical and logical domains?
More importantly, how whould this relate to NLP tasks such as: alright, the story is good, but can you rewrite it in the style of Auggierose ?
I think we could reasonably say that if an optical nerve has 1mm neurons on average, and they can fire at 250Hz at the most, that’s 250mbps or ~31mb/s per eye of uncompressed data as an upper bound.
Also, there's no reason to use data from optical nerves as input, as it is already precompressed. You should be counting optical receptors instead (120 000 000).
I don’t believe it is precompressed as it hasn’t been processed by the visual cortex yet, no? Aren’t the optical receptors simply an artifact of the “sensor design”? E.g. if the refractory period of an optical receptor is 100x that of the neuron (or you simply need to cover a certain area, as you probably have tons of receptors attached to a single neuron outside the fovea and a small number per neuron inside the fovea), you’d hook up 100 optical receptors per neuron to use its full capacity. I think this is less compression and more combining a bunch of low information channels into a higher information channel.
All we really care about here is the amount of information reaching the brain not what your physical eye is capable of receiving, so I think using the nerve makes the most sense. There’s an interesting direct analogy: we don’t really care about the number of CCD sensors in the camera that took the image, we only care about how much information is in the video coming from the camera.
But it matters little as even with 100x reduction the estimate blows GPT out of the water in the first year, making it very sample inefficient in comparison.
As for signal I am a layman in its most extreme here (only mist-like idea about information theory and frequency relationship), but don't the bandwidth limits only apply to fixed rate measurements? E.g. there's basically infinite (sans plank limits) number of values between 4ms and 5ms and as long as the receiver can separate them, they can encode information?
To put it in other words, if the neurons can control the impulse peak delay down to a nanosecond, then shouldn't the limit be measured based on 10^9Hz of that control vs 250Hz of max firing rate?
Regarding the nanosecond point — I don’t believe that’s how information works, and there should be many obvious problems with the idea of an infinite information channel not to mention the obvious practical ones (propagation variability, lack of a reference point, etc.). There may be some optimizations, but generally the frequency (or frequency bandwidth, which is where the generic computing term comes from) determines the information capacity, and phase modulation doesn’t magically change this (it is actually what is used in many radio systems).
By the way, I wonder how much you could get from "history" data: wikipedia history pages, talk pages, commits diffs on github, pull request discussions, etc.
AFAIK so far we've only been using the finished code "artifacts", but if we're desperate for more tokens to train on, we might get a lot of mileage from just "all different versions of this dataset over time".
What I wanted to emphasize is that the training _does_ actually incentivize the model to say "I don't know" but on a lower level.
I've tried augmentation for LLM domain adaptation and it's very modest gains in the best of situations, and even still the augmented corpus is a very tiny fraction of the underlying training corpus.
I believe OP's question was getting at whether synthetic data is useful as a substantial corpus for unsupervised training of a language model (given the topic it's reasonable to disregard other areas of 'AI') and that answer appears to be no or at least unproven and non-intuitive.
See https://blog.google/products/translate/found-translation-mor... and https://en.wikipedia.org/wiki/Google_Neural_Machine_Translat...
It doesn't look like they are still using any rule-based reasoning?
The blog post says:
> With this update, Google Translate is improving more in a single leap than we’ve seen in the last ten years combined. [...]
Which seems pretty strong evidence to me that moving away from rule-based reasoning or even a hybrid approach that includes rule-based reasoning, was a clear win?
Anyway, your question is very interesting! :-)
Forgive me if I'm misreading, but I'm having trouble with your line of reasoning. Your first reply to me scarequoting "captive" strongly implies an argument that the imperative to seek employment for survival is not a limiting factor on how people spend their time, and therefore that my suggestion that giving people more choice over how they apply their talents could be a good thing is irrelevant; but your child reply implies a concern that AI taking over some human labor will cause mass unemployment and explicitly states choice is declining.
I'm advocating that, since the genie is out of the bottle, AI could be used to free people from toil, just as other labor innovations like machinery and the 40-hour work week have done. Why the dismissive snark? In the abstract, do we not want the same thing?
That horse is already dead. Large models can learn everything, there's nothing that can be done to stop them from learning. It's too easy for them to do it. We can't hold any meaningful IP when models can generate 100 variations only different enough to pass the test. IP is dead. But on its corpse there will grow a new world of applications. We all got new skills, depends on us if we use them or not.
Literally no one said that, you are being ridiculous
[0] https://blog.kaichristensen.com/p/generative-ai-is-the-final...
I see two possible reasons, but neither seems to be worth the purity concern. The first is that AI can be wrong, make stuff up, be confidently incorrect. Anyone who has been on the internet knows this isn’t exactly a game changer.
Second is that we won’t be training AI to be like humans, but like humans + AI. Also doesn’t seem like a big deal. We’re already humans + writing + computers + internet and so on. This cutoff matters for anthropology, but I don’t see how it matters for trying to make a bot that can do my taxes.
In the same way, AI is trying to generate text that looks like its training data, but if its training data is AI generated text then it's simply being taught to be more like itself. It slowly starts to work less like a human and more like whatever its own idiosyncrasies are. It's a larger sort of version of the hallucinations it has today. If 50% of all the text on the internet becomes some part AI generated, then a huge part of the training for the next generation of AI will be the shortcomings of the current iteration of AI. And this will get worse as non-AI content moves to exclude itself from training.
LLMs weren't training AI to be like humans. They were training AI to be able to predict what humans (and other sources of common crawl data) will write next in their texts. This might seem like a small difference but it's not. Consider for example someone whose career is to research ant behavior. Their job in some sense is to be able to predict what an ant will do. Does this mean that in the course of their academic training and scientific research, this researcher is being trained to be like an ant?
If they act out these predictions and are rewarded based on their accuracy, then yes. They're being trained to be like ants. Not entirely like ants in every way, but like them in specific ways.
There's a big difference with your analogy. Predicting tokens is essentially the same as generating tokens. There's no meaningful objective difference between the activities (I'm ignoring philosophy and focusing on observables). They both lead to a stream of tokens.
For contrast, consider any sport, maybe baseball. I could predict the winner of a game but not be able to win it myself. I could predict the next pitch but not be able throw it or hit it. There's an execution aspect you can fail at. Being like an ant would also have this aspect. Token prediction doesn't have this, or if it does (maybe turning a vector into an API response?) it's a trivial part of the whole problem.
Maybe I'd be more clear to say "write like humans" instead of "be like humans", though.
There was also remarkably little uproar nor violence when the Czech Republic escaped the Iron Curtain and embraced capitalism.
The violence would be between billionaires with infinite automation making infinite money, and common folk with no way to eat.
Gains from increases in productivity in the last hundred years[0] seem to be spread between more consumption, shorter working hours[1].
Some people expected that most gains would go towards decreased working hours instead of the spread we have actually seen. Not sure there's much significance behind that?
[0] Or any span of time you might want to pick.
[1] And bigger bureaucratic overheads, but you can count that either as a weird form of consumption or as just productivity not having increased quite as fast.