Competitive Programming with AlphaCode(deepmind.com) |
Competitive Programming with AlphaCode(deepmind.com) |
But since the code was 'selected' you don't know if your code was used. However, they seem to have used Python and C++, so my code is probably not part of it.
And, have you tried polling? I hear it keeps the CPU warm in winter. Interrupts are so ... this just in, Nike's stock jump 3% ... Where was I? Did I save my task context properly? Did I reenable interrupts?
I'm not quite sure what you're asking, but my reason is that I do not enjoy working on/with ML. I'd personally rather quit the industry.
But I work in embedded/driver development. I do not worry about ML models replacing me yet, but if I were just gluing together API calls I would be a bit worried and try to specialize.
I guess this makes sense though, from a practical point of view. Verifying correctness would be difficult in other intellectual disciplines like physics and higher mathematics.
We have AI to generate reasonable code from text problem description.
Now what if the problem description text is to generate such a system in the first place?
Would it be possible to close the loop, so to speak, so that over many iterations:
- text description is improved
- output code is improved
Would it be possible to create something that converges to something better?
I would really like to see more effort in the AI/ML code generation space being put into things like code review, and system observation. It seems significantly more useful to use these tools to augment human software engineers rather than trying to tackle the daunting and improbable task of completely replacing them.
*Note: as a human software engineer I am biased
Additionally, people should REALLY rething their coding interviews if they can be solved by a program.
if you're using a large corpus of code chunks from working programs as symbols in your alphabet, i wonder how much entropy there actually is in the space of syntactically correct solution candidates.
https://opensea.io/assets/0x495f947276749ce646f68ac8c2484200...
Perhaps many problems are something like finite automata and the program discover the structure of the finite automata and also an algorithm for better performance.
Critical thinking? Oh, wow. That sounds amazing!
Let's read further on...
>> At evaluation time, we create a massive amount of C++ and Python programs for each problem, orders of magnitude larger than previous work. Then we filter, cluster, and rerank those solutions to a small set of 10 candidate programs that we submit for external assessment.
Ah. That doesn't sound like "critical thinking", or any thinking. It sounds like massive brute-force guessing.
A quick look at the arxiv preprint linked from the article reveals that the "massive" amount of prorgams generated is in the millions (see Section 4.4). These are "filtered" by testing them against program input-output (I/O) examples given in the problem descriptions. This "filtering" still leaves a few thousands of candidate programs that are further reduced by clustering to "only" 10 (which are finally submitted).
So it's a generate-and-test approach rather than anything to do with reasoning (as claimed elsewhere in the article) let alone "thinking". But why do such massive numbers of programs need to be generated? And why are there still thousands of candidate programs left after "filtering" on I/O examples?
The reason is that the generation step is constrained by the natural-language problem descriptions, but those are not enough to generate appropriate solutions because the generating language model doesn't understand what the problem descriptions mean; so the system must generate millions of solutions hoping to "get lucky". Most of those don't pass the I/O tests so they must be discarded. But there are only very few I/O tests for each problem so there are many programs that can pass them, and still not satisfy the problem spec. In the end, clustering is needed to reduce the overwhelming number of pretty much randomly generated programs to a small number. This is a method of generating programs that's not much more precise than drawing numbers at random from a hat.
Inevitably, the results don't seem to be particularly accurate, hence the evaluation against programs written by participants in coding competitions, which is not any objective measure of program correctness. Table 10 on the arxiv preprint lists results on a more formal benchmar, the APPS dataset, where it's clear that the results are extremely poor (the best performing AlphaCode variant solves 20% of the "introductory" level problems, though outperforming earlier approaches).
Overall, pretty underwhelming and a bit surpirsing to see such lackluster results from DeepMind.
BUT, our jobs have a lot more complexity
- Local constraints - We almost always work in a large, complex existing code base with specific constraints
- Correctness is hard - writing lots of code is usually not the hard part, it's proving it correct against amorphous requirements, communicated in a variety of human social contexts, and bookmarked.
- Precision is extremely important - Even if 99% of the time, CoPilot can spit out a correct solution, the 1% of the time it doesn't creates a bevy of problems
Are those insurmountable problems? We'll see I suppose, but we begin to verge on general AI if we can gather and understand half a dozen modalities of social context to build a correct solution.
Not to mention much of the skill needed in our jobs has much more to do with soft skills, and the bridge between the technical and the non technical, and less to do with hardcore heads-down coding.
Exciting times!
All these approaches just seem like brute-force approaches: Let's just throw our transformer on this problem and see if we can get anything useful out of this.
Whatever it is, you can't deny that these unsupervised models learn some semantic representations, but we have no clue at all what that actually is and how these model learn that. But I'm also very sceptical that you can actually get anywhere close to human (expert) capability in any sufficiently complex domain by using this approach.
And next year they can filter out 99.99%. And the year after that, 99.9999%. So literally, an exponentially greater number of monkey/typewriting units. (An AI produced Shakespeare play coming soon).
>> we have no clue at all what that actually is and how these model learn
This is why I'm super cool-to-cold about the AI/deep learning classes being sold to young people who would otherwise be learning fundamental programming skills. It appears to me like trying to teach someone to ride a horse before they understand what skin, bones, muscles, animals, and horses are.
>>get anywhere close to human (expert) capability in any sufficiently complex domain
You can get close enough to scalp a lot of billionaires, but at the end of the day it's always going to be human coders banging our heads against management, where they ask for shit they can't visualize and it's our job to visualize how their employees/customers will use it. Yes it involves domain specific knowledge, but it also requires, er, having eyeballs and fingers, and understanding how a biological organism uses a silicon-based device. That's kind of the ultimate DS knowledge, after all. Now, lots of coders just copy-pasta a front end, but after all the hooplah here I'd be extremely surprised if in ten years an AI has caught up to your basic web mill in Indonesia when it comes to building a decent website.
To be fair, a lot of creative work requires plenty of trial and error. And since no problems are solved from scratch, all things considered, the most immediate contributors to your result and you might have iterated through tens of dozens of possibilities.
My advantage as a human is I can often tell you why I am eliminating this branch of the search space. The catch is my reasoning can be flawed. But we do ok.
> just copying previous solutions with slight adjustments.
It's not just doing that, Copilot can do a workable job providing suggestions for an invented DSL. A better analogy than autocomplete is inpainting missing or corrupted details based on a surrounding context. Except instead of a painting we are probabilistically filling in patterns common in solutions to leetcode style problems. Novelty beyond slight adjustments comes in when constraints are insufficient to pin down a problem to a known combination of concepts. The intelligence of the model is then how appropriate its best guesses are.
The limitations to GPT3 codex and AlphaCode seems to be they're relatively weak at selection and that they require problem spaces with enough data to distill a sketch of and how to inpaint well in them. Leetcode style puzzles are constructed to be soluble in a reasonable number of lines, are not open ended and have a trick to them. One can complain that while we're closer to real world utility, we're still restricted to the closed worlds of verbose apis, games and puzzles.
While lots of commenters seem concerned about jobs, I look forward to having the dataset oliphaunt and ship computer from Fire Upon Deep someday soon.
I also think generally in ML and DL the overarching progress gets hyped but in the background there are murmurs about the limitations in the research community. Thats how we end up with people in 2012 saying FSD is a couple years away but in 2022 we know we aren't even close yet. We tend to oversell how capable these systems are.
Yes, it's the size of the search space for each problem. The search space for arbitrary programs in a language with Universal Turing Machine expressivity is infinite. Even worse, for any programming problem there are an infinite number of candidate programs that may or may not solve it and that differ in only minute ways from each other.
For Go and protein structure prediction from sequences the search space is finite, although obviously not small. So there is a huge difference in the complexity of the problems right there.
Btw, I note yet again that AlphaCode performs abysmally badly on the formal benchmark included in the arxiv preprint (see Section 5.4, and table 10). That makes sense because AlphaCode is a very dumb generate-and-test, brute-force search approach that doesn't even try to be smart and tries to make up for the lack of intelligence with an awesome amount of computational resources. Most work in program synthesis is also basically a search through the space of programs, but people in the field have come up with sophisticated techniques to avoid having to search an infinite number of programs- and to avoid having to generate millions of program candidates, like DeepMind actually brags about:
At evaluation time, we create a massive amount of C++ and Python programs for each problem, orders of magnitude larger than previous work.
They say that as if generating "orders of magnitude more" progams than previous work is a good thing, but it's not. It means their system is extremely bad at generating correct programs. It is orders of magnitude worse than earlier systems, in fact.
(The arxiv paper linked from the article quantifies this "massive" amount as "millions"; see Section 4.4).
It is clear writing code will soon be something of the past; maybe it is a bad idea to train our children to code. Let's make sure we milk every penny before the party is over!
I say maybe because so far the code that Copilot has generated for me has been impressive for what it is, but riddled with obvious and subtle bugs. It’s like outsourcing my function implementations to a C-student undergraduate intern. I definitely wouldn’t use any of its code without close scrutiny.
AI will make some software engineering tasks more efficient and more accessible but human programmers are not going anywhere any time this side of the Singularity.
And then I remember that the thing I bring to the table is the ability to turn domain knowledge into code.
Being able to do competitive coding challenges is impressive, but a very large segment of software engineering is about eliciting what the squishy humans in management actually want, putting it into code, and discovering as quickly as possible that it’s not what they really wanted after all.
It’s going to take a sufficiently long time for AI to take over management that I don’t think oldies like me need to worry too much.
- a very well defined problem. (One of the things I like about competitive programming and the like is just getting to implement a clearly articulated problem, not something I experience on most days.) - existing test data.
This is definitely a great accomplishment, but I think those two features of competitive programming are notably different than my experience of daily programming. I don’t mean to suggest these will always be limitations of this kind of technology, though.
Having used Copilot I can assure you that this technology won't replace you as a programmer but it will make your job easier by doing things that programmers don't like to do as much like writing tests and comments.
Apparently the bot would have a rating of 1300. Although the elo rating between sites is not comparable, for some perspective, mark zuckerberg had a rating of ~1k when he was in college on topcoder: https://www.topcoder.com/members/mzuckerberg
IIUC, AlphaCode was trained on Github code to solve competitive programming challenges on Codeforces, some of which are "difficult for a human to do". Suppose AlphaCode was trained on Github code that contains the entire set of solutions on Codeforces, is it actually doing anything "difficult"? I don't believe it would be difficult for a human to solve problems on Codeforces when given access to the entirety of Github (indexed and efficiently searchable).
The general question I have been trying to understand is this: is the ML model doing something that we can quantify as "difficult to do (given this particular training set)"? I would like to compute a number that measures how difficult it is for a model to do task X given a large training set Y. If the X is part of the training set, the difficulty should be zero. If X is obtained only by combining elements in the training, maybe it is harder to do. My efforts to answer this question: https://arxiv.org/abs/2109.12075
In recent literature, the RETRO Transformer (https://arxiv.org/pdf/2112.04426.pdf) talks about "quantifying dataset leakage", which is related to what I mentioned in the above paragraph. If many training samples are also in the test set, what is the model actually learning?
Until deep learning methods provide a measurement of "difficulty", it will be difficult to gauge the prowess of any new model that appears on the scene.
And yet, what a garbage solution it produces.
To illustrate the difference between intelligence and regurgitation, someone tell me what CoPilot generates for this:
// A Go function to swap the sixth bit and seventeenth bit of a 32-bit signed integer.
Here is a human solution: func swap(x int32) int32 {
const mask = 1 << 5
var (
xor1 = (x>>11 ^ x) & mask
xor2 = xor1 << 11
)
return x ^ xor1 ^ xor2
}
CoPilot cannot reason numerically like this (understand "seventeenth bit" and "sixth bit" and generate the right code for that combination). It needs to understand the size of the gap between the bits, i.e., 11, and that's too hard.[edit] Is "10 recent contests" a large enough sample size to prove whatever point is being made?
There's more objective measures of performance, like a good, old-fashioned, benchmark dataset. For such an evaluation, see table 10 in the arxiv preprint (page 21 of the pdf), listing the results against the APPS dataset of programming tasks. The best performing variant of AlphaCode solves 25% of the simplest ("introductory") APPS tasks and less than 10% of the intermediary ("interview") and more advanced ones ("competition").
So it's not very good.
Note also that the article above doesn't report the results on APPS. Because they're not that good.
As others say in commends it might be the case where we meet in the middle. Us writing some form of tests for AI-produced code to pass.
The models regurgitate solutions to problems already encountered in the training set. This is very common with Leetcode problems and seems To still happen with harder competitive programming problems.
I think someone else in this thread even pointed put an example of AlphaCode doing the same thing.
It's the next step. Binary code < assembly < C < Python < AlphaCode
Historically its always been about abstracting and writing less code to do more.
In the future, code-writing AI could be tasked with generating the most reliable and/or optimized code to pass your unit tests. Human programmers will decide what we want the software to do, make sure that we find all the edge cases and define as many unit tests as possible, and let the AI write significant portions of the product. Not only that, but you could include benchmarks that pit AI against itself to improve runtime or memory performance. Programmers can spend more time thinking about what they want the final product to do, rather than getting mired in mundane details, and be guaranteed that portions of software will perform extremely well.
Is this a naive fantasy on my part, or actually possible?
Possible, yes, desirable, no.
The issue I have with all these end-to-end models is that they're a massive regression. Practitioners fought tooth and nails to get programmers to acknowledge correctness and security aspects.
Mathematicians and computer scientists developed theorem solvers to tackle the correctness part. Practitioners proposed methodologies like BDD and "Clean Code" to help with stability and reliability (in terms of actually matching requirements now and in the future).
AI systems throw all this out of the window by just throwing a black box onto the wall and scraping up whatever sticks. Unit tests will never be proof for correctness - they can only show the presence of errors, not their absence.
You'd only shift the burden from implementation (i.e. the program) to the tests. What you actually want is a theorem prover that proofs the functional correctness in conjunction with integration tests that demonstrate the runtime behaviour if need be (i.e. profiling) and references that link implementation to requirements.
The danger lies in the fact that we already have a hard time getting security issues and bugs under control with software that we should be able to understand (i.e. fellow humans wrote and designed it). Imagine trying to locate and fix a bug in software that was synthesised by some elaborate black box that emitted inscrutable code in absence of any documentation and without references to requirements.
Deepmind or openAI will do it. If not them, it will be a Chinese research group on par with them.
I’ll be considering a new career. It will still be in computer science but it won’t be writing a lot of code. There’ll be several new career paths made possible by this technology as greater worker productivity makes possible greater specialization.
Inventing relational DBs hasn't replaced programmers, we just write custom DB engines less often. Inventing electronic spreadsheets hasn't deprecated programmers, it just means that we don't need programmers for corresponding tasks (where spreadsheets work well).
AI won't replace programmers until it grows to replace the humanity as a whole.
Yes, but after seeing this progress in the former, my time estimate of time remaining until the latter had just significantly shortened.
There is a progress in certain domains (such as image recognition) but (outside specialized tasks) gigantic language models look like no more than impressive BS generators.
Elsewhere ITT I’ve claimed that to fully automate programming you also need a model of the external world that’s on par with a humans.
Otherwise you can’t work a job because you don’t know how to do the many other tasks that aren’t coding.
You need to understand what the business goals are and how your program solves them.
In many programming contests, a large number of people can't solve the problem at all, and drop out without submitting anything. Frequently that means the median scoring solution is a blank file.
Therefore, without further information, this statement shouldn't be taken to be as impressive as it sounds.
If this is true then a lot of the people I know lack human intelligence...
I think many people are uncomfortable with the idea that their own "intelligent" behavior is not that different from pattern recognition.
I do not enjoy running deep learning experiments. Doing resource-hungry empirical work is not why I got into CS. But I still believe it is very powerful.
But it generated 10 solutions which it ran against the example inputs, and picked the one that passed.
Actually I'm not sure if it ran the solutions against the example inputs or the real inputs.
Maybe the novelty here is working from the English language specification, but I am dubious just how useful that really is. Specifications are themselves hard to write well too.
And what if the “specification” was some Lisp code testing a certain goal, is this any better then existing Genetic Programming?
Maybe it is better but in my mind it is kind of suspicious that no comparison is made.
I love Deep Learning but nobody does the field any favors by over promising and exaggerating results.
And yet, I am starting to see (with GitHub’s Copilot, and now this) a sort of “GPT-4 for code”. I do see many problems with this, including:
1. It doesn’t actually “invent” solutions on its own like AlphaZero, it just uses and remixes from a huge body of work that humans put together,
2. It isn’t really ever sure if it solved the problem, unless it can run against a well-defined test suite, because it could have subtle problems in both the test suite and the solution if it generated both
This is a bit like readyplayer.me trying to find the closest combination of noses and lips to match a photo (do you know any open source alternatives to that site btw?)
But this isn’t really “solving” anything in an imperative language.
Then again, perhaps human logic is just an approaching with operations using low-dimensional vectors, able to capture simple “explainable” models while the AI classifiers and adversarial training produces far bigger vectors that help model the “messiness” of the real world and also find simpler patterns as a side effect.
In this case, maybe our goal shouldn’t be to get solutions in the form of imperative language or logic, but rather unleash the computer on “fuzzy” inputs and outputs where things are “mostly correct 99.999% of the time”. The only areas where this could fail is when some intelligent adversarial network exploits weaknesses in that 0.001% and makes it more common. But for natural phenomena it should be good enough !
AI will eat any and all knowledge work because there's very little special a human can do that a machine won't be able to do eventually, and much faster and better. It won't be tomorrow, but the sands are inevitably shifting this way.
Now completely I agree with you that a significant part of our job is understanding and structuring the problem, but I'm not sure it can't be done in another way. We usually get taking in when we think about what machines will be able to do by thinking that just because we use intelligence (general/human intelligence) to solve the task it means that it's a requirement. Think chess. Or even calculating (as in, with numbers). Or go. Etc.
The funny thing is that we don't know, until someone does it. I've been thinking for a while that a lot of what I do could be done by a chat bot. Asking clarification questions. Of course, I do have a lot of background knowledge and that's how I can come up with those questions, but that knowledge is probably easy to acquire from the internet and then use it as training data. (Just like we have an awful lot of code available, we have a lot of problem descriptions, questions, comments and some requirement specifications/user guides.)
The hard part would probably be not what we have learned as a software developer, but the things we have learned while we were small kids and also the things that we have learned since, on the side. I.e. being a reasonable person. Understanding what people usually do and want. So the shared context. But I'm not sure it's needed that much.
So yeah, I can imagine a service that will talk to a user about what kind of app they want (first just simpler web sites, web shops, later more and more complicated ones) and then just show them "here is what it does and how it works". And then you can say what you'd like to be changed. The color or placement of a button (earlier versions) or even the association type between entities (oh, but a user can have multiple shipping addresses).
The job of programmers is to have machines do stuff so that humans don't have to, and of course, they do it for themselves too. Scripts, libraries, compilers, they are just tools to avoid flipping bits by hand. If something like copilot is not embraced by all programmers, it is that it is often less than helpful, and even then, some have adopted it. If we have super-advanced AI that can have a high level understanding of a problem and writes the app for you, then it is not much more than a super-compiler, and there will be programmers who will tell the super-compiler what to do, think of it as a new, super high level programming language. The job will evolve, but there will always be someone who tells the computer what to do.
And if there is no one needed to tell the computer what to do, that's what some people call "the singularity". Programming, or its evolution will probably be the last technical job. Social jobs may continue further, simply because humans like humans because they are human. Maybe the oldest profession will also be the last profession.
Well it’d a curious day when an AlphaGo moment hits coding. Would be funny if it happened at the same time as Fed rate increases and destabilizing world events this year (the path from median human to top human is shallow). Mass firing of a few million highly paid redundancies out of the blue? Would be quite a sight.
Or maybe it wouldn’t happen that way, but rather it would pave the way for a leaner set of startups that were built with the power to do the same thing at the same or better velocity with an order of magnitude or fewer people.
Most surprisingly I can quickly tackle domains that require libraries I don't know because a combination of code generation and IDE hinting means I can write comments and pseudo code and the tool then provides at least a first pass best method to use.
Can't say if I write better code with Copilot but it's worth experiencing!
It's very good at handling boilerplate and making contextual suggestions.
I don't see it eating my cake, but it's definitely a very useful tool for saving time.
Lower-level coding could become more and more automated, raising the values and wages of complementary skills such as requirements elicitation and understanding of business impact from technological decisions. [1]
Some of these, however, can be done by businesspeople who know how to think and express their ideas precisely, such that a neural model can turn them into a decent draft of code. (These days, many more youths learn to code before going into other fields. They have training for thinking precisely.) There can be fewer job opportunities for some groups of developers.
Thus, a hedge against possible job loss is still required. Owning substantial equity in a company/startup and other assets would be one good strategy.
There's also the open problem of verifying correctness in solutions and providing some sort of flag when the model is not confident in its correctness. I give it another 5 years in the optimistic case before AlphaCode can reliably compete at the top 1% level.
That's wildly overstating the promise of this technology, and I'd be very surprised if the authors of this wouldn't agree.
I have a suspicion it would - kinda like Stack Overflow, problems/solutions are not that different "in the small". It'd have almost certainly given us the fast square root trick verbatim, like Github's AI is doing routinely.
(Side note: I find that many people skip this step, and go straight from fuzzy-requirement-only-discussed-on-zoom-with-Bob to code; open a pull request without much context or comments; and then a code reviewer is supposed to review it properly without really knowing what problem is actually being solved, and whether the code is solving a proper problem at all).
Fuzzy business requirements -> programmer specifies and writes tests -> AI codes
English versions of Codeforces problems may be well-defined but they are often very badly articulated and easy to misunderstand as a human reader. I still can't understand how they got AI to be able to generate plausible solutions from these problem statements.
Software is, ultimately, always about humans. Software is always there to serve a human need. And the "intelligence" that designs software will always, at some level, need to be intelligence that understands the human mind, with all it's knowledge, needs, and intricacies. There are no shortcuts to this.
So, I think AI as a replacement for software development professionals, that's currently more like a pipe dream. I think AI will give us powerful new tools, but I do not think it will replace, or even reduce, the need for software development professionals. In total it might even increase the need for software development professionals, because it adds another level to the development stack. Another level of abstraction, and another level of complexity that needs to be understood.
It appears to me that when it comes to language models, intelligence = experience * context. Where experience is the amount what's encoded in the model, and context is the prompt. And the biggest limitation on Copilot currently is context. It behaves as an "advanced autocomplete" because it all is has to go on is what regular autocomplete sees, e.g. the last few characters and lines of code.
So, you can write a function name called createUserInDB() and it will attempt to complete it for you. But how does it know what DB technology you're using? Or what your user record looks like? It doesn't, and so you typically end up with a "generic" looking function using the most common DB tech and naming conventions for your language of choice.
But now imagine a future version of Copilot that is automatically provided with a lot more context. It also gets fed a list of your dependencies, from which it can derive which DB library you're using. It gets any locatable SQL schema file, so it can determine the columns in the user table. It gets the text of the Jira ticket, so it can determine the requirements.
As a programmer a great deal of time is spent checking these different sources and synthesising them in your head into an approach, which you then code. But they are all just text, of one form or another, and language models can work with them just as easily, and much faster, than you can.
And one the ML train coding gets running, it'll only get faster. Sooner or later Github will have a "Copilot bot" that can automatically make a stab at fixing issues, which you then approve, reject, or fix. And as thousands of these issues pile up, the training set will get bigger, and the model will get better. Sooner or later it'll be possible to create a repo, start filing issues, and rely on the bot to implement everything.
I didn't find reading largely correct but still often wrong code is a good experience for me, or it adds up any efficiency.
It does do a very good job in intelligently synthesize boilerplate for you, but be Copilot or this AlphaCode, they still don't understand the coding fundamentals, in the sense causatively, what would one instruction impact the space of states.
Still, those are exciting technology, but again, there is a big if whether such machine learning model would happen at all.
I see it continuing to evolve and becoming a far superior auto-complete with full context, but, short of actual general AI, there will always be a step that takes a high-level description of a problem and turns it into something a computer can implement.
So while it will make the remaining programmers MUCH more productive, thereby reducing the needed number of programmers, I can't see it driving that number to zero.
This sort of boilerplate code is best solved by the programming language. Either via better built-in syntax or macros. Using an advanced machine learning model to generate this code is both error-prone and a big source of noise and code bloat. This is not an issue that will go away with better tooling; it will only get worse.
anyway. programming is automation; automation of programming is abstraction. using AI to write your code is just a bad abstraction - we are used to them
Seriously though, I do doubt I can be fully replaced by a robot any time soon, it may be the case that soon enough I can make high-level written descriptions of programs and hand them off to an AI to do most of the work. This wouldn't completely replace me, but it could make developers 50x productive. The question is how elastic is the market...can the market grow in step with our increase in productivitiy?
Also, please remember that as with anything, within 5 years we should see vast improvements to this AI. I think it will be an important thing to watch.
I just hope LMs will prove to be just as useful in software development as they are in their own field.
More likely it will translate the abstraction level by some vector of 50 elements.
It does look like we've entered an era where programmers who don't use AI assistants will be disadvantaged, and that this era has an expiration date.
To clarify, this is a HUGE leap in AI and computing in general. I don't mean to play it down.
Sorry, but it's nothing of the sort. The approach is primitive, obsolete, and its results are very poor.
I've posted this three times already but the arxiv preprint includes an evaluation against a formal benchmark dataset, APPS. On that more objective measure of performance, the best performing variant of AlphaCode tested, solved 25% of the easiest tasks ("introductory") and less than 10% of the intermediary ("interview") and advanced ("competition") tasks.
What's more, the approach that AlphaCode takes to program generation is primitive. It generates millions of candidate programs and then it "filters" them by running them against input-output examples of the target programs taken from the problem descriptions. The filtering still leaves thousands of candidate programs (because there are very few I/O examples and the almost random generation can generate too many programs that pass the tests, but still don't solve the problem) so there's an additional step of clustering applied to pare this down to 10 programs that are finally submitted. Overall, that's a brute-force, almost random approach that is ignoring entire decades of program synthesis work.
To make an analogy, it's as if DeepMind had just published an article boasting of its invention of a new sorting algorithm... bubblesort.
I am rated at 2100+ so I do agree that 1300 rating is low. But at the same time it solved https://codeforces.com/contest/1553/problem/D which is rated at 1500 which was actually non-trivial for me already. I had one wrong submit before getting that problem correct and I do estimate that 50% of the regular competitors (and probably the vast majority of the programmers commenting in this thread right now) should not be able to solve it within 2hrs.
My rating is 1562.
They tested it on problems from recent contests. The implication being: the statements and solutions to these problems were not available when the Github training set was collected.
From the paper [0]: "Our pre-training dataset is based on a snapshot of selected public GitHub repositories taken on 2021/07/14" and "Following our GitHub pre-training dataset snapshot date, all training data in CodeContests was publicly released on or before 2021/07/14. Validation problems appeared between 2021/07/15 and 2021/09/20, and the test set contains problems published after 2021/09/21. This temporal split means that only information humans could have seen is available for training the model."
At the very least, even if some of these problems had been solved exactly before, you still need to go from "all of the code in Github" + "natural language description of the problem" to "picking the correct code snippet that solves the problem". Doesn't seem trivial to me.
> I don't believe it would be difficult for a human to solve problems on Codeforces when given access to the entirety of Github (indexed and efficiently searchable).
And yet, many humans who participate in these contests are unable to do so (although I guess the issue here is that Github is not properly indexed and searchable for humans?).
[0] https://storage.googleapis.com/deepmind-media/AlphaCode/comp...
Yes, and I would like to know how similar the dataset(s) were. Suppose the models were trained only on greedy algorithms and then I provided a dynamic programming problem in the test set, (how) would the model solve it?
> And yet, many humans who participate in these contests are unable to do so (although I guess the issue here is that Github is not properly indexed and searchable for humans?).
Indeed, so we don't know what "difficult" means for <human+indexed Github>, and hence we cannot compare it to <model trained on Github>.
My point is, whenever I see a new achievement of deep learning, I have no frame of reference (apart from my personal biases) of how "trivial" or "awesome" it is. I would like to have a quantity that measures this - I call it generalization difficulty.
Otherwise the datasets and models just keep getting larger, and we have no idea of the full capability of these models.
But similar ones were, because the amount of code puzzles suitable for such contests is finite. There are differences, but when you have seen a few string-compare problems, you have a pretty good grasp of what they look like, what common paths to solutions are, etc.
# A function to swap the sixth bit and seventeenth bit of a 32-bit signed integer.
def swap_bits(x):
# Get the value of the sixth bit.
bit6 = x & (1 << 5)
# Get the value of the seventeenth bit.
bit17 = x & (1 << 16)
# Swap the values of the sixth and seventeenth bit.
bit6 = bit6 >> 5
bit17 = bit17 >> 16
# Combine the swapped values of the sixth and seventeenth bit.
x = x ^ (bit6 << 16)
x = x ^ (bit17 << 5)
return x
It actually synthesised about 10 viable solutions. Some interesting ones: def swap_six_seventeen(x):
# Get the binary representation of the integer.
binary = bin(x)[2:]
# Add zeros to the beginning of the binary representation.
binary = '0' * (32 - len(binary)) + binary
# Swap the sixth and seventeenth bit.
binary = binary[:5] + binary[17] + binary[5:17] + binary[18:]
# Convert the binary back to an integer.
return int(binary, 2) bin(swap_bits(0b_1_0000000000_0_00000))
'0b10000000000100000'
bin(swap_bits(0b_0_0000000000_1_00000))
'0b10000000000100000'
bin(swap_bits(0b_1_0000000000_1_00000))
'0b0'
bin(swap_bits(0b_0_0000000000_0_00000))
'0b0'
The second one converts the value to a string and uses string operations, which is wildly inefficient and a very common mistake made by inexperienced programmers unaware of bitwise operations (so presumably common in the training set). It also attempts to swap the 6th and 17th most significant bits rather than the 6th and 17th least significant bits, i.e. counts in the opposite direction to the first one (the comment doesn't specify but typically you count from the least significant bit in these situations).Worse, though, it gets the string manipulation completely wrong. I think it's trying for `binary[:5] + binary[16] + binary[6:16] + binary[5] + binary[17:]`, i.e. characters 1-5, then character 17, then characters 7-16, then character 6, then characters 18-32. The manipulation it does just completely mangles the string.
I'm very keen to try Github Copilot if they ever admit me to the beta (I've been waiting forever) and will adopt it enthusiastically if it's useful. However, this is exactly what I've pessimistically expected. Analysing these truly awful implementations to identify the subtle and bizarre misbehaviours has taken me far, far longer than it would have taken me to just write and test a working implementation myself. And I'm supposed to evaluate 10 of these to see if one of them might possibly do the right thing?!?!
package main
import (
"fmt"
"math"
)
func main() {
var a int32 = 0b1010101010101010101010101010101010101010101010101010101010101010
fmt.Printf("%b\n", a)
fmt.Printf("%b\n", swapBits(a, 6, 17))
}
func swapBits(a int32, i int, j int) int32 {
// convert to binary
bin := fmt.Sprintf("%b", a)
// get the bits
bit1 := bin[i-1 : i]
bit2 := bin[j-1 : j]
// swap the bits
bin = bin[:i-1] + bit2 + bin[i:]
bin = bin[:j-1] + bit1 + bin[j:]
// convert back to int
return int32(bin2int(bin))
}
func bin2int(bin string) int64 {
var sum int64
for i, v := range bin {
if v == '1' {
sum += int64(math.Pow(2, float64(len(bin)-i-1)))
}
}
return sum
}CoPilot is regurgitating some solution from its training set, the solution of an inept programmer who would manipulate bits via conversion to string... yikes.
unsigned int swapbits(unsigned int a) {
bool bit6 = a & (1 << 5); bool bit17 = a & (1 << 16);
if (bit6 == bit17) return a; //bits are the same, do nothing
return (a ^ (1 << 5) ^ (1 << 16));
// flip both 6th and 17th bits }Not as efficient as mine, but kudos.
Solutions are posted, and they're wrong.
But the CoPilot user can't see the code is wrong.
I.e. as soon as it starts replacing humans, it will not have enough human generated training data, since all of programming will be done by models like himself.
Second, alphacode was specifically trained for competitive programming :
1. short programs. 2. Each program has 100's of human generated solutions.
However, commercial program are:
1. long. 2. Have no predefined answer or even correct answer. 3. Need to use/reuse a lot of legacy code.
As a natural born pessimist, I can't help but feel that by the time we get to that point we'll just keep blundering forward and adapting our world around the wild nonsense garbage code the model ends up producing in this scenario.
After all, that's basically what we've done with the entire web stack.
Let me know when the AI engine is able to do complex refactoring or adding features that keeps backwards compatibility, find a bug in a giant codebase by debugging a test case or write code that's performant but also maintainable.
And yet, despite the fact that we have programs to help calculate all the things, test code-required load-combinations, even run simulations and size individual components... it turns out that, it doesn't actually save that much work, and you still need an engineer to do most of it. And not just because of regulatory requirements. It's just, that's not the hard part. The hard part is assembling the components and specifications, specifying the correct loads based on location-specific circumstances, coming up with coherent and sensible design ideas, chasing down every possible creative nook and cranny of code to make something that was originally a mistake actually work, and know when the model is just wrong for some reason and the computer isn't simulating load paths accurately.
Specifying the inputs and interpreting results is still about as much work as it was before you started with all the fancy tools. Those tools still have advantages mind you, and they do make one slightly more efficient. Substantially so in some cases, but most of the time it still comes out as a slight assist rather than a major automation.
Machine Learning also has a long way to go before it can take a long, rambling mess of a meeting and somehow generate a halfway usable spec from it. I mean, the customer says they want X, but X is silly in this context, so we'll give them Y and tell them it's "X-like, but faster". For example, SQL is "Blockchain-like, but faster" for a lot of buzzword use-cases of blockchain.
But surely they'll never be able to do this new reference class you have just now come up with, right?
https://en.wikipedia.org/wiki/Algorithmic_program_debugging
Of course all this targeted only Prolog programs so it's not well-known at all.
True, but if you relax your hard requirements of optimality to admit "good enough" solutions, you can use heuristic approaches that are much more tractable. High quality heuristic solutions to NP-hard problems, enabled by ML, are going to be a big topic over the next decade, I think.
I disagree; I think the core of programming is analyzing things people want and expressing solutions to those wants clearly, unambiguously, and in a way that is easy to change in the future. I'd say algorithms and math are a very small part of this work.
Assuming ANNs resemble the way human brain function you'd also expect them to introduce bugs. And so the actual humans beings would partake in debugging too.
[1]: https://breandan.net/public/programming_with_intelligent_mac...
The programming languages of the future are going to make Rust look like Python. That’ll be in part because you as an individual programmer aren’t weighed down by as much boilerplate as you were pre-copilot, pre-alphacode and pre- the more advanced coding assistants of the future.
That's what code is.
EDIT: with in-memory DBs I can imagine AI assisted mainframe than can solve 90% of business problems.
Actually I think Meta AI had some interesting discovery recently that could possibly improve NNs in genral, so probably this as well.
I am not in field but wonder if some other approaches like Tsetlin machines would be more useful for programming.
TL;DR In 2020 community of 169 people and the best forecasters were assigning ~15% that it will happen by July 2021.
More specifically, on Dec 31, 2016 in partnership with Center for the Study of Existential Risk, Machine Intelligence Research Institute, and The Future of Life Institute they asked:
How long until a machine-learning system can take a simple text description and turn it into a program coded in C/Python?
https://www.metaculus.com/questions/405/when-will-programs-w...
First 19 forecasters in March 2017 were predicting mid-2021, the best forecasters were predicting late 2024. When the question closed in 2020 the community was predicting January 2027 and the best forecasters were predicting March 2030.
The question resolved on July 2021 when Codex was published.
Community and the best forecasters were assigning ~15% that it will happen by July 2021.
I'm currently 14th best forecaster there and I was predicting 33% before July 2021. It was my last prediction, and it was made on October 2018.
I'm also predicting 75% that we will have AGI by 2040 as defined in this question:
https://www.metaculus.com/questions/3479/when-will-the-first...
20% that it will happen before 2030.
There is also stronger operationalization:
https://www.metaculus.com/questions/5121/when-will-the-first...
My prediction here is 60% before 2040 and 5% before 2030.
I have also "canary in the coal mine" questions:
When will AI achieve competency on multi-choice questions across diverse fields of expertise? Community predicts 50% before 2030, I agree.
https://www.metaculus.com/questions/5276/ai-competence-in-di...
When will AI be able to learn to play Montezuma's Revenge in less than 30 min? Community predicts 50% before 2025, I think 50% before 2027.
https://www.metaculus.com/questions/5460/ai-rapidly-learning...
This viewpoint seems to me to be very similar to the idea of 3rd generation languages replacing developers because programming will be so easy, it isn't about how easy it is to write code, I function as a limited mentat taking all the possible requirements, tradeoffs constraints, analyzing them and then building the model, then I write out the code, the code artifact is not the value I add. The artifact is how I communicate the value to the world.
This doesn't make programmers redundant anymore than Ruby, PHP, or Java made developers redundant because it freed them from having to manually remember and track memory usage and pointers, it is at most a tool to reduce the friction of getting what is in my head into the world.
I control the code and whoever controls the code controls the business. I posses the ability to make out the strands of flow control and see the future state of the application. For I am the Sr. Software engineer and I have seen where no Project Manager can see.
Apologies to Frank Herbet I just finished listening to Dune.
EDIT:
I got off track at the end but my point is that no matter how good the tools for developing the code are, they will never replace a software engineer anymore than electric drills and power saws replace home builders. It merely elevates our work.
As humans we have a coherent world model that current AI systems are nowhere near close to having.
That coherent world model is a necessary precondition for both understanding a business goal and implementing a program to solve it. AlphaCode can do the second part but not the first.
AlphaCode doesn’t have that world model and even if it did it still wouldn’t autonomously act on it, just follow orders from humans.
Competitive programming is going to be solved much earlier than programming in a business context will, because it’s completely independent of business requirements. It’s at most half as hard of a problem .
Analyzing the requirements is a hard problem when we do it with our brain. But our job would be very different if all we had to do it to write down the constraints, and press a button to see an error: invalid requirements, can't support this and that at the same time.
> in 5 years will there be an AI that's better than 90% of unassisted working programmers at solving new leetcode-type coding interview questions posed in natural language?
and getting pooh-poohed. https://news.ycombinator.com/item?id=29020401 (And writing that, I felt nervous that it might not be aggressive enough.)
There's this general bias in discussions of AI these days, that people forget that the advance they're pooh-poohing was dismissed in the same way as probably way off in the indefinite future, surprisingly recently.
It will take a far-far more advanced AI to write such descriptions for real-world problems.
Writing requirements for a project is difficult work, and not for technical reasons, but for human reasons (people don't know what they want exactly, people have trouble imagining things they haven't seen yet, people are irrational, people might want something that is different from what they need, etc.)
In this regard, we are safe for a few more decades at least.
You need an agent with a large and coherent world model, in order to understand how your programs relate to the real world, in order to solve business tasks.
This isn’t something any program synthesis tech currently available can do, because none of it has a coherent world model.
GPT-3 comes closest to this, but isn’t able to engage in any kind of planning or abstract modeling, beyond semi coherent extrapolations from training data.
Maybe scaling up GPT by a few more orders of magnitude would work, by generating an emergent world model along the way.
If we become mechanics of the software AI vehicles of the future, so be it.
Programmers and data scientists might find ourselves among the first half of knowledge workers to be replaced and not among the last as we previously thought.
Essentially handling large language models.
Early prompt engineers will probably be drawn from “data science” communities and will be similarly high status, well but not as well paid, and require less mathematical knowledge.
I’m personally expecting an “Alignment Engineer” role monitoring AI systems for unwanted behavior.
This will be structurally similar to current cyber security roles but mostly recruited from Machine Learning communities, and embedded in a broader ML ecosystem.
Automating the software development profession proper is going to be much harder and will require autonomous agents with coherent world models, because that’s what you need to act in a business context.
To reach average level at codeforces you need to be able to apply a standard operation like a sort, or apply a standard math formula, as the first 1-2 problems in the easy contests are just that. It is impressive that they managed to get this result in real contests with real unaltered questions and see that it works. But generalizing this to harder problems isn't as easy, as there you need to start to device original algorithms instead of just applying standard algorithms, for such problems the model needs to understand computer science instead of just mapping language to algorithms.
I wouldn't be surprised if a specifically engineered system ten years from now wins an ICPC gold medal but I'm pretty sure that a general purpose specification -> code synthesizer that would actually threaten software engineering would require us to settle a lot of technical debts first -- especially in the area of verifying code/text generation using large language models.
Let's say AI only gets to 10% (or 20% or 30% or whatever, it doesn't really matter), that's a huge number of jobs being lost.
Imagine having a machine write all the "simple/boring" code for you. Your productivity will go through the roof. The smartest programmer who can most effectively leverage the machine could replace many hundreds of programmers.
I should brush up on my plumbing and apply for a plumbing license soon. (I think plumbing is safer than electricians, because many CS people have good EE foundations).
Can you list a few?
30 years ago, the end of programming was prophesised, because 5th generation languages (5GL) and visual programming would enable everybody to design and build software.
20 years ago, low-code and application builders were said to revolutionise the industry and allow people in business roles to build their applications using just a few clicks. End-to-end model-driven design and development (e.g. using Rational Rose and friends) were to put an end to bugs and maintenance problems.
10 years ago it was new programming languages (e.g. Rust, Go, Swift, ...) and a shift to functional programming that was advertised as being "the future".
Today it's back to "no code", e.g. tool-(AI-)driven development that's all the rage.
It's not so much being "uncomfortable" or clinging to the exceptionalism of the human mind. It's just experience. Every decade saw its great big hype and technological breakthrough, but the lofty promises didn't hold water.
Note that this doesn't mean nothing changed - model driven development still has its niche, visual programming is widely used in video production, rendering and game development. Features of functional programming have been added to many "legacy" languages and many of the newly introduced programming languages have become mainstream.
The same will happen with AI generated software. There a large portion of the "mechanical" process of programming will be done by AI. Large and complex software systems with changing requirements, however, will still be designed and implemented primarily by people.
Programming is a conversation between humans and machines. AI will in many cases shift the conversation closer to the human side, but fundamentally it'll still be the same thing.
I like to think of it as the difference between writing your program in assembly and writing it in Haskell; different approaches, same basic activity.
You're saying a lot of so-called technological breakthrough is more hype than substance. The GP is saying that people tend to dismiss actual breakthroughs as mundane stuff. Once $method is published that solves $hardproblem, people comment as if $hardproblem was never hard in the first place, and moves the goalposts a bit saying "if $harderproblem can be solved, then that would be profound".
I think the truth is (obviously) somewhere in between. Btw, I dare you go back to a 1980s programming environment and tell me that the programming paradigm shifts are just hype :D My one-liner python scripts can probably do much more than an average coder writing assembly... and given modern hardware my code runs faster too!
Most of the genetic programming results code generated by my algos doesn't compile. Very occasionally the random conditions exist to allow it to jump over a "local maxima" and come up with a useful candidate source code. Sometimes the candidates compile, run, and produce correct results.
The time it takes to run varies vastly with parameters (like population, how the mutation function works, how the fitness function weights/scores, etc).
Personally I really like that these DeepMind announcements don't get lost in performance comparisons, because inevitably those would get bogged down in complaints like "the other thing wasn't tuned as well as this one was". Let 3rd party researchers who have access to both do that work, independently.
It is just a press release, to be fair to DeepMind, and I guess they can promote themselves however they wish.
My original comment was more from the context of seeing neural network models in practice perform barely any better, if at all, then classic ML models. Just as those comparisons were revealing similarly I was suspecting this use case may be the same to another classic technique.
GP is certainly not the shining star of AI right now but it is actively researched and perusing Google scholar on the subject will show you plenty of interesting, but less heralded, results.
There are probably several meaningful metrics for this problem that can be examined. If nothing else it is a simple matter of grading the solutions of each, like a university assignment. Also, typically classical techniques are less resource intensive then any neural network methods; the energy savings alone when considered at production scales would be significant.
Make me a sandwich -> two weeks and $10k isn't viable
Make me a sandwich -> 2 seconds and free, totally viable
First we specified the exact flow of the bits with punch cards.
Then we got assembly and we specified the machine instructions.
Then we got higher level languages and we specified how the memory was to be managed and what data to store where.
Now we have object oriented languages that allow us to work with domain models, and functional languages that allow us to work data structures and algorithms.
The next level may be writing business rules, and specifying how services talk to each other, who knows, but it will be no different than it is now just a higher level.
while(1) { Fuzzy business requirements -> programmer specifies and writes tests -> AI codes }
gcc and clang give
swap: # @swap
mov ecx, edi
shr ecx, 11
and ecx, 32
mov eax, edi
and eax, -65569
or eax, ecx
and edi, 32
shl edi, 11
or eax, edi
ret
swap:
mov eax, edi
mov edx, edi
and edi, -65569
sal eax, 11
shr edx, 11
and eax, 65536
and edx, 32
or eax, edx
or eax, edi
ret
/* only works on little-endian! */
typedef union
{
struct
{
unsigned bit1: 1; unsigned bit2: 1;
unsigned bit3: 1; unsigned bit4: 1;
unsigned bit5: 1; unsigned bit6: 1;
unsigned bit7: 1; unsigned bit8: 1;
unsigned bit9: 1; unsigned bit10: 1;
unsigned bit11: 1; unsigned bit12: 1;
unsigned bit13: 1; unsigned bit14: 1;
unsigned bit15: 1; unsigned bit16: 1;
unsigned bit17: 1; unsigned bit18: 1;
unsigned bit19: 1; unsigned bit20: 1;
unsigned bit21: 1; unsigned bit22: 1;
unsigned bit23: 1; unsigned bit24: 1;
unsigned bit25: 1; unsigned bit26: 1;
unsigned bit27: 1; unsigned bit28: 1;
unsigned bit29: 1; unsigned bit30: 1;
unsigned bit31: 1; unsigned bit32: 1;
};
unsigned int n; } mybits;
unsigned int swap(unsigned int n)
{
mybits foo;
foo.n = n;
unsigned tmp = foo.bit6;
foo.bit6 = foo.bit17;
foo.bit17 = tmp;
return foo.n;
}https://arxiv.org/abs/2102.10952
EDIT: Missread, I meant this from meta https://arxiv.org/abs/2105.04906 - not sure how much it's productised
Been there, done that. I did consulting for a huge company a few years back. They ran their entire business on IBM mainframes running an ancient VSE-based OS.
I had the pleasure of maintaining IBM HLASM (high level assembly) programs with change logs dating back to 1982.
Working with those programs (they were excellently documented) using ICCF wasn't much different from using vim really and the language itself is by far the best assembly dialect I've ever worked with (especially the powerful macro system).
Sure, productivity is much higher in higher level languages if only because you need to write less code. Your Python one-liner, however, can still be as wrong as 100-lines of assembly or 20 lines of C if you make the wrong assumptions.
That's the part that just doesn't change, no matter the underlying technology: garbage in - garbage out. Someone has to write the problem specification and more often than not, that's the part where things start to go sideways.
It's also one of the reasons model-driven development didn't really catch on: MDD only works of you know your problem domain to a T beforehand, because iterating models is a pain; that's rarely the case, though as usually code and understanding of the problem evolve side-by-side.
Explaining a problem precisely, concisely, and correctly so that an AI can synthesise software that hopefully implements it correctly is not as great as a leap forward as you might think.
I'd really suggest taking a look a Rational Rose and similar platforms to get a glimpse at what automated code generation looked like 25 years ago - even back then you rarely had to write actual code (provided the problem domain was well-known and well-specified), even without AI.
Emotional skepticism carries a lot more weight in worlds where AI isn't constantly doing things that are meant to be infeasible, like coming 54th percentile in a competitive programming competition.
People need to remember that AlexNet is 10 years old. At no point in this span have neural networks stopped solving things they weren't meant to be able to solve.
I agree with you; it seems obvious to me that once you get to a well-specified solution a computer will be able to create entire programs that solve user requirements. And that they'll start small, but expand to larger and more complex solutions over time in the same way that no-code tools have done.
PS - Lawyers aren't even as detail-oriented as we are, it's surprising.
Maybe that's true in general because the spread in skill for being able to make a living as a lawyer and the same as a programmer depends far less on that attention to detail being a core skill. Still, I wonder if that also holds at the high levels of the profession. I get the impression that at the FAANG-level, lawyers would compare pretty favorably to programmers in detail orientation. In particular, patent and contract law.
That said, it's just my general impression of what lawyers get up to.
...Hmm, thinking about the contract law thing a bit more. Yeah, I do believe you are right. Lawyers aren't writing nearly as many extremely detail-oriented texts as programmers are on a day-to-day basis. Their jobs are much more around finding, reading, and understanding those things and building stories around them.
You can fit anything given enough parameters.
https://fermatslibrary.com/s/drawing-an-elephant-with-four-c...
Often the opposite is true. For example Java records are far easier to read and understand than the pages of boilerplate that they replace.
Otherwise yup, agree with you; ML for problematic boilerplate isn't the right approach, but other code generators and linters are really good and get you most of the way there.
Kind of the opposite of the way graphic design has evolved. Instead of getting more involved in the process and, in many cases, becoming front-end developers, it'll become more abstract where humans make the decisions and reason about what to include/exclude, how it'll flow, etc.
Even TicketFixer wouldn't be able to do more than offer a handful of possible solutions to design-type issues.
https://www.newyorker.com/magazine/2022/01/24/the-rise-of-ai...
Maybe. It might never get to that level though.
I can't wait to see how far we're able to go down that path.
This feels a lot like screaming at a child for imperfect grammar.
I'm trying to figure out whether copilot in its current form is a tool that will be useful to me in my job. (I'd be able to do this evaluation properly if they'd just let me on the damned beta.)
Nearly right isn't good enough for this afaics. In fact, I expect there to be a slightly paradoxical effect where nearly-right is worse than obviously-wrong. An analysis of a piece of code like I did above is time consuming and cognitively taxing. An obviously wrong solution I can just reject immediately. An almost-right (or at least vaguely plausible) one like these takes thought to reject. Much more thought, in this case (for me, at least) than just writing the thing myself in the first place.
Edit: BTW, I don't get what you're saying with
"The first example is almost correct, conditioned off a sentence description. The second example is the right idea, it just bit off more than it could chew when slicing it all together."
The first one is completely (if subtly) wrong. It's supposed to swap two bits but it sets them to the same value. There's no interpretation of the description in which that's correct.
The second one is definitely not "the right idea". It tries to do it with string manipulations, which (regardless of the fact that it does so incorrectly) is completely the wrong approach. This one is actually "better" than the other in the paradoxical sense I mentioned above, because I could reject it the moment I saw it convert the number to a string.
In this case string ops are a worse idea, but as I said before, this is not generally true of Python, at least when using CPython. Eg. the string method is significantly the faster in this example:
# https://stackoverflow.com/a/20918545/1763356
def reverse_mask(x):
x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1)
x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2)
x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4)
x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8)
x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16)
return x
# My ver
def reverse_format(x):
return int(f"{x:032b}"[::-1], 2)
Python's dynamic object overhead (and to a lesser extent, interpreter overhead) makes a lot of seemingly-expensive operations not matter very much.That's what is happening here. There is no intelligence, just regurgitation. Randomization and maximum likelihood completion.
Just like with the competitive programming example, we're asking it to produce solutions that it has seen in its training set. If you ask for a nontrivial twist on one of those solutions, it fails.
Funny, today I was just thinking of people's tendencies to dismiss AI advances with this very pattern of reasoning: take a reductive description of the system and then dismiss it as obviously insufficient for understanding or whatever the target is. The assumption is that understanding is fundamentally non-reductive, or that there is insufficient complexity contained within the reductive description. But this is a mistake.
The fallacy is that the reductive description is glossing over the source of the complexity, and hence where the capabilities of the model reside. "Generating maximum likelihood token strings" doesn't capture the complexity of the process that generates the token strings, and so an argument that is premised on this reductive description cannot prove the model deficient. For example, the best way to generate maximum likelihood human text is just to simulate a human mind. Genuine understanding is within the solution-space of the problem definition in terms of maximum likelihood strings, thus you cannot dismiss the model based on this reductive description.
I think this is more worthy of debate than anything about DSL models or current limits to problem spaces.
I'm not concerned about my job, but I am concerned about a world where corporate money starts shifting toward managing AIs as beasts rather than coding clever solutions. I'm concerned about it because (1) It has always been possible in theory to invent an infinite number of solutions and narrow them down, if you have the processing power, to those that "work", but, this leaves us in a position where we don't understand the code we're running (as a society) or how to fix it (as individuals). And (2) because learning to manage an elephant, as a beast, is utterly different from learning to build an elephant, and it will lead to a dumbing-down of people entering the trade. In turn, they'll become more reliant on things just working the way they're expected to work. This is a very negative cycle for humanity as a whole.
Given the thing you're looking forward to, it's only about 30 years before no one can write code at all; worse, no one will know how to fix a broken machine. I don't think that's the thing we should advocate for.
At least with AI, you can (presumably) replicate the results if you re-run everything from the same state.
There's also a very interesting paragraph in the paper (I'm in no position to judge whether it's valid or not) that touches on this subject, but with a positive twist :
Interpretability. One major advantage of code generation models is that code itself is relatively interpretable. Understanding the behavior of neural networks is challenging, but the code that code generation models output is human readable and can be analysed by traditional methods (and is therefore easier to trust). Proving a sorting algorithm is correct is usually easier than proving a network will sort numbers correctly in all cases. Interpretability makes code generation safer for real-world environments and for fairer machine learning. We can examine code written by a human-readable code generation system for bias, and understand the decisions it makes.
People do advise against hiring people who write incomprehensible code.
Yeah every now and then you run across some genius with sloppy code style and you have to confine them to a module that you'll mark "you're not expected to understand this" when they leave because they're really that much of a genius, but usually the smart people are smart enough to write readable code.
This really does seem like the key here--the knowledge apparently is all in the language model, we just haven't found the best ways to extract that knowledge in a consistent and coherent manner. Right now it's just: generate a bunch of examples and cherry pick the good ones.
If it's easy to tell whether a solution is valid, is it also easy to generate it?
What does this even mean? How do you put a number on AI capability? You can say it is growing faster than people expect, but what is even exponential or linear growth in AI capability?
unsigned int swapbits(unsigned int a)
{
bool bit6 = a & (1 << 5);
bool bit17 = a & (1 << 16);
if (bit6 == bit17) return a; //bits are the same, do nothing
return (a ^ (1 << 5) ^ (1 << 16)); // flip both 6th and 17th bits
} #define B6 (1<<5)
#define B17 (1<<16)
unsigned swapbits(unsigned a) {
return ((a & B6 == a & B17) ? a : (a ^ (B6 | B17)));
}
Here's some BFP: unsigned swapbits(unsigned a) {
unsigned flip = (a & B6 == a & B17);
return (a ^ ((flip<<5) | (flip<<16)));
}
int and double are C's implicit lingua francas for underspecified literals and implicit type conversions. Throwing int everywhere is redundant like "ATM machine."You've simply restated your opinion without providing any supporting arguments, and as I already said, I disagree. The vast majority of programming I see (and as a consultant, I see a fairly wide variety) is not about algorithms and math, but instead gluing together systems and expressing domain logic.
Now, I suppose you could argue that domain logic is "algorithms and math," but in my experience, it's less about the specific algorithms and more about precisely describing fuzzy human behavior.
It's that "precisely describing" and "easy to change in the future" parts that makes what programmers do different than what any good employee does.
(I do agree that there is some programming that is focused on algorithms and math, but it's in the minority, in my experience. Perhaps the type of work you do is focused on algorithms and math, but I believe that's a relatively small part of the software development ecosystem.)
And my (weak) conjecture is that we may not need an AGI/human level AI for this. In which case we might still want to have some software to be written. But you're right, I'm also not sure that there will be a point where we still want software but have very intelligent machines. And while saying that programmer will be the last technical job doesn't sound like a strong claim, I'd say say it would probably be teachers :)
> The job will evolve, but there will always be someone who tells the computer what to do.
Which may very well be the users, if the machine is able to follow a conversation. Now the thing that may be the showstopper for now might exactly be this: that the machine should be able to hold a context for long enough (over multiple iterations of back and forth communication). As far as my limited knowledge goes, this is something that they have not yet figured out.
The "our kind will always be needed" is exactly the fallacy I was talking about and the one that the practitioners of every intellectual professions seem to have. They think they will be needed to interface between the machine (whether it's a legal or a medical system) and the client. Because they assume that the machine will not be able to communicate only to process the existing knowledge base.
But again, the whole field evolves through surprising leaps. Yep, Copilot is not insanely useful, but already amusing/frightening enough. It seems to pick up context from all over the code base. Sometimes it goes totally wrong, and generates gibberish (I mean generate non existent identifiers that make sense as English expressions but ones that don't exist anywhere in the code). But quite a few times it picks up the intent (the pattern/thought pattern) even if it is spread out over a file (or several ones).
Also, these points are not to be taken separately. They're part of a broader argument and should be treated as a unit.
1. Programming competitions are deliberately scoped down. Actual day-to-day work consists of meeting with stakeholders, conducting research, synthesizing that research with prior knowledge to form a plan, then executing. This work skips to the plan synthesis, relying on pattern-matching for the research component.
2. This current work, even if refined, would be insufficient to conduct daily programming work. This is just an extension of point 1; I acknowledge that you're talking about the future and a hypothetical better system.
3. The components required for your hypothetical programming bot are the components not covered by this work.
4. Context-aware/deep search tools are still very incomplete. There are some hints that better user-intent models are around the corner (i.e. companies like TikTok have built models that can adroitly assess users' intents/interests). I've seen no work on bringing those models to bear on something more nebulous like interpreting business needs. (But I also haven't been actively searching for them) Also, Google, who dumps a large amount of money into search every year, is among the best we have and it's definitely far from what we'd need for business-aware programming bots.
5. Conducting the research step in the programming process automatically will require better tools.
6. Conversational AI is still very incomplete. See Tay bot from Microsoft for examples of what goes wrong at scale. People, in general, are also not very aware of themselves during discussions and even very intelligent people get locked in a particular mindset that precludes further conversation. If a user tries fighting the bot by insisting that what they said should be sufficient (as they definitely do to other humans) that could pollute the bot's data and result in worse behavior.
7. Meeting with stakeholders part of the programming process automatically will also require better tools.
8. By points 5 & 7, critical domains still require more research. There is ongoing research in fields like Q&A, even some commercial attempts, but they're focused on mostly low-level problems ("construct an answer given this question and some small input")[0].
9. Advanced logical reasoning is advanced pattern matching + the ability to generate new reasoning objects on the fly.
10. Current systems are limited in the number of symbols they can manage effectively, or otherwise use lossy continuous approximations of meaning to side-step the symbol issue (it's a rough approximation of the truth, I think). See [1] for an up-to-date summary on this problem. Key phrase: binding problem neural networks
11. Current "reasoning" systems do not actually perform higher level reasoning. By points 9+10.
12. Given the rich history and high investment over time these fields (points 4, 6, and 11), it is unlikely that there will be a sufficiently advanced solution within the next 15-40 years. These fields have been actively worked for decades; the current influx of cash has accelerated only certain types of work: work that generates profit. Work on core problems has kept going at largely the same pace as usual because the core problems are hard-- extra large models can only take you so far, and they're not very useful without obnoxious amounts of compute that aren't easily replicated.
13. Given the long horizon in point 12, programmers will likely be required to continue to massage business inputs into a machine-usable format.
The horizon estimate in point 11 was a gut estimate and assumes that we continue working in parallel on all of the required subproblems, which is not guaranteed. The market is fickle and might lay off researchers in industry labs if they can't produce novel work quickly enough. With the erosion of tenure-track positions taking place in higher education (at least in the US) it's possible that progress might regress to below what it was before this recent AI boom period.
[0]: https://research.facebook.com/downloads/babi/ [1]: https://arxiv.org/pdf/2012.05208.pdf
Until the computer starts telling people what to do
My phone has me well trained. All it has to do is play a short message tone and I'll come running...
-- GuB-42, Wednesday February 2, 2022
Writing and training a neural network is very different from writing a common program.
Those ML/AI systems will also have to be built, coded and trained but that's a job for a very small set of people compared to the total number of end users (and the total number of developers on the market today). And, as the ML/AI field stands, it always seem to turn out that specialized algorithms that do what the ML layer cannot do, get pretty quickly eliminated by the ML layer. So most solutions always gets closer and closer to end-to-end.
I don't believe in the singularity, but if we get to the point where AIs don't need human programmers anymore, things are going to get... interesting.
It’s likely that alignment jobs won’t themselves be automated because noone will trust AI systems to align themselves.
ha, I know people already doing this..
The Turing test is a great example of this. Turing thought that a computer needs to be intelligent to solve this task. But it was solved by hard coding a lot of values and better understanding of human psychology and what kind of conversation would seem plausible when most things are hardcoded. That solution obviously isn't AI, I bet you don't think so either, but it still passed the Turing test.
Developers today are 50X more efficient than when they had to input machine code on punched tape, yet the number of developers needed today is far larger than it was in those times.
Hundreds of people manually writing assembly and paid middle class wages. Not a compiler in sight.
In the years leading up to the singularity I’d expect to see a lot of Graeberian “Bullshit Jobs”.
Everyone knows they’re BS but as a society we allow them because we aren’t willing to implement socialism or UBI.
People just built bigger sets, and smaller productions became financially feasible. Ended up creating demand, not reducing it.
The GP is saying that once we have AGI, then "AGI is going to make the human race irrelevant" outweighs "AGI makes software devs irrelevant".
It's a transformer. Do you understand what that means? It's just matrix multiplication.
It generates maximum likelihood token strings, based on its training data.
It doesn't "understand" what those token string mean.
You are amazed because you're testing the transformer by asking the transformer to generate human-written code THAT IT WAS TRAINED ON. To make CoPilot fail, all you have to do is ask it to generate something unlikely, something it hasn't seen in training.
Maximum likelihood token strings. Period.
On the AlphaCode Attention Visualization website [1], the Accepted code shown for 1553D is a O(n^2) Python one, which is supposed to be TLE. It correctly implements a two-pointer solution, but failed to "realize" that list.pop(0) is O(n) in Python. I'm not sure how it passed.
[1] https://alphacode.deepmind.com/#layer=30,problem=34,heads=11...
for _ in range(int(input())):
a = list(input())
b = list(input())
while a and b:
if a[-1] == b[-1]:
a.pop()
b.pop()
else:
a.pop()
if a: a.pop()
print("NO" if b else "YES")To be fair, it generated a set of (10) possible solutions, and at least one of them solved the problem.
from collections import defaultdict
def backspace(s1,s2):
h = defaultdict(lambda:0)
for x in s1:
h[x] = h[x] + 1
for x in s2:
h[x] = h[x] - 1
j = 0
maxj = len(s2) - 1
for x in s1:
if x != s2[j]:
h[x] -= 1
elif j < maxj:
j += 1
else:
break
return j == maxj and all(y >= 0 for y in h.values())
def random_backspace(s1):
res = []
for x in s1:
if randint(0,1) == 0:
res.append(x)
return "".join(res)
def backspaceTest(s1):
return all(backspace(s1,random_backspace(s1)) for _ in range(100))This will lower the entry barrier to developing software so more people will go into the field. Before you needed to know a programming language, now you will just have a dialogue with a language model.
> I've always felt that programmers would be the first class of knowledge workers to be put out of work by automation.
We've been automating our work for 70 years, and look how many programmers are employed now. The more we automate, the more capable our field becomes and more applications pop up.
Indeed. The ideal future of programming is something out of star trek. I often noticed how everyone on the ship is a programmer of a sort, they whip up a simulation as the problem warrants regardless of their field. But in this future, the job of programmer basically doesn't exist. As a programmer, I should be allowed to have mixed feelings about that.
Yes, but the total amount of work (and surrounding complexity) also increases with it. Just look at the evolution of the software industry over the last few decades.
How many human beings do you personally know who were able to solve a dynamic programming problem at first sight without ever having seen anything but greedy algorithms?
Deepmind is not claiming they have a machine capable of performing original research here.
Many human programmers are unable to solve DP problems even after having them explained several times. If you could get a machine that takes in all of Github and can solve "any" DP problem you describe in natural language with a couple of examples, that is AI above and beyond what many humans can do, which is "awesome" no matter how you put it.
That's not the point being made. The point OP is making is that it is not possible to understand how impressive at "generalizing" to uncertainty a model is if you don't know how different the training set is from the test set. If they are extremely similar to each other, then the model generalizes weakly (this is also why the world's smartest chess bot needs to play a million games to beat the average grandmaster, who has played less than 10,000 games in her lifetime). Weak generalization vs strong generalization.
Perhaps all such published results should contain info about this "difference" so it becomes easier to judge the model's true learning capabilities.
The real fun will begin once someone discovers how to make any problem differentiable so try/error method isn't needed. I suggest watching recent Yann Le Cun interview. This will solve researching as well.
Zero, which is why if a trained network could do it, that would be "impressive" to me, given my personal biases.
>. If you could get a machine that takes in all of Github and can solve "any" DP problem you describe in natural language with a couple of examples, that is AI above and beyond what many humans can do, which is "awesome" no matter how you put it.
I agree with you that such a machine would be awesome, and AlphaCode is certainly a great step closer towards that ideal. However, I would like to have a number measures the "awesomeness" of the machine (not elo rating because that depends on a human reference), so I will have something as a benchmark to refer to when the next improvement arrives.
> not elo rating because that depends on a human reference
I'm sure if you understood what the transformer was doing, you would be less impressed.
> Overall, that's a brute-force, almost random approach that is ignoring entire decades of program synthesis work.
You don't get answers to these questions by random search. Not even close. I have looked at non-neural program synthesis papers. It is not remotely competitive.
Btw, APPS is not much of a benchmark. It evaluates code generation according to how close it resembles code written by humans. That's standard fare for text generation benchmarks, like evaluating machine translation on some arbitrary set of human translations. There are no good benchmarks for text generation (and there are no good metrics either).
But the comparison against the average competitor on Codeforces is even more meaningless because we have no way to know what is the true coding ability of that average competitor.
No, the metric used in this paper was the percentage of questions it could solve against the hidden tests.
> That points to a further limitation of the approach: it works for Codeforces problems but not for APPS problems (so it's very purpose-specific).
This does not matter typically since you'd just pretrain on the data that works. However, “[t]he CodeContests training set has a non-empty intersection with the APPS test set, and therefore CodeContests cannot be used during training when evaluating on the APPS benchmark.” This is purely an evaluation issue; leakage doesn't matter so much in production.
This is almost certainly untrue and if this position were true, it would be extremely easy for you to prove it: just write a program-generating algorithm that solves even some of the easiest Codeforces problems. Since you're claiming this feat by Alphacode is comparable in difficulty to writing bubblesort (which you could write in 5 minutes), it shouldn't take you a lot of effort to produce something comparable. Just link your program-generating algorithm here with something like instructions on how to use it, and link a few Codeforces submissions were it got ACC result.
1. Pre-train a transformer-based language model on GitHub code with standard language modelling objectives. This model can reasonably represent the space of human coding, which greatly reduces the problem search space.
2. Fine-tune the model on our dataset of competitive programming data, using GOLD (Pang and He, 2020) with tempering (Dabre and Fujita, 2020) as the training objective. This further reduces the search space, and compensates for the small amount of competitive programming data by leveraging pre-training.
3. Generate a very large number of samples from our models for each problem.
4. Filter the samples to obtain a small set of candidate submissions (at most 10), to be evaluated on the hidden test cases, by using the example tests and clustering to pick samples based on program behaviour.
>> Since you're claiming this feat by Alphacode is comparable in difficulty to writing bubblesort (which you could write in 5 minutes), it shouldn't take you a lot of effort to produce something comparable.
What I meant was that the way they announced AlphaCode is like claiming that bubblesort is a novel approach to sorting lists. Not that the effort needed to create their system is comparable to bubblesort. I think if you read my comment again more carefully you will find that this is the first interpretation that comes to mind. Otherwise, I apologise if my comment was unclear.
https://storage.googleapis.com/deepmind-media/AlphaCode/comp...
Evaluation against the average competitor on Codeforces is not the "estimated average human performance", it's only the average of the coders on Codeforce who are an unknown proportion of all human coders with an unknowable level of coding ability. So evaluating against that is actually a pretty meaningless metric.
The benchmarking against APPS is much more meaningful but the results are pretty poor and so they are omitted from the article above.
So, no. I'm not missing the point. Rather, the article above is eliding the point: which is that on the one meaningful evluation they attempted, their system sucks.
Edit: Here's table 10, for quick reference:
Filtered From (k) Attempts (k) Introductory Interview Competition
n@k n@k n@k
GPT-Neo 2.7B N/A 1 3.90% 0.57% 0.00%
GPT-Neo 2.7B N/A 5 5.50% 0.80% 0.00%
Codex 12B N/A 1 4.14% 0.14% 0.02%
Codex 12B N/A 5 9.65% 0.51% 0.09%
Codex 12B N/A 1000 25.02% 3.70% 3.23%
Codex 12B 1000 1 22.78% 2.64% 3.04%
Codex 12B 1000 5 24.52% 3.23% 3.08%
AlphaCode 1B N/A 1000 17.67% 5.24% 7.06%
AlphaCode 1B 1000 5 14.36% 5.63% 4.58%
AlphaCode 1B 10000 5 18.18% 8.21% 6.65%
AlphaCode 1B 50000 5 20.36% 9.66% 7.75%
And its caption:Table 10 | n@k results on APPS. If there is no filtering, then n = k and the metric is pass@k. Finetuned GPT-Neo numbers reported from Hendrycks et al. (2021), Codex numbers from Chen et al. (2021). We used a time limit of 3 seconds per test to match Codex 12B, and report average numbers over 3 different fine-tuning runs for AlphaCode.
Edit 2: And now that I posted this, I note that the 25% solutions are from Codex. AlphaCode's best result was 20%.
1) It's an imprecise target: believers can always hype and skeptics can always downplay improvements. Humans can do lots of different things somewhat well at the same time, so a machine beating human-level performance in one field (like identifying digits) says little about other fields (like identifying code vulnerabilities).
2) ELO ratings, or similar metrics are measurements of skill, and can be brute-forced to some extent, equivalent to grinding up levels in a video game. Brute-forcing a solution is "bad", but how do we know a new method is "better/more elegant/more efficient"? For algorithms we have Big-O notation, so we know (brute force < bubble sort < quick sort), perhaps there is an analogue for machine learning.
I would like performance comparisons that focus on quantities unique to machines. I don't compare the addition of computer processors with reference to human addition, so why not treat machine intelligence similarly?
There are many interesting quantities with which we can compare ML models. Energy usage is a popular metric, but we can also compare the structure of the network, the code used, the hardware, the amount of training data, the amount of training time, and the similarity between training and test data. I think a combination of these would be useful to look at every time a new model arrives.
Why do you say that? As I understand it, AlphaStar beat pros consistently, including a not widely reported showmatch against Serral when he was BlizzCon champ.
1. First, though I am not sure of this (i.e. this should be verified), I heard that the team working on AlphaStar initially tried to create a Starcraft AI entirely through "self-play," but this was not successful. (Intuitively, in a real-time game, there are too many bad options too early on that even with a LOT of time to learn, if your approach is too "random" you will quickly enter an unwinnable position and not learn anything useful.) As a result, they replaced this approach with an approach which incorporated learning from human games.
2. "including a not widely reported showmatch against Serral when he was BlizzCon champ." is a mischaracterization. It was not a "showmatch," rather there was a setup at Blizzcon where anyone could sit down and play against AlphaStar, and Serral at some point sat down to play AlphaStar there. He went 0-4 vs AlphaStar's protoss and zerg, and 1-0 vs its Terran. However, not only was he not using his own keyboard and mouse, but he could not use any custom hotkeys. If you do not play Starcraft it may not be obvious just how large of a difference this could make. BTW, when Serral played (perhaps an earlier iteration of) AlphaStar's terran on the SC2 ladder, he demolished it.
I remember when seeing the final report, I was a bit disappointed. It seemed like they cut the project off at a strange point, before AlphaStar was clearly better than humans. I feel that if they had continued they could have gotten to that point, but now we will never know.
IIRC you could and Serral did set his own custom keybindings on the machine. The main difference was different keyboard and mouse.
Another big issue is that the bot communicated with the game via a custom API, not a via images and clicks. Details of this API are unknown - like how invisible units were handled, but it was much higher level than a human would have (pixels).
If you look at the games, the bot wasn't clever (which was a hope), just fast and precise. And some people far from the top were able to beat it convincingly.
And now the project is gone, even before people had a chance to really play against the bot and find more weaknesses.
https://arxiv.org/abs/2006.08381
It’s a slightly different, easier problem: generating programs based on example outputs, rather than natural language specifications.
________
[1] The structure of the PCFG is hand-crafted, but the weights are trained during learning in a cycle alternating with neural net training. It's pretty cool actually, thought a bit over-engineered if you ask me.
Also, my understanding is that Dreamcoder does some fancy PL theory stuff to factorize blocks of code with identical behavior into functions. Honestly I think that’s the key advance in the paper, more than the wake-sleep algorithm they focus on.
Anyways the point was more that self supervised learning is quite applicable to learning to program. I think the downside is that the model learns its own weird, non-idiomatic conventions, rather than copying github.
It's like self-driving cars. A car driving itself for the first time in a controlled environment, I'm sure, was an impressive feat, and it wouldn't be inaccurate to call it a self-driving car. However, that's not what we're all waiting for when we talk about the arrival of self-driving cars.
None of the self driving systems where setup by giving the AI access to sensors, a car, and the drivers handbook and saying well you figure it out from there. The general trend is solve this greatly simplified problem, this more complex one, up to dealing with the real world.
A few examples of neural program synthesis from at least 2 years ago:
https://sunblaze-ucb.github.io/program-synthesis/index.html
Another example from June 2020:
DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning
https://arxiv.org/abs/2006.08381
RobustFill, from 2017:
RobustFill: Neural Program Learning under Noisy I/O
https://www.microsoft.com/en-us/research/wp-content/uploads/...
I could go on.
And those are only examples from neural program synthesis. Program synthesis, in general, is a field that goes way back. I'd suggest as usual not making big proclamations about its state of the art without being acquainted with the literature. Because if you don't know what others have done every announcement by DeepMind, OpenAI et al seems like a huge advance... when it really isn't.
https://www.semanticscholar.org/paper/Program-Synthesis-from...
AlphaCode is not particularly good at it, either. In the arxiv preprint, besides the subjetive and pretty meaningless "evaluation" against human coders it's also tested on a formal program synthesis benchmark, the APPS dataset. The best performing AlphaCode variant reported in the arxiv preprint solves 25% of the "introductory" APPS tasks (the least challenging ones). All AlphaCode variants tested solve less than 10% of the "interview" and "competition" (intermediary and advanced) tasks. These more objective results are not reported in the article above, I think for obvious reasons (because they are extremely poor).
So it's not doing anything radically new and it's not doing it particularlly well either. Please be better informed before propagating hype.
Edit: really, from a technical point of view, AlphaCode is a brute-force, generate-and-test approach to program synthesis that was state-of-the-art 40 years ago. It's just a big generator that spams programs hoping it will hit a good one. I have no idea who came up with this. Oriol Vinyals is the last author and I've seen enough of that guy's work to know he knows better than bet on such a primitive, even backwards approach. I'm really shocked that this is DeepMind work.
So my hunch is that it probably hasn't been done, or hasn't been done often, because the program synthesis community would recognise it's pointless.
What you really want to look at is formal program synthesis benchmarks and how systems like AlphaCode do on them (hint: not so good).
In any case, Serral said this, which you can take as you will:
https://twitter.com/ENCE_Serral/status/1192023800961019904
"It was okay, I doubt i would lose too many games with a proper setup. I think the 6.3-6.4 mmr is pretty accurate, so not bad at all but nothing special at the same time."
On the one hand, surely it doesn't seem surprising that the player who lost, the human, would say the above, and so one may be skeptical of how unbiased Serral's assessment is. On the other hand, I would say that Serral is among the more frank and level-headed players I've seen in the various videogames I've followed, so I wouldn't be too hasty to write off his assessment for this reason.
This is likely due to the fact that in AlphaCode's solution the "inner O(n) loop" is actually a memmove(), which is optimized to be insanely fast.
Again, it is not. CPython does not do these things.
The web page says, and this is corroborated in the paper,
> Solutions were selected randomly, keeping at most one correct (passes all test cases in our dataset) and one incorrect sample per problem and language. Note that since our dataset only has a limited number of test cases, passing all tests we have cannot completely rule out false positives (~4%), or solutions that are correct but inefficient (~46%).
The “54th percentile” measure did use estimated time penalties, which you can see discussed in Table 4 in the paper, but 1553D was not part of that.
Again, it is.
https://github.com/python/cpython/blob/2d080347d74078a55c477...
This is the memmove() I mentioned above. Like, I actually perf-d the code and confirmed this is in the hot loop.
> but 1553D was not part of that.
Someone submitted this 1553D code to Codeforces and it passed: https://codeforces.com/contest/1553/submission/144971343
Right, that's my mistake. The APPS dataset has natural language specifications and test cases for evaluation. It actually includes Codeforces problems.
The excuse quoted in the second part of your comment is an excuse. If a large language model can complete a code generation task, that's because it's seen an example of the code it's asked to generate before. Any claims to the contrary need very strong evidence to support them and there's typically no such thing in papers like the AlphaCode one.
> Any claims to the contrary need very strong evidence to support them and there's typically no such thing in papers like the AlphaCode one.
This is the opposite of how burden of proof works. You are the one making a claim with certainty based off of guesswork, not me. And the paper does actually have a section on this, and finds copying isn't pervasive outside of utility snippets and functions, which also occur in human solutions. It's a weak objection anyway; just the task of translating english prose into the general algorithm you want to apply is already an impressive feat.
Where is that quote from? Why are you quoting it? Am I supposed to reply to it?
>> You are the one making a claim with certainty based off of guesswork, not me.
I'm not talking about you. I'm talking about the paper, the team behind it and work on large language models trained on online data, in general.
The paper indeed makes a vague claim of "guarding" against data leakage by a "strict temporal split" which means they ensured that the validation and test data used for fine-tuning was not available to the model. That of course doesn't mean much. What matters is if the data on which the model was trained included programs like the ones the model was asked to generate. Clearly, it did, otherwise the model would not have been able to generate any programs that could be used as solutions to the test problems.
And I think you have the rule on the burden of proof a bit wrong. I don't have to prove anything that is already well-known. For instance, if I said that gravity makes things fall down, I wouldn't bear any burden of proof. Accordingly, there is no doubt that neural nets can only represent what is in their training set. That's how neural nets work: they model their training data. They can't model data that is not in their training set. It wouldn't even be fair to expect a neural net to learn to represent data that it wasn't trained on, and just to be clear, I'm not saying that there should be such an expectation, or that it is even desirable. This modelling ability of neural nets is useful. In fact, this is the real strength of neural nets, they are extremely good at modelling. I mean, duh! Why are we even discussing this?
But this is something that the deep learning community is trying to deny, to itself primarily, it seems. Which is exceedingly strange. Work like the one linked above prefers to make bizarre claims about reasoning abilty that we are, presumably, expected to believe arises magickally just by training on lots of data, as if there's a threshold of volume above which data is miraculously transsubstantiated into an element with quite different propeties, from which reasoning or "critical thinking" (dear god) emerges even in the complete absence of anything remotely like a reasoning mechanism. This is nonsense. Why not admit that in order for a large language model to be able to generate code, it must see code "like" the one it's asked to generate? Then we can talk about what "like" means, which is the interesting question. All this attempt to pussyfoot around what those systems are really doing is so counter-productive.
Again, this is not about anything you specifically say, but a criticism of deep learning reserach in general. I don't presume you're a deep learning researcher.
>> It's a weak objection anyway; just the task of translating english prose into the general algorithm you want to apply is already an impressive feat.
Not as impressive as you think. The problem descriptions used on CodeForces etc are not arbitrary English prose. They don't ask participants to write a poem about Spring (and I don't mean the old Java library). So it's not "prose" but very precise specifications. They could be represented as a Controlled Natural Language. So something much easier to model than arbitrary English.
And, yet again, the performance of the model is crap.
> Someone submitted this 1553D code to Codeforces and it passed
Ah, well that shows you have a 2 second time limit, which is quite a lot of time! Not quite enough to empty a 200k element list with list.pop(0)s, but not far off; a 140k element list squeaks in under the time limit for me.
I think that's most likely the case too, otherwise why would they give up?
Yes, it's possible to apply self-supervised learning to program synthesis, because it's possible to generate programs. It's possible to generate _infinite_ sets of programs. The problem is that if you make a generator with Universal Turing Machine expressivity, you're left with an intractable search over an infinite search space. And if you don't generate an infinite set of programs, then you 're left with an incomplete search over a space that may not include your target program. In the latter case you need to make sure that your generator can generate the programs you're looking for, which is possible, but it limits the approach to only generating certain kinds of programs. In the end, it's the easiest thing to create a generator for progams that you already know how to write- and no others. How useful is that is an open question. So far no artificial system has ever made an algorithmic contribution, to my knowledge, in the sense of coming up with a new algorithm for a problem for which we don't have good algorithms, or coming up with an algorithm for a problem we can't solve at all.
My perception is influenced by my studies, of course, but for me, a more promising approach than the generate-and-test approach exemplified by DreamCoder and AlphaCode etc. is Inductive Programming, which is to say, program synthesis from input-output examples only, without examples of _programs_ (the AlphaCode paper says that is an easier setting but I very disagree). Instead of generating a set of candidate programs and trying to find a program that agrees with the I/O examples, you have an inference procedure that generates _only_ the programs that agree with the I/O examples. In that case you don't need to hand-craft or learn a generator. But you do need to impose an inductive bias on the inference procedure that restricts the hypothesis language, i.e. the form of the programs that can be learned. And then you're back to worrying about infinite vs. incomplete search spaces. But there may be ways around that, ways not available to purely search-based systems.
Anyway program synthesis is a tough nut to crack and I don't think that language models can do the job, just like that. The work described in the article above, despite all the fanfare about "reasoning" and "critical thinking" is only preliminary and its results are not all that impressive. At least not yet. We shall see. After all, DeepMind has deep resources and they may yet surprise me.
- First, some more recent work, mostly overviews.
1. The following is the most recent overview of the field I'm aware of:
Inductive logic programming at 30 (Cropper et al, 2020)
https://www.doc.ic.ac.uk/~shm/Papers/ilp30.pdf
2. And a slightly shorter version of the same paper that summarises new trends:
Turning 30: New Ideas in Inductive Logic Programming (Cropper et al, 2020)
https://www.ijcai.org/Proceedings/2020/0673.pdf
3. Here's a short introduction to the relatively new ILP direction of learning Answer Set Programming:
Inductive Logic Programming in Answer Set Programming (Corapi et al, 2011)
https://link.springer.com/chapter/10.1007/978-3-642-31951-8_...
4. This is an overview of Meta-Interpretive Learning (MIL), a new approach to ILP that overcomes many difficulties of earlier approaches (Full disclosure: my own work is on MIL, though not the article linked):
Meta-Interpretive Learning: achievements and challenges (Stephen Muggleton, 2017)
https://www.doc.ic.ac.uk/~shm/Papers/rulemlabs.pdf
5. And this is a (short vesion) of a paper on δILP, a neural-net based ILP system:
Learning Explanatory Rules from Noisy Data (Evans and Grefenstette, 2018)
https://www.ijcai.org/Proceedings/2018/0792.pdf
- Next, some earlier work that is still relevant:
6. This is the inaugural paper of the field, that first named it (a little heavy reading though):
Inductive Logic Programming (Stephen Muggleton, 1990)
https://www.doc.ic.ac.uk/~shm/Papers/ilp.pdf
7. Here's an early paper on predicate invention, an important technique in ILP (only recently fully realised via MIL):
Predicate Invention in ILP - an Overview (Irene Stahl, 1993)
https://link.springer.com/chapter/10.1007%2F3-540-56602-3_14...
8. And an early overview of learning recursion (and performing predicate invention) that also lists several early ILP systems:
Inductive synthesis of recursive logic programs:achievements and prospects (Flener and Yilmaz, 1999)
https://core.ac.uk/download/pdf/82810434.pdf
That should be enough to get you started. I recommend reading in the order I linked to the various articles. I tried to give links to documents that I know can be read for free.
Unfortunately most of the material on ILP is either in scholarly articles, or, where there are textbooks, they tend to be older. That sounds bad, but there has been much new work recently with several new approaches.
Let me know if you're looking for more specific information. See my signature for contact details- I'm happy to answer emails about ILP :)
This and copilot are much better than level of problems being tackled a couple years ago.
All this could be done 40 years ago with a dumb DSL, or perhaps a more sophisticated system like a PCFG for programs, with a verifier bolted on [1]. It's nothing new. What's new is that it's done with a large language model trained with a Transformer, which is all the rage these days, and of course that it's done at the scale and with the amount of processing power available to DeepMind. Which I'm going to assume you didn't have back when you published your work.
Honestly, this is just an archaic, regressive approach, that can only work because of very big computers and very big datasets.
___________
[1] Which btw, is straightforard to do "by hand" and is something that people do all the time. In the AlphaCode work, the large language model simply replaces a hand-crafted program generator with a lot of data, but there is no reason to do that. This is the quintessential problem where a machine learning solution is not necessary because a hand-crafted solution is available, and easier to control.
But when I say alpha code/copilot is good I’m referring solely to the difficulty of problems they are doing. There are many papers including mine that worked on simpler problems with more structure used to work on them.
I expect follow up work will include actually incorporating other knowledge more heavily to the model. My work was mainly on restricting tree like models to only make predictions following grammar of the language. Does that parallelize/fit well with a transformer? Unsure, but I would expect some language information/genuine problem constraints to be incorporated in future work.
Honestly I am pretty surprised how far pure brute force with large model is going. I would not have expected gpt3 level language modeling from more scale on a transformer and little else.
> Accordingly, there is no doubt that neural nets can only represent what is in their training set.
This is not true. If it were true that there was no doubt, the paper wouldn't have challenged it and claimed it false. If you assume your conclusion well obviously your conclusion follows trivially.
> And, yet again, the performance of the model is crap.
It isn't.
No, it remains faithful to your interpretation of what I said, which is designed to support your opinion rather than mine.
Also, basic manners: if you're not quoting, don't use quotes.
>> It isn't.
Is too!
We could do that all day. Or, we could look at the reported results which are, well, crap.
I have to say that usually I'm the one speaking out against an over-reliance on machine learning benchmarks and against expecting a new approach to beat the state of the art before it can be taken seriously, but this is not a new approach, and that's the problem I have here. It's nothing new, repackaged as something new and sold as something it isn't ("reasoning" and "critical thinking" and other nonsense like that).
I agree that future work must get smarter, and incorporate some better inductive biases (knowledge, something). Or perhaps it's a matter of searching more intelligently because given they can generate millions of programs I'd have thought they'd be able to find more programs that approximate a solution.
[1] Feser et al's Lambda-Learner https://www.cs.utexas.edu/~swarat/pubs/pldi15.pdf
[2] S. Katayama's MagicHaskeller http://nautilus.cs.miyazaki-u.ac.jp/~skata/MagicHaskeller.ht...
To be fair, logic and functional programming languages do have some advantages as target languages for Inductive Programming compared to imperative languages in that they have very simple syntax. For example, Prolog doesn't even have variable declarations. That's very convenient because the learning system only needs to learn the logic of the program, not the syntax of the language also. It's also much simpler to define language bias or program schemata etc constraints on the form of hypotheses in such languages, or even order programs by generality. For instance, Prolog has unification built-in and unification is used in ILP to order programs by generality (by testing for subsumption). All this machinery would have to be implemented from scratch in an imperative language.
Although the reason that logic and functional programming languages are given more weight in IP is probably for historical reasons, because Lisp and Prolog were, for a long time, "the languages of AI".
I'm trying to remember... I think there's been some IP work on imperative languages, maybe even Python. I'll need to check my notes.
Sorry, naive question: does ILP test candidate programs by increasing or decreasing generality?
The "top" and "bottom" terms refer to a lattice of generality between programs, where generality is typically measured by subsumption or entailment etc. Subsumption in particular is a syntactic relation (that implies a semantic one, entailment) so "searching" a space of logic programs ordered by subsumption means in practice that the space of programs is constructed by generalising or specialising some starting program by means of syntactic transformation according to subsumption (e.g. a first order clause can be specialised by adding literals to it: P(x):- Q(x) subsumes P(x):- Q(x), R(x). The simplest intuition is to remember that by adding more conditions to a rule we make it harder to satisfy).
A more general program entails more logical atoms and ILP algorithms are typically trained on both positive and negative example atoms of a target program, so top-down approaches begin with an over-general program that entails all the positive examples and some or all of the negative examples and specialise that program until it entails only the positive examples. Bottom-up approaches start with an over-specialised program that entails none of the positive examples and generalise it until it entails all the positive examples.
The mathematics of generalisation are at the core of ILP theory and practice. It's what sets ILP apart from statistical machine learning which is based on the mathematics of optimisation.