What we learned in 6 months of working on an AI Developer

What we learned in 6 months of working on an AI Developer(blog.pythagora.ai)

63 points by magden 2 years ago | 49 comments

Lerc 2 years ago |

Even though I don't think GPT-4 is up to the task, it does seem like now is the right time to be working on these things. Pretty soon GPT-4 will not be the best in the field. The next generation will perform much better.

Possibly the most frustrating thing I find about GPT-4 is how close it gets with it's wrong answers. It's easy to dismiss a lesser answer when it responds with a laughably out-of-band idea. GPT-4 often shows that it has a general idea of what you want but misses a small but critical aspect which results in a solution to something else that is similar but not what you wanted.

I have mixed results on iterating on it's own mistakes. It will too often try and change the world to match it's answer, rather than fixing the answer. The best approach I have found to stop this is by getting it to create unit tests. I imagine there is a lot of training data for it to understand the intention behind fixing a failing test. It's a very specific problem for it to look at and generally changing the test is not considered the correct solution.

withinboredom 2 years ago | |

Oh man. When it’s so close but wrong it’s amazing for creative endeavors! For technical ones, it is quite a bad thing. It’s like being a Star Wars fan but the AI just wants to talk about Star Trek.

I think this is why the non-tech people see AI as so amazing. For anything human and non-technical, the “almost but not quite” nature is a good thing.

I was using an AI to help me debug a weird thing (mainly summarizing log splats hundreds of lines long) and I eventually got pretty close to identifying the issue when I asked “wtaf is this message. Never seen anything like it.” It then went on about how it was offended that I used vulgar language. I had to apologize for saying “wtaf!” Anyway, I found a bug in a linker, so that was fun; thanks Al.

layer8 2 years ago | | |

What’s frustrating is that the one reason I ever wanted AI is to have a Lt. Cdr. Data or ship computer equivalent that is logical and correct to a fault and that helps me reason through things, but what we got now is almost exactly the opposite, we have to help it reason through things and have to double-check everything for correctness.

__loam 2 years ago | | |

I think it's equally shit at creative work too, that's just harder to dismiss as "wrong". It's still wrong, it's just harder to see.

berkes 2 years ago | |

> Pretty soon GPT-4 will not be the best in the field. The next generation will perform much better.

What makes you believe that progress is linear, or at least a line forever going up?

I keep seeing people predicting rapidly improving AI, based on how rapid it improved over the last x months.

But why is that not an outlier? How do we know we haven't hit a ceiling and stagnating? Isn't progress typically very bumpy and sudden?

Lerc 2 years ago | | |

>What makes you believe that progress is linear, or at least a line forever going up?

I assume neither of those things. I have however read a lot of the papers published since GPT-4 was trained. There have been a lot of advances since then, so much so that simply saying "a lot" seems to be a massive understatement.

I think it is a reasonable assumption that at least a portion of those advancements would be able to build upon the existing technology of GPT-4 to produce something greater.

I am not assuming discoveries yet to be made. I am considering existing discoveries that have not yet made it into the top level of production.

teaearlgraycold 2 years ago | | |

Technological innovation is not truly an exponential. It is instead a series of logistics. If we do not have further breakthroughs then AI technology will plateau. I do think we'll have those breakthroughs but it's impossible to say when they will happen.

bugglebeetle 2 years ago | | |

I would say that Microsoft/OpenAI’s attacks on open source, whether it be through “AGI safety” BS as a front for regulating their way to a monopoly, or attempting Embrace, Exentend, Extinguish on companies like Mistral, and Cold War-style fear mongering about China, are the greatest near-term risks to linear progress. And it’s worth noting on that latter point that China is not similarly constrained and so could end up outcompeting the U.S., regardless

amelius 2 years ago |

Until I see an AI sysadmin that can help with basic configure/make problems, I don't have high hopes for an AI developer.

samus 2 years ago | |

That should be quite easy compared to software development, which is much more open-ended since the requirement are usually more nebulous, potentially contradictory, and at times simply wrong.

c0balt 2 years ago | | |

> That should be quite easy

SysAdmin stuff is quite easy in terms of complexity to some sw stuff. The problems, similar to traditional engineering, tend to come from the rather high cost of failure.

To expand further, it's easy to setup a system but hard to setup one that's reliable and/ or resilient. It's hard to maintain systems that are not documented and/ or wrongly documented (outdated, inaccurate). It's even harder to always make sure everything's consistent and you don't lose/ damage data.

amelius 2 years ago | | |

Yet, I haven't seen an AI that solves the software distribution problem. Tools like ChatGPT are often plain wrong when answering questions about basic sysadmin problems, and even make up commands.

fragmede 2 years ago | |

https://chat.openai.com/share/c424f444-c7ac-476f-bd10-02234e...

I picked a random GitHub issue that was some issue with ./configure. Seems like it helps to me.

kbar13 2 years ago | |

need AI for ffmpeg flags

stavros 2 years ago | | |

I made a program for all your Unix admin needs:

https://github.com/skorokithakis/sysaidmin

whiterknight 2 years ago | | |

That sounds like a job ai would actually be good at

Cilvic 2 years ago | | |

I have great success for my simple use cases with sgpt -s "cut the 40 seconds of the video starting at 1:30"

symbolicAGI 2 years ago | |

ChatGPT-4 is wonderful for composing regular expressions. Saves so much time when transforming various Java strings that arise in my work.

Today's big time savings came from this prompt: "Write a Java method that uses the Eclipse AST parser to create a simple markdown file showing the commented method signatures of a given Java class text file."

65 2 years ago |

Maybe AI developers can make landing pages and basic APIs. But, taking front end as an example, I just don't see how an AI can reproduce exact design specifications and interactivity to the point where it wouldn't just be faster to write the code yourself or search for some human verified snippet that does what you want.

And programmers who do know how to actually write efficient code without AI seem like they'd be even more in demand than those that rely on AI. Skill + knowledge + ability to use existing resources (e.g. StackOverflow, packages, templates), as we do now, are much more predictable and faster than trying to wrangle AI to do exactly what the designer or PM wants.

When the dishwasher was invented, everyone thought the human dish washer would be obsolete. And yet, restaurants still employ dish washers because they are much more efficient and thorough than a dishwashing machine.

nine_zeros 2 years ago | |

> When the dishwasher was invented, everyone thought the human dish washer would be obsolete. And yet, restaurants still employ dish washers because they are much more efficient and thorough than a dishwashing machine.

This is a good example of both job destruction and job retention by technology.

Job destruction - the total number of potential hand dishwasher jobs has reduced because the vast majority of commodity dishwashing is machine driven.

Job enhancement - machine dishwashers just can't produce the quality/dexterity of hand dishwashers.

I feel like generative AI will do the same. It will replace a large number of commodity jobs - editors, translators, copy producers, website designers, app prototypes, paper pushers but it will also reveal the value of skilled producers.

Too risky to let chatGPT write code for your backend that destroys your production database and crashes your company forever.

ctoth 2 years ago |

One of the things they seem to have figured out is the requirement to at least model a sort of actor-critic architecture with their agents. It helps quite a bit.

They seem to badmouth Aider a tad (not cool) but I do wonder how a full-stack of this + Aider might work? There needs to also be some sort of good test generator involved.

All that said, any time someone actually demonstrates progress on the automated Software Engineer problem and it makes it to HN, I am deeply reminded of the old quote:

"It is difficult to get a man to understand something, when his salary depends on his not understanding it."

Just read through this comments section and check out the pure copium. Yes, ChatGPT can do basic sysadmin tasks with ./configure and make.

Yes it does make sense to work on this now, assuming LLMs will get better, because LLMs have continued to get better on any metric you can imagine.

Finally, yes, AI devs will make landing pages and basic APIs. I didn't realize we were all hardcore world-class 0.01% programmers? I have certainly written a landing page and basic API before, in fact I do that sort of thing a lot more than I write uber1337 hax0r code. You probably do too!

singularity2001 2 years ago | |

Mentioning Aider, which other tools attempt to act as complete coding copilots? Github Copilot pro seems to have ended some of their experiments, or did I just not get access to the right beta?

stevage 2 years ago |

The focus on upfront specs feels a bit off. Since it's apparently cheap to generate running code, as a user, I'd much rather be able to just iterate really fast and use output to refine my requirements rather than having to laboriously state them all up front. Agile rather than waterfall if you will.

samus 2 years ago | |

In that case, it might be easier to start over with fixed specs. That might not be as much work as it sounds like since most of the existing code would have been produced by the LLM, and only the human feedback and interventions would have to be redone. It would be almost like backtracking to an earlier point in a chat history and changing path there. TA talks about that another LLM could provide insight about where to change what.

It might also be possible to change an existing history without abandoning all which has happened afterwards. Of course, this could lead to conflicts, sort of like when rebasing a branch, and it would be useful to have another LLM look for it.

GPT Copilot might or might not be able to start from existing code as well and one would approach it as one would a legacy codebase that has to be adapted to new requirements.

bsenftner 2 years ago | |

It's the LLM that needs the upfront specs, regardless of what you'd like, at this point. If that is the case, implement a nondestructive composable system, like node based visual editors - ComfyUI for example. Change the upfront specification "node" and let the LLM cascade through that and any attached nodes creating the code (or whatever) fresh each time.

gumby 2 years ago |

CMake was invented to guarantee that at least some humans would have software jobs.

somewhereoutth 2 years ago |

> Our approach is to focus on building the application layer instead of working on getting LLMs to output better results. The reasoning is that LLMs will get better,...

So more jam tomorrow then. Building the framework around the magic is the easy bit.

romafirst3 2 years ago | |

It is a very important bit and might be how we all code in the future.

romafirst3 2 years ago | | |

It’s definitely not even close to bring solved either. I haven’t seen a single code generator that works (100% of the time) for anything more than a very simple one or two liner.

wokwokwok 2 years ago |

Hm.

It’s easy to look at https://github.com/Pythagora-io/gpt-pilot-db-analysis-tool/b... and go… so, this new tool means you took two days to write this?

long stare

Why did you bother?

…but, this both hits the nail on the head and misses the point at the same time.

On the one hand, this is foundational tech, prototyping on a new way of doing things. It’s not going to be faster than doing it yourself at first. It won’t run locally at first.

On the other hand, we already know that GPT4 level models can do trivial tasks.

Over and over and over, people claim coding tools can massively improve productivity, and then try to demo that by building a trivial system.

…but building a trivial systems is not the problem that needs solving.

The problem that needs solving is building large complex systems with dynamically adjusting requirements.

The examples and blog post seem to miss this even as an idea.

While I applaud, in general, efforts to explore this space, tackling the easy problems seems like it doesn’t significantly advance the state of play.

Here are some concrete things that would be more valuable, but are significantly technically harder:

- Use tests. Make it write tests. Make humans write tests. Do not accept generated code that fails the tests.

- Focus on refactoring; it’s a known issue that models struggle to refactor code. Breaking your existing code base into tiny files isn’t the answer.

- Focus on documenting the behaviour of existing code and incrementally migrating to new behaviour.

- Bad developers write new code instead of reading the existing code and using existing functionality and utilities. AI generators are notoriously rubbish at this, and will almost always generate a function rather than use an existing one.

Refining and understanding existing code is significantly more valuable than generating code “from scratch”; so much so that I would argue that without the ability to refine existing code, such tools will forever remain in the “scaffold generator” category of “useful but ultimately no better than the current status quo”.

The tool as shown, is I believe broadly speaking interesting, but the approach described in the blog (upfront decisions about everything) is a dead end.