It's this kind of thing that makes me think tackling big feature requests is still an AGI-complete problem. Perhaps if it gets good enough at pure coding you can iterate your way to success.
Basically you go from programmer to product manager, except you also get to micromanage a non-sentient programmer
Even small requests to AI I find myself accidentally including some words or phrases that seem to indicate to AI "Oh he wants this as a function that does all the things very manually".
So I get some fairly capable, but very verbose and often inflexible code.
Yet, that's not what I was asking for, but something in the context set the AI off in that direction. In reality I'm not sure what I want and I'm open to anything.
Often I suddenly realize "Wait, there's gotta be some built in things in this language that does this or part of this..." and often there is that is far more reliable and a better way to do it. Somehow AI skipped that and gave me a different answer.
It strikes me as similar to customers who come to me with "I want an email that's sent on Tuesdays that are single digit calendar dates and this field contains the letter Q in them and ..." But when I ask them what they're trying to accomplish I find all that specificity isn't needed, and they really mean they order all their grapes on Tuesdays at the begging of the month and they just want a list of their grapes orders every few weeks.
I think part of the problem is that instruction fine tuning is not done on full codebases, just shorter problems that fit into reasonable (8K, 32K) context windows. By nature these problems are more specific, so they are biased in that direction from the start.
I talked about it the last time that Copilot Workspaces reached the front page two days ago and that was, I don't think the value is in the code generation, but rather in the ability to capture our thought process. CW is currently a bottleneck in my opinion and I think the code generation will have to get pretty good before we can see the value in writing everything down vs just coding as we have always done.
The most compelling part of the demo showcased in this post is the way that the tool built the bulleted list of success criteria -- that's so often a tedious and overlooked part of writing user stories, but its importance shouldn't be understated -- the fact that it bakes that step into the workflow feels like the most valuable piece of the puzzle here.
Part of the fun of software development is exploring the solution space by implementing, and gaining a deeper understanding in the process, as well as coming up with the corresponding design decisions.
It seems that with current AI, in order to steer it and evaluate its output, you would have to build that deeper understanding up front without doing the work, which seems difficult.
Programming is the task of finding the real requirements!
I think you’ve just invented product managers. This used to be part of a software engineer’s job. Back when inputting code into a computer was so labor intensive that you’d write your program then hand it off to another human to translate into machine code.
Then we invented compilers and now programming can take up a whole person’s day so programmers stopped having time to do product management. That became a full-time job supplying 4+ programmers with enough work to stay busy.
If we can replace those 4 programmers with AI, software engineers will once more turn back into product managers.
The best product managers I’ve worked with have some combination of a comp sci and business background. The CS background helps a lot.
And some of the best software engineers I’ve worked with are basically their product manager’s right hand. Partnering smoothly in developing requirements, communicating technical feasibility, and deeply understanding their customers. They could be product managers but choose not to.
TDD is a great way to show exactly how much you understand what you're about to build. the make all the decisions about edge cases and various conditions ahead of time, before even getting to the code
It sounds like using an LLM to write code requires careful preparation and wording ahead of time that it's basically like writing in a very high level programming language itself.
A couple questions:
* Will the codebase turn into a mess over time by having the AI apply changes over changes over changes? Do we even care? Or do we want a human to still be able to follow what is going on?
* Will you just be able ask the AI to refactor it all and clean it up? Then it wouldn't be a problem I presume.
* Are product-based tech companies/startups still defensible if anyone can basically recreate the product with some English?
* I don't know Codepilot Workspaces - are the prompts that generate and change the code kept somewhere? Imo they're part of the codebase now.
It's not clear it we're even near a point where it can independently and meaningfully contribute to an existing codebase rather these greenfield demos. Feels similar to the self-driving AI hype where level 5 is still pretty far from realized (Waymo is closest but AIUI still uses a lot of remote human intervention).
Look at the lessons that the author has learned here:
* More specificity == better
* The importance of clear bulleted delivery items / criteria-for-success
* Unspecified details around a general goal is a ripe area for disappointment
All of these are things that a product owner / team leader learns in their first few projects (and so often must re-learn as the years go by).
AI is lowering barriers and promoting more developers to this role earlier. But everything that we learned about good Agile development in the past will still apply to the future.
Moving on to the complex task, the author simply hand-waves "this isn't good yet but surely it will be". No evidence is given as to _why_ there should be any expectation of LLMs getting there.
And the perceived benefit of discovering that their idea of the more complex task was not thought out enough did not come from the LLM, it came from the author itself. They may as well have spoken to ELIZA or a rubber duck.
What am I missing?
As I type the code I get a feeling if I like it, I also pretend to use it even when its unfinished, kind of like playing a game. Even if I spent a lot of time thinking about what I am going to write, until it exists and I play with the code, I don't know if its good.
Now Copilot writes so much code, even if it exactly what I was going to type, I kind of lost the intuition, and I hate it.
So I just enable it when I do things that I don't consider programming anymore.
I still think it is absolutely amazing tech though, and I know it will get better and better, and at some point it will be hard to not use it, but I really enjoy playing with the code as I write it.
Aider is more of a collaborative chat, where you work with the LLM interactively asking for a sequence of changes to your git repo. The changes can be non-trivial, modifying a group of files in a coordinated way.
Workspaces seems more agentic. You need to do a bunch of up-front work to (fully) specify the requirements. Even with a perfectly formulated request, agents often go down wrong paths and waste a lot of time and token costs doing the wrong thing.
That's also not how I code personally. My process is usually more iterative.
Another big difference compared to Workspaces is that aider is primarily a CLI tool. Although I just released an experimental browser UI [1] yesterday, making it more approachable for folks who are not fully comfortable on the command line.
And not joking, I think there should be engineering classes taught with slide rule, to get students to learn old school ability to work with orders of magnitude in their head.
Of course students have to learn new things too. But do think we are really losing some of the basic skills, methods of thinking, that you get with the old methods.
Like tracking down some pointer errors, it takes time, it's a difficult struggle, but you do learn a lot about how things work.
Have classes with 'new' tech, then have classes that require 'old' tech. Exams without calculators, or make an Assembly language class mandatory.
[0] https://tbeseda.com/blog/previewing-github-copilot-workspace...
Its a huge legal liability to have statements about how data won't be used and then use it, when you're a company that might compete in similar spaces, and Microsoft competes almost everywhere.
While I trusted githib when they were independent, I trust this feature from MS owned github more than I would them because the liability misuse opens them up to is so much more. If I was building a product and I was able to prove some MS depot used my info in an unauthorized way to build a product, I could sue that product out of existence, and someone always talks, so MS can't assume it will never be known, and they know that.
Almost everywhere in tech, but almost nowhere outside of tech. I work for a large non-tech conglomerate, and as far as I'm aware, we don't compete with any MS products/services.
Microsoft will sell "Copilot enterprise" to companies that can afford to negotiate. But every individual out there on a normal subscription gets data mined.
OpenAI is similar - you can't negotiate a "no-logs" deal with them unless you are a player the size of say, Epic (the health industry giant).
I could see "AI workspace driven development" being the future of at the very least cutting through the smaller tickets of work and generally improving developer workflows.
That feels like the right way to go -- almost baking an "agile done right" workflow into its engine.
To me the effect seems similar to going from assembly language to C or from C to Java or Visual Basic. It's a new level of abstraction that saves massive amounts of time.
I think the amount of work for software developers will increase just like it did back then. Many software projects are never started because they will be too expensive. If they can be done by half the number of people in half the time using AI tools, they might get a "go" instead.
This is a tool for product owners, it’s just too early for them to use it by itself.
Talk about high expectations!
This guy ships code.
A similar system, CrewAI, I ran their hello world and it cost $4 against GPT-4.
There is a trade-off between my time and the cost of the feature against me just coding it up with LLM assistance which has a fixed cost of $20 per month.
Hmmm, wonder if there's cheaply sourced labour of the human variety in that loop then?
The main problem was context. It didn’t seem to know what files to use for our discussion, didn’t listen when I told it, didn’t remember when I told it, had no effective way that I could bring files in and out of the discussion.
All this led to a deeply frustrating session of interaction and frankly I hated it. Easier to use ChatGPT web ui and copy and paste in and out.
GitHub copilot I found better in jetbrains ides. It seemed mostly to know what I was asking about though it’s very long was from being good at managing context.
It’s surprising that after the amount of development they’ve put into copilot it still is so bad at what I’d consider to be barest minimum functionality to integrate into an IDE.
We're working on something similar to workspaces: https://www.bismuthos.com
We provide a workspace to build Python backends. Chat on the left, code and visual editors on the right. However, we also handle deployments, data storage (we have a blob store), serving (we built a home grown function runtime) and logging.
The experience is tightly integrated with our copilot and the idea is to get ideas off the ground as quickly as possible with as little devops hassle. Right now the focus is on building something new, but we're in the process of making it easier for existing projects to integrate with us too.
Feel free to drop by our (very) new discord too: https://discord.gg/E5Yn3vaM
If your idea of high-quality code is "follows all the standard clean coding practices, uses design patterns, doesn't do anything Sonarqube would complain about, etc.", then it does a great job.
In terms of more abstract, design-level aspects of code quality, though, I have been less impressed. So, things like limiting statefulness and avoiding unnecessary temporal coupling, good high-level abstractions that obey regular and predictable - ideally algebraic - rules, preservation of well-defined bounded contexts, things like that. Left unchecked, Copilot will happily help you turn a large monolithic codebase into architectural spaghetti.
But then, most humans will do that, too.
Will it work? Also yes.
This is usually enough for most cases. Despite HN skewing to the fancier side of programming, the vast majority of day to day programming is just slapping together API glue.
For those cases LLMs like Copilot are excellent. It's a lot faster to ask Copilot about some specific C# thing than start searching through Microsoft's documentation for it. In most cases it can just insert whatever you want at the cursor.
Like just today I pasted a SQL CREATE statement to Copilot and asked it to create a FooModel class of it. Took me 3 seconds of typing, about 5-10 seconds of waiting and clicking "insert at cursor" and I had a 15 property model class.
Repeat a few more times and I've cut down stupid tedious writing by at least 30 minutes and I can go do the more fun bits of attaching some actual logic to those models.
However I think this is also the hard bit for humans to do. It’s one of the most frequent stumbling blocks I see for more junior engineers, and one of the things I notice most when working with code from people who are really good programmers.
The trick is to know how to program already, and avoid checking in LLM-generated code unless you completely understand every line.
If you don't do that you'll run into the same problems as you would if you hire a contractor to build your codebase without understanding what they did for you.
I often (simplistically) explain LLMs to people by explaining that it's essentially running a statistical average of language. Next-token-prediction (generally) aims to predict the next-least-surprising word that would occur in a sequence. It aims to "make sense" and be unsurprising.
If you want creative writing and innovative research papers and novel ideas, this isn't going to get you very far.
But if the things you want are "unsurprising" or "predictable" (great attributes of good, maintainable source code), then using this to write code feels like a pretty darn good fit.
I guess the difference is now that the contractor is cheap or free (because it’s a LLM), whereas in the old days you’d either hire a person to do the work and not understand or pick up a book and figure it out yourself (or go to school, or whatever). Figuring it out yourself was often cheaper and then you could understand.
(Not that humans can be replaced by LLM devs yet, or that LLM generated code is necessarily unreadable. It’s usually fine as you say.)
I really like this way of thinking about using LLMs, I think that's a great analogy in many ways.
The code is not the asset. It never has been. Deeply understanding your customer, their problem, and how to solve it is the asset. The code is just the current manifestation of that understanding.
Problem is that for many companies the code is also the only manifestation of that understanding.
I think it'll be hard enough to reason about what you really want that most customers won't care enough to roll their own. And personally, I'd happily pay someone to keep the product maintained. A product is usually not one and done.
That's a very bad thing, but this sounds like just more of that. Which most developers seem totally fine with.
For smaller contexts, LLMs tend to be really good at reviewing, suggesting changes, and refactoring. I haven't seen this applied successfully at a larger contexts, though.
TFA didn't show a screenshot of it but the per file plans and the diffs are side by side on a single screen so you can update the per file plan (adding and removing files as needed) and then "re-roll" the code changes as you go. With the Codespaces feature you can even launch the project and get access to a terminal to run stuff and presumably feed the output back into the plan.
It makes it really easy to spot deficiencies in the code, add comments in the plan, and instantly regenerate the code (well, not instantly, there was a queue when I used it). It was a lot smoother than my experience with Copilot Chat, Aider, and Plandex.
I don't see an AI agent doing a good job of avoiding that.
OpenAI's API license states that they won't use your data to train models, if that's any consolation. Unlike ChatGPT
Dont use your personal account for work, and don't assume any service you use for work provided by work isn't giving data on you to you employer, and if at all possible try to work for a company that cares what you deliver and not how you do it (meaning they aren't micromanage, not that they want you to skirt laws.. ). Some of those are obviously easier than others to control.