I'm going back to writing code by hand(blog.k10s.dev) |
I'm going back to writing code by hand(blog.k10s.dev) |
Time to become a "product engineer" and watch the hyper-agile agents putting up digital post-it notes on digital pin-boards discussing how much each post-it is worth in digital scrum meetings. Meanwhile the agents keep wasting more and more time so that their owners make less and less of a lose, until eventually a profit is made.
Until the costs become prohibitive and humans become cheaper than the agents that replaced them. Once the agents are replaced by the humans, the next hype bubble awaits around the bend.
/s
I really do think this whole thing is a wash.
AI was also able to help me create my first subscription payment workflow.
It is like farming without Roundup, less crops, more energy, less toxic chemical risks.
But again, if you just guide the AI on architecture and review the code, you should be fine. The code that you write and the code that an AI writes are two different things; they will never be the same.
The AI is very helpful for generating code, and that is exactly how you should use it: as a code generator.
Also 1600 lines... didn't any agent reviewing the diffs point that out?
You're also adding a lot to claude.md, I dunno how much that file has grown but a big claude.md file with many instructions, I don't think the ai will be able to remember all those rules.
In my experience, no. These tools suck at refactoring, mostly choosing to add more code instead.
Also 1600 lines... didn't any agent reviewing the diffs point that out?
You're also adding a lot to claude.md, I dunno how much that file has grown but a big claude.md file with many instructions, I don't think the ai will be able to remember all those rules
Actually I am curikus to try somwthing like that myself. Is there an existing orchestrating engine (or single agent) which can spawn multiple subagents and keep passing their feedback/output between each other until all of them agree that assignment overall is complete?
I see this in Claude too, but I also see this in junior engineers. In the case with Claude, I simply ask it to refactor immediately after each feature is done. The human is still responsible for the AI writes, so if the AI writes code that’s gross, I would never push that lest it sully my name and my reputation for my own code quality.
Do they write empty functions and let AI fill them in?
Or do they use some kind of specification language?
Are people designing those languages?
If there's any hope for reliability, auditability, predictability to be had it lies in contraining and LlMs grammar whilst delegating freeform behavior to a more passive substrate.
Looking at the code, paying attention to the structure is part of the skill
The skills required to wield an an LLM are not exactly those required to write code, but are very close.
"Vibecoding" is not a way for idiots to blindly produce software artifacts that anyone would want
The problem with this dev's approach is not AI, it's their use of it. They didn't ensure that the architecture made sense. They didn't look at the code and get a "feel" for it. They didn't do the whole build stuff, step back, refactor, rinse and repeat dance. The need for that hasn't gone away; if anything, it's even more important now. Because you can spit out code 100x faster than you could before, your tech debt compounds 100x faster. The earlier you refactor, the less work it is.
I usually give the agent a solid idea of what I want, often down to the API interfaces. Then every now and then, I'll go through the code and ensure that everything makes sense, and that I'm not just spitting out code that works, but building a codebase that scales.
Inb4 “you’re gonna be replaced” god damn it I hope so, I do not want to spend the rest of my life behind a computer screen…
The ones who are “AI pilled” and the contagious lepers.
some states, for an example, are meant to be assumed from the data shape, rather than the actual state fields, but damn they like adding a state field.
Yea, that's why engineers are still very important for now (until models can do this type of longer term designs and stick to them).
Attempting anything comprehensive with AI is the software development analogue to the Gell-Mann Amnesia effect..
I'm definitely thinking deeply now about how I'm approaching these tools going forward.. Yes, GPT5 is better at spitting out a fairly acceptable skeleton to a class when prompted hard enough, than I am, in one go.. but.. It will happily do things like write decent looking protobuf schemas and then go ahead and hide everything that takes the least amount of reasoning behind some binary blob nested deep enough that it'll get past even the most dedicated reviewer..
It's fairly good at a lot of the things that I don't find interesting to deal with, but it's also amazingly incompetent when it comes to even the most mundane kind of common sense.. It's so strongly steering towards text-book examples that it will happily put in three times the amount of code and handle multiple classes of actually impossible edge-cases and even use-cases that it was specifically asked NOT to add.. And it will defend it by "well, I added this because I can't know if someone is going to use the thing I just added.. well, if you hadn't added it, chances are indeed slimmer..
It's so good at answering questions and explaining what's there, and diving through call-paths, and yet, it drops the ball the moment it's going to actually do something beyond saving me from looking up how write some really annoying and uninteresting boilerplate..
The worst thing is how good it is at making things LOOK right, it will cover every single edge-case you throw at it, but not because of the design, not because it correctly argues why the architecture is inherently allowing such and such, or because the design and spec fleshes out that A goes to B and never the other way around, and as soon as it's time to make something, it will make sure B can go to A, especially, it seems, if allowing so prevents it from doing the right thing which is WHY those edge-cases were trivial, instead it will endlessly hack around them.. I've worked people like that too, so I don't know if I am really blaming the models or the training data..
But damn it's a tough spot..
I've had multiple situations where, after wasting hours of work, which I should have just spend doing it myself, the only thing I really wished was for the model to be sentient, and able to feel pain, and have a corporal body so I could drag it outside and beat it to a pulp. (I've never reached that level of frustration with an actual person, so that's something new they bring to the table..)
If you understand good software architecture, architect it. Create a markdown document just as you would if you had a team of engineers working with you and would hand off to them. Be specific.
Let the AI do the implementation of your architecture.
AI may default to mediocre and often somewhat buggy code unless you iterate because that is just what the vast majority of human written code that it has seen looks like. But the fact that he got away with not reviewing the code for so long to me proves the opposite of his conclusion.
1690 lines of code in one file is a walk in the park for SOTA models.
He can just say something like:
"Please review and create a refactoring plan and test suite. I found atrocious architectural decisions like numerous special cases and if statements rather than using abstractions properly. Make a few notes in comments and architecture.md to never do this again."
One could also argue that it was a better decision each time by the AI to just never do a refactor unless prompted because that increases the likelihood of something breaking and you want to do that after you verify the minimum code change actually functionally does what you want.
Also I bet you the headline is a lie. He basically admits it by saying he is writing the core structure of the next version by hand ahead of time, implying that he will generate the rest. So the title is a half-truth at best.
He's already 5k+ LOC into the rust rewrite...
That trial and error process is still happening with a LLM, but much faster, and with instantaneous cross-references to various forms of documentation that I would be looking up myself otherwise. It produces code of a quality that is dependent on the engineer knowing what they want in the first place and prompting for it and refining its output correctly.
It's the exact same process of sculpting code that the majority of the industry was doing "by hand" prior to the release of LLMs, but faster, and the harnesses are only getting better. To "vibe code" is to prompt vaguely and ignore the quality of the output. You're coming to a forum full of professionals and essentially telling us that you're getting really frustrated with your Scratch project.
I don't know if you're trying to lead a charge or whatever but good luck with that. As a senior SWE, it is clear to me that this is the new paradigm until something better than LLMs comes along. My workflows and efficiency have been vastly improved. I will admit that I have never really been a "I made a SMTP server in 3k of Rust" kind of guy, though.
> For 7 months I'd been prompting and shipping without ever sitting down and actually reading the code Claude wrote.
But every time I read something like this, I seriously wonder about the mental state of the person that wrote it.
How do you get to this point?
7 months ago was early November. Coding assistants were getting very good back then, but they were still significantly poorer at making good architectural decisions in my experience. They tended to just force features into the existing code base without much thought or care.
Today I've noticed assistants tend to spot architectural smells while working and will ask you whether they should try to address it, but even then they're probably never going to suggest a full refactor of the codebase (which probably is generally the correct heuristic).
My guess is that if you built this today with AI that you wouldn't run into so many of these problems. That's not to say you should build blind, but the first thing that stood out to me was that you starting building 7 months ago and coding assistants were only just becoming decent at that time, and undirected would still generally generate total slop.
BASIC at that time was heralded as a much simpler and faster way to program. Rings a bell?
But in my main work, reverse engineering, LLMs are godsend, for years now.
You can basically bruteforce binary obfuscation thanks to them. And thanks to eager chinese LLM providers, basically for free.
But I always use LLM only for boring work and rest is for me to do manually, or with scripts of course, but made by me. Because I want to learn.
Yes, there are a lot people using LLMs for full RE automation since they're selling exploits for profit. No problem with me.
I see funny future for huge corporations like Adobe, etc.
Imagine prompt, "Hey Claude, re-implement Adobe Photoshop with clean-room design" One agent will open decompiler, outputs complete low level technical details how is everything implemented.
Second agent implements new Photoshop based on that.
They will be mad and I like this.
You will own nothing, and you will be happy, corpos.
>They will be mad and I like this.
I suspect through some convoluted legal mechanism this kind of thing is going to end up applying only to copyleft laundering and not against players like Adobe.
I feel the same way about coding, its a source of pride for me and when I hear people say I should resign myself to being an "ideas guy" while chatgpt actually creates things I find the very concept to be distasteful regardless of whether or not it can outperform me.
It would have been easy to run a few ai agents to review the code and find these issues as well and architect it clearly
clickbait title
But here's the thing, you almost never know what the architecture is up front. If you do you probably aren't the one writing the actual code anymore. Writing the code, with or without an AI is part of the design process. For most people it isn't until they've tried several times, fucked it up a bunch, and refactored or rewrote even more that you actually know what the architecture needs to be.
This. I definitely agree with this statement at this point in AI-assisted development. This gets at the "taste" factor that is still intrinsically human, especially in software engineering. If you can construct and guide the overall architecture of an application or system, AI can conceivably fill in the smaller feature bits, and do so well. But it must have a strong architecture and opinionated field in which to play.
Another note was for me e2e tests; while AI can write them it never comes up with just basic organization or abstraction required to manage a large e2e test suite with hundreds of tests. It immediately starts to produce spaghetti code.
Now I do feel lucky that I started learning coding about four years before the LLM revolution, but these things are really just natural language compilers, aren’t they? We’re just in that period - the 1980s, the greybeards tell me - where companies charged thousands of dollars per compiler instance, right? And now, I myself have never paid for a compiler.
This whole investor bubble will blow up in the face of the rentier-finance capitalists and I’ll be laughing my head off while it happens.
I dont go as fast as with other agents, but this works for me, and I enjoy the process.
This is what I was doing right from the beginning. AI just fills out methods and doing other low intelligence work. Both are happy. My architectures and code are really mine, easy to read and reason. AI gets paid and does not get a chance to fuck me in the process. At no point I felt any temptation to leave "serious" to AI.
Getting a plan isn't a panacea but is a better way to limit downstream slop than just vibing without one.
But your prompts are not the only inputs. Among other things, there is a random seed injected by the vendor.
That is a primary source of non-determinism.
Then, of course, is the fact that you don't personally have an old copy of the model, and the vendor isn't going to keep the model forever, and there are no unit tests to make sure that, faced with prompts like you gave it before, the newer models won't suffer major regressions in the functionality you were using.
And even if there were no non-determinism, the models suffer greatly (much more so than traditional compilers) from the butterfly effect.
It is literally impossible to pin down part of your prompt in such a way that it always will contribute to good outcomes, and such that you can simply vary a tiny bit of the prompt to logically correlate with tiny variations in the output.
I have found small iterations to have the best results. I'm not giving AI any chance to one shot it. For example, I won't tell it to "create a fleet view" but something more like "extract key binding to a service" so that I can reuse it in another view before adding another view. Basically, talk to the AI as an engineer talking to another engineer at the nitty gritty level that we need to deal with everyday, not a product person wishing for a business selling point to magically happen.
Hey I don't want to over simplify, I'm sure it was complicated, but did the author have functional tests for these broken views? As long as there are functional tests passing on the previous commit I'd have thought that claude could look at the end situation and work out how to get the desired feature without breaking the other stuff.
TUIs aren't an exception, it's still essential to have a way to end-to-end test each view.
You can't test every permutation of app usage. You actually need good architechture so you can trust your test and changes to be local with minimal side-effects.
In that situation you have two choices:
1. Tell claude to iterate until the tests for the new view and the old views are all passing.
2. git reset --hard back to the previous commit at which all tests are passing and tell claude to try again, making sure not to break any tests.
It's essential to use tests when vibecoding anything non trivial. Almost certainly in a TDD style.
What has generally worked for me is paraphrasing the old adage "Write the data structures and the code will follow" over to AI. Design your data, consider the design immutable and let the AI try fill in the necessary code (well, with some guidance). If it finds the data structures aren't enough, have it prompt you instead of making changes on its own. AI can do lot of the low-hanging fruit and often the harder ones as well as long as it's bound to something.
Yet, for now, AI at best has been something that relieves me from having to write a long string of boring code: it's not sustainable to keep developing stuff relying on AI alone. It's also great when quality is not an issue; for any serious work AI has not speeded me up noticeably. I still need to think through the hard parts, and whatever I gain in generating code I lose in managing the agents. But I can parallelise code generation, trying new approaches, and exploring out because AI is cheap. AI is also pretty good for going through the codebase and reasoning about dependencies whether in the context of adding a new feature or fixing a bug: I often let AI create a proof-of-concept change that does it, then I extract the important bits out of that and usually trim down the diffs down to at least 1/3 or less.
AI further helps with non-work, i.e. tasks that you have to do in order to fulfill external demands and requirements, and not strictly create anything solid and new. I can imagine AI creating various reports and summaries and documentation, perhaps mostly to be consumed and condensed by another AI at the receiving end. Sadly, all of this is mostly things not worth doing anyway.
Overall, I cringe under all the hype that's been laid on AI: it's a new tool that's still looking for its box or niche carveout, not a revolution.
Personally, I've taken the time its freed up to spend more time on mathacademy and reading more theory oriented books on data structures and algorithms. AI coding systems are at their best when paired with someone with broad knowledge. knowing what to ask for and knowing the vocabulary to be specific about what you want to be built is going to be a much more valuable job skill going forward.
One example is a small AI based learning system I have been developing in my free time to help me learn. the mvp stored an entire knowledge graph and progress in markdown files. being an engineer, I knew this wouldn't scale so once I proved the concept viable, I moved everything into sqlite with a graphdb. then I decided to wrap some parts of teh functionality in to rust and put everything behind a small rust layer with the progress tracking logic still being in python.
someone with no knowlege of graph databases or dependncy graphs or heuristics would not be able to build this even if they had AI. they simply don't know what they dont' know and AI wont' save you there.
That said, I think its important to also spend time in the dirt. I've recently started pickign up zig as my NO AI langauge just to keep. those skills sharp.
I'm really curious if we'll seesaw once AI costs go up 10x.
Ive only been using kimi 2.5 and deepseek pro for reviewing PRs for security issues. less than 10% of my workflow requires a full powered frontier model.
I think the issue is overblown by people who think claude code is a good harness and use opus for everything. opencode is objectively better. its much more verbose about what its doing, you have more control when it comes to offloading to subagents with targeted context (crucial for running through larger jobs) and I can swap between codex and open weight models.
The quality gates are up to you, and if you are smart you will make a lot of them and review them closely
For example, if I'm new to programming today and I'm not part of any community that necessarily approves agentic coding or disapproves of vibe coding and I heard that C programs run fast as heck and I heard that I can automate jobs 1,2 and 3 with such a program, I generate said program and it works as expected per my limited experience then what's the issue?
Perhaps in a couple of weeks I notice I'm missing 1/4 of my HD space and I figure out probably via an agent that my cool C program is creating bloat through caching or creating hidden dot files, so I agentically/vibe-ally generate a patch. Maybe this encourages me to join a community of other amateurs or a pro-am community where I learn specifics - eg. the exact bug(s) in my code -- as well as metas -- eg. testing.
There will probably be millions and millions of people generating code for their own purposes thanks to LLMs, and the number grows as the technology develops and becomes more trivial. So I wonder how much value there is in the "how to think about this" discussion vs the "how to use this" discussion. It almost feels like religious encampments are forming over a false -- possibly manufactured -- lines of division
With that said, this caught my eye:
> AI gravitates toward single-struct-holds-everything because it satisfies the immediate prompt with minimal ceremony.
This is too general. "AI" is used here as a catch-all, but in fact, it was the specific model under the specific conditions you ran your prompt, including harness, markdowns, PRDs, etc. So it's not fair to say "AI does X!" in this case.
It's also very much up to you. It's very common to have a frontier model plan an architecture before you have another model implement code. If you're just one-shotting an LLM to do everything you get mediocre, more brittle code.
This stuff is still being figured out by a lot of people. But I feel the core of the issue is not using AI well. Scoping, task alignment, validation, are crucial.
Can someone with more experience with it (or similar tools) chime in and confirm that this isn't just more AI snake oil? :)
Matt Pocock talks about specs and Openspec after 23:00 minute mark and again after 33:00 minute mark here: https://www.youtube.com/watch?v=-QFHIoCo-Ko. He doesn't believe in simply translating specs-to-code. He emphasizes tracer bullets, TDD, setting up quick feedback loops.
And in a couple of months we might be doing things completely differently because of some new model or new framework.
That's really cool.
I still do, but I used to, too.
The framework could be an isolation later against viberod but not sure if its necessary for my small project i always wanted to do and never done anything with it.
For another tool, i will try another approach: Start with a deep investigation and spec write together with AI, than starting with the core architecture layout and than adding features.
So instead of just prompting "write a golang project with a http server serving xy, and these top 3 features" i will prompt "create a basic golang scarfold for build and test" -> "create a basic http server with a basic library doing xy" -> "define api spec" -> "write feature x"
There is kind a skill and depth to vibe coding though.
But I will say... you have to know Golang. You have to have at least tried to make a BubbleTea app yourself and try to understand ELM architecture. You have to look at the code and increment with it.
It makes total sense for OP to switch to Rust and Ratatui if they don't know Golang well. But I don't think it's a better language for it. [Ratatui has brought me great inspiration though!]
Independent of framework, the LLMs get the spacial relationships. I say things like "the upper right panel's content is not wrapping inside and the panel's right edge should extend to the terminal edge" and the LLM will fix it. They can see the resultant text, I'm copy-pasting all the time.
TUI code is finicky; one mis-rendered component mucks everything up. The LLMs will decide themselves make little, temporary BubbleTea fixtures to help understand for itself when things aren't right.
The only real problem with LLMs and BubbleTea is that upon first prompt, they insist on using BubbleaTea v1 versus BubbleTea v2, released in December 2025. But then you just point it to the V2_UPGRADE.md and it gets back on track. That will improve as training cutoffs expand.
I vibe-coded this TUI for Mom's last night. I actually started with Grok (who started with v1) and then moved into Claude Code after some iteration:
https://gist.github.com/neomantra/1008e7f2ad5119d3dd5716d52e...
Will that improve or get worse? One would argue that LLMs in general are drastically more competent now than they were a couple years ago, they’re also much better at coding. We’re likely just now entering the era where they can code but are still not what you’d fully expect, or at least not what someone with absolutely no coding knowledge could use to code at the same level as someone who does know how to code.
Maybe that changes as the models improve, maybe it doesn’t, only time will tell.
I stopped reading after this, because this is the dumbest way to vibe code anything larger than a single-use tool.
Claude is a collaborator, and honestly a decent voice of dissent, but it will never offer that unprompted. "Make this thing" - "OK".
You need to review the code. You need to say "I want this, AND HERE IS THE LONG-TERM VISION. Now offer critique and the trade-offs for various implementations."
Or just realize that in every hand-written project you learn the contours of the problem space as you go along and if the tool is big enough you'll feel the urge to do a green-field rewrite of hand-rolled code after a few years. You get there quicker with the robot's help. This is not a new lesson.
There's a massive difference in good human "writin" and a dozen paragraphs of "it's not x, it's a y".
But unfortunately everyone "reads" English. So, at least devs have mysterious computer languages that have strings of numbers that most of us look at and immediately get a migrain from attempting to comprehend what it means.
keep up the good work and the craft of building things one keystroke at a time.
Software engineering is not that. You absolutely can and often will hand ofoff work to humans. Its not inherently that creative in the actual coding part.
However everything in this field is cargo culting. We have absolutely no way to quantify productivity in the real world.
We've had advanced programming languages backed by advanced programming language theory for decades and the most used/ran programming languages in the world are C, PHP and JavaScript, languages held together by duct tape or in the case of C, programing language theory from the 60s.
We have a super minimal JavaScript runtime in the browser to avoid a bloated standard library and then people invent things like leftpad. At the same time basically every major website in the world serves mega bytes of tracking and ad serving libraries.
We all "know" AI makes coders more productive but nobody can do the equivalent of a clinical trial for a major new drug.
The problem is that the mitigations offered in the article also don't work for long. When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it’s small, like the selection of a data structure. You can tell the agent what the constraints are with something like "Views do NOT access other views' state" as the post does.
Except, eventually, you'll want to add a feature that clashes with that invariant. At that point there are usually three choices:
- Don’t add the feature. The invariant is a useful simplifying principle and it’s more important than the feature; it will pay dividends in other ways.
- Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.
- Go back and change the invariant. You’ve just learnt something new that you hadn’t considered and puts things in a new light, and it turns out there’s a better approach.
Often, only one of these is right. Often, at least one of these is very, very wrong, and with bad consequences.
Picking among them isn’t a matter of context. It’s a matter of judgment, and the models - not the harnesses - get this judgment wrong far too often. I would say no better than random chance.
Even if you have an architecture in mind, and even if the agent follows it, sooner or later it will need to be reconsidered. What I've seen is that if you define the architectural constraints, the agent writes complex, unmaintainable code that contorts itself to it when it needs to change. If you don't read what the agent does very carefully - more carefully than human-written code because the agent doesn't complain about contortious code - you will end up with the same "code that devours itself", only you won't know it until it's too late.
So yes, you might get good results in one round, but not over time. What does work is to carefully review the AI's output, although the review needs to be more careful than review of human-written code because the agents are very good at hiding the time bombs they leave behind.
No, if you have to do all of the stuff you have listed to kind-of-make-it-work...You are not in charge.
You can of course use PreToolUse hooks to block particularly damaging actions of the "rm -rf" variety, but this is also not 100% guaranteed unless you're able to block _all_ ways of performing that damaging action (and you would be surprised: agents will happily write custom python / bash / etc. scripts to do actions you tried to block them from doing!)
Tools help instruct the agent to redo work e.g. to pass linter / formatter checks or relevant tests. But I've also seen them ignore those, often enough to be noticeable: e.g. "17 of 18 tests pass, the other 1 wasn't introduced by this feature" - regardless of whether that's actually true or not, regardless of whether I put "ALWAYS make sure ALL affected tests pass" in an instruction file somewhere.
This isn't to refute your main point: yes, you can improve your chances that AI will write good code. But there is no magic bullet that will force it, 100% of the time, to write good code; this is where vibe coders without requisite coding + engineering skills hit a wall. A multi-layered approach of guidelines + progressive disclosure + tools + hooks indeed reduces the probability of bad code enough to be useful for many engineering tasks.
[1] https://straion.com/blog/1m-tokens-wont-save-your-engineerin...
Sure. That's how I work with AI, and the way I believe that AI is meant to be use -- as a companion tool.
But it's a lot of work. It saves me time for certain tasks, but not others. I haven't measured my productivity gains, but they're at most 2x.
But that's not "vibe coding" (which was the point of the article) or the (false) promise of "10x productivity" and "code that writes itself" that companies are being told is going to reduce their engineering headcount tenfold.
I had strong principles at the outset of the project and migrated a few consumers by hand, which gave me confidence that it would work. The overall migration is large and expensive enough that it has been deferred for nearly a decade. Bringing down the cost of that migration made me turn to AI to accelerate it.
I found that it was OK at the more mechanical and straightforward cases, which are 80% of the use cases, to be fair. The remaining 20% need changes to the framework. Most of them need very small changes, such as an extra field in an API, but one or two require a partial conceptual redesign.
To over simplify the problem, the backend for one system can generate certain data in 99% of cases. In a few critical cases, it logically cannot, and that data must be reported to it. Some important optimizations were made with the assumption that this would be impossible.
The AI tooling didn't (yet) detect this scenario and happily added migration logic assuming it would work properly.
Now, because of how this is being rolled out, this wasn't a production bug or anything (yet). However, asking the right questions to partner teams revealed it and unearthed that some others were going to need it as well.
Ultimately, it isn't a big problem to solve in a way that will mostly satisfy everyone, but it would have been a big problem without a human deeper in the weeds.
Over time, this may change. Validation tooling I built may make a future migration of this kind easier to vibe code even if AI functionality doesn't continue to improve. Smarter models with more context will eventually learn these problems in more and more cases.
The code it generates still oscilates between beautiful and broken (or both!) so for now my artistic sensibilities make me keep a close eye on it. I think of the depressed robot from the Hitchhiker's Guide to the Galaxy as the intelligence behind it. Maybe one day it'll be trustworthy
To be fair, there are many people like this as well. One of my personal favorite examples was way back in the 80s when I inherited the code for a protocol converter that let ASCII terminals communicate with IBM mainframes via the 3270 protocol.
One of the pieces of code in there, for managing indicator lights, was simply wrong. It was ca. 150 lines of Z80 assembly language that was trying to faithfully follow the copious IBM documentation of how things worked, but it had subtle issues and didn't always work.
My approach was to accept the documentation as accurate (the IBM documentation was always verbose and almost never wrong), but to reason that the original 3270 had these functions implemented in TTL logic gates, and there was no way in heck that they were wasting enough gates on indicator lights to require the logical equivalent of 150 instructions.
So in my mind, it had to be a really simple circuit that had emergent properties that required the reams of documentation. With that mindset, I was able to craft correct code for this in 12 instructions.
Many systems are likewise fractal in nature. You want to figure out the generating equations, rather than all the rules that derive from those. And, in many cases, writing down the generating equations is at least as easy to do in code as it would be to do in English for someone or something else to implement.
I find this to be a big problem with spec driven development: no spec survives the real world, some invariant that was in the spec will inevitably turn out to be wrong, no matter how much time you spend researching and designing the spec.
When I as a human hit this during development, I can take a step back and think it through, and decide oh yes, the invariant is wrong and needs to be thought through again, and the impact of changing it needs to be assessed. Then I can design around it. Sometimes that means a substantial change in design, sometimes not, but in all times the resulting software is better for it: an unknown has been uncovered, something new has been learned.
When this happens to AI, it keeps churning on it until it manages to hack a solution together, under the potentially wrong assumptions, design, or invariant. It doesn’t have the insight to step back and holistically reevaluate.
At least, that’s been my experience working with AI. I think we can improve its ability to handle these situations, through good workflows and verification, but it’s not something that comes natural to AI and not something Claude code or whatever support out of the box and it’s got its limits.
But in all seriousness it depends on what you’re doing with it. Writing a quick tool using an LLM is much easier than context changing to write it yourself. If you need the tool, that’s very valuable.
Been building a new app with lots of policies and whatnot and instructing a LLM is just much faster than doing the same repetitive shit over and over myself.
Even if you could state it in a precise formal language the LLM under the agent doesn’t have the capability to understand what the invariant is for and why it’s important. You’ll still get oddly generated code. You might get an LLM that can associate certain tokens with those in the formal language specification which can hold invariants and perhaps even write the proofs… but you’ll still get a whole bunch of other code generated from the informal parts of the prompt.
I agree that simply adding constraints and prompts to you skills and specs isn’t going to prevent these things. Worse, that even if you could invent a better mouse trap the creature will still escape.
The problem is… “elongation:” the addition of code for the sake of the prompt/task/etc. Often less is better. This takes a human with the ability to anticipate what other humans would want/expect. When you need a generator, they’re great but it’s a firehouse that whose use should be restrained a little more.
That depends on the invariant. Some are behavioural, like "variable x must be even if y is positive", but some are architectural, such as "a new view requires a new class".
But that's only one side of the problem because maintaining the invariant can be just as bad as breaking it. You ask the agent to add a feature and it may well maintain the invariant - only it shouldn't have, because the feature uncovers the fact that the invariant is architecturally wrong.
The problem is that evolving software requires exercising judgment about when you need to follow the existing strategy and when you need to rethink it. If there is any mechanical rule that could state what the right judgment is, I don't know what it is.
You can try telling the agent to stop and ask when a constraint proves problematic, except it doesn't have as good a judgment as humans to know when that's the case. I often find myself saying, "why did you write that insane code instead of raising the alarm about a problem?" and the answer is always, "you're absolutely right; I continued when I should have stopped." Of course, you can only tell when that happens if you carefully review the code.
Ancillary parts I don't mind generating, but for core features I still need to be actively writing most of the time.
If you already have a mature code base, then it's very easy to get AI to write excellent code. It has a ton of documentation on what you already do, how you do things, functions to use etc.
I read all the changes AI does. I work in small chunks.
>Even if you have an architecture in mind, and even if the agent follows it, sooner or later it will need to be reconsidered
The agent can modify the structure you want to change to 100x faster than you can. That's the beauty of it. We all know how hard it is manually to make architectural changes once you've started to lock into something.
These comments just show me you must not be using AI in the right way, or haven't used it enough to learn "how" to use it. I've been using claude code months now at full speed. You are simply wrong that it doesn't generate good code.
I'm surprised this still needs to be said. I'm convinced that posts like these are from people that let the LLM run wild. Small chunk PRs is the key whether its a human or an LLM
Indeed for the task of “jump into an unfamiliar codebase and make a requested change that aligns with existing styles and patterns, and uses existing functionality” I would say something like opus 4.7 exceeds the capabilities of most developers.
But agents generate code much faster, and to know slow them down, some people want to not do the only thing that can currently ensure you get good results, which is to carefully review the output. Once that happens, there is simply no way for them to know how good or bad what they're getting is.
Yeah I’m currently working for several months already on a harness that wraps Claude Code and Codex etc to ensure that these types of invariants are captured and enforced (after the first few harness attempts failed), and - while it’s possible - slows down the workflow significantly and burns a lot more tokens. In addition to requiring more human involvement, of course.
I suspect this is the right direction, though, as the alternatives inevitably lead any software project to delve into a spaghetti mess maintenance nightmare.
that's when I stopped.
I would use Stripe, curl, and ffmpeg without audits, because I trust them to provide good code and to respect their API. I wouldn’t trust AI to write a Fibonacci series implementation.
The AI has no reputation to wager for my trust.
For the record, I definitely don't immediately read the majority of code Claude writes these days. I just check on it periodicially. In terms of code quality it's as good as any human I know of.
Can be a bumbler at times. So can people.
I review every line of code I generate with AI. I mainly use an MR-based approach:
1) Provide a tightly scoped technical spec to Codex as a task, and ask for 3x solutions. Usually at least one of them is on the right track, and it is better to ditch a solution that went in the wrong direction than to try to fix it.
2) Review the explanation and diff of the proposed changes line by line, file by file. If I find minor deviations from what I asked, or violations of the codebase architecture/conventions, I write comments in the diff and/or global comments, and ask again for 3x adjusted solutions.
3) Usually, by this point, the solution is ready for me to merge locally and either run local tests or do some manual fine-tuning.
4) Finally, I generate unit tests. I leave them to this stage because I can repeat the same process with the sole intent of generating case-specific unit tests. This way, I can generate/review tests against the final version of the implementation.
This has been working very well for me since our repos are reasonably organized and have a well-defined architecture. In the technical spec, I include the major architectural requirements and code conventions, and I also add a catch-all like "follow the codebase's existing conventions and style", which works reasonably well.
This simple process has enabled me to deliver most minor/medium tasks and bug fixes really quickly while maintaining control over the changes and without lowering the quality bar. For larger and more challenging tasks, I find myself "driving the wheel" (i.e. coding by hand) more often, and using AI code generation in a much more scoped and specific way. So that becomes a different process altogether.
I'm using a personal license and Codex. What does this cost to generate 3x solutions as a starting point?
Even in simple coding I have been doing, I notice Codex will burn through my Open AI subscription rather fast.
I'm sure you agree broadly with Gabe Newell, "people who don't know how to program who use AI to scaffold their programming abilities will become more effective developers of value than people who've been programming, y'know, for a decade." Look, he's talking about you and me. Programming for a while is quickly becoming worthless. It is of course the journey of programming that gives some people insight to real problems - business, creative, whatever - so it is extra important that the people with the best programming skills use the chatbots to write a lot of code that you and I will absolutely never read.
And anyway, you, as consumer, are constantly using code you have never read. Lots of code is shipped that we never read. There is nothing special about reading code. Even if you and I learned everything by reading code, it doesn't mean that generated code isn't going to create value. It's going to generate tons and tons of value.
Yet another POV is, if you are making code for customers who need to read the code, you are making a mistake, in the long term. It is a very, very interesting way to think about efforts around SBOM and various security companies - a far more informative lens to look at Wiz or Cloudflare, and what value they actually provide, because it's not code - and how relatively little enterprise value the "we read everything" teams at high frequency trading startups really deliver. You know this, you know exactly what I am talking about, it's your experience, so it is surprising to hear from you, talking in generalities against a trend that is obviously coming for all the best programmers.
Well, that is problematic. I have to either assume you are disinterested or lying and neither is great for any discourse.
Worse, the disclaimer is buried under a bunch of "did X, did Y on line Z of file a/b/c", as if it's just a minor inconvenience. To the extent the plan was inaccurate, you're left in an undefined state where you might as well undo what it just did..
1. If I use a coding agent to generate code, it should be something I am absolutely confident I can code correctly myself given the time (gun to my head test).
2. If it isn't, I can't move on until I completely understand what it is that has been generated, such that I would be able to recreate it myself.
3. I can create debt (I believe this is being called Cognitive Debt) by breaking rule 2, but it must be paid in full for me to declare a project complete.
Accumulating debt increases the chances that code I generate afterwards is of lower quality, and it also feels like the debt is compounding.
I'm also not really sure how these rules scale to serious projects. So far I've only been applying these to my personal projects. It's been a real joy to use agents this way though. I've been learning a lot, and I end up with a codebase that I understand to a comfortable level.
Then when it was completing functions, people would say, "yeah, but you still have to make sure you're the one writing the logic around the functions"
Then when it was completing the logic around the functions, people would say, "yeah, but you still have to make sure you're the one writing the features"
Now it's completing features and people say, "yeah, but you still have to make sure you're the one writing the architecture"
I don't know if architecture is a solvable problem for these models, but it is interesting watching the expectations moving over time.
That’s the hard part of coding. If you have an architecture then writing the code is dead simple. If you aren’t writing the code you aren’t going to notice when you architected an API that allows nulls but then your database doesn’t. Or that it does allow that but you realize some other small issue you never accounted for.
I do not know how you can write this article and not realize the problem is the AI. Not that you let it architect, but that you weren’t paying attention to every single thing it does. It’s a glorified code generator. You need to be checking every thing it does.
The hard part of software engineering was never writing code. Junior devs know how to write code. The hard part is everything else.
The swindle goes like this, AI on a good codebase can build a lot of features, you think it’s faster it even seems safer and more accurate on times, especially in domains you don’t know everything about.
This goes in for a while whilst the codebase gets bigger and exploration takes longer and failure rate increases. You don’t want it to be true and try harder so you only stop after it practically became impossible to make any changes.
You look at the code again and there is so much code spaghetti is an understatement it’s the Chinese wall.
You start working…, and you realize what was going on
I deleted 75,000 of 140,000 lines of code and I honestly feel like the 3 months I went hard into agentic coding I wasted and I failed my users by building useless features increasing bugs, losing the mental model of my code and not finding the problems I didn’t know about the kind of hard decisions you only see when you in the code, the stuff that wanders in your mind for days
Coincidentally I've been working on a project for about 7 months now: its a 3d MMO. Currently its playable, and people are having fun with it - it has decent (but needs work) graphics, and you can cram a few hundred people into the server easily currently. The architecture is pretty nice, and its easy to extend and add features onto. Overall, I'm very happy with the progress, and its on track to launch after probably a years worth of development
In 7 months vibe coding, OP failed to produce a basic TUI. Maybe the feature velocity feels high, but this seems unbelievably slow for building a basic piece of UI like this - this is the kind of thing you could knock out in a few weeks by hand. There are tonnes of TUI libraries that are high quality at this point, and all you need to do is populate some tables with whatever data you're looking for. Its surprising that its taking so long
There seems to be a strong bias where using AI feels like you're making a lot of progress very quickly, but compared to manual coding it often seems to be significantly slower in practice. This seems to be backed up by the available productivity data, where AI users feel faster but produce less
> back to writing code by hand
But what they are doing is
> doing the __design work__ myself, by hand, before any code gets written.
So... Claude still is generating the code I guess?
And seriously, I can't understand that they thought their vibe coded project works fine and even bought a domain for the project without ever looking at source code it generated, FOR 7 MONTHS??
And the goal of the article is to draw attention to their project.
Additionally, they couldn't even bother to write their own blog post, so it's a little hard to take them seriously when they say they're going to write their own code...
> Claude (c) by Anthropic (R) is the best thing since sliced bread and I'm Lovin' It(tm)! Here's a breakdown of you too can live a code free life for 10 easy payments of $99.99 a month if you subscribe now!
> Step one in your journey to code free life: code the whole damn project and put it together yourself
It's so much fluff and baloney and every single article is identical. And every single one is just over the top praise of Claude that doesn't come off as remotely authentic. There's always mentions of Claude "one shotting"(tm) something.
I don’t think it’s that weird to not look at the code if it’s a side project and you follow along incrementally via diffs. It’s definitely a different way of working but it’s not that crazy.
Its not weird to not look at the code, as long as you're looking at the code? (diffs?)
Uh, ok
We’ve moved to seeing that specs are useful and that having someone write lots of wrong code doesn’t make the project move faster (lots of times devs get annoyed at meetings and discussions because it hinders the code writing, but often those are there to stop everyone writing more of the wrong thing)
We’ve seen people find out that task management is useful.
Now more I’m seeing talk of fully doing the design work upfront. And we head towards waterfall style dev.
Then we’ll see someone start naming the process of prototyping, then I’m sure something about incremental features where you have to ma age old vs new requirements. Then talk of how really the customer needs to be involved more.
Genuinely, look at what projects and product managers do. They have been guiding projects where the product is code yet they are not expected to read the code and are required to use only natural language to achieve this.
This is a special case of a general fundamental point I'm struggling with.
Let's assume AI has reduced the marginal cost of code to zero. So our supply of code is now infinite.
Meanwhile, other critical factors continue to be finite: time in a day, attention, interest, goodwill, paying customers, money, energy.
So how do you choose what to build?
Like a genie, the tools give us the power to ask for whatever we want. And like a genie, it turns out we often don't really know what we want.
Now it is different in a way where now I don’t have time to use those apps.
That’s a joke.
But I do believe it answers the question of “what to build?”. If you didn’t have time before LLM assisted coding you still don’t have time for it. You most likely know what is used and what not already by heart or by some measurements.
When asking for a new major feature, despite hard guidelines and context (that eat half your context window), then it quickly ships bloat. The foundations are not very well organized and this is where you acknowledge it is all about random-prediction of the next word-thing.
Overall, i've wasted more time reviewing the PR and trying to steer it properly than I expected. So multi-layer agent vibe coding is no longer the way to go *for me*. Maybe with unlimited tokens and a better prompt, to be investigated...
Then it spent more time appending comments to its own comments rather than writing code ^^
The rewrite is me sitting down with a blank doc and drawing the boxes before any code exists. Then the CLAUDE.md enforces what I already decided. Whether that actually holds up as the project grows, I genuinely don't know yet.
Prompt for what you want. Get your feature working, then cut: reduce SLOC, refactor to remove duplication, update things to match existing patterns. You might do these instinctively, or maybe as-you-go, but that's just style. Having a dedicated pass works just as well.
The same thing goes for my code now that did when I wrote every line by hand: make it work, then make it good, then make it manageable. Manually that meant breaking things down into small blocks of individual diffs inside a PR (or splitting PRs), checking for repetitive code and refactoring, or even stashing what I got to and doing it again with the knowledge of how things went wrong.
Agents can do the same. It's WAY easier mentally and works out better if you treat them the same way and go working -> better -> done.
The very worst things you can do in a codebase are (a) not deeply understand how it works (have it be magic) and (b) be lazy and mess up the structure.
How do you fix a problem which happens at 2:00am and takes your system down if you don't have an excellent understanding of how it works?
Over time we're already bad at (a) because most developers hate writing documentation so that knowledge is invariably lost over time.
Even i think that after few iterations of producing the code there must/should be change in the strategy.
I sometimes also wonder if i should add the software engineering text books that ` tried teaching us to code` but contained the frameworks that are better applied along with the principles like SOLID, DRY etc.
But then again, I do not have the right answer now. Maybe the reformation must come in the models too but as I see it, going back to hand coding is not the solution.
Just like we came up with different paradigms of coding, the different principles of coding, different frameworks in short, we need to and will come up with some frameworks (& maybe some newer models as mentioned above) that can and will make us call AI coding “The Standard”.
What are off the table (I think)
1. Hand coding out maybe even reading AI’s code line by line. That’d rather be counterproductive. At least with me it takes more time to read its code and understand. But i evaluate its code not just be writing tests but by other means too depending on the situation and that’s for another time too. 2. Vibe coding 3. Thinking software engineering is automated (it definitely is more essential than ever) 4. So does software development - even that’s not going to go extinct 5. Software jobs are going to go extinct. (In fact if a company is losing people claiming it doesn’t need so many of em means to me that either they do not see much of future for themselves or they’re just playing the stock price and investor satisfaction game for the short run - but that’s for a different topic)
I add now a long list of instructions how to work with the type system and some do’s and don’ts. I don’t see myself as a vibe coder. I actually read the damn code and instruct the ai to get to my level of taste.
Eventually like every hype wave the dust will settle, and lets see where we stand.
By now all the AI companies have consumed all human knowledge so they either learn to actually think for themselves, or that is it.
Either way, that won't change the ongoing layoffs while trying to pursue the AI dream from management point of view.
I think most companies doing layoffs are bloated to begin with, AI is just the scapegoat to do the layoffs.
Translation and asset generation teams for enterprise CMS, whose role has now been taken by AI.
Likewise traditional backend development, that was already reduced via SaaS products, serverless, iPaaS low code/no code tooling, that now is further reduced via agents workflow tooling, doing orchestration via tools (serverless endpoints).
Claude is super good as making it seem like it’s an expert in kubernetes, but then undercovering certain decisions, it’s basically optimizing to try to make things look like they work.
An example is, i wanted to develop a feature to easily fork a managed Postgres database with a k8s cluster. The thing it did was to copy the entirety of the source db to localhost, then copy it back out to the cluster, rather than just running the job within the cluster.
Now I’m pretty stressed after a 1 hour vibe coding session, having to now review and digest and think through the code that it wrote. Implementations like that scare me — if I accidentally missed it and merged it — since there are real people who rely on canine.
I wouldn’t go as far as to say I’m writing everything by hand, but I now always map out how I would do something before asking ai to approach it
Yes I agree for sure llms write terrible code when left to their own devices, but so do most engineers. Which is why we have so many tools to help keep a certain level of quality. Duplication checks, tests, linters, other engineers.
I find whenever you make an llm repo without these checks, and more, it will write like an enthusiastic junior engineer, wrong and strong. However a junior engineer would be hard pressed to get 95% coverage on a codebase, the ai is more than willing and does it in a few minutes. We can use things like this to our advantage, how many people have ever seen a repo with 100% test coverage? With ai this is very possible, with people not so much.
LLM’s writes terrible code, we know this, but when dealing with humans that write terrible code we have many techniques. We should be using those same techniques to keep the llms honest, but more importantly verifiable.
Then you're right back on track.
In a way it's not that different from a human-made project. Plenty of teams have to crunch, ignoring the architecture and incurring tech debt, and then come back and fix it later.
I have to periodically get it to do a bunch of refactoring
Isn't Golang relatively easier to read than Rust? I was under the impression that Rust is a more complex language syntactically.
> The other change is simpler: I'm doing the design work myself, by hand, before any code gets written. Not a vague doc. Concrete interfaces, message types, ownership rules. The architecture decisions that the AI kept making wrong are now made in writing before the first prompt.
This post is good to grasp the difference between "vibe-coding" and using the AI to help with design and architectural choices done by a competent programmer (I am not saying you are not one). Lately I feel that Opus 4.7 involves the user a lot more, even when given a prompt to one-shot a particular piece of software.
I'm building an orchestrator (who isn't). Haven't looked at the code yet, but it appears to work. But man have I spent hours in loops between Claude, Codex and myself all on the highest thinking levels to figure out what interface portability means for the employee, how best to handle "remote" sessions and the appropriate semantics for pipelines/recipes.
I've also been very opinionated about who does what. I'll let the agent write a script to sync with github and reload workers, but I decided to "waste" the 5 minutes to manually do all of the config steps on render for my server when claude told me that I couldn't just give it read only scope to pull the logs. Bad news, I'm cutting and pasting for my computer overlord. Good news? Claude can't blow away the prod db if it happens to get in the way of whatever interpretation is makes of the instructions I give it.
A chainsaw requires very different skills that an axe. It has different failure modes. Some experience as a lumberjack probably helps using either/both.
No difference (at least now) with agents.
Unfortunately, it is not, and many of its attempts at mathematical proofs have major flaws. You shouldn't trust its proofs unless you are already able to evaluate them--which I think is pretty much all the OP is saying.
There is one exception to this: If the AI also delivers the proof of why the math is correct, in a machine-checked format, and I understand the correctness theorem (not necessarily its proof). Then I would use it without hesitation.
I’m going to guess that this is Gell-Mann amnesia more than anything, and it’s going to get a lot of organizations into a lot of weird places.
Normally with mathematical problems you have to prove the solution correct. Testing is not sufficient, unless you can test all possible inputs exhaustively.
If it’s beyond our ability to review and we blindly trust it’s correct based on a limited set of tests… we’re asking for trouble.
"PhD level" just means you finished a bachelor and masters degree and are now doing a bit of original research as an employed research assistant.
Claude isn't "PhD level" anything. This shows a complete lack of understanding here. Claude has read every single text book in existence, so it can surface knowledge locked away in book chapters that people haven't read in years (nobody really reads those dense books on niche topics from start to finish).
Since Claude has infinite patience, you can just keep asking until you get it.
... that can't even count.
Comprehension debt just sounds like there are things you don’t (yet) understand.
Cognition debt means your lack of understanding compounds and the cognition “space” required to clear it increases accordingly.
An increasing comprehension debt that can be paid off one bit at a time within reasonable cognition space takes linear time to clear.
Cognition debt takes exponential time to clear the more of it you have. If it reaches a point where you simply don’t have the space for the cognition overhead required to understand the problem, you probably need to start over from your specifications.
This all works pretty great. Where it starts going off the rails is if I let it use a library I'm not >=90% comfortable with. That's a good use of these tools, but if I let it plow through feature requests, I end up accumulating debt, as you pointed out.
For my uses, I'm still finding the right balance. I'm not terribly sure it makes me faster. What I do think it helps with is longer focused sections because my cognitive load is being reduced. So I can get more done but not necessarily faster in the traditional sense. It's more that I can keep up momentum easier, which does deliver more over time.
I'm interested in multi agent systems, but I'm still not sure of the right orchestration pattern. These AI tools still can go off the rails real quick.
But we don’t follow the same things for dependencies, work of colleagues, external services, all the layers down to the silicon when trying to work.
Why is AI suddenly different?
We just have to do this by risk and reward. What’s the downside if it’s wrong, and how likely is an error to be found in testing and review? What is the benefit gained if it’s all fine? This is the same for libraries and external services.
A complex financial set of rules in a non-updatable crypto contract with no testing?
A viewer for your internal log data to visualise something?
There are some programmers who treat the job as just plumbing together what is to them completely incomprehensible black boxes, who treat the computer as a mystery machine that just does things "somehow", but these programmers will almost always be hacks that spend their entire career producing mediocre code.
There are things such a programmer can build, but they are very limited by their lack of in depth understanding, and it is only a tiny fraction of what a more competent programmer can put together.
To get beyond being a hack, you need to understand the entire stack, including the code that you didn't write, including both libraries, frameworks and the OS, and including the hardware, the networking layers, and so forth. You don't have to be an expert at these things by any means, but you do need to understand them and be comfortable treating them as transparent boxes that you may have to go in and fiddle with at some point to get where you need to go. Sometimes you need to vendor a dependency and change it. Sometimes you need to drop it entirely and replace it with something more fit for purpose you built yourself.
An outsourced developer isn't a "tool". They're a human being, and responsible for their actions. They're being paid, and they either act responsibly or they get replaced.
A vibe coder is a human using a tool. The human is responsible for code quality, and if it's not good enough, they need to keep using the tool to make it better. That means understanding the tool's output.
If an artist used Photoshop to create a billboard ad that was ugly, they don't get to blame Photoshop. They have to keep using the tool until their output is good.
Your manager is unknowingly helping you create a form of job security for yourself, with all the technical debt and bugs being accumulated.
He might not understand it, and it might not be the type of work you want to do, but someone is going to have to fix those issues. And the longer they wait, the bigger the task gets.
If manage is reasonable, you can explain to them that there isn't time to check the work of the AI, and that it frequently makes obscure mistakes that need to be properly checked, and that takes time.
At this point, if they still insist you just give it the AI's work, they've made a decision that is their fault. You've done what you can.
And when the shit hits the fan, we're back to whether they're reasonable or not. If they are, you explained what could happen and it did. If they force responsibility on you, they aren't reasonable and were never going to listen to you. That time bomb was always going to go off.
Had a project idea which I coded with the help of AI and it became quite large to a point I was starting to have uncharted areas in the code. Mostly because I reviewed it too shallow or moved fast.
It was a good thing as that project never floated but if I were to do such a thing on my breadwinning project I would lose the joy.
I am not very disciplined, and find it too convenient to reach for an agent these days.
This may sound ridiculous, but I am addicted to nicotine. I used to have some sort of rule around how I am allowed to use nicotine pouches to manage my addiction. For example after I finish writing a feature, I could have one pouch. It was obviously a dumb idea that didn't last very long.. But in that specific aspect, coding agents feel similar. I tried setting up rules on how I should use them, but it's not easy to follow them.
Maybe the biggest problem is just guilt?
There ought to be entire sets of teams of devs replaced by one guy if this the case.
There ought to be popular open source projects suddenly being improved at 10x previous speed if it were true.
Im seeing plenty of evidence of slop being churned out fast, often creating work for others in the process.
I keep reading all over social media about these "hypercharged 10x AI devs".
But, I see literally zero evidence of their existence beyond a series of comments on internet forums of "trust me bro".
When AI can complete lines, you still have to read and understand the code.
When AI can complete whole functions, you still have to read and understand the code.
When AI can complete features and tickets, you still have to read and understand the code.
Which is a very similar approach to any serious code. If you just hired a very clever, enormously knowledgeable intern, and they wrote a bunch of code for you overnight, you would probably review it.
Yes, in some cases, either hobby projects or throwaway code, you could just take it and use it as is, and I surely do, for the code no one cares about. But at work, I would rather review it.
I think the solution is between the lines of this article. The author states the steps leading to this, but doesn't arrive at it explicitly. It has been obvious (With 50/50 hindsight) to me since LLMs started getting popular, and holds:
LLMs are fantastic for software dev. If you don't let it write architecture. Create the modules, structs, and enums yourself. Add as many of the struct fields and enum variants as possible. Add doc comments to each struct, enum, field, and module. Point the LLM to the modules and data structures, and have it complete the function bodies etc as required.
Have people's standards for quality just completely vanished in the pursuit of the shiny new thing? Is that guy doing something wrong?
That has also been my experience with this sort of thing fwiw, which is why I gave up and do more of a class-by-class pairing with an LLM as a workable middle ground.
Considering how fast we can poop out code now, I think this issue is just more visible than before, but it's been an issue for as long as I've been a developer. Almost no one knows what they actually want, and half the job is trying to coax out what they want to be able to do, so you can properly architect it.
At least with current languages, I think the primary problem is they are globally complex, and it's not scalable for them (and certainly for you to review a codebase they've mainly or completely generated) that the invariants you want are being withheld.
No matter how many times you tell them - there is ZERO blocking allowed on the critical path, they will add blocking on the critical path.
No matter how many times you tell them any time they do X, they need Y type of test, they will do X without Y type of test.
They cannot follow directions 100%. Neither can people.
But they are more random. The mistakes people make are less likely to do the exact polar opposite of what you wanted to do.
People are less likely to see a critical invariant in the code, build themselves a loophole to get through it, write a test that the code fails successfully, and then tell you they did exactly what you asked for, and burry it in a 5k line commit, where 1000 lines are them changing comments that shouldn't be there in the first place.
LLMs are great. I'm convinced they're the future. I'm building a language specifically for them: https://GitHub.com/Cuzzo/clear - and to make it easier for YOU to work with them.
I think once we get around this language problem, that they need global context for things where they shouldn't, it will be a challenge to work with them.
I've had success with them, but it's been so frustrating, that I question how much it's been worth my sanity.
So it's not much of a surprise that this is the situation folks find themselves in with the current models.
They can keep internal consistency so the more you let it write the more it can write with internal consistency. It still fails at all of these levels as soon as you are looking at each level of detail.
"it takes too much effort to get the output production ready"
turning into
"maybe long term the maintenance will be more expensive"
I give it three months until people realize that you rarely need to review every single line and fully understand the code, like so many comments are claiming.
Maybe on projects with no users you can yolo things.
While the salary stays stagnant or even reduced if you adjust for inflation.
This blob of people criticizing AI is just that, a blob. A gaggle of discrete people that your brain makes up a narrative about being some goalpost shifting entity.
Of course there could be individuals who have moved the goalposts. Which would need a pointed critique to address, not an offhand “people are saying” remark.
It's completing shit. Even if it does not implement some lazy stuff with empty catch blocks (i.e. happy path from programming 101 tutorials), it will either expose your secrets in a sensible place or do some other stupidity.
Also, you've set up a huge strawman here. Who are these people saying these things in this order and why is that the argument and not "You need to be reviewing every line of code that gets written and understand it."
Your argument is nonsense.
The developers that thing coding is hard are the ones that absolutely love AI coding. It's changed their world because things they used to find hard are now easy.
Those that think coding is easy don't have such an easy time because coding to them is all about the abstractions, the maintainability and extensibility. They want to lay sensible foundations to allow the software to scale. This is the hard part. When you discover the right abstractions everything becomes relatively easy. But getting there is the hard part. These people find AI coding a useful tool but not the crazy amazing magical tool the people who struggle with coding do.
The OP is definitely in the second camp since they could spot and realise the shortcomings of the AI. They spotted the problem, and that problem is that the AI can't do the hard bit.
I like coding, I just don't particularly enjoy figuring out the framework du jour. The task at hand is interesting, but the part where I need to figure out what are the incantations to have a Qt list with images in it is not. I need a working UI to get the thing done, but the framework stands in my way, requiring me to step away from my task intended task and spend a few hours on understanding QTreeView.
That's where I really enjoy AI currently, because I can get the GUI stuff out of the way much faster and get back to the thing the GUI is for.
Now within the specific problem I'm trying to solve, sure, I enjoy thinking about the abstractions, maintainability and extensibility. That's the part that actually matters. But the Qt UI on top, that's just a visual layer with a structure that was already set in stone, there's no big decisions of interest to make there. Just to figure out how to make it do the thing.
PMs can now cross reference and organize tickets with just a few keystrokes. Organisational knowledge, business knowledge, design systems and patterns, etc all of it is encoded in LLM consumable artefacts. For PMs it is the same switch - instead of having to do it by hand you direct lower level employees to handle the details and inconsistencies and you just do vibe and vision.
When all of the pieces successfully connect and execute reliably, what is left for humans to do? Just direct and consume?
And AI companies with their huge swaths of data are soon gonna be in the situation of being able to do the directing themselves
The first group are still thinking fairly deeply about design and interfaces and data structures, and are doing fairly heavy review in those areas. The second group are not, and those are the ones that I find a bit more worrisome.
I can't speak for others, but I'd go further and say that LLMs allow me to go deeper on the design side. I can survey alternative data structures, brainstorm conversationally, play design golf, work out a consistent domain taxonomy and from there function, data structure and field names, draft and redraft code, and then rewrite or edit the code myself when the AI cost/benefit trade off breaks down.
It's the same thing here. AI has dropped the cost of software development, so developers are now fooling themselves into producing low or zero value software. Since the value of the software is zero or near zero, it doesn't really matter whether you get it right or not. This freedom from external constraints lets you crank up development velocity, which makes you feel super productive, while effectively accomplishing less than if you had to actually pay a meaningful cost to develop something.
Like, what is the purpose of Gas Town? It looks to me like the purpose of Gas Town is to build Gas Town.
I worry about the first group too, because interfaces and data structures are the map, not the territory. When you create a glossary, it is to compose a message, that transmit a specific idea. I find invariably that people that focus on code that much often forgot the main purpose of the program in favor of small features (the ticket). And that has accelerated with LLM tooling.
I believe most of us that are not so keen on AI tooling are always thinking about the program first, then the various parts, then the code. If you focus on a specific part, you make sure that you have well defined contracts to the orther parts that guarantees the correctness of the whole. If you need to change the contract, you change it with regard to the whole thing, not the specific part.
The issue with most LLM tools is that they’re linear. They can follow patterns well, and agents can have feedback loop that correct it. But contracts are multi dimensional forces that shapes a solution. That solution appears more like a collapsing wave function than a linear prediction.
I’m not making a judgement call about which is better, but it was widely accepted in tech before the advent of LLMs that you just fundamentally lack a sense of understanding as a reviewer vs an author. It was a meme that engineers would rather just rewrite a complicated feature than fix a bug, because understanding someone else’s code was too much effort.
I find it useful to not listen to people who just talk.
You need to be checking every thing it does.
This is what seems to be lost on so many. As someone with relatively little code experience, I find myself learning more than ever by checking the results and what went right/wrong.This is also why I don't see it getting better anytime soon. So many people ask me "how do you get your claude to have such good output?" and the answer is always "I paid attention and spotted problems and asked claude to fix them." And it's literally that simple but I can see their eyes already glazing over.
Just as google made finding information easier, it didn't fix the human element of deciphering quality information from poor information.
I follow the plan -> red/green/refactor approach and it is surprisingly good, and the plans it produces all look super well reasoned and grounded, because the agent will slurp all the docs and forums with discussions and the like.
Trouble is once it starts working there would inevitably be a point where the docs and the implementation actually differ - either some combination of tools that have not been used in that way, some outdated docs, or just plain old bugs.
But if the goals of the project/feature are stated clearly enough it is quite capable of iterating itself out of an architectural dead end, that is if it can run and test itself locally.
It goes as deep as inspecting the code of dependencies and libraries and suggesting upstream fixes etc. all things that I would personally do in a deep debugging session.
And I’m supper happy with that approach as I’m more directing and supervising rather than doing the drudgery of it.
Trouble is a lot of my team mates _dont_ actually go this deep when addressing architectural problems, their usual mode of operandi is “escalate to the architect”.
This will not end up good for them in the long run I feel, but not sure what they can do themselves - the window of being able to run and understand everything seems to be rapidly closing.
Maybe that’s not super bad - I don’t exactly what the compiler is doing to translate things to machine code, and I definitely don’t get how the assembly itself is executed to produce the results I want at scale - that is level of magic and wizardry I can only admire (look ahead branching strategies and caching on modern cpus is super impressive - like how is all of this even producing correct responses reliable at such a a scale …)
Anyway - maybe all of this is ok - we will build new tools and frameworks to deal with all of this, human ingenuity and desire for improvement, measured in likes, references or money will still be there.
You can skip that and go directly to writing code. But that meant you replaced a few hours of planning with a few weeks of coding.
They seem to be different for LLMs, because would anyone be surprised if they handed summary feature descriptions to some random "developer" you've ever only met online, and got back an absolute dung pile of half-broken implementation?
For some reason, people seem to expect miracles from some machine that they would not expect of other humans, especially not ones with a proven penchant for rambling hallucinations every once in a while.
I'd like to know, ideally from people who've been there, why they think that is. Where does the trust come from?
They can assimilate 100s of thousands of tokens of context in few seconds/minutes and do exceptional pattern matching beyond what any human can do, that's a main factor in why it looks like "miracles" to us. When a model actually solves a long standing issue that was never addressed due to a lack of funding/time/knowledge, it does feel miraculous and when you are exposed to this a couple of times it's easy to give them more trust, just like you would trust someone who provided you a helping hand a couple of times more than at total stranger.
It is surprising how bad it is at taking the lead given how effective it is with a much more limited prompt, particularly if you buy in to all the hype that it can take the place of human intelligence. It is capable of applying a incredible amount of knowledge while having virtually no real understanding of the problem.
LLMs also don't solve the much bigger problem of most software engineers having no ability to work with others to clarify requests or offer alternatives. So now bad and/or misunderstood requests can be implemented faster.
Or really the same reason people fall for get rich quick schemes.
I don't understand this. A large codebase should be a collection of small codebases, just like a large city is a collection of small cities. There is a map and you zoom into your local area and work within that scope. You don't need to know every detail of NYC to get a cup of coffee.
Its your responsibility to build a sane architecture that is maintainable. AI doesn't prevent you from doing that, and in fact it can help you do so if you hold the tool correctly.
To speak more directly - every codebase has local reasoning and global reasoning. When looking at a single piece of code that's well-isolated, you can fully understand its behavior "locally" without knowing anything about any other part of the code. But when a piece of code is tightly coupled to many other parts of the codebase, you have to reason globally - you have to understand the whole system to even understand what that one piece of code is doing, because it has tendrils touching the whole system. That's typically what we call spaghetti code.
If you leave an AI to its own devices, it will happily "punch holes", and create shortcuts, through your architecture to implement a specific feature, not caring about what that does to the comprehensibility of the system.
There are software that works like this (e.g. a website's unrelated pages and their logic), but in general composing simple functions can result in vastly non-proportional complexity. (The usual example is having a simple loop, and a simple conditional, where you can easily encode Goldbach or Collatz)
E.g. you write a runtime with a garbage collector and a JIT compiler. What is your map? You can't really zoom in on the district for the GC, because on every other street there you have a portal opening to another street on the JIT district, which have portals to the ISS where you don't even have gravity.
And if you think this might be a contrived example and not everyone is writing JIT-ted runtimes, something like a banking app with special logging requirements (cross cutting concerns) sits somewhere between these two extremes.
Oh, great analogy there.
Just like there's almost nothing in common between a large city and a collection of small cities, a large codebase is completely different from a collection of small codebases too.
Mostly because of the same kinds of effects.
Rather than arguing about the specifics, it's easier to point to numerous concrete examples, such as a fairly simple system - which should be easy to implement in 8-15k lines of code, depending on certain choices (I've been writing code long enough to estimate this relatively accurately) - being still-incomplete while approaching 150k lines. These kinds of atrocities are usually economically infeasible in hand-written code, for 2 reasons: 1) the cost to produce that much code is very high, and 2) the cost of maintaining that much code is insurmountable.
I guess you could say that AI is great at generating code that only AI can understand and maintain.
Eg., treating AI code generated as immediately legacy, with tight encapsulation boundaries, well-defined interfaces etc. And integrating in a more manual workflow.
There's a range from single-shot prompts to inline code generation, that will make more sense depending on the problem and where in the code base it is.
Single-shot stuff is going to make more sense for a protyping phase with extensive spec iteration. Once that prototype is in place, you then prob want to drop down into per-module/per-file generation, and be more systematic -- always maintaining a reasonably good mental model at this layer.
I could see value in using it during the prototyping phase, but wouldn’t like to work like you described for a serious project for end users.
This is good advice regardless whether you're using AI or not, yet in real life "let's have well-defined boundaries and interfaces" always loses against "let's keep having meetings for years and then ducttape whatever works once the situation gets urgent".
This seems to me like it requires an impossible level of discipline, judgement and foresight
There comes a realization, to many engineer’s horror, that AI won’t be able to save them and they will have to manually comprehend and possibly write a ton of code by hand to fix major issues, all while upper management is breathing down their back furious as to why the product has become a piece of shit and customers are leaving to competitors.
The engineers who sink further into denial thrash around with AI, hoping they are a few prompts or orchestrations away from everything be fixed again.
But the solution doesn’t come. They realize there is nothing they can do. It’s over.
You can have very good diffs and then found that the whole codebase is a collection of slightly disjointed parts.
AI doesn't necessarily have to increase your throughput, it can also serve as a flexible exploration and refactoring tool that will support either later hand crafter code or agentic implementation.
I still have a lot of usage for AI: Exploration, Double-checking me, teaching me. But writing code became very tough for me to accept. Nex-edit autocompletes mainly
I'm ready to give up on having it even review my code at this point. It's been so frustrating. It hallucinates bugs, especially in places where "best practices" are at odds with reality.
Recently it informed me of a bug where it suggested the line of code in question couldn't possibly do anything because on Linux the specific stdlib behaved in X ways, but it was obvious from the line of code that it was running on Windows which doesn't have this problem at all. Of course, it doesn't actually mention that this is an issue on Linux, just that there is a bug here. It vomits up a paragraph of $WORDS explaining why this was a high-priority bug that absolutely needed to be fixed because it was failing in subtle ways. Yet the line of code in question has been running in production, producing exactly the results it is expected to, for ~3 years.
And this is just one simple example, of the many dozens+ of times it has failed this task this year. In that same review run, the agent suggested 3 additional "bugs" or other issues that should be addressed that were all flatly wrong or subjective. I'm at a point of absolute exhaustion with this sort of shit. It's worse than a junior half of the time because of how strongly opinionated it is. And the solution to this sort of problem is an endless amount of configuration and customization that will be forgotten about by all of us over time, leading to who knows what sort of knock-on effects (especially as we migrate from one model to the next). We have a guy on our team who has ~17,000 words in his agent and instructions files, yet he sees nothing wrong with this. I guess he just really loves YAML and Markdown.
This metric highly depends on who uses the AI to do what, where strong emphasis is on "who" and "what".
In my line of work (software developer) the biggest time sinks are meetings where people need to align proposed solutions with the expectations of stakeholders. From that aspect AI won't help much, or at all, so measuring the difference of man hours spent from solution proposal to when it ends up in the test loops with and without AI would yield... very disappointing results.
But for troubleshooting and fixing bugs, or actually implementing solutions once they have been approved? For me, I'm at least 10x'ing myself compared to before I was using AI. Not only in pure time, but also in my ability to reason around observed behaviors and investigating what those observations mean when troubleshooting.
But I also work with people who simply cannot make the AI produce valuable (correct) results. I think if you know exactly what you want and how you want it, AI is a great help. You just tell it to do what you would have done anyway, and it does it quicker than you could. But if you don't know exactly what you want, AI will be outright harmful to your progress.
Which is still cool. I just wish people were more honest about this.
Another thing I don’t see mentioned is code quality.
Vibe-coded code bases are an excellent example of why LLMs aren’t very good at writing code. It will often correct its own mistakes only to make them again immediately after and Inconsistent pattern use.
Recently Claude has been making some “interesting” code style choices, not inline with the code base it’s currently supposed to be working on.
It's got a fun Zelda-inspired mechanic (I won't say which one), and you'll have to unlock abilities and parts of the world over several quests and modes to "win".
It's also multiplayer.
In the past, I was trying to reproduce vibe coding results when I managed to get all the information from Youtube videos (model version, ide version, same input data and prompt) and never was I able to reproduce something impressive even after multiple runs of the same thing.
AI, and especially agentic AI can make you lose situational awareness over a codebase and when you're doing deep work that SUUUUCKS, but it's not useless, you just have to play to it's strengths. Though my favorite hill to die on is telling people not to underestimate it's value as autocomplete. Turns out 40 gigabytes of autocomplete makes for a fucking amazing autocomplete. Try it with llama.vim + qwen coder 30b, it feels like the editor is reading your mind sometimes and the latency is so low.
And I'm sure the rewrite is going to teach me a whole different set of lessons...
Not sure why good coverage wouldn't mitigate risk in a refactor...
My mantra whenever I'm working with AI is that I want it to know what "point b" looks like and be able to tell by itself whether it's gotten there...
If you have a working implementation, it sounds like you have a basis for automated tests to be written... once you have that (assuming that the tests are written to test the interface rather than the implementation), then it should be fairly direct to have an agent extract and decompose...
For example, consider a lint rule that bans Kysely queries on certain tables from existing outside of a specific folder. You'd write a rule like this in an effort to pull reads and writes on a certain domain into one place, hoping you can just hand the lint violations to your AI agent and it would split your queries into service calls as needed.
And at first, it will appear to have Just Worked™. You are feeling the AGI. Right up until you start to review the output carefully. Because there are now little discrepancies in the new queries written (like not distinguishing between calls to the primary vs. the replica, missing the point of a certain LIMIT or ORDER BY clause, failing to appropriately rewrite a condition or SELECT, etc.) You run a few more reviewer agent passes over it, but realize your efforts are entirely in vain... because even if the reviewer agent fixes 10 or 20 or 30 of the issues, you can still never fully trust the output.
As someone with experience in doing this kind of thing before AI, I went back to doing it the old way: using a codemod to rewrite the code automatically using a series of rules. AI can write the codemod, AI can help me evaluate the results, but actually having it apply all of the few hundred changes automatically led to a lack of my ability to trust the output. And I suspect that will continue to be true for some time.
This industry needs a "verification layer" that, as far as I know, it does not have yet. Some part of me hopes that someone will reply to this comment with a counterexample, because I could sorely use one.
+1 on Open 4.7 involving the user a lot more. Rn I'm trying to get to a state where I can codify my design + decision preferences as agents personas and push myself out of the dev loop.
> Go reads fine whether the architecture is good or bad
Were you reading the Golang code all along and got fooled or did you review it after it failed? Sorry I admit I didn't read the whole article.
Good architecture in any language is obvious to someone who is experienced and cares.
Go is actually great for bots to write if you’re actually thinking.
It sounds like the author knows Rust, and might not be as familiar with Go.
A language that you are proficient in is always going to be easier read than one you don’t, even if it is an objectively easier language to to read in general.
I’ve used AI tools to do i18n translations to Spanish and Portuguese (somewhat ashamed to admit this). I’ve grown more familiar with the structure of these languages, and come to recognize some of the common vocabulary for our agtech domain. If anything, I feel more clueless about both languages now than I did before, when it comes to any sort of proficiency.
For example, I had Claude generate a language server for TLA+ so I could have nice keystrokes in Neovim. For things like this, I really do think there is such thing as "good enough"; a language server doesn't have to be perfect, and the stakes are pretty low, where I think the worst case scenario is that it screws up my code, but that should be relatively easy to catch in Git.
I have been trying to mostly have Claude generate code from specifications; either a Mermaid diagram for simpler stuff, and TLA+ for more complicated stuff. I usually supply a lot of surrounding context about how I want these specs to be implemented, and it will usually get me about 90% of the way there, but I've found that I still need to hack against it to get over the hump.
It makes me feel a little valuable; I finally have an excuse to use formal methods for things.
With a skilled operator, it could be possible to drive an agent to handle these kinds of changes. I would be concerned that spoken language wouldn't be precise enough to handle the refactoring and changes necessary to make to a code base when an invariant changes... regardless of whether it was a property, architectural, or procedural change. It already can take several prompts and burn quite a few tokens doing large-scale rewrites and code changes. Maybe the parameters and weights can be tuned for this kind of work but I remain skeptical that what we have at present is "efficient" at this kind of work.
So, about five years later is the right time for refactoring.
P.S. It takes about five years to forget what you thought you were doing with that code, and see the reality of what you wrote.
On my most active days, I integrate around a dozen fully reviewed and adjusted MRs into my codebase.
It's all very simple. "Use x library, data model should be xyz, do m, not n."
They're obviously not at the point of replacing an experienced programmer as far as knowing the start-to-finish way of accomplishing every detail, that's what the human is for.
Obviously technical and design choices have risks beyond just initial implementation, and those have to be considered too (do we trust the dependency, will it still be there in a year, can we get fixes merged upstream), but I think there's significant value in driving down the cost of code sketches involving unfamiliar libraries and tools.
I suppose it's difficult to account for the inconsistency of something able to perform up to standard (and fast!) at one time, but then lose the plot in subtle or not-so-subtle ways the next.
We're wired to see and treat this machine as a human and therefore are tempted to trust it as if it were a human who demonstrated proficiency. Then we're surprised when the machine fails to behave like one.
I have to say, I'm still flabbergasted by the willingness to check out completely and not even keep on top of, and a mental model of, what gets produced. But the mind is easily tempted into laziness, I presume, especially when the fun part of thinking gets outsourced, and only the less fun work of checking is left. At least that's what makes the difference for me between coding and reviewing. One is considerably more interesting than the other, much less similar than they should be, given that they both should require gaining a similar understanding of the code.
Was it strongdm talking about the dark factories? They were working on some integration software so needed to use google drive and slack and lots of other things. They fully reimplemented those to the level they needed for their tests - outside of the biggest firms this would probably have been an enormous time and money sink. Now it’s reasonable.
On a personal project, with my wife we wanted a tracker for holiday planning. Five minutes given a barely through through request and we had a working prototype, fixed bugs in seconds and then talked through with a model what we needed and how it did or didn’t fit (and we needed that first version to figure that out). It helped drive out actual requirements from us, prioritise them, choose a stack add tickets and then went ahead and implemented it pretty far. Have a mostly working v2 which has highlighted some details about what we really wanted. Total invested cost was one day of a $20 subscription and maybe half an hour of talking to a bot and checking results.
I struggle to remember even relatively simple maths like working out "what percentage of X is Y" so if I write a formula like that I'll put in some simple values like 12 and 6 or 10,000 and 2,456 just to confirm I haven't got the values backwards or something. I've been shown sheets where someone put a formula in that they don't understand, checked it with numbers they can't easily eyeball and just assumed it was right as it's roughly in their ball park / they had no idea what the end result should be.
Then again I've also seen sheets where a 10% discount column always had a larger number than the standard price so even obviously wrong things aren't always checked.
I've reached solutions by trial and error too, and tried to rationalize them later, quite a few times. And it's easier to rationalize a working solution, however adversarial you claim to be in your rationalization.
I don't see using gen AI for the (not so) “brute force” exploration of the solution space as that different from trial and error and post fact rationalization.
In a mobile app, do you think it's more important to test that your drag gesture works as expected on the phone, or to understand every line of the implementation?
Here’s what’s working for me right now:
1. The basics: use best model available, have skills and rules that specify project guidelines, etc.
2. Always use plan mode. It works much better to iterate on the concept of what we’re going to do, then do the implementation. The models will adhere to the plan at very high rates in my experience.
3. Don’t give chunks of work that are too large in scope. This is just art, and I’m constantly experimenting with how ambitious I can be.
4. I review all code to some extent, but I have a strong mental model of what areas of the app are more critical, where hidden bugs might accumulate, etc, and I review both tests and impl more strenuously in those areas. Whereas like a widget for my admin panel probably gets a 2 second glance.
5. Have the discipline to go through periodically and clean up tech debt, refactor things that you’d do differently now, etc. I find the AI a huge help here, because I can clean up cruft in an hour that would have once taken me days, and thus probably wouldn’t have gotten done.
6. I’m experimenting with shifting my architecture to make it easier to review AI code, make it less likely it’ll make mistakes, etc. Honestly mostly things I should have always been doing, but the level of formalism and abstraction on my solo projects is usually different than on a bigger team.
To each their own, but I’ve grown this from nothing to about $350k in ARR over the last ten months, and I’m very confident I never could have built this product without AI help in triple that time.
I'm not sure that there's really a "bomb" hiding in here anywhere. The issue is that it IS "reasonable" now to expect big features to be done within a week.
There's plenty of times where I don't know what code to write because I've never used a library before. But it's just a page of documentation away. It's not hard, it's just slow and tedious.
Yeah, because they believe (sometimes wrongly) their subordinates read it.
> Understanding of code never existed from the business perspective.
It does, it's called organizational wisdom and domain knowledge, because you need those witty names to sell books to aspiring managers.
>The leader of a product/company does not have to read code.
That's because he's paid a bunch of people 300k to read it and make sure it aligns with the company's objectives and interests. Part of the reason why devs are paid so much is because they're literal business administrators for some narrow slice of the company's operations. The devs are the leaders that you're referring to.
Even in multi-hundred-billion-dollar companies there are so many mission critical things that are owned by just 2 SWEs.
the value of the junior dev is the hopes that they'd be a senior dev someday, I dont have a solution to that problem. But in my current capacity where the only devs I'm "managing" these days are via open source contributions, they're already gone - 100% of what I get is through their own work with LLMs (which I have to spend more time correcting than if I used the LLM myself).
https://x.com/DannyLimanseta/status/2052040017007251946
Here are the direct links:
https://github.com/dannylimanseta/tinyskies
https://github.com/dannylimanseta/tinyskies/blob/cursor/glob...
https://github.com/dannylimanseta/tinyskies/blob/cursor/glob...
A really screwed code base blows out your context window and just starts burning tokens as the AI works out a way to kill -9 itself to escape the hell you're subjecting it to.
So you figure out that someone you paid is at fault, instead of someone they hired. Your contract is with them so what really changes? What process or anything else is really different between it being a company with a manager who asks a team of devs and a company which asks an AI agent, to you as a customer?
Maybe it changes who gets fired or sued or whether one insurance or another pays out- but broadly I think none of what I said about project work really changes.
Product owners and hell even customers have been able to get software they don’t understand all the details of or for customers even get to see the code, purely driven with natural language.
I'd think that depends on the model of responsibility at play.
For example, suppose I hire a building contractor to build a house, and the electrician he subcontracts makes mistake.
From my perspective, the prime contractor is equally responsible for that mistake regardless of whether he used a subcontractor, or did the work himself but used a broken tool.
This doesn't make the electrician any less of a "person" in the deeply important ways, but it's not a distinction that's relevant to my handling of the problem.
jk :D your points make a lot of sense though!
A little update: upon viewing the page on phone, for me the "comitter" field in the demo is going out of bounds... Really not speaking for their product.
Right, so depending on an LLM makes perfect sense in that case, thanks for clarifying :)
After reading a bunch of other comments, it sounds like people are referring to letting agents go wild and code whatever off a limited prompt. I'm not using LLMs like that; I'm generally interacting only via conversations with pretty detailed initial prompts. My interactions with the chat after that are corrections/guiding prompts to keep it on point and edit the prompt output from time to time.
When on a tiny project it doesn't matter. However when you have millions of lines of code you have to trust that your code works in isolation without knowing the details.
> have millions of lines of code you have to trust that your code works in isolation without knowing the details.
More like hope. This is where good design and architecture helps, as well as strong invariants held up by the language. But given that most applications can't really escape global state (not even internally, let alone external state like the file system), you can never really know that your code will work the way you expect it to - that is, it's not trivially composable to any depth.
In their mind they’ve already done the “architectural heavy lifting” and accelerated the team. More often than not it just adds cognitive overhead where you spend more time deciphering and cleaning up garbage than actually building the thing properly from scratch.
This will not happen. Nobody desires to give that up. Also AI does not deliver even remotely that much true value multiplier
But we still hold good cards in hand.
Do they want their pile of steaming slop fixed, or not? Because no amount of complaints about the deadline being "yesterday" are going to change anything about the fact that time will be needed to fix the accrued technical debt, whether they like it or not.. And if AI dug you in that deep to start with, the solution is not to dig deeper.
I suspect some companies are going to find that out the hard (costly) way.
The developers in this scenario are there to absorb the blame when things go wrong. "Human crumple-zones" to protect the company.
As humans, we need another human to blame when things go wrong.
Especially in situations that are catastrophic when things go wrong.
Will be interesting to see if / when enforcement happens given management is currently being pushed to encourage AI use
I would say that counts as "not having that policy". Based on what management tells us, we are dead if we don't operate this way.
But that still doesn’t mean I review all the code. I tend to review defensively, based on the potential for harm if this piece of code is broken. And I rely a lot on tests, static analysis, canaries, analytics, health checks, etc. to reduce risk for when I’m wrong. So far it’s working.
That said most of the world's most useful code has strict quality requirements. Even before AI 90% of SLOC would be tossed away without much if any use, 9% was used infrequently while 1% runs half the world's software.
Reviewers aren't perfect, far from it. And we just gave them ~20x more code to review. Incentives mean that taking 20x longer to review is unacceptable. So where do we go from here?
> To get beyond being a hack, you need to understand the entire stack, including the code that you didn't write, including both libraries, frameworks and the OS, and including the hardware, the networking layers, and so forth.
I think maybe you overestimate your own knowledge here. It's one thing to understand general principles and design or to understand a contextually-relevant vertical or whatever. It's another to demand comprehensive (even if not expert) familiarity in non-trivial projects, especially those created by many developers over long time spans. It's not just a question of intelligence or dedication or even just time spend working on a project.
The amount of software even your typical piece of code relies on is staggering and shifting, and it's only getting more complicated. A good chunk of software engineering and programming language research has been focused on making it practical to operate in such an complex environment - an environment that nobody fully understands - which is a major part of why modularity exists. Making software like "plumbing together [...] black boxes" is exactly what such research aspired to accomplish, because it allows different developers to focus on different scopes and focus on the domain they're working on. Software engineering is a practical field, and any system that requires full knowledge to operate, modify, and extend is either relatively small (maybe greenfield and written by a sole developer) or impractical to work with.
So I would say there's a wide gap between "lazy guy who doesn't give a shit" and "guy who thinks he can understand everything". Both lack the humility and wisdom needed to know the limits of their knowledge, to circumscribe what he needs to understand, and to operate within the space these afford. (Both extremes remind me of cocky junior devs. On the one hand, you have the junior dev who carelessly churns out "hot shit" garbage code by plumbing things together with no grasp or appreciation of sound design; on the other you have the dev who makes a big show about "rigor" completely detached from the actual realities and needs of the project. In each case, the dev is failing to engage intelligently with the subject matter.)
I'm far from the best at anything and make no claims toward knowing everything, but I do think I have reasonable breadth in my experience and work, and I don't think I could have built something like this otherwise.
[1] ... which is something that does not decompose neatly into black boxes and must to a large degree be built from first principles as goddamn nothing off the shelves scales well enough to deal with multi-terabyte workloads at the even a fraction of the speed a bespoke solution can.
You call it a hackathon. You tell the human to stay up the whole night. In exchange for the extra hours worked you provide some pizza.
I care more about code quality now, because typing no longer limits if I feel like it's worth to refactor something or not.
After the initial coding is complete, will you need to use AI to fix bugs? Presumably that is both slower and more expensive than doing it by hand when you know exactly where to look?
I can only presume you work with talented people somewhere that is not representative of most companies. You're definitely overestimating the average programmer's abilities.
As soon as requirements change the abstractions fall apart and everything gets shoehorned.
Humans can be held accountable for their own slop
> The kind of mess me and everyone I've worked with produces by hand is the inverse of that
Yes, it's frustrating to work with isn't it? So why are you so excited to make higher volumes of this low quality slop using AI?
> Humans can be held accountable for their own slop
If you're a human and AI writes code for you you are ultimately responsible.
It is surprising to see how many are still in utter denial around here, though. Maybe we should all go back to punch cards.
It’s a valid direction to look in, it just doesn’t address the root issue of throwing slop across the wall and also having unrealistic expectations due to not knowing any better.
If there’s one thing that’s disturbing with AI proponent is how trusting they are of code. One change in the business domain and most of the code may have turn from useful to actively harmful. Which you have to rewrite. Good luck doing that well if you’re not really familiar with the code.
Understanding your domain, setting clear expectations and understanding limitations and how much ambiguity your people/robots can handle are all good management techniques, they translate.
But the nature of working with an always-on flattery machine vs humans that can exceed your expectations while also being sources of infinite drama and frustration are still fundamentally different. The blind spot is being subsceptible to the flattery machine and forgetting how much you relied on good people challenging you. The benefit is, of course, not having to deal with humans.
That's more or less the opposite of what I'd want.
If I instruct the AI to make small modules where I can verify they work, have tests and no side effects - then it is good enough code for me. It works, is readable and can be extended - and will turn into bad code if this is not done with care.
The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code. Notice that there is 3 separate things written about the code, and the code itself…. Keeping all of that correct, coherent, and consistent (with a separate, invariant document that describes your design principles) keeps the model from going off the rails and gives ample opportunity to sense bad smells before they get set in stone.
It’s a token fire and you need a minimum 250k context model… but I still get as much work done in an hour as I used to do in a day, and the code I coauthor is better documented, more maintainable, and more tested than any code I have ever written before.
Some of these checks have caught thousands of the same error, even with the latest Opus 4.7 writing the original code.
You can’t play whack-a-mole with GenAI. You have to start from well-known principles and watch everything it produces. Every module or bounded context has to have its own invariants.
You can’t fully automate software engineering with GenAI. It seems the vast majority of GenAI users think they can and end up in the same place as the OP.
Maybe learn Domain-Driven Design, Event Sourcing, and then try again. The results will be dramatically improved.
The harder the code is to understand, the badder it is (and the more likely it is infested with bugs).
Code that will not be able to evolve for more than one-two years is terrible code. Agents write terrible code while doing a truly impressive job hiding it (including in the tests they write) unless, of course, you keep them under very close supervision.
"Do x" - for baseline, assume this generally does X fine.
"Do X, don't use javascript". - even if X already didn't use javascript, this will often perform worse. It will perform even _more_ worse if X is difficult or unusual to do without javascript even if there is some perfectly serviceable way to do it.
Also, despite "don't use javascript", sometimes it just still uses a little bit of javascript anyway, and usually in a spot that would actually be extremely annoying/inconvenient to them remove that js yourself (when you would've otherwise reconsidered your approach at a higher level, to either use js, or to just want something different that is easier to do without js).
I do observe the same thing. There are a limited number of constraints you can add and once you exceed that, you'll play whack-a-mole if you insist on all of them.
This is why I tend toward a more wu-wei attitude to constraints.
For example:
- Do I really need this constraint?
- How does the agent tend to behave in this scenario it if unconstrained? Is this behavior/result an acceptable pattern for this solution?
- Is the constraint implicitly followed often enough that I can trade spending tokens recovering from a deterministic test that enforces the constraint rather than preemptively state it in the prompt?
If I get into the situation where I need more constraints than can fit in context/attention without the need to regularly play whack-a-mole, then I break the module down into sub-modules with fewer, more specific constraints.
Same way teaching your child to do something is much harder than to just do it yourself.
Except the child learns.
Countless individuals/companies unfortunately do believe it is a replacement still and are unlikely to change their minds.
Tired point, but it really is true. People will throw money at you if you do.
I guess "build something you want", the Temu-bought knockoff of the previous advice. It's not quite as bad advice as it sounds, as it's at least some validation of an idea, and much easier than playing 17D chess trying to predict the zeitgeist.
The luck surface area[1] also isn't quite as talked about as it should, but a good mental model if you're seeking serendipitous life-changing outcomes that I can get behind.
[1] https://www.codusoperandi.com/posts/increasing-your-luck-sur...
Before: One person writes the code (and likely understands it thoroughly), another person reviews the code to spot obvious mistakes or shortcomings. Now: AI writes the code, a person reviews it to spot obvious mistakes or shortcomings.
In the before case, you have a person who has a deeper understanding of the code and in the AI case, you don’t, instead you have even more code to review.
When a competent programmer is writing the code, the human written code tends to be higher quality too. So it’s not just about review quantity but the quality of code being reviewed. Some people claim the AI writes great code, but that just hasn’t been my experience yet (at least with the models I’ve tried, including Opus). They still make ridiculously bad decisions regularly.
This is a great idea, but on average is deeply untrue. Far and away most programmers today write significantly worse code than LLMs. Also LLMs are fantastic at generating high level summaries and comments in code
Your experience with LLMs do not match my own. Not to say that I haven’t experienced terrible human written code where I’ve wondered what the author could possibly have been thinking, but overall, I still find LLM written code to be on the poor side.
Like, the code itself is ok, but the wider picture reasoning and abstractions are bad. It also makes really dumb decisions far too often. Or doggedly shoehorns its first idea in no matter how badly it fits.
I find the more good practices I add to my envision/scope/spec/build/test/deploy loops the happier I am with the outcomes.
I will say that I am finding the actual code to be somewhat ephemeral for me - the more precise the specifications are and generally the tighter and more elegant the design is, the less the code matters as a long term artifact.
I'm not at the "code is assembler" point yet - but I could see that with more, richer specs I could end up there. Of course the specs are then substantial, but declarative specs can be robust and unambigous (with sufficient read teaming review) and - like domain specific languages - reduce the accidental complexity of the syntax when compared to an implementation in a given language.
There are exceptions to all of this, but it's fascinating to see how it's evolving!
I have ADHD and for whatever reason telling the LLM what to do instead of doing it myself bypasses the task avoidance patterns and/or focus problems I tend to suffer from. I do not find it fun, but I am thankful for it.
Rarely do I use what the tool actually spits out. I just use it as a sounding board, like I’m chatting with a (very noob) writer. It doesn’t make me much faster but it helps me break through when I just can’t get words down.
I have gained a paranoid suspicion that our capacity to decrease immediate distress with technology has become so great that we are creating a world where people with certain temperaments can have their personalities become more and more extreme through the assistance of technologies which, for example, decrease the amount of interpersonal interaction required or prevent the need for deep focus.
But the difference with LLMs currently - I guess? - is that non-engineers are pushing the idea that it’s universally indispensable at scale. I think it leads to a lot of emotion bleeding into the debate.
For me personally, it's a tradeoff of generating the first pass code 10x more quickly, but then deeply knowing and validating the code is then 10-20x more work than it would have been if I'd written it myself (and if time is of the essence, then there's the option of shallow validation/understanding in exchange for speed - which is a compromise in rigor and path towards tech debt). In the end, none of this seems like a net win (unless you don't care about quality), and it is much less enjoyable.
TL;DR; While LLMs are faster to spit out first pass code, by the time I've validated and fixed the LLM's first-pass work, I could've had my "by-hand" implementation done correctly, and had much deeper understanding out of the box. Net loss.
it depends on language and infra, but some/many require lots of boilerplate and memorizing thousands of APIs, automating this is easy LLM 10x gain.
I for example write SQL myself, because boilerplate is super-minimal, and core SQL is very minimal itself, there are like 20 constructs to memorize.
It is significantly easier to micro-manage an AI than a suite of junior developers. The AI doesn't replace a principal engineer, it's replacing junior and weaker senior developers who need stories broken down extremely concisely to be able to get anything done. The time it takes to break down a story such that a junior through weak senior developers can pick it up and execute it well would have the AI already done with testing built around it.
Micromanaging LLMs is like having Dory from Finding Nemo as your colleague. You find ways to communicate, but there is no learning going on.
Of course if you don't provide that feedback loop, no learning happens. I guess the same could be said of a junior, though.
I can also switch between codebase with different frameworks and languages and make changes without spending all day reading docs.
It's also pretty good at tracing code and that's fairly straight forward to verify the results manually. It can build a flow diagram in 10-30 minutes (depending on what tool calls need allowed and how many prompts it needs) versus me taking a couple hours to do the same.
Every project should have a custom linter for their tech stack. It would check for not just syntax errors, but architectural choices as well as taste guidelines.
Whenever the LLM writes bad code, I add it to my linter to check against in the future.
With LLM driven code you can generate code once, and then if anything is shitty about it you can always manually update it yourself without the need of an LLM. It's a dependency of convenience, not an app-dependency.
https://jsonforms.io/img/architecture.svg
You can add your custom renderer but you still need their library for bindings and such.
The recommended tool cant even produce mobile friendly, like why would I ever use it?
The other is using Agents as critical reviewers. I've let Opus 4.7 review PRs by very senior people. Most of the suggestions are meh, but usually there's at least 1 or 2 that improve the code base unequivocally.
Some of the worst programmers I have ever worked with had 30+ years of experience. They basically spend all of their time fixing bug after bug in a never ending cycle because the software they produced was so fragile that it would crash if you just looked at it wrong or the temperature in the room wasn't perfect.
While others with the same number of years of experience had massive systems in production for years with not a single bug reported by the happy users.
I mean for real, is the idea here, that all programmers are or were some kind of semi gods?
Because this is not what I remember from the pre LLM time, rather this:
I know I got into such developement hell myself. Fix a bug here, results in braking something there. Experience surely helps in avoiding it .. but even senior devs can make a mess. Otherwise there wouldn't be so many projects canceled.
So sure, agents can multiply a mess in a amazingly short time, but .. that is up to the humans guiding them.
And it's really someone's fault for hiring a bad junior. Someone did interview them, right? Maybe the person that hired them is the problem. And maybe the person that decided to go all-in on AI is also the same problem.
(Also I stick to the original definition of "vibe coding = not looking at generated code", "LLM assisted coding = verify generated code", I do both, depending on the task)
You don't actually think I look at the code, do you?
Agents can help a lot when you carefully review everything they output and find all the time bombs they like hiding in your code and your tests. If not, then they're fine for codebases that don't need to last more than a year or two.
Not at this time. Even if you could somehow get their success rate to 90%, it's still far too low because the mistakes can be (and are occassionally) catastrophic. It's only when you review everything that you find mistakes that will bite you down the line. If you don't review everything, you just don't know, but the rate of bad mistakes introduced by the agents is too high to trust, no matter how much prompting and orchestration you do. Maybe future models will address that, but we're not there yet.
> The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code.
That's helpful but it doesn't solve the problem, which is that the agents are happy to introduce horrendous workarounds, and they don't tell you that the code they've written is a horrendous workaround. The docs are fine and reflect the code and the code reflects the strategy, but you just don't know that the strategy is wrong.
My workflow also requires a discussion of the architecture and methodology of each addition or change, but honestly because we define the interfaces first, and each concern is given its own .c and .h file, it’s very hard to sneak something in without me noticing and calling it out. (Which does happen occasionally)
I suspect that file level granularity may be one of the keys. It never is actually working on more than a couple hundred lines of code at a time, plus interfaces of related files. I end up with a hundred files where I might have had 30 coding by hand, but it is actually easier to reason about the code for me as well, and the number of files is not an issue because of the automation. Total LOC is about the same as I would produce by hand for the same work, which means it’s actually writing less, due to the interface overhead, so I’m pretty stoked about that. The only real nightmare for humans is the long includes.
OTOH if I don’t do all of this it will definitely go off the rails and produce garbage.
I’ve been writing c (and c++) for almost 40 years, and although that doesn’t mean I’m any good, it does mean I have developed a keen sense of smell and highly sensitive olfactory PTSD.
With the right structured environment, a SOTA model with a suspicious seasoned dev holding its hand can be easier to manage and much more productive than a small team. Or, maybe I’ve just sucked so bad my whole life that I can’t tell the difference, but at any rate it works well enough to ship without nightmares, and less bugs and patching than I had before.
Edit:
I should mention that if bugs get tricky, like hardware idiosyncrasies and things like that, the model just goes nuts.if I handle it very very carefully so that it does not try to understand the problem, and I just have it poke the firmware with a stick from a distance enough times and from enough angles, as long as I have successfully prevented it from trying to figure out the problem (which is not as easy as it seems like it would be) it actually will usually nail it. If it starts to guess it’s usually best just to roll back the context and start over with the poking (I have a harness so it does direct hardware probes)
There seems to be an analog for this for non hardware related issues, but it’s harder to sus out when you should be telling it that you specifically do not want it to attempt to understand or solve the problem until you’ve rigged and tested all of the debug messaging.
I can feel it sometimes, as my brain shuts down and I gamble instead of thinking. It's a reversion to what I call "monkey mind" where you just keep pressing buttons to "make it work". I took a decade training my mind away from this, and too much AI is bringing it back.
I truly believe that people claiming huge productivity gains from AI are either terribly slow programmers or are skipping their due diligence. Many "vibe coders" are incapable of checking the output of the code.
Compilers are mechanical and engineered to produce a correct output. A compiler emitting incorrect machine code is exceedingly rare, and considered a bug. They have heuristics and probabilities in them, but those are to pick between a set of known-good outputs.
An AI is a bag of weights outputting a probability of the most plausible token that follows [1]. It is inherently probabilistic in nature and its output is organic (by design, they’re designed to mimic human speech), as opposed to mechanical like a compiler.
A compiler follows hard rules. An AI does its best.
And to be fair, AIs are no better than human in this regard: humans are pretty bad at generating correct code without mechanical tools to keep them in line (compilers, linters, formatters). It’s not a wonder we use the same tools to keep LLM output in line as we do humans. (And, to be fair, LLMs are better than humans at oneshotting valid code).
[1]: to those that tell me this vision of an LLM is outdated: nope. The heavy lifting is done in the probability generation. Debates about understanding are not relevant here, and the net output of an LLM is a probability vector over raw tokens. This basic description can be contrasted to a compiler whose output is a glorified Jinja template.
They constantly say they did a thing they didn’t, say they know how to solve something when they don’t, etc. Regardless of guard rails or tests - AI forces a constant vigilance of a new kind.
Not just “what might have gone wrong” but also “what do I think is working but isn’t actually”.
And we’re not even talking about how it chooses substandard solutions, is happy to muddy code/architectures, add spaghetti on top of spaghetti etc.
Agentic coding often feels like an army of unexperienced developers who are also incredibly eager to please.
Human languages are mostly very bad at this, and in particular bad at mapping low-level abstraction to the human written word unambiguously in a way that is as expressive as programming languages.
Inference closes that specific gap significantly (which is why anyone at all sees LLMs as a useful option to explore), but it will never be as good as a purpose-built language designed to map to a reasonable corresponding assembly language implementation.
Yes, because wrong assembly blows really loudly. From wrong behavior to invalid instruction errors and everything between them. Moreover, compilers are battle tested over the years, with extremely detailed test suites, and extreme testing (everyday, hundreds of thousands users test and verify them).
Also, as people said, assembly generation is deterministic. For a given source file and set of flags, you get the same thing out. Byte by byte, bit by bit. This is what we call "reproducible builds".
AI is not like that. It's randomized on purpose, it pulls from training set which contains imperfect, non-ideal code. "Yeah, it works whatever", doesn't cut it when you pull a whole function out of its connections, formed by the training data. It can and will make errors, because it's randomized from a non-ideal pool.
Next, sometimes you need tight code. Fitting into caches, running at absolute performance limit of the processor or system you have. AI is not a good fit here. Sometimes you go so far that you optimize for the architecture at hand, and it works slower on newer systems, so you need to re-optimize that thing.
For anyone who reads and murmurs "but AI can optimize", yes, by calling specific optimization routines written by real talented people for some cases; by removing their name, licenses, and context around them. This is called plagiarism in its mildest form and will get you in hot water in academia, for example. Writing closed source software doesn't make you immune from cheating and doing unethical things.
Lastly, this still rings in my ears, and I understood it over and over as I worked with more high performance, correctness critical code:
I was taking an exam, there's this tracing question. I raise my head and ask my professor: "Why do I need to trace this? Compiler is made to do this for me". The answer was simple yet deep: "If you can't trace that code, the compiler can't trace it either".
As I said, I just said "huh" at the time, but the saying came back and when I understood it fully, it was like being shocked by a Tesla coil.
Get your sleep, eat your veggies and understand your code. That's the four essential things you need to do.
For anything other than Greenfield, new code projcets without dependencies and conventions and connections to other proprietary code, it has to be reviewed. Even for that case it's not good to not review code
I am so tired of hearing about this false equivalency. Compilers are deterministic, their outputs are well understood and they’re transparent.
LLMs are not.
When people compare LLMs to juniors it's "can I have it do something pretty brain numbing, and when it makes mistakes can I invest time into preventing that from happening again, either systemically or via training?"
IME this is true for LLMs, at least in how my team has been utilizing them. This doesn't make juniors worthless, as they can be useful for things that LLMs aren't good at.
I guess the problem is the blind assumption of competence?
I just think of AI as being a lot like my late friend Henry. Henry had several PHDs, was an accomplished polymath in a bunch of other subjects, and spoke more than 20 languages with reasonable fluency. He was for sure one of the smartest people I ever met.
He was also prone to drinking, and he when he was on a tear, you could barely tell except he would confidently say some of the most outrageous shit, or start speaking some other language without noticing. So you always took Henry with a grain of salt, and if it was important you’d double check. Even so, he was still an amazing resource to bounce things off of.
Think about it from an information theory standpoint:
A compiler takes at least the exact amount of information it needs to produce a result, and produces exactly that result every time (unless it's bad at its job or has a bug).
An LLM always takes far less information than would actually be needed to fully describe the desired output, and extrapolates from that. It fetches contexts and such to give itself a glut of assumedly relevant information, but the prompt always contains less information than necessary to produce the code it generates. If it did fully contain enough information, then you've just written a far more verbose version of the program in human language.
No one said that
If you meant they’re now better at mimicking compilers, sure, but they’re only mimicks.
We do not have "perfect" probabilistic transformation, and we probably never will (in part because it's hard to know what exactly that even means), but the gap between the two is shrinking every day.
Ergo:
> they're becoming (effectively) more and more similar every day.