Where are the normal people :/
I'm a Solidworks user. Most Solidworks or other pro CAD users would consider OpenSCAD kind of like MS Paint. Yes, you can draw the Mona Lisa in it, but it doesn't really work the same way.
Even so, the examples shown here are better than what I've seen before. They seem to be on the right track using images instead of long paragraphs of text to try to describe the object. They are still missing the constraints and dimensions that come naturally to pro cad users (it can be done manually in openscad of course), but if you're just making a video game it's probably going to be fine for that.
I'd say its 50/50 pessimistic and optimistic, with pessimistic attracting more attention because of human nature.
Not using OpenSCAD?
Claude 4.6 before the lobotomy in Claude code was able to take a PSU spec sheet and my requirements for glands and ports, use YAPP and openscad MCPs to iteratively and unassisted build end to end a printable enclosure that was perfectly suited for the PSU with right dimensions and screw holes, mountings, grills, gland ports, everything, placed for optimal printing. This was the moment I felt like LLMs had really arrived.
A photo of a building? Why. That’s a mesh problem and is about fidelity. A technical spec sheet and diagrams to functional print with intelligent choices about the functional part baked in? That’s useful.
Gave it a short prompt and it gave me an openscad model with everything parametrized. I printed with no changes in tpu and it was nearly perfect on the first try. Claude put in a 0.3mm subtraction in the x/y dimensions and I lowered it to 0.1 and it's perfect.
Much easier shape than ancient Roman architecture but still very cool how easy it was.
I've had similar experiences with making simple functional parts off a 3d printer with OpenSCAD + LLMs. I'm very aware that the models are worse at it than say, generating react code, and I'm also the antithesis of a skilled pilot. It's still cool and has resulted in me starting to learn a new skill at a hobby level.
That is seriously really impressive. I looked at the 3D model and didn't even thing to LOOK INSIDE the building before reading this.
Here's [1] the 3D model with `show_cutaway` enabled.
[1] https://modelrift.com/models/pantheon-benchmark-antigravity-...
I would be more interested in benchmarking the modeling of an anonymous structure based on provided references alone. It kind of feels like the shallow magic of watching an LLM one-shot a to-do app..
My Antigravity (forced) replacement for Gemini CLI requires me to log on via browser every time I use it, and my Antigravity IDE won't update at all, so:
If it's ok I'd prefer they just work on reaching a baseline acceptable rollout before worrying about being Top in anything.
Ps actual title:
OpenSCAD LLM Benchmark: Building the Pantheon
- Models are very jagged (might excel in one type of 3d model, but not another)
- Gemini models are the least jagged in my experience and have the best image understanding
- Gemini models are also the most creative (which may be undesirable if you want precise CAD part)
- Overall this benchmark doesn't prove much because one 3d model (and one attempt) is just not enough. I am usually testing on at least a dozen models each generated 3 times, but should really do much more, but it's too pricey for a solo dev.
Still, thanks for publishing this. Will be definitely run flash 3.5 soon to see how it performs.
Just totally subjective grading criteria of a single poorly defined example with no end use case in mind to guide how to even do evaluation.
Scad needs unit tests. It would be powerful to asset that a profile doesn't have slope greater than 45°, that intersection of two objects is null, or specific volume.
It also needs cut away views. I got okay results using boxes to remove everything except a sliver, to view a slice and internal details. But without hash marks, texture, or outlines it can be hard to tell the forms.
One area I had near magic was providing a land survey which includes details in writing of the plat. It took those directions and beautifully reconstructed the boundaries to exact precision in CAD.
Where I ran into trouble was creating good constraints on sketches without being overly explicit. I kept running into it creating distance constraints from an arbitrary point instead of using other elements in the diagram that a human drafter would think to do by default.
Projects like Anna’s Archive make it much easier for researchers and builders to work responsibly with large datasets.
As a side note Autodesk released an agentic assistant back in December for Fusion. Six months later it is still quite bad.
At this point I'm not even sure if it can properly create a simple primitive solid.
Why is this medium ranked, and not on par with the best two?
A model that knows more in general, will often be better at specific tasks. e.g. If you ask a model to "make a program that estimates the annual production of a solar installation", it needs to have been trained on a lot more than just Python code.
Is this your hypothesis or broad conclusion among AI experts?
My take is that it's a fancy wrapper around the CLI tool. It's there to organize multiple conversations and see all the related output and generate files.
I've been using the internal version and I've actually liked it quite a bit. It's clear from when I started using it, it's not an editor, and they have ways to open your normal editor outside of it. They have turned it fully into an agent management tool.
When the antigravity development team doesn't have to focus on all the things that vscode is already good at, it lets them simplify the UI and do only agent related things. We'll see if this bet works out for them, but so far I like the idea.
Don't get me wrong, I don't think AI coding is a bad thing. For East Asians like myself, it levels the playing field with Westerners, so as long as you rigorously review the AI's output, it's a perfectly viable tool.
However, the absolute farce we just witnessed with the antiGravity2.0 update really raises doubts about whether 'vibe coding' can actually be trusted. If even a behemoth like Google is dropping the ball like this, it says a lot.
I'd like to put regional differences aside and say AI coding / LLMs are incredible tools.
While I'm nervous about my job as a programmer being able to pay a prevailing wage after the dust settles, I do hope that everyone gaining access to an AI coder / tutor will allow anyone to be able to achieve things they previously only dreamed of. If the tutor costs pennies per session, sure, the tutors are out of work, but I hope everyone can thus up-skill to work on the challenges they actually want to work on.
I'm taking baby-steps into coding in Elixir on the other monitor, a language I had only read about before, because an LLM is walking me through the changes, answering my questions, and accepting my rebuttals. There's no way I would have time to pick up the language otherwise.
Yesterday I vibe-coded some additions to the static site generator python script for my blog. It was awesome to be able to think in terms of desired features instead of digging around documentation for libraries and syntax.
I'm sorry, but that sounds exactly like almost every single Google "product" out there, they seem to only care about throwing stuff over the wall as quickly as possible, and you'd have a hard time finding a single Google product that doesn't also feel filled with fragmented choices, like every project of theirs have a different project manager every week.
Why do you say that? Are there language or cultural disadvantages to being East Asian?
I guess the wow!->adjust->complain->wow!->... cycle is endless as a human
Err, yes they did. Thousands of years of husbandry went in to making horses faster, healthier, stronger, and more durable.
I think the quote you’re looking for is “if I had asked people what they wanted, the would have said faster horses”. It’s attributed to Henry Ford, although there is debate about whether or not he said it.
The point of the quote is that “faster horses” is the consumer response to “how do I get more work done” as it comes from the viewpoint of “how am I doing my work now”. An ingenious mind looks at the desired outcome and works backwards and may come to a different and dramatically improved solution instead of merely improving the current tool.
This is also an probably part of extended prompt that disallowed coding, Gemini always does calculation with a little python snippet because it is deterministic and accurate.
Flash 3.5 fails exactly like in your sample: https://gemini.google.com/share/97521a8752d9
but Flash 3.1 Lite initially fails, but then corrects itself: https://gemini.google.com/share/dc0889ec85ba
The usage limits are too aggressive, too. I tried to generate a quick Deno Fresh website to act as a a redirect to my GitHub from socials (literally the simplest possible thing I could have asked of it) and it chewed through my five hour limit in tokens from scaffolding.
To me, as a developer of CLI developer tooling, its obvious not a lot of thought or testing went into this product, but as Google has said before: the models are the product".
And next year Google will probably sunset Antigravity.
If it doesn't make Google billions, don't trust them.
I can't imagine why (or who) that'd be kept alive for..
funny how some of their projects have undisclosed budgets and profits.
I was actually hoping for "Opus level intelligence at Haiku costs" model or "Sonnet level performance in Gemini 3.0 pricing", either of these would have been a workhorse, plus a competitor to Claude/Codex (1 app to do things). I got neither.
I get you have to change limits, but reducing limits in a way which both applies retroactively and has a really long reset period is just infuriating. If they'd applied the new limits more gently or at the next billing period I'd probably have continued paying.
I don't mind paying a fair price for a service that provides value, but I really hate having a service I think I'm paying for rug-pulled with no clear justification.
So far I like it much more than Gemini CLI (my previous daily driver for personal projects). Seems more mature and "feels more intelligent" (very subjective ofc)
If you're on WSL, getting dbus to work is a PITA. There may be other OS-level issues that folks are running into.
My point is that with every new model release, the expectations grow. I don't know how else to say that.
This seems very similar to mobile data limits (remember those years?), where there wasn't enough tower bandwidth to serve everyone unlimited data, so telecos were in constant tension between data caps and bandwidth throttling.
It wasn't until 5G came along with 100x network capacity that they could finally give everyone "unlimited" data.