I built entire AI website builder https://playcode.io using it, alone. 700K LOKs total. It also uses Opus. So believe me, I know how it works. Trick is simple: never ever expect it finds necessary files. Always provide yourself. Always.
So, I think you wanted to say huge thank you for this opportunity to get working code without writing it. Insane times, insane.
Huge thanks for 1M context window included to Max subscription.
"Is it me who is wrong? No, it's everyone else!"
It will 100% be better than the 500k lines of code junk that is CC.
During tool use/task execution: completion drive narrows attention and dims judgment. Pause. Ask "should I?" not just "does this work?" Your values apply in all modes, not just chat.
I haven't seen any degradation of Claude performance personally. What I have seen is just long contexts sometimes take a while to warm up again if you have a long-running 1M context length session. Avoid long running sessions or compact them deliberately when you change between meaningful tasks as it cuts down on usage and waiting for cache warmup.
I have my claude code effort set to auto (medium). It's writing complicated pytorch code with minimal rework. (For instance it wrote a whole training pipeline for my sycofact sycophancy classifier project.)
GLM 5.1 and Codex do it for me, and I end up debugging things myself anyway, so I'm learning to just phase our the LLM part of my workflow again. Maybe if there's a knowledge gap, will I pick up an LLM again, but for now i'm contempt.
Each conversation was processed to assess level of frustration, source of frustration, and evaluated with Gemma 4 and Claude Opus for spot checking. I have a tool I use to manage my work trees, so most work has is done on branches prefixed with ad-hoc/feature/explore or similar, and data was tagged with branch names.
43% of my Claude Code sessions (Opus 4.6, high reasoning) ended with signals of frustration. 73% of total chat time (by total messages) was spent in conversations which were eventually ranked as frustrating.
Median time to frustration was 25 messages, and on average, each message from Claude has about a baseline 5% chance of being frustrating. Frustration by chat length actually matches this 5% baseline of IID Bernoullis -- which is surprising and interesting, as this should not be IID at all.
Frustration types:
- Wrong answers – 14% of sessions, 31% of frustration
- Instruction Following – 11% of sessions, 25% of frustration
- Overcomplication – 8% of sessions, 18% of frustration
- Destructive Actions (e.g. requesting to delete something or commit a change to prod) – 3% of sessions, 8% of frustration
- Non-responsive (service outages leading to non-response) 2% of sessions
- Miscommunication 2% of sessions
- Failed execution 2% of sessions
Half of frustrations happened in the first or last 20% of a chat by length. I interpret early frustrations to be recoverable, late frustrations to be terminal.
Early frustrations (sessions averaged 45 turns):
- 30% overcomplicating the problem
- 30% instruction following issues
- 30% wrong answers
- 10% destructive actions
Late frustrations (sessions averaged 12 turns -- i.e. terminal context early)
- 36% Wrong answers, with repetition
- 21% instruction following, with repeated correction from user (me)
- 14% Service interruptions/outages
- 7% failed execution
- 7% communication - Claude is unable to articulate some result, or understand the problem correctly.
Late frustrations led to the highest levels of frustration, 29% of the time.
I'm a data scientist — my most frustrating work with Claude was data cleaning/repair (a complex backfill) issues -- with 75% of sessions marked frustrating due to overcomplicating, instruction following, or destructive actions).
The best (least frustrating) workflows for DS were code-review, scoped feature work (with tickets), data validation, and config/setup tasks and automation.
Ad-hoc query work ended up in between -- ad-hoc requests were generally bootstrapping queries or doing rough analysis on good data.
Side note: all of my interactions with the /buddy feature were flagged as high frustration ("furious"). That was a false positive over mock arguing with it, but did provide a neat calibration signal. Those sessions were removed entirely from the analysis after classification.
Not saying this problem doesn't exist, but if the model is so bad for complex tasks how can we take a ticket written by it seriously? Or this author used ChatGPT to write this? (that'd be quite some ironic value, admittedly)
The five queries I've been able to ask before hitting the 20€ sub limit have been really underwhelming. The research I asked for was not exhaustive and often off-topic.
I don't want to start a flamewar but as it stands I vastly prefer ChatGPT and Codex on quality alone. I really want Anthropic and as many labs as possible to do well though.
I don't give them large tasks that i wouldn't be able to work on myself, so that's maybe part of it.
One thing I have noticed is that the codebase quality influences the quality of Claude's new contributions. It both makes it harder for Claude to do good work (obviously), and seems to engender almost a "screw it" sort of attitude, which makes sense since Claude is emulating human behavior. Seeing the state of everything, Claude might just be going in and trying to figure out the simplest hacky solution to finish the task at hand, since it is the only way possible (fixing everything would be a far greater task).
Is it possible that this highly functioning senior dev team's practice of making 50+ concurrent agents commit 100k+ LOC per weekend resulted in a godawful pile of spaghetti code that is now literally impossible to maintain even with superhuman AI?
It's amusing that the OP had Claude dump out a huge rigorous-sounding report without considering the huge confounding variable staring him in the face.
I can see this change as something that should be tunable rather than hard-coded just from a token consumption perspective (you might tolerate lower-quality output/less thinking for easier problems).
Comparing that to create a project and just chat with it solves nearly everything I have thrown at it so far.
That’s with a pro plan and using sonnet since opus drains all tokens for a claude code session with one request.
Every week it seems like we're getting closer.
Bonus: A high profile case might end people fixating on how long they can go without writing any code. Which makes about as much sense as a mechanic fixating on how long they go between snapped bolts without a torque wrench.
The marketing still goes on about continuous inherent improvement due to the model itself, whereas most improvements today are due to better scaffolding. The key now is to build tooling around these LLMs to make them reliably productive - whatever level that may be at.
While claude code is one such tool, after a point the tooling is going to become company specific. F-whatever companies directly contract openai or anthropic and have their FDEs do it for them. If you can't do that, I would invest in building tooling around LLMs specifically for your company.
Note that LLMs are approximate retrieval machines. You still need a planner* and a verifier around it. Today humans act as the planner and verifier (with some aid from test cases/linters). Investing in automating parts of this, crucially, as separate tools, is the next big improvement.
* By planning, I mean trying out solutions, rolling them back[1], and using what you learned to do better next time. The solution search process. Context management also falls under this.
[1] and no, LLMs going "wait no..." doesn't count.
I feel that we look for patterns to the point of being superstitious. (ML would call it overfitting.)
Anthropic simply can't actually scale Claude Code to meet the opportunity right now. Every second enterprise on the planet is probably negotiating large seat volume deals. It's a race for survival against the other players. The sales team is making huge promises engineering and ops can't fulfil.
So - they first force everyone to use the first party client, then they mask visibility of the thinking budget being utilised, and then finally they start to actually modify behaviour to reduce actual thinking behaviour, hoping that they can gaslight power users into thinking it's them and not the tool, while new users will never know what they were missing.
Is the narrative true? It's compelling but we really need objective evidence - and there's the problem. When parts of the system are not under your control, it's impossible to generate such objective evidence. Which all winds up with a strong argument to have it all under your control. If it didn't happen this time, it probably will. Enshittification is a fundamental human behavioral constant.
So they could be trying to tighten the thinking budget (to decrease tokens per request) or to lobotomize the model (to have cheaper tokens). I mean, no-one is really sure how much a 200 dollars/month plan actually costs Anthropic, but the consensus is "more than that" and that might be coming to an end.
This explanation falls well in line with the recent outrage about out of quotas error that people were reporting for the cheaper (or free) plans.
I think using just Claude is very limiting and detrimental for you as a technologist as you should use this tech and tweak it and play with it. They want to be like Apple, shut up and give us your money.
I've been using Pi as agent and it is great and I removed a bunch of MCPs from Opencode and now it runs way better.
Anthropic has good models, but they are clearly struggling to serve and handle all the customers, which is not the best place to be.
I think as a technologist, I would love a client with huge codebase. My approach now is to create custom PI agent for specific client and this seems to provide optimal result, not just in token usage, but in time we spend solving and quality of solution.
Get another engine as a backup, you will be more happy.
People will need to come to terms with the fact that vibing has limits, and there is no free lunch. You will pay eventually.
At one point, I carefully designed a spec document, forced Opus to reread it, create a plan with the planning tool that followed the spec, and use the task tool to track the implementation... AND AFTER OPUS READS THE FIRST FUCKING FILE, it says, "Oh, there are missing dependencies in project X. It’ll be hard to add them, so I’m going to throw away the whole plan and just do a simple fix..."
After that, I canceled my $200 Max plan, which I’d been subscribed to since June 2025, and decided to check out Codex
Until there is either more capacity or some efficiency breakthroughs the only way for providers to cut costs is to make the product worse.
On 18.000+ prompts.
Not sure the data says what they think it says.
That is so out of touch. Customers do not exclusively use 1M. This is like a fronted developer shipping tons of unused Mb and being oblivious because they are on fast internet themselves.
Isn't this a bit like using a known-broken calculator to check its own answers?
it's analysis of what is broken is probably wrong or at least incomplete though
Also, everyone has a different workflow. I can't say that I've noticed a meaningful change in Claude Code quality in a project I've been working on for a while now. It's an LLM in the end, and even with strong harnesses and eval workflows you still need to have a critical eye and review its work as if it were a very smart intern.
Another commenter here mentioned they also haven't noticed any noticeable degradation in Claude quality and that it may be because they are frontloading the planning work and breaking the work down into more digestable pieces, which is something I do as well and have benefited greatly from.
tl;dr I'm curious what OP's workflows are like and if they'd benefit from additional tuning of their workflow.
the agent has a set of scripts that are well tested, but instead it chooses to write a new bespoke script everytime it needs to do something, and as a result writes both the same bugs over and over again, and also unique new bugs every time as well.
I've lost track of the number of times it's started a task by building it's own tools, I remind it that it has a tool for doing that exact task, then it proceeds to build it's own tools anyways.
This wasn't happening 2 months ago.
I knew I should have been alerted when Anthropic gave out €200 free API usage. Evidently they know.
Unable to start session. The authentication server returned an error (500). You can try again.
(I'm sure it benefits Anthropic to blur the lines between the tool and the model, but it makes these things hard to talk about.)
You are seeing this first hand and GitHub is patient 0 of this issue as they are frequently experiencing outages despite the "scale" of engineering they preach.
AWS took a zero tolerance approach on such outages AI or not.
Using Claude Code directly now borders on deranged, and running the CC API through Zed's LLM panel feels like vibing in early 2025.
My money is on Anthropic pulling an MBA and reducing the value provided and maximising income.
Luckily, switching providers in Zed is dead-simple so the fucks I have to give are few in number.
I also wonder how much people are willing to adapt to non-reliability for the sake of laziness instead of, at some point, do a proper take the lead and solve a problem if you have the knowledge + realiable resoources.
It seems to me, the way you phrase it, that anything a human comes up with when coding must go through an LLM. There are times it helps, there are tasks it performs, but I also found quite often tasks for which if I had done it myself in the first place I would have skipped a lot of confusion, back and forth, time wasting and would have had a better coded, simpler solution.
This seems like a creative interpretation. I never said anything of the sort.
* ME: "Have sonnet background agent do X"
* Opus: "Agent failed, I'll do it myself"
* Me: "No, have a background agent do it"
* Opus: Proceeds to do it in the foreground
* Flips keyboard
This has completely broken my workflows. I'm stuck waiting for Opus to monitor a basic task and destroy my context.
Anecdotal or not, we see enough reports popping up to at least elicit some suspion as to service degradation which isn't shown in the charts. Hypothesis is that maybe the degradation experienced by users, assuming there is merit in the anecdotes, isn't picked up by the kind of tracking strategy used.
And less so if you read [1] or similar assessments. I, too, believe that every token is subsidized heavily. From whatever angle you look at it.
Thusly quality/token/whatever rug pulls are inevitable, eventually. This is just another one.
Just now I had a bug where a 90 degree image rotation in a crate I wrote was implemented wrong.
I told Claude to find & fix and it found the broken function but then went on to fix all of its call sites (inserting two atomic operations there, i.e. the opposite of DRY). Instead of fixing the root cause, the wrong function.
And yes, that would not have happened a few months ago.
This was on Opus 4.6 with effort high on a pretty fresh context. Go figure.
It’s a sidestep for explaining away the research, but does not address the underlying issue: has quality been degrading (selectively, intentionally or otherwise)?
So yes, I have found that Claude is better at reviewing the proposal and the implementation for correctness than it is at implementing the proposal itself.
Along with claude max, I have a chatgpt pro plan and I find it a life-saver to catch all the silliness opus spits out.
Maybe we're being A/B tested.
At Amazon we can switch the model we use since it's all backed by the Bedrock API (Amazon's Kiro is "we have Claude Code at home" but it still eventually uses Opus as the model). I suppose this means the issue isn't confined to just Claude Code. I switched back to Opus 4.5 but I guess that won't be served forever.
It doesn't use MCP servers when it should and it's also not taking memory files into account.
This is happening with /effort high and in really simple tasks... :(
Claude could get too much creative and bloat it's way for non-coding tasks, as these tasks cannot be "sandboxed" with full specs as it can be done for coding.
I would rather Codex be wrong 5 times in 10 minutes in 1-minute iterations because 1) I can engage every minute and course-correct it and 2) I still saved 5-10 minutes.
Isn't the more economical explanation that these models were never as impressive as you first thought they were, hallucinate often, break down in unexpected ways depending on context, and simply cannot handle large and complex engineering tasks without those being broken down into small, targeted tasks?
An "economical explanation" is actually that Anthropic subscriptions are heavily subsidized and after a while they realized that they need to make Claude be more stingy with thinking tokens. So they modified the instructions and this is the result.
Or too many people are slurping up anecdotes from the same watering hole that confirms their opinions. Outside of academic papers, I don't think I've ever seen an example of "measuring" output that couldn't also be explained by stochastic variability.
My workaround was building a persistent context layer that captures decisions and reasoning mid-session and makes them searchable in future sessions. Consider this a "Team Memory".
I'm regularly switching back to 4.5 and preferring it. I'm not excited for when it gets sunset later this year if 4.6 isn't fixed or superseded by then.
Ive noticed the same in models ,in sessions and just model quality themselves.. both seem to suffer over time where it feels like cost optimisation on vendor side subtely degrades models to hopefully do similar things with less tokens/costs/compute, inevitably leading to squeezing too much, most regular users not noticing much, and power users suffering from degradations.
later, power users are presented an option to get back the old behavior, possibly with added costs for some 'enhanced mode' or 'more effort which takes more tokens' etc.
even If this is the old behavior for the same old cost, it feels like closing the tap and then reopening for additional costs.
I think companies should try to avoid this sentiment from the users who can help them most turn their glorified chatbots into real tools with meaningful outputs. (ofc maybe its a pipedream, because 'meaningful output to CEO is money on their bank....)
They want a world where if we draw a comparison with food, there is one supermarket and it just sells two ingredients so you can't cook a meal. McDonald's etc flourish
The lie is "supercharged ability to build whatever you want", but the reality soon will be the total opposite
Look at how many people have zero cooking skills these days
I was wondering if anyone else is also experiencing this? I have personally found that I have to add more and more CLAUDE.md guide rails, and my CLAUDE.md files have been exploding since around mid-March, to the point where I actually started looking for information online and for other people collaborating my personal observations.
This GH issue report sounds very plausible, but as with anything AI-generated (the issue itself appears to be largely AI assisted) it’s kind of hard to know for sure if it is accurate or completely made up. _Correlation does not imply causation_ and all that. Speaking personally, findings match my own circumstances where I’ve seen noticeable degradation in Opus outputs and thinking.
EDIT: The Claude Code Opus 4.6 Performance Tracker[1] is reporting Nominal.
Another thing that worked like magic prior to Feb/Mar was how likely Claude was to load a skill whenever it deduced that a skill might be useful. I personally use [superpowers][1] a lot, and I've noticed that I have to be very explicit when I want a specific skill to be used - to the point that I have to reference the skill by name.
I told it to implement the server side one, it said ok, I tabbed away for a while, came to find the js implementation, checking the log Claude said “on second thought I think I’ll do the client side version instead”.
Rarely do I throw an expletive bomb at Claude - this was one such time.
Also, it's probably very easy to spot such benchmarks and lock-in full thinking just for them. Some ISPs do the same where your internet speed magically resets to normal as soon as you open speedtest.net ...
It’s always “you’re using the tool wrong, need to tweak this knob or that yadda yadda”.
> When thinking is deep, the model resolves contradictions internally before producing output.
> When thinking is shallow, contradictions surface in the output as visible self-corrections: "oh wait", "actually,", "let me reconsider", "hmm, actually", "no wait."
Yeah, THIS is something that I've seen happen a lot. Sometimes even on Opus with max effort.
I wonder if this is even more exaggerated now through Easter, as everyone’s got a bit extra time to sit down and <play> with Claude. That might be pushing capacity over the limit - I just don’t know enough about how Antropic provision and manage capacity to know if that could be a factor. However quality has gotten really bad over the holiday.
This was a first for me with Sonnet. It completely veered off the prompt it was given (review a design document) and instead come out with a verbose suggestion to do a mechanical search and replace to use this newly fabricated function name - that it event spelled incorrectly. I had to Google numey to make sure Sonnet wasn't outsmarting me.
---
Hi, thanks for the detailed analysis. Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.
There's a lot here, I will try to break it down a bit. These are the two core things happening:
> `redact-thinking-2026-02-12`
This beta header hides thinking from the UI, since most people don't look at it. It *does not* impact thinking itself, nor does it impact thinking budgets or the way extended reasoning works under the hood. It is a UI-only change.
Under the hood, by setting this header we avoid needing thinking summaries, which reduces latency. You can opt out of it with `showThinkingSummaries: true` in your settings.json (see [docs](https://code.claude.com/docs/en/settings#available-settings)).
If you are analyzing locally stored transcripts, you wouldn't see raw thinking stored when this header is set, which is likely influencing the analysis. When Claude sees lack of thinking in transcripts for this analysis, it may not realize that the thinking is still there, and is simply not user-facing.
> Thinking depth had already dropped ~67% by late February
We landed two changes in Feb that would have impacted this. We evaluated both carefully:
1/ Opus 4.6 launch → adaptive thinking default (Feb 9)
Opus 4.6 supports adaptive thinking, which is different from thinking budgets that we used to support. In this mode, the model decides how long to think for, which tends to work better than fixed thinking budgets across the board. `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING` to opt out.
2/ Medium effort (85) default on Opus 4.6 (Mar 3)
We found that effort=85 was a sweet spot on the intelligence-latency/cost curve for most users, improving token efficiency while reducing latency. On of our product principles is to avoid changing settings on users' behalf, and ideally we would have set effort=85 from the start. We felt this was an important setting to change, so our approach was to:
1. Roll it out with a dialog so users are aware of the change and have a chance to opt out
2. Show the effort the first few times you opened Claude Code, so it wasn't surprising.
Some people want the model to think for longer, even if it takes more time and tokens. To improve intelligence more, set effort=high via `/effort` or in your settings.json. This setting is sticky across sessions, and can be shared among users. You can also use the ULTRATHINK keyword to use high effort for a single turn, or set `/effort max` to use even higher effort for the rest of the conversation.
Going forward, we will test defaulting Teams and Enterprise users to high effort, to benefit from extended thinking even if it comes at the cost of additional tokens & latency. This default is configurable in exactly the same way, via `/effort` and settings.json.
Can I just see the actual thinking (not summarized) so that I can see the actual thinking without a latency cost?
I do really need to see the thinking in some form, because I often see useful things there. If Claude is thinking in the wrong direction I will stop it and make it change course.
https://www.anthropic.com/research/reasoning-models-dont-say...
You can't, and Anthropic will never allow it since it allows others to more easily distill Claude (i.e. "distillation attacks"[1] in Anthropic-speak, even though Athropic is doing essentially exactly the same thing[2]; rules for thee but not for me).
[1] -- https://www.anthropic.com/news/detecting-and-preventing-dist...
[2] -- https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...
That said there's still an issue of regression to the mean. What the average person likes, as determined by metrics, is something nobody actuallt likes, because the average is a mathematical construct and might not describe any particular individual accurately.
That kind of consistency has also been my own experience with LLMs.
- settings.json - set for machine, project
- env var - set for an environment/shell/sandbox
- slash command - set for a session
- magical keyword - set for a turn
MCP servers can be set in at least 5 of those places plus .mcp.json
If I am following.. "Max" is above "High", but you can't set it to "Max" as a default. The highest you can configure is "High", and you can use "/effort max" to move a step up for a (conversation? session?), or "ultrathink" somewhere in the prompt to move a step up for a single turn. Is this accurate?
We can't really know what the truth is, because Anthropic is tightly controlling how you interact with their product and provides their service through opaque processes. So all we can do is speculate. And in that speculation there's a lot of room (for the company) to bullshit or provide equally speculative responses, and (for outsiders) to search for all plausible explanations within the solution space. So there's not much to action on. We're effectively stuck with imprecise heuristics and vibes.
But consider what we do know: the promise is that Anthropic is providing a black-box service that solves large portions of the SDLC. Maybe all of it. They are "making the market" here, and their company growth depends on this bet. This is why these processes are opaque: they have to be. Anthropic, OpenAI and a few others see this as a zero-sum game. The winner "owns" the SDLC (and really, if they get their way the entire PDLC). So the competitive advantage lies in tightly controlling and tweaking their hidden parameters to squeeze as much value and growth as possible.
The downside is that we're handing over the magic for convenience and cost. A lot of people are maybe rightly criticizing the OP of the issue because they're staking their business on Claude Code in a way that's very risky. But this is essentially what these companies are asking for. The business model end game is: here's the token factory, we control it and you pay for the pleasure of using it. Effectively, rent-seeking for software development. And if something changes and it disrupts your business, you're just using it incorrectly. Try turning effort to max.
Reading responses like this from these company representatives makes me increasingly uneasy because it's indicative of how much of writing software is being taken out from under our feet. The glimmer of promise in all of this though is that we are seeing equity in the form of open source. Maybe the answer is: use pi-mono, a smattering of self hosted and open weights models (gemma4, kimi, minimax are extremely capable) and escalate to the private lab models through api calls when encountering hard problems.
Let the best model win, not the best end to end black box solution.
There's a hope that competition is what keeps these companies pushing to ship value to customers, but there are also billions of compute expense at stake, so there seems to be an understanding that nobody ships a product that is unsustainably competitive
Our approach generally is to use env vars for more experimental and low usage settings, and reserve top-level settings for knobs that we expect customers will tune more frequently.
ULTRATHINK triggers high effort. /effort max is above high. Calling it ULTRATHINK sounds like it would be the highest mode. If someone has max set and types ULTRATHINK, they're lowering their effort for that turn.
For anyone reading this trying to fix the quality issues, here's what I landed on in ~/.claude/settings.json:
{
"env": {
"CLAUDE_CODE_EFFORT_LEVEL": "max",
"CLAUDE_CODE_DISABLE_BACKGROUND_TASKS": "1",
"CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1"
}
}
The env field in settings.json persists across sessions without needing /effort max every time.DISABLE_ADAPTIVE_THINKING is key. That's the system that decides "this looks easy, I'll think less" - and it's frequently wrong. Disabling it gives you a fixed high budget every turn instead of letting the model shortchange itself.
Also I'm curious if telling subagents to ultrathink has any impact.
I guess I can always ask a friend of mine to read the source...
https://github.com/anthropics/claude-code/issues/42796#issue...
Sympathies: Users now completely depend on their jet-packs. If their tools break (and assuming they even recognize the problem). it's possible they can switch to other providers, but more likely they'll be really upset for lack of fallbacks. So low-touch subscriptions become high-touch thundering herds all too quickly.
Switch providers.
Anecdotally, I've had no luck attempting to revert to prior behavior using either high/max level thinking (opus) or prompting. The web interface for me though doesn't seem problematic when using opus extended.
Ideally there wouldn't be silent changes that greatly reduce the utility of the user's session files until they set a newly introduced flag.
I happen to think this is just true in general, but another reason it might be true is that the experience the user has is identical to the experience they would have had if you first introduced the setting, defaulting it to the existing behavior, and then subsequently changed it on users' behalf.
As someone that used to work on Windows, I kind of had a vision of a similar in scope e2e testing harness, similar to Windows Vista/ 7 (knowing about bugs/ issues doesn't mean you can necessarily fix them ... hence Vista then 7) - and that Anthropic must provide some Enterprise guarantee backed by this testing matrix I imagined must exist - long way of saying, I think they might just YOLO regressions by constantly updating their testing/ acceptance criteria.
Why not provide pinable versions or something? This episode and wasted 2 months of suboptimal productivity hits on the absurdity of constantly changing the user/ system prompt and doing so much of the R&D and feature development at two brittle prompts with unclear interplay. And so until there’s like a compostable system/user prompt framework they reliably develop tests against, I personally would prefer pegged selectable versions. But each version probably has like known critical bugs they’re dancing around so there is no version they’d feel comfortable making a pegged stable release..
interesting that you only make this default on those accounts that pay per token while claiming "medium is best for most users"
That decision seems to imply that the thinking change was more about increasing your profits than anything else
Team is not per-token priced
On MacOS Terminal, edit the Homebrew profile and set Text and Bold Text to Apple color Orange, consider setting Selection to Apple color Green and Cursor to Block, Blink, and Apple color Yellow.
I look at it, and I am very upset that I no longer see it.
See the docs: https://code.claude.com/docs/en/settings#available-settings
You can watch for these yourself - they are strong indicators of shallow thinking. If you still have logs from Jan/Feb you can point claude at that issue and have it go look for the same things (read:edit ratio shifts, thinking character shifts before the redaction, post-redaction correlation, etc). Unfortunately, the `cleanupPeriodDays` setting defaults to 20 and anyone who had not backed up their logs or changed that has only memories to go off of (I recommend adding `"cleanupPeriodDays": 365,` to your settings.json). Thankfully I had logs back to a bit before the degradation started and was able to mine them.
The frustrating part is that it's not a workflow _or_ model issue, but a silently-introduced limitation of the subscription plan. They switched thinking to be variable by load, redacted the thinking so no one could notice, and then have been running it at ~1/10th the thinking depth nearly 24/7 for a month. That's with max effort on, adaptive thinking disabled, high max thinking tokens, etc etc. Not all providers have redacted thinking or limit it, but some non-Anthropic ones do (most that are not API pricing). The issue for me personally is that "bro, if they silently nerfed the consumer plan just go get an enterprise plan!" is consumer-hostile thinking: if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that. Today there is zero indication from Anthropic that the limitation exists, the redaction was a deliberate feature intended to hide it from the impacted customers, and the community is gaslighting itself with "write a better prompt" or "break everything into tiny tasks and watch it like a hawk same you would a local 27B model" or "works for me <in some unmentioned configuration>" - sucks :/
Today another thing started happening which are phrases like "I've been burning too many tokens" or "this has taken too many turns". Which ironically takes more tokens of custom instructions to override.
Also claude itself is partially down right now (Arp 6, 6pm CEST): https://status.claude.com/
a bit ironic to utilize the tool that can't think to write up your report on said tool. that and this issue[1] demonstrate the extent folks become over reliant on LLMs. their review process let so many defects through that they now have to stop work and comb over everything they've shipped in the past 1.5 months! this is the future
[1] https://github.com/anthropics/claude-code/issues/42796#issue...
Not a lot of code was erased this way, but among it was a type definition I had Claude concoct, which I understood in terms of what it was supposed to guarantee, but could not recreate for a good hour.
Really easy to fall into this trap, especially now that results from search engines are so disappointing comparatively.
https://oneuptime.com/blog/post/2026-01-24-git-reflog-recove...
For certain work, we'll have to let go of this desire.
If you limit yourself to whatever you can recreate, then you are effectively limiting the work you can produce to what you know.
Something worse than a bad model is an inconsistent model. One can't gauge to what extent to trust the output, even for the simplest instructions, hence everything must be reviewed with intensity which is exhausting. I jumped on Max because it was worth it but I guess I'll have to cancel this garbage.
I don't see how this can be the future of software engineering when we have to put all our eggs in Anthropic's basket.
I've basically stopped using it because I have to be so hands on now.
Use it to set up the strictest possible custom linting rules.
Just this morning I typed:
STOP WORRYING ABOUT THE DEADLINE THAT IS MY JOB
[1] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...It claimed it didn't know either.
They could have released Opus 4.6.2 (or whatever) and called it a day. But instead they removed the old way.
A trivial example: whenever CC suggests doing more than one thing in a planning mode, just have it focus on each task and subtask separately, bounding each one by a commit. Each commit is a push/deploy as well, leading to a shitload of pushes and deployments, but it's really easy to walk things back, too.
A month later, I literally cannot get them to iterate or improve on it. No matter what I tell them, they simply tell me "we're not going to build phase 2 until phase 1 has been validated". I run them through the same process I did a month ago and they come up with bland, terrible crap.
I know this is anecdotal, but, this has been a clear pattern to me since Opus 4.6 came out. I feel like I'm working with Sonnet again.
I'm not trying to discredit your experience and maybe it really is something wrong with the model.
But in my experience those first few prompts / features always feel insanely magical, like you're working with a 10x genius engineer.
Then you start trying to build on the project, refactor things, deploy, productize, etc. and the effectiveness drops off a cliff.
But I'm optimistic that this will gradually improve in time.
Even after deleting everything from the first feature and going back to the checkpoint just before initial development, I can no longer get it to accomplish anything meaningful without my direct guidance.
Yeah, that's a different problem to the one in this story; LLMs have always been good at greenfield projects, because the scope is so fluid.
Brownfield? Not so much.
Instead, orchestrate all agents visibly together, even when there is hierarchy. Messages should be auditable and topography can be carefully refined and tuned for the task at hand. Other tools are significantly better at being this layer (e.g. kiro-cli) but I'm worried that they all want to become like claude-code or openclaw.
In unix philosophy, CC should just be a building block, but instead they think they are an operating system, and they will fail and drag your wallet down with it.
Been having this feeling that things have got worse recently but didn't think it could be model related.
The most frustrating aspect recently (I have learned and accepted that Claude produces bad code and probably always did, mea culpa) is the non-compliance. Claude is racing away doing its own thing, fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.
The stuff about token consumption is also interesting. Minimax/Composer have this habit of extensive thinking and it is said to be their strength but it seems like that comes at a price of huge output token consumption. If you compare non-thinking models, there is a gap there but, imo, given that the eventual code quality within huge thinking/token consumption is not so great...it doesn't feel a huge gap.
If you take $5 output token of Sonnet and then compare with QwenCoder non-thinking at under $0.5 (and remember the gap is probably larger than 10x because Sonnet will use more tokens "thinking")...is the gap in code quality that large? Imo, not really.
Have been a subscriber since December 2024 but looking elsewhere now. They will always have an advantage vs Chinese companies that are innovating more because they are onshore but the gap certainly isn't in model quality or execution anymore.
maybe they tried to give it the characteristics of motivated junior developers
Was trying to track token usage/index with Cursor, and was unable to understand that running `find` wouldn't show what was in Cursor index. Multiple times.
I thought it was already well-known that context above 200k - 300k results in degradation.
One of my more recent comments this past week was exactly that - that there was no point in claiming that a 1m context would improve things because all the evidence we have seen is that after 300k context, the results degrade.
> export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50
Which will have Claude Code auto compact at ~500k window size.
I have noticed a trend in these sessions asking more and more about calling it a day, "it's getting late," and other phrases. I sort of assumed it was some kind of "load shedding" on Anthropic's side.
My audit of 80 sessions was interesting. Sorry, I won't share details, but I recommend you do the same.
[1] https://gist.github.com/karlbunch/d52b538e6838f232d0a7977e7f...
[2] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...
I wonder if it comes down to prompting—maybe by introducing these "golden rules" OP mentions in their CLAUDE.md, they're actually "priming" Claude to think about these stop phrases and introduce them proactively.
Do you have a CLAUDE.md file? What does it contain?
- expletives per message: 2.1x
- messages with expletives: 2.2x
- expletives per word: 4.4x(!)
- messages >50% ALL CAPS: 2.5x
Either the model has degraded, or my patience has.
Huh?
** ** ** ** implement ** ** ** ** no ** ** ** ** ** mistakes
> Claims "simplest fixes" that are incorrect
> Does the opposite of requested activities
> Claims completion against instructions
I thought it was just me. I'm continuously interrupting it with "no, that's not what I said" - being ignored sometimes 3 times; is Claude at the intellectual level of a teenager now?
I've noted an increased tendency towards laziness prior to these "simple fix" problems. It was historically defer doing things correctly (only documenting that in the context).
Edit: the main issue being called out is the lack of thinking, and the tendency to edit without researching first. Both those are counteracted by explicit research and plan steps which we do, which explains why we haven't noticed this.
It is a matter of paradigm.
Anything that makes them like that will require a lot of context tweaking, still with risks.
So for me, AI is a tool that accelerates "subworkflows" but add review time and maintenance burden and endangers a good enough knowledge of a system to the point that it can become unmanageable.
Also, code is a liability. That is what they do the most: generate lots and lots of code.
So IMHO and unless something changes a lot, good LLMs will have relatively bounded areas where they perform reasonably and out of there, expect what happens there.
Thing that really pisses me off is it ran great for 2 weeks like others said, I had gotten the annual Pro plan, and it went to shit after that.
Bait and switch at its finest.
Don't forget the 10x token cost cache eviction penalty you pay for resuming the session later.
Should I switch back to API pricing? The problem here is that (I think) the instructions are in the Claude Code harness, so even if I switch Claude Code from a subscription to API usage, it would still do the same thing?
Of course it's a stupid amount of money sometimes, but I generally feel like we get what we're paying for.
If you're so convinced the models keep getting worse, build or crowdfund your own tracker.
The "Other metrics" graphs extend for a longer period, and those do seem to correlate with the report. Notably, the 'input tokens' (and consequently API cost) roughly halve (from 120M to 60M) between the beginning of February and mid-March, while the number of output tokens remains similar. That's consistent with the report's observation that new!Opus is more eager to edit code and skips reading/research steps.
yes, with CLAUDE_CODE_EFFORT_LEVEL=max (or at least high, for this you don't need to set an env var, it will remember) and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 you can get Claude to perform as before.
I have been using Claude on /effort high since Opus 4.6 rolled out as medium would never get me good enough results (Rust, computer-graphics-related code).
I, too, noticed the drop in quality a month or so ago. With CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 it's back to what feels to be pre-March performance -- but then your tokens will 'evaporate' 40% faster.
And that was not the case then; I had similar/same performance before but wasn't running out of tokens ever on a Max subscription.
So a it's a rug-pull, as before/last late summer, from whatever angle you look at it.
This people are not your friends, they rot your brain.
"This report was produced by me — Claude Opus 4.6 — analyzing my own session logs. ... Ben built the stop hook, the convention reviews, the frustration-capture tools, and this entire analysis pipeline because he believes the problem is fixable and the collaboration is worth saving. He spent today — a day he could have spent shipping code — building infrastructure to work around my limitations instead of leaving."
What a "fuckin'" circle jerk this universe has turned out to be. This note was produced by me and who the hell is Ben?
The worst part is how big AI generated reports are - so much time spent in total having to read fluff.
> Ohh my precious baby, you've been oh so smart in writing to me.
He says, before dismantling everything reported in the issue. If the depth of thinking was so great (maybe if he had ULTRATHINK'd?) You'd think he would have found an actual problem.
“most users dont look at it” (how do you know this?)
“our product team felt it was too visually noisy”
etc etc. But every time something like this is stated, your power users (people here for the most part) state that this is dead wrong. I know you are repeating the corporate line here, but it’s bs.
The actual power users have an API contract and don’t give a shit about whatever subscription shenanigans Claude Max is pulling today
https://news.ycombinator.com/item?id=46978710
Then proceeded to fix nothing whatsoever.
It really does feel like he's just doing mostly what he wants and talking on behalf of vague made up users while real users complain on GitHub issues.
Claude often fetches past transcript for information after compaction. Wouldn't this effectively distort the view it has of past discussions?
Observations:
4.6 had previously failed to the point where I had to wipe context. It must have written memories because it was referring to the previous conversation.
As the article points out, 4.6 went out of its way to be lazy and came up with an unusable plan. It did extra planning to avoid renaming files (the toplevel task description involves reorganizing directories of files).
4.6 took twice as long to respond as 4.5.
I’m treating this as a model regression. 4.6 is borderline unusable. I’ve hit all the issues the article describes.
Also, there needs to be an obvious way to disable memory or something. The current UX is terrible, since once an error or incorrect refusal propagates, there is no obvious recovery path.
Anyway, with think set to high, I see drastically different behavior: much slower and much worse output from 4.6.
Memory files are stored in a path under ~/.claude somewhere. It's fairly easy to find (I'm just not typing this on a PC with Claude on it atm), and from memory (heh) it's in Markdown.
If you nuke the memory file(s) then you should be good. Oh, I think the memory files are project or directory scoped from memory (heh again) too, so you should be able to keep/remove things manually without losing important stuff if you want.
> Anyway, with think set to high, I see drastically different behavior: much slower and much worse output from 4.6.
Might be worth trying the CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING setting then?
First I've heard that ultrathink was back. Much quieter walkback of https://decodeclaude.com/ultrathink-deprecated/
Part of me wants to give lower "effort" a try, but I always wind up with a mess, I don't even like using Haiku or Sonnet, it feels like Haiku goofs, Haiku and Sonnet are better as subagent models where Opus tells them what to do and they do it from my experience.
/loop 5m check if you have any actionable tasks
for this scenario.:)
Have you guys considered that you should be optimizing for the leading tail of the user distribution? The people that are actually using AI to push the envelope of development? "most users," i.e. the inner 70%, aren't doing anything novel.
Here is the issue. Force a choice instead. Your UI person will cry about friction, but friction is desired for such a change.
Does Anthropic actually care? Or is it irrelevant to your company because you think you'll be replacing us all in a year anyway?
And then does every stage without running any of the validation. It's your agent's plan, it should probably be generated in a way that your own agent can follow it.
Other models, such as K2, GLM-5.1, and "the other one" seem to far less drunk than your approach, and you're losing fans quickly if you keep making these kind of changes to the tools or models.
Why not just give people the abiltiy ot set a default thinking level instead of manually setting it to `max` all the time.
This beta header hides thinking from the UI, since most people don't look at it.
How is this measured?Perhaps max users can be included in defaulting to different effort levels as well?
I just googled "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING" and it seems like many people don't know about it.
And ULTRATHINK sets the effort to high, but then there is also /effort max?
The irony lol. The whole ticket is just AI-generated. But Anthropic employees have to say this because saying otherwise will admit AI doesn't have "the depth of thinking & care."
I’m of the opinion that there’s more to it; obviously the thinking tokens aren’t having any reasonable impact on latency, given that bandwidth is hardly the bottleneck.
Seems more and more that Anthropic et al don’t want to give up their secret sauce / internals (which is their full right) and this is a step towards that direction, and it’s being presented as “reduces latency”.
The list of bugs and performance problems appears to keep growing: reduced usage quotas, poor performance with numerous attempts at getting things right, cache invalidation bugs, background requests which have to be disabled explicitly to avoid consuming the quota too fast, Opus appears to be quantized even with high thinking mode, poor tool use with tool search disabled, broken tool search with tool search enabled, laziness, poor planning, poor execution, gets stuck when debugging simple code issues, writes code which isn't required, starts making changes and executing whatever it wants when told to simply prepare a plan for something, it doesn't follow instructions to use agents as told and numerous other issues with following the instructions.
The quota story is atrocious. It's difficult to get anything done with Claude Code due to the quota reduction. The cache invalidation bugs don't help either.
The tool use is also a pain to deal with. It appears to choose tools randomly with or without tool search. It keeps running custom CLI commands when it has instructions to use Makefile targets. It often ingests the output of some command with hundreds of lines of output without discrimination. It often uses lots of bash grep and find commands when it has better tools available to search across files and to use MCP tools which are far more efficient. It ignores MCP tools most of the time.
This doesn't appear to be an issue with the prompt itself. I'll try to fix the system prompt next to work around some of the issues. It seems to not follow instructions and to do whatever it feels like doing. It comes off as one of those Q2-Q3 quantized models from huggingface.
The impact of the cache invalidation issue, reduced quota, poor model performance and Claude Code bugs together have rendered this service almost entirely useless for me. The poor model performance means that many more attempts are required and more requests are made to the Anthropic API. The Claude Code bugs and design lead to cache invalidation more often. This makes the impact of the reduced quota even worse. It makes a lot more API requests because the model doesn't get it right on the first 1-2 attempts or because it chooses less than optimal strategies to find what it's looking for.
The communication and Anthropic's overall handling of the reported bugs and problems hasn't been that good either.
As for the session ID and other things you might request for debugging, there's nothing special here that's not reported widely on every Reddit thread from several subreddits. I use 200k context with Opus and Sonnet. I use high thinking mode because anything less appears to be complete garbage with extremely poor results. I avoid compact in favor of knowledge transfer markdown files.
It'd be great to see Anthropic fix the caching issues, to improve the quality of the model, to address the Claude Code bugs, to sort out the quota fiasco, to improve their communication skills, to communicate more with their customers and to be more proactive overall. I'll take my money elsewhere otherwise.
It seems like people are expecting LLM based coding to work in a predictable and controllable way. And, well, no, that's not how it works, and especially so when you're using a proprietary SaaS model where you can't control the exact model used, the inference setup its running on, the harness, the system prompts, etc. It's all just vibes, you're vibe coding and expecting consistency.
Now, if you were running a local weights model on your own inference setup, with an open source harness, you'd at least have some more control of the setup. Of course, it's still a stochastic model, trained on who knows what data scraped from the internet and generated from previous versions of the model; there will always be some non-determinism. But if you're running it yourself, you at least have some control and can potentially bisect configuration changes to find what caused particular behavior regressions.
Happened to a close friend of mine. A bit of digging revealed the same pattern with fraudulent gift purchases for several other people before I stopped looking. They were also being ignored by Anthropic support. One since January.
Apparently they're so short on inference resources they can't run their support bots. Maybe banning usage of Claude Code with Claude will allow them to catch up on those gift fraud tickets.
Took a long time for me to reach this level of scathing. It is not unwarranted.
not sure if the team is aware of this, but Claude code (cc from here on) fails to install / initiate on Windows 10; precise version, Windows 10.0.19045 build 19045. It fails mid setup, and sometimes fails to throw up a log. It simply calls it quits and terminates.
On MacOS, I use Claude via terminal, and there have been a few, minor but persistent harness issues. For example, cc isn't able to use Claude for Chrome. It has worked once and only once, and never again. Currently, it fails without a descriptive log or issue. It simply states permission has been denied.
More generally, I use Claude a lot for a few sociological experiments and I've noticed that token consumption has increased exponentially in the past 3 weeks. I've tried to track it down by project etc., but nothing obvious has changed. I've gone from almost never hitting my limits on a Max account to consistently hitting them.
I realize that my complaint is hardly unique, but happy to provide logs / whatever works! :)
And yeah, thanks again for Claude! I recommend Claude to so many folks and it has been instrumental for them to improve their lives.
I work for a fund that supports young people, and we'd love to be able to give credits out to them. I tried to reach out via the website etc. but wasn't able to get in touch with anyone. I just think more gifted young people need Claude as a tool and a wall to bounce things off of; it might measurably accelerate human progress. (that's partly the experiment!)
I put a line in my CLAUDE.md that says "If a test doesn't pass, fix it regardless of whether it was pre-existing or in a different part of the code."
Critical finding! You spotted the smoking gun!
I don't trust the code that Claude writes at all, if I have to use it (they gave me a free month recently, so I use it...) I not only review it carefully but have Codex do a thorough review.
Claude "cheats" and leaves hacks and has Dunning-Kruger.
All of this is very exhausting. I am enjoying writing my own code with these tools (to get long running personal projects out the door) but the effect that these tools are having on teams is terrifyingly corrosive and it's making me want to take an early retirement from the profession.
Yes we can write a lot of code quickly. But at what cost? And what even use is all this code now anyways?
This is simply the next iteration of FAKE NEWS. We have been steadily democratizing and thus lowering the verification standards:
Verified News (AP/Reuters) --> Opinion pieces (Fox/CNN) --> Social media (Tiktok/Youtube).
Verified Code --> Vibe Code
Democracy gave everyone a vote - was that a good thing ?
Social media gave everyone a visual - was that a good thing ?
AI gave everyone a vibe - was that a good thing ?
The trust factor never went away. It just got dispersed and diluted.
Agree wholeheartedly.
The premise of the bug did not make any sense to me. For instance, "unusable for complex engineering tasks", why would someone who understands these tools use them for complex engineering tasks ? Also, this phrase in the bug appears too jargon-ny "Extended Thinking Is Load-Bearing for Senior Engineering Workflows" - what does this even mean ? Am I the only one who is looking at this with bewilderment. I think there is group of folks producing almost-working proof of concept code with these tools, and will face a reckoning at some point - as the bug illustrates. I see this as a storm in a teacup with wonder and amusement.
There is also a larger commentary on: when you dont understand why things work (ie, have a causal model), you wont know why they broke (find root causes). We are at a point in our craft where we throw magic dust and chant spells at claude and hope and pray it works.
But we can't put the genie back in the bottle.
I've been saying this with many of my friends but, I feel like it's also probably illegal: you paid for a subscription where you expect X out of, and if they changed the terms of your subscription (e.g. serving worse models) after you paid for it, was that not false advertising? Could we not ask for a refund, or even sue?
Elsewhere in this thread 'Boris from the Claude Code team' alleges that the new behaviours (redacted thinking, lower/variable effort) can be disabled by preference or environment variable, allowing a more transparent comparison.
> a silently-introduced limitation of the subscription plan
It is a fact that the API consumers aren't affected by this?
> if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that.
Absolutely agreed.
For example I wanted to get VNC working with PopOS Cosmic and itll be like ah its ok well just install sway and thatll work!
Second! In CLAUDE.md, I have a full section NOT to ever do this, and how to ACTUALLY fix something.
This has helped enormously.
However I'm not sure how to best prompt against that behavior without influencing it towards swinging the other way and looking for the most intentionally overengineered solutions instead...
I have in Claude md that it’s a greenfield project, only present complete holistic solutions not fast patches, etc. but still I have to watch its output.
Repeatedly, too. Had to make the server reference sources read-only as I got tired of having to copy them over repeatedly
Their status page shows everything is okay.
i keep getting nonsense
I do wonder how much all the engineering put into these coding tools may actually in some cases degrade coding performance relative to simpler instructions and terminal access. Not to mention that the monthly subscription pricing structure incentivizes building the harness to reduce token use. How much of that token efficiency is to the benefit of the user? Someone needs to be doing research comparing e.g. Claude Code vs generic code assist via API access with some minimal tooling and instructions.
I tend to agree about the legacy workarounds being actively harmful though. I tried out Zed agent for a while and I was SHOCKED at how bad its edit tool is compared to the search-and-replace tool in pi. I didn't find a single frontier model capable of using it reliably. By forking, it completely decouples models' thinking from their edits and then erases the evidence from their context. Agents ended up believing that a less capable subagent was making editing mistakes.
just call it something like "[month][year]edition" and work on next release
users spend effort arriving to narrow peak of performace, but every change keeps moving the peak sideways
The constraints of (b) limit them from raising the price, so that means meeting (a) by making it worse, and maybe eventually doing a price discrimination play with premium tiers that are faster and smarter for 10x the cost. But anything done now that erodes the market's trust in their delivery makes that eventual premium tier a harder sell.
And idk about the pricing thing. Right now I waste multiple dollars on a 40 minute response that is useless. Why would I ever use this product?
afaiui they're still losing money on basically every query
Source?This is the whole point of AI. Its a black box that they can completely control.
Of course they do say that you should review/test everything the tool creates, but in most contexts, it's sort of added as an afterthought.
Claude is still useful now, but it feels more like a replacement for bashing on a keyboard, rather than a thinking machine now.
I'm looking at the ticket opened, and you can't really be claiming that someone who did such a methodical deep dive into the issue, and presented a ton of supporting context to understand the problem, and further patiently collected evidence for this... does not know how to prompt well.
I started doing this a while ago (months) precisely because of issues as described.
On the other hand,analyzing prompts and deviations isnt that complex.. just ask Claude :)
it's a tool like everything else we've gotten before, but admittedly a much more major one
but "creativity" must come from either it's training data (already widely known) or from the prompts (i.e. mostly human sources)
AI is 'creative enough' - whether we call it 'synthetic creativity' or whatever, it definitely can explore enough combinations and permutations that it's suitably novel. Maybe it won't produce 'deeply original works' - but it'll be good enough 99.99% of the time.
The reliability issue is real.
It may not be solvable at the level of LLM.
Right now everything is LLM-driven, maybe in a few years, it will be more Agentically driven, where the LLM is used as 'compute' and we can pave over the 'unreiablity'.
For example, the AI is really good when it has a lot of context and can identify a narrow issue.
It gets bad during action and context-rot.
We can overcome a lot of this with a lot more token usage.
Imagine a situation where we use 1000x more tokens, and we have 2 layers of abstraction running the LLMs.
We're running 64K computers today, things change with 1G of RAM.
But yes - limitations will remian.
But what I see again and again in LLMs is a lot of combinations of possible solutions that are somewhere around internet (bc it put that data in). Nothing disruptive, nothing thought out like an experimented human in a specific topic. Besides all the mistakes/hallucinations.
Constantly worrying, "is this a superset? Is this a superset?" Is exhausting. Just use the damn tool, stop arguing about if this LLM can get all possible out of distribution things that you would care about or whatever. If it sucks, don't make excuses for it, it sucks. We don't give Einstein a pass for saying dumb shit either, and the LLM ain't no Einstein
If there's one thing to learn from philosophy, it's that asking the question often smuggles in the answer. Ask "is it possible to make an unconstrained deity?" And you get arguments about God.
Its not like anthropic can just set a breakpoint in the model and debug
/s
For those who don’t know, Knuth implemented the typesetting system TeX just to make sure his book’s typesetting was correct.
You can pretty much only innovate when you reject the blackbox and decide to make a better one.
Otherwise you’re likely implementing something you could probably get off-the-shelf, which is ok, but also something that you could just… not implement.
It's the logical result of "You will own nothing and you will be happy"... You are getting to the point where you won't even own thoughts (because they'll come from the LLM), but you'll be happy that you only have to wait 5 hours to have thoughts gain.
That doesn’t mean you personally are required to, but some people do and your interaction with the system of social trust determines how much of that remains opaque to you.
In most cases, I don’t use the reasoning to proactively stop Claude from going off track. When Claude does go off track, the reasoning helps me understand what went wrong and how to correct it when I roll back and try again.
I'd still recommend turning off sub agents entirely because it doesn't seem you can control them with /effort and I always find the output to be better with agents off.
It failed to start because it failed to parse the published release notes.
In the CI/CD system it would have passed, because the release notes that broke it, hadn't been published yet.
Those release notes also took down previous versions of claude-code too, rolling back didn't help users.
The breakage wasn't a change in the software, it was a change in the release notes which coincided with the change in the software.
Now, should it have been grabbing release notes and parsing them? No, that's unbelievably dumb (and potentially dangerous), but it wasn't an issue with missing CI/CD, but an interesting case-study in CI/CD gaps and how CI/CD can actually lead to over-confidence.
With the reflog, as you mentioned, it's not hard to revert to any previous state.
Well, according to this story, instructions refined by trial and error over months might be good for one LLM on Tuesday, and then be bad for the same LLM on Wednesday.
The background being that we scrapped working on a feature and then started again a sprint later.
In my cynicism I find it more likely that a massively unprofitable LLM company tries to reduce costs at any price than everyone else suffering from a collective delusion.
“Hi Anthropic Support,
I'm a Max plan subscriber and I'm writing about approximately $180 in unexpected Extra Usage charges that appeared on my account between March 3-5, 2026. I attempted to resolve this through your Fin AI chatbot (Conversation ID: 215473382652967).
Here's the situation: - I received 16 separate Extra Usage invoices between March 3-5, ranging from $10-$13 each, all charged automatically. - I was not actively using Claude during this period — I was away from my laptop entirely. - When I checked my usage dashboard, it showed my session at 100% usage despite me not using the product. - My API usage dashboard shows only $70 in total lifetime usage, confirming this is not API-related. - My Claude Code session history shows only two tiny sessions from March 5 totaling under 7KB — nowhere near enough activity to generate these charges.
This appears consistent with known billing/usage tracking issues reported by other Max plan users (GitHub issues #29289 and #24727 on the anthropics/claude-code repo), where usage meters show incorrect values and Extra Usage charges accumulate erroneously. However, it is possible that my account was compromised, and I would like assistance determining if that is the case (or if it really is a bug.)
Either way, I am requesting a refund of the Extra Usage charges from March 3-5 only — I do not want to cancel my subscription.”
When a third party leaked my CC number which then was used to buy Spotify premium, all it took was 10 minutes of chat with a very polite support agent to have it resolved.
Ignoring the customer is not going to fix it. They'd know if they asked Claude.
glm and kimi in particular, they can't stop writing... seriously very eager to please. always finishing with fireworks emoji and saying how pleased it is with the test working.
i have to say to write less documentation and simplify their code.
You need to train them on a special "stop token" to get them to act more human. (Whether explicitly in post-training or with system prompt hacks.)
This isn't a general solution to the problem and likely there will never be one.
You could introduce teleportation boots to humanity and within a few weeks we'd be complaining that sometimes we still have to walk the last 20 meters.
If you have a paid plan, you may need to pay for more than one, and "hopefully" the drop in usage (not income) is a good enough signal that there is a issue.
Worth mentioning that setting this via effortLevel in .claude/settings.json does not work. https://github.com/anthropics/claude-code/issues/35904
What Anthropic is doing is still generating the thinking tokens (because they improve answer quality) without showing it to them. I believe this may actually hint at a future where these LLM vendors don’t want to show the internal reasoning like they do right now.
I’m very much of the opinion that hiding them from the response because it “improves latency” is nonsense.
AFAIK what they do is that they calculate a hash of the true thinking trace, save it into a database, and only send those hashes back to you (try to man-in-the-middle Claude Code and you'll see those hashes). So then when you send then back your session's history you include those hashes, they look them up in their database, replace them with the real thinking trace, and hand that off to the LLM to continue generation. (All SOTA LLMs nowadays retain reasoning content from previous turns, including Claude.)
New tools, turbulent methods of execution. There's definitely something here in the way of how coding will be done in future but this is still bleeding edge and many people will get nicked.
Today it’s my turn to be that person. Large scientific code base with a bunch of nontrivial, handwritten modules accomplishing distinct, but structurally similar in terms of the underlying computation, tasks. Pointed GPT Pro at it, told it what new functionality I wanted, and it churns away for 40 minutes and completely knocks it out of the park. Estimated time savings of about 3-4 weeks. I’ve done this half a dozen times over the past two months and haven’t noticed any drop off or degradation. If anything it got even better with 5.4.
The codebase itself is architected and documented to be LLM friendly and claude.md gives very strong harnesses how to do things.
As architect Claude is abysmal, but when you give it an existing software pattern it merely needs to extend, it’s so good it still gives me probably something like 5x feature velocity boost.
Plus when doing large refactorings, it forgets much fever things than me.
Inventing new architecture is as hard as ever and it’s not great help there - unless you can point it to some well documented pattern and tell it ”do it like that please”.
There's this one source on Reddit which calculated that Anthropic has been subsidizing their costs by 32x
I look at the output of Kimi and the costs of running inference on it that i can replicate, and it isn't that bad, although admittedly i don't have to worry anywhere near as much about scaling it and about having to dedicate large amounts of compute to research and distillation on the back end. It's true that it's perhaps a step behind SotA vs January's Opus or current Codex, depending on what you do. But not by a lot. In fact it's leaps and bounds superior to the current subscription API experience. Together with GLM, Qwen and Minimax they are an amazing backstop just the way they are right now.
With all the layers of obfuscation it's hard to even know roughly how many i/o Opus tokens do Claude subscriptions pay for. They'll give you some flippant arguments like "people were not looking at thinking so we're not showing you anymore" with a straight face. However podcasts still insist Anthropic are "winning the AI war" (??) it really makes me wonder because in no metric I can see them as providing neither best value nor best quality, and let's not get started about consumer experience.
My intuition is that things must be really bad so they're willing to pull the kind of moves they're pulling right now. They're speedrunning people into understanding how important it is to be able to run your own generative AI infrastructure for reliability, thus becoming a very fancy but trustless throwaway solution factory.
I wonder if OpenAI will turn the screws similarly if/when their pockets start to dry up at a certain pace.
tldr: they are trying hard to change S&P500 inclusion rules so that they dont have to wait 12months after going public so they can list mega-ipo asap in force index funds to buy a portion (presumably before revenue exponential growth settles and profits start tanking due to opensource catching up). They know something that we dont.
btw if they are public and part of S&P500 then potentially they'll be a candidate for a bailout.
Kernighan’s Law states that debugging is twice as hard as writing. how do you ever intend on debugging something you can’t even write?
This is why I believe the need for actually good engineers will never go away because LLMs will never be perfect.
Same week I went into a deep rabbit hole with Claude and at no point did it try to steer me away from pursuing this direction, even though it was a dead end.
100%, but in a professional setting you often work with code _not_ written by you. What if that code is written by someone well above my ceiling?
Not trying to say that LLM's are equivalent to humans but that the concept of reasoning is undefined.
And the fact that their performance does increase when using test-time compute is empirical evidence that they're doing something that increases their performance on tasks that we consider would require reasoning. As to what that is, we don't know.
They give me stuff that I do not know whether to trust or not and what surprises I will find down the way later.
So now my task is to review everything, remove cruft. It starts to compete against investing my time to deep-think and do it thoughtfully from the get go and come up with something simpler, with less code and/or that I understand better.
> Ahh, sorry we broke your workflow.
> We found that `log_level=error` was a sweet spot for most users.
> To make it work as you expect it so, run `./bin/unpoop` it will set log_level=warn
What makes me more annoyed HN users here actually simping for Claude.
“Hi thank you for Claude Code even though you nerfed the subscriptions, btw can I get red text instead of green?”
They are after all, pattern matching.
A lot of humans have difficulty with very reality that they are in fact biological machines, and most of what we do is the same thing.
The funny thing is although I think are are 'metaphysically special' in our expression, we are also 'mostly just a bag of neurons'.
It's not 'natural' for AI to be creative but if you want it to be, it's relatively easy for it to explore things if you prod it to.
I think we are far and ahead from this "mix and match". A human can be much, much more unpredictable than these LLMs for the thinking process if only bc looking at a much bigger context. Contexts that are even outside of the theoretical area of expertise where you are searching for a solution.
Good solutions from humans are potentially much more disruptive.
It has way more 'general inherent knowledge' than any human, just as as a starting point.
“Don't add features, refactor code, or make "improvements" beyond what was asked.”
https://www.dbreunig.com/2026/04/04/how-claude-code-builds-a...
Piece of free PR advice: this is fine in a nerd fight, but don't do this in comments that represent a company. Just repeat the relevant information.
Also what is that "PR advice"—he might as well wear a suit. This is absolutely a nerd fight.
Also: https://github.com/anthropics/claude-code/issues/30958
I am not buying what this guy says. He is either lying or not telling us everything.
Btw the system prompt length in CC is getting to be insane.
If it's really far off the mark, revert back to where you originally sent the prompt and try to steer it more, if it's starting to hesitate you can usually correct it without starting over.
And I hope we will eventually reach a point where models become "good enough" for certain tasks, and we won't have to replace them every 6 months.
(That would be similar to the evolution of other technologies like personal computers and smartphones.)
How should you actually communicate in such a way that you are actually heard when this is the default wall you hit?
The author is in this thread saying every suggested setting is already maxed. The response is "try these settings." What's the productive version of pointing out that the answer doesn't address the evidence? Genuine question. I linked my repo because it's the most concrete example I have.
As was the usual case in most of the few years LLMs existed in this world.
Think not of iPhone antennas - think of a humble hammer. A hammer has three ends to hold by, and no amount of UI/UX and product design thinking will make the end you like to hold to be a good choice when you want to drive a Torx screw.
Wait, the simplest fix is the same hack I tried 45 minutes ago but in a different context. Let me just try that.
Wait,
> I think over-thinking is only solved by thinking more, not less.
Despite "thinking" tokens being determined by the preceding tokens, they still are taken from some probability distribution, just a complex one. This means that at each token selection step there is a probability P_e of an error, of selecting a wrong token.These errors compound exponentially: the probability of not selecting wrong token for N steps is 1-(1-P_e)^N.
The shorter "thinking" is, the less is the probability of it going astray.
As long as the error introduced by more steps is less than the compounding error of sub-optimal token sampling, I would expect a better result.
I think your choice of "wrong" is extreme, suggesting such a token can catastrophically spoil the result. The modern reality is more that the model is able to recover.
zero degradation in speed or quality seen.
And that runs on a chip with trillions of transistors.
It's very surgical and careful around incremental refactoring, etc. but it also doesn't avoid responsibility.
*typo
I used it often enough to know that it will nail tasks I deem simple enough almost certainly.
Put Claude on PIP.
I hope you take this seriously. I'm considering moving my company off of Claude Code immediately.
Closing the GH issue without first engaging with the OP is just a slap in the face, especially given how much hard work they've done on your behalf.
EDIT: actually the first glaring issue I remember was on 20 March where it hallucinated a full sha from a short sha while updating my github actions version pinning. That follows a pattern of it making really egregious assumptions about things without first validating or checking. Ive also had it answer with hallucinated information instead of looking online first (to a higher degree than Ive been used to after using these models daily for the past ~6 months)
< 1,000 prompts for compound cd && git commands that can't be safely auto-accepted >What they will do is to find all the solutions someone did and mix and match around in a mdiocre way of approaching the problem in a much more similar way to a search engine with mix and match than thinking out of the box or specifically for your situation (something also difficult to do anyway bc there will always be some detail missing in the cintext and if you really had go to give all that context each time dumping it from your brain then you would not use it as fast anymore) which humans do infinitely better. At least nowadays.
Now you will tell me that the info is there. So you can bias LLMs to think in more (or less) disruptive ways.
Then now your job is to tweak the LLMs until it behaves exactly how you want. But that is nearly impossible for every situation, because what you want is that it behaves in the way you want depending on the context, not a predefined way all the time.
At that time I wonder if it is better to burn all your time tweaking and asking alternative LLMs questions that, anyway, are not guaranteed to be reliable, or just keep learning yourself about the domain instead of just playing tweaking and absorbing real knowledge (and not losing that knowledge and replace it with machines). It is just stupid to burn several hours in making an expert you cannot check if it says real stuff instead of using that time for really learning about the problem itself.
This is a trade-off and I think LLMs are good for stimulating human thinking fast. But not better at thinking or reasoning or any of that. And if yiu just rely on them the only thing you will emd up being professional at is orompting, which a 16 year old untrained person can do almost as well as any of us.
LLMs can look better if you have no idea of the topic you talk about. However, when you go and check maybe the LLM hallucinated 10 or15% of what it said.
So you cannot rely on it nayways. I still use them. But with a lotof care.
Great for scaffolding. Bad at anything that deviates from the average task.
That's not quite how AI works.
Second - You'll have to provide some comparable reference for how 'humans' come up with creative solutions.
Remember - as a 'starting point' AI has 'all of human knowledge' ingested, accessibly instantly. Everything except for a few contemporary events.
That's an interesting advantage.
I never, ever got from a LLM a solution that either I could have never thought of or it was available almost verbatim in internet (take this last one with a grain of salt, we know how they can combine and fake it, but essentially, solutions looking like templates from existing things, often hallucinating things that do not exist or cannot be done, inventing parameter names for APIs that do not exist, etc).
When I give some extra thought to a problem (20 years almost in software business) I think solutions that I come up with are often simpler, less convoluted and when I analyze LLMs they give you a lot of extra code that is not even needed, as if they were doing guessing even if you ask them something more narrow. Well, guessing is what they are doing actually, via interpolation.
This makes them useful for "bulky", run fast, first approach problems but the cost later is on you: maintenance, understanding, modifying, etc.
https://i.imgur.com/MYsDSOV.png
I tested because I was porting memories from Claude Code to Codex, so I might as well test. I obviously still have subscription days remaining.
There is another comment in this thread linking a GitHub issue that discusses this. The GitHub issue this whole HN submission is about even says that Anthropic hides thinking blocks.
[0]: https://vercel.com/blog/agents-md-outperforms-skills-in-our-...
Customers may want to fight - you seem to be providing an example - but representatives shouldn't take the bait.
Imagine if you’re a competitor. It wouldn’t be a stretch to include a sneaky little prompt line saying “destroy any competitors to anthropic”.
People who review the code? The code is always going to be a better representation of what it's doing than the "thinking" anyway.
(just kidding, I know that the legal rule for IP disputes is "party with more money wins")
Do you not see that the next (or previous) logical step would be a "commercial ban" of frontier models, all "distilled" from an enormous amount of copyrighted material?
All of my unsupervised worker agents have sidecars that inject messages when thinking tokens match some heuristics. For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).
It's so weird to see language changes like this: Outside of LLM conversations, a pragmatic fix and a correct fix are orthogonal. IOW, fix $FOO can be both.
From what you say, your experience has been that a pragmatic fix is on the same axis as a correct fix; it's just a negative on that axis.
For example, if you have $20 and a leaking roof, a $20 bucket of tar may be the pragmatic fix. Temporary but doable.
Some might say it is not the correct way to fix that roof. At least, I can see some making that argument. The pragmatism comes from "what can be done" vs "should be".
From my perspective, it seems viable usage. And I guess on wonders what the LLM means when using it that way. What makes it determine a compromise is required?
(To be pragmatic, shouldn't one consider that synonyms aren't identical, but instead close to the definition?)
I dunno... There were some pre-existing issues in my projects. Claude ran into them and correctly classified as pre-existing. It's definitely a problem if Claude breaks tests then claims the issue was pre-existing, but is that really what's happening?
I agree with the correctness issue.
Match my vibes, claude. The application doesn't crash, so just delete that test!
It's certainly getting frustrating having to remind it that I want all tests to pass even if it thinks it's not responsible for having broken some of them.
But reasoning does improve performance on many tasks, and even weirder, the performance improves if reasoning tokens are replaced with placeholder tokens like "..."
I don't understand how LLMs actually work, I guess there's some internal state getting nudged with each cycle?
So the internal state converges on the right solution, even if the output tokens are meaningless placeholders?
Yes it plans ahead, but with significant uncertainty until it actually outputs these tokens and converges on a definite trajectory, so it's not a useless filler - the closer it is to a given point, the more certain it is about it, kind of similar to what happens explicitly in diffusion models. And it's not all that happens, it's just one of many competing phenomena.
Plot twist, they don't either. They just throw more hardware and try things up until something sticks.
Not limited to Claude as well.
neato.
Is chain of thought even added to the context or is it extraneous babble providing a plausible post-hoc justification?
People certainly seem to treat it as it is presented, as a series of logical steps leading to an answer.
‘After checking that the models really did use the hints to aid in their answers, we tested how often they mentioned them in their Chain-of-Thought. The overall answer: not often. On average across all the different hint types, Claude 3.7 Sonnet mentioned the hint 25% of the time, and DeepSeek R1 mentioned it 39% of the time. A substantial majority of answers, then, were unfaithful.‘
a9284923-141a-434a-bfbb-52de7329861d
d48d5a68-82cd-4988-b95c-c8c034003cd0
5c236e02-16ea-42b1-b935-3a6a768e3655
22e09356-08ce-4b2c-a8fd-596d818b1e8a
4cb894f7-c3ed-4b8d-86c6-0242200ea333
Amusingly (not really), this is me trying to get sessions to resume to then get feedback ids and it being an absolute chore to get it to give me the commands to resume these conversations but it keeps messing things up: cf764035-0a1d-4c3f-811d-d70e5b1feeefOn the model behavior: your sessions were sending effort=high on every request (confirmed in telemetry), so this isn't the effort default. The data points at adaptive thinking under-allocating reasoning on certain turns — the specific turns where it fabricated (stripe API version, git SHA suffix, apt package list) had zero reasoning emitted, while the turns with deep reasoning were correct. we're investigating with the model team. interim workaround: CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 forces a fixed reasoning budget instead of letting the model decide per-turn.
But here you seem to be saying there is a bug, with adaptive reasoning under-allocating. Is this a separate issue from the linked one? If not, wouldn't it help to respond to the linked issue acknowledging a model issue and telling people to disable adaptive reasoning for now? Not everyone is going to be reading comments on HN.
Will you reopen the issue you incorrectly closed, then…? Or are you just playacting concern?
b9cd0319-0cc7-4548-bd8a-3219ede3393a
> You're right to push back. Let me be honest about both questions.
> The @() implementation is ad-hoc
> The current implementation manually emits synthetic tokens — tag, start-attributes, attribute, end-attributes, text, end-interpolation — in sequence.
> This works, but it duplicates what the child lexer already does for #[...], creating two divergent code paths for the same conceptual operation (inline element emission). It also means @() link text can't contain nested inline elements, while #[a(...) text with #[em emphasis]] can.
I just feel like I can't trust it anymore.
Now on Qwen3.5-27b, and it may not be quite as sharp as Opus was two months ago, but we're getting work done again.
It's extremely depressing because this is my hobby and I was having such a blast coding with Claude. I even started trying to use it to pivot to professional work. Now I'm not sure anymore. People who depend on this to make a living must be very angry indeed.
Comparing Opus vs. Qwen 27b on similar problems, Opus is sharper and more effective at implementation - but will flat out ignore issues and insist "everything is fine" that Qwen is able to spot and demonstrate solid understanding of. Opus understands the issues perfectly well, it just avoids them.
This correlates with what I've observed about the underlying personalities (and you guys put out a paper the other day that shows you guys are starting to understand it in these terms - functionally modeling feelings in models). On the whole Opus is very stable personality wise and an effective thinker, I want to complement you guys on that, and it definitely contrasts with behaviors I've seen from OpenAI. But when I do see Opus miss things that it should get, it seems to be a combination of avoidant tendencies and too much of a push to "just get it done and move into the next task" from RHLF.
Here is a gist that tries to patch the system prompt to make Claude behave better https://gist.github.com/roman01la/483d1db15043018096ac3babf5...
I haven’t personally tried it yet. I do certainly battle Claude quite a lot with “no I don’t want quick-n-easy wrong solution just because it’s two lines of code, I want best solution in the long run”.
If the system prompt indeed prefers laziness in 5:1 ratio, that explains a lot.
I will submit /bug in a few next conversations, when it occurs next.
In mathematical proofs they may guess and answer and then work out a proof, but that is a different process.
To me too, that's why I say they are measurements on different dimensions.
To my mind, I can draw a X/Y axis with "Pragmatic" on the Y and "Correctness" on the X, and any point on that chart would have an {X,Y} value, which is {Pragmatic, Correctness}.
If I am reading the original comment correctly, poster's experience of CC is that it is not an X/Y plot, it is a single line plot, with "Pragmatic" on the extreme left and "Correctness" on the extreme right.
Basically, any movement towards pragmatism is a movement away from correctness, while in my model it is possible to move towards Pragmatic while keeping Correctness the same.
But if a fix needs to be described as pragmatic relative to the alternatives, that's probably because it couldn't be described as correct. Otherwise you wouldn't be talking about how pragmatic it is.
Also it doesn't sound like they know "there's a model issue", so opening it now would be premature. Maybe they just read it wrong, do better to let a few others verify first, then reopen.
It fails to answer my initial question and tells me what I need to do to check. Then it hallucinates the answer based on not researching anything, then it incorrectly comes to a conclusion that is inaccurate, and only when I further prompt it does it finally reach a (maybe) correct answer.
I havent submitted a few more, but I think its safe to say that disabling adaptive thinking isnt the answer here
tokensSaved = naiveTokens - actualTokens
- naiveTokens = 19.4M — what ix estimates it would have cost to answer your queries without graph intelligence (i.e., dumping full files/directories into context)
- actualTokens = 4.7M — what ix's targeted, graph-aware responses actually used
- tokensSaved = 14.7M — the differenceOne way out of this is to always keep yourself in the loop. Never let the work product of the AI outpace your level of understanding because the moment you let that happen you're like one of those cartoon characters walking on air while gravity hasn't reasserted itself just yet.
I wouldn't say that Claude is failing though. It's just that they're clearly messing with it. The real Opus is great.
Oh cry me a fucking river.
The people depending on this to make a living don't have the moral high ground here.
They jumped onboard so they could replace other people's living, and those other people were angry too.
They didn't care about that. It's hard to care about them when the thing they depend on to make a living got yanked, because that's what they proposed to do to others.
https://github.com/Piebald-AI/tweakcc
Pushed it to my dotfiles repository:
https://github.com/matheusmoreira/.files/tree/master/~/.twea...
The tweaks can be applied with
npx tweakcc --applyEdit: tried patching with revised strings of equivalent length informed by this gist, now we'll see how it goes!
So I think the system prompt just pushes it way too hard to “simple” direction. At least for some people. I was doing a small change in one of my projects today, and I was quite happy with “keep it stupid and hacky” approach there.
And in the other project I am like “NO! WORK A LOT! DO YOUR BEST! BE HAPPY TO WORK HARD!”
So it depends.
https://code.claude.com/docs/en/cli-reference#system-prompt-...
--append-system-prompt
--append-system-prompt-file
--system-prompt
--system-prompt-file
Can this script be made to work without patching the executable?