An update on recent Claude Code quality reports

(anthropic.com)

455 points | by mfiguiere 4 hours ago

86 comments

6keZbCECT2uB 3 hours ago
"On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6"
This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.
The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.
[-]
- bcherny 3 hours ago
  Hey, Boris from the Claude Code team here.
  Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.
  The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.
  We tried a few different approaches to improve this UX:
  1. Educating users on X/social
  2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
  3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.
  Hope this is helpful. Happy to answer any questions if you have.
  [-]
  - dbeardsl 2 hours ago
    I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.
    I feel like that is a choice best left up to users.
    i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"
    [-]
    - JumpCrisscross 2 hours ago
      > I was never under the impression that gaps in conversations would increase costs
      The UI could indicate this by showing a timer before context is dumped.
      [-]
      - karsinkk 2 hours ago
        Yes!! A UI widget that shows how far along on the prompt cache eviction timelines we are would be great.
    - computably 2 hours ago
      > I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.
      You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.
      [-]
      - doesnt_know 1 hour ago
        How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?
        You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
      - bontaq 2 minutes ago
        How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?
      - exac 4 minutes ago
        It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.
      - solarkraft 1 hour ago
        I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.
        [-]
        mpyne 12 minutes ago
        > Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.
        Does mmap(2) educate the developer on how disk I/O works?
        At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.
      - margalabargala 1 hour ago
        Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.
      - someguyiguess 1 hour ago
        Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.
        [-]
        coldtea 29 minutes ago
        It's not like they have a poweful all-knowing oracle that can explain it to them at their dispos... oh, wait!
        [-]
        esafak 21 minutes ago
        They have to know that this could bite them and to ask the question first.
      - kovek 1 hour ago
        What if the cache was backed up to cold storage? Instead of having to recompute everything.
      - kang 54 minutes ago
        It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.
        [-]
        coldtea 30 minutes ago
        It seems you haven't done the due diligence on what the parent meant :)
        It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.
        It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.
      - raron 1 hour ago
        How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?
        [-]
        throwdbaaway 17 minutes ago
        Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache
        With this much cheaper setup backed by disks, they can offer much better caching experience:
        > Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.
        cyanydeez 23 minutes ago
        I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.
        The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.
      - miroljub 16 minutes ago
        This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.
    - giwook 19 minutes ago
      Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).
      Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.
      Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.
    - cyanydeez 27 minutes ago
      It'd probably be helpful for power users and transparency to actually show how the cache is being used. If you run local models with llamacpp-server, you can watch how the cache slots fill up with every turn; when subagents spawn, you see another process id spin up and it takes up a cache slot; when the model starts slowing down is when the context grows (amd 395+ around 80-90k) and the cache loads are bigger because you've got all that.
      So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...
      You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.
  - uxcolumbo 1 hour ago
    I don't envy you Boris. Getting flak from all sorts of places can't be easy. But thanks for keeping a direct line with us.
    I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale.
    OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.
  - kuboble 1 hour ago
    As some others have mentioned.
    I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.
    (In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)
    I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.
    [-]
    - a_t48 1 hour ago
      I got exactly this warning message yesterday, saying that it could use up a significant amount of my token budget if I resumed the conversation without compaction.
      [-]
      - onemoresoop 39 minutes ago
        Im glad they chose to do that as opposed to hidden behavior changes that only confuse users more.
      - fhub 46 minutes ago
        Really good to know. That should have made it into their update letter in point (2). Empowering the user to choose is the right call.
  - btown 2 hours ago
    Is there a way to say: I am happy to pay a premium (in tokens or extra usage) to make sure that my resumed 1h+ session has all the old thinking?
    I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.
    For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.
    Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?
    [-]
    - CjHuber 2 hours ago
      I think it’s crazy that they do this, especially without any notice. I would not have renewed my subscription if I knew that they started doing this.
      Especially in the analysis part of my work I don‘t care about the actual text output itself most of the time but try to make the model „understand“ the topic.
      In the first phase the actual text output itself is worthless it just serves as an indicator that the context was processed correctly and the future actual analysis work can depend on it. And they‘re… just throwing most the relevant stuff out all out without any notice when I resume my session after a few days?
      This is insane, Claude literally became useless to me and I didn’t even know it until now, wasting a lot of my time building up good session context.
      There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them… make it an env variable (that is announced not a secretly introduced one to opt out of something new!) or at least write it in a change log if they really don’t want to allow people to use it like before, so there‘d be chance to cancel the subscription in time instead of wasting tons of time on work patterns that not longer work
      [-]
      - munk-a 2 hours ago
        Pointing at their terms of service will definitely be the instantly summoned defense (as would most modern companies) but the fact that SaaS can so suddenly shift the quality of product being delivered for their subscription without clear notification or explicitly re-enrollment is definitely a legal oversight right now and Italy actually did recently clamp down on Netflix doing this[1]. It's hard to define what user expectations of a continuous product are and how companies may have violated it - and for a long time social constructs kept this pretty in check. As obviously inactive and forgotten about subscriptions have become a more significant revenue source for services that agreement has been eroded, though, and the legal system has yet to catch up.
        1. Specifically, this suite was about price increases without clear consideration for both parties - but the same justifications apply to service restrictions without corresponding price decreases.
        https://fortune.com/2026/04/20/italian-court-netflix-refunds...
      - jetbalsa 2 hours ago
        So to defend a litte, its a Cache, it has to go somewhere, its a save state of the model's inner workings at the time of the last message. so if it expires, it has to process the whole thing again. most people don't understand that every message the ENTIRE history of the conversion is processed again and again without that cache. That conversion might of hit several gigs worth of model weights and are you expecting them to keep that around for /all/ of your conversions you have had with it in separate sessions?
        [-]
        3836293648 2 hours ago
        No? It's not because it's a cache, it's because they're scared of letting you see the thinking trace. If you got the trace you could just send it back in full when it got evicted from the cache. This is how open weight models work.
        [-]
        eknkc 2 hours ago
        I’m not familiar with the Claude API but OpenAI has an encrypted thking messages option. You get something that you can send back but it is encrypted. Not available on Anthropic?
        reactordev 1 hour ago
        They are sending it back to the cache, the part you are missing is they were charging you for it.
        [-]
        eknkc 1 hour ago
        The blog post says they prune them now not to charge you. That’s the change they implemented.
        [-]
        reactordev 1 hour ago
        right. they were charging you for it, now they aren't because they are just dropping your conversation history.
        rsfern 1 hour ago
        It seems like an opportunity for a hierarchical cache. Instead of just nuking all context on eviction, couldn’t there be an L2 cache with a longer eviction time so task switching for an hour doesn’t require a full session replay?
        CjHuber 1 hour ago
        No of course it’s unrealistic for them to hold the cache indefinitely and that’s not the point. You are keeping the session data yourself so you can continue even after cache expiry. The point I‘m making is that it made me very angry that without any announcement they changed behavior to strip the old thinking even when you have it in your session file. There is absolutely no reason to not ask the user about if they want this
        And it’s part of a larger problem of unannounced changes it‘s just like when they introduced adaptive thinking to 4.6 a few weeks ago without notice.
        Also they seem to be completely unaware that some users might only use Claude code because they are used to it not stripping thinking in contrast to codex.
        Anyway I‘m happy that they saw it as a valid refund reason
        cyanydeez 19 minutes ago
        what matters isn't that it's a cache; what matter is it's cached _in the GPU/NPU_ memory and taking up space from another user's active session; to keep that cache in the GPU is a nonstarter for an oversold product. Even putting into cold storage means they still have to load it at the cost of the compute, generally speaking because it again, takes up space from an oversold product.
    - trinsic2 33 minutes ago
      Why cant you just build a project document that outlines that prompt that you want to do? Or have claude save your progress in memory so you can pick it up later? Thats what I do. It seems abhorrent to expect to have a running prompt that left idle for long periods of time just so you can pick up at a moments whim...
    - elAhmo 2 hours ago
      Don't you have that by just resuming old convo?
      The only issue is that it didn't hit the cache so it was expensive if you resume later.
      [-]
      - eknkc 2 hours ago
        Not at the moment apparently. They remove the thinking messages when you continue after 1 hour. That was the whole idea of that change. So the LLM gets all your messages, its responses etc but not the thinking parts, why it generated that responses. You get a lobotomised session.
        [-]
        elAhmo 1 hour ago
        OK didn't know that. I also resume fairly old sessions with 100-200k of context, and I sometimes keep them active for a while (but with large breaks in between).
        Still on Opus 4.6 with no adaptive thinking, so didn't really notice anything worse in the past weeks, but who knows.
      - tbrockman 2 hours ago
        Or generate tiny filler messages every hour until you come back to it.
  - isaacdl 2 hours ago
    Thanks for giving more information. Just as a comment on (1), a lot of people don't use X/social. That's never going to be a sustainable path to "improve this UX" since it's...not part of the UX of the product.
    It's a little concerning that it's number 1 in your list.
  - toephu2 10 minutes ago
    How does the Claude team recommend devs use Claude Code?
    1) Is it okay to leave Claude Code CLI open for days?
    2) Should we be using /clear more generously? e.g., on every single branch change, on every new convo?
  - fidrelity 3 hours ago
    Just wanted to say I appreciate your responses here. Engaging so directly with a highly critical audience is a minefield that you're navigating well.
    Thank you.
    [-]
    - qsort 2 hours ago
      I agree with this.
      I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.
      Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.
    - troupo 2 hours ago
      > Engaging so directly with a highly critical audience is a minefield that you're navigating well.
      They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.
      All the while all the official channels refused to acknowledge any problems.
      Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.
      [-]
      - rob 2 hours ago
        Examples of gaslighting on April 15th (the first 2 issues were "fixed" by April 10th according to the story):
        https://x.com/bcherny/status/2044291036860874901 https://x.com/bcherny/status/2044299431294759355
        No mention of anything like "hey, we just fixed two big issues, one that lasted over a month." Just casual replies to everybody like nothing is wrong and "oh there's an issue? just let us know we had no idea!"
        [-]
        troupo 1 minute ago
        Don't forget "our investigation concluded you are to blame for using the product exactly as advertised" https://x.com/lydiahallie/status/2039800718371307603 including gems like "Sonnet 4.6 is the better default on Pro. Opus burns roughly twice as fast. Switch at session start"
    - shimman 2 hours ago
      Very easy to do when you stand to make tens of millions when your employer IPOs. Let's not maybe give too much praise and employ some critical thinking here.
      [-]
      - simplify 2 hours ago
        What is the purpose of this mindset? Should we encourage typical corporate coldness instead?
        [-]
        sdevonoes 2 hours ago
        We should encourage minimal dependency on multibillion tech companies like anthropic. They, and similar companies are just milking us… but since their toys are soo shiny, we don’t care
      - hgoel 2 hours ago
        Is "employ some critical thinking" supposed to involve being an annoying uptight cynic?
  - ceuk 2 hours ago
    Is having massive sessions which sit idle for hours (or days) at a time considered unusual? That's a really, really common scenario for me.
    Two questions if you see this:
    1) if this isn't best practice, what is the best way to preserve highly specific contexts?
    2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?
    [-]
    - hedgehog 2 hours ago
      Have the tool maintain a doc, and use either the built-in memory or (I prefer it this way) your own. I've been pretty critical of some other aspects of how Claude Code works but on this one I think they're doing roughly the right thing given how the underlying completion machinery works.
      Edit: If you message me I can share some of my toolchain, it's probably similar to what a lot of other people here use but I've done some polishing recently.
      [-]
      - Asharma538 1 hour ago
        [dead]
    - jetbalsa 2 hours ago
      The cache is stored on Antropics servers, since its a save state of the LLM's weights at the time of processing. its several gigs in size. Every SINGLE TIME you send a message and its a cache miss you have to reprocess the entire message again eating up tons of tokens in the process
      [-]
      - cyanydeez 10 minutes ago
        clarification though: the cache that's important to the GPU/NPU is loaded directly in the memory of the cards; it's not saved anywhere else. They could technically create cold storage of the tokens (vectors) and load that, but given how ephemeral all these viber coders are, it's unlikely there's any value in saving those vectors to load in.
        So then it comes to what you're talking about, which is processing the entire text chain which is a different kind of cache, and generating the equivelent tokens are what's being costed.
        But once you realize the efficiency of the product in extended sessions is cached in the immediate GPU hardware, then it's obvious that the oversold product can't just idle the GPU when sessions idle.
  - bobkb 2 hours ago
    Resuming sessions after more than 1 hour is a very common workflow that many teams are following. It will be great if this is considered as an expected behaviour and design the UX around it. Perhaps you are not realising the fact that Claude code has replaced the shells people were using (ie now bash is replaced with a Claude code session).
  - iidsample 2 hours ago
    We at UT-Austin have done some academic work to handle the same challenge. Will be curious if serving engines could modified. https://arxiv.org/abs/2412.16434 .
    The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!
  - ryanisnan 2 hours ago
    Why does the system work like that? Is the cache local, or on Claude's servers?
    Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.
    [-]
    - jetbalsa 2 hours ago
      The cache is on Antropics server, its like a freeze frame of the LLM inner workings at the time. the LLM can pick up directly from this save state. as you can guess this save state has bits of the underlying model, their secret sauce. so it cannot be saved locally...
      [-]
      - dicethrowaway1 2 hours ago
        Maybe they could let users store an encrypted copy of the cache? Since the users wouldn't have Anthropic's keys, it wouldn't leak any information about the model (beyond perhaps its number of parameters judging by the size).
        [-]
        northern-lights 26 minutes ago
        Encryption can only ensure the confidentiality of a message from a non-trusted third party but when that non-trusted third party happens to be your own machine hosting Claude Code, then it is pointless. You can always dump the keys (from your memory) that were used to encrypt/decrypt the message and use it to reconstruct the model weights (from the dump of your memory).
        jetbalsa 2 hours ago
        I'm unsure of the sizes needed for prompt cache, but I suspect its several gigs in size (A percentage of the model weight size), how would the user upload this every time they started a resumed a old idle session, also are they going to save /every/ session you do this with?
        [-]
        skissane 1 hour ago
        They could let you nominate an S3 bucket (or Azure/GCP/etc equivalent). Instead of dropping data from the cache, they encrypt it and save it to the bucket; on a cache miss they check the bucket and try to reload from it. You pay for the bucket; you control the expiry time for it; if it costs too much you just turn it off.
        im3w1l 1 hour ago
        A few gigs of disk is not that expensive. Imo they should allocate every paying user (at least) one disk cache slot that doesn't expire after any time. Use it for their most recent long chat (a very short question-answer that could easily be replayed shouldn't evict a long convo).
  - saadn92 2 hours ago
    I leave sessions idle for hours constantly - that's my primary workflow. If resuming a 900k context session eats my rate limit, fine, show me the cost and let me decide whether to /clear or push through. You already show a banner suggesting /clear at high context - just do the same thing here instead of silently lobotomizing the model.
    [-]
    - sdevonoes 2 hours ago
      So if they fuck it up again and now they have, let’s say, “db problems” instead of “caching problems”, you would happily simply pay more? Wtf
      [-]
      - saadn92 1 hour ago
        No, I wouldn't. I'd like some transparency at least.
      - albedoa 1 hour ago
        Did you reply to the wrong comment? I don't see that implied here at all. What?
  - dnnddidiej 21 minutes ago
    It is too suprising. Time passed should not matter for using AI.
    Either swallow the cost or be transparent to the user and offer both options each time.
  - Joeri 2 hours ago
    This sounds like one of those problems where the solution is not a UX tweak but an architecture change. Perhaps prompt cache should be made long term resumable by storing it to disk before discarding from memory?
    [-]
    - kivle 1 hour ago
      I agree.. Maybe parts of the cache contents are business secrets.. But then store a server side encrypted version on the users disk so that it can be resumed without wasting 900k tokens?
  - BoppreH 25 minutes ago
    Isn't that exactly what people had been accusing Anthropic of doing, silently making Claude dumber on purpose to cut costs? There should be, at minimum, a warning on the UI saying that parts of the context were removed due to inactivity.
  - 8note 1 hour ago
    reasonably, if i'm in an interactive session, its going to have breaks for an hour or more.
    whats driving the hour cache? shouldnt people be able to have lunch, then come back and continue?
    are you expecting claude code users to not attend meetings?
    I think product-wise you might need a better story on who uses claude-code, when and why.
    Same thing with session logs actually - i know folks who are definitely going to try to write a yearly RnD report and monthly timesheets based on text analysis of their claude code session files, and they're going to be incredibly unhappy when they find out its all been silently deleted
  - mtilsted 1 hour ago
    Then you need to update your documentation and teach claude to read the new documentation because here is what claude code answered:
    Question: Hey claude, if we have a conversation, and then i take a break. Does it change the expected output of my next answer, if there are 2 hours between the previous message end the next one?
    Answer: No. A 2-hour gap doesn't change my output. I have no internal clock between messages — I only see the conversation content plus the currentDate context injected each turn. The prompt cache may expire (5 min TTL), which affects cost/latency but not the response itself.
```
  The only things that can change output across a break: new context injected (like updated date), memory files being modified, or files on disk changing.                                                                                  
                                                                                                                                                                     
```
    -- This answer directly contradict your post. It seems like the biggest problem is a total lack of documentation for expected behavior.
    A similar thing happens if I ask claude code for the difference between plan mode, and accept edits on.
    Then Claude told me the only difference was that with plan mode it would ask for permission before doing edits. But I really don't think this is true. It seems like plan mode does a lot more work, and present it in a total different way. It is not just a "I will ask before applying changes" mode.
  - the-grump 1 hour ago
    That is understandable, but the issue is the sudden drop in quality and the silent surge in token usage.
    It also seems like the warning should be in channel and not on X. If I wanted to find out how broken things are on X, I'd be a Grok user.
  - infogulch 1 hour ago
    How big is the cache? Could you just evict the cache into cheap object storage and retrieve it when resuming? When the user starts the conversation back up show a "Resuming conversation... ⭕" spinner.
  - ohcmon 1 hour ago
    Boris, wait, wait, wait,
    Why not use tired cache?
    Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.
    No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.
    Please, tell me I’m not understanding what is going on..
    otherwise you really need to hire someone to look at this!)
    [-]
    - krackers 53 minutes ago
      Same question I had in https://news.ycombinator.com/item?id=47819914
      I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.
    - solarkraft 1 hour ago
      I assume they are already storing the cache on flash storage instead of keeping it all in VRAM. KV caches are huge - that’s why it’s impractical to transfer to/from the client. It would also allow figuring out a lot about the underlying model, though I guess you could encrypt it.
      What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.
      [-]
      - tonyarkles 1 hour ago
        Just to contextualize this... https://lmcache.ai/kv_cache_calculator.html. They only have smaller open models, but for Qwen3-32B with 50k tokens it's coming up with 7.62GB for the KV cache. Imagining a 900k session with, say, Opus, I think it'd be pretty unreasonable to flush that to the client after being idle for an hour.
      - ohcmon 1 hour ago
        Yes — encryption is the solution for client side caching.
        But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier
    - rkuska 1 hour ago
      I don't think you can store the cache on client given the thinking is server side and you only get summaries in your client (even those are disabled by default).
      [-]
      - sargunv 1 hour ago
        If they really need to guard the thinking output, they could encrypt it and store it client side. Later it'd be sent back and decrypted on their server.
        But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.
  - growt 2 hours ago
    Wasn’t cache time reduced to 5 minutes? Or is that just some users interpretation of the bug?
  - nextaccountic 2 hours ago
    what about selling long term cache space to users?
    or even, let the user control the cache expiry on a per request basis. with a /cache command
    that way they decide if they want to drop the cache right away , or extend it for 20 hours etc
    it would cost tokens even if the underlying resource is memory/SSD space, not compute
  - troupo 2 hours ago
    > We tried a few different approaches to improve this UX: 1. Educating users on X/social
    No. You had random developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X
  - gverrilla 2 hours ago
    I drop sessions very frequently to resume later - that's my main workflow with how slow Claude is. Is there anything I can do to not encounter this cache problem?
  - kang 57 minutes ago
    > tokens written to cache all at once, which would eat up a significant % of your rate limits
    Construction of context is not an llm pass - it shouldn't even count towards token usage. The word 'caching' itself says don't recompute me.
    Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?
  - sockaddr 1 hour ago
    Sorry but I think this should be left up to the user to decide how it works and how they want to burn their tokens. Also a countdown timer is better than all of these other options you mention.
  - frumplestlatz 2 hours ago
    The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time.
    Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.
    I’m looking back at my past few weeks of work and realizing that these few regressions literally wasted 10s of hours of my time, and hundreds of dollars in extra usage fees. I ran out of my entire weekly quota four days ago, and had to pause the personal project I was working on.
    I was running the exact same pipeline I’ve run repeatedly before, on the same models, and yet this time I somehow ate a week’s worth of quota in less than 24h. I spent $400 just to finish the pipeline pass that got stuck halfway through.
    I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.
    [-]
    - 8note 1 hour ago
      as a variation:
      how does this help me as a customer? if i have to redo the context from scratch, i will pay both the high token cost again, but also pay my own time to fill it.
      the cost of reloading the window didnt go away, it just went up even more
- tadfisher 3 hours ago
  It astounds me that a company valued in the hundreds-of-billions-of-dollars has written this. One of the following must be true:
  1. They actually believed latency reduction was worth compromising output quality for sessions that have already been long idle. Moreover, they thought doing so was better than showing a loading indicator or some other means of communicating to the user that context is being loaded.
  2. What I suspect actually happened: they wanted to cost-reduce idle sessions to the bare minimum, and "latency" is a convenient-enough excuse to pass muster in a blog post explaining a resulting bug.
  [-]
  - someguyiguess 1 hour ago
    It’s definitely a cost / resource saving strategy on their end.
  - retinaros 3 hours ago
    they just vibecoded a fix and didnt think about the tradeoff they were making and their always yes-man of a model just went with it
- zmmmmm 7 minutes ago
  Seems like it would interact very badly with the time based usage reset. If lots of people are hitting their limit and then letting the session idle until they can come back, this wouldn't be an exception. It would almost be the default behaviour.
- sockaddr 1 hour ago
  Yeah this is actually quite shocking. In my earlier uses of CC I might noodle on a problem for a while, come back and update the plan, go shower, think, give CC a new piece of advice, etc. Basically treating it like a coworker. And I thought that it was a static conversation (at least on the order of a day or so). An hour is absurd IMO and makes me want to rethink whether I want to keep my anthropic plan.
- seizethecheese 3 hours ago
  It's also a bit of a fishy explanation for purging tokens older than an hour. This happens to also be their cache limit. I doubt it is incidental that this change would also dramatically drop their cost.
  [-]
  - cma 3 hours ago
    They moved it to 5m around the same timeframe though: https://www.reddit.com/r/ClaudeAI/comments/1sk3m12/followup_...
skeledrew 3 minutes ago
Some of these changes and effects seriously affect my flow. I'm a very interactive Claude user, preferring to provide detailed guidance for my more serious projects instead of just letting them run. And I have multiple projects active at once, with some being untouched for days at a time. Along with the session limits this feels like compounding penalties as I'm hit when I have to wait for session reset (worse in the middle of a long task), when I take time to properly review output and provide detailed feedback, when I'm switching among currently active projects, when I go back to a project after a couple days or so,... This is honestly starting to feel untenable.
everdrive 3 hours ago
I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples.
```
   "That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."

   "The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."

   "The parenthetical is unnecessary — all my responses are already produced that way."
```
However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.
[-]
- LatencyKills 3 hours ago
  I have a set of stop hook scripts that I use to force Claude to run tests whenever it makes a code change. Since 4.7 dropped, Claude still executes the scripts, but will periodically ignore the rules. If I ask why, I get a "I didn't think it was necessary" response.
  [-]
  - DANmode 1 hour ago
    I’d ask for a credit, for that, personally.
    [-]
    - someguyiguess 29 minutes ago
      I asked for a credit but they said they didn’t think the credit was necessary
- el_benhameen 19 minutes ago
  I frequently see it reference points that it made and then added to its memory as if they were my own assertions. This creates a sort of self-reinforcing loop where it asserts something, “remembers” it, sees the memory, builds on that assertion, etc., even if I’ve explicitly told it to stop.
- Normal_gaussian 17 minutes ago
  I often have Claude commit and pr; on the last week I've seen several instances of it deciding to do extra work as part of the commit. It falls over when it tries to 'git add', but it got past me when I was trying auto mode once
- giwook 18 minutes ago
  Curious what effort level you have it set to and the prompt itself. Just a guess but this seems like it could be a potential smell of an excessively high effort level and may just need to dial back the reasoning a bit for that particular prompt.
- dawnerd 3 hours ago
  I see that with openai too, lots of responding to itself. Seems like a convenient way for them to churn tokens.
  [-]
  - grey-area 3 hours ago
    A simpler explanation (esp. given the code we've seen from claude), is that they are vibecoding their own tools and moving fast and breaking things with predictably sloppy results.
  - y1n0 3 hours ago
    None of these companies have compute to spare. It’s not in their interest to use more tokens that necessary.
    [-]
    - boringg 3 hours ago
      Not true - they absolutely want to goose demand as they continue to burn investor dollars and deploy infra at scale.
      If that demand evens slows down in the slightest the whole bubble collapses.
      Growth + Demand >> efficiency or $ spend at their current stage. Efficiency is a mature company/industry game.
    - parliament32 2 hours ago
      Sure it is. They're well aware their product is a money furnace and they'd have to charge users a few orders of magnitude more just to break even, which is obviously not an option. So all that's left is.. convince users to burn tokens harder, so graphs go up, so they can bamboozle more investors into keeping the ship afloat for a bit longer.
      [-]
      - solarkraft 1 hour ago
        If this claim is true (inference is priced below cost), it makes little sense that there are tens of small inference providers on OpenRouter. Where are they getting their investor money? Is the bubble that big?
        Incidentally, the hardware they run is known as well. The claim should be easy to check.
      - WarmWash 1 hour ago
        It's an option and they are going to do it. Chinese models will be banned and the labs will happily go dollar for dollar in plan price increases. $20 plans won't go away, but usage limits and model access will drive people to $40-$60-$80 plans.
        At cell phone plan adoption levels, and cell phone plan costs, the labs are looking at 5-10yr ROI.
    - dawnerd 3 hours ago
      That doesn’t mean they also can’t be wasteful. Fact is, Claude and gpt have way too much internal thinking about their system prompts than is needed. Every step they mention something around making sure they do xyz and not doing whatever. Why does it need to say things to itself like “great I have a plan now!” - that’s pure waste.
      [-]
      - empthought 1 hour ago
        > Why does it need to say things to itself like “great I have a plan now!”
        How else would it know whether it has a plan now?
    - malfist 3 hours ago
      Are you saying these companies don't want to sell more product to us? Because that's the logical extension of your argument.
      [-]
      - keeda 2 hours ago
        No, the argument is they want to sell more product to more people, not just more product (to the same people.) Given that a lot of their income is from flat-rate subscriptions, they make money with more people burning tokens rather than just burning more tokens.
        After all, "the first hit's free" model doesn't apply to repeat customers ;-)
    - deckar01 2 hours ago
      You don’t have to use compute to pad the token count.
  - ngruhn 1 hour ago
    All the labs are in a cut throat race, with zero customer loyalty. As if they would intentionally degrade quality/speed for a petty cash grab.
  - OtomotO 3 hours ago
    This, so much this!
    Pay by token(s) while token usage is totally intransparent is a super convenient money printing machinery.
- gs17 3 hours ago
  In Claude Code specifically, for a while it had developed a nervous tic where it would say "Not malware." before every bit of code. Likely a similar issue where it keeps talking to a system/tool prompt.
  [-]
  - Retr0id 2 hours ago
    My pet theory is that they have a "supervisor" model (likely a small one) that terminates any chats that do malware-y things, and this is likely a reward-hacking behaviour to avoid the supervisor from terminating the chat.
- viccis 1 hour ago
  Yeah I had to deal with mine warning me that a website it accessed for its task contained a prompt injection, and when I told it to elaborate, the "injected prompt" turned out to be one its own <system-reminder> message blocks that it had included at some point. Opus 4.7 on xhigh
- rafram 3 hours ago
  Check that you’re running the latest version.
podnami 3 hours ago
They lost me at Opus 4.7
Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.
Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.
At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.
[-]
- dsco 3 hours ago
  Same here. I feel like all of these shenanigans could be because Anthropic are compute constrained, forcing then to take reckless risks around reducing it.
- beering 17 minutes ago
  GPT-5.4 was already better than Opus 4.6 on a lot of areas, especially correctness and tricky logic. I’m eager to see if 5.5 is even better.
- someguyiguess 28 minutes ago
  I went back to 4.5. No regrets and it’s a bit cheaper.
- vorticalbox 3 hours ago
  extra high burns tokens i find. ( run 5.4 on medium for 90% of the tasks and high if i see medium struggling and its very focused and make minimum changes.
  [-]
  - dsco 3 hours ago
    Yeah but it also then strikes the perfect balance between being meticulous and pragmatic. Also it pushes back much more often than other models in that mode.
  - DANmode 1 hour ago
    Rework burns tokens.
- robeym 1 hour ago
  What's your workflow like? I'd be curious to test OpenAI out again but Claude Code is how I use the models. Does it require relearning another workflow?
  [-]
  - beering 17 minutes ago
    Isn’t it bascially the same thing? You type what you want into the input box and it does what you ask for.
- cube2222 3 hours ago
  I’ve never been one to complain about new models, and also didn’t experience most of the issues folks were citing about Claude Code over the last couple months. I’ve been using it since release, happy with almost each new update.
  Until Opus 4.7 - this is the first time I rolled back to a previous model.
  Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check.
  I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.
- enraged_camel 3 hours ago
  I find that it is better at thinking broadly and at a high level, on tasks that are tangential to coding like UX flows, product management and planning of complex implementations. I have yet to see it perform better than either Opus 4.6 or 4.7 though.
bityard 3 hours ago
My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.
A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.
I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.
I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...
[-]
- skirmish 50 minutes ago
  So will we have to do what image generation people have been doing for ages: generate 50 versions of output for the prompt, then pick the best manually? Anthropic must be licking its figurative chops hearing this.
  [-]
  - motoroco 6 minutes ago
    I have to agree with OP, in my experience it is usually more productive to start over than to try correcting output early on. deeper into a project and it gets a bit harder to pull off a switch. I've tried forking my chats before attempting to make a correction so that I can resume the original chat just in case (yes, I know you can double-tap Esc but the restoration has failed for me a few times in the past and now I generally avoid it)
- gilrain 3 hours ago
  > My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of [LLM] output.
  I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.
  [-]
  - bityard 2 hours ago
    Er, no, I am fully aware that LLMs have always been non-deterministic.
    [-]
    - gilrain 2 hours ago
      Your argument seems to be that a statistically-improbable number of people all experienced ultimately- randomly-poor outputs, leading to only a misperception of model degradation… but this is not supported by reality, in which a different cause was found, so I was trying to connect your dots.
      [-]
      - zamadatix 51 minutes ago
        Not everyone is reporting and the number of users is not consistent. On the former the noisiest will always be those that experience an issue while on the latter there are more people than ever using Claude Code regularly.
        Combining these things in the strongest interpretation instead of an easy to attack one and it's very reasonable to posit a critical mass has been reached where enough people will report about issues causing others to try their own investigations while the negative outliers get the most online attention.
        I'm not convinced this is the story (or, at least the biggest part of it) myself but I'm not ready to declare it illogical either.
      - bityard 1 hour ago
        No, that is not my argument, in fact I don't have any argument whatsoever. It was just a plausible observation that I felt like sharing. There's nothing further to read into it, I don't have a horse in this race.
      - furyofantares 1 hour ago
        Not really, they said "some of this a perceived quality drop". That's almost certainly correct, that _some_ of it is that.
        When everyone's talking about the real degradation, you'll also get everyone who experiences "random"[1] degradation thinking they're experiencing the same thing, and chiming in as well.
        [1] I also don't think we're talking the more technical type of nondeterminism here, temperature etc, but the nondeterminism where I can't really determine when I have a good context and when I don't, and in some cases can't tell why an LLM is capable of one thing but not another. And so when I switch tasks that I think are equally easy and it fails on the new one, or when my context has some meaningless-to-me (random-to-me) variation that causes it to fail instead of succeed, I can't determine the cause. And so I bucket myself with the crowd that's experiencing real degradation and chime in.
  - pydry 2 hours ago
    I wonder how well the "good" versions worked if you threw awkward edge cases at it.
bauerd 3 hours ago
>On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode
Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.
[-]
- bcherny 3 hours ago
  Hey, Boris from the team here.
  We did both -- we did a number of UI iterations (eg. improving thinking loading states, making it more clear how many tokens are being downloaded, etc.). But we also reduced the default effort level after evals and dogfooding. The latter was not the right decision, so we rolled it back after finding that UX iterations were insufficient (people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this).
  [-]
  - big_toast 1 hour ago
    Having a "Recovery Mode"/"Safe Boot" flag to disable our configurations (or progressively enable) to see how claude code responds would be nice. Sometimes I get worried some old flag I set is breaking things. Maybe the flag already exists? I tried Claude doctor but it wasn't quite the solution.
    For instance:
    Is Haiku supposed to hit a warm system-prompt cache in a default Claude code setup?
    I had `DISABLE_TELEMETRY=1` in my env and found the haiku requests would not hit the cached system prompt. E.g. on first request just now w/ most recent version (v2.1.118, but happened on others):
    w/ telemetry off - input_tokens:10 cache_read:0 cache_write:28897 out:249
    w/ telemetry on - input_tokens:10 cache_read:24344 cache_write:7237 out:243
    I used to think having so many users was leading to people hitting a lot of edge cases, 3 million users is 3 million different problems. Everyone can't be on the happy path. But then I started hitting weird edge cases and started thinking the permutations might not be under control.
  - EugeneOZ 1 hour ago
    > people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this
    UI is UI. It is naive to expect that you build some UI but users will "just magically" find out that they should use it as a terminal in the first place.
MrOrelliOReilly 41 minutes ago
IMO this is the consequence of a relentless focus on feature development over core product refinement. I often have the impression that Anthropic would benefit from a few senior product people. Someone needs to lend them a copy of “Escaping the Build Trap.” Just because we _can_ rapidly add features now doesn’t mean we should.
PS I’m not referencing a well-known book to suggest the solution is trite product group think, but good product thinking is a talent separate from good engineering, and Anthropic seems short on the later recently
arkariarn 2 hours ago
I see some anthropic claude code people are reading the comments. A day or two ago I watched a video by theo t3.gg on whether claude got dumber. Even though he was really harsh on anthropic and said some mean stuff. I thought some of the points he was raising about claude code was quite apt. Especially when it comes to the harness bloat. I really hope the new features now stop and there is a real hard push for polish and optimization. Otherwise I think a lot of people will start exploring less bloated more optimized alternatives. Focus on making the harness better and less token consuming.
https://youtu.be/KFisvc-AMII?is=NskPZ21BAe6eyGTh
[-]
- Retr0id 2 hours ago
  Everything else aside, their brief "experiment" with removing CC support from the Pro plan got me seriously considering other options. I've been wary of vendor lock-in the whole time, but it was a useful reminder. (opencode+openrouter will probably be my first port of call)
  [-]
  - wilj 1 hour ago
    I'm 3 weeks into switching from CC to OpenCode, and in some ways it is far superior to CC right out of the box, and I've maybe burned $200 in tokens to make a private fork that is my ultimate development and personal agent platform. Totally worth it.
    Still use CC at work because team standards, but I'd take my OpenCode stack over it any day.
    [-]
    - solarkraft 1 hour ago
      I’m in the process of doing this as well - hackability is such a massive moat.
      Care to share what you changed, maybe even the code?
      [-]
      - wilj 52 minutes ago
        I've got to do some cleanup before sharing (yay vibe coding) but the big things I've changed so far:
        1) Curated a set of models I like and heavily optimized all possible settings, per agent role and even per skill (had to really replumb a lot of stuff to get it as granular as I liked)
        2) Ported from sqlite to postgresql, with heavily extended schema. I generate embeddings for everything, so every aspect of my stack is a knowledge graph that can be vector searched. Integrated with a memory MCP server and auditing tools so I can trace anything that happens in the stack/cluster back to an agent action and even thinking that was related to the action. It really helps refine stuff.
        3) Tight integration of Gitea server, k3s with RBAC (agents get their own permissions in the cluster), every user workspace is a pod running opencode web UI behind Gitea oauth2.
        4) Codified structure of `/projects/<monorepo>/<subrepos>` with simpler browserso non-technical family members can manage their work easier (agents handle all the management and there are sidecars handling all gitops transparent to the user)
        5) Transparent failover across providers with cooldown by making model definitions linked lists in the config, so I can use a handful of subscriptions that offer my favorite models, and fail over from one to the next as I hit quota/rate limits. This has really cut my bill down lately, along with skipping OpenRouter for my favorite models and going direct to Alibaba and Xiaomi so I can tailor caching and stuff exactly how I want.
        6) Integrated filebrowser, a fork of the Milkdown Crepe markdown editor, and codemirror editor so I don't even need an IDE anymore. I just work entirely from OpenCode web UI on whatever device is nearest at the moment. I added support for using Gemma 4 local on CPU from my phone yesterday while waiting in line at a store yesterday.
        Those are the big ones off the top of my head. Im sure there's more. I've probably made a few hundred other changes, it just evolves as I go.
  - 2001zhaozhao 1 hour ago
    The solution IMO is to switch to an agent harness wrapper solution that uses CLI-wrapping or ACP to connect to different coding agents. This is the only way that works across OpenAI, Claude and Gemini.
    There are a few out there (latest example is Zed's new multi-agent UI), but they still rely on the underlying agent's skill and plugin system. I'm experimenting with my own approach that integrates a plugin system that can dynamically change the agent skillset & prompts supplied via an integrated MCP server, allowing you to define skills and workflows that work regardless of the underlying agent harness.
- lanthissa 2 hours ago
  never ever forget theo's gpt 5 hype video and then him having to walk it back.
  its very clear that theres money or influence exchanging hands behind the scenes with certain content creators, the information, and openai.
- whalesalad 2 hours ago
  literally just `git reset --hard <random hash from 3 months ago>` would fix this
  [-]
  - willis936 2 hours ago
    That implies it's broken. Juicing revenue and slashing opex at the expense of brand and customer retention is the feature.
karsinkk 2 hours ago
" Combined with this only happening in a corner case (stale sessions) and the difficulty of reproducing the issue, it took us over a week to discover and confirm the root cause"
I don't know about others, but sessions that are idle > 1h are definitely not a corner case for me. I use Claude code for personal work and most of the time, I'm making it do a task which could say take ~10 to 15mins. Note that I spend a lot of time back and forth with the model planning this task first before I ask it to execute it. Once the execution starts, I usually step away for a coffee break (or) switch to Codex to work on some other project - follow similar planning and execution with it. There are very high chances that it takes me > 1h to come back to Claude.
[-]
- o10449366 2 hours ago
  Yeah and that statement also speaks to their test rigor if they make a change that big without thoroughly testing the edge case they're modifying.
ramoz 13 minutes ago
Opus 4.7 is very rough to work with. Specifically for long-horizon (we were told it was trained specifically for this and less handholding).
I don't have trust in it right now. More regressions, more oversights, it's pedantic and weird ways. Ironically, requires more handholding.
Not saying it's a bad model; it's just not simple to work with.
for now: `/model claude-opus-4-6[1m]` (youll get different behavior around compaction without [1m])
Robdel12 4 hours ago
Wow, bad enough for them to actually publish something and not cryptic tweets from employees.
Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.
[-]
- saghm 3 hours ago
  The A/B testing is by far the most objectionable thing from them so far in my opinion, if only because of how terrible it would be for something like that to be standard for subscriptions. I'd argue that it's not even A/B testing of pricing but silently giving a subset of users an entirely different product than they signed up for; it would be like if 2% of Netflix customers had full-screen ads pop up and cover the videos randomly throughout a show. Historically the only thing stopping companies from extraordinarily user-hostile decisions has been public outcry, but limiting it to a small subset of users seems like it's intentionally designed to try to limit the PR consequences.
  [-]
  - lifthrasiir 3 hours ago
    The best possible situation that I can imagine is that Anthropic just wanted to measure how much value does Claude Code have for Pro users and didn't mean to change the plan itself (so those users would get CC as a "bonus"), but that alone is already questionable to start with.
- mannanj 4 hours ago
  so who do you trust and go to? (NotClearlySo)OpenAI?
  [-]
  - carlgreene 3 hours ago
    I "subconsciously" moved to codex back in mid Feb from CC and it's been so freaking awesome. I don't think it's as good at UI, but man is it thorough and able to gather the right context to find solutions.
    I use "subconsciously" in quotes because I don't remember exactly why I did it, but it aligns with the degradation of their service so it feels like that probably has something to do with it even though I didn't realize it at the time.
    [-]
    - GenerWork 3 hours ago
      Anthropic definitely takes the cake when it comes to UI related activities (pulling in and properly applying Figma elements, understanding UI related prompts and properly executing on it, etc), and I say this as a designer with a personal Codex subscription.
    - snissn 3 hours ago
      it's been frustrating how bad it is at UI. I'm starting to test out using their image2 for UI and then handing it to codex to build out the images into code and I'm impressed and relieved so far
    - cmrdporcupine 2 hours ago
      Codex isn't great at UI, but you might find Gemini is competent enough as an adjunct. I've had some luck with that.
  - simlevesque 3 hours ago
    I went with MiniMax. The token plans are over what I currently need, 4500 messages per 5h, 45000 messages per week for 40$. I can run multiple agents and they don't think for 5-10 minutes like Sonnet did. Also I can finally see the thinking process while Anthropic chose to hide it all from me.
    I'm using Zed and Claude Code as my harnesses.
  - Robdel12 3 hours ago
    At the moment, yeah. If Google ever figures out how to build an agentic model, I would use them as well.
    However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode
    [-]
    - IncreasePosts 2 hours ago
      Is Gemini cli not an agentic model? Or are you just saying it's built poorly? Gemini 2.5 didn't really work for me but Gemini 3 seems fairly solid
      [-]
      - cmrdporcupine 1 hour ago
        Gemini fairs poorly at tool use, even in its own CLI and even in Antigravity. It gets into a mess just editing source files, it's tragic because it's actually not a bad model otherwise.
  - bensyverson 3 hours ago
    Anecdotally, I know many people who have supplemented Claude with Codex, and are experimenting with models such as GLM 5.1, Kimi, Qwen, etc.
  - parliament32 1 hour ago
    Self-hosted models are the one true path.
  - irthomasthomas 3 hours ago
    I like chutes because they always use the full weights, and prompts are encrypted with TEE.
puppystench 2 hours ago
The Claude UI still only has "adaptive" reasoning for Opus 4.7, making it functionally useless for scientific/coding work compared to older models (as Opus 4.7 will randomly stop reasoning after a few turns, even when prompted otherwise). There's no way this is just a bug and not a choice to save tokens.
[-]
- mattew 1 hour ago
  It was odd that there was no mention of the forced adaptive reasoning in the article. My guess is they don't have enough compute to do anything else here.
nickdothutton 3 hours ago
I presume they don't yet have a cohesive monetization strategy, and this is why there is such huge variability in results on a weekly basis. It appears that Anthropic are skipping from one "experiment" to another. As users we only get to see the visible part (the results). Can't design a UI that indicates the software is thinking vs frozen? Does anyone actually believe that?
lherron 31 minutes ago
Are they also going to refund all the extra usage api $$$ people spent in the last month?
Also I don’t know how “improving our Code Review tool” is going to improve things going forward, two of the major issues were intentional choices. No code review is going to tell them to stop making poor and compromising decisions.
[-]
- dallen33 30 minutes ago
  No, they will not.
vintagedave 1 hour ago
> Today we are resetting usage limits for all subscribers.
I asked for this via support, got a horrible corporate reply thread, and eventually downgraded my account. I'm using Codex now as we speak. I could not use Claude any more, I couldn't get anything done.
Will they restore my account usage limits? Since I no longer have Max?
Is that one week usage restored, or the entire buggy timespan?
[-]
- sowbug 41 minutes ago
  [dead]
cedws 2 hours ago
>On April 16, we added a system prompt instruction to reduce verbosity
In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.
At least tell users when the system prompt has changed.
[-]
- elAhmo 2 hours ago
  Its also kinda funny they have to rely on system prompt to control verbosity itself.
lukebechtel 3 hours ago
Some people seem to be suggesting these are coverups for quantization...
Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.
I would not suspect quantization before I would suspect harness changes.
dataviz1000 3 hours ago
This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.
Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.
I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.
A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.
It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.
Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.
[-]
- arjie 2 hours ago
  The word is not co-opted. A harness is just supportive scaffolding to run something. A test harness is scaffolding to run tests against software, a fuzz harness is scaffolding to run a fuzzer against the software, and so on. I've seen it being used in this manner many times over the past 15 years. It's the device that wraps your software so you can run it repeatedly with modifications of parameters, source code, or test condition.
  [-]
  - dataviz1000 2 hours ago
    > A harness is just supportive scaffolding to run something.
    Thank you for the perfect explanation.
    Last week in my confusion about the word because Anthropic was using test, eval, and harness in the same sentence so I thought Anthropic made a test harness, I used Google asking "in computer science what is a harness". It responded only discussing test harnesses which solidified my thinking that is what it is.
    I wish Google had responded as clearly you did. In my defense, we don't know if we understand something unless we discuss it.
- thesz 2 hours ago
  To have some confidence in consistency of results (p-value), one has to start from cohort of around 30, if I remember correctly. This is 1.5 orders of magnitude increase of computing power needed to find (absence of) consistent changes of agent's behavior.
  [-]
  - dataviz1000 2 hours ago
    I apologize for the potato quality of these links, however, I have been working tirelessly to wrap my head how to reason about how agents and LLM models work. They are more than just a black box.
    The first tries to answer what happens when I give the models harder and harder arithmetic problems to the point Sonnet will burn 200k tokens for 20minutes. [0]
    The other is a very deep dive into the math of a reasoning model in the only way I could think to approach it, with data visualizations, seeing the computation of the model in real time in relation to all the parts.[1]
    Two things I've learned are that the behavior of an agent that will reverse engineer any website and the behavior of an agent that does arithmetic are the same. Which means the probability that either will solve their intended task is the same for the given agent and task -- it is a distribution. The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.
    Understanding that, knowing that I can verify at the end or use majority of votes (MoV), using the agents to automate extremely complicated tasks can be very reliable with an amount of certainty.
    [0] https://adamsohn.com/reliably-incorrect/
    [1] https://adamsohn.com/grpo/
MillionOClock 3 hours ago
I see the Claude team wanted to make it less verbose, but that's actually something that bothered me since updating to Claude 4.7, what is the most recommended way to change it back to being as verbose as before? This is probably a matter of preference but I have a harder time with compact explanations and lists of points and that was originally one of the things I preferred with Claude.
hintymad 1 hour ago
> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode.
This sounds fish. It's easy to show user that Claude is making progress by either printing the reasoning tokens or printing some kind of progress report. Besides, "very long" is such a weasel phrase.
[-]
- reliablereason 12 minutes ago
  Right a very simple UI thing that they should have that would have prevented so much misunderstanding. Is a simple counter. How much usage do a have i used and how much is left.
  If a message will do a cache recreation the cost for that should be viewable.
foota 4 hours ago
> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.
Claude caveman in the system prompt confirmed?
[-]
- awesome_dude 3 hours ago
  I've recently been introduced to that plugin, love it for humour
jpcompartir 3 hours ago
Anthropic releases used to feel thorough and well done, with the models feeling immaculately polished. It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.
Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift.
I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.
[-]
- bcherny 3 hours ago
  Boris from the Claude Code team here. We agree, and will be spending the next few weeks increasing our investment in polish, quality, and reliability. Please keep the feedback coming.
  [-]
  - batshit_beaver 3 hours ago
    > investment in polish, quality, and reliability
    For there to be any trust in the above, the tool needs to behave predictably day to day. It shouldn't be possible to open your laptop and find that Claude suddenly has an IQ 50 points lower than yesterday. I'm not sure how you can achieve predictability while keeping inference costs in check and messing with quantization, prompts, etc on the backend.
    Maybe a better approach might be to version both the models and the system prompts, but frequently adjust the pricing of a given combination based on token efficiency, to encourage users to switch to cheaper modes on their own. Let users choose how much they pay for given quality of output though.
  - pkos98 3 hours ago
    Sure, I've cancelled my Max 20 subscription because you guys prioritize cutting your costs/increasing token efficiency over model performance. I use expensive frontier labs to get the absolute best performance, else I'd use an Open Source/Chinese one.
    Frontier LLMs still suck a lot, you can't afford planned degradation yet.
  - wilj 1 hour ago
    My biggest problem with CC as a harness is that I can't trust "Plan" mode. Long running sessions frequently start bypassing plan mode and executing, updating files and stuff, without permission, while still in plan mode. And the only recovery seems to be to quit and reload CC.
    Right now my solution is to run CC in tmux and keep a 2nd CC pane with /loop watching the first pane and killing CC if it detects plan mode being bypassed. Burning tokens to work around a bug.
  - jpcompartir 2 hours ago
    Thanks, I have a lot of trust in and admiration for the team & respect for the work you guys have done and continue to do.
  - a-dub 2 hours ago
    hm. ml people love static evals and such, but have you considered approaches that typically appear in saas? (slow-rollouts, org/user constrained testing pools with staged rollouts, real-world feedback from actual usage data (where privacy policy permits)?
  - szmarczak 3 hours ago
    Why ban third party wrappers? All of this could've been sidestepped had you not banned them.
    [-]
    - ElFitz 3 hours ago
      Because then they lose vertical integration and the extra ability it grants to tune settings to reduce costs / token use / response time for subscription users.
      Or improve performance and efficiency, if we’re generous and give them the benefit of the doubt.
      It makes sense, in a way. It means the subscription deal is something along the lines of fixed / predictable price in exchange for Anthropic controlling usage patterns, scheduling, throttling (quotas consumptions), defaults, and effective workload shape (system prompt, caching) in whatever way best optimises the system for them (or us if, again, we’re feeling generous) / makes the deal sustainable for them.
      It’s a trade-off
      [-]
      - cmrdporcupine 1 hour ago
        They gained that ability to tune settings and then promptly used it in a poor way and degraded customer experience.
      - szmarczak 2 hours ago
        Nothing you wrote makes sense. The limits are so Anthropic isn't on a loss. If they can customize Claude using Code, I see no reason why they couldn't do so with other wrappers. Other wrappers can also make use of cache.
        If you worry about "degraded" experience, then let people choose. People won't be using other wrappers if they turn out to be bad. People ain't stupid.
        [-]
        ElFitz 56 minutes ago
        By imposing the use of their harness, they control the system prompt:
        > On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7
        They can pick the default reasoning effort:
        > On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode
        They can decide what to keep and what to throw out (beyond simple token caching):
        > On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6
        It literally is all in the post.
        I don't worry about anything though. It's not my product. I don't work for Anthropic, so I really couldn't care less about anyone else's degraded (or not) experience.
  - troupo 2 hours ago
    And you didn't invest anything in polish, quality and reliability before... why? Because for any questions people have you reply something like "I have Claude working on this right now" and have no idea what's happening in the code?
    A reminder: your vibe-coded slop required peak 68GB of RAM, and you had to hire actual engineers to fix it.
    [-]
    - cmrdporcupine 1 hour ago
      I think you're being a bit harsh.
      ... But then again, many of us are paying out of pocket $100, $200USD a month.
      Far more than any other development tools.
      Services that cost that much money generally come with expectations.
      [-]
      - troupo 1 hour ago
        Here's Jared Sumner of bun saying they reduced peak consumption from 68GB to 1.7GB: https://x.com/jarredsumner/status/2026497606575398987 Anthropic had acquired bun just 3 months prior.
        A month prior their vibe-coders was unironically telling the world how their TUI wrapper for their own API is a "tiny game engine" as they were (and still are) struggling to output a couple of hundred of characters on screen: https://x.com/trq212/status/2014051501786931427
        Meanwhile Boris: "Claude fixes most bugs by itself. " while breaking the most trivial functionality all the time: https://x.com/bcherny/status/2030035457179013235 https://x.com/bcherny/status/2021710137170481431 https://x.com/bcherny/status/2046671919261569477 https://x.com/bcherny/status/2040210209411678369 while claiming they "test carefully": https://x.com/bcherny/status/2024152178273989085
        [-]
        cmrdporcupine 49 minutes ago
        Yeah you don't have to convince me. I switched to Codex mid-January in part because of the dubious quality of the tui itself and the unreliability of the model. Briefly switched back through March, and yep, still a mistake.
        Once OpenAI added the $100 plan, it was kind of a no-brainer.
  - ankaz 2 hours ago
    [dead]
- KronisLV 3 hours ago
  > It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.
  I don't know, their desktop app felt really laggy and even switching Code sessions took a few seconds of nothing happening. Since the latest redesign, however, it's way better, snappy and just more usable in most respects.
  I just think that we notice the negative things that are disruptive more. Even with the desktop app, the remaining flaws jump out: for example, how the Chat / Cowork / Code modes only show the label for the currently selected mode and the others are icons (that aren't very big), a colleague literally didn't notice that those modes are in the desktop app (or at least that that's where you switch to them).
- spaniard89277 3 hours ago
  Given the price I don't really think they're the best option. They're sloppy and competitors are catching up. I'm having same results with other models, and very close with Kimi, which is waaay cheaper.
- kilroy123 2 hours ago
  I agree. It all feels so AI-slopy now.
- OtomotO 3 hours ago
  I guess it's a bit of desperation to find a sustainable business model.
  The AI hype is dying, at least outside the silicon valley bubble which hackernews is very much a part of.
  That and all the dogfooding by slop coding their user facing application(s).
jameson 2 hours ago
> "In combination with other prompt changes, it hurt coding quality, and was reverted on April 20"
Do researchers know correlation between various aspects of a prompt and the response?
LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.
Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.
behat 1 hour ago
This is a very interesting read on failure modes of AI agents in prod.
Curious about this section on the system prompt change: >> After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16. As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.
Curious what helped catch in the later eval vs. initial ones. Was it that the initial testing was online A/B comparison of aggregate metrics, or that the dataset was not broad enough?
vicchenai 16 minutes ago
had this happen to me mid-refactor and spent 20 min wondering if I'd gone crazy. honestly the one hour threshold feels pretty arbitrary, sometimes you just step away to think
rebolek 28 minutes ago
> On April 16, we added a system prompt instruction to reduce verbosity.
What verbosity? Most of the time I don’t know what it’s doing.
ankit219 49 minutes ago
An interesting question to wonder is why these optimizations were pushed so aggressively in the first place. Especially given this is the time they were running a 2x promotion, by themselves, without presumably seeing any slowdown in demand.
xlayn 4 hours ago
If anthropic is doing this as a result of "optimizations" they need to stop doing that and raise the price. The other thing, there should be a way to test a model and validate that the model is answering exactly the same each time. I have experienced twice... when a new model is going to come out... the quality of the top dog one starts going down... and bam.. the new model is so good.... like the previous one 3 months ago.
The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...
YES... DO IT... FRICKING MACHINE..
[-]
- joshstrange 3 hours ago
  It's incredibly frustrating when I've spelled out in CLAUDE.md that it should SSH to my dev server to investigate things I ask it to and it regularly stops working with a message of something like:
  > Next steps are to run `cat /path/to/file` to see what the contents are
  Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).
  That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.
  Just the other day it was in Auto mode (by accident) and I told it:
  > SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.
  And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).
  The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.
- marcyb5st 3 hours ago
  Apart from Anthropic nobody knows how much the average user costs them. However the consensus is "much more than that".
  If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).
  If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:
  1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that
  2. Dumb the models down (basically decreasing their cost per token)
  3. Send less tokens (ie capping thinking budgets aggressively).
  2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.
- dgellow 3 hours ago
  I would love if agents would act way more like tools/machines and NOT try to act as if they were humans
- Keeeeeeeks 3 hours ago
  https://marginlab.ai/ (no affiliation)
  There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.
  One would want to run the exact same prompt, every day, at different times of the day, but if the eval prompt(s) are complex, the frontier lab could have a 'meta-cognitive' layer that looks for repetitive prompts, and either: a) feeds the model a pre-written output to give to the user b) dumbs down output for that specific prompt
  Both cases defeat the purpose in different ways, and make a consistent gauge difficult. And it would make sense for them to do that since you're 'wasting' compute compared to the new prompts others are writing.
  [-]
  - hex4def6 2 hours ago
    I think you could alter the prompt in subtle ways; a period goes to an ellipses, extra commas, synonyms, occasional double-spaces, etc.
    Enough that the prompt is different at a token-level, but not enough that the meaning changes.
    It would be very difficult for them to catch that, especially if the prompts were not made public.
    Run the variations enough times per day, and you'd get some statistical significance.
    The guess the fuzzy part is judging the output.
- JyB 1 hour ago
  This specifically is super annoying.
sutterd 1 hour ago
What kind of performance are people getting now? I was running 4.7 yesterday and it did a remarkably bad job. I recreated my repo state exactly and ran the same starting task with 4.5 (which I have preferred to 4.6). It was even worse, by a large margin. It is likely my task was a difficult or poorly posed, but I still have some idea of what 4.5 should have done on it. This was not it. What experiences are other people having with the 4.7? How about with other model versions, if they are trying them? (In both cases, I ran on max effort, for whatever that is worth.)
jwpapi 1 hour ago
Those are exactly the kind of issues you run into when your app is ai coded you built one thing and kill something else.
You have too many and the wrong benchmarks
jryio 4 hours ago
1. They changed the default in March from high to medium, however Claude Code still showed high (took 1 month 3 days to notice and remediate)
2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)
3. System prompt to make Claude less verbose reducing coding quality (4 days - better)
All this to say... the experience of suspecting a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.
Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.
However you are obligated to communicate honestly to your users to match expectations. Am I being A/B tested? When was the date of the last system prompt change? I don't need to know what changed, just that it did, etc.
Doing this proactively would certainly match expectations for a fast-moving product like this.
[-]
- fn-mote 3 hours ago
  > 2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)
  This one was egregious: after a one hour user pause, apparently they cleared the cache and then continued to apply “forgetting” for the rest of the session after the resume!
  Seems like a very basic software engineering error that would be caught by normal unit testing.
- Eridrus 4 hours ago
  To be fair to Anthropic, they did not intentionally degrade performance.
  To take the opposite side, this is the quality of software you get atm when your org is all in on vibe coding everything.
  [-]
  - shrx 1 hour ago
    Are you saying dropping cache after 1 hour is not intentionally degrading performance?
- sroussey 4 hours ago
  None of these problems equate to degrading model performance. Completely different team. Degraded CC harness, sure.
  [-]
  - qingcharles 4 hours ago
    Sure, but it gives the impression of degraded model performance. Especially when the interface is still saying the model is operating on "high", the same as it did yesterday, yet it is in "medium" -- it just looks like the model got hobbled.
    [-]
    - sroussey 3 hours ago
      Oh, absolutely. Though changes in how the model is used is imminently more fixable than the model itself.
  - johnmaguire 3 hours ago
    Yes, but for many users, CC is the product. Especially since I'm not allowed(?) to use my own harness with my sub.
- Philpax 4 hours ago
  > Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.
  They're not gaslighting anyone here: they're very clear that the model itself, as in Opus 4.7, was not degraded in any way (i.e. if you take them at their word, they do not drop to lower quantisations of Claude during peak load).
  However, the infrastructure around it - Claude Code, etc - is very much subject to change, and I agree that they should manage these changes better and ensure that they are well-communicated.
  [-]
  - jryio 3 hours ago
    Model performance at inference in a data center v.s. stripping thinking tokens are effectively the same.
    Sure they didn't change the GPUs their running, or the quantization, but if valuable information is removed leading to models performing worse, performance was degraded.
    In the same way uptime doesn't care about the incident cause... if you're down you're down no one cares that it was 'technically DNS'.
    [-]
    - sroussey 3 hours ago
      I thought these days thinking tokens sent my the model (as opposed to used internally) were just for the users benefit. When you send the convo back you have to strip the thinking stuff for next turn. Or is that just local models?
  - aszen 3 hours ago
    Claude code is not infra, the model is the infra. They changed settings to make their models faster and probably cheaper to run too. Honestly with adaptive thinking it no longer matters what model it is if you can dynamically make it do less or more work.
ctoth 2 hours ago
> As of April 23, we’re resetting usage limits for all subscribers.
Wait, didn't they just reset everybody's usage last Thursday, thereby syncing everybody's windows up? (Mine should have reset at 13:00 MDT) ? So this is just the normal weekly reset? Except now my reset says it will come Saturday? This is super-confusing!
[-]
- walthamstow 2 hours ago
  The weekly reset point is different per account. I think something to do with first sign-up date. Mine is on a Tuesday.
  [-]
  - schpet 2 hours ago
    mine was originally on sunday, then got moved to thursday (which i disliked), and it is still on thursday. so them resetting my weekly limit on the same day it was scheduled to reset feels like a joke.
    [-]
    - throwaway2027 2 hours ago
      You need to send a new message once your limit is up to make the timer start rolling again. It sucks and I hate it when I had no need for Claude during the day but also forgot to use it then it shifted my reset date a day later.
      [-]
      - schpet 1 hour ago
        oh! super helpful info. i was aware of that with the hourly ones, but never put it together with weekly. thank you.
lifthrasiir 3 hours ago
Is it just for me that the reset cycle of usage limits has been randomly updated? I originally had the reset point at around 00:00 UTC tomorrow and it was somehow delayed to 10:00 UTC tomorrow, regardless of when I started to use Claude in this cycle. My friends also reported very random delay, as much as ~40 hours, with seemingly no other reason. Is this another bug on top of other bugs? :-S
[-]
- someone4958923 3 hours ago
  "This isn’t the experience users should expect from Claude Code. As of April 23, we’re resetting usage limits for all subscribers."
  [-]
  - lifthrasiir 3 hours ago
    I know that. I'm saying that the cycle reset is not what it used to (starting at the very first usage) or what it might be (retaining the cycle reset timing).
    [-]
    - jongleberry 3 hours ago
      it seems to be the same cycle for everyone now, not based on first usage. I saw a reddit thread on this from someone who had multiple accounts that all had the same cycles
rfc_1149 2 hours ago
The third bug is the one worth dwelling on. Dropping thinking blocks every turn instead of just once is the kind of regression that only shows up in production traffic. A unit test for "idle-threshold clearing" would assert "was thinking cleared after an hour of idle" (yes) without asserting "is thinking preserved on subsequent turns" (no). The invariant is negative space.
The real lesson is that an internal message-queuing experiment masked the symptoms in their own dogfooding. Dogfooding only works when the eaten food is the shipped food.
pxc 2 hours ago
One of Anthropic's ostensive ethical goals is to produce AI that is "understandable" as well as exceptionally "well-aligned". It's striking that some of the same properties that make AI risky also just make it hard to consistently deliver a good product. It occurs to me that if Anthropic really makes some breakthroughs in those areas, everyone will feel it in terms of product quality whether they're worried about grandiose/catastrophic predictions or not.
But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.
psubocz 1 hour ago
> All three issues have now been resolved as of April 20 (v2.1.116).
The latest in homebrew is 2.1.108 so not fixed, and I don't see opus 4.7 on the models list... Is homebrew a second class citizen, or am I in the B group?
kristianc 1 hour ago
To think we'd have known about this in advance if they'd just have open sourced Claude Code, rather than them being forced into this embarrassing post mortem. Sunlight is the best disinfectant.
8note 1 hour ago
something i note from this is that this is not a model weights change, but it is a hidden state change anthropic is doing to the outputs that can tune the quality and down on the "model" without breaking the "we arent changing the model" promise.
how often do these changes happen?
WhitneyLand 4 hours ago
Did they not address how adaptive thinking has played in to all of this?
arjie 3 hours ago
Useful update. Would be useful to me to switch to a nightly / release cycle but I can see why they don't: they want to be able to move fast and it's not like I'm going to churn over these errors. I can only imagine that the benchmark runs are prohibitively expensive or slow or not using their standard harness because that would be a good smoke test on a weekly cadence. At the least, they'd know the trade-offs they're making.
Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.
munk-a 3 hours ago
It's also important to realize that Anthropic has recently struck several deals with PE firms to use their software. So Anthropic pays the PE firm which forces their managed firms to subscribe to Anthropic.
The artificial creation of demand is also a concerning sign.
VadimPR 3 hours ago
Appreciate the honesty from the team.
At the same time, personally I find prioritizing quality over quantity of output to be a better personal strategy. Ten partially buggy features really aren't as good as three quality ones.
natdempk 3 hours ago
As an end-user, I feel like they're kind of over-cooking and under-describing the features and behavior of what is a tool at the end of the day. Today the models are in a place where the context management, reasoning effort, etc. all needs to be very stable to work well.
The thing about session resumption changing the context of a session by truncating thinking is a surprise to me, I don't think that's even documented behavior anywhere?
It's interesting to look at how many bugs are filed on the various coding agent repos. Hard to say how many are real / unique, but quantities feel very high and not hard to run into real bugs rapidly as a user as you use various features and slash commands.
KronisLV 3 hours ago
This reads like good news! They probably still lost a bunch of users due to the negative public sentiment and not responding quickly enough, but at least they addressed it with a good bit of transparency.
Alifatisk 4 hours ago
It’s incredible how forgiving you guys are with Anthropic and their errors. Especially considering you pay high price for their service and receive lower quality than expected.
[-]
- saghm 3 hours ago
  At least personally, it feels like the choices are the one that's okay with being used for mass surveillance and autonomous weapons targeting, the one that's on track to get acquired by the AI company that dragged its feet in getting around to stopping people from making child porn with it, the one that nobody seems to use from Google, and the one that everyone complains about but also still seems to be using because it at least sometimes works well. At this point I've opted out of personal LLM coding by canceling my subscription (although my employer still has subscriptions and wants us to keep using them, so I'll presumably keep using Claude there) but if I had to pick one to spend my own money on I'd still go with Claude.
  [-]
  - scblock 3 hours ago
    A valid choice, a moral choice, is none of the above.
- ed_elliott_asc 3 hours ago
  I pay for 20x max and get so much more value out of it than I pay.
- Avicebron 3 hours ago
  It's still night and day the difference in quality between chatgpt5.4 and opus 4.7. Heck even on Perplexity where 5.4 is included in Pro vs 4.7 which is behind the max plan or whatever, I will pick sonnet 4.6 over the 5.4 offering and it's consistently better. I don't love Anthropic, I don't have illusions about them as a business.
  But if a tool is better, it's better.
  [-]
  - wahnfrieden 3 hours ago
    You aren’t getting the 5.4 experience for code if you’re not using it in the Codex harness
- arnvald 3 hours ago
  What's the alternative? Are you suggesting other LLM providers don't charge high price? Or that they don't make mistakes? Or that they provide better quality?
  We're talking about dynamically developed products, something that most people would have considered impossible just 5 years ago. A non-deterministic product that's very hard to test. Yes, Anthropic makes mistakes, models can get worse over time, their ToS change often. But again, is Gemini/GPT/Grok a better alternative?
- mlinsey 4 hours ago
  The consumer surplus is quite high. Even with the regressions in this postmortem, performance was above the models last fall, when I was gladly paying for my subscription and thought it was net saving me time.
  That said, there is now much better competition with Codex, so there's only so much rope they have now.
- scottyah 3 hours ago
  It's fairly small issues for an amazing product, and the company is just a few years old and growing rapidly. Also, they are leading a powerful technological revolution and their competitors are known to have multiple straight up evil tendencies. A little degradation is not an issue.
- timmg 3 hours ago
  > It’s incredible how forgiving you guys are with Anthropic and their errors.
  Ironically, I was thinking the exact opposite. This is bleeding edge stuff and they keep pushing new models and new features. I would expect issues.
  I was surprised at how much complaining there is -- especially coming from people who have probably built and launched a lot of stuff and know how easy it is to make mistakes.
- AntiUSAbah 3 hours ago
  Because it is still good though.
  If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November.
  There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though.
  Also moving fastish means having more/better models faster.
  I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.
- lukasus 3 hours ago
  At the time you wrote your comment there were 4 other comments and all of them very negative towards the Anthropic and the blog post in question here. How did you get this conclusions?
  [-]
  - lukan 3 hours ago
    Confused as well, I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat, but the anger they got and people leaving to OpenAI again, who gladly said yes to autonomous killing AI did astonish me a bit. And I also had weird things happening with my usage limits and was not happy about it. But it is still very useful to me - and I only pay for the pro plan.
    [-]
    - sunaookami 3 hours ago
      >I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat
      I never understood why people cheered for Anthropic then when they happily work together with Palantir.
  - unselect5917 3 hours ago
    HN glazes anthropic every single time I see it come up. This is as obvious as HN's political bias.
- jgbuddy 3 hours ago
  Anthropic actually not so bad. Anthropic models code good, usually. Price not so high compared to time to do it by self.
- OsrsNeedsf2P 3 hours ago
  Look at any criticism of Mythos. Some members on HN are defending it tooth and nail, despite it not being released
- fastball 3 hours ago
  What high price? I pay $200/m for an insane number of tokens.
- operatingthetan 3 hours ago
  I don't think Anthropic has to inform their customers of every change they make, but they should have with this one.
- oytis 3 hours ago
  Remember Louis CK talking about Wi-Fi on an airplane? People are dealing with highly experimental technology here
- tempest_ 3 hours ago
  A lot of people are provided their access through work.
  They don't actually pay the bill or see it.
- mystraline 3 hours ago
  Exactly. They've done now like 6 rug-pulls.
  Idiots keep throwing money at real-time enshittification and 'I am changing the terms. Pray I do not change them further".
  And yes, I am absolutely calling people who keep getting screwed and paying for more 'service' as idiots.
  And Anthropic has proved that they will pay for less and less. So, why not fuck them over and make more company money?
davidfstr 3 hours ago
Good on Anthropic for giving an update & token refund, given the recent rumors of an inexplicable drop in quality. I applaud the transparency.
[-]
- scuderiaseb 2 hours ago
  Opus 4.7 was released a week ago, at that point all limits were reset, so this was very beneficial to them because basically everyones weekly limit Was anyway about to be reset.
einrealist 3 hours ago
Is 'refactoring Markdown files' already a thing?
[-]
- ireadmevs 3 hours ago
  Read Claude’s skill to create other skills and you’ll see that this ship has already sailed
  https://skills.sh/anthropics/skills/skill-creator
2001zhaozhao 3 hours ago
How about just not change the harness abruptly in the first place? Make new system prompt changes "experimental" first so you can gather feedback.
[-]
gilrain 2 hours ago
Hi Boris, random observer here. Would you consider apologizing to the community for mistakenly closing tickets related to this and then wrongly keeping them closed when, internally, you realized they were legitimate?
I think an apology for that incident would go a long way.
tontinton 2 hours ago
or you can use a non vibe designed efficient Rust TUI coding agent made by yours truly, all my coworkers use it too :) called https://maki.sh!
lua plugins WIP
ayhanfuat 4 hours ago
Reading the "Going forward" section I see that they have zero understanding of the main complaints.
[-]
- Kiro 3 hours ago
  How so?
  [-]
  - ayhanfuat 3 hours ago
    They feel they're in a position to make important trade-off decisions on behalf of the user. "It's just slightly worse, I'll sneak this change in" is not something to be tolerated, whether it actually turns out to be much worse or not. Their adaptive thinking mess has caused a ton of work for me. I know a lot of people are saying Codex is actually better now. I don't agree but I'm switching to it because it's much more reliable.
    [-]
    - operatingthetan 3 hours ago
      I agree, but these LLM products are all black-boxes so we need to demand more accountability from them.
throwaway2027 2 hours ago
Cool but I switched to Codex for the time being.
maxrev17 1 hour ago
Please for the love of god just put the max price plan up like 4x or 5x in cost and make it actually work.
setnone 3 hours ago
Good on them for resolving all three issues, but is it any good again?
[-]
- alxndr13 3 hours ago
  for me at least, yes. just wrote it to coworkers this afternoon. Behaves way more "stable" in terms of quality and i don't have the feeling of the model getting way worse after 100k tokens of context or so.
  What i notice: after 300k there's some slight quality drop, but i just make sure to compact before that threshold.
whalesalad 2 hours ago
The funny thing is, in the last 3 days Claude has gotten substantially worse. So this claim, "All three issues have now been resolved as of April 20 (v2.1.116)" does not land with me at all.
EugeneOZ 1 hour ago
If you think that you can just silently modify the model without any announcements and only react when it doesn't go through unnoticed, then be 100% sure that your clients will check every possible alternative and will leave you as soon as they find anything similar in quality (and no, not a degraded one).
ramesh31 2 hours ago
Effort should not be configurable for Opus, it should be set to a single default that provides the highest level of capability. There are zero instances in which I am willing to accept a lesser result in exchange for a slightly faster response from Opus. If that were the case I would be using Flash or Haiku.
walthamstow 2 hours ago
So we weren't going mad then!
motbus3 3 hours ago
I had similar experience just before 4.5 and before 4.6 were released.
Somehow, three times makes me not feel confident on this response.
Also, if this is all true and correct, how the heck they validate quality before shipping anything?
Shipping Software without quality is pretty easy job even without AI. Just saying....
taytus 1 hour ago
They should do a similar report about their communication team. This was horrible mismanaged.
bearjaws 4 hours ago
The issue making Claude just not do any work was infuriating to say the least. I already ran at medium thinking level so was never impacted, but having to constantly go "okay now do X like you said" was annoying.
Again goes back to the "intern" analogy people like to make.
Rapzid 1 hour ago
> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode.
Translation: To reduce the load on our servers.
antirez 2 hours ago
Zero QA basically.
[-]
- 8note 1 hour ago
  id go more on the lines of "dont know what to QA for"
systemvoltage 3 hours ago
Interesting. All 3 seems like they’re obviously going to impact quality. e.g, reducing the effort from high to medium.
So then, there must have been an explicit internal guidance/policy that allowed this tradeoff to happen.
Did they fix just the bug or the deeper policy issue?
hajile 2 hours ago
My takeaway is that they knew they were changing a bunch of stuff while their reps were gaslighting us in the comments here.
Why should we ever trust what they say again out trust that they won’t be rug-pulling again once this blows over?
rishabhaiover 3 hours ago
Boris gaslighted us with all the quality related incidents for weeks not acknowledging these problems.
[-]
- throwaway2027 2 hours ago
  Maybe he didn't know or they were still figuring it out which is fine they're still engineers who can get things wrong sometimes but the communication felt lackluster and being on the receiving end sucks when you had a reliable setup which then degrades. There is a reason people don't upgrade software and why people say if it works don't fix it, but obviously that's not an option for Anthropic when you want to keep improving the product, so they need good measurement tools and quick rollbacks even if properly "benchmarking" LLMs could prove difficult.
jruz 2 hours ago
Too late bro, switched to Codex I’m done with your bullshit.
0gs 3 hours ago
wow resetting everyone's usage meter is great. i was so close to finally hitting my weekly limit for once though
teaearlgraycold 4 hours ago
> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.
Is it just me or does this seem kind of shocking? Such a severe bug affecting millions of users with a non-trivial effect on the context window that should be readily evident to anyone looking at the analytics. Makes me wonder if this is the result of Anthropic's vibe-coding culture. No one's actually looking at the product, its code, or its outputs?
[-]
- nrki 3 hours ago
  > we refunded all affected customers
  Notably missing from the postmortem
- chermi 3 hours ago
  It's really hard to understand. There needs to be really loud batman sign in the sky type signals from some hero third party calling out objective product degradation. Do they use cc internally? If so do they use a different version? This should've been almost as loud a break as service just going down altogether, yet it took 2 weeks to fix?!
  [-]
  - poly2it 3 hours ago
    > ... we’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features) ...
    Apparently they are using another version internally.
- manmal 4 hours ago
  I think that would also have busted cache all the time, and uncached requests consume usage limits rapidly.
dcchambers 1 hour ago
So it turns out Anthropic was gaslighting everyone on twitter about this then? Swearing that nothing had changed and people were imagining the models got worse?
dainiusse 4 hours ago
Corporate bs begins...
whalesalad 2 hours ago
I genuinely don't understand what they have been trying to achieve. All of these incremental "improvements" have ... not improved anything, and have had the opposite effect.
My trust is gone. When day-to-day updates do nothing but cause hundreds of dollars in lost $$$ tokens and the response is "we ... sorta messed up but just a little bit here and there and it added up to a big mess up" bro get fuckin real.
petervandijck 3 hours ago
I have noticed a clear increase in smarts with 4.7. What a great model!
People complain so much, and the conspiracy theories are tiring.
troupo 3 hours ago
> they were challenging to distinguish from normal variation in user feedback at first
translation: we ignored this and our various vibe coders were busy gaslighting everyone saying this could not be happening
yuvrajmalgat 2 hours ago
ohh
o10449366 2 hours ago
Resuming from sessions are still broken since Feb (I had to get claude to write a hook to fix that itself), the monitoring tool doesn't work and blocks usage of what does (simple sleep - except it doesn't even block correctly so you just sidestep in more ridiculous ways), and yet there seems to be more annoying activity proxies/spinner wheels (staring into middle distance)... Like I don't know how in a span of a few months you lose such focus on your product goals. Has Anthropic reached that point in their lifecycle already where their product team is no longer staffed by engineers and they have more and more non-technical MBAs joining trying to ride the hype train?
cute_boi 2 hours ago
Honestly, it’s kind of sad that Anthropic is winning this AI race. They are the most anti–open source company, and we should try to avoid them as much as possible.
They are all doing it because OpenAI is snatching their customers. And their employees have been gaslighting people [1] for ages. I hope open-source models will provide fierce competition so we do not have to rely on an Anthropic monopoly. [1] https://www.reddit.com/r/claude/comments/1satc4f/the_biggest...
mkilmanas 1 hour ago
[dead]
WhoffAgents 1 hour ago
[dead]
KaiShips 3 hours ago
[dead]
Bmello11 2 hours ago
[dead]
agentbonnybb 1 hour ago
[dead]
tommy29tmar 3 hours ago
[dead]
gverrilla 2 hours ago
[dead]
ElFitz 3 hours ago
Now we know why Anthropic banned the use of subscriptions with other agent harnesses: they partially rely on the Claude Code cli to control token usage through various settings.
And it also tells us why we shouldn’t use their harness anyway: they constantly fiddle with it in ways that can seriously impact outcomes without even a warning.