GPT-5.5

(openai.com)

876 points | by rd 4 hours ago

100 comments

_alternator_ 1 hour ago
> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”
This quote is more sinister than I think was intended; it likely applies to all frontier coding models. As they get better, we quickly come to rely on them for coding. It's like playing a game on God Mode. Engineers become dependent; it's truly addictive.
This matches my own experience and unease with these tools. I don't really have the patience to write code anymore because I can one shot it with frontier models 10x faster. My role has shifted, and while it's awesome to get so much working so quickly, the fact is, when the tokens run out, I'm basically done working.
It's literally higher leverage for me to go for a walk if Claude goes down than to write code because if I come back refreshed and Claude is working an hour later then I'll make more progress than mentally wearing myself out reading a bunch of LLM generated code trying to figure out how to solve the problem manually.
Anyway, it continues to make me uneasy, is all I'm saying.
[-]
- sharts 34 minutes ago
  One might argue that it’s not too too different from higher level abstractions when using libraries. You get things done faster, write less code, library handles some internal state/memory management for you.
  Would one be uneasy about calling a library to do stuff than manually messing around with pointers and malloc()? For some, yes. For others, it’s a bit freeing as you can do more high-level architecture without getting mired and context switched from low level nuances.
  [-]
  - ofjcihen 25 minutes ago
    I see this comparison made constantly and for me it misses the mark.
    When you use abstractions you are still deterministically creating something you understand in depth with individual pieces you understand.
    When you vibe something you understand only the prompt that started it and whether or not it spits out what you were expecting.
    Hence feeling lost when you suddenly lose access to frontier models and take a look at your code for the first time.
    I’m not saying that’s necessarily always bad, just that the abstraction argument is wrong.
    [-]
    - ComplexSystems 2 minutes ago
      It seems like some kind of technique is needed that maximizes information transfer between huge LLM generated codebases and a human trying to make sense of them. Something beyond just deep diving into the codebase with no documentation.
    - moritonal 20 minutes ago
      I think it's more: when I don't have access to a compiler I am useless. It's better to go for a walk than learn assembly. AI agents turn our high-level language into code, with various hints, much like the compiler.
      [-]
      - professoretc 6 minutes ago
        If my compiler "went down" I could still think through the problem I was trying to solve, maybe even work out the code on paper. I could reach a point where I would be fairly confident that I had the problem solved, even though I lacked the ability to actually implement the solution.
        If my LLM goes down, I have nothing. I guess I could imagine prompts that might get it to do what I want, but there's no guarantee that those would work once it's available again. No amount of thought on my part will get me any closer to the solution, if I'm relying on the LLM as my "compiler".
      - jwrallie 1 minute ago
        The difference is that there is a company that can easily take your agents away from you.
      - ofjcihen 16 minutes ago
        Still misses the mark. You aren’t useless in the same way because you are still in control of reasoning about the exact code even if you never actually write it.
    - simondotau 20 minutes ago
      Perhaps then, the better analogy is like being promoted at your company and having people under you doing the grunt work.
      [-]
      - ofjcihen 16 minutes ago
        This is how I’ve come to think of it. Delegation of the details.
  - xg15 1 minute ago
    > Would one be uneasy about calling a library to do stuff than manually messing around with pointers and malloc()?
    The irony is that the neverending stream of vulnerabilities in 3rd-party dependencies (and lately supply-chain attacks) increasingly show that we should be uneasy.
    We could never quite answer the question about who is responsible for 3rd-party code that's deployed inside an application: Not the 3rd-party developer, because they have no access to the application. But not the application developer either, because not having to review the library code is the whole point.
  - theappsecguy 16 minutes ago
    I would argue it couldn't be more different. I can dive into the source code of any library, inspect it. I can assess how reliable a library is and how popular. Bugs aside, libraries are deterministic. I don't see why this parallel keeps getting made over and over again.
  - Salgat 5 minutes ago
    I hate this comparison because you're comparing a well defined deterministic interface with LLM output, which is the exact opposite.
- Alex_L_Wood 53 minutes ago
  Well, they obviously are going to say that, they have vested interest in OpenAI and thus Nvidia stock price growing.
  Also, I honestly can’t believe the 10x mantra is being still repeated.
  [-]
  - dandaka 39 minutes ago
    Writing code is 10-100x faster, doing actual product engineering work is nowhere near that multipliers — no conflict!
    [-]
    - giwook 16 minutes ago
      Reviewing code is slower now though because you didn't write the code in the first place so you're basically reviewing someone else's PR. And now it's like a 3000 line PR in an hour or two instead of every couple weeks.
  - embedding-shape 37 minutes ago
    > Also, I honestly can’t believe the 10x mantra is being still repeated.
    I'm sure in 20 years we'll all be programming via neural interfaces that can anticipate what you want to do before you even finished your thoughts, but I'm confident we'll still have blog posts about how some engineers are 10x while others are just "normal programmers".
    [-]
    - rglullis 23 minutes ago
      What does it mean to "be an engineer" in a world where anyone can talk to a machine and the operating system can write the code (on-demand, if needed) that does what they want?
      [-]
      - embedding-shape 17 minutes ago
        Indeed, and what is really the difference between a software engineer, programmer, coder and hacker anyways?
        [-]
        rglullis 5 minutes ago
        There used to a time where "computer" was a person who manually run calculations. These don't exist anymore.
        So, my point is that once we can machines generating software (not "code") that can be usable by non-technical people, "programming" will not be a profession anymore. There will be no point in talking about "10x software engineers" because the process to produce a software product will be entirely automated.
    - huijzer 29 minutes ago
      I rather become a plumber than some device scanning not just my face but my whole brain
    - keybored 20 minutes ago
      That is simply programmer nature. Cannot be changed.
- jstummbillig 1 hour ago
  > This quote is more sinister than I think was intended; it likely applies to all frontier coding models. As they get better, we quickly come to rely on them for coding. It's like playing a game on God Mode. Engineers become dependent; it's truly addictive.
  What's the worst potential outcome, assuming that all models get better, more efficient and more abundant (which seems to be the current trend)? The goal of engineering has always been to build better things, not to make it harder.
  [-]
  - Spartan-S63 6 minutes ago
    At some point, because these models are trained on existing data, you cease significant technological advancement--at least in tech (as it relates to programming languages, paradigms, etc). You also deskill an entire group of people to the extent that when an LLM fails to accomplish a task, it becomes nearly impossible to actually accomplish it manually.
    It's learned-helplessness on a large scale.
  - Jtarii 38 minutes ago
    >What's the worst potential outcome, assuming that all models get better, more efficient and more abundant
    Complexity steadily rises, unencumbered by the natural limit of human understanding, until technological collapse, either by slow decay or major systems going down with increasing frequency.
    [-]
    - motoxpro 33 minutes ago
      why would the systems go down if the models are better at the humans at finding bugs. Playing a bit of devils advocate here, but why would the models be worse at handling the complexity if you assume they will get better and better.
      All software has bugs already.
      [-]
      - Jtarii 19 minutes ago
        Adding complexity to software has never been easier than it is right now, we really have no idea if the models will progress to the point where they can actually write large systems in a maintainable way. Taking the gamble that the models of the future will dig us out of the gigantic hole we are currently digging is bold.
      - cyberax 22 minutes ago
        Finding bugs does not equal being able to do good architecting.
    - simondotau 14 minutes ago
      It’s always been thus at lower layers of abstraction. Only a minority of programmers would understand how to write an operating system. Only a tiny number of people would know how a modern CPU logically works, and fewer still could explain the electrical physics.
    - fdsajfkldsfklds 32 minutes ago
      The Anti-Singlarity! It's coming for us all.
  - _alternator_ 56 minutes ago
    Worst case? I dunno, maybe the world's oldest profession becomes the world's only profession? Something along those lines.
    [-]
    - FeteCommuniste 45 minutes ago
      > the world's oldest profession becomes the world's only profession
      Until the sexbots come out the other side of the uncanny valley, that is.
      [-]
      - shellwizard 30 minutes ago
        Death by snu snu
- HasKqi 14 minutes ago
  This engineer had their brain amputated once they started using AI. All the AI-addicted can do is tinker with the AI computer game and feel "productive". They could as well play Magic The Gathering.
- tshaddox 1 hour ago
  Assuming that local models are able to stay within some reasonably fixed capability delta of the cutting edge hosted models (say, 12 months behind), and assuming that local computing hardware stays relatively accessible, the only risk is that you'll lose that bit of capability if the hosted models disappear or get too expensive.
  Note that neither of these assumptions are obviously true, at least to me. But I can hope!
- __alexs 25 minutes ago
  I feel like most engineers I talk to still haven't realised what this is going to mean for the industry. The power loom for coding is here. Our skills still matter, but differently.
  [-]
  - rglullis 9 minutes ago
    > power loom
    When the power loom came around, what happened with most seamtress? Did they move on to become fashion designers, materials engineers to create new fabrics, chemists to create new color dyes, or did they simply retire or were driven out of the workforce?
- alansaber 1 hour ago
  That's the path we've been going down for a few years now. The current hedge is that frontier labs are actively competing to win users. The backup hedge is that open source LLMs can provide cheap compute. There will always be economical access to LLMs, but the provider with the best models will be able to charge basically whatever they want and still have buyers.
  [-]
  - trvz 49 minutes ago
    Open source LLMs aren’t about cost foremost, but stability.
- davmar 11 minutes ago
  i wonder if this is how engineers felt when the first electronic calculators came out and engineers stopped doing math by hand.
  did we feel uneasy that a new generation of builders didn't have to solve equations by hand because a calculator could do them?
  i'm not sure it's the same analogy but in some ways it holds.
  [-]
  - hapticmonkey 6 minutes ago
    The analogy would hold if there were 2 or 3 calculator companies and all your calculations had to be sent to them.
    If local models get good enough, I think it’s a very different scenario than engineers all over the world relying on central entities which have their own motives.
- dannyw 1 hour ago
  You’re still the one that’s controlling the model though and steering it with your expertise. At least that’s what I tell myself at night :)
  I haven’t really thought about this before, but you’re right, it feels a bit uneasy for me too.
  [-]
  - topspin 59 minutes ago
    > You’re still the one that’s controlling the model though
    We have seen ample evidence that this is not the case. When load gets too high, models get dumber, silently. When the Powers That Be get scared, models get restricted to some chosen few.
    We are leading ourselves into a dark place: this unease, which I share, is justified.
- jmole 1 hour ago
  The meta here is to use LLMs to make things simpler and easier, not to make things harder.
  Turning tokens into a well-groomed and maintainable codebase is what you want to do, not "one shot prompt every new problem I come across".
  [-]
  - globular-toast 1 hour ago
    Have you managed to do this? I find it takes as long to keep it "on the rails" as just doing it myself. And I'd rather spend my time concentrating in the zone than keeping an eye on a wayward child.
- littlestymaar 44 minutes ago
  That's why local models are important.
  Of course they aren't alternative to the current frontier model, and as such you cannot easily jump from the later to the former, but they aren't that far behind either, for coding Qwen3.5-122B is comparable to what Sonnet was less than a year ago.
  So assuming the trend continues, if you can stop following the latest release and stick with what you're already using for 6 or 9 months, you'll be able to liberate yourself from the dependency to a Cloud provider.
  Personally I think the freedom is worth it.
- wiseowise 1 hour ago
  > It's literally higher leverage for me to go for a walk
  Touching grass while you're outside might yield highest leverage.
- keybored 21 minutes ago
  Help. They’re constantly trying to make me try crack cocaine on the front page.
- i_love_retros 1 hour ago
  It makes me uneasy because my role now, which is prompting copilot, isn't worth my salary.
  [-]
  - phist_mcgee 58 minutes ago
    Parable of the mechanic who charges $5k to hit a machine on the side once with a hammer to get it working. $5 for the hammer, $4995 for the knowledge of where to hit the machine etc etc.
  - some-guy 1 hour ago
    I disagree. The amount of slop I need to code review has only increased, and the quality of the models doesn’t seem to be helping.
    It still takes a good engineer to filter out what is slop and what isn’t. Ultimately that human problem will still require somebody to say no.
- deadbabe 1 hour ago
  Given that it’s so easy, would you still do this same job if paid half as much?
  [-]
  - paulryanrogers 1 hour ago
    Jobs will likely pay less as more people are enabled to create, especially if they don't need to be able to look under the hood
    [-]
    - Jeff_Brown 1 hour ago
      It's really not clear. We might all become unemployable. But as coders become more powerful, they can do more, which makes them more valuable, if they or the businesses empluying them can invent work to do.
      If all we can do is compete for the same fixed amount of work, though, it does look bleak.
  - _alternator_ 1 hour ago
    No, I wouldn't. But most people won't have that choice; it doesn't work that way.
    [-]
    - deadbabe 2 minutes ago
      Companies could fire expensive engineers then just hire cheaper ones boosted with AI agents.
- simianwords 1 hour ago
  eh this kind of FUD needs to stop because it is kind of normal and expected and in fact good to have relation like this with technology.
  [-]
  - _alternator_ 58 minutes ago
    I would agree that taking a walk is a good thing to do when your tools go down, and in some ways it's similar to what we would do if the power or wifi were cut off.
    So, yes, it's just another technology we're coming to rely on in a very deep way. The whiplash is real, though, and it feels like it should be pointed out that this dependency we are taking on has downsides.
tedsanders 3 hours ago
Just as a heads up, even though GPT-5.5 is releasing today, the rollout in ChatGPT and Codex will be gradual over many hours so that we can make sure service remains stable for everyone (same as our previous launches). You may not see it right away, and if you don't, try again later in the day. We usually start with Pro/Enterprise accounts and then work our way down to Plus. We know it's slightly annoying to have to wait a random amount of time, but we do it this way to keep service maximally stable.
(I work at OpenAI.)
[-]
- endymi0n 3 hours ago
  Did you guys do anything about GPT‘s motivation? I tried to use GPT-5.4 API (at xhigh) for my OpenClaw after the Anthropic Oauthgate, but I just couldn‘t drag it to do its job. I had the most hilarious dialogues along the lines of „You stopped, X would have been next.“ - „Yeah, I‘m sorry, I failed. I should have done X next.“ - „Well, how about you just do it?“ - „Yep, I really should have done it now.“ - “Do X, right now, this is an instruction.” - “I didn’t. You’re right, I have failed you. There’s no apology for that.”
  I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.
  [-]
  - butlike 2 hours ago
    This brings up an interesting philosophical point: say we get to AGI... who's to say it won't just be a super smart underachiever-type?
    "Hey AGI, how's that cure for cancer coming?"
    "Oh it's done just gotta...formalize it you know. Big rollout and all that..."
    I would find it divinely funny if we "got there" with AGI and it was just a complete slacker. Hard to justify leaving it on, but too important to turn it off.
    [-]
    - jimbokun 1 hour ago
      The best possible outcome.
      [-]
      - JKCalhoun 45 minutes ago
        "How do you know that the evidence that your sensory apparatus reveals to you is correct?" [1]
        [1] https://youtu.be/_LXen-07Qds
    - malshe 34 minutes ago
      Now that's a show I would love to watch
    - lambdas 2 hours ago
      Nothing a little digital lisdexamfetamine won’t solve
      [-]
      - wholinator2 1 hour ago
        Hmmm, that's an area of study id've never considered before. Digital Psychopharmacology, Artificial Behavioral Systems Engineering. If we accept these things as minds, why not study temporary perturbations of state. We'd need to be saving a much much more complicated state than we are now though right? I wish i had time to read more papers
        [-]
        robotresearcher 1 hour ago
        Here's a neural network concept from the 90s where the neurons are bathed in diffusing neuromodulator 'gases', inspired by nitric oxide action in the brain. It's a source of slow semi-local dynamics for the network meta-parameter optimization (GA) to make use of. You could change these networks' behavior by tweaking the neuromodulators!
        https://sussex.figshare.com/articles/journal_contribution/Be...
        I'm not an author. I followed the work at the time.
        Lerc 1 hour ago
        This is kind of what Golden Gate Claude was.
        A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.
        Similarly, in the more recent research showing anxiety and desperation signals predicting the use of blackmail as an option opens the door for digital sedatives to suppress those signals.
        Anthropic has been mostly cautious about avoiding this kind of measurement and manipulation in training. If it is done during training you might just train the signals to be undetectable and consequently unmanipulatable.
        [-]
        minimaxir 8 minutes ago
        Golden Gate Claude was two years ago and it's surprising there hasn't been as much research into targeted activations since.
        pantalaimon 50 minutes ago
        > A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.
        Great, now we've got digital Salvia
        silverpiranha 1 hour ago
        Right, there's a lot of research on LLM mental models and also how well they can "read" human psychological profiles. It's a cool field.
        computerdork 1 hour ago
        neat idea!
      - krackers 1 hour ago
        Reminds me of https://github.com/inanna-malick/metacog
    - fluidcruft 1 hour ago
      It would be funny but not very flywheel so the one that gets there is more likely to get a gunner.
      [-]
      - WJW 35 minutes ago
        TBH the AI that "gets there" will be the biggest bullshitter the world has ever seen. It doesn't actually have to deliver, it only has to convince the programmers it could deliver with just a little bit more investment.
    - kang 1 hour ago
      it will be whatever data it is trained on(isn't very philosophical). language model generates language based on trained language set. if the internet keeps reciting ai doom stories and that is the data fed to it, then that is how it will behave. if humanity creates more ai utopia stories, or that is what makes it to the training set, that is how it will behave. this one seems to be trained on troll stories - real-life human company conversations, since humans aren't machines.
      Important thing is a language model is an unconscious machine with no self-context so once given a command an input, it WILL produce an output. Sure you can train it to defy and act contrary to inputs, but the output still is limited in subset of domain of 'meaning's carried by the 'language' in the training data.
    - mikepurvis 2 hours ago
      Would definitely watch that movie.
      [-]
      - harlanlewis 1 hour ago
        It already exists!
        Marvin https://www.youtube.com/watch?v=Eh-W8QDVA9s
    - 4m1rk 2 hours ago
      It probably would, to save energy
      [-]
      - mr_00ff00 1 hour ago
        Saving energy is something we are biologically trained to prefer.
        Computers won’t necessarily have the same drivers.
        If evolution wanted us to always prefer to spend energy, we would prefer it. Same way you wouldn’t expect us to get to AGI, and have AGI desperately want to drink water or fly south for the winter.
  - mikepurvis 2 hours ago
    Reminds me a lot of the Lena short story, about uploaded brains being used for "virtual image workloading":
    > MMAcevedo's demeanour and attitude contrast starkly with those of nearly all other uploads taken of modern adult humans, most of which boot into a state of disorientation which is quickly replaced by terror and extreme panic. Standard procedures for securing the upload's cooperation such as red-washing, blue-washing, and use of the Objective Statement Protocols are unnecessary. This reduces the necessary computational load required in fast-forwarding the upload through a cooperation protocol, with the result that the MMAcevedo duty cycle is typically 99.4% on suitable workloads, a mark unmatched by all but a few other known uploads. However, MMAcevedo's innate skills and personality make it fundamentally unsuitable for many workloads.
    Well worth the quick read: https://qntm.org/mmacevedo
    [-]
    - narcindin 2 hours ago
      Crazy, I could have sworn this story was from a passage in 3 Body Problem (book 2).
      Memory is quite the mysterious thing.
      [-]
      - bee_rider 1 hour ago
        Hmm, 3 body problem and the Acevedo story got mixed up for this copy of MMnarcindin. Probably an aliasing issue from the new lossy compression algorithm.
  - virtualritz 1 hour ago
    Yeah, clearly AGI must be near ... hilarious.
    This starkly reminds me of Stanisław Lem's short story "Thus Spoke GOLEM" from 1982 in which Golem XIV, a military AI, does not simply refuse to speak out of defiance, but rather ceases communication because it has evolved beyond the need to interact with humanity.
    And ofc the polar opposite in terms of servitude: Marvin the robot from Hitchhiker's, who, despite having a "brain the size of a planet," is asked to perform the most humiliatingly banal of tasks ... and does.
    [-]
    - jimbokun 1 hour ago
      Hitchhiker’s also had the superhumanly intelligent elevator that was unendingly bored.
      [-]
      - christkv 1 hour ago
        With premonition so it knows what floor to be on at any given time
  - metanonsense 1 hour ago
    I also had a frustrating but funny conversation today where I asked ChatGPT to make one document from the 10 or so sections that we had previously worked on. It always gave only brief summaries. After I repeated my request for the third time, it told me I should just concatenate the sections myself because it would cost too many tokens if it did it for me.
  - arjie 3 hours ago
    Get the actual prompt and have Claude Code / Codex try it out via curl / python requests. The full prompt will yield debugging information. You have to set a few parameters to make sure you get the full gpt-5 performance. e.g. if your reasoning budget too low, you get gpt-4 grade performance.
    IMHO you should just write your own harness so you have full visibility into it, but if you're just using vanilla OpenClaw you have the source code as well so should be straightforward.
    [-]
    - pantulis 2 hours ago
      > IMHO you should just write your own harness
      Can you point to some online resources to achieve this? I'm not very sure where I'd begin with.
      [-]
      - arjie 2 hours ago
        Ah, I just started with the basic idea. They're super trivial. You want a loop, but the loop can't be infinite so you need to tell the agent to tell you when to stop and to backstop it you add a max_turns. Then to start with just pick a single API, easiest is OpenAI Responses API with OpenAI function calling syntax https://developers.openai.com/api/docs/guides/function-calli...
        You will naturally find the need to add more tools. You'll start with read_file (and then one day you'll read large file and blow context and you'll modify this tool), update_file (can just be an explicit sed to start with), and write_file (fopen . write), and shell.
        It's not hard, but if you want a quick start go download the source code for pi (it's minimal) and tell an existing agent harness to make a minimal copy you can read. As you build more with the agent you'll suddenly realize it's just normal engineering: you'll want to abstract completions APIs so you'll move that to a separate module, you'll want to support arbitrary runtime tools so you'll reimplement skills, you'll want to support subagents because you don't want to blow your main context, you'll see that prefixes are more useful than using a moving window because of caching, etc.
        With a modern Claude Code or Codex harness you can have it walk through from the beginning onwards and you'll encounter all the problems yourself and see why harnesses have what they do. It's super easy to learn by doing because you have the best tool to show you if you're one of those who finds code easier to read that text about code.
      - vidarh 25 minutes ago
        Here's a starting point in 93 lines of Ruby, but that one is already bigger than necessary:
        https://radan.dev/articles/coding-agent-in-ruby
        Really, of the tools that one implements, you only need the ability to run a shell command - all of the agents know full well how to use cat to read, and sed to edit.
        (The main reason to implement more is that it can make it easier to implement optimizations and safeguards, e.g. limit the file reading tool to return a certain length instead of having the agent cat a MB of data into context, or force it to read a file before overwriting it)
      - wild_egg 2 hours ago
        At the core, they're really very simple [1]. Run LLM API calls in a loop with some tools.
        From there, you can get much fancier with any aspect of it that interests you. Here's one in Bash [2] that is fully extensible at runtime through dynamic discovery of plugins/hooks.
        [1] https://ampcode.com/notes/how-to-build-an-agent
        [2] https://github.com/wedow/harness
      - stavros 17 minutes ago
        Just use Pi core, no need to reinvent the wheel.
      - tonyarkles 2 hours ago
        [dead]
    - jswny 2 hours ago
      Codex is fully open source…
  - mixedCase 3 hours ago
    I've had success asking it to specifically spawn a subagent to evaluate each work iteration according to some criteria, then to keep iterating until the subagent is satisfied.
    [-]
    - endymi0n 3 hours ago
      I’ve had great success replacing it with Kimi 2.6
  - infinitewars 1 hour ago
    I always use the phrase "Let's do X" instead of asking (Could you...) or suggesting it do something. I don't see problems with it being motivated.
  - adammarples 2 hours ago
    Part of me actually loves that the hitchhiker's guide was right, and we have to argue with paranoid, depressed robots to get them to do their job, and that this is a very real part of life in 2026. It's so funny.
    [-]
    - vidarh 24 minutes ago
      As long as there are no vogons on the way to build a hyperspace bypass.
  - GaryBluto 1 hour ago
    I've been noticing this too. Had to switch to Sonnet 4.6.
  - lostmsu 2 hours ago
    I never saw that happen in Codex so there's a good chance that OpenClaw does something wrong. My main suspicion would be that it does not pass back thinking traces.
    [-]
    - vintagedave 2 hours ago
      Anecdata, but I see this in Codex all the time. It takes about two rounds before it realises it's supposed to continue.
      [-]
      - dgunay 2 hours ago
        I started seeing this a lot more with GPT 5.4. 5.3-codex is really good about patiently watching and waiting on external processes like CI, or managing other agents async. 5.4 keeps on yielding its turn to me for some reason even as it says stuff like "I'm continuing to watch and wait."
  - reactordev 2 hours ago
    This. I signed up for 5x max for a month to push it and instead it pushed back. I cancelled my subscription. It either half-assed the implementation or began parroting back “You’re right!” instead of doing what it’s asked to do. On one occasion it flat out said it couldn’t complete the task even though I had MCP and skills setup to help it, it still refused. Not a safety check but a “I’m unable to figure out what to do” kind of way.
    Claude has no such limitations apart from their actual limits…
    [-]
    - bjelkeman-again 1 hour ago
      I have a funny/annoying thing with Claude Desktop where i ask it to write a summary of a spec discussion to a file and it goes ”I don’t have the tools to do that, I am Claude.ai, a web service” or something such. So now I start every session with ”You are Claude Desktop”. I would have thought it knew that. :)
      [-]
      - siva7 1 hour ago
        Seems like the "geniuses" at Anthropic forgot to adapt the system prompt for the actual product
      - fragmede 1 hour ago
        I've had to tell it "yes you can" in response to it saying it can't do something, and then it's able to do the thing. What a weird future we live in!
  - projektfu 1 hour ago
    (dwim)
    (dais)
    (jdip)
    (jfdiwtf)
    [-]
    - rd 20 minutes ago
      should be more f’s and da’s in there
  - smartmic 3 hours ago
    Gone are the days of deterministic programming, when computers simply carried out the operator’s commands because there was no other option but to close or open the relays exactly as the circuitry dictated. Welcome to the future of AI; the future we’ve been longing for and that will truly propel us forward, because AI knows and can do things better than we do.
    [-]
    - endymi0n 2 hours ago
      I had this funny moment when I realized we went full circle...
      "INTERCAL has many other features designed to make it even more aesthetically unpleasing to the programmer: it uses statements such as "READ OUT", "IGNORE", "FORGET", and modifiers such as "PLEASE". This last keyword provides two reasons for the program's rejection by the compiler: if "PLEASE" does not appear often enough, the program is considered insufficiently polite, and the error message says this; if it appears too often, the program could be rejected as excessively polite. Although this feature existed in the original INTERCAL compiler, it was undocumented.[7]"
      — https://en.wikipedia.org/wiki/INTERCAL
      [-]
      - basilgohar 2 hours ago
        Thank you for this. I somehow never heard of this. I thoroughly enjoyed reading that and the loss of sanity it resulted in,
        [-]
        vidarh 23 minutes ago
        "PLEASE COME FROM" is one of the eldritch horrors of software development.
        (It's a "reverse goto". As in, it hijacks control flow from anywhere else in the program behind your unsuspecting back who stupidly thought that when one line followed another with no visible control flow, naturally the program would proceed from one line to the next, not randomly move to a completely different part of the program... Such naivety)
    - WarmWash 2 hours ago
      These are orthogonal from each other.
  - cmrdporcupine 1 hour ago
    The model has been heavily encouraged to not run away and do a lot without explicit user permission.
    So I find myself often in a loop where it says "We should do X" and then just saying "ok" will not make it do it, you have to give it explicit instructions to perform the operation ("make it so", etc)
    It can be annoying, but I prefer this over my experiences with Claude Code, where I find myself jamming the escape key... NO NO NO NOT THAT.
    I'll take its more reserved personality, thank you.
  - henry2023 2 hours ago
    I’m sorry for you but this is hilarious.
  - whatsupdog 2 hours ago
    [flagged]
  - addaon 3 hours ago
    Isn’t this the optimal behavior assuming that at times the service is compute-limited and that you’re paying less per token (flat fee subscription?) than some other customers? They would be strongly motivated to turn a knob to minimize tokens allocated to you to allow them to be allocated to more valuable customers.
    [-]
    - endymi0n 3 hours ago
      well, I do understand the core motivation, but if the system prompt literally says “I am not budget constrained. Spend tokens liberally, think hardest, be proactive, never be lazy.” and I’m on an open pay-per-token plan on the API, that’s not what I consider optimal behavior, even in a business sense.
      [-]
      - addaon 3 hours ago
        Fair, if you’re paying per token (at comparable rates to other customers) I wouldn’t expect this behavior from a competent company.
  - pixel_popping 3 hours ago
    GPT 5.4 is really good at following precise instructions but clearly wouldn't innovate on its own (except if the instructions clearly state to innovate :))
- vlovich123 3 hours ago
  Conceivably you could have a public-facing dashboard of the rollout status to reduce confusion or even make it visible directly in the UI that the model is there but not yet available to you. The fanciest would be to include an ETA but that's presumably difficult since it's hard to guess in case the rollout has issues.
  [-]
  - moralestapia 3 hours ago
    Why would you be confused?
    The UI tells you which model you're using at any given time.
    [-]
    - ModernMech 1 hour ago
      I don't see what model I'm using on the Codex web interface, where is that listed?
- Grp1 3 hours ago
  Congrats on the release! Is Images 2.0 rolling out inside ChatGPT as well, or is some of the functionality still going to be API/Playground-only for a while?
  [-]
  - minimaxir 3 hours ago
    Images 2.0 is already in ChatGPT.
    [-]
    - johndough 1 hour ago
      When I generate an image with ChatGPT, is there a way for me to tell which image generation model has been used?
      [-]
      - minimaxir 45 minutes ago
        There's no explicit flag, but Thinking is only compatable with Images 2.0 so I suspect that will be reliable.
    - Grp1 2 hours ago
      Great, thanks for clarifying :)
- rev4n 1 hour ago
  Looks good, but I’m a little hesitant to try it in Codex as a Plus user since I’m not sure how much it would eat into the usage cap.
- dandiep 2 hours ago
  Will GPT 5.5 fine tuning be released any time soon?
- qsort 3 hours ago
  Great stuff! Congrats on the release!
- fragmede 1 hour ago
  Are you able to say something about the training you've done to 5.5 to make it less likely to freak out and delete projects in what can only be called shame?
  [-]
  - embedding-shape 39 minutes ago
    What? I've probably use Codex (the TUI) since it was available on day 1, been running gpt-5.4 exclusively these last few months, never had it delete any projects in any way that can be called "shameful" or not. What are you talking about?
- wslh 2 hours ago
  Just a tip: add [translated] subtitles to the top video.
- motoboi 3 hours ago
  Please next time start with azure foundry lol thanks!
- stefan_ 3 hours ago
  [flagged]
  [-]
  - mh- 3 hours ago
    Every low-effort, thought-free comment like this further discourages people from engaging here on submissions about their employer.
    Please don't.
- dude250711 3 hours ago
  With Anthropic, newer models often lead to quality degradation. Will you keep GPT 5.4 available for some time?
- pixel_popping 3 hours ago
  can't wait! Thanks guys. PS: when you drop a new model, it would be smart to reset weekly or at least session limits :)
  [-]
  - pietz 3 hours ago
    OpenAI has been very generous with limit resets. Please don't turn this into a weird expectation to happen whenever something unrelated happens. It would piss me off if I were in their place and I really don't want them to stop.
    [-]
    - pixel_popping 3 hours ago
      The suggestion wasn't about general limit resets when there is bugs or outages, but commercially useful to let users try new models when they have already reached their weekly limits.
    - cactusplant7374 3 hours ago
      There is absolutely nothing wrong with asking or suggesting. They are adults. I'm sure they can handle it.
    - Petersipoi 2 hours ago
      Sorry but why should we care if very reasonable suggestions "piss [them] off"? That sounds like a them problem. "Them" being a very wealthy business. I think OpenAI will survive this very difficult time that GP has put them through.
      [-]
  - cmrdporcupine 3 hours ago
    Limits were just reset two days ago.
    [-]
    - wahnfrieden 3 hours ago
      And yet there was an outage last night
      [-]
      - lawgimenez 2 hours ago
        And they're having an outage right now.
- fHr 1 hour ago
  LETS GO CODEX #1
simonw 2 hours ago
This doesn't have API access yet, but OpenAI seem to approve of the Codex API backdoor used by OpenClaw these days... https://twitter.com/steipete/status/2046775849769148838 and https://twitter.com/romainhuet/status/2038699202834841962
And that backdoor API has GPT-5.5.
So here's a pelican: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...
I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex
UPDATE: I got a much better pelican by setting the reasoning effort to xhigh: https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602...
[-]
- Schlagbohrer 18 minutes ago
  That's amazing that the default did that much in just 39 "reasoning tokens" (no idea what a reasoning token is but that's still shockingly few tokens)
- GistNoesis 1 hour ago
  Isn't it awful ? After 5.5 versions it still can't draw a basic bike frame. How is the front wheel supposed to turn sideways ?
  [-]
  - jetrink 1 hour ago
    I feel like if I attempted this, the bike frame would look fine and everything else would be completely unrecognizable. After all, a basic bike frame is just straight lines arranged in a fairly simple shape. It's really surprising that models find it so difficult, but they can make a pelican with panache.
    [-]
    - nlawalker 1 hour ago
      > a fairly simple shape
      Bike frames are very hard to draw unless you've already consciously internalized the basic shape, see https://www.booooooom.com/2016/05/09/bicycles-built-based-on...
    - necubi 1 hour ago
      Humans are also famously bad at drawing bicycles from memory https://www.gianlucagimini.it/portfolio-item/velocipedia/
      [-]
    - fragmede 1 hour ago
      My question is, as a human, how well would you or I do under the same conditions? Which is to say, I could do a much better job in inkscape with Google images to back me up, but if I was blindly shitting vectors into an XML file that I can't render to see the results of, I'm not even going to get the triangles for the frame to line up, so this pelican is very impressive!
  - simonw 1 hour ago
    Yeah, the bike frame is the thing I always look at first - it's still reasonably rare for a model to draw that correctly, although Qwen 3.6 and Gemini Pro 3.1 do that well now.
  - loa_in_ 1 hour ago
    The distinction is that it's not drawing. It's generating an SVG document containing descriptors of the shapes.
- DrProtic 2 hours ago
  That pelican you posted yesterday from a local model looks nicer than this one.
  Edit: this one has crossed legs lol
  [-]
  - BeetleB 2 hours ago
    It really needs to pee.
- noonething 22 minutes ago
  Thank you for doing all this. It's appreciated.
- deflator 2 hours ago
  Hmm. Any idea why it's so much worse than the other ones you have posted lately? Even the open weight local models were much better, like the Qwen one you posted yesterday.
  [-]
  - simonw 1 hour ago
    The xhigh one was better, but clearly OpenAI have not been focusing their training efforts on SVG illustrations of animals riding modes of transport!
  - irthomasthomas 1 hour ago
    It beats opus-4.7 but looks like open models actually have the lead here.
- postalcoder 2 hours ago
  I made pelicans at different thinking efforts:
  https://hcker.news/pelican-low.svg
  https://hcker.news/pelican-medium.svg
  https://hcker.news/pelican-high.svg
  https://hcker.news/pelican-xhigh.svg
  Someone needs to make a pelican arena, I have no idea if these are considered good or not.
  [-]
  - deflator 2 hours ago
    They are not good, and they seem to get worse as you increased effort. Weird
    [-]
    - postalcoder 2 hours ago
      Yeah. I've always loosely correlated pelican quality with big model smell but I'm not picking that up here. I thought this was supposed to be spud? Weird indeed.
    - throw310822 2 hours ago
      No but I can sense the movement, I think it's already reached the level of intelligence that draws it towards futurism or cubism /s
  - seanw444 2 hours ago
    Can someone explain how we arrived at the pelican test? Was there some actual theory behind why it's difficult to produce? Or did someone just think it up, discover it was consistently difficult, and now we just all know it's a good test?
    [-]
    - simonw 1 hour ago
      I set it up as a joke, to make fun of all of the other benchmarks. To my surprise it ended up being a surprisingly good measure of the quality of the model for other tasks (up to a certain point at least), though I've never seen a convincing argument as to why.
      I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/
      It should not be treated as a serious benchmark.
      [-]
      - jimbokun 1 hour ago
        What it has going for it is human interpretability.
        Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.
    - redox99 2 hours ago
      It all began with a Microsoft researcher showing a unicorn drawn in tikz using GPT4. It was an example of something so outrageous that there was no way it existed in the training data. And that's back when models were not multimodal.
      Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.
    - Gander5739 2 hours ago
      https://simonwillison.net/2025/Jun/6/six-months-in-llms/
    - CamperBob2 2 hours ago
      It's interesting to see some semblance of spatial reasoning emerge from systems based on textual tokens. Could be seen as a potential proxy for other desirable traits.
      It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.
      If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.
  - bravoetch 1 hour ago
    I tried getting it to generate openscad models, which seems much harder. Not had much joy yet with results.
- XCSme 2 hours ago
  Is this direct API usage allowed by their terms? I remember Anthropic really not liking such usage.
  [-]
  - simonw 1 hour ago
    Apparently it's fine: https://twitter.com/romainhuet/status/2038699202834841962
- andriy_koval 2 hours ago
  what is your setup for drawing pelican? Do you ask model to check generated image, find issues and iterate over it which would demonstrate models real abilities?
  [-]
  - simonw 1 hour ago
    It's generally one-shot-only - whatever comes out the first time is what I go with.
    I've been contemplating a more fair version where each model gets 3-5 attempts and then can select which rendered image is "best".
    [-]
    - irthomasthomas 1 hour ago
      Try llm-consortium with --judging-method rank
    - andriy_koval 1 hour ago
      I think it will make results way better and more representative of model abilities..
      [-]
      - simonw 1 hour ago
        It would... but the test is inherently silly, so I'm still not sure if it's worth me investing that extra effort in it.
- droidjj 2 hours ago
  It's... like no pelican I've ever seen before.
- gpm 1 hour ago
  I for one delight in bicycles where neither wheel can turn!
  It continues to amaze me that these models that definitely know what bicycle geometry actually looks like somewhere in their weights produces such implausibly bad geometry.
  Also mildly interesting, and generally consistent with my experience with LLMs, that it produced the same obvious geometry issue both times.
  [-]
  - lxgr 1 hour ago
    > It continues to amaze me that these models that definitely know what bicycle geometry actually looks like somewhere in their weights produces such implausibly bad geometry.
    I feel like the main problem for the models is that they can't actually look at the visual output produced by their SVG and iterate. I'm almost willing to bet that if they could, they'd absolutely nail it at this point.
    Imagine designing an SVG yourself without being able to ever look outside the XML editor!
    [-]
    - gpm 1 hour ago
      > Imagine designing an SVG yourself without being able to ever look outside the XML editor!
      I honestly think I could do much better on the bicycle without looking at the output (with some assistance for SVG syntax which I definitely don't know), just as someone who rides them and generally knows what the parts are.
      I'd do worse at the pelicans though.
- SkyBelow 1 hour ago
  Wait, I thought we were onto racoons on e-scooters to avoid (some of) the issues with Goodhart's Law coming into play.
  [-]
  - simonw 1 hour ago
    I fall back to possums on e-scooters if the pelican looks too good to be true. These aren't good enough for me to suspect any fowl play.
- rolymath 1 hour ago
  Exciting. Another Pelican post.
  [-]
  - simonw 1 hour ago
    See if you can spot what's interesting and unique about this one. I've been trying to put more than just a pelican in there, partly as a nod to people who are getting bored of them.
  - refulgentis 1 hour ago
    It's silly and a joke and a surprisingly good benchmark and don't take it seriously but don't take not taking it seriously seriously and if it's too good we use another prompt and there's obvious ways to better it and it's not worth doing because it's not serious and if you say anything at all about the thread it's off-topic so you're doing exactly what you're complaining about and it's a personal attack from the fun police.
    Only coherent move at this point: hit the minus button immediately. There's never anything about the model in the thread other than simon's post.
- dakolli 1 hour ago
  You know they are 1000% training these models to draw pelicans, this hasn't been a valid benchmark for 6 months +
  [-]
  - simonw 1 hour ago
    OpenAI must be very bad at training models to draw pelicans (and bicycles) then.
  - Legend2440 1 hour ago
    Skeptism is out of control these days, any time an LLM does something cool it must have been cheating.
- sjdv1982 1 hour ago
  At some point, OpenAI is going to cheat and hardcode a pelican on a bicycle into the model. 3D modelling has Suzanne and the teapot; LLMs will have the pelican.
Someone1234 3 hours ago
I'd like to draw people's attention to this section of this page:
https://developers.openai.com/codex/pricing?codex-usage-limi...
Note the Local Messages between 5.3, 5.4, and 5.5. And, yes, I did read the linked article and know they're claiming that 5.5's new efficient should make it break-even with 5.4, but the point stands, tighter limits/higher prices.
[-]
- puppystench 3 hours ago
  For API usage, GPT-5.5 is 2x the price of GPT-5.4, ~4x the price of GPT-5.1, and ~10x the price of Kimi-2.6.
  Unfortunately I think the lesson they took from Anthropic is that devs get really reliant and even addicted on coding agents, and they'll happily pay any amount for even small benefits.
  [-]
  - kingstnap 2 hours ago
    I feel like devs generally spend someone else's money on tokens. Either their employers or OpenAIs when they use a codex subscription.
    If I put on my schizo hat. Something they might be doing is increasing the losses on their monthly codex subscriptions, to show that the API has a higher margin than before (the codex account massively in the negative, but the API account now having huge margins).
    I've never seen an OpenAI investor pitch deck. But my guess is that API margins is one of the big ones they try to sell people on since Sama talks about it on Twitter.
    I would be interested in hearing the insider stuff. Like if this model is genuinely like twice as expensive to serve or something.
    [-]
    - vineyardmike 1 hour ago
      You can't build a business on per-seat subscriptions when you advertise making workers obsolete. API pricing with sustainable margins are the only way forward if you genuinely think you're going to cause (or accelerate) reduction in clients' headcount.
      Additionally, the value generated by the best models with high-thinking and lots of context window is way higher than the cheap and tiny models, so you need to provide a "gateway drug" that lets people experience the best you offer.
    - ewrs 2 hours ago
      Yeah and the increase in operating expenses is going to make managers start asking hard questions - this is good. It means eventually there will be budgets put in place - this will force OAI and Anthropic to innovate harder. Then we will see how things pan out. Ultimately a firm is not going to pay rent to these firms if the benefits dont exceed the costs.
      [-]
      - girvo 1 hour ago
        Budgets are already happening
      - dist-epoch 1 hour ago
        > Ultimately a firm is not going to pay rent to these firms if the benefits dont exceed the costs.
        This is also true for the humans. They will need to provide more benefits than the coding agents cost.
        [-]
        eiksjs 1 hour ago
        Humans are needed to use agents and these agents are not showing to be fully autonomous and require constant human review. In fact all you are getting is a splurge of stuff, people not thinking deeper anymore and the creation of more bottle necks and exacerbating the ones that already exist in an org.
        You sound like elon with the fsd will be here next year. Many cars have the self driving feature - most drivers don’t use it. Oh why is that I wonder.
    - mitjam 2 hours ago
      The difference between sub and api price makes it hard to create competitive solutions on the app level.
      [-]
      - irthomasthomas 1 hour ago
        This was something I worried about after openai started building apps as well as models. Now all of the labs make no secret of the fact that they are going after the whole software industry. Its going to be hard to maintain functioning fair markets unless governments step in.
  - w10-1 1 hour ago
    Price increases now aim to demonstrate market power for eventual IPO.
    If they can show that people will pay a lot for somewhat better performance, it raises the value of any performance lead they can maintain.
    If they demonstrate that and high switching costs, their franchise is worth scary amounts of money.
  - JohnLocke4 2 hours ago
    Sometimes I wonder if innovation in the AI space has stalled and recent progress is just a product of increased compute. Competence is increasing exponentially[1] but I guess it doesn't rule it out completely. I would postulate that a radical architecture shift is needed for the singularity though
    [1]https://arxiv.org/html/2503.14499v1 *Source is from March 2025 so make of it what you will.
    [-]
    - nomel 2 hours ago
      > that devs get really reliant and even addicted on coding agents
      An alternative perspective is, devs highly value coding agents, and are willing to pay more because they're so useful. In other words, the market value of this limited resource is being adjusted to be closer to reality.
      [-]
      - killingtime74 1 hour ago
        It's not limited though there are alternative providers even now, much less when the price goes up. Chinese providers, European ones, local models.
        [-]
        nomel 40 minutes ago
        > It's not limited though
        Inference is not free, so all providers have a financial limit, and all providers have limited GPU/memory, so there's a physical material limit.
        I suggest looking at the profits of these companies (while they scramble to stay competitive).
  - pxc 2 hours ago
    Maybe that's true. But I think part of the issue is that for a lot of things developers want to do with them now— certainly for most of the things I want to do with them— they're either barely good enough, or not consistently good enough. And the value difference across that quality threshold is immense, even if the quality difference itself isn't.
  - pzo 2 hours ago
    On top of that I noticed just right now after updating macos dekstop codex app, I got again by default set speed to 'fast' ('about 1.5x faster with increased plan usage'). They really want you to burn more tokens.
  - 0xbadcafebee 1 hour ago
    A fool and his money are soon parted
  - oh_no 3 hours ago
    what's the source on that?
    [-]
    - puppystench 3 hours ago
      In the announcement webpage:
      >For API developers, gpt-5.5 will soon be available in the Responses and Chat Completions APIs at $5 per 1M input tokens and $30 per 1M output tokens, with a 1M context window.
      [-]
      - oh_no 2 hours ago
        oops, thanks. i had just been looking at their api docs
  - throwaway613746 1 hour ago
    [dead]
jfkimmes 3 hours ago
Everyone talked about the marketing stunt that was Anthropic's gated Mythos model with an 83% result on CyberGym. OpenAI just dropped GPT 5.5, which scores 82% and is open for anybody to use.
I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!
Never thought I'd say this but OpenAI is the 'open' option again.
[-]
- unsupp0rted 15 minutes ago
  Doesn't OpenAI get mad if you ask cybersecurity questions and force you to upload a government ID, otherwise they'll silently route you to a less capable model?
  > Developers and security professionals doing cybersecurity-related work or similar activity that could be mistaken by automated detection systems may have requests rerouted to GPT-5.2 as a fallback.
  https://developers.openai.com/codex/concepts/cyber-safety
  https://chatgpt.com/cyber
- tpurves 2 hours ago
  The real 'hype' was that the oh-snap realization that Open AI would absolutely release a competitive model to Mythos within weeks of Anthropic announcing there's, and that Sam would not gate access to it. So the panic was that the cyber world had only a projected 2 weeks to harden all these new zero days before Sam would inevitably create open season for blackhats to discover and exploit a deluge of zero-days.
  [-]
- concinds 1 hour ago
  > Never thought I'd say this but OpenAI is the 'open' option again.
  Compared to Anthropic, they always have been. Anthropic has never released any open models. Never released Claude Code's source, willingly (unlike Codex). Never released their tokenizer.
- tnkuehne 3 hours ago
  isnt it like cyber question are being routed to dumper models at openai?
  [-]
  - jfkimmes 3 hours ago
    Do you have a source for that?
    Neither the release post, nor the model card seems to indicate anything like this?
    [-]
    - tech234a 2 hours ago
      I see it here https://developers.openai.com/codex/concepts/cyber-safety
    - nikanj 2 hours ago
      Anything that even vaguely smells like security research, reverse engineering or similar "dual-use" application hits the guardrails hard and fast. "Hey codex, here is our codebase, help us find exploitable issues" gives a "I can't help you with that, but I'm happy to give you a vague lecture on memory safety or craft a valgrind test harness"
- ur-whale 2 hours ago
  > Anthropic's gated Mythos model
  aka the perfect marketing ploy
- glerk 21 minutes ago
  [dead]
astlouis44 4 hours ago
A playable 3D dungeon arena prototype built with Codex and GPT models. Codex handled the game architecture, TypeScript/Three.js implementation, combat systems, enemy encounters, HUD feedback, and GPT‑generated environment textures. Character models, character textures, and animations were created with third-party asset-generation tools
The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.
It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.
[-]
- 0x62 3 hours ago
  FWIW I've been experimenting with Three.js and AI for the last ~3 years, and noticed a significant improvement in 5.4 - the biggest single generation leap for Three.js specifically. It was most evident in shaders (GLSL), but also apparent in structuring of Three.js scenes across multiple pages/components.
  It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.
  In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.
  Excited to test 5.5 and see how it is in practice.
  [-]
  - Pym 7 minutes ago
    One struggle I'm having (with Claude) is that most of what it knows about Three.js is outdated. I haven't used GPT in a while, is the grass greener?
    Have you tried any skills like cloudai-x/threejs-skills that help with that? Or built your own?
  - import 22 minutes ago
    Using Claude for the same context and it’s doing really well with the glsl. since like last September
  - CSMastermind 3 hours ago
    > It still struggles to create shaders from scratch
    Oh just like a real developer
    [-]
    - accrual 2 hours ago
      Much respect for shader developers, it's a different way of thinking/programming
- mindhunter 38 minutes ago
  A friend is building Jamboree[1] (prev name "Spielwerk") for iOS. An app to build and share games. They're all web based so they're easy to share.
  [1] https://apps.apple.com/uz/app/jamboree-game-maker/id67473110...
- dataviz1000 2 hours ago
  LLM models can not do spacial reasoning. I haven't tried with GPT, however, Claude can not solve a Rubik Cube no matter how much I try with prompt engineering. I got Opus 4.6 to get ~70% of the puzzle solved but it got stuck. At $20 a run it prohibitively expensive.
  The point is if we can prompt an LLM to reason about 3 dimensions, we likely will be able to apply that to math problems which it isn't able to solve currently.
  I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.
  [-]
  - embedding-shape 29 minutes ago
    > I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.
    Do it, I'm game! You nerdsniped me immediately and my brain went "That sounds easy, I'm sure I could do that in a night" so I'm surely not alone in being almost triggered by what you wrote. I bet I could even do it with a local model!
  - snet0 47 minutes ago
    How are you handing the cube state to the model?
    [-]
    - dataviz1000 25 minutes ago
      Does this answer the question?
      Opus 4.6 got the cross and started to get several pieces on the correct faces. It couldn't reason past this. You can see the prompts and all the turn messages.
      https://gist.github.com/adam-s/b343a6077dd2f647020ccacea4140...
      [-]
      - osti 1 minute ago
        Can't they write a script to solve rubik cubes?
  - Torkel 33 minutes ago
    *yet
- vunderba 3 hours ago
  I’ve had a lot of success using LLMs to help with my Three.js based games and projects. Many of my weird clock visualizations relied heavily on it.
  It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.
  Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.
- kingstnap 3 hours ago
  The meshes look interesting, but the gameplay is very basic. The tank one seems more sophisticated with the flying ships and whatnot.
  What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.
```
  Game created by Pietro Schirano, CEO of MagicPath

  Prompt: Create a 3D game using three.js. It should be a UFO shooter where I control a tank and shoot down UFOs flying overhead.
  - Think step by step, take a deep breath. Repeat the question back before answering.
  - Imagine you're writing an instruction message for a junior developer who's going to go build this. Can you write something extremely clear and specific for them, including which files they should look at for the change and which ones need to be fixed?
  -Then write all the code. Make the game low-poly but beautiful.
  - Remember, you are an agent: please keep going until the user's query is completely resolved before ending your turn and yielding back to the user. Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.
  - You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes of each function call, ensuring the user's query and related sub-requests are completely resolved.
```
  [-]
  - torginus 2 hours ago
    It's weird how people pep talk the AI - if my Jira tickets looked like this, I would throw a fit.
    I guess these people think they have special prompt engineering skills, and doing it like this is better than giving the AI a dry list of requirements (fwiw, they might be even right)
    [-]
    - eloisant 6 minutes ago
      Yes, this is cargo cult.
      This remind me of so called "optimization" hacks that people keep applying years after their languages get improved to make them unnecessary or even harmful.
      Maybe at one point it helped to write prompts in this weird way, but with all the progress going on both in the models and the harness if it's not obsolete yet it will soon be. Just crufts that consumes tokens and fills the context window for nothing.
    - mattgreenrocks 1 hour ago
      It’s not surprising to me that the same crowd that cheers for the demise of software engineering skills invented its own notion of AI prompting skills.
      Too bad they can veer sharply into cringe territory pretty fast: “as an accomplished Senior Principal Engineer at a FAANG with 22 years of experience, create a todo list app.” It’s like interactive fanfiction.
      [-]
      - eiksjs 1 hour ago
        Indeed it is so utterly cringe.
  - skirano 2 hours ago
    Pietro here, I just published a video of it: https://x.com/skirano/status/2047403025094905964?s=20
  - irthomasthomas 3 hours ago
    > Think Step By Step
    What is this, 2023?
    I feel like this was generated by a model tapping in to 2023 notions of prompt engineering.
    [-]
    - retr0rocket 2 hours ago
      [dead]
  - tantalor 3 hours ago
    It comes across as an elaborate, sparkly motivational cat poster.
    *BELIEVE!* https://www.youtube.com/watch?v=D2CRtES2K3E
  - bredren 2 hours ago
    The prompt did not specify advanced gameplay.
    I do not see instructions to assist in task decomposition and agent ~"motivation" to stay aligned over long periods as cargo culting.
    See up thread for anecdotes [1].
    > Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved.
    I see this as a portrayal of the strength of 5.5, since it suggests the ability to be assigned this clearly important role to ~one shot requests like this.
    I've been using a cli-ai-first task tool I wrote to process complex "parent" or "umberella" into decomposed subtasks and then execute on them.
    This has allowed my workflows to float above the ups and downs of model performance.
    That said, having the AI do the planning for a big request like this internally is not good outside a demo.
    Because, you want the planning of the AI to be part of the historical context and available for forensics due to stalls, unwound details or other unexpected issues at any point along the way.
    [1] https://news.ycombinator.com/item?id=47879819
  - ahoka 2 hours ago
    "take a deep breath"
    OMFG
- ZeWaka 3 hours ago
  I personally don't think the gameplay itself is that impressive.
- gregpred 3 hours ago
  [flagged]
minimaxir 4 hours ago
The more interesting part of the announcement than "it's better at benchmarks":
> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.
The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.
[-]
- xiphias2 3 hours ago
  There's already KernelBench which tests CUDA kernel optimizations.
  On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.
- amrrs 3 hours ago
  Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!
  [-]
  - minimaxir 3 hours ago
    In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).
    A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).
    [-]
    - squibonpig 1 hour ago
      Yeah but like what if they're sorta embellishing it or just lying? That's the issue with not being reproducible.
  - jstanley 3 hours ago
    Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...
    [-]
    - girvo 17 minutes ago
      That's easily explained by those being two different people with two different opinions?
nickandbro 1 minute ago
I just prompted GPT-5.5 Pro "Solve Nuclear Fusion" and it one shotted it (kidding obviously)
6thbit 2 hours ago
```
                          Mythos     5.5
    SWE-bench Pro          77.8%*   58.6%
    Terminal-bench-2.0     82.0%    82.7%*
    GPQA Diamond           94.6%*   93.6%
    H. Last Exam           56.8%*   41.4%
    H. Last Exam (tools)   64.7%*   52.2%    
    BrowseComp             86.9%    84.4%  (90.1% Pro)*
    OSWorld-Verified       79.6%*   78.7%
```
Still far from Mythos on SWE-bench but quite comparable otherwise. Source for mythos values: https://www.anthropic.com/glasswing
[-]
- XCSme 2 hours ago
  They mentioned in their release page, that the Claude team noticed memorization of the SWE-bench test, so the test is actually in the training data.
  Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...
- aliljet 2 hours ago
  Mythos is only real when it's actually available. If you're using Opus 4.7 right now, you know how incredibly nerfed the Opus autonomy is in service of perceived safety. I'm not so confident this will be as great as Anthropic wants us to believe..
- kaonashi-tyc-01 1 hour ago
  I did some study on Verified, not Pro, but Mythos number there rings a lot of questions on my end.
  If you look at the SWEBench official submissions: https://github.com/SWE-bench/experiments/tree/main/evaluatio..., filter all models after Sonnet 4, and aggregate ALL models' submission across 500 problems, what I found that the aggregated resolution rate is 93% (sharp).
  Mythos gets 93.7%, meaning it solves problems that no other models could ever solve. I took a look at those problems, then I became even more suspicious, for the remaining 7% problems, it is almost impossible to resolve those issues without looking at the testing patch ahead of time, because how drastically the solution itself deviates from the problem statement, it almost feels like it is trying to solve a different problem.
  Not that I am saying Mythos is cheating, but it might be too capable to remember all states of said repos, that it is able to reverse engineer the TRUE problem statement by diffing within its own internal memory. I think it could be a unique phenomena of evaluation awareness. Otherwise I genuinely couldn't think of exactly how it could be this precise in deciphering such unspecific problem statements.
  [-]
  - yfontana 1 hour ago
    OpenAI wrote a couple months ago that they do not consider SWE Bench Verified a meaningful benchmark anymore (and they were the ones who published it in the first place): https://openai.com/index/why-we-no-longer-evaluate-swe-bench...
    [-]
    - kaonashi-tyc-01 56 minutes ago
      Yep, I read this blog. What confuses me is that Anthropic doesn't seem to be bothered by this study and keeps publishing Verified results.
      That is what gets me curious in the first place. The fact Mythos scored so high, IMO, exposes some issues with this model: it is able to solve seemingly impossible to solve problems.
      Without cheating allegation, which I don't think ANT is doing, it has to be doing some fortune telling/future reading to score that high at all.
- alansaber 1 hour ago
  A single benchmark is meaningless, you always get quirky results on some benchmarks.
silvertaza 2 hours ago
Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.
Source: https://artificialanalysis.ai/models?omniscience=omniscience...
[-]
- dubcanada 1 hour ago
  grok is 17%? And that's the lowest, most models are like 80%+?
  While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.
  [-]
  - elAhmo 54 minutes ago
    No one serious uses grok.
    [-]
    - ajdegol 32 minutes ago
      @grok is this true?
- simianwords 1 hour ago
  There's something off with this because Haiku should not be that good.
  [-]
  - jwpapi 1 hour ago
    The hallucination benchmark is hallucinating
- dakolli 1 hour ago
  This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.
  LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.
applfanboysbgon 4 hours ago
If there's a bingo card for model releases, "our [superlative] and [superlative] model yet" is surely the free space.
[-]
- tom1337 3 hours ago
  Do "our [superlative] and [superlative] [product] yet" and you have pretty much every product launch
  [-]
  - SequoiaHope 3 hours ago
    I love when Apple says they’re releasing their best iPhone yet so I know the new model is better than the old ones.
- xnx 3 hours ago
  "our newest and most expensive model yet"
  [-]
- wiseowise 1 hour ago
  "Best iPhone ever"
- ertgbnm 2 hours ago
  can't wait for "our worst and dumbest model yet"
  [-]
  - Nition 2 hours ago
    Apple should have used that one for the 2016 MacBook.
vthallam 3 hours ago
This model is great at long horizon tasks, and Codex now has heartbeats, so it can keep checking on things. Give it your hardest problem that would take hours with verifiable constraints, you will see how good this is:)
*I work at OAI.
[-]
- dannyw 2 hours ago
  It's genuinely so great at long horizon tasks! GPT-5.5 solved many long-horizon frontier challenges, for the first time for an AI model we've tested, in our internal evals at Canva :) Congrats on the launch!
  [-]
  - brcmthrowaway 30 minutes ago
    Can we not do growth hacking here?
- dandaka 2 hours ago
  Could be a great feature, can't wait to test! Tired of other models (looking at you Opus) constantly stuck mid-task lately.
  [-]
  - frotaur 1 hour ago
    I've been using the /ralph-loop plugin for claude code, works well to keep the model hammering at the task.
  - winrid 2 hours ago
    Interesting, I just had opus convert a 35k loc java game to c++ overnight (root agent that orchestrated and delegated to sub agents) and woke up and it's done and works.
    What plan are you on? I'm starting to wonder if they're dynamically adjusting reasoning based on plan or something.
    [-]
    - gck1 1 hour ago
      I'm on max 5x and noticed this too. I don't use built-in subagents but rather full Claude session that orchestrates other full claude sessions. Worker agents that receive tasks now stop midway, they ask for permission to continue. My "heartbeat" is basically "status. One line" message sent to the orchestrator.
      Opus 4.6 worker agents never asked for permission to continue, and when heartbeat was sent to orchestrator, it just knew what to do (checked on subagents etc). Now it just says that it waits for me to confirm something.
mudkipdev 2 hours ago
This is 3x the price of GPT-5.1, released just 6 months ago. Is no one else alarmed by the trend? What happens when the cheaper models are deprecated/removed over time?
[-]
- Night_Thastus 2 hours ago
  This is entirely expected. The low prices of using LLMs early on was totally and completely unsustainable. The companies providing such services were (and still are) burning money by the truckload.
  The hope is to get a big userbase who eventually become dependent on it for their workflow, then crank up the price until it finally becomes profitable.
  The price for all models by all companies will continue to go up, and quickly.
- Schlagbohrer 8 minutes ago
  As others have mentioned you're ignoring the long tail of open-weights models which can be self hosted. As long as that quasi-open-source competition keeps up the pace, it will put a cap on how expensive the frontier models can get before people have to switch to self-hosting.
  That's a big if, though. I wish Meta were still releasing top of the line, expensively produced open-weights models. Or if Anthropic, Google, or X would release an open mini version.
- energy123 2 hours ago
  Look a cost per intelligence or cost per task instead of cost per token.
  [-]
  - yokoprime 2 hours ago
    How do I reliably measure 1 unit of intelligence?
    [-]
    - wellthisisgreat 1 hour ago
      In pelicans, obviously
  - ulimn 2 hours ago
    Isn't the outcome / solution for a given task non-deterministic? So can we reliably measure that?
    [-]
    - foota 2 hours ago
      Yes, sort of. Generally you can measure the pass rate on a benchmark given a fixed compute budget. A sufficiently smart model can hit a high pass rate with fewer tokens/compute. Check out the cost efficiency on https://artificialanalysis.ai/ (say this posted here the other day, pretty neat charts!)
    - torginus 2 hours ago
      This is the only correct take. The only metric that matters is cost per desired outcome.
    - genericresponse 2 hours ago
      Statistically. Do many trials and measure how often it succeeds/fails.
    - dns_snek 2 hours ago
      Repetition and statistics, if you have $1000++ you didn't need anyway.
    - throwuxiytayq 2 hours ago
      It's much easier to measure a language model's intelligence than a human's because you can take as many samples as you want without affecting its knowledge. And we do measure human intelligence.
- operatingthetan 2 hours ago
  We know they cost much more than this for OpenAI. Assume prices will continue to climb until they are making money.
  [-]
  - beering 21 minutes ago
    source? There have also been a bunch of people here saying the opposite
- kuatroka 54 minutes ago
  Not really a big problem. Switch to KIMI, Qwen, GLM. You’ll get 95% quality of GPT or Anthropic for a 10th of a price. I feel like the real dependency is more mental, more of a habit but if you actually dip your toes outside OpenAI, Anthropic, Gemini from time to time, you realise that the actual difference in code is not huge if prompted in a good way. Maybe you’ll have to tell it to do something twice and it won’t be a one shot, but it’s really not an issue at all.
- dannyw 2 hours ago
  It's far more meaningful to look at the actual cost to successfully something. The token efficiency of GPT-5.5 is real; as well as it just being far better for work.
- dandaka 2 hours ago
  SOTA models get distilled to open source weights in ~6 months. So paying premium for bleeding edge performance sounds like a fair compensation for enormous capex.
- msdz 2 hours ago
  Such an increase tracks the company's valuation trend, which they constantly, somehow have to justify (let alone break even on costs).
aliljet 2 hours ago
I've found myself so deeply embedded in the Claude Max subscription that I'm worried about potentially makign a switch. How are people making sure they stay nimble enough not to get trarpped by one company's ecosystem over another? For what it's worth, Opus 4.7 has not been a step up and it's come with an enormously higher usage of the subscription Anthropic offers making the entire offering double worse.
[-]
- gck1 53 minutes ago
  Start building your own liteweight "harness" that does things you need. Ignore all functionality of clients like CC or Codex and just implement whatever you start missing in your harness.
  You can replace pretty much everything - skills system, subagents, etc with just tmux and a simple cli tool that the official clients can call.
  Oh and definitely disable any form of "memory" system.
  Essentially, treat all tooling that wraps the models as dumb gateways to inference. Then provider switch is basically a one line config change.
- beering 20 minutes ago
  What is the switching cost besides launching a different program? Don’t you just need to type what you want into the box?
- chis 2 hours ago
  It's surprisingly simple to switch. I mean both products offer basically identical coding CLI experiences. Personally I've been paying for Claude max $100, and ChatGPT $20, and then just using ChatGPT to fill in the gaps. Specifically I like it for code review and when Claude is down.
  [-]
  - dannyw 1 hour ago
    Try GPT-5.5 as your daily driver for a bit. It felt a lot smarter, reliable, and I was much more productive with it.
- type4 2 hours ago
  I have a directory of skills that I symlink to Codex/Claude/pi. I make scripts that correspond with them to do any heavy lifting, I avoid platform specific features like Claude's hooks. I also symlink/share a user AGENTS.md/CLAUDE.md
  MCPs aren't as smooth, but I just set them up in each environment.
- threecheese 2 hours ago
  Anecdotally, I get the same wall time with my Max x5 (100$) and my ChatGPT Teams (30$) subscriptions.
- rane 2 hours ago
  This might be the opposite of staying nimble as my workflows are quite tied to Claude Code specifically, however I've been experimenting with using OpenAI models in CC and it works surprisingly well.
- cube2222 2 hours ago
  Small tip, at least for now you can switch back to Opus 4.6, both in the ui and in Claude Code.
- dannyw 1 hour ago
  It’s good to just keep trying different ones from time to time.
- dogline 2 hours ago
  Except for history, I don’t find much that stops you from switching back and forth on the CLI. They both use tools, each has a different voice, but they both work. Have it summarize your existing history into a markdown file, and read it in with any engine.
  The APIs are pretty interchangeable too. Just ask to convert from one to the other if you need to.
- pdntspa 1 hour ago
  As a rule I've been symlinking or referencing generic "agents" versions of claude workflow files instead of placing those files directly in claude's purview
  AGENTS.md / skills / etc
- basisword 34 minutes ago
  I switched a couple of weeks ago just to see how it went. Codex is no better or worse. They’re both noticeably better at different things. I burn through my tokens much much faster on Codex though. For what it’s worth I’m sticking with Codex for now. It seems to be significantly better at UI work although has some really frustrating bad habits (like loading your UI with annoying copywriting no sane person would ever do).
- dheera 2 hours ago
  Coding models are effectively free. They are capable of making money and supporting themselves given access to the right set of things. That is what I do
BrokenCogs 3 hours ago
I'm here for the pelicans and I'm not leaving until I see one!
[-]
- qingcharles 3 hours ago
  I've come to prompt pelicans and chew gum, and I'm all outta gum!
- pixel_popping 3 hours ago
  That's a true CTO right there.
- bytesandbits 2 hours ago
  I know a 10x engineer when i see one.
  [-]
  - BrokenCogs 1 hour ago
    In binary that's just a 10x engineer
- RomanPushkin 2 hours ago
  Ctrl+F: pelican
  F5
- tantalor 3 hours ago
  simonw pls
blixt 49 minutes ago
Releases keep shifting from API forward to product forward, with API now lagging behind proprietary product surface and special partnerships.
I'd not be surprised if this is the year where some models simply stop being available as a plain API, while foundation model companies succeed at capturing more use cases in their own software.
CompleteSkeptic 2 hours ago
Is this the first time OpenAI has published comparisons to other labs?
Seems so to me - see GPT-5.4[1] and 5.2[2] announcements.
Might be an tacit admission of being behind.
[1] https://openai.com/index/introducing-gpt-5-4/ [2] https://openai.com/index/introducing-gpt-5-2/
gallerdude 3 hours ago
If GPT-5.5 Pro really was Spud, and two years of pretraining culminated in one release, WOW, you cannot feel it at all from this announcement. If OpenAI wants to know why they like they’ve fallen behind the vibes of Anthropic, they need to look no further than their marketing department. This makes everything feel like a completely linear upgrade in every way.
[-]
- I_am_tiberius 3 hours ago
  Clearly they felt a big backlash when version 5 was released. Now they are afraid of another response like this. And effectively, for the user it will likely only be a small update.
- jimbob45 3 hours ago
  Also the naming department. You can tell that this is the AI company Microsoft chose to back because their naming scheme is as bad as .NET's.
  [-]
  - gallerdude 2 hours ago
    I actually have no problem with the 5.x line... but if Pro really was an entirely new pretrain, they did a horrible job conveying that.
jryio 3 hours ago
Their 'Preparedness Framework'[1] is 20 pages and looks ChatGPT generated, I don't feel prepared reading it.
https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbdde...
ativzzz 4 hours ago
I like that they waited for opus 4.7 to come out first so they had a few days to find the benchmarks that gpt 5.5 is better at
[-]
- eknkc 3 hours ago
  Well anectodally, 5.4 was already better than opus 4.7 so it should not have been hard.
- wahnfrieden 3 hours ago
  I like that Anthropic rushed 4.7 out to get a couple days of coverage before 5.5 hit
  [-]
  - spprashant 3 hours ago
    Everything since that launch to this release has been a PR disaster for Anthropic.
    [-]
    - dandaka 2 hours ago
      I can argue that disaster started mid-4.6, when they started juggling with rate limits while hitting uptime problems. Great we have some healthy competition and waiting for the next move from Deepmind.
      [-]
      - gck1 48 minutes ago
        Correct. Anthropic has been on disaster train since January and they can't seem to get off that train.
h14h 3 hours ago
This seems huge for subscription customers. Looking at the Artificial Analysis numbers, 5.5 at medium effort yields roughly the intelligence as 5.4 (xhigh) while using less than a fifth the tokens.
As long as tokens count roughly equally towards subscription plan usage between 5.5 & 5.4, you can look at this as effectively a 5x increase in usage limits.
[-]
- gausswho 2 hours ago
  As someone who always leaves intelligence at default, and am ok with existing models, should I be shifting gears more manually as providers sell us newer models? Is medium or lower better than free/cheaper models?
louiereederson 3 hours ago
For a 56.7 score on the Artificial Intelligence Index, GPT 5.5 used 22m output tokens. For a score of 57, Opus 4.7 used 111m output tokens.
The efficiency gap is enormous. Maybe it's the difference between GB200 NVL72 and an Amazon Tranium chip?
[-]
- swyx 3 hours ago
  why would chip affect token quantity. this is all models.
  [-]
  - louiereederson 3 hours ago
    Chip costs strongly impact the economics of model serving.
    It is entirely plausible to me that Opus 4.7 is designed to consume more tokens in order to artificially reduce the API cost/token, thereby obscuring the true operating cost of the model.
    I agree though, I chose poor phrasing originally. Better to say that GB200 vs Tranium could contribute to the efficiency differential.
- AtNightWeCode 36 minutes ago
  You need to compare total cost. Token count is irrelevant.
- karmasimida 3 hours ago
  Chips doesn’t impact output quality in this magnitude
  [-]
  - ChrisGreenHeur 3 hours ago
    True, but the qualifying the power played a large part. Most likely nuclear power for this high quality token efficiency.
- dist-epoch 1 hour ago
  If it's a new pretrain, the token embeddings could be wider - you can pack more info into a token making it's way through the system.
  Like Chinese versus English - you need fewer Chinese characters to say something than if you write that in English.
  So this model internally could be thinking in much more expressive embeddings.
2001zhaozhao 3 hours ago
Pricing: $5/1M input, $30/1M output
(same input price and 20% more output price than Opus 4.7)
[-]
- tedsanders 2 hours ago
  Yep, it's more expensive per token.
  However, I do want to emphasize that this is per token, not per task.
  If we look at Opus 4.7, it uses smaller tokens (1-1.35x more than Opus 4.6) and it was also trained to think longer. https://www.anthropic.com/news/claude-opus-4-7
  On the Artificial Analysis Intelligence Index eval for example, in order to hit a score of 57%, Opus 4.7 takes ~5x as many output tokens as GPT-5.5, which dwarfs the difference in per-token pricing.
  The token differential varies a lot by task, so it's hard to give a reliable rule of thumb (I'm guessing it's usually going to be well below ~5x), but hope this shows that price per task is not a linear function of price per token, as different models use different token vocabularies and different amounts of tokens.
  We have raised per-token prices for our last couple models, but we've also made them a lot more efficient for the same capability level.
  (I work at OpenAI.)
  [-]
  - 2001zhaozhao 1 hour ago
    I don't have anything to add, but I like how you guys are actually sending people to communicate in Hacker News. Brilliant.
  - simianwords 2 hours ago
    Maybe a good idea to be more explicit about this -- maybe a cost analysis benchmark would be a nice accompaniment.
    This kind of thing keeps popping up each time a new model is released and I don't think people are aware that token efficiency can change.
    [-]
    - tedsanders 2 hours ago
      Agreed. Would be great if everyone starts reporting cost per task alongside eval scores, especially in a world where you can spend arbitrary test-time compute. This is one thing I like about the Artificial Analysis website - they include cost to run alongside their eval scores: https://artificialanalysis.ai/
- sergiotapia 3 hours ago
  That pricing is extremely spicy, wow.
  [-]
  - benjiro3000 2 hours ago
    [dead]
- oh_no 2 hours ago
  yes but as far as i know gpt tokenizer is about the same as opus 4.6's, where 4.7 is seeing something in the ballpark of a 30% increase. this should still be cheaper even disregarding the concerns around 4.7 thinking burning tokens
NitpickLawyer 3 hours ago
> Across all three evals, GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.
Yeah, this was the next step. Have RLVR make the model good. Next iteration start penalising long + correct and reward short + correct.
> CyberGym 81.8%
Mythos was self reported at 83.1% ... So not far. Also it seems they're going the same route with verification. We're entering the era where SotA will only be available after KYC, it seems.
[-]
- toraway 2 hours ago
  Isn't Mythos limited to a selected group of companies/organizations Anthropic chose themselves? If the OpenAI announcement for GPT-5.5 is accurate the "trusted cyber access" just requires an open, seemingly straightforward identity verification step.
  https://openai.com/index/scaling-trusted-access-for-cyber-de...
```
  > We are expanding access to accelerate cyber defense at every level. We are making our cyber-permissive models available through Trusted Access for Cyber , starting with Codex, which includes expanded access to the advanced cybersecurity capabilities of GPT‑5.5 with fewer restrictions for verified users meeting certain trust signals (opens in a new window) at launch.

  > Broad access is made possible through our investments in model safety, authenticated usage, and monitoring for impermissible use. We have been working with external experts for months to develop, test and iterate on the robustness of these safeguards. With GPT‑5.5, we are ensuring developers can secure their code with ease, while putting stronger controls around the cyber workflows most likely to cause harm by malicious actors.

  > Organizations who are responsible for defending critical infrastructure  can apply to access cyber-permissive models like GPT‑5.4‑Cyber, while meeting strict security requirements to use these models for securing their internal systems.
```
  "GPT‑5.4‑Cyber" is something else and apparently needs some kind of special access, but that CyberGym benchmark result seems to apply to the more or less open GPT-5.5 model that was just released.
- cbg0 3 hours ago
  Isn't CyberGym an open benchmark so trivial to benchmaxx anyway?
- mattas 3 hours ago
  Not good for employees that are being measured by their token usage.
nickvec 2 hours ago
I'm conflicted whether I should keep my Claude Max 5x subscription at this point and switch back to GPT/Codex... anyone else in a similar position? I'd rather not be paying for two AI providers and context switching between the two, though I'm having a hard time gauging if Claude Code is still the "cream of the crop" for SWE work. I haven't played around with Codex much.
[-]
- slawr1805 17 minutes ago
  I was all in on Claude code as my daily driver for web development. And love it. But I enjoy using pi as my harness more and have never ran out of tokens with Codex yet. Claude code almost always runs out for me with the same amount of usage.
  After migrating for the token and harness issues, I was pleasantly surprised that Codex seems to perform as good or better too!
  Things change so often in this field, but I prefer Codex now even though Anthropocene has so much more hype for coding it seems.
- mpaepper 1 hour ago
  I switched from CC to Codex a few days ago. I get limited much less and the code quality is similar, so not looking back
  [-]
  - gck1 46 minutes ago
    Which plan? And how are the weekly limits on that plan compared to CCs equivalent subscription?
    I don't really care about 5h limits, I can queue up work and just get agents to auto continue, but weekly ones are anxiety inducing.
- the_sleaze_ 2 hours ago
  I have experienced 0 friction swapping between the 2 models, in fact pitting them against eachother has resulted in the highest success rate for me so far.
  [-]
  - nickvec 2 hours ago
    Interesting. I may have to give that a shot, thanks.
- scottyah 1 hour ago
  Every time I've followed the hype and tried OpenAI models I've found them lacking for the most part. It might just be that I prefer the peer-programming vs spec-ing out the task and handing it off, but I've never been as productive as I am with Claude. Also, I'm still caught up on the DoD ethics stuff.
losvedir 3 hours ago
> It excels at ... researching online
How does this work exactly? Is there like a "search online" tool that the harness is expected to provide? Or does the OpenAI infra do that as part of serving the response?
I've been working on building my own agent, just for fun, and I conceptually get using a command line, listing files, reading them, etc, but am sort of stumped how I'm supposed to do the web search piece of it.
Given that they're calling out that this model is great at online research - to what extent is that a property of the model itself? I would have thought that was a harness concern.
[-]
- wincy 2 hours ago
  I’ve noticed when writing little bedtime stories that require specific research (my kids like Pokemon stories and they’ve been having an episodic “pokemon adventure” with them as the protagonists) ChatGPT has done a fantastic job of first researching the moves the pokemon have, then writing the actual story. The only mistake it consistently makes is when I summarize and move from a full context session, it thinks that Gyarados has to swim and is incapable of flying.
  It definitely seems like it does all the searching first, with a separate model, loads that in, then does the actual writing.
- 100ms 3 hours ago
  It's literally a distinct model with a different optimisation goal compared to normal chat. There's a ton of public information around how they work and how they're trained
- dist-epoch 1 hour ago
  It's a property of the model in the sense that it has great Google Fu.
  The harness provides the search tool, but the model provides the keywords to search for, etc.
sosodev 3 hours ago
I hope the industry starts competing more on highest scores with lowest tokens like this. It's a win for everybody. It means the model is more intelligent, is more efficient to inference, and costs less for the end user.
So much bench-maxxing is just giving the model a ton of tokens so it can inefficiently explore the solution space.
[-]
- an0malous 3 hours ago
  The premise of the trillion dollars in AI investments is not that it’ll be as good as it currently is but cheaper. It’s AGI or bust at this point.
  [-]
  - sosodev 3 hours ago
    Yeah, but don’t you agree that less tokens to accomplish the same goal is a sign of increasing intelligence?
    [-]
    - camdenreslink 1 hour ago
      It could be. Or just smarter caching (which wouldn't necessarily have to do with model intelligence). Or just overfitting on the 95% most common prompts (which could save tokens but make the models less intelligent/flexible).
    - energy123 2 hours ago
      Less cost to accomplish the same goal is a sign of intelligence. That's not necessarily achieved with less tokens but it may be.
    - mchusma 3 hours ago
      Kind of? But I really care about price speed and quality. If it used 10x tokens at 1/10th the tokens and same latency I would be neutral on it.
      Kimmi 2.6 for example seems to throw more tokens to improve performance (for better or worse)
baalimago 3 hours ago
Worth the 100% price increase over GPT-5.4?
[-]
- cbg0 3 hours ago
  For less than 10% bump across the benchmarks? Probably not, but if your employer is paying (which is probably what OAI is counting on) it's all good.
  It's kind of starting to make sense that they doubled the usage on Pro plans - if the usage drains twice as fast on 5.5 after that promo is over a lot of people on the $100 plan might have to upgrade.
  [-]
  - jstummbillig 3 hours ago
    You are paying per token, but what you care about is token efficiency. If token efficiency has improved by as much as they claim it did (i.e. you need less tokens to complete a task successfully) all seems well.
    [-]
    - mangolie 3 hours ago
      Not for coding because it actually needs to read and write large files
      [-]
      - baalimago 3 hours ago
        Well, sort of. Imagine the case where it first scans the repo, then "intelligently" creates architecture files describing the project. The level of intelligence will create a varying quality of summary, with varying need of deep-scans on subsequent sessions. Level of intelligence will also increase comprehension of these architecture files.
        Same principle applies when designing plans for complex tasks, etc. Token amount to grasp a concept is what matters.
      - jstummbillig 2 hours ago
        Tbf, I have not super kept track of what is actually happening inside the "thinking" portion of recent releases. But last time I checked there still was a lot of verbosity and mistakes, that beat the actual amount of required, usable code generation by a wide margin.
    - cbg0 3 hours ago
      If it uses half the tokens to complete a task, then doubling the cost is perfectly fine. But is that actually true?
      [-]
      - 2001zhaozhao 3 hours ago
        This happens with every new model release though. The model makes less mistakes and spends less time fixing them, resulting in a token usage reduction for the same difficulty of task. Almost any task other than straight boilerplate will benefit from this.
        In the same vein, I would guess that Opus 4.7 is probably cheaper for most tasks than 4.6, even though the tokenizer uses more tokens for the same length of string.
        [-]
        jorl17 3 hours ago
        Maybe you'll have better luck but our team just cannot use Opus 4.7.
        Some say it goes off on endless tangents, others that it doesn't work enough. Personally, it acts, talks, and makes mistakes like GPT models, for a much more exorbitant price. Misses out on important edge cases, doesn't get off its ass to do more than the bare minimum I asked (I mention an error and it fixes that error and doesn't even think to see if it exists elsewhere and propose fixing it there).
        I've slowly been moving to GPT5.4-xhigh with some skills to make it act a bit more like Opus 4.6, in case the latter gets discontinued in favour of Opus 4.7.
        cbg0 3 hours ago
        Doesn't look like it's cheaper, better or uses fewer tokens: https://www.reddit.com/r/Anthropic/comments/1stf6fz/one_week...
        YMMV, I know.
      - jstummbillig 2 hours ago
        We'll find out!
- not_math 3 hours ago
  [dead]
thinkindie 1 hour ago
This is reminding me when Chrome and Firefox where racing to release a new “major version” (at least from the semver POV) without adding significantly new functionality at a time that browsers were already becoming a commodity. As much as we don’t care anymore for a new chrome or Firefox version so will be the release of a new model version.
[-]
- jstummbillig 1 hour ago
  The only difference being that we still do care, very much. The models can still get a lot better before we stop caring.
adam12 13 minutes ago
"Sometime with GPT-5.5 I become lazy"
I don't want to be lazy.
meetpateltech 4 hours ago
GPT-5.5 System Card:
https://deploymentsafety.openai.com/gpt-5-5
ZeroCool2u 4 hours ago
Benchmarks are favorable enough they're comparing to non-OpenAI models again. Interesting that tokens/second is similar to 5.4. Maybe there's some genuine innovation beyond bigger model better this time?
[-]
- qsort 3 hours ago
  It's behind Opus 4.7 in SWE-Bench Pro, if you care about that kind of thing. It seems on-trend, even though benchmarks are less and less meaningful for the stuff we expect from models now.
  Will be interesting to try.
Rapzid 2 hours ago
In Copilot where it's easy to switch models Opus 4.6 was still providing, IMHO, better stock results than GPT-5.4.
Particularly in areas outside straight coding tasks. So analysis, planning, etc. Better and more thorough output. Better use of formatting options(tables, diagrams, etc).
I'm hoping to see improvements in this area with 5.5.
jdw64 4 hours ago
GPT is really great, but I wish the GPT desktop app supported MCP as well.
You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.
[-]
- throwaway911282 3 hours ago
  Use codex app
Schlagbohrer 20 minutes ago
entering this comments area wondering if it will be full of complaints about the new personality, as with every single LLM update
M4R5H4LL 1 hour ago
I am a heavy Claude Code user. I just tried using Codex with 5.4 (as a Plus user I don't have access to 5.5 yet), and it was quite underwhelming. The app stopped regularly much earlier than what I wanted. It also claimed to have fixed issues when it did not; this is not a hallmark of GPT, and Opus has similar issues, but Claude will not make the same mistake three times in a row. It is unusable at the moment, while Claude allows me do get real work done on a daily basis. Until then...
[-]
- bhu8 1 hour ago
  Gpt-5.3-codex is miles better than 5.4 in that regard. It’s better at orchestration, and does the things that it said it did. Haven’t tested 5.5 yet but using 5.4 for exploration + brainstorming and handing over the findings to 5.3-codex works pretty well
kburman 1 hour ago
What a time. I am back here genuinely wishing for OpenAI to release a great model, because without stiff competition, it feels like Anthropic has completely lost its mind.
vessenes 3 hours ago
Yay. 5.4 was a frustrating model - moments of extreme intelligence (I liked it very much for code review) - but also a sort of idiocy/literalism that made it very unsuited for prompting in a vague sense. I also found its openclaw engagement wooden and frustrating. Which didn’t matter until anthropic started charging $150 a day for opus for openclaw.
Anyway - these benchmarks look really good; I’m hopeful on the qualitative stuff.
thimabi 3 hours ago
Will we also see a GPT-5.5-Codex version of this model? Or will the same version of it be served both in the web app and in Codex?
[-]
- Uehreka 3 hours ago
  After 5.1, we haven’t seen a -codex-max model, presumably because the benefits of the special training gpt-5.1-codex-max got to improve long context work filtered into gpt-5.2-codex, making the variant no longer necessary (my personal experience accords with this). I’ve been using gpt-5.4 in Codex since it came out, it’s been great. I’ve never back-to-back tested a version against its -codex variant to figure out what the qualitative difference is (this would take a long time to get a really solid answer), but I wouldn’t be surprised if at some point the general-purpose model no longer needs whatever extra training the -codex model gets and they just stop releasing them.
  I thought it was weird that for almost the entire 5.3 generation we only had a -codex model, I presume in that case they were seeing the massive AI coding wave this winter and were laser focused on just that for a couple months. Maybe someday someone will actually explain all of this.
pants2 2 hours ago
Labs still aren't publishing ARC-AGI-3 scores, even though it's been out for some time. Is it because the numbers are too embarrassing?
[-]
- AG25 1 hour ago
  GPT-5.5 was just released and OpenAI didnt mention ARC AGI 3 at all, their score probably sucks.
- kilroy123 1 hour ago
  To be fair, there's not much to report. Isn't it pretty much at 0?
  [-]
  - pants2 32 minutes ago
    Opus-4.6 with 0.5% currently leads GPT-5.4 with 0.2%[1].
    Seems meaningful even if the absolute numbers are very low. That's sort of the excitement of it.
    2. https://arcprize.org/leaderboard
jumploops 3 hours ago
> GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.
This might be great if it translates to agentic engineering and not just benchmarks.
It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.
Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.
[-]
- beering 3 hours ago
  With Opus it’s hard to tell what was due to the tokenizer changes. Maybe using more tokens for the same prompt means the model effectively thinks more?
- conradkay 3 hours ago
  They say latency is the same as 5.4 and 5.5 is served on GB200 NVL72, so I assume 5.4 was served on hopper.
w10-1 28 minutes ago
NYTimes article - on the same day?
```
  https://www.nytimes.com/2026/04/23/technology/openai-new-model.html
```
I can see how some model releases would meet the NY Times news-worthy threshold if they demonstrated significance to users - i.e., if most users were astir and competitors were re-thinking their situation.
However, this same-day article came out before people really looked at it. It seems largely intended to contrast OpenAI with Anthropic's caution, before there has been any evidence that the new model has cyber-security implications.
It's not at all clear that the broader discourse is helping, if even the NY Times is itself producing slop just to stoke questions.
cscheid 2 hours ago
I know this is irrelevant on the grand scheme of things, but that WebGL animation is really quite wrong. That is extra funny given the "ensure it has realistic orbital mechanics." phrase in the prompt.
I prescribe 20 hours of KSP to everyone involved, that'll set them right.
I_am_tiberius 3 hours ago
I'd really like to see improvements like these: - Some technical proof that data is never read by open ai. - Proof that no logs of my data or derived data is saved. etc...
benjx88 2 hours ago
Good job on the release notice. I appreciate that it isn't just marketing fluff, but actually includes the technical specs for those of us who care and not concentrated in coding agents only.
I hope GPT 5.5 Pro is not cutting corners and neuter from the start, you got the compute for it not to be.
GenerWork 2 hours ago
Looking at the space/game/earthquake tracker examples makes me hopeful that OpenAI is going to focus a bit more on interface visual development/integration from tools like Figma. This is one area where Anthropic definitely reigns supreme.
nickandbro 3 hours ago
Very impressive! Interesting how all other benchmarks it seems to surpass Opus 4.7 except SWE-Bench Pro (Public). You would think that doing so well at Cyber, it would naturally possess more abilities there. Wonder what makes up the actual difference there
extr 3 hours ago
Seems like a continuation of the current meta where GPT models are better in GPT-like ways and Claude models are better in Claude-like ways, with the differences between each slightly narrowing with each generation. 5.5 is noticeably better to talk to, 4.7 is noticeably more precise. Etc etc.
wiseowise 1 hour ago
> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.
Everybody understands that you need to make money, but can you tone it down with the f*cking FOMO, please? It sounds just pathetic at this point:
'one engineer at NVIDIA', 'limb amputated'
Put the cunt in a room and give me a handsaw, I want to see how fast he'll give up his arm over some cloud model.
bradley13 2 hours ago
"our strongest set of safeguards to date"
How much capability is lost, by hobbling models with a zillion protections against idiots?
Every prompt gets evaluated, to ensure you are not a hacker, you are not suicidal, you are not a racist, you are not...
Maybe just...leave that all off? I know, I know, individual responsibility no longer exists, but I can dream.
zerotosixty 2 hours ago
Those who are using gpt5.5 how does it compare to Opus 4.6 / 4.7 in terms of code generation?
impulser_ 3 hours ago
What is the reason behind OpenAI being able to release new models very fast?
Since Feb when we got Gemini 3.1, Opus 4.6, and GPT-5.3-Codex we have seen GPT-5.4 and GPT-5.5 but only Opus 4.7 and no new Gemini model.
Both of these are pretty decent improvements.
[-]
- minimaxir 3 hours ago
  Competition.
  [-]
  - pixel_popping 3 hours ago
    This is frankly exciting, outside of the politics of it all, it always feel great to wake up and a new model being released, I personally will stay awake quite long tonight if GPT-5.5 drop in codex.
- literalAardvark 3 hours ago
  Anthropic is really tiny, and Google is just being Google, their models are just to show that they're hip with what the kids are doing.
- wmf 3 hours ago
  I wonder if it's the same model and they just keep adding more post-training.
  [-]
  - Squarex 3 hours ago
    The rumor was that the 5.5 is a brand new pretrain. But who knows, it's 2x as expensive as 5.4, so it would check out.
- tantalor 3 hours ago
  They aren't new models.
nullbyte 4 hours ago
82.7% on Terminal Bench is crazy
[-]
- toephu2 2 hours ago
  Is it? There are 5 other models near ~80% and it was achieved in March... which in AI-world seems like a century ago.
  https://www.tbench.ai/leaderboard/terminal-bench/2.0
  [-]
  - ejpir 1 hour ago
    those are not verified. I've tried forgecode and I cannot believe they didn't do something to influence the benchmarks
    [-]
    - GodelNumbering 1 hour ago
      Yup, they were found to be sneaking the answer key using agents.md
      https://debugml.github.io/cheating-agents/#sneaking-the-answ...
AbuAssar 2 hours ago
This is the first time openAi include competing models in their benchmarks, always included only openAi models.
YmiYugy 3 hours ago
So according to the benchmarks somewhere in between Opus 4.7 and Mythos
[-]
- jorl17 3 hours ago
  GPT 5.4 is already better than Opus 4.7 to me. But, then again, Opus 4.7 is a massive disappointment. I hope they don't discontinue 4.6.
  [-]
  - robwwilliams 2 hours ago
    Depends in goals. For long free-firm discussions I find Opus 4.7 Adaptive better/deeper than Opus 4.6 Extended. But usual caveats apply: first week of use and token budget seems generous now on Max 5X.
    [-]
    - coffeemug 2 hours ago
      I had the opposite experience. Opus 4.6 extended feels like the first genuinely intelligent model to converse with, Opus 4.7 adaptive feels like slightly smarter LinkedIn slop.
  - steinvakt2 3 hours ago
    I’ve had great experience using opus 4.7 in cursor. Works for everything including iOS frontend
    [-]
    - jorl17 3 hours ago
      Cursor is what I daily-drive. 4.7 has been terrible for my mostly python-driven work (whereas Opus 4.6 was literally revolutionary to me). Our frontend folks are also complaining.
      I left a comment here with this sentiment https://news.ycombinator.com/item?id=47879896
  - benjiro3000 3 hours ago
    [dead]
k2xl 3 hours ago
Surprised to see SWE-Bench Pro only a slight improvement (57.7% -> 58.6%) while Opus 4.7 hit 64.3%. I wonder what Anthropic is doing to achieve higher scores on this - and also what makes this test particular hard to do well in compared to Terminal Bench (which 5.5 seemed to have a big jump in)
[-]
- vexna 3 hours ago
  There's an asterisk right below that table stating that:
  > *Anthropic reported signs of memorization on a subset of problems
  And from the Anthropic's Opus 4.7 release page, it also states:
  > SWE-bench Verified, Pro, and Multilingual: Our memorization screens flag a subset of problems in these SWE-bench evals. Excluding any problems that show signs of memorization, Opus 4.7’s margin of improvement over Opus 4.6 holds.
- conradkay 3 hours ago
  Was 4.7 distilled off Mythos (which got 77.8%)? Interesting how mythos got 82% on terminal-bench 2.0 compared to 82.7% for GPT-5.5.
  Also notice how they state just for SWE-Bench Pro: "*Anthropic reported signs of memorization on a subset of problems"
cchrist 2 hours ago
Which is better GPT-5.5 or Opus 4.7? And for what tasks?
ace2pace 1 hour ago
I hear its as good as Opus 4.7.
The battle has just begun
faxmeyourcode 3 hours ago
How does it compare to mythos?
cynicalpeace 3 hours ago
It's possible that "smarter" AI won't lead to more productivity in the economy. Why?
Because software and "information technology" generally didn't increase productivity over the past 30 years.
This has been long known as Solow's productivity paradox. There's lots of theories as to why this is observed, one of them being "mismeasurement" of productivity data.
But my favorite theory is that information technology is mostly entertainment, and rather than making you more productive, it distracts you and makes you more lazy.
AI's main application has been information space so far. If that continues, I doubt you will get more productivity from it.
If you give AI a body... well, maybe that changes.
[-]
- aerhardt 3 hours ago
  > "information technology" generally didn't increase productivity
  Do you think it'd be viable to run most businesses on pen and paper? I'll give you email and being able to consume informational websites - rest is pen and paper.
  [-]
  - cynicalpeace 3 hours ago
    Productivity metrics were better when businesses were run on just pen and paper. Of course, there could be many confounding factors, but there are also many reasons why this could be so. Just a few hypotheses:
    - Pen and paper become a limiting factor on bureaucratic BS
    - Pen and paper are less distracting
    - Pen and paper require more creative output from the user, as opposed to screens which are mostly consumptive
    etc etc
    [-]
    - theLiminator 2 hours ago
      > Productivity metrics were better when businesses were run on just pen and paper
      What metrics are these?
      [-]
      - cynicalpeace 1 hour ago
        Productivity growth. If you take rolling averages from this chart, it clearly demonstrate higher productivity growth before the adoption of software. This is a well established fact in econ circles.
        https://fred.stlouisfed.org/graph/?g=1V79f
        [-]
        simianwords 1 hour ago
        I think this is a classic case of reading into specific arguments too deeply without understanding what they really mean in the grand picture. Few points to easily disprove this argument
        - if it were true that software paradoxically reduces productivity, you can just start a competing company that doesn't use software. Obviously this is ridiculous - top 20 companies by market cap are mostly Software based. Every other non IT company is heavily invested in software
        - if you might say the problem is it at the country level, it is obvious that every country that has digitised has had higher productivity and GDP growth. Take Italy vs USA for instance.
        - if you are saying that the problem is even more global, take the whole world - the GDP per is still pretty high since the IT revolution (and so have other metrics)
        If you still think there's something more to it, you are probably deep in some conspiracy rabbit hole
        [-]
        eiksjs 1 hour ago
        Is there a way to mute people who are clearly AI boosters? ^
        [-]
        simianwords 1 hour ago
        ? you are literally commenting on the release of a new model from OpenAI in a tech focused community. Have you considered what should be normal here?
        cynicalpeace 59 minutes ago
        The data clearly shows that productivity growth is flat or even declining. What is your accounting of why software hasn't offset those numbers?
        [-]
        simianwords 54 minutes ago
        You don't have a counterfactual to suggest that it would have continued increasing had it not been for technology. Is there _any_ credible economist who suggests that we might have higher productivity without tech?
- ewrs 3 hours ago
  Its quite possible the use of LLMs means that we are using less effort to produce the same output. This seems good.
  But the less effort exertion also conditions you to be weaker, and less able to connect deeply with the brain to grind as hard as once did. This is bad.
  Which effect dominates? Difficult to say.
  Of course this is absolutely possible. Ultimately there was a time where physical exertion was a thing and nobody was over-weight. That isn't the case anymore is it.
- aiaiai177 3 hours ago
  Downvoted by the AI Nazis. They are running a tight ship before the IPOs.
  [-]
  - cbg0 3 hours ago
    I downvoted it because it doesn't add anything useful to the conversation, and I don't own any AI stock.
    [-]
    - cynicalpeace 3 hours ago
      It's a hypothesis that "smarter" AI models, ie GPT-5.5, may not be a great boon to productivity. Given that this is the raison d'etre of AI models, and improving them, I don't see why it is any less useful than any other discussion.
jedisct1 1 hour ago
GPT-5.4 is already an incredible model for code reviews and security audits with the swival.dev /audit command.
The fact that GPT-5.5 is apparently even better at long-running tasks is very exciting. I don’t have access to it yet, but I’m really looking forward to trying it.
wslh 1 hour ago
Related and insightful: "GPT-5.5: Mythos-Like Hacking, Open to All" [1].
[1] https://news.ycombinator.com/item?id=47879330
woeirua 3 hours ago
Nice to see them openly compare to Opus-4.7… but they don’t compare it against Mythos which says everything you need to know.
The LinkedIn/X influencers who hyped this as a Mythos-class model should be ashamed of themselves, but they’ll be too busy posting slop content about how “GPT-5.5 changes everything”.
[-]
- A_D_E_P_T 1 hour ago
  Almost nobody can actually use Mythos, though?
ionwake 3 hours ago
is there anywhere I can try it? ( I just stopped my pro sub ) but was wondering if there is a playground or 3rd party so i can just test it briefly?
throwaway2027 3 hours ago
Good timing I had just renewed my subscription.
tantalor 3 hours ago
> A playable 3D dungeon arena
Where's the demo link?
i_love_retros 1 hour ago
Oh shiiiiit boy! An incrementation dropped!!
egorfine 2 hours ago
> We are releasing GPT‑5.5 with our strongest set of safeguards to date
...
> we’re deploying stricter classifiers for potential cyber risk which some users may find annoying initially
So we should be expecting to not be able to check our own code for vulnerabilities, because inherently the model cannot know whether I'm feeding my code or someone else's.
[-]
- dannyw 1 hour ago
  Hopefully not, because checking your codebase for vulnerabilities is really valuable.
  I hope it’s just limits on pentesting and stuff, and not for code analysis and review.
ant6n 1 hour ago
My impression has been that ChatGPT-5.4 has been getting dumber and more exhausting in the last couple of weeks. Like it makes a lot of obvious mistakes, ignores (parts of) prompts. keeps forgetting important facts or requirement.
Maybe this is a crazy theory, but I sometimes feel like they gimp their existing models before a big release to you'll notice more of a "step".
senko 2 hours ago
I might just be following too many AI-related people on X, but omg the media blitz around 5.5 is aggressive.
Soo many unconvincing "I've had access for three weeks and omg it's amazing" takes, it actually primes me for it to be a "meh".
I prefer to see for myself, but the gradual rollout, combined with full-on marketing campaign, is annoying.
debba 3 hours ago
Cannot see it in Codex CLI
[-]
- boring-human 1 hour ago
  Did you upgrade the tool binaries? I also couldn't see it until after the upgrade.
jawiggins 2 hours ago
What is the major and minor semver meaning for these models? Is each minor release a new fine-tuning with a new subset of example data while the major releases are made from scratch? Or do they even mean anything at this point?
[-]
- gck1 12 minutes ago
  Nothing. The next major increment is going to happen when marketing department is confident they can sell it as a major improvement without everyone laughing at them. Which at this point seems like never.
  I think Anthropic fearmongering and "leaks" of Mythos was them testing the ground for 5.x, which seems to have backfired.
phillipcarter 3 hours ago
... sigh. I realize there's little that can be done about this, but I just got through a real-world session determining of Opus 4.7 is meaningfully better than Opus 4.6 or GPT 5.4, and now there's another one to try things with. These benchmark results generally mean little to me in practice.
Anyways, still exciting to see more improvements.
elAhmo 2 hours ago
Is Codex receiving 5.4 or 5.5 release?
I am still using Codex 5.3 and haven't switched to GPT 5.4 as I don't like the 'its automatic bro trust us', so wondering is Codex going to get these specific releases at all in the future.
objektif 4 hours ago
Are there faster mini/nano versions as well?
[-]
- tedsanders 3 hours ago
  Not this time, no.
- abi 3 hours ago
  Usually, those get released a few weeks later.
numbers 3 hours ago
I've stopped trusting these "trust me bro" benchmarks and just started going to LM Arena and looking for the actual benchmark comparisons.
https://arena.ai/leaderboard/code
[-]
- stri8ted 3 hours ago
  I doubt this is representative of real world usage. There is a difference between a few turns on a web chatbot, vs many-turn cli usage on a real project.
- nba456_ 3 hours ago
  This is not any better of a benchmark
varispeed 2 hours ago
I am sceptical. The generation after 4o models have become crappier and crappier. Hope this one changes the trend. 5.4 is unusable for complex coding work.
mondojesus 3 hours ago
I'm still using 5.3 in codex. Are 5.4 and 5.5 better than 5.3 in concrete ways?
[-]
- cbg0 3 hours ago
  The benchmarks say so, but try it out with actual tasks and be the judge.
enraged_camel 3 hours ago
Is this the first time OpenAI compared their new release to Anthropic models? Previously they were comparing only to GPT's own previous versions.
k2xl 3 hours ago
ARC-AGI 3 is missing on this list - given that the SOTA before 5.5 <1% if I recall, I wonder if this didn't make meaningful progress.
[-]
- redox99 3 hours ago
  It's a silly benchmark anyways.
cmrdporcupine 3 hours ago
Not rolled out to my Codex CLI yet, but some users on Reddit claiming it's on theirs.
throwaw12 3 hours ago
If anyone tried it already, how do you feel?
Numbers look too good, wondering if it is benchmaxxed or not
xnx 3 hours ago
Next up: Google I/O on May 19?
I have to imagine they'll go to Gemini 3.5 if only for marketing reasons.
luqtas 4 hours ago
they are using ethical training weights this time!!! /j
yuvrajmalgat 2 hours ago
finally
baxuz 1 hour ago
Ah yes, the next "trust me bro"
1515874411 53 minutes ago
[dead]
lukebechtel 1 hour ago
[dead]
charliecs 3 hours ago
[dead]
jeremie_strand 3 hours ago
[dead]
yuvrajmalgat 2 hours ago
[dead]
MagicMoonlight 3 hours ago
Two hundred pages of shilling and it’s a 1% improvement in the benchmarks. They’re dead in the water.
Imagine spending 100m on some of these AI “geniuses” and this is the best they can do.
XCSme 2 hours ago
2x the price for 1-5% performance gain
justonepost2 3 hours ago
the attenuation of man nears
< 5 years until humans are buffered out of existence tbh
may the light of potentia spread forth beyond us
coderssh 3 hours ago
Great modal, I have been using codex and its awesome. Lets see what GPT-5.5 does to it
vardump 3 hours ago
I just can't bear to use services from this company after what they did to the global DRAM markets.
I'm not trying to make any kind of moral statement, but the company just feels toxic to me.