Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal

(blog.jcz.dev)

193 points | by alphabetting 4 days ago

10 comments

orbital-decay 1 hour ago
The baked-in assumptions observation is basically the opposite of the impression I get after watching Gemini 3's CoT. With the maximum reasoning effort it's able to break out of the wrong route by rethinking the strategy. For example I gave it an onion address without the .onion part, and told it to figure out what this string means. All reasoning models including Gemini 2.5 and 3 assume it's a puzzle or a cipher (because they're trained on those) and start endlessly applying different algorithms to no avail. Gemini 3 Pro is the only model that can break the initial assumption after running out of ideas ("Wait, the user said it's just a string, what if it's NOT obfuscated"), and correctly identify the string as an onion address. My guess is they trained it on simulations to enforce the anti-jailbreaking commands injected by the Model Armor, as its CoT is incredibly paranoid at times. I could be wrong, of course.
bbondo 4 hours ago
1.88 billion tokens * $12 / 1M tokens (output) suggests a total cost of $22,560 to solve the game with Gemini 3 Pro?
[-]
- elephanlemon 3 hours ago
  “Gemini 3 Pro was often overloaded, which produced long spans of downtime that 2.5 Pro experienced much less often”
  I was unclear if this meant that the API was overloaded or if he was on a subscription plan and had hit his limit for the moment. Although I think that the Gemini plans just use weekly limits, so I guess it must be API.
- addaon 1 hour ago
  To beat it, not to solve it. Solving means something very specific in the context of games — deriving and proving a GTO strategy.
- brianwawok 3 hours ago
  True though I bet the $200 a month plan could do it, maybe a few extra days of downtime when quota was maxed
  [-]
  - AstroBen 3 hours ago
    For how long would it stay $200 of you can rack up 5 figures if usage..
    [-]
    - manmal 2 hours ago
      That is the reason they severely limited Claude Max subscriptions. Some users racked up 1k+ in API equivalent cost per day.
- mkoubaa 3 hours ago
  I can't believe how massively underpaid I was when I was 11
  [-]
  - re-thc 3 hours ago
    Do you hallucinate as a kid?
    [-]
    - foundddit 3 hours ago
      At that age, it's called "imagination"
    - nomel 1 hour ago
      Kids definitely do this. They fill in blanks/context with assumptions, resulting in all sorts of silly responses, for topics of sparse knowledge/certainty. They're not lying, because they think it's true. Sometimes the gap filling is wrong, but usually downright brilliant, within the context of their knowledge.
    - mkoubaa 1 hour ago
      All kids confidently state incorrect things it's part of growing up
      [-]
      - mikojan 50 minutes ago
        That is just part of being a frontend developer
- ogogmad 3 hours ago
  :/ Damn. That needs to cost 1000x less before people can try it on their own games.
  [-]
  - someperson 2 hours ago
    That's an extrapolation to finish the entire game.
    If limit your token count to a fraction of 2 billion tokens, you can try it on your own game, and of course have it complete a shorter fraction of the game.
oceansky 4 hours ago
"Crucially, it tells the agent not to rely on its internal training data (which might be hallucinated or refer to a different version of the game) but to ground its knowledge in what it observes. "
Does this even have any effect?
[-]
- ragibson 4 hours ago
  Yes, at least to some extent. The author mentions that the base model knows the answer to the switch puzzle but does not execute it properly here.
  "It is worth noting that the instruction to "ignore internal knowledge" played a role here. In cases like the shutters puzzle, the model did seem to suppress its training data. I verified this by chatting with the model separately on AI Studio; when asked directly multiple times, it gave the correct solution significantly more often than not. This suggests that the system prompt can indeed mask pre-trained knowledge to facilitate genuine discovery."
  [-]
  - hypron 4 hours ago
    My issue with this is that the LLM could just be roleplaying that it doesn't know.
    [-]
    - jdiff 3 hours ago
      Of course it is. It's not capable of actually forgetting or suppressing its training data. It's just double checking rather than assuming because of the prompt. Roleplaying is exactly what it's doing. At any point, it may stop doing that and spit out an answer solely based on training data.
      It's a big part of why search overview summaries are so awful. Many times the answers are not grounded in the material.
    - brianwawok 3 hours ago
      To test would just need to edit the rom and switch around the solution. Not sure how complicated that is, likely depends on the rom system.
      [-]
      - Workaccount2 3 hours ago
        I don't know why people still get wrapped around the axle of "training data".
        Basically every benchmark worth it's salt uses bespoke problems purposely tuned to force the models to reason and generalize. It's the whole point of ARC-AGI tests.
        Unsurprisingly Gemini 3 pro performs way better on ARC-AGI than 2.5 pro, and unsurprisingly it did much better in pokemon.
        The benchmarks, by design, indicate you can mix up the switch puzzle pattern and it will still solve it.
- tootyskooty 4 hours ago
  I'm wondering about this too. Would be nice to see an ablation here, or at least see some analysis on the reasoning traces.
  It definitely doesn't wipe its internal knowledge of Crystal clean (that's not how LLMs work). My guess is that it slightly encourages the model to explore more and second-guess it's likely very-strong Crystal game knowledge but that's about it.
  [-]
  - Workaccount2 3 hours ago
    The model probably recognizes the need for a grassroots effort to solve the problem, to "show it's work".
- raincole 3 hours ago
  It will definitely have some effect. Why won't it? Even adding noise into prompts (like saying you will be rewarded $1000 for each correct answer) has some effect.
  Whether the 'effect' something implied by the prompt, or even something we can understand, is a totally different question.
- elif 1 hour ago
  I would imagine that prompting anything like this will have an excessively ironic effect like convincing it to suppress patterns which it would consider to be pre-knowledge.
  If you looked inside they would be spinning on something like "oh I know this is the tile to walk on, but I have to only rely on what I observe! I will do another task instead to satisfy my conditions and not reveal that I have pre-knowledge.
  LLMs are literal douche genies. The less you say, generally, the better
- baby 3 hours ago
  Do we have examples of this in promps in other contexts?
- blibble 4 hours ago
  I very much doubt it
- astrange 3 hours ago
  If they trained the model to respond to that, then it can respond to that, otherwise it can't necessarily.
  [-]
  - oceansky 3 hours ago
    I think you got a point here. These companies are injecting a lot of datasets every day into it.
- mkoubaa 3 hours ago
  It might get things wrong on purpose, but deep down it knows what it's doing
soulofmischief 4 hours ago
Nice writeup! I need to start blogging about my antics. I rigged up several cutting edge small local models to an emulator all in-browser and unsuccessfully tried to get them to play different Pokémon games. They just weren't as sharp as the frontier models.
This was a good while back but I'm sure a lot of people might find the process and code interesting even if it didn't succeed. Might resurrect that project.
[-]
- giancarlostoro 4 hours ago
  I have to think they need to know enough of the guides for the game for it to work out, how do they know whats on screen?
  [-]
  - soulofmischief 4 hours ago
    In my project I rigged up an in-browser emulator and directly fed captured images of the screen to local multimodal models.
    So it just looks right at what's going on, writes a description for refinement, and uses all of that to create and manage goals, write to a scratchpad and submit input. It's minimal scaffolding because I wanted to see what these raw models are capable of. Kind of a benchmark.
    [-]
    - giancarlostoro 1 hour ago
      I have a feeling if you gave them access to GameFAQ guides they might be able to play better, but it depends on how you can feed them the data.
sussmannbaka 1 hour ago
So after years of being gleefully told that AI will replace all jobs an omniscient state of the art model, with heavy assistance, takes more than two weeks and thousands of dollars in tokens to do what child me did in a few days? Huh.
[-]
- rybosome 53 minutes ago
  “And, because AI never got any better or any cheaper after that point, sussmanbaka’s wry observation remained true in perpetuity, forever.”
  - History, most likely
- dwaltrip 26 minutes ago
  Children are incredibly smart. All of this was fantasy 15 years ago. Comments like yours are amazing to me…
- murukesh_s 1 hour ago
  I used to think the same until latest agents started adding perfectly fine features to a large existing react app with just basic input (in English) . Most of the jobs require levels of intelligence below that. It's just a matter of time before agents get to that.
  [-]
  - blauditore 1 hour ago
    It's about the complexity of the task. Front end apps tend do be much less complex and boilerplate-y than backends, hence AI tends to work better.
    [-]
    - murukesh_s 1 hour ago
      I disagree - having worked on backends most of the time, I find modern frontend much more complex (and difficult to test) than pure backend. When I say modern frontend - its mostly React, state management like Redux, Zustand, Router framework like React Router, a CSS framework like Tailwind and component framework like Shadcn. Not to mention different versions of React, different ways of managing state, animation/transitions etc. And on top of that the ever increasing complex quirks in the codebase still needed to be compatible with all the modern browsers and device sizes/orientation out there.
      [-]
      - rafaelmn 30 minutes ago
        That's just a farmiliarity thing. I've worked on project doing full web FE, mobile and BE.
        It's hard to generalize but modern frontend is very good at isolating you from dealing with complex state machine states and you're dealing with single user/limited concurrency. It's usually easy to find all references/usecases for something.
        Most modern backend is building consistent distributed state machines, you need to cover all the edge cases, deal with concurrency, different clients/contracts etc. I would say getting BE right (beyond simple CRUD) is going to be hard for LLM simply because the context is usually wider and hard to compress/isolate.
    - etse 1 hour ago
      Isn’t frontend more complex? If my task starts with a Figma UI design, how well does a code agent do at generating working code that looks right, and iterate on it (presuming some browser MCP)? Some automated tests seem enough for an genetic loop on backend.
      [-]
      - murukesh_s 1 hour ago
        >Isn’t frontend more complex? If my task starts with a Figma UI design, how well does a code agent do at generating working code that looks right, and iterate on it (presuming some browser MCP)? Some automated tests seem enough for an genetic loop on backend.
        Haven't tried a Figma design, but i built an internal tool entirely via instructions to agent. The kind of work I could easily quote 3 weeks previously.
    - ribosometronome 52 minutes ago
      Or perhaps the sort of things it's been trained on? There's not really a huge corpus of material re: beating Pokemon in the manner it has to play Pokemon, especially compared to the mountains of code these models have access to.
cg5280 3 hours ago
I like the inclusion of the graph at the end to compare progress. It would be cool to compare this directly to competing models (Claude, GPT, etc).
[-]
- kqr 3 hours ago
  It would unfortunately also need several runs of each to be reliable. There's nothing in TFA to indicate the results shown aren't to a large degree affected by random chance!
  (I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)
  [-]
  - casey2 3 hours ago
    TFA says multiple times that the results are affect by random chance
    [-]
    - kqr 2 hours ago
      Yes, but recognising that is only the first step. Quantifying the variance is the next step which I miss in the article.
squimmy26 4 hours ago
How certain can we be that these improvements aren't just a result of Gemini 3 Pro pre-training on endless internet writeups of where 2.5 has struggled (and almost certainly what a human would have done instead)?
In other words, how much of this improvement is true generalization vs memorization?
[-]
- zurfer 3 hours ago
  You're too kind. Even the CEO of Google retweeted how well Gemini 2.5 did on Pokemon. There is a high chance that now it's explicitly part of the training regime. We kind of need a different kind of game to know how well it generalizes.
  [-]
  - kqr 1 hour ago
    I have a draft doing this with text adventures: https://entropicthoughts.com/updated-llm-benchmark
- prmoustache 30 minutes ago
  Isn't that the point of a new model anyway?
jwrallie 4 hours ago
Being through the game recently, I am not surprised Goldenrod Underground was a challenge, it is very confusing and even though I solved it through trial and error, I still don't know what I did. Olivine Lighthouse is the real surprise, as it felt quite obvious to me.
wild_pointer 4 hours ago
I wonder how much of it is due to the model being familiar with the game or parts of it, be it due to training of the game itself, or reading/watching walkthroughs online.
[-]
- andrepd 4 hours ago
  There was a well-publicised "Claude plays Pokémon" stream where Claude failed to complete Pokemon Blue in spectacular fashion, despite weeks of trying. I think only a very gullible person would assume that future LLMs didn't specifically bake this into their training, as they do for popular benchmarks or for penguins riding a bike.
  [-]
  - dwaltrip 21 minutes ago
    If they game the pelican benchmark, it’d be pretty obvious.
    Just try other random, non-realistic things like “a giraffe walking a tightrope”, “a car sitting at a cafe eating a pizza”, etc.
    If the results are dramatically different, then they gamed it. If they are similar in quality, then they probably didn’t.
  - ctoth 1 hour ago
    > as they do for popular benchmarks or for penguins riding a bike.
    Citation?
    [-]
    - da_grift_shift 1 hour ago
      The companies developing frontier models obviously know.
      https://x.com/simonw/status/1924909405906338033
  - criley2 4 hours ago
    While it is true that model makers are increasingly trying to game benchmarks, it's also true that benchmark-chasing is lowering model quality. GPT 5, 5.1 and 5.2 have been nearly universally panned by almost every class of user, despite being a benchmark monster. In fact, the more OpenAI tries to benchmark-max, the worse their models seem to get.
    [-]
    - astrange 3 hours ago
      Hm? 5.1 Thinking is much better than 4o or o3. Just don't use the instant model.
elif 1 hour ago
Give it the gameFAQ next time