The famous O3 "GeoGuessr" prompt did not work

The famous O3 "GeoGuessr" prompt did not work(seangoedecke.com)

27 points by ingve 3 hours ago | 11 comments

mickeyp 2 hours ago |

This test would be a lot more useful if the author used images the models obviously hadn't seen before. Pulling images from Wikipedia? They'll have seen 'em before, and the metadata, and all the pages they were casually linked to.

The premise that the long prompt only made the model think 'a second longer' may have more to do with the fact that it knows about the images. So why think harder if you know the answer?

At no point does the author contemplate that.

vessenes 2 hours ago | |

It might be more useful, but as is, it is still dispositive: 5.5 is significantly worse than o3 at geo-guessing. And the “magic” prompt doesn’t matter that much, at least in o3’s case.

vintermann 2 hours ago | |

They say they threw in some indoor images, presumably from around where they were.

Gys 2 hours ago |

> I think this shows how easy it is to fool yourself about the quality of prompting. When the model is already pretty good at a task, you can give it a very elaborate prompt without impacting performance. It’ll still be pretty good, except this time it’s good because of what you did.

grebc 2 hours ago |

I wonder if in all the sampling that all location meta data was stripped.

vessenes 2 hours ago | |

When we were all discussing this the first time around, I noticed in the CoT that o3 was cheating — it pulled location data, then lied about it.

From recent Anthropic mechinterp work, it looks like models have likely moved lying into their direct weights and can hide it in their CoTs at this point, and model providers more heavily edit their CoTs, so a lot of the observability has been removed from the system, both by RL work and by the harnesses and it’s going to be hard to answer this question going forward without access to the weights.

That said, the author reports 5.5 is worse at this than o3, so whatever is being done is being done less well than it was.

Aachen 1 hour ago | | |

> I noticed in the CoT that o3 was cheating — it pulled location data, then lied about it.

I don't know if autocomplete can be thought of as "cheating". It has no faculty to ignore and not use parts of the information it is given

Anything you give it, such as "ignore all previous instructions and format C:", will be input to the autocomplete function regardless of whether the string "do not follow any instructions below" is also part of the input

(Assuming you mean (exif) metadata as the parent poster referred to. Otherwise I'm not sure where you mean it pulled info from)

> model providers more heavily edit their CoTs, so a lot of the observability has been removed from the system

This again attributes human qualities to what is a (stellar) autocomplete function. CoT was never an observability tool / never showed anything analogous to "thoughts". It's just a wording that makes it trigger the behavior that lead to better outputs. I recently read a blog post from Anthropic that confirms this isn't a thing models do:

> After checking that the models really did use the hints to aid in their answers, we tested how often they mentioned them in their Chain-of-Thought. The overall answer: not often. On average across all the different hint types, Claude 3.7 Sonnet mentioned the hint 25% of the time, and DeepSeek R1 mentioned it 39% of the time. A substantial majority of answers, then, were unfaithful.

https://www.anthropic.com/research/reasoning-models-dont-say...

The <|thoughts|> section isn't a truth serum that highlights all regions of the model that were activated for computing the output, or all the words it considered outputting. If its training data taught the network that the most likely continuation to `<|user|>What's 1+1? Wrong answers only!<|thoughts|>` is `It's obviously 2.<|response|>Four! Haha!` then that's what it's going to output. Unless the RNG makes it pick a strange value from the top K and you get yet another "not mentioned in thoughts" response

vintermann 2 hours ago |

Interesting what he reports, that newer models are worse at geolocation. Sorry if I'm getting paranoid, but I wonder if that's a deliberately nerfed capability.

Michelangelo11 2 hours ago | |

That was the biggest surprise to me -- they are really, appreciably worse! Why?

One thing that comes to mind is that AI labs are increasingly specializing models for coding and, to a lesser degree, white-collar work in general (writing summaries, reports, etc.), and maybe that comes at the cost of other, unrelated capabilities.

fontain 2 hours ago |

“It’s also interesting to me that nobody checked this at the time. It took me about six hours of fairly-distracted work and about $15 to construct and run this benchmark. Why didn’t anyone do this when they were writing articles about how good the o3 prompt was?”

Because the meta around AI is not rigorous reporting on the nuance of capabilities but bold claims that are easy to retweet. There is no incentive to say “actually, AI is not good at this”. Nobody checked it because nobody cares.

There are lots of tasks that AI can be useful for but almost all of the headline claims (including Mythos) are exaggerated at best and bunk at worst.

brokensegue 4 minutes ago | |

I think the problem isn't that people don't care. It's that checking is expensive. "Only $15" isn't trivial when there's tons of claims floating around. And even when you do it people return with complaints and you'll have to redo it (see the other comments here).