Wow this is pretty vindicating for those who believe in the scaling of dense models over the goodhart's law following MoE sparse models. I do think quantifying this would be a good idea even though you have distaste for such though. A shame we can't see 4.5 unfortunately... Do you think you could try with o3, o1 and gpt5? A quick heads up, going into Settings -> Data Controls -> Sharing -> You should find a feature, "Share inputs and outputs with OpenAI", enabling that gives you 1 million free credits per day for the best models, although OpenAI might go and train on that, thus defeating your goal! haha!
Interesting. I wonder what would happen if you try this in polar coordinates and display the result on a sphere. Are the results reproducible? My guess would be: no generalisation beyond longitude/ latitude training data.
Good point on training set bias. One key question is whether LLMs are actually reasoning about geography or just memorizing coordinate patterns from training data.
One could hypothesize that their "geographic knowledge" simply reflects coordinate density in text (more populated areas → more mentions → better land prediction). If true, plotting all Wikipedia coordinates should correlate with the "LLM world map" here.
Indeed, this is exactly what you see here, if plotting all GCS co-ordinates extracted from Wikipedia:
Note the presence of "predicted land" for commonly mentioned sea co-ordinates. This implies that larger models aren't necessarily "smarter" at geography - they just have bigger memorized datasets.
I wonder what this experiment would look like if it were repeated for choices other than (land, water). What if it could choose between biomes instead of just land.? What if it were asked to provide an elevation - would mountain ranges be visible? What if it were asked to provide the country - how accurate would the borders look?
Do you happen to have an accuracy statistic (say what fraction of all coordinates you tried the model got correct)? This could make for an interesting eval to optimize for. It might be even more interesting if you divided the globe into (say) quadrants and also calculated each model's accuracy over each quadrant.
Alternatively (this would drive up costs a *lot*, but it might increase accuracy too) you could allow each model to discuss the location and/or use a chain of thought before pinning down the answer.
Yeah, I addressed this. It could definitely be done, but I have a bit of a distaste for purely numerical and goodhartable benchmarks.
> By the way, I'm also going to avoid letting things become too benchmark-ey. Yes, I could grade these generated maps, computing the mean squared error relative to some ground truth and ranking the models, but I think it'll soon become apparent how much we'd lose by doing so. Instead, let's just look at them, and see what we can notice.
I long had an intuition about multimodality being beneficial for spacial reasoning ability and world models, that these models may develop and this post is an elegant way of looking into that!
I recommend looking at known early-fusion multimodals (chameleon) and rumored early-fusion aka "natively multimodal" Gemini-2.5-pro next - i expect they should blow anything of similar size out of the water, if my intuition is directionally correct.
Wow this is pretty vindicating for those who believe in the scaling of dense models over the goodhart's law following MoE sparse models. I do think quantifying this would be a good idea even though you have distaste for such though. A shame we can't see 4.5 unfortunately... Do you think you could try with o3, o1 and gpt5? A quick heads up, going into Settings -> Data Controls -> Sharing -> You should find a feature, "Share inputs and outputs with OpenAI", enabling that gives you 1 million free credits per day for the best models, although OpenAI might go and train on that, thus defeating your goal! haha!
Interesting. I wonder what would happen if you try this in polar coordinates and display the result on a sphere. Are the results reproducible? My guess would be: no generalisation beyond longitude/ latitude training data.
yeah, this is indeed interesting. I'll add it to the list of follow-up experiments. I suspect performance will be degraded but not entirely.
Super interesting, I didn't expect Llama 3 405b to be so good. Do you think there could be some training set pollution there too?
I get what you mean about seeing the planet from orbit. At least there are plenty more where ours came from, not to mention the deep oceans.
probably something like that, yeah. compared to gpt-4 base, the llama 405b base model isn't even particularly intelligent in my experience
Good point on training set bias. One key question is whether LLMs are actually reasoning about geography or just memorizing coordinate patterns from training data.
One could hypothesize that their "geographic knowledge" simply reflects coordinate density in text (more populated areas → more mentions → better land prediction). If true, plotting all Wikipedia coordinates should correlate with the "LLM world map" here.
Indeed, this is exactly what you see here, if plotting all GCS co-ordinates extracted from Wikipedia:
https://github.com/Magnushhoie/Wikipedia_Coordinates_Visualization
Note the presence of "predicted land" for commonly mentioned sea co-ordinates. This implies that larger models aren't necessarily "smarter" at geography - they just have bigger memorized datasets.
Reminds me of DeepMind's recently released AlphaEarth. https://deepmind.google/discover/blog/alphaearth-foundations-helps-map-our-planet-in-unprecedented-detail/
I wonder what this experiment would look like if it were repeated for choices other than (land, water). What if it could choose between biomes instead of just land.? What if it were asked to provide an elevation - would mountain ranges be visible? What if it were asked to provide the country - how accurate would the borders look?
Do you happen to have an accuracy statistic (say what fraction of all coordinates you tried the model got correct)? This could make for an interesting eval to optimize for. It might be even more interesting if you divided the globe into (say) quadrants and also calculated each model's accuracy over each quadrant.
Alternatively (this would drive up costs a *lot*, but it might increase accuracy too) you could allow each model to discuss the location and/or use a chain of thought before pinning down the answer.
Yeah, I addressed this. It could definitely be done, but I have a bit of a distaste for purely numerical and goodhartable benchmarks.
> By the way, I'm also going to avoid letting things become too benchmark-ey. Yes, I could grade these generated maps, computing the mean squared error relative to some ground truth and ranking the models, but I think it'll soon become apparent how much we'd lose by doing so. Instead, let's just look at them, and see what we can notice.
... hmm, you're right, I think benchmarkification might be a bad idea
DUCKY!! IT'S YOU!! QUACK!!!
Good stuff! Loved it.
https://substackcdn.com/image/fetch/$s_!4shR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93df604-8a92-48e5-b002-e1ed90d0c1da_1532x940.png really is a nice flag of the world huh.
ACTUALLY THIS REMINDS ME MORE OF TY ROBINSON PAPERS ON HOW TO VIEW THE EARTH FROM SUPER FAR AS AN EXOPLANET
This feels SO much like the https://dspace.mit.edu/handle/1721.1/52333 paper
And Dorian abbot and Ray pierrehumbert papers!!!
Where be dragons?
I long had an intuition about multimodality being beneficial for spacial reasoning ability and world models, that these models may develop and this post is an elegant way of looking into that!
I recommend looking at known early-fusion multimodals (chameleon) and rumored early-fusion aka "natively multimodal" Gemini-2.5-pro next - i expect they should blow anything of similar size out of the water, if my intuition is directionally correct.
What does "hermes-ification" mean?
hermes 405b is a model trained from llama 405b. I mean "the process of turning a llama model into a hermes model"
Why is North America a mandelbrot in Qwen2.5 7B?