How Does A Blind Model See The Earth?

Aug 11

A tiny LLM eval with pretty pictures

19 Comments

Wow this is pretty vindicating for those who believe in the scaling of dense models over the goodhart's law following MoE sparse models. I do think quantifying this would be a good idea even though you have distaste for such though. A shame we can't see 4.5 unfortunately... Do you think you could try with o3, o1 and gpt5? A quick heads up, going into Settings -> Data Controls -> Sharing -> You should find a feature, "Share inputs and outputs with OpenAI", enabling that gives you 1 million free credits per day for the best models, although OpenAI might go and train on that, thus defeating your goal! haha!

Expand full comment

Gunnar Tausch

1dEdited

Interesting. I wonder what would happen if you try this in polar coordinates and display the result on a sphere. Are the results reproducible? My guess would be: no generalisation beyond longitude/ latitude training data.

Expand full comment

Reply (1)

henry

yeah, this is indeed interesting. I'll add it to the list of follow-up experiments. I suspect performance will be degraded but not entirely.

Expand full comment

Sam T. Oates

Super interesting, I didn't expect Llama 3 405b to be so good. Do you think there could be some training set pollution there too?

I get what you mean about seeing the planet from orbit. At least there are plenty more where ours came from, not to mention the deep oceans.

Expand full comment

Reply (2)

henry

probably something like that, yeah. compared to gpt-4 base, the llama 405b base model isn't even particularly intelligent in my experience

Expand full comment

Magnus

6hEdited

Good point on training set bias. One key question is whether LLMs are actually reasoning about geography or just memorizing coordinate patterns from training data.

One could hypothesize that their "geographic knowledge" simply reflects coordinate density in text (more populated areas → more mentions → better land prediction). If true, plotting all Wikipedia coordinates should correlate with the "LLM world map" here.

Indeed, this is exactly what you see here, if plotting all GCS co-ordinates extracted from Wikipedia:

https://github.com/Magnushhoie/Wikipedia_Coordinates_Visualization

Note the presence of "predicted land" for commonly mentioned sea co-ordinates. This implies that larger models aren't necessarily "smarter" at geography - they just have bigger memorized datasets.

Expand full comment

Andy Carnevale

Reminds me of DeepMind's recently released AlphaEarth. https://deepmind.google/discover/blog/alphaearth-foundations-helps-map-our-planet-in-unprecedented-detail/

I wonder what this experiment would look like if it were repeated for choices other than (land, water). What if it could choose between biomes instead of just land.? What if it were asked to provide an elevation - would mountain ranges be visible? What if it were asked to provide the country - how accurate would the borders look?

Expand full comment

duck_master

Do you happen to have an accuracy statistic (say what fraction of all coordinates you tried the model got correct)? This could make for an interesting eval to optimize for. It might be even more interesting if you divided the globe into (say) quadrants and also calculated each model's accuracy over each quadrant.

Alternatively (this would drive up costs a *lot*, but it might increase accuracy too) you could allow each model to discuss the location and/or use a chain of thought before pinning down the answer.

Expand full comment

Reply (1)

henry

Yeah, I addressed this. It could definitely be done, but I have a bit of a distaste for purely numerical and goodhartable benchmarks.

> By the way, I'm also going to avoid letting things become too benchmark-ey. Yes, I could grade these generated maps, computing the mean squared error relative to some ground truth and ranking the models, but I think it'll soon become apparent how much we'd lose by doing so. Instead, let's just look at them, and see what we can notice.

Expand full comment

Reply (1)

duck_master

... hmm, you're right, I think benchmarkification might be a bad idea

Expand full comment

Reply (1)

Alex K. Chen

DUCKY!! IT'S YOU!! QUACK!!!

Expand full comment

James bond 💖

13h

Good stuff! Loved it.

Expand full comment

May Be

14hEdited

https://substackcdn.com/image/fetch/$s_!4shR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93df604-8a92-48e5-b002-e1ed90d0c1da_1532x940.png really is a nice flag of the world huh.

Expand full comment

Alex K. Chen

ACTUALLY THIS REMINDS ME MORE OF TY ROBINSON PAPERS ON HOW TO VIEW THE EARTH FROM SUPER FAR AS AN EXOPLANET

Expand full comment

Alex K. Chen

This feels SO much like the https://dspace.mit.edu/handle/1721.1/52333 paper

And Dorian abbot and Ray pierrehumbert papers!!!

Expand full comment

Tensor Templar

Where be dragons?

I long had an intuition about multimodality being beneficial for spacial reasoning ability and world models, that these models may develop and this post is an elegant way of looking into that!

I recommend looking at known early-fusion multimodals (chameleon) and rumored early-fusion aka "natively multimodal" Gemini-2.5-pro next - i expect they should blow anything of similar size out of the water, if my intuition is directionally correct.

Expand full comment