UNIGEOCLIP: Unified Geospatial Contrastive Learning

Guillaume Astruc, Eduard Trulls, Jan Hosang, Loic Landrieu, Paul-Edouard Sarlin

EarthVision (CVPRW) 2026

EarthVision 2026

UniGeoCLIP

Unified Geospatial Contrastive Learning

Guillaume Astruc · Eduard Trulls · Jan Hosang · Loic Landrieu · Paul-Edouard Sarlin

📄 Paper 💻 Code Citation

scroll

Overview

Five Modalities, One Unified Space

Geospatial understanding requires reasoning across fundamentally different kinds of data -- a satellite view from above, a street photo at ground level, a 3D elevation map, a text description of a neighborhood, and a pair of GPS coordinates. These modalities are complementary: each one captures something the others miss.

UniGeoCLIP is the first contrastive framework to jointly align all five modalities into a single unified embedding space, enabling seamless retrieval and reasoning across any combination of inputs, without relying on a privileged "pivot" modality.

Aerial imagery

Street-level imagery

Elevation (DSM)

Text descriptions

GPS coordinates

Figure 1: Unified contrastive learning of geospatial modalities (all-to-all).

Key Innovation

All-to-All, Not Pivot-Centric

Prior multimodal contrastive models like ImageBind and UniBind rely on a central pivot. This creates an asymmetry and forces cross-modal retrieval between two non-image inputs to pass through the pivot.

UniGeoCLIP takes a fundamentally different approach: every modality is directly contrasted against every other modality via a multi-way InfoNCE objective summed over all ordered modality pairs. Text, DSM, aerial imagery, street views and coordinates are all primary citizens in the same latent space.

🔗

All-to-All Alignment

Contrastive loss over all M^2 ordered pairs.

🌍

Modality-Invariant Geography

Same location, any modality -> nearby embeddings.

🔄

Zero-Shot Cross-Modal Retrieval

Query with one modality, retrieve another without task-specific heads.

Training Data

Co-located Five-Modality Data

Training spans continental USA metropolitan centers with uniform spatial coverage using ~800k S2 cells at level L=16. To prevent temporal leakage, all data from 2023 is held out for evaluation.

Figure 2: Example location represented through aerial, street-level, DSM, text, and GPS.

Architecture

A Scaled Coordinate Encoder That Actually Understands Space

Raw latitude/longitude coordinates contain rich geographic structure — but only if the encoder can capture dependencies across multiple spatial scales simultaneously. Prior approaches such as GeoCLIP process each Fourier frequency independently through a separate MLP before averaging, leaving cross-scale interactions unexploited and tying parameter count to the number of frequencies.

Our Scaled Latitude–Longitude Encoder rethinks this design. Each frequency projection is treated as a token and processed jointly through self-attention blocks, letting every scale communicate with every other. Register tokens act as persistent memory banks, further increasing representational capacity. Crucially, the parameter count is now independent of the number of frequency scales — making the encoder both more expressive and more efficient to scale.

Figure 3: Multi-scale coordinate encoder pipeline (Fourier scales -> tokens -> attention -> pooling).

The depth of the encoder matters significantly. At depth 0, the model reduces to plain fixed Fourier features with no learned interactions. As self-attention blocks are added, all retrieval metrics improve consistently, and the gains extend beyond coordinate retrieval to aerial localization and multimodal ensembling.

Table 5: Ablation on the number of blocks in the location (GPS) encoder.

Results

State-of-the-Art Across Every Geospatial Task

We evaluate UniGeoCLIP on cross-modal street-view retrieval, satellite image encoding (solar panels, land cover), spatial coordinate regression (health/socio-economic/environment indicators), and DSM understanding.

88.2%

Acc@100m
SV -> Aerial (full model)

97.0%

Overall Accuracy
m-pv4ger solar panels

57.0

Mean R^2
27 regression tasks

72.0%

Accuracy
DSM land-cover (MDAS)

Table 1: Cross-modal street-view retrieval (Acc@100m).

Table 2: Aerial/satellite encoder transfer (solar panels + land cover).

Table 3: Location encoder downstream regression (mean R^2).

Table 4: DSM encoder evaluation on MDAS (DSM-only segmentation).

Visualizations

Semantically Structured Geographic Representations

The learned embedding space is not only accurate; it is also structured. When applying PCA over a dense grid in Manhattan, UniGeoCLIP produces spatial patterns reflecting underlying urban structure (for example, Central Park stands out).

In contrast, SatCLIP and GeoCLIP exhibit smoother, predominantly position-driven gradients. UniGeoCLIP learns what a place means, not just where it is.

Figure 5: PCA projection of coordinate embeddings over Manhattan.

The t-SNE visualization (Figure 6) qualitatively checks whether embeddings from the five modalities truly co-localize for the same geographic location in the shared space. Clusters of points correspond to the exact same location.

Figure 6: t-SNE visualization showing modality embeddings cluster by location.

Generalization

Trained in the USA. Works in Amsterdam.

UniGeoCLIP is trained exclusively on US metropolitan areas, yet achieves 41.2% Acc@100m on an out-of-distribution Amsterdam evaluation set under a substantial domain shift, using an image-to-image retrieval protocol. The paper reports consistent performance trends: adding modalities helps, and multimodal ensembling remains beneficial. Note that for the location/coordinate encoder regression benchmark, the paper evaluates only on locations that overlap with the training area.

Citation

Cite Our Work

@inproceedings{astruc2026unigeoclip,
  title     = {UniGeoCLIP: Unified Geospatial Contrastive Learning},
  author    = {Astruc, Guillaume and Trulls, Eduard and Hosang, Jan
               and Landrieu, Lo{\"i}c and Sarlin, Paul-Edouard},
  booktitle = {EarthVision Workshop, CVPR},
  year      = {2026}
}