UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

Yohann Perron*, Guillaume Astruc*, Nicolas Gonthier, Clement Mallet, Loic Landrieu

arXiv preprint, 2026

arXiv preprint · 2026

UniverSat

A resolution- and modality-agnostic transformer backbone for Earth Observation: one set of weights for any sensor, any spatial/spectral/temporal resolution and any scale.

Yohann Perron^* · Guillaume Astruc^* · Nicolas Gonthier · Clement Mallet · Loic Landrieu

📄 Paper 💻 Code ⚡ Quick Start Citation

scroll

Overview

One Model. Any Sensor. Any Resolution.

ViTs assume a fixed input format. Earth Observation doesn't play by that rule:

Modalities — optical, radar, hyperspectral, elevation
Spatial resolution — centimetres to hundreds of metres
Image size — tiny patches to multi-kilometre tiles, no two images share the same shape
Temporal depth — single snapshot up to 150+ revisits
Spectral width — from one band to 396 channels

UniverSat handles all of this with a single set of weights — no resampling, no channel selection, no per-sensor encoder. It is a ViT-style backbone built around a Universal Patch Encoder (UPE) that maps patches of arbitrary spatial, spectral, and temporal shape into a shared embedding space. One model is trained jointly on 13 sensors from 7 datasets, generalises to unseen sensors within this gamut without input resampling, and stays competitive on standard benchmarks.

Figure 1: A single UniverSat trained jointly on 13 sensors from 7 datasets, spanning ~3 orders of magnitude in spatial resolution, channel count, and revisit frequency.

Quick Start

Use UniverSat in three lines.

UniverSat is designed to be a drop-in backbone for any EO pipeline. Load the pretrained weights in one line via torch.hub — no clone, no config — feed a dict of whichever sensors you have, and read out dense embeddings at any output resolution. No modality-specific preprocessing, no channel filtering, no input resampling.

1 Load a pretrained model

The model is published on the Hugging Face Hub with PyTorchModelHubMixin. The simplest path is Torch Hub — no local checkout needed:

import torch

model = torch.hub.load('gastruc/UniverSat', 'from_pretrained').eval()

Or, equivalently, with the repo on your path and huggingface_hub installed:

from hubconf import UniverSat

model = UniverSat.from_pretrained('g-astruc/UniverSat').eval()

Loading requires huggingface_hub (and safetensors). The released checkpoint is a Base UniverSat (~201 M params).

2 Encode any sensor combination

# Snapshot modalities: (B, C, H, W). Time series: (B, T, C, H, W) + <mod>_dates.
data = {
    'spot':      torch.randn(2, 3, 360, 360),      # 1 m VHR, RGB snapshot
    's2':        torch.randn(2, 20, 10, 36, 36),  # 10 m Sentinel-2 time series
    's2_dates':  torch.randint(0, 365, (2, 20)),
    's1':        torch.randn(2, 12, 3, 36, 36),   # 10 m Sentinel-1 (SAR) time series
    's1_dates':  torch.randint(0, 365, (2, 12)),
    'dsm':       torch.randn(2, 1, 12, 12),        # 30 m elevation snapshot
}

features, _ = model.encode(data, patch_size=40, output_grid=36)
# -> (2, 1296, 768): a 36x36 dense feature grid (register tokens stripped for you)

model.encode(...) looks up per-modality wavelengths, physical resolution, and sub-patch factors automatically from a built-in registry — s2, s1, spot, aerial, naip, l7/l8, modis, alos, enmap, dsm, neon, hls, and more.

3 Control input and output resolutions

With UniverSat, the output resolution is decoupled from the input patch size.

First, choose the patch_size used to partition the input data. Smaller patches better capture local, fine-grained processes, while larger patches are more efficient.

Then, choose the output_grid, i.e. the number of output tokens. The model returns a tensor of shape D × output_grid × output_grid (each token then covers tile_extent / G on the ground). Same model, same inputs — only the requested grid changes.

# Same model, same inputs — only the requested output grid changes.
patch, _   = model.encode(data, patch_size=40, output_grid=9)      #   9x9   patch-level
dense, _   = model.encode(data, patch_size=40, output_grid=36)    #  36x36  dense
highres, _ = model.encode(data, patch_size=40, output_grid=180)  # 180x180 high-res

Under the hood: the patch-level transformer runs over a coarse spatial grid, then a sub-patch skip cross-attention recovers fine spatial detail at the requested grid — one bilinear resample plus one CA pass.

See our demo notebook.

⚠️ Note: small input patches or very fine output grids can significantly increase memory usage.

Unseen sensors? → Just pass the sensor's wavelengths (optical / hyperspectral), polarization (SAR), or revisit (time series) as wavelengths={...}, input_res={...}, subpatches={...} overrides to encode(...). The Universal Patch Encoder uses these as positional encodings — no retraining needed. See the generalization experiments in the paper for results on HLS and EnMAP, which the model never saw during pretraining.

🧊

Frozen-backbone friendly

Strong results with linear probes and just 9K probe parameters — perfect for low-label regimes.

🪶

Lightweight integrations

The forward pass returns standard dense features — plug them into any segmentation / classification head you already use.

🧰

Reference recipes

The GitHub repo ships with fine-tuning, kNN, and linear-probe scripts for GeoBench, PangaeaBench, and SpectralEarth.

Key Properties

One model for all your EO needs.

Integrated into a transformer that operates over spatialized tokens, the Universal Patch Encoder gives UniverSat three key advantages over rigid ViT-style EO foundation models:

🌐

Sensor-agnostic

A single set of weights processes any modality combination and arbitrary resolutions — no resampling, no channel filtering.

🔍

Resolution-flexible

The output spatial resolution is specified at inference time, decoupled from the input patch size.

🧬

Granular

A sub-patch skip connection preserves fine spatial details well beyond the patch-level embedding.

Table 1: Flexible multimodal EO foundation models. UniverSat supports the broadest modality mix, handles unseen spatial / temporal / spectral configurations, and offers flexible output resolution — all with a single set of weights.

Architecture

The Universal Patch Encoder

Different sensors yield patches of fundamentally different shapes: C channels × T timestamps × H × W pixels. Naively projecting every shape with an MLP is impractical; applying full self-attention over all atomic tokens is prohibitive.

The UPE instead lifts each scalar into a learnable embedding using Fourier features, and progressively collapses the spectral, temporal, spatial-within-patch, and sub-patch axes one at a time using linear-complexity Axial Cross-Attention. Each axis receives dedicated positional encodings (wavelength, polarization, time-of-year, etc.), so the encoder is intrinsically aware of what each input is — not just where it sits.

Figure 5: A tile is observed by multiple sensors of arbitrary modality and resolution. The shared UPE patchifies and embeds inputs; tokens are fused via Axial Cross-Attention (ACA), processed by self-attention blocks, resampled to the target resolution, and attend to high-resolution sub-patch embeddings via cross-attention (CA) to recover fine spatial details.

Why this matters → Where prior EO foundation models retrain or adapt encoders for each new sensor configuration, UniverSat treats resolution, channel count and time as first-class metadata of the input — not as a fixed property of the architecture.

Training

Self-supervised on 13 sensors at once

UniverSat is trained with a self-supervised objective that combines (i) cross-modal contrastive learning at the patch level and (ii) latent multimodal masked modeling (LM₃), an extension of latent masked image modeling to heterogeneous, multimodal, multitemporal EO data.

Aggressive input dropping — modalities, patches, channels, and timestamps — removes ≈90% of input atoms at training time, drastically improving robustness across scales and sensor configurations.

Figure 6: We feed UniverSat a heavily masked version of the input patches, apply a cross-modal contrastive loss on the UPE outputs, and predict random-projection targets of the masked patches via a batch-wise InfoNCE loss per modality.

SENSORS

DATASETS

MODALITY TYPES

0.1–300m

GROUND SAMPLING DISTANCE

1–150

TIMESTAMPS

1–396

SPECTRAL CHANNELS

Figure 7: Training datasets — distribution of atoms (one pixel × one band × one timestamp) across modalities and datasets, and the 13 supported sensors with their typical spatial resolution (S, m), temporal depth (T, images per year), number of channels (C), and total atom count.

Results

Competitive — and broader — than the state of the art

Despite its flexibility and ability to ingest unseen sensor configurations, UniverSat remains highly competitive on standard benchmarks. We evaluate on 16 datasets spanning GeoBench, PangaeaBench, and the hyperspectral SpectralEarth benchmark using strict probing protocols (kNN and linear probing).

Table 2: Linear-probe / kNN classification and segmentation across GeoBench and PangaeaBench tasks (brick-kiln, pv4ger, forestnet, PASTIS-R, Sen1Floods11, chesapeake, NeonTree). UniverSat-B is competitive with or exceeds specialist baselines — despite being significantly more general than competing modality-specific approaches.

PangaeaBench — linear probes match heavyweight decoders

On PangaeaBench, competing models attach 33M–47M-parameter UperNet decoders. UniverSat uses a single 9K-parameter linear probe on top of its dense embeddings — 3700–5000× fewer supervised parameters — and still reaches or exceeds the state of the art on PASTIS-R and AI4Farms, including configurations the model never saw at pretraining (mono-temporal Sentinel inputs, the synthetic HLS sensor).

Table 3: Probing with decoders on PangaeaBench. UniverSat uses a 9K-parameter linear probe versus 33M–47M-parameter UperNet decoders — yet reaches state-of-the-art on PASTIS-R and AI4Farms.

Hyperspectral — competitive without ever seeing EnMAP

On the SpectralEarth benchmark (EnMAP, up to 500 bands), UniverSat outperforms DOFA-L — a foundation model trained on EnMAP — across every task, and approaches SpectralEarth-L, a model specifically designed for EnMAP and trained with self-supervision on the evaluation data itself. UniverSat was never trained on EnMAP.

Table 4: Hyperspectral evaluation on the SpectralEarth / EnMAP benchmark — a sensor unseen at pretraining. UniverSat-B outperforms DOFA-L across every task and approaches the EnMAP-specialised SpectralEarth-L.

Embedding Quality

Sharper, modality-agnostic spatial features

Thanks to its controllable output resolution and sub-patch skip connection, UniverSat produces higher-resolution embeddings that preserve fine spatial structures — field boundaries, roads, parcel edges — compared to fixed-resolution backbones. PCA projections on a PASTIS test tile reveal markedly less positional collapse than other multimodal foundation models.

Figure 8: Embedding visualization. PCA projections of features extracted from a multimodal PASTIS test tile (1.6 km²). UniverSat preserves field boundaries and fine spatial structures at higher granularity than competing multimodal models.

Contributions

What this paper delivers

A unified ViT-like architecture for EO that processes heterogeneous sensors without modality-specific projectors or preprocessing.
A multimodal self-supervised training framework combining cross-modal contrastive and latent multimodal masked modeling (LM₃).
Competitive performance across 16 datasets — from VHR RGB to radar time series to 500-band hyperspectral imagery.
Demonstrated generalisation to unseen sensors and modality combinations without input resampling.

Citation

Cite Our Work

@article{perron2026universat,
  title   = {UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation},
  author  = {Perron, Yohann and Astruc, Guillaume and Gonthier, Nicolas
             and Mallet, Clement and Landrieu, Loic},
  journal = {arXiv preprint},
  year    = {2026}
}