Street View Europe — Geocell Classification Heads

Trained linear classification heads for European Street View geolocation, sitting on top of a frozen geolocal/StreetCLIP vision encoder (1024-dim pooled features).

Each head predicts one of N discrete geocells (a partition of Europe into bounded regions). The cell centroids ship alongside each head as geocell_info.json, so a prediction can be turned into a (lat, lon) by indexing into the file.

These checkpoints power the live demo at lebfla11/streetview-eu-geocell-demo.

Quick start

import json, torch, torch.nn as nn
from huggingface_hub import hf_hub_download
from transformers import CLIPImageProcessor, CLIPModel
from PIL import Image

REPO = "lebfla11/streetview-eu-geocell-heads"

# 1. Pull a head + its partition (★ headline: v13 ce_haversine λ=0.05)
head_path = hf_hub_download(REPO, "ablation_v13_weighted/lam_0.05/best.pt")
info_path = hf_hub_download(REPO,
    "data_collection/outputs/geocells_h3_res4_eu/geocell_info.json")

centroids = {int(k): (v["centroid_lat"], v["centroid_lon"])
             for k, v in json.load(open(info_path)).items()}
n_cells = len(centroids)

# 2. Build the head (matches HeadOnlyModel(feat_dim=1024, fusion='feature'))
class Head(nn.Module):
    def __init__(self, n_cells, dropout=0.3):
        super().__init__()
        self.head = nn.Sequential(
            nn.LayerNorm(1024), nn.Dropout(dropout), nn.Linear(1024, n_cells))
    def forward(self, feats):
        # feats: (B, V, 1024) for V views, mean-pool then head
        if feats.dim() == 3:
            feats = feats.mean(dim=1)
        return self.head(feats)

head = Head(n_cells).eval()
ck = torch.load(head_path, map_location="cpu", weights_only=False)
head.load_state_dict(ck["model"] if isinstance(ck, dict) and "model" in ck else ck,
                     strict=False)

# 3. Encode an image with frozen StreetCLIP
clip = CLIPModel.from_pretrained("geolocal/StreetCLIP").vision_model.eval()
proc = CLIPImageProcessor.from_pretrained("geolocal/StreetCLIP")
img  = Image.open("street_view.jpg").convert("RGB")
px   = proc(images=img, return_tensors="pt")["pixel_values"]
with torch.no_grad():
    feat = clip(pixel_values=px).pooler_output.float()  # (1, 1024)
    logits = head(feat)
    probs = logits.softmax(-1)

top5 = probs[0].topk(5)
for prob, cell in zip(top5.values, top5.indices):
    lat, lon = centroids[int(cell)]
    print(f"cell {int(cell):4d}  {prob.item()*100:5.1f}%  ({lat:.3f}, {lon:.3f})")

For a head trained with fusion='logit' (the v12 logit_lr5e-3 family — not shipped here, see the demo Space for context), apply the linear head per view and average the logits instead of the features.

Model details


Architecture	Linear classification head: `LayerNorm(1024) → Dropout → Linear(1024, N_cells)` on top of frozen StreetCLIP vision encoder
Backbone	`geolocal/StreetCLIP` (frozen, 1024-d pooled output, image size 336×336)
Multi-view fusion	Mean-pool of per-view features before the head (`fusion='feature'`)
Training framework	PyTorch 2.6, AMP, AdamW, cosine LR schedule, label smoothing 0.1, ~80 epochs (v12 family) with patience 15. The headline v13 head is a 10-epoch warm-started fine-tune from v12 seed_44.
Loss	v12 family: class-weighted cross-entropy (inverse cell frequency). v13 (headline): weighted CE + λ · mean haversine on the posterior-expected lat/lon (probability-weighted spherical mean over all cell centroids). λ = 0.05 for the headline checkpoint.
Selection metric	`balscore` = ∛(within_25 × within_200 × within_750), the geometric mean of three distance-band accuracies
Trainable params	1024 × N_cells + 2048 (LayerNorm) per head — under 6 M for the largest partition

Available heads

All checkpoints were trained on the leakage-free train/val/test split (no shared lat,lon key between splits). Earlier runs/clusters/* runs that used a leaky split are not included here.

Path in repo	Partition	# cells	Top-1	balscore
★ `ablation_v13_weighted/lam_0.05/best.pt`	H3 res=4 EU	1 898	29.7 %	63.1
`ablation_v13_weighted/lam_0.01/best.pt`	H3 res=4 EU	1 898	29.3 %	62.8
`ablation_v12/seed_44/best.pt`	H3 res=4 EU	1 898	28.8 %	62.1
`sweep_v5/streetclip/h3_res4_eu/best.pt`	H3 res=4 EU	1 898	28.3 %	61.5
`sweep_v5/streetclip/k2000_eu/best.pt`	K-Means k=2000 EU	1 792	27.5 %	60.0
`sweep_v5/streetclip/k4000_eu/best.pt`	K-Means k=4000 EU	3 180	22.9 %	60.0
`sweep_v5/streetclip/kdtree19_eu/best.pt`	K-d tree t=19 EU	5 444	15.7 %	58.6
`sweep_v5/streetclip/kdtree39_eu/best.pt`	K-d tree t=39 EU	2 059	21.8 %	58.4
`sweep_v5/streetclip/k1000_eu/best.pt`	K-Means k=1000 EU	957	36.2 %	57.9
`sweep_v5/streetclip/nuts3_eu/best.pt`	NUTS-3 EU	1 194	37.6 %	56.8
`sweep_v5/streetclip/kdtree78_eu/best.pt`	K-d tree t=78 EU	1 029	27.5 %	55.3
`sweep_v5/streetclip/k500_eu/best.pt`	K-Means k=500 EU	487	43.4 %	51.7
`sweep_v5/streetclip/kdtree155_eu/best.pt`	K-d tree t=155 EU	515	36.1 %	50.7

For each head, the matching partition lives at data_collection/outputs/geocells_<name>/geocell_info.json (centroids only — the training-time geocell_map.json and splits.json are not shipped).

Training data

Source: Internal Google Street View collection
Size: 79 144 European locations × ~4 headings = 316 498 images
Coverage: 41 European countries
Split: 80 / 10 / 10 train/val/test, stratified to be leakage-free (no shared lat,lon key between splits)

Evaluation (test split, headline head: `ablation_v13_weighted/lam_0.05`)

Metric	v13 λ=0.05 (★)	v12 seed_44 (prior best)
Top-1 cell accuracy	29.7 %	28.8 %
Top-5 cell accuracy	58.6 %	57.8 %
Median haversine error	51.9 km	54.6 km
Mean haversine error	122.6 km	130.3 km
within 1 km	0.1 %	0.1 %
within 25 km	31.4 %	30.5 %
within 200 km	81.0 %	80.0 %
within 750 km	98.6 %	98.1 %
within 2500 km	100.0 %	100.0 %
balscore	63.1	62.1

The v13 head is a 10-epoch warm-started fine-tune from ablation_v12/seed_44/best.pt with the new ce_haversine_weighted loss (L = CE + λ · mean_haversine(posterior-expected lat/lon, true GPS), λ=0.05). The fine-tune additionally improves cross-domain generalization on out-of-distribution test sets (im2gps Europe top-1 14.4 → 20.0; OSM-Europe-1k 4-view mean km 716 → 629).

Intended use

Educational / demo: explore CLIP-based classifier behaviour under different geocell granularities.
Reference implementation: starting point for further geolocation research using the StreetCLIP vision backbone.
Thesis demonstrator: companion artefact for the master's thesis at FH JOANNEUM University of Applied Sciences.

Out-of-scope / limitations

Geographic scope: Europe only. Predictions outside Europe are meaningless extrapolations.
Visual scope: Street View-style imagery (driver perspective, outdoor, daylight). Aerial photos, indoor shots, screenshots, and composite imagery are likely to fail.
Resolution ceiling: predictions are cell centroids; per-cell median errors range from ~16 km (k=4000 EU) to ~120 km (k=500 EU).
Class imbalance: dataset is skewed toward Western/Central Europe. Predictions for under-represented regions carry higher uncertainty.
Privacy: do not use this model to identify or track individuals.

Citation

@misc{streetclip2023,
  title  = {StreetCLIP: A Robust Image-Language Model for Generalizable Geolocation},
  author = {Lukas Haas and Silas Alberti and Michal Skreta},
  year   = {2023},
  url    = {https://huggingface.co/geolocal/StreetCLIP}
}

@mastersthesis{leber2026streetview,
  title  = {Street View Image Geolocation in Europe via Geocell Classification},
  author = {Florian Leber},
  school = {FH JOANNEUM University of Applied Sciences},
  year   = {2026}
}

Contact

Florian Leber — florian.leber@edu.fh-joanneum.at Master's thesis at FH JOANNEUM University of Applied Sciences. Source: https://git-iit.fh-joanneum.at/leberflo19/europestreetviewgeolocator

License

MIT. Released for research / educational use.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for lebfla11/streetview-eu-geocell-heads

Base model

geolocal/StreetCLIP

Finetuned

(1)

this model