Street View Europe β Geocell Classification Heads
Trained linear classification heads for European Street View geolocation,
sitting on top of a frozen geolocal/StreetCLIP
vision encoder (1024-dim pooled features).
Each head predicts one of N discrete geocells (a partition of Europe
into bounded regions). The cell centroids ship alongside each head as
geocell_info.json, so a prediction can be turned into a (lat, lon) by
indexing into the file.
These checkpoints power the live demo at lebfla11/streetview-eu-geocell-demo.
Quick start
import json, torch, torch.nn as nn
from huggingface_hub import hf_hub_download
from transformers import CLIPImageProcessor, CLIPModel
from PIL import Image
REPO = "lebfla11/streetview-eu-geocell-heads"
# 1. Pull a head + its partition (β
headline: v13 ce_haversine Ξ»=0.05)
head_path = hf_hub_download(REPO, "ablation_v13_weighted/lam_0.05/best.pt")
info_path = hf_hub_download(REPO,
"data_collection/outputs/geocells_h3_res4_eu/geocell_info.json")
centroids = {int(k): (v["centroid_lat"], v["centroid_lon"])
for k, v in json.load(open(info_path)).items()}
n_cells = len(centroids)
# 2. Build the head (matches HeadOnlyModel(feat_dim=1024, fusion='feature'))
class Head(nn.Module):
def __init__(self, n_cells, dropout=0.3):
super().__init__()
self.head = nn.Sequential(
nn.LayerNorm(1024), nn.Dropout(dropout), nn.Linear(1024, n_cells))
def forward(self, feats):
# feats: (B, V, 1024) for V views, mean-pool then head
if feats.dim() == 3:
feats = feats.mean(dim=1)
return self.head(feats)
head = Head(n_cells).eval()
ck = torch.load(head_path, map_location="cpu", weights_only=False)
head.load_state_dict(ck["model"] if isinstance(ck, dict) and "model" in ck else ck,
strict=False)
# 3. Encode an image with frozen StreetCLIP
clip = CLIPModel.from_pretrained("geolocal/StreetCLIP").vision_model.eval()
proc = CLIPImageProcessor.from_pretrained("geolocal/StreetCLIP")
img = Image.open("street_view.jpg").convert("RGB")
px = proc(images=img, return_tensors="pt")["pixel_values"]
with torch.no_grad():
feat = clip(pixel_values=px).pooler_output.float() # (1, 1024)
logits = head(feat)
probs = logits.softmax(-1)
top5 = probs[0].topk(5)
for prob, cell in zip(top5.values, top5.indices):
lat, lon = centroids[int(cell)]
print(f"cell {int(cell):4d} {prob.item()*100:5.1f}% ({lat:.3f}, {lon:.3f})")
For a head trained with fusion='logit' (the v12 logit_lr5e-3 family β not
shipped here, see the demo Space for context), apply the linear head per
view and average the logits instead of the features.
Model details
| Architecture | Linear classification head: LayerNorm(1024) β Dropout β Linear(1024, N_cells) on top of frozen StreetCLIP vision encoder |
| Backbone | geolocal/StreetCLIP (frozen, 1024-d pooled output, image size 336Γ336) |
| Multi-view fusion | Mean-pool of per-view features before the head (fusion='feature') |
| Training framework | PyTorch 2.6, AMP, AdamW, cosine LR schedule, label smoothing 0.1, ~80 epochs (v12 family) with patience 15. The headline v13 head is a 10-epoch warm-started fine-tune from v12 seed_44. |
| Loss | v12 family: class-weighted cross-entropy (inverse cell frequency). v13 (headline): weighted CE + Ξ» Β· mean haversine on the posterior-expected lat/lon (probability-weighted spherical mean over all cell centroids). Ξ» = 0.05 for the headline checkpoint. |
| Selection metric | balscore = β(within_25 Γ within_200 Γ within_750), the geometric mean of three distance-band accuracies |
| Trainable params | 1024 Γ N_cells + 2048 (LayerNorm) per head β under 6 M for the largest partition |
Available heads
All checkpoints were trained on the leakage-free train/val/test split
(no shared lat,lon key between splits). Earlier runs/clusters/* runs
that used a leaky split are not included here.
| Path in repo | Partition | # cells | Top-1 | balscore |
|---|---|---|---|---|
β
ablation_v13_weighted/lam_0.05/best.pt |
H3 res=4 EU | 1 898 | 29.7 % | 63.1 |
ablation_v13_weighted/lam_0.01/best.pt |
H3 res=4 EU | 1 898 | 29.3 % | 62.8 |
ablation_v12/seed_44/best.pt |
H3 res=4 EU | 1 898 | 28.8 % | 62.1 |
sweep_v5/streetclip/h3_res4_eu/best.pt |
H3 res=4 EU | 1 898 | 28.3 % | 61.5 |
sweep_v5/streetclip/k2000_eu/best.pt |
K-Means k=2000 EU | 1 792 | 27.5 % | 60.0 |
sweep_v5/streetclip/k4000_eu/best.pt |
K-Means k=4000 EU | 3 180 | 22.9 % | 60.0 |
sweep_v5/streetclip/kdtree19_eu/best.pt |
K-d tree t=19 EU | 5 444 | 15.7 % | 58.6 |
sweep_v5/streetclip/kdtree39_eu/best.pt |
K-d tree t=39 EU | 2 059 | 21.8 % | 58.4 |
sweep_v5/streetclip/k1000_eu/best.pt |
K-Means k=1000 EU | 957 | 36.2 % | 57.9 |
sweep_v5/streetclip/nuts3_eu/best.pt |
NUTS-3 EU | 1 194 | 37.6 % | 56.8 |
sweep_v5/streetclip/kdtree78_eu/best.pt |
K-d tree t=78 EU | 1 029 | 27.5 % | 55.3 |
sweep_v5/streetclip/k500_eu/best.pt |
K-Means k=500 EU | 487 | 43.4 % | 51.7 |
sweep_v5/streetclip/kdtree155_eu/best.pt |
K-d tree t=155 EU | 515 | 36.1 % | 50.7 |
For each head, the matching partition lives at
data_collection/outputs/geocells_<name>/geocell_info.json (centroids only β
the training-time geocell_map.json and splits.json are not shipped).
Training data
- Source: Internal Google Street View collection
- Size: 79 144 European locations Γ ~4 headings = 316 498 images
- Coverage: 41 European countries
- Split: 80 / 10 / 10 train/val/test, stratified to be leakage-free
(no shared
lat,lonkey between splits)
Evaluation (test split, headline head: ablation_v13_weighted/lam_0.05)
| Metric | v13 Ξ»=0.05 (β ) | v12 seed_44 (prior best) |
|---|---|---|
| Top-1 cell accuracy | 29.7 % | 28.8 % |
| Top-5 cell accuracy | 58.6 % | 57.8 % |
| Median haversine error | 51.9 km | 54.6 km |
| Mean haversine error | 122.6 km | 130.3 km |
| within 1 km | 0.1 % | 0.1 % |
| within 25 km | 31.4 % | 30.5 % |
| within 200 km | 81.0 % | 80.0 % |
| within 750 km | 98.6 % | 98.1 % |
| within 2500 km | 100.0 % | 100.0 % |
| balscore | 63.1 | 62.1 |
The v13 head is a 10-epoch warm-started fine-tune from ablation_v12/seed_44/best.pt
with the new ce_haversine_weighted loss
(L = CE + Ξ» Β· mean_haversine(posterior-expected lat/lon, true GPS),
Ξ»=0.05). The fine-tune additionally improves cross-domain generalization on
out-of-distribution test sets (im2gps Europe top-1 14.4 β 20.0; OSM-Europe-1k
4-view mean km 716 β 629).
Intended use
- Educational / demo: explore CLIP-based classifier behaviour under different geocell granularities.
- Reference implementation: starting point for further geolocation research using the StreetCLIP vision backbone.
- Thesis demonstrator: companion artefact for the master's thesis at FH JOANNEUM University of Applied Sciences.
Out-of-scope / limitations
- Geographic scope: Europe only. Predictions outside Europe are meaningless extrapolations.
- Visual scope: Street View-style imagery (driver perspective, outdoor, daylight). Aerial photos, indoor shots, screenshots, and composite imagery are likely to fail.
- Resolution ceiling: predictions are cell centroids; per-cell median errors range from ~16 km (k=4000 EU) to ~120 km (k=500 EU).
- Class imbalance: dataset is skewed toward Western/Central Europe. Predictions for under-represented regions carry higher uncertainty.
- Privacy: do not use this model to identify or track individuals.
Citation
@misc{streetclip2023,
title = {StreetCLIP: A Robust Image-Language Model for Generalizable Geolocation},
author = {Lukas Haas and Silas Alberti and Michal Skreta},
year = {2023},
url = {https://huggingface.co/geolocal/StreetCLIP}
}
@mastersthesis{leber2026streetview,
title = {Street View Image Geolocation in Europe via Geocell Classification},
author = {Florian Leber},
school = {FH JOANNEUM University of Applied Sciences},
year = {2026}
}
Contact
Florian Leber β florian.leber@edu.fh-joanneum.at
Master's thesis at FH JOANNEUM University of Applied Sciences.
Source: https://git-iit.fh-joanneum.at/leberflo19/europestreetviewgeolocator
License
MIT. Released for research / educational use.
Model tree for lebfla11/streetview-eu-geocell-heads
Base model
geolocal/StreetCLIP