Street View Europe β€” Geocell Classification Heads

Trained linear classification heads for European Street View geolocation, sitting on top of a frozen geolocal/StreetCLIP vision encoder (1024-dim pooled features).

Each head predicts one of N discrete geocells (a partition of Europe into bounded regions). The cell centroids ship alongside each head as geocell_info.json, so a prediction can be turned into a (lat, lon) by indexing into the file.

These checkpoints power the live demo at lebfla11/streetview-eu-geocell-demo.


Quick start

import json, torch, torch.nn as nn
from huggingface_hub import hf_hub_download
from transformers import CLIPImageProcessor, CLIPModel
from PIL import Image

REPO = "lebfla11/streetview-eu-geocell-heads"

# 1. Pull a head + its partition (β˜… headline: v13 ce_haversine Ξ»=0.05)
head_path = hf_hub_download(REPO, "ablation_v13_weighted/lam_0.05/best.pt")
info_path = hf_hub_download(REPO,
    "data_collection/outputs/geocells_h3_res4_eu/geocell_info.json")

centroids = {int(k): (v["centroid_lat"], v["centroid_lon"])
             for k, v in json.load(open(info_path)).items()}
n_cells = len(centroids)

# 2. Build the head (matches HeadOnlyModel(feat_dim=1024, fusion='feature'))
class Head(nn.Module):
    def __init__(self, n_cells, dropout=0.3):
        super().__init__()
        self.head = nn.Sequential(
            nn.LayerNorm(1024), nn.Dropout(dropout), nn.Linear(1024, n_cells))
    def forward(self, feats):
        # feats: (B, V, 1024) for V views, mean-pool then head
        if feats.dim() == 3:
            feats = feats.mean(dim=1)
        return self.head(feats)

head = Head(n_cells).eval()
ck = torch.load(head_path, map_location="cpu", weights_only=False)
head.load_state_dict(ck["model"] if isinstance(ck, dict) and "model" in ck else ck,
                     strict=False)

# 3. Encode an image with frozen StreetCLIP
clip = CLIPModel.from_pretrained("geolocal/StreetCLIP").vision_model.eval()
proc = CLIPImageProcessor.from_pretrained("geolocal/StreetCLIP")
img  = Image.open("street_view.jpg").convert("RGB")
px   = proc(images=img, return_tensors="pt")["pixel_values"]
with torch.no_grad():
    feat = clip(pixel_values=px).pooler_output.float()  # (1, 1024)
    logits = head(feat)
    probs = logits.softmax(-1)

top5 = probs[0].topk(5)
for prob, cell in zip(top5.values, top5.indices):
    lat, lon = centroids[int(cell)]
    print(f"cell {int(cell):4d}  {prob.item()*100:5.1f}%  ({lat:.3f}, {lon:.3f})")

For a head trained with fusion='logit' (the v12 logit_lr5e-3 family β€” not shipped here, see the demo Space for context), apply the linear head per view and average the logits instead of the features.


Model details

Architecture Linear classification head: LayerNorm(1024) β†’ Dropout β†’ Linear(1024, N_cells) on top of frozen StreetCLIP vision encoder
Backbone geolocal/StreetCLIP (frozen, 1024-d pooled output, image size 336Γ—336)
Multi-view fusion Mean-pool of per-view features before the head (fusion='feature')
Training framework PyTorch 2.6, AMP, AdamW, cosine LR schedule, label smoothing 0.1, ~80 epochs (v12 family) with patience 15. The headline v13 head is a 10-epoch warm-started fine-tune from v12 seed_44.
Loss v12 family: class-weighted cross-entropy (inverse cell frequency). v13 (headline): weighted CE + Ξ» Β· mean haversine on the posterior-expected lat/lon (probability-weighted spherical mean over all cell centroids). Ξ» = 0.05 for the headline checkpoint.
Selection metric balscore = βˆ›(within_25 Γ— within_200 Γ— within_750), the geometric mean of three distance-band accuracies
Trainable params 1024 Γ— N_cells + 2048 (LayerNorm) per head β€” under 6 M for the largest partition

Available heads

All checkpoints were trained on the leakage-free train/val/test split (no shared lat,lon key between splits). Earlier runs/clusters/* runs that used a leaky split are not included here.

Path in repo Partition # cells Top-1 balscore
β˜… ablation_v13_weighted/lam_0.05/best.pt H3 res=4 EU 1 898 29.7 % 63.1
ablation_v13_weighted/lam_0.01/best.pt H3 res=4 EU 1 898 29.3 % 62.8
ablation_v12/seed_44/best.pt H3 res=4 EU 1 898 28.8 % 62.1
sweep_v5/streetclip/h3_res4_eu/best.pt H3 res=4 EU 1 898 28.3 % 61.5
sweep_v5/streetclip/k2000_eu/best.pt K-Means k=2000 EU 1 792 27.5 % 60.0
sweep_v5/streetclip/k4000_eu/best.pt K-Means k=4000 EU 3 180 22.9 % 60.0
sweep_v5/streetclip/kdtree19_eu/best.pt K-d tree t=19 EU 5 444 15.7 % 58.6
sweep_v5/streetclip/kdtree39_eu/best.pt K-d tree t=39 EU 2 059 21.8 % 58.4
sweep_v5/streetclip/k1000_eu/best.pt K-Means k=1000 EU 957 36.2 % 57.9
sweep_v5/streetclip/nuts3_eu/best.pt NUTS-3 EU 1 194 37.6 % 56.8
sweep_v5/streetclip/kdtree78_eu/best.pt K-d tree t=78 EU 1 029 27.5 % 55.3
sweep_v5/streetclip/k500_eu/best.pt K-Means k=500 EU 487 43.4 % 51.7
sweep_v5/streetclip/kdtree155_eu/best.pt K-d tree t=155 EU 515 36.1 % 50.7

For each head, the matching partition lives at data_collection/outputs/geocells_<name>/geocell_info.json (centroids only β€” the training-time geocell_map.json and splits.json are not shipped).

Training data

  • Source: Internal Google Street View collection
  • Size: 79 144 European locations Γ— ~4 headings = 316 498 images
  • Coverage: 41 European countries
  • Split: 80 / 10 / 10 train/val/test, stratified to be leakage-free (no shared lat,lon key between splits)

Evaluation (test split, headline head: ablation_v13_weighted/lam_0.05)

Metric v13 Ξ»=0.05 (β˜…) v12 seed_44 (prior best)
Top-1 cell accuracy 29.7 % 28.8 %
Top-5 cell accuracy 58.6 % 57.8 %
Median haversine error 51.9 km 54.6 km
Mean haversine error 122.6 km 130.3 km
within 1 km 0.1 % 0.1 %
within 25 km 31.4 % 30.5 %
within 200 km 81.0 % 80.0 %
within 750 km 98.6 % 98.1 %
within 2500 km 100.0 % 100.0 %
balscore 63.1 62.1

The v13 head is a 10-epoch warm-started fine-tune from ablation_v12/seed_44/best.pt with the new ce_haversine_weighted loss (L = CE + Ξ» Β· mean_haversine(posterior-expected lat/lon, true GPS), Ξ»=0.05). The fine-tune additionally improves cross-domain generalization on out-of-distribution test sets (im2gps Europe top-1 14.4 β†’ 20.0; OSM-Europe-1k 4-view mean km 716 β†’ 629).

Intended use

  • Educational / demo: explore CLIP-based classifier behaviour under different geocell granularities.
  • Reference implementation: starting point for further geolocation research using the StreetCLIP vision backbone.
  • Thesis demonstrator: companion artefact for the master's thesis at FH JOANNEUM University of Applied Sciences.

Out-of-scope / limitations

  • Geographic scope: Europe only. Predictions outside Europe are meaningless extrapolations.
  • Visual scope: Street View-style imagery (driver perspective, outdoor, daylight). Aerial photos, indoor shots, screenshots, and composite imagery are likely to fail.
  • Resolution ceiling: predictions are cell centroids; per-cell median errors range from ~16 km (k=4000 EU) to ~120 km (k=500 EU).
  • Class imbalance: dataset is skewed toward Western/Central Europe. Predictions for under-represented regions carry higher uncertainty.
  • Privacy: do not use this model to identify or track individuals.

Citation

@misc{streetclip2023,
  title  = {StreetCLIP: A Robust Image-Language Model for Generalizable Geolocation},
  author = {Lukas Haas and Silas Alberti and Michal Skreta},
  year   = {2023},
  url    = {https://huggingface.co/geolocal/StreetCLIP}
}

@mastersthesis{leber2026streetview,
  title  = {Street View Image Geolocation in Europe via Geocell Classification},
  author = {Florian Leber},
  school = {FH JOANNEUM University of Applied Sciences},
  year   = {2026}
}

Contact

Florian Leber β€” florian.leber@edu.fh-joanneum.at Master's thesis at FH JOANNEUM University of Applied Sciences. Source: https://git-iit.fh-joanneum.at/leberflo19/europestreetviewgeolocator

License

MIT. Released for research / educational use.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lebfla11/streetview-eu-geocell-heads

Finetuned
(1)
this model