--- license: mit library_name: pytorch pipeline_tag: image-classification tags: - geolocation - street-view - clip - streetclip - geocell - europe - thesis base_model: geolocal/StreetCLIP language: - en datasets: - private-streetview-europe metrics: - accuracy - haversine-distance --- # Street View Europe — Geocell Classification Heads Trained linear classification heads for European Street View geolocation, sitting on top of a frozen [`geolocal/StreetCLIP`](https://huggingface.co/geolocal/StreetCLIP) vision encoder (1024-dim pooled features). Each head predicts one of `N` discrete **geocells** (a partition of Europe into bounded regions). The cell centroids ship alongside each head as `geocell_info.json`, so a prediction can be turned into a (lat, lon) by indexing into the file. These checkpoints power the live demo at [**lebfla11/streetview-eu-geocell-demo**](https://huggingface.co/spaces/lebfla11/streetview-eu-geocell-demo). --- ## Quick start ```python import json, torch, torch.nn as nn from huggingface_hub import hf_hub_download from transformers import CLIPImageProcessor, CLIPModel from PIL import Image REPO = "lebfla11/streetview-eu-geocell-heads" # 1. Pull a head + its partition (★ headline: v13 ce_haversine λ=0.05) head_path = hf_hub_download(REPO, "ablation_v13_weighted/lam_0.05/best.pt") info_path = hf_hub_download(REPO, "data_collection/outputs/geocells_h3_res4_eu/geocell_info.json") centroids = {int(k): (v["centroid_lat"], v["centroid_lon"]) for k, v in json.load(open(info_path)).items()} n_cells = len(centroids) # 2. Build the head (matches HeadOnlyModel(feat_dim=1024, fusion='feature')) class Head(nn.Module): def __init__(self, n_cells, dropout=0.3): super().__init__() self.head = nn.Sequential( nn.LayerNorm(1024), nn.Dropout(dropout), nn.Linear(1024, n_cells)) def forward(self, feats): # feats: (B, V, 1024) for V views, mean-pool then head if feats.dim() == 3: feats = feats.mean(dim=1) return self.head(feats) head = Head(n_cells).eval() ck = torch.load(head_path, map_location="cpu", weights_only=False) head.load_state_dict(ck["model"] if isinstance(ck, dict) and "model" in ck else ck, strict=False) # 3. Encode an image with frozen StreetCLIP clip = CLIPModel.from_pretrained("geolocal/StreetCLIP").vision_model.eval() proc = CLIPImageProcessor.from_pretrained("geolocal/StreetCLIP") img = Image.open("street_view.jpg").convert("RGB") px = proc(images=img, return_tensors="pt")["pixel_values"] with torch.no_grad(): feat = clip(pixel_values=px).pooler_output.float() # (1, 1024) logits = head(feat) probs = logits.softmax(-1) top5 = probs[0].topk(5) for prob, cell in zip(top5.values, top5.indices): lat, lon = centroids[int(cell)] print(f"cell {int(cell):4d} {prob.item()*100:5.1f}% ({lat:.3f}, {lon:.3f})") ``` For a head trained with `fusion='logit'` (the v12 logit_lr5e-3 family — not shipped here, see the demo Space for context), apply the linear head per view and average the logits instead of the features. --- ## Model details | | | |---|---| | **Architecture** | Linear classification head: `LayerNorm(1024) → Dropout → Linear(1024, N_cells)` on top of frozen StreetCLIP vision encoder | | **Backbone** | [`geolocal/StreetCLIP`](https://huggingface.co/geolocal/StreetCLIP) (frozen, 1024-d pooled output, image size 336×336) | | **Multi-view fusion** | Mean-pool of per-view features before the head (`fusion='feature'`) | | **Training framework** | PyTorch 2.6, AMP, AdamW, cosine LR schedule, label smoothing 0.1, ~80 epochs (v12 family) with patience 15. The headline v13 head is a 10-epoch warm-started fine-tune from v12 seed_44. | | **Loss** | v12 family: class-weighted cross-entropy (inverse cell frequency). **v13 (headline)**: weighted CE + λ · mean haversine on the posterior-expected lat/lon (probability-weighted spherical mean over all cell centroids). λ = 0.05 for the headline checkpoint. | | **Selection metric** | `balscore` = ∛(within_25 × within_200 × within_750), the geometric mean of three distance-band accuracies | | **Trainable params** | 1024 × N_cells + 2048 (LayerNorm) per head — under 6 M for the largest partition | ### Available heads All checkpoints were trained on the **leakage-free** train/val/test split (no shared `lat,lon` key between splits). Earlier `runs/clusters/*` runs that used a leaky split are not included here. | Path in repo | Partition | # cells | Top-1 | balscore | |---|---|---:|---:|---:| | **★ `ablation_v13_weighted/lam_0.05/best.pt`** | H3 res=4 EU | 1 898 | **29.7 %** | **63.1** | | `ablation_v13_weighted/lam_0.01/best.pt` | H3 res=4 EU | 1 898 | 29.3 % | 62.8 | | `ablation_v12/seed_44/best.pt` | H3 res=4 EU | 1 898 | 28.8 % | 62.1 | | `sweep_v5/streetclip/h3_res4_eu/best.pt` | H3 res=4 EU | 1 898 | 28.3 % | 61.5 | | `sweep_v5/streetclip/k2000_eu/best.pt` | K-Means k=2000 EU | 1 792 | 27.5 % | 60.0 | | `sweep_v5/streetclip/k4000_eu/best.pt` | K-Means k=4000 EU | 3 180 | 22.9 % | 60.0 | | `sweep_v5/streetclip/kdtree19_eu/best.pt` | K-d tree t=19 EU | 5 444 | 15.7 % | 58.6 | | `sweep_v5/streetclip/kdtree39_eu/best.pt` | K-d tree t=39 EU | 2 059 | 21.8 % | 58.4 | | `sweep_v5/streetclip/k1000_eu/best.pt` | K-Means k=1000 EU | 957 | 36.2 % | 57.9 | | `sweep_v5/streetclip/nuts3_eu/best.pt` | NUTS-3 EU | 1 194 | 37.6 % | 56.8 | | `sweep_v5/streetclip/kdtree78_eu/best.pt` | K-d tree t=78 EU | 1 029 | 27.5 % | 55.3 | | `sweep_v5/streetclip/k500_eu/best.pt` | K-Means k=500 EU | 487 | 43.4 % | 51.7 | | `sweep_v5/streetclip/kdtree155_eu/best.pt` | K-d tree t=155 EU | 515 | 36.1 % | 50.7 | For each head, the matching partition lives at `data_collection/outputs/geocells_/geocell_info.json` (centroids only — the training-time `geocell_map.json` and `splits.json` are not shipped). ### Training data * **Source**: Internal Google Street View collection * **Size**: 79 144 European locations × ~4 headings = 316 498 images * **Coverage**: 41 European countries * **Split**: 80 / 10 / 10 train/val/test, stratified to be leakage-free (no shared `lat,lon` key between splits) ### Evaluation (test split, headline head: `ablation_v13_weighted/lam_0.05`) | Metric | v13 λ=0.05 (★) | v12 seed_44 (prior best) | |---|---:|---:| | Top-1 cell accuracy | **29.7 %** | 28.8 % | | Top-5 cell accuracy | **58.6 %** | 57.8 % | | Median haversine error | **51.9 km** | 54.6 km | | Mean haversine error | **122.6 km** | 130.3 km | | within 1 km | 0.1 % | 0.1 % | | within 25 km | **31.4 %** | 30.5 % | | within 200 km | **81.0 %** | 80.0 % | | within 750 km | **98.6 %** | 98.1 % | | within 2500 km | 100.0 % | 100.0 % | | **balscore** | **63.1** | 62.1 | The v13 head is a 10-epoch warm-started fine-tune from `ablation_v12/seed_44/best.pt` with the new `ce_haversine_weighted` loss (`L = CE + λ · mean_haversine(posterior-expected lat/lon, true GPS)`, λ=0.05). The fine-tune additionally improves cross-domain generalization on out-of-distribution test sets (im2gps Europe top-1 14.4 → 20.0; OSM-Europe-1k 4-view mean km 716 → 629). ### Intended use * **Educational / demo**: explore CLIP-based classifier behaviour under different geocell granularities. * **Reference implementation**: starting point for further geolocation research using the StreetCLIP vision backbone. * **Thesis demonstrator**: companion artefact for the master's thesis at FH JOANNEUM University of Applied Sciences. ### Out-of-scope / limitations * **Geographic scope**: Europe only. Predictions outside Europe are meaningless extrapolations. * **Visual scope**: Street View-style imagery (driver perspective, outdoor, daylight). Aerial photos, indoor shots, screenshots, and composite imagery are likely to fail. * **Resolution ceiling**: predictions are cell centroids; per-cell median errors range from ~16 km (k=4000 EU) to ~120 km (k=500 EU). * **Class imbalance**: dataset is skewed toward Western/Central Europe. Predictions for under-represented regions carry higher uncertainty. * **Privacy**: do **not** use this model to identify or track individuals. ### Citation ```bibtex @misc{streetclip2023, title = {StreetCLIP: A Robust Image-Language Model for Generalizable Geolocation}, author = {Lukas Haas and Silas Alberti and Michal Skreta}, year = {2023}, url = {https://huggingface.co/geolocal/StreetCLIP} } @mastersthesis{leber2026streetview, title = {Street View Image Geolocation in Europe via Geocell Classification}, author = {Florian Leber}, school = {FH JOANNEUM University of Applied Sciences}, year = {2026} } ``` ### Contact Florian Leber — [`florian.leber@edu.fh-joanneum.at`](mailto:florian.leber@edu.fh-joanneum.at) Master's thesis at FH JOANNEUM University of Applied Sciences. Source: ### License MIT. Released for research / educational use.