---
license: mit
library_name: pytorch
pipeline_tag: image-classification
tags:
  - geolocation
  - street-view
  - clip
  - streetclip
  - geocell
  - europe
  - thesis
base_model: geolocal/StreetCLIP
language:
  - en
datasets:
  - private-streetview-europe
metrics:
  - accuracy
  - haversine-distance
---

# Street View Europe — Geocell Classification Heads

Trained linear classification heads for European Street View geolocation,
sitting on top of a frozen [`geolocal/StreetCLIP`](https://huggingface.co/geolocal/StreetCLIP)
vision encoder (1024-dim pooled features).

Each head predicts one of `N` discrete **geocells** (a partition of Europe
into bounded regions). The cell centroids ship alongside each head as
`geocell_info.json`, so a prediction can be turned into a (lat, lon) by
indexing into the file.

These checkpoints power the live demo at
[**lebfla11/streetview-eu-geocell-demo**](https://huggingface.co/spaces/lebfla11/streetview-eu-geocell-demo).

---

## Quick start

```python
import json, torch, torch.nn as nn
from huggingface_hub import hf_hub_download
from transformers import CLIPImageProcessor, CLIPModel
from PIL import Image

REPO = "lebfla11/streetview-eu-geocell-heads"

# 1. Pull a head + its partition (★ headline: v13 ce_haversine λ=0.05)
head_path = hf_hub_download(REPO, "ablation_v13_weighted/lam_0.05/best.pt")
info_path = hf_hub_download(REPO,
    "data_collection/outputs/geocells_h3_res4_eu/geocell_info.json")

centroids = {int(k): (v["centroid_lat"], v["centroid_lon"])
             for k, v in json.load(open(info_path)).items()}
n_cells = len(centroids)

# 2. Build the head (matches HeadOnlyModel(feat_dim=1024, fusion='feature'))
class Head(nn.Module):
    def __init__(self, n_cells, dropout=0.3):
        super().__init__()
        self.head = nn.Sequential(
            nn.LayerNorm(1024), nn.Dropout(dropout), nn.Linear(1024, n_cells))
    def forward(self, feats):
        # feats: (B, V, 1024) for V views, mean-pool then head
        if feats.dim() == 3:
            feats = feats.mean(dim=1)
        return self.head(feats)

head = Head(n_cells).eval()
ck = torch.load(head_path, map_location="cpu", weights_only=False)
head.load_state_dict(ck["model"] if isinstance(ck, dict) and "model" in ck else ck,
                     strict=False)

# 3. Encode an image with frozen StreetCLIP
clip = CLIPModel.from_pretrained("geolocal/StreetCLIP").vision_model.eval()
proc = CLIPImageProcessor.from_pretrained("geolocal/StreetCLIP")
img  = Image.open("street_view.jpg").convert("RGB")
px   = proc(images=img, return_tensors="pt")["pixel_values"]
with torch.no_grad():
    feat = clip(pixel_values=px).pooler_output.float()  # (1, 1024)
    logits = head(feat)
    probs = logits.softmax(-1)

top5 = probs[0].topk(5)
for prob, cell in zip(top5.values, top5.indices):
    lat, lon = centroids[int(cell)]
    print(f"cell {int(cell):4d}  {prob.item()*100:5.1f}%  ({lat:.3f}, {lon:.3f})")
```

For a head trained with `fusion='logit'` (the v12 logit_lr5e-3 family — not
shipped here, see the demo Space for context), apply the linear head per
view and average the logits instead of the features.

---

## Model details

| | |
|---|---|
| **Architecture** | Linear classification head: `LayerNorm(1024) → Dropout → Linear(1024, N_cells)` on top of frozen StreetCLIP vision encoder |
| **Backbone** | [`geolocal/StreetCLIP`](https://huggingface.co/geolocal/StreetCLIP) (frozen, 1024-d pooled output, image size 336×336) |
| **Multi-view fusion** | Mean-pool of per-view features before the head (`fusion='feature'`) |
| **Training framework** | PyTorch 2.6, AMP, AdamW, cosine LR schedule, label smoothing 0.1, ~80 epochs (v12 family) with patience 15. The headline v13 head is a 10-epoch warm-started fine-tune from v12 seed_44. |
| **Loss** | v12 family: class-weighted cross-entropy (inverse cell frequency). **v13 (headline)**: weighted CE + λ · mean haversine on the posterior-expected lat/lon (probability-weighted spherical mean over all cell centroids). λ = 0.05 for the headline checkpoint. |
| **Selection metric** | `balscore` = ∛(within_25 × within_200 × within_750), the geometric mean of three distance-band accuracies |
| **Trainable params** | 1024 × N_cells + 2048 (LayerNorm) per head — under 6 M for the largest partition |

### Available heads

All checkpoints were trained on the **leakage-free** train/val/test split
(no shared `lat,lon` key between splits). Earlier `runs/clusters/*` runs
that used a leaky split are not included here.

| Path in repo | Partition | # cells | Top-1 | balscore |
|---|---|---:|---:|---:|
| **★ `ablation_v13_weighted/lam_0.05/best.pt`** | H3 res=4 EU | 1 898 | **29.7 %** | **63.1** |
| `ablation_v13_weighted/lam_0.01/best.pt` | H3 res=4 EU | 1 898 | 29.3 % | 62.8 |
| `ablation_v12/seed_44/best.pt` | H3 res=4 EU | 1 898 | 28.8 % | 62.1 |
| `sweep_v5/streetclip/h3_res4_eu/best.pt` | H3 res=4 EU | 1 898 | 28.3 % | 61.5 |
| `sweep_v5/streetclip/k2000_eu/best.pt` | K-Means k=2000 EU | 1 792 | 27.5 % | 60.0 |
| `sweep_v5/streetclip/k4000_eu/best.pt` | K-Means k=4000 EU | 3 180 | 22.9 % | 60.0 |
| `sweep_v5/streetclip/kdtree19_eu/best.pt` | K-d tree t=19 EU | 5 444 | 15.7 % | 58.6 |
| `sweep_v5/streetclip/kdtree39_eu/best.pt` | K-d tree t=39 EU | 2 059 | 21.8 % | 58.4 |
| `sweep_v5/streetclip/k1000_eu/best.pt` | K-Means k=1000 EU | 957 | 36.2 % | 57.9 |
| `sweep_v5/streetclip/nuts3_eu/best.pt` | NUTS-3 EU | 1 194 | 37.6 % | 56.8 |
| `sweep_v5/streetclip/kdtree78_eu/best.pt` | K-d tree t=78 EU | 1 029 | 27.5 % | 55.3 |
| `sweep_v5/streetclip/k500_eu/best.pt` | K-Means k=500 EU | 487 | 43.4 % | 51.7 |
| `sweep_v5/streetclip/kdtree155_eu/best.pt` | K-d tree t=155 EU | 515 | 36.1 % | 50.7 |

For each head, the matching partition lives at
`data_collection/outputs/geocells_<name>/geocell_info.json` (centroids only —
the training-time `geocell_map.json` and `splits.json` are not shipped).

### Training data

* **Source**: Internal Google Street View collection
* **Size**: 79 144 European locations × ~4 headings = 316 498 images
* **Coverage**: 41 European countries
* **Split**: 80 / 10 / 10 train/val/test, stratified to be leakage-free
  (no shared `lat,lon` key between splits)

### Evaluation (test split, headline head: `ablation_v13_weighted/lam_0.05`)

| Metric | v13 λ=0.05 (★) | v12 seed_44 (prior best) |
|---|---:|---:|
| Top-1 cell accuracy | **29.7 %** | 28.8 % |
| Top-5 cell accuracy | **58.6 %** | 57.8 % |
| Median haversine error | **51.9 km** | 54.6 km |
| Mean haversine error | **122.6 km** | 130.3 km |
| within 1 km | 0.1 % | 0.1 % |
| within 25 km | **31.4 %** | 30.5 % |
| within 200 km | **81.0 %** | 80.0 % |
| within 750 km | **98.6 %** | 98.1 % |
| within 2500 km | 100.0 % | 100.0 % |
| **balscore** | **63.1** | 62.1 |

The v13 head is a 10-epoch warm-started fine-tune from `ablation_v12/seed_44/best.pt`
with the new `ce_haversine_weighted` loss
(`L = CE + λ · mean_haversine(posterior-expected lat/lon, true GPS)`,
λ=0.05). The fine-tune additionally improves cross-domain generalization on
out-of-distribution test sets (im2gps Europe top-1 14.4 → 20.0; OSM-Europe-1k
4-view mean km 716 → 629).

### Intended use

* **Educational / demo**: explore CLIP-based classifier behaviour under
  different geocell granularities.
* **Reference implementation**: starting point for further geolocation
  research using the StreetCLIP vision backbone.
* **Thesis demonstrator**: companion artefact for the master's thesis at
  FH JOANNEUM University of Applied Sciences.

### Out-of-scope / limitations

* **Geographic scope**: Europe only. Predictions outside Europe are
  meaningless extrapolations.
* **Visual scope**: Street View-style imagery (driver perspective,
  outdoor, daylight). Aerial photos, indoor shots, screenshots, and
  composite imagery are likely to fail.
* **Resolution ceiling**: predictions are cell centroids; per-cell median
  errors range from ~16 km (k=4000 EU) to ~120 km (k=500 EU).
* **Class imbalance**: dataset is skewed toward Western/Central Europe.
  Predictions for under-represented regions carry higher uncertainty.
* **Privacy**: do **not** use this model to identify or track individuals.

### Citation

```bibtex
@misc{streetclip2023,
  title  = {StreetCLIP: A Robust Image-Language Model for Generalizable Geolocation},
  author = {Lukas Haas and Silas Alberti and Michal Skreta},
  year   = {2023},
  url    = {https://huggingface.co/geolocal/StreetCLIP}
}

@mastersthesis{leber2026streetview,
  title  = {Street View Image Geolocation in Europe via Geocell Classification},
  author = {Florian Leber},
  school = {FH JOANNEUM University of Applied Sciences},
  year   = {2026}
}
```

### Contact

Florian Leber — [`florian.leber@edu.fh-joanneum.at`](mailto:florian.leber@edu.fh-joanneum.at)
Master's thesis at FH JOANNEUM University of Applied Sciences.
Source: <https://git-iit.fh-joanneum.at/leberflo19/europestreetviewgeolocator>

### License

MIT. Released for research / educational use.