π BirdCLEF 2026 β Top-Scoring Solution
Multi-model ensemble approach based on SOTA bioacoustic research.
Architecture
| Model |
Backbone |
Params |
Pre-training |
Weight in Ensemble |
| Bird-MAE-Large |
ViT-L/16 |
302M |
SSL on XCL-1.7M (Xeno-Canto) |
60% |
| EfficientNet-B1 |
EfficientNet-B1 |
19M |
BirdSet-XCL (9,735 species) |
40% |
Key Techniques
1. Bird-MAE (arxiv:2504.12880) β Primary model
- Domain-specific SSL pre-training on 1.7M Xeno-Canto bird recordings
- Fine-tuning with Asymmetric Loss (handles noisy multi-label annotations)
- Layer-wise LR decay (0.75) for stable ViT fine-tuning
- 2-stage: 2 epochs frozen backbone β 28 epochs full fine-tune
2. Domain Adaptation (focal β soundscape)
- Waveform mixup (p=0.9, up to 3 sources) β simulates co-occurring species in soundscapes
- Cyclic rolling (p=1.0) β removes position bias
- Background noise injection (p=0.5, SNR 3-30 dB)
- Colored noise (p=0.2, spectral slope -2 to +2)
- SpecAugment: freq-mask (50, p=0.3) + time-mask (100, p=0.3)
- Energy-based window selection (from Perch 2.0)
3. Ensemble + Post-processing
- 5-fold CV Γ 2 architectures = 10 models total
- Logit averaging + no-call detection + TTA (time-reversal + gain)
Quick Start
kaggle competitions download -c birdclef-2026
for fold in 0 1 2 3 4; do
python train_birdclef.py \
--data_dir ./data/train_audio \
--metadata ./data/train_metadata.csv \
--output_dir ./outputs/birdmae \
--hub_model_id YOUR_USERNAME/birdclef2026-birdmae \
--epochs 30 --batch_size 32 --lr 3e-4 --fold $fold
done
for fold in 0 1 2 3 4; do
python train_effnet.py \
--data_dir ./data/train_audio \
--metadata ./data/train_metadata.csv \
--output_dir ./outputs/effnet \
--hub_model_id YOUR_USERNAME/birdclef2026-effnet \
--epochs 50 --batch_size 64 --lr 5e-4 --fold $fold
done
python inference_birdclef.py --test_dir ./data/test_soundscapes --model_dir ./outputs/birdmae --output sub_birdmae.csv --tta
python inference_birdclef.py --test_dir ./data/test_soundscapes --model_dir ./outputs/effnet --output sub_effnet.csv
python ensemble_submit.py --submissions sub_birdmae.csv sub_effnet.csv --weights 0.6 0.4 --output final_submission.csv
Files
| File |
Description |
train_birdclef.py |
Bird-MAE-Large training (primary model) |
train_effnet.py |
EfficientNet-B1 training (ensemble member) |
inference_birdclef.py |
Inference with multi-fold ensemble + TTA |
ensemble_submit.py |
Combine predictions + post-processing |
Hardware Requirements
| Model |
GPU |
VRAM |
Time/fold |
| Bird-MAE-Large |
A100 80GB |
~40GB |
~6-8h |
| EfficientNet-B1 |
A10G 24GB |
~8GB |
~3-4h |
Dependencies
torch>=2.0
torchaudio>=2.0
transformers==4.48.0
librosa
scikit-learn
pandas
numpy
soundfile
trackio
huggingface_hub
Hyperparameters (from published papers)
Bird-MAE-Large Fine-tuning
| Parameter |
Value |
Source |
| Learning rate |
3e-4 |
Bird-MAE Table 10 |
| Weight decay |
3e-4 |
Bird-MAE Table 10 |
| Layer decay |
0.75 |
Bird-MAE Table 10 |
| Batch size |
32 |
Adjusted for A100 |
| Epochs |
30 |
Bird-MAE |
| Freeze epochs |
2 |
sl-BEATs recipe |
| Loss |
Asymmetric (Ξ³_neg=4, Ξ³_pos=0, clip=0.05) |
ASL paper |
| Gradient clip |
2.0 |
Bird-MAE |
| Sample rate |
32,000 Hz |
Bird-MAE |
EfficientNet-B1
| Parameter |
Value |
Source |
| Learning rate |
5e-4 |
BirdSet + EffNetB0-all |
| Weight decay |
0.01 |
sl-BEATs recipe |
| Batch size |
64 |
- |
| Epochs |
50 |
EffNetB0-all recipe |
| Loss |
BCE |
Standard |
References
- Bird-MAE: Rauch et al., "Can Masked Autoencoders Also Listen to Birds?", 2025 (2504.12880)
- sl-BEATs-all: "What Matters for Bioacoustic Encoding", ICLR 2026 (2508.11845)
- Perch 2.0: "The Bittern Lesson for Bioacoustics", 2025 (2508.04665)
- FINCH: "Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion", 2026 (2602.03817)
- BirdSet: Rauch et al., "BirdSet: A Multi-Task Benchmark", 2024 (2403.10380)
- Asymmetric Loss: Ridnik et al., 2021 (2009.14119)