NightOwl-CodeEmbedding 🦉

NightOwl-CodeEmbedding is a compact 768-dimensional dense embedding model specialized for code retrieval, code-edit retrieval, and technical question answering.

The model is fine-tuned from Shuu12121/NightOwl, a ModernBERT-based code model. It uses CLS pooling with cosine similarity and does not require query: / passage: style prefixes.

Highlights

Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
On the MTEB(Code, v1) leaderboard it ranks 18th out of 241 models overall and is the top-scoring single-vector model under 300M parameters among scored entries on the official board, ahead of many models an order of magnitude larger (see Leaderboard Standing)
Covers eight programming languages, including Rust and TypeScript in addition to the six CodeSearchNet languages
Handles a wide range of code retrieval scenarios: NL-to-code search, code-to-code retrieval, code-edit retrieval, and technical QA
Trained with hard negatives mined by Qwen/Qwen3-Embedding-0.6B (15 hard negatives per anchor)
Decontaminated against CodeSearchNet test splits and the CodeEditSearchRetrieval benchmark (see Data Decontamination)
Drop-in compatible with sentence-transformers, Apache-2.0 license

Supported Languages

The training data covers the six CodeSearchNet languages plus two additional languages:

Go, Java, JavaScript, PHP, Python, Ruby (CodeSearchNet languages)
Rust, TypeScript (additional)

Performance on languages outside this set is not guaranteed and may vary.

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Shuu12121/NightOwl-CodeEmbedding")

queries = ["Python function that sorts a list in descending order"]
documents = [
    "def sort_desc(values): return sorted(values, reverse=True)",
    "def average(values): return sum(values) / len(values)",
]

query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

# Cosine similarity (embeddings are normalized internally by similarity())
scores = model.similarity(query_embeddings, document_embeddings)
print(scores)

Model Details

Property	Value
Base model	`Shuu12121/NightOwl`
Architecture	ModernBERT
Parameters	150,779,136
Embedding dimension	768
Pooling	CLS pooling
Maximum sequence length	1,024 tokens
Similarity	Cosine similarity
Query/document prefixes	Not required
Weight dtype	FP32
Weight memory	575 MiB
License	Apache-2.0

MTEB Results

The model was evaluated with MTEB on code-related retrieval and technical QA tasks.

Evaluation setup:

Model revision: c7c8a57b9539297e192d5cf39b9aecf1fb376edd
MTEB version: 2.15.1
Metric: NDCG@10
Hardware: NVIDIA GeForce RTX 5090
Batch size: 64

Multi-subset task scores are reported as macro averages.

Task	Split	NDCG@10
AppsRetrieval	test	0.39177
COIRCodeSearchNetRetrieval	test	0.84264
CodeEditSearchRetrieval	train¹	0.74808
CodeFeedbackMT	test	0.76690
CodeFeedbackST	test	0.85207
CodeSearchNetCCRetrieval	test	0.91805
CodeSearchNetRetrieval	test	0.89239
CodeTransOceanContest	test	0.75953
CodeTransOceanDL	test	0.36057
CosQA	test	0.42810
StackOverflowQA	test	0.86608
SyntheticText2SQL	test	0.68266
Macro average, all 12 tasks		0.70907
CoIR macro average, 10 tasks		0.68684

¹ CodeEditSearchRetrieval does not provide a standard test split in MTEB, so the official train split is used for evaluation. These examples were not used for fine-tuning. See Data Decontamination for details.

Leaderboard Standing

On the public MTEB(Code, v1) leaderboard, the model's task-mean score (70.91, i.e. the 12-task macro average above ×100) places it as follows:

#18 of 241 models overall, ahead of many models that are an order of magnitude larger
#6 of 155 among sub-1B-parameter dense single-vector models — and the smallest model in that top six. The five models ranked above it are all ≈0.33–0.6B parameters (F2LLM-v2-0.6B/330M, pplx-embed-v1-0.6b, C2LLM-0.5B, Qwen3-Embedding-0.6B), i.e. 2–4× larger.
#1 among ranked dense single-vector models under 300M parameters (the leaderboard's small-model view)
#2 once late-interaction / multi-vector models are included, behind only lightonai/LateOn-Code (a multi-vector late-interaction model — see the head-to-head below)

Reading the numbers fairly. MTEB(Code, v1) reports a zero-shot % for each model — the fraction of leaderboard tasks the model was not trained on. NightOwl-CodeEmbedding is 8% zero-shot: it was trained on most of these task families, so its score reflects strong in-domain retrieval rather than zero-shot transfer. Models marked 100% (e.g. embeddinggemma-300m, the granite-embedding r2 family, Qwen3-Embedding) are evaluated fully out-of-domain, so a raw score comparison across rows with different zero-shot % is not apples-to-apples. The fairest direct comparisons are to other code-specialized models at similar zero-shot levels (e.g. LateOn-Code at 8%, the F2LLM / C2LLM families at 8–58%).

Comparison with similar-sized models

The table below compares NightOwl-CodeEmbedding with other compact code / general embedding models on MTEB(Code, v1), with a size ladder of larger models for reference. Score is the leaderboard task mean (higher is better); the Zero-shot column is the share of tasks the model did not train on.

Model	Params	Type	Emb. dim	Max tokens	Zero-shot	MTEB(Code, v1) ↑
`NightOwl-CodeEmbedding` (this model)	150.8M	single-vector	768	1,024	8%	70.91
`codefuse-ai/F2LLM-v2-160M`	159M	single-vector	640	40,960	58%	70.38
`google/embeddinggemma-300m`	308M	single-vector	768	2,048	100%	68.76
`codefuse-ai/F2LLM-v2-80M`	80M	single-vector	320	40,960	58%	67.97
`ibm-granite/granite-embedding-311m-multilingual-r2`	312M	single-vector	768	8,192	100%	63.84
Late-interaction (multi-vector) reference
`lightonai/LateOn-Code`	149M	multi-vector	128 (per-tok)	2,048	8%	74.12
Larger single-vector reference (size ladder)
`codefuse-ai/F2LLM-v2-0.6B` (#1 sub-1B)	596M	single-vector	1,024	40,960	58%	77.41
`Qwen/Qwen3-Embedding-0.6B`	596M	single-vector	1,024	32,768	100%	75.42
`codefuse-ai/F2LLM-v2-14B` (#1 overall)	13.99B	single-vector	5,120	40,960	58%	80.75

Takeaways:

Among compact single-vector dense models, NightOwl-CodeEmbedding is the strongest entry in the leaderboard's small-model view while also being one of the smallest, edging out F2LLM-v2-160M and clearly ahead of embeddinggemma-300m.
The sub-1B leaders (F2LLM-v2-0.6B, Qwen3-Embedding-0.6B) score ~4–6.5 points higher but are ~4× the parameter count and use larger embedding dimensions, which directly increases index size and inference cost.
The 14B model at the top of the overall board is ~10 points higher but ~93× larger, sitting in a different deployment cost regime entirely.

Head-to-head vs LateOn-Code

lightonai/LateOn-Code is the only sub-0.5B model that outranks NightOwl-CodeEmbedding once multi-vector models are included, so it is worth a closer look. It is a ColBERT-style late-interaction model (built with PyLate on ModernBERT-base): it stores one 128-dimensional vector per token and scores with the MaxSim operator, rather than a single 768-d vector per text. That buys accuracy at the cost of a larger index and a different retrieval path (PyLate + a PLAID index), whereas NightOwl is a drop-in single-vector sentence-transformers model.

Per-task NDCG@10 (×100) on MTEB(Code, v1); both models are code-specialized and in-domain (8% zero-shot), so this is a like-for-like comparison. Bold marks the higher score on each task.

Task	NightOwl-CodeEmbedding	LateOn-Code (multi-vec)
AppsRetrieval	39.18	54.76
COIRCodeSearchNetRetrieval	84.26	86.57
CodeEditSearchRetrieval	74.81	64.99
CodeFeedbackMT	76.69	82.22
CodeFeedbackST	85.21	90.40
CodeSearchNetCCRetrieval	91.81	89.32
CodeSearchNetRetrieval	89.24	90.40
CodeTransOceanContest	75.95	87.44
CodeTransOceanDL	36.06	41.00
CosQA	42.81	45.23
StackOverflowQA	86.61	93.43
SyntheticText2SQL	68.27	63.67
Average	70.91	74.12

LateOn-Code wins on average, driven mostly by AppsRetrieval and the feedback/translation/QA tasks. However, NightOwl-CodeEmbedding wins on three tasks that map directly to its design focus:

CodeEditSearchRetrieval (+9.8): matching edit intents to code changes — NightOwl's dedicated code-edit training shows here.
CodeSearchNetCCRetrieval (+2.5): code-to-code / similar-function retrieval.
SyntheticText2SQL (+4.6): NL-to-SQL retrieval.

So for single-vector code-edit and code-to-code retrieval specifically, NightOwl is competitive with or ahead of a higher-average multi-vector model, while keeping a standard dense-vector index. (LateOn-Code scores sourced from the model's MTEB(Code, v1) table.)

Because the benchmark suite consists of in-domain code retrieval tasks related to the model's training distribution, these results should not be interpreted as strictly zero-shot performance.

Base Model: the NightOwl Backbone

NightOwl-CodeEmbedding is fine-tuned from Shuu12121/NightOwl, a ModernBERT-style code encoder that was pre-trained from scratch — including its own tokenizer — rather than adapted from a general-purpose checkpoint. The whole stack, from tokenization to the pre-training objective, is controlled for code.

Code-aware tokenizer. NightOwl uses a custom 50,368-token BPE tokenizer in which whitespace is tokenized independently of adjacent words, so indentation is represented by its own tokens instead of being merged into "leading-whitespace + word" pieces. In code the same identifier recurs at many indentation depths; folding whitespace into those pieces would spend large parts of the vocabulary on near-duplicate "indent + token" variants. Keeping whitespace separate avoids that waste and lets the fixed vocabulary budget cover more genuinely distinct subwords, while still representing indentation faithfully — which matters for whitespace-significant languages such as Python.

Two-phase pre-training with line-level masking. NightOwl is trained with masked-language modeling (mlm_probability = 0.3) in two phases:

Phase 1 — mixed pre-training: standard random-token MLM over code, natural language, and technical documentation (produces NightOwl-Pre).
Phase 2 — code-only continuation: line-level MLM, where entire source-code lines are masked instead of random tokens. This aligns the pre-training objective with code search and retrieval, where the unit of meaning is closer to a line or statement than an isolated token. The recommended NightOwl checkpoint is this Phase-2 result.

Backbone architecture (base):

Property	Value
Architecture	ModernBERT (alternating local/global attention, RoPE)
Parameters	≈150M
`hidden_size` / layers / heads	768 / 19 / 12
Vocabulary	50,368 (custom code BPE)
Max sequence length	1,024 (Phase 1) → 2,048 (Phase 2)

Pre-training data mixes bigcode/starcoder2data-extras (Kaggle notebooks, StackOverflow threads, GitHub issues, technical documentation, …) with whole-file source from Shuu12121/github-file-programs-dataset across the eight supported languages (Python, JavaScript, TypeScript, Java, Go, Rust, Ruby, PHP). Long examples are split into chunks so all tokens are used rather than truncated.

As a raw backbone — before any embedding fine-tuning — NightOwl reaches 0.8436 average MRR on MTEB CodeSearchNetRetrieval under a fixed SentenceTransformer fine-tuning protocol, ahead of CodeBERT-base (0.7944), GraphCodeBERT-base (0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) evaluated the same way. NightOwl-CodeEmbedding builds the retrieval model described in this card on top of that backbone.

Training

The model was trained with CachedMultipleNegativesRankingLoss using bidirectional query-to-document and document-to-query objectives.

Property	Value
Training samples	2,534,400
Positives per anchor	1
Negatives per anchor	15
Loss	`CachedMultipleNegativesRankingLoss`
Objective	Bidirectional retrieval training
Hard-negative mining model	`Qwen/Qwen3-Embedding-0.6B`
Epochs	1
Learning rate	6e-5
Batch size	1024

Training Data

The training data is a mixture of:

Public code-retrieval datasets covering the following CoIR task families: AppsRetrieval, COIRCodeSearchNetRetrieval, CodeFeedbackMT, CodeFeedbackST, CodeSearchNetCCRetrieval, CodeSearchNetRetrieval, CodeTransOceanContest, CodeTransOceanDL, CosQA, StackOverflowQA, and SyntheticText2SQL.
Custom code-comment pair data consisting of code snippets paired with natural-language description comments across the eight supported languages (the six CodeSearchNet languages plus Rust and TypeScript).
Code-edit data derived from commitpackft, pairing edit intents with code changes.

All datasets were constructed as hard-negative retrieval datasets: for each anchor, one positive and fifteen hard negatives were used. Hard negatives were mined with Qwen/Qwen3-Embedding-0.6B, which retrieves semantically similar but non-matching candidates, producing negatives that are more difficult than random negatives. The mining model is used only during dataset construction and is not required at inference time.

This setup is intended to improve discrimination between code snippets, programming questions, edit examples, and technically similar retrieval candidates.

Data Decontamination

To reduce benchmark contamination, the following overlaps were removed from the training data before training:

Overlaps between the custom code-comment pair data and the CodeSearchNet test split
Overlaps between the commitpackft-derived code-edit data and the CodeEditSearchRetrieval benchmark evaluation data

For CodeEditSearchRetrieval, note that MTEB labels the evaluation split train. This refers only to the official split name available for the task; the evaluated examples were not included in this model's fine-tuning data. The reported score should therefore be interpreted as in-domain generalization on held-out benchmark examples, not as training-set performance — though, given the in-domain training distribution, also not as strictly zero-shot performance.

Intended Use

This model is intended for code-related retrieval tasks such as:

Natural language to code search
Code-to-code retrieval and similar function search
Code-edit retrieval (matching edit intents to code changes)
Retrieval over programming Q&A and technical questions
Local semantic code search systems
RAG systems over codebases and developer documentation

Example use cases include indexing functions, snippets, programming solutions, StackOverflow-style answers, code review examples, and edit-related code examples.

Limitations

The model is specialized for code-related retrieval and may underperform general-purpose text embedding models on unrelated natural language tasks.
Inputs longer than 1,024 tokens are truncated. This is a shorter context window than several models it competes with (e.g. the 8K+ token F2LLM and granite models), so very long files must be chunked.
MTEB(Code, v1) is largely in-domain for this model (8% zero-shot). On code domains, query styles, or languages far from the training distribution, expect lower performance than the leaderboard numbers suggest.
Performance may vary by programming language, query style, and the granularity of indexed code chunks; languages outside the eight supported languages are untested.
The model uses dense single-vector embeddings. For very fine-grained matching, rerankers or late-interaction models (such as LateOn-Code) may provide a higher average at the cost of a larger index and a non-standard retrieval path — though, as the head-to-head shows, single-vector NightOwl still leads on code-edit and code-to-code retrieval.

Recommended Indexing Settings

Encode both queries and documents with normalized embeddings:

embeddings = model.encode(texts, normalize_embeddings=True)

With normalized embeddings, dot product is equivalent to cosine similarity.

For codebase search, indexing function-level or class-level chunks is usually recommended. Very long files may exceed the 1,024-token context limit and should be split into smaller semantic chunks.

Citation

If you use this model, please cite it together with the base model and Sentence Transformers.

@misc{nightowl_codeembedding,
  title = {NightOwl-CodeEmbedding},
  author = {Shuu12121},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Shuu12121/NightOwl-CodeEmbedding}
}

Downloads last month: 98

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Shuu12121/NightOwl-CodeEmbedding

Base model

Shuu12121/NightOwl

Finetuned

(2)

this model

Shuu12121
/

NightOwl-CodeEmbedding