Instructions to use Shuu12121/NightOwl-CodeEmbedding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Shuu12121/NightOwl-CodeEmbedding with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Shuu12121/NightOwl-CodeEmbedding") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
NightOwl-CodeEmbedding 🦉
NightOwl-CodeEmbedding is a compact 768-dimensional dense embedding model
specialized for code retrieval, code-edit retrieval, and technical question
answering.
The model is fine-tuned from
Shuu12121/NightOwl, a
ModernBERT-based code model. It uses CLS pooling with cosine similarity and
does not require query: / passage: style prefixes.
Highlights
- Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
- On the MTEB(Code, v1) leaderboard it ranks 18th out of 241 models overall and is the top-scoring single-vector model under 300M parameters among scored entries on the official board, ahead of many models an order of magnitude larger (see Leaderboard Standing)
- Covers eight programming languages, including Rust and TypeScript in addition to the six CodeSearchNet languages
- Handles a wide range of code retrieval scenarios: NL-to-code search, code-to-code retrieval, code-edit retrieval, and technical QA
- Trained with hard negatives mined by
Qwen/Qwen3-Embedding-0.6B(15 hard negatives per anchor) - Decontaminated against CodeSearchNet test splits and the CodeEditSearchRetrieval benchmark (see Data Decontamination)
- Drop-in compatible with
sentence-transformers, Apache-2.0 license
Supported Languages
The training data covers the six CodeSearchNet languages plus two additional languages:
- Go, Java, JavaScript, PHP, Python, Ruby (CodeSearchNet languages)
- Rust, TypeScript (additional)
Performance on languages outside this set is not guaranteed and may vary.
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Shuu12121/NightOwl-CodeEmbedding")
queries = ["Python function that sorts a list in descending order"]
documents = [
"def sort_desc(values): return sorted(values, reverse=True)",
"def average(values): return sum(values) / len(values)",
]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)
# Cosine similarity (embeddings are normalized internally by similarity())
scores = model.similarity(query_embeddings, document_embeddings)
print(scores)
Model Details
| Property | Value |
|---|---|
| Base model | Shuu12121/NightOwl |
| Architecture | ModernBERT |
| Parameters | 150,779,136 |
| Embedding dimension | 768 |
| Pooling | CLS pooling |
| Maximum sequence length | 1,024 tokens |
| Similarity | Cosine similarity |
| Query/document prefixes | Not required |
| Weight dtype | FP32 |
| Weight memory | 575 MiB |
| License | Apache-2.0 |
MTEB Results
The model was evaluated with MTEB on code-related retrieval and technical QA tasks.
Evaluation setup:
- Model revision:
c7c8a57b9539297e192d5cf39b9aecf1fb376edd - MTEB version:
2.15.1 - Metric:
NDCG@10 - Hardware: NVIDIA GeForce RTX 5090
- Batch size: 64
Multi-subset task scores are reported as macro averages.
| Task | Split | NDCG@10 |
|---|---|---|
| AppsRetrieval | test | 0.39177 |
| COIRCodeSearchNetRetrieval | test | 0.84264 |
| CodeEditSearchRetrieval | train¹ | 0.74808 |
| CodeFeedbackMT | test | 0.76690 |
| CodeFeedbackST | test | 0.85207 |
| CodeSearchNetCCRetrieval | test | 0.91805 |
| CodeSearchNetRetrieval | test | 0.89239 |
| CodeTransOceanContest | test | 0.75953 |
| CodeTransOceanDL | test | 0.36057 |
| CosQA | test | 0.42810 |
| StackOverflowQA | test | 0.86608 |
| SyntheticText2SQL | test | 0.68266 |
| Macro average, all 12 tasks | 0.70907 | |
| CoIR macro average, 10 tasks | 0.68684 |
¹ CodeEditSearchRetrieval does not provide a standard test split in MTEB,
so the official train split is used for evaluation. These examples were
not used for fine-tuning. See
Data Decontamination for details.
Leaderboard Standing
On the public MTEB(Code, v1) leaderboard, the model's task-mean score (70.91, i.e. the 12-task macro average above ×100) places it as follows:
- #18 of 241 models overall, ahead of many models that are an order of magnitude larger
- #6 of 155 among sub-1B-parameter dense single-vector models — and the
smallest model in that top six. The five models ranked above it are all
≈0.33–0.6B parameters (
F2LLM-v2-0.6B/330M,pplx-embed-v1-0.6b,C2LLM-0.5B,Qwen3-Embedding-0.6B), i.e. 2–4× larger. - #1 among ranked dense single-vector models under 300M parameters (the leaderboard's small-model view)
- #2 once late-interaction / multi-vector models are included, behind only
lightonai/LateOn-Code(a multi-vector late-interaction model — see the head-to-head below)
Reading the numbers fairly. MTEB(Code, v1) reports a zero-shot % for each model — the fraction of leaderboard tasks the model was not trained on.
NightOwl-CodeEmbeddingis 8% zero-shot: it was trained on most of these task families, so its score reflects strong in-domain retrieval rather than zero-shot transfer. Models marked 100% (e.g.embeddinggemma-300m, thegranite-embeddingr2 family,Qwen3-Embedding) are evaluated fully out-of-domain, so a raw score comparison across rows with different zero-shot % is not apples-to-apples. The fairest direct comparisons are to other code-specialized models at similar zero-shot levels (e.g.LateOn-Codeat 8%, theF2LLM/C2LLMfamilies at 8–58%).
Comparison with similar-sized models
The table below compares NightOwl-CodeEmbedding with other compact code /
general embedding models on MTEB(Code, v1), with a size ladder of larger models
for reference. Score is the leaderboard task mean (higher is better); the
Zero-shot column is the share of tasks the model did not train on.
| Model | Params | Type | Emb. dim | Max tokens | Zero-shot | MTEB(Code, v1) ↑ |
|---|---|---|---|---|---|---|
NightOwl-CodeEmbedding (this model) |
150.8M | single-vector | 768 | 1,024 | 8% | 70.91 |
codefuse-ai/F2LLM-v2-160M |
159M | single-vector | 640 | 40,960 | 58% | 70.38 |
google/embeddinggemma-300m |
308M | single-vector | 768 | 2,048 | 100% | 68.76 |
codefuse-ai/F2LLM-v2-80M |
80M | single-vector | 320 | 40,960 | 58% | 67.97 |
ibm-granite/granite-embedding-311m-multilingual-r2 |
312M | single-vector | 768 | 8,192 | 100% | 63.84 |
| Late-interaction (multi-vector) reference | ||||||
lightonai/LateOn-Code |
149M | multi-vector | 128 (per-tok) | 2,048 | 8% | 74.12 |
| Larger single-vector reference (size ladder) | ||||||
codefuse-ai/F2LLM-v2-0.6B (#1 sub-1B) |
596M | single-vector | 1,024 | 40,960 | 58% | 77.41 |
Qwen/Qwen3-Embedding-0.6B |
596M | single-vector | 1,024 | 32,768 | 100% | 75.42 |
codefuse-ai/F2LLM-v2-14B (#1 overall) |
13.99B | single-vector | 5,120 | 40,960 | 58% | 80.75 |
Takeaways:
- Among compact single-vector dense models,
NightOwl-CodeEmbeddingis the strongest entry in the leaderboard's small-model view while also being one of the smallest, edging outF2LLM-v2-160Mand clearly ahead ofembeddinggemma-300m. - The sub-1B leaders (
F2LLM-v2-0.6B,Qwen3-Embedding-0.6B) score ~4–6.5 points higher but are ~4× the parameter count and use larger embedding dimensions, which directly increases index size and inference cost. - The 14B model at the top of the overall board is ~10 points higher but ~93× larger, sitting in a different deployment cost regime entirely.
Head-to-head vs LateOn-Code
lightonai/LateOn-Code is the only sub-0.5B model that outranks
NightOwl-CodeEmbedding once multi-vector models are included, so it is worth a
closer look. It is a ColBERT-style late-interaction model (built with PyLate
on ModernBERT-base): it stores one 128-dimensional vector per token and
scores with the MaxSim operator, rather than a single 768-d vector per text.
That buys accuracy at the cost of a larger index and a different retrieval path
(PyLate + a PLAID index), whereas NightOwl is a drop-in single-vector
sentence-transformers model.
Per-task NDCG@10 (×100) on MTEB(Code, v1); both models are code-specialized and in-domain (8% zero-shot), so this is a like-for-like comparison. Bold marks the higher score on each task.
| Task | NightOwl-CodeEmbedding | LateOn-Code (multi-vec) |
|---|---|---|
| AppsRetrieval | 39.18 | 54.76 |
| COIRCodeSearchNetRetrieval | 84.26 | 86.57 |
| CodeEditSearchRetrieval | 74.81 | 64.99 |
| CodeFeedbackMT | 76.69 | 82.22 |
| CodeFeedbackST | 85.21 | 90.40 |
| CodeSearchNetCCRetrieval | 91.81 | 89.32 |
| CodeSearchNetRetrieval | 89.24 | 90.40 |
| CodeTransOceanContest | 75.95 | 87.44 |
| CodeTransOceanDL | 36.06 | 41.00 |
| CosQA | 42.81 | 45.23 |
| StackOverflowQA | 86.61 | 93.43 |
| SyntheticText2SQL | 68.27 | 63.67 |
| Average | 70.91 | 74.12 |
LateOn-Code wins on average, driven mostly by AppsRetrieval and the
feedback/translation/QA tasks. However, NightOwl-CodeEmbedding wins on three
tasks that map directly to its design focus:
- CodeEditSearchRetrieval (+9.8): matching edit intents to code changes —
NightOwl's dedicated code-edit training shows here. - CodeSearchNetCCRetrieval (+2.5): code-to-code / similar-function retrieval.
- SyntheticText2SQL (+4.6): NL-to-SQL retrieval.
So for single-vector code-edit and code-to-code retrieval specifically,
NightOwl is competitive with or ahead of a higher-average multi-vector model,
while keeping a standard dense-vector index. (LateOn-Code scores sourced from
the model's
MTEB(Code, v1) table.)
Because the benchmark suite consists of in-domain code retrieval tasks related to the model's training distribution, these results should not be interpreted as strictly zero-shot performance.
Base Model: the NightOwl Backbone
NightOwl-CodeEmbedding is fine-tuned from
Shuu12121/NightOwl, a
ModernBERT-style code encoder that was pre-trained from scratch — including
its own tokenizer — rather than adapted from a general-purpose checkpoint. The
whole stack, from tokenization to the pre-training objective, is controlled for
code.
Code-aware tokenizer. NightOwl uses a custom 50,368-token BPE tokenizer in which whitespace is tokenized independently of adjacent words, so indentation is represented by its own tokens instead of being merged into "leading-whitespace + word" pieces. In code the same identifier recurs at many indentation depths; folding whitespace into those pieces would spend large parts of the vocabulary on near-duplicate "indent + token" variants. Keeping whitespace separate avoids that waste and lets the fixed vocabulary budget cover more genuinely distinct subwords, while still representing indentation faithfully — which matters for whitespace-significant languages such as Python.
Two-phase pre-training with line-level masking. NightOwl is trained with
masked-language modeling (mlm_probability = 0.3) in two phases:
- Phase 1 — mixed pre-training: standard random-token MLM over code, natural
language, and technical documentation (produces
NightOwl-Pre). - Phase 2 — code-only continuation: line-level MLM, where entire
source-code lines are masked instead of random tokens. This aligns the
pre-training objective with code search and retrieval, where the unit of
meaning is closer to a line or statement than an isolated token. The
recommended
NightOwlcheckpoint is this Phase-2 result.
Backbone architecture (base):
| Property | Value |
|---|---|
| Architecture | ModernBERT (alternating local/global attention, RoPE) |
| Parameters | ≈150M |
hidden_size / layers / heads |
768 / 19 / 12 |
| Vocabulary | 50,368 (custom code BPE) |
| Max sequence length | 1,024 (Phase 1) → 2,048 (Phase 2) |
Pre-training data mixes bigcode/starcoder2data-extras (Kaggle notebooks,
StackOverflow threads, GitHub issues, technical documentation, …) with
whole-file source from Shuu12121/github-file-programs-dataset across the eight
supported languages (Python, JavaScript, TypeScript, Java, Go, Rust, Ruby, PHP).
Long examples are split into chunks so all tokens are used rather than truncated.
As a raw backbone — before any embedding fine-tuning — NightOwl reaches 0.8436
average MRR on MTEB CodeSearchNetRetrieval under a fixed SentenceTransformer
fine-tuning protocol, ahead of CodeBERT-base (0.7944), GraphCodeBERT-base
(0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) evaluated the
same way. NightOwl-CodeEmbedding builds the retrieval model described in this
card on top of that backbone.
Training
The model was trained with CachedMultipleNegativesRankingLoss using
bidirectional query-to-document and document-to-query objectives.
| Property | Value |
|---|---|
| Training samples | 2,534,400 |
| Positives per anchor | 1 |
| Negatives per anchor | 15 |
| Loss | CachedMultipleNegativesRankingLoss |
| Objective | Bidirectional retrieval training |
| Hard-negative mining model | Qwen/Qwen3-Embedding-0.6B |
| Epochs | 1 |
| Learning rate | 6e-5 |
| Batch size | 1024 |
Training Data
The training data is a mixture of:
- Public code-retrieval datasets covering the following CoIR task families: AppsRetrieval, COIRCodeSearchNetRetrieval, CodeFeedbackMT, CodeFeedbackST, CodeSearchNetCCRetrieval, CodeSearchNetRetrieval, CodeTransOceanContest, CodeTransOceanDL, CosQA, StackOverflowQA, and SyntheticText2SQL.
- Custom code-comment pair data consisting of code snippets paired with natural-language description comments across the eight supported languages (the six CodeSearchNet languages plus Rust and TypeScript).
- Code-edit data derived from
commitpackft, pairing edit intents with code changes.
All datasets were constructed as hard-negative retrieval datasets: for each
anchor, one positive and fifteen hard negatives were used. Hard negatives were
mined with
Qwen/Qwen3-Embedding-0.6B,
which retrieves semantically similar but non-matching candidates, producing
negatives that are more difficult than random negatives. The mining model is
used only during dataset construction and is not required at inference time.
This setup is intended to improve discrimination between code snippets, programming questions, edit examples, and technically similar retrieval candidates.
Data Decontamination
To reduce benchmark contamination, the following overlaps were removed from the training data before training:
- Overlaps between the custom code-comment pair data and the CodeSearchNet test split
- Overlaps between the
commitpackft-derived code-edit data and the CodeEditSearchRetrieval benchmark evaluation data
For CodeEditSearchRetrieval, note that MTEB labels the evaluation split
train. This refers only to the official split name available for the task;
the evaluated examples were not included in this model's fine-tuning data.
The reported score should therefore be interpreted as in-domain
generalization on held-out benchmark examples, not as training-set
performance — though, given the in-domain training distribution, also not as
strictly zero-shot performance.
Intended Use
This model is intended for code-related retrieval tasks such as:
- Natural language to code search
- Code-to-code retrieval and similar function search
- Code-edit retrieval (matching edit intents to code changes)
- Retrieval over programming Q&A and technical questions
- Local semantic code search systems
- RAG systems over codebases and developer documentation
Example use cases include indexing functions, snippets, programming solutions, StackOverflow-style answers, code review examples, and edit-related code examples.
Limitations
- The model is specialized for code-related retrieval and may underperform general-purpose text embedding models on unrelated natural language tasks.
- Inputs longer than 1,024 tokens are truncated. This is a shorter context
window than several models it competes with (e.g. the 8K+ token
F2LLMandgranitemodels), so very long files must be chunked. - MTEB(Code, v1) is largely in-domain for this model (8% zero-shot). On code domains, query styles, or languages far from the training distribution, expect lower performance than the leaderboard numbers suggest.
- Performance may vary by programming language, query style, and the granularity of indexed code chunks; languages outside the eight supported languages are untested.
- The model uses dense single-vector embeddings. For very fine-grained
matching, rerankers or late-interaction models (such as
LateOn-Code) may provide a higher average at the cost of a larger index and a non-standard retrieval path — though, as the head-to-head shows, single-vectorNightOwlstill leads on code-edit and code-to-code retrieval.
Recommended Indexing Settings
Encode both queries and documents with normalized embeddings:
embeddings = model.encode(texts, normalize_embeddings=True)
With normalized embeddings, dot product is equivalent to cosine similarity.
For codebase search, indexing function-level or class-level chunks is usually recommended. Very long files may exceed the 1,024-token context limit and should be split into smaller semantic chunks.
Citation
If you use this model, please cite it together with the base model and Sentence Transformers.
@misc{nightowl_codeembedding,
title = {NightOwl-CodeEmbedding},
author = {Shuu12121},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/Shuu12121/NightOwl-CodeEmbedding}
}
- Downloads last month
- 98
Model tree for Shuu12121/NightOwl-CodeEmbedding
Base model
Shuu12121/NightOwl