CUDA support added - Pre-generation knowledge-boundary estimator

sadeezy81 · June 6, 2026, 4:05pm

I built a sidecar that takes the user’s prompt and does one forward pass, predicting whether the model’s parametric knowledge is sufficient to answer the question correctly (no answer tokens are generated).

So far I’m getting good results, but I’m wondering if there is a better way to do this. Thoughts?

GitHub

John6666 · June 9, 2026, 1:13am

For now, this looks like a promising direction:

I would frame this less as a generic “hallucination detector” and more as a pre-generation selective predictor / router for the model’s own closed-book QA behavior.

That framing matters, because it makes the target much sharper:

Given only the prompt and one prompt-only forward pass, can we estimate whether this exact model instance is likely to answer correctly without retrieval, tool use, or extra generation?

Under that interpretation, the idea is useful. It could decide whether to:

answer from parametric memory,
invoke RAG,
use a stronger/slower model,
trigger a “deep thinking” path,
ask for clarification,
or abstain.

That is a very practical deployment problem.

Direct answer

Yes, I think this direction is promising. I would probably not change the architecture first. I would first tighten the evaluation and compare against a few very close baselines.

The closest framing I know is:

Area	Relevant work	Why it is close
Pre-generation knowledge estimation	KEEN: Estimating Knowledge in LLMs Without Generating a Single Token	Uses internal representations to estimate what a model knows before generation.
Query-level uncertainty	Query-Level Uncertainty in Large Language Models / Internal Confidence and code	Single forward pass, no answer generation, intended for RAG, cascading, deep thinking, and abstention.
Knowledge-boundary perception	Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception and code	Uses internal states for pre/post-generation confidence and knowledge-boundary perception.
Selective prediction	Selective Classification for Deep Neural Networks	Gives the risk–coverage framing: answer only when confidence is high enough.
Adaptive retrieval	When Not to Trust Language Models, Adaptive Retrieval, When to Retrieve, Self-RAG	The natural downstream use case is deciding when parametric memory is enough and when retrieval is needed.

So I would say: the idea is not isolated; it sits in a real research thread. But your implementation has a nice practical flavor: a small sidecar, prompt-only features, logit-lens trajectory / crystallization signals, MLP-write features, and a usable GUI.

What I think the method is really measuring

I would be careful with the word knows.

For this setup, P(knows) is probably best read as:

P(this exact model instance answers correctly under this prompt format, decoding mode, and answer-matching rule)

not as:

P(the model metaphysically knows the truth)

That is not a criticism. In fact, I think your behavioral definition is a good one. It is deployment-relevant. If the sidecar predicts the model’s own closed-book correctness, that is exactly the signal a router needs.

But I would make the definition explicit everywhere, because otherwise readers may interpret this as a general truthfulness estimator or hallucination detector.

A possible wording:

In this prototype, “knows” means “the base model’s greedy closed-book answer matches a gold answer or alias under the current evaluator.” It is therefore a model-behavior label, not a direct claim about world truth.

What looks strong

A few parts look especially good.

Component	Why it is interesting
Prompt-only forward pass	Very attractive for routing because it avoids paying for answer generation before deciding whether to retrieve/escalate.
Small sidecar	Easier to deploy than a second LLM judge or multi-sample uncertainty method.
Layer trajectory	This is more informative than only taking the last hidden state. It lines up with the intuition behind logit lens / tuned lens style analyses.
MLP-write features	This has a plausible mechanistic connection to factual recall, given the “FFN as key-value memory” literature.
Risk–coverage plots	This is the right kind of deployment-facing visualization.
CUDA/MPS portability	Very useful for people actually trying to reproduce or run the system.

The MLP-write direction is particularly interesting. There is prior work arguing that transformer feed-forward layers behave like key-value memories, and work on knowledge neurons also points toward factual information being localized in internal activations. I would not overclaim that the sidecar has found “the knowledge neurons,” but the feature design is plausible.

For the logit-lens side, I would also mention Tuned Lens. Raw logit lens can be noisy or brittle, especially in middle layers, so I like that your README already treats the MLP-write lens carefully and emphasizes magnitudes / trajectories rather than taking every decoded token literally.

Closest baselines I would add

If the goal is to convince people that LRD adds value, I would compare it against these.

Baseline	Why it matters
Internal Confidence	Very close: training-free, query-level, single forward pass, no answer tokens. This is probably the most important external baseline.
KEEN-style probe	Very close: internal representation → knowledge estimate before generation. It is entity-oriented, but the probing idea is highly relevant.
Last-token / last-layer hidden-state probe	Strong minimal white-box baseline. If this is close to LRD, the extra feature engineering needs justification.
Entropy / max-prob / margin	Cheap uncertainty baselines. These are hard to ignore because they are simple and often surprisingly strong.
Global scalar logistic regression	Helps show whether the GRU-over-depth is doing more than using a few scalar features.
Semantic Entropy Probes	Not exactly the same because they use hidden states from a generation, but useful as a cheap learned uncertainty baseline.
LM-Polygraph methods	Useful if you want to compare against a broader UE toolkit rather than hand-picking baselines.

Links:

Evaluation: I would tighten this before changing the architecture

This is the biggest thing I would look at.

From the README, the reported numbers are already good:

Split	Model	AUROC	AUPRC	ECE
held-out test	LRD	0.8983	0.7937	0.0925
held-out test	last-layer probe	0.8189	0.6571	0.1807
held-out test	global logreg	0.8665	0.7477	0.1161
held-out test	neg-entropy	0.8529	0.7269	—
cross PopQA	LRD	0.9411	0.7629	0.0679
cross PopQA	last-layer probe	0.9612	0.8875	0.0327
cross PopQA	global logreg	0.8748	0.5913	0.1494
cross PopQA	neg-entropy	0.8611	0.5903	—

The held-out test result is encouraging. But I would be cautious about calling the current cross result “cross-dataset generalization” unless the sidecar is actually trained only on the source dataset used for the cross split.

If I am reading the shipped split files correctly, cross.test is all PopQA, but the sidecar checkpoint appears to be trained on the main.train split. Since main.train is stratified over both PopQA and TriviaQA, that means many PopQA examples are already part of the sidecar’s training distribution. So the current cross result may be more like “evaluate the already-trained sidecar on all PopQA” than “train on TriviaQA, test on PopQA.”

I may be misreading the intended workflow, but if not, I would make the cross protocol stricter:

Claim	Suggested protocol
Held-out in-distribution	Deduplicate first, then stratified train/val/test over the mixed dataset.
True dataset transfer	Train sidecar on TriviaQA only, tune/calibrate on TriviaQA only, test once on PopQA only.
Reverse transfer	Train on PopQA only, test on TriviaQA only.
Popularity transfer	Train on high-popularity PopQA, test on low-popularity PopQA, and vice versa.
Dataset identity control	Report whether a dataset-only or popularity-only classifier already predicts `knows`.

I would also deduplicate questions before splitting. Exact duplicate questions across train/val/test can make held-out metrics optimistic, especially for a small 3k-example dataset.

This does not invalidate the idea. It just means the evaluation should be made leak-resistant before making strong claims.

Metrics I would emphasize

AUROC/AUPRC/ECE are useful, but for this application I would make risk–coverage the primary metric.

Why?

Because the product question is not only:

“Can the sidecar rank known vs unknown questions?”

It is:

“At a chosen risk tolerance, how many questions can the system safely answer without retrieval or escalation?”

So I would report:

Metric	Interpretation
`risk@coverage`	If the system answers the top X% most confident questions, what is the error rate?
`coverage@risk≤5%`	How many questions can be answered while keeping error below 5%?
`coverage@risk≤10%`	Same, but with a looser risk target.
`AURC` / area under risk–coverage	Overall selective-prediction quality.
`RAG calls saved at fixed accuracy`	Practical cost-saving metric.
`accuracy at fixed RAG budget`	Practical quality metric.
`ECE / Brier score`	Calibration quality, secondary but still useful.

This recent paper is also relevant: Entropy Alone is Insufficient for Safe Selective Prediction in LLMs. It argues that entropy-only uncertainty can fail for selective prediction, and that combining entropy with a correctness probe can improve risk–coverage and calibration. That is a useful argument in favor of systems like yours: the goal should not be “beat entropy everywhere,” but “show where learned internal signals improve deployment-facing selective behavior beyond entropy.”

Suggested RAG-gating experiment

I think the most convincing next experiment would be a simple RAG router.

For each question:

Compute LRD score from the prompt-only forward pass.
If score ≥ threshold, answer closed-book.
If score < threshold, retrieve context and answer with RAG.
Compare against:
- always closed-book,
- always retrieve,
- entropy gate,
- Internal Confidence gate,
- popularity threshold,
- random gate.

Then report:

System	Accuracy	Retrieval rate	Cost proxy	Error rate on non-retrieved answers
Always closed-book	—	0%	low	—
Always retrieve	—	100%	high	—
Popularity threshold	—	—	—	—
Entropy gate	—	—	—	—
Internal Confidence gate	—	—	—	—
LRD gate	—	—	—	—
Oracle gate	—	minimal	lower bound	0%

This would connect directly to the PopQA / adaptive retrieval literature:

PopQA is a good dataset for this because the known/unknown boundary is strongly related to long-tail knowledge. The original PopQA paper found that LMs are much weaker on less popular factual knowledge, while retrieval helps long-tail cases and closed-book answering can still be competitive for high-popularity facts. That is almost exactly the deployment niche for a pre-generation router.

One detail: dataset identity may be doing work

Because the dataset is a balanced PopQA + TriviaQA mix, I would check whether the model is partly learning “this looks like TriviaQA” vs “this looks like PopQA.”

That matters because PopQA and TriviaQA differ in more than just whether Gemma knows the answer. They differ in question style, entity popularity, answer distribution, and probably surface form.

A quick control:

Control	Purpose
Dataset-only classifier	Does `dataset ∈ {PopQA, TriviaQA}` predict `knows` too well?
Question-length / lexical features	Are shallow features enough?
Popularity-only PopQA baseline	Does `s_pop` explain most PopQA performance?
Same-dataset OOD split	Avoids dataset identity as a shortcut.
Entity-disjoint split	Prevents the same subject/entity from appearing in train and test.

If these controls are weak and LRD remains strong, the result becomes much more convincing.

Architecture comments

I would keep the architecture simple until the evaluation is locked down.

Current LRD:

per-layer features [L, F] → GRU over depth → mean/last pooling → global scalars → MLP → calibrated probability

That is reasonable. I would not jump to a larger network yet.

I would test ablations first:

Ablation	Question answered
Signal A only	Are crystallization / logit-lens trajectory features enough?
Signal B only	Are MLP-write features enough?
Global scalars only	Are entropy/margin-like features doing most of the work?
Last hidden state only	Does the engineered feature stack beat a simple probe?
No GRU, pooled layers only	Is cross-layer sequence modeling actually helping?
Layer-shuffled GRU	Is the depth order meaningful?
Random-label sanity check	Confirms no pipeline leakage.
Train on one dataset, calibrate on same dataset, test on another	Tests actual transfer.

Only after that would I add Signal C/D.

Signal C/D ideas

Your README mentions perturbation robustness and Mahalanobis familiarity as scaffolded but off by default. Those are plausible additions.

Related work:

Potential additions:

Signal	Intuition	Caveat
Perturbation stability	Known answers should have more stable internal representations under small perturbations.	Extra forward passes may weaken the “single forward” selling point.
Mahalanobis / representation familiarity	Unknown queries may be farther from familiar training-like regions.	Can overfit to dataset/source identity.
Cross-layer update magnitude	Hallucination/unknown cases may show different residual-stream update patterns.	Needs careful normalization across models.
Tuned-lens trajectory	More robust than raw logit lens.	Requires training lens probes per model.
Entity popularity / retrieval prior	Useful for PopQA-like settings.	Not available for arbitrary user questions.

For a production router, I would preserve the one-forward property if possible. It is a big advantage.

Calibration

Because the output is used as a decision score, calibration matters.

I would report calibration separately for:

Slice	Why
PopQA	Long-tail / low base-rate unknown-heavy distribution.
TriviaQA	Easier / higher known-rate distribution.
Popularity quartiles	Tests whether calibration breaks on rare facts.
Short vs long questions	Surface complexity may affect confidence.
High-confidence positives	Deployment-critical: these are the questions the system will answer without retrieval.
High-confidence negatives	Deployment-critical: these decide retrieval/escalation cost.

Temperature scaling is a good start, but I would be careful to fit the calibration temperature only on a clean validation set that does not overlap with the evaluation split or target OOD dataset.

How I would phrase the contribution

I would avoid:

“This detects hallucinations before generation.”

That is too broad.

I would prefer:

“This estimates whether a specific model is likely to answer a prompt correctly from parametric memory, before generating answer tokens.”

Or:

“This is a prompt-only, white-box selective-prediction sidecar for closed-book QA.”

Or:

“This is a one-forward-pass router for deciding whether to trust parametric memory or invoke retrieval/escalation.”

Those are narrower and easier to defend.

A possible paper-style claim

If the tightened experiments hold, the claim could be:

We train a lightweight sidecar on prompt-only internal features to predict a model’s closed-book answerability before generation. On held-out QA examples, the sidecar improves risk–coverage and calibration over entropy and simple hidden-state probes. In a RAG-gating setting, it reduces retrieval calls at fixed accuracy compared with always-retrieve and entropy-gated baselines.

That would be a strong, practical claim.

Suggested next checklist

If I were iterating on this repo, I would do this order:

Deduplicate before splitting
- exact question duplicates,
- possibly entity-level duplicates,
- possibly normalized answer duplicates if they imply same fact.
Rebuild splits
- mixed held-out,
- true TriviaQA→PopQA,
- true PopQA→TriviaQA,
- popularity-heldout PopQA.
Retrain sidecar separately per protocol
- do not reuse a mixed-trained checkpoint for a claimed cross-dataset result.
Add close baselines
- Internal Confidence,
- KEEN-style probe,
- last-hidden probe,
- entropy / max-prob / margin,
- global scalar logreg.
Make risk–coverage primary
- coverage@risk≤5%,
- coverage@risk≤10%,
- AURC,
- RAG calls saved at fixed accuracy.
Run a RAG-gating demo
- always closed-book,
- always retrieve,
- entropy gate,
- LRD gate,
- Internal Confidence gate,
- oracle.
Keep the “knows” definition narrow
- model-behavior label,
- same prompt format,
- same decoding mode,
- same answer matcher.

Minor naming thought

P(knows) is intuitive for the GUI, but in a paper or README I might also expose a more precise name:

P(correct_closed_book)
P(answerable_by_base_model)
P(parametric_success)
P(self_correct)
P(closed_book_success)

Maybe keep P(knows) as the user-facing short label, but define it formally as one of the above.

Bottom line

I think this is a promising direction, especially if framed as a pre-generation router / selective predictor rather than a broad hallucination detector.

The most valuable next step is probably not a more complex sidecar. It is a stricter evaluation:

deduplicated splits,
true dataset-transfer training,
close baselines like Internal Confidence and KEEN-style probes,
deployment-facing risk–coverage metrics,
and a RAG-gating experiment showing retrieval calls saved at fixed accuracy.

If those results still hold, the project would have a much stronger story: not just “the sidecar gets high AUROC,” but “the sidecar makes useful routing decisions before generation.”

KnackAU · June 9, 2026, 3:37am

Strong +1 to John’s core point - tighten the evaluation before touching the architecture. I’ll just add three confounds we ran into doing internal-state probing on this exact model family (Gemma), roughly ordered by how much they’d move your numbers. They’re not yet on the list and the first one is, in our experience, the silent killer.

0. Certify the weight artifact before trusting any internal-state feature

Your logit-lens “crystallization” trajectory and MLP-write magnitudes are only as faithful as the weights they’re read from - and Gemma’s quantized artifacts are a minefield right now. In our own work on this family we found several GGUF/quantized conversions of Gemma silently broken: a clean bf16 forward gave wikitext PPL ≈ 4.7, while multiple GGUF artifacts of the same weights gave ≈ 190–500 (~100× worse). On an artifact like that, the “knowledge crystallizes across depth” signal you’re keying on is partly measuring quantization damage, not parametric knowledge - and it’ll look structured, because the damage is structured. We were using Gemma-4 12b. The E series are probably ok, but worth checking out.

Concrete: print the probed model’s wikitext PPL next to a known reference and sanity-check it before you trust a single feature. Prefer bf16 safetensors-direct for the model you probe. If you have to probe the quantized deployment target, see #2.

1. The current LRD-vs-baseline gaps look like they’re inside split + seed noise

Single split, ~3k examples: LRD 0.898 vs global-logreg 0.867 is a 0.03 AUROC gap.
On the cross-PopQA split the last-layer probe (0.961) already beats LRD (0.941). That’s the simplest baseline outscoring the full architecture on the transfer split - which, combined with John’s leakage observation, strongly suggests the held-out margin isn’t robust yet.

We got burned by exactly this on a different internal-state probe: a held-out score of 0.20 collapsed to ~0.13 the moment we made the train/val split context-disjoint, and single-seed runs swung ±0.04 between adjacent checkpoints - enough to flip a “win” into a “loss.” We now treat any number as inconclusive until ≥3 seeds with a bootstrap CI over the test set.

Concrete: 3–5 seeds, report mean ± CI (or a paired bootstrap of LRD − baseline, which is what actually tells you if the gap is real). And dedup at the entity level, not just exact-question - PopQA and TriviaQA share entities, so question-level dedup leaves entity leakage intact, and an entity that appears in train teaches the probe the entity, not the knowledge boundary.

2. P(knows) is quantization-instance specific, not just model specific

Quantization changes which facts are recoverable - we have direct evidence that coarse quantization degrades exact factual recall on this family, not just average loss. So a sidecar trained on fp16 Gemma will mis-route a Q4 deployment: the boundary literally moves. Train and calibrate the router on the exact deployed (quantized) instance, and re-fit the temperature per instance - calibration in particular will not transfer across quant levels.

3. (smaller) Layer-class structure on Gemma

Gemma interleaves global and sliding-window attention layers - different effective ranges (global carries long-range, SWA fades past its window) and different RoPE. A GRU-over-depth that treats all layers as one uniform sequence is averaging over two different signal regimes. A cheap ablation: tag each layer with its class (or pool the two classes separately) and see whether the depth signal actually lives in the global layers. If it does, that’s both a smaller, more honest feature set and a more transferable one.

Bottom line

John’s eval-tightening is the right first move. I’d prepend step 0: certify the weight artifact - a broken quant silently poisons every downstream internal feature, and it’s the cheapest thing to rule out. Then treat the current LRD-vs-baseline gaps as within-noise until you have multi-seed + entity-disjoint + true train→test transfer, and bind the probe to the exact quantized instance you’ll deploy. If the margin survives all of that, you’ve got a genuinely strong, deployment-relevant result - and the prompt-only, one-forward property is worth protecting the whole way.

Topic		Replies	Views
Pre-generation knowledge-boundary estimator Intermediate	0	16	June 5, 2026
LayerBrake — Full Transparency Release ⚡ I’ve been working on making LLMs more efficient. Here’s the honest update: Original Results (with optimized prompt): 61% fewer tokens ~2.6x faster 75-85% less VRAM Cache & Power Much cleaner answers Research	0	22	June 1, 2026
LLaVA Steering: Why does grounding fix hallucinations in captioning but not in Yes/No QA? 🤗Transformers	1	60	February 19, 2026
HallucinationBench — detect hallucinations in RAG output in 2 lines of Python Intermediate	2	45	March 28, 2026
Need advice: open-source surgical LLM fine-tune (90k Q&A) — multi-turn stability, RL (DPO), and RAG Intermediate	1	108	January 8, 2026

CUDA support added - Pre-generation knowledge-boundary estimator

Direct answer

What I think the method is really measuring

What looks strong

Closest baselines I would add

Evaluation: I would tighten this before changing the architecture

Metrics I would emphasize

Suggested RAG-gating experiment

One detail: dataset identity may be doing work

Architecture comments

Signal C/D ideas

Calibration

How I would phrase the contribution

A possible paper-style claim

Suggested next checklist

Minor naming thought

Bottom line

0. Certify the weight artifact before trusting any internal-state feature

1. The current LRD-vs-baseline gaps look like they’re inside split + seed noise

2. P(knows) is quantization-instance specific, not just model specific

3. (smaller) Layer-class structure on Gemma

Bottom line

Related topics