CUDA support added - Pre-generation knowledge-boundary estimator

I built a sidecar that takes the user’s prompt and does one forward pass, predicting whether the model’s parametric knowledge is sufficient to answer the question correctly (no answer tokens are generated).

So far I’m getting good results, but I’m wondering if there is a better way to do this. Thoughts?

GitHub

For now, this looks like a promising direction:


I would frame this less as a generic “hallucination detector” and more as a pre-generation selective predictor / router for the model’s own closed-book QA behavior.

That framing matters, because it makes the target much sharper:

Given only the prompt and one prompt-only forward pass, can we estimate whether this exact model instance is likely to answer correctly without retrieval, tool use, or extra generation?

Under that interpretation, the idea is useful. It could decide whether to:

  • answer from parametric memory,
  • invoke RAG,
  • use a stronger/slower model,
  • trigger a “deep thinking” path,
  • ask for clarification,
  • or abstain.

That is a very practical deployment problem.

Direct answer

Yes, I think this direction is promising. I would probably not change the architecture first. I would first tighten the evaluation and compare against a few very close baselines.

The closest framing I know is:

Area Relevant work Why it is close
Pre-generation knowledge estimation KEEN: Estimating Knowledge in LLMs Without Generating a Single Token Uses internal representations to estimate what a model knows before generation.
Query-level uncertainty Query-Level Uncertainty in Large Language Models / Internal Confidence and code Single forward pass, no answer generation, intended for RAG, cascading, deep thinking, and abstention.
Knowledge-boundary perception Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception and code Uses internal states for pre/post-generation confidence and knowledge-boundary perception.
Selective prediction Selective Classification for Deep Neural Networks Gives the risk–coverage framing: answer only when confidence is high enough.
Adaptive retrieval When Not to Trust Language Models, Adaptive Retrieval, When to Retrieve, Self-RAG The natural downstream use case is deciding when parametric memory is enough and when retrieval is needed.

So I would say: the idea is not isolated; it sits in a real research thread. But your implementation has a nice practical flavor: a small sidecar, prompt-only features, logit-lens trajectory / crystallization signals, MLP-write features, and a usable GUI.

What I think the method is really measuring

I would be careful with the word knows.

For this setup, P(knows) is probably best read as:

P(this exact model instance answers correctly under this prompt format, decoding mode, and answer-matching rule)

not as:

P(the model metaphysically knows the truth)

That is not a criticism. In fact, I think your behavioral definition is a good one. It is deployment-relevant. If the sidecar predicts the model’s own closed-book correctness, that is exactly the signal a router needs.

But I would make the definition explicit everywhere, because otherwise readers may interpret this as a general truthfulness estimator or hallucination detector.

A possible wording:

In this prototype, “knows” means “the base model’s greedy closed-book answer matches a gold answer or alias under the current evaluator.” It is therefore a model-behavior label, not a direct claim about world truth.

What looks strong

A few parts look especially good.

Component Why it is interesting
Prompt-only forward pass Very attractive for routing because it avoids paying for answer generation before deciding whether to retrieve/escalate.
Small sidecar Easier to deploy than a second LLM judge or multi-sample uncertainty method.
Layer trajectory This is more informative than only taking the last hidden state. It lines up with the intuition behind logit lens / tuned lens style analyses.
MLP-write features This has a plausible mechanistic connection to factual recall, given the “FFN as key-value memory” literature.
Risk–coverage plots This is the right kind of deployment-facing visualization.
CUDA/MPS portability Very useful for people actually trying to reproduce or run the system.

The MLP-write direction is particularly interesting. There is prior work arguing that transformer feed-forward layers behave like key-value memories, and work on knowledge neurons also points toward factual information being localized in internal activations. I would not overclaim that the sidecar has found “the knowledge neurons,” but the feature design is plausible.

For the logit-lens side, I would also mention Tuned Lens. Raw logit lens can be noisy or brittle, especially in middle layers, so I like that your README already treats the MLP-write lens carefully and emphasizes magnitudes / trajectories rather than taking every decoded token literally.

Closest baselines I would add

If the goal is to convince people that LRD adds value, I would compare it against these.

Baseline Why it matters
Internal Confidence Very close: training-free, query-level, single forward pass, no answer tokens. This is probably the most important external baseline.
KEEN-style probe Very close: internal representation → knowledge estimate before generation. It is entity-oriented, but the probing idea is highly relevant.
Last-token / last-layer hidden-state probe Strong minimal white-box baseline. If this is close to LRD, the extra feature engineering needs justification.
Entropy / max-prob / margin Cheap uncertainty baselines. These are hard to ignore because they are simple and often surprisingly strong.
Global scalar logistic regression Helps show whether the GRU-over-depth is doing more than using a few scalar features.
Semantic Entropy Probes Not exactly the same because they use hidden states from a generation, but useful as a cheap learned uncertainty baseline.
LM-Polygraph methods Useful if you want to compare against a broader UE toolkit rather than hand-picking baselines.

Links:

Evaluation: I would tighten this before changing the architecture

This is the biggest thing I would look at.

From the README, the reported numbers are already good:

Split Model AUROC AUPRC ECE
held-out test LRD 0.8983 0.7937 0.0925
held-out test last-layer probe 0.8189 0.6571 0.1807
held-out test global logreg 0.8665 0.7477 0.1161
held-out test neg-entropy 0.8529 0.7269
cross PopQA LRD 0.9411 0.7629 0.0679
cross PopQA last-layer probe 0.9612 0.8875 0.0327
cross PopQA global logreg 0.8748 0.5913 0.1494
cross PopQA neg-entropy 0.8611 0.5903

The held-out test result is encouraging. But I would be cautious about calling the current cross result “cross-dataset generalization” unless the sidecar is actually trained only on the source dataset used for the cross split.

If I am reading the shipped split files correctly, cross.test is all PopQA, but the sidecar checkpoint appears to be trained on the main.train split. Since main.train is stratified over both PopQA and TriviaQA, that means many PopQA examples are already part of the sidecar’s training distribution. So the current cross result may be more like “evaluate the already-trained sidecar on all PopQA” than “train on TriviaQA, test on PopQA.”

I may be misreading the intended workflow, but if not, I would make the cross protocol stricter:

Claim Suggested protocol
Held-out in-distribution Deduplicate first, then stratified train/val/test over the mixed dataset.
True dataset transfer Train sidecar on TriviaQA only, tune/calibrate on TriviaQA only, test once on PopQA only.
Reverse transfer Train on PopQA only, test on TriviaQA only.
Popularity transfer Train on high-popularity PopQA, test on low-popularity PopQA, and vice versa.
Dataset identity control Report whether a dataset-only or popularity-only classifier already predicts knows.

I would also deduplicate questions before splitting. Exact duplicate questions across train/val/test can make held-out metrics optimistic, especially for a small 3k-example dataset.

This does not invalidate the idea. It just means the evaluation should be made leak-resistant before making strong claims.

Metrics I would emphasize

AUROC/AUPRC/ECE are useful, but for this application I would make risk–coverage the primary metric.

Why?

Because the product question is not only:

“Can the sidecar rank known vs unknown questions?”

It is:

“At a chosen risk tolerance, how many questions can the system safely answer without retrieval or escalation?”

So I would report:

Metric Interpretation
risk@coverage If the system answers the top X% most confident questions, what is the error rate?
coverage@risk≤5% How many questions can be answered while keeping error below 5%?
coverage@risk≤10% Same, but with a looser risk target.
AURC / area under risk–coverage Overall selective-prediction quality.
RAG calls saved at fixed accuracy Practical cost-saving metric.
accuracy at fixed RAG budget Practical quality metric.
ECE / Brier score Calibration quality, secondary but still useful.

This recent paper is also relevant: Entropy Alone is Insufficient for Safe Selective Prediction in LLMs. It argues that entropy-only uncertainty can fail for selective prediction, and that combining entropy with a correctness probe can improve risk–coverage and calibration. That is a useful argument in favor of systems like yours: the goal should not be “beat entropy everywhere,” but “show where learned internal signals improve deployment-facing selective behavior beyond entropy.”

Suggested RAG-gating experiment

I think the most convincing next experiment would be a simple RAG router.

For each question:

  1. Compute LRD score from the prompt-only forward pass.

  2. If score ≥ threshold, answer closed-book.

  3. If score < threshold, retrieve context and answer with RAG.

  4. Compare against:

    • always closed-book,
    • always retrieve,
    • entropy gate,
    • Internal Confidence gate,
    • popularity threshold,
    • random gate.

Then report:

System Accuracy Retrieval rate Cost proxy Error rate on non-retrieved answers
Always closed-book 0% low
Always retrieve 100% high
Popularity threshold
Entropy gate
Internal Confidence gate
LRD gate
Oracle gate minimal lower bound 0%

This would connect directly to the PopQA / adaptive retrieval literature:

PopQA is a good dataset for this because the known/unknown boundary is strongly related to long-tail knowledge. The original PopQA paper found that LMs are much weaker on less popular factual knowledge, while retrieval helps long-tail cases and closed-book answering can still be competitive for high-popularity facts. That is almost exactly the deployment niche for a pre-generation router.

One detail: dataset identity may be doing work

Because the dataset is a balanced PopQA + TriviaQA mix, I would check whether the model is partly learning “this looks like TriviaQA” vs “this looks like PopQA.”

That matters because PopQA and TriviaQA differ in more than just whether Gemma knows the answer. They differ in question style, entity popularity, answer distribution, and probably surface form.

A quick control:

Control Purpose
Dataset-only classifier Does dataset ∈ {PopQA, TriviaQA} predict knows too well?
Question-length / lexical features Are shallow features enough?
Popularity-only PopQA baseline Does s_pop explain most PopQA performance?
Same-dataset OOD split Avoids dataset identity as a shortcut.
Entity-disjoint split Prevents the same subject/entity from appearing in train and test.

If these controls are weak and LRD remains strong, the result becomes much more convincing.

Architecture comments

I would keep the architecture simple until the evaluation is locked down.

Current LRD:

per-layer features [L, F] → GRU over depth → mean/last pooling → global scalars → MLP → calibrated probability

That is reasonable. I would not jump to a larger network yet.

I would test ablations first:

Ablation Question answered
Signal A only Are crystallization / logit-lens trajectory features enough?
Signal B only Are MLP-write features enough?
Global scalars only Are entropy/margin-like features doing most of the work?
Last hidden state only Does the engineered feature stack beat a simple probe?
No GRU, pooled layers only Is cross-layer sequence modeling actually helping?
Layer-shuffled GRU Is the depth order meaningful?
Random-label sanity check Confirms no pipeline leakage.
Train on one dataset, calibrate on same dataset, test on another Tests actual transfer.

Only after that would I add Signal C/D.

Signal C/D ideas

Your README mentions perturbation robustness and Mahalanobis familiarity as scaffolded but off by default. Those are plausible additions.

Related work:

Potential additions:

Signal Intuition Caveat
Perturbation stability Known answers should have more stable internal representations under small perturbations. Extra forward passes may weaken the “single forward” selling point.
Mahalanobis / representation familiarity Unknown queries may be farther from familiar training-like regions. Can overfit to dataset/source identity.
Cross-layer update magnitude Hallucination/unknown cases may show different residual-stream update patterns. Needs careful normalization across models.
Tuned-lens trajectory More robust than raw logit lens. Requires training lens probes per model.
Entity popularity / retrieval prior Useful for PopQA-like settings. Not available for arbitrary user questions.

For a production router, I would preserve the one-forward property if possible. It is a big advantage.

Calibration

Because the output is used as a decision score, calibration matters.

I would report calibration separately for:

Slice Why
PopQA Long-tail / low base-rate unknown-heavy distribution.
TriviaQA Easier / higher known-rate distribution.
Popularity quartiles Tests whether calibration breaks on rare facts.
Short vs long questions Surface complexity may affect confidence.
High-confidence positives Deployment-critical: these are the questions the system will answer without retrieval.
High-confidence negatives Deployment-critical: these decide retrieval/escalation cost.

Temperature scaling is a good start, but I would be careful to fit the calibration temperature only on a clean validation set that does not overlap with the evaluation split or target OOD dataset.

How I would phrase the contribution

I would avoid:

“This detects hallucinations before generation.”

That is too broad.

I would prefer:

“This estimates whether a specific model is likely to answer a prompt correctly from parametric memory, before generating answer tokens.”

Or:

“This is a prompt-only, white-box selective-prediction sidecar for closed-book QA.”

Or:

“This is a one-forward-pass router for deciding whether to trust parametric memory or invoke retrieval/escalation.”

Those are narrower and easier to defend.

A possible paper-style claim

If the tightened experiments hold, the claim could be:

We train a lightweight sidecar on prompt-only internal features to predict a model’s closed-book answerability before generation. On held-out QA examples, the sidecar improves risk–coverage and calibration over entropy and simple hidden-state probes. In a RAG-gating setting, it reduces retrieval calls at fixed accuracy compared with always-retrieve and entropy-gated baselines.

That would be a strong, practical claim.

Suggested next checklist

If I were iterating on this repo, I would do this order:

  1. Deduplicate before splitting

    • exact question duplicates,
    • possibly entity-level duplicates,
    • possibly normalized answer duplicates if they imply same fact.
  2. Rebuild splits

    • mixed held-out,
    • true TriviaQA→PopQA,
    • true PopQA→TriviaQA,
    • popularity-heldout PopQA.
  3. Retrain sidecar separately per protocol

    • do not reuse a mixed-trained checkpoint for a claimed cross-dataset result.
  4. Add close baselines

    • Internal Confidence,
    • KEEN-style probe,
    • last-hidden probe,
    • entropy / max-prob / margin,
    • global scalar logreg.
  5. Make risk–coverage primary

    • coverage@risk≤5%,
    • coverage@risk≤10%,
    • AURC,
    • RAG calls saved at fixed accuracy.
  6. Run a RAG-gating demo

    • always closed-book,
    • always retrieve,
    • entropy gate,
    • LRD gate,
    • Internal Confidence gate,
    • oracle.
  7. Keep the “knows” definition narrow

    • model-behavior label,
    • same prompt format,
    • same decoding mode,
    • same answer matcher.

Minor naming thought

P(knows) is intuitive for the GUI, but in a paper or README I might also expose a more precise name:

  • P(correct_closed_book)
  • P(answerable_by_base_model)
  • P(parametric_success)
  • P(self_correct)
  • P(closed_book_success)

Maybe keep P(knows) as the user-facing short label, but define it formally as one of the above.

Bottom line

I think this is a promising direction, especially if framed as a pre-generation router / selective predictor rather than a broad hallucination detector.

The most valuable next step is probably not a more complex sidecar. It is a stricter evaluation:

  • deduplicated splits,
  • true dataset-transfer training,
  • close baselines like Internal Confidence and KEEN-style probes,
  • deployment-facing risk–coverage metrics,
  • and a RAG-gating experiment showing retrieval calls saved at fixed accuracy.

If those results still hold, the project would have a much stronger story: not just “the sidecar gets high AUROC,” but “the sidecar makes useful routing decisions before generation.”

Strong +1 to John’s core point - tighten the evaluation before touching the architecture. I’ll just add three confounds we ran into doing internal-state probing on this exact model family (Gemma), roughly ordered by how much they’d move your numbers. They’re not yet on the list and the first one is, in our experience, the silent killer.

0. Certify the weight artifact before trusting any internal-state feature

Your logit-lens “crystallization” trajectory and MLP-write magnitudes are only as faithful as the weights they’re read from - and Gemma’s quantized artifacts are a minefield right now. In our own work on this family we found several GGUF/quantized conversions of Gemma silently broken: a clean bf16 forward gave wikitext PPL ≈ 4.7, while multiple GGUF artifacts of the same weights gave ≈ 190–500 (~100× worse). On an artifact like that, the “knowledge crystallizes across depth” signal you’re keying on is partly measuring quantization damage, not parametric knowledge - and it’ll look structured, because the damage is structured. We were using Gemma-4 12b. The E series are probably ok, but worth checking out.

Concrete: print the probed model’s wikitext PPL next to a known reference and sanity-check it before you trust a single feature. Prefer bf16 safetensors-direct for the model you probe. If you have to probe the quantized deployment target, see #2.

1. The current LRD-vs-baseline gaps look like they’re inside split + seed noise

  • Single split, ~3k examples: LRD 0.898 vs global-logreg 0.867 is a 0.03 AUROC gap.
  • On the cross-PopQA split the last-layer probe (0.961) already beats LRD (0.941). That’s the simplest baseline outscoring the full architecture on the transfer split - which, combined with John’s leakage observation, strongly suggests the held-out margin isn’t robust yet.

We got burned by exactly this on a different internal-state probe: a held-out score of 0.20 collapsed to ~0.13 the moment we made the train/val split context-disjoint, and single-seed runs swung ±0.04 between adjacent checkpoints - enough to flip a “win” into a “loss.” We now treat any number as inconclusive until ≥3 seeds with a bootstrap CI over the test set.

Concrete: 3–5 seeds, report mean ± CI (or a paired bootstrap of LRD − baseline, which is what actually tells you if the gap is real). And dedup at the entity level, not just exact-question - PopQA and TriviaQA share entities, so question-level dedup leaves entity leakage intact, and an entity that appears in train teaches the probe the entity, not the knowledge boundary.

2. P(knows) is quantization-instance specific, not just model specific

Quantization changes which facts are recoverable - we have direct evidence that coarse quantization degrades exact factual recall on this family, not just average loss. So a sidecar trained on fp16 Gemma will mis-route a Q4 deployment: the boundary literally moves. Train and calibrate the router on the exact deployed (quantized) instance, and re-fit the temperature per instance - calibration in particular will not transfer across quant levels.

3. (smaller) Layer-class structure on Gemma

Gemma interleaves global and sliding-window attention layers - different effective ranges (global carries long-range, SWA fades past its window) and different RoPE. A GRU-over-depth that treats all layers as one uniform sequence is averaging over two different signal regimes. A cheap ablation: tag each layer with its class (or pool the two classes separately) and see whether the depth signal actually lives in the global layers. If it does, that’s both a smaller, more honest feature set and a more transferable one.

Bottom line

John’s eval-tightening is the right first move. I’d prepend step 0: certify the weight artifact - a broken quant silently poisons every downstream internal feature, and it’s the cheapest thing to rule out. Then treat the current LRD-vs-baseline gaps as within-noise until you have multi-seed + entity-disjoint + true train→test transfer, and bind the probe to the exact quantized instance you’ll deploy. If the margin survives all of that, you’ve got a genuinely strong, deployment-relevant result - and the prompt-only, one-forward property is worth protecting the whole way.