For now, this looks like a promising direction:
I would frame this less as a generic “hallucination detector” and more as a pre-generation selective predictor / router for the model’s own closed-book QA behavior.
That framing matters, because it makes the target much sharper:
Given only the prompt and one prompt-only forward pass, can we estimate whether this exact model instance is likely to answer correctly without retrieval, tool use, or extra generation?
Under that interpretation, the idea is useful. It could decide whether to:
- answer from parametric memory,
- invoke RAG,
- use a stronger/slower model,
- trigger a “deep thinking” path,
- ask for clarification,
- or abstain.
That is a very practical deployment problem.
Direct answer
Yes, I think this direction is promising. I would probably not change the architecture first. I would first tighten the evaluation and compare against a few very close baselines.
The closest framing I know is:
So I would say: the idea is not isolated; it sits in a real research thread. But your implementation has a nice practical flavor: a small sidecar, prompt-only features, logit-lens trajectory / crystallization signals, MLP-write features, and a usable GUI.
What I think the method is really measuring
I would be careful with the word knows.
For this setup, P(knows) is probably best read as:
P(this exact model instance answers correctly under this prompt format, decoding mode, and answer-matching rule)
not as:
P(the model metaphysically knows the truth)
That is not a criticism. In fact, I think your behavioral definition is a good one. It is deployment-relevant. If the sidecar predicts the model’s own closed-book correctness, that is exactly the signal a router needs.
But I would make the definition explicit everywhere, because otherwise readers may interpret this as a general truthfulness estimator or hallucination detector.
A possible wording:
In this prototype, “knows” means “the base model’s greedy closed-book answer matches a gold answer or alias under the current evaluator.” It is therefore a model-behavior label, not a direct claim about world truth.
What looks strong
A few parts look especially good.
| Component |
Why it is interesting |
| Prompt-only forward pass |
Very attractive for routing because it avoids paying for answer generation before deciding whether to retrieve/escalate. |
| Small sidecar |
Easier to deploy than a second LLM judge or multi-sample uncertainty method. |
| Layer trajectory |
This is more informative than only taking the last hidden state. It lines up with the intuition behind logit lens / tuned lens style analyses. |
| MLP-write features |
This has a plausible mechanistic connection to factual recall, given the “FFN as key-value memory” literature. |
| Risk–coverage plots |
This is the right kind of deployment-facing visualization. |
| CUDA/MPS portability |
Very useful for people actually trying to reproduce or run the system. |
The MLP-write direction is particularly interesting. There is prior work arguing that transformer feed-forward layers behave like key-value memories, and work on knowledge neurons also points toward factual information being localized in internal activations. I would not overclaim that the sidecar has found “the knowledge neurons,” but the feature design is plausible.
For the logit-lens side, I would also mention Tuned Lens. Raw logit lens can be noisy or brittle, especially in middle layers, so I like that your README already treats the MLP-write lens carefully and emphasizes magnitudes / trajectories rather than taking every decoded token literally.
Closest baselines I would add
If the goal is to convince people that LRD adds value, I would compare it against these.
| Baseline |
Why it matters |
| Internal Confidence |
Very close: training-free, query-level, single forward pass, no answer tokens. This is probably the most important external baseline. |
| KEEN-style probe |
Very close: internal representation → knowledge estimate before generation. It is entity-oriented, but the probing idea is highly relevant. |
| Last-token / last-layer hidden-state probe |
Strong minimal white-box baseline. If this is close to LRD, the extra feature engineering needs justification. |
| Entropy / max-prob / margin |
Cheap uncertainty baselines. These are hard to ignore because they are simple and often surprisingly strong. |
| Global scalar logistic regression |
Helps show whether the GRU-over-depth is doing more than using a few scalar features. |
| Semantic Entropy Probes |
Not exactly the same because they use hidden states from a generation, but useful as a cheap learned uncertainty baseline. |
| LM-Polygraph methods |
Useful if you want to compare against a broader UE toolkit rather than hand-picking baselines. |
Links:
Evaluation: I would tighten this before changing the architecture
This is the biggest thing I would look at.
From the README, the reported numbers are already good:
| Split |
Model |
AUROC |
AUPRC |
ECE |
| held-out test |
LRD |
0.8983 |
0.7937 |
0.0925 |
| held-out test |
last-layer probe |
0.8189 |
0.6571 |
0.1807 |
| held-out test |
global logreg |
0.8665 |
0.7477 |
0.1161 |
| held-out test |
neg-entropy |
0.8529 |
0.7269 |
— |
| cross PopQA |
LRD |
0.9411 |
0.7629 |
0.0679 |
| cross PopQA |
last-layer probe |
0.9612 |
0.8875 |
0.0327 |
| cross PopQA |
global logreg |
0.8748 |
0.5913 |
0.1494 |
| cross PopQA |
neg-entropy |
0.8611 |
0.5903 |
— |
The held-out test result is encouraging. But I would be cautious about calling the current cross result “cross-dataset generalization” unless the sidecar is actually trained only on the source dataset used for the cross split.
If I am reading the shipped split files correctly, cross.test is all PopQA, but the sidecar checkpoint appears to be trained on the main.train split. Since main.train is stratified over both PopQA and TriviaQA, that means many PopQA examples are already part of the sidecar’s training distribution. So the current cross result may be more like “evaluate the already-trained sidecar on all PopQA” than “train on TriviaQA, test on PopQA.”
I may be misreading the intended workflow, but if not, I would make the cross protocol stricter:
| Claim |
Suggested protocol |
| Held-out in-distribution |
Deduplicate first, then stratified train/val/test over the mixed dataset. |
| True dataset transfer |
Train sidecar on TriviaQA only, tune/calibrate on TriviaQA only, test once on PopQA only. |
| Reverse transfer |
Train on PopQA only, test on TriviaQA only. |
| Popularity transfer |
Train on high-popularity PopQA, test on low-popularity PopQA, and vice versa. |
| Dataset identity control |
Report whether a dataset-only or popularity-only classifier already predicts knows. |
I would also deduplicate questions before splitting. Exact duplicate questions across train/val/test can make held-out metrics optimistic, especially for a small 3k-example dataset.
This does not invalidate the idea. It just means the evaluation should be made leak-resistant before making strong claims.
Metrics I would emphasize
AUROC/AUPRC/ECE are useful, but for this application I would make risk–coverage the primary metric.
Why?
Because the product question is not only:
“Can the sidecar rank known vs unknown questions?”
It is:
“At a chosen risk tolerance, how many questions can the system safely answer without retrieval or escalation?”
So I would report:
| Metric |
Interpretation |
risk@coverage |
If the system answers the top X% most confident questions, what is the error rate? |
coverage@risk≤5% |
How many questions can be answered while keeping error below 5%? |
coverage@risk≤10% |
Same, but with a looser risk target. |
AURC / area under risk–coverage |
Overall selective-prediction quality. |
RAG calls saved at fixed accuracy |
Practical cost-saving metric. |
accuracy at fixed RAG budget |
Practical quality metric. |
ECE / Brier score |
Calibration quality, secondary but still useful. |
This recent paper is also relevant: Entropy Alone is Insufficient for Safe Selective Prediction in LLMs. It argues that entropy-only uncertainty can fail for selective prediction, and that combining entropy with a correctness probe can improve risk–coverage and calibration. That is a useful argument in favor of systems like yours: the goal should not be “beat entropy everywhere,” but “show where learned internal signals improve deployment-facing selective behavior beyond entropy.”
Suggested RAG-gating experiment
I think the most convincing next experiment would be a simple RAG router.
For each question:
-
Compute LRD score from the prompt-only forward pass.
-
If score ≥ threshold, answer closed-book.
-
If score < threshold, retrieve context and answer with RAG.
-
Compare against:
- always closed-book,
- always retrieve,
- entropy gate,
- Internal Confidence gate,
- popularity threshold,
- random gate.
Then report:
| System |
Accuracy |
Retrieval rate |
Cost proxy |
Error rate on non-retrieved answers |
| Always closed-book |
— |
0% |
low |
— |
| Always retrieve |
— |
100% |
high |
— |
| Popularity threshold |
— |
— |
— |
— |
| Entropy gate |
— |
— |
— |
— |
| Internal Confidence gate |
— |
— |
— |
— |
| LRD gate |
— |
— |
— |
— |
| Oracle gate |
— |
minimal |
lower bound |
0% |
This would connect directly to the PopQA / adaptive retrieval literature:
PopQA is a good dataset for this because the known/unknown boundary is strongly related to long-tail knowledge. The original PopQA paper found that LMs are much weaker on less popular factual knowledge, while retrieval helps long-tail cases and closed-book answering can still be competitive for high-popularity facts. That is almost exactly the deployment niche for a pre-generation router.
One detail: dataset identity may be doing work
Because the dataset is a balanced PopQA + TriviaQA mix, I would check whether the model is partly learning “this looks like TriviaQA” vs “this looks like PopQA.”
That matters because PopQA and TriviaQA differ in more than just whether Gemma knows the answer. They differ in question style, entity popularity, answer distribution, and probably surface form.
A quick control:
| Control |
Purpose |
| Dataset-only classifier |
Does dataset ∈ {PopQA, TriviaQA} predict knows too well? |
| Question-length / lexical features |
Are shallow features enough? |
| Popularity-only PopQA baseline |
Does s_pop explain most PopQA performance? |
| Same-dataset OOD split |
Avoids dataset identity as a shortcut. |
| Entity-disjoint split |
Prevents the same subject/entity from appearing in train and test. |
If these controls are weak and LRD remains strong, the result becomes much more convincing.
Architecture comments
I would keep the architecture simple until the evaluation is locked down.
Current LRD:
per-layer features [L, F] → GRU over depth → mean/last pooling → global scalars → MLP → calibrated probability
That is reasonable. I would not jump to a larger network yet.
I would test ablations first:
| Ablation |
Question answered |
| Signal A only |
Are crystallization / logit-lens trajectory features enough? |
| Signal B only |
Are MLP-write features enough? |
| Global scalars only |
Are entropy/margin-like features doing most of the work? |
| Last hidden state only |
Does the engineered feature stack beat a simple probe? |
| No GRU, pooled layers only |
Is cross-layer sequence modeling actually helping? |
| Layer-shuffled GRU |
Is the depth order meaningful? |
| Random-label sanity check |
Confirms no pipeline leakage. |
| Train on one dataset, calibrate on same dataset, test on another |
Tests actual transfer. |
Only after that would I add Signal C/D.
Signal C/D ideas
Your README mentions perturbation robustness and Mahalanobis familiarity as scaffolded but off by default. Those are plausible additions.
Related work:
Potential additions:
| Signal |
Intuition |
Caveat |
| Perturbation stability |
Known answers should have more stable internal representations under small perturbations. |
Extra forward passes may weaken the “single forward” selling point. |
| Mahalanobis / representation familiarity |
Unknown queries may be farther from familiar training-like regions. |
Can overfit to dataset/source identity. |
| Cross-layer update magnitude |
Hallucination/unknown cases may show different residual-stream update patterns. |
Needs careful normalization across models. |
| Tuned-lens trajectory |
More robust than raw logit lens. |
Requires training lens probes per model. |
| Entity popularity / retrieval prior |
Useful for PopQA-like settings. |
Not available for arbitrary user questions. |
For a production router, I would preserve the one-forward property if possible. It is a big advantage.
Calibration
Because the output is used as a decision score, calibration matters.
I would report calibration separately for:
| Slice |
Why |
| PopQA |
Long-tail / low base-rate unknown-heavy distribution. |
| TriviaQA |
Easier / higher known-rate distribution. |
| Popularity quartiles |
Tests whether calibration breaks on rare facts. |
| Short vs long questions |
Surface complexity may affect confidence. |
| High-confidence positives |
Deployment-critical: these are the questions the system will answer without retrieval. |
| High-confidence negatives |
Deployment-critical: these decide retrieval/escalation cost. |
Temperature scaling is a good start, but I would be careful to fit the calibration temperature only on a clean validation set that does not overlap with the evaluation split or target OOD dataset.
How I would phrase the contribution
I would avoid:
“This detects hallucinations before generation.”
That is too broad.
I would prefer:
“This estimates whether a specific model is likely to answer a prompt correctly from parametric memory, before generating answer tokens.”
Or:
“This is a prompt-only, white-box selective-prediction sidecar for closed-book QA.”
Or:
“This is a one-forward-pass router for deciding whether to trust parametric memory or invoke retrieval/escalation.”
Those are narrower and easier to defend.
A possible paper-style claim
If the tightened experiments hold, the claim could be:
We train a lightweight sidecar on prompt-only internal features to predict a model’s closed-book answerability before generation. On held-out QA examples, the sidecar improves risk–coverage and calibration over entropy and simple hidden-state probes. In a RAG-gating setting, it reduces retrieval calls at fixed accuracy compared with always-retrieve and entropy-gated baselines.
That would be a strong, practical claim.
Suggested next checklist
If I were iterating on this repo, I would do this order:
-
Deduplicate before splitting
- exact question duplicates,
- possibly entity-level duplicates,
- possibly normalized answer duplicates if they imply same fact.
-
Rebuild splits
- mixed held-out,
- true TriviaQA→PopQA,
- true PopQA→TriviaQA,
- popularity-heldout PopQA.
-
Retrain sidecar separately per protocol
- do not reuse a mixed-trained checkpoint for a claimed cross-dataset result.
-
Add close baselines
- Internal Confidence,
- KEEN-style probe,
- last-hidden probe,
- entropy / max-prob / margin,
- global scalar logreg.
-
Make risk–coverage primary
coverage@risk≤5%,
coverage@risk≤10%,
- AURC,
- RAG calls saved at fixed accuracy.
-
Run a RAG-gating demo
- always closed-book,
- always retrieve,
- entropy gate,
- LRD gate,
- Internal Confidence gate,
- oracle.
-
Keep the “knows” definition narrow
- model-behavior label,
- same prompt format,
- same decoding mode,
- same answer matcher.
Minor naming thought
P(knows) is intuitive for the GUI, but in a paper or README I might also expose a more precise name:
P(correct_closed_book)
P(answerable_by_base_model)
P(parametric_success)
P(self_correct)
P(closed_book_success)
Maybe keep P(knows) as the user-facing short label, but define it formally as one of the above.
Bottom line
I think this is a promising direction, especially if framed as a pre-generation router / selective predictor rather than a broad hallucination detector.
The most valuable next step is probably not a more complex sidecar. It is a stricter evaluation:
- deduplicated splits,
- true dataset-transfer training,
- close baselines like Internal Confidence and KEEN-style probes,
- deployment-facing risk–coverage metrics,
- and a RAG-gating experiment showing retrieval calls saved at fixed accuracy.
If those results still hold, the project would have a much stronger story: not just “the sidecar gets high AUROC,” but “the sidecar makes useful routing decisions before generation.”