reFlow: A Feature-Decoupled Transformer with Native Interpretability

TL;DR: We decompose the embedding matrix E ∈ R^{V×d} into W_recipe × W_basis, forcing every token to be a readable “recipe” over a shared signal basis. Without any sparsity constraint, the signal space spontaneously develops semantic structure (three↔four cos=0.76, king+woman−man=queen rank #1), 11% natural sparsity, and single-signal causal traceability. Full training code, 12 interpretability experiments, and pretrained weights are MIT-licensed.


Motivation

Standard Transformer embeddings are unstructured lookup tables — every token gets an independent d-dimensional vector with no compositional constraint. This makes the latent space a semantic tangle: you can probe it after the fact (SAE, probing classifiers), but the model was never designed to be interpretable. reFlow flips this: by factoring the embedding into a recipe matrix (how to mix signals) and a signal basis (what the signals mean), the architecture forces all computation onto a signal manifold. Interpretability isn’t bolted on — it’s load-bearing structure.

Key Results

  • Convergence: reFlow-1 (32 layers, 464M) is ~3% above GPT-2-New (36 layers, 514M) due to 4 fewer layers and 9% fewer params. When aligned to the same depth (reFlow-1-Big, 36 layers, 515M), the gap narrows to ~1%. Three-point scaling: Small (46M, 3.55) → reFlow-1 (464M, 3.01) → Big (515M, 2.92).
  • Semantic organization in recipe space: Top-20 nearest-neighbor pairs are all semantically valid (three↔four 0.7551, king↔queen 0.54, France↔Germany 0.53). PCA silhouette score = 0.1052 (positive → real clusters).
  • Semantic algebra: 3/3 hit — king + woman − man → queen (#1), walked + running − walking → ran (#1), Paris + China − France → Beijing (#2).
  • Emergent sparsity: Mean 116.6/1024 signals active per token (11.38% activation rate), with no L0/L1 penalty. Gini coefficient only 0.085 → all signals utilized evenly.
  • Causal traceability: Ablating 1 signal on “The capital of France is” drops target probability from 8.31% → 0.03%. That signal’s codebook = {the, a, in, to, an, at} — a pure function-word channel.
  • Behavioral steering: Emotion surgery flips “terrible” → “great” (L0–L12 injection). Concept inception: critical α ≈ 18.4. Gene tampering: modifying W_recipe globally flips sentiment while maintaining grammatical coherence.
  • Hard sparsity destroys semantics: Top-64 constraint collapses recipe structure (cos 0.76 → 0.30, algebra 3/3 → 0/3, silhouette drops to −0.02). Sparsity ≠ interpretability.
  • Information crystallization boundary: Semantic decisions solidify around L12–L18; interventions after this layer range have no effect.

Architecture in brief

Input token i → W_recipe[i, :] (S-dim recipe vector)
                    ↓
            e_i = W_recipe[i] × W_basis    (S×d shared signal basis)
                    ↓
            36-layer Transformer (RMSNorm, RoPE, SwiGLU)
                    ↓
            Logits = H_out × (W_recipe × W_basis)^T    (dynamic vocab matrix, no separate LM head)

The same factored product is used for both input embedding and output projection — a closed loop that forces the backbone to operate entirely on the signal manifold.

Links

Built on nanoGPT. Trained on OpenWebText (9B tokens), 4×T4 GPUs, 50k steps.


Happy to answer questions! Especially interested in discussion around:

  • The tension between hard sparsity and semantic fidelity (Section 6)
  • Signal distillation prospects (teacher/student sharing W_basis)
  • How this compares to SAE-based post-hoc interpretability
3 Likes

Interesting work — the “interpretability as load-bearing structure” framing resonates.

I’m working on a related but orthogonal problem at Prooftrail: instead of making the representation space readable, we’re trying to extract a real-time feedback signal from hidden states during generation — a non-learned coherence metric (cosine similarity over time at a fixed layer) that detects when the model is looping or stagnating, without any trained probe or labels.

Your crystallization boundary finding (L12–L18) is directly relevant to us. We monitor at Layer 27 (Qwen 7B, 32 layers) because the signal is clearest there — but your result suggests that if we ever want to intervene (not just monitor), we’d need to act earlier, in the zone where semantic decisions are still fluid. That’s a concrete design constraint we hadn’t formalized.

Two questions:

  1. Did you measure whether the crystallization boundary shifts with task type (e.g., factual recall vs. multi-step reasoning), or is it stable across your evaluation suite?

  2. The hard-sparsity result (top-64 destroying semantics) is striking. Have you looked at whether soft gating (learned attention over signals rather than hard top-k) preserves structure while still giving you a compact active set?

Our working paper is on Zenodo (DOI: 10.5281/zenodo.18941566) and the interim data is on HuggingFace (airVen/missing-value-function-interim-report) — different angle, but the shared conviction that architectural constraints beat post-hoc analysis seems worth connecting.

2 Likes

Hello, thank you very much for such insightful feedback! I am delighted that the concept of “interpretability as a load-bearing structure” resonated with you.

Your work at Prooftrail sounds extremely interesting. Utilizing non-learned cosine similarity at deep layers (such as L27) to detect generation loops or stagnation in real-time is a very elegant approach. Your insight regarding the “divergence between monitoring and intervention” is correct: information at Layer 27 is already highly “crystallized,” making it an excellent vantage point for observing final trajectories; however, to manipulate steering as we did in “emotion surgery,” signals must be injected before the L12–L18 range, where semantic routing still maintains fluidity.
Regarding the two questions you raised:

  1. Does the crystallization boundary shift with task type or context length?
    Your intuition has been confirmed as correct through experimentation. Inspired by your question, I just conducted a targeted causal intervention sweep on the 0.5B model. I compared short, direct contexts with deep contexts containing long clauses by tracking the “point of no return” layer by layer—the specific layer where intervention fails to override the native prediction.
    Experimental results: short contexts typically crystallize around Layer 18, whereas complex syntactic structures and long-range dependencies significantly delay the crystallization boundary (in some cases even pushing it to Layer 28). This confirms that complex contexts force the network to maintain “fluidity” in internal representations at deeper layers to integrate distant information, thereby widening the viable “intervention window.”
  2. Soft gating or hard Top-K sparsity?
    This is an excellent entry point. In our baseline reFlow model, we found that “soft sparsity” spontaneously emerges (approximately 11.38% activation rate) even without explicit constraints. However, when we applied rigid Top-64 truncation, the semantic geometric structure indeed collapsed.
    Coincidentally, we are currently researching several soft sparsity mechanisms. One direction involves “Learned Signal Routing,” which is highly consistent with your attention-based idea. We are also testing “Relative Mean Gating”—a dynamic filtering strategy that sets truncation thresholds based on the ratio of signal intensity relative to the mean value of the global signal pool.
    I completely share our common conviction: architectural-level constraints are ultimately more robust than “post-hoc analysis.” I will certainly read your paper on Zenodo and the data on HuggingFace.

Thank you again for the exchange, and I look forward to following your progress!

2 Likes

This is fantastic — thank you for actually running the experiment.

The task-dependent crystallization result is more useful to us than you might realize. Our benchmark is a multi-bug coding repair task (LRU Cache with 5 interdependent bugs, iterative fix loop). That’s exactly the kind of long-range, complex-dependency context where your result predicts a late crystallization boundary. If the boundary shifts to L28 in complex contexts, then our monitoring layer at L27 on Qwen 7B (32 layers) might sit right at the edge of the fluid zone — meaning we could potentially both monitor and intervene at the same depth, at least for complex tasks.

That reframes something we’d assumed was a hard constraint. We had accepted “monitor deep, intervene early” as two separate operations at two separate layers. Your result suggests it might be one operation at one layer, task-contingent.

The soft sparsity directions sound right to me. “Relative Mean Gating” is interesting — we’ve been struggling with the same problem from a different angle: our coherence threshold (cosine similarity > 0.95 = looping) is too binary. The trajectory shape carries more information than the peak value, but we haven’t found a good way to formalize that. A gating mechanism that’s relative to the local signal distribution rather than an absolute cutoff might be exactly the framing we need.

I’ll keep you posted on the Zenodo paper progress — we’re preparing the arXiv submission for cs.LG now. And I’d be curious to hear your take on the paper once you’ve had a look, especially whether the biological motivation (Damasio’s somatic markers as design prior for non-learned architectural signals) seems like a productive framing or an unnecessary detour.

Good exchange. Rare to find someone else arguing for architecture-first over post-hoc.

2 Likes

This is exactly the kind of research discussion I value. I am glad that the experiment on context-dependent boundaries could provide a useful reference for your architectural approach with Qwen 7B.

Your observation regarding Layer 27 is quite insightful. If complex tasks like the LRU Cache multi-bug repair push the semantic “fluid zone” to such depths, then unifying monitoring and intervention into a single localized operation becomes more than just a convenience—it appears to be a mathematically sound architectural choice justified by the context depth.

Your perspective on the “trajectory shape vs. peak value” issue is also well-founded. A rigid > 0.95 cosine similarity threshold can be structurally brittle, much like how the Top-64 hard sparsity constraint compromised our semantic geometry. Relying on a relative local distribution scheme might effectively bypass this binary cliff. On a related note, the model versions utilizing “Learned Signal Routing” and “Relative Mean Gating” are currently in the training phase. I am looking forward to seeing the comparative results between the two.

Following our discussion, I have also updated the reFlow paper and repository. I focused on improving the academic rigor of the text and formally integrated the new findings, charts, and visualizations regarding the shifting Information Crystallization boundary and the soft sparsity mechanisms.

Additionally, I recently deployed an interactive web demo for our experiments:

It allows for real-time interaction with various experimental setups alongside chart visualizations. I hope it might be of some utility to your own work.

Finally, regarding your Zenodo paper and Damasio’s “somatic markers” framework: it is a highly interesting perspective. Utilizing somatic markers as a design prior for non-learned architectural signals offers a strong engineering metaphor. Conceptually, it aligns well with the intuitive motivation behind non-learned architectural signals. I believe it is a promising direction worth exploring, and I look forward to your progress.

Thank you again for the discussion!

2 Likes