- bryła — a structured semantic representation for small language models
- The honest part: stress-testing the claims
- Chapter 1 — May 2026: ablations (after @John6666's feedback on the HF Forum)
- Chapter 2 — June 2026: the diverse dataset, and the "24/27" claim
- Chapter 3 — June 2026: the control ladder, and correcting 24/27
- Chapter 4 — June 2026: the redesigned test (inference, not copying)
- Chapter 5 — June 2026: text is a tie → the pivot to multimodality
- What I claim now — and what I don't
- Walls of bryła v9
- Problems bryłas are designed to attack
- Strategic goal
- Hardware
- Reproduce it yourself
- What this is not
- Roadmap
- Where this came from
- License & contact
- Chapter 1 — May 2026: ablations (after @John6666's feedback on the HF Forum)
bryła — a structured semantic representation for small language models
A bryła carries pre-computed meaning, so the model recomputes less from scratch.
This card is a research log. It shows what I tried, what held up under testing, and what I had to walk back — in order, with dates. I correct my own earlier claims out loud rather than quietly editing them, because the corrections are the most useful part. If a number here looks weaker than an earlier post of mine, that is on purpose: it survived a control the earlier one didn't.
All recent numbers are mean ± 95% CI over multiple seeds unless noted.
Who I am
Krzysiek. Self-taught. Not a computer scientist, not a programmer — I learn this after night shifts at work, on my own, with help from AI. I started in December 2025 after watching a YouTube video where someone built their own Jarvis. I thought: "I want that." And so it began.
It quickly turned out my hardware (RTX 2060 12GB, Ryzen 5 3600, 32GB RAM) won't lift big models. LLMs kept recomputing everything from scratch, context got lost, the GPU ground away. Instead of waiting for better hardware, I looked for another way. That's how bryłas came to be. (Bryła = Polish for solid / block / geometric body — chosen because the structure is many-faceted.)
What is a bryła
A bryła is a many-faceted semantic object — a packet carrying not just text but pre-computed meaning: emotion, importance, relations, context, state. Instead of forcing the model to guess these every time, a parser decomposes a sentence into bryłas, and each one tells the model: "here is what I know about myself."
The idea: rich input + small model should be cheaper to train than big model + raw text.
Example
Sentence: "WATCH OUT! This is dangerous!"
A regular LLM gets the raw text and has to infer it's an alarm. My parser decomposes it into a bryła with metadata tokens:
[ENTITY] WATCHOUT [POL:neutral] [SCOPE:session] [COLOR:red] [URG:immediate]
[INT:high] [ATT:high] [CORE] [SRC:parsed] [COMPL:high] [NEG:none]
[INTENT:demand] [TOPIC:new] [DENSE:low] [DEP:deeply_dependent]
[EXPECT:description] [READY]
The model doesn't have to guess: [COLOR:red] — alarm, [URG:immediate] — now,
[INTENT:demand] — a demand, [CORE] — the heart of the sentence.
How it evolved (Dec 2025 – Apr 2026)
| Stage | When | What |
|---|---|---|
| Field of 36 points | Dec 2025 | Letters as points on a flat field. Naive, didn't work — but gave the intuition that geometric representation ≠ tokenization. |
| Cube (6 faces) | Jan 2026 | Structure instead of a flat string. Too few dimensions. |
| Bryłas v2–v5 | Feb–Mar 2026 | Cube → polyhedron. More faces, validators, pipelines. Five versions, each better. |
| v6 + GRU seq2seq | Mar 2026 | First model that actually talked. 9.4M params, val_ppl 7.3. |
| v7 + Transformer | Apr 2026 | Encoder-decoder 6+6, 53.5M params, val_ppl 31.06. |
| v8 + AFFECT | Apr 2026 | Added affective wall. val_ppl 31 → 24.12 (−22%), same model, same corpus. |
| v9 + Platform | Apr 2026 | Schema-driven (JSON, zero hardcode), pragmatics, weighted loss. val_ppl 24.02. |
At the time, I read v8's 22% drop as "this proves bryłas help." The chapters below are the story of me testing that claim properly — and finding it needed serious qualification.
The honest part: stress-testing the claims
This is where most of the real work went. Five chapters, each one: a claim, a control, and what it did to the claim.
Chapter 1 — May 2026: ablations (after @John6666's feedback on the HF Forum)
I posted v9. John6666 suggested rigorous follow-up: ablations, more seeds, randomized controls, token economy. I spent two days on it. Honest results, including the failures:
- Are all 20 fields needed? No. On 695 Q/A pairs, 3 seeds: 7 informative fields (val_ppl 810.85) beat all 20 (820.14). The extra 13 default fields are mostly noise. Small effect (~3%), consistent direction.
- A metric I had to invent to stay honest. Standard val_ppl was misleading
because bryła tags are deterministic and easy to predict — they artificially
lower perplexity. On Wikipedia, FULL bryła scored
val_ppl_std = 2.03butval_ppl_clean = 3.10(text-only). The standard metric was hiding ~35% of the real perplexity. Lesson: any ablation that adds prefix tokens should report target-only perplexity. - Three leakages I caught in my own pipeline (and retrained each time):
surface text duplicated inside the bryła and after the separator (the model was
copying, not generating); a
[FACTS]block leaking previous text; anchors still carrying 80-char text snippets. These are exactly the artifacts that can fake a "success." - Token economy is poor at this scale. FULL costs ~6× the tokens of RAW for ~3% perplexity gain.
- Conditional generation works (qualitatively). With masked loss (bryła as context only, loss on text), I fed hand-crafted prefixes differing only in polarity: neutral → geography/astronomy; positive → place descriptions; negative → sports/competition. Three polarity values, three topical distributions. The numbers don't show it; generation does. The model reads the bryła as conditioning.
What Chapter 1 changed: "bryłas help" weakened to "the effect is real but small (3–10%, comparable to seed variance), and the model does use the bryła as conditioning."
Chapter 2 — June 2026: the diverse dataset, and the "24/27" claim
The May corpus, I discovered, had handicapped the bryła: 9080 unique texts (87%) but only 483 unique bryłas (5%). The bryła saw 20× less variety than the text — and still tied. So I built a balanced generator where I know the ground truth (the sentence is generated from the wall values, so text and wall carry the same information). Full grid: groups {1,3,8} × model size {32,64,128} × parser noise {0,10,20%} = 27 configs.
parser noise bryła wins quality (bryła vs text) margin
0% 9/9 100.0% vs 95.0% +5.0 pp
10% 9/9 98.3% vs 95.5% +2.8 pp
20% 6/9 97.2% vs 96.3% +0.8 pp
Bryła won 24 of 27. I posted that. It was true — but, as the next chapter shows, not for the reason I thought.
(Separately measured: bryła trains up to ~2.96× faster at d=512 and is smaller parametrically; when the model grows 10×, base cost grows ~6.3× while bryła cost grows ~2.2×.)
Chapter 3 — June 2026: the control ladder, and correcting 24/27
After sharing 24/27, I ran the control ladder John suggested. It changed my conclusion, so here is the full path — including where I was too strong.
In that grid, the model predicted the same walls it received as input. So a large part of the score was the model copying the value set, not understanding structure. The control made it visible:
RAW 96.8% DOMAIN 96.3% BRYLA 100.0% SHUFFLED 99.9% RANDOM 96.2%
SHUFFLED randomly scrambles which value sits in which wall. It barely hurt —
meaning the wall assignment wasn't being used, only the bag of values. The
diagnosis: my values were self-identifying (e.g. "formal" only ever appeared in
the register wall), so a value revealed its own wall and the labels were
redundant.
So I no longer claim "24/27 proves structure helps." It proves the model can reconstruct a value set it was handed. That's a real but much smaller thing.
Chapter 4 — June 2026: the redesigned test (inference, not copying)
Two fixes: (1) predict a hidden wall absent from the input, so the model must infer; (2) make all walls share the same value space (A/B/C), so a value no longer reveals its wall — the model must use the assignment. 2700 balanced pairs, with and without text (near-identical):
chance RAW BRYLA SHUFFLED RANDOM
with text 66.7% 68.6% 100% 79.0% 66.4%
no text 66.7% 68.6% 100% 81.0% 68.6%
Now SHUFFLED collapses ~20 pp below BRYLA. The gain comes from real values in their correct wall positions — not from the prefix format, not from text. Putting both tests together:
unique values per wall -> SHUFFLED = BRYLA (structure redundant)
shared values per wall -> SHUFFLED < BRYLA (structure necessary, +~20pp)
Wall structure carries information exactly when values are not self-identifying. Overlapping values are the natural-language case ("high" can be certainty, urgency, or intensity). My first synthetic set was too clean, which hid the value of structure. Still synthetic, though — it shows structure CAN matter and WHEN, not that my parser on real Polish produces enough overlap for it to help in practice.
Chapter 5 — June 2026: text is a tie → the pivot to multimodality
Here's the hard truth I stopped dancing around. On a binding control on real text (bryła vs the same tags shuffled), the difference is ≈ 0. A generating decoder scores bryła-tags-only 2.8% vs raw text 10.2%. The reason is structural:
The walls are built FROM the text, so they are a lossy summary of it. They cannot carry a fact the text doesn't already contain.
So on pure text, bryła is a tie, not a win, and I'm done claiming otherwise. That points to the only place "bryła carries more" can be literally true: information physically absent from the text — pixels, sound. There a sensory wall is an independent source, not a summary. The card had listed sensory walls as a placeholder, ready to be plugged in. So I plugged them in and measured.
1. Carrier passes a non-text fact. Answer depends on a real image (CLIP vector), text neutral: 93.5 ± 1.0% vs ~10% text alone (5 seeds).
2. Fusion adds over either channel alone. image=object, text=feature, target=both: +66 pp over the best single channel.
3. Fusion is adaptive — it weights channels by reliability. When text is unreliable (50% correct) the model ignores it and stays at image-only level; as text reliability rises, accuracy climbs monotonically (94 → 100%). A lie on one axis doesn't corrupt the other.
4. Three modalities. text + image + audio, three different encoders (embedding, CLIP, MFCC), one answer: full 48.4 ± 3.1% (feature 100% from text, object 93% from image, word 52% from audio); single channels score 1–3%.
5. A hypothesis that did NOT hold. I expected fusion to need slot-ordered concatenation to preserve channel binding. It doesn't: summing the channel vectors ≈ concatenation (−0.8 pp). Simpler than I thought. Binding only matters for the overlapping-value tasks of Chapter 4, not for fusing independent axes.
What Chapter 5 establishes: on text, bryła is a lossy summary (tie); its measured value is as a shared space for heterogeneous channels — it passes, fuses, and adaptively weights text, image, and sound.
What I claim now — and what I don't
I claim (measured, with CI):
- Bryła trains faster at scale and is smaller.
- On text, bryła ties raw text (it's a lossy summary — this is a tie, not a win).
- Wall structure carries information a bag of values doesn't — but only when values overlap across walls (synthetic, +~20 pp).
- As a multimodal carrier it passes non-text facts (image 93.5%), fuses channels (+66 pp), and weights them adaptively. Confirmed across image + audio.
I do NOT claim:
- That bryła beats text on text tasks. (It doesn't — tie.)
- That any of this is shown on natural data. All the strong results are on controlled/synthetic inputs. This is a proof of concept, not a deployed win.
- That memory anchors work. The original motivation of this whole project — longer context via anchor-bryłas against lost-in-the-middle — is untested by measurement. It's a separate pillar and my honest next target.
- That bryłas can be generated on the output side, or that any modality is generated. Everything here is understanding: multimodal input → text output.
Walls of bryła v9
Semantics · Affect (emotion color, urgency, intensity, attention/learning weight) · Pragmatics (negation, intent strength, topic continuity, density, expected format) · Relations (graph of links between bryłas) · State (parser confidence, completeness, source: parsed/verified/stated/retrieved) · Anchors (every N bryłas, a compression anchor — schema-ready, but its effect on long context is untested, see above) · Temporal (schema, v10) · Speaker (schema, v11) · Sensory (audio/image slots — now measured, see Chapter 5).
Problems bryłas are designed to attack
These are design intentions, not all measured wins — I list them so the goal is
clear, not to claim they're solved: hallucinations ([SRC]), "I don't know"
([COMPL]), important-vs-not ([CORE]/[ATT]/anchors), lost-in-the-middle
(anchors), tone/irony (affect), compute cost (pre-computed meaning), catastrophic
forgetting (schema swap, not model change), generation control ([HINT]/[EXPECT]).
Of these, only compute cost and conditioning are measured so far.
Strategic goal
I am not building another chatbot. I'm building a learning architecture to cheaply train a specialist in any domain — medicine, cooking, law, robotics — as long as the corpus is good. Changing domain = changing one JSON schema, zero code changes.
Original hypothesis: small model + rich bryłas ≈ big model + raw text, but cheaper. Status after testing: on text, unconfirmed (tie). Where it holds is multimodality — bryła as a carrier that fuses information no single channel has.
Hardware
RTX 2060 12GB · Ryzen 5 3600 · 32GB RAM · Windows 11 Pro · ~20–30 min per run. No datacenter. No A100. A regular PC and persistence.
Reproduce it yourself
Don't take my word — the control scripts are in this repo:
pip install torch
python kontrole.py --pairs dataset_zroznicowany.json --epoki 30 --dmodel 64
python test_wspolne_wartosci.py --epoki 30 --dmodel 64 --na-kombinacje 100
# multimodal carrier:
python test_obraz.py # image via CLIP
python test_fuzja.py # text + image fusion
python runda_kontrolna.py --seedy 5 # seeds, CI, sum-vs-concat, conflict
python test_audio.py --seedy 5 # text + image + audio
Raw results: kontrole_wynikow.csv, wspolne_wynikow.csv.
What this is not
Not a GPT-4 competitor — an experiment on a small model, custom input architecture, home hardware. Not finished — work in progress. Not perfect — the corpus is small, the model sometimes repeats words. Not statistically conclusive on text — text-side effects are within seed variance. The multimodal results are stronger (clear gaps, CIs) but on controlled data.
Roadmap
- v10 — bryła as attention unit (hierarchical attention between bryłas).
- v11 — generating bryłas instead of text (would open output-side multimodality).
- v12 — schema as a platform for many domains.
- Near-term, honest priorities: (1) a real memory-anchor test (needle-in-a- haystack) — the untested pillar; (2) a stronger audio encoder (wav2vec) to lift the 52%; (3) multimodal fusion on natural data, not synthetic.
Where this came from
I was at home, my RTX 2060 grinding Qwen, context getting lost, frustration rising. Instead of complaining I started experimenting: what if, instead of making the model recompute everything, I pack part of the meaning into the message itself? Field of points → cube → bryłas → affect walls → platform. I didn't know this was called "neurosymbolic AI" or that people had studied "structured inputs" for years. I got there on my own, because it bothered me that the LLM kept computing the same thing from scratch.
A bryła doesn't generate meaning — it carries it, from parser to model. Like a lamp over the table: it lights up once, not every time.
License & contact
CC BY-NC-SA 4.0 — use, modify, share; non-commercial, with attribution, same license. If you're doing something similar, or fighting the same problems on weak hardware — write. I'm looking for people to work with, not to compete.
HuggingFace: krzysiekpl · Developed since December 2025. Alone. By night. With persistence.