bryła — a structured semantic representation for small language models

A bryła carries pre-computed meaning, so the model recomputes less from scratch.

This card is a research log. It shows what I tried, what held up under testing, and what I had to walk back — in order, with dates. I correct my own earlier claims out loud rather than quietly editing them, because the corrections are the most useful part. If a number here looks weaker than an earlier post of mine, that is on purpose: it survived a control the earlier one didn't.

All recent numbers are mean ± 95% CI over multiple seeds unless noted.

Who I am

Krzysiek. Self-taught. Not a computer scientist, not a programmer — I learn this after night shifts at work, on my own, with help from AI. I started in December 2025 after watching a YouTube video where someone built their own Jarvis. I thought: "I want that." And so it began.

It quickly turned out my hardware (RTX 2060 12GB, Ryzen 5 3600, 32GB RAM) won't lift big models. LLMs kept recomputing everything from scratch, context got lost, the GPU ground away. Instead of waiting for better hardware, I looked for another way. That's how bryłas came to be. (Bryła = Polish for solid / block / geometric body — chosen because the structure is many-faceted.)

What is a bryła

A bryła is a many-faceted semantic object — a packet carrying not just text but pre-computed meaning: emotion, importance, relations, context, state. Instead of forcing the model to guess these every time, a parser decomposes a sentence into bryłas, and each one tells the model: "here is what I know about myself."

The idea: rich input + small model should be cheaper to train than big model + raw text.

Example

Sentence: "WATCH OUT! This is dangerous!"

A regular LLM gets the raw text and has to infer it's an alarm. My parser decomposes it into a bryła with metadata tokens:

[ENTITY] WATCHOUT [POL:neutral] [SCOPE:session] [COLOR:red] [URG:immediate]
[INT:high] [ATT:high] [CORE] [SRC:parsed] [COMPL:high] [NEG:none]
[INTENT:demand] [TOPIC:new] [DENSE:low] [DEP:deeply_dependent]
[EXPECT:description] [READY]

The model doesn't have to guess: [COLOR:red] — alarm, [URG:immediate] — now, [INTENT:demand] — a demand, [CORE] — the heart of the sentence.

How it evolved (Dec 2025 – Apr 2026)

Stage	When	What
Field of 36 points	Dec 2025	Letters as points on a flat field. Naive, didn't work — but gave the intuition that geometric representation ≠ tokenization.
Cube (6 faces)	Jan 2026	Structure instead of a flat string. Too few dimensions.
Bryłas v2–v5	Feb–Mar 2026	Cube → polyhedron. More faces, validators, pipelines. Five versions, each better.
v6 + GRU seq2seq	Mar 2026	First model that actually talked. 9.4M params, val_ppl 7.3.
v7 + Transformer	Apr 2026	Encoder-decoder 6+6, 53.5M params, val_ppl 31.06.
v8 + AFFECT	Apr 2026	Added affective wall. val_ppl 31 → 24.12 (−22%), same model, same corpus.
v9 + Platform	Apr 2026	Schema-driven (JSON, zero hardcode), pragmatics, weighted loss. val_ppl 24.02.

At the time, I read v8's 22% drop as "this proves bryłas help." The chapters below are the story of me testing that claim properly — and finding it needed serious qualification.

The honest part: stress-testing the claims

This is where most of the real work went. Five chapters, each one: a claim, a control, and what it did to the claim.

Chapter 1 — May 2026: ablations (after @John6666's feedback on the HF Forum)

I posted v9. John6666 suggested rigorous follow-up: ablations, more seeds, randomized controls, token economy. I spent two days on it. Honest results, including the failures:

Are all 20 fields needed? No. On 695 Q/A pairs, 3 seeds: 7 informative fields (val_ppl 810.85) beat all 20 (820.14). The extra 13 default fields are mostly noise. Small effect (~3%), consistent direction.
A metric I had to invent to stay honest. Standard val_ppl was misleading because bryła tags are deterministic and easy to predict — they artificially lower perplexity. On Wikipedia, FULL bryła scored val_ppl_std = 2.03 but val_ppl_clean = 3.10 (text-only). The standard metric was hiding ~35% of the real perplexity. Lesson: any ablation that adds prefix tokens should report target-only perplexity.
Three leakages I caught in my own pipeline (and retrained each time): surface text duplicated inside the bryła and after the separator (the model was copying, not generating); a [FACTS] block leaking previous text; anchors still carrying 80-char text snippets. These are exactly the artifacts that can fake a "success."
Token economy is poor at this scale. FULL costs ~6× the tokens of RAW for ~3% perplexity gain.
Conditional generation works (qualitatively). With masked loss (bryła as context only, loss on text), I fed hand-crafted prefixes differing only in polarity: neutral → geography/astronomy; positive → place descriptions; negative → sports/competition. Three polarity values, three topical distributions. The numbers don't show it; generation does. The model reads the bryła as conditioning.

What Chapter 1 changed: "bryłas help" weakened to "the effect is real but small (3–10%, comparable to seed variance), and the model does use the bryła as conditioning."

Chapter 2 — June 2026: the diverse dataset, and the "24/27" claim

The May corpus, I discovered, had handicapped the bryła: 9080 unique texts (87%) but only 483 unique bryłas (5%). The bryła saw 20× less variety than the text — and still tied. So I built a balanced generator where I know the ground truth (the sentence is generated from the wall values, so text and wall carry the same information). Full grid: groups {1,3,8} × model size {32,64,128} × parser noise {0,10,20%} = 27 configs.

parser noise   bryła wins   quality (bryła vs text)   margin
0%             9/9          100.0% vs 95.0%            +5.0 pp
10%            9/9           98.3% vs 95.5%            +2.8 pp
20%            6/9           97.2% vs 96.3%            +0.8 pp

Bryła won 24 of 27. I posted that. It was true — but, as the next chapter shows, not for the reason I thought.

(Separately measured: bryła trains up to ~2.96× faster at d=512 and is smaller parametrically; when the model grows 10×, base cost grows ~6.3× while bryła cost grows ~2.2×.)

Chapter 3 — June 2026: the control ladder, and correcting 24/27

After sharing 24/27, I ran the control ladder John suggested. It changed my conclusion, so here is the full path — including where I was too strong.

In that grid, the model predicted the same walls it received as input. So a large part of the score was the model copying the value set, not understanding structure. The control made it visible:

RAW 96.8%   DOMAIN 96.3%   BRYLA 100.0%   SHUFFLED 99.9%   RANDOM 96.2%

SHUFFLED randomly scrambles which value sits in which wall. It barely hurt — meaning the wall assignment wasn't being used, only the bag of values. The diagnosis: my values were self-identifying (e.g. "formal" only ever appeared in the register wall), so a value revealed its own wall and the labels were redundant.

So I no longer claim "24/27 proves structure helps." It proves the model can reconstruct a value set it was handed. That's a real but much smaller thing.

Chapter 4 — June 2026: the redesigned test (inference, not copying)

Two fixes: (1) predict a hidden wall absent from the input, so the model must infer; (2) make all walls share the same value space (A/B/C), so a value no longer reveals its wall — the model must use the assignment. 2700 balanced pairs, with and without text (near-identical):

            chance   RAW    BRYLA   SHUFFLED   RANDOM
with text   66.7%    68.6%  100%    79.0%      66.4%
no text     66.7%    68.6%  100%    81.0%      68.6%

Now SHUFFLED collapses ~20 pp below BRYLA. The gain comes from real values in their correct wall positions — not from the prefix format, not from text. Putting both tests together:

unique values per wall   ->  SHUFFLED = BRYLA   (structure redundant)
shared values per wall   ->  SHUFFLED < BRYLA   (structure necessary, +~20pp)

Wall structure carries information exactly when values are not self-identifying. Overlapping values are the natural-language case ("high" can be certainty, urgency, or intensity). My first synthetic set was too clean, which hid the value of structure. Still synthetic, though — it shows structure CAN matter and WHEN, not that my parser on real Polish produces enough overlap for it to help in practice.

Chapter 5 — June 2026: text is a tie → the pivot to multimodality

Here's the hard truth I stopped dancing around. On a binding control on real text (bryła vs the same tags shuffled), the difference is ≈ 0. A generating decoder scores bryła-tags-only 2.8% vs raw text 10.2%. The reason is structural:

The walls are built FROM the text, so they are a lossy summary of it. They cannot carry a fact the text doesn't already contain.

So on pure text, bryła is a tie, not a win, and I'm done claiming otherwise. That points to the only place "bryła carries more" can be literally true: information physically absent from the text — pixels, sound. There a sensory wall is an independent source, not a summary. The card had listed sensory walls as a placeholder, ready to be plugged in. So I plugged them in and measured.

1. Carrier passes a non-text fact. Answer depends on a real image (CLIP vector), text neutral: 93.5 ± 1.0% vs ~10% text alone (5 seeds).

2. Fusion adds over either channel alone. image=object, text=feature, target=both: +66 pp over the best single channel.

3. Fusion is adaptive — it weights channels by reliability. When text is unreliable (50% correct) the model ignores it and stays at image-only level; as text reliability rises, accuracy climbs monotonically (94 → 100%). A lie on one axis doesn't corrupt the other.

4. Three modalities. text + image + audio, three different encoders (embedding, CLIP, MFCC), one answer: full 48.4 ± 3.1% (feature 100% from text, object 93% from image, word 52% from audio); single channels score 1–3%.

5. A hypothesis that did NOT hold. I expected fusion to need slot-ordered concatenation to preserve channel binding. It doesn't: summing the channel vectors ≈ concatenation (−0.8 pp). Simpler than I thought. Binding only matters for the overlapping-value tasks of Chapter 4, not for fusing independent axes.

What Chapter 5 establishes: on text, bryła is a lossy summary (tie); its measured value is as a shared space for heterogeneous channels — it passes, fuses, and adaptively weights text, image, and sound.

What I claim now — and what I don't

I claim (measured, with CI):

Bryła trains faster at scale and is smaller.
On text, bryła ties raw text (it's a lossy summary — this is a tie, not a win).
Wall structure carries information a bag of values doesn't — but only when values overlap across walls (synthetic, +~20 pp).
As a multimodal carrier it passes non-text facts (image 93.5%), fuses channels (+66 pp), and weights them adaptively. Confirmed across image + audio.

I do NOT claim:

That bryła beats text on text tasks. (It doesn't — tie.)
That any of this is shown on natural data. All the strong results are on controlled/synthetic inputs. This is a proof of concept, not a deployed win.
That memory anchors work. The original motivation of this whole project — longer context via anchor-bryłas against lost-in-the-middle — is untested by measurement. It's a separate pillar and my honest next target.
That bryłas can be generated on the output side, or that any modality is generated. Everything here is understanding: multimodal input → text output.

Walls of bryła v9

Semantics · Affect (emotion color, urgency, intensity, attention/learning weight) · Pragmatics (negation, intent strength, topic continuity, density, expected format) · Relations (graph of links between bryłas) · State (parser confidence, completeness, source: parsed/verified/stated/retrieved) · Anchors (every N bryłas, a compression anchor — schema-ready, but its effect on long context is untested, see above) · Temporal (schema, v10) · Speaker (schema, v11) · Sensory (audio/image slots — now measured, see Chapter 5).

Problems bryłas are designed to attack

These are design intentions, not all measured wins — I list them so the goal is clear, not to claim they're solved: hallucinations ([SRC]), "I don't know" ([COMPL]), important-vs-not ([CORE]/[ATT]/anchors), lost-in-the-middle (anchors), tone/irony (affect), compute cost (pre-computed meaning), catastrophic forgetting (schema swap, not model change), generation control ([HINT]/[EXPECT]). Of these, only compute cost and conditioning are measured so far.

Strategic goal

I am not building another chatbot. I'm building a learning architecture to cheaply train a specialist in any domain — medicine, cooking, law, robotics — as long as the corpus is good. Changing domain = changing one JSON schema, zero code changes.

Original hypothesis: small model + rich bryłas ≈ big model + raw text, but cheaper. Status after testing: on text, unconfirmed (tie). Where it holds is multimodality — bryła as a carrier that fuses information no single channel has.

Hardware

RTX 2060 12GB · Ryzen 5 3600 · 32GB RAM · Windows 11 Pro · ~20–30 min per run. No datacenter. No A100. A regular PC and persistence.

Reproduce it yourself

Don't take my word — the control scripts are in this repo:

pip install torch
python kontrole.py --pairs dataset_zroznicowany.json --epoki 30 --dmodel 64
python test_wspolne_wartosci.py --epoki 30 --dmodel 64 --na-kombinacje 100
# multimodal carrier:
python test_obraz.py        # image via CLIP
python test_fuzja.py        # text + image fusion
python runda_kontrolna.py --seedy 5   # seeds, CI, sum-vs-concat, conflict
python test_audio.py --seedy 5        # text + image + audio

Raw results: kontrole_wynikow.csv, wspolne_wynikow.csv.

What this is not

Not a GPT-4 competitor — an experiment on a small model, custom input architecture, home hardware. Not finished — work in progress. Not perfect — the corpus is small, the model sometimes repeats words. Not statistically conclusive on text — text-side effects are within seed variance. The multimodal results are stronger (clear gaps, CIs) but on controlled data.

Roadmap

v10 — bryła as attention unit (hierarchical attention between bryłas).
v11 — generating bryłas instead of text (would open output-side multimodality).
v12 — schema as a platform for many domains.
Near-term, honest priorities: (1) a real memory-anchor test (needle-in-a- haystack) — the untested pillar; (2) a stronger audio encoder (wav2vec) to lift the 52%; (3) multimodal fusion on natural data, not synthetic.

Where this came from

I was at home, my RTX 2060 grinding Qwen, context getting lost, frustration rising. Instead of complaining I started experimenting: what if, instead of making the model recompute everything, I pack part of the meaning into the message itself? Field of points → cube → bryłas → affect walls → platform. I didn't know this was called "neurosymbolic AI" or that people had studied "structured inputs" for years. I got there on my own, because it bothered me that the LLM kept computing the same thing from scratch.

A bryła doesn't generate meaning — it carries it, from parser to model. Like a lamp over the table: it lights up once, not every time.

License & contact

CC BY-NC-SA 4.0 — use, modify, share; non-commercial, with attribution, same license. If you're doing something similar, or fighting the same problems on weak hardware — write. I'm looking for people to work with, not to compete.

HuggingFace: krzysiekpl · Developed since December 2025. Alone. By night. With persistence.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support