# Qwopus-GLM-18B Merge — Full Technical Workflow **Date:** April 17, 2026 **Author:** Kyle Hessling ([@KyleHessling1](https://x.com/KyleHessling1)) **Hardware:** NVIDIA RTX 5090 (32 GB VRAM), Intel Core Ultra 7 265K, 125 GB RAM **Software:** llama.cpp (CUDA 12.8), Python 3.12, Unsloth 2026.4.6, mergekit 0.1.4 --- ## Table of Contents 1. [Motivation](#1-motivation) 2. [Source Models](#2-source-models) 3. [Merge Strategy](#3-merge-strategy) 4. [Merge Execution](#4-merge-execution) 5. [GGUF Conversion & Quantization](#5-gguf-conversion--quantization) 6. [Benchmark Suite](#6-benchmark-suite) 7. [Results](#7-results) 8. [Heal Fine-Tune (Post-Merge Training)](#8-heal-fine-tune-post-merge-training) 9. [Lessons Learned](#9-lessons-learned) 10. [Reproducing This Work](#10-reproducing-this-work) --- ## 1. Motivation The community around Jackrong's Qwen3.5 finetunes had a clear gap: the 27B models delivered excellent quality but required 16+ GB VRAM even at Q4, while the 9B models fit on consumer GPUs but left performance on the table. We wanted to explore whether frankenmerging two complementary 9B finetunes could create a viable ~18B "middle ground" — something that fits on 12-16 GB GPUs while delivering improved capability. The two source models were chosen because they were trained on fundamentally different reasoning data: - **Qwopus v3.5**: Opus-style reasoning distillation (agentic, tool-use focused) - **GLM-5.1 Distill**: GLM-style structured problem decomposition The hypothesis was that stacking layers from these differently-trained models would produce a deeper network with more diverse reasoning capabilities than either source alone. --- ## 2. Source Models ### Jackrong/Qwopus3.5-9B-v3.5 - **Base:** Qwen3.5-9B - **Training:** SFT with ~2x more data than v3, focused on reasoning, coding, and agentic tasks - **Data sources:** Mathematics, programming, puzzle-solving, multilingual dialogue, instruction-following, STEM - **Framework:** Unsloth - **Key strength:** Agentic tool use, code generation, structured reasoning ### Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1 - **Base:** Qwen3.5-9B - **Training:** SFT via distillation from GLM-5.1 teacher model using LoRA - **Data sources:** `Jackrong/GLM-5.1-Reasoning-1M-Cleaned` (~700x scale), `Jackrong/Qwen3.5-reasoning-700x` - **Framework:** Unsloth - **Key strength:** Structured problem decomposition, instruction adherence, reasoning scaffolds ### Architecture (shared by both) ``` Architecture: Qwen3_5ForConditionalGeneration (multimodal) Layers: 32 Hidden size: 4096 Attention heads: 16 (4 KV heads, GQA) Intermediate size: 12288 Attention type: Hybrid (linear + full, every 4th layer) Vocab size: 248,320 Context length: 262,144 tokens Parameters: ~10B ``` --- ## 3. Merge Strategy ### Method: Passthrough Frankenmerge (Layer Stacking) We used a **passthrough** merge — the simplest form of frankenmerging — where all layers from both models are concatenated sequentially without any weight interpolation: ``` Output model (64 layers): Layers 0-31: All transformer layers from Qwopus3.5-9B-v3.5 Layers 32-63: All transformer layers from Qwen3.5-9B-GLM5.1-Distill-v1 Embeddings: From model A (Qwopus) LM head: From model A (Qwopus) MTP layers: From model A (Qwopus) Vision encoder: From model A (Qwopus) ``` ### Why Passthrough (Not SLERP/TIES/DARE)? - Both models share the same base (Qwen3.5-9B), so their weight spaces are already aligned - SLERP/TIES/DARE interpolate weights at the same layer positions — they produce a 9B model, not an 18B model - Passthrough is the only method that increases depth (and therefore parameter count) - The goal was specifically an 18B model to fill the 9B-27B gap ### Why mergekit Failed We initially attempted to use `mergekit-yaml` with the standard passthrough config. It failed for two reasons: 1. **Multi-module architecture:** Qwen3.5 is a multimodal model (`Qwen3_5ForConditionalGeneration`) with 4 distinct modules: - `model.language_model.layers` (32 transformer layers) - `default` (16 loose weights — embeddings, etc.) - `mtp.layers` (1 multi-token prediction layer) - `model.visual.blocks` (27 vision encoder blocks) mergekit required `modules:` syntax for multi-module models. 2. **Hybrid attention:** Qwen3.5 uses mixed `linear_attn` (SSM-style) and `self_attn` (standard) layers. mergekit's layer renumbering logic confused the tensor mapping between these different layer types, producing errors like: ``` RuntimeError: Tensor model.language_model.layers.3.linear_attn.out_proj.weight required but not present in model ``` ### The Custom Merge Script We wrote a custom Python script (`frankenmerge.py`) that: 1. Downloads both models from HuggingFace (using cached snapshots from the mergekit attempt) 2. Loads all safetensor shards from both models into memory 3. Copies all tensors from model A as-is (includes embeddings, visual encoder, MTP, and layers 0-31) 4. Iterates over model B's tensors, identifies language model layer tensors via regex, renumbers them by +32, and adds them to the merged dict 5. Saves the merged tensors as sharded safetensors (5 GB per shard, 7 shards total) 6. Updates `config.json` to reflect 64 layers (doubles the `layer_types` list, updates `num_hidden_layers`) 7. Copies tokenizer and template files from model A Key implementation detail — the layer renumbering regex: ```python def renumber_layer(key: str, offset: int, prefix: str) -> str | None: pattern = rf'^({re.escape(prefix)}\.)(\d+)(\..*)' m = re.match(pattern, key) if m: new_idx = int(m.group(2)) + offset return f"{m.group(1)}{new_idx}{m.group(3)}" return None ``` This handles all tensor naming patterns (linear_attn, self_attn, MLP, norms) uniformly without needing to know the specific layer structure. --- ## 4. Merge Execution ```bash python3 frankenmerge.py ``` ### Output ``` [3/5] Merging tensors... Model A: 775 tensors (all kept) Model B: 424 layer tensors renumbered and added Total merged: 1199 tensors [4/5] Saving to ~/models/Qwopus-GLM-18B-merged... 7 shards, 33.15 GB total [5/5] Updating config... layer_types: 32 -> 64 num_hidden_layers: 64 ``` ### Resource Usage - **RAM:** ~40 GB peak (both models loaded simultaneously) - **Disk:** 33.15 GB output (BF16 safetensors) - **Time:** ~15 minutes (mostly I/O) - **GPU:** Not used for merge --- ## 5. GGUF Conversion & Quantization ### Step 1: Convert safetensors to GGUF (BF16) ```bash python3 llama-cpp-latest/convert_hf_to_gguf.py \ ~/models/Qwopus-GLM-18B-merged \ --outfile ~/models/Qwopus-GLM-18B-merged-f16.gguf \ --outtype bf16 ``` - Output: 30 GB GGUF (851 tensors) - Time: ~40 seconds ### Step 2: Quantize to Q4_K_M ```bash llama-quantize \ ~/models/Qwopus-GLM-18B-merged-f16.gguf \ ~/models/Qwopus-GLM-18B-merged-Q4_K_M.gguf \ Q4_K_M ``` - Output: **9.2 GB** (from 30 GB BF16) - Time: ~44 seconds - Compression ratio: 3.2x ### Serving ```bash llama-server \ -m Qwopus-GLM-18B-merged-Q4_K_M.gguf \ --alias "Qwopus-GLM-18B" \ --chat-template-file ~/.hermes/qwen35-fixed.jinja \ --host 127.0.0.1 --port 8001 \ --ctx-size 65536 --flash-attn on --n-gpu-layers 99 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --batch-size 8192 --ubatch-size 4096 --parallel 1 --mlock \ --threads 20 --threads-batch 20 ``` Note: We used a custom Jinja template (`qwen35-fixed.jinja`) that patches a `|items` bug in the standard Qwen3.5 template when handling string arguments. --- ## 6. Benchmark Suite We developed a 44-test capability suite (`tests/test_qwopus_v35.py`) covering 9 categories: | Category | Tests | What's Tested | |---|---|---| | **Basic** | 6 | Health, instruction following, factual QA, system prompt, clean EOS, determinism | | **Reasoning** | 4 | Math word problems, logic puzzles, code generation, code debugging | | **Tool Calling** | 6 | Single call, optional params, no-call-when-unneeded, tool selection, complex params (arrays/enums), tool response handling | | **Agentic** | 4 | Plan generation, multi-step tool chain (read→diagnose→fix), error recovery, self-correction | | **Structured** | 2 | JSON output, markdown table generation | | **Context** | 3 | Multi-turn memory, 4K-token needle-in-haystack, instruction hierarchy (prompt injection resistance) | | **Multilingual** | 2 | Chinese generation, English→French translation | | **Programming** | 15 | Quicksort, two-sum, balanced parens, longest substring, DP climbing stairs, LRU cache class, binary tree inorder, graph BFS, subtle bug detection, recursive→iterative refactor, JavaScript execution, SQL query writing, pytest generation, complexity analysis, anagram grouping | | **Performance** | 2 | SSE streaming, throughput consistency | ### Key Design Decisions - **All programming tests execute the generated code** against multiple test cases (not just syntactic checks) - **Tool calling tests** check both the OpenAI-structured `tool_calls` API response and XML fallback parsing - **Agentic tests** simulate multi-turn tool workflows with realistic tool responses - Configurable via environment variables: `TEST_MODEL`, `THINKING_BUDGET`, `DISABLE_THINKING` --- ## 7. Results ### Full Scoreboard | Category | Qwopus 9B | GLM-18B Merge | Qwen 3.6-35B MoE | |---|---|---|---| | Basic | 6/6 | 6/6 | 5/6 | | Reasoning | 4/4 | 4/4 | 4/4 | | Tool Calling | 6/6 | 6/6 | 6/6 | | Agentic | 4/4 | 4/4 | 4/4 | | Structured | 2/2 | 2/2 | 2/2 | | Context | 2/3 | 2/3 | 2/3 | | Multilingual | 2/2 | 2/2 | 2/2 | | Programming | **13/15** | 11/15 | 12/15 | | Performance | 2/2 | 2/2 | 1/2 | | **TOTAL** | **41/44 (93.2%)** | **39/44 (88.6%)** | **38/44 (86.4%)** | | Throughput | 126.0 tok/s | 66.6 tok/s | 174.2 tok/s | | GGUF Size | 5.3 GB | 9.2 GB | 22 GB | | Wall Time | 55s | 127s | 442s | ### Analysis **Where the merge excels:** - Perfect tool calling (6/6) — matches the 9B source and Qwen 3.6 - Perfect agentic reasoning (4/4) — correctly diagnoses a timezone parsing bug from file contents and proposes a fix - Highest Chinese output density (138 CJK chars) of any model tested - Zero throughput variance (66.6-66.6 tok/s) — perfectly stable - Beats Qwen 3.6 MoE overall (39/44 vs 38/44) at less than half the VRAM **Where the merge struggles:** - Programming (11/15) — 4 failures: - `longest_substring`: returned no fenced code block - `subtle_bug`: `NameError: name 'remove_evens' is not defined` (function defined with wrong name) - `javascript`: missing closing parenthesis in generated JS - `write_unit_tests`: returned no fenced code block - These are code *formatting* issues, not reasoning issues — the model typically reasons correctly about the problem but garbles the structured output **Root cause of programming regressions:** The layer boundary at position 32 creates a representational discontinuity. Structured output (code blocks, indentation, bracket matching) requires tight token-to-token coordination across layers, which is exactly what breaks at the merge seam. The model's high-level reasoning remains intact because semantic representations are more robust to layer stacking than fine-grained formatting. --- ## 8. Heal Fine-Tune (Post-Merge Training) To address the code formatting regressions, we developed a "heal" fine-tune script (`heal-frankenmerge.py`): ### Configuration ``` Method: QLoRA (4-bit NF4, double quantization) LoRA rank: 64 LoRA alpha: 32 Trainable params: 346M / 9.5B quantized (3.62%) Learning rate: 2e-5 (cosine schedule) Warmup: 50 steps Batch size: 8 (2 per device × 4 gradient accumulation) Max steps: 1000 Max sequence length: 4096 VRAM usage: ~13 GB allocated, ~17.5 GB reserved ``` ### LoRA Target Modules All attention and MLP projections across both linear_attn and self_attn layer types: ```python TARGET_MODULES = [ "q_proj", "k_proj", "v_proj", "o_proj", "in_proj_a", "in_proj_b", "in_proj_z", "in_proj_qkv", "out_proj", "gate_proj", "up_proj", "down_proj", ] ``` ### Training Data Blended from three of Jackrong's datasets: - **Jackrong/Qwen3.5-reasoning-700x** (70% weight) — math, code, science, instruction following - **Jackrong/Competitive-Programming-python-blend** (15% weight) — code-heavy to address formatting - **Jackrong/MultiReason-ChatAlpaca** (15% weight) — multi-turn instruction following Total: ~1383 samples after filtering, trained for 6 epochs (1000 steps). ### Technical Challenges Solved 1. **Multimodal processor:** Qwen3.5's `from_pretrained()` returns a `Qwen3VLProcessor` (multimodal), not a text tokenizer. The processor tried to parse `<|im_start|>` tokens as image data. Fix: extract `tokenizer.tokenizer` for text-only training. 2. **Pickle errors:** TRL's `SFTTrainer` tried to serialize `ConfigModuleInstance` objects from the Unsloth-patched model for multiprocessing. Fix: switched to vanilla HuggingFace `Trainer` with pre-tokenized data and `dataloader_num_workers=0`. 3. **mergekit incompatibility:** mergekit's architecture detection didn't handle Qwen3.5's hybrid linear/full attention types during layer renumbering. Fix: wrote custom merge script that operates on raw tensor names without needing architecture awareness. ### Training Results The heal fine-tune ran for approximately 14 hours on the RTX 5090. **Loss curve:** ``` Step 10: 1.0175 ← initial loss (high — layer boundary causing confusion) Step 50: 0.8758 ← sharp early drop as boundary begins healing Step 100: 0.7215 Step 250: 0.6700 ← checkpoint 1 Step 500: 0.6154 ← checkpoint 2 (loss stabilizing) Step 750: 0.6435 ← checkpoint 3 (cosine schedule tapering) Step 1000: 0.6396 ← final (39% total reduction from start) ``` The sharp drop in the first 100 steps confirms the layer boundary was a real source of error — the model rapidly learned to produce coherent outputs across the seam. The remaining 900 steps refined overall quality with diminishing but meaningful returns. ### Post-Heal Benchmark Results | Category | Raw Merge | **Healed Merge** | Delta | |---|---|---|---| | Basic | 6/6 | 6/6 | — | | Reasoning | 4/4 | 4/4 | — | | Tool Calling | 6/6 | 6/6 | — | | Agentic | 4/4 | 4/4 | — | | Structured | 2/2 | 2/2 | — | | Context | 2/3 | 2/3 | — | | Multilingual | 2/2 | 2/2 | — | | Programming | 11/15 | **12/15** | **+1** | | Performance | 2/2 | 2/2 | — | | **TOTAL** | **39/44 (88.6%)** | **40/44 (90.9%)** | **+1 test** | The `longest_substring` test was recovered — the model now produces a clean fenced Python code block that passes all 8 sliding-window test cases. Three programming tests remain failing (function naming issue, missing JS paren, no pytest code block). ### Frontend Code Generation — The Real Proof While the benchmark suite showed a modest +1 improvement, the real transformation became apparent when we stress-tested HTML/CSS/JS generation — the exact category of structured output that was garbled before healing. We ran 6 increasingly complex frontend tasks: | Test | Description | Checks | Score | Output | |---|---|---|---|---| | Weather Dashboard | Responsive layout, CSS vars, dark mode, 5-day forecast grid | 9 | **9/9** | 14.5K chars | | E-Commerce Product Page | Image gallery, color swatches, quantity +/-, tabbed content, sticky mobile bar | 12 | **12/12** | 16.7K chars | | Animated SaaS Landing | Moving CSS gradient, typing animation, IntersectionObserver scroll reveals, auto-rotating testimonial carousel, 3 pricing tiers, scroll-based navbar | 13 | **13/13** | 24.1K chars | | Analytics Dashboard | SVG bar chart with hover tooltips, SVG donut chart, sortable data table with JS sort, collapsible sidebar, dark theme, CSS Grid layout | 13 | **13/13** | 22.3K chars | | Multi-Step Registration | 3-step form wizard, real-time inline validation, password strength meter (weak/medium/strong), all 50 US states dropdown, animated step transitions, success modal | 12 | **12/12** | 23.3K chars | | Snake Game | Canvas rendering, requestAnimationFrame game loop, arrow key controls, collision detection, localStorage high score, increasing difficulty | 12 | **11/12** | 11.2K chars | **Total: 62/63 checks passed (98.4%)** Critical structural integrity metrics across all 6 outputs: - **CSS braces: perfectly balanced in every file** (0 imbalance) - **JS parentheses: perfectly balanced in every file** (0 imbalance) - **Zero garbled or hallucinated text** in any output - **5 of 6 files end with proper ``** (Snake game had `html>` — minor typo) The sophistication of the generated code is notable for an 18B frankenmerge: - `IntersectionObserver` for scroll-triggered animations - `requestAnimationFrame` with delta-time game loops - SVG chart generation with computed coordinates - Real-time password strength calculation with regex - CSS `@keyframes` animations (3+ per file) - Proper responsive design with `@media` breakpoints **Before healing:** The model would produce garbled code blocks, missing brackets, hallucinated syntax, and incomplete HTML structures. **After healing:** Production-quality frontend code with perfect structural integrity across outputs up to 24K characters. All 6 HTML samples are included in the `samples/` directory of the repository. ### Why the Heal Works The heal fine-tune addresses the core problem: at layer 32, the model transitions from weights trained on Opus-style reasoning to weights trained on GLM-style reasoning. Without additional training, the internal representations at this boundary are discontinuous — the output from layer 31 is not what layers 32+ expect as input. By training with QLoRA across all attention and MLP projections, we allow the model to: 1. **Adapt the boundary layers** (28-35) to bridge the representational gap 2. **Learn to route information** coherently through the full 64-layer stack 3. **Restore structured output capability** by re-establishing the tight token-to-token coordination that code generation requires The 39% loss reduction (1.02 → 0.62) in just 1000 steps confirms that the boundary was a significant source of prediction error that the model could quickly learn to compensate for. --- ## 9. Lessons Learned ### What Worked 1. **Frankenmerging two differently-distilled models produces a viable model.** The 18B merge beat Qwen 3.6 MoE on our benchmark at less than half the VRAM — a surprising and encouraging result. 2. **The heal fine-tune dramatically improved structured output.** Before healing: garbled code blocks, missing brackets, hallucinated syntax. After healing: 62/63 frontend stress test checks passed with perfectly balanced CSS/JS across outputs up to 24K characters. 1000 steps of QLoRA was enough. 3. **Tool calling and agentic reasoning are robust to layer stacking.** Perfect 6/6 and 4/4 scores in both raw and healed versions — these capabilities rely on high-level semantic representations that survive the merge even without healing. 4. **Custom merge scripts are more reliable than mergekit for novel architectures.** The regex-based tensor renumbering approach is architecture-agnostic and handles hybrid attention types that mergekit doesn't support. 5. **A comprehensive test suite is essential.** Without executable programming tests and frontend stress tests, we would have missed the core regression and wouldn't have been able to measure the heal's effectiveness. 6. **Loss curve tells the story.** The sharp initial drop (1.02 → 0.72 in 100 steps) confirmed the layer boundary was a real source of error, not just noise. This gives us confidence the heal is addressing the root cause. ### What Didn't Work (Initially) 1. **Code formatting degrades significantly at merge boundaries.** Structured output requiring precise token sequences is the weakest point of a naive frankenmerge — but the heal fine-tune largely fixed this. 2. **mergekit doesn't support Qwen3.5's multimodal + hybrid attention architecture** as of v0.1.4. Required a custom solution. 3. **The 9B source model outscored the raw 18B merge (41/44 vs 39/44) on short benchmarks.** On short, single-turn test prompts, a well-tuned smaller model can beat a larger but unrefined merge. After healing, the gap narrowed to 41 vs 40 — and the 18B excels on longer, more complex outputs where the extra depth pays off (see frontend stress tests). 4. **Three programming tests still fail after healing.** Function naming issues and missing brackets persist in some code generation tasks, suggesting QLoRA with 1383 samples isn't enough to fully resolve all formatting edge cases. ### What We'd Do Differently 1. **More code-heavy training data** — 750 competitive programming samples helped but wasn't enough to fix every code formatting edge case. A larger code-focused dataset would likely close the remaining gap. 2. **Test with longer, multi-turn conversations** — our 44-test suite uses short prompts. The frontend stress tests revealed the 18B's real strength: long, complex, structured outputs. A suite designed for those would show the merge's advantages more clearly. 3. **Try interleaved layer stacking** (A[0], B[0], A[1], B[1], ...) instead of sequential — this distributes the merge boundary across all layers instead of concentrating it at one point, potentially reducing the need for healing. 4. **Consider a full fine-tune** on a multi-GPU setup instead of QLoRA — training 100% of parameters instead of 2% would give maximum healing capacity, though the QLoRA results are already quite good. --- ## 10. Reproducing This Work ### Prerequisites ```bash pip install torch safetensors huggingface_hub datasets # For quantization: llama.cpp with CUDA support # For heal training: pip install unsloth bitsandbytes trl peft accelerate ``` ### Step 1: Merge ```bash python3 frankenmerge.py # Outputs: ~/models/Qwopus-GLM-18B-merged/ (33 GB safetensors) ``` ### Step 2: Convert to GGUF ```bash python3 llama-cpp-latest/convert_hf_to_gguf.py \ ~/models/Qwopus-GLM-18B-merged \ --outfile ~/models/Qwopus-GLM-18B-merged-f16.gguf \ --outtype bf16 ``` ### Step 3: Quantize ```bash llama-quantize \ ~/models/Qwopus-GLM-18B-merged-f16.gguf \ ~/models/Qwopus-GLM-18B-merged-Q4_K_M.gguf \ Q4_K_M ``` ### Step 4: Benchmark ```bash # Start server llama-server -m ~/models/Qwopus-GLM-18B-merged-Q4_K_M.gguf \ --alias "Qwopus-GLM-18B" --host 127.0.0.1 --port 8001 \ --ctx-size 65536 --flash-attn on --n-gpu-layers 99 --jinja # Run suite TEST_MODEL="Qwopus-GLM-18B" python3 tests/test_qwopus_v35.py ``` ### Step 5: Heal Fine-Tune (recommended) ```bash # Dry run first to verify everything loads python3 heal-frankenmerge.py --dry-run # Full heal — ~14 hours on RTX 5090, checkpoints every 250 steps python3 heal-frankenmerge.py --max-steps 1000 --num-samples 5000 # Outputs: ~/models/Qwopus-GLM-18B-healed/ (merged 16-bit safetensors) ``` ### Step 6: Convert & Quantize Healed Model ```bash python3 llama-cpp-latest/convert_hf_to_gguf.py \ ~/models/Qwopus-GLM-18B-healed \ --outfile ~/models/Qwopus-GLM-18B-healed-f16.gguf \ --outtype bf16 llama-quantize \ ~/models/Qwopus-GLM-18B-healed-f16.gguf \ ~/models/Qwopus-GLM-18B-healed-Q4_K_M.gguf \ Q4_K_M ``` ### Step 7: Benchmark Healed Model ```bash llama-server -m ~/models/Qwopus-GLM-18B-healed-Q4_K_M.gguf \ --alias "Qwopus-GLM-18B-healed" --host 127.0.0.1 --port 8001 \ --ctx-size 65536 --flash-attn on --n-gpu-layers 99 --jinja TEST_MODEL="Qwopus-GLM-18B-healed" python3 tests/test_qwopus_v35.py ``` --- ## File Index | File | Purpose | |---|---| | `frankenmerge.py` | Custom merge script (passthrough layer stacking) | | `heal-frankenmerge.py` | QLoRA heal fine-tune script | | `tests/test_qwopus_v35.py` | 44-test benchmark suite | | `merge-config.yaml` | mergekit config (didn't work, kept for reference) | | `qwen35-fixed.jinja` | Patched Qwen3.5 chat template | --- *This document was created as part of an experimental model merging project. The work is exploratory and the merged model has known limitations. For questions or collaboration, reach out on X: [@KyleHessling1](https://x.com/KyleHessling1)*