Text Generation
GGUF
Merge
frankenmerge
qwen3.5
reasoning
conversational
unsloth
agent
tool-use
chain-of-thought
Instructions to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="KyleHessling1/Qwopus-GLM-18B-Merged-GGUF", filename="Qwopus-GLM-18B-Healed-Q3_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
Use Docker
docker model run hf.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "KyleHessling1/Qwopus-GLM-18B-Merged-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KyleHessling1/Qwopus-GLM-18B-Merged-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
- Ollama
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Ollama:
ollama run hf.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
- Unsloth Studio new
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for KyleHessling1/Qwopus-GLM-18B-Merged-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for KyleHessling1/Qwopus-GLM-18B-Merged-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for KyleHessling1/Qwopus-GLM-18B-Merged-GGUF to start chatting
- Pi new
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Docker Model Runner:
docker model run hf.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
- Lemonade
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwopus-GLM-18B-Merged-GGUF-Q4_K_M
List all available models
lemonade list
Update merge docs with full heal results, loss curve, frontend stress tests, and updated lessons learned
Browse files- MERGE_PROCESS.md +121 -14
MERGE_PROCESS.md
CHANGED
|
@@ -313,8 +313,85 @@ Total: ~1383 samples after filtering, trained for 6 epochs (1000 steps).
|
|
| 313 |
|
| 314 |
3. **mergekit incompatibility:** mergekit's architecture detection didn't handle Qwen3.5's hybrid linear/full attention types during layer renumbering. Fix: wrote custom merge script that operates on raw tensor names without needing architecture awareness.
|
| 315 |
|
| 316 |
-
###
|
| 317 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 318 |
|
| 319 |
---
|
| 320 |
|
|
@@ -322,20 +399,23 @@ The heal training allows gradients to flow across the layer-32 boundary, teachin
|
|
| 322 |
|
| 323 |
### What Worked
|
| 324 |
1. **Frankenmerging two differently-distilled models produces a viable model.** The 18B merge beat Qwen 3.6 MoE on our benchmark at less than half the VRAM β a surprising and encouraging result.
|
| 325 |
-
2. **
|
| 326 |
-
3. **
|
| 327 |
-
4. **
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
|
|
|
|
|
|
| 331 |
2. **mergekit doesn't support Qwen3.5's multimodal + hybrid attention architecture** as of v0.1.4. Required a custom solution.
|
| 332 |
-
3. **The 9B source model
|
|
|
|
| 333 |
|
| 334 |
### What We'd Do Differently
|
| 335 |
-
1. **More code-heavy training data**
|
| 336 |
-
2. **Test with longer, multi-turn conversations**
|
| 337 |
-
3. **Try interleaved layer stacking** (A[0], B[0], A[1], B[1], ...) instead of sequential β this distributes the merge boundary across all layers instead of concentrating it at one point
|
| 338 |
-
4. **Consider a full fine-tune** on a multi-GPU setup instead of QLoRA
|
| 339 |
|
| 340 |
---
|
| 341 |
|
|
@@ -381,9 +461,36 @@ llama-server -m ~/models/Qwopus-GLM-18B-merged-Q4_K_M.gguf \
|
|
| 381 |
TEST_MODEL="Qwopus-GLM-18B" python3 tests/test_qwopus_v35.py
|
| 382 |
```
|
| 383 |
|
| 384 |
-
### Step 5: Heal Fine-Tune (
|
| 385 |
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
| 386 |
python3 heal-frankenmerge.py --max-steps 1000 --num-samples 5000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 387 |
```
|
| 388 |
|
| 389 |
---
|
|
|
|
| 313 |
|
| 314 |
3. **mergekit incompatibility:** mergekit's architecture detection didn't handle Qwen3.5's hybrid linear/full attention types during layer renumbering. Fix: wrote custom merge script that operates on raw tensor names without needing architecture awareness.
|
| 315 |
|
| 316 |
+
### Training Results
|
| 317 |
+
|
| 318 |
+
The heal fine-tune ran for approximately 14 hours on the RTX 5090.
|
| 319 |
+
|
| 320 |
+
**Loss curve:**
|
| 321 |
+
```
|
| 322 |
+
Step 10: 1.0175 β initial loss (high β layer boundary causing confusion)
|
| 323 |
+
Step 50: 0.8758 β sharp early drop as boundary begins healing
|
| 324 |
+
Step 100: 0.7215
|
| 325 |
+
Step 250: 0.6700 β checkpoint 1
|
| 326 |
+
Step 500: 0.6154 β checkpoint 2 (loss stabilizing)
|
| 327 |
+
Step 750: 0.6435 β checkpoint 3 (cosine schedule tapering)
|
| 328 |
+
Step 1000: 0.6396 β final (39% total reduction from start)
|
| 329 |
+
```
|
| 330 |
+
|
| 331 |
+
The sharp drop in the first 100 steps confirms the layer boundary was a real source of error β the model rapidly learned to produce coherent outputs across the seam. The remaining 900 steps refined overall quality with diminishing but meaningful returns.
|
| 332 |
+
|
| 333 |
+
### Post-Heal Benchmark Results
|
| 334 |
+
|
| 335 |
+
| Category | Raw Merge | **Healed Merge** | Delta |
|
| 336 |
+
|---|---|---|---|
|
| 337 |
+
| Basic | 6/6 | 6/6 | β |
|
| 338 |
+
| Reasoning | 4/4 | 4/4 | β |
|
| 339 |
+
| Tool Calling | 6/6 | 6/6 | β |
|
| 340 |
+
| Agentic | 4/4 | 4/4 | β |
|
| 341 |
+
| Structured | 2/2 | 2/2 | β |
|
| 342 |
+
| Context | 2/3 | 2/3 | β |
|
| 343 |
+
| Multilingual | 2/2 | 2/2 | β |
|
| 344 |
+
| Programming | 11/15 | **12/15** | **+1** |
|
| 345 |
+
| Performance | 2/2 | 2/2 | β |
|
| 346 |
+
| **TOTAL** | **39/44 (88.6%)** | **40/44 (90.9%)** | **+1 test** |
|
| 347 |
+
|
| 348 |
+
The `longest_substring` test was recovered β the model now produces a clean fenced Python code block that passes all 8 sliding-window test cases. Three programming tests remain failing (function naming issue, missing JS paren, no pytest code block).
|
| 349 |
+
|
| 350 |
+
### Frontend Code Generation β The Real Proof
|
| 351 |
+
|
| 352 |
+
While the benchmark suite showed a modest +1 improvement, the real transformation became apparent when we stress-tested HTML/CSS/JS generation β the exact category of structured output that was garbled before healing.
|
| 353 |
+
|
| 354 |
+
We ran 6 increasingly complex frontend tasks:
|
| 355 |
+
|
| 356 |
+
| Test | Description | Checks | Score | Output |
|
| 357 |
+
|---|---|---|---|---|
|
| 358 |
+
| Weather Dashboard | Responsive layout, CSS vars, dark mode, 5-day forecast grid | 9 | **9/9** | 14.5K chars |
|
| 359 |
+
| E-Commerce Product Page | Image gallery, color swatches, quantity +/-, tabbed content, sticky mobile bar | 12 | **12/12** | 16.7K chars |
|
| 360 |
+
| Animated SaaS Landing | Moving CSS gradient, typing animation, IntersectionObserver scroll reveals, auto-rotating testimonial carousel, 3 pricing tiers, scroll-based navbar | 13 | **13/13** | 24.1K chars |
|
| 361 |
+
| Analytics Dashboard | SVG bar chart with hover tooltips, SVG donut chart, sortable data table with JS sort, collapsible sidebar, dark theme, CSS Grid layout | 13 | **13/13** | 22.3K chars |
|
| 362 |
+
| Multi-Step Registration | 3-step form wizard, real-time inline validation, password strength meter (weak/medium/strong), all 50 US states dropdown, animated step transitions, success modal | 12 | **12/12** | 23.3K chars |
|
| 363 |
+
| Snake Game | Canvas rendering, requestAnimationFrame game loop, arrow key controls, collision detection, localStorage high score, increasing difficulty | 12 | **11/12** | 11.2K chars |
|
| 364 |
+
|
| 365 |
+
**Total: 62/63 checks passed (98.4%)**
|
| 366 |
+
|
| 367 |
+
Critical structural integrity metrics across all 6 outputs:
|
| 368 |
+
- **CSS braces: perfectly balanced in every file** (0 imbalance)
|
| 369 |
+
- **JS parentheses: perfectly balanced in every file** (0 imbalance)
|
| 370 |
+
- **Zero garbled or hallucinated text** in any output
|
| 371 |
+
- **5 of 6 files end with proper `</html>`** (Snake game had `html>` β minor typo)
|
| 372 |
+
|
| 373 |
+
The sophistication of the generated code is notable for an 18B frankenmerge:
|
| 374 |
+
- `IntersectionObserver` for scroll-triggered animations
|
| 375 |
+
- `requestAnimationFrame` with delta-time game loops
|
| 376 |
+
- SVG chart generation with computed coordinates
|
| 377 |
+
- Real-time password strength calculation with regex
|
| 378 |
+
- CSS `@keyframes` animations (3+ per file)
|
| 379 |
+
- Proper responsive design with `@media` breakpoints
|
| 380 |
+
|
| 381 |
+
**Before healing:** The model would produce garbled code blocks, missing brackets, hallucinated syntax, and incomplete HTML structures. **After healing:** Production-quality frontend code with perfect structural integrity across outputs up to 24K characters.
|
| 382 |
+
|
| 383 |
+
All 6 HTML samples are included in the `samples/` directory of the repository.
|
| 384 |
+
|
| 385 |
+
### Why the Heal Works
|
| 386 |
+
|
| 387 |
+
The heal fine-tune addresses the core problem: at layer 32, the model transitions from weights trained on Opus-style reasoning to weights trained on GLM-style reasoning. Without additional training, the internal representations at this boundary are discontinuous β the output from layer 31 is not what layers 32+ expect as input.
|
| 388 |
+
|
| 389 |
+
By training with QLoRA across all attention and MLP projections, we allow the model to:
|
| 390 |
+
1. **Adapt the boundary layers** (28-35) to bridge the representational gap
|
| 391 |
+
2. **Learn to route information** coherently through the full 64-layer stack
|
| 392 |
+
3. **Restore structured output capability** by re-establishing the tight token-to-token coordination that code generation requires
|
| 393 |
+
|
| 394 |
+
The 39% loss reduction (1.02 β 0.62) in just 1000 steps confirms that the boundary was a significant source of prediction error that the model could quickly learn to compensate for.
|
| 395 |
|
| 396 |
---
|
| 397 |
|
|
|
|
| 399 |
|
| 400 |
### What Worked
|
| 401 |
1. **Frankenmerging two differently-distilled models produces a viable model.** The 18B merge beat Qwen 3.6 MoE on our benchmark at less than half the VRAM β a surprising and encouraging result.
|
| 402 |
+
2. **The heal fine-tune dramatically improved structured output.** Before healing: garbled code blocks, missing brackets, hallucinated syntax. After healing: 62/63 frontend stress test checks passed with perfectly balanced CSS/JS across outputs up to 24K characters. 1000 steps of QLoRA was enough.
|
| 403 |
+
3. **Tool calling and agentic reasoning are robust to layer stacking.** Perfect 6/6 and 4/4 scores in both raw and healed versions β these capabilities rely on high-level semantic representations that survive the merge even without healing.
|
| 404 |
+
4. **Custom merge scripts are more reliable than mergekit for novel architectures.** The regex-based tensor renumbering approach is architecture-agnostic and handles hybrid attention types that mergekit doesn't support.
|
| 405 |
+
5. **A comprehensive test suite is essential.** Without executable programming tests and frontend stress tests, we would have missed the core regression and wouldn't have been able to measure the heal's effectiveness.
|
| 406 |
+
6. **Loss curve tells the story.** The sharp initial drop (1.02 β 0.72 in 100 steps) confirmed the layer boundary was a real source of error, not just noise. This gives us confidence the heal is addressing the root cause.
|
| 407 |
+
|
| 408 |
+
### What Didn't Work (Initially)
|
| 409 |
+
1. **Code formatting degrades significantly at merge boundaries.** Structured output requiring precise token sequences is the weakest point of a naive frankenmerge β but the heal fine-tune largely fixed this.
|
| 410 |
2. **mergekit doesn't support Qwen3.5's multimodal + hybrid attention architecture** as of v0.1.4. Required a custom solution.
|
| 411 |
+
3. **The 9B source model outscored the raw 18B merge (41/44 vs 39/44) on short benchmarks.** On short, single-turn test prompts, a well-tuned smaller model can beat a larger but unrefined merge. After healing, the gap narrowed to 41 vs 40 β and the 18B excels on longer, more complex outputs where the extra depth pays off (see frontend stress tests).
|
| 412 |
+
4. **Three programming tests still fail after healing.** Function naming issues and missing brackets persist in some code generation tasks, suggesting QLoRA with 1383 samples isn't enough to fully resolve all formatting edge cases.
|
| 413 |
|
| 414 |
### What We'd Do Differently
|
| 415 |
+
1. **More code-heavy training data** β 750 competitive programming samples helped but wasn't enough to fix every code formatting edge case. A larger code-focused dataset would likely close the remaining gap.
|
| 416 |
+
2. **Test with longer, multi-turn conversations** β our 44-test suite uses short prompts. The frontend stress tests revealed the 18B's real strength: long, complex, structured outputs. A suite designed for those would show the merge's advantages more clearly.
|
| 417 |
+
3. **Try interleaved layer stacking** (A[0], B[0], A[1], B[1], ...) instead of sequential β this distributes the merge boundary across all layers instead of concentrating it at one point, potentially reducing the need for healing.
|
| 418 |
+
4. **Consider a full fine-tune** on a multi-GPU setup instead of QLoRA β training 100% of parameters instead of 2% would give maximum healing capacity, though the QLoRA results are already quite good.
|
| 419 |
|
| 420 |
---
|
| 421 |
|
|
|
|
| 461 |
TEST_MODEL="Qwopus-GLM-18B" python3 tests/test_qwopus_v35.py
|
| 462 |
```
|
| 463 |
|
| 464 |
+
### Step 5: Heal Fine-Tune (recommended)
|
| 465 |
```bash
|
| 466 |
+
# Dry run first to verify everything loads
|
| 467 |
+
python3 heal-frankenmerge.py --dry-run
|
| 468 |
+
|
| 469 |
+
# Full heal β ~14 hours on RTX 5090, checkpoints every 250 steps
|
| 470 |
python3 heal-frankenmerge.py --max-steps 1000 --num-samples 5000
|
| 471 |
+
# Outputs: ~/models/Qwopus-GLM-18B-healed/ (merged 16-bit safetensors)
|
| 472 |
+
```
|
| 473 |
+
|
| 474 |
+
### Step 6: Convert & Quantize Healed Model
|
| 475 |
+
```bash
|
| 476 |
+
python3 llama-cpp-latest/convert_hf_to_gguf.py \
|
| 477 |
+
~/models/Qwopus-GLM-18B-healed \
|
| 478 |
+
--outfile ~/models/Qwopus-GLM-18B-healed-f16.gguf \
|
| 479 |
+
--outtype bf16
|
| 480 |
+
|
| 481 |
+
llama-quantize \
|
| 482 |
+
~/models/Qwopus-GLM-18B-healed-f16.gguf \
|
| 483 |
+
~/models/Qwopus-GLM-18B-healed-Q4_K_M.gguf \
|
| 484 |
+
Q4_K_M
|
| 485 |
+
```
|
| 486 |
+
|
| 487 |
+
### Step 7: Benchmark Healed Model
|
| 488 |
+
```bash
|
| 489 |
+
llama-server -m ~/models/Qwopus-GLM-18B-healed-Q4_K_M.gguf \
|
| 490 |
+
--alias "Qwopus-GLM-18B-healed" --host 127.0.0.1 --port 8001 \
|
| 491 |
+
--ctx-size 65536 --flash-attn on --n-gpu-layers 99 --jinja
|
| 492 |
+
|
| 493 |
+
TEST_MODEL="Qwopus-GLM-18B-healed" python3 tests/test_qwopus_v35.py
|
| 494 |
```
|
| 495 |
|
| 496 |
---
|