Instructions to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="KyleHessling1/Qwopus-GLM-18B-Merged-GGUF",
	filename="Qwopus-GLM-18B-Healed-Q3_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M

Use Docker

docker model run hf.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "KyleHessling1/Qwopus-GLM-18B-Merged-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KyleHessling1/Qwopus-GLM-18B-Merged-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M

Ollama
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Ollama:
```
ollama run hf.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
```

Unsloth Studio new

How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for KyleHessling1/Qwopus-GLM-18B-Merged-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for KyleHessling1/Qwopus-GLM-18B-Merged-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for KyleHessling1/Qwopus-GLM-18B-Merged-GGUF to start chatting

Pi new

How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Docker Model Runner:
```
docker model run hf.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M
```

Lemonade

How to use KyleHessling1/Qwopus-GLM-18B-Merged-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull KyleHessling1/Qwopus-GLM-18B-Merged-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwopus-GLM-18B-Merged-GGUF-Q4_K_M

List all available models

lemonade list

KyleHessling1 commited on Apr 18

Commit

72b07c0

verified ·

1 Parent(s): 849acba

Update merge docs with full heal results, loss curve, frontend stress tests, and updated lessons learned

Browse files

Files changed (1) hide show

MERGE_PROCESS.md +121 -14

MERGE_PROCESS.md CHANGED Viewed

@@ -313,8 +313,85 @@ Total: ~1383 samples after filtering, trained for 6 epochs (1000 steps).
 3. **mergekit incompatibility:** mergekit's architecture detection didn't handle Qwen3.5's hybrid linear/full attention types during layer renumbering. Fix: wrote custom merge script that operates on raw tensor names without needing architecture awareness.
-### Expected Outcome
-The heal training allows gradients to flow across the layer-32 boundary, teaching the model to produce coherent outputs through the full 64-layer stack. The cosine learning rate schedule starts aggressive (2e-5) and tapers, so early steps fix the worst discontinuities while later steps refine overall quality.
 ---
@@ -322,20 +399,23 @@ The heal training allows gradients to flow across the layer-32 boundary, teachin
 ### What Worked
 1. **Frankenmerging two differently-distilled models produces a viable model.** The 18B merge beat Qwen 3.6 MoE on our benchmark at less than half the VRAM — a surprising and encouraging result.
-2. **Tool calling and agentic reasoning are robust to layer stacking.** Perfect 6/6 and 4/4 scores suggest these capabilities rely on high-level semantic representations that survive the merge.
-3. **Custom merge scripts are more reliable than mergekit for novel architectures.** The regex-based tensor renumbering approach is architecture-agnostic and handles hybrid attention types that mergekit doesn't support.
-4. **A comprehensive test suite is essential.** Without executable programming tests, we would have reported 33/29 on the non-programming categories and missed the core regression entirely.
-### What Didn't Work
-1. **Code formatting degrades significantly at merge boundaries.** Structured output requiring precise token sequences (fenced blocks, indentation, bracket matching) is the weakest point of a naive frankenmerge.
 2. **mergekit doesn't support Qwen3.5's multimodal + hybrid attention architecture** as of v0.1.4. Required a custom solution.
-3. **The 9B source model actually outscored the 18B merge (41/44 vs 39/44).** On short, single-turn test prompts, a well-tuned smaller model can beat a larger but unrefined merge. The merge's advantages likely appear in longer, more complex interactions that our suite doesn't test.
 ### What We'd Do Differently
-1. **More code-heavy training data** in the heal fine-tune — 750 competitive programming samples isn't enough to fully fix code formatting
-2. **Test with longer, multi-turn conversations** to measure where the extra depth actually pays off
-3. **Try interleaved layer stacking** (A[0], B[0], A[1], B[1], ...) instead of sequential — this distributes the merge boundary across all layers instead of concentrating it at one point
-4. **Consider a full fine-tune** on a multi-GPU setup instead of QLoRA, for maximum healing capacity
 ---
@@ -381,9 +461,36 @@ llama-server -m ~/models/Qwopus-GLM-18B-merged-Q4_K_M.gguf \
 TEST_MODEL="Qwopus-GLM-18B" python3 tests/test_qwopus_v35.py
 ```
-### Step 5: Heal Fine-Tune (optional)
 ```bash
 python3 heal-frankenmerge.py --max-steps 1000 --num-samples 5000
 ```
 ---

 3. **mergekit incompatibility:** mergekit's architecture detection didn't handle Qwen3.5's hybrid linear/full attention types during layer renumbering. Fix: wrote custom merge script that operates on raw tensor names without needing architecture awareness.
+### Training Results
+The heal fine-tune ran for approximately 14 hours on the RTX 5090.
+**Loss curve:**
+```
+Step   10:  1.0175  ← initial loss (high — layer boundary causing confusion)
+Step   50:  0.8758  ← sharp early drop as boundary begins healing
+Step  100:  0.7215
+Step  250:  0.6700  ← checkpoint 1
+Step  500:  0.6154  ← checkpoint 2 (loss stabilizing)
+Step  750:  0.6435  ← checkpoint 3 (cosine schedule tapering)
+Step 1000:  0.6396  ← final (39% total reduction from start)
+```
+The sharp drop in the first 100 steps confirms the layer boundary was a real source of error — the model rapidly learned to produce coherent outputs across the seam. The remaining 900 steps refined overall quality with diminishing but meaningful returns.
+### Post-Heal Benchmark Results
+| Category | Raw Merge | **Healed Merge** | Delta |
+|---|---|---|---|
+| Basic | 6/6 | 6/6 | — |
+| Reasoning | 4/4 | 4/4 | — |
+| Tool Calling | 6/6 | 6/6 | — |
+| Agentic | 4/4 | 4/4 | — |
+| Structured | 2/2 | 2/2 | — |
+| Context | 2/3 | 2/3 | — |
+| Multilingual | 2/2 | 2/2 | — |
+| Programming | 11/15 | **12/15** | **+1** |
+| Performance | 2/2 | 2/2 | — |
+| **TOTAL** | **39/44 (88.6%)** | **40/44 (90.9%)** | **+1 test** |
+The `longest_substring` test was recovered — the model now produces a clean fenced Python code block that passes all 8 sliding-window test cases. Three programming tests remain failing (function naming issue, missing JS paren, no pytest code block).
+### Frontend Code Generation — The Real Proof
+While the benchmark suite showed a modest +1 improvement, the real transformation became apparent when we stress-tested HTML/CSS/JS generation — the exact category of structured output that was garbled before healing.
+We ran 6 increasingly complex frontend tasks:
+| Test | Description | Checks | Score | Output |
+|---|---|---|---|---|
+| Weather Dashboard | Responsive layout, CSS vars, dark mode, 5-day forecast grid | 9 | **9/9** | 14.5K chars |
+| E-Commerce Product Page | Image gallery, color swatches, quantity +/-, tabbed content, sticky mobile bar | 12 | **12/12** | 16.7K chars |
+| Animated SaaS Landing | Moving CSS gradient, typing animation, IntersectionObserver scroll reveals, auto-rotating testimonial carousel, 3 pricing tiers, scroll-based navbar | 13 | **13/13** | 24.1K chars |
+| Analytics Dashboard | SVG bar chart with hover tooltips, SVG donut chart, sortable data table with JS sort, collapsible sidebar, dark theme, CSS Grid layout | 13 | **13/13** | 22.3K chars |
+| Multi-Step Registration | 3-step form wizard, real-time inline validation, password strength meter (weak/medium/strong), all 50 US states dropdown, animated step transitions, success modal | 12 | **12/12** | 23.3K chars |
+| Snake Game | Canvas rendering, requestAnimationFrame game loop, arrow key controls, collision detection, localStorage high score, increasing difficulty | 12 | **11/12** | 11.2K chars |
+**Total: 62/63 checks passed (98.4%)**
+Critical structural integrity metrics across all 6 outputs:
+- **CSS braces: perfectly balanced in every file** (0 imbalance)
+- **JS parentheses: perfectly balanced in every file** (0 imbalance)
+- **Zero garbled or hallucinated text** in any output
+- **5 of 6 files end with proper `</html>`** (Snake game had `html>` — minor typo)
+The sophistication of the generated code is notable for an 18B frankenmerge:
+- `IntersectionObserver` for scroll-triggered animations
+- `requestAnimationFrame` with delta-time game loops
+- SVG chart generation with computed coordinates
+- Real-time password strength calculation with regex
+- CSS `@keyframes` animations (3+ per file)
+- Proper responsive design with `@media` breakpoints
+**Before healing:** The model would produce garbled code blocks, missing brackets, hallucinated syntax, and incomplete HTML structures. **After healing:** Production-quality frontend code with perfect structural integrity across outputs up to 24K characters.
+All 6 HTML samples are included in the `samples/` directory of the repository.
+### Why the Heal Works
+The heal fine-tune addresses the core problem: at layer 32, the model transitions from weights trained on Opus-style reasoning to weights trained on GLM-style reasoning. Without additional training, the internal representations at this boundary are discontinuous — the output from layer 31 is not what layers 32+ expect as input.
+By training with QLoRA across all attention and MLP projections, we allow the model to:
+1. **Adapt the boundary layers** (28-35) to bridge the representational gap
+2. **Learn to route information** coherently through the full 64-layer stack
+3. **Restore structured output capability** by re-establishing the tight token-to-token coordination that code generation requires
+The 39% loss reduction (1.02 → 0.62) in just 1000 steps confirms that the boundary was a significant source of prediction error that the model could quickly learn to compensate for.
 ---
 ### What Worked
 1. **Frankenmerging two differently-distilled models produces a viable model.** The 18B merge beat Qwen 3.6 MoE on our benchmark at less than half the VRAM — a surprising and encouraging result.
+2. **The heal fine-tune dramatically improved structured output.** Before healing: garbled code blocks, missing brackets, hallucinated syntax. After healing: 62/63 frontend stress test checks passed with perfectly balanced CSS/JS across outputs up to 24K characters. 1000 steps of QLoRA was enough.
+3. **Tool calling and agentic reasoning are robust to layer stacking.** Perfect 6/6 and 4/4 scores in both raw and healed versions — these capabilities rely on high-level semantic representations that survive the merge even without healing.
+4. **Custom merge scripts are more reliable than mergekit for novel architectures.** The regex-based tensor renumbering approach is architecture-agnostic and handles hybrid attention types that mergekit doesn't support.
+5. **A comprehensive test suite is essential.** Without executable programming tests and frontend stress tests, we would have missed the core regression and wouldn't have been able to measure the heal's effectiveness.
+6. **Loss curve tells the story.** The sharp initial drop (1.02 → 0.72 in 100 steps) confirmed the layer boundary was a real source of error, not just noise. This gives us confidence the heal is addressing the root cause.
+### What Didn't Work (Initially)
+1. **Code formatting degrades significantly at merge boundaries.** Structured output requiring precise token sequences is the weakest point of a naive frankenmerge — but the heal fine-tune largely fixed this.
 2. **mergekit doesn't support Qwen3.5's multimodal + hybrid attention architecture** as of v0.1.4. Required a custom solution.
+3. **The 9B source model outscored the raw 18B merge (41/44 vs 39/44) on short benchmarks.** On short, single-turn test prompts, a well-tuned smaller model can beat a larger but unrefined merge. After healing, the gap narrowed to 41 vs 40 — and the 18B excels on longer, more complex outputs where the extra depth pays off (see frontend stress tests).
+4. **Three programming tests still fail after healing.** Function naming issues and missing brackets persist in some code generation tasks, suggesting QLoRA with 1383 samples isn't enough to fully resolve all formatting edge cases.
 ### What We'd Do Differently
+1. **More code-heavy training data** — 750 competitive programming samples helped but wasn't enough to fix every code formatting edge case. A larger code-focused dataset would likely close the remaining gap.
+2. **Test with longer, multi-turn conversations** — our 44-test suite uses short prompts. The frontend stress tests revealed the 18B's real strength: long, complex, structured outputs. A suite designed for those would show the merge's advantages more clearly.
+3. **Try interleaved layer stacking** (A[0], B[0], A[1], B[1], ...) instead of sequential — this distributes the merge boundary across all layers instead of concentrating it at one point, potentially reducing the need for healing.
+4. **Consider a full fine-tune** on a multi-GPU setup instead of QLoRA — training 100% of parameters instead of 2% would give maximum healing capacity, though the QLoRA results are already quite good.
 ---
 TEST_MODEL="Qwopus-GLM-18B" python3 tests/test_qwopus_v35.py
 ```
+### Step 5: Heal Fine-Tune (recommended)
 ```bash
+# Dry run first to verify everything loads
+python3 heal-frankenmerge.py --dry-run
+# Full heal — ~14 hours on RTX 5090, checkpoints every 250 steps
 python3 heal-frankenmerge.py --max-steps 1000 --num-samples 5000
+# Outputs: ~/models/Qwopus-GLM-18B-healed/ (merged 16-bit safetensors)
+```
+### Step 6: Convert & Quantize Healed Model
+```bash
+python3 llama-cpp-latest/convert_hf_to_gguf.py \
+  ~/models/Qwopus-GLM-18B-healed \
+  --outfile ~/models/Qwopus-GLM-18B-healed-f16.gguf \
+  --outtype bf16
+llama-quantize \
+  ~/models/Qwopus-GLM-18B-healed-f16.gguf \
+  ~/models/Qwopus-GLM-18B-healed-Q4_K_M.gguf \
+  Q4_K_M
+```
+### Step 7: Benchmark Healed Model
+```bash
+llama-server -m ~/models/Qwopus-GLM-18B-healed-Q4_K_M.gguf \
+  --alias "Qwopus-GLM-18B-healed" --host 127.0.0.1 --port 8001 \
+  --ctx-size 65536 --flash-attn on --n-gpu-layers 99 --jinja
+TEST_MODEL="Qwopus-GLM-18B-healed" python3 tests/test_qwopus_v35.py
 ```
 ---