Instructions to use namgyu-youn/Qwen3-8B-W8A8-INT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use namgyu-youn/Qwen3-8B-W8A8-INT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="namgyu-youn/Qwen3-8B-W8A8-INT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("namgyu-youn/Qwen3-8B-W8A8-INT")
model = AutoModelForCausalLM.from_pretrained("namgyu-youn/Qwen3-8B-W8A8-INT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use namgyu-youn/Qwen3-8B-W8A8-INT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "namgyu-youn/Qwen3-8B-W8A8-INT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "namgyu-youn/Qwen3-8B-W8A8-INT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/namgyu-youn/Qwen3-8B-W8A8-INT

SGLang

How to use namgyu-youn/Qwen3-8B-W8A8-INT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "namgyu-youn/Qwen3-8B-W8A8-INT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "namgyu-youn/Qwen3-8B-W8A8-INT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "namgyu-youn/Qwen3-8B-W8A8-INT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "namgyu-youn/Qwen3-8B-W8A8-INT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use namgyu-youn/Qwen3-8B-W8A8-INT with Docker Model Runner:
```
docker model run hf.co/namgyu-youn/Qwen3-8B-W8A8-INT
```

W8A8-INT Qwen/Qwen3-8B model

Developed by: namgyu-youn
License: apache-2.0
Quantized from Model: Qwen/Qwen3-8B
Quantization Method: W8A8-INT

Model Performance

A. Perplexity (lm-eval)

Original Model

# Perplexity (ppl) command
lm_eval --model hf   --model_args pretrained=Qwen/Qwen3-8B   --tasks mmlu   --device cuda:0   --batch_size 8   --limit 100

Quantized Model

# Perplexity (ppl) command
lm_eval --model hf   --model_args pretrained=namgyu-youn/Qwen3-8B-W8A8-INT   --tasks mmlu   --device cuda:0   --batch_size 8   --limit 100

Summary

Benchmark
	Qwen/Qwen3-8B	namgyu-youn/Qwen3-8B-W8A8-INT
mmlu	-	-

B. Throughput (vLLM)

Original Model

vllm bench throughput --model Qwen/Qwen3-8B --input-len 256 --output-len 256 --num-prompts 100

Quantized Model

> vllm bench throughput --model namgyu-youn/Qwen3-8B-W8A8-INT --input-len 256 --output-len 256 --num-prompts 100
/home/elicer/ao/.venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
When dataset path is not set, it will default to random dataset
INFO 12-18 16:07:49 [datasets.py:613] Sampling input_len from [256, 256] and output_len from [256, 256]
INFO 12-18 16:07:49 [utils.py:253] non-default args: {'tokenizer': 'namgyu-youn/Qwen3-8B-W8A8-INT', 'enable_lora': None, 'reasoning_parser_plugin': '', 'model': 'namgyu-youn/Qwen3-8B-W8A8-INT'}
INFO 12-18 16:07:51 [model.py:637] Resolved architecture: Qwen3ForCausalLM
INFO 12-18 16:07:51 [model.py:1750] Using max model len 40960
INFO 12-18 16:07:51 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.
/home/elicer/ao/.venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:01 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='namgyu-youn/Qwen3-8B-W8A8-INT', speculative_config=None, tokenizer='namgyu-youn/Qwen3-8B-W8A8-INT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=namgyu-youn/Qwen3-8B-W8A8-INT, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:01 [parallel_state.py:1200] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.2.100:41745 backend=nccl
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:01 [parallel_state.py:1408] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:01 [gpu_model_runner.py:3467] Starting to load model namgyu-youn/Qwen3-8B-W8A8-INT...
(EngineCore_DP0 pid=40856) /home/elicer/ao/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
(EngineCore_DP0 pid=40856)   _C._set_float32_matmul_precision(precision)
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:02 [cuda.py:411] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
Loading pt checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading pt checkpoint shards:  50% Completed | 1/2 [00:03<00:03,  3.57s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:06<00:00,  3.44s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:06<00:00,  3.46s/it]
(EngineCore_DP0 pid=40856) 
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:11 [default_loader.py:308] Loading weights took 6.91 seconds
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:11 [gpu_model_runner.py:3549] Model loading took 8.8021 GiB memory and 9.104030 seconds
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:25 [backends.py:655] Using cache directory: /home/elicer/.cache/vllm/torch_compile_cache/d343df497c/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:25 [backends.py:715] Dynamo bytecode transform time: 13.34 s
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:52 [backends.py:216] Directly load the compiled graph(s) for dynamic shape from the cache, took 26.647 s
(EngineCore_DP0 pid=40856) INFO 12-18 16:09:29 [monitor.py:34] torch.compile takes 39.98 s in total
(EngineCore_DP0 pid=40856) INFO 12-18 16:09:30 [gpu_worker.py:359] Available KV cache memory: 7.24 GiB
(EngineCore_DP0 pid=40856) INFO 12-18 16:09:31 [kv_cache_utils.py:1286] GPU KV cache size: 52,736 tokens
(EngineCore_DP0 pid=40856) INFO 12-18 16:09:31 [kv_cache_utils.py:1291] Maximum concurrency for 40,960 tokens per request: 1.29x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎            | 46/51 [00:03<00:00, 12.78it/s]
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] EngineCore failed to start.
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] Traceback (most recent call last):
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     super().__init__(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 251, in _initialize_kv_caches
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 116, in initialize_from_config
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return func(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 427, in compile_or_warm_up_model
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4423, in capture_model
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     self._capture_cudagraphs(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4520, in _capture_cudagraphs
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     self._dummy_run(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return func(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4071, in _dummy_run
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     outputs = self.model(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/cuda_graph.py", line 126, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen3.py", line 315, in forward
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     hidden_states = self.model(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 433, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/wrapper.py", line 174, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return self.forward(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 389, in forward
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     def forward(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/caching.py", line 54, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     raise e
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "<eval_with_key>.74", line 298, in forward
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_ = None
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/cuda_graph.py", line 126, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/piecewise_backend.py", line 99, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/compiler_interface.py", line 268, in compiled_graph_wrapper
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     graph_output = inductor_compiled_graph(*args)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/standalone_compile.py", line 63, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return self._compiled_fn(*args)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/standalone_compile.py", line 184, in <lambda>
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return CompiledArtifact(lambda *args: compiled_fn(list(args)), None)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     out = normalize_as_list(f(args))
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return compiled_fn(runtime_args)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 690, in inner_fn
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     unwrapped_outs = compiled_fn(unwrapped_args)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 613, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     return self.current_callable(inputs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2962, in run
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     out = model(new_inputs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]   File "/tmp/torchinductor_elicer/ky/cky2jjvg4btm7jgfrxinkkypay46uehobqgihwloy7ibr7dpw2kp.py", line 1385, in call
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843]     extern_kernels._int_mm(buf5, reinterpret_tensor(arg4_1, (4096, 6144), (1, 4096), 0), out=buf6)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] RuntimeError: self.size(0) needs to be greater than 16, but got 16
(EngineCore_DP0 pid=40856) Process EngineCore_DP0:
(EngineCore_DP0 pid=40856) Traceback (most recent call last):
(EngineCore_DP0 pid=40856)   File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=40856)     self.run()
(EngineCore_DP0 pid=40856)   File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=40856)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 847, in run_engine_core
(EngineCore_DP0 pid=40856)     raise e
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=40856)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=40856)     super().__init__(
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=40856)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 251, in _initialize_kv_caches
(EngineCore_DP0 pid=40856)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 116, in initialize_from_config
(EngineCore_DP0 pid=40856)     self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=40856)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP0 pid=40856)     return func(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 427, in compile_or_warm_up_model
(EngineCore_DP0 pid=40856)     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4423, in capture_model
(EngineCore_DP0 pid=40856)     self._capture_cudagraphs(
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4520, in _capture_cudagraphs
(EngineCore_DP0 pid=40856)     self._dummy_run(
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=40856)     return func(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4071, in _dummy_run
(EngineCore_DP0 pid=40856)     outputs = self.model(
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/cuda_graph.py", line 126, in __call__
(EngineCore_DP0 pid=40856)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=40856)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=40856)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen3.py", line 315, in forward
(EngineCore_DP0 pid=40856)     hidden_states = self.model(
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 433, in __call__
(EngineCore_DP0 pid=40856)     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/wrapper.py", line 174, in __call__
(EngineCore_DP0 pid=40856)     return self.forward(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 389, in forward
(EngineCore_DP0 pid=40856)     def forward(
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=40856)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/caching.py", line 54, in __call__
(EngineCore_DP0 pid=40856)     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=40856)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=40856)     raise e
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=40856)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=40856)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=40856)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "<eval_with_key>.74", line 298, in forward
(EngineCore_DP0 pid=40856)     submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_ = None
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/cuda_graph.py", line 126, in __call__
(EngineCore_DP0 pid=40856)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/piecewise_backend.py", line 99, in __call__
(EngineCore_DP0 pid=40856)     return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/compiler_interface.py", line 268, in compiled_graph_wrapper
(EngineCore_DP0 pid=40856)     graph_output = inductor_compiled_graph(*args)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/standalone_compile.py", line 63, in __call__
(EngineCore_DP0 pid=40856)     return self._compiled_fn(*args)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/standalone_compile.py", line 184, in <lambda>
(EngineCore_DP0 pid=40856)     return CompiledArtifact(lambda *args: compiled_fn(list(args)), None)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(EngineCore_DP0 pid=40856)     all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=40856)     out = normalize_as_list(f(args))
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_DP0 pid=40856)     return compiled_fn(runtime_args)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 690, in inner_fn
(EngineCore_DP0 pid=40856)     unwrapped_outs = compiled_fn(unwrapped_args)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 613, in __call__
(EngineCore_DP0 pid=40856)     return self.current_callable(inputs)
(EngineCore_DP0 pid=40856)   File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2962, in run
(EngineCore_DP0 pid=40856)     out = model(new_inputs)
(EngineCore_DP0 pid=40856)   File "/tmp/torchinductor_elicer/ky/cky2jjvg4btm7jgfrxinkkypay46uehobqgihwloy7ibr7dpw2kp.py", line 1385, in call
(EngineCore_DP0 pid=40856)     extern_kernels._int_mm(buf5, reinterpret_tensor(arg4_1, (4096, 6144), (1, 4096), 0), out=buf6)
(EngineCore_DP0 pid=40856) RuntimeError: self.size(0) needs to be greater than 16, but got 16
[rank0]:[W1218 16:09:36.180598790 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/home/elicer/ao/.venv/bin/vllm", line 10, in <module>
    sys.exit(main())
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
    args.dispatch_function(args)
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/benchmark/throughput.py", line 21, in cmd
    main(args)
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 721, in main
    elapsed_time, request_outputs = run_vllm(
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 50, in run_vllm
    llm = LLM(**dataclasses.asdict(engine_args))
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 334, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 183, in from_engine_args
    return cls(
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 109, in __init__
    self.engine_core = EngineCoreClient.make_client(
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 93, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 471, in __init__
    with launch_core_engines(vllm_config, executor_class, log_stats) as (
  File "/usr/local/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines
    wait_for_engine_startup(
  File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Summary

Benchmark
	Qwen/Qwen3-8B	namgyu-youn/Qwen3-8B-W8A8-INT
Throughput (tok/s)	-	-

C. Latency (vLLM)

Original Model

vllm bench latency --model Qwen/Qwen3-8B --input-len 256 --output-len 256 --batch-size 1

Quantized Model

vllm bench latency --model namgyu-youn/Qwen3-8B-W8A8-INT --input-len 256 --output-len 256 --batch-size 1

Summary

Benchmark
	Qwen/Qwen3-8B	namgyu-youn/Qwen3-8B-W8A8-INT
Latency (ms)	-	-

Resources

TorchAO GitHub: https://github.com/pytorch/ao
TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html

Downloads last month: 4

Model tree for namgyu-youn/Qwen3-8B-W8A8-INT

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Quantized

(278)

this model

Collection including namgyu-youn/Qwen3-8B-W8A8-INT

TorchAO (ao)

Collection

4 items • Updated Mar 2