W8A8-INT Qwen/Qwen3-8B model
- Developed by: namgyu-youn
- License: apache-2.0
- Quantized from Model: Qwen/Qwen3-8B
- Quantization Method: W8A8-INT
Model Performance
A. Perplexity (lm-eval)
Original Model
lm_eval --model hf --model_args pretrained=Qwen/Qwen3-8B --tasks mmlu --device cuda:0 --batch_size 8 --limit 100
Quantized Model
lm_eval --model hf --model_args pretrained=namgyu-youn/Qwen3-8B-W8A8-INT --tasks mmlu --device cuda:0 --batch_size 8 --limit 100
Summary
| Benchmark |
|
|
|
Qwen/Qwen3-8B |
namgyu-youn/Qwen3-8B-W8A8-INT |
| mmlu |
- |
- |
B. Throughput (vLLM)
Original Model
vllm bench throughput --model Qwen/Qwen3-8B --input-len 256 --output-len 256 --num-prompts 100
Quantized Model
> vllm bench throughput --model namgyu-youn/Qwen3-8B-W8A8-INT --input-len 256 --output-len 256 --num-prompts 100
/home/elicer/ao/.venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
When dataset path is not set, it will default to random dataset
INFO 12-18 16:07:49 [datasets.py:613] Sampling input_len from [256, 256] and output_len from [256, 256]
INFO 12-18 16:07:49 [utils.py:253] non-default args: {'tokenizer': 'namgyu-youn/Qwen3-8B-W8A8-INT', 'enable_lora': None, 'reasoning_parser_plugin': '', 'model': 'namgyu-youn/Qwen3-8B-W8A8-INT'}
INFO 12-18 16:07:51 [model.py:637] Resolved architecture: Qwen3ForCausalLM
INFO 12-18 16:07:51 [model.py:1750] Using max model len 40960
INFO 12-18 16:07:51 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.
/home/elicer/ao/.venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:01 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='namgyu-youn/Qwen3-8B-W8A8-INT', speculative_config=None, tokenizer='namgyu-youn/Qwen3-8B-W8A8-INT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=namgyu-youn/Qwen3-8B-W8A8-INT, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:01 [parallel_state.py:1200] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.2.100:41745 backend=nccl
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:01 [parallel_state.py:1408] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:01 [gpu_model_runner.py:3467] Starting to load model namgyu-youn/Qwen3-8B-W8A8-INT...
(EngineCore_DP0 pid=40856) /home/elicer/ao/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html
(EngineCore_DP0 pid=40856) _C._set_float32_matmul_precision(precision)
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:02 [cuda.py:411] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
Loading pt checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading pt checkpoint shards: 50% Completed | 1/2 [00:03<00:03, 3.57s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:06<00:00, 3.44s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:06<00:00, 3.46s/it]
(EngineCore_DP0 pid=40856)
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:11 [default_loader.py:308] Loading weights took 6.91 seconds
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:11 [gpu_model_runner.py:3549] Model loading took 8.8021 GiB memory and 9.104030 seconds
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:25 [backends.py:655] Using cache directory: /home/elicer/.cache/vllm/torch_compile_cache/d343df497c/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:25 [backends.py:715] Dynamo bytecode transform time: 13.34 s
(EngineCore_DP0 pid=40856) INFO 12-18 16:08:52 [backends.py:216] Directly load the compiled graph(s) for dynamic shape from the cache, took 26.647 s
(EngineCore_DP0 pid=40856) INFO 12-18 16:09:29 [monitor.py:34] torch.compile takes 39.98 s in total
(EngineCore_DP0 pid=40856) INFO 12-18 16:09:30 [gpu_worker.py:359] Available KV cache memory: 7.24 GiB
(EngineCore_DP0 pid=40856) INFO 12-18 16:09:31 [kv_cache_utils.py:1286] GPU KV cache size: 52,736 tokens
(EngineCore_DP0 pid=40856) INFO 12-18 16:09:31 [kv_cache_utils.py:1291] Maximum concurrency for 40,960 tokens per request: 1.29x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 46/51 [00:03<00:00, 12.78it/s]
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] EngineCore failed to start.
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] Traceback (most recent call last):
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] super().__init__(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 251, in _initialize_kv_caches
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 116, in initialize_from_config
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return func(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 427, in compile_or_warm_up_model
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4423, in capture_model
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] self._capture_cudagraphs(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4520, in _capture_cudagraphs
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] self._dummy_run(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return func(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4071, in _dummy_run
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] outputs = self.model(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/cuda_graph.py", line 126, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen3.py", line 315, in forward
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] hidden_states = self.model(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 433, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/wrapper.py", line 174, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return self.forward(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 389, in forward
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] def forward(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return fn(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/caching.py", line 54, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] raise e
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "<eval_with_key>.74", line 298, in forward
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_ = None
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/cuda_graph.py", line 126, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/piecewise_backend.py", line 99, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/compiler_interface.py", line 268, in compiled_graph_wrapper
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] graph_output = inductor_compiled_graph(*args)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/standalone_compile.py", line 63, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return self._compiled_fn(*args)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/standalone_compile.py", line 184, in <lambda>
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return CompiledArtifact(lambda *args: compiled_fn(list(args)), None)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] out = normalize_as_list(f(args))
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return compiled_fn(runtime_args)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 690, in inner_fn
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] unwrapped_outs = compiled_fn(unwrapped_args)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 613, in __call__
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] return self.current_callable(inputs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2962, in run
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] out = model(new_inputs)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] File "/tmp/torchinductor_elicer/ky/cky2jjvg4btm7jgfrxinkkypay46uehobqgihwloy7ibr7dpw2kp.py", line 1385, in call
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] extern_kernels._int_mm(buf5, reinterpret_tensor(arg4_1, (4096, 6144), (1, 4096), 0), out=buf6)
(EngineCore_DP0 pid=40856) ERROR 12-18 16:09:35 [core.py:843] RuntimeError: self.size(0) needs to be greater than 16, but got 16
(EngineCore_DP0 pid=40856) Process EngineCore_DP0:
(EngineCore_DP0 pid=40856) Traceback (most recent call last):
(EngineCore_DP0 pid=40856) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=40856) self.run()
(EngineCore_DP0 pid=40856) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=40856) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 847, in run_engine_core
(EngineCore_DP0 pid=40856) raise e
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=40856) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=40856) super().__init__(
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=40856) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 251, in _initialize_kv_caches
(EngineCore_DP0 pid=40856) self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 116, in initialize_from_config
(EngineCore_DP0 pid=40856) self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=40856) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP0 pid=40856) return func(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 427, in compile_or_warm_up_model
(EngineCore_DP0 pid=40856) cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4423, in capture_model
(EngineCore_DP0 pid=40856) self._capture_cudagraphs(
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4520, in _capture_cudagraphs
(EngineCore_DP0 pid=40856) self._dummy_run(
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=40856) return func(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4071, in _dummy_run
(EngineCore_DP0 pid=40856) outputs = self.model(
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/cuda_graph.py", line 126, in __call__
(EngineCore_DP0 pid=40856) return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=40856) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=40856) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen3.py", line 315, in forward
(EngineCore_DP0 pid=40856) hidden_states = self.model(
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 433, in __call__
(EngineCore_DP0 pid=40856) return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/wrapper.py", line 174, in __call__
(EngineCore_DP0 pid=40856) return self.forward(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 389, in forward
(EngineCore_DP0 pid=40856) def forward(
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=40856) return fn(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/caching.py", line 54, in __call__
(EngineCore_DP0 pid=40856) return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=40856) return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=40856) raise e
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=40856) return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=40856) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=40856) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "<eval_with_key>.74", line 298, in forward
(EngineCore_DP0 pid=40856) submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_ = None
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/cuda_graph.py", line 126, in __call__
(EngineCore_DP0 pid=40856) return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/piecewise_backend.py", line 99, in __call__
(EngineCore_DP0 pid=40856) return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/compilation/compiler_interface.py", line 268, in compiled_graph_wrapper
(EngineCore_DP0 pid=40856) graph_output = inductor_compiled_graph(*args)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/standalone_compile.py", line 63, in __call__
(EngineCore_DP0 pid=40856) return self._compiled_fn(*args)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/standalone_compile.py", line 184, in <lambda>
(EngineCore_DP0 pid=40856) return CompiledArtifact(lambda *args: compiled_fn(list(args)), None)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(EngineCore_DP0 pid=40856) all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=40856) out = normalize_as_list(f(args))
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_DP0 pid=40856) return compiled_fn(runtime_args)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 690, in inner_fn
(EngineCore_DP0 pid=40856) unwrapped_outs = compiled_fn(unwrapped_args)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 613, in __call__
(EngineCore_DP0 pid=40856) return self.current_callable(inputs)
(EngineCore_DP0 pid=40856) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2962, in run
(EngineCore_DP0 pid=40856) out = model(new_inputs)
(EngineCore_DP0 pid=40856) File "/tmp/torchinductor_elicer/ky/cky2jjvg4btm7jgfrxinkkypay46uehobqgihwloy7ibr7dpw2kp.py", line 1385, in call
(EngineCore_DP0 pid=40856) extern_kernels._int_mm(buf5, reinterpret_tensor(arg4_1, (4096, 6144), (1, 4096), 0), out=buf6)
(EngineCore_DP0 pid=40856) RuntimeError: self.size(0) needs to be greater than 16, but got 16
[rank0]:[W1218 16:09:36.180598790 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "/home/elicer/ao/.venv/bin/vllm", line 10, in <module>
sys.exit(main())
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
args.dispatch_function(args)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/benchmark/throughput.py", line 21, in cmd
main(args)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 721, in main
elapsed_time, request_outputs = run_vllm(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 50, in run_vllm
llm = LLM(**dataclasses.asdict(engine_args))
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 334, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 183, in from_engine_args
return cls(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 109, in __init__
self.engine_core = EngineCoreClient.make_client(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 93, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 642, in __init__
super().__init__(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 471, in __init__
with launch_core_engines(vllm_config, executor_class, log_stats) as (
File "/usr/local/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines
wait_for_engine_startup(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup
raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Summary
| Benchmark |
|
|
|
Qwen/Qwen3-8B |
namgyu-youn/Qwen3-8B-W8A8-INT |
| Throughput (tok/s) |
- |
- |
C. Latency (vLLM)
Original Model
vllm bench latency --model Qwen/Qwen3-8B --input-len 256 --output-len 256 --batch-size 1
Quantized Model
vllm bench latency --model namgyu-youn/Qwen3-8B-W8A8-INT --input-len 256 --output-len 256 --batch-size 1
Summary
| Benchmark |
|
|
|
Qwen/Qwen3-8B |
namgyu-youn/Qwen3-8B-W8A8-INT |
| Latency (ms) |
- |
- |
Resources