--- tags: - quantization - awq - gptq - 4-bit - fp8-kv-cache base_model: LGAI-EXAONE/EXAONE-4.0-1.2B license: apache-2.0 language: - en --- # EXAONE-4.0-1.2B AWQ (MLP) + GPTQ (Attention) W4A16 ## 01. Quick Start ```python from vllm import LLM llm = LLM(model="namgyu-youn/EXAONE-4.0-1.2B-LLMC-AWQ-W4", dtype="bfloat16", kv_cache_dtype="fp8") ``` ## 02. Benchmark ```bash Perplexity: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.0098|± |0.0044| | | |strict-match | 5|exact_match|↑ |0.0000|± |0.0000| Throughput: 5.01 requests/s, 5774.56 total tokens/s, 641.62 output tokens/s ``` repro: ```bash lm_eval --model vllm \ --model_args pretrained=namgyu-youn/EXAONE-4.0-1.2B-LLMC-AWQ-W4,dtype=float16,gpu_memory_utilization=0.85,enable_thinking=False,max_gen_toks=2048,max_model_len=8192,enforce_eager=True \ --tasks gsm8k \ --limit 512 \ --output_path results \ --apply_chat_template \ --batch_size auto # For gptqmodel checkpoints (native GPTQ format), omit --quantization # (vLLM auto-selects gptq_marlin) vllm bench throughput \ --input-len 256 \ --output-len 256 \ --model namgyu-youn/EXAONE-4.0-1.2B-LLMC-AWQ-W4 \ --num-prompts 100 \ --max-model-len 4096 \ --enforce-eager