## openPangu-R-7B-2512 在[vllm-ascend](https://github.com/vllm-project/vllm-ascend)部署指导文档 ### 部署环境说明 Atlas 800T A2(64GB) 可部署openPangu-R-7B-2512。 ### A2镜像构建和启动 拉取基础镜像: ``` docker pull quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11 ``` 使用Dockerfile.构建镜像: ``` IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11 docker build -t $IMAGE -f ./Dockerfile . ``` 启动镜像: ``` export IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11 # Use correct image id export NAME=XXX # Custom docker name # Run the container using the defined variables # Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance # To prevent device interference from other docker containers, add the argument "--privileged" docker run -itd \ --privileged \ --ipc=host \ --name $NAME \ --network host \ --device /dev/davinci0 \ --device /dev/davinci1 \ --device /dev/davinci2 \ --device /dev/davinci3 \ --device /dev/davinci4 \ --device /dev/davinci5 \ --device /dev/davinci6 \ --device /dev/davinci7 \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /mnt/:/mnt/ \ -v /data:/data \ -v /home/work:/home/work \ --entrypoint /bin/bash \ $IMAGE ``` 需要保证模型权重和本项目代码可在容器中访问。如果未进入容器,需以root用户进容器。 ``` docker exec -itu root $NAME /bin/bash cd inference pip install -r requirements.txt bash ./cann910B-omni_inference_custom_ops-0.7.0-8.3.RC1-linux-aarch64.run --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/omni_custom_ops/bin/set_env.bash pip install omni_inference_ascendc_custom_ops-0.7.0+8.3.rc1.pta2.7.1-cp311-cp311-linux_aarch64.whl --force-reinstall ``` ### openPangu-R-7B-2512推理 启动脚本:inference/launch.sh 执行命令: ``` export LOAD_CKPT_DIR = XXX/checkpoint/ # The pangu_7b bf16 weight bash inference/launch.sh ``` 启动脚本示例: ``` # 指定 HOST=127.0.0.1(本地主机)表示服务器只能从主设备访问。 # 指定 HOST=0.0.0.0 允许从同一网络上的其他设备甚至从互联网访问 vLLM 服务器,前提是网络配置正确(例如,防火墙规则、端口转发)。 HOST=xxx.xxx.xxx.xxx python $SCRIPT_DIR/vllm_register.py \ --model $LOCAL_CKPT_DIR \ --served-model-name ${SERVED_MODEL_NAME:=pangu_7b} \ --tensor-parallel-size ${TENSOR_PARALLEL_SIZE:=8} \ --trust-remote-code \ --host $HOST \ --port ${PORT:=8000} \ --max-num-seqs ${MAX_NUM_SEQS:=256} \ --max-model-len ${MAX_MODEL_LEN:=40960} \ --tokenizer-mode "slow" \ --dtype bfloat16 \ --enable-log-requests \ --distributed-executor-backend mp \ --gpu-memory-utilization 0.9 \ --max-num-batched-tokens ${MAX_NUM_BATCHED_TOKENS:=4096} \ --no-enable-prefix-caching \ --enforce_eager \ --reasoning-parser pangu \ ``` ### 发请求测试 服务启动后,可发送测试请求: ``` MASTER_NODE_IP=xxx.xxx.xxx.xxx # server node ip curl http://${MASTER_NODE_IP}:${PORT}/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "'$SERVED_MODEL_NAME'", "messages": [ { "role": "user", "content": "Who are you?" } ], "max_tokens": 512, "temperature": 0 }' ```