华为云 裸金属昇腾910B3 gpu服务器部署qwen大模型测试

一、华为云 裸金属昇腾910B3 gpu服务器部署qwen大模型测试

1.1 参考文档:

https://support.huaweicloud.com/bestpractice-modelarts/modelarts_llm_infer_5905023.html

1.2 检查环境:

1
2
3
4
5
6
7
npu-smi info                    # 在每个实例节点上运行此命令可以看到NPU卡状态

npu-smi info -l | grep Total # 在每个实例节点上运行此命令可以看到总卡数,用来确认对应卡数已经挂载

npu-smi info -t board -i 1 | egrep -i "software|firmware" #查看驱动和固件版本

cat /usr/local/Ascend/ascend-toolkit/latest/arm64-linux/ascend_toolkit_install.info

昇腾910B3服务器环境

1.3 使用镜像:

1
docker pull swr.cn-east-4.myhuaweicloud.com/ascendcloud/llm_inference:905_20250624

1.4 推理脚本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

docker run -itd \

--device=/dev/davinci0 \

--device=/dev/davinci1 \

--device=/dev/davinci2 \

--device=/dev/davinci3 \

--device=/dev/davinci4 \

--device=/dev/davinci5 \

--device=/dev/davinci6 \

--device=/dev/davinci7 \

-v /etc/localtime:/etc/localtime \

-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \

-v /etc/ascend_install.info:/etc/ascend_install.info \

--device=/dev/davinci_manager \

--device=/dev/devmm_svm \

--device=/dev/hisi_hdc \

-v /var/log/npu/:/usr/slog \

-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \

-v /sys/fs/cgroup:/sys/fs/cgroup:ro \

-v /root/models_dir:/models_dir \

--net=host \

--name llm_inference \

34608b73e3c4 \

/bin/bash

1.5 进入容器:

1
docker exec -it -u ma-user ${container_name} /bin/bash

1.6 启动在线推理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
export ASCEND_TURBO_TASK_QUEUE=0

export CPU_AFFINITY_CONF=1

export VLLM_USE_V1=0

export HCCL_OP_EXPANSION_MODE=AIV

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# 指定可使用的卡,按需修改。和下述--tensor-parallel-size的值要对应!

export ASCEND_RT_VISIBLE_DEVICES=0,1

# 单卡不需要添加该环境变量

export USE_MM_ALL_REDUCE_OP=1

python -m vllm.entrypoints.openai.api_server --model ${container_model_path} \

--max-num-seqs=256 \

--max-model-len=4096 \

--max-num-batched-tokens=4096 \

--tensor-parallel-size=1 \

--block-size=128 \

--host=${docker_ip} \

--port=8080 \

--gpu-memory-utilization=0.9 \

--trust-remote-code \

--enforce-eager

1.7 测试推理接口:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
curl -X POST http://127.0.0.1:8080/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "/models_dir/Qwen3-32B",

"messages": [

{

"role": "user",

"content": "你好"

}

],

"max_tokens": 2000,

"top_k": -1,

"top_p": 1,

"temperature": 0,

"ignore_eos": false,

"stream": false

}'

二、部署微调Qwen2-7B模型

2.1 参考文档:

https://github.com/hiyouga/LLaMA-Factory

2.2 微调训练好的模型地址:

1
2
3
4
5
6
7
8
9
10
11
# 微调训练好的模型地址

(base) [zhaoxiaokang@sf-k8s-03 LLaMA-Factory]$ cat examples/inference/qwen2_7b_pq_lora_sft.yaml

model_name_or_path: /data1/zhaoxiaokang/pretrain_model/Qwen2-7B-Instruct

adapter_name_or_path: /data1/zhaoxiaokang/workspace/llm-funtuning/LLaMA-Factory/saves/Qwen2-7B-Chat/lora/train_qwen2_7b_policy_20240903/checkpoint-best

template: qwen

finetuning_type: lora

2.3 进入容器:

1
2

docker exec -it -u ma-user llm_inference /bin/bash

下载LLaMA-Factory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
git clone https://github.com/hiyouga/LLaMA-Factory.git

cd LLaMA-Factory

安装LLaMA-Factory

pip install -e ".[torch-npu,metrics,vllm]"

权重合并(LoRA → 全量)

export ASCEND_RT_VISIBLE_DEVICES=2,3 # 指定卡

llamafactory-cli export \

--model_name_or_path /models_dir/Qwen2-7B-Instruct \

--adapter_name_or_path /models_dir/checkpoint-best \

--template qwen \

--finetuning_type lora \

--export_dir /home/ma-user/qwen2-7b-lora-merged \

--export_size 2 \

# --device npu

2.4 部署:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
export ASCEND_TURBO_TASK_QUEUE=0

export CPU_AFFINITY_CONF=1

export VLLM_USE_V1=0

export HCCL_OP_EXPANSION_MODE=AIV

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# 指定可使用的卡,按需修改。和下述--tensor-parallel-size的值要对应!

export ASCEND_RT_VISIBLE_DEVICES=2,3

# 单卡不需要添加该环境变量

export USE_MM_ALL_REDUCE_OP=1

python -m vllm.entrypoints.openai.api_server --model /home/ma-user/qwen2-7b-lora-merged \

--max-num-seqs=256 \

--max-model-len=4096 \

--max-num-batched-tokens=4096 \

--tensor-parallel-size=2 \

--block-size=128 \

--host=0.0.0.0 \

--port=8081 \

--gpu-memory-utilization=0.9 \

--trust-remote-code \

--enforce-eager

2.5 测试推理接口:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
curl -X POST http://127.0.0.1:8081/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "/home/ma-user/qwen2-7b-lora-merged",

"messages": [

{

"role": "system",

"content": "你是一个可以对问题做’意图识别‘和’实体识别‘的模型,请按以下步骤执行信息识别和抽取,要抽取问句中有的数据,不要凭空捏造:\n1、意图识别,意图分类为【申报项目,政策法规, 政策依据,政策汇编,政策匹配,政策对比,奖补立项,其他】其中一个。\n2、实体类别,实体类型为【地区,主管部门类型,主管部门,政策主题,政策级别,扶持对象,申报状态,项目名称,政策类别,政策名称,日期时间,企业名称,项目主题,产业领域,关键词】其中一个或多个。\n3、意图识别和实体识别的结果用json格式输出,json格式为{\"intention_type\":`意图分类`,\"china_region\":`地区`,\"competent_department_type\":`主管部门类型`,\"competent_department\":`主管部门`,\"policy_category\":`政策主题`,\"policy_level\":`政策级别`,\"policy_type\":`扶持对象`,\"apply_status\":`申报状态`,\"project_name\":`项目名称`,\"policy_classify\":`政策类别`,\"policy_name\":`政策名称`,\"date_time\":`日期时间`,\"company_name\":`企业名称`,\"project_topic\":`项目主题`,\"publish_year\":`发布年份`,\"domain_field\":`产业领域`,\"key_word\":`关键词`}\n4、请给只我步骤3的json数据"

},

{

"role": "user",

"content": "问题:广州市人民政府办公厅关于促进汽车产业加快发展的意见"

}

],

"temperature": 0.9,

"top_p": 0.5,

"n": 1,

"max_tokens": 500,

"stream": false,

"cut_max_length": 500

}'

2.6 VLLM-Ascend镜像:

1
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:v0.10.0rc1-310p