Qwen3.6-35B-A3B-AWQ 模型本地部署

发布时间:2026/6/26 18:22:22
Qwen3.6-35B-A3B-AWQ 模型本地部署 1.模型介绍这是 Qwen3.6-35B-A3B 的社区 AWQ 4-bit 无校准量化版25GB 左右单/多卡消费级或中端数据中心卡就能跑适合想本地或私有化部署 Qwen3.6 MoE、又扛不动 FP8 原版显存的同学。代价是>2.模型部署工具vllmdockerpull vllm/vllm-openai:latest3.模型下载魔搭国内社区https://www.modelscope.cn/models/tclf90/Qwen3.6-35B-A3B-AWQ使用vllm docker容器下载模型dockerrun--rm-it\--gpusall\--networkhost\--entrypoint/bin/bash\--pids-limit-1\--security-optseccompunconfined\-v/root/lipengcheng/qwen36_35b:/models\-eOMP_NUM_THREADS8\vllm/vllm-openai:latest\-cpip install modelscope python3 -c\from modelscope import snapshot_download; snapshot_download(tclf90/Qwen3.6-35B-A3B-AWQ, cache_dir/models)\4.模型部署version:3.8services: vllm-qwen36-moe: image: vllm/vllm-openai:latest container_name: vllm-qwen3.6-35b-a3b-awq privileged:truenetwork_mode:hostdeploy: resources: reservations: devices: - driver: nvidia count:2# 吃满两张 T4capabilities:[gpu]volumes:# 左边已经精准替换为你刚才 pwd 出来的绝对路径- /root/lipengcheng/qwen36_35b/tclf90/Qwen3___6-35B-A3B-AWQ:/models environment: -NVIDIA_VISIBLE_DEVICESall -OMP_NUM_THREADS8-VLLM_USE_MODELSCOPEtrue command:/models--host0.0.0.0--port23333--dtypehalf --served-model-name Qwen3.6-35B-A3B-AWQ --tensor-parallel-size2--quantizationawq_marlin --trust-remote-code --gpu-memory-utilization0.96--max-model-len65536--max-num-seqs32--enable-prefix-caching --enable-chunked-prefill --enable-expert-parallel --reasoning-parser qwen3 --tool-call-parser qwen3_coder --enable-auto-tool-choice restart: unless-stopped5.模型测试基础测试curlhttp://localhost:23333/v1/chat/completions\-HContent-Type: application/json\-d{ model: Qwen3.6-35B-A3B-AWQ, messages: [ {role: user, content: 你是谁} ], max_tokens: 128, temperature: 0.7 }流式测试curlhttp://localhost:23333/v1/chat/completions\-HContent-Type: application/json\-d{ model: Qwen3.6-35B-A3B-AWQ, messages: [ {role: user, content: 用 Python 写一个快速排序} ], stream: true }长上下文 Thinking 模式测试curlhttp://localhost:23333/v1/chat/completions\-HContent-Type: application/json\-d{ model: Qwen3.6-35B-A3B-AWQ, messages: [ {role: system, content: /think}, {role: user, content: 一步步推导如果 A 比 B 快B 比 C 快A 和 C 谁快} ] }Tool Calling / Agent 测试Qwen3 Codercurlhttp://localhost:23333/v1/chat/completions\-HContent-Type: application/json\-d{ model: Qwen3.6-35B-A3B-AWQ, messages: [ {role: user, content: 帮我查一下当前目录下的文件} ], tools: [ { type: function, function: { name: list_files, description: 列出目录文件, parameters: { type: object, properties: {} } } } ] }6.总结模型最大可以跑到 57 tokens/s速度还是不错的日常的文案输出普通编码的生成等还是够用的

月新闻