| 
 | 
 
 楼主 |
发表于 2025-8-14 21:04
|
显示全部楼层
 
 
 
 
改CUDA有用 
 
FP8 可以了 
 
 
但FP4报错  vllm serve /home/Qwen3-235B-A22B-Thinking-2507-FP4 --served-model-name Qwen3-235B-A22B-Thinking-2507-FP4 --max-model-len 201000 --tensor-parallel-size 2 --gpu-memory-utilization 0.9 
 
(APIServer pid=15764) INFO 08-14 20:58:03 [api_server.py:1805] vLLM API server version 0.10.1.dev628+g00e3f9da4.d20250814 
 
(APIServer pid=15764)   Value error, Unknown quantization method: . Must be one of ['aqlm', 'awq', 'deepspeedfp', 'tpu_int8', 'fp8', 'ptpc_fp8', 'fbgemm_fp8', 'modelopt', 'modelopt_fp4', 'marlin', 'bitblas', 'gguf', '**q_marlin_24', '**q_marlin', '**q_bitblas', 'awq_marlin', '**q', 'compressed-tensors', 'bitsandbytes', 'qqq', 'hqq', 'experts_int8', 'neuron_quant', 'ipex', 'quark', 'moe_wna16', 'torchao', 'auto-round', 'rtn', 'inc', 'mxfp4']. [type=value_error, input_value=ArgsKwargs((), {'model': ...attention_dtype': None}), input_type=ArgsKwargs] 
(APIServer pid=15764)     For further information visit https://errors.pydantic.dev/2.11/v/value_error 
 
 
 |   
 
 
 
 |