vllm - ✅(Solved) Fix [Feature]: Support non-standard GGUF quant type prefixes (e.g. Unsloth Dynamic UD-IQ1_S ) [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39469Fetched 2026-04-11 06:13:28
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1cross-referenced ×1labeled ×1

Error Message

vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B

huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'.

Fix Action

Fixed

PR fix notes

PR #39471: [GGUF] Support non-standard quant types with prefix (e.g. UD-IQ1_S)

Description (problem / solution / changelog)

Purpose

Support non-standard quant types with prefix (e.g. UD-IQ1_S )

Fixes: #39469

Test Plan

vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B

Test Result

<details> <summary>before</summary>
vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev122+g83aea2147
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]   █▄█▀ █     █     █     █  model   unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'}
(APIServer pid=2603210) Traceback (most recent call last):
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 479, in cached_files
(APIServer pid=2603210)     hf_hub_download(
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
(APIServer pid=2603210)     validate_repo_id(arg_value)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
(APIServer pid=2603210)     raise HFValidationError(
(APIServer pid=2603210) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'.
(APIServer pid=2603210)
(APIServer pid=2603210) During handling of the above exception, another exception occurred:
(APIServer pid=2603210)
(APIServer pid=2603210) Traceback (most recent call last):
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict
(APIServer pid=2603210)     resolved_config_file = cached_file(
(APIServer pid=2603210)                            ^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 322, in cached_file
(APIServer pid=2603210)     file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 532, in cached_files
(APIServer pid=2603210)     _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return
(APIServer pid=2603210)     resolved_file = try_to_load_from_cache(
(APIServer pid=2603210)                     ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
(APIServer pid=2603210)     validate_repo_id(arg_value)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
(APIServer pid=2603210)     raise HFValidationError(
(APIServer pid=2603210) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'.
(APIServer pid=2603210)
(APIServer pid=2603210) During handling of the above exception, another exception occurred:
(APIServer pid=2603210)
(APIServer pid=2603210) Traceback (most recent call last):
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=2603210)     sys.exit(main())
(APIServer pid=2603210)              ^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=2603210)     args.dispatch_function(args)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=2603210)     uvloop.run(run_server(args))
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=2603210)     return __asyncio.run(
(APIServer pid=2603210)            ^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2603210)     return runner.run(main)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2603210)     return self._loop.run_until_complete(task)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=2603210)     return await main
(APIServer pid=2603210)            ^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 686, in run_server
(APIServer pid=2603210)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 700, in run_server_worker
(APIServer pid=2603210)     async with build_async_engine_client(
(APIServer pid=2603210)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2603210)     return await anext(self.gen)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=2603210)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2603210)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2603210)     return await anext(self.gen)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 124, in build_async_engine_client_from_engine_args
(APIServer pid=2603210)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=2603210)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/engine/arg_utils.py", line 1574, in create_engine_config
(APIServer pid=2603210)     maybe_override_with_speculators(
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/transformers_utils/config.py", line 584, in maybe_override_with_speculators
(APIServer pid=2603210)     config_dict, _ = PretrainedConfig.get_config_dict(
(APIServer pid=2603210)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict
(APIServer pid=2603210)     config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
(APIServer pid=2603210)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict
(APIServer pid=2603210)     raise OSError(
(APIServer pid=2603210) OSError: Can't load the configuration of 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S' is the correct path to a directory containing a config.json file
</details> <details> <summary>after</summary>
vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev122+g83aea2147
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]   █▄█▀ █     █     █     █  model   unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'}
(APIServer pid=2598019) WARNING 04-10 09:31:38 [gguf_utils.py:60] Non-standard GGUF quant type 'UD-IQ1_S' detected.
(APIServer pid=2598019) INFO 04-10 09:31:39 [model.py:554] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=2598019) INFO 04-10 09:31:39 [model.py:1684] Using max model len 40960
(APIServer pid=2598019) INFO 04-10 09:31:39 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=2598019) INFO 04-10 09:31:39 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=2598461) INFO 04-10 09:31:44 [core.py:107] Initializing a V1 LLM engine (v0.19.1rc1.dev122+g83aea2147) with config: model='unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=gguf, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=2598461) INFO 04-10 09:31:44 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.10:52857 backend=nccl
(EngineCore pid=2598461) INFO 04-10 09:31:44 [parallel_state.py:1713] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=2598461) INFO 04-10 09:31:45 [gpu_model_runner.py:4735] Starting to load model unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S...
Qwen3-0.6B-UD-IQ1_S.gguf: 100%|████████████████████████████████████████████████████████████████████████████████| 215M/215M [00:21<00:00, 9.93MB/s]
(EngineCore pid=2598461) INFO 04-10 09:32:07 [weight_utils.py:615] Time spent downloading weights for unsloth/Qwen3-0.6B-GGUF: 22.251029 seconds
(EngineCore pid=2598461) INFO 04-10 09:32:16 [cuda.py:362] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=2598461) INFO 04-10 09:32:16 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=2598461) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=2598461) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore pid=2598461) INFO 04-10 09:32:21 [gpu_model_runner.py:4820] Model loading took 0.22 GiB memory and 35.669191 seconds
(EngineCore pid=2598461) INFO 04-10 09:32:24 [backends.py:1055] Using cache directory: /home/name/.cache/vllm/torch_compile_cache/564aa12500/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=2598461) INFO 04-10 09:32:24 [backends.py:1115] Dynamo bytecode transform time: 3.07 s
(EngineCore pid=2598461) INFO 04-10 09:32:26 [backends.py:373] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=2598461) INFO 04-10 09:32:33 [backends.py:391] Compiling a graph for compile range (1, 2048) takes 8.53 s
(EngineCore pid=2598461) INFO 04-10 09:32:35 [decorators.py:655] saved AOT compiled function to /home/name/.cache/vllm/torch_compile_cache/torch_aot_compile/d5db8a5d1bc2f897526bb947908032d2f1ae13b65f8af58e817018da7e2e59ce/rank_0_0/model
(EngineCore pid=2598461) INFO 04-10 09:32:35 [monitor.py:48] torch.compile took 13.63 s in total
(EngineCore pid=2598461) INFO 04-10 09:32:35 [monitor.py:76] Initial profiling/warmup run took 0.24 s
(EngineCore pid=2598461) INFO 04-10 09:32:35 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=2598461) INFO 04-10 09:32:35 [gpu_model_runner.py:5893] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_model_runner.py:5972] Estimated CUDA graph memory: 0.64 GiB total
(EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_worker.py:436] Available KV cache memory: 20.23 GiB
(EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9274 to maintain the same effective KV cache size.
(EngineCore pid=2598461) INFO 04-10 09:32:36 [kv_cache_utils.py:1319] GPU KV cache size: 189,408 tokens
(EngineCore pid=2598461) INFO 04-10 09:32:36 [kv_cache_utils.py:1324] Maximum concurrency for 40,960 tokens per request: 4.62x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████| 51/51 [00:01<00:00, 44.49it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 52.28it/s]
(EngineCore pid=2598461) INFO 04-10 09:32:39 [gpu_model_runner.py:6063] Graph capturing finished in 2 secs, took 0.72 GiB
(EngineCore pid=2598461) INFO 04-10 09:32:39 [gpu_worker.py:597] CUDA graph pool memory: 0.72 GiB (actual), 0.64 GiB (estimated), difference: 0.07 GiB (10.1%).
(EngineCore pid=2598461) INFO 04-10 09:32:39 [core.py:285] init engine (profile, create kv cache, warmup model) took 18.07 seconds
(EngineCore pid=2598461) INFO 04-10 09:32:41 [vllm.py:799] Asynchronous scheduling is enabled.
(EngineCore pid=2598461) INFO 04-10 09:32:41 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=2598019) INFO 04-10 09:32:41 [api_server.py:606] Supported tasks: ['generate']
(APIServer pid=2598019) INFO 04-10 09:32:43 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=2598019) INFO 04-10 09:32:43 [api_server.py:610] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:37] Available routes are:
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=2598019) INFO:     Started server process [2598019]
(APIServer pid=2598019) INFO:     Waiting for application startup.
(APIServer pid=2598019) INFO:     Application startup complete.
</details>
<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/transformers_utils/test_utils.py (modified, +27/-0)
  • vllm/transformers_utils/gguf_utils.py (modified, +38/-3)

Code Example

vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B

huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'.
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

GGUF models with non-standard quant type prefixes like Unsloth Dynamic 2.0 (UD-) cannot be loaded via repo_id:quant_type format.

vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B

huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'.

Currently, is_remote_gguf() validates quant types against GGMLQuantizationType members and a hardcoded suffix list (_M, _S, _L, etc.). Prefixed types like UD-IQ1_S are rejected, and the model string falls through to HuggingFace Hub as a plain repo ID.

Since quant_type is only used for glob file matching (*-{quant_type}.gguf) and not for actual quantization logic (read from GGUF binary headers), accepting non-standard prefixed names is safe.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Update the is_remote_gguf() function to accept non-standard quant type prefixes like UD- by modifying the validation logic to allow for custom prefixes.

Guidance

  • Modify the is_remote_gguf() function to use a more permissive validation logic that allows for non-standard quant type prefixes.
  • Update the hardcoded suffix list to include a mechanism for handling custom prefixes.
  • Consider adding a configuration option to allow users to specify custom quant type prefixes.
  • Verify that the updated function correctly handles both standard and non-standard quant type prefixes.

Example

# Example of updated is_remote_gguf() function
def is_remote_gguf(repo_id):
    # Split the repo_id into quant_type and rest
    parts = repo_id.split(":")
    if len(parts) == 2:
        quant_type = parts[1]
        # Check if quant_type starts with a custom prefix
        if quant_type.startswith("UD-"):
            # Allow the custom prefix
            return True
    # Fall back to original logic
    # ...

Notes

The solution assumes that the is_remote_gguf() function is the only place where the validation logic needs to be updated. Additional changes may be required if the validation logic is used elsewhere in the codebase.

Recommendation

Apply workaround by updating the is_remote_gguf() function to accept non-standard quant type prefixes, as this allows for more flexibility in handling custom quantization types.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING