vllm - ✅(Solved) Fix [RFC]: Add Configuration API [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38147Fetched 2026-04-08 01:32:03
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
subscribed ×2cross-referenced ×1labeled ×1mentioned ×1

Fix Action

Fixed

PR fix notes

PR #38149: [WIP][HMA] Add configuration API

Description (problem / solution / changelog)

Purpose

Add endpoint to REST API server to retrieve static inference configuration properties for prefix-cache aware routing techniques.

Properties like the following:

  • KV-cache capacity, block sizes etc, across tiers
  • DP ranks and port mappings
  • HMA settings

Closes #38147

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/engine/protocol.py (modified, +8/-0)
  • vllm/entrypoints/openai/api_server.py (modified, +4/-0)
  • vllm/entrypoints/serve/__init__.py (modified, +6/-0)
  • vllm/entrypoints/serve/config/__init__.py (added, +2/-0)
  • vllm/entrypoints/serve/config/api_router.py (added, +219/-0)
  • vllm/entrypoints/serve/config/protocol.py (added, +199/-0)
  • vllm/v1/engine/async_llm.py (modified, +10/-0)
  • vllm/v1/engine/core.py (modified, +111/-1)
  • vllm/v1/engine/core_client.py (modified, +12/-0)
  • vllm/v1/worker/gpu_worker.py (modified, +16/-0)

Code Example

{
  "model": {
    "served_model_names": ["my-model"],
    "dtype": "bfloat16",
    "quantization": null,
    "max_model_len": 32768,
    "max_logprobs": 20
  },
  "kv_cache": {
    "num_gpu_blocks": 1024,
    "num_cpu_blocks": 256,
    "gpu_memory_utilization": 0.9,
    "cache_dtype": "bfloat16",
    "enable_prefix_caching": true,
    "prefix_caching_hash_algo": "sha256",
    "kv_offloading_enabled": false,
    "kv_offloading_backend": null,
    "kv_offloading_size_gib": null,
    "groups": [
      {
     "group_id": 0,

        "layer_names": ["model.layers.0.self_attn", "model.layers.1.self_attn"],

        "spec_type": "FullAttentionSpec",
        "block_size": 16,
        "page_size_bytes": 131072,
        "num_kv_heads": 8,
        "head_size": 128,
        "head_size_v": 128,
        "dtype": "bfloat16",
        "sliding_window": null,
        "attention_chunk_size": null
      },
      {
     "group_id": 1,
        "layer_names": ["model.layers.0.mamba"],
        "spec_type": "MambaSpec",
        "block_size": 1,
        "page_size_bytes": 4096,
        "shapes": [[16, 128]],
        "dtypes": ["float32"],
        "mamba_type": "mamba2",
        "mamba_cache_mode": "none"
      }
    ]
  },
  "scheduler": {
    "max_num_batched_tokens": 2048,
    "max_num_seqs": 128,
    "max_num_partial_prefills": 1,
    "enable_chunked_prefill": true,
    "policy": "fcfs"
  },
  "parallelism": {
    "tensor_parallel_size": 4,
    "pipeline_parallel_size": 1,
    "data_parallel_size": 2,
    "data_parallel_size_local": 1,
    "data_parallel_rank": 0,
    "data_parallel_rank_local": null,
    "data_parallel_master_ip": "127.0.0.1",
    "data_parallel_master_port": 29500,
    "data_parallel_rpc_port": 29550,
    "expert_parallel_enabled": false,
    "prefill_context_parallel_size": 1
  },
  "devices": [
    {
      "rank": 0,
      "name": "A100-PCIE-40GB",
      "total_memory_bytes": 42949672960,
      "compute_capability": {"major": 8, "minor": 0},
      "num_compute_units": 108
    },
    {
      "rank": 1,
      "name": "A100-PCIE-40GB",
      "total_memory_bytes": 42949672960,
      "compute_capability": {"major": 8, "minor": 0},
      "num_compute_units": 108
    }
  ],
  "speculative_decoding": {
    "enabled": false,
    "method": null,
    "num_speculative_tokens": null,
    "draft_model": null
  },
  "lora": {
    "enabled": false,
    "max_lora_rank": null,
    "max_loras": null,
    "max_cpu_loras": null,
    "lora_dtype": null
  },
  "hma": {
    "enabled": true
  }
}

---

InferenceConfigResponse
├── model: ModelInfo
│   ├── served_model_names: list[str]
│   ├── dtype: str
│   ├── quantization: str | None
│   ├── max_model_len: int
│   └── max_logprobs: int
├── kv_cache: KVCacheInfo
│   ├── num_gpu_blocks: int | None
│   ├── num_cpu_blocks: int | None
│   ├── gpu_memory_utilization: float
│   ├── cache_dtype: str
│   ├── enable_prefix_caching: bool
│   ├── prefix_caching_hash_algo: str
│   ├── kv_offloading_enabled: bool
│   ├── kv_offloading_backend: str | None
│   ├── kv_offloading_size_gib: float | None
│   └── groups: list[KVCacheGroupInfo]
KVCacheGroupInfo = Annotated[
FullAttentionGroupSpec
| MLAAttentionGroupSpec
| SlidingWindowGroupSpec
| ChunkedLocalAttentionGroupSpec
| MambaGroupSpec
| CrossAttentionGroupSpec
| SinkFullAttentionGroupSpec,
Field(discriminator="spec_type")
]
Shared base fields (all variants):
│           group_id: int
│           layer_names: list[str]
│           block_size: int
│           page_size_bytes: int
├── scheduler: SchedulerInfo
│   ├── max_num_batched_tokens: int
│   ├── max_num_seqs: int
│   ├── max_num_partial_prefills: int
│   ├── enable_chunked_prefill: bool
│   └── policy: str
├── parallelism: ParallelismInfo
│   ├── tensor_parallel_size: int
│   ├── pipeline_parallel_size: int
│   ├── data_parallel_size: int
│   ├── data_parallel_size_local: int
│   ├── data_parallel_rank: int
│   ├── data_parallel_rank_local: int | None
│   ├── data_parallel_master_ip: str
│   ├── data_parallel_master_port: int
│   ├── data_parallel_rpc_port: int
│   ├── expert_parallel_enabled: bool
│   └── prefill_context_parallel_size: int
├── devices: list[DeviceInfo]
DeviceInfo:
│       rank: int
│       name: str
│       total_memory_bytes: int
│       compute_capability: ComputeCapability
│         major: int
│         minor: int
│       num_compute_units: int
├── speculative_decoding: SpeculativeDecodingInfo
│   ├── enabled: bool
│   ├── method: str | None
│   ├── num_speculative_tokens: int | None
│   └── draft_model: str | None
├── lora: LoRAInfo
│   ├── enabled: bool
│   ├── max_lora_rank: int | None
│   ├── max_loras: int | None
│   ├── max_cpu_loras: int | None
│   └── lora_dtype: str | None
└── hma: HMAInfo
    └── enabled: bool
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Motivation

A new REST API endpoint that exposes structured inference configuration for a running vLLM server. The static data is intended to be consumed by distributed serving platforms like llm-d for prefix-cache aware routing techniques. It distinguished from metrics as the data is static once the server is started.

Endpoint

GET /v1/config or GET /inference/v1/config – it depends on the vLLM community if they want to distinguish the endpoint from OpenAI path or not

Response Schema

{
  "model": {
    "served_model_names": ["my-model"],
    "dtype": "bfloat16",
    "quantization": null,
    "max_model_len": 32768,
    "max_logprobs": 20
  },
  "kv_cache": {
    "num_gpu_blocks": 1024,
    "num_cpu_blocks": 256,
    "gpu_memory_utilization": 0.9,
    "cache_dtype": "bfloat16",
    "enable_prefix_caching": true,
    "prefix_caching_hash_algo": "sha256",
    "kv_offloading_enabled": false,
    "kv_offloading_backend": null,
    "kv_offloading_size_gib": null,
    "groups": [
      {
     "group_id": 0,

        "layer_names": ["model.layers.0.self_attn", "model.layers.1.self_attn"],

        "spec_type": "FullAttentionSpec",
        "block_size": 16,
        "page_size_bytes": 131072,
        "num_kv_heads": 8,
        "head_size": 128,
        "head_size_v": 128,
        "dtype": "bfloat16",
        "sliding_window": null,
        "attention_chunk_size": null
      },
      {
     "group_id": 1,
        "layer_names": ["model.layers.0.mamba"],
        "spec_type": "MambaSpec",
        "block_size": 1,
        "page_size_bytes": 4096,
        "shapes": [[16, 128]],
        "dtypes": ["float32"],
        "mamba_type": "mamba2",
        "mamba_cache_mode": "none"
      }
    ]
  },
  "scheduler": {
    "max_num_batched_tokens": 2048,
    "max_num_seqs": 128,
    "max_num_partial_prefills": 1,
    "enable_chunked_prefill": true,
    "policy": "fcfs"
  },
  "parallelism": {
    "tensor_parallel_size": 4,
    "pipeline_parallel_size": 1,
    "data_parallel_size": 2,
    "data_parallel_size_local": 1,
    "data_parallel_rank": 0,
    "data_parallel_rank_local": null,
    "data_parallel_master_ip": "127.0.0.1",
    "data_parallel_master_port": 29500,
    "data_parallel_rpc_port": 29550,
    "expert_parallel_enabled": false,
    "prefill_context_parallel_size": 1
  },
  "devices": [
    {
      "rank": 0,
      "name": "A100-PCIE-40GB",
      "total_memory_bytes": 42949672960,
      "compute_capability": {"major": 8, "minor": 0},
      "num_compute_units": 108
    },
    {
      "rank": 1,
      "name": "A100-PCIE-40GB",
      "total_memory_bytes": 42949672960,
      "compute_capability": {"major": 8, "minor": 0},
      "num_compute_units": 108
    }
  ],
  "speculative_decoding": {
    "enabled": false,
    "method": null,
    "num_speculative_tokens": null,
    "draft_model": null
  },
  "lora": {
    "enabled": false,
    "max_lora_rank": null,
    "max_loras": null,
    "max_cpu_loras": null,
    "lora_dtype": null
  },
  "hma": {
    "enabled": true
  }
}

Pydantic Model Hierarchy

InferenceConfigResponse
├── model: ModelInfo
│   ├── served_model_names: list[str]
│   ├── dtype: str
│   ├── quantization: str | None
│   ├── max_model_len: int
│   └── max_logprobs: int
├── kv_cache: KVCacheInfo
│   ├── num_gpu_blocks: int | None
│   ├── num_cpu_blocks: int | None
│   ├── gpu_memory_utilization: float
│   ├── cache_dtype: str
│   ├── enable_prefix_caching: bool
│   ├── prefix_caching_hash_algo: str
│   ├── kv_offloading_enabled: bool
│   ├── kv_offloading_backend: str | None
│   ├── kv_offloading_size_gib: float | None
│   └── groups: list[KVCacheGroupInfo]
│         KVCacheGroupInfo = Annotated[
│           FullAttentionGroupSpec
│           | MLAAttentionGroupSpec
│           | SlidingWindowGroupSpec
│           | ChunkedLocalAttentionGroupSpec
│           | MambaGroupSpec
│           | CrossAttentionGroupSpec
│           | SinkFullAttentionGroupSpec,
│           Field(discriminator="spec_type")
│         ]
│         Shared base fields (all variants):
│           group_id: int
│           layer_names: list[str]
│           block_size: int
│           page_size_bytes: int
├── scheduler: SchedulerInfo
│   ├── max_num_batched_tokens: int
│   ├── max_num_seqs: int
│   ├── max_num_partial_prefills: int
│   ├── enable_chunked_prefill: bool
│   └── policy: str
├── parallelism: ParallelismInfo
│   ├── tensor_parallel_size: int
│   ├── pipeline_parallel_size: int
│   ├── data_parallel_size: int
│   ├── data_parallel_size_local: int
│   ├── data_parallel_rank: int
│   ├── data_parallel_rank_local: int | None
│   ├── data_parallel_master_ip: str
│   ├── data_parallel_master_port: int
│   ├── data_parallel_rpc_port: int
│   ├── expert_parallel_enabled: bool
│   └── prefill_context_parallel_size: int
├── devices: list[DeviceInfo]
│     DeviceInfo:
│       rank: int
│       name: str
│       total_memory_bytes: int
│       compute_capability: ComputeCapability
│         major: int
│         minor: int
│       num_compute_units: int
├── speculative_decoding: SpeculativeDecodingInfo
│   ├── enabled: bool
│   ├── method: str | None
│   ├── num_speculative_tokens: int | None
│   └── draft_model: str | None
├── lora: LoRAInfo
│   ├── enabled: bool
│   ├── max_lora_rank: int | None
│   ├── max_loras: int | None
│   ├── max_cpu_loras: int | None
│   └── lora_dtype: str | None
└── hma: HMAInfo
    └── enabled: bool

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

@tlrmchlsmth

extent analysis

Fix Plan

To implement the new REST API endpoint, follow these steps:

  1. Create a new endpoint: Define a new route for the GET /v1/config or GET /inference/v1/config endpoint, depending on the desired path.
  2. Define the response model: Use the provided Pydantic model hierarchy to define the response structure.
  3. Implement the endpoint logic: Create a function to handle the GET request and return the inference configuration data in the defined response structure.

Example code:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ModelInfo(BaseModel):
    served_model_names: list[str]
    dtype: str
    quantization: str | None
    max_model_len: int
    max_logprobs: int

class KVCacheInfo(BaseModel):
    num_gpu_blocks: int | None
    num_cpu_blocks: int | None
    gpu_memory_utilization: float
    cache_dtype: str
    enable_prefix_caching: bool
    prefix_caching_hash_algo: str
    kv_offloading_enabled: bool
    kv_offloading_backend: str | None
    kv_offloading_size_gib: float | None
    groups: list[dict]

class InferenceConfigResponse(BaseModel):
    model: ModelInfo
    kv_cache: KVCacheInfo
    scheduler: dict
    parallelism: dict
    devices: list[dict]
    speculative_decoding: dict
    lora: dict
    hma: dict

@app.get("/v1/config")
async def get_inference_config():
    # Implement logic to retrieve inference configuration data
    config_data = {
        "model": {
            "served_model_names": ["my-model"],
            "dtype": "bfloat16",
            "quantization": None,
            "max_model_len": 32768,
            "max_logprobs": 20
        },
        "kv_cache": {
            "num_gpu_blocks": 1024,
            "num_cpu_blocks": 256,
            "gpu_memory_utilization": 0.9,
            "cache_dtype": "bfloat16",
            "enable_prefix_caching": True,
            "prefix_caching_hash_algo": "sha256",
            "kv_offloading_enabled": False,
            "kv_offloading_backend": None,
            "kv_offloading_size_gib": None,
            "groups": [
                {
                    "group_id": 0,
                    "layer_names": ["model.layers.0.self_attn", "model.layers.1.self_attn"],
                    "spec_type": "FullAttentionSpec",
                    "block_size": 16,
                    "page_size_bytes": 131072,
                    "num_kv_heads": 8,
                    "head_size": 128,
                    "head_size_v": 128,
                    "dtype":

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING