vllm - ✅(Solved) Fix [RFC]: Add Configuration API [1 pull requests, 1 participants]

vllm2026-03-25 21:22:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38147•Fetched 2026-04-08 01:32:03

View on GitHub

Comments

Participants

Timeline

Reactions

Author

hickeyma

Participants

hickeyma

Timeline (top)

subscribed ×2cross-referenced ×1labeled ×1mentioned ×1

Fix Action

Fixed

Fixed by PR: [WIP][HMA] Add configuration API (https://github.com/vllm-project/vllm/pull/38149)

PR fix notes

PR #38149: [WIP][HMA] Add configuration API

Repository: vllm-project/vllm
Author: hickeyma
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38149

Description (problem / solution / changelog)

Purpose

Add endpoint to REST API server to retrieve static inference configuration properties for prefix-cache aware routing techniques.

Properties like the following:

KV-cache capacity, block sizes etc, across tiers
DP ranks and port mappings
HMA settings

Closes #38147

Test Plan

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/engine/protocol.py (modified, +8/-0)
vllm/entrypoints/openai/api_server.py (modified, +4/-0)
vllm/entrypoints/serve/__init__.py (modified, +6/-0)
vllm/entrypoints/serve/config/__init__.py (added, +2/-0)
vllm/entrypoints/serve/config/api_router.py (added, +219/-0)
vllm/entrypoints/serve/config/protocol.py (added, +199/-0)
vllm/v1/engine/async_llm.py (modified, +10/-0)
vllm/v1/engine/core.py (modified, +111/-1)
vllm/v1/engine/core_client.py (modified, +12/-0)
vllm/v1/worker/gpu_worker.py (modified, +16/-0)

Code Example

{
  "model": {
    "served_model_names": ["my-model"],
    "dtype": "bfloat16",
    "quantization": null,
    "max_model_len": 32768,
    "max_logprobs": 20
  },
  "kv_cache": {
    "num_gpu_blocks": 1024,
    "num_cpu_blocks": 256,
    "gpu_memory_utilization": 0.9,
    "cache_dtype": "bfloat16",
    "enable_prefix_caching": true,
    "prefix_caching_hash_algo": "sha256",
    "kv_offloading_enabled": false,
    "kv_offloading_backend": null,
    "kv_offloading_size_gib": null,
    "groups": [
      {
     "group_id": 0,

        "layer_names": ["model.layers.0.self_attn", "model.layers.1.self_attn"],

        "spec_type": "FullAttentionSpec",
        "block_size": 16,
        "page_size_bytes": 131072,
        "num_kv_heads": 8,
        "head_size": 128,
        "head_size_v": 128,
        "dtype": "bfloat16",
        "sliding_window": null,
        "attention_chunk_size": null
      },
      {
     "group_id": 1,
        "layer_names": ["model.layers.0.mamba"],
        "spec_type": "MambaSpec",
        "block_size": 1,
        "page_size_bytes": 4096,
        "shapes": [[16, 128]],
        "dtypes": ["float32"],
        "mamba_type": "mamba2",
        "mamba_cache_mode": "none"
      }
    ]
  },
  "scheduler": {
    "max_num_batched_tokens": 2048,
    "max_num_seqs": 128,
    "max_num_partial_prefills": 1,
    "enable_chunked_prefill": true,
    "policy": "fcfs"
  },
  "parallelism": {
    "tensor_parallel_size": 4,
    "pipeline_parallel_size": 1,
    "data_parallel_size": 2,
    "data_parallel_size_local": 1,
    "data_parallel_rank": 0,
    "data_parallel_rank_local": null,
    "data_parallel_master_ip": "127.0.0.1",
    "data_parallel_master_port": 29500,
    "data_parallel_rpc_port": 29550,
    "expert_parallel_enabled": false,
    "prefill_context_parallel_size": 1
  },
  "devices": [
    {
      "rank": 0,
      "name": "A100-PCIE-40GB",
      "total_memory_bytes": 42949672960,
      "compute_capability": {"major": 8, "minor": 0},
      "num_compute_units": 108
    },
    {
      "rank": 1,
      "name": "A100-PCIE-40GB",
      "total_memory_bytes": 42949672960,
      "compute_capability": {"major": 8, "minor": 0},
      "num_compute_units": 108
    }
  ],
  "speculative_decoding": {
    "enabled": false,
    "method": null,
    "num_speculative_tokens": null,
    "draft_model": null
  },
  "lora": {
    "enabled": false,
    "max_lora_rank": null,
    "max_loras": null,
    "max_cpu_loras": null,
    "lora_dtype": null
  },
  "hma": {
    "enabled": true
  }
}

---

InferenceConfigResponse
├── model: ModelInfo
│   ├── served_model_names: list[str]
│   ├── dtype: str
│   ├── quantization: str | None
│   ├── max_model_len: int
│   └── max_logprobs: int
│
├── kv_cache: KVCacheInfo
│   ├── num_gpu_blocks: int | None
│   ├── num_cpu_blocks: int | None
│   ├── gpu_memory_utilization: float
│   ├── cache_dtype: str
│   ├── enable_prefix_caching: bool
│   ├── prefix_caching_hash_algo: str
│   ├── kv_offloading_enabled: bool
│   ├── kv_offloading_backend: str | None
│   ├── kv_offloading_size_gib: float | None
│   └── groups: list[KVCacheGroupInfo]
│         KVCacheGroupInfo = Annotated[
│           FullAttentionGroupSpec
│           | MLAAttentionGroupSpec
│           | SlidingWindowGroupSpec
│           | ChunkedLocalAttentionGroupSpec
│           | MambaGroupSpec
│           | CrossAttentionGroupSpec
│           | SinkFullAttentionGroupSpec,
│           Field(discriminator="spec_type")
│         ]
│         Shared base fields (all variants):
│           group_id: int
│           layer_names: list[str]
│           block_size: int
│           page_size_bytes: int
│
├── scheduler: SchedulerInfo
│   ├── max_num_batched_tokens: int
│   ├── max_num_seqs: int
│   ├── max_num_partial_prefills: int
│   ├── enable_chunked_prefill: bool
│   └── policy: str
│
├── parallelism: ParallelismInfo
│   ├── tensor_parallel_size: int
│   ├── pipeline_parallel_size: int
│   ├── data_parallel_size: int
│   ├── data_parallel_size_local: int
│   ├── data_parallel_rank: int
│   ├── data_parallel_rank_local: int | None
│   ├── data_parallel_master_ip: str
│   ├── data_parallel_master_port: int
│   ├── data_parallel_rpc_port: int
│   ├── expert_parallel_enabled: bool
│   └── prefill_context_parallel_size: int
│
│
├── devices: list[DeviceInfo]
│     DeviceInfo:
│       rank: int
│       name: str
│       total_memory_bytes: int
│       compute_capability: ComputeCapability
│         major: int
│         minor: int
│       num_compute_units: int
│
├── speculative_decoding: SpeculativeDecodingInfo
│   ├── enabled: bool
│   ├── method: str | None
│   ├── num_speculative_tokens: int | None
│   └── draft_model: str | None
│
├── lora: LoRAInfo
│   ├── enabled: bool
│   ├── max_lora_rank: int | None
│   ├── max_loras: int | None
│   ├── max_cpu_loras: int | None
│   └── lora_dtype: str | None
│
└── hma: HMAInfo
    └── enabled: bool

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Motivation

A new REST API endpoint that exposes structured inference configuration for a running vLLM server. The static data is intended to be consumed by distributed serving platforms like llm-d for prefix-cache aware routing techniques. It distinguished from metrics as the data is static once the server is started.

Endpoint

GET /v1/config or GET /inference/v1/config – it depends on the vLLM community if they want to distinguish the endpoint from OpenAI path or not

Response Schema

{
  "model": {
    "served_model_names": ["my-model"],
    "dtype": "bfloat16",
    "quantization": null,
    "max_model_len": 32768,
    "max_logprobs": 20
  },
  "kv_cache": {
    "num_gpu_blocks": 1024,
    "num_cpu_blocks": 256,
    "gpu_memory_utilization": 0.9,
    "cache_dtype": "bfloat16",
    "enable_prefix_caching": true,
    "prefix_caching_hash_algo": "sha256",
    "kv_offloading_enabled": false,
    "kv_offloading_backend": null,
    "kv_offloading_size_gib": null,
    "groups": [
      {
     "group_id": 0,

        "layer_names": ["model.layers.0.self_attn", "model.layers.1.self_attn"],

        "spec_type": "FullAttentionSpec",
        "block_size": 16,
        "page_size_bytes": 131072,
        "num_kv_heads": 8,
        "head_size": 128,
        "head_size_v": 128,
        "dtype": "bfloat16",
        "sliding_window": null,
        "attention_chunk_size": null
      },
      {
     "group_id": 1,
        "layer_names": ["model.layers.0.mamba"],
        "spec_type": "MambaSpec",
        "block_size": 1,
        "page_size_bytes": 4096,
        "shapes": [[16, 128]],
        "dtypes": ["float32"],
        "mamba_type": "mamba2",
        "mamba_cache_mode": "none"
      }
    ]
  },
  "scheduler": {
    "max_num_batched_tokens": 2048,
    "max_num_seqs": 128,
    "max_num_partial_prefills": 1,
    "enable_chunked_prefill": true,
    "policy": "fcfs"
  },
  "parallelism": {
    "tensor_parallel_size": 4,
    "pipeline_parallel_size": 1,
    "data_parallel_size": 2,
    "data_parallel_size_local": 1,
    "data_parallel_rank": 0,
    "data_parallel_rank_local": null,
    "data_parallel_master_ip": "127.0.0.1",
    "data_parallel_master_port": 29500,
    "data_parallel_rpc_port": 29550,
    "expert_parallel_enabled": false,
    "prefill_context_parallel_size": 1
  },
  "devices": [
    {
      "rank": 0,
      "name": "A100-PCIE-40GB",
      "total_memory_bytes": 42949672960,
      "compute_capability": {"major": 8, "minor": 0},
      "num_compute_units": 108
    },
    {
      "rank": 1,
      "name": "A100-PCIE-40GB",
      "total_memory_bytes": 42949672960,
      "compute_capability": {"major": 8, "minor": 0},
      "num_compute_units": 108
    }
  ],
  "speculative_decoding": {
    "enabled": false,
    "method": null,
    "num_speculative_tokens": null,
    "draft_model": null
  },
  "lora": {
    "enabled": false,
    "max_lora_rank": null,
    "max_loras": null,
    "max_cpu_loras": null,
    "lora_dtype": null
  },
  "hma": {
    "enabled": true
  }
}

Pydantic Model Hierarchy

InferenceConfigResponse
├── model: ModelInfo
│   ├── served_model_names: list[str]
│   ├── dtype: str
│   ├── quantization: str | None
│   ├── max_model_len: int
│   └── max_logprobs: int
│
├── kv_cache: KVCacheInfo
│   ├── num_gpu_blocks: int | None
│   ├── num_cpu_blocks: int | None
│   ├── gpu_memory_utilization: float
│   ├── cache_dtype: str
│   ├── enable_prefix_caching: bool
│   ├── prefix_caching_hash_algo: str
│   ├── kv_offloading_enabled: bool
│   ├── kv_offloading_backend: str | None
│   ├── kv_offloading_size_gib: float | None
│   └── groups: list[KVCacheGroupInfo]
│         KVCacheGroupInfo = Annotated[
│           FullAttentionGroupSpec
│           | MLAAttentionGroupSpec
│           | SlidingWindowGroupSpec
│           | ChunkedLocalAttentionGroupSpec
│           | MambaGroupSpec
│           | CrossAttentionGroupSpec
│           | SinkFullAttentionGroupSpec,
│           Field(discriminator="spec_type")
│         ]
│         Shared base fields (all variants):
│           group_id: int
│           layer_names: list[str]
│           block_size: int
│           page_size_bytes: int
│
├── scheduler: SchedulerInfo
│   ├── max_num_batched_tokens: int
│   ├── max_num_seqs: int
│   ├── max_num_partial_prefills: int
│   ├── enable_chunked_prefill: bool
│   └── policy: str
│
├── parallelism: ParallelismInfo
│   ├── tensor_parallel_size: int
│   ├── pipeline_parallel_size: int
│   ├── data_parallel_size: int
│   ├── data_parallel_size_local: int
│   ├── data_parallel_rank: int
│   ├── data_parallel_rank_local: int | None
│   ├── data_parallel_master_ip: str
│   ├── data_parallel_master_port: int
│   ├── data_parallel_rpc_port: int
│   ├── expert_parallel_enabled: bool
│   └── prefill_context_parallel_size: int
│
│
├── devices: list[DeviceInfo]
│     DeviceInfo:
│       rank: int
│       name: str
│       total_memory_bytes: int
│       compute_capability: ComputeCapability
│         major: int
│         minor: int
│       num_compute_units: int
│
├── speculative_decoding: SpeculativeDecodingInfo
│   ├── enabled: bool
│   ├── method: str | None
│   ├── num_speculative_tokens: int | None
│   └── draft_model: str | None
│
├── lora: LoRAInfo
│   ├── enabled: bool
│   ├── max_lora_rank: int | None
│   ├── max_loras: int | None
│   ├── max_cpu_loras: int | None
│   └── lora_dtype: str | None
│
└── hma: HMAInfo
    └── enabled: bool

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

@tlrmchlsmth

extent analysis

Fix Plan

To implement the new REST API endpoint, follow these steps:

Create a new endpoint: Define a new route for the GET /v1/config or GET /inference/v1/config endpoint, depending on the desired path.
Define the response model: Use the provided Pydantic model hierarchy to define the response structure.
Implement the endpoint logic: Create a function to handle the GET request and return the inference configuration data in the defined response structure.

Example code:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ModelInfo(BaseModel):
    served_model_names: list[str]
    dtype: str
    quantization: str | None
    max_model_len: int
    max_logprobs: int

class KVCacheInfo(BaseModel):
    num_gpu_blocks: int | None
    num_cpu_blocks: int | None
    gpu_memory_utilization: float
    cache_dtype: str
    enable_prefix_caching: bool
    prefix_caching_hash_algo: str
    kv_offloading_enabled: bool
    kv_offloading_backend: str | None
    kv_offloading_size_gib: float | None
    groups: list[dict]

class InferenceConfigResponse(BaseModel):
    model: ModelInfo
    kv_cache: KVCacheInfo
    scheduler: dict
    parallelism: dict
    devices: list[dict]
    speculative_decoding: dict
    lora: dict
    hma: dict

@app.get("/v1/config")
async def get_inference_config():
    # Implement logic to retrieve inference configuration data
    config_data = {
        "model": {
            "served_model_names": ["my-model"],
            "dtype": "bfloat16",
            "quantization": None,
            "max_model_len": 32768,
            "max_logprobs": 20
        },
        "kv_cache": {
            "num_gpu_blocks": 1024,
            "num_cpu_blocks": 256,
            "gpu_memory_utilization": 0.9,
            "cache_dtype": "bfloat16",
            "enable_prefix_caching": True,
            "prefix_caching_hash_algo": "sha256",
            "kv_offloading_enabled": False,
            "kv_offloading_backend": None,
            "kv_offloading_size_gib": None,
            "groups": [
                {
                    "group_id": 0,
                    "layer_names": ["model.layers.0.self_attn", "model.layers.1.self_attn"],
                    "spec_type": "FullAttentionSpec",
                    "block_size": 16,
                    "page_size_bytes": 131072,
                    "num_kv_heads": 8,
                    "head_size": 128,
                    "head_size_v": 128,
                    "dtype":

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [RFC]: Add Configuration API [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #38149: [WIP][HMA] Add configuration API

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

🚀 The feature, motivation and pitch

Motivation

Endpoint

Response Schema

Pydantic Model Hierarchy

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [RFC]: Add Configuration API [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #38149: [WIP][HMA] Add configuration API

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

🚀 The feature, motivation and pitch

Motivation

Endpoint

Response Schema

Pydantic Model Hierarchy

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Still need to ship something?

RELATED_DISCOVERY

TRENDING