PR fix notes

PR #36687: [PD][Nixl] Add support for hybrid SSM-FA models

Repository: vllm-project/vllm
Author: NickLucche
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/36687

Description (problem / solution / changelog)

For a comprehensive description of the changes proposed here, check out the corresponding RFC https://github.com/vllm-project/vllm/issues/36780.

This PR adds support for hybrid SSM-based models such as nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 with NixlConnector, enabling KVCache transfer of both FA and Mamba states in disaggregated setups. Currently it only supports Homogeneous TP sizes on both P and D.

Note that we're only transferring actual mamba states and skipping the padding that may be present, as that might have non-trivial size.

UPDATE: re this change"

- curr_tensor_size_bytes = cache.numel() * cache.element_size()
+ curr_tensor_size_bytes = num_blocks * physical_page_size

in this PR I am trying to further move away from relying on tensor views while trying to unify usage in code of kv_cache_config as single source of truth. This is also necessary for Mamba-like models in which tensors (cache above) gives the unpadded tensor size, which doesn't reflect the num_blocks * physical_page_size, as one would need to take into account padding manually.

Important notes

TP > 1 currently require --no-async-scheduling to run correctly. @ZhanqiuHu and I identified a synchronization issue where states may be transferred in a corrupted form, leading to high variance in evaluations. Will address separately as that is likely unrelated to SSMs.
@ZhanqiuHu has identified an issue with current PD workflow in which we're recomputing the first token on D, leading to burning-in that extra step into the SSM state in-place.

Test with

Enable HMA experimental support with --no-disable-hybrid-kv-cache-manager:

# usual P/D command
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
--trust-remote-code \
--block-size 64 \
--no-enable-prefix-caching \
--no-disable-hybrid-kv-cache-manager \
 --mamba-ssm-cache-dtype float16 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

# usual toy_proxy_server.py command

HYBRID_SSM=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

or check out unit tests added with this PR.

Results from running consecutive full lm-eval runs with no prefix caching:

local-completions ({'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'base_url': 'http://127.0.0.1:55483/v1/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5444|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8355|±  |0.0102|

local-completions ({'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'base_url': 'http://127.0.0.1:55483/v1/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5345|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8340|±  |0.0102|

local-completions ({'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'base_url': 'http://127.0.0.1:55483/v1/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5398|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8355|±  |0.0102|

local-completions ({'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'base_url': 'http://127.0.0.1:55483/v1/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5428|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8332|±  |0.0103|

local-completions ({'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'base_url': 'http://127.0.0.1:55483/v1/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5557|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8506|±  |0.0098|

TODO

Address kernel<>logical block size mismatch
Benchmark

Changed files

tests/v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh (modified, +8/-0)
tests/v1/kv_connector/nixl_integration/test_accuracy.py (modified, +1/-0)
tests/v1/kv_connector/unit/test_nixl_connector.py (modified, +76/-45)
tests/v1/kv_connector/unit/test_nixl_connector_hma.py (modified, +112/-0)
vllm/distributed/kv_transfer/kv_connector/utils.py (modified, +38/-8)
vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py (modified, +1/-1)
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (modified, +351/-112)

Your current environment

latest main

🐛 Describe the bug

As reported in https://github.com/vllm-project/vllm/pull/36687/, PD for SSM models (eg NemotronH, Qwen3.5..) with TP > 1 currently requires --no-async-scheduling to run without accuracy drops. @ZhanqiuHu and I identified a synchronization issue where states may be transferred in a corrupted form, leading to high variance in evaluations.

Current diagnostics:

DTP1-PTP1 works great (also tested with --distributed-executor-backend=mp )
DTP2-PTP2 accuracy jitters a lot
- higher TP eg TP4 appear to jitter more
DTP2-PTP2 --no-async-scheduling works great
adding a torch.synchronize call after model execution in the runner also works, so my current guess is that we're moving bytes that haven't had their content fully written to.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the synchronization issue, we need to ensure that the model execution is properly synchronized.

Add a torch.synchronize() call after model execution in the runner to guarantee that all pending operations are completed before transferring states.
Alternatively, use the --no-async-scheduling flag to disable asynchronous scheduling, which can also resolve the issue.

Example code snippet:

import torch

# Model execution
output = model(input_data)

# Add torch.synchronize() call
torch.synchronize()

# Transfer states
transfer_states(output)

By adding the torch.synchronize() call, we can ensure that the model execution is properly synchronized, and the states are transferred in a consistent manner.

Verification

To verify that the fix worked, test the model with different thread counts (TP) and verify that the accuracy no longer jitters. Specifically, test the following scenarios:

DTP1-PTP1 with TP >

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: PD disaggregation for SSM models requires `--no-async-scheduling` when TP>1 [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #36687: [PD][Nixl] Add support for hybrid SSM-FA models

Description (problem / solution / changelog)

Important notes

Test with

TODO

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: PD disaggregation for SSM models requires `--no-async-scheduling` when TP>1 [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #36687: [PD][Nixl] Add support for hybrid SSM-FA models

Description (problem / solution / changelog)

Important notes

Test with

TODO

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING