vllm - 💡(How to fix) Fix [RFC]: Per-Layer Parallelism Policy

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • Whitelist gate per architecture. A model opts into per-layer policy only when its migration is verified. Non-whitelisted models keep their current global-TP wiring; per-layer CLI flags are rejected at config-validation time with a clear error.

Root Cause

The matcher can target by prefix when the architecture attaches sub-modules. The rule applies regardless of the main model's parallelism choices, because the resolver dispatches on (layer, prefix) independently for each layer.

Fix Action

Fix / Workaround

The resolver dispatches by layer kind, which HF config already declares via hybrid_override_pattern. No new metadata source needed — the heterogeneity signal already exists in the model config.

The matcher can target by prefix when the architecture attaches sub-modules. The rule applies regardless of the main model's parallelism choices, because the resolver dispatches on (layer, prefix) independently for each layer.

  • TPA + FlashAttention + Llama family (rebased [PR #36287]). Resolver scaffold, QKVParallelLinear consult site, --tensor-parallel-size-attention flag, Llama-family migration, FA backend integration, whitelist gate. Enables TPA-GQA on Llama family and its derivatives (Mistral, Phi3, NemotronNAS, TeleChat2, GLM). GSM8K parity vs pure-TP baseline validates both the per-rank semantics and the backward-compat property.
  • TPA + FlashInfer + Llama family. Backend-specific DCP-wrapper audit. The FA path already validates the policy mechanism; this enables the FlashInfer backend for already-whitelisted models.
  • DSV4 with Q-rep (MLA family). Q-rep ships as the second concrete policy feature; MLA models (DSV3, DSV4 and inheritors like GLM4-MoE-Lite, MistralLarge3) join the whitelist. Demonstrates the resolver carrying a non-TPA plan field.
  • Nemotron Super (hybrid attention + SSM). Per-layer-kind policy via HF hybrid_override_pattern. Demonstrates the resolver dispatching on layer kind (attention vs SSM), the central case from the §3.2 example.

Code Example

┌─────────────────────────────┐
   (layer instance, prefix) ──▶ │  framework-owned resolver   │──▶  plan
                                  (consults config + layer)                                └─────────────────────────────┘

---

parallelism_rules:
  axes:
    tp: 16
    attn_tp: 4      # TPA: attention layers narrower than full TP
    dcp: 4          # required by TPA invariant: dcp = tp / attn_tp
  rules:
    - match: { kind: attention }
      plan: { tp_size: attn_tp, dcp_size: dcp }
    - match: { kind: mlp }
      plan: { tp_size: tp }

---

parallelism_rules:
  axes:
    tp: 16
    attn_tp: 4
    dcp: 4
  rules:
    - match: { kind: attention }
      plan: { tp_size: attn_tp, dcp_size: dcp }
    - match: { kind: mamba }
      plan: { tp_size: tp }     # SSM has no attention heads to TPA-shard
    - match: { kind: mlp }
      plan: { tp_size: tp }

---

parallelism_rules:
  axes:
    tp: 16
    attn_tp: 1      # MLA has 1 effective KV head → TPA collapses to DCP-only
    dcp: 16
    ep: 8           # expert parallel for MoE layers
  rules:
    - match: { kind: attention, attention_flavor: mla }
      plan: { tp_size: tp, dcp_size: dcp }
    - match: { kind: indexer }
      plan: { tp_size: 1 }                  # replicate across TP
    - match: { kind: moe }
      plan: { tp_size: tp, ep_size: ep }
    - match: { kind: mlp }
      plan: { tp_size: tp }

---

parallelism_rules:
  rules:
    - match: { prefix_pattern: "mtp_*" }
      plan: { tp_size: 1 }      # replicate each MTP head
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

1. Motivation

PR #36287 introduced TPA (tensor-parallel-size-attention) to vLLM, and review feedback from Lucas Wilkinson — "the design shouldn't be attention-specific" — pointed to a more general need: a per-layer parallelism policy that lets layers declare what they are and have the framework resolve their plan.

2. The idea

A single function, owned by the framework, that turns a layer's identity into its parallelism plan:

                                ┌─────────────────────────────┐
   (layer instance, prefix) ──▶ │  framework-owned resolver   │──▶  plan
                                │  (consults config + layer)  │
                                └─────────────────────────────┘

Layers consult the resolver in __init__ instead of reading global TP accessors. A plan is whatever the layer needs to make local decisions — its TP width, its rank within that group, its communication backend, etc. The plan schema is open-ended; the call shape stays the same as new fields are added.

The plan being a function of the layer is what makes the design non-attention-specific. The resolver doesn't know it's an attention layer — it knows it's this layer at this prefix and consults config to decide. New strategies (Q-replication, hybrid attention, MTP, future ideas) become resolver-and-config changes, not N-model changes.

3. How the policy expresses different models

The mechanism in §2 is one resolver and one plan dataclass. The breadth comes from what the policy declares for which layers. To make the schema concrete, here are illustrative configurations for different model architectures. The exact field names are not part of this RFC's ask — the question is whether the shape (named axes, rules with matchers, per-layer-kind plans) covers the architectural patterns we want to support.

3.1 Llama family — uniform GQA

parallelism_rules:
  axes:
    tp: 16
    attn_tp: 4      # TPA: attention layers narrower than full TP
    dcp: 4          # required by TPA invariant: dcp = tp / attn_tp
  rules:
    - match: { kind: attention }
      plan: { tp_size: attn_tp, dcp_size: dcp }
    - match: { kind: mlp }
      plan: { tp_size: tp }

One attention rule, one MLP rule. Every block matches the same two rules. Every derivative (Mistral, Phi3, NemotronNAS, TeleChat2, GLM) reuses this verbatim.

3.2 Nemotron-H, Jamba — hybrid attention + SSM

parallelism_rules:
  axes:
    tp: 16
    attn_tp: 4
    dcp: 4
  rules:
    - match: { kind: attention }
      plan: { tp_size: attn_tp, dcp_size: dcp }
    - match: { kind: mamba }
      plan: { tp_size: tp }     # SSM has no attention heads to TPA-shard
    - match: { kind: mlp }
      plan: { tp_size: tp }

The resolver dispatches by layer kind, which HF config already declares via hybrid_override_pattern. No new metadata source needed — the heterogeneity signal already exists in the model config.

3.3 DSV4 — MLA + MoE + sparse indexer

parallelism_rules:
  axes:
    tp: 16
    attn_tp: 1      # MLA has 1 effective KV head → TPA collapses to DCP-only
    dcp: 16
    ep: 8           # expert parallel for MoE layers
  rules:
    - match: { kind: attention, attention_flavor: mla }
      plan: { tp_size: tp, dcp_size: dcp }
    - match: { kind: indexer }
      plan: { tp_size: 1 }                  # replicate across TP
    - match: { kind: moe }
      plan: { tp_size: tp, ep_size: ep }
    - match: { kind: mlp }
      plan: { tp_size: tp }

Three distinct layer-kind rules within one model. MLA attention, sparse indexer, and MoE experts each get a different plan; all expressed through the same schema.

3.4 MTP heads — speculative decoding

parallelism_rules:
  rules:
    - match: { prefix_pattern: "mtp_*" }
      plan: { tp_size: 1 }      # replicate each MTP head

The matcher can target by prefix when the architecture attaches sub-modules. The rule applies regardless of the main model's parallelism choices, because the resolver dispatches on (layer, prefix) independently for each layer.

4. Backward compatibility

Three properties of the design enforce backward compatibility by construction:

  • Sentinel-fallback at every consult site. The resolver returns a "no policy active" plan by default. Layers fall back to today's global TP accessors and behave bit-identically. Adding the resolver call to a parallel layer is safe in isolation.
  • Whitelist gate per architecture. A model opts into per-layer policy only when its migration is verified. Non-whitelisted models keep their current global-TP wiring; per-layer CLI flags are rejected at config-validation time with a clear error.
  • Per-strategy delivery. Each new strategy lands resolver-native for the model families it targets, independently. Existing global parallelism (TP, PP, EP, DCP) stays where it is; existing globals become the resolver's fallbacks for those fields if and when a per-layer override is genuinely needed.

A user who upgrades vLLM and changes no CLI flags sees no behavior change. A user who opts into a per-layer flag on a whitelisted model gets the new behavior. There is no third state.

5. Roadmap

What we commit to deliver, with TPA-GQA as the concrete reference:

  • TPA + FlashAttention + Llama family (rebased [PR #36287]). Resolver scaffold, QKVParallelLinear consult site, --tensor-parallel-size-attention flag, Llama-family migration, FA backend integration, whitelist gate. Enables TPA-GQA on Llama family and its derivatives (Mistral, Phi3, NemotronNAS, TeleChat2, GLM). GSM8K parity vs pure-TP baseline validates both the per-rank semantics and the backward-compat property.
  • TPA + FlashInfer + Llama family. Backend-specific DCP-wrapper audit. The FA path already validates the policy mechanism; this enables the FlashInfer backend for already-whitelisted models.
  • DSV4 with Q-rep (MLA family). Q-rep ships as the second concrete policy feature; MLA models (DSV3, DSV4 and inheritors like GLM4-MoE-Lite, MistralLarge3) join the whitelist. Demonstrates the resolver carrying a non-TPA plan field.
  • Nemotron Super (hybrid attention + SSM). Per-layer-kind policy via HF hybrid_override_pattern. Demonstrates the resolver dispatching on layer kind (attention vs SSM), the central case from the §3.2 example.

Additional model families are not in our immediate priority but we're happy to extend coverage later, including through community contributions, since adding a new architecture is a small declaration in the schema rather than a deep change.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Per-Layer Parallelism Policy