vllm - ✅(Solved) Fix [Bug]: CUDA assert in triton attention for MolmoWeb models (Molmo2 architecture with different max_position_embeddings) [2 pull requests, 2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38660Fetched 2026-04-08 01:58:40
View on GitHub
Comments
2
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
commented ×2cross-referenced ×2referenced ×2

Error Message

from vllm import LLM, SamplingParams from PIL import Image import urllib.request, io

url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png" req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"}) with urllib.request.urlopen(req) as resp: image = Image.open(io.BytesIO(resp.read())).convert("RGB")

This WORKS:

llm = LLM(model="allenai/Molmo2-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096) outputs = llm.generate( {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}}, sampling_params=SamplingParams(temperature=0.0, max_tokens=64), ) print(outputs[0].outputs[0].text) # Works fine

This CRASHES:

llm2 = LLM(model="allenai/MolmoWeb-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096) outputs2 = llm2.generate( {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}}, sampling_params=SamplingParams(temperature=0.0, max_tokens=64), )

torch.AcceleratorError: CUDA error: device-side assert triggered

Root Cause

The only config difference between allenai/MolmoWeb-8B and allenai/Molmo2-8B is:

  • max_position_embeddings: 10240 (MolmoWeb) vs 36864 (Molmo2)

Everything else — architecture, text config dimensions, vision config, processor code — is identical (diff of processing_molmo2.py shows zero functional differences).

The lower max_position_embeddings likely causes the multimodal prefix range computation to produce invalid index ranges that trigger a CUDA bounds check in the triton attention kernel.

Fix Action

Fixed

PR fix notes

PR #6: Add vLLM inference backend

Description (problem / solution / changelog)

Summary

Add vLLM as an inference backend for MolmoWeb, enabling high-throughput serving.

Changes

  • agent/model_backends.py: New VLLMActionPredictor class
  • agent/fastapi_model_server.py: Add "vllm" predictor type
  • pyproject.toml: Add vllm>=0.15.0 as optional dep
  • README.md: Add vLLM backend docs

Usage

pip install -e ".[vllm]"
PREDICTOR_TYPE=vllm CKPT=allenai/MolmoWeb-8B python -m agent.fastapi_model_server

Note

There is an open issue (vllm-project/vllm#38660) where MolmoWeb models hit a CUDA assert on Blackwell GPUs (RTX 5090). Likely works on A100/H100 — testing appreciated.

Changed files

  • README.md (modified, +14/-2)
  • agent/fastapi_model_server.py (modified, +7/-1)
  • agent/model_backends.py (modified, +77/-0)
  • pyproject.toml (modified, +3/-0)

PR #7: Add vLLM inference backend

Description (problem / solution / changelog)

Summary

Add vLLM as an inference backend for MolmoWeb.

Changes

  • agent/model_backends.py: New VLLMActionPredictor class
  • agent/fastapi_model_server.py: Add "vllm" predictor type
  • pyproject.toml: Add vllm>=0.15.0 as optional dep
  • README.md: Add vLLM backend docs

Usage

pip install -e ".[vllm]"
PREDICTOR_TYPE=vllm CKPT=allenai/MolmoWeb-8B python -m agent.fastapi_model_server

Note

There is an open issue (vllm-project/vllm#38660) where MolmoWeb models hit a CUDA assert on Blackwell GPUs (RTX 5090). Likely works on A100/H100 — testing appreciated.

Changed files

  • README.md (modified, +14/-2)
  • agent/fastapi_model_server.py (modified, +7/-1)
  • agent/model_backends.py (modified, +77/-0)
  • pyproject.toml (modified, +3/-0)

Code Example

torch.tensor(r, dtype=torch.int32, device=device).view(-1, 2)

---

from vllm import LLM, SamplingParams
from PIL import Image
import urllib.request, io

url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(req) as resp:
    image = Image.open(io.BytesIO(resp.read())).convert("RGB")

# This WORKS:
llm = LLM(model="allenai/Molmo2-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096)
outputs = llm.generate(
    {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(temperature=0.0, max_tokens=64),
)
print(outputs[0].outputs[0].text)  # Works fine

# This CRASHES:
llm2 = LLM(model="allenai/MolmoWeb-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096)
outputs2 = llm2.generate(
    {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(temperature=0.0, max_tokens=64),
)
# torch.AcceleratorError: CUDA error: device-side assert triggered

---

File "vllm/v1/attention/backends/triton_attn.py", line 496, in forward
    mm_prefix_range_tensor = attn_metadata.mm_prefix_range_tensor
File "vllm/v1/attention/backends/triton_attn.py", line 108, in mm_prefix_range_tensor
    torch.tensor(r, dtype=torch.int32, device=device).view(-1, 2)
torch.AcceleratorError: CUDA error: device-side assert triggered
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM version: 0.18.1 (Docker vllm/vllm-openai:latest)
  • GPU: NVIDIA RTX 5090 (24GB)
  • CUDA: 13.2
  • PyTorch: 2.7

Model

allenai/MolmoWeb-8B and allenai/MolmoWeb-4B — both use Molmo2ForConditionalGeneration architecture (model_type: "molmo2"), identical to allenai/Molmo2-8B.

Bug description

MolmoWeb models crash with a CUDA device-side assert in the triton attention kernel during inference. allenai/Molmo2-8B works fine with the same code.

The crash occurs in triton_attn.py at mm_prefix_range_tensor property, specifically at:

torch.tensor(r, dtype=torch.int32, device=device).view(-1, 2)

This is in the multimodal bidirectional attention path (is_mm_prefix_lm = True), which is enabled for all model_type: "molmo2" models.

Root cause analysis

The only config difference between allenai/MolmoWeb-8B and allenai/Molmo2-8B is:

  • max_position_embeddings: 10240 (MolmoWeb) vs 36864 (Molmo2)

Everything else — architecture, text config dimensions, vision config, processor code — is identical (diff of processing_molmo2.py shows zero functional differences).

The lower max_position_embeddings likely causes the multimodal prefix range computation to produce invalid index ranges that trigger a CUDA bounds check in the triton attention kernel.

How to reproduce

from vllm import LLM, SamplingParams
from PIL import Image
import urllib.request, io

url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(req) as resp:
    image = Image.open(io.BytesIO(resp.read())).convert("RGB")

# This WORKS:
llm = LLM(model="allenai/Molmo2-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096)
outputs = llm.generate(
    {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(temperature=0.0, max_tokens=64),
)
print(outputs[0].outputs[0].text)  # Works fine

# This CRASHES:
llm2 = LLM(model="allenai/MolmoWeb-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096)
outputs2 = llm2.generate(
    {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(temperature=0.0, max_tokens=64),
)
# torch.AcceleratorError: CUDA error: device-side assert triggered

Error traceback

File "vllm/v1/attention/backends/triton_attn.py", line 496, in forward
    mm_prefix_range_tensor = attn_metadata.mm_prefix_range_tensor
File "vllm/v1/attention/backends/triton_attn.py", line 108, in mm_prefix_range_tensor
    torch.tensor(r, dtype=torch.int32, device=device).view(-1, 2)
torch.AcceleratorError: CUDA error: device-side assert triggered

Additional notes

  • Also tested with enforce_eager=True — same crash
  • Also tested via chat API (llm.chat(...)) — same crash
  • MolmoWeb-4B also crashes with the same error
  • The prompt placeholder <|image|> is required (without it, a separate AssertionError: Failed to apply prompt replacement occurs, but that affects Molmo2 too when placeholder is missing)
  • This likely affects any Molmo2 fine-tune with max_position_embeddings different from the original 36864
  • MolmoWeb repo (https://github.com/allenai/molmoweb) lists "vLLM support coming soon" — this bug may be what's blocking them

Before submitting a new issue...

  • I have searched existing issues for similar problems
  • I have verified this is reproducible with vllm/vllm-openai:latest Docker image
  • I have confirmed the baseline model (allenai/Molmo2-8B) works correctly

AI assistance disclosure: This issue was prepared with AI assistance (Claude). All testing and analysis was reviewed by the human submitter.

extent analysis

TL;DR

The most likely fix is to increase the max_position_embeddings in the MolmoWeb models to match the value used in the Molmo2 models, which is 36864.

Guidance

  • Verify that the crash occurs due to the max_position_embeddings difference by testing with a MolmoWeb model that has its max_position_embeddings set to 36864.
  • Update the max_position_embeddings value in the MolmoWeb model configuration to 36864 to potentially resolve the crash.
  • If updating the model configuration is not feasible, consider using the allenai/Molmo2-8B model as a workaround, as it does not exhibit the same issue.
  • Be cautious when updating model configurations, as changes can have unintended effects on model performance or behavior.

Example

No code snippet is provided, as the issue is related to model configuration rather than code.

Notes

The provided analysis suggests that the max_position_embeddings difference is the likely cause of the crash. However, it is essential to verify this by testing with the updated configuration. Additionally, updating the model configuration may have other effects on the model's performance or behavior, which should be carefully evaluated.

Recommendation

Apply the workaround by using the allenai/Molmo2-8B model until the max_position_embeddings issue is resolved in the MolmoWeb models, as it is a known working configuration.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING