vllm - ✅(Solved) Fix [Bug]: CUDA assert in triton attention for MolmoWeb models (Molmo2 architecture with different max_position_embeddings) [2 pull requests, 2 comments, 1 participants]

vllm2026-03-31 21:35:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38660•Fetched 2026-04-08 01:58:40

View on GitHub

Comments

Participants

Timeline

Reactions

Author

2imi9

Participants

2imi9

Timeline (top)

commented ×2cross-referenced ×2referenced ×2

Error Message

from vllm import LLM, SamplingParams from PIL import Image import urllib.request, io

url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png" req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"}) with urllib.request.urlopen(req) as resp: image = Image.open(io.BytesIO(resp.read())).convert("RGB")

This WORKS:

llm = LLM(model="allenai/Molmo2-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096) outputs = llm.generate( {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}}, sampling_params=SamplingParams(temperature=0.0, max_tokens=64), ) print(outputs[0].outputs[0].text) # Works fine

This CRASHES:

llm2 = LLM(model="allenai/MolmoWeb-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096) outputs2 = llm2.generate( {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}}, sampling_params=SamplingParams(temperature=0.0, max_tokens=64), )

torch.AcceleratorError: CUDA error: device-side assert triggered

Root Cause

The only config difference between allenai/MolmoWeb-8B and allenai/Molmo2-8B is:

max_position_embeddings: 10240 (MolmoWeb) vs 36864 (Molmo2)

Everything else — architecture, text config dimensions, vision config, processor code — is identical (diff of processing_molmo2.py shows zero functional differences).

The lower max_position_embeddings likely causes the multimodal prefix range computation to produce invalid index ranges that trigger a CUDA bounds check in the triton attention kernel.

Fix Action

Fixed

Fixed by PR: Add vLLM inference backend (https://github.com/allenai/molmoweb/pull/6)
Fixed by PR: Add vLLM inference backend (https://github.com/allenai/molmoweb/pull/7)

PR fix notes

PR #6: Add vLLM inference backend

Repository: allenai/molmoweb
Author: 2imi9
State: closed | merged: False
Link: https://github.com/allenai/molmoweb/pull/6

Description (problem / solution / changelog)

Summary

Add vLLM as an inference backend for MolmoWeb, enabling high-throughput serving.

Changes

agent/model_backends.py: New VLLMActionPredictor class
agent/fastapi_model_server.py: Add "vllm" predictor type
pyproject.toml: Add vllm>=0.15.0 as optional dep
README.md: Add vLLM backend docs

Usage

pip install -e ".[vllm]"
PREDICTOR_TYPE=vllm CKPT=allenai/MolmoWeb-8B python -m agent.fastapi_model_server

Note

There is an open issue (vllm-project/vllm#38660) where MolmoWeb models hit a CUDA assert on Blackwell GPUs (RTX 5090). Likely works on A100/H100 — testing appreciated.

Changed files

README.md (modified, +14/-2)
agent/fastapi_model_server.py (modified, +7/-1)
agent/model_backends.py (modified, +77/-0)
pyproject.toml (modified, +3/-0)

PR #7: Add vLLM inference backend

Repository: allenai/molmoweb
Author: 2imi9
State: open | merged: False
Link: https://github.com/allenai/molmoweb/pull/7

Description (problem / solution / changelog)

Summary

Add vLLM as an inference backend for MolmoWeb.

Changes

agent/model_backends.py: New VLLMActionPredictor class
agent/fastapi_model_server.py: Add "vllm" predictor type
pyproject.toml: Add vllm>=0.15.0 as optional dep
README.md: Add vLLM backend docs

Usage

pip install -e ".[vllm]"
PREDICTOR_TYPE=vllm CKPT=allenai/MolmoWeb-8B python -m agent.fastapi_model_server

Note

There is an open issue (vllm-project/vllm#38660) where MolmoWeb models hit a CUDA assert on Blackwell GPUs (RTX 5090). Likely works on A100/H100 — testing appreciated.

Changed files

README.md (modified, +14/-2)
agent/fastapi_model_server.py (modified, +7/-1)
agent/model_backends.py (modified, +77/-0)
pyproject.toml (modified, +3/-0)

Code Example

torch.tensor(r, dtype=torch.int32, device=device).view(-1, 2)

---

from vllm import LLM, SamplingParams
from PIL import Image
import urllib.request, io

url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(req) as resp:
    image = Image.open(io.BytesIO(resp.read())).convert("RGB")

# This WORKS:
llm = LLM(model="allenai/Molmo2-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096)
outputs = llm.generate(
    {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(temperature=0.0, max_tokens=64),
)
print(outputs[0].outputs[0].text)  # Works fine

# This CRASHES:
llm2 = LLM(model="allenai/MolmoWeb-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096)
outputs2 = llm2.generate(
    {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(temperature=0.0, max_tokens=64),
)
# torch.AcceleratorError: CUDA error: device-side assert triggered

---

File "vllm/v1/attention/backends/triton_attn.py", line 496, in forward
    mm_prefix_range_tensor = attn_metadata.mm_prefix_range_tensor
File "vllm/v1/attention/backends/triton_attn.py", line 108, in mm_prefix_range_tensor
    torch.tensor(r, dtype=torch.int32, device=device).view(-1, 2)
torch.AcceleratorError: CUDA error: device-side assert triggered

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM version: 0.18.1 (Docker vllm/vllm-openai:latest)
GPU: NVIDIA RTX 5090 (24GB)
CUDA: 13.2
PyTorch: 2.7

Model

allenai/MolmoWeb-8B and allenai/MolmoWeb-4B — both use Molmo2ForConditionalGeneration architecture (model_type: "molmo2"), identical to allenai/Molmo2-8B.

Bug description

MolmoWeb models crash with a CUDA device-side assert in the triton attention kernel during inference. allenai/Molmo2-8B works fine with the same code.

The crash occurs in triton_attn.py at mm_prefix_range_tensor property, specifically at:

torch.tensor(r, dtype=torch.int32, device=device).view(-1, 2)

This is in the multimodal bidirectional attention path (is_mm_prefix_lm = True), which is enabled for all model_type: "molmo2" models.

Root cause analysis

The only config difference between allenai/MolmoWeb-8B and allenai/Molmo2-8B is:

max_position_embeddings: 10240 (MolmoWeb) vs 36864 (Molmo2)

Everything else — architecture, text config dimensions, vision config, processor code — is identical (diff of processing_molmo2.py shows zero functional differences).

The lower max_position_embeddings likely causes the multimodal prefix range computation to produce invalid index ranges that trigger a CUDA bounds check in the triton attention kernel.

How to reproduce

from vllm import LLM, SamplingParams
from PIL import Image
import urllib.request, io

url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(req) as resp:
    image = Image.open(io.BytesIO(resp.read())).convert("RGB")

# This WORKS:
llm = LLM(model="allenai/Molmo2-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096)
outputs = llm.generate(
    {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(temperature=0.0, max_tokens=64),
)
print(outputs[0].outputs[0].text)  # Works fine

# This CRASHES:
llm2 = LLM(model="allenai/MolmoWeb-8B", trust_remote_code=True, max_model_len=4096, max_num_batched_tokens=4096)
outputs2 = llm2.generate(
    {"prompt": "<|image|> Describe this image.", "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(temperature=0.0, max_tokens=64),
)
# torch.AcceleratorError: CUDA error: device-side assert triggered

Error traceback

File "vllm/v1/attention/backends/triton_attn.py", line 496, in forward
    mm_prefix_range_tensor = attn_metadata.mm_prefix_range_tensor
File "vllm/v1/attention/backends/triton_attn.py", line 108, in mm_prefix_range_tensor
    torch.tensor(r, dtype=torch.int32, device=device).view(-1, 2)
torch.AcceleratorError: CUDA error: device-side assert triggered

Additional notes

Also tested with enforce_eager=True — same crash
Also tested via chat API (llm.chat(...)) — same crash
MolmoWeb-4B also crashes with the same error
The prompt placeholder <|image|> is required (without it, a separate AssertionError: Failed to apply prompt replacement occurs, but that affects Molmo2 too when placeholder is missing)
This likely affects any Molmo2 fine-tune with max_position_embeddings different from the original 36864
MolmoWeb repo (https://github.com/allenai/molmoweb) lists "vLLM support coming soon" — this bug may be what's blocking them

Before submitting a new issue...

I have searched existing issues for similar problems
I have verified this is reproducible with vllm/vllm-openai:latest Docker image
I have confirmed the baseline model (allenai/Molmo2-8B) works correctly

AI assistance disclosure: This issue was prepared with AI assistance (Claude). All testing and analysis was reviewed by the human submitter.

extent analysis

TL;DR

The most likely fix is to increase the max_position_embeddings in the MolmoWeb models to match the value used in the Molmo2 models, which is 36864.

Guidance

Verify that the crash occurs due to the max_position_embeddings difference by testing with a MolmoWeb model that has its max_position_embeddings set to 36864.
Update the max_position_embeddings value in the MolmoWeb model configuration to 36864 to potentially resolve the crash.
If updating the model configuration is not feasible, consider using the allenai/Molmo2-8B model as a workaround, as it does not exhibit the same issue.
Be cautious when updating model configurations, as changes can have unintended effects on model performance or behavior.

Example

No code snippet is provided, as the issue is related to model configuration rather than code.

Notes

The provided analysis suggests that the max_position_embeddings difference is the likely cause of the crash. However, it is essential to verify this by testing with the updated configuration. Additionally, updating the model configuration may have other effects on the model's performance or behavior, which should be carefully evaluated.

Recommendation

Apply the workaround by using the allenai/Molmo2-8B model until the max_position_embeddings issue is resolved in the MolmoWeb models, as it is a known working configuration.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #chain error #conversation history #tool integration #LLM response

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: CUDA assert in triton attention for MolmoWeb models (Molmo2 architecture with different max_position_embeddings) [2 pull requests, 2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

This WORKS:

This CRASHES:

torch.AcceleratorError: CUDA error: device-side assert triggered

Root Cause

Fix Action

Fixed

PR fix notes

PR #6: Add vLLM inference backend

Description (problem / solution / changelog)

Summary

Changes

Usage

Note

Changed files

PR #7: Add vLLM inference backend

Description (problem / solution / changelog)

Summary

Changes

Usage

Note

Changed files

Code Example

Your current environment

Model

Bug description

Root cause analysis

How to reproduce

Error traceback

Additional notes

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING