vllm - 💡(How to fix) Fix [Bug]: R1 NVFP4 gsm8k drop in lm_eval [3 comments, 2 participants]

vllm2026-03-17 14:27:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37302•Fetched 2026-04-08 00:53:38

View on GitHub

Comments

Participants

Timeline

Reactions

Author

elvircrn

Participants

elvircrn

robertgshaw2-redhat

Timeline (top)

commented ×3closed ×1labeled ×1mentioned ×1

Fix Action

Fix / Workaround

Use NVFP4 dispatch, and then masked_gemm is required to avoid crashes

          - name: VLLM_DEEPEPLL_NVFP4_DISPATCH
            value: "1"
          - name: VLLM_FLASHINFER_MOE_BACKEND
            value: "masked_gemm"

Code Example

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    10|exact_match|↑  |0.9545|±  |0.0057|
|     |       |strict-match    |    10|exact_match|↑  |0.9515|±  |0.0059|

---

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    10|exact_match|↑  |0.9409|±  |0.0065|
|     |       |strict-match    |    10|exact_match|↑  |0.9393|±  |0.0066|

RAW_BUFFERClick to expand / collapse

Your current environment

Prefill yaml:

<details> apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: wide-ep-prefill labels: llm-d.ai/inferenceServing: "true" llm-d.ai/model: DeepSeek-R1-0528-FP4-v2 llm-d.ai/role: prefill spec: replicas: 1 leaderWorkerTemplate: size: 2 workerTemplate: metadata: labels: llm-d.ai/inferenceServing: "true" llm-d.ai/model: DeepSeek-R1-0528-FP4-v2 llm-d.ai/role: prefill spec: serviceAccountName: wide-ep volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 2Gi - name: lustre persistentVolumeClaim: claimName: lustre-pvc-vllm resourceClaims: - name: compute-domain-channel resourceClaimTemplateName: llm-d-dev-claim containers: - name: vllm image: quay.io/rh-ee-ecrncevi/llm-dev-cuda13:v0.5.0-arm64-upstream-9c7cab5ebb0f8a15e632e7ea2cfeebcca1d3628f securityContext: capabilities: add: - IPC_LOCK - SYS_RAWIO runAsGroup: 0 runAsUser: 0 imagePullPolicy: Always command: - /bin/bash - -c args: - |- ################# # RUN vLLM prefill with external DP load balancing # TP_SIZE controls tensor parallelism (GPUs per DP rank) # DP_SIZE_LOCAL is derived: 4 GPUs / TP_SIZE # No routing sidecar - serves directly on ports 8000+ ################# # DEV mode: use persistent Lustre venv; otherwise use baked-in image venv if [ -n "${VLLM_DEV_VENV}" ] && [ -d "${VLLM_DEV_VENV}" ]; then echo "Using dev venv at ${VLLM_DEV_VENV}" source "${VLLM_DEV_VENV}/bin/activate" else source /opt/vllm/bin/activate fi

            cd /opt/vllm-source
            
            uv pip uninstall pplx-kernels

            DP_SIZE_LOCAL=$((4 / TP_SIZE))
            DP_SIZE=$((LWS_GROUP_SIZE * DP_SIZE_LOCAL))
            START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL ))

            # Build DP flags only when there are multiple DP ranks
            DP_FLAGS=""
            if [ $DP_SIZE -gt 1 ]; then
              DP_FLAGS="--data-parallel-size $DP_SIZE \
                --data-parallel-size-local 1 \
                --data-parallel-address ${LWS_LEADER_ADDRESS} \
                --data-parallel-rpc-port 5555"
            fi

            for R in $(seq 0 $((DP_SIZE_LOCAL - 1))); do
              GPU_START=$((R * TP_SIZE))
              GPUS=$(seq -s, $GPU_START $((GPU_START + TP_SIZE - 1)))

              RANK_FLAGS=""
              if [ $DP_SIZE -gt 1 ]; then
                RANK_FLAGS="--data-parallel-rank $((START_RANK + R))"
              fi

              VLLM_CACHE_ROOT=${VLLM_CACHE_ROOT}/rank$R \
              FLASHINFER_CACHE_DIR=${FLASHINFER_CACHE_DIR}/rank$R \
              CUDA_VISIBLE_DEVICES=$GPUS vllm serve \
                nvidia/DeepSeek-R1-0528-FP4-v2 \
                --port $((8000 + R)) \
                --tensor-parallel-size $TP_SIZE \
                --disable-uvicorn-access-log \
                --enable-expert-parallel \
                $DP_FLAGS $RANK_FLAGS \
                --trust-remote-code \
                --kv_transfer_config "$KV_TRANSFER_CONFIG" \
                --async-scheduling \
                --gpu-memory-utilization 0.75 \
                --kv-cache-dtype fp8 \
                --enforce-eager \
                --max-cudagraph-capture-size 8192 \
                --max-num-batched-tokens 8192 \
                --all2all-backend allgather_reducescatter \
                --enable-force-include-usage &
            done

                #--profiler-config '{"profiler": "torch",
                #                    "torch_profiler_dir": "/traces",
                #                    "ignore_frontend": "true",
                #                    "delay_iterations": 5,
                #                    "max_iterations": 3}' \

            # If any process exits, kill all and restart pod
            wait -n
            kill $(jobs -p) 2>/dev/null
            exit 1

        env:
          # Set to a Lustre venv path to override baked-in vLLM (e.g. /mnt/lustre/tms/vllm-venv)
          - name: VLLM_DEV_VENV
            value: ""

          # Enable on-demand profiling via /start_profile and /stop_profile
          - name: VLLM_TORCH_PROFILER_DIR
            value: "/traces"

          # HuggingFace token and cache location
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: HF_TOKEN
                optional: true
          - name: HF_HOME
            value: /mnt/lustre/vllm-vlm-elvircrn
          - name: TRANSFORMERS_CACHE
            value: /mnt/lustre/vllm-vlm-elvircrn
          - name: HF_HUB_CACHE
            value: /mnt/lustre/vllm-vlm-elvircrn

          # Compile caches - persist across restarts for faster startup
          - name: VLLM_CACHE_ROOT
            value: /mnt/lustre/vllm-vlm-elvircrn/vllm_cache
          - name: FLASHINFER_CACHE_DIR
            value: /mnt/lustre/vllm-vlm-elvircrn/flashinfer_cache

          # Tensor parallelism - GPUs per DP rank (DP_SIZE_LOCAL = 4 / TP_SIZE)
          - name: TP_SIZE
            value: "1"
          - name: MAX_TOKENS
            value: "1024"

          # Common vLLM settings
          - name: TRITON_LIBCUDA_PATH
            value: /usr/lib64
          - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
            value: "1"
          - name: NVIDIA_GDRCOPY
            value: enabled

          # Override container default that disables VMM
          - name: NVSHMEM_DISABLE_CUDA_VMM
            value: "0"

          # UCX configuration for NIXL - use only working InfiniBand devices
          # Excluded: mlx5_2/5 are Ethernet/RoCE with no GID configured, mlx5_6 is DOWN
          - name: UCX_NET_DEVICES
          #  value: "mlx5_0:1,mlx5_1:1,mlx5_3:1,mlx5_4:1"
          # Some ports are down on a couple of nodes.
            value: "mlx5_0:1,mlx5_1:1"

          # Engine startup timeout (default 600s, increase for large models over network storage)
          - name: VLLM_ENGINE_READY_TIMEOUT_S
            value: "1800"

          - name: VLLM_FLASHINFER_MOE_BACKEND
            value: "masked_gemm"

          # Debug logging
          - name: VLLM_LOGGING_LEVEL
            value: INFO
          #- name: NCCL_DEBUG
          #  value: "TRACE"
          #- name: NVSHMEM_DEBUG
          #  value: INFO

          # KV_TRANSFER_CONFIG and VLLM_NIXL_SIDE_CHANNEL_HOST are
          # added by overlays (pd vs decode-bench use different connectors)
        ports:
          - containerPort: 8000
            name: vllm-rank0
            protocol: TCP
          - containerPort: 8001
            name: vllm-rank1
            protocol: TCP
          - containerPort: 8002
            name: vllm-rank2
            protocol: TCP
          - containerPort: 8003
            name: vllm-rank3
            protocol: TCP
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        resources:
          claims:
            - name: compute-domain-channel
          limits:
            ephemeral-storage: 128Gi
            memory: 512Gi
            nvidia.com/gpu: "4"
          requests:
            cpu: 32
            ephemeral-storage: 128Gi
            memory: 512Gi
            nvidia.com/gpu: "4"
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /mnt/lustre
            name: lustre
        workingDir: /code

</details>

Decode yaml:

<details> apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: wide-ep-decode labels: llm-d.ai/inferenceServing: "true" llm-d.ai/model: DeepSeek-R1-0528-FP4-v2 llm-d.ai/role: decode spec: replicas: 1 leaderWorkerTemplate: size: 4 workerTemplate: metadata: labels: llm-d.ai/inferenceServing: "true" llm-d.ai/model: DeepSeek-R1-0528-FP4-v2 llm-d.ai/role: decode spec: serviceAccountName: wide-ep volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 2Gi - name: lustre persistentVolumeClaim: claimName: lustre-pvc-vllm # - name: jit-build-scripts # configMap: # name: jit-build-scripts resourceClaims: - name: compute-domain-channel resourceClaimTemplateName: llm-d-dev-claim initContainers: - name: routing-proxy-rank0 image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.5.0 imagePullPolicy: Always args: - --port=8000 - --vllm-port=8200 - --secure-proxy=false - --connector=nixlv2 ports: - containerPort: 8000 name: proxy-rank0 protocol: TCP restartPolicy: Always resources: {} securityContext: allowPrivilegeEscalation: false - name: routing-proxy-rank1 image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.5.0 imagePullPolicy: Always args: - --port=8001 - --vllm-port=8201 - --secure-proxy=false - --connector=nixlv2 ports: - containerPort: 8001 name: proxy-rank1 protocol: TCP restartPolicy: Always resources: {} securityContext: allowPrivilegeEscalation: false - name: routing-proxy-rank2 image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.5.0 imagePullPolicy: Always args: - --port=8002 - --vllm-port=8202 - --secure-proxy=false - --connector=nixlv2 ports: - containerPort: 8002 name: proxy-rank2 protocol: TCP restartPolicy: Always resources: {} securityContext: allowPrivilegeEscalation: false - name: routing-proxy-rank3 image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.5.0 imagePullPolicy: Always args: - --port=8003 - --vllm-port=8203 - --secure-proxy=false - --connector=nixlv2 ports: - containerPort: 8003 name: proxy-rank3 protocol: TCP restartPolicy: Always resources: {} securityContext: allowPrivilegeEscalation: false containers: - name: vllm image: quay.io/rh-ee-ecrncevi/llm-dev-cuda13:v0.5.0-arm64-upstream-9c7cab5ebb0f8a15e632e7ea2cfeebcca1d3628f securityContext: capabilities: add: - IPC_LOCK - SYS_RAWIO runAsGroup: 0 runAsUser: 0 imagePullPolicy: Always command: - /bin/bash - -c args: - |- set -e ################# # RUN vLLM decode-bench with external DP load balancing # Launches 4 independent vLLM processes (one per GPU) # Routing sidecar maps port 800X -> localhost:820X ################# START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL ))

            # Forward logs to Lustre for crash debugging
            LOG_DIR="/mnt/lustre/vllm-vlm-elvircrn/logs/decode-bench"
            mkdir -p $LOG_DIR
            LOG_FILE="$LOG_DIR/${HOSTNAME}_$(date +%Y%m%d_%H%M%S).log"
            exec > >(tee -a "$LOG_FILE") 2>&1
            echo "=== Decode-bench worker starting at $(date) ==="

            # Derive batch-size-dependent env vars from MAX_TOKENS
            export VLLM_MOE_DP_CHUNK_SIZE=$MAX_TOKENS
            export NVSHMEM_QP_DEPTH=$((MAX_TOKENS * 2))  # should be >= 2 * VLLM_MOE_DP_CHUNK_SIZE

            DP_SIZE_LOCAL=$((4 / TP_SIZE))
            START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL ))
            
            
            cd /opt/vllm-source

            # Revert EPLB NVFP4 PR #37217 to test without it
            # Image has a shallow clone with wrong remote; fetch full history from upstream
            # git remote set-url origin https://github.com/vllm-project/vllm
            # git fetch --unshallow origin || true
            # git revert --no-commit fd4d96302a
            
            
            uv pip uninstall flashinfer-jit-cache
            uv pip install flashinfer-python==0.6.4 flashinfer-cubin==0.6.4

            for R in $(seq 0 $((DP_SIZE_LOCAL - 1))); do
              VLLM_CACHE_ROOT=${VLLM_CACHE_ROOT}/rank$R \
              FLASHINFER_CACHE_DIR=${FLASHINFER_CACHE_DIR}/rank$R \
              CUDA_VISIBLE_DEVICES=$R $NSYS_PREFIX vllm serve $PROFILER_ARGS \
                nvidia/DeepSeek-R1-0528-FP4-v2 \
                --host :: \
                --port $((8200 + R)) \
                --disable-uvicorn-access-log \
                --enable-expert-parallel \
                --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \
                --data-parallel-rank $((START_RANK + R)) \
                --data-parallel-size-local 1 \
                --data-parallel-address ${LWS_LEADER_ADDRESS} \
                --data-parallel-rpc-port 5555 \
                --trust-remote-code \
                --gpu-memory-utilization 0.75 \
                --kv-cache-dtype fp8 \
                --kv_transfer_config "$KV_TRANSFER_CONFIG" \
                --async-scheduling \
                --compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
                --max-cudagraph-capture-size 8192 \
                --max-num-batched-tokens 8192 \
                --all2all-backend "flashinfer_nvlink_one_sided" \
                --enable-force-include-usage &
            done
            
            
            # --all2all-backend "flashinfer_nvlink_one_sided" \

            # If any process exits, kill all and restart pod
            wait -n
            kill $(jobs -p) 2>/dev/null
            exit 1

        env:
          # Set to a Lustre venv path to override baked-in vLLM (e.g. /mnt/lustre/tms/vllm-venv)
          - name: VLLM_DEV_VENV
            value: ""

          # Use NVFP4 dispatch, and then masked_gemm is required to avoid crashes
          - name: VLLM_DEEPEPLL_NVFP4_DISPATCH
            value: "1"
          - name: VLLM_FLASHINFER_MOE_BACKEND
            value: "masked_gemm"

          - name: MAX_TOKENS
            value: "1024"

          # Enable on-demand profiling via /start_profile and /stop_profile
          - name: VLLM_TORCH_PROFILER_DIR
            value: "/traces"

          # JIT build parallelism
          - name: MAX_JOBS
            value: "64"
          - name: CMAKE_BUILD_PARALLEL_LEVEL
            value: "64"
          - name: MAKEFLAGS
            value: "-j64"
          - name: NVCC_THREADS
            value: "64"
          - name: CUDA_HOME
            value: "/usr/local/cuda-13"

          # HuggingFace token and cache location
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: HF_TOKEN
                optional: true
          - name: HF_HOME
            value: /mnt/lustre/vllm-vlm-elvircrn
          - name: TRANSFORMERS_CACHE
            value: /mnt/lustre/vllm-vlm-elvircrn
          - name: HF_HUB_CACHE
            value: /mnt/lustre/vllm-vlm-elvircrn

          # Compile caches - persist across restarts for faster startup


          - name: VLLM_CACHE_ROOT
            value: /mnt/lustre/vllm-vlm-elvircrn/vllm_cache
          - name: FLASHINFER_CACHE_DIR
            value: /mnt/lustre/vllm-vlm-elvircrn/flashinfer_cache

          # Tensor parallelism - GPUs per DP rank (DP_SIZE_LOCAL = 4 / TP_SIZE)
          - name: TP_SIZE
            value: "1"

          - name: VLLM_ATTENTION_BACKEND
            value: CUTLASS_MLA

          - name: DP_SIZE_LOCAL
            value: "4"

          - name: VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL
            value: "1"

          - name: VLLM_MLA_FUSED_ROPE_CACHE
            value: "1"

          - name: VLLM_MLA_FUSED_ABSORPTION
            value: "1"

          - name: VLLM_DEEPEP_COMBINE_GEMM2_OVERLAP
            value: "1"

          - name: VLLM_DEEPEP_COMBINE_COMM_SMS
            value: "32"

          # Common vLLM settings
          - name: TRITON_LIBCUDA_PATH
            value: /usr/lib64
          - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
            value: "1"
          - name: VLLM_USE_DEEP_GEMM
            value: "1"
          - name: NVIDIA_GDRCOPY
            value: enabled

          # Use fabric handles for NVSHMEM symmetric heap (MNNVL)
          # Without this, NVSHMEM defaults to FILE_DESCRIPTOR (IPC handles)
          # and falls back to ibrc transport instead of NVLink fabric
          - name: NVSHMEM_CUMEM_HANDLE_TYPE
            value: FABRIC

          # Override container default that disables VMM
          # VMM is required for fabric memory / MNNVL cross-pod communication
          - name: NVSHMEM_DISABLE_CUDA_VMM
            value: "0"

          # UCX configuration for NIXL - use only working InfiniBand devices
          # Excluded: mlx5_2/5 are Ethernet/RoCE with no GID configured, mlx5_6 is DOWN
          - name: UCX_NET_DEVICES
            value: "mlx5_0:1,mlx5_1:1"

          # Engine startup timeout (default 600s, increase for large models over network storage)
          - name: VLLM_ENGINE_READY_TIMEOUT_S
            value: "1800"

          # Debug logging
          - name: VLLM_LOGGING_LEVEL
            value: INFO
          #- name: NCCL_DEBUG
          #  value: "TRACE"
          #- name: NVSHMEM_DEBUG
          #  value: INFO

          # KV_TRANSFER_CONFIG and VLLM_NIXL_SIDE_CHANNEL_HOST are
          # added by overlays (pd vs decode-bench use different connectors)
        ports:
          - containerPort: 8200
            name: vllm-rank0
            protocol: TCP
          - containerPort: 8201
            name: vllm-rank1
            protocol: TCP
          - containerPort: 8202
            name: vllm-rank2
            protocol: TCP
          - containerPort: 8203
            name: vllm-rank3
            protocol: TCP
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        resources:
          claims:
            - name: compute-domain-channel
          limits:
            ephemeral-storage: 128Gi
            memory: 512Gi
            nvidia.com/gpu: "4"
          requests:
            cpu: 32
            ephemeral-storage: 128Gi
            memory: 512Gi
            nvidia.com/gpu: "4"
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /mnt/lustre/vllm-vlm-elvircrn
            name: lustre
          # - name: jit-build-scripts
          #   mountPath: /opt/vllm-source/tools/jit_mla_rope_quant.py
          #   subPath: jit_mla_rope_quant.py
          #   readOnly: true
          # - name: jit-build-scripts
          #   mountPath: /opt/vllm-source/tools/jit_mla_absorption_bmm.py
          #   subPath: jit_mla_absorption_bmm.py
          #   readOnly: true
        workingDir: /code

</details>

🐛 Describe the bug

On NVL72, vllm hash 106ff69c4eb and prior, DeepSeek-R1-0528-FP4-v2produces the following gsm8k lm_eval:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    10|exact_match|↑  |0.9545|±  |0.0057|
|     |       |strict-match    |    10|exact_match|↑  |0.9515|±  |0.0059|

with FI 0.6.4 and the new FI a2a --all2all-backend "flashinfer_nvlink_one_sided" and deepep LL with this DeepEP fork (https://github.com/elvircrn/DeepEP/commits/gb200_blog/).

On 9c7cab5ebb0f8a15e632e7ea2cfeebcca1d3628f I noticed a degradation in lm_eval gsm8k:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    10|exact_match|↑  |0.9409|±  |0.0065|
|     |       |strict-match    |    10|exact_match|↑  |0.9393|±  |0.0066|

I ruled out FI 0.6.4 -> 0.6.6 upgrade on decode and prefill paths, so this does not explain this.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the degradation in lm_eval gsm8k, we need to investigate and potentially revert recent changes.

Revert recent commits: Revert the commits starting from 106ff69c4eb to 9c7cab5ebb0f8a15e632e7ea2cfeebcca1d3628f to see if the issue persists.
Check FI version: Although the FI upgrade from 0.6.4 to 0.6.6 was ruled out, double-check that the correct FI version is being used.
Verify DeepEP fork: Ensure that the DeepEP fork is correctly configured and that the changes in the fork are not causing the issue.

Example code to revert commits:

git revert --no-commit 106ff69c4eb..9c7cab5ebb0f8a15e632e7ea2cfeebcca1d3628f

Verification

To verify the fix, re-run the lm_eval gsm8k test and check the results:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    10|exact_match|↑  |0.9545|±  |0.0057|
|     |       |strict-match    |    10|exact_match|↑  |0.9515|±  |0.0059|

If the results match the expected output, the fix is successful.

Extra Tips

Always test changes in a non-production environment before deploying to production.
Use version control to track changes and revert if necessary.
Double-check dependencies and configurations to ensure they are correct.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #GPU setup #container setup #orchestration issue #cache issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: R1 NVFP4 gsm8k drop in lm_eval [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Use NVFP4 dispatch, and then masked_gemm is required to avoid crashes

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: R1 NVFP4 gsm8k drop in lm_eval [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Use NVFP4 dispatch, and then masked_gemm is required to avoid crashes

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING