vllm - 💡(How to fix) Fix Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM) [1 participants]

vllm2026-03-21 01:44:06

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37730•Fetched 2026-04-08 01:08:31

View on GitHub

Comments

Participants

Timeline

Reactions

Author

glaziermag

Participants

glaziermag

Timeline (top)

closed ×1

Root Cause

SGLang experienced a bottleneck under the 150-concurrency load. The Python router pipeline was constrained by the GIL, limiting its multi-threading to approximately a single saturated core (127%). Because it did not scale across the 32 available vCPUs like vLLM, its latency increased by over 2.4x.

RAW_BUFFERClick to expand / collapse

Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM)

Problem Description

This issue documents a benchmark comparing SGLang (RadixAttention) and vLLM (PagedAttention) to observe the "Scaling Zero-Sum" trade-off. SGLang’s Radix tree optimizes prefix-sharing across requests but uses Python-based routing, which can be vulnerable to Python Global Interpreter Lock (GIL) contention under high concurrency. vLLM mitigates the Python GIL bottleneck by offloading its PagedAttention implementation into C++ CUDA extensions, allowing better multi-threading scaling.

To observe this behavior, we ran a 150-concurrency load test against identical Qwen/Qwen2.5-0.5B deployments on a g2-standard-32 Google Compute instance (1x NVIDIA L4, 32 vCPUs).

Reproduction / Context

Hardware Context:

Machine Type: g2-standard-32 (Google Cloud)
GPU: 1x NVIDIA L4 (24GB VRAM)
Host Cores: 32 vCPUs
Model: Qwen/Qwen2.5-0.5B

The Load Test (barrage.py): We sent concurrent requests querying a shared payload pool to saturate the backend event loops:

Concurrency: 150 simultaneous asynchronous workers
Duration: 45 seconds
Endpoint: /v1/completions

Execution Diagnostics (Side-by-Side)

The Docker hardware monitoring metrics showed the following results:

1. vLLM (PagedAttention)

Total Requests Completed: 16,369
Throughput: 363.76 Requests/sec
Average Latency: 0.4140s
Host CPU Utilization: 251.12%

vLLM avoided the Python GIL limit, pooling ~2.5 native CPU cores across its C++ extensions to process the queue.

2. SGLang (RadixAttention)

Total Requests Completed: 6,750
Throughput: 150.00 Requests/sec (<50% of vLLM)
Average Latency: 1.0134s
Host CPU Utilization: 127.09%

Conclusion

The telemetry supports the "Scaling Zero-Sum" hypothesis. For high concurrent loads where processing parallelism is required to saturate the GPU, vLLM's C++ PagedAttention scales better by bypassing Python contention. SGLang's Radix tree is effective for prefix matching at lower concurrency, but may require C++/Rust bindings to scale the router architecture effectively.

extent analysis

Fix Plan

To address the Python GIL contention issue in SGLang's RadixAttention, we will:

Implement C++ bindings for the router pipeline to bypass the GIL
Utilize a thread pool to manage concurrent requests
Optimize the Radix tree data structure for better performance

Example Code

import ctypes
import threading
from queue import Queue

# Load the C++ extension module
radix_cpp = ctypes.CDLL('./radix_cpp.so')

# Define a thread pool class
class ThreadPool:
    def __init__(self, num_threads):
        self.queue = Queue()
        self.threads = []
        for _ in range(num_threads):
            t = threading.Thread(target=self.worker)
            t.start()
            self.threads.append(t)

    def worker(self):
        while True:
            func, args = self.queue.get()
            func(*args)
            self.queue.task_done()

    def submit(self, func, *args):
        self.queue.put((func, args))

# Define a function to process requests using the C++ extension
def process_request(request):
    # Call the C++ function to process the request
    radix_cpp.process_request(request)

# Create a thread pool with 32 threads
pool = ThreadPool(32)

# Submit requests to the thread pool
for request in requests:
    pool.submit(process_request, request)

C++ Extension Code

// radix_cpp.cpp
extern "C" {
    void process_request(const char* request) {
        // Implement the Radix tree processing logic here
    }
}

Compile the C++ code into a shared library:

g++ -shared -o radix_cpp.so radix_cpp.cpp

Verification

To verify the fix, run the load test again and monitor the performance metrics. The throughput and latency should improve significantly, and the CPU utilization should increase to utilize multiple cores.

Extra Tips

Use a profiling tool to identify performance bottlenecks in the C++ extension code
Optimize the Radix tree data structure for better performance
Consider using a Rust binding instead of C++ for better memory safety and performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #generation error #database connection #vector store #embedding generation #cache error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM)

Problem Description

Reproduction / Context

Execution Diagnostics (Side-by-Side)

1. vLLM (PagedAttention)

2. SGLang (RadixAttention)

Conclusion

extent analysis

Fix Plan

Example Code

C++ Extension Code

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM)

Problem Description

Reproduction / Context

Execution Diagnostics (Side-by-Side)

1. vLLM (PagedAttention)

2. SGLang (RadixAttention)

Conclusion

extent analysis

Fix Plan

Example Code

C++ Extension Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING