vllm - 💡(How to fix) Fix Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37730Fetched 2026-04-08 01:08:31
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
3
Participants
Timeline (top)
closed ×1

Root Cause

SGLang experienced a bottleneck under the 150-concurrency load. The Python router pipeline was constrained by the GIL, limiting its multi-threading to approximately a single saturated core (127%). Because it did not scale across the 32 available vCPUs like vLLM, its latency increased by over 2.4x.

RAW_BUFFERClick to expand / collapse

Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM)

Problem Description

This issue documents a benchmark comparing SGLang (RadixAttention) and vLLM (PagedAttention) to observe the "Scaling Zero-Sum" trade-off. SGLang’s Radix tree optimizes prefix-sharing across requests but uses Python-based routing, which can be vulnerable to Python Global Interpreter Lock (GIL) contention under high concurrency. vLLM mitigates the Python GIL bottleneck by offloading its PagedAttention implementation into C++ CUDA extensions, allowing better multi-threading scaling.

To observe this behavior, we ran a 150-concurrency load test against identical Qwen/Qwen2.5-0.5B deployments on a g2-standard-32 Google Compute instance (1x NVIDIA L4, 32 vCPUs).

Reproduction / Context

Hardware Context:

  • Machine Type: g2-standard-32 (Google Cloud)
  • GPU: 1x NVIDIA L4 (24GB VRAM)
  • Host Cores: 32 vCPUs
  • Model: Qwen/Qwen2.5-0.5B

The Load Test (barrage.py): We sent concurrent requests querying a shared payload pool to saturate the backend event loops:

  • Concurrency: 150 simultaneous asynchronous workers
  • Duration: 45 seconds
  • Endpoint: /v1/completions

Execution Diagnostics (Side-by-Side)

The Docker hardware monitoring metrics showed the following results:

1. vLLM (PagedAttention)

  • Total Requests Completed: 16,369
  • Throughput: 363.76 Requests/sec
  • Average Latency: 0.4140s
  • Host CPU Utilization: 251.12%

vLLM avoided the Python GIL limit, pooling ~2.5 native CPU cores across its C++ extensions to process the queue.

2. SGLang (RadixAttention)

  • Total Requests Completed: 6,750
  • Throughput: 150.00 Requests/sec (<50% of vLLM)
  • Average Latency: 1.0134s
  • Host CPU Utilization: 127.09%

SGLang experienced a bottleneck under the 150-concurrency load. The Python router pipeline was constrained by the GIL, limiting its multi-threading to approximately a single saturated core (127%). Because it did not scale across the 32 available vCPUs like vLLM, its latency increased by over 2.4x.

Conclusion

The telemetry supports the "Scaling Zero-Sum" hypothesis. For high concurrent loads where processing parallelism is required to saturate the GPU, vLLM's C++ PagedAttention scales better by bypassing Python contention. SGLang's Radix tree is effective for prefix matching at lower concurrency, but may require C++/Rust bindings to scale the router architecture effectively.

extent analysis

Fix Plan

To address the Python GIL contention issue in SGLang's RadixAttention, we will:

  • Implement C++ bindings for the router pipeline to bypass the GIL
  • Utilize a thread pool to manage concurrent requests
  • Optimize the Radix tree data structure for better performance

Example Code

import ctypes
import threading
from queue import Queue

# Load the C++ extension module
radix_cpp = ctypes.CDLL('./radix_cpp.so')

# Define a thread pool class
class ThreadPool:
    def __init__(self, num_threads):
        self.queue = Queue()
        self.threads = []
        for _ in range(num_threads):
            t = threading.Thread(target=self.worker)
            t.start()
            self.threads.append(t)

    def worker(self):
        while True:
            func, args = self.queue.get()
            func(*args)
            self.queue.task_done()

    def submit(self, func, *args):
        self.queue.put((func, args))

# Define a function to process requests using the C++ extension
def process_request(request):
    # Call the C++ function to process the request
    radix_cpp.process_request(request)

# Create a thread pool with 32 threads
pool = ThreadPool(32)

# Submit requests to the thread pool
for request in requests:
    pool.submit(process_request, request)

C++ Extension Code

// radix_cpp.cpp
extern "C" {
    void process_request(const char* request) {
        // Implement the Radix tree processing logic here
    }
}

Compile the C++ code into a shared library:

g++ -shared -o radix_cpp.so radix_cpp.cpp

Verification

To verify the fix, run the load test again and monitor the performance metrics. The throughput and latency should improve significantly, and the CPU utilization should increase to utilize multiple cores.

Extra Tips

  • Use a profiling tool to identify performance bottlenecks in the C++ extension code
  • Optimize the Radix tree data structure for better performance
  • Consider using a Rust binding instead of C++ for better memory safety and performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING