vllm - ✅(Solved) Fix [Feature]: Support lightweight import of vllm protocol types without torch dependency [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38925Fetched 2026-04-08 02:44:56
View on GitHub
Comments
2
Participants
2
Timeline
6
Reactions
0
Author
Timeline (top)
commented ×2closed ×1cross-referenced ×1reopened ×1

Root Cause

  • #38908 -- glibc 2.35 requirement on nightly wheels blocks RHEL 9 / UBI9 users
  • #33741 -- --help performance due to unnecessary torch import (same root cause in the import chain)
  • #30985 -- RFC for DRY dependency management across hardware targets
  • #28071 -- RFC to pin all dependencies

Fix Action

Fix / Workaround

vllm/__init__.py (line 14)
  -> import vllm.env_override
    -> env_override.py (line 87): import torch  # unconditional, top-level
    -> env_override.py (line 89): from vllm.utils.torch_utils import is_torch_equal
    -> env_override.py (line 106): torch._inductor.config.compile_threads = 1
    -> ... (torch inductor monkeypatches, lines 116-484)

... rest of monkeypatches

PR fix notes

PR #17: fix: replace manylinux_2_35 CPU wheels with torch stub for UBI9 compat

Description (problem / solution / changelog)

Replace manylinux_2_35 CPU wheels (incompatible with UBI9 glibc 2.34) with a torch stub that allows vllm to import without PyTorch. The tokenizer only uses vllm protocol types, config, and rendering modules no GPU code. A post-install script registers the stub and strips ~1.3 GB of unused native extensions from the image.

workaround for: https://github.com/vllm-project/vllm/issues/38925

Disabling PyTorch because PyTorch >= 2.1 is required but found 0.0.0
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
INFO 04-07 10:54:44 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 04-07 10:54:44 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
2026-04-07 10:54:47,247 [INFO] [root] TokenizationServiceServicer initialized
2026-04-07 10:54:47,247 [INFO] [root] gRPC reflection disabled (set `ENABLE_GRPC_REFLECTION=1` to enable)
2026-04-07 10:54:47,248 [INFO] [root] gRPC server configured to listen on /tmp/tokenizer/tokenizer-uds.socket
2026-04-07 10:54:47,248 [INFO] [root] gRPC server started on /tmp/tokenizer/tokenizer-uds.socket
2026-04-07 10:54:47,249 [INFO] [root] Probe server started on port 8082
2026-04-07 10:54:47,249 [INFO] [root] Server started.

Step 1: Initialize the tokenizer

oc exec qwen-test-epp-with-tokenizer -n redhat-ods-applications -c tokenizer -- python3 -c "
import grpc, sys 
sys.path.insert(0, '/app')
from tokenizerpb import tokenizer_pb2, tokenizer_pb2_grpc
channel = grpc.insecure_channel('unix:///tmp/tokenizer/tokenizer-uds.socket')
stub = tokenizer_pb2_grpc.TokenizationServiceStub(channel)
init_req = tokenizer_pb2.InitializeTokenizerRequest(model_name='Qwen/Qwen2.5-1.5B-Instruct')
init_resp = stub.InitializeTokenizer(init_req)
print(f'Initialize: success={init_resp.success}')
"

Expected:

Initialize: success=True

Step 2: Tokenize

oc exec qwen-test-epp-with-tokenizer -n redhat-ods-applications -c tokenizer -- python3 -c "
import grpc, sys
sys.path.insert(0, '/app')
from tokenizerpb import tokenizer_pb2, tokenizer_pb2_grpc
channel = grpc.insecure_channel('unix:///tmp/tokenizer/tokenizer-uds.socket')
stub = tokenizer_pb2_grpc.TokenizationServiceStub(channel)
req = tokenizer_pb2.TokenizeRequest(
    input='The history of artificial intelligence',
    model_name='Qwen/Qwen2.5-1.5B-Instruct'
)
resp = stub.Tokenize(req)
print(f'input_ids: {list(resp.input_ids)}')
print(f'success: {resp.success}')
"

Expected:

input_ids: [785, 3840, 315, 20443, 11229]
success: true

Pod Setup

EPP scheduler + UDS tokenizer as sidecar, sharing UDS socket via emptyDir. Init container downloads model tokenizer files into the expected path.

apiVersion: v1
kind: Pod
metadata:
  name: qwen-test-epp-with-tokenizer
  namespace: redhat-ods-applications
  labels:
    app.kubernetes.io/name: qwen-test-router-scheduler
    app.kubernetes.io/component: llminferenceservice-router-scheduler
spec:
  serviceAccountName: qwen-test-epp-sa
  initContainers:
  - name: fetch-tokenizer
    image: quay.io/rhoai/pull-request-pipelines:odh-llm-d-kv-cache-a63d65dbe1bdae59453ec2b7e398239b97a9e308
    command: ["python3", "-c"]
    args:
    - |
      import os; os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '0'
      from huggingface_hub import snapshot_download
      snapshot_download('Qwen/Qwen2.5-1.5B-Instruct',
        local_dir='/mnt/models/Qwen/Qwen2.5-1.5B-Instruct',
        allow_patterns=['tokenizer*', 'special_tokens*', 'vocab*', 'merges*', '*.json'])
    volumeMounts:
    - name: model-cache
      mountPath: /mnt/models
    resources:
      requests: { cpu: 100m, memory: 256Mi }
  containers:
  - name: epp
    image: quay.io/rhoai/odh-llm-d-inference-scheduler-rhel9@sha256:3c099f1079e2f2111f3f4a85505ac8b95eb789f69ea3062a53d40f8e7c401646
    args:
    - "--pool-name=qwen-test-inference-pool"
    - "--pool-namespace=redhat-ods-applications"
    - "--pool-group=inference.networking.x-k8s.io"
    - "--zap-encoder=json"
    - "--grpc-port=9002"
    - "--grpc-health-port=9003"
    - "--secure-serving"
    - "--model-server-metrics-scheme=https"
    - "--cert-path=/var/run/kserve/tls"
    - "--config-text=apiVersion: inference.networking.x-k8s.io/v1alpha1\nkind: EndpointPickerConfig\nplugins:\n- type: single-profile-handler\n- type: prefix-cache-scorer\n- type: load-aware-scorer\n- type: max-score-picker\nschedulingProfiles:\n- name: default\n  plugins:\n  - pluginRef: prefix-cache-scorer\n    weight: 2.0\n  - pluginRef: load-aware-scorer\n    weight: 1.0\n  - pluginRef: max-score-picker\n"
    ports:
    - { name: grpc, containerPort: 9002 }
    - { name: grpc-health, containerPort: 9003 }
    - { name: metrics, containerPort: 9090 }
    env:
    - { name: SSL_CERT_DIR, value: "/var/run/kserve/tls:/var/run/secrets/kubernetes.io/serviceaccount:/etc/pki/tls/certs" }
    resources:
      requests: { cpu: 256m, memory: 500Mi }
    volumeMounts:
    - { name: tls-certs, mountPath: /var/run/kserve/tls, readOnly: true }
    - { name: tokenizer-uds, mountPath: /tmp/tokenizer }
  - name: tokenizer
    image: quay.io/rhoai/pull-request-pipelines:odh-llm-d-kv-cache-a63d65dbe1bdae59453ec2b7e398239b97a9e308
    imagePullPolicy: Always
    env:
    - { name: TOKENIZERS_DIR, value: /mnt/models }
    - { name: HF_HOME, value: /tmp/hf }
    ports:
    - { containerPort: 8082, name: health }
    resources:
      requests: { cpu: 100m, memory: 512Mi }
    volumeMounts:
    - { mountPath: /tmp/tokenizer, name: tokenizer-uds }
    - { mountPath: /mnt/models, name: model-cache, readOnly: true }
    - { mountPath: /tmp/hf, name: hf-cache }
    livenessProbe:  { httpGet: { path: /healthz, port: 8082 }, periodSeconds: 15, failureThreshold: 5, initialDelaySeconds: 60 }
    readinessProbe: { httpGet: { path: /healthz, port: 8082 }, periodSeconds: 10, failureThreshold: 10, initialDelaySeconds: 15 }
    startupProbe:   { httpGet: { path: /healthz, port: 8082 }, periodSeconds: 10, failureThreshold: 30 }
  volumes:
  - { name: tls-certs, secret: { secretName: qwen-test-kserve-self-signed-certs } }
  - { name: tokenizer-uds, emptyDir: {} }
  - { name: model-cache, emptyDir: {} }
  - { name: hf-cache, emptyDir: {} }

Changed files

  • services/uds_tokenizer/Dockerfile.konflux (modified, +5/-4)
  • services/uds_tokenizer/pyproject.toml (modified, +3/-3)
  • services/uds_tokenizer/strip_unused_deps.sh (added, +34/-0)
  • services/uds_tokenizer/tokenizer_grpc_service.py (modified, +14/-13)
  • services/uds_tokenizer/tokenizer_service/tokenizer.py (modified, +15/-13)
  • services/uds_tokenizer/torch_stub.py (added, +175/-0)
  • services/uds_tokenizer/uv.lock (modified, +1320/-750)

Code Example

from vllm.config import VllmConfig
from vllm.config.device import DeviceConfig
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.completion.protocol import CompletionRequest
from vllm.entrypoints.openai.engine.protocol import ErrorResponse
from vllm.entrypoints.openai.models.protocol import BaseModelPath
from vllm.entrypoints.openai.models.serving import OpenAIModelRegistry
from vllm.entrypoints.openai.engine.serve.render.serving import OpenAIServingRender
from vllm.plugins.io_processors import get_io_processor
from vllm.renderers import renderer_from_config

---

vllm/__init__.py (line 14)
  -> import vllm.env_override
    -> env_override.py (line 87): import torch  # unconditional, top-level
    -> env_override.py (line 89): from vllm.utils.torch_utils import is_torch_equal
    -> env_override.py (line 106): torch._inductor.config.compile_threads = 1
    -> ... (torch inductor monkeypatches, lines 116-484)

---

# env_override.py, line 85 onward
_torch_available = importlib.util.find_spec("torch") is not None

if _torch_available:
    import torch

    from vllm.logger import init_logger
    from vllm.utils.torch_utils import is_torch_equal

    logger = init_logger(__name__)

    os.environ["PYTORCH_NVML_BASED_CUDA_CHECK"] = "1"
    os.environ["TORCHINDUCTOR_COMPILE_THREADS"] = "1"
    torch._inductor.config.compile_threads = 1

    # ... rest of monkeypatches
RAW_BUFFERClick to expand / collapse

Motivation

Projects in the llm-d ecosystem (and likely others) need to import vllm protocol types, config dataclasses, and rendering utilities but do not run inference. For example, the llm-d-kv-cache UDS tokenizer service is a lightweight gRPC sidecar that only imports:

from vllm.config import VllmConfig
from vllm.config.device import DeviceConfig
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.completion.protocol import CompletionRequest
from vllm.entrypoints.openai.engine.protocol import ErrorResponse
from vllm.entrypoints.openai.models.protocol import BaseModelPath
from vllm.entrypoints.openai.models.serving import OpenAIModelRegistry
from vllm.entrypoints.openai.engine.serve.render.serving import OpenAIServingRender
from vllm.plugins.io_processors import get_io_processor
from vllm.renderers import renderer_from_config

These are pure Python pydantic models, dataclasses, and rendering logic. No GPU, no CUDA, no inference engine.

Problem

Any from vllm.<anything> import ... triggers this import chain:

vllm/__init__.py (line 14)
  -> import vllm.env_override
    -> env_override.py (line 87): import torch  # unconditional, top-level
    -> env_override.py (line 89): from vllm.utils.torch_utils import is_torch_equal
    -> env_override.py (line 106): torch._inductor.config.compile_threads = 1
    -> ... (torch inductor monkeypatches, lines 116-484)

This means:

  1. torch is a hard runtime requirement for any vllm import, even protocol-only usage
  2. pip install vllm --no-deps is not viable -- imports crash without torch
  3. The full vllm dep tree (torch, CUDA libs, triton, flashinfer, nvidia-*, etc.) must be installed even in lightweight sidecars that never touch a GPU

Additionally, even beyond env_override.py, there are unconditional import torch statements deeper in the chain:

  • vllm/config/device.py (line 7)
  • vllm/config/model.py (line 10) -- module-level _STR_DTYPE_TO_TORCH_DTYPE dict initialization
  • vllm/config/utils.py (line 18)
  • vllm/utils/__init__.py (line 6)

These would also need lazy guards for the full protocol-only import path to work.

Concrete impact: container image size

The UDS tokenizer is a ~50 MB Python service. Adding vllm==0.18.0 from PyPI pulls in ~5-6 GB of transitive dependencies (torch with CUDA, nvidia-cublas, nvidia-cudnn, nvidia-nccl, triton, flashinfer, etc.).

Previously, the project used CPU-only manylinux_2_35 wheels from wheels.vllm.ai to avoid this, but those wheels require glibc >= 2.35 and are incompatible with RHEL 9 / UBI9 (glibc 2.34) the standard base image for Red Hat's downstream builds. See also #38908 for the same glibc constraint on nightly wheels.

There is currently no way to get vllm protocol types into a UBI9-based container without pulling the full torch+CUDA dependency tree.

Proposed solution

Guard the torch-dependent code in env_override.py behind an availability check as the first step. The module already uses importlib.util (line 4) and _get_torch_cuda_version() (line 8) already checks for torch without importing it. Extending that pattern:

# env_override.py, line 85 onward
_torch_available = importlib.util.find_spec("torch") is not None

if _torch_available:
    import torch

    from vllm.logger import init_logger
    from vllm.utils.torch_utils import is_torch_equal

    logger = init_logger(__name__)

    os.environ["PYTORCH_NVML_BASED_CUDA_CHECK"] = "1"
    os.environ["TORCHINDUCTOR_COMPILE_THREADS"] = "1"
    torch._inductor.config.compile_threads = 1

    # ... rest of monkeypatches

This preserves all existing behavior when torch is installed (100% of inference users) while unblocking the first gate for protocol-only consumers.

The deeper import torch statements in vllm/config/*.py and vllm/utils/__init__.py could be addressed incrementally -- there is strong precedent in the codebase for lazy import patterns (PRs #34343, #38649, #34651, #36024).

Alternative approaches

  • Separate vllm-types or vllm-protocol package: cleaner long-term but higher maintenance burden and coordination cost.
  • Move import vllm.env_override to engine entrypoints instead of __init__.py: more invasive, touches more files, higher risk of regressions.
  • VLLM_NO_TORCH environment variable: explicit opt-out, but adds a knob users have to discover.

The guard approach is minimal, self-contained, and doesn't change behavior for any existing user.

Related issues

  • #38908 -- glibc 2.35 requirement on nightly wheels blocks RHEL 9 / UBI9 users
  • #33741 -- --help performance due to unnecessary torch import (same root cause in the import chain)
  • #30985 -- RFC for DRY dependency management across hardware targets
  • #28071 -- RFC to pin all dependencies

Use cases that benefit

  • llm-d-kv-cache UDS tokenizer service (protocol types + chat rendering)
  • Routing sidecars / ext-proc plugins that parse vllm request/response types
  • Monitoring and observability tools that deserialize vllm protocol objects
  • Test harnesses and CI that validate request schemas without GPU hardware
  • Documentation tooling that introspects vllm's API types

Before submitting a new issue...

  • Searched for relevant issues
  • Checked the documentation chatbot

extent analysis

TL;DR

Guarding torch-dependent code in env_override.py behind an availability check can resolve the issue of torch being a hard runtime requirement for any vllm import.

Guidance

  • Identify and guard all torch-dependent code in the vllm package to allow for protocol-only imports without requiring torch.
  • Start by modifying env_override.py to check for torch availability before importing it, as proposed in the issue.
  • Incrementally address deeper import torch statements in vllm/config/*.py and vllm/utils/__init__.py using lazy import patterns.
  • Consider alternative approaches, such as creating a separate vllm-types or vllm-protocol package, but weigh the benefits against the added maintenance burden and coordination cost.

Example

# env_override.py, line 85 onward
_torch_available = importlib.util.find_spec("torch") is not None

if _torch_available:
    import torch
    # ... rest of the code

Notes

  • The proposed solution focuses on modifying the env_override.py file, but other files may also require changes to fully resolve the issue.
  • The use of lazy import patterns can help minimize the impact of the changes on existing users.

Recommendation

Apply the proposed workaround by guarding torch-dependent code in env_override.py behind an availability check, as it is a minimal and self-contained solution that doesn't change behavior for existing users.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Feature]: Support lightweight import of vllm protocol types without torch dependency [1 pull requests, 2 comments, 2 participants]