vllm - ✅(Solved) Fix [Feature]: Support lightweight import of vllm protocol types without torch dependency [1 pull requests, 2 comments, 2 participants]

vllm2026-04-03 16:38:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38925•Fetched 2026-04-08 02:44:56

View on GitHub

Comments

Participants

Timeline

Reactions

Author

hexfusion

Participants

hexfusion

robertgshaw2-redhat

Timeline (top)

commented ×2closed ×1cross-referenced ×1reopened ×1

Root Cause

#38908 -- glibc 2.35 requirement on nightly wheels blocks RHEL 9 / UBI9 users
#33741 -- --help performance due to unnecessary torch import (same root cause in the import chain)
#30985 -- RFC for DRY dependency management across hardware targets
#28071 -- RFC to pin all dependencies

Fix Action

Fix / Workaround

vllm/__init__.py (line 14)
  -> import vllm.env_override
    -> env_override.py (line 87): import torch  # unconditional, top-level
    -> env_override.py (line 89): from vllm.utils.torch_utils import is_torch_equal
    -> env_override.py (line 106): torch._inductor.config.compile_threads = 1
    -> ... (torch inductor monkeypatches, lines 116-484)

... rest of monkeypatches

PR fix notes

PR #17: fix: replace manylinux_2_35 CPU wheels with torch stub for UBI9 compat

Repository: red-hat-data-services/llm-d-kv-cache
Author: hexfusion
State: closed | merged: False
Link: https://github.com/red-hat-data-services/llm-d-kv-cache/pull/17

Description (problem / solution / changelog)

Replace manylinux_2_35 CPU wheels (incompatible with UBI9 glibc 2.34) with a torch stub that allows vllm to import without PyTorch. The tokenizer only uses vllm protocol types, config, and rendering modules no GPU code. A post-install script registers the stub and strips ~1.3 GB of unused native extensions from the image.

workaround for: https://github.com/vllm-project/vllm/issues/38925

Disabling PyTorch because PyTorch >= 2.1 is required but found 0.0.0
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
INFO 04-07 10:54:44 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 04-07 10:54:44 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
2026-04-07 10:54:47,247 [INFO] [root] TokenizationServiceServicer initialized
2026-04-07 10:54:47,247 [INFO] [root] gRPC reflection disabled (set `ENABLE_GRPC_REFLECTION=1` to enable)
2026-04-07 10:54:47,248 [INFO] [root] gRPC server configured to listen on /tmp/tokenizer/tokenizer-uds.socket
2026-04-07 10:54:47,248 [INFO] [root] gRPC server started on /tmp/tokenizer/tokenizer-uds.socket
2026-04-07 10:54:47,249 [INFO] [root] Probe server started on port 8082
2026-04-07 10:54:47,249 [INFO] [root] Server started.

Step 1: Initialize the tokenizer

oc exec qwen-test-epp-with-tokenizer -n redhat-ods-applications -c tokenizer -- python3 -c "
import grpc, sys 
sys.path.insert(0, '/app')
from tokenizerpb import tokenizer_pb2, tokenizer_pb2_grpc
channel = grpc.insecure_channel('unix:///tmp/tokenizer/tokenizer-uds.socket')
stub = tokenizer_pb2_grpc.TokenizationServiceStub(channel)
init_req = tokenizer_pb2.InitializeTokenizerRequest(model_name='Qwen/Qwen2.5-1.5B-Instruct')
init_resp = stub.InitializeTokenizer(init_req)
print(f'Initialize: success={init_resp.success}')
"

Expected:

Initialize: success=True

Step 2: Tokenize

oc exec qwen-test-epp-with-tokenizer -n redhat-ods-applications -c tokenizer -- python3 -c "
import grpc, sys
sys.path.insert(0, '/app')
from tokenizerpb import tokenizer_pb2, tokenizer_pb2_grpc
channel = grpc.insecure_channel('unix:///tmp/tokenizer/tokenizer-uds.socket')
stub = tokenizer_pb2_grpc.TokenizationServiceStub(channel)
req = tokenizer_pb2.TokenizeRequest(
    input='The history of artificial intelligence',
    model_name='Qwen/Qwen2.5-1.5B-Instruct'
)
resp = stub.Tokenize(req)
print(f'input_ids: {list(resp.input_ids)}')
print(f'success: {resp.success}')
"

Expected:

input_ids: [785, 3840, 315, 20443, 11229]
success: true

Pod Setup

EPP scheduler + UDS tokenizer as sidecar, sharing UDS socket via emptyDir. Init container downloads model tokenizer files into the expected path.

apiVersion: v1
kind: Pod
metadata:
  name: qwen-test-epp-with-tokenizer
  namespace: redhat-ods-applications
  labels:
    app.kubernetes.io/name: qwen-test-router-scheduler
    app.kubernetes.io/component: llminferenceservice-router-scheduler
spec:
  serviceAccountName: qwen-test-epp-sa
  initContainers:
  - name: fetch-tokenizer
    image: quay.io/rhoai/pull-request-pipelines:odh-llm-d-kv-cache-a63d65dbe1bdae59453ec2b7e398239b97a9e308
    command: ["python3", "-c"]
    args:
    - |
      import os; os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '0'
      from huggingface_hub import snapshot_download
      snapshot_download('Qwen/Qwen2.5-1.5B-Instruct',
        local_dir='/mnt/models/Qwen/Qwen2.5-1.5B-Instruct',
        allow_patterns=['tokenizer*', 'special_tokens*', 'vocab*', 'merges*', '*.json'])
    volumeMounts:
    - name: model-cache
      mountPath: /mnt/models
    resources:
      requests: { cpu: 100m, memory: 256Mi }
  containers:
  - name: epp
    image: quay.io/rhoai/odh-llm-d-inference-scheduler-rhel9@sha256:3c099f1079e2f2111f3f4a85505ac8b95eb789f69ea3062a53d40f8e7c401646
    args:
    - "--pool-name=qwen-test-inference-pool"
    - "--pool-namespace=redhat-ods-applications"
    - "--pool-group=inference.networking.x-k8s.io"
    - "--zap-encoder=json"
    - "--grpc-port=9002"
    - "--grpc-health-port=9003"
    - "--secure-serving"
    - "--model-server-metrics-scheme=https"
    - "--cert-path=/var/run/kserve/tls"
    - "--config-text=apiVersion: inference.networking.x-k8s.io/v1alpha1\nkind: EndpointPickerConfig\nplugins:\n- type: single-profile-handler\n- type: prefix-cache-scorer\n- type: load-aware-scorer\n- type: max-score-picker\nschedulingProfiles:\n- name: default\n  plugins:\n  - pluginRef: prefix-cache-scorer\n    weight: 2.0\n  - pluginRef: load-aware-scorer\n    weight: 1.0\n  - pluginRef: max-score-picker\n"
    ports:
    - { name: grpc, containerPort: 9002 }
    - { name: grpc-health, containerPort: 9003 }
    - { name: metrics, containerPort: 9090 }
    env:
    - { name: SSL_CERT_DIR, value: "/var/run/kserve/tls:/var/run/secrets/kubernetes.io/serviceaccount:/etc/pki/tls/certs" }
    resources:
      requests: { cpu: 256m, memory: 500Mi }
    volumeMounts:
    - { name: tls-certs, mountPath: /var/run/kserve/tls, readOnly: true }
    - { name: tokenizer-uds, mountPath: /tmp/tokenizer }
  - name: tokenizer
    image: quay.io/rhoai/pull-request-pipelines:odh-llm-d-kv-cache-a63d65dbe1bdae59453ec2b7e398239b97a9e308
    imagePullPolicy: Always
    env:
    - { name: TOKENIZERS_DIR, value: /mnt/models }
    - { name: HF_HOME, value: /tmp/hf }
    ports:
    - { containerPort: 8082, name: health }
    resources:
      requests: { cpu: 100m, memory: 512Mi }
    volumeMounts:
    - { mountPath: /tmp/tokenizer, name: tokenizer-uds }
    - { mountPath: /mnt/models, name: model-cache, readOnly: true }
    - { mountPath: /tmp/hf, name: hf-cache }
    livenessProbe:  { httpGet: { path: /healthz, port: 8082 }, periodSeconds: 15, failureThreshold: 5, initialDelaySeconds: 60 }
    readinessProbe: { httpGet: { path: /healthz, port: 8082 }, periodSeconds: 10, failureThreshold: 10, initialDelaySeconds: 15 }
    startupProbe:   { httpGet: { path: /healthz, port: 8082 }, periodSeconds: 10, failureThreshold: 30 }
  volumes:
  - { name: tls-certs, secret: { secretName: qwen-test-kserve-self-signed-certs } }
  - { name: tokenizer-uds, emptyDir: {} }
  - { name: model-cache, emptyDir: {} }
  - { name: hf-cache, emptyDir: {} }

Changed files

services/uds_tokenizer/Dockerfile.konflux (modified, +5/-4)
services/uds_tokenizer/pyproject.toml (modified, +3/-3)
services/uds_tokenizer/strip_unused_deps.sh (added, +34/-0)
services/uds_tokenizer/tokenizer_grpc_service.py (modified, +14/-13)
services/uds_tokenizer/tokenizer_service/tokenizer.py (modified, +15/-13)
services/uds_tokenizer/torch_stub.py (added, +175/-0)
services/uds_tokenizer/uv.lock (modified, +1320/-750)

Code Example

from vllm.config import VllmConfig
from vllm.config.device import DeviceConfig
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.completion.protocol import CompletionRequest
from vllm.entrypoints.openai.engine.protocol import ErrorResponse
from vllm.entrypoints.openai.models.protocol import BaseModelPath
from vllm.entrypoints.openai.models.serving import OpenAIModelRegistry
from vllm.entrypoints.openai.engine.serve.render.serving import OpenAIServingRender
from vllm.plugins.io_processors import get_io_processor
from vllm.renderers import renderer_from_config

---

vllm/__init__.py (line 14)
  -> import vllm.env_override
    -> env_override.py (line 87): import torch  # unconditional, top-level
    -> env_override.py (line 89): from vllm.utils.torch_utils import is_torch_equal
    -> env_override.py (line 106): torch._inductor.config.compile_threads = 1
    -> ... (torch inductor monkeypatches, lines 116-484)

---

# env_override.py, line 85 onward
_torch_available = importlib.util.find_spec("torch") is not None

if _torch_available:
    import torch

    from vllm.logger import init_logger
    from vllm.utils.torch_utils import is_torch_equal

    logger = init_logger(__name__)

    os.environ["PYTORCH_NVML_BASED_CUDA_CHECK"] = "1"
    os.environ["TORCHINDUCTOR_COMPILE_THREADS"] = "1"
    torch._inductor.config.compile_threads = 1

    # ... rest of monkeypatches

RAW_BUFFERClick to expand / collapse

Motivation

Projects in the llm-d ecosystem (and likely others) need to import vllm protocol types, config dataclasses, and rendering utilities but do not run inference. For example, the llm-d-kv-cache UDS tokenizer service is a lightweight gRPC sidecar that only imports:

from vllm.config import VllmConfig
from vllm.config.device import DeviceConfig
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.completion.protocol import CompletionRequest
from vllm.entrypoints.openai.engine.protocol import ErrorResponse
from vllm.entrypoints.openai.models.protocol import BaseModelPath
from vllm.entrypoints.openai.models.serving import OpenAIModelRegistry
from vllm.entrypoints.openai.engine.serve.render.serving import OpenAIServingRender
from vllm.plugins.io_processors import get_io_processor
from vllm.renderers import renderer_from_config

These are pure Python pydantic models, dataclasses, and rendering logic. No GPU, no CUDA, no inference engine.

Problem

Any from vllm.<anything> import ... triggers this import chain:

vllm/__init__.py (line 14)
  -> import vllm.env_override
    -> env_override.py (line 87): import torch  # unconditional, top-level
    -> env_override.py (line 89): from vllm.utils.torch_utils import is_torch_equal
    -> env_override.py (line 106): torch._inductor.config.compile_threads = 1
    -> ... (torch inductor monkeypatches, lines 116-484)

This means:

torch is a hard runtime requirement for any vllm import, even protocol-only usage
pip install vllm --no-deps is not viable -- imports crash without torch
The full vllm dep tree (torch, CUDA libs, triton, flashinfer, nvidia-*, etc.) must be installed even in lightweight sidecars that never touch a GPU

Additionally, even beyond env_override.py, there are unconditional import torch statements deeper in the chain:

vllm/config/device.py (line 7)
vllm/config/model.py (line 10) -- module-level _STR_DTYPE_TO_TORCH_DTYPE dict initialization
vllm/config/utils.py (line 18)
vllm/utils/__init__.py (line 6)

These would also need lazy guards for the full protocol-only import path to work.

Concrete impact: container image size

The UDS tokenizer is a ~50 MB Python service. Adding vllm==0.18.0 from PyPI pulls in ~5-6 GB of transitive dependencies (torch with CUDA, nvidia-cublas, nvidia-cudnn, nvidia-nccl, triton, flashinfer, etc.).

Previously, the project used CPU-only manylinux_2_35 wheels from wheels.vllm.ai to avoid this, but those wheels require glibc >= 2.35 and are incompatible with RHEL 9 / UBI9 (glibc 2.34) the standard base image for Red Hat's downstream builds. See also #38908 for the same glibc constraint on nightly wheels.

There is currently no way to get vllm protocol types into a UBI9-based container without pulling the full torch+CUDA dependency tree.

Proposed solution

Guard the torch-dependent code in env_override.py behind an availability check as the first step. The module already uses importlib.util (line 4) and _get_torch_cuda_version() (line 8) already checks for torch without importing it. Extending that pattern:

# env_override.py, line 85 onward
_torch_available = importlib.util.find_spec("torch") is not None

if _torch_available:
    import torch

    from vllm.logger import init_logger
    from vllm.utils.torch_utils import is_torch_equal

    logger = init_logger(__name__)

    os.environ["PYTORCH_NVML_BASED_CUDA_CHECK"] = "1"
    os.environ["TORCHINDUCTOR_COMPILE_THREADS"] = "1"
    torch._inductor.config.compile_threads = 1

    # ... rest of monkeypatches

This preserves all existing behavior when torch is installed (100% of inference users) while unblocking the first gate for protocol-only consumers.

The deeper import torch statements in vllm/config/*.py and vllm/utils/__init__.py could be addressed incrementally -- there is strong precedent in the codebase for lazy import patterns (PRs #34343, #38649, #34651, #36024).

Alternative approaches

Separate vllm-types or vllm-protocol package: cleaner long-term but higher maintenance burden and coordination cost.
Move import vllm.env_override to engine entrypoints instead of __init__.py: more invasive, touches more files, higher risk of regressions.
VLLM_NO_TORCH environment variable: explicit opt-out, but adds a knob users have to discover.

The guard approach is minimal, self-contained, and doesn't change behavior for any existing user.

Related issues

#38908 -- glibc 2.35 requirement on nightly wheels blocks RHEL 9 / UBI9 users
#33741 -- --help performance due to unnecessary torch import (same root cause in the import chain)
#30985 -- RFC for DRY dependency management across hardware targets
#28071 -- RFC to pin all dependencies

Use cases that benefit

llm-d-kv-cache UDS tokenizer service (protocol types + chat rendering)
Routing sidecars / ext-proc plugins that parse vllm request/response types
Monitoring and observability tools that deserialize vllm protocol objects
Test harnesses and CI that validate request schemas without GPU hardware
Documentation tooling that introspects vllm's API types

Before submitting a new issue...

Searched for relevant issues
Checked the documentation chatbot

extent analysis

TL;DR

Guarding torch-dependent code in env_override.py behind an availability check can resolve the issue of torch being a hard runtime requirement for any vllm import.

Guidance

Identify and guard all torch-dependent code in the vllm package to allow for protocol-only imports without requiring torch.
Start by modifying env_override.py to check for torch availability before importing it, as proposed in the issue.
Incrementally address deeper import torch statements in vllm/config/*.py and vllm/utils/__init__.py using lazy import patterns.
Consider alternative approaches, such as creating a separate vllm-types or vllm-protocol package, but weigh the benefits against the added maintenance burden and coordination cost.

Example

# env_override.py, line 85 onward
_torch_available = importlib.util.find_spec("torch") is not None

if _torch_available:
    import torch
    # ... rest of the code

Notes

The proposed solution focuses on modifying the env_override.py file, but other files may also require changes to fully resolve the issue.
The use of lazy import patterns can help minimize the impact of the changes on existing users.

Recommendation

Apply the proposed workaround by guarding torch-dependent code in env_override.py behind an availability check, as it is a minimal and self-contained solution that doesn't change behavior for existing users.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #environment variable #agent setup #task chaining #parallel task

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.