hermes - ✅(Solved) Fix [Bug]: Local STT should reuse CPU fallback after cuBLAS runtime failure [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#13568Fetched 2026-04-22 08:05:47
View on GitHub
Comments
3
Participants
2
Timeline
11
Reactions
0
Author
Participants
Timeline (top)
commented ×3labeled ×3mentioned ×2subscribed ×2

Error Message

RuntimeError: Library libcublas.so.12 is not found or cannot be loaded

Root Cause

Root Cause Analysis (optional)

The local STT code updates the in-memory model instance after the CPU retry, but the cache identity for the requested model does not preserve the active runtime mode. That means the next _transcribe_local(..., model_name) call thinks it needs to rebuild the device="auto" model instead of reusing the already-working CPU fallback.

Fix Action

Fixed

PR fix notes

PR #13571: fix(stt): reuse CPU fallback after CUDA runtime failure

Description (problem / solution / changelog)

What does this PR do?

This PR makes local faster-whisper STT more robust when device="auto" reaches a cuBLAS/CUDA runtime failure during transcription. Hermes already retries on CPU int8 for the current transcription, but the successful CPU fallback was not being reused on later calls with the same requested model.

This change preserves the active local runtime mode in the in-memory cache key so a working CPU fallback can be reused instead of rebuilding the failing device="auto" model on the next transcription.

Related Issue

Fixes #13568

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✅ Tests (adding or improving test coverage)

Changes Made

  • Added _local_model_cache_key in tools/transcription_tools.py to track the requested local whisper model plus active runtime mode.
  • Preserved the successful CPU int8 fallback in _transcribe_local() after cuBLAS/CUDA runtime failures.
  • Added a regression test that verifies the first CUDA runtime failure retries on CPU and succeeds.
  • Added a second regression test that verifies the CPU fallback is reused on the next transcription call for the same model instead of re-instantiating the failing device="auto" model.
  • Updated the fallback-related tests to reset the new cache key between runs.

How to Test

  1. Run source venv/bin/activate.
  2. Run pytest tests/tools/test_transcription_tools.py tests/tools/test_transcription.py tests/gateway/test_stt_config.py -q.
  3. Optionally reproduce locally by sending a voice message in an environment where faster-whisper hits RuntimeError: Library libcublas.so.12 is not found or cannot be loaded; confirm the first transcription falls back to CPU and later calls reuse the CPU fallback instead of repeatedly rebuilding the failing auto/CUDA path.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: WSL2 / Linux

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

104 passed in 2.56s

Changed files

  • tests/tools/test_transcription_tools.py (modified, +70/-0)
  • tools/transcription_tools.py (modified, +58/-6)

Code Example

RuntimeError: Library libcublas.so.12 is not found or cannot be loaded
RAW_BUFFERClick to expand / collapse

Bug Description

When local STT is initialized with faster-whisper on device="auto", a cuBLAS/CUDA runtime failure can be recovered by retrying on CPU int8. However, the CPU fallback is not reused on subsequent calls, so Hermes needlessly re-enters the failing GPU/auto path each time.

Steps to Reproduce

  1. Configure stt.provider: local with faster-whisper available.
  2. Trigger a runtime failure such as RuntimeError: Library libcublas.so.12 is not found or cannot be loaded during _local_model.transcribe(...).
  3. Observe Hermes retry successfully on CPU int8.
  4. Send another voice message using the same local model.

Expected Behavior

After the first cuBLAS/CUDA runtime failure, Hermes should reuse the working CPU int8 local whisper model for subsequent transcriptions with the same requested model.

Actual Behavior

Hermes retries successfully on CPU for the current transcription, but the cache state does not preserve that fallback for the next call. The next transcription rebuilds the device="auto" model again and re-enters the same failing runtime path before falling back again.

Affected Component

Gateway (Telegram/Discord/Slack/WhatsApp), Other

Messaging Platform (if gateway-related)

Telegram

Operating System

WSL2 / Linux (reproduced locally on my setup)

Relevant Logs / Traceback

RuntimeError: Library libcublas.so.12 is not found or cannot be loaded

Root Cause Analysis (optional)

The local STT code updates the in-memory model instance after the CPU retry, but the cache identity for the requested model does not preserve the active runtime mode. That means the next _transcribe_local(..., model_name) call thinks it needs to rebuild the device="auto" model instead of reusing the already-working CPU fallback.

Proposed Fix (optional)

Track the cached local whisper model by both requested model name and active runtime mode/device (for example (model_name, device, compute_type)) so a successful CPU fallback can be reused. Add a regression test that confirms the second call does not instantiate a fresh device="auto" model after the first fallback succeeds.

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

extent analysis

TL;DR

Update the cache identity for the local whisper model to include the active runtime mode and device, allowing the successful CPU fallback to be reused on subsequent calls.

Guidance

  • Identify the cache management logic in the local STT code and modify it to track models by a composite key including model_name, device, and compute_type.
  • Verify that the updated cache logic correctly reuses the CPU fallback model on subsequent transcription requests after a successful fallback.
  • Consider adding a regression test to ensure the fix is stable and effective.
  • Review the _transcribe_local function to ensure it correctly utilizes the updated cache logic and reuses the CPU fallback model when available.

Example

# Pseudo-code example of updated cache logic
cache_key = (model_name, device, compute_type)
if cache_key in cache:
    # Reuse the cached model
    model = cache[cache_key]
else:
    # Build and cache the model
    model = build_model(model_name, device, compute_type)
    cache[cache_key] = model

Notes

The proposed fix requires careful consideration of the cache management logic and its interaction with the _transcribe_local function. Additional testing and verification may be necessary to ensure the fix is correct and effective.

Recommendation

Apply the workaround by updating the cache logic to track models by a composite key including model_name, device, and compute_type, as this will allow the successful CPU fallback to be reused on subsequent calls.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING