hermes - ✅(Solved) Fix [Bug]: Local STT should reuse CPU fallback after cuBLAS runtime failure [1 pull requests, 3 comments, 2 participants]

hermes2026-04-21 14:44:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#13568•Fetched 2026-04-22 08:05:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Jonhvmp

Participants

alt-glitch

Jonhvmp

Timeline (top)

commented ×3labeled ×3mentioned ×2subscribed ×2

Error Message

RuntimeError: Library libcublas.so.12 is not found or cannot be loaded

Root Cause

Root Cause Analysis (optional)

The local STT code updates the in-memory model instance after the CPU retry, but the cache identity for the requested model does not preserve the active runtime mode. That means the next _transcribe_local(..., model_name) call thinks it needs to rebuild the device="auto" model instead of reusing the already-working CPU fallback.

Fix Action

Fixed

Fixed by PR: fix(stt): reuse CPU fallback after CUDA runtime failure (https://github.com/NousResearch/hermes-agent/pull/13571)

PR fix notes

PR #13571: fix(stt): reuse CPU fallback after CUDA runtime failure

Repository: NousResearch/hermes-agent
Author: Jonhvmp
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/13571

Description (problem / solution / changelog)

What does this PR do?

This PR makes local faster-whisper STT more robust when device="auto" reaches a cuBLAS/CUDA runtime failure during transcription. Hermes already retries on CPU int8 for the current transcription, but the successful CPU fallback was not being reused on later calls with the same requested model.

This change preserves the active local runtime mode in the in-memory cache key so a working CPU fallback can be reused instead of rebuilding the failing device="auto" model on the next transcription.

Related Issue

Fixes #13568

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✅ Tests (adding or improving test coverage)

Changes Made

Added _local_model_cache_key in tools/transcription_tools.py to track the requested local whisper model plus active runtime mode.
Preserved the successful CPU int8 fallback in _transcribe_local() after cuBLAS/CUDA runtime failures.
Added a regression test that verifies the first CUDA runtime failure retries on CPU and succeeds.
Added a second regression test that verifies the CPU fallback is reused on the next transcription call for the same model instead of re-instantiating the failing device="auto" model.
Updated the fallback-related tests to reset the new cache key between runs.

How to Test

Run source venv/bin/activate.
Run pytest tests/tools/test_transcription_tools.py tests/tools/test_transcription.py tests/gateway/test_stt_config.py -q.
Optionally reproduce locally by sending a voice message in an environment where faster-whisper hits RuntimeError: Library libcublas.so.12 is not found or cannot be loaded; confirm the first transcription falls back to CPU and later calls reuse the CPU fallback instead of repeatedly rebuilding the failing auto/CUDA path.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: WSL2 / Linux

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

104 passed in 2.56s

Changed files

tests/tools/test_transcription_tools.py (modified, +70/-0)
tools/transcription_tools.py (modified, +58/-6)

Code Example

RuntimeError: Library libcublas.so.12 is not found or cannot be loaded

RAW_BUFFERClick to expand / collapse

Bug Description

When local STT is initialized with faster-whisper on device="auto", a cuBLAS/CUDA runtime failure can be recovered by retrying on CPU int8. However, the CPU fallback is not reused on subsequent calls, so Hermes needlessly re-enters the failing GPU/auto path each time.

Steps to Reproduce

Configure stt.provider: local with faster-whisper available.
Trigger a runtime failure such as RuntimeError: Library libcublas.so.12 is not found or cannot be loaded during _local_model.transcribe(...).
Observe Hermes retry successfully on CPU int8.
Send another voice message using the same local model.

Expected Behavior

After the first cuBLAS/CUDA runtime failure, Hermes should reuse the working CPU int8 local whisper model for subsequent transcriptions with the same requested model.

Actual Behavior

Hermes retries successfully on CPU for the current transcription, but the cache state does not preserve that fallback for the next call. The next transcription rebuilds the device="auto" model again and re-enters the same failing runtime path before falling back again.

Affected Component

Gateway (Telegram/Discord/Slack/WhatsApp), Other

Messaging Platform (if gateway-related)

Operating System

WSL2 / Linux (reproduced locally on my setup)

Relevant Logs / Traceback

RuntimeError: Library libcublas.so.12 is not found or cannot be loaded

Root Cause Analysis (optional)

Proposed Fix (optional)

Track the cached local whisper model by both requested model name and active runtime mode/device (for example (model_name, device, compute_type)) so a successful CPU fallback can be reused. Add a regression test that confirms the second call does not instantiate a fresh device="auto" model after the first fallback succeeds.

Are you willing to submit a PR for this?

I'd like to fix this myself and submit a PR

extent analysis

TL;DR

Update the cache identity for the local whisper model to include the active runtime mode and device, allowing the successful CPU fallback to be reused on subsequent calls.

Guidance

Identify the cache management logic in the local STT code and modify it to track models by a composite key including model_name, device, and compute_type.
Verify that the updated cache logic correctly reuses the CPU fallback model on subsequent transcription requests after a successful fallback.
Consider adding a regression test to ensure the fix is stable and effective.
Review the _transcribe_local function to ensure it correctly utilizes the updated cache logic and reuses the CPU fallback model when available.

Example

# Pseudo-code example of updated cache logic
cache_key = (model_name, device, compute_type)
if cache_key in cache:
    # Reuse the cached model
    model = cache[cache_key]
else:
    # Build and cache the model
    model = build_model(model_name, device, compute_type)
    cache[cache_key] = model

Notes

The proposed fix requires careful consideration of the cache management logic and its interaction with the _transcribe_local function. Additional testing and verification may be necessary to ensure the fix is correct and effective.

Recommendation

Apply the workaround by updating the cache logic to track models by a composite key including model_name, device, and compute_type, as this will allow the successful CPU fallback to be reused on subsequent calls.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#LLM response #prompt template #agent execution #callback error #memory management

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Bug]: Local STT should reuse CPU fallback after cuBLAS runtime failure [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root Cause Analysis (optional)

Fix Action

Fixed

PR fix notes

PR #13571: fix(stt): reuse CPU fallback after CUDA runtime failure

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Changed files

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Operating System

Relevant Logs / Traceback

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING