gemini-cli - 💡(How to fix) Fix Bug: Local Gemma model (gemma3-1b-gpu-custom) hangs and fails on NVIDIA GTX 1050 (Linux)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Checking gemma.log shows the GPU is successfully detected and initialized (Selected adapter: NVIDIA GeForce GTX 1050), but the server throws a Request cancelled by client: context canceled error almost immediately.

Code Example

CLI Version                                         0.41.2Git Commit                                          b0c7a1722                                                                                        │
Model                                               Auto (Gemini 3)Sandbox                                             no sandbox                                                                                       │
OS                                                  linux                                                                                            │
Auth Method                                         Signed in with Google (enricferrera@gmail.com)Tier                                                Gemini Code Assist in Google One AI Pro
RAW_BUFFERClick to expand / collapse

What happened?

When attempting to use the local Gemma model (gemma3-1b-gpu-custom) on an NVIDIA GTX 1050 running Ubuntu 24.04 LTS, the gemini ask command hangs for a few seconds and then silently falls back to the cloud model.

Checking gemma.log shows the GPU is successfully detected and initialized (Selected adapter: NVIDIA GeForce GTX 1050), but the server throws a Request cancelled by client: context canceled error almost immediately.

I verified that the GPU can run the model by bypassing the CLI's internal timeout. When I send a direct payload using: echo '{"contents": [{"role": "user", "parts": [{"text": "Write a short poem about a laptop."}]}]}' > request.json curl -d @request.json -H "Content-Type: application/json" http://localhost:9379/v1beta/models/gemma3-1b-gpu-custom:generateContent ...the server successfully generates the response, but it takes an exceptionally long time. The CLI appears to be timing out during the model's prefill phase due to the slower hardware.

What did you expect to happen?

I expected the CLI to wait for the local GPU to finish generating the response instead of immediately timing out and falling back to the cloud. Ideally, there should be a configurable timeout setting in settings.json for the local model router to accommodate older/slower hardware

Client information

<details> <summary>Client Information</summary>

Run gemini to enter the interactive CLI, then run the /about command.

CLI Version                                         0.41.2                                                                                           │
│ Git Commit                                          b0c7a1722                                                                                        │
│ Model                                               Auto (Gemini 3)                                                                                  │
│ Sandbox                                             no sandbox                                                                                       │
│ OS                                                  linux                                                                                            │
│ Auth Method                                         Signed in with Google ([email protected])                                                   │
│ Tier                                                Gemini Code Assist in Google One AI Pro
</details>

Login information

Not relevant

Anything else we need to know?

Logs showing the crash from the CLI's hardcoded timeout:

1 I0000 00:00:1778196558.764659 4732 environment.cc:554] Selected adapter: NVIDIA GeForce GTX 1050, arch=pascal, vendor=nvidia, backend=Vulkan, adapterType=Discrete GPU 2 2026/05/08 01:29:21 Engine initialized 3 2026/05/08 01:29:22 Request cancelled by client: context canceled 4 E0000 00:00:1778196756.215127 4711 process_state.cc:771] RAW: Raising signal 15 with default behavior It seems the VRAM limits or Pascal architecture compute speed is just too slow for the CLI's default "impatience" threshold.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING