codex - 💡(How to fix) Fix Feature Request: Hybrid Local/Cloud “Instant” Models for Codex and ChatGPT using Apple/Intel/AMD NPUs [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openai/codex#22041Fetched 2026-05-11 03:20:20
View on GitHub
Comments
1
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
labeled ×3commented ×1

I would like to propose a new class of lightweight “Instant” models for Codex and ChatGPT, designed to run locally on modern consumer hardware using available NPUs, CPU cores, GPU cores, and system memory.

The idea is not to replace frontier cloud models, but to create a hybrid inference system:

  • small local model for fast, low-cost, high-frequency tasks
  • cloud fallback for complex reasoning, large-context understanding, safety-critical tasks, and tasks the local model cannot solve confidently

This could reduce cloud infrastructure load, improve responsiveness, increase effective token availability in Codex sessions, and make AI coding workflows more scalable.

Root Cause

I would like to propose a new class of lightweight “Instant” models for Codex and ChatGPT, designed to run locally on modern consumer hardware using available NPUs, CPU cores, GPU cores, and system memory.

The idea is not to replace frontier cloud models, but to create a hybrid inference system:

  • small local model for fast, low-cost, high-frequency tasks
  • cloud fallback for complex reasoning, large-context understanding, safety-critical tasks, and tasks the local model cannot solve confidently

This could reduce cloud infrastructure load, improve responsiveness, increase effective token availability in Codex sessions, and make AI coding workflows more scalable.

Fix Action

Fix / Workaround

  • complex architectural decisions
  • large repository-wide reasoning
  • security-sensitive changes
  • difficult debugging
  • multi-file agent execution
  • tasks requiring long context
  • uncertain local model responses
  • final review before applying patches
RAW_BUFFERClick to expand / collapse

What variant of Codex are you using?

App

What feature would you like to see?

Feature Request: Hybrid Local/Cloud “Instant” Models for Codex and ChatGPT using Apple/Intel/AMD NPUs

Summary

I would like to propose a new class of lightweight “Instant” models for Codex and ChatGPT, designed to run locally on modern consumer hardware using available NPUs, CPU cores, GPU cores, and system memory.

The idea is not to replace frontier cloud models, but to create a hybrid inference system:

  • small local model for fast, low-cost, high-frequency tasks
  • cloud fallback for complex reasoning, large-context understanding, safety-critical tasks, and tasks the local model cannot solve confidently

This could reduce cloud infrastructure load, improve responsiveness, increase effective token availability in Codex sessions, and make AI coding workflows more scalable.

Proposed Feature

Develop a family of local “Instant” models, for example:

  • GPT-5.5 Instant Local
  • GPT Codex Instant Local
  • GPT Mini Coding Local
  • GPT Local Router / Assistant Router

These models could run inside:

  • Codex CLI
  • Codex IDE extension
  • Codex app
  • ChatGPT desktop/mobile app, where supported

They would use local hardware acceleration where available:

  • Apple Neural Engine / Apple Silicon GPU / CPU
  • Intel NPU
  • AMD NPU
  • Qualcomm NPU
  • local CPU and RAM fallback

Hybrid Routing Architecture

The client could decide whether to answer locally or escalate to the cloud.

Example local tasks:

  • autocomplete
  • simple refactoring
  • file summarization
  • syntax explanations
  • small code transformations
  • grep/search explanation
  • commit message generation
  • simple test generation
  • local project indexing
  • short-context Q&A
  • boilerplate generation
  • repeated agent planning steps

Example cloud tasks:

  • complex architectural decisions
  • large repository-wide reasoning
  • security-sensitive changes
  • difficult debugging
  • multi-file agent execution
  • tasks requiring long context
  • uncertain local model responses
  • final review before applying patches

Confidence-Based Cloud Fallback

The local model could include a confidence/routing mechanism:

  1. Try local model first.
  2. Estimate confidence.
  3. If confidence is low, summarize local context.
  4. Send only the necessary compressed context to the stronger cloud model.
  5. Receive cloud answer.
  6. Continue locally when possible.

This would make Codex sessions more efficient and reduce unnecessary cloud token usage.

Benefits

1. Lower infrastructure load

Many Codex and ChatGPT interactions are simple enough to be handled locally. Offloading these to user hardware could reduce cloud compute demand.

2. More effective tokens per Codex session

If local models handle repetitive or low-complexity steps, cloud tokens can be reserved for genuinely difficult reasoning.

3. Faster user experience

Local inference could provide near-instant responses for common coding tasks.

4. Better privacy options

Some operations, such as local codebase indexing, simple file summaries, or project navigation, could happen without sending every detail to the cloud.

5. Better offline or low-connectivity mode

Codex and ChatGPT could remain useful even when internet access is limited, with cloud escalation when connectivity returns.

6. Better use of modern hardware

Modern laptops and desktops increasingly include NPUs. Codex could become one of the first developer tools to make practical use of this hardware.

Possible Implementation Ideas

  • Local model package downloaded per platform
  • Quantized models optimized for Apple Silicon, Intel, AMD, and Qualcomm NPUs
  • Configurable local/cloud routing
  • User setting: “Prefer local model when possible”
  • User setting: “Always ask before sending code to cloud”
  • Repo-local semantic index generated locally
  • Local embeddings for project search
  • Cloud model receives compressed summaries instead of full raw context where appropriate
  • Safety and correctness checks before applying local-generated code

Suggested UX

In Codex settings:

Inference mode:
[ ] Cloud only
[ ] Local first, cloud fallback
[ ] Local only
[ ] Ask before cloud escalation

### Additional information

_No response_

## Clarification

The goal of this proposal is not to replace OpenAI cloud models or avoid using the cloud.

The main idea is to let Codex use idle local compute that already exists on user machines.

During many Codex sessions, the developer’s CPU, GPU, memory, and NPU are mostly underused. Even while compiling, many machines still have unused compute capacity. At the same time, many Codex tasks are repetitive and lightweight:

- indexing the repository
- summarizing files
- analyzing compiler output
- summarizing test logs
- preparing local embeddings
- compressing context before sending it to the cloud
- simple refactors
- autocomplete
- duplicate detection
- dependency graph extraction

These tasks could be handled locally when possible.

The cloud model would still remain the premium intelligence layer for:

- complex reasoning
- architecture decisions
- difficult debugging
- large multi-file changes
- final review
- low-confidence local results

So the user’s machine becomes a local co-processor for Codex, while OpenAI cloud remains the main reasoning brain.

This could benefit both users and OpenAI:

- less unnecessary cloud workload
- lower infrastructure pressure
- faster local interactions
- longer and more productive Codex sessions
- better privacy story for enterprise users
- better use of modern Apple/Intel/AMD/Qualcomm NPU hardware
- cloud models focused on high-value reasoning instead of repetitive lightweight tasks

In short: this is not “local instead of cloud”; it is “local compute assisting the cloud.”


When laptops reach hundreds of TOPS and 64GB+ RAM, Codex should treat the user machine as a local AI worker node, not just as a client.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix Feature Request: Hybrid Local/Cloud “Instant” Models for Codex and ChatGPT using Apple/Intel/AMD NPUs [1 comments, 1 participants]