codex - 💡(How to fix) Fix Keep GPT-5.4-class long-context Codex models available for operational/GitOps workflows [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openai/codex#21229Fetched 2026-05-06 06:24:31
View on GitHub
Comments
2
Participants
2
Timeline
5
Reactions
0
Author
Timeline (top)
labeled ×3commented ×2

For my main Codex use case, gpt-5.4 is still strongly preferable to gpt-5.5.

This is not because gpt-5.5 is worse overall. For normal application programming, I do find gpt-5.5 to be a real improvement over gpt-5.4, and in that context the higher cost can be justified.

However, my primary Codex workflow is not general SWE. I use Codex for long-running, operationally sensitive GitOps/Kubernetes/AI infrastructure work. In that context, gpt-5.4 currently has the better tradeoff: lower cost, larger context window, much lower compaction frequency, and more literal instruction-following over long sessions.

My request is: please either keep a gpt-5.4-class model/profile available in Codex, or provide a newer model/profile that preserves the same properties at a comparable price point.

I would also like to hear from other Codex users running long-lived infrastructure, GitOps, deployment, incident-response, platform engineering, or ops-heavy sessions. If you depend on long context, low compaction frequency, lower-cost high/xhigh reasoning, or strict instruction-following over many turns, please comment with your use case.

Error Message

That distinction matters a lot. If the model overgeneralizes from one exception, forgets an earlier constraint, or decides to be “helpful” in the wrong direction, the result can be an outage. This is not a workflow where extra initiative is always welcome. Sometimes the best possible model behavior is boring, literal, and obedient.

Root Cause

Codex is not only used for isolated SWE tasks.

In my case, Codex is part of an operational workflow around AI/ML infrastructure. The model needs to preserve context over long sessions, follow instructions very literally, and distinguish between normal operating procedure and explicitly requested exceptions.

For this class of work, the priority is not maximum coding creativity. The priority is operational reliability.

The model needs to remember what was said much earlier, avoid reinterpreting exceptions as policy changes, and preserve a careful distinction between observation and mutation. A larger context window and lower compaction frequency are not just conveniences here. They are part of what makes the workflow safe and useful.

RAW_BUFFERClick to expand / collapse

Summary

For my main Codex use case, gpt-5.4 is still strongly preferable to gpt-5.5.

This is not because gpt-5.5 is worse overall. For normal application programming, I do find gpt-5.5 to be a real improvement over gpt-5.4, and in that context the higher cost can be justified.

However, my primary Codex workflow is not general SWE. I use Codex for long-running, operationally sensitive GitOps/Kubernetes/AI infrastructure work. In that context, gpt-5.4 currently has the better tradeoff: lower cost, larger context window, much lower compaction frequency, and more literal instruction-following over long sessions.

My request is: please either keep a gpt-5.4-class model/profile available in Codex, or provide a newer model/profile that preserves the same properties at a comparable price point.

I would also like to hear from other Codex users running long-lived infrastructure, GitOps, deployment, incident-response, platform engineering, or ops-heavy sessions. If you depend on long context, low compaction frequency, lower-cost high/xhigh reasoning, or strict instruction-following over many turns, please comment with your use case.

My use case

I use Codex to orchestrate changes to GitOps-style Kubernetes-managed clusters running AI/ML workflows.

These sessions are usually long-lived and operationally sensitive. They often involve:

  • multiple commits;
  • verification of changes against real deployments;
  • observing ArgoCD state;
  • inspecting Kubernetes state;
  • carefully manipulating Git state;
  • waiting for ArgoCD deployments to converge;
  • waiting for vLLM models to load;
  • running evals and using the results to decide the next step.

In this workflow, Git is the normal mutation path. argocd and kubectl are mostly used for observation.

There are occasions where direct mutation through ArgoCD or Kubernetes is warranted, and I do explicitly request that when needed. But those should be treated as operational exceptions, not as standing permission for the model to continue mutating state directly afterward.

That distinction matters a lot. If the model overgeneralizes from one exception, forgets an earlier constraint, or decides to be “helpful” in the wrong direction, the result can be an outage. This is not a workflow where extra initiative is always welcome. Sometimes the best possible model behavior is boring, literal, and obedient.

Why gpt-5.4 works better for this

For this specific operational workflow, gpt-5.4 has several properties that matter more than the SWE improvements in gpt-5.5.

1. Lower cost makes the useful settings feasible

gpt-5.4 is cheaper.

Because of that, it is practical to run it in fast mode with high or xhigh thinking without burning through quotas aggressively.

That matters for long operational sessions. These are not short “implement this function” interactions. They can span many steps, several rounds of verification, deployment observation, and follow-up changes. The cost profile of the model directly affects whether the workflow is practical.

With gpt-5.5, the higher cost makes the same style of usage much harder to justify for this class of task.

2. The 1M context window is a better fit than the SWE improvements

The 1M context window in gpt-5.4 is extremely important for my use case.

My sessions depend on the model remembering details from much earlier in the conversation:

  • workflow boundaries;
  • constraints I gave earlier;
  • which tools are allowed for observation versus mutation;
  • previous deployment states;
  • earlier warnings;
  • prior decisions;
  • specific instructions about sequencing;
  • what has already been verified;
  • what must not be repeated.

With the 1M context window, I basically do not run into compaction. It is a once-in-a-blue-moon event. For this workflow, the current size is close to a perfect fit.

By contrast, the shorter context window in gpt-5.5 means compaction becomes a much more realistic concern. For SWE tasks, that may be acceptable. For long operational sessions where old constraints still matter, it is a major regression.

3. Literal instruction-following is more valuable than creativity

For my use case, gpt-5.4 follows what I say to the letter with a level of reliability that is very valuable.

This may not generalize to all SWE workflows. I am not claiming that gpt-5.4 is universally better, or that it is better for all coding tasks. But for my operational GitOps/Kubernetes workflow, the model’s tendency to stay close to instructions is a feature.

The extra creativity and initiative in gpt-5.5 can be useful when programming applications. But in operational infrastructure work, that same behavior can become a liability.

I do not want the model to infer that because I allowed one direct kubectl or argocd mutation in a specific situation, it now has permission to treat that as the new default workflow. I do not want it to reinterpret guardrails halfway through a long session. I do not want it to “improve” the operational procedure unless explicitly asked.

For this workflow, paying more for a model that requires me to be more vigilant and repeat hard constraints more often is not a good tradeoff.

This is also the reason why I refrain from using Anthropic models for this kind of workflow. They do not have the discipline and strict rule hierarchy following needed. Here, gpt-5.4 shines bright.

Where gpt-5.5 is better

I do want to be clear that gpt-5.5 is a meaningful improvement for application programming.

When the task is normal SWE work, I find gpt-5.5 pleasant to use. It is better suited to implementation-heavy issues, and the improvement over gpt-5.4 is small but noticeable. In that setting, the higher cost can make sense.

So this is not feedback that gpt-5.5 should not exist, or that it is not valuable.

The point is that the two models serve different high-value profiles:

  • gpt-5.5 is better for many programming/application-development workflows.
  • gpt-5.4 is better for my long-context, operationally sensitive GitOps/Kubernetes workflows.

Those are not the same workload.

Why this matters

Codex is not only used for isolated SWE tasks.

In my case, Codex is part of an operational workflow around AI/ML infrastructure. The model needs to preserve context over long sessions, follow instructions very literally, and distinguish between normal operating procedure and explicitly requested exceptions.

For this class of work, the priority is not maximum coding creativity. The priority is operational reliability.

The model needs to remember what was said much earlier, avoid reinterpreting exceptions as policy changes, and preserve a careful distinction between observation and mutation. A larger context window and lower compaction frequency are not just conveniences here. They are part of what makes the workflow safe and useful.

Requested outcome

Please either:

  1. keep a gpt-5.4-class Codex model/profile available for users who need long context, lower cost, low compaction frequency, and literal instruction-following; or
  2. provide a newer model/profile that preserves those properties at a comparable price point.

The main concern is that optimizing Codex only around SWE improvements risks regressing an important operational use case.

For application programming, gpt-5.5 is a welcome improvement.

For long-running operational GitOps/Kubernetes/AI infrastructure sessions, gpt-5.4 remains the better tool.

extent analysis

TL;DR

The user requests that a gpt-5.4-class model or a newer model with similar properties (lower cost, larger context window, lower compaction frequency, and literal instruction-following) be made available for long-running, operationally sensitive GitOps/Kubernetes/AI infrastructure work.

Guidance

  • Evaluate the trade-offs between gpt-5.4 and gpt-5.5 for your specific use case, considering factors such as cost, context window, and instruction-following.
  • If you rely on long context, low compaction frequency, and literal instruction-following, consider reaching out to the Codex team to express your needs and potentially influence the development of future models.
  • Explore alternative models or configurations that may better suit your operational workflow, such as adjusting the model's settings or using a different model for specific tasks.
  • Provide feedback to the Codex team on the importance of preserving operational reliability and safety in their models, particularly for use cases that involve long-running infrastructure sessions.

Example

No specific code snippet is applicable in this case, as the issue is related to model selection and configuration rather than code implementation.

Notes

The user's request highlights the importance of considering diverse use cases when developing and optimizing AI models. While gpt-5.5 may be an improvement for application programming, it may not be suitable for all types of workflows, such as long-running operational infrastructure sessions.

Recommendation

Apply workaround: Request that the Codex team consider providing a gpt-5.4-class model or a newer model with similar properties, and explore alternative models or configurations that better suit your operational workflow. This approach allows you to continue using a model that meets your specific needs while also providing feedback to influence future model development.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix Keep GPT-5.4-class long-context Codex models available for operational/GitOps workflows [2 comments, 2 participants]