openclaw - 💡(How to fix) Fix [Bug]: Kubernetes gateway regression: `2026.3.28` OOM-crashes on startup while `2026.3.8` passes same canary [2 comments, 2 participants]

openclaw2026-03-29 21:25:29

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#57303•Fetched 2026-04-08 01:51:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

matthewprotti

Participants

matthewprotti

NathanMartinNZ

Timeline (top)

subscribed ×3commented ×2labeled ×2cross-referenced ×1

We are seeing a reproducible regression in our Kubernetes OpenClaw gateway deployment. In the same alpha cluster, with the same deployment shape and same gateway config:

2026.3.8 works and passes our canary 2026.3.28 starts, then crash-loops with a Node/V8 heap OOM

We are not reporting a generic “it feels flaky” issue. We built a canary path to compare old and new digests in the same cluster and deployment shape, and the newer version fails while the older one passes.

Error Message

We did a same-environment canary comparison.

2026.3.8 passed the same canary with local and cross-pod health succeeding.

2026.3.28 failed with repeated restarts and the following previous-container log:

<--- Last few GCs --->

[14:0x3bb47000] 27802 ms: Scavenge (interleaved) 1012.8 (1023.9) -> 1012.2 (1028.4) MB, pooled: 0 MB, 82.22 / 0.00 ms (average mu = 0.228, current mu = 0.194) allocation failure; [14:0x3bb47000] 30311 ms: Mark-Compact 1015.7 (1028.4) -> 1013.7 (1030.4) MB, pooled: 1 MB, 2426.60 / 0.00 ms (average mu = 0.161, current mu = 0.094) allocation failure; scavenge might not succeed

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory ----- Native stack trace -----

1: 0x735eec node::OOMErrorHandler(char const*, v8::OOMDetails const&) [openclaw-gateway] 2: 0xbafc40 [openclaw-gateway] 3: 0xbafd2f [openclaw-gateway] 4: 0xe48825 [openclaw-gateway] 5: 0xe48852 [openclaw-gateway] 6: 0xe48b4a [openclaw-gateway] 7: 0xe5906a [openclaw-gateway] 8: 0xe5d410 [openclaw-gateway] 9: 0x18efec1 [openclaw-gateway]

Kubernetes observations:

repeated restart/backoff behavior last exit code 1 older digest remains stable in the same deployment shape

Root Cause

We are seeing a reproducible regression in our Kubernetes OpenClaw gateway deployment. In the same alpha cluster, with the same deployment shape and same gateway config:

2026.3.8 works and passes our canary 2026.3.28 starts, then crash-loops with a Node/V8 heap OOM

Code Example

We did a same-environment canary comparison.

2026.3.8 passed the same canary with local and cross-pod health succeeding.

2026.3.28 failed with repeated restarts and the following previous-container log:

<--- Last few GCs --->

[14:0x3bb47000]    27802 ms: Scavenge (interleaved) 1012.8 (1023.9) -> 1012.2 (1028.4) MB, pooled: 0 MB, 82.22 / 0.00 ms  (average mu = 0.228, current mu = 0.194) allocation failure;
[14:0x3bb47000]    30311 ms: Mark-Compact 1015.7 (1028.4) -> 1013.7 (1030.4) MB, pooled: 1 MB, 2426.60 / 0.00 ms  (average mu = 0.161, current mu = 0.094) allocation failure; scavenge might not succeed

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----

 1: 0x735eec node::OOMErrorHandler(char const*, v8::OOMDetails const&) [openclaw-gateway]
 2: 0xbafc40  [openclaw-gateway]
 3: 0xbafd2f  [openclaw-gateway]
 4: 0xe48825  [openclaw-gateway]
 5: 0xe48852  [openclaw-gateway]
 6: 0xe48b4a  [openclaw-gateway]
 7: 0xe5906a  [openclaw-gateway]
 8: 0xe5d410  [openclaw-gateway]
 9: 0x18efec1  [openclaw-gateway]

Kubernetes observations:

repeated restart/backoff behavior
last exit code 1
older digest remains stable in the same deployment shape

RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

Summary

We are seeing a reproducible regression in our Kubernetes OpenClaw gateway deployment. In the same alpha cluster, with the same deployment shape and same gateway config:

2026.3.8 works and passes our canary 2026.3.28 starts, then crash-loops with a Node/V8 heap OOM

Steps to reproduce

Deploy OpenClaw in Kubernetes as a dedicated in-cluster gateway Deployment.
Use a ClusterIP Service only, with no public ingress on port 18789.
Configure the gateway with: gateway.mode=local gateway.bind=lan gateway.port=18789 gateway.auth.mode=token gateway.http.endpoints.chatCompletions.enabled=true
Mount a writable config volume at /config and writable state dir at /tmp/openclaw-state.
Run the container with: openclaw gateway --bind lan --port 18789
Set resources to: requests: 250m CPU, 512Mi memory limits: 1 CPU, 2Gi memory
xDeploy first with digest sha256:7b1294f6aa2eb05b2070cc614743f79212313fc294e5de221ada8a2969ea52f6 and verify it passes local and cross-pod health checks.
Replace only the image digest with sha256:5900559f795ef15ea2f0b1fc488726d9b27bb2de398424c14d13c9b1f1ff0d66.
Observe repeated restarts and OOM failure.

Expected behavior

The gateway should start successfully and remain stable under the same Kubernetes deployment shape that works with 2026.3.8.

Actual behavior

The 2026.3.28 gateway pod starts, then repeatedly restarts and fails with a Node/V8 heap OOM.

OpenClaw version

Working: 2026.3.8 ghcr.io/openclaw/openclaw@sha256:7b1294f6aa2eb05b2070cc614743f79212313fc294e5de221ada8a2969ea52f6 Failing: 2026.3.28 ghcr.io/openclaw/openclaw@sha256:5900559f795ef15ea2f0b1fc488726d9b27bb2de398424c14d13c9b1f1ff0d66

Operating system

Linux containers on managed Kubernetes

Install method

Kubernetes Deployment using the GHCR container image

Model

openai/gpt-5.2

Provider / routing chain

openclaw gateway -> openai/gpt-5.2

Additional provider/model setup details

token-authenticated local gateway ClusterIP only writable config at /config/openclaw.json writable state dir at /tmp/openclaw-state OPENAI_API_KEY via secret OPENCLAW_GATEWAY_TOKEN via secret

Logs, screenshots, and evidence

We did a same-environment canary comparison.

2026.3.8 passed the same canary with local and cross-pod health succeeding.

2026.3.28 failed with repeated restarts and the following previous-container log:

<--- Last few GCs --->

[14:0x3bb47000]    27802 ms: Scavenge (interleaved) 1012.8 (1023.9) -> 1012.2 (1028.4) MB, pooled: 0 MB, 82.22 / 0.00 ms  (average mu = 0.228, current mu = 0.194) allocation failure;
[14:0x3bb47000]    30311 ms: Mark-Compact 1015.7 (1028.4) -> 1013.7 (1030.4) MB, pooled: 1 MB, 2426.60 / 0.00 ms  (average mu = 0.161, current mu = 0.094) allocation failure; scavenge might not succeed

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----

 1: 0x735eec node::OOMErrorHandler(char const*, v8::OOMDetails const&) [openclaw-gateway]
 2: 0xbafc40  [openclaw-gateway]
 3: 0xbafd2f  [openclaw-gateway]
 4: 0xe48825  [openclaw-gateway]
 5: 0xe48852  [openclaw-gateway]
 6: 0xe48b4a  [openclaw-gateway]
 7: 0xe5906a  [openclaw-gateway]
 8: 0xe5d410  [openclaw-gateway]
 9: 0x18efec1  [openclaw-gateway]

Kubernetes observations:

repeated restart/backoff behavior
last exit code 1
older digest remains stable in the same deployment shape

Impact and severity

High for this deployment pattern.

This blocks us from upgrading OpenClaw in production-like Alpha environment while keeping the same Kubernetes deployment model that currently works on 2026.3.8.

Additional information

This is not the earlier docs/bind mismatch around --bind 0.0.0.0. This is not just a readiness-probe false negative: the older version passes in the same environment the newer version crash-loops with an actual OOM

extent analysis

Fix Plan

To address the Node/V8 heap OOM issue in the OpenClaw gateway deployment, follow these steps:

Increase Memory Limits: Adjust the memory limits for the OpenClaw gateway deployment to ensure it has sufficient resources to operate without running out of memory.
Optimize Gateway Configuration: Review and optimize the gateway configuration to reduce memory usage. This may involve adjusting settings such as gateway.mode, gateway.bind, and gateway.port.
Update Deployment YAML: Update the deployment YAML file to reflect the increased memory limits and optimized gateway configuration.

Example code snippets:

# Updated deployment YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: openclaw-gateway
spec:
  containers:
  - name: openclaw-gateway
    image: ghcr.io/openclaw/openclaw@sha256:5900559f795ef15ea2f0b1fc488726d9b27bb2de398424c14d13c9b1f1ff0d66
    resources:
      requests:
        cpu: 250m
        memory: 1024Mi
      limits:
        cpu: 1
        memory: 4Gi

# Example command to update deployment
kubectl apply -f updated-deployment.yaml

Verification

To verify that the fix worked, monitor the OpenClaw gateway deployment for stability and check for any signs of OOM errors. You can use Kubernetes tools such as kubectl logs and kubectl describe to inspect the deployment and its containers.

Example commands:

# Check deployment status
kubectl get deployments

# Check container logs
kubectl logs -f openclaw-gateway

# Describe deployment
kubectl describe deployment openclaw-gateway

Extra Tips

Regularly monitor the deployment's memory usage and adjust the limits as needed to prevent OOM errors.
Consider implementing a horizontal pod autoscaler to dynamically adjust the number of replicas based on memory usage.
Review the OpenClaw gateway documentation for any recommendations on optimizing memory usage and configuration.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

The gateway should start successfully and remain stable under the same Kubernetes deployment shape that works with 2026.3.8.

#api #generation error #database connection #vector store #embedding generation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: Kubernetes gateway regression: `2026.3.28` OOM-crashes on startup while `2026.3.8` passes same canary [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Kubernetes gateway regression: `2026.3.28` OOM-crashes on startup while `2026.3.8` passes same canary [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING