openclaw - 💡(How to fix) Fix [Feature]: Pluggable subagent execution backends and resource profiles (Kubernetes, containers, remote workers)

Root Cause

This would make subagent/ACP execution more production-grade:

isolate heavy or risky agent work from the gateway
make resource usage explicit and enforceable
reduce orphan process problems by moving lifecycle into backend-managed workers
improve observability with backend-specific IDs, logs, events, and traces
enable team/shared deployments where many users spawn agents concurrently
provide a foundation for GPU/build/test/large-repo profiles

Code Example

{
  "agents": {
    "executionBackends": {
      "local": { "type": "process" },
      "docker": { "type": "container" },
      "k8s": {
        "type": "kubernetes",
        "namespace": "openclaw-agents",
        "profiles": {
          "small": {
            "image": "ghcr.io/openclaw/agent-worker:VERSION",
            "resources": {
              "requests": { "cpu": "500m", "memory": "1Gi" },
              "limits": { "memory": "2Gi" }
            }
          },
          "large-build": {
            "image": "ghcr.io/openclaw/agent-worker:VERSION",
            "resources": {
              "requests": { "cpu": "4", "memory": "8Gi" },
              "limits": { "cpu": "8", "memory": "16Gi" }
            }
          }
        }
      }
    }
  }
}

---

{
  "runtime": "acp",
  "agentId": "codex",
  "execution": {
    "backend": "k8s",
    "profile": "large-build"
  },
  "message": "Run the full test suite and fix failures."
}

Summary

Add a pluggable remote execution backend for spawned subagents/ACP sessions so OpenClaw can run agent workers in Kubernetes or other resource-isolated compute environments, with explicit resource profiles selected per spawn, binding, or agent config.

This is broader than "Sandboxing + ACP". Kubernetes is one strong implementation target, but the product-level feature is: agent/session spawn placement and resource selection.

Problem

Today, sessions_spawn and ACP-backed agent runs are mostly tied to the local gateway host/runtime. This makes it hard to:

isolate risky or heavy coding-agent work from the gateway process
choose CPU/memory/GPU resources per task
run many long-running agents without exhausting the host
route different agent types to different execution environments
observe and clean up remote worker lifecycle consistently
support teams where the gateway is lightweight but workers should run on dedicated infrastructure

A user may want to say, effectively:

Spawn Codex in a 4 CPU / 8 GiB worker with network egress disabled.

or:

Spawn Gemini/OpenCode in a cheap small worker unless the task asks for build/test, then use a larger profile.

or:

Run this ACP session in the team's Kubernetes namespace, not on the gateway host.

Proposed capability

Introduce a generic spawn execution backend/profile layer for subagents and ACP sessions.

Conceptually:

{
  "agents": {
    "executionBackends": {
      "local": { "type": "process" },
      "docker": { "type": "container" },
      "k8s": {
        "type": "kubernetes",
        "namespace": "openclaw-agents",
        "profiles": {
          "small": {
            "image": "ghcr.io/openclaw/agent-worker:VERSION",
            "resources": {
              "requests": { "cpu": "500m", "memory": "1Gi" },
              "limits": { "memory": "2Gi" }
            }
          },
          "large-build": {
            "image": "ghcr.io/openclaw/agent-worker:VERSION",
            "resources": {
              "requests": { "cpu": "4", "memory": "8Gi" },
              "limits": { "cpu": "8", "memory": "16Gi" }
            }
          }
        }
      }
    }
  }
}

Then sessions_spawn / bindings / agent defaults could select a backend and profile:

{
  "runtime": "acp",
  "agentId": "codex",
  "execution": {
    "backend": "k8s",
    "profile": "large-build"
  },
  "message": "Run the full test suite and fix failures."
}

Kubernetes implementation sketch

A Kubernetes backend would:

resolve the requested profile and policy
build a worker Pod/Job manifest
validate it with Kubernetes API dryRun=All
create the worker
wait for readiness
stream agent/ACP events back over a defined worker protocol
expose status as namespace/podName, phase, container status, logs/events
delete/TTL/sweep workers after close, timeout, or parent reset

The worker image should implement a narrow OpenClaw worker contract such as:

GET /healthz
POST /turn streaming NDJSON events
POST /cancel
optional GET /status / GET /logs metadata

Resource and policy selection

Profiles should support at least:

CPU/memory requests and limits
optional GPU/runtime class/node selector/tolerations
image and command/args
env/secret references
network policy class or egress mode
workspace volume strategy
timeout/TTL
allowed agent IDs
max concurrent workers per profile/backend/channel/user

Selection could happen from:

explicit sessions_spawn.execution
agent defaults
ACP binding defaults
channel/user policy
task classification heuristics later

Alternatives beyond Kubernetes

This should not be Kubernetes-only. Possible backend types:

local process: current behavior; simplest and fastest for dev
Docker/Podman container: good single-host isolation without a cluster
Kubernetes Pod/Job: best for teams, quotas, autoscaling, node pools, GPU, namespace isolation
Nomad task: simpler ops for some infra teams; good resource scheduling without full Kubernetes
ECS/Fargate / Cloud Run Jobs / Azure Container Apps Jobs: managed container workers without operating a cluster
Firecracker/microVM: stronger isolation for untrusted code execution
remote SSH worker pool: pragmatic for existing build machines
CI runner backend: GitHub Actions/GitLab runners for bursty repo tasks, though latency and interactivity are weaker
devcontainer/Codespaces-like backend: good for repo-aware coding agents with prebuilt environments

The OpenClaw API should expose a common backend/profile abstraction so Kubernetes can be one implementation, not the only design.

Why this matters

This would make subagent/ACP execution more production-grade:

isolate heavy or risky agent work from the gateway
make resource usage explicit and enforceable
reduce orphan process problems by moving lifecycle into backend-managed workers
improve observability with backend-specific IDs, logs, events, and traces
enable team/shared deployments where many users spawn agents concurrently
provide a foundation for GPU/build/test/large-repo profiles

Relationship to existing issues

This is related to, but distinct from, sandboxing and ACP lifecycle issues:

#45841 discusses Sandboxing + ACP, but this proposal is broader than sandbox compatibility.
#68916 and #74684 show the need for better spawned-agent lifecycle cleanup and observability.
#68204 would be important for tracing parent session → spawned worker → backend resource.
#79560 suggests policy/rate limits are needed when spawning agents from non-interactive channels.

Acceptance criteria / possible first milestone

A first milestone could be Kubernetes-only but shaped as a generic backend abstraction:

config schema for execution backends and profiles
sessions_spawn can select backend/profile explicitly
Kubernetes backend creates a worker Pod from a profile
dry-run validation and RBAC/doctor checks exist
status exposes namespace/podName and readiness/failure reason
close/reset deletes the worker and sweeper handles orphans
minimal worker image supports /healthz, /turn, /cancel
targeted tests cover profile resolution, manifest generation, lifecycle, and cleanup

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Feature]: Pluggable subagent execution backends and resource profiles (Kubernetes, containers, remote workers)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Problem

Proposed capability

Kubernetes implementation sketch

Resource and policy selection

Alternatives beyond Kubernetes

Why this matters

Relationship to existing issues

Acceptance criteria / possible first milestone

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Feature]: Pluggable subagent execution backends and resource profiles (Kubernetes, containers, remote workers)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Problem

Proposed capability

Kubernetes implementation sketch

Resource and policy selection

Alternatives beyond Kubernetes

Why this matters

Relationship to existing issues

Acceptance criteria / possible first milestone

Still need to ship something?

RELATED_DISCOVERY

TRENDING