claude-code - 💡(How to fix) Fix [Bug] HTTP/2 connection pool missing SO_KEEPALIVE, causing hangs on CGNAT networks [1 comments, 2 participants]

adigourdi · 2026-04-12T17:54:23Z

[claude-code] Claude Code's pooled HTTPS connections to api.anthropic.com do not have the SO KEEPALIVE socket option enabled. On networks behind carrier-grade… Claude Code's pooled HTTPS connections to `api.anthropic.com` do **not** have the `SO_KEEPALIVE` socket option enabled. On networks behind carrier-grade NAT (CGNAT) — common on mobile data / tethered connections — this causes the following failure pattern: 1. The HTTP/2 connection pool opens several long-lived sockets to Anthropic's edge. 2. Because `SO_KEEPALIVE` is off, no TCP keepalive probes are sent during idle periods. 3. The carrier's NAT translation table silently evicts the idle flow (often well below RFC 5382's recommended 2h4min established-state timeout — in practice a few minutes). 4. From the kernel's perspective, the sockets remain `ESTABLISHED`. 5. The next user message is written into one of these zombie sockets. The peer never sees it. 6. Linux TCP retransmits into the void. With default `tcp_retries2=15`, this means up to ~15 minutes of hang before the socket dies. 7. The client's retry logic then walks through the rest of the pooled zombies serially, each costing another ~15 min (or ~15s with tuned `tcp_retries2=6`). **User-visible effect:** the first message after any idle period (even on the same network — no switch required) hangs for tens of seconds to many minutes. Cancelling and resending typically succeeds immediately because a fresh connection gets established. **Bug Description** # Claude Code pool connections lack SO_KEEPALIVE, causing silent hangs on CGNAT networks ## Summary Claude Code's pooled HTTPS connections to `api.anthropic.com` do **not** have the `SO_KEEPALIVE` socket option enabled. On networks behind carrier-grade NAT (CGNAT) — common on mobile data / tethered connections — this causes the following failure pattern: 1. The HTTP/2 connection pool opens several long-lived sockets to Anthropic's edge. 2. Because `SO_KEEPALIVE` is off, no TCP keepalive probes are sent during idle periods. 3. The carrier's NAT translation table silently evicts the idle flow (often well below RFC 5382's recommended 2h4min established-state timeout — in practice a few minutes). 4. From the kernel's perspective, the sockets remain `ESTABLISHED`. 5. The next user message is written into one of these zombie sockets. The peer never sees it. 6. Linux TCP retransmits into the void. With default `tcp_retries2=15`, this means up to ~15 minutes of hang before the socket dies. 7. The client's retry logic then walks through the rest of the pooled zombies serially, each costing another ~15 min (or ~15s with tuned `tcp_retries2=6`). **User-visible effect:** the first message after any idle period (even on the same network — no switch required) hangs for tens of seconds to many minutes. Cancelling and resending typically succeeds immediately because a fresh connection gets established. ## Evidence Captured with `ss -tnoe 'dst :443'` on Linux while Claude Code was running on a mobile tethered connection. Other applications' sockets (Firefox, VS Code, gnome-terminal child processes) show a keepalive timer: ``` ESTAB 0 0 10.134.87.203:56958 150.171.109.83:443 timer:(keepalive,6.145ms,0) ... cgroup:...app-gnome-code... ESTAB 0 0 10.134.87.203:53086 34.107.243.93:443 timer:(keepalive,3min18sec,0) ... cgroup:...app-gnome-firefox... ESTAB 0 0 10.134.87.203:59916 31.13.83.51:443 timer:(keepalive,56sec,0) ... cgroup:...app-gnome-firefox... ``` Claude Code's sockets to `api.anthropic.com` (`160.79.104.10`) show **no timer field**, confirming `SO_KEEPALIVE` is not set: ``` ESTAB 0 0 10.134.87.203:49994 160.79.104.10:443 uid:1000 ino:2447964 sk:2003 cgroup:...vte-spawn-... ESTAB 0 0 10.134.87.203:60090 160.79.104.10:443 uid:1000 ino:2452395 sk:2004 cgroup:...vte-spawn-... ``` When a hang was actively reproduced, the stuck socket's send queue held ~96KB of request data with an active retransmit timer deep into the exponential-backoff schedule: ``` ESTAB 0 95895 10.134.87.203:53072 160.79.104.10:443 timer:(on,1min2sec,8) ``` ## Reproduction 1. Use Claude Code on a network behind CGNAT (most mobile carriers / tethered hotspots). 2. Send a message so the pool warms up. 3. Leave the session idle long enough for the carrier NAT to evict flows (varies; often 2–5 minutes). 4. Send another message. Expected: message sends promptly. Actual: message hangs until either TCP `tcp_retries2` gives up (default: up to ~15 min) or the user cancels and retries. ## Diagnostic confirmation that kernel tuning alone is insufficient On the affected host: ``` net.ipv4.tcp_retries2 = 6 # lowered from default 15 net.ipv4.tcp_keepalive_time = 60 # lowered from default 7200 net.ipv4.tcp_keepalive_intvl = 10 # lowered from default 75 net.ipv4.tcp_keepalive_probes = 3 # lowered from default 9 ``` - `tcp_retries2=6` **does** help: zombie sockets now die in ~15s instead of ~15 min. Confirmed via `watch ss -tno` — retry counter climbs to 5–6 then the socket is reaped. - `tcp_keepaliv

claude-code2026-04-12 17:54:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#47059•Fetched 2026-04-13 05:42:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

adigourdi

Participants

adigourdi

github-actions[bot]

Timeline (top)

labeled ×4commented ×1cross-referenced ×1

Claude Code's pooled HTTPS connections to api.anthropic.com do not have the SO_KEEPALIVE socket option enabled. On networks behind carrier-grade NAT (CGNAT) — common on mobile data / tethered connections — this causes the following failure pattern:

The HTTP/2 connection pool opens several long-lived sockets to Anthropic's edge.
Because SO_KEEPALIVE is off, no TCP keepalive probes are sent during idle periods.
The carrier's NAT translation table silently evicts the idle flow (often well below RFC 5382's recommended 2h4min established-state timeout — in practice a few minutes).
From the kernel's perspective, the sockets remain ESTABLISHED.
The next user message is written into one of these zombie sockets. The peer never sees it.
Linux TCP retransmits into the void. With default tcp_retries2=15, this means up to ~15 minutes of hang before the socket dies.
The client's retry logic then walks through the rest of the pooled zombies serially, each costing another ~15 min (or ~15s with tuned tcp_retries2=6).

User-visible effect: the first message after any idle period (even on the same network — no switch required) hangs for tens of seconds to many minutes. Cancelling and resending typically succeeds immediately because a fresh connection gets established.

Root Cause

The HTTP/2 connection pool opens several long-lived sockets to Anthropic's edge.
Because SO_KEEPALIVE is off, no TCP keepalive probes are sent during idle periods.
The carrier's NAT translation table silently evicts the idle flow (often well below RFC 5382's recommended 2h4min established-state timeout — in practice a few minutes).
From the kernel's perspective, the sockets remain ESTABLISHED.
The next user message is written into one of these zombie sockets. The peer never sees it.
Linux TCP retransmits into the void. With default tcp_retries2=15, this means up to ~15 minutes of hang before the socket dies.
The client's retry logic then walks through the rest of the pooled zombies serially, each costing another ~15 min (or ~15s with tuned tcp_retries2=6).

Code Example

ESTAB 0 0 10.134.87.203:56958 150.171.109.83:443
  timer:(keepalive,6.145ms,0) ... cgroup:...app-gnome-code...
ESTAB 0 0 10.134.87.203:53086 34.107.243.93:443
  timer:(keepalive,3min18sec,0) ... cgroup:...app-gnome-firefox...
ESTAB 0 0 10.134.87.203:59916 31.13.83.51:443
  timer:(keepalive,56sec,0) ... cgroup:...app-gnome-firefox...

---

ESTAB 0 0 10.134.87.203:49994 160.79.104.10:443
  uid:1000 ino:2447964 sk:2003 cgroup:...vte-spawn-...  <->
ESTAB 0 0 10.134.87.203:60090 160.79.104.10:443
  uid:1000 ino:2452395 sk:2004 cgroup:...vte-spawn-...  <->

---

ESTAB 0 95895 10.134.87.203:53072 160.79.104.10:443
  timer:(on,1min2sec,8)

---

net.ipv4.tcp_retries2 = 6              # lowered from default 15
net.ipv4.tcp_keepalive_time = 60       # lowered from default 7200
net.ipv4.tcp_keepalive_intvl = 10      # lowered from default 75
net.ipv4.tcp_keepalive_probes = 3      # lowered from default 9

RAW_BUFFERClick to expand / collapse

Bug Description

Claude Code pool connections lack SO_KEEPALIVE, causing silent hangs on CGNAT networks

Summary

The HTTP/2 connection pool opens several long-lived sockets to Anthropic's edge.
Because SO_KEEPALIVE is off, no TCP keepalive probes are sent during idle periods.
The carrier's NAT translation table silently evicts the idle flow (often well below RFC 5382's recommended 2h4min established-state timeout — in practice a few minutes).
From the kernel's perspective, the sockets remain ESTABLISHED.
The next user message is written into one of these zombie sockets. The peer never sees it.
Linux TCP retransmits into the void. With default tcp_retries2=15, this means up to ~15 minutes of hang before the socket dies.
The client's retry logic then walks through the rest of the pooled zombies serially, each costing another ~15 min (or ~15s with tuned tcp_retries2=6).

Evidence

Captured with ss -tnoe 'dst :443' on Linux while Claude Code was running on a mobile tethered connection.

Other applications' sockets (Firefox, VS Code, gnome-terminal child processes) show a keepalive timer:

ESTAB 0 0 10.134.87.203:56958 150.171.109.83:443
  timer:(keepalive,6.145ms,0) ... cgroup:...app-gnome-code...
ESTAB 0 0 10.134.87.203:53086 34.107.243.93:443
  timer:(keepalive,3min18sec,0) ... cgroup:...app-gnome-firefox...
ESTAB 0 0 10.134.87.203:59916 31.13.83.51:443
  timer:(keepalive,56sec,0) ... cgroup:...app-gnome-firefox...

Claude Code's sockets to api.anthropic.com (160.79.104.10) show no timer field, confirming SO_KEEPALIVE is not set:

ESTAB 0 0 10.134.87.203:49994 160.79.104.10:443
  uid:1000 ino:2447964 sk:2003 cgroup:...vte-spawn-...  <->
ESTAB 0 0 10.134.87.203:60090 160.79.104.10:443
  uid:1000 ino:2452395 sk:2004 cgroup:...vte-spawn-...  <->

When a hang was actively reproduced, the stuck socket's send queue held ~96KB of request data with an active retransmit timer deep into the exponential-backoff schedule:

ESTAB 0 95895 10.134.87.203:53072 160.79.104.10:443
  timer:(on,1min2sec,8)

Reproduction

Use Claude Code on a network behind CGNAT (most mobile carriers / tethered hotspots).
Send a message so the pool warms up.
Leave the session idle long enough for the carrier NAT to evict flows (varies; often 2–5 minutes).
Send another message.

Expected: message sends promptly. Actual: message hangs until either TCP tcp_retries2 gives up (default: up to ~15 min) or the user cancels and retries.

Diagnostic confirmation that kernel tuning alone is insufficient

On the affected host:

net.ipv4.tcp_retries2 = 6              # lowered from default 15
net.ipv4.tcp_keepalive_time = 60       # lowered from default 7200
net.ipv4.tcp_keepalive_intvl = 10      # lowered from default 75
net.ipv4.tcp_keepalive_probes = 3      # lowered from default 9

tcp_retries2=6 does help: zombie sockets now die in ~15s instead of ~15 min. Confirmed via watch ss -tno — retry counter climbs to 5–6 then the socket is reaped.
tcp_keepalive_* has no effect on Claude Code's sockets because it requires SO_KEEPALIVE to be set per-socket, which the client does not do.

Because the pool holds multiple zombies, a single user message can still stall for N × 15s while the client serially retries through each dead connection.

Proposed fix

Enable SO_KEEPALIVE on the HTTP/2 pool connections used by Claude Code's HTTP client (presumably undici / the Anthropic SDK's HTTP agent). Node.js exposes this via socket.setKeepAlive(true, initialDelayMs).

A reasonable default initial delay would be 30–60 seconds, which is short enough to beat most CGNAT idle timeouts while being light on traffic. The existing Linux sysctl defaults (tcp_keepalive_time=7200) are far too long for modern carrier environments and should be overridden at the application level.

Alternatively or additionally: send HTTP/2 PING frames on idle connections at a similar cadence. PING frames have the advantage of being an application-layer health check, so they detect not just dead TCP but also broken proxies/load-balancers along the path.

Impact

This affects any Claude Code user on:

Mobile data / tethered hotspots (very common when traveling).
Residential ISPs that deploy CGNAT (increasingly common globally as IPv4 exhaustion progresses).
Corporate networks with aggressive stateful firewalls.

For these users, Claude Code is currently unrel… Note: Content was truncated.

extent analysis

TL;DR

Enable SO_KEEPALIVE on the HTTP/2 pool connections used by Claude Code's HTTP client to prevent silent hangs on CGNAT networks.

Guidance

Enable SO_KEEPALIVE on the HTTP/2 pool connections using socket.setKeepAlive(true, initialDelayMs) with a reasonable initial delay (e.g., 30-60 seconds).
Consider sending HTTP/2 PING frames on idle connections as an alternative or additional solution to detect broken proxies/load-balancers.
Verify the fix by checking the socket options using ss -tnoe 'dst :443' and confirming that the keepalive timer is set.
Test the fix on a network behind CGNAT to ensure that the issue is resolved.

Example

const socket = // obtain the socket object
socket.setKeepAlive(true, 30000); // enable keepalive with 30-second initial delay

Notes

The proposed fix requires changes to the Claude Code's HTTP client implementation.
The initial delay value may need to be adjusted based on the specific network environment and requirements.
This fix may not be applicable to all users, but it should resolve the issue for those affected by CGNAT networks.

Recommendation

Apply the workaround by enabling SO_KEEPALIVE on the HTTP/2 pool connections, as it is a targeted solution that addresses the root cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #installation #tensor shape #autograd error #model save/load

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [Bug] HTTP/2 connection pool missing SO_KEEPALIVE, causing hangs on CGNAT networks [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Claude Code pool connections lack SO_KEEPALIVE, causing silent hangs on CGNAT networks

Summary

Evidence

Reproduction

Diagnostic confirmation that kernel tuning alone is insufficient

Proposed fix

Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix [Bug] HTTP/2 connection pool missing SO_KEEPALIVE, causing hangs on CGNAT networks [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Claude Code pool connections lack SO_KEEPALIVE, causing silent hangs on CGNAT networks

Summary

Evidence

Reproduction

Diagnostic confirmation that kernel tuning alone is insufficient

Proposed fix

Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING