Files
coder/cli
Ethan e91bec8574 fix(cli): close aibridge daemon before WebSocket shutdown wait (#25719)
> [!WARNING]
> The investigation and solution in this PR were done with
[Mux](https://mux.coder.com/). I've reviewed the investigation
methodology, evidence and solution, and it all appears sound.

## Summary

PR #25570 (`refactor: move aibridged out of enterprise to AGPL`, merged
2026-05-22) added an in-memory aibridge DRPC server in
`coderd/aibridged.go` that does `api.WebsocketWaitGroup.Add(1)` and only
releases `Done()` when its client session is closed. PR #25575 then
flipped `CODER_AI_GATEWAY_ENABLED` to default to `true`, so every
`cli.Server()` invocation now spins up that goroutine.

In `cli/server.go`, the only call to `aibridgeDaemon.Close()` was a
`defer` scheduled at function return. During graceful shutdown the code
first calls `coderAPICloser.Close()`, which waits on
`api.WebsocketWaitGroup`. That wait sits for the full 10s timeout in
`coderd/coderd.go` (`websocket shutdown timed out after 10 seconds`),
then returns, then the function unwinds, and only then does the deferred
`aibridgeDaemon.Close()` fire and let the goroutine call `Done()`.

The 10s tax was previously latent (aibridged was enterprise-only and
opt-in). After the two May 22 PRs it hit every `cli.Server()` test. On
Linux/macOS CI it just makes the suite slower; on the Depot Windows
runner, the ramdisk reservation leaves only ~17 GiB of headroom and the
~10s shutdown tails of multiple concurrent package binaries overlap into
an OOM, presenting as `test-go-pg (windows-2022)` jobs that die silently
at the ~600s watchdog with an empty `steps` array.

See Slack:
https://codercom.slack.com/archives/C05AE94121Z/p1779807717764189

## Fix

Close `aibridgeDaemon` explicitly during graceful shutdown, **before**
`coderAPICloser.Close()` waits on the WebSocket wait group. This matches
the existing ordered-shutdown pattern used for `tunnel` and
`notificationsManager`. The deferred `aibridgeDaemon.Close()` is
retained as a safety net for early-return paths, and is safe to
double-call because `aibridged.Server.Close()` is already idempotent via
`shutdownOnce` in `coderd/aibridged/aibridged.go`.

## Regression test

`TestServer_AIGatewayShutdownOrdering` boots a real `coder server` with
`--ai-gateway-enabled=true`, cancels its context, and asserts graceful
shutdown finishes in under 8s. With the fix the test runs in ~0.1s;
without the fix it fails deterministically at ~10.0s. The flag is passed
explicitly so the test continues to guard the ordering even if the
deployment default is ever flipped back.

## Evidence this fixes the OOM

On Linux the patched `cli` test package drops from 114 s back to its
pre-regression 30 s wall time at the same single-process peak RSS (~7.6
GiB), and the `websocket shutdown timed out after 10 seconds` log line
disappears from every server-test run. Since the Windows OOM is the sum
of multiple concurrent 10 s shutdown tails overlapping past the runner's
~17 GiB headroom, removing those tails returns the concurrent-RSS budget
to its pre-regression level. The Windows OOM was intermittent (a handful
of hits across many runs since May 22), so a single green `test-go-pg
(windows-2022)` job on this PR is not by itself proof; confirmation will
come from watching Windows runs on `main` over the next several days and
seeing the ~600 s silent-kill fingerprint stop recurring.

Relates to ENG-2771
2026-05-27 17:33:14 +10:00
..
2023-09-08 18:21:33 +00:00