mirror of
https://github.com/coder/coder.git
synced 2026-06-03 04:58:23 +00:00
e91bec8574
> [!WARNING] > The investigation and solution in this PR were done with [Mux](https://mux.coder.com/). I've reviewed the investigation methodology, evidence and solution, and it all appears sound. ## Summary PR #25570 (`refactor: move aibridged out of enterprise to AGPL`, merged 2026-05-22) added an in-memory aibridge DRPC server in `coderd/aibridged.go` that does `api.WebsocketWaitGroup.Add(1)` and only releases `Done()` when its client session is closed. PR #25575 then flipped `CODER_AI_GATEWAY_ENABLED` to default to `true`, so every `cli.Server()` invocation now spins up that goroutine. In `cli/server.go`, the only call to `aibridgeDaemon.Close()` was a `defer` scheduled at function return. During graceful shutdown the code first calls `coderAPICloser.Close()`, which waits on `api.WebsocketWaitGroup`. That wait sits for the full 10s timeout in `coderd/coderd.go` (`websocket shutdown timed out after 10 seconds`), then returns, then the function unwinds, and only then does the deferred `aibridgeDaemon.Close()` fire and let the goroutine call `Done()`. The 10s tax was previously latent (aibridged was enterprise-only and opt-in). After the two May 22 PRs it hit every `cli.Server()` test. On Linux/macOS CI it just makes the suite slower; on the Depot Windows runner, the ramdisk reservation leaves only ~17 GiB of headroom and the ~10s shutdown tails of multiple concurrent package binaries overlap into an OOM, presenting as `test-go-pg (windows-2022)` jobs that die silently at the ~600s watchdog with an empty `steps` array. See Slack: https://codercom.slack.com/archives/C05AE94121Z/p1779807717764189 ## Fix Close `aibridgeDaemon` explicitly during graceful shutdown, **before** `coderAPICloser.Close()` waits on the WebSocket wait group. This matches the existing ordered-shutdown pattern used for `tunnel` and `notificationsManager`. The deferred `aibridgeDaemon.Close()` is retained as a safety net for early-return paths, and is safe to double-call because `aibridged.Server.Close()` is already idempotent via `shutdownOnce` in `coderd/aibridged/aibridged.go`. ## Regression test `TestServer_AIGatewayShutdownOrdering` boots a real `coder server` with `--ai-gateway-enabled=true`, cancels its context, and asserts graceful shutdown finishes in under 8s. With the fix the test runs in ~0.1s; without the fix it fails deterministically at ~10.0s. The flag is passed explicitly so the test continues to guard the ordering even if the deployment default is ever flipped back. ## Evidence this fixes the OOM On Linux the patched `cli` test package drops from 114 s back to its pre-regression 30 s wall time at the same single-process peak RSS (~7.6 GiB), and the `websocket shutdown timed out after 10 seconds` log line disappears from every server-test run. Since the Windows OOM is the sum of multiple concurrent 10 s shutdown tails overlapping past the runner's ~17 GiB headroom, removing those tails returns the concurrent-RSS budget to its pre-regression level. The Windows OOM was intermittent (a handful of hits across many runs since May 22), so a single green `test-go-pg (windows-2022)` job on this PR is not by itself proof; confirmation will come from watching Windows runs on `main` over the next several days and seeing the ~600 s silent-kill fingerprint stop recurring. Relates to ENG-2771