Ethan e91bec8574 fix(cli): close aibridge daemon before WebSocket shutdown wait (#25719)
> [!WARNING]
> The investigation and solution in this PR were done with
[Mux](https://mux.coder.com/). I've reviewed the investigation
methodology, evidence and solution, and it all appears sound.

## Summary

PR #25570 (`refactor: move aibridged out of enterprise to AGPL`, merged
2026-05-22) added an in-memory aibridge DRPC server in
`coderd/aibridged.go` that does `api.WebsocketWaitGroup.Add(1)` and only
releases `Done()` when its client session is closed. PR #25575 then
flipped `CODER_AI_GATEWAY_ENABLED` to default to `true`, so every
`cli.Server()` invocation now spins up that goroutine.

In `cli/server.go`, the only call to `aibridgeDaemon.Close()` was a
`defer` scheduled at function return. During graceful shutdown the code
first calls `coderAPICloser.Close()`, which waits on
`api.WebsocketWaitGroup`. That wait sits for the full 10s timeout in
`coderd/coderd.go` (`websocket shutdown timed out after 10 seconds`),
then returns, then the function unwinds, and only then does the deferred
`aibridgeDaemon.Close()` fire and let the goroutine call `Done()`.

The 10s tax was previously latent (aibridged was enterprise-only and
opt-in). After the two May 22 PRs it hit every `cli.Server()` test. On
Linux/macOS CI it just makes the suite slower; on the Depot Windows
runner, the ramdisk reservation leaves only ~17 GiB of headroom and the
~10s shutdown tails of multiple concurrent package binaries overlap into
an OOM, presenting as `test-go-pg (windows-2022)` jobs that die silently
at the ~600s watchdog with an empty `steps` array.

See Slack:
https://codercom.slack.com/archives/C05AE94121Z/p1779807717764189

## Fix

Close `aibridgeDaemon` explicitly during graceful shutdown, **before**
`coderAPICloser.Close()` waits on the WebSocket wait group. This matches
the existing ordered-shutdown pattern used for `tunnel` and
`notificationsManager`. The deferred `aibridgeDaemon.Close()` is
retained as a safety net for early-return paths, and is safe to
double-call because `aibridged.Server.Close()` is already idempotent via
`shutdownOnce` in `coderd/aibridged/aibridged.go`.

## Regression test

`TestServer_AIGatewayShutdownOrdering` boots a real `coder server` with
`--ai-gateway-enabled=true`, cancels its context, and asserts graceful
shutdown finishes in under 8s. With the fix the test runs in ~0.1s;
without the fix it fails deterministically at ~10.0s. The flag is passed
explicitly so the test continues to guard the ordering even if the
deployment default is ever flipped back.

## Evidence this fixes the OOM

On Linux the patched `cli` test package drops from 114 s back to its
pre-regression 30 s wall time at the same single-process peak RSS (~7.6
GiB), and the `websocket shutdown timed out after 10 seconds` log line
disappears from every server-test run. Since the Windows OOM is the sum
of multiple concurrent 10 s shutdown tails overlapping past the runner's
~17 GiB headroom, removing those tails returns the concurrent-RSS budget
to its pre-regression level. The Windows OOM was intermittent (a handful
of hits across many runs since May 22), so a single green `test-go-pg
(windows-2022)` job on this PR is not by itself proof; confirmation will
come from watching Windows runs on `main` over the next several days and
seeing the ~600 s silent-kill fingerprint stop recurring.

Relates to ENG-2771
2026-05-27 17:33:14 +10:00
2022-04-04 11:55:06 -05:00

Coder Logo Light Coder Logo Dark

Self-Hosted Cloud Development Environments and AI Agents

Coder Banner Light Coder Banner Dark

Quickstart | Docs | Why Coder | Premium

discord release godoc Go Report Card OpenSSF Best Practices OpenSSF Scorecard license

Coder is a self-hosted platform for cloud development environments and AI coding agents. Workspaces are defined with Terraform, connected through a secure Wireguard® tunnel, and automatically shut down when not used. Coder Agents runs a native AI coding agent whose loop executes in the control plane on your infrastructure, with no API keys in workspaces.

  • Define cloud development environments in Terraform
    • EC2 VMs, Kubernetes Pods, Docker Containers, etc.
  • Automatically shutdown idle resources to save on costs
  • Onboard developers in seconds instead of days
  • Delegate coding work to AI agents on your infrastructure
    • Bring any model (Anthropic, OpenAI, Google, Bedrock, self-hosted)
    • No LLM credentials in workspaces, user identity on every action
    • Centralized model governance, cost tracking, and audit logging

Coder platform showing templates and a running workspace

Quickstart

The most convenient way to try Coder is to install it on your local machine and experiment with provisioning cloud development environments using Docker (works on Linux, macOS, and Windows).

# First, install Coder
curl -L https://coder.com/install.sh | sh

# Start the Coder server (caches data in ~/.cache/coder)
coder server

# Navigate to http://localhost:3000 to create your initial user,
# create a Docker template and provision a workspace

Install

The easiest way to install Coder is to use the install script for Linux and macOS. For Windows, use the latest ..._installer.exe file from GitHub Releases.

curl -L https://coder.com/install.sh | sh

You can run the install script with --dry-run to see the commands that will be used to install without executing them. Run the install script with --help for additional flags.

See install for additional methods.

Once installed, you can start a production deployment with a single command:

# Automatically sets up an external access URL on *.try.coder.app
coder server

# Requires a PostgreSQL instance (version 13 or higher) and external access URL
coder server --postgres-url <url> --access-url <url>

Use coder --help to get a list of flags and environment variables. See the install guides for a complete tutorial.

Documentation

Browse the documentation or visit a specific section below:

  • Workspaces: Workspaces contain the IDEs, dependencies, and configuration information needed for software development
  • Templates: Templates are written in Terraform and describe the infrastructure for workspaces
  • Coder Agents: Delegate coding work to AI agents running on your self-hosted infrastructure
  • Administration: Learn how to operate Coder
  • Premium: Learn about paid features built for large teams
  • IDEs: Connect your existing editor to a workspace

Support

Feel free to open an issue if you have questions, run into bugs, or have a feature request.

Join our Discord to provide feedback on in-progress features and chat with the community using Coder!

Integrations

New integrations are always in progress. Open an issue to request one. Contributions are welcome in any official or community repository.

Official

Community

Contributing

New contributors are always welcome. If you are new to the Coder codebase, see the contribution guide to get started.

Hiring

Apply on the careers page if you are interested in joining the team.

Languages
Go 74.4%
TypeScript 23.5%
Shell 0.8%
HCL 0.4%
PLpgSQL 0.3%
Other 0.2%