Kyle Carberry 30d534b36b fix(chatd): fix relay race conditions, extract enterprise relay logic, move pubsub to OSS (#22589)
## Summary

Fixes a bug where interrupting a streaming chat and sending a new
message
left the relay connected to the wrong replica. Expanded into a broader
refactor that cleanly separates concerns:

- **OSS** owns pubsub subscription, message catch-up, queue updates,
  status forwarding, and local parts merging.
- **Enterprise** (`enterprise/coderd/chatd`) only manages relay dialing,
  reconnection, and stale-dial discarding for cross-replica streaming.

## Architecture

### OSS `coderd/chatd/chatd.go`

`Subscribe()` builds the initial snapshot then runs a single merge
goroutine that handles:

- Pubsub subscription for durable events (status, messages, queue,
errors)
- Message catch-up via `AfterMessageID`
- Local `message_part` forwarding
- Relay events from enterprise (when `SubscribeFn` is set)
- Sends `StatusNotification` to enterprise so it can manage relay
lifecycle

Key types:

- `SubscribeFn` — enterprise hook, returns relay-only events channel
- `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`,
`StatusNotifications`, `RequestHeader`, `DB`, `Logger`
- `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on
pubsub status changes

### Enterprise `enterprise/coderd/chatd/chatd.go`

`NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a
`SubscribeFn` that:

- Opens an initial synchronous relay if the chat is running on a remote
worker
- Reads `StatusNotifications` from OSS to open/close relay connections
- Handles async dial, reconnect timers, stale-dial discarding
- Returns only relay `message_part` events

## Bug fixes

### Original bug: stale relay dial after interrupt

`openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a
per-dial context. `closeRelay()` could not cancel in-flight dials. When
the user interrupts and a new replica picks up the chat, the old dial
goroutine could complete after the new one and deliver a stale
`relayResult`.

**Fix**: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking,
`workerID` on `relayResult`. `closeRelay()` cancels the dial context and
drains `relayReadyCh`. Merge loop rejects mismatched worker IDs.

### Additional fixes

- `statusNotifications` send-on-closed-channel race — goroutine now owns
  `close()` via defer
- Enterprise spin-loop on `StatusNotifications` close — two-value
receive
  with nil-out
- `hasPubsub` set from `p.pubsub != nil` instead of subscription success
  — now tracks actual subscription result
- `lastMessageID` not initialized from `afterMessageID` — caused
  duplicate messages on catch-up
- `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel
- `closeRelay()` did not drain `relayReadyCh`
- `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in
`InTx`
- `processChat` post-TX side effects fired when chat was taken by
another
  worker — added `errChatTakenByOtherWorker` sentinel
- Cancel closure data race on `reconnectTimer`
- Bare blocking send on pubsub error path
- `localParts` hot-spin after channel close
- No-pubsub branch dropped relay events and initial snapshot
- Failed relay dial caused permanent stall (no reconnect retry)
- DB error during reconnect timer caused permanent stall
- `time.NewTimer` replaced with `quartz.Clock` for testable timing

## Tests

9 enterprise tests covering:

- Relay reconnect on drop (mock clock)
- Async dial does not block merge loop
- Relay snapshot delivery
- Stale dial discarded after interrupt
- Cancel during in-flight dial
- Running-to-running worker switch
- Failed dial retries (mock clock)
- Local worker closes relay
- Multiple consecutive reconnects (mock clock)

All pass with `-race`.
2026-03-04 18:42:28 -05:00
2022-04-04 11:55:06 -05:00

Coder Logo Light Coder Logo Dark

Self-Hosted Cloud Development Environments

Coder Banner Light Coder Banner Dark

Quickstart | Docs | Why Coder | Premium

discord release godoc Go Report Card OpenSSF Best Practices OpenSSF Scorecard license

Coder enables organizations to set up development environments in their public or private cloud infrastructure. Cloud development environments are defined with Terraform, connected through a secure high-speed Wireguard® tunnel, and automatically shut down when not used to save on costs. Coder gives engineering teams the flexibility to use the cloud for workloads most beneficial to them.

  • Define cloud development environments in Terraform
    • EC2 VMs, Kubernetes Pods, Docker Containers, etc.
  • Automatically shutdown idle resources to save on costs
  • Onboard developers in seconds instead of days

Coder Hero Image

Quickstart

The most convenient way to try Coder is to install it on your local machine and experiment with provisioning cloud development environments using Docker (works on Linux, macOS, and Windows).

# First, install Coder
curl -L https://coder.com/install.sh | sh

# Start the Coder server (caches data in ~/.cache/coder)
coder server

# Navigate to http://localhost:3000 to create your initial user,
# create a Docker template and provision a workspace

Install

The easiest way to install Coder is to use our install script for Linux and macOS. For Windows, use the latest ..._installer.exe file from GitHub Releases.

curl -L https://coder.com/install.sh | sh

You can run the install script with --dry-run to see the commands that will be used to install without executing them. Run the install script with --help for additional flags.

See install for additional methods.

Once installed, you can start a production deployment with a single command:

# Automatically sets up an external access URL on *.try.coder.app
coder server

# Requires a PostgreSQL instance (version 13 or higher) and external access URL
coder server --postgres-url <url> --access-url <url>

Use coder --help to get a list of flags and environment variables. Use our install guides for a complete walkthrough.

Documentation

Browse our docs here or visit a specific section below:

  • Templates: Templates are written in Terraform and describe the infrastructure for workspaces
  • Workspaces: Workspaces contain the IDEs, dependencies, and configuration information needed for software development
  • IDEs: Connect your existing editor to a workspace
  • Administration: Learn how to operate Coder
  • Premium: Learn about our paid features built for large teams

Support

Feel free to open an issue if you have questions, run into bugs, or have a feature request.

Join our Discord to provide feedback on in-progress features and chat with the community using Coder!

Integrations

We are always working on new integrations. Please feel free to open an issue and ask for an integration. Contributions are welcome in any official or community repositories.

Official

Community

Contributing

We are always happy to see new contributors to Coder. If you are new to the Coder codebase, we have a guide on how to get started. We'd love to see your contributions!

Hiring

Apply here if you're interested in joining our team.

Languages
Go 74.4%
TypeScript 23.5%
Shell 0.8%
HCL 0.4%
PLpgSQL 0.3%
Other 0.2%