fix(chatd): fix relay race conditions, extract enterprise relay logic, move pubsub to OSS (#22589)

## Summary

Fixes a bug where interrupting a streaming chat and sending a new
message
left the relay connected to the wrong replica. Expanded into a broader
refactor that cleanly separates concerns:

- **OSS** owns pubsub subscription, message catch-up, queue updates,
  status forwarding, and local parts merging.
- **Enterprise** (`enterprise/coderd/chatd`) only manages relay dialing,
  reconnection, and stale-dial discarding for cross-replica streaming.

## Architecture

### OSS `coderd/chatd/chatd.go`

`Subscribe()` builds the initial snapshot then runs a single merge
goroutine that handles:

- Pubsub subscription for durable events (status, messages, queue,
errors)
- Message catch-up via `AfterMessageID`
- Local `message_part` forwarding
- Relay events from enterprise (when `SubscribeFn` is set)
- Sends `StatusNotification` to enterprise so it can manage relay
lifecycle

Key types:

- `SubscribeFn` — enterprise hook, returns relay-only events channel
- `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`,
`StatusNotifications`, `RequestHeader`, `DB`, `Logger`
- `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on
pubsub status changes

### Enterprise `enterprise/coderd/chatd/chatd.go`

`NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a
`SubscribeFn` that:

- Opens an initial synchronous relay if the chat is running on a remote
worker
- Reads `StatusNotifications` from OSS to open/close relay connections
- Handles async dial, reconnect timers, stale-dial discarding
- Returns only relay `message_part` events

## Bug fixes

### Original bug: stale relay dial after interrupt

`openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a
per-dial context. `closeRelay()` could not cancel in-flight dials. When
the user interrupts and a new replica picks up the chat, the old dial
goroutine could complete after the new one and deliver a stale
`relayResult`.

**Fix**: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking,
`workerID` on `relayResult`. `closeRelay()` cancels the dial context and
drains `relayReadyCh`. Merge loop rejects mismatched worker IDs.

### Additional fixes

- `statusNotifications` send-on-closed-channel race — goroutine now owns
  `close()` via defer
- Enterprise spin-loop on `StatusNotifications` close — two-value
receive
  with nil-out
- `hasPubsub` set from `p.pubsub != nil` instead of subscription success
  — now tracks actual subscription result
- `lastMessageID` not initialized from `afterMessageID` — caused
  duplicate messages on catch-up
- `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel
- `closeRelay()` did not drain `relayReadyCh`
- `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in
`InTx`
- `processChat` post-TX side effects fired when chat was taken by
another
  worker — added `errChatTakenByOtherWorker` sentinel
- Cancel closure data race on `reconnectTimer`
- Bare blocking send on pubsub error path
- `localParts` hot-spin after channel close
- No-pubsub branch dropped relay events and initial snapshot
- Failed relay dial caused permanent stall (no reconnect retry)
- DB error during reconnect timer caused permanent stall
- `time.NewTimer` replaced with `quartz.Clock` for testable timing

## Tests

9 enterprise tests covering:

- Relay reconnect on drop (mock clock)
- Async dial does not block merge loop
- Relay snapshot delivery
- Stale dial discarded after interrupt
- Cancel during in-flight dial
- Running-to-running worker switch
- Failed dial retries (mock clock)
- Local worker closes relay
- Multiple consecutive reconnects (mock clock)

All pass with `-race`.
This commit is contained in:
Kyle Carberry
2026-03-04 18:42:28 -05:00
committed by GitHub
parent 0ccfc4da06
commit 30d534b36b
11 changed files with 2049 additions and 838 deletions
+18 -29
View File
@@ -45,6 +45,7 @@ import (
agplusage "github.com/coder/coder/v2/coderd/usage"
"github.com/coder/coder/v2/coderd/wsbuilder"
"github.com/coder/coder/v2/codersdk"
entchatd "github.com/coder/coder/v2/enterprise/coderd/chatd"
"github.com/coder/coder/v2/enterprise/coderd/connectionlog"
"github.com/coder/coder/v2/enterprise/coderd/dbauthz"
"github.com/coder/coder/v2/enterprise/coderd/enidpsync"
@@ -191,8 +192,9 @@ func New(ctx context.Context, options *Options) (_ *API, err error) {
// This must happen before coderd initialization!
options.PostAuthAdditionalHeadersFunc = api.writeEntitlementWarningsHeader
// Wire up enterprise chat relay for cross-replica message_part streaming.
// Must be set before coderd.New so the chat processor gets it.
// Wire up enterprise chat subscription with cross-replica relay
// and pubsub coordination. Must be set before coderd.New so the
// chat processor receives it.
replicaHTTPClient := replicaRelayHTTPClient(options.HTTPClient, meshTLSConfig)
if replicaHTTPClient == nil {
replicaHTTPClient = options.Options.HTTPClient
@@ -200,33 +202,20 @@ func New(ctx context.Context, options *Options) (_ *API, err error) {
if replicaHTTPClient == nil {
replicaHTTPClient = http.DefaultClient
}
// Use a closure that captures api by reference so it can access api.AGPL.ID
// after coderd.New is called. The provider is only invoked when Subscribe
// is called, which happens after initialization, so api.AGPL will be set.
options.Options.ChatRemotePartsProvider = func(
ctx context.Context,
chatID uuid.UUID,
workerID uuid.UUID,
requestHeader http.Header,
) (
[]codersdk.ChatStreamEvent,
<-chan codersdk.ChatStreamEvent,
func(),
error,
) {
// Get the replica ID from the API (will be set after coderd.New)
replicaID := api.AGPL.ID
if replicaID == uuid.Nil {
// Fallback if somehow called before initialization
replicaID = uuid.New()
}
provider := newRemotePartsProvider(
resolveReplicaAddress,
replicaHTTPClient,
replicaID,
)
return provider(ctx, chatID, workerID, requestHeader)
}
// Use a closure that captures api by reference so it can access
// api.AGPL.ID after coderd.New is called. The SubscribeFn is
// only invoked from Subscribe, which happens after init.
options.Options.ChatSubscribeFn = entchatd.NewMultiReplicaSubscribeFn(entchatd.MultiReplicaSubscribeConfig{
ResolveReplicaAddress: resolveReplicaAddress,
ReplicaHTTPClient: replicaHTTPClient,
ReplicaIDFn: func() uuid.UUID {
id := api.AGPL.ID
if id == uuid.Nil {
return uuid.New()
}
return id
},
})
api.AGPL = coderd.New(options.Options)
defer func() {