mirror of
https://github.com/coder/coder.git
synced 2026-06-02 20:48:20 +00:00
fix(chatd): fix relay race conditions, extract enterprise relay logic, move pubsub to OSS (#22589)
## Summary Fixes a bug where interrupting a streaming chat and sending a new message left the relay connected to the wrong replica. Expanded into a broader refactor that cleanly separates concerns: - **OSS** owns pubsub subscription, message catch-up, queue updates, status forwarding, and local parts merging. - **Enterprise** (`enterprise/coderd/chatd`) only manages relay dialing, reconnection, and stale-dial discarding for cross-replica streaming. ## Architecture ### OSS `coderd/chatd/chatd.go` `Subscribe()` builds the initial snapshot then runs a single merge goroutine that handles: - Pubsub subscription for durable events (status, messages, queue, errors) - Message catch-up via `AfterMessageID` - Local `message_part` forwarding - Relay events from enterprise (when `SubscribeFn` is set) - Sends `StatusNotification` to enterprise so it can manage relay lifecycle Key types: - `SubscribeFn` — enterprise hook, returns relay-only events channel - `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`, `StatusNotifications`, `RequestHeader`, `DB`, `Logger` - `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on pubsub status changes ### Enterprise `enterprise/coderd/chatd/chatd.go` `NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a `SubscribeFn` that: - Opens an initial synchronous relay if the chat is running on a remote worker - Reads `StatusNotifications` from OSS to open/close relay connections - Handles async dial, reconnect timers, stale-dial discarding - Returns only relay `message_part` events ## Bug fixes ### Original bug: stale relay dial after interrupt `openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a per-dial context. `closeRelay()` could not cancel in-flight dials. When the user interrupts and a new replica picks up the chat, the old dial goroutine could complete after the new one and deliver a stale `relayResult`. **Fix**: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking, `workerID` on `relayResult`. `closeRelay()` cancels the dial context and drains `relayReadyCh`. Merge loop rejects mismatched worker IDs. ### Additional fixes - `statusNotifications` send-on-closed-channel race — goroutine now owns `close()` via defer - Enterprise spin-loop on `StatusNotifications` close — two-value receive with nil-out - `hasPubsub` set from `p.pubsub != nil` instead of subscription success — now tracks actual subscription result - `lastMessageID` not initialized from `afterMessageID` — caused duplicate messages on catch-up - `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel - `closeRelay()` did not drain `relayReadyCh` - `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in `InTx` - `processChat` post-TX side effects fired when chat was taken by another worker — added `errChatTakenByOtherWorker` sentinel - Cancel closure data race on `reconnectTimer` - Bare blocking send on pubsub error path - `localParts` hot-spin after channel close - No-pubsub branch dropped relay events and initial snapshot - Failed relay dial caused permanent stall (no reconnect retry) - DB error during reconnect timer caused permanent stall - `time.NewTimer` replaced with `quartz.Clock` for testable timing ## Tests 9 enterprise tests covering: - Relay reconnect on drop (mock clock) - Async dial does not block merge loop - Relay snapshot delivery - Stale dial discarded after interrupt - Cancel during in-flight dial - Running-to-running worker switch - Failed dial retries (mock clock) - Local worker closes relay - Multiple consecutive reconnects (mock clock) All pass with `-race`.
This commit is contained in:
+18
-29
@@ -45,6 +45,7 @@ import (
|
||||
agplusage "github.com/coder/coder/v2/coderd/usage"
|
||||
"github.com/coder/coder/v2/coderd/wsbuilder"
|
||||
"github.com/coder/coder/v2/codersdk"
|
||||
entchatd "github.com/coder/coder/v2/enterprise/coderd/chatd"
|
||||
"github.com/coder/coder/v2/enterprise/coderd/connectionlog"
|
||||
"github.com/coder/coder/v2/enterprise/coderd/dbauthz"
|
||||
"github.com/coder/coder/v2/enterprise/coderd/enidpsync"
|
||||
@@ -191,8 +192,9 @@ func New(ctx context.Context, options *Options) (_ *API, err error) {
|
||||
// This must happen before coderd initialization!
|
||||
options.PostAuthAdditionalHeadersFunc = api.writeEntitlementWarningsHeader
|
||||
|
||||
// Wire up enterprise chat relay for cross-replica message_part streaming.
|
||||
// Must be set before coderd.New so the chat processor gets it.
|
||||
// Wire up enterprise chat subscription with cross-replica relay
|
||||
// and pubsub coordination. Must be set before coderd.New so the
|
||||
// chat processor receives it.
|
||||
replicaHTTPClient := replicaRelayHTTPClient(options.HTTPClient, meshTLSConfig)
|
||||
if replicaHTTPClient == nil {
|
||||
replicaHTTPClient = options.Options.HTTPClient
|
||||
@@ -200,33 +202,20 @@ func New(ctx context.Context, options *Options) (_ *API, err error) {
|
||||
if replicaHTTPClient == nil {
|
||||
replicaHTTPClient = http.DefaultClient
|
||||
}
|
||||
// Use a closure that captures api by reference so it can access api.AGPL.ID
|
||||
// after coderd.New is called. The provider is only invoked when Subscribe
|
||||
// is called, which happens after initialization, so api.AGPL will be set.
|
||||
options.Options.ChatRemotePartsProvider = func(
|
||||
ctx context.Context,
|
||||
chatID uuid.UUID,
|
||||
workerID uuid.UUID,
|
||||
requestHeader http.Header,
|
||||
) (
|
||||
[]codersdk.ChatStreamEvent,
|
||||
<-chan codersdk.ChatStreamEvent,
|
||||
func(),
|
||||
error,
|
||||
) {
|
||||
// Get the replica ID from the API (will be set after coderd.New)
|
||||
replicaID := api.AGPL.ID
|
||||
if replicaID == uuid.Nil {
|
||||
// Fallback if somehow called before initialization
|
||||
replicaID = uuid.New()
|
||||
}
|
||||
provider := newRemotePartsProvider(
|
||||
resolveReplicaAddress,
|
||||
replicaHTTPClient,
|
||||
replicaID,
|
||||
)
|
||||
return provider(ctx, chatID, workerID, requestHeader)
|
||||
}
|
||||
// Use a closure that captures api by reference so it can access
|
||||
// api.AGPL.ID after coderd.New is called. The SubscribeFn is
|
||||
// only invoked from Subscribe, which happens after init.
|
||||
options.Options.ChatSubscribeFn = entchatd.NewMultiReplicaSubscribeFn(entchatd.MultiReplicaSubscribeConfig{
|
||||
ResolveReplicaAddress: resolveReplicaAddress,
|
||||
ReplicaHTTPClient: replicaHTTPClient,
|
||||
ReplicaIDFn: func() uuid.UUID {
|
||||
id := api.AGPL.ID
|
||||
if id == uuid.Nil {
|
||||
return uuid.New()
|
||||
}
|
||||
return id
|
||||
},
|
||||
})
|
||||
|
||||
api.AGPL = coderd.New(options.Options)
|
||||
defer func() {
|
||||
|
||||
Reference in New Issue
Block a user