mirror of
https://github.com/coder/coder.git
synced 2026-06-03 04:58:23 +00:00
b7a7683ac0
## Problem Subscribers connecting to a different replica than the one running the chat see full messages appear but no streaming partials (`message_part` events). The relay mechanism that forwards ephemeral parts across replicas had several bugs. ## Root Causes 1. **`openRelay()` blocked the event loop** — The WebSocket dial (TCP + TLS + HTTP upgrade) to the worker replica ran synchronously inside the select loop. While dialing, no events could be processed, channels filled up, and parts were silently dropped. 2. **Relay drops were permanent** — When the relay WebSocket closed mid-stream, `relayParts` was set to nil and never reopened. No status notification would re-trigger it since the chat was still running on the same worker. 3. **`drainInitial` snapshot race** — The `default` case in the initial drain loop caused the snapshot to be empty if the remote hadn't flushed data yet (common immediately after WebSocket connect). 4. **Duplicate event delivery** — The `preloaded` slice caused snapshot events to be sent both in the return value and re-sent through the channel goroutine. ## Fixes ### `coderd/chatd/chatd.go` (Subscribe method) - **Async relay dial**: `openRelayAsync()` spawns a goroutine to dial the remote replica. The result (channel + cancel func) is delivered on a `relayReadyCh` channel that the select loop reads without blocking. - **Relay reconnection**: When the relay channel closes, a 500ms timer fires. The handler re-checks chat status from the DB and reopens the relay if the chat is still running on a remote worker. - **Snapshot parts via channel**: Relay snapshot + live parts are wrapped into a single channel so they flow through the same path, avoiding races with the select loop. ### `enterprise/coderd/chats.go` (newRemotePartsProvider) - **Timer-based drain**: Replaced `default` with a 1-second timer. After the first event, `Reset(0)` switches to non-blocking drain for remaining buffered events. - **Remove preloaded duplication**: The goroutine now only forwards new events; snapshot events are returned to the caller directly. ## Testing All existing tests pass: - `TestInterruptChatBroadcastsStatusAcrossInstances` - `TestSubscribeSnapshotIncludesStatusEvent` - `TestSubscribeNoPubsubNoDuplicateMessageParts` - `TestSubscribeAfterMessageID` - `TestChatStreamRelay/RelayMessagePartsAcrossReplicas`