mirror of
https://github.com/coder/coder.git
synced 2026-06-04 13:38:21 +00:00
e94de0bdabcaeed735d468bb6d279d06f8b8b2d1
6 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
bdbcd3428b |
feat(coderd/chatd): unify chat storage on SDK parts and fix file-reference rendering (#22958)
File-reference parts in user messages were flattened to `TextContent` at write time because fantasy has no file-reference content type. The frontend never saw them as structured parts. This moves all write paths (user, assistant, tool) from fantasy envelope format to `codersdk.ChatMessagePart`. The streaming layer (`chatloop`) is untouched, the conversion happens at the serialization boundary in `persistStep`. Old rows are still readable. `ParseContent` uses a structural heuristic (`isFantasyEnvelopeFormat`) to distinguish legacy envelopes from SDK parts. We chose this over try/fallback because fantasy envelopes partially unmarshal into `ChatMessagePart` (the `type` field matches) while silently losing content. A guard test enforces that no SDK part can produce the envelope shape. This is forward-only: new rows are unreadable by old code. Chat is behind a feature flag so rollback risk is contained. Also adds a typed `ChatMessageRole` to replace raw strings and `fantasy.MessageRole*` casts at the persistence boundary. The type covers `ChatMessage.Role`, `ChatStreamMessagePart.Role`, the `PublishMessagePart` callback chain, and all DB write sites. `fantasy.MessageRole*` remains only where we build `fantasy.Message` structs for LLM dispatch. Separately, `ProviderMetadata` was leaking to SSE clients via `publishMessagePart`. `StripInternal` now runs on both the SSE and REST paths, covering this. Other cleanup: - Old `db2sdk.contentBlockToPart` silently dropped metadata on text/reasoning/tool-call content. New code preserves it. - `providerMetadataToOptions` now logs warnings instead of silently returning nil. - `db2sdk` shrinks from ~250 lines of parallel conversion to ~15 lines delegating to `chatprompt.ParseContent()`, removing the `fantasy` import entirely. Refs #22821 |
||
|
|
94a2e440a8 |
fix(chatd): extract session token from cookie for relay header (#22649)
## Problem When a browser connects to the chat stream via WebSocket, it authenticates using cookies only — the native WebSocket API cannot set custom headers like `Coder-Session-Token`. The relay between replicas copies the original request's `Cookie` header but did **not** set the `Coder-Session-Token` header as a fallback. This causes a **401 on the worker replica** when `EnableHostPrefix` is enabled, because the `HTTPCookies.Middleware` strips bare `coder_session_token` cookies (expecting the `__Host-` prefix). Without a `Coder-Session-Token` header fallback, `apiKeyMiddleware` finds no valid credentials. ### Root Cause The data flow: 1. Browser → subscriber replica: `Cookie: __Host-coder_session_token=xxx` (browser sends prefixed cookie) 2. Subscriber's `HTTPCookies.Middleware` normalizes: `Cookie: coder_session_token=xxx` (strips prefix) 3. `relayHeaders()` copies `Cookie: coder_session_token=xxx` to relay request 4. Worker replica's `HTTPCookies.Middleware` sees bare `coder_session_token` → **strips it** (expects `__Host-` prefix) 5. `apiKeyMiddleware` → `APITokenFromRequest`: no cookie, no header → **401** ## Fix Modified `relayHeaders()` to extract the session token value from the `Cookie` header and set it as the `Coder-Session-Token` header when no explicit session token header is already present. The header is never stripped by middleware, so the worker replica can always authenticate. ## Testing - **`TestRelayHeaders`**: Unit tests for the updated `relayHeaders()` function covering all scenarios (cookie-only, header+cookie, no auth, nil source) - **`TestExtractSessionTokenFromCookieHeader`**: Unit tests for the helper function - **`TestChatStreamRelay/RelayCookieOnlyAuth`**: Integration test with plain HTTP, cookie-only WebSocket auth - **`TestChatStreamRelay/RelayCookieOnlyAuthWithHostPrefix`**: Integration test with `EnableHostPrefix=true`, confirming the 401 is fixed - **`cookieOnlySessionTokenProvider`**: Test helper that simulates browser WebSocket behavior (sets Cookie header only on WebSocket dials, no custom headers) ## Files Changed - `enterprise/coderd/chatd/chatd.go` — `relayHeaders()` fix + `extractSessionTokenFromCookieHeader()` helper - `enterprise/coderd/chatd/relay_headers_internal_test.go` — unit tests (new file) - `enterprise/coderd/chats_test.go` — integration tests + test helper type |
||
|
|
63b6868113 |
fix(codersdk): propagate HTTPClient to websocket.Dial for TLS relay (#22642)
## Problem In multi-replica Coder deployments, the chat relay WebSocket between replicas fails with HTTP 401 (or TLS handshake errors). The subscriber replica cannot relay `message_part` events from the worker replica. **Root cause:** `codersdk.Client.Dial()` does not pass `c.HTTPClient` to `websocket.DialOptions.HTTPClient`. The websocket library (`github.com/coder/websocket`) falls back to `http.DefaultClient`, which lacks the mesh TLS configuration needed for inter-replica communication. The relay code in `enterprise/coderd/chatd/chatd.go` correctly sets `sdkClient.HTTPClient = cfg.ReplicaHTTPClient` (which has mesh TLS certs), but that client was never used for the actual WebSocket handshake. ## Fix One-line fix in `codersdk/client.go`: propagate `c.HTTPClient` to `opts.HTTPClient` when the caller hasn't already set one. ## Test Added `TestChatStreamRelay/RelayWithTLSAndCookieAuth` which: - Sets up two replicas with TLS certificates (simulating mesh TLS in production) - Authenticates via cookies (simulating browser WebSocket behavior) - Verifies message_part events relay across replicas over TLS This test times out without the fix because the WebSocket handshake fails with `x509: certificate signed by unknown authority` (http.DefaultClient rejects self-signed certs). ## Related Follow-up to #22635 which fixed the `redirectToAccessURL` middleware bypassing 307 redirects for relay requests. That fix changed the error from HTTP 200 to HTTP 401, exposing this deeper issue. |
||
|
|
30d534b36b |
fix(chatd): fix relay race conditions, extract enterprise relay logic, move pubsub to OSS (#22589)
## Summary Fixes a bug where interrupting a streaming chat and sending a new message left the relay connected to the wrong replica. Expanded into a broader refactor that cleanly separates concerns: - **OSS** owns pubsub subscription, message catch-up, queue updates, status forwarding, and local parts merging. - **Enterprise** (`enterprise/coderd/chatd`) only manages relay dialing, reconnection, and stale-dial discarding for cross-replica streaming. ## Architecture ### OSS `coderd/chatd/chatd.go` `Subscribe()` builds the initial snapshot then runs a single merge goroutine that handles: - Pubsub subscription for durable events (status, messages, queue, errors) - Message catch-up via `AfterMessageID` - Local `message_part` forwarding - Relay events from enterprise (when `SubscribeFn` is set) - Sends `StatusNotification` to enterprise so it can manage relay lifecycle Key types: - `SubscribeFn` — enterprise hook, returns relay-only events channel - `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`, `StatusNotifications`, `RequestHeader`, `DB`, `Logger` - `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on pubsub status changes ### Enterprise `enterprise/coderd/chatd/chatd.go` `NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a `SubscribeFn` that: - Opens an initial synchronous relay if the chat is running on a remote worker - Reads `StatusNotifications` from OSS to open/close relay connections - Handles async dial, reconnect timers, stale-dial discarding - Returns only relay `message_part` events ## Bug fixes ### Original bug: stale relay dial after interrupt `openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a per-dial context. `closeRelay()` could not cancel in-flight dials. When the user interrupts and a new replica picks up the chat, the old dial goroutine could complete after the new one and deliver a stale `relayResult`. **Fix**: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking, `workerID` on `relayResult`. `closeRelay()` cancels the dial context and drains `relayReadyCh`. Merge loop rejects mismatched worker IDs. ### Additional fixes - `statusNotifications` send-on-closed-channel race — goroutine now owns `close()` via defer - Enterprise spin-loop on `StatusNotifications` close — two-value receive with nil-out - `hasPubsub` set from `p.pubsub != nil` instead of subscription success — now tracks actual subscription result - `lastMessageID` not initialized from `afterMessageID` — caused duplicate messages on catch-up - `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel - `closeRelay()` did not drain `relayReadyCh` - `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in `InTx` - `processChat` post-TX side effects fired when chat was taken by another worker — added `errChatTakenByOtherWorker` sentinel - Cancel closure data race on `reconnectTimer` - Bare blocking send on pubsub error path - `localParts` hot-spin after channel close - No-pubsub branch dropped relay events and initial snapshot - Failed relay dial caused permanent stall (no reconnect retry) - DB error during reconnect timer caused permanent stall - `time.NewTimer` replaced with `quartz.Clock` for testable timing ## Tests 9 enterprise tests covering: - Relay reconnect on drop (mock clock) - Async dial does not block merge loop - Relay snapshot delivery - Stale dial discarded after interrupt - Cancel during in-flight dial - Running-to-running worker switch - Failed dial retries (mock clock) - Local worker closes relay - Multiple consecutive reconnects (mock clock) All pass with `-race`. |
||
|
|
b7a7683ac0 |
fix(chatd): harden cross-replica relay for chat stream parts (#22533)
## Problem Subscribers connecting to a different replica than the one running the chat see full messages appear but no streaming partials (`message_part` events). The relay mechanism that forwards ephemeral parts across replicas had several bugs. ## Root Causes 1. **`openRelay()` blocked the event loop** — The WebSocket dial (TCP + TLS + HTTP upgrade) to the worker replica ran synchronously inside the select loop. While dialing, no events could be processed, channels filled up, and parts were silently dropped. 2. **Relay drops were permanent** — When the relay WebSocket closed mid-stream, `relayParts` was set to nil and never reopened. No status notification would re-trigger it since the chat was still running on the same worker. 3. **`drainInitial` snapshot race** — The `default` case in the initial drain loop caused the snapshot to be empty if the remote hadn't flushed data yet (common immediately after WebSocket connect). 4. **Duplicate event delivery** — The `preloaded` slice caused snapshot events to be sent both in the return value and re-sent through the channel goroutine. ## Fixes ### `coderd/chatd/chatd.go` (Subscribe method) - **Async relay dial**: `openRelayAsync()` spawns a goroutine to dial the remote replica. The result (channel + cancel func) is delivered on a `relayReadyCh` channel that the select loop reads without blocking. - **Relay reconnection**: When the relay channel closes, a 500ms timer fires. The handler re-checks chat status from the DB and reopens the relay if the chat is still running on a remote worker. - **Snapshot parts via channel**: Relay snapshot + live parts are wrapped into a single channel so they flow through the same path, avoiding races with the select loop. ### `enterprise/coderd/chats.go` (newRemotePartsProvider) - **Timer-based drain**: Replaced `default` with a 1-second timer. After the first event, `Reset(0)` switches to non-blocking drain for remaining buffered events. - **Remove preloaded duplication**: The goroutine now only forwards new events; snapshot events are returned to the caller directly. ## Testing All existing tests pass: - `TestInterruptChatBroadcastsStatusAcrossInstances` - `TestSubscribeSnapshotIncludesStatusEvent` - `TestSubscribeNoPubsubNoDuplicateMessageParts` - `TestSubscribeAfterMessageID` - `TestChatStreamRelay/RelayMessagePartsAcrossReplicas` |
||
|
|
edee917d88 |
feat: add experimental agents support (#22290)
feat: add AI chat system with agent tools and chat UI Introduce the chatd subsystem and Agents UI for AI-powered chat within Coder workspaces. - Add chatd package with chat loop, message compaction, prompt management, and LLM provider integration (OpenAI, Anthropic) - Add agent tools: create workspace, list/read templates, read/write/ edit files, execute commands - Add chat API endpoints with streaming, message editing, and durable reconnection - Add database schema and migrations for chats, chat messages, chat providers, and chat model configs - Add RBAC policies and dbauthz enforcement for chat resources - Add Agents UI pages with conversation timeline, queued messages list, diff viewer, and model configuration panel - Add comprehensive test coverage including coderd integration tests, chatd unit tests, and Storybook stories - Gate feature behind experiments flag --------- Co-authored-by: Cian Johnston <cian@coder.com> Co-authored-by: Danielle Maywood <danielle@themaywoods.com> Co-authored-by: Jeremy Ruppel <jeremy@coder.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |