coder

mirror of https://github.com/coder/coder.git synced 2026-06-04 05:28:20 +00:00

Author	SHA1	Message	Date
Kyle Carberry	e388a88592	feat(coderd/chatd): connect to external MCP servers for chat tool invocation (#23333 ) ## Summary Adds a new `coderd/chatd/mcpclient` package that connects to admin-configured MCP servers and wraps their tools as `fantasy.AgentTool` values that the chat loop can invoke. ## What changed ### New: `coderd/chatd/mcpclient/mcpclient.go` The core package with a single entry point: ```go func ConnectAll( ctx context.Context, logger slog.Logger, configs []database.MCPServerConfig, tokens []database.MCPServerUserToken, ) (tools []fantasy.AgentTool, cleanup func(), err error) ``` This: 1. Connects to each enabled MCP server using `mark3labs/mcp-go` (streamable HTTP or SSE transport) 2. Discovers tools via the MCP `tools/list` method 3. Wraps each tool as a `fantasy.AgentTool` with namespaced name (`serverslug__toolname`) 4. Applies tool allow/deny list filtering from the server config 5. Handles auth: OAuth2 bearer tokens, API keys, and custom headers 6. Skips broken servers with a warning (10s connect timeout per server) 7. Returns a cleanup function to close all MCP connections ### Modified: `coderd/chatd/chatd.go` In `runChat()`, after loading the model/messages but before assembling the tool list: - Reads `chat.MCPServerIDs` from the chat record - Loads the MCP server configs from the database - Resolves the user's auth tokens - Calls `mcpclient.ConnectAll()` to connect and discover tools - Appends the MCP tools to the chat's tool set - Defers cleanup to close connections when the chat turn ends The chat loop (`chatloop.Run`) already handles tools generically — MCP-backed tools are invoked identically to built-in workspace tools. No changes needed in `chatloop/`. ### New: `coderd/chatd/mcpclient/mcpclient_test.go` 10 tests covering: - Tool discovery and namespacing - Tool call forwarding and result conversion - Allow/deny list filtering - Connection failure handling (graceful skip) - Multi-server support with correct prefixes - OAuth2 auth header injection - Disabled server skipping - Invalid input handling - Tool info parameter propagation ## Design decisions - Tool namespacing: `slug__toolname` with double underscore separator. Avoids collisions with tools containing single underscores. Stripped when forwarding to `tools/call`. - Connection lifecycle: Fresh connections per chat turn, closed via `defer`. Matches the `turnWorkspaceContext` pattern. - Failure isolation: Each server connects independently. A broken server doesn't fail the chat — its tools are simply unavailable. - No chatloop changes: The existing `[]fantasy.AgentTool` interface is already fully generic. ## What's NOT in this PR (follow-ups) - Frontend MCP server picker UI (selecting servers for a chat) - System prompt additions describing available MCP tools - Token refresh on expiry mid-chat - The deprecated `aibridged` MCP proxy cleanup	2026-03-20 16:49:55 +00:00
Mathias Fredriksson	41e15ae440	feat: make process output blocking-capable (#23312 ) Replace the 200ms polling loop in chatd's execute and process_output tools with server-side blocking via sync.Cond on HeadTailBuffer. The agent's GET /{id}/output endpoint accepts ?wait=true to block until the process exits or a 5-minute server cap expires. The process_output tool blocks by default for 10s (overridable via wait_timeout), and falls back to a non-blocking snapshot on timeout. The execute tool's foreground path makes a single blocking call instead of polling. Related #23316	2026-03-20 14:33:55 +02:00
Cian Johnston	2f50e89afd	fix(coderd): bump workspace autostop deadline on chat heartbeat (#23314 ) - Wire `workspacestats.ActivityBumpWorkspace` into `trackWorkspaceUsage` so the workspace build deadline is extended each time the chat heartbeat fires - Prevents mid-conversation autostop for chat workspaces - Updates `TestHeartbeatBumpsWorkspaceUsage` verifying the deadline bump > This PR was created with the help of Coder Agents, and was reviewed by two humans and their pet robots 🧑‍💻🤝🤖	2026-03-19 22:07:20 +00:00
Mathias Fredriksson	0a0c976a1a	test(coderd/chatd): add P0 coverage tests for subagent auth and panic recovery (#23309 ) The processChat defer at line 2464 catches panics on its main goroutine and transitions the chat to error status. This was previously untested. The test wraps the database Store to panic during PersistStep's InTx call, which runs synchronously on the processChat goroutine. A tool-level panic wouldn't work because executeTools has its own recover that converts panics into tool error results.	2026-03-19 17:54:03 +00:00
Michael Suchacz	6d214644f6	fix: make TestInterruptAutoPromotionIgnoresLaterUsageLimitIncrease deterministic (#23279 ) Eliminates the timing flake in `TestInterruptAutoPromotionIgnoresLaterUsageLimitIncrease` by making the chatd worker loop clock-controllable. ## Changes `coderd/chatd/chatd.go` - Replace `time.NewTicker` calls in `Server.start()` with `p.clock.NewTicker` using named quartz tags `("chatd", "acquire")` and `("chatd", "stale-recovery")`. `coderd/chatd/chatd_test.go` - Inject `quartz.NewMock(t)` into the test via `newActiveTestServer` config override. - Trap the acquire ticker so the test controls exactly when pending chats are reacquired. - Rewrite the test flow as explicit clock-advance steps instead of wall-clock polling. `AGENTS.md` - Document the PR title scope rule (scope must be a real path containing all changed files). ## Validation - `go test ./coderd/chatd -run TestInterruptAutoPromotionIgnoresLaterUsageLimitIncrease -count=100` ✅ - `go test ./coderd/chatd` ✅ - `make lint` ✅	2026-03-19 15:14:00 +00:00
Hugo Dutka	d285a3e74e	fix: handle null bytes in chat messages (#22946 ) This PR fixes a bug where if a tool result contained binary data it wouldn't be persisted to the database. `jsonb` in Postgres is unable to store null bytes which are sometimes output by tool results. This change makes it so that we encode them with a special escape sequence before saving them to the database, and decode them on read. <img width="808" height="637" alt="Screenshot 2026-03-11 at 13 14 06" src="https://github.com/user-attachments/assets/9be353eb-ff26-40ec-9f0a-195022b11f43" />	2026-03-18 21:19:25 +01:00
Cian Johnston	14ed3e3644	feat: bump workspace last_used_at on chat heartbeat (#23205 ) - coderd: Wires `options.WorkspaceUsageTracker` into the chatd config. - chatd: Adds `UsageTracker` and calls `UsageTracker.Add(workspaceID)` on each heartbeat tick - chatd: adds tests to verify `last_used_at` bump behaviour > 🤖 This PR was created with the help of Coder Agents, and will be reviewed by my human. 🧑‍💻	2026-03-18 19:07:21 +00:00
Kyle Carberry	1f0d896fc9	feat: add deleted flag to chat messages for soft-delete (#23223 ) Adds a `deleted` boolean column to the `chat_messages` table. Messages are never physically deleted from the database — instead they are marked as deleted so that usage and cost data is preserved. ## Changes ### Migration - New migration (000444) adds `deleted boolean NOT NULL DEFAULT false` to `chat_messages` ### SQL queries - `DeleteChatMessagesAfterID` → `SoftDeleteChatMessagesAfterID` (UPDATE SET deleted=true instead of DELETE) - New `SoftDeleteChatMessageByID` query for single-message soft-delete - All read queries now filter `deleted = false`: - `GetChatMessageByID` - `GetChatMessagesByChatID` - `GetChatMessagesByChatIDDescPaginated` - `GetChatMessagesForPromptByChatID` (both CTE and main query) - `GetLastChatMessageByRole` - Cost/usage queries (`GetChatCostSummary`, `GetChatCostPerModel`, etc.) intentionally still include deleted messages to preserve accurate spend tracking ### EditMessage behavior - Previously: updated the message content in-place + hard-deleted subsequent messages - Now: soft-deletes the original message + soft-deletes subsequent messages + inserts a new message with the updated content - This preserves the original message data (tokens, cost, content) in the database	2026-03-18 14:37:09 -04:00
Kyle Carberry	483adc59fe	feat: replace InsertChatMessage with batch InsertChatMessages (#23220 ) Replaces the singular `InsertChatMessage` query with `InsertChatMessages` that uses PostgreSQL's `unnest()` for batch inserts. This reduces the number of database round-trips when inserting multiple messages in a single transaction. ## Changes - SQL: New `InsertChatMessages :many` query using `unnest()` arrays following the existing codebase pattern (e.g., `InsertWorkspaceAgentStats`). Preserves the CTE that updates `chats.last_model_config_id` using the last non-null model config from the batch. Uses `NULLIF` for UUID columns to handle NULL foreign keys. - Go layers: Updated `querier.go`, `dbauthz.go`, `dbmetrics/querymetrics.go`, `dbmock/dbmock.go`, and `queries.sql.go` to use the new batch signature (`[]ChatMessage` return type, array params). - chatd.go: All call sites converted to batch inserts: - CreateChat: System prompt + user message batched into one call - persistStep: Assistant message + tool messages batched into one call - persistSummary: Hidden summary + assistant + tool messages batched into one call - Single-message sites use the same API with single-element arrays - Helper: New `appendChatMessage` function simplifies building batch params at each call site. - Tests: All test files updated to use the new API. Builds on top of #23213.	2026-03-18 16:27:07 +00:00
Kyle Carberry	4dd8531f37	feat: track step runtime_ms on chat messages (#23219 ) ## Summary Adds a `runtime_ms` column to `chat_messages` that records the wall-clock duration (in milliseconds) of each LLM step. This covers LLM streaming, tool execution, and retries — the full time the agent is "alive" for a step. This is the foundation for billing by agent alive time. The column follows the same pattern as `total_cost_micros`: stored per assistant message, aggregatable with `SUM()` over time periods by user. ## Changes - Migration: adds nullable `runtime_ms bigint` to `chat_messages`. - chatloop: adds `Runtime time.Duration` field to `PersistedStep`, measures `time.Since(stepStart)` at the beginning of each step (covering stream + tool execution + retries). - chatd: passes `step.Runtime.Milliseconds()` to the assistant message `InsertChatMessage` call; all other message types (system, user, tool) get `NULL`. - Tests: adds `runtime > 0` assertion in chatloop tests. ## Billing query pattern Once ready, aggregation mirrors the existing cost queries: ```sql SELECT COALESCE(SUM(cm.runtime_ms), 0)::bigint AS total_runtime_ms FROM chat_messages cm JOIN chats c ON c.id = cm.chat_id WHERE c.owner_id = @user_id AND cm.created_at >= @start_time AND cm.created_at < @end_time AND cm.runtime_ms IS NOT NULL; ```	2026-03-18 10:57:35 -04:00
Kyle Carberry	b83b93ea5c	feat: add workspace awareness system message on chat creation (#23213 ) When a chat is created via `chatd`, a system message is now inserted informing the model whether the chat was created with or without a workspace. With workspace: > This chat is attached to a workspace. You can use workspace tools like execute, read_file, write_file, etc. Without workspace: > There is no workspace associated with this chat yet. Create one using the create_workspace tool before using workspace tools like execute, read_file, write_file, etc. This is a model-only visibility system message (not shown to users) that helps the model understand its available capabilities upfront — particularly important for subagents spawned without a workspace, which previously would attempt to use workspace tools and fail. Changes: - `coderd/chatd/chatd.go`: Added workspace awareness constants and inserted the system message in `CreateChat` after the system prompt, before the initial user message. - `coderd/chatd/chatd_test.go`: Added `TestCreateChatInsertsWorkspaceAwarenessMessage` with sub-tests for both with-workspace and without-workspace cases.	2026-03-18 14:01:46 +00:00
Kyle Carberry	d42008e93d	fix: persist partial assistant response when chat is interrupted mid-stream (#23193 ) ## Problem When a user cancels a streaming chat response mid-stream, the partial content disappears entirely — both from the UI and the database. The streamed text vanishes as if the response never happened. ## Root Causes Three issues combine to prevent partial message persistence on interrupt: ### 1. StreamPartTypeError only matched `context.Canceled` (`chatloop.go`) The interrupt detection in `processStepStream` checked: ```go errors.Is(part.Error, context.Canceled) && errors.Is(context.Cause(ctx), ErrInterrupted) ``` But some providers propagate `ErrInterrupted` directly as the stream error rather than wrapping it in `context.Canceled`. This caused the condition to fail, so `flushActiveState` was never called and partial text accumulated in `activeTextContent` was lost. ### 2. No post-loop interrupt check (`chatloop.go`) If the stream iterator stops yielding parts without producing a `StreamPartTypeError` (e.g., a provider that silently closes the response body on cancel), there was no check after the `for part := range stream` loop to detect the interrupt and flush active state. ### 3. Worker ownership check blocked interrupted persists (`chatd.go`) `InterruptChat` → `setChatWaiting` clears `worker_id` in the DB before the chatloop detects the interrupt. When `persistInterruptedStep` (using `context.WithoutCancel`) tried to write the partial message, the ownership check: ```go if !lockedChat.WorkerID.Valid \|\| lockedChat.WorkerID.UUID != p.workerID { return chatloop.ErrInterrupted // always blocks! } ``` unconditionally rejected the write. The error was silently logged as a warning. ## Fix - Broaden the `StreamPartTypeError` interrupt detection to match both `context.Canceled` and `ErrInterrupted` as the stream error. - Add a post-loop interrupt check in `processStepStream` that flushes active state when the context was canceled with `ErrInterrupted`. - Allow `persistStep` to write when the chat is in `waiting` status (interrupt) even if `worker_id` was cleared. The `pending` status (from `EditMessage`, where history is truncated) still correctly blocks stale writes. ## Testing Added `TestInterruptChatPersistsPartialResponse` — an end-to-end integration test that: 1. Streams partial text chunks from a mock LLM 2. Waits for the chatloop to publish `message_part` events (confirming chunks were processed) 3. Interrupts the chat mid-stream 4. Verifies the partial assistant message is persisted in the database with the expected text content	2026-03-18 11:48:28 +00:00
Hugo Dutka	2cf47ec384	feat: virtual desktop settings toggle backend (#23171 ) Adds a new `site_config` entry that controls whether the virtual desktop feature for Coder Agents is enabled. It can be set via a new `/api/experimental/chats/config/desktop-enabled` endpoint, which will be used by the frontend.	2026-03-18 09:35:13 +01:00
Kyle Carberry	b779c9ee33	fix: use SQL-level auth filtering for chat listing (#23159 ) ## Problem The chat listing endpoint (`GetChatsByOwnerID`) was using `fetchWithPostFilter`, which fetches N rows from the database and then filters them in Go memory using RBAC checks. This causes a pagination bug: if the user requests `limit=25` but some rows fail the auth check, fewer than 25 rows are returned even though more authorized rows exist in the database. The client may incorrectly assume it has reached the end of the list. ## Solution Switch to the same pattern used by `GetWorkspaces`, `GetTemplates`, and `GetUsers`: `prepareSQLFilter` + `GetAuthorized*` variant. The RBAC filter is compiled to a SQL WHERE clause and injected into the query before `ORDER BY`/`LIMIT`, so the database returns exactly the requested number of authorized rows. Additionally, `GetChatsByOwnerID` is renamed to `GetChats` with `OwnerID` as an optional (nullable) filter parameter, matching the `GetWorkspaces` naming convention. ## Changes \| File \| Change \| \|------\|--------\| \| `queries/chats.sql` \| Renamed to `GetChats`, `owner_id` now optional via CASE/NULL, added `-- @authorize_filter` \| \| `queries.sql.go` \| Renamed constant, params struct (`GetChatsParams`), and method \| \| `querier.go` \| Interface method renamed \| \| `modelqueries.go` \| Added `chatQuerier` interface + `GetAuthorizedChats` impl \| \| `dbauthz/dbauthz.go` \| `GetChats` now uses `prepareSQLFilter` instead of `fetchWithPostFilter` \| \| `dbauthz/dbauthz_test.go` \| Updated tests for SQL filter pattern \| \| `dbmock/dbmock.go` \| Renamed + added mock for `GetAuthorizedChats` \| \| `dbmetrics/querymetrics.go` \| Renamed + added metrics wrapper \| \| `rbac/regosql/configs.go` \| Added `ChatConverter` (maps `org_owner` to empty string literal since `chats` has no `organization_id` column) \| \| `rbac/authz.go` \| Added `ConfigChats()` \| \| `chats.go` \| Handler uses renamed method with `uuid.NullUUID` \| \| `searchquery/search.go` \| Updated return type \| \| `gitsync/worker.go` \| Updated interface and call site \| \| Various test files \| Updated for renamed types \|	2026-03-17 12:46:24 -04:00
Michael Suchacz	5d0eb772da	fix(cored): fix flaky TestInterruptAutoPromotionIgnoresLaterUsageLimitIncrease (#23147 )	2026-03-17 19:08:22 +11:00
Ethan	04fca84872	perf(coderd): reduce duplicated reads in push and webpush paths (#23115 ) ## Background A 5000-chat scaletest (~50k turns, ~2m45s wall time) completed successfully, but the main bottleneck was DB pool starvation from repeated reads, not individually expensive SQL. The push/webpush path showed a few especially noisy reads: - `GetLastChatMessageByRole` for push body generation - `GetEnabledChatProviders` + `GetChatModelConfigByID` for push summary model resolution - `GetWebpushSubscriptionsByUserID` for every webpush dispatch This PR keeps the optimizations that remove those duplicate reads while leaving stream behavior unchanged. ## What changes in this PR ### 1. Reuse resolved chat state for push notifications `maybeSendPushNotification` used to re-read the last assistant message and re-resolve the chat model/provider after `runChat` had already done that work. Now `runChat` returns the final assistant text plus the already-resolved model and provider keys, and the push goroutine uses that state directly. That removes the extra push-path reads for: - `GetLastChatMessageByRole` - the second `resolveChatModel` path - the provider/model lookups that came with that second resolution ### 2. Cache webpush subscriptions during dispatch `Dispatch()` previously hit `GetWebpushSubscriptionsByUserID` on every push. A small per-user in-memory cache now avoids those repeated reads. The follow-up fix keeps that optimization correct: `InvalidateUser()` bumps a per-user generation so an older in-flight fetch cannot repopulate the cache with pre-mutation data after subscribe/unsubscribe. That preserves the cache win without letting local subscription changes be silently overwritten by stale fetch results. ## Why this is safe - The push change only reuses data already produced during the same chat run. It does not change notification semantics; if there is no assistant text to summarize, the existing fallback body still applies. - The webpush change keeps the existing TTL and `410 Gone` cleanup behavior. The generation guard only prevents stale in-flight fetches from poisoning the shared cache after invalidation. - The final PR does not change stream setup, pubsub/relay behavior, or chat status snapshot timing. ## Deliberately not included - No stream-path optimization in `Subscribe`. - No inline pubsub message payloads. - No distributed cross-replica webpush cache invalidation.	2026-03-17 13:50:47 +11:00
Michael Suchacz	1031da9738	feat: add agent chat spend limiting (backend) (#23071 ) Introduces deployment-scoped spend limiting for Coder Agents, enabling administrators to control LLM costs at global, group, and individual user levels. ## Changes - Database migration (000437): `chat_usage_limit_config` (singleton), `chat_usage_limit_overrides` (per-user), `chat_usage_limit_group_overrides` (per-group) - Single-query limit resolution: individual override > min(group) > global default via `ResolveUserChatSpendLimit` - Fail-open enforcement in chatd with documented TOCTOU trade-off - Experimental API under `/api/experimental/chats/usage-limits` for CRUD on limits - `AsChatd` RBAC subject for narrowly-scoped daemon access (replaces `AsSystemRestricted`) - Generated TypeScript types for the frontend SDK ## Hierarchy 1. Individual user override (highest) 2. Minimum of group limits 3. Global default 4. Disabled / unlimited Currency stored as micro-dollars (`1,000,000` = $1.00). Frontend PR: #23072	2026-03-17 01:24:03 +01:00
Kyle Carberry	741af057dc	feat: paginate chat messages endpoint with cursor-based infinite scroll (#23083 ) Adds cursor-based pagination to the chat messages endpoint. ## Backend - New `GetChatMessagesByChatIDPaginated` SQL query: returns messages in `id DESC` order with a `before_id` keyset cursor and configurable `limit` - Handler parses `?before_id=N&limit=N` query params, uses the `LIMIT N+1` trick to set `has_more` without a separate COUNT query - Queued messages only returned on the first page (no cursor) since they're always the most recent - SDK client updated with `ChatMessagesPaginationOptions` - Fully backward compatible: omitting params returns the 50 newest messages ## Frontend - Switches `getChatMessages` from `useQuery` to `useInfiniteQuery` with cursor chaining via `getNextPageParam` - Pages flattened and sorted by `id` ascending for chronological display - `MessagesPaginationSentinel` component uses `IntersectionObserver` (200px rootMargin prefetch) inside the existing `flex-col-reverse` scroll container - `flex-col-reverse` handles scroll anchoring natively when older messages are prepended — no manual `scrollTop` adjustment needed (same pattern as coder/blink) ## Why cursor-based instead of offset/limit Offset-based pagination breaks when new messages arrive while paginating backward (offsets shift, causing duplicates or missed messages). The `before_id` cursor is stable regardless of inserts — each page is deterministic.	2026-03-16 16:40:59 +00:00
Hugo Dutka	84527390c6	feat: chat desktop backend (#23005 ) Implement the backend for the desktop feature for agents. - Adds a new `/api/experimental/chats/$id/desktop` endpoint to coderd which exposes a VNC stream from a [portabledesktop](https://github.com/coder/portabledesktop) process running inside the workspace - Adds a new `spawn_computer_use_agent` tool to chatd, which spawns a subagent that has access to the `computer` tool which lets it interact with the `portabledesktop` process running inside the workspace - Adds the plumbing to make the above possible There's a follow up frontend PR here: https://github.com/coder/coder/pull/23006	2026-03-13 19:49:34 +01:00
Mathias Fredriksson	4a79af1a0d	refactor: add chat_message_role enum and content_version column (#23042 ) Migration 000434 converts chat_messages.role from text to a Postgres enum, rebuilds the partial index, and adds content_version smallint. The column is backfilled with DEFAULT 0, then the default is dropped so future inserts must set it explicitly. Version 0 uses the role-aware heuristic from #22958. Version 1 (all new inserts) stores []ChatMessagePart JSON for all roles, including system messages. ParseContent takes database.ChatMessage directly and dispatches on version internally. Unknown versions error. All string(codersdk.ChatMessageRole) casts at DB write sites are replaced with database.ChatMessageRole constants from sqlc. Refs #22958	2026-03-13 16:47:36 +00:00
Mathias Fredriksson	bdbcd3428b	feat(coderd/chatd): unify chat storage on SDK parts and fix file-reference rendering (#22958 ) File-reference parts in user messages were flattened to `TextContent` at write time because fantasy has no file-reference content type. The frontend never saw them as structured parts. This moves all write paths (user, assistant, tool) from fantasy envelope format to `codersdk.ChatMessagePart`. The streaming layer (`chatloop`) is untouched, the conversion happens at the serialization boundary in `persistStep`. Old rows are still readable. `ParseContent` uses a structural heuristic (`isFantasyEnvelopeFormat`) to distinguish legacy envelopes from SDK parts. We chose this over try/fallback because fantasy envelopes partially unmarshal into `ChatMessagePart` (the `type` field matches) while silently losing content. A guard test enforces that no SDK part can produce the envelope shape. This is forward-only: new rows are unreadable by old code. Chat is behind a feature flag so rollback risk is contained. Also adds a typed `ChatMessageRole` to replace raw strings and `fantasy.MessageRole` casts at the persistence boundary. The type covers `ChatMessage.Role`, `ChatStreamMessagePart.Role`, the `PublishMessagePart` callback chain, and all DB write sites. `fantasy.MessageRole` remains only where we build `fantasy.Message` structs for LLM dispatch. Separately, `ProviderMetadata` was leaking to SSE clients via `publishMessagePart`. `StripInternal` now runs on both the SSE and REST paths, covering this. Other cleanup: - Old `db2sdk.contentBlockToPart` silently dropped metadata on text/reasoning/tool-call content. New code preserves it. - `providerMetadataToOptions` now logs warnings instead of silently returning nil. - `db2sdk` shrinks from ~250 lines of parallel conversion to ~15 lines delegating to `chatprompt.ParseContent()`, removing the `fantasy` import entirely. Refs #22821	2026-03-13 17:53:26 +02:00
Kyle Carberry	690e3a87d8	feat: move chat messages to dedicated /chats/{id}/messages endpoint (#23021 ) ## Summary Moves the messages response out of `GET /chats/{id}` and into a dedicated `GET /chats/{id}/messages` endpoint. ### Backend - `GET /chats/{id}` now returns just the `Chat` object (no messages) - `GET /chats/{id}/messages` is a new endpoint returning `ChatMessagesResponse` with `messages` and `queued_messages` - Added `ChatMessagesResponse` SDK type and `GetChatMessages` client method ### Frontend - `getChat()` API method returns `Chat` instead of `ChatWithMessages` - Added `getChatMessages()` API method for the new endpoint - Split `chatQuery` into two: `chatQuery` (metadata) and `chatMessagesQuery` (messages) - Updated all cache mutations, optimistic updates, and websocket handlers - Updated tests and stories ### Files changed \| File \| Change \| \|---\|---\| \| `coderd/coderd.go` \| Register `GET /messages` route \| \| `coderd/chats.go` \| Simplify `getChat`, add `getChatMessages` handler \| \| `codersdk/chats.go` \| New type + method, update `GetChat` return \| \| `site/src/api/api.ts` \| New method, update `getChat` \| \| `site/src/api/queries/chats.ts` \| New query, update cache mutations \| \| `site/src/pages/AgentsPage/AgentDetail.tsx` \| Use separate queries \| \| `site/src/pages/AgentsPage/AgentDetail/ChatContext.ts` \| Update types and cache writes \| \| `site/src/pages/AgentsPage/AgentsPage.tsx` \| Update websocket cache handler \|	2026-03-13 08:35:46 -04:00
Danielle Maywood	6489d6f714	feat(chatd): use last assistant message as push notification summary (#22671 ) Instead of the static 'Agent has finished running.' text, extract a summary from the last assistant message to give users meaningful context about what the agent accomplished. Falls back to the static text if no suitable message is found. Co-authored-by: Kyle Carberry <kyle@carberry.com>	2026-03-10 15:14:15 +00:00
Kyle Carberry	fee5cc5e5b	fix(chatd): fix flaky TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica (#22893 ) Fixes https://github.com/coder/internal/issues/1371 ## Root causes Two independent races cause this test to flake at ~2–3/1000: ### 1. Title-generation requests racing with the streaming request counter `maybeGenerateChatTitle` fires in a `context.WithoutCancel` goroutine (line 2130) and makes a non-streaming request to the mock OpenAI handler. The test handler was not filtering by request type, so these title requests incremented the `requestCount` atomic — throwing off the coordination logic that uses `requestCount == 1` to identify the first streaming request and hold it open until shutdown. Fix: Guard the test handler to return a canned response for non-streaming requests before touching `requestCount`. ### 2. Phantom acquire: `AcquireChat` commits in Postgres but Go sees `context.Canceled` During `Close()`, the main loop's `select` can randomly pick `acquireTicker.C` over `ctx.Done()` (Go spec: when multiple cases are ready, one is chosen uniformly at random). This calls `processOnce(ctx)` with an already-canceled context. In the pq driver, `QueryContext` does not check `ctx.Err()` up front. Instead it calls `watchCancel(ctx)` which spawns a goroutine monitoring `ctx.Done()`, then sends the query on the existing connection. When `ctx` is already canceled, a race ensues: - pq's watchCancel goroutine immediately sees `<-done`, opens a new TCP connection to Postgres, and sends a cancel request. - The query is sent concurrently on the existing connection. Because the `AcquireChat` UPDATE is fast (sub-millisecond, single row with `SKIP LOCKED`), it often commits before the cancel arrives via the second connection. Meanwhile in `database/sql`, `initContextClose` spawns an `awaitDone` goroutine that fires immediately (context is already canceled), stores `contextDone`, and calls `rs.close(ctx.Err())` — which races with `Row.Scan` → `rows.Next()`. If `awaitDone` wins, `Next()` sees `contextDone` is set and returns false, causing Scan to return `context.Canceled` (or `ErrNoRows`). Result: Postgres committed the UPDATE (chat is now `running` with serverA's worker ID), but Go sees an error and never spawns a goroutine to process it. The chat is stuck as `running` with no worker. If the previous `processChat` cleanup already set the chat back to `pending`, this phantom acquire flips it back to `running` — which is exactly what the debug logs showed: after `Close()` returns, the DB shows `status=running` with serverA's worker ID. Fix: Three guards in `processOnce`: 1. Early `ctx.Err()` check — catches the common case where `select` picked the ticker after cancellation. 2. `context.WithoutCancel(ctx)` for `AcquireChat` — prevents the pq `watchCancel` race entirely, ensuring the driver sees the query result if Postgres executed it. 3. Post-acquire `ctx.Err()` check — if the context was canceled while `AcquireChat` ran (or between the early check and the call), immediately release the chat back to `pending`. ## Verification Passes 2000/2000 iterations (previously flaked at ~2–3/1000): ``` go test -run "TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica" \ -count=2000 -timeout 1800s -failfast ./coderd/chatd/ ```	2026-03-10 14:22:39 +00:00
Kyle Carberry	b9c729457b	fix(chatd): queue interrupt messages to preserve conversation order (#22736 ) ## Problem When `message_agent` is called with `interrupt=true`, two independent code paths race to persist messages: 1. `SendMessage` inserts the user message into `chat_messages` at time T1 2. `persistInterruptedStep` saves the partial assistant response at time T2 (T2 > T1) Since `chat_messages` are ordered by `(created_at, id)`, the assistant message ends up after the user message that triggered the interrupt. On reload, this produces a broken conversation where the interrupted response appears below the new user message — and Anthropic rejects the trailing assistant message as unsupported prefill. The root cause is that two independent writers can't guarantee ordering. Any solution involving timestamp manipulation or signal-then-wait coordination leaves race windows. ## Fix Route interrupt behavior through the existing queued message mechanism: 1. `SendMessage` with `BusyBehaviorInterrupt` now inserts into `chat_queued_messages` (not `chat_messages`) when the chat is busy 2. After queuing, `setChatWaiting` signals the running loop to stop 3. The deferred cleanup in `processChat` persists the partial assistant response first, then auto-promotes the queued user message This eliminates the race entirely: the assistant partial response and user message are written by the same serialized cleanup flow, so ordering is guaranteed by the DB's auto-incrementing `id` sequence. No timestamp hacks, no reordering at send time. Supersedes #22728 — fixes the root cause instead of reordering at prompt construction time.	2026-03-06 18:15:40 -05:00
Kyle Carberry	eecb7d0b66	fix: resolve bugs in chatd streaming system (#22720 ) Split from #22693 per review feedback. Fixes multiple bugs in coderd/chatd and sub-packages including race conditions, transaction safety, stream buffer bounds, retry limits, and enterprise relay improvements. See commit message for full list.	2026-03-06 21:02:25 +00:00
Danielle Maywood	ffb47cea19	feat(chatd): add tag-based dedup to push notifications (#22669 )	2026-03-06 10:48:58 +00:00
Danielle Maywood	d91d9712f7	fix: use Eventually for web push dispatch assertion in chatd test (#22700 )	2026-03-06 09:52:28 +00:00
Hugo Dutka	48ab492f49	feat: agents git watch backend (#22565 ) Adds real-time git status watching for workspace agents, so the frontend can subscribe over WebSocket and show git file changes in near real-time. 1. Subscription is scoped to a chat via `GET /api/experimental/chats/{chat}/git/watch`. 2. The workspace agent automatically determines which paths to watch based on tool calls made by the chat (and its ancestor chats). 3. Workspace agent polls subscribed repo working trees on a 30s interval, on tools calls, and on explicit `refresh` from the client. 4. Scans are rate-limited to at most once per second. 5. Edited paths are tracked in-memory inside the workspace agent. There is no database persistence — state is lost on agent restart. This will be addresses in a future PR. 6. Messages sent over WebSocket include a full-repo snapshot (unified diff, branch, origin). A new message is emitted only when the snapshot changes. This PR was implemented with AI with me closely controlling what it's doing. The code follows a plan file that was updated continuously during implementation. Here's the file if you'd like to see it: [project.md](https://gist.github.com/hugodutka/8722cf80c92f8a56555f7bc595b770e2). It reflects the current state of the PR.	2026-03-06 10:47:55 +01:00
Danielle Maywood	0ec27e3d48	feat(chatd): navigate to specific chat on push notification click (#22668 )	2026-03-05 16:40:17 +00:00
Kyle Carberry	6520159045	feat(chatd): add start_workspace tool to agent flow (#22646 ) ## Summary When a chat's workspace is stopped, the LLM previously had no way to start it — `create_workspace` would either create a duplicate workspace or fail. This adds a dedicated `start_workspace` tool to the agent flow. ## Changes ### New: `start_workspace` tool (`coderd/chatd/chattool/startworkspace.go`) - Detects if the chat's workspace is stopped and starts it via a new build with `transition=start` - Reuses the existing `waitForBuild` and `waitForAgent` helpers (shared logic) - Shares the workspace mutex with `create_workspace` to prevent races - Idempotent: returns immediately if the workspace is already running or building - Returns a `no_agent` / `not_ready` status if the agent isn't available yet (non-fatal) ### Updated: `create_workspace` stopped-workspace hint - `checkExistingWorkspace` now returns a `stopped` status with message `"use start_workspace to start it"` when it detects the chat's workspace is stopped, instead of falling through to create a new workspace ### Wiring - `chatd.Config` / `chatd.Server`: new `StartWorkspace` / `startWorkspaceFn` field - `coderd/chats.go`: new `chatStartWorkspace` method that calls `postWorkspaceBuildsInternal` with proper RBAC context - `coderd/coderd.go`: passes `chatStartWorkspace` into chatd config - Tool registered alongside `create_workspace` for root chats only (not subagents) ### Tests (`startworkspace_test.go`) - `NoWorkspace`: error when chat has no workspace - `AlreadyRunning`: idempotent return for workspace with successful start build - `StoppedWorkspace`: verifies StartFn is called, build is waited on, and success response returned	2026-03-05 15:34:24 +00:00
Cian Johnston	d0a51e1752	fix: use testutil.Eventually in chatd interrupt test (#22653 ) Follow-up to #22630. Addresses [review feedback](https://github.com/coder/coder/pull/22630#pullrequestreview-2953419963) that was missed due to auto-merge. ## Changes Replaces three `require.Eventually` calls with `testutil.Eventually` in `TestInterruptChatDoesNotSendWebPushNotification`, linking the condition to the existing test context (`ctx`) created on line 1194. This ensures the test respects context cancellation instead of using a standalone timeout/tick pattern.	2026-03-05 09:42:34 +00:00
Cian Johnston	4d0d187806	fix(chatd): wait for startup scripts before returning from create_workspace (#22498 ) The `create_workspace` tool waited for the workspace build to succeed and the agent to become connectable, but did not wait for the agent's startup scripts (e.g. git clone) to finish. This caused agents to attempt file operations on repositories that hadn't been cloned yet. Add a waitForStartupScripts step that polls the agent's lifecycle_state via GetWorkspaceAgentLifecycleStateByID until it transitions out of created/starting into a terminal state (ready, start_error, or start_timeout). The tool now only returns success once the workspace is fully initialized. If the scripts fail or time out, the tool still returns (non-fatal) with an appropriate agent_status so the model knows something went wrong. Created using thingies (Opus 4.6 Max)	2026-03-05 09:42:12 +00:00
Kyle Carberry	7bcd9f6de8	fix: skip web push notification when chat is interrupted (#22630 ) When a user interrupts a chat, the status transitions to `waiting` which previously triggered an "Agent has finished running." web push notification. This is incorrect — the user interrupted it themselves, so no notification is needed. ## Changes ### `coderd/chatd/chatd.go` - Added `wasInterrupted` flag alongside the existing `status` variable - Set the flag when `ErrInterrupted` is detected in the error handler - Added `!wasInterrupted` to the web push dispatch condition ### `coderd/chatd/chatd_test.go` - Added `TestInterruptChatDoesNotSendWebPushNotification` that creates a chat with a mock webpush dispatcher, processes it, interrupts it, and verifies no push notification was dispatched - Added `mockWebpushDispatcher` implementing the `webpush.Dispatcher` interface	2026-03-05 09:08:17 +00:00
Kyle Carberry	30d534b36b	fix(chatd): fix relay race conditions, extract enterprise relay logic, move pubsub to OSS (#22589 ) ## Summary Fixes a bug where interrupting a streaming chat and sending a new message left the relay connected to the wrong replica. Expanded into a broader refactor that cleanly separates concerns: - OSS owns pubsub subscription, message catch-up, queue updates, status forwarding, and local parts merging. - Enterprise (`enterprise/coderd/chatd`) only manages relay dialing, reconnection, and stale-dial discarding for cross-replica streaming. ## Architecture ### OSS `coderd/chatd/chatd.go` `Subscribe()` builds the initial snapshot then runs a single merge goroutine that handles: - Pubsub subscription for durable events (status, messages, queue, errors) - Message catch-up via `AfterMessageID` - Local `message_part` forwarding - Relay events from enterprise (when `SubscribeFn` is set) - Sends `StatusNotification` to enterprise so it can manage relay lifecycle Key types: - `SubscribeFn` — enterprise hook, returns relay-only events channel - `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`, `StatusNotifications`, `RequestHeader`, `DB`, `Logger` - `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on pubsub status changes ### Enterprise `enterprise/coderd/chatd/chatd.go` `NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a `SubscribeFn` that: - Opens an initial synchronous relay if the chat is running on a remote worker - Reads `StatusNotifications` from OSS to open/close relay connections - Handles async dial, reconnect timers, stale-dial discarding - Returns only relay `message_part` events ## Bug fixes ### Original bug: stale relay dial after interrupt `openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a per-dial context. `closeRelay()` could not cancel in-flight dials. When the user interrupts and a new replica picks up the chat, the old dial goroutine could complete after the new one and deliver a stale `relayResult`. Fix: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking, `workerID` on `relayResult`. `closeRelay()` cancels the dial context and drains `relayReadyCh`. Merge loop rejects mismatched worker IDs. ### Additional fixes - `statusNotifications` send-on-closed-channel race — goroutine now owns `close()` via defer - Enterprise spin-loop on `StatusNotifications` close — two-value receive with nil-out - `hasPubsub` set from `p.pubsub != nil` instead of subscription success — now tracks actual subscription result - `lastMessageID` not initialized from `afterMessageID` — caused duplicate messages on catch-up - `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel - `closeRelay()` did not drain `relayReadyCh` - `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in `InTx` - `processChat` post-TX side effects fired when chat was taken by another worker — added `errChatTakenByOtherWorker` sentinel - Cancel closure data race on `reconnectTimer` - Bare blocking send on pubsub error path - `localParts` hot-spin after channel close - No-pubsub branch dropped relay events and initial snapshot - Failed relay dial caused permanent stall (no reconnect retry) - DB error during reconnect timer caused permanent stall - `time.NewTimer` replaced with `quartz.Clock` for testable timing ## Tests 9 enterprise tests covering: - Relay reconnect on drop (mock clock) - Async dial does not block merge loop - Relay snapshot delivery - Stale dial discarded after interrupt - Cancel during in-flight dial - Running-to-running worker switch - Failed dial retries (mock clock) - Local worker closes relay - Multiple consecutive reconnects (mock clock) All pass with `-race`.	2026-03-04 18:42:28 -05:00
Kyle Carberry	f4a7fa5b95	fix(chatd): block subagents from spawning workspaces (#22603 ) ## Summary Subagent (child) chats were previously given access to workspace provisioning tools (`list_templates`, `read_template`, `create_workspace`), which could lead to uncontrolled resource consumption. This PR moves those tools behind the same `!chat.ParentChatID.Valid` gate that already protects the subagent tools (`spawn_agent`, `wait_agent`, etc.). ## Changes - `coderd/chatd/chatd.go`: Moved `list_templates`, `read_template`, and `create_workspace` tool registration into the root-chat-only block alongside subagent tools. - `coderd/chatd/chatd_test.go`: Added `TestSubagentChatExcludesWorkspaceProvisioningTools` — an E2E test that spawns a subagent via a root chat and verifies the subagent's LLM call does not include workspace provisioning or subagent tools. - `coderd/chatd/chattest/openai.go`: Added `Tools` field to `OpenAIRequest` and supporting `OpenAITool`/`OpenAIToolFunction` types so tests can inspect which tools are sent to the model.	2026-03-04 15:49:14 +00:00
Kyle Carberry	b7a7683ac0	fix(chatd): harden cross-replica relay for chat stream parts (#22533 ) ## Problem Subscribers connecting to a different replica than the one running the chat see full messages appear but no streaming partials (`message_part` events). The relay mechanism that forwards ephemeral parts across replicas had several bugs. ## Root Causes 1. `openRelay()` blocked the event loop — The WebSocket dial (TCP + TLS + HTTP upgrade) to the worker replica ran synchronously inside the select loop. While dialing, no events could be processed, channels filled up, and parts were silently dropped. 2. Relay drops were permanent — When the relay WebSocket closed mid-stream, `relayParts` was set to nil and never reopened. No status notification would re-trigger it since the chat was still running on the same worker. 3. `drainInitial` snapshot race — The `default` case in the initial drain loop caused the snapshot to be empty if the remote hadn't flushed data yet (common immediately after WebSocket connect). 4. Duplicate event delivery — The `preloaded` slice caused snapshot events to be sent both in the return value and re-sent through the channel goroutine. ## Fixes ### `coderd/chatd/chatd.go` (Subscribe method) - Async relay dial: `openRelayAsync()` spawns a goroutine to dial the remote replica. The result (channel + cancel func) is delivered on a `relayReadyCh` channel that the select loop reads without blocking. - Relay reconnection: When the relay channel closes, a 500ms timer fires. The handler re-checks chat status from the DB and reopens the relay if the chat is still running on a remote worker. - Snapshot parts via channel: Relay snapshot + live parts are wrapped into a single channel so they flow through the same path, avoiding races with the select loop. ### `enterprise/coderd/chats.go` (newRemotePartsProvider) - Timer-based drain: Replaced `default` with a 1-second timer. After the first event, `Reset(0)` switches to non-blocking drain for remaining buffered events. - Remove preloaded duplication: The goroutine now only forwards new events; snapshot events are returned to the caller directly. ## Testing All existing tests pass: - `TestInterruptChatBroadcastsStatusAcrossInstances` - `TestSubscribeSnapshotIncludesStatusEvent` - `TestSubscribeNoPubsubNoDuplicateMessageParts` - `TestSubscribeAfterMessageID` - `TestChatStreamRelay/RelayMessagePartsAcrossReplicas`	2026-03-02 19:57:13 -05:00
Kyle Carberry	5eebd3829f	fix: use cursor-based query for chat stream notifications (#22510 ) ## Problem The pubsub notification handler in `chatd` re-fetched all messages from the DB on every new message notification, then filtered in Go with `msg.ID > lastMessageID`. This grows linearly with conversation length — every new message triggers a full table scan of that chat's history. The `AfterMessageID` field in the pubsub notification payload was clearly designed for cursor-based fetching, but no matching query existed. ## Fix - Add `GetChatMessagesByChatIDAfter` SQL query with `WHERE id > @after_id`, so the database does the filtering instead of Go. - Use it in the pubsub notification handler in `chatd.go`, passing `lastMessageID` as the cursor. - Implement the dbauthz wrapper (was a `panic("not implemented")` stub from codegen) with the same read-check-on-parent-chat pattern as adjacent methods. - Add dbauthz test coverage for the new method. Not changed: The initial snapshot in `Subscribe()` still loads all messages — that's correct, since a newly-connecting client needs the full conversation state. The waste was only in the ongoing notification path.	2026-03-02 16:31:04 -05:00
Kyle Carberry	1c71fd69f6	fix: workspace auto-refresh during the chat flow (#22447 )	2026-02-28 19:07:17 -05:00
Kyle Carberry	2abe55549c	fix: return in-flight chats to pending on server shutdown (#22443 ) When a chatd server shuts down (`Close()`), the server context is canceled. Previously, in-flight chats would be marked as `error` because the `context.Canceled` error was not distinguished from actual processing failures. This adds `isShutdownCancellation()` to detect when the error is caused by the server context being canceled (as opposed to a chat-specific cancellation like `ErrInterrupted`). When detected, the chat status is set to `pending` with no `last_error`, allowing another replica to pick it up and retry. Extracted from #22440 — only the context cancellation bug fix, no chattest changes.	2026-02-28 17:14:11 -05:00
Kyle Carberry	c5619746d1	fix(chat): fix stream state discrepancies between frontend and backend (#22437 ) ## Summary Fixes four frontend↔backend discrepancies in chat stream state management that could cause duplicate content, UI flicker, and stale stream state. ### Backend fixes (`coderd/chatd/chatd.go`) 1. No-pubsub path double-replayed message_part events `Subscribe()` built an `initialSnapshot` containing `message_part` events from `localSnapshot`, then the no-pubsub goroutine replayed the same `localSnapshot` into the `mergedEvents` channel. Since `streamChat` sends the snapshot first then reads the channel, the frontend received every `message_part` twice. `applyMessagePartToStreamState` doesn't deduplicate — text gets concatenated, so content appeared doubled. Fix: Only forward live `localParts` in the no-pubsub goroutine; the snapshot already contains the historical events. 2. Snapshot missing status event The initial snapshot never included a `status` event. The frontend's `shouldApplyMessagePart()` gates on status (`pending`/`waiting`), but the initial status came from a separate REST query via `useEffect`. During the race window between snapshot arrival and REST resolution, `message_part` events could be incorrectly accepted or rejected. Fix: Prepend a `status` event to the snapshot after loading the chat from DB, so the frontend has the authoritative status from the very first batch. ### Frontend fixes (`ChatContext.ts`) 3. Scheduled stream reset not canceled by subsequent message_parts When a `message` event arrived, `scheduleStreamReset()` queued `clearStreamState` via `requestAnimationFrame`. If new `message_part` events arrived in the next WebSocket frame before the rAF fired, they were pushed to `pendingMessageParts` without canceling the scheduled reset. The rAF would fire between frames, clearing stream state, then the next flush would re-populate it — causing a visible flash. Fix: Call `cancelScheduledStreamReset()` when accumulating `message_part` events. 4. startTransition race with synchronous clearStreamState `flushMessageParts` wrapped `applyMessageParts` in `startTransition`, which React can defer. If a `status: "waiting"` event arrived in the same batch after `message_part` events, the status handler cleared stream state synchronously, but the deferred `applyMessageParts` callback could fire afterward and re-populate it. Fix: Re-check `shouldApplyMessagePart()` inside the `startTransition` callback at execution time. ### Tests added - Go: `TestSubscribeSnapshotIncludesStatusEvent` — asserts the first snapshot event is a status event - Go: `TestSubscribeNoPubsubNoDuplicateMessageParts` — asserts the events channel doesn't replay snapshot events - TS: `cancels scheduled stream reset when message_part arrives after message` — verifies stream state survives a [message, message_part] batch - TS: `does not apply message parts after status changes to waiting` — verifies deferred applyMessageParts respects status transitions	2026-02-28 13:35:23 -05:00
Kyle Carberry	0ad2f9ecd7	feat(chatd): persist last_error on chats table (#22436 ) Adds a nullable `last_error` column to the `chats` table so error reasons survive page reloads. Backend: - Migration adds `last_error TEXT` (nullable) to chats - `UpdateChatStatus` writes the error reason when status transitions to `error`, clears it (NULL) on recovery - `convertChat` maps `sql.NullString` to `string` in the SDK Frontend:* - Sidebar falls back to `chat.last_error` when no stream error reason is cached - Chat detail page does the same for `persistedErrorReason` - Fixtures updated for new required field	2026-02-28 12:27:26 -05:00
Kyle Carberry	f509c841cf	fix(chatd): recover stale chats after coderd redeployment (#22405 ) ## Problem When coderd instances are redeployed (e.g. rolling deployment on dogfood), in-flight chats get stuck in `running` status permanently. The UI shows them as "thinking" with a spinning indicator, but no worker is actually processing them. They never error or resume. ## Root Cause Two bugs combine to cause this: ### Bug 1: Shutdown cleanup uses a canceled context The `processChat` defer block updates the chat status in the DB when processing completes. But it uses `ctx`, which `Close()` cancels before the defer runs. The DB transaction silently fails with `context.Canceled`, leaving the chat in `status=running` with a dead `worker_id`. ```go // Close() calls p.cancel() which cancels ctx // Then the defer tries to use the now-canceled ctx: defer func() { err := p.db.InTx(func(tx database.Store) error { tx.GetChatByIDForUpdate(ctx, chat.ID) // FAILS tx.UpdateChatStatus(ctx, ...) // FAILS }, nil) }() ``` ### Bug 2: Stale recovery runs only once at startup `recoverStaleChats()` was called only once in `start()`, not periodically. During a rolling deployment, the new instance starts while the old one is still alive (fresh heartbeat). By the time the old instance crashes, no one checks again. ## Fix 1. Use `context.WithoutCancel(ctx)` in the processChat defer — the cleanup transaction now completes even during graceful shutdown. 2. Run `recoverStaleChats` periodically — a second ticker in the `start()` loop checks for stale chats at `inFlightChatStaleAfter / 5` intervals (default: every 1 minute). This catches orphaned chats even when the instance that owns them crashes without clean shutdown. ## Tests - `TestRecoverStaleChatsPeriodically` — Verifies chats orphaned after startup are recovered by the periodic loop (not just the startup check). - `TestNewReplicaRecoversStaleChatFromDeadReplica` — Verifies a new replica recovers stale chats on startup. - `TestWaitingChatsAreNotRecoveredAsStale` — Negative test: `waiting` chats are not incorrectly modified by recovery.	2026-02-27 15:25:40 -05:00
Kyle Carberry	edee917d88	feat: add experimental agents support (#22290 ) feat: add AI chat system with agent tools and chat UI Introduce the chatd subsystem and Agents UI for AI-powered chat within Coder workspaces. - Add chatd package with chat loop, message compaction, prompt management, and LLM provider integration (OpenAI, Anthropic) - Add agent tools: create workspace, list/read templates, read/write/ edit files, execute commands - Add chat API endpoints with streaming, message editing, and durable reconnection - Add database schema and migrations for chats, chat messages, chat providers, and chat model configs - Add RBAC policies and dbauthz enforcement for chat resources - Add Agents UI pages with conversation timeline, queued messages list, diff viewer, and model configuration panel - Add comprehensive test coverage including coderd integration tests, chatd unit tests, and Storybook stories - Gate feature behind experiments flag --------- Co-authored-by: Cian Johnston <cian@coder.com> Co-authored-by: Danielle Maywood <danielle@themaywoods.com> Co-authored-by: Jeremy Ruppel <jeremy@coder.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-27 16:50:56 +00:00

44 Commits