coder

mirror of https://github.com/coder/coder.git synced 2026-06-04 13:38:21 +00:00

Author	SHA1	Message	Date
Kyle Carberry	5b1cf4a6a3	fix(chatd): start stream buffering before publishing running status (#22571 ) ## Problem There is a race condition in the chat stream reconnect path. When a client connects (or reconnects) to `/stream`, sometimes they only see a `status: running` event but never receive any `message_part` events — the stream appears stuck. ## Root Cause In `processChat`, the sequence is: 1. `publishStatus(running)` — broadcasts `status: running` to all subscribers and via pubsub. 2. `runChat()` is called. 3. Inside `runChat`, there's significant setup work (model resolution, DB queries, title generation, prompt building, instruction resolution). 4. Only after all that setup does `runChat` set `buffering = true` on the stream state. If a client connects to `/stream` between steps 1 and 4: - `Subscribe()` reads `chat.Status == running` from the DB, so it includes `status: running` in the snapshot. - But `buffering` is still `false`, so `subscribeToStream` returns an empty local snapshot (no message_parts). - `publishToStream` drops all `message_part` events when `buffering` is false. - Result: client sees `running` but never gets any streaming content. ## Fix Move the `buffering = true` setup (and its deferred cleanup) from `runChat` into `processChat`, right before `publishStatus(running)`. This guarantees the buffer is active before any subscriber can observe `status: running`, so: - The snapshot always includes any in-flight `message_part` events. - `publishToStream` never drops parts because buffering is already on.	2026-03-03 21:27:59 +00:00
Kyle Carberry	059ed7ab5c	fix(chatd): return chat to pending when server shuts down during successful completion (#22559 ) ## Problem Flaky test: `TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica` (coder/internal#1371) The test intermittently fails because the chat ends up in `waiting` status instead of `pending` after server shutdown. ## Root Cause There is a race condition in `processChat` where `runChat` completes successfully just as the server context is being canceled during `Close()`. The sequence: 1. Server calls `Close()`, canceling the server context. 2. The LLM HTTP response has already been fully written by the mock server (the stream closes normally before context cancellation propagates to the HTTP client). 3. `runChat` returns `nil` (success) instead of `context.Canceled`. 4. The existing `isShutdownCancellation` check only runs when `runChat` returns an error, so the shutdown is not detected. 5. `processChat`'s deferred cleanup marks the chat as `waiting` instead of `pending`. 6. The test's assertion that the chat is `pending` never becomes true. This race is timing-dependent — it only triggers when the mock server's HTTP response completes in the narrow window between context cancellation being initiated and it propagating through the HTTP transport layer. ## Fix Add a server context check after `runChat` returns successfully. If the server is shutting down (`ctx.Err() != nil`), override the status to `pending` so another replica can pick up the chat. This is the same pattern already used for the error path (`isShutdownCancellation`), extended to cover the success path.	2026-03-03 11:34:08 -05:00
Kyle Carberry	56f95a3e6d	fix: scope git askpass diff status updates to initiating chat (#22534 ) ## Problem When the git askpass flow triggered diff status refreshes, it updated every chat connected to the workspace. This was wasteful and could cause confusing status updates on unrelated chats. ## Solution Thread the chat ID through the entire git askpass flow so only the chat that initiated the git operation gets updated: 1. `coderd/chatd/chattool/execute.go` — Sets `CODER_CHAT_ID` env var on spawned processes (alongside the existing `CODER_CHAT_AGENT`) 2. `cli/gitaskpass.go` — Reads `CODER_CHAT_ID` from the environment and sends it as a `chat_id` query parameter in the `ExternalAuthRequest` 3. `codersdk/agentsdk/agentsdk.go` — Adds `ChatID` field to `ExternalAuthRequest` and encodes it as a query param 4. `coderd/workspaceagents.go` — Parses `chat_id` query param and passes it through to `storeChatGitRef` and `triggerWorkspaceChatDiffStatusRefresh` 5. `coderd/chats.go` — `storeChatGitRef` and `refreshWorkspaceChatDiffStatuses` now scope updates to just the initiating chat when a chat ID is provided, falling back to all-workspace-chats behavior for backwards compatibility (non-chat git operations)	2026-03-02 22:52:39 -05:00
Kyle Carberry	b7a7683ac0	fix(chatd): harden cross-replica relay for chat stream parts (#22533 ) ## Problem Subscribers connecting to a different replica than the one running the chat see full messages appear but no streaming partials (`message_part` events). The relay mechanism that forwards ephemeral parts across replicas had several bugs. ## Root Causes 1. `openRelay()` blocked the event loop — The WebSocket dial (TCP + TLS + HTTP upgrade) to the worker replica ran synchronously inside the select loop. While dialing, no events could be processed, channels filled up, and parts were silently dropped. 2. Relay drops were permanent — When the relay WebSocket closed mid-stream, `relayParts` was set to nil and never reopened. No status notification would re-trigger it since the chat was still running on the same worker. 3. `drainInitial` snapshot race — The `default` case in the initial drain loop caused the snapshot to be empty if the remote hadn't flushed data yet (common immediately after WebSocket connect). 4. Duplicate event delivery — The `preloaded` slice caused snapshot events to be sent both in the return value and re-sent through the channel goroutine. ## Fixes ### `coderd/chatd/chatd.go` (Subscribe method) - Async relay dial: `openRelayAsync()` spawns a goroutine to dial the remote replica. The result (channel + cancel func) is delivered on a `relayReadyCh` channel that the select loop reads without blocking. - Relay reconnection: When the relay channel closes, a 500ms timer fires. The handler re-checks chat status from the DB and reopens the relay if the chat is still running on a remote worker. - Snapshot parts via channel: Relay snapshot + live parts are wrapped into a single channel so they flow through the same path, avoiding races with the select loop. ### `enterprise/coderd/chats.go` (newRemotePartsProvider) - Timer-based drain: Replaced `default` with a 1-second timer. After the first event, `Reset(0)` switches to non-blocking drain for remaining buffered events. - Remove preloaded duplication: The goroutine now only forwards new events; snapshot events are returned to the caller directly. ## Testing All existing tests pass: - `TestInterruptChatBroadcastsStatusAcrossInstances` - `TestSubscribeSnapshotIncludesStatusEvent` - `TestSubscribeNoPubsubNoDuplicateMessageParts` - `TestSubscribeAfterMessageID` - `TestChatStreamRelay/RelayMessagePartsAcrossReplicas`	2026-03-02 19:57:13 -05:00
Kyle Carberry	ddfe630757	refactor(chatd): replace fantasy.Agent with custom agent loop (#22507 ) ## Summary Replaces fantasy's `Agent` abstraction with a direct step loop calling `LanguageModel.Stream()`. Fantasy is retained as the provider abstraction layer (streaming parsers, types, tool schema) but we no longer use `fantasy.Agent`, `AgentStreamCall`, `AgentResult`, or `StepResult`. ## Problems solved \| Problem \| Before \| After \| \|---\|---\|---\| \| Sentinel prompt hack \| fantasy.Agent requires non-empty Prompt → UUID sentinel generated and stripped in PrepareStep \| Messages passed directly to `model.Stream()` \| \| Discarded PersistStep errors \| `_ = opts.OnStepFinish(result)` silently swallows errors \| Errors propagate directly from `PersistStep()` \| \| Shadow draft state \| ~160 LOC tracking content in parallel because fantasy doesn't expose in-progress content on interruption \| `stepResult` owns content directly; `flushActiveState()` is trivial \| \| Nested retry layers \| fantasy's 2-attempt retry nested inside chatretry's indefinite retry \| Single `chatretry.Retry` layer \| \| Callback-mediated compaction \| Mutex + boolean flag + coordination between OnStepFinish/PrepareStep callbacks \| Inline `if` statement between steps \| \| Duplicate compaction paths \| `compactStep()` + `maybeCompact()` sharing ~80% logic \| Single `tryCompact()` function \| ## Changes ### `coderd/chatd/chatloop/chatloop.go` — Rewritten - Removed: `fantasy.NewAgent()`, `AgentStreamCall`, sentinel prompt, shadow draft state (~160 LOC of closures), `compactedMu`/`compacted` flag, `PrepareStepResult` - Added: `stepResult` struct, `processStepStream()` (stream consumer), `executeTools()` (sequential tool execution), `flushActiveState()` (interrupt handling), `buildToolDefinitions()`, `toResponseMessages()` - Changed: `Run()` return type from `(fantasy.AgentResult, error)` to `error` (callers already discarded the result) - Preserved*: Anthropic prompt caching, reasoning title extraction, `extractContextLimit()`, `ErrInterrupted` semantics ### `coderd/chatd/chatloop/compaction.go` — Simplified - Merged `compactStep()` + `maybeCompact()` → single `tryCompact()` - Removed `[]StepResult` parameter from `generateCompactionSummary()` (caller provides complete message list) - Kept helper functions: `normalizedCompactionConfig`, `contextTokensFromUsage`, `resolveContextLimit`, `shouldCompact` ### `coderd/chatd/chatd.go` — Caller updates - Removed `AgentStreamCall` construction - Changed `_, err = chatloop.Run(...)` to `err = chatloop.Run(...)` - Model parameters moved from `AgentStreamCall` fields to `RunOptions` fields ### Tests — 4 new tests - `MidLoopCompactionReloadsMessages` — compaction fires mid-loop, messages reloaded - `PostRunCompactionSkippedAfterMidLoop` — no double compaction - `MultiStepToolExecution` — tools execute between steps, results feed next step - `PersistStepErrorPropagates` — persistence errors propagate (was silently discarded)	2026-03-02 18:51:57 -05:00
Kyle Carberry	5eebd3829f	fix: use cursor-based query for chat stream notifications (#22510 ) ## Problem The pubsub notification handler in `chatd` re-fetched all messages from the DB on every new message notification, then filtered in Go with `msg.ID > lastMessageID`. This grows linearly with conversation length — every new message triggers a full table scan of that chat's history. The `AfterMessageID` field in the pubsub notification payload was clearly designed for cursor-based fetching, but no matching query existed. ## Fix - Add `GetChatMessagesByChatIDAfter` SQL query with `WHERE id > @after_id`, so the database does the filtering instead of Go. - Use it in the pubsub notification handler in `chatd.go`, passing `lastMessageID` as the cursor. - Implement the dbauthz wrapper (was a `panic("not implemented")` stub from codegen) with the same read-check-on-parent-chat pattern as adjacent methods. - Add dbauthz test coverage for the new method. Not changed: The initial snapshot in `Subscribe()` still loads all messages — that's correct, since a newly-connecting client needs the full conversation state. The waste was only in the ongoing notification path.	2026-03-02 16:31:04 -05:00
Kyle Carberry	7aef0bf25e	fix(chatd): increase title generation timeout from 10s to 30s (#22501 ) ## Problem Production logs frequently show: ``` [debu] coderd.chats.chat-processor: failed to generate chat title error= generate title text: context deadline exceeded ``` ## Root Cause The title generation timeout in `maybeGenerateChatTitle` is 10 seconds. Many LLM providers routinely exceed this under load (cold starts, rate limits, large models). Since `chatretry` classifies `context deadline exceeded` as non-retryable, the first timeout kills the entire attempt with no retry. ## Fix Increase the timeout from 10s to 30s. Title generation is async and best-effort — it runs in a background goroutine and doesn't block the chat response — so a longer timeout has no user-facing impact.	2026-03-02 14:11:25 -05:00
Kyle Carberry	a33ca95df2	fix(chatd): prevent chat re-acquisition during server shutdown (#22497 ) Fixes https://github.com/coder/internal/issues/1371 ## Problem `TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica` flakes intermittently in CI. The observed failure is that the chat never reaches `pending` status after `serverA.Close()`. ## Root cause Race between context cancellation and the mock OpenAI server's stream completion marker. When `Close()` cancels the server context, the in-flight HTTP streaming request is canceled. The mock server's handler detects this via `req.Context().Done()` and closes its chunks channel. The mock's `writeChatCompletionsStreaming` then writes `data: [DONE]` — the SSE completion marker. On a loopback connection, this marker can reach the client before the client's HTTP transport honors the context cancellation. When this happens: 1. The client sees a successful stream completion (not an error) 2. `chatloop.Run` returns `nil` 3. `processChat` falls through without error → status stays `waiting` (the default) 4. The test expects `pending` → flake ## Fix Skip writing the `[DONE]` marker when the request context is already canceled, in both `writeChatCompletionsStreaming` and `writeResponsesAPIStreaming`.	2026-03-02 18:00:21 +00:00
Kyle Carberry	0908505348	fix(chats): archive chat tree with single query instead of loop (#22496 ) ## Problem When archiving an agent with subagents, the children briefly flash in the sidebar as root-level items before disappearing. Two issues: 1. Backend: Archive used N+1 queries — a recursive DFS (`archiveChatTree`, no transaction) or BFS loop (`chatd.ArchiveChat`, N+1 queries in a tx) to walk the tree and archive each chat individually. 2. Frontend: The SSE `deleted` event handler only filtered out the parent chat from the cache. Children remained briefly, got promoted to root-level by `buildChatTree`, then disappeared on the next re-fetch. ## Fix Backend: Replace both tree-walk implementations with a single SQL query: ```sql UPDATE chats SET archived = true, updated_at = NOW() WHERE id = @id OR root_chat_id = @id; ``` This leverages the existing `root_chat_id` column (already indexed) to archive the entire tree atomically. Frontend: When a `deleted` event arrives, also filter out any chats whose `root_chat_id` matches the deleted chat, so children vanish from the sidebar immediately with the parent. ## Changes - `coderd/database/queries/chats.sql` — Added `ArchiveChatTreeByID` query - `coderd/chats.go` — Use single query, delete `archiveChatTree` function - `coderd/chatd/chatd.go` — Simplify `ArchiveChat` to use single query - `coderd/database/dbauthz/dbauthz.go` — Auth wrapper for new query - `coderd/chats_test.go` — Added `TestArchiveChat/ArchivesChildren` subtest - `site/src/pages/AgentsPage/AgentsPage.tsx` — Filter children in SSE handler - Generated files updated via `make gen`	2026-03-02 12:00:00 -05:00
Cian Johnston	a62f2fbfc4	feat(rbac): add AsChatd subject to replace AsSystemRestricted in chatd (#22487 ) Add a new SubjectTypeChatd RBAC subject with minimal permissions: - Chat: CRUD - Workspace: Read - DeploymentConfig: Read Replace all 10 AsSystemRestricted calls in coderd/chatd/chatd.go: - Line 890: Use AsChatd instead of AsSystemRestricted for the background processor context. - Subscribe() path (5 calls): Remove system escalation entirely; these run under the authenticated user's context from the HTTP handler. - processChat path (4 calls): Remove redundant per-call wraps; the context already carries AsChatd from the processor start. Add TestAsChatd verifying allowed and denied actions. Created using Mux (Opus 4.6)	2026-03-02 15:57:04 +00:00
Kyle Carberry	c9ed1e17fc	feat(agents): add desktop notifications via VAPID web push (#22454 ) ## Summary Wire VAPID web push notifications into the Agents (chat) system so users get desktop notifications when an agent finishes running. ### Backend - Add `webpush.Dispatcher` to `chatd.Server` and pass it through from `coderd.Options.WebPushDispatcher` - In `processChat()`'s deferred cleanup, dispatch a web push notification when the chat reaches a terminal state: - `waiting` (success): "Agent has finished running." - `error` (failure): the error message, or "Agent encountered an error." - Sub-agent chats (`ParentChatID.Valid`) are skipped to avoid notification spam from internal delegation - Gracefully no-ops when the dispatcher is nil (web push disabled) ### Frontend - New `WebPushButton` component — a bell icon that uses the existing `useWebpushNotifications` hook - Returns `null` when the `web-push` experiment is off - Three states: loading spinner, green bell (subscribed), muted bell-off (unsubscribed) - Tooltip + toast feedback on toggle - Added to both the Agents page empty state top bar and the AgentDetail top bar - The Agents page has its own layout (no standard Navbar), so it needs its own subscribe button ### End-to-end flow 1. User clicks the bell icon on `/agents` → browser subscribes via VAPID 2. User starts an agent chat → chat enters `running` status 3. Agent finishes → `processChat` defer sets status to `waiting`/`error` → dispatches web push 4. Browser service worker shows a desktop notification with the chat title and status --------- Co-authored-by: Coder <coder@users.noreply.github.com>	2026-02-28 23:40:17 -05:00
Kyle Carberry	533b90a3a4	fix: resolve chat title update race conditions and improve resilience (#22450 ) ## Problem Chat titles sometimes don't update in the UI. The generated AI title gets stuck as the fallback (first 6 words of the message) even though the backend successfully generates a proper title. ## Root Causes ### 1. Cancelable context used during cleanup DB read (P0) In `processChat`, the deferred cleanup re-reads the chat from the DB to pick up the AI-generated title for the `status_change` pubsub event. But it used the cancelable `ctx` instead of `cleanupCtx`: ```go // Before — ctx may already be canceled here if freshChat, readErr := p.db.GetChatByID(ctx, chat.ID); readErr == nil { ``` When the context is canceled, the DB read fails silently and the `status_change` event carries the stale fallback title. ### 2. Title goroutine not tracked by inflight WaitGroup (P2) The `maybeGenerateChatTitle` goroutine was fire-and-forget — not tracked by `p.inflight`. During graceful shutdown, the server could exit before the goroutine completes its DB write or pubsub publish. ### 3. No recovery when watchChats() WebSocket misses events The frontend relies entirely on the `watchChats()` SSE connection for title updates. If the connection drops or misses events, titles never recover — the only fix was a full page reload. ## Fixes 1. Use `cleanupCtx` for the `GetChatByID` call and logger in the deferred cleanup block. 2. Track the title goroutine with `p.inflight.Add(1)` / `defer p.inflight.Done()` so shutdown waits for it. 3. Invalidate chats query on WebSocket open/close/error events so missed updates are recovered via refetch. Also enable `refetchOnWindowFocus` for the chats query. Co-authored-by: Coder <coder@users.noreply.github.com>	2026-02-28 21:38:16 -05:00
Kyle Carberry	1c71fd69f6	fix: workspace auto-refresh during the chat flow (#22447 )	2026-02-28 19:07:17 -05:00
Kyle Carberry	2abe55549c	fix: return in-flight chats to pending on server shutdown (#22443 ) When a chatd server shuts down (`Close()`), the server context is canceled. Previously, in-flight chats would be marked as `error` because the `context.Canceled` error was not distinguished from actual processing failures. This adds `isShutdownCancellation()` to detect when the error is caused by the server context being canceled (as opposed to a chat-specific cancellation like `ErrInterrupted`). When detected, the chat status is set to `pending` with no `last_error`, allowing another replica to pick it up and retry. Extracted from #22440 — only the context cancellation bug fix, no chattest changes.	2026-02-28 17:14:11 -05:00
Kyle Carberry	22d4539a7a	fix(chatd): clear stream buffer after each step is persisted (#22445 ) The in-memory stream buffer accumulated message-part events for the entire duration of a chat run. Late-joining subscribers received all buffered parts even though the backing messages had already been committed to the database, wasting memory and potentially duplicating content. Clear the buffer at the end of each `persistStep` call so that only in-flight (uncommitted) parts remain in the buffer.	2026-02-28 16:51:04 -05:00
Kyle Carberry	34d9392e37	chore(db): remove workspace_agent_id from chats table (#22442 ) ## Summary Remove the `workspace_agent_id` column from the `chats` table and dynamically look up the first workspace agent instead. ## Problem When a workspace is stopped and restarted, the workspace agent gets a new ID. The `workspace_agent_id` stored on the chat at creation time becomes stale, making the agent unreachable. This caused chats to break after workspace restarts. ## Solution Instead of persisting the agent ID, dynamically look up the first agent from the workspace's latest build via `GetWorkspaceAgentsInLatestBuildByWorkspaceID` whenever an agent connection is needed. The `workspace_id` on the chat remains stable across restarts. This behavior may be refined later (e.g., agent selection heuristics), but picking the first agent resolves the immediate breakage. ## Changes - Migration 000425: Drop `workspace_agent_id` column from `chats` - SQL queries: Remove `workspace_agent_id` from `InsertChat` and `UpdateChatWorkspace` - chatd.go: `getWorkspaceConn` and `resolveInstructions` now look up agents dynamically from workspace ID - chatd.go: Remove `refreshChatWorkspaceSnapshot` (no longer needed) - createworkspace.go: Stop persisting agent ID when associating workspace with chat - subagent.go: Stop passing agent ID to child chats - SDK/frontend: Remove `WorkspaceAgentID` / `workspace_agent_id` from Chat type --------- Co-authored-by: Kyle Carberry <kylecarbs@gmail.com>	2026-02-28 16:46:51 -05:00
Kyle Carberry	c316d0a3e7	fix(chatd): improve subagent tool descriptions and strip tools from child agents (#22441 ) Two changes: 1. Gate subagent tools behind `!chat.ParentChatID.Valid` so child agents never receive `spawn_agent`, `wait_agent`, `message_agent`, or `close_agent`. Previously all 4 tools were given to every chat. `spawn_agent` would fail at runtime ("delegated chats cannot create child subagents") but the other 3 had no guard at all — meaning a child could theoretically operate on sibling chats. Removing the tools entirely is cleaner and saves context window. 2. *Rewrite tool descriptions to explain when* to use them*, not just what they do. `spawn_agent` now says to use it for clearly scoped, independent, self-contained tasks (e.g. fixing a specific bug, writing a single module, running a migration) and explicitly says not* to use it for simple operations you can handle with `execute`/`read_file`/`write_file`. It also states that child agents cannot spawn their own subagents. The other 3 tools get similar guidance-oriented descriptions. Co-authored-by: Coder <coder@users.noreply.github.com>	2026-02-28 16:30:25 -05:00
Kyle Carberry	c5619746d1	fix(chat): fix stream state discrepancies between frontend and backend (#22437 ) ## Summary Fixes four frontend↔backend discrepancies in chat stream state management that could cause duplicate content, UI flicker, and stale stream state. ### Backend fixes (`coderd/chatd/chatd.go`) 1. No-pubsub path double-replayed message_part events `Subscribe()` built an `initialSnapshot` containing `message_part` events from `localSnapshot`, then the no-pubsub goroutine replayed the same `localSnapshot` into the `mergedEvents` channel. Since `streamChat` sends the snapshot first then reads the channel, the frontend received every `message_part` twice. `applyMessagePartToStreamState` doesn't deduplicate — text gets concatenated, so content appeared doubled. Fix: Only forward live `localParts` in the no-pubsub goroutine; the snapshot already contains the historical events. 2. Snapshot missing status event The initial snapshot never included a `status` event. The frontend's `shouldApplyMessagePart()` gates on status (`pending`/`waiting`), but the initial status came from a separate REST query via `useEffect`. During the race window between snapshot arrival and REST resolution, `message_part` events could be incorrectly accepted or rejected. Fix: Prepend a `status` event to the snapshot after loading the chat from DB, so the frontend has the authoritative status from the very first batch. ### Frontend fixes (`ChatContext.ts`) 3. Scheduled stream reset not canceled by subsequent message_parts When a `message` event arrived, `scheduleStreamReset()` queued `clearStreamState` via `requestAnimationFrame`. If new `message_part` events arrived in the next WebSocket frame before the rAF fired, they were pushed to `pendingMessageParts` without canceling the scheduled reset. The rAF would fire between frames, clearing stream state, then the next flush would re-populate it — causing a visible flash. Fix: Call `cancelScheduledStreamReset()` when accumulating `message_part` events. 4. startTransition race with synchronous clearStreamState `flushMessageParts` wrapped `applyMessageParts` in `startTransition`, which React can defer. If a `status: "waiting"` event arrived in the same batch after `message_part` events, the status handler cleared stream state synchronously, but the deferred `applyMessageParts` callback could fire afterward and re-populate it. Fix: Re-check `shouldApplyMessagePart()` inside the `startTransition` callback at execution time. ### Tests added - Go: `TestSubscribeSnapshotIncludesStatusEvent` — asserts the first snapshot event is a status event - Go: `TestSubscribeNoPubsubNoDuplicateMessageParts` — asserts the events channel doesn't replay snapshot events - TS: `cancels scheduled stream reset when message_part arrives after message` — verifies stream state survives a [message, message_part] batch - TS: `does not apply message parts after status changes to waiting` — verifies deferred applyMessageParts respects status transitions	2026-02-28 13:35:23 -05:00
Kyle Carberry	a621c3cb13	feat(agent): add process execution API and rewrite execute tool (#22416 ) ## Summary Adds a new agent-side process management HTTP API and rewrites the chat execute tool to use it instead of SSH sessions. ## What changed ### New agent/agentproc/ package - headtail.go — Thread-safe io.Writer with bounded memory (16KB head + 16KB tail ring buffer). Provides LLM-ready output with truncation metadata and long-line truncation at 2048 bytes. - headtail_test.go — 16 tests including race detector coverage for concurrent writes. - process.go — Manager + Process types for lifecycle management using agentexec.Execer for proper OOM/nice scores. - api.go — HTTP API following the agentfiles chi router pattern. 4 endpoints: start, list, output, signal. ### Agent wiring (agent/agent.go, agent/api.go) Mounts the process API at /api/v0/processes, mirroring how agentfiles is mounted. ### SDK (codersdk/workspacesdk/agentconn.go) 4 new AgentConn interface methods + 7 request/response types: - StartProcess, ListProcesses, ProcessOutput, SignalProcess ### Execute tool rewrite (coderd/chatd/chattool/execute.go) - SSH to Agent API: conn.StartProcess() + conn.ProcessOutput() polling - New parameters: workdir, run_in_background - Structured response: success, exit_code, wall_duration_ms, error, truncated, note, background_process_id - Non-interactive env vars: GIT_EDITOR=true, TERM=dumb, NO_COLOR=1, PAGER=cat, etc. - Output truncation: HeadTailBuffer caps at 32KB for LLM consumption - File-dump detection with advisory notes suggesting read_file - Default timeout: 60s to 10s - Foreground polling: 200ms intervals until exit or timeout ## Architecture State lives on the agent, surviving coderd failover and instance changes. Any coderd replica can query any agent via HTTP over tailnet.	2026-02-28 12:33:52 -05:00
Kyle Carberry	0ad2f9ecd7	feat(chatd): persist last_error on chats table (#22436 ) Adds a nullable `last_error` column to the `chats` table so error reasons survive page reloads. Backend: - Migration adds `last_error TEXT` (nullable) to chats - `UpdateChatStatus` writes the error reason when status transitions to `error`, clears it (NULL) on recovery - `convertChat` maps `sql.NullString` to `string` in the SDK Frontend:* - Sidebar falls back to `chat.last_error` when no stream error reason is cached - Chat detail page does the same for `persistedErrorReason` - Fixtures updated for new required field	2026-02-28 12:27:26 -05:00
Kyle Carberry	2bdacae5f5	feat(chatd): add LLM stream retry with exponential backoff (#22418 ) ## Summary Adds automatic retry with exponential backoff for transient LLM errors during chat streaming and title generation. Inspired by [coder/mux](https://github.com/coder/mux)'s retry mechanism. ## Key Behaviors - Infinite retries with exponential backoff: 1s → 2s → 4s → ... → 60s cap - Deterministic delays (no jitter) - Error classification: retryable (429, 5xx, overloaded, rate limit, network errors) vs non-retryable (auth, quota, context exceeded, model not found, canceled) - Retry status published to SSE stream so frontend can show "Retrying in Xs..." UI - Title generation retries silently (best-effort, nil onRetry callback) ## New Package: `coderd/chatd/chatretry/` \| File \| Purpose \| \|------\|---------\| \| `classify.go` \| `IsRetryable(err)` and `StatusCodeRetryable(code)` \| \| `backoff.go` \| `Delay(attempt)` — exponential doubling with 60s cap \| \| `retry.go` \| `Retry(ctx, fn, onRetry)` — infinite loop with context-aware timer \| ## Test Helpers: `coderd/chatd/chattest/errors.go` Anthropic and OpenAI error response builders for use in chattest providers: - `AnthropicErrorResponse()`, `AnthropicOverloadedResponse()`, `AnthropicRateLimitResponse()` - `OpenAIErrorResponse()`, `OpenAIRateLimitResponse()`, `OpenAIServerErrorResponse()` ## SDK Changes: `codersdk/chats.go` - New `ChatStreamEventType: "retry"` - New `ChatStreamRetry` struct with `Attempt`, `DelayMs`, `Error`, `RetryingAt` fields - TypeScript types auto-generated ## Changed Files - `coderd/chatd/chatloop/chatloop.go` — wraps `agent.Stream()` in `chatretry.Retry()` - `coderd/chatd/chatd.go` — publishes retry events to SSE stream with logging - `coderd/chatd/title.go` — wraps `model.Generate()` in silent retry - `coderd/chatd/chattest/anthropic.go` / `openai.go` — error injection support ## Tests 42 tests covering classification (33), backoff (9), and retry scenarios (8).	2026-02-27 18:34:33 -05:00
Kyle Carberry	4b5ec8a9a4	feat: add diff_status_change event to /chats/watch pubsub stream (#22419 ) ## Summary Adds a new `diff_status_change` event kind to the `/chats/watch` pubsub stream so the sidebar can update diff status (PR created, files changed, branch info) without a full page reload. ### Problem When a chat's diff status changes (e.g. PR created via GitHub, git branch pushed), the sidebar didn't update because: 1. The backend `publishChatPubsubEvent` didn't include diff status data 2. The frontend watch handler only merged `status`, `title`, and `updated_at` from events ### Solution A notify-only approach: a new `ChatEventKindDiffStatusChange` event kind tells the frontend "diff status changed for chat X" — the frontend then invalidates the relevant React Query cache entries to re-fetch. ### Backend changes - `coderd/pubsub/chatevent.go`: New `ChatEventKindDiffStatusChange = "diff_status_change"` constant - `coderd/chatd/chatd.go`: New `PublishDiffStatusChange(ctx, chatID)` method on `Server` - `coderd/chats.go`: New `publishChatDiffStatusEvent` helper. Published from: - `refreshWorkspaceChatDiffStatuses` — after each chat's diff status is refreshed via GitHub API - `storeChatGitRef` — after persisting git branch/origin info from workspace agent ### Frontend changes - `AgentsPage.tsx`: Handle `diff_status_change` event by invalidating `chatDiffStatusKey` and `chatDiffContentsKey` queries - `ChatContext.ts`: Remove redundant diff status invalidation that fired on every chat status change (the new event kind handles this properly)	2026-02-27 18:06:54 -05:00
Kyle Carberry	12083441e0	feat(chats): archive chats instead of hard-deleting them (#22406 ) ## Summary The UI has always labeled the action as "Archive agent" but the backend was performing a hard `DELETE`, permanently destroying chats and all their messages. This change replaces the hard delete with a soft archive, consistent with the pattern used by template versions. ## Changes ### Database - Migration 000423: Add `archived boolean DEFAULT false NOT NULL` column to `chats` table - Replace `DeleteChatByID` query with `ArchiveChatByID` (`UPDATE SET archived = true`) - Add `UnarchiveChatByID` query (`UPDATE SET archived = false`) - Filter archived chats from `GetChatsByOwnerID` (`WHERE archived = false`) ### API - Remove `DELETE /api/experimental/chats/{chat}` - Add `POST /api/experimental/chats/{chat}/archive` — archives a chat and all its descendants - Add `POST /api/experimental/chats/{chat}/unarchive` — unarchives a single chat (API only, no UI yet) ### Backend - `archiveChatTree()` recursively archives child chats (replaces `deleteChatTree()` which hard-deleted) - Chat daemon's `ArchiveChat()` archives the full chat tree in a transaction - Authorization uses `ActionUpdate` instead of `ActionDelete` ### SDK - Replace `DeleteChat()` with `ArchiveChat()` and `UnarchiveChat()` - Add `Archived` field to `Chat` struct ### Frontend - `archiveChat` API call uses `POST .../archive` instead of `DELETE` - No UI changes — the "Archive agent" button now actually archives instead of deleting ## Design Decision This follows the template version archive pattern (Pattern B in the codebase): - `archived boolean` column (not `deleted boolean`) - Dedicated `POST .../archive` and `POST .../unarchive` routes (not repurposing `DELETE`) - Reversible — users can unarchive via the API (UI for this will come later)	2026-02-27 16:46:19 -05:00
Kyle Carberry	360df1d84f	fix(chatd): publish streaming message_part events during compaction (#22410 ) ## Problem Context compaction in chatd persisted durable messages for the `chat_summarized` tool call and result via `publishMessage`, but never published `message_part` streaming events via `publishMessagePart`. This meant connected clients had no streaming representation of the compaction. The client's `streamState` (built entirely from `message_part` events in `streamState.ts`) never saw the compaction tool call, so: - No "Summarizing..." running state was shown to the user during summary generation (which can take up to 90s). - The durable `message` events arrived after or interleaved with the `status: waiting` event, causing the tool to appear as "Summarized" with the chat appearing to just stop. ## Fix ### 1. `CompactionOptions.OnStart` callback (chatloop) Added an `OnStart` callback to `CompactionOptions`, called in `maybeCompact` right before `generateCompactionSummary` (the slow LLM call). This gives `chatd` a hook to publish the tool-call `message_part` immediately when compaction begins. ### 2. Tool-result streaming part (chatd) `persistChatContextSummary` now publishes a tool-result `message_part` before the durable `message` events, so clients transition from "Summarizing..." to "Summarized" before the status change arrives. ### Event ordering is now: 1. `message_part` (tool call via `OnStart`) — client shows "Summarizing..." 2. LLM generates summary (up to 90s) 3. `message_part` (tool result) — client shows "Summarized" in stream state 4. `message` (assistant) — durable message persisted, stream state resets 5. `message` (tool) — durable tool result persisted 6. `status: waiting` — chat transitions to idle ## Tests - `OnStartFiresBeforePersist`: Verifies callback ordering is `on_start` → `generate` → `persist`. - `OnStartNotCalledBelowThreshold`: Verifies `OnStart` is not called when context usage is below the compaction threshold.	2026-02-27 16:33:39 -05:00
Kyle Carberry	f509c841cf	fix(chatd): recover stale chats after coderd redeployment (#22405 ) ## Problem When coderd instances are redeployed (e.g. rolling deployment on dogfood), in-flight chats get stuck in `running` status permanently. The UI shows them as "thinking" with a spinning indicator, but no worker is actually processing them. They never error or resume. ## Root Cause Two bugs combine to cause this: ### Bug 1: Shutdown cleanup uses a canceled context The `processChat` defer block updates the chat status in the DB when processing completes. But it uses `ctx`, which `Close()` cancels before the defer runs. The DB transaction silently fails with `context.Canceled`, leaving the chat in `status=running` with a dead `worker_id`. ```go // Close() calls p.cancel() which cancels ctx // Then the defer tries to use the now-canceled ctx: defer func() { err := p.db.InTx(func(tx database.Store) error { tx.GetChatByIDForUpdate(ctx, chat.ID) // FAILS tx.UpdateChatStatus(ctx, ...) // FAILS }, nil) }() ``` ### Bug 2: Stale recovery runs only once at startup `recoverStaleChats()` was called only once in `start()`, not periodically. During a rolling deployment, the new instance starts while the old one is still alive (fresh heartbeat). By the time the old instance crashes, no one checks again. ## Fix 1. Use `context.WithoutCancel(ctx)` in the processChat defer — the cleanup transaction now completes even during graceful shutdown. 2. Run `recoverStaleChats` periodically — a second ticker in the `start()` loop checks for stale chats at `inFlightChatStaleAfter / 5` intervals (default: every 1 minute). This catches orphaned chats even when the instance that owns them crashes without clean shutdown. ## Tests - `TestRecoverStaleChatsPeriodically` — Verifies chats orphaned after startup are recovered by the periodic loop (not just the startup check). - `TestNewReplicaRecoversStaleChatFromDeadReplica` — Verifies a new replica recovers stale chats on startup. - `TestWaitingChatsAreNotRecoveredAsStale` — Negative test: `waiting` chats are not incorrectly modified by recovery.	2026-02-27 15:25:40 -05:00
Kyle Carberry	b65c0766d2	feat: add line-based read_file tool with safety limits (#22400 ) ## Summary Adds a new line-based file reading endpoint to the workspace agent, replacing the unbounded byte-based approach for the `read_file` chat tool and `coder_workspace_read_file` MCP tool. Problem: The current `read_file` tool returns the entire file contents with no limits, which can blow up LLM context windows and cause OOM issues with large files. Solution: Inspired by [`coder/mux`](https://github.com/coder/mux) and [`openai/codex`](https://github.com/openai/codex), implement a line-based reader with safety limits. ## Changes ### Agent (`agent/agentfiles/`) - New `/read-file-lines` endpoint with `HandleReadFileLines` handler - Line-based `offset` (1-based line number, default: 1) and `limit` (line count, default: 2000) - Safety constants: \| Constant \| Value \| Purpose \| \|---\|---\|---\| \| `MaxFileSize` \| 1 MB \| Reject files larger than this at stat \| \| `MaxLineBytes` \| 1,024 \| Per-line truncation with `... [truncated]` marker \| \| `MaxResponseLines` \| 2,000 \| Max lines per response \| \| `MaxResponseBytes` \| 32 KB \| Max total response size \| \| `DefaultLineLimit` \| 2,000 \| Default when no limit specified \| - Line numbering format: `1\tcontent` (tab-separated) - Structured JSON response: `{ success, file_size, total_lines, lines_read, content, error }` - Hard errors when limits exceeded — tells the LLM to use `offset`/`limit` - Existing byte-based `/read-file` endpoint preserved (used by `instruction.go`) ### SDK (`codersdk/workspacesdk/`) - `ReadFileLinesResponse` type added - `ReadFileLines` method added to `AgentConn` interface - Mock regenerated ### Chat tool (`coderd/chatd/chattool/`) - `read_file` tool now uses `conn.ReadFileLines()` instead of `conn.ReadFile()` - Updated tool description to document line-based parameters - Response includes `file_size`, `total_lines`, `lines_read` metadata ### MCP tool (`codersdk/toolsdk/`) - `coder_workspace_read_file` updated to use line-based reading - Schema descriptions updated for line-based offset/limit - Removed `maxFileLimit` constant (agent handles limits now) ### Tests - 13 new test cases for `TestReadFileLines`: - Path validation (empty, relative, non-existent, directory, no permissions) - Empty file handling - Basic read, offset, limit, offset+limit combinations - Offset beyond file length - Long line truncation (>1024 bytes) - Large file rejection (>1MB) - All existing tests pass unchanged ## Design decisions \| Decision \| Rationale \| \|---\|---\| \| Line-based, not byte-based \| Both coder/mux and openai/codex use line-based — matches how LLMs reason about code \| \| Default limit of 2000 \| Matches codex; prevents accidental full-file dumps while being generous \| \| 32 KB response cap \| Compromise between mux (16 KB) and codex (no cap) \| \| 1024 byte/line truncation with marker \| More generous than codex (500), marker helps LLM know data is missing \| \| Hard errors on overflow \| Matches mux; forces LLM to paginate rather than getting partial data \| \| Preserve byte-based endpoint \| `instruction.go` needs raw byte access for AGENTS.md \|	2026-02-27 15:12:56 -05:00
Kyle Carberry	ff687aa780	fix: re-read chat before publishing status event to preserve AI title (#22402 ) ## Problem Chat titles revert to the fallback truncated title after briefly showing the AI-generated title. Even reloading the page doesn't help — the correct title flashes then gets overwritten. ## Root Cause Single bug, two symptoms. In `processChat` (`coderd/chatd/chatd.go`), the `chat` variable is passed by value. The flow: 1. `processChat(ctx, chat)` receives `chat` with the initial fallback title (truncated first message). 2. Inside `runChat`, `maybeGenerateChatTitle` generates an AI title, writes it to the DB via `UpdateChatByID`, and publishes a `title_change` event. The DB has the correct title. The client briefly displays it. 3. `runChat` returns. The deferred cleanup in `processChat` publishes `publishChatPubsubEvent(chat, StatusChange)` — but `chat` here is the original value copy that still has the old fallback title. 4. The frontend receives the `status_change` SSE event and unconditionally applies `title` from every event kind (see `AgentsPage.tsx` line ~305: `title: updatedChat.title`). This overwrites the correct AI title with the stale fallback. Why reload doesn't help: If the chat is still processing when the page reloads, `listChats` loads the correct title from the DB, but then the deferred `status_change` event arrives moments later and clobbers it. The title was always in the DB — it was the pubsub event that kept overwriting it. ## Fix Re-read the chat from the database in the deferred cleanup before publishing the final `status_change` event, so it carries the current (AI-generated) title.	2026-02-27 15:06:36 -05:00
Kyle Carberry	344d11fa22	feat: include OS and working directory in workspace agent prompt injection (#22399 ) When injecting system instructions into the chat prompt, include: 1. Operating system and working directory from the `workspace_agents` table 2. Home-level instructions from `~/.coder/AGENTS.md` (existing behavior) 3. Project-level instructions from `<pwd>/AGENTS.md` (new) The XML tag is renamed from `<coder-home-instructions>` to `<system-instructions>` since it now carries more than just the home instruction file. ### Example output (both files present) ```xml <system-instructions> Operating System: linux Working Directory: /home/coder/coder Source: /home/coder/.coder/AGENTS.md ... home instructions ... Source: /home/coder/coder/AGENTS.md ... project instructions ... </system-instructions> ``` ### Example output (no AGENTS.md files) ```xml <system-instructions> Operating System: linux Working Directory: /home/coder/coder </system-instructions> ``` ### Changes - `coderd/chatd/instruction.go`: - Renamed types: `homeInstructionContext` → `agentContext`, added `instructionFile` struct - Extracted `readInstructionFileAtPath` shared helper - Added `readWorkingDirectoryInstructionFile` to read `<pwd>/AGENTS.md` - Replaced `formatHomeInstruction` with `formatInstructions` that renders both files under `<system-instructions>` - `coderd/chatd/chatd.go`: - Renamed `resolveHomeInstruction` → `resolveInstructions`; now reads both home and pwd instruction files - `resolveAgentContext` returns `agentContext` (renamed from `homeInstructionContext`) - pwd file read is skipped gracefully if directory is empty or file doesn't exist - `coderd/chatd/instruction_test.go`: - Added `TestReadWorkingDirectoryInstructionFile` (success, not-found, empty-directory) - Replaced `TestFormatHomeInstruction` with `TestFormatInstructions` covering all combinations - Added ordering test (`AgentContextBeforeFiles`) to verify OS/pwd appear before file sources	2026-02-27 14:21:23 -05:00
Kyle Carberry	59cec5be65	feat: add pagination and popularity sorting to chattool list_templates (#22398 ) ## Summary The `chattool` `list_templates` tool previously returned all templates in a single response with no popularity signal. On deployments with many templates (e.g. 71 on dogfood), this wastes tokens and makes it hard for the AI to pick the right template for broad user questions. ## Changes Single file: `coderd/chatd/chattool/listtemplates.go` - `page` parameter — optional, 1-indexed, 10 results per page - Popularity sort — queries `GetWorkspaceUniqueOwnerCountByTemplateIDs` to get active developer counts, then sorts descending (most popular first). The DB query returns templates alphabetically, so this explicit sort is needed. - `active_developers` — included on each template item so the agent can see the signal - Pagination metadata — `page`, `total_pages`, `total_count` in the response so the agent knows there are more results - Updated tool description — tells the agent that results are ordered by popularity and paginated ## Frontend No frontend changes needed. The renderer already reads `rec.templates` and `rec.count` from the response — the new fields (`page`, `total_pages`, `total_count`) are additive and safely ignored.	2026-02-27 14:06:22 -05:00
Kyle Carberry	edee917d88	feat: add experimental agents support (#22290 ) feat: add AI chat system with agent tools and chat UI Introduce the chatd subsystem and Agents UI for AI-powered chat within Coder workspaces. - Add chatd package with chat loop, message compaction, prompt management, and LLM provider integration (OpenAI, Anthropic) - Add agent tools: create workspace, list/read templates, read/write/ edit files, execute commands - Add chat API endpoints with streaming, message editing, and durable reconnection - Add database schema and migrations for chats, chat messages, chat providers, and chat model configs - Add RBAC policies and dbauthz enforcement for chat resources - Add Agents UI pages with conversation timeline, queued messages list, diff viewer, and model configuration panel - Add comprehensive test coverage including coderd integration tests, chatd unit tests, and Storybook stories - Gate feature behind experiments flag --------- Co-authored-by: Cian Johnston <cian@coder.com> Co-authored-by: Danielle Maywood <danielle@themaywoods.com> Co-authored-by: Jeremy Ruppel <jeremy@coder.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-27 16:50:56 +00:00

1 2

80 Commits