coder

mirror of https://github.com/coder/coder.git synced 2026-06-07 15:08:20 +00:00

Author	SHA1	Message	Date
Ethan	9444eddf4e	feat(coderd/x/chatd): allow attach_file in root plan-mode chats (#25388 ) `attach_file` was registered for plan-mode turns but never added to `builtinPlanToolAllowed`, so the per-turn `ActiveTools` allowlist filtered it out and calls failed with `Tool not active in this turn: attach_file`. This was an omission rather than a deliberate block — the tool (#24280) landed shortly after plan mode (#24236) and no subsequent edit to the allowlist picked it up. Add `attach_file` under the `isRootChat` case, matching how other artifact-producing tools (`propose_plan`, `write_file`, `edit_files`) are gated. The tool only reads from the workspace and writes to chat-attachment storage, so it preserves plan mode's invariant of not making implementation changes to the workspace. Subagents in plan mode remain restricted to the minimal read-only surface.	2026-05-19 17:01:23 +10:00
Danielle Maywood	170a6e1fe9	feat: add chat sharing foundation (#25041 )	2026-05-18 22:32:05 +01:00
Kyle Carberry	385146000b	feat: record created_at/completed_at on reasoning ChatMessageParts (#24789 ) Records reasoning start and end times on persisted reasoning `ChatMessagePart`s so reasoning duration can be computed for stored chats. Backend-only: no SSE changes and no frontend rendering ship in this PR. The `created_at` field on `ChatMessagePart` is extended to also be present on `reasoning` parts (it previously appeared only on `tool-call` and `tool-result`), and a new `completed_at` field is added for `reasoning` parts. ### How timestamps are recorded - `StreamPartTypeReasoningStart`: stamp `startedAt = dbtime.Now()` on the active reasoning state. - `StreamPartTypeReasoningEnd`: stamp `completedAt = dbtime.Now()` and append both into parallel `[]time.Time` slices on `stepResult`. - Persistence reads the slices in occurrence order (reasoning has no provider-side ID) and applies them to the matching `ChatMessagePart` via `buildAssistantPartsForPersist`. The first reasoning block's stamps go onto the first reasoning part, and so on. - `flushActiveState` flushes partial reasoning interrupted before `StreamPartTypeReasoningEnd` with `startedAt` from the active state and `completedAt = dbtime.Now()` at the interruption. ### Why two fields, not one? Tool calls and results are point events. The frontend computes their duration by subtracting the call's `created_at` from the result's `created_at`. Reasoning is one assistant part that brackets a span, so we record both endpoints on the part itself. ### Why not stamp in `PartFromContent`? Same rationale as #24101: `PartFromContent` is called during both SSE publishing and persistence. Stamping there would yield incorrect persistence-time timestamps for reasoning blocks that finished much earlier in the step. Instead we capture in the chatloop and apply during persistence. <details><summary>Implementation plan</summary> - `codersdk/chats.go`: extend `CreatedAt`'s `variants` to include `reasoning?`; add `CompletedAt *time.Time` with `variants:"reasoning?"`. - `coderd/x/chatd/chatloop/chatloop.go`: extend `reasoningState` with `startedAt`; extend `stepResult` and `PersistedStep` with parallel `[]time.Time` reasoning slices; stamp on `ReasoningStart`/`ReasoningEnd`; thread the slices through all `PersistStep` call sites including the interrupt-safe path; record partial reasoning in `flushActiveState`. - `coderd/x/chatd/attachments.go`: walk reasoning parts in occurrence order and apply `step.ReasoningStartedAt[i]` to `part.CreatedAt` and `step.ReasoningCompletedAt[i]` to `part.CompletedAt`. ### Tests - `codersdk/chats_test.go` round-trips `created_at` + `completed_at` on reasoning parts and verifies omission when absent and partial interrupted parts. - `coderd/x/chatd/chatprompt/chatprompt_test.go` asserts `PartFromContent(ReasoningContent{})` does NOT stamp timestamps. - `coderd/x/chatd/chatloop/chatloop_test.go` `TestRun_ReasoningTimestamps` drives a stream with two reasoning blocks and verifies parallel slices, monotonicity, ordering, non-zero values, and content-block ordering. `TestRun_InterruptedReasoningFlushesTimestamps` cancels mid-reasoning and verifies `flushActiveState` records a non-zero pair. - `coderd/x/chatd/attachments_test.go` covers `buildAssistantPartsForPersist` for normal interleaved reasoning, partial (zero `completed_at`), and missing slices. </details> > Generated by Coder Agents. Co-authored-by: Coder Agent <agent@coder.com>	2026-05-18 12:30:30 -04:00
Kyle Carberry	159089686a	fix(coderd/x/chatd): prime workspace MCP cache after create/start (#25298 ) ## Problem Mid-turn workspace MCP discovery was broken when an agent was still cold-starting. `PrepareTools` in `chatd.go` flipped `workspaceMCPDiscovered = true` before calling `discoverWorkspaceMCPTools`, so a failed discovery attempt permanently blocked retries within the turn. Customer-reported repro: - New chat with no pre-selected workspace. - LLM calls `create_workspace` mid-turn at `23:35:05`. - `PrepareTools` fires, dials the agent with a 30s timeout, dial times out at `23:38:15`, `discoverWorkspaceMCPTools` returns empty. - Agent connects at `23:38:29`, 14 seconds later. - `workspaceMCPDiscovered` was already true, so `PrepareTools` never retried for the rest of the turn. MCP tools only appeared on the next user message. A naive retry loop in `PrepareTools` would also miss the bigger picture: a workspace boot can take several minutes (EC2 cold start, 10 min startup scripts), and the chatloop only gets a chance to call `PrepareTools` between LLM steps. ## Fix Do the workspace MCP discovery from inside the tool that already waits for the agent. `chattool.CreateWorkspace` and `chattool.StartWorkspace` call `waitForAgentReady`, which has a 2 min agent-online budget plus a 10 min startup-script budget. By the time they fire `OnChatUpdated`, the agent is `Ready`. The chatd `onChatUpdated` callback now launches an async `primeWorkspaceMCPCache` goroutine on every bind that has a valid workspace ID: - The primer calls `discoverWorkspaceMCPTools` until it returns a non-empty list or `workspaceMCPPrimeMaxWait` (30s) elapses, with a 2s backoff between attempts. The bounded wait handles the short race between agent-online and the agent's MCP `Connect` settling. - The primer runs asynchronously so the tool itself never blocks. Some templates simply do not advertise MCP tools, in which case the primer would otherwise spend its full budget for nothing. - The primer shares the chat `ctx` (not a detached one) so it is canceled together with the chat. A dangling primer would re-dial the workspace conn after `runChat`'s deferred `workspaceCtx.close()` and leak that conn. - `inflight.Add(1)` ensures server shutdown still waits for any in-progress primer. - `PrepareTools` is simplified back to a single discovery call. It now only sets `workspaceMCPDiscovered = true` on success, so an empty result no longer permanently blocks discovery within the turn. The cache hit warmed by the primer makes that call cheap in the common case; the dial fallback handles the rare cache miss. ## Tests All in `coderd/x/chatd/chatd_internal_test.go`: - `TestPrimeWorkspaceMCPCache_SuccessOnFirstAttempt` — single `ListMCPTools` call returning tools populates the cache. - `TestPrimeWorkspaceMCPCache_RetriesUntilToolsAppear` — first call empty, second returns tools; primer retries past the backoff and writes the cache. Uses `quartz.Mock.Trap` on `NewTimer`. - `TestPrimeWorkspaceMCPCache_GivesUpAfterDeadline` — `ListMCPTools` always empty; primer stops at `workspaceMCPPrimeMaxWait` and refuses to cache the empty result so PrepareTools can retry on the next step. The existing integration test `TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspace` continues to pass and now also exercises the async-primer path end-to-end via the create_workspace tool. ``` go test ./coderd/x/chatd/... -count=1 go test ./coderd/x/chatd/ -race -count=1 make pre-commit ``` <details> <summary>Design notes</summary> - The first iteration of this PR added retry+cooldown+failure-cap logic inside `PrepareTools`. It worked for the customer's ~30s race window but did not help workspaces that take several minutes to boot, because `PrepareTools` only fires between LLM steps. Reviewer pointed out the right place to handle this is the tool itself; the current implementation does that. - Why async: a primer that ran synchronously inside the `OnChatUpdated` callback blocked the create_workspace tool from returning for up to `workspaceMCPPrimeMaxWait`, which broke `TestCreateWorkspaceTool_EndToEnd` and would hurt any template that does not expose MCP tools. Decoupling lets the tool return immediately and lets the primer warm the cache concurrently with the next LLM step. - Why share the chat `ctx` rather than `context.WithoutCancel(ctx)` (the title-generation pattern): the primer touches `workspaceCtx.getWorkspaceConn`, which `runChat`'s deferred `workspaceCtx.close()` invalidates. A detached primer outliving the chat would dial a fresh conn and leak it. - The constant naming distinguishes `workspaceMCPDiscoveryTimeout` (35s per-call dial budget, unchanged from #25169) from `workspaceMCPPrimeMaxWait` (30s total budget for the post-ready primer loop) and `workspaceMCPPrimeRetryInterval` (2s between empty-result retries). </details> Follow-up to #25169. --- _This pull request was generated by Coder Agents._	2026-05-18 07:55:56 -04:00
Ethan	e75bd3aca4	fix: preserve Anthropic replay fidelity (#25377 ) Anthropic is strict about replaying the latest assistant turn once it contains signed or redacted reasoning. We were still mutating that turn in a few Coder-owned places: dropping empty reasoning blocks on replay, rewriting provider-tool history during sanitization, and in the worst case sending a prompt we already knew Anthropic would reject. This patch keeps the latest signed assistant immutable through Coder's replay and sanitization paths, preserves empty signed or redacted reasoning anywhere Coder owns the ledger, and fails before the provider call if the prompt is still unsafe. It also bumps the existing `coder/fantasy` `coder_2_33` fork that `main` already uses to the commit containing coder/fantasy#35. These fixes have also been upstreamed to charmbracelet/fantasy. Closes CODAGT-409.	2026-05-18 15:20:33 +10:00
Michael Suchacz	792f0b4902	feat: add personal skill resolver (#25362 ) > Mux updated this PR on behalf of Mike. ## Stack Context This stack splits experimental personal skills into smaller reviewable PRs. Personal skills are user-owned `SKILL.md` files stored by Coder and injected into chatd alongside workspace skills. Stack order: 1. #25362 personal skill resolver 2. #25363 storage, permissions, API, and SDK 3. #25365 API test coverage 4. #25366 chattool and chatd integration 5. #25066 settings UI and docs 6. #25386 personal skills slash menu ## What? Adds the shared personal skill parser and resolver package, plus reusable skill-name validation exported from `workspacesdk`. The parser enforces the full personal skill contract: max raw size, kebab-case name, max name length, and non-empty body. ## Why? The rest of the stack needs one source-aware resolver for personal and workspace skills, including collision handling and qualified aliases. Keeping personal skill constraints in the parser prevents callers from accidentally parsing invalid personal skills. ## Validation - `go test ./coderd/x/skills ./codersdk/workspacesdk` - pre-commit hooks on this branch	2026-05-16 15:33:43 +00:00
Ethan	a59b951565	test: skip stale notification chatd flakes (#25376 ) These chatd tests are flaking for the same stale control-notification race tracked by CODAGT-353, so this change skips the newly reflaking advisor-chain and `TestPatchChatMessage/ChangesModel` tests and rewrites the older `TODO(hugodutka)` skips to point at the same root cause. This keeps the known flakes documented consistently until the chatd notification-flow refactor lands. Closes CODAGT-427 Closes https://github.com/coder/internal/issues/1510	2026-05-15 17:36:48 +10:00
Ethan	a35f71cd8a	fix(coderd/x/chatd): retry HTTP/2 stream resets (#25170 ) Mid-stream HTTP/2 peer resets from LLM providers can arrive after a 200 streaming response has already emitted provisional parts. Previously those resets fell through as generic non-retryable errors because `stream ID` messages did not match retryable transport signals, and stream IDs could be misread as HTTP statuses. Classify retryable HTTP/2 RST_STREAM codes as transient timeout failures, ignore stream IDs during status extraction, and keep the existing `retry` event as the rollback boundary for provisional message parts so replacement attempts do not replay failed-attempt output. Closes CODAGT-382	2026-05-14 11:40:43 +10:00
Michael Suchacz	d1a471e29e	fix(coderd/x/chatd): retune subagent selection guidance (#25311 ) > Mux working on behalf of Mike. ## Summary - retune chatd subagent guidance to prefer `general` for substantial delegated work, including read-only synthesis and planning support - narrow `explore` guidance to repository-local code lookup and bounded tracing - add regression tests for planning, spawn tool, and Plan Mode guidance text ## Tests - `go test ./coderd/x/chatd -run 'Test(DefaultSystemPromptPlanningGuidance_SteersSubagentSelection\|SpawnAgent_DescriptionSteersGeneralForSubstantialResearch\|SpawnAgent_PlanModeDescriptionOmitsComputerUse\|PlanningOverlaySubagentGuidance_UsesPlanModeSafeDescriptions\|ExploreSubagentIsReadOnly)$'` - `make lint` - `make test TEST_PACKAGES=./coderd/x/chatd RUN=Guidance && make test TEST_PACKAGES=./coderd/x/chatd RUN=Description` - pre-commit hook during `git commit`	2026-05-13 23:10:21 +02:00
Kyle Carberry	b0b07536fc	feat: add opt-in Coder identity headers for MCP servers (#25153 )	2026-05-12 08:54:53 -04:00
Michael Suchacz	f1d160c7f4	fix: allow changing model when editing earlier chat message (#25084 ) Editing a previous user message and selecting a different model in the picker silently kept using the original model: the selection was dropped on the frontend, in the SDK, and in the backend, so both the replacement user message and the assistant turn that followed ran against the old model. Plumb the selected model through all three layers (`AgentChatPage`, `codersdk.EditChatMessageRequest`, `chatd.EditMessageOptions` / `Server.EditMessage`), defaulting to the original message's model when the client does not specify one. The existing `InsertChatMessages` CTE already advances `chats.last_model_config_id` when the inserted message's model differs, so the assistant turn picks up the new selection without further changes. The new model is validated inside the transaction, so an unknown ID rolls the edit back and returns a 400 `Invalid model config ID.`, mirroring the `SendMessage` path. Refs: CODAGT-345 This change was generated by a Coder agent. <details> <summary>Implementation plan</summary> # CODAGT-345: Editing an earlier message cannot change model ## Problem When editing a previous user message in a chat, the user can change the model in the model picker, but the backend keeps using the original message's model. The model selection is dropped at three layers: 1. Frontend: `AgentChatPage.tsx`'s edit branch builds an `EditChatMessageRequest` that omits `model_config_id`. The new-message branch (a few lines below) does include it. 2. SDK: `codersdk.EditChatMessageRequest` has no `ModelConfigID` field at all. 3. Backend: `chatd.EditMessageOptions` has no model field, and `Server.EditMessage` always copies the original message's `ModelConfigID` into the replacement message. Once the replacement user message is inserted with the original model, the `InsertChatMessages` CTE leaves `chats.last_model_config_id` unchanged, so the assistant turn that follows runs against the old model. ## Fix Plumb the selected model through all three layers, defaulting to the original message's model when the client doesn't override it. This mirrors the `SendMessage` path, which already accepts a `model_config_id` and validates it via `resolveSendMessageModelConfigID`. ### Backend - `codersdk/chats.go`: add `ModelConfigID *uuid.UUID` to `EditChatMessageRequest`. - `coderd/x/chatd/chatd.go`: - Add `ModelConfigID uuid.UUID` to `EditMessageOptions`. - In `EditMessage`, after fetching the edited message, resolve the model: if `opts.ModelConfigID != uuid.Nil`, validate it exists with `tx.GetChatModelConfigByID` (using `chatdModelConfigLookupContext`), otherwise keep `editedMsg.ModelConfigID.UUID`. Pass the resolved ID into `newChatMessage(...)`. - Reuse the existing `ErrInvalidModelConfigID` sentinel. - `coderd/exp_chats.go` (`patchChatMessage`): - Read `req.ModelConfigID` (nil-safe), pass into `chatd.EditMessageOptions`. - Add a `case xerrors.Is(editErr, chatd.ErrInvalidModelConfigID)` arm returning 400 `Invalid model config ID.`, matching the `postChatMessages` handler. ### Frontend - `site/src/pages/AgentsPage/AgentChatPage.tsx`: - In the edit branch, set `model_config_id: effectiveSelectedModel \|\| undefined` on the `EditChatMessageRequest`. - On success, persist the chosen model to `lastModelConfigIDStorageKey` so the next chat from this browser keeps the same default. Mirrors the new-message branch. ### Generated - `make site/src/api/typesGenerated.ts` and `make coderd/apidoc/swagger.json` produce the updated `EditChatMessageRequest` schema in `typesGenerated.ts`, `coderd/apidoc/{docs.go,swagger.json}`, and `docs/reference/api/{chats.md,schemas.md}`. ## Tests - `coderd/x/chatd/chatd_test.go`: - `TestEditMessageWithModelConfigOverride`: edit with a different model -> replacement message and `chats.LastModelConfigID` use the new model. - `TestEditMessagePreservesModelConfigByDefault`: edit without `ModelConfigID` -> original model preserved. - `TestEditMessageRejectsUnknownModelConfig`: passes a random UUID -> `ErrInvalidModelConfigID`, original message still present, `LastModelConfigID` unchanged (rollback). - `coderd/exp_chats_test.go` (under `TestPatchChatMessage`): - `ChangesModel`: end-to-end via SDK; `edited.Message.ModelConfigID` and `chat.LastModelConfigID` both match the new model. - `InvalidModelConfigID`: random UUID -> 400 `Invalid model config ID.`. </details>	2026-05-12 14:51:55 +02:00
Michael Suchacz	f847ff3731	test(coderd/x/chatd): skip stale notification flakes (#25177 ) Skip the chatd tests that currently flake because the control notification flow cannot distinguish stale wake/status NOTIFY payloads from real interrupt requests. Each skipped test includes a TODO to re-enable it after the chatd notification flow refactor handles stale notifications correctly. Supersedes #25133, #25134, #25135, and #25139. Refs [CODAGT-353](https://linear.app/coder/issue/CODAGT-353), [CODAGT-356](https://linear.app/coder/issue/CODAGT-356), [CODAGT-360](https://linear.app/coder/issue/CODAGT-360), and [CODAGT-361](https://linear.app/coder/issue/CODAGT-361). > Mux working on behalf of Mike.	2026-05-12 14:50:30 +02:00
Ethan	4e08543ace	test(coderd): centralize chat test harness and stabilize flakes (#25171 ) Chat tests previously constructed a real `openai` provider with a fake API key and no `BaseURL`, so background title generation hit `api.openai.com` and timed out under `-race`. The same root cause produced several distinct flakes: title regeneration races with synchronous `UpdateChat`/`ProposeChatTitle`, and pagination races against `updated_at` bumps from real-network processing. This moves the fake OpenAI-compatible provider and the chat-settle wait into first-class `coderdtest` capabilities. `coderd.Options.ChatProviderAPIKeys` is the new seam tests use to redirect chat traffic to a local `httptest.Server`. `coderdtest.WaitForChatSettled` replaces per-test waiters and drains tracked chat-daemon work after the chat row leaves `pending`/`running`. The `newChatClient*` constructors funnel through one options builder that installs the fake provider before the coderd test server so cleanup ordering is deterministic. Closes https://github.com/coder/internal/issues/1528 & Closes ENG-2659 Closes https://github.com/coder/internal/issues/1480 & Closes CODAGT-359 Closes https://github.com/coder/internal/issues/1507 & Closes CODAGT-368 Relates to https://github.com/coder/internal/issues/1397 & Relates to CODAGT-374	2026-05-12 22:13:55 +10:00
Kyle Carberry	376fc80451	fix(coderd/x/chatd): discover workspace MCP tools mid-turn after create_workspace (#25169 ) ## Problem In `coderd/x/chatd/chatd.go` `runChat`, workspace MCP discovery is gated on `chat.WorkspaceID.Valid` at the start of each turn. New chats that bind their workspace mid-turn (via `create_workspace` or `start_workspace`) get an empty workspace tool list on the first step, and the model falls back to `execute` (bash) because no workspace MCP tools are advertised. Repro: new chat → "create a workspace and use MCP tools". No `/api/v0/mcp/tools` request hits the agent on turn 1; turn 2 in the same chat works fine. ## Fix - Add a `PrepareTools` callback to `chatloop.RunOptions`, analogous to `PrepareMessages`. It is invoked once before each LLM step with the current tool list. When it returns non-nil, the chatloop replaces `opts.Tools`, rebuilds the per-step tool definitions, and appends new tool names to `opts.ActiveTools` so newly injected tools are callable immediately. - Wire `PrepareTools` in `runChat` to trigger workspace MCP discovery the first time the chat snapshot reports a valid `WorkspaceID`. The previous top-of-turn discovery path is unchanged for chats that start with a workspace. - Extract the discovery logic into `Server.discoverWorkspaceMCPTools` so the top-of-turn and mid-turn paths share identical behavior (cache, agent resolution, `ListMCPTools` timeout, invalidation). Mid-turn discovery stays disabled in plan-mode turns and Explore subagents, matching the existing top-of-turn gate. The `workspaceMCPDiscovered` flag prevents redundant dials after the first successful discovery. ## Tests - `coderd/x/chatd/chatloop/chatloop_test.go`: two new `TestRun_PrepareTools*` cases covering injection on the next step and active-set merging when `ActiveTools` is non-empty. - `coderd/x/chatd/chatd_test.go`: `TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspace` drives `runChat` through a `create_workspace` tool call against a real Postgres + mocked agent conn and asserts the second streamed LLM request advertises the workspace MCP tool. Verified that the test fails (and pinpoints the missing tool) when the `PrepareTools` wiring is disabled. ## Validation ``` go test ./coderd/x/chatd/chatloop/... -count=1 go test ./coderd/x/chatd/... -count=1 make lint/emdash ``` <details> <summary>Decision log</summary> - Chose a per-step `PrepareTools` callback over mutating `opts.Tools` in place because `chatloop.Run` builds the `fantasy.Tool` definitions once at start; a hook is required to let the LLM see new tools on the next step. - Returned `[]fantasy.AgentTool` (not also active-tool-names) and let the chatloop derive name merges via `mergeNewToolNames`. This avoids leaking plan-mode gating decisions into the callback contract. - Kept the existing top-of-turn discovery path so chats that already have a workspace at turn start pay no extra latency. - Skipped reusing `ReloadMessages` (history reload) since this is purely a tool-availability concern; coupling it to a history reload would defeat the chatloop cache prefix optimizations. </details> --- _This pull request was generated by Coder Agents._	2026-05-12 00:30:56 -04:00
Kyle Carberry	5a5cd79c4c	fix: drop buffered chat parts after their durable message commits (#25164 )	2026-05-12 00:30:38 -04:00
Kyle Carberry	0ed57ee343	fix(coderd/x/chatd): checkpoint buffered message_parts to avoid stale replay (#25145 )	2026-05-11 17:27:03 -04:00
Thomas Kosiewski	e56381eb61	feat: stream advisor tool output (#25032 ) Stream advisor output into the advisor tool card while the nested advisor call is still running. This keeps the advisor implementation intentionally advisor-specific: the parent model still receives the same final structured tool result, while the frontend receives transient `tool-result.result_delta` parts to render partial advisor text in the expanded card. The final persisted chat history remains unchanged. Refs CODAGT-322. Generated by Coder Agents. <details> <summary>Implementation plan</summary> - Publish advisor text deltas from the nested `chatloop.Run` via `RunAdvisorOptions.OnAdviceDelta`. - Forward those deltas through `chatadvisor.Tool` with the parent advisor tool call ID. - Emit transient `ChatMessagePartTypeToolResult` websocket parts with `ResultDelta` from `chatd`. - Add `result_delta` to the generated tool-result TypeScript variant. - Accumulate tool result deltas in frontend stream state and keep the tool running until the final result arrives. - Render streamed advisor advice in the existing advisor card using streaming markdown mode, while retaining the updated advisor UI. </details>	2026-05-11 20:18:49 +02:00
Michael Suchacz	6bb88775ab	test(coderd/x/chatd): pin TestGetWorkspaceConn_StatusCheck to mock clock (#25130 ) The `TimedOutAgentCacheHit`, `CacheHitHealthyAgent`, and `CacheHitDBError` subtests of `TestGetWorkspaceConn_StatusCheck` built their `WorkspaceAgent` timestamps with `time.Now()` in the parent test's slice literal and then ran the actual check against the server's real wall clock (`quartz.NewReal()`). On slow Windows CI runners, more than `agentInactiveDisconnectTimeout` (30s) of wall time can elapse between slice construction and the parallel subtest body. In that window, the cached "healthy" agent gets reclassified as disconnected by `agentDisconnectedFor`, and `CacheHitHealthyAgent` fails with `errChatAgentDisconnected` instead of returning the cached connection. Build each agent inside the subtest with `quartz.NewMock(t)` and feed the same clock into the `Server` so the agent timestamps and the status math share a single frozen `now`. This matches the pattern already used by `TestGetWorkspaceConn_DialTimeoutDisconnectedRecoveryThreshold` in the same file. Closes https://github.com/coder/internal/issues/1522 <details> <summary>Verification</summary> Inserting `time.Sleep(35 * time.Second)` at the top of each subtest's body reliably reproduces the original failure (`errChatAgentDisconnected` on `CacheHitHealthyAgent`) on the parent commit and passes with this change. After removing the synthetic sleep, `go test ./coderd/x/chatd -run TestGetWorkspaceConn_StatusCheck -count=50` passes cleanly. </details> > Generated by Coder Agents on behalf of the assignee. Co-authored-by: Coder Agents <noreply@coder.com>	2026-05-11 19:53:58 +02:00
Michael Suchacz	60779ad2ec	test(coderd/x/chatd): stop waking acquireLoop in TestResolveExploreToolSnapshot (#25129 ) Fixes [CODAGT-367](https://linear.app/codercom/issue/CODAGT-367). `TestResolveExploreToolSnapshot/` flaked on CI (Linux and Windows) with `context deadline exceeded` on the `GetMCPServerConfigsByIDs` call inside `resolveExploreToolSnapshot`. Each test setup called `server.CreateChat` twice with `MCPServerIDs` set to fake `.example.com` URLs. `CreateChat` marks the chat pending and calls `signalWake`, which causes the chatd background `acquireLoop` to pick the chat up. That goroutine then dialed the fake MCP URLs (NXDOMAIN, slower on Windows) and made an OpenAI request with the dbgen default test key (401). Under CI load, that activity racing the 4 parallel subtests' `GetMCPServerConfigsByIDs` calls was enough to exceed the 25s test context deadline. The failure logs in the issue showed both side effects firing in the same job. `resolveExploreToolSnapshot` only reads `ID`, `MCPServerIDs`, `PlanMode`, `ParentChatID`, and `Mode` off the parent argument, so the chats do not need to be persisted. Build them as in-memory `database.Chat` values instead. The MCP server configs remain in the DB because the function still queries them via `GetMCPServerConfigsByIDs`. Verified locally with `go test ./coderd/x/chatd -run TestResolveExploreToolSnapshot -count=100 -race` (passes, ~5s total) and the surrounding `TestResolve` / `TestCreateChildSubagentChat` / `TestSpawnAgent_Explore` tests. --- _Made by Coder Agents on behalf of @ibetitsmike. [Linear session](https://linear.app/codercom/issue/CODAGT-367/flake-testresolveexploretoolsnapshot#agent-session-0730f3fe)._	2026-05-11 19:46:59 +02:00
Michael Suchacz	645b8cc63d	fix(coderd/x/chatd/chaterror): deflake TestClassify_ParsesRetryAfterHTTPDate (#25128 ) The test built a `Retry-After` HTTP-date with `time.Now().Add(3*time.Second).UTC().Format(http.TimeFormat)`, then asserted that the parsed `RetryAfter` was `>= 2s`. `http.TimeFormat` has second precision, so `Format()` truncates up to ~1s. Combined with the small elapsed time between formatting in the test and `time.Until()` in production, the value could land just under `offset-1s` (1.997s observed in CI), failing the lower bound. Round the formatted target up to the next whole second so the parsed deadline is never earlier than `now+offset`, and assert against a symmetric `[offset-1s, offset+1s]` window. Closes [CODAGT-365](https://linear.app/codercom/issue/CODAGT-365/flake-testclassify-parsesretryafterhttpdate) Refs https://github.com/coder/internal/issues/1512 <sub>Created by [Coder Agents](https://coder.com/docs/agent).</sub> Co-authored-by: Coder Agents <coderagents@coder.com>	2026-05-11 19:09:51 +02:00
Cian Johnston	e8508b2d90	fix: recover chatd from poisoned chain anchor on retry (#25097 ) When OpenAI's Responses API returns `Previous response with id ... not found` for a chained turn, classify it as a `ChainBroken` retry, clear `previous_response_id`, exit chain mode, reload full history, and let `chatretry` retry. Self-heals chats whose anchor was poisoned before #25074 stopped truncated streams from being persisted as a successful turn with a stored response id. The new state is exposed via the existing `coderd_chatd_stream_retries_total` counter as a `chain_broken="true"\|"false"` label. Aggregating queries (`sum`, `rate` over `provider`/`model`/`kind`) keep working without changes; raw-series matchers without aggregation will now see two series per `(provider, model, kind)` where they previously saw one. The metric is internal-only so the blast radius should be small, but if you have dashboards that index by exact label matchers without aggregation they will need an extra `sum` or an explicit `chain_broken` selector. > 🤖 This PR was created with the help of Coder Agents, and was reviewed by a human 🧑‍💻	2026-05-11 17:43:40 +01:00
Michael Suchacz	915956460a	feat(coderd/x/chatd): add compact turn status labels (#25043 ) > Mux is acting on Mike's behalf. Changes chat turn-end summaries into compact status labels for the cached `last_turn_summary` and successful web push body. Uses a structured-output model call for successful turns, requiring a 2-5 word `label` and validating it to reject agent-centric phrasing. Pending and requires-action states keep deterministic status labels. Removes the earlier deterministic tool-signal pipeline in favor of the smaller structured-output path.	2026-05-11 17:09:42 +02:00
Mathias Fredriksson	fb60bb0c08	chore(coderd/x/chatd): instrument PromoteQueued + stream subscriber for ENG-2645 (#25085 ) TestPromoteQueuedWhileRequiresActionMixedTools has flaked three times across Windows and Ubuntu CI runners since 2026-05-06; local repro on the dev workspace has not surfaced it. The May 8 Ubuntu log shows all four PromoteQueued post-TX pubsub publishes reaching pg_notify, yet the test still times out 25s later, so the failure is downstream between the subscriber's listener and the test's events channel. Adds three Debug-level markers in chatd.go (no logic change) plus two t.Logf markers in the test's reader so the next CI occurrence pins down exactly which step failed. Closes ENG-2645 Closes coder/internal#1523	2026-05-11 08:33:46 +00:00
Ethan	063c06ca5f	test: prevent expired contexts in chatd parallel subtests (#25107 ) Parallel subtests in `coderd/x/chatd` reused a parent test context with a `testutil.WaitLong` deadline, so the context could expire before a subtest was scheduled under load. That made the subagent lifecycle tools return plain-text context errors instead of the expected JSON payload, causing flaky JSON unmarshal failures. Create fresh `chatdTestContext` values inside the affected parallel subtests and add `chatdTestContext` to the `paralleltestctx` custom function list so this pattern is caught by `make lint`. Closes https://github.com/coder/internal/issues/1494	2026-05-11 17:48:27 +10:00
Ethan	bd6cc1aaf2	feat(coderd): add stop_workspace chatd tool and recovery classification (#24997 ) ## Summary Adds a `stop_workspace` tool to chatd so the model can recover from the "workspace running but agent dead" failure mode (e.g. an OOM that leaves the workspace running but the agent unreachable) by stopping and then starting the workspace. <img width="924" height="742" alt="image" src="https://github.com/user-attachments/assets/279dedb6-6e29-4fe1-8754-3a1f01e538bf" /> ## What changed New `stop_workspace` chatd tool (`coderd/x/chatd/chattool/stopworkspace.go`). Mirrors `start_workspace`: shares `WorkspaceMu` to serialize with create/start, waits for any in-progress build before issuing a stop, and is idempotent only after a successful Stop transition. Failed stop builds re-attempt rather than reporting success. New `chatStopWorkspace` coderd hook (`coderd/exp_chats.go`). Mirrors `chatStartWorkspace` minus the `RequireActiveVersion` gate. Stop should not be blocked by template version policy. Differentiated recovery sentinels (`coderd/x/chatd/chatd.go`). `errChatAgentDisconnected` instructs the model to call `stop_workspace` then `start_workspace`. `errChatDialTimeout` instructs a single retry, then user escalation if it repeats. The previous single message conflated transient and persistent failures. Two-signal recovery gate. Recovery is only surfaced when a tool call times out and a fresh DB read of the latest workspace agent says `Disconnected`. The previous draft escalated on the DB read alone, which would fire on a 30-second heartbeat blip (e.g. agent respawn) and prompt a destructive stop/start unnecessarily. Cache-hit disconnected handling now clears the cache and retries a fresh dial before escalating, rather than returning the recovery sentinel immediately. Latest-agent classification uses `GetWorkspaceAgentsInLatestBuildByWorkspaceID` instead of the chat's bound `AgentID`, so stale bindings after a rebuild don't misclassify. Shared chattool helpers in `coderd/x/chatd/chattool/chattool.go`: `latestWorkspaceBuildAndJob`, `publishBuildBinding`, `provisionerJobTerminal`. Applied to both `start_workspace` and `stop_workspace`. ## Notes - Reverts an earlier draft that widened `ask_user_question` to root standard turns. Plan-mode-only behavior is restored. - The `stop_workspace` tool currently renders via the generic chat tool-call UI. A follow-up frontend PR will prettify the `stop_workspace` tool and style it like the `start_workspace` tool. - Never-connected (`Timeout` status) agents are intentionally excluded from recovery. They indicate template or startup failure, not the running-but-dead case this PR targets. Closes CODAGT-315	2026-05-11 16:23:07 +10:00
Mathias Fredriksson	3925d3941b	fix(coderd/x/chatd): wait long enough for cold-start workspace MCP discovery (#25035 ) The 5s timeout cancelled cold-start ListMCPTools calls before the agent's 30s connectTimeout could settle, so workspace MCP tools never reached the LLM. Bump to 35s and scope to ListMCPTools only.	2026-05-08 17:49:10 +03:00
Ethan	b6dbc5614c	fix(coderd/x/chatd): handle truncated provider streams (#25074 ) coder/fantasy now fails closed when Anthropic or OpenAI Responses streams close before their provider terminal events instead of yielding a successful finish. This bumps the fantasy replacement to coder/fantasy#33 and teaches chat error classification to treat those failures as retryable timeout errors with explicit stream-closed messages. <img width="875" height="311" alt="image" src="https://github.com/user-attachments/assets/69c6f7b5-c885-46d2-a88b-b7a2b111bd55" />	2026-05-08 15:52:42 +10:00
Ethan	de9cdca77e	fix(coderd): handle external-agent workspaces honestly in chat (#24969 ) ## Summary Make Coder's chat agent honest about workspaces that use `coder_external_agent`. Three behaviors change so the chat stops pretending it can drive an external workspace through to a usable state on its own. <img width="859" height="537" alt="image" src="https://github.com/user-attachments/assets/0561442b-95f1-4a2d-853c-7e3776114680" /> ## Problem External agents are not started by Coder. The user has to run `coder agent` on their own host with a token Coder generates. Before this change, the chat agent treated those workspaces like any other: - `create_workspace` would enqueue a build for an external-agent template and then wait minutes (~22 worst case) for an agent that was never going to come up. - When mid-turn tool calls dialed an external agent that was not connected, the chat burned the full 30-second dial timeout and returned generic "the workspace may need to be restarted from the Coder dashboard" guidance, which is not the action the user can take. - Nothing told the chat (or the user, through the chat) that the next action lives outside Coder. ## Fix Three changes scoped to `coderd/x/chatd/`: 1. `create_workspace` blocks templates with external agents. The tool reads `template_versions.has_external_agent` for the template's active version and refuses external-agent templates with a message instructing the chat to pick a different template, or to have the user create and start the workspace themselves and then attach it. 2. Attaching an existing external workspace stays open. No selection-time gate on attachment; users can still bind a working external workspace to a chat. 3. External-agent-aware error handling on connection. Two complementary changes both predicated on proven connectivity failures rather than every dial error: - `getWorkspaceConn` preflight and timeout handling. Before opening a connection, the cache-miss path reads the agent's status from the already-loaded row. If the selected agent is external and clearly offline according to the existing `isAgentUnreachable` helper (`Disconnected` or `Timeout`, never `Connecting`), it returns an external-agent-specific error immediately instead of waiting out the 30-second dial timeout. `Connecting` external agents fall through to the dial so a user who just started the agent on their host can still succeed in the same turn. The preflight only fires when the agent is still the latest selected agent for the workspace, so stale-binding recovery via `dialWithLazyValidation` is unaffected. The post-dial rewrite is limited to the dial timeout sentinel; stale/no-agent bindings and non-timeout dial failures preserve their original errors. - `waitForAgentReady` timeout-branch rewrite. The 2-minute retry loop used by `create_workspace` and `start_workspace` runs unchanged for all agents. When the loop's outer deadline elapses, the timeout branch substitutes the external-agent message in place of the raw dial error if the agent belongs to an external resource. This applies the same pattern that the cache-hit path of `getWorkspaceConn` already used (`isAgentUnreachable` returning `errChatAgentDisconnected`), extended to the cache-miss path and to the readiness helper, with the external-agent-aware error rewrite layered only on confirmed offline or timeout paths. Closes CODAGT-314	2026-05-08 13:51:13 +10:00
Ethan	3a9080fff6	feat: tag chat-originating agent logs with chat_id (#25019 ) Workspace-agent logs emitted while serving chatd-driven requests were not correlated with the originating chat, making agent logs hard to attribute to the corresponding/originating chat. This adds agent-side chat context middleware that parses `Coder-Chat-Id` once, enriches agent access logs and structured handler/background logs, and adds a chatd bridge log when chat headers are attached to an agent connection. Closes CODAGT-324	2026-05-08 13:25:30 +10:00
Dean Sheather	e1b1c7ec5b	feat: resize chat image attachments client-side for provider budgets (#24533 ) Anthropic rejects inline images over 5,242,880 bytes, but our upload endpoint accepts images up to 10 MiB — so 5–10 MiB images were reaching the provider and failing. This adds two layers of protection: the browser resizes oversized images before upload, and the server rejects any that still slip through before an upstream request is issued. Client-side resizing uses `createImageBitmap` with `resizeWidth`/`resizeHeight` to clamp the decoded bitmap at decode time, then iteratively shrinks on an `OffscreenCanvas` (falling back to `HTMLCanvasElement`) until the output fits the applicable budget. Anthropic (and Bedrock-hosted Claude — fantasy's bedrock provider is a thin wrapper around the Anthropic client) uses a ~5 MiB budget; other providers use a ~10 MiB budget to stay under the server cap. Doing the resize in the browser avoids decoding attacker-controlled image bytes in `coderd` (image-bomb DoS surface). Server-side, `chatFileResolver` now takes a provider string and looks up the inline-image cap via a new `chatprovider.InlineImageByteCap` helper; oversized `image/*` files for capped providers are rejected with a pre-classified `chaterror` before the SDK call. The backstop fires for older clients, direct API callers, or any image that was committed to the composer before the user switched to a stricter provider. Attachments commit to composer state synchronously with a new `"processing"` `UploadState` so paste+Enter can't dispatch before the resize finishes; the `"uploading"` send gate now covers both states. Dismissed-while-resizing attachments are tracked in a `WeakSet` so a late swap can't resurrect a removed file. Closes CODAGT-215	2026-05-08 02:07:33 +10:00
Thomas Kosiewski	273e828442	fix: remove advisor reasoning configuration (#25030 )	2026-05-07 15:19:19 +02:00
Mathias Fredriksson	8c08aa1f6c	fix(coderd/x/chatd): wake after async chat-row UPDATEs commit (#25036 ) The async title-generation and turn-summary goroutines launched from processChat run autocommit UPDATEs on the chat row after finishActiveChat has set the chat to pending and signalWake has fired. If the row lock from one of those UPDATEs is held while acquireLoop's processOnce runs, AcquireChats's FOR UPDATE SKIP LOCKED skips the freshly-pending chat and returns no rows. The wake is then consumed with no acquisition, and the chat sits in pending until the next acquireTicker (default 1s). Wake again after each UPDATE commits. The second wake covers the race window without changing the transaction semantics. Closes coder/internal#1500	2026-05-07 15:53:11 +03:00
Ethan	6fa7e84761	test(coderd/x/chatd): skip TestExploreChatSendMessageCannotMutateMCPSnapshot (#25023 ) Skips `TestExploreChatSendMessageCannotMutateMCPSnapshot` while the chatd redesign is in flight. The test exposes a self-interrupt race in `processChat`'s control-pubsub subscriber that is structurally fixed by the redesign in #24444; skipping until then matches the existing `TestSubscribeRelayEstablishedMidStream` skip in `enterprise/coderd/x/chatd/chatd_test.go`. Relates to https://github.com/coder/internal/issues/1493.	2026-05-07 07:56:17 +00:00
Ethan	2ff05608d2	test: stabilize chatdebug heartbeat threshold test (#25022 ) `launchHeartbeat` could miss a stale-threshold update during startup if `SetStaleAfter` ran after the heartbeat ticker was created but before the goroutine subscribed to `thresholdChan`. In that case, the heartbeat kept the old interval until a future tick, and the mock-clock test could time out waiting for `Ticker.Reset` without advancing time. Subscribe to `thresholdChan` before reading the heartbeat interval so the channel consistently invalidates the interval. The regression test now changes the threshold while ticker creation is trapped, making the startup race deterministic. Closes https://github.com/coder/internal/issues/1513	2026-05-07 17:12:14 +10:00
Ethan	100ebd9f3b	test(coderd/x/chatd): deflake advisor chain mode snapshot (#25021 ) `TestAdvisorChainMode_SnapshotKeepsFullHistory` was using the generic active chatd test server, which leaves periodic pending-chat polling enabled. That made the test inconsistent with the other OpenAI Responses API tests and allowed stale pending pubsub notifications to interrupt the second turn before the advisor request was observed. Use the existing OpenAI Responses test server helper so pending-chat acquisition is delayed and the test only starts processing after the SendMessage pending notification has been published. Closes https://github.com/coder/internal/issues/1510	2026-05-07 17:12:12 +10:00
Ethan	ef0151601e	feat: report insufficient quota build failures in chat tools (#24956 ) ## Summary When a workspace build fails because the user is over their group quota, the chat tools currently surface the failure as a bare `"workspace build failed: insufficient quota"` string with no machine-readable error code and no visibility into the user's current usage. Agents and the UI cannot distinguish quota failures from any other Terraform error, so users see an opaque message and have no clear path to recovery. This PR tags quota failures with a typed error code at the source and propagates it through the chat tool layer so callers can react to it explicitly. Relates to CODAGT-20 ## Changes Provisioner runner - Add `InsufficientQuotaErrorCode = "INSUFFICIENT_QUOTA"` and set it explicitly at the `commitQuota` failure site via a new `failedWorkspaceBuildfCode` helper, so `provisioner_jobs.error_code` is populated only on the genuine quota path. The substring matcher used for externally produced sentinels (e.g. `"missing parameter"`, `"required template variables"`) is intentionally not extended; provider errors that happen to mention "insufficient quota" stay classified as generic build failures. SDK and API contract - Add `JobErrorCodeInsufficientQuota` and a `JobIsInsufficientQuotaErrorCode` helper to `codersdk`. - Extend the swagger `enums` tag on `ProvisionerJob.ErrorCode` to include `INSUFFICIENT_QUOTA`. - Regenerate `coderd/apidoc`, `docs/reference/api/`, and `site/src/api/typesGenerated.ts`. chattool create_workspace / start_workspace* - `waitForBuild` now returns a typed `*workspaceBuildError` carrying both the message and the `JobErrorCode`, instead of a bare error string. - New `quotaerror.go` introduces a structured `quotaErrorResult` (with `error_code`, `title`, `message`, `build_id`, and optional `quota`) and a best-effort `workspaceQuotaDetails` lookup that wraps owner authorization internally and fetches `credits_consumed` and `budget` from the database. Quota lookup failures (including authorization failures) never block the failure payload. - On quota-coded build failures, both `create_workspace` and `start_workspace` now return the structured response (with the recovery guidance inlined into `message`) instead of the bare `"insufficient quota"` string. This applies to all three failure paths: post-creation, an in-progress existing build, and a freshly triggered start build. Non-quota build failures continue to use the existing `buildToolResponse` / `newBuildError` path. - Owner authorization is wrapped only on the call sites that need it (the `CreateFn` and `StartFn` invocations and the quota-detail lookup), so idempotent fast paths (already running, already in progress, existing-workspace early returns) do not pay for an extra RBAC round-trip or fail when role lookup is transient. ## Out of scope - No changes to quota math, allowances, or bypass behavior. - No automatic retries. - No new quota-inspection tools and no changes to MCP `coder_create_workspace` (which returns immediately and never observed the build outcome here). - No frontend UI changes; those will land in a follow-up PR that consumes the new `INSUFFICIENT_QUOTA` code.	2026-05-07 15:01:58 +10:00
Mathias Fredriksson	6b0518d051	fix: state-aware queued message promotion (#24819 ) PromoteQueued now branches on chat status: synth tool results before the user message on requires_action, deferred reorder + Waiting on running so the worker's persist+auto-promote keeps partial output. Stale heartbeat falls through to the synchronous path; GetStaleChats picks up Waiting+queue to recover post-cleanup-crash. Endpoint returns 202. Closes CODAGT-119	2026-05-06 19:11:56 +03:00
Michael Suchacz	0bfb9f6f13	feat: show agent turn summary in agents sidebar (#24942 ) Persists the agent-generated turn-end summary on `chats` and shows it as the Agents sidebar subtitle when present, falling back to the model name. Errors still take precedence. > Mux is acting on Mike's behalf. ## What changes Storage. New nullable `last_turn_summary` column on `chats` (migration `000486`). New `UpdateChatLastTurnSummary` query normalizes blank/whitespace input to `NULL`, preserves `updated_at` (so the chat does not jump to the top of the sidebar on summary writes), and uses an `expected_updated_at` stale-write guard so an older async summary cannot overwrite a newer turn. Backend. `coderd/x/chatd/chatd.go` decouples summary generation from webpush. Generated summaries persist for completed parent turns even when webpush is unconfigured or has no subscriptions. The same generated text is reused as the webpush body when webpush is configured, so the summary model is not called twice. Generic fallback push text is no longer persisted; it clears any stale summary instead. Error/interrupt/pending-action terminal paths clear `last_turn_summary` for the latest turn. Frontend. `AgentsSidebar.tsx` subtitle priority is now `errorReason \|\| lastTurnSummary \|\| modelName`, normalized via the existing `asNonEmptyString` helper from `blockUtils.ts`. ## Tests - `TestUpdateChatLastTurnSummary` (database): success, whitespace-to-NULL, stale guard rejects, `updated_at` preserved. - `TestUpdateLastTurnSummaryRejectsStaleWrites` (chatd internal): direct stale-`expected_updated_at` test. - `TestSuccessfulChatPersistsTurnSummaryWithoutWebPush`: persistence works without webpush subscriptions. - `TestSuccessfulChatSendsWebPushWithSummary`: same generated text drives both DB and push body. - `TestSuccessfulChatSendsWebPushFallbackWithoutSummaryForEmptyAssistantText`: fallback text is not persisted. - `TestErroredChatClearsLastTurnSummaryAndSendsWebPush`: error path clears the field. - `TestInterruptChatDoesNotSendWebPushNotification`: interrupt path clears the field, no push fires. - `AgentsSidebar.test.tsx`: subtitle priority for summary-present, error-wins, no-summary fallback, whitespace fallback. - `AgentsSidebar.stories.tsx`: `ChatWithTurnSummary` and `ChatWithTurnSummaryAndError`. ## Notes - No backfill. Existing chats keep showing the model name until their next turn completes. - Parent chats only in this iteration; the field is rendered on any `Chat` if a future change extends generation to children. - Decoupling generation from webpush adds quickgen model calls for completed parent turns that previously skipped generation when no subscriptions existed. Existing parent-only, assistant-text-present, `PushSummaryModel` configured, and bounded-timeout gates keep this behavior bounded.	2026-05-06 16:43:35 +02:00
Cian Johnston	a74015fc85	refactor: make store and chatID explicit parameter arguments in chattools (#24850 ) Fixes CODAGT-175 Addresses a review finding in https://github.com/coder/coder/pull/23827 that the nil-guards for both `database.Store` and `chatID` are both dead code in practice in the `chattool` package. - Modifies the return signatures require passing both `database.Store` and `chatID` explicitly as positional arguments instead of just parameter struct keys. - Drops the nil-guards for `database.Store` and `chatID`.	2026-05-06 11:05:16 +01:00
Ethan	e5c7fdff86	fix(coderd/x/chatd): refresh chat status and bound subscriber reads on Subscribe (#24095 ) Tightens the chat stream subscription path on a few related axes. None of these changes touch the steady-state event flow; they all concern the subscribe handshake. ## Motivation `Server.Subscribe` carries three responsibilities that were entangled: 1. Authorize the caller against the chat row. 2. Arm local + pubsub subscriptions before any DB reads (subscribe-first-then-query). 3. Build the initial snapshot from a fresh chat row, message history, and queue. When all three live in one function and share the request context, a few unfortunate behaviors fall out: - The HTTP handler's middleware already loaded and authorized the chat row, but `Subscribe(chatID)` discarded it and re-fetched on every WebSocket connection. - The chat row used to populate the initial `status` event was loaded before the pubsub subscription was armed, so a status transition that happened in that window was silently lost. - Control-path DB reads inherited whatever context the caller passed in. A caller without a deadline could wedge a subscriber goroutine indefinitely on a stalled DB. - A transient failure of the chat re-read collapsed the entire subscription instead of degrading gracefully. ## What changes Split the auth boundary out into the type signature. A new `SubscribeAuthorized(ctx, chat, ...)` takes the already-authorized row directly. The HTTP handler in `coderd/exp_chats.go` calls it with the chat row from `httpmw.ChatParam`, eliminating the redundant `GetChatByID`. `Subscribe(chatID)` is preserved as a thin wrapper for callers that don't have a chat row in hand (tests, internal callers); it does the auth lookup and delegates. Re-read the chat after arming subscriptions. Inside `SubscribeAuthorized`, after the local stream and pubsub subscriptions are active, we reload the chat row to populate the initial `status` event and any enterprise relay setup. Combined with the existing subscribe-first-then-query pattern, this closes the gap where a status transition between the middleware's load and the subscription arming would not appear in either the initial snapshot or a live notification. Fall back to the middleware row on refresh failure. If the post-subscription refresh fails (transient DB blip, brief pool exhaustion), we log a warning and reuse the row that proved authorization in the first place. Messages, queue, and pubsub are all independent of this row, so the stream still works; the initial `status` is just slightly stale and self-corrects via the next pubsub event. Bound subscriber control-path DB reads. A new `streamSubscriberControlFetchContext` helper applies a 5-second fallback timeout only when the caller has no deadline of their own. Used at the chat refresh, the initial queue load, and the queue-update goroutine following pubsub notifications. HTTP-driven callers pass through unchanged; background callers can no longer hang forever on a stalled DB and leak subscriber goroutines, pubsub subscriptions, and `chatStreams` entries.	2026-05-06 14:29:53 +10:00
Ethan	46a60e6d5d	refactor: move chat error kinds into codersdk (#24955 ) Moves the chat error kind taxonomy from `coderd/x/chatd/chaterror` into `codersdk.ChatErrorKind` and types `ChatError.Kind` / `ChatStreamRetry.Kind` so generated TypeScript exposes an SDK-owned union, including `usage_limit`. Backend chat classification now references the SDK constants directly while preserving the existing JSON string values. Keeps chat usage-limit admission failures on their existing 409 response shape. The frontend maps structured usage-limit responses to the SDK-owned `usage_limit` kind, uses generated `TypesGen.ChatErrorKind` directly, and removes the local string union and alias.	2026-05-06 11:57:48 +10:00
Ethan	4751416b29	fix!: persist structured chat errors (#24919 ) Breaking change for changelog: > `codersdk.Chat.last_error` now returns a structured `ChatError` object (`{message, kind, provider, retryable, status_code, detail}`) instead of a plain string. The chats API is experimental (`/api/experimental/chats`), so this ships without a deprecation cycle; consumers reading `chat.last_error` as a string must update to read `chat.last_error.message`. SDK/generated TypeScript terminal error payloads now use the single `ChatError` type; the live stream error payload type is renamed from `ChatStreamError` to `ChatError`. Persisted chat errors now carry the same provider-specific detail (kind, provider, retryable, HTTP status, optional detail) as the live stream, so refreshing a failed chat rehydrates with the full structured error instead of a one-line headline. Existing rows are migrated in place: legacy text errors are wrapped into `{message, kind: "generic"}` so already-errored chats still render, and rows with `last_error IS NULL` stay NULL. Internally, persisted fallback decoding now reuses the existing `chaterror.KindGeneric` constant, with no JSON value change. Closes CODAGT-239	2026-05-05 12:56:06 +10:00
Ethan	7e01edeb8e	fix: align chat attachment picker with allowed file types (#24917 ) The agent chat composer only advertised image uploads to the OS file picker and filtered drag-and-drop and paste events to `image/*`, even though the backend accepts text, CSV, JSON, PDF, and a narrower set of image types. Move the allowed chat attachment media types into `codersdk` so the frontend picker and backend enforcement share one source of truth. Use the generated TypeScript list to drive the file input `accept` attribute and the drag-and-drop and paste filters, while adding common text extensions so platforms without MIME registrations still surface those files in the picker.	2026-05-05 12:25:13 +10:00
Michael Suchacz	632dcdb63a	feat: add personal chat model overrides (#24715 )	2026-05-05 00:57:51 +02:00
Michael Suchacz	0bb09935bc	feat: add computer-use provider selection for AI agents (#24772 ) Adds a deployment-wide setting to select the computer-use provider (Anthropic or OpenAI) for AI agents, plus the OpenAI computer-use runner needed to honor that selection. The setting is stored in `site_configs` under `agents_computer_use_provider`, defaults to Anthropic when unset, and is exposed via experimental GET/PUT endpoints under `/api/experimental/chats/config/computer-use-provider`. The chatd computer-use tool now dispatches to either `runAnthropicComputerUse` or `runOpenAIComputerUse` based on the resolved provider, with provider-specific result metadata for OpenAI screenshots. Frontend adds a provider dropdown to the Agents Experiments settings page nested under the virtual desktop toggle, with disabled state handling while virtual desktop is off and skeleton loaders while config queries are in flight. Hugo and Codex review follow-up: - Uses shared provider validation and clearer computer-use constant names. - Removes stale OpenAI pending-safety-checks commentary. - Documents why provider result metadata is needed for OpenAI screenshots. - Keeps the computer-use subagent visible when provider credentials are missing, then returns a clear spawn-time configuration error. - Uses OpenAI's recommended 1600x900 screenshot geometry to preserve the native 16:9 aspect ratio. - Moves OpenAI-specific computer-use helpers into `coderd/x/chatd/chatopenai/computeruse` after rebasing onto the provider package refactor in `main`. - Converts OpenAI pixel scroll deltas to Coder desktop wheel-click amounts. - Preserves OpenAI pointer modifiers with key down/up desktop actions and rejects unsupported non-left double-click buttons explicitly. - Maps OpenAI back/forward side-button clicks to browser navigation key actions. - Defaults omitted OpenAI click buttons to left-click. - Retries mouse release cleanup if the final OpenAI drag release fails. - Keeps computer-use subagent availability messages stable when provider config cannot be loaded, while logging the backend error. - Releases remaining OpenAI modifier keys if a synthetic key-up cleanup action fails. - Updates Storybook interaction stories so provider snapshots show the selected final provider. > Mux updated this PR description on behalf of Mike.	2026-05-04 20:30:50 +02:00
Michael Suchacz	033ed0bb82	feat: add admin-configurable chat title generation model (#24838 ) Adds an admin-configurable deployment-wide setting that controls which model is used for chat title generation. Admins can pick any enabled chat model config from the Agents settings page, or leave the setting unset to keep the existing fast-models-then-chat-model fallback algorithm. When a model is selected, both automatic and manual title generation use only that model, with no silent fallback. When the configured model is disabled, missing credentials, or otherwise unusable, automatic title generation skips entirely (best-effort) and manual title regeneration returns a clear error, so admins notice the misconfiguration instead of silently routing title traffic through another provider. ## Surface - New deployment-wide setting stored as a `site_configs` row (`agents_chat_title_generation_model_override`). - New experimental endpoint `GET/PUT /api/experimental/chats/config/model-override/{context}`. - Frontend: title generation now appears as a third dropdown on the Agents admin settings page alongside the existing general and explore context overrides. ## DRY refactors folded in Title generation is integrated as a third value of the existing `ChatModelOverrideContext` type alongside `general` and `explore`, sharing the parameterized HTTP route, SDK methods, generated types, and frontend API plumbing rather than introducing a parallel surface. The `Agent` prefix was dropped from the type and route since title generation is not a delegated agent. The chatd model-override resolver is also shared. `resolveConfiguredModelOverride` now takes a `failureMode` parameter: - Subagent overrides use soft failure: misconfigured overrides are logged and the parent model is used. - Title generation uses hard failure: misconfigured overrides return an explicit error so manual title regeneration surfaces the misconfiguration and automatic title generation skips instead of silently falling back. > Mux is acting on Mike's behalf.	2026-05-04 13:13:00 +02:00
Michael Suchacz	203b0a9df8	refactor(coderd/x/chatd): extract OpenAI logic into chatopenai package (#24788 ) Extracts OpenAI-specific logic from `coderd/x/chatd` into `coderd/x/chatd/chatopenai` so the main chat path no longer references `fantasyopenai` directly for chain mode info, response IDs, web search tooling, or option mapping. Structural refactor. The only deliberate behavioral narrowing is consolidating Responses store checks and related keyed option or metadata access on `opts[fantasyopenai.Name]`. That is documented by `TestIsResponsesStoreEnabledIgnoresMalformedNonOpenAIKey` and is unreachable in production where Responses options always live under `fantasyopenai.Name`. Summary: - Moves OpenAI Responses chain mode info, response ID helpers, web search tool construction, and provider option conversion into `chatopenai`. - Keeps Anthropic, Google, OpenRouter, and Vercel provider branches as thin, existing code paths. - `chatopenai` only imports `chatprompt` from chatd subpackages. It does not import `chatd`, `chatloop`, `chatprovider`, or `chaterror`. - Follow-up review fixes align helper names, keyed provider option access, map cloning behavior, and PR documentation with the extracted package boundary. - Final sweep trims unused chain-mode state, removes a duplicate store-check test case, drops an unused provider-tool parameter, and shares the chat-message test helper through `chattest`. > Mux is updating this PR on Mike's behalf.	2026-05-04 11:17:19 +02:00
Ethan	3a153ebb15	fix(coderd/x/chatd): replay retry phase on subscribe (#24569 ) Retry events were previously fire-and-forget, so subscribers that connected after a retry started only saw durable history plus `status=running` and could not tell the stream was backing off. Keep the current retry phase in `chatStreamState`, capture it atomically with subscriber registration, replay it in the initial snapshot for same-chat late joiners, and clear it when streaming resumes or ends so reconnects get consistent retry state without duplicate delivery at the subscription boundary. Relates to CODAGT-139	2026-05-04 11:48:39 +10:00
Kyle Carberry	d889ba1842	feat: add user_oidc auth type for MCP servers (#24793 ) Adds a 5th MCP server authentication mode, `user_oidc` ("User OIDC Identity"), that forwards the calling user's OIDC access token from `user_links.oauth_access_token` to the upstream MCP server as `Authorization: Bearer <token>`. The token is read from `user_links` and refreshed transparently via `oauth2.TokenSource` before each MCP request. No new per-MCP-server secret storage and no per-user connect/disconnect step. Limitation: only users who logged in via OIDC have a forwardable token. Users authenticated via password or GitHub will see requests sent without an `Authorization` header, and the upstream MCP server is expected to respond with 401. A pluggable token source (e.g. CLI-minted E2E tokens) is left as future work. <details> <summary>Implementation notes</summary> - Schema: new `coderd/database/migrations/000481_mcp_user_oidc_auth.{up,down}.sql` relaxes the `mcp_server_configs.auth_type` CHECK constraint to include `user_oidc`. Down migration deletes affected rows before restoring the old constraint. - SDK validation: `codersdk/mcp.go` extends `oneof` for `CreateMCPServerConfigRequest` and `UpdateMCPServerConfigRequest`. - Handler: `coderd/mcp.go` adds `case "user_oidc":` to the field-clearing switch on update. The existing list and detail handlers already report `auth_connected = true` for any non-`oauth2` auth type. - Header construction: `coderd/x/chatd/mcpclient/mcpclient.go` introduces a `UserOIDCTokenSource` interface and adds the `user_oidc` case to `buildAuthHeaders`. `ConnectAll` / `connectOne` / `buildAuthHeaders` gain `userID uuid.UUID, oidcSrc UserOIDCTokenSource` parameters. - Wiring: `coderd/x/chatd/chatd.go` adds `OIDCTokenSource` to `Config` / `Server` and passes `chat.OwnerID` plus the source through `ConnectAll`. `coderd/coderd.go` constructs the source next to the `chatd.New` call when `options.OIDCConfig` is non-nil. - Token source: `oidcMCPTokenSource` lives in `coderd/mcp.go`. It reads the user's OIDC link, refreshes via `oauth2.TokenSource`, and writes the refreshed token back to `user_links`. Logic is duplicated from `provisionerdserver.ObtainOIDCAccessToken` to avoid an MCP -> provisionerdserver dependency. The two copies must be kept in sync; a comment on `oidcMCPTokenSource` records this. - Frontend: `MCPServerAdminPanel.tsx` adds the new dropdown option, an explanatory helper block (no admin-configurable fields), and a Storybook story (`CreateServerUserOIDC`). - Tests: - `mcpclient_test.go`: `TestConnectAll_UserOIDCAuth`, `TestConnectAll_UserOIDCAuth_NoLink`, `TestConnectAll_UserOIDCAuth_NilSource`. All existing tests updated for the new signature. - `mcp_test.go`: extends `TestMCPServerConfigsAuthConnected` to assert `auth_connected=true` for `user_oidc`; adds `TestMCPServerConfigsUserOIDCClearsFields` and `TestMCPServerConfigsUserOIDCDirect`. - Docs: `docs/ai-coder/agents/platform-controls/mcp-servers.md` describes the new mode and its OIDC-only limitation. </details> This PR was created by Coder Agents. --------- Co-authored-by: Coder Agents <agents@coder.com>	2026-05-03 11:31:48 -04:00
Jaayden Halko	efda5c2c12	feat: disable Git controls when Git is not active (#24673 ) closes CODAGT-148 In chats with no Git context (no repositories known to the watcher, no PR tab, no remote diff), the refresh button fires an "Unable to refresh git status" toast because the watcher WebSocket never opens. Derive `isGitActive = repositories.size > 0 \|\| showRemoteTab` in `GitPanel` and use it to: - Disable the refresh button, unified-diff toggle, and split-diff toggle with a "Git is not set up for this chat" tooltip. - Show a dedicated empty state explaining how to enable Git, replacing the generic "No pushed changes yet" copy. Chats with at least one repository or a PR tab are unaffected; all controls remain enabled and behave as before. Adds a `GitNotActive` Storybook story with play-function assertions covering the disabled controls and empty-state copy.	2026-05-01 14:46:46 +01:00

1 2 3 4 5

206 Commits