coder

mirror of https://github.com/coder/coder.git synced 2026-06-06 06:28:20 +00:00

Author	SHA1	Message	Date
Michael Suchacz	ca1f6b19a2	feat: remove legacy chat provider tables (#25416 )	2026-05-22 09:50:01 +02:00
Michael Suchacz	5968c3dac7	feat: use AI provider keys at runtime (#25414 )	2026-05-22 02:17:09 +02:00
Mathias Fredriksson	ec1e861152	fix(coderd/x/chatd): deliver out-of-order durable messages on subscribe (#25433 ) The subscriber advanced a single delivery cursor on each notify and trusted it for both lookups. Concurrent publishMessage calls and PG NOTIFY commit ordering let cache appends and notifies arrive out of ID order, after which a late notify would scan above its own message and drop it. The DB fallback was also skipped whenever the cache delivered anything, hiding cross-replica messages that only the DB held. The cursor becomes a high-water mark, not the lookup key. Notifies trigger a rescan over the gap they describe and dedupe per subscription, and the DB pass runs every time so cross-replica messages can't get eaten by a local cache hit. Closes coder/internal#1525 Closes CODAGT-357	2026-05-21 10:35:41 +03:00
Michael Suchacz	63900d212d	feat: support personal skills in chats (#25366 ) > Mux updated this PR on behalf of Mike. ## Stack Context This PR builds on #25365 in the experimental personal skills stack and completes the chat integration. Stack order: 1. #25362 personal skill resolver 2. #25363 storage, permissions, API, and SDK 3. #25365 API test coverage 4. #25366 chattool and chatd integration 5. #25066 settings UI and docs 6. #25386 personal skills slash menu ## What? Updates chattool skill formatting and `read_skill` resolution so tools can read personal skills from the database, then injects personal skill metadata into chatd prompts and registers the skill-reading tools when skills are available. This branch has also been merged with current `origin/main` to resolve merge conflicts. ## Why? The chattool and chatd changes need to land together so the intermediate stack state stays buildable. This completes personal skill availability in chats without syncing personal skills into workspace filesystems. ## Validation - `go test -count=1 ./coderd/x/chatd/chattool -run 'TestFormatResolvedSkillIndex\|TestReadSkillTool\|TestReadSkillFileTool'` - `go test -count=1 ./coderd/x/chatd -run 'TestPersonalSkillsInSystemPrompt\|TestPersonalAndWorkspaceSkillCollisionInSystemPrompt\|TestSkillIndexRefreshReplacesStaleAliases\|TestFetchPersonalSkillMetadata\|TestLoadPersonalSkillBody'` - `go test -count=1 ./coderd -run 'Test.*UserSkill'` - `git diff --cached --check` - `make lint` - pre-commit hook	2026-05-20 19:50:50 +02:00
Kyle Carberry	159089686a	fix(coderd/x/chatd): prime workspace MCP cache after create/start (#25298 ) ## Problem Mid-turn workspace MCP discovery was broken when an agent was still cold-starting. `PrepareTools` in `chatd.go` flipped `workspaceMCPDiscovered = true` before calling `discoverWorkspaceMCPTools`, so a failed discovery attempt permanently blocked retries within the turn. Customer-reported repro: - New chat with no pre-selected workspace. - LLM calls `create_workspace` mid-turn at `23:35:05`. - `PrepareTools` fires, dials the agent with a 30s timeout, dial times out at `23:38:15`, `discoverWorkspaceMCPTools` returns empty. - Agent connects at `23:38:29`, 14 seconds later. - `workspaceMCPDiscovered` was already true, so `PrepareTools` never retried for the rest of the turn. MCP tools only appeared on the next user message. A naive retry loop in `PrepareTools` would also miss the bigger picture: a workspace boot can take several minutes (EC2 cold start, 10 min startup scripts), and the chatloop only gets a chance to call `PrepareTools` between LLM steps. ## Fix Do the workspace MCP discovery from inside the tool that already waits for the agent. `chattool.CreateWorkspace` and `chattool.StartWorkspace` call `waitForAgentReady`, which has a 2 min agent-online budget plus a 10 min startup-script budget. By the time they fire `OnChatUpdated`, the agent is `Ready`. The chatd `onChatUpdated` callback now launches an async `primeWorkspaceMCPCache` goroutine on every bind that has a valid workspace ID: - The primer calls `discoverWorkspaceMCPTools` until it returns a non-empty list or `workspaceMCPPrimeMaxWait` (30s) elapses, with a 2s backoff between attempts. The bounded wait handles the short race between agent-online and the agent's MCP `Connect` settling. - The primer runs asynchronously so the tool itself never blocks. Some templates simply do not advertise MCP tools, in which case the primer would otherwise spend its full budget for nothing. - The primer shares the chat `ctx` (not a detached one) so it is canceled together with the chat. A dangling primer would re-dial the workspace conn after `runChat`'s deferred `workspaceCtx.close()` and leak that conn. - `inflight.Add(1)` ensures server shutdown still waits for any in-progress primer. - `PrepareTools` is simplified back to a single discovery call. It now only sets `workspaceMCPDiscovered = true` on success, so an empty result no longer permanently blocks discovery within the turn. The cache hit warmed by the primer makes that call cheap in the common case; the dial fallback handles the rare cache miss. ## Tests All in `coderd/x/chatd/chatd_internal_test.go`: - `TestPrimeWorkspaceMCPCache_SuccessOnFirstAttempt` — single `ListMCPTools` call returning tools populates the cache. - `TestPrimeWorkspaceMCPCache_RetriesUntilToolsAppear` — first call empty, second returns tools; primer retries past the backoff and writes the cache. Uses `quartz.Mock.Trap` on `NewTimer`. - `TestPrimeWorkspaceMCPCache_GivesUpAfterDeadline` — `ListMCPTools` always empty; primer stops at `workspaceMCPPrimeMaxWait` and refuses to cache the empty result so PrepareTools can retry on the next step. The existing integration test `TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspace` continues to pass and now also exercises the async-primer path end-to-end via the create_workspace tool. ``` go test ./coderd/x/chatd/... -count=1 go test ./coderd/x/chatd/ -race -count=1 make pre-commit ``` <details> <summary>Design notes</summary> - The first iteration of this PR added retry+cooldown+failure-cap logic inside `PrepareTools`. It worked for the customer's ~30s race window but did not help workspaces that take several minutes to boot, because `PrepareTools` only fires between LLM steps. Reviewer pointed out the right place to handle this is the tool itself; the current implementation does that. - Why async: a primer that ran synchronously inside the `OnChatUpdated` callback blocked the create_workspace tool from returning for up to `workspaceMCPPrimeMaxWait`, which broke `TestCreateWorkspaceTool_EndToEnd` and would hurt any template that does not expose MCP tools. Decoupling lets the tool return immediately and lets the primer warm the cache concurrently with the next LLM step. - Why share the chat `ctx` rather than `context.WithoutCancel(ctx)` (the title-generation pattern): the primer touches `workspaceCtx.getWorkspaceConn`, which `runChat`'s deferred `workspaceCtx.close()` invalidates. A detached primer outliving the chat would dial a fresh conn and leak it. - The constant naming distinguishes `workspaceMCPDiscoveryTimeout` (35s per-call dial budget, unchanged from #25169) from `workspaceMCPPrimeMaxWait` (30s total budget for the post-ready primer loop) and `workspaceMCPPrimeRetryInterval` (2s between empty-result retries). </details> Follow-up to #25169. --- _This pull request was generated by Coder Agents._	2026-05-18 07:55:56 -04:00
Ethan	a35f71cd8a	fix(coderd/x/chatd): retry HTTP/2 stream resets (#25170 ) Mid-stream HTTP/2 peer resets from LLM providers can arrive after a 200 streaming response has already emitted provisional parts. Previously those resets fell through as generic non-retryable errors because `stream ID` messages did not match retryable transport signals, and stream IDs could be misread as HTTP statuses. Classify retryable HTTP/2 RST_STREAM codes as transient timeout failures, ignore stream IDs during status extraction, and keep the existing `retry` event as the rollback boundary for provisional message parts so replacement attempts do not replay failed-attempt output. Closes CODAGT-382	2026-05-14 11:40:43 +10:00
Kyle Carberry	5a5cd79c4c	fix: drop buffered chat parts after their durable message commits (#25164 )	2026-05-12 00:30:38 -04:00
Kyle Carberry	0ed57ee343	fix(coderd/x/chatd): checkpoint buffered message_parts to avoid stale replay (#25145 )	2026-05-11 17:27:03 -04:00
Michael Suchacz	6bb88775ab	test(coderd/x/chatd): pin TestGetWorkspaceConn_StatusCheck to mock clock (#25130 ) The `TimedOutAgentCacheHit`, `CacheHitHealthyAgent`, and `CacheHitDBError` subtests of `TestGetWorkspaceConn_StatusCheck` built their `WorkspaceAgent` timestamps with `time.Now()` in the parent test's slice literal and then ran the actual check against the server's real wall clock (`quartz.NewReal()`). On slow Windows CI runners, more than `agentInactiveDisconnectTimeout` (30s) of wall time can elapse between slice construction and the parallel subtest body. In that window, the cached "healthy" agent gets reclassified as disconnected by `agentDisconnectedFor`, and `CacheHitHealthyAgent` fails with `errChatAgentDisconnected` instead of returning the cached connection. Build each agent inside the subtest with `quartz.NewMock(t)` and feed the same clock into the `Server` so the agent timestamps and the status math share a single frozen `now`. This matches the pattern already used by `TestGetWorkspaceConn_DialTimeoutDisconnectedRecoveryThreshold` in the same file. Closes https://github.com/coder/internal/issues/1522 <details> <summary>Verification</summary> Inserting `time.Sleep(35 * time.Second)` at the top of each subtest's body reliably reproduces the original failure (`errChatAgentDisconnected` on `CacheHitHealthyAgent`) on the parent commit and passes with this change. After removing the synthetic sleep, `go test ./coderd/x/chatd -run TestGetWorkspaceConn_StatusCheck -count=50` passes cleanly. </details> > Generated by Coder Agents on behalf of the assignee. Co-authored-by: Coder Agents <noreply@coder.com>	2026-05-11 19:53:58 +02:00
Ethan	bd6cc1aaf2	feat(coderd): add stop_workspace chatd tool and recovery classification (#24997 ) ## Summary Adds a `stop_workspace` tool to chatd so the model can recover from the "workspace running but agent dead" failure mode (e.g. an OOM that leaves the workspace running but the agent unreachable) by stopping and then starting the workspace. <img width="924" height="742" alt="image" src="https://github.com/user-attachments/assets/279dedb6-6e29-4fe1-8754-3a1f01e538bf" /> ## What changed New `stop_workspace` chatd tool (`coderd/x/chatd/chattool/stopworkspace.go`). Mirrors `start_workspace`: shares `WorkspaceMu` to serialize with create/start, waits for any in-progress build before issuing a stop, and is idempotent only after a successful Stop transition. Failed stop builds re-attempt rather than reporting success. New `chatStopWorkspace` coderd hook (`coderd/exp_chats.go`). Mirrors `chatStartWorkspace` minus the `RequireActiveVersion` gate. Stop should not be blocked by template version policy. Differentiated recovery sentinels (`coderd/x/chatd/chatd.go`). `errChatAgentDisconnected` instructs the model to call `stop_workspace` then `start_workspace`. `errChatDialTimeout` instructs a single retry, then user escalation if it repeats. The previous single message conflated transient and persistent failures. Two-signal recovery gate. Recovery is only surfaced when a tool call times out and a fresh DB read of the latest workspace agent says `Disconnected`. The previous draft escalated on the DB read alone, which would fire on a 30-second heartbeat blip (e.g. agent respawn) and prompt a destructive stop/start unnecessarily. Cache-hit disconnected handling now clears the cache and retries a fresh dial before escalating, rather than returning the recovery sentinel immediately. Latest-agent classification uses `GetWorkspaceAgentsInLatestBuildByWorkspaceID` instead of the chat's bound `AgentID`, so stale bindings after a rebuild don't misclassify. Shared chattool helpers in `coderd/x/chatd/chattool/chattool.go`: `latestWorkspaceBuildAndJob`, `publishBuildBinding`, `provisionerJobTerminal`. Applied to both `start_workspace` and `stop_workspace`. ## Notes - Reverts an earlier draft that widened `ask_user_question` to root standard turns. Plan-mode-only behavior is restored. - The `stop_workspace` tool currently renders via the generic chat tool-call UI. A follow-up frontend PR will prettify the `stop_workspace` tool and style it like the `start_workspace` tool. - Never-connected (`Timeout` status) agents are intentionally excluded from recovery. They indicate template or startup failure, not the running-but-dead case this PR targets. Closes CODAGT-315	2026-05-11 16:23:07 +10:00
Ethan	de9cdca77e	fix(coderd): handle external-agent workspaces honestly in chat (#24969 ) ## Summary Make Coder's chat agent honest about workspaces that use `coder_external_agent`. Three behaviors change so the chat stops pretending it can drive an external workspace through to a usable state on its own. <img width="859" height="537" alt="image" src="https://github.com/user-attachments/assets/0561442b-95f1-4a2d-853c-7e3776114680" /> ## Problem External agents are not started by Coder. The user has to run `coder agent` on their own host with a token Coder generates. Before this change, the chat agent treated those workspaces like any other: - `create_workspace` would enqueue a build for an external-agent template and then wait minutes (~22 worst case) for an agent that was never going to come up. - When mid-turn tool calls dialed an external agent that was not connected, the chat burned the full 30-second dial timeout and returned generic "the workspace may need to be restarted from the Coder dashboard" guidance, which is not the action the user can take. - Nothing told the chat (or the user, through the chat) that the next action lives outside Coder. ## Fix Three changes scoped to `coderd/x/chatd/`: 1. `create_workspace` blocks templates with external agents. The tool reads `template_versions.has_external_agent` for the template's active version and refuses external-agent templates with a message instructing the chat to pick a different template, or to have the user create and start the workspace themselves and then attach it. 2. Attaching an existing external workspace stays open. No selection-time gate on attachment; users can still bind a working external workspace to a chat. 3. External-agent-aware error handling on connection. Two complementary changes both predicated on proven connectivity failures rather than every dial error: - `getWorkspaceConn` preflight and timeout handling. Before opening a connection, the cache-miss path reads the agent's status from the already-loaded row. If the selected agent is external and clearly offline according to the existing `isAgentUnreachable` helper (`Disconnected` or `Timeout`, never `Connecting`), it returns an external-agent-specific error immediately instead of waiting out the 30-second dial timeout. `Connecting` external agents fall through to the dial so a user who just started the agent on their host can still succeed in the same turn. The preflight only fires when the agent is still the latest selected agent for the workspace, so stale-binding recovery via `dialWithLazyValidation` is unaffected. The post-dial rewrite is limited to the dial timeout sentinel; stale/no-agent bindings and non-timeout dial failures preserve their original errors. - `waitForAgentReady` timeout-branch rewrite. The 2-minute retry loop used by `create_workspace` and `start_workspace` runs unchanged for all agents. When the loop's outer deadline elapses, the timeout branch substitutes the external-agent message in place of the raw dial error if the agent belongs to an external resource. This applies the same pattern that the cache-hit path of `getWorkspaceConn` already used (`isAgentUnreachable` returning `errChatAgentDisconnected`), extended to the cache-miss path and to the readiness helper, with the external-agent-aware error rewrite layered only on confirmed offline or timeout paths. Closes CODAGT-314	2026-05-08 13:51:13 +10:00
Michael Suchacz	0bfb9f6f13	feat: show agent turn summary in agents sidebar (#24942 ) Persists the agent-generated turn-end summary on `chats` and shows it as the Agents sidebar subtitle when present, falling back to the model name. Errors still take precedence. > Mux is acting on Mike's behalf. ## What changes Storage. New nullable `last_turn_summary` column on `chats` (migration `000486`). New `UpdateChatLastTurnSummary` query normalizes blank/whitespace input to `NULL`, preserves `updated_at` (so the chat does not jump to the top of the sidebar on summary writes), and uses an `expected_updated_at` stale-write guard so an older async summary cannot overwrite a newer turn. Backend. `coderd/x/chatd/chatd.go` decouples summary generation from webpush. Generated summaries persist for completed parent turns even when webpush is unconfigured or has no subscriptions. The same generated text is reused as the webpush body when webpush is configured, so the summary model is not called twice. Generic fallback push text is no longer persisted; it clears any stale summary instead. Error/interrupt/pending-action terminal paths clear `last_turn_summary` for the latest turn. Frontend. `AgentsSidebar.tsx` subtitle priority is now `errorReason \|\| lastTurnSummary \|\| modelName`, normalized via the existing `asNonEmptyString` helper from `blockUtils.ts`. ## Tests - `TestUpdateChatLastTurnSummary` (database): success, whitespace-to-NULL, stale guard rejects, `updated_at` preserved. - `TestUpdateLastTurnSummaryRejectsStaleWrites` (chatd internal): direct stale-`expected_updated_at` test. - `TestSuccessfulChatPersistsTurnSummaryWithoutWebPush`: persistence works without webpush subscriptions. - `TestSuccessfulChatSendsWebPushWithSummary`: same generated text drives both DB and push body. - `TestSuccessfulChatSendsWebPushFallbackWithoutSummaryForEmptyAssistantText`: fallback text is not persisted. - `TestErroredChatClearsLastTurnSummaryAndSendsWebPush`: error path clears the field. - `TestInterruptChatDoesNotSendWebPushNotification`: interrupt path clears the field, no push fires. - `AgentsSidebar.test.tsx`: subtitle priority for summary-present, error-wins, no-summary fallback, whitespace fallback. - `AgentsSidebar.stories.tsx`: `ChatWithTurnSummary` and `ChatWithTurnSummaryAndError`. ## Notes - No backfill. Existing chats keep showing the model name until their next turn completes. - Parent chats only in this iteration; the field is rendered on any `Chat` if a future change extends generation to children. - Decoupling generation from webpush adds quickgen model calls for completed parent turns that previously skipped generation when no subscriptions existed. Existing parent-only, assistant-text-present, `PushSummaryModel` configured, and bounded-timeout gates keep this behavior bounded.	2026-05-06 16:43:35 +02:00
Ethan	e5c7fdff86	fix(coderd/x/chatd): refresh chat status and bound subscriber reads on Subscribe (#24095 ) Tightens the chat stream subscription path on a few related axes. None of these changes touch the steady-state event flow; they all concern the subscribe handshake. ## Motivation `Server.Subscribe` carries three responsibilities that were entangled: 1. Authorize the caller against the chat row. 2. Arm local + pubsub subscriptions before any DB reads (subscribe-first-then-query). 3. Build the initial snapshot from a fresh chat row, message history, and queue. When all three live in one function and share the request context, a few unfortunate behaviors fall out: - The HTTP handler's middleware already loaded and authorized the chat row, but `Subscribe(chatID)` discarded it and re-fetched on every WebSocket connection. - The chat row used to populate the initial `status` event was loaded before the pubsub subscription was armed, so a status transition that happened in that window was silently lost. - Control-path DB reads inherited whatever context the caller passed in. A caller without a deadline could wedge a subscriber goroutine indefinitely on a stalled DB. - A transient failure of the chat re-read collapsed the entire subscription instead of degrading gracefully. ## What changes Split the auth boundary out into the type signature. A new `SubscribeAuthorized(ctx, chat, ...)` takes the already-authorized row directly. The HTTP handler in `coderd/exp_chats.go` calls it with the chat row from `httpmw.ChatParam`, eliminating the redundant `GetChatByID`. `Subscribe(chatID)` is preserved as a thin wrapper for callers that don't have a chat row in hand (tests, internal callers); it does the auth lookup and delegates. Re-read the chat after arming subscriptions. Inside `SubscribeAuthorized`, after the local stream and pubsub subscriptions are active, we reload the chat row to populate the initial `status` event and any enterprise relay setup. Combined with the existing subscribe-first-then-query pattern, this closes the gap where a status transition between the middleware's load and the subscription arming would not appear in either the initial snapshot or a live notification. Fall back to the middleware row on refresh failure. If the post-subscription refresh fails (transient DB blip, brief pool exhaustion), we log a warning and reuse the row that proved authorization in the first place. Messages, queue, and pubsub are all independent of this row, so the stream still works; the initial `status` is just slightly stale and self-corrects via the next pubsub event. Bound subscriber control-path DB reads. A new `streamSubscriberControlFetchContext` helper applies a 5-second fallback timeout only when the caller has no deadline of their own. Used at the chat refresh, the initial queue load, and the queue-update goroutine following pubsub notifications. HTTP-driven callers pass through unchanged; background callers can no longer hang forever on a stalled DB and leak subscriber goroutines, pubsub subscriptions, and `chatStreams` entries.	2026-05-06 14:29:53 +10:00
Ethan	46a60e6d5d	refactor: move chat error kinds into codersdk (#24955 ) Moves the chat error kind taxonomy from `coderd/x/chatd/chaterror` into `codersdk.ChatErrorKind` and types `ChatError.Kind` / `ChatStreamRetry.Kind` so generated TypeScript exposes an SDK-owned union, including `usage_limit`. Backend chat classification now references the SDK constants directly while preserving the existing JSON string values. Keeps chat usage-limit admission failures on their existing 409 response shape. The frontend maps structured usage-limit responses to the SDK-owned `usage_limit` kind, uses generated `TypesGen.ChatErrorKind` directly, and removes the local string union and alias.	2026-05-06 11:57:48 +10:00
Ethan	4751416b29	fix!: persist structured chat errors (#24919 ) Breaking change for changelog: > `codersdk.Chat.last_error` now returns a structured `ChatError` object (`{message, kind, provider, retryable, status_code, detail}`) instead of a plain string. The chats API is experimental (`/api/experimental/chats`), so this ships without a deprecation cycle; consumers reading `chat.last_error` as a string must update to read `chat.last_error.message`. SDK/generated TypeScript terminal error payloads now use the single `ChatError` type; the live stream error payload type is renamed from `ChatStreamError` to `ChatError`. Persisted chat errors now carry the same provider-specific detail (kind, provider, retryable, HTTP status, optional detail) as the live stream, so refreshing a failed chat rehydrates with the full structured error instead of a one-line headline. Existing rows are migrated in place: legacy text errors are wrapped into `{message, kind: "generic"}` so already-errored chats still render, and rows with `last_error IS NULL` stay NULL. Internally, persisted fallback decoding now reuses the existing `chaterror.KindGeneric` constant, with no JSON value change. Closes CODAGT-239	2026-05-05 12:56:06 +10:00
Michael Suchacz	0bb09935bc	feat: add computer-use provider selection for AI agents (#24772 ) Adds a deployment-wide setting to select the computer-use provider (Anthropic or OpenAI) for AI agents, plus the OpenAI computer-use runner needed to honor that selection. The setting is stored in `site_configs` under `agents_computer_use_provider`, defaults to Anthropic when unset, and is exposed via experimental GET/PUT endpoints under `/api/experimental/chats/config/computer-use-provider`. The chatd computer-use tool now dispatches to either `runAnthropicComputerUse` or `runOpenAIComputerUse` based on the resolved provider, with provider-specific result metadata for OpenAI screenshots. Frontend adds a provider dropdown to the Agents Experiments settings page nested under the virtual desktop toggle, with disabled state handling while virtual desktop is off and skeleton loaders while config queries are in flight. Hugo and Codex review follow-up: - Uses shared provider validation and clearer computer-use constant names. - Removes stale OpenAI pending-safety-checks commentary. - Documents why provider result metadata is needed for OpenAI screenshots. - Keeps the computer-use subagent visible when provider credentials are missing, then returns a clear spawn-time configuration error. - Uses OpenAI's recommended 1600x900 screenshot geometry to preserve the native 16:9 aspect ratio. - Moves OpenAI-specific computer-use helpers into `coderd/x/chatd/chatopenai/computeruse` after rebasing onto the provider package refactor in `main`. - Converts OpenAI pixel scroll deltas to Coder desktop wheel-click amounts. - Preserves OpenAI pointer modifiers with key down/up desktop actions and rejects unsupported non-left double-click buttons explicitly. - Maps OpenAI back/forward side-button clicks to browser navigation key actions. - Defaults omitted OpenAI click buttons to left-click. - Retries mouse release cleanup if the final OpenAI drag release fails. - Keeps computer-use subagent availability messages stable when provider config cannot be loaded, while logging the backend error. - Releases remaining OpenAI modifier keys if a synthetic key-up cleanup action fails. - Updates Storybook interaction stories so provider snapshots show the selected final provider. > Mux updated this PR description on behalf of Mike.	2026-05-04 20:30:50 +02:00
Michael Suchacz	033ed0bb82	feat: add admin-configurable chat title generation model (#24838 ) Adds an admin-configurable deployment-wide setting that controls which model is used for chat title generation. Admins can pick any enabled chat model config from the Agents settings page, or leave the setting unset to keep the existing fast-models-then-chat-model fallback algorithm. When a model is selected, both automatic and manual title generation use only that model, with no silent fallback. When the configured model is disabled, missing credentials, or otherwise unusable, automatic title generation skips entirely (best-effort) and manual title regeneration returns a clear error, so admins notice the misconfiguration instead of silently routing title traffic through another provider. ## Surface - New deployment-wide setting stored as a `site_configs` row (`agents_chat_title_generation_model_override`). - New experimental endpoint `GET/PUT /api/experimental/chats/config/model-override/{context}`. - Frontend: title generation now appears as a third dropdown on the Agents admin settings page alongside the existing general and explore context overrides. ## DRY refactors folded in Title generation is integrated as a third value of the existing `ChatModelOverrideContext` type alongside `general` and `explore`, sharing the parameterized HTTP route, SDK methods, generated types, and frontend API plumbing rather than introducing a parallel surface. The `Agent` prefix was dropped from the type and route since title generation is not a delegated agent. The chatd model-override resolver is also shared. `resolveConfiguredModelOverride` now takes a `failureMode` parameter: - Subagent overrides use soft failure: misconfigured overrides are logged and the parent model is used. - Title generation uses hard failure: misconfigured overrides return an explicit error so manual title regeneration surfaces the misconfiguration and automatic title generation skips instead of silently falling back. > Mux is acting on Mike's behalf.	2026-05-04 13:13:00 +02:00
Michael Suchacz	203b0a9df8	refactor(coderd/x/chatd): extract OpenAI logic into chatopenai package (#24788 ) Extracts OpenAI-specific logic from `coderd/x/chatd` into `coderd/x/chatd/chatopenai` so the main chat path no longer references `fantasyopenai` directly for chain mode info, response IDs, web search tooling, or option mapping. Structural refactor. The only deliberate behavioral narrowing is consolidating Responses store checks and related keyed option or metadata access on `opts[fantasyopenai.Name]`. That is documented by `TestIsResponsesStoreEnabledIgnoresMalformedNonOpenAIKey` and is unreachable in production where Responses options always live under `fantasyopenai.Name`. Summary: - Moves OpenAI Responses chain mode info, response ID helpers, web search tool construction, and provider option conversion into `chatopenai`. - Keeps Anthropic, Google, OpenRouter, and Vercel provider branches as thin, existing code paths. - `chatopenai` only imports `chatprompt` from chatd subpackages. It does not import `chatd`, `chatloop`, `chatprovider`, or `chaterror`. - Follow-up review fixes align helper names, keyed provider option access, map cloning behavior, and PR documentation with the extracted package boundary. - Final sweep trims unused chain-mode state, removes a duplicate store-check test case, drops an unused provider-tool parameter, and shares the chat-message test helper through `chattest`. > Mux is updating this PR on Mike's behalf.	2026-05-04 11:17:19 +02:00
Ethan	3a153ebb15	fix(coderd/x/chatd): replay retry phase on subscribe (#24569 ) Retry events were previously fire-and-forget, so subscribers that connected after a retry started only saw durable history plus `status=running` and could not tell the stream was backing off. Keep the current retry phase in `chatStreamState`, capture it atomically with subscriber registration, replay it in the initial snapshot for same-chat late joiners, and clear it when streaming resumes or ends so reconnects get consistent retry state without duplicate delivery at the subscription boundary. Relates to CODAGT-139	2026-05-04 11:48:39 +10:00
Cian Johnston	2f855904be	refactor: add dbgen chat generators and migrate test boilerplate (#24497 ) - Adds chat-related dbgen generators covering defaults, overrides, and message field mapping. - Replaces raw single-row chat, message, provider, and model-config setup in tests with dbgen helpers. - Simplifies chat seed helpers after moving fixture setup into dbgen. > Generated with [Coder Agents](https://coder.com/agents).	2026-05-01 13:29:33 +01:00
Thomas Kosiewski	ab75e46f1d	fix: record debug runs for proposed chat titles (#24820 )	2026-04-29 16:45:48 +02:00
Mathias Fredriksson	782b7166a4	fix: preserve stream state on interrupt, fix auto-promote error handling (#24314 ) When tryAutoPromoteQueuedMessage's insert fails, return the error instead of swallowing it so the transaction rolls back and the queued message survives. Previously the POP DELETE committed while the INSERT silently failed, permanently losing the message. Remove clearStreamState() from the pending/waiting status handler in the frontend. The durable message event clears stream state via the existing needsStreamReset path, eliminating the visual gap where content vanishes before the persisted message arrives. Fixes CODAGT-61	2026-04-29 14:08:35 +03:00
Cian Johnston	1666bff1f9	fix(coderd/x/chatd): block chain mode when provider missing tool results (#24782 ) When `StopAfterTool` fires (e.g., `propose_plan`), the LLM response containing a `function_call` is stored at OpenAI via `store=true`, but the tool result is only persisted locally. On the next user message, `resolveChainMode` sees the tool result in the local DB and concludes all calls are resolved. Chain mode activates with `previous_response_id`, but OpenAI rejects because its stored chain has an unresolved `function_call`. This adds a `providerMissingToolResults` check to `resolveChainMode` that detects the `assistant(tool-call) → tool(result) → user` pattern with no follow-up assistant message. The absence of a follow-up assistant proves the tool results were never round-tripped to the provider. When detected, chain mode is blocked and the system falls back to full history replay, which includes both the tool call and its result. Deploying this fix un-bricks existing affected chats with no DB migration needed. > Generated by Coder Agents.	2026-04-28 15:30:04 +01:00
Michael Suchacz	62e9752acd	fix: prevent malformed OpenAI Responses continuations (#24725 ) > Worked on by Mux on Mike's behalf. ## Summary - Disable OpenAI Responses `previous_response_id` chain mode when the prior assistant response has unresolved local tool calls, so the next request can include paired tool outputs instead of sending an incomplete continuation. - Update the fantasy pin to a Responses replay fix that preserves stored reasoning references, only replays web search references when paired with reasoning, and validates local function-call output pairing before send. - Add fake OpenAI Responses input validation for the two production 400 shapes and integration coverage for full-history reasoning plus web search replay. - Add sanitized diagnostics for the OpenAI Responses continuity errors. ## Tests - `go test ./providers/openai -run 'TestResponsesToPrompt_(ReasoningWithStore\|ReasoningWithWebSearchCombined\|WebSearchRequiresReasoningReference\|ReasoningWithFunctionCallCombined\|WebSearchProviderExecutedToolResults)\|TestPrepareParams_(SkipsProviderExecutedToolReferences\|ValidatesFunctionCallOutputPairing)\|TestValidateResponsesInput_WebSearchReferenceRequiresReasoning' -count=1` - `go test ./providers/openai -count=1` - `GOWORK=off go test ./coderd/x/chatd/chattest -run TestValidateResponsesAPIInput -count=1` - `GOWORK=off go test ./coderd/x/chatd -run 'TestOpenAIResponses(NoStaleWebSearchReplay\|FullReplayPairsReasoningAndWebSearch\|ChainModeSkipsWhenLocalCallPending\|ChainModeStillFiresForProviderExecutedOnly)$\|TestResolveChainMode_' -count=1` - `GOWORK=off go test ./coderd/x/chatd/chatprompt -run 'TestInjectMissingToolResults_' -count=1` - `GOWORK=off go test ./coderd/x/chatd/chaterror -run TestClassify_OpenAIResponsesAPIDiagnostics -count=1` - `GOWORK=off go test ./coderd/x/chatd/... -count=1` - `git diff --check` - `git commit` pre-commit hook	2026-04-26 21:23:06 +02:00
Michael Suchacz	dbcc654d28	feat: snapshot explore subagent tool entitlements (#24638 ) Explore sub-agents previously could not use `web_search` or external MCP tools. `runChat` hard-skipped both for Explore. Lifting those guards naively would over-grant tools, because a child chat could outlive the spawning turn's plan-mode filter. This change persists the spawning parent turn's filtered external MCP server IDs onto the child Explore chat, and simplifies the Explore provider-tool filter in `runChat`: - New `resolveExploreToolSnapshot` helper: computes the child's inherited external MCP subset by running the parent's configs through `filterExternalMCPConfigsForTurn` (plan-mode policy) and, if the parent is itself an Explore child, further narrowing to the parent's own persisted `MCPServerIDs`. The result is written to the child's `MCPServerIDs` column at spawn time. - The existing `mcp_server_ids` column is the sole durable snapshot. No new chat column is added. - `runChat` for Explore children: loads MCP tools from the persisted snapshot, and keeps only `web_search` from provider-native tools (to block computer-use and other write-style tools, since Explore is read-only). Whether `web_search` is actually available is a per-model decision, determined by the current model config, just like a main chat. - Built-in Explore allowlist is unchanged. Workspace-local MCP remains excluded for Explore. Verification: `go build ./...`, `go test ./coderd/x/chatd/... -count=1`, `make gen` (clean tree), `make lint/emdash`, `go vet`. Deep-review ran 12 reviewers on the feature and 5 on the clarity refactor; CAR reviewed and approved; a subsequent scope reduction dropped a temporary `allow_web_search` column in favor of per-model handling. > Mux is acting on Mike's behalf.	2026-04-23 19:07:38 +02:00
Mathias Fredriksson	1ace519c6e	fix(coderd/x/chatd): remove cache-miss check blocking agent recovery (#24634 ) The cache-miss isAgentUnreachable check added in #24336 runs before dialWithLazyValidation, preventing the existing switch mechanism from discovering the new agent after a workspace rebuild. The chat's stale agent binding is never repaired, causing an infinite loop of 'agent is disconnected' errors. Remove the cache-miss check. The cache-hit check remains (it verifies the agent behind an established connection). The dial timeout and dialWithLazyValidation already bound the cache-miss failure path. Closes CODAGT-248	2026-04-22 21:49:10 +03:00
Mathias Fredriksson	78d9a220cf	fix(coderd/x/chatd): detect disconnected agents in getWorkspaceConn (#24336 ) Add agent status check and dial timeout to getWorkspaceConn to prevent tool calls from hanging when a workspace agent disconnects. Status check: call isAgentUnreachable on every getWorkspaceConn call. On cache miss, check the freshly fetched agent row. On cache hit, re-fetch the agent row by PK for a fresh heartbeat timestamp. Disconnected and timed-out agents return a sentinel immediately; connecting agents proceed to dial. Dial timeout: wrap dialWithLazyValidation in a 30s context.WithTimeoutCause (matching 8 other server-side AgentConn callers). Parent context cancellation propagates unchanged so the chatloop can detect ErrInterrupted. Both sentinels tell the LLM the agent is unreachable and the workspace may need restarting from the dashboard. Closes CODAGT-149	2026-04-22 12:10:32 +00:00
Ethan	c1421b4ead	test(coderd/x/chatd): deflake stale control notification test (#24545 ) Previously, `TestProcessChat_IgnoresStaleControlNotification` could return as soon as `UpdateChatStatus` ran, even though `processChat` still re-read chat state and finished deferred cleanup afterward. That let gomock and quartz teardown race the tail of cleanup and intermittently fail the test. Wait for `processChat` itself to return before asserting the final status, while keeping the existing strict mock expectations intact. Closes https://github.com/coder/internal/issues/1479	2026-04-22 00:08:34 +10:00
Michael Suchacz	f073323c89	refactor: unify subagent spawn behind spawn_subagent (#24535 ) Unify the three subagent spawn tools (`spawn_agent`, `spawn_explore_agent`, `spawn_computer_use_agent`) behind a single `spawn_subagent` tool keyed by a `subagent_type` discriminant (`general`, `explore`, `computer_use`). Mirrors the single-entry-point pattern already used by `task` in mux while keeping `wait_agent`, `message_agent`, and `close_agent` as separate lifecycle tools. A new backend subagent definition catalog (`coderd/x/chatd/subagent_catalog.go`) is the source of truth for tool description, prompt guidance, availability rules (plan mode, desktop/Anthropic gating), and child-chat option building. `spawn_subagent` advertises only the types available in the current context and validates `subagent_type` server-side; context inheritance still flows through the existing `createChildSubagentChatWithOptions` path. `wait_agent`, `message_agent`, and `close_agent` responses now include a server-derived `subagent_type` so the UI stops inferring lifecycle state from tool names. The frontend gets a shared normalization helper (`site/src/pages/AgentsPage/components/ChatElements/tools/subagentDescriptor.ts`) that maps either legacy tool names or new `spawn_subagent` args into a common descriptor (action, variant, icon, fallback copy). Legacy transcripts still render identically; `Tool.tsx`, `SubagentTool.tsx`, `ToolLabel.tsx`, `ToolIcon.tsx`, and `messageParsing.ts` now key off the descriptor instead of hard-coded names. Existing UI copy is preserved (`Spawning Explore agent...`, `Using the computer...`, computer-use monitor icon and Open Desktop affordance). > This PR was opened by Mux working on Mike's behalf.	2026-04-21 14:01:32 +02:00
Michael Suchacz	9d0469fc4c	feat: allow approved external MCP tools in root plan mode (#24509 ) ## Summary Allow root plan-mode chats to use MCP tools from external servers that an admin has explicitly approved for plan mode. Workspace MCP and plan-mode subagents remain blocked. ## Problem `chatd.go` excluded every MCP tool when `isPlanModeTurn` was true, so planning had no access to tools like docs search, ticketing, etc. Lifting that guard wholesale was unsafe: `mcp_server_configs` already has centralized admin governance, but workspace-local MCP (discovered from agent `.mcp.json`) does not, and subagents use a narrower trust boundary. ## Fix Add an admin-controlled per-server `allow_in_plan_mode` flag (default `false`) and gate plan-mode MCP access on it. ### Backend / schema - New migration `000472_mcp_server_allow_in_plan_mode.{up,down}.sql` and matching fixture update. - `mcpserverconfigs.sql` + generated code: persist and read the new column. - `codersdk/mcp.go`: thread the field through `MCPServerConfig`, `Create`, and `Update` request types. - `coderd/mcp.go`: validate, persist, and return the flag in get/list/create/update handlers. ### chatd - `coderd/x/chatd/chatd.go`: pre-filter selected external MCP configs by `AllowInPlanMode` before calling `mcpclient.ConnectAll` on plan-mode root turns. Workspace MCP discovery is skipped entirely on plan-mode turns. - Single helper decides whether a tool is available in plan mode, used both at construction and for active-tool filtering (defense in depth). Plan-mode subagents, dynamic tools, provider-native tools, computer-use, and workspace MCP stay unchanged. - `coderd/x/chatd/prompt.go`: update the root plan-mode overlay text to match the new boundary. ### UI - `MCPServerAdminPanel.tsx`: add an explicit toggle ("Allow all tools from this MCP server in root plan mode") next to the existing governance controls. - Regenerated `site/src/api/typesGenerated.ts`. ### Docs - `docs/ai-coder/agents/architecture.md`: replace the blanket "MCP is unavailable in plan mode" note with the new root-only, external-only, admin-approved policy. Explicitly call out that workspace MCP and plan-mode subagents are still excluded. ### Tests - Plan-mode visibility (approved vs non-approved external server). - Plan-mode invocation of an approved external MCP tool. - End-to-end plan-mode workflow that uses an approved MCP tool and then reaches `propose_plan`. - Regressions: workspace MCP still excluded in plan mode; plan-mode subagents still on the restricted tool boundary; existing tool allow/deny list filtering still applies. ## Policy precedence `allow_in_plan_mode` is an additional requirement on top of existing `enabled`, availability, chat-selected / forced server IDs, and tool allow/deny lists. It approves all tools on that server for root plan mode; a per-tool plan allowlist is deliberately deferred. ## Follow-ups (explicitly out of scope) - Whether plan-mode subagents should inherit approved external MCP tools. - Workspace-local MCP safety model (agent-side `.mcp.json` schema vs. a coderd-managed workspace MCP config). ## Validation - `go vet ./coderd/x/chatd/...` - `go test ./coderd/x/chatd -run 'TestPlan.\|TestMCP.' -count=1` - `go test ./coderd/x/chatd -count=1 -timeout 5m` (full chatd suite) - `make fmt` (no diff) > Mux opened this PR on Mike's behalf.	2026-04-21 12:26:12 +02:00
Jaayden Halko	410f9a5e19	feat: allow renaming of agent chat title (#24489 ) Co-authored-by: Coder Agents <noreply@coder.com>	2026-04-20 14:00:46 +01:00
Thomas Kosiewski	df7e838c21	feat(coderd): wire debug logging into chat lifecycle (#23917 )	2026-04-20 12:27:16 +02:00
Cian Johnston	3f6b40a833	fix: reap idle chatd stream states on a timer (#24476 ) * Adds `streamJanitorLoop` to clean up stale streams every 30s * zeroes dropped slots to aid in gc-eligibliity * Adds regression tests in coderd/x/chatd and enterprise/coderd/x/chatd > 🤖	2026-04-17 19:22:00 +01:00
Michael Suchacz	73b5058923	feat: add Explore mode as subagent-only modality (#24448 ) > This PR was authored by Mux on behalf of Mike. Introduce Explore mode, a read-only subagent modality for delegated discovery and code investigation. ## What Adds a `spawn_explore_agent` tool that creates child chats restricted to read-only operations. An admin can optionally configure a deployment-wide model override so Explore subagents use a model optimized for large context or reasoning without changing the root chat's model. ### Backend - New `ChatModeExplore` enum value (migration 000471). - `spawn_explore_agent` tool definition with read-only allowlist: `read_file`, `execute`, `process_output`, `read_skill`, `read_skill_file`. Write tools, file editors, and nested subagent spawning are blocked. - Deployment config storage for the Explore model override (`agents_chat_explore_model_override` in `site_configs`). - Model resolution hierarchy: configured override, then current turn model, then global default. Silent fallback with warning log when the override becomes unavailable. - RBAC: `AsChatd` for daemon reads, `ActionRead` and `ActionUpdate` on `ResourceDeploymentConfig` for admin API calls. - Plan mode root chats can use `spawn_explore_agent` for read-only research, matching the planning prompt guidance. - The Explore override config API now reports malformed saved overrides as "treated as unset" so admins can clear them explicitly. ### Frontend - `ExploreModelOverrideSettings` component in admin agent behavior settings. Uses `ModelSelector`, handles unavailable model warnings, and supports explicit Save and Clear actions. - Malformed saved overrides show a warning and require an explicit Save to clear, instead of Clear auto-submitting behind the scenes. ### Tests - Integration: `TestExploreSubagentIsReadOnly` (full spawn flow, tool verification, prompt overlay, DB state). - Unit: tool allowlist tests for explore, plan, and default modes. - Internal: model override resolution with valid, invalid UUID, disabled, and unconfigured override scenarios. - RBAC: `dbauthz_test.go` for `GetChatExploreModelOverride` and `UpsertChatExploreModelOverride`. - API: admin set and clear, malformed stored override reporting, disabled model rejection, non-admin denial.	2026-04-17 13:40:17 +02:00
Michael Suchacz	1cf0354f72	feat: add plan mode with restricted tool boundary (#24236 ) > This PR was authored by Mux on behalf of Mike. ## Summary - add persistent plan mode for chats and the chat-specific plan file flow - add structured planning tools such as `ask_user_question` and `propose_plan` - keep `write_file` and `edit_files` constrained to the chat-specific plan file during plan turns - allow shell exploration in plan mode, including subagents, via `execute` and `process_output` - block implementation-oriented, provider-native, MCP, dynamic, and computer-use tools during plan turns - update the chat UI, tests, and docs for the new planning flow	2026-04-16 11:12:01 +02:00
Cian Johnston	d7439a9de0	feat: add Prometheus metrics for chatd subsystem (#24371 ) Adds 7 Prometheus metrics to the chatd subsystem and introduces typed `ActivityBumpReason` for deadline bump attribution. \| Metric \| Type \| Labels \| \|--------\|------\|--------\| \| `coderd_chatd_chats` \| Gauge \| `state` (streaming, waiting) \| \| `coderd_chatd_message_count` \| Histogram \| `provider` \| \| `coderd_chatd_prompt_size_bytes` \| Histogram \| `provider` \| \| `coderd_chatd_tool_result_size_bytes` \| Histogram \| `provider`, `tool_name` \| \| `coderd_chatd_ttft_seconds` \| Histogram \| `provider` \| \| `coderd_chatd_compaction_total` \| Counter \| `provider`, `result` \| \| `coderd_chatd_steps_total` \| Counter \| `provider` \| > 🤖	2026-04-15 19:53:10 +01:00
Danielle Maywood	38d4da82b9	refactor: send raw typed payloads over chat WebSockets (#24148 )	2026-04-10 10:47:30 +01:00
Kyle Carberry	391b22aef7	feat: add CLI commands for managing chat context from workspaces (#24105 ) Adds `coder exp chat context add` and `coder exp chat context clear` commands that run inside a workspace to manage chat context files via the agent token. `add` reads instruction and skill files from a directory (defaulting to cwd) and inserts them as context-file messages into an active chat. Multiple calls are additive — `instructionFromContextFiles` already accumulates all context-file parts across messages. `clear` soft-deletes all context-file messages, causing `contextFileAgentID()` to return `!found` on the next turn, which triggers `needsInstructionPersist=true` and re-fetches defaults from the agent. Both commands auto-detect the target chat via `CODER_CHAT_ID` (already set by `agentproc` on chat-spawned processes), or fall back to single-active-chat resolution for the agent. The `--chat` flag overrides both. Also adds sub-agent context inheritance: `createChildSubagentChat` now copies parent context-file messages to child chats at spawn time, so delegated sub-agents share the same instruction context without independently re-fetching from the workspace agent. <details><summary>Implementation details</summary> New files: - `cli/exp_chat.go` — CLI command tree under `coder exp chat context` Modified files: - `agent/agentcontextconfig/api.go` — `ConfigFromDir()` reads context from an arbitrary directory without env vars - `codersdk/agentsdk/agentsdk.go` — `AddChatContext`/`ClearChatContext` SDK methods - `coderd/workspaceagents.go` — POST/DELETE handlers on `/workspaceagents/me/chat-context` - `coderd/coderd.go` — Route registration - `coderd/database/queries/chats.sql` — `GetActiveChatsByAgentID`, `SoftDeleteContextFileMessages` - `coderd/database/dbauthz/dbauthz.go` — RBAC implementations for new queries - `coderd/x/chatd/subagent.go` — `copyParentContextFiles` for sub-agent inheritance - `cli/root.go` — Register `chatCommand()` in `AGPLExperimental()` Auth pattern: Uses `AgentAuth` (same as `coder external-auth`) — agent token via `CODER_AGENT_TOKEN` + `CODER_AGENT_URL` env vars. </details> > 🤖 Generated by Coder Agents --------- Co-authored-by: Michael Suchacz <203725896+ibetitsmike@users.noreply.github.com>	2026-04-09 16:33:00 +02:00
Kyle Carberry	684f21740d	perf(coderd): batch chat heartbeat queries into single UPDATE per interval (#24037 ) ## Summary Replaces N per-chat heartbeat goroutines with a single centralized heartbeat loop that issues one `UPDATE` per 30s interval for all running chats on a worker. ## Problem Each running chat spawned a dedicated goroutine that issued an individual `UPDATE chats SET heartbeat_at = NOW() WHERE id = $1 AND worker_id = $2 AND status = 'running'` query every 30 seconds. At 10,000 concurrent chats this produces ~333 DB queries/second just for heartbeats, plus ~333 `ActivityBumpWorkspace` CTE queries/second from `trackWorkspaceUsage`. ## Solution New `UpdateChatHeartbeats` (plural) SQL query replaces the old singular `UpdateChatHeartbeat`: ```sql UPDATE chats SET heartbeat_at = @now::timestamptz WHERE worker_id = @worker_id::uuid AND status = 'running'::chat_status RETURNING id; ``` A single `heartbeatLoop` goroutine on the `Server`: 1. Ticks every `chatHeartbeatInterval` (30s) 2. Issues one batch UPDATE for all registered chats 3. Detects stolen/completed chats via set-difference (equivalent of old `rows == 0`) 4. Calls `trackWorkspaceUsage` for surviving chats `processChat` registers an entry in the heartbeat registry instead of spawning a goroutine. ## Impact \| Metric \| Before (10K chats) \| After (10K chats) \| \|---\|---\|---\| \| Heartbeat queries/sec \| ~333 \| ~0.03 (1 per 30s per replica) \| \| Heartbeat goroutines \| 10,000 \| 1 \| \| Self-interrupt detection \| Per-chat `rows==0` \| Batch set-difference \| --- > 🤖 Generated by Coder Agents <details><summary>Implementation notes</summary> - Uses `@now` parameter instead of `NOW()` so tests with `quartz.Mock` can control timestamps. - `heartbeatEntry` stores `context.CancelCauseFunc` + workspace state for the centralized loop. - `recoverStaleChats` is unaffected — it reads `heartbeat_at` which is still updated. - The old singular `UpdateChatHeartbeat` is removed entirely. - `dbauthz` wrapper uses system-level `rbac.ResourceChat` authorization (same pattern as `AcquireChats`). </details>	2026-04-07 10:25:46 -04:00
Kyle Carberry	919dc299fc	feat: agent reads context files and discovers skills locally (#23935 ) Piggybacks on #23878. Moves instruction file reading and skill discovery from `chatd` (server-side, via multiple `LS`/`ReadFile` round-trips through the agent connection) to the agent itself (local filesystem access). This intentionally drops backward compatibility with older agents that don't support the context-config endpoint. Agents and server are deployed together; there is no rolling-update contract to maintain here. ## What changed The agent's `GET /api/v0/context-config` response now returns `[]ChatMessagePart` directly — the same types chatd persists. This eliminates intermediate type conversions and makes the protocol extensible. \| Field \| Type \| Description \| \|---\|---\|---\| \| `parts` \| `[]ChatMessagePart` \| Context-file and skill parts, ready to persist \| \| `working_dir` \| `string` \| Agent's resolved working directory \| Removed from the response: `instructions_dirs`, `instructions_file`, `skills_dirs`, `skill_meta_file`, `mcp_config_files` — the agent reads files locally and returns their content as parts. Removed from chatd: all legacy `LS`/`ReadFile` fallback code (`readHomeInstructionFile`, `readInstructionDirFile`, `DiscoverSkills` via LS, etc). ## Why The previous architecture had the agent resolve paths, serve them over HTTP, then `chatd` make N+1 round-trips back through the agent connection to read files. The agent has direct filesystem access and should just read the files. ## Key design decisions - Agent returns `ChatMessagePart` directly — same types chatd persists. No intermediate `InstructionFileEntry`/`SkillEntry` types needed. - `SkillMeta.MetaFile` — persisted via `ContextFileSkillMetaFile` on the skill part, so custom meta file names (`CODER_AGENT_EXP_SKILL_META_FILE`) survive across chat turns. - No pre-read body — `read_skill` always dials the workspace to fetch the skill body on demand. Simpler than caching the body in the response. - MCP config paths kept agent-internal — `MCPConfigFiles()` getter, not sent over the wire. - No backward compat fallback — old agents that don't support context-config get no instruction files. This is acceptable since agent and server deploy together.	2026-04-04 12:45:46 -04:00
Michael Suchacz	7d0a0c6495	feat: provider key policies and user provider settings (#23751 )	2026-04-02 19:46:42 +02:00
Ethan	fc1e0beb3b	fix(coderd/x/chatd): use structured output for chat title generation (#23909 ) Chat title generation used free-form text completion, which let models respond conversationally instead of producing a title. Review chats started with GitHub URLs were especially affected — models would say "I don't have the ability to browse external links" and that string became the persisted title. Replace the raw-text `generateShortText` path with structured output via `object.Generate[generatedTitle]`. Both auto-title and manual retitle now go through the same typed contract: the model must return a JSON object with a `title` field, validated and normalized before persistence. Invalid outputs (empty, too long) are rejected and retried through the existing candidate-model fallback loop.	2026-04-02 14:13:27 +11:00
Kyle Carberry	ee855f9618	feat: make agent context paths configurable via env vars (#23878 ) Replace hardcoded paths for instruction files, skills, and MCP config with values read from `CODER_AGENT_EXP_` environment variables. Template authors configure paths via the existing `coder_agent` `env` block. The agent resolves `~`, relative, and absolute paths locally, then serves the resolved config over `GET /api/v0/context-config`. `chatd` fetches this once per workspace attach and falls back to today's defaults for older agents. All path env vars are comma-separated, allowing multiple directories: \| Env Var \| Default \| Controls \| \|---\|---\|---\| \| `CODER_AGENT_EXP_INSTRUCTIONS_DIRS` \| `~/.coder` \| Dirs containing the instruction file \| \| `CODER_AGENT_EXP_INSTRUCTIONS_FILE` \| `AGENTS.md` \| Instruction file name \| \| `CODER_AGENT_EXP_SKILLS_DIRS` \| `.agents/skills` \| Skills directories \| \| `CODER_AGENT_EXP_SKILL_META_FILE` \| `SKILL.md` \| Skill metadata file name \| \| `CODER_AGENT_EXP_MCP_CONFIG_FILES` \| `.mcp.json` \| MCP config files \| ### Example ```hcl resource "coder_agent" "main" { os = "linux" arch = "amd64" env = { CODER_AGENT_EXP_INSTRUCTIONS_DIRS = "/opt/company/agent-config,~/.coder" CODER_AGENT_EXP_INSTRUCTIONS_FILE = "CLAUDE.md" CODER_AGENT_EXP_SKILLS_DIRS = "/opt/company/ai-skills,.agents/skills" CODER_AGENT_EXP_MCP_CONFIG_FILES = "/opt/company/mcp.json,.mcp.json" } } ``` <details> <summary>Implementation Details</summary> ### Architecture Follows the same pattern as MCP tool discovery: agent resolves locally → exposes via HTTP → chatd consumes. Agent-side* (`agent/agentcontextconfig/`): - `ResolvePath` / `ResolvePaths` handle `~`, relative, and absolute path forms; returns `""` for relative paths when baseDir is empty - `Config` reads env vars, falls back to defaults, resolves all paths - `GET /api/v0/context-config` serves the resolved config as JSON chatd-side (`coderd/x/chatd/`): - Calls `conn.ContextConfig()` once on first workspace attach - Falls back to hardcoded defaults on 404 (older agents) - Iterates instruction dirs, skills dirs using resolved absolute paths - `LSRelativityRoot` everywhere — no more home/root juggling ### Key design decisions - `EXP_` prefix: env vars use `CODER_AGENT_EXP_` to indicate experimental status - Plural names: comma-separated vars use plural names (`DIRS`, `FILES`); single-value vars use singular (`FILE`) - Defaults in `workspacesdk`: default constants live in `codersdk/workspacesdk/` so both agent and server reference them without cross-layer imports - `skillMetaFile` persistence: stored on context-file parts via `ContextFileSkillMetaFile` and restored on subsequent chat turns so custom values survive across turns - Working dir dedup: `slices.Contains` guard prevents reading the same instruction file from both `InstructionsDirs` and the working directory - MCP server dedup: first-occurrence-wins dedup prevents leaking duplicate connections from overlapping config files - ResolvePath safety*: returns `""` for relative paths when `baseDir` is empty, so `ResolvePaths` filters them out ### Files changed \| File \| Change \| \|---\|---\| \| `agent/agentcontextconfig/` \| New package — path resolution + HTTP endpoint \| \| `codersdk/workspacesdk/agentconn.go` \| `ContextConfigResponse` type, default constants, client method \| \| `agent/agent.go` + `agent/api.go` \| Wire up endpoint, pass config to MCP \| \| `agent/x/agentmcp/manager.go` \| Accept `[]string` MCP config paths, dedup by name \| \| `coderd/x/chatd/chatd.go` \| Fetch config, thread through, named returns \| \| `coderd/x/chatd/instruction.go` \| Accept configurable dir + file name, `skillMetaFileFromParts` \| \| `coderd/x/chatd/chattool/skill.go` \| Accept configurable dirs + meta file \| \| `codersdk/chats.go` \| `ContextFileSkillMetaFile` field for persistence \| ### Test coverage - `TestConfig` (4 cases): defaults, custom env vars, whitespace trimming, comma-separated dirs - `TestResolvePath` / `TestResolvePaths`: including empty baseDir edge case - `TestPersistInstructionFilesFallbackOnOlderAgent`: backward-compat path when `ContextConfig` returns 404 - `TestChatMessagePartVariantTags`: updated exclusion list for new internal field ### Backward compatibility Older agents return 404 for the new endpoint. `chatd` catches this and falls back to today's defaults via `readHomeInstructionFile` (using `LSRelativityHome`). Existing workspaces work with no changes. </details>	2026-04-01 12:28:47 -04:00
Cian Johnston	a164d508cf	fix(coderd/x/chatd): gate control subscriber to ignore stale pubsub notifications (#23865 ) Fixes flaky `TestOpenAIReasoningWithWebSearchRoundTripStoreFalse` and `TestOpenAIReasoningWithWebSearchRoundTrip`. ## Changes - Gate the `processChat` control subscriber's cancel callback behind a `chan struct{}` that is closed after publishing `"running"` status - Add `TestGatedControlCancel` with 4 subtests exercising the gate logic <details> <summary>Root cause analysis</summary> `SendMessage` publishes a `"pending"` notification on `chat:stream:<chatID>` via PostgreSQL `NOTIFY`. `processChat` subscribes to the same channel for control signals. Due to async NOTIFY delivery, the `"pending"` notification can arrive at the control subscriber after it registers its queue — even though it was published before. `shouldCancelChatFromControlNotification("pending")` returns `true`, immediately self-interrupting the processor before it does any work. The fix gates the cancel callback behind a closed channel. The channel is closed after `processChat` publishes `"running"` status, so stale notifications from before initialization are harmlessly ignored. `close()` provides a happens-before guarantee in the Go memory model. </details> > 🤖 Written by a Coder Agent. Reviewed by a human.	2026-03-31 22:55:20 +01:00
Kyle Carberry	a5cc579453	feat: add last_injected_context column to chats table (#23798 ) Adds a nullable JSONB column `last_injected_context` to the `chats` table that stores the most recently persisted injected context parts (AGENTS.md context-file and skill message parts). The column is updated only when `persistInstructionFiles()` runs — on first workspace attach or when the agent changes — so there are no redundant writes on subsequent turns. Internal fields (`ContextFileContent`, `ContextFileOS`, `ContextFileDirectory`, `SkillDir`) are stripped at write time so the column only holds small metadata. No stripping needed on the read path. <details> <summary>Implementation notes</summary> - New migration `000456` adds nullable `last_injected_context JSONB` column. - New SQL query `UpdateChatLastInjectedContext` writes the column without touching `updated_at`. - `persistInstructionFiles()` strips internal fields from parts via `StripInternal()` before persisting. - Sentinel path (no AGENTS.md) persists skill-only parts when skills exist. - `codersdk.Chat` exposes `LastInjectedContext []ChatMessagePart` (omitempty). - `db2sdk.Chat()` passes through the already-clean data. </details>	2026-03-30 14:11:30 -04:00
Kyle Carberry	4d2b0a2f82	feat: persist skills as message parts like AGENTS.md (#23748 ) ## Summary Skills are now discovered once on the first turn (or when the workspace agent changes) and persisted as `skill` message parts alongside `context-file` parts. On subsequent turns, the skill index is reconstructed from persisted parts instead of re-dialing the workspace agent. This makes skills consistent with the AGENTS.md pattern and is groundwork for a future `/context` endpoint that surfaces loaded workspace context to the frontend. ## Changes - Add `skill` `ChatMessagePartType` with `SkillName` and `SkillDescription` fields - Extend `persistInstructionFiles` to also discover and persist skills as parts - Add `skillsFromParts()` to reconstruct skill index from persisted parts on subsequent turns - Update `runChat()` to use `skillsFromParts` instead of re-dialing workspace for skills - Frontend: handle new `skill` part type (skip rendering, hide metadata-only messages) ## Before / After \| \| AGENTS.md \| Skills \| \|---\|---\|---\| \| Before \| Persist as `context-file` parts, reconstruct from parts \| In-memory `skillsCache` only, re-dial workspace on cache miss \| \| After \| Persist as `context-file` parts, reconstruct from parts \| Persist as `skill` parts, reconstruct from parts \| The in-memory `skillsCache` remains for `read_skill`/`read_skill_file` tool calls that need full skill bodies on demand. <details><summary>Design context</summary> This is the first step toward a unified workspace context representation. Currently: - Context files are persisted as message parts (works) - Skills were only in-memory (inconsistent) - Workspace MCP servers are cached in-memory (future work) Persisting skills as parts means a future `/context` endpoint can query both context files and skills from the same message parts in the DB, without depending on ephemeral server-side caches. </details>	2026-03-29 21:48:17 -04:00
Michael Suchacz	bfeb91d9cd	fix: scope title regeneration per chat (#23729 ) Previously, generating a new agent title used a page-global pending state, so one in-flight regeneration disabled the action for every chat in the Agents UI. This change tracks regenerations by chat ID, updates the Agents page contracts to use `regeneratingTitleChatIds`, and adds sidebar story coverage that proves only the active chat is disabled.	2026-03-29 00:01:53 +01:00
Ethan	c4ef94aacf	fix(coderd/x/chatd): prevent chat hang when workspace agent is unavailable (#23707 ) ## Problem Chats with a persisted `agent_id` binding hang indefinitely when the workspace is stopped. The stale agent row still exists in the DB, so `ensureWorkspaceAgent` succeeds, but the dial blocks forever in `AwaitReachable`. The MCP discovery goroutine used an unbounded context, so `g2.Wait()` never returned and the LLM never started. ## Fix Three targeted changes restore the pre-binding behavior where stopped workspaces degrade gracefully instead of blocking: 1. `dialWithLazyValidation`: "no agents in latest build" is now a terminal fast-fail — the hanging dial is canceled and `errChatHasNoWorkspaceAgent` returned immediately, instead of falling through to `waitForOriginalDial`. 2. Pre-LLM workspace setup: MCP discovery and instruction persistence gate on `workspaceAgentIDForConn` before attempting any dial. MCP discovery is bounded by a 5s timeout and checks the in-memory tool cache first (using the cheap cached agent from `ensureWorkspaceAgent`), so the common subsequent-turn path has zero DB queries. 3. `persistInstructionFiles`: tracks whether the workspace connection succeeded and skips sentinel persistence on failure, so the next turn retries if the workspace is restarted. ## Scenarios Running workspace, subsequent turn (hot path): MCP cache hit via in-memory cached agent. Zero DB queries, zero dials. Unchanged from #23274. Stopped workspace, persisted binding (the bug): MCP cache hit (stale descriptors, fine — they fail at invocation). Pre-LLM setup completes instantly. Tool invocation enters `dialWithLazyValidation`, dial fails or hangs, validation discovers no agents, returns `errChatHasNoWorkspaceAgent`. Model sees the error and can call `start_workspace`. New chat, running workspace: `ensureWorkspaceAgent` resolves via latest-build, persists binding. MCP discovery dials and caches tools. New chat, stopped workspace: `ensureWorkspaceAgent` finds no agents, returns `errChatHasNoWorkspaceAgent`. Pre-LLM setup skips. LLM starts with built-in tools only. Rebuilt workspace (agent switched): MCP cache hit with stale agent (harmless for one turn). Tool invocation dials stale agent, fails fast, `dialWithLazyValidation` switches to new agent, persists updated binding. Workspace restarted after stop: No sentinel was persisted during the stopped turn, so instruction persistence retries. Agent binding switches to the new agent via `workspaceAgentIDForConn`. Transient DB error during validation: Not `errChatHasNoWorkspaceAgent`, so `dialWithLazyValidation` falls through to `waitForOriginalDial` (cannot prove stale). No false positive. Tool invocation on stopped workspace: `getWorkspaceConn` calls `ensureWorkspaceAgent` (returns stale row), then `dialWithLazyValidation` validation discovers no agents, returns `errChatHasNoWorkspaceAgent`, cached state cleared, error returned to model.	2026-03-27 18:47:39 +11:00
Michael Suchacz	2312e5c428	feat: add manual chat title regeneration (#23633 ) ## Summary Adds a "Generate new title" action that lets users manually regenerate a chat's title using richer conversation context than the automatic first-message title path. ## Changes ### Backend - New endpoint: `POST /api/experimental/chats/{chatID}/title/regenerate` returns the updated Chat with a regenerated title - Manual title algorithm: Extracts useful user/assistant text turns → selects first user turn + last 3 turns → builds context with gap markers → renders prompt with anti-recency guidance → calls lightweight model → normalizes output - Helpers: `extractManualTitleTurns`, `selectManualTitleTurnIndexes`, `buildManualTitleContext`, `renderManualTitlePrompt`, `generateManualTitle` — all private, with the public `Server.RegenerateChatTitle` method - SDK: `ExperimentalClient.RegenerateChatTitle(ctx, chatID) (Chat, error)` - Persists title via existing `UpdateChatByID` and broadcasts `ChatEventKindTitleChange` ### Frontend - API client method + React Query mutation with cache invalidation - "Generate new title" menu item (with wand icon) in both TopBar and Sidebar dropdown menus - Loading/disabled state while regeneration is in-flight - Error toast on failure - Stories updated for both menus ### Tests - `quickgen_test.go`: Table-driven tests for all 4 helper functions (turn extraction, index selection, context building, prompt rendering) - `exp_chats_test.go`: Handler tests (ChatNotFound, NotFoundForDifferentUser, NoDaemon) ## Design notes - The existing auto-title path (`maybeGenerateChatTitle`, `titleInput`) is completely unchanged - Manual regeneration uses richer context (first user turn + last 3 turns + gap markers) vs the auto path's single first message - Endpoint is experimental and marked with `@x-apidocgen {"skip": true}`	2026-03-27 01:47:19 +01:00
Ethan	21c2acbad5	fix: refine chat retry status UX (#23651 ) Follow-up to #23282. The retry and terminal error callouts had a few UX oddities: - Auto-retrying states reused backend error text that said "Please try again" even while the UI was already retrying on behalf of the user. - Terminal error states also said "Please try again" with no action the user could take. - `startup_timeout` had no specific title or retry copy — it fell through to the generic "Retrying request" heading. - The kind pill showed raw enum values like `startup_timeout` and `rate_limit`. - Terminal error metadata showed a "Retryable" / "Not retryable" label that does not help users. - A separate "Provider anthropic" metadata row duplicated information already present in the message body. - The `usage-limit` error kind used a hyphen while every backend kind uses underscores. Changes: Backend (`chaterror/message.go`) - Split message generation into `terminalMessage()` and `retryMessage()`, replacing the old `userFacingMessage()`. - Terminal messages include HTTP status codes and actionable guidance (e.g. "Check the API key, permissions, and billing settings."). - Retry messages are clean factual statements without status codes or remediation, suitable for the retry countdown UI (e.g. "Anthropic is temporarily overloaded."). - Removed "Please try again" / "Please try again later" from all paths. - `StreamRetryPayload` calls `retryMessage()` instead of forwarding `classified.Message`. Frontend - Removed the parallel frontend message-generation system: `getRetryMessage()`, `getProviderDisplayName()`, `getRetryProviderSubject()`, and the `PROVIDER_DISPLAY_NAMES` map are all deleted from `chatStatusHelpers.ts`. - `liveStatusModel.ts` passes `retryState.error` through directly — the backend owns the copy. - Added specific title and retry copy for `startup_timeout`, and extended the title mapping to cover `auth` and `config`. - Kind pills now show humanized labels ("Startup timeout", "Rate limit", etc.) instead of raw enum strings. - Removed the redundant "Provider anthropic" metadata row. - Removed the terminal "Retryable" / "Not retryable" badge. - Normalized `"usage-limit"` → `"usage_limit"` and added it to `ChatProviderFailureKind` so all error kinds follow the same underscore convention and live in one enum. Refs #23282.	2026-03-26 17:37:27 +11:00

1 2

55 Commits