coder

mirror of https://github.com/coder/coder.git synced 2026-06-04 13:38:21 +00:00

Author	SHA1	Message	Date
Michael Suchacz	99a83a2702	fix: clean Bedrock headers (#24718 ) Bedrock chat provider requests can inherit Anthropic public API headers from the process environment, which causes mixed Anthropic and Bedrock auth headers on signed requests. Update the Anthropic SDK fork so its Bedrock middleware strips Anthropic-only headers before signing requests, and keep a chatprovider regression test for the production request shape. > Mux is acting on Mike's behalf.	2026-04-26 21:50:29 +02:00
Michael Suchacz	62e9752acd	fix: prevent malformed OpenAI Responses continuations (#24725 ) > Worked on by Mux on Mike's behalf. ## Summary - Disable OpenAI Responses `previous_response_id` chain mode when the prior assistant response has unresolved local tool calls, so the next request can include paired tool outputs instead of sending an incomplete continuation. - Update the fantasy pin to a Responses replay fix that preserves stored reasoning references, only replays web search references when paired with reasoning, and validates local function-call output pairing before send. - Add fake OpenAI Responses input validation for the two production 400 shapes and integration coverage for full-history reasoning plus web search replay. - Add sanitized diagnostics for the OpenAI Responses continuity errors. ## Tests - `go test ./providers/openai -run 'TestResponsesToPrompt_(ReasoningWithStore\|ReasoningWithWebSearchCombined\|WebSearchRequiresReasoningReference\|ReasoningWithFunctionCallCombined\|WebSearchProviderExecutedToolResults)\|TestPrepareParams_(SkipsProviderExecutedToolReferences\|ValidatesFunctionCallOutputPairing)\|TestValidateResponsesInput_WebSearchReferenceRequiresReasoning' -count=1` - `go test ./providers/openai -count=1` - `GOWORK=off go test ./coderd/x/chatd/chattest -run TestValidateResponsesAPIInput -count=1` - `GOWORK=off go test ./coderd/x/chatd -run 'TestOpenAIResponses(NoStaleWebSearchReplay\|FullReplayPairsReasoningAndWebSearch\|ChainModeSkipsWhenLocalCallPending\|ChainModeStillFiresForProviderExecutedOnly)$\|TestResolveChainMode_' -count=1` - `GOWORK=off go test ./coderd/x/chatd/chatprompt -run 'TestInjectMissingToolResults_' -count=1` - `GOWORK=off go test ./coderd/x/chatd/chaterror -run TestClassify_OpenAIResponsesAPIDiagnostics -count=1` - `GOWORK=off go test ./coderd/x/chatd/... -count=1` - `git diff --check` - `git commit` pre-commit hook	2026-04-26 21:23:06 +02:00
Michael Suchacz	ed33e28b13	fix(coderd/x/chatd): wake after auto-promoting queued message (#24714 ) `tryAutoPromoteQueuedMessage` in `processChat`'s deferred cleanup could set a chat back to `pending` without waking the processor. The processor only noticed on the next 10ms poll, so under load tests like `TestAutoPromoteQueuedMessageFallsBackForInvalidQueuedModelConfigID` could time out waiting for the second streaming request (#1500). Call `p.signalWake()` after the promoted-message publishes when `promotedMessage != nil`, matching the pattern used by `CreateChat`, `SendMessage`, `EditMessage`, `PromoteQueued`, and `InterruptChat`. Make the regression helper `testAutoPromoteQueuedMessageFallback` deterministic by setting `PendingChatAcquireInterval = time.Hour` and synchronizing on a `secondRunStarted` channel instead of polling `requestCount`, so the test fails without the wake instead of relying on the 10ms ticker. Closes https://github.com/coder/internal/issues/1500 > Mux is acting on Mike's behalf.	2026-04-26 11:08:32 +02:00
Michael Suchacz	0211448d09	fix(coderd): sanitize Anthropic provider tool history (#24706 ) Anthropic can reject replayed chat histories when a provider-executed tool call, such as `web_search`, is present without its matching provider result block. This sanitizes unpaired Anthropic provider-executed tool calls during prompt reconstruction, before Anthropic requests, and before persistence so existing poisoned histories can continue and new malformed turns are not stored. Resolves: CODAGT-259 > Mux is acting on Mike's behalf.	2026-04-24 23:57:30 +02:00
Michael Suchacz	c7cac9debe	fix: persist per-turn model on chats and queued messages (#24688 ) Previously, `chats.last_model_config_id` was not updated when a user sent a mid-chat message with a different model, and queued messages did not store their own per-turn model, so promotion ran against whatever the chat row said at promote time. Chat watch events also did not merge `last_model_config_id` into the site's root, child, and per-chat caches, so sidebar labels stayed stale after direct sends and queued promotions. - Add nullable `chat_queued_messages.model_config_id`, backfilled from `chats.last_model_config_id`. Queued inserts round-trip the effective model id at enqueue time. - In `coderd/x/chatd`, direct sends update `chats.last_model_config_id` inside the same transaction that inserts the admitted user message. Manual promotion and auto-promotion use the queued row's stored `model_config_id`, with a fallback to `chats.last_model_config_id` for legacy NULL rows during rollout. `PromoteQueuedOptions.ModelConfigID` is now ignored. - On the site, extract `mergeWatchedChatSummary` and `mergeWatchedChatIntoCaches` in `site/src/api/queries/chats.ts` so status-change watch events merge `last_model_config_id` into the root infinite chat list, the parent-embedded child entry, and the per-chat `chatKey(chatId)` cache. `updated_at` guards against stale watch payloads clobbering newer cached state, while diff status events still merge their PR metadata because they are timestamped outside the chat row. Watch timestamps are compared as instants so variable fractional precision does not make fresh events look stale. - Queued promotion validates stored model config IDs before admission. Invalid legacy queued IDs fall back to the chat's current model config instead of dropping the queued message during auto-promotion. - Backend and frontend regression coverage added for admission, queue promotion (including FIFO across mixed models, legacy NULL fallback, and invalid queued model IDs), and chat watch cache merging. > Mux is acting on Mike's behalf.	2026-04-24 15:36:08 +02:00
Michael Suchacz	3d90546aae	feat: add general subagent model override (#24610 ) Adds a deployment-wide admin override for general delegated subagents. ## What changed - store the general override in `site_configs` and expose it through the shared `agent-model-override/{context}` API - apply the general override when spawning delegated general subagents, while preserving the existing Explore override behavior - reuse a shared Agents settings form for the general and Explore override sections ## Validation - `make gen` - `go test ./coderd -run 'TestChatModelOverrides'` - `go test ./coderd/x/chatd -run 'TestSpawnAgent_(GeneralUsesConfiguredModelOverride\|GeneralOverrideLogsAndFallsBackWhenCredentialsUnavailable\|GeneralOverrideLogsAndFallsBackWhenProviderDisabled)'` - `pnpm -C site lint:types` - `pnpm -C site test:storybook -- AgentSettingsAgentsPageView.stories.tsx` - `make lint` - `make pre-commit` > Mux is acting on Mike's behalf.	2026-04-24 12:37:20 +02:00
Cian Johnston	a02339c66a	fix(coderd/x/chatd): prevent invalid tool results from poisoning chat history (#24663 ) - computeruse.go: Decode base64 screenshot data before storing in `ToolResponse.Data` (was casting base64 string to bytes without decoding) - chatloop.go: Re-encode `ToolResponse.Data` to base64 via `base64.StdEncoding.EncodeToString` instead of `string()` cast - mcpclient.go: UTF-8 validate all text from MCP responses in `convertCallResult()` using `strings.ToValidUTF8` - chatprompt.go (persist): Defense-in-depth UTF-8 sanitization of text and media Text fields before database storage - chatprompt.go (replay): Antivenom layer that validates base64 and UTF-8 at read time, auto-healing already-poisoned chats without requiring a migration - `TestToolResultAntivenom`: 4 subtests covering poisoned text, poisoned media, valid media round-trip, and media with invalid UTF-8 text - Adds `TestConvertCallResult_UTF8Sanitization`: 4 subtests covering invalid UTF-8 in TextContent, EmbeddedResource, valid passthrough, and multi-part - Adds `TestComputerUseTool_Run_ScreenshotDataIsDecodedBinary`: Verifies no double-encode in the computer-use path - Updated existing computer-use tests for the new decoded-binary contract > 🤖	2026-04-23 19:58:38 +01:00
Michael Suchacz	dbcc654d28	feat: snapshot explore subagent tool entitlements (#24638 ) Explore sub-agents previously could not use `web_search` or external MCP tools. `runChat` hard-skipped both for Explore. Lifting those guards naively would over-grant tools, because a child chat could outlive the spawning turn's plan-mode filter. This change persists the spawning parent turn's filtered external MCP server IDs onto the child Explore chat, and simplifies the Explore provider-tool filter in `runChat`: - New `resolveExploreToolSnapshot` helper: computes the child's inherited external MCP subset by running the parent's configs through `filterExternalMCPConfigsForTurn` (plan-mode policy) and, if the parent is itself an Explore child, further narrowing to the parent's own persisted `MCPServerIDs`. The result is written to the child's `MCPServerIDs` column at spawn time. - The existing `mcp_server_ids` column is the sole durable snapshot. No new chat column is added. - `runChat` for Explore children: loads MCP tools from the persisted snapshot, and keeps only `web_search` from provider-native tools (to block computer-use and other write-style tools, since Explore is read-only). Whether `web_search` is actually available is a per-model decision, determined by the current model config, just like a main chat. - Built-in Explore allowlist is unchanged. Workspace-local MCP remains excluded for Explore. Verification: `go build ./...`, `go test ./coderd/x/chatd/... -count=1`, `make gen` (clean tree), `make lint/emdash`, `go vet`. Deep-review ran 12 reviewers on the feature and 5 on the clarity refactor; CAR reviewed and approved; a subsequent scope reduction dropped a temporary `allow_web_search` column in favor of per-model handling. > Mux is acting on Mike's behalf.	2026-04-23 19:07:38 +02:00
Mathias Fredriksson	f8fe5d680b	fix(coderd): reject API operations on archived chats (#24633 ) Archived chats accept mutations (messages, edits, queued-message promotions, tool-result submissions) via the API, causing them to re-enter the processing pipeline. This violates the hard-stop design intent from PR #23758. Add archived checks at three layers: - HTTP handlers (postChatMessages, patchChatMessage, promoteChatQueuedMessage, postChatToolResults): return 400 after auth so callers get a clear error. - Daemon functions (SendMessage, EditMessage, PromoteQueued, SubmitToolResults): return ErrChatArchived after row lock, guarding against future callers that bypass the handler. - AcquireChats SQL: filter out archived chats so they are never acquired for processing. Fixes CODAGT-245	2026-04-23 19:03:33 +03:00
Cian Johnston	2e5c7d99c2	fix(coderd/x/chatd): fix flaky TestSpawnComputerUseAgentInheritsContext (#24666 ) Fixes flaky `TestSpawnComputerUseAgentInheritsContext`. - The test inserts an Anthropic provider directly into the DB after `CreateChat` has already been called - The server's background goroutine may have already cached the provider list (OpenAI only) via `configCache.EnabledProviders()` with a 10s TTL - The direct DB insert bypasses the pubsub event that production uses to invalidate the cache - `isAnthropicConfigured()` returns the stale cached result, making `computer_use` appear unavailable - Fix: call `server.configCache.InvalidateProviders()` after the insert, mirroring what production does via pubsub CI failure: https://github.com/coder/coder/actions/runs/24829197096/job/72673070101?pr=24648 > 🤖	2026-04-23 13:18:18 +01:00
Mathias Fredriksson	1ace519c6e	fix(coderd/x/chatd): remove cache-miss check blocking agent recovery (#24634 ) The cache-miss isAgentUnreachable check added in #24336 runs before dialWithLazyValidation, preventing the existing switch mechanism from discovering the new agent after a workspace rebuild. The chat's stale agent binding is never repaired, causing an infinite loop of 'agent is disconnected' errors. Remove the cache-miss check. The cache-hit check remains (it verifies the agent behind an established connection). The dial timeout and dialWithLazyValidation already bound the cache-miss failure path. Closes CODAGT-248	2026-04-22 21:49:10 +03:00
Cian Johnston	72e3ae9c5f	feat: add chatd tool call error metrics and logging (#24559 ) - Add `coderd_chatd_tool_errors_total` prometheus counter (labels: provider, model, tool_name) - Log tool call errors at warn level with correlation fields: chat_id, owner_id, organization_id, workspace_id, agent_id, parent_chat_id, trigger_message_id, tool_name, tool_call_id, provider, model - Thread enriched logger from chatd.go into chatloop via `RunOptions.Logger` - Remove squashing of all MCP tool calls to the `mcp` bucket > 🤖	2026-04-22 16:19:56 +00:00
Michael Suchacz	9b5d09ebdc	test(coderd/x/chatd): seed anthropic provider for computer_use tests (#24611 ) `TestSubagentLifecycleToolsIncludePersistedSubagentTypeAcrossVariants/ComputerUse` and two adjacent positive tests passed a static Anthropic key into `newInternalTestServer`, but `seedInternalChatDeps` only inserts an OpenAI provider. At runtime, `Server.resolveUserProviderAPIKeys` calls `chatprovider.PruneDisabledProviderKeys`, which clears `keys.Anthropic` because Anthropic is not in the enabled DB provider set, so the `computer_use` execution path loses its key. Add a focused test helper `seedEnabledAnthropicProvider` and use it only in the positive tests that actually drive a `computer_use` spawn through the runtime key-resolution path (the `computer_use` branch of `TestSubagentLifecycleToolsIncludePersistedSubagentTypeAcrossVariants`, `TestSpawnAgent_ComputerUseUsesComputerUseModelNotParent`, and `TestSpawnAgent_ComputerUseInheritsMCPServerIDs`). `seedInternalChatDeps` stays unchanged, so the negative availability tests continue to model the "Anthropic unavailable" fixture. No production code is modified. Closes https://github.com/coder/internal/issues/1486 > This PR was opened by Mux working on Mike's behalf.	2026-04-22 15:54:17 +02:00
Thomas Kosiewski	b7c2c59931	fix(coderd/x/chatd/chatdebug): allow Anthropic per-modality ratelimit headers (#24592 ) Previously, Anthropic's per-modality, Priority Tier, and fast-mode rate-limit headers (`Anthropic-Ratelimit-Input-Tokens-`, `Anthropic-Ratelimit-Output-Tokens-`, `Anthropic-Priority-Input-Tokens-`, `Anthropic-Priority-Output-Tokens-`, `Anthropic-Fast-Input-Tokens-`, and `Anthropic-Fast-Output-Tokens-`) were shown as `[REDACTED]` in the Debug panel because they contain `"token"` in the name and fell through the generic credential filter. Add them to the allowlist in `coderd/x/chatd/chatdebug/redaction.go` alongside the existing `Anthropic-Ratelimit-Tokens-*` entries so the limits/remaining/reset values surface in the raw response view.	2026-04-22 15:14:31 +02:00
Thomas Kosiewski	26b64fa523	fix(coderd/x/chatd/chatdebug): record SSE attempts on EOF (#24565 ) `chat_turn` debug steps persist with `attempts: []` even when the streaming call to Anthropic completes successfully. Fantasy's Anthropic SSE adapter iterates the response to EOF via `for stream.Next()` and abandons the body without calling `Close()`, so `RecordingTransport`'s Close-only recording path never fires and the attempt is lost. Non-streaming runs (`quickgen`, `title_generation`) go through `model.Generate(...)` and are unaffected. Record on `io.EOF` for `text/event-stream` bodies specifically. Non-SSE responses stay on the Close-only path so JSON integrity, content-length validation, and inner-`Close()` error semantics are preserved. `record()` is already `sync.Once`-guarded, so a later `Close()` is a no-op for recording.	2026-04-22 15:02:02 +02:00
Michael Suchacz	9634739aed	fix: support Bedrock ambient AWS credentials for Agents providers (#24397 ) > This PR was authored by Mux on behalf of Mike. Adds AWS Bedrock ambient credential support to the Agents provider path. Bedrock providers can now be saved without a stored API key and authenticated via the standard AWS SDK credential chain on the Coder server (IAM roles, `AWS_ACCESS_KEY_ID`, etc.). Also fixes missing `Base URL` forwarding for Bedrock. ## Changes Backend runtime (`coderd/x/chatd/chatprovider/chatprovider.go`): - New `ProviderAllowsAmbientCredentials(provider)` helper. Currently returns true only for Bedrock. - `ModelFromConfig` no longer errors on an empty API key when the provider is in the ambient-allowed set AND was explicitly resolved via `ByProvider`. This preserves the policy gate: unresolvable providers (disabled central key, user-key-required without a user key) still error. - `setResolvedProviderAPIKey` internalizes the ambient-credentials contract via `ProviderAllowsAmbientCredentials`, so a resolved-but-keyless Bedrock provider is represented as an empty `ByProvider` entry rather than a post-hoc sentinel patch in the caller. - `WithAPIKey` is only appended when a token is present. - `WithBaseURL(baseURL)` is now forwarded for Bedrock (was previously missing). Backend admin API (`coderd/exp_chats.go`): - `validateChatProviderCentralAPIKey` exempts Bedrock from requiring a stored API key when central credentials are enabled. - AI Gateway separation (`ChatProviderAPIKeysFromDeploymentValues`) is unchanged. No silent reuse of `CODER_AIBRIDGE_BEDROCK_` flags. Frontend* (`site/src/pages/AgentsPage/components/ChatModelAdminPanel/`): - API Key field is optional for Bedrock when central credentials are enabled. - Bedrock-specific descriptions on API Key and Base URL fields (bearer-token vs ambient modes, `AWS_REGION` guidance). - Right-aligned "Clear stored token" action switches an existing Bedrock provider back to ambient mode. - `hasEffectiveAPIKey` treats Bedrock with central credentials enabled as configured, so the provider list shows the correct status icon. - Three new stories: `ProviderFormBedrockAmbientCredentials`, `ProviderFormBedrockBearerToken`, `ProviderFormBedrockClearBearerToken`. Docs* (`docs/ai-coder/agents/models.md`, `docs/ai-coder/ai-gateway/setup.md`): - New "Configuring AWS Bedrock" section covering both credential modes, region resolution, and the Base URL override. - Explicit note that the `us-east-1` region fallback only applies to bearer-token mode; ambient credentials require a region from the standard AWS SDK chain. - Cross-reference in AI Gateway docs clarifying that `CODER_AIBRIDGE_BEDROCK_*` flags are a separate configuration path from Agents. ## Not in scope - Reusing AI Gateway Bedrock flags as an implicit Agents fallback. - Per-provider AWS access key, secret, or region fields (would need a migration and audit-table review). - IMDS or network-backed credential probes in admin/listing request paths. ## Related Dogfood deployment integration: https://github.com/coder/dogfood/pull/324	2026-04-22 14:20:23 +02:00
Mathias Fredriksson	78d9a220cf	fix(coderd/x/chatd): detect disconnected agents in getWorkspaceConn (#24336 ) Add agent status check and dial timeout to getWorkspaceConn to prevent tool calls from hanging when a workspace agent disconnects. Status check: call isAgentUnreachable on every getWorkspaceConn call. On cache miss, check the freshly fetched agent row. On cache hit, re-fetch the agent row by PK for a fresh heartbeat timestamp. Disconnected and timed-out agents return a sentinel immediately; connecting agents proceed to dial. Dial timeout: wrap dialWithLazyValidation in a 30s context.WithTimeoutCause (matching 8 other server-side AgentConn callers). Parent context cancellation propagates unchanged so the chatloop can detect ErrInterrupted. Both sentinels tell the LLM the agent is unreachable and the workspace may need restarting from the dashboard. Closes CODAGT-149	2026-04-22 12:10:32 +00:00
Ethan	cc4e04afde	feat(site): display file attachments in chat UI (#24281 ) Renders the durable file attachments introduced in #24280 in the chat interface. Without this, attachments were stored and served correctly but the UI showed raw file parts with no previews or download UX. Every attachment gets a download affordance, split into three rendering tiers: - Images — thumbnail with a hover/focus overlay containing a download link. `onFocusCapture`/`onBlurCapture` with `contains(relatedTarget)` keeps the overlay open while tabbing between the image and its download link. - Text-like files (`text/`, `application/json`) — expandable preview button with loading + error-with-retry states and the same download overlay. Preview fetches throw a typed `FetchTextAttachmentError` with a `.status` field instead of a stringly-typed error. - Everything else* — compact `FileCard` with extension badge, filename, and download link. User-side and assistant-side rendering now share `AttachmentBlocks.tsx` (`AttachmentPreviewFrame`, `TextAttachmentButton`, `ImageAttachmentButton`, `FileCard`, plus `getAttachmentHref`/`getAttachmentName`) instead of two near-duplicate implementations. The text-attachment overlay anchors to the preview surface so the download button stays pinned even when a loading/error status line widens the row below. `ComputerRenderer` detects when a screenshot was stored as a durable attachment (`attachment_file_id`) and suppresses the stale base64 rendering — the screenshot appears as a proper file part instead. `ToolLabel` shows the attached filename for `attach_file` tool calls. Storybook coverage in `ConversationTimeline.stories.tsx` was expanded to cover every tier (single/multiple images, inline + file-id text, JSON, download-only files, fetch-failure retry, mixed attachments + file references) with play-function assertions. <img width="811" height="150" alt="image" src="https://github.com/user-attachments/assets/27c71081-3502-4e80-92a7-d8adf1ff9323" /> ## Cleanup Per Mathias' post-merge suggestion on #24280, this PR also relocates `coderd/chatfiles` → `coderd/x/chatfiles` so the durable-attachment helpers live beside the rest of the `chatd` experimental surface. Closes CODAGT-91	2026-04-22 20:11:53 +10:00
Ethan	353e522614	fix: handle expired chat file attachments in replay and UI (#24518 ) Closes CODAGT-216 ## Problem `dbpurge` deletes `chat_files` rows after the deployment's configured retention window, but `chat_messages.content` can still contain `file_id` references to those files. On replay, that left the Anthropic provider with an empty file payload and a `400 image cannot be empty` error. In the UI, the same missing file showed up as a broken image. ## Fix - Backend: when replay hits a `file_id` whose bytes are gone, replace it with a short text placeholder instead of emitting an empty file part. We could also drop the missing attachment entirely, but that would silently remove context from the replay and make the conversation harder for the model to interpret. The placeholder keeps the request valid while still telling the model that a file used to be there and is no longer available. - Frontend: classify chat image failures instead of treating every broken image the same. - `404` file fetches render `Image expired`, with a tooltip explaining that chat attachments are deleted after the retention window set for the deployment. - Other remote failures render `Image failed to load`, with a tooltip that surfaces server/network detail when available. - Invalid inline image data still renders `Image failed to load` without a probe.	2026-04-22 14:10:51 +10:00
blinkagent[bot]	79a9f437d7	feat(coderd/x/chatd/chattool): add description tags to tool parameter structs (#24394 )	2026-04-21 11:37:29 -07:00
Ethan	c1421b4ead	test(coderd/x/chatd): deflake stale control notification test (#24545 ) Previously, `TestProcessChat_IgnoresStaleControlNotification` could return as soon as `UpdateChatStatus` ran, even though `processChat` still re-read chat state and finished deferred cleanup afterward. That let gomock and quartz teardown race the tail of cleanup and intermittently fail the test. Wait for `processChat` itself to return before asserting the final status, while keeping the existing strict mock expectations intact. Closes https://github.com/coder/internal/issues/1479	2026-04-22 00:08:34 +10:00
Ethan	2295e9d5be	feat: surface upstream provider error details in chat callout (#24546 ) Anthropic HTTP 400 responses (e.g. "image exceeds 5 MB maximum") were collapsed in the chat UI to the generic headline "Anthropic returned an unexpected error (HTTP 400)." with no actionable detail — the upstream message survived to the processor log but was dropped before reaching the client. Add a new optional `Detail` field on `codersdk.ChatStreamError` that carries the upstream provider message alongside the existing normalized headline. The backend extracts `error.message` from `fantasy.ProviderError.ResponseBody` (the JSON envelope shared by Anthropic and OpenAI), falls back to the trimmed provider message when the body is absent or unparseable, and caps the result at 500 runes. The frontend threads `Detail` through `useChatStore`, `liveStatusModel`, and `ChatStatusCallout`, rendering it as a muted secondary line inside the existing `AlertDescription`. Before: <img width="1552" height="185" alt="image" src="https://github.com/user-attachments/assets/524b588e-3cee-4fad-bc15-6bf3aec0899d" /> After: <img width="814" height="173" alt="image" src="https://github.com/user-attachments/assets/eae82a89-3ac1-4a33-8d18-ef9f77263d89" /> ## Persistence `Detail` is not persisted — it disappears on refresh. Persisting it would require a DB change (today `chats.last_error` is a single nullable `TEXT` column), and the shape of persisted chat errors is worth a more deliberate rethink — e.g. promoting `last_error` to `JSONB` so we can also retain structured fields like `kind`, `statusCode`, `provider`, and `retryable` instead of only the normalized headline string. That's a bigger design discussion than this PR should carry. In the meantime, seeing the upstream error reason immediately on failure is already a large UX improvement over the status quo, and this PR gets us there without prejudicing the eventual persistence design. Tracking persistence in CODAGT-239. Closes CODAGT-235	2026-04-22 00:05:27 +10:00
Michael Suchacz	f073323c89	refactor: unify subagent spawn behind spawn_subagent (#24535 ) Unify the three subagent spawn tools (`spawn_agent`, `spawn_explore_agent`, `spawn_computer_use_agent`) behind a single `spawn_subagent` tool keyed by a `subagent_type` discriminant (`general`, `explore`, `computer_use`). Mirrors the single-entry-point pattern already used by `task` in mux while keeping `wait_agent`, `message_agent`, and `close_agent` as separate lifecycle tools. A new backend subagent definition catalog (`coderd/x/chatd/subagent_catalog.go`) is the source of truth for tool description, prompt guidance, availability rules (plan mode, desktop/Anthropic gating), and child-chat option building. `spawn_subagent` advertises only the types available in the current context and validates `subagent_type` server-side; context inheritance still flows through the existing `createChildSubagentChatWithOptions` path. `wait_agent`, `message_agent`, and `close_agent` responses now include a server-derived `subagent_type` so the UI stops inferring lifecycle state from tool names. The frontend gets a shared normalization helper (`site/src/pages/AgentsPage/components/ChatElements/tools/subagentDescriptor.ts`) that maps either legacy tool names or new `spawn_subagent` args into a common descriptor (action, variant, icon, fallback copy). Legacy transcripts still render identically; `Tool.tsx`, `SubagentTool.tsx`, `ToolLabel.tsx`, `ToolIcon.tsx`, and `messageParsing.ts` now key off the descriptor instead of hard-coded names. Existing UI copy is preserved (`Spawning Explore agent...`, `Using the computer...`, computer-use monitor icon and Open Desktop affordance). > This PR was opened by Mux working on Mike's behalf.	2026-04-21 14:01:32 +02:00
Michael Suchacz	9d0469fc4c	feat: allow approved external MCP tools in root plan mode (#24509 ) ## Summary Allow root plan-mode chats to use MCP tools from external servers that an admin has explicitly approved for plan mode. Workspace MCP and plan-mode subagents remain blocked. ## Problem `chatd.go` excluded every MCP tool when `isPlanModeTurn` was true, so planning had no access to tools like docs search, ticketing, etc. Lifting that guard wholesale was unsafe: `mcp_server_configs` already has centralized admin governance, but workspace-local MCP (discovered from agent `.mcp.json`) does not, and subagents use a narrower trust boundary. ## Fix Add an admin-controlled per-server `allow_in_plan_mode` flag (default `false`) and gate plan-mode MCP access on it. ### Backend / schema - New migration `000472_mcp_server_allow_in_plan_mode.{up,down}.sql` and matching fixture update. - `mcpserverconfigs.sql` + generated code: persist and read the new column. - `codersdk/mcp.go`: thread the field through `MCPServerConfig`, `Create`, and `Update` request types. - `coderd/mcp.go`: validate, persist, and return the flag in get/list/create/update handlers. ### chatd - `coderd/x/chatd/chatd.go`: pre-filter selected external MCP configs by `AllowInPlanMode` before calling `mcpclient.ConnectAll` on plan-mode root turns. Workspace MCP discovery is skipped entirely on plan-mode turns. - Single helper decides whether a tool is available in plan mode, used both at construction and for active-tool filtering (defense in depth). Plan-mode subagents, dynamic tools, provider-native tools, computer-use, and workspace MCP stay unchanged. - `coderd/x/chatd/prompt.go`: update the root plan-mode overlay text to match the new boundary. ### UI - `MCPServerAdminPanel.tsx`: add an explicit toggle ("Allow all tools from this MCP server in root plan mode") next to the existing governance controls. - Regenerated `site/src/api/typesGenerated.ts`. ### Docs - `docs/ai-coder/agents/architecture.md`: replace the blanket "MCP is unavailable in plan mode" note with the new root-only, external-only, admin-approved policy. Explicitly call out that workspace MCP and plan-mode subagents are still excluded. ### Tests - Plan-mode visibility (approved vs non-approved external server). - Plan-mode invocation of an approved external MCP tool. - End-to-end plan-mode workflow that uses an approved MCP tool and then reaches `propose_plan`. - Regressions: workspace MCP still excluded in plan mode; plan-mode subagents still on the restricted tool boundary; existing tool allow/deny list filtering still applies. ## Policy precedence `allow_in_plan_mode` is an additional requirement on top of existing `enabled`, availability, chat-selected / forced server IDs, and tool allow/deny lists. It approves all tools on that server for root plan mode; a per-tool plan allowlist is deliberately deferred. ## Follow-ups (explicitly out of scope) - Whether plan-mode subagents should inherit approved external MCP tools. - Workspace-local MCP safety model (agent-side `.mcp.json` schema vs. a coderd-managed workspace MCP config). ## Validation - `go vet ./coderd/x/chatd/...` - `go test ./coderd/x/chatd -run 'TestPlan.\|TestMCP.' -count=1` - `go test ./coderd/x/chatd -count=1 -timeout 5m` (full chatd suite) - `make fmt` (no diff) > Mux opened this PR on Mike's behalf.	2026-04-21 12:26:12 +02:00
Cian Johnston	5f3effd839	fix(coderd/x/chatd): add chattest.OpenAI() default fake server (#24540 ) - Add `chattest.OpenAI(t)` convenience wrapper around `NewOpenAI` with sensible defaults (JSON title response for non-streaming, text chunk for streaming) - Update `seedChatDependencies` to use it instead of an empty base URL, preventing title generation from hitting real `api.openai.com` with a fake key: ``` t.go:111: 2026-04-20 19:23:31.885 [debu] coderd.chatd.processor: title model candidate failed chat_id=edb43454-f23d-4163-9974-d101b8091de6 chat_id=edb43454-f23d-4163-9974-d101b8091de6 ... error= generate structured title: github.com/coder/coder/v2/coderd/x/chatd.generateStructuredTitleWithUsage /home/coder/src/coder/coder/coderd/x/chatd/quickgen.go:443 - unauthorized: Incorrect API key provided: test-api-key. You can find your API key at https://platform.openai.com/account/api-keys. ``` > 🤖	2026-04-21 10:26:20 +01:00
Ethan	1203f625b7	feat(coderd): accept parameters in start_workspace tool (#24434 ) When the chat `start_workspace` tool triggers an active-version upgrade that introduces new required parameters, the build fails with a parameter validation error. Previously this returned a message telling the user to update from the UI — a dead end for the model. This PR lets the model recover inside the chat by: 1. Accepting an optional `parameters` map on `start_workspace` (same schema as `create_workspace`), forwarded as `RichParameterValues`. 2. Returning structured JSON error responses that preserve validation details and the workspace's `template_id`, so the model can call `read_template` to discover what changed. 3. Replacing the UI-only guidance in `exp_chats.go` with model-actionable retry instructions. The expected model flow on an active-version parameter failure is now: ``` start_workspace → fails (structured error with template_id + validations) read_template → discovers new required parameters start_workspace → retries with parameters map → workspace starts ``` <img width="846" height="511" alt="image" src="https://github.com/user-attachments/assets/d18b6864-5970-4225-8da0-0f2ab134ccb4" />	2026-04-21 11:36:20 +10:00
Jaayden Halko	410f9a5e19	feat: allow renaming of agent chat title (#24489 ) Co-authored-by: Coder Agents <noreply@coder.com>	2026-04-20 14:00:46 +01:00
Thomas Kosiewski	df7e838c21	feat(coderd): wire debug logging into chat lifecycle (#23917 )	2026-04-20 12:27:16 +02:00
Mathias Fredriksson	fc2493780f	fix: exclude subagent chats from sidebar pagination (#24404 ) GetChats now returns only root chats (parent_chat_id IS NULL). A new GetChildChatsByParentIDs query fetches children for visible roots and embeds them in each parent's Children field. The singular getChat endpoint does the same. Archive invariant is one-way: parent archived implies child archived. Parent archive/unarchive cascades via root_chat_id. Individual child archive is permitted; child unarchive while the parent is archived is rejected atomically (row lock on child, re-read parent inside the transaction). Embedded children are filtered by the caller's archive state so individually-archived children stay hidden from active-parent views. Gitsync MarkStale uses GetChatsByWorkspaceIDs directly; MarkStaleParams.OwnerID removed (dead after the switch). Frontend: buildChatTree reads from the embedded children field, WebSocket handlers route child events into the parent's children array, and archiving a child strips it from the parent cache.	2026-04-20 13:19:59 +03:00
Cian Johnston	df429b7f60	fix: classify HTTP/2 transport failures as retryable timeouts (#24502 ) Modifies chatloop error classification behaviour to treat the following as retryable: * HTTP/2 `force closed` * GOAWAY * use of closed network connection * Modfies user-facing retry banner to show "<provider> is temporarily unavailable." Relates to CODAGT-212. > 🤖	2026-04-20 11:09:47 +01:00
Ethan	ef6969dd70	feat(coderd/x/chatd): agent-created file attachments in chat (#24280 ) Agents can already see workspace files and take screenshots, but users could not download those artifacts from chat. This PR adds durable chat attachments to chatd. `attach_file`, explicit `computer` screenshot actions (not the automatic post-action screenshots), and `propose_plan` now fetch bytes over the agent connection, store them in `chat_files`, link them to the chat, and carry attachment metadata in tool responses so `buildAssistantPartsForPersist` can materialize ordinary `type:"file"` assistant parts that the chat file APIs serve. The same storage helpers are reused for other artifact-producing paths. `wait_agent` recordings and thumbnails are stored as chat files and linked back to the parent chat, with best-effort relinking so parent chats retain those artifacts without leaving orphaned rows when chat-file caps reject links. `storeChatAttachment` wraps insert + link in one transaction, files are capped at 10 MB each and 20 per chat, and serving defaults to `Content-Disposition: attachment` with an explicit inline-safe allowlist. This PR also consolidates chat-file media policy in `coderd/chatfiles`. Uploads and tool-generated attachments share byte-based MIME detection, SVG blocking, inline-safety rules, and compatible `text/plain` refinement for JSON, CSV, and Markdown. Prompt construction still only inlines synthetic pasted text for model consumption; assistant-created attachments are persisted for the user and intentionally not replayed into later LLM turns. UI follow-up lives in #24281. Relates to CODAGT-91	2026-04-20 18:04:35 +10:00
Mathias Fredriksson	6b0bb02e5d	fix: server-side diffs and stricter fuzzy splicing for edit_files (#24454 ) Fixes three classes of edit_files bugs and adds structured per-file diff output for tool callers: - New IncludeDiff flag on FileEditRequest; when set, the agent returns FileEditResponse.Files[]{Path, Diff} with unified diffs computed via go-udiff v0.4.1 Lines + ToUnified (not Unified, which calls log.Fatalf on internal error). - Fuzzy match comparators split each line into leading whitespace, body, trailing whitespace, and ending. The splice substitutes at each position: on agreement between search and replace the file's bytes win; on disagreement the replacement's bytes are spliced verbatim. Carve-outs for empty-body lines, multi-line EOF splices, and level-aware indent translation for inserted lines. - Indent-unit detection (GCD for spaces, tab-priority) lets a 4sp LLM search insert correctly into tab or 2sp files. Falls back to the previous cLead-inheritance path when units can't be detected cleanly. - Empty search is rejected with "search string must not be empty". - Duplicate file paths in one request are rejected; symlink aliases resolved via api.resolvePath before the dedup check. - Frontend EditFilesRenderer consumes the structured files array by explicit path (no label munging) with per-file synthetic fallback for older agents or mismatched paths. On error, no diff is rendered so the synthetic fallback doesn't misrepresent a rejected edit as applied. Breaking change: AgentConn.EditFiles changes from (ctx, req) error to (ctx, req) (FileEditResponse, error) in codersdk/workspacesdk. Source-breaking for external Go consumers; no compat shim per plan owner. Out of scope (tracked in CODAGT-214): level-aware indent for middle-substituted splice lines. Locked in TestEditFiles_FuzzyIndent_InsertionLevelAware's Lock_* cases plus TestEditFiles_ReplaceAll_FuzzyIndentGap.	2026-04-18 16:39:34 +03:00
Cian Johnston	3f6b40a833	fix: reap idle chatd stream states on a timer (#24476 ) * Adds `streamJanitorLoop` to clean up stale streams every 30s * zeroes dropped slots to aid in gc-eligibliity * Adds regression tests in coderd/x/chatd and enterprise/coderd/x/chatd > 🤖	2026-04-17 19:22:00 +01:00
Cian Johnston	4b585465b8	feat: label chatd metrics by model, add stream-state diagnostics (#24475 ) Adds production-observability metrics to coderd/x/chatd/ for model-level correlation and a chatStreams memory-leak investigation. - Label per-request chatd metrics (steps_total, message_count, prompt_size_bytes, tool_result_size_bytes, ttft_seconds, compaction_total) with `model` and enrich the per-turn logger with provider/model. - Add `coderd_chatd_stream_retries_total{provider, model, kind}` counter incremented in chatloop before OnRetry. - Register a prometheus.Collector exposing `streams_active`, `stream_buffer_size_max`, `stream_buffer_events`, `stream_subscribers` from p.chatStreams. - Add `coderd_chatd_stream_buffer_dropped_total` counter, incremented per publishToStream drop independently of the existing log-rate-limited bufferDropCount. - Snapshot logger/model before the title-generation goroutine to avoid a data race with the logger/model rebind below it. > 🤖	2026-04-17 16:16:30 +01:00
Thomas Kosiewski	91f9de27a1	feat(coderd): add chat debug service and summary aggregation (#23916 )	2026-04-17 16:27:53 +02:00
Hugo Dutka	db8191277b	fix: associate computer use recordings with chats (#24471 ) Fixes [CODAGT-195](https://linear.app/codercom/issue/CODAGT-195/agent-uploaded-recordings-are-missing-chat-file-links-entries).	2026-04-17 13:47:59 +02:00
Michael Suchacz	73b5058923	feat: add Explore mode as subagent-only modality (#24448 ) > This PR was authored by Mux on behalf of Mike. Introduce Explore mode, a read-only subagent modality for delegated discovery and code investigation. ## What Adds a `spawn_explore_agent` tool that creates child chats restricted to read-only operations. An admin can optionally configure a deployment-wide model override so Explore subagents use a model optimized for large context or reasoning without changing the root chat's model. ### Backend - New `ChatModeExplore` enum value (migration 000471). - `spawn_explore_agent` tool definition with read-only allowlist: `read_file`, `execute`, `process_output`, `read_skill`, `read_skill_file`. Write tools, file editors, and nested subagent spawning are blocked. - Deployment config storage for the Explore model override (`agents_chat_explore_model_override` in `site_configs`). - Model resolution hierarchy: configured override, then current turn model, then global default. Silent fallback with warning log when the override becomes unavailable. - RBAC: `AsChatd` for daemon reads, `ActionRead` and `ActionUpdate` on `ResourceDeploymentConfig` for admin API calls. - Plan mode root chats can use `spawn_explore_agent` for read-only research, matching the planning prompt guidance. - The Explore override config API now reports malformed saved overrides as "treated as unset" so admins can clear them explicitly. ### Frontend - `ExploreModelOverrideSettings` component in admin agent behavior settings. Uses `ModelSelector`, handles unavailable model warnings, and supports explicit Save and Clear actions. - Malformed saved overrides show a warning and require an explicit Save to clear, instead of Clear auto-submitting behind the scenes. ### Tests - Integration: `TestExploreSubagentIsReadOnly` (full spawn flow, tool verification, prompt overlay, DB state). - Unit: tool allowlist tests for explore, plan, and default modes. - Internal: model override resolution with valid, invalid UUID, disabled, and unconfigured override scenarios. - RBAC: `dbauthz_test.go` for `GetChatExploreModelOverride` and `UpsertChatExploreModelOverride`. - API: admin set and clear, malformed stored override reporting, disabled model rejection, non-admin denial.	2026-04-17 13:40:17 +02:00
Danielle Maywood	15d8e4ff9f	feat: accept xhigh effort for Anthropic (#24439 )	2026-04-16 17:25:34 +01:00
Michael Suchacz	1092093e98	feat: add internal subagent model override wiring (#24399 ) > Mux working on behalf of Mike. ## Summary - add an enabled chat model config lookup by ID for internal callers - keep `spawn_agent` unchanged while threading an internal model override through child subagent chat creation - extend chatd coverage for inherited bindings, plan mode, and internal override behavior ## Validation - `go test ./coderd/x/chatd ./coderd/database/dbauthz` - `make lint`	2026-04-16 17:08:02 +02:00
Ethan	eae9444dbe	fix: add missing ClientType to InsertChat test params (#24436 ) Two `InsertChatParams` blocks in `startworkspace_test.go` were missing the `ClientType` field. Since the `chat_client_type` enum column is `NOT NULL`, Postgres rejects the Go zero value (`""`), causing `TestStartWorkspace` subtests `StoppedWorkspaceReportsAutoUpdate` and `ManualUpdateRequired` to fail with: ``` pq: invalid input value for enum chat_client_type: "" ``` Closes https://github.com/coder/internal/issues/1471	2026-04-16 15:04:40 +00:00
Ethan	91b35a25ee	fix(coderd): auto-update workspace to active template version on chat start (#24424 ) ## Problem When a template has `require_active_version` enabled and the chat agent tries to start a workspace that is stopped on an older template version, the agent gets stuck in an infinite loop: `start_workspace` fails with a 403 (the old version is not the active version and the user lacks `ActionUpdate` on the template), then `create_workspace` sees the existing stopped workspace and tells the agent to use `start_workspace`, repeat forever. The root cause is that `chatStartWorkspace()` passes the start build request through without setting `TemplateVersionID`, so `wsbuilder` defaults to the previous build's template version — which RBAC rejects when `RequireActiveVersion` is true. ## Fix In `chatStartWorkspace()` (`coderd/exp_chats.go`), when the template's access control has `RequireActiveVersion` enabled, explicitly set `req.TemplateVersionID` to `template.ActiveVersionID` before calling `postWorkspaceBuildsInternal()`. This mirrors how the autobuild executor handles the same scenario (`coderd/autobuild/lifecycle_executor.go`). If the new active version introduces required parameters that cannot be resolved automatically (no defaults, no previous values), the build fails at parameter validation before a provisioner job is created. In that case, a clear error message tells the user to update and start the workspace from the UI instead of surfacing a raw internal error. On successful auto-update, the tool response includes `updated_to_active_version`, `update_reason`, and a human-readable `message` so the model can explain to the user what happened. <img width="782" height="122" alt="image" src="https://github.com/user-attachments/assets/289430d6-066e-41cf-bc97-cd013dcf717d" /> ### Changes - `coderd/exp_chats.go`: `chatStartWorkspace()` loads the template, checks `RequireActiveVersion` via `AccessControlStore`, and pins the build to the active version when required. New `isChatStartWorkspaceManualUpdateRequiredError()` classifies parameter validation failures from both the dynamic parameters path (`DiagnosticError`) and the classic path (`ErrParameterValidation` sentinel). - `coderd/wsbuilder/wsbuilder.go`: New `ErrParameterValidation` sentinel error, wrapped into the classic parameter validation `BuildError` so callers can use `errors.Is` instead of string matching. - `coderd/x/chatd/chattool/startworkspace.go`: `waitForAgentAndRespond` now returns `map[string]any` instead of `fantasy.ToolResponse`, letting the caller annotate the result (e.g. auto-update metadata) before converting. Error handling for `StartFn` checks for `httperror.Responder` errors to surface clean messages for the manual-update case. - `coderd/x/chatd/chattool/startworkspace_test.go`: Two new tests — `StoppedWorkspaceReportsAutoUpdate` (verifies auto-update fields in response) and `ManualUpdateRequired` (verifies clean error message without internal wrapping). ### Follow-up The manual-update error message could include a direct link to the workspace settings page, but the chattool layer does not currently have access to the deployment's access URL. Plumbing it through is straightforward but out of scope for this fix. Closes CODAGT-192	2026-04-17 00:16:37 +10:00
Dean Sheather	3452ab3166	chore: add client_type field to chats and telemetry (#24342 ) Add a `chat_client_type` enum (`ui` \| `api`) and `client_type` column to the `chats` table. The column defaults to `api` for new rows so API callers don't need to set it explicitly. Existing rows are backfilled to `ui`. The field flows through `CreateChatRequest`, `chatd.CreateOptions`, `InsertChat`, and is returned in the `Chat` response via `db2sdk`. <details> <summary>Implementation notes (Coder Agents generated)</summary> ### Changes Database migration (000469) - New enum `chat_client_type` with values `ui`, `api`. - New `client_type` column, `NOT NULL DEFAULT 'api'`. - Backfill: `UPDATE chats SET client_type = 'ui'`. SQL query — `InsertChat` now includes `client_type`. SDK — `ChatClientType` type added; `ClientType` field added to both `CreateChatRequest` (optional, defaults server-side to `api`) and `Chat` response. Handler — `postChats` maps the request field (defaulting to `api`) and passes it through `chatd.CreateOptions`. Sub-agent — Child chats inherit their parent's `client_type`. db2sdk — Maps the database value to the SDK type. ### Decision log - Default is `api` (not `ui`) so existing API integrations get the correct value without code changes. - Backfill sets existing rows to `ui` per requirement. - Child chats inherit `client_type` from parent rather than defaulting. </details>	2026-04-16 23:57:05 +10:00
Michael Suchacz	1cf0354f72	feat: add plan mode with restricted tool boundary (#24236 ) > This PR was authored by Mux on behalf of Mike. ## Summary - add persistent plan mode for chats and the chat-specific plan file flow - add structured planning tools such as `ask_user_question` and `propose_plan` - keep `write_file` and `edit_files` constrained to the chat-specific plan file during plan turns - allow shell exploration in plan mode, including subagents, via `execute` and `process_output` - block implementation-oriented, provider-native, MCP, dynamic, and computer-use tools during plan turns - update the chat UI, tests, and docs for the new planning flow	2026-04-16 11:12:01 +02:00
blinkagent[bot]	e996f6d44b	chore: increase coderd_chatd_message_count histogram max bucket to 1024 (#24409 ) The `coderd_chatd_message_count` histogram's current max bucket of 128 is being hit in production. This increases the exponential bucket count from 8 to 11, extending coverage from `1..128` to `1..1024`. Before: `1, 2, 4, 8, 16, 32, 64, 128` After: `1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024` Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>	2026-04-16 09:43:54 +01:00
Kyle Carberry	9c74c8c674	fix: move OnChatUpdated call after agent is ready in create/start workspace (#24410 )	2026-04-15 19:18:54 -04:00
Kyle Carberry	d11849d94a	fix: re-fetch context files and skills from workspace on each turn (#24360 ) Context files (AGENTS.md) and skills were only fetched from the workspace on the first turn or when the agent changed. On subsequent turns, stale content from persisted messages was used. This meant that if AGENTS.md or skills were modified on the workspace between turns, the agent wouldn't see the changes until the user created a new chat. ## Changes - Extract `fetchWorkspaceContext` from `persistInstructionFiles` to allow fetching workspace context without persisting - On subsequent turns, re-fetch fresh context from the workspace instead of reading stale persisted content; falls back to persisted messages if the workspace dial fails - Update `ReloadMessages` callback to re-derive instruction and skills from reloaded database messages after compaction, instead of using captured closure variables - Add `formatSystemInstructionsFromParts` helper to build system instructions directly from agent parts without requiring separate OS/directory params - Add tests for the new helper <details><summary>Implementation Notes</summary> ### Root cause In `runChat`, the `else if hasContextFiles` branch (subsequent turns) called `instructionFromContextFiles(messages)` which read stale content from persisted DB messages. The `ReloadMessages` callback (post-compaction) also used captured `instruction`/`skills` closure variables from the start of the turn, never re-deriving them. ### Approach 1. Extract `fetchWorkspaceContext` — Pure refactor of the fetch-only part of `persistInstructionFiles` (agent connection, context config retrieval, content sanitization, metadata stamping). Returns parts + skills without persisting. 2. Subsequent turns: Instead of reading from persisted messages, launch a `g2` goroutine that calls `fetchWorkspaceContext` to get fresh context from the workspace. Falls back gracefully to persisted messages if the workspace is unreachable. 3. ReloadMessages: Re-derive `instruction` from `instructionFromContextFiles(reloadedMsgs)` and `skills` from `skillsFromParts(reloadedMsgs)` using the freshly loaded messages, with fallback to captured values if the reloaded messages don't contain context (e.g. compacted away). </details> > 🤖 Generated by Coder Agents	2026-04-15 16:41:15 -04:00
Cian Johnston	d7439a9de0	feat: add Prometheus metrics for chatd subsystem (#24371 ) Adds 7 Prometheus metrics to the chatd subsystem and introduces typed `ActivityBumpReason` for deadline bump attribution. \| Metric \| Type \| Labels \| \|--------\|------\|--------\| \| `coderd_chatd_chats` \| Gauge \| `state` (streaming, waiting) \| \| `coderd_chatd_message_count` \| Histogram \| `provider` \| \| `coderd_chatd_prompt_size_bytes` \| Histogram \| `provider` \| \| `coderd_chatd_tool_result_size_bytes` \| Histogram \| `provider`, `tool_name` \| \| `coderd_chatd_ttft_seconds` \| Histogram \| `provider` \| \| `coderd_chatd_compaction_total` \| Counter \| `provider`, `result` \| \| `coderd_chatd_steps_total` \| Counter \| `provider` \| > 🤖	2026-04-15 19:53:10 +01:00
Ethan	e7883d4573	fix(coderd/x/chatd): hoist system prompt fetch out of chat creation transactions (#24369 ) ## Problem `resolveDeploymentSystemPrompt` was called inside `InTx` closures in both `CreateChat` (`coderd/x/chatd/chatd.go`) and `createChildSubagentChatWithOptions` (`coderd/x/chatd/subagent.go`). That method uses `p.db` (the root store) internally to call `GetChatSystemPromptConfig`, which requires a second DB pool checkout while the transaction already holds one connection. Under concurrent chat creation load (e.g., the chat scaletest at 4800 chats), this causes pool starvation: every in-flight create holds one connection and blocks waiting for another, leading to `idle in transaction` pileups and cascading timeouts across the entire coderd DB pool — including unrelated background work like prebuild metrics and the chat acquire loop. ## Fix Move the `resolveDeploymentSystemPrompt` call before `p.db.InTx(...)` in both call sites. The system prompt config is a read-only deployment-level setting that does not need transactional consistency with the chat insert, so fetching it before the transaction is both safe and preferable (it also shortens transaction lifetime). ## Backporting The `CreateChat` instance of this bug is also present on `release/2.32` (`coderd/x/chatd/chatd.go` line 907). The `subagent.go` instance is not — the child-subagent-chat creation path with its own `InTx` was added after the branch cut. This should be backported, but because this is only in the chat creation path, and that's not typically hit with a great deal of concurrency in the real world, I don't think an urgent patch for 2.32 is necessary. ## Lint gap The existing `InTx` ruleguard rule in `scripts/rules.go` catches direct outer-store usage (`p.db.GetFoo()`) and passing the outer store as a function argument inside `InTx` closures, but it explicitly cannot catch indirect access through receiver methods like `p.resolveDeploymentSystemPrompt()` — the rule documents this blind spot at line 273. Catching this class of bug would require interprocedural analysis (following the callee's body to see if it touches `p.db`), which is beyond what ruleguard's AST pattern matching can express. We're considering a lightweight custom `go/analysis` analyzer (similar to `paralleltestctx`) that does 1-level same-package callee inspection to detect this pattern. In the meantime, this PR adds guidance to `AGENTS.md` so AI reviewers can flag the pattern during code review.	2026-04-16 00:13:15 +10:00
Thomas Kosiewski	4651ca5a9a	feat(coderd/x/chatd/chatdebug): add recorder, transport, and redaction (#23915 )	2026-04-15 15:14:51 +02:00
Cian Johnston	6194bd6f57	fix: address post-merge review findings for chat org scoping (#24297 ) Addresses review findings from #23827 that were added post-merge: - Persisted attachments now store `organizationId`; mismatched orgs pruned on restore - Workspace selection reconciliation: stale IDs from previous orgs dropped via derived `effectiveWorkspaceId` - Org picker uses `permittedOrganizations()` for RBAC-aware filtering - Org picker hidden when user belongs to only one org - Ref-sync `useEffect` replaced with `useEffectEvent` - `CreateWorkspace()` and `ListTemplates()` take `organizationID` and `db` as required function parameters instead of optional struct fields — compiler enforces them, removes scattered nil guards - Cross-org template check in `CreateWorkspace` is now unconditional - `ListTemplates` org-scoping filter now has test coverage - `setupChatInfra` comment fixed; test helpers use params structs instead of positional UUIDs - Enterprise test documents that org admin only sees own chats (handler hardcodes `OwnerID` — future work needs sidebar UI before lifting that restriction) > 🤖	2026-04-15 11:39:05 +01:00

1 2 3

139 Commits