Commit Graph

139 Commits

Author SHA1 Message Date
Michael Suchacz 99a83a2702 fix: clean Bedrock headers (#24718)
Bedrock chat provider requests can inherit Anthropic public API headers
from the process environment, which causes mixed Anthropic and Bedrock
auth headers on signed requests.

Update the Anthropic SDK fork so its Bedrock middleware strips
Anthropic-only headers before signing requests, and keep a chatprovider
regression test for the production request shape.

> Mux is acting on Mike's behalf.
2026-04-26 21:50:29 +02:00
Michael Suchacz 62e9752acd fix: prevent malformed OpenAI Responses continuations (#24725)
> Worked on by Mux on Mike's behalf.

## Summary

- Disable OpenAI Responses `previous_response_id` chain mode when the
prior assistant response has unresolved local tool calls, so the next
request can include paired tool outputs instead of sending an incomplete
continuation.
- Update the fantasy pin to a Responses replay fix that preserves stored
reasoning references, only replays web search references when paired
with reasoning, and validates local function-call output pairing before
send.
- Add fake OpenAI Responses input validation for the two production 400
shapes and integration coverage for full-history reasoning plus web
search replay.
- Add sanitized diagnostics for the OpenAI Responses continuity errors.

## Tests

- `go test ./providers/openai -run
'TestResponsesToPrompt_(ReasoningWithStore|ReasoningWithWebSearchCombined|WebSearchRequiresReasoningReference|ReasoningWithFunctionCallCombined|WebSearchProviderExecutedToolResults)|TestPrepareParams_(SkipsProviderExecutedToolReferences|ValidatesFunctionCallOutputPairing)|TestValidateResponsesInput_WebSearchReferenceRequiresReasoning'
-count=1`
- `go test ./providers/openai -count=1`
- `GOWORK=off go test ./coderd/x/chatd/chattest -run
TestValidateResponsesAPIInput -count=1`
- `GOWORK=off go test ./coderd/x/chatd -run
'TestOpenAIResponses(NoStaleWebSearchReplay|FullReplayPairsReasoningAndWebSearch|ChainModeSkipsWhenLocalCallPending|ChainModeStillFiresForProviderExecutedOnly)$|TestResolveChainMode_'
-count=1`
- `GOWORK=off go test ./coderd/x/chatd/chatprompt -run
'TestInjectMissingToolResults_' -count=1`
- `GOWORK=off go test ./coderd/x/chatd/chaterror -run
TestClassify_OpenAIResponsesAPIDiagnostics -count=1`
- `GOWORK=off go test ./coderd/x/chatd/... -count=1`
- `git diff --check`
- `git commit` pre-commit hook
2026-04-26 21:23:06 +02:00
Michael Suchacz ed33e28b13 fix(coderd/x/chatd): wake after auto-promoting queued message (#24714)
`tryAutoPromoteQueuedMessage` in `processChat`'s deferred cleanup could
set a chat back to `pending` without waking the processor. The processor
only noticed on the next 10ms poll, so under load tests like
`TestAutoPromoteQueuedMessageFallsBackForInvalidQueuedModelConfigID`
could time out waiting for the second streaming request (#1500).

Call `p.signalWake()` after the promoted-message publishes when
`promotedMessage != nil`, matching the pattern used by `CreateChat`,
`SendMessage`, `EditMessage`, `PromoteQueued`, and `InterruptChat`. Make
the regression helper `testAutoPromoteQueuedMessageFallback`
deterministic by setting `PendingChatAcquireInterval = time.Hour` and
synchronizing on a `secondRunStarted` channel instead of polling
`requestCount`, so the test fails without the wake instead of relying on
the 10ms ticker.

Closes https://github.com/coder/internal/issues/1500

> Mux is acting on Mike's behalf.
2026-04-26 11:08:32 +02:00
Michael Suchacz 0211448d09 fix(coderd): sanitize Anthropic provider tool history (#24706)
Anthropic can reject replayed chat histories when a provider-executed
tool call, such as `web_search`, is present without its matching
provider result block.

This sanitizes unpaired Anthropic provider-executed tool calls during
prompt reconstruction, before Anthropic requests, and before persistence
so existing poisoned histories can continue and new malformed turns are
not stored.

Resolves: CODAGT-259

> Mux is acting on Mike's behalf.
2026-04-24 23:57:30 +02:00
Michael Suchacz c7cac9debe fix: persist per-turn model on chats and queued messages (#24688)
Previously, `chats.last_model_config_id` was not updated when a user
sent a mid-chat message with a different model, and queued messages did
not store their own per-turn model, so promotion ran against whatever
the chat row said at promote time. Chat watch events also did not merge
`last_model_config_id` into the site's root, child, and per-chat
caches, so sidebar labels stayed stale after direct sends and queued
promotions.

- Add nullable `chat_queued_messages.model_config_id`, backfilled from
  `chats.last_model_config_id`. Queued inserts round-trip the effective
  model id at enqueue time.
- In `coderd/x/chatd`, direct sends update `chats.last_model_config_id`
  inside the same transaction that inserts the admitted user message.
  Manual promotion and auto-promotion use the queued row's stored
  `model_config_id`, with a fallback to `chats.last_model_config_id`
for legacy NULL rows during rollout.
`PromoteQueuedOptions.ModelConfigID`
  is now ignored.
- On the site, extract `mergeWatchedChatSummary` and
  `mergeWatchedChatIntoCaches` in `site/src/api/queries/chats.ts` so
  status-change watch events merge `last_model_config_id` into the
  root infinite chat list, the parent-embedded child entry, and the
  per-chat `chatKey(chatId)` cache. `updated_at` guards against stale
  watch payloads clobbering newer cached state, while diff status
  events still merge their PR metadata because they are timestamped
  outside the chat row. Watch timestamps are compared as instants so
  variable fractional precision does not make fresh events look stale.
- Queued promotion validates stored model config IDs before admission.
  Invalid legacy queued IDs fall back to the chat's current model config
  instead of dropping the queued message during auto-promotion.
- Backend and frontend regression coverage added for admission, queue
  promotion (including FIFO across mixed models, legacy NULL fallback,
  and invalid queued model IDs), and chat watch cache merging.

> Mux is acting on Mike's behalf.
2026-04-24 15:36:08 +02:00
Michael Suchacz 3d90546aae feat: add general subagent model override (#24610)
Adds a deployment-wide admin override for general delegated subagents.

## What changed
- store the general override in `site_configs` and expose it through the
shared `agent-model-override/{context}` API
- apply the general override when spawning delegated general subagents,
while preserving the existing Explore override behavior
- reuse a shared Agents settings form for the general and Explore
override sections

## Validation
- `make gen`
- `go test ./coderd -run 'TestChatModelOverrides'`
- `go test ./coderd/x/chatd -run
'TestSpawnAgent_(GeneralUsesConfiguredModelOverride|GeneralOverrideLogsAndFallsBackWhenCredentialsUnavailable|GeneralOverrideLogsAndFallsBackWhenProviderDisabled)'`
- `pnpm -C site lint:types`
- `pnpm -C site test:storybook --
AgentSettingsAgentsPageView.stories.tsx`
- `make lint`
- `make pre-commit`

> Mux is acting on Mike's behalf.
2026-04-24 12:37:20 +02:00
Cian Johnston a02339c66a fix(coderd/x/chatd): prevent invalid tool results from poisoning chat history (#24663)
- **computeruse.go**: Decode base64 screenshot data before storing in
`ToolResponse.Data` (was casting base64 string to bytes without
decoding)
- **chatloop.go**: Re-encode `ToolResponse.Data` to base64 via
`base64.StdEncoding.EncodeToString` instead of `string()` cast
- **mcpclient.go**: UTF-8 validate all text from MCP responses in
`convertCallResult()` using `strings.ToValidUTF8`
- **chatprompt.go (persist)**: Defense-in-depth UTF-8 sanitization of
text and media Text fields before database storage
- **chatprompt.go (replay)**: Antivenom layer that validates base64 and
UTF-8 at read time, auto-healing already-poisoned chats without
requiring a migration
- `TestToolResultAntivenom`: 4 subtests covering poisoned text, poisoned
media, valid media round-trip, and media with invalid UTF-8 text
-  Adds `TestConvertCallResult_UTF8Sanitization`: 4 subtests covering invalid
UTF-8 in TextContent, EmbeddedResource, valid passthrough, and
multi-part
- Adds `TestComputerUseTool_Run_ScreenshotDataIsDecodedBinary`: Verifies no
double-encode in the computer-use path
- Updated existing computer-use tests for the new decoded-binary
contract

> 🤖
2026-04-23 19:58:38 +01:00
Michael Suchacz dbcc654d28 feat: snapshot explore subagent tool entitlements (#24638)
Explore sub-agents previously could not use `web_search` or external MCP
tools. `runChat` hard-skipped both for Explore. Lifting those guards
naively would over-grant tools, because a child chat could outlive the
spawning turn's plan-mode filter.

This change persists the spawning parent turn's filtered external MCP
server IDs onto the child Explore chat, and simplifies the Explore
provider-tool filter in `runChat`:

- New `resolveExploreToolSnapshot` helper: computes the child's
inherited external MCP subset by running the parent's configs through
`filterExternalMCPConfigsForTurn` (plan-mode policy) and, if the parent
is itself an Explore child, further narrowing to the parent's own
persisted `MCPServerIDs`. The result is written to the child's
`MCPServerIDs` column at spawn time.
- The existing `mcp_server_ids` column is the sole durable snapshot. No
new chat column is added.
- `runChat` for Explore children: loads MCP tools from the persisted
snapshot, and keeps only `web_search` from provider-native tools (to
block computer-use and other write-style tools, since Explore is
read-only). Whether `web_search` is actually available is a per-model
decision, determined by the current model config, just like a main chat.
- Built-in Explore allowlist is unchanged. Workspace-local MCP remains
excluded for Explore.

Verification: `go build ./...`, `go test ./coderd/x/chatd/... -count=1`,
`make gen` (clean tree), `make lint/emdash`, `go vet`. Deep-review ran
12 reviewers on the feature and 5 on the clarity refactor; CAR reviewed
and approved; a subsequent scope reduction dropped a temporary
`allow_web_search` column in favor of per-model handling.

> Mux is acting on Mike's behalf.
2026-04-23 19:07:38 +02:00
Mathias Fredriksson f8fe5d680b fix(coderd): reject API operations on archived chats (#24633)
Archived chats accept mutations (messages, edits, queued-message
promotions, tool-result submissions) via the API, causing them to
re-enter the processing pipeline. This violates the hard-stop
design intent from PR #23758.

Add archived checks at three layers:

- HTTP handlers (postChatMessages, patchChatMessage,
  promoteChatQueuedMessage, postChatToolResults): return 400
  after auth so callers get a clear error.
- Daemon functions (SendMessage, EditMessage, PromoteQueued,
  SubmitToolResults): return ErrChatArchived after row lock,
  guarding against future callers that bypass the handler.
- AcquireChats SQL: filter out archived chats so they are never
  acquired for processing.

Fixes CODAGT-245
2026-04-23 19:03:33 +03:00
Cian Johnston 2e5c7d99c2 fix(coderd/x/chatd): fix flaky TestSpawnComputerUseAgentInheritsContext (#24666)
Fixes flaky `TestSpawnComputerUseAgentInheritsContext`.

- The test inserts an Anthropic provider directly into the DB after
`CreateChat` has already been called
- The server's background goroutine may have already cached the provider
list (OpenAI only) via `configCache.EnabledProviders()` with a 10s TTL
- The direct DB insert bypasses the pubsub event that production uses to
invalidate the cache
- `isAnthropicConfigured()` returns the stale cached result, making
`computer_use` appear unavailable
- Fix: call `server.configCache.InvalidateProviders()` after the insert,
mirroring what production does via pubsub

CI failure:
https://github.com/coder/coder/actions/runs/24829197096/job/72673070101?pr=24648

> 🤖
2026-04-23 13:18:18 +01:00
Mathias Fredriksson 1ace519c6e fix(coderd/x/chatd): remove cache-miss check blocking agent recovery (#24634)
The cache-miss isAgentUnreachable check added in #24336 runs before
dialWithLazyValidation, preventing the existing switch mechanism from
discovering the new agent after a workspace rebuild. The chat's stale
agent binding is never repaired, causing an infinite loop of
'agent is disconnected' errors.

Remove the cache-miss check. The cache-hit check remains (it verifies
the agent behind an established connection). The dial timeout and
dialWithLazyValidation already bound the cache-miss failure path.

Closes CODAGT-248
2026-04-22 21:49:10 +03:00
Cian Johnston 72e3ae9c5f feat: add chatd tool call error metrics and logging (#24559)
- Add `coderd_chatd_tool_errors_total` prometheus counter (labels:
provider, model, tool_name)
- Log tool call errors at warn level with correlation fields: chat_id,
owner_id, organization_id, workspace_id, agent_id, parent_chat_id,
trigger_message_id, tool_name, tool_call_id, provider, model
- Thread enriched logger from chatd.go into chatloop via
`RunOptions.Logger`
- Remove squashing of all MCP tool calls to the `mcp` bucket

> 🤖
2026-04-22 16:19:56 +00:00
Michael Suchacz 9b5d09ebdc test(coderd/x/chatd): seed anthropic provider for computer_use tests (#24611)
`TestSubagentLifecycleToolsIncludePersistedSubagentTypeAcrossVariants/ComputerUse`
and two adjacent positive tests passed a static Anthropic key into
`newInternalTestServer`, but `seedInternalChatDeps` only inserts an
OpenAI
provider. At runtime, `Server.resolveUserProviderAPIKeys` calls
`chatprovider.PruneDisabledProviderKeys`, which clears `keys.Anthropic`
because Anthropic is not in the enabled DB provider set, so the
`computer_use` execution path loses its key.

Add a focused test helper `seedEnabledAnthropicProvider` and use it only
in
the positive tests that actually drive a `computer_use` spawn through
the
runtime key-resolution path (the `computer_use` branch of
`TestSubagentLifecycleToolsIncludePersistedSubagentTypeAcrossVariants`,
`TestSpawnAgent_ComputerUseUsesComputerUseModelNotParent`, and
`TestSpawnAgent_ComputerUseInheritsMCPServerIDs`).
`seedInternalChatDeps`
stays unchanged, so the negative availability tests continue to model
the
"Anthropic unavailable" fixture. No production code is modified.

Closes https://github.com/coder/internal/issues/1486

> This PR was opened by Mux working on Mike's behalf.
2026-04-22 15:54:17 +02:00
Thomas Kosiewski b7c2c59931 fix(coderd/x/chatd/chatdebug): allow Anthropic per-modality ratelimit headers (#24592)
Previously, Anthropic's per-modality, Priority Tier, and fast-mode rate-limit headers (`Anthropic-Ratelimit-Input-Tokens-*`, `Anthropic-Ratelimit-Output-Tokens-*`, `Anthropic-Priority-Input-Tokens-*`, `Anthropic-Priority-Output-Tokens-*`, `Anthropic-Fast-Input-Tokens-*`, and `Anthropic-Fast-Output-Tokens-*`) were shown as `[REDACTED]` in the Debug panel because they contain `"token"` in the name and fell through the generic credential filter.

Add them to the allowlist in `coderd/x/chatd/chatdebug/redaction.go` alongside the existing `Anthropic-Ratelimit-Tokens-*` entries so the limits/remaining/reset values surface in the raw response view.
2026-04-22 15:14:31 +02:00
Thomas Kosiewski 26b64fa523 fix(coderd/x/chatd/chatdebug): record SSE attempts on EOF (#24565)
`chat_turn` debug steps persist with `attempts: []` even when the
streaming call to Anthropic completes successfully. Fantasy's
Anthropic SSE adapter iterates the response to EOF via
`for stream.Next()` and abandons the body without calling `Close()`,
so `RecordingTransport`'s Close-only recording path never fires and
the attempt is lost. Non-streaming runs (`quickgen`,
`title_generation`) go through `model.Generate(...)` and are
unaffected.

Record on `io.EOF` for `text/event-stream` bodies specifically.
Non-SSE responses stay on the Close-only path so JSON integrity,
content-length validation, and inner-`Close()` error semantics are
preserved. `record()` is already `sync.Once`-guarded, so a later
`Close()` is a no-op for recording.
2026-04-22 15:02:02 +02:00
Michael Suchacz 9634739aed fix: support Bedrock ambient AWS credentials for Agents providers (#24397)
> This PR was authored by Mux on behalf of Mike.

Adds AWS Bedrock ambient credential support to the Agents provider path.
Bedrock providers can now be saved without a stored API key and
authenticated via the standard AWS SDK credential chain on the Coder
server (IAM roles, `AWS_ACCESS_KEY_ID`, etc.). Also fixes missing `Base
URL` forwarding for Bedrock.

## Changes

**Backend runtime** (`coderd/x/chatd/chatprovider/chatprovider.go`):
- New `ProviderAllowsAmbientCredentials(provider)` helper. Currently
returns true only for Bedrock.
- `ModelFromConfig` no longer errors on an empty API key when the
provider is in the ambient-allowed set AND was explicitly resolved via
`ByProvider`. This preserves the policy gate: unresolvable providers
(disabled central key, user-key-required without a user key) still
error.
- `setResolvedProviderAPIKey` internalizes the ambient-credentials
contract via `ProviderAllowsAmbientCredentials`, so a
resolved-but-keyless Bedrock provider is represented as an empty
`ByProvider` entry rather than a post-hoc sentinel patch in the caller.
- `WithAPIKey` is only appended when a token is present.
- `WithBaseURL(baseURL)` is now forwarded for Bedrock (was previously
missing).

**Backend admin API** (`coderd/exp_chats.go`):
- `validateChatProviderCentralAPIKey` exempts Bedrock from requiring a
stored API key when central credentials are enabled.
- AI Gateway separation (`ChatProviderAPIKeysFromDeploymentValues`) is
unchanged. No silent reuse of `CODER_AIBRIDGE_BEDROCK_*` flags.

**Frontend**
(`site/src/pages/AgentsPage/components/ChatModelAdminPanel/*`):
- API Key field is optional for Bedrock when central credentials are
enabled.
- Bedrock-specific descriptions on API Key and Base URL fields
(bearer-token vs ambient modes, `AWS_REGION` guidance).
- Right-aligned "Clear stored token" action switches an existing Bedrock
provider back to ambient mode.
- `hasEffectiveAPIKey` treats Bedrock with central credentials enabled
as configured, so the provider list shows the correct status icon.
- Three new stories: `ProviderFormBedrockAmbientCredentials`,
`ProviderFormBedrockBearerToken`, `ProviderFormBedrockClearBearerToken`.

**Docs** (`docs/ai-coder/agents/models.md`,
`docs/ai-coder/ai-gateway/setup.md`):
- New "Configuring AWS Bedrock" section covering both credential modes,
region resolution, and the Base URL override.
- Explicit note that the `us-east-1` region fallback only applies to
bearer-token mode; ambient credentials require a region from the
standard AWS SDK chain.
- Cross-reference in AI Gateway docs clarifying that
`CODER_AIBRIDGE_BEDROCK_*` flags are a separate configuration path from
Agents.

## Not in scope

- Reusing AI Gateway Bedrock flags as an implicit Agents fallback.
- Per-provider AWS access key, secret, or region fields (would need a
migration and audit-table review).
- IMDS or network-backed credential probes in admin/listing request
paths.

## Related

Dogfood deployment integration:
https://github.com/coder/dogfood/pull/324
2026-04-22 14:20:23 +02:00
Mathias Fredriksson 78d9a220cf fix(coderd/x/chatd): detect disconnected agents in getWorkspaceConn (#24336)
Add agent status check and dial timeout to getWorkspaceConn to
prevent tool calls from hanging when a workspace agent disconnects.

Status check: call isAgentUnreachable on every getWorkspaceConn
call. On cache miss, check the freshly fetched agent row. On
cache hit, re-fetch the agent row by PK for a fresh heartbeat
timestamp. Disconnected and timed-out agents return a sentinel
immediately; connecting agents proceed to dial.

Dial timeout: wrap dialWithLazyValidation in a 30s
context.WithTimeoutCause (matching 8 other server-side AgentConn
callers). Parent context cancellation propagates unchanged so
the chatloop can detect ErrInterrupted.

Both sentinels tell the LLM the agent is unreachable and the
workspace may need restarting from the dashboard.

Closes CODAGT-149
2026-04-22 12:10:32 +00:00
Ethan cc4e04afde feat(site): display file attachments in chat UI (#24281)
Renders the durable file attachments introduced in #24280 in the chat
interface. Without this, attachments were stored and served correctly
but the UI showed raw file parts with no previews or download UX.

Every attachment gets a download affordance, split into three rendering
tiers:

- **Images** — thumbnail with a hover/focus overlay containing a
download link. `onFocusCapture`/`onBlurCapture` with
`contains(relatedTarget)` keeps the overlay open while tabbing between
the image and its download link.
- **Text-like files** (`text/*`, `application/json`) — expandable
preview button with loading + error-with-retry states and the same
download overlay. Preview fetches throw a typed
`FetchTextAttachmentError` with a `.status` field instead of a
stringly-typed error.
- **Everything else** — compact `FileCard` with extension badge,
filename, and download link.

User-side and assistant-side rendering now share `AttachmentBlocks.tsx`
(`AttachmentPreviewFrame`, `TextAttachmentButton`,
`ImageAttachmentButton`, `FileCard`, plus
`getAttachmentHref`/`getAttachmentName`) instead of two near-duplicate
implementations. The text-attachment overlay anchors to the preview
surface so the download button stays pinned even when a loading/error
status line widens the row below.

`ComputerRenderer` detects when a screenshot was stored as a durable
attachment (`attachment_file_id`) and suppresses the stale base64
rendering — the screenshot appears as a proper file part instead.
`ToolLabel` shows the attached filename for `attach_file` tool calls.

Storybook coverage in `ConversationTimeline.stories.tsx` was expanded to
cover every tier (single/multiple images, inline + file-id text, JSON,
download-only files, fetch-failure retry, mixed attachments + file
references) with play-function assertions.

<img width="811" height="150" alt="image"
src="https://github.com/user-attachments/assets/27c71081-3502-4e80-92a7-d8adf1ff9323"
/>



## Cleanup

Per Mathias' post-merge suggestion on #24280, this PR also relocates
`coderd/chatfiles` → `coderd/x/chatfiles` so the durable-attachment
helpers live beside the rest of the `chatd` experimental surface.

Closes CODAGT-91
2026-04-22 20:11:53 +10:00
Ethan 353e522614 fix: handle expired chat file attachments in replay and UI (#24518)
Closes CODAGT-216

## Problem

`dbpurge` deletes `chat_files` rows after the deployment's configured
retention window, but `chat_messages.content` can still contain
`file_id` references to those files. On replay, that left the Anthropic
provider with an empty file payload and a `400 image cannot be empty`
error. In the UI, the same missing file showed up as a broken image.

## Fix

- Backend: when replay hits a `file_id` whose bytes are gone, replace it
with a short text placeholder instead of emitting an empty file part. We
could also drop the missing attachment entirely, but that would silently
remove context from the replay and make the conversation harder for the
model to interpret. The placeholder keeps the request valid while still
telling the model that a file used to be there and is no longer
available.
- Frontend: classify chat image failures instead of treating every
broken image the same.
- `404` file fetches render `Image expired`, with a tooltip explaining
that chat attachments are deleted after the retention window set for the
deployment.
- Other remote failures render `Image failed to load`, with a tooltip
that surfaces server/network detail when available.
- Invalid inline image data still renders `Image failed to load` without
a probe.
2026-04-22 14:10:51 +10:00
blinkagent[bot] 79a9f437d7 feat(coderd/x/chatd/chattool): add description tags to tool parameter structs (#24394) 2026-04-21 11:37:29 -07:00
Ethan c1421b4ead test(coderd/x/chatd): deflake stale control notification test (#24545)
Previously, `TestProcessChat_IgnoresStaleControlNotification` could
return as soon as `UpdateChatStatus` ran, even though `processChat`
still re-read chat state and finished deferred cleanup afterward. That
let gomock and quartz teardown race the tail of cleanup and
intermittently fail the test.

Wait for `processChat` itself to return before asserting the final
status, while keeping the existing strict mock expectations intact.

Closes https://github.com/coder/internal/issues/1479
2026-04-22 00:08:34 +10:00
Ethan 2295e9d5be feat: surface upstream provider error details in chat callout (#24546)
Anthropic HTTP 400 responses (e.g. "image exceeds 5 MB maximum") were
collapsed in the chat UI to the generic headline "Anthropic returned an
unexpected error (HTTP 400)." with no actionable detail — the upstream
message survived to the processor log but was dropped before reaching
the client.

Add a new optional `Detail` field on `codersdk.ChatStreamError` that
carries the upstream provider message alongside the existing normalized
headline. The backend extracts `error.message` from
`fantasy.ProviderError.ResponseBody` (the JSON envelope shared by
Anthropic and OpenAI), falls back to the trimmed provider message when
the body is absent or unparseable, and caps the result at 500 runes. The
frontend threads `Detail` through `useChatStore`, `liveStatusModel`, and
`ChatStatusCallout`, rendering it as a muted secondary line inside the
existing `AlertDescription`.

Before:

<img width="1552" height="185" alt="image"
src="https://github.com/user-attachments/assets/524b588e-3cee-4fad-bc15-6bf3aec0899d"
/>

After:

<img width="814" height="173" alt="image"
src="https://github.com/user-attachments/assets/eae82a89-3ac1-4a33-8d18-ef9f77263d89"
/>

## Persistence

`Detail` is **not** persisted — it disappears on refresh. Persisting it
would require a DB change (today `chats.last_error` is a single nullable
`TEXT` column), and the shape of persisted chat errors is worth a more
deliberate rethink — e.g. promoting `last_error` to `JSONB` so we can
also retain structured fields like `kind`, `statusCode`, `provider`, and
`retryable` instead of only the normalized headline string. That's a
bigger design discussion than this PR should carry.

In the meantime, seeing the upstream error reason *immediately on
failure* is already a large UX improvement over the status quo, and this
PR gets us there without prejudicing the eventual persistence design.
Tracking persistence in CODAGT-239.

Closes CODAGT-235
2026-04-22 00:05:27 +10:00
Michael Suchacz f073323c89 refactor: unify subagent spawn behind spawn_subagent (#24535)
Unify the three subagent spawn tools (`spawn_agent`,
`spawn_explore_agent`, `spawn_computer_use_agent`) behind a single
`spawn_subagent` tool keyed by a `subagent_type` discriminant
(`general`, `explore`, `computer_use`). Mirrors the single-entry-point
pattern already used by `task` in mux while keeping `wait_agent`,
`message_agent`, and `close_agent` as separate lifecycle tools.

A new backend subagent definition catalog
(`coderd/x/chatd/subagent_catalog.go`) is the source of truth for tool
description, prompt guidance, availability rules (plan mode,
desktop/Anthropic gating), and child-chat option building.
`spawn_subagent` advertises only the types available in the current
context and validates `subagent_type` server-side; context inheritance
still flows through the existing `createChildSubagentChatWithOptions`
path. `wait_agent`, `message_agent`, and `close_agent` responses now
include a server-derived `subagent_type` so the UI stops inferring
lifecycle state from tool names.

The frontend gets a shared normalization helper
(`site/src/pages/AgentsPage/components/ChatElements/tools/subagentDescriptor.ts`)
that maps either legacy tool names or new `spawn_subagent` args into a
common descriptor (action, variant, icon, fallback copy). Legacy
transcripts still render identically; `Tool.tsx`, `SubagentTool.tsx`,
`ToolLabel.tsx`, `ToolIcon.tsx`, and `messageParsing.ts` now key off the
descriptor instead of hard-coded names. Existing UI copy is preserved
(`Spawning Explore agent...`, `Using the computer...`, computer-use
monitor icon and Open Desktop affordance).

> This PR was opened by Mux working on Mike's behalf.
2026-04-21 14:01:32 +02:00
Michael Suchacz 9d0469fc4c feat: allow approved external MCP tools in root plan mode (#24509)
## Summary

Allow root plan-mode chats to use MCP tools from external servers that
an admin has explicitly approved for plan mode. Workspace MCP and
plan-mode subagents remain blocked.

## Problem

`chatd.go` excluded every MCP tool when `isPlanModeTurn` was true, so
planning had no access to tools like docs search, ticketing, etc.
Lifting that guard wholesale was unsafe: `mcp_server_configs` already
has centralized admin governance, but workspace-local MCP (discovered
from agent `.mcp.json`) does not, and subagents use a narrower trust
boundary.

## Fix

Add an admin-controlled per-server `allow_in_plan_mode` flag (default
`false`) and gate plan-mode MCP access on it.

### Backend / schema
- New migration `000472_mcp_server_allow_in_plan_mode.{up,down}.sql` and
matching fixture update.
- `mcpserverconfigs.sql` + generated code: persist and read the new
column.
- `codersdk/mcp.go`: thread the field through `MCPServerConfig`,
`Create*`, and `Update*` request types.
- `coderd/mcp.go`: validate, persist, and return the flag in
get/list/create/update handlers.

### chatd
- `coderd/x/chatd/chatd.go`: pre-filter selected external MCP configs by
`AllowInPlanMode` before calling `mcpclient.ConnectAll` on plan-mode
root turns. Workspace MCP discovery is skipped entirely on plan-mode
turns.
- Single helper decides whether a tool is available in plan mode, used
both at construction and for active-tool filtering (defense in depth).
Plan-mode subagents, dynamic tools, provider-native tools, computer-use,
and workspace MCP stay unchanged.
- `coderd/x/chatd/prompt.go`: update the root plan-mode overlay text to
match the new boundary.

### UI
- `MCPServerAdminPanel.tsx`: add an explicit toggle ("Allow all tools
from this MCP server in root plan mode") next to the existing governance
controls.
- Regenerated `site/src/api/typesGenerated.ts`.

### Docs
- `docs/ai-coder/agents/architecture.md`: replace the blanket "MCP is
unavailable in plan mode" note with the new root-only, external-only,
admin-approved policy. Explicitly call out that workspace MCP and
plan-mode subagents are still excluded.

### Tests
- Plan-mode visibility (approved vs non-approved external server).
- Plan-mode invocation of an approved external MCP tool.
- End-to-end plan-mode workflow that uses an approved MCP tool and then
reaches `propose_plan`.
- Regressions: workspace MCP still excluded in plan mode; plan-mode
subagents still on the restricted tool boundary; existing tool
allow/deny list filtering still applies.

## Policy precedence

`allow_in_plan_mode` is an **additional** requirement on top of existing
`enabled`, availability, chat-selected / forced server IDs, and tool
allow/deny lists. It approves **all tools on that server** for root plan
mode; a per-tool plan allowlist is deliberately deferred.

## Follow-ups (explicitly out of scope)

- Whether plan-mode subagents should inherit approved external MCP
tools.
- Workspace-local MCP safety model (agent-side `.mcp.json` schema vs. a
coderd-managed workspace MCP config).

## Validation

- `go vet ./coderd/x/chatd/...`
- `go test ./coderd/x/chatd -run 'TestPlan.*|TestMCP.*' -count=1`
- `go test ./coderd/x/chatd -count=1 -timeout 5m` (full chatd suite)
- `make fmt` (no diff)

> Mux opened this PR on Mike's behalf.
2026-04-21 12:26:12 +02:00
Cian Johnston 5f3effd839 fix(coderd/x/chatd): add chattest.OpenAI() default fake server (#24540)
- Add `chattest.OpenAI(t)` convenience wrapper around `NewOpenAI` with
sensible defaults (JSON title response for non-streaming, text chunk for
streaming)
- Update `seedChatDependencies` to use it instead of an empty base URL,
preventing title generation from hitting real `api.openai.com` with a
fake key:

```
    t.go:111: 2026-04-20 19:23:31.885 [debu]  coderd.chatd.processor: title model candidate failed  chat_id=edb43454-f23d-4163-9974-d101b8091de6  chat_id=edb43454-f23d-4163-9974-d101b8091de6 ...
        error= generate structured title:
                   github.com/coder/coder/v2/coderd/x/chatd.generateStructuredTitleWithUsage
                       /home/coder/src/coder/coder/coderd/x/chatd/quickgen.go:443
                 - unauthorized: Incorrect API key provided: test-api-key. You can find your API key at https://platform.openai.com/account/api-keys.
```

> 🤖
2026-04-21 10:26:20 +01:00
Ethan 1203f625b7 feat(coderd): accept parameters in start_workspace tool (#24434)
When the chat `start_workspace` tool triggers an active-version upgrade
that introduces new required parameters, the build fails with a
parameter validation error. Previously this returned a message telling
the user to update from the UI — a dead end for the model.

This PR lets the model recover inside the chat by:

1. Accepting an optional `parameters` map on `start_workspace` (same
schema as `create_workspace`), forwarded as `RichParameterValues`.
2. Returning structured JSON error responses that preserve validation
details and the workspace's `template_id`, so the model can call
`read_template` to discover what changed.
3. Replacing the UI-only guidance in `exp_chats.go` with
model-actionable retry instructions.

The expected model flow on an active-version parameter failure is now:

```
start_workspace → fails (structured error with template_id + validations)
read_template   → discovers new required parameters
start_workspace → retries with parameters map → workspace starts
```
<img width="846" height="511" alt="image"
src="https://github.com/user-attachments/assets/d18b6864-5970-4225-8da0-0f2ab134ccb4"
/>
2026-04-21 11:36:20 +10:00
Jaayden Halko 410f9a5e19 feat: allow renaming of agent chat title (#24489)
Co-authored-by: Coder Agents <noreply@coder.com>
2026-04-20 14:00:46 +01:00
Thomas Kosiewski df7e838c21 feat(coderd): wire debug logging into chat lifecycle (#23917) 2026-04-20 12:27:16 +02:00
Mathias Fredriksson fc2493780f fix: exclude subagent chats from sidebar pagination (#24404)
GetChats now returns only root chats (parent_chat_id IS NULL).
A new GetChildChatsByParentIDs query fetches children for visible
roots and embeds them in each parent's Children field. The
singular getChat endpoint does the same.

Archive invariant is one-way: parent archived implies child
archived. Parent archive/unarchive cascades via root_chat_id.
Individual child archive is permitted; child unarchive while the
parent is archived is rejected atomically (row lock on child,
re-read parent inside the transaction). Embedded children are
filtered by the caller's archive state so individually-archived
children stay hidden from active-parent views.

Gitsync MarkStale uses GetChatsByWorkspaceIDs directly;
MarkStaleParams.OwnerID removed (dead after the switch).

Frontend: buildChatTree reads from the embedded children field,
WebSocket handlers route child events into the parent's children
array, and archiving a child strips it from the parent cache.
2026-04-20 13:19:59 +03:00
Cian Johnston df429b7f60 fix: classify HTTP/2 transport failures as retryable timeouts (#24502)
Modifies chatloop error classification behaviour to treat the following as retryable:
* HTTP/2 `force closed`
* GOAWAY 
* use of closed network connection

* Modfies user-facing retry banner to show "<provider> is temporarily
unavailable."

Relates to CODAGT-212.

> 🤖
2026-04-20 11:09:47 +01:00
Ethan ef6969dd70 feat(coderd/x/chatd): agent-created file attachments in chat (#24280)
Agents can already see workspace files and take screenshots, but users could not download those artifacts from chat. This PR adds durable chat attachments to chatd. `attach_file`, explicit `computer` screenshot actions (not the automatic post-action screenshots), and `propose_plan` now fetch bytes over the agent connection, store them in `chat_files`, link them to the chat, and carry attachment metadata in tool responses so `buildAssistantPartsForPersist` can materialize ordinary `type:"file"` assistant parts that the chat file APIs serve.

The same storage helpers are reused for other artifact-producing paths. `wait_agent` recordings and thumbnails are stored as chat files and linked back to the parent chat, with best-effort relinking so parent chats retain those artifacts without leaving orphaned rows when chat-file caps reject links. `storeChatAttachment` wraps insert + link in one transaction, files are capped at 10 MB each and 20 per chat, and serving defaults to `Content-Disposition: attachment` with an explicit inline-safe allowlist.

This PR also consolidates chat-file media policy in `coderd/chatfiles`. Uploads and tool-generated attachments share byte-based MIME detection, SVG blocking, inline-safety rules, and compatible `text/plain` refinement for JSON, CSV, and Markdown. Prompt construction still only inlines synthetic pasted text for model consumption; assistant-created attachments are persisted for the user and intentionally not replayed into later LLM turns.

UI follow-up lives in #24281.

Relates to CODAGT-91
2026-04-20 18:04:35 +10:00
Mathias Fredriksson 6b0bb02e5d fix: server-side diffs and stricter fuzzy splicing for edit_files (#24454)
Fixes three classes of edit_files bugs and adds structured per-file
diff output for tool callers:

- New IncludeDiff flag on FileEditRequest; when set, the agent
  returns FileEditResponse.Files[]{Path, Diff} with unified diffs
  computed via go-udiff v0.4.1 Lines + ToUnified (not Unified,
  which calls log.Fatalf on internal error).
- Fuzzy match comparators split each line into leading whitespace,
  body, trailing whitespace, and ending. The splice substitutes at
  each position: on agreement between search and replace the file's
  bytes win; on disagreement the replacement's bytes are spliced
  verbatim. Carve-outs for empty-body lines, multi-line EOF splices,
  and level-aware indent translation for inserted lines.
- Indent-unit detection (GCD for spaces, tab-priority) lets a 4sp
  LLM search insert correctly into tab or 2sp files. Falls back to
  the previous cLead-inheritance path when units can't be detected
  cleanly.
- Empty search is rejected with "search string must not be empty".
- Duplicate file paths in one request are rejected; symlink aliases
  resolved via api.resolvePath before the dedup check.
- Frontend EditFilesRenderer consumes the structured files array by
  explicit path (no label munging) with per-file synthetic fallback
  for older agents or mismatched paths. On error, no diff is
  rendered so the synthetic fallback doesn't misrepresent a
  rejected edit as applied.

Breaking change: AgentConn.EditFiles changes from (ctx, req) error
to (ctx, req) (FileEditResponse, error) in codersdk/workspacesdk.
Source-breaking for external Go consumers; no compat shim per plan
owner.

Out of scope (tracked in CODAGT-214): level-aware indent for
middle-substituted splice lines. Locked in
TestEditFiles_FuzzyIndent_InsertionLevelAware's Lock_* cases plus
TestEditFiles_ReplaceAll_FuzzyIndentGap.
2026-04-18 16:39:34 +03:00
Cian Johnston 3f6b40a833 fix: reap idle chatd stream states on a timer (#24476)
* Adds `streamJanitorLoop` to clean up stale streams every 30s
* zeroes dropped slots to aid in gc-eligibliity
* Adds regression tests in coderd/x/chatd and enterprise/coderd/x/chatd

> 🤖
2026-04-17 19:22:00 +01:00
Cian Johnston 4b585465b8 feat: label chatd metrics by model, add stream-state diagnostics (#24475)
Adds production-observability metrics to coderd/x/chatd/ for
model-level correlation and a chatStreams memory-leak investigation.

- Label per-request chatd metrics (steps_total, message_count,
  prompt_size_bytes, tool_result_size_bytes, ttft_seconds,
  compaction_total) with `model` and enrich the per-turn logger
  with provider/model.
- Add `coderd_chatd_stream_retries_total{provider, model, kind}`
  counter incremented in chatloop before OnRetry.
- Register a prometheus.Collector exposing `streams_active`,
  `stream_buffer_size_max`, `stream_buffer_events`,
  `stream_subscribers` from p.chatStreams.
- Add `coderd_chatd_stream_buffer_dropped_total` counter,
  incremented per publishToStream drop independently of the
  existing log-rate-limited bufferDropCount.
- Snapshot logger/model before the title-generation goroutine to
  avoid a data race with the logger/model rebind below it.

> 🤖
2026-04-17 16:16:30 +01:00
Thomas Kosiewski 91f9de27a1 feat(coderd): add chat debug service and summary aggregation (#23916) 2026-04-17 16:27:53 +02:00
Hugo Dutka db8191277b fix: associate computer use recordings with chats (#24471)
Fixes
[CODAGT-195](https://linear.app/codercom/issue/CODAGT-195/agent-uploaded-recordings-are-missing-chat-file-links-entries).
2026-04-17 13:47:59 +02:00
Michael Suchacz 73b5058923 feat: add Explore mode as subagent-only modality (#24448)
> This PR was authored by Mux on behalf of Mike.

Introduce Explore mode, a read-only subagent modality for delegated
discovery and code investigation.

## What

Adds a `spawn_explore_agent` tool that creates child chats restricted to
read-only operations. An admin can optionally configure a
deployment-wide
model override so Explore subagents use a model optimized for large
context
or reasoning without changing the root chat's model.

### Backend

- New `ChatModeExplore` enum value (migration 000471).
- `spawn_explore_agent` tool definition with read-only allowlist:
`read_file`, `execute`, `process_output`, `read_skill`,
`read_skill_file`.
  Write tools, file editors, and nested subagent spawning are blocked.
- Deployment config storage for the Explore model override
  (`agents_chat_explore_model_override` in `site_configs`).
- Model resolution hierarchy: configured override, then current turn
model,
then global default. Silent fallback with warning log when the override
  becomes unavailable.
- RBAC: `AsChatd` for daemon reads, `ActionRead` and `ActionUpdate` on
  `ResourceDeploymentConfig` for admin API calls.
- Plan mode root chats can use `spawn_explore_agent` for read-only
research,
  matching the planning prompt guidance.
- The Explore override config API now reports malformed saved overrides
as
  "treated as unset" so admins can clear them explicitly.

### Frontend

- `ExploreModelOverrideSettings` component in admin agent behavior
settings.
  Uses `ModelSelector`, handles unavailable model warnings, and supports
  explicit Save and Clear actions.
- Malformed saved overrides show a warning and require an explicit Save
to
  clear, instead of Clear auto-submitting behind the scenes.

### Tests

- Integration: `TestExploreSubagentIsReadOnly` (full spawn flow, tool
  verification, prompt overlay, DB state).
- Unit: tool allowlist tests for explore, plan, and default modes.
- Internal: model override resolution with valid, invalid UUID,
disabled, and
  unconfigured override scenarios.
- RBAC: `dbauthz_test.go` for `GetChatExploreModelOverride` and
  `UpsertChatExploreModelOverride`.
- API: admin set and clear, malformed stored override reporting,
disabled
  model rejection, non-admin denial.
2026-04-17 13:40:17 +02:00
Danielle Maywood 15d8e4ff9f feat: accept xhigh effort for Anthropic (#24439) 2026-04-16 17:25:34 +01:00
Michael Suchacz 1092093e98 feat: add internal subagent model override wiring (#24399)
> Mux working on behalf of Mike.

## Summary
- add an enabled chat model config lookup by ID for internal callers
- keep `spawn_agent` unchanged while threading an internal model
override through child subagent chat creation
- extend chatd coverage for inherited bindings, plan mode, and internal
override behavior

## Validation
- `go test ./coderd/x/chatd ./coderd/database/dbauthz`
- `make lint`
2026-04-16 17:08:02 +02:00
Ethan eae9444dbe fix: add missing ClientType to InsertChat test params (#24436)
Two `InsertChatParams` blocks in `startworkspace_test.go` were missing
the `ClientType` field. Since the `chat_client_type` enum column is `NOT
NULL`, Postgres rejects the Go zero value (`""`), causing
`TestStartWorkspace` subtests `StoppedWorkspaceReportsAutoUpdate` and
`ManualUpdateRequired` to fail with:

```
pq: invalid input value for enum chat_client_type: ""
```

Closes https://github.com/coder/internal/issues/1471
2026-04-16 15:04:40 +00:00
Ethan 91b35a25ee fix(coderd): auto-update workspace to active template version on chat start (#24424)
## Problem

When a template has `require_active_version` enabled and the chat agent
tries to start a workspace that is stopped on an older template version,
the agent gets stuck in an infinite loop: `start_workspace` fails with a
403 (the old version is not the active version and the user lacks
`ActionUpdate` on the template), then `create_workspace` sees the
existing stopped workspace and tells the agent to use `start_workspace`,
repeat forever.

The root cause is that `chatStartWorkspace()` passes the start build
request through without setting `TemplateVersionID`, so `wsbuilder`
defaults to the previous build's template version — which RBAC rejects
when `RequireActiveVersion` is true.

## Fix

In `chatStartWorkspace()` (`coderd/exp_chats.go`), when the template's
access control has `RequireActiveVersion` enabled, explicitly set
`req.TemplateVersionID` to `template.ActiveVersionID` before calling
`postWorkspaceBuildsInternal()`. This mirrors how the autobuild executor
handles the same scenario (`coderd/autobuild/lifecycle_executor.go`).

If the new active version introduces required parameters that cannot be
resolved automatically (no defaults, no previous values), the build
fails at parameter validation before a provisioner job is created. In
that case, a clear error message tells the user to update and start the
workspace from the UI instead of surfacing a raw internal error.

On successful auto-update, the tool response includes
`updated_to_active_version`, `update_reason`, and a human-readable
`message` so the model can explain to the user what happened.

<img width="782" height="122" alt="image"
src="https://github.com/user-attachments/assets/289430d6-066e-41cf-bc97-cd013dcf717d"
/>

### Changes

- **`coderd/exp_chats.go`**: `chatStartWorkspace()` loads the template,
checks `RequireActiveVersion` via `AccessControlStore`, and pins the
build to the active version when required. New
`isChatStartWorkspaceManualUpdateRequiredError()` classifies parameter
validation failures from both the dynamic parameters path
(`DiagnosticError`) and the classic path (`ErrParameterValidation`
sentinel).
- **`coderd/wsbuilder/wsbuilder.go`**: New `ErrParameterValidation`
sentinel error, wrapped into the classic parameter validation
`BuildError` so callers can use `errors.Is` instead of string matching.
- **`coderd/x/chatd/chattool/startworkspace.go`**:
`waitForAgentAndRespond` now returns `map[string]any` instead of
`fantasy.ToolResponse`, letting the caller annotate the result (e.g.
auto-update metadata) before converting. Error handling for `StartFn`
checks for `httperror.Responder` errors to surface clean messages for
the manual-update case.
- **`coderd/x/chatd/chattool/startworkspace_test.go`**: Two new tests —
`StoppedWorkspaceReportsAutoUpdate` (verifies auto-update fields in
response) and `ManualUpdateRequired` (verifies clean error message
without internal wrapping).

### Follow-up

The manual-update error message could include a direct link to the
workspace settings page, but the chattool layer does not currently have
access to the deployment's access URL. Plumbing it through is
straightforward but out of scope for this fix.


Closes CODAGT-192
2026-04-17 00:16:37 +10:00
Dean Sheather 3452ab3166 chore: add client_type field to chats and telemetry (#24342)
Add a `chat_client_type` enum (`ui` | `api`) and `client_type` column to
the `chats` table. The column defaults to `api` for new rows so API
callers don't need to set it explicitly. Existing rows are backfilled to
`ui`.

The field flows through `CreateChatRequest`, `chatd.CreateOptions`,
`InsertChat`, and is returned in the `Chat` response via `db2sdk`.

<details>
<summary>Implementation notes (Coder Agents generated)</summary>

### Changes

**Database migration (000469)**
- New enum `chat_client_type` with values `ui`, `api`.
- New `client_type` column, `NOT NULL DEFAULT 'api'`.
- Backfill: `UPDATE chats SET client_type = 'ui'`.

**SQL query** — `InsertChat` now includes `client_type`.

**SDK** — `ChatClientType` type added; `ClientType` field added to both
`CreateChatRequest` (optional, defaults server-side to `api`) and `Chat`
response.

**Handler** — `postChats` maps the request field (defaulting to `api`)
and passes it through `chatd.CreateOptions`.

**Sub-agent** — Child chats inherit their parent's `client_type`.

**db2sdk** — Maps the database value to the SDK type.

### Decision log
- Default is `api` (not `ui`) so existing API integrations get the
correct value without code changes.
- Backfill sets existing rows to `ui` per requirement.
- Child chats inherit `client_type` from parent rather than defaulting.
</details>
2026-04-16 23:57:05 +10:00
Michael Suchacz 1cf0354f72 feat: add plan mode with restricted tool boundary (#24236)
> This PR was authored by Mux on behalf of Mike.

## Summary
- add persistent plan mode for chats and the chat-specific plan file
flow
- add structured planning tools such as `ask_user_question` and
`propose_plan`
- keep `write_file` and `edit_files` constrained to the chat-specific
plan file during plan turns
- allow shell exploration in plan mode, including subagents, via
`execute` and `process_output`
- block implementation-oriented, provider-native, MCP, dynamic, and
computer-use tools during plan turns
- update the chat UI, tests, and docs for the new planning flow
2026-04-16 11:12:01 +02:00
blinkagent[bot] e996f6d44b chore: increase coderd_chatd_message_count histogram max bucket to 1024 (#24409)
The `coderd_chatd_message_count` histogram's current max bucket of 128
is being hit in production. This increases the exponential bucket count
from 8 to 11, extending coverage from `1..128` to `1..1024`.

Before: `1, 2, 4, 8, 16, 32, 64, 128`
After: `1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024`

Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
2026-04-16 09:43:54 +01:00
Kyle Carberry 9c74c8c674 fix: move OnChatUpdated call after agent is ready in create/start workspace (#24410) 2026-04-15 19:18:54 -04:00
Kyle Carberry d11849d94a fix: re-fetch context files and skills from workspace on each turn (#24360)
Context files (AGENTS.md) and skills were only fetched from the
workspace on the first turn or when the agent changed. On subsequent
turns, stale content from persisted messages was used. This meant that
if AGENTS.md or skills were modified on the workspace between turns, the
agent wouldn't see the changes until the user created a new chat.

## Changes

- Extract `fetchWorkspaceContext` from `persistInstructionFiles` to
allow fetching workspace context without persisting
- On subsequent turns, re-fetch fresh context from the workspace instead
of reading stale persisted content; falls back to persisted messages if
the workspace dial fails
- Update `ReloadMessages` callback to re-derive instruction and skills
from reloaded database messages after compaction, instead of using
captured closure variables
- Add `formatSystemInstructionsFromParts` helper to build system
instructions directly from agent parts without requiring separate
OS/directory params
- Add tests for the new helper

<details><summary>Implementation Notes</summary>

### Root cause

In `runChat`, the `else if hasContextFiles` branch (subsequent turns)
called `instructionFromContextFiles(messages)` which read stale content
from persisted DB messages. The `ReloadMessages` callback
(post-compaction) also used captured `instruction`/`skills` closure
variables from the start of the turn, never re-deriving them.

### Approach

1. **Extract `fetchWorkspaceContext`** — Pure refactor of the fetch-only
part of `persistInstructionFiles` (agent connection, context config
retrieval, content sanitization, metadata stamping). Returns parts +
skills without persisting.

2. **Subsequent turns**: Instead of reading from persisted messages,
launch a `g2` goroutine that calls `fetchWorkspaceContext` to get fresh
context from the workspace. Falls back gracefully to persisted messages
if the workspace is unreachable.

3. **ReloadMessages**: Re-derive `instruction` from
`instructionFromContextFiles(reloadedMsgs)` and `skills` from
`skillsFromParts(reloadedMsgs)` using the freshly loaded messages, with
fallback to captured values if the reloaded messages don't contain
context (e.g. compacted away).

</details>

> 🤖 Generated by Coder Agents
2026-04-15 16:41:15 -04:00
Cian Johnston d7439a9de0 feat: add Prometheus metrics for chatd subsystem (#24371)
Adds 7 Prometheus metrics to the chatd subsystem and introduces typed
`ActivityBumpReason` for deadline bump attribution.

| Metric | Type | Labels |
|--------|------|--------|
| `coderd_chatd_chats` | Gauge | `state` (streaming, waiting) |
| `coderd_chatd_message_count` | Histogram | `provider` |
| `coderd_chatd_prompt_size_bytes` | Histogram | `provider` |
| `coderd_chatd_tool_result_size_bytes` | Histogram | `provider`,
`tool_name` |
| `coderd_chatd_ttft_seconds` | Histogram | `provider` |
| `coderd_chatd_compaction_total` | Counter | `provider`, `result` |
| `coderd_chatd_steps_total` | Counter | `provider` |

> 🤖
2026-04-15 19:53:10 +01:00
Ethan e7883d4573 fix(coderd/x/chatd): hoist system prompt fetch out of chat creation transactions (#24369)
## Problem

`resolveDeploymentSystemPrompt` was called inside `InTx` closures in
both `CreateChat` (`coderd/x/chatd/chatd.go`) and
`createChildSubagentChatWithOptions` (`coderd/x/chatd/subagent.go`).
That method uses `p.db` (the root store) internally to call
`GetChatSystemPromptConfig`, which requires a second DB pool checkout
while the transaction already holds one connection.

Under concurrent chat creation load (e.g., the chat scaletest at 4800
chats), this causes pool starvation: every in-flight create holds one
connection and blocks waiting for another, leading to `idle in
transaction` pileups and cascading timeouts across the entire coderd DB
pool — including unrelated background work like prebuild metrics and the
chat acquire loop.

## Fix

Move the `resolveDeploymentSystemPrompt` call before `p.db.InTx(...)` in
both call sites. The system prompt config is a read-only
deployment-level setting that does not need transactional consistency
with the chat insert, so fetching it before the transaction is both safe
and preferable (it also shortens transaction lifetime).

## Backporting

The `CreateChat` instance of this bug is also present on `release/2.32`
(`coderd/x/chatd/chatd.go` line 907). The `subagent.go` instance is not
— the child-subagent-chat creation path with its own `InTx` was added
after the branch cut.

This should be backported, but because this is only in the chat creation
path, and that's not typically hit with a great deal of concurrency in
the real world, I don't think an urgent patch for 2.32 is necessary.

## Lint gap

The existing `InTx` ruleguard rule in `scripts/rules.go` catches direct
outer-store usage (`p.db.GetFoo()`) and passing the outer store as a
function argument inside `InTx` closures, but it explicitly cannot catch
indirect access through receiver methods like
`p.resolveDeploymentSystemPrompt()` — the rule documents this blind spot
at line 273. Catching this class of bug would require interprocedural
analysis (following the callee's body to see if it touches `p.db`),
which is beyond what ruleguard's AST pattern matching can express. We're
considering a lightweight custom `go/analysis` analyzer (similar to
`paralleltestctx`) that does 1-level same-package callee inspection to
detect this pattern. In the meantime, this PR adds guidance to
`AGENTS.md` so AI reviewers can flag the pattern during code review.
2026-04-16 00:13:15 +10:00
Thomas Kosiewski 4651ca5a9a feat(coderd/x/chatd/chatdebug): add recorder, transport, and redaction (#23915) 2026-04-15 15:14:51 +02:00
Cian Johnston 6194bd6f57 fix: address post-merge review findings for chat org scoping (#24297)
Addresses review findings from #23827 that were added post-merge:

- Persisted attachments now store `organizationId`; mismatched orgs
pruned on restore
- Workspace selection reconciliation: stale IDs from previous orgs
dropped via derived `effectiveWorkspaceId`
- Org picker uses `permittedOrganizations()` for RBAC-aware filtering
- Org picker hidden when user belongs to only one org
- Ref-sync `useEffect` replaced with `useEffectEvent`
- `CreateWorkspace()` and `ListTemplates()` take `organizationID` and
`db` as required function parameters instead of optional struct fields —
compiler enforces them, removes scattered nil guards
- Cross-org template check in `CreateWorkspace` is now unconditional
- `ListTemplates` org-scoping filter now has test coverage
- `setupChatInfra` comment fixed; test helpers use params structs
instead of positional UUIDs
- Enterprise test documents that org admin only sees own chats (handler
hardcodes `OwnerID` — future work needs sidebar UI before lifting that
restriction)

> 🤖
2026-04-15 11:39:05 +01:00