Commit Graph

55 Commits

Author SHA1 Message Date
Michael Suchacz ca1f6b19a2 feat: remove legacy chat provider tables (#25416) 2026-05-22 09:50:01 +02:00
Michael Suchacz 5968c3dac7 feat: use AI provider keys at runtime (#25414) 2026-05-22 02:17:09 +02:00
Mathias Fredriksson ec1e861152 fix(coderd/x/chatd): deliver out-of-order durable messages on subscribe (#25433)
The subscriber advanced a single delivery cursor on each notify and
trusted it for both lookups. Concurrent publishMessage calls and PG
NOTIFY commit ordering let cache appends and notifies arrive out of
ID order, after which a late notify would scan above its own message
and drop it. The DB fallback was also skipped whenever the cache
delivered anything, hiding cross-replica messages that only the DB
held.

The cursor becomes a high-water mark, not the lookup key. Notifies
trigger a rescan over the gap they describe and dedupe per
subscription, and the DB pass runs every time so cross-replica
messages can't get eaten by a local cache hit.

Closes coder/internal#1525
Closes CODAGT-357
2026-05-21 10:35:41 +03:00
Michael Suchacz 63900d212d feat: support personal skills in chats (#25366)
> Mux updated this PR on behalf of Mike.

## Stack Context

This PR builds on #25365 in the experimental personal skills stack and
completes the chat integration.

Stack order:
1. #25362 personal skill resolver
2. #25363 storage, permissions, API, and SDK
3. #25365 API test coverage
4. #25366 chattool and chatd integration
5. #25066 settings UI and docs
6. #25386 personal skills slash menu

## What?

Updates chattool skill formatting and `read_skill` resolution so tools
can read personal skills from the database, then injects personal skill
metadata into chatd prompts and registers the skill-reading tools when
skills are available.

This branch has also been merged with current `origin/main` to resolve
merge conflicts.

## Why?

The chattool and chatd changes need to land together so the intermediate
stack state stays buildable. This completes personal skill availability
in chats without syncing personal skills into workspace filesystems.

## Validation

- `go test -count=1 ./coderd/x/chatd/chattool -run
'TestFormatResolvedSkillIndex|TestReadSkillTool|TestReadSkillFileTool'`
- `go test -count=1 ./coderd/x/chatd -run
'TestPersonalSkillsInSystemPrompt|TestPersonalAndWorkspaceSkillCollisionInSystemPrompt|TestSkillIndexRefreshReplacesStaleAliases|TestFetchPersonalSkillMetadata|TestLoadPersonalSkillBody'`
- `go test -count=1 ./coderd -run 'Test.*UserSkill'`
- `git diff --cached --check`
- `make lint`
- pre-commit hook
2026-05-20 19:50:50 +02:00
Kyle Carberry 159089686a fix(coderd/x/chatd): prime workspace MCP cache after create/start (#25298)
## Problem

Mid-turn workspace MCP discovery was broken when an agent was still
cold-starting. `PrepareTools` in `chatd.go` flipped
`workspaceMCPDiscovered = true` *before* calling
`discoverWorkspaceMCPTools`, so a failed discovery attempt permanently
blocked retries within the turn.

Customer-reported repro:

- New chat with no pre-selected workspace.
- LLM calls `create_workspace` mid-turn at `23:35:05`.
- `PrepareTools` fires, dials the agent with a 30s timeout, dial times
out at `23:38:15`, `discoverWorkspaceMCPTools` returns empty.
- Agent connects at `23:38:29`, 14 seconds later.
- `workspaceMCPDiscovered` was already true, so `PrepareTools` never
retried for the rest of the turn. MCP tools only appeared on the next
user message.

A naive retry loop in `PrepareTools` would also miss the bigger picture:
a workspace boot can take several minutes (EC2 cold start, 10 min
startup scripts), and the chatloop only gets a chance to call
`PrepareTools` between LLM steps.

## Fix

Do the workspace MCP discovery from inside the tool that already waits
for the agent. `chattool.CreateWorkspace` and `chattool.StartWorkspace`
call `waitForAgentReady`, which has a 2 min agent-online budget plus a
10 min startup-script budget. By the time they fire `OnChatUpdated`, the
agent is `Ready`. The chatd `onChatUpdated` callback now launches an
async `primeWorkspaceMCPCache` goroutine on every bind that has a valid
workspace ID:

- The primer calls `discoverWorkspaceMCPTools` until it returns a
non-empty list or `workspaceMCPPrimeMaxWait` (30s) elapses, with a 2s
backoff between attempts. The bounded wait handles the short race
between agent-online and the agent's MCP `Connect` settling.
- The primer runs asynchronously so the tool itself never blocks. Some
templates simply do not advertise MCP tools, in which case the primer
would otherwise spend its full budget for nothing.
- The primer shares the chat `ctx` (not a detached one) so it is
canceled together with the chat. A dangling primer would re-dial the
workspace conn after `runChat`'s deferred `workspaceCtx.close()` and
leak that conn.
- `inflight.Add(1)` ensures server shutdown still waits for any
in-progress primer.
- `PrepareTools` is simplified back to a single discovery call. It now
only sets `workspaceMCPDiscovered = true` on success, so an empty result
no longer permanently blocks discovery within the turn. The cache hit
warmed by the primer makes that call cheap in the common case; the dial
fallback handles the rare cache miss.

## Tests

All in `coderd/x/chatd/chatd_internal_test.go`:

- `TestPrimeWorkspaceMCPCache_SuccessOnFirstAttempt` — single
`ListMCPTools` call returning tools populates the cache.
- `TestPrimeWorkspaceMCPCache_RetriesUntilToolsAppear` — first call
empty, second returns tools; primer retries past the backoff and writes
the cache. Uses `quartz.Mock.Trap` on `NewTimer`.
- `TestPrimeWorkspaceMCPCache_GivesUpAfterDeadline` — `ListMCPTools`
always empty; primer stops at `workspaceMCPPrimeMaxWait` and refuses to
cache the empty result so PrepareTools can retry on the next step.

The existing integration test
`TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspace` continues
to pass and now also exercises the async-primer path end-to-end via the
create_workspace tool.

```
go test ./coderd/x/chatd/... -count=1
go test ./coderd/x/chatd/ -race -count=1
make pre-commit
```

<details>
<summary>Design notes</summary>

- The first iteration of this PR added retry+cooldown+failure-cap logic
inside `PrepareTools`. It worked for the customer's ~30s race window but
did not help workspaces that take several minutes to boot, because
`PrepareTools` only fires between LLM steps. Reviewer pointed out the
right place to handle this is the tool itself; the current
implementation does that.
- Why async: a primer that ran synchronously inside the `OnChatUpdated`
callback blocked the create_workspace tool from returning for up to
`workspaceMCPPrimeMaxWait`, which broke
`TestCreateWorkspaceTool_EndToEnd` and would hurt any template that does
not expose MCP tools. Decoupling lets the tool return immediately and
lets the primer warm the cache concurrently with the next LLM step.
- Why share the chat `ctx` rather than `context.WithoutCancel(ctx)` (the
title-generation pattern): the primer touches
`workspaceCtx.getWorkspaceConn`, which `runChat`'s deferred
`workspaceCtx.close()` invalidates. A detached primer outliving the chat
would dial a fresh conn and leak it.
- The constant naming distinguishes `workspaceMCPDiscoveryTimeout` (35s
per-call dial budget, unchanged from #25169) from
`workspaceMCPPrimeMaxWait` (30s total budget for the post-ready primer
loop) and `workspaceMCPPrimeRetryInterval` (2s between empty-result
retries).

</details>

Follow-up to #25169.

---

_This pull request was generated by Coder Agents._
2026-05-18 07:55:56 -04:00
Ethan a35f71cd8a fix(coderd/x/chatd): retry HTTP/2 stream resets (#25170)
Mid-stream HTTP/2 peer resets from LLM providers can arrive after a 200
streaming response has already emitted provisional parts. Previously
those resets fell through as generic non-retryable errors because
`stream ID` messages did not match retryable transport signals, and
stream IDs could be misread as HTTP statuses.

Classify retryable HTTP/2 RST_STREAM codes as transient timeout
failures, ignore stream IDs during status extraction, and keep the
existing `retry` event as the rollback boundary for provisional message
parts so replacement attempts do not replay failed-attempt output.

Closes CODAGT-382
2026-05-14 11:40:43 +10:00
Kyle Carberry 5a5cd79c4c fix: drop buffered chat parts after their durable message commits (#25164) 2026-05-12 00:30:38 -04:00
Kyle Carberry 0ed57ee343 fix(coderd/x/chatd): checkpoint buffered message_parts to avoid stale replay (#25145) 2026-05-11 17:27:03 -04:00
Michael Suchacz 6bb88775ab test(coderd/x/chatd): pin TestGetWorkspaceConn_StatusCheck to mock clock (#25130)
The `TimedOutAgentCacheHit`, `CacheHitHealthyAgent`, and
`CacheHitDBError` subtests of `TestGetWorkspaceConn_StatusCheck` built
their `WorkspaceAgent` timestamps with `time.Now()` in the parent test's
slice literal and then ran the actual check against the server's real
wall clock (`quartz.NewReal()`). On slow Windows CI runners, more than
`agentInactiveDisconnectTimeout` (30s) of wall time can elapse between
slice construction and the parallel subtest body. In that window, the
cached "healthy" agent gets reclassified as disconnected by
`agentDisconnectedFor`, and `CacheHitHealthyAgent` fails with
`errChatAgentDisconnected` instead of returning the cached connection.

Build each agent inside the subtest with `quartz.NewMock(t)` and feed
the same clock into the `Server` so the agent timestamps and the status
math share a single frozen `now`. This matches the pattern already used
by `TestGetWorkspaceConn_DialTimeoutDisconnectedRecoveryThreshold` in
the same file.

Closes https://github.com/coder/internal/issues/1522

<details>
<summary>Verification</summary>

Inserting `time.Sleep(35 * time.Second)` at the top of each subtest's
body reliably reproduces the original failure
(`errChatAgentDisconnected` on `CacheHitHealthyAgent`) on the parent
commit and passes with this change. After removing the synthetic sleep,
`go test ./coderd/x/chatd -run TestGetWorkspaceConn_StatusCheck
-count=50` passes cleanly.

</details>

> Generated by Coder Agents on behalf of the assignee.

Co-authored-by: Coder Agents <noreply@coder.com>
2026-05-11 19:53:58 +02:00
Ethan bd6cc1aaf2 feat(coderd): add stop_workspace chatd tool and recovery classification (#24997)
## Summary

Adds a `stop_workspace` tool to chatd so the model can recover from the
"workspace running but agent dead" failure mode (e.g. an OOM that leaves
the workspace running but the agent unreachable) by stopping and then
starting the workspace.

<img width="924" height="742" alt="image"
src="https://github.com/user-attachments/assets/279dedb6-6e29-4fe1-8754-3a1f01e538bf"
/>



## What changed

**New `stop_workspace` chatd tool**
(`coderd/x/chatd/chattool/stopworkspace.go`). Mirrors `start_workspace`:
shares `WorkspaceMu` to serialize with create/start, waits for any
in-progress build before issuing a stop, and is idempotent only after a
successful Stop transition. Failed stop builds re-attempt rather than
reporting success.

**New `chatStopWorkspace` coderd hook** (`coderd/exp_chats.go`). Mirrors
`chatStartWorkspace` minus the `RequireActiveVersion` gate. Stop should
not be blocked by template version policy.

**Differentiated recovery sentinels** (`coderd/x/chatd/chatd.go`).
`errChatAgentDisconnected` instructs the model to call `stop_workspace`
then `start_workspace`. `errChatDialTimeout` instructs a single retry,
then user escalation if it repeats. The previous single message
conflated transient and persistent failures.

**Two-signal recovery gate.** Recovery is only surfaced when a tool call
times out *and* a fresh DB read of the latest workspace agent says
`Disconnected`. The previous draft escalated on the DB read alone, which
would fire on a 30-second heartbeat blip (e.g. agent respawn) and prompt
a destructive stop/start unnecessarily.

**Cache-hit disconnected handling** now clears the cache and retries a
fresh dial before escalating, rather than returning the recovery
sentinel immediately. Latest-agent classification uses
`GetWorkspaceAgentsInLatestBuildByWorkspaceID` instead of the chat's
bound `AgentID`, so stale bindings after a rebuild don't misclassify.

**Shared chattool helpers** in `coderd/x/chatd/chattool/chattool.go`:
`latestWorkspaceBuildAndJob`, `publishBuildBinding`,
`provisionerJobTerminal`. Applied to both `start_workspace` and
`stop_workspace`.

## Notes

- Reverts an earlier draft that widened `ask_user_question` to root
standard turns. Plan-mode-only behavior is restored.
- The `stop_workspace` tool currently renders via the generic chat
tool-call UI. A follow-up frontend PR will prettify the `stop_workspace`
tool and style it like the `start_workspace` tool.
- Never-connected (`Timeout` status) agents are intentionally excluded
from recovery. They indicate template or startup failure, not the
running-but-dead case this PR targets.

Closes CODAGT-315
2026-05-11 16:23:07 +10:00
Ethan de9cdca77e fix(coderd): handle external-agent workspaces honestly in chat (#24969)
## Summary

Make Coder's chat agent honest about workspaces that use
`coder_external_agent`. Three behaviors change so the chat stops
pretending it can drive an external workspace through to a usable state
on its own.

<img width="859" height="537" alt="image"
src="https://github.com/user-attachments/assets/0561442b-95f1-4a2d-853c-7e3776114680"
/>


## Problem

External agents are not started by Coder. The user has to run `coder
agent` on their own host with a token Coder generates. Before this
change, the chat agent treated those workspaces like any other:

- `create_workspace` would enqueue a build for an external-agent
template and then wait minutes (~22 worst case) for an agent that was
never going to come up.
- When mid-turn tool calls dialed an external agent that was not
connected, the chat burned the full 30-second dial timeout and returned
generic "the workspace may need to be restarted from the Coder
dashboard" guidance, which is not the action the user can take.
- Nothing told the chat (or the user, through the chat) that the next
action lives outside Coder.

## Fix

Three changes scoped to `coderd/x/chatd/`:

1. **`create_workspace` blocks templates with external agents.** The
tool reads `template_versions.has_external_agent` for the template's
active version and refuses external-agent templates with a message
instructing the chat to pick a different template, or to have the user
create and start the workspace themselves and then attach it.

2. **Attaching an existing external workspace stays open.** No
selection-time gate on attachment; users can still bind a working
external workspace to a chat.

3. **External-agent-aware error handling on connection.** Two
complementary changes both predicated on proven connectivity failures
rather than every dial error:

- **`getWorkspaceConn` preflight and timeout handling.** Before opening
a connection, the cache-miss path reads the agent's status from the
already-loaded row. If the selected agent is external and clearly
offline according to the existing `isAgentUnreachable` helper
(`Disconnected` or `Timeout`, never `Connecting`), it returns an
external-agent-specific error immediately instead of waiting out the
30-second dial timeout. `Connecting` external agents fall through to the
dial so a user who just started the agent on their host can still
succeed in the same turn. The preflight only fires when the agent is
still the latest selected agent for the workspace, so stale-binding
recovery via `dialWithLazyValidation` is unaffected. The post-dial
rewrite is limited to the dial timeout sentinel; stale/no-agent bindings
and non-timeout dial failures preserve their original errors.

- **`waitForAgentReady` timeout-branch rewrite.** The 2-minute retry
loop used by `create_workspace` and `start_workspace` runs unchanged for
all agents. When the loop's outer deadline elapses, the timeout branch
substitutes the external-agent message in place of the raw dial error if
the agent belongs to an external resource.

This applies the same pattern that the cache-hit path of
`getWorkspaceConn` already used (`isAgentUnreachable` returning
`errChatAgentDisconnected`), extended to the cache-miss path and to the
readiness helper, with the external-agent-aware error rewrite layered
only on confirmed offline or timeout paths.

Closes CODAGT-314
2026-05-08 13:51:13 +10:00
Michael Suchacz 0bfb9f6f13 feat: show agent turn summary in agents sidebar (#24942)
Persists the agent-generated turn-end summary on `chats` and shows it as
the Agents sidebar subtitle when present, falling back to the model
name. Errors still take precedence.

> Mux is acting on Mike's behalf.

## What changes

**Storage.** New nullable `last_turn_summary` column on `chats`
(migration `000486`). New `UpdateChatLastTurnSummary` query normalizes
blank/whitespace input to `NULL`, preserves `updated_at` (so the chat
does not jump to the top of the sidebar on summary writes), and uses an
`expected_updated_at` stale-write guard so an older async summary cannot
overwrite a newer turn.

**Backend.** `coderd/x/chatd/chatd.go` decouples summary generation from
webpush. Generated summaries persist for completed parent turns even
when webpush is unconfigured or has no subscriptions. The same generated
text is reused as the webpush body when webpush is configured, so the
summary model is not called twice. Generic fallback push text is no
longer persisted; it clears any stale summary instead.
Error/interrupt/pending-action terminal paths clear `last_turn_summary`
for the latest turn.

**Frontend.** `AgentsSidebar.tsx` subtitle priority is now `errorReason
|| lastTurnSummary || modelName`, normalized via the existing
`asNonEmptyString` helper from `blockUtils.ts`.

## Tests

- `TestUpdateChatLastTurnSummary` (database): success,
whitespace-to-NULL, stale guard rejects, `updated_at` preserved.
- `TestUpdateLastTurnSummaryRejectsStaleWrites` (chatd internal): direct
stale-`expected_updated_at` test.
- `TestSuccessfulChatPersistsTurnSummaryWithoutWebPush`: persistence
works without webpush subscriptions.
- `TestSuccessfulChatSendsWebPushWithSummary`: same generated text
drives both DB and push body.
-
`TestSuccessfulChatSendsWebPushFallbackWithoutSummaryForEmptyAssistantText`:
fallback text is not persisted.
- `TestErroredChatClearsLastTurnSummaryAndSendsWebPush`: error path
clears the field.
- `TestInterruptChatDoesNotSendWebPushNotification`: interrupt path
clears the field, no push fires.
- `AgentsSidebar.test.tsx`: subtitle priority for summary-present,
error-wins, no-summary fallback, whitespace fallback.
- `AgentsSidebar.stories.tsx`: `ChatWithTurnSummary` and
`ChatWithTurnSummaryAndError`.

## Notes

- No backfill. Existing chats keep showing the model name until their
next turn completes.
- Parent chats only in this iteration; the field is rendered on any
`Chat` if a future change extends generation to children.
- Decoupling generation from webpush adds quickgen model calls for
completed parent turns that previously skipped generation when no
subscriptions existed. Existing parent-only, assistant-text-present,
`PushSummaryModel` configured, and bounded-timeout gates keep this
behavior bounded.
2026-05-06 16:43:35 +02:00
Ethan e5c7fdff86 fix(coderd/x/chatd): refresh chat status and bound subscriber reads on Subscribe (#24095)
Tightens the chat stream subscription path on a few related axes. None
of these changes touch the steady-state event flow; they all concern the
subscribe handshake.

## Motivation

`Server.Subscribe` carries three responsibilities that were entangled:

1. Authorize the caller against the chat row.
2. Arm local + pubsub subscriptions before any DB reads
(subscribe-first-then-query).
3. Build the initial snapshot from a fresh chat row, message history,
and queue.

When all three live in one function and share the request context, a few
unfortunate behaviors fall out:

- The HTTP handler's middleware already loaded and authorized the chat
row, but `Subscribe(chatID)` discarded it and re-fetched on every
WebSocket connection.
- The chat row used to populate the initial `status` event was loaded
*before* the pubsub subscription was armed, so a status transition that
happened in that window was silently lost.
- Control-path DB reads inherited whatever context the caller passed in.
A caller without a deadline could wedge a subscriber goroutine
indefinitely on a stalled DB.
- A transient failure of the chat re-read collapsed the entire
subscription instead of degrading gracefully.

## What changes

**Split the auth boundary out into the type signature.** A new
`SubscribeAuthorized(ctx, chat, ...)` takes the already-authorized row
directly. The HTTP handler in `coderd/exp_chats.go` calls it with the
chat row from `httpmw.ChatParam`, eliminating the redundant
`GetChatByID`. `Subscribe(chatID)` is preserved as a thin wrapper for
callers that don't have a chat row in hand (tests, internal callers); it
does the auth lookup and delegates.

**Re-read the chat after arming subscriptions.** Inside
`SubscribeAuthorized`, after the local stream and pubsub subscriptions
are active, we reload the chat row to populate the initial `status`
event and any enterprise relay setup. Combined with the existing
subscribe-first-then-query pattern, this closes the gap where a status
transition between the middleware's load and the subscription arming
would not appear in either the initial snapshot or a live notification.

**Fall back to the middleware row on refresh failure.** If the
post-subscription refresh fails (transient DB blip, brief pool
exhaustion), we log a warning and reuse the row that proved
authorization in the first place. Messages, queue, and pubsub are all
independent of this row, so the stream still works; the initial `status`
is just slightly stale and self-corrects via the next pubsub event.

**Bound subscriber control-path DB reads.** A new
`streamSubscriberControlFetchContext` helper applies a 5-second fallback
timeout only when the caller has no deadline of their own. Used at the
chat refresh, the initial queue load, and the queue-update goroutine
following pubsub notifications. HTTP-driven callers pass through
unchanged; background callers can no longer hang forever on a stalled DB
and leak subscriber goroutines, pubsub subscriptions, and `chatStreams`
entries.
2026-05-06 14:29:53 +10:00
Ethan 46a60e6d5d refactor: move chat error kinds into codersdk (#24955)
Moves the chat error kind taxonomy from `coderd/x/chatd/chaterror` into
`codersdk.ChatErrorKind` and types `ChatError.Kind` /
`ChatStreamRetry.Kind` so generated TypeScript exposes an SDK-owned
union, including `usage_limit`. Backend chat classification now
references the SDK constants directly while preserving the existing JSON
string values.

Keeps chat usage-limit admission failures on their existing 409 response
shape. The frontend maps structured usage-limit responses to the
SDK-owned `usage_limit` kind, uses generated `TypesGen.ChatErrorKind`
directly, and removes the local string union and alias.
2026-05-06 11:57:48 +10:00
Ethan 4751416b29 fix!: persist structured chat errors (#24919)
**Breaking change for changelog:**

> `codersdk.Chat.last_error` now returns a structured `ChatError` object
(`{message, kind, provider, retryable, status_code, detail}`) instead of
a plain string. The chats API is experimental
(`/api/experimental/chats`), so this ships without a deprecation cycle;
consumers reading `chat.last_error` as a string must update to read
`chat.last_error.message`. SDK/generated TypeScript terminal error
payloads now use the single `ChatError` type; the live stream error
payload type is renamed from `ChatStreamError` to `ChatError`.

Persisted chat errors now carry the same provider-specific detail (kind,
provider, retryable, HTTP status, optional detail) as the live stream,
so refreshing a failed chat rehydrates with the full structured error
instead of a one-line headline.

Existing rows are migrated in place: legacy text errors are wrapped into
`{message, kind: "generic"}` so already-errored chats still render, and
rows with `last_error IS NULL` stay NULL. Internally, persisted fallback
decoding now reuses the existing `chaterror.KindGeneric` constant, with
no JSON value change.

Closes CODAGT-239
2026-05-05 12:56:06 +10:00
Michael Suchacz 0bb09935bc feat: add computer-use provider selection for AI agents (#24772)
Adds a deployment-wide setting to select the computer-use provider
(Anthropic or OpenAI) for AI agents, plus the OpenAI computer-use runner
needed to honor that selection.

The setting is stored in `site_configs` under
`agents_computer_use_provider`, defaults to Anthropic when unset, and is
exposed via experimental GET/PUT endpoints under
`/api/experimental/chats/config/computer-use-provider`. The chatd
computer-use tool now dispatches to either `runAnthropicComputerUse` or
`runOpenAIComputerUse` based on the resolved provider, with
provider-specific result metadata for OpenAI screenshots.

Frontend adds a provider dropdown to the Agents Experiments settings
page nested under the virtual desktop toggle, with disabled state
handling while virtual desktop is off and skeleton loaders while config
queries are in flight.

Hugo and Codex review follow-up:
- Uses shared provider validation and clearer computer-use constant
names.
- Removes stale OpenAI pending-safety-checks commentary.
- Documents why provider result metadata is needed for OpenAI
screenshots.
- Keeps the computer-use subagent visible when provider credentials are
missing, then returns a clear spawn-time configuration error.
- Uses OpenAI's recommended 1600x900 screenshot geometry to preserve the
native 16:9 aspect ratio.
- Moves OpenAI-specific computer-use helpers into
`coderd/x/chatd/chatopenai/computeruse` after rebasing onto the provider
package refactor in `main`.
- Converts OpenAI pixel scroll deltas to Coder desktop wheel-click
amounts.
- Preserves OpenAI pointer modifiers with key down/up desktop actions
and rejects unsupported non-left double-click buttons explicitly.
- Maps OpenAI back/forward side-button clicks to browser navigation key
actions.
- Defaults omitted OpenAI click buttons to left-click.
- Retries mouse release cleanup if the final OpenAI drag release fails.
- Keeps computer-use subagent availability messages stable when provider
config cannot be loaded, while logging the backend error.
- Releases remaining OpenAI modifier keys if a synthetic key-up cleanup
action fails.
- Updates Storybook interaction stories so provider snapshots show the
selected final provider.

> Mux updated this PR description on behalf of Mike.
2026-05-04 20:30:50 +02:00
Michael Suchacz 033ed0bb82 feat: add admin-configurable chat title generation model (#24838)
Adds an admin-configurable deployment-wide setting that controls which
model is used for chat title generation. Admins can pick any enabled
chat model config from the Agents settings page, or leave the setting
unset to keep the existing fast-models-then-chat-model fallback
algorithm.

When a model is selected, both automatic and manual title generation use
only that model, with no silent fallback. When the configured model is
disabled, missing credentials, or otherwise unusable, automatic title
generation skips entirely (best-effort) and manual title regeneration
returns a clear error, so admins notice the misconfiguration instead of
silently routing title traffic through another provider.

## Surface

- New deployment-wide setting stored as a `site_configs` row
(`agents_chat_title_generation_model_override`).
- New experimental endpoint `GET/PUT
/api/experimental/chats/config/model-override/{context}`.
- Frontend: title generation now appears as a third dropdown on the
Agents admin settings page alongside the existing general and explore
context overrides.

## DRY refactors folded in

Title generation is integrated as a third value of the existing
`ChatModelOverrideContext` type alongside `general` and `explore`,
sharing the parameterized HTTP route, SDK methods, generated types, and
frontend API plumbing rather than introducing a parallel surface. The
`Agent` prefix was dropped from the type and route since title
generation is not a delegated agent.

The chatd model-override resolver is also shared.
`resolveConfiguredModelOverride` now takes a `failureMode` parameter:

- Subagent overrides use soft failure: misconfigured overrides are
logged and the parent model is used.
- Title generation uses hard failure: misconfigured overrides return an
explicit error so manual title regeneration surfaces the
misconfiguration and automatic title generation skips instead of
silently falling back.

> Mux is acting on Mike's behalf.
2026-05-04 13:13:00 +02:00
Michael Suchacz 203b0a9df8 refactor(coderd/x/chatd): extract OpenAI logic into chatopenai package (#24788)
Extracts OpenAI-specific logic from `coderd/x/chatd` into
`coderd/x/chatd/chatopenai` so the main chat path no longer references
`fantasyopenai` directly for chain mode info, response IDs, web search
tooling, or option mapping.

Structural refactor. The only deliberate behavioral narrowing is
consolidating Responses store checks and related keyed option or
metadata access on `opts[fantasyopenai.Name]`. That is documented by
`TestIsResponsesStoreEnabledIgnoresMalformedNonOpenAIKey` and is
unreachable in production where Responses options always live under
`fantasyopenai.Name`.

Summary:

- Moves OpenAI Responses chain mode info, response ID helpers, web
search tool construction, and provider option conversion into
`chatopenai`.
- Keeps Anthropic, Google, OpenRouter, and Vercel provider branches as
thin, existing code paths.
- `chatopenai` only imports `chatprompt` from chatd subpackages. It does
not import `chatd`, `chatloop`, `chatprovider`, or `chaterror`.
- Follow-up review fixes align helper names, keyed provider option
access, map cloning behavior, and PR documentation with the extracted
package boundary.
- Final sweep trims unused chain-mode state, removes a duplicate
store-check test case, drops an unused provider-tool parameter, and
shares the chat-message test helper through `chattest`.

> Mux is updating this PR on Mike's behalf.
2026-05-04 11:17:19 +02:00
Ethan 3a153ebb15 fix(coderd/x/chatd): replay retry phase on subscribe (#24569)
Retry events were previously fire-and-forget, so subscribers that
connected after a retry started only saw durable history plus
`status=running` and could not tell the stream was backing off.

Keep the current retry phase in `chatStreamState`, capture it atomically
with subscriber registration, replay it in the initial snapshot for
same-chat late joiners, and clear it when streaming resumes or ends so
reconnects get consistent retry state without duplicate delivery at the
subscription boundary.

Relates to CODAGT-139
2026-05-04 11:48:39 +10:00
Cian Johnston 2f855904be refactor: add dbgen chat generators and migrate test boilerplate (#24497)
- Adds chat-related dbgen generators covering defaults, overrides, and message field mapping.
- Replaces raw single-row chat, message, provider, and model-config setup in tests with dbgen helpers.
- Simplifies chat seed helpers after moving fixture setup into dbgen.

> Generated with [Coder Agents](https://coder.com/agents).
2026-05-01 13:29:33 +01:00
Thomas Kosiewski ab75e46f1d fix: record debug runs for proposed chat titles (#24820) 2026-04-29 16:45:48 +02:00
Mathias Fredriksson 782b7166a4 fix: preserve stream state on interrupt, fix auto-promote error handling (#24314)
When tryAutoPromoteQueuedMessage's insert fails, return the error
instead of swallowing it so the transaction rolls back and the
queued message survives. Previously the POP DELETE committed while
the INSERT silently failed, permanently losing the message.

Remove clearStreamState() from the pending/waiting status handler
in the frontend. The durable message event clears stream state via
the existing needsStreamReset path, eliminating the visual gap
where content vanishes before the persisted message arrives.

Fixes CODAGT-61
2026-04-29 14:08:35 +03:00
Cian Johnston 1666bff1f9 fix(coderd/x/chatd): block chain mode when provider missing tool results (#24782)
When `StopAfterTool` fires (e.g., `propose_plan`), the LLM response
containing a `function_call` is stored at OpenAI via `store=true`, but
the tool result is only persisted locally. On the next user message,
`resolveChainMode` sees the tool result in the local DB and concludes
all calls are resolved. Chain mode activates with
`previous_response_id`, but OpenAI rejects because its stored chain has
an unresolved `function_call`.

This adds a `providerMissingToolResults` check to `resolveChainMode`
that detects the `assistant(tool-call) → tool(result) → user` pattern
with no follow-up assistant message. The absence of a follow-up
assistant proves the tool results were never round-tripped to the
provider. When detected, chain mode is blocked and the system falls back
to full history replay, which includes both the tool call and its
result.

Deploying this fix un-bricks existing affected chats with no DB
migration needed.

> Generated by Coder Agents.
2026-04-28 15:30:04 +01:00
Michael Suchacz 62e9752acd fix: prevent malformed OpenAI Responses continuations (#24725)
> Worked on by Mux on Mike's behalf.

## Summary

- Disable OpenAI Responses `previous_response_id` chain mode when the
prior assistant response has unresolved local tool calls, so the next
request can include paired tool outputs instead of sending an incomplete
continuation.
- Update the fantasy pin to a Responses replay fix that preserves stored
reasoning references, only replays web search references when paired
with reasoning, and validates local function-call output pairing before
send.
- Add fake OpenAI Responses input validation for the two production 400
shapes and integration coverage for full-history reasoning plus web
search replay.
- Add sanitized diagnostics for the OpenAI Responses continuity errors.

## Tests

- `go test ./providers/openai -run
'TestResponsesToPrompt_(ReasoningWithStore|ReasoningWithWebSearchCombined|WebSearchRequiresReasoningReference|ReasoningWithFunctionCallCombined|WebSearchProviderExecutedToolResults)|TestPrepareParams_(SkipsProviderExecutedToolReferences|ValidatesFunctionCallOutputPairing)|TestValidateResponsesInput_WebSearchReferenceRequiresReasoning'
-count=1`
- `go test ./providers/openai -count=1`
- `GOWORK=off go test ./coderd/x/chatd/chattest -run
TestValidateResponsesAPIInput -count=1`
- `GOWORK=off go test ./coderd/x/chatd -run
'TestOpenAIResponses(NoStaleWebSearchReplay|FullReplayPairsReasoningAndWebSearch|ChainModeSkipsWhenLocalCallPending|ChainModeStillFiresForProviderExecutedOnly)$|TestResolveChainMode_'
-count=1`
- `GOWORK=off go test ./coderd/x/chatd/chatprompt -run
'TestInjectMissingToolResults_' -count=1`
- `GOWORK=off go test ./coderd/x/chatd/chaterror -run
TestClassify_OpenAIResponsesAPIDiagnostics -count=1`
- `GOWORK=off go test ./coderd/x/chatd/... -count=1`
- `git diff --check`
- `git commit` pre-commit hook
2026-04-26 21:23:06 +02:00
Michael Suchacz dbcc654d28 feat: snapshot explore subagent tool entitlements (#24638)
Explore sub-agents previously could not use `web_search` or external MCP
tools. `runChat` hard-skipped both for Explore. Lifting those guards
naively would over-grant tools, because a child chat could outlive the
spawning turn's plan-mode filter.

This change persists the spawning parent turn's filtered external MCP
server IDs onto the child Explore chat, and simplifies the Explore
provider-tool filter in `runChat`:

- New `resolveExploreToolSnapshot` helper: computes the child's
inherited external MCP subset by running the parent's configs through
`filterExternalMCPConfigsForTurn` (plan-mode policy) and, if the parent
is itself an Explore child, further narrowing to the parent's own
persisted `MCPServerIDs`. The result is written to the child's
`MCPServerIDs` column at spawn time.
- The existing `mcp_server_ids` column is the sole durable snapshot. No
new chat column is added.
- `runChat` for Explore children: loads MCP tools from the persisted
snapshot, and keeps only `web_search` from provider-native tools (to
block computer-use and other write-style tools, since Explore is
read-only). Whether `web_search` is actually available is a per-model
decision, determined by the current model config, just like a main chat.
- Built-in Explore allowlist is unchanged. Workspace-local MCP remains
excluded for Explore.

Verification: `go build ./...`, `go test ./coderd/x/chatd/... -count=1`,
`make gen` (clean tree), `make lint/emdash`, `go vet`. Deep-review ran
12 reviewers on the feature and 5 on the clarity refactor; CAR reviewed
and approved; a subsequent scope reduction dropped a temporary
`allow_web_search` column in favor of per-model handling.

> Mux is acting on Mike's behalf.
2026-04-23 19:07:38 +02:00
Mathias Fredriksson 1ace519c6e fix(coderd/x/chatd): remove cache-miss check blocking agent recovery (#24634)
The cache-miss isAgentUnreachable check added in #24336 runs before
dialWithLazyValidation, preventing the existing switch mechanism from
discovering the new agent after a workspace rebuild. The chat's stale
agent binding is never repaired, causing an infinite loop of
'agent is disconnected' errors.

Remove the cache-miss check. The cache-hit check remains (it verifies
the agent behind an established connection). The dial timeout and
dialWithLazyValidation already bound the cache-miss failure path.

Closes CODAGT-248
2026-04-22 21:49:10 +03:00
Mathias Fredriksson 78d9a220cf fix(coderd/x/chatd): detect disconnected agents in getWorkspaceConn (#24336)
Add agent status check and dial timeout to getWorkspaceConn to
prevent tool calls from hanging when a workspace agent disconnects.

Status check: call isAgentUnreachable on every getWorkspaceConn
call. On cache miss, check the freshly fetched agent row. On
cache hit, re-fetch the agent row by PK for a fresh heartbeat
timestamp. Disconnected and timed-out agents return a sentinel
immediately; connecting agents proceed to dial.

Dial timeout: wrap dialWithLazyValidation in a 30s
context.WithTimeoutCause (matching 8 other server-side AgentConn
callers). Parent context cancellation propagates unchanged so
the chatloop can detect ErrInterrupted.

Both sentinels tell the LLM the agent is unreachable and the
workspace may need restarting from the dashboard.

Closes CODAGT-149
2026-04-22 12:10:32 +00:00
Ethan c1421b4ead test(coderd/x/chatd): deflake stale control notification test (#24545)
Previously, `TestProcessChat_IgnoresStaleControlNotification` could
return as soon as `UpdateChatStatus` ran, even though `processChat`
still re-read chat state and finished deferred cleanup afterward. That
let gomock and quartz teardown race the tail of cleanup and
intermittently fail the test.

Wait for `processChat` itself to return before asserting the final
status, while keeping the existing strict mock expectations intact.

Closes https://github.com/coder/internal/issues/1479
2026-04-22 00:08:34 +10:00
Michael Suchacz f073323c89 refactor: unify subagent spawn behind spawn_subagent (#24535)
Unify the three subagent spawn tools (`spawn_agent`,
`spawn_explore_agent`, `spawn_computer_use_agent`) behind a single
`spawn_subagent` tool keyed by a `subagent_type` discriminant
(`general`, `explore`, `computer_use`). Mirrors the single-entry-point
pattern already used by `task` in mux while keeping `wait_agent`,
`message_agent`, and `close_agent` as separate lifecycle tools.

A new backend subagent definition catalog
(`coderd/x/chatd/subagent_catalog.go`) is the source of truth for tool
description, prompt guidance, availability rules (plan mode,
desktop/Anthropic gating), and child-chat option building.
`spawn_subagent` advertises only the types available in the current
context and validates `subagent_type` server-side; context inheritance
still flows through the existing `createChildSubagentChatWithOptions`
path. `wait_agent`, `message_agent`, and `close_agent` responses now
include a server-derived `subagent_type` so the UI stops inferring
lifecycle state from tool names.

The frontend gets a shared normalization helper
(`site/src/pages/AgentsPage/components/ChatElements/tools/subagentDescriptor.ts`)
that maps either legacy tool names or new `spawn_subagent` args into a
common descriptor (action, variant, icon, fallback copy). Legacy
transcripts still render identically; `Tool.tsx`, `SubagentTool.tsx`,
`ToolLabel.tsx`, `ToolIcon.tsx`, and `messageParsing.ts` now key off the
descriptor instead of hard-coded names. Existing UI copy is preserved
(`Spawning Explore agent...`, `Using the computer...`, computer-use
monitor icon and Open Desktop affordance).

> This PR was opened by Mux working on Mike's behalf.
2026-04-21 14:01:32 +02:00
Michael Suchacz 9d0469fc4c feat: allow approved external MCP tools in root plan mode (#24509)
## Summary

Allow root plan-mode chats to use MCP tools from external servers that
an admin has explicitly approved for plan mode. Workspace MCP and
plan-mode subagents remain blocked.

## Problem

`chatd.go` excluded every MCP tool when `isPlanModeTurn` was true, so
planning had no access to tools like docs search, ticketing, etc.
Lifting that guard wholesale was unsafe: `mcp_server_configs` already
has centralized admin governance, but workspace-local MCP (discovered
from agent `.mcp.json`) does not, and subagents use a narrower trust
boundary.

## Fix

Add an admin-controlled per-server `allow_in_plan_mode` flag (default
`false`) and gate plan-mode MCP access on it.

### Backend / schema
- New migration `000472_mcp_server_allow_in_plan_mode.{up,down}.sql` and
matching fixture update.
- `mcpserverconfigs.sql` + generated code: persist and read the new
column.
- `codersdk/mcp.go`: thread the field through `MCPServerConfig`,
`Create*`, and `Update*` request types.
- `coderd/mcp.go`: validate, persist, and return the flag in
get/list/create/update handlers.

### chatd
- `coderd/x/chatd/chatd.go`: pre-filter selected external MCP configs by
`AllowInPlanMode` before calling `mcpclient.ConnectAll` on plan-mode
root turns. Workspace MCP discovery is skipped entirely on plan-mode
turns.
- Single helper decides whether a tool is available in plan mode, used
both at construction and for active-tool filtering (defense in depth).
Plan-mode subagents, dynamic tools, provider-native tools, computer-use,
and workspace MCP stay unchanged.
- `coderd/x/chatd/prompt.go`: update the root plan-mode overlay text to
match the new boundary.

### UI
- `MCPServerAdminPanel.tsx`: add an explicit toggle ("Allow all tools
from this MCP server in root plan mode") next to the existing governance
controls.
- Regenerated `site/src/api/typesGenerated.ts`.

### Docs
- `docs/ai-coder/agents/architecture.md`: replace the blanket "MCP is
unavailable in plan mode" note with the new root-only, external-only,
admin-approved policy. Explicitly call out that workspace MCP and
plan-mode subagents are still excluded.

### Tests
- Plan-mode visibility (approved vs non-approved external server).
- Plan-mode invocation of an approved external MCP tool.
- End-to-end plan-mode workflow that uses an approved MCP tool and then
reaches `propose_plan`.
- Regressions: workspace MCP still excluded in plan mode; plan-mode
subagents still on the restricted tool boundary; existing tool
allow/deny list filtering still applies.

## Policy precedence

`allow_in_plan_mode` is an **additional** requirement on top of existing
`enabled`, availability, chat-selected / forced server IDs, and tool
allow/deny lists. It approves **all tools on that server** for root plan
mode; a per-tool plan allowlist is deliberately deferred.

## Follow-ups (explicitly out of scope)

- Whether plan-mode subagents should inherit approved external MCP
tools.
- Workspace-local MCP safety model (agent-side `.mcp.json` schema vs. a
coderd-managed workspace MCP config).

## Validation

- `go vet ./coderd/x/chatd/...`
- `go test ./coderd/x/chatd -run 'TestPlan.*|TestMCP.*' -count=1`
- `go test ./coderd/x/chatd -count=1 -timeout 5m` (full chatd suite)
- `make fmt` (no diff)

> Mux opened this PR on Mike's behalf.
2026-04-21 12:26:12 +02:00
Jaayden Halko 410f9a5e19 feat: allow renaming of agent chat title (#24489)
Co-authored-by: Coder Agents <noreply@coder.com>
2026-04-20 14:00:46 +01:00
Thomas Kosiewski df7e838c21 feat(coderd): wire debug logging into chat lifecycle (#23917) 2026-04-20 12:27:16 +02:00
Cian Johnston 3f6b40a833 fix: reap idle chatd stream states on a timer (#24476)
* Adds `streamJanitorLoop` to clean up stale streams every 30s
* zeroes dropped slots to aid in gc-eligibliity
* Adds regression tests in coderd/x/chatd and enterprise/coderd/x/chatd

> 🤖
2026-04-17 19:22:00 +01:00
Michael Suchacz 73b5058923 feat: add Explore mode as subagent-only modality (#24448)
> This PR was authored by Mux on behalf of Mike.

Introduce Explore mode, a read-only subagent modality for delegated
discovery and code investigation.

## What

Adds a `spawn_explore_agent` tool that creates child chats restricted to
read-only operations. An admin can optionally configure a
deployment-wide
model override so Explore subagents use a model optimized for large
context
or reasoning without changing the root chat's model.

### Backend

- New `ChatModeExplore` enum value (migration 000471).
- `spawn_explore_agent` tool definition with read-only allowlist:
`read_file`, `execute`, `process_output`, `read_skill`,
`read_skill_file`.
  Write tools, file editors, and nested subagent spawning are blocked.
- Deployment config storage for the Explore model override
  (`agents_chat_explore_model_override` in `site_configs`).
- Model resolution hierarchy: configured override, then current turn
model,
then global default. Silent fallback with warning log when the override
  becomes unavailable.
- RBAC: `AsChatd` for daemon reads, `ActionRead` and `ActionUpdate` on
  `ResourceDeploymentConfig` for admin API calls.
- Plan mode root chats can use `spawn_explore_agent` for read-only
research,
  matching the planning prompt guidance.
- The Explore override config API now reports malformed saved overrides
as
  "treated as unset" so admins can clear them explicitly.

### Frontend

- `ExploreModelOverrideSettings` component in admin agent behavior
settings.
  Uses `ModelSelector`, handles unavailable model warnings, and supports
  explicit Save and Clear actions.
- Malformed saved overrides show a warning and require an explicit Save
to
  clear, instead of Clear auto-submitting behind the scenes.

### Tests

- Integration: `TestExploreSubagentIsReadOnly` (full spawn flow, tool
  verification, prompt overlay, DB state).
- Unit: tool allowlist tests for explore, plan, and default modes.
- Internal: model override resolution with valid, invalid UUID,
disabled, and
  unconfigured override scenarios.
- RBAC: `dbauthz_test.go` for `GetChatExploreModelOverride` and
  `UpsertChatExploreModelOverride`.
- API: admin set and clear, malformed stored override reporting,
disabled
  model rejection, non-admin denial.
2026-04-17 13:40:17 +02:00
Michael Suchacz 1cf0354f72 feat: add plan mode with restricted tool boundary (#24236)
> This PR was authored by Mux on behalf of Mike.

## Summary
- add persistent plan mode for chats and the chat-specific plan file
flow
- add structured planning tools such as `ask_user_question` and
`propose_plan`
- keep `write_file` and `edit_files` constrained to the chat-specific
plan file during plan turns
- allow shell exploration in plan mode, including subagents, via
`execute` and `process_output`
- block implementation-oriented, provider-native, MCP, dynamic, and
computer-use tools during plan turns
- update the chat UI, tests, and docs for the new planning flow
2026-04-16 11:12:01 +02:00
Cian Johnston d7439a9de0 feat: add Prometheus metrics for chatd subsystem (#24371)
Adds 7 Prometheus metrics to the chatd subsystem and introduces typed
`ActivityBumpReason` for deadline bump attribution.

| Metric | Type | Labels |
|--------|------|--------|
| `coderd_chatd_chats` | Gauge | `state` (streaming, waiting) |
| `coderd_chatd_message_count` | Histogram | `provider` |
| `coderd_chatd_prompt_size_bytes` | Histogram | `provider` |
| `coderd_chatd_tool_result_size_bytes` | Histogram | `provider`,
`tool_name` |
| `coderd_chatd_ttft_seconds` | Histogram | `provider` |
| `coderd_chatd_compaction_total` | Counter | `provider`, `result` |
| `coderd_chatd_steps_total` | Counter | `provider` |

> 🤖
2026-04-15 19:53:10 +01:00
Danielle Maywood 38d4da82b9 refactor: send raw typed payloads over chat WebSockets (#24148) 2026-04-10 10:47:30 +01:00
Kyle Carberry 391b22aef7 feat: add CLI commands for managing chat context from workspaces (#24105)
Adds `coder exp chat context add` and `coder exp chat context clear`
commands that run inside a workspace to manage chat context files via
the agent token.

`add` reads instruction and skill files from a directory (defaulting to
cwd) and inserts them as context-file messages into an active chat.
Multiple calls are additive — `instructionFromContextFiles` already
accumulates all context-file parts across messages.

`clear` soft-deletes all context-file messages, causing
`contextFileAgentID()` to return `!found` on the next turn, which
triggers `needsInstructionPersist=true` and re-fetches defaults from the
agent.

Both commands auto-detect the target chat via `CODER_CHAT_ID` (already
set by `agentproc` on chat-spawned processes), or fall back to
single-active-chat resolution for the agent. The `--chat` flag overrides
both.

Also adds sub-agent context inheritance: `createChildSubagentChat` now
copies parent context-file messages to child chats at spawn time, so
delegated sub-agents share the same instruction context without
independently re-fetching from the workspace agent.

<details><summary>Implementation details</summary>

**New files:**
- `cli/exp_chat.go` — CLI command tree under `coder exp chat context`

**Modified files:**
- `agent/agentcontextconfig/api.go` — `ConfigFromDir()` reads context
from an arbitrary directory without env vars
- `codersdk/agentsdk/agentsdk.go` — `AddChatContext`/`ClearChatContext`
SDK methods
- `coderd/workspaceagents.go` — POST/DELETE handlers on
`/workspaceagents/me/chat-context`
- `coderd/coderd.go` — Route registration
- `coderd/database/queries/chats.sql` — `GetActiveChatsByAgentID`,
`SoftDeleteContextFileMessages`
- `coderd/database/dbauthz/dbauthz.go` — RBAC implementations for new
queries
- `coderd/x/chatd/subagent.go` — `copyParentContextFiles` for sub-agent
inheritance
- `cli/root.go` — Register `chatCommand()` in `AGPLExperimental()`

**Auth pattern:** Uses `AgentAuth` (same as `coder external-auth`) —
agent token via `CODER_AGENT_TOKEN` + `CODER_AGENT_URL` env vars.

</details>

> 🤖 Generated by Coder Agents

---------

Co-authored-by: Michael Suchacz <203725896+ibetitsmike@users.noreply.github.com>
2026-04-09 16:33:00 +02:00
Kyle Carberry 684f21740d perf(coderd): batch chat heartbeat queries into single UPDATE per interval (#24037)
## Summary

Replaces N per-chat heartbeat goroutines with a single centralized
heartbeat loop that issues one `UPDATE` per 30s interval for all running
chats on a worker.

## Problem

Each running chat spawned a dedicated goroutine that issued an
individual `UPDATE chats SET heartbeat_at = NOW() WHERE id = $1 AND
worker_id = $2 AND status = 'running'` query every 30 seconds. At 10,000
concurrent chats this produces **~333 DB queries/second** just for
heartbeats, plus ~333 `ActivityBumpWorkspace` CTE queries/second from
`trackWorkspaceUsage`.

## Solution

New `UpdateChatHeartbeats` (plural) SQL query replaces the old singular
`UpdateChatHeartbeat`:

```sql
UPDATE chats
SET    heartbeat_at = @now::timestamptz
WHERE  worker_id = @worker_id::uuid
  AND  status = 'running'::chat_status
RETURNING id;
```

A single `heartbeatLoop` goroutine on the `Server`:
1. Ticks every `chatHeartbeatInterval` (30s)
2. Issues one batch UPDATE for all registered chats
3. Detects stolen/completed chats via set-difference (equivalent of old
`rows == 0`)
4. Calls `trackWorkspaceUsage` for surviving chats

`processChat` registers an entry in the heartbeat registry instead of
spawning a goroutine.

## Impact

| Metric | Before (10K chats) | After (10K chats) |
|---|---|---|
| Heartbeat queries/sec | ~333 | ~0.03 (1 per 30s per replica) |
| Heartbeat goroutines | 10,000 | 1 |
| Self-interrupt detection | Per-chat `rows==0` | Batch set-difference |

---

> 🤖 Generated by Coder Agents

<details><summary>Implementation notes</summary>

- Uses `@now` parameter instead of `NOW()` so tests with `quartz.Mock`
can control timestamps.
- `heartbeatEntry` stores `context.CancelCauseFunc` + workspace state
for the centralized loop.
- `recoverStaleChats` is unaffected — it reads `heartbeat_at` which is
still updated.
- The old singular `UpdateChatHeartbeat` is removed entirely.
- `dbauthz` wrapper uses system-level `rbac.ResourceChat` authorization
(same pattern as `AcquireChats`).

</details>
2026-04-07 10:25:46 -04:00
Kyle Carberry 919dc299fc feat: agent reads context files and discovers skills locally (#23935)
Piggybacks on #23878. Moves instruction file reading and skill discovery
from `chatd` (server-side, via multiple `LS`/`ReadFile` round-trips
through the agent connection) to the agent itself (local filesystem
access).

This intentionally drops backward compatibility with older agents that
don't support the context-config endpoint. Agents and server are
deployed together; there is no rolling-update contract to maintain here.

## What changed

The agent's `GET /api/v0/context-config` response now returns
`[]ChatMessagePart` directly — the same types chatd persists. This
eliminates intermediate type conversions and makes the protocol
extensible.

| Field | Type | Description |
|---|---|---|
| `parts` | `[]ChatMessagePart` | Context-file and skill parts, ready to
persist |
| `working_dir` | `string` | Agent's resolved working directory |

Removed from the response: `instructions_dirs`, `instructions_file`,
`skills_dirs`, `skill_meta_file`, `mcp_config_files` — the agent reads
files locally and returns their content as parts.

Removed from chatd: all legacy `LS`/`ReadFile` fallback code
(`readHomeInstructionFile`, `readInstructionDirFile`, `DiscoverSkills`
via LS, etc).

## Why

The previous architecture had the agent resolve paths, serve them over
HTTP, then `chatd` make N+1 round-trips back through the agent
connection to read files. The agent has direct filesystem access and
should just read the files.

## Key design decisions

- **Agent returns `ChatMessagePart` directly** — same types chatd
persists. No intermediate `InstructionFileEntry`/`SkillEntry` types
needed.
- **`SkillMeta.MetaFile`** — persisted via `ContextFileSkillMetaFile` on
the skill part, so custom meta file names
(`CODER_AGENT_EXP_SKILL_META_FILE`) survive across chat turns.
- **No pre-read body** — `read_skill` always dials the workspace to
fetch the skill body on demand. Simpler than caching the body in the
response.
- **MCP config paths kept agent-internal** — `MCPConfigFiles()` getter,
not sent over the wire.
- **No backward compat fallback** — old agents that don't support
context-config get no instruction files. This is acceptable since agent
and server deploy together.
2026-04-04 12:45:46 -04:00
Michael Suchacz 7d0a0c6495 feat: provider key policies and user provider settings (#23751) 2026-04-02 19:46:42 +02:00
Ethan fc1e0beb3b fix(coderd/x/chatd): use structured output for chat title generation (#23909)
Chat title generation used free-form text completion, which let models
respond conversationally instead of producing a title. Review chats
started with GitHub URLs were especially affected — models would say "I
don't have the ability to browse external links" and that string became
the persisted title.

Replace the raw-text `generateShortText` path with structured output via
`object.Generate[generatedTitle]`. Both auto-title and manual retitle
now go through the same typed contract: the model must return a JSON
object with a `title` field, validated and normalized before
persistence. Invalid outputs (empty, too long) are rejected and retried
through the existing candidate-model fallback loop.
2026-04-02 14:13:27 +11:00
Kyle Carberry ee855f9618 feat: make agent context paths configurable via env vars (#23878)
Replace hardcoded paths for instruction files, skills, and MCP config
with
values read from `CODER_AGENT_EXP_*` environment variables. Template
authors
configure paths via the existing `coder_agent` `env` block. The agent
resolves `~`, relative, and absolute paths locally, then serves the
resolved config over `GET /api/v0/context-config`. `chatd` fetches this
once per workspace attach and falls back to today's defaults for older
agents.

All path env vars are comma-separated, allowing multiple directories:

| Env Var | Default | Controls |
|---|---|---|
| `CODER_AGENT_EXP_INSTRUCTIONS_DIRS` | `~/.coder` | Dirs containing the
instruction file |
| `CODER_AGENT_EXP_INSTRUCTIONS_FILE` | `AGENTS.md` | Instruction file
name |
| `CODER_AGENT_EXP_SKILLS_DIRS` | `.agents/skills` | Skills directories
|
| `CODER_AGENT_EXP_SKILL_META_FILE` | `SKILL.md` | Skill metadata file
name |
| `CODER_AGENT_EXP_MCP_CONFIG_FILES` | `.mcp.json` | MCP config files |

### Example

```hcl
resource "coder_agent" "main" {
  os   = "linux"
  arch = "amd64"
  env = {
    CODER_AGENT_EXP_INSTRUCTIONS_DIRS  = "/opt/company/agent-config,~/.coder"
    CODER_AGENT_EXP_INSTRUCTIONS_FILE  = "CLAUDE.md"
    CODER_AGENT_EXP_SKILLS_DIRS        = "/opt/company/ai-skills,.agents/skills"
    CODER_AGENT_EXP_MCP_CONFIG_FILES   = "/opt/company/mcp.json,.mcp.json"
  }
}
```

<details>
<summary>Implementation Details</summary>

### Architecture

Follows the same pattern as MCP tool discovery:
agent resolves locally → exposes via HTTP → chatd consumes.

**Agent-side** (`agent/agentcontextconfig/`):
- `ResolvePath` / `ResolvePaths` handle `~`, relative, and absolute path
forms; returns `""` for relative paths when baseDir is empty
- `Config` reads env vars, falls back to defaults, resolves all paths
- `GET /api/v0/context-config` serves the resolved config as JSON

**chatd-side** (`coderd/x/chatd/`):
- Calls `conn.ContextConfig()` once on first workspace attach
- Falls back to hardcoded defaults on 404 (older agents)
- Iterates instruction dirs, skills dirs using resolved absolute paths
- `LSRelativityRoot` everywhere — no more home/root juggling

### Key design decisions

- **`EXP_` prefix**: env vars use `CODER_AGENT_EXP_*` to indicate
experimental status
- **Plural names**: comma-separated vars use plural names (`DIRS`,
`FILES`); single-value vars use singular (`FILE`)
- **Defaults in `workspacesdk`**: default constants live in
`codersdk/workspacesdk/` so both agent and server reference them without
cross-layer imports
- **`skillMetaFile` persistence**: stored on context-file parts via
`ContextFileSkillMetaFile` and restored on subsequent chat turns so
custom values survive across turns
- **Working dir dedup**: `slices.Contains` guard prevents reading the
same instruction file from both `InstructionsDirs` and the working
directory
- **MCP server dedup**: first-occurrence-wins dedup prevents leaking
duplicate connections from overlapping config files
- **ResolvePath safety**: returns `""` for relative paths when `baseDir`
is empty, so `ResolvePaths` filters them out

### Files changed

| File | Change |
|---|---|
| `agent/agentcontextconfig/` | New package — path resolution + HTTP
endpoint |
| `codersdk/workspacesdk/agentconn.go` | `ContextConfigResponse` type,
default constants, client method |
| `agent/agent.go` + `agent/api.go` | Wire up endpoint, pass config to
MCP |
| `agent/x/agentmcp/manager.go` | Accept `[]string` MCP config paths,
dedup by name |
| `coderd/x/chatd/chatd.go` | Fetch config, thread through, named
returns |
| `coderd/x/chatd/instruction.go` | Accept configurable dir + file name,
`skillMetaFileFromParts` |
| `coderd/x/chatd/chattool/skill.go` | Accept configurable dirs + meta
file |
| `codersdk/chats.go` | `ContextFileSkillMetaFile` field for persistence
|

### Test coverage

- `TestConfig` (4 cases): defaults, custom env vars, whitespace
trimming, comma-separated dirs
- `TestResolvePath` / `TestResolvePaths`: including empty baseDir edge
case
- `TestPersistInstructionFilesFallbackOnOlderAgent`: backward-compat
path when `ContextConfig` returns 404
- `TestChatMessagePartVariantTags`: updated exclusion list for new
internal field

### Backward compatibility

Older agents return 404 for the new endpoint. `chatd` catches this and
falls back to today's defaults via `readHomeInstructionFile` (using
`LSRelativityHome`). Existing workspaces work with no changes.

</details>
2026-04-01 12:28:47 -04:00
Cian Johnston a164d508cf fix(coderd/x/chatd): gate control subscriber to ignore stale pubsub notifications (#23865)
Fixes flaky `TestOpenAIReasoningWithWebSearchRoundTripStoreFalse` and
`TestOpenAIReasoningWithWebSearchRoundTrip`.

## Changes

- Gate the `processChat` control subscriber's cancel callback behind a
`chan struct{}` that is closed after publishing `"running"` status
- Add `TestGatedControlCancel` with 4 subtests exercising the gate logic

<details>
<summary>Root cause analysis</summary>

`SendMessage` publishes a `"pending"` notification on
`chat:stream:<chatID>` via PostgreSQL `NOTIFY`. `processChat` subscribes
to the same channel for control signals. Due to async NOTIFY delivery,
the `"pending"` notification can arrive at the control subscriber
**after** it registers its queue — even though it was published
**before**. `shouldCancelChatFromControlNotification("pending")` returns
`true`, immediately self-interrupting the processor before it does any
work.

The fix gates the cancel callback behind a closed channel. The channel
is closed after `processChat` publishes `"running"` status, so stale
notifications from before initialization are harmlessly ignored.
`close()` provides a happens-before guarantee in the Go memory model.
</details>

> 🤖 Written by a Coder Agent. Reviewed by a human.
2026-03-31 22:55:20 +01:00
Kyle Carberry a5cc579453 feat: add last_injected_context column to chats table (#23798)
Adds a nullable JSONB column `last_injected_context` to the `chats`
table that stores the most recently persisted injected context parts
(AGENTS.md context-file and skill message parts). The column is updated
only when `persistInstructionFiles()` runs — on first workspace attach
or when the agent changes — so there are no redundant writes on
subsequent turns.

Internal fields (`ContextFileContent`, `ContextFileOS`,
`ContextFileDirectory`, `SkillDir`) are stripped at write time so the
column only holds small metadata. No stripping needed on the read path.

<details>
<summary>Implementation notes</summary>

- New migration `000456` adds nullable `last_injected_context JSONB`
column.
- New SQL query `UpdateChatLastInjectedContext` writes the column
without touching `updated_at`.
- `persistInstructionFiles()` strips internal fields from parts via
`StripInternal()` before persisting.
- Sentinel path (no AGENTS.md) persists skill-only parts when skills
exist.
- `codersdk.Chat` exposes `LastInjectedContext []ChatMessagePart`
(omitempty).
- `db2sdk.Chat()` passes through the already-clean data.

</details>
2026-03-30 14:11:30 -04:00
Kyle Carberry 4d2b0a2f82 feat: persist skills as message parts like AGENTS.md (#23748)
## Summary

Skills are now discovered once on the first turn (or when the workspace
agent changes) and persisted as `skill` message parts alongside
`context-file` parts. On subsequent turns, the skill index is
reconstructed from persisted parts instead of re-dialing the workspace
agent.

This makes skills consistent with the AGENTS.md pattern and is
groundwork for a future `/context` endpoint that surfaces loaded
workspace context to the frontend.

## Changes

- Add `skill` `ChatMessagePartType` with `SkillName` and
`SkillDescription` fields
- Extend `persistInstructionFiles` to also discover and persist skills
as parts
- Add `skillsFromParts()` to reconstruct skill index from persisted
parts on subsequent turns
- Update `runChat()` to use `skillsFromParts` instead of re-dialing
workspace for skills
- Frontend: handle new `skill` part type (skip rendering, hide
metadata-only messages)

## Before / After

| | AGENTS.md | Skills |
|---|---|---|
| **Before** | Persist as `context-file` parts, reconstruct from parts |
In-memory `skillsCache` only, re-dial workspace on cache miss |
| **After** | Persist as `context-file` parts, reconstruct from parts |
Persist as `skill` parts, reconstruct from parts |

The in-memory `skillsCache` remains for `read_skill`/`read_skill_file`
tool calls that need full skill bodies on demand.

<details><summary>Design context</summary>

This is the first step toward a unified workspace context
representation. Currently:
- Context files are persisted as message parts (works)
- Skills were only in-memory (inconsistent)
- Workspace MCP servers are cached in-memory (future work)

Persisting skills as parts means a future `/context` endpoint can query
both context files and skills from the same message parts in the DB,
without depending on ephemeral server-side caches.
</details>
2026-03-29 21:48:17 -04:00
Michael Suchacz bfeb91d9cd fix: scope title regeneration per chat (#23729)
Previously, generating a new agent title used a page-global pending
state, so one in-flight regeneration disabled the action for every chat
in the Agents UI.

This change tracks regenerations by chat ID, updates the Agents page
contracts to use `regeneratingTitleChatIds`, and adds sidebar story
coverage that proves only the active chat is disabled.
2026-03-29 00:01:53 +01:00
Ethan c4ef94aacf fix(coderd/x/chatd): prevent chat hang when workspace agent is unavailable (#23707)
## Problem

Chats with a persisted `agent_id` binding hang indefinitely when the
workspace is stopped. The stale agent row still exists in the DB, so
`ensureWorkspaceAgent` succeeds, but the dial blocks forever in
`AwaitReachable`. The MCP discovery goroutine used an unbounded context,
so `g2.Wait()` never returned and the LLM never started.

## Fix

Three targeted changes restore the pre-binding behavior where stopped
workspaces degrade gracefully instead of blocking:

1. **`dialWithLazyValidation`**: "no agents in latest build" is now a
terminal fast-fail — the hanging dial is canceled and
`errChatHasNoWorkspaceAgent` returned immediately, instead of falling
through to `waitForOriginalDial`.

2. **Pre-LLM workspace setup**: MCP discovery and instruction
persistence gate on `workspaceAgentIDForConn` before attempting any
dial. MCP discovery is bounded by a 5s timeout and checks the in-memory
tool cache first (using the cheap cached agent from
`ensureWorkspaceAgent`), so the common subsequent-turn path has zero DB
queries.

3. **`persistInstructionFiles`**: tracks whether the workspace
connection succeeded and skips sentinel persistence on failure, so the
next turn retries if the workspace is restarted.

## Scenarios

**Running workspace, subsequent turn (hot path):** MCP cache hit via
in-memory cached agent. Zero DB queries, zero dials. Unchanged from
#23274.

**Stopped workspace, persisted binding (the bug):** MCP cache hit (stale
descriptors, fine — they fail at invocation). Pre-LLM setup completes
instantly. Tool invocation enters `dialWithLazyValidation`, dial fails
or hangs, validation discovers no agents, returns
`errChatHasNoWorkspaceAgent`. Model sees the error and can call
`start_workspace`.

**New chat, running workspace:** `ensureWorkspaceAgent` resolves via
latest-build, persists binding. MCP discovery dials and caches tools.

**New chat, stopped workspace:** `ensureWorkspaceAgent` finds no agents,
returns `errChatHasNoWorkspaceAgent`. Pre-LLM setup skips. LLM starts
with built-in tools only.

**Rebuilt workspace (agent switched):** MCP cache hit with stale agent
(harmless for one turn). Tool invocation dials stale agent, fails fast,
`dialWithLazyValidation` switches to new agent, persists updated
binding.

**Workspace restarted after stop:** No sentinel was persisted during the
stopped turn, so instruction persistence retries. Agent binding switches
to the new agent via `workspaceAgentIDForConn`.

**Transient DB error during validation:** Not
`errChatHasNoWorkspaceAgent`, so `dialWithLazyValidation` falls through
to `waitForOriginalDial` (cannot prove stale). No false positive.

**Tool invocation on stopped workspace:** `getWorkspaceConn` calls
`ensureWorkspaceAgent` (returns stale row), then
`dialWithLazyValidation` validation discovers no agents, returns
`errChatHasNoWorkspaceAgent`, cached state cleared, error returned to
model.
2026-03-27 18:47:39 +11:00
Michael Suchacz 2312e5c428 feat: add manual chat title regeneration (#23633)
## Summary

Adds a "Generate new title" action that lets users manually regenerate a
chat's title using richer conversation context than the automatic
first-message title path.

## Changes

### Backend
- **New endpoint:** `POST
/api/experimental/chats/{chatID}/title/regenerate` returns the updated
Chat with a regenerated title
- **Manual title algorithm:** Extracts useful user/assistant text turns
→ selects first user turn + last 3 turns → builds context with gap
markers → renders prompt with anti-recency guidance → calls lightweight
model → normalizes output
- **Helpers:** `extractManualTitleTurns`,
`selectManualTitleTurnIndexes`, `buildManualTitleContext`,
`renderManualTitlePrompt`, `generateManualTitle` — all private, with the
public `Server.RegenerateChatTitle` method
- **SDK:** `ExperimentalClient.RegenerateChatTitle(ctx, chatID) (Chat,
error)`
- Persists title via existing `UpdateChatByID` and broadcasts
`ChatEventKindTitleChange`

### Frontend
- API client method + React Query mutation with cache invalidation
- "Generate new title" menu item (with wand icon) in both TopBar and
Sidebar dropdown menus
- Loading/disabled state while regeneration is in-flight
- Error toast on failure
- Stories updated for both menus

### Tests
- `quickgen_test.go`: Table-driven tests for all 4 helper functions
(turn extraction, index selection, context building, prompt rendering)
- `exp_chats_test.go`: Handler tests (ChatNotFound,
NotFoundForDifferentUser, NoDaemon)

## Design notes
- The existing auto-title path (`maybeGenerateChatTitle`, `titleInput`)
is completely unchanged
- Manual regeneration uses richer context (first user turn + last 3
turns + gap markers) vs the auto path's single first message
- Endpoint is experimental and marked with `@x-apidocgen {"skip": true}`
2026-03-27 01:47:19 +01:00
Ethan 21c2acbad5 fix: refine chat retry status UX (#23651)
Follow-up to #23282. The retry and terminal error callouts had a few UX
oddities:

- Auto-retrying states reused backend error text that said "Please try
again" even while the UI was already retrying on behalf of the user.
- Terminal error states also said "Please try again" with no action the
user could take.
- `startup_timeout` had no specific title or retry copy — it fell
through to the generic "Retrying request" heading.
- The kind pill showed raw enum values like `startup_timeout` and
`rate_limit`.
- Terminal error metadata showed a "Retryable" / "Not retryable" label
that does not help users.
- A separate "Provider anthropic" metadata row duplicated information
already present in the message body.
- The `usage-limit` error kind used a hyphen while every backend kind
uses underscores.

Changes:

**Backend (`chaterror/message.go`)**

- Split message generation into `terminalMessage()` and
`retryMessage()`, replacing the old `userFacingMessage()`.
- Terminal messages include HTTP status codes and actionable guidance
(e.g. "Check the API key, permissions, and billing settings.").
- Retry messages are clean factual statements without status codes or
remediation, suitable for the retry countdown UI (e.g. "Anthropic is
temporarily overloaded.").
- Removed "Please try again" / "Please try again later" from all paths.
- `StreamRetryPayload` calls `retryMessage()` instead of forwarding
`classified.Message`.

**Frontend**

- Removed the parallel frontend message-generation system:
`getRetryMessage()`, `getProviderDisplayName()`,
`getRetryProviderSubject()`, and the `PROVIDER_DISPLAY_NAMES` map are
all deleted from `chatStatusHelpers.ts`.
- `liveStatusModel.ts` passes `retryState.error` through directly — the
backend owns the copy.
- Added specific title and retry copy for `startup_timeout`, and
extended the title mapping to cover `auth` and `config`.
- Kind pills now show humanized labels ("Startup timeout", "Rate limit",
etc.) instead of raw enum strings.
- Removed the redundant "Provider anthropic" metadata row.
- Removed the terminal "Retryable" / "Not retryable" badge.
- Normalized `"usage-limit"` → `"usage_limit"` and added it to
`ChatProviderFailureKind` so all error kinds follow the same underscore
convention and live in one enum.

Refs #23282.
2026-03-26 17:37:27 +11:00