Commit Graph

44 Commits

Author SHA1 Message Date
Kyle Carberry e388a88592 feat(coderd/chatd): connect to external MCP servers for chat tool invocation (#23333)
## Summary

Adds a new `coderd/chatd/mcpclient` package that connects to
admin-configured MCP servers and wraps their tools as
`fantasy.AgentTool` values that the chat loop can invoke.

## What changed

### New: `coderd/chatd/mcpclient/mcpclient.go`

The core package with a single entry point:

```go
func ConnectAll(
    ctx context.Context,
    logger slog.Logger,
    configs []database.MCPServerConfig,
    tokens []database.MCPServerUserToken,
) (tools []fantasy.AgentTool, cleanup func(), err error)
```

This:
1. Connects to each enabled MCP server using `mark3labs/mcp-go`
(streamable HTTP or SSE transport)
2. Discovers tools via the MCP `tools/list` method
3. Wraps each tool as a `fantasy.AgentTool` with namespaced name
(`serverslug__toolname`)
4. Applies tool allow/deny list filtering from the server config
5. Handles auth: OAuth2 bearer tokens, API keys, and custom headers
6. Skips broken servers with a warning (10s connect timeout per server)
7. Returns a cleanup function to close all MCP connections

### Modified: `coderd/chatd/chatd.go`

In `runChat()`, after loading the model/messages but before assembling
the tool list:
- Reads `chat.MCPServerIDs` from the chat record
- Loads the MCP server configs from the database
- Resolves the user's auth tokens
- Calls `mcpclient.ConnectAll()` to connect and discover tools
- Appends the MCP tools to the chat's tool set
- Defers cleanup to close connections when the chat turn ends

The chat loop (`chatloop.Run`) already handles tools generically —
MCP-backed tools are invoked identically to built-in workspace tools. No
changes needed in `chatloop/`.

### New: `coderd/chatd/mcpclient/mcpclient_test.go`

10 tests covering:
- Tool discovery and namespacing
- Tool call forwarding and result conversion  
- Allow/deny list filtering
- Connection failure handling (graceful skip)
- Multi-server support with correct prefixes
- OAuth2 auth header injection
- Disabled server skipping
- Invalid input handling
- Tool info parameter propagation

## Design decisions

- **Tool namespacing**: `slug__toolname` with double underscore
separator. Avoids collisions with tools containing single underscores.
Stripped when forwarding to `tools/call`.
- **Connection lifecycle**: Fresh connections per chat turn, closed via
`defer`. Matches the `turnWorkspaceContext` pattern.
- **Failure isolation**: Each server connects independently. A broken
server doesn't fail the chat — its tools are simply unavailable.
- **No chatloop changes**: The existing `[]fantasy.AgentTool` interface
is already fully generic.

## What's NOT in this PR (follow-ups)

- Frontend MCP server picker UI (selecting servers for a chat)
- System prompt additions describing available MCP tools
- Token refresh on expiry mid-chat
- The deprecated `aibridged` MCP proxy cleanup
2026-03-20 16:49:55 +00:00
Mathias Fredriksson 41e15ae440 feat: make process output blocking-capable (#23312)
Replace the 200ms polling loop in chatd's execute and
process_output tools with server-side blocking via sync.Cond
on HeadTailBuffer.

The agent's GET /{id}/output endpoint accepts ?wait=true to
block until the process exits or a 5-minute server cap expires.
The process_output tool blocks by default for 10s (overridable
via wait_timeout), and falls back to a non-blocking snapshot on
timeout. The execute tool's foreground path makes a single
blocking call instead of polling.

Related #23316
2026-03-20 14:33:55 +02:00
Cian Johnston 2f50e89afd fix(coderd): bump workspace autostop deadline on chat heartbeat (#23314)
- Wire `workspacestats.ActivityBumpWorkspace` into `trackWorkspaceUsage`
so the workspace build deadline is extended each time the chat heartbeat
fires
- Prevents mid-conversation autostop for chat workspaces
- Updates `TestHeartbeatBumpsWorkspaceUsage` verifying the deadline bump

> This PR was created with the help of Coder Agents, and was reviewed by two humans and their pet robots 🧑‍💻🤝🤖
2026-03-19 22:07:20 +00:00
Mathias Fredriksson 0a0c976a1a test(coderd/chatd): add P0 coverage tests for subagent auth and panic recovery (#23309)
The processChat defer at line 2464 catches panics on its main
goroutine and transitions the chat to error status. This was
previously untested.

The test wraps the database Store to panic during PersistStep's
InTx call, which runs synchronously on the processChat goroutine.
A tool-level panic wouldn't work because executeTools has its own
recover that converts panics into tool error results.
2026-03-19 17:54:03 +00:00
Michael Suchacz 6d214644f6 fix: make TestInterruptAutoPromotionIgnoresLaterUsageLimitIncrease deterministic (#23279)
Eliminates the timing flake in
`TestInterruptAutoPromotionIgnoresLaterUsageLimitIncrease` by making the
chatd worker loop clock-controllable.

## Changes

**`coderd/chatd/chatd.go`**
- Replace `time.NewTicker` calls in `Server.start()` with
`p.clock.NewTicker` using named quartz tags `("chatd", "acquire")` and
`("chatd", "stale-recovery")`.

**`coderd/chatd/chatd_test.go`**
- Inject `quartz.NewMock(t)` into the test via `newActiveTestServer`
config override.
- Trap the acquire ticker so the test controls exactly when pending
chats are reacquired.
- Rewrite the test flow as explicit clock-advance steps instead of
wall-clock polling.

**`AGENTS.md`**
- Document the PR title scope rule (scope must be a real path containing
all changed files).

## Validation
- `go test ./coderd/chatd -run
TestInterruptAutoPromotionIgnoresLaterUsageLimitIncrease -count=100` 
- `go test ./coderd/chatd` 
- `make lint` 
2026-03-19 15:14:00 +00:00
Hugo Dutka d285a3e74e fix: handle null bytes in chat messages (#22946)
This PR fixes a bug where if a tool result contained binary data it
wouldn't be persisted to the database.

`jsonb` in Postgres is unable to store null bytes which are sometimes
output by tool results. This change makes it so that we encode them with
a special escape sequence before saving them to the database, and decode
them on read.

<img width="808" height="637" alt="Screenshot 2026-03-11 at 13 14 06"
src="https://github.com/user-attachments/assets/9be353eb-ff26-40ec-9f0a-195022b11f43"
/>
2026-03-18 21:19:25 +01:00
Cian Johnston 14ed3e3644 feat: bump workspace last_used_at on chat heartbeat (#23205)
- coderd: Wires `options.WorkspaceUsageTracker` into the chatd config.
- chatd: Adds `UsageTracker` and calls `UsageTracker.Add(workspaceID)`
on each heartbeat tick
- chatd: adds tests to verify `last_used_at` bump behaviour

> 🤖 This PR was created with the help of Coder Agents, and will be
reviewed by my human. 🧑‍💻
2026-03-18 19:07:21 +00:00
Kyle Carberry 1f0d896fc9 feat: add deleted flag to chat messages for soft-delete (#23223)
Adds a `deleted` boolean column to the `chat_messages` table. Messages
are never physically deleted from the database — instead they are marked
as deleted so that usage and cost data is preserved.

## Changes

### Migration
- New migration (000444) adds `deleted boolean NOT NULL DEFAULT false`
to `chat_messages`

### SQL queries
- `DeleteChatMessagesAfterID` → `SoftDeleteChatMessagesAfterID` (UPDATE
SET deleted=true instead of DELETE)
- New `SoftDeleteChatMessageByID` query for single-message soft-delete
- All read queries now filter `deleted = false`:
  - `GetChatMessageByID`
  - `GetChatMessagesByChatID`
  - `GetChatMessagesByChatIDDescPaginated`
  - `GetChatMessagesForPromptByChatID` (both CTE and main query)
  - `GetLastChatMessageByRole`
- Cost/usage queries (`GetChatCostSummary`, `GetChatCostPerModel`, etc.)
intentionally still include deleted messages to preserve accurate spend
tracking

### EditMessage behavior
- Previously: updated the message content in-place + hard-deleted
subsequent messages
- Now: soft-deletes the original message + soft-deletes subsequent
messages + inserts a new message with the updated content
- This preserves the original message data (tokens, cost, content) in
the database
2026-03-18 14:37:09 -04:00
Kyle Carberry 483adc59fe feat: replace InsertChatMessage with batch InsertChatMessages (#23220)
Replaces the singular `InsertChatMessage` query with
`InsertChatMessages` that uses PostgreSQL's `unnest()` for batch
inserts. This reduces the number of database round-trips when inserting
multiple messages in a single transaction.

## Changes

- **SQL**: New `InsertChatMessages :many` query using `unnest()` arrays
following the existing codebase pattern (e.g.,
`InsertWorkspaceAgentStats`). Preserves the CTE that updates
`chats.last_model_config_id` using the last non-null model config from
the batch. Uses `NULLIF` for UUID columns to handle NULL foreign keys.
- **Go layers**: Updated `querier.go`, `dbauthz.go`,
`dbmetrics/querymetrics.go`, `dbmock/dbmock.go`, and `queries.sql.go` to
use the new batch signature (`[]ChatMessage` return type, array params).
- **chatd.go**: All call sites converted to batch inserts:
  - **CreateChat**: System prompt + user message batched into one call
- **persistStep**: Assistant message + tool messages batched into one
call
- **persistSummary**: Hidden summary + assistant + tool messages batched
into one call
  - Single-message sites use the same API with single-element arrays
- **Helper**: New `appendChatMessage` function simplifies building batch
params at each call site.
- **Tests**: All test files updated to use the new API.

Builds on top of #23213.
2026-03-18 16:27:07 +00:00
Kyle Carberry 4dd8531f37 feat: track step runtime_ms on chat messages (#23219)
## Summary

Adds a `runtime_ms` column to `chat_messages` that records the
wall-clock duration (in milliseconds) of each LLM step. This covers LLM
streaming, tool execution, and retries — the full time the agent is
"alive" for a step.

This is the foundation for billing by agent alive time. The column
follows the same pattern as `total_cost_micros`: stored per assistant
message, aggregatable with `SUM()` over time periods by user.

## Changes

- **Migration**: adds nullable `runtime_ms bigint` to `chat_messages`.
- **chatloop**: adds `Runtime time.Duration` field to `PersistedStep`,
measures `time.Since(stepStart)` at the beginning of each step (covering
stream + tool execution + retries).
- **chatd**: passes `step.Runtime.Milliseconds()` to the assistant
message `InsertChatMessage` call; all other message types (system, user,
tool) get `NULL`.
- **Tests**: adds `runtime > 0` assertion in chatloop tests.

## Billing query pattern

Once ready, aggregation mirrors the existing cost queries:

```sql
SELECT COALESCE(SUM(cm.runtime_ms), 0)::bigint AS total_runtime_ms
FROM chat_messages cm
JOIN chats c ON c.id = cm.chat_id
WHERE c.owner_id = @user_id
  AND cm.created_at >= @start_time
  AND cm.created_at < @end_time
  AND cm.runtime_ms IS NOT NULL;
```
2026-03-18 10:57:35 -04:00
Kyle Carberry b83b93ea5c feat: add workspace awareness system message on chat creation (#23213)
When a chat is created via `chatd`, a system message is now inserted
informing the model whether the chat was created with or without a
workspace.

**With workspace:**
> This chat is attached to a workspace. You can use workspace tools like
execute, read_file, write_file, etc.

**Without workspace:**
> There is no workspace associated with this chat yet. Create one using
the create_workspace tool before using workspace tools like execute,
read_file, write_file, etc.

This is a model-only visibility system message (not shown to users) that
helps the model understand its available capabilities upfront —
particularly important for subagents spawned without a workspace, which
previously would attempt to use workspace tools and fail.

**Changes:**
- `coderd/chatd/chatd.go`: Added workspace awareness constants and
inserted the system message in `CreateChat` after the system prompt,
before the initial user message.
- `coderd/chatd/chatd_test.go`: Added
`TestCreateChatInsertsWorkspaceAwarenessMessage` with sub-tests for both
with-workspace and without-workspace cases.
2026-03-18 14:01:46 +00:00
Kyle Carberry d42008e93d fix: persist partial assistant response when chat is interrupted mid-stream (#23193)
## Problem

When a user cancels a streaming chat response mid-stream, the partial
content disappears entirely — both from the UI and the database. The
streamed text vanishes as if the response never happened.

## Root Causes

Three issues combine to prevent partial message persistence on
interrupt:

### 1. StreamPartTypeError only matched `context.Canceled`
(`chatloop.go`)

The interrupt detection in `processStepStream` checked:
```go
errors.Is(part.Error, context.Canceled) && errors.Is(context.Cause(ctx), ErrInterrupted)
```
But some providers propagate `ErrInterrupted` directly as the stream
error rather than wrapping it in `context.Canceled`. This caused the
condition to fail, so `flushActiveState` was never called and partial
text accumulated in `activeTextContent` was lost.

### 2. No post-loop interrupt check (`chatloop.go`)

If the stream iterator stops yielding parts without producing a
`StreamPartTypeError` (e.g., a provider that silently closes the
response body on cancel), there was no check after the `for part :=
range stream` loop to detect the interrupt and flush active state.

### 3. Worker ownership check blocked interrupted persists (`chatd.go`)

`InterruptChat` → `setChatWaiting` clears `worker_id` in the DB
**before** the chatloop detects the interrupt. When
`persistInterruptedStep` (using `context.WithoutCancel`) tried to write
the partial message, the ownership check:
```go
if !lockedChat.WorkerID.Valid || lockedChat.WorkerID.UUID != p.workerID {
    return chatloop.ErrInterrupted  // always blocks!
}
```
unconditionally rejected the write. The error was silently logged as a
warning.

## Fix

- **Broaden the `StreamPartTypeError` interrupt detection** to match
both `context.Canceled` and `ErrInterrupted` as the stream error.
- **Add a post-loop interrupt check** in `processStepStream` that
flushes active state when the context was canceled with
`ErrInterrupted`.
- **Allow `persistStep` to write when the chat is in `waiting` status**
(interrupt) even if `worker_id` was cleared. The `pending` status (from
`EditMessage`, where history is truncated) still correctly blocks stale
writes.

## Testing

Added `TestInterruptChatPersistsPartialResponse` — an end-to-end
integration test that:
1. Streams partial text chunks from a mock LLM
2. Waits for the chatloop to publish `message_part` events (confirming
chunks were processed)
3. Interrupts the chat mid-stream
4. Verifies the partial assistant message is persisted in the database
with the expected text content
2026-03-18 11:48:28 +00:00
Hugo Dutka 2cf47ec384 feat: virtual desktop settings toggle backend (#23171)
Adds a new `site_config` entry that controls whether the virtual desktop
feature for Coder Agents is enabled. It can be set via a new
`/api/experimental/chats/config/desktop-enabled` endpoint, which will be
used by the frontend.
2026-03-18 09:35:13 +01:00
Kyle Carberry b779c9ee33 fix: use SQL-level auth filtering for chat listing (#23159)
## Problem

The chat listing endpoint (`GetChatsByOwnerID`) was using
`fetchWithPostFilter`, which fetches N rows from the database and then
filters them in Go memory using RBAC checks. This causes a pagination
bug: if the user requests `limit=25` but some rows fail the auth check,
fewer than 25 rows are returned even though more authorized rows exist
in the database. The client may incorrectly assume it has reached the
end of the list.

## Solution

Switch to the same pattern used by `GetWorkspaces`, `GetTemplates`, and
`GetUsers`: `prepareSQLFilter` + `GetAuthorized*` variant. The RBAC
filter is compiled to a SQL WHERE clause and injected into the query
before `ORDER BY`/`LIMIT`, so the database returns exactly the requested
number of authorized rows.

Additionally, `GetChatsByOwnerID` is renamed to `GetChats` with
`OwnerID` as an optional (nullable) filter parameter, matching the
`GetWorkspaces` naming convention.

## Changes

| File | Change |
|------|--------|
| `queries/chats.sql` | Renamed to `GetChats`, `owner_id` now optional
via CASE/NULL, added `-- @authorize_filter` |
| `queries.sql.go` | Renamed constant, params struct (`GetChatsParams`),
and method |
| `querier.go` | Interface method renamed |
| `modelqueries.go` | Added `chatQuerier` interface +
`GetAuthorizedChats` impl |
| `dbauthz/dbauthz.go` | `GetChats` now uses `prepareSQLFilter` instead
of `fetchWithPostFilter` |
| `dbauthz/dbauthz_test.go` | Updated tests for SQL filter pattern |
| `dbmock/dbmock.go` | Renamed + added mock for `GetAuthorizedChats` |
| `dbmetrics/querymetrics.go` | Renamed + added metrics wrapper |
| `rbac/regosql/configs.go` | Added `ChatConverter` (maps `org_owner` to
empty string literal since `chats` has no `organization_id` column) |
| `rbac/authz.go` | Added `ConfigChats()` |
| `chats.go` | Handler uses renamed method with `uuid.NullUUID` |
| `searchquery/search.go` | Updated return type |
| `gitsync/worker.go` | Updated interface and call site |
| Various test files | Updated for renamed types |
2026-03-17 12:46:24 -04:00
Michael Suchacz 5d0eb772da fix(cored): fix flaky TestInterruptAutoPromotionIgnoresLaterUsageLimitIncrease (#23147) 2026-03-17 19:08:22 +11:00
Ethan 04fca84872 perf(coderd): reduce duplicated reads in push and webpush paths (#23115)
## Background

A 5000-chat scaletest (~50k turns, ~2m45s wall time) completed
successfully,
but the main bottleneck was **DB pool starvation from repeated reads**,
not
individually expensive SQL. The push/webpush path showed a few
especially noisy
reads:

- `GetLastChatMessageByRole` for push body generation
- `GetEnabledChatProviders` + `GetChatModelConfigByID` for push summary
model
  resolution
- `GetWebpushSubscriptionsByUserID` for every webpush dispatch

This PR keeps the optimizations that remove those duplicate reads while
leaving
stream behavior unchanged.

## What changes in this PR

### 1. Reuse resolved chat state for push notifications

`maybeSendPushNotification` used to re-read the last assistant message
and
re-resolve the chat model/provider after `runChat` had already done that
work.

Now `runChat` returns the final assistant text plus the already-resolved
model
and provider keys, and the push goroutine uses that state directly.

That removes the extra push-path reads for:

- `GetLastChatMessageByRole`
- the second `resolveChatModel` path
- the provider/model lookups that came with that second resolution

### 2. Cache webpush subscriptions during dispatch

`Dispatch()` previously hit `GetWebpushSubscriptionsByUserID` on every
push. A
small per-user in-memory cache now avoids those repeated reads.

The follow-up fix keeps that optimization correct: `InvalidateUser()`
bumps a
per-user generation so an older in-flight fetch cannot repopulate the
cache with
pre-mutation data after subscribe/unsubscribe.

That preserves the cache win without letting local subscription changes
be
silently overwritten by stale fetch results.

## Why this is safe

- The push change only reuses data already produced during the same chat
run. It
does not change notification semantics; if there is no assistant text to
  summarize, the existing fallback body still applies.
- The webpush change keeps the existing TTL and `410 Gone` cleanup
behavior. The
generation guard only prevents stale in-flight fetches from poisoning
the
  shared cache after invalidation.
- The final PR does **not** change stream setup, pubsub/relay behavior,
or chat
  status snapshot timing.

## Deliberately not included

- No stream-path optimization in `Subscribe`.
- No inline pubsub message payloads.
- No distributed cross-replica webpush cache invalidation.
2026-03-17 13:50:47 +11:00
Michael Suchacz 1031da9738 feat: add agent chat spend limiting (backend) (#23071)
Introduces deployment-scoped spend limiting for Coder Agents, enabling
administrators to control LLM costs at global, group, and individual
user levels.

## Changes

- **Database migration (000437)**: `chat_usage_limit_config`
(singleton), `chat_usage_limit_overrides` (per-user),
`chat_usage_limit_group_overrides` (per-group)
- **Single-query limit resolution**: individual override > min(group) >
global default via `ResolveUserChatSpendLimit`
- **Fail-open enforcement** in chatd with documented TOCTOU trade-off
- **Experimental API** under `/api/experimental/chats/usage-limits` for
CRUD on limits
- **`AsChatd` RBAC subject** for narrowly-scoped daemon access (replaces
`AsSystemRestricted`)
- **Generated TypeScript types** for the frontend SDK

## Hierarchy

1. Individual user override (highest)
2. Minimum of group limits
3. Global default
4. Disabled / unlimited

Currency stored as micro-dollars (`1,000,000` = $1.00).

Frontend PR: #23072
2026-03-17 01:24:03 +01:00
Kyle Carberry 741af057dc feat: paginate chat messages endpoint with cursor-based infinite scroll (#23083)
Adds cursor-based pagination to the chat messages endpoint.

## Backend

- New `GetChatMessagesByChatIDPaginated` SQL query: returns messages in
`id DESC` order with a `before_id` keyset cursor and configurable
`limit`
- Handler parses `?before_id=N&limit=N` query params, uses the `LIMIT
N+1` trick to set `has_more` without a separate COUNT query
- Queued messages only returned on the first page (no cursor) since
they're always the most recent
- SDK client updated with `ChatMessagesPaginationOptions`
- Fully backward compatible: omitting params returns the 50 newest
messages

## Frontend

- Switches `getChatMessages` from `useQuery` to `useInfiniteQuery` with
cursor chaining via `getNextPageParam`
- Pages flattened and sorted by `id` ascending for chronological display
- `MessagesPaginationSentinel` component uses `IntersectionObserver`
(200px rootMargin prefetch) inside the existing `flex-col-reverse`
scroll container
- `flex-col-reverse` handles scroll anchoring natively when older
messages are prepended — no manual `scrollTop` adjustment needed (same
pattern as coder/blink)

## Why cursor-based instead of offset/limit

Offset-based pagination breaks when new messages arrive while paginating
backward (offsets shift, causing duplicates or missed messages). The
`before_id` cursor is stable regardless of inserts — each page is
deterministic.
2026-03-16 16:40:59 +00:00
Hugo Dutka 84527390c6 feat: chat desktop backend (#23005)
Implement the backend for the desktop feature for agents.

- Adds a new `/api/experimental/chats/$id/desktop` endpoint to coderd
which exposes a VNC stream from a
[portabledesktop](https://github.com/coder/portabledesktop) process
running inside the workspace
- Adds a new `spawn_computer_use_agent` tool to chatd, which spawns a
subagent that has access to the `computer` tool which lets it interact
with the `portabledesktop` process running inside the workspace
- Adds the plumbing to make the above possible

There's a follow up frontend PR here:
https://github.com/coder/coder/pull/23006
2026-03-13 19:49:34 +01:00
Mathias Fredriksson 4a79af1a0d refactor: add chat_message_role enum and content_version column (#23042)
Migration 000434 converts chat_messages.role from text to a Postgres
enum, rebuilds the partial index, and adds content_version smallint.
The column is backfilled with DEFAULT 0, then the default is dropped
so future inserts must set it explicitly.

Version 0 uses the role-aware heuristic from #22958. Version 1 (all
new inserts) stores []ChatMessagePart JSON for all roles, including
system messages. ParseContent takes database.ChatMessage directly
and dispatches on version internally. Unknown versions error.

All string(codersdk.ChatMessageRole*) casts at DB write sites are
replaced with database.ChatMessageRole* constants from sqlc.

Refs #22958
2026-03-13 16:47:36 +00:00
Mathias Fredriksson bdbcd3428b feat(coderd/chatd): unify chat storage on SDK parts and fix file-reference rendering (#22958)
File-reference parts in user messages were flattened to `TextContent` at
write time because fantasy has no file-reference content type. The
frontend never saw them as structured parts.

This moves all write paths (user, assistant, tool) from fantasy envelope
format to `codersdk.ChatMessagePart`. The streaming layer (`chatloop`)
is untouched, the conversion happens at the serialization boundary in
`persistStep`.

Old rows are still readable. `ParseContent` uses a structural heuristic
(`isFantasyEnvelopeFormat`) to distinguish legacy envelopes from SDK
parts. We chose this over try/fallback because fantasy envelopes
partially unmarshal into `ChatMessagePart` (the `type` field matches)
while silently losing content. A guard test enforces that no SDK part
can produce the envelope shape.

This is forward-only: new rows are unreadable by old code. Chat is
behind a feature flag so rollback risk is contained.

Also adds a typed `ChatMessageRole` to replace raw strings and
`fantasy.MessageRole*` casts at the persistence boundary. The type
covers `ChatMessage.Role`, `ChatStreamMessagePart.Role`, the
`PublishMessagePart` callback chain, and all DB write sites.
`fantasy.MessageRole*` remains only where we build `fantasy.Message`
structs for LLM dispatch.

Separately, `ProviderMetadata` was leaking to SSE clients via
`publishMessagePart`. `StripInternal` now runs on both the SSE and REST
paths, covering this.

Other cleanup:

- Old `db2sdk.contentBlockToPart` silently dropped metadata on
text/reasoning/tool-call content. New code preserves it.
- `providerMetadataToOptions` now logs warnings instead of silently
returning nil.
- `db2sdk` shrinks from ~250 lines of parallel conversion to ~15 lines
delegating to `chatprompt.ParseContent()`, removing the `fantasy` import
entirely.

Refs #22821
2026-03-13 17:53:26 +02:00
Kyle Carberry 690e3a87d8 feat: move chat messages to dedicated /chats/{id}/messages endpoint (#23021)
## Summary

Moves the messages response out of `GET /chats/{id}` and into a
dedicated `GET /chats/{id}/messages` endpoint.

### Backend
- `GET /chats/{id}` now returns just the `Chat` object (no messages)
- `GET /chats/{id}/messages` is a new endpoint returning
`ChatMessagesResponse` with `messages` and `queued_messages`
- Added `ChatMessagesResponse` SDK type and `GetChatMessages` client
method

### Frontend
- `getChat()` API method returns `Chat` instead of `ChatWithMessages`
- Added `getChatMessages()` API method for the new endpoint
- Split `chatQuery` into two: `chatQuery` (metadata) and
`chatMessagesQuery` (messages)
- Updated all cache mutations, optimistic updates, and websocket
handlers
- Updated tests and stories

### Files changed
| File | Change |
|---|---|
| `coderd/coderd.go` | Register `GET /messages` route |
| `coderd/chats.go` | Simplify `getChat`, add `getChatMessages` handler
|
| `codersdk/chats.go` | New type + method, update `GetChat` return |
| `site/src/api/api.ts` | New method, update `getChat` |
| `site/src/api/queries/chats.ts` | New query, update cache mutations |
| `site/src/pages/AgentsPage/AgentDetail.tsx` | Use separate queries |
| `site/src/pages/AgentsPage/AgentDetail/ChatContext.ts` | Update types
and cache writes |
| `site/src/pages/AgentsPage/AgentsPage.tsx` | Update websocket cache
handler |
2026-03-13 08:35:46 -04:00
Danielle Maywood 6489d6f714 feat(chatd): use last assistant message as push notification summary (#22671)
Instead of the static 'Agent has finished running.' text, extract a
summary from the last assistant message to give users meaningful context
about what the agent accomplished. Falls back to the static text if no
suitable message is found.

Co-authored-by: Kyle Carberry <kyle@carberry.com>
2026-03-10 15:14:15 +00:00
Kyle Carberry fee5cc5e5b fix(chatd): fix flaky TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica (#22893)
Fixes https://github.com/coder/internal/issues/1371

## Root causes

Two independent races cause this test to flake at ~2–3/1000:

### 1. Title-generation requests racing with the streaming request
counter

`maybeGenerateChatTitle` fires in a `context.WithoutCancel` goroutine
(line 2130) and makes a **non-streaming** request to the mock OpenAI
handler. The test handler was not filtering by request type, so these
title requests incremented the `requestCount` atomic — throwing off the
coordination logic that uses `requestCount == 1` to identify the first
streaming request and hold it open until shutdown.

**Fix:** Guard the test handler to return a canned response for
non-streaming requests before touching `requestCount`.

### 2. Phantom acquire: `AcquireChat` commits in Postgres but Go sees
`context.Canceled`

During `Close()`, the main loop's `select` can randomly pick
`acquireTicker.C` over `ctx.Done()` (Go spec: when multiple cases are
ready, one is chosen uniformly at random). This calls `processOnce(ctx)`
with an already-canceled context.

In the pq driver, `QueryContext` does **not** check `ctx.Err()` up
front. Instead it calls `watchCancel(ctx)` which spawns a goroutine
monitoring `ctx.Done()`, then sends the query on the existing
connection. When `ctx` is already canceled, a race ensues:

- **pq's watchCancel goroutine** immediately sees `<-done`, opens a
  *new* TCP connection to Postgres, and sends a cancel request.
- **The query** is sent concurrently on the existing connection.

Because the `AcquireChat` UPDATE is fast (sub-millisecond, single row
with `SKIP LOCKED`), it often commits before the cancel arrives via the
second connection. Meanwhile in `database/sql`, `initContextClose`
spawns an `awaitDone` goroutine that fires immediately (context is
already canceled), stores `contextDone`, and calls `rs.close(ctx.Err())`
— which races with `Row.Scan` → `rows.Next()`. If `awaitDone` wins,
`Next()` sees `contextDone` is set and returns false, causing Scan to
return `context.Canceled` (or `ErrNoRows`).

**Result:** Postgres committed the UPDATE (chat is now `running` with
serverA's worker ID), but Go sees an error and never spawns a goroutine
to process it. The chat is stuck as `running` with no worker.

If the previous `processChat` cleanup already set the chat back to
`pending`, this phantom acquire flips it back to `running` — which is
exactly what the debug logs showed: after `Close()` returns, the DB
shows `status=running` with serverA's worker ID.

**Fix:** Three guards in `processOnce`:

1. Early `ctx.Err()` check — catches the common case where `select`
   picked the ticker after cancellation.
2. `context.WithoutCancel(ctx)` for `AcquireChat` — prevents the pq
   `watchCancel` race entirely, ensuring the driver sees the query
   result if Postgres executed it.
3. Post-acquire `ctx.Err()` check — if the context was canceled while
   `AcquireChat` ran (or between the early check and the call),
   immediately release the chat back to `pending`.

## Verification

Passes 2000/2000 iterations (previously flaked at ~2–3/1000):

```
go test -run "TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica" \
  -count=2000 -timeout 1800s -failfast ./coderd/chatd/
```
2026-03-10 14:22:39 +00:00
Kyle Carberry b9c729457b fix(chatd): queue interrupt messages to preserve conversation order (#22736)
## Problem

When `message_agent` is called with `interrupt=true`, two independent
code paths race to persist messages:

1. `SendMessage` inserts the **user message** into `chat_messages` at
time T1
2. `persistInterruptedStep` saves the partial **assistant response** at
time T2 (T2 > T1)

Since `chat_messages` are ordered by `(created_at, id)`, the assistant
message ends up **after** the user message that triggered the interrupt.
On reload, this produces a broken conversation where the interrupted
response appears below the new user message — and Anthropic rejects the
trailing assistant message as unsupported prefill.

The root cause is that **two independent writers can't guarantee
ordering**. Any solution involving timestamp manipulation or
signal-then-wait coordination leaves race windows.

## Fix

Route interrupt behavior through the existing queued message mechanism:

1. `SendMessage` with `BusyBehaviorInterrupt` now inserts into
`chat_queued_messages` (not `chat_messages`) when the chat is busy
2. After queuing, `setChatWaiting` signals the running loop to stop
3. The deferred cleanup in `processChat` persists the partial assistant
response first, then auto-promotes the queued user message

This eliminates the race entirely: the assistant partial response and
user message are written by the same serialized cleanup flow, so
ordering is guaranteed by the DB's auto-incrementing `id` sequence. No
timestamp hacks, no reordering at send time.

Supersedes #22728 — fixes the root cause instead of reordering at prompt
construction time.
2026-03-06 18:15:40 -05:00
Kyle Carberry eecb7d0b66 fix: resolve bugs in chatd streaming system (#22720)
Split from #22693 per review feedback.

Fixes multiple bugs in coderd/chatd and sub-packages including race
conditions, transaction safety, stream buffer bounds, retry limits, and
enterprise relay improvements.

See commit message for full list.
2026-03-06 21:02:25 +00:00
Danielle Maywood ffb47cea19 feat(chatd): add tag-based dedup to push notifications (#22669) 2026-03-06 10:48:58 +00:00
Danielle Maywood d91d9712f7 fix: use Eventually for web push dispatch assertion in chatd test (#22700) 2026-03-06 09:52:28 +00:00
Hugo Dutka 48ab492f49 feat: agents git watch backend (#22565)
Adds real-time git status watching for workspace agents, so the frontend
can subscribe over WebSocket and show
git file changes in near real-time.

1. Subscription is scoped to a **chat** via `GET
/api/experimental/chats/{chat}/git/watch`.
2. The workspace agent automatically determines which paths to watch
based on tool calls made by the chat (and its ancestor chats).
3. Workspace agent polls subscribed repo working trees on a 30s
interval, on tools calls, and on explicit `refresh` from the client.
4. Scans are rate-limited to at most once per second.
5. Edited paths are tracked **in-memory** inside the workspace agent.
There is no database persistence — state is lost on agent restart. This
will be addresses in a future PR.
6. Messages sent over WebSocket include a full-repo snapshot (unified
diff, branch, origin). A new message is emitted only when the snapshot
changes.

This PR was implemented with AI with me closely controlling what it's
doing. The code follows a plan file that was updated continuously during
implementation. Here's the file if you'd like to see it:
[project.md](https://gist.github.com/hugodutka/8722cf80c92f8a56555f7bc595b770e2).
It reflects the current state of the PR.
2026-03-06 10:47:55 +01:00
Danielle Maywood 0ec27e3d48 feat(chatd): navigate to specific chat on push notification click (#22668) 2026-03-05 16:40:17 +00:00
Kyle Carberry 6520159045 feat(chatd): add start_workspace tool to agent flow (#22646)
## Summary

When a chat's workspace is stopped, the LLM previously had no way to
start it — `create_workspace` would either create a duplicate workspace
or fail. This adds a dedicated `start_workspace` tool to the agent flow.

## Changes

### New: `start_workspace` tool
(`coderd/chatd/chattool/startworkspace.go`)
- Detects if the chat's workspace is stopped and starts it via a new
build with `transition=start`
- Reuses the existing `waitForBuild` and `waitForAgent` helpers (shared
logic)
- Shares the workspace mutex with `create_workspace` to prevent races
- Idempotent: returns immediately if the workspace is already running or
building
- Returns a `no_agent` / `not_ready` status if the agent isn't available
yet (non-fatal)

### Updated: `create_workspace` stopped-workspace hint
- `checkExistingWorkspace` now returns a `stopped` status with message
`"use start_workspace to start it"` when it detects the chat's workspace
is stopped, instead of falling through to create a new workspace

### Wiring
- `chatd.Config` / `chatd.Server`: new `StartWorkspace` /
`startWorkspaceFn` field
- `coderd/chats.go`: new `chatStartWorkspace` method that calls
`postWorkspaceBuildsInternal` with proper RBAC context
- `coderd/coderd.go`: passes `chatStartWorkspace` into chatd config
- Tool registered alongside `create_workspace` for root chats only (not
subagents)

### Tests (`startworkspace_test.go`)
- `NoWorkspace`: error when chat has no workspace
- `AlreadyRunning`: idempotent return for workspace with successful
start build
- `StoppedWorkspace`: verifies StartFn is called, build is waited on,
and success response returned
2026-03-05 15:34:24 +00:00
Cian Johnston d0a51e1752 fix: use testutil.Eventually in chatd interrupt test (#22653)
Follow-up to #22630. Addresses [review
feedback](https://github.com/coder/coder/pull/22630#pullrequestreview-2953419963)
that was missed due to auto-merge.

## Changes

Replaces three `require.Eventually` calls with `testutil.Eventually` in
`TestInterruptChatDoesNotSendWebPushNotification`, linking the condition
to the existing test context (`ctx`) created on line 1194. This ensures
the test respects context cancellation instead of using a standalone
timeout/tick pattern.
2026-03-05 09:42:34 +00:00
Cian Johnston 4d0d187806 fix(chatd): wait for startup scripts before returning from create_workspace (#22498)
The `create_workspace` tool waited for the workspace build to succeed
and the agent to become connectable, but did not wait for the agent's
startup scripts (e.g. git clone) to finish. This caused agents to
attempt file operations on repositories that hadn't been cloned yet.

Add a waitForStartupScripts step that polls the agent's lifecycle_state
via GetWorkspaceAgentLifecycleStateByID until it transitions out of
created/starting into a terminal state (ready, start_error, or
start_timeout). The tool now only returns success once the workspace is
fully initialized.

If the scripts fail or time out, the tool still returns (non-fatal) with
an appropriate agent_status so the model knows something went wrong.

Created using thingies (Opus 4.6 Max)
2026-03-05 09:42:12 +00:00
Kyle Carberry 7bcd9f6de8 fix: skip web push notification when chat is interrupted (#22630)
When a user interrupts a chat, the status transitions to `waiting` which
previously triggered an "Agent has finished running." web push
notification. This is incorrect — the user interrupted it themselves, so
no notification is needed.

## Changes

### `coderd/chatd/chatd.go`
- Added `wasInterrupted` flag alongside the existing `status` variable
- Set the flag when `ErrInterrupted` is detected in the error handler
- Added `!wasInterrupted` to the web push dispatch condition

### `coderd/chatd/chatd_test.go`
- Added `TestInterruptChatDoesNotSendWebPushNotification` that creates a
chat with a mock webpush dispatcher, processes it, interrupts it, and
verifies no push notification was dispatched
- Added `mockWebpushDispatcher` implementing the `webpush.Dispatcher`
interface
2026-03-05 09:08:17 +00:00
Kyle Carberry 30d534b36b fix(chatd): fix relay race conditions, extract enterprise relay logic, move pubsub to OSS (#22589)
## Summary

Fixes a bug where interrupting a streaming chat and sending a new
message
left the relay connected to the wrong replica. Expanded into a broader
refactor that cleanly separates concerns:

- **OSS** owns pubsub subscription, message catch-up, queue updates,
  status forwarding, and local parts merging.
- **Enterprise** (`enterprise/coderd/chatd`) only manages relay dialing,
  reconnection, and stale-dial discarding for cross-replica streaming.

## Architecture

### OSS `coderd/chatd/chatd.go`

`Subscribe()` builds the initial snapshot then runs a single merge
goroutine that handles:

- Pubsub subscription for durable events (status, messages, queue,
errors)
- Message catch-up via `AfterMessageID`
- Local `message_part` forwarding
- Relay events from enterprise (when `SubscribeFn` is set)
- Sends `StatusNotification` to enterprise so it can manage relay
lifecycle

Key types:

- `SubscribeFn` — enterprise hook, returns relay-only events channel
- `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`,
`StatusNotifications`, `RequestHeader`, `DB`, `Logger`
- `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on
pubsub status changes

### Enterprise `enterprise/coderd/chatd/chatd.go`

`NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a
`SubscribeFn` that:

- Opens an initial synchronous relay if the chat is running on a remote
worker
- Reads `StatusNotifications` from OSS to open/close relay connections
- Handles async dial, reconnect timers, stale-dial discarding
- Returns only relay `message_part` events

## Bug fixes

### Original bug: stale relay dial after interrupt

`openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a
per-dial context. `closeRelay()` could not cancel in-flight dials. When
the user interrupts and a new replica picks up the chat, the old dial
goroutine could complete after the new one and deliver a stale
`relayResult`.

**Fix**: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking,
`workerID` on `relayResult`. `closeRelay()` cancels the dial context and
drains `relayReadyCh`. Merge loop rejects mismatched worker IDs.

### Additional fixes

- `statusNotifications` send-on-closed-channel race — goroutine now owns
  `close()` via defer
- Enterprise spin-loop on `StatusNotifications` close — two-value
receive
  with nil-out
- `hasPubsub` set from `p.pubsub != nil` instead of subscription success
  — now tracks actual subscription result
- `lastMessageID` not initialized from `afterMessageID` — caused
  duplicate messages on catch-up
- `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel
- `closeRelay()` did not drain `relayReadyCh`
- `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in
`InTx`
- `processChat` post-TX side effects fired when chat was taken by
another
  worker — added `errChatTakenByOtherWorker` sentinel
- Cancel closure data race on `reconnectTimer`
- Bare blocking send on pubsub error path
- `localParts` hot-spin after channel close
- No-pubsub branch dropped relay events and initial snapshot
- Failed relay dial caused permanent stall (no reconnect retry)
- DB error during reconnect timer caused permanent stall
- `time.NewTimer` replaced with `quartz.Clock` for testable timing

## Tests

9 enterprise tests covering:

- Relay reconnect on drop (mock clock)
- Async dial does not block merge loop
- Relay snapshot delivery
- Stale dial discarded after interrupt
- Cancel during in-flight dial
- Running-to-running worker switch
- Failed dial retries (mock clock)
- Local worker closes relay
- Multiple consecutive reconnects (mock clock)

All pass with `-race`.
2026-03-04 18:42:28 -05:00
Kyle Carberry f4a7fa5b95 fix(chatd): block subagents from spawning workspaces (#22603)
## Summary

Subagent (child) chats were previously given access to workspace
provisioning tools (`list_templates`, `read_template`,
`create_workspace`), which could lead to uncontrolled resource
consumption. This PR moves those tools behind the same
`!chat.ParentChatID.Valid` gate that already protects the subagent tools
(`spawn_agent`, `wait_agent`, etc.).

## Changes

- **`coderd/chatd/chatd.go`**: Moved `list_templates`, `read_template`,
and `create_workspace` tool registration into the root-chat-only block
alongside subagent tools.
- **`coderd/chatd/chatd_test.go`**: Added
`TestSubagentChatExcludesWorkspaceProvisioningTools` — an E2E test that
spawns a subagent via a root chat and verifies the subagent's LLM call
does not include workspace provisioning or subagent tools.
- **`coderd/chatd/chattest/openai.go`**: Added `Tools` field to
`OpenAIRequest` and supporting `OpenAITool`/`OpenAIToolFunction` types
so tests can inspect which tools are sent to the model.
2026-03-04 15:49:14 +00:00
Kyle Carberry b7a7683ac0 fix(chatd): harden cross-replica relay for chat stream parts (#22533)
## Problem

Subscribers connecting to a different replica than the one running the
chat see full messages appear but no streaming partials (`message_part`
events). The relay mechanism that forwards ephemeral parts across
replicas had several bugs.

## Root Causes

1. **`openRelay()` blocked the event loop** — The WebSocket dial (TCP +
TLS + HTTP upgrade) to the worker replica ran synchronously inside the
select loop. While dialing, no events could be processed, channels
filled up, and parts were silently dropped.

2. **Relay drops were permanent** — When the relay WebSocket closed
mid-stream, `relayParts` was set to nil and never reopened. No status
notification would re-trigger it since the chat was still running on the
same worker.

3. **`drainInitial` snapshot race** — The `default` case in the initial
drain loop caused the snapshot to be empty if the remote hadn't flushed
data yet (common immediately after WebSocket connect).

4. **Duplicate event delivery** — The `preloaded` slice caused snapshot
events to be sent both in the return value and re-sent through the
channel goroutine.

## Fixes

### `coderd/chatd/chatd.go` (Subscribe method)
- **Async relay dial**: `openRelayAsync()` spawns a goroutine to dial
the remote replica. The result (channel + cancel func) is delivered on a
`relayReadyCh` channel that the select loop reads without blocking.
- **Relay reconnection**: When the relay channel closes, a 500ms timer
fires. The handler re-checks chat status from the DB and reopens the
relay if the chat is still running on a remote worker.
- **Snapshot parts via channel**: Relay snapshot + live parts are
wrapped into a single channel so they flow through the same path,
avoiding races with the select loop.

### `enterprise/coderd/chats.go` (newRemotePartsProvider)
- **Timer-based drain**: Replaced `default` with a 1-second timer. After
the first event, `Reset(0)` switches to non-blocking drain for remaining
buffered events.
- **Remove preloaded duplication**: The goroutine now only forwards new
events; snapshot events are returned to the caller directly.

## Testing

All existing tests pass:
- `TestInterruptChatBroadcastsStatusAcrossInstances`
- `TestSubscribeSnapshotIncludesStatusEvent`
- `TestSubscribeNoPubsubNoDuplicateMessageParts`
- `TestSubscribeAfterMessageID`
- `TestChatStreamRelay/RelayMessagePartsAcrossReplicas`
2026-03-02 19:57:13 -05:00
Kyle Carberry 5eebd3829f fix: use cursor-based query for chat stream notifications (#22510)
## Problem

The pubsub notification handler in `chatd` re-fetched **all** messages
from the DB on every new message notification, then filtered in Go with
`msg.ID > lastMessageID`. This grows linearly with conversation length —
every new message triggers a full table scan of that chat's history.

The `AfterMessageID` field in the pubsub notification payload was
clearly designed for cursor-based fetching, but no matching query
existed.

## Fix

- Add `GetChatMessagesByChatIDAfter` SQL query with `WHERE id >
@after_id`, so the database does the filtering instead of Go.
- Use it in the pubsub notification handler in `chatd.go`, passing
`lastMessageID` as the cursor.
- Implement the dbauthz wrapper (was a `panic("not implemented")` stub
from codegen) with the same read-check-on-parent-chat pattern as
adjacent methods.
- Add dbauthz test coverage for the new method.

**Not changed:** The initial snapshot in `Subscribe()` still loads all
messages — that's correct, since a newly-connecting client needs the
full conversation state. The waste was only in the ongoing notification
path.
2026-03-02 16:31:04 -05:00
Kyle Carberry 1c71fd69f6 fix: workspace auto-refresh during the chat flow (#22447) 2026-02-28 19:07:17 -05:00
Kyle Carberry 2abe55549c fix: return in-flight chats to pending on server shutdown (#22443)
When a chatd server shuts down (`Close()`), the server context is
canceled. Previously, in-flight chats would be marked as `error` because
the `context.Canceled` error was not distinguished from actual
processing failures.

This adds `isShutdownCancellation()` to detect when the error is caused
by the server context being canceled (as opposed to a chat-specific
cancellation like `ErrInterrupted`). When detected, the chat status is
set to `pending` with no `last_error`, allowing another replica to pick
it up and retry.

Extracted from #22440 — only the context cancellation bug fix, no
chattest changes.
2026-02-28 17:14:11 -05:00
Kyle Carberry c5619746d1 fix(chat): fix stream state discrepancies between frontend and backend (#22437)
## Summary

Fixes four frontend↔backend discrepancies in chat stream state
management that could cause duplicate content, UI flicker, and stale
stream state.

### Backend fixes (`coderd/chatd/chatd.go`)

**1. No-pubsub path double-replayed message_part events**

`Subscribe()` built an `initialSnapshot` containing `message_part`
events from `localSnapshot`, then the no-pubsub goroutine replayed the
same `localSnapshot` into the `mergedEvents` channel. Since `streamChat`
sends the snapshot first then reads the channel, the frontend received
every `message_part` twice. `applyMessagePartToStreamState` doesn't
deduplicate — text gets concatenated, so content appeared doubled.

Fix: Only forward live `localParts` in the no-pubsub goroutine; the
snapshot already contains the historical events.

**2. Snapshot missing status event**

The initial snapshot never included a `status` event. The frontend's
`shouldApplyMessagePart()` gates on status (`pending`/`waiting`), but
the initial status came from a separate REST query via `useEffect`.
During the race window between snapshot arrival and REST resolution,
`message_part` events could be incorrectly accepted or rejected.

Fix: Prepend a `status` event to the snapshot after loading the chat
from DB, so the frontend has the authoritative status from the very
first batch.

### Frontend fixes (`ChatContext.ts`)

**3. Scheduled stream reset not canceled by subsequent message_parts**

When a `message` event arrived, `scheduleStreamReset()` queued
`clearStreamState` via `requestAnimationFrame`. If new `message_part`
events arrived in the next WebSocket frame before the rAF fired, they
were pushed to `pendingMessageParts` without canceling the scheduled
reset. The rAF would fire between frames, clearing stream state, then
the next flush would re-populate it — causing a visible flash.

Fix: Call `cancelScheduledStreamReset()` when accumulating
`message_part` events.

**4. startTransition race with synchronous clearStreamState**

`flushMessageParts` wrapped `applyMessageParts` in `startTransition`,
which React can defer. If a `status: "waiting"` event arrived in the
same batch after `message_part` events, the status handler cleared
stream state synchronously, but the deferred `applyMessageParts`
callback could fire afterward and re-populate it.

Fix: Re-check `shouldApplyMessagePart()` inside the `startTransition`
callback at execution time.

### Tests added

- **Go**: `TestSubscribeSnapshotIncludesStatusEvent` — asserts the first
snapshot event is a status event
- **Go**: `TestSubscribeNoPubsubNoDuplicateMessageParts` — asserts the
events channel doesn't replay snapshot events
- **TS**: `cancels scheduled stream reset when message_part arrives
after message` — verifies stream state survives a [message,
message_part] batch
- **TS**: `does not apply message parts after status changes to waiting`
— verifies deferred applyMessageParts respects status transitions
2026-02-28 13:35:23 -05:00
Kyle Carberry 0ad2f9ecd7 feat(chatd): persist last_error on chats table (#22436)
Adds a nullable `last_error` column to the `chats` table so error
reasons survive page reloads.

**Backend:**
- Migration adds `last_error TEXT` (nullable) to chats
- `UpdateChatStatus` writes the error reason when status transitions to
`error`, clears it (NULL) on recovery
- `convertChat` maps `sql.NullString` to `*string` in the SDK

**Frontend:**
- Sidebar falls back to `chat.last_error` when no stream error reason is
cached
- Chat detail page does the same for `persistedErrorReason`
- Fixtures updated for new required field
2026-02-28 12:27:26 -05:00
Kyle Carberry f509c841cf fix(chatd): recover stale chats after coderd redeployment (#22405)
## Problem

When coderd instances are redeployed (e.g. rolling deployment on
dogfood), in-flight chats get stuck in `running` status permanently. The
UI shows them as "thinking" with a spinning indicator, but no worker is
actually processing them. They never error or resume.

## Root Cause

Two bugs combine to cause this:

### Bug 1: Shutdown cleanup uses a canceled context

The `processChat` defer block updates the chat status in the DB when
processing completes. But it uses `ctx`, which `Close()` cancels
*before* the defer runs. The DB transaction silently fails with
`context.Canceled`, leaving the chat in `status=running` with a dead
`worker_id`.

```go
// Close() calls p.cancel() which cancels ctx
// Then the defer tries to use the now-canceled ctx:
defer func() {
    err := p.db.InTx(func(tx database.Store) error {
        tx.GetChatByIDForUpdate(ctx, chat.ID) // FAILS
        tx.UpdateChatStatus(ctx, ...)          // FAILS
    }, nil)
}()
```

### Bug 2: Stale recovery runs only once at startup

`recoverStaleChats()` was called only once in `start()`, not
periodically. During a rolling deployment, the new instance starts while
the old one is still alive (fresh heartbeat). By the time the old
instance crashes, no one checks again.

## Fix

1. **Use `context.WithoutCancel(ctx)` in the processChat defer** — the
cleanup transaction now completes even during graceful shutdown.

2. **Run `recoverStaleChats` periodically** — a second ticker in the
`start()` loop checks for stale chats at `inFlightChatStaleAfter / 5`
intervals (default: every 1 minute). This catches orphaned chats even
when the instance that owns them crashes without clean shutdown.

## Tests

- `TestRecoverStaleChatsPeriodically` — Verifies chats orphaned *after*
startup are recovered by the periodic loop (not just the startup check).
- `TestNewReplicaRecoversStaleChatFromDeadReplica` — Verifies a new
replica recovers stale chats on startup.
- `TestWaitingChatsAreNotRecoveredAsStale` — Negative test: `waiting`
chats are not incorrectly modified by recovery.
2026-02-27 15:25:40 -05:00
Kyle Carberry edee917d88 feat: add experimental agents support (#22290)
feat: add AI chat system with agent tools and chat UI

Introduce the chatd subsystem and Agents UI for AI-powered chat
within Coder workspaces.

- Add chatd package with chat loop, message compaction, prompt
  management, and LLM provider integration (OpenAI, Anthropic)
- Add agent tools: create workspace, list/read templates, read/write/
  edit files, execute commands
- Add chat API endpoints with streaming, message editing, and
  durable reconnection
- Add database schema and migrations for chats, chat messages, chat
  providers, and chat model configs
- Add RBAC policies and dbauthz enforcement for chat resources
- Add Agents UI pages with conversation timeline, queued messages
  list, diff viewer, and model configuration panel
- Add comprehensive test coverage including coderd integration tests,
  chatd unit tests, and Storybook stories
- Gate feature behind experiments flag

---------

Co-authored-by: Cian Johnston <cian@coder.com>
Co-authored-by: Danielle Maywood <danielle@themaywoods.com>
Co-authored-by: Jeremy Ruppel <jeremy@coder.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 16:50:56 +00:00