Commit Graph

3582 Commits

Author SHA1 Message Date
Kyle Carberry c5d720f73d feat(coderd): add telemetry for agents chats and messages (#24068)
Adds telemetry collection for the agents chat system (`/agents`) to the
existing telemetry snapshot pipeline.

Three new snapshot fields:
- **`Chats`** — per-chat metadata (id, owner, status, mode,
workspace_id, root_chat_id, has_parent, archived, model config)
collected time-windowed via `createdAfter`
- **`ChatMessageSummaries`** — per-chat aggregated message metrics
(counts by role, token sums by type, cost, runtime, model count,
compression count) collected time-windowed
- **`ChatModelConfigs`** — model configuration metadata (provider,
model, context limit, enabled, default) collected as full dump

No PII is included — titles, message content, and URLs are excluded at
the SQL level. Only structural metadata flows through telemetry.

<details><summary>Implementation plan</summary>

### SQL Queries (`coderd/database/queries/chats.sql`)
- `GetChatsCreatedAfter` — time-windowed chat metadata
- `GetChatMessageSummariesPerChat` — per-chat message aggregates via
`GROUP BY`
- `GetChatModelConfigsForTelemetry` — full dump of model configs

### Telemetry (`coderd/telemetry/telemetry.go`)
- `Chat`, `ChatMessageSummary`, `ChatModelConfig` structs
- `ConvertChat`, `ConvertChatMessageSummary`, `ConvertChatModelConfig`
conversion functions
- Three `eg.Go()` blocks in `createSnapshot()` following the existing
collection pattern

### Authorization (`coderd/database/dbauthz/dbauthz.go`)
- System-only access for all three queries via `rbac.ResourceSystem`

### Tests
- `TestChatsTelemetry` in `coderd/telemetry/telemetry_test.go` — creates
chats (root + child), messages with token/cost data, model configs;
verifies all snapshot fields
- dbauthz test entries for all three queries in
`coderd/database/dbauthz/dbauthz_test.go`

</details>

> 🤖 Generated by Coder Agents
2026-04-08 09:47:44 -04:00
Cian Johnston 233343c010 feat: add chat and chat_files cleanup to dbpurge (#23833)
Fixes https://github.com/coder/coder/issues/23910

Adds periodic cleanup of chats and chat files to the dbpurge background
goroutine, with a configurable retention period exposed in the Agent
settings UI.

> 🤖 Written by a Coder Agent. Reviewed by a human.
2026-04-08 11:08:09 +01:00
Kayla はな c5f1a2fccf feat: make service accounts a Premium feature (#24020) 2026-04-07 12:25:32 -06:00
Zach 565a15bc9b feat: update user secrets queries for REST API and injection (#23998)
Update queries as prep work for user secrets API development:
- Switch all lookups and mutations from ID-based to user_id + name
- Split list query into metadata-only (for API responses) and
with-values (for provisioner/agent)
- Add partial update support using CASE WHEN pattern for write-only
value fields
- Include value_key_id in create for dbcrypt encryption support
- Update dbauthz wrappers and remove stale methods from dbmetrics
2026-04-07 09:03:28 -06:00
Kyle Carberry 684f21740d perf(coderd): batch chat heartbeat queries into single UPDATE per interval (#24037)
## Summary

Replaces N per-chat heartbeat goroutines with a single centralized
heartbeat loop that issues one `UPDATE` per 30s interval for all running
chats on a worker.

## Problem

Each running chat spawned a dedicated goroutine that issued an
individual `UPDATE chats SET heartbeat_at = NOW() WHERE id = $1 AND
worker_id = $2 AND status = 'running'` query every 30 seconds. At 10,000
concurrent chats this produces **~333 DB queries/second** just for
heartbeats, plus ~333 `ActivityBumpWorkspace` CTE queries/second from
`trackWorkspaceUsage`.

## Solution

New `UpdateChatHeartbeats` (plural) SQL query replaces the old singular
`UpdateChatHeartbeat`:

```sql
UPDATE chats
SET    heartbeat_at = @now::timestamptz
WHERE  worker_id = @worker_id::uuid
  AND  status = 'running'::chat_status
RETURNING id;
```

A single `heartbeatLoop` goroutine on the `Server`:
1. Ticks every `chatHeartbeatInterval` (30s)
2. Issues one batch UPDATE for all registered chats
3. Detects stolen/completed chats via set-difference (equivalent of old
`rows == 0`)
4. Calls `trackWorkspaceUsage` for surviving chats

`processChat` registers an entry in the heartbeat registry instead of
spawning a goroutine.

## Impact

| Metric | Before (10K chats) | After (10K chats) |
|---|---|---|
| Heartbeat queries/sec | ~333 | ~0.03 (1 per 30s per replica) |
| Heartbeat goroutines | 10,000 | 1 |
| Self-interrupt detection | Per-chat `rows==0` | Batch set-difference |

---

> 🤖 Generated by Coder Agents

<details><summary>Implementation notes</summary>

- Uses `@now` parameter instead of `NOW()` so tests with `quartz.Mock`
can control timestamps.
- `heartbeatEntry` stores `context.CancelCauseFunc` + workspace state
for the centralized loop.
- `recoverStaleChats` is unaffected — it reads `heartbeat_at` which is
still updated.
- The old singular `UpdateChatHeartbeat` is removed entirely.
- `dbauthz` wrapper uses system-level `rbac.ResourceChat` authorization
(same pattern as `AcquireChats`).

</details>
2026-04-07 10:25:46 -04:00
George K 86ca61d6ca perf: cap count queries and emit native UUID comparisons for audit/connection logs (#23835)
Audit and connection log pages were timing out due to expensive COUNT(*)
queries over large tables. This commit adds opt-in count capping: requests can
return a `count_cap` field signaling that the count was truncated at a threshold,
avoiding full table scans that caused page timeouts.

Text-cast UUID comparisons in regosql-generated authorization queries
also contributed to the slowdown by preventing index usage for connection
and audit log queries. These now emit native UUID operators.

Frontend changes handle the capped state in usePaginatedQuery and
PaginationWidget, optionally displaying a capped count in the pagination
UI (e.g. "Showing 2,076 to 2,100 of 2,000+ logs")

Related to:
https://linear.app/codercom/issue/PLAT-31/connectionaudit-log-performance-issue
2026-04-07 07:24:53 -07:00
Michael Suchacz d7c8213eee fix(coderd/x/chatd/mcpclient): deterministic external MCP tool ordering (#24075)
> This PR was authored by Mux on behalf of Mike.

External MCP tools returned by `ConnectAll` were ordered by goroutine
completion, making the tool list nondeterministic across chat turns.
This broke prompt-cache stability since tools are serialized in order.

Sort tools by their model-visible name after all connections complete,
matching the existing pattern in workspace MCP tools
(`agent/x/agentmcp/manager.go`). Also guards against a nil-client panic
in cleanup when a connected server contributes zero tools after
filtering.
2026-04-07 14:42:30 +02:00
Cian Johnston d5a1792f07 feat: track chat file associations with chat_file_links on chats (#23537)
Needed by #23833

Adds a `chat_file_links` association table to track which files are
associated with each chat.

- `AppendChatFileIDs` query links a file to a chat with deduplication
- `GetChatFileMetadataByIDs` query returns lightweight file metadata by
IDs
- Tool-created files (e.g. `propose_plan`) are linked to the chat after
insert
- User-uploaded files are linked to the chat when the referencing
message is sent
- Single-chat GET endpoint hydrates `files: ChatFileMetadata[]` on the
response

> 🤖 Created by Coder Agents and massaged into shape by a human.
2026-04-07 12:05:29 +01:00
Kyle Carberry acd5f01b4b fix: use GreaterOrEqual for step runtime assertion in chatloop test (#24067)
Fixes https://github.com/coder/internal/issues/1418

The `TestRun_ActiveToolsPrepareBehavior` test asserts
`persistedStep.Runtime > 0`, but on Windows the timer resolution (~15ms)
means the in-memory mock model can complete within the same clock tick,
producing a measured duration of `0s`.

Change the assertion from `require.Greater` to `require.GreaterOrEqual`
so that a legitimately measured zero duration on low-resolution clocks
does not cause a flake.

> Generated by Coder Agents
2026-04-07 02:08:49 +00:00
Kyle Carberry 6c62d8f5e6 fix(coderd/x/chatd): fix flaky TestAwaitSubagentCompletion/CompletesViaPubsub (#24066)
## Fix flaky TestAwaitSubagentCompletion/CompletesViaPubsub

Fixes coder/internal#1435

### Root Cause

During `createParentChildChats`, the processor publishes notifications
on `ChatStreamNotifyChannel(child.ID)` via PostgreSQL `LISTEN/NOTIFY`.
After `drainInflight()` returns, these stale notifications can still be
buffered in the pgListener's `NotifyChan()`. When
`awaitSubagentCompletion` subscribes and a stale notification is
dispatched between `setChatStatus(Waiting)` and
`insertAssistantMessage`, `checkSubagentCompletion` sees `done=true`
(status is `Waiting`) but returns an empty report because the message
hasn't been committed yet.

### Fix

Swap the order: insert the assistant message **before** transitioning
the status to `Waiting`. This guarantees the report is committed before
the status makes the chat appear complete to `checkSubagentCompletion`.

### Verification

- 50 consecutive runs of the specific test: all pass
- 10 runs of the full `TestAwaitSubagentCompletion` suite: all pass
- 20 runs with `-race`: all pass

> Generated by Coder Agents
2026-04-07 02:04:48 +00:00
Kyle Carberry 648787e739 feat: expose busy_behavior on chat message API (#24054)
The backend (`chatd.go`) already fully implements both `"queue"` and
`"interrupt"` busy behaviors for `SendMessage`, and the `message_agent`
subagent tool already leverages both internally. However the HTTP API
hardcoded `"queue"` and the SDK had no way for callers to request
interrupt-on-send.

This adds a `ChatBusyBehavior` enum type to the SDK and an optional
`busy_behavior` field on `CreateChatMessageRequest`. The HTTP handler
validates the field and passes it through to `chatd.SendMessage`.
Default remains `"queue"` for full backward compatibility.

<details><summary>Implementation notes</summary>

- `codersdk/chats.go`: New `ChatBusyBehavior` type with
`ChatBusyBehaviorQueue` and `ChatBusyBehaviorInterrupt` constants. Added
`BusyBehavior` field to `CreateChatMessageRequest` with `enums` tag for
codegen.
- `coderd/exp_chats.go`: `postChatMessages` now reads
`req.BusyBehavior`, maps SDK constants to
`chatd.SendMessageBusyBehavior*`, returns 400 on invalid values.
- `site/src/api/typesGenerated.ts`: Auto-generated via `make gen`.
- No frontend behavior changes — the field is available but unused by
the UI.

</details>

> [!NOTE]
> Generated by Coder Agents
2026-04-06 16:20:14 -04:00
Kyle Carberry 4cfbf544a0 feat: add per-chat system prompt option (#24053)
Adds a `system_prompt` field to `CreateChatRequest` that allows API
consumers to provide custom instructions when creating a chat. The
per-chat prompt is stored as a separate system message (`role=system`,
`visibility=model`) in the `chat_messages` table, inserted between the
deployment system prompt and the workspace awareness message.

Also moves deployment system prompt resolution from the HTTP handler
(`resolvedChatSystemPrompt`) into `chatd.CreateChat` where it belongs.
The handler no longer assembles system prompts —
`CreateOptions.SystemPrompt` is now purely the per-chat user prompt, and
the deployment prompt is resolved internally by chatd.

No database schema changes required.

**Message insertion order:**
1. Deployment system prompt (resolved by chatd, existing)
2. Per-chat user system prompt (new, from `CreateOptions.SystemPrompt`)
3. Workspace awareness (existing)
4. Initial user message (existing)

🤖 Generated with [Coder Agents](https://coder.com/agents)
2026-04-06 17:19:05 +00:00
Kyle Carberry a2ce74f398 feat: add total_runtime_ms to chat cost analytics endpoints (#24050)
Surface the aggregated `runtime_ms` from `chat_messages` through all
four cost analytics queries (summary, per-model, per-chat, per-user).
This is the key billing metric for agent compute time.

The per-chat breakdown already groups by `root_chat_id`, so subagent
runtime is automatically rolled up under the parent chat — no additional
query changes needed.

<details>
<summary>Implementation details</summary>

**SQL** (`coderd/database/queries/chats.sql`): Added
`COALESCE(SUM(cm.runtime_ms), 0)::bigint AS total_runtime_ms` to
`GetChatCostSummary`, `GetChatCostPerModel`, `GetChatCostPerChat`, and
`GetChatCostPerUser`.

**Go SDK** (`codersdk/chats.go`): Added `TotalRuntimeMs int64` to
`ChatCostSummary`, `ChatCostModelBreakdown`, `ChatCostChatBreakdown`,
and `ChatCostUserRollup`.

**Handler** (`coderd/exp_chats.go`): Wired the new field through all
converter functions and the response assembly.

**Tests** (`coderd/exp_chats_test.go`): Updated fixture to seed non-zero
`runtime_ms` values and added assertions for the new field at summary,
per-model, and per-chat levels.
</details>

> 🤖 Generated by Coder Agents
2026-04-06 12:10:57 -04:00
Kyle Carberry 8625543413 feat(coderd/x/chatd): parallelize ConvertMessagesWithFiles with g2 errgroup (#24034)
## Summary

Move `ConvertMessagesWithFiles` into the `g2` errgroup so prompt
conversion runs concurrently with instruction persistence, user prompt
resolution, MCP server connections, and workspace MCP tool discovery.

## Problem

In `runChat`, the setup before the first LLM `Stream()` call is
sequential across two errgroups:

```
g.Wait()                          // model + messages + MCP configs
ConvertMessagesWithFiles()        // sequential — blocked on g2 starting
g2.Wait()                         // instructions + user prompt + MCP connect + workspace MCP
```

`ConvertMessagesWithFiles` can take non-trivial time on conversations
with file attachments (batch DB resolution), and it was blocking g2 from
starting.

## Fix

`ConvertMessagesWithFiles` only reads the `messages` slice (available
after `g.Wait()`) and resolves file references via the database. No g2
task reads or writes the `prompt` variable. This makes it safe to
overlap with g2:

```
g.Wait()
g2.Wait()   // now includes ConvertMessagesWithFiles in parallel
```

The `InsertSystem` call for parent chats and the `promptErr` check are
deferred to after `g2.Wait()`, preserving correctness.

<details><summary>Decision log</summary>

- `ConvertMessagesWithFiles` is read-only on `messages` — no mutation,
safe for concurrent access
- `prompt` and `promptErr` are written only by the conversion goroutine,
read only after `g2.Wait()` — no data race
- Error from prompt conversion is checked immediately after `g2.Wait()`,
before any code that uses `prompt`
- `chatloop.Run` now uses `:=` instead of `=` since the prior `err`
declaration from `prompt, err :=` was removed

</details>

> Generated by Coder Agents
2026-04-05 11:42:07 +00:00
Kyle Carberry e18094825a fix: retain message_part buffer for cross-replica relay (#24031) 2026-04-04 17:24:41 -04:00
Kyle Carberry 919dc299fc feat: agent reads context files and discovers skills locally (#23935)
Piggybacks on #23878. Moves instruction file reading and skill discovery
from `chatd` (server-side, via multiple `LS`/`ReadFile` round-trips
through the agent connection) to the agent itself (local filesystem
access).

This intentionally drops backward compatibility with older agents that
don't support the context-config endpoint. Agents and server are
deployed together; there is no rolling-update contract to maintain here.

## What changed

The agent's `GET /api/v0/context-config` response now returns
`[]ChatMessagePart` directly — the same types chatd persists. This
eliminates intermediate type conversions and makes the protocol
extensible.

| Field | Type | Description |
|---|---|---|
| `parts` | `[]ChatMessagePart` | Context-file and skill parts, ready to
persist |
| `working_dir` | `string` | Agent's resolved working directory |

Removed from the response: `instructions_dirs`, `instructions_file`,
`skills_dirs`, `skill_meta_file`, `mcp_config_files` — the agent reads
files locally and returns their content as parts.

Removed from chatd: all legacy `LS`/`ReadFile` fallback code
(`readHomeInstructionFile`, `readInstructionDirFile`, `DiscoverSkills`
via LS, etc).

## Why

The previous architecture had the agent resolve paths, serve them over
HTTP, then `chatd` make N+1 round-trips back through the agent
connection to read files. The agent has direct filesystem access and
should just read the files.

## Key design decisions

- **Agent returns `ChatMessagePart` directly** — same types chatd
persists. No intermediate `InstructionFileEntry`/`SkillEntry` types
needed.
- **`SkillMeta.MetaFile`** — persisted via `ContextFileSkillMetaFile` on
the skill part, so custom meta file names
(`CODER_AGENT_EXP_SKILL_META_FILE`) survive across chat turns.
- **No pre-read body** — `read_skill` always dials the workspace to
fetch the skill body on demand. Simpler than caching the body in the
response.
- **MCP config paths kept agent-internal** — `MCPConfigFiles()` getter,
not sent over the wire.
- **No backward compat fallback** — old agents that don't support
context-config get no instruction files. This is acceptable since agent
and server deploy together.
2026-04-04 12:45:46 -04:00
Jon Ayers a1d51f0dab feat: batch connection logs to avoid DB lock contention (#23727)
- Running 30k connections was generating a ton of lock contention in the
DB
2026-04-03 15:47:26 -05:00
Jon Ayers 333503f74e feat: improve coordinator peer mapping performance (#23696)
- Skipping DB querying entirely for peers that aren't actually connected
to our coordinator
- Opportunistically batching the queries for peers
2026-04-03 14:22:58 -05:00
Jeremy Ruppel 01b8cdb00d fix: remove work/personal onboarding telemetry (#24021)
Following on from #23989 #24018 

- We also no longer want to collect `IsBusiness` demographic data
- Newsletter fields no longer allow `nil` as a value, instead default to
false

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 14:26:35 -04:00
Hugo Dutka ec83065b59 fix(coderd/x/chatd): inflight wait group data race (#24007)
Addresses https://github.com/coder/internal/issues/1450
2026-04-03 20:04:09 +02:00
Jeremy Ruppel 2a1bef18e0 fix: remove IndustryType and OrgSize from FirstUserOnboarding telemetry (#24018)
New `IndustryType` and `OrgSize` enums were added in #23989, but they
are no longer desired in the onboarding/marketing telemetry data. This
removes them.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 11:35:37 -04:00
Paweł Banaszewski 8369fa88fd feat: add columns for cached tokens from aibridge (#23832)
Two new columns added to aibridge_token_usages:
  - cache_read_input_tokens (BIGINT, default 0)
  - cache_write_input_tokens (BIGINT, default 0)

Migration backfills existing rows by extracting values from the metadata
JSONB column (cache_read_input, input_cached, prompt_cached for reads
(max value selected since only 1 should be set), cache_creation_input
for writes).

All references to data from metadata were updated to reference new
columns. No other changes then changing where data is extracted from.

Requires aibridge library version bump to include:
https://github.com/coder/aibridge/pull/229
Fixes: https://github.com/coder/aibridge/issues/150
2026-04-03 16:27:31 +02:00
Jeremy Ruppel da3c46b557 feat: add onboarding info fields to first user setup (#23989)
Add optional demographic and newsletter preference fields to the setup
page: business use (yes/no), industry type, organization size, and two
newsletter toggles (marketing, release/security updates).

The new data flows through telemetry via a FirstUserOnboarding struct in
the snapshot payload, sent once when the first user is created. The
telemetry-server and BigQuery schema changes are required separately to
persist this data.

---------

Co-authored-by: default <davidiii@fraley.us>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 09:52:52 -04:00
Hugo Dutka 53482adc2d fix(coderd/x/chatd): TestAwaitSubagentCompletion/ContextCanceled flake (#24008)
Addresses https://github.com/coder/internal/issues/1437
2026-04-03 13:11:38 +02:00
Sas Swart 5b6b7719df fix: make prebuild claiming durable and idempotent (#23108)
## Problem

When a prebuilt workspace is claimed, the agent reinitializes via a
single fire-and-forget pubsub event over SSE. If the agent's SSE
connection is interrupted at claim time, the event is permanently lost —
the workspace is stuck with no self-healing path.

Additionally, regular (non-prebuild) workspaces had no way to opt out of
the `/reinit` polling loop — agents would reconnect indefinitely to an
endpoint that would never send them anything useful.

## Root Cause

`workspaceAgentReinit` fetches the workspace (with its current
`owner_id`) via `GetWorkspaceByAgentID`, but never checked whether a
claim already happened. It only subscribed to pubsub for future events.
The database already has durable claim state (`owner_id` changes from
`PrebuildsSystemUserID` to the real user), but no layer ever consulted
it on reconnection.

## Solution

### Server-side durable check with first-build-initiator gating

**TOCTOU-safe ordering**: Subscribe to pubsub claim events *before* any
durable checks, so a claim that fires during the check is buffered in
the channel rather than lost.

**First-build-initiator gating**: When `!workspace.IsPrebuild()` (owner
is no longer the system user), look up the first build's `InitiatorID`.
The prebuild reconciler always uses `PrebuildsSystemUserID` as the
initiator. This distinguishes claimed prebuilds from regular workspaces
without any SQL schema changes.

- **Regular workspace** (first build initiator ≠ system user) → **409
Conflict**, agent stops reconnecting
- **Claimed prebuild, build completed** → pre-seed channel with reinit
event and close it, transmitter delivers one-shot then exits
- **Claimed prebuild, build in-progress** → fall through to pubsub
subscription, agent waits for completion event
- **Unclaimed prebuild** → pubsub subscription (existing happy path)

### Declarative reinit events (defense-in-depth)

- Added `UserID` field to `ReinitializationEvent` with JSON tags
- Switched pubsub serialization from raw string to JSON (with
backward-compat fallback for rolling upgrades)
- Populated `UserID` at both the publish site and the durable check

### Agent SDK: 409 handling

`WaitForReinitLoop` detects 409 Conflict from the server and closes the
`reinitEvents` channel, cleanly exiting the retry goroutine.

### Agent CLI: fixed two bugs + added reinitCtx

- **Closed channel (`!ok`)**: now blocks on `<-ctx.Done()` instead of
`continue`, keeping the current agent running. Previously this would
leak agents by skipping `agnt.Close()` and re-entering the loop.
- **Duplicate owner reinit**: cancels `reinitCtx` (stops the reinit
goroutine), then blocks on `<-ctx.Done()`. Previously `continue` would
skip cleanup and create a new agent on the next loop iteration.
- **`reinitCtx`**: a cancellable child of `ctx` passed to
`WaitForReinitLoop`, allowing the agent to stop the reinit HTTP polling
after reinit completes.

### Agent-side idempotency

Tracks `lastOwnerID` in the agent reinit loop — duplicate events for the
same owner are skipped.

## Testing

- **"unclaimed prebuild receives reinit via pubsub"**: prebuild owned by
system user, pubsub event triggers reinit
- **"claimed prebuild receives one-shot reinit on reconnect"**: first
build by system user, owner changed, build completed → immediate reinit
(no pubsub needed)
- **"claimed prebuild waits during in-progress claim build"**: claimed
but build still running → no reinit until build completes
- **"regular workspace gets 409"**: first build by real user → 409
Conflict, agent stops polling
- Updated claim publisher/listener tests: verify `UserID` survives JSON
round-trip + backward compat with raw string payloads
- Updated SSE round-trip test: verify `UserID` survives transmit →
receive cycle

Fixes #22359

## Rolling upgrade note

During a rolling deploy where old coderd instances coexist with new
ones, the pubsub `ReinitializationEvent` has a new `workspace_id` field
(JSON key `workspace_id`). Old publishers send a raw reason string
instead of JSON; the new listener gracefully falls back by treating the
entire payload as the reason and filling in `WorkspaceID` from context.
The only visible effect during the upgrade window is that `WorkspaceID`
may be the zero UUID in agent-side logs — this is cosmetic and resolves
once all instances are updated.
2026-04-02 23:51:02 +02:00
Zach 990c006f28 feat(coderd/database): add value_key_id column to user_secrets for encryption (#23997)
Add a nullable `value_key_id` column to the `user_secrets` table with a
foreign key to `dbcrypt_keys`. This is the column dbcrypt uses to track
which encryption key encrypted a given secret's value. This is required
for encryption of user secret values.

The column was missing from the original migration (000357).
2026-04-02 15:40:32 -06:00
Michael Suchacz 7d0a0c6495 feat: provider key policies and user provider settings (#23751) 2026-04-02 19:46:42 +02:00
Hugo Dutka 17dec2a70f feat: agents desktop recordings backend (#23894)
This PR introduces screen recording of the computer use agent using the
virtual desktop.

- Screen recording is triggered by a `wait_agent` tool call. Recording
is stopped by a successful `wait_agent` tool call or when there hasn't
been any desktop activity for 10 minutes.
- Recordings are handled by the `portabledesktop` cli via the `record`
command. The videos are sped up in periods of inactivity.
- Recordings are saved to the database to the `chat_files` table.
There's a hard limit of 100MB per recording. Larger recordings are
dropped.
- A successful `wait_agent` on a computer use subagent tool call returns
a `recording_file_id`, later allowing the frontend to display the
corresponding video.
2026-04-02 17:23:27 +00:00
dylanhuff-at-coder f796f3645f fix(coderd): fix isContextLimitKey false positive on max_context_version (#23950)
`isContextLimitKey` had a fallback heuristic that matched any key starting with `"max"` containing `"context"`, causing false positives on keys like `"max_context_version"`. A provider returning such metadata would have the value parsed as a context limit.

Replace substring matching on the separator-stripped key with word-level matching. A new `metadataKeyWords` function tokenizes keys by splitting on separators and camelCase boundaries, then the fallback requires
`"context"` paired with a limit-related word (`"limit"`, `"window"` + qualifier, `"length"` + qualifier, or `"tokens"` + qualifier). Known exact forms like `"context_window"` remain in the fast-path switch.

Closes https://github.com/coder/coder/issues/23332
2026-04-02 10:07:01 -07:00
Cian Johnston b5da77ff55 test: cover read_template and create_workspace allowlist enforcement (#23645)
- Extend `TestChatTemplateAllowlistEnforcement` to also exercise
`read_template` and `create_workspace` through the allowlist
- Mock LLM now chains 4 tool calls: list_templates, read_template
(blocked), read_template (allowed), create_workspace (blocked)
- Wire dummy `CreateWorkspace` config into test server so the tool
reaches the allowlist check
- Generalize tool result collection to support multiple calls per tool
name

> 🤖 Created by Coder Agents and reviewed by Kyle the human.
2026-04-02 15:39:40 +00:00
Cian Johnston d4a9c63e91 feat: auto-assign agents-access role to new users when experiment enabled (#23968)
When the `agents` experiment is enabled, new users are automatically
granted the `agents-access` role at creation time so they can use Coder
Agents without manual admin intervention.

- Auto-assigns in `CreateUser()` — covers admin API, OAuth, and OIDC
creation paths
- Skips auto-assign for OIDC users when enterprise site role sync is
enabled (sync overwrites roles on every login; those admins should use
`--oidc-user-role-default` instead)
- CLI `create-admin-user` bypasses `CreateUser()` but creates `owner`
users who already have all permissions

> 🤖 Written by a Coder Agent. Will be reviewed by a human.
2026-04-02 14:46:47 +01:00
Cian Johnston 2ebc076b9e fix: make 'chat has no workspace agent' error actually helpful (#23971)
- Change `errChatHasNoWorkspaceAgent` message from cryptic `"chat has no
workspace agent"` to actionable `"workspace has no running agent: the
workspace may be stopped. Use the start_workspace tool to start it, or
create_workspace to create a new one"`
- Update test assertions to match the new message substring

> 🤖 Written by a Coder Agent. Reviewed by a human.
2026-04-02 14:18:26 +01:00
Cian Johnston 16add93908 fix(coderd/x/chatd): stabilize subagent pubsub completion test (#23944)
- stabilize `TestAwaitSubagentCompletion/CompletesViaPubsub` by waiting
for durable completion state before sending the synthetic pubsub wake
- add coverage for successful subagent completion with an empty report

> 🤖 Written by a Coder Agent. Reviewed by a human.
2026-04-02 12:29:47 +01:00
Susana Ferreira fb788530b3 feat: add provider_name column to aibridge interceptions (#23960)
## Description

Adds `provider_name` to aibridge interceptions to store the provider
instance name alongside the provider type. This allows distinguishing
between multiple instances of the same provider type (e.g. `copilot` vs
`copilot-business`).

## Changes

* Add `provider_name` column to `aibridge_interceptions` table with
backfill from `provider`.
* Add `provider_name` field to the proto `RecordInterceptionRequest`
message.
* Add `ProviderName` to the `codersdk.AIBridgeInterception` API
response.

_Disclaimer: initially produced by Claude Opus 4.6, modified and
reviewed by @ssncferreira ._
2026-04-02 10:58:13 +01:00
Ethan f4dc8f6b11 test: use non-monitoring RPC role in apptest setup (#23953)
Closes https://github.com/coder/internal/issues/1432
Closes https://github.com/coder/internal/issues/1399

The test setup in `createWorkspaceWithApps` opens a short-lived RPC
connection to fetch the agent manifest before starting the real agent.
This connection used `ConnectRPC()` which sends no `role` parameter, so
the server treated it as a real agent connection and enabled connection
monitoring. When the helper closed, its monitor asynchronously wrote
`disconnectedAt` to the DB — racing with the real agent's monitor and
transiently marking the agent as disconnected.

The fix uses `ConnectRPCWithRole(ctx, "apptest-manifest")` so the helper
doesn't trigger connection monitoring. The server already has this
role-based distinction for non-agent clients like
`coder-logstream-kube`; the test helper just wasn't using it.

Both issues share this codepath: `setupProxyTest` →
`createWorkspaceWithApps` → the `ConnectRPC` call at `setup.go:518`.
Both test configurations have a non-empty `PrimaryAppHost`, so both
enter the affected block.

This is not masking a product issue — the "disconnected" state was
caused by two competing monitors writing to the same agent DB row, a
scenario that only exists in this test setup. No assertions were
weakened; the proxy still checks real agent connectivity on every
request.
2026-04-02 20:01:54 +11:00
Ethan 7757cd8e08 refactor(coderd/x/chatd): insert chats directly as pending on creation (#23888)
Previously, `CreateChat` inserted the `chats` row with the DB default
status (`waiting`), then updated it to `pending` in the same transaction
via `setChatPendingWithStore`. This wasted two extra queries per chat
creation (`GetChatByID` + `UpdateChatStatus`) and rewrote the same row
immediately after inserting it.

Now `CreateChat` passes the status directly to `InsertChat`, so the row
is written once in its final create-time state. The
`setChatPendingWithStore` helper is removed entirely. `InsertChat` now
requires an explicit `status` parameter at all callsites instead of
relying on a DB column default.

## Motivation

On an experimental branch we're trialing firing all chatd notifications
from plpgsql triggers. The old two-step insert made that awkward: in an
`AFTER INSERT` trigger, `NEW` only contained the insert-time row
(`waiting`), not the final committed state (`pending`). To emit the
correct event payload the trigger had to be deferred and re-read the row
from `chats` at commit time.

With this change, `NEW` already contains the correct row to publish — no
deferred trigger, no extra `SELECT`, simpler and cheaper trigger logic.

That said, this seems like a worthwhile change regardless of the trigger
experiment: writing the final row state once removes unnecessary DB work
on every chat creation and makes the create path easier to reason about.
2026-04-02 14:13:51 +11:00
Ethan fc1e0beb3b fix(coderd/x/chatd): use structured output for chat title generation (#23909)
Chat title generation used free-form text completion, which let models
respond conversationally instead of producing a title. Review chats
started with GitHub URLs were especially affected — models would say "I
don't have the ability to browse external links" and that string became
the persisted title.

Replace the raw-text `generateShortText` path with structured output via
`object.Generate[generatedTitle]`. Both auto-title and manual retitle
now go through the same typed contract: the model must return a JSON
object with a `title` field, validated and normalized before
persistence. Invalid outputs (empty, too long) are rejected and retried
through the existing candidate-model fallback loop.
2026-04-02 14:13:27 +11:00
Kyle Carberry ee855f9618 feat: make agent context paths configurable via env vars (#23878)
Replace hardcoded paths for instruction files, skills, and MCP config
with
values read from `CODER_AGENT_EXP_*` environment variables. Template
authors
configure paths via the existing `coder_agent` `env` block. The agent
resolves `~`, relative, and absolute paths locally, then serves the
resolved config over `GET /api/v0/context-config`. `chatd` fetches this
once per workspace attach and falls back to today's defaults for older
agents.

All path env vars are comma-separated, allowing multiple directories:

| Env Var | Default | Controls |
|---|---|---|
| `CODER_AGENT_EXP_INSTRUCTIONS_DIRS` | `~/.coder` | Dirs containing the
instruction file |
| `CODER_AGENT_EXP_INSTRUCTIONS_FILE` | `AGENTS.md` | Instruction file
name |
| `CODER_AGENT_EXP_SKILLS_DIRS` | `.agents/skills` | Skills directories
|
| `CODER_AGENT_EXP_SKILL_META_FILE` | `SKILL.md` | Skill metadata file
name |
| `CODER_AGENT_EXP_MCP_CONFIG_FILES` | `.mcp.json` | MCP config files |

### Example

```hcl
resource "coder_agent" "main" {
  os   = "linux"
  arch = "amd64"
  env = {
    CODER_AGENT_EXP_INSTRUCTIONS_DIRS  = "/opt/company/agent-config,~/.coder"
    CODER_AGENT_EXP_INSTRUCTIONS_FILE  = "CLAUDE.md"
    CODER_AGENT_EXP_SKILLS_DIRS        = "/opt/company/ai-skills,.agents/skills"
    CODER_AGENT_EXP_MCP_CONFIG_FILES   = "/opt/company/mcp.json,.mcp.json"
  }
}
```

<details>
<summary>Implementation Details</summary>

### Architecture

Follows the same pattern as MCP tool discovery:
agent resolves locally → exposes via HTTP → chatd consumes.

**Agent-side** (`agent/agentcontextconfig/`):
- `ResolvePath` / `ResolvePaths` handle `~`, relative, and absolute path
forms; returns `""` for relative paths when baseDir is empty
- `Config` reads env vars, falls back to defaults, resolves all paths
- `GET /api/v0/context-config` serves the resolved config as JSON

**chatd-side** (`coderd/x/chatd/`):
- Calls `conn.ContextConfig()` once on first workspace attach
- Falls back to hardcoded defaults on 404 (older agents)
- Iterates instruction dirs, skills dirs using resolved absolute paths
- `LSRelativityRoot` everywhere — no more home/root juggling

### Key design decisions

- **`EXP_` prefix**: env vars use `CODER_AGENT_EXP_*` to indicate
experimental status
- **Plural names**: comma-separated vars use plural names (`DIRS`,
`FILES`); single-value vars use singular (`FILE`)
- **Defaults in `workspacesdk`**: default constants live in
`codersdk/workspacesdk/` so both agent and server reference them without
cross-layer imports
- **`skillMetaFile` persistence**: stored on context-file parts via
`ContextFileSkillMetaFile` and restored on subsequent chat turns so
custom values survive across turns
- **Working dir dedup**: `slices.Contains` guard prevents reading the
same instruction file from both `InstructionsDirs` and the working
directory
- **MCP server dedup**: first-occurrence-wins dedup prevents leaking
duplicate connections from overlapping config files
- **ResolvePath safety**: returns `""` for relative paths when `baseDir`
is empty, so `ResolvePaths` filters them out

### Files changed

| File | Change |
|---|---|
| `agent/agentcontextconfig/` | New package — path resolution + HTTP
endpoint |
| `codersdk/workspacesdk/agentconn.go` | `ContextConfigResponse` type,
default constants, client method |
| `agent/agent.go` + `agent/api.go` | Wire up endpoint, pass config to
MCP |
| `agent/x/agentmcp/manager.go` | Accept `[]string` MCP config paths,
dedup by name |
| `coderd/x/chatd/chatd.go` | Fetch config, thread through, named
returns |
| `coderd/x/chatd/instruction.go` | Accept configurable dir + file name,
`skillMetaFileFromParts` |
| `coderd/x/chatd/chattool/skill.go` | Accept configurable dirs + meta
file |
| `codersdk/chats.go` | `ContextFileSkillMetaFile` field for persistence
|

### Test coverage

- `TestConfig` (4 cases): defaults, custom env vars, whitespace
trimming, comma-separated dirs
- `TestResolvePath` / `TestResolvePaths`: including empty baseDir edge
case
- `TestPersistInstructionFilesFallbackOnOlderAgent`: backward-compat
path when `ContextConfig` returns 404
- `TestChatMessagePartVariantTags`: updated exclusion list for new
internal field

### Backward compatibility

Older agents return 404 for the new endpoint. `chatd` catches this and
falls back to today's defaults via `readHomeInstructionFile` (using
`LSRelativityHome`). Existing workspaces work with no changes.

</details>
2026-04-01 12:28:47 -04:00
Kyle Carberry 8c8b307b97 fix: persist session cookie to disk to prevent PWA logout (#23746) 2026-04-01 09:54:59 -04:00
Kyle Carberry 19e44f4136 fix: target specific chat in MarkStale instead of broadcasting to all workspace chats (#23883)
## Problem

Subagent chats were receiving git context (branch, remote origin, PR
status) from their parent or sibling chats' git operations. When a git
operation triggers external auth, the workspace agent sends `chat_id`
identifying which chat initiated it — but this was broken at two levels:

1. **Agent side:** `CODER_CHAT_ID` was never injected into process
   environments. `chatd` sets `Coder-Chat-Id` HTTP headers and the
   agent extracts them for process isolation, but never propagated
   `CODER_CHAT_ID` to `cmd.Env`. So `gitaskpass` always sent an empty
   `chat_id`.

2. **Server side:** `workspaceAgentsExternalAuth` ignored the `chat_id`
   query param. `MarkStale` broadcast git context to **all** chats on
   the workspace via `filterChatsByWorkspaceID`.

## Fix

- Inject `CODER_CHAT_ID` into `cmd.Env` in `agentproc` when the chat
  ID is known, so `gitaskpass` can read and forward it.
- Read `chat_id` from query params in `workspaceAgentsExternalAuth`
  and thread it through `chatGitRef`.
- Refactor `MarkStale` to accept a `MarkStaleParams` struct. When
  `ChatID` is provided, target only that specific chat. When empty
  (legacy agents, non-chat git operations), fall back to the existing
  workspace-wide broadcast.
- Extract `markStaleSingle` helper to deduplicate the upsert+publish
  logic.

<details><summary>Investigation notes</summary>

### Data flow before fix

```
chatd → sets Coder-Chat-Id header on agent conn
agent → extracts chatID, stores on process struct
agent → does NOT set CODER_CHAT_ID in cmd.Env  ← gap 1
gitaskpass → reads CODER_CHAT_ID (always empty), sends chat_id=""
server handler → ignores chat_id query param     ← gap 2
MarkStale → broadcasts to ALL workspace chats
```

### Data flow after fix

```
chatd → sets Coder-Chat-Id header on agent conn
agent → extracts chatID, stores on process struct
agent → sets CODER_CHAT_ID in cmd.Env
gitaskpass → reads CODER_CHAT_ID, sends chat_id=<uuid>
server handler → reads chat_id, passes to MarkStale
MarkStale → targets only that specific chat
```

</details>
2026-04-01 13:04:59 +00:00
Kyle Carberry 7861fcf1f6 perf(coderd): stop inline-resolving diff status on every GetChat call (#23901)
## Problem

Every `GET /api/experimental/chats/{chatID}` call was blocking for
200-800ms because the `getChat` handler called `resolveChatDiffStatus`,
which unconditionally hit the git provider API (e.g. GitHub's `GET
/repos/{owner}/{repo}/pulls?head=...`) via `ResolveBranchPullRequest` —
even when the cached diff status was fresh.

This made every chat page load at `/agents/{id}` noticeably slow.

## Root cause

The call chain was:
1. `getChat` → `resolveChatDiffStatus`
2. `resolveChatDiffStatus` → `resolveChatDiffReference` →
`gp.ResolveBranchPullRequest(...)` **(external HTTP call)**
3. Only **after** the external call: `chatDiffStatusIsStale(status,
now)` check

The staleness check happened after the expensive work, so every request
paid the cost regardless of cache freshness.

## Fix

`getChat` now returns the cached `chat_diff_statuses` row directly from
the database. The background `gitsync` worker already keeps these rows
fresh (every `DiffStatusTTL = 120s`), so inline resolution was
redundant.

The `resolveChatDiffContents` endpoint (which fetches actual diff
content) still uses the full resolution path since it needs to make
provider API calls by design.

## Changes

- `getChat` reads cached diff status from DB instead of calling
`resolveChatDiffStatus`
- Remove `resolveChatDiffStatus` (dead code — no production callers)
- Remove `chatDiffStatusIsStale` and `chatDiffStatusTTL` (dead code)
- Remove `RefreshesStaleStatusWithExternalAuth` test (tested the removed
inline refresh path)

<details><summary>Decision log</summary>

- **Why not just add a staleness gate?** The background worker already
handles refreshes on the same schedule. Adding an early-return-if-fresh
would work but leaves dead code for the stale path that's never
exercised in production (the worker gets there first). Removing the
inline path entirely is simpler and eliminates the external API
dependency from the read path.
- **Why keep `resolveChatDiffContents` unchanged?** That endpoint's job
is to fetch the actual diff content from the provider, so external API
calls are inherent to its purpose.

</details>
2026-04-01 12:08:13 +00:00
Cian Johnston d6df78c9b9 chore: remove racy ChatStatusPending assertions after CreateChat (#23882)
Removes 6 fragile `require.Equal(t, codersdk.ChatStatusPending,
chat.Status)` assertions from chat relay and creation tests.

**Root cause**: In HA tests with two replicas sharing the same DB, the
worker can acquire a just-created chat (flipping `pending → running` via
`AcquireChats`) before the HTTP response reaches the test. All affected
tests already synchronize via `require.Eventually` waiting for `running`
status, making the initial assertion both redundant and racy.

- Remove 5 assertions in `enterprise/coderd/exp_chats_test.go` (all
`TestChatStreamRelay` subtests)
- Remove 1 assertion in `coderd/exp_chats_test.go` (`TestPostChats`)
- An existing comment in `TestPostChats/Success` already documents this
exact race

Fixes flake:
https://github.com/coder/coder/actions/runs/23807597632/job/69385425724

> 🤖 Written by a Coder Agent. Will be reviewed by a human.
2026-04-01 10:00:50 +01:00
Ethan 5cba59af79 fix(coderd): unarchive child chats with parents (#23761)
Unarchiving a root chat now restores descendant chats in the database
and emits lifecycle events for every affected chat so passive sessions
converge without a full refetch.

This keeps archive and unarchive symmetric at both the data and
watch-stream layers by returning the affected chat family from the
database, using those post-update rows for chatd pubsub fanout, and
covering descendant lifecycle delivery with a watch-level regression
test.

Closes #23666
2026-04-01 15:30:25 +11:00
Cian Johnston a164d508cf fix(coderd/x/chatd): gate control subscriber to ignore stale pubsub notifications (#23865)
Fixes flaky `TestOpenAIReasoningWithWebSearchRoundTripStoreFalse` and
`TestOpenAIReasoningWithWebSearchRoundTrip`.

## Changes

- Gate the `processChat` control subscriber's cancel callback behind a
`chan struct{}` that is closed after publishing `"running"` status
- Add `TestGatedControlCancel` with 4 subtests exercising the gate logic

<details>
<summary>Root cause analysis</summary>

`SendMessage` publishes a `"pending"` notification on
`chat:stream:<chatID>` via PostgreSQL `NOTIFY`. `processChat` subscribes
to the same channel for control signals. Due to async NOTIFY delivery,
the `"pending"` notification can arrive at the control subscriber
**after** it registers its queue — even though it was published
**before**. `shouldCancelChatFromControlNotification("pending")` returns
`true`, immediately self-interrupting the processor before it does any
work.

The fix gates the cancel callback behind a closed channel. The channel
is closed after `processChat` publishes `"running"` status, so stale
notifications from before initialization are harmlessly ignored.
`close()` provides a happens-before guarantee in the Go memory model.
</details>

> 🤖 Written by a Coder Agent. Reviewed by a human.
2026-03-31 22:55:20 +01:00
Michael Suchacz e2bbd12137 test(coderd/x/chatd): remove flaky OpenAI round-trip tests (#23877) 2026-03-31 17:04:56 -04:00
Yevhenii Shcherbina 84b94a8376 feat: add chatgpt support for aibridge proxy (#23826)
Add ChatGPT support for AIBridgeProxy
2026-03-31 12:54:38 -04:00
Cian Johnston 2a990ce758 feat: show friendly alert for missing agents-access role (#23831)
Replaces the generic red `ErrorAlert` ("Forbidden.") with a proactive
permission check and friendly info alert when a user lacks the
`agents-access` role.

- Add `createChat` permission check to `permissions.json` using
`owner_id: "me"`
- Handle `"me"` owner substitution in `renderPermissions` (SSR path)
- Pass `canCreateChat` from `useAuthenticated().permissions` into
`AgentCreateForm`
- Show `ChatAccessDeniedAlert` and disable input immediately (no need to
trigger a 403 first)
- Also catch 403 errors as a fallback in case permissions aren't yet
loaded
- Add `ForbiddenNoAgentsRole` Storybook story with `play` assertions
- Add `TestRenderPermissionsResolvesMe` Go test to pin the `"me"`
sentinel substitution

<details><summary>Implementation plan & decision log</summary>

- Uses the existing `permissions.json` + `checkAuthorization` system
rather than a separate API call
- `owner_id: "me"` is resolved to the actor's ID by both the auth-check
API endpoint and the SSR `renderPermissions` function
- Go test uses a real `rbac.StrictCachingAuthorizer` (not a mock) so it
verifies both the sentinel substitution and the RBAC role evaluation
end-to-end
- Alert follows the exact same `Alert` pattern as the 409 usage-limit
block
- Uses `severity="info"` and links to the getting-started docs Step 3
- Textarea is disabled proactively so the user never sees the scary
generic error

</details>

> 🤖 Created by a Coder Agent and will be reviewed by a human.
2026-03-31 17:26:58 +01:00
Yevhenii Shcherbina 9440adf435 feat: add chatgpt support for aibridge (#23822)
Registers a new aibridge provider for ChatGPT by reusing the existing
OpenAI provider with a different `Name` and `BaseURL`
(https://chatgpt.com/backend-api/codex). The ChatGPT backend API is
OpenAI-compatible, so no new provider type is needed.
ChatGPT authenticates exclusively via per-user OAuth JWTs (BYOK mode) —
no centralized API key is configured. The OpenAI provider already
handles this: when no key is set, it falls through to the bearer token
from the request's Authorization header.
  
  Depends on #23811
2026-03-31 12:08:45 -04:00
Susana Ferreira b0036af57b feat: register multiple Copilot providers for business and enterprise upstreams (#23811)
## Description

Adds support for multiple Copilot provider instances to route requests to different Copilot upstreams (individual, business, enterprise). Each instance has its own name and base URL, enabling per-upstream metrics, logs, circuit breakers, API dump, and routing.

## Changes

* Add Copilot business and enterprise provider names and host constants
* Register three Copilot provider instances in aibridged (default, business, enterprise)
* Update `defaultAIBridgeProvider` in `aibridgeproxy` to route new Copilot hosts to their corresponding providers

## Related

* Depends on: https://github.com/coder/aibridge/pull/240
* Closes: https://github.com/coder/aibridge/issues/152

Note: documentation changes will be added in a follow-up PR.

_Disclaimer: initially produced by Claude Opus 4.6, heavily modified and reviewed by @ssncferreira ._
2026-03-31 16:00:37 +01:00
Ethan bbf3fbc830 fix(coderd/x/chatd): archive chat hard-interrupts active stream (#23758)
Archiving a chat now transitions pending or running chats to waiting
before setting the archived flag. This publishes a status notification
on `ChatStreamNotifyChannel` so `subscribeChatControl` cancels the
active `processChat` context via `ErrInterrupted` — the same codepath
used by the stop button.

The `processChat` cleanup also skips queued-message auto-promotion when
the chat is archived, so archiving behaves like a hard stop rather than
interrupt-and-continue.

Relates to https://github.com/coder/coder/issues/23666
2026-04-01 00:23:52 +11:00