Commit Graph

3601 Commits

Author SHA1 Message Date
Mathias Fredriksson a62ead8588 fix(coderd): sort pinned chats first in GetChats pagination (#24222)
The GetChats SQL query ordered by (updated_at, id) DESC with no
pin_order awareness. A pinned chat with an old updated_at could
land on page 2+ and be invisible in the sidebar's Pinned section.

Add a 4-column ORDER BY: pinned-first flag DESC, negated pin_order
DESC, updated_at DESC, id DESC. The negation trick keeps all sort
columns DESC so the cursor tuple < comparison still works. Update
the after_id cursor clause to match the expanded sort key.

Fix the false handler comment claiming PinChatByID bumps updated_at.
2026-04-10 17:13:19 +00:00
J. Scott Miller 7bde763b66 feat: add workspace build transition to provisioner job list (#24131)
Closes #16332

Previously `coder provisioner jobs list` showed no indication of what a workspace
build job was doing (i.e., start, stop, or delete). This adds
`workspace_build_transition` to the provisioner job metadata, exposed in
both the REST API and CLI. Template and workspace name columns were also
added, both available via `-c`.

```
$ coder provisioner jobs list -c id,type,status,"workspace build transition"
ID                                    TYPE                     STATUS     WORKSPACE BUILD TRANSITION
95f35545-a59f-4900-813d-80b8c8fd7a33  template_version_import  succeeded
0a903bbe-cef5-4e72-9e62-f7e7b4dfbb7a  workspace_build          succeeded  start
```
2026-04-10 09:50:11 -05:00
Matt Vollmer 36141fafad feat: stack insights tables vertically and paginate Pull requests table (#24198)
The "By model" and "Pull requests" tables on the PR Insights page
(`/agents/settings/insights`) were side-by-side at `lg` breakpoints, and
the Pull requests table was hard-capped at 20 rows by the backend.

- Replaced `lg:grid-cols-2` with a single-column stacked layout so both
tables span the full content width.
- Removed the `LIMIT 20` from the `GetPRInsightsRecentPRs` SQL query so
all PRs in the selected time range are returned.
- Can add this back if we need it. If we do, we should add a little
subheader above this table to indicate that we're not showing all PRs
within the selected timeframe.
- Added client-side pagination to the Pull requests table using
`PaginationWidgetBase` (page size 10), matching the existing pattern in
`ChatCostSummaryView`.
- Renamed the section heading from "Recent" to "Pull requests" since it
now shows the full set for the time range.
<img width="1481" height="1817" alt="image"
src="https://github.com/user-attachments/assets/0066c42f-4d7b-4cee-b64b-6680848edc68"
/>


> 🤖 PR generated with Coder Agents
2026-04-10 10:48:54 -04:00
Garrett Delfosse 3462c31f43 fix: update directory for terraform-managed subagents (#24220)
When a devcontainer subagent is terraform-managed, the provisioner sets
its directory to the host-side `workspace_folder` path at build time. At
runtime, the agent injection code determines the correct
container-internal
path from `devcontainer read-configuration` and sends it via
`CreateSubAgent`.

However, the `CreateSubAgent` handler only updated `display_apps` for
pre-existing agents, ignoring the `Directory` field. This caused
SSH/terminal
sessions to land in `~` instead of the workspace folder (e.g.
`/workspaces/foo`).

Add `UpdateWorkspaceAgentDirectoryByID` query and call it in the
terraform-managed subagent update path to also persist the directory.

Fixes PLAT-118

<details><summary>Root cause analysis</summary>

Two code paths set the subagent `Directory` field:

1. **Provisioner (build time):** `insertDevcontainerSubagent` in
`provisionerdserver.go`
   stores `dc.GetWorkspaceFolder()` — the **host-side** path from the
   `coder_devcontainer` Terraform resource (e.g. `/home/coder/project`).

2. **Agent injection (runtime):**
`maybeInjectSubAgentIntoContainerLocked` in
`api.go` reads the devcontainer config and gets the correct
**container-internal**
path (e.g. `/workspaces/project`), then calls `client.Create(ctx,
subAgentConfig)`.

For terraform-managed subagents (those with `req.Id != nil`),
`CreateSubAgent`
in `coderd/agentapi/subagent.go` recognized the pre-existing agent and
entered
the update path — but only called `UpdateWorkspaceAgentDisplayAppsByID`,
discarding the `Directory` field from the request. The agent kept the
stale
host-side path, which doesn't exist inside the container, causing
`expandPathToAbs` to fall back to `~`.

</details>

> [!NOTE]
> Generated by Coder Agents
2026-04-10 10:11:22 -04:00
Faur Ioan-Aurel 83fd4cf5c2 fix: OAuth2 cancel button in the authorization page not working (#24058)
Go's html/template has a built-in security filter (urlFilter) that only
allows http, https, and mailto URL schemes. Any other scheme gets
replaced with #ZgotmplZ.

The OAuth2 app's callback URL uses custom URI scheme which the filter
considers unsafe. For example the Coder JetBrains plugin exposes a
callback URI with the scheme jetbrains:// - which was effectively
changed by the template engine into #ZgotmplZ. Of course this is not an
actual callback. When users clicked the cancel button nothing happened.

The fix was simple - we now wrap the apps registered callback URI into
htmltemplate.URL. Usually this needs some validation otherwise the
linter will complain about it. The callback URI used by the Cancel logic
is actually validated by our backend when the client app
programmatically registered via the dynamic OAuth2 registration
endpoints, so we refactored the validation around that code and re-used
some of it in the Cancel handling to make sure we don't allow URIs like
`javascript` and `data`, even though in theory these URIs were already
validated.

In addition, while testing this PR with
https://github.com/coder/coder-jetbrains-toolbox/pull/209 I discovered
that we are also not compliant with
https://www.rfc-editor.org/rfc/rfc6749#section-4.1.2.1 which requires
the server to attach the local state if it was provided by the client in
the original request. Also it is optional but generally a good practice
to include `error_description` in the error responses. In fact we follow
this pattern for the other types of error responses. So this is not a
one off.

- resolves #20323
<img width="1485" height="771" alt="Cancel_page_with_invalid_uri"
src="https://github.com/user-attachments/assets/5539d234-9ce3-4dda-b421-d023fc9aa99e"
/>
<img width="486" height="746" alt="Coder Toolbox handling the Cancel
button"
src="https://github.com/user-attachments/assets/acab71a6-d29c-4fa9-80ba-3c0095bbdc8f"
/>

<!--

If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.

-->
2026-04-10 12:49:22 +03:00
Danielle Maywood 38d4da82b9 refactor: send raw typed payloads over chat WebSockets (#24148) 2026-04-10 10:47:30 +01:00
Zach 95cff8c5fb feat: add REST API handlers and client methods for user secrets (#24107)
Add the five REST endpoints for managing user secrets, SDK client
methods, and handler tests.

Endpoints:
- `POST /api/v2/users/{user}/secrets`
- `GET /api/v2/users/{user}/secrets`
- `GET /api/v2/users/{user}/secrets/{name}`
- `PATCH /api/v2/users/{user}/secrets/{name}`
- `DELETE /api/v2/users/{user}/secrets/{name}`

Routes are registered under the existing `/{user}` group with
`ExtractUserParam`. The delete query was changed from `:exec` to
`:execrows` so the handler can distinguish "not found" from success
(DELETE with `:exec` silently returns nil for zero affected rows).
2026-04-09 12:12:55 -06:00
Yevhenii Shcherbina 8237822441 feat: byok observability api (#24207)
## Summary
Exposes `credential_kind` and `credential_hint` on AI Bridge session
threads, making credential metadata visible in the session detail API.
   
Each thread in the `/api/v2/aibridge/sessions/{session_id}` response now
includes:
- `credential_kind`: `centralized` or `byok`
- `credential_hint`: masked credential (e.g. `sk-a...pgAA`)
Values are taken from the thread's root interception.
## Changes

- `codersdk/aibridge.go`: Added `CredentialKind` and `CredentialHint`
fields to `AIBridgeThread`
- `coderd/database/db2sdk/db2sdk.go`: Populated from root interception
in `buildAIBridgeThread`
  - `SessionTimeline.stories.tsx`: Added fields to mock thread data
2026-04-09 11:41:17 -04:00
Ethan 65bf7c3b18 fix(coderd/x/chatd/chatloop): stabilize startup-timeout tests with quartz (#24193)
The startup-timeout integration tests in `chatloop` used a 5ms real-time
budget and relied on wall-clock scheduling to fire the startup guard
timer before the first stream part arrived. On loaded CI runners the
timer sometimes lost the race, producing `attempts == 2` instead of
`attempts == 1` and flaking `TestRun_FirstPartDisarmsStartupTimeout`.

Replace the real `time.Timer` in `startupGuard` with a `quartz.Timer` so
tests can control time deterministically. Production behavior is
unchanged: `RunOptions.Clock` defaults to `quartz.NewReal()` when nil,
and the startup timeout still covers both opening the provider stream
and waiting for the first stream part.

- Add `RunOptions.Clock quartz.Clock` with nil-safe default.
- Tag the startup guard timer as `"startupGuard"` for quartz trap
targeting.
- Rewrite the four startup-timeout integration tests to use
`quartz.NewMock(t)` with trap/advance/release sequences instead of
wall-clock sleeps.
- Add `awaitRunResult` helper so tests fail with a clear message instead
of hanging when `Run` does not complete.

Closes https://github.com/coder/internal/issues/1460
2026-04-10 00:40:09 +10:00
Kyle Carberry 391b22aef7 feat: add CLI commands for managing chat context from workspaces (#24105)
Adds `coder exp chat context add` and `coder exp chat context clear`
commands that run inside a workspace to manage chat context files via
the agent token.

`add` reads instruction and skill files from a directory (defaulting to
cwd) and inserts them as context-file messages into an active chat.
Multiple calls are additive — `instructionFromContextFiles` already
accumulates all context-file parts across messages.

`clear` soft-deletes all context-file messages, causing
`contextFileAgentID()` to return `!found` on the next turn, which
triggers `needsInstructionPersist=true` and re-fetches defaults from the
agent.

Both commands auto-detect the target chat via `CODER_CHAT_ID` (already
set by `agentproc` on chat-spawned processes), or fall back to
single-active-chat resolution for the agent. The `--chat` flag overrides
both.

Also adds sub-agent context inheritance: `createChildSubagentChat` now
copies parent context-file messages to child chats at spawn time, so
delegated sub-agents share the same instruction context without
independently re-fetching from the workspace agent.

<details><summary>Implementation details</summary>

**New files:**
- `cli/exp_chat.go` — CLI command tree under `coder exp chat context`

**Modified files:**
- `agent/agentcontextconfig/api.go` — `ConfigFromDir()` reads context
from an arbitrary directory without env vars
- `codersdk/agentsdk/agentsdk.go` — `AddChatContext`/`ClearChatContext`
SDK methods
- `coderd/workspaceagents.go` — POST/DELETE handlers on
`/workspaceagents/me/chat-context`
- `coderd/coderd.go` — Route registration
- `coderd/database/queries/chats.sql` — `GetActiveChatsByAgentID`,
`SoftDeleteContextFileMessages`
- `coderd/database/dbauthz/dbauthz.go` — RBAC implementations for new
queries
- `coderd/x/chatd/subagent.go` — `copyParentContextFiles` for sub-agent
inheritance
- `cli/root.go` — Register `chatCommand()` in `AGPLExperimental()`

**Auth pattern:** Uses `AgentAuth` (same as `coder external-auth`) —
agent token via `CODER_AGENT_TOKEN` + `CODER_AGENT_URL` env vars.

</details>

> 🤖 Generated by Coder Agents

---------

Co-authored-by: Michael Suchacz <203725896+ibetitsmike@users.noreply.github.com>
2026-04-09 16:33:00 +02:00
Hugo Dutka efb19eb748 feat: agents desktop recording thumbnail backend (#24022)
The agents chat interface displays thumbnails for videos recorded by the
computer use agent. Currently, to display a thumbnail, the frontend
downloads the entire video and shows the first frame. This PR starts
storing a new thumbnail file in the database for every recorded video,
and exposes the file id in the `wait_agent` tool result alongside the
recording file id, so the frontend can fetch just the thumbnail.
2026-04-09 13:47:54 +02:00
Atif Ali 584c61acb5 fix: mark connecting agents as unhealthy instead of healthy (#24044)
## Problem

Workspaces showed as "Healthy" immediately after creation while the
agent was still downloading, starting, or connecting. If the agent never
connected, the workspace stayed "Healthy" for the entire connection
timeout (~120s), then abruptly flipped to "Unhealthy".

## Root cause

In `db2sdk.WorkspaceAgent`, the health switch had no case for
`WorkspaceAgentConnecting`. Agents in `connecting` status with a
non-`off` lifecycle (e.g. `created` after a fresh build) fell through to
the `default` case and were marked `Healthy = true`.

## Fix

Add an explicit case for `WorkspaceAgentConnecting` that sets `Healthy =
false` with reason `"agent has not yet connected"`. The case is placed
after the existing `!connected + off` case (which correctly catches
stopped agents as "not running") and before the `timeout`/`disconnected`
cases.

```
Status        + Lifecycle       → Health reason
──────────────────────────────────────────────────────
any !connected + off           → "agent is not running"
connecting    + created/starting → "agent has not yet connected"  ← NEW
timeout       + any            → "agent is taking too long to connect"
disconnected  + any            → "agent has lost connection"
connected     + start_error    → "agent startup script exited with an error"
connected     + shutting_down  → "agent is shutting down"
connected     + ready/starting → healthy
```

The frontend already handles this case — `getAgentHealthIssue()` returns
"Workspace agent is still connecting" with `severity: "info"` for
unhealthy workspaces with connecting agents.

## Test changes

- **Healthy test**: now actually connects the agent via `agenttest.New`
before asserting health (previously passed due to the bug).
- **New Connecting test**: verifies a never-connected agent is correctly
marked unhealthy.
- **Mixed health test**: connects a1 and waits for the mixed state
(`a1.Healthy && !workspace.Healthy`) to avoid a race where both agents
are initially connecting.
- **Sub-agent excluded test**: connects the parent agent and waits for
it to be healthy before creating the sub-agent.
- **TestWorkspaceAgent/Connect**: flipped assertion to `Health.Healthy
== false` for a `dbfake` agent that never connects.

<details>
<summary>Review notes</summary>

### Known follow-up

The `healthy:false` workspace search filter maps to `[disconnected,
timeout]` and does not include `connecting`. This is a pre-existing gap
that is now more consequential — a workspace unhealthy solely due to a
connecting agent won't appear in `healthy:false` results. Worth a
follow-up issue.

### Deep review findings addressed

| Finding | Severity | Status |
|---------|----------|--------|
| Mixed health test race (all 3 reviewers) | P2 | Fixed — tightened
`Eventually` condition |
| `TestWorkspaceAgent/Connect` assertion break | P1 | Fixed — flipped
assertion |
| CLI renders red for connecting agents | Obs | Acknowledged — design
trade-off, accurate but visually strong for transient state |
| Switch case ordering overlap | Obs | Documented with inline comment |

</details>

> 🤖 This PR was created with the help of Coder Agents, and needs a human
review. 🧑💻
2026-04-09 13:21:28 +05:00
dylanhuff-at-coder f4240bb8c1 fix: sanitize workspace agent logs before insert (#24028)
Workspace agent logs could still fail after the earlier invalid UTF-8
fix because NUL bytes are valid Go/protobuf strings but are rejected by
Postgres text columns. The legacy HTTP log upload path also bypassed the
old sanitization entirely, and both server insert paths computed
logs_length from the unsanitized input.

Add a shared log-output sanitizer in agentsdk, use it in the protobuf
conversion path and both server-side insert paths, and compute
OutputLength from the sanitized string so overflow accounting matches
what is actually stored. This keeps the old invalid UTF-8 behavior while
also handling embedded NUL bytes consistently across DRPC and HTTP log
ingestion.

Refs [#23292 ](https://github.com/coder/coder/issues/23292)
Refs [#13433 ](https://github.com/coder/coder/issues/13433)
2026-04-08 16:29:38 -07:00
Zach 9b91af8ab7 feat: add user secrets SDK types and db2sdk converters (#24102)
Adds the SDK types and database-to-SDK conversion helpers for the user secrets feature.
2026-04-08 16:48:41 -06:00
Cian Johnston 7b0421d8c6 fix: revert auto-assign agents-access role enabled (#24170)
This reverts commit d4a9c63e91 (#23968).

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-04-08 20:56:17 +01:00
Yevhenii Shcherbina 7f496c2f18 feat: byok-observability for aibridge (#23808)
## Summary

Adds `credential_kind` and `credential_hint` columns to
`aibridge_interceptions` to record how each LLM request was
authenticated and provide a masked credential identifier for audit
purposes.

This enables admins to distinguish between centralized API keys,
personal API keys, and subscription-based credentials in the
interceptions audit log.

## Changes

- New migration adding `credential_kind`and `credential_hint` to
`aibridge_interceptions`
- Updated `InsertAIBridgeInterception` query and proto definition to
carry the new fields
- Wired proto fields through `translator.go` and `aibridgedserver.go` to
the database

Depends on https://github.com/coder/aibridge/pull/239
2026-04-08 13:24:28 -04:00
Michael Suchacz 590235138f fix: pin fixed anthropic/fantasy forks for streaming token accounting (#24077) 2026-04-08 17:07:39 +00:00
Kyle Carberry 35c26ce22a feat: add CreatedAt to tool-call and tool-result ChatMessageParts (#24101)
Adds an optional `CreatedAt` timestamp to `tool-call` and `tool-result`
`ChatMessagePart` variants so the frontend can compute tool execution
duration (`result.created_at - call.created_at`).

Timestamps are recorded at the correct moments in the chatloop:
- **Tool-call**: when the model stream emits the tool call
- **Tool-result**: when tool execution completes (or is interrupted)

These are passed through `PersistedStep.PartCreatedAt` so the
persistence layer can apply accurate timestamps to stored parts.
SSE-published parts also carry `CreatedAt` for real-time display.

Old persisted messages without `created_at` deserialize to `nil` — fully
backward compatible.

<details><summary>Implementation notes (Coder Agents
generated)</summary>

### Why not stamp in `PartFromContent`?

`PartFromContent` is called both for SSE publishing (correct timing) and
during persistence (wrong timing — both tool-call and tool-result would
get the same "persistence time" timestamp, yielding ~0 duration).
Instead, timestamps are captured in the chatloop at the right moments
and carried through `PersistedStep.PartCreatedAt` as a
`map[string]time.Time` keyed by `"call:<id>"` / `"result:<id>"`.

### Interrupted tool calls

`persistInterruptedStep` also stamps `CreatedAt` on synthetic error
results for cancelled/interrupted tool calls, so partial duration is
available.

### Files changed

| File | Change |
|------|--------|
| `codersdk/chats.go` | Add `CreatedAt *time.Time` field |
| `codersdk/chats_test.go` | JSON round-trip test |
| `coderd/database/dbtime/dbtime.go` | Add `TimePtr` helper |
| `coderd/x/chatd/chatloop/chatloop.go` | Track timestamps, pass through
`PersistedStep` |
| `coderd/x/chatd/chatd.go` | Apply timestamps during persistence |
| `coderd/x/chatd/chatprompt/chatprompt_test.go` | Verify
`PartFromContent` does NOT stamp |
| `site/src/api/typesGenerated.ts` | Auto-generated |

</details>

---------

Co-authored-by: Ethan <39577870+ethanndickson@users.noreply.github.com>
2026-04-08 12:42:03 -04:00
Kyle Carberry b969d66978 feat: add dynamic tools support for chat API (#24036)
Adds client-executed dynamic tools to the chat API. Dynamic tools are
declared by the client at chat creation time, presented to the LLM
alongside built-in tools, but executed by the client rather than chatd.
This enables external systems (Slack bots, IDE extensions, Discord bots,
CI/CD integrations) to plug custom tools into the LLM chat loop without
modifying chatd's built-in tool set.

Modeled after OpenAI's Assistants API: the chat pauses with
`requires_action` status when the LLM calls a dynamic tool, the client
POSTs results back via `POST /chats/{id}/tool-results`, and the chat
resumes.

See [this example](https://github.com/coder/coder-slackbot-poc) as a
reference for how this is used. It's highly-configurable, which would
enable creating chats from webhooks, periodically polling, or running as
a Slackbot.

<details>
<summary>Design context</summary>

### Architecture

The chatloop **exits** when it encounters dynamic tools and
**re-enters** when results arrive. No blocking channels, no pubsub for
tool results, no in-memory registry. The DB is the only coordination
mechanism.

```
Phase 1 (chatloop):
  LLM response → execute built-in tools only →
  Persist(assistant + built-in results) →
  status = requires_action → chatloop exits

Phase 2 (POST /tool-results):
  Persist(dynamic tool results) →
  status = pending → wakeCh → chatloop re-enters
```

### Validation (POST /tool-results)

1. Chat status must be `requires_action` (409 if not)
2. Read chat's `dynamic_tools` → set of dynamic tool names
3. Read last assistant message → extract tool-call parts matching
dynamic tool names
4. Submitted tool_call_ids must match exactly (400 for missing/extra)
5. Persist tool-result message parts, set status to `pending`, signal
wake

### Idempotency

Tool call IDs scoped per LLM step. State machine (`requires_action` →
`pending`) is the guard. First POST wins, subsequent get 409.

### Mixed tool calls

When the LLM calls both built-in and dynamic tools in one step, built-in
tools execute immediately. Their results are persisted in phase 1.
Dynamic tool results arrive via POST in phase 2. The LLM sees all
results when the chatloop resumes.

</details>

> 🤖 Generated by Coder Agents
2026-04-08 11:54:44 -04:00
Kyle Carberry c5d720f73d feat(coderd): add telemetry for agents chats and messages (#24068)
Adds telemetry collection for the agents chat system (`/agents`) to the
existing telemetry snapshot pipeline.

Three new snapshot fields:
- **`Chats`** — per-chat metadata (id, owner, status, mode,
workspace_id, root_chat_id, has_parent, archived, model config)
collected time-windowed via `createdAfter`
- **`ChatMessageSummaries`** — per-chat aggregated message metrics
(counts by role, token sums by type, cost, runtime, model count,
compression count) collected time-windowed
- **`ChatModelConfigs`** — model configuration metadata (provider,
model, context limit, enabled, default) collected as full dump

No PII is included — titles, message content, and URLs are excluded at
the SQL level. Only structural metadata flows through telemetry.

<details><summary>Implementation plan</summary>

### SQL Queries (`coderd/database/queries/chats.sql`)
- `GetChatsCreatedAfter` — time-windowed chat metadata
- `GetChatMessageSummariesPerChat` — per-chat message aggregates via
`GROUP BY`
- `GetChatModelConfigsForTelemetry` — full dump of model configs

### Telemetry (`coderd/telemetry/telemetry.go`)
- `Chat`, `ChatMessageSummary`, `ChatModelConfig` structs
- `ConvertChat`, `ConvertChatMessageSummary`, `ConvertChatModelConfig`
conversion functions
- Three `eg.Go()` blocks in `createSnapshot()` following the existing
collection pattern

### Authorization (`coderd/database/dbauthz/dbauthz.go`)
- System-only access for all three queries via `rbac.ResourceSystem`

### Tests
- `TestChatsTelemetry` in `coderd/telemetry/telemetry_test.go` — creates
chats (root + child), messages with token/cost data, model configs;
verifies all snapshot fields
- dbauthz test entries for all three queries in
`coderd/database/dbauthz/dbauthz_test.go`

</details>

> 🤖 Generated by Coder Agents
2026-04-08 09:47:44 -04:00
Cian Johnston 233343c010 feat: add chat and chat_files cleanup to dbpurge (#23833)
Fixes https://github.com/coder/coder/issues/23910

Adds periodic cleanup of chats and chat files to the dbpurge background
goroutine, with a configurable retention period exposed in the Agent
settings UI.

> 🤖 Written by a Coder Agent. Reviewed by a human.
2026-04-08 11:08:09 +01:00
Kayla はな c5f1a2fccf feat: make service accounts a Premium feature (#24020) 2026-04-07 12:25:32 -06:00
Zach 565a15bc9b feat: update user secrets queries for REST API and injection (#23998)
Update queries as prep work for user secrets API development:
- Switch all lookups and mutations from ID-based to user_id + name
- Split list query into metadata-only (for API responses) and
with-values (for provisioner/agent)
- Add partial update support using CASE WHEN pattern for write-only
value fields
- Include value_key_id in create for dbcrypt encryption support
- Update dbauthz wrappers and remove stale methods from dbmetrics
2026-04-07 09:03:28 -06:00
Kyle Carberry 684f21740d perf(coderd): batch chat heartbeat queries into single UPDATE per interval (#24037)
## Summary

Replaces N per-chat heartbeat goroutines with a single centralized
heartbeat loop that issues one `UPDATE` per 30s interval for all running
chats on a worker.

## Problem

Each running chat spawned a dedicated goroutine that issued an
individual `UPDATE chats SET heartbeat_at = NOW() WHERE id = $1 AND
worker_id = $2 AND status = 'running'` query every 30 seconds. At 10,000
concurrent chats this produces **~333 DB queries/second** just for
heartbeats, plus ~333 `ActivityBumpWorkspace` CTE queries/second from
`trackWorkspaceUsage`.

## Solution

New `UpdateChatHeartbeats` (plural) SQL query replaces the old singular
`UpdateChatHeartbeat`:

```sql
UPDATE chats
SET    heartbeat_at = @now::timestamptz
WHERE  worker_id = @worker_id::uuid
  AND  status = 'running'::chat_status
RETURNING id;
```

A single `heartbeatLoop` goroutine on the `Server`:
1. Ticks every `chatHeartbeatInterval` (30s)
2. Issues one batch UPDATE for all registered chats
3. Detects stolen/completed chats via set-difference (equivalent of old
`rows == 0`)
4. Calls `trackWorkspaceUsage` for surviving chats

`processChat` registers an entry in the heartbeat registry instead of
spawning a goroutine.

## Impact

| Metric | Before (10K chats) | After (10K chats) |
|---|---|---|
| Heartbeat queries/sec | ~333 | ~0.03 (1 per 30s per replica) |
| Heartbeat goroutines | 10,000 | 1 |
| Self-interrupt detection | Per-chat `rows==0` | Batch set-difference |

---

> 🤖 Generated by Coder Agents

<details><summary>Implementation notes</summary>

- Uses `@now` parameter instead of `NOW()` so tests with `quartz.Mock`
can control timestamps.
- `heartbeatEntry` stores `context.CancelCauseFunc` + workspace state
for the centralized loop.
- `recoverStaleChats` is unaffected — it reads `heartbeat_at` which is
still updated.
- The old singular `UpdateChatHeartbeat` is removed entirely.
- `dbauthz` wrapper uses system-level `rbac.ResourceChat` authorization
(same pattern as `AcquireChats`).

</details>
2026-04-07 10:25:46 -04:00
George K 86ca61d6ca perf: cap count queries and emit native UUID comparisons for audit/connection logs (#23835)
Audit and connection log pages were timing out due to expensive COUNT(*)
queries over large tables. This commit adds opt-in count capping: requests can
return a `count_cap` field signaling that the count was truncated at a threshold,
avoiding full table scans that caused page timeouts.

Text-cast UUID comparisons in regosql-generated authorization queries
also contributed to the slowdown by preventing index usage for connection
and audit log queries. These now emit native UUID operators.

Frontend changes handle the capped state in usePaginatedQuery and
PaginationWidget, optionally displaying a capped count in the pagination
UI (e.g. "Showing 2,076 to 2,100 of 2,000+ logs")

Related to:
https://linear.app/codercom/issue/PLAT-31/connectionaudit-log-performance-issue
2026-04-07 07:24:53 -07:00
Michael Suchacz d7c8213eee fix(coderd/x/chatd/mcpclient): deterministic external MCP tool ordering (#24075)
> This PR was authored by Mux on behalf of Mike.

External MCP tools returned by `ConnectAll` were ordered by goroutine
completion, making the tool list nondeterministic across chat turns.
This broke prompt-cache stability since tools are serialized in order.

Sort tools by their model-visible name after all connections complete,
matching the existing pattern in workspace MCP tools
(`agent/x/agentmcp/manager.go`). Also guards against a nil-client panic
in cleanup when a connected server contributes zero tools after
filtering.
2026-04-07 14:42:30 +02:00
Cian Johnston d5a1792f07 feat: track chat file associations with chat_file_links on chats (#23537)
Needed by #23833

Adds a `chat_file_links` association table to track which files are
associated with each chat.

- `AppendChatFileIDs` query links a file to a chat with deduplication
- `GetChatFileMetadataByIDs` query returns lightweight file metadata by
IDs
- Tool-created files (e.g. `propose_plan`) are linked to the chat after
insert
- User-uploaded files are linked to the chat when the referencing
message is sent
- Single-chat GET endpoint hydrates `files: ChatFileMetadata[]` on the
response

> 🤖 Created by Coder Agents and massaged into shape by a human.
2026-04-07 12:05:29 +01:00
Kyle Carberry acd5f01b4b fix: use GreaterOrEqual for step runtime assertion in chatloop test (#24067)
Fixes https://github.com/coder/internal/issues/1418

The `TestRun_ActiveToolsPrepareBehavior` test asserts
`persistedStep.Runtime > 0`, but on Windows the timer resolution (~15ms)
means the in-memory mock model can complete within the same clock tick,
producing a measured duration of `0s`.

Change the assertion from `require.Greater` to `require.GreaterOrEqual`
so that a legitimately measured zero duration on low-resolution clocks
does not cause a flake.

> Generated by Coder Agents
2026-04-07 02:08:49 +00:00
Kyle Carberry 6c62d8f5e6 fix(coderd/x/chatd): fix flaky TestAwaitSubagentCompletion/CompletesViaPubsub (#24066)
## Fix flaky TestAwaitSubagentCompletion/CompletesViaPubsub

Fixes coder/internal#1435

### Root Cause

During `createParentChildChats`, the processor publishes notifications
on `ChatStreamNotifyChannel(child.ID)` via PostgreSQL `LISTEN/NOTIFY`.
After `drainInflight()` returns, these stale notifications can still be
buffered in the pgListener's `NotifyChan()`. When
`awaitSubagentCompletion` subscribes and a stale notification is
dispatched between `setChatStatus(Waiting)` and
`insertAssistantMessage`, `checkSubagentCompletion` sees `done=true`
(status is `Waiting`) but returns an empty report because the message
hasn't been committed yet.

### Fix

Swap the order: insert the assistant message **before** transitioning
the status to `Waiting`. This guarantees the report is committed before
the status makes the chat appear complete to `checkSubagentCompletion`.

### Verification

- 50 consecutive runs of the specific test: all pass
- 10 runs of the full `TestAwaitSubagentCompletion` suite: all pass
- 20 runs with `-race`: all pass

> Generated by Coder Agents
2026-04-07 02:04:48 +00:00
Kyle Carberry 648787e739 feat: expose busy_behavior on chat message API (#24054)
The backend (`chatd.go`) already fully implements both `"queue"` and
`"interrupt"` busy behaviors for `SendMessage`, and the `message_agent`
subagent tool already leverages both internally. However the HTTP API
hardcoded `"queue"` and the SDK had no way for callers to request
interrupt-on-send.

This adds a `ChatBusyBehavior` enum type to the SDK and an optional
`busy_behavior` field on `CreateChatMessageRequest`. The HTTP handler
validates the field and passes it through to `chatd.SendMessage`.
Default remains `"queue"` for full backward compatibility.

<details><summary>Implementation notes</summary>

- `codersdk/chats.go`: New `ChatBusyBehavior` type with
`ChatBusyBehaviorQueue` and `ChatBusyBehaviorInterrupt` constants. Added
`BusyBehavior` field to `CreateChatMessageRequest` with `enums` tag for
codegen.
- `coderd/exp_chats.go`: `postChatMessages` now reads
`req.BusyBehavior`, maps SDK constants to
`chatd.SendMessageBusyBehavior*`, returns 400 on invalid values.
- `site/src/api/typesGenerated.ts`: Auto-generated via `make gen`.
- No frontend behavior changes — the field is available but unused by
the UI.

</details>

> [!NOTE]
> Generated by Coder Agents
2026-04-06 16:20:14 -04:00
Kyle Carberry 4cfbf544a0 feat: add per-chat system prompt option (#24053)
Adds a `system_prompt` field to `CreateChatRequest` that allows API
consumers to provide custom instructions when creating a chat. The
per-chat prompt is stored as a separate system message (`role=system`,
`visibility=model`) in the `chat_messages` table, inserted between the
deployment system prompt and the workspace awareness message.

Also moves deployment system prompt resolution from the HTTP handler
(`resolvedChatSystemPrompt`) into `chatd.CreateChat` where it belongs.
The handler no longer assembles system prompts —
`CreateOptions.SystemPrompt` is now purely the per-chat user prompt, and
the deployment prompt is resolved internally by chatd.

No database schema changes required.

**Message insertion order:**
1. Deployment system prompt (resolved by chatd, existing)
2. Per-chat user system prompt (new, from `CreateOptions.SystemPrompt`)
3. Workspace awareness (existing)
4. Initial user message (existing)

🤖 Generated with [Coder Agents](https://coder.com/agents)
2026-04-06 17:19:05 +00:00
Kyle Carberry a2ce74f398 feat: add total_runtime_ms to chat cost analytics endpoints (#24050)
Surface the aggregated `runtime_ms` from `chat_messages` through all
four cost analytics queries (summary, per-model, per-chat, per-user).
This is the key billing metric for agent compute time.

The per-chat breakdown already groups by `root_chat_id`, so subagent
runtime is automatically rolled up under the parent chat — no additional
query changes needed.

<details>
<summary>Implementation details</summary>

**SQL** (`coderd/database/queries/chats.sql`): Added
`COALESCE(SUM(cm.runtime_ms), 0)::bigint AS total_runtime_ms` to
`GetChatCostSummary`, `GetChatCostPerModel`, `GetChatCostPerChat`, and
`GetChatCostPerUser`.

**Go SDK** (`codersdk/chats.go`): Added `TotalRuntimeMs int64` to
`ChatCostSummary`, `ChatCostModelBreakdown`, `ChatCostChatBreakdown`,
and `ChatCostUserRollup`.

**Handler** (`coderd/exp_chats.go`): Wired the new field through all
converter functions and the response assembly.

**Tests** (`coderd/exp_chats_test.go`): Updated fixture to seed non-zero
`runtime_ms` values and added assertions for the new field at summary,
per-model, and per-chat levels.
</details>

> 🤖 Generated by Coder Agents
2026-04-06 12:10:57 -04:00
Kyle Carberry 8625543413 feat(coderd/x/chatd): parallelize ConvertMessagesWithFiles with g2 errgroup (#24034)
## Summary

Move `ConvertMessagesWithFiles` into the `g2` errgroup so prompt
conversion runs concurrently with instruction persistence, user prompt
resolution, MCP server connections, and workspace MCP tool discovery.

## Problem

In `runChat`, the setup before the first LLM `Stream()` call is
sequential across two errgroups:

```
g.Wait()                          // model + messages + MCP configs
ConvertMessagesWithFiles()        // sequential — blocked on g2 starting
g2.Wait()                         // instructions + user prompt + MCP connect + workspace MCP
```

`ConvertMessagesWithFiles` can take non-trivial time on conversations
with file attachments (batch DB resolution), and it was blocking g2 from
starting.

## Fix

`ConvertMessagesWithFiles` only reads the `messages` slice (available
after `g.Wait()`) and resolves file references via the database. No g2
task reads or writes the `prompt` variable. This makes it safe to
overlap with g2:

```
g.Wait()
g2.Wait()   // now includes ConvertMessagesWithFiles in parallel
```

The `InsertSystem` call for parent chats and the `promptErr` check are
deferred to after `g2.Wait()`, preserving correctness.

<details><summary>Decision log</summary>

- `ConvertMessagesWithFiles` is read-only on `messages` — no mutation,
safe for concurrent access
- `prompt` and `promptErr` are written only by the conversion goroutine,
read only after `g2.Wait()` — no data race
- Error from prompt conversion is checked immediately after `g2.Wait()`,
before any code that uses `prompt`
- `chatloop.Run` now uses `:=` instead of `=` since the prior `err`
declaration from `prompt, err :=` was removed

</details>

> Generated by Coder Agents
2026-04-05 11:42:07 +00:00
Kyle Carberry e18094825a fix: retain message_part buffer for cross-replica relay (#24031) 2026-04-04 17:24:41 -04:00
Kyle Carberry 919dc299fc feat: agent reads context files and discovers skills locally (#23935)
Piggybacks on #23878. Moves instruction file reading and skill discovery
from `chatd` (server-side, via multiple `LS`/`ReadFile` round-trips
through the agent connection) to the agent itself (local filesystem
access).

This intentionally drops backward compatibility with older agents that
don't support the context-config endpoint. Agents and server are
deployed together; there is no rolling-update contract to maintain here.

## What changed

The agent's `GET /api/v0/context-config` response now returns
`[]ChatMessagePart` directly — the same types chatd persists. This
eliminates intermediate type conversions and makes the protocol
extensible.

| Field | Type | Description |
|---|---|---|
| `parts` | `[]ChatMessagePart` | Context-file and skill parts, ready to
persist |
| `working_dir` | `string` | Agent's resolved working directory |

Removed from the response: `instructions_dirs`, `instructions_file`,
`skills_dirs`, `skill_meta_file`, `mcp_config_files` — the agent reads
files locally and returns their content as parts.

Removed from chatd: all legacy `LS`/`ReadFile` fallback code
(`readHomeInstructionFile`, `readInstructionDirFile`, `DiscoverSkills`
via LS, etc).

## Why

The previous architecture had the agent resolve paths, serve them over
HTTP, then `chatd` make N+1 round-trips back through the agent
connection to read files. The agent has direct filesystem access and
should just read the files.

## Key design decisions

- **Agent returns `ChatMessagePart` directly** — same types chatd
persists. No intermediate `InstructionFileEntry`/`SkillEntry` types
needed.
- **`SkillMeta.MetaFile`** — persisted via `ContextFileSkillMetaFile` on
the skill part, so custom meta file names
(`CODER_AGENT_EXP_SKILL_META_FILE`) survive across chat turns.
- **No pre-read body** — `read_skill` always dials the workspace to
fetch the skill body on demand. Simpler than caching the body in the
response.
- **MCP config paths kept agent-internal** — `MCPConfigFiles()` getter,
not sent over the wire.
- **No backward compat fallback** — old agents that don't support
context-config get no instruction files. This is acceptable since agent
and server deploy together.
2026-04-04 12:45:46 -04:00
Jon Ayers a1d51f0dab feat: batch connection logs to avoid DB lock contention (#23727)
- Running 30k connections was generating a ton of lock contention in the
DB
2026-04-03 15:47:26 -05:00
Jon Ayers 333503f74e feat: improve coordinator peer mapping performance (#23696)
- Skipping DB querying entirely for peers that aren't actually connected
to our coordinator
- Opportunistically batching the queries for peers
2026-04-03 14:22:58 -05:00
Jeremy Ruppel 01b8cdb00d fix: remove work/personal onboarding telemetry (#24021)
Following on from #23989 #24018 

- We also no longer want to collect `IsBusiness` demographic data
- Newsletter fields no longer allow `nil` as a value, instead default to
false

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 14:26:35 -04:00
Hugo Dutka ec83065b59 fix(coderd/x/chatd): inflight wait group data race (#24007)
Addresses https://github.com/coder/internal/issues/1450
2026-04-03 20:04:09 +02:00
Jeremy Ruppel 2a1bef18e0 fix: remove IndustryType and OrgSize from FirstUserOnboarding telemetry (#24018)
New `IndustryType` and `OrgSize` enums were added in #23989, but they
are no longer desired in the onboarding/marketing telemetry data. This
removes them.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 11:35:37 -04:00
Paweł Banaszewski 8369fa88fd feat: add columns for cached tokens from aibridge (#23832)
Two new columns added to aibridge_token_usages:
  - cache_read_input_tokens (BIGINT, default 0)
  - cache_write_input_tokens (BIGINT, default 0)

Migration backfills existing rows by extracting values from the metadata
JSONB column (cache_read_input, input_cached, prompt_cached for reads
(max value selected since only 1 should be set), cache_creation_input
for writes).

All references to data from metadata were updated to reference new
columns. No other changes then changing where data is extracted from.

Requires aibridge library version bump to include:
https://github.com/coder/aibridge/pull/229
Fixes: https://github.com/coder/aibridge/issues/150
2026-04-03 16:27:31 +02:00
Jeremy Ruppel da3c46b557 feat: add onboarding info fields to first user setup (#23989)
Add optional demographic and newsletter preference fields to the setup
page: business use (yes/no), industry type, organization size, and two
newsletter toggles (marketing, release/security updates).

The new data flows through telemetry via a FirstUserOnboarding struct in
the snapshot payload, sent once when the first user is created. The
telemetry-server and BigQuery schema changes are required separately to
persist this data.

---------

Co-authored-by: default <davidiii@fraley.us>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 09:52:52 -04:00
Hugo Dutka 53482adc2d fix(coderd/x/chatd): TestAwaitSubagentCompletion/ContextCanceled flake (#24008)
Addresses https://github.com/coder/internal/issues/1437
2026-04-03 13:11:38 +02:00
Sas Swart 5b6b7719df fix: make prebuild claiming durable and idempotent (#23108)
## Problem

When a prebuilt workspace is claimed, the agent reinitializes via a
single fire-and-forget pubsub event over SSE. If the agent's SSE
connection is interrupted at claim time, the event is permanently lost —
the workspace is stuck with no self-healing path.

Additionally, regular (non-prebuild) workspaces had no way to opt out of
the `/reinit` polling loop — agents would reconnect indefinitely to an
endpoint that would never send them anything useful.

## Root Cause

`workspaceAgentReinit` fetches the workspace (with its current
`owner_id`) via `GetWorkspaceByAgentID`, but never checked whether a
claim already happened. It only subscribed to pubsub for future events.
The database already has durable claim state (`owner_id` changes from
`PrebuildsSystemUserID` to the real user), but no layer ever consulted
it on reconnection.

## Solution

### Server-side durable check with first-build-initiator gating

**TOCTOU-safe ordering**: Subscribe to pubsub claim events *before* any
durable checks, so a claim that fires during the check is buffered in
the channel rather than lost.

**First-build-initiator gating**: When `!workspace.IsPrebuild()` (owner
is no longer the system user), look up the first build's `InitiatorID`.
The prebuild reconciler always uses `PrebuildsSystemUserID` as the
initiator. This distinguishes claimed prebuilds from regular workspaces
without any SQL schema changes.

- **Regular workspace** (first build initiator ≠ system user) → **409
Conflict**, agent stops reconnecting
- **Claimed prebuild, build completed** → pre-seed channel with reinit
event and close it, transmitter delivers one-shot then exits
- **Claimed prebuild, build in-progress** → fall through to pubsub
subscription, agent waits for completion event
- **Unclaimed prebuild** → pubsub subscription (existing happy path)

### Declarative reinit events (defense-in-depth)

- Added `UserID` field to `ReinitializationEvent` with JSON tags
- Switched pubsub serialization from raw string to JSON (with
backward-compat fallback for rolling upgrades)
- Populated `UserID` at both the publish site and the durable check

### Agent SDK: 409 handling

`WaitForReinitLoop` detects 409 Conflict from the server and closes the
`reinitEvents` channel, cleanly exiting the retry goroutine.

### Agent CLI: fixed two bugs + added reinitCtx

- **Closed channel (`!ok`)**: now blocks on `<-ctx.Done()` instead of
`continue`, keeping the current agent running. Previously this would
leak agents by skipping `agnt.Close()` and re-entering the loop.
- **Duplicate owner reinit**: cancels `reinitCtx` (stops the reinit
goroutine), then blocks on `<-ctx.Done()`. Previously `continue` would
skip cleanup and create a new agent on the next loop iteration.
- **`reinitCtx`**: a cancellable child of `ctx` passed to
`WaitForReinitLoop`, allowing the agent to stop the reinit HTTP polling
after reinit completes.

### Agent-side idempotency

Tracks `lastOwnerID` in the agent reinit loop — duplicate events for the
same owner are skipped.

## Testing

- **"unclaimed prebuild receives reinit via pubsub"**: prebuild owned by
system user, pubsub event triggers reinit
- **"claimed prebuild receives one-shot reinit on reconnect"**: first
build by system user, owner changed, build completed → immediate reinit
(no pubsub needed)
- **"claimed prebuild waits during in-progress claim build"**: claimed
but build still running → no reinit until build completes
- **"regular workspace gets 409"**: first build by real user → 409
Conflict, agent stops polling
- Updated claim publisher/listener tests: verify `UserID` survives JSON
round-trip + backward compat with raw string payloads
- Updated SSE round-trip test: verify `UserID` survives transmit →
receive cycle

Fixes #22359

## Rolling upgrade note

During a rolling deploy where old coderd instances coexist with new
ones, the pubsub `ReinitializationEvent` has a new `workspace_id` field
(JSON key `workspace_id`). Old publishers send a raw reason string
instead of JSON; the new listener gracefully falls back by treating the
entire payload as the reason and filling in `WorkspaceID` from context.
The only visible effect during the upgrade window is that `WorkspaceID`
may be the zero UUID in agent-side logs — this is cosmetic and resolves
once all instances are updated.
2026-04-02 23:51:02 +02:00
Zach 990c006f28 feat(coderd/database): add value_key_id column to user_secrets for encryption (#23997)
Add a nullable `value_key_id` column to the `user_secrets` table with a
foreign key to `dbcrypt_keys`. This is the column dbcrypt uses to track
which encryption key encrypted a given secret's value. This is required
for encryption of user secret values.

The column was missing from the original migration (000357).
2026-04-02 15:40:32 -06:00
Michael Suchacz 7d0a0c6495 feat: provider key policies and user provider settings (#23751) 2026-04-02 19:46:42 +02:00
Hugo Dutka 17dec2a70f feat: agents desktop recordings backend (#23894)
This PR introduces screen recording of the computer use agent using the
virtual desktop.

- Screen recording is triggered by a `wait_agent` tool call. Recording
is stopped by a successful `wait_agent` tool call or when there hasn't
been any desktop activity for 10 minutes.
- Recordings are handled by the `portabledesktop` cli via the `record`
command. The videos are sped up in periods of inactivity.
- Recordings are saved to the database to the `chat_files` table.
There's a hard limit of 100MB per recording. Larger recordings are
dropped.
- A successful `wait_agent` on a computer use subagent tool call returns
a `recording_file_id`, later allowing the frontend to display the
corresponding video.
2026-04-02 17:23:27 +00:00
dylanhuff-at-coder f796f3645f fix(coderd): fix isContextLimitKey false positive on max_context_version (#23950)
`isContextLimitKey` had a fallback heuristic that matched any key starting with `"max"` containing `"context"`, causing false positives on keys like `"max_context_version"`. A provider returning such metadata would have the value parsed as a context limit.

Replace substring matching on the separator-stripped key with word-level matching. A new `metadataKeyWords` function tokenizes keys by splitting on separators and camelCase boundaries, then the fallback requires
`"context"` paired with a limit-related word (`"limit"`, `"window"` + qualifier, `"length"` + qualifier, or `"tokens"` + qualifier). Known exact forms like `"context_window"` remain in the fast-path switch.

Closes https://github.com/coder/coder/issues/23332
2026-04-02 10:07:01 -07:00
Cian Johnston b5da77ff55 test: cover read_template and create_workspace allowlist enforcement (#23645)
- Extend `TestChatTemplateAllowlistEnforcement` to also exercise
`read_template` and `create_workspace` through the allowlist
- Mock LLM now chains 4 tool calls: list_templates, read_template
(blocked), read_template (allowed), create_workspace (blocked)
- Wire dummy `CreateWorkspace` config into test server so the tool
reaches the allowlist check
- Generalize tool result collection to support multiple calls per tool
name

> 🤖 Created by Coder Agents and reviewed by Kyle the human.
2026-04-02 15:39:40 +00:00
Cian Johnston d4a9c63e91 feat: auto-assign agents-access role to new users when experiment enabled (#23968)
When the `agents` experiment is enabled, new users are automatically
granted the `agents-access` role at creation time so they can use Coder
Agents without manual admin intervention.

- Auto-assigns in `CreateUser()` — covers admin API, OAuth, and OIDC
creation paths
- Skips auto-assign for OIDC users when enterprise site role sync is
enabled (sync overwrites roles on every login; those admins should use
`--oidc-user-role-default` instead)
- CLI `create-admin-user` bypasses `CreateUser()` but creates `owner`
users who already have all permissions

> 🤖 Written by a Coder Agent. Will be reviewed by a human.
2026-04-02 14:46:47 +01:00