Commit Graph

1456 Commits

Author SHA1 Message Date
Jaayden Halko 410f9a5e19 feat: allow renaming of agent chat title (#24489)
Co-authored-by: Coder Agents <noreply@coder.com>
2026-04-20 14:00:46 +01:00
Thomas Kosiewski 18a30a7a10 feat: add chat debug HTTP handlers and API docs (#23918) 2026-04-20 13:34:41 +02:00
Mathias Fredriksson 467430d8fa fix: sort child chats newest-first and prepend on creation (#24524)
GetChildChatsByParentIDs sorted created_at ASC, but the cache
helper appended new children to the end. On refetch the API and
cache agreed on oldest-first, putting the just-created child at
the bottom. Users expect newest first, matching the root-chat
sidebar convention.

- SQL: change child sort to created_at DESC, id DESC.
- Cache: prepend instead of append in addChildToParentInCache
  (renamed from appendChildToParentInCache to avoid leaking
  position semantics).
- Test: update ordering assertion to expect newest-first.

Refs #24404
2026-04-20 10:43:31 +00:00
Thomas Kosiewski df7e838c21 feat(coderd): wire debug logging into chat lifecycle (#23917) 2026-04-20 12:27:16 +02:00
Mathias Fredriksson fc2493780f fix: exclude subagent chats from sidebar pagination (#24404)
GetChats now returns only root chats (parent_chat_id IS NULL).
A new GetChildChatsByParentIDs query fetches children for visible
roots and embeds them in each parent's Children field. The
singular getChat endpoint does the same.

Archive invariant is one-way: parent archived implies child
archived. Parent archive/unarchive cascades via root_chat_id.
Individual child archive is permitted; child unarchive while the
parent is archived is rejected atomically (row lock on child,
re-read parent inside the transaction). Embedded children are
filtered by the caller's archive state so individually-archived
children stay hidden from active-parent views.

Gitsync MarkStale uses GetChatsByWorkspaceIDs directly;
MarkStaleParams.OwnerID removed (dead after the switch).

Frontend: buildChatTree reads from the embedded children field,
WebSocket handlers route child events into the parent's children
array, and archiving a child strips it from the parent cache.
2026-04-20 13:19:59 +03:00
Spike Curtis 2ea27e897b chore: split Pubsub interface into Publisher and Subscriber (#24442)
<!--

If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.

-->

Splits the Pubsub into Publisher and Subscriber interfaces. Allows components to scope down their needs if they only publish or only subscribe. This allows smaller fakes/mocks and generally better encapsulation.
2026-04-17 22:58:33 -04:00
Spike Curtis e19b21b7d5 chore: add GetLatestWorkspaceBuildWithStatusByWorkspaceID query (#24441)
<!--

If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.

-->

relates to GRU-18  
  
Adds new database query supporting the Agent Connection Watch we will add.
2026-04-17 22:47:08 -04:00
Thomas Kosiewski 91f9de27a1 feat(coderd): add chat debug service and summary aggregation (#23916) 2026-04-17 16:27:53 +02:00
Dean Sheather 4ba74dcdc8 feat(coderd): add PR status summary to telemetry snapshots (#24379)
Adds aggregate PR counts (total, open, merged, closed) from
`chat_diff_statuses` to telemetry snapshots, giving visibility into AI
agent PR outcomes across deployments.

The existing telemetry system reports `Chats`, `ChatMessageSummaries`,
and `ChatModelConfigs`, but had no PR-level data. This adds a
`ChatDiffStatusSummary` field to the `Snapshot` struct with four
all-time counts derived from a single aggregate query.

<details>
<summary>Implementation details</summary>

- New SQL query `GetChatDiffStatusSummary` counts `chat_diff_statuses`
rows with non-NULL `pull_request_state`, grouped by state
(open/merged/closed).
- `ChatDiffStatusSummary` struct added to telemetry `Snapshot`,
collected via a parallel `eg.Go()` block in `createSnapshot()`.
- `dbauthz` wrapper uses `rbac.ResourceSystem` (telemetry-only pattern).
- Test covers both empty state (zero counts) and populated state (mixed
states + NULL-state exclusion).

</details>

> 🤖 Generated by Coder Agents
2026-04-17 21:56:11 +10:00
Michael Suchacz 73b5058923 feat: add Explore mode as subagent-only modality (#24448)
> This PR was authored by Mux on behalf of Mike.

Introduce Explore mode, a read-only subagent modality for delegated
discovery and code investigation.

## What

Adds a `spawn_explore_agent` tool that creates child chats restricted to
read-only operations. An admin can optionally configure a
deployment-wide
model override so Explore subagents use a model optimized for large
context
or reasoning without changing the root chat's model.

### Backend

- New `ChatModeExplore` enum value (migration 000471).
- `spawn_explore_agent` tool definition with read-only allowlist:
`read_file`, `execute`, `process_output`, `read_skill`,
`read_skill_file`.
  Write tools, file editors, and nested subagent spawning are blocked.
- Deployment config storage for the Explore model override
  (`agents_chat_explore_model_override` in `site_configs`).
- Model resolution hierarchy: configured override, then current turn
model,
then global default. Silent fallback with warning log when the override
  becomes unavailable.
- RBAC: `AsChatd` for daemon reads, `ActionRead` and `ActionUpdate` on
  `ResourceDeploymentConfig` for admin API calls.
- Plan mode root chats can use `spawn_explore_agent` for read-only
research,
  matching the planning prompt guidance.
- The Explore override config API now reports malformed saved overrides
as
  "treated as unset" so admins can clear them explicitly.

### Frontend

- `ExploreModelOverrideSettings` component in admin agent behavior
settings.
  Uses `ModelSelector`, handles unavailable model warnings, and supports
  explicit Save and Clear actions.
- Malformed saved overrides show a warning and require an explicit Save
to
  clear, instead of Clear auto-submitting behind the scenes.

### Tests

- Integration: `TestExploreSubagentIsReadOnly` (full spawn flow, tool
  verification, prompt overlay, DB state).
- Unit: tool allowlist tests for explore, plan, and default modes.
- Internal: model override resolution with valid, invalid UUID,
disabled, and
  unconfigured override scenarios.
- RBAC: `dbauthz_test.go` for `GetChatExploreModelOverride` and
  `UpsertChatExploreModelOverride`.
- API: admin set and clear, malformed stored override reporting,
disabled
  model rejection, non-admin denial.
2026-04-17 13:40:17 +02:00
Michael Suchacz 1092093e98 feat: add internal subagent model override wiring (#24399)
> Mux working on behalf of Mike.

## Summary
- add an enabled chat model config lookup by ID for internal callers
- keep `spawn_agent` unchanged while threading an internal model
override through child subagent chat creation
- extend chatd coverage for inherited bindings, plan mode, and internal
override behavior

## Validation
- `go test ./coderd/x/chatd ./coderd/database/dbauthz`
- `make lint`
2026-04-16 17:08:02 +02:00
Ethan 55e525fc28 ci: add InTx linter replacing ruleguard rule (#24422)
Replace the old `InTx` ruleguard rule in `scripts/rules.go` with a
custom in-tree `go/analysis` analyzer under `scripts/intxcheck/`. The
new analyzer catches the same direct and pass-through misuse classes as
before, plus two new classes the pattern-matcher couldn't reach:

- **Indirect same-package helper misuse** — flags `p.someHelper(ctx)`
inside `InTx` when the helper body uses the outer store (the PR #24369
bug class).
- **Nested dangerous closures** — descends into `go func() { ... }()`,
`defer func() { ... }()`, and immediately-invoked function literals.

The analyzer uses semantic `types.Object` identity instead of raw
expression string comparison, which avoids false positives from
closure-local shadowing and catches simple aliases like `outer := s.db`
and `alias := s`.

This PR also fixes three real outer-store-inside-transaction bugs the
new analyzer surfaced:

- `coderd/wsbuilder/wsbuilder.go`: `FindMatchingPresetID` and
`getWorkspaceTask` now use the inner transaction store instead of
`b.store`.
- `enterprise/dbcrypt/dbcrypt.go`: `ensureEncrypted` now calls
`s.InsertDBCryptKey` (the tx-wrapped store) instead of
`db.InsertDBCryptKey`. The `dbCrypt.InTx` method wraps the raw tx in a
new `*dbCrypt`, so `s.InsertDBCryptKey` still dispatches through the
encryption layer.

Two call sites need `// intxcheck:ignore` suppressions. Both are one-off
patterns that only look like misuse because the analyzer doesn't track
assignments — proving them safe would require full dataflow analysis,
which is well beyond what a targeted lint like this should attempt:

- `coderd/database/dbfake/dbfake.go` — `b.db` is reassigned to `tx` on
the preceding line, so `b.doInTX()` actually uses the transaction. The
analyzer sees the original `b.db` identity and flags it.
- `coderd/database/db_test.go` — test intentionally passes the outer
store to `require.Equal` to assert that nested `InTx` returns the same
handle.

Suppressions use `// intxcheck:ignore` instead of `//nolint:intxcheck`
because `intxcheck` runs as a standalone `go/analysis` tool outside
golangci-lint. golangci-lint's `nolintlint` checker flags `//nolint`
directives for linters it doesn't control, so we use a custom comment
prefix to avoid that conflict.
2026-04-17 00:07:30 +10:00
Dean Sheather 3452ab3166 chore: add client_type field to chats and telemetry (#24342)
Add a `chat_client_type` enum (`ui` | `api`) and `client_type` column to
the `chats` table. The column defaults to `api` for new rows so API
callers don't need to set it explicitly. Existing rows are backfilled to
`ui`.

The field flows through `CreateChatRequest`, `chatd.CreateOptions`,
`InsertChat`, and is returned in the `Chat` response via `db2sdk`.

<details>
<summary>Implementation notes (Coder Agents generated)</summary>

### Changes

**Database migration (000469)**
- New enum `chat_client_type` with values `ui`, `api`.
- New `client_type` column, `NOT NULL DEFAULT 'api'`.
- Backfill: `UPDATE chats SET client_type = 'ui'`.

**SQL query** — `InsertChat` now includes `client_type`.

**SDK** — `ChatClientType` type added; `ClientType` field added to both
`CreateChatRequest` (optional, defaults server-side to `api`) and `Chat`
response.

**Handler** — `postChats` maps the request field (defaulting to `api`)
and passes it through `chatd.CreateOptions`.

**Sub-agent** — Child chats inherit their parent's `client_type`.

**db2sdk** — Maps the database value to the SDK type.

### Decision log
- Default is `api` (not `ui`) so existing API integrations get the
correct value without code changes.
- Backfill sets existing rows to `ui` per requirement.
- Child chats inherit `client_type` from parent rather than defaulting.
</details>
2026-04-16 23:57:05 +10:00
Michael Suchacz e5707a13d6 feat: support multiple agents with shared instance-identity auth (#24325)
> This PR was authored by Mux on behalf of Mike.

## Summary

Adds support for multiple peer root workspace agents sharing the same
`auth_instance_id`, so AWS, Azure, and GCP instance-identity auth can
issue the correct session token for a selected agent instead of assuming
a
single root agent per instance.

## Problem

When a Terraform template attaches two or more `coder_agent` resources
(with `auth = "aws-instance-identity"`) to a single compute instance,
every agent shares the same cloud instance ID. The existing singular
lookup picks whichever agent was created most recently, silently
ignoring
the others.

## Solution

Introduce an optional pre-auth agent selector (`CODER_AGENT_NAME`) and
make the server-side lookup ambiguity-aware.

**Database layer:**
- `GetWorkspaceAgentsByInstanceID` (`:many`): returns all matching root
  agents for an instance ID.
- `GetWorkspaceAgentByInstanceIDAndName` (`:one`): returns the named
root
  agent for disambiguation.

**SDK and CLI:**
- `agent_name` field added to AWS, Azure, and GCP request structs
  (`omitempty` for backward compatibility).
- `CODER_AGENT_NAME` env var and `--agent-name` flag wired into the
agent
  bootstrap before instance-identity auth runs.

**Server handler (`handleAuthInstanceID`):**
- When `agent_name` is present: direct lookup by (instance ID, name).
- When absent: legacy lookup, then resource-scoped ambiguity check.
  Returns 409 with available agent names if multiple root agents match.
- Whitespace-only names are trimmed and treated as unspecified.
- Sub-agents remain excluded (`parent_id IS NULL` filter).

**Verification template:**
- `examples/templates/aws-multi-agent/` provisions one EC2 instance with
  two agents (`main` and `dev`), both using instance-identity auth with
  `CODER_AGENT_NAME` set in the cloud-init user data.

## Backward compatibility

Existing single-agent deployments work unchanged. The `agent_name` field
is optional with `omitempty`, and the unnamed path preserves today's
behavior when only one root agent matches.
2026-04-16 13:59:09 +02:00
Michael Suchacz 1cf0354f72 feat: add plan mode with restricted tool boundary (#24236)
> This PR was authored by Mux on behalf of Mike.

## Summary
- add persistent plan mode for chats and the chat-specific plan file
flow
- add structured planning tools such as `ask_user_question` and
`propose_plan`
- keep `write_file` and `edit_files` constrained to the chat-specific
plan file during plan turns
- allow shell exploration in plan mode, including subagents, via
`execute` and `process_output`
- block implementation-oriented, provider-native, MCP, dynamic, and
computer-use tools during plan turns
- update the chat UI, tests, and docs for the new planning flow
2026-04-16 11:12:01 +02:00
Cian Johnston 6194bd6f57 fix: address post-merge review findings for chat org scoping (#24297)
Addresses review findings from #23827 that were added post-merge:

- Persisted attachments now store `organizationId`; mismatched orgs
pruned on restore
- Workspace selection reconciliation: stale IDs from previous orgs
dropped via derived `effectiveWorkspaceId`
- Org picker uses `permittedOrganizations()` for RBAC-aware filtering
- Org picker hidden when user belongs to only one org
- Ref-sync `useEffect` replaced with `useEffectEvent`
- `CreateWorkspace()` and `ListTemplates()` take `organizationID` and
`db` as required function parameters instead of optional struct fields —
compiler enforces them, removes scattered nil guards
- Cross-org template check in `CreateWorkspace` is now unconditional
- `ListTemplates` org-scoping filter now has test coverage
- `setupChatInfra` comment fixed; test helpers use params structs
instead of positional UUIDs
- Enterprise test documents that org admin only sees own chats (handler
hardcodes `OwnerID` — future work needs sidebar UI before lifting that
restriction)

> 🤖
2026-04-15 11:39:05 +01:00
Callum Styan 730edba87a fix: fix false positive disconnected agent metric reporting (#24225)
We noticed during higher active workspace counts that the agent
connection metric, generated via a query to the database, would report a
relatively high amount of agents as disconnected. Somewhere between 5
and 20%. However, other metrics such as # of websocket connections would
suggest that all agent connections are healthy.

Looking at the `Agents` function in prometheus metrics, plus the query
execution time (not accounting for actual database RT time) revealed
that this reporting of agents as disconnected was almost certainly false
positives due to clock drift in the way we're generating the metric
values. At 10k metrics, with a p50 of 2ms and p99 of 5ms, the entire
`agents` function could take upwards of 50s to execute. Because we were
doing a query/database RT to query th apps for each agent individually,
and grabbing a `time.Now` value on each iteration of that loop, it's
likely the portion of agents that were reported as disconnected were
those that had last heartbeat the furthest in the past.

The fix here is to set a consistent `now` before fetching agent data to
avoid clock drift inflating the inactive timeout comparison, and replace
the per-agent app query N+1 with a single batched lookup to prevent loop
execution time from pushing agents over the disconnected threshold.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:23:06 -07:00
Cian Johnston c552f9f281 fix: stop group spend limits from leaking across org boundaries (#24294)
Three SQL queries (`GetUserGroupSpendLimit`,
`ResolveUserChatSpendLimit`, `GetUserChatSpendInPeriod`) aggregated chat
spend limits and usage globally across all organizations. A restrictive
group limit in org A would bleed into org B.

## Changes

- Add `organization_id` parameter to all three SQL queries in
`coderd/database/queries/chats.sql`
- When nil UUID is passed, queries fall back to global behavior
(backward compat for HTTP dashboard endpoints)
- When real org ID is passed, limits and spend are scoped to that
organization
- Thread `organizationID` through `ResolveUsageLimitStatus` →
`checkUsageLimit` → all chatd call sites
- Update dbauthz wrappers for new param structs
- HTTP endpoints (`chatCostSummary`, `getMyChatUsageLimitStatus`) pass
`uuid.Nil` with TODO for future org-scoped UI
- Add `TestResolveUsageLimitStatus_OrgScoped` with 5 test cases covering
org isolation, nil-UUID fallback, spend scoping, and user override
priority

Closes coder/internal#1466

> 🤖
2026-04-14 16:56:17 +01:00
Yevhenii Shcherbina b78eba9f9d feat: make sure creds are always masked (#24241)
## Summary  
Adds a `sanitizeCredentialHint` safety check in the db-to-SDK conversion
layer to ensure credential hints are always masked before being exposed
in the API. Also adds `credential_kind` and `credential_hint` assertions
to the session threads API test.
2026-04-13 10:14:38 -04:00
Thomas Kosiewski 6ab30123bf feat: add chat debug log tables, queries, and SDK types (#23913) 2026-04-13 15:06:06 +02:00
Cian Johnston 22062ec52e feat: add organization scoping to chats (#23827)
Fixes https://github.com/coder/internal/issues/1436

* Adds organization_id to chats with backfill (workspace org → user org membership → default org)
* No support yet for ACLs (follow-up issue)
- Cross-org workspace binding rejected (both in `CreateChatRequest` and in `create_workspace` tool
- Adds `OrganizationAutocomplete` to `AgentCreateForm`
- Docs updated with `organization_id` in chats-api.md

> 🤖 Written by a Coder Agent. Reviewed by many humans and many agents.

---------

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
2026-04-13 12:31:25 +01:00
Mathias Fredriksson a62ead8588 fix(coderd): sort pinned chats first in GetChats pagination (#24222)
The GetChats SQL query ordered by (updated_at, id) DESC with no
pin_order awareness. A pinned chat with an old updated_at could
land on page 2+ and be invisible in the sidebar's Pinned section.

Add a 4-column ORDER BY: pinned-first flag DESC, negated pin_order
DESC, updated_at DESC, id DESC. The negation trick keeps all sort
columns DESC so the cursor tuple < comparison still works. Update
the after_id cursor clause to match the expanded sort key.

Fix the false handler comment claiming PinChatByID bumps updated_at.
2026-04-10 17:13:19 +00:00
J. Scott Miller 7bde763b66 feat: add workspace build transition to provisioner job list (#24131)
Closes #16332

Previously `coder provisioner jobs list` showed no indication of what a workspace
build job was doing (i.e., start, stop, or delete). This adds
`workspace_build_transition` to the provisioner job metadata, exposed in
both the REST API and CLI. Template and workspace name columns were also
added, both available via `-c`.

```
$ coder provisioner jobs list -c id,type,status,"workspace build transition"
ID                                    TYPE                     STATUS     WORKSPACE BUILD TRANSITION
95f35545-a59f-4900-813d-80b8c8fd7a33  template_version_import  succeeded
0a903bbe-cef5-4e72-9e62-f7e7b4dfbb7a  workspace_build          succeeded  start
```
2026-04-10 09:50:11 -05:00
Matt Vollmer 36141fafad feat: stack insights tables vertically and paginate Pull requests table (#24198)
The "By model" and "Pull requests" tables on the PR Insights page
(`/agents/settings/insights`) were side-by-side at `lg` breakpoints, and
the Pull requests table was hard-capped at 20 rows by the backend.

- Replaced `lg:grid-cols-2` with a single-column stacked layout so both
tables span the full content width.
- Removed the `LIMIT 20` from the `GetPRInsightsRecentPRs` SQL query so
all PRs in the selected time range are returned.
- Can add this back if we need it. If we do, we should add a little
subheader above this table to indicate that we're not showing all PRs
within the selected timeframe.
- Added client-side pagination to the Pull requests table using
`PaginationWidgetBase` (page size 10), matching the existing pattern in
`ChatCostSummaryView`.
- Renamed the section heading from "Recent" to "Pull requests" since it
now shows the full set for the time range.
<img width="1481" height="1817" alt="image"
src="https://github.com/user-attachments/assets/0066c42f-4d7b-4cee-b64b-6680848edc68"
/>


> 🤖 PR generated with Coder Agents
2026-04-10 10:48:54 -04:00
Garrett Delfosse 3462c31f43 fix: update directory for terraform-managed subagents (#24220)
When a devcontainer subagent is terraform-managed, the provisioner sets
its directory to the host-side `workspace_folder` path at build time. At
runtime, the agent injection code determines the correct
container-internal
path from `devcontainer read-configuration` and sends it via
`CreateSubAgent`.

However, the `CreateSubAgent` handler only updated `display_apps` for
pre-existing agents, ignoring the `Directory` field. This caused
SSH/terminal
sessions to land in `~` instead of the workspace folder (e.g.
`/workspaces/foo`).

Add `UpdateWorkspaceAgentDirectoryByID` query and call it in the
terraform-managed subagent update path to also persist the directory.

Fixes PLAT-118

<details><summary>Root cause analysis</summary>

Two code paths set the subagent `Directory` field:

1. **Provisioner (build time):** `insertDevcontainerSubagent` in
`provisionerdserver.go`
   stores `dc.GetWorkspaceFolder()` — the **host-side** path from the
   `coder_devcontainer` Terraform resource (e.g. `/home/coder/project`).

2. **Agent injection (runtime):**
`maybeInjectSubAgentIntoContainerLocked` in
`api.go` reads the devcontainer config and gets the correct
**container-internal**
path (e.g. `/workspaces/project`), then calls `client.Create(ctx,
subAgentConfig)`.

For terraform-managed subagents (those with `req.Id != nil`),
`CreateSubAgent`
in `coderd/agentapi/subagent.go` recognized the pre-existing agent and
entered
the update path — but only called `UpdateWorkspaceAgentDisplayAppsByID`,
discarding the `Directory` field from the request. The agent kept the
stale
host-side path, which doesn't exist inside the container, causing
`expandPathToAbs` to fall back to `~`.

</details>

> [!NOTE]
> Generated by Coder Agents
2026-04-10 10:11:22 -04:00
Zach 95cff8c5fb feat: add REST API handlers and client methods for user secrets (#24107)
Add the five REST endpoints for managing user secrets, SDK client
methods, and handler tests.

Endpoints:
- `POST /api/v2/users/{user}/secrets`
- `GET /api/v2/users/{user}/secrets`
- `GET /api/v2/users/{user}/secrets/{name}`
- `PATCH /api/v2/users/{user}/secrets/{name}`
- `DELETE /api/v2/users/{user}/secrets/{name}`

Routes are registered under the existing `/{user}` group with
`ExtractUserParam`. The delete query was changed from `:exec` to
`:execrows` so the handler can distinguish "not found" from success
(DELETE with `:exec` silently returns nil for zero affected rows).
2026-04-09 12:12:55 -06:00
Yevhenii Shcherbina 8237822441 feat: byok observability api (#24207)
## Summary
Exposes `credential_kind` and `credential_hint` on AI Bridge session
threads, making credential metadata visible in the session detail API.
   
Each thread in the `/api/v2/aibridge/sessions/{session_id}` response now
includes:
- `credential_kind`: `centralized` or `byok`
- `credential_hint`: masked credential (e.g. `sk-a...pgAA`)
Values are taken from the thread's root interception.
## Changes

- `codersdk/aibridge.go`: Added `CredentialKind` and `CredentialHint`
fields to `AIBridgeThread`
- `coderd/database/db2sdk/db2sdk.go`: Populated from root interception
in `buildAIBridgeThread`
  - `SessionTimeline.stories.tsx`: Added fields to mock thread data
2026-04-09 11:41:17 -04:00
Kyle Carberry 391b22aef7 feat: add CLI commands for managing chat context from workspaces (#24105)
Adds `coder exp chat context add` and `coder exp chat context clear`
commands that run inside a workspace to manage chat context files via
the agent token.

`add` reads instruction and skill files from a directory (defaulting to
cwd) and inserts them as context-file messages into an active chat.
Multiple calls are additive — `instructionFromContextFiles` already
accumulates all context-file parts across messages.

`clear` soft-deletes all context-file messages, causing
`contextFileAgentID()` to return `!found` on the next turn, which
triggers `needsInstructionPersist=true` and re-fetches defaults from the
agent.

Both commands auto-detect the target chat via `CODER_CHAT_ID` (already
set by `agentproc` on chat-spawned processes), or fall back to
single-active-chat resolution for the agent. The `--chat` flag overrides
both.

Also adds sub-agent context inheritance: `createChildSubagentChat` now
copies parent context-file messages to child chats at spawn time, so
delegated sub-agents share the same instruction context without
independently re-fetching from the workspace agent.

<details><summary>Implementation details</summary>

**New files:**
- `cli/exp_chat.go` — CLI command tree under `coder exp chat context`

**Modified files:**
- `agent/agentcontextconfig/api.go` — `ConfigFromDir()` reads context
from an arbitrary directory without env vars
- `codersdk/agentsdk/agentsdk.go` — `AddChatContext`/`ClearChatContext`
SDK methods
- `coderd/workspaceagents.go` — POST/DELETE handlers on
`/workspaceagents/me/chat-context`
- `coderd/coderd.go` — Route registration
- `coderd/database/queries/chats.sql` — `GetActiveChatsByAgentID`,
`SoftDeleteContextFileMessages`
- `coderd/database/dbauthz/dbauthz.go` — RBAC implementations for new
queries
- `coderd/x/chatd/subagent.go` — `copyParentContextFiles` for sub-agent
inheritance
- `cli/root.go` — Register `chatCommand()` in `AGPLExperimental()`

**Auth pattern:** Uses `AgentAuth` (same as `coder external-auth`) —
agent token via `CODER_AGENT_TOKEN` + `CODER_AGENT_URL` env vars.

</details>

> 🤖 Generated by Coder Agents

---------

Co-authored-by: Michael Suchacz <203725896+ibetitsmike@users.noreply.github.com>
2026-04-09 16:33:00 +02:00
Atif Ali 584c61acb5 fix: mark connecting agents as unhealthy instead of healthy (#24044)
## Problem

Workspaces showed as "Healthy" immediately after creation while the
agent was still downloading, starting, or connecting. If the agent never
connected, the workspace stayed "Healthy" for the entire connection
timeout (~120s), then abruptly flipped to "Unhealthy".

## Root cause

In `db2sdk.WorkspaceAgent`, the health switch had no case for
`WorkspaceAgentConnecting`. Agents in `connecting` status with a
non-`off` lifecycle (e.g. `created` after a fresh build) fell through to
the `default` case and were marked `Healthy = true`.

## Fix

Add an explicit case for `WorkspaceAgentConnecting` that sets `Healthy =
false` with reason `"agent has not yet connected"`. The case is placed
after the existing `!connected + off` case (which correctly catches
stopped agents as "not running") and before the `timeout`/`disconnected`
cases.

```
Status        + Lifecycle       → Health reason
──────────────────────────────────────────────────────
any !connected + off           → "agent is not running"
connecting    + created/starting → "agent has not yet connected"  ← NEW
timeout       + any            → "agent is taking too long to connect"
disconnected  + any            → "agent has lost connection"
connected     + start_error    → "agent startup script exited with an error"
connected     + shutting_down  → "agent is shutting down"
connected     + ready/starting → healthy
```

The frontend already handles this case — `getAgentHealthIssue()` returns
"Workspace agent is still connecting" with `severity: "info"` for
unhealthy workspaces with connecting agents.

## Test changes

- **Healthy test**: now actually connects the agent via `agenttest.New`
before asserting health (previously passed due to the bug).
- **New Connecting test**: verifies a never-connected agent is correctly
marked unhealthy.
- **Mixed health test**: connects a1 and waits for the mixed state
(`a1.Healthy && !workspace.Healthy`) to avoid a race where both agents
are initially connecting.
- **Sub-agent excluded test**: connects the parent agent and waits for
it to be healthy before creating the sub-agent.
- **TestWorkspaceAgent/Connect**: flipped assertion to `Health.Healthy
== false` for a `dbfake` agent that never connects.

<details>
<summary>Review notes</summary>

### Known follow-up

The `healthy:false` workspace search filter maps to `[disconnected,
timeout]` and does not include `connecting`. This is a pre-existing gap
that is now more consequential — a workspace unhealthy solely due to a
connecting agent won't appear in `healthy:false` results. Worth a
follow-up issue.

### Deep review findings addressed

| Finding | Severity | Status |
|---------|----------|--------|
| Mixed health test race (all 3 reviewers) | P2 | Fixed — tightened
`Eventually` condition |
| `TestWorkspaceAgent/Connect` assertion break | P1 | Fixed — flipped
assertion |
| CLI renders red for connecting agents | Obs | Acknowledged — design
trade-off, accurate but visually strong for transient state |
| Switch case ordering overlap | Obs | Documented with inline comment |

</details>

> 🤖 This PR was created with the help of Coder Agents, and needs a human
review. 🧑💻
2026-04-09 13:21:28 +05:00
Zach 9b91af8ab7 feat: add user secrets SDK types and db2sdk converters (#24102)
Adds the SDK types and database-to-SDK conversion helpers for the user secrets feature.
2026-04-08 16:48:41 -06:00
Yevhenii Shcherbina 7f496c2f18 feat: byok-observability for aibridge (#23808)
## Summary

Adds `credential_kind` and `credential_hint` columns to
`aibridge_interceptions` to record how each LLM request was
authenticated and provide a masked credential identifier for audit
purposes.

This enables admins to distinguish between centralized API keys,
personal API keys, and subscription-based credentials in the
interceptions audit log.

## Changes

- New migration adding `credential_kind`and `credential_hint` to
`aibridge_interceptions`
- Updated `InsertAIBridgeInterception` query and proto definition to
carry the new fields
- Wired proto fields through `translator.go` and `aibridgedserver.go` to
the database

Depends on https://github.com/coder/aibridge/pull/239
2026-04-08 13:24:28 -04:00
Kyle Carberry b969d66978 feat: add dynamic tools support for chat API (#24036)
Adds client-executed dynamic tools to the chat API. Dynamic tools are
declared by the client at chat creation time, presented to the LLM
alongside built-in tools, but executed by the client rather than chatd.
This enables external systems (Slack bots, IDE extensions, Discord bots,
CI/CD integrations) to plug custom tools into the LLM chat loop without
modifying chatd's built-in tool set.

Modeled after OpenAI's Assistants API: the chat pauses with
`requires_action` status when the LLM calls a dynamic tool, the client
POSTs results back via `POST /chats/{id}/tool-results`, and the chat
resumes.

See [this example](https://github.com/coder/coder-slackbot-poc) as a
reference for how this is used. It's highly-configurable, which would
enable creating chats from webhooks, periodically polling, or running as
a Slackbot.

<details>
<summary>Design context</summary>

### Architecture

The chatloop **exits** when it encounters dynamic tools and
**re-enters** when results arrive. No blocking channels, no pubsub for
tool results, no in-memory registry. The DB is the only coordination
mechanism.

```
Phase 1 (chatloop):
  LLM response → execute built-in tools only →
  Persist(assistant + built-in results) →
  status = requires_action → chatloop exits

Phase 2 (POST /tool-results):
  Persist(dynamic tool results) →
  status = pending → wakeCh → chatloop re-enters
```

### Validation (POST /tool-results)

1. Chat status must be `requires_action` (409 if not)
2. Read chat's `dynamic_tools` → set of dynamic tool names
3. Read last assistant message → extract tool-call parts matching
dynamic tool names
4. Submitted tool_call_ids must match exactly (400 for missing/extra)
5. Persist tool-result message parts, set status to `pending`, signal
wake

### Idempotency

Tool call IDs scoped per LLM step. State machine (`requires_action` →
`pending`) is the guard. First POST wins, subsequent get 409.

### Mixed tool calls

When the LLM calls both built-in and dynamic tools in one step, built-in
tools execute immediately. Their results are persisted in phase 1.
Dynamic tool results arrive via POST in phase 2. The LLM sees all
results when the chatloop resumes.

</details>

> 🤖 Generated by Coder Agents
2026-04-08 11:54:44 -04:00
Kyle Carberry c5d720f73d feat(coderd): add telemetry for agents chats and messages (#24068)
Adds telemetry collection for the agents chat system (`/agents`) to the
existing telemetry snapshot pipeline.

Three new snapshot fields:
- **`Chats`** — per-chat metadata (id, owner, status, mode,
workspace_id, root_chat_id, has_parent, archived, model config)
collected time-windowed via `createdAfter`
- **`ChatMessageSummaries`** — per-chat aggregated message metrics
(counts by role, token sums by type, cost, runtime, model count,
compression count) collected time-windowed
- **`ChatModelConfigs`** — model configuration metadata (provider,
model, context limit, enabled, default) collected as full dump

No PII is included — titles, message content, and URLs are excluded at
the SQL level. Only structural metadata flows through telemetry.

<details><summary>Implementation plan</summary>

### SQL Queries (`coderd/database/queries/chats.sql`)
- `GetChatsCreatedAfter` — time-windowed chat metadata
- `GetChatMessageSummariesPerChat` — per-chat message aggregates via
`GROUP BY`
- `GetChatModelConfigsForTelemetry` — full dump of model configs

### Telemetry (`coderd/telemetry/telemetry.go`)
- `Chat`, `ChatMessageSummary`, `ChatModelConfig` structs
- `ConvertChat`, `ConvertChatMessageSummary`, `ConvertChatModelConfig`
conversion functions
- Three `eg.Go()` blocks in `createSnapshot()` following the existing
collection pattern

### Authorization (`coderd/database/dbauthz/dbauthz.go`)
- System-only access for all three queries via `rbac.ResourceSystem`

### Tests
- `TestChatsTelemetry` in `coderd/telemetry/telemetry_test.go` — creates
chats (root + child), messages with token/cost data, model configs;
verifies all snapshot fields
- dbauthz test entries for all three queries in
`coderd/database/dbauthz/dbauthz_test.go`

</details>

> 🤖 Generated by Coder Agents
2026-04-08 09:47:44 -04:00
Cian Johnston 233343c010 feat: add chat and chat_files cleanup to dbpurge (#23833)
Fixes https://github.com/coder/coder/issues/23910

Adds periodic cleanup of chats and chat files to the dbpurge background
goroutine, with a configurable retention period exposed in the Agent
settings UI.

> 🤖 Written by a Coder Agent. Reviewed by a human.
2026-04-08 11:08:09 +01:00
Zach 565a15bc9b feat: update user secrets queries for REST API and injection (#23998)
Update queries as prep work for user secrets API development:
- Switch all lookups and mutations from ID-based to user_id + name
- Split list query into metadata-only (for API responses) and
with-values (for provisioner/agent)
- Add partial update support using CASE WHEN pattern for write-only
value fields
- Include value_key_id in create for dbcrypt encryption support
- Update dbauthz wrappers and remove stale methods from dbmetrics
2026-04-07 09:03:28 -06:00
Kyle Carberry 684f21740d perf(coderd): batch chat heartbeat queries into single UPDATE per interval (#24037)
## Summary

Replaces N per-chat heartbeat goroutines with a single centralized
heartbeat loop that issues one `UPDATE` per 30s interval for all running
chats on a worker.

## Problem

Each running chat spawned a dedicated goroutine that issued an
individual `UPDATE chats SET heartbeat_at = NOW() WHERE id = $1 AND
worker_id = $2 AND status = 'running'` query every 30 seconds. At 10,000
concurrent chats this produces **~333 DB queries/second** just for
heartbeats, plus ~333 `ActivityBumpWorkspace` CTE queries/second from
`trackWorkspaceUsage`.

## Solution

New `UpdateChatHeartbeats` (plural) SQL query replaces the old singular
`UpdateChatHeartbeat`:

```sql
UPDATE chats
SET    heartbeat_at = @now::timestamptz
WHERE  worker_id = @worker_id::uuid
  AND  status = 'running'::chat_status
RETURNING id;
```

A single `heartbeatLoop` goroutine on the `Server`:
1. Ticks every `chatHeartbeatInterval` (30s)
2. Issues one batch UPDATE for all registered chats
3. Detects stolen/completed chats via set-difference (equivalent of old
`rows == 0`)
4. Calls `trackWorkspaceUsage` for surviving chats

`processChat` registers an entry in the heartbeat registry instead of
spawning a goroutine.

## Impact

| Metric | Before (10K chats) | After (10K chats) |
|---|---|---|
| Heartbeat queries/sec | ~333 | ~0.03 (1 per 30s per replica) |
| Heartbeat goroutines | 10,000 | 1 |
| Self-interrupt detection | Per-chat `rows==0` | Batch set-difference |

---

> 🤖 Generated by Coder Agents

<details><summary>Implementation notes</summary>

- Uses `@now` parameter instead of `NOW()` so tests with `quartz.Mock`
can control timestamps.
- `heartbeatEntry` stores `context.CancelCauseFunc` + workspace state
for the centralized loop.
- `recoverStaleChats` is unaffected — it reads `heartbeat_at` which is
still updated.
- The old singular `UpdateChatHeartbeat` is removed entirely.
- `dbauthz` wrapper uses system-level `rbac.ResourceChat` authorization
(same pattern as `AcquireChats`).

</details>
2026-04-07 10:25:46 -04:00
George K 86ca61d6ca perf: cap count queries and emit native UUID comparisons for audit/connection logs (#23835)
Audit and connection log pages were timing out due to expensive COUNT(*)
queries over large tables. This commit adds opt-in count capping: requests can
return a `count_cap` field signaling that the count was truncated at a threshold,
avoiding full table scans that caused page timeouts.

Text-cast UUID comparisons in regosql-generated authorization queries
also contributed to the slowdown by preventing index usage for connection
and audit log queries. These now emit native UUID operators.

Frontend changes handle the capped state in usePaginatedQuery and
PaginationWidget, optionally displaying a capped count in the pagination
UI (e.g. "Showing 2,076 to 2,100 of 2,000+ logs")

Related to:
https://linear.app/codercom/issue/PLAT-31/connectionaudit-log-performance-issue
2026-04-07 07:24:53 -07:00
Cian Johnston d5a1792f07 feat: track chat file associations with chat_file_links on chats (#23537)
Needed by #23833

Adds a `chat_file_links` association table to track which files are
associated with each chat.

- `AppendChatFileIDs` query links a file to a chat with deduplication
- `GetChatFileMetadataByIDs` query returns lightweight file metadata by
IDs
- Tool-created files (e.g. `propose_plan`) are linked to the chat after
insert
- User-uploaded files are linked to the chat when the referencing
message is sent
- Single-chat GET endpoint hydrates `files: ChatFileMetadata[]` on the
response

> 🤖 Created by Coder Agents and massaged into shape by a human.
2026-04-07 12:05:29 +01:00
Kyle Carberry a2ce74f398 feat: add total_runtime_ms to chat cost analytics endpoints (#24050)
Surface the aggregated `runtime_ms` from `chat_messages` through all
four cost analytics queries (summary, per-model, per-chat, per-user).
This is the key billing metric for agent compute time.

The per-chat breakdown already groups by `root_chat_id`, so subagent
runtime is automatically rolled up under the parent chat — no additional
query changes needed.

<details>
<summary>Implementation details</summary>

**SQL** (`coderd/database/queries/chats.sql`): Added
`COALESCE(SUM(cm.runtime_ms), 0)::bigint AS total_runtime_ms` to
`GetChatCostSummary`, `GetChatCostPerModel`, `GetChatCostPerChat`, and
`GetChatCostPerUser`.

**Go SDK** (`codersdk/chats.go`): Added `TotalRuntimeMs int64` to
`ChatCostSummary`, `ChatCostModelBreakdown`, `ChatCostChatBreakdown`,
and `ChatCostUserRollup`.

**Handler** (`coderd/exp_chats.go`): Wired the new field through all
converter functions and the response assembly.

**Tests** (`coderd/exp_chats_test.go`): Updated fixture to seed non-zero
`runtime_ms` values and added assertions for the new field at summary,
per-model, and per-chat levels.
</details>

> 🤖 Generated by Coder Agents
2026-04-06 12:10:57 -04:00
Jon Ayers a1d51f0dab feat: batch connection logs to avoid DB lock contention (#23727)
- Running 30k connections was generating a ton of lock contention in the
DB
2026-04-03 15:47:26 -05:00
Jon Ayers 333503f74e feat: improve coordinator peer mapping performance (#23696)
- Skipping DB querying entirely for peers that aren't actually connected
to our coordinator
- Opportunistically batching the queries for peers
2026-04-03 14:22:58 -05:00
Paweł Banaszewski 8369fa88fd feat: add columns for cached tokens from aibridge (#23832)
Two new columns added to aibridge_token_usages:
  - cache_read_input_tokens (BIGINT, default 0)
  - cache_write_input_tokens (BIGINT, default 0)

Migration backfills existing rows by extracting values from the metadata
JSONB column (cache_read_input, input_cached, prompt_cached for reads
(max value selected since only 1 should be set), cache_creation_input
for writes).

All references to data from metadata were updated to reference new
columns. No other changes then changing where data is extracted from.

Requires aibridge library version bump to include:
https://github.com/coder/aibridge/pull/229
Fixes: https://github.com/coder/aibridge/issues/150
2026-04-03 16:27:31 +02:00
Zach 990c006f28 feat(coderd/database): add value_key_id column to user_secrets for encryption (#23997)
Add a nullable `value_key_id` column to the `user_secrets` table with a
foreign key to `dbcrypt_keys`. This is the column dbcrypt uses to track
which encryption key encrypted a given secret's value. This is required
for encryption of user secret values.

The column was missing from the original migration (000357).
2026-04-02 15:40:32 -06:00
Michael Suchacz 7d0a0c6495 feat: provider key policies and user provider settings (#23751) 2026-04-02 19:46:42 +02:00
Susana Ferreira fb788530b3 feat: add provider_name column to aibridge interceptions (#23960)
## Description

Adds `provider_name` to aibridge interceptions to store the provider
instance name alongside the provider type. This allows distinguishing
between multiple instances of the same provider type (e.g. `copilot` vs
`copilot-business`).

## Changes

* Add `provider_name` column to `aibridge_interceptions` table with
backfill from `provider`.
* Add `provider_name` field to the proto `RecordInterceptionRequest`
message.
* Add `ProviderName` to the `codersdk.AIBridgeInterception` API
response.

_Disclaimer: initially produced by Claude Opus 4.6, modified and
reviewed by @ssncferreira ._
2026-04-02 10:58:13 +01:00
Ethan 7757cd8e08 refactor(coderd/x/chatd): insert chats directly as pending on creation (#23888)
Previously, `CreateChat` inserted the `chats` row with the DB default
status (`waiting`), then updated it to `pending` in the same transaction
via `setChatPendingWithStore`. This wasted two extra queries per chat
creation (`GetChatByID` + `UpdateChatStatus`) and rewrote the same row
immediately after inserting it.

Now `CreateChat` passes the status directly to `InsertChat`, so the row
is written once in its final create-time state. The
`setChatPendingWithStore` helper is removed entirely. `InsertChat` now
requires an explicit `status` parameter at all callsites instead of
relying on a DB column default.

## Motivation

On an experimental branch we're trialing firing all chatd notifications
from plpgsql triggers. The old two-step insert made that awkward: in an
`AFTER INSERT` trigger, `NEW` only contained the insert-time row
(`waiting`), not the final committed state (`pending`). To emit the
correct event payload the trigger had to be deferred and re-read the row
from `chats` at commit time.

With this change, `NEW` already contains the correct row to publish — no
deferred trigger, no extra `SELECT`, simpler and cheaper trigger logic.

That said, this seems like a worthwhile change regardless of the trigger
experiment: writing the final row state once removes unnecessary DB work
on every chat creation and makes the create path easier to reason about.
2026-04-02 14:13:51 +11:00
Ethan 5cba59af79 fix(coderd): unarchive child chats with parents (#23761)
Unarchiving a root chat now restores descendant chats in the database
and emits lifecycle events for every affected chat so passive sessions
converge without a full refetch.

This keeps archive and unarchive symmetric at both the data and
watch-stream layers by returning the affected chat family from the
database, using those post-update rows for chatd pubsub fanout, and
covering descendant lifecycle delivery with a watch-level regression
test.

Closes #23666
2026-04-01 15:30:25 +11:00
Danny Kopping 9fa103929a perf: make ListAIBridgeSessions 10x faster (#23774)
_Disclaimer: produced using Claude Opus 4.6, reviewed by me, and
validated against Dogfood dataset._

The `ListAIBridgeSessions` query materialized and aggregated all
matching interceptions before paginating, then ran expensive
token/prompt lookups across the full dataset. For a page of 25 sessions
against ~200k interceptions (our dogfood dataset), this meant:
- Three CTEs scanning all rows (filtered_interceptions, session_tokens,
session_root)
  - ARRAY_AGG(fi.id) collecting every interception ID per session
- Lateral prompt lookup via ANY(array_of_all_ids) running for every
session, not just the page
  - ~90MB of disk sorts and JIT compilation kicking in

The improvement is to restructure to paginate first and enrich after: a
single CTE groups interceptions into sessions with only cheap aggregates
(MIN, MAX, COUNT), applies cursor pagination and LIMIT, then lateral
joins fetch metadata, tokens, and prompts for just the ~25-row page.

  Measured against 220k interceptions / 160k sessions:

  | Metric             | Before | After |
  |--------------------|--------|-------|
  | Execution time     | 1800ms | 185ms |
  | Shared buffer hits | 737k   | 2.6k  |
  | Disk sort spill    | 86MB   | 16MB  |
  | Lateral loops      | 160k   | 25    |

https://grafana.dev.coder.com/goto/fbODPGtvR?orgId=1 the results are
identical, just _much_ faster.

--- 

Also includes some additional tests which I added prior to refactoring
the query to ensure no regressions on edge-cases.

---------

Signed-off-by: Danny Kopping <danny@coder.com>
2026-03-31 14:42:23 +02:00
Cian Johnston 3ce82bb885 feat: add chat-access site-wide role to gate chat creation (#23724)
- Add `chat-access` built-in role granting chat CRUD at User scope
- Exclude `ResourceChat` from member, org member, and org service
account `allPermsExcept` calls
- Allow system, owner, and user-admin to assign the new role
- Migration auto-assigns role to users who have ever created a chat
- Update RBAC test matrix: `memberMe` denied, `chatAccessUser` allowed

**Breaking change**: Members without `chat-access` lose chat creation
ability. Migration covers existing chat creators. Members who have never
created a chat do not get this role automatically applied.

> 🤖 This PR was created by a Coder Agent and reviewed by me.
2026-03-31 10:07:21 +01:00
Kyle Carberry a5cc579453 feat: add last_injected_context column to chats table (#23798)
Adds a nullable JSONB column `last_injected_context` to the `chats`
table that stores the most recently persisted injected context parts
(AGENTS.md context-file and skill message parts). The column is updated
only when `persistInstructionFiles()` runs — on first workspace attach
or when the agent changes — so there are no redundant writes on
subsequent turns.

Internal fields (`ContextFileContent`, `ContextFileOS`,
`ContextFileDirectory`, `SkillDir`) are stripped at write time so the
column only holds small metadata. No stripping needed on the read path.

<details>
<summary>Implementation notes</summary>

- New migration `000456` adds nullable `last_injected_context JSONB`
column.
- New SQL query `UpdateChatLastInjectedContext` writes the column
without touching `updated_at`.
- `persistInstructionFiles()` strips internal fields from parts via
`StripInternal()` before persisting.
- Sentinel path (no AGENTS.md) persists skill-only parts when skills
exist.
- `codersdk.Chat` exposes `LastInjectedContext []ChatMessagePart`
(omitempty).
- `db2sdk.Chat()` passes through the already-clean data.

</details>
2026-03-30 14:11:30 -04:00