Commit Graph

18 Commits

Author SHA1 Message Date
Michael Suchacz de6d62815e fix(coderd): avoid redundant workspace setup (#25615)
GPT-class chat turns could eagerly create workspaces or repeat setup
such as cloning an existing repo because the system prompt framed setup
work as the default path.

This updates chatd prompt guidance and the `create_workspace` tool
description so agents reuse existing chat and workspace context, treat
injected workspace context as already read, avoid recloning present
repositories, and create or start workspaces only when workspace-backed
work is required. Delegated chats now report workspace needs to the
parent instead of trying to create one.

> Mux opened this PR on behalf of Mike.
2026-05-22 14:08:07 +00:00
Ethan de9cdca77e fix(coderd): handle external-agent workspaces honestly in chat (#24969)
## Summary

Make Coder's chat agent honest about workspaces that use
`coder_external_agent`. Three behaviors change so the chat stops
pretending it can drive an external workspace through to a usable state
on its own.

<img width="859" height="537" alt="image"
src="https://github.com/user-attachments/assets/0561442b-95f1-4a2d-853c-7e3776114680"
/>


## Problem

External agents are not started by Coder. The user has to run `coder
agent` on their own host with a token Coder generates. Before this
change, the chat agent treated those workspaces like any other:

- `create_workspace` would enqueue a build for an external-agent
template and then wait minutes (~22 worst case) for an agent that was
never going to come up.
- When mid-turn tool calls dialed an external agent that was not
connected, the chat burned the full 30-second dial timeout and returned
generic "the workspace may need to be restarted from the Coder
dashboard" guidance, which is not the action the user can take.
- Nothing told the chat (or the user, through the chat) that the next
action lives outside Coder.

## Fix

Three changes scoped to `coderd/x/chatd/`:

1. **`create_workspace` blocks templates with external agents.** The
tool reads `template_versions.has_external_agent` for the template's
active version and refuses external-agent templates with a message
instructing the chat to pick a different template, or to have the user
create and start the workspace themselves and then attach it.

2. **Attaching an existing external workspace stays open.** No
selection-time gate on attachment; users can still bind a working
external workspace to a chat.

3. **External-agent-aware error handling on connection.** Two
complementary changes both predicated on proven connectivity failures
rather than every dial error:

- **`getWorkspaceConn` preflight and timeout handling.** Before opening
a connection, the cache-miss path reads the agent's status from the
already-loaded row. If the selected agent is external and clearly
offline according to the existing `isAgentUnreachable` helper
(`Disconnected` or `Timeout`, never `Connecting`), it returns an
external-agent-specific error immediately instead of waiting out the
30-second dial timeout. `Connecting` external agents fall through to the
dial so a user who just started the agent on their host can still
succeed in the same turn. The preflight only fires when the agent is
still the latest selected agent for the workspace, so stale-binding
recovery via `dialWithLazyValidation` is unaffected. The post-dial
rewrite is limited to the dial timeout sentinel; stale/no-agent bindings
and non-timeout dial failures preserve their original errors.

- **`waitForAgentReady` timeout-branch rewrite.** The 2-minute retry
loop used by `create_workspace` and `start_workspace` runs unchanged for
all agents. When the loop's outer deadline elapses, the timeout branch
substitutes the external-agent message in place of the raw dial error if
the agent belongs to an external resource.

This applies the same pattern that the cache-hit path of
`getWorkspaceConn` already used (`isAgentUnreachable` returning
`errChatAgentDisconnected`), extended to the cache-miss path and to the
readiness helper, with the external-agent-aware error rewrite layered
only on confirmed offline or timeout paths.

Closes CODAGT-314
2026-05-08 13:51:13 +10:00
Ethan ef0151601e feat: report insufficient quota build failures in chat tools (#24956)
## Summary

When a workspace build fails because the user is over their group quota,
the chat tools currently surface the failure as a bare `"workspace build
failed: insufficient quota"` string with no machine-readable error code
and no visibility into the user's current usage. Agents and the UI
cannot distinguish quota failures from any other Terraform error, so
users see an opaque message and have no clear path to recovery.

This PR tags quota failures with a typed error code at the source and
propagates it through the chat tool layer so callers can react to it
explicitly.

Relates to CODAGT-20

## Changes

**Provisioner runner**

- Add `InsufficientQuotaErrorCode = "INSUFFICIENT_QUOTA"` and set it
explicitly at the `commitQuota` failure site via a new
`failedWorkspaceBuildfCode` helper, so `provisioner_jobs.error_code` is
populated only on the genuine quota path. The substring matcher used for
externally produced sentinels (e.g. `"missing parameter"`, `"required
template variables"`) is intentionally not extended; provider errors
that happen to mention "insufficient quota" stay classified as generic
build failures.

**SDK and API contract**

- Add `JobErrorCodeInsufficientQuota` and a
`JobIsInsufficientQuotaErrorCode` helper to `codersdk`.
- Extend the swagger `enums` tag on `ProvisionerJob.ErrorCode` to
include `INSUFFICIENT_QUOTA`.
- Regenerate `coderd/apidoc`, `docs/reference/api/*`, and
`site/src/api/typesGenerated.ts`.

**chattool create_workspace / start_workspace**

- `waitForBuild` now returns a typed `*workspaceBuildError` carrying
both the message and the `JobErrorCode`, instead of a bare error string.
- New `quotaerror.go` introduces a structured `quotaErrorResult` (with
`error_code`, `title`, `message`, `build_id`, and optional `quota`) and
a best-effort `workspaceQuotaDetails` lookup that wraps owner
authorization internally and fetches `credits_consumed` and `budget`
from the database. Quota lookup failures (including authorization
failures) never block the failure payload.
- On quota-coded build failures, both `create_workspace` and
`start_workspace` now return the structured response (with the recovery
guidance inlined into `message`) instead of the bare `"insufficient
quota"` string. This applies to all three failure paths: post-creation,
an in-progress existing build, and a freshly triggered start build.
Non-quota build failures continue to use the existing
`buildToolResponse` / `newBuildError` path.
- Owner authorization is wrapped only on the call sites that need it
(the `CreateFn` and `StartFn` invocations and the quota-detail lookup),
so idempotent fast paths (already running, already in progress,
existing-workspace early returns) do not pay for an extra RBAC
round-trip or fail when role lookup is transient.

## Out of scope

- No changes to quota math, allowances, or bypass behavior.
- No automatic retries.
- No new quota-inspection tools and no changes to MCP
`coder_create_workspace` (which returns immediately and never observed
the build outcome here).
- No frontend UI changes; those will land in a follow-up PR that
consumes the new `INSUFFICIENT_QUOTA` code.
2026-05-07 15:01:58 +10:00
Cian Johnston a74015fc85 refactor: make store and chatID explicit parameter arguments in chattools (#24850)
Fixes CODAGT-175

Addresses a review finding in https://github.com/coder/coder/pull/23827
that the nil-guards for both `database.Store` and `chatID` are both dead
code in practice in the `chattool` package.

- Modifies the return signatures require passing both `database.Store`
and `chatID` explicitly as positional arguments instead of just
parameter struct keys.
- Drops the nil-guards for `database.Store` and `chatID`.
2026-05-06 11:05:16 +01:00
Cian Johnston 04cc983833 fix: add preset support to MCP tools (#24694)
The chat tools (`read_template`, `create_workspace`) did not surface or
respect template version presets. Presets were invisible to the LLM and
preset parameter defaults were never applied at workspace creation. The
`toolsdk` MCP surface had the same gap (ref #24695, now subsumed here).

## What this changes

- **`read_template`** returns presets with `id`, `name`, `default`,
`description`, `icon`, `parameters`, and `desired_prebuild_instances`
(when set), so the LLM can pick the right preset and prefer
prebuilt-backed ones.
- **`create_workspace`** accepts a `preset_id`. The wsbuilder applies
preset parameter defaults and may claim a prebuilt workspace.
- **`start_workspace`** does *not* accept a preset. Presets are a
creation-time choice; subsequent starts use the workspace's existing
version and parameters. Users who need a specific preset or version on
an existing chat can create the workspace out-of-band (CLI / UI / API)
with the desired configuration and attach the chat to it.
- **`toolsdk`** gains `GetTemplate` (with presets including
`desired_prebuild_instances`), preset support on `CreateWorkspace`, and
preset + `rich_parameters` support on `CreateWorkspaceBuild`. The
`template_version_preset_id` description warns about preset/version
affinity.


> 🤖 Generated with [Coder Agents](https://coder.com/agents) and reviewed by a human.

Co-authored-by: Max schwenk <maschwenk@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:57:52 +01:00
blinkagent[bot] 79a9f437d7 feat(coderd/x/chatd/chattool): add description tags to tool parameter structs (#24394) 2026-04-21 11:37:29 -07:00
Ethan 1203f625b7 feat(coderd): accept parameters in start_workspace tool (#24434)
When the chat `start_workspace` tool triggers an active-version upgrade
that introduces new required parameters, the build fails with a
parameter validation error. Previously this returned a message telling
the user to update from the UI — a dead end for the model.

This PR lets the model recover inside the chat by:

1. Accepting an optional `parameters` map on `start_workspace` (same
schema as `create_workspace`), forwarded as `RichParameterValues`.
2. Returning structured JSON error responses that preserve validation
details and the workspace's `template_id`, so the model can call
`read_template` to discover what changed.
3. Replacing the UI-only guidance in `exp_chats.go` with
model-actionable retry instructions.

The expected model flow on an active-version parameter failure is now:

```
start_workspace → fails (structured error with template_id + validations)
read_template   → discovers new required parameters
start_workspace → retries with parameters map → workspace starts
```
<img width="846" height="511" alt="image"
src="https://github.com/user-attachments/assets/d18b6864-5970-4225-8da0-0f2ab134ccb4"
/>
2026-04-21 11:36:20 +10:00
Kyle Carberry 9c74c8c674 fix: move OnChatUpdated call after agent is ready in create/start workspace (#24410) 2026-04-15 19:18:54 -04:00
Kyle Carberry d11849d94a fix: re-fetch context files and skills from workspace on each turn (#24360)
Context files (AGENTS.md) and skills were only fetched from the
workspace on the first turn or when the agent changed. On subsequent
turns, stale content from persisted messages was used. This meant that
if AGENTS.md or skills were modified on the workspace between turns, the
agent wouldn't see the changes until the user created a new chat.

## Changes

- Extract `fetchWorkspaceContext` from `persistInstructionFiles` to
allow fetching workspace context without persisting
- On subsequent turns, re-fetch fresh context from the workspace instead
of reading stale persisted content; falls back to persisted messages if
the workspace dial fails
- Update `ReloadMessages` callback to re-derive instruction and skills
from reloaded database messages after compaction, instead of using
captured closure variables
- Add `formatSystemInstructionsFromParts` helper to build system
instructions directly from agent parts without requiring separate
OS/directory params
- Add tests for the new helper

<details><summary>Implementation Notes</summary>

### Root cause

In `runChat`, the `else if hasContextFiles` branch (subsequent turns)
called `instructionFromContextFiles(messages)` which read stale content
from persisted DB messages. The `ReloadMessages` callback
(post-compaction) also used captured `instruction`/`skills` closure
variables from the start of the turn, never re-deriving them.

### Approach

1. **Extract `fetchWorkspaceContext`** — Pure refactor of the fetch-only
part of `persistInstructionFiles` (agent connection, context config
retrieval, content sanitization, metadata stamping). Returns parts +
skills without persisting.

2. **Subsequent turns**: Instead of reading from persisted messages,
launch a `g2` goroutine that calls `fetchWorkspaceContext` to get fresh
context from the workspace. Falls back gracefully to persisted messages
if the workspace is unreachable.

3. **ReloadMessages**: Re-derive `instruction` from
`instructionFromContextFiles(reloadedMsgs)` and `skills` from
`skillsFromParts(reloadedMsgs)` using the freshly loaded messages, with
fallback to captured values if the reloaded messages don't contain
context (e.g. compacted away).

</details>

> 🤖 Generated by Coder Agents
2026-04-15 16:41:15 -04:00
Cian Johnston 6194bd6f57 fix: address post-merge review findings for chat org scoping (#24297)
Addresses review findings from #23827 that were added post-merge:

- Persisted attachments now store `organizationId`; mismatched orgs
pruned on restore
- Workspace selection reconciliation: stale IDs from previous orgs
dropped via derived `effectiveWorkspaceId`
- Org picker uses `permittedOrganizations()` for RBAC-aware filtering
- Org picker hidden when user belongs to only one org
- Ref-sync `useEffect` replaced with `useEffectEvent`
- `CreateWorkspace()` and `ListTemplates()` take `organizationID` and
`db` as required function parameters instead of optional struct fields —
compiler enforces them, removes scattered nil guards
- Cross-org template check in `CreateWorkspace` is now unconditional
- `ListTemplates` org-scoping filter now has test coverage
- `setupChatInfra` comment fixed; test helpers use params structs
instead of positional UUIDs
- Enterprise test documents that org admin only sees own chats (handler
hardcodes `OwnerID` — future work needs sidebar UI before lifting that
restriction)

> 🤖
2026-04-15 11:39:05 +01:00
Callum Styan 730edba87a fix: fix false positive disconnected agent metric reporting (#24225)
We noticed during higher active workspace counts that the agent
connection metric, generated via a query to the database, would report a
relatively high amount of agents as disconnected. Somewhere between 5
and 20%. However, other metrics such as # of websocket connections would
suggest that all agent connections are healthy.

Looking at the `Agents` function in prometheus metrics, plus the query
execution time (not accounting for actual database RT time) revealed
that this reporting of agents as disconnected was almost certainly false
positives due to clock drift in the way we're generating the metric
values. At 10k metrics, with a p50 of 2ms and p99 of 5ms, the entire
`agents` function could take upwards of 50s to execute. Because we were
doing a query/database RT to query th apps for each agent individually,
and grabbing a `time.Now` value on each iteration of that loop, it's
likely the portion of agents that were reported as disconnected were
those that had last heartbeat the furthest in the past.

The fix here is to set a consistent `now` before fetching agent data to
avoid clock drift inflating the inactive timeout comparison, and replace
the per-agent app query N+1 with a single batched lookup to prevent loop
execution time from pushing agents over the disconnected threshold.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:23:06 -07:00
Cian Johnston 22062ec52e feat: add organization scoping to chats (#23827)
Fixes https://github.com/coder/internal/issues/1436

* Adds organization_id to chats with backfill (workspace org → user org membership → default org)
* No support yet for ACLs (follow-up issue)
- Cross-org workspace binding rejected (both in `CreateChatRequest` and in `create_workspace` tool
- Adds `OrganizationAutocomplete` to `AgentCreateForm`
- Docs updated with `organization_id` in chats-api.md

> 🤖 Written by a Coder Agent. Reviewed by many humans and many agents.

---------

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
2026-04-13 12:31:25 +01:00
Danielle Maywood cb0b84a2d3 feat: show build logs in chat for start_workspace and create_workspace tools (#24194) 2026-04-12 15:04:10 +01:00
Michael Suchacz 73f6cd8169 feat: suffix-based chat agent selection (#23741)
Adds suffix-based agent selection for chatd. Template authors can direct
chat traffic to a specific root workspace agent by naming it with the
`-coderd-chat` suffix (for example, `coder_agent "dev-coderd-chat"`).
When no suffix match exists, chatd falls back to the first root agent by
`DisplayOrder`, then `Name`. Multiple suffix matches return an error.

The selection logic lives in `coderd/x/chatd/internal/agentselect` and
is shared by chatd core plus the workspace chat tools so all chat entry
points pick the same agent deterministically.

No database migrations, API contract changes, or provider changes. The
experimental sandbox template was split out to #23777.
2026-03-30 11:43:59 +00:00
Ethan 61e31ec5cc perf(coderd/x/chatd): persist workspace agent binding across chat turns (#23274)
## Summary

This change removes the steady-state "resolve the latest workspace
agent" query from chat execution.

Instead of asking the database for the latest build's agent on every
turn, a chat now persists the workspace/build/agent binding it actually
uses and reuses that binding across subsequent turns. The common path
becomes "load the bound agent by ID and dial it", with fallback paths to
repair the binding when it is missing, stale, or intentionally changed.

## What changes

- add `workspace_id`, `build_id`, and `agent_id` binding fields to
`chats`
- expose those fields through the chat API / SDK so the execution
context is explicit
- load the persisted binding first in chatd, instead of always resolving
the latest build's agent
- persist a refreshed binding when chatd has to re-resolve the workspace
agent
- keep child / subagent chats on the same bound workspace context by
inheriting the parent binding
- leave `build_id` / `agent_id` unset for flows like `create_workspace`,
then bind them lazily on the next agent-backed turn

## Runtime behavior

The binding is treated as an optimistic cache of the agent a chat should
use:

- if the bound agent still exists and dials successfully, we use it
without a latest-build lookup
- if the bound agent is missing or no longer reachable, chatd
re-resolves against the latest build and persists the new binding
- if a workspace mutation changes the chat's target workspace, the
binding is updated as part of that mutation

To avoid reintroducing a hot-path query, dialing uses lazy validation:

- start dialing the cached agent immediately
- only validate against the latest build if the dial is still pending
after a short delay
- if validation finds a different agent, cancel the stale dial, switch
to the current agent, and persist the repaired binding

## Result

The hot path stops issuing
`GetWorkspaceAgentsInLatestBuildByWorkspaceID` for every user message,
which is the source of the DB pressure this PR is addressing. At the
same time, chats still converge to the correct workspace agent when the
binding becomes stale due to rebuilds or explicit workspace changes.
2026-03-26 17:22:38 +11:00
Cian Johnston 796872f4de feat: add deployment-wide template allowlist for chats (#23262)
- Stores a deployment-wide agents template allowlist in `site_configs`
(`agents_template_allowlist`)
- Adds `GET/PUT /api/experimental/chats/config/template-allowlist`
endpoints
- Filters `list_templates`, `read_template`, and `create_workspace` chat
tools by allowlist, if defined (empty=all allowed)
- Add "Templates" admin settings tab in Agents UI ([what it looks
like](https://624de63c6aacee003aa84340-sitjilsyrr.chromatic.com/?path=/story/pages-agentspage-agentsettingspageview--template-allowlist))

> 🤖 This PR was created with the help of Coder Agents, and has been
reviewed by my human. 🧑‍💻
2026-03-25 15:19:17 +00:00
Ethan c0a323a751 fix(coderd): use DB liveness for chat workspace reuse (#23551)
create_workspace could create a replacement workspace after a single 5s
agent dial failed, even when the existing workspace agent had recently
checked in. That made temporary reachability blips look like dead
workspaces and let chatd replace a running workspace too aggressively.

Use the workspace agent's DB-backed status with the deployment's
AgentInactiveDisconnectTimeout before allowing replacement. Recently
connected and still-connecting agents now reuse the existing workspace,
while disconnected or timed-out agents still allow a new workspace. This
also threads the inactivity timeout through chatd and adds focused
coverage for the reuse and replacement branches.
2026-03-26 00:12:05 +11:00
Cian Johnston 80a172f932 chore: move chatd and related packages to /x/ subpackage (#23445)
- Moves `coderd/chatd/`, `coderd/gitsync/`, `enterprise/coderd/chatd/`
under `x/` parent directories to signal instability
- Adds `Experimental:` glue code comments in `coderd/coderd.go`

> 🤖 This PR was created with the help of Coder Agents, and was
reviewed by my human. 🧑‍💻
2026-03-23 17:34:43 +00:00