coder

mirror of https://github.com/coder/coder.git synced 2026-06-06 14:38:23 +00:00

Author	SHA1	Message	Date
Ethan	bd6cc1aaf2	feat(coderd): add stop_workspace chatd tool and recovery classification (#24997 ) ## Summary Adds a `stop_workspace` tool to chatd so the model can recover from the "workspace running but agent dead" failure mode (e.g. an OOM that leaves the workspace running but the agent unreachable) by stopping and then starting the workspace. <img width="924" height="742" alt="image" src="https://github.com/user-attachments/assets/279dedb6-6e29-4fe1-8754-3a1f01e538bf" /> ## What changed New `stop_workspace` chatd tool (`coderd/x/chatd/chattool/stopworkspace.go`). Mirrors `start_workspace`: shares `WorkspaceMu` to serialize with create/start, waits for any in-progress build before issuing a stop, and is idempotent only after a successful Stop transition. Failed stop builds re-attempt rather than reporting success. New `chatStopWorkspace` coderd hook (`coderd/exp_chats.go`). Mirrors `chatStartWorkspace` minus the `RequireActiveVersion` gate. Stop should not be blocked by template version policy. Differentiated recovery sentinels (`coderd/x/chatd/chatd.go`). `errChatAgentDisconnected` instructs the model to call `stop_workspace` then `start_workspace`. `errChatDialTimeout` instructs a single retry, then user escalation if it repeats. The previous single message conflated transient and persistent failures. Two-signal recovery gate. Recovery is only surfaced when a tool call times out and a fresh DB read of the latest workspace agent says `Disconnected`. The previous draft escalated on the DB read alone, which would fire on a 30-second heartbeat blip (e.g. agent respawn) and prompt a destructive stop/start unnecessarily. Cache-hit disconnected handling now clears the cache and retries a fresh dial before escalating, rather than returning the recovery sentinel immediately. Latest-agent classification uses `GetWorkspaceAgentsInLatestBuildByWorkspaceID` instead of the chat's bound `AgentID`, so stale bindings after a rebuild don't misclassify. Shared chattool helpers in `coderd/x/chatd/chattool/chattool.go`: `latestWorkspaceBuildAndJob`, `publishBuildBinding`, `provisionerJobTerminal`. Applied to both `start_workspace` and `stop_workspace`. ## Notes - Reverts an earlier draft that widened `ask_user_question` to root standard turns. Plan-mode-only behavior is restored. - The `stop_workspace` tool currently renders via the generic chat tool-call UI. A follow-up frontend PR will prettify the `stop_workspace` tool and style it like the `start_workspace` tool. - Never-connected (`Timeout` status) agents are intentionally excluded from recovery. They indicate template or startup failure, not the running-but-dead case this PR targets. Closes CODAGT-315	2026-05-11 16:23:07 +10:00
Ethan	de9cdca77e	fix(coderd): handle external-agent workspaces honestly in chat (#24969 ) ## Summary Make Coder's chat agent honest about workspaces that use `coder_external_agent`. Three behaviors change so the chat stops pretending it can drive an external workspace through to a usable state on its own. <img width="859" height="537" alt="image" src="https://github.com/user-attachments/assets/0561442b-95f1-4a2d-853c-7e3776114680" /> ## Problem External agents are not started by Coder. The user has to run `coder agent` on their own host with a token Coder generates. Before this change, the chat agent treated those workspaces like any other: - `create_workspace` would enqueue a build for an external-agent template and then wait minutes (~22 worst case) for an agent that was never going to come up. - When mid-turn tool calls dialed an external agent that was not connected, the chat burned the full 30-second dial timeout and returned generic "the workspace may need to be restarted from the Coder dashboard" guidance, which is not the action the user can take. - Nothing told the chat (or the user, through the chat) that the next action lives outside Coder. ## Fix Three changes scoped to `coderd/x/chatd/`: 1. `create_workspace` blocks templates with external agents. The tool reads `template_versions.has_external_agent` for the template's active version and refuses external-agent templates with a message instructing the chat to pick a different template, or to have the user create and start the workspace themselves and then attach it. 2. Attaching an existing external workspace stays open. No selection-time gate on attachment; users can still bind a working external workspace to a chat. 3. External-agent-aware error handling on connection. Two complementary changes both predicated on proven connectivity failures rather than every dial error: - `getWorkspaceConn` preflight and timeout handling. Before opening a connection, the cache-miss path reads the agent's status from the already-loaded row. If the selected agent is external and clearly offline according to the existing `isAgentUnreachable` helper (`Disconnected` or `Timeout`, never `Connecting`), it returns an external-agent-specific error immediately instead of waiting out the 30-second dial timeout. `Connecting` external agents fall through to the dial so a user who just started the agent on their host can still succeed in the same turn. The preflight only fires when the agent is still the latest selected agent for the workspace, so stale-binding recovery via `dialWithLazyValidation` is unaffected. The post-dial rewrite is limited to the dial timeout sentinel; stale/no-agent bindings and non-timeout dial failures preserve their original errors. - `waitForAgentReady` timeout-branch rewrite. The 2-minute retry loop used by `create_workspace` and `start_workspace` runs unchanged for all agents. When the loop's outer deadline elapses, the timeout branch substitutes the external-agent message in place of the raw dial error if the agent belongs to an external resource. This applies the same pattern that the cache-hit path of `getWorkspaceConn` already used (`isAgentUnreachable` returning `errChatAgentDisconnected`), extended to the cache-miss path and to the readiness helper, with the external-agent-aware error rewrite layered only on confirmed offline or timeout paths. Closes CODAGT-314	2026-05-08 13:51:13 +10:00
Ethan	ef0151601e	feat: report insufficient quota build failures in chat tools (#24956 ) ## Summary When a workspace build fails because the user is over their group quota, the chat tools currently surface the failure as a bare `"workspace build failed: insufficient quota"` string with no machine-readable error code and no visibility into the user's current usage. Agents and the UI cannot distinguish quota failures from any other Terraform error, so users see an opaque message and have no clear path to recovery. This PR tags quota failures with a typed error code at the source and propagates it through the chat tool layer so callers can react to it explicitly. Relates to CODAGT-20 ## Changes Provisioner runner - Add `InsufficientQuotaErrorCode = "INSUFFICIENT_QUOTA"` and set it explicitly at the `commitQuota` failure site via a new `failedWorkspaceBuildfCode` helper, so `provisioner_jobs.error_code` is populated only on the genuine quota path. The substring matcher used for externally produced sentinels (e.g. `"missing parameter"`, `"required template variables"`) is intentionally not extended; provider errors that happen to mention "insufficient quota" stay classified as generic build failures. SDK and API contract - Add `JobErrorCodeInsufficientQuota` and a `JobIsInsufficientQuotaErrorCode` helper to `codersdk`. - Extend the swagger `enums` tag on `ProvisionerJob.ErrorCode` to include `INSUFFICIENT_QUOTA`. - Regenerate `coderd/apidoc`, `docs/reference/api/`, and `site/src/api/typesGenerated.ts`. chattool create_workspace / start_workspace* - `waitForBuild` now returns a typed `*workspaceBuildError` carrying both the message and the `JobErrorCode`, instead of a bare error string. - New `quotaerror.go` introduces a structured `quotaErrorResult` (with `error_code`, `title`, `message`, `build_id`, and optional `quota`) and a best-effort `workspaceQuotaDetails` lookup that wraps owner authorization internally and fetches `credits_consumed` and `budget` from the database. Quota lookup failures (including authorization failures) never block the failure payload. - On quota-coded build failures, both `create_workspace` and `start_workspace` now return the structured response (with the recovery guidance inlined into `message`) instead of the bare `"insufficient quota"` string. This applies to all three failure paths: post-creation, an in-progress existing build, and a freshly triggered start build. Non-quota build failures continue to use the existing `buildToolResponse` / `newBuildError` path. - Owner authorization is wrapped only on the call sites that need it (the `CreateFn` and `StartFn` invocations and the quota-detail lookup), so idempotent fast paths (already running, already in progress, existing-workspace early returns) do not pay for an extra RBAC round-trip or fail when role lookup is transient. ## Out of scope - No changes to quota math, allowances, or bypass behavior. - No automatic retries. - No new quota-inspection tools and no changes to MCP `coder_create_workspace` (which returns immediately and never observed the build outcome here). - No frontend UI changes; those will land in a follow-up PR that consumes the new `INSUFFICIENT_QUOTA` code.	2026-05-07 15:01:58 +10:00
Cian Johnston	a74015fc85	refactor: make store and chatID explicit parameter arguments in chattools (#24850 ) Fixes CODAGT-175 Addresses a review finding in https://github.com/coder/coder/pull/23827 that the nil-guards for both `database.Store` and `chatID` are both dead code in practice in the `chattool` package. - Modifies the return signatures require passing both `database.Store` and `chatID` explicitly as positional arguments instead of just parameter struct keys. - Drops the nil-guards for `database.Store` and `chatID`.	2026-05-06 11:05:16 +01:00
Cian Johnston	04cc983833	fix: add preset support to MCP tools (#24694 ) The chat tools (`read_template`, `create_workspace`) did not surface or respect template version presets. Presets were invisible to the LLM and preset parameter defaults were never applied at workspace creation. The `toolsdk` MCP surface had the same gap (ref #24695, now subsumed here). ## What this changes - `read_template` returns presets with `id`, `name`, `default`, `description`, `icon`, `parameters`, and `desired_prebuild_instances` (when set), so the LLM can pick the right preset and prefer prebuilt-backed ones. - `create_workspace` accepts a `preset_id`. The wsbuilder applies preset parameter defaults and may claim a prebuilt workspace. - `start_workspace` does not accept a preset. Presets are a creation-time choice; subsequent starts use the workspace's existing version and parameters. Users who need a specific preset or version on an existing chat can create the workspace out-of-band (CLI / UI / API) with the desired configuration and attach the chat to it. - `toolsdk` gains `GetTemplate` (with presets including `desired_prebuild_instances`), preset support on `CreateWorkspace`, and preset + `rich_parameters` support on `CreateWorkspaceBuild`. The `template_version_preset_id` description warns about preset/version affinity. > 🤖 Generated with [Coder Agents](https://coder.com/agents) and reviewed by a human. Co-authored-by: Max schwenk <maschwenk@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:57:52 +01:00
Ethan	1203f625b7	feat(coderd): accept parameters in start_workspace tool (#24434 ) When the chat `start_workspace` tool triggers an active-version upgrade that introduces new required parameters, the build fails with a parameter validation error. Previously this returned a message telling the user to update from the UI — a dead end for the model. This PR lets the model recover inside the chat by: 1. Accepting an optional `parameters` map on `start_workspace` (same schema as `create_workspace`), forwarded as `RichParameterValues`. 2. Returning structured JSON error responses that preserve validation details and the workspace's `template_id`, so the model can call `read_template` to discover what changed. 3. Replacing the UI-only guidance in `exp_chats.go` with model-actionable retry instructions. The expected model flow on an active-version parameter failure is now: ``` start_workspace → fails (structured error with template_id + validations) read_template → discovers new required parameters start_workspace → retries with parameters map → workspace starts ``` <img width="846" height="511" alt="image" src="https://github.com/user-attachments/assets/d18b6864-5970-4225-8da0-0f2ab134ccb4" />	2026-04-21 11:36:20 +10:00
Ethan	91b35a25ee	fix(coderd): auto-update workspace to active template version on chat start (#24424 ) ## Problem When a template has `require_active_version` enabled and the chat agent tries to start a workspace that is stopped on an older template version, the agent gets stuck in an infinite loop: `start_workspace` fails with a 403 (the old version is not the active version and the user lacks `ActionUpdate` on the template), then `create_workspace` sees the existing stopped workspace and tells the agent to use `start_workspace`, repeat forever. The root cause is that `chatStartWorkspace()` passes the start build request through without setting `TemplateVersionID`, so `wsbuilder` defaults to the previous build's template version — which RBAC rejects when `RequireActiveVersion` is true. ## Fix In `chatStartWorkspace()` (`coderd/exp_chats.go`), when the template's access control has `RequireActiveVersion` enabled, explicitly set `req.TemplateVersionID` to `template.ActiveVersionID` before calling `postWorkspaceBuildsInternal()`. This mirrors how the autobuild executor handles the same scenario (`coderd/autobuild/lifecycle_executor.go`). If the new active version introduces required parameters that cannot be resolved automatically (no defaults, no previous values), the build fails at parameter validation before a provisioner job is created. In that case, a clear error message tells the user to update and start the workspace from the UI instead of surfacing a raw internal error. On successful auto-update, the tool response includes `updated_to_active_version`, `update_reason`, and a human-readable `message` so the model can explain to the user what happened. <img width="782" height="122" alt="image" src="https://github.com/user-attachments/assets/289430d6-066e-41cf-bc97-cd013dcf717d" /> ### Changes - `coderd/exp_chats.go`: `chatStartWorkspace()` loads the template, checks `RequireActiveVersion` via `AccessControlStore`, and pins the build to the active version when required. New `isChatStartWorkspaceManualUpdateRequiredError()` classifies parameter validation failures from both the dynamic parameters path (`DiagnosticError`) and the classic path (`ErrParameterValidation` sentinel). - `coderd/wsbuilder/wsbuilder.go`: New `ErrParameterValidation` sentinel error, wrapped into the classic parameter validation `BuildError` so callers can use `errors.Is` instead of string matching. - `coderd/x/chatd/chattool/startworkspace.go`: `waitForAgentAndRespond` now returns `map[string]any` instead of `fantasy.ToolResponse`, letting the caller annotate the result (e.g. auto-update metadata) before converting. Error handling for `StartFn` checks for `httperror.Responder` errors to surface clean messages for the manual-update case. - `coderd/x/chatd/chattool/startworkspace_test.go`: Two new tests — `StoppedWorkspaceReportsAutoUpdate` (verifies auto-update fields in response) and `ManualUpdateRequired` (verifies clean error message without internal wrapping). ### Follow-up The manual-update error message could include a direct link to the workspace settings page, but the chattool layer does not currently have access to the deployment's access URL. Plumbing it through is straightforward but out of scope for this fix. Closes CODAGT-192	2026-04-17 00:16:37 +10:00
Kyle Carberry	9c74c8c674	fix: move OnChatUpdated call after agent is ready in create/start workspace (#24410 )	2026-04-15 19:18:54 -04:00
Kyle Carberry	d11849d94a	fix: re-fetch context files and skills from workspace on each turn (#24360 ) Context files (AGENTS.md) and skills were only fetched from the workspace on the first turn or when the agent changed. On subsequent turns, stale content from persisted messages was used. This meant that if AGENTS.md or skills were modified on the workspace between turns, the agent wouldn't see the changes until the user created a new chat. ## Changes - Extract `fetchWorkspaceContext` from `persistInstructionFiles` to allow fetching workspace context without persisting - On subsequent turns, re-fetch fresh context from the workspace instead of reading stale persisted content; falls back to persisted messages if the workspace dial fails - Update `ReloadMessages` callback to re-derive instruction and skills from reloaded database messages after compaction, instead of using captured closure variables - Add `formatSystemInstructionsFromParts` helper to build system instructions directly from agent parts without requiring separate OS/directory params - Add tests for the new helper <details><summary>Implementation Notes</summary> ### Root cause In `runChat`, the `else if hasContextFiles` branch (subsequent turns) called `instructionFromContextFiles(messages)` which read stale content from persisted DB messages. The `ReloadMessages` callback (post-compaction) also used captured `instruction`/`skills` closure variables from the start of the turn, never re-deriving them. ### Approach 1. Extract `fetchWorkspaceContext` — Pure refactor of the fetch-only part of `persistInstructionFiles` (agent connection, context config retrieval, content sanitization, metadata stamping). Returns parts + skills without persisting. 2. Subsequent turns: Instead of reading from persisted messages, launch a `g2` goroutine that calls `fetchWorkspaceContext` to get fresh context from the workspace. Falls back gracefully to persisted messages if the workspace is unreachable. 3. ReloadMessages: Re-derive `instruction` from `instructionFromContextFiles(reloadedMsgs)` and `skills` from `skillsFromParts(reloadedMsgs)` using the freshly loaded messages, with fallback to captured values if the reloaded messages don't contain context (e.g. compacted away). </details> > 🤖 Generated by Coder Agents	2026-04-15 16:41:15 -04:00
Danielle Maywood	cb0b84a2d3	feat: show build logs in chat for start_workspace and create_workspace tools (#24194 )	2026-04-12 15:04:10 +01:00
Michael Suchacz	73f6cd8169	feat: suffix-based chat agent selection (#23741 ) Adds suffix-based agent selection for chatd. Template authors can direct chat traffic to a specific root workspace agent by naming it with the `-coderd-chat` suffix (for example, `coder_agent "dev-coderd-chat"`). When no suffix match exists, chatd falls back to the first root agent by `DisplayOrder`, then `Name`. Multiple suffix matches return an error. The selection logic lives in `coderd/x/chatd/internal/agentselect` and is shared by chatd core plus the workspace chat tools so all chat entry points pick the same agent deterministically. No database migrations, API contract changes, or provider changes. The experimental sandbox template was split out to #23777.	2026-03-30 11:43:59 +00:00
Cian Johnston	80a172f932	chore: move chatd and related packages to /x/ subpackage (#23445 ) - Moves `coderd/chatd/`, `coderd/gitsync/`, `enterprise/coderd/chatd/` under `x/` parent directories to signal instability - Adds `Experimental:` glue code comments in `coderd/coderd.go` > 🤖 This PR was created with the help of Coder Agents, and was reviewed by my human. 🧑‍💻	2026-03-23 17:34:43 +00:00

12 Commits