The `mockEventSenderWrite` function in `newOneWayWriter()` wrote
WebSocket frame data to both the `net.Pipe` and the
`httptest.ResponseRecorder`. After `websocket.Accept()` calls
`WriteHeader(101)`, the recorder rejects body writes with `"response
status code does not allow body"`. When `HeartbeatClose` sends a ping,
the control frame flush routes through the recorder, producing an
ERROR-level log that `slogtest` catches as a test failure.
Removed the `recorder.Write(b)` call from the write function. The
recorder is only needed for header/status inspection; WebSocket frame
data should only go through the `net.Pipe`.
Closes https://github.com/coder/internal/issues/1521
> 🤖 Generated by Coder Agents
Part 1: Backend portion of a change broken into 2 PRs.
Part 2: #25077
Adds three new UserAppearanceSettings fields (theme_mode, theme_light,
theme_dark) on top of the existing theme_preference and terminal_font.
Replaces GetUserThemePreference and GetUserTerminalFont with a single
GetUserAppearanceSettings aggregate query. The PUT handler is wrapped in
db.InTx so sync-mode's mode + slot writes can never half-apply.
Mid-stream HTTP/2 peer resets from LLM providers can arrive after a 200
streaming response has already emitted provisional parts. Previously
those resets fell through as generic non-retryable errors because
`stream ID` messages did not match retryable transport signals, and
stream IDs could be misread as HTTP statuses.
Classify retryable HTTP/2 RST_STREAM codes as transient timeout
failures, ignore stream IDs during status extraction, and keep the
existing `retry` event as the rollback boundary for provisional message
parts so replacement attempts do not replay failed-attempt output.
Closes CODAGT-382
> Mux working on behalf of Mike.
## Summary
- retune chatd subagent guidance to prefer `general` for substantial
delegated work, including read-only synthesis and planning support
- narrow `explore` guidance to repository-local code lookup and bounded
tracing
- add regression tests for planning, spawn tool, and Plan Mode guidance
text
## Tests
- `go test ./coderd/x/chatd -run
'Test(DefaultSystemPromptPlanningGuidance_SteersSubagentSelection|SpawnAgent_DescriptionSteersGeneralForSubstantialResearch|SpawnAgent_PlanModeDescriptionOmitsComputerUse|PlanningOverlaySubagentGuidance_UsesPlanModeSafeDescriptions|ExploreSubagentIsReadOnly)$'`
- `make lint`
- `make test TEST_PACKAGES=./coderd/x/chatd RUN=Guidance && make test
TEST_PACKAGES=./coderd/x/chatd RUN=Description`
- pre-commit hook during `git commit`
Adds `dynamicparameters.EvaluateSecretMismatch` as a shared helper on
top of the existing renderer, then wires it into the resolve-autostart
handler so the UI can surface unsatisfied `coder_secret` requirements in
a template alongside parameter mismatch for autostart.
The lifecycle executor changes will land in a follow-up that depend
on this helper. The UI changes that consume the new `secret_mismatch`
field is also a follow-up.
Generated with assistance from Coder Agents.
Azure IMDS attested data signatures can now chain through
Microsoft TLS G2 RSA CA OCSP intermediates, then through the
cross-signed Microsoft TLS RSA Root G2 certificate, before reaching
DigiCert Global Root G2.
coderd did not bundle the new G2 OCSP intermediates or the
cross-signed Microsoft TLS RSA Root G2 bridge certificate, so it could
fail to build a trusted chain for affected IMDS signatures.
Related to:
https://linear.app/codercom/issue/PLAT-205/bug-azure-instance-identity-verification-is-broken
Adds a `diff_url:` term to the `q` search parameter on `GET
/api/experimental/chats` so callers can look up the chat associated with
a particular pull request, merge request, or any other URL persisted on
the chat's diff status.
```
q=diff_url:"https://github.com/coder/coder/pull/123"
```
Match is case-insensitive. When the URL lives on a delegated sub-agent's
diff status, the parent chat is returned so the relationship surfaces
from a single lookup.
<details>
<summary>Design notes</summary>
- **Forge-agnostic.** Reuses the existing `chat_diff_statuses.url`
column rather than introducing a `pr:` vocabulary, since the SDK already
documents the URL as "may point to a pull request or a branch page
depending on whether a PR has been opened." Works for GitHub PRs, GitLab
MRs, branch pages, etc.
- **Composes with `archived:`.** The two terms can be combined:
`q=archived:true diff_url:"..."`.
- **Case handling.** The parser used to lowercase the entire `q` string
up front, which would mangle URL path segments. Switched to lowercasing
only the field key inside `searchTerms` (already happens there) and
keeping the value as the caller typed it. The SQL comparison lowercases
on both sides.
- **Validation.** `diff_url` must be a syntactically valid HTTP(S) URL
with a non-empty host. No forge-specific validation.
- **Index.** Adds `idx_chat_diff_statuses_url_lower` on `LOWER(url)` so
the lookup is cheap even on large datasets.
- **Sub-agent fan-in.** `EXISTS` clause matches when the URL lives on
the chat itself or any chat with `root_chat_id` equal to the chat's id,
so a delegated sub-agent's PR pulls in its parent.
- **Deferred.** Sentinels like `pr:any` / `pr:none` and a forge-agnostic
state filter (`diff_state:open|merged|closed`) were intentionally left
out of this change. They couple cleanly to a second forge or a clearer
product call, and shipping them now would lock in vocabulary we may want
to revisit.
</details>
## Tests
- `coderd/searchquery`: parser tests for valid URLs, case handling (key
insensitive, value preserved), composition with `archived:`, and
validation errors (non-HTTP scheme, missing host, malformed URL).
- `coderd/exp_chats_test.go`: end-to-end coverage hitting `ListChats`.
Verifies a root chat matches its own URL, a parent chat surfaces when
only a sub-agent has the URL, lookups are case-insensitive, non-matching
URLs return empty, and invalid URLs return `400`.
---
_This PR was authored by a Coder Agent on behalf of @kylecarbs._
Migrates Azure instance identity verification from
`go.mozilla.org/pkcs7` and `github.com/fullsailor/pkcs7` to
`github.com/smallstep/pkcs7`, using `VerifyWithChainAtTime` to validate
both the PKCS7 signature and the certificate chain in one call. The
previous code only verified the signer certificate against a set of
intermediates/roots but did not verify that the PKCS7 signature itself
covered the content, meaning tampered payloads could be accepted.
The `Options` struct is restructured to accept `Roots`, `Intermediates`,
and `CurrentTime` as explicit fields instead of embedding
`x509.VerifyOptions`. The test helper `NewAzureInstanceIdentity` now
builds a realistic 3-level certificate chain (Root CA -> Intermediate CA
-> Signing Cert) matching real Azure trust hierarchy. New tests
(`TestValidate_TamperedContent`,
`TestValidate_UntrustedCertWithValidSignature`) confirm tampered and
untrusted envelopes are rejected.
Addresses GHSA-6x44-w3xg-hqqf.
> [!NOTE]
> This PR was authored by Coder Agents.
<details>
<summary>Implementation Plan</summary>
### Files Changed
| File | Summary |
|------|---------|
| `coderd/azureidentity/azureidentity.go` | Replace `signer.Verify()`
with `VerifyWithChainAtTime`; restructure `Options` struct; add
`ParseCertificates()` helper |
| `coderd/azureidentity/azureidentity_test.go` | Add `testCertChain`
builder, tampered-content and untrusted-cert tests; update existing
tests for new `Options` API |
| `coderd/coderd.go` | Change `AzureCertificates` field from
`x509.VerifyOptions` to `azureidentity.Options` |
| `coderd/workspaceresourceauth.go` | Pass `api.AzureCertificates`
directly instead of wrapping |
| `coderd/coderdtest/coderdtest.go` | Migrate to `smallstep/pkcs7`;
build 3-level cert chain in test helper |
| `go.mod` / `go.sum` | Add `github.com/smallstep/pkcs7`; remove
`fullsailor/pkcs7` and `go.mozilla.org/pkcs7` |
</details>
Security improvements:
- Restrict cert fetches to a host+port allowlist (Microsoft and DigiCert
on 80/443).
- Route requests through a dedicated `http.Client` that resolves the
host once and dials the validated IP directly, preventing DNS rebinding.
- Reject loopback, private (RFC 1918 / IPv6 ULA), link-local, multicast,
unspecified, CGNAT, benchmarking, and IPv4-mapped IPv6 addresses.
- Cap the certificate response body at 1 MiB.
- Log the underlying error via slog and return a generic detail to the
caller to prevent information disclosure.
* fix(coderd): Harden Azure identity certificate fetch
- Restrict cert fetches to a host+port allowlist (Microsoft and
DigiCert on 80/443).
- Route requests through a dedicated `http.Client` that resolves
the host once and dials the validated IP directly.
- Reject loopback, private (RFC 1918 / IPv6 ULA), link-local,
multicast, unspecified, CGNAT, benchmarking, and IPv4-mapped
IPv6 addresses.
- Cap the certificate response body at 1 MiB.
- Log the underlying error via slog and return a generic detail
to the caller.
- Add unit tests for the URL allowlist, IP classification, and
dialer.
* fix(coderd/azureidentity): add IPv6 special-use ranges to SSRF blocklist
The extraBlockedNetworks list only contained IPv4 CIDRs. Add IPv6
equivalents that Go's stdlib classification methods do not cover:
- 64:ff9b:1::/48 RFC 8215 NAT64 translation
- 100::/64 RFC 6666 discard-only
- 2001:2::/48 RFC 5180 benchmarking
- 2001:db8::/32 RFC 3849 documentation
IPv6 ranges already handled by stdlib (unchanged):
- ::1/128 (IsLoopback)
- fc00::/7 (IsPrivate, ULA)
- fe80::/10 (IsLinkLocalUnicast)
- ff00::/8 (IsMulticast)
- ::/128 (IsUnspecified)
Closes https://github.com/coder/internal/issues/965
Recent `pg_dump` patch releases (13.22+ / 14.19+ / 15.14+ / 16.10+ /
17.6+) emit `\restrict` / `\unrestrict` psql meta-commands at the head
and tail of schema dumps. These broke both `sqlc` and our
`scripts/migrate-test` schema-equality check. PR #19696 worked around it
by pinning `pg_dump` to a Docker image.
This change unpins the workaround now that `sqlc` handles the
meta-commands:
* Bumps the coder/sqlc fork pin to [`337309b` on
coder/sqlc:main](https://github.com/coder/sqlc/commit/337309bfb9524f38466a5090e310040fc7af0203),
the merge of upstream v1.31.1 (coder/sqlc#6). v1.31.1 includes
[sqlc-dev/sqlc#4390](https://github.com/sqlc-dev/sqlc/pull/4390), the
upstream `\restrict` / `\unrestrict` parser fix. Updated in three places
that pin the fork SHA: `flake.nix` (`sqlc-custom`),
`.github/actions/setup-sqlc/action.yaml`, and the
`dogfood/coder/ubuntu-{22,26}.04` Dockerfiles. The flake's `sha256` /
`vendorHash` are reset to `pkgs.lib.fakeSha256`; Nix will surface the
real hashes on first build, per the existing comment block.
* Reverts #19696's Docker pin in `coderd/database/dbtestutil/db.go`.
Local `pg_dump` (13+) and the `postgres:13` Docker fallback both work
again.
* Strips `\restrict` / `\unrestrict` lines in `normalizeDump` so
`scripts/migrate-test`'s schema comparison is stable across `pg_dump`
versions (the token in those lines is randomized per run).
`TestNormalizeDumpStripsRestrict` locks the behavior in.
* Regenerates with v1.31.1, picking up the version stamp and one
upstream correctness fix in `DeleteLicense`
([sqlc-dev/sqlc#4383](https://github.com/sqlc-dev/sqlc/pull/4383): don't
shadow the input parameter when scanning a single-column return).
The soft-delete cleanup trigger (`delete_deleted_user_resources`)
removed `api_keys`, `user_links`, and `user_secrets` but left
`organization_members` rows intact. When a new user was created with a
previously-deleted user's email, both user IDs had org membership rows
in the same organization, producing duplicate-email members.
Extend the trigger to also delete `organization_members` for the
soft-deleted user. This cascades through the existing
`trigger_delete_group_members_on_org_member_delete`, which cleans up
group memberships automatically. The migration backfills by removing
zombie rows for already-deleted users.
Fixes ENG-831
> [!NOTE]
> 🤖 Generated by Coder Agents
<details>
<summary>Implementation notes</summary>
**Root cause**: `GetOrganizationIDsByMemberIDs` does not join on
`users.deleted = false`, so stale org membership rows for soft-deleted
users were visible to internal queries. Even the filtered queries
(`OrganizationMembers`, `PaginatedOrganizationMembers`) could surface
duplicate emails when a new active user reused a deleted user's email.
**What changed**:
- Migration 000491 extends `delete_deleted_user_resources()` to `DELETE
FROM organization_members WHERE user_id = OLD.id`
- Backfill removes existing zombie org memberships for soft-deleted
users
- `TestOrgMembersSoftDeleteTrigger` covers org membership removal, raw
row cleanup, and cascading group membership cleanup
</details>
`TestPatchChatMessage/ChangesModel` hardcoded `"openai"` as the provider
for the override model config. After #25171, the shared chat test
harness registers a single `"openai-compat"` provider by default, so
calling `createAdditionalChatModelConfig(..., "openai", ...)` fails with
HTTP 400 `Chat provider is not configured` before the test can exercise
the model-change path. The subtest was added in #25084 after #25171 was
reviewed, so the harness change and the new hardcoded provider only met
on `main`.
Use `defaultModel.Provider` so the override always matches whatever
provider the harness registered. This mirrors every other call site of
`createAdditionalChatModelConfig` in the file.
Closes https://github.com/coder/internal/issues/1530
Replaces the per-agent Go-side template-version filter in
`handleAuthInstanceID` with a purpose-built SQL query.
`GetWorkspaceBuildAgentsByInstanceID` joins `workspace_agents ->
workspace_resources -> workspace_builds -> provisioner_jobs ->
workspaces` and excludes:
- non-`workspace_build` provisioner jobs (template-version-import,
dry-run)
- deleted agents and sub-agents
- deleted workspaces
The handler:
- drops the per-candidate `GetWorkspaceResourceByID` /
`GetProvisionerJobByID` lookups
- drops the `provisioner_jobs.input` JSON parsing and the follow-up
`GetWorkspaceBuildByID` call
- compares `latestHistory.ID` against `selected.WorkspaceBuildID`
returned directly from the query
- preserves the existing recycled-instance safety check and matching
response codes
One intentional behavior tightening: agents whose workspace is deleted
now return 404 (previously they could reach the recycled-instance check
and return 400, or 200 if the stale build was still latest). This
matches the existing token-auth path, which already refuses to
authenticate against deleted workspaces.
The original `GetWorkspaceAgentsByInstanceID` query is intentionally
untouched. It remains the generic raw lookup used elsewhere in tests and
helpers.
The dbauthz wrapper for the new query uses the system-read fast path
with `fetchWithPostFilter` for non-system reads, with `RBACObject()`
delegating to the embedded `WorkspaceTable`.
Tests:
- new `TestGetWorkspaceBuildAgentsByInstanceID` covering newest-first
ordering, exclusion of deleted/sub agents, exclusion of template-import
and dry-run jobs, and exclusion of deleted workspaces
- new dbauthz mock test for `GetWorkspaceBuildAgentsByInstanceID`
- new `TestPostWorkspaceAuthAWSInstanceIdentity/RecycledInstanceID`
exercising the recycled-instance rejection branch (HTTP 400 when the
agent's build is no longer latest)
- existing `TestPostWorkspaceAuth{AWS,Azure,Google}InstanceIdentity`
continue to cover the handler end to end (including the template-version
+ workspace-build same-instance-ID scenario via
`setupInstanceIDWorkspace`)
> Mux is acting on Mike's behalf.
Editing a previous user message and selecting a different model in the
picker silently kept using the original model: the selection was dropped
on the frontend, in the SDK, and in the backend, so both the replacement
user message and the assistant turn that followed ran against the old
model.
Plumb the selected model through all three layers (`AgentChatPage`,
`codersdk.EditChatMessageRequest`, `chatd.EditMessageOptions` /
`Server.EditMessage`), defaulting to the original message's model when
the client does not specify one. The existing `InsertChatMessages` CTE
already advances `chats.last_model_config_id` when the inserted
message's model differs, so the assistant turn picks up the new
selection without further changes. The new model is validated inside the
transaction, so an unknown ID rolls the edit back and returns a 400
`Invalid model config ID.`, mirroring the `SendMessage` path.
Refs: CODAGT-345
This change was generated by a Coder agent.
<details>
<summary>Implementation plan</summary>
# CODAGT-345: Editing an earlier message cannot change model
## Problem
When editing a previous user message in a chat, the user can change the
model in the model picker, but the backend keeps using the original
message's model. The model selection is dropped at three layers:
1. **Frontend:** `AgentChatPage.tsx`'s edit branch builds an
`EditChatMessageRequest` that omits `model_config_id`. The new-message
branch (a few lines below) does include it.
2. **SDK:** `codersdk.EditChatMessageRequest` has no `ModelConfigID`
field at all.
3. **Backend:** `chatd.EditMessageOptions` has no model field, and
`Server.EditMessage` always copies the original message's
`ModelConfigID` into the replacement message.
Once the replacement user message is inserted with the original model,
the `InsertChatMessages` CTE leaves `chats.last_model_config_id`
unchanged, so the assistant turn that follows runs against the old
model.
## Fix
Plumb the selected model through all three layers, defaulting to the
original message's model when the client doesn't override it. This
mirrors the `SendMessage` path, which already accepts a
`model_config_id` and validates it via
`resolveSendMessageModelConfigID`.
### Backend
- `codersdk/chats.go`: add `ModelConfigID *uuid.UUID` to
`EditChatMessageRequest`.
- `coderd/x/chatd/chatd.go`:
- Add `ModelConfigID uuid.UUID` to `EditMessageOptions`.
- In `EditMessage`, after fetching the edited message, resolve the
model: if `opts.ModelConfigID != uuid.Nil`, validate it exists with
`tx.GetChatModelConfigByID` (using `chatdModelConfigLookupContext`),
otherwise keep `editedMsg.ModelConfigID.UUID`. Pass the resolved ID into
`newChatMessage(...)`.
- Reuse the existing `ErrInvalidModelConfigID` sentinel.
- `coderd/exp_chats.go` (`patchChatMessage`):
- Read `req.ModelConfigID` (nil-safe), pass into
`chatd.EditMessageOptions`.
- Add a `case xerrors.Is(editErr, chatd.ErrInvalidModelConfigID)` arm
returning 400 `Invalid model config ID.`, matching the
`postChatMessages` handler.
### Frontend
- `site/src/pages/AgentsPage/AgentChatPage.tsx`:
- In the edit branch, set `model_config_id: effectiveSelectedModel ||
undefined` on the `EditChatMessageRequest`.
- On success, persist the chosen model to `lastModelConfigIDStorageKey`
so the next chat from this browser keeps the same default. Mirrors the
new-message branch.
### Generated
- `make site/src/api/typesGenerated.ts` and `make
coderd/apidoc/swagger.json` produce the updated `EditChatMessageRequest`
schema in `typesGenerated.ts`, `coderd/apidoc/{docs.go,swagger.json}`,
and `docs/reference/api/{chats.md,schemas.md}`.
## Tests
- `coderd/x/chatd/chatd_test.go`:
- `TestEditMessageWithModelConfigOverride`: edit with a different model
-> replacement message and `chats.LastModelConfigID` use the new model.
- `TestEditMessagePreservesModelConfigByDefault`: edit without
`ModelConfigID` -> original model preserved.
- `TestEditMessageRejectsUnknownModelConfig`: passes a random UUID ->
`ErrInvalidModelConfigID`, original message still present,
`LastModelConfigID` unchanged (rollback).
- `coderd/exp_chats_test.go` (under `TestPatchChatMessage`):
- `ChangesModel`: end-to-end via SDK; `edited.Message.ModelConfigID` and
`chat.LastModelConfigID` both match the new model.
- `InvalidModelConfigID`: random UUID -> 400 `Invalid model config ID.`.
</details>
Chat tests previously constructed a real `openai` provider with a fake
API key and no `BaseURL`, so background title generation hit
`api.openai.com` and timed out under `-race`. The same root cause
produced several distinct flakes: title regeneration races with
synchronous `UpdateChat`/`ProposeChatTitle`, and pagination races
against `updated_at` bumps from real-network processing.
This moves the fake OpenAI-compatible provider and the chat-settle wait
into first-class `coderdtest` capabilities.
`coderd.Options.ChatProviderAPIKeys` is the new seam tests use to
redirect chat traffic to a local `httptest.Server`.
`coderdtest.WaitForChatSettled` replaces per-test waiters and drains
tracked chat-daemon work after the chat row leaves `pending`/`running`.
The `newChatClient*` constructors funnel through one options builder
that installs the fake provider before the coderd test server so cleanup
ordering is deterministic.
Closes https://github.com/coder/internal/issues/1528 & Closes ENG-2659
Closes https://github.com/coder/internal/issues/1480 & Closes CODAGT-359
Closes https://github.com/coder/internal/issues/1507 & Closes CODAGT-368
Relates to https://github.com/coder/internal/issues/1397 & Relates to
CODAGT-374
Adds an Agents General setting to require Cmd/Ctrl+Enter before sending
chat messages. When enabled, plain Enter inserts a newline in agent chat
inputs while the send button remains available.
The preference is now persisted server-side through
`/api/v2/users/{user}/preferences`, alongside the existing user
preference settings, and is applied to both the create-agent input and
existing chat composer. Storybook and API coverage verify the setting,
keyboard behavior, validation, and persistence.
<details>
<summary>Coder Agents notes</summary>
Generated by Coder Agents from a Slack request. Dogfooded with
agent-browser against the Storybook settings and chat input stories.
</details>
## Problem
In `coderd/x/chatd/chatd.go` `runChat`, workspace MCP discovery is gated
on `chat.WorkspaceID.Valid` at the start of each turn. New chats that
bind their workspace mid-turn (via `create_workspace` or
`start_workspace`) get an empty workspace tool list on the first step,
and the model falls back to `execute` (bash) because no workspace MCP
tools are advertised.
**Repro:** new chat → "create a workspace and use MCP tools". No
`/api/v0/mcp/tools` request hits the agent on turn 1; turn 2 in the same
chat works fine.
## Fix
- Add a `PrepareTools` callback to `chatloop.RunOptions`, analogous to
`PrepareMessages`. It is invoked once before each LLM step with the
current tool list. When it returns non-nil, the chatloop replaces
`opts.Tools`, rebuilds the per-step tool definitions, and appends new
tool names to `opts.ActiveTools` so newly injected tools are callable
immediately.
- Wire `PrepareTools` in `runChat` to trigger workspace MCP discovery
the first time the chat snapshot reports a valid `WorkspaceID`. The
previous top-of-turn discovery path is unchanged for chats that start
with a workspace.
- Extract the discovery logic into `Server.discoverWorkspaceMCPTools` so
the top-of-turn and mid-turn paths share identical behavior (cache,
agent resolution, `ListMCPTools` timeout, invalidation).
Mid-turn discovery stays disabled in plan-mode turns and Explore
subagents, matching the existing top-of-turn gate. The
`workspaceMCPDiscovered` flag prevents redundant dials after the first
successful discovery.
## Tests
- `coderd/x/chatd/chatloop/chatloop_test.go`: two new
`TestRun_PrepareTools*` cases covering injection on the next step and
active-set merging when `ActiveTools` is non-empty.
- `coderd/x/chatd/chatd_test.go`:
`TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspace` drives
`runChat` through a `create_workspace` tool call against a real Postgres
+ mocked agent conn and asserts the second streamed LLM request
advertises the workspace MCP tool. Verified that the test fails (and
pinpoints the missing tool) when the `PrepareTools` wiring is disabled.
## Validation
```
go test ./coderd/x/chatd/chatloop/... -count=1
go test ./coderd/x/chatd/... -count=1
make lint/emdash
```
<details>
<summary>Decision log</summary>
- Chose a per-step `PrepareTools` callback over mutating `opts.Tools` in
place because `chatloop.Run` builds the `fantasy.Tool` definitions once
at start; a hook is required to let the LLM see new tools on the next
step.
- Returned `[]fantasy.AgentTool` (not also active-tool-names) and let
the chatloop derive name merges via `mergeNewToolNames`. This avoids
leaking plan-mode gating decisions into the callback contract.
- Kept the existing top-of-turn discovery path so chats that already
have a workspace at turn start pay no extra latency.
- Skipped reusing `ReloadMessages` (history reload) since this is purely
a tool-availability concern; coupling it to a history reload would
defeat the chatloop cache prefix optimizations.
</details>
---
_This pull request was generated by Coder Agents._
Moves the `coderd_agents_first_connection_seconds` histogram from the
polling-based `prometheusmetrics.Agents()` loop to the event-driven
`agentConnectionMonitor.init()` path. The metric is now recorded exactly
once when an agent first connects over the RPC websocket, instead of
being retroactively computed each polling tick.
The `username` and `workspace_name` labels are removed to reduce
cardinality; only `template_name` and `agent_name` are retained.
Adds unit tests covering both the happy path (first connection recorded)
and the negative-duration guard (clock skew logs a warning, no sample
emitted).
Stream advisor output into the advisor tool card while the nested
advisor call is still running.
This keeps the advisor implementation intentionally advisor-specific:
the parent model still receives the same final structured tool result,
while the frontend receives transient `tool-result.result_delta` parts
to render partial advisor text in the expanded card. The final persisted
chat history remains unchanged.
Refs CODAGT-322.
Generated by Coder Agents.
<details>
<summary>Implementation plan</summary>
- Publish advisor text deltas from the nested `chatloop.Run` via
`RunAdvisorOptions.OnAdviceDelta`.
- Forward those deltas through `chatadvisor.Tool` with the parent
advisor tool call ID.
- Emit transient `ChatMessagePartTypeToolResult` websocket parts with
`ResultDelta` from `chatd`.
- Add `result_delta` to the generated tool-result TypeScript variant.
- Accumulate tool result deltas in frontend stream state and keep the
tool running until the final result arrives.
- Render streamed advisor advice in the existing advisor card using
streaming markdown mode, while retaining the updated advisor UI.
</details>
The `TimedOutAgentCacheHit`, `CacheHitHealthyAgent`, and
`CacheHitDBError` subtests of `TestGetWorkspaceConn_StatusCheck` built
their `WorkspaceAgent` timestamps with `time.Now()` in the parent test's
slice literal and then ran the actual check against the server's real
wall clock (`quartz.NewReal()`). On slow Windows CI runners, more than
`agentInactiveDisconnectTimeout` (30s) of wall time can elapse between
slice construction and the parallel subtest body. In that window, the
cached "healthy" agent gets reclassified as disconnected by
`agentDisconnectedFor`, and `CacheHitHealthyAgent` fails with
`errChatAgentDisconnected` instead of returning the cached connection.
Build each agent inside the subtest with `quartz.NewMock(t)` and feed
the same clock into the `Server` so the agent timestamps and the status
math share a single frozen `now`. This matches the pattern already used
by `TestGetWorkspaceConn_DialTimeoutDisconnectedRecoveryThreshold` in
the same file.
Closes https://github.com/coder/internal/issues/1522
<details>
<summary>Verification</summary>
Inserting `time.Sleep(35 * time.Second)` at the top of each subtest's
body reliably reproduces the original failure
(`errChatAgentDisconnected` on `CacheHitHealthyAgent`) on the parent
commit and passes with this change. After removing the synthetic sleep,
`go test ./coderd/x/chatd -run TestGetWorkspaceConn_StatusCheck
-count=50` passes cleanly.
</details>
> Generated by Coder Agents on behalf of the assignee.
Co-authored-by: Coder Agents <noreply@coder.com>
Fixes
[CODAGT-372](https://linear.app/codercom/issue/CODAGT-372/coderdazureidentity-testvalidateregular-fails-on-macos).
Closes coder/internal#101.
## Problem
`coderd/azureidentity TestValidate/regular` fails on macOS with:
```
verify signature:
github.com/coder/coder/v2/coderd/azureidentity.Validate
/Users/runner/work/coder/coder/coderd/azureidentity/azureidentity.go:75
- x509: “metadata.azure.com” certificate is not standards compliant
```
When `crypto/x509.VerifyOptions.Roots` is `nil`, Go's verifier on
macOS/iOS falls back to the system verifier (`systemVerify` in
`crypto/x509/root_darwin.go`), which delegates to Apple's
`SecTrustEvaluateWithError`. Apple's framework enforces stricter
standards-compliance checks than Go's pure-Go verifier and rejects some
otherwise valid Azure instance-identity leaf certificates with
`errSecCertificateIsNotStandardsCompliant`, surfaced as the `not
standards compliant` error.
The test had been skipped on darwin since #12979 (April 2024) as a
workaround.
## Fix
- Embed the three root CAs that Azure instance-identity certificates
ultimately chain to:
- DigiCert Global Root G2
- DigiCert Global Root G3
- Baltimore CyberTrust Root (kept for historical chains via `Microsoft
RSA TLS CA 01/02`)
- In `Validate`, populate `options.Roots` from those embedded roots when
the caller does not supply its own pool. Because `Roots != nil`, Go no
longer takes the `systemVerify` path on darwin and uses the pure-Go
verifier on all platforms.
- Remove the `runtime.GOOS == "darwin"` skip from `TestValidate`.
- Add `TestEmbeddedRoots` to guard against future regressions in the
embedded root list (parses each PEM, asserts self-signed, requires all
three named roots).
The caller's existing `Intermediates` handling is unchanged. Tests that
pass their own `Roots` (e.g. `coderdtest.NewAzureInstanceIdentity`) are
unaffected.
## Verification
On Linux:
```
$ go test ./coderd/azureidentity/ -race -count=1 -v
=== RUN TestValidate
=== RUN TestValidate/regular
=== RUN TestValidate/govcloud
=== RUN TestValidate/rsa
--- PASS: TestValidate (0.00s)
--- PASS: TestValidate/regular (0.00s)
--- PASS: TestValidate/rsa (0.00s)
--- PASS: TestValidate/govcloud (0.00s)
=== RUN TestEmbeddedRoots
--- PASS: TestEmbeddedRoots (0.00s)
=== RUN TestExpiresSoon
--- SKIP: TestExpiresSoon (0.00s)
PASS
ok github.com/coder/coder/v2/coderd/azureidentity 1.020s
```
The `test-go-pg` job on `macos-latest` in CI is the authoritative
confirmation of the fix on macOS; previously it would have failed
`TestValidate/regular` had the skip been removed.
<details>
<summary>Why this is the correct fix</summary>
From `/usr/local/go/src/crypto/x509/verify.go`:
```go
// Use platform verifiers, where available, if Roots is from SystemCertPool.
if runtime.GOOS == "windows" || runtime.GOOS == "darwin" || runtime.GOOS == "ios" {
systemPool := systemRootsPool()
if opts.Roots == nil && (systemPool == nil || systemPool.systemPool) {
return c.systemVerify(&opts)
}
...
}
```
Setting `opts.Roots` to any non-nil, non-system pool deterministically
routes verification through Go's pure-Go verifier, bypassing Apple's
stricter compliance checks. The embedded roots are sufficient to
validate every chain we currently care about, since every intermediate
in `Certificates` ultimately issues to one of the three embedded roots.
</details>
> Generated by Coder Agents. Reviewed manually.
Fixes [CODAGT-367](https://linear.app/codercom/issue/CODAGT-367).
`TestResolveExploreToolSnapshot/*` flaked on CI (Linux and Windows) with
`context deadline exceeded` on the `GetMCPServerConfigsByIDs` call
inside `resolveExploreToolSnapshot`.
Each test setup called `server.CreateChat` twice with `MCPServerIDs` set
to fake `.example.com` URLs. `CreateChat` marks the chat pending and
calls `signalWake`, which causes the chatd background `acquireLoop` to
pick the chat up. That goroutine then dialed the fake MCP URLs
(NXDOMAIN, slower on Windows) and made an OpenAI request with the dbgen
default test key (401). Under CI load, that activity racing the 4
parallel subtests' `GetMCPServerConfigsByIDs` calls was enough to exceed
the 25s test context deadline. The failure logs in the issue showed both
side effects firing in the same job.
`resolveExploreToolSnapshot` only reads `ID`, `MCPServerIDs`,
`PlanMode`, `ParentChatID`, and `Mode` off the parent argument, so the
chats do not need to be persisted. Build them as in-memory
`database.Chat` values instead. The MCP server configs remain in the DB
because the function still queries them via `GetMCPServerConfigsByIDs`.
Verified locally with `go test ./coderd/x/chatd -run
TestResolveExploreToolSnapshot -count=100 -race` (passes, ~5s total) and
the surrounding `TestResolve*` / `TestCreateChildSubagentChat*` /
`TestSpawnAgent_Explore*` tests.
---
_Made by Coder Agents on behalf of @ibetitsmike. [Linear
session](https://linear.app/codercom/issue/CODAGT-367/flake-testresolveexploretoolsnapshot#agent-session-0730f3fe)._
Closes https://github.com/coder/coder/issues/13112
**Breaking Change**: Removed status code `StatusNotModified` when no
diffs occur in a patch. Now the patch is always applied and a template
is always returned.
The test built a `Retry-After` HTTP-date with
`time.Now().Add(3*time.Second).UTC().Format(http.TimeFormat)`, then
asserted that the parsed `RetryAfter` was `>= 2s`. `http.TimeFormat` has
second precision, so `Format()` truncates up to ~1s. Combined with the
small elapsed time between formatting in the test and `time.Until()` in
production, the value could land just under `offset-1s` (1.997s observed
in CI), failing the lower bound.
Round the formatted target up to the next whole second so the parsed
deadline is never earlier than `now+offset`, and assert against a
symmetric `[offset-1s, offset+1s]` window.
Closes
[CODAGT-365](https://linear.app/codercom/issue/CODAGT-365/flake-testclassify-parsesretryafterhttpdate)
Refs https://github.com/coder/internal/issues/1512
<sub>Created by [Coder Agents](https://coder.com/docs/agent).</sub>
Co-authored-by: Coder Agents <coderagents@coder.com>
When OpenAI's Responses API returns `Previous response with id ... not
found` for a chained turn, classify it as a `ChainBroken` retry, clear
`previous_response_id`, exit chain mode, reload full history, and let
`chatretry` retry. Self-heals chats whose anchor was poisoned before
#25074 stopped truncated streams from being persisted as a successful
turn with a stored response id.
The new state is exposed via the existing
`coderd_chatd_stream_retries_total` counter as a
`chain_broken="true"|"false"` label. Aggregating queries (`sum`, `rate`
over `provider`/`model`/`kind`) keep working without changes; raw-series
matchers without aggregation will now see two series per `(provider,
model, kind)` where they previously saw one. The metric is internal-only
so the blast radius should be small, but if you have dashboards that
index by exact label matchers without aggregation they will need an
extra `sum` or an explicit `chain_broken` selector.
> 🤖 This PR was created with the help of Coder Agents, and was reviewed by a human 🧑💻
> Mux is acting on Mike's behalf.
Changes chat turn-end summaries into compact status labels for the
cached `last_turn_summary` and successful web push body.
Uses a structured-output model call for successful turns, requiring a
2-5 word `label` and validating it to reject agent-centric phrasing.
Pending and requires-action states keep deterministic status labels.
Removes the earlier deterministic tool-signal pipeline in favor of the
smaller structured-output path.
Extend the delete_deleted_user_resources() trigger so that secrets
belonging to a soft-deleted user are removed in the same transaction as
the existing api_keys and user_links cleanup.
user_secrets.user_id has ON DELETE CASCADE, but Coder soft-deletes users
by flipping users.deleted rather than removing the row, so the foreign key
cascade never fires and secrets would otherwise survive deletion.
Assisted by Coder Agents.
Use typed atomics (atomic.Int64, atomic.Int32, etc.) in test files to prevent
mixing atomic and non-atomic access on the same value, guarantee 64-bit
alignment on 32-bit platforms, and provide a cleaner API.
TestPromoteQueuedWhileRequiresActionMixedTools has flaked three times across
Windows and Ubuntu CI runners since 2026-05-06; local repro on the dev
workspace has not surfaced it. The May 8 Ubuntu log shows all four
PromoteQueued post-TX pubsub publishes reaching pg_notify, yet the test still
times out 25s later, so the failure is downstream between the subscriber's
listener and the test's events channel. Adds three Debug-level markers in
chatd.go (no logic change) plus two t.Logf markers in the test's reader so
the next CI occurrence pins down exactly which step failed.
Closes ENG-2645
Closescoder/internal#1523
Parallel subtests in `coderd/x/chatd` reused a parent test context with
a `testutil.WaitLong` deadline, so the context could expire before a
subtest was scheduled under load. That made the subagent lifecycle tools
return plain-text context errors instead of the expected JSON payload,
causing flaky JSON unmarshal failures.
Create fresh `chatdTestContext` values inside the affected parallel
subtests and add `chatdTestContext` to the `paralleltestctx` custom
function list so this pattern is caught by `make lint`.
Closes https://github.com/coder/internal/issues/1494
## Summary
Adds a `stop_workspace` tool to chatd so the model can recover from the
"workspace running but agent dead" failure mode (e.g. an OOM that leaves
the workspace running but the agent unreachable) by stopping and then
starting the workspace.
<img width="924" height="742" alt="image"
src="https://github.com/user-attachments/assets/279dedb6-6e29-4fe1-8754-3a1f01e538bf"
/>
## What changed
**New `stop_workspace` chatd tool**
(`coderd/x/chatd/chattool/stopworkspace.go`). Mirrors `start_workspace`:
shares `WorkspaceMu` to serialize with create/start, waits for any
in-progress build before issuing a stop, and is idempotent only after a
successful Stop transition. Failed stop builds re-attempt rather than
reporting success.
**New `chatStopWorkspace` coderd hook** (`coderd/exp_chats.go`). Mirrors
`chatStartWorkspace` minus the `RequireActiveVersion` gate. Stop should
not be blocked by template version policy.
**Differentiated recovery sentinels** (`coderd/x/chatd/chatd.go`).
`errChatAgentDisconnected` instructs the model to call `stop_workspace`
then `start_workspace`. `errChatDialTimeout` instructs a single retry,
then user escalation if it repeats. The previous single message
conflated transient and persistent failures.
**Two-signal recovery gate.** Recovery is only surfaced when a tool call
times out *and* a fresh DB read of the latest workspace agent says
`Disconnected`. The previous draft escalated on the DB read alone, which
would fire on a 30-second heartbeat blip (e.g. agent respawn) and prompt
a destructive stop/start unnecessarily.
**Cache-hit disconnected handling** now clears the cache and retries a
fresh dial before escalating, rather than returning the recovery
sentinel immediately. Latest-agent classification uses
`GetWorkspaceAgentsInLatestBuildByWorkspaceID` instead of the chat's
bound `AgentID`, so stale bindings after a rebuild don't misclassify.
**Shared chattool helpers** in `coderd/x/chatd/chattool/chattool.go`:
`latestWorkspaceBuildAndJob`, `publishBuildBinding`,
`provisionerJobTerminal`. Applied to both `start_workspace` and
`stop_workspace`.
## Notes
- Reverts an earlier draft that widened `ask_user_question` to root
standard turns. Plan-mode-only behavior is restored.
- The `stop_workspace` tool currently renders via the generic chat
tool-call UI. A follow-up frontend PR will prettify the `stop_workspace`
tool and style it like the `start_workspace` tool.
- Never-connected (`Timeout` status) agents are intentionally excluded
from recovery. They indicate template or startup failure, not the
running-but-dead case this PR targets.
Closes CODAGT-315
Closes
[CODAGT-317](https://linear.app/codercom/issue/CODAGT-317/pr-workspaces-sometimes-require-name-confirmation-to-delete).
## Problem
The `/agents` archive-and-delete molly-guard (typing the workspace name)
was firing for chats that had clearly created their own workspace. The
heuristic in `resolveArchiveAndDeleteAction` decides whether
confirmation is needed by comparing the workspace's `created_at` against
the chat's `created_at`:
```ts
return new Date(workspaceCreatedAt) >= new Date(chatCreatedAt);
```
That assumption breaks for **prebuilt workspaces**.
`ClaimPrebuiltWorkspace` rewrites `owner_id`, `name`, `updated_at`,
`last_used_at`, etc., but **never touches `created_at`**, which still
reflects when the prebuild was provisioned by the reconciler, often
hours before the chat exists. Result: every prebuild-claimed workspace
looks pre-existing, so the molly-guard fires.
Concrete example from a real chat:
| Field | Value |
|---|---|
| `chat.created_at` | `2026-05-07T15:12:23Z` |
| `workspace.created_at` (provision) | `2026-05-07T14:22:24Z` |
| `latest_build.created_at` (claim) | `2026-05-07T15:19:09Z` |
`14:22:24 < 15:12:23` so `isWorkspaceAutoCreated` returned false even
though the chat issued the claim.
## Fix (frontend-only)
Derive the moment a workspace was acquired from existing build history
rather than relying on `workspace.created_at`:
- Build #1 initiator = prebuilds system user → workspace was a prebuild
→ use `build_2.created_at` (the claim build) as the acquisition time.
- Build #1 initiator = real user → workspace was created from scratch →
use `workspace.created_at` (unchanged behavior).
- Unclaimed prebuild or no build history → return `null` (force
confirmation; safe degradation for a destructive flow).
The resolver fetches the build list via the existing
`getWorkspaceBuilds` endpoint when the dialog might fire. No new column,
no migration, no schema change. Works retroactively for all existing
claimed prebuilds; no backfill needed.
The prebuilds system user UUID is exposed via
`codersdk.PrebuildsSystemUserID` and typegen'd to `typesGenerated.ts`.
`coderd/database.PrebuildsSystemUserID` parses that constant via
`uuid.MustParse` so the two cannot drift; if the codersdk literal ever
changes, package init fails fast.
## History
The first draft of this PR added a `workspaces.claimed_at` column
populated by `ClaimPrebuiltWorkspace`. After review feedback from
@johnstcn pointing out that the same fact is already implicit in build
history, I pivoted to the frontend-only approach. Subsequent review
notes consolidated the prebuilds system user UUID into a single
typegen'd constant.
## Why not the other open PRs
- **#25055** (`chatKey` cache fallback) only fixes a different
cache-miss path; it explicitly notes it does not address `created_at <
chat.created_at`.
- **#25053** (`chats.workspace_auto_created` boolean) puts the truth on
the wrong side of the schema: "this workspace was claimed at time T" is
a property of the workspace, not the chat. The MCP plumbing it adds is
also unnecessary now that the same answer is available from build
history.
## Test plan
- `pnpm vitest run --project=unit
src/pages/AgentsPage/utils/agentWorkspaceUtils.test.ts` — 40/40 pass;
new cases cover prebuild claim before/after chat, unclaimed prebuild,
missing-build-history fallback, and the fetch-skip when the chat is not
in cache.
- `pnpm lint:types`, `pnpm check`, `make pre-commit`.
<details>
<summary>Disclosure</summary>
Opened on behalf of @kylecarbs by [Coder
Agents](https://coder.com/coder-agents).
</details>
# Summary
Implements
https://linear.app/codercom/issue/AIGOV-282/add-ai-model-price-table-and-seed-generator
This PR lays the groundwork for AI Bridge cost controls (per the AI
Governance RFC). It adds the foundation needed for future cost tracking:
a place to store per-model token prices, a way to keep those prices in
sync with upstream pricing data, and a startup mechanism that ensures
every deployment has prices loaded before AI Bridge starts processing
requests.
The price data comes from [models.dev](https://models.dev/), a
community-maintained catalogue of AI provider pricing. A generator
script fetches the latest prices, filters to Anthropic and OpenAI for
now, and produces a seed file checked into the repository.
On every server startup the seed is applied to the database, so new
releases automatically pick up any price corrections that landed since
the previous one. Existing rows are overwritten with the latest prices;
rows for models no longer in the seed are left untouched.
# Batching the AI model price seed: three approaches
Context: at server startup we seed the `ai_model_prices` table from an
embedded JSON price book (~70 rows today, will grow as we add providers,
potentially 4000+).
Each row is:
```text
(provider, model, input_price, output_price, cache_read_price, cache_write_price)
```
Any of the four price columns can be:
- `NULL` → “price unknown for this dimension”
- explicit `0` → “free”
The batch must be an UPSERT so re-running is idempotent and existing
rows pick up new prices.
We considered three implementations.
---
## Approach 1 — Per-row UPSERT in a Go loop
```go
for _, row := range rows {
if err := db.UpsertAIModelPrice(ctx, database.UpsertAIModelPriceParams{
Provider: row.Provider,
Model: row.Model,
InputPrice: nullInt64(row.InputPrice),
// ...
}); err != nil {
return err
}
}
```
### Pros
- Trivial.
- NULL handling falls out naturally from `sql.NullInt64`.
### Cons
- `N` round-trips per seed.
- With ~70 rows that means ~70 statement executions on every startup,
even inside a transaction.
- Doesn't scale gracefully as the price book grows, potentially 4000+.
---
## Approach 2 — `UNNEST` with parallel arrays
Pass each column as a separate Go slice. Postgres unnests them in
parallel into a virtual table, then `INSERT ... SELECT`.
```sql
INSERT INTO ai_model_prices (
provider,
model,
input_price,
output_price,
cache_read_price,
cache_write_price
)
SELECT
UNNEST(@providers::text[]),
UNNEST(@models::text[]),
NULLIF(UNNEST(@input_prices::bigint[]), -1),
NULLIF(UNNEST(@output_prices::bigint[]), -1),
NULLIF(UNNEST(@cache_read_prices::bigint[]), -1),
NULLIF(UNNEST(@cache_write_prices::bigint[]), -1)
ON CONFLICT (provider, model) DO UPDATE SET
input_price = EXCLUDED.input_price,
output_price = EXCLUDED.output_price,
cache_read_price = EXCLUDED.cache_read_price,
cache_write_price = EXCLUDED.cache_write_price,
updated_at = NOW();
```
Go side: flatten rows into six parallel slices.
Use a sentinel (`-1`) for “missing”, since `lib/pq` can't encode `NULL`
into a `bigint[]` element.
```go
providers := make([]string, len(rows))
models := make([]string, len(rows))
inputs := make([]int64, len(rows))
outputs := make([]int64, len(rows))
cacheR := make([]int64, len(rows))
cacheW := make([]int64, len(rows))
for i, r := range rows {
providers[i] = r.Provider
models[i] = r.Model
inputs[i] = -1
if r.InputPrice != nil {
inputs[i] = *r.InputPrice
}
outputs[i] = -1
if r.OutputPrice != nil {
outputs[i] = *r.OutputPrice
}
cacheR[i] = -1
if r.CacheReadPrice != nil {
cacheR[i] = *r.CacheReadPrice
}
cacheW[i] = -1
if r.CacheWritePrice != nil {
cacheW[i] = *r.CacheWritePrice
}
}
return db.UpsertAIModelPrices(ctx, database.UpsertAIModelPricesParams{
Providers: providers,
Models: models,
InputPrices: inputs,
OutputPrices: outputs,
CacheReadPrices: cacheR,
CacheWritePrices: cacheW,
})
```
### Pros
- Single round-trip.
### Cons
- The generated `sqlc` params become plain `[]int64`, which can't
represent `NULL`.
---
## Approach 3 — `jsonb_array_elements` over a single `@seed::jsonb`
(chosen)
Pass the raw seed JSON as one parameter; let Postgres expand and parse
it.
```sql
INSERT INTO ai_model_prices (
provider,
model,
input_price,
output_price,
cache_read_price,
cache_write_price
)
SELECT
elem->>'provider',
elem->>'model',
(elem->>'input_price')::bigint,
(elem->>'output_price')::bigint,
(elem->>'cache_read_price')::bigint,
(elem->>'cache_write_price')::bigint
FROM jsonb_array_elements(@seed::jsonb) AS elem
ON CONFLICT (provider, model) DO UPDATE SET
input_price = EXCLUDED.input_price,
output_price = EXCLUDED.output_price,
cache_read_price = EXCLUDED.cache_read_price,
cache_write_price = EXCLUDED.cache_write_price,
updated_at = NOW();
```
Go side reduces to:
```go
return db.UpsertAIModelPrices(ctx, seedJSON)
```
### Pros
- Single round-trip.
- NULLs fall out naturally:
- `(elem->>'cache_write_price')::bigint` becomes `NULL`
- no sentinels
- The seed is already JSON:
- Existing precedent:
- `jsonb_array_elements` is already used elsewhere in the codebase
### Cons
- Less type-safe at the SQL boundary than `UNNEST`
- Slightly less standard than `UNNEST`
- Readers need familiarity with:
- `jsonb_array_elements`
- `->>` extraction syntax
- Postgres pays JSON parse cost
- negligible at our scale
---
---
# Decision
We picked Approach 3.
It collapses the round-trips like `UNNEST` does, but without:
- nullable-array workarounds
- sentinel values
The 5s timeout cancelled cold-start ListMCPTools calls before the
agent's 30s connectTimeout could settle, so workspace MCP tools
never reached the LLM. Bump to 35s and scope to ListMCPTools only.
coder/fantasy now fails closed when Anthropic or OpenAI Responses
streams close before their provider terminal events instead of yielding
a successful finish.
This bumps the fantasy replacement to coder/fantasy#33 and teaches chat
error classification to treat those failures as retryable timeout errors
with explicit stream-closed messages.
<img width="875" height="311" alt="image"
src="https://github.com/user-attachments/assets/69c6f7b5-c885-46d2-a88b-b7a2b111bd55"
/>
## Summary
Make Coder's chat agent honest about workspaces that use
`coder_external_agent`. Three behaviors change so the chat stops
pretending it can drive an external workspace through to a usable state
on its own.
<img width="859" height="537" alt="image"
src="https://github.com/user-attachments/assets/0561442b-95f1-4a2d-853c-7e3776114680"
/>
## Problem
External agents are not started by Coder. The user has to run `coder
agent` on their own host with a token Coder generates. Before this
change, the chat agent treated those workspaces like any other:
- `create_workspace` would enqueue a build for an external-agent
template and then wait minutes (~22 worst case) for an agent that was
never going to come up.
- When mid-turn tool calls dialed an external agent that was not
connected, the chat burned the full 30-second dial timeout and returned
generic "the workspace may need to be restarted from the Coder
dashboard" guidance, which is not the action the user can take.
- Nothing told the chat (or the user, through the chat) that the next
action lives outside Coder.
## Fix
Three changes scoped to `coderd/x/chatd/`:
1. **`create_workspace` blocks templates with external agents.** The
tool reads `template_versions.has_external_agent` for the template's
active version and refuses external-agent templates with a message
instructing the chat to pick a different template, or to have the user
create and start the workspace themselves and then attach it.
2. **Attaching an existing external workspace stays open.** No
selection-time gate on attachment; users can still bind a working
external workspace to a chat.
3. **External-agent-aware error handling on connection.** Two
complementary changes both predicated on proven connectivity failures
rather than every dial error:
- **`getWorkspaceConn` preflight and timeout handling.** Before opening
a connection, the cache-miss path reads the agent's status from the
already-loaded row. If the selected agent is external and clearly
offline according to the existing `isAgentUnreachable` helper
(`Disconnected` or `Timeout`, never `Connecting`), it returns an
external-agent-specific error immediately instead of waiting out the
30-second dial timeout. `Connecting` external agents fall through to the
dial so a user who just started the agent on their host can still
succeed in the same turn. The preflight only fires when the agent is
still the latest selected agent for the workspace, so stale-binding
recovery via `dialWithLazyValidation` is unaffected. The post-dial
rewrite is limited to the dial timeout sentinel; stale/no-agent bindings
and non-timeout dial failures preserve their original errors.
- **`waitForAgentReady` timeout-branch rewrite.** The 2-minute retry
loop used by `create_workspace` and `start_workspace` runs unchanged for
all agents. When the loop's outer deadline elapses, the timeout branch
substitutes the external-agent message in place of the raw dial error if
the agent belongs to an external resource.
This applies the same pattern that the cache-hit path of
`getWorkspaceConn` already used (`isAgentUnreachable` returning
`errChatAgentDisconnected`), extended to the cache-miss path and to the
readiness helper, with the external-agent-aware error rewrite layered
only on confirmed offline or timeout paths.
Closes CODAGT-314
Workspace-agent logs emitted while serving chatd-driven requests were
not correlated with the originating chat, making agent logs hard to
attribute to the corresponding/originating chat.
This adds agent-side chat context middleware that parses `Coder-Chat-Id`
once, enriches agent access logs and structured handler/background logs,
and adds a chatd bridge log when chat headers are attached to an agent
connection.
Closes CODAGT-324