* fix(coderd): Harden Azure identity certificate fetch
- Restrict cert fetches to a host+port allowlist (Microsoft and
DigiCert on 80/443).
- Route requests through a dedicated `http.Client` that resolves
the host once and dials the validated IP directly.
- Reject loopback, private (RFC 1918 / IPv6 ULA), link-local,
multicast, unspecified, CGNAT, benchmarking, and IPv4-mapped
IPv6 addresses.
- Cap the certificate response body at 1 MiB.
- Log the underlying error via slog and return a generic detail
to the caller.
- Add unit tests for the URL allowlist, IP classification, and
dialer.
* fix(coderd/azureidentity): add IPv6 special-use ranges to SSRF blocklist
The extraBlockedNetworks list only contained IPv4 CIDRs. Add IPv6
equivalents that Go's stdlib classification methods do not cover:
- 64:ff9b:1::/48 RFC 8215 NAT64 translation
- 100::/64 RFC 6666 discard-only
- 2001:2::/48 RFC 5180 benchmarking
- 2001:db8::/32 RFC 3849 documentation
IPv6 ranges already handled by stdlib (unchanged):
- ::1/128 (IsLoopback)
- fc00::/7 (IsPrivate, ULA)
- fe80::/10 (IsLinkLocalUnicast)
- ff00::/8 (IsMulticast)
- ::/128 (IsUnspecified)
Closes https://github.com/coder/internal/issues/965
Recent `pg_dump` patch releases (13.22+ / 14.19+ / 15.14+ / 16.10+ /
17.6+) emit `\restrict` / `\unrestrict` psql meta-commands at the head
and tail of schema dumps. These broke both `sqlc` and our
`scripts/migrate-test` schema-equality check. PR #19696 worked around it
by pinning `pg_dump` to a Docker image.
This change unpins the workaround now that `sqlc` handles the
meta-commands:
* Bumps the coder/sqlc fork pin to [`337309b` on
coder/sqlc:main](https://github.com/coder/sqlc/commit/337309bfb9524f38466a5090e310040fc7af0203),
the merge of upstream v1.31.1 (coder/sqlc#6). v1.31.1 includes
[sqlc-dev/sqlc#4390](https://github.com/sqlc-dev/sqlc/pull/4390), the
upstream `\restrict` / `\unrestrict` parser fix. Updated in three places
that pin the fork SHA: `flake.nix` (`sqlc-custom`),
`.github/actions/setup-sqlc/action.yaml`, and the
`dogfood/coder/ubuntu-{22,26}.04` Dockerfiles. The flake's `sha256` /
`vendorHash` are reset to `pkgs.lib.fakeSha256`; Nix will surface the
real hashes on first build, per the existing comment block.
* Reverts #19696's Docker pin in `coderd/database/dbtestutil/db.go`.
Local `pg_dump` (13+) and the `postgres:13` Docker fallback both work
again.
* Strips `\restrict` / `\unrestrict` lines in `normalizeDump` so
`scripts/migrate-test`'s schema comparison is stable across `pg_dump`
versions (the token in those lines is randomized per run).
`TestNormalizeDumpStripsRestrict` locks the behavior in.
* Regenerates with v1.31.1, picking up the version stamp and one
upstream correctness fix in `DeleteLicense`
([sqlc-dev/sqlc#4383](https://github.com/sqlc-dev/sqlc/pull/4383): don't
shadow the input parameter when scanning a single-column return).
The soft-delete cleanup trigger (`delete_deleted_user_resources`)
removed `api_keys`, `user_links`, and `user_secrets` but left
`organization_members` rows intact. When a new user was created with a
previously-deleted user's email, both user IDs had org membership rows
in the same organization, producing duplicate-email members.
Extend the trigger to also delete `organization_members` for the
soft-deleted user. This cascades through the existing
`trigger_delete_group_members_on_org_member_delete`, which cleans up
group memberships automatically. The migration backfills by removing
zombie rows for already-deleted users.
Fixes ENG-831
> [!NOTE]
> 🤖 Generated by Coder Agents
<details>
<summary>Implementation notes</summary>
**Root cause**: `GetOrganizationIDsByMemberIDs` does not join on
`users.deleted = false`, so stale org membership rows for soft-deleted
users were visible to internal queries. Even the filtered queries
(`OrganizationMembers`, `PaginatedOrganizationMembers`) could surface
duplicate emails when a new active user reused a deleted user's email.
**What changed**:
- Migration 000491 extends `delete_deleted_user_resources()` to `DELETE
FROM organization_members WHERE user_id = OLD.id`
- Backfill removes existing zombie org memberships for soft-deleted
users
- `TestOrgMembersSoftDeleteTrigger` covers org membership removal, raw
row cleanup, and cascading group membership cleanup
</details>
`TestPatchChatMessage/ChangesModel` hardcoded `"openai"` as the provider
for the override model config. After #25171, the shared chat test
harness registers a single `"openai-compat"` provider by default, so
calling `createAdditionalChatModelConfig(..., "openai", ...)` fails with
HTTP 400 `Chat provider is not configured` before the test can exercise
the model-change path. The subtest was added in #25084 after #25171 was
reviewed, so the harness change and the new hardcoded provider only met
on `main`.
Use `defaultModel.Provider` so the override always matches whatever
provider the harness registered. This mirrors every other call site of
`createAdditionalChatModelConfig` in the file.
Closes https://github.com/coder/internal/issues/1530
Replaces the per-agent Go-side template-version filter in
`handleAuthInstanceID` with a purpose-built SQL query.
`GetWorkspaceBuildAgentsByInstanceID` joins `workspace_agents ->
workspace_resources -> workspace_builds -> provisioner_jobs ->
workspaces` and excludes:
- non-`workspace_build` provisioner jobs (template-version-import,
dry-run)
- deleted agents and sub-agents
- deleted workspaces
The handler:
- drops the per-candidate `GetWorkspaceResourceByID` /
`GetProvisionerJobByID` lookups
- drops the `provisioner_jobs.input` JSON parsing and the follow-up
`GetWorkspaceBuildByID` call
- compares `latestHistory.ID` against `selected.WorkspaceBuildID`
returned directly from the query
- preserves the existing recycled-instance safety check and matching
response codes
One intentional behavior tightening: agents whose workspace is deleted
now return 404 (previously they could reach the recycled-instance check
and return 400, or 200 if the stale build was still latest). This
matches the existing token-auth path, which already refuses to
authenticate against deleted workspaces.
The original `GetWorkspaceAgentsByInstanceID` query is intentionally
untouched. It remains the generic raw lookup used elsewhere in tests and
helpers.
The dbauthz wrapper for the new query uses the system-read fast path
with `fetchWithPostFilter` for non-system reads, with `RBACObject()`
delegating to the embedded `WorkspaceTable`.
Tests:
- new `TestGetWorkspaceBuildAgentsByInstanceID` covering newest-first
ordering, exclusion of deleted/sub agents, exclusion of template-import
and dry-run jobs, and exclusion of deleted workspaces
- new dbauthz mock test for `GetWorkspaceBuildAgentsByInstanceID`
- new `TestPostWorkspaceAuthAWSInstanceIdentity/RecycledInstanceID`
exercising the recycled-instance rejection branch (HTTP 400 when the
agent's build is no longer latest)
- existing `TestPostWorkspaceAuth{AWS,Azure,Google}InstanceIdentity`
continue to cover the handler end to end (including the template-version
+ workspace-build same-instance-ID scenario via
`setupInstanceIDWorkspace`)
> Mux is acting on Mike's behalf.
Editing a previous user message and selecting a different model in the
picker silently kept using the original model: the selection was dropped
on the frontend, in the SDK, and in the backend, so both the replacement
user message and the assistant turn that followed ran against the old
model.
Plumb the selected model through all three layers (`AgentChatPage`,
`codersdk.EditChatMessageRequest`, `chatd.EditMessageOptions` /
`Server.EditMessage`), defaulting to the original message's model when
the client does not specify one. The existing `InsertChatMessages` CTE
already advances `chats.last_model_config_id` when the inserted
message's model differs, so the assistant turn picks up the new
selection without further changes. The new model is validated inside the
transaction, so an unknown ID rolls the edit back and returns a 400
`Invalid model config ID.`, mirroring the `SendMessage` path.
Refs: CODAGT-345
This change was generated by a Coder agent.
<details>
<summary>Implementation plan</summary>
# CODAGT-345: Editing an earlier message cannot change model
## Problem
When editing a previous user message in a chat, the user can change the
model in the model picker, but the backend keeps using the original
message's model. The model selection is dropped at three layers:
1. **Frontend:** `AgentChatPage.tsx`'s edit branch builds an
`EditChatMessageRequest` that omits `model_config_id`. The new-message
branch (a few lines below) does include it.
2. **SDK:** `codersdk.EditChatMessageRequest` has no `ModelConfigID`
field at all.
3. **Backend:** `chatd.EditMessageOptions` has no model field, and
`Server.EditMessage` always copies the original message's
`ModelConfigID` into the replacement message.
Once the replacement user message is inserted with the original model,
the `InsertChatMessages` CTE leaves `chats.last_model_config_id`
unchanged, so the assistant turn that follows runs against the old
model.
## Fix
Plumb the selected model through all three layers, defaulting to the
original message's model when the client doesn't override it. This
mirrors the `SendMessage` path, which already accepts a
`model_config_id` and validates it via
`resolveSendMessageModelConfigID`.
### Backend
- `codersdk/chats.go`: add `ModelConfigID *uuid.UUID` to
`EditChatMessageRequest`.
- `coderd/x/chatd/chatd.go`:
- Add `ModelConfigID uuid.UUID` to `EditMessageOptions`.
- In `EditMessage`, after fetching the edited message, resolve the
model: if `opts.ModelConfigID != uuid.Nil`, validate it exists with
`tx.GetChatModelConfigByID` (using `chatdModelConfigLookupContext`),
otherwise keep `editedMsg.ModelConfigID.UUID`. Pass the resolved ID into
`newChatMessage(...)`.
- Reuse the existing `ErrInvalidModelConfigID` sentinel.
- `coderd/exp_chats.go` (`patchChatMessage`):
- Read `req.ModelConfigID` (nil-safe), pass into
`chatd.EditMessageOptions`.
- Add a `case xerrors.Is(editErr, chatd.ErrInvalidModelConfigID)` arm
returning 400 `Invalid model config ID.`, matching the
`postChatMessages` handler.
### Frontend
- `site/src/pages/AgentsPage/AgentChatPage.tsx`:
- In the edit branch, set `model_config_id: effectiveSelectedModel ||
undefined` on the `EditChatMessageRequest`.
- On success, persist the chosen model to `lastModelConfigIDStorageKey`
so the next chat from this browser keeps the same default. Mirrors the
new-message branch.
### Generated
- `make site/src/api/typesGenerated.ts` and `make
coderd/apidoc/swagger.json` produce the updated `EditChatMessageRequest`
schema in `typesGenerated.ts`, `coderd/apidoc/{docs.go,swagger.json}`,
and `docs/reference/api/{chats.md,schemas.md}`.
## Tests
- `coderd/x/chatd/chatd_test.go`:
- `TestEditMessageWithModelConfigOverride`: edit with a different model
-> replacement message and `chats.LastModelConfigID` use the new model.
- `TestEditMessagePreservesModelConfigByDefault`: edit without
`ModelConfigID` -> original model preserved.
- `TestEditMessageRejectsUnknownModelConfig`: passes a random UUID ->
`ErrInvalidModelConfigID`, original message still present,
`LastModelConfigID` unchanged (rollback).
- `coderd/exp_chats_test.go` (under `TestPatchChatMessage`):
- `ChangesModel`: end-to-end via SDK; `edited.Message.ModelConfigID` and
`chat.LastModelConfigID` both match the new model.
- `InvalidModelConfigID`: random UUID -> 400 `Invalid model config ID.`.
</details>
Chat tests previously constructed a real `openai` provider with a fake
API key and no `BaseURL`, so background title generation hit
`api.openai.com` and timed out under `-race`. The same root cause
produced several distinct flakes: title regeneration races with
synchronous `UpdateChat`/`ProposeChatTitle`, and pagination races
against `updated_at` bumps from real-network processing.
This moves the fake OpenAI-compatible provider and the chat-settle wait
into first-class `coderdtest` capabilities.
`coderd.Options.ChatProviderAPIKeys` is the new seam tests use to
redirect chat traffic to a local `httptest.Server`.
`coderdtest.WaitForChatSettled` replaces per-test waiters and drains
tracked chat-daemon work after the chat row leaves `pending`/`running`.
The `newChatClient*` constructors funnel through one options builder
that installs the fake provider before the coderd test server so cleanup
ordering is deterministic.
Closes https://github.com/coder/internal/issues/1528 & Closes ENG-2659
Closes https://github.com/coder/internal/issues/1480 & Closes CODAGT-359
Closes https://github.com/coder/internal/issues/1507 & Closes CODAGT-368
Relates to https://github.com/coder/internal/issues/1397 & Relates to
CODAGT-374
Adds an Agents General setting to require Cmd/Ctrl+Enter before sending
chat messages. When enabled, plain Enter inserts a newline in agent chat
inputs while the send button remains available.
The preference is now persisted server-side through
`/api/v2/users/{user}/preferences`, alongside the existing user
preference settings, and is applied to both the create-agent input and
existing chat composer. Storybook and API coverage verify the setting,
keyboard behavior, validation, and persistence.
<details>
<summary>Coder Agents notes</summary>
Generated by Coder Agents from a Slack request. Dogfooded with
agent-browser against the Storybook settings and chat input stories.
</details>
## Problem
In `coderd/x/chatd/chatd.go` `runChat`, workspace MCP discovery is gated
on `chat.WorkspaceID.Valid` at the start of each turn. New chats that
bind their workspace mid-turn (via `create_workspace` or
`start_workspace`) get an empty workspace tool list on the first step,
and the model falls back to `execute` (bash) because no workspace MCP
tools are advertised.
**Repro:** new chat → "create a workspace and use MCP tools". No
`/api/v0/mcp/tools` request hits the agent on turn 1; turn 2 in the same
chat works fine.
## Fix
- Add a `PrepareTools` callback to `chatloop.RunOptions`, analogous to
`PrepareMessages`. It is invoked once before each LLM step with the
current tool list. When it returns non-nil, the chatloop replaces
`opts.Tools`, rebuilds the per-step tool definitions, and appends new
tool names to `opts.ActiveTools` so newly injected tools are callable
immediately.
- Wire `PrepareTools` in `runChat` to trigger workspace MCP discovery
the first time the chat snapshot reports a valid `WorkspaceID`. The
previous top-of-turn discovery path is unchanged for chats that start
with a workspace.
- Extract the discovery logic into `Server.discoverWorkspaceMCPTools` so
the top-of-turn and mid-turn paths share identical behavior (cache,
agent resolution, `ListMCPTools` timeout, invalidation).
Mid-turn discovery stays disabled in plan-mode turns and Explore
subagents, matching the existing top-of-turn gate. The
`workspaceMCPDiscovered` flag prevents redundant dials after the first
successful discovery.
## Tests
- `coderd/x/chatd/chatloop/chatloop_test.go`: two new
`TestRun_PrepareTools*` cases covering injection on the next step and
active-set merging when `ActiveTools` is non-empty.
- `coderd/x/chatd/chatd_test.go`:
`TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspace` drives
`runChat` through a `create_workspace` tool call against a real Postgres
+ mocked agent conn and asserts the second streamed LLM request
advertises the workspace MCP tool. Verified that the test fails (and
pinpoints the missing tool) when the `PrepareTools` wiring is disabled.
## Validation
```
go test ./coderd/x/chatd/chatloop/... -count=1
go test ./coderd/x/chatd/... -count=1
make lint/emdash
```
<details>
<summary>Decision log</summary>
- Chose a per-step `PrepareTools` callback over mutating `opts.Tools` in
place because `chatloop.Run` builds the `fantasy.Tool` definitions once
at start; a hook is required to let the LLM see new tools on the next
step.
- Returned `[]fantasy.AgentTool` (not also active-tool-names) and let
the chatloop derive name merges via `mergeNewToolNames`. This avoids
leaking plan-mode gating decisions into the callback contract.
- Kept the existing top-of-turn discovery path so chats that already
have a workspace at turn start pay no extra latency.
- Skipped reusing `ReloadMessages` (history reload) since this is purely
a tool-availability concern; coupling it to a history reload would
defeat the chatloop cache prefix optimizations.
</details>
---
_This pull request was generated by Coder Agents._
Moves the `coderd_agents_first_connection_seconds` histogram from the
polling-based `prometheusmetrics.Agents()` loop to the event-driven
`agentConnectionMonitor.init()` path. The metric is now recorded exactly
once when an agent first connects over the RPC websocket, instead of
being retroactively computed each polling tick.
The `username` and `workspace_name` labels are removed to reduce
cardinality; only `template_name` and `agent_name` are retained.
Adds unit tests covering both the happy path (first connection recorded)
and the negative-duration guard (clock skew logs a warning, no sample
emitted).
Stream advisor output into the advisor tool card while the nested
advisor call is still running.
This keeps the advisor implementation intentionally advisor-specific:
the parent model still receives the same final structured tool result,
while the frontend receives transient `tool-result.result_delta` parts
to render partial advisor text in the expanded card. The final persisted
chat history remains unchanged.
Refs CODAGT-322.
Generated by Coder Agents.
<details>
<summary>Implementation plan</summary>
- Publish advisor text deltas from the nested `chatloop.Run` via
`RunAdvisorOptions.OnAdviceDelta`.
- Forward those deltas through `chatadvisor.Tool` with the parent
advisor tool call ID.
- Emit transient `ChatMessagePartTypeToolResult` websocket parts with
`ResultDelta` from `chatd`.
- Add `result_delta` to the generated tool-result TypeScript variant.
- Accumulate tool result deltas in frontend stream state and keep the
tool running until the final result arrives.
- Render streamed advisor advice in the existing advisor card using
streaming markdown mode, while retaining the updated advisor UI.
</details>
The `TimedOutAgentCacheHit`, `CacheHitHealthyAgent`, and
`CacheHitDBError` subtests of `TestGetWorkspaceConn_StatusCheck` built
their `WorkspaceAgent` timestamps with `time.Now()` in the parent test's
slice literal and then ran the actual check against the server's real
wall clock (`quartz.NewReal()`). On slow Windows CI runners, more than
`agentInactiveDisconnectTimeout` (30s) of wall time can elapse between
slice construction and the parallel subtest body. In that window, the
cached "healthy" agent gets reclassified as disconnected by
`agentDisconnectedFor`, and `CacheHitHealthyAgent` fails with
`errChatAgentDisconnected` instead of returning the cached connection.
Build each agent inside the subtest with `quartz.NewMock(t)` and feed
the same clock into the `Server` so the agent timestamps and the status
math share a single frozen `now`. This matches the pattern already used
by `TestGetWorkspaceConn_DialTimeoutDisconnectedRecoveryThreshold` in
the same file.
Closes https://github.com/coder/internal/issues/1522
<details>
<summary>Verification</summary>
Inserting `time.Sleep(35 * time.Second)` at the top of each subtest's
body reliably reproduces the original failure
(`errChatAgentDisconnected` on `CacheHitHealthyAgent`) on the parent
commit and passes with this change. After removing the synthetic sleep,
`go test ./coderd/x/chatd -run TestGetWorkspaceConn_StatusCheck
-count=50` passes cleanly.
</details>
> Generated by Coder Agents on behalf of the assignee.
Co-authored-by: Coder Agents <noreply@coder.com>
Fixes
[CODAGT-372](https://linear.app/codercom/issue/CODAGT-372/coderdazureidentity-testvalidateregular-fails-on-macos).
Closes coder/internal#101.
## Problem
`coderd/azureidentity TestValidate/regular` fails on macOS with:
```
verify signature:
github.com/coder/coder/v2/coderd/azureidentity.Validate
/Users/runner/work/coder/coder/coderd/azureidentity/azureidentity.go:75
- x509: “metadata.azure.com” certificate is not standards compliant
```
When `crypto/x509.VerifyOptions.Roots` is `nil`, Go's verifier on
macOS/iOS falls back to the system verifier (`systemVerify` in
`crypto/x509/root_darwin.go`), which delegates to Apple's
`SecTrustEvaluateWithError`. Apple's framework enforces stricter
standards-compliance checks than Go's pure-Go verifier and rejects some
otherwise valid Azure instance-identity leaf certificates with
`errSecCertificateIsNotStandardsCompliant`, surfaced as the `not
standards compliant` error.
The test had been skipped on darwin since #12979 (April 2024) as a
workaround.
## Fix
- Embed the three root CAs that Azure instance-identity certificates
ultimately chain to:
- DigiCert Global Root G2
- DigiCert Global Root G3
- Baltimore CyberTrust Root (kept for historical chains via `Microsoft
RSA TLS CA 01/02`)
- In `Validate`, populate `options.Roots` from those embedded roots when
the caller does not supply its own pool. Because `Roots != nil`, Go no
longer takes the `systemVerify` path on darwin and uses the pure-Go
verifier on all platforms.
- Remove the `runtime.GOOS == "darwin"` skip from `TestValidate`.
- Add `TestEmbeddedRoots` to guard against future regressions in the
embedded root list (parses each PEM, asserts self-signed, requires all
three named roots).
The caller's existing `Intermediates` handling is unchanged. Tests that
pass their own `Roots` (e.g. `coderdtest.NewAzureInstanceIdentity`) are
unaffected.
## Verification
On Linux:
```
$ go test ./coderd/azureidentity/ -race -count=1 -v
=== RUN TestValidate
=== RUN TestValidate/regular
=== RUN TestValidate/govcloud
=== RUN TestValidate/rsa
--- PASS: TestValidate (0.00s)
--- PASS: TestValidate/regular (0.00s)
--- PASS: TestValidate/rsa (0.00s)
--- PASS: TestValidate/govcloud (0.00s)
=== RUN TestEmbeddedRoots
--- PASS: TestEmbeddedRoots (0.00s)
=== RUN TestExpiresSoon
--- SKIP: TestExpiresSoon (0.00s)
PASS
ok github.com/coder/coder/v2/coderd/azureidentity 1.020s
```
The `test-go-pg` job on `macos-latest` in CI is the authoritative
confirmation of the fix on macOS; previously it would have failed
`TestValidate/regular` had the skip been removed.
<details>
<summary>Why this is the correct fix</summary>
From `/usr/local/go/src/crypto/x509/verify.go`:
```go
// Use platform verifiers, where available, if Roots is from SystemCertPool.
if runtime.GOOS == "windows" || runtime.GOOS == "darwin" || runtime.GOOS == "ios" {
systemPool := systemRootsPool()
if opts.Roots == nil && (systemPool == nil || systemPool.systemPool) {
return c.systemVerify(&opts)
}
...
}
```
Setting `opts.Roots` to any non-nil, non-system pool deterministically
routes verification through Go's pure-Go verifier, bypassing Apple's
stricter compliance checks. The embedded roots are sufficient to
validate every chain we currently care about, since every intermediate
in `Certificates` ultimately issues to one of the three embedded roots.
</details>
> Generated by Coder Agents. Reviewed manually.
Fixes [CODAGT-367](https://linear.app/codercom/issue/CODAGT-367).
`TestResolveExploreToolSnapshot/*` flaked on CI (Linux and Windows) with
`context deadline exceeded` on the `GetMCPServerConfigsByIDs` call
inside `resolveExploreToolSnapshot`.
Each test setup called `server.CreateChat` twice with `MCPServerIDs` set
to fake `.example.com` URLs. `CreateChat` marks the chat pending and
calls `signalWake`, which causes the chatd background `acquireLoop` to
pick the chat up. That goroutine then dialed the fake MCP URLs
(NXDOMAIN, slower on Windows) and made an OpenAI request with the dbgen
default test key (401). Under CI load, that activity racing the 4
parallel subtests' `GetMCPServerConfigsByIDs` calls was enough to exceed
the 25s test context deadline. The failure logs in the issue showed both
side effects firing in the same job.
`resolveExploreToolSnapshot` only reads `ID`, `MCPServerIDs`,
`PlanMode`, `ParentChatID`, and `Mode` off the parent argument, so the
chats do not need to be persisted. Build them as in-memory
`database.Chat` values instead. The MCP server configs remain in the DB
because the function still queries them via `GetMCPServerConfigsByIDs`.
Verified locally with `go test ./coderd/x/chatd -run
TestResolveExploreToolSnapshot -count=100 -race` (passes, ~5s total) and
the surrounding `TestResolve*` / `TestCreateChildSubagentChat*` /
`TestSpawnAgent_Explore*` tests.
---
_Made by Coder Agents on behalf of @ibetitsmike. [Linear
session](https://linear.app/codercom/issue/CODAGT-367/flake-testresolveexploretoolsnapshot#agent-session-0730f3fe)._
Closes https://github.com/coder/coder/issues/13112
**Breaking Change**: Removed status code `StatusNotModified` when no
diffs occur in a patch. Now the patch is always applied and a template
is always returned.
The test built a `Retry-After` HTTP-date with
`time.Now().Add(3*time.Second).UTC().Format(http.TimeFormat)`, then
asserted that the parsed `RetryAfter` was `>= 2s`. `http.TimeFormat` has
second precision, so `Format()` truncates up to ~1s. Combined with the
small elapsed time between formatting in the test and `time.Until()` in
production, the value could land just under `offset-1s` (1.997s observed
in CI), failing the lower bound.
Round the formatted target up to the next whole second so the parsed
deadline is never earlier than `now+offset`, and assert against a
symmetric `[offset-1s, offset+1s]` window.
Closes
[CODAGT-365](https://linear.app/codercom/issue/CODAGT-365/flake-testclassify-parsesretryafterhttpdate)
Refs https://github.com/coder/internal/issues/1512
<sub>Created by [Coder Agents](https://coder.com/docs/agent).</sub>
Co-authored-by: Coder Agents <coderagents@coder.com>
When OpenAI's Responses API returns `Previous response with id ... not
found` for a chained turn, classify it as a `ChainBroken` retry, clear
`previous_response_id`, exit chain mode, reload full history, and let
`chatretry` retry. Self-heals chats whose anchor was poisoned before
#25074 stopped truncated streams from being persisted as a successful
turn with a stored response id.
The new state is exposed via the existing
`coderd_chatd_stream_retries_total` counter as a
`chain_broken="true"|"false"` label. Aggregating queries (`sum`, `rate`
over `provider`/`model`/`kind`) keep working without changes; raw-series
matchers without aggregation will now see two series per `(provider,
model, kind)` where they previously saw one. The metric is internal-only
so the blast radius should be small, but if you have dashboards that
index by exact label matchers without aggregation they will need an
extra `sum` or an explicit `chain_broken` selector.
> 🤖 This PR was created with the help of Coder Agents, and was reviewed by a human 🧑💻
> Mux is acting on Mike's behalf.
Changes chat turn-end summaries into compact status labels for the
cached `last_turn_summary` and successful web push body.
Uses a structured-output model call for successful turns, requiring a
2-5 word `label` and validating it to reject agent-centric phrasing.
Pending and requires-action states keep deterministic status labels.
Removes the earlier deterministic tool-signal pipeline in favor of the
smaller structured-output path.
Extend the delete_deleted_user_resources() trigger so that secrets
belonging to a soft-deleted user are removed in the same transaction as
the existing api_keys and user_links cleanup.
user_secrets.user_id has ON DELETE CASCADE, but Coder soft-deletes users
by flipping users.deleted rather than removing the row, so the foreign key
cascade never fires and secrets would otherwise survive deletion.
Assisted by Coder Agents.
Use typed atomics (atomic.Int64, atomic.Int32, etc.) in test files to prevent
mixing atomic and non-atomic access on the same value, guarantee 64-bit
alignment on 32-bit platforms, and provide a cleaner API.
TestPromoteQueuedWhileRequiresActionMixedTools has flaked three times across
Windows and Ubuntu CI runners since 2026-05-06; local repro on the dev
workspace has not surfaced it. The May 8 Ubuntu log shows all four
PromoteQueued post-TX pubsub publishes reaching pg_notify, yet the test still
times out 25s later, so the failure is downstream between the subscriber's
listener and the test's events channel. Adds three Debug-level markers in
chatd.go (no logic change) plus two t.Logf markers in the test's reader so
the next CI occurrence pins down exactly which step failed.
Closes ENG-2645
Closescoder/internal#1523
Parallel subtests in `coderd/x/chatd` reused a parent test context with
a `testutil.WaitLong` deadline, so the context could expire before a
subtest was scheduled under load. That made the subagent lifecycle tools
return plain-text context errors instead of the expected JSON payload,
causing flaky JSON unmarshal failures.
Create fresh `chatdTestContext` values inside the affected parallel
subtests and add `chatdTestContext` to the `paralleltestctx` custom
function list so this pattern is caught by `make lint`.
Closes https://github.com/coder/internal/issues/1494
## Summary
Adds a `stop_workspace` tool to chatd so the model can recover from the
"workspace running but agent dead" failure mode (e.g. an OOM that leaves
the workspace running but the agent unreachable) by stopping and then
starting the workspace.
<img width="924" height="742" alt="image"
src="https://github.com/user-attachments/assets/279dedb6-6e29-4fe1-8754-3a1f01e538bf"
/>
## What changed
**New `stop_workspace` chatd tool**
(`coderd/x/chatd/chattool/stopworkspace.go`). Mirrors `start_workspace`:
shares `WorkspaceMu` to serialize with create/start, waits for any
in-progress build before issuing a stop, and is idempotent only after a
successful Stop transition. Failed stop builds re-attempt rather than
reporting success.
**New `chatStopWorkspace` coderd hook** (`coderd/exp_chats.go`). Mirrors
`chatStartWorkspace` minus the `RequireActiveVersion` gate. Stop should
not be blocked by template version policy.
**Differentiated recovery sentinels** (`coderd/x/chatd/chatd.go`).
`errChatAgentDisconnected` instructs the model to call `stop_workspace`
then `start_workspace`. `errChatDialTimeout` instructs a single retry,
then user escalation if it repeats. The previous single message
conflated transient and persistent failures.
**Two-signal recovery gate.** Recovery is only surfaced when a tool call
times out *and* a fresh DB read of the latest workspace agent says
`Disconnected`. The previous draft escalated on the DB read alone, which
would fire on a 30-second heartbeat blip (e.g. agent respawn) and prompt
a destructive stop/start unnecessarily.
**Cache-hit disconnected handling** now clears the cache and retries a
fresh dial before escalating, rather than returning the recovery
sentinel immediately. Latest-agent classification uses
`GetWorkspaceAgentsInLatestBuildByWorkspaceID` instead of the chat's
bound `AgentID`, so stale bindings after a rebuild don't misclassify.
**Shared chattool helpers** in `coderd/x/chatd/chattool/chattool.go`:
`latestWorkspaceBuildAndJob`, `publishBuildBinding`,
`provisionerJobTerminal`. Applied to both `start_workspace` and
`stop_workspace`.
## Notes
- Reverts an earlier draft that widened `ask_user_question` to root
standard turns. Plan-mode-only behavior is restored.
- The `stop_workspace` tool currently renders via the generic chat
tool-call UI. A follow-up frontend PR will prettify the `stop_workspace`
tool and style it like the `start_workspace` tool.
- Never-connected (`Timeout` status) agents are intentionally excluded
from recovery. They indicate template or startup failure, not the
running-but-dead case this PR targets.
Closes CODAGT-315
Closes
[CODAGT-317](https://linear.app/codercom/issue/CODAGT-317/pr-workspaces-sometimes-require-name-confirmation-to-delete).
## Problem
The `/agents` archive-and-delete molly-guard (typing the workspace name)
was firing for chats that had clearly created their own workspace. The
heuristic in `resolveArchiveAndDeleteAction` decides whether
confirmation is needed by comparing the workspace's `created_at` against
the chat's `created_at`:
```ts
return new Date(workspaceCreatedAt) >= new Date(chatCreatedAt);
```
That assumption breaks for **prebuilt workspaces**.
`ClaimPrebuiltWorkspace` rewrites `owner_id`, `name`, `updated_at`,
`last_used_at`, etc., but **never touches `created_at`**, which still
reflects when the prebuild was provisioned by the reconciler, often
hours before the chat exists. Result: every prebuild-claimed workspace
looks pre-existing, so the molly-guard fires.
Concrete example from a real chat:
| Field | Value |
|---|---|
| `chat.created_at` | `2026-05-07T15:12:23Z` |
| `workspace.created_at` (provision) | `2026-05-07T14:22:24Z` |
| `latest_build.created_at` (claim) | `2026-05-07T15:19:09Z` |
`14:22:24 < 15:12:23` so `isWorkspaceAutoCreated` returned false even
though the chat issued the claim.
## Fix (frontend-only)
Derive the moment a workspace was acquired from existing build history
rather than relying on `workspace.created_at`:
- Build #1 initiator = prebuilds system user → workspace was a prebuild
→ use `build_2.created_at` (the claim build) as the acquisition time.
- Build #1 initiator = real user → workspace was created from scratch →
use `workspace.created_at` (unchanged behavior).
- Unclaimed prebuild or no build history → return `null` (force
confirmation; safe degradation for a destructive flow).
The resolver fetches the build list via the existing
`getWorkspaceBuilds` endpoint when the dialog might fire. No new column,
no migration, no schema change. Works retroactively for all existing
claimed prebuilds; no backfill needed.
The prebuilds system user UUID is exposed via
`codersdk.PrebuildsSystemUserID` and typegen'd to `typesGenerated.ts`.
`coderd/database.PrebuildsSystemUserID` parses that constant via
`uuid.MustParse` so the two cannot drift; if the codersdk literal ever
changes, package init fails fast.
## History
The first draft of this PR added a `workspaces.claimed_at` column
populated by `ClaimPrebuiltWorkspace`. After review feedback from
@johnstcn pointing out that the same fact is already implicit in build
history, I pivoted to the frontend-only approach. Subsequent review
notes consolidated the prebuilds system user UUID into a single
typegen'd constant.
## Why not the other open PRs
- **#25055** (`chatKey` cache fallback) only fixes a different
cache-miss path; it explicitly notes it does not address `created_at <
chat.created_at`.
- **#25053** (`chats.workspace_auto_created` boolean) puts the truth on
the wrong side of the schema: "this workspace was claimed at time T" is
a property of the workspace, not the chat. The MCP plumbing it adds is
also unnecessary now that the same answer is available from build
history.
## Test plan
- `pnpm vitest run --project=unit
src/pages/AgentsPage/utils/agentWorkspaceUtils.test.ts` — 40/40 pass;
new cases cover prebuild claim before/after chat, unclaimed prebuild,
missing-build-history fallback, and the fetch-skip when the chat is not
in cache.
- `pnpm lint:types`, `pnpm check`, `make pre-commit`.
<details>
<summary>Disclosure</summary>
Opened on behalf of @kylecarbs by [Coder
Agents](https://coder.com/coder-agents).
</details>
# Summary
Implements
https://linear.app/codercom/issue/AIGOV-282/add-ai-model-price-table-and-seed-generator
This PR lays the groundwork for AI Bridge cost controls (per the AI
Governance RFC). It adds the foundation needed for future cost tracking:
a place to store per-model token prices, a way to keep those prices in
sync with upstream pricing data, and a startup mechanism that ensures
every deployment has prices loaded before AI Bridge starts processing
requests.
The price data comes from [models.dev](https://models.dev/), a
community-maintained catalogue of AI provider pricing. A generator
script fetches the latest prices, filters to Anthropic and OpenAI for
now, and produces a seed file checked into the repository.
On every server startup the seed is applied to the database, so new
releases automatically pick up any price corrections that landed since
the previous one. Existing rows are overwritten with the latest prices;
rows for models no longer in the seed are left untouched.
# Batching the AI model price seed: three approaches
Context: at server startup we seed the `ai_model_prices` table from an
embedded JSON price book (~70 rows today, will grow as we add providers,
potentially 4000+).
Each row is:
```text
(provider, model, input_price, output_price, cache_read_price, cache_write_price)
```
Any of the four price columns can be:
- `NULL` → “price unknown for this dimension”
- explicit `0` → “free”
The batch must be an UPSERT so re-running is idempotent and existing
rows pick up new prices.
We considered three implementations.
---
## Approach 1 — Per-row UPSERT in a Go loop
```go
for _, row := range rows {
if err := db.UpsertAIModelPrice(ctx, database.UpsertAIModelPriceParams{
Provider: row.Provider,
Model: row.Model,
InputPrice: nullInt64(row.InputPrice),
// ...
}); err != nil {
return err
}
}
```
### Pros
- Trivial.
- NULL handling falls out naturally from `sql.NullInt64`.
### Cons
- `N` round-trips per seed.
- With ~70 rows that means ~70 statement executions on every startup,
even inside a transaction.
- Doesn't scale gracefully as the price book grows, potentially 4000+.
---
## Approach 2 — `UNNEST` with parallel arrays
Pass each column as a separate Go slice. Postgres unnests them in
parallel into a virtual table, then `INSERT ... SELECT`.
```sql
INSERT INTO ai_model_prices (
provider,
model,
input_price,
output_price,
cache_read_price,
cache_write_price
)
SELECT
UNNEST(@providers::text[]),
UNNEST(@models::text[]),
NULLIF(UNNEST(@input_prices::bigint[]), -1),
NULLIF(UNNEST(@output_prices::bigint[]), -1),
NULLIF(UNNEST(@cache_read_prices::bigint[]), -1),
NULLIF(UNNEST(@cache_write_prices::bigint[]), -1)
ON CONFLICT (provider, model) DO UPDATE SET
input_price = EXCLUDED.input_price,
output_price = EXCLUDED.output_price,
cache_read_price = EXCLUDED.cache_read_price,
cache_write_price = EXCLUDED.cache_write_price,
updated_at = NOW();
```
Go side: flatten rows into six parallel slices.
Use a sentinel (`-1`) for “missing”, since `lib/pq` can't encode `NULL`
into a `bigint[]` element.
```go
providers := make([]string, len(rows))
models := make([]string, len(rows))
inputs := make([]int64, len(rows))
outputs := make([]int64, len(rows))
cacheR := make([]int64, len(rows))
cacheW := make([]int64, len(rows))
for i, r := range rows {
providers[i] = r.Provider
models[i] = r.Model
inputs[i] = -1
if r.InputPrice != nil {
inputs[i] = *r.InputPrice
}
outputs[i] = -1
if r.OutputPrice != nil {
outputs[i] = *r.OutputPrice
}
cacheR[i] = -1
if r.CacheReadPrice != nil {
cacheR[i] = *r.CacheReadPrice
}
cacheW[i] = -1
if r.CacheWritePrice != nil {
cacheW[i] = *r.CacheWritePrice
}
}
return db.UpsertAIModelPrices(ctx, database.UpsertAIModelPricesParams{
Providers: providers,
Models: models,
InputPrices: inputs,
OutputPrices: outputs,
CacheReadPrices: cacheR,
CacheWritePrices: cacheW,
})
```
### Pros
- Single round-trip.
### Cons
- The generated `sqlc` params become plain `[]int64`, which can't
represent `NULL`.
---
## Approach 3 — `jsonb_array_elements` over a single `@seed::jsonb`
(chosen)
Pass the raw seed JSON as one parameter; let Postgres expand and parse
it.
```sql
INSERT INTO ai_model_prices (
provider,
model,
input_price,
output_price,
cache_read_price,
cache_write_price
)
SELECT
elem->>'provider',
elem->>'model',
(elem->>'input_price')::bigint,
(elem->>'output_price')::bigint,
(elem->>'cache_read_price')::bigint,
(elem->>'cache_write_price')::bigint
FROM jsonb_array_elements(@seed::jsonb) AS elem
ON CONFLICT (provider, model) DO UPDATE SET
input_price = EXCLUDED.input_price,
output_price = EXCLUDED.output_price,
cache_read_price = EXCLUDED.cache_read_price,
cache_write_price = EXCLUDED.cache_write_price,
updated_at = NOW();
```
Go side reduces to:
```go
return db.UpsertAIModelPrices(ctx, seedJSON)
```
### Pros
- Single round-trip.
- NULLs fall out naturally:
- `(elem->>'cache_write_price')::bigint` becomes `NULL`
- no sentinels
- The seed is already JSON:
- Existing precedent:
- `jsonb_array_elements` is already used elsewhere in the codebase
### Cons
- Less type-safe at the SQL boundary than `UNNEST`
- Slightly less standard than `UNNEST`
- Readers need familiarity with:
- `jsonb_array_elements`
- `->>` extraction syntax
- Postgres pays JSON parse cost
- negligible at our scale
---
---
# Decision
We picked Approach 3.
It collapses the round-trips like `UNNEST` does, but without:
- nullable-array workarounds
- sentinel values
The 5s timeout cancelled cold-start ListMCPTools calls before the
agent's 30s connectTimeout could settle, so workspace MCP tools
never reached the LLM. Bump to 35s and scope to ListMCPTools only.
coder/fantasy now fails closed when Anthropic or OpenAI Responses
streams close before their provider terminal events instead of yielding
a successful finish.
This bumps the fantasy replacement to coder/fantasy#33 and teaches chat
error classification to treat those failures as retryable timeout errors
with explicit stream-closed messages.
<img width="875" height="311" alt="image"
src="https://github.com/user-attachments/assets/69c6f7b5-c885-46d2-a88b-b7a2b111bd55"
/>
## Summary
Make Coder's chat agent honest about workspaces that use
`coder_external_agent`. Three behaviors change so the chat stops
pretending it can drive an external workspace through to a usable state
on its own.
<img width="859" height="537" alt="image"
src="https://github.com/user-attachments/assets/0561442b-95f1-4a2d-853c-7e3776114680"
/>
## Problem
External agents are not started by Coder. The user has to run `coder
agent` on their own host with a token Coder generates. Before this
change, the chat agent treated those workspaces like any other:
- `create_workspace` would enqueue a build for an external-agent
template and then wait minutes (~22 worst case) for an agent that was
never going to come up.
- When mid-turn tool calls dialed an external agent that was not
connected, the chat burned the full 30-second dial timeout and returned
generic "the workspace may need to be restarted from the Coder
dashboard" guidance, which is not the action the user can take.
- Nothing told the chat (or the user, through the chat) that the next
action lives outside Coder.
## Fix
Three changes scoped to `coderd/x/chatd/`:
1. **`create_workspace` blocks templates with external agents.** The
tool reads `template_versions.has_external_agent` for the template's
active version and refuses external-agent templates with a message
instructing the chat to pick a different template, or to have the user
create and start the workspace themselves and then attach it.
2. **Attaching an existing external workspace stays open.** No
selection-time gate on attachment; users can still bind a working
external workspace to a chat.
3. **External-agent-aware error handling on connection.** Two
complementary changes both predicated on proven connectivity failures
rather than every dial error:
- **`getWorkspaceConn` preflight and timeout handling.** Before opening
a connection, the cache-miss path reads the agent's status from the
already-loaded row. If the selected agent is external and clearly
offline according to the existing `isAgentUnreachable` helper
(`Disconnected` or `Timeout`, never `Connecting`), it returns an
external-agent-specific error immediately instead of waiting out the
30-second dial timeout. `Connecting` external agents fall through to the
dial so a user who just started the agent on their host can still
succeed in the same turn. The preflight only fires when the agent is
still the latest selected agent for the workspace, so stale-binding
recovery via `dialWithLazyValidation` is unaffected. The post-dial
rewrite is limited to the dial timeout sentinel; stale/no-agent bindings
and non-timeout dial failures preserve their original errors.
- **`waitForAgentReady` timeout-branch rewrite.** The 2-minute retry
loop used by `create_workspace` and `start_workspace` runs unchanged for
all agents. When the loop's outer deadline elapses, the timeout branch
substitutes the external-agent message in place of the raw dial error if
the agent belongs to an external resource.
This applies the same pattern that the cache-hit path of
`getWorkspaceConn` already used (`isAgentUnreachable` returning
`errChatAgentDisconnected`), extended to the cache-miss path and to the
readiness helper, with the external-agent-aware error rewrite layered
only on confirmed offline or timeout paths.
Closes CODAGT-314
Workspace-agent logs emitted while serving chatd-driven requests were
not correlated with the originating chat, making agent logs hard to
attribute to the corresponding/originating chat.
This adds agent-side chat context middleware that parses `Coder-Chat-Id`
once, enriches agent access logs and structured handler/background logs,
and adds a chatd bridge log when chat headers are attached to an agent
connection.
Closes CODAGT-324
Closes#24091
Adds
`TestDeleteChatDebugDataAfterMessageIDStepLevelFieldBoundariesAndNulls`,
which complements the existing triggered-runs test for
`DeleteChatDebugDataAfterMessageID` with boundary and NULL coverage for
step-level message IDs.
The existing
`TestDeleteChatDebugDataAfterMessageIDIncludesTriggeredRuns` already
exercises the `step.assistant_message_id > @message_id` deletion path.
This test focuses on:
- Strict greater-than behavior at the cutoff for assistant and
history-tip step message IDs.
- Step-level assistant and history-tip message ID combinations.
- SQL NULL behavior for step-level message IDs.
- A mixed-step run where one matching step deletes the whole run and
cascades every step.
| Scenario | assistant_message_id | history_tip_message_id | Expected |
|----------|----------------------|------------------------|----------|
| Assistant above cutoff, history tip NULL | cutoff + 5 | NULL | Deleted
|
| Assistant above cutoff, history tip below cutoff | cutoff + 20 |
cutoff - 3 | Deleted |
| Assistant below cutoff, history tip NULL | cutoff - 3 | NULL |
Preserved |
| Assistant at cutoff boundary, history tip NULL | cutoff | NULL |
Preserved |
| Assistant NULL, history tip above cutoff | NULL | cutoff + 2 | Deleted
|
| Assistant NULL, history tip at cutoff boundary | NULL | cutoff |
Preserved |
| Both step message IDs NULL | NULL | NULL | Preserved |
> Generated by Coder Agents
<details><summary>Review notes</summary>
- Run-level message IDs are below the cutoff to isolate step-level
selection.
- The assistant-above-cutoff scenario includes a second nonmatching step
to cover mixed-step deletion.
- The test uses unique model and chat names for isolation.
- `go test -v ./coderd/database -run
TestDeleteChatDebugDataAfterMessageID -count=1` passes.
</details>
Anthropic task name responses can include valid JSON followed by a
closing fence or extra text, which made `json.Unmarshal` fail with
trailing-character errors and forced fallback naming.
This updates task name JSON extraction to accept the first JSON value
after optional fences and adds regression coverage for fenced and bare
JSON with trailing content.
Anthropic rejects inline images over 5,242,880 bytes, but our upload
endpoint accepts images up to 10 MiB — so 5–10 MiB images were
reaching the provider and failing. This adds two layers of
protection: the browser resizes oversized images before upload, and
the server rejects any that still slip through before an upstream
request is issued.
Client-side resizing uses `createImageBitmap` with
`resizeWidth`/`resizeHeight` to clamp the decoded bitmap at decode
time, then iteratively shrinks on an `OffscreenCanvas` (falling back
to `HTMLCanvasElement`) until the output fits the applicable budget.
Anthropic (and Bedrock-hosted Claude — fantasy's bedrock provider is
a thin wrapper around the Anthropic client) uses a ~5 MiB budget;
other providers use a ~10 MiB budget to stay under the server cap.
Doing the resize in the browser avoids decoding attacker-controlled
image bytes in `coderd` (image-bomb DoS surface).
Server-side, `chatFileResolver` now takes a provider string and
looks up the inline-image cap via a new
`chatprovider.InlineImageByteCap`
helper; oversized `image/*` files for capped providers are rejected
with a pre-classified `chaterror` before the SDK call. The backstop
fires for older clients, direct API callers, or any image that was
committed to the composer before the user switched to a stricter
provider.
Attachments commit to composer state synchronously with a new
`"processing"` `UploadState` so paste+Enter can't dispatch before
the resize finishes; the `"uploading"` send gate now covers both
states. Dismissed-while-resizing attachments are tracked in a
`WeakSet` so a late swap can't resurrect a removed file.
Closes CODAGT-215
Closes#24090
Enhances the existing `TestFinalizeStaleChatDebugRows` test with three
missing coverage areas:
1. **Error JSON preservation**: verifies pre-existing error payloads are
not overwritten by finalization
2. **Timestamp correctness**: verifies `updated_at` and `finished_at`
match the `@now` parameter across all finalized row paths
3. **Null error preservation**: verifies finalized steps that had no
error keep a null error column
No production code changed. Test passes against Postgres.
> 🤖 Generated by Coder Agents
<details><summary>Review notes</summary>
- Enhances existing test rather than adding a new one, the existing test
was the right place
- Covers stale, orphaned, and cascade finalization timestamp assertions
- Preserves both pre-existing error JSON and null error values during
finalization
</details>
This change uses separate http clients/transports in TestValidateToken
subtests. Previously parallel subtests of TestValidateToken shared
a http.DefaultTransport. When one subtest's httptest.Server.Close() ran in
t.Cleanup, it called http.DefaultTransport.CloseIdleConnections, which
could interrupt connection(s) used in another subtest.
The async title-generation and turn-summary goroutines launched from
processChat run autocommit UPDATEs on the chat row after finishActiveChat
has set the chat to pending and signalWake has fired. If the row lock
from one of those UPDATEs is held while acquireLoop's processOnce runs,
AcquireChats's FOR UPDATE SKIP LOCKED skips the freshly-pending chat and
returns no rows. The wake is then consumed with no acquisition, and the
chat sits in pending until the next acquireTicker (default 1s).
Wake again after each UPDATE commits. The second wake covers the race
window without changing the transaction semantics.
Closescoder/internal#1500
Skips `TestExploreChatSendMessageCannotMutateMCPSnapshot` while the
chatd redesign is in flight. The test exposes a self-interrupt race in
`processChat`'s control-pubsub subscriber that is structurally fixed by
the redesign in #24444; skipping until then matches the existing
`TestSubscribeRelayEstablishedMidStream` skip in
`enterprise/coderd/x/chatd/chatd_test.go`.
Relates to https://github.com/coder/internal/issues/1493.
`launchHeartbeat` could miss a stale-threshold update during startup if
`SetStaleAfter` ran after the heartbeat ticker was created but before
the goroutine subscribed to `thresholdChan`. In that case, the heartbeat
kept the old interval until a future tick, and the mock-clock test could
time out waiting for `Ticker.Reset` without advancing time.
Subscribe to `thresholdChan` before reading the heartbeat interval so
the channel consistently invalidates the interval. The regression test
now changes the threshold while ticker creation is trapped, making the
startup race deterministic.
Closes https://github.com/coder/internal/issues/1513
`TestAdvisorChainMode_SnapshotKeepsFullHistory` was using the generic
active chatd test server, which leaves periodic pending-chat polling
enabled. That made the test inconsistent with the other OpenAI Responses
API tests and allowed stale pending pubsub notifications to interrupt
the second turn before the advisor request was observed.
Use the existing OpenAI Responses test server helper so pending-chat
acquisition is delayed and the test only starts processing after the
SendMessage pending notification has been published.
Closes https://github.com/coder/internal/issues/1510