Context files (AGENTS.md) and skills were only fetched from the
workspace on the first turn or when the agent changed. On subsequent
turns, stale content from persisted messages was used. This meant that
if AGENTS.md or skills were modified on the workspace between turns, the
agent wouldn't see the changes until the user created a new chat.
## Changes
- Extract `fetchWorkspaceContext` from `persistInstructionFiles` to
allow fetching workspace context without persisting
- On subsequent turns, re-fetch fresh context from the workspace instead
of reading stale persisted content; falls back to persisted messages if
the workspace dial fails
- Update `ReloadMessages` callback to re-derive instruction and skills
from reloaded database messages after compaction, instead of using
captured closure variables
- Add `formatSystemInstructionsFromParts` helper to build system
instructions directly from agent parts without requiring separate
OS/directory params
- Add tests for the new helper
<details><summary>Implementation Notes</summary>
### Root cause
In `runChat`, the `else if hasContextFiles` branch (subsequent turns)
called `instructionFromContextFiles(messages)` which read stale content
from persisted DB messages. The `ReloadMessages` callback
(post-compaction) also used captured `instruction`/`skills` closure
variables from the start of the turn, never re-deriving them.
### Approach
1. **Extract `fetchWorkspaceContext`** — Pure refactor of the fetch-only
part of `persistInstructionFiles` (agent connection, context config
retrieval, content sanitization, metadata stamping). Returns parts +
skills without persisting.
2. **Subsequent turns**: Instead of reading from persisted messages,
launch a `g2` goroutine that calls `fetchWorkspaceContext` to get fresh
context from the workspace. Falls back gracefully to persisted messages
if the workspace is unreachable.
3. **ReloadMessages**: Re-derive `instruction` from
`instructionFromContextFiles(reloadedMsgs)` and `skills` from
`skillsFromParts(reloadedMsgs)` using the freshly loaded messages, with
fallback to captured values if the reloaded messages don't contain
context (e.g. compacted away).
</details>
> 🤖 Generated by Coder Agents
The startup-timeout integration tests in `chatloop` used a 5ms real-time
budget and relied on wall-clock scheduling to fire the startup guard
timer before the first stream part arrived. On loaded CI runners the
timer sometimes lost the race, producing `attempts == 2` instead of
`attempts == 1` and flaking `TestRun_FirstPartDisarmsStartupTimeout`.
Replace the real `time.Timer` in `startupGuard` with a `quartz.Timer` so
tests can control time deterministically. Production behavior is
unchanged: `RunOptions.Clock` defaults to `quartz.NewReal()` when nil,
and the startup timeout still covers both opening the provider stream
and waiting for the first stream part.
- Add `RunOptions.Clock quartz.Clock` with nil-safe default.
- Tag the startup guard timer as `"startupGuard"` for quartz trap
targeting.
- Rewrite the four startup-timeout integration tests to use
`quartz.NewMock(t)` with trap/advance/release sequences instead of
wall-clock sleeps.
- Add `awaitRunResult` helper so tests fail with a clear message instead
of hanging when `Run` does not complete.
Closes https://github.com/coder/internal/issues/1460
Adds an optional `CreatedAt` timestamp to `tool-call` and `tool-result`
`ChatMessagePart` variants so the frontend can compute tool execution
duration (`result.created_at - call.created_at`).
Timestamps are recorded at the correct moments in the chatloop:
- **Tool-call**: when the model stream emits the tool call
- **Tool-result**: when tool execution completes (or is interrupted)
These are passed through `PersistedStep.PartCreatedAt` so the
persistence layer can apply accurate timestamps to stored parts.
SSE-published parts also carry `CreatedAt` for real-time display.
Old persisted messages without `created_at` deserialize to `nil` — fully
backward compatible.
<details><summary>Implementation notes (Coder Agents
generated)</summary>
### Why not stamp in `PartFromContent`?
`PartFromContent` is called both for SSE publishing (correct timing) and
during persistence (wrong timing — both tool-call and tool-result would
get the same "persistence time" timestamp, yielding ~0 duration).
Instead, timestamps are captured in the chatloop at the right moments
and carried through `PersistedStep.PartCreatedAt` as a
`map[string]time.Time` keyed by `"call:<id>"` / `"result:<id>"`.
### Interrupted tool calls
`persistInterruptedStep` also stamps `CreatedAt` on synthetic error
results for cancelled/interrupted tool calls, so partial duration is
available.
### Files changed
| File | Change |
|------|--------|
| `codersdk/chats.go` | Add `CreatedAt *time.Time` field |
| `codersdk/chats_test.go` | JSON round-trip test |
| `coderd/database/dbtime/dbtime.go` | Add `TimePtr` helper |
| `coderd/x/chatd/chatloop/chatloop.go` | Track timestamps, pass through
`PersistedStep` |
| `coderd/x/chatd/chatd.go` | Apply timestamps during persistence |
| `coderd/x/chatd/chatprompt/chatprompt_test.go` | Verify
`PartFromContent` does NOT stamp |
| `site/src/api/typesGenerated.ts` | Auto-generated |
</details>
---------
Co-authored-by: Ethan <39577870+ethanndickson@users.noreply.github.com>
Fixes https://github.com/coder/internal/issues/1418
The `TestRun_ActiveToolsPrepareBehavior` test asserts
`persistedStep.Runtime > 0`, but on Windows the timer resolution (~15ms)
means the in-memory mock model can complete within the same clock tick,
producing a measured duration of `0s`.
Change the assertion from `require.Greater` to `require.GreaterOrEqual`
so that a legitimately measured zero duration on low-resolution clocks
does not cause a flake.
> Generated by Coder Agents
Follow-up to #23282. The retry and terminal error callouts had a few UX
oddities:
- Auto-retrying states reused backend error text that said "Please try
again" even while the UI was already retrying on behalf of the user.
- Terminal error states also said "Please try again" with no action the
user could take.
- `startup_timeout` had no specific title or retry copy — it fell
through to the generic "Retrying request" heading.
- The kind pill showed raw enum values like `startup_timeout` and
`rate_limit`.
- Terminal error metadata showed a "Retryable" / "Not retryable" label
that does not help users.
- A separate "Provider anthropic" metadata row duplicated information
already present in the message body.
- The `usage-limit` error kind used a hyphen while every backend kind
uses underscores.
Changes:
**Backend (`chaterror/message.go`)**
- Split message generation into `terminalMessage()` and
`retryMessage()`, replacing the old `userFacingMessage()`.
- Terminal messages include HTTP status codes and actionable guidance
(e.g. "Check the API key, permissions, and billing settings.").
- Retry messages are clean factual statements without status codes or
remediation, suitable for the retry countdown UI (e.g. "Anthropic is
temporarily overloaded.").
- Removed "Please try again" / "Please try again later" from all paths.
- `StreamRetryPayload` calls `retryMessage()` instead of forwarding
`classified.Message`.
**Frontend**
- Removed the parallel frontend message-generation system:
`getRetryMessage()`, `getProviderDisplayName()`,
`getRetryProviderSubject()`, and the `PROVIDER_DISPLAY_NAMES` map are
all deleted from `chatStatusHelpers.ts`.
- `liveStatusModel.ts` passes `retryState.error` through directly — the
backend owns the copy.
- Added specific title and retry copy for `startup_timeout`, and
extended the title mapping to cover `auth` and `config`.
- Kind pills now show humanized labels ("Startup timeout", "Rate limit",
etc.) instead of raw enum strings.
- Removed the redundant "Provider anthropic" metadata row.
- Removed the terminal "Retryable" / "Not retryable" badge.
- Normalized `"usage-limit"` → `"usage_limit"` and added it to
`ChatProviderFailureKind` so all error kinds follow the same underscore
convention and live in one enum.
Refs #23282.
> **PR Stack**
> 1. #23351 ← `#23282`
> 2. #23282 ← `#23275`
> 3. **#23275** ← `#23349` *(you are here)*
> 4. #23349 ← `main`
---
## Summary
Extracts a structured error classification subsystem for agent chat
(`chatd`) so that retry and error payloads carry machine-readable
metadata — error kind, provider name, HTTP status code, and retryability
— instead of raw error strings.
This is the **backend half** of the error-handling work. The frontend
counterpart is in #23282.
## Changes
### New package: `coderd/chatd/chaterror/`
Canonical error classification — extracts error kind, provider, status
code, and user-facing message from raw provider errors. One source of
truth that drives both retry policy and stream payloads.
- **`kind.go`**: Error kind enum (`rate_limit`, `timeout`, `auth`,
`config`, `overloaded`, `unknown`).
- **`signals.go`**: Signal extraction — parses provider name, HTTP
status code, and retryability from error strings and wrapped types.
- **`classify.go`**: Classification logic — maps extracted signals to an
error kind.
- **`message.go`**: User-facing message templates keyed by kind +
signals.
- **`payload.go`**: Projectors that build `ChatStreamError` and
`ChatStreamRetry` payloads from a classified error.
### Modified
- **`codersdk/chats.go`**: Added `Kind`, `Provider`, `Retryable`,
`StatusCode` fields to `ChatStreamError` and `ChatStreamRetry`.
- **`coderd/chatd/chatretry/`**: Thinned to retry-policy only;
classification logic moved to `chaterror`.
- **`coderd/chatd/chatloop/`**: Added per-attempt first-chunk timeout
(60 s) via `guardedStream` wrapper — produces retryable
`startup_timeout` errors instead of hanging forever.
- **`coderd/chatd/chatd.go`**: Publishes normalized retry/error payloads
via `chaterror` projectors.
- Moves `coderd/chatd/`, `coderd/gitsync/`, `enterprise/coderd/chatd/`
under `x/` parent directories to signal instability
- Adds `Experimental:` glue code comments in `coderd/coderd.go`
> 🤖 This PR was created with the help of Coder Agents, and was
reviewed by my human. 🧑💻