## Summary
Fixes a bug where interrupting a streaming chat and sending a new
message
left the relay connected to the wrong replica. Expanded into a broader
refactor that cleanly separates concerns:
- **OSS** owns pubsub subscription, message catch-up, queue updates,
status forwarding, and local parts merging.
- **Enterprise** (`enterprise/coderd/chatd`) only manages relay dialing,
reconnection, and stale-dial discarding for cross-replica streaming.
## Architecture
### OSS `coderd/chatd/chatd.go`
`Subscribe()` builds the initial snapshot then runs a single merge
goroutine that handles:
- Pubsub subscription for durable events (status, messages, queue,
errors)
- Message catch-up via `AfterMessageID`
- Local `message_part` forwarding
- Relay events from enterprise (when `SubscribeFn` is set)
- Sends `StatusNotification` to enterprise so it can manage relay
lifecycle
Key types:
- `SubscribeFn` — enterprise hook, returns relay-only events channel
- `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`,
`StatusNotifications`, `RequestHeader`, `DB`, `Logger`
- `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on
pubsub status changes
### Enterprise `enterprise/coderd/chatd/chatd.go`
`NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a
`SubscribeFn` that:
- Opens an initial synchronous relay if the chat is running on a remote
worker
- Reads `StatusNotifications` from OSS to open/close relay connections
- Handles async dial, reconnect timers, stale-dial discarding
- Returns only relay `message_part` events
## Bug fixes
### Original bug: stale relay dial after interrupt
`openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a
per-dial context. `closeRelay()` could not cancel in-flight dials. When
the user interrupts and a new replica picks up the chat, the old dial
goroutine could complete after the new one and deliver a stale
`relayResult`.
**Fix**: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking,
`workerID` on `relayResult`. `closeRelay()` cancels the dial context and
drains `relayReadyCh`. Merge loop rejects mismatched worker IDs.
### Additional fixes
- `statusNotifications` send-on-closed-channel race — goroutine now owns
`close()` via defer
- Enterprise spin-loop on `StatusNotifications` close — two-value
receive
with nil-out
- `hasPubsub` set from `p.pubsub != nil` instead of subscription success
— now tracks actual subscription result
- `lastMessageID` not initialized from `afterMessageID` — caused
duplicate messages on catch-up
- `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel
- `closeRelay()` did not drain `relayReadyCh`
- `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in
`InTx`
- `processChat` post-TX side effects fired when chat was taken by
another
worker — added `errChatTakenByOtherWorker` sentinel
- Cancel closure data race on `reconnectTimer`
- Bare blocking send on pubsub error path
- `localParts` hot-spin after channel close
- No-pubsub branch dropped relay events and initial snapshot
- Failed relay dial caused permanent stall (no reconnect retry)
- DB error during reconnect timer caused permanent stall
- `time.NewTimer` replaced with `quartz.Clock` for testable timing
## Tests
9 enterprise tests covering:
- Relay reconnect on drop (mock clock)
- Async dial does not block merge loop
- Relay snapshot delivery
- Stale dial discarded after interrupt
- Cancel during in-flight dial
- Running-to-running worker switch
- Failed dial retries (mock clock)
- Local worker closes relay
- Multiple consecutive reconnects (mock clock)
All pass with `-race`.
Previously, WorkspaceBuildBuilder.doInTX() inserted provisioner jobs
with empty tags and used a loop in AcquireProvisionerJob that could
match other tests' pending jobs when parallel tests share a database.
Add a unique tag (jobID -> "true") to each provisioner job at insert
time, then use that tag in AcquireProvisionerJob to target only the
correct job. This follows the same pattern used in dbgen.ProvisionerJob.
Closescoder/internal#1367
## Problem
Title generation uses the same model the user selected for chat. This
breaks when:
1. **Thinking/extended thinking models** — `ToolChoice: None` conflicts
with extended thinking on Anthropic. The bare call has no thinking
config, so provider-level defaults can conflict.
2. **Expensive models** — User picks `o3` or `claude-opus-4`, and a
trivial 8-word title generation burns through tokens/cost unnecessarily.
3. **Provider quirks** — Different providers have different constraints
around thinking mode + tool choice combinations.
## Solution
Modeled after how `coder/mux` handles this with
`NAME_GEN_PREFERRED_MODELS` + ordered candidate fallback:
### Phase 1: Candidate model list with fallback
- New `TitleModelFunc` type returns an ordered list of candidate models
- Tries `claude-haiku-4-5` → `gpt-4o-mini` → user's model
- Gracefully skips unavailable candidates (missing API key, provider not
configured)
- Falls back to the user's chat model as last resort
### Phase 2: Provider-safe call options
- Removed `ToolChoice: None` which conflicts with extended thinking on
some providers
- Added `MaxOutputTokens: 256` to cap token usage
- Improved title prompt with verb-noun format guidance (`Fix sidebar
layout`, `Add user authentication`) and explicit
no-markdown/no-code-fences instructions
### Files changed
- `coderd/chatd/title.go` — Candidate loop, improved prompt, safe call
options
- `coderd/chatd/chatd.go` — Build `TitleModelFunc` closure with
lightweight candidates
## Summary
Subagent (child) chats were previously given access to workspace
provisioning tools (`list_templates`, `read_template`,
`create_workspace`), which could lead to uncontrolled resource
consumption. This PR moves those tools behind the same
`!chat.ParentChatID.Valid` gate that already protects the subagent tools
(`spawn_agent`, `wait_agent`, etc.).
## Changes
- **`coderd/chatd/chatd.go`**: Moved `list_templates`, `read_template`,
and `create_workspace` tool registration into the root-chat-only block
alongside subagent tools.
- **`coderd/chatd/chatd_test.go`**: Added
`TestSubagentChatExcludesWorkspaceProvisioningTools` — an E2E test that
spawns a subagent via a root chat and verifies the subagent's LLM call
does not include workspace provisioning or subagent tools.
- **`coderd/chatd/chattest/openai.go`**: Added `Tools` field to
`OpenAIRequest` and supporting `OpenAITool`/`OpenAIToolFunction` types
so tests can inspect which tools are sent to the model.
## Problem
Two bugs in the agents chat flow:
1. **Optimistic rendering glitch**: When sending a message while the
agent is busy, a fake message with a negative ID appears in the
timeline, then gets rolled back to the queued state. This causes a
jarring flash.
2. **Auto-promoted messages not appearing**: When the server
auto-promotes a queued message after finishing a task, the promoted user
message doesn't show up in the timeline until the LLM finishes its
response.
## Root Causes
**Bug 1**: The optimistic rendering system injected placeholder messages
with `id: -Date.now()` into the store. When the server responded with
`queued: true`, the optimistic message was rolled back — but the user
had already seen it flash in the timeline.
**Bug 2**: In `processChat`'s deferred cleanup, the auto-promoted
message was published via `publishEvent()`, which only delivers to local
in-process stream subscribers. The SSE subscriber goroutine only
forwards `message_part` events from the local channel — it ignores
`message` events. Durable events reach the SSE client via pubsub → DB
read, but `publishEvent` doesn't trigger a pubsub notification. The
explicit `PromoteQueued` endpoint correctly used `publishMessage()`
(which does both), but the auto-promote path did not.
## Changes
### Frontend (`site/`)
- **AgentDetail.tsx**: Remove optimistic message injection from send and
edit flows. Instead, use the `CreateChatMessageResponse.message` from
the POST response to insert the real server message into the store
immediately.
- **ChatContext.ts**: Remove the negative-ID cleanup logic from
`upsertDurableMessage` that stripped optimistic placeholders when real
messages arrived.
- **chatStore.test.ts**: Remove 2 tests for negative-ID optimistic
message behavior.
### Backend (`coderd/chatd/`)
- **chatd.go**: In `processChat` cleanup, replace `publishEvent()` with
`publishMessage()` for auto-promoted messages. This ensures the pubsub
notification (`AfterMessageID`) is sent, so SSE subscribers read the new
message from the DB immediately.
closes https://github.com/coder/internal/issues/464
# Summary
This PR resolves a flaky test that was sensitive to DST transitions in
various time zones. The root of the flake was:
* a bug; the query and its tests assume 24 hours per day
* the tests used local system time, which resulted in failures for dates
proximal to DST transitions
# Changes
Query:
The original query assumed 24 hour intervals between each day, which is
not a valid assumption. It now increments `1 day` at a time.
Database tests:
Database level tests for the query all assumed 24 hour days. They now
increment in DST-aware days instead. Instead of using time.Now() as a
base for testing, the test uses a series of dates over the course of an
entire year, to ensure that DST transition dates are present in every
test run.
# API Endpoint
The endpoint that delivers the user status chart now accepts an IANA
timezone name as a parameter and passes it, keeping the existing offset
as a fallback, to the database query.
API level tests were added to ensure the correct response form and error
behaviour. Correctness of content is tested at the database level.
## Problem
There is a race condition in the chat stream reconnect path. When a
client connects (or reconnects) to `/stream`, sometimes they only see a
`status: running` event but never receive any `message_part` events —
the stream appears stuck.
## Root Cause
In `processChat`, the sequence is:
1. `publishStatus(running)` — broadcasts `status: running` to all
subscribers and via pubsub.
2. `runChat()` is called.
3. Inside `runChat`, there's significant setup work (model resolution,
DB queries, title generation, prompt building, instruction resolution).
4. Only **after** all that setup does `runChat` set `buffering = true`
on the stream state.
If a client connects to `/stream` between steps 1 and 4:
- `Subscribe()` reads `chat.Status == running` from the DB, so it
includes `status: running` in the snapshot.
- But `buffering` is still `false`, so `subscribeToStream` returns an
**empty** local snapshot (no message_parts).
- `publishToStream` **drops** all `message_part` events when `buffering`
is false.
- Result: client sees `running` but never gets any streaming content.
## Fix
Move the `buffering = true` setup (and its deferred cleanup) from
`runChat` into `processChat`, right before `publishStatus(running)`.
This guarantees the buffer is active before any subscriber can observe
`status: running`, so:
- The snapshot always includes any in-flight `message_part` events.
- `publishToStream` never drops parts because buffering is already on.
Despite the SDK type having an `Archived` field for chats, this data was
never fetched from the database — the `GetChatsByOwnerID` query
hardcoded `AND archived = false`, and the `convertChat` function never
mapped the field.
This PR adds an optional `archived` query parameter to `GET
/api/experimental/chats`:
| Value | Behavior |
|-------|----------|
| *(not provided)* | Returns all chats (active and archived) |
| `archived=false` | Returns only non-archived chats |
| `archived=true` | Returns only archived chats |
This follows the same pattern used by template versions
(`sqlc.narg('archived')` nullable boolean).
Also fixes `convertChat` to populate the `Archived` field in API
responses, which was never being set despite existing on the SDK type.
Adds database columns and server-side logic to track interception lineage via tool call IDs. When an interception ends, the server resolves the correlating tool call ID to find the parent interception and links them via `parent_id`.
New `provider_tool_call_id` column on `aibridge_tool_usages` and `parent_id` column on `aibridge_interceptions`, with indexes for lookup. `findParentInterceptionID` queries by tool call ID and filters out the current interception to find the parent.
Adapted from the [coder/coder `dk/prompt_provenance_poc`](https://github.com/coder/coder/compare/main...dk/prompt_provenance_poc) branch.
Depends on [coder/aibridge#188](https://github.com/coder/aibridge/pull/188).
Closes https://github.com/coder/internal/issues/1334
Prebuilds need to be valid. Before this change, you can push a template
version that's preset will fail when making a prebuild. This PR ensures
all presets that are used for prebuilds are valid
## Problem
Flaky test:
`TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica`
(coder/internal#1371)
The test intermittently fails because the chat ends up in `waiting`
status instead of `pending` after server shutdown.
## Root Cause
There is a race condition in `processChat` where `runChat` completes
successfully just as the server context is being canceled during
`Close()`. The sequence:
1. Server calls `Close()`, canceling the server context.
2. The LLM HTTP response has already been fully written by the mock
server (the stream closes normally before context cancellation
propagates to the HTTP client).
3. `runChat` returns `nil` (success) instead of `context.Canceled`.
4. The existing `isShutdownCancellation` check only runs when `runChat`
returns an error, so the shutdown is not detected.
5. `processChat`'s deferred cleanup marks the chat as `waiting` instead
of `pending`.
6. The test's assertion that the chat is `pending` never becomes true.
This race is timing-dependent — it only triggers when the mock server's
HTTP response completes in the narrow window between context
cancellation being initiated and it propagating through the HTTP
transport layer.
## Fix
Add a server context check after `runChat` returns successfully. If the
server is shutting down (`ctx.Err() != nil`), override the status to
`pending` so another replica can pick up the chat.
This is the same pattern already used for the error path
(`isShutdownCancellation`), extended to cover the success path.
## Summary
Removes `time.Sleep` calls in two test files by replacing them with
deterministic or event-driven alternatives.
### Changes
**`coderd/provisionerjobs_test.go`** (34.5s → 0.25s)
Replaced `time.Sleep(1500ms)` with a direct SQL `UPDATE` to bump
`created_at` by 2 seconds. The sleep existed purely to ensure different
timestamps for sort-order testing. The fix is deterministic and cannot
flake. Uses `NewDBWithSQLDB` (the test already required real Postgres
via `WithDumpOnFailure`).
**`coderd/database/pubsub/pubsub_test.go`** (2.05s → 1.3s)
Replaced `time.Sleep(1s)` with a `testutil.Eventually` retry loop that
publishes and checks for subscriber receipt. This is the idiomatic
pattern in the codebase. The old sleep waited for pq.Listener to
re-issue LISTEN after reconnect; the new code polls until it actually
works.
## Summary
Change the four main `coderdtest` Await helper functions to poll at
`IntervalFast` (25ms) instead of `IntervalMedium` (250ms):
- `AwaitTemplateVersionJobCompleted`
- `AwaitWorkspaceBuildJobCompleted`
- `WorkspaceAgentWaiter.WaitFor`
- `WorkspaceAgentWaiter.Wait`
These are called **~855 times** across the test suite. Each call
previously wasted ~125ms on average waiting for the next poll tick.
`AwaitTemplateVersionJobRunning` already used `IntervalFast` — this
makes all Await helpers consistent.
## Measured Impact
Local benchmarks (postgres, `-short -count=1 -p 8 -parallel 8
-tags=testsmallbatch`):
| Package | Before | After | Delta |
|---|---|---|---|
| enterprise/coderd | 90.8s | 76.0s | **-16.3%** |
| coderd | 65.6s | 56.5s | **-13.8%** |
| cli | 57.9s | 37.8s | **-34.7%** |
| enterprise (root) | 41.1s | 39.9s | -2.9% |
| **Sum of all packages** | **623s** | **543s** | **-12.8%** |
Zero test failures across all 199 packages.
The pause/resume endpoints were only registered under /api/experimental
but the frontend and Go SDK were calling /api/v2, resulting in 404s.
Register the routes in the v2 group, update the SDK client paths, and
fix swagger annotations (Accept → Produce) since these POST endpoints
have no request body.
This pull-request follows up #22060
Felt wrong to only make use of Geist when there is a Monospace variant
here too. Felt best we default to this as the default font as its inline
with the rest of the application. This also updates the lower line for
Workspace Statistics 🙂
Replace manual experiment checks in web-push handlers with the
`RequireExperimentWithDevBypass` middleware on the route group, matching
the pattern used by OAuth2, Agents, and MCP experiments.
## Changes
- **`coderd/coderd.go`**: Add `RequireExperimentWithDevBypass`
middleware to `/webpush` route group
- **`coderd/webpush.go`**: Remove inline
`api.Experiments.Enabled(codersdk.ExperimentWebPush)` checks from all
three handlers
- **`cli/server.go`**: Gate webpush dispatcher initialization with
`buildinfo.IsDev()` fallback so dev builds always init the real
dispatcher
- **`coderd/webpush_test.go`**: Remove experiment enablement from tests
(dev bypass handles it)
Net effect: -26 lines removed, +5 added.
Created using whatchamacallits (Opus 4.6 Max)
## Problem
When the git askpass flow triggered diff status refreshes, it updated
**every chat** connected to the workspace. This was wasteful and could
cause confusing status updates on unrelated chats.
## Solution
Thread the chat ID through the entire git askpass flow so only the chat
that initiated the git operation gets updated:
1. **`coderd/chatd/chattool/execute.go`** — Sets `CODER_CHAT_ID` env var
on spawned processes (alongside the existing `CODER_CHAT_AGENT`)
2. **`cli/gitaskpass.go`** — Reads `CODER_CHAT_ID` from the environment
and sends it as a `chat_id` query parameter in the `ExternalAuthRequest`
3. **`codersdk/agentsdk/agentsdk.go`** — Adds `ChatID` field to
`ExternalAuthRequest` and encodes it as a query param
4. **`coderd/workspaceagents.go`** — Parses `chat_id` query param and
passes it through to `storeChatGitRef` and
`triggerWorkspaceChatDiffStatusRefresh`
5. **`coderd/chats.go`** — `storeChatGitRef` and
`refreshWorkspaceChatDiffStatuses` now scope updates to just the
initiating chat when a chat ID is provided, falling back to
all-workspace-chats behavior for backwards compatibility (non-chat git
operations)
Currently the sharing UI is only hidden under certain circumstances,
rather than on a permission basis. This makes it permissions based, and
makes some backend changes to make sure permissions are correct.
## Problem
Subscribers connecting to a different replica than the one running the
chat see full messages appear but no streaming partials (`message_part`
events). The relay mechanism that forwards ephemeral parts across
replicas had several bugs.
## Root Causes
1. **`openRelay()` blocked the event loop** — The WebSocket dial (TCP +
TLS + HTTP upgrade) to the worker replica ran synchronously inside the
select loop. While dialing, no events could be processed, channels
filled up, and parts were silently dropped.
2. **Relay drops were permanent** — When the relay WebSocket closed
mid-stream, `relayParts` was set to nil and never reopened. No status
notification would re-trigger it since the chat was still running on the
same worker.
3. **`drainInitial` snapshot race** — The `default` case in the initial
drain loop caused the snapshot to be empty if the remote hadn't flushed
data yet (common immediately after WebSocket connect).
4. **Duplicate event delivery** — The `preloaded` slice caused snapshot
events to be sent both in the return value and re-sent through the
channel goroutine.
## Fixes
### `coderd/chatd/chatd.go` (Subscribe method)
- **Async relay dial**: `openRelayAsync()` spawns a goroutine to dial
the remote replica. The result (channel + cancel func) is delivered on a
`relayReadyCh` channel that the select loop reads without blocking.
- **Relay reconnection**: When the relay channel closes, a 500ms timer
fires. The handler re-checks chat status from the DB and reopens the
relay if the chat is still running on a remote worker.
- **Snapshot parts via channel**: Relay snapshot + live parts are
wrapped into a single channel so they flow through the same path,
avoiding races with the select loop.
### `enterprise/coderd/chats.go` (newRemotePartsProvider)
- **Timer-based drain**: Replaced `default` with a 1-second timer. After
the first event, `Reset(0)` switches to non-blocking drain for remaining
buffered events.
- **Remove preloaded duplication**: The goroutine now only forwards new
events; snapshot events are returned to the caller directly.
## Testing
All existing tests pass:
- `TestInterruptChatBroadcastsStatusAcrossInstances`
- `TestSubscribeSnapshotIncludesStatusEvent`
- `TestSubscribeNoPubsubNoDuplicateMessageParts`
- `TestSubscribeAfterMessageID`
- `TestChatStreamRelay/RelayMessagePartsAcrossReplicas`
## Problem
The pubsub notification handler in `chatd` re-fetched **all** messages
from the DB on every new message notification, then filtered in Go with
`msg.ID > lastMessageID`. This grows linearly with conversation length —
every new message triggers a full table scan of that chat's history.
The `AfterMessageID` field in the pubsub notification payload was
clearly designed for cursor-based fetching, but no matching query
existed.
## Fix
- Add `GetChatMessagesByChatIDAfter` SQL query with `WHERE id >
@after_id`, so the database does the filtering instead of Go.
- Use it in the pubsub notification handler in `chatd.go`, passing
`lastMessageID` as the cursor.
- Implement the dbauthz wrapper (was a `panic("not implemented")` stub
from codegen) with the same read-check-on-parent-chat pattern as
adjacent methods.
- Add dbauthz test coverage for the new method.
**Not changed:** The initial snapshot in `Subscribe()` still loads all
messages — that's correct, since a newly-connecting client needs the
full conversation state. The waste was only in the ongoing notification
path.
## Summary
`deleteUserWebpushSubscription` in `coderd/webpush.go` had incorrect
error handling that masked database errors as 404 responses.
## Bug
`GetWebpushSubscriptionsByUserID` is a `:many` query — it returns `([],
nil)` when no rows match, never `sql.ErrNoRows`. The previous `if/else
if` chain:
```go
if existing, err := api.Database.GetWebpushSubscriptionsByUserID(ctx, user.ID); err != nil && errors.Is(err, sql.ErrNoRows) {
// dead code — :many queries never return sql.ErrNoRows
} else if idx := slices.IndexFunc(existing, ...); idx == -1 {
// real DB errors fall through here, existing is nil, idx is -1 → 404
}
```
Any real database error (connection failure, timeout, authorization
error) fell through to the `else if` branch where `slices.IndexFunc(nil,
...)` returns `-1`, returning 404 "subscription not found" instead of
500.
## Fix
Split into two separate checks so database errors properly return 500:
```go
existing, err := api.Database.GetWebpushSubscriptionsByUserID(ctx, user.ID)
if err != nil {
// 500
}
if idx := slices.IndexFunc(existing, ...); idx == -1 {
// 404
}
```
## Testing
Added `TestDeleteWebpushSubscription/database_error_returns_500` which
wraps the DB store to inject an error into
`GetWebpushSubscriptionsByUserID` and asserts the handler returns 500
(not 404).
## Problem
Production logs frequently show:
```
[debu] coderd.chats.chat-processor: failed to generate chat title
error= generate title text: context deadline exceeded
```
## Root Cause
The title generation timeout in `maybeGenerateChatTitle` is 10 seconds.
Many LLM providers routinely exceed this under load (cold starts, rate
limits, large models). Since `chatretry` classifies `context deadline
exceeded` as non-retryable, the first timeout kills the entire attempt
with no retry.
## Fix
Increase the timeout from 10s to 30s. Title generation is async and
best-effort — it runs in a background goroutine and doesn't block the
chat response — so a longer timeout has no user-facing impact.
Fixes https://github.com/coder/internal/issues/1371
## Problem
`TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica` flakes
intermittently in CI. The observed failure is that the chat never
reaches `pending` status after `serverA.Close()`.
## Root cause
Race between context cancellation and the mock OpenAI server's stream
completion marker.
When `Close()` cancels the server context, the in-flight HTTP streaming
request is canceled. The mock server's handler detects this via
`req.Context().Done()` and closes its chunks channel. The mock's
`writeChatCompletionsStreaming` then writes `data: [DONE]` — the SSE
completion marker. On a loopback connection, this marker can reach the
client **before** the client's HTTP transport honors the context
cancellation.
When this happens:
1. The client sees a successful stream completion (not an error)
2. `chatloop.Run` returns `nil`
3. `processChat` falls through without error → status stays `waiting`
(the default)
4. The test expects `pending` → **flake**
## Fix
Skip writing the `[DONE]` marker when the request context is already
canceled, in both `writeChatCompletionsStreaming` and
`writeResponsesAPIStreaming`.
## Problem
When archiving an agent with subagents, the children briefly flash in
the sidebar as root-level items before disappearing. Two issues:
1. **Backend:** Archive used N+1 queries — a recursive DFS
(`archiveChatTree`, no transaction) or BFS loop (`chatd.ArchiveChat`,
N+1 queries in a tx) to walk the tree and archive each chat
individually.
2. **Frontend:** The SSE `deleted` event handler only filtered out the
parent chat from the cache. Children remained briefly, got promoted to
root-level by `buildChatTree`, then disappeared on the next re-fetch.
## Fix
**Backend:** Replace both tree-walk implementations with a single SQL
query:
```sql
UPDATE chats SET archived = true, updated_at = NOW()
WHERE id = @id OR root_chat_id = @id;
```
This leverages the existing `root_chat_id` column (already indexed) to
archive the entire tree atomically.
**Frontend:** When a `deleted` event arrives, also filter out any chats
whose `root_chat_id` matches the deleted chat, so children vanish from
the sidebar immediately with the parent.
## Changes
- `coderd/database/queries/chats.sql` — Added `ArchiveChatTreeByID`
query
- `coderd/chats.go` — Use single query, delete `archiveChatTree`
function
- `coderd/chatd/chatd.go` — Simplify `ArchiveChat` to use single query
- `coderd/database/dbauthz/dbauthz.go` — Auth wrapper for new query
- `coderd/chats_test.go` — Added `TestArchiveChat/ArchivesChildren`
subtest
- `site/src/pages/AgentsPage/AgentsPage.tsx` — Filter children in SSE
handler
- Generated files updated via `make gen`
Add a new SubjectTypeChatd RBAC subject with minimal permissions:
- Chat: CRUD
- Workspace: Read
- DeploymentConfig: Read
Replace all 10 AsSystemRestricted calls in coderd/chatd/chatd.go:
- Line 890: Use AsChatd instead of AsSystemRestricted for the background
processor context.
- Subscribe() path (5 calls): Remove system escalation entirely; these
run under the authenticated user's context from the HTTP handler.
- processChat path (4 calls): Remove redundant per-call wraps; the
context already carries AsChatd from the processor start.
Add TestAsChatd verifying allowed and denied actions.
Created using Mux (Opus 4.6)
## Flake Fix
Resolves https://github.com/coder/internal/issues/1301
`TestAIBridgeListInterceptions/Pagination/offset` flakes with a 500
caused by `runtime error: integer divide by zero` in `pq.ParseTimestamp`
(encode.go:430) during `GetAPIKeyByID` in the auth middleware.
### Root Cause
**PostgreSQL historical timezone formatting + fragile pq parser:**
1. **Year-0001 timestamps trigger unusual PostgreSQL formatting.** New
API keys were initialized with `LastUsed: time.Time{}` (year
0001-01-01). When the PostgreSQL server timezone is non-UTC, it applies
historical Local Mean Time (LMT) offsets for pre-1900 dates. For year
0001, this can produce timestamps with seconds in the timezone offset
like `0001-12-31 19:03:58-04:56:02`, a format the pq parser was never
designed to handle.
2. **The pq parser panics on unexpected formats.** The
fractional-seconds parser at encode.go:430 computes `fracOff` via
`strings.IndexAny`. When the timestamp has an unusual LMT format, index
arithmetic can produce `fracOff ≤ 0`, causing `int(math.Pow(10,
float64(negative))) = 0` → divide-by-zero panic.
3. **Why it is intermittent:** CI Postgres instances may have varying
timezone configs across runs. The pagination test makes 80+ API calls,
each reading `last_used` via `GetAPIKeyByID`, increasing the probability
of hitting the edge case.
4. **Ruled out pq race condition.** The decode path copies bytes to a Go
string via `string(s)` before `ParseTimestamp`, so buffer reuse cannot
corrupt the input.
### Fix
Initialize `LastUsed` to `time.Unix(0, 0).UTC()` (Unix epoch,
1970-01-01) instead of `time.Time{}` (year 0001). This avoids the entire
class of historical timestamp formatting edge cases.
**Why not `dbtime.Now()`?** The auth middleware debounces `LastUsed`
updates — it only writes when `now.Sub(key.LastUsed) > time.Hour`. Using
`dbtime.Now()` makes the key appear freshly used so the debounce never
triggers, breaking `TestPostUsers/LastSeenAt` and
`TestUsersFilter/LastSeenBeforeNow`. Unix epoch is always >1 hour in the
past, so debounce works correctly.
### Follow-up
A defensive fix should also be added to the `coder/pq` fork (guard
`fracOff ≤ 0` before the division in `ParseTimestamp`). Other year-0001
sentinel values exist across the codebase (`workspace_builds.deadline`,
`users.last_seen_at`, `workspaces.last_used_at`, etc.) and remain
theoretically vulnerable until the pq fork is hardened.
## Summary
Wire VAPID web push notifications into the Agents (chat) system so users
get desktop notifications when an agent finishes running.
### Backend
- Add `webpush.Dispatcher` to `chatd.Server` and pass it through from
`coderd.Options.WebPushDispatcher`
- In `processChat()`'s deferred cleanup, dispatch a web push
notification when the chat reaches a terminal state:
- **`waiting`** (success): "Agent has finished running."
- **`error`** (failure): the error message, or "Agent encountered an
error."
- Sub-agent chats (`ParentChatID.Valid`) are skipped to avoid
notification spam from internal delegation
- Gracefully no-ops when the dispatcher is nil (web push disabled)
### Frontend
- New `WebPushButton` component — a bell icon that uses the existing
`useWebpushNotifications` hook
- Returns `null` when the `web-push` experiment is off
- Three states: loading spinner, green bell (subscribed), muted bell-off
(unsubscribed)
- Tooltip + toast feedback on toggle
- Added to both the Agents page empty state top bar and the AgentDetail
top bar
- The Agents page has its own layout (no standard Navbar), so it needs
its own subscribe button
### End-to-end flow
1. User clicks the bell icon on `/agents` → browser subscribes via VAPID
2. User starts an agent chat → chat enters `running` status
3. Agent finishes → `processChat` defer sets status to `waiting`/`error`
→ dispatches web push
4. Browser service worker shows a desktop notification with the chat
title and status
---------
Co-authored-by: Coder <coder@users.noreply.github.com>
## Problem
Chat titles sometimes don't update in the UI. The generated AI title
gets stuck as the fallback (first 6 words of the message) even though
the backend successfully generates a proper title.
## Root Causes
### 1. Cancelable context used during cleanup DB read (P0)
In `processChat`, the deferred cleanup re-reads the chat from the DB to
pick up the AI-generated title for the `status_change` pubsub event. But
it used the cancelable `ctx` instead of `cleanupCtx`:
```go
// Before — ctx may already be canceled here
if freshChat, readErr := p.db.GetChatByID(ctx, chat.ID); readErr == nil {
```
When the context is canceled, the DB read fails silently and the
`status_change` event carries the stale fallback title.
### 2. Title goroutine not tracked by inflight WaitGroup (P2)
The `maybeGenerateChatTitle` goroutine was fire-and-forget — not tracked
by `p.inflight`. During graceful shutdown, the server could exit before
the goroutine completes its DB write or pubsub publish.
### 3. No recovery when watchChats() WebSocket misses events
The frontend relies entirely on the `watchChats()` SSE connection for
title updates. If the connection drops or misses events, titles never
recover — the only fix was a full page reload.
## Fixes
1. **Use `cleanupCtx`** for the `GetChatByID` call and logger in the
deferred cleanup block.
2. **Track the title goroutine** with `p.inflight.Add(1)` / `defer
p.inflight.Done()` so shutdown waits for it.
3. **Invalidate chats query** on WebSocket open/close/error events so
missed updates are recovered via refetch. Also enable
`refetchOnWindowFocus` for the chats query.
Co-authored-by: Coder <coder@users.noreply.github.com>
When a chatd server shuts down (`Close()`), the server context is
canceled. Previously, in-flight chats would be marked as `error` because
the `context.Canceled` error was not distinguished from actual
processing failures.
This adds `isShutdownCancellation()` to detect when the error is caused
by the server context being canceled (as opposed to a chat-specific
cancellation like `ErrInterrupted`). When detected, the chat status is
set to `pending` with no `last_error`, allowing another replica to pick
it up and retry.
Extracted from #22440 — only the context cancellation bug fix, no
chattest changes.
The in-memory stream buffer accumulated message-part events for the
entire duration of a chat run. Late-joining subscribers received all
buffered parts even though the backing messages had already been
committed to the database, wasting memory and potentially duplicating
content.
Clear the buffer at the end of each `persistStep` call so that only
in-flight (uncommitted) parts remain in the buffer.
## Summary
Remove the `workspace_agent_id` column from the `chats` table and
dynamically look up the first workspace agent instead.
## Problem
When a workspace is stopped and restarted, the workspace agent gets a
new ID. The `workspace_agent_id` stored on the chat at creation time
becomes stale, making the agent unreachable. This caused chats to break
after workspace restarts.
## Solution
Instead of persisting the agent ID, dynamically look up the first agent
from the workspace's latest build via
`GetWorkspaceAgentsInLatestBuildByWorkspaceID` whenever an agent
connection is needed. The `workspace_id` on the chat remains stable
across restarts.
This behavior may be refined later (e.g., agent selection heuristics),
but picking the first agent resolves the immediate breakage.
## Changes
- **Migration 000425**: Drop `workspace_agent_id` column from `chats`
- **SQL queries**: Remove `workspace_agent_id` from `InsertChat` and
`UpdateChatWorkspace`
- **chatd.go**: `getWorkspaceConn` and `resolveInstructions` now look up
agents dynamically from workspace ID
- **chatd.go**: Remove `refreshChatWorkspaceSnapshot` (no longer needed)
- **createworkspace.go**: Stop persisting agent ID when associating
workspace with chat
- **subagent.go**: Stop passing agent ID to child chats
- **SDK/frontend**: Remove `WorkspaceAgentID` / `workspace_agent_id`
from Chat type
---------
Co-authored-by: Kyle Carberry <kylecarbs@gmail.com>
Two changes:
1. **Gate subagent tools behind `!chat.ParentChatID.Valid`** so child
agents never receive `spawn_agent`, `wait_agent`, `message_agent`, or
`close_agent`. Previously all 4 tools were given to every chat.
`spawn_agent` would fail at runtime ("delegated chats cannot create
child subagents") but the other 3 had no guard at all — meaning a child
could theoretically operate on sibling chats. Removing the tools
entirely is cleaner and saves context window.
2. **Rewrite tool descriptions to explain *when* to use them**, not just
what they do. `spawn_agent` now says to use it for clearly scoped,
independent, self-contained tasks (e.g. fixing a specific bug, writing a
single module, running a migration) and explicitly says *not* to use it
for simple operations you can handle with
`execute`/`read_file`/`write_file`. It also states that child agents
cannot spawn their own subagents. The other 3 tools get similar
guidance-oriented descriptions.
Co-authored-by: Coder <coder@users.noreply.github.com>
## Summary
Fixes four frontend↔backend discrepancies in chat stream state
management that could cause duplicate content, UI flicker, and stale
stream state.
### Backend fixes (`coderd/chatd/chatd.go`)
**1. No-pubsub path double-replayed message_part events**
`Subscribe()` built an `initialSnapshot` containing `message_part`
events from `localSnapshot`, then the no-pubsub goroutine replayed the
same `localSnapshot` into the `mergedEvents` channel. Since `streamChat`
sends the snapshot first then reads the channel, the frontend received
every `message_part` twice. `applyMessagePartToStreamState` doesn't
deduplicate — text gets concatenated, so content appeared doubled.
Fix: Only forward live `localParts` in the no-pubsub goroutine; the
snapshot already contains the historical events.
**2. Snapshot missing status event**
The initial snapshot never included a `status` event. The frontend's
`shouldApplyMessagePart()` gates on status (`pending`/`waiting`), but
the initial status came from a separate REST query via `useEffect`.
During the race window between snapshot arrival and REST resolution,
`message_part` events could be incorrectly accepted or rejected.
Fix: Prepend a `status` event to the snapshot after loading the chat
from DB, so the frontend has the authoritative status from the very
first batch.
### Frontend fixes (`ChatContext.ts`)
**3. Scheduled stream reset not canceled by subsequent message_parts**
When a `message` event arrived, `scheduleStreamReset()` queued
`clearStreamState` via `requestAnimationFrame`. If new `message_part`
events arrived in the next WebSocket frame before the rAF fired, they
were pushed to `pendingMessageParts` without canceling the scheduled
reset. The rAF would fire between frames, clearing stream state, then
the next flush would re-populate it — causing a visible flash.
Fix: Call `cancelScheduledStreamReset()` when accumulating
`message_part` events.
**4. startTransition race with synchronous clearStreamState**
`flushMessageParts` wrapped `applyMessageParts` in `startTransition`,
which React can defer. If a `status: "waiting"` event arrived in the
same batch after `message_part` events, the status handler cleared
stream state synchronously, but the deferred `applyMessageParts`
callback could fire afterward and re-populate it.
Fix: Re-check `shouldApplyMessagePart()` inside the `startTransition`
callback at execution time.
### Tests added
- **Go**: `TestSubscribeSnapshotIncludesStatusEvent` — asserts the first
snapshot event is a status event
- **Go**: `TestSubscribeNoPubsubNoDuplicateMessageParts` — asserts the
events channel doesn't replay snapshot events
- **TS**: `cancels scheduled stream reset when message_part arrives
after message` — verifies stream state survives a [message,
message_part] batch
- **TS**: `does not apply message parts after status changes to waiting`
— verifies deferred applyMessageParts respects status transitions
## Summary
Adds a new agent-side process management HTTP API and rewrites the chat
execute tool to use it instead of SSH sessions.
## What changed
### New agent/agentproc/ package
- **headtail.go** — Thread-safe io.Writer with bounded memory (16KB head
+ 16KB tail ring buffer). Provides LLM-ready output with truncation
metadata and long-line truncation at 2048 bytes.
- **headtail_test.go** — 16 tests including race detector coverage for
concurrent writes.
- **process.go** — Manager + Process types for lifecycle management
using agentexec.Execer for proper OOM/nice scores.
- **api.go** — HTTP API following the agentfiles chi router pattern. 4
endpoints: start, list, output, signal.
### Agent wiring (agent/agent.go, agent/api.go)
Mounts the process API at /api/v0/processes, mirroring how agentfiles is
mounted.
### SDK (codersdk/workspacesdk/agentconn.go)
4 new AgentConn interface methods + 7 request/response types:
- StartProcess, ListProcesses, ProcessOutput, SignalProcess
### Execute tool rewrite (coderd/chatd/chattool/execute.go)
- SSH to Agent API: conn.StartProcess() + conn.ProcessOutput() polling
- New parameters: workdir, run_in_background
- Structured response: success, exit_code, wall_duration_ms, error,
truncated, note, background_process_id
- Non-interactive env vars: GIT_EDITOR=true, TERM=dumb, NO_COLOR=1,
PAGER=cat, etc.
- Output truncation: HeadTailBuffer caps at 32KB for LLM consumption
- File-dump detection with advisory notes suggesting read_file
- Default timeout: 60s to 10s
- Foreground polling: 200ms intervals until exit or timeout
## Architecture
State lives on the agent, surviving coderd failover and instance
changes. Any coderd replica can query any agent via HTTP over tailnet.
Adds a nullable `last_error` column to the `chats` table so error
reasons survive page reloads.
**Backend:**
- Migration adds `last_error TEXT` (nullable) to chats
- `UpdateChatStatus` writes the error reason when status transitions to
`error`, clears it (NULL) on recovery
- `convertChat` maps `sql.NullString` to `*string` in the SDK
**Frontend:**
- Sidebar falls back to `chat.last_error` when no stream error reason is
cached
- Chat detail page does the same for `persistedErrorReason`
- Fixtures updated for new required field
## Summary
Adds a new `diff_status_change` event kind to the `/chats/watch` pubsub
stream so the sidebar can update diff status (PR created, files changed,
branch info) without a full page reload.
### Problem
When a chat's diff status changes (e.g. PR created via GitHub, git
branch pushed), the sidebar didn't update because:
1. The backend `publishChatPubsubEvent` didn't include diff status data
2. The frontend watch handler only merged `status`, `title`, and
`updated_at` from events
### Solution
A **notify-only** approach: a new `ChatEventKindDiffStatusChange` event
kind tells the frontend "diff status changed for chat X" — the frontend
then invalidates the relevant React Query cache entries to re-fetch.
### Backend changes
- **`coderd/pubsub/chatevent.go`**: New `ChatEventKindDiffStatusChange =
"diff_status_change"` constant
- **`coderd/chatd/chatd.go`**: New `PublishDiffStatusChange(ctx,
chatID)` method on `Server`
- **`coderd/chats.go`**: New `publishChatDiffStatusEvent` helper.
Published from:
- `refreshWorkspaceChatDiffStatuses` — after each chat's diff status is
refreshed via GitHub API
- `storeChatGitRef` — after persisting git branch/origin info from
workspace agent
### Frontend changes
- **`AgentsPage.tsx`**: Handle `diff_status_change` event by
invalidating `chatDiffStatusKey` and `chatDiffContentsKey` queries
- **`ChatContext.ts`**: Remove redundant diff status invalidation that
fired on every chat status change (the new event kind handles this
properly)
## Summary
The UI has always labeled the action as "Archive agent" but the backend
was performing a hard `DELETE`, permanently destroying chats and all
their messages.
This change replaces the hard delete with a soft archive, consistent
with the pattern used by template versions.
## Changes
### Database
- **Migration 000423**: Add `archived boolean DEFAULT false NOT NULL`
column to `chats` table
- Replace `DeleteChatByID` query with `ArchiveChatByID` (`UPDATE SET
archived = true`)
- Add `UnarchiveChatByID` query (`UPDATE SET archived = false`)
- Filter archived chats from `GetChatsByOwnerID` (`WHERE archived =
false`)
### API
- Remove `DELETE /api/experimental/chats/{chat}`
- Add `POST /api/experimental/chats/{chat}/archive` — archives a chat
and all its descendants
- Add `POST /api/experimental/chats/{chat}/unarchive` — unarchives a
single chat (API only, no UI yet)
### Backend
- `archiveChatTree()` recursively archives child chats (replaces
`deleteChatTree()` which hard-deleted)
- Chat daemon's `ArchiveChat()` archives the full chat tree in a
transaction
- Authorization uses `ActionUpdate` instead of `ActionDelete`
### SDK
- Replace `DeleteChat()` with `ArchiveChat()` and `UnarchiveChat()`
- Add `Archived` field to `Chat` struct
### Frontend
- `archiveChat` API call uses `POST .../archive` instead of `DELETE`
- No UI changes — the "Archive agent" button now actually archives
instead of deleting
## Design Decision
This follows the **template version archive pattern** (Pattern B in the
codebase):
- `archived boolean` column (not `deleted boolean`)
- Dedicated `POST .../archive` and `POST .../unarchive` routes (not
repurposing `DELETE`)
- Reversible — users can unarchive via the API (UI for this will come
later)
## Problem
`resolveChatGitHubAccessToken` reads the `OAuthAccessToken` directly
from the database without refreshing it. When the token expires, GitHub
returns "bad credentials" and the chat diff features break.
## Fix
Call `config.RefreshToken()` before returning the token — the same code
path used by `provisionerdserver` when handing tokens to provisioners.
- Builds a map of provider ID → `*externalauth.Config` during the
existing config iteration
- After fetching the `ExternalAuthLink` from the DB, calls
`cfg.RefreshToken()` if a matching config exists
- On refresh failure, falls through to the existing token (GitHub tokens
without expiry still work) with a debug log
## Problem
Context compaction in chatd persisted durable messages for the
`chat_summarized` tool call and result via `publishMessage`, but never
published `message_part` streaming events via `publishMessagePart`. This
meant connected clients had no streaming representation of the
compaction.
The client's `streamState` (built entirely from `message_part` events in
`streamState.ts`) never saw the compaction tool call, so:
- No **"Summarizing..."** running state was shown to the user during
summary generation (which can take up to 90s).
- The durable `message` events arrived after or interleaved with the
`status: waiting` event, causing the tool to appear as "Summarized" with
the chat appearing to just stop.
## Fix
### 1. `CompactionOptions.OnStart` callback (chatloop)
Added an `OnStart` callback to `CompactionOptions`, called in
`maybeCompact` right before `generateCompactionSummary` (the slow LLM
call). This gives `chatd` a hook to publish the tool-call `message_part`
immediately when compaction begins.
### 2. Tool-result streaming part (chatd)
`persistChatContextSummary` now publishes a tool-result `message_part`
before the durable `message` events, so clients transition from
"Summarizing..." to "Summarized" before the status change arrives.
### Event ordering is now:
1. `message_part` (tool call via `OnStart`) — client shows
"Summarizing..."
2. LLM generates summary (up to 90s)
3. `message_part` (tool result) — client shows "Summarized" in stream
state
4. `message` (assistant) — durable message persisted, stream state
resets
5. `message` (tool) — durable tool result persisted
6. `status: waiting` — chat transitions to idle
## Tests
- **`OnStartFiresBeforePersist`**: Verifies callback ordering is
`on_start` → `generate` → `persist`.
- **`OnStartNotCalledBelowThreshold`**: Verifies `OnStart` is not called
when context usage is below the compaction threshold.
## Problem
Non-admin users of the Agents (chat) feature send `model_config_id:
"00000000-0000-0000-0000-000000000000"` (nil UUID) when creating chats,
because the `GET /api/experimental/chats/model-configs` endpoint
requires `policy.ActionRead` on `rbac.ResourceDeploymentConfig`, which
is only granted to admins.
The flow:
1. `AgentsPage.tsx` calls `useQuery(chatModelConfigs())` → hits
`listChatModelConfigs`
2. Non-admin users get a **403 Forbidden** response
3. `chatModelConfigsQuery.data` is `undefined`, so the
`modelConfigIDByModelID` map is empty
4. `handleCreateChat` falls back to `nilUUID` for `model_config_id`
5. The backend rejects the nil UUID: `"Invalid model config ID."`
## Fix
Changed `listChatModelConfigs` to allow all authenticated users to read
model configs:
- **Admin users** continue to see all configs (including disabled ones)
for management via `GetChatModelConfigs`
- **Non-admin users** now see only enabled configs via
`GetEnabledChatModelConfigs` with a system context, which is sufficient
for using the chat feature
This follows the same pattern as `listChatModels`, which already uses
`dbauthz.AsSystemRestricted(ctx)` to allow all authenticated users to
see available models.
Write endpoints (create/update/delete) retain their existing
`ResourceDeploymentConfig` authorization.
## Testing
- Updated `TestListChatModelConfigs/ForbiddenForOrganizationMember` →
`SuccessForOrganizationMember` to verify non-admin users can list
enabled model configs
- All existing chat tests continue to pass
## Problem
When coderd instances are redeployed (e.g. rolling deployment on
dogfood), in-flight chats get stuck in `running` status permanently. The
UI shows them as "thinking" with a spinning indicator, but no worker is
actually processing them. They never error or resume.
## Root Cause
Two bugs combine to cause this:
### Bug 1: Shutdown cleanup uses a canceled context
The `processChat` defer block updates the chat status in the DB when
processing completes. But it uses `ctx`, which `Close()` cancels
*before* the defer runs. The DB transaction silently fails with
`context.Canceled`, leaving the chat in `status=running` with a dead
`worker_id`.
```go
// Close() calls p.cancel() which cancels ctx
// Then the defer tries to use the now-canceled ctx:
defer func() {
err := p.db.InTx(func(tx database.Store) error {
tx.GetChatByIDForUpdate(ctx, chat.ID) // FAILS
tx.UpdateChatStatus(ctx, ...) // FAILS
}, nil)
}()
```
### Bug 2: Stale recovery runs only once at startup
`recoverStaleChats()` was called only once in `start()`, not
periodically. During a rolling deployment, the new instance starts while
the old one is still alive (fresh heartbeat). By the time the old
instance crashes, no one checks again.
## Fix
1. **Use `context.WithoutCancel(ctx)` in the processChat defer** — the
cleanup transaction now completes even during graceful shutdown.
2. **Run `recoverStaleChats` periodically** — a second ticker in the
`start()` loop checks for stale chats at `inFlightChatStaleAfter / 5`
intervals (default: every 1 minute). This catches orphaned chats even
when the instance that owns them crashes without clean shutdown.
## Tests
- `TestRecoverStaleChatsPeriodically` — Verifies chats orphaned *after*
startup are recovered by the periodic loop (not just the startup check).
- `TestNewReplicaRecoversStaleChatFromDeadReplica` — Verifies a new
replica recovers stale chats on startup.
- `TestWaitingChatsAreNotRecoveredAsStale` — Negative test: `waiting`
chats are not incorrectly modified by recovery.
## Summary
Adds a new line-based file reading endpoint to the workspace agent,
replacing the unbounded byte-based approach for the `read_file` chat
tool and `coder_workspace_read_file` MCP tool.
**Problem**: The current `read_file` tool returns the entire file
contents with no limits, which can blow up LLM context windows and cause
OOM issues with large files.
**Solution**: Inspired by [`coder/mux`](https://github.com/coder/mux)
and [`openai/codex`](https://github.com/openai/codex), implement a
line-based reader with safety limits.
## Changes
### Agent (`agent/agentfiles/`)
- New `/read-file-lines` endpoint with `HandleReadFileLines` handler
- Line-based `offset` (1-based line number, default: 1) and `limit`
(line count, default: 2000)
- Safety constants:
| Constant | Value | Purpose |
|---|---|---|
| `MaxFileSize` | 1 MB | Reject files larger than this at stat |
| `MaxLineBytes` | 1,024 | Per-line truncation with `... [truncated]`
marker |
| `MaxResponseLines` | 2,000 | Max lines per response |
| `MaxResponseBytes` | 32 KB | Max total response size |
| `DefaultLineLimit` | 2,000 | Default when no limit specified |
- Line numbering format: `1\tcontent` (tab-separated)
- Structured JSON response: `{ success, file_size, total_lines,
lines_read, content, error }`
- Hard errors when limits exceeded — tells the LLM to use
`offset`/`limit`
- Existing byte-based `/read-file` endpoint preserved (used by
`instruction.go`)
### SDK (`codersdk/workspacesdk/`)
- `ReadFileLinesResponse` type added
- `ReadFileLines` method added to `AgentConn` interface
- Mock regenerated
### Chat tool (`coderd/chatd/chattool/`)
- `read_file` tool now uses `conn.ReadFileLines()` instead of
`conn.ReadFile()`
- Updated tool description to document line-based parameters
- Response includes `file_size`, `total_lines`, `lines_read` metadata
### MCP tool (`codersdk/toolsdk/`)
- `coder_workspace_read_file` updated to use line-based reading
- Schema descriptions updated for line-based offset/limit
- Removed `maxFileLimit` constant (agent handles limits now)
### Tests
- 13 new test cases for `TestReadFileLines`:
- Path validation (empty, relative, non-existent, directory, no
permissions)
- Empty file handling
- Basic read, offset, limit, offset+limit combinations
- Offset beyond file length
- Long line truncation (>1024 bytes)
- Large file rejection (>1MB)
- All existing tests pass unchanged
## Design decisions
| Decision | Rationale |
|---|---|
| Line-based, not byte-based | Both coder/mux and openai/codex use
line-based — matches how LLMs reason about code |
| Default limit of 2000 | Matches codex; prevents accidental full-file
dumps while being generous |
| 32 KB response cap | Compromise between mux (16 KB) and codex (no cap)
|
| 1024 byte/line truncation with marker | More generous than codex
(500), marker helps LLM know data is missing |
| Hard errors on overflow | Matches mux; forces LLM to paginate rather
than getting partial data |
| Preserve byte-based endpoint | `instruction.go` needs raw byte access
for AGENTS.md |
## Problem
Chat titles revert to the fallback truncated title after briefly showing
the AI-generated title. Even reloading the page doesn't help — the
correct title flashes then gets overwritten.
## Root Cause
Single bug, two symptoms.
In `processChat` (`coderd/chatd/chatd.go`), the `chat` variable is
passed by value. The flow:
1. `processChat(ctx, chat)` receives `chat` with the initial fallback
title (truncated first message).
2. Inside `runChat`, `maybeGenerateChatTitle` generates an AI title,
writes it to the DB via `UpdateChatByID`, and publishes a `title_change`
event. **The DB has the correct title.** The client briefly displays it.
3. `runChat` returns. The **deferred cleanup** in `processChat`
publishes `publishChatPubsubEvent(chat, StatusChange)` — but `chat` here
is the original value copy that still has the **old fallback title**.
4. The frontend receives the `status_change` SSE event and
**unconditionally applies `title` from every event kind** (see
`AgentsPage.tsx` line ~305: `title: updatedChat.title`). This overwrites
the correct AI title with the stale fallback.
**Why reload doesn't help:** If the chat is still processing when the
page reloads, `listChats` loads the correct title from the DB, but then
the deferred `status_change` event arrives moments later and clobbers
it. The title was always in the DB — it was the pubsub event that kept
overwriting it.
## Fix
Re-read the chat from the database in the deferred cleanup before
publishing the final `status_change` event, so it carries the current
(AI-generated) title.