- The httpmw upstream from this endpoint only checks for read perms to the
workspace agent. Recreating a dev container should require `update`
perms since it mutates state. This also matches the behavior of the
`DELETE` endpoint
- Empty string is valid for `apiKeyID` in paths that genuinely lack a
caller key (e.g. agent-initiated context injection in
`workspaceAgentAddChatContext`). AI Gateway fail-closed check remains
the runtime safety net.
- Context injection paths (`persistInstructionFiles`, compaction) read
the key from `aibridge.DelegatedAPIKeyIDFromContext(ctx)`, set upstream
by `contextWithActiveTurnAPIKeyID`.
- Subagent context copy branches on `copiedRole ==
database.ChatMessageRoleUser` to choose the right append function.
> Generated by Coder Agents
`TestWatchAgentContainers/CoderdWebSocketCanHandleClientClosing` spent
about 15 seconds waiting for the real websocket heartbeat ticker to
detect that the client closed.
Add a clock-aware `HeartbeatClose` wrapper and pass `api.Clock` through
the containers watch handler so the test can drive the heartbeat
deterministically with `quartz.Mock`. The test still verifies the same
client-close teardown path, but it advances the heartbeat tick instead
of waiting for wall-clock time.
Refs #25557
Discovered as part of the work on CODAGT-381.
GetChats now returns only root chats (parent_chat_id IS NULL).
A new GetChildChatsByParentIDs query fetches children for visible
roots and embeds them in each parent's Children field. The
singular getChat endpoint does the same.
Archive invariant is one-way: parent archived implies child
archived. Parent archive/unarchive cascades via root_chat_id.
Individual child archive is permitted; child unarchive while the
parent is archived is rejected atomically (row lock on child,
re-read parent inside the transaction). Embedded children are
filtered by the caller's archive state so individually-archived
children stay hidden from active-parent views.
Gitsync MarkStale uses GetChatsByWorkspaceIDs directly;
MarkStaleParams.OwnerID removed (dead after the switch).
Frontend: buildChatTree reads from the embedded children field,
WebSocket handlers route child events into the parent's children
array, and archiving a child strips it from the parent cache.
Adds `coder exp chat context add` and `coder exp chat context clear`
commands that run inside a workspace to manage chat context files via
the agent token.
`add` reads instruction and skill files from a directory (defaulting to
cwd) and inserts them as context-file messages into an active chat.
Multiple calls are additive — `instructionFromContextFiles` already
accumulates all context-file parts across messages.
`clear` soft-deletes all context-file messages, causing
`contextFileAgentID()` to return `!found` on the next turn, which
triggers `needsInstructionPersist=true` and re-fetches defaults from the
agent.
Both commands auto-detect the target chat via `CODER_CHAT_ID` (already
set by `agentproc` on chat-spawned processes), or fall back to
single-active-chat resolution for the agent. The `--chat` flag overrides
both.
Also adds sub-agent context inheritance: `createChildSubagentChat` now
copies parent context-file messages to child chats at spawn time, so
delegated sub-agents share the same instruction context without
independently re-fetching from the workspace agent.
<details><summary>Implementation details</summary>
**New files:**
- `cli/exp_chat.go` — CLI command tree under `coder exp chat context`
**Modified files:**
- `agent/agentcontextconfig/api.go` — `ConfigFromDir()` reads context
from an arbitrary directory without env vars
- `codersdk/agentsdk/agentsdk.go` — `AddChatContext`/`ClearChatContext`
SDK methods
- `coderd/workspaceagents.go` — POST/DELETE handlers on
`/workspaceagents/me/chat-context`
- `coderd/coderd.go` — Route registration
- `coderd/database/queries/chats.sql` — `GetActiveChatsByAgentID`,
`SoftDeleteContextFileMessages`
- `coderd/database/dbauthz/dbauthz.go` — RBAC implementations for new
queries
- `coderd/x/chatd/subagent.go` — `copyParentContextFiles` for sub-agent
inheritance
- `cli/root.go` — Register `chatCommand()` in `AGPLExperimental()`
**Auth pattern:** Uses `AgentAuth` (same as `coder external-auth`) —
agent token via `CODER_AGENT_TOKEN` + `CODER_AGENT_URL` env vars.
</details>
> 🤖 Generated by Coder Agents
---------
Co-authored-by: Michael Suchacz <203725896+ibetitsmike@users.noreply.github.com>
Workspace agent logs could still fail after the earlier invalid UTF-8
fix because NUL bytes are valid Go/protobuf strings but are rejected by
Postgres text columns. The legacy HTTP log upload path also bypassed the
old sanitization entirely, and both server insert paths computed
logs_length from the unsanitized input.
Add a shared log-output sanitizer in agentsdk, use it in the protobuf
conversion path and both server-side insert paths, and compute
OutputLength from the sanitized string so overflow accounting matches
what is actually stored. This keeps the old invalid UTF-8 behavior while
also handling embedded NUL bytes consistently across DRPC and HTTP log
ingestion.
Refs [#23292 ](https://github.com/coder/coder/issues/23292)
Refs [#13433 ](https://github.com/coder/coder/issues/13433)
## Problem
When a prebuilt workspace is claimed, the agent reinitializes via a
single fire-and-forget pubsub event over SSE. If the agent's SSE
connection is interrupted at claim time, the event is permanently lost —
the workspace is stuck with no self-healing path.
Additionally, regular (non-prebuild) workspaces had no way to opt out of
the `/reinit` polling loop — agents would reconnect indefinitely to an
endpoint that would never send them anything useful.
## Root Cause
`workspaceAgentReinit` fetches the workspace (with its current
`owner_id`) via `GetWorkspaceByAgentID`, but never checked whether a
claim already happened. It only subscribed to pubsub for future events.
The database already has durable claim state (`owner_id` changes from
`PrebuildsSystemUserID` to the real user), but no layer ever consulted
it on reconnection.
## Solution
### Server-side durable check with first-build-initiator gating
**TOCTOU-safe ordering**: Subscribe to pubsub claim events *before* any
durable checks, so a claim that fires during the check is buffered in
the channel rather than lost.
**First-build-initiator gating**: When `!workspace.IsPrebuild()` (owner
is no longer the system user), look up the first build's `InitiatorID`.
The prebuild reconciler always uses `PrebuildsSystemUserID` as the
initiator. This distinguishes claimed prebuilds from regular workspaces
without any SQL schema changes.
- **Regular workspace** (first build initiator ≠ system user) → **409
Conflict**, agent stops reconnecting
- **Claimed prebuild, build completed** → pre-seed channel with reinit
event and close it, transmitter delivers one-shot then exits
- **Claimed prebuild, build in-progress** → fall through to pubsub
subscription, agent waits for completion event
- **Unclaimed prebuild** → pubsub subscription (existing happy path)
### Declarative reinit events (defense-in-depth)
- Added `UserID` field to `ReinitializationEvent` with JSON tags
- Switched pubsub serialization from raw string to JSON (with
backward-compat fallback for rolling upgrades)
- Populated `UserID` at both the publish site and the durable check
### Agent SDK: 409 handling
`WaitForReinitLoop` detects 409 Conflict from the server and closes the
`reinitEvents` channel, cleanly exiting the retry goroutine.
### Agent CLI: fixed two bugs + added reinitCtx
- **Closed channel (`!ok`)**: now blocks on `<-ctx.Done()` instead of
`continue`, keeping the current agent running. Previously this would
leak agents by skipping `agnt.Close()` and re-entering the loop.
- **Duplicate owner reinit**: cancels `reinitCtx` (stops the reinit
goroutine), then blocks on `<-ctx.Done()`. Previously `continue` would
skip cleanup and create a new agent on the next loop iteration.
- **`reinitCtx`**: a cancellable child of `ctx` passed to
`WaitForReinitLoop`, allowing the agent to stop the reinit HTTP polling
after reinit completes.
### Agent-side idempotency
Tracks `lastOwnerID` in the agent reinit loop — duplicate events for the
same owner are skipped.
## Testing
- **"unclaimed prebuild receives reinit via pubsub"**: prebuild owned by
system user, pubsub event triggers reinit
- **"claimed prebuild receives one-shot reinit on reconnect"**: first
build by system user, owner changed, build completed → immediate reinit
(no pubsub needed)
- **"claimed prebuild waits during in-progress claim build"**: claimed
but build still running → no reinit until build completes
- **"regular workspace gets 409"**: first build by real user → 409
Conflict, agent stops polling
- Updated claim publisher/listener tests: verify `UserID` survives JSON
round-trip + backward compat with raw string payloads
- Updated SSE round-trip test: verify `UserID` survives transmit →
receive cycle
Fixes#22359
## Rolling upgrade note
During a rolling deploy where old coderd instances coexist with new
ones, the pubsub `ReinitializationEvent` has a new `workspace_id` field
(JSON key `workspace_id`). Old publishers send a raw reason string
instead of JSON; the new listener gracefully falls back by treating the
entire payload as the reason and filling in `WorkspaceID` from context.
The only visible effect during the upgrade window is that `WorkspaceID`
may be the zero UUID in agent-side logs — this is cosmetic and resolves
once all instances are updated.
## Problem
Subagent chats were receiving git context (branch, remote origin, PR
status) from their parent or sibling chats' git operations. When a git
operation triggers external auth, the workspace agent sends `chat_id`
identifying which chat initiated it — but this was broken at two levels:
1. **Agent side:** `CODER_CHAT_ID` was never injected into process
environments. `chatd` sets `Coder-Chat-Id` HTTP headers and the
agent extracts them for process isolation, but never propagated
`CODER_CHAT_ID` to `cmd.Env`. So `gitaskpass` always sent an empty
`chat_id`.
2. **Server side:** `workspaceAgentsExternalAuth` ignored the `chat_id`
query param. `MarkStale` broadcast git context to **all** chats on
the workspace via `filterChatsByWorkspaceID`.
## Fix
- Inject `CODER_CHAT_ID` into `cmd.Env` in `agentproc` when the chat
ID is known, so `gitaskpass` can read and forward it.
- Read `chat_id` from query params in `workspaceAgentsExternalAuth`
and thread it through `chatGitRef`.
- Refactor `MarkStale` to accept a `MarkStaleParams` struct. When
`ChatID` is provided, target only that specific chat. When empty
(legacy agents, non-chat git operations), fall back to the existing
workspace-wide broadcast.
- Extract `markStaleSingle` helper to deduplicate the upsert+publish
logic.
<details><summary>Investigation notes</summary>
### Data flow before fix
```
chatd → sets Coder-Chat-Id header on agent conn
agent → extracts chatID, stores on process struct
agent → does NOT set CODER_CHAT_ID in cmd.Env ← gap 1
gitaskpass → reads CODER_CHAT_ID (always empty), sends chat_id=""
server handler → ignores chat_id query param ← gap 2
MarkStale → broadcasts to ALL workspace chats
```
### Data flow after fix
```
chatd → sets Coder-Chat-Id header on agent conn
agent → extracts chatID, stores on process struct
agent → sets CODER_CHAT_ID in cmd.Env
gitaskpass → reads CODER_CHAT_ID, sends chat_id=<uuid>
server handler → reads chat_id, passes to MarkStale
MarkStale → targets only that specific chat
```
</details>
We currently call GetWorkspaceAgentByID millions of times at scale
unnecessarily. This PR embeds immutable fields into the relevant
services instead of fetching for them every time.
resolves https://github.com/coder/scaletest/issues/84
Confirmed with a 10k scaletest that this changeset takes the query from
10M+ queries down to 39k
- Adds `_API_BASE_URL` to `CODER_EXTERNAL_AUTH_CONFIG_`
- Extracts and refactors existing GitHub PR sync logic to new packages
`coderd/gitsync` and `coderd/externalauth/gitprovider`
- Associated wiring and tests
Created using Opus 4.6
Fixes a flaky test (`TestUserTailnetTelemetry/invalid_header`) caused by
sub-microsecond precision mismatch between `time.Now()` calls on
Windows.
The server used `time.Now()` (nanosecond precision) for `ConnectedAt`
and `DisconnectedAt`, while the test compared against its own
`time.Now()`. On Windows, wall-clock jitter can cause the server
timestamp to appear slightly before the test's `predialTime`.
Switch to `dbtime.Now()` which rounds to microsecond precision (matching
Postgres), consistent with all other timestamps in `workspaceagents.go`.
Relates to: https://github.com/coder/internal/issues/1390
## Problem
When the git askpass flow triggered diff status refreshes, it updated
**every chat** connected to the workspace. This was wasteful and could
cause confusing status updates on unrelated chats.
## Solution
Thread the chat ID through the entire git askpass flow so only the chat
that initiated the git operation gets updated:
1. **`coderd/chatd/chattool/execute.go`** — Sets `CODER_CHAT_ID` env var
on spawned processes (alongside the existing `CODER_CHAT_AGENT`)
2. **`cli/gitaskpass.go`** — Reads `CODER_CHAT_ID` from the environment
and sends it as a `chat_id` query parameter in the `ExternalAuthRequest`
3. **`codersdk/agentsdk/agentsdk.go`** — Adds `ChatID` field to
`ExternalAuthRequest` and encodes it as a query param
4. **`coderd/workspaceagents.go`** — Parses `chat_id` query param and
passes it through to `storeChatGitRef` and
`triggerWorkspaceChatDiffStatusRefresh`
5. **`coderd/chats.go`** — `storeChatGitRef` and
`refreshWorkspaceChatDiffStatuses` now scope updates to just the
initiating chat when a chat ID is provided, falling back to
all-workspace-chats behavior for backwards compatibility (non-chat git
operations)
The listen loop in workspaceAgentsExternalAuthListen compared
OAuthExpiry using == which compares `time.Time` internal struct fields
including the `*time.Location` pointer.
`time.LoadLocation` does not cache the returned `*Location` pointer, so
each lib/pq connection gets a distinct pointer for the same timezone.
When `pq.ParseTimestamp()` applies the connection's location to a parsed
timestamp, the resulting time.Time embeds that connection-specific
pointer. If the `sql.DB` pool hands out different connections for the
two GetExternalAuthLink reads, the identical timestamp produces
`time.Time` values where == returns false despite representing the same
instant. This is intermittent because the pool _usually_ reuses the same
connection for sequential queries.
This change uses `.Equal()` to compare instants regardless of location.
Also makes the test's validation call counter atomic to fix a possible
data race between the HTTP server and test goroutines.
<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
part of https://github.com/coder/coder/issues/21335
This moves updating app status (used by Tasks) into the workspace agent
API over dRPC. This will allow us to update the status without having to
re-authenticate each time, like we would with an HTTP PATCH request.
Further PRs in this stack will pipe these requests thru from the CLI MCP
server to the agentsock and finally to this dRPC call to coderd.
* Adds support for parameter `format=text` in the following API routes:
* `/api/v2/workspaceagents/:id/logs`
* `/api/v2/workspacebuilds/:id/logs`
* `/api/v2/templateversions/:id/logs`
* `/api/v2/templateversions/:id/dry-run/:id/logs`
* Adds links to view raw logs on the following pages:
* Workspace build page
* Template editor page
* Template version page
* Refactors existing log formatting in `cli/logs.go` to live in `codersdk`.
🤖 Generated with Claude Opus 4.5, reviewed by me.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Relates to https://github.com/coder/coder/pull/21676
* Replaces all existing usages of `httpapi.Heartbeat` with `httpapi.HeartbeatClose`
* Removes `httpapi.HeartbeatClose`
AI agents report status via patchWorkspaceAgentAppStatus, but this wasn't
extending workspace deadlines. This prevented proper task auto-pause behavior,
causing tasks to pause mid-execution when there were no human connections.
Now we call ActivityBumpWorkspace when agents report status, using the same
logic as SSH/IDE connections. We bump when transitioning to or from the working
state.
Closescoder/internal#1251
Relates to https://github.com/coder/internal/issues/1214
The `ExtractWorkspaceAgentParam` middleware ends up making 4 database
queries to follow the chain of `WorkspaceAgent` -> `WorkspaceResource`
-> `ProvisionerJob` -> `WorkspaceBuild` -- but then dropping all that
hard work on the floor. The `api.workspaceAgent` handler that references
this middleware then has to do all of that work again, plus one more
query to get the related `User` so we can get the username. This pattern
is also mirrored in `getDatabaseTerminal` but without the middleware.
This PR:
* Adds a new query `GetWorkspaceAgentAndWorkspaceByID` to fetch all
this information at once to avoid the multiple round-trips,
* Updates the existing usage of `GetWorkspaceAgentByID` to this new
query instead,
* Updates `ExtractWorkspaceAgentParam` to also store the workspace in
the request context
Dalibo: [0.63ms](https://explain.dalibo.com/plan/40bb597f3539gc6c)
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
## Description
This PR fixes an issue where `GetLatestWorkspaceAppStatusesByAppID`
returned an unbounded number of rows for a given app ID, which could
cause performance issues for noisy or long-running AI tasks.
## Impact
This change reduces database query overhead for workspace app status
updates, particularly for busy AI tasks that update their status
frequently. Previously, fetching the latest status would return all
historical statuses, now it returns only the most recent one.
Fixes#20862
---
🤖 This change was written by Claude Sonnet 4.5 Thinking using [mux](https://github.com/coder/mux) and reviewed by a human 🏄🏻♂️
Relates to https://github.com/coder/internal/issues/1098
Currently AgentAPI waits for only 2 seconds worth of identical terminal
screen snapshots before deciding a task has entered a "stable" state. We
interpret this as becoming "idle", resulting in a notification being
triggered. This behavior is not ideal and is ultimately the root cause
of our spammy notifications.
Unfortunately, until we move AgentAPI to either use the Claude Code SDK
(or ACP wrapper around it), we are unable to easily fix the root cause.
This PR instead waits until the agent is ready before it will send state
change notifications. This will at least resolve _some_ of the
complaints about task state notifications being too spammy.
---
🤖 PR was written by Claude Sonnet 4.5 using [Coder
Mux](https://github.com/coder/cmux) and reviewed by a human 👩
Disclaimer: Claude did all of this, reviewed and committed by me.
I find the "task is working" notification straight after creation to be
unnecessary.
Added logic to skip the notification if the first app status is
"working".
## Description
Send a notification to the workspace owner when an AI task’s app state
becomes `Working` or `Idle`.
An AI task is identified by a workspace build with `HasAITask = true`
and `AITaskSidebarAppID` matching the agent app’s ID.
## Changes
* Add `TemplateTaskWorking` notification template.
* Add `TemplateTaskIdle` notification template.
* Add `GetLatestWorkspaceAppStatusesByAppID` SQL query to get the
workspace app statuses ordered by latest first.
* Update `PATCH /workspaceagents/me/app-status` to enqueue:
* `TemplateTaskWorking` when state transitions to `working`
* `TemplateTaskIdle` when state transitions to `idle`
* Notification labels include:
* `task`: task initial prompt
* `workspace`: workspace name
* Notification dedupe: include a minute-bucketed timestamp (UTC
truncated to the minute) in the enqueue data to allow identical content
to resend within the same day (but not more than once per minute).
Closes: https://github.com/coder/coder/issues/19776
When clients disconnected from the /containers/watch endpoint, the WebSocket
connection between coderd and the agent stayed open. This caused heartbeat
traffic every 15s that was incorrectly counted as workspace activity,
extending workspace lifetimes indefinitely.
Now properly cancels the agent connection context when the client disconnects.
Fixes https://github.com/coder/coder/issues/19372
We increase the read limit to 4MiB (we use this limit elsewhere). We
also make sure to stop sending messages when `containersCh` becomes
closed.
This PR replaces the use of the **container** ID with the
**devcontainer** ID. This is a breaking change. This allows rebuilding a
devcontainer when there is no valid container ID.
"Idle" is more accurate than "complete" since:
1. AgentAPI only knows if the screen is active; it has no way of knowing
if the task is complete.
2. The LLM might be done with its current prompt, but that does not mean
the task is complete either (it likely needs refinement).
The "complete" state will be reserved for future definition.
Additionally, in the case where the screen goes idle but the LLM never
reported a status update, we can get an idle icon without a message, and
it looks kinda janky in the UI so if there is no message I display the
state text.
Closes https://github.com/coder/internal/issues/699
This change introduces a refactor of the devcontainers recreation logic
which is now handled asynchronously rather than being request scoped.
The response was consequently changed from "No Content" to "Accepted" to
reflect this.
A new `Status` field was introduced to the devcontainer struct which
replaces `Running` (bool). This reflects that the devcontainer can now
be in various states (starting, running, stopped or errored).
The status field also protects against multiple concurrent recrations,
as long as they are initiated via the API.
Updates #16424
This pull request allows coder workspace agents to be reinitialized when
a prebuilt workspace is claimed by a user. This facilitates the transfer
of ownership between the anonymous prebuilds system user and the new
owner of the workspace.
Only a single agent per prebuilt workspace is supported for now, but
plumbing has already been done to facilitate the seamless transition to
multi-agent support.
---------
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
* Refactors toolsdk.Tools to remove opaque `map[string]any` argument in
favour of typed args structs.
* Refactors toolsdk.Tools to remove opaque passing of dependencies via
`context.Context` in favour of a tool dependencies struct.
* Adds panic recovery and clean context middleware to all tools.
* Adds `GenericTool` implementation to allow keeping `toolsdk.All` with
uniform type signature while maintaining type information in handlers.
* Adds stricter checks to `patchWorkspaceAgentAppStatus` handler.
There were too many ways to configure the agentcontainers API resulting
in inconsistent behavior or features not being enabled. This refactor
introduces a control flag for enabling or disabling the containers API.
When disabled, all implementations are no-op and explicit endpoint
behaviors are defined. When enabled, concrete implementations are used
by default but can be overridden by passing options.
https://github.com/coder/coder/pull/17163 introduced the
`workspace_app_statuses` table. Two of these fields
(`needs_user_attention`, `icon`) turned out to be surplus to
requirements.
- Removes columns `needs_user_attention` and `icon` from
`workspace_app_statuses`
- Marks the corresponding fields of `codersdk.WorkspaceAppStatus` as
deprecated.