> This PR was authored by Mux on behalf of Mike.
## Summary
Adds support for multiple peer root workspace agents sharing the same
`auth_instance_id`, so AWS, Azure, and GCP instance-identity auth can
issue the correct session token for a selected agent instead of assuming
a
single root agent per instance.
## Problem
When a Terraform template attaches two or more `coder_agent` resources
(with `auth = "aws-instance-identity"`) to a single compute instance,
every agent shares the same cloud instance ID. The existing singular
lookup picks whichever agent was created most recently, silently
ignoring
the others.
## Solution
Introduce an optional pre-auth agent selector (`CODER_AGENT_NAME`) and
make the server-side lookup ambiguity-aware.
**Database layer:**
- `GetWorkspaceAgentsByInstanceID` (`:many`): returns all matching root
agents for an instance ID.
- `GetWorkspaceAgentByInstanceIDAndName` (`:one`): returns the named
root
agent for disambiguation.
**SDK and CLI:**
- `agent_name` field added to AWS, Azure, and GCP request structs
(`omitempty` for backward compatibility).
- `CODER_AGENT_NAME` env var and `--agent-name` flag wired into the
agent
bootstrap before instance-identity auth runs.
**Server handler (`handleAuthInstanceID`):**
- When `agent_name` is present: direct lookup by (instance ID, name).
- When absent: legacy lookup, then resource-scoped ambiguity check.
Returns 409 with available agent names if multiple root agents match.
- Whitespace-only names are trimmed and treated as unspecified.
- Sub-agents remain excluded (`parent_id IS NULL` filter).
**Verification template:**
- `examples/templates/aws-multi-agent/` provisions one EC2 instance with
two agents (`main` and `dev`), both using instance-identity auth with
`CODER_AGENT_NAME` set in the cloud-init user data.
## Backward compatibility
Existing single-agent deployments work unchanged. The `agent_name` field
is optional with `omitempty`, and the unnamed path preserves today's
behavior when only one root agent matches.
## Problem
When a prebuilt workspace is claimed, the agent reinitializes via a
single fire-and-forget pubsub event over SSE. If the agent's SSE
connection is interrupted at claim time, the event is permanently lost —
the workspace is stuck with no self-healing path.
Additionally, regular (non-prebuild) workspaces had no way to opt out of
the `/reinit` polling loop — agents would reconnect indefinitely to an
endpoint that would never send them anything useful.
## Root Cause
`workspaceAgentReinit` fetches the workspace (with its current
`owner_id`) via `GetWorkspaceByAgentID`, but never checked whether a
claim already happened. It only subscribed to pubsub for future events.
The database already has durable claim state (`owner_id` changes from
`PrebuildsSystemUserID` to the real user), but no layer ever consulted
it on reconnection.
## Solution
### Server-side durable check with first-build-initiator gating
**TOCTOU-safe ordering**: Subscribe to pubsub claim events *before* any
durable checks, so a claim that fires during the check is buffered in
the channel rather than lost.
**First-build-initiator gating**: When `!workspace.IsPrebuild()` (owner
is no longer the system user), look up the first build's `InitiatorID`.
The prebuild reconciler always uses `PrebuildsSystemUserID` as the
initiator. This distinguishes claimed prebuilds from regular workspaces
without any SQL schema changes.
- **Regular workspace** (first build initiator ≠ system user) → **409
Conflict**, agent stops reconnecting
- **Claimed prebuild, build completed** → pre-seed channel with reinit
event and close it, transmitter delivers one-shot then exits
- **Claimed prebuild, build in-progress** → fall through to pubsub
subscription, agent waits for completion event
- **Unclaimed prebuild** → pubsub subscription (existing happy path)
### Declarative reinit events (defense-in-depth)
- Added `UserID` field to `ReinitializationEvent` with JSON tags
- Switched pubsub serialization from raw string to JSON (with
backward-compat fallback for rolling upgrades)
- Populated `UserID` at both the publish site and the durable check
### Agent SDK: 409 handling
`WaitForReinitLoop` detects 409 Conflict from the server and closes the
`reinitEvents` channel, cleanly exiting the retry goroutine.
### Agent CLI: fixed two bugs + added reinitCtx
- **Closed channel (`!ok`)**: now blocks on `<-ctx.Done()` instead of
`continue`, keeping the current agent running. Previously this would
leak agents by skipping `agnt.Close()` and re-entering the loop.
- **Duplicate owner reinit**: cancels `reinitCtx` (stops the reinit
goroutine), then blocks on `<-ctx.Done()`. Previously `continue` would
skip cleanup and create a new agent on the next loop iteration.
- **`reinitCtx`**: a cancellable child of `ctx` passed to
`WaitForReinitLoop`, allowing the agent to stop the reinit HTTP polling
after reinit completes.
### Agent-side idempotency
Tracks `lastOwnerID` in the agent reinit loop — duplicate events for the
same owner are skipped.
## Testing
- **"unclaimed prebuild receives reinit via pubsub"**: prebuild owned by
system user, pubsub event triggers reinit
- **"claimed prebuild receives one-shot reinit on reconnect"**: first
build by system user, owner changed, build completed → immediate reinit
(no pubsub needed)
- **"claimed prebuild waits during in-progress claim build"**: claimed
but build still running → no reinit until build completes
- **"regular workspace gets 409"**: first build by real user → 409
Conflict, agent stops polling
- Updated claim publisher/listener tests: verify `UserID` survives JSON
round-trip + backward compat with raw string payloads
- Updated SSE round-trip test: verify `UserID` survives transmit →
receive cycle
Fixes#22359
## Rolling upgrade note
During a rolling deploy where old coderd instances coexist with new
ones, the pubsub `ReinitializationEvent` has a new `workspace_id` field
(JSON key `workspace_id`). Old publishers send a raw reason string
instead of JSON; the new listener gracefully falls back by treating the
entire payload as the reason and filling in `WorkspaceID` from context.
The only visible effect during the upgrade window is that `WorkspaceID`
may be the zero UUID in agent-side logs — this is cosmetic and resolves
once all instances are updated.
* Adds support for parameter `format=text` in the following API routes:
* `/api/v2/workspaceagents/:id/logs`
* `/api/v2/workspacebuilds/:id/logs`
* `/api/v2/templateversions/:id/logs`
* `/api/v2/templateversions/:id/dry-run/:id/logs`
* Adds links to view raw logs on the following pages:
* Workspace build page
* Template editor page
* Template version page
* Refactors existing log formatting in `cli/logs.go` to live in `codersdk`.
🤖 Generated with Claude Opus 4.5, reviewed by me.
---------
Co-authored-by: Claude <noreply@anthropic.com>
This PR replaces the use of the **container** ID with the
**devcontainer** ID. This is a breaking change. This allows rebuilding a
devcontainer when there is no valid container ID.
This commit consolidates two container endpoints on the backend and improves the
frontend devcontainer support by showing names and displaying apps as
appropriate.
With this change, the frontend now has knowledge of the subagent and we can also
display things like port forwards.
The frontend was updated to show dev container labels on the border as well as
subagent connection status. The recreation flow was also adjusted a bit to show
placeholder app icons when relevant.
Support for apps was also added, although these are still WIP on the backend.
And the port forwarding utility was added in since the sub agents now provide
the necessary info.
Fixescoder/internal#666
This change introduces a refactor of the devcontainers recreation logic
which is now handled asynchronously rather than being request scoped.
The response was consequently changed from "No Content" to "Accepted" to
reflect this.
A new `Status` field was introduced to the devcontainer struct which
replaces `Running` (bool). This reflects that the devcontainer can now
be in various states (starting, running, stopped or errored).
The status field also protects against multiple concurrent recrations,
as long as they are initiated via the API.
Updates #16424
This pull request allows coder workspace agents to be reinitialized when
a prebuilt workspace is claimed by a user. This facilitates the transfer
of ownership between the anonymous prebuilds system user and the new
owner of the workspace.
Only a single agent per prebuilt workspace is supported for now, but
plumbing has already been done to facilitate the seamless transition to
multi-agent support.
---------
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
This does ~95% of the backend work required to integrate the AI work.
Most left to integrate from the tasks branch is just frontend, which
will be a lot smaller I believe.
The real difference between this branch and that one is the abstraction
-- this now attaches statuses to apps, and returns the latest status
reported as part of a workspace.
This change enables us to have a similar UX to in the tasks branch, but
for agents other than Claude Code as well. Any app can report status
now.
* Improves separation of concerns between `runDockerInspect` and
`convertDockerInspect`: `runDockerInspect` now just runs the command and
returns the output, while `convertDockerInspect` now does all of the
conversion and parsing logic.
* Improves testing of `convertDockerInspect` using real test fixtures.
* Fixes issue where the container port is returned instead of the host
port.
* Updates UI to link to correct host port. Container port is still
displayed in the button text, but the HostIP:HostPort is shown in a
popover.
* Adds stories for workspace agent UI
Fixes https://github.com/coder/coder/issues/16268
- Adds `/api/v2/workspaceagents/:id/containers` coderd endpoint that allows listing containers
visible to the agent. Optional filtering by labels is supported.
- Adds go tools to the `coder-dylib` CI step so we can generate mocks if needed
Second step to resolve [open_in
issue](https://github.com/coder/terraform-provider-coder/issues/297)
This PR improves the way the open_in parameter is forwarded across the
code, changing the last `string` to const everywhere.
Also make sure it is available and forwarded up to the `CreateLink`
component.
Closes#14716Closes#14717
Adds a new user-scoped tailnet API endpoint (`api/v2/tailnet`) with a new RPC stream for receiving updates on workspaces owned by a specific user, as defined in #14716.
When a stream is started, the `WorkspaceUpdatesProvider` will begin listening on the user-scoped pubsub events implemented in #14964. When a relevant event type is seen (such as a workspace state transition), the provider will query the DB for all the workspaces (and agents) owned by the user. This gets compared against the result of the previous query to produce a set of workspace updates.
Workspace updates can be requested for any user ID, however only workspaces the authorised user is permitted to `ActionRead` will have their updates streamed.
Opening a tunnel to an agent requires that the user can perform `ActionSSH` against the workspace containing it.
* feat: begin impl of agent script timings
* feat: add job_id and display_name to script timings
* fix: increment migration number
* fix: rename migrations from 251 to 254
* test: get tests compiling
* fix: appease the linter
* fix: get tests passing again
* fix: drop column from correct table
* test: add fixture for agent script timings
* fix: typo
* fix: use job id used in provisioner job timings
* fix: increment migration number
* test: behaviour of script runner
* test: rewrite test
* test: does exit 1 script break things?
* test: rewrite test again
* fix: revert change
Not sure how this came to be, I do not recall manually changing
these files.
* fix: let code breathe
* fix: wrap errors
* fix: justify nolint
* fix: swap require.Equal argument order
* fix: add mutex operations
* feat: add 'ran_on_start' and 'blocked_login' fields
* fix: update testdata fixture
* fix: refer to agent_id instead of job_id in timings
* fix: JobID -> AgentID in dbauthz_test
* fix: add 'id' to scripts, make timing refer to script id
* fix: fix broken tests and convert bug
* fix: update testdata fixtures
* fix: update testdata fixtures again
* feat: capture stage and if script timed out
* fix: update migration number
* test: add test for script api
* fix: fake db query
* fix: use UTC time
* fix: ensure r.scriptComplete is not nil
* fix: move err check to right after call
* fix: uppercase sql
* fix: use dbtime.Now()
* fix: debug log on r.scriptCompleted being nil
* fix: ensure correct rbac permissions
* chore: remove DisplayName
* fix: get tests passing
* fix: remove space in sql up
* docs: document ExecuteOption
* fix: drop 'RETURNING' from sql
* chore: remove 'display_name' from timing table
* fix: testdata fixture
* fix: put r.scriptCompleted call in goroutine
* fix: track goroutine for test + use separate context for reporting
* fix: appease linter, handle trackCommandGoroutine error
* fix: resolve race condition
* feat: replace timed_out column with status column
* test: update testdata fixture
* fix: apply suggestions from review
* revert: linter changes