ForkReap's syscall.ForkExec and process-directed signals remain
flaky in CI despite the subprocess isolation added in #22894.
Restore the testutil.InCI() skip guard that was removed in that
change.
Fixescoder/internal#1402
Implement the backend for the desktop feature for agents.
- Adds a new `/api/experimental/chats/$id/desktop` endpoint to coderd
which exposes a VNC stream from a
[portabledesktop](https://github.com/coder/portabledesktop) process
running inside the workspace
- Adds a new `spawn_computer_use_agent` tool to chatd, which spawns a
subagent that has access to the `computer` tool which lets it interact
with the `portabledesktop` process running inside the workspace
- Adds the plumbing to make the above possible
There's a follow up frontend PR here:
https://github.com/coder/coder/pull/23006
Tests that call ForkReap or send signals to their own process now
re-exec as isolated subprocesses. This prevents ForkReap's
syscall.ForkExec and process-directed signals from interfering
with the parent test binary or other tests running in parallel.
Also:
- Wait for the reaper goroutine to fully exit between subtests
to prevent overlapping reapers from competing on Wait4(-1).
- Register signal handlers synchronously before spawning the
forwarding goroutine so no signal is lost between ForkExec
and the handler being ready.
Each ForkReap call started a reap.ReapChildren goroutine that never
stopped (done=nil). Goroutines accumulated across subtests, racing to
call Wait4(-1, WNOHANG) and stealing the child's wait status before
ForkReap's Wait4(pid) could collect it.
Add a WithDone option to pass the done channel through to ReapChildren,
and use it in tests via a withDone(t) helper.
TestServer_X11_EvictionLRU was timing out under -race because it created
190 sequential SSH shell sessions (~0.55s each = ~105s), exceeding the
90s test timeout. The session count was derived from the production
X11MaxPort constant (6200).
Add a configurable X11MaxPort field to Config so the test can use a
small port range (5 ports instead of 190). This reduces the number of
sessions from 190 to 4, completing in ~3.8s under -race.
## Problem
`TestE2E_WriteFileTriggersGitWatch` and `TestE2E_SubagentAncestorWatch`
flake intermittently in `test-go-race-pg` with:
```
agentgit_test.go:1271: timed out waiting for server message
```
## Root Cause
In `handleWatch()`, `GetPaths(chatID)` was called **before**
`Subscribe(chatID)` on the PathStore. If `AddPaths()` fired between
those two calls:
1. `GetPaths()` returned empty (paths not added yet).
2. `AddPaths()` stored the paths and called `notifySubscribers()` — but
the subscription channel didn't exist yet, so the notification was a
no-op.
3. `Subscribe()` created the channel, but the notification was already
lost.
4. The handler never scanned, and the mock clock never advanced the 30s
fallback ticker → timeout.
Both failing tests connect the WebSocket with an empty PathStore and
immediately call `AddPaths()` from the test goroutine, making them
vulnerable to this scheduling interleaving.
## Fix
Swap the order: call `Subscribe()` first, then `GetPaths()`. This
guarantees:
| `AddPaths` fires... | `Subscribe` sees it? | `GetPaths` sees it? |
Outcome |
|---|---|---|---|
| Before `Subscribe` | No | **Yes** | Picked up by `GetPaths` |
| Between the two calls | **Yes** (queued) | **Yes** | Redundant but
safe (delta dedupes) |
| After `GetPaths` | **Yes** | No | Goroutine handles it |
No window exists where both miss it.
Verified with 10,000 iterations (`-race -count=5000`) — zero failures.
Fixescoder/internal#1389
go-git has bugs in gitignore logic. With more complex gitignores, some
paths that should be ignored aren't. That caused extra, unexpected files
to appear in the git diff panel.
If the git cli isn't available in a workspace, the /git/watch endpoint
will still allow the frontend to connect, but no git changes will ever
be transmitted.
Adds real-time git status watching for workspace agents, so the frontend
can subscribe over WebSocket and show
git file changes in near real-time.
1. Subscription is scoped to a **chat** via `GET
/api/experimental/chats/{chat}/git/watch`.
2. The workspace agent automatically determines which paths to watch
based on tool calls made by the chat (and its ancestor chats).
3. Workspace agent polls subscribed repo working trees on a 30s
interval, on tools calls, and on explicit `refresh` from the client.
4. Scans are rate-limited to at most once per second.
5. Edited paths are tracked **in-memory** inside the workspace agent.
There is no database persistence — state is lost on agent restart. This
will be addresses in a future PR.
6. Messages sent over WebSocket include a full-repo snapshot (unified
diff, branch, origin). A new message is emitted only when the snapshot
changes.
This PR was implemented with AI with me closely controlling what it's
doing. The code follows a plan file that was updated continuously during
implementation. Here's the file if you'd like to see it:
[project.md](https://gist.github.com/hugodutka/8722cf80c92f8a56555f7bc595b770e2).
It reflects the current state of the PR.
relates to #21335
Adds UpdateAppStatus on the agentsocket, wired up to forward to Coderd over the dRPC connection the agent maintains.
Disclosure: I used AI to generate significant portions of this PR, but hand-reviewed and tweaked the code. I consider it approximately indistinguishable from what I would have done by hand.
Add Prometheus metrics to the boundary log proxy for observability:
- batches_dropped_total (reason: buffer_full, forward_failed)
- logs_dropped_total (reason: buffer_full, forward_failed,
boundary_channel_full, boundary_batch_full)
- batches_forwarded_total
Also add BoundaryStatus to the BoundaryMessage envelope so boundary
can report dropped log counts as a separate wire message. The agent
records these as Prometheus metrics, making boundary-side data loss
visible. Backwards compatibility for older versions of boundary is maintained.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
relates to #21335
Enables the agent socket by default and updates docs to strike references to having to enable it.
The PRs in this stack change the MCP server that Tasks use to update their status to rely on the agent socket, rather than directly dialing Coderd with the agent token.
Default disable was a reasonable default when it was only used for the experimental script ordering features, but now that we want to use it for Tasks, it should be default on.
Extend the wire protocol for the boundary <-> agent unix socket with
a message envelope.
The envelope creates a boundary <-> agent data path that is separate
from the agent <-> coderd path. This lets boundary send operational
metadata (drop counts, configuration like jail type, capabilities)
that the agent can act on locally (e.g. Prometheus metrics) or use
to enrich outbound requests, without polluting the coderd-facing proto
with fields coderd never consumes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Problem
The `agentproc` process manager spawns processes with only
`os.Environ()`, missing agent-level environment variables like
`GIT_ASKPASS`, `CODER_*`, and `GIT_SSH_COMMAND` that are injected by the
agent's `updateCommandEnv` function. This means processes started
through the HTTP process API (used by chat tools) cannot authenticate
git operations via the Coder gitaskpass helper.
By contrast, SSH sessions get the full agent environment because the SSH
server calls `updateCommandEnv` via its `UpdateEnv` config hook.
## Fix
Wire the agent's `updateCommandEnv` hook into the process manager so all
spawned processes receive the full agent environment. The hook is:
- Passed as a parameter through `NewAPI` → `newManager`
- Called in `manager.start()` with `os.Environ()` as the base, producing
the same enriched env that SSH sessions get
- Gracefully falls back to `os.Environ()` if the hook returns an error
Request-level env vars (`req.Env`, set by chat tools) are still appended
last and take precedence.
## Changes
- `agent/agentproc/process.go`: Add `updateEnv` field to manager, call
it when building process env
- `agent/agentproc/api.go`: Accept `updateEnv` parameter in `NewAPI`
- `agent/agent.go`: Pass `a.updateCommandEnv` when creating the process
API
- `agent/agentproc/api_test.go`: Add `UpdateEnvHook` and
`UpdateEnvHookOverriddenByReqEnv` tests
Co-authored-by: Coder <coder@coder.com>
Inspired by openai/codex's `apply_patch` implementation, this changes
the `edit_files` search-and-replace to use a cascading match strategy
when the exact search string isn't found:
1. **Exact substring match** (byte-for-byte) — existing behavior,
unchanged
2. **Line-by-line match ignoring trailing whitespace** — handles
trailing spaces/tabs the LLM omits
3. **Line-by-line match ignoring all leading/trailing whitespace** —
handles tabs-vs-spaces and wrong indentation depth
## Problem
When the chat agent uses `edit_files`, it generates a search string that
must match the file content exactly. LLMs frequently get whitespace
wrong:
- Emitting spaces when the file uses tabs (or vice versa)
- Getting the indentation depth wrong by one or more levels
- Omitting trailing whitespace that exists in the file
When this happens, the edit silently does nothing, and the agent falls
into a retry loop using `cat -A` to diagnose the exact whitespace
characters.
## Solution
Adopted the same cascading fuzzy match strategy that [openai/codex uses
in
`seek_sequence.rs`](https://github.com/openai/codex/blob/main/codex-rs/apply-patch/src/seek_sequence.rs):
- Pass 1: exact match (existing behavior)
- Pass 2: `TrimRight` each line before comparing (trailing whitespace
tolerance)
- Pass 3: `TrimSpace` each line before comparing (full indentation
tolerance)
When a fuzzy match is found, the matched lines in the original file are
replaced with the replacement text. This preserves surrounding content
exactly.
## Changes
- `agent/agentfiles/files.go`: Replaced `icholy/replace` streaming
transformer with in-memory `fuzzyReplace` + helper functions
(`seekLines`, `spliceLines`)
- `agent/agentfiles/files_test.go`: Added 6 new test cases covering
trailing whitespace, tabs-vs-spaces, different indent depths, exact
match preference, no-match behavior, and mixed whitespace multiline
edits
- Removed `icholy/replace` dependency from go.mod/go.sum
---------
Co-authored-by: Kyle Carberry <kylecarbs@users.noreply.github.com>
## Summary
Adds a new agent-side process management HTTP API and rewrites the chat
execute tool to use it instead of SSH sessions.
## What changed
### New agent/agentproc/ package
- **headtail.go** — Thread-safe io.Writer with bounded memory (16KB head
+ 16KB tail ring buffer). Provides LLM-ready output with truncation
metadata and long-line truncation at 2048 bytes.
- **headtail_test.go** — 16 tests including race detector coverage for
concurrent writes.
- **process.go** — Manager + Process types for lifecycle management
using agentexec.Execer for proper OOM/nice scores.
- **api.go** — HTTP API following the agentfiles chi router pattern. 4
endpoints: start, list, output, signal.
### Agent wiring (agent/agent.go, agent/api.go)
Mounts the process API at /api/v0/processes, mirroring how agentfiles is
mounted.
### SDK (codersdk/workspacesdk/agentconn.go)
4 new AgentConn interface methods + 7 request/response types:
- StartProcess, ListProcesses, ProcessOutput, SignalProcess
### Execute tool rewrite (coderd/chatd/chattool/execute.go)
- SSH to Agent API: conn.StartProcess() + conn.ProcessOutput() polling
- New parameters: workdir, run_in_background
- Structured response: success, exit_code, wall_duration_ms, error,
truncated, note, background_process_id
- Non-interactive env vars: GIT_EDITOR=true, TERM=dumb, NO_COLOR=1,
PAGER=cat, etc.
- Output truncation: HeadTailBuffer caps at 32KB for LLM consumption
- File-dump detection with advisory notes suggesting read_file
- Default timeout: 60s to 10s
- Foreground polling: 200ms intervals until exit or timeout
## Architecture
State lives on the agent, surviving coderd failover and instance
changes. Any coderd replica can query any agent via HTTP over tailnet.
## Summary
Adds a new line-based file reading endpoint to the workspace agent,
replacing the unbounded byte-based approach for the `read_file` chat
tool and `coder_workspace_read_file` MCP tool.
**Problem**: The current `read_file` tool returns the entire file
contents with no limits, which can blow up LLM context windows and cause
OOM issues with large files.
**Solution**: Inspired by [`coder/mux`](https://github.com/coder/mux)
and [`openai/codex`](https://github.com/openai/codex), implement a
line-based reader with safety limits.
## Changes
### Agent (`agent/agentfiles/`)
- New `/read-file-lines` endpoint with `HandleReadFileLines` handler
- Line-based `offset` (1-based line number, default: 1) and `limit`
(line count, default: 2000)
- Safety constants:
| Constant | Value | Purpose |
|---|---|---|
| `MaxFileSize` | 1 MB | Reject files larger than this at stat |
| `MaxLineBytes` | 1,024 | Per-line truncation with `... [truncated]`
marker |
| `MaxResponseLines` | 2,000 | Max lines per response |
| `MaxResponseBytes` | 32 KB | Max total response size |
| `DefaultLineLimit` | 2,000 | Default when no limit specified |
- Line numbering format: `1\tcontent` (tab-separated)
- Structured JSON response: `{ success, file_size, total_lines,
lines_read, content, error }`
- Hard errors when limits exceeded — tells the LLM to use
`offset`/`limit`
- Existing byte-based `/read-file` endpoint preserved (used by
`instruction.go`)
### SDK (`codersdk/workspacesdk/`)
- `ReadFileLinesResponse` type added
- `ReadFileLines` method added to `AgentConn` interface
- Mock regenerated
### Chat tool (`coderd/chatd/chattool/`)
- `read_file` tool now uses `conn.ReadFileLines()` instead of
`conn.ReadFile()`
- Updated tool description to document line-based parameters
- Response includes `file_size`, `total_lines`, `lines_read` metadata
### MCP tool (`codersdk/toolsdk/`)
- `coder_workspace_read_file` updated to use line-based reading
- Schema descriptions updated for line-based offset/limit
- Removed `maxFileLimit` constant (agent handles limits now)
### Tests
- 13 new test cases for `TestReadFileLines`:
- Path validation (empty, relative, non-existent, directory, no
permissions)
- Empty file handling
- Basic read, offset, limit, offset+limit combinations
- Offset beyond file length
- Long line truncation (>1024 bytes)
- Large file rejection (>1MB)
- All existing tests pass unchanged
## Design decisions
| Decision | Rationale |
|---|---|
| Line-based, not byte-based | Both coder/mux and openai/codex use
line-based — matches how LLMs reason about code |
| Default limit of 2000 | Matches codex; prevents accidental full-file
dumps while being generous |
| 32 KB response cap | Compromise between mux (16 KB) and codex (no cap)
|
| 1024 byte/line truncation with marker | More generous than codex
(500), marker helps LLM know data is missing |
| Hard errors on overflow | Matches mux; forces LLM to paginate rather
than getting partial data |
| Preserve byte-based endpoint | `instruction.go` needs raw byte access
for AGENTS.md |
<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
part of https://github.com/coder/coder/issues/21335
This moves updating app status (used by Tasks) into the workspace agent
API over dRPC. This will allow us to update the status without having to
re-authenticate each time, like we would with an HTTP PATCH request.
Further PRs in this stack will pipe these requests thru from the CLI MCP
server to the agentsock and finally to this dRPC call to coderd.
## Summary
coder-logstream-kube and other tools that use the agent token to connect
to the RPC endpoint were incorrectly triggering connection monitoring,
causing false connected/disconnected timestamps on the agent. This led
to VSCode/JetBrains disconnections and incorrect dashboard status.
## Changes
Add a `role` query parameter to `/api/v2/workspaceagents/me/rpc`:
- `role=agent`: triggers connection monitoring (default for the agent
SDK)
- any other value (e.g. `logstream-kube`): skips connection monitoring
- omitted: triggers monitoring for backward compatibility with older
agents
The agent SDK now sends `role=agent` by default. A new `Role` field on
the `agentsdk.Client` allows non-agent callers to specify a different
role.
## Required follow-up
coder-logstream-kube needs to set `client.Role = "logstream-kube"`
before calling `ConnectRPC20()`. Without that change, it will still send
`role=agent` and trigger monitoring.
Fixes#21625
Adds additional logs for determining what signal the agent receives
prior to shut down. Also helps distinguish whether the signal originated
at the agent or reaper.
Update the agent protobuf schema (agent/proto/agent.proto) to include:
- subagent_id field in WorkspaceAgentDevcontainer message
- id field in CreateSubAgentRequest message
Bump the Agent API version from v2.7 to v2.8 and update all client
references throughout the codebase (ConnectRPC27 -> ConnectRPC28,
DRPCAgentClient27 -> DRPCAgentClient28).
The reaper (PID 1) now returns the child's exit code instead of always
exiting 0. Signal termination uses the standard Unix convention of 128 +
signal number.
fixes#21661
Relates to https://github.com/coder/coder/pull/21676
* Replaces all existing usages of `httpapi.Heartbeat` with `httpapi.HeartbeatClose`
* Removes `httpapi.HeartbeatClose`
fixes https://github.com/coder/internal/issues/1286
We can get blank IP address from the net connection if the client has
already disconnected, as was the case in this flake. Fix is to only log
error if we get something non-empty we can't parse.
---------
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
Adds template_version_id to re-emitted boundary audit logs to allow
filtering and analysis by specific template versions iin addition to the
existing template_id field. Since boundary policies are defined in the
template, the template version is critical to figuring out which policy
was responsible for boundaries decision in a workspace.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Boundary policies are currently defined at the template level, so
including the template ID in re-emitted logs by the control plane
allows policy creators to filter and observe boundary activity for
specific templates. This makes it easier to verify that policies are
working as expected and to debug issues with specific template
configurations.
This makes it so we can test it directly without having to go through
Tailnet, which appears to be causing flakes in CI where the requests
time out and never make it to the agent.
Takes inspiration from the container-related API endpoints.
Would probably make sense to refactor the ls tests to also go through
the API (rather than be internal tests like they are currently) but I
left those alone for now to keep the diff minimal.
Modifies `make fmt/go` to also use https://github.com/daixiang0/gci to format our imports to a standard format.
It introduces a new shell script to do the formatting so that our formatting tools are in one place.
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
Add agent forwarding of boundary audit logs from workspaces to coderd
via agent API, and re-emission of boundary logs to coderd stderr. This
change adds a server to the workspace agent that always listens on a
unix socket for boundary to connect and send audit logs.
coderd log format example:
```
[API] 2025-12-23 18:31:46.755 [info] coderd.agentrpc: boundary_request owner=.. workspace_name=.. agent_name=.. decision=.. workspace_id=.. http_method=.. http_url=.. event_time=.. request_id=..
```
Corresponding boundary PR: https://github.com/coder/boundary/pull/124
RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41bahttps://github.com/coder/coder/issues/21280
_Disclaimer: investigation done by Claude Opus 4.5_
Closes https://github.com/coder/internal/issues/1173
Closes https://github.com/coder/internal/issues/1174
The agent containers API is only marked "ready" under this condition in
`agent/agentcontainers/api.go`:
```go
// For now, all endpoints require the initial update to be done.
// If we want to allow some endpoints to be available before
// the initial update, we can enable this per-route.
```
However, what was actually being checked for was that the _init_ was
done, not the _initial update_.
In agent/agentcontainers/api.go, the `Start()` method:
1. Called `Init()` which closed `initDone` <--- API marked ready here
2. Then launched `go api.updaterLoop()` asynchronously
3. `updaterLoop()` performs the initial container update <--- should
have marked it ready after this
This PR fixes these semantics to avoid the race which was causing the
above two flakes.
Signed-off-by: Danny Kopping <danny@coder.com>
Add the AgentAPI changes to support the feature that transmits boundary
logs from workspaces to coderd via the agent API for eventual re-emission to
stderr. The API handlers are stubs for now because I'm trying to land
this feature from multiple smaller PRs.
High level architecture:
- Boundary records resource access in batches and sends proto message to
agent
- Agent proxies messages to coderd **(captured by the API changes in
this PR)**
- coderd re-emits logs to stderr
RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba
BREAKING CHANGE: SFTP/SCP now respects the agent's configured directory.
If your workspace agent has a custom `dir` configured in Terraform, SFTP
and SCP connections will now land there instead of `$HOME`. Previously,
only SSH and rsync respected this setting, which caused confusing behavior
where `scp file.txt coder:.` and `rsync file.txt coder:.` would put files
in different places. If you have scripts that relied on SFTP/SCP always
using `$HOME` regardless of agent configuration, you may need to use
explicit paths instead.
fixes https://github.com/coder/internal/issues/1196
The above issue exposes two different bugs in Coder.
In the agent, there is a race where if the agent is closed while starting up networking, it will erroneously disconnect from Coderd, which delays or breaks writing final status and logs.
In Coderd, there is a bug where we don't properly record the latest agent disconnection time if the agent had previously disconnected. This causes us to report the agent status as "Connected" even after it has disconnected up until the inactivity timeout fires.
This PR fixes both issues.
It also slightly reworks when we send workspace updates based on connection and disconnection. Previously we would send two updates when the agent connected in certain circumstances, even though the status would be the same in both (only times changed). Now we universally only send one on connect, and then another on disconnect.
fixes: https://github.com/coder/internal/issues/1179
The problem in that flake is that dRPC doensn't consistently return
`context.Canceled` if you make an RPC call and then cancel it: sometimes
it returns EOF.
Without this PR, if we get an EOF on one of the routines that uses the
agentapi connection, we tear down the whole connection and reconnect to
coderd --- even if we are in the middle of a graceful shutdown.
What happened in the linked flake is that writing stats failed with EOF,
which then caused us to reconnect and write the lifecycle "SHUTTING
DOWN" twice.
## Problem
The `TestAgent_SessionTTYShell` test was flaking on macOS CI runners
with:
```
match deadline exceeded: context deadline exceeded (wanted 1 bytes; got 0: "")
```
The test uses `WaitShort` (10s) for the context timeout when waiting for
shell prompt output via `Peek(ctx, 1)`. On slow macOS CI runners, the
shell startup can exceed this timeout due to resource contention.
This is evidenced in the failure logs, the SSH session was not reported
by the agent until the 10s timeout is nearly up - it took a while to
connect.
## Solution
Increase the timeout from `WaitShort` (10s) to `WaitMedium` (30s). This
matches the timeout used by `ExpectMatch` internally and gives the shell
more time to initialize on slow CI machines.
---
This PR was entirely generated by [mux](https://github.com/coder/mux)
but reviewed by a human.
Closes https://github.com/coder/internal/issues/1177