<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
part of https://github.com/coder/coder/issues/21335
This moves updating app status (used by Tasks) into the workspace agent
API over dRPC. This will allow us to update the status without having to
re-authenticate each time, like we would with an HTTP PATCH request.
Further PRs in this stack will pipe these requests thru from the CLI MCP
server to the agentsock and finally to this dRPC call to coderd.
## Summary
coder-logstream-kube and other tools that use the agent token to connect
to the RPC endpoint were incorrectly triggering connection monitoring,
causing false connected/disconnected timestamps on the agent. This led
to VSCode/JetBrains disconnections and incorrect dashboard status.
## Changes
Add a `role` query parameter to `/api/v2/workspaceagents/me/rpc`:
- `role=agent`: triggers connection monitoring (default for the agent
SDK)
- any other value (e.g. `logstream-kube`): skips connection monitoring
- omitted: triggers monitoring for backward compatibility with older
agents
The agent SDK now sends `role=agent` by default. A new `Role` field on
the `agentsdk.Client` allows non-agent callers to specify a different
role.
## Required follow-up
coder-logstream-kube needs to set `client.Role = "logstream-kube"`
before calling `ConnectRPC20()`. Without that change, it will still send
`role=agent` and trigger monitoring.
Fixes#21625
Adds additional logs for determining what signal the agent receives
prior to shut down. Also helps distinguish whether the signal originated
at the agent or reaper.
Update the agent protobuf schema (agent/proto/agent.proto) to include:
- subagent_id field in WorkspaceAgentDevcontainer message
- id field in CreateSubAgentRequest message
Bump the Agent API version from v2.7 to v2.8 and update all client
references throughout the codebase (ConnectRPC27 -> ConnectRPC28,
DRPCAgentClient27 -> DRPCAgentClient28).
The reaper (PID 1) now returns the child's exit code instead of always
exiting 0. Signal termination uses the standard Unix convention of 128 +
signal number.
fixes#21661
Relates to https://github.com/coder/coder/pull/21676
* Replaces all existing usages of `httpapi.Heartbeat` with `httpapi.HeartbeatClose`
* Removes `httpapi.HeartbeatClose`
fixes https://github.com/coder/internal/issues/1286
We can get blank IP address from the net connection if the client has
already disconnected, as was the case in this flake. Fix is to only log
error if we get something non-empty we can't parse.
---------
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
Adds template_version_id to re-emitted boundary audit logs to allow
filtering and analysis by specific template versions iin addition to the
existing template_id field. Since boundary policies are defined in the
template, the template version is critical to figuring out which policy
was responsible for boundaries decision in a workspace.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Boundary policies are currently defined at the template level, so
including the template ID in re-emitted logs by the control plane
allows policy creators to filter and observe boundary activity for
specific templates. This makes it easier to verify that policies are
working as expected and to debug issues with specific template
configurations.
This makes it so we can test it directly without having to go through
Tailnet, which appears to be causing flakes in CI where the requests
time out and never make it to the agent.
Takes inspiration from the container-related API endpoints.
Would probably make sense to refactor the ls tests to also go through
the API (rather than be internal tests like they are currently) but I
left those alone for now to keep the diff minimal.
Modifies `make fmt/go` to also use https://github.com/daixiang0/gci to format our imports to a standard format.
It introduces a new shell script to do the formatting so that our formatting tools are in one place.
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
Add agent forwarding of boundary audit logs from workspaces to coderd
via agent API, and re-emission of boundary logs to coderd stderr. This
change adds a server to the workspace agent that always listens on a
unix socket for boundary to connect and send audit logs.
coderd log format example:
```
[API] 2025-12-23 18:31:46.755 [info] coderd.agentrpc: boundary_request owner=.. workspace_name=.. agent_name=.. decision=.. workspace_id=.. http_method=.. http_url=.. event_time=.. request_id=..
```
Corresponding boundary PR: https://github.com/coder/boundary/pull/124
RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41bahttps://github.com/coder/coder/issues/21280
_Disclaimer: investigation done by Claude Opus 4.5_
Closes https://github.com/coder/internal/issues/1173
Closes https://github.com/coder/internal/issues/1174
The agent containers API is only marked "ready" under this condition in
`agent/agentcontainers/api.go`:
```go
// For now, all endpoints require the initial update to be done.
// If we want to allow some endpoints to be available before
// the initial update, we can enable this per-route.
```
However, what was actually being checked for was that the _init_ was
done, not the _initial update_.
In agent/agentcontainers/api.go, the `Start()` method:
1. Called `Init()` which closed `initDone` <--- API marked ready here
2. Then launched `go api.updaterLoop()` asynchronously
3. `updaterLoop()` performs the initial container update <--- should
have marked it ready after this
This PR fixes these semantics to avoid the race which was causing the
above two flakes.
Signed-off-by: Danny Kopping <danny@coder.com>
Add the AgentAPI changes to support the feature that transmits boundary
logs from workspaces to coderd via the agent API for eventual re-emission to
stderr. The API handlers are stubs for now because I'm trying to land
this feature from multiple smaller PRs.
High level architecture:
- Boundary records resource access in batches and sends proto message to
agent
- Agent proxies messages to coderd **(captured by the API changes in
this PR)**
- coderd re-emits logs to stderr
RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba
BREAKING CHANGE: SFTP/SCP now respects the agent's configured directory.
If your workspace agent has a custom `dir` configured in Terraform, SFTP
and SCP connections will now land there instead of `$HOME`. Previously,
only SSH and rsync respected this setting, which caused confusing behavior
where `scp file.txt coder:.` and `rsync file.txt coder:.` would put files
in different places. If you have scripts that relied on SFTP/SCP always
using `$HOME` regardless of agent configuration, you may need to use
explicit paths instead.
fixes https://github.com/coder/internal/issues/1196
The above issue exposes two different bugs in Coder.
In the agent, there is a race where if the agent is closed while starting up networking, it will erroneously disconnect from Coderd, which delays or breaks writing final status and logs.
In Coderd, there is a bug where we don't properly record the latest agent disconnection time if the agent had previously disconnected. This causes us to report the agent status as "Connected" even after it has disconnected up until the inactivity timeout fires.
This PR fixes both issues.
It also slightly reworks when we send workspace updates based on connection and disconnection. Previously we would send two updates when the agent connected in certain circumstances, even though the status would be the same in both (only times changed). Now we universally only send one on connect, and then another on disconnect.
fixes: https://github.com/coder/internal/issues/1179
The problem in that flake is that dRPC doensn't consistently return
`context.Canceled` if you make an RPC call and then cancel it: sometimes
it returns EOF.
Without this PR, if we get an EOF on one of the routines that uses the
agentapi connection, we tear down the whole connection and reconnect to
coderd --- even if we are in the middle of a graceful shutdown.
What happened in the linked flake is that writing stats failed with EOF,
which then caused us to reconnect and write the lifecycle "SHUTTING
DOWN" twice.
## Problem
The `TestAgent_SessionTTYShell` test was flaking on macOS CI runners
with:
```
match deadline exceeded: context deadline exceeded (wanted 1 bytes; got 0: "")
```
The test uses `WaitShort` (10s) for the context timeout when waiting for
shell prompt output via `Peek(ctx, 1)`. On slow macOS CI runners, the
shell startup can exceed this timeout due to resource contention.
This is evidenced in the failure logs, the SSH session was not reported
by the agent until the 10s timeout is nearly up - it took a while to
connect.
## Solution
Increase the timeout from `WaitShort` (10s) to `WaitMedium` (30s). This
matches the timeout used by `ExpectMatch` internally and gives the shell
more time to initialize on slow CI machines.
---
This PR was entirely generated by [mux](https://github.com/coder/mux)
but reviewed by a human.
Closes https://github.com/coder/internal/issues/1177
fixes: https://github.com/coder/internal/issues/1143
Both gVisor and the Go standard library implementations of `net.Conn` can under certain circumstances return `nil` for `RemoteAddr()` and `LocalAddr()` calls. If we call their methods, we segfault.
This PR fixes these calls and adds ruleguard rules.
Note that `slog.F("remote_addr", conn.RemoteAddr())` is fine because slog detects the `nil` before attempting to stringify the type.
When devcontainer up fails due to a lifecycle script error like
postCreateCommand, the CLI still returns a container ID. Previously
this was discarded and the devcontainer marked as failed. Now we
continue with agent injection if a container ID is available,
allowing users to debug the issue in the running container.
Fixescoder/internal#1137
closes: https://github.com/coder/coder/issues/10352
closes: https://github.com/coder/internal/issues/1094
closes: https://github.com/coder/internal/issues/1095
In this pull request, we enable a new set of experimental cli commands
grouped under `coder exp sync`.
These commands allow any process acting within a coder workspace to
inform the coder agent of its requirements and execution progress. The
coder agent will then relay this information to other processes that
have subscribed.
These commands are:
```
# Check if this feature is enabled in your environment
coder exp sync ping
# express that your unit depends on another
coder exp sync want <unit> <dependency_unit>
# express that your unit intends to start a portion of the script that requires
# other units to have completed first. This command blocks until all dependencies have been met
coder exp sync start <unit>
# express that your unit has completes its work, allowing dependent units to begin their execution
coder exp sync complete <unit>
```
Example:
In order to automatically run claude code in a new workspace, it must
first have a git repository cloned. The scripts responsible for cloning
the repository and for running claude code would coordinate in the
following way:
```bash
# Script A: Claude code
# Inform the agent that the claude script wants the git script.
# That is, the git script must have completed before the claude script can begin its execution
coder exp sync want claude git
# Inform the agent that we would now like to begin execution of claude.
# This command will block until the git script (and any other defined dependencies)
# have completed
coder exp sync start claude
# Now we run claude code and any other commands we need
claude ...
# Once our script has completed, we inform the agent, so that any scripts that depend on this one
# may begin their execution
coder exp sync complete claude
```
```bash
# Script B: Git
# Because the git script does not have any dependencies, we can simply inform the agent that we
# intend to start
coder exp sync start git
git clone ssh://git@github.com/coder/coder
# Once the repository have been cloned, we inform the agent that this script is complete, so that
# scripts that depend on it may begin their execution.
coder exp sync complete git
```
Notes:
* Unit names (ie. `claude` and `git`) given as input to the sync
commands are arbitrary strings. You do not have to conform to specific
identifiers. We recommend naming your scripts descriptively, but
succinctly.
* Scripts unit names should be well documented. Other scripts will need
to know the names you've chosen in order to depend on yours. Therefore,
you
---------
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
This PR removes a log field that could expose sensitive information in
agent logs for workspaces that pass such information to the agent via
its manifest.
This is to debug context timeouts on API requests to the agent.
Because rbac and database cannot be imported in slim, split the logger
middleware into slim and non-slim versions and break out the recovery
middleware.
fixes https://github.com/coder/internal/issues/1123
We want to tests that ports are not included after they are no longer used, but this isn't safe on the real OS networking stack because there is no way to guarantee a port _won't_ be used. Instead, we introduce an interface and fake implementation for testing.
On order to leave the filtering logic in the test path, this PR also does some refactoring.
Caching logic is left in the real OS querying implementation and a new test case is added for it in this PR.
relates to: https://github.com/coder/internal/issues/1094
This is number 2 of 5 pull requests in an effort to add agent script
ordering. It adds a drpc API that is exposed via a local socket. This
API serves access to a lightweight DAG based dependency manager that was
inspired by systemd.
In follow-up PRs:
* This unit manager will be plumbed into the workspace agent struct.
* CLI commands will use this agentsocket api to express dependencies
between coder scripts
I used an LLM to produce some of these changes, but I have conducted
thorough self review and consider this contribution to be ready for an
external reviewer.
relates to: https://github.com/coder/internal/issues/1094
This is number 1 of 5 pull requests in an effort to add agent script
ordering. It adds a unit manager, which uses an underlying DAG and a
list of subscribers to inform units when their dependencies have changed
in status.
In follow-up PRs:
* This unit manager will be plumbed into the workspace agent struct.
* It will then be exposed to users via a new socket based drpc API
* The agentsocket API will then become accessible via CLI commands that
allow coder scripts to express their dependencies on one another.
This is an experimental feature. There may be ways to improve the
efficiency of the manager struct, but it is more important to validate
this feature with customers before we invest in such optimizations.
See the tests for examples of how units may communicate with one
another. Actual CLI usage will be analogous.
I used an LLM to produce some of these changes, but I have conducted
thorough self review and consider this contribution to be ready for an
external reviewer.
Closes https://github.com/coder/internal/issues/769
According to the `time.NewTicker` documentation [^1] (which is used
under the hood by https://github.com/coder/quartz) it will automatically
adjust the time interval to make up for slow receivers. This means we
should be safe to drop the default branch.
> NewTicker returns a new Ticker containing a channel that will send the
current time on the channel after each tick. The period of the ticks is
specified by the duration argument. The ticker will adjust the time
interval or drop ticks to make up for slow receivers. The duration d
must be greater than zero; if not, NewTicker will panic.
[^1]: https://pkg.go.dev/time#Ticker
Relates to https://github.com/coder/internal/issues/1093
This is the first of N pull requests to allow coder script ordering.
It introduces what is for now dead code, but paves the way for various
interfaces that allow coder scripts and other processes to depend on one
another via CLI commands and terraform configurations.
The next step is to add reactivity to the graph, such that changes in
the status of one vertex will propagate and allow other vertices to
change their own statuses.
Concurrency and stress testing yield the following:
CPU Profile:
<img width="1512" height="862" alt="Screenshot 2025-10-17 at 10 38 52"
src="https://github.com/user-attachments/assets/f46cf1a2-a0b2-4c02-81a0-069798108ee5"
/>
Mem Profile:
<img width="1512" height="862" alt="Screenshot 2025-10-17 at 10 38 01"
src="https://github.com/user-attachments/assets/45be1235-fff6-45ba-a50d-db9880377bd0"
/>
Predictably, lock contention and memory allocation are the largest
components of this system under stress. Nothing seems untoward.
Second flake for this test today 😮💨.
Flake seen here, though I couldn't replicate this locally, some CI
exclusive networking issue.
https://github.com/coder/coder/actions/runs/18770305895/job/53553517887?pr=20448
```
agent_test.go:3619:
Error Trace: /home/runner/work/coder/coder/agent/agent_test.go:3619
Error: Received unexpected error:
expected 1, got 0.000000:
github.com/coder/coder/v2/agent_test.TestAgent_Metrics_SSH.func7
/home/runner/work/coder/coder/agent/agent_test.go:3557
Test: TestAgent_Metrics_SSH
Messages: check fn for coderd_agentstats_currently_reachable_peers failed
```
This value is incremented by a successful ping to the peer from the
agent, which is dependent on all the networking code, which I think is
definitely out of scope of this test for agent metrics. So, we'll just
assert that the metrics exist with the correct labels (`derp`, `p2p`)
Closes https://github.com/coder/internal/issues/921
The flake in the linked issue was caused by the startup script taking longer than 1 second in CI. The existing conditional, that the startup script duration was under a second, was incorrect; the correct conditional is that the metric exists with the `success` label set to `true`.
- Ignore errors when reporting a connection from the server, just log
them instead
- Translate connection log IP `localhost` to `127.0.0.1` on both the
server and the agent
Note that the temporary fix for converting invalid IPs to localhost is
not required in main since the database no longer forbids NULL for the
IP column since https://github.com/coder/coder/pull/19788
Relates to #20194
When we added support for connection tracking in the Workspace agent, we modified the ReconnectingPTY tests to add an initial connection that we immediately hang up and check that connections are logged.
In the case of `screen`-based pty handling, hanging up the initial connection can race with the initial attachment to the `screen` process, and cause that process to exit early. This leaves subsequent connections to the same session ID to fail.
In this PR we just use different pty session IDs so that the initial connections we do to verify logging don't interfere with the rest of the test.
_Arguably_ it's a bug in our Reconnecting PTY code that hanging up immediately can leave the system in a weird state, but we do eventually recover and error out, so I don't think it's worth trying to fix.
This PR makes the initial steps at removing usage of the global Go HTTP
client, which was seen to have impacts on test flakiness in
https://github.com/coder/internal/issues/1020. The first commit removes
uses from tests, with the exception of one test that is tightly coupled
to the default client. The second commit makes easy/low-risk removals
from application code. This should have some impact to reduce test flakiness.