## Summary
coder-logstream-kube and other tools that use the agent token to connect
to the RPC endpoint were incorrectly triggering connection monitoring,
causing false connected/disconnected timestamps on the agent. This led
to VSCode/JetBrains disconnections and incorrect dashboard status.
## Changes
Add a `role` query parameter to `/api/v2/workspaceagents/me/rpc`:
- `role=agent`: triggers connection monitoring (default for the agent
SDK)
- any other value (e.g. `logstream-kube`): skips connection monitoring
- omitted: triggers monitoring for backward compatibility with older
agents
The agent SDK now sends `role=agent` by default. A new `Role` field on
the `agentsdk.Client` allows non-agent callers to specify a different
role.
## Required follow-up
coder-logstream-kube needs to set `client.Role = "logstream-kube"`
before calling `ConnectRPC20()`. Without that change, it will still send
`role=agent` and trigger monitoring.
Fixes#21625
Update the agent protobuf schema (agent/proto/agent.proto) to include:
- subagent_id field in WorkspaceAgentDevcontainer message
- id field in CreateSubAgentRequest message
Bump the Agent API version from v2.7 to v2.8 and update all client
references throughout the codebase (ConnectRPC27 -> ConnectRPC28,
DRPCAgentClient27 -> DRPCAgentClient28).
fixes https://github.com/coder/internal/issues/1286
We can get blank IP address from the net connection if the client has
already disconnected, as was the case in this flake. Fix is to only log
error if we get something non-empty we can't parse.
---------
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
This makes it so we can test it directly without having to go through
Tailnet, which appears to be causing flakes in CI where the requests
time out and never make it to the agent.
Takes inspiration from the container-related API endpoints.
Would probably make sense to refactor the ls tests to also go through
the API (rather than be internal tests like they are currently) but I
left those alone for now to keep the diff minimal.
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
Add agent forwarding of boundary audit logs from workspaces to coderd
via agent API, and re-emission of boundary logs to coderd stderr. This
change adds a server to the workspace agent that always listens on a
unix socket for boundary to connect and send audit logs.
coderd log format example:
```
[API] 2025-12-23 18:31:46.755 [info] coderd.agentrpc: boundary_request owner=.. workspace_name=.. agent_name=.. decision=.. workspace_id=.. http_method=.. http_url=.. event_time=.. request_id=..
```
Corresponding boundary PR: https://github.com/coder/boundary/pull/124
RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41bahttps://github.com/coder/coder/issues/21280
Add the AgentAPI changes to support the feature that transmits boundary
logs from workspaces to coderd via the agent API for eventual re-emission to
stderr. The API handlers are stubs for now because I'm trying to land
this feature from multiple smaller PRs.
High level architecture:
- Boundary records resource access in batches and sends proto message to
agent
- Agent proxies messages to coderd **(captured by the API changes in
this PR)**
- coderd re-emits logs to stderr
RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba
fixes https://github.com/coder/internal/issues/1196
The above issue exposes two different bugs in Coder.
In the agent, there is a race where if the agent is closed while starting up networking, it will erroneously disconnect from Coderd, which delays or breaks writing final status and logs.
In Coderd, there is a bug where we don't properly record the latest agent disconnection time if the agent had previously disconnected. This causes us to report the agent status as "Connected" even after it has disconnected up until the inactivity timeout fires.
This PR fixes both issues.
It also slightly reworks when we send workspace updates based on connection and disconnection. Previously we would send two updates when the agent connected in certain circumstances, even though the status would be the same in both (only times changed). Now we universally only send one on connect, and then another on disconnect.
fixes: https://github.com/coder/internal/issues/1179
The problem in that flake is that dRPC doensn't consistently return
`context.Canceled` if you make an RPC call and then cancel it: sometimes
it returns EOF.
Without this PR, if we get an EOF on one of the routines that uses the
agentapi connection, we tear down the whole connection and reconnect to
coderd --- even if we are in the middle of a graceful shutdown.
What happened in the linked flake is that writing stats failed with EOF,
which then caused us to reconnect and write the lifecycle "SHUTTING
DOWN" twice.
fixes: https://github.com/coder/internal/issues/1143
Both gVisor and the Go standard library implementations of `net.Conn` can under certain circumstances return `nil` for `RemoteAddr()` and `LocalAddr()` calls. If we call their methods, we segfault.
This PR fixes these calls and adds ruleguard rules.
Note that `slog.F("remote_addr", conn.RemoteAddr())` is fine because slog detects the `nil` before attempting to stringify the type.
closes: https://github.com/coder/coder/issues/10352
closes: https://github.com/coder/internal/issues/1094
closes: https://github.com/coder/internal/issues/1095
In this pull request, we enable a new set of experimental cli commands
grouped under `coder exp sync`.
These commands allow any process acting within a coder workspace to
inform the coder agent of its requirements and execution progress. The
coder agent will then relay this information to other processes that
have subscribed.
These commands are:
```
# Check if this feature is enabled in your environment
coder exp sync ping
# express that your unit depends on another
coder exp sync want <unit> <dependency_unit>
# express that your unit intends to start a portion of the script that requires
# other units to have completed first. This command blocks until all dependencies have been met
coder exp sync start <unit>
# express that your unit has completes its work, allowing dependent units to begin their execution
coder exp sync complete <unit>
```
Example:
In order to automatically run claude code in a new workspace, it must
first have a git repository cloned. The scripts responsible for cloning
the repository and for running claude code would coordinate in the
following way:
```bash
# Script A: Claude code
# Inform the agent that the claude script wants the git script.
# That is, the git script must have completed before the claude script can begin its execution
coder exp sync want claude git
# Inform the agent that we would now like to begin execution of claude.
# This command will block until the git script (and any other defined dependencies)
# have completed
coder exp sync start claude
# Now we run claude code and any other commands we need
claude ...
# Once our script has completed, we inform the agent, so that any scripts that depend on this one
# may begin their execution
coder exp sync complete claude
```
```bash
# Script B: Git
# Because the git script does not have any dependencies, we can simply inform the agent that we
# intend to start
coder exp sync start git
git clone ssh://git@github.com/coder/coder
# Once the repository have been cloned, we inform the agent that this script is complete, so that
# scripts that depend on it may begin their execution.
coder exp sync complete git
```
Notes:
* Unit names (ie. `claude` and `git`) given as input to the sync
commands are arbitrary strings. You do not have to conform to specific
identifiers. We recommend naming your scripts descriptively, but
succinctly.
* Scripts unit names should be well documented. Other scripts will need
to know the names you've chosen in order to depend on yours. Therefore,
you
---------
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
This PR removes a log field that could expose sensitive information in
agent logs for workspaces that pass such information to the agent via
its manifest.
fixes https://github.com/coder/internal/issues/1123
We want to tests that ports are not included after they are no longer used, but this isn't safe on the real OS networking stack because there is no way to guarantee a port _won't_ be used. Instead, we introduce an interface and fake implementation for testing.
On order to leave the filtering logic in the test path, this PR also does some refactoring.
Caching logic is left in the real OS querying implementation and a new test case is added for it in this PR.
- Ignore errors when reporting a connection from the server, just log
them instead
- Translate connection log IP `localhost` to `127.0.0.1` on both the
server and the agent
Note that the temporary fix for converting invalid IPs to localhost is
not required in main since the database no longer forbids NULL for the
IP column since https://github.com/coder/coder/pull/19788
Relates to #20194
Refactors Agent instance identity to be a SessionTokenProvider.
Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags.
This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication.
Fixes#19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.
Relates to https://github.com/coder/internal/issues/711
This PR implements a project discovery mechanism that searches for any
dev container projects and makes them visible in the UI so that they can
be started. To make the wording on the site more clear, "Rebuild" has
been changed to "Start" when there is no container associated with a
known dev container configuration. I've also made it so that site will
show the dev container config path when there is no other name
available.
### Design decisions
Just want to ensure my explanation for a few design decisions are noted
down:
- We only search for dev container configurations inside git
repositories
- We only search for these git repositories if they're at the top level
or a direct child of the agent directory.
This limited approach is to reduce the amount of files we ultimately
walk when trying to find these projects. It makes sense to limit it to
only the agent directory, although I'm open to expanding how deep we
search.
The agentsdk currently does a remap of the DERP map to change the
EmbeddedRelay node's URL to match the agent's access URL.
This PR makes changes to the `workspacesdk` (used by clients like the
CLI) and `vpn` (used by Coder Desktop) to match this behavior.
This enables us the ability to try Coder clients in dogfood over a VPN
without changing the global access URL.
Previously in #18635 we delayed the containers API `Init` to avoid producing
errors due to Docker and `@devcontainers/cli` not yet being installed by startup
scripts. This had an adverse effect on the UX via UI responsiveness as the
detection of devcontainers was greatly delayed.
This change splits `Init` into `Init` and `Start` so that we can immediately
after `Init` start serving known devcontainers (defined in Terraform), improving
the UX.
Related #18635
Related #18640
Listen to feedback that was missed in
https://github.com/coder/coder/pull/18346
- Adds `CODER_WORKSPACE_OWNER_NAME` into the agent environment.
- Logs warnings for when dev container app creation fails.
Closes https://github.com/coder/internal/issues/619
Implement the `coderd` side of the AgentAPI for the upcoming
dev-container agents work.
`agent/agenttest/client.go` is left unimplemented for a future PR
working to implement the agent side of this feature.
This change introduces a significant refactor to the agentcontainers API
and enables periodic updates of Docker containers rather than on-demand.
Consequently this change also allows us to move away from using a
locking channel and replace it with a mutex, which simplifies usage.
Additionally a previous oversight was fixed, and testing added, to clear
devcontainer running/dirty status when the container has been removed.
Updates coder/coder#16424
Updates coder/internal#621
Closes https://github.com/coder/internal/issues/648
This change introduces a new `ParentId` field to the agent's manifest.
This will allow an agent to know if it is a child or not, as well as
knowing who the owner is.
This is part of the Dev Container Agents work
This pull request allows coder workspace agents to be reinitialized when
a prebuilt workspace is claimed by a user. This facilitates the transfer
of ownership between the anonymous prebuilds system user and the new
owner of the workspace.
Only a single agent per prebuilt workspace is supported for now, but
plumbing has already been done to facilitate the seamless transition to
multi-agent support.
---------
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
There were too many ways to configure the agentcontainers API resulting
in inconsistent behavior or features not being enabled. This refactor
introduces a control flag for enabling or disabling the containers API.
When disabled, all implementations are no-op and explicit endpoint
behaviors are defined. When enabled, concrete implementations are used
by default but can be overridden by passing options.
Changes the SSH host key seeding to use the owner username, workspace name, and agent name. This prevents SSH from complaining about a mismatched host key if you use Coder Desktop to connect, and delete and recreate your workspace with the same name. Previously this would generate a different key because the workspace ID changed.
We also include the owner's username in anticipation of using Coder Desktop to access shared workspaces (or as a superuser) down the road, so that workspaces with the same name owned by different users will not have the same key.
This change is **BREAKING** in a limited sense that early access users of Coder Desktop will see their SSH clients complain about host keys changing the first time each workspace is rebuilt with this code. It can be resolved by clearing your `.ssh/known_hosts` file of the Coder workspaces you access this way.
Fix hanging workspace shutdowns caused by orphaned SSH child processes.
Key changes:
- Create process groups for non-PTY SSH sessions
- Send SIGHUP to entire process group for proper termination
- Add 5-second timeout to prevent indefinite blocking
Fixes#17108
This change adds support for devcontainer autostart in workspaces. The
preconditions for utilizing this feature are:
1. The `coder_devcontainer` resource must be defined in Terraform
2. By the time the startup scripts have completed,
- The `@devcontainers/cli` tool must be installed
- The given workspace folder must contain a devcontainer configuration
Example Terraform:
```tf
resource "coder_devcontainer" "coder" {
agent_id = coder_agent.main.id
workspace_folder = "/home/coder/coder"
config_path = ".devcontainer/devcontainer.json" # (optional)
}
```
Closes#16423