287 Commits

Author SHA1 Message Date
Steven Masley 1afc6d4fd0 feat: structured disconnect attribution for agent logs (#25191)
Implements
[PLAT-60](https://linear.app/codercom/issue/PLAT-60/enhance-disconnect-logs-with-structured-reason-attribution):
adds structured disconnect attribution to disconnect logs throughout the
agent and tailnet packages.

Every disconnect log site now carries structured slog fields. All
existing logs remain; existing messages are preserved with the fields
added alongside.

New fields on disconnect log lines:

- `connect_type` — which layer disconnected: `server_to_agent`,
`agent_to_client`, or `client_to_server`
- `disconnect_reason` — categorical reason: `graceful`, `network_error`,
`server_shutdown`, etc.
- `disconnect_expected` — whether the disconnect is normal operation
(`true`) or should be investigated (`false`)
- `disconnect_initiator` — who started it: `client`, `agent`, `server`,
or `network` (control-plane sites only)
- `disconnect_detail` — free-form supplemental info (where useful)

## What's covered

**Control plane (`server_to_agent`):** coordination RPC, DERP map
subscriber, agent runLoop, agent Close, `BasicCoordination.Close`,
`Controller.run`.

**Data plane (`agent_to_client`):** SSH sessions, reconnecting PTY,
JetBrains port-forwarding.

<details>
<summary>Control-plane sites</summary>

| Site | Reason | Initiator |
|---|---|---|
| `agent/agent.go` `runLoop` EOF | `network_error` | `network` |
| `agent/agent.go` `runCoordinator` deferred exit | `server_shutdown` /
`graceful` / `network_error` | `agent` / `server` / `network` |
| `agent/agent.go` `runDERPMapSubscriber` deferred exit | same (shared
`classifyCoordinatorRPCExit`) | same |
| `agent/agent.go` `Close` shutdown timeout | `server_shutdown` + detail
| `agent` |
| `agent/agent.go` `Close` clean coord disconnect | `server_shutdown` |
`agent` |
| `tailnet/controllers.go` `BasicCoordination.Close` | `graceful` or
`network_error` | `c.initiator` |
| `tailnet/controllers.go` `Controller.run` `net.ErrClosed` |
`network_error` | `network` |

</details>

<details>
<summary>Data-plane sites</summary>

| Site | Reason | Notes |
|---|---|---|
| `agent/agentssh/agentssh.go` SSH session closed | free-form
(`graceful`, `process exited with error status: N`, etc.) | Also sets
`closeCause("normal exit")` for clean exits so coderd's
`connection_log.DisconnectReason` is no longer empty |
| `agent/reconnectingpty/server.go` PTY closed | `server_shutdown`,
error string, or `graceful` | |
| `agent/agentssh/jetbrainstrack.go` channel closed | `normal close` or
error string | Previously passed empty reason |

</details>

<details>
<summary>Bug fix</summary>

The deferred `disconnected from coordination RPC` log no longer fires
when the initial `Coordinate()` RPC call fails before any connection is
established.

</details>

Refs PLAT-60.

---

_This PR was prepared by Coder Agents on behalf of @Emyrk._
**Manually QA'd a lot of common disconnects**

---------

Co-authored-by: Coder Agents <noreply@coder.com>
2026-05-19 09:47:03 -05:00
Mathias Fredriksson ca6450cf94 fix(agent): gate MCP tool discovery on startup (#25034)
The first `/mcp/tools` request could race workspace startup and return
an empty tool list before startup scripts had a chance to write
`.mcp.json`. Chatd may only discover tools once for a turn, so that
empty response could hide workspace MCP tools even though the agent
loaded them later.

Make the manager wait for startup to settle before treating missing MCP
config files as a real empty state. Tool listing now goes through one
manager-owned path that starts reload work independently of caller
cancellation; caller contexts only bound that caller's wait. After the
first reload body settles, transient reload errors return cached tools
with the error so the HTTP handler can degrade to the last known tool
set instead of returning `[]`.

The handler is intentionally thin: it asks the manager for tools, logs
any degraded path, and still returns the tool response shape callers
already expect. Tests cover startup gating, caller-canceled waits,
manager close, reload timeout via quartz, and cached-tool fallback after
a later reload error.
2026-05-11 12:57:22 +03:00
Sas Swart 1ba7139f21 feat: add session correlation fields to BoundaryLog proto (#24809)
1 of 9 [next >>](https://github.com/coder/coder/pull/24811)

RFC: [Bridge ↔ Boundaries Correlation
RFC](https://www.notion.so/Bridge-Boundaries-Correlation-313d579be59281f3b4efdbfd6896775a)

Adds three new proto fields for boundary session correlation.

**`ReportBoundaryLogsRequest`**
- `session_id` (string, field 2) — UUID generated by boundary at
startup,
  shared across all batches from a single run.
- `confined_process` (string, field 3) — name of the confined process
  (e.g. `claude-code`, `codex`, `copilot`).

**`BoundaryLog`**
- `sequence_number` (uint64, field 4) — monotonically increasing counter
  per session, primary ordering key when boundary is in use.

`BoundaryLog.time` already existed at field 2; no change needed there.

API version bumped to v2.9.

No behaviour change in coderd or the agent. This is a pure schema bump
that the boundary repo will consume in its own stack.

> Generated by Coder Agents
2026-05-05 10:36:26 +02:00
Mathias Fredriksson 881df9a5b0 feat: reload MCP config on change via lazy stat-on-request (#24700)
The MCP manager previously read .mcp.json exactly once at agent startup.
Editing the file had no effect until workspace rebuild or agent restart.

handleListTools now stats config file mtimes on every tool-list request
and triggers a differential reload when any file changed. Unchanged
servers keep their client pointer so in-flight tool calls survive.
Concurrent reload requests coalesce via singleflight.

MCP stdio subprocesses use the agent's execer for resource limits and
receive the same enriched environment as SSH sessions via updateEnv.

On the chatd side, WorkspaceMCPTool.Run detects 404 responses from
CallMCPTool (indicating the server was removed) and drops the chat's
cached tool list so the next turn refetches from the agent.
2026-04-28 19:47:14 +03:00
Mathias Fredriksson 3c450899ea fix: pass agent context config explicitly instead of reading env (#24759)
The CODER_AGENT_EXP_* env vars are agent-internal options. When set
in the workspace environment they leak to MCP subprocesses and user
shells.

ReadEnvConfig() captures the values and ClearEnvVars() strips them
before the reinit loop, so config survives agent restarts. NewAPI
and ReadEnvConfig both use applyDefaults() to fill zero fields.
The chatd test passes config via agenttest.WithContextConfigFromEnv().
2026-04-28 17:58:28 +03:00
Zach 72f35e1cd3 feat: runtime user secrets injection into workspaces (#24313)
Injects user secrets into workspace agents at runtime via the agent
manifest. Secrets with an environment variable name are set as
environment variables in every agent session and startup script. Secrets
with a file path are written to disk before startup scripts run.

- Fetch user secrets in GetManifest and convert to proto
- Defensively strip secrets from manifests received by the agent to
   avoid accidental leakage
- Add WorkspaceSecret type and proto conversion helpers to agentsdk
- Write secret files eagerly on manifest fetch (0600 perms, 0700 dirs)
- Inject secret env vars per-session in updateCommandEnv
- Expand ~/paths using caller-resolved home directory
- Log file write errors without blocking workspace startup
2026-04-17 16:55:24 -06:00
Spike Curtis 4c1a32cd7c feat: wire DERPTLSConfig through CLI, SDK, tailnet, VPN, agent, and health checks (#24435)
Wire DERPTLSConfig through the CLI, SDK, tailnet, VPN client, agent, and
health checks to allow custom TLS configuration for DERP connections.
The main use case is to be able to set a custom CA and also present
client certs (mTLS). See https://github.com/coder/tailscale/pull/105 for
related changes.

Adds three new global CLI flags:
- `--client-tls-ca-file` / `CODER_CLIENT_TLS_CA_FILE`
- `--client-tls-cert-file` / `CODER_CLIENT_TLS_CERT_FILE`
- `--client-tls-key-file` / `CODER_CLIENT_TLS_KEY_FILE`

Based on community PR #22695 by @ibdafna, with autogeneration issues
fixed (protobuf version mismatches in .pb.go files, golden file
regeneration, lint fixes).

> [!NOTE]
> This PR was authored by Coder Agents on behalf of a Coder team member.

<details>
<summary>Relationship to #22695</summary>

This is a clean reimplementation of the changes from #22695 on top of
current `main`, with the following differences:
- **Removed**: Accidental protobuf version changes in `.pb.go` files
(contributor had `protoc v6.33.4` vs project's `protoc v4.23.4`)
- **Added**: Properly regenerated golden files and docs via `make gen`
- **Fixed**: Lint issue (`var-declaration` revive warning on explicit
type in `createHTTPClient`)
- All meaningful code changes are identical to the original PR
</details>
2026-04-16 12:46:52 -04:00
Garrett Delfosse 7b7baea851 feat: support disabling reverse/local port forwarding in agent SSH server (#24026)
The agent SSH server unconditionally allows all four SSH forwarding
paths (TCP local, TCP reverse, Unix local, Unix reverse). This is a
sandbox escape vector when workspaces are used for AI agent containment
— a reverse tunnel lets anything inside the workspace reach the user's
local machine, bypassing network isolation.

This adds two new agent CLI flags / environment variables:

- `--block-reverse-port-forwarding` /
`CODER_AGENT_BLOCK_REVERSE_PORT_FORWARDING` — blocks both TCP (`ssh -R`)
and Unix socket reverse forwarding
- `--block-local-port-forwarding` /
`CODER_AGENT_BLOCK_LOCAL_PORT_FORWARDING` — blocks both TCP (`ssh -L`)
and Unix socket local forwarding

Template admins can set these via the `env` block on the container/VM
resource that runs the agent (e.g. `docker_container`,
`kubernetes_pod`), or via `coder_env` resources tied to the agent.

Fixes https://github.com/coder/coder/issues/22275

<details>
<summary>Implementation notes</summary>

Follows the existing `BlockFileTransfer` pattern:

1. `agent/agentssh/agentssh.go` — New `BlockReversePortForwarding` and
`BlockLocalPortForwarding` fields on `Config`. TCP callbacks check these
before allowing forwarding. The `direct-streamlocal@openssh.com` channel
handler is wrapped to reject Unix local forwards.
2. `agent/agentssh/forward.go` — `forwardedUnixHandler` gains a
`blockReversePortForwarding` field to reject
`streamlocal-forward@openssh.com` requests.
3. `agent/agent.go` — New fields on `Options` and `agent` struct,
plumbed to SSH config.
4. `cli/agent.go` — New serpent flags with env vars.
5. Tests cover all four blocked paths: TCP local, TCP reverse, Unix
local, Unix reverse.

</details>

> 🤖 Generated by Coder Agents
2026-04-08 10:41:55 -04:00
Kyle Carberry 919dc299fc feat: agent reads context files and discovers skills locally (#23935)
Piggybacks on #23878. Moves instruction file reading and skill discovery
from `chatd` (server-side, via multiple `LS`/`ReadFile` round-trips
through the agent connection) to the agent itself (local filesystem
access).

This intentionally drops backward compatibility with older agents that
don't support the context-config endpoint. Agents and server are
deployed together; there is no rolling-update contract to maintain here.

## What changed

The agent's `GET /api/v0/context-config` response now returns
`[]ChatMessagePart` directly — the same types chatd persists. This
eliminates intermediate type conversions and makes the protocol
extensible.

| Field | Type | Description |
|---|---|---|
| `parts` | `[]ChatMessagePart` | Context-file and skill parts, ready to
persist |
| `working_dir` | `string` | Agent's resolved working directory |

Removed from the response: `instructions_dirs`, `instructions_file`,
`skills_dirs`, `skill_meta_file`, `mcp_config_files` — the agent reads
files locally and returns their content as parts.

Removed from chatd: all legacy `LS`/`ReadFile` fallback code
(`readHomeInstructionFile`, `readInstructionDirFile`, `DiscoverSkills`
via LS, etc).

## Why

The previous architecture had the agent resolve paths, serve them over
HTTP, then `chatd` make N+1 round-trips back through the agent
connection to read files. The agent has direct filesystem access and
should just read the files.

## Key design decisions

- **Agent returns `ChatMessagePart` directly** — same types chatd
persists. No intermediate `InstructionFileEntry`/`SkillEntry` types
needed.
- **`SkillMeta.MetaFile`** — persisted via `ContextFileSkillMetaFile` on
the skill part, so custom meta file names
(`CODER_AGENT_EXP_SKILL_META_FILE`) survive across chat turns.
- **No pre-read body** — `read_skill` always dials the workspace to
fetch the skill body on demand. Simpler than caching the body in the
response.
- **MCP config paths kept agent-internal** — `MCPConfigFiles()` getter,
not sent over the wire.
- **No backward compat fallback** — old agents that don't support
context-config get no instruction files. This is acceptable since agent
and server deploy together.
2026-04-04 12:45:46 -04:00
Hugo Dutka 17dec2a70f feat: agents desktop recordings backend (#23894)
This PR introduces screen recording of the computer use agent using the
virtual desktop.

- Screen recording is triggered by a `wait_agent` tool call. Recording
is stopped by a successful `wait_agent` tool call or when there hasn't
been any desktop activity for 10 minutes.
- Recordings are handled by the `portabledesktop` cli via the `record`
command. The videos are sped up in periods of inactivity.
- Recordings are saved to the database to the `chat_files` table.
There's a hard limit of 100MB per recording. Larger recordings are
dropped.
- A successful `wait_agent` on a computer use subagent tool call returns
a `recording_file_id`, later allowing the frontend to display the
corresponding video.
2026-04-02 17:23:27 +00:00
Cian Johnston cd784c755a fix(agent): exorcise data race haunting contextConfigAPI on reconnect (#23946)
Fixes: coder/internal#1441

- Move `contextConfigAPI` init from `handleManifest` to `init()`,
matching all other API fields
- Change `agentcontextconfig.NewAPI` to accept `func() string` closure
(lazy directory evaluation)
- `Config()` and HTTP handler now compute on demand via
`a.manifest.Load().Directory`
- Widen `TestAgent_Reconnect` to loop 5 reconnections with a non-empty
manifest directory
- Add `TestContextConfigAPI_InitOnce` internal test verifying lazy eval
across manifest changes
- Add `TestNewAPI_LazyDirectory` unit test for the lazy contract

> 🤖 Written by a Coder Agent. Reviewed by a human.
2026-04-02 09:00:13 +01:00
Kyle Carberry ee855f9618 feat: make agent context paths configurable via env vars (#23878)
Replace hardcoded paths for instruction files, skills, and MCP config
with
values read from `CODER_AGENT_EXP_*` environment variables. Template
authors
configure paths via the existing `coder_agent` `env` block. The agent
resolves `~`, relative, and absolute paths locally, then serves the
resolved config over `GET /api/v0/context-config`. `chatd` fetches this
once per workspace attach and falls back to today's defaults for older
agents.

All path env vars are comma-separated, allowing multiple directories:

| Env Var | Default | Controls |
|---|---|---|
| `CODER_AGENT_EXP_INSTRUCTIONS_DIRS` | `~/.coder` | Dirs containing the
instruction file |
| `CODER_AGENT_EXP_INSTRUCTIONS_FILE` | `AGENTS.md` | Instruction file
name |
| `CODER_AGENT_EXP_SKILLS_DIRS` | `.agents/skills` | Skills directories
|
| `CODER_AGENT_EXP_SKILL_META_FILE` | `SKILL.md` | Skill metadata file
name |
| `CODER_AGENT_EXP_MCP_CONFIG_FILES` | `.mcp.json` | MCP config files |

### Example

```hcl
resource "coder_agent" "main" {
  os   = "linux"
  arch = "amd64"
  env = {
    CODER_AGENT_EXP_INSTRUCTIONS_DIRS  = "/opt/company/agent-config,~/.coder"
    CODER_AGENT_EXP_INSTRUCTIONS_FILE  = "CLAUDE.md"
    CODER_AGENT_EXP_SKILLS_DIRS        = "/opt/company/ai-skills,.agents/skills"
    CODER_AGENT_EXP_MCP_CONFIG_FILES   = "/opt/company/mcp.json,.mcp.json"
  }
}
```

<details>
<summary>Implementation Details</summary>

### Architecture

Follows the same pattern as MCP tool discovery:
agent resolves locally → exposes via HTTP → chatd consumes.

**Agent-side** (`agent/agentcontextconfig/`):
- `ResolvePath` / `ResolvePaths` handle `~`, relative, and absolute path
forms; returns `""` for relative paths when baseDir is empty
- `Config` reads env vars, falls back to defaults, resolves all paths
- `GET /api/v0/context-config` serves the resolved config as JSON

**chatd-side** (`coderd/x/chatd/`):
- Calls `conn.ContextConfig()` once on first workspace attach
- Falls back to hardcoded defaults on 404 (older agents)
- Iterates instruction dirs, skills dirs using resolved absolute paths
- `LSRelativityRoot` everywhere — no more home/root juggling

### Key design decisions

- **`EXP_` prefix**: env vars use `CODER_AGENT_EXP_*` to indicate
experimental status
- **Plural names**: comma-separated vars use plural names (`DIRS`,
`FILES`); single-value vars use singular (`FILE`)
- **Defaults in `workspacesdk`**: default constants live in
`codersdk/workspacesdk/` so both agent and server reference them without
cross-layer imports
- **`skillMetaFile` persistence**: stored on context-file parts via
`ContextFileSkillMetaFile` and restored on subsequent chat turns so
custom values survive across turns
- **Working dir dedup**: `slices.Contains` guard prevents reading the
same instruction file from both `InstructionsDirs` and the working
directory
- **MCP server dedup**: first-occurrence-wins dedup prevents leaking
duplicate connections from overlapping config files
- **ResolvePath safety**: returns `""` for relative paths when `baseDir`
is empty, so `ResolvePaths` filters them out

### Files changed

| File | Change |
|---|---|
| `agent/agentcontextconfig/` | New package — path resolution + HTTP
endpoint |
| `codersdk/workspacesdk/agentconn.go` | `ContextConfigResponse` type,
default constants, client method |
| `agent/agent.go` + `agent/api.go` | Wire up endpoint, pass config to
MCP |
| `agent/x/agentmcp/manager.go` | Accept `[]string` MCP config paths,
dedup by name |
| `coderd/x/chatd/chatd.go` | Fetch config, thread through, named
returns |
| `coderd/x/chatd/instruction.go` | Accept configurable dir + file name,
`skillMetaFileFromParts` |
| `coderd/x/chatd/chattool/skill.go` | Accept configurable dirs + meta
file |
| `codersdk/chats.go` | `ContextFileSkillMetaFile` field for persistence
|

### Test coverage

- `TestConfig` (4 cases): defaults, custom env vars, whitespace
trimming, comma-separated dirs
- `TestResolvePath` / `TestResolvePaths`: including empty baseDir edge
case
- `TestPersistInstructionFilesFallbackOnOlderAgent`: backward-compat
path when `ContextConfig` returns 404
- `TestChatMessagePartVariantTags`: updated exclusion list for new
internal field

### Backward compatibility

Older agents return 404 for the new endpoint. `chatd` catches this and
falls back to today's defaults via `readHomeInstructionFile` (using
`LSRelativityHome`). Existing workspaces work with no changes.

</details>
2026-04-01 12:28:47 -04:00
Kyle Carberry 0f86c4237e feat: add workspace MCP tool discovery and proxying for chat (#23680)
Coder's chat (chatd) can now discover and use MCP servers configured in
a workspace's `.mcp.json` file. This brings project-specific tooling
(GitHub, databases, docs servers, etc.) into the chat without any manual
configuration.

## How it works

The workspace agent reads `.mcp.json` from the workspace directory (same
format Claude Code uses), connects to the declared MCP servers —
spawning child processes for stdio servers and connecting over the
network for HTTP/SSE — and caches their tool lists. Two new agent HTTP
endpoints expose this:

- `GET /api/v0/mcp/tools` returns the cached tool list (supports
`?refresh=true`)
- `POST /api/v0/mcp/call-tool` proxies calls to the correct server

On each chat turn, chatd calls `ListMCPTools` through the existing
`AgentConn` tailnet connection, wraps each tool as a
`fantasy.AgentTool`, and adds them to the LLM's tool set alongside
built-in and admin-configured MCP tools. Tool names are prefixed with
the server name (`github__create_issue`) to avoid collisions.

Failed server connections are logged and skipped — they never block the
agent or break the chat. Child stdio processes are terminated on agent
shutdown.
2026-03-26 19:57:02 +00:00
Cian Johnston c753a622ad refactor(agent): move agentdesktop under x/ subpackage (#23610)
- Move `agent/agentdesktop/` to `agent/x/agentdesktop/` to signal
experimental/unstable status
- Update import paths in `agent/agent.go` and `api_test.go`

> 🤖 This mechanical refactor was performed by an agent. I made sure it
didn't change anything it wasn't supposed to.
2026-03-25 18:23:52 +00:00
Mathias Fredriksson 147df5c971 refactor: replace sort.Strings with slices.Sort (#23457)
The slices package provides type-safe generic replacements for the
old typed sort convenience functions. The codebase already uses
slices.Sort in 43 call sites; this finishes the migration for the
remaining 29.

- sort.Strings(x)          -> slices.Sort(x)
- sort.Float64s(x)         -> slices.Sort(x)
- sort.StringsAreSorted(x) -> slices.IsSorted(x)
2026-03-23 23:19:23 +02:00
Mathias Fredriksson 119030d795 fix(agent): default process working directory to agent dir or $HOME (#23224)
Processes started via the agent process API inherited the agent's
own working directory (/tmp/coder.xxx) when no WorkDir was
specified. SSH sessions already use a fallback chain: configured
agent directory > $HOME. This wires the same manifest directory
closure into the process manager so the priority is now:

  explicit req.WorkDir > agent configured dir > $HOME

The resolved directory is recorded on the process struct so
ProcessInfo.WorkDir and pathStore notifications reflect where
the process actually ran.
2026-03-18 16:46:26 +00:00
Hugo Dutka fdb1205bdf chore(agent): remove portabledesktop download logic (#23128)
The new way to install portabledesktop in a workspace will be via a
module: https://github.com/coder/registry/pull/805
2026-03-17 15:24:11 +01:00
Hugo Dutka 84527390c6 feat: chat desktop backend (#23005)
Implement the backend for the desktop feature for agents.

- Adds a new `/api/experimental/chats/$id/desktop` endpoint to coderd
which exposes a VNC stream from a
[portabledesktop](https://github.com/coder/portabledesktop) process
running inside the workspace
- Adds a new `spawn_computer_use_agent` tool to chatd, which spawns a
subagent that has access to the `computer` tool which lets it interact
with the `portabledesktop` process running inside the workspace
- Adds the plumbing to make the above possible

There's a follow up frontend PR here:
https://github.com/coder/coder/pull/23006
2026-03-13 19:49:34 +01:00
Hugo Dutka 48ab492f49 feat: agents git watch backend (#22565)
Adds real-time git status watching for workspace agents, so the frontend
can subscribe over WebSocket and show
git file changes in near real-time.

1. Subscription is scoped to a **chat** via `GET
/api/experimental/chats/{chat}/git/watch`.
2. The workspace agent automatically determines which paths to watch
based on tool calls made by the chat (and its ancestor chats).
3. Workspace agent polls subscribed repo working trees on a 30s
interval, on tools calls, and on explicit `refresh` from the client.
4. Scans are rate-limited to at most once per second.
5. Edited paths are tracked **in-memory** inside the workspace agent.
There is no database persistence — state is lost on agent restart. This
will be addresses in a future PR.
6. Messages sent over WebSocket include a full-repo snapshot (unified
diff, branch, origin). A new message is emitted only when the snapshot
changes.

This PR was implemented with AI with me closely controlling what it's
doing. The code follows a plan file that was updated continuously during
implementation. Here's the file if you'd like to see it:
[project.md](https://gist.github.com/hugodutka/8722cf80c92f8a56555f7bc595b770e2).
It reflects the current state of the PR.
2026-03-06 10:47:55 +01:00
Spike Curtis 7cc2b22568 chore: expose UpdateAppStatus on agentsocket (#22353)
relates to #21335

Adds UpdateAppStatus on the agentsocket, wired up to forward to Coderd over the dRPC connection the agent maintains.

Disclosure: I used AI to generate significant portions of this PR, but hand-reviewed and tweaked the code. I consider it approximately indistinguishable from what I would have done by hand.
2026-03-04 21:18:17 +04:00
Zach 5b7377c375 feat: add Prometheus metrics for boundary log drop reporting (#22521)
Add Prometheus metrics to the boundary log proxy for observability:
- batches_dropped_total (reason: buffer_full, forward_failed)
- logs_dropped_total (reason: buffer_full, forward_failed,
  boundary_channel_full, boundary_batch_full)
- batches_forwarded_total

Also add BoundaryStatus to the BoundaryMessage envelope so boundary
can report dropped log counts as a separate wire message. The agent
records these as Prometheus metrics, making boundary-side data loss
visible. Backwards compatibility for older versions of boundary is maintained.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 12:42:34 -07:00
Spike Curtis 56eb57caf4 chore: enable agent socket by default (#22352)
relates to #21335

Enables the agent socket by default and updates docs to strike references to having to enable it.

The PRs in this stack change the MCP server that Tasks use to update their status to rely on the agent socket, rather than directly dialing Coderd with the agent token.

Default disable was a reasonable default when it was only used for the experimental script ordering features, but now that we want to use it for Tasks, it should be default on.
2026-03-03 21:23:59 +04:00
Kyle Carberry fb6bf3a568 fix(agent): wire updateCommandEnv into process manager (#22451)
## Problem

The `agentproc` process manager spawns processes with only
`os.Environ()`, missing agent-level environment variables like
`GIT_ASKPASS`, `CODER_*`, and `GIT_SSH_COMMAND` that are injected by the
agent's `updateCommandEnv` function. This means processes started
through the HTTP process API (used by chat tools) cannot authenticate
git operations via the Coder gitaskpass helper.

By contrast, SSH sessions get the full agent environment because the SSH
server calls `updateCommandEnv` via its `UpdateEnv` config hook.

## Fix

Wire the agent's `updateCommandEnv` hook into the process manager so all
spawned processes receive the full agent environment. The hook is:

- Passed as a parameter through `NewAPI` → `newManager`
- Called in `manager.start()` with `os.Environ()` as the base, producing
the same enriched env that SSH sessions get
- Gracefully falls back to `os.Environ()` if the hook returns an error

Request-level env vars (`req.Env`, set by chat tools) are still appended
last and take precedence.

## Changes

- `agent/agentproc/process.go`: Add `updateEnv` field to manager, call
it when building process env
- `agent/agentproc/api.go`: Accept `updateEnv` parameter in `NewAPI`
- `agent/agent.go`: Pass `a.updateCommandEnv` when creating the process
API
- `agent/agentproc/api_test.go`: Add `UpdateEnvHook` and
`UpdateEnvHookOverriddenByReqEnv` tests

Co-authored-by: Coder <coder@coder.com>
2026-02-28 21:58:59 -05:00
Kyle Carberry a621c3cb13 feat(agent): add process execution API and rewrite execute tool (#22416)
## Summary

Adds a new agent-side process management HTTP API and rewrites the chat
execute tool to use it instead of SSH sessions.

## What changed

### New agent/agentproc/ package

- **headtail.go** — Thread-safe io.Writer with bounded memory (16KB head
+ 16KB tail ring buffer). Provides LLM-ready output with truncation
metadata and long-line truncation at 2048 bytes.
- **headtail_test.go** — 16 tests including race detector coverage for
concurrent writes.
- **process.go** — Manager + Process types for lifecycle management
using agentexec.Execer for proper OOM/nice scores.
- **api.go** — HTTP API following the agentfiles chi router pattern. 4
endpoints: start, list, output, signal.

### Agent wiring (agent/agent.go, agent/api.go)

Mounts the process API at /api/v0/processes, mirroring how agentfiles is
mounted.

### SDK (codersdk/workspacesdk/agentconn.go)

4 new AgentConn interface methods + 7 request/response types:
- StartProcess, ListProcesses, ProcessOutput, SignalProcess

### Execute tool rewrite (coderd/chatd/chattool/execute.go)

- SSH to Agent API: conn.StartProcess() + conn.ProcessOutput() polling
- New parameters: workdir, run_in_background
- Structured response: success, exit_code, wall_duration_ms, error,
truncated, note, background_process_id
- Non-interactive env vars: GIT_EDITOR=true, TERM=dumb, NO_COLOR=1,
PAGER=cat, etc.
- Output truncation: HeadTailBuffer caps at 32KB for LLM consumption
- File-dump detection with advisory notes suggesting read_file
- Default timeout: 60s to 10s
- Foreground polling: 200ms intervals until exit or timeout

## Architecture

State lives on the agent, surviving coderd failover and instance
changes. Any coderd replica can query any agent via HTTP over tailnet.
2026-02-28 12:33:52 -05:00
Kacper Sawicki f016d9e505 fix(coderd): add role param to agent RPC to prevent false connectivity (#22052)
## Summary

coder-logstream-kube and other tools that use the agent token to connect
to the RPC endpoint were incorrectly triggering connection monitoring,
causing false connected/disconnected timestamps on the agent. This led
to VSCode/JetBrains disconnections and incorrect dashboard status.

## Changes

Add a `role` query parameter to `/api/v2/workspaceagents/me/rpc`:
- `role=agent`: triggers connection monitoring (default for the agent
SDK)
- any other value (e.g. `logstream-kube`): skips connection monitoring
- omitted: triggers monitoring for backward compatibility with older
agents

The agent SDK now sends `role=agent` by default. A new `Role` field on
the `agentsdk.Client` allows non-agent callers to specify a different
role.

## Required follow-up

coder-logstream-kube needs to set `client.Role = "logstream-kube"`
before calling `ConnectRPC20()`. Without that change, it will still send
`role=agent` and trigger monitoring.

Fixes #21625
2026-02-18 09:44:06 +01:00
Danielle Maywood 2de8cdf160 feat(agent): add subagent ID fields to devcontainers in manifest (#21848)
Update the agent protobuf schema (agent/proto/agent.proto) to include:
- subagent_id field in WorkspaceAgentDevcontainer message
- id field in CreateSubAgentRequest message

Bump the Agent API version from v2.7 to v2.8 and update all client
references throughout the codebase (ConnectRPC27 -> ConnectRPC28,
DRPCAgentClient27 -> DRPCAgentClient28).
2026-02-03 12:37:30 +00:00
Spike Curtis 3398833919 test: don't drop error on blank IP address in report (#21642)
fixes https://github.com/coder/internal/issues/1286

We can get blank IP address from the net connection if the client has
already disconnected, as was the case in this flake. Fix is to only log
error if we get something non-empty we can't parse.

---------

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
2026-01-23 10:26:44 +00:00
Asher ff9ed91811 chore: move agent's file API into separate package (#21531)
This makes it so we can test it directly without having to go through
Tailnet, which appears to be causing flakes in CI where the requests
time out and never make it to the agent.

Takes inspiration from the container-related API endpoints.

Would probably make sense to refactor the ls tests to also go through
the API (rather than be internal tests like they are currently) but I
left those alone for now to keep the diff minimal.
2026-01-16 17:03:17 -09:00
Spike Curtis bddb808b25 chore: arrange imports in a standard way (#21452)
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:

```
import (
	"context"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"golang.org/x/xerrors"
	"gopkg.in/natefinch/lumberjack.v2"

	"cdr.dev/slog/v3"
	"github.com/coder/coder/v2/codersdk/agentsdk"
	"github.com/coder/serpent"
)
```

3 groups: standard library, 3rd partly libs, Coder libs.

This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
2026-01-08 15:24:11 +04:00
Spike Curtis 49b34a716a fix: fix slog to always use array of Fields (#21426)
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).

It also updates dependencies that also use slog and were updated.

I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.

Other dependencies, I pushed new tags.
2026-01-08 10:29:41 +04:00
Zach 07924037e7 feat: add boundary log forwarding from agent to coderd (#21345)
Add agent forwarding of boundary audit logs from workspaces to coderd
via agent API, and re-emission of boundary logs to coderd stderr. This
change adds a server to the workspace agent that always listens on a
unix socket for boundary to connect and send audit logs.

coderd log format example:
```
[API] 2025-12-23 18:31:46.755 [info] coderd.agentrpc: boundary_request owner=.. workspace_name=.. agent_name=.. decision=.. workspace_id=.. http_method=.. http_url=.. event_time=.. request_id=..
```

Corresponding boundary PR: https://github.com/coder/boundary/pull/124
RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba
https://github.com/coder/coder/issues/21280
2025-12-31 16:38:19 -07:00
Zach 9d1493a13a feat: add initial API for boundary log forwarding to coderd (#21293)
Add the AgentAPI changes to support the feature that transmits boundary
logs from workspaces to coderd via the agent API for eventual re-emission to
stderr. The API handlers are stubs for now because I'm trying to land
this feature from multiple smaller PRs.

High level architecture:
- Boundary records resource access in batches and sends proto message to
agent
- Agent proxies messages to coderd **(captured by the API changes in
this PR)**
- coderd re-emits logs to stderr

RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba
2025-12-19 10:41:39 -07:00
Spike Curtis 71c6dc4043 fix: stop disconnecting from coderd early and record disconnect correctly (#21250)
fixes https://github.com/coder/internal/issues/1196

The above issue exposes two different bugs in Coder.

In the agent, there is a race where if the agent is closed while starting up networking, it will erroneously disconnect from Coderd, which delays or breaks writing final status and logs.

In Coderd, there is a bug where we don't properly record the latest agent disconnection time if the agent had previously disconnected. This causes us to report the agent status as "Connected" even after it has disconnected up until the inactivity timeout fires.

This PR fixes both issues.

It also slightly reworks when we send workspace updates based on connection and disconnection. Previously we would send two updates when the agent connected in certain circumstances, even though the status would be the same in both (only times changed).  Now we universally only send one on connect, and then another on disconnect.
2025-12-15 12:04:01 +04:00
Spike Curtis ce9e7ad909 fix(agent): ignore EOF errors during shutdown (#21187)
fixes: https://github.com/coder/internal/issues/1179

The problem in that flake is that dRPC doensn't consistently return
`context.Canceled` if you make an RPC call and then cancel it: sometimes
it returns EOF.

Without this PR, if we get an EOF on one of the routines that uses the
agentapi connection, we tear down the whole connection and reconnect to
coderd --- even if we are in the middle of a graceful shutdown.

What happened in the linked flake is that writing stats failed with EOF,
which then caused us to reconnect and write the lifecycle "SHUTTING
DOWN" twice.
2025-12-09 17:32:38 +04:00
Spike Curtis 40df21ed62 fix: fixes use of possibly nil RemoteAddr() and LocalAddr() return values (#21076)
fixes: https://github.com/coder/internal/issues/1143

Both gVisor and the Go standard library implementations of `net.Conn` can under certain circumstances return `nil` for `RemoteAddr()` and `LocalAddr()` calls. If we call their methods, we segfault.

This PR fixes these calls and adds ruleguard rules.

Note that `slog.F("remote_addr", conn.RemoteAddr())` is fine because slog detects the `nil` before attempting to stringify the type.
2025-12-03 15:06:00 +04:00
Sas Swart ce627bf23f feat: implement agent socket api, client and cli (#20758)
closes: https://github.com/coder/coder/issues/10352
closes: https://github.com/coder/internal/issues/1094
closes: https://github.com/coder/internal/issues/1095

In this pull request, we enable a new set of experimental cli commands
grouped under `coder exp sync`.
These commands allow any process acting within a coder workspace to
inform the coder agent of its requirements and execution progress. The
coder agent will then relay this information to other processes that
have subscribed.

These commands are:
```
# Check if this feature is enabled in your environment 
coder exp sync ping

# express that your unit depends on another
coder exp sync want <unit> <dependency_unit> 

# express that your unit intends to start a portion of the script that requires 
# other units to have completed first. This command blocks until all dependencies have been met
coder exp sync start <unit> 

# express that your unit has completes its work, allowing dependent units to begin their execution
coder exp sync complete <unit>
```

Example:

In order to automatically run claude code in a new workspace, it must
first have a git repository cloned. The scripts responsible for cloning
the repository and for running claude code would coordinate in the
following way:

```bash
# Script A: Claude code

# Inform the agent that the claude script wants the git script.
# That is, the git script must have completed before the claude script can begin its execution
coder exp sync want claude git

# Inform the agent that we would now like to begin execution of claude.
# This command will block until the git script (and any other defined dependencies)
# have completed
coder exp sync start claude

# Now we run claude code and any other commands we need
claude ...

# Once our script has completed, we inform the agent, so that any scripts that depend on this one
# may begin their execution

coder exp sync complete claude
```

```bash
# Script B: Git

# Because the git script does not have any dependencies, we can simply inform the agent that we 
# intend to start
coder exp sync start git

git clone ssh://git@github.com/coder/coder

# Once the repository have been cloned, we inform the agent that this script is complete, so that
# scripts that depend on it may begin their execution.
coder exp sync complete git
```

Notes:
* Unit names (ie. `claude` and `git`) given as input to the sync
commands are arbitrary strings. You do not have to conform to specific
identifiers. We recommend naming your scripts descriptively, but
succinctly.
* Scripts unit names should be well documented. Other scripts will need
to know the names you've chosen in order to depend on yours. Therefore,
you

---------

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
2025-11-28 08:33:50 +02:00
Sas Swart 1d726c81bb fix: remove a sensitive field from an agent log line (#20968)
This PR removes a log field that could expose sensitive information in
agent logs for workspaces that pass such information to the agent via
its manifest.
2025-11-27 16:12:03 +02:00
Spike Curtis afd40436f0 fix: mock Agent querying OS for listening ports in tests (#20842)
fixes https://github.com/coder/internal/issues/1123

We want to tests that ports are not included after they are no longer used, but this isn't safe on the real OS networking stack because there is no way to guarantee a port _won't_ be used. Instead, we introduce an interface and fake implementation for testing.

On order to leave the filtering logic in the test path, this PR also does some refactoring.

Caching logic is left in the real OS querying implementation and a new test case is added for it in this PR.
2025-11-25 14:25:24 +04:00
Dean Sheather 6c99d5eca2 fix: avoid connection logging crashes in agent (#20307)
- Ignore errors when reporting a connection from the server, just log
them instead
- Translate connection log IP `localhost` to `127.0.0.1` on both the
server and the agent

Note that the temporary fix for converting invalid IPs to localhost is
not required in main since the database no longer forbids NULL for the
IP column since https://github.com/coder/coder/pull/19788

Relates to #20194
2025-10-16 01:56:43 +11:00
Spike Curtis 1354d84eb4 chore: refactor instance identity to be a SessionTokenProvider (#19566)
Refactors Agent instance identity to be a SessionTokenProvider.

Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags.

This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication.

Fixes #19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.
2025-09-03 10:38:42 +04:00
Danielle Maywood f41275eb39 feat(agent/agentcontainers): auto detect dev containers (#18950)
Relates to https://github.com/coder/internal/issues/711

This PR implements a project discovery mechanism that searches for any
dev container projects and makes them visible in the UI so that they can
be started. To make the wording on the site more clear, "Rebuild" has
been changed to "Start" when there is no container associated with a
known dev container configuration. I've also made it so that site will
show the dev container config path when there is no other name
available.

### Design decisions

Just want to ensure my explanation for a few design decisions are noted
down:
- We only search for dev container configurations inside git
repositories
- We only search for these git repositories if they're at the top level
or a direct child of the agent directory.

This limited approach is to reduce the amount of files we ultimately
walk when trying to find these projects. It makes sense to limit it to
only the agent directory, although I'm open to expanding how deep we
search.
2025-07-22 19:02:43 +01:00
Dean Sheather a1b87a67c6 fix: use client preferred URL for the default DERP (#18911)
The agentsdk currently does a remap of the DERP map to change the
EmbeddedRelay node's URL to match the agent's access URL.

This PR makes changes to the `workspacesdk` (used by clients like the
CLI) and `vpn` (used by Coder Desktop) to match this behavior.

This enables us the ability to try Coder clients in dogfood over a VPN
without changing the global access URL.
2025-07-17 20:17:44 +10:00
Danielle Maywood 0118e75009 fix(agent): disable dev container integration inside sub agents (#18781)
It appears we accidentally broke this logic in a previous PR. This
should now correctly disable the agent api as we'd expect.
2025-07-08 11:05:30 +01:00
Mathias Fredriksson 0f3a1e9849 fix(agent/agentcontainers): split Init into Init and Start for early API responses (#18640)
Previously in #18635 we delayed the containers API `Init` to avoid producing
errors due to Docker and `@devcontainers/cli` not yet being installed by startup
scripts. This had an adverse effect on the UX via UI responsiveness as the
detection of devcontainers was greatly delayed.

This change splits `Init` into `Init` and `Start` so that we can immediately
after `Init` start serving known devcontainers (defined in Terraform), improving
the UX.

Related #18635
Related #18640
2025-06-27 19:01:50 +03:00
Mathias Fredriksson 8ee2668b39 fix(agent): fix script filtering for devcontainers (#18635) 2025-06-27 16:59:31 +03:00
Mathias Fredriksson 7e99fb7d7e fix(agent): delay containerAPI init to ensure startup scripts run before (#18630) 2025-06-27 14:10:35 +03:00
ケイラ 09cc906981 chore: remove unnecessary redeclarations in for loops (part 2) (#18593) 2025-06-26 12:28:00 -06:00
Mathias Fredriksson eca6381314 feat(agent/agentcontainers): add more envs to readconfig for app URL building (#18603) 2025-06-26 09:33:58 +00:00
Danielle Maywood c4e4fe85f9 fix(agent): start devcontainers through agentcontainers package (#18471)
Fixes https://github.com/coder/internal/issues/706

Context for the implementation here
https://github.com/coder/internal/issues/706#issuecomment-2990490282

Synchronously starts dev containers defined in terraform with our
`DevcontainerCLI` abstraction, instead of piggybacking off of our
`agentscripts` package. This gives us more control over logs, instead of
being reliant on packages which may or may not exist in the
user-provided image.
2025-06-25 11:52:50 +01:00
Mathias Fredriksson 99d124e276 feat(agent): enable devcontainers by default (#18533) 2025-06-24 21:17:04 +03:00