Commit Graph

18 Commits

Author SHA1 Message Date
Mathias Fredriksson c8359d8598 fix(agent/agentproc): read process info before output to prevent TOCTOU (#25646)
handleProcessOutput read proc.output() then proc.info() using
separate locks. Between the two reads the exit goroutine could
finish I/O and set running=false, pairing stale output with final
status. On Windows CI this caused OutputExceedsBuffer to flake
when the buffer snapshot caught mid-write data (OmittedBytes=0)
but info reported the process as exited.

Swap the read order so info is read first. The exit goroutine
completes cmd.Wait (draining all pipe data) before setting
running=false, so seeing Running=false guarantees the subsequent
output read reflects the final buffer state.

Closes CODAGT-399
2026-05-25 17:27:29 +03:00
Ethan 3a9080fff6 feat: tag chat-originating agent logs with chat_id (#25019)
Workspace-agent logs emitted while serving chatd-driven requests were
not correlated with the originating chat, making agent logs hard to
attribute to the corresponding/originating chat.

This adds agent-side chat context middleware that parses `Coder-Chat-Id`
once, enriches agent access logs and structured handler/background logs,
and adds a chatd bridge log when chat headers are attached to an agent
connection.

Closes CODAGT-324
2026-05-08 13:25:30 +10:00
Kyle Carberry 19e44f4136 fix: target specific chat in MarkStale instead of broadcasting to all workspace chats (#23883)
## Problem

Subagent chats were receiving git context (branch, remote origin, PR
status) from their parent or sibling chats' git operations. When a git
operation triggers external auth, the workspace agent sends `chat_id`
identifying which chat initiated it — but this was broken at two levels:

1. **Agent side:** `CODER_CHAT_ID` was never injected into process
   environments. `chatd` sets `Coder-Chat-Id` HTTP headers and the
   agent extracts them for process isolation, but never propagated
   `CODER_CHAT_ID` to `cmd.Env`. So `gitaskpass` always sent an empty
   `chat_id`.

2. **Server side:** `workspaceAgentsExternalAuth` ignored the `chat_id`
   query param. `MarkStale` broadcast git context to **all** chats on
   the workspace via `filterChatsByWorkspaceID`.

## Fix

- Inject `CODER_CHAT_ID` into `cmd.Env` in `agentproc` when the chat
  ID is known, so `gitaskpass` can read and forward it.
- Read `chat_id` from query params in `workspaceAgentsExternalAuth`
  and thread it through `chatGitRef`.
- Refactor `MarkStale` to accept a `MarkStaleParams` struct. When
  `ChatID` is provided, target only that specific chat. When empty
  (legacy agents, non-chat git operations), fall back to the existing
  workspace-wide broadcast.
- Extract `markStaleSingle` helper to deduplicate the upsert+publish
  logic.

<details><summary>Investigation notes</summary>

### Data flow before fix

```
chatd → sets Coder-Chat-Id header on agent conn
agent → extracts chatID, stores on process struct
agent → does NOT set CODER_CHAT_ID in cmd.Env  ← gap 1
gitaskpass → reads CODER_CHAT_ID (always empty), sends chat_id=""
server handler → ignores chat_id query param     ← gap 2
MarkStale → broadcasts to ALL workspace chats
```

### Data flow after fix

```
chatd → sets Coder-Chat-Id header on agent conn
agent → extracts chatID, stores on process struct
agent → sets CODER_CHAT_ID in cmd.Env
gitaskpass → reads CODER_CHAT_ID, sends chat_id=<uuid>
server handler → reads chat_id, passes to MarkStale
MarkStale → targets only that specific chat
```

</details>
2026-04-01 13:04:59 +00:00
Mathias Fredriksson 4aa94fcd4c fix: StatusWriter Unwrap and process output error recovery (#23383)
Add Unwrap() to StatusWriter so http.ResponseController.SetWriteDeadline
can reach the underlying net.Conn through the middleware wrapper. Without
this, the agent's 20s WriteTimeout killed blocking process output
connections.

Also add 30s headroom to the write deadline in handleProcessOutput so
the response can be written after a full-duration blocking wait.

On the tool layer, waitForProcess and the process_output tool now try a
non-blocking snapshot on any error, not just context timeout. Transport
errors (like the WriteTimeout EOF) previously returned with no process
ID and no recovery path. Now if the process finished, the result is
returned transparently. If still running, the error includes the process
ID and tells the agent to use process_output.
2026-03-20 20:00:55 +00:00
Mathias Fredriksson 41e15ae440 feat: make process output blocking-capable (#23312)
Replace the 200ms polling loop in chatd's execute and
process_output tools with server-side blocking via sync.Cond
on HeadTailBuffer.

The agent's GET /{id}/output endpoint accepts ?wait=true to
block until the process exits or a 5-minute server cap expires.
The process_output tool blocks by default for 10s (overridable
via wait_timeout), and falls back to a non-blocking snapshot on
timeout. The execute tool's foreground path makes a single
blocking call instead of polling.

Related #23316
2026-03-20 14:33:55 +02:00
Mathias Fredriksson 6edcbdba7f fix(agent/agentproc): enforce chat ID isolation on output and signal endpoints (#23316)
handleProcessOutput and handleSignalProcess did not check the
chat ID from the request. Any caller that knew a process ID
could read output or signal processes belonging to other chats.

handleListProcesses already filtered by chat ID. Apply the
same check to the output and signal handlers. Non-chat callers
(no Coder-Chat-Id header) are allowed through for backwards
compatibility.
2026-03-20 11:24:45 +02:00
Mathias Fredriksson 119030d795 fix(agent): default process working directory to agent dir or $HOME (#23224)
Processes started via the agent process API inherited the agent's
own working directory (/tmp/coder.xxx) when no WorkDir was
specified. SSH sessions already use a fallback chain: configured
agent directory > $HOME. This wires the same manifest directory
closure into the process manager so the priority is now:

  explicit req.WorkDir > agent configured dir > $HOME

The resolved directory is recorded on the process struct so
ProcessInfo.WorkDir and pathStore notifications reflect where
the process actually ran.
2026-03-18 16:46:26 +00:00
Kyle Carberry 6972d073a2 fix: improve background process handling for agent tools (#23132)
## Problem

Models frequently use shell `&` instead of `run_in_background=true` when
starting long-running processes through `/agents`, causing them to die
shortly after starting. This happens because:

1. **No guidance in tool schema** — The `ExecuteArgs` struct had zero
`description` tags. The model saw `run_in_background: boolean
(optional)` with no explanation of when/why to use it.
2. **Shell `&` is silently broken** — `sh -c "command &"` forks the
process, the shell exits immediately, and the forked child becomes an
orphan not tracked by the process manager.
3. **No process group isolation** — The SSH subsystem sets `Setsid:
true` on spawned processes, but the agent process manager set no
`SysProcAttr` at all. Signals only hit the top-level `sh`, not child
processes.

## Investigation

Compared our implementation against **openai/codex** and **coder/mux**:

| Aspect | codex | mux | coder/coder (before) |
|--------|-------|-----|---------------------|
| Background flag | Yield/resume with `session_id` | `run_in_background`
with rich description | `run_in_background` with **no description** |
| `&` handling | `setsid()` + `killpg()` | `detached: true` +
`killProcessTree()` | **Nothing** — orphaned children escape |
| Process isolation | `setsid()` on every spawn | `set -m; nohup ...
setsid` for background | **No `SysProcAttr` at all** |
| Signal delivery | `killpg(pgid, sig)` — entire group | `kill -15
-\$pid` — negative PID | `proc.cmd.Process.Signal()` — **PID only** |

## Changes

### Fix 1: Add descriptions to `ExecuteArgs` (highest impact)
The model now sees explicit guidance: *"Use for long-running processes
like dev servers, file watchers, or builds. Do NOT use shell & — it will
not work correctly."*

### Fix 2: Update tool description
The top-level execute tool description now reinforces: *"Use
run_in_background=true for long-running processes. Never use shell '&'
for backgrounding."*

### Fix 3: Detect trailing `&` and auto-promote to background
Defense-in-depth: if the model still uses `command &`, we strip the `&`
and promote to `run_in_background=true` automatically. Correctly
distinguishes `&` from `&&`.

### Fix 4: Process group isolation (`Setpgid`)
New platform-specific files (`proc_other.go` / `proc_windows.go`)
following the same pattern as `agentssh/exec_other.go`. Every spawned
process gets its own process group.

### Fix 5: Process group signaling
`signal()` now uses `syscall.Kill(-pid, sig)` on Unix to signal the
entire process group, ensuring child processes from shell pipelines are
also cleaned up.

## Testing
All existing `agent/agentproc` tests pass. Both packages compile
cleanly.
2026-03-16 16:22:10 -04:00
Kyle Carberry 0e1846fe2a fix(agent): reap exited processes and scope process list by chat ID (#22944) 2026-03-12 14:51:05 -07:00
Hugo Dutka 48ab492f49 feat: agents git watch backend (#22565)
Adds real-time git status watching for workspace agents, so the frontend
can subscribe over WebSocket and show
git file changes in near real-time.

1. Subscription is scoped to a **chat** via `GET
/api/experimental/chats/{chat}/git/watch`.
2. The workspace agent automatically determines which paths to watch
based on tool calls made by the chat (and its ancestor chats).
3. Workspace agent polls subscribed repo working trees on a 30s
interval, on tools calls, and on explicit `refresh` from the client.
4. Scans are rate-limited to at most once per second.
5. Edited paths are tracked **in-memory** inside the workspace agent.
There is no database persistence — state is lost on agent restart. This
will be addresses in a future PR.
6. Messages sent over WebSocket include a full-repo snapshot (unified
diff, branch, origin). A new message is emitted only when the snapshot
changes.

This PR was implemented with AI with me closely controlling what it's
doing. The code follows a plan file that was updated continuously during
implementation. Here's the file if you'd like to see it:
[project.md](https://gist.github.com/hugodutka/8722cf80c92f8a56555f7bc595b770e2).
It reflects the current state of the PR.
2026-03-06 10:47:55 +01:00
Kyle Carberry fb6bf3a568 fix(agent): wire updateCommandEnv into process manager (#22451)
## Problem

The `agentproc` process manager spawns processes with only
`os.Environ()`, missing agent-level environment variables like
`GIT_ASKPASS`, `CODER_*`, and `GIT_SSH_COMMAND` that are injected by the
agent's `updateCommandEnv` function. This means processes started
through the HTTP process API (used by chat tools) cannot authenticate
git operations via the Coder gitaskpass helper.

By contrast, SSH sessions get the full agent environment because the SSH
server calls `updateCommandEnv` via its `UpdateEnv` config hook.

## Fix

Wire the agent's `updateCommandEnv` hook into the process manager so all
spawned processes receive the full agent environment. The hook is:

- Passed as a parameter through `NewAPI` → `newManager`
- Called in `manager.start()` with `os.Environ()` as the base, producing
the same enriched env that SSH sessions get
- Gracefully falls back to `os.Environ()` if the hook returns an error

Request-level env vars (`req.Env`, set by chat tools) are still appended
last and take precedence.

## Changes

- `agent/agentproc/process.go`: Add `updateEnv` field to manager, call
it when building process env
- `agent/agentproc/api.go`: Accept `updateEnv` parameter in `NewAPI`
- `agent/agent.go`: Pass `a.updateCommandEnv` when creating the process
API
- `agent/agentproc/api_test.go`: Add `UpdateEnvHook` and
`UpdateEnvHookOverriddenByReqEnv` tests

Co-authored-by: Coder <coder@coder.com>
2026-02-28 21:58:59 -05:00
Kyle Carberry a621c3cb13 feat(agent): add process execution API and rewrite execute tool (#22416)
## Summary

Adds a new agent-side process management HTTP API and rewrites the chat
execute tool to use it instead of SSH sessions.

## What changed

### New agent/agentproc/ package

- **headtail.go** — Thread-safe io.Writer with bounded memory (16KB head
+ 16KB tail ring buffer). Provides LLM-ready output with truncation
metadata and long-line truncation at 2048 bytes.
- **headtail_test.go** — 16 tests including race detector coverage for
concurrent writes.
- **process.go** — Manager + Process types for lifecycle management
using agentexec.Execer for proper OOM/nice scores.
- **api.go** — HTTP API following the agentfiles chi router pattern. 4
endpoints: start, list, output, signal.

### Agent wiring (agent/agent.go, agent/api.go)

Mounts the process API at /api/v0/processes, mirroring how agentfiles is
mounted.

### SDK (codersdk/workspacesdk/agentconn.go)

4 new AgentConn interface methods + 7 request/response types:
- StartProcess, ListProcesses, ProcessOutput, SignalProcess

### Execute tool rewrite (coderd/chatd/chattool/execute.go)

- SSH to Agent API: conn.StartProcess() + conn.ProcessOutput() polling
- New parameters: workdir, run_in_background
- Structured response: success, exit_code, wall_duration_ms, error,
truncated, note, background_process_id
- Non-interactive env vars: GIT_EDITOR=true, TERM=dumb, NO_COLOR=1,
PAGER=cat, etc.
- Output truncation: HeadTailBuffer caps at 32KB for LLM consumption
- File-dump detection with advisory notes suggesting read_file
- Default timeout: 60s to 10s
- Foreground polling: 200ms intervals until exit or timeout

## Architecture

State lives on the agent, surviving coderd failover and instance
changes. Any coderd replica can query any agent via HTTP over tailnet.
2026-02-28 12:33:52 -05:00
Jon Ayers 1f238fed59 feat: integrate new agentexec pkg (#15609)
- Integrates the `agentexec` pkg into the agent and removes the
legacy system of iterating over the process tree. It adds some linting
rules to hopefully catch future improper uses of `exec.Command` in the package.
2024-11-27 20:12:15 +02:00
Jon Ayers bfdc29f466 fix: suppress benign errors when listing processes (#14660) 2024-09-12 23:00:04 +01:00
Jon Ayers 426e9f2b96 feat: support adjusting child proc oom scores (#12655) 2024-04-03 09:42:03 -05:00
Steven Masley dd05a6b13a chore: mockgen archived, moved to new location (#11415)
* chore: mockgen archived, moved to new location
2024-01-04 18:35:56 -06:00
Spike Curtis edeb9bb42a fix: appease linter on darwin (#11154)
Fixing up some linting errors that show up on Darwin, but not in CI.
2023-12-12 17:02:28 +04:00
Jon Ayers 7311ffbd9d feat: implement agent process management (#9461)
- An opt-in feature has been added to the agent to allow
   deprioritizing non coder-related processes for CPU by setting their
   niceness level to 10.
- Opting in to the feature requires setting CODER_PROC_PRIO_MGMT to a non-empty value.
2023-09-14 19:45:05 -05:00