Files
Kyle Carberry 147f50c5e8 fix(agent/x/agentmcp): watch MCP config files for late-appearing or rewritten config (#25172)
## Bug

`agent/x/agentmcp/Manager` resolves its config paths once at boot, calls
`Reload`, and then only re-stats them lazily when a `GET
/api/v0/mcp/tools` request arrives (PR #24700). If any of the manager's
MCP config files (`~/.mcp.json` by default, or whatever paths
`agentcontextconfig.MCPConfigFiles()` resolves from
`CODER_AGENT_EXP_MCP_CONFIG_FILES`) is created, atomically rewritten, or
removed _after_ that initial `Reload` and _before_ the next tools HTTP
request, the manager keeps serving the stale (often empty) snapshot.
`parseAndDedup` silently swallows `fs.ErrNotExist`, so a late-appearing
file looks indistinguishable from "no config" until something pokes the
manager again.

This affects any workspace where the file lands after
`MarkStartupSettled` fires, including:

- a startup script that writes `~/.mcp.json` after MCP init
- a user creating or editing the file mid-session
- an installer, dotfiles step, or sync tool writing the file later in
startup
- another agent process (Claude Code, etc.) producing the file
out-of-band
- editor rewrites that land as `Write + Chmod + Rename` bursts

### Concrete repro (dual-agent workspace)

The race is easiest to reproduce on dual-agent workspaces (inner sandbox
+ outer host), where the inner agent's `scriptRunner` has
`script_count=0` and `mcpManager.Reload` fires at ~t+0.3 s while the
host agent writes `~/.mcp.json` ~21 s later. Timeline from
`workspace-otto-aa16`:

- agent up `20:23:38.918`
- lifecycle Ready `20:23:39.200`
- MCP config file Birth `20:24:00.460` (~21 s gap)
- no MCP log lines for 8 minutes
- first `GET /api/v0/mcp/tools` at `20:32:11.812` logs `[warn] mcp: mcp
reload canceled by caller`, takes 4946 ms; subsequent turns are cached
at 2 ms.

The single-agent case has the same race; it's just usually narrow enough
that the next HTTP request masks it, at the cost of a multi-second stall
on the first call that has to do the lazy reload itself. PR #25034's
`MarkStartupSettled` does not help: "settled" fires before the file is
necessarily on disk.

## Fix

Add an fsnotify-backed `configWatcher` to `agent/x/agentmcp/Manager`.
The watcher consumes whatever paths the manager is told to reload, which
is the same `[]string` returned by
`agentcontextconfig.MCPConfigFiles()`.

For each path the watcher:

- Watches the **parent directory** of the path, not the file itself.
This handles late creation, atomic rewrite (rename + create), and
deletion uniformly because inotify watches on individual non-existent
files return `ENOENT` and are lost across renames. The pattern matches
`agent/agentcontainers/watcher`.
- Walks up to the first existing **ancestor directory** when the parent
does not yet exist, and re-arms deeper on `Create` events that promote
an unrealized path.
- Refcounts directory watches so multiple configured paths sharing a
parent dir only register one inotify watch.
- Resolves symlinks **once at arming time** via `filepath.EvalSymlinks`;
never chases arbitrary symlink targets on events.
- Debounces multi-event editor writes through a single
`quartz.AfterFunc` timer so a `Write + Chmod + Rename` burst produces
one reload.
- Fires a debounced callback that calls `Manager.Reload`, which routes
through the existing singleflight so concurrent triggers coalesce.
- Re-syncs on every `Reload` call so a future path-list change is picked
up.

Lifecycle: the watcher is armed lazily on the first `Reload` (no
goroutine cost for unit tests that never reload). `Manager.Close` marks
the manager closed and closes its `closedCh` before tearing down the
watcher, so any in-flight watcher-driven reload observes the close via
`waitReload` and returns `ErrManagerClosed` instead of blocking
`firesWG.Wait()` on a stuck connect. The watcher then waits for its
goroutine and any in-flight debounced `fire` callback before returning.

`parseAndDedup` behavior is unchanged: `fs.ErrNotExist` still records an
empty snapshot. With the watcher armed before that snapshot is
committed, any `Create` event that races `parseAndDedup` is still
delivered.

This is the agent-side complement to PR #25169, which fixes chatd's
mid-turn workspace MCP discovery. `MarkStartupSettled` semantics are not
changed.

## Tests (`agent/x/agentmcp/configwatcher_internal_test.go`)

All new tests pass with the fix and fail without it. No `time.Sleep`;
synchronization uses `testutil.Eventually` for fsnotify-driven
assertions and the quartz mock clock for debounce assertions.

- `TestWatcher_LateFileTriggersReload` - the late-file regression: empty
dir, settle startup, `Reload` sees no file, write the file later,
watcher reloads, tools appear.
- `TestWatcher_RewriteTriggersReload` - existing file overwritten with a
new server list, watcher reloads, cache reflects new server.
- `TestWatcher_RemovalTransitionsToEmpty` - delete the file, watcher
reloads, manager transitions to empty cleanly.
- `TestWatcher_DebouncesBurst` - quartz mock clock; three back-to-back
`scheduleFire` calls produce exactly one `onChange` after `AdvanceNext`.
- `TestWatcher_CloseStopsGoroutine` - construct/Reload/Close five times
to surface goroutine or fd leaks under `-race`.
- `TestWatcher_DualAgentHTTPNoStall` - integration: write file after
`Reload`, wait for watcher reload, then `GET /tools` returns the MCP
tools in less than `testutil.WaitShort` instead of the multi-second
"reload canceled" stall.
- `TestWatcher_LateParentDirTriggersReload` - parent dir doesn't exist
at `Reload` time; create the dir then the file; watcher re-arms deeper
and reloads.
- `TestWatcher_SharedParentRefcount` - two configured paths share a
parent dir; only one inotify watch is registered and both reload on
changes.
- `TestWatcher_CloseDoesNotStallOnInFlightReload` - installs a
`connectStartedHook` to block a watcher-driven reload mid-`connectAll`,
then asserts `Close()` returns within `WaitMedium`. Regression-verified:
reverting the close-ordering causes the test to time out.

## Acceptance checklist

- [x] `go test ./agent/x/agentmcp/... -race -count=1` passes (also
`-count=5`).
- [x] All `./agent/...` tests pass under `-race -short`.
- [x] No emdash, endash, or ` -- `; `scripts/check_emdash.sh` clean.
- [x] No `time.Sleep` in tests.
- [x] New tests fail without the fix and pass with it (verified by
temporarily disabling `m.armWatcher(paths)` and by reverting `Close()`
ordering).

## Out of scope

No changes to chatd's mid-turn workspace MCP discovery (PR #25169) or
`MarkStartupSettled` semantics.

---

<sub>This pull request was prepared by a [Coder
Agents](https://coder.com/docs/admin/ai-coder) run.</sub>
2026-05-12 11:32:39 -04:00
..