mirror of
https://github.com/coder/coder.git
synced 2026-06-02 20:48:20 +00:00
147f50c5e8
## Bug `agent/x/agentmcp/Manager` resolves its config paths once at boot, calls `Reload`, and then only re-stats them lazily when a `GET /api/v0/mcp/tools` request arrives (PR #24700). If any of the manager's MCP config files (`~/.mcp.json` by default, or whatever paths `agentcontextconfig.MCPConfigFiles()` resolves from `CODER_AGENT_EXP_MCP_CONFIG_FILES`) is created, atomically rewritten, or removed _after_ that initial `Reload` and _before_ the next tools HTTP request, the manager keeps serving the stale (often empty) snapshot. `parseAndDedup` silently swallows `fs.ErrNotExist`, so a late-appearing file looks indistinguishable from "no config" until something pokes the manager again. This affects any workspace where the file lands after `MarkStartupSettled` fires, including: - a startup script that writes `~/.mcp.json` after MCP init - a user creating or editing the file mid-session - an installer, dotfiles step, or sync tool writing the file later in startup - another agent process (Claude Code, etc.) producing the file out-of-band - editor rewrites that land as `Write + Chmod + Rename` bursts ### Concrete repro (dual-agent workspace) The race is easiest to reproduce on dual-agent workspaces (inner sandbox + outer host), where the inner agent's `scriptRunner` has `script_count=0` and `mcpManager.Reload` fires at ~t+0.3 s while the host agent writes `~/.mcp.json` ~21 s later. Timeline from `workspace-otto-aa16`: - agent up `20:23:38.918` - lifecycle Ready `20:23:39.200` - MCP config file Birth `20:24:00.460` (~21 s gap) - no MCP log lines for 8 minutes - first `GET /api/v0/mcp/tools` at `20:32:11.812` logs `[warn] mcp: mcp reload canceled by caller`, takes 4946 ms; subsequent turns are cached at 2 ms. The single-agent case has the same race; it's just usually narrow enough that the next HTTP request masks it, at the cost of a multi-second stall on the first call that has to do the lazy reload itself. PR #25034's `MarkStartupSettled` does not help: "settled" fires before the file is necessarily on disk. ## Fix Add an fsnotify-backed `configWatcher` to `agent/x/agentmcp/Manager`. The watcher consumes whatever paths the manager is told to reload, which is the same `[]string` returned by `agentcontextconfig.MCPConfigFiles()`. For each path the watcher: - Watches the **parent directory** of the path, not the file itself. This handles late creation, atomic rewrite (rename + create), and deletion uniformly because inotify watches on individual non-existent files return `ENOENT` and are lost across renames. The pattern matches `agent/agentcontainers/watcher`. - Walks up to the first existing **ancestor directory** when the parent does not yet exist, and re-arms deeper on `Create` events that promote an unrealized path. - Refcounts directory watches so multiple configured paths sharing a parent dir only register one inotify watch. - Resolves symlinks **once at arming time** via `filepath.EvalSymlinks`; never chases arbitrary symlink targets on events. - Debounces multi-event editor writes through a single `quartz.AfterFunc` timer so a `Write + Chmod + Rename` burst produces one reload. - Fires a debounced callback that calls `Manager.Reload`, which routes through the existing singleflight so concurrent triggers coalesce. - Re-syncs on every `Reload` call so a future path-list change is picked up. Lifecycle: the watcher is armed lazily on the first `Reload` (no goroutine cost for unit tests that never reload). `Manager.Close` marks the manager closed and closes its `closedCh` before tearing down the watcher, so any in-flight watcher-driven reload observes the close via `waitReload` and returns `ErrManagerClosed` instead of blocking `firesWG.Wait()` on a stuck connect. The watcher then waits for its goroutine and any in-flight debounced `fire` callback before returning. `parseAndDedup` behavior is unchanged: `fs.ErrNotExist` still records an empty snapshot. With the watcher armed before that snapshot is committed, any `Create` event that races `parseAndDedup` is still delivered. This is the agent-side complement to PR #25169, which fixes chatd's mid-turn workspace MCP discovery. `MarkStartupSettled` semantics are not changed. ## Tests (`agent/x/agentmcp/configwatcher_internal_test.go`) All new tests pass with the fix and fail without it. No `time.Sleep`; synchronization uses `testutil.Eventually` for fsnotify-driven assertions and the quartz mock clock for debounce assertions. - `TestWatcher_LateFileTriggersReload` - the late-file regression: empty dir, settle startup, `Reload` sees no file, write the file later, watcher reloads, tools appear. - `TestWatcher_RewriteTriggersReload` - existing file overwritten with a new server list, watcher reloads, cache reflects new server. - `TestWatcher_RemovalTransitionsToEmpty` - delete the file, watcher reloads, manager transitions to empty cleanly. - `TestWatcher_DebouncesBurst` - quartz mock clock; three back-to-back `scheduleFire` calls produce exactly one `onChange` after `AdvanceNext`. - `TestWatcher_CloseStopsGoroutine` - construct/Reload/Close five times to surface goroutine or fd leaks under `-race`. - `TestWatcher_DualAgentHTTPNoStall` - integration: write file after `Reload`, wait for watcher reload, then `GET /tools` returns the MCP tools in less than `testutil.WaitShort` instead of the multi-second "reload canceled" stall. - `TestWatcher_LateParentDirTriggersReload` - parent dir doesn't exist at `Reload` time; create the dir then the file; watcher re-arms deeper and reloads. - `TestWatcher_SharedParentRefcount` - two configured paths share a parent dir; only one inotify watch is registered and both reload on changes. - `TestWatcher_CloseDoesNotStallOnInFlightReload` - installs a `connectStartedHook` to block a watcher-driven reload mid-`connectAll`, then asserts `Close()` returns within `WaitMedium`. Regression-verified: reverting the close-ordering causes the test to time out. ## Acceptance checklist - [x] `go test ./agent/x/agentmcp/... -race -count=1` passes (also `-count=5`). - [x] All `./agent/...` tests pass under `-race -short`. - [x] No emdash, endash, or ` -- `; `scripts/check_emdash.sh` clean. - [x] No `time.Sleep` in tests. - [x] New tests fail without the fix and pass with it (verified by temporarily disabling `m.armWatcher(paths)` and by reverting `Close()` ordering). ## Out of scope No changes to chatd's mid-turn workspace MCP discovery (PR #25169) or `MarkStartupSettled` semantics. --- <sub>This pull request was prepared by a [Coder Agents](https://coder.com/docs/admin/ai-coder) run.</sub>