Files
coder/coderd
Kyle Carberry 159089686a fix(coderd/x/chatd): prime workspace MCP cache after create/start (#25298)
## Problem

Mid-turn workspace MCP discovery was broken when an agent was still
cold-starting. `PrepareTools` in `chatd.go` flipped
`workspaceMCPDiscovered = true` *before* calling
`discoverWorkspaceMCPTools`, so a failed discovery attempt permanently
blocked retries within the turn.

Customer-reported repro:

- New chat with no pre-selected workspace.
- LLM calls `create_workspace` mid-turn at `23:35:05`.
- `PrepareTools` fires, dials the agent with a 30s timeout, dial times
out at `23:38:15`, `discoverWorkspaceMCPTools` returns empty.
- Agent connects at `23:38:29`, 14 seconds later.
- `workspaceMCPDiscovered` was already true, so `PrepareTools` never
retried for the rest of the turn. MCP tools only appeared on the next
user message.

A naive retry loop in `PrepareTools` would also miss the bigger picture:
a workspace boot can take several minutes (EC2 cold start, 10 min
startup scripts), and the chatloop only gets a chance to call
`PrepareTools` between LLM steps.

## Fix

Do the workspace MCP discovery from inside the tool that already waits
for the agent. `chattool.CreateWorkspace` and `chattool.StartWorkspace`
call `waitForAgentReady`, which has a 2 min agent-online budget plus a
10 min startup-script budget. By the time they fire `OnChatUpdated`, the
agent is `Ready`. The chatd `onChatUpdated` callback now launches an
async `primeWorkspaceMCPCache` goroutine on every bind that has a valid
workspace ID:

- The primer calls `discoverWorkspaceMCPTools` until it returns a
non-empty list or `workspaceMCPPrimeMaxWait` (30s) elapses, with a 2s
backoff between attempts. The bounded wait handles the short race
between agent-online and the agent's MCP `Connect` settling.
- The primer runs asynchronously so the tool itself never blocks. Some
templates simply do not advertise MCP tools, in which case the primer
would otherwise spend its full budget for nothing.
- The primer shares the chat `ctx` (not a detached one) so it is
canceled together with the chat. A dangling primer would re-dial the
workspace conn after `runChat`'s deferred `workspaceCtx.close()` and
leak that conn.
- `inflight.Add(1)` ensures server shutdown still waits for any
in-progress primer.
- `PrepareTools` is simplified back to a single discovery call. It now
only sets `workspaceMCPDiscovered = true` on success, so an empty result
no longer permanently blocks discovery within the turn. The cache hit
warmed by the primer makes that call cheap in the common case; the dial
fallback handles the rare cache miss.

## Tests

All in `coderd/x/chatd/chatd_internal_test.go`:

- `TestPrimeWorkspaceMCPCache_SuccessOnFirstAttempt` — single
`ListMCPTools` call returning tools populates the cache.
- `TestPrimeWorkspaceMCPCache_RetriesUntilToolsAppear` — first call
empty, second returns tools; primer retries past the backoff and writes
the cache. Uses `quartz.Mock.Trap` on `NewTimer`.
- `TestPrimeWorkspaceMCPCache_GivesUpAfterDeadline` — `ListMCPTools`
always empty; primer stops at `workspaceMCPPrimeMaxWait` and refuses to
cache the empty result so PrepareTools can retry on the next step.

The existing integration test
`TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspace` continues
to pass and now also exercises the async-primer path end-to-end via the
create_workspace tool.

```
go test ./coderd/x/chatd/... -count=1
go test ./coderd/x/chatd/ -race -count=1
make pre-commit
```

<details>
<summary>Design notes</summary>

- The first iteration of this PR added retry+cooldown+failure-cap logic
inside `PrepareTools`. It worked for the customer's ~30s race window but
did not help workspaces that take several minutes to boot, because
`PrepareTools` only fires between LLM steps. Reviewer pointed out the
right place to handle this is the tool itself; the current
implementation does that.
- Why async: a primer that ran synchronously inside the `OnChatUpdated`
callback blocked the create_workspace tool from returning for up to
`workspaceMCPPrimeMaxWait`, which broke
`TestCreateWorkspaceTool_EndToEnd` and would hurt any template that does
not expose MCP tools. Decoupling lets the tool return immediately and
lets the primer warm the cache concurrently with the next LLM step.
- Why share the chat `ctx` rather than `context.WithoutCancel(ctx)` (the
title-generation pattern): the primer touches
`workspaceCtx.getWorkspaceConn`, which `runChat`'s deferred
`workspaceCtx.close()` invalidates. A detached primer outliving the chat
would dial a fresh conn and leak it.
- The constant naming distinguishes `workspaceMCPDiscoveryTimeout` (35s
per-call dial budget, unchanged from #25169) from
`workspaceMCPPrimeMaxWait` (30s total budget for the post-ready primer
loop) and `workspaceMCPPrimeRetryInterval` (2s between empty-result
retries).

</details>

Follow-up to #25169.

---

_This pull request was generated by Coder Agents._
2026-05-18 07:55:56 -04:00
..