Commit Graph

2 Commits

Author SHA1 Message Date
Ethan c4ef94aacf fix(coderd/x/chatd): prevent chat hang when workspace agent is unavailable (#23707)
## Problem

Chats with a persisted `agent_id` binding hang indefinitely when the
workspace is stopped. The stale agent row still exists in the DB, so
`ensureWorkspaceAgent` succeeds, but the dial blocks forever in
`AwaitReachable`. The MCP discovery goroutine used an unbounded context,
so `g2.Wait()` never returned and the LLM never started.

## Fix

Three targeted changes restore the pre-binding behavior where stopped
workspaces degrade gracefully instead of blocking:

1. **`dialWithLazyValidation`**: "no agents in latest build" is now a
terminal fast-fail — the hanging dial is canceled and
`errChatHasNoWorkspaceAgent` returned immediately, instead of falling
through to `waitForOriginalDial`.

2. **Pre-LLM workspace setup**: MCP discovery and instruction
persistence gate on `workspaceAgentIDForConn` before attempting any
dial. MCP discovery is bounded by a 5s timeout and checks the in-memory
tool cache first (using the cheap cached agent from
`ensureWorkspaceAgent`), so the common subsequent-turn path has zero DB
queries.

3. **`persistInstructionFiles`**: tracks whether the workspace
connection succeeded and skips sentinel persistence on failure, so the
next turn retries if the workspace is restarted.

## Scenarios

**Running workspace, subsequent turn (hot path):** MCP cache hit via
in-memory cached agent. Zero DB queries, zero dials. Unchanged from
#23274.

**Stopped workspace, persisted binding (the bug):** MCP cache hit (stale
descriptors, fine — they fail at invocation). Pre-LLM setup completes
instantly. Tool invocation enters `dialWithLazyValidation`, dial fails
or hangs, validation discovers no agents, returns
`errChatHasNoWorkspaceAgent`. Model sees the error and can call
`start_workspace`.

**New chat, running workspace:** `ensureWorkspaceAgent` resolves via
latest-build, persists binding. MCP discovery dials and caches tools.

**New chat, stopped workspace:** `ensureWorkspaceAgent` finds no agents,
returns `errChatHasNoWorkspaceAgent`. Pre-LLM setup skips. LLM starts
with built-in tools only.

**Rebuilt workspace (agent switched):** MCP cache hit with stale agent
(harmless for one turn). Tool invocation dials stale agent, fails fast,
`dialWithLazyValidation` switches to new agent, persists updated
binding.

**Workspace restarted after stop:** No sentinel was persisted during the
stopped turn, so instruction persistence retries. Agent binding switches
to the new agent via `workspaceAgentIDForConn`.

**Transient DB error during validation:** Not
`errChatHasNoWorkspaceAgent`, so `dialWithLazyValidation` falls through
to `waitForOriginalDial` (cannot prove stale). No false positive.

**Tool invocation on stopped workspace:** `getWorkspaceConn` calls
`ensureWorkspaceAgent` (returns stale row), then
`dialWithLazyValidation` validation discovers no agents, returns
`errChatHasNoWorkspaceAgent`, cached state cleared, error returned to
model.
2026-03-27 18:47:39 +11:00
Ethan 61e31ec5cc perf(coderd/x/chatd): persist workspace agent binding across chat turns (#23274)
## Summary

This change removes the steady-state "resolve the latest workspace
agent" query from chat execution.

Instead of asking the database for the latest build's agent on every
turn, a chat now persists the workspace/build/agent binding it actually
uses and reuses that binding across subsequent turns. The common path
becomes "load the bound agent by ID and dial it", with fallback paths to
repair the binding when it is missing, stale, or intentionally changed.

## What changes

- add `workspace_id`, `build_id`, and `agent_id` binding fields to
`chats`
- expose those fields through the chat API / SDK so the execution
context is explicit
- load the persisted binding first in chatd, instead of always resolving
the latest build's agent
- persist a refreshed binding when chatd has to re-resolve the workspace
agent
- keep child / subagent chats on the same bound workspace context by
inheriting the parent binding
- leave `build_id` / `agent_id` unset for flows like `create_workspace`,
then bind them lazily on the next agent-backed turn

## Runtime behavior

The binding is treated as an optimistic cache of the agent a chat should
use:

- if the bound agent still exists and dials successfully, we use it
without a latest-build lookup
- if the bound agent is missing or no longer reachable, chatd
re-resolves against the latest build and persists the new binding
- if a workspace mutation changes the chat's target workspace, the
binding is updated as part of that mutation

To avoid reintroducing a hot-path query, dialing uses lazy validation:

- start dialing the cached agent immediately
- only validate against the latest build if the dial is still pending
after a short delay
- if validation finds a different agent, cancel the stale dial, switch
to the current agent, and persist the repaired binding

## Result

The hot path stops issuing
`GetWorkspaceAgentsInLatestBuildByWorkspaceID` for every user message,
which is the source of the DB pressure this PR is addressing. At the
same time, chats still converge to the correct workspace agent when the
binding becomes stale due to rebuilds or explicit workspace changes.
2026-03-26 17:22:38 +11:00