mirror of
https://github.com/coder/coder.git
synced 2026-06-04 13:38:21 +00:00
1e07ec49a672580fff4e7d47061d47ae12461c2d
5 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
42c12176a0 |
fix(chatd): persist interrupted tool call steps instead of losing them (#23011)
## Problem When a chat is interrupted while tools are executing, the step content (text, reasoning, tool calls, and partial tool results) was being lost. Two gaps existed: 1. **During tool execution**: `executeTools` returns with error results for interrupted tools, but the subsequent `PersistStep(ctx, ...)` fails on the canceled context and returns `ErrInterrupted` without persisting anything. 2. **PersistStep race**: If the context is canceled between the post-tool interrupt check and the `PersistStep` call, the same loss occurs. This is inconsistent with how we handle stream interruptions (which properly flush and persist partial content via `persistInterruptedStep`) and how [coder/blink](https://github.com/coder/blink) handles interruptions (always inserting the response message regardless of execution phase). ## Fix Two changes in `chatloop.go`: - **Post-tool-execution interrupt check**: After `executeTools` returns, check if the context was interrupted and route through `persistInterruptedStep` (which uses `context.WithoutCancel` internally) to save the accumulated content. - **PersistStep fallback**: If `PersistStep` returns `ErrInterrupted`, retry via `persistInterruptedStep` so partial content is not lost. ## Tests - `TestRun_InterruptedDuringToolExecutionPersistsStep`: Verifies that when a tool is blocked and the chat is interrupted, the step (text + reasoning + tool call + tool error result) is persisted via the interrupt-safe path. - `TestRun_PersistStepInterruptedFallback`: Verifies that when `PersistStep` itself returns `ErrInterrupted`, the step is retried via the fallback path and content is saved. |
||
|
|
072e9a212f |
fix(chatloop): keep provider-executed tool results in assistant message (#23012)
## Problem When a step contains both provider-executed tool calls (e.g. Anthropic web search) and local tool calls in parallel, the next loop iteration fails with the Anthropic API claiming the regular tool call has no result. However, sending a new user message (which reloads messages from the DB) works fine. ## Root cause `toResponseMessages` was placing **all** tool results into the tool-role message, regardless of `ProviderExecuted`. When Fantasy's Anthropic provider later converted these messages for the API, it moved the provider tool result from the tool message to the **end** of the previous assistant message (`prevMsg.Content = append(...)`). This placed `web_search_tool_result` **after** the regular `tool_use` block: ``` assistant: [server_tool_use(A), tool_use(B), web_search_tool_result(A)] ← wrong order user: [tool_result(B)] ``` The persistence layer in `chatd.go` already handles this correctly — provider-executed tool results stay in the assistant message, producing the expected ordering: ``` assistant: [server_tool_use(A), web_search_tool_result(A), tool_use(B)] ← correct order user: [tool_result(B)] ``` This is why reloading from the DB fixed it. ## Fix In the `ContentTypeToolResult` case of `toResponseMessages`, route provider-executed results to `assistantParts` instead of `toolParts`, matching the persistence layer's behavior. ## Testing Added `TestToResponseMessages_ProviderExecutedToolResultInAssistantMessage` which verifies that mixed provider+local tool results are split correctly between the assistant and tool messages. |
||
|
|
f35b99a4fa |
fix(chatd): preserve context.Canceled in persistStep during shutdown (#22890)
## Problem
When a chat worker shuts down gracefully (e.g. Kubernetes pod SIGTERM)
while a tool is executing (like `wait_agent` polling for a subagent),
the chat gets stuck in `waiting` status forever — no other worker will
pick it up.
### Root Cause
`persistStep` in `chatd.go` unconditionally returned
`chatloop.ErrInterrupted` for **any** canceled context:
```go
if persistCtx.Err() != nil {
return chatloop.ErrInterrupted // BUG: doesn't check WHY the context was canceled
}
```
During shutdown, the context cause is `context.Canceled` (not
`ErrInterrupted`). But because `persistStep` returned `ErrInterrupted`,
the error handling in `processChat` hit the `ErrInterrupted` check first
(line 2011) and set status to `waiting` — the `isShutdownCancellation`
check (line 2017) was never reached:
```go
// Checked FIRST — matches because persistStep returned ErrInterrupted
if errors.Is(err, chatloop.ErrInterrupted) {
status = database.ChatStatusWaiting // Stuck forever
return
}
// NEVER REACHED during shutdown
if isShutdownCancellation(ctx, chatCtx, err) {
status = database.ChatStatusPending // Would have been correct
return
}
```
### Trigger scenario (from production logs)
1. Chat spawns a subagent via `spawn_agent`, then calls `wait_agent`
2. `wait_agent` blocks in `awaitSubagentCompletion` polling loop
3. Worker pod receives SIGTERM → `Close()` cancels server context
4. Context cancellation propagates to `awaitSubagentCompletion` →
returns `context.Canceled`
5. Tool execution completes, `persistStep` is called with canceled
context
6. `persistStep` returns `ErrInterrupted` (wrong!) → status set to
`waiting` (stuck!)
## Fix
Check `context.Cause()` before deciding which error to return:
```go
if persistCtx.Err() != nil {
if errors.Is(context.Cause(persistCtx), chatloop.ErrInterrupted) {
return chatloop.ErrInterrupted // Intentional interruption
}
return persistCtx.Err() // Shutdown → context.Canceled
}
```
This preserves `context.Canceled` for shutdown, allowing
`isShutdownCancellation` to match and set status to `pending` so another
worker retries the chat.
## Test
Added `TestRun_ShutdownDuringToolExecutionReturnsContextCanceled` which:
1. Streams a tool call to a blocking tool (simulating `wait_agent`)
2. Cancels the server context (simulating shutdown) while the tool
blocks
3. Verifies `Run` returns `context.Canceled`, NOT `ErrInterrupted`
|
||
|
|
ddfe630757 |
refactor(chatd): replace fantasy.Agent with custom agent loop (#22507)
## Summary Replaces fantasy's `Agent` abstraction with a direct step loop calling `LanguageModel.Stream()`. Fantasy is retained as the provider abstraction layer (streaming parsers, types, tool schema) but we no longer use `fantasy.Agent`, `AgentStreamCall`, `AgentResult`, or `StepResult`. ## Problems solved | Problem | Before | After | |---|---|---| | **Sentinel prompt hack** | fantasy.Agent requires non-empty Prompt → UUID sentinel generated and stripped in PrepareStep | Messages passed directly to `model.Stream()` | | **Discarded PersistStep errors** | `_ = opts.OnStepFinish(result)` silently swallows errors | Errors propagate directly from `PersistStep()` | | **Shadow draft state** | ~160 LOC tracking content in parallel because fantasy doesn't expose in-progress content on interruption | `stepResult` owns content directly; `flushActiveState()` is trivial | | **Nested retry layers** | fantasy's 2-attempt retry nested inside chatretry's indefinite retry | Single `chatretry.Retry` layer | | **Callback-mediated compaction** | Mutex + boolean flag + coordination between OnStepFinish/PrepareStep callbacks | Inline `if` statement between steps | | **Duplicate compaction paths** | `compactStep()` + `maybeCompact()` sharing ~80% logic | Single `tryCompact()` function | ## Changes ### `coderd/chatd/chatloop/chatloop.go` — Rewritten - **Removed**: `fantasy.NewAgent()`, `AgentStreamCall`, sentinel prompt, shadow draft state (~160 LOC of closures), `compactedMu`/`compacted` flag, `PrepareStepResult` - **Added**: `stepResult` struct, `processStepStream()` (stream consumer), `executeTools()` (sequential tool execution), `flushActiveState()` (interrupt handling), `buildToolDefinitions()`, `toResponseMessages()` - **Changed**: `Run()` return type from `(*fantasy.AgentResult, error)` to `error` (callers already discarded the result) - **Preserved**: Anthropic prompt caching, reasoning title extraction, `extractContextLimit()`, `ErrInterrupted` semantics ### `coderd/chatd/chatloop/compaction.go` — Simplified - Merged `compactStep()` + `maybeCompact()` → single `tryCompact()` - Removed `[]StepResult` parameter from `generateCompactionSummary()` (caller provides complete message list) - Kept helper functions: `normalizedCompactionConfig`, `contextTokensFromUsage`, `resolveContextLimit`, `shouldCompact` ### `coderd/chatd/chatd.go` — Caller updates - Removed `AgentStreamCall` construction - Changed `_, err = chatloop.Run(...)` to `err = chatloop.Run(...)` - Model parameters moved from `AgentStreamCall` fields to `RunOptions` fields ### Tests — 4 new tests - `MidLoopCompactionReloadsMessages` — compaction fires mid-loop, messages reloaded - `PostRunCompactionSkippedAfterMidLoop` — no double compaction - `MultiStepToolExecution` — tools execute between steps, results feed next step - `PersistStepErrorPropagates` — persistence errors propagate (was silently discarded) |
||
|
|
edee917d88 |
feat: add experimental agents support (#22290)
feat: add AI chat system with agent tools and chat UI Introduce the chatd subsystem and Agents UI for AI-powered chat within Coder workspaces. - Add chatd package with chat loop, message compaction, prompt management, and LLM provider integration (OpenAI, Anthropic) - Add agent tools: create workspace, list/read templates, read/write/ edit files, execute commands - Add chat API endpoints with streaming, message editing, and durable reconnection - Add database schema and migrations for chats, chat messages, chat providers, and chat model configs - Add RBAC policies and dbauthz enforcement for chat resources - Add Agents UI pages with conversation timeline, queued messages list, diff viewer, and model configuration panel - Add comprehensive test coverage including coderd integration tests, chatd unit tests, and Storybook stories - Gate feature behind experiments flag --------- Co-authored-by: Cian Johnston <cian@coder.com> Co-authored-by: Danielle Maywood <danielle@themaywoods.com> Co-authored-by: Jeremy Ruppel <jeremy@coder.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |