coder

mirror of https://github.com/coder/coder.git synced 2026-06-04 13:38:21 +00:00

Author	SHA1	Message	Date
Kyle Carberry	742694eb20	fix: filter empty text/reasoning parts before sending to LLM (#23284 ) ## Problem Anthropic rejects requests containing empty text content blocks with: ``` messages: text content blocks must be non-empty ``` Empty text parts (`""` or whitespace-only like `" "`) get persisted in the database when a stream sends `TextStart`/`TextEnd` with no `TextDelta` in between. On the next turn, these parts are loaded from the DB and sent to Anthropic, which rejects them. ## Fix Filter empty/whitespace-only text and reasoning parts at the two LLM dispatch boundaries, without modifying persistence (the raw record is preserved): - `partsToMessageParts()` in `chatprompt.go` — filters when converting persisted DB messages to fantasy message parts for LLM calls. This is the last gateway before the Anthropic provider creates `TextBlockParam` objects. - `toResponseMessages()` in `chatloop.go` — filters when building in-flight conversation messages between steps within a single turn. Note: `flushActiveState()` (the interruption path) already had this guard — the normal `TextEnd` streaming path did not, but since we're not changing persistence, the fix is applied at the dispatch layer.	2026-03-19 12:10:54 -04:00
Kyle Carberry	4dd8531f37	feat: track step runtime_ms on chat messages (#23219 ) ## Summary Adds a `runtime_ms` column to `chat_messages` that records the wall-clock duration (in milliseconds) of each LLM step. This covers LLM streaming, tool execution, and retries — the full time the agent is "alive" for a step. This is the foundation for billing by agent alive time. The column follows the same pattern as `total_cost_micros`: stored per assistant message, aggregatable with `SUM()` over time periods by user. ## Changes - Migration: adds nullable `runtime_ms bigint` to `chat_messages`. - chatloop: adds `Runtime time.Duration` field to `PersistedStep`, measures `time.Since(stepStart)` at the beginning of each step (covering stream + tool execution + retries). - chatd: passes `step.Runtime.Milliseconds()` to the assistant message `InsertChatMessage` call; all other message types (system, user, tool) get `NULL`. - Tests: adds `runtime > 0` assertion in chatloop tests. ## Billing query pattern Once ready, aggregation mirrors the existing cost queries: ```sql SELECT COALESCE(SUM(cm.runtime_ms), 0)::bigint AS total_runtime_ms FROM chat_messages cm JOIN chats c ON c.id = cm.chat_id WHERE c.owner_id = @user_id AND cm.created_at >= @start_time AND cm.created_at < @end_time AND cm.runtime_ms IS NOT NULL; ```	2026-03-18 10:57:35 -04:00
Kyle Carberry	42c12176a0	fix(chatd): persist interrupted tool call steps instead of losing them (#23011 ) ## Problem When a chat is interrupted while tools are executing, the step content (text, reasoning, tool calls, and partial tool results) was being lost. Two gaps existed: 1. During tool execution: `executeTools` returns with error results for interrupted tools, but the subsequent `PersistStep(ctx, ...)` fails on the canceled context and returns `ErrInterrupted` without persisting anything. 2. PersistStep race: If the context is canceled between the post-tool interrupt check and the `PersistStep` call, the same loss occurs. This is inconsistent with how we handle stream interruptions (which properly flush and persist partial content via `persistInterruptedStep`) and how [coder/blink](https://github.com/coder/blink) handles interruptions (always inserting the response message regardless of execution phase). ## Fix Two changes in `chatloop.go`: - Post-tool-execution interrupt check: After `executeTools` returns, check if the context was interrupted and route through `persistInterruptedStep` (which uses `context.WithoutCancel` internally) to save the accumulated content. - PersistStep fallback: If `PersistStep` returns `ErrInterrupted`, retry via `persistInterruptedStep` so partial content is not lost. ## Tests - `TestRun_InterruptedDuringToolExecutionPersistsStep`: Verifies that when a tool is blocked and the chat is interrupted, the step (text + reasoning + tool call + tool error result) is persisted via the interrupt-safe path. - `TestRun_PersistStepInterruptedFallback`: Verifies that when `PersistStep` itself returns `ErrInterrupted`, the step is retried via the fallback path and content is saved.	2026-03-12 16:59:16 -04:00
Kyle Carberry	072e9a212f	fix(chatloop): keep provider-executed tool results in assistant message (#23012 ) ## Problem When a step contains both provider-executed tool calls (e.g. Anthropic web search) and local tool calls in parallel, the next loop iteration fails with the Anthropic API claiming the regular tool call has no result. However, sending a new user message (which reloads messages from the DB) works fine. ## Root cause `toResponseMessages` was placing all tool results into the tool-role message, regardless of `ProviderExecuted`. When Fantasy's Anthropic provider later converted these messages for the API, it moved the provider tool result from the tool message to the end of the previous assistant message (`prevMsg.Content = append(...)`). This placed `web_search_tool_result` after the regular `tool_use` block: ``` assistant: [server_tool_use(A), tool_use(B), web_search_tool_result(A)] ← wrong order user: [tool_result(B)] ``` The persistence layer in `chatd.go` already handles this correctly — provider-executed tool results stay in the assistant message, producing the expected ordering: ``` assistant: [server_tool_use(A), web_search_tool_result(A), tool_use(B)] ← correct order user: [tool_result(B)] ``` This is why reloading from the DB fixed it. ## Fix In the `ContentTypeToolResult` case of `toResponseMessages`, route provider-executed results to `assistantParts` instead of `toolParts`, matching the persistence layer's behavior. ## Testing Added `TestToResponseMessages_ProviderExecutedToolResultInAssistantMessage` which verifies that mixed provider+local tool results are split correctly between the assistant and tool messages.	2026-03-12 20:22:09 +00:00
Kyle Carberry	f35b99a4fa	fix(chatd): preserve context.Canceled in persistStep during shutdown (#22890 ) ## Problem When a chat worker shuts down gracefully (e.g. Kubernetes pod SIGTERM) while a tool is executing (like `wait_agent` polling for a subagent), the chat gets stuck in `waiting` status forever — no other worker will pick it up. ### Root Cause `persistStep` in `chatd.go` unconditionally returned `chatloop.ErrInterrupted` for any canceled context: ```go if persistCtx.Err() != nil { return chatloop.ErrInterrupted // BUG: doesn't check WHY the context was canceled } ``` During shutdown, the context cause is `context.Canceled` (not `ErrInterrupted`). But because `persistStep` returned `ErrInterrupted`, the error handling in `processChat` hit the `ErrInterrupted` check first (line 2011) and set status to `waiting` — the `isShutdownCancellation` check (line 2017) was never reached: ```go // Checked FIRST — matches because persistStep returned ErrInterrupted if errors.Is(err, chatloop.ErrInterrupted) { status = database.ChatStatusWaiting // Stuck forever return } // NEVER REACHED during shutdown if isShutdownCancellation(ctx, chatCtx, err) { status = database.ChatStatusPending // Would have been correct return } ``` ### Trigger scenario (from production logs) 1. Chat spawns a subagent via `spawn_agent`, then calls `wait_agent` 2. `wait_agent` blocks in `awaitSubagentCompletion` polling loop 3. Worker pod receives SIGTERM → `Close()` cancels server context 4. Context cancellation propagates to `awaitSubagentCompletion` → returns `context.Canceled` 5. Tool execution completes, `persistStep` is called with canceled context 6. `persistStep` returns `ErrInterrupted` (wrong!) → status set to `waiting` (stuck!) ## Fix Check `context.Cause()` before deciding which error to return: ```go if persistCtx.Err() != nil { if errors.Is(context.Cause(persistCtx), chatloop.ErrInterrupted) { return chatloop.ErrInterrupted // Intentional interruption } return persistCtx.Err() // Shutdown → context.Canceled } ``` This preserves `context.Canceled` for shutdown, allowing `isShutdownCancellation` to match and set status to `pending` so another worker retries the chat. ## Test Added `TestRun_ShutdownDuringToolExecutionReturnsContextCanceled` which: 1. Streams a tool call to a blocking tool (simulating `wait_agent`) 2. Cancels the server context (simulating shutdown) while the tool blocks 3. Verifies `Run` returns `context.Canceled`, NOT `ErrInterrupted`	2026-03-10 13:01:45 +00:00
Kyle Carberry	ddfe630757	refactor(chatd): replace fantasy.Agent with custom agent loop (#22507 ) ## Summary Replaces fantasy's `Agent` abstraction with a direct step loop calling `LanguageModel.Stream()`. Fantasy is retained as the provider abstraction layer (streaming parsers, types, tool schema) but we no longer use `fantasy.Agent`, `AgentStreamCall`, `AgentResult`, or `StepResult`. ## Problems solved \| Problem \| Before \| After \| \|---\|---\|---\| \| Sentinel prompt hack \| fantasy.Agent requires non-empty Prompt → UUID sentinel generated and stripped in PrepareStep \| Messages passed directly to `model.Stream()` \| \| Discarded PersistStep errors \| `_ = opts.OnStepFinish(result)` silently swallows errors \| Errors propagate directly from `PersistStep()` \| \| Shadow draft state \| ~160 LOC tracking content in parallel because fantasy doesn't expose in-progress content on interruption \| `stepResult` owns content directly; `flushActiveState()` is trivial \| \| Nested retry layers \| fantasy's 2-attempt retry nested inside chatretry's indefinite retry \| Single `chatretry.Retry` layer \| \| Callback-mediated compaction \| Mutex + boolean flag + coordination between OnStepFinish/PrepareStep callbacks \| Inline `if` statement between steps \| \| Duplicate compaction paths \| `compactStep()` + `maybeCompact()` sharing ~80% logic \| Single `tryCompact()` function \| ## Changes ### `coderd/chatd/chatloop/chatloop.go` — Rewritten - Removed: `fantasy.NewAgent()`, `AgentStreamCall`, sentinel prompt, shadow draft state (~160 LOC of closures), `compactedMu`/`compacted` flag, `PrepareStepResult` - Added: `stepResult` struct, `processStepStream()` (stream consumer), `executeTools()` (sequential tool execution), `flushActiveState()` (interrupt handling), `buildToolDefinitions()`, `toResponseMessages()` - Changed: `Run()` return type from `(fantasy.AgentResult, error)` to `error` (callers already discarded the result) - Preserved*: Anthropic prompt caching, reasoning title extraction, `extractContextLimit()`, `ErrInterrupted` semantics ### `coderd/chatd/chatloop/compaction.go` — Simplified - Merged `compactStep()` + `maybeCompact()` → single `tryCompact()` - Removed `[]StepResult` parameter from `generateCompactionSummary()` (caller provides complete message list) - Kept helper functions: `normalizedCompactionConfig`, `contextTokensFromUsage`, `resolveContextLimit`, `shouldCompact` ### `coderd/chatd/chatd.go` — Caller updates - Removed `AgentStreamCall` construction - Changed `_, err = chatloop.Run(...)` to `err = chatloop.Run(...)` - Model parameters moved from `AgentStreamCall` fields to `RunOptions` fields ### Tests — 4 new tests - `MidLoopCompactionReloadsMessages` — compaction fires mid-loop, messages reloaded - `PostRunCompactionSkippedAfterMidLoop` — no double compaction - `MultiStepToolExecution` — tools execute between steps, results feed next step - `PersistStepErrorPropagates` — persistence errors propagate (was silently discarded)	2026-03-02 18:51:57 -05:00
Kyle Carberry	edee917d88	feat: add experimental agents support (#22290 ) feat: add AI chat system with agent tools and chat UI Introduce the chatd subsystem and Agents UI for AI-powered chat within Coder workspaces. - Add chatd package with chat loop, message compaction, prompt management, and LLM provider integration (OpenAI, Anthropic) - Add agent tools: create workspace, list/read templates, read/write/ edit files, execute commands - Add chat API endpoints with streaming, message editing, and durable reconnection - Add database schema and migrations for chats, chat messages, chat providers, and chat model configs - Add RBAC policies and dbauthz enforcement for chat resources - Add Agents UI pages with conversation timeline, queued messages list, diff viewer, and model configuration panel - Add comprehensive test coverage including coderd integration tests, chatd unit tests, and Storybook stories - Gate feature behind experiments flag --------- Co-authored-by: Cian Johnston <cian@coder.com> Co-authored-by: Danielle Maywood <danielle@themaywoods.com> Co-authored-by: Jeremy Ruppel <jeremy@coder.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-27 16:50:56 +00:00

7 Commits