mirror of
https://github.com/coder/coder.git
synced 2026-06-03 21:18:24 +00:00
f35b99a4fa
## Problem
When a chat worker shuts down gracefully (e.g. Kubernetes pod SIGTERM)
while a tool is executing (like `wait_agent` polling for a subagent),
the chat gets stuck in `waiting` status forever — no other worker will
pick it up.
### Root Cause
`persistStep` in `chatd.go` unconditionally returned
`chatloop.ErrInterrupted` for **any** canceled context:
```go
if persistCtx.Err() != nil {
return chatloop.ErrInterrupted // BUG: doesn't check WHY the context was canceled
}
```
During shutdown, the context cause is `context.Canceled` (not
`ErrInterrupted`). But because `persistStep` returned `ErrInterrupted`,
the error handling in `processChat` hit the `ErrInterrupted` check first
(line 2011) and set status to `waiting` — the `isShutdownCancellation`
check (line 2017) was never reached:
```go
// Checked FIRST — matches because persistStep returned ErrInterrupted
if errors.Is(err, chatloop.ErrInterrupted) {
status = database.ChatStatusWaiting // Stuck forever
return
}
// NEVER REACHED during shutdown
if isShutdownCancellation(ctx, chatCtx, err) {
status = database.ChatStatusPending // Would have been correct
return
}
```
### Trigger scenario (from production logs)
1. Chat spawns a subagent via `spawn_agent`, then calls `wait_agent`
2. `wait_agent` blocks in `awaitSubagentCompletion` polling loop
3. Worker pod receives SIGTERM → `Close()` cancels server context
4. Context cancellation propagates to `awaitSubagentCompletion` →
returns `context.Canceled`
5. Tool execution completes, `persistStep` is called with canceled
context
6. `persistStep` returns `ErrInterrupted` (wrong!) → status set to
`waiting` (stuck!)
## Fix
Check `context.Cause()` before deciding which error to return:
```go
if persistCtx.Err() != nil {
if errors.Is(context.Cause(persistCtx), chatloop.ErrInterrupted) {
return chatloop.ErrInterrupted // Intentional interruption
}
return persistCtx.Err() // Shutdown → context.Canceled
}
```
This preserves `context.Canceled` for shutdown, allowing
`isShutdownCancellation` to match and set status to `pending` so another
worker retries the chat.
## Test
Added `TestRun_ShutdownDuringToolExecutionReturnsContextCanceled` which:
1. Streams a tool call to a blocking tool (simulating `wait_agent`)
2. Cancels the server context (simulating shutdown) while the tool
blocks
3. Verifies `Run` returns `context.Canceled`, NOT `ErrInterrupted`