Files
coder/coderd/chatd
Kyle Carberry f509c841cf fix(chatd): recover stale chats after coderd redeployment (#22405)
## Problem

When coderd instances are redeployed (e.g. rolling deployment on
dogfood), in-flight chats get stuck in `running` status permanently. The
UI shows them as "thinking" with a spinning indicator, but no worker is
actually processing them. They never error or resume.

## Root Cause

Two bugs combine to cause this:

### Bug 1: Shutdown cleanup uses a canceled context

The `processChat` defer block updates the chat status in the DB when
processing completes. But it uses `ctx`, which `Close()` cancels
*before* the defer runs. The DB transaction silently fails with
`context.Canceled`, leaving the chat in `status=running` with a dead
`worker_id`.

```go
// Close() calls p.cancel() which cancels ctx
// Then the defer tries to use the now-canceled ctx:
defer func() {
    err := p.db.InTx(func(tx database.Store) error {
        tx.GetChatByIDForUpdate(ctx, chat.ID) // FAILS
        tx.UpdateChatStatus(ctx, ...)          // FAILS
    }, nil)
}()
```

### Bug 2: Stale recovery runs only once at startup

`recoverStaleChats()` was called only once in `start()`, not
periodically. During a rolling deployment, the new instance starts while
the old one is still alive (fresh heartbeat). By the time the old
instance crashes, no one checks again.

## Fix

1. **Use `context.WithoutCancel(ctx)` in the processChat defer** — the
cleanup transaction now completes even during graceful shutdown.

2. **Run `recoverStaleChats` periodically** — a second ticker in the
`start()` loop checks for stale chats at `inFlightChatStaleAfter / 5`
intervals (default: every 1 minute). This catches orphaned chats even
when the instance that owns them crashes without clean shutdown.

## Tests

- `TestRecoverStaleChatsPeriodically` — Verifies chats orphaned *after*
startup are recovered by the periodic loop (not just the startup check).
- `TestNewReplicaRecoversStaleChatFromDeadReplica` — Verifies a new
replica recovers stale chats on startup.
- `TestWaitingChatsAreNotRecoveredAsStale` — Negative test: `waiting`
chats are not incorrectly modified by recovery.
2026-02-27 15:25:40 -05:00
..