fix(chatd): return chat to pending when server shuts down during successful completion (#22559)

## Problem Flaky test: `TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica` (coder/internal#1371) The test intermittently fails because the chat ends up in `waiting` status instead of `pending` after server shutdown. ## Root Cause There is a race condition in `processChat` where `runChat` completes successfully just as the server context is being canceled during `Close()`. The sequence: 1. Server calls `Close()`, canceling the server context. 2. The LLM HTTP response has already been fully written by the mock server (the stream closes normally before context cancellation propagates to the HTTP client). 3. `runChat` returns `nil` (success) instead of `context.Canceled`. 4. The existing `isShutdownCancellation` check only runs when `runChat` returns an error, so the shutdown is not detected. 5. `processChat`'s deferred cleanup marks the chat as `waiting` instead of `pending`. 6. The test's assertion that the chat is `pending` never becomes true. This race is timing-dependent — it only triggers when the mock server's HTTP response completes in the narrow window between context cancellation being initiated and it propagating through the HTTP transport layer. ## Fix Add a server context check after `runChat` returns successfully. If the server is shutting down (`ctx.Err() != nil`), override the status to `pending` so another replica can pick up the chat. This is the same pattern already used for the error path (`isShutdownCancellation`), extended to cover the success path.
2026-06-02 20:48:20 +00:00 · 2026-03-03 11:34:08 -05:00
parent 96cfb7d06a
commit 059ed7ab5c
1 changed files with 15 additions and 0 deletions
@@ -1846,6 +1846,21 @@ func (p *Server) processChat(ctx context.Context, chat database.Chat) {
 		status = database.ChatStatusError
 		return
 	}
+
+	// If runChat completed successfully but the server context was
+	// canceled (e.g. during Close()), the chat should be returned
+	// to pending so another replica can pick it up. There is a
+	// race where the LLM stream finishes just as the server is
+	// shutting down — the HTTP response completes before context
+	// cancellation propagates, so runChat returns nil instead of
+	// a context.Canceled error. Without this check the chat would
+	// be marked "waiting" and never retried.
+	if ctx.Err() != nil {
+		logger.Info(ctx, "chat completed during shutdown; returning to pending")
+		status = database.ChatStatusPending
+		lastError = ""
+		return
+	}
 }

 func isShutdownCancellation(