mirror of
https://github.com/coder/coder.git
synced 2026-06-03 13:08:25 +00:00
059ed7ab5c
## Problem Flaky test: `TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica` (coder/internal#1371) The test intermittently fails because the chat ends up in `waiting` status instead of `pending` after server shutdown. ## Root Cause There is a race condition in `processChat` where `runChat` completes successfully just as the server context is being canceled during `Close()`. The sequence: 1. Server calls `Close()`, canceling the server context. 2. The LLM HTTP response has already been fully written by the mock server (the stream closes normally before context cancellation propagates to the HTTP client). 3. `runChat` returns `nil` (success) instead of `context.Canceled`. 4. The existing `isShutdownCancellation` check only runs when `runChat` returns an error, so the shutdown is not detected. 5. `processChat`'s deferred cleanup marks the chat as `waiting` instead of `pending`. 6. The test's assertion that the chat is `pending` never becomes true. This race is timing-dependent — it only triggers when the mock server's HTTP response completes in the narrow window between context cancellation being initiated and it propagating through the HTTP transport layer. ## Fix Add a server context check after `runChat` returns successfully. If the server is shutting down (`ctx.Err() != nil`), override the status to `pending` so another replica can pick up the chat. This is the same pattern already used for the error path (`isShutdownCancellation`), extended to cover the success path.