mirror of
https://github.com/coder/coder.git
synced 2026-06-04 05:28:20 +00:00
becc858fa8
Closes CODAGT-541. ## Problem An Agents chat stream could die with a terminal `context cancelled` error and surface to the user as a permanent chat failure, even when no context in our process had actually been canceled. The cancellation was a provider-returned error value (HTTP/2 RST_STREAM mid-body surfacing as `context.Canceled` from Go's net/http2), not a real caller cancel. The chain that produced the bug: - fantasy passed the provider's `context.Canceled` through unchanged. - `chaterror.Classify` short-circuited any `errors.Is(err, context.Canceled)` (or `"context canceled"` text) as terminal generic, before checking HTTP status codes or other retry signals. - `chatretry.Retry` did not retry. - The frontend rendered `type:"error"` and the chat was dead. The same short-circuit also masked retryable 5xx responses whose underlying transport error happened to wrap `context.Canceled`. ## Approach `context.Canceled` has no inherent intent. The same error value can mean a user pressing Stop, a server shutdown, the silence guard firing, or a provider-side stream reset. The only layer that can disambiguate is the one holding both the returned error and the caller context. That is `chatretry`. This PR centralizes the policy there and keeps `chaterror` context-free. ## Changes `coderd/x/chatd/chaterror/classify.go` - Add `ErrProviderTransportReset` sentinel to explicitly mark provider-side stream cancellations. - Remove the broad `context.Canceled` / `"context canceled"` short-circuit so status codes and other retry signals can win. - Classify `ErrProviderTransportReset` (with no status code) as a retryable timeout. - Keep a fallback that classifies bare `context.Canceled` as terminal-generic when no other signal is present, so legitimate caller cancels still terminate cleanly. `coderd/x/chatd/chatretry/chatretry.go` - Add `contextError(ctx)` that returns `context.Cause(ctx)` when set, falling back to `ctx.Err()`, so caller-owned cancel causes (`ErrInterrupted`, `errStreamSilenceTimeout`, server shutdown sentinels) propagate cleanly out of the retry loop. - Add `classifyProviderAttemptError(err)` that wraps a bare `context.Canceled` in `ErrProviderTransportReset` and reclassifies. Errors that already classify as retryable or carry a status code are left alone. - Restructure `Retry` so the policy is explicit and readable: check caller cancellation before attempting, run the attempt, check caller cancellation again before normalizing the provider error, then classify and retry. ## End-to-end behavior - Provider returns `context.Canceled` while caller context is healthy: classified as a retryable timeout, retried, the user sees a brief `type:"retry"` event and the chat continues. - User presses Stop: `contextError(ctx)` returns `ErrInterrupted`. Retry stops. `chatloop` flushes partial content and persists. - Stream-silence guard fires: `attemptCtx` is canceled with `errStreamSilenceTimeout`, `guardedStream` produces a classified retryable error, retry proceeds normally on the still-alive parent. - Server shutdown: parent context's cause propagates out, retry stops.