Commit Graph

5 Commits

Author SHA1 Message Date
Hugo Dutka 658a04d28f pr 3 2026-06-04 18:51:22 +00:00
Ethan 3ab1323bc9 fix!: rename chat stream silence timeout error (#25973)
Renames the Agents chat stream-silence error from `startup_timeout` to
`stream_silence_timeout` now that the timeout applies to any gap between
provider stream parts, not just first-token startup.

Updates the SDK enum, generated API docs/types, chat error copy, and
Agents UI stories/status labels so the user-facing wording describes a
stalled provider response instead of startup delay.

> **Breaking change:** This is a very minor breaking change for the
Coder Agents API: the public chat error kind enum no longer includes
`startup_timeout`, so clients matching that specific value should handle
`stream_silence_timeout` instead.
2026-06-04 18:36:02 +10:00
Ethan becc858fa8 fix(coderd/x/chatd): retry provider stream cancellations (#26010)
Closes CODAGT-541.

## Problem

An Agents chat stream could die with a terminal `context cancelled`
error and surface to the user as a permanent chat failure, even when no
context in our process had actually been canceled. The cancellation was
a provider-returned error value (HTTP/2 RST_STREAM mid-body surfacing as
`context.Canceled` from Go's net/http2), not a real caller cancel.

The chain that produced the bug:

- fantasy passed the provider's `context.Canceled` through unchanged.
- `chaterror.Classify` short-circuited any `errors.Is(err,
context.Canceled)` (or `"context canceled"` text) as terminal generic,
before checking HTTP status codes or other retry signals.
- `chatretry.Retry` did not retry.
- The frontend rendered `type:"error"` and the chat was dead.

The same short-circuit also masked retryable 5xx responses whose
underlying transport error happened to wrap `context.Canceled`.

## Approach

`context.Canceled` has no inherent intent. The same error value can mean
a user pressing Stop, a server shutdown, the silence guard firing, or a
provider-side stream reset. The only layer that can disambiguate is the
one holding both the returned error and the caller context. That is
`chatretry`.

This PR centralizes the policy there and keeps `chaterror` context-free.

## Changes

`coderd/x/chatd/chaterror/classify.go`

- Add `ErrProviderTransportReset` sentinel to explicitly mark
provider-side stream cancellations.
- Remove the broad `context.Canceled` / `"context canceled"`
short-circuit so status codes and other retry signals can win.
- Classify `ErrProviderTransportReset` (with no status code) as a
retryable timeout.
- Keep a fallback that classifies bare `context.Canceled` as
terminal-generic when no other signal is present, so legitimate caller
cancels still terminate cleanly.

`coderd/x/chatd/chatretry/chatretry.go`

- Add `contextError(ctx)` that returns `context.Cause(ctx)` when set,
falling back to `ctx.Err()`, so caller-owned cancel causes
(`ErrInterrupted`, `errStreamSilenceTimeout`, server shutdown sentinels)
propagate cleanly out of the retry loop.
- Add `classifyProviderAttemptError(err)` that wraps a bare
`context.Canceled` in `ErrProviderTransportReset` and reclassifies.
Errors that already classify as retryable or carry a status code are
left alone.
- Restructure `Retry` so the policy is explicit and readable: check
caller cancellation before attempting, run the attempt, check caller
cancellation again before normalizing the provider error, then classify
and retry.

## End-to-end behavior

- Provider returns `context.Canceled` while caller context is healthy:
classified as a retryable timeout, retried, the user sees a brief
`type:"retry"` event and the chat continues.
- User presses Stop: `contextError(ctx)` returns `ErrInterrupted`. Retry
stops. `chatloop` flushes partial content and persists.
- Stream-silence guard fires: `attemptCtx` is canceled with
`errStreamSilenceTimeout`, `guardedStream` produces a classified
retryable error, retry proceeds normally on the still-alive parent.
- Server shutdown: parent context's cause propagates out, retry stops.
2026-06-04 12:52:37 +10:00
Ethan 7e2f7198dd fix(coderd/x/chatd/chatloop): use stream silence timeout (#25782)
Replaces the 60 second first-token timeout in the chat loop with a 10
minute stream-silence timeout.

Previously, the guard bounded only the gap before the first stream part.
Once any part arrived the attempt could hang indefinitely if the
provider stopped streaming without closing the connection, and even
normal long-running responses could be killed after 60 seconds if the
provider was slow to emit the first token.

The guard now arms when a model attempt opens its stream, resets on
every received stream part, and fires after 10 minutes of complete
silence. The existing retry path still handles the timeout, and the
public `startup_timeout` error kind is preserved to avoid API and
frontend churn.

10 minutes matches the default request timeout used by the Anthropic and
OpenAI Python SDKs.


Closes CODAGT-493
2026-05-28 21:02:40 +10:00
Ethan c650aabbef chore: standardize on *_internal_test.go for white-box tests (#25601)
My agent added `//nolint:testpackage` to a test file on one of my PRs.
Again. This PR cleans it up across the entire repo and updates the
in-repo conventions so future agents stop doing it.

The repo already has a precedent for white-box tests that need to touch
unexported symbols: `*_internal_test.go` (145+ existing files). The
`testpackage` linter's default `skip-regexp` exempts that filename
suffix, so the `//nolint:testpackage` directive is unnecessary in every
case where someone reached for it. This PR renames 51 such files to
`*_internal_test.go` via `git mv` so blame and history follow, and
strips the dead directive from 2 files that were already correctly named
(`coderd/oauth2provider/authorize_internal_test.go`,
`coderd/x/chatd/advisor_internal_test.go`).

`.claude/docs/TESTING.md` now documents the rule explicitly under *Test
Package Naming*, which is imported into the root `AGENTS.md` via
`@.claude/docs/TESTING.md`. The rule: prefer `package foo_test`; if you
need internal access, rename the file to `*_internal_test.go` rather
than adding a nolint directive.
2026-05-22 20:24:38 +10:00