mirror of
https://github.com/coder/coder.git
synced 2026-06-03 13:08:25 +00:00
58f6b9c4d0
## Summary Wraps external auth token refresh in an exponential-backoff retry so a brief upstream hiccup (5xx, network timeout, rate-limited 429) no longer surfaces as an `InvalidTokenError` and forces users to re-authenticate. GitHub in particular has been flaky enough lately that this is hitting real users. ## Behavior - `(*Config).RefreshToken` now calls a helper that retries the `TokenSource.Token()` exchange with exponential backoff (250ms → 2s), bounded by a 10s total budget. - Errors classified as permanent by `isFailedRefresh` (e.g. `bad_refresh_token`, `invalid_grant`, `unauthorized_client`, ...) skip the retry loop. Retrying a permanent failure wastes the refresh quota and, on providers with single-use refresh tokens, can mask a legitimate concurrent winner with repeated `bad_refresh_token` responses. - Refreshes with an empty refresh token still short-circuit without making an API call. - The existing concurrent-refresh-race detection and optimistic-lock paths are unchanged. ## Tunables Three new `time.Duration` fields on `externalauth.Config` (`RefreshRetryInitialBackoff`, `RefreshRetryMaxBackoff`, `RefreshRetryTimeout`) let callers override the defaults. They default to zero, which falls back to the package defaults, so existing call sites are unaffected. The fields exist primarily so tests can dial the timing way down without touching package globals (and therefore without serializing parallel tests). ## Tests - `TestRefreshToken/RefreshRetries` now disables internal retries via `RefreshRetryTimeout = time.Nanosecond` so its existing "1 IDP call per `RefreshToken` invocation" assertion still holds. Otherwise its assertions are unchanged. - New `TestRefreshToken/RefreshTokenWithBackoff` simulates 3 transient 5xx failures followed by success and verifies the refresh ultimately succeeds with 4 total IDP attempts. - New `TestRefreshToken/RefreshTokenBackoffPermanentError` returns `bad_refresh_token` and verifies the refresh is **not** retried even with a generous 1s budget. <details> <summary>Why the explicit <code>retryCtx.Err()</code> guard?</summary> `retry.Retrier.Wait` `select`s between `time.After(delay)` and `ctx.Done()`. The first call has `delay == 0`, so `time.After(0)` and an already-cancelled context both fire immediately and Go picks the case nondeterministically. Without the guard, a near-zero retry budget would still trigger an unwanted extra refresh attempt roughly half the time, which would have made the `RefreshRetries` test flaky. </details> This PR was opened by a Coder agent on behalf of @kylecarbs.