coder

mirror of https://github.com/coder/coder.git synced 2026-06-02 20:48:20 +00:00

Author	SHA1	Message	Date
github-actions[bot]	663f1ee834	fix: track credential hint across key failover attempts in aibridge (#25735 ) (#25847 ) Cherry-pick of https://github.com/coder/coder/pull/25735 Original PR: #25735 — fix: track credential hint across key failover attempts in aibridge Merge commit: `7b903cad73` Requested by: @ssncferreira Co-authored-by: Susana Ferreira <susana@coder.com>	2026-05-29 12:59:39 -04:00
github-actions[bot]	cccf436db2	feat: serve 503 sentinel for disabled providers (#25794 ) (#25837 ) Cherry-pick of https://github.com/coder/coder/pull/25794 Original PR: #25794 — feat: serve 503 sentinel for disabled providers Merge commit: `5b10268827` Requested by: @dannykopping Signed-off-by: Danny Kopping <danny@coder.com> Co-authored-by: Danny Kopping <danny@coder.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-29 12:58:13 -04:00
Susana Ferreira	846aac2f74	refactor(aibridge): remove InjectAuthHeader in favor of KeyFailoverConfig (#25618 ) ## Description `Provider.InjectAuthHeader` is no longer needed. With the addition of `KeyFailoverConfig` in #24920, authentication is now applied per-attempt by `KeyFailoverTransport` on passthrough routes. This PR removes the dead method from the `Provider` interface, all implementations (`Anthropic`, `OpenAI`, `Copilot`), and the test mock. The orphaned `InjectAuthHeader` unit tests are replaced with `Test{Anthropic,OpenAI,Copilot}_KeyFailoverConfig`. `TestPassthrough_KeyFailover` is also extended to cover Copilot in the BYOK scenario. Related to: https://linear.app/codercom/issue/AIGOV-334/aibridge-follow-ups-from-key-failover-prs > [!NOTE] > Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira	2026-05-25 19:10:38 +01:00
Susana Ferreira	22109a54ad	refactor(aibridge): clean up keypool and provider error handling (#25609 ) ## Description Cleans up how key pool errors are represented and how they get turned into HTTP responses. Consolidates two error types into a single type with a kind tag, and gives the response helpers in both providers consistent names. ## Changes - Replaced the keypool sentinel and transient error struct with one error type that carries a kind and a retry-after duration. - Updated `KeyFailoverConfig.BuildKeyPoolResponse` to take the typed key pool error, so each provider can shape the exhaustion response in its own format. - Removed the per-provider `MarkKey` callback from `KeyFailoverConfig` since providers can rely on the shared `MarkKeyOnStatus` helper. - Renamed the response-error helpers so OpenAI and Anthropic use the same naming. Related to: https://linear.app/codercom/issue/AIGOV-334/aibridge-follow-ups-from-key-failover-prs > [!NOTE] > Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira	2026-05-25 18:58:29 +01:00
Susana Ferreira	5d178ada9f	docs(aibridge): document known IsStreaming race condition (#25654 ) Documents the known race in `EventStream.IsStreaming()` and the resulting flake in `TestStreamingInterception_AgenticLoopFailover/agentic_all_keys_fail `, accepted rather than fixed since the inner agentic loop is on track to be removed as part of the reverse proxy migration in coder/aibridge#223. Full reasoning in coder/internal#1524.	2026-05-25 17:57:02 +01:00
Paweł Banaszewski	1a8a153c56	chore: fix flake in TestResponsesInjectedTool (#25630 ) Fixes flake in TestResponsesInjectedTool. See https://github.com/coder/coder/pull/25630/changes/d9bfeb20092129127ad5e7958c5b8dbf46740527 for reproduction. Due to AsyncRecorded token usages may be recorded in different order then expected. Fixes: https://github.com/coder/internal/issues/1544	2026-05-25 16:41:55 +02:00
Ethan	c650aabbef	chore: standardize on _internal_test.go for white-box tests (#25601 ) My agent added `//nolint:testpackage` to a test file on one of my PRs. Again. This PR cleans it up across the entire repo and updates the in-repo conventions so future agents stop doing it. The repo already has a precedent for white-box tests that need to touch unexported symbols: `_internal_test.go` (145+ existing files). The `testpackage` linter's default `skip-regexp` exempts that filename suffix, so the `//nolint:testpackage` directive is unnecessary in every case where someone reached for it. This PR renames 51 such files to `_internal_test.go` via `git mv` so blame and history follow, and strips the dead directive from 2 files that were already correctly named (`coderd/oauth2provider/authorize_internal_test.go`, `coderd/x/chatd/advisor_internal_test.go`). `.claude/docs/TESTING.md` now documents the rule explicitly under Test Package Naming, which is imported into the root `AGENTS.md` via `@.claude/docs/TESTING.md`. The rule: prefer `package foo_test`; if you need internal access, rename the file to `_internal_test.go` rather than adding a nolint directive.	2026-05-22 20:24:38 +10:00
Steven Masley	51b531f5b3	chore: 'go generate' mockgen to use `go tool` wrapper (#25490 ) Calling `mockgen` relies on the executable in the `$PATH`. Using `go tool` uses the one defined in `go.mod`	2026-05-19 14:53:13 +00:00
Paweł Banaszewski	c0b4180206	fix: fix race in setupInjectedToolTest (#25455 ) Fixes race condition in `setupInjectedToolTest`. To reproduce add short sleep to AsyncRecorder.RecordToolUsage method (inside newly spawned go-routine): https://github.com/coder/coder/blob/46821525f7d2f7466735463898fb6e054c169f85/aibridge/recorder/recorder.go#L254-L258 Upstream request counter can reach 2 while no recording has been done since asyncRecorder does it asynchronously: https://github.com/coder/coder/blob/46821525f7d2f7466735463898fb6e054c169f85/aibridge/internal/integrationtest/setupbridge.go#L242-L244 Added consuming request to `setupInjectedToolTest` so `newInterceptionProcessor` handler finishes before returning from `setupInjectedToolTest` which guarantees that all recordings are done: https://github.com/coder/coder/blob/46821525f7d2f7466735463898fb6e054c169f85/aibridge/bridge.go#L282 Fixes: https://github.com/coder/internal/issues/1526	2026-05-19 09:36:38 +02:00
Marcin Tojek	38772bdb7c	refactor: remove cache tokens from ExtraTokenTypes (#25118 ) Fixes: https://github.com/coder/aibridge/issues/243 > Generated with [Coder Agents](https://coder.com/agents)	2026-05-18 11:30:19 +02:00
Danny Kopping	c6ab379c32	fix(aibridge/intercept/messages): convert enabled thinking to adaptive for Bedrock Opus 4.7+ (#25335 ) Disclaimer: implemented by a Coder Agent using Claude Opus 4.6/4.7 Fixes [coder/aibridge#280](https://github.com/coder/aibridge/issues/280). Claude Opus 4.7 (and future adaptive-only Bedrock models) reject the legacy `thinking.type: "enabled"` + `budget_tokens` shape with a 400. Claude Code falls back to that shape when it cannot read the upstream model's capability metadata, which is exactly the case when AI Bridge sits between the client and Bedrock. Pinning back to Opus 4.6 is the only operator workaround today. This is the counterpart to the `adaptive -> enabled` conversion added in [coder/aibridge#225](https://github.com/coder/aibridge/pull/225) for older Bedrock models. ## Behavior - New `bedrockModelRequiresAdaptiveThinking()` helper matches Opus 4.7 (covers `us.anthropic.claude-opus-4-7`, ARN-style application inference profile names that include the model ID, etc.). - New `RequestPayload.convertEnabledThinkingForBedrock()` rewrites `thinking: {type: enabled, budget_tokens: N}` to `thinking: {type: adaptive}`. The budget hint is dropped; an explicit `output_config.effort` from the caller is preserved naturally because we never touch that field. We deliberately do not derive an effort label from the budget (see decision log). - `removeUnsupportedBedrockFields` learns a variadic `exemptFields` parameter. Adaptive-only models support `output_config` natively (no beta flag required), so `augmentRequestForBedrock` exempts that field for those models. - Bedrock Opus 4.7 accepts `output_config.effort` but rejects `output_config.format` (structured outputs) with the same "Extra inputs are not permitted" 400. The generic strip pass operates at top-level granularity only, so a small targeted pass drops `output_config.format` after the top-level strip for adaptive-only models. The whole Bedrock thinking-type shim block carries a header comment flagging it as temporary; a planned native Bedrock provider removes the impedance mismatch and lets us delete it. ## Out of scope The issue calls out a possible follow-up around `Anthropic-Beta: interleaved-thinking-2025-05-14` for adaptive-only models; best evidence is that Opus 4.7 still accepts those flags, so this PR is a no-op there. <details> <summary>Decision log</summary> - `bedrockModelSupportsAdaptiveThinking` now also returns `true` for adaptive-only models. That keeps the existing `convertAdaptiveThinkingForBedrock` branch from running on Opus 4.7 (which would otherwise be incorrect; `adaptive` is the supported native type there), and the new `convertEnabledThinkingForBedrock` runs only for adaptive-only models via the explicit `bedrockModelRequiresAdaptiveThinking` switch case. The two model sets are disjoint by construction. - The reverse conversion does not derive `output_config.effort` from `budget_tokens / max_tokens`. The two thinking shapes encode different intents (`enabled+budget` is "give me exactly N tokens," `adaptive[+effort]` is "model, pick a budget, optionally biased") and there is no canonical mapping between them. An earlier draft of this PR derived effort via midpoints of an invented anchor table; it was symmetric-looking but lossy and required a lot of scaffolding (sorted anchors, init-time invariant guard, round-trip tests) to keep two halves consistent. The reverse direction now just rewrites the shape, which is honest about the information loss and matches platform-defined adaptive behavior when no effort hint is present. - `output_config.format` is stripped only for adaptive-only models. Other Bedrock models either don't get `output_config` through at all (top-level strip handles them) or accept it via a beta flag that may imply broader feature support. Easy to widen if the same 400 shows up elsewhere. - I chose `variadic exemptFields ...string` over passing the model down to `removeUnsupportedBedrockFields`, to keep that function focused on stripping and to localise the model-aware policy in `augmentRequestForBedrock`. </details>	2026-05-15 10:11:41 +02:00
Marcin Tojek	febabfb8b2	feat: add request/response dump support to aibridgeproxyd (#24837 ) Closes https://github.com/coder/coder/issues/24335	2026-05-11 10:59:26 +02:00
Susana Ferreira	0766cc3097	feat: add automatic key failover for AI Bridge passthrough (#24920 ) ## Description Adds automatic key failover for passthrough routes for the Anthropic and OpenAI providers. A new `keyFailoverTransport` wraps the reverse-proxy transport: centralized requests walk the configured key pool and retry with the next key on key-specific failures (401/403/429), reusing the same key-marking semantics as the bridged routes. BYOK passthrough requests run as a single attempt with no failover. ## Changes - New `keypool.KeyFailoverConfig` carrying the `Pool` to walk and the provider-specific closures (`IsBYOK`, `InjectAuthKey`, `MarkKey`, `BuildExhaustedResponse`). - New `keypool.NewKeyFailoverTransport`: wraps an inner `http.RoundTripper`. Returns `inner` unchanged when `Pool` is nil, otherwise produces a transport that buffers the request body once, walks the pool per request, and replays each attempt with the next key. - New `Provider.KeyFailoverConfig(logger)` interface method. Anthropic injects `X-Api-Key`; OpenAI injects `Authorization: Bearer ...`; Copilot returns an empty config. - `passthrough.go` wires `NewKeyFailoverTransport` around the existing apidump middleware, so every retry attempt is recorded. ## Related Issues Related to: https://github.com/coder/internal/issues/1446 Related to: https://linear.app/codercom/issue/AIGOV-197/aibridge-automatic-key-failover-for-bridged-and-passthrough-routes ## Follow-up PRs - Remove dead `Provider.InjectAuthHeader` method now that all auth is applied per-attempt by `KeyFailoverTransport`. - Bedrock multi-key support. - Refactor provider vs interceptor config separation. - Record the actually-used key in the interception credential hint after failover. > [!NOTE] > Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira	2026-05-07 15:46:36 +01:00
Susana Ferreira	b6dacb4a3c	feat: add automatic key failover for AI Bridge OpenAI (#24847 ) ## Description Adds automatic key failover for centralized OpenAI provider, covering both chat completions and responses APIs. Same shape as the Anthropic PR: each upstream call walks the configured key pool, keys are marked temporary on 429 (with cooldown from `Retry-After`) and permanent on 401/403. Each agentic-loop iteration gets its own fresh walker so a tool-call continuation can fail over independently of the initial request. BYOK is unchanged: BYOK requests run as a single attempt with no failover. ## Changes - `config.OpenAI` carries a `KeyPool`. `Key` remains for BYOK Authorization Bearer set per interception. - Chat completions blocking interceptor: walks the pool via `newChatCompletionWithKeyFailover`, marks keys on key-specific failures, returns on first success or non-failover error. - Chat completions streaming interceptor: per-iteration walker. Pre-stream failures fail over to the next key; mid-stream errors are relayed as SSE events. - Responses blocking interceptor: extracts `newResponseWithKeyFailover` parallel to chatcompletions. - Responses streaming interceptor: per-iteration walker, retains the existing buffer-then-forward design. ## Related Issues Related to: https://github.com/coder/internal/issues/1446 Related to: https://linear.app/codercom/issue/AIGOV-197/aibridge-automatic-key-failover-for-bridged-and-passthrough-routes ## Follow-up PRs - Bedrock multi-key support. - Refactor provider vs interceptor config separation. - Record the actually-used key in the interception credential hint after failover. > [!NOTE] > Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira	2026-05-07 15:35:46 +01:00
Susana Ferreira	f1155ac4d7	feat: add automatic key failover for AI Bridge Anthropic (#24836 ) ## Description Adds automatic key failover for centralized Anthropic provider. When a key pool is configured, each upstream call walks the pool and tries keys in order until one succeeds or the pool is exhausted. Keys are marked temporary on 429 (with cooldown from `Retry-After`) and permanent on 401/403. Errors that aren't key-specific don't trigger failover. Each agentic-loop iteration gets its own fresh walker, so a tool-call continuation can fail over independently of the initial request. BYOK is unchanged: BYOK requests run as a single attempt with no failover. ## Changes - `config.Anthropic` carries a `KeyPool`. `Key` remains for BYOK X-Api-Key set per interception. - Blocking interceptor: walks the pool, marks keys on key-specific failures, returns on first success or non-failover error. - Streaming interceptor: per-iteration walker. Pre-stream failures fail over to the next key; mid-stream errors are relayed as SSE events. - New `keypool` error types: `TransientExhaustionError` (carries soonest cooldown) and `ErrPermanentExhaustion`. Replace the prior `ErrAllKeysExhausted`. - Error responses now consistently include the outer `"type": "error"` field. ## Related Issues Related to: https://github.com/coder/internal/issues/1446 Related to: https://linear.app/codercom/issue/AIGOV-197/aibridge-automatic-key-failover-for-bridged-and-passthrough-routes ## Follow-up PRs - Bedrock multi-key support. - Refactor provider vs interceptor config separation. - Record the actually-used key in the interception credential hint after failover. > [!NOTE] > Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira	2026-05-07 14:57:44 +01:00
Susana Ferreira	dbb50ebaaf	feat: remove 429 from aibridge circuit breaker failure conditions (#24701 ) ## Description Removes 429 (Too Many Requests) from the circuit breaker failure conditions. Rate limiting is now handled by automatic key failover instead of tripping the circuit breaker. ## Changes `DefaultIsFailure` no longer treats 429 as a circuit breaker failure. The circuit breaker now only trips on server overload responses (503, 529). Tests and integration tests updated to use 503 instead of 429 for tripping circuits. Description strings in deployment config updated to reflect the change. Closes https://github.com/coder/internal/issues/1445 > [!NOTE] > Initially generated by Coder Agents, modified and reviewed by @ssncferreira	2026-04-30 09:31:32 +01:00
Susana Ferreira	123c8dfc02	feat(aibridge): add key pool with state tracking and walker (#24681 ) ## Description Adds the `aibridge/keypool` package, a thread-safe key pool with per-key state tracking and cooldown expiry. This PR introduces the package only; wiring it into the aibridge providers and coder configuration will happen in upstream PRs. ## Changes Each key is in one of three states: Valid (available), Temporary (rate-limited with cooldown expiry), or Permanent (revoked/unauthorized, terminal until restart). The state is derived from the key fields rather than stored explicitly: once a cooldown expires, the key is valid again without any external action. A `Walker` provides per-request key traversal using a primary-with-fallback strategy: each request walks the pool from index 0, skipping unavailable keys, so the first key is always preferred when healthy. Walkers are independent, so concurrent requests traverse the pool without interfering with each other. `MarkTemporary` preserves the longer cooldown when concurrent requests both mark the same key. Relates to https://github.com/coder/internal/issues/1445 > [!NOTE] > Initially generated by Coder Agents, modified and reviewed by @ssncferreira	2026-04-30 09:07:57 +01:00
Paweł Banaszewski	a24dc19d49	chore: clean up env var usage in aibridge (#24783 ) > AI tools where used when creating this PR This PR removes environment variable parsing from `/aibridge` directory. Added env variables/flags for dump dir as coder options. Only added to new indexed provider options (`CODER_AIBRIDGE_PROVIDER_<N>_`) not to deprecated legacy env variables (`CODER_AIBRIDGE_ANTHROPIC_` and `CODER_AIBRIDGE_OPENAI_KEY_*`). Reverted adding `MaxRetries` option as it will be removed soon due to key failover work: https://github.com/coder/coder/pull/24783#discussion_r3155544808	2026-04-29 18:28:37 +02:00
Paweł Banaszewski	6ea9c61da0	chore: update AI Gateway docs (#24805 ) > AI tools where used when creating this PR This PR: * removes references to aibridge repository from coder docs * updates aibdrige/README.md * makes it clear aibridge (keeping old name) is a handler not a separate process * updates outdated sections about: metrics, recorded interface and supported paths. --------- Co-authored-by: Susana Ferreira <susana@coder.com>	2026-04-29 18:28:09 +02:00
Paweł Banaszewski	5907730dcf	chore: update aibridge/AGENTS.md to reflect it is now part of coder/coder repo (#24822 ) > AI tools where used when creating this PR Updates `AGENTS.md` to reflect that `/aibridge` is now a directory in coder/coder repo not a separate repo.	2026-04-29 15:51:07 +02:00
Paweł Banaszewski	b8906c84a1	fix: fixes aibridge integration tests failing on windows (#24665 ) Fixes aibridge responses and apidump integration tests in windows CI job. example failing job before the fix: https://github.com/coder/coder/actions/runs/24873191866/job/72823942378	2026-04-24 18:28:22 +02:00
Danny Kopping	8aa3294f06	fix(aibridge): track Charm Crush client and session ID (#24630 ) Disclaimer: implemented by a Coder Agent using Claude Opus 4.6 Porting https://github.com/coder/aibridge/pull/277 to coder/coder after the [aibridge code move](https://github.com/coder/coder/pull/24190). ## Summary Fixes client detection and session ID tracking for the [Charm Crush](https://github.com/charmbracelet/crush) AI coding client. ## Changes ### Bug fix: User-Agent matching The actual Crush user-agent is `Charm-Crush/{version} (https://charm.land/crush)` (hyphenated), but `GuessClient` only checked for `charm crush/` (space-separated). After lowercasing, `Charm-Crush/0.2.0` becomes `charm-crush/0.2.0`, which did not match the `charm crush/` prefix. Now matches both formats for backwards compatibility. ### Session ID tracking Adds an explicit `ClientCrush` case to `GuessSessionID`. Crush does not currently send a session ID header to upstream AI providers, so this returns `nil` (consistent with how `ClientZed`, `ClientRoo`, and `ClientCursor` are handled). ### Tests - Added `charm_crush_hyphen` test case for `GuessClient` using the real user-agent format. - Added `crush_returns_empty` test case for `GuessSessionID`.	2026-04-22 19:02:31 +02:00
Paweł Banaszewski	e00e85765b	chore: move aibridge library code into coder repo (#24190 ) This PR merges code from `coder/aibridge` repository into `coder/coder`. It was split into 4 PRs for easier review but stacked PRs will need to be merged into this PR so all checks pass. * https://github.com/coder/coder/pull/24190 -> raw code copy (this PR, before merging PRs on top of it, it was just 1 commit: https://github.com/coder/coder/commit/70d33f33200c7e77df910957595715f81f9bec24) * https://github.com/coder/coder/pull/24570 -> update imports in `coder/coder` to use copied code * https://github.com/coder/coder/pull/24586 -> linter fixes and CI integration (also added README.md) * https://github.com/coder/coder/pull/24571 -> added exclude to scripts/check_emdash.sh check Original PR message (before PR squash): Moves coder/aibridge code into coder/coder repository. Omitted files: - `go.mod`, `go.sum`, `.gitignore`, `.github/workflows/ci.yml,` `Makefile`, `LICENSE`, `README.md` (modified README.md is added later) - `.github`, `example`, `buildinfo,` `scripts` directories Simple verification script (will list omitted files) ``` tmp=$(mktemp -d) echo "$tmp" git clone --depth=1 https://github.com/coder/aibridge "$tmp/aibridge" git clone --depth=1 --branch pb/aibridge-code-move https://github.com/coder/coder "$tmp/coder" diff -rq --exclude=.git "$tmp/aibridge" "$tmp/coder/aibridge" # rm -rf "$tmp" ```	2026-04-22 17:01:01 +02:00

23 Commits