Commit Graph

23 Commits

Author SHA1 Message Date
github-actions[bot] 663f1ee834 fix: track credential hint across key failover attempts in aibridge (#25735) (#25847)
Cherry-pick of https://github.com/coder/coder/pull/25735

Original PR: #25735 — fix: track credential hint across key failover
attempts in aibridge
Merge commit: 7b903cad73
Requested by: @ssncferreira

Co-authored-by: Susana Ferreira <susana@coder.com>
2026-05-29 12:59:39 -04:00
github-actions[bot] cccf436db2 feat: serve 503 sentinel for disabled providers (#25794) (#25837)
Cherry-pick of https://github.com/coder/coder/pull/25794

Original PR: #25794 — feat: serve 503 sentinel for disabled providers
Merge commit: 5b10268827
Requested by: @dannykopping

Signed-off-by: Danny Kopping <danny@coder.com>
Co-authored-by: Danny Kopping <danny@coder.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 12:58:13 -04:00
Susana Ferreira 846aac2f74 refactor(aibridge): remove InjectAuthHeader in favor of KeyFailoverConfig (#25618)
## Description

`Provider.InjectAuthHeader` is no longer needed. With the addition of `KeyFailoverConfig` in #24920, authentication is now applied per-attempt by `KeyFailoverTransport` on passthrough routes. This PR removes the dead method from the `Provider` interface, all implementations (`Anthropic`, `OpenAI`, `Copilot`), and the test mock.

The orphaned `InjectAuthHeader` unit tests are replaced with `Test{Anthropic,OpenAI,Copilot}_KeyFailoverConfig`. `TestPassthrough_KeyFailover` is also extended to cover Copilot in the BYOK scenario.

Related to: https://linear.app/codercom/issue/AIGOV-334/aibridge-follow-ups-from-key-failover-prs

> [!NOTE]
> Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira
2026-05-25 19:10:38 +01:00
Susana Ferreira 22109a54ad refactor(aibridge): clean up keypool and provider error handling (#25609)
## Description

Cleans up how key pool errors are represented and how they get turned into HTTP responses. Consolidates two error types into a single type with a kind tag, and gives the response helpers in both providers consistent names.

## Changes

- Replaced the keypool sentinel and transient error struct with one error type that carries a kind and a retry-after duration.
- Updated `KeyFailoverConfig.BuildKeyPoolResponse` to take the typed key pool error, so each provider can shape the exhaustion response in its own format.
- Removed the per-provider `MarkKey` callback from `KeyFailoverConfig` since providers can rely on the shared `MarkKeyOnStatus` helper.
- Renamed the response-error helpers so OpenAI and Anthropic use the same naming.

Related to: https://linear.app/codercom/issue/AIGOV-334/aibridge-follow-ups-from-key-failover-prs

> [!NOTE]
> Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira
2026-05-25 18:58:29 +01:00
Susana Ferreira 5d178ada9f docs(aibridge): document known IsStreaming race condition (#25654)
Documents the known race in `EventStream.IsStreaming()` and the
resulting flake in
`TestStreamingInterception_AgenticLoopFailover/agentic_all_keys_fail `,
accepted rather than fixed since the inner agentic loop is on track to
be removed as part of the reverse proxy migration in coder/aibridge#223.

Full reasoning in coder/internal#1524.
2026-05-25 17:57:02 +01:00
Paweł Banaszewski 1a8a153c56 chore: fix flake in TestResponsesInjectedTool (#25630)
Fixes flake in TestResponsesInjectedTool.
See
https://github.com/coder/coder/pull/25630/changes/d9bfeb20092129127ad5e7958c5b8dbf46740527
for reproduction.
Due to AsyncRecorded token usages may be recorded in different order
then expected.

Fixes: https://github.com/coder/internal/issues/1544
2026-05-25 16:41:55 +02:00
Ethan c650aabbef chore: standardize on *_internal_test.go for white-box tests (#25601)
My agent added `//nolint:testpackage` to a test file on one of my PRs.
Again. This PR cleans it up across the entire repo and updates the
in-repo conventions so future agents stop doing it.

The repo already has a precedent for white-box tests that need to touch
unexported symbols: `*_internal_test.go` (145+ existing files). The
`testpackage` linter's default `skip-regexp` exempts that filename
suffix, so the `//nolint:testpackage` directive is unnecessary in every
case where someone reached for it. This PR renames 51 such files to
`*_internal_test.go` via `git mv` so blame and history follow, and
strips the dead directive from 2 files that were already correctly named
(`coderd/oauth2provider/authorize_internal_test.go`,
`coderd/x/chatd/advisor_internal_test.go`).

`.claude/docs/TESTING.md` now documents the rule explicitly under *Test
Package Naming*, which is imported into the root `AGENTS.md` via
`@.claude/docs/TESTING.md`. The rule: prefer `package foo_test`; if you
need internal access, rename the file to `*_internal_test.go` rather
than adding a nolint directive.
2026-05-22 20:24:38 +10:00
Steven Masley 51b531f5b3 chore: 'go generate' mockgen to use go tool wrapper (#25490)
Calling `mockgen` relies on the executable in the `$PATH`. Using `go
tool` uses the one defined in `go.mod`
2026-05-19 14:53:13 +00:00
Paweł Banaszewski c0b4180206 fix: fix race in setupInjectedToolTest (#25455)
Fixes race condition in `setupInjectedToolTest`.

To reproduce add short sleep to AsyncRecorder.RecordToolUsage method
(inside newly spawned go-routine):
https://github.com/coder/coder/blob/46821525f7d2f7466735463898fb6e054c169f85/aibridge/recorder/recorder.go#L254-L258

Upstream request counter can reach 2 while no recording has been done
since asyncRecorder does it asynchronously:
https://github.com/coder/coder/blob/46821525f7d2f7466735463898fb6e054c169f85/aibridge/internal/integrationtest/setupbridge.go#L242-L244

Added consuming request to `setupInjectedToolTest` so
`newInterceptionProcessor` handler finishes before returning from
`setupInjectedToolTest` which guarantees that all recordings are done:
https://github.com/coder/coder/blob/46821525f7d2f7466735463898fb6e054c169f85/aibridge/bridge.go#L282

Fixes: https://github.com/coder/internal/issues/1526
2026-05-19 09:36:38 +02:00
Marcin Tojek 38772bdb7c refactor: remove cache tokens from ExtraTokenTypes (#25118)
Fixes: https://github.com/coder/aibridge/issues/243

> Generated with [Coder Agents](https://coder.com/agents)
2026-05-18 11:30:19 +02:00
Danny Kopping c6ab379c32 fix(aibridge/intercept/messages): convert enabled thinking to adaptive for Bedrock Opus 4.7+ (#25335)
*Disclaimer: implemented by a Coder Agent using Claude Opus 4.6/4.7*

Fixes
[coder/aibridge#280](https://github.com/coder/aibridge/issues/280).

Claude Opus 4.7 (and future adaptive-only Bedrock models) reject the
legacy `thinking.type: "enabled"` + `budget_tokens` shape with a 400.
Claude Code falls back to that shape when it cannot read the upstream
model's capability metadata, which is exactly the case when AI Bridge
sits between the client and Bedrock. Pinning back to Opus 4.6 is the
only operator workaround today.

This is the counterpart to the `adaptive -> enabled` conversion added in
[coder/aibridge#225](https://github.com/coder/aibridge/pull/225) for
older Bedrock models.

## Behavior

- New `bedrockModelRequiresAdaptiveThinking()` helper matches Opus 4.7
(covers `us.anthropic.claude-opus-4-7`, ARN-style application inference
profile names that include the model ID, etc.).
- New `RequestPayload.convertEnabledThinkingForBedrock()` rewrites
`thinking: {type: enabled, budget_tokens: N}` to `thinking: {type:
adaptive}`. The budget hint is dropped; an explicit
`output_config.effort` from the caller is preserved naturally because we
never touch that field. We deliberately do **not** derive an effort
label from the budget (see decision log).
- `removeUnsupportedBedrockFields` learns a variadic `exemptFields`
parameter. Adaptive-only models support `output_config` natively (no
beta flag required), so `augmentRequestForBedrock` exempts that field
for those models.
- Bedrock Opus 4.7 accepts `output_config.effort` but rejects
`output_config.format` (structured outputs) with the same "Extra inputs
are not permitted" 400. The generic strip pass operates at top-level
granularity only, so a small targeted pass drops `output_config.format`
after the top-level strip for adaptive-only models.

The whole Bedrock thinking-type shim block carries a header comment
flagging it as temporary; a planned native Bedrock provider removes the
impedance mismatch and lets us delete it.

## Out of scope

The issue calls out a possible follow-up around `Anthropic-Beta:
interleaved-thinking-2025-05-14` for adaptive-only models; best evidence
is that Opus 4.7 still accepts those flags, so this PR is a no-op there.

<details>
<summary>Decision log</summary>

- `bedrockModelSupportsAdaptiveThinking` now also returns `true` for
adaptive-only models. That keeps the existing
`convertAdaptiveThinkingForBedrock` branch from running on Opus 4.7
(which would otherwise be incorrect; `adaptive` is the supported native
type there), and the new `convertEnabledThinkingForBedrock` runs only
for adaptive-only models via the explicit
`bedrockModelRequiresAdaptiveThinking` switch case. The two model sets
are disjoint by construction.
- The reverse conversion does **not** derive `output_config.effort` from
`budget_tokens / max_tokens`. The two thinking shapes encode different
intents (`enabled+budget` is "give me exactly N tokens,"
`adaptive[+effort]` is "model, pick a budget, optionally biased") and
there is no canonical mapping between them. An earlier draft of this PR
derived effort via midpoints of an invented anchor table; it was
symmetric-looking but lossy and required a lot of scaffolding (sorted
anchors, init-time invariant guard, round-trip tests) to keep two halves
consistent. The reverse direction now just rewrites the shape, which is
honest about the information loss and matches platform-defined adaptive
behavior when no effort hint is present.
- `output_config.format` is stripped only for adaptive-only models.
Other Bedrock models either don't get `output_config` through at all
(top-level strip handles them) or accept it via a beta flag that may
imply broader feature support. Easy to widen if the same 400 shows up
elsewhere.
- I chose `variadic exemptFields ...string` over passing the model down
to `removeUnsupportedBedrockFields`, to keep that function focused on
stripping and to localise the model-aware policy in
`augmentRequestForBedrock`.

</details>
2026-05-15 10:11:41 +02:00
Marcin Tojek febabfb8b2 feat: add request/response dump support to aibridgeproxyd (#24837)
Closes https://github.com/coder/coder/issues/24335
2026-05-11 10:59:26 +02:00
Susana Ferreira 0766cc3097 feat: add automatic key failover for AI Bridge passthrough (#24920)
## Description

Adds automatic key failover for passthrough routes for the Anthropic and OpenAI providers. A new `keyFailoverTransport` wraps the reverse-proxy transport: centralized requests walk the configured key pool and retry with the next key on key-specific failures (401/403/429), reusing the same key-marking semantics as the bridged routes.

BYOK passthrough requests run as a single attempt with no failover.

## Changes

- New `keypool.KeyFailoverConfig` carrying the `Pool` to walk and the provider-specific closures (`IsBYOK`, `InjectAuthKey`, `MarkKey`, `BuildExhaustedResponse`).
- New `keypool.NewKeyFailoverTransport`: wraps an inner `http.RoundTripper`. Returns `inner` unchanged when `Pool` is nil, otherwise produces a transport that buffers the request body once, walks the pool per request, and replays each attempt with the next key.
- New `Provider.KeyFailoverConfig(logger)` interface method. Anthropic injects `X-Api-Key`; OpenAI injects `Authorization: Bearer ...`; Copilot returns an empty config.
- `passthrough.go` wires `NewKeyFailoverTransport` around the existing apidump middleware, so every retry attempt is recorded.

## Related Issues

Related to: https://github.com/coder/internal/issues/1446
Related to: https://linear.app/codercom/issue/AIGOV-197/aibridge-automatic-key-failover-for-bridged-and-passthrough-routes

## Follow-up PRs

- Remove dead `Provider.InjectAuthHeader` method now that all auth is applied per-attempt by `KeyFailoverTransport`.
- Bedrock multi-key support.
- Refactor provider vs interceptor config separation.
- Record the actually-used key in the interception credential hint after failover.

> [!NOTE]
> Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira
2026-05-07 15:46:36 +01:00
Susana Ferreira b6dacb4a3c feat: add automatic key failover for AI Bridge OpenAI (#24847)
## Description

Adds automatic key failover for centralized OpenAI provider, covering both chat completions and responses APIs. Same shape as the Anthropic PR: each upstream call walks the configured key pool, keys are marked **temporary** on 429 (with cooldown from `Retry-After`) and **permanent** on 401/403. Each agentic-loop iteration gets its own fresh walker so a tool-call continuation can fail over independently of the initial request.

BYOK is unchanged: BYOK requests run as a single attempt with no failover.

## Changes

- `config.OpenAI` carries a `KeyPool`. `Key` remains for BYOK Authorization Bearer set per interception.
- Chat completions blocking interceptor: walks the pool via `newChatCompletionWithKeyFailover`, marks keys on key-specific failures, returns on first success or non-failover error.
- Chat completions streaming interceptor: per-iteration walker. Pre-stream failures fail over to the next key; mid-stream errors are relayed as SSE events.
- Responses blocking interceptor: extracts `newResponseWithKeyFailover` parallel to chatcompletions.
- Responses streaming interceptor: per-iteration walker, retains the existing buffer-then-forward design.

## Related Issues

Related to: https://github.com/coder/internal/issues/1446
Related to: https://linear.app/codercom/issue/AIGOV-197/aibridge-automatic-key-failover-for-bridged-and-passthrough-routes

## Follow-up PRs

- Bedrock multi-key support.
- Refactor provider vs interceptor config separation.
- Record the actually-used key in the interception credential hint after failover.

> [!NOTE]
> Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira
2026-05-07 15:35:46 +01:00
Susana Ferreira f1155ac4d7 feat: add automatic key failover for AI Bridge Anthropic (#24836)
## Description

Adds automatic key failover for centralized Anthropic provider. When a key pool is configured, each upstream call walks the pool and tries keys in order until one succeeds or the pool is exhausted. Keys are marked **temporary** on 429 (with cooldown from `Retry-After`) and **permanent** on 401/403. Errors that aren't key-specific don't trigger failover. Each agentic-loop iteration gets its own fresh walker, so a tool-call continuation can fail over independently of the initial request.

BYOK is unchanged: BYOK requests run as a single attempt with no failover.

## Changes

- `config.Anthropic` carries a `KeyPool`. `Key` remains for BYOK X-Api-Key set per interception.
- Blocking interceptor: walks the pool, marks keys on key-specific failures, returns on first success or non-failover error.
- Streaming interceptor: per-iteration walker. Pre-stream failures fail over to the next key; mid-stream errors are relayed as SSE events.
- New `keypool` error types: `TransientExhaustionError` (carries soonest cooldown) and `ErrPermanentExhaustion`. Replace the prior `ErrAllKeysExhausted`.
- Error responses now consistently include the outer `"type": "error"` field.

## Related Issues

Related to: https://github.com/coder/internal/issues/1446
Related to: https://linear.app/codercom/issue/AIGOV-197/aibridge-automatic-key-failover-for-bridged-and-passthrough-routes

## Follow-up PRs

- Bedrock multi-key support.
- Refactor provider vs interceptor config separation.
- Record the actually-used key in the interception credential hint after failover.

> [!NOTE]
> Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira
2026-05-07 14:57:44 +01:00
Susana Ferreira dbb50ebaaf feat: remove 429 from aibridge circuit breaker failure conditions (#24701)
## Description

Removes 429 (Too Many Requests) from the circuit breaker failure conditions. Rate limiting is now handled by automatic key failover instead of tripping the circuit breaker.

## Changes

`DefaultIsFailure` no longer treats 429 as a circuit breaker failure. The circuit breaker now only trips on server overload responses (503, 529).

Tests and integration tests updated to use 503 instead of 429 for tripping circuits. Description strings in deployment config updated to reflect the change.

Closes https://github.com/coder/internal/issues/1445

> [!NOTE]
> Initially generated by Coder Agents, modified and reviewed by @ssncferreira
2026-04-30 09:31:32 +01:00
Susana Ferreira 123c8dfc02 feat(aibridge): add key pool with state tracking and walker (#24681)
## Description

Adds the `aibridge/keypool` package, a thread-safe key pool with per-key state tracking and cooldown expiry. This PR introduces the package only; wiring it into the aibridge providers and coder configuration will happen in upstream PRs.

## Changes

Each key is in one of three states: **Valid** (available), **Temporary** (rate-limited with cooldown expiry), or **Permanent** (revoked/unauthorized, terminal until restart). The state is derived from the key fields rather than stored explicitly: once a cooldown expires, the key is valid again without any external action.

A `Walker` provides per-request key traversal using a primary-with-fallback strategy: each request walks the pool from index 0, skipping unavailable keys, so the first key is always preferred when healthy. Walkers are independent, so concurrent requests traverse the pool without interfering with each other.

`MarkTemporary` preserves the longer cooldown when concurrent requests both mark the same key.

Relates to https://github.com/coder/internal/issues/1445

> [!NOTE]
> Initially generated by Coder Agents, modified and reviewed by @ssncferreira
2026-04-30 09:07:57 +01:00
Paweł Banaszewski a24dc19d49 chore: clean up env var usage in aibridge (#24783)
> AI tools where used when creating this PR

This PR removes environment variable parsing from `/aibridge` directory.

Added env variables/flags for dump dir as coder options.
Only added to new indexed provider options
(`CODER_AIBRIDGE_PROVIDER_<N>_*`) not to deprecated legacy env variables
(`CODER_AIBRIDGE_ANTHROPIC_*` and `CODER_AIBRIDGE_OPENAI_KEY_*`).

Reverted adding `MaxRetries` option as it will be removed soon due to
key failover work:
https://github.com/coder/coder/pull/24783#discussion_r3155544808
2026-04-29 18:28:37 +02:00
Paweł Banaszewski 6ea9c61da0 chore: update AI Gateway docs (#24805)
> AI tools where used when creating this PR


This PR:
* removes references to aibridge repository from coder docs
* updates aibdrige/README.md
* makes it clear aibridge (keeping old name) is a handler not a separate
process
* updates outdated sections about: metrics, recorded interface and
supported paths.

---------

Co-authored-by: Susana Ferreira <susana@coder.com>
2026-04-29 18:28:09 +02:00
Paweł Banaszewski 5907730dcf chore: update aibridge/AGENTS.md to reflect it is now part of coder/coder repo (#24822)
> AI tools where used when creating this PR

Updates `AGENTS.md` to reflect that `/aibridge` is now a directory in
coder/coder repo not a separate repo.
2026-04-29 15:51:07 +02:00
Paweł Banaszewski b8906c84a1 fix: fixes aibridge integration tests failing on windows (#24665)
Fixes aibridge responses and apidump integration tests in windows CI
job.

example failing job before the fix:
https://github.com/coder/coder/actions/runs/24873191866/job/72823942378
2026-04-24 18:28:22 +02:00
Danny Kopping 8aa3294f06 fix(aibridge): track Charm Crush client and session ID (#24630)
*Disclaimer: implemented by a Coder Agent using Claude Opus 4.6*

Porting https://github.com/coder/aibridge/pull/277 to coder/coder after
the [aibridge code move](https://github.com/coder/coder/pull/24190).

## Summary

Fixes client detection and session ID tracking for the [Charm
Crush](https://github.com/charmbracelet/crush) AI coding client.

## Changes

### Bug fix: User-Agent matching

The actual Crush user-agent is `Charm-Crush/{version}
(https://charm.land/crush)` (hyphenated), but `GuessClient` only checked
for `charm crush/` (space-separated). After lowercasing,
`Charm-Crush/0.2.0` becomes `charm-crush/0.2.0`, which did not match the
`charm crush/` prefix.

Now matches both formats for backwards compatibility.

### Session ID tracking

Adds an explicit `ClientCrush` case to `GuessSessionID`. Crush does not
currently send a session ID header to upstream AI providers, so this
returns `nil` (consistent with how `ClientZed`, `ClientRoo`, and
`ClientCursor` are handled).

### Tests

- Added `charm_crush_hyphen` test case for `GuessClient` using the real
user-agent format.
- Added `crush_returns_empty` test case for `GuessSessionID`.
2026-04-22 19:02:31 +02:00
Paweł Banaszewski e00e85765b chore: move aibridge library code into coder repo (#24190)
This PR merges code from `coder/aibridge` repository into `coder/coder`.
It was split into 4 PRs for easier review but stacked PRs will need to
be merged into this PR so all checks pass.

* https://github.com/coder/coder/pull/24190 -> raw code copy (this PR,
before merging PRs on top of it, it was just 1 commit:
https://github.com/coder/coder/commit/70d33f33200c7e77df910957595715f81f9bec24)
* https://github.com/coder/coder/pull/24570 -> update imports in
`coder/coder` to use copied code
* https://github.com/coder/coder/pull/24586 -> linter fixes and CI
integration (also added README.md)
* https://github.com/coder/coder/pull/24571 -> added exclude to
scripts/check_emdash.sh check

Original PR message (before PR squash):
Moves coder/aibridge code into coder/coder repository.

Omitted files:

- `go.mod`, `go.sum`, `.gitignore`, `.github/workflows/ci.yml,`
`Makefile`, `LICENSE`, `README.md` (modified README.md is added later)
- `.github`, `example`, `buildinfo,` `scripts` directories

Simple verification script (will list omitted files)

```
tmp=$(mktemp -d)
echo "$tmp"
git clone --depth=1 https://github.com/coder/aibridge "$tmp/aibridge"
git clone --depth=1 --branch pb/aibridge-code-move https://github.com/coder/coder "$tmp/coder"
diff -rq --exclude=.git "$tmp/aibridge" "$tmp/coder/aibridge"
# rm -rf "$tmp"
```
2026-04-22 17:01:01 +02:00