coder

mirror of https://github.com/coder/coder.git synced 2026-06-02 20:48:20 +00:00

Author	SHA1	Message	Date
Danny Kopping	85f56e4944	fix: recreate `ai_provider_type` instead of ADD VALUE (#25895 ) Coder runs all migrations in a single transaction (`pgTxnDriver`). Postgres forbids using an enum value added by `ALTER TYPE ... ADD VALUE` within the same transaction that added it. Migration `000499` widened `ai_provider_type` with `ADD VALUE`, and `000504` casts existing `chat_providers` rows to that enum in the same transaction. On deployments with a legacy provider using one of the new values (for example `openai-compat`), the batch failed with `unsafe use of new value` and the server could not start. Recreate the type (create a new enum, alter the column, drop and rename) instead of using `ADD VALUE`, matching the existing precedent in `000144_user_status_dormant`. A freshly created enum's values are usable immediately in the same transaction, so the cast in `000504` succeeds. The resulting schema is identical, so `make gen` produces no `dump.sql` diff and databases that already applied these migrations see no drift. Added a regression test that seeds an `openai-compat` provider and applies `000499` through `000504` in a single transaction, reproducing the production path. The per-step `Stepper` used by the other migration tests commits each migration separately and cannot surface this class of bug. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Danny Kopping <danny@coder.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-01 13:30:45 +00:00
Danny Kopping	a85462bd49	feat: support adding GitHub Copilot AI provider via UI (#25888 ) Copilot is the only AI provider type that could not be added through the `/ai/settings` UI. The aibridge runtime and the env-var seeding path already supported it, but the runtime CRUD API rejected `type=copilot` and the UI omitted it entirely. The root cause is that Copilot's auth model (a per-request GitHub OAuth token, with no pre-shared key) does not fit the credential-centric add-provider flow that every other provider uses. ## Backend Allow `type=copilot` in `CreateAIProviderRequest.Validate()`, and reject `api_keys` for Copilot on both create (validation) and update (handler sentinel), mirroring the existing Bedrock guards. Copilot carries no stored credential. ## Frontend Add Copilot to the provider type picker (with the `github-copilot.svg` icon) and give the form a credential-free branch: name, display name, and a free-text endpoint defaulting to `https://api.business.githubcopilot.com`, with copy explaining that authentication happens via the user's GitHub token at request time. Copilot maps to the distinct `copilot` wire type rather than collapsing to `openai`, and the edit flow recovers it correctly. The endpoint stays required with a business-tier default; users on the individual or enterprise endpoints edit the field. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-01 15:26:37 +02:00
Mathias Fredriksson	82752844bc	fix: isolate MCP HTTP transports from DefaultTransport in tests (#25821 ) Use testing.Testing() inside createTransport to automatically clone http.DefaultTransport when running in tests. In production, DefaultTransport is used as-is (efficient connection pooling). This fixes the CloseIdleConnections flake class: httptest.Server.Close() calls http.DefaultTransport.CloseIdleConnections(), which disrupts any MCP client sharing that transport. The testing.Testing() check means every MCP transport created during tests gets isolation automatically, with no caller changes needed. Closes coder/internal#1016 Closes PLAT-291	2026-06-01 16:17:29 +03:00
Mathias Fredriksson	8b7e040105	fix(coderd/x/chatd/chatloop): discourage doctrine in compaction summaries (#25850 ) Two additions to the compaction summary prompt: 1. Error specificity: the "errors encountered" bullet now instructs the model to keep error notes specific (name the file, the error, the fix) and not generalize from a specific failure to a blanket tool-avoidance rule. This addresses the doctrine crystallization pattern where a single tool failure gets promoted to a standing "avoid tool X" rule that persists across compactions and model swaps. 2. Reproducibility: a new closing sentence instructs the model to reference reproducible content by path, command, or URL rather than inlining it. Content without a stable reproducer is still preserved inline with a brief summary. This targets summary bloat from inlined code blocks (worst case: 34k chars, 76 code blocks reproducing repo content verbatim). Refs CODAGT-331	2026-06-01 12:42:09 +03:00
dylanhuff-at-coder	0401ed3af5	fix(coderd/notifications): serialize pending updates gauge writes (#25495 ) Fixes a race where concurrent notification dispatch goroutines could overwrite `coderd_notifications_pending_updates` with an older buffer-length snapshot. Pending update snapshots now serialize count evaluation with the gauge write, and inhibited dispatch results refresh the metric when buffered.	2026-05-29 11:02:13 -07:00
Jon Ayers	5cdc9e28a9	feat: add nats cluster peer support (#25632 )	2026-05-29 11:35:59 -05:00
Mathias Fredriksson	98d5e7948d	fix(coderd/autobuild): handle concurrent build number race in lifecycle executor (#25824 ) The lifecycle executor did not handle unique-violation errors from InsertWorkspaceBuild. When a concurrent actor (API handler, another lifecycle executor, or prebuilds reconciler) inserts a workspace build with the same build number, PostgreSQL returns a unique constraint violation on workspace_builds_workspace_id_build_number_key. The lifecycle executor treated this as a hard error, logging it and storing it in stats.Errors. The per-workspace advisory lock (pg_try_advisory_xact_lock) prevents two lifecycle executors from racing, but does not protect against races with the CreateWorkspaceBuild API handler or the prebuilds reconciler, which use different (or no) locking. Catch the specific unique-violation error after InTx returns (where the transaction is already rolled back) and clear it. The concurrent actor's build takes effect; the lifecycle executor treats the workspace as a no-op for this tick. Closes coder/internal#455 Closes PLAT-290	2026-05-29 17:12:31 +03:00
Yevhenii Shcherbina	1a91d31793	feat: add user AI budget override endpoints (#25439 ) Implements https://linear.app/codercom/issue/AIGOV-285 Follow the structure established in https://github.com/coder/coder/pull/25203 ## Summary Adds the `user_ai_budget_overrides` table and CRUD API at `/api/v2/users/{user}/ai/budget`. An override sets a custom per-user spend cap that supersedes group-budget resolution, attributing spend to a specific group. ## Schema ```sql CREATE TABLE user_ai_budget_overrides ( user_id UUID PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE, group_id UUID NOT NULL REFERENCES groups(id) ON DELETE CASCADE, spend_limit_micros BIGINT NOT NULL CHECK (spend_limit_micros >= 0), created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); ``` ## Membership lifecycle The membership invariant — a user must be a member of the attributed group, including when that group is "Everyone" — would naturally be expressed as a composite FK on `(user_id, group_id) → group_members_expanded(user_id, group_id)`. PostgreSQL doesn't allow foreign keys to reference views, so enforcement is split across two mechanisms: - Write-time check. A CHECK constraint on the table (`user_ai_budget_overrides_must_be_group_member`) calls a `STABLE` function `is_group_member(user_id, group_id)` that queries `group_members_expanded`. The view surfaces both regular group memberships and the implicit "Everyone" group memberships from `organization_members`. Any INSERT or UPDATE that violates the predicate is rejected with a Postgres `check_violation`, which the handler maps to a 400. `is_group_member` is defined as a general predicate, reusable by any future table that needs the same check. - Cascade on removal. Two `BEFORE DELETE` triggers handle membership loss: - `trigger_delete_user_ai_budget_overrides_on_group_member_delete` on `group_members` — covers regular group removals (admin action, OIDC sync). - `trigger_delete_user_ai_budget_overrides_on_org_member_delete` on `organization_members` — covers the "Everyone" group, whose membership lives in `organization_members`. The single-column FKs on `users(id)` and `groups(id)` remain to cascade on user or group deletion (those paths don't pass through `group_members`). ## Authorization The dbauthz layer gates each operation against the `User` and (for writes) `Group` resources: \| Operation \| User resource \| Group resource \| \|-----------\|----------------\|----------------\| \| `GET` \| `ActionRead` \| — \| \| `PUT` \| `ActionUpdate` \| `ActionUpdate` \| \| `DELETE` \| `ActionUpdate` \| `ActionUpdate` \| For `DELETE`, the dbauthz layer fetches the existing override first to learn the attributed `group_id`, then runs both checks. ### Role matrix \| Role \| GET \| PUT \| DELETE \| \|--------------\|-----\|-----\|--------\| \| Owner \| ✅ \| ✅ \| ✅ \| \| UserAdmin \| ✅ \| ✅ \| ✅ \| \| OrgAdmin \| ✅ \| ❌ \| ❌ \| \| OrgUserAdmin \| ✅ \| ❌ \| ❌ \| Internal discussion: https://codercom.slack.com/archives/C096PFVBZKN/p1779392747885359 ## Audit logs Audit logs will be addressed in a follow-up PR.	2026-05-29 10:08:25 -04:00
Danny Kopping	110210d7c9	fix(coderd): block ai provider env key drift (#25849 ) Previously, `SeedAIProvidersFromEnv` only hashed provider-level fields, so env var key changes were silently ignored once a provider already existed in the database. Include bearer keys and Bedrock credentials in the canonical drift hash, and cover multi-key, multi-provider cases so restarts now fail loudly when the configured credentials no longer match what is stored. When changing a key, you'll now see this in the server startup logs: ``` 2026-05-29 12:29:02.674 [info] api: Encountered an error running "coder server", see "coder server --help" for more information 2026-05-29 12:29:02.674 [info] api: error: create coder API: 2026-05-29 12:29:02.674 [info] api: github.com/coder/coder/v2/cli.(RootCmd).Server.func2 2026-05-29 12:29:02.674 [info] api: /home/coder/coder/cli/server.go:1015 2026-05-29 12:29:02.674 [info] api: - seed ai providers from env: 2026-05-29 12:29:02.674 [info] api: github.com/coder/coder/v2/enterprise/cli.(RootCmd).Server.func1 2026-05-29 12:29:02.674 [info] api: /home/coder/coder/enterprise/cli/server.go:187 2026-05-29 12:29:02.674 [info] api: - execute transaction: 2026-05-29 12:29:02.674 [info] api: github.com/coder/coder/v2/coderd/database.(sqlQuerier).runTx 2026-05-29 12:29:02.674 [info] api: /home/coder/coder/coderd/database/db.go:212 ---> 2026-05-29 12:29:02.674 [info] api: - AI provider "vercel" already exists in the database and differs from the current environment configuration; update the provider through the API or remove the CODER_AIBRIDGE_ env vars to stop seeding it: 2026-05-29 12:29:02.674 [info] api: github.com/coder/coder/v2/coderd.SeedAIProvidersFromEnv.func1 2026-05-29 12:29:02.674 [info] api: /home/coder/coder/coderd/ai_providers_migrate.go:139 2026-05-29 12:29:02.674 [info] api: slogjson: failed to write entry: io: read/write on closed pipe 2026-05-29 12:29:02.700 [info] dlv: Stop reason: exited 2026-05-29 12:29:02.825 [info] site: ELIFECYCLE Command failed. error: running command "develop": server did not become ready in 1m0s: main.waitForHealthy /home/coder/coder/scripts/develop/main.go:877 - context canceled ``` _This PR was generated with Coder Agents._	2026-05-29 13:14:55 +00:00
Cian Johnston	d0a51da0a9	feat: classify provider_disabled 503 as non-retryable (#25800 ) Builds on top of https://github.com/coder/coder/pull/25794 Adds a new `provider_disabled` error classification in `chatd` with the corresponding plumbing to classify it as non-retryable. Also adds a story for how this particular error kind is displayed in the UI.	2026-05-29 13:14:04 +01:00
Susana Ferreira	7b903cad73	fix: track credential hint across key failover attempts in aibridge (#25735 ) ## Problem Centralized requests recorded the first available key from the pool at `CreateInterceptor` time as `credential_hint`, so the interception could be persisted in the database with a hint that didn't match the key that actually served the request. The fix consists in storing, at end-of-interception, the hint of the key that succeeded, or the last attempted key if all keys are unavailable. ## Changes - Add `Key.Hint()` and update `credential_hint` on every failover attempt so it reflects the actually-used key. - Stop pre-populating `credential_hint` at `CreateInterceptor`. Centralized starts empty and is updated by the key failover loop. - Persist the final hint via `RecordInterceptionEnded`; SQL updates `credential_hint` only when `credential_kind = 'centralized'` so BYOK keeps its start-time value. - Log the actually-used hint on interception end/failure; start log uses a `<keypool-pending>` placeholder for centralized. > [!NOTE] > Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira	2026-05-29 12:01:37 +01:00
Sas Swart	a586b7e5e0	feat: add `boundary_log` rbac resource (#24810 ) RFC: [Bridge ↔ Boundaries Correlation RFC](https://www.notion.so/coderhq/Gateway-and-Firewall-Correlation-RFC-31ad579be592803aa8b3d48348ccdde9) Register a dedicated `boundary_log` RBAC resource type with `create`, `read`, and `delete` actions, replacing the placeholder `rbac.ResourceAuditLog` and `rbac.ResourceSystem` references previously used in the dbauthz layer. Create is granted at user-level so workspace agents can only write logs owned by their workspace owner, preventing cross-workspace log fabrication. Delete is restricted to `DBPurge` only; no human role (including owner) can delete boundary logs. \| Subject \| Create (own) \| Create (other) \| Read (all) \| Delete \| \|---\|---\|---\|---\|---\| \| Workspace agent \| yes \| no \| no \| no \| \| Owner (site admin) \| yes (via member) \| no \| yes \| no \| \| Auditor \| no \| no \| yes \| no \| \| DBPurge \| no \| no \| no \| yes \| ### Changes - RBAC policy & resource definition: add `boundary_log` to `policy.go` and generate `ResourceBoundaryLog` object, scope constants, and codersdk/TypeScript types. - dbauthz authorization: replace all `ResourceAuditLog`/`ResourceSystem` placeholders with `ResourceBoundaryLog`. `InsertBoundaryLog` and `InsertBoundarySession` derive the workspace owner from the agent and authorize with `.WithOwner()` for user-scoped create. - Role assignments: - Owner (site): read only. Excluded from `allPermsExcept` wildcard; create is inherited from member at user-level. - Member (user-level): create. User-scoped so agents can only write logs they own. - Auditor (site): read. - `boundary_log` is excluded from org-admin, org-member, and org-service-account `allPermsExcept` calls for consistency with `ResourceBoundaryUsage`. - System subjects: - DB Purge (`SubjectTypeDBPurge`): delete. The only subject that can remove boundary logs. - Workspace agent scope: `ResourceBoundaryLog` with wildcard ID in the agent scope allow-list (necessary for creation since no pre-existing ID exists). User-level role scoping prevents deployment-wide access. - DB migration (`000510_boundary_log_scopes`): add `boundary_log:`, `boundary_log:create`, `boundary_log:delete`, `boundary_log:read` enum values to `api_key_scope`. - Test coverage: `BoundaryLogCreate` (user-scoped, only matching owner succeeds), `BoundaryLogDelete` (all human roles denied), `BoundaryLogRead` (owner + auditor). dbauthz mock tests set up workspace agent lookups for owner derivation. - Generated docs*: update OpenAPI specs, API reference docs, and frontend type definitions. --------- Co-authored-by: Muhammad Danish <mdanishkhdev@gmail.com> Co-authored-by: Coder Agents <coder-agents-review[bot]@users.noreply.github.com>	2026-05-29 12:50:39 +02:00
Danny Kopping	5b10268827	feat: serve 503 sentinel for disabled providers (#25794 ) _Disclosure: created with Coder Agents._ When providers are disabled, we should serve a sentinel error so the requesting client (Claude Code, Coder Agents, etc) is informed. Coder Agents can also conditionalize its display to show a helpful error message. --------- Signed-off-by: Danny Kopping <danny@coder.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-29 10:24:16 +02:00
Ethan	eb2c2799ca	fix: strip deleted MCP IDs from chats on delete (#25763 ) Adds a database migration that reconciles existing stale chat MCP server IDs, then installs a `BEFORE DELETE` trigger on `mcp_server_configs` to remove the deleted ID from `chats.mcp_server_ids`. This keeps chat continuation from failing with `400 One or more MCP server IDs are invalid` after an MCP server config is deleted. This matches the existing repo precedent in `coderd/database/migrations/000241_delete_user_roles.up.sql`, where deleting a custom role cleans `organization_members.roles`, a similarly structured array of references that cannot be protected by a normal foreign key. Closes CODAGT-505	2026-05-29 16:49:25 +10:00
Jon Ayers	bb11946bd4	fix: require update permission to recreate devcontainers (#25812 ) - The httpmw upstream from this endpoint only checks for read perms to the workspace agent. Recreating a dev container should require `update` perms since it mutates state. This also matches the behavior of the `DELETE` endpoint	2026-05-28 15:34:36 -05:00
Cian Johnston	7ea0eff94e	fix: improve chat audit log descriptions and diff rendering (#25728 ) Chat ACL audit diffs rendered as `[object Object]` because the diff viewer called `.toString()` on object values. Common chat operations (archive, share) showed generic "updated chat" descriptions instead of semantic ones. Add `chatAuditLogDescription` to derive semantic descriptions from the audit diff for successful chat writes: "archived/unarchived chat" for archive toggles, "updated sharing for chat" for ACL-only changes. Extract diff value formatting into `formatAuditDiffValue`, which renders object values as deterministic compact JSON with sorted keys, fixing the `[object Object]` rendering for chat ACLs and any other object-valued fields. The previous `determineIdPSyncMappingDiff` workaround for IdP sync mappings was removed because the generic formatting handles it. Closes CODAGT-513 > Generated by Coder Agents on behalf of @johnstcn	2026-05-28 18:37:57 +01:00
Danielle Maywood	0d1340a430	fix: collapse agent command output by default (#25748 )	2026-05-28 16:54:52 +01:00
Steven Masley	4591212482	feat: implement SCIM handler for SCIM 2.0 compliance (#25572 ) Rewrites the SCIM 2.0 user provisioning handler to be RFC 7644 compliant. Verified against an external IdP Okta. Behavior is OPT IN	2026-05-28 10:00:37 -05:00
Cian Johnston	6df1536256	fix: add missing_key error kind for missing chat api_key_id (#25783 ) Refs CODAGT-486 - `codersdk/chats.go`: New `ChatErrorKindMissingKey` constant and `AllChatErrorKinds` entry - `coderd/x/chatd/chaterror/message.go`: `terminalMessage` and `retryMessage` cases - `coderd/x/chatd/model_routing_aibridge.go`: Pre-classify error with `WithClassification` - `coderd/x/chatd/model_routing_internal_test.go`: Classification assertion on production path (CRF-2) - `chatStatusHelpers.ts`: Frontend title "Chat interrupted" - `LiveStreamTail.stories.tsx`: Storybook story with `detail` assertion - `docs/ai-coder/ai-gateway/clients/coder-agents.md`: Troubleshooting entry - Tests: classification round-trip, terminal message, metrics kind enumeration > Generated with [Coder Agents](https://coder.com/agents) on behalf of @johnstcn	2026-05-28 15:50:52 +01:00
Danny Kopping	12520ee964	feat: add ai provider status and reload freshness metrics (#25770 ) Add metrics for `aibridged` and `aibridgeproxyd`'s provider statuses. AI providers can be modified, and possibly misconfigured, at runtime. These metrics help operators understand the state of these provider definitions in case unexpected behaviour is observed.	2026-05-28 14:57:33 +02:00
Ethan	7e2f7198dd	fix(coderd/x/chatd/chatloop): use stream silence timeout (#25782 ) Replaces the 60 second first-token timeout in the chat loop with a 10 minute stream-silence timeout. Previously, the guard bounded only the gap before the first stream part. Once any part arrived the attempt could hang indefinitely if the provider stopped streaming without closing the connection, and even normal long-running responses could be killed after 60 seconds if the provider was slow to emit the first token. The guard now arms when a model attempt opens its stream, resets on every received stream part, and fires after 10 minutes of complete silence. The existing retry path still handles the timeout, and the public `startup_timeout` error kind is preserved to avoid API and frontend churn. 10 minutes matches the default request timeout used by the Anthropic and OpenAI Python SDKs. Closes CODAGT-493	2026-05-28 21:02:40 +10:00
Michael Suchacz	f529577bee	fix(coderd/x/chatd): harden openai-compatible chat calls (#25737 ) OpenAI-compatible chat paths hit two provider compatibility issues. Some compatible endpoints reject a named `tool_choice` when there is only one tool, and Gemini's OpenAI-compatible endpoint requires thought signatures on current-turn tool calls. Centralize OpenAI-compatible request patches in the chat provider: rewrite single named tool choices to `"required"`, and add the documented dummy Google thought signature to the first tool call in each current-turn tool step for Gemini routes. Vercel OpenAI-compatible requests are left unchanged for the thought-signature patch. > Mux created this PR on behalf of Mike.	2026-05-28 10:27:32 +02:00
Garrett Delfosse	a2e1ddb56f	fix: validate FileSize in NewDataBuilder to prevent OOM DoS (#25710 ) `NewDataBuilder` allocated `make([]byte, 0, req.FileSize)` using the client-supplied `int64` with no upper-bound check. The DRPC 4 MiB wire cap limits message size but not the integer value, so a crafted message with `FileSize = 1<<40` forces a 1 TiB allocation, triggering an unrecoverable `runtime.throw` that kills the entire `coderd` process. Add a `MaxFileSize` constant (100 MiB, matching `HTTPFileMaxBytes` in `coderd/files.go`) and reject negative or oversized `FileSize`, plus negative or excessive `Chunks`, before the allocation. `BytesToDataUpload` also returns an error for oversized data to preserve the encode/decode round-trip contract. Fix a pre-existing reversed subtraction in the `Add()` overflow error message. Closes https://linear.app/codercom/issue/PLAT-231 <details> <summary>Implementation details</summary> - `provisionersdk/proto/dataupload.go`: New exported `MaxFileSize` constant; validation in `NewDataBuilder` and `BytesToDataUpload`. Fixed reversed subtraction in `Add()` error. - `provisionersdk/proto/dataupload_test.go`: New `TestNewDataBuilderValidation` with 7 subtests. - Updated all 5 callers of `BytesToDataUpload` for new error return. - Audited all `make([]byte, ...)` in provisioner paths; no other client-supplied sizes. </details> > Generated by Coder Agents on behalf of @f0ssel	2026-05-27 14:30:11 -04:00
Jon Ayers	f6f284ea51	feat: add initial NATS implementation (#25602 )	2026-05-27 12:57:20 -05:00
Cian Johnston	b278be7361	fix(coderd): enforce api_key_id on user messages at type level (#25729 ) - Empty string is valid for `apiKeyID` in paths that genuinely lack a caller key (e.g. agent-initiated context injection in `workspaceAgentAddChatContext`). AI Gateway fail-closed check remains the runtime safety net. - Context injection paths (`persistInstructionFiles`, compaction) read the key from `aibridge.DelegatedAPIKeyIDFromContext(ctx)`, set upstream by `contextWithActiveTurnAPIKeyID`. - Subagent context copy branches on `copiedRole == database.ChatMessageRoleUser` to choose the right append function. > Generated by Coder Agents	2026-05-27 17:00:23 +01:00
Danny Kopping	2770bdc9d1	feat: route extra ai_provider_types through OpenAI and Anthropic providers (#25722 ) _Disclosure:_ _produced_ _with_ _Claude_ _Opus_ _4\.7_ AI Gateway only supports Anthropic (+Bedrock), OpenAI, and Copilot providers at present. All other types (Vercel, Gemini, etc) will be mapped to OpenAI since they support OpenAI-compatible endpoints.	2026-05-27 16:16:05 +02:00
Spike Curtis	6f06ace949	chore: export MsgQueue from pubsub package (#25707 ) <!-- If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting. --> Makes `MsgQueue` exported, so it can be used in pubsub implementations outside PGPubsub.	2026-05-27 10:11:51 -04:00
Cian Johnston	0c27224fc2	fix(coderd): pass title API key context (#25723 ) Fixes CODAGT-503 - Add failing-first coverage for manual title generation with missing message `api_key_id`, with both context fallback and fail-closed cases. - Set `aibridge.WithDelegatedAPIKeyID(ctx, apiKey.ID)` in `regenerateChatTitle` and `proposeChatTitle`. - In `generateManualTitleCandidate`, fall back to `aibridge.DelegatedAPIKeyIDFromContext(ctx)` only when `modelBuildOptionsFromMessages` yields an empty `ActiveAPIKeyID`. - Keep `modelBuildOptionsFromMessages` pure and leave automatic title generation unchanged.	2026-05-27 13:20:36 +01:00
Danny Kopping	10f37db35d	fix(coderd/x/chatd/chatprovider): keep gateway model prefix in ResolveModelWithProviderHint (#25725 ) For `vercel`, `openrouter`, and `openai-compat`, the `<provider>/<model>` slash is part of the upstream model ID rather than a hint. `ResolveModelWithProviderHint` was running `parseCanonicalModelRef` before honoring `providerHint`, so a config like `(provider=vercel, model=anthropic/claude-4-5-sonnet)` resolved to `provider=anthropic, model=claude-4-5-sonnet` and the prefix-less model name was forwarded to Vercel, which returned `Model 'claude-4-5-sonnet' not found`. Honor an explicit gateway provider hint before attempting canonical-ref parsing. Non-gateway hints (anthropic, openai, etc.) keep the existing canonical-ref-first behavior so `anthropic/claude-...` still has its prefix stripped when routed directly to Anthropic. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 11:13:39 +00:00
Danny Kopping	79e007cf30	feat: hot-reload aibridged and aibridgeproxyd providers on DB changes (#25673 ) Previously the in-process aibridge daemon and the enterprise aibridgeproxy daemon both snapshotted their provider routing once at boot. Any `ai_providers` or `ai_provider_keys` mutation required a restart for either to pick it up. Add an `ai_providers_changed` pubsub channel that the CRUD handlers publish on after Create / Update / Delete. Both daemons subscribe: - aibridged rebuilds its `[]aibridge.Provider` snapshot via `BuildProviders` and swaps it into the pool atomically. Inflight requests keep serving against the bridge they already acquired; new acquires build against the new snapshot. Per-provider construction errors stay scoped to the offending row. - aibridgeproxyd rebuilds its routing snapshot from `GetAIProviders` and swaps the host→provider map atomically. The MITM listener picks up new providers without restart. DB read for aibridgeproxyd uses the existing `AsAIProviderMetadataReader` subject for routing-only access.	2026-05-27 11:58:43 +02:00
Cian Johnston	6acfe6c835	fix: classify quota errors as usage_limit instead of auth (#25676 ) Fixes CODAGT-484. - Removed "quota", "billing", "insufficient_quota", "payment required" from `authStrongPatterns` - Added `usageLimitPatterns` slice with those patterns - Added `usageLimitMatch` signal and rule between overloaded and authStrong in priority - Added terminal/retry messages for `ChatErrorKindUsageLimit` - Simplified auth message (removed billing reference) - Frontend: conditional `!usageLimitStatus.provider` guard on the "View Usage" Alert - Added `TestClassify_UsageLimitBeatsAuth` with 5 cases including real production OpenAI error - Added `ProviderQuotaExceeded` story asserting no "View Usage" link and correct `ChatStatusCallout` rendering > Generated with [Coder Agents](https://coder.com/agents)	2026-05-27 09:45:36 +01:00
Zach	47ac4b309a	feat: enforce per-user limits on user_secrets (#25588 ) Add a Postgres trigger and matching codersdk constants that cap each user's secrets in four dimensions: count (50), total stored value bytes (200 KiB), env-injected stored value bytes (24 KiB), and env name length (256 bytes). Without these caps a user could overflow the 4 MiB DRPC agent manifest, the ~32 KiB Windows process env block, or Linux/macOS ARG_MAX at workspace start. The trigger is the source of truth on aggregates; the handler maps its check_violation error into a 400 that names the per-user budget in stored (post-encryption) bytes. A handler test exercises off-by-one at each cap across POST and PATCH, plus per-user budget isolation. Generated with help from Coder Agents.	2026-05-26 14:42:31 -06:00
Kyle Carberry	58f6b9c4d0	fix(coderd/externalauth): retry transient refresh failures with backoff (#25686 ) ## Summary Wraps external auth token refresh in an exponential-backoff retry so a brief upstream hiccup (5xx, network timeout, rate-limited 429) no longer surfaces as an `InvalidTokenError` and forces users to re-authenticate. GitHub in particular has been flaky enough lately that this is hitting real users. ## Behavior - `(Config).RefreshToken` now calls a helper that retries the `TokenSource.Token()` exchange with exponential backoff (250ms → 2s), bounded by a 10s total budget. - Errors classified as permanent by `isFailedRefresh` (e.g. `bad_refresh_token`, `invalid_grant`, `unauthorized_client`, ...) skip the retry loop. Retrying a permanent failure wastes the refresh quota and, on providers with single-use refresh tokens, can mask a legitimate concurrent winner with repeated `bad_refresh_token` responses. - Refreshes with an empty refresh token still short-circuit without making an API call. - The existing concurrent-refresh-race detection and optimistic-lock paths are unchanged. ## Tunables Three new `time.Duration` fields on `externalauth.Config` (`RefreshRetryInitialBackoff`, `RefreshRetryMaxBackoff`, `RefreshRetryTimeout`) let callers override the defaults. They default to zero, which falls back to the package defaults, so existing call sites are unaffected. The fields exist primarily so tests can dial the timing way down without touching package globals (and therefore without serializing parallel tests). ## Tests - `TestRefreshToken/RefreshRetries` now disables internal retries via `RefreshRetryTimeout = time.Nanosecond` so its existing "1 IDP call per `RefreshToken` invocation" assertion still holds. Otherwise its assertions are unchanged. - New `TestRefreshToken/RefreshTokenWithBackoff` simulates 3 transient 5xx failures followed by success and verifies the refresh ultimately succeeds with 4 total IDP attempts. - New `TestRefreshToken/RefreshTokenBackoffPermanentError` returns `bad_refresh_token` and verifies the refresh is not* retried even with a generous 1s budget. <details> <summary>Why the explicit <code>retryCtx.Err()</code> guard?</summary> `retry.Retrier.Wait` `select`s between `time.After(delay)` and `ctx.Done()`. The first call has `delay == 0`, so `time.After(0)` and an already-cancelled context both fire immediately and Go picks the case nondeterministically. Without the guard, a near-zero retry budget would still trigger an unwanted extra refresh attempt roughly half the time, which would have made the `RefreshRetries` test flaky. </details> This PR was opened by a Coder agent on behalf of @kylecarbs.	2026-05-26 15:35:22 -04:00
Michael Suchacz	8b1705eb65	feat: route chatd provider traffic through aibridge (#25629 ) ## Summary Routes chatd model calls backed by concrete AI Provider rows through the in-process aibridge transport by default, with deployment options to use direct provider routing when AI Gateway is disabled or chat AI Gateway routing is disabled. - Splits model routing into common, direct provider, and AI Gateway paths behind a single deployment-mode entry point. - Builds chatd models through explicit request, route, and options data. Active API key attribution is passed explicitly instead of being hidden inside generic model construction. - For AI Gateway BYOK routes, resolves the user's provider key in chatd, forwards it through provider-specific auth headers, and sets `X-Coder-AI-Governance-Token` to the `delegated` marker so aibridge preserves those headers while still stripping Coder-specific metadata. - Keeps central provider credentials and deployment fallback credentials out of forwarded provider auth headers, so AI Gateway central policy remains authoritative. - Redacts delegated provider auth from default string formatting to avoid accidental plaintext logging of user BYOK credentials. - Covers selected chat models, advisor overrides, title and quickgen paths, subagent overrides, computer use model selection, and an integration-style chat turn through the aibridge transport path. - Persists initiating API key IDs on chat and queued user messages, including subagent child messages, and fails closed for AI Gateway-routed model builds without an active key. - Removes unused `api_key_id` indexes while keeping the persistence columns and foreign keys. - Keeps the deployment option available through config and env parsing, but hides it from CLI help and generated docs. - Stabilizes the subagent poll fallback test so background CreateChat processing cannot win the state transition under slower CI environments. ## Tests - `go test ./coderd/x/chatd -run 'TestAIGatewayProviderAuthForUser\|TestAIGatewayProviderAuthRedactsFormatting\|TestResolveModelRouteForConfigAIGatewayProviderAuth\|TestAIGatewayModelForwardsProviderAuth\|TestProcessChat_AIGatewayRoutingUsesDelegatedAPIKey\|TestAwaitSubagentCompletion' -count=1` - `go test ./coderd/aibridged -run 'TestServeHTTP_DelegatedAPIKey\|TestServeHTTP_StripCoderToken' -count=1` - `git diff --check HEAD~1..HEAD` - `make lint` > Mux working on behalf of Mike.	2026-05-26 19:31:52 +00:00
Danny Kopping	5d8ca2e5ce	fix: extract key when BYOK header is given with delegated auth (#25688 ) Previously we were only extracting the API when _not_ delegating auth; this is incorrect. We need to extract the key _always_ when BYOK is intended. --------- Signed-off-by: Danny Kopping <danny@coder.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 19:46:26 +02:00
Danny Kopping	282ab7de34	refactor: load AI providers from the database at startup (#25672 ) Replace the env-based `BuildProviders` with a DB-backed loader. The database is now the single source of truth for runtime provider configuration; env config arrives via `SeedAIProvidersFromEnv` (run at boot) and `BuildProviders` reads it back as `aibridge.Provider` instances. `cli/server.go` and `enterprise/cli/server.go` both call the same path, so aibridged and aibridgeproxyd see the same provider set. Per-provider `DumpDir` is replaced by a top-level `CODER_AI_GATEWAY_DUMP_DIR` base; each provider's effective dump path is `<base>/<provider name>`.	2026-05-26 15:57:01 +02:00
Mathias Fredriksson	32ed9f1f39	fix: use old_text/new_text in edit_files tool schema (#25658 ) Models frequently confuse the search and replace fields in the edit_files tool (CODAGT-312). Rename the model-facing JSON fields to old_text/new_text so the intent is unambiguous. Backend: custom UnmarshalJSON on editFileEdit falls back to deprecated search/replace when old_text/new_text are empty. The workspace agent API is unchanged; toSDKFiles maps old_text/new_text back to search/replace for agent/agentfiles. Frontend: normalizeEdit in parseEditFilesArgs accepts both old_text/new_text and search/replace, normalizing to the internal { search, replace } representation so streaming diff rendering works with either field naming convention.	2026-05-26 11:11:47 +03:00
Danny Kopping	c801dcbbc8	fix: strip route prefix when passing request to aibridged handler (#25671 ) We weren't stripping the API base (`/api/v2/aibridge`), leading to 404s when using the in-memory transport. Signed-off-by: Danny Kopping <danny@coder.com>	2026-05-26 08:04:26 +00:00
Ethan	fe13bb2a20	fix(coderd/x/chatd): seed afterMessageID test directly (#25665 ) This fixes the flaky `TestSubscribeAfterMessageID` by seeding its chat and messages directly, so the test no longer creates pending work that a chat worker can pick up. The assertion now covers only the `afterMessageID` subscription behavior, independent of chat processing lifecycle timing. Closes DEVEX-326 Closes https://github.com/coder/internal/issues/1489	2026-05-26 13:16:32 +10:00
Cian Johnston	579daaff70	feat: add GitLab support to coderd/externalauth/gitprovider Fixes CODAGT-146 Add GitLab support to the gitprovider package for gitsync/chatd PR diff flows. This is a squashed stack of 3 PRs: #25651 - refactor(coderd/externalauth): prepare gitprovider for multi-provider support - Change gitprovider.New to return (Provider, error) - Extract shared helpers (parseRetryAfter, checkRateLimitError, countDiffLines, escapePathPreserveSlashes) from github.go - Update all callers (db2sdk, exp_chats, gitsync) for new signature - Add error logging for provider construction failures - Thread context through provider resolution #25652 - feat(coderd/externalauth/gitprovider): add GitLab provider - Implement full Provider interface: FetchPullRequestStatus, FetchPullRequestDiff, FetchBranchDiff, ResolveBranchPullRequest - Handle nested groups, forks, and self-hosted instances - Rate limit detection on both library and raw HTTP paths - URL parsing/building with NormalizePullRequestURL support - Unit tests covering error paths, URL parsing, state mapping - Document GitLab configuration and known limitations #25653 - test(coderd/externalauth/gitprovider): add GitLab VCR integration tests - FetchPullRequestStatus: 4 fixtures (open, conflicts, merged, closed) - FetchPullRequestDiff: 4 fixtures - FetchBranchDiff: 3 fixtures (open, deleted, fork) - ResolveBranchPullRequest: 3 fixtures - go-vcr cassettes with sanitized GitLab API responses	2026-05-25 17:41:02 +01:00
Danny Kopping	8652ef3e3b	refactor: route `TransportFor` by provider name (#25650 ) Delegate `aibridge` routing responsibility to the in-memory transport layer. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 18:04:12 +02:00
Mathias Fredriksson	00a6dc56a7	test(coderd/x/chatd): wait for settled state in PromoteQueued ordering (#25644 ) TestPromoteQueuedWhileRunningRespectsMessageOrder was flaky because it read queue state from the database immediately after PromoteQueued returned. The active server worker drains queued messages concurrently, so the DB read races the auto-promote pipeline (TOCTOU). Instead of asserting intermediate queue state, wait for all three promoted messages to appear in chat history and verify their relative order (B before A before C). This asserts the same invariant (promote reorders B to the front) without reading during the race window. Closes CODAGT-384	2026-05-25 17:58:31 +03:00
Danny Kopping	4ddda3a9db	feat: filter interceptions and sessions by provider name (#25640 ) Allows filtering sessions & interceptions by provider name, and adds a test to vaidate that provider name is immutable (at least until #25606 lands).	2026-05-25 16:31:48 +02:00
Mathias Fredriksson	12f082c864	test(coderd/x/chatd): drain all subscriber events per tick in PromoteQueued tests (#25645 ) The root cause of the TestPromoteQueuedWhileRequiresActionMixedTools flake (CODAGT-425) was the subscriber out-of-order durable message delivery bug, fixed by PR #25433 (`ec1e861`). All five CI failures predate that fix. Zero failures since. This change hardens the subscriber event-drain pattern in both PromoteQueued requires_action tests: wrap the channel select in a for-loop so interleaved non-target events (status, queue_update, message_parts) are consumed in the same Eventually tick instead of each burning a 25ms interval. This is defense-in-depth for slow CI runners, not a standalone bug fix. Closes coder/internal#1523 Closes CODAGT-425	2026-05-25 16:55:48 +03:00
Sas Swart	3bf5f80277	feat(coderd/database): add boundary_sessions and boundary_logs tables (#25441 ) RFC: [Bridge ↔ Boundaries Correlation RFC](https://www.notion.so/coderhq/Gateway-and-Firewall-Correlation-RFC-31ad579be592803aa8b3d48348ccdde9) Add up/down migrations and matching sqlc queries for persisting Boundary audit events, as specified in the Bridge/Boundaries Correlation RFC. Tables: - `boundary_sessions`: session metadata with `workspace_agent_id` FK, `confined_process_name`, and timestamps (`started_at`, `updated_at`). ID is externally supplied by the Boundary process (no DB-side default). Created lazily when the first log for a session arrives. - `boundary_logs`: individual audit events with `session_id` FK, `sequence_number` (INT, primary ordering key), protocol/method/detail fields, and `matched_rule` (nullable; non-NULL implies allowed). Indexes (per RFC): - `(session_id, sequence_number)` for the ordering query path - `(captured_at)` for the retention purge path Queries: - `InsertBoundarySession` / `GetBoundarySessionByID` - `InsertBoundaryLog` / `GetBoundaryLogByID` - `ListBoundaryLogsBySessionID` with nullable `seq_after`/`seq_before` exclusive bounds for fetching events between two known interception sequence numbers - `DeleteOldBoundaryLogs` with row limit to avoid long-running transactions Also includes: dbgen helpers (`BoundarySession`, `BoundaryLog`), dbauthz implementations (reads gated on `ResourceAuditLog`, deletes on `ResourceSystem`), and all generated wrappers (dbmock, dbmetrics). No callers yet. A follow-up PR will add the dedicated `boundary_log` RBAC resource type. > Generated by Coder Agents	2026-05-25 11:14:36 +02:00
Danny Kopping	eddd4a8c2f	feat(coderd): accept delegated API key ID from in-process aibridge callers (#25625 ) Allows an `api_key_id` to be passed from a trusted in-memory transport (currently: `chatd`) to `aibridged` for use in authenticating LLM requests. This value can _only_ be passed via context, and all users of the in-memory transport _must_ provide it. It can be used in conjunction with BYOK headers. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 11:08:07 +02:00
Michael Suchacz	6739542875	test(coderd/x/chatd): skip signal wake send flake (#25633 ) Skips `TestSignalWakeSendMessage`, which flakes because the current chatd control notification flow can deliver stale status notifications after a new processing run starts. This mirrors the existing CODAGT-353 skips for the same stale-notification class and leaves the deterministic fix to that notification-flow refactor. Refs https://linear.app/codercom/issue/ENG-2727/flake-testsignalwakesendmessage > Generated by Coder Agents on behalf of @ibetitsmike.	2026-05-22 23:10:31 +00:00
Danny Kopping	0d9718e217	feat: add 'copilot' to ai_provider_type (#25616 )	2026-05-22 16:10:37 +02:00
Michael Suchacz	de6d62815e	fix(coderd): avoid redundant workspace setup (#25615 ) GPT-class chat turns could eagerly create workspaces or repeat setup such as cloning an existing repo because the system prompt framed setup work as the default path. This updates chatd prompt guidance and the `create_workspace` tool description so agents reuse existing chat and workspace context, treat injected workspace context as already read, avoid recloning present repositories, and create or start workspaces only when workspace-backed work is required. Delegated chats now report workspace needs to the parent instead of trying to create one. > Mux opened this PR on behalf of Mike.	2026-05-22 14:08:07 +00:00
Michael Suchacz	bdf2698fcd	fix: parse skill frontmatter as YAML (#25610 )	2026-05-22 15:09:30 +02:00

1 2 3 4 5 ...

3937 Commits