coder

mirror of https://github.com/coder/coder.git synced 2026-06-02 20:48:20 +00:00

Author	SHA1	Message	Date
Danny Kopping	c8555e2163	fix: deprecate ai provider seeding env config (#25854 ) Environment variables used to configure AI Gateway providers are now deprecated, and we need to reflect this as such.	2026-06-01 15:15:47 +02:00
Danny Kopping	12520ee964	feat: add ai provider status and reload freshness metrics (#25770 ) Add metrics for `aibridged` and `aibridgeproxyd`'s provider statuses. AI providers can be modified, and possibly misconfigured, at runtime. These metrics help operators understand the state of these provider definitions in case unexpected behaviour is observed.	2026-05-28 14:57:33 +02:00
Danny Kopping	2770bdc9d1	feat: route extra ai_provider_types through OpenAI and Anthropic providers (#25722 ) _Disclosure:_ _produced_ _with_ _Claude_ _Opus_ _4\.7_ AI Gateway only supports Anthropic (+Bedrock), OpenAI, and Copilot providers at present. All other types (Vercel, Gemini, etc) will be mapped to OpenAI since they support OpenAI-compatible endpoints.	2026-05-27 16:16:05 +02:00
Danny Kopping	79e007cf30	feat: hot-reload aibridged and aibridgeproxyd providers on DB changes (#25673 ) Previously the in-process aibridge daemon and the enterprise aibridgeproxy daemon both snapshotted their provider routing once at boot. Any `ai_providers` or `ai_provider_keys` mutation required a restart for either to pick it up. Add an `ai_providers_changed` pubsub channel that the CRUD handlers publish on after Create / Update / Delete. Both daemons subscribe: - aibridged rebuilds its `[]aibridge.Provider` snapshot via `BuildProviders` and swaps it into the pool atomically. Inflight requests keep serving against the bridge they already acquired; new acquires build against the new snapshot. Per-provider construction errors stay scoped to the offending row. - aibridgeproxyd rebuilds its routing snapshot from `GetAIProviders` and swaps the host→provider map atomically. The MITM listener picks up new providers without restart. DB read for aibridgeproxyd uses the existing `AsAIProviderMetadataReader` subject for routing-only access.	2026-05-27 11:58:43 +02:00
Ethan	e91bec8574	fix(cli): close aibridge daemon before WebSocket shutdown wait (#25719 ) > [!WARNING] > The investigation and solution in this PR were done with [Mux](https://mux.coder.com/). I've reviewed the investigation methodology, evidence and solution, and it all appears sound. ## Summary PR #25570 (`refactor: move aibridged out of enterprise to AGPL`, merged 2026-05-22) added an in-memory aibridge DRPC server in `coderd/aibridged.go` that does `api.WebsocketWaitGroup.Add(1)` and only releases `Done()` when its client session is closed. PR #25575 then flipped `CODER_AI_GATEWAY_ENABLED` to default to `true`, so every `cli.Server()` invocation now spins up that goroutine. In `cli/server.go`, the only call to `aibridgeDaemon.Close()` was a `defer` scheduled at function return. During graceful shutdown the code first calls `coderAPICloser.Close()`, which waits on `api.WebsocketWaitGroup`. That wait sits for the full 10s timeout in `coderd/coderd.go` (`websocket shutdown timed out after 10 seconds`), then returns, then the function unwinds, and only then does the deferred `aibridgeDaemon.Close()` fire and let the goroutine call `Done()`. The 10s tax was previously latent (aibridged was enterprise-only and opt-in). After the two May 22 PRs it hit every `cli.Server()` test. On Linux/macOS CI it just makes the suite slower; on the Depot Windows runner, the ramdisk reservation leaves only ~17 GiB of headroom and the ~10s shutdown tails of multiple concurrent package binaries overlap into an OOM, presenting as `test-go-pg (windows-2022)` jobs that die silently at the ~600s watchdog with an empty `steps` array. See Slack: https://codercom.slack.com/archives/C05AE94121Z/p1779807717764189 ## Fix Close `aibridgeDaemon` explicitly during graceful shutdown, before `coderAPICloser.Close()` waits on the WebSocket wait group. This matches the existing ordered-shutdown pattern used for `tunnel` and `notificationsManager`. The deferred `aibridgeDaemon.Close()` is retained as a safety net for early-return paths, and is safe to double-call because `aibridged.Server.Close()` is already idempotent via `shutdownOnce` in `coderd/aibridged/aibridged.go`. ## Regression test `TestServer_AIGatewayShutdownOrdering` boots a real `coder server` with `--ai-gateway-enabled=true`, cancels its context, and asserts graceful shutdown finishes in under 8s. With the fix the test runs in ~0.1s; without the fix it fails deterministically at ~10.0s. The flag is passed explicitly so the test continues to guard the ordering even if the deployment default is ever flipped back. ## Evidence this fixes the OOM On Linux the patched `cli` test package drops from 114 s back to its pre-regression 30 s wall time at the same single-process peak RSS (~7.6 GiB), and the `websocket shutdown timed out after 10 seconds` log line disappears from every server-test run. Since the Windows OOM is the sum of multiple concurrent 10 s shutdown tails overlapping past the runner's ~17 GiB headroom, removing those tails returns the concurrent-RSS budget to its pre-regression level. The Windows OOM was intermittent (a handful of hits across many runs since May 22), so a single green `test-go-pg (windows-2022)` job on this PR is not by itself proof; confirmation will come from watching Windows runs on `main` over the next several days and seeing the ~600 s silent-kill fingerprint stop recurring. Relates to ENG-2771	2026-05-27 17:33:14 +10:00
Danny Kopping	a56c88a0cc	fix: run AI provider seed and build after newAPI so dbcrypt applies (#25699 ) ## Problem Two related symptoms of the same architectural issue: the `dbcrypt` wrapper is installed inside `enterprise/coderd.New`, so any access to `options.Database` that happens before `newAPI` runs bypasses encryption. Symptom 1 (reads): Provider keys added via the admin UI are encrypted at rest. `BuildProviders` was running before `newAPI`, against the unwrapped store, so the ciphertext was read as-is and shoved into the keypool as the upstream credential. Anthropic/OpenAI reject it, and the interception log shows: ``` coderd.aibridged.pool: interception failed ... error="all configured keys failed authentication" credential_kind=centralized credential_hint=PaPb...4A== credential_length=184 ``` Symptom 2 (writes): `SeedAIProvidersFromEnv` was also running before `newAPI`, against the unwrapped store, so env-derived keys (`CODER_AIBRIDGE_OPENAI_KEY`, indexed `CODER_AIBRIDGE_PROVIDER_<N>_KEY`, etc.) landed in `ai_provider_keys` as plaintext with `ApiKeyKeyID = null` even when `CODER_EXTERNAL_TOKEN_ENCRYPTION_KEYS` was set. ## Fix Move both `SeedAIProvidersFromEnv` and `BuildProviders` to after `newAPI`, where `options.Database` is the dbcrypt-wrapped store. Writes encrypt correctly; reads decrypt correctly. The enterprise closure (`enterprise/cli/server.go`) runs inside `newAPI` and calls `BuildProviders` for the aibridgeproxyd at that point. Once the agpl seed moves to after `newAPI`, the proxy on first boot would see no env-seeded providers. Add a matching seed call inside the enterprise closure before its `BuildProviders` to cover that case. Seeding is idempotent, so the agpl-side seed running again post-`newAPI` is a no-op when the rows already exist. ## Known shortcomings The clean version of this fix would just inherit `ctx` like every other startup step and place these calls naturally. It can't, for two reasons that are both about the surrounding handler architecture rather than this change: 1. `dbcrypt` wrapping is positioned inside `newAPI`, not around `options.Database` at creation. That's why both seed and build have to wait until after `newAPI` in the first place. The principled fix is to install the wrapper at the point the store is created (behind a hook the enterprise build supplies), so every consumer sees a single authoritative view and the ordering stops mattering. This would also collapse the duplicated seed call back to a single site. 2. The handler's shutdown sequence is not deferred. `coderAPICloser.Close()` and the other teardown steps run only if control reaches the `select` at the bottom of the handler. An early `return` from anywhere in Phase 1 (e.g. seed/build returning `context.Canceled` when the user hits ctrl-c during startup) skips that block and orphans all the goroutines `newAPI` spawned — tailnet workers, gitsync, telemetry batcher, etc. `goleak` then catches them at package teardown and `TestServer_TelemetryDisabled_FinalReport` fails. Moving the shutdown into deferred closers (with a `sync.Once`-guarded close to avoid double-close from the explicit Phase 2 call) is the principled fix. For this PR I took the smallest change that fixes the reported bugs: a detached context (`context.WithoutCancel(ctx)` + a 30s timeout) at the seed and build call sites in both the agpl and enterprise paths. It lets the calls complete even if the user cancels during startup, after which the handler reaches its shutdown select naturally and tears down through Phase 2. Both shortcomings above are worth addressing separately. ## Test plan - `make test RUN=TestServer_TelemetryDisabled_FinalReport` with `-race`; passes locally with `-count=3`. - Manually verified on a deployment with `CODER_EXTERNAL_TOKEN_ENCRYPTION_KEYS` set and env-configured providers: `ai_provider_keys.api_key_key_id` is populated, `api_key` is base64 ciphertext, and upstream auth succeeds. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 21:27:02 +02:00
Danny Kopping	282ab7de34	refactor: load AI providers from the database at startup (#25672 ) Replace the env-based `BuildProviders` with a DB-backed loader. The database is now the single source of truth for runtime provider configuration; env config arrives via `SeedAIProvidersFromEnv` (run at boot) and `BuildProviders` reads it back as `aibridge.Provider` instances. `cli/server.go` and `enterprise/cli/server.go` both call the same path, so aibridged and aibridgeproxyd see the same provider set. Per-provider `DumpDir` is replaced by a top-level `CODER_AI_GATEWAY_DUMP_DIR` base; each provider's effective dump path is `<base>/<provider name>`.	2026-05-26 15:57:01 +02:00
Danny Kopping	ef6ee2af68	chore: tolerate empty providers at startup and log env seeds (#25605 ) Since AI Gateway is now enabled by default, and if the AI Gateway Proxy is enabled too it's possible the server can start without any configured providers. This would previously block startup, which is unacceptable. In an upstack PR we will handle reloading the providers at runtime, so the server needs to be able to start up even if it can't handle any proxy requests to AI Gateway. This change was necessitated because if there are providers configured in the environment they need to be seeded _before_ the proxy starts.	2026-05-22 12:45:14 +02:00
Danny Kopping	ddec110b0e	refactor: move aibridged out of enterprise to AGPL (#25570 ) In order to allow Coder Agents to use AI Gateway in OSS, we need to rehome the `aibridged`\-related code into the AGPL path. The HTTP API is only registered under enterprise so will still require the AI Governance Add-on to be present in order to use it, whereas Coder Agents uses an in-memory pipe to the same handlers.	2026-05-22 09:11:37 +02:00
Danny Kopping	9341efec9f	feat!: seed ai_providers from env on server startup (#24895 ) _Disclaimer: implemented by a Coder Agent using Claude Opus 4.7_ Part of the implementation of [RFC: Common AI Provider Configs](https://www.notion.so/coderhq/RFC-Common-AI-Provider-Configs-34bd579be59280ed958feffb82024797) (AIGOV-201). ## Note This change can cause a previously working installation to fail to start should a conflict exist between the providers configured in the environment & those now migrated to the database. I'll raise a PR upstack to document this process and workarounds should a startup fail. ## What this PR does Reconciles environment-derived AI provider configuration with the `ai_providers` table at server startup. The seed runs before the aibridged daemon is initialized, so the runtime always reads providers from the database; the legacy `CODER_AIBRIDGE_` environment variables become a one-shot migration source. ### Behavior - Concurrent server starts are serialized through a Postgres advisory lock (`LockIDAIProvidersEnvSeed`). - Missing rows are inserted with an audit entry attributed to the system actor. - Existing rows whose canonical hash matches the env-derived hash are left alone (the common no-op restart path). - Existing rows whose canonical hash does not* match cause server startup to fail with a descriptive error so the operator can explicitly resolve the conflict in either env or DB. - Soft-deleted rows are NOT resurrected from env; an explicit operator deletion is sticky across restarts. - Indexed providers whose name conflicts with a legacy env var fail startup with a clear remediation message. - Unknown provider types (e.g. `copilot`, until the DB enum is widened) are skipped with a log entry rather than failing startup. ### Canonical hashing The `canonicalAIProvider` shape captures exactly the fields that determine runtime behavior — `type`, `base_url`, and the Bedrock subset of settings (access key, access key secret, region, model, small fast model) — and is hashed with SHA-256. The hash is computed on demand from the row + env, never persisted, so the database does not need a new column for it. API keys live in the separate `ai_provider_keys` table and are intentionally excluded from the hash so operators can rotate keys via the API without forcing a server restart. <details> <summary>Decision log</summary> - The hash is intentionally not persisted in the database. The RFC discussed this trade-off; computing on demand keeps the schema minimal and lets the canonical shape evolve without a migration. - The lock uses an `iota` slot in `coderd/database/lock.go` rather than `GenLockID` so it's stable, easy to audit, and matches the convention used for every other startup lock. - A bearer-token Anthropic provider whose env vars also set Bedrock metadata but no AWS credentials does NOT store the Bedrock fields. Without credentials the discriminated settings would misrepresent the row as Bedrock auth. - We deliberately do NOT publish to the `ai_providers_changed` pubsub channel from the seed because the seed completes before any subscriber is started; the follow-up PR introduces that channel. </details>	2026-05-22 08:37:27 +02:00
Paweł Banaszewski	46e93e6325	chore: add ai_gateway options that alias aibridge options (#25061 ) Adds options matching new AI Gateway naming. New options are added as alias for old options. Old options are still working. Old options have deprecated message. No conflict detection was added. Updated documentation so it mentions only new options. Added note about old options still working. > Various AI tools where used to create this PR	2026-05-21 11:14:11 +02:00
Danny Kopping	841b777ccd	feat: add ai_providers table, queries, dbauthz, audit, RBAC (#24892 )	2026-05-14 16:10:46 +02:00
Susana Ferreira	101a4082dd	feat: support multiple keys per AI Bridge provider (#24683 ) ## Description Adds support for configuring multiple API keys per AI Bridge provider. This PR introduces the configuration parsing and validation only; wiring the key pools into the aibridge providers will happen in upstream PRs. ## Changes Providers now accept a comma-separated list of keys via the `KEYS` env var (or a single key via the existing `KEY` var). The two are mutually exclusive. Bedrock follows the same pattern with `BEDROCK_ACCESS_KEYS` / `BEDROCK_ACCESS_KEY_SECRETS`, with an additional validation that the two slices have matching lengths. Key validation at startup checks for empty values, duplicates, and a maximum of 5 keys per provider. Related to: https://github.com/coder/internal/issues/1445 > [!NOTE] > Initially generated by Coder Agents, modified and reviewed by @ssncferreira	2026-04-30 09:19:32 +01:00
Paweł Banaszewski	a24dc19d49	chore: clean up env var usage in aibridge (#24783 ) > AI tools where used when creating this PR This PR removes environment variable parsing from `/aibridge` directory. Added env variables/flags for dump dir as coder options. Only added to new indexed provider options (`CODER_AIBRIDGE_PROVIDER_<N>_`) not to deprecated legacy env variables (`CODER_AIBRIDGE_ANTHROPIC_` and `CODER_AIBRIDGE_OPENAI_KEY_*`). Reverted adding `MaxRetries` option as it will be removed soon due to key failover work: https://github.com/coder/coder/pull/24783#discussion_r3155544808	2026-04-29 18:28:37 +02:00
George K	3f0e015fe5	fix: allow coderd to start with an empty DERP map when built-in DERP is disabled (#24544 ) Allow coderd to start with an empty base DERP map when built-in DERP is disabled and no static DERP map is configured, so DERP can come from workspace proxies after startup. Also add a DERP healthcheck warning when no DERP servers are currently available at runtime. Related to: https://linear.app/codercom/issue/PLAT-43/bug-coderd-unable-to-be-started-if-built-in-derp-server-disabled-and Related to: https://github.com/coder/coder/issues/22324	2026-04-28 09:17:08 -07:00
Cian Johnston	70d6efa311	feat: chat auto-archive owner digest notifications (#24643 ) Depends on #24642 Adds per-owner digest notifications onto the chat auto-archive subsystem. Each tick's archived rows are grouped by owner, the top 25 titles per owner are rendered into a new `Chats Auto-Archived` notification template, and any remainder surfaces as `and N more`. Each digest is per-tick, so users with large amounts of purgeable data may get multiple notifications in sequence (one per user per tick). The template body branches on `retention_days`: when retention is disabled (`retention_days=0`), users are told archived chats are kept indefinitely rather than falsely claiming imminent deletion. ### Changes - migration `000XXX_chat_auto_archive_notification_template` adds new notification template - `dbpurge`: threads `notifications.Enqueuer` through `New`; and enqueues notification message. - `cli/server.go`: passes `options.NotificationsEnqueuer` into `dbpurge.New`. - `coderd/notifications/events.go`: new `TemplateChatAutoArchiveDigest` UUID. - `coderd/inboxnotifications.go`: inbox registration. - Docs: adds a `Notifications` section to `chat-auto-archive.md`. > 🤖	2026-04-28 08:56:36 +01:00
Cian Johnston	a876287d36	feat: auto-archive inactive chats with audit trail (#24642 ) Adds a background job in `dbpurge` that periodically archives chats inactive beyond a configurable threshold. Each archived root chat gets a background audit entry tagged `chat_auto_archive`. Disabled by default. * New `AutoArchiveInactiveChats` SQL query with LATERAL last-activity subquery and partial index on archive candidates * `site_configs`-backed `auto_archive_days` setting with admin-only PUT, any-authenticated-user GET * Cascade archive via `root_chat_id`; pinned chats and active threads exempt * Root-only audit dispatch on detached context, matching manual archive (`patchChat`) behavior * 11 subtests covering disabled no-op, boundary, deleted messages, child activity, pinned exemption, multi-owner, idempotency, and batch pagination PR #24643 adds per-owner digest notifications. PR #24704 adds the requisite UI controls. > 🤖	2026-04-24 14:18:28 +01:00
Paweł Banaszewski	e00e85765b	chore: move aibridge library code into coder repo (#24190 ) This PR merges code from `coder/aibridge` repository into `coder/coder`. It was split into 4 PRs for easier review but stacked PRs will need to be merged into this PR so all checks pass. * https://github.com/coder/coder/pull/24190 -> raw code copy (this PR, before merging PRs on top of it, it was just 1 commit: https://github.com/coder/coder/commit/70d33f33200c7e77df910957595715f81f9bec24) * https://github.com/coder/coder/pull/24570 -> update imports in `coder/coder` to use copied code * https://github.com/coder/coder/pull/24586 -> linter fixes and CI integration (also added README.md) * https://github.com/coder/coder/pull/24571 -> added exclude to scripts/check_emdash.sh check Original PR message (before PR squash): Moves coder/aibridge code into coder/coder repository. Omitted files: - `go.mod`, `go.sum`, `.gitignore`, `.github/workflows/ci.yml,` `Makefile`, `LICENSE`, `README.md` (modified README.md is added later) - `.github`, `example`, `buildinfo,` `scripts` directories Simple verification script (will list omitted files) ``` tmp=$(mktemp -d) echo "$tmp" git clone --depth=1 https://github.com/coder/aibridge "$tmp/aibridge" git clone --depth=1 --branch pb/aibridge-code-move https://github.com/coder/coder "$tmp/coder" diff -rq --exclude=.git "$tmp/aibridge" "$tmp/coder/aibridge" # rm -rf "$tmp" ```	2026-04-22 17:01:01 +02:00
Danny Kopping	914a0f7830	chore: follow-ups from #23948 (#24377 ) A couple follow-ups from #23948 --------- Signed-off-by: Danny Kopping <danny@coder.com>	2026-04-16 14:08:23 +02:00
Danny Kopping	48b90f8cc8	feat: add coder_build_info metric (#24365 ) _Disclaimer: produced by Claude Opus 4.6_ Adds a `coder_build_info` metric which allows operators to see which versions of Coder are currently running. --------- Signed-off-by: Danny Kopping <danny@coder.com>	2026-04-15 12:48:38 +00:00
Danny Kopping	08045c2aac	feat: configure multiple AI Bridge providers of the same type (#23948 ) _Disclaimer: produced mostly by Claude Opus 4.6 following detailed planning._ ## Summary - Support multiple instances of the same AI Bridge provider type via indexed env vars (`CODER_AIBRIDGE_PROVIDER_<N>_<KEY>`), following the `CODER_EXTERNAL_AUTH_<N>_<KEY>` pattern - Existing single-provider env vars (`CODER_AIBRIDGE_OPENAI_KEY`, etc.) continue to work unchanged - Setting both a legacy env var and an indexed provider with the same name errors at startup to prevent silent misconfiguration - Mark legacy provider fields (`OpenAI`, `Anthropic`, `Bedrock`) as deprecated in `AIBridgeConfig` in favor of `Providers` ## Example ```sh CODER_AIBRIDGE_PROVIDER_0_TYPE=anthropic CODER_AIBRIDGE_PROVIDER_0_NAME=anthropic-corp CODER_AIBRIDGE_PROVIDER_0_KEY=sk-ant-corp-xxx CODER_AIBRIDGE_PROVIDER_0_BASE_URL=https://llm-proxy.internal.example.com/anthropic CODER_AIBRIDGE_PROVIDER_1_TYPE=anthropic CODER_AIBRIDGE_PROVIDER_1_NAME=anthropic-direct CODER_AIBRIDGE_PROVIDER_1_KEY=sk-ant-direct-yyy ``` Each instance is routed by name: - /api/v2/aibridge/anthropic-corp/v1/messages - /api/v2/aibridge/anthropic-direct/v1/messages Closes [AIGOV-157](https://linear.app/codercom/issue/AIGOV-157/spike-to-understand-if-there-is-a-simple-way-to-handle-multi-api-key) --------- Signed-off-by: Danny Kopping <danny@coder.com>	2026-04-15 07:59:37 +00:00
Cian Johnston	116323d3cf	feat: graduate web-push from experiment to always-on (#24310 ) * Removes experiment `web-push`. * Falls back to NoopWebpusher in case of error * Checks browser capability in FE * Adds note to agents getting-started docs regarding webpush without TLS > 🤖	2026-04-14 09:07:06 +01:00
Cian Johnston	81fe7543b4	chore: set tls.VersionTLS12 MinVersion in cli/server.go to address gosec warning (#23646 ) I was investigating `//nolint` comments and this one popped up. It raised my eyebrows enough to warrant its own PR.	2026-03-26 14:53:47 +00:00
Cian Johnston	847a88c6ca	chore: clean up stale and dangerous //nolint comments (#23643 ) ## Changes - Commit 1: Remove 17 unnecessary `//nolint` directives: - `//nolint:varnamelen` — linter not active - `//nolint:unused` on exported `SlimUnsupported` - `//nolint:govet` in `coderd/httpmw/csrf` — no longer fires - `//nolint:revive` on functions refactored since the nolint was added - `//nolint:paralleltest` citing Go 1.22 loop variable capture (obsolete) - Bare `//nolint` narrowed to specific `//nolint:gocritic` with justification - Commit 2: Fix root causes behind 5 dangerous nolint suppressions: - Add `MinVersion: tls.VersionTLS12` to TLS client config (removes `gosec` G402) - Delete trivial unexported wrappers `apiKey()`/`normalizeProvider()` in chatprovider (removes `revive` confusing-naming) - Add doc comments to `StartWithAssert` and `Router` (removes `revive` exported) - Rename unused parameters to `_` in integration test helpers > 🤖 This PR was created using Coder Agents and reviewed by me.	2026-03-26 14:13:53 +00:00
Mathias Fredriksson	147df5c971	refactor: replace sort.Strings with slices.Sort (#23457 ) The slices package provides type-safe generic replacements for the old typed sort convenience functions. The codebase already uses slices.Sort in 43 call sites; this finishes the migration for the remaining 29. - sort.Strings(x) -> slices.Sort(x) - sort.Float64s(x) -> slices.Sort(x) - sort.StringsAreSorted(x) -> slices.IsSorted(x)	2026-03-23 23:19:23 +02:00
Cian Johnston	bc27274aba	feat(coderd): refactors github pr sync functionality (#22715 ) - Adds `_API_BASE_URL` to `CODER_EXTERNAL_AUTH_CONFIG_` - Extracts and refactors existing GitHub PR sync logic to new packages `coderd/gitsync` and `coderd/externalauth/gitprovider` - Associated wiring and tests Created using Opus 4.6	2026-03-10 18:46:01 +00:00
Kyle Carberry	a6b9a25f82	fix(cli): bypass access URL redirect for inter-replica chat relay (#22635 ) ## Summary Fixes cross-replica chat relay failing with: ``` failed to open initial relay for chat stream error= dial relay stream: - failed to WebSocket dial: expected handshake response status code 101 but got 200 failed to open relay for message parts error= dial relay stream: - failed to WebSocket dial: expected handshake response status code 101 but got 200 ``` Subscribers see accurate `status=running` (delivered via pubsub) but miss all in-progress `message_part` events (delivered only via the relay WebSocket that never connects). ## Root cause `redirectToAccessURL` in `cli/server.go` redirects any request whose `Host` header doesn't match the access URL. The enterprise chat relay dials another replica directly via its DERP relay address (e.g. `http://10.0.0.2:8080`), so the `Host` header is the pod IP — not the access URL. This triggers a 307 redirect to the access URL. The WebSocket library follows the redirect, but the second request is a plain GET — `Connection: Upgrade` and `Upgrade: websocket` headers are not carried over by HTTP redirect semantics. The load-balanced access URL routes the plain GET to any replica, which serves the SPA catch-all handler and returns HTTP 200 with `index.html`. The WebSocket library then fails: `expected handshake response status code 101 but got 200`. DERP mesh already has an exemption for this exact scenario (`isDERPPath`). Chat relay was added later and didn't get one. ## Fix Bypass `redirectToAccessURL` for requests that carry the `X-Coder-Relay-Source-Replica` header, which the enterprise relay already sets on every request (`enterprise/coderd/chatd/chatd.go:573`). ## Sequence diagram Before (broken): ``` Replica A (subscriber) Replica B (worker) Load Balancer \| \| \| \|--- WS dial pod-ip:8080 ----->\| \| \| \|-- 307 redirect to LB --->\| \| \| \| \|<----------- plain GET (no Upgrade headers) ------------->\| \| \| \|-- routes to any replica \|<----------- 200 index.html -------------------------------\| \| \| X 'expected 101 but got 200' \| ``` After (fixed): ``` Replica A (subscriber) Replica B (worker) \| \| \|--- WS dial pod-ip:8080 ----->\| \| (X-Coder-Relay-Source- \| \| Replica header set) \| \| \|-- bypass redirect \|<--------- 101 Upgrade ------\| \|<==== message_part events ====\| ```	2026-03-04 20:26:03 -05:00
Cian Johnston	517cb0ce73	refactor(webpush): use RequireExperimentWithDevBypass middleware (#22525 ) Replace manual experiment checks in web-push handlers with the `RequireExperimentWithDevBypass` middleware on the route group, matching the pattern used by OAuth2, Agents, and MCP experiments. ## Changes - `coderd/coderd.go`: Add `RequireExperimentWithDevBypass` middleware to `/webpush` route group - `coderd/webpush.go`: Remove inline `api.Experiments.Enabled(codersdk.ExperimentWebPush)` checks from all three handlers - `cli/server.go`: Gate webpush dispatcher initialization with `buildinfo.IsDev()` fallback so dev builds always init the real dispatcher - `coderd/webpush_test.go`: Remove experiment enablement from tests (dev bypass handles it) Net effect: -26 lines removed, +5 added. Created using whatchamacallits (Opus 4.6 Max)	2026-03-03 09:49:04 +00:00
Kyle Carberry	edee917d88	feat: add experimental agents support (#22290 ) feat: add AI chat system with agent tools and chat UI Introduce the chatd subsystem and Agents UI for AI-powered chat within Coder workspaces. - Add chatd package with chat loop, message compaction, prompt management, and LLM provider integration (OpenAI, Anthropic) - Add agent tools: create workspace, list/read templates, read/write/ edit files, execute commands - Add chat API endpoints with streaming, message editing, and durable reconnection - Add database schema and migrations for chats, chat messages, chat providers, and chat model configs - Add RBAC policies and dbauthz enforcement for chat resources - Add Agents UI pages with conversation timeline, queued messages list, diff viewer, and model configuration panel - Add comprehensive test coverage including coderd integration tests, chatd unit tests, and Storybook stories - Gate feature behind experiments flag --------- Co-authored-by: Cian Johnston <cian@coder.com> Co-authored-by: Danielle Maywood <danielle@themaywoods.com> Co-authored-by: Jeremy Ruppel <jeremy@coder.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-27 16:50:56 +00:00
Steven Masley	efdaaa2c8f	chore: add oidc redirect url to override access url (#21521 ) If a deployment has 2 domains, overriding the oidc url allows the oidc redirect to differ from the access_url response to https://github.com/coder/coder/discussions/21500 This config setting is hidden by default	2026-02-20 09:11:01 -06:00
Callum Styan	5f3be6b288	feat: add provisioner job queue wait time histogram and jobs enqueued counter (#21869 ) This PR adds some metrics to help identify job enqueue rates and latencies. This work was initiated as a way to help reduce the cost of the observation/measurement itself for autostart scaletests, which impacts our ability to identify/reason about the load caused by autostart. See: https://github.com/coder/internal/issues/1209 I've extended the metrics here to account for regular user initiated builds, prebuilds, autostarts, etc. IMO there is still the question here of whether we want to include or need the `transition` label, which is only present on workspace builds. Including it does lead to an increase in cardinality, and in the case of the histogram (when not using native histograms) that's at least a few extra series for every bucket. We could remove the transition label there but keep it on the counter. Additionally, the histogram is currently observing latencies for other jobs, such as template builds/version imports, those do not have a transition type associated with them. Tested briefly in a workspace, can see metric values like the following: - `coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"} 1` - `coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"} 1` --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 13:40:47 -08:00
Garrett Delfosse	953a6159a4	fix: increase retry attempts for builtin postgres port conflicts (#21796 ) ## Summary Fixes flaky `TestServer/BuiltinPostgres` test caused by port conflicts in CI. ## Fix Increase retry attempts from 3 to 10 for better odds when port conflicts occur. Fixes https://github.com/coder/internal/issues/1017	2026-02-05 13:36:32 -05:00
Danny Kopping	49a42eff5c	feat: make database connection pool size configurable (#21403 ) Closes https://github.com/coder/coder/issues/21360 A few considerations/notes: - I've kept the number of conns to 10 in all other places, except coderd - which uses the config value - I opted to also make idle conns configurable; the greater the delta between max open and max idle, the more connection churn - Postgres maintains a [_process_ per connection](https://www.postgresql.org/docs/current/connect-estab.html), contrary to what the comment said previously - Operators should be able to tune this, since process churn can negatively affect OS scheduling - I've set the value to `"auto"` by default so it's not another knob one _has to_ twiddle, and sets max idle = max conns / 3 --------- Signed-off-by: Danny Kopping <danny@coder.com>	2026-01-13 10:50:57 +02:00
Spike Curtis	bddb808b25	chore: arrange imports in a standard way (#21452 ) Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example: ``` import ( "context" "time" "github.com/prometheus/client_golang/prometheus" "golang.org/x/xerrors" "gopkg.in/natefinch/lumberjack.v2" "cdr.dev/slog/v3" "github.com/coder/coder/v2/codersdk/agentsdk" "github.com/coder/serpent" ) ``` 3 groups: standard library, 3rd partly libs, Coder libs. This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.	2026-01-08 15:24:11 +04:00
Spike Curtis	49b34a716a	fix: fix slog to always use array of Fields (#21426 ) Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder). It also updates dependencies that also use slog and were updated. I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule. Other dependencies, I pushed new tags.	2026-01-08 10:29:41 +04:00
Jake Howell	00793cc0b5	feat: add prometheus observability metrics for `dbpurge` (#21074 ) Related to [`internal#1139`](https://github.com/coder/internal/issues/1139) This implements some prometheus metrics for records being removed from the database. Currently we're tracking the following fields being removed from the DB by this. They're viewable in the `/api/v2/debug/metrics` endpoint. * `expired_api_keys` * `aibridge_records` * `connection_logs` * `duration` ``` # HELP coderd_dbpurge_iteration_duration_seconds Duration of each dbpurge iteration in seconds. # TYPE coderd_dbpurge_iteration_duration_seconds histogram coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="1"} 1 coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="5"} 1 coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="10"} 1 coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="30"} 1 coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="60"} 1 coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="300"} 1 coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="600"} 1 coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="+Inf"} 1 coderd_dbpurge_iteration_duration_seconds_sum{success="true"} 0.014787814 coderd_dbpurge_iteration_duration_seconds_count{success="true"} 1 # HELP coderd_dbpurge_records_purged_total Total number of records purged by type. # TYPE coderd_dbpurge_records_purged_total counter coderd_dbpurge_records_purged_total{record_type="aibridge_records"} 0 coderd_dbpurge_records_purged_total{record_type="audit_logs"} 0 coderd_dbpurge_records_purged_total{record_type="connection_logs"} 0 coderd_dbpurge_records_purged_total{record_type="expired_api_keys"} 0 coderd_dbpurge_records_purged_total{record_type="workspace_agent_logs"} 0 ``` \| Position \| Pull-request \| \| -------- \| ------------ \| \| ✅ \| [feat: add prometheus observability metrics for `dbpurge`](https://github.com/coder/coder/pull/21074) \| \| \| [feat: add rbac specificity for `dbpurge`](https://github.com/coder/coder/pull/21088) \|	2025-12-20 00:20:57 +11:00
Steven Masley	8fefd91e4a	feat!: support PKCE in the oauth2 client's auth/exchange flow (#21215 ) Breaking Change: Existing oauth apps might now use PKCE. If an unknown IdP type was being used, and it does not support PKCE, it will break. To fix, set the PKCE methods on the external auth to `none` ``` export CODER_EXTERNAL_AUTH_1_PKCE_METHODS=none ```	2025-12-15 17:41:47 +00:00
Zach	b4cc982cc2	fix: ensure embedded-postgres state is wiped between retries (#20809 ) Retries were previously added when starting embedded postgres to mitigate port allocation conflicts (we can't use an ephemeral port for tests). Retries alone seemingly did not fix the test flakes. A new failure mode appeared on the retries: timing out connecting to the database. When a port discovery error occurrs, embedded-postgres does not create the database. If the data directory exists on the next attempt, embedded-postgres will assume the database has already been created. This seems to cause the timeout error. Wipe all state between retries to ensure attempts execute the same logic that creates the database. [#658](https://github.com/coder/internal/issues/658)	2025-11-21 08:55:01 -07:00
Danny Kopping	5a7d4f69f6	feat: add configurable retention for aibridge (#20828 ) Closes https://github.com/coder/internal/issues/1134 --------- Signed-off-by: Danny Kopping <danny@coder.com>	2025-11-21 11:35:36 +02:00
Steven Masley	04727c06e8	chore: add experiment toggle for terraform workspace caching (#20559 ) Experiments passed to provisioners to determine behavior. This adds `--experiments` flag to provisioner daemons. Prior to this, provisioners had no method to turn on/off experiments.	2025-11-12 14:26:15 -06:00
Zach	e73f9d356b	fix: retry embedded postgres port allocation (#20371 ) Sometimes tests would fail because the port embedded postgres tries to use is already in use. This is because there's no way to tell postgres to use an ephemeral port in tests. This change adds retries to starting embedded postgres when the port is not explicitly defined (e.g. tests) which should rid of, or at least significantly reduce, these flakes. https://github.com/coder/internal/issues/658	2025-10-21 12:52:17 -06:00
Spike Curtis	1734dfd291	chore: cleanup unused Client in server command (#19762 ) As part of converting production code to use the new ClientBuilder, I noticed some dead code that creates a client with a URL for the only purpose of later accessing the URL. This PR removes the cruft.	2025-09-22 17:38:34 +04:00
Paweł Banaszewski	439b041780	feat: add best effort attempt to revoke oauth access token in external auth provider (#19775 ) Solves #15575 Adds OAuth access token revocation when unlinking external auth provider. Due to revocation not being consistently implemented by providers this is only best effort attempt. Unsuccessful revocation won't influence link removal.	2025-09-19 16:27:02 +02:00
Danny Kopping	348a2e0285	feat: add configs for external auth MCP usage + tool allow/denylist (#19794 ) Closes https://github.com/coder/internal/issues/988 The logic for allowing/denying tools can be found in https://github.com/coder/aibridge/pull/4/files#diff-330a6371a583dd8cadeed79b95499e3a87960ad8ea4d6a94061e8f88a44834c3 (`ProxyBase.filterAllowedTools`).	2025-09-16 20:31:29 +02:00
Thomas Kosiewski	088d14933c	feat: ensure OAuth2 refresh tokens outlive access tokens (#19769 )	2025-09-13 08:57:26 +02:00
Susana Ferreira	0ab345ca84	feat: add prebuild timing metrics to Prometheus (#19503 ) ## Description This PR introduces one counter and two histograms related to workspace creation and claiming. The goal is to provide clearer observability into how workspaces are created (regular vs prebuild) and the time cost of those operations. ### `coderd_workspace_creation_total` * Metric type: Counter * Name: `coderd_workspace_creation_total` * Labels: `organization_name`, `template_name`, `preset_name` This counter tracks whether a regular workspace (not created from a prebuild pool) was created using a preset or not. Currently, we already expose `coderd_prebuilt_workspaces_claimed_total` for claimed prebuilt workspaces, but we lack a comparable metric for regular workspace creations. This metric fills that gap, making it possible to compare regular creations against claims. Implementation notes: * Exposed as a `coderd_` metric, consistent with other workspace-related metrics (e.g. `coderd_api_workspace_latest_build`: https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149). * Every `defaultRefreshRate` (1 minute ), DB query `GetRegularWorkspaceCreateMetrics` is executed to fetch all regular workspaces (not created from a prebuild pool). * The counter is updated with the total from all time (not just since metric introduction). This differs from the histograms below, which only accumulate from their introduction forward. ### `coderd_workspace_creation_duration_seconds` & `coderd_prebuilt_workspace_claim_duration_seconds` * Metric types: Histogram * Names: * `coderd_workspace_creation_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name`, `type` (`regular`, `prebuild`) * `coderd_prebuilt_workspace_claim_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name` We already have `coderd_provisionerd_workspace_build_timings_seconds`, which tracks build run times for all workspace builds handled by the provisioner daemon. However, in the context of this issue, we are only interested in creation and claim build times, not all transitions; additionally, this metric does not include `preset_name`, and adding it there would significantly increase cardinality. Therefore, separate more focused metrics are introduced here: * `coderd_workspace_creation_duration_seconds`: Build time to create a workspace (either a regular workspace or the build into a prebuild pool, for prebuild initial provisioning build). * `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a prebuilt workspace from the pool. The reason for two separate histograms is that: * Creation (regular or prebuild): provisioning builds with similar time magnitude, generally expected to take longer than a claim operation. * Claim: expected to be a much faster provisioning build. #### Native histogram usage Provisioning times vary widely between projects. Using static buckets risks unbalanced or poorly informative histograms. To address this, these metrics use [Prometheus native histograms](https://prometheus.io/docs/specs/native_histograms/): * First introduced in Prometheus v2.40.0 * Recommended stable usage from v2.45+ * Requires Go client `prometheus/client_golang` v1.15.0+ * Experimental and must be explicitly enabled on the server (`--enable-feature=native-histograms`) For compatibility, we also retain a classic bucket definition (aligned with the existing provisioner metric: https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189). * If native histograms are enabled, Prometheus ingests the high-resolution histogram. * If not, it falls back to the predefined buckets. Implementation notes: * Unlike the counter, these histograms are updated in real-time at workspace build job completion. * They reflect data only from the point of introduction forward (no historical backfill). ## Relates to Closes: https://github.com/coder/coder/issues/19528 Native histograms tested in observability stack: https://github.com/coder/observability/pull/50	2025-08-28 15:00:26 +01:00
Steven Masley	8ba8b4f061	chore: add profiling labels for pprof analysis (#19232 ) PProf labels segment the code into groups for determing the source of cpu/memory profiles. Since the web server and background jobs share a lot of the same code (eg wsbuilder), it helps to know if the load is user induced, or background job based.	2025-08-07 11:21:17 -05:00
Dean Sheather	9a6dd73f68	feat: add managed agent license limit checks (#18937 ) - Adds a query for counting managed agent workspace builds between two timestamps - The "Actual" field in the feature entitlement for managed agents is now populated with the value read from the database - The wsbuilder package now validates AI agent usage against the limit when a license is installed Closes coder/internal#777	2025-07-22 13:39:26 +10:00
Hugo Dutka	3c2f3d640b	chore: remove dbmem (#18803 ) Remove the in-memory database. Addresses #15109.	2025-07-09 09:46:31 +02:00
ケイラ	09cc906981	chore: remove unnecessary redeclarations in for loops (part 2) (#18593 )	2025-06-26 12:28:00 -06:00

1 2 3 4 5 ...

392 Commits