coder

mirror of https://github.com/coder/coder.git synced 2026-06-05 14:08:20 +00:00

Author	SHA1	Message	Date
Asher	24ab216dd1	feat: add new group members endpoint with filtering and pagination (#23067 ) Partially addresses #21813 (still need to make changes to the "add user" button to be complete) Since there are a lot of user tests already, I moved them into `coderdtest` to be shared.	2026-03-20 12:43:03 -08:00
Jaayden Halko	6f244cddde	feat: display the addon license UI (#22948 ) <img width="1052" height="234" alt="Screenshot 2026-03-18 at 21 58 57" src="https://github.com/user-attachments/assets/136ccb1f-e47a-44fd-804d-859301161435" /> --------- Co-authored-by: Steven Masley <stevenmasley@gmail.com>	2026-03-20 16:34:17 +00:00
Ethan	a1e912a763	fix(chatd): deliver retry control events via pubsub (#23349 ) > PR Stack > 1. #23351 ← `#23282` > 2. #23282 ← `#23275` > 3. #23275 ← `#23349` > 4. #23349 ← `main` (you are here) --- Retry events were published only to the local in-process stream via `publishEvent()`. When pubsub is active, `Subscribe()`'s merge loop only forwarded durable events (messages, status, errors) from pubsub notifications, so retry events were silently dropped for cross-replica subscribers. This adds a `publishRetry()` helper that publishes both locally and via pubsub, and extends the `Subscribe()` notification handler to forward retry events. Changes: - `coderd/pubsub/chatstreamnotify.go`: add `Retry` field to notify message - `coderd/chatd/chatd.go`: add `publishRetry()`, update `OnRetry` callback, extend `Subscribe()` to forward `notify.Retry` - `coderd/chatd/chatd_internal_test.go`: focused pubsub delivery test - `enterprise/coderd/chatd/chatd_test.go`: cross-replica end-to-end test	2026-03-20 15:19:41 +00:00
Cian Johnston	f1d333f0e6	refactor: deduplicate utility helpers across the codebase (#23338 ) Audited exported helpers in `coderd/util/`, `testutil`, `cryptorand`, and friends, then replaced duplicated implementations with canonical versions. - fix: `maps.SortedKeys` generic signature* — value type was hardcoded to `any`, making it impossible to actually call. Added second type parameter `V any`. Added table-driven tests with `cmp.Diff`. - refactor: replace ad-hoc ptr helpers with `ptr.Ref` — removed `int64Ptr`, `stringPtr`, `boolPtr`, `i64ptr`, `strPtr`, `PtrInt32` across 6 files. - refactor: replace local `sortedKeys`/`sortKeys` with `maps.SortedKeys` — now that the signature is fixed, scripts can use it. - refactor: replace hand-rolled `capitalize` with `strings.Capitalize` — the typegen version was also not UTF-8 safe. > 🤖 This PR was created with the help of Coder Agents, and was reviewed by my human. 🧑‍💻	2026-03-20 15:12:41 +00:00
Susana Ferreira	139594a4f4	feat: block CONNECT tunnels to private/reserved IP ranges (#23109 ) ## Description Blocks `CONNECT` tunnels to private and reserved IP ranges in aibridgeproxyd, preventing the proxy from being used to reach internal networks. The Coder access URL is always exempt (hostname+port match) so the proxy can reach its own deployment. It is possible to exempt additional ranges via `CODER_AIBRIDGE_PROXY_ALLOWED_PRIVATE_CIDRS`. DNS rebinding is handled differently per path: * Direct (no upstream proxy): validate the resolved IP right before the TCP dial, no window between check and connect. * Upstream proxy: Resolves and checks before forwarding to the upstream dialer. A small rebinding window exists since the upstream proxy re-resolves independently. ## Changes * Add blocked IP denylist covering private, reserved, and special-purpose ranges * Add `AllowedPrivateCIDRs` option with CLI flag and env var * Wire IP checks into `proxy.ConnectDial` for both upstream and direct paths * Add tests for blocked/allowed cases across direct dial, upstream proxy, CIDR exemptions, and CoderAccessURL exemption Notes: documentation will be handled in a follow-up PR. Closes: https://github.com/coder/security/issues/124	2026-03-20 09:49:26 +00:00
Kyle Carberry	d8ff67fb68	feat: add MCP server configuration backend for chats (#23227 ) ## Summary Adds the database schema, API endpoints, SDK types, and encryption wrappers for admin-managed MCP (Model Context Protocol) server configurations that chatd can consume. This is the backend foundation for allowing external MCP tools (Sentry, Linear, GitHub, etc.) to be used during AI chat sessions. ## Database Two new tables: - `mcp_server_configs`: Admin-managed server definitions with URL, transport (Streamable HTTP / SSE), auth config (none / OAuth2 / API key / custom headers), tool allow/deny lists, and an availability policy (`force_on` / `default_on` / `default_off`). Includes CHECK constraints on transport, auth_type, and availability values. - `mcp_server_user_tokens`: Per-user OAuth2 tokens for servers requiring individual authentication. Cascades on user/config deletion. New column on `chats` table: - `mcp_server_ids UUID[]`: Per-chat MCP server selection, following the same pattern as `model_config_id` — passed at chat creation, changeable per-message with nil-means-no-change semantics. ## API Endpoints All routes are under `/api/experimental/mcp/servers/` and gated behind the `agents` experiment. Admin endpoints (`ResourceDeploymentConfig` auth): - `POST /` — Create MCP server config - `PATCH /{id}` — Update MCP server config (full-replace) - `DELETE /{id}` — Delete MCP server config Authenticated endpoints (all users, enabled servers only for non-admins): - `GET /` — List configs (admins see all, members see enabled-only with admin fields redacted) - `GET /{id}` — Get config by ID (with `auth_connected` populated per-user) OAuth2 per-user auth flow: - `GET /{id}/oauth2/connect` — Initiate OAuth2 flow (state cookie CSRF protection) - `GET /{id}/oauth2/callback` — Handle OAuth2 callback, store tokens - `DELETE /{id}/oauth2/disconnect` — Remove stored OAuth2 tokens ## Security - Secrets never returned: `OAuth2ClientSecret`, `APIKeyValue`, and `CustomHeaders` are never in API responses — only boolean indicators (`has_oauth2_secret`, `has_api_key`, `has_custom_headers`). - Field redaction for non-admins: `convertMCPServerConfigRedacted` strips `OAuth2ClientID`, auth URLs, scopes, and `APIKeyHeader` from non-admin responses. - dbcrypt encryption at rest: All 5 secret fields use `dbcrypt_keys` encryption with full encrypt-on-write / decrypt-on-read wrappers (11 dbcrypt method overrides + 2 helpers), following the same pattern as `chat_providers.api_key`. - OAuth2 CSRF protection: State parameter stored in `HttpOnly` cookie with `HTTPCookies.Apply()` for correct `Secure`/`SameSite` behind TLS-terminating proxies. - dbauthz authorization: All 18 querier methods have authorization wrappers. Read operations use `ActionRead`, write operations use `ActionUpdate` on `ResourceDeploymentConfig`. ## Governance Model \| Control \| Implementation \| \|---------\|---------------\| \| Global kill switch \| `enabled` defaults to `false` \| \| Availability policy \| `force_on` (always injected), `default_on` (pre-selected), `default_off` (opt-in) \| \| Per-chat selection \| `mcp_server_ids` on `CreateChatRequest` / `CreateChatMessageRequest` \| \| Auth gate \| OAuth2 servers require per-user auth before tools are injected \| \| Tool-level allow/deny \| Arrays on `mcp_server_configs` for granular tool filtering \| \| Secrets encrypted at rest \| Uses `dbcrypt_keys` (same pattern as `chat_providers.api_key`) \| ## Tests 8 test functions covering: - Full CRUD lifecycle (create, list, update, delete) - Non-admin visibility filtering (enabled-only, field redaction) - `auth_connected` population for OAuth2 vs non-OAuth2 servers - Availability policy validation (valid values + invalid rejection) - Unique slug enforcement (409 Conflict) - OAuth2 disconnect idempotency - Chat creation with `mcp_server_ids` persistence ## Known Limitations (Deferred) These are documented and intentional for an experimental feature: - Audit logging not yet wired — will add when feature stabilizes - Cross-field validation (e.g., OAuth2 fields required when `auth_type=oauth2`) — admin-only endpoint, will add when stabilizing - `force_on` auto-injection — query exists but not yet wired into chatd tool injection (follow-up) - Additional test coverage — 403 auth tests, GET-by-ID tests, callback CSRF tests planned for follow-up ## What's NOT in this PR - Frontend UI (admin panel + chat picker) - Actual MCP client connections (`chatd/chatmcp/` manager) - Tool injection into `chatloop/`	2026-03-19 14:07:36 +00:00
Cian Johnston	65b7658568	chore: extract testutil.FakeSink for slog test assertions (#23208 ) Follow-up to [review comment on #23025](https://github.com/coder/coder/pull/23025#discussion_r2930309487) from @mafredri. Extracts the repeated `logSink` / `fakeSink` test pattern into a shared `testutil.FakeSink` and migrates all existing call sites. > 🤖 This PR was created with the help of Coder Agents, and will be reviewed by my human. 🧑‍💻 --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-18 17:02:38 +00:00
Steven Masley	84de391f26	chore: add tallyman events for ai seat tracking (#22689 ) AI seat tracking inserted as heartbeat into usage table.	2026-03-18 09:30:22 -05:00
George K	91ec0f1484	feat: add service_accounts workspace sharing mode (#23093 ) Introduce a three-way workspace sharing setting (none, everyone, service_accounts) replacing the boolean workspace_sharing_disabled. In service_accounts mode, only service account-owned workspaces can be shared while regular members' share permissions are removed. Adds a new organization-service-account system role with per-org permissions reconciled alongside the existing organization-member system role. Related to: https://linear.app/codercom/issue/PLAT-28/feat-service-accounts-sharing-mode-and-rbac-role --------- Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com> Co-authored-by: Kayla はな <mckayla@hey.com>	2026-03-17 12:16:43 -07:00
Danny Kopping	365de3e367	feat: record model thoughts (#22676 ) Depends on https://github.com/coder/aibridge/pull/203 Closes https://github.com/coder/internal/issues/1337 --------- Signed-off-by: Danny Kopping <danny@coder.com>	2026-03-17 11:41:10 +00:00
Michael Suchacz	1031da9738	feat: add agent chat spend limiting (backend) (#23071 ) Introduces deployment-scoped spend limiting for Coder Agents, enabling administrators to control LLM costs at global, group, and individual user levels. ## Changes - Database migration (000437): `chat_usage_limit_config` (singleton), `chat_usage_limit_overrides` (per-user), `chat_usage_limit_group_overrides` (per-group) - Single-query limit resolution: individual override > min(group) > global default via `ResolveUserChatSpendLimit` - Fail-open enforcement in chatd with documented TOCTOU trade-off - Experimental API under `/api/experimental/chats/usage-limits` for CRUD on limits - `AsChatd` RBAC subject for narrowly-scoped daemon access (replaces `AsSystemRestricted`) - Generated TypeScript types for the frontend SDK ## Hierarchy 1. Individual user override (highest) 2. Minimum of group limits 3. Global default 4. Disabled / unlimited Currency stored as micro-dollars (`1,000,000` = $1.00). Frontend PR: #23072	2026-03-17 01:24:03 +01:00
Steven Masley	93b9d70a9b	chore: add audit log entry when ai seat is consumed (#22683 ) When an ai seat is consumed, an audit log entry is made. This only happens the first time a seat is used.	2026-03-16 15:30:25 -05:00
Zach	3f76f312e4	feat(cli): add --no-wait flag to coder create (#22867 ) Adds a `--no-wait` flag (CODER_CREATE_NO_WAIT) to the create command, matching the existing pattern in `coder start`. When set, the `coder create` command returns immediately after the workspace creation API call succeeds instead of streaming build logs until completion. This enables fire-and-forget workspace creation in CI/automation contexts (e.g., GitHub Actions), where waiting for the build to finish is unnecessary. Combined with other existing flags, users can create a workspace with no interactivity, assuming the user is already authenticated.	2026-03-16 11:54:30 -06:00
Steven Masley	abf59ee7a6	feat: track ai seat usage (#22682 ) When a user uses an AI feature, we record them in the `ai_seat_state` as consuming a seat. Added in debouching to prevent excessive writes to the db for this feature. There is no need for frequent updates.	2026-03-16 12:36:26 -05:00
Mathias Fredriksson	bdbcd3428b	feat(coderd/chatd): unify chat storage on SDK parts and fix file-reference rendering (#22958 ) File-reference parts in user messages were flattened to `TextContent` at write time because fantasy has no file-reference content type. The frontend never saw them as structured parts. This moves all write paths (user, assistant, tool) from fantasy envelope format to `codersdk.ChatMessagePart`. The streaming layer (`chatloop`) is untouched, the conversion happens at the serialization boundary in `persistStep`. Old rows are still readable. `ParseContent` uses a structural heuristic (`isFantasyEnvelopeFormat`) to distinguish legacy envelopes from SDK parts. We chose this over try/fallback because fantasy envelopes partially unmarshal into `ChatMessagePart` (the `type` field matches) while silently losing content. A guard test enforces that no SDK part can produce the envelope shape. This is forward-only: new rows are unreadable by old code. Chat is behind a feature flag so rollback risk is contained. Also adds a typed `ChatMessageRole` to replace raw strings and `fantasy.MessageRole` casts at the persistence boundary. The type covers `ChatMessage.Role`, `ChatStreamMessagePart.Role`, the `PublishMessagePart` callback chain, and all DB write sites. `fantasy.MessageRole` remains only where we build `fantasy.Message` structs for LLM dispatch. Separately, `ProviderMetadata` was leaking to SSE clients via `publishMessagePart`. `StripInternal` now runs on both the SSE and REST paths, covering this. Other cleanup: - Old `db2sdk.contentBlockToPart` silently dropped metadata on text/reasoning/tool-call content. New code preserves it. - `providerMetadataToOptions` now logs warnings instead of silently returning nil. - `db2sdk` shrinks from ~250 lines of parallel conversion to ~15 lines delegating to `chatprompt.ParseContent()`, removing the `fantasy` import entirely. Refs #22821	2026-03-13 17:53:26 +02:00
Danny Kopping	870583224d	chore: deprecate injected MCP approach in AI Bridge (#23031 ) _Disclaimer: implemented by a Coder Agent using Claude Opus 4.6._ Marks the injected MCP approach in AI Bridge as deprecated across the codebase. ## Changes - `codersdk/deployment.go`: Deprecated `ExternalAuthConfig.MCPURL`, `.MCPToolAllowRegex`, `.MCPToolDenyRegex` fields; deprecated and hid the `--aibridge-inject-coder-mcp-tools` server flag; deprecated `AIBridgeConfig.InjectCoderMCPTools`. - `coderd/externalauth/externalauth.go`: Deprecated `Config.MCPURL`, `.MCPToolAllowRegex`, `.MCPToolDenyRegex`. - `enterprise/aibridgedserver/aibridgedserver.go`: Added runtime deprecation warning when `CODER_AIBRIDGE_INJECT_CODER_MCP_TOOLS` is enabled; deprecated `getCoderMCPServerConfig`. - `enterprise/aibridged/mcp.go`: Deprecated `MCPProxyBuilder` interface and `MCPProxyFactory` struct. - `docs/ai-coder/ai-bridge/mcp.md`: Added deprecation warning banner.	2026-03-13 16:15:33 +02:00
Mathias Fredriksson	57af7abf1f	test: add testutil.WaitBuffer and replace time.Sleep in tests (#22922 ) WaitBuffer is a thread-safe io.Writer that supports blocking until accumulated output matches a substring or custom predicate. It replaces ad-hoc safeBuffer/syncWriter types and time.Sleep-based poll loops in tests with signal-driven waits. - WaitFor/WaitForNth/WaitForCond for blocking on output - Replace custom buffer types in cli/sync_test.go and provisionersdk/agent_test.go - Convert time.Sleep poll loops to require.Eventually/require.Never in cli/ssh_test.go, coderd/activitybump_test.go, coderd/workspaceagentsrpc_test.go, workspaceproxy_test.go, and scaletest tests	2026-03-12 18:07:52 +02:00
George K	e5c19d0af4	feat: backend support for creating and storing service accounts (#22698 ) Add is_service_account column to users table with CHECK constraints enforcing login_type='none' and empty email for service accounts. Update user creation API to validate service account constraints. Related to: https://linear.app/codercom/issue/PLAT-27/feat-backend-support-for-creating-and-storing-service-accounts	2026-03-11 10:19:08 -07:00
Zach	a46336c3ec	fix(cli)!: `coder groups list -o json` returns empty values (#22923 ) The groupsToRows function was not setting the Group field on groupTableRow, causing JSON output to contain zero-value structs. Table output was unaffected since it uses separate fields. BREAKING CHANGE: The JSON output structure changes from `{"Group": {"id": ...}}` to `{"id": ...}` (flat). This is technically a breaking change, but JSON output never contained real data (all fields were zero-valued), so no working consumer could exist. We're taking the opportunity to flatten the structure to match other list commands like `coder list -o json`.	2026-03-11 09:45:00 -06:00
Kyle Carberry	53e52aef78	fix(externalauth): prevent race condition in token refresh with optimistic locking (#22904 ) ## Problem When multiple concurrent callers (e.g., parallel workspace builds) read the same single-use OAuth2 refresh token from the database and race to exchange it with the provider, the first caller succeeds but subsequent callers get `bad_refresh_token`. The losing caller then clears the valid new token from the database, permanently breaking the auth link until the user manually re-authenticates. This is reliably reproducible when launching multiple workspaces simultaneously with GitHub App external auth and user-to-server token expiration enabled. ## Solution Two layers of protection: ### 1. Singleflight deduplication (`Config.RefreshToken` + `ObtainOIDCAccessToken`) Concurrent callers for the same user/provider share a single refresh call via `golang.org/x/sync/singleflight`, keyed by `userID`. The singleflight callback re-reads the link from the database to pick up any token already refreshed by a prior in-flight call, avoiding redundant IDP round-trips entirely. ### 2. Optimistic locking on `UpdateExternalAuthLinkRefreshToken` The SQL `WHERE` clause now includes `AND oauth_refresh_token = @old_oauth_refresh_token`, so if two replicas (HA) race past singleflight, the loser's destructive UPDATE is a harmless no-op rather than overwriting the winner's valid token. ## Changes \| File \| Change \| \|------\|--------\| \| `coderd/externalauth/externalauth.go` \| Added `singleflight.Group` to `Config`; split `RefreshToken` into public wrapper + `refreshTokenInner`; pass `OldOauthRefreshToken` to DB update \| \| `coderd/provisionerdserver/provisionerdserver.go` \| Wrapped OIDC refresh in `ObtainOIDCAccessToken` with package-level singleflight \| \| `coderd/database/queries/externalauth.sql` \| Added optimistic lock (`WHERE ... AND oauth_refresh_token = @old_oauth_refresh_token`) \| \| `coderd/database/queries.sql.go` \| Regenerated \| \| `coderd/database/querier.go` \| Regenerated \| \| `coderd/database/dbauthz/dbauthz_test.go` \| Updated test params for new field \| \| `coderd/externalauth/externalauth_test.go` \| Added `ConcurrentRefreshDedup` test; updated existing tests for singleflight DB re-read \| ## Testing - New test `ConcurrentRefreshDedup`: 5 goroutines call `RefreshToken` concurrently, asserts IDP refresh called exactly once, all callers get same token. - All existing `TestRefreshToken/*` subtests updated and passing. - `TestObtainOIDCAccessToken` passing. - `dbauthz` tests passing.	2026-03-10 13:52:55 -04:00
Mathias Fredriksson	338d30e4c4	fix(enterprise/cli): use :0 for http-address in proxy server tests (#22726 ) `Test_ProxyServer_Headers` never passed `--http-address`, so it bound to the default `127.0.0.1:3000`. `TestWorkspaceProxy_Server_PrometheusEnabled` used `RandomPort(t)` for `--http-address` (a drive-by from #14972 which was fixing the Prometheus port). Both now use `--http-address :0`. `ConfigureHTTPServers` calls `net.Listen("tcp", ":0")` and holds the listener open, so there is no TOCTOU window. Neither test connects to the HTTP listener, so the assigned port is irrelevant. This matches `cli/server_test.go` where `:0` is used throughout.	2026-03-06 17:05:06 -05:00
Kyle Carberry	eecb7d0b66	fix: resolve bugs in chatd streaming system (#22720 ) Split from #22693 per review feedback. Fixes multiple bugs in coderd/chatd and sub-packages including race conditions, transaction safety, stream buffer bounds, retry limits, and enterprise relay improvements. See commit message for full list.	2026-03-06 21:02:25 +00:00
Danny Kopping	13e3df67d6	feat: track client sessions (#22470 ) This change adds support for tracking client session IDs in AI Bridge interceptions to enable better session-based auditing. Depends on https://github.com/coder/aibridge/pull/198 Fixes https://github.com/coder/internal/issues/1337 The session ID field is optional and not universally supported by all clients.	2026-03-06 14:43:53 +02:00
Cian Johnston	81468323e0	fix(coderd): use dbtime.Now() instead of time.Now() in test assertions against DB timestamps (#22685 ) `time.Now()` has nanosecond precision while Postgres timestamps are microsecond precision. When tests compare `time.Now()` against DB-sourced timestamps using `Before`/`After`/`WithinRange`/etc., there is a non-zero flake risk from the precision mismatch. This replaces `time.Now()` with `dbtime.Now()` (which rounds to microsecond precision) in all test assertions that compare against database timestamps. Follows from #22684. ## Changes (11 files) \| File \| Changes \| \|---\|---\| \| `coderd/apikey_test.go` \| 11 comparisons with `ExpiresAt` \| \| `coderd/users_test.go` \| 2 comparisons with `ExpiresAt` \| \| `coderd/oauth2_test.go` \| 1 comparison with `token.Expiry` \| \| `coderd/workspaces_test.go` \| 2 comparisons with `DormantAt` \| \| `coderd/workspaceagents_test.go` \| 3 comparisons with `ConnectedAt`/`DisconnectedAt` \| \| `coderd/workspaceapps/db_test.go` \| 1 comparison with `token.Expiry` \| \| `coderd/provisionerdserver/provisionerdserver_test.go` \| 1 comparison with `key.ExpiresAt` \| \| `enterprise/coderd/workspaces_test.go` \| 1 comparison with `DormantAt` \| \| `enterprise/coderd/license/license_test.go` \| 3 `NotBefore` values \| \| `enterprise/coderd/licenses_test.go` \| 2 `NotBefore` values \| \| `enterprise/coderd/users_test.go` \| 3 `Next()` comparisons \| ## Not changed (intentionally) - `scaletest/placebo/run_test.go` — compares wall-clock elapsed time, not DB timestamps - `cli/server_test.go`, `coderd/jwtutils/jwt_test.go`, `enterprise/aibridgeproxyd/aibridgeproxyd_test.go` — TLS cert fields, not DB-stored - `coderd/azureidentity/azureidentity_test.go` — Azure cert expiry, not DB 🤖 Generated by Claude Opus 4.6 but reviewed manually.	2026-03-06 09:14:11 +00:00
Jon Ayers	6c44de951d	feat: add Prometheus collector for DERP server expvar metrics (#22583 ) This PR does three things: - Exports derp expvars to the pprof endpoint - Exports the expvar metrics as prometheus metrics in both coderd and wsproxy - Updates our tailscale to a fix I also had to make to avoid a data race condition I generated this with mux but I also manually tested that the metrics were getting properly emitted	2026-03-06 01:57:58 -06:00
Danielle Maywood	f91475cd51	test: remove unnecessary dbauthz.AsSystemRestricted calls in tests (#22663 )	2026-03-05 20:29:49 +00:00
Susana Ferreira	21c91cebaa	feat: add TLS listener support to aibridgeproxyd (#22411 ) ## Description Adds optional TLS support for the AI Bridge Proxy listener. When TLS cert and key files are provided, the proxy serves over HTTPS instead of plain HTTP. ## Changes * New configuration options to enable TLS on the proxy listener * Wraps the TCP listener in `tls.NewListener` when configured * Tests for validation errors, invalid files, and full integration (tunneled + MITM) through a TLS listener Note: Documentation for TLS listener setup and client configuration will be handled in a follow-up PR. Related to: https://github.com/coder/internal/issues/1335	2026-03-05 09:19:34 +00:00
Susana Ferreira	c79e8f2707	refactor: clarify MITM certificate naming in aibridgeproxyd (#22408 ) ## Description Renames internal fields, variables, and comments related to the proxy's certificate/key configuration to explicitly reference their MITM CA purpose. The AI Bridge Proxy uses a CA certificate to sign dynamically generated leaf certificates during MITM interception of HTTPS traffic from AI clients. With the upcoming introduction of TLS listener certificates (for serving the proxy itself over HTTPS, implemented upstack https://github.com/coder/coder/pull/22411), the previous generic naming would become ambiguous. This refactor makes it clear which certificate is which. No user-facing flags, environment variables, YAML keys, or JSON fields were changed, this is purely an internal rename to avoid confusion going forward. Related to https://github.com/coder/internal/issues/1335	2026-03-05 09:06:38 +00:00
Kyle Carberry	94a2e440a8	fix(chatd): extract session token from cookie for relay header (#22649 ) ## Problem When a browser connects to the chat stream via WebSocket, it authenticates using cookies only — the native WebSocket API cannot set custom headers like `Coder-Session-Token`. The relay between replicas copies the original request's `Cookie` header but did not set the `Coder-Session-Token` header as a fallback. This causes a 401 on the worker replica when `EnableHostPrefix` is enabled, because the `HTTPCookies.Middleware` strips bare `coder_session_token` cookies (expecting the `__Host-` prefix). Without a `Coder-Session-Token` header fallback, `apiKeyMiddleware` finds no valid credentials. ### Root Cause The data flow: 1. Browser → subscriber replica: `Cookie: __Host-coder_session_token=xxx` (browser sends prefixed cookie) 2. Subscriber's `HTTPCookies.Middleware` normalizes: `Cookie: coder_session_token=xxx` (strips prefix) 3. `relayHeaders()` copies `Cookie: coder_session_token=xxx` to relay request 4. Worker replica's `HTTPCookies.Middleware` sees bare `coder_session_token` → strips it (expects `__Host-` prefix) 5. `apiKeyMiddleware` → `APITokenFromRequest`: no cookie, no header → 401 ## Fix Modified `relayHeaders()` to extract the session token value from the `Cookie` header and set it as the `Coder-Session-Token` header when no explicit session token header is already present. The header is never stripped by middleware, so the worker replica can always authenticate. ## Testing - `TestRelayHeaders`: Unit tests for the updated `relayHeaders()` function covering all scenarios (cookie-only, header+cookie, no auth, nil source) - `TestExtractSessionTokenFromCookieHeader`: Unit tests for the helper function - `TestChatStreamRelay/RelayCookieOnlyAuth`: Integration test with plain HTTP, cookie-only WebSocket auth - `TestChatStreamRelay/RelayCookieOnlyAuthWithHostPrefix`: Integration test with `EnableHostPrefix=true`, confirming the 401 is fixed - `cookieOnlySessionTokenProvider`: Test helper that simulates browser WebSocket behavior (sets Cookie header only on WebSocket dials, no custom headers) ## Files Changed - `enterprise/coderd/chatd/chatd.go` — `relayHeaders()` fix + `extractSessionTokenFromCookieHeader()` helper - `enterprise/coderd/chatd/relay_headers_internal_test.go` — unit tests (new file) - `enterprise/coderd/chats_test.go` — integration tests + test helper type	2026-03-05 05:11:07 +00:00
Kyle Carberry	219d02bdc3	fix(coderd): poll for metrics in TestWorkspaceProvisionerdServerMetrics (#22644 ) ## Problem `TestWorkspaceProvisionerdServerMetrics` flakes because metric assertions run immediately after `AwaitWorkspaceBuildJobCompleted` returns, but metrics are updated asynchronously after the DB transaction commits in `completeWorkspaceBuildJob`. The timeline in the provisioner server: 1. DB transaction commits (`provisionerdserver.go:~2362`) — job marked completed 2. Audit logging, notifications, DB queries (`~2370-2427`) 3. Metric `.Observe()` (`~2463`) — happens ~100 lines later The test synchronization (`AwaitWorkspaceBuildJobCompleted`) polls for `CompletedAt != nil`, which fires at step 1. The metric assertion then executes before step 3, causing the flake. ## Fix Wrap all three metric assertions (prebuild creation, prebuild claim, regular workspace creation) in `require.Eventually` to poll until the metric appears, then assert on the value. ## Test - `go test -run TestWorkspaceProvisionerdServerMetrics -count=5` — all pass - `go test -race -run TestWorkspaceProvisionerdServerMetrics -count=1` — clean	2026-03-04 22:30:36 -05:00
Kyle Carberry	63b6868113	fix(codersdk): propagate HTTPClient to websocket.Dial for TLS relay (#22642 ) ## Problem In multi-replica Coder deployments, the chat relay WebSocket between replicas fails with HTTP 401 (or TLS handshake errors). The subscriber replica cannot relay `message_part` events from the worker replica. Root cause: `codersdk.Client.Dial()` does not pass `c.HTTPClient` to `websocket.DialOptions.HTTPClient`. The websocket library (`github.com/coder/websocket`) falls back to `http.DefaultClient`, which lacks the mesh TLS configuration needed for inter-replica communication. The relay code in `enterprise/coderd/chatd/chatd.go` correctly sets `sdkClient.HTTPClient = cfg.ReplicaHTTPClient` (which has mesh TLS certs), but that client was never used for the actual WebSocket handshake. ## Fix One-line fix in `codersdk/client.go`: propagate `c.HTTPClient` to `opts.HTTPClient` when the caller hasn't already set one. ## Test Added `TestChatStreamRelay/RelayWithTLSAndCookieAuth` which: - Sets up two replicas with TLS certificates (simulating mesh TLS in production) - Authenticates via cookies (simulating browser WebSocket behavior) - Verifies message_part events relay across replicas over TLS This test times out without the fix because the WebSocket handshake fails with `x509: certificate signed by unknown authority` (http.DefaultClient rejects self-signed certs). ## Related Follow-up to #22635 which fixed the `redirectToAccessURL` middleware bypassing 307 redirects for relay requests. That fix changed the error from HTTP 200 to HTTP 401, exposing this deeper issue.	2026-03-04 21:57:23 -05:00
Kyle Carberry	30d534b36b	fix(chatd): fix relay race conditions, extract enterprise relay logic, move pubsub to OSS (#22589 ) ## Summary Fixes a bug where interrupting a streaming chat and sending a new message left the relay connected to the wrong replica. Expanded into a broader refactor that cleanly separates concerns: - OSS owns pubsub subscription, message catch-up, queue updates, status forwarding, and local parts merging. - Enterprise (`enterprise/coderd/chatd`) only manages relay dialing, reconnection, and stale-dial discarding for cross-replica streaming. ## Architecture ### OSS `coderd/chatd/chatd.go` `Subscribe()` builds the initial snapshot then runs a single merge goroutine that handles: - Pubsub subscription for durable events (status, messages, queue, errors) - Message catch-up via `AfterMessageID` - Local `message_part` forwarding - Relay events from enterprise (when `SubscribeFn` is set) - Sends `StatusNotification` to enterprise so it can manage relay lifecycle Key types: - `SubscribeFn` — enterprise hook, returns relay-only events channel - `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`, `StatusNotifications`, `RequestHeader`, `DB`, `Logger` - `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on pubsub status changes ### Enterprise `enterprise/coderd/chatd/chatd.go` `NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a `SubscribeFn` that: - Opens an initial synchronous relay if the chat is running on a remote worker - Reads `StatusNotifications` from OSS to open/close relay connections - Handles async dial, reconnect timers, stale-dial discarding - Returns only relay `message_part` events ## Bug fixes ### Original bug: stale relay dial after interrupt `openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a per-dial context. `closeRelay()` could not cancel in-flight dials. When the user interrupts and a new replica picks up the chat, the old dial goroutine could complete after the new one and deliver a stale `relayResult`. Fix: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking, `workerID` on `relayResult`. `closeRelay()` cancels the dial context and drains `relayReadyCh`. Merge loop rejects mismatched worker IDs. ### Additional fixes - `statusNotifications` send-on-closed-channel race — goroutine now owns `close()` via defer - Enterprise spin-loop on `StatusNotifications` close — two-value receive with nil-out - `hasPubsub` set from `p.pubsub != nil` instead of subscription success — now tracks actual subscription result - `lastMessageID` not initialized from `afterMessageID` — caused duplicate messages on catch-up - `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel - `closeRelay()` did not drain `relayReadyCh` - `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in `InTx` - `processChat` post-TX side effects fired when chat was taken by another worker — added `errChatTakenByOtherWorker` sentinel - Cancel closure data race on `reconnectTimer` - Bare blocking send on pubsub error path - `localParts` hot-spin after channel close - No-pubsub branch dropped relay events and initial snapshot - Failed relay dial caused permanent stall (no reconnect retry) - DB error during reconnect timer caused permanent stall - `time.NewTimer` replaced with `quartz.Clock` for testable timing ## Tests 9 enterprise tests covering: - Relay reconnect on drop (mock clock) - Async dial does not block merge loop - Relay snapshot delivery - Stale dial discarded after interrupt - Cancel during in-flight dial - Running-to-running worker switch - Failed dial retries (mock clock) - Local worker closes relay - Multiple consecutive reconnects (mock clock) All pass with `-race`.	2026-03-04 18:42:28 -05:00
Kayla はな	e35717bc19	fix: show a notice when workspace sharing is disabled globally in organization settings (#22580 )	2026-03-04 11:14:52 -07:00
Sas Swart	8c09df52f9	fix(coderd): use WaitSuperLong in TestReinitializeAgent (#22593 ) Fixes coder/internal#642 We recently fixed Windows specific flakes for this test and reenabled it. It then failed intermittently due to context deadline expiration. The temporary path created on Windows contained invalid characters. This resulted in a silent startup script failure on Windows. The test then fruitlessly waited until context expiration. The test now uses a valid path on Windows.	2026-03-04 15:22:43 +02:00
Danny Kopping	1b08bc76a6	feat: store tool call IDs to determine interception lineage (#22246 ) Adds database columns and server-side logic to track interception lineage via tool call IDs. When an interception ends, the server resolves the correlating tool call ID to find the parent interception and links them via `parent_id`. New `provider_tool_call_id` column on `aibridge_tool_usages` and `parent_id` column on `aibridge_interceptions`, with indexes for lookup. `findParentInterceptionID` queries by tool call ID and filters out the current interception to find the parent. Adapted from the [coder/coder `dk/prompt_provenance_poc`](https://github.com/coder/coder/compare/main...dk/prompt_provenance_poc) branch. Depends on [coder/aibridge#188](https://github.com/coder/aibridge/pull/188). Closes https://github.com/coder/internal/issues/1334	2026-03-03 21:04:41 +02:00
Spike Curtis	56eb57caf4	chore: enable agent socket by default (#22352 ) relates to #21335 Enables the agent socket by default and updates docs to strike references to having to enable it. The PRs in this stack change the MCP server that Tasks use to update their status to rely on the agent socket, rather than directly dialing Coderd with the agent token. Default disable was a reasonable default when it was only used for the experimental script ordering features, but now that we want to use it for Tasks, it should be default on.	2026-03-03 21:23:59 +04:00
Sas Swart	e563766722	tests: re-enable 'TestReinitializeAgent' on Windows (#22488 ) closes https://github.com/coder/internal/issues/642 This PR: * re-enables `func TestReinitializeAgent(t testing.T)` adjusts it to use a Windows specific command in Windows environments	2026-03-03 11:22:02 +02:00
Kyle Carberry	b7a7683ac0	fix(chatd): harden cross-replica relay for chat stream parts (#22533 ) ## Problem Subscribers connecting to a different replica than the one running the chat see full messages appear but no streaming partials (`message_part` events). The relay mechanism that forwards ephemeral parts across replicas had several bugs. ## Root Causes 1. `openRelay()` blocked the event loop — The WebSocket dial (TCP + TLS + HTTP upgrade) to the worker replica ran synchronously inside the select loop. While dialing, no events could be processed, channels filled up, and parts were silently dropped. 2. Relay drops were permanent — When the relay WebSocket closed mid-stream, `relayParts` was set to nil and never reopened. No status notification would re-trigger it since the chat was still running on the same worker. 3. `drainInitial` snapshot race — The `default` case in the initial drain loop caused the snapshot to be empty if the remote hadn't flushed data yet (common immediately after WebSocket connect). 4. Duplicate event delivery — The `preloaded` slice caused snapshot events to be sent both in the return value and re-sent through the channel goroutine. ## Fixes ### `coderd/chatd/chatd.go` (Subscribe method) - Async relay dial: `openRelayAsync()` spawns a goroutine to dial the remote replica. The result (channel + cancel func) is delivered on a `relayReadyCh` channel that the select loop reads without blocking. - Relay reconnection: When the relay channel closes, a 500ms timer fires. The handler re-checks chat status from the DB and reopens the relay if the chat is still running on a remote worker. - Snapshot parts via channel: Relay snapshot + live parts are wrapped into a single channel so they flow through the same path, avoiding races with the select loop. ### `enterprise/coderd/chats.go` (newRemotePartsProvider) - Timer-based drain: Replaced `default` with a 1-second timer. After the first event, `Reset(0)` switches to non-blocking drain for remaining buffered events. - Remove preloaded duplication: The goroutine now only forwards new events; snapshot events are returned to the caller directly. ## Testing All existing tests pass: - `TestInterruptChatBroadcastsStatusAcrossInstances` - `TestSubscribeSnapshotIncludesStatusEvent` - `TestSubscribeNoPubsubNoDuplicateMessageParts` - `TestSubscribeAfterMessageID` - `TestChatStreamRelay/RelayMessagePartsAcrossReplicas`	2026-03-02 19:57:13 -05:00
Steven Masley	7bc454eed8	chore: version is 2.31 not 1.31 (#22494 )	2026-03-02 16:23:09 +00:00
Kyle Carberry	edee917d88	feat: add experimental agents support (#22290 ) feat: add AI chat system with agent tools and chat UI Introduce the chatd subsystem and Agents UI for AI-powered chat within Coder workspaces. - Add chatd package with chat loop, message compaction, prompt management, and LLM provider integration (OpenAI, Anthropic) - Add agent tools: create workspace, list/read templates, read/write/ edit files, execute commands - Add chat API endpoints with streaming, message editing, and durable reconnection - Add database schema and migrations for chats, chat messages, chat providers, and chat model configs - Add RBAC policies and dbauthz enforcement for chat resources - Add Agents UI pages with conversation timeline, queued messages list, diff viewer, and model configuration panel - Add comprehensive test coverage including coderd integration tests, chatd unit tests, and Storybook stories - Gate feature behind experiments flag --------- Co-authored-by: Cian Johnston <cian@coder.com> Co-authored-by: Danielle Maywood <danielle@themaywoods.com> Co-authored-by: Jeremy Ruppel <jeremy@coder.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-27 16:50:56 +00:00
Steven Masley	21bc185254	doc: add language to mention disruptive nature of cookie host prefix (#22384 )	2026-02-27 15:59:01 +00:00
Susana Ferreira	ca234f346d	fix: mark presets as validation_failed to prevent endless prebuild retries (#22085 ) ## Description - Updates `wsbuilder` to return a `BuildError` with `http.StatusBadRequest` to signify a "validation error" on missing or invalid parameters - Adds a short-circuit in `prebuilds.StoreReconciler` to mark presets for which creating a build returns a "validation error" as "validation failed" and skip further attempts to reconcile. - Adds a test to verify the above - Introduces a new Prometheus metric `coderd_prebuilt_workspaces_preset_validation_failed` to track the above Closes: https://github.com/coder/coder/issues/21237 --------- Co-authored-by: Cian Johnston <cian@coder.com>	2026-02-27 14:26:48 +00:00
Jake Howell	a51eb40dca	fix: marshal `convertLicenses()` into a `[]` instead of `nil` (#22366 ) This was a bad smell that was being addressed by the frontend. This type was generating out to be a `nil`/`null` instead of an empty `License[]`. Now this returns as an empty array and we can actively check if we have no licenses with a length of `0`.	2026-02-28 00:23:41 +11:00
Dean Sheather	bef7eb9dcc	fix: avoid derp-related panic during wsproxy registration (#22322 )	2026-02-27 00:07:14 +11:00
Jake Howell	d2787df442	feat: add AI Bridge request logs model filter (#22230 ) This pull-request implements a simple filtering logic so that we're able to pick which model the user actually used when logs were sent to AI Bridge. - Add `GET /aibridge/models` API endpoint that returns distinct model names from AI Bridge interceptions, with pagination and search support - New `ListAIBridgeModels` SQL query using case-sensitive prefix matching (`LIKE model \|\| '%'`) to allow B-tree index usage - Hand-written `ListAuthorizedAIBridgeModels` in `modelqueries.go` for RBAC authorization filter injection - `AIBridgeModels` search query parser in searchquery/search.go (defaults bare terms to the `model` field) - dbauthz wrappers, dbmetrics, and dbmock implementations for the new query <img width="292" height="185" alt="image" src="https://github.com/user-attachments/assets/134771df-2d26-4c54-acc4-27f58128b351" />	2026-02-26 02:40:45 +11:00
Jon Ayers	4f34452bcc	fix: use separate http.Transports for wsproxy tests (#22292 ) - Previously all tests were sharing the global http.Transport meaning on `Close` it would close connections presumed to be idle for other tests. fixes https://github.com/coder/internal/issues/112	2026-02-24 23:56:58 -06:00
Garrett Delfosse	6c16794173	fix(cli): proactively use active template version when require_active_version is set (#22033 ) Fixes #22030 ## Problem When a template has `require_active_version = true` and a workspace is outdated, the web UI always shows "Update and start" as the only button (for all users including admins), but `coder start` starts with the old version. For admins, this silently succeeds on the stale version. For non-admins, it goes through a clunky 403→retry path. This also affects the VS Code extension, which calls `coder start --yes` under the hood. ## Root Cause `buildWorkspaceStartRequest()` in `cli/start.go` checks `workspace.AutomaticUpdates == "always"` but ignores `workspace.TemplateRequireActiveVersion`. The server-side autostart already ORs both settings together: ```go // coderd/autobuild/lifecycle_executor.go func useActiveVersion(opts, ws) bool { return opts.RequireActiveVersion \|\| ws.AutomaticUpdates == "always" } ``` The CLI was missing the `RequireActiveVersion` check. ## Fix Add `workspace.TemplateRequireActiveVersion` to the existing OR condition: ```go // Before: if workspace.AutomaticUpdates == codersdk.AutomaticUpdatesAlways \|\| action == WorkspaceUpdate { // After: if workspace.AutomaticUpdates == codersdk.AutomaticUpdatesAlways \|\| workspace.TemplateRequireActiveVersion \|\| action == WorkspaceUpdate { ``` Now `coder start` and `coder restart` proactively use the active template version when `require_active_version` is set, matching the web UI and server autostart behavior. The 403→retry fallback remains as a safety net but is no longer the primary path for any user. ## Testing Updated `enterprise/cli/start_test.go` — all user types (owner, template admin, ACL admin, group ACL admin, member) now expect the active version when `require_active_version` is set, and verify the 403→retry message does NOT appear.	2026-02-24 19:51:48 -05:00
Zach	9613e41d21	chore: update boundary version (#22289 ) Updating to the latest tag before the 2.31 code freeze.	2026-02-24 13:33:37 -05:00
Jon Ayers	0a7a3da178	fix: exclude provisioner_state from workspace_build_with_user view (#22159 ) The provisioner state for a workspace build was being loaded for every long-lived agent rpc connection. Since this state can be anywhere from kilobytes to megabytes this can gradually cause the `coderd` memory footprint to grow over time. It's also a lot of unnecessary allocations for every query that fetches a workspace build since only a few callers ever actually reference the provisioner state. This PR removes it from the returned workspace build and adds a query to fetch the provisioner state explicitly.	2026-02-23 22:46:17 -06:00
Sushant P	37a8e61ea2	chore: move Shared Workspaces from experiments to beta (#22206 ) * Removed the shared-workspaces experiment and cleaned up related middleware * Added beta tagging to the UI for shared workspaces	2026-02-23 08:30:32 -08:00

1 2 3 4 5 ...

1231 Commits