coder

mirror of https://github.com/coder/coder.git synced 2026-06-03 21:18:24 +00:00

Author	SHA1	Message	Date
Mathias Fredriksson	bdbcd3428b	feat(coderd/chatd): unify chat storage on SDK parts and fix file-reference rendering (#22958 ) File-reference parts in user messages were flattened to `TextContent` at write time because fantasy has no file-reference content type. The frontend never saw them as structured parts. This moves all write paths (user, assistant, tool) from fantasy envelope format to `codersdk.ChatMessagePart`. The streaming layer (`chatloop`) is untouched, the conversion happens at the serialization boundary in `persistStep`. Old rows are still readable. `ParseContent` uses a structural heuristic (`isFantasyEnvelopeFormat`) to distinguish legacy envelopes from SDK parts. We chose this over try/fallback because fantasy envelopes partially unmarshal into `ChatMessagePart` (the `type` field matches) while silently losing content. A guard test enforces that no SDK part can produce the envelope shape. This is forward-only: new rows are unreadable by old code. Chat is behind a feature flag so rollback risk is contained. Also adds a typed `ChatMessageRole` to replace raw strings and `fantasy.MessageRole` casts at the persistence boundary. The type covers `ChatMessage.Role`, `ChatStreamMessagePart.Role`, the `PublishMessagePart` callback chain, and all DB write sites. `fantasy.MessageRole` remains only where we build `fantasy.Message` structs for LLM dispatch. Separately, `ProviderMetadata` was leaking to SSE clients via `publishMessagePart`. `StripInternal` now runs on both the SSE and REST paths, covering this. Other cleanup: - Old `db2sdk.contentBlockToPart` silently dropped metadata on text/reasoning/tool-call content. New code preserves it. - `providerMetadataToOptions` now logs warnings instead of silently returning nil. - `db2sdk` shrinks from ~250 lines of parallel conversion to ~15 lines delegating to `chatprompt.ParseContent()`, removing the `fantasy` import entirely. Refs #22821	2026-03-13 17:53:26 +02:00
Mathias Fredriksson	57af7abf1f	test: add testutil.WaitBuffer and replace time.Sleep in tests (#22922 ) WaitBuffer is a thread-safe io.Writer that supports blocking until accumulated output matches a substring or custom predicate. It replaces ad-hoc safeBuffer/syncWriter types and time.Sleep-based poll loops in tests with signal-driven waits. - WaitFor/WaitForNth/WaitForCond for blocking on output - Replace custom buffer types in cli/sync_test.go and provisionersdk/agent_test.go - Convert time.Sleep poll loops to require.Eventually/require.Never in cli/ssh_test.go, coderd/activitybump_test.go, coderd/workspaceagentsrpc_test.go, workspaceproxy_test.go, and scaletest tests	2026-03-12 18:07:52 +02:00
George K	e5c19d0af4	feat: backend support for creating and storing service accounts (#22698 ) Add is_service_account column to users table with CHECK constraints enforcing login_type='none' and empty email for service accounts. Update user creation API to validate service account constraints. Related to: https://linear.app/codercom/issue/PLAT-27/feat-backend-support-for-creating-and-storing-service-accounts	2026-03-11 10:19:08 -07:00
Kyle Carberry	eecb7d0b66	fix: resolve bugs in chatd streaming system (#22720 ) Split from #22693 per review feedback. Fixes multiple bugs in coderd/chatd and sub-packages including race conditions, transaction safety, stream buffer bounds, retry limits, and enterprise relay improvements. See commit message for full list.	2026-03-06 21:02:25 +00:00
Danny Kopping	13e3df67d6	feat: track client sessions (#22470 ) This change adds support for tracking client session IDs in AI Bridge interceptions to enable better session-based auditing. Depends on https://github.com/coder/aibridge/pull/198 Fixes https://github.com/coder/internal/issues/1337 The session ID field is optional and not universally supported by all clients.	2026-03-06 14:43:53 +02:00
Cian Johnston	81468323e0	fix(coderd): use dbtime.Now() instead of time.Now() in test assertions against DB timestamps (#22685 ) `time.Now()` has nanosecond precision while Postgres timestamps are microsecond precision. When tests compare `time.Now()` against DB-sourced timestamps using `Before`/`After`/`WithinRange`/etc., there is a non-zero flake risk from the precision mismatch. This replaces `time.Now()` with `dbtime.Now()` (which rounds to microsecond precision) in all test assertions that compare against database timestamps. Follows from #22684. ## Changes (11 files) \| File \| Changes \| \|---\|---\| \| `coderd/apikey_test.go` \| 11 comparisons with `ExpiresAt` \| \| `coderd/users_test.go` \| 2 comparisons with `ExpiresAt` \| \| `coderd/oauth2_test.go` \| 1 comparison with `token.Expiry` \| \| `coderd/workspaces_test.go` \| 2 comparisons with `DormantAt` \| \| `coderd/workspaceagents_test.go` \| 3 comparisons with `ConnectedAt`/`DisconnectedAt` \| \| `coderd/workspaceapps/db_test.go` \| 1 comparison with `token.Expiry` \| \| `coderd/provisionerdserver/provisionerdserver_test.go` \| 1 comparison with `key.ExpiresAt` \| \| `enterprise/coderd/workspaces_test.go` \| 1 comparison with `DormantAt` \| \| `enterprise/coderd/license/license_test.go` \| 3 `NotBefore` values \| \| `enterprise/coderd/licenses_test.go` \| 2 `NotBefore` values \| \| `enterprise/coderd/users_test.go` \| 3 `Next()` comparisons \| ## Not changed (intentionally) - `scaletest/placebo/run_test.go` — compares wall-clock elapsed time, not DB timestamps - `cli/server_test.go`, `coderd/jwtutils/jwt_test.go`, `enterprise/aibridgeproxyd/aibridgeproxyd_test.go` — TLS cert fields, not DB-stored - `coderd/azureidentity/azureidentity_test.go` — Azure cert expiry, not DB 🤖 Generated by Claude Opus 4.6 but reviewed manually.	2026-03-06 09:14:11 +00:00
Danielle Maywood	f91475cd51	test: remove unnecessary dbauthz.AsSystemRestricted calls in tests (#22663 )	2026-03-05 20:29:49 +00:00
Kyle Carberry	94a2e440a8	fix(chatd): extract session token from cookie for relay header (#22649 ) ## Problem When a browser connects to the chat stream via WebSocket, it authenticates using cookies only — the native WebSocket API cannot set custom headers like `Coder-Session-Token`. The relay between replicas copies the original request's `Cookie` header but did not set the `Coder-Session-Token` header as a fallback. This causes a 401 on the worker replica when `EnableHostPrefix` is enabled, because the `HTTPCookies.Middleware` strips bare `coder_session_token` cookies (expecting the `__Host-` prefix). Without a `Coder-Session-Token` header fallback, `apiKeyMiddleware` finds no valid credentials. ### Root Cause The data flow: 1. Browser → subscriber replica: `Cookie: __Host-coder_session_token=xxx` (browser sends prefixed cookie) 2. Subscriber's `HTTPCookies.Middleware` normalizes: `Cookie: coder_session_token=xxx` (strips prefix) 3. `relayHeaders()` copies `Cookie: coder_session_token=xxx` to relay request 4. Worker replica's `HTTPCookies.Middleware` sees bare `coder_session_token` → strips it (expects `__Host-` prefix) 5. `apiKeyMiddleware` → `APITokenFromRequest`: no cookie, no header → 401 ## Fix Modified `relayHeaders()` to extract the session token value from the `Cookie` header and set it as the `Coder-Session-Token` header when no explicit session token header is already present. The header is never stripped by middleware, so the worker replica can always authenticate. ## Testing - `TestRelayHeaders`: Unit tests for the updated `relayHeaders()` function covering all scenarios (cookie-only, header+cookie, no auth, nil source) - `TestExtractSessionTokenFromCookieHeader`: Unit tests for the helper function - `TestChatStreamRelay/RelayCookieOnlyAuth`: Integration test with plain HTTP, cookie-only WebSocket auth - `TestChatStreamRelay/RelayCookieOnlyAuthWithHostPrefix`: Integration test with `EnableHostPrefix=true`, confirming the 401 is fixed - `cookieOnlySessionTokenProvider`: Test helper that simulates browser WebSocket behavior (sets Cookie header only on WebSocket dials, no custom headers) ## Files Changed - `enterprise/coderd/chatd/chatd.go` — `relayHeaders()` fix + `extractSessionTokenFromCookieHeader()` helper - `enterprise/coderd/chatd/relay_headers_internal_test.go` — unit tests (new file) - `enterprise/coderd/chats_test.go` — integration tests + test helper type	2026-03-05 05:11:07 +00:00
Kyle Carberry	219d02bdc3	fix(coderd): poll for metrics in TestWorkspaceProvisionerdServerMetrics (#22644 ) ## Problem `TestWorkspaceProvisionerdServerMetrics` flakes because metric assertions run immediately after `AwaitWorkspaceBuildJobCompleted` returns, but metrics are updated asynchronously after the DB transaction commits in `completeWorkspaceBuildJob`. The timeline in the provisioner server: 1. DB transaction commits (`provisionerdserver.go:~2362`) — job marked completed 2. Audit logging, notifications, DB queries (`~2370-2427`) 3. Metric `.Observe()` (`~2463`) — happens ~100 lines later The test synchronization (`AwaitWorkspaceBuildJobCompleted`) polls for `CompletedAt != nil`, which fires at step 1. The metric assertion then executes before step 3, causing the flake. ## Fix Wrap all three metric assertions (prebuild creation, prebuild claim, regular workspace creation) in `require.Eventually` to poll until the metric appears, then assert on the value. ## Test - `go test -run TestWorkspaceProvisionerdServerMetrics -count=5` — all pass - `go test -race -run TestWorkspaceProvisionerdServerMetrics -count=1` — clean	2026-03-04 22:30:36 -05:00
Kyle Carberry	63b6868113	fix(codersdk): propagate HTTPClient to websocket.Dial for TLS relay (#22642 ) ## Problem In multi-replica Coder deployments, the chat relay WebSocket between replicas fails with HTTP 401 (or TLS handshake errors). The subscriber replica cannot relay `message_part` events from the worker replica. Root cause: `codersdk.Client.Dial()` does not pass `c.HTTPClient` to `websocket.DialOptions.HTTPClient`. The websocket library (`github.com/coder/websocket`) falls back to `http.DefaultClient`, which lacks the mesh TLS configuration needed for inter-replica communication. The relay code in `enterprise/coderd/chatd/chatd.go` correctly sets `sdkClient.HTTPClient = cfg.ReplicaHTTPClient` (which has mesh TLS certs), but that client was never used for the actual WebSocket handshake. ## Fix One-line fix in `codersdk/client.go`: propagate `c.HTTPClient` to `opts.HTTPClient` when the caller hasn't already set one. ## Test Added `TestChatStreamRelay/RelayWithTLSAndCookieAuth` which: - Sets up two replicas with TLS certificates (simulating mesh TLS in production) - Authenticates via cookies (simulating browser WebSocket behavior) - Verifies message_part events relay across replicas over TLS This test times out without the fix because the WebSocket handshake fails with `x509: certificate signed by unknown authority` (http.DefaultClient rejects self-signed certs). ## Related Follow-up to #22635 which fixed the `redirectToAccessURL` middleware bypassing 307 redirects for relay requests. That fix changed the error from HTTP 200 to HTTP 401, exposing this deeper issue.	2026-03-04 21:57:23 -05:00
Kyle Carberry	30d534b36b	fix(chatd): fix relay race conditions, extract enterprise relay logic, move pubsub to OSS (#22589 ) ## Summary Fixes a bug where interrupting a streaming chat and sending a new message left the relay connected to the wrong replica. Expanded into a broader refactor that cleanly separates concerns: - OSS owns pubsub subscription, message catch-up, queue updates, status forwarding, and local parts merging. - Enterprise (`enterprise/coderd/chatd`) only manages relay dialing, reconnection, and stale-dial discarding for cross-replica streaming. ## Architecture ### OSS `coderd/chatd/chatd.go` `Subscribe()` builds the initial snapshot then runs a single merge goroutine that handles: - Pubsub subscription for durable events (status, messages, queue, errors) - Message catch-up via `AfterMessageID` - Local `message_part` forwarding - Relay events from enterprise (when `SubscribeFn` is set) - Sends `StatusNotification` to enterprise so it can manage relay lifecycle Key types: - `SubscribeFn` — enterprise hook, returns relay-only events channel - `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`, `StatusNotifications`, `RequestHeader`, `DB`, `Logger` - `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on pubsub status changes ### Enterprise `enterprise/coderd/chatd/chatd.go` `NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a `SubscribeFn` that: - Opens an initial synchronous relay if the chat is running on a remote worker - Reads `StatusNotifications` from OSS to open/close relay connections - Handles async dial, reconnect timers, stale-dial discarding - Returns only relay `message_part` events ## Bug fixes ### Original bug: stale relay dial after interrupt `openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a per-dial context. `closeRelay()` could not cancel in-flight dials. When the user interrupts and a new replica picks up the chat, the old dial goroutine could complete after the new one and deliver a stale `relayResult`. Fix: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking, `workerID` on `relayResult`. `closeRelay()` cancels the dial context and drains `relayReadyCh`. Merge loop rejects mismatched worker IDs. ### Additional fixes - `statusNotifications` send-on-closed-channel race — goroutine now owns `close()` via defer - Enterprise spin-loop on `StatusNotifications` close — two-value receive with nil-out - `hasPubsub` set from `p.pubsub != nil` instead of subscription success — now tracks actual subscription result - `lastMessageID` not initialized from `afterMessageID` — caused duplicate messages on catch-up - `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel - `closeRelay()` did not drain `relayReadyCh` - `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in `InTx` - `processChat` post-TX side effects fired when chat was taken by another worker — added `errChatTakenByOtherWorker` sentinel - Cancel closure data race on `reconnectTimer` - Bare blocking send on pubsub error path - `localParts` hot-spin after channel close - No-pubsub branch dropped relay events and initial snapshot - Failed relay dial caused permanent stall (no reconnect retry) - DB error during reconnect timer caused permanent stall - `time.NewTimer` replaced with `quartz.Clock` for testable timing ## Tests 9 enterprise tests covering: - Relay reconnect on drop (mock clock) - Async dial does not block merge loop - Relay snapshot delivery - Stale dial discarded after interrupt - Cancel during in-flight dial - Running-to-running worker switch - Failed dial retries (mock clock) - Local worker closes relay - Multiple consecutive reconnects (mock clock) All pass with `-race`.	2026-03-04 18:42:28 -05:00
Kayla はな	e35717bc19	fix: show a notice when workspace sharing is disabled globally in organization settings (#22580 )	2026-03-04 11:14:52 -07:00
Sas Swart	8c09df52f9	fix(coderd): use WaitSuperLong in TestReinitializeAgent (#22593 ) Fixes coder/internal#642 We recently fixed Windows specific flakes for this test and reenabled it. It then failed intermittently due to context deadline expiration. The temporary path created on Windows contained invalid characters. This resulted in a silent startup script failure on Windows. The test then fruitlessly waited until context expiration. The test now uses a valid path on Windows.	2026-03-04 15:22:43 +02:00
Spike Curtis	56eb57caf4	chore: enable agent socket by default (#22352 ) relates to #21335 Enables the agent socket by default and updates docs to strike references to having to enable it. The PRs in this stack change the MCP server that Tasks use to update their status to rely on the agent socket, rather than directly dialing Coderd with the agent token. Default disable was a reasonable default when it was only used for the experimental script ordering features, but now that we want to use it for Tasks, it should be default on.	2026-03-03 21:23:59 +04:00
Sas Swart	e563766722	tests: re-enable 'TestReinitializeAgent' on Windows (#22488 ) closes https://github.com/coder/internal/issues/642 This PR: * re-enables `func TestReinitializeAgent(t testing.T)` adjusts it to use a Windows specific command in Windows environments	2026-03-03 11:22:02 +02:00
Kyle Carberry	b7a7683ac0	fix(chatd): harden cross-replica relay for chat stream parts (#22533 ) ## Problem Subscribers connecting to a different replica than the one running the chat see full messages appear but no streaming partials (`message_part` events). The relay mechanism that forwards ephemeral parts across replicas had several bugs. ## Root Causes 1. `openRelay()` blocked the event loop — The WebSocket dial (TCP + TLS + HTTP upgrade) to the worker replica ran synchronously inside the select loop. While dialing, no events could be processed, channels filled up, and parts were silently dropped. 2. Relay drops were permanent — When the relay WebSocket closed mid-stream, `relayParts` was set to nil and never reopened. No status notification would re-trigger it since the chat was still running on the same worker. 3. `drainInitial` snapshot race — The `default` case in the initial drain loop caused the snapshot to be empty if the remote hadn't flushed data yet (common immediately after WebSocket connect). 4. Duplicate event delivery — The `preloaded` slice caused snapshot events to be sent both in the return value and re-sent through the channel goroutine. ## Fixes ### `coderd/chatd/chatd.go` (Subscribe method) - Async relay dial: `openRelayAsync()` spawns a goroutine to dial the remote replica. The result (channel + cancel func) is delivered on a `relayReadyCh` channel that the select loop reads without blocking. - Relay reconnection: When the relay channel closes, a 500ms timer fires. The handler re-checks chat status from the DB and reopens the relay if the chat is still running on a remote worker. - Snapshot parts via channel: Relay snapshot + live parts are wrapped into a single channel so they flow through the same path, avoiding races with the select loop. ### `enterprise/coderd/chats.go` (newRemotePartsProvider) - Timer-based drain: Replaced `default` with a 1-second timer. After the first event, `Reset(0)` switches to non-blocking drain for remaining buffered events. - Remove preloaded duplication: The goroutine now only forwards new events; snapshot events are returned to the caller directly. ## Testing All existing tests pass: - `TestInterruptChatBroadcastsStatusAcrossInstances` - `TestSubscribeSnapshotIncludesStatusEvent` - `TestSubscribeNoPubsubNoDuplicateMessageParts` - `TestSubscribeAfterMessageID` - `TestChatStreamRelay/RelayMessagePartsAcrossReplicas`	2026-03-02 19:57:13 -05:00
Kyle Carberry	edee917d88	feat: add experimental agents support (#22290 ) feat: add AI chat system with agent tools and chat UI Introduce the chatd subsystem and Agents UI for AI-powered chat within Coder workspaces. - Add chatd package with chat loop, message compaction, prompt management, and LLM provider integration (OpenAI, Anthropic) - Add agent tools: create workspace, list/read templates, read/write/ edit files, execute commands - Add chat API endpoints with streaming, message editing, and durable reconnection - Add database schema and migrations for chats, chat messages, chat providers, and chat model configs - Add RBAC policies and dbauthz enforcement for chat resources - Add Agents UI pages with conversation timeline, queued messages list, diff viewer, and model configuration panel - Add comprehensive test coverage including coderd integration tests, chatd unit tests, and Storybook stories - Gate feature behind experiments flag --------- Co-authored-by: Cian Johnston <cian@coder.com> Co-authored-by: Danielle Maywood <danielle@themaywoods.com> Co-authored-by: Jeremy Ruppel <jeremy@coder.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-27 16:50:56 +00:00
Susana Ferreira	ca234f346d	fix: mark presets as validation_failed to prevent endless prebuild retries (#22085 ) ## Description - Updates `wsbuilder` to return a `BuildError` with `http.StatusBadRequest` to signify a "validation error" on missing or invalid parameters - Adds a short-circuit in `prebuilds.StoreReconciler` to mark presets for which creating a build returns a "validation error" as "validation failed" and skip further attempts to reconcile. - Adds a test to verify the above - Introduces a new Prometheus metric `coderd_prebuilt_workspaces_preset_validation_failed` to track the above Closes: https://github.com/coder/coder/issues/21237 --------- Co-authored-by: Cian Johnston <cian@coder.com>	2026-02-27 14:26:48 +00:00
Jake Howell	a51eb40dca	fix: marshal `convertLicenses()` into a `[]` instead of `nil` (#22366 ) This was a bad smell that was being addressed by the frontend. This type was generating out to be a `nil`/`null` instead of an empty `License[]`. Now this returns as an empty array and we can actively check if we have no licenses with a length of `0`.	2026-02-28 00:23:41 +11:00
Dean Sheather	bef7eb9dcc	fix: avoid derp-related panic during wsproxy registration (#22322 )	2026-02-27 00:07:14 +11:00
Jake Howell	d2787df442	feat: add AI Bridge request logs model filter (#22230 ) This pull-request implements a simple filtering logic so that we're able to pick which model the user actually used when logs were sent to AI Bridge. - Add `GET /aibridge/models` API endpoint that returns distinct model names from AI Bridge interceptions, with pagination and search support - New `ListAIBridgeModels` SQL query using case-sensitive prefix matching (`LIKE model \|\| '%'`) to allow B-tree index usage - Hand-written `ListAuthorizedAIBridgeModels` in `modelqueries.go` for RBAC authorization filter injection - `AIBridgeModels` search query parser in searchquery/search.go (defaults bare terms to the `model` field) - dbauthz wrappers, dbmetrics, and dbmock implementations for the new query <img width="292" height="185" alt="image" src="https://github.com/user-attachments/assets/134771df-2d26-4c54-acc4-27f58128b351" />	2026-02-26 02:40:45 +11:00
Jon Ayers	4f34452bcc	fix: use separate http.Transports for wsproxy tests (#22292 ) - Previously all tests were sharing the global http.Transport meaning on `Close` it would close connections presumed to be idle for other tests. fixes https://github.com/coder/internal/issues/112	2026-02-24 23:56:58 -06:00
Jon Ayers	0a7a3da178	fix: exclude provisioner_state from workspace_build_with_user view (#22159 ) The provisioner state for a workspace build was being loaded for every long-lived agent rpc connection. Since this state can be anywhere from kilobytes to megabytes this can gradually cause the `coderd` memory footprint to grow over time. It's also a lot of unnecessary allocations for every query that fetches a workspace build since only a few callers ever actually reference the provisioner state. This PR removes it from the returned workspace build and adds a query to fetch the provisioner state explicitly.	2026-02-23 22:46:17 -06:00
Sushant P	37a8e61ea2	chore: move Shared Workspaces from experiments to beta (#22206 ) * Removed the shared-workspaces experiment and cleaned up related middleware * Added beta tagging to the UI for shared workspaces	2026-02-23 08:30:32 -08:00
Jake Howell	d700f9ebc4	fix: restore block to `Managed Agents` on `Enterprise` (#22210 ) #21998 accidentally allowed `Managed Agents` usages whilst being on an `Enterprise` license. This was incorrect, it should work as the following (same as prior to #21998). \| Scenario \| Before your PRs \| After your PRs (bug) \| After this fix \| \|---\|---\|---\|---\| \| Unlicensed (AGPL) \| Permitted \| Permitted \| Permitted \| \| Licensed, no entitlement \| Blocked \| Permitted \| Blocked \| \| Licensed, explicitly disabled (limit=0) \| Blocked \| Permitted \| Blocked \| \| Licensed, entitled, under limit \| Permitted \| Permitted \| Permitted \| \| Licensed, entitled, over limit \| Blocked \| Permitted (advisory) \| Permitted (advisory) \| \| Any license, stop/delete \| Permitted \| Permitted \| Permitted \| \| Any license, non-AI build \| Permitted \| Permitted \| Permitted \|	2026-02-20 20:15:32 +11:00
Jake Howell	051ed34580	feat: convert `soft_limit` to `limit` (#22048 ) In relation to [`internal#1281`](https://github.com/coder/internal/issues/1281) Remove the `soft_limit` field from the `Feature` type and simplify license limit handling. This change: - Removes the `soft_limit` field from the API and SDK - Uses the soft limit value as the single `limit` value in the UI and API - Simplifies warning logic to only show warnings when the limit is exceeded - Updates tests to reflect the new behavior - Updates the UI to use the single limit value for display	2026-02-20 16:09:12 +11:00
Jake Howell	203899718f	feat: remove agent workspaces limit (#21998 ) In relation to [`internal#1281`](https://github.com/coder/internal/issues/1281) Managed agent workspace build limits are now advisory only. Breaching the limit no longer blocks workspace creation — it only surfaces a warning. - Removed hard-limit enforcement in `checkAIBuildUsage` so AI task builds are always permitted regardless of managed agent count. - Updated the license warning to remove "Further managed agent builds will be blocked." verbiage. - Updated tests to assert builds succeed beyond the limit instead of failing. - Removed the "Limit" display from the `ManagedAgentsConsumption` progress bar — the bar is now relative to the included allowance (soft limit) only, and turns orange when usage exceeds it. Bonus: - De-MUI'd `LicenseBannerView` — replaced Emotion CSS and MUI `Link` with Tailwind classes. - Added `highlight-orange` color token to the Tailwind theme.	2026-02-20 12:56:00 +11:00
Danielle Maywood	92a6d6c2c0	chore: remove unnecessary loop variable captures (#22180 ) Since Go 1.22, the loop variable capture issue is resolved. Variables declared by for loops are now per-iteration rather than per-loop, making the 'v := v' pattern unnecessary.	2026-02-19 09:02:19 +00:00
Paweł Banaszewski	90c11f3386	feat: add client column to aibridge_interceptions table (#21839 ) Adds `client` column to `aibridge_interceptions` table. It is set accordingly to what is passed from AI Bridge in `RecordInterception`. Adds interception filtering by `client` value. Depends on: https://github.com/coder/aibridge/pull/158 Updates aibridge library to include this change. Fixes: https://github.com/coder/aibridge/issues/31	2026-02-17 15:43:02 +01:00
Steven Masley	01f06671a1	chore: return 404, not 400 if missing or authz deny (#22069 )	2026-02-13 08:19:07 -06:00
Callum Styan	5f3be6b288	feat: add provisioner job queue wait time histogram and jobs enqueued counter (#21869 ) This PR adds some metrics to help identify job enqueue rates and latencies. This work was initiated as a way to help reduce the cost of the observation/measurement itself for autostart scaletests, which impacts our ability to identify/reason about the load caused by autostart. See: https://github.com/coder/internal/issues/1209 I've extended the metrics here to account for regular user initiated builds, prebuilds, autostarts, etc. IMO there is still the question here of whether we want to include or need the `transition` label, which is only present on workspace builds. Including it does lead to an increase in cardinality, and in the case of the histogram (when not using native histograms) that's at least a few extra series for every bucket. We could remove the transition label there but keep it on the counter. Additionally, the histogram is currently observing latencies for other jobs, such as template builds/version imports, those do not have a transition type associated with them. Tested briefly in a workspace, can see metric values like the following: - `coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"} 1` - `coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"} 1` --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 13:40:47 -08:00
Susana Ferreira	220b9f3cc5	fix: track goroutines and fix race condition in reconciler (#21980 ) ## Problem CI failure showed 3 goroutines leaked in the prebuilds reconciler, all stuck in `select` state: 1) `MetricsCollector.BackgroundFetch` (metrics goroutine) 2) `StoreReconciler.Run` (main reconciliation loop) 3) `StoreReconciler.Run.func3()` (provisioner job publisher goroutine) All three goroutines were waiting for `ctx.Done()`, which likely means `cancelFn()` was never called to trigger shutdown. Note: I was unable to reproduce the flake locally. The likely cause was a race condition between `Run()` and `Stop()` where `Stop()` could check `running` (seeing `false`), return early, and then `Run()` would start goroutines that never get cleaned up. This could happen in any `coderd` test that starts a server with prebuilds enabled. ### Problems identified 1) Missing waitgoroup tracking: provisioner job publisher goroutine was not tracked in the waitgroup, therefore, this goroutine was not tracked for a clean shutdown in `Run defer func()`. 2) The provisioner job publisher goroutine had a redundant `case <-c.done` that could race with `Stop()` select statement. 3) Race condition between `Run()` and `Stop()`: the `running` and `stopped` fields were `atomic.Bool` values checked and set independently, allowing a window where `Stop()` could see `running=false` and return early, then `Run()` would set `running=true` and start goroutines that would never be cleaned up. This could happen in any `coderd` test that starts a server with prebuilds enabled. ## Changes * Added `wg.Add(1)` and `defer wg.Done()` to track provisioner job publisher goroutine in waitgroup * Removed redundant `case <-c.done` from provisioner job publisher goroutine to eliminate race condition * Replaced `atomic.Bool` for `running` and `stopped` with a `sync.Mutex` lifecycle state, also protecting `cancelFn` under the same mutex, to eliminate the race between `Run()` and `Stop()` * Added a guard in `Run()` to prevent double-start (`c.stopped \|\| c.running`) * Improved comments in Stop() and Run() to clarify shutdown behavior Closes: https://github.com/coder/internal/issues/1116	2026-02-12 15:35:42 +00:00
Cian Johnston	25a0c807cb	chore(coderd/database/dbfake): add support for provisioner job timestamp control (#21944 ) Relates to https://github.com/coder/coder/pull/21922 / https://github.com/coder/internal/issues/1259 * Adds `dbfake.BuilderOption func(WorkspaceBuildBuilder)` Adds `BuilderOption` methods for setting various provisioner job related fields on `WorkspaceBuildBuilder`. * Migrates a number of existing tests that previously dependeded on provisioner job timing to use these updated methods in the following packages: * `coderd/jobreaper` * `coderd/notifications/reports` * `enterprise/coderd/schedule` * `enterprise/coderd/prebuilds` * `scripts/workspace-runtime-audit` 🤖 Created using Mux (Opus 4.5) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-06 09:44:40 +00:00
Cian Johnston	91be688e39	chore(coderd/database): remove deprecated db2sdk.List(Lazy)? methods (#21902 ) Removes deprecated methods db2sdk.List and db2sdk.ListLazy.	2026-02-03 17:52:07 +00:00
Jon Ayers	3c1db17361	fix: use existing transaction to claim prebuild (#21862 ) - Claiming a prebuild was happening outside a transaction	2026-02-02 17:57:59 -06:00
Dean Sheather	6954b73f8a	fix: prevent panic from duplicate metrics registration on license upload (#21832 )	2026-02-02 20:57:06 +11:00
Ethan	a464ab67c6	test: use explicit names in TestStartAutoUpdate to prevent flake (#21745 ) The test was creating two template versions without explicit names, relying on `namesgenerator.NameDigitWith()` which can produce collisions. When both versions got the same random name, the test failed with a 409 Conflict error. Fix by giving each version an explicit name (`v1`, `v2`). Closes https://github.com/coder/internal/issues/1309 --- Generated by [mux](https://mux.coder.com)	2026-01-30 13:24:06 +11:00
Marcin Tojek	04b0253e8a	feat: add Prometheus metrics for license warnings and errors (#21749 ) Fixes: coder/internal#767 Adds two new Prometheus metrics for license health monitoring: - `coderd_license_warnings` - count of active license warnings - `coderd_license_errors` - count of active license errors Metrics endpoint after startup of a deployment with license enabled: ``` ... # HELP coderd_license_errors The number of active license errors. # TYPE coderd_license_errors gauge coderd_license_errors 0 ... # HELP coderd_license_warnings The number of active license warnings. # TYPE coderd_license_warnings gauge coderd_license_warnings 0 ... ```	2026-01-29 13:50:15 +01:00
Cian Johnston	c2c225052a	chore(enterprise/coderd): ensure TestManagedAgentLimit differentiates between tasks and workspaces (#21731 ) My previous change to this test did not create another workspace using the template containing `coder_ai_task` resources, meaning that this test was not actually testing the right thing. This PR addresses this oversight.	2026-01-28 16:30:56 +00:00
Zach	2204731ddb	feat: implement boundary usage tracker and telemetry collection (#21716 ) Implements telemetry for boundary usage tracking across all Coder replicas and reports them via telemetry. Changes: - Implement Tracker with Track(), FlushToDB(), and StartFlushLoop() methods - Add telemetry integration via collectBoundaryUsageSummary() - Use telemetry lock to ensure only one replica collects per period The tracker accumulates unique workspaces, unique users, and request counts (allowed/denied) in memory, then flushes to the database periodically. During telemetry collection, stats are aggregated across all replicas and reset for the next period.	2026-01-27 19:11:40 -07:00
Steven Masley	799b190dee	fix: do not enforce managed agent limit for non-task workspaces (#21689 ) Only task workspaces have the checks in wsbuilder for violating the managed agent caps in the license. Stopped tasks that are resumed with a regular workspace start still count as usage.	2026-01-27 19:01:17 -06:00
Cian Johnston	7b44976618	fix(coderd/provisionerdserver): correct managed agent tracking (#21696 ) Relates to https://github.com/coder/internal/issues/1282 Updates tracking of managed agents to be predicated instead on the presence of a related `task_id` instead of the presence of a `coder_ai_task` resource.	2026-01-27 12:14:52 +00:00
Jake Howell	6f15b178a4	feat: extend premium license for `aigovernance` (#21499 ) Closes [#1227](https://github.com/coder/internal/issues/1227) Added support for license addons, starting with AI Governance, to enable dynamic feature grouping without requiring license reissuance. ### What changed? - Introduced a new `Addon` type to represent groupings of features that can be added to licenses - Created the first addon `AddonAIGovernance` which includes AI Bridge and Boundary features - Added validation for addon dependencies to ensure required features are present - Added new features: `FeatureBoundary` and `FeatureAIGovernanceUserLimit` - Updated license entitlement logic to handle addons and their features - Added helper methods to check if features belong to addons - Updated tests to verify addon functionality ### Why make this change? This change introduces a more flexible licensing model that allows features to be grouped into addons that can be added to licenses without requiring reissuance when new features are added to an addon. This is particularly useful for specialized feature sets like AI Governance, where related features can be bundled together and sold as a separate SKU. The addon approach allows for better organization of features and more granular control over entitlements.	2026-01-27 22:33:53 +11:00
Kacper Sawicki	78bc5861e0	feat(enterprise/coderd): add soft warning for AI Bridge GA transition (#21675 ) ## Summary AI Bridge is moving to General Availability in v2.30 and will require the AI Governance Add-On license in future versions. This adds a soft warning for deployments using AI Bridge via Premium/Enterprise FeatureSet without an explicit AI Bridge add-on license. Relates to: https://github.com/coder/internal/issues/1226 ## Changes - Track whether AI Bridge was explicitly granted via license Features (add-on) vs inherited from FeatureSet - Show soft warning when AI Bridge is enabled and entitled via FeatureSet but not via explicit add-on - Changed AI Bridge enablement from hardcoded `true` to check `CODER_AIBRIDGE_ENABLED` deployment config ## Behavior Change AI Bridge is now only marked as "enabled" in entitlements when `CODER_AIBRIDGE_ENABLED=true` is set in the deployment config. Previously, it was always enabled for Premium/Enterprise licenses regardless of the config setting. This change ensures that users who do not use AI Bridge will not see the soft warning about the upcoming license requirement. ## Warning Message > AI Bridge is now Generally Available in v2.30. In a future Coder version, your deployment will require the AI Governance Add-On to continue using this feature. Please reach out to your account team or sales@coder.com to learn more. ## Behavior \| Condition \| Warning Shown \| \|-----------\|---------------\| \| AI Bridge disabled \| ❌ No \| \| AI Bridge enabled + explicit add-on license \| ❌ No \| \| AI Bridge enabled + Premium/Enterprise FeatureSet (no add-on) \| ✅ Yes \| ## Screenshots ### 1. No license <img width="1708" height="577" alt="image" src="https://github.com/user-attachments/assets/cbdbfd4d-55de-4d70-8abf-2665f458e96f" /> ### 2. No license + CODER_AIBRIDGE_ENABLED=true <img width="1716" height="513" alt="image" src="https://github.com/user-attachments/assets/344aae76-7703-485f-b568-1f13a1efa48f" /> ### 3. Premium license + CODER_AIBRIDGE_ENABLED=false <img width="1687" height="389" alt="image" src="https://github.com/user-attachments/assets/c2be12b0-1c0f-438d-a293-f9ec9fe6a736" /> ### 4. Premium license + CODER_AIBRIDGE_ENABLED=true <img width="1707" height="525" alt="image" src="https://github.com/user-attachments/assets/1a4640e1-e656-4f9b-bed0-9390cb5d6a84" /> ## Notes - TODO comments added to mark code that should be removed when AI Bridge enforcement is added - Feature continues to work - this is just a transitional warning (soft enforcement)	2026-01-26 10:46:45 +01:00
Kacper Sawicki	b82693d4cc	feat(codersdk): revert "remove AI Bridge entitlement from Premium license" (#21653 ) Reverts coder/coder#21540	2026-01-23 15:58:12 +00:00
Susana Ferreira	f5858c8a18	fix: unregister metrics on reconciler stop to prevent panic on restart (#21647 ) ## Description Fixes a panic that occurs when the prebuilds feature is toggled by adding/removing a license. The `StoreReconciler` was not unregistering the `reconciliationDuration` histogram, causing a "duplicate metrics collector registration attempted" panic when a new reconciler was created. ## Changes * Unregister the `reconciliationDuration` histogram in `Stop()` alongside the existing metrics collector * Change log level when stopping the reconciler with a cause, since "entitlements change" is not an error condition * Add `TestReconcilerLifecycle` to verify the reconciler can be stopped and recreated with the same prometheus registry Related to internal slack thread: https://codercom.slack.com/archives/C07GRNNRW03/p1769116582171379	2026-01-23 14:45:27 +00:00
Kacper Sawicki	9843adb8c6	feat(codersdk): remove AI Bridge entitlement from Premium license (#21540 ) ## Summary AI Bridge is moving out of Premium as a separate add-on (GA in Feb 3). Closes https://github.com/coder/internal/issues/1226 ## Changes - Excludes `FeatureAIBridge` from `Enterprise()` and `FeatureSetPremium.Features()` - Adds soft warning for deployments with AI Bridge enabled but not entitled - Warning is displayed to Auditor/Owner roles in UI banner and CLI headers ## Warning Message When AI Bridge is enabled (`CODER_AIBRIDGE_ENABLED=true`) but the license doesn't include the entitlement: > AI Bridge has reached General Availability and your Coder deployment is not entitled to run this feature. Contact your account team (https://coder.com/contact) for information around getting a license with AI Bridge. ## Behavior - The feature remains usable in v2.30 (soft warning only) - Future versions may include hard enforcement	2026-01-23 13:48:27 +01:00
George K	d29a168785	fix(coderd/rbac): reinstate deployment-wide workspace.share permission for owner role (#21620 ) The removal of that permission from the role broke valid use cases (e.g. a site owner user creating a workspace owned by a system account and then trying to share it with another user). The bulk of the PR is made up of the rollbacks of the previously introduced test updates necessitated by the removal. Related to: https://github.com/coder/internal/issues/1285	2026-01-22 08:12:15 -08:00
Mathias Fredriksson	97e8a5b093	fix(coderd): allow agent auth during workspace shutdown (#21538 ) Agents were losing authentication during workspace shutdown, causing shutdown scripts to fail. The auth query required agents to belong to the latest build, but during shutdown a `stop` build becomes latest while the `start` build's agents are still running. Modified the auth query to allow `start` build agents to authenticate temporarily during `stop` execution. The query allows auth when: - Agent's `start` build job succeeded - Latest build is `stop` with `pending`/`running` job status - Builds are adjacent (`stop` is `build_number + 1`) - Template versions match Auth closes once `stop` completes. Renamed `GetWorkspaceAgentAndLatestBuildByAuthToken` to `GetAuthenticatedWorkspaceAgentAndBuildByAuthToken` since it returns the agent's build (not always latest) during shutdown. Closes coder/internal#1249 Fixes #19467	2026-01-21 13:18:43 +00:00
Susana Ferreira	6ef9670384	fix: limit concurrent database connections in prebuild reconciliation (#20908 ) ## Description This PR addresses database connection pool exhaustion during prebuilds reconciliation by introducing two changes: * `CanSkipReconciliation`: Filters out presets that don't need reconciliation before spawning goroutines. This ensures we only create goroutines for presets that will (_most likely_) perform database operations, avoiding unnecessary connection pool usage. * Dynamic `eg.SetLimit`: Limits concurrent goroutines based on the configured database connection pool size (`CODER_PG_CONN_MAX_OPEN / 2`). This replaces the previous hardcoded limit of 5, ensuring the reconciliation loop scales appropriately with the configured pool size while leaving capacity for other database operations. ## Changes * Add `CanSkipReconciliation()` method to `PresetSnapshot` that returns true for inactive presets with no running workspaces, no pending jobs, or expired prebuilds. * Add `maxDBConnections` parameter to `NewStoreReconciler` and compute `reconciliationConcurrency` as half the pool size (minimum 1). * Add `ReconciliationConcurrency()` getter method to `StoreReconciler`. * Add `eg.SetLimit(c.reconciliationConcurrency)` to bound concurrent reconciliation goroutines. * Add `PresetsTotal` and `PresetsReconciled` to `ReconcileStats` for observability. * Add `TestCanSkipReconciliation` unit tests. * Add `TestReconciliationConcurrency` unit tests. * Add benchmark tests for reconciliation performance. ## Benchmarks * `BenchmarkReconcileAll_NoOps`: Tests presets with no reconciliation actions. All presets are filtered by `CanSkipReconciliation`, resulting in no goroutines spawned and no database connections used. * `BenchmarkReconcileAll_ConnectionContention`: Tests presets where all require reconciliation actions. All presets spawn goroutines, but concurrency is limited by `eg.SetLimit(reconciliationConcurrency)`. * `BenchmarkReconcileAll_Mix`: Simulates a realistic scenario with a large subset of inactive presets (filtered by `CanSkipReconciliation`) and a smaller subset requiring reconciliation (limited by `eg.SetLimit`). Closes: https://github.com/coder/coder/issues/20606	2026-01-21 10:56:31 +00:00

1 2 3 4 5 ...

752 Commits