coder

mirror of https://github.com/coder/coder.git synced 2026-06-04 05:28:20 +00:00

Author	SHA1	Message	Date
Susana Ferreira	ca234f346d	fix: mark presets as validation_failed to prevent endless prebuild retries (#22085 ) ## Description - Updates `wsbuilder` to return a `BuildError` with `http.StatusBadRequest` to signify a "validation error" on missing or invalid parameters - Adds a short-circuit in `prebuilds.StoreReconciler` to mark presets for which creating a build returns a "validation error" as "validation failed" and skip further attempts to reconcile. - Adds a test to verify the above - Introduces a new Prometheus metric `coderd_prebuilt_workspaces_preset_validation_failed` to track the above Closes: https://github.com/coder/coder/issues/21237 --------- Co-authored-by: Cian Johnston <cian@coder.com>	2026-02-27 14:26:48 +00:00
Danielle Maywood	92a6d6c2c0	chore: remove unnecessary loop variable captures (#22180 ) Since Go 1.22, the loop variable capture issue is resolved. Variables declared by for loops are now per-iteration rather than per-loop, making the 'v := v' pattern unnecessary.	2026-02-19 09:02:19 +00:00
Callum Styan	5f3be6b288	feat: add provisioner job queue wait time histogram and jobs enqueued counter (#21869 ) This PR adds some metrics to help identify job enqueue rates and latencies. This work was initiated as a way to help reduce the cost of the observation/measurement itself for autostart scaletests, which impacts our ability to identify/reason about the load caused by autostart. See: https://github.com/coder/internal/issues/1209 I've extended the metrics here to account for regular user initiated builds, prebuilds, autostarts, etc. IMO there is still the question here of whether we want to include or need the `transition` label, which is only present on workspace builds. Including it does lead to an increase in cardinality, and in the case of the histogram (when not using native histograms) that's at least a few extra series for every bucket. We could remove the transition label there but keep it on the counter. Additionally, the histogram is currently observing latencies for other jobs, such as template builds/version imports, those do not have a transition type associated with them. Tested briefly in a workspace, can see metric values like the following: - `coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"} 1` - `coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"} 1` --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 13:40:47 -08:00
Susana Ferreira	220b9f3cc5	fix: track goroutines and fix race condition in reconciler (#21980 ) ## Problem CI failure showed 3 goroutines leaked in the prebuilds reconciler, all stuck in `select` state: 1) `MetricsCollector.BackgroundFetch` (metrics goroutine) 2) `StoreReconciler.Run` (main reconciliation loop) 3) `StoreReconciler.Run.func3()` (provisioner job publisher goroutine) All three goroutines were waiting for `ctx.Done()`, which likely means `cancelFn()` was never called to trigger shutdown. Note: I was unable to reproduce the flake locally. The likely cause was a race condition between `Run()` and `Stop()` where `Stop()` could check `running` (seeing `false`), return early, and then `Run()` would start goroutines that never get cleaned up. This could happen in any `coderd` test that starts a server with prebuilds enabled. ### Problems identified 1) Missing waitgoroup tracking: provisioner job publisher goroutine was not tracked in the waitgroup, therefore, this goroutine was not tracked for a clean shutdown in `Run defer func()`. 2) The provisioner job publisher goroutine had a redundant `case <-c.done` that could race with `Stop()` select statement. 3) Race condition between `Run()` and `Stop()`: the `running` and `stopped` fields were `atomic.Bool` values checked and set independently, allowing a window where `Stop()` could see `running=false` and return early, then `Run()` would set `running=true` and start goroutines that would never be cleaned up. This could happen in any `coderd` test that starts a server with prebuilds enabled. ## Changes * Added `wg.Add(1)` and `defer wg.Done()` to track provisioner job publisher goroutine in waitgroup * Removed redundant `case <-c.done` from provisioner job publisher goroutine to eliminate race condition * Replaced `atomic.Bool` for `running` and `stopped` with a `sync.Mutex` lifecycle state, also protecting `cancelFn` under the same mutex, to eliminate the race between `Run()` and `Stop()` * Added a guard in `Run()` to prevent double-start (`c.stopped \|\| c.running`) * Improved comments in Stop() and Run() to clarify shutdown behavior Closes: https://github.com/coder/internal/issues/1116	2026-02-12 15:35:42 +00:00
Cian Johnston	25a0c807cb	chore(coderd/database/dbfake): add support for provisioner job timestamp control (#21944 ) Relates to https://github.com/coder/coder/pull/21922 / https://github.com/coder/internal/issues/1259 * Adds `dbfake.BuilderOption func(WorkspaceBuildBuilder)` Adds `BuilderOption` methods for setting various provisioner job related fields on `WorkspaceBuildBuilder`. * Migrates a number of existing tests that previously dependeded on provisioner job timing to use these updated methods in the following packages: * `coderd/jobreaper` * `coderd/notifications/reports` * `enterprise/coderd/schedule` * `enterprise/coderd/prebuilds` * `scripts/workspace-runtime-audit` 🤖 Created using Mux (Opus 4.5) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-06 09:44:40 +00:00
Jon Ayers	3c1db17361	fix: use existing transaction to claim prebuild (#21862 ) - Claiming a prebuild was happening outside a transaction	2026-02-02 17:57:59 -06:00
Susana Ferreira	f5858c8a18	fix: unregister metrics on reconciler stop to prevent panic on restart (#21647 ) ## Description Fixes a panic that occurs when the prebuilds feature is toggled by adding/removing a license. The `StoreReconciler` was not unregistering the `reconciliationDuration` histogram, causing a "duplicate metrics collector registration attempted" panic when a new reconciler was created. ## Changes * Unregister the `reconciliationDuration` histogram in `Stop()` alongside the existing metrics collector * Change log level when stopping the reconciler with a cause, since "entitlements change" is not an error condition * Add `TestReconcilerLifecycle` to verify the reconciler can be stopped and recreated with the same prometheus registry Related to internal slack thread: https://codercom.slack.com/archives/C07GRNNRW03/p1769116582171379	2026-01-23 14:45:27 +00:00
Susana Ferreira	6ef9670384	fix: limit concurrent database connections in prebuild reconciliation (#20908 ) ## Description This PR addresses database connection pool exhaustion during prebuilds reconciliation by introducing two changes: * `CanSkipReconciliation`: Filters out presets that don't need reconciliation before spawning goroutines. This ensures we only create goroutines for presets that will (_most likely_) perform database operations, avoiding unnecessary connection pool usage. * Dynamic `eg.SetLimit`: Limits concurrent goroutines based on the configured database connection pool size (`CODER_PG_CONN_MAX_OPEN / 2`). This replaces the previous hardcoded limit of 5, ensuring the reconciliation loop scales appropriately with the configured pool size while leaving capacity for other database operations. ## Changes * Add `CanSkipReconciliation()` method to `PresetSnapshot` that returns true for inactive presets with no running workspaces, no pending jobs, or expired prebuilds. * Add `maxDBConnections` parameter to `NewStoreReconciler` and compute `reconciliationConcurrency` as half the pool size (minimum 1). * Add `ReconciliationConcurrency()` getter method to `StoreReconciler`. * Add `eg.SetLimit(c.reconciliationConcurrency)` to bound concurrent reconciliation goroutines. * Add `PresetsTotal` and `PresetsReconciled` to `ReconcileStats` for observability. * Add `TestCanSkipReconciliation` unit tests. * Add `TestReconciliationConcurrency` unit tests. * Add benchmark tests for reconciliation performance. ## Benchmarks * `BenchmarkReconcileAll_NoOps`: Tests presets with no reconciliation actions. All presets are filtered by `CanSkipReconciliation`, resulting in no goroutines spawned and no database connections used. * `BenchmarkReconcileAll_ConnectionContention`: Tests presets where all require reconciliation actions. All presets spawn goroutines, but concurrency is limited by `eg.SetLimit(reconciliationConcurrency)`. * `BenchmarkReconcileAll_Mix`: Simulates a realistic scenario with a large subset of inactive presets (filtered by `CanSkipReconciliation`) and a smaller subset requiring reconciliation (limited by `eg.SetLimit`). Closes: https://github.com/coder/coder/issues/20606	2026-01-21 10:56:31 +00:00
Susana Ferreira	000bc334c9	fix: reuse reconciliation lock transaction for read operations in prebuilds (#21408 ) ## Description Reuses the reconciliation lock transaction for read operations during prebuilds reconciliation, reducing unnecessary database connections. ## Changes * Use the lock transaction (`db`) for read operations and `c.store` for write operations: * `GetPrebuildsSettings`: now uses `db` * `SnapshotState`: now uses `db` * `MembershipReconciler`: continues to use `c.store` (performs write operations) * Add comments explaining the transaction model and when to use `db` vs `c.store` Related to: https://github.com/coder/coder/pull/20587	2026-01-13 15:04:51 +00:00
Spike Curtis	bddb808b25	chore: arrange imports in a standard way (#21452 ) Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example: ``` import ( "context" "time" "github.com/prometheus/client_golang/prometheus" "golang.org/x/xerrors" "gopkg.in/natefinch/lumberjack.v2" "cdr.dev/slog/v3" "github.com/coder/coder/v2/codersdk/agentsdk" "github.com/coder/serpent" ) ``` 3 groups: standard library, 3rd partly libs, Coder libs. This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.	2026-01-08 15:24:11 +04:00
Spike Curtis	49b34a716a	fix: fix slog to always use array of Fields (#21426 ) Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder). It also updates dependencies that also use slog and were updated. I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule. Other dependencies, I pushed new tags.	2026-01-08 10:29:41 +04:00
Sas Swart	9a0024c45f	chore: add tracing to prebuilds (#21443 ) The implementation for prebuilt workspaces is complex and conversations regarding edge cases and bugs frequently get bogged down by minutiae, because it's hard to reason about the behaviour of the system. To alleviate this, I've introduced otel tracing to the StoreReconciler (see attached). We can now directly observe the behaviour of the prebuilds system under load in order to inform our decisions. Traces are terminated at the boundary between prebuilds and workspace builder, because of prebuilt workspaces' "fire and forget" philosophy and to prevent span explosion. <img width="3024" height="1718" alt="image" src="https://github.com/user-attachments/assets/f9b207be-8f2c-475e-98a8-46ef70bda446" />	2026-01-07 11:04:40 +02:00
Steven Masley	3194bcfc9e	chore: distinct operations for provisioner's 'parse', 'init', 'plan', 'apply', 'graph' (#21064 ) Provisioner steps broken into smaller granular actions. Changes: - `ExtractArchive` moved to `init` request (was in `configure`) - Writing `tfstate` moved to `plan` (was in `configure`) - Moved most plan/apply outputs to `GraphComplete`	2025-12-15 11:26:41 -06:00
Susana Ferreira	ca94588bd5	fix: send prebuild job notification after job build db commit (#20693 ) ## Problem Fix race condition in prebuilds reconciler. Previously, a job notification event was sent to a Go channel before the provisioning database transaction completed. The notification is consumed by a separate goroutine that publishes to PostgreSQL's LISTEN/NOTIFY, using a separate database connection. This creates a potential race: if a provisioner daemon receives the notification and queries for the job before the provisioning transaction commits, it won't find the job in the database. This manifested as a flaky test failure in `TestReinitializeAgent`, where provisioners would occasionally miss newly created jobs. The test uses a 25-second timeout context, while the acquirer's backup polling mechanism checks for jobs every 30 seconds. This made the race condition visible in tests, though in production the backup polling would eventually pick up the job. The solution presented here guarantees that a job notification is only sent after the provisioning database transaction commits. ## Changes * The `provision()` and `provisionDelete()` functions now return the provisioner job instead of sending notifications internally. * A new `publishProvisionerJob()` helper centralizes the notification logic and is called after each transaction completes. Closes: https://github.com/coder/internal/issues/963	2025-11-12 10:36:39 +00:00
Susana Ferreira	7e8fcb4b0f	perf: optimize prebuilds membership reconciliation to check orgs not presets (#20493 ) ## Description The membership reconciliation ensures the prebuilds system user is a member of all organizations with prebuilds configured. To support prebuilds quota management, each organization must have a prebuilds group that the system user belongs to. ## Problem Previously, membership reconciliation iterated over all presets to check and update membership status. This meant database queries `GetGroupByOrgAndName` and `InsertGroupMember` were executed for each preset. Since presets are unique combinations of `(organization, template, template version, preset)`, this resulted in several redundant checks for the same organization. In dogfood, `InsertGroupMember` was called thousands of times per day, even though memberships were already configured ([internal Grafana dashboard link](https://grafana.dev.coder.com/goto/46MZ1UgDg?orgId=1)) <img width="5382" height="1788" alt="Screenshot 2025-10-28 at 16 01 36" src="https://github.com/user-attachments/assets/757b7253-106f-4f72-8586-8e2ede9f18db" /> ## Solution This PR introduces `GetOrganizationsWithPrebuildStatus`, a single query that returns: * All unique organizations with prebuilds configured * Whether the prebuilds user is a member of each organization * Whether the prebuilds group exists in each organization * Whether the prebuilds user is in the prebuilds group The membership reconciliation logic now: * Fetches status for all organizations in one query * Only performs inserts for organizations missing required memberships or groups * Safely handles concurrent operations via unique constraint violations * This reduces database load from `O(presets)` to `O(organizations)` per reconciliation loop, with a single read query when everything is configured. ## Changes * Add `GetOrganizationsWithPrebuildStatus` SQL query * Update `membership.ReconcileAll` to use organization-based reconciliation instead of preset-based * Update tests to reflect new behavior Related to internal thread: https://codercom.slack.com/archives/C07GRNNRW03/p1760535570381369	2025-10-29 14:24:29 +00:00
Susana Ferreira	aad1b401c1	feat: add prebuilds reconciliation duration metric (#20535 ) ## Description Adds `coderd_prebuilds_reconciliation_duration_seconds` histogram metric to track the duration of each prebuilds reconciliation cycle. This metric helps operators monitor reconciliation performance and identify potential bottlenecks. ## Changes - Added `ReconcileStats` struct to capture reconciliation cycle statistics - Updated `ReconcileAll()` to return stats including elapsed time - Added histogram metric `coderd_prebuilds_reconciliation_duration_seconds`	2025-10-29 12:52:30 +00:00
Susana Ferreira	c3e3bb58f2	feat: delete pending canceled prebuilds (#20499 ) ## Description PR https://github.com/coder/coder/pull/20387 introduced canceling pending prebuild jobs from inactive template versions to avoid provisioning obsolete workspaces. However, the associated prebuilds remained in the database with "Canceled" status, visible in the UI. This PR now orphan-deletes these canceled prebuilt workspaces. Since the canceled jobs were never processed by a provisioner, no Terraform resources were created, making orphan deletion safe. Orphan deletion always creates a provisioner job, but behaves differently based on provisioner availability: - If no provisioner daemon is available, the job is immediately marked as completed and the workspace is marked as deleted without any provisioner processing - If a provisioner daemon is available, it processes the delete job with empty Terraform state (no actual resources to destroy) The job cancellation and workspace deletion occur atomically in the same transaction. We don't split this into two separate reconciliation runs because there's no way to distinguish between system-canceled prebuilds and user-canceled workspaces. If we deleted canceled workspaces in a later run, we'd delete user-canceled workspaces that users may want to keep for troubleshooting. Note: This only applies to system-generated prebuilds from inactive template versions. ## Changes * Update `UpdatePrebuildProvisionerJobWithCancel` query to return job ID, workspace ID, template ID, and template version preset ID * Add `DeprovisionMode` enum to support orphan deletion in the provision flow * Update `ActionTypeCancelPending` handler to cancel jobs and orphan-delete associated workspaces atomically	2025-10-29 10:37:28 +00:00
Susana Ferreira	f6e86c6fdb	feat: cancel pending prebuilds from non-active template versions (#20387 ) ## Description This PR introduces an optimization to automatically cancel pending prebuild-related jobs from non-active template versions in the reconciliation loop. ## Problem Currently, when a template is configured with more prebuild instances than available provisioners, the provisioner queue can become flooded with pending prebuild jobs. This issue is worsened when provisioning/deprovisioning operations take a long time. When the prebuild reconciliation loop generates jobs faster than provisioners can process them, pending jobs accumulate in the queue. Since prebuilt workspaces should always run the latest active template version, pending prebuild jobs from non-active versions become obsolete once a new version is promoted. ## Solution The reconciliation loop cancels pending prebuild-related jobs from non-active template versions that match the following criteria: * Build number: 1 (initial build created by the reconciliation loop) * Job status: `pending` * Not yet picked up by a provisioner (`worker_id` is `NULL`) * Owned by the prebuilds system user * Workspace transition: `start` This prevents the queue from being cluttered with stale prebuild jobs that would provision workspaces on an outdated template version that would consequently need to be deprovisioned. ## Changes * Added new SQL query `CountPendingNonActivePrebuilds` to identify presets with pending jobs from non-active versions * Added new SQL query `UpdatePrebuildProvisionerJobWithCancel` to cancel jobs for a specific preset * New reconciliation action type `ActionTypeCancelPending` handles the cancellation logic * Cancellation is non-blocking: failures to cancel prebuild jobs are logged as errors and don't prevent other reconciliation actions ## Follow-up PR Canceling pending prebuild jobs leaves workspaces in a Canceled state. While no Terraform resources need to be destroyed (since jobs were canceled before provisioning started), these database records should still be cleaned up. This will be addressed in a follow-up PR. Closes: https://github.com/coder/coder/issues/20242	2025-10-24 15:27:49 +01:00
Hugo Dutka	e62c5db678	chore: remove references to dbtestutil.WillUsePostgres (#20436 ) Addresses https://github.com/coder/internal/issues/758. This PR only cleans up dead code, it makes no changes to test logic.	2025-10-23 14:24:54 +02:00
Sas Swart	544f15523c	fix: adjust workspace claims to be initiated by users (#20179 ) The prebuilds user never initiates a workspace claim autonomously. A claim can only happen when a user attempts to create a workspace. When listing prebuild provisioner jobs, it would not make sense to see jobs related to users who are creating workspaces and have gotten a prebuilt workspace. When cleaning up an overwhelmed provisioner queue, we should not delete claims as they have humans waiting for them and are not part of the thundering herd. Therefore, this PR ensures that provisioner jobs that claim workspaces are considered to be initiated by the user, not the prebuilds system.	2025-10-08 10:40:54 +02:00
Callum Styan	0ec9df390b	fix: reduce impact of GetPrebuildMetrics on database (#19694 ) see https://github.com/coder/internal/issues/959 but the tl; dr is: - we call this DB query on an interval (every 15s) and it would be called on each coderd replica as well - the generated values update very infrequently (for our most used internal template I saw the builds created/claimed update twice in a 1h period) - we have no index on the initiator ID, so this query has to scan the entire workspace_builds table on every request In reality this should likely just be a Prometheus metric, and Prometheus can handle the counter reset behaviour at query time, but for now this should at least cut the load of the query to 25% of it's current impact. --------- Signed-off-by: Callum Styan <callumstyan@gmail.com>	2025-09-04 13:43:50 -07:00
Sas Swart	4e9ee80882	feat(enterprise/coderd): allow system users to be added to groups (#19518 ) closes https://github.com/coder/coder/issues/18274 This pull request makes system users visible in various group related queries so that they can be added to and removed from groups. This allows system user quotas to be configured. System users are still ignored in certain queries, such as when license seat consumption is determined. This pull request further ensures the existence of a "coder_prebuilt_workspaces" group in any organization that needs prebuilt workspaces --------- Co-authored-by: Susana Ferreira <susana@coder.com>	2025-08-27 16:57:59 +02:00
Dean Sheather	6eb02d1c2a	chore: wire up usage tracking for managed agents (#19096 ) Wires up the usage collector and publisher to coderd. Relates to coder/internal#814	2025-08-20 23:38:09 +10:00
Susana Ferreira	8567ecbe52	fix: set prebuilds lifecycle parameters on creation and claim (#19252 ) ## Description This PR ensures that prebuilt workspaces are properly excluded from the lifecycle executor and treated as a separate class of workspaces, fully managed by the prebuild reconciliation loop. It introduces two lifecycle guarantees: * When a prebuilt workspace is created (i.e., when the workspace build completes), all lifecycle-related fields are unset, ensuring the workspace does not participate in TTL, autostop, autostart, dormancy, or auto-deletion logic. * When a prebuilt workspace is claimed, it transitions into a regular user workspace. At this point, all lifecycle fields are correctly populated according to template-level configurations, allowing the workspace to be managed by the lifecycle executor as expected. ## Changes * Prebuilt workspaces now have all lifecycle-relevant fields unset during creation * When a prebuild is claimed: * Lifecycle fields are set based on template and workspace level configurations. This ensures a clean transition into the standard workspace lifecycle flow. * Updated lifecycle-related SQL update queries to explicitly exclude prebuilt workspaces. ## Relates Related issue: https://github.com/coder/coder/issues/18898 To reduce the scope of this PR and make the review process more manageable, the original implementation has been split into the following focused PRs: * https://github.com/coder/coder/pull/19259 * https://github.com/coder/coder/pull/19263 * https://github.com/coder/coder/pull/19264 * https://github.com/coder/coder/pull/19265 These PRs should be considered in conjunction with this one to understand the complete set of lifecycle separation changes for prebuilt workspaces.	2025-08-13 12:45:46 +01:00
Cian Johnston	afb54f6884	chore: revert feat(enterprise/coderd): allow system users to be added to groups (#19254 ) This reverts commit `b200fc8e67` (https://github.com/coder/coder/pull/18341).	2025-08-08 12:18:07 +01:00
Sas Swart	b200fc8e67	feat(enterprise/coderd): allow system users to be added to groups (#18341 ) closes https://github.com/coder/coder/issues/18274 This pull request makes system users visible in various group related queries so that they can be added to and removed from groups. This allows system user quotas to be configured. System users are still ignored in certain queries, such as when license seat consumption is determined. This pull request further ensures the existence of a "coder_prebuilt_workspaces" group in any organization that needs prebuilt workspaces <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Organization and group member listings now include system users. * Bug Fixes * Updated tests to reflect the inclusion of system users in member and group queries. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-08-08 11:03:17 +02:00
Dean Sheather	9a6dd73f68	feat: add managed agent license limit checks (#18937 ) - Adds a query for counting managed agent workspace builds between two timestamps - The "Actual" field in the feature entitlement for managed agents is now populated with the value read from the database - The wsbuilder package now validates AI agent usage against the limit when a license is installed Closes coder/internal#777	2025-07-22 13:39:26 +10:00
Cian Johnston	198d50dbc2	chore: replace original GetPrebuiltWorkspaces with optimized version (#18832 ) Fixes https://github.com/coder/internal/issues/715 Follow-up from https://github.com/coder/coder/pull/18717 Now that we've determined the updated query is safe, remove the duplication.	2025-07-21 15:31:11 +01:00
Cian Johnston	0367dbac43	chore: optimize GetPrebuiltWorkspaces query (#18717 ) * Adds GetRunningPrebuiltWorkspacesOptimized query * Runs both original and updated query side-by-side and logs diffs	2025-07-09 11:30:42 +01:00
Sas Swart	01163ea57b	feat: allow users to pause prebuilt workspace reconciliation (#18700 ) This PR provides two commands: * `coder prebuilds pause` * `coder prebuilds resume` These allow the suspension of all prebuilds activity, intended for use if prebuilds are misbehaving.	2025-07-02 15:05:42 +00:00
Susana Ferreira	b9e32c8eaf	refactor: remove unused enterprise prebuilds id.go (#18543 ) ## Description Remove unused `enterprise/coderd/prebuilds/id.go` file. Note: PR https://github.com/coder/coder/pull/18333 moved `SystemUserID` constant from `coderd/prebuilds/id.go` to the database package `PrebuildsSystemUserID` to resolve an import cycle: https://github.com/coder/coder/blob/main/coderd/database/constants.go	2025-06-24 19:28:41 +01:00
Yevhenii Shcherbina	bca5c35aa2	fix: remove notifications for hard-limited prebuilds (#18528 ) Relates to https://github.com/coder/internal/issues/674 Currently, we send notifications to all template admins for every failed and hard-limited preset. This can generate excessive noise—especially when someone is debugging a template and creates multiple broken versions in quick succession. For now, we've decided to remove hard-limited preset notifications to reduce excessive noise. In the long term, we plan to aggregate failure information and deliver it on a daily or weekly basis.	2025-06-24 08:43:16 -04:00
Steven Masley	82af2e019d	feat: implement dynamic parameter validation (#18482 ) # What does this do? This does parameter validation for dynamic parameters in `wsbuilder`. All input parameters are validated in `coder/coder` before being sent to terraform. The heart of this PR is [`ResolveParameters`](https://github.com/coder/coder/blob/b65001e89c0577199a8e470c138c51e91cf2350c/coderd/dynamicparameters/resolver.go#L30-L30). # What else changes? `wsbuilder` now needs to load the terraform files into memory to succeed. This does add a larger memory requirement to workspace builds. # Future work - Sort autostart handling workspaces by template version id. So workspaces with the same template version only load the terraform files once from the db, and store them in the cache.	2025-06-23 12:35:15 -05:00
ケイラ	fae30a00fd	chore: remove unnecessary redeclarations in for loops (#18440 )	2025-06-20 13:16:55 -06:00
Susana Ferreira	72f7d70bab	feat: allow TemplateAdmin to delete prebuilds via auth layer (#18333 ) ## Description This PR adds support for deleting prebuilt workspaces via the authorization layer. It introduces special-case handling to ensure that `prebuilt_workspace` permissions are evaluated when attempting to delete a prebuilt workspace, falling back to the standard `workspace` resource as needed. Prebuilt workspaces are a subset of workspaces, identified by having `owner_id` set to `PREBUILD_SYSTEM_USER`. This means: * A user with `prebuilt_workspace.delete` permission is allowed to delete only prebuilt workspaces. * A user with `workspace.delete` permission can delete both normal and prebuilt workspaces. ⚠️ This implementation is scoped to deletion operations only. No other operations are currently supported for the `prebuilt_workspace` resource. To delete a workspace, users must have the following permissions: * `workspace.read`: to read the current workspace state * `update`: to modify workspace metadata and related resources during deletion (e.g., updating the `deleted` field in the database) * `delete`: to perform the actual deletion of the workspace ## Changes * Introduced `authorizeWorkspace()` helper to handle prebuilt workspace authorization logic. * Ensured both `prebuilt_workspace` and `workspace` permissions are checked. * Added comments to clarify the current behavior and limitations. * Moved `SystemUserID` constant from the `prebuilds` package to the `database` package `PrebuildsSystemUserID` to resolve an import cycle (commit https://github.com/coder/coder/pull/18333/commits/f24e4ab4b6f0a56726fd04be2d7302c9fdb52d53). * Update middleware `ExtractOrganizationMember` to include system user members.	2025-06-20 17:36:32 +01:00
Yevhenii Shcherbina	8e3022ed9e	docs: add documentation for prebuild scheduling feature (#18462 ) Follow-up to https://github.com/coder/coder/pull/18126 Changes: - address issue mentioned here: https://github.com/coder/coder/pull/18126#discussion_r2144557600 - add docs for prebuilds scheduling --------- Co-authored-by: Danny Kopping <danny@coder.com> Co-authored-by: Atif Ali <atif@coder.com>	2025-06-20 10:08:47 -04:00
Yevhenii Shcherbina	0f6ca55238	feat: implement scheduling mechanism for prebuilds (#18126 ) Closes https://github.com/coder/internal/issues/312 Depends on https://github.com/coder/terraform-provider-coder/pull/408 This PR adds support for defining an autoscaling block for prebuilds, allowing number of desired instances to scale dynamically based on a schedule. Example usage: ``` data "coder_workspace_preset" "us-nix" { ... prebuilds = { instances = 0 # default to 0 instances scheduling = { timezone = "UTC" # a single timezone is used for simplicity # Scale to 3 instances during the work week schedule { cron = "* 8-18 * * 1-5" # from 8AM–6:59PM, Mon–Fri, UTC instances = 3 # scale to 3 instances } # Scale to 1 instance on Saturdays for urgent support queries schedule { cron = "* 8-14 * * 6" # from 8AM–2:59PM, Sat, UTC instances = 1 # scale to 1 instance } } } } ``` ### Behavior - Multiple `schedule` blocks per `prebuilds` block are supported. - If the current time matches any defined autoscaling schedule, the corresponding number of instances is used. - If no schedule matches, the default instance count (`prebuilds.instances`) is used as a fallback. ### Why This feature allows prebuild instance capacity to adapt to predictable usage patterns, such as: - Scaling up during business hours or high-demand periods - Reducing capacity during off-hours to save resources ### Cron specification The cron specification is interpreted as a continuous time range. For example, the expression: ``` * 9-18 * * 1-5 ``` is intended to represent a continuous range from 09:00 to 18:59, Monday through Friday. However, due to minor implementation imprecision, it is currently interpreted as a range from 08:59:00 to 18:58:59, Monday through Friday. This slight discrepancy arises because the evaluation is based on whether a specific point in time falls within the range, using the `github.com/coder/coder/v2/coderd/schedule/cron` library, which performs per-minute matching rather than strict range evaluation. --------- Co-authored-by: Danny Kopping <danny@coder.com>	2025-06-19 11:08:48 -04:00
Susana Ferreira	cda9208580	test: add ReconcileAll tests for multiple actions on expired prebuilds (#18265 ) ## Description Adds tests for `ReconcileAll` to verify the full reconciliation flow when handling expired prebuilds. This complements existing lower-level tests by checking multiple reconciliation actions (delete + create) at the higher reconciliation cycle level. Related with comment: https://github.com/coder/coder/pull/17996#issuecomment-2910516489	2025-06-17 13:06:36 +01:00
Sas Swart	5f7e5d7097	feat: support prebuilt workspaces in non-default organizations (#18010 ) closes https://github.com/coder/internal/issues/527	2025-06-04 14:20:29 +02:00
Yevhenii Shcherbina	b330c0803c	fix: reimplement reporting of preset-hard-limited metric (#18055 ) Addresses concerns raised in https://github.com/coder/coder/pull/18045	2025-05-28 14:18:32 -04:00
Yevhenii Shcherbina	e8c75eb1c3	fix: fix metric for hard-limited presets (#18045 ) ``` // Report a metric only if the preset uses the latest version of the template and the template is not deleted. // This avoids conflicts between metrics from old and new template versions. // // NOTE: Multiple versions of a preset can exist with the same orgName, templateName, and presetName, // because templates can have multiple versions — or deleted templates can share the same name. // // The safest approach is to report the metric only for the latest version of the preset. // When a new template version is released, the metric for the new preset should overwrite // the old value in Prometheus. // // However, there’s one edge case: if an admin creates a template, it becomes hard-limited, // then deletes the template and never creates another with the same name, // the old preset will continue to be reported as hard-limited — // even though it’s deleted. This will persist until `coderd` is restarted. ```	2025-05-27 10:07:36 -04:00
Spike Curtis	6c0bed0f53	chore: update to coder/quartz v0.2.0 (#18007 ) Upgrade to coder/quartz v0.2.0 including fixing up a minor API breaking change.	2025-05-27 16:05:03 +04:00
Susana Ferreira	6f6e73af03	feat: implement expiration policy logic for prebuilds (#17996 ) ## Summary This PR introduces support for expiration policies in prebuilds. The TTL (time-to-live) is retrieved from the Terraform configuration ([terraform-provider-coder PR](https://github.com/coder/terraform-provider-coder/pull/404)): ``` prebuilds = { instances = 2 expiration_policy { ttl = 86400 } } ``` Note: Since there is no need for precise TTL enforcement down to the second, in this implementation expired prebuilds are handled in a single reconciliation cycle: they are deleted, and new instances are created only if needed to match the desired count. ## Changes * The outcome of a reconciliation cycle is now expressed as a slice of reconciliation actions, instead of a single aggregated action. * Adjusted reconciliation logic to delete expired prebuilds and guarantee that the number of desired instances is correct. * Updated relevant data structures and methods to support expiration policies parameters. * Added documentation to `Prebuilt workspaces` page * Update `terraform-provider-coder` to version 2.5.0: https://github.com/coder/terraform-provider-coder/releases/tag/v2.5.0 Depends on: https://github.com/coder/terraform-provider-coder/pull/404 Fixes: https://github.com/coder/coder/issues/17916	2025-05-26 20:31:24 +01:00
Yevhenii Shcherbina	2a15aa8a6f	feat: add hard-limited presets metric (#18008 ) Closes https://github.com/coder/coder/issues/17988 Define `preset_hard_limited` metric which for every preset indicates whether a given preset has reached the hard failure limit (1 for hard-limited, 0 otherwise). CLI example: ``` curl -X GET localhost:2118/metrics \| grep preset_hard_limited # HELP coderd_prebuilt_workspaces_preset_hard_limited Indicates whether a given preset has reached the hard failure limit (1 for hard-limited, 0 otherwise). # TYPE coderd_prebuilt_workspaces_preset_hard_limited gauge coderd_prebuilt_workspaces_preset_hard_limited{organization_name="coder",preset_name="GoLand: Large",template_name="Test7"} 1 coderd_prebuilt_workspaces_preset_hard_limited{organization_name="coder",preset_name="GoLand: Large",template_name="ValidTemplate"} 0 coderd_prebuilt_workspaces_preset_hard_limited{organization_name="coder",preset_name="IU: Medium",template_name="Test7"} 1 coderd_prebuilt_workspaces_preset_hard_limited{organization_name="coder",preset_name="IU: Medium",template_name="ValidTemplate"} 0 coderd_prebuilt_workspaces_preset_hard_limited{organization_name="coder",preset_name="WS: Small",template_name="Test7"} 1 ``` NOTE: ```go if !ps.Preset.Deleted && ps.Preset.UsingActiveVersion { c.metrics.trackHardLimitedStatus(ps.Preset.OrganizationName, ps.Preset.TemplateName, ps.Preset.Name, ps.IsHardLimited) } ``` Only active template version is tracked. If admin creates new template version - old value of metric (for previous template version) will be overwritten with new value of metric (for active template version). Because `template_version` is not part of metric: ```go labels = []string{"template_name", "preset_name", "organization_name"} ``` Implementation is similar to implementation of `MetricResourceReplacementsCount` metric --------- Co-authored-by: Susana Ferreira <ssncferreira@gmail.com>	2025-05-26 11:39:44 -04:00
Yevhenii Shcherbina	53e8e9c7cd	fix: reduce cost of prebuild failure (#17697 ) Relates to https://github.com/coder/coder/issues/17432 ### Part 1: Notes: - `GetPresetsAtFailureLimit` SQL query is added, which is similar to `GetPresetsBackoff`, they use same CTEs: `filtered_builds`, `time_sorted_builds`, but they are still different. - Query is executed on every loop iteration. We can consider marking specific preset as permanently failed as an optimization to avoid executing query on every loop iteration. But I decided don't do it for now. - By default `FailureHardLimit` is set to 3. - `FailureHardLimit` is configurable. Setting it to zero - means that hard limit is disabled. ### Part 2 Notes: - `PrebuildFailureLimitReached` notification is added. - Notification is sent to template admins. - Notification is sent only the first time, when hard limit is reached. But it will `log.Warn` on every loop iteration. - I introduced this enum: ```sql CREATE TYPE prebuild_status AS ENUM ( 'normal', -- Prebuilds are working as expected; this is the default, healthy state. 'hard_limited', -- Prebuilds have failed repeatedly and hit the configured hard failure limit; won't be retried anymore. 'validation_failed' -- Prebuilds failed due to a non-retryable validation error (e.g. template misconfiguration); won't be retried. ); ``` `validation_failed` not used in this PR, but I think it will be used in next one, so I wanted to save us an extra migration. - Notification looks like this: <img width="472" alt="image" src="https://github.com/user-attachments/assets/e10efea0-1790-4e7f-a65c-f94c40fced27" /> ### Latest notification views: <img width="463" alt="image" src="https://github.com/user-attachments/assets/11310c58-68d1-4075-a497-f76d854633fe" /> <img width="725" alt="image" src="https://github.com/user-attachments/assets/6bbfe21a-91ac-47c3-a9d1-21807bb0c53a" />	2025-05-21 15:16:38 -04:00
Yevhenii Shcherbina	2aa8cbebd7	fix: exclude deleted templates from metrics collection (#17839 ) Also add some clarification about the lack of database constraints for soft template deletion. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com>	2025-05-15 13:33:58 +02:00
Danny Kopping	6e967780c9	feat: track resource replacements when claiming a prebuilt workspace (#17571 ) Closes https://github.com/coder/internal/issues/369 We can't know whether a replacement (i.e. drift of terraform state leading to a resource needing to be deleted/recreated) will take place apriori; we can only detect it at `plan` time, because the provider decides whether a resource must be replaced and it cannot be inferred through static analysis of the template. This is likely to be the most common gotcha with using prebuilds, since it requires a slight template modification to use prebuilds effectively, so let's head this off before it's an issue for customers. Drift details will now be logged in the workspace build logs: ![image](https://github.com/user-attachments/assets/da1988b6-2cbe-4a79-a3c5-ea29891f3d6f) Plus a notification will be sent to template admins when this situation arises: ![image](https://github.com/user-attachments/assets/39d555b1-a262-4a3e-b529-03b9f23bf66a) A new metric - `coderd_prebuilt_workspaces_resource_replacements_total` - will also increment each time a workspace encounters replacements. We only track _that_ a resource replacement occurred, not how many. Just one is enough to ruin a prebuild, but we can't know apriori which replacement would cause this. For example, say we have 2 replacements: a `docker_container` and a `null_resource`; we don't know which one might cause an issue (or indeed if either would), so we just track the replacement. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com>	2025-05-14 14:52:22 +02:00
Danny Kopping	b2a1de9e2a	feat: fetch prebuilds metrics state in background (#17792 ) `Collect()` is called whenever the `/metrics` endpoint is hit to retrieve metrics. The queries used in prebuilds metrics collection are quite heavy, and we want to avoid having them running concurrently / too often to keep db load down. Here I'm moving towards a background retrieval of the state required to set the metrics, which gets invalidated every interval. Also introduces `coderd_prebuilt_workspaces_metrics_last_updated` which operators can use to determine when these metrics go stale. See https://github.com/coder/coder/pull/17789 as well. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com>	2025-05-13 20:27:41 +02:00
Danny Kopping	a646478aed	fix: move pubsub publishing out of database transactions to avoid conn exhaustion (#17648 ) Database transactions hold onto connections, and `pubsub.Publish` tries to acquire a connection of its own. If the latter is called within a transaction, this can lead to connection exhaustion. I plan two follow-ups to this PR: 1. Make connection counts tuneable https://github.com/coder/coder/blob/main/cli/server.go#L2360-L2376 We will then be able to write tests showing how connection exhaustion occurs. 2. Write a linter/ruleguard to prevent `pubsub.Publish` from being called within a transaction. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com>	2025-05-05 11:54:18 +02:00
Yevhenii Shcherbina	ef11d4f769	fix: fix bug with deletion of prebuilt workspaces (#17652 ) Don't specify the template version for a delete transition, because the prebuilt workspace may have been created using an older template version. If the template version isn't explicitly set, the builder will automatically use the version from the last workspace build - which is the desired behavior.	2025-05-01 17:26:30 -04:00

1 2

59 Commits