coder

mirror of https://github.com/coder/coder.git synced 2026-06-03 13:08:25 +00:00

Author	SHA1	Message	Date
Marcin Tojek	d004710a74	feat: add prebuild invalidation via last_invalidated_at timestamp (#20582 ) Updates #17917	2025-11-20 17:12:25 +01:00
Susana Ferreira	7e8fcb4b0f	perf: optimize prebuilds membership reconciliation to check orgs not presets (#20493 ) ## Description The membership reconciliation ensures the prebuilds system user is a member of all organizations with prebuilds configured. To support prebuilds quota management, each organization must have a prebuilds group that the system user belongs to. ## Problem Previously, membership reconciliation iterated over all presets to check and update membership status. This meant database queries `GetGroupByOrgAndName` and `InsertGroupMember` were executed for each preset. Since presets are unique combinations of `(organization, template, template version, preset)`, this resulted in several redundant checks for the same organization. In dogfood, `InsertGroupMember` was called thousands of times per day, even though memberships were already configured ([internal Grafana dashboard link](https://grafana.dev.coder.com/goto/46MZ1UgDg?orgId=1)) <img width="5382" height="1788" alt="Screenshot 2025-10-28 at 16 01 36" src="https://github.com/user-attachments/assets/757b7253-106f-4f72-8586-8e2ede9f18db" /> ## Solution This PR introduces `GetOrganizationsWithPrebuildStatus`, a single query that returns: * All unique organizations with prebuilds configured * Whether the prebuilds user is a member of each organization * Whether the prebuilds group exists in each organization * Whether the prebuilds user is in the prebuilds group The membership reconciliation logic now: * Fetches status for all organizations in one query * Only performs inserts for organizations missing required memberships or groups * Safely handles concurrent operations via unique constraint violations * This reduces database load from `O(presets)` to `O(organizations)` per reconciliation loop, with a single read query when everything is configured. ## Changes * Add `GetOrganizationsWithPrebuildStatus` SQL query * Update `membership.ReconcileAll` to use organization-based reconciliation instead of preset-based * Update tests to reflect new behavior Related to internal thread: https://codercom.slack.com/archives/C07GRNNRW03/p1760535570381369	2025-10-29 14:24:29 +00:00
Susana Ferreira	c3e3bb58f2	feat: delete pending canceled prebuilds (#20499 ) ## Description PR https://github.com/coder/coder/pull/20387 introduced canceling pending prebuild jobs from inactive template versions to avoid provisioning obsolete workspaces. However, the associated prebuilds remained in the database with "Canceled" status, visible in the UI. This PR now orphan-deletes these canceled prebuilt workspaces. Since the canceled jobs were never processed by a provisioner, no Terraform resources were created, making orphan deletion safe. Orphan deletion always creates a provisioner job, but behaves differently based on provisioner availability: - If no provisioner daemon is available, the job is immediately marked as completed and the workspace is marked as deleted without any provisioner processing - If a provisioner daemon is available, it processes the delete job with empty Terraform state (no actual resources to destroy) The job cancellation and workspace deletion occur atomically in the same transaction. We don't split this into two separate reconciliation runs because there's no way to distinguish between system-canceled prebuilds and user-canceled workspaces. If we deleted canceled workspaces in a later run, we'd delete user-canceled workspaces that users may want to keep for troubleshooting. Note: This only applies to system-generated prebuilds from inactive template versions. ## Changes * Update `UpdatePrebuildProvisionerJobWithCancel` query to return job ID, workspace ID, template ID, and template version preset ID * Add `DeprovisionMode` enum to support orphan deletion in the provision flow * Update `ActionTypeCancelPending` handler to cancel jobs and orphan-delete associated workspaces atomically	2025-10-29 10:37:28 +00:00
Susana Ferreira	f6e86c6fdb	feat: cancel pending prebuilds from non-active template versions (#20387 ) ## Description This PR introduces an optimization to automatically cancel pending prebuild-related jobs from non-active template versions in the reconciliation loop. ## Problem Currently, when a template is configured with more prebuild instances than available provisioners, the provisioner queue can become flooded with pending prebuild jobs. This issue is worsened when provisioning/deprovisioning operations take a long time. When the prebuild reconciliation loop generates jobs faster than provisioners can process them, pending jobs accumulate in the queue. Since prebuilt workspaces should always run the latest active template version, pending prebuild jobs from non-active versions become obsolete once a new version is promoted. ## Solution The reconciliation loop cancels pending prebuild-related jobs from non-active template versions that match the following criteria: * Build number: 1 (initial build created by the reconciliation loop) * Job status: `pending` * Not yet picked up by a provisioner (`worker_id` is `NULL`) * Owned by the prebuilds system user * Workspace transition: `start` This prevents the queue from being cluttered with stale prebuild jobs that would provision workspaces on an outdated template version that would consequently need to be deprovisioned. ## Changes * Added new SQL query `CountPendingNonActivePrebuilds` to identify presets with pending jobs from non-active versions * Added new SQL query `UpdatePrebuildProvisionerJobWithCancel` to cancel jobs for a specific preset * New reconciliation action type `ActionTypeCancelPending` handles the cancellation logic * Cancellation is non-blocking: failures to cancel prebuild jobs are logged as errors and don't prevent other reconciliation actions ## Follow-up PR Canceling pending prebuild jobs leaves workspaces in a Canceled state. While no Terraform resources need to be destroyed (since jobs were canceled before provisioning started), these database records should still be cleaned up. This will be addressed in a follow-up PR. Closes: https://github.com/coder/coder/issues/20242	2025-10-24 15:27:49 +01:00
Susana Ferreira	0ab345ca84	feat: add prebuild timing metrics to Prometheus (#19503 ) ## Description This PR introduces one counter and two histograms related to workspace creation and claiming. The goal is to provide clearer observability into how workspaces are created (regular vs prebuild) and the time cost of those operations. ### `coderd_workspace_creation_total` * Metric type: Counter * Name: `coderd_workspace_creation_total` * Labels: `organization_name`, `template_name`, `preset_name` This counter tracks whether a regular workspace (not created from a prebuild pool) was created using a preset or not. Currently, we already expose `coderd_prebuilt_workspaces_claimed_total` for claimed prebuilt workspaces, but we lack a comparable metric for regular workspace creations. This metric fills that gap, making it possible to compare regular creations against claims. Implementation notes: * Exposed as a `coderd_` metric, consistent with other workspace-related metrics (e.g. `coderd_api_workspace_latest_build`: https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149). * Every `defaultRefreshRate` (1 minute ), DB query `GetRegularWorkspaceCreateMetrics` is executed to fetch all regular workspaces (not created from a prebuild pool). * The counter is updated with the total from all time (not just since metric introduction). This differs from the histograms below, which only accumulate from their introduction forward. ### `coderd_workspace_creation_duration_seconds` & `coderd_prebuilt_workspace_claim_duration_seconds` * Metric types: Histogram * Names: * `coderd_workspace_creation_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name`, `type` (`regular`, `prebuild`) * `coderd_prebuilt_workspace_claim_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name` We already have `coderd_provisionerd_workspace_build_timings_seconds`, which tracks build run times for all workspace builds handled by the provisioner daemon. However, in the context of this issue, we are only interested in creation and claim build times, not all transitions; additionally, this metric does not include `preset_name`, and adding it there would significantly increase cardinality. Therefore, separate more focused metrics are introduced here: * `coderd_workspace_creation_duration_seconds`: Build time to create a workspace (either a regular workspace or the build into a prebuild pool, for prebuild initial provisioning build). * `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a prebuilt workspace from the pool. The reason for two separate histograms is that: * Creation (regular or prebuild): provisioning builds with similar time magnitude, generally expected to take longer than a claim operation. * Claim: expected to be a much faster provisioning build. #### Native histogram usage Provisioning times vary widely between projects. Using static buckets risks unbalanced or poorly informative histograms. To address this, these metrics use [Prometheus native histograms](https://prometheus.io/docs/specs/native_histograms/): * First introduced in Prometheus v2.40.0 * Recommended stable usage from v2.45+ * Requires Go client `prometheus/client_golang` v1.15.0+ * Experimental and must be explicitly enabled on the server (`--enable-feature=native-histograms`) For compatibility, we also retain a classic bucket definition (aligned with the existing provisioner metric: https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189). * If native histograms are enabled, Prometheus ingests the high-resolution histogram. * If not, it falls back to the predefined buckets. Implementation notes: * Unlike the counter, these histograms are updated in real-time at workspace build job completion. * They reflect data only from the point of introduction forward (no historical backfill). ## Relates to Closes: https://github.com/coder/coder/issues/19528 Native histograms tested in observability stack: https://github.com/coder/observability/pull/50	2025-08-28 15:00:26 +01:00
Sas Swart	f9a6adc704	feat: claim prebuilds based on workspace parameters instead of preset id (#19279 ) Closes https://github.com/coder/coder/issues/18356. This change finds and selects a matching preset if one was not chosen during workspace creation. This solidifies the relationship between presets and parameters. When a workspace is created without in explicitly chosen preset, it will now still be eligible to claim a prebuilt workspace if one is available.	2025-08-20 11:02:53 +02:00
Susana Ferreira	8567ecbe52	fix: set prebuilds lifecycle parameters on creation and claim (#19252 ) ## Description This PR ensures that prebuilt workspaces are properly excluded from the lifecycle executor and treated as a separate class of workspaces, fully managed by the prebuild reconciliation loop. It introduces two lifecycle guarantees: * When a prebuilt workspace is created (i.e., when the workspace build completes), all lifecycle-related fields are unset, ensuring the workspace does not participate in TTL, autostop, autostart, dormancy, or auto-deletion logic. * When a prebuilt workspace is claimed, it transitions into a regular user workspace. At this point, all lifecycle fields are correctly populated according to template-level configurations, allowing the workspace to be managed by the lifecycle executor as expected. ## Changes * Prebuilt workspaces now have all lifecycle-relevant fields unset during creation * When a prebuild is claimed: * Lifecycle fields are set based on template and workspace level configurations. This ensures a clean transition into the standard workspace lifecycle flow. * Updated lifecycle-related SQL update queries to explicitly exclude prebuilt workspaces. ## Relates Related issue: https://github.com/coder/coder/issues/18898 To reduce the scope of this PR and make the review process more manageable, the original implementation has been split into the following focused PRs: * https://github.com/coder/coder/pull/19259 * https://github.com/coder/coder/pull/19263 * https://github.com/coder/coder/pull/19264 * https://github.com/coder/coder/pull/19265 These PRs should be considered in conjunction with this one to understand the complete set of lifecycle separation changes for prebuilt workspaces.	2025-08-13 12:45:46 +01:00
Cian Johnston	198d50dbc2	chore: replace original GetPrebuiltWorkspaces with optimized version (#18832 ) Fixes https://github.com/coder/internal/issues/715 Follow-up from https://github.com/coder/coder/pull/18717 Now that we've determined the updated query is safe, remove the duplication.	2025-07-21 15:31:11 +01:00
Cian Johnston	0367dbac43	chore: optimize GetPrebuiltWorkspaces query (#18717 ) * Adds GetRunningPrebuiltWorkspacesOptimized query * Runs both original and updated query side-by-side and logs diffs	2025-07-09 11:30:42 +01:00
Cian Johnston	4e95b1d20e	fix: revert changes to GetRunningPrebuiltWorkspaces (#18688 ) … (#18588)" This reverts commit `258a839d27`.	2025-07-01 10:11:43 +00:00
Cian Johnston	258a839d27	chore(coderd/database): optimize GetRunningPrebuiltWorkspaces (#18588 ) Fixes https://github.com/coder/internal/issues/715 After this change, the only use of the `workspace_prebuilds` view is the `ClaimPrebuiltWorkspace` query. A subsequent PR will update the view. Before: ~44ms https://explain.dalibo.com/plan/76cbe21d1a4c9329#plan After: 7.3ms https://explain.dalibo.com/plan/5abbdf926315677e#plan	2025-07-01 09:42:01 +01:00
Yevhenii Shcherbina	0f6ca55238	feat: implement scheduling mechanism for prebuilds (#18126 ) Closes https://github.com/coder/internal/issues/312 Depends on https://github.com/coder/terraform-provider-coder/pull/408 This PR adds support for defining an autoscaling block for prebuilds, allowing number of desired instances to scale dynamically based on a schedule. Example usage: ``` data "coder_workspace_preset" "us-nix" { ... prebuilds = { instances = 0 # default to 0 instances scheduling = { timezone = "UTC" # a single timezone is used for simplicity # Scale to 3 instances during the work week schedule { cron = "* 8-18 * * 1-5" # from 8AM–6:59PM, Mon–Fri, UTC instances = 3 # scale to 3 instances } # Scale to 1 instance on Saturdays for urgent support queries schedule { cron = "* 8-14 * * 6" # from 8AM–2:59PM, Sat, UTC instances = 1 # scale to 1 instance } } } } ``` ### Behavior - Multiple `schedule` blocks per `prebuilds` block are supported. - If the current time matches any defined autoscaling schedule, the corresponding number of instances is used. - If no schedule matches, the default instance count (`prebuilds.instances`) is used as a fallback. ### Why This feature allows prebuild instance capacity to adapt to predictable usage patterns, such as: - Scaling up during business hours or high-demand periods - Reducing capacity during off-hours to save resources ### Cron specification The cron specification is interpreted as a continuous time range. For example, the expression: ``` * 9-18 * * 1-5 ``` is intended to represent a continuous range from 09:00 to 18:59, Monday through Friday. However, due to minor implementation imprecision, it is currently interpreted as a range from 08:59:00 to 18:58:59, Monday through Friday. This slight discrepancy arises because the evaluation is based on whether a specific point in time falls within the range, using the `github.com/coder/coder/v2/coderd/schedule/cron` library, which performs per-minute matching rather than strict range evaluation. --------- Co-authored-by: Danny Kopping <danny@coder.com>	2025-06-19 11:08:48 -04:00
Susana Ferreira	6f6e73af03	feat: implement expiration policy logic for prebuilds (#17996 ) ## Summary This PR introduces support for expiration policies in prebuilds. The TTL (time-to-live) is retrieved from the Terraform configuration ([terraform-provider-coder PR](https://github.com/coder/terraform-provider-coder/pull/404)): ``` prebuilds = { instances = 2 expiration_policy { ttl = 86400 } } ``` Note: Since there is no need for precise TTL enforcement down to the second, in this implementation expired prebuilds are handled in a single reconciliation cycle: they are deleted, and new instances are created only if needed to match the desired count. ## Changes * The outcome of a reconciliation cycle is now expressed as a slice of reconciliation actions, instead of a single aggregated action. * Adjusted reconciliation logic to delete expired prebuilds and guarantee that the number of desired instances is correct. * Updated relevant data structures and methods to support expiration policies parameters. * Added documentation to `Prebuilt workspaces` page * Update `terraform-provider-coder` to version 2.5.0: https://github.com/coder/terraform-provider-coder/releases/tag/v2.5.0 Depends on: https://github.com/coder/terraform-provider-coder/pull/404 Fixes: https://github.com/coder/coder/issues/17916	2025-05-26 20:31:24 +01:00
Yevhenii Shcherbina	53e8e9c7cd	fix: reduce cost of prebuild failure (#17697 ) Relates to https://github.com/coder/coder/issues/17432 ### Part 1: Notes: - `GetPresetsAtFailureLimit` SQL query is added, which is similar to `GetPresetsBackoff`, they use same CTEs: `filtered_builds`, `time_sorted_builds`, but they are still different. - Query is executed on every loop iteration. We can consider marking specific preset as permanently failed as an optimization to avoid executing query on every loop iteration. But I decided don't do it for now. - By default `FailureHardLimit` is set to 3. - `FailureHardLimit` is configurable. Setting it to zero - means that hard limit is disabled. ### Part 2 Notes: - `PrebuildFailureLimitReached` notification is added. - Notification is sent to template admins. - Notification is sent only the first time, when hard limit is reached. But it will `log.Warn` on every loop iteration. - I introduced this enum: ```sql CREATE TYPE prebuild_status AS ENUM ( 'normal', -- Prebuilds are working as expected; this is the default, healthy state. 'hard_limited', -- Prebuilds have failed repeatedly and hit the configured hard failure limit; won't be retried anymore. 'validation_failed' -- Prebuilds failed due to a non-retryable validation error (e.g. template misconfiguration); won't be retried. ); ``` `validation_failed` not used in this PR, but I think it will be used in next one, so I wanted to save us an extra migration. - Notification looks like this: <img width="472" alt="image" src="https://github.com/user-attachments/assets/e10efea0-1790-4e7f-a65c-f94c40fced27" /> ### Latest notification views: <img width="463" alt="image" src="https://github.com/user-attachments/assets/11310c58-68d1-4075-a497-f76d854633fe" /> <img width="725" alt="image" src="https://github.com/user-attachments/assets/6bbfe21a-91ac-47c3-a9d1-21807bb0c53a" />	2025-05-21 15:16:38 -04:00
Yevhenii Shcherbina	2aa8cbebd7	fix: exclude deleted templates from metrics collection (#17839 ) Also add some clarification about the lack of database constraints for soft template deletion. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com>	2025-05-15 13:33:58 +02:00
Yevhenii Shcherbina	27bc60d1b9	feat: implement reconciliation loop (#17261 ) Closes https://github.com/coder/internal/issues/510 <details> <summary> Refactoring Summary </summary> ### 1) `CalculateActions` Function #### Issues Before Refactoring: - Large function (~150 lines), making it difficult to read and maintain. - The control flow is hard to follow due to complex conditional logic. - The `ReconciliationActions` struct was partially initialized early, then mutated in multiple places, making the flow error-prone. Original source: https://github.com/coder/coder/blob/fe60b569ad754245e28bac71e0ef3c83536631bb/coderd/prebuilds/state.go#L13-L167 #### Improvements After Refactoring: - Simplified and broken down into smaller, focused helper methods. - The flow of the function is now more linear and easier to understand. - Struct initialization is cleaner, avoiding partial and incremental mutations. Refactored function: https://github.com/coder/coder/blob/eeb0407d783cdda71ec2418c113f325542c47b1c/coderd/prebuilds/state.go#L67-L84 --- ### 2) `ReconciliationActions` Struct #### Issues Before Refactoring: - The struct mixed both actionable decisions and diagnostic state, which blurred its purpose. - It was unclear which fields were necessary for reconciliation logic, and which were purely for logging/observability. #### Improvements After Refactoring: - Split into two clear, purpose-specific structs: - `ReconciliationActions` — defines the intended reconciliation action. - `ReconciliationState` — captures runtime state and metadata, primarily for logging and diagnostics. Original struct: https://github.com/coder/coder/blob/fe60b569ad754245e28bac71e0ef3c83536631bb/coderd/prebuilds/reconcile.go#L29-L41 </details> --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Sas Swart <sas.swart.cdk@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Dean Sheather <dean@deansheather.com> Co-authored-by: Spike Curtis <spike@coder.com> Co-authored-by: Danny Kopping <danny@coder.com>	2025-04-17 09:29:29 -04:00
Sas Swart	99c6f235eb	feat: add migrations and queries to support prebuilds (#16891 ) Depends on https://github.com/coder/coder/pull/16916 _(change base to `main` once it is merged)_ Closes https://github.com/coder/internal/issues/514 _This is one of several PRs to decompose the `dk/prebuilds` feature branch into separate PRs to merge into `main`._ --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: evgeniy-scherbina <evgeniy.shcherbina.es@gmail.com>	2025-04-03 10:58:30 +02:00

17 Commits