coder

mirror of https://github.com/coder/coder.git synced 2026-06-03 21:18:24 +00:00

Author	SHA1	Message	Date
Danielle Maywood	92a6d6c2c0	chore: remove unnecessary loop variable captures (#22180 ) Since Go 1.22, the loop variable capture issue is resolved. Variables declared by for loops are now per-iteration rather than per-loop, making the 'v := v' pattern unnecessary.	2026-02-19 09:02:19 +00:00
Jon Ayers	3c1db17361	fix: use existing transaction to claim prebuild (#21862 ) - Claiming a prebuild was happening outside a transaction	2026-02-02 17:57:59 -06:00
Susana Ferreira	6ef9670384	fix: limit concurrent database connections in prebuild reconciliation (#20908 ) ## Description This PR addresses database connection pool exhaustion during prebuilds reconciliation by introducing two changes: * `CanSkipReconciliation`: Filters out presets that don't need reconciliation before spawning goroutines. This ensures we only create goroutines for presets that will (_most likely_) perform database operations, avoiding unnecessary connection pool usage. * Dynamic `eg.SetLimit`: Limits concurrent goroutines based on the configured database connection pool size (`CODER_PG_CONN_MAX_OPEN / 2`). This replaces the previous hardcoded limit of 5, ensuring the reconciliation loop scales appropriately with the configured pool size while leaving capacity for other database operations. ## Changes * Add `CanSkipReconciliation()` method to `PresetSnapshot` that returns true for inactive presets with no running workspaces, no pending jobs, or expired prebuilds. * Add `maxDBConnections` parameter to `NewStoreReconciler` and compute `reconciliationConcurrency` as half the pool size (minimum 1). * Add `ReconciliationConcurrency()` getter method to `StoreReconciler`. * Add `eg.SetLimit(c.reconciliationConcurrency)` to bound concurrent reconciliation goroutines. * Add `PresetsTotal` and `PresetsReconciled` to `ReconcileStats` for observability. * Add `TestCanSkipReconciliation` unit tests. * Add `TestReconciliationConcurrency` unit tests. * Add benchmark tests for reconciliation performance. ## Benchmarks * `BenchmarkReconcileAll_NoOps`: Tests presets with no reconciliation actions. All presets are filtered by `CanSkipReconciliation`, resulting in no goroutines spawned and no database connections used. * `BenchmarkReconcileAll_ConnectionContention`: Tests presets where all require reconciliation actions. All presets spawn goroutines, but concurrency is limited by `eg.SetLimit(reconciliationConcurrency)`. * `BenchmarkReconcileAll_Mix`: Simulates a realistic scenario with a large subset of inactive presets (filtered by `CanSkipReconciliation`) and a smaller subset requiring reconciliation (limited by `eg.SetLimit`). Closes: https://github.com/coder/coder/issues/20606	2026-01-21 10:56:31 +00:00
Spike Curtis	bddb808b25	chore: arrange imports in a standard way (#21452 ) Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example: ``` import ( "context" "time" "github.com/prometheus/client_golang/prometheus" "golang.org/x/xerrors" "gopkg.in/natefinch/lumberjack.v2" "cdr.dev/slog/v3" "github.com/coder/coder/v2/codersdk/agentsdk" "github.com/coder/serpent" ) ``` 3 groups: standard library, 3rd partly libs, Coder libs. This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.	2026-01-08 15:24:11 +04:00
Spike Curtis	49b34a716a	fix: fix slog to always use array of Fields (#21426 ) Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder). It also updates dependencies that also use slog and were updated. I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule. Other dependencies, I pushed new tags.	2026-01-08 10:29:41 +04:00
Marcin Tojek	d004710a74	feat: add prebuild invalidation via last_invalidated_at timestamp (#20582 ) Updates #17917	2025-11-20 17:12:25 +01:00
Susana Ferreira	aad1b401c1	feat: add prebuilds reconciliation duration metric (#20535 ) ## Description Adds `coderd_prebuilds_reconciliation_duration_seconds` histogram metric to track the duration of each prebuilds reconciliation cycle. This metric helps operators monitor reconciliation performance and identify potential bottlenecks. ## Changes - Added `ReconcileStats` struct to capture reconciliation cycle statistics - Updated `ReconcileAll()` to return stats including elapsed time - Added histogram metric `coderd_prebuilds_reconciliation_duration_seconds`	2025-10-29 12:52:30 +00:00
Susana Ferreira	f6e86c6fdb	feat: cancel pending prebuilds from non-active template versions (#20387 ) ## Description This PR introduces an optimization to automatically cancel pending prebuild-related jobs from non-active template versions in the reconciliation loop. ## Problem Currently, when a template is configured with more prebuild instances than available provisioners, the provisioner queue can become flooded with pending prebuild jobs. This issue is worsened when provisioning/deprovisioning operations take a long time. When the prebuild reconciliation loop generates jobs faster than provisioners can process them, pending jobs accumulate in the queue. Since prebuilt workspaces should always run the latest active template version, pending prebuild jobs from non-active versions become obsolete once a new version is promoted. ## Solution The reconciliation loop cancels pending prebuild-related jobs from non-active template versions that match the following criteria: * Build number: 1 (initial build created by the reconciliation loop) * Job status: `pending` * Not yet picked up by a provisioner (`worker_id` is `NULL`) * Owned by the prebuilds system user * Workspace transition: `start` This prevents the queue from being cluttered with stale prebuild jobs that would provision workspaces on an outdated template version that would consequently need to be deprovisioned. ## Changes * Added new SQL query `CountPendingNonActivePrebuilds` to identify presets with pending jobs from non-active versions * Added new SQL query `UpdatePrebuildProvisionerJobWithCancel` to cancel jobs for a specific preset * New reconciliation action type `ActionTypeCancelPending` handles the cancellation logic * Cancellation is non-blocking: failures to cancel prebuild jobs are logged as errors and don't prevent other reconciliation actions ## Follow-up PR Canceling pending prebuild jobs leaves workspaces in a Canceled state. While no Terraform resources need to be destroyed (since jobs were canceled before provisioning started), these database records should still be cleaned up. This will be addressed in a follow-up PR. Closes: https://github.com/coder/coder/issues/20242	2025-10-24 15:27:49 +01:00
Sas Swart	544f15523c	fix: adjust workspace claims to be initiated by users (#20179 ) The prebuilds user never initiates a workspace claim autonomously. A claim can only happen when a user attempts to create a workspace. When listing prebuild provisioner jobs, it would not make sense to see jobs related to users who are creating workspaces and have gotten a prebuilt workspace. When cleaning up an overwhelmed provisioner queue, we should not delete claims as they have humans waiting for them and are not part of the thundering herd. Therefore, this PR ensures that provisioner jobs that claim workspaces are considered to be initiated by the user, not the prebuilds system.	2025-10-08 10:40:54 +02:00
Sas Swart	f9a6adc704	feat: claim prebuilds based on workspace parameters instead of preset id (#19279 ) Closes https://github.com/coder/coder/issues/18356. This change finds and selects a matching preset if one was not chosen during workspace creation. This solidifies the relationship between presets and parameters. When a workspace is created without in explicitly chosen preset, it will now still be eligible to claim a prebuilt workspace if one is available.	2025-08-20 11:02:53 +02:00
Susana Ferreira	8567ecbe52	fix: set prebuilds lifecycle parameters on creation and claim (#19252 ) ## Description This PR ensures that prebuilt workspaces are properly excluded from the lifecycle executor and treated as a separate class of workspaces, fully managed by the prebuild reconciliation loop. It introduces two lifecycle guarantees: * When a prebuilt workspace is created (i.e., when the workspace build completes), all lifecycle-related fields are unset, ensuring the workspace does not participate in TTL, autostop, autostart, dormancy, or auto-deletion logic. * When a prebuilt workspace is claimed, it transitions into a regular user workspace. At this point, all lifecycle fields are correctly populated according to template-level configurations, allowing the workspace to be managed by the lifecycle executor as expected. ## Changes * Prebuilt workspaces now have all lifecycle-relevant fields unset during creation * When a prebuild is claimed: * Lifecycle fields are set based on template and workspace level configurations. This ensures a clean transition into the standard workspace lifecycle flow. * Updated lifecycle-related SQL update queries to explicitly exclude prebuilt workspaces. ## Relates Related issue: https://github.com/coder/coder/issues/18898 To reduce the scope of this PR and make the review process more manageable, the original implementation has been split into the following focused PRs: * https://github.com/coder/coder/pull/19259 * https://github.com/coder/coder/pull/19263 * https://github.com/coder/coder/pull/19264 * https://github.com/coder/coder/pull/19265 These PRs should be considered in conjunction with this one to understand the complete set of lifecycle separation changes for prebuilt workspaces.	2025-08-13 12:45:46 +01:00
ケイラ	fae30a00fd	chore: remove unnecessary redeclarations in for loops (#18440 )	2025-06-20 13:16:55 -06:00
Susana Ferreira	72f7d70bab	feat: allow TemplateAdmin to delete prebuilds via auth layer (#18333 ) ## Description This PR adds support for deleting prebuilt workspaces via the authorization layer. It introduces special-case handling to ensure that `prebuilt_workspace` permissions are evaluated when attempting to delete a prebuilt workspace, falling back to the standard `workspace` resource as needed. Prebuilt workspaces are a subset of workspaces, identified by having `owner_id` set to `PREBUILD_SYSTEM_USER`. This means: * A user with `prebuilt_workspace.delete` permission is allowed to delete only prebuilt workspaces. * A user with `workspace.delete` permission can delete both normal and prebuilt workspaces. ⚠️ This implementation is scoped to deletion operations only. No other operations are currently supported for the `prebuilt_workspace` resource. To delete a workspace, users must have the following permissions: * `workspace.read`: to read the current workspace state * `update`: to modify workspace metadata and related resources during deletion (e.g., updating the `deleted` field in the database) * `delete`: to perform the actual deletion of the workspace ## Changes * Introduced `authorizeWorkspace()` helper to handle prebuilt workspace authorization logic. * Ensured both `prebuilt_workspace` and `workspace` permissions are checked. * Added comments to clarify the current behavior and limitations. * Moved `SystemUserID` constant from the `prebuilds` package to the `database` package `PrebuildsSystemUserID` to resolve an import cycle (commit https://github.com/coder/coder/pull/18333/commits/f24e4ab4b6f0a56726fd04be2d7302c9fdb52d53). * Update middleware `ExtractOrganizationMember` to include system user members.	2025-06-20 17:36:32 +01:00
Yevhenii Shcherbina	8e3022ed9e	docs: add documentation for prebuild scheduling feature (#18462 ) Follow-up to https://github.com/coder/coder/pull/18126 Changes: - address issue mentioned here: https://github.com/coder/coder/pull/18126#discussion_r2144557600 - add docs for prebuilds scheduling --------- Co-authored-by: Danny Kopping <danny@coder.com> Co-authored-by: Atif Ali <atif@coder.com>	2025-06-20 10:08:47 -04:00
Yevhenii Shcherbina	0f6ca55238	feat: implement scheduling mechanism for prebuilds (#18126 ) Closes https://github.com/coder/internal/issues/312 Depends on https://github.com/coder/terraform-provider-coder/pull/408 This PR adds support for defining an autoscaling block for prebuilds, allowing number of desired instances to scale dynamically based on a schedule. Example usage: ``` data "coder_workspace_preset" "us-nix" { ... prebuilds = { instances = 0 # default to 0 instances scheduling = { timezone = "UTC" # a single timezone is used for simplicity # Scale to 3 instances during the work week schedule { cron = "* 8-18 * * 1-5" # from 8AM–6:59PM, Mon–Fri, UTC instances = 3 # scale to 3 instances } # Scale to 1 instance on Saturdays for urgent support queries schedule { cron = "* 8-14 * * 6" # from 8AM–2:59PM, Sat, UTC instances = 1 # scale to 1 instance } } } } ``` ### Behavior - Multiple `schedule` blocks per `prebuilds` block are supported. - If the current time matches any defined autoscaling schedule, the corresponding number of instances is used. - If no schedule matches, the default instance count (`prebuilds.instances`) is used as a fallback. ### Why This feature allows prebuild instance capacity to adapt to predictable usage patterns, such as: - Scaling up during business hours or high-demand periods - Reducing capacity during off-hours to save resources ### Cron specification The cron specification is interpreted as a continuous time range. For example, the expression: ``` * 9-18 * * 1-5 ``` is intended to represent a continuous range from 09:00 to 18:59, Monday through Friday. However, due to minor implementation imprecision, it is currently interpreted as a range from 08:59:00 to 18:58:59, Monday through Friday. This slight discrepancy arises because the evaluation is based on whether a specific point in time falls within the range, using the `github.com/coder/coder/v2/coderd/schedule/cron` library, which performs per-minute matching rather than strict range evaluation. --------- Co-authored-by: Danny Kopping <danny@coder.com>	2025-06-19 11:08:48 -04:00
Yevhenii Shcherbina	b330c0803c	fix: reimplement reporting of preset-hard-limited metric (#18055 ) Addresses concerns raised in https://github.com/coder/coder/pull/18045	2025-05-28 14:18:32 -04:00
Susana Ferreira	6f6e73af03	feat: implement expiration policy logic for prebuilds (#17996 ) ## Summary This PR introduces support for expiration policies in prebuilds. The TTL (time-to-live) is retrieved from the Terraform configuration ([terraform-provider-coder PR](https://github.com/coder/terraform-provider-coder/pull/404)): ``` prebuilds = { instances = 2 expiration_policy { ttl = 86400 } } ``` Note: Since there is no need for precise TTL enforcement down to the second, in this implementation expired prebuilds are handled in a single reconciliation cycle: they are deleted, and new instances are created only if needed to match the desired count. ## Changes * The outcome of a reconciliation cycle is now expressed as a slice of reconciliation actions, instead of a single aggregated action. * Adjusted reconciliation logic to delete expired prebuilds and guarantee that the number of desired instances is correct. * Updated relevant data structures and methods to support expiration policies parameters. * Added documentation to `Prebuilt workspaces` page * Update `terraform-provider-coder` to version 2.5.0: https://github.com/coder/terraform-provider-coder/releases/tag/v2.5.0 Depends on: https://github.com/coder/terraform-provider-coder/pull/404 Fixes: https://github.com/coder/coder/issues/17916	2025-05-26 20:31:24 +01:00
Yevhenii Shcherbina	53e8e9c7cd	fix: reduce cost of prebuild failure (#17697 ) Relates to https://github.com/coder/coder/issues/17432 ### Part 1: Notes: - `GetPresetsAtFailureLimit` SQL query is added, which is similar to `GetPresetsBackoff`, they use same CTEs: `filtered_builds`, `time_sorted_builds`, but they are still different. - Query is executed on every loop iteration. We can consider marking specific preset as permanently failed as an optimization to avoid executing query on every loop iteration. But I decided don't do it for now. - By default `FailureHardLimit` is set to 3. - `FailureHardLimit` is configurable. Setting it to zero - means that hard limit is disabled. ### Part 2 Notes: - `PrebuildFailureLimitReached` notification is added. - Notification is sent to template admins. - Notification is sent only the first time, when hard limit is reached. But it will `log.Warn` on every loop iteration. - I introduced this enum: ```sql CREATE TYPE prebuild_status AS ENUM ( 'normal', -- Prebuilds are working as expected; this is the default, healthy state. 'hard_limited', -- Prebuilds have failed repeatedly and hit the configured hard failure limit; won't be retried anymore. 'validation_failed' -- Prebuilds failed due to a non-retryable validation error (e.g. template misconfiguration); won't be retried. ); ``` `validation_failed` not used in this PR, but I think it will be used in next one, so I wanted to save us an extra migration. - Notification looks like this: <img width="472" alt="image" src="https://github.com/user-attachments/assets/e10efea0-1790-4e7f-a65c-f94c40fced27" /> ### Latest notification views: <img width="463" alt="image" src="https://github.com/user-attachments/assets/11310c58-68d1-4075-a497-f76d854633fe" /> <img width="725" alt="image" src="https://github.com/user-attachments/assets/6bbfe21a-91ac-47c3-a9d1-21807bb0c53a" />	2025-05-21 15:16:38 -04:00
Danny Kopping	6e967780c9	feat: track resource replacements when claiming a prebuilt workspace (#17571 ) Closes https://github.com/coder/internal/issues/369 We can't know whether a replacement (i.e. drift of terraform state leading to a resource needing to be deleted/recreated) will take place apriori; we can only detect it at `plan` time, because the provider decides whether a resource must be replaced and it cannot be inferred through static analysis of the template. This is likely to be the most common gotcha with using prebuilds, since it requires a slight template modification to use prebuilds effectively, so let's head this off before it's an issue for customers. Drift details will now be logged in the workspace build logs: ![image](https://github.com/user-attachments/assets/da1988b6-2cbe-4a79-a3c5-ea29891f3d6f) Plus a notification will be sent to template admins when this situation arises: ![image](https://github.com/user-attachments/assets/39d555b1-a262-4a3e-b529-03b9f23bf66a) A new metric - `coderd_prebuilt_workspaces_resource_replacements_total` - will also increment each time a workspace encounters replacements. We only track _that_ a resource replacement occurred, not how many. Just one is enough to ruin a prebuild, but we can't know apriori which replacement would cause this. For example, say we have 2 replacements: a `docker_container` and a `null_resource`; we don't know which one might cause an issue (or indeed if either would), so we just track the replacement. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com>	2025-05-14 14:52:22 +02:00
Sas Swart	425ee6fa55	feat: reinitialize agents when a prebuilt workspace is claimed (#17475 ) This pull request allows coder workspace agents to be reinitialized when a prebuilt workspace is claimed by a user. This facilitates the transfer of ownership between the anonymous prebuilds system user and the new owner of the workspace. Only a single agent per prebuilt workspace is supported for now, but plumbing has already been done to facilitate the seamless transition to multi-agent support. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com>	2025-05-14 14:15:36 +02:00
Yevhenii Shcherbina	02b2de9ae4	refactor: skip reconciliation for some presets (#17595 )	2025-04-29 07:55:37 -04:00
Yevhenii Shcherbina	a78f0fc4e1	refactor: use specific error for agpl and prebuilds (#17591 ) Follow-up PR to https://github.com/coder/coder/pull/17458 Addresses this discussion: https://github.com/coder/coder/pull/17458#discussion_r2055940797	2025-04-28 16:37:41 -04:00
Danny Kopping	e0483e3136	feat: add prebuilds metrics collector (#17547 ) Closes https://github.com/coder/internal/issues/509 --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com>	2025-04-28 12:28:56 +02:00
Danny Kopping	08ad910171	feat: add prebuilds configuration & bootstrapping (#17527 ) Closes https://github.com/coder/internal/issues/508 --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Cian Johnston <cian@coder.com>	2025-04-25 11:07:15 +02:00
Yevhenii Shcherbina	118f12ac3a	feat: implement claiming of prebuilt workspaces (#17458 ) Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Danny Kopping <danny@coder.com> Co-authored-by: Edward Angert <EdwardAngert@users.noreply.github.com> Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com> Co-authored-by: Jaayden Halko <jaayden.halko@gmail.com> Co-authored-by: Ethan <39577870+ethanndickson@users.noreply.github.com> Co-authored-by: M Atif Ali <atif@coder.com> Co-authored-by: Aericio <16523741+Aericio@users.noreply.github.com> Co-authored-by: M Atif Ali <me@matifali.dev> Co-authored-by: Michael Suchacz <203725896+ibetitsmike@users.noreply.github.com>	2025-04-24 09:39:38 -04:00
Yevhenii Shcherbina	183146e2c9	fix: add minor fix to reconciliation loop (#17454 ) Follow-up PR to https://github.com/coder/coder/pull/17261 I noticed that 1 metrics-related test fails in `dk/prebuilds` after merging my PR into `dk/prebuilds`.	2025-04-17 13:43:24 -04:00
Yevhenii Shcherbina	27bc60d1b9	feat: implement reconciliation loop (#17261 ) Closes https://github.com/coder/internal/issues/510 <details> <summary> Refactoring Summary </summary> ### 1) `CalculateActions` Function #### Issues Before Refactoring: - Large function (~150 lines), making it difficult to read and maintain. - The control flow is hard to follow due to complex conditional logic. - The `ReconciliationActions` struct was partially initialized early, then mutated in multiple places, making the flow error-prone. Original source: https://github.com/coder/coder/blob/fe60b569ad754245e28bac71e0ef3c83536631bb/coderd/prebuilds/state.go#L13-L167 #### Improvements After Refactoring: - Simplified and broken down into smaller, focused helper methods. - The flow of the function is now more linear and easier to understand. - Struct initialization is cleaner, avoiding partial and incremental mutations. Refactored function: https://github.com/coder/coder/blob/eeb0407d783cdda71ec2418c113f325542c47b1c/coderd/prebuilds/state.go#L67-L84 --- ### 2) `ReconciliationActions` Struct #### Issues Before Refactoring: - The struct mixed both actionable decisions and diagnostic state, which blurred its purpose. - It was unclear which fields were necessary for reconciliation logic, and which were purely for logging/observability. #### Improvements After Refactoring: - Split into two clear, purpose-specific structs: - `ReconciliationActions` — defines the intended reconciliation action. - `ReconciliationState` — captures runtime state and metadata, primarily for logging and diagnostics. Original struct: https://github.com/coder/coder/blob/fe60b569ad754245e28bac71e0ef3c83536631bb/coderd/prebuilds/reconcile.go#L29-L41 </details> --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Sas Swart <sas.swart.cdk@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Dean Sheather <dean@deansheather.com> Co-authored-by: Spike Curtis <spike@coder.com> Co-authored-by: Danny Kopping <danny@coder.com>	2025-04-17 09:29:29 -04:00
Danny Kopping	4c33846f6d	chore: add prebuilds system user (#16916 ) Pre-requisite for https://github.com/coder/coder/pull/16891 Closes https://github.com/coder/internal/issues/515 This PR introduces a new concept of a "system" user. Our data model requires that all workspaces have an owner (a `users` relation), and prebuilds is a feature that will spin up workspaces to be claimed later by actual users - and thus needs to own the workspaces in the interim. Naturally, introducing a change like this touches a few aspects around the codebase and we've taken the approach _default hidden_ here; in other words, queries for users will by default _exclude_ all system users, but there is a flag to ensure they can be displayed. This keeps the changeset relatively small. This user has minimal permissions (it's equivalent to a `member` since it has no roles). It will be associated with the default org in the initial migration, and thereafter we'll need to somehow ensure its membership aligns with templates (which are org-scoped) for which it'll need to provision prebuilds; that's a solution we'll have in a subsequent PR. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Sas Swart <sas.swart.cdk@gmail.com>	2025-03-25 12:18:06 +00:00

28 Commits