coder

mirror of https://github.com/coder/coder.git synced 2026-06-05 14:08:20 +00:00

Author	SHA1	Message	Date
Callum Styan	5f3be6b288	feat: add provisioner job queue wait time histogram and jobs enqueued counter (#21869 ) This PR adds some metrics to help identify job enqueue rates and latencies. This work was initiated as a way to help reduce the cost of the observation/measurement itself for autostart scaletests, which impacts our ability to identify/reason about the load caused by autostart. See: https://github.com/coder/internal/issues/1209 I've extended the metrics here to account for regular user initiated builds, prebuilds, autostarts, etc. IMO there is still the question here of whether we want to include or need the `transition` label, which is only present on workspace builds. Including it does lead to an increase in cardinality, and in the case of the histogram (when not using native histograms) that's at least a few extra series for every bucket. We could remove the transition label there but keep it on the counter. Additionally, the histogram is currently observing latencies for other jobs, such as template builds/version imports, those do not have a transition type associated with them. Tested briefly in a workspace, can see metric values like the following: - `coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"} 1` - `coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"} 1` --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 13:40:47 -08:00
Jon Ayers	3c1db17361	fix: use existing transaction to claim prebuild (#21862 ) - Claiming a prebuild was happening outside a transaction	2026-02-02 17:57:59 -06:00
Susana Ferreira	6ef9670384	fix: limit concurrent database connections in prebuild reconciliation (#20908 ) ## Description This PR addresses database connection pool exhaustion during prebuilds reconciliation by introducing two changes: * `CanSkipReconciliation`: Filters out presets that don't need reconciliation before spawning goroutines. This ensures we only create goroutines for presets that will (_most likely_) perform database operations, avoiding unnecessary connection pool usage. * Dynamic `eg.SetLimit`: Limits concurrent goroutines based on the configured database connection pool size (`CODER_PG_CONN_MAX_OPEN / 2`). This replaces the previous hardcoded limit of 5, ensuring the reconciliation loop scales appropriately with the configured pool size while leaving capacity for other database operations. ## Changes * Add `CanSkipReconciliation()` method to `PresetSnapshot` that returns true for inactive presets with no running workspaces, no pending jobs, or expired prebuilds. * Add `maxDBConnections` parameter to `NewStoreReconciler` and compute `reconciliationConcurrency` as half the pool size (minimum 1). * Add `ReconciliationConcurrency()` getter method to `StoreReconciler`. * Add `eg.SetLimit(c.reconciliationConcurrency)` to bound concurrent reconciliation goroutines. * Add `PresetsTotal` and `PresetsReconciled` to `ReconcileStats` for observability. * Add `TestCanSkipReconciliation` unit tests. * Add `TestReconciliationConcurrency` unit tests. * Add benchmark tests for reconciliation performance. ## Benchmarks * `BenchmarkReconcileAll_NoOps`: Tests presets with no reconciliation actions. All presets are filtered by `CanSkipReconciliation`, resulting in no goroutines spawned and no database connections used. * `BenchmarkReconcileAll_ConnectionContention`: Tests presets where all require reconciliation actions. All presets spawn goroutines, but concurrency is limited by `eg.SetLimit(reconciliationConcurrency)`. * `BenchmarkReconcileAll_Mix`: Simulates a realistic scenario with a large subset of inactive presets (filtered by `CanSkipReconciliation`) and a smaller subset requiring reconciliation (limited by `eg.SetLimit`). Closes: https://github.com/coder/coder/issues/20606	2026-01-21 10:56:31 +00:00
Spike Curtis	bddb808b25	chore: arrange imports in a standard way (#21452 ) Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example: ``` import ( "context" "time" "github.com/prometheus/client_golang/prometheus" "golang.org/x/xerrors" "gopkg.in/natefinch/lumberjack.v2" "cdr.dev/slog/v3" "github.com/coder/coder/v2/codersdk/agentsdk" "github.com/coder/serpent" ) ``` 3 groups: standard library, 3rd partly libs, Coder libs. This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.	2026-01-08 15:24:11 +04:00
Sas Swart	9a0024c45f	chore: add tracing to prebuilds (#21443 ) The implementation for prebuilt workspaces is complex and conversations regarding edge cases and bugs frequently get bogged down by minutiae, because it's hard to reason about the behaviour of the system. To alleviate this, I've introduced otel tracing to the StoreReconciler (see attached). We can now directly observe the behaviour of the prebuilds system under load in order to inform our decisions. Traces are terminated at the boundary between prebuilds and workspace builder, because of prebuilt workspaces' "fire and forget" philosophy and to prevent span explosion. <img width="3024" height="1718" alt="image" src="https://github.com/user-attachments/assets/f9b207be-8f2c-475e-98a8-46ef70bda446" />	2026-01-07 11:04:40 +02:00
Steven Masley	3194bcfc9e	chore: distinct operations for provisioner's 'parse', 'init', 'plan', 'apply', 'graph' (#21064 ) Provisioner steps broken into smaller granular actions. Changes: - `ExtractArchive` moved to `init` request (was in `configure`) - Writing `tfstate` moved to `plan` (was in `configure`) - Moved most plan/apply outputs to `GraphComplete`	2025-12-15 11:26:41 -06:00
Sas Swart	544f15523c	fix: adjust workspace claims to be initiated by users (#20179 ) The prebuilds user never initiates a workspace claim autonomously. A claim can only happen when a user attempts to create a workspace. When listing prebuild provisioner jobs, it would not make sense to see jobs related to users who are creating workspaces and have gotten a prebuilt workspace. When cleaning up an overwhelmed provisioner queue, we should not delete claims as they have humans waiting for them and are not part of the thundering herd. Therefore, this PR ensures that provisioner jobs that claim workspaces are considered to be initiated by the user, not the prebuilds system.	2025-10-08 10:40:54 +02:00
Susana Ferreira	8567ecbe52	fix: set prebuilds lifecycle parameters on creation and claim (#19252 ) ## Description This PR ensures that prebuilt workspaces are properly excluded from the lifecycle executor and treated as a separate class of workspaces, fully managed by the prebuild reconciliation loop. It introduces two lifecycle guarantees: * When a prebuilt workspace is created (i.e., when the workspace build completes), all lifecycle-related fields are unset, ensuring the workspace does not participate in TTL, autostop, autostart, dormancy, or auto-deletion logic. * When a prebuilt workspace is claimed, it transitions into a regular user workspace. At this point, all lifecycle fields are correctly populated according to template-level configurations, allowing the workspace to be managed by the lifecycle executor as expected. ## Changes * Prebuilt workspaces now have all lifecycle-relevant fields unset during creation * When a prebuild is claimed: * Lifecycle fields are set based on template and workspace level configurations. This ensures a clean transition into the standard workspace lifecycle flow. * Updated lifecycle-related SQL update queries to explicitly exclude prebuilt workspaces. ## Relates Related issue: https://github.com/coder/coder/issues/18898 To reduce the scope of this PR and make the review process more manageable, the original implementation has been split into the following focused PRs: * https://github.com/coder/coder/pull/19259 * https://github.com/coder/coder/pull/19263 * https://github.com/coder/coder/pull/19264 * https://github.com/coder/coder/pull/19265 These PRs should be considered in conjunction with this one to understand the complete set of lifecycle separation changes for prebuilt workspaces.	2025-08-13 12:45:46 +01:00
Dean Sheather	9a6dd73f68	feat: add managed agent license limit checks (#18937 ) - Adds a query for counting managed agent workspace builds between two timestamps - The "Actual" field in the feature entitlement for managed agents is now populated with the value read from the database - The wsbuilder package now validates AI agent usage against the limit when a license is installed Closes coder/internal#777	2025-07-22 13:39:26 +10:00
Steven Masley	82af2e019d	feat: implement dynamic parameter validation (#18482 ) # What does this do? This does parameter validation for dynamic parameters in `wsbuilder`. All input parameters are validated in `coder/coder` before being sent to terraform. The heart of this PR is [`ResolveParameters`](https://github.com/coder/coder/blob/b65001e89c0577199a8e470c138c51e91cf2350c/coderd/dynamicparameters/resolver.go#L30-L30). # What else changes? `wsbuilder` now needs to load the terraform files into memory to succeed. This does add a larger memory requirement to workspace builds. # Future work - Sort autostart handling workspaces by template version id. So workspaces with the same template version only load the terraform files once from the db, and store them in the cache.	2025-06-23 12:35:15 -05:00
ケイラ	fae30a00fd	chore: remove unnecessary redeclarations in for loops (#18440 )	2025-06-20 13:16:55 -06:00
Sas Swart	5f7e5d7097	feat: support prebuilt workspaces in non-default organizations (#18010 ) closes https://github.com/coder/internal/issues/527	2025-06-04 14:20:29 +02:00
Danny Kopping	6e967780c9	feat: track resource replacements when claiming a prebuilt workspace (#17571 ) Closes https://github.com/coder/internal/issues/369 We can't know whether a replacement (i.e. drift of terraform state leading to a resource needing to be deleted/recreated) will take place apriori; we can only detect it at `plan` time, because the provider decides whether a resource must be replaced and it cannot be inferred through static analysis of the template. This is likely to be the most common gotcha with using prebuilds, since it requires a slight template modification to use prebuilds effectively, so let's head this off before it's an issue for customers. Drift details will now be logged in the workspace build logs: ![image](https://github.com/user-attachments/assets/da1988b6-2cbe-4a79-a3c5-ea29891f3d6f) Plus a notification will be sent to template admins when this situation arises: ![image](https://github.com/user-attachments/assets/39d555b1-a262-4a3e-b529-03b9f23bf66a) A new metric - `coderd_prebuilt_workspaces_resource_replacements_total` - will also increment each time a workspace encounters replacements. We only track _that_ a resource replacement occurred, not how many. Just one is enough to ruin a prebuild, but we can't know apriori which replacement would cause this. For example, say we have 2 replacements: a `docker_container` and a `null_resource`; we don't know which one might cause an issue (or indeed if either would), so we just track the replacement. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com>	2025-05-14 14:52:22 +02:00
Yevhenii Shcherbina	98e5611e16	fix: fix for prebuilds claiming and deletion (#17624 ) PR contains: - fix for claiming & deleting prebuilds with immutable params - unit test for claiming scenario - unit test for deletion scenario The parameter resolver was failing when deleting/claiming prebuilds because a value for a previously-used parameter was provided to the resolver, but since the value was unchanged (it's coming from the preset) it failed in the resolver. The resolver was missing a check to see if the old value != new value; if the values match then there's no mutation of an immutable parameter. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com>	2025-05-01 08:52:23 +00:00
Yevhenii Shcherbina	a78f0fc4e1	refactor: use specific error for agpl and prebuilds (#17591 ) Follow-up PR to https://github.com/coder/coder/pull/17458 Addresses this discussion: https://github.com/coder/coder/pull/17458#discussion_r2055940797	2025-04-28 16:37:41 -04:00
Yevhenii Shcherbina	9167cbfe4c	refactor: claim prebuilt workspace tests (#17567 ) Follow-up to: https://github.com/coder/coder/pull/17458 Specifically it addresses these discussions: - https://github.com/coder/coder/pull/17458#discussion_r2053531445	2025-04-28 12:49:23 -04:00
Danny Kopping	e0483e3136	feat: add prebuilds metrics collector (#17547 ) Closes https://github.com/coder/internal/issues/509 --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com>	2025-04-28 12:28:56 +02:00
Yevhenii Shcherbina	118f12ac3a	feat: implement claiming of prebuilt workspaces (#17458 ) Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Danny Kopping <danny@coder.com> Co-authored-by: Edward Angert <EdwardAngert@users.noreply.github.com> Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com> Co-authored-by: Jaayden Halko <jaayden.halko@gmail.com> Co-authored-by: Ethan <39577870+ethanndickson@users.noreply.github.com> Co-authored-by: M Atif Ali <atif@coder.com> Co-authored-by: Aericio <16523741+Aericio@users.noreply.github.com> Co-authored-by: M Atif Ali <me@matifali.dev> Co-authored-by: Michael Suchacz <203725896+ibetitsmike@users.noreply.github.com>	2025-04-24 09:39:38 -04:00

18 Commits