Commit Graph

18 Commits

Author SHA1 Message Date
Callum Styan 5f3be6b288 feat: add provisioner job queue wait time histogram and jobs enqueued counter (#21869)
This PR adds some metrics to help identify job enqueue rates and
latencies. This work was initiated as a way to help reduce the cost of
the observation/measurement itself for autostart scaletests, which
impacts our ability to identify/reason about the load caused by
autostart. See: https://github.com/coder/internal/issues/1209

I've extended the metrics here to account for regular user initiated
builds, prebuilds, autostarts, etc. IMO there is still the question here
of whether we want to include or need the `transition` label, which is
only present on workspace builds. Including it does lead to an increase
in cardinality, and in the case of the histogram (when not using native
histograms) that's at least a few extra series for every bucket. We
could remove the transition label there but keep it on the counter.

Additionally, the histogram is currently observing latencies for other
jobs, such as template builds/version imports, those do not have a
transition type associated with them.

Tested briefly in a workspace, can see metric values like the following:
-
`coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"}
1`
-
`coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"}
1`

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 13:40:47 -08:00
Jon Ayers 3c1db17361 fix: use existing transaction to claim prebuild (#21862)
- Claiming a prebuild was happening outside a transaction
2026-02-02 17:57:59 -06:00
Susana Ferreira 6ef9670384 fix: limit concurrent database connections in prebuild reconciliation (#20908)
## Description

This PR addresses database connection pool exhaustion during prebuilds
reconciliation by introducing two changes:
* `CanSkipReconciliation`: Filters out presets that don't need
reconciliation before spawning goroutines. This ensures we only create
goroutines for presets that will (_most likely_) perform database
operations, avoiding unnecessary connection pool usage.
* Dynamic `eg.SetLimit`: Limits concurrent goroutines based on the
configured database connection pool size (`CODER_PG_CONN_MAX_OPEN / 2`).
This replaces the previous hardcoded limit of 5, ensuring the
reconciliation loop scales appropriately with the configured pool size
while leaving capacity for other database operations.

## Changes

* Add `CanSkipReconciliation()` method to `PresetSnapshot` that returns
true for inactive presets with no running workspaces, no pending jobs,
or expired prebuilds.
* Add `maxDBConnections` parameter to `NewStoreReconciler` and compute
`reconciliationConcurrency` as half the pool size (minimum 1).
* Add `ReconciliationConcurrency()` getter method to `StoreReconciler`.
* Add `eg.SetLimit(c.reconciliationConcurrency)` to bound concurrent
reconciliation goroutines.
* Add `PresetsTotal` and `PresetsReconciled` to `ReconcileStats` for
observability.
* Add `TestCanSkipReconciliation` unit tests.
* Add `TestReconciliationConcurrency` unit tests.
* Add benchmark tests for reconciliation performance.

## Benchmarks

* `BenchmarkReconcileAll_NoOps`: Tests presets with no reconciliation
actions. All presets are filtered by `CanSkipReconciliation`, resulting
in no goroutines spawned and no database connections used.
* `BenchmarkReconcileAll_ConnectionContention`: Tests presets where all
require reconciliation actions. All presets spawn goroutines, but
concurrency is limited by `eg.SetLimit(reconciliationConcurrency)`.
* `BenchmarkReconcileAll_Mix`: Simulates a realistic scenario with a
large subset of inactive presets (filtered by `CanSkipReconciliation`)
and a smaller subset requiring reconciliation (limited by
`eg.SetLimit`).

Closes: https://github.com/coder/coder/issues/20606
2026-01-21 10:56:31 +00:00
Spike Curtis bddb808b25 chore: arrange imports in a standard way (#21452)
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:

```
import (
	"context"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"golang.org/x/xerrors"
	"gopkg.in/natefinch/lumberjack.v2"

	"cdr.dev/slog/v3"
	"github.com/coder/coder/v2/codersdk/agentsdk"
	"github.com/coder/serpent"
)
```

3 groups: standard library, 3rd partly libs, Coder libs.

This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
2026-01-08 15:24:11 +04:00
Sas Swart 9a0024c45f chore: add tracing to prebuilds (#21443)
The implementation for prebuilt workspaces is complex and conversations
regarding edge cases and bugs frequently get bogged down by minutiae,
because it's hard to reason about the behaviour of the system.

To alleviate this, I've introduced otel tracing to the StoreReconciler
(see attached). We can now directly observe the behaviour of the
prebuilds system under load in order to inform our decisions.

Traces are terminated at the boundary between prebuilds and workspace
builder, because of prebuilt workspaces' "fire and forget" philosophy
and to prevent span explosion.

<img width="3024" height="1718" alt="image"
src="https://github.com/user-attachments/assets/f9b207be-8f2c-475e-98a8-46ef70bda446"
/>
2026-01-07 11:04:40 +02:00
Steven Masley 3194bcfc9e chore: distinct operations for provisioner's 'parse', 'init', 'plan', 'apply', 'graph' (#21064)
Provisioner steps broken into smaller granular actions.
Changes:
- `ExtractArchive` moved to `init` request (was in `configure`)
- Writing `tfstate` moved to `plan` (was in `configure`)
- Moved most plan/apply outputs to `GraphComplete`
2025-12-15 11:26:41 -06:00
Sas Swart 544f15523c fix: adjust workspace claims to be initiated by users (#20179)
The prebuilds user never initiates a workspace claim autonomously. A
claim can only happen when a user attempts to create a workspace. When
listing prebuild provisioner jobs, it would not make sense to see jobs
related to users who are creating workspaces and have gotten a prebuilt
workspace. When cleaning up an overwhelmed provisioner queue, we should
not delete claims as they have humans waiting for them and are not part
of the thundering herd.

Therefore, this PR ensures that provisioner jobs that claim workspaces
are considered to be initiated by the user, not the prebuilds system.
2025-10-08 10:40:54 +02:00
Susana Ferreira 8567ecbe52 fix: set prebuilds lifecycle parameters on creation and claim (#19252)
## Description

This PR ensures that prebuilt workspaces are properly excluded from the
lifecycle executor and treated as a separate class of workspaces, fully
managed by the prebuild reconciliation loop.

It introduces two lifecycle guarantees:
* When a prebuilt workspace is created (i.e., when the workspace build
completes), all lifecycle-related fields are unset, ensuring the
workspace does not participate in TTL, autostop, autostart, dormancy, or
auto-deletion logic.
* When a prebuilt workspace is claimed, it transitions into a regular
user workspace. At this point, all lifecycle fields are correctly
populated according to template-level configurations, allowing the
workspace to be managed by the lifecycle executor as expected.

## Changes

* Prebuilt workspaces now have all lifecycle-relevant fields unset
during creation
* When a prebuild is claimed:
* Lifecycle fields are set based on template and workspace level
configurations. This ensures a clean transition into the standard
workspace lifecycle flow.
* Updated lifecycle-related SQL update queries to explicitly exclude
prebuilt workspaces.

## Relates 

Related issue: https://github.com/coder/coder/issues/18898

To reduce the scope of this PR and make the review process more
manageable, the original implementation has been split into the
following focused PRs:
* https://github.com/coder/coder/pull/19259
* https://github.com/coder/coder/pull/19263
* https://github.com/coder/coder/pull/19264
* https://github.com/coder/coder/pull/19265

These PRs should be considered in conjunction with this one to
understand the complete set of lifecycle separation changes for prebuilt
workspaces.
2025-08-13 12:45:46 +01:00
Dean Sheather 9a6dd73f68 feat: add managed agent license limit checks (#18937)
- Adds a query for counting managed agent workspace builds between two
timestamps
- The "Actual" field in the feature entitlement for managed agents is
now populated with the value read from the database
- The wsbuilder package now validates AI agent usage against the limit
when a license is installed

Closes coder/internal#777
2025-07-22 13:39:26 +10:00
Steven Masley 82af2e019d feat: implement dynamic parameter validation (#18482)
# What does this do?

This does parameter validation for dynamic parameters in `wsbuilder`. All input parameters are validated in `coder/coder` before being sent to terraform.

The heart of this PR is [`ResolveParameters`](https://github.com/coder/coder/blob/b65001e89c0577199a8e470c138c51e91cf2350c/coderd/dynamicparameters/resolver.go#L30-L30).

# What else changes?

`wsbuilder` now needs to load the terraform files into memory to succeed. This does add a larger memory requirement to workspace builds.

# Future work

- Sort autostart handling workspaces by template version id. So workspaces with the same template version only load the terraform files once from the db, and store them in the cache.
2025-06-23 12:35:15 -05:00
ケイラ fae30a00fd chore: remove unnecessary redeclarations in for loops (#18440) 2025-06-20 13:16:55 -06:00
Sas Swart 5f7e5d7097 feat: support prebuilt workspaces in non-default organizations (#18010)
closes https://github.com/coder/internal/issues/527
2025-06-04 14:20:29 +02:00
Danny Kopping 6e967780c9 feat: track resource replacements when claiming a prebuilt workspace (#17571)
Closes https://github.com/coder/internal/issues/369

We can't know whether a replacement (i.e. drift of terraform state
leading to a resource needing to be deleted/recreated) will take place
apriori; we can only detect it at `plan` time, because the provider
decides whether a resource must be replaced and it cannot be inferred
through static analysis of the template.

**This is likely to be the most common gotcha with using prebuilds,
since it requires a slight template modification to use prebuilds
effectively**, so let's head this off before it's an issue for
customers.

Drift details will now be logged in the workspace build logs:


![image](https://github.com/user-attachments/assets/da1988b6-2cbe-4a79-a3c5-ea29891f3d6f)

Plus a notification will be sent to template admins when this situation
arises:


![image](https://github.com/user-attachments/assets/39d555b1-a262-4a3e-b529-03b9f23bf66a)

A new metric - `coderd_prebuilt_workspaces_resource_replacements_total`
- will also increment each time a workspace encounters replacements.

We only track _that_ a resource replacement occurred, not how many. Just
one is enough to ruin a prebuild, but we can't know apriori which
replacement would cause this.
For example, say we have 2 replacements: a `docker_container` and a
`null_resource`; we don't know which one might
cause an issue (or indeed if either would), so we just track the
replacement.

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
2025-05-14 14:52:22 +02:00
Yevhenii Shcherbina 98e5611e16 fix: fix for prebuilds claiming and deletion (#17624)
PR contains:
- fix for claiming & deleting prebuilds with immutable params
- unit test for claiming scenario
- unit test for deletion scenario

The parameter resolver was failing when deleting/claiming prebuilds
because a value for a previously-used parameter was provided to the
resolver, but since the value was unchanged (it's coming from the
preset) it failed in the resolver. The resolver was missing a check to
see if the old value != new value; if the values match then there's no
mutation of an immutable parameter.

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
2025-05-01 08:52:23 +00:00
Yevhenii Shcherbina a78f0fc4e1 refactor: use specific error for agpl and prebuilds (#17591)
Follow-up PR to https://github.com/coder/coder/pull/17458
Addresses this discussion:
https://github.com/coder/coder/pull/17458#discussion_r2055940797
2025-04-28 16:37:41 -04:00
Yevhenii Shcherbina 9167cbfe4c refactor: claim prebuilt workspace tests (#17567)
Follow-up to: https://github.com/coder/coder/pull/17458
Specifically it addresses these discussions:
- https://github.com/coder/coder/pull/17458#discussion_r2053531445
2025-04-28 12:49:23 -04:00
Danny Kopping e0483e3136 feat: add prebuilds metrics collector (#17547)
Closes https://github.com/coder/internal/issues/509

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
2025-04-28 12:28:56 +02:00
Yevhenii Shcherbina 118f12ac3a feat: implement claiming of prebuilt workspaces (#17458)
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <danny@coder.com>
Co-authored-by: Edward Angert <EdwardAngert@users.noreply.github.com>
Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>
Co-authored-by: Jaayden Halko <jaayden.halko@gmail.com>
Co-authored-by: Ethan <39577870+ethanndickson@users.noreply.github.com>
Co-authored-by: M Atif Ali <atif@coder.com>
Co-authored-by: Aericio <16523741+Aericio@users.noreply.github.com>
Co-authored-by: M Atif Ali <me@matifali.dev>
Co-authored-by: Michael Suchacz <203725896+ibetitsmike@users.noreply.github.com>
2025-04-24 09:39:38 -04:00