mirror of
https://github.com/coder/coder.git
synced 2026-06-03 04:58:23 +00:00
0ab345ca84
## Description This PR introduces one counter and two histograms related to workspace creation and claiming. The goal is to provide clearer observability into how workspaces are created (regular vs prebuild) and the time cost of those operations. ### `coderd_workspace_creation_total` * Metric type: Counter * Name: `coderd_workspace_creation_total` * Labels: `organization_name`, `template_name`, `preset_name` This counter tracks whether a regular workspace (not created from a prebuild pool) was created using a preset or not. Currently, we already expose `coderd_prebuilt_workspaces_claimed_total` for claimed prebuilt workspaces, but we lack a comparable metric for regular workspace creations. This metric fills that gap, making it possible to compare regular creations against claims. Implementation notes: * Exposed as a `coderd_` metric, consistent with other workspace-related metrics (e.g. `coderd_api_workspace_latest_build`: https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149). * Every `defaultRefreshRate` (1 minute ), DB query `GetRegularWorkspaceCreateMetrics` is executed to fetch all regular workspaces (not created from a prebuild pool). * The counter is updated with the total from all time (not just since metric introduction). This differs from the histograms below, which only accumulate from their introduction forward. ### `coderd_workspace_creation_duration_seconds` & `coderd_prebuilt_workspace_claim_duration_seconds` * Metric types: Histogram * Names: * `coderd_workspace_creation_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name`, `type` (`regular`, `prebuild`) * `coderd_prebuilt_workspace_claim_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name` We already have `coderd_provisionerd_workspace_build_timings_seconds`, which tracks build run times for all workspace builds handled by the provisioner daemon. However, in the context of this issue, we are only interested in creation and claim build times, not all transitions; additionally, this metric does not include `preset_name`, and adding it there would significantly increase cardinality. Therefore, separate more focused metrics are introduced here: * `coderd_workspace_creation_duration_seconds`: Build time to create a workspace (either a regular workspace or the build into a prebuild pool, for prebuild initial provisioning build). * `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a prebuilt workspace from the pool. The reason for two separate histograms is that: * Creation (regular or prebuild): provisioning builds with similar time magnitude, generally expected to take longer than a claim operation. * Claim: expected to be a much faster provisioning build. #### Native histogram usage Provisioning times vary widely between projects. Using static buckets risks unbalanced or poorly informative histograms. To address this, these metrics use [Prometheus native histograms](https://prometheus.io/docs/specs/native_histograms/): * First introduced in Prometheus v2.40.0 * Recommended stable usage from v2.45+ * Requires Go client `prometheus/client_golang` v1.15.0+ * Experimental and must be explicitly enabled on the server (`--enable-feature=native-histograms`) For compatibility, we also retain a classic bucket definition (aligned with the existing provisioner metric: https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189). * If native histograms are enabled, Prometheus ingests the high-resolution histogram. * If not, it falls back to the predefined buckets. Implementation notes: * Unlike the counter, these histograms are updated in real-time at workspace build job completion. * They reflect data only from the point of introduction forward (no historical backfill). ## Relates to Closes: https://github.com/coder/coder/issues/19528 Native histograms tested in observability stack: https://github.com/coder/observability/pull/50
275 lines
13 KiB
SQL
275 lines
13 KiB
SQL
-- name: ClaimPrebuiltWorkspace :one
|
|
UPDATE workspaces w
|
|
SET owner_id = @new_user_id::uuid,
|
|
name = @new_name::text,
|
|
updated_at = @now::timestamptz,
|
|
-- Update autostart_schedule, next_start_at and ttl according to template and workspace-level
|
|
-- configurations, allowing the workspace to be managed by the lifecycle executor as expected.
|
|
autostart_schedule = @autostart_schedule,
|
|
next_start_at = @next_start_at,
|
|
ttl = @workspace_ttl,
|
|
-- Update last_used_at during claim to ensure the claimed workspace is treated as recently used.
|
|
-- This avoids unintended dormancy caused by prebuilds having stale usage timestamps.
|
|
last_used_at = @now::timestamptz,
|
|
-- Clear dormant and deletion timestamps as a safeguard to ensure a clean lifecycle state after claim.
|
|
-- These fields should not be set on prebuilds, but we defensively reset them here to prevent
|
|
-- accidental dormancy or deletion by the lifecycle executor.
|
|
dormant_at = NULL,
|
|
deleting_at = NULL
|
|
WHERE w.id IN (
|
|
SELECT p.id
|
|
FROM workspace_prebuilds p
|
|
INNER JOIN workspace_latest_builds b ON b.workspace_id = p.id
|
|
INNER JOIN templates t ON p.template_id = t.id
|
|
WHERE (b.transition = 'start'::workspace_transition
|
|
AND b.job_status IN ('succeeded'::provisioner_job_status))
|
|
-- The prebuilds system should never try to claim a prebuild for an inactive template version.
|
|
-- Nevertheless, this filter is here as a defensive measure:
|
|
AND b.template_version_id = t.active_version_id
|
|
AND p.current_preset_id = @preset_id::uuid
|
|
AND p.ready
|
|
AND NOT t.deleted
|
|
LIMIT 1 FOR UPDATE OF p SKIP LOCKED -- Ensure that a concurrent request will not select the same prebuild.
|
|
)
|
|
RETURNING w.id, w.name;
|
|
|
|
-- name: GetTemplatePresetsWithPrebuilds :many
|
|
-- GetTemplatePresetsWithPrebuilds retrieves template versions with configured presets and prebuilds.
|
|
-- It also returns the number of desired instances for each preset.
|
|
-- If template_id is specified, only template versions associated with that template will be returned.
|
|
SELECT
|
|
t.id AS template_id,
|
|
t.name AS template_name,
|
|
o.id AS organization_id,
|
|
o.name AS organization_name,
|
|
tv.id AS template_version_id,
|
|
tv.name AS template_version_name,
|
|
tv.id = t.active_version_id AS using_active_version,
|
|
tvp.id,
|
|
tvp.name,
|
|
tvp.desired_instances AS desired_instances,
|
|
tvp.scheduling_timezone,
|
|
tvp.invalidate_after_secs AS ttl,
|
|
tvp.prebuild_status,
|
|
t.deleted,
|
|
t.deprecated != '' AS deprecated
|
|
FROM templates t
|
|
INNER JOIN template_versions tv ON tv.template_id = t.id
|
|
INNER JOIN template_version_presets tvp ON tvp.template_version_id = tv.id
|
|
INNER JOIN organizations o ON o.id = t.organization_id
|
|
WHERE tvp.desired_instances IS NOT NULL -- Consider only presets that have a prebuild configuration.
|
|
-- AND NOT t.deleted -- We don't exclude deleted templates because there's no constraint in the DB preventing a soft deletion on a template while workspaces are running.
|
|
AND (t.id = sqlc.narg('template_id')::uuid OR sqlc.narg('template_id') IS NULL);
|
|
|
|
-- name: GetRunningPrebuiltWorkspaces :many
|
|
WITH latest_prebuilds AS (
|
|
-- All workspaces that match the following criteria:
|
|
-- 1. Owned by prebuilds user
|
|
-- 2. Not deleted
|
|
-- 3. Latest build is a 'start' transition
|
|
-- 4. Latest build was successful
|
|
SELECT
|
|
workspaces.id,
|
|
workspaces.name,
|
|
workspaces.template_id,
|
|
workspace_latest_builds.template_version_id,
|
|
workspace_latest_builds.job_id,
|
|
workspaces.created_at
|
|
FROM workspace_latest_builds
|
|
JOIN workspaces ON workspaces.id = workspace_latest_builds.workspace_id
|
|
WHERE workspace_latest_builds.transition = 'start'::workspace_transition
|
|
AND workspace_latest_builds.job_status = 'succeeded'::provisioner_job_status
|
|
AND workspaces.owner_id = 'c42fdf75-3097-471c-8c33-fb52454d81c0'::UUID
|
|
AND NOT workspaces.deleted
|
|
),
|
|
workspace_latest_presets AS (
|
|
-- For each of the above workspaces, the preset_id of the most recent
|
|
-- successful start transition.
|
|
SELECT DISTINCT ON (latest_prebuilds.id)
|
|
latest_prebuilds.id AS workspace_id,
|
|
workspace_builds.template_version_preset_id AS current_preset_id
|
|
FROM latest_prebuilds
|
|
JOIN workspace_builds ON workspace_builds.workspace_id = latest_prebuilds.id
|
|
WHERE workspace_builds.transition = 'start'::workspace_transition
|
|
AND workspace_builds.template_version_preset_id IS NOT NULL
|
|
ORDER BY latest_prebuilds.id, workspace_builds.build_number DESC
|
|
),
|
|
ready_agents AS (
|
|
-- For each of the above workspaces, check if all agents are ready.
|
|
SELECT
|
|
latest_prebuilds.job_id,
|
|
BOOL_AND(workspace_agents.lifecycle_state = 'ready'::workspace_agent_lifecycle_state)::boolean AS ready
|
|
FROM latest_prebuilds
|
|
JOIN workspace_resources ON workspace_resources.job_id = latest_prebuilds.job_id
|
|
JOIN workspace_agents ON workspace_agents.resource_id = workspace_resources.id
|
|
WHERE workspace_agents.deleted = false
|
|
AND workspace_agents.parent_id IS NULL
|
|
GROUP BY latest_prebuilds.job_id
|
|
)
|
|
SELECT
|
|
latest_prebuilds.id,
|
|
latest_prebuilds.name,
|
|
latest_prebuilds.template_id,
|
|
latest_prebuilds.template_version_id,
|
|
workspace_latest_presets.current_preset_id,
|
|
COALESCE(ready_agents.ready, false)::boolean AS ready,
|
|
latest_prebuilds.created_at
|
|
FROM latest_prebuilds
|
|
LEFT JOIN ready_agents ON ready_agents.job_id = latest_prebuilds.job_id
|
|
LEFT JOIN workspace_latest_presets ON workspace_latest_presets.workspace_id = latest_prebuilds.id
|
|
ORDER BY latest_prebuilds.id;
|
|
|
|
-- name: CountInProgressPrebuilds :many
|
|
-- CountInProgressPrebuilds returns the number of in-progress prebuilds, grouped by preset ID and transition.
|
|
-- Prebuild considered in-progress if it's in the "starting", "stopping", or "deleting" state.
|
|
SELECT t.id AS template_id, wpb.template_version_id, wpb.transition, COUNT(wpb.transition)::int AS count, wlb.template_version_preset_id as preset_id
|
|
FROM workspace_latest_builds wlb
|
|
INNER JOIN workspace_prebuild_builds wpb ON wpb.id = wlb.id
|
|
-- We only need these counts for active template versions.
|
|
-- It doesn't influence whether we create or delete prebuilds
|
|
-- for inactive template versions. This is because we never create
|
|
-- prebuilds for inactive template versions, we always delete
|
|
-- running prebuilds for inactive template versions, and we ignore
|
|
-- prebuilds that are still building.
|
|
INNER JOIN templates t ON t.active_version_id = wlb.template_version_id
|
|
WHERE wlb.job_status IN ('pending'::provisioner_job_status, 'running'::provisioner_job_status)
|
|
-- AND NOT t.deleted -- We don't exclude deleted templates because there's no constraint in the DB preventing a soft deletion on a template while workspaces are running.
|
|
GROUP BY t.id, wpb.template_version_id, wpb.transition, wlb.template_version_preset_id;
|
|
|
|
-- GetPresetsBackoff groups workspace builds by preset ID.
|
|
-- Each preset is associated with exactly one template version ID.
|
|
-- For each group, the query checks up to N of the most recent jobs that occurred within the
|
|
-- lookback period, where N equals the number of desired instances for the corresponding preset.
|
|
-- If at least one of the job within a group has failed, we should backoff on the corresponding preset ID.
|
|
-- Query returns a list of preset IDs for which we should backoff.
|
|
-- Only active template versions with configured presets are considered.
|
|
-- We also return the number of failed workspace builds that occurred during the lookback period.
|
|
--
|
|
-- NOTE:
|
|
-- - To **decide whether to back off**, we look at up to the N most recent builds (within the defined lookback period).
|
|
-- - To **calculate the number of failed builds**, we consider all builds within the defined lookback period.
|
|
--
|
|
-- The number of failed builds is used downstream to determine the backoff duration.
|
|
-- name: GetPresetsBackoff :many
|
|
WITH filtered_builds AS (
|
|
-- Only select builds which are for prebuild creations
|
|
SELECT wlb.template_version_id, wlb.created_at, tvp.id AS preset_id, wlb.job_status, tvp.desired_instances
|
|
FROM template_version_presets tvp
|
|
INNER JOIN workspace_latest_builds wlb ON wlb.template_version_preset_id = tvp.id
|
|
INNER JOIN workspaces w ON wlb.workspace_id = w.id
|
|
INNER JOIN template_versions tv ON wlb.template_version_id = tv.id
|
|
INNER JOIN templates t ON tv.template_id = t.id AND t.active_version_id = tv.id
|
|
WHERE tvp.desired_instances IS NOT NULL -- Consider only presets that have a prebuild configuration.
|
|
AND wlb.transition = 'start'::workspace_transition
|
|
AND w.owner_id = 'c42fdf75-3097-471c-8c33-fb52454d81c0'
|
|
AND NOT t.deleted
|
|
),
|
|
time_sorted_builds AS (
|
|
-- Group builds by preset, then sort each group by created_at.
|
|
SELECT fb.template_version_id, fb.created_at, fb.preset_id, fb.job_status, fb.desired_instances,
|
|
ROW_NUMBER() OVER (PARTITION BY fb.preset_id ORDER BY fb.created_at DESC) as rn
|
|
FROM filtered_builds fb
|
|
),
|
|
failed_count AS (
|
|
-- Count failed builds per preset in the given period
|
|
SELECT preset_id, COUNT(*) AS num_failed
|
|
FROM filtered_builds
|
|
WHERE job_status = 'failed'::provisioner_job_status
|
|
AND created_at >= @lookback::timestamptz
|
|
GROUP BY preset_id
|
|
)
|
|
SELECT
|
|
tsb.template_version_id,
|
|
tsb.preset_id,
|
|
COALESCE(fc.num_failed, 0)::int AS num_failed,
|
|
MAX(tsb.created_at)::timestamptz AS last_build_at
|
|
FROM time_sorted_builds tsb
|
|
LEFT JOIN failed_count fc ON fc.preset_id = tsb.preset_id
|
|
WHERE tsb.rn <= tsb.desired_instances -- Fetch the last N builds, where N is the number of desired instances; if any fail, we backoff
|
|
AND tsb.job_status = 'failed'::provisioner_job_status
|
|
AND created_at >= @lookback::timestamptz
|
|
GROUP BY tsb.template_version_id, tsb.preset_id, fc.num_failed;
|
|
|
|
-- GetPresetsAtFailureLimit groups workspace builds by preset ID.
|
|
-- Each preset is associated with exactly one template version ID.
|
|
-- For each preset, the query checks the last hard_limit builds.
|
|
-- If all of them failed, the preset is considered to have hit the hard failure limit.
|
|
-- The query returns a list of preset IDs that have reached this failure threshold.
|
|
-- Only active template versions with configured presets are considered.
|
|
-- name: GetPresetsAtFailureLimit :many
|
|
WITH filtered_builds AS (
|
|
-- Only select builds which are for prebuild creations
|
|
SELECT wlb.template_version_id, wlb.created_at, tvp.id AS preset_id, wlb.job_status, tvp.desired_instances
|
|
FROM template_version_presets tvp
|
|
INNER JOIN workspace_latest_builds wlb ON wlb.template_version_preset_id = tvp.id
|
|
INNER JOIN workspaces w ON wlb.workspace_id = w.id
|
|
INNER JOIN template_versions tv ON wlb.template_version_id = tv.id
|
|
INNER JOIN templates t ON tv.template_id = t.id AND t.active_version_id = tv.id
|
|
WHERE tvp.desired_instances IS NOT NULL -- Consider only presets that have a prebuild configuration.
|
|
AND wlb.transition = 'start'::workspace_transition
|
|
AND w.owner_id = 'c42fdf75-3097-471c-8c33-fb52454d81c0'
|
|
),
|
|
time_sorted_builds AS (
|
|
-- Group builds by preset, then sort each group by created_at.
|
|
SELECT fb.template_version_id, fb.created_at, fb.preset_id, fb.job_status, fb.desired_instances,
|
|
ROW_NUMBER() OVER (PARTITION BY fb.preset_id ORDER BY fb.created_at DESC) as rn
|
|
FROM filtered_builds fb
|
|
)
|
|
SELECT
|
|
tsb.template_version_id,
|
|
tsb.preset_id
|
|
FROM time_sorted_builds tsb
|
|
-- For each preset, check the last hard_limit builds.
|
|
-- If all of them failed, the preset is considered to have hit the hard failure limit.
|
|
WHERE tsb.rn <= @hard_limit::bigint
|
|
AND tsb.job_status = 'failed'::provisioner_job_status
|
|
GROUP BY tsb.template_version_id, tsb.preset_id
|
|
HAVING COUNT(*) = @hard_limit::bigint;
|
|
|
|
-- name: GetPrebuildMetrics :many
|
|
SELECT
|
|
t.name as template_name,
|
|
tvp.name as preset_name,
|
|
o.name as organization_name,
|
|
COUNT(*) as created_count,
|
|
COUNT(*) FILTER (WHERE pj.job_status = 'failed'::provisioner_job_status) as failed_count,
|
|
COUNT(*) FILTER (
|
|
WHERE w.owner_id != 'c42fdf75-3097-471c-8c33-fb52454d81c0'::uuid -- The system user responsible for prebuilds.
|
|
) as claimed_count
|
|
FROM workspaces w
|
|
INNER JOIN workspace_prebuild_builds wpb ON wpb.workspace_id = w.id
|
|
INNER JOIN templates t ON t.id = w.template_id
|
|
INNER JOIN template_version_presets tvp ON tvp.id = wpb.template_version_preset_id
|
|
INNER JOIN provisioner_jobs pj ON pj.id = wpb.job_id
|
|
INNER JOIN organizations o ON o.id = w.organization_id
|
|
WHERE NOT t.deleted AND wpb.build_number = 1
|
|
GROUP BY t.name, tvp.name, o.name
|
|
ORDER BY t.name, tvp.name, o.name;
|
|
|
|
-- name: FindMatchingPresetID :one
|
|
-- FindMatchingPresetID finds a preset ID that is the largest exact subset of the provided parameters.
|
|
-- It returns the preset ID if a match is found, or NULL if no match is found.
|
|
-- The query finds presets where all preset parameters are present in the provided parameters,
|
|
-- and returns the preset with the most parameters (largest subset).
|
|
WITH provided_params AS (
|
|
SELECT
|
|
unnest(@parameter_names::text[]) AS name,
|
|
unnest(@parameter_values::text[]) AS value
|
|
),
|
|
preset_matches AS (
|
|
SELECT
|
|
tvp.id AS template_version_preset_id,
|
|
COALESCE(COUNT(tvpp.name), 0) AS total_preset_params,
|
|
COALESCE(COUNT(pp.name), 0) AS matching_params
|
|
FROM template_version_presets tvp
|
|
LEFT JOIN template_version_preset_parameters tvpp ON tvpp.template_version_preset_id = tvp.id
|
|
LEFT JOIN provided_params pp ON pp.name = tvpp.name AND pp.value = tvpp.value
|
|
WHERE tvp.template_version_id = @template_version_id
|
|
GROUP BY tvp.id
|
|
)
|
|
SELECT pm.template_version_preset_id
|
|
FROM preset_matches pm
|
|
WHERE pm.total_preset_params = pm.matching_params -- All preset parameters must match
|
|
ORDER BY pm.total_preset_params DESC -- Return the preset with the most parameters
|
|
LIMIT 1;
|