feat: add prebuild timing metrics to Prometheus (#19503)

## Description

This PR introduces one counter and two histograms related to workspace
creation and claiming. The goal is to provide clearer observability into
how workspaces are created (regular vs prebuild) and the time cost of
those operations.

### `coderd_workspace_creation_total`

* Metric type: Counter
* Name: `coderd_workspace_creation_total`
* Labels: `organization_name`, `template_name`, `preset_name`

This counter tracks whether a regular workspace (not created from a
prebuild pool) was created using a preset or not.
Currently, we already expose `coderd_prebuilt_workspaces_claimed_total`
for claimed prebuilt workspaces, but we lack a comparable metric for
regular workspace creations. This metric fills that gap, making it
possible to compare regular creations against claims.

Implementation notes:
* Exposed as a `coderd_` metric, consistent with other workspace-related
metrics (e.g. `coderd_api_workspace_latest_build`:
https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149).
* Every `defaultRefreshRate` (1 minute ), DB query
`GetRegularWorkspaceCreateMetrics` is executed to fetch all regular
workspaces (not created from a prebuild pool).
* The counter is updated with the total from all time (not just since
metric introduction). This differs from the histograms below, which only
accumulate from their introduction forward.

### `coderd_workspace_creation_duration_seconds` &
`coderd_prebuilt_workspace_claim_duration_seconds`

* Metric types: Histogram
* Names:
  * `coderd_workspace_creation_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`, `type`
(`regular`, `prebuild`)
  * `coderd_prebuilt_workspace_claim_duration_seconds`
    * Labels: `organization_name`, `template_name`, `preset_name`

We already have `coderd_provisionerd_workspace_build_timings_seconds`,
which tracks build run times for all workspace builds handled by the
provisioner daemon.
However, in the context of this issue, we are only interested in
creation and claim build times, not all transitions; additionally, this
metric does not include `preset_name`, and adding it there would
significantly increase cardinality. Therefore, separate more focused
metrics are introduced here:
* `coderd_workspace_creation_duration_seconds`: Build time to create a
workspace (either a regular workspace or the build into a prebuild pool,
for prebuild initial provisioning build).
* `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a
prebuilt workspace from the pool.

The reason for two separate histograms is that:
* Creation (regular or prebuild): provisioning builds with similar time
magnitude, generally expected to take longer than a claim operation.
* Claim: expected to be a much faster provisioning build.

#### Native histogram usage

Provisioning times vary widely between projects. Using static buckets
risks unbalanced or poorly informative histograms.
To address this, these metrics use [Prometheus native
histograms](https://prometheus.io/docs/specs/native_histograms/):
* First introduced in Prometheus v2.40.0
* Recommended stable usage from v2.45+
* Requires Go client `prometheus/client_golang` v1.15.0+
* Experimental and must be explicitly enabled on the server
(`--enable-feature=native-histograms`)

For compatibility, we also retain a classic bucket definition (aligned
with the existing provisioner metric:
https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189).
* If native histograms are enabled, Prometheus ingests the
high-resolution histogram.
* If not, it falls back to the predefined buckets.

Implementation notes:
* Unlike the counter, these histograms are updated in real-time at
workspace build job completion.
* They reflect data only from the point of introduction forward (no
historical backfill).

## Relates to 

Closes: https://github.com/coder/coder/issues/19528
Native histograms tested in observability stack:
https://github.com/coder/observability/pull/50
This commit is contained in:
Susana Ferreira
2025-08-28 15:00:26 +01:00
committed by GitHub
parent 9fd33a7653
commit 0ab345ca84
21 changed files with 699 additions and 8 deletions
+12 -6
View File
@@ -62,12 +62,6 @@ import (
"github.com/coder/serpent"
"github.com/coder/wgtunnel/tunnelsdk"
"github.com/coder/coder/v2/coderd/entitlements"
"github.com/coder/coder/v2/coderd/notifications/reports"
"github.com/coder/coder/v2/coderd/runtimeconfig"
"github.com/coder/coder/v2/coderd/webpush"
"github.com/coder/coder/v2/codersdk/drpcsdk"
"github.com/coder/coder/v2/buildinfo"
"github.com/coder/coder/v2/cli/clilog"
"github.com/coder/coder/v2/cli/cliui"
@@ -83,15 +77,19 @@ import (
"github.com/coder/coder/v2/coderd/database/migrations"
"github.com/coder/coder/v2/coderd/database/pubsub"
"github.com/coder/coder/v2/coderd/devtunnel"
"github.com/coder/coder/v2/coderd/entitlements"
"github.com/coder/coder/v2/coderd/externalauth"
"github.com/coder/coder/v2/coderd/gitsshkey"
"github.com/coder/coder/v2/coderd/httpmw"
"github.com/coder/coder/v2/coderd/jobreaper"
"github.com/coder/coder/v2/coderd/notifications"
"github.com/coder/coder/v2/coderd/notifications/reports"
"github.com/coder/coder/v2/coderd/oauthpki"
"github.com/coder/coder/v2/coderd/prometheusmetrics"
"github.com/coder/coder/v2/coderd/prometheusmetrics/insights"
"github.com/coder/coder/v2/coderd/promoauth"
"github.com/coder/coder/v2/coderd/provisionerdserver"
"github.com/coder/coder/v2/coderd/runtimeconfig"
"github.com/coder/coder/v2/coderd/schedule"
"github.com/coder/coder/v2/coderd/telemetry"
"github.com/coder/coder/v2/coderd/tracing"
@@ -99,9 +97,11 @@ import (
"github.com/coder/coder/v2/coderd/util/ptr"
"github.com/coder/coder/v2/coderd/util/slice"
stringutil "github.com/coder/coder/v2/coderd/util/strings"
"github.com/coder/coder/v2/coderd/webpush"
"github.com/coder/coder/v2/coderd/workspaceapps/appurl"
"github.com/coder/coder/v2/coderd/workspacestats"
"github.com/coder/coder/v2/codersdk"
"github.com/coder/coder/v2/codersdk/drpcsdk"
"github.com/coder/coder/v2/cryptorand"
"github.com/coder/coder/v2/provisioner/echo"
"github.com/coder/coder/v2/provisioner/terraform"
@@ -280,6 +280,12 @@ func enablePrometheus(
}
}
provisionerdserverMetrics := provisionerdserver.NewMetrics(logger)
if err := provisionerdserverMetrics.Register(options.PrometheusRegistry); err != nil {
return nil, xerrors.Errorf("failed to register provisionerd_server metrics: %w", err)
}
options.ProvisionerdServerMetrics = provisionerdserverMetrics
//nolint:revive
return ServeHandler(
ctx, logger, promhttp.InstrumentMetricHandler(
+3
View File
@@ -241,6 +241,8 @@ type Options struct {
UpdateAgentMetrics func(ctx context.Context, labels prometheusmetrics.AgentMetricLabels, metrics []*agentproto.Stats_Metric)
StatsBatcher workspacestats.Batcher
ProvisionerdServerMetrics *provisionerdserver.Metrics
// WorkspaceAppAuditSessionTimeout allows changing the timeout for audit
// sessions. Raising or lowering this value will directly affect the write
// load of the audit log table. This is used for testing. Default 1 hour.
@@ -1930,6 +1932,7 @@ func (api *API) CreateInMemoryTaggedProvisionerDaemon(dialCtx context.Context, n
},
api.NotificationsEnqueuer,
&api.PrebuildsReconciler,
api.ProvisionerdServerMetrics,
)
if err != nil {
return nil, err
+3
View File
@@ -184,6 +184,8 @@ type Options struct {
OIDCConvertKeyCache cryptokeys.SigningKeycache
Clock quartz.Clock
TelemetryReporter telemetry.Reporter
ProvisionerdServerMetrics *provisionerdserver.Metrics
}
// New constructs a codersdk client connected to an in-memory API instance.
@@ -604,6 +606,7 @@ func NewOptions(t testing.TB, options *Options) (func(http.Handler), context.Can
Clock: options.Clock,
AppEncryptionKeyCache: options.APIKeyEncryptionCache,
OIDCConvertKeyCache: options.OIDCConvertKeyCache,
ProvisionerdServerMetrics: options.ProvisionerdServerMetrics,
}
}
+7
View File
@@ -2699,6 +2699,13 @@ func (q *querier) GetQuotaConsumedForUser(ctx context.Context, params database.G
return q.db.GetQuotaConsumedForUser(ctx, params)
}
func (q *querier) GetRegularWorkspaceCreateMetrics(ctx context.Context) ([]database.GetRegularWorkspaceCreateMetricsRow, error) {
if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceWorkspace.All()); err != nil {
return nil, err
}
return q.db.GetRegularWorkspaceCreateMetrics(ctx)
}
func (q *querier) GetReplicaByID(ctx context.Context, id uuid.UUID) (database.Replica, error) {
if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceSystem); err != nil {
return database.Replica{}, err
+4
View File
@@ -2177,6 +2177,10 @@ func (s *MethodTestSuite) TestWorkspace() {
dbm.EXPECT().GetWorkspaceAgentDevcontainersByAgentID(gomock.Any(), agt.ID).Return([]database.WorkspaceAgentDevcontainer{d}, nil).AnyTimes()
check.Args(agt.ID).Asserts(w, policy.ActionRead).Returns([]database.WorkspaceAgentDevcontainer{d})
}))
s.Run("GetRegularWorkspaceCreateMetrics", s.Subtest(func(_ database.Store, check *expects) {
check.Args().
Asserts(rbac.ResourceWorkspace.All(), policy.ActionRead)
}))
}
func (s *MethodTestSuite) TestWorkspacePortSharing() {
@@ -1356,6 +1356,13 @@ func (m queryMetricsStore) GetQuotaConsumedForUser(ctx context.Context, ownerID
return consumed, err
}
func (m queryMetricsStore) GetRegularWorkspaceCreateMetrics(ctx context.Context) ([]database.GetRegularWorkspaceCreateMetricsRow, error) {
start := time.Now()
r0, r1 := m.s.GetRegularWorkspaceCreateMetrics(ctx)
m.queryLatencies.WithLabelValues("GetRegularWorkspaceCreateMetrics").Observe(time.Since(start).Seconds())
return r0, r1
}
func (m queryMetricsStore) GetReplicaByID(ctx context.Context, id uuid.UUID) (database.Replica, error) {
start := time.Now()
replica, err := m.s.GetReplicaByID(ctx, id)
+15
View File
@@ -2851,6 +2851,21 @@ func (mr *MockStoreMockRecorder) GetQuotaConsumedForUser(ctx, arg any) *gomock.C
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "GetQuotaConsumedForUser", reflect.TypeOf((*MockStore)(nil).GetQuotaConsumedForUser), ctx, arg)
}
// GetRegularWorkspaceCreateMetrics mocks base method.
func (m *MockStore) GetRegularWorkspaceCreateMetrics(ctx context.Context) ([]database.GetRegularWorkspaceCreateMetricsRow, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "GetRegularWorkspaceCreateMetrics", ctx)
ret0, _ := ret[0].([]database.GetRegularWorkspaceCreateMetricsRow)
ret1, _ := ret[1].(error)
return ret0, ret1
}
// GetRegularWorkspaceCreateMetrics indicates an expected call of GetRegularWorkspaceCreateMetrics.
func (mr *MockStoreMockRecorder) GetRegularWorkspaceCreateMetrics(ctx any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "GetRegularWorkspaceCreateMetrics", reflect.TypeOf((*MockStore)(nil).GetRegularWorkspaceCreateMetrics), ctx)
}
// GetReplicaByID mocks base method.
func (m *MockStore) GetReplicaByID(ctx context.Context, id uuid.UUID) (database.Replica, error) {
m.ctrl.T.Helper()
+3
View File
@@ -306,6 +306,9 @@ type sqlcQuerier interface {
GetProvisionerLogsAfterID(ctx context.Context, arg GetProvisionerLogsAfterIDParams) ([]ProvisionerJobLog, error)
GetQuotaAllowanceForUser(ctx context.Context, arg GetQuotaAllowanceForUserParams) (int64, error)
GetQuotaConsumedForUser(ctx context.Context, arg GetQuotaConsumedForUserParams) (int64, error)
// Count regular workspaces: only those whose first successful 'start' build
// was not initiated by the prebuild system user.
GetRegularWorkspaceCreateMetrics(ctx context.Context) ([]GetRegularWorkspaceCreateMetricsRow, error)
GetReplicaByID(ctx context.Context, id uuid.UUID) (Replica, error)
GetReplicasUpdatedAfter(ctx context.Context, updatedAt time.Time) ([]Replica, error)
GetRunningPrebuiltWorkspaces(ctx context.Context) ([]GetRunningPrebuiltWorkspacesRow, error)
+70 -1
View File
@@ -7309,7 +7309,7 @@ const getPrebuildMetrics = `-- name: GetPrebuildMetrics :many
SELECT
t.name as template_name,
tvp.name as preset_name,
o.name as organization_name,
o.name as organization_name,
COUNT(*) as created_count,
COUNT(*) FILTER (WHERE pj.job_status = 'failed'::provisioner_job_status) as failed_count,
COUNT(*) FILTER (
@@ -20131,6 +20131,75 @@ func (q *sqlQuerier) GetDeploymentWorkspaceStats(ctx context.Context) (GetDeploy
return i, err
}
const getRegularWorkspaceCreateMetrics = `-- name: GetRegularWorkspaceCreateMetrics :many
WITH first_success_build AS (
-- Earliest successful 'start' build per workspace
SELECT DISTINCT ON (wb.workspace_id)
wb.workspace_id,
wb.template_version_preset_id,
wb.initiator_id
FROM workspace_builds wb
JOIN provisioner_jobs pj ON pj.id = wb.job_id
WHERE
wb.transition = 'start'::workspace_transition
AND pj.job_status = 'succeeded'::provisioner_job_status
ORDER BY wb.workspace_id, wb.build_number, wb.id
)
SELECT
t.name AS template_name,
COALESCE(tvp.name, '') AS preset_name,
o.name AS organization_name,
COUNT(*) AS created_count
FROM first_success_build fsb
JOIN workspaces w ON w.id = fsb.workspace_id
JOIN templates t ON t.id = w.template_id
LEFT JOIN template_version_presets tvp ON tvp.id = fsb.template_version_preset_id
JOIN organizations o ON o.id = w.organization_id
WHERE
NOT t.deleted
-- Exclude workspaces whose first successful start was the prebuilds system user
AND fsb.initiator_id != 'c42fdf75-3097-471c-8c33-fb52454d81c0'::uuid
GROUP BY t.name, COALESCE(tvp.name, ''), o.name
ORDER BY t.name, preset_name, o.name
`
type GetRegularWorkspaceCreateMetricsRow struct {
TemplateName string `db:"template_name" json:"template_name"`
PresetName string `db:"preset_name" json:"preset_name"`
OrganizationName string `db:"organization_name" json:"organization_name"`
CreatedCount int64 `db:"created_count" json:"created_count"`
}
// Count regular workspaces: only those whose first successful 'start' build
// was not initiated by the prebuild system user.
func (q *sqlQuerier) GetRegularWorkspaceCreateMetrics(ctx context.Context) ([]GetRegularWorkspaceCreateMetricsRow, error) {
rows, err := q.db.QueryContext(ctx, getRegularWorkspaceCreateMetrics)
if err != nil {
return nil, err
}
defer rows.Close()
var items []GetRegularWorkspaceCreateMetricsRow
for rows.Next() {
var i GetRegularWorkspaceCreateMetricsRow
if err := rows.Scan(
&i.TemplateName,
&i.PresetName,
&i.OrganizationName,
&i.CreatedCount,
); err != nil {
return nil, err
}
items = append(items, i)
}
if err := rows.Close(); err != nil {
return nil, err
}
if err := rows.Err(); err != nil {
return nil, err
}
return items, nil
}
const getWorkspaceACLByID = `-- name: GetWorkspaceACLByID :one
SELECT
group_acl as groups,
+1 -1
View File
@@ -230,7 +230,7 @@ HAVING COUNT(*) = @hard_limit::bigint;
SELECT
t.name as template_name,
tvp.name as preset_name,
o.name as organization_name,
o.name as organization_name,
COUNT(*) as created_count,
COUNT(*) FILTER (WHERE pj.job_status = 'failed'::provisioner_job_status) as failed_count,
COUNT(*) FILTER (
+33
View File
@@ -923,3 +923,36 @@ SET
user_acl = @user_acl
WHERE
id = @id;
-- name: GetRegularWorkspaceCreateMetrics :many
-- Count regular workspaces: only those whose first successful 'start' build
-- was not initiated by the prebuild system user.
WITH first_success_build AS (
-- Earliest successful 'start' build per workspace
SELECT DISTINCT ON (wb.workspace_id)
wb.workspace_id,
wb.template_version_preset_id,
wb.initiator_id
FROM workspace_builds wb
JOIN provisioner_jobs pj ON pj.id = wb.job_id
WHERE
wb.transition = 'start'::workspace_transition
AND pj.job_status = 'succeeded'::provisioner_job_status
ORDER BY wb.workspace_id, wb.build_number, wb.id
)
SELECT
t.name AS template_name,
COALESCE(tvp.name, '') AS preset_name,
o.name AS organization_name,
COUNT(*) AS created_count
FROM first_success_build fsb
JOIN workspaces w ON w.id = fsb.workspace_id
JOIN templates t ON t.id = w.template_id
LEFT JOIN template_version_presets tvp ON tvp.id = fsb.template_version_preset_id
JOIN organizations o ON o.id = w.organization_id
WHERE
NOT t.deleted
-- Exclude workspaces whose first successful start was the prebuilds system user
AND fsb.initiator_id != 'c42fdf75-3097-471c-8c33-fb52454d81c0'::uuid
GROUP BY t.name, COALESCE(tvp.name, ''), o.name
ORDER BY t.name, preset_name, o.name;
@@ -165,6 +165,18 @@ func Workspaces(ctx context.Context, logger slog.Logger, registerer prometheus.R
return nil, err
}
workspaceCreationTotal := prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: "coderd",
Name: "workspace_creation_total",
Help: "Total regular (non-prebuilt) workspace creations by organization, template, and preset.",
},
[]string{"organization_name", "template_name", "preset_name"},
)
if err := registerer.Register(workspaceCreationTotal); err != nil {
return nil, err
}
ctx, cancelFunc := context.WithCancel(ctx)
done := make(chan struct{})
@@ -200,6 +212,27 @@ func Workspaces(ctx context.Context, logger slog.Logger, registerer prometheus.R
string(w.LatestBuildTransition),
).Add(1)
}
// Update regular workspaces (without a prebuild transition) creation counter
regularWorkspaces, err := db.GetRegularWorkspaceCreateMetrics(ctx)
if err != nil {
if errors.Is(err, sql.ErrNoRows) {
workspaceCreationTotal.Reset()
} else {
logger.Warn(ctx, "failed to load regular workspaces for metrics", slog.Error(err))
}
return
}
workspaceCreationTotal.Reset()
for _, regularWorkspace := range regularWorkspaces {
workspaceCreationTotal.WithLabelValues(
regularWorkspace.OrganizationName,
regularWorkspace.TemplateName,
regularWorkspace.PresetName,
).Add(float64(regularWorkspace.CreatedCount))
}
}
// Use time.Nanosecond to force an initial tick. It will be reset to the
@@ -424,6 +424,107 @@ func TestWorkspaceLatestBuildStatuses(t *testing.T) {
}
}
func TestWorkspaceCreationTotal(t *testing.T) {
t.Parallel()
for _, tc := range []struct {
Name string
Database func() database.Store
ExpectedWorkspaces int
}{
{
Name: "None",
Database: func() database.Store {
db, _ := dbtestutil.NewDB(t)
return db
},
ExpectedWorkspaces: 0,
},
{
// Should count only the successfully created workspaces
Name: "Multiple",
Database: func() database.Store {
db, _ := dbtestutil.NewDB(t)
u := dbgen.User(t, db, database.User{})
org := dbgen.Organization(t, db, database.Organization{})
insertTemplates(t, db, u, org)
insertCanceled(t, db, u, org)
insertFailed(t, db, u, org)
insertFailed(t, db, u, org)
insertSuccess(t, db, u, org)
insertSuccess(t, db, u, org)
insertSuccess(t, db, u, org)
insertRunning(t, db, u, org)
return db
},
ExpectedWorkspaces: 3,
},
{
// Should not include prebuilt workspaces
Name: "MultipleWithPrebuild",
Database: func() database.Store {
ctx := context.Background()
db, _ := dbtestutil.NewDB(t)
u := dbgen.User(t, db, database.User{})
prebuildUser, err := db.GetUserByID(ctx, database.PrebuildsSystemUserID)
require.NoError(t, err)
org := dbgen.Organization(t, db, database.Organization{})
insertTemplates(t, db, u, org)
insertCanceled(t, db, u, org)
insertFailed(t, db, u, org)
insertSuccess(t, db, u, org)
insertSuccess(t, db, prebuildUser, org)
insertRunning(t, db, u, org)
return db
},
ExpectedWorkspaces: 1,
},
{
// Should include deleted workspaces
Name: "MultipleWithDeleted",
Database: func() database.Store {
db, _ := dbtestutil.NewDB(t)
u := dbgen.User(t, db, database.User{})
org := dbgen.Organization(t, db, database.Organization{})
insertTemplates(t, db, u, org)
insertCanceled(t, db, u, org)
insertFailed(t, db, u, org)
insertSuccess(t, db, u, org)
insertRunning(t, db, u, org)
insertDeleted(t, db, u, org)
return db
},
ExpectedWorkspaces: 2,
},
} {
t.Run(tc.Name, func(t *testing.T) {
t.Parallel()
registry := prometheus.NewRegistry()
closeFunc, err := prometheusmetrics.Workspaces(context.Background(), testutil.Logger(t), registry, tc.Database(), testutil.IntervalFast)
require.NoError(t, err)
t.Cleanup(closeFunc)
require.Eventually(t, func() bool {
metrics, err := registry.Gather()
assert.NoError(t, err)
sum := 0
for _, m := range metrics {
if m.GetName() != "coderd_workspace_creation_total" {
continue
}
for _, metric := range m.Metric {
sum += int(metric.GetCounter().GetValue())
}
}
t.Logf("count = %d, expected == %d", sum, tc.ExpectedWorkspaces)
return sum == tc.ExpectedWorkspaces
}, testutil.WaitShort, testutil.IntervalFast)
})
}
}
func TestAgents(t *testing.T) {
t.Parallel()
@@ -897,6 +998,7 @@ func insertRunning(t *testing.T, db database.Store, u database.User, org databas
Transition: database.WorkspaceTransitionStart,
Reason: database.BuildReasonInitiator,
TemplateVersionID: templateVersionID,
InitiatorID: u.ID,
})
require.NoError(t, err)
// This marks the job as started.
+177
View File
@@ -0,0 +1,177 @@
package provisionerdserver
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"cdr.dev/slog"
)
type Metrics struct {
logger slog.Logger
workspaceCreationTimings *prometheus.HistogramVec
workspaceClaimTimings *prometheus.HistogramVec
}
type WorkspaceTimingType int
const (
Unsupported WorkspaceTimingType = iota
WorkspaceCreation
PrebuildCreation
PrebuildClaim
)
const (
workspaceTypeRegular = "regular"
workspaceTypePrebuild = "prebuild"
)
type WorkspaceTimingFlags struct {
IsPrebuild bool
IsClaim bool
IsFirstBuild bool
}
func NewMetrics(logger slog.Logger) *Metrics {
log := logger.Named("provisionerd_server_metrics")
return &Metrics{
logger: log,
workspaceCreationTimings: prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "coderd",
Name: "workspace_creation_duration_seconds",
Help: "Time to create a workspace by organization, template, preset, and type (regular or prebuild).",
Buckets: []float64{
1, // 1s
10,
30,
60, // 1min
60 * 5,
60 * 10,
60 * 30, // 30min
60 * 60, // 1hr
},
NativeHistogramBucketFactor: 1.1,
// Max number of native buckets kept at once to bound memory.
NativeHistogramMaxBucketNumber: 100,
// Merge/flush small buckets periodically to control churn.
NativeHistogramMinResetDuration: time.Hour,
// Treat tiny values as zero (helps with noisy near-zero latencies).
NativeHistogramZeroThreshold: 0,
NativeHistogramMaxZeroThreshold: 0,
}, []string{"organization_name", "template_name", "preset_name", "type"}),
workspaceClaimTimings: prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "coderd",
Name: "prebuilt_workspace_claim_duration_seconds",
Help: "Time to claim a prebuilt workspace by organization, template, and preset.",
// Higher resolution between 15m to show typical prebuild claim times.
// Cap at 5m since longer claims diminish prebuild value.
Buckets: []float64{
1, // 1s
5,
10,
20,
30,
60, // 1m
120, // 2m
180, // 3m
240, // 4m
300, // 5m
},
NativeHistogramBucketFactor: 1.1,
// Max number of native buckets kept at once to bound memory.
NativeHistogramMaxBucketNumber: 100,
// Merge/flush small buckets periodically to control churn.
NativeHistogramMinResetDuration: time.Hour,
// Treat tiny values as zero (helps with noisy near-zero latencies).
NativeHistogramZeroThreshold: 0,
NativeHistogramMaxZeroThreshold: 0,
}, []string{"organization_name", "template_name", "preset_name"}),
}
}
func (m *Metrics) Register(reg prometheus.Registerer) error {
if err := reg.Register(m.workspaceCreationTimings); err != nil {
return err
}
return reg.Register(m.workspaceClaimTimings)
}
func (f WorkspaceTimingFlags) count() int {
count := 0
if f.IsPrebuild {
count++
}
if f.IsClaim {
count++
}
if f.IsFirstBuild {
count++
}
return count
}
// getWorkspaceTimingType returns the type of the workspace build:
// - isPrebuild: if the workspace build corresponds to the creation of a prebuilt workspace
// - isClaim: if the workspace build corresponds to the claim of a prebuilt workspace
// - isWorkspaceFirstBuild: if the workspace build corresponds to the creation of a regular workspace
// (not created from the prebuild pool)
func getWorkspaceTimingType(flags WorkspaceTimingFlags) WorkspaceTimingType {
switch {
case flags.IsPrebuild:
return PrebuildCreation
case flags.IsClaim:
return PrebuildClaim
case flags.IsFirstBuild:
return WorkspaceCreation
default:
return Unsupported
}
}
// UpdateWorkspaceTimingsMetrics updates the workspace timing metrics based on the workspace build type
func (m *Metrics) UpdateWorkspaceTimingsMetrics(
ctx context.Context,
flags WorkspaceTimingFlags,
organizationName string,
templateName string,
presetName string,
buildTime float64,
) {
m.logger.Debug(ctx, "update workspace timings metrics",
"organizationName", organizationName,
"templateName", templateName,
"presetName", presetName,
"isPrebuild", flags.IsPrebuild,
"isClaim", flags.IsClaim,
"isWorkspaceFirstBuild", flags.IsFirstBuild)
if flags.count() > 1 {
m.logger.Warn(ctx, "invalid workspace timing flags",
"isPrebuild", flags.IsPrebuild,
"isClaim", flags.IsClaim,
"isWorkspaceFirstBuild", flags.IsFirstBuild)
return
}
workspaceTimingType := getWorkspaceTimingType(flags)
switch workspaceTimingType {
case WorkspaceCreation:
// Regular workspace creation (without prebuild pool)
m.workspaceCreationTimings.
WithLabelValues(organizationName, templateName, presetName, workspaceTypeRegular).Observe(buildTime)
case PrebuildCreation:
// Prebuilt workspace creation duration
m.workspaceCreationTimings.
WithLabelValues(organizationName, templateName, presetName, workspaceTypePrebuild).Observe(buildTime)
case PrebuildClaim:
// Prebuilt workspace claim duration
m.workspaceClaimTimings.
WithLabelValues(organizationName, templateName, presetName).Observe(buildTime)
default:
m.logger.Warn(ctx, "unsupported workspace timing flags")
}
}
@@ -129,6 +129,8 @@ type server struct {
heartbeatInterval time.Duration
heartbeatFn func(ctx context.Context) error
metrics *Metrics
}
// We use the null byte (0x00) in generating a canonical map key for tags, so
@@ -178,6 +180,7 @@ func NewServer(
options Options,
enqueuer notifications.Enqueuer,
prebuildsOrchestrator *atomic.Pointer[prebuilds.ReconciliationOrchestrator],
metrics *Metrics,
) (proto.DRPCProvisionerDaemonServer, error) {
// Fail-fast if pointers are nil
if lifecycleCtx == nil {
@@ -248,6 +251,7 @@ func NewServer(
heartbeatFn: options.HeartbeatFn,
PrebuildsOrchestrator: prebuildsOrchestrator,
UsageInserter: usageInserter,
metrics: metrics,
}
if s.heartbeatFn == nil {
@@ -2281,6 +2285,50 @@ func (s *server) completeWorkspaceBuildJob(ctx context.Context, job database.Pro
}
}
// Update workspace (regular and prebuild) timing metrics
if s.metrics != nil {
// Only consider 'start' workspace builds
if workspaceBuild.Transition == database.WorkspaceTransitionStart {
// Get the updated job to report the metrics with correct data
updatedJob, err := s.Database.GetProvisionerJobByID(ctx, jobID)
if err != nil {
s.Logger.Error(ctx, "get updated job from database", slog.Error(err))
} else
// Only consider 'succeeded' provisioner jobs
if updatedJob.JobStatus == database.ProvisionerJobStatusSucceeded {
presetName := ""
if workspaceBuild.TemplateVersionPresetID.Valid {
preset, err := s.Database.GetPresetByID(ctx, workspaceBuild.TemplateVersionPresetID.UUID)
if err != nil {
if !errors.Is(err, sql.ErrNoRows) {
s.Logger.Error(ctx, "get preset by ID for workspace timing metrics", slog.Error(err))
}
} else {
presetName = preset.Name
}
}
buildTime := updatedJob.CompletedAt.Time.Sub(updatedJob.StartedAt.Time).Seconds()
s.metrics.UpdateWorkspaceTimingsMetrics(
ctx,
WorkspaceTimingFlags{
// Is a prebuilt workspace creation build
IsPrebuild: input.PrebuiltWorkspaceBuildStage.IsPrebuild(),
// Is a prebuilt workspace claim build
IsClaim: input.PrebuiltWorkspaceBuildStage.IsPrebuiltWorkspaceClaim(),
// Is a regular workspace creation build
// Only consider the first build number for regular workspaces
IsFirstBuild: workspaceBuild.BuildNumber == 1,
},
workspace.OrganizationName,
workspace.TemplateName,
presetName,
buildTime,
)
}
}
}
msg, err := json.Marshal(wspubsub.WorkspaceEvent{
Kind: wspubsub.WorkspaceEventKindStateChange,
WorkspaceID: workspace.ID,
@@ -4144,6 +4144,7 @@ func setup(t *testing.T, ignoreLogErrors bool, ov *overrides) (proto.DRPCProvisi
},
notifEnq,
&op,
provisionerdserver.NewMetrics(logger),
)
require.NoError(t, err)
return srv, db, ps, daemon
+19
View File
@@ -143,9 +143,12 @@ deployment. They will always be available from the agent.
| `coderd_oauth2_external_requests_rate_limit_total` | gauge | DEPRECATED: use coderd_oauth2_external_requests_rate_limit instead | `name` `resource` |
| `coderd_oauth2_external_requests_rate_limit_used` | gauge | The number of requests made in this interval. | `name` `resource` |
| `coderd_oauth2_external_requests_total` | counter | The total number of api calls made to external oauth2 providers. 'status_code' will be 0 if the request failed with no response. | `name` `source` `status_code` |
| `coderd_prebuilt_workspace_claim_duration_seconds` | histogram | Time to claim a prebuilt workspace by organization, template, and preset. | `organization_name` `preset_name` `template_name` |
| `coderd_provisionerd_job_timings_seconds` | histogram | The provisioner job time duration in seconds. | `provisioner` `status` |
| `coderd_provisionerd_jobs_current` | gauge | The number of currently running provisioner jobs. | `provisioner` |
| `coderd_workspace_builds_total` | counter | The number of workspaces started, updated, or deleted. | `action` `owner_email` `status` `template_name` `template_version` `workspace_name` |
| `coderd_workspace_creation_duration_seconds` | histogram | Time to create a workspace by organization, template, preset, and type (regular or prebuild). | `organization_name` `preset_name` `template_name` `type` |
| `coderd_workspace_creation_total` | counter | Total regular (non-prebuilt) workspace creations by organization, template, and preset. | `organization_name` `preset_name` `template_name` |
| `coderd_workspace_latest_build_status` | gauge | The current workspace statuses by template, transition, and owner. | `status` `template_name` `template_version` `workspace_owner` `workspace_transition` |
| `go_gc_duration_seconds` | summary | A summary of the pause duration of garbage collection cycles. | |
| `go_goroutines` | gauge | Number of goroutines that currently exist. | |
@@ -185,3 +188,19 @@ deployment. They will always be available from the agent.
| `promhttp_metric_handler_requests_total` | counter | Total number of scrapes by HTTP status code. | `code` |
<!-- End generated by 'make docs/admin/integrations/prometheus.md'. -->
### Note on Prometheus native histogram support
The following metrics support native histograms:
* `coderd_workspace_creation_duration_seconds`
* `coderd_prebuilt_workspace_claim_duration_seconds`
Native histograms are an **experimental** Prometheus feature that removes the need to predefine bucket boundaries and allows higher-resolution buckets that adapt to deployment characteristics.
Whether a metric is exposed as classic or native depends entirely on the Prometheus server configuration (see [Prometheus docs](https://prometheus.io/docs/specs/native_histograms/) for details):
* If native histograms are enabled, Prometheus ingests the high-resolution histogram.
* If not, it falls back to the predefined buckets.
⚠️ Important: classic and native histograms cannot be aggregated together. If Prometheus is switched from classic to native at a certain point in time, dashboards may need to account for that transition.
For this reason, its recommended to follow [Prometheus migration guidelines](https://prometheus.io/docs/specs/native_histograms/#migration-considerations) when moving from classic to native histograms.
@@ -300,6 +300,7 @@ Coder provides several metrics to monitor your prebuilt workspaces:
- `coderd_prebuilt_workspaces_desired` (gauge): Target number of prebuilt workspaces that should be available.
- `coderd_prebuilt_workspaces_running` (gauge): Current number of prebuilt workspaces in a `running` state.
- `coderd_prebuilt_workspaces_eligible` (gauge): Current number of prebuilt workspaces eligible to be claimed.
- `coderd_prebuilt_workspace_claim_duration_seconds` ([_native histogram_](https://prometheus.io/docs/specs/native_histograms) support): Time to claim a prebuilt workspace from the prebuild pool.
#### Logs
+1
View File
@@ -361,6 +361,7 @@ func (api *API) provisionerDaemonServe(rw http.ResponseWriter, r *http.Request)
},
api.NotificationsEnqueuer,
&api.AGPL.PrebuildsReconciler,
api.ProvisionerdServerMetrics,
)
if err != nil {
if !xerrors.Is(err, context.Canceled) {
+128
View File
@@ -26,6 +26,7 @@ import (
"github.com/coder/coder/v2/coderd/audit"
"github.com/coder/coder/v2/coderd/autobuild"
"github.com/coder/coder/v2/coderd/coderdtest"
"github.com/coder/coder/v2/coderd/coderdtest/promhelp"
"github.com/coder/coder/v2/coderd/database"
"github.com/coder/coder/v2/coderd/database/dbauthz"
"github.com/coder/coder/v2/coderd/database/dbfake"
@@ -2873,6 +2874,133 @@ func TestPrebuildActivityBump(t *testing.T) {
require.Zero(t, workspace.LatestBuild.MaxDeadline)
}
func TestWorkspaceProvisionerdServerMetrics(t *testing.T) {
t.Parallel()
// Setup
log := testutil.Logger(t)
reg := prometheus.NewRegistry()
provisionerdserverMetrics := provisionerdserver.NewMetrics(log)
err := provisionerdserverMetrics.Register(reg)
require.NoError(t, err)
client, db, owner := coderdenttest.NewWithDatabase(t, &coderdenttest.Options{
Options: &coderdtest.Options{
IncludeProvisionerDaemon: true,
ProvisionerdServerMetrics: provisionerdserverMetrics,
},
LicenseOptions: &coderdenttest.LicenseOptions{
Features: license.Features{
codersdk.FeatureWorkspacePrebuilds: 1,
},
},
})
// Given: a template and a template version with a preset without prebuild instances
presetNoPrebuildID := uuid.New()
versionNoPrebuild := coderdtest.CreateTemplateVersion(t, client, owner.OrganizationID, nil)
_ = coderdtest.AwaitTemplateVersionJobCompleted(t, client, versionNoPrebuild.ID)
templateNoPrebuild := coderdtest.CreateTemplate(t, client, owner.OrganizationID, versionNoPrebuild.ID)
presetNoPrebuild := dbgen.Preset(t, db, database.InsertPresetParams{
ID: presetNoPrebuildID,
TemplateVersionID: versionNoPrebuild.ID,
})
// Given: a template and a template version with a preset with a prebuild instance
presetPrebuildID := uuid.New()
versionPrebuild := coderdtest.CreateTemplateVersion(t, client, owner.OrganizationID, nil)
_ = coderdtest.AwaitTemplateVersionJobCompleted(t, client, versionPrebuild.ID)
templatePrebuild := coderdtest.CreateTemplate(t, client, owner.OrganizationID, versionPrebuild.ID)
presetPrebuild := dbgen.Preset(t, db, database.InsertPresetParams{
ID: presetPrebuildID,
TemplateVersionID: versionPrebuild.ID,
DesiredInstances: sql.NullInt32{Int32: 1, Valid: true},
})
// Given: a prebuild workspace
wb := dbfake.WorkspaceBuild(t, db, database.WorkspaceTable{
OwnerID: database.PrebuildsSystemUserID,
TemplateID: templatePrebuild.ID,
}).Seed(database.WorkspaceBuild{
TemplateVersionID: versionPrebuild.ID,
TemplateVersionPresetID: uuid.NullUUID{
UUID: presetPrebuildID,
Valid: true,
},
}).WithAgent(func(agent []*proto.Agent) []*proto.Agent {
return agent
}).Do()
// Mark the prebuilt workspace's agent as ready so the prebuild can be claimed
// nolint:gocritic
ctx := dbauthz.AsSystemRestricted(testutil.Context(t, testutil.WaitLong))
agent, err := db.GetWorkspaceAgentAndLatestBuildByAuthToken(ctx, uuid.MustParse(wb.AgentToken))
require.NoError(t, err)
err = db.UpdateWorkspaceAgentLifecycleStateByID(ctx, database.UpdateWorkspaceAgentLifecycleStateByIDParams{
ID: agent.WorkspaceAgent.ID,
LifecycleState: database.WorkspaceAgentLifecycleStateReady,
})
require.NoError(t, err)
organizationName, err := client.Organization(ctx, owner.OrganizationID)
require.NoError(t, err)
user, err := client.User(ctx, "testUser")
require.NoError(t, err)
// Given: no histogram value for prebuilt workspaces claim
prebuiltWorkspaceHistogramMetric := promhelp.MetricValue(t, reg, "coderd_prebuilt_workspace_claim_duration_seconds", prometheus.Labels{
"organization_name": organizationName.Name,
"template_name": templatePrebuild.Name,
"preset_name": presetPrebuild.Name,
})
require.Nil(t, prebuiltWorkspaceHistogramMetric)
// Given: the prebuilt workspace is claimed by a user
claimedWorkspace, err := client.CreateUserWorkspace(ctx, user.ID.String(), codersdk.CreateWorkspaceRequest{
TemplateVersionID: versionPrebuild.ID,
TemplateVersionPresetID: presetPrebuildID,
Name: coderdtest.RandomUsername(t),
})
require.NoError(t, err)
coderdtest.AwaitWorkspaceBuildJobCompleted(t, client, claimedWorkspace.LatestBuild.ID)
require.Equal(t, wb.Workspace.ID, claimedWorkspace.ID)
// Then: the histogram value for prebuilt workspace claim should be updated
prebuiltWorkspaceHistogram := promhelp.HistogramValue(t, reg, "coderd_prebuilt_workspace_claim_duration_seconds", prometheus.Labels{
"organization_name": organizationName.Name,
"template_name": templatePrebuild.Name,
"preset_name": presetPrebuild.Name,
})
require.NotNil(t, prebuiltWorkspaceHistogram)
require.Equal(t, uint64(1), prebuiltWorkspaceHistogram.GetSampleCount())
// Given: no histogram value for regular workspaces creation
regularWorkspaceHistogramMetric := promhelp.MetricValue(t, reg, "coderd_workspace_creation_duration_seconds", prometheus.Labels{
"organization_name": organizationName.Name,
"template_name": templateNoPrebuild.Name,
"preset_name": presetNoPrebuild.Name,
"type": "regular",
})
require.Nil(t, regularWorkspaceHistogramMetric)
// Given: a user creates a regular workspace (without prebuild pool)
regularWorkspace, err := client.CreateUserWorkspace(ctx, user.ID.String(), codersdk.CreateWorkspaceRequest{
TemplateVersionID: versionNoPrebuild.ID,
TemplateVersionPresetID: presetNoPrebuildID,
Name: coderdtest.RandomUsername(t),
})
require.NoError(t, err)
coderdtest.AwaitWorkspaceBuildJobCompleted(t, client, regularWorkspace.LatestBuild.ID)
// Then: the histogram value for regular workspace creation should be updated
regularWorkspaceHistogram := promhelp.HistogramValue(t, reg, "coderd_workspace_creation_duration_seconds", prometheus.Labels{
"organization_name": organizationName.Name,
"template_name": templateNoPrebuild.Name,
"preset_name": presetNoPrebuild.Name,
"type": "regular",
})
require.NotNil(t, regularWorkspaceHistogram)
require.Equal(t, uint64(1), regularWorkspaceHistogram.GetSampleCount())
}
// TestWorkspaceTemplateParamsChange tests a workspace with a parameter that
// validation changes on apply. The params used in create workspace are invalid
// according to the static params on import.
+31
View File
@@ -715,6 +715,37 @@ coderd_workspace_latest_build_status{status="failed",template_name="docker",temp
coderd_workspace_builds_total{action="START",owner_email="admin@coder.com",status="failed",template_name="docker",template_version="gallant_wright0",workspace_name="test1"} 1
coderd_workspace_builds_total{action="START",owner_email="admin@coder.com",status="success",template_name="docker",template_version="gallant_wright0",workspace_name="test1"} 1
coderd_workspace_builds_total{action="STOP",owner_email="admin@coder.com",status="success",template_name="docker",template_version="gallant_wright0",workspace_name="test1"} 1
# HELP coderd_workspace_creation_total Total regular (non-prebuilt) workspace creations by organization, template, and preset.
# TYPE coderd_workspace_creation_total counter
coderd_workspace_creation_total{organization_name="{organization}",preset_name="",template_name="docker"} 1
# HELP coderd_workspace_creation_duration_seconds Time to create a workspace by organization, template, preset, and type (regular or prebuild).
# TYPE coderd_workspace_creation_duration_seconds histogram
coderd_workspace_creation_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",type="prebuild",le="1"} 0
coderd_workspace_creation_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",type="prebuild",le="10"} 1
coderd_workspace_creation_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",type="prebuild",le="30"} 1
coderd_workspace_creation_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",type="prebuild",le="60"} 1
coderd_workspace_creation_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",type="prebuild",le="300"} 1
coderd_workspace_creation_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",type="prebuild",le="600"} 1
coderd_workspace_creation_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",type="prebuild",le="1800"} 1
coderd_workspace_creation_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",type="prebuild",le="3600"} 1
coderd_workspace_creation_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",type="prebuild",le="+Inf"} 1
coderd_workspace_creation_duration_seconds_sum{organization_name="{organization}",preset_name="Falkenstein",template_name="template-example",type="prebuild"} 4.406214
coderd_workspace_creation_duration_seconds_count{organization_name="{organization}",preset_name="Falkenstein",template_name="template-example",type="prebuild"} 1
# HELP coderd_prebuilt_workspace_claim_duration_seconds Time to claim a prebuilt workspace by organization, template, and preset.
# TYPE coderd_prebuilt_workspace_claim_duration_seconds histogram
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="1"} 0
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="5"} 1
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="10"} 1
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="20"} 1
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="30"} 1
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="60"} 1
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="120"} 1
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="180"} 1
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="240"} 1
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="300"} 1
coderd_prebuilt_workspace_claim_duration_seconds_bucket{organization_name="{organization}",preset_name="Falkenstein",template_name="docker",le="+Inf"} 1
coderd_prebuilt_workspace_claim_duration_seconds_sum{organization_name="{organization}",preset_name="Falkenstein",template_name="docker"} 4.860075
coderd_prebuilt_workspace_claim_duration_seconds_count{organization_name="{organization}",preset_name="Falkenstein",template_name="docker"} 1
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.4056e-05