Files
coder/coderd/database/queries/boundaryusagestats.sql
T
Zach 90aeea5649 fix: handle boundary usage across snapshots and flush races (#21805)
Previously there were two issues that could cause incorrect boundary
usage telemetry data.

1. Bad handling across snapshot intervals: After telemetry snapshot deleted
the DB row, the next flush would INSERT the stale cumulative data (which
included already-reported usage). This would then be overwritten by
subsequent UPDATE flushes, causing the delta between the last snapshot
and the reset to be lost (under-reporting usage). Additionally, if there
was no new usage after the reset, the tracker would carry over all usage
from the previous period into the next period (over-reporting usage).

2. Missed usage from a race condition: Track() calls between the first
mutex unlock and second mutex lock in FlushToDB() were lost. The data
wasn't included in the current flush (already snapshotted) and was wiped
by the subsequent reset. This is likely low impact to overall usage
numbers in the real world.

Fix by tracking unique workspace/user deltas separately from cumulative
values and always tracking delta allowed/denied requests. Deltas are used
for INSERT (fresh start after reset), cumulative for UPDATE (accurate unique
counts within a period). All counters reset atomically before the DB operation
so Track() calls during the operation are preserved for the next flush.
2026-02-02 09:11:54 -07:00

50 lines
2.0 KiB
SQL

-- name: UpsertBoundaryUsageStats :one
-- Upserts boundary usage statistics for a replica. On INSERT (new period), uses
-- delta values for unique counts (only data since last flush). On UPDATE, uses
-- cumulative values for unique counts (accurate period totals). Request counts
-- are always deltas, accumulated in DB. Returns true if insert, false if update.
INSERT INTO boundary_usage_stats (
replica_id,
unique_workspaces_count,
unique_users_count,
allowed_requests,
denied_requests,
window_start,
updated_at
) VALUES (
@replica_id,
@unique_workspaces_delta,
@unique_users_delta,
@allowed_requests,
@denied_requests,
NOW(),
NOW()
) ON CONFLICT (replica_id) DO UPDATE SET
unique_workspaces_count = @unique_workspaces_count,
unique_users_count = @unique_users_count,
allowed_requests = boundary_usage_stats.allowed_requests + EXCLUDED.allowed_requests,
denied_requests = boundary_usage_stats.denied_requests + EXCLUDED.denied_requests,
updated_at = NOW()
RETURNING (xmax = 0) AS new_period;
-- name: GetBoundaryUsageSummary :one
-- Aggregates boundary usage statistics across all replicas. Filters to only
-- include data where window_start is within the given interval to exclude
-- stale data.
SELECT
COALESCE(SUM(unique_workspaces_count), 0)::bigint AS unique_workspaces,
COALESCE(SUM(unique_users_count), 0)::bigint AS unique_users,
COALESCE(SUM(allowed_requests), 0)::bigint AS allowed_requests,
COALESCE(SUM(denied_requests), 0)::bigint AS denied_requests
FROM boundary_usage_stats
WHERE window_start >= NOW() - (@max_staleness_ms::bigint || ' ms')::interval;
-- name: ResetBoundaryUsageStats :exec
-- Deletes all boundary usage statistics. Called after telemetry reports the
-- aggregated stats. Each replica will insert a fresh row on its next flush.
DELETE FROM boundary_usage_stats;
-- name: DeleteBoundaryUsageStatsByReplicaID :exec
-- Deletes boundary usage statistics for a specific replica.
DELETE FROM boundary_usage_stats WHERE replica_id = @replica_id;