Files
coder/coderd/boundaryusage/doc.go
Zach 90aeea5649 fix: handle boundary usage across snapshots and flush races (#21805)
Previously there were two issues that could cause incorrect boundary
usage telemetry data.

1. Bad handling across snapshot intervals: After telemetry snapshot deleted
the DB row, the next flush would INSERT the stale cumulative data (which
included already-reported usage). This would then be overwritten by
subsequent UPDATE flushes, causing the delta between the last snapshot
and the reset to be lost (under-reporting usage). Additionally, if there
was no new usage after the reset, the tracker would carry over all usage
from the previous period into the next period (over-reporting usage).

2. Missed usage from a race condition: Track() calls between the first
mutex unlock and second mutex lock in FlushToDB() were lost. The data
wasn't included in the current flush (already snapshotted) and was wiped
by the subsequent reset. This is likely low impact to overall usage
numbers in the real world.

Fix by tracking unique workspace/user deltas separately from cumulative
values and always tracking delta allowed/denied requests. Deltas are used
for INSERT (fresh start after reset), cumulative for UPDATE (accurate unique
counts within a period). All counters reset atomically before the DB operation
so Track() calls during the operation are preserved for the next flush.
2026-02-02 09:11:54 -07:00

82 lines
5.3 KiB
Go

// Package boundaryusage tracks workspace boundary usage for telemetry reporting.
// The design intent is to track trends and rough usage patterns.
//
// Each replica does in-memory usage tracking. Boundary usage is inferred at the
// control plane when workspace agents call the ReportBoundaryLogs RPC. Accumulated
// stats are periodically flushed to a database table keyed by replica ID. Telemetry
// aggregates are computed across all replicas when generating snapshots.
//
// Aggregate Precision:
//
// The aggregated stats represent approximate usage over roughly the telemetry
// snapshot interval, not a precise time window. This imprecision arises because:
//
// - Each replica flushes independently, so their data covers slightly different
// time ranges (varying by up to the flush interval)
// - Unflushed in-memory data at snapshot time rolls into the next period
// - The snapshot captures "data flushed since last reset" rather than "usage
// during exactly the last N minutes"
//
// We accept this imprecision to keep the architecture simple. Each replica
// operates independently and flushes to the database on their own schedule.
// This approach also minimizes database load. The table contains at most one
// row per replica, so flushes are just upserts, and resets only delete N
// rows. There's no accumulation of historical data to clean up. The only
// synchronization is a database lock that ensures exactly one replica reports
// telemetry per period.
//
// Known Shortcomings:
//
// - Unique workspace/user counts may be inflated when the same workspace or
// user connects through multiple replicas, as each replica tracks its own
// unique set
// - Ad-hoc boundary usage in a workspace may not be accounted for e.g. if
// the boundary command is invoked directly with the --log-proxy-socket-path
// flag set to something other than the Workspace agent server.
//
// Implementation:
//
// The Tracker maintains sets of unique workspace IDs and user IDs, plus request
// counters. When boundary logs are reported, Track() adds the IDs to the sets
// and increments request counters.
//
// FlushToDB() writes stats to the database only when there's been new activity
// since the last flush. This prevents stale data from being written after a
// telemetry reset when no new usage occurred. Stats accumulate in memory
// throughout the telemetry period.
//
// A new period is detected when the upsert results in an INSERT (meaning
// telemetry deleted the replica's row). At that point, all in-memory stats are
// reset so they only count usage within the new period.
//
// Below is a sequence diagram showing the flow of boundary usage tracking.
//
// ┌───────┐ ┌───────────────┐ ┌──────────┐ ┌────┐ ┌───────────┐
// │ Agent │ │BoundaryLogsAPI│ │ Tracker │ │ DB │ │ Telemetry │
// └───┬───┘ └───────┬───────┘ └────┬─────┘ └──┬─┘ └─────┬─────┘
// │ │ │ │ │
// │ ReportBoundaryLogs│ │ │ │
// ├──────────────────►│ │ │ │
// │ │ Track(...) │ │ │
// │ ├────────────────►│ │ │
// │ : │ │ │ │
// │ : │ │ │ │
// │ ReportBoundaryLogs│ │ │ │
// ├──────────────────►│ │ │ │
// │ │ Track(...) │ │ │
// │ ├────────────────►│ │ │
// │ │ │ │ │
// │ │ │ FlushToDB │ │
// │ │ ├────────────►│ │
// │ │ │ : │ │
// │ │ │ : │ │
// │ │ │ FlushToDB │ │
// │ │ ├────────────►│ │
// │ │ │ │ │
// │ │ │ │ Snapshot │
// │ │ │ │ interval │
// │ │ │ │◄───────────┤
// │ │ │ │ Aggregate │
// │ │ │ │ & Reset │
package boundaryusage