mirror of
https://github.com/coder/coder.git
synced 2026-06-02 20:48:20 +00:00
feat: add Prometheus metric for agent first connection duration (#24179)
## Summary Add `coderd_agents_first_connection_seconds` histogram metric that records the duration from workspace agent creation to first connection. This fills an observability gap — provisioner job timings and startup script metrics exist, but the agent connection phase (which can take several minutes) was not exposed to Prometheus. Closes https://github.com/coder/coder/issues/21282 ## Changes - **`coderd/prometheusmetrics/prometheusmetrics.go`** — Define and register a `HistogramVec` in the existing `Agents()` polling loop. Observe `first_connected_at - created_at` exactly once per agent via a deduplication map, pruned each tick to prevent unbounded memory growth. - **`coderd/prometheusmetrics/prometheusmetrics_test.go`** — Update `TestAgents` to set `first_connected_at` on the test agent and assert the histogram is collected with correct labels, sample count, and sample sum. - **`docs/admin/integrations/prometheus.md`**, **`scripts/metricsdocgen/generated_metrics`** — Auto-generated documentation updates from `make gen`. ## Metric details | Property | Value | |---|---| | Name | `coderd_agents_first_connection_seconds` | | Type | histogram | | Labels | `template_name`, `agent_name`, `username`, `workspace_name` | | Buckets | 1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h | ## Example PromQL ```promql # P95 agent connection time by template histogram_quantile(0.95, sum(rate(coderd_agents_first_connection_seconds_bucket[1h])) by (le, template_name) ) ``` <details> <summary>Implementation notes</summary> ### Design decisions - **Histogram over gauge**: Enables `histogram_quantile()` for percentile queries. - **Observe in `Agents()` polling loop**: All required data is already fetched by `GetWorkspaceAgentsForMetrics()` — no new DB queries. - **Dedup via `map[uuid.UUID]struct{}`**: Prevents re-observing the same agent across polling ticks. Pruned each cycle to bound memory. - **Buckets**: Aligned with `coderd_provisionerd_workspace_build_timings_seconds` range (1s–1h). ### Overhead at scale (100k active workspaces) The deduplication map (`observedFirstConnection`) and per-tick pruning map (`currentAgentIDs`) are both `map[[16]byte]struct{}`. At 100k agents: - **Memory**: ~2.25 MB persistent + ~2.25 MB transient per tick = **~4.5 MB peak**. - **CPU**: ~25 ms of map operations per tick (one tick per minute) = **<0.05% of one core**. Both are negligible relative to the existing cost of the `Agents()` loop (the DB query, per-agent `GetWorkspaceAppsByAgentID` calls, and coordinator node lookups dominate). </details> > 🤖 Generated by Coder Agents
This commit is contained in:
@@ -157,6 +157,9 @@ coderd_agents_connection_latencies_seconds{agent_name="",username="",workspace_n
|
||||
# HELP coderd_agents_connections Agent connections with statuses.
|
||||
# TYPE coderd_agents_connections gauge
|
||||
coderd_agents_connections{agent_name="",username="",workspace_name="",status="",lifecycle_state="",tailnet_node=""} 0
|
||||
# HELP coderd_agents_first_connection_seconds Duration from agent creation to first connection to the control plane in seconds.
|
||||
# TYPE coderd_agents_first_connection_seconds histogram
|
||||
coderd_agents_first_connection_seconds{template_name="",agent_name="",username="",workspace_name=""} 0
|
||||
# HELP coderd_agents_up The number of active agents per workspace.
|
||||
# TYPE coderd_agents_up gauge
|
||||
coderd_agents_up{username="",workspace_name="",template_name="",template_version=""} 0
|
||||
|
||||
Reference in New Issue
Block a user