## Summary Add `coderd_agents_first_connection_seconds` histogram metric that records the duration from workspace agent creation to first connection. This fills an observability gap — provisioner job timings and startup script metrics exist, but the agent connection phase (which can take several minutes) was not exposed to Prometheus. Closes https://github.com/coder/coder/issues/21282 ## Changes - **`coderd/prometheusmetrics/prometheusmetrics.go`** — Define and register a `HistogramVec` in the existing `Agents()` polling loop. Observe `first_connected_at - created_at` exactly once per agent via a deduplication map, pruned each tick to prevent unbounded memory growth. - **`coderd/prometheusmetrics/prometheusmetrics_test.go`** — Update `TestAgents` to set `first_connected_at` on the test agent and assert the histogram is collected with correct labels, sample count, and sample sum. - **`docs/admin/integrations/prometheus.md`**, **`scripts/metricsdocgen/generated_metrics`** — Auto-generated documentation updates from `make gen`. ## Metric details | Property | Value | |---|---| | Name | `coderd_agents_first_connection_seconds` | | Type | histogram | | Labels | `template_name`, `agent_name`, `username`, `workspace_name` | | Buckets | 1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h | ## Example PromQL ```promql # P95 agent connection time by template histogram_quantile(0.95, sum(rate(coderd_agents_first_connection_seconds_bucket[1h])) by (le, template_name) ) ``` <details> <summary>Implementation notes</summary> ### Design decisions - **Histogram over gauge**: Enables `histogram_quantile()` for percentile queries. - **Observe in `Agents()` polling loop**: All required data is already fetched by `GetWorkspaceAgentsForMetrics()` — no new DB queries. - **Dedup via `map[uuid.UUID]struct{}`**: Prevents re-observing the same agent across polling ticks. Pruned each cycle to bound memory. - **Buckets**: Aligned with `coderd_provisionerd_workspace_build_timings_seconds` range (1s–1h). ### Overhead at scale (100k active workspaces) The deduplication map (`observedFirstConnection`) and per-tick pruning map (`currentAgentIDs`) are both `map[[16]byte]struct{}`. At 100k agents: - **Memory**: ~2.25 MB persistent + ~2.25 MB transient per tick = **~4.5 MB peak**. - **CPU**: ~25 ms of map operations per tick (one tick per minute) = **<0.05% of one core**. Both are negligible relative to the existing cost of the `Agents()` loop (the DB query, per-agent `GetWorkspaceAppsByAgentID` calls, and coordinator node lookups dominate). </details> > 🤖 Generated by Coder Agents
Metrics Documentation Generator
This tool generates the Prometheus metrics documentation at docs/admin/integrations/prometheus.md.
How It Works
The documentation is generated from two metrics files:
metrics(static, manually maintained)generated_metrics(auto-generated, do not edit)
These files are merged and used to produce the final documentation.
metrics (static)
Contains metrics that are not directly defined in the coder source code:
go_*: Go runtime metricsprocess_*: Process metrics from prometheus/client_golangpromhttp_*: Prometheus HTTP handler metricscoder_aibridged_*: Metrics from external dependencies
Note
This file also contains edge cases where metric metadata cannot be accurately extracted by the scanner (e.g., labels determined by runtime logic). Static metrics take priority over generated metrics when both files contain the same metric name.
Edit this file to add metrics that should appear in the documentation but are not scanned from the coder codebase,
or to manually override metrics where the scanner generates incorrect metadata (e.g., missing runtime-determined labels like in agent_scripts_executed_total).
generated_metrics (auto-generated)
Contains metrics extracted from the coder source code by the AST scanner (scanner/scanner.go).
Do not edit this file directly. It is regenerated by running:
make scripts/metricsdocgen/generated_metrics
Updating Metrics Documentation
To regenerate the documentation after code changes:
make docs/admin/integrations/prometheus.md
This will:
- Run the scanner to update
generated_metrics - Merge
metricsandgenerated_metricsmetric files - Update the documentation file