Adds production-observability metrics to coderd/x/chatd/ for
model-level correlation and a chatStreams memory-leak investigation.
- Label per-request chatd metrics (steps_total, message_count,
prompt_size_bytes, tool_result_size_bytes, ttft_seconds,
compaction_total) with `model` and enrich the per-turn logger
with provider/model.
- Add `coderd_chatd_stream_retries_total{provider, model, kind}`
counter incremented in chatloop before OnRetry.
- Register a prometheus.Collector exposing `streams_active`,
`stream_buffer_size_max`, `stream_buffer_events`,
`stream_subscribers` from p.chatStreams.
- Add `coderd_chatd_stream_buffer_dropped_total` counter,
incremented per publishToStream drop independently of the
existing log-rate-limited bufferDropCount.
- Snapshot logger/model before the title-generation goroutine to
avoid a data race with the logger/model rebind below it.
> 🤖
_Disclaimer: produced by Claude Opus 4.6_
Adds a `coder_build_info` metric which allows operators to see which
versions of Coder are currently running.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
## Summary
Add `coderd_agents_first_connection_seconds` histogram metric that
records the
duration from workspace agent creation to first connection. This fills
an
observability gap — provisioner job timings and startup script metrics
exist,
but the agent connection phase (which can take several minutes) was not
exposed
to Prometheus.
Closes https://github.com/coder/coder/issues/21282
## Changes
- **`coderd/prometheusmetrics/prometheusmetrics.go`** — Define and
register a
`HistogramVec` in the existing `Agents()` polling loop. Observe
`first_connected_at - created_at` exactly once per agent via a
deduplication
map, pruned each tick to prevent unbounded memory growth.
- **`coderd/prometheusmetrics/prometheusmetrics_test.go`** — Update
`TestAgents`
to set `first_connected_at` on the test agent and assert the histogram
is
collected with correct labels, sample count, and sample sum.
- **`docs/admin/integrations/prometheus.md`**,
**`scripts/metricsdocgen/generated_metrics`** —
Auto-generated documentation updates from `make gen`.
## Metric details
| Property | Value |
|---|---|
| Name | `coderd_agents_first_connection_seconds` |
| Type | histogram |
| Labels | `template_name`, `agent_name`, `username`, `workspace_name` |
| Buckets | 1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h |
## Example PromQL
```promql
# P95 agent connection time by template
histogram_quantile(0.95,
sum(rate(coderd_agents_first_connection_seconds_bucket[1h])) by (le, template_name)
)
```
<details>
<summary>Implementation notes</summary>
### Design decisions
- **Histogram over gauge**: Enables `histogram_quantile()` for
percentile queries.
- **Observe in `Agents()` polling loop**: All required data is already
fetched by
`GetWorkspaceAgentsForMetrics()` — no new DB queries.
- **Dedup via `map[uuid.UUID]struct{}`**: Prevents re-observing the same
agent
across polling ticks. Pruned each cycle to bound memory.
- **Buckets**: Aligned with
`coderd_provisionerd_workspace_build_timings_seconds`
range (1s–1h).
### Overhead at scale (100k active workspaces)
The deduplication map (`observedFirstConnection`) and per-tick pruning
map
(`currentAgentIDs`) are both `map[[16]byte]struct{}`. At 100k agents:
- **Memory**: ~2.25 MB persistent + ~2.25 MB transient per tick = **~4.5
MB peak**.
- **CPU**: ~25 ms of map operations per tick (one tick per minute) =
**<0.05% of one core**.
Both are negligible relative to the existing cost of the `Agents()` loop
(the DB
query, per-agent `GetWorkspaceAppsByAgentID` calls, and coordinator node
lookups
dominate).
</details>
> 🤖 Generated by Coder Agents
- Suppress informational `log.Printf` messages from the metrics scanner
when stdout is not a TTY (i.e. piped via `atomic_write` in `make gen` or
CI)
- Genuine warnings (`warnf`) still print unconditionally so real
problems remain visible
- `log.Fatalf` for fatal errors is unchanged
> 🤖 Created by Coder Agents and reviewed by a human
Audited exported helpers in `coderd/util/*`, `testutil`, `cryptorand`,
and friends, then replaced duplicated implementations with canonical
versions.
- **fix: `maps.SortedKeys` generic signature** — value type was
hardcoded to `any`, making it impossible to actually call. Added second
type parameter `V any`. Added table-driven tests with `cmp.Diff`.
- **refactor: replace ad-hoc ptr helpers with `ptr.Ref`** — removed
`int64Ptr`, `stringPtr`, `boolPtr`, `i64ptr`, `strPtr`, `PtrInt32`
across 6 files.
- **refactor: replace local `sortedKeys`/`sortKeys` with
`maps.SortedKeys`** — now that the signature is fixed, scripts can use
it.
- **refactor: replace hand-rolled `capitalize` with
`strings.Capitalize`** — the typegen version was also not UTF-8 safe.
> 🤖 This PR was created with the help of Coder Agents, and was reviewed
by my human. 🧑💻
This PR does three things:
- Exports derp expvars to the pprof endpoint
- Exports the expvar metrics as prometheus metrics in both coderd and
wsproxy
- Updates our tailscale to a fix I also had to make to avoid a data race
condition
I generated this with mux but I also manually tested that the metrics
were getting properly emitted
`make gen` could not run with `-j` because inter-target dependency edges
were missing. Multiple recipes compile `coderd/rbac` (which includes
generated files like `object_gen.go`), and without explicit ordering,
parallel runs produced syntax errors from mid-write reads.
Three main changes:
**Dependency graph fixes** declare the compile-time chain through
`coderd/rbac` so that `object_gen.go` is written before anything that
imports it is compiled. The DB generation targets use a GNU Make 4.3+
grouped target (`&:`) so Make knows `generate.sh` co-produces
`querier.go`, `unique_constraint.go`, `dbmetrics`, and `dbauthz` in a
single invocation. `SKIP_DUMP_SQL=1` avoids re-entrant `make` inside
`generate.sh` when the Makefile already guarantees `dump.sql` is fresh.
**`scripts/atomicwrite` package** replaces `os.WriteFile` in all gen
scripts with a temp-file-in-same-dir + rename pattern, preventing
interrupted runs from leaving partial files.
**`.PRECIOUS` and shell atomic writes** protect git-tracked generated
files from Make's default delete-on-error behavior. Since these files
are committed, deletion is worse than staleness -- `git restore` is the
recovery path.
CI now runs `make -j --output-sync -B gen` (~32s, down from ~85s
serial).
| Scenario | Before | After |
|-----------------------------------|--------------------|----------|
| `make gen` (serial) | 95s | 95s |
| `make -j gen` (parallel) | race error | **22s** |
| CI `make -j --output-sync -B gen` | forced serial ~85s | **~32s** |
Add Prometheus metrics to the boundary log proxy for observability:
- batches_dropped_total (reason: buffer_full, forward_failed)
- logs_dropped_total (reason: buffer_full, forward_failed,
boundary_channel_full, boundary_batch_full)
- batches_forwarded_total
Also add BoundaryStatus to the BoundaryMessage envelope so boundary
can report dropped log counts as a separate wire message. The agent
records these as Prometheus metrics, making boundary-side data loss
visible. Backwards compatibility for older versions of boundary is maintained.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Description
- Updates `wsbuilder` to return a `BuildError` with
`http.StatusBadRequest` to signify a "validation error" on missing or
invalid parameters
- Adds a short-circuit in `prebuilds.StoreReconciler` to mark presets
for which creating a build returns a "validation error" as "validation
failed" and skip further attempts to reconcile.
- Adds a test to verify the above
- Introduces a new Prometheus metric
`coderd_prebuilt_workspaces_preset_validation_failed` to track the above
Closes: https://github.com/coder/coder/issues/21237
---------
Co-authored-by: Cian Johnston <cian@coder.com>
## Description
When multiple organizations have templates with the same name, the
Prometheus `/metrics` endpoint returns HTTP 500 because Prometheus
rejects duplicate label combinations. The three `coderd_insights_*`
metrics (`coderd_insights_templates_active_users`,
`coderd_insights_applications_usage_seconds`,
`coderd_insights_parameters`) used only `template_name` as a
distinguishing label, so two templates named e.g. `"openstack-v1"` in
different orgs would produce duplicate metric series.
This adds `organization_name` as a label to all three insight metric
descriptors to disambiguate templates across organizations.
## Changes
**`coderd/prometheusmetrics/insights/metricscollector.go`**:
- Added `organization_name` label to all three metric descriptors
- Added `organizationNames` field (template ID → org name) to the
`insightsData` struct
- In `doTick`: after fetching templates, collect unique org IDs, fetch
organizations via `GetOrganizations`, and build a
template-ID-to-org-name mapping
- In `Collect()`: pass the organization name as an additional label
value in every `MustNewConstMetric` call
**`coderd/prometheusmetrics/insights/testdata/insights-metrics.json`**:
Updated golden file to include `organization_name=coder` in all metric
label keys.
Fixes#21748
## Description
This PR wires up the metrics scanner in the Makefile to automatically regenerate metrics documentation when source files change.
## Changes
* Add Makefile target `scripts/metricsdocgen/generated_metrics` to run the AST scanner to generate the metrics file
* Update `docs/admin/integrations/prometheus.md` Makefile target to depend on `scripts/metricsdocgen/generated_metrics`
* Add `scripts/metricsdocgen/README.md` documenting the metrics generation process
Closes: https://github.com/coder/coder/issues/13223
## Description
This PR refactors `scripts/metricsdocgen/main.go` to support merging static and generated metrics files for documentation generation.
The static `metrics` file remains necessary for metrics not defined in the coder codebase (`go_*`, `process_*`, `promhttp_*`, `coder_aibridged_*`), as well as **edge cases** the scanner cannot handle (e.g., such as metrics with runtime-determined labels or function-local variable references for fields, ...). Handling these edge cases in the scanner would make it significantly more complex, so we keep this hybrid approach to accommodate them. This means that in such cases, developers need to update the `metrics` file directly, meaning there is still a risk of out-of-date information in the documentation. However, this solution should already encompass most cases.
Static metrics take priority over generated metrics when both files contain the same metric name, allowing manual overrides without modifying the scanner. Some of these edge cases could be easily fixed by updating the codebase to use one of the supported patterns.
## Changes
* Update `scripts/metricsdocgen/main.go` to read from two separate metrics files:
* `metrics`: static, manually maintained metrics (e.g., `go_*`, `process_*`, `promhttp_*`, `coder_aibridged_*`)
* `generated_metrics`: auto-generated by the AST scanner
* Update `metrics` file to contain only static and edge-case metrics
* Skip metrics with empty HELP descriptions in the scanner
* Update `generated_metrics` to reflect skipped metrics
* Update `docs/admin/integrations/prometheus.md` with merged metrics
Related to: https://github.com/coder/coder/issues/13223
**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
## Description
This PR implements extraction of metrics defined using `promauto.With()` factory patterns.
## Changes
* Add `extractPromautoMetric()` to handle:
* `promauto.With(reg).NewCounterVec(prometheus.CounterOpts{...}, labels)`
* `factory.NewGaugeVec(prometheus.GaugeOpts{...}, labels)`
* Script generates an updated `scripts/metricsdocgen/generated_metrics` file
Related to: https://github.com/coder/coder/issues/13223
**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
## Description
This PR implements extraction of metrics defined using `prometheus.New*()` and `prometheus.New*Vec()` patterns with `*Opts{}` structs.
## Changes
* Add `extractOptsMetric()` to handle:
* `prometheus.NewGauge(prometheus.GaugeOpts{...})`
* `prometheus.NewCounter(prometheus.CounterOpts{...})`
* `prometheus.NewHistogram(prometheus.HistogramOpts{...})`
* `prometheus.NewSummary(prometheus.SummaryOpts{...})`
* `prometheus.New*Vec(prometheus.*Opts{...}, labels)`
* Script generates an updated `scripts/metricsdocgen/generated_metrics` file
Related to: https://github.com/coder/coder/issues/13223
**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
## Description
This PR implements extraction of metrics defined using the `prometheus.NewDesc()` pattern.
## Changes
* Add `extractNewDescMetric()` to extract metrics from `prometheus.NewDesc()` calls
* Script generates an updated `scripts/metricsdocgen/generated_metrics` file
Related to: https://github.com/coder/coder/issues/13223
**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
## Description
This PR adds an AST-based scanner to automatically generate Prometheus metrics documentation from the coder source code.
## Changes
* Add `scripts/metricsdocgen/scanner/scanner.go` with:
* Directory walking for `agent/`, `coderd/`, `enterprise/`, `provisionerd/`
* Go file parsing (skipping `*_test.go` files)
* AST inspection for metric extraction
* `Metric.String()` for Prometheus text exposition format rendering
* `writeMetrics()` to output metrics to stdout
* Placeholder `extractMetricFromCall()` (implemented in subsequent PRs)
* Empty `scripts/metricsdocgen/generated_metrics` placeholder (populated by subsequent PRs)
**Note:** To facilitate the review process, this was separated into scoped stacked PRs. The division was based on the main structure, the different Prometheus patterns currently present in the codebase, and updates to the build process.
Related to: https://github.com/coder/coder/issues/13223
**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
This PR adds some metrics to help identify job enqueue rates and
latencies. This work was initiated as a way to help reduce the cost of
the observation/measurement itself for autostart scaletests, which
impacts our ability to identify/reason about the load caused by
autostart. See: https://github.com/coder/internal/issues/1209
I've extended the metrics here to account for regular user initiated
builds, prebuilds, autostarts, etc. IMO there is still the question here
of whether we want to include or need the `transition` label, which is
only present on workspace builds. Including it does lead to an increase
in cardinality, and in the case of the histogram (when not using native
histograms) that's at least a few extra series for every bucket. We
could remove the transition label there but keep it on the counter.
Additionally, the histogram is currently observing latencies for other
jobs, such as template builds/version imports, those do not have a
transition type associated with them.
Tested briefly in a workspace, can see metric values like the following:
-
`coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"}
1`
-
`coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"}
1`
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Adds coderd_template_workspace_build_duration_seconds histogram that
tracks the full duration from workspace build creation to agent ready.
This captures the complete user-perceived build time including
provisioning and agent startup.
The metric is emitted when the agent reports ready/error/timeout via the
lifecycle API, ensuring each build is counted exactly once per replica.
## Description
Removes the following deprecated Prometheus metrics:
- `coderd_api_workspace_latest_build_total` → use
`coderd_api_workspace_latest_build` instead
- `coderd_oauth2_external_requests_rate_limit_total` → use
`coderd_oauth2_external_requests_rate_limit` instead
These metrics were deprecated in #12976 because gauge metrics should
avoid the `_total` suffix per [Prometheus naming
conventions](https://prometheus.io/docs/practices/naming/).
## Changes
- Removed deprecated metric `coderd_api_workspace_latest_build_total`
from `coderd/prometheusmetrics/prometheusmetrics.go`
- Removed deprecated metric
`coderd_oauth2_external_requests_rate_limit_total` from
`coderd/promoauth/oauth2.go`
- Updated tests to use the non-deprecated metric name
Fixes#12999
Fixes: coder/internal#767
Adds two new Prometheus metrics for license health monitoring:
- `coderd_license_warnings` - count of active license warnings
- `coderd_license_errors` - count of active license errors
Metrics endpoint after startup of a deployment with license enabled:
```
...
# HELP coderd_license_errors The number of active license errors.
# TYPE coderd_license_errors gauge
coderd_license_errors 0
...
# HELP coderd_license_warnings The number of active license warnings.
# TYPE coderd_license_warnings gauge
coderd_license_warnings 0
...
```
## Description
Add missing provisionerd metrics to Prometheus documentation:
* `coderd_provisionerd_num_daemons`: The number of provisioner daemons.
* `coderd_provisionerd_workspace_build_timings_seconds`: The time taken
for a workspace to build.
Related to internal thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1760642020583019
## Description
This PR introduces one counter and two histograms related to workspace
creation and claiming. The goal is to provide clearer observability into
how workspaces are created (regular vs prebuild) and the time cost of
those operations.
### `coderd_workspace_creation_total`
* Metric type: Counter
* Name: `coderd_workspace_creation_total`
* Labels: `organization_name`, `template_name`, `preset_name`
This counter tracks whether a regular workspace (not created from a
prebuild pool) was created using a preset or not.
Currently, we already expose `coderd_prebuilt_workspaces_claimed_total`
for claimed prebuilt workspaces, but we lack a comparable metric for
regular workspace creations. This metric fills that gap, making it
possible to compare regular creations against claims.
Implementation notes:
* Exposed as a `coderd_` metric, consistent with other workspace-related
metrics (e.g. `coderd_api_workspace_latest_build`:
https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149).
* Every `defaultRefreshRate` (1 minute ), DB query
`GetRegularWorkspaceCreateMetrics` is executed to fetch all regular
workspaces (not created from a prebuild pool).
* The counter is updated with the total from all time (not just since
metric introduction). This differs from the histograms below, which only
accumulate from their introduction forward.
### `coderd_workspace_creation_duration_seconds` &
`coderd_prebuilt_workspace_claim_duration_seconds`
* Metric types: Histogram
* Names:
* `coderd_workspace_creation_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`, `type`
(`regular`, `prebuild`)
* `coderd_prebuilt_workspace_claim_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`
We already have `coderd_provisionerd_workspace_build_timings_seconds`,
which tracks build run times for all workspace builds handled by the
provisioner daemon.
However, in the context of this issue, we are only interested in
creation and claim build times, not all transitions; additionally, this
metric does not include `preset_name`, and adding it there would
significantly increase cardinality. Therefore, separate more focused
metrics are introduced here:
* `coderd_workspace_creation_duration_seconds`: Build time to create a
workspace (either a regular workspace or the build into a prebuild pool,
for prebuild initial provisioning build).
* `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a
prebuilt workspace from the pool.
The reason for two separate histograms is that:
* Creation (regular or prebuild): provisioning builds with similar time
magnitude, generally expected to take longer than a claim operation.
* Claim: expected to be a much faster provisioning build.
#### Native histogram usage
Provisioning times vary widely between projects. Using static buckets
risks unbalanced or poorly informative histograms.
To address this, these metrics use [Prometheus native
histograms](https://prometheus.io/docs/specs/native_histograms/):
* First introduced in Prometheus v2.40.0
* Recommended stable usage from v2.45+
* Requires Go client `prometheus/client_golang` v1.15.0+
* Experimental and must be explicitly enabled on the server
(`--enable-feature=native-histograms`)
For compatibility, we also retain a classic bucket definition (aligned
with the existing provisioner metric:
https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189).
* If native histograms are enabled, Prometheus ingests the
high-resolution histogram.
* If not, it falls back to the predefined buckets.
Implementation notes:
* Unlike the counter, these histograms are updated in real-time at
workspace build job completion.
* They reflect data only from the point of introduction forward (no
historical backfill).
## Relates to
Closes: https://github.com/coder/coder/issues/19528
Native histograms tested in observability stack:
https://github.com/coder/observability/pull/50
* chore: drop github per user rate limit tracking
Rate limits for authenticated requests are per user.
This would be an excessive number of prometheus labels,
so we only track the unauthorized limit.
This PR adds the prometheus metric coderd_workspace_builds_total.
It measures the total number of workspace builds, along with a number of labels intended to be useful for an operator debugging a failed workspace build trying to discover the scope of the issue.