Commit Graph

19 Commits

Author SHA1 Message Date
Jon Ayers 6c44de951d feat: add Prometheus collector for DERP server expvar metrics (#22583)
This PR does three things:
- Exports derp expvars to the pprof endpoint
- Exports the expvar metrics as prometheus metrics in both coderd and
wsproxy
- Updates our tailscale to a fix I also had to make to avoid a data race
condition

I generated this with mux but I also manually tested that the metrics
were getting properly emitted
2026-03-06 01:57:58 -06:00
Zach 5b7377c375 feat: add Prometheus metrics for boundary log drop reporting (#22521)
Add Prometheus metrics to the boundary log proxy for observability:
- batches_dropped_total (reason: buffer_full, forward_failed)
- logs_dropped_total (reason: buffer_full, forward_failed,
  boundary_channel_full, boundary_batch_full)
- batches_forwarded_total

Also add BoundaryStatus to the BoundaryMessage envelope so boundary
can report dropped log counts as a separate wire message. The agent
records these as Prometheus metrics, making boundary-side data loss
visible. Backwards compatibility for older versions of boundary is maintained.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 12:42:34 -07:00
Susana Ferreira ca234f346d fix: mark presets as validation_failed to prevent endless prebuild retries (#22085)
## Description

- Updates `wsbuilder` to return a `BuildError` with
`http.StatusBadRequest` to signify a "validation error" on missing or
invalid parameters
- Adds a short-circuit in `prebuilds.StoreReconciler` to mark presets
for which creating a build returns a "validation error" as "validation
failed" and skip further attempts to reconcile.
- Adds a test to verify the above
- Introduces a new Prometheus metric
`coderd_prebuilt_workspaces_preset_validation_failed` to track the above

Closes: https://github.com/coder/coder/issues/21237

---------

Co-authored-by: Cian Johnston <cian@coder.com>
2026-02-27 14:26:48 +00:00
Garrett Delfosse 4057363f78 fix(coderd): add organization_name label to insights Prometheus metrics (#22296)
## Description

When multiple organizations have templates with the same name, the
Prometheus `/metrics` endpoint returns HTTP 500 because Prometheus
rejects duplicate label combinations. The three `coderd_insights_*`
metrics (`coderd_insights_templates_active_users`,
`coderd_insights_applications_usage_seconds`,
`coderd_insights_parameters`) used only `template_name` as a
distinguishing label, so two templates named e.g. `"openstack-v1"` in
different orgs would produce duplicate metric series.

This adds `organization_name` as a label to all three insight metric
descriptors to disambiguate templates across organizations.

## Changes

**`coderd/prometheusmetrics/insights/metricscollector.go`**:
- Added `organization_name` label to all three metric descriptors
- Added `organizationNames` field (template ID → org name) to the
`insightsData` struct
- In `doTick`: after fetching templates, collect unique org IDs, fetch
organizations via `GetOrganizations`, and build a
template-ID-to-org-name mapping
- In `Collect()`: pass the organization name as an additional label
value in every `MustNewConstMetric` call

**`coderd/prometheusmetrics/insights/testdata/insights-metrics.json`**:
Updated golden file to include `organization_name=coder` in all metric
label keys.

Fixes #21748
2026-02-25 08:58:50 +00:00
Susana Ferreira df84cea924 feat(scripts/metricsdocgen): support merging static and generated metrics files (#21464)
## Description

This PR refactors `scripts/metricsdocgen/main.go` to support merging static and generated metrics files for documentation generation.

The static `metrics` file remains necessary for metrics not defined in the coder codebase (`go_*`, `process_*`, `promhttp_*`, `coder_aibridged_*`), as well as **edge cases** the scanner cannot handle (e.g.,  such as metrics with runtime-determined labels or function-local variable references for fields, ...). Handling these edge cases in the scanner would make it significantly more complex, so we keep this hybrid approach to accommodate them. This means that in such cases, developers need to update the `metrics` file directly, meaning there is still a risk of out-of-date information in the documentation. However, this solution should already encompass most cases.

Static metrics take priority over generated metrics when both files contain the same metric name, allowing manual overrides without modifying the scanner. Some of these edge cases could be easily fixed by updating the codebase to use one of the supported patterns.

## Changes

* Update `scripts/metricsdocgen/main.go` to read from two separate metrics files:
  * `metrics`: static, manually maintained metrics (e.g., `go_*`, `process_*`, `promhttp_*`, `coder_aibridged_*`)
  * `generated_metrics`: auto-generated by the AST scanner
* Update `metrics` file to contain only static and edge-case metrics
* Skip metrics with empty HELP descriptions in the scanner
* Update `generated_metrics` to reflect skipped metrics
* Update `docs/admin/integrations/prometheus.md` with merged metrics

Related to: https://github.com/coder/coder/issues/13223

**Disclosure:** This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira
2026-02-13 12:19:33 +00:00
Callum Styan 5f3be6b288 feat: add provisioner job queue wait time histogram and jobs enqueued counter (#21869)
This PR adds some metrics to help identify job enqueue rates and
latencies. This work was initiated as a way to help reduce the cost of
the observation/measurement itself for autostart scaletests, which
impacts our ability to identify/reason about the load caused by
autostart. See: https://github.com/coder/internal/issues/1209

I've extended the metrics here to account for regular user initiated
builds, prebuilds, autostarts, etc. IMO there is still the question here
of whether we want to include or need the `transition` label, which is
only present on workspace builds. Including it does lead to an increase
in cardinality, and in the case of the histogram (when not using native
histograms) that's at least a few extra series for every bucket. We
could remove the transition label there but keep it on the counter.

Additionally, the histogram is currently observing latencies for other
jobs, such as template builds/version imports, those do not have a
transition type associated with them.

Tested briefly in a workspace, can see metric values like the following:
-
`coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"}
1`
-
`coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"}
1`

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 13:40:47 -08:00
Jon Ayers 6035e45cb8 feat: add e2e workspace build duration metric (#21739)
Adds coderd_template_workspace_build_duration_seconds histogram that
tracks the full duration from workspace build creation to agent ready.
This captures the complete user-perceived build time including
provisioning and agent startup.

The metric is emitted when the agent reports ready/error/timeout via the
lifecycle API, ensuring each build is counted exactly once per replica.
2026-02-06 16:26:02 -06:00
Marcin Tojek 036ed5672f fix!: remove deprecated prometheus metrics (#21788)
## Description

Removes the following deprecated Prometheus metrics:

- `coderd_api_workspace_latest_build_total` → use
`coderd_api_workspace_latest_build` instead
- `coderd_oauth2_external_requests_rate_limit_total` → use
`coderd_oauth2_external_requests_rate_limit` instead

These metrics were deprecated in #12976 because gauge metrics should
avoid the `_total` suffix per [Prometheus naming
conventions](https://prometheus.io/docs/practices/naming/).

## Changes

- Removed deprecated metric `coderd_api_workspace_latest_build_total`
from `coderd/prometheusmetrics/prometheusmetrics.go`
- Removed deprecated metric
`coderd_oauth2_external_requests_rate_limit_total` from
`coderd/promoauth/oauth2.go`
- Updated tests to use the non-deprecated metric name

Fixes #12999
2026-01-30 13:30:06 +01:00
Marcin Tojek 04b0253e8a feat: add Prometheus metrics for license warnings and errors (#21749)
Fixes: coder/internal#767

Adds two new Prometheus metrics for license health monitoring:

- `coderd_license_warnings` - count of active license warnings
- `coderd_license_errors` - count of active license errors

Metrics endpoint after startup of a deployment with license enabled:

```
...
# HELP coderd_license_errors The number of active license errors.
# TYPE coderd_license_errors gauge
coderd_license_errors 0
...
# HELP coderd_license_warnings The number of active license warnings.
# TYPE coderd_license_warnings gauge
coderd_license_warnings 0
...
```
2026-01-29 13:50:15 +01:00
Callum Styan 806d7e4c11 docs: update metrics docs to include metadata batcher metrics (#21665)
This updates the metrics docs to include metrics added in
https://github.com/coder/coder/pull/21330

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2026-01-26 09:22:14 -08:00
Danny Kopping c6631e1e50 feat: expose aibridged metrics (#20865)
Upgrades `coder/aibridge` to v0.2.0 which includes
https://github.com/coder/aibridge/pull/62.

Creates a `prometheus.Registerer` with a prefix `coder_aibridged_` and
passes that along to coder/aibridge which actually exposes the metrics.

Also includes a side-effect of a change described in
https://github.com/coder/aibridge/pull/62#discussion_r2550017470.

---------

Signed-off-by: Danny Kopping <danny@coder.com>
2025-11-24 18:16:06 +02:00
Susana Ferreira c1f8465de6 fix: add missing provisionerd metrics to docs (#20358)
## Description

Add missing provisionerd metrics to Prometheus documentation:
* `coderd_provisionerd_num_daemons`: The number of provisioner daemons.
* `coderd_provisionerd_workspace_build_timings_seconds`: The time taken
for a workspace to build.

Related to internal thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1760642020583019
2025-10-20 11:33:45 +01:00
Susana Ferreira 0ab345ca84 feat: add prebuild timing metrics to Prometheus (#19503)
## Description

This PR introduces one counter and two histograms related to workspace
creation and claiming. The goal is to provide clearer observability into
how workspaces are created (regular vs prebuild) and the time cost of
those operations.

### `coderd_workspace_creation_total`

* Metric type: Counter
* Name: `coderd_workspace_creation_total`
* Labels: `organization_name`, `template_name`, `preset_name`

This counter tracks whether a regular workspace (not created from a
prebuild pool) was created using a preset or not.
Currently, we already expose `coderd_prebuilt_workspaces_claimed_total`
for claimed prebuilt workspaces, but we lack a comparable metric for
regular workspace creations. This metric fills that gap, making it
possible to compare regular creations against claims.

Implementation notes:
* Exposed as a `coderd_` metric, consistent with other workspace-related
metrics (e.g. `coderd_api_workspace_latest_build`:
https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149).
* Every `defaultRefreshRate` (1 minute ), DB query
`GetRegularWorkspaceCreateMetrics` is executed to fetch all regular
workspaces (not created from a prebuild pool).
* The counter is updated with the total from all time (not just since
metric introduction). This differs from the histograms below, which only
accumulate from their introduction forward.

### `coderd_workspace_creation_duration_seconds` &
`coderd_prebuilt_workspace_claim_duration_seconds`

* Metric types: Histogram
* Names:
  * `coderd_workspace_creation_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`, `type`
(`regular`, `prebuild`)
  * `coderd_prebuilt_workspace_claim_duration_seconds`
    * Labels: `organization_name`, `template_name`, `preset_name`

We already have `coderd_provisionerd_workspace_build_timings_seconds`,
which tracks build run times for all workspace builds handled by the
provisioner daemon.
However, in the context of this issue, we are only interested in
creation and claim build times, not all transitions; additionally, this
metric does not include `preset_name`, and adding it there would
significantly increase cardinality. Therefore, separate more focused
metrics are introduced here:
* `coderd_workspace_creation_duration_seconds`: Build time to create a
workspace (either a regular workspace or the build into a prebuild pool,
for prebuild initial provisioning build).
* `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a
prebuilt workspace from the pool.

The reason for two separate histograms is that:
* Creation (regular or prebuild): provisioning builds with similar time
magnitude, generally expected to take longer than a claim operation.
* Claim: expected to be a much faster provisioning build.

#### Native histogram usage

Provisioning times vary widely between projects. Using static buckets
risks unbalanced or poorly informative histograms.
To address this, these metrics use [Prometheus native
histograms](https://prometheus.io/docs/specs/native_histograms/):
* First introduced in Prometheus v2.40.0
* Recommended stable usage from v2.45+
* Requires Go client `prometheus/client_golang` v1.15.0+
* Experimental and must be explicitly enabled on the server
(`--enable-feature=native-histograms`)

For compatibility, we also retain a classic bucket definition (aligned
with the existing provisioner metric:
https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189).
* If native histograms are enabled, Prometheus ingests the
high-resolution histogram.
* If not, it falls back to the predefined buckets.

Implementation notes:
* Unlike the counter, these histograms are updated in real-time at
workspace build job completion.
* They reflect data only from the point of introduction forward (no
historical backfill).

## Relates to 

Closes: https://github.com/coder/coder/issues/19528
Native histograms tested in observability stack:
https://github.com/coder/observability/pull/50
2025-08-28 15:00:26 +01:00
Edward Angert cf7d143e43 docs: use consistent examples in prometheus doc and add namespaceSelector spec (#16918)
closes: #15385 

- use consistent `prom-http` port (@johnstcn looks like this was
changed/added in #12214 - do we prefer `prom-http` over
`prometheus-http` or is it more important that they align?)
- add `namespaceSelector:` per @francisco-mata (thanks! - sorry it took
so long to get this in)

   from issue:

> For some reason our target was not appearing on our prometheus
targets, we had to add a namespaceSelector key on the Service Monitor to
successfully appear

Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>
2025-03-13 22:09:26 -04:00
Michael Vincent Patterson 5295902596 docs: clarified prometheus integration behavior (#16724)
Closes issue #16538 
Updated docs to explain Behavior of enabling Prometheus
2025-02-26 19:30:41 +00:00
Gregory McCue 08dd2ab4cc docs: fix typo in prometheus.md (#16091)
Fixes small `scrape_config` typo in `prometheus.md`
2025-01-10 12:02:25 -05:00
Charlie Voiselle 4e0963966d docs: markdown fixes and edits (#15527)
- **docs: improve admonition for need to add useHttpPath**
- **docs: fix list item nesting**
- **docs: fix list item nesting**
- **docs: improve admonition for authentication**
- **docs: tidy and update vault guide**
- **docs: improve admonitions**
- **docs: improve admonitions**
- **docs: content edits, reference links to make copy easier to read**

previews:
- <https://coder.com/docs/@fix-guides-list-numbers/admin/external-auth>
-
<https://coder.com/docs/@fix-guides-list-numbers/admin/integrations/island>

---------

Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>
Co-authored-by: EdwardAngert <EdwardAngert@users.noreply.github.com>
2025-01-03 14:13:46 -05:00
Muhammad Atif Ali 94f5d52fdc chore: adopt markdownlint and markdown-table-formatter for *.md (#15831)
Co-authored-by: Edward Angert <EdwardAngert@users.noreply.github.com>
2025-01-03 13:12:59 +00:00
Muhammad Atif Ali 419eba5fb6 docs: restructure docs (#14421)
Closes #13434 
Supersedes #14182

---------

Co-authored-by: Ethan <39577870+ethanndickson@users.noreply.github.com>
Co-authored-by: Ethan Dickson <ethan@coder.com>
Co-authored-by: Ben Potter <ben@coder.com>
Co-authored-by: Stephen Kirby <58410745+stirby@users.noreply.github.com>
Co-authored-by: Stephen Kirby <me@skirby.dev>
Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>
Co-authored-by: Edward Angert <EdwardAngert@users.noreply.github.com>
2024-10-05 10:52:04 -05:00