Files
coder/docs/admin/integrations/prometheus.md
T
Susana Ferreira c1f8465de6 fix: add missing provisionerd metrics to docs (#20358)
## Description

Add missing provisionerd metrics to Prometheus documentation:
* `coderd_provisionerd_num_daemons`: The number of provisioner daemons.
* `coderd_provisionerd_workspace_build_timings_seconds`: The time taken
for a workspace to build.

Related to internal thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1760642020583019
2025-10-20 11:33:45 +01:00

209 lines
29 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Prometheus
Coder exposes many metrics which can be consumed by a Prometheus server, and
give insight into the current state of a live Coder deployment.
If you don't have a Prometheus server installed, you can follow the Prometheus
[Getting started](https://prometheus.io/docs/prometheus/latest/getting_started/) guide.
## Enable Prometheus metrics
Coder server exports metrics via the HTTP endpoint, which can be enabled using
either the environment variable `CODER_PROMETHEUS_ENABLE` or the flag
`--prometheus-enable`.
The Prometheus endpoint address is `http://localhost:2112/` by default. You can
use either the environment variable `CODER_PROMETHEUS_ADDRESS` or the flag
`--prometheus-address <network-interface>:<port>` to select a different listen
address.
If `coder server --prometheus-enable` is started locally, you can preview the
metrics endpoint in your browser or with `curl`:
```console
$ curl http://localhost:2112/
# HELP coderd_api_active_users_duration_hour The number of users that have been active within the last hour.
# TYPE coderd_api_active_users_duration_hour gauge
coderd_api_active_users_duration_hour 0
...
```
### Kubernetes deployment
The Prometheus endpoint can be enabled in the [Helm chart's](https://github.com/coder/coder/tree/main/helm)
`values.yml` by setting `CODER_PROMETHEUS_ENABLE=true`. Once enabled, the environment variable `CODER_PROMETHEUS_ADDRESS` will be set by default to
`0.0.0.0:2112`. A Service Endpoint will not be exposed; if you need to
expose the Prometheus port on a Service, (for example, to use a
`ServiceMonitor`), create a separate headless service instead.
```yaml
apiVersion: v1
kind: Service
metadata:
name: coder-prom
namespace: coder
spec:
clusterIP: None
ports:
- name: prom-http
port: 2112
protocol: TCP
targetPort: 2112
selector:
app.kubernetes.io/instance: coder
app.kubernetes.io/name: coder
type: ClusterIP
```
### Prometheus configuration
To allow Prometheus to scrape the Coder metrics, you will need to create a
`scrape_config` in your `prometheus.yml` file, or in the Prometheus Helm chart
values. The following is an example `scrape_config`.
```yaml
scrape_configs:
- job_name: "coder"
scheme: "http"
static_configs:
# replace with the the IP address of the Coder pod or server
- targets: ["<ip>:2112"]
labels:
apps: "coder"
```
To use the Kubernetes Prometheus operator to scrape metrics, you will need to
create a `ServiceMonitor` in your Coder deployment namespace. The following is
an example `ServiceMonitor`.
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: coder-service-monitor
namespace: coder
spec:
endpoints:
- port: prom-http
interval: 10s
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- coder
selector:
matchLabels:
app.kubernetes.io/name: coder
```
## Available metrics
You must first enable `coderd_agentstats_*` with the flag
`--prometheus-collect-agent-stats`, or the environment variable
`CODER_PROMETHEUS_COLLECT_AGENT_STATS` before they can be retrieved from the
deployment. They will always be available from the agent.
<!-- Code generated by 'make docs/admin/integrations/prometheus.md'. DO NOT EDIT -->
| Name | Type | Description | Labels |
|---------------------------------------------------------------|-----------|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| `agent_scripts_executed_total` | counter | Total number of scripts executed by the Coder agent. Includes cron scheduled scripts. | `agent_name` `success` `template_name` `username` `workspace_name` |
| `coderd_agents_apps` | gauge | Agent applications with statuses. | `agent_name` `app_name` `health` `username` `workspace_name` |
| `coderd_agents_connection_latencies_seconds` | gauge | Agent connection latencies in seconds. | `agent_name` `derp_region` `preferred` `username` `workspace_name` |
| `coderd_agents_connections` | gauge | Agent connections with statuses. | `agent_name` `lifecycle_state` `status` `tailnet_node` `username` `workspace_name` |
| `coderd_agents_up` | gauge | The number of active agents per workspace. | `template_name` `username` `workspace_name` |
| `coderd_agentstats_connection_count` | gauge | The number of established connections by agent | `agent_name` `username` `workspace_name` |
| `coderd_agentstats_connection_median_latency_seconds` | gauge | The median agent connection latency | `agent_name` `username` `workspace_name` |
| `coderd_agentstats_currently_reachable_peers` | gauge | The number of peers (e.g. clients) that are currently reachable over the encrypted network. | `agent_name` `connection_type` `template_name` `username` `workspace_name` |
| `coderd_agentstats_rx_bytes` | gauge | Agent Rx bytes | `agent_name` `username` `workspace_name` |
| `coderd_agentstats_session_count_jetbrains` | gauge | The number of session established by JetBrains | `agent_name` `username` `workspace_name` |
| `coderd_agentstats_session_count_reconnecting_pty` | gauge | The number of session established by reconnecting PTY | `agent_name` `username` `workspace_name` |
| `coderd_agentstats_session_count_ssh` | gauge | The number of session established by SSH | `agent_name` `username` `workspace_name` |
| `coderd_agentstats_session_count_vscode` | gauge | The number of session established by VSCode | `agent_name` `username` `workspace_name` |
| `coderd_agentstats_startup_script_seconds` | gauge | The number of seconds the startup script took to execute. | `agent_name` `success` `template_name` `username` `workspace_name` |
| `coderd_agentstats_tx_bytes` | gauge | Agent Tx bytes | `agent_name` `username` `workspace_name` |
| `coderd_api_active_users_duration_hour` | gauge | The number of users that have been active within the last hour. | |
| `coderd_api_concurrent_requests` | gauge | The number of concurrent API requests. | |
| `coderd_api_concurrent_websockets` | gauge | The total number of concurrent API websockets. | |
| `coderd_api_request_latencies_seconds` | histogram | Latency distribution of requests in seconds. | `method` `path` |
| `coderd_api_requests_processed_total` | counter | The total number of processed API requests | `code` `method` `path` |
| `coderd_api_websocket_durations_seconds` | histogram | Websocket duration distribution of requests in seconds. | `path` |
| `coderd_api_workspace_latest_build` | gauge | The latest workspace builds with a status. | `status` |
| `coderd_api_workspace_latest_build_total` | gauge | DEPRECATED: use coderd_api_workspace_latest_build instead | `status` |
| `coderd_insights_applications_usage_seconds` | gauge | The application usage per template. | `application_name` `slug` `template_name` |
| `coderd_insights_parameters` | gauge | The parameter usage per template. | `parameter_name` `parameter_type` `parameter_value` `template_name` |
| `coderd_insights_templates_active_users` | gauge | The number of active users of the template. | `template_name` |
| `coderd_license_active_users` | gauge | The number of active users. | |
| `coderd_license_limit_users` | gauge | The user seats limit based on the active Coder license. | |
| `coderd_license_user_limit_enabled` | gauge | Returns 1 if the current license enforces the user limit. | |
| `coderd_metrics_collector_agents_execution_seconds` | histogram | Histogram for duration of agents metrics collection in seconds. | |
| `coderd_oauth2_external_requests_rate_limit` | gauge | The total number of allowed requests per interval. | `name` `resource` |
| `coderd_oauth2_external_requests_rate_limit_next_reset_unix` | gauge | Unix timestamp of the next interval | `name` `resource` |
| `coderd_oauth2_external_requests_rate_limit_remaining` | gauge | The remaining number of allowed requests in this interval. | `name` `resource` |
| `coderd_oauth2_external_requests_rate_limit_reset_in_seconds` | gauge | Seconds until the next interval | `name` `resource` |
| `coderd_oauth2_external_requests_rate_limit_total` | gauge | DEPRECATED: use coderd_oauth2_external_requests_rate_limit instead | `name` `resource` |
| `coderd_oauth2_external_requests_rate_limit_used` | gauge | The number of requests made in this interval. | `name` `resource` |
| `coderd_oauth2_external_requests_total` | counter | The total number of api calls made to external oauth2 providers. 'status_code' will be 0 if the request failed with no response. | `name` `source` `status_code` |
| `coderd_prebuilt_workspace_claim_duration_seconds` | histogram | Time to claim a prebuilt workspace by organization, template, and preset. | `organization_name` `preset_name` `template_name` |
| `coderd_provisionerd_job_timings_seconds` | histogram | The provisioner job time duration in seconds. | `provisioner` `status` |
| `coderd_provisionerd_jobs_current` | gauge | The number of currently running provisioner jobs. | `provisioner` |
| `coderd_provisionerd_num_daemons` | gauge | The number of provisioner daemons. | |
| `coderd_provisionerd_workspace_build_timings_seconds` | histogram | The time taken for a workspace to build. | `status` `template_name` `template_version` `workspace_transition` |
| `coderd_workspace_builds_total` | counter | The number of workspaces started, updated, or deleted. | `action` `owner_email` `status` `template_name` `template_version` `workspace_name` |
| `coderd_workspace_creation_duration_seconds` | histogram | Time to create a workspace by organization, template, preset, and type (regular or prebuild). | `organization_name` `preset_name` `template_name` `type` |
| `coderd_workspace_creation_total` | counter | Total regular (non-prebuilt) workspace creations by organization, template, and preset. | `organization_name` `preset_name` `template_name` |
| `coderd_workspace_latest_build_status` | gauge | The current workspace statuses by template, transition, and owner. | `status` `template_name` `template_version` `workspace_owner` `workspace_transition` |
| `go_gc_duration_seconds` | summary | A summary of the pause duration of garbage collection cycles. | |
| `go_goroutines` | gauge | Number of goroutines that currently exist. | |
| `go_info` | gauge | Information about the Go environment. | `version` |
| `go_memstats_alloc_bytes` | gauge | Number of bytes allocated and still in use. | |
| `go_memstats_alloc_bytes_total` | counter | Total number of bytes allocated, even if freed. | |
| `go_memstats_buck_hash_sys_bytes` | gauge | Number of bytes used by the profiling bucket hash table. | |
| `go_memstats_frees_total` | counter | Total number of frees. | |
| `go_memstats_gc_sys_bytes` | gauge | Number of bytes used for garbage collection system metadata. | |
| `go_memstats_heap_alloc_bytes` | gauge | Number of heap bytes allocated and still in use. | |
| `go_memstats_heap_idle_bytes` | gauge | Number of heap bytes waiting to be used. | |
| `go_memstats_heap_inuse_bytes` | gauge | Number of heap bytes that are in use. | |
| `go_memstats_heap_objects` | gauge | Number of allocated objects. | |
| `go_memstats_heap_released_bytes` | gauge | Number of heap bytes released to OS. | |
| `go_memstats_heap_sys_bytes` | gauge | Number of heap bytes obtained from system. | |
| `go_memstats_last_gc_time_seconds` | gauge | Number of seconds since 1970 of last garbage collection. | |
| `go_memstats_lookups_total` | counter | Total number of pointer lookups. | |
| `go_memstats_mallocs_total` | counter | Total number of mallocs. | |
| `go_memstats_mcache_inuse_bytes` | gauge | Number of bytes in use by mcache structures. | |
| `go_memstats_mcache_sys_bytes` | gauge | Number of bytes used for mcache structures obtained from system. | |
| `go_memstats_mspan_inuse_bytes` | gauge | Number of bytes in use by mspan structures. | |
| `go_memstats_mspan_sys_bytes` | gauge | Number of bytes used for mspan structures obtained from system. | |
| `go_memstats_next_gc_bytes` | gauge | Number of heap bytes when next garbage collection will take place. | |
| `go_memstats_other_sys_bytes` | gauge | Number of bytes used for other system allocations. | |
| `go_memstats_stack_inuse_bytes` | gauge | Number of bytes in use by the stack allocator. | |
| `go_memstats_stack_sys_bytes` | gauge | Number of bytes obtained from system for stack allocator. | |
| `go_memstats_sys_bytes` | gauge | Number of bytes obtained from system. | |
| `go_threads` | gauge | Number of OS threads created. | |
| `process_cpu_seconds_total` | counter | Total user and system CPU time spent in seconds. | |
| `process_max_fds` | gauge | Maximum number of open file descriptors. | |
| `process_open_fds` | gauge | Number of open file descriptors. | |
| `process_resident_memory_bytes` | gauge | Resident memory size in bytes. | |
| `process_start_time_seconds` | gauge | Start time of the process since unix epoch in seconds. | |
| `process_virtual_memory_bytes` | gauge | Virtual memory size in bytes. | |
| `process_virtual_memory_max_bytes` | gauge | Maximum amount of virtual memory available in bytes. | |
| `promhttp_metric_handler_requests_in_flight` | gauge | Current number of scrapes being served. | |
| `promhttp_metric_handler_requests_total` | counter | Total number of scrapes by HTTP status code. | `code` |
<!-- End generated by 'make docs/admin/integrations/prometheus.md'. -->
### Note on Prometheus native histogram support
The following metrics support native histograms:
* `coderd_workspace_creation_duration_seconds`
* `coderd_prebuilt_workspace_claim_duration_seconds`
Native histograms are an **experimental** Prometheus feature that removes the need to predefine bucket boundaries and allows higher-resolution buckets that adapt to deployment characteristics.
Whether a metric is exposed as classic or native depends entirely on the Prometheus server configuration (see [Prometheus docs](https://prometheus.io/docs/specs/native_histograms/) for details):
* If native histograms are enabled, Prometheus ingests the high-resolution histogram.
* If not, it falls back to the predefined buckets.
⚠️ Important: classic and native histograms cannot be aggregated together. If Prometheus is switched from classic to native at a certain point in time, dashboards may need to account for that transition.
For this reason, its recommended to follow [Prometheus migration guidelines](https://prometheus.io/docs/specs/native_histograms/#migration-considerations) when moving from classic to native histograms.