mirror of https://github.com/coder/coder.git synced 2026-06-03 04:58:23 +00:00

Files

T

Jon Ayers 6035e45cb8 feat: add e2e workspace build duration metric (#21739 )

Adds coderd_template_workspace_build_duration_seconds histogram that
tracks the full duration from workspace build creation to agent ready.
This captures the complete user-perceived build time including
provisioning and agent startup.

The metric is emitted when the agent reports ready/error/timeout via the
lifecycle API, ensuring each build is counted exactly once per replica.

2026-02-06 16:26:02 -06:00

42 KiB

Raw Blame History

Prometheus

Coder exposes many metrics which can be consumed by a Prometheus server, and give insight into the current state of a live Coder deployment.

If you don't have a Prometheus server installed, you can follow the Prometheus Getting started guide.

Enable Prometheus metrics

Coder server exports metrics via the HTTP endpoint, which can be enabled using either the environment variable CODER_PROMETHEUS_ENABLE or the flag --prometheus-enable.

The Prometheus endpoint address is http://localhost:2112/ by default. You can use either the environment variable CODER_PROMETHEUS_ADDRESS or the flag --prometheus-address <network-interface>:<port> to select a different listen address.

If coder server --prometheus-enable is started locally, you can preview the metrics endpoint in your browser or with curl:

$ curl http://localhost:2112/
# HELP coderd_api_active_users_duration_hour The number of users that have been active within the last hour.
# TYPE coderd_api_active_users_duration_hour gauge
coderd_api_active_users_duration_hour 0
...

Kubernetes deployment

The Prometheus endpoint can be enabled in the Helm chart's values.yml by setting CODER_PROMETHEUS_ENABLE=true. Once enabled, the environment variable CODER_PROMETHEUS_ADDRESS will be set by default to 0.0.0.0:2112. A Service Endpoint will not be exposed; if you need to expose the Prometheus port on a Service, (for example, to use a ServiceMonitor), create a separate headless service instead.

apiVersion: v1
kind: Service
metadata:
  name: coder-prom
  namespace: coder
spec:
  clusterIP: None
  ports:
    - name: prom-http
      port: 2112
      protocol: TCP
      targetPort: 2112
  selector:
    app.kubernetes.io/instance: coder
    app.kubernetes.io/name: coder
  type: ClusterIP

Prometheus configuration

To allow Prometheus to scrape the Coder metrics, you will need to create a scrape_config in your prometheus.yml file, or in the Prometheus Helm chart values. The following is an example scrape_config.

scrape_configs:
  - job_name: "coder"
    scheme: "http"
    static_configs:
      # replace with the the IP address of the Coder pod or server
      - targets: ["<ip>:2112"]
        labels:
          apps: "coder"

To use the Kubernetes Prometheus operator to scrape metrics, you will need to create a ServiceMonitor in your Coder deployment namespace. The following is an example ServiceMonitor.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: coder-service-monitor
  namespace: coder
spec:
  endpoints:
    - port: prom-http
      interval: 10s
      scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
    - coder
  selector:
    matchLabels:
      app.kubernetes.io/name: coder

Available metrics

You must first enable coderd_agentstats_* with the flag --prometheus-collect-agent-stats, or the environment variable CODER_PROMETHEUS_COLLECT_AGENT_STATS before they can be retrieved from the deployment. They will always be available from the agent.

Name	Type	Description	Labels
`agent_scripts_executed_total`	counter	Total number of scripts executed by the Coder agent. Includes cron scheduled scripts.	`agent_name` `success` `template_name` `username` `workspace_name`
`coder_aibridged_injected_tool_invocations_total`	counter	The number of times an injected MCP tool was invoked by aibridge.	`model` `name` `provider` `server`
`coder_aibridged_interceptions_duration_seconds`	histogram	The total duration of intercepted requests, in seconds. The majority of this time will be the upstream processing of the request. aibridge has no control over upstream processing time, so it's just an illustrative metric.	`model` `provider`
`coder_aibridged_interceptions_inflight`	gauge	The number of intercepted requests which are being processed.	`model` `provider` `route`
`coder_aibridged_interceptions_total`	counter	The count of intercepted requests.	`initiator_id` `method` `model` `provider` `route` `status`
`coder_aibridged_non_injected_tool_selections_total`	counter	The number of times an AI model selected a tool to be invoked by the client.	`model` `name` `provider`
`coder_aibridged_prompts_total`	counter	The number of prompts issued by users (initiators).	`initiator_id` `model` `provider`
`coder_aibridged_tokens_total`	counter	The number of tokens used by intercepted requests.	`initiator_id` `model` `provider` `type`
`coderd_agentapi_metadata_batch_size`	histogram	Total number of metadata entries in each batch, updated before flushes.
`coderd_agentapi_metadata_batch_utilization`	histogram	Number of metadata keys per agent in each batch, updated before flushes.
`coderd_agentapi_metadata_batches_total`	counter	Total number of metadata batches flushed.	`reason`
`coderd_agentapi_metadata_dropped_keys_total`	counter	Total number of metadata keys dropped due to capacity limits.
`coderd_agentapi_metadata_flush_duration_seconds`	histogram	Time taken to flush metadata batch to database and pubsub.	`reason`
`coderd_agentapi_metadata_flushed_total`	counter	Total number of unique metadatas flushed.
`coderd_agentapi_metadata_publish_errors_total`	counter	Total number of metadata batch pubsub publish calls that have resulted in an error.
`coderd_agents_apps`	gauge	Agent applications with statuses.	`agent_name` `app_name` `health` `username` `workspace_name`
`coderd_agents_connection_latencies_seconds`	gauge	Agent connection latencies in seconds.	`agent_name` `derp_region` `preferred` `username` `workspace_name`
`coderd_agents_connections`	gauge	Agent connections with statuses.	`agent_name` `lifecycle_state` `status` `tailnet_node` `username` `workspace_name`
`coderd_agents_up`	gauge	The number of active agents per workspace.	`template_name` `username` `workspace_name`
`coderd_agentstats_connection_count`	gauge	The number of established connections by agent	`agent_name` `username` `workspace_name`
`coderd_agentstats_connection_median_latency_seconds`	gauge	The median agent connection latency	`agent_name` `username` `workspace_name`
`coderd_agentstats_currently_reachable_peers`	gauge	The number of peers (e.g. clients) that are currently reachable over the encrypted network.	`agent_name` `connection_type` `template_name` `username` `workspace_name`
`coderd_agentstats_rx_bytes`	gauge	Agent Rx bytes	`agent_name` `username` `workspace_name`
`coderd_agentstats_session_count_jetbrains`	gauge	The number of session established by JetBrains	`agent_name` `username` `workspace_name`
`coderd_agentstats_session_count_reconnecting_pty`	gauge	The number of session established by reconnecting PTY	`agent_name` `username` `workspace_name`
`coderd_agentstats_session_count_ssh`	gauge	The number of session established by SSH	`agent_name` `username` `workspace_name`
`coderd_agentstats_session_count_vscode`	gauge	The number of session established by VSCode	`agent_name` `username` `workspace_name`
`coderd_agentstats_startup_script_seconds`	gauge	The number of seconds the startup script took to execute.	`agent_name` `success` `template_name` `username` `workspace_name`
`coderd_agentstats_tx_bytes`	gauge	Agent Tx bytes	`agent_name` `username` `workspace_name`
`coderd_api_active_users_duration_hour`	gauge	The number of users that have been active within the last hour.
`coderd_api_concurrent_requests`	gauge	The number of concurrent API requests.
`coderd_api_concurrent_websockets`	gauge	The total number of concurrent API websockets.
`coderd_api_request_latencies_seconds`	histogram	Latency distribution of requests in seconds.	`method` `path`
`coderd_api_requests_processed_total`	counter	The total number of processed API requests	`code` `method` `path`
`coderd_api_websocket_durations_seconds`	histogram	Websocket duration distribution of requests in seconds.	`path`
`coderd_api_workspace_latest_build`	gauge	The latest workspace builds with a status.	`status`
`coderd_insights_applications_usage_seconds`	gauge	The application usage per template.	`application_name` `slug` `template_name`
`coderd_insights_parameters`	gauge	The parameter usage per template.	`parameter_name` `parameter_type` `parameter_value` `template_name`
`coderd_insights_templates_active_users`	gauge	The number of active users of the template.	`template_name`
`coderd_license_active_users`	gauge	The number of active users.
`coderd_license_errors`	gauge	The number of active license errors.
`coderd_license_limit_users`	gauge	The user seats limit based on the active Coder license.
`coderd_license_user_limit_enabled`	gauge	Returns 1 if the current license enforces the user limit.
`coderd_license_warnings`	gauge	The number of active license warnings.
`coderd_metrics_collector_agents_execution_seconds`	histogram	Histogram for duration of agents metrics collection in seconds.
`coderd_oauth2_external_requests_rate_limit`	gauge	The total number of allowed requests per interval.	`name` `resource`
`coderd_oauth2_external_requests_rate_limit_next_reset_unix`	gauge	Unix timestamp of the next interval	`name` `resource`
`coderd_oauth2_external_requests_rate_limit_remaining`	gauge	The remaining number of allowed requests in this interval.	`name` `resource`
`coderd_oauth2_external_requests_rate_limit_reset_in_seconds`	gauge	Seconds until the next interval	`name` `resource`
`coderd_oauth2_external_requests_rate_limit_used`	gauge	The number of requests made in this interval.	`name` `resource`
`coderd_oauth2_external_requests_total`	counter	The total number of api calls made to external oauth2 providers. 'status_code' will be 0 if the request failed with no response.	`name` `source` `status_code`
`coderd_prebuilt_workspace_claim_duration_seconds`	histogram	Time to claim a prebuilt workspace by organization, template, and preset.	`organization_name` `preset_name` `template_name`
`coderd_provisionerd_job_timings_seconds`	histogram	The provisioner job time duration in seconds.	`provisioner` `status`
`coderd_provisionerd_jobs_current`	gauge	The number of currently running provisioner jobs.	`provisioner`
`coderd_provisionerd_num_daemons`	gauge	The number of provisioner daemons.
`coderd_provisionerd_workspace_build_timings_seconds`	histogram	The time taken for a workspace to build.	`status` `template_name` `template_version` `workspace_transition`
`coderd_template_workspace_build_duration_seconds`	histogram	Duration from workspace build creation to agent ready, by template.	`is_prebuild` `organization_name` `status` `template_name` `transition`
`coderd_workspace_builds_total`	counter	The number of workspaces started, updated, or deleted.	`action` `owner_email` `status` `template_name` `template_version` `workspace_name`
`coderd_workspace_creation_duration_seconds`	histogram	Time to create a workspace by organization, template, preset, and type (regular or prebuild).	`organization_name` `preset_name` `template_name` `type`
`coderd_workspace_creation_total`	counter	Total regular (non-prebuilt) workspace creations by organization, template, and preset.	`organization_name` `preset_name` `template_name`
`coderd_workspace_latest_build_status`	gauge	The current workspace statuses by template, transition, and owner.	`status` `template_name` `template_version` `workspace_owner` `workspace_transition`
`go_gc_duration_seconds`	summary	A summary of the pause duration of garbage collection cycles.
`go_goroutines`	gauge	Number of goroutines that currently exist.
`go_info`	gauge	Information about the Go environment.	`version`
`go_memstats_alloc_bytes`	gauge	Number of bytes allocated and still in use.
`go_memstats_alloc_bytes_total`	counter	Total number of bytes allocated, even if freed.
`go_memstats_buck_hash_sys_bytes`	gauge	Number of bytes used by the profiling bucket hash table.
`go_memstats_frees_total`	counter	Total number of frees.
`go_memstats_gc_sys_bytes`	gauge	Number of bytes used for garbage collection system metadata.
`go_memstats_heap_alloc_bytes`	gauge	Number of heap bytes allocated and still in use.
`go_memstats_heap_idle_bytes`	gauge	Number of heap bytes waiting to be used.
`go_memstats_heap_inuse_bytes`	gauge	Number of heap bytes that are in use.
`go_memstats_heap_objects`	gauge	Number of allocated objects.
`go_memstats_heap_released_bytes`	gauge	Number of heap bytes released to OS.
`go_memstats_heap_sys_bytes`	gauge	Number of heap bytes obtained from system.
`go_memstats_last_gc_time_seconds`	gauge	Number of seconds since 1970 of last garbage collection.
`go_memstats_lookups_total`	counter	Total number of pointer lookups.
`go_memstats_mallocs_total`	counter	Total number of mallocs.
`go_memstats_mcache_inuse_bytes`	gauge	Number of bytes in use by mcache structures.
`go_memstats_mcache_sys_bytes`	gauge	Number of bytes used for mcache structures obtained from system.
`go_memstats_mspan_inuse_bytes`	gauge	Number of bytes in use by mspan structures.
`go_memstats_mspan_sys_bytes`	gauge	Number of bytes used for mspan structures obtained from system.
`go_memstats_next_gc_bytes`	gauge	Number of heap bytes when next garbage collection will take place.
`go_memstats_other_sys_bytes`	gauge	Number of bytes used for other system allocations.
`go_memstats_stack_inuse_bytes`	gauge	Number of bytes in use by the stack allocator.
`go_memstats_stack_sys_bytes`	gauge	Number of bytes obtained from system for stack allocator.
`go_memstats_sys_bytes`	gauge	Number of bytes obtained from system.
`go_threads`	gauge	Number of OS threads created.
`process_cpu_seconds_total`	counter	Total user and system CPU time spent in seconds.
`process_max_fds`	gauge	Maximum number of open file descriptors.
`process_open_fds`	gauge	Number of open file descriptors.
`process_resident_memory_bytes`	gauge	Resident memory size in bytes.
`process_start_time_seconds`	gauge	Start time of the process since unix epoch in seconds.
`process_virtual_memory_bytes`	gauge	Virtual memory size in bytes.
`process_virtual_memory_max_bytes`	gauge	Maximum amount of virtual memory available in bytes.
`promhttp_metric_handler_requests_in_flight`	gauge	Current number of scrapes being served.
`promhttp_metric_handler_requests_total`	counter	Total number of scrapes by HTTP status code.	`code`

Note on Prometheus native histogram support

The following metrics support native histograms:

coderd_workspace_creation_duration_seconds
coderd_prebuilt_workspace_claim_duration_seconds
coderd_template_coderd_template_workspace_build_duration_seconds

Native histograms are an experimental Prometheus feature that removes the need to predefine bucket boundaries and allows higher-resolution buckets that adapt to deployment characteristics. Whether a metric is exposed as classic or native depends entirely on the Prometheus server configuration (see Prometheus docs for details):

If native histograms are enabled, Prometheus ingests the high-resolution histogram.
If not, it falls back to the predefined buckets.

⚠️ Important: classic and native histograms cannot be aggregated together. If Prometheus is switched from classic to native at a certain point in time, dashboards may need to account for that transition. For this reason, it’s recommended to follow Prometheus’ migration guidelines when moving from classic to native histograms.

42 KiB Raw Blame History Unescape Escape