Files
coder/docs/admin/integrations/prometheus.md
T
Jon Ayers 6035e45cb8 feat: add e2e workspace build duration metric (#21739)
Adds coderd_template_workspace_build_duration_seconds histogram that
tracks the full duration from workspace build creation to agent ready.
This captures the complete user-perceived build time including
provisioning and agent startup.

The metric is emitted when the agent reports ready/error/timeout via the
lifecycle API, ensuring each build is counted exactly once per replica.
2026-02-06 16:26:02 -06:00

42 KiB
Raw Blame History

Prometheus

Coder exposes many metrics which can be consumed by a Prometheus server, and give insight into the current state of a live Coder deployment.

If you don't have a Prometheus server installed, you can follow the Prometheus Getting started guide.

Enable Prometheus metrics

Coder server exports metrics via the HTTP endpoint, which can be enabled using either the environment variable CODER_PROMETHEUS_ENABLE or the flag --prometheus-enable.

The Prometheus endpoint address is http://localhost:2112/ by default. You can use either the environment variable CODER_PROMETHEUS_ADDRESS or the flag --prometheus-address <network-interface>:<port> to select a different listen address.

If coder server --prometheus-enable is started locally, you can preview the metrics endpoint in your browser or with curl:

$ curl http://localhost:2112/
# HELP coderd_api_active_users_duration_hour The number of users that have been active within the last hour.
# TYPE coderd_api_active_users_duration_hour gauge
coderd_api_active_users_duration_hour 0
...

Kubernetes deployment

The Prometheus endpoint can be enabled in the Helm chart's values.yml by setting CODER_PROMETHEUS_ENABLE=true. Once enabled, the environment variable CODER_PROMETHEUS_ADDRESS will be set by default to 0.0.0.0:2112. A Service Endpoint will not be exposed; if you need to expose the Prometheus port on a Service, (for example, to use a ServiceMonitor), create a separate headless service instead.

apiVersion: v1
kind: Service
metadata:
  name: coder-prom
  namespace: coder
spec:
  clusterIP: None
  ports:
    - name: prom-http
      port: 2112
      protocol: TCP
      targetPort: 2112
  selector:
    app.kubernetes.io/instance: coder
    app.kubernetes.io/name: coder
  type: ClusterIP

Prometheus configuration

To allow Prometheus to scrape the Coder metrics, you will need to create a scrape_config in your prometheus.yml file, or in the Prometheus Helm chart values. The following is an example scrape_config.

scrape_configs:
  - job_name: "coder"
    scheme: "http"
    static_configs:
      # replace with the the IP address of the Coder pod or server
      - targets: ["<ip>:2112"]
        labels:
          apps: "coder"

To use the Kubernetes Prometheus operator to scrape metrics, you will need to create a ServiceMonitor in your Coder deployment namespace. The following is an example ServiceMonitor.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: coder-service-monitor
  namespace: coder
spec:
  endpoints:
    - port: prom-http
      interval: 10s
      scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
    - coder
  selector:
    matchLabels:
      app.kubernetes.io/name: coder

Available metrics

You must first enable coderd_agentstats_* with the flag --prometheus-collect-agent-stats, or the environment variable CODER_PROMETHEUS_COLLECT_AGENT_STATS before they can be retrieved from the deployment. They will always be available from the agent.

Name Type Description Labels
agent_scripts_executed_total counter Total number of scripts executed by the Coder agent. Includes cron scheduled scripts. agent_name success template_name username workspace_name
coder_aibridged_injected_tool_invocations_total counter The number of times an injected MCP tool was invoked by aibridge. model name provider server
coder_aibridged_interceptions_duration_seconds histogram The total duration of intercepted requests, in seconds. The majority of this time will be the upstream processing of the request. aibridge has no control over upstream processing time, so it's just an illustrative metric. model provider
coder_aibridged_interceptions_inflight gauge The number of intercepted requests which are being processed. model provider route
coder_aibridged_interceptions_total counter The count of intercepted requests. initiator_id method model provider route status
coder_aibridged_non_injected_tool_selections_total counter The number of times an AI model selected a tool to be invoked by the client. model name provider
coder_aibridged_prompts_total counter The number of prompts issued by users (initiators). initiator_id model provider
coder_aibridged_tokens_total counter The number of tokens used by intercepted requests. initiator_id model provider type
coderd_agentapi_metadata_batch_size histogram Total number of metadata entries in each batch, updated before flushes.
coderd_agentapi_metadata_batch_utilization histogram Number of metadata keys per agent in each batch, updated before flushes.
coderd_agentapi_metadata_batches_total counter Total number of metadata batches flushed. reason
coderd_agentapi_metadata_dropped_keys_total counter Total number of metadata keys dropped due to capacity limits.
coderd_agentapi_metadata_flush_duration_seconds histogram Time taken to flush metadata batch to database and pubsub. reason
coderd_agentapi_metadata_flushed_total counter Total number of unique metadatas flushed.
coderd_agentapi_metadata_publish_errors_total counter Total number of metadata batch pubsub publish calls that have resulted in an error.
coderd_agents_apps gauge Agent applications with statuses. agent_name app_name health username workspace_name
coderd_agents_connection_latencies_seconds gauge Agent connection latencies in seconds. agent_name derp_region preferred username workspace_name
coderd_agents_connections gauge Agent connections with statuses. agent_name lifecycle_state status tailnet_node username workspace_name
coderd_agents_up gauge The number of active agents per workspace. template_name username workspace_name
coderd_agentstats_connection_count gauge The number of established connections by agent agent_name username workspace_name
coderd_agentstats_connection_median_latency_seconds gauge The median agent connection latency agent_name username workspace_name
coderd_agentstats_currently_reachable_peers gauge The number of peers (e.g. clients) that are currently reachable over the encrypted network. agent_name connection_type template_name username workspace_name
coderd_agentstats_rx_bytes gauge Agent Rx bytes agent_name username workspace_name
coderd_agentstats_session_count_jetbrains gauge The number of session established by JetBrains agent_name username workspace_name
coderd_agentstats_session_count_reconnecting_pty gauge The number of session established by reconnecting PTY agent_name username workspace_name
coderd_agentstats_session_count_ssh gauge The number of session established by SSH agent_name username workspace_name
coderd_agentstats_session_count_vscode gauge The number of session established by VSCode agent_name username workspace_name
coderd_agentstats_startup_script_seconds gauge The number of seconds the startup script took to execute. agent_name success template_name username workspace_name
coderd_agentstats_tx_bytes gauge Agent Tx bytes agent_name username workspace_name
coderd_api_active_users_duration_hour gauge The number of users that have been active within the last hour.
coderd_api_concurrent_requests gauge The number of concurrent API requests.
coderd_api_concurrent_websockets gauge The total number of concurrent API websockets.
coderd_api_request_latencies_seconds histogram Latency distribution of requests in seconds. method path
coderd_api_requests_processed_total counter The total number of processed API requests code method path
coderd_api_websocket_durations_seconds histogram Websocket duration distribution of requests in seconds. path
coderd_api_workspace_latest_build gauge The latest workspace builds with a status. status
coderd_insights_applications_usage_seconds gauge The application usage per template. application_name slug template_name
coderd_insights_parameters gauge The parameter usage per template. parameter_name parameter_type parameter_value template_name
coderd_insights_templates_active_users gauge The number of active users of the template. template_name
coderd_license_active_users gauge The number of active users.
coderd_license_errors gauge The number of active license errors.
coderd_license_limit_users gauge The user seats limit based on the active Coder license.
coderd_license_user_limit_enabled gauge Returns 1 if the current license enforces the user limit.
coderd_license_warnings gauge The number of active license warnings.
coderd_metrics_collector_agents_execution_seconds histogram Histogram for duration of agents metrics collection in seconds.
coderd_oauth2_external_requests_rate_limit gauge The total number of allowed requests per interval. name resource
coderd_oauth2_external_requests_rate_limit_next_reset_unix gauge Unix timestamp of the next interval name resource
coderd_oauth2_external_requests_rate_limit_remaining gauge The remaining number of allowed requests in this interval. name resource
coderd_oauth2_external_requests_rate_limit_reset_in_seconds gauge Seconds until the next interval name resource
coderd_oauth2_external_requests_rate_limit_used gauge The number of requests made in this interval. name resource
coderd_oauth2_external_requests_total counter The total number of api calls made to external oauth2 providers. 'status_code' will be 0 if the request failed with no response. name source status_code
coderd_prebuilt_workspace_claim_duration_seconds histogram Time to claim a prebuilt workspace by organization, template, and preset. organization_name preset_name template_name
coderd_provisionerd_job_timings_seconds histogram The provisioner job time duration in seconds. provisioner status
coderd_provisionerd_jobs_current gauge The number of currently running provisioner jobs. provisioner
coderd_provisionerd_num_daemons gauge The number of provisioner daemons.
coderd_provisionerd_workspace_build_timings_seconds histogram The time taken for a workspace to build. status template_name template_version workspace_transition
coderd_template_workspace_build_duration_seconds histogram Duration from workspace build creation to agent ready, by template. is_prebuild organization_name status template_name transition
coderd_workspace_builds_total counter The number of workspaces started, updated, or deleted. action owner_email status template_name template_version workspace_name
coderd_workspace_creation_duration_seconds histogram Time to create a workspace by organization, template, preset, and type (regular or prebuild). organization_name preset_name template_name type
coderd_workspace_creation_total counter Total regular (non-prebuilt) workspace creations by organization, template, and preset. organization_name preset_name template_name
coderd_workspace_latest_build_status gauge The current workspace statuses by template, transition, and owner. status template_name template_version workspace_owner workspace_transition
go_gc_duration_seconds summary A summary of the pause duration of garbage collection cycles.
go_goroutines gauge Number of goroutines that currently exist.
go_info gauge Information about the Go environment. version
go_memstats_alloc_bytes gauge Number of bytes allocated and still in use.
go_memstats_alloc_bytes_total counter Total number of bytes allocated, even if freed.
go_memstats_buck_hash_sys_bytes gauge Number of bytes used by the profiling bucket hash table.
go_memstats_frees_total counter Total number of frees.
go_memstats_gc_sys_bytes gauge Number of bytes used for garbage collection system metadata.
go_memstats_heap_alloc_bytes gauge Number of heap bytes allocated and still in use.
go_memstats_heap_idle_bytes gauge Number of heap bytes waiting to be used.
go_memstats_heap_inuse_bytes gauge Number of heap bytes that are in use.
go_memstats_heap_objects gauge Number of allocated objects.
go_memstats_heap_released_bytes gauge Number of heap bytes released to OS.
go_memstats_heap_sys_bytes gauge Number of heap bytes obtained from system.
go_memstats_last_gc_time_seconds gauge Number of seconds since 1970 of last garbage collection.
go_memstats_lookups_total counter Total number of pointer lookups.
go_memstats_mallocs_total counter Total number of mallocs.
go_memstats_mcache_inuse_bytes gauge Number of bytes in use by mcache structures.
go_memstats_mcache_sys_bytes gauge Number of bytes used for mcache structures obtained from system.
go_memstats_mspan_inuse_bytes gauge Number of bytes in use by mspan structures.
go_memstats_mspan_sys_bytes gauge Number of bytes used for mspan structures obtained from system.
go_memstats_next_gc_bytes gauge Number of heap bytes when next garbage collection will take place.
go_memstats_other_sys_bytes gauge Number of bytes used for other system allocations.
go_memstats_stack_inuse_bytes gauge Number of bytes in use by the stack allocator.
go_memstats_stack_sys_bytes gauge Number of bytes obtained from system for stack allocator.
go_memstats_sys_bytes gauge Number of bytes obtained from system.
go_threads gauge Number of OS threads created.
process_cpu_seconds_total counter Total user and system CPU time spent in seconds.
process_max_fds gauge Maximum number of open file descriptors.
process_open_fds gauge Number of open file descriptors.
process_resident_memory_bytes gauge Resident memory size in bytes.
process_start_time_seconds gauge Start time of the process since unix epoch in seconds.
process_virtual_memory_bytes gauge Virtual memory size in bytes.
process_virtual_memory_max_bytes gauge Maximum amount of virtual memory available in bytes.
promhttp_metric_handler_requests_in_flight gauge Current number of scrapes being served.
promhttp_metric_handler_requests_total counter Total number of scrapes by HTTP status code. code

Note on Prometheus native histogram support

The following metrics support native histograms:

  • coderd_workspace_creation_duration_seconds
  • coderd_prebuilt_workspace_claim_duration_seconds
  • coderd_template_coderd_template_workspace_build_duration_seconds

Native histograms are an experimental Prometheus feature that removes the need to predefine bucket boundaries and allows higher-resolution buckets that adapt to deployment characteristics. Whether a metric is exposed as classic or native depends entirely on the Prometheus server configuration (see Prometheus docs for details):

  • If native histograms are enabled, Prometheus ingests the high-resolution histogram.
  • If not, it falls back to the predefined buckets.

⚠️ Important: classic and native histograms cannot be aggregated together. If Prometheus is switched from classic to native at a certain point in time, dashboards may need to account for that transition. For this reason, its recommended to follow Prometheus migration guidelines when moving from classic to native histograms.