coder

mirror of https://github.com/coder/coder.git synced 2026-06-03 13:08:25 +00:00

Author	SHA1	Message	Date
Cian Johnston	72e3ae9c5f	feat: add chatd tool call error metrics and logging (#24559 ) - Add `coderd_chatd_tool_errors_total` prometheus counter (labels: provider, model, tool_name) - Log tool call errors at warn level with correlation fields: chat_id, owner_id, organization_id, workspace_id, agent_id, parent_chat_id, trigger_message_id, tool_name, tool_call_id, provider, model - Thread enriched logger from chatd.go into chatloop via `RunOptions.Logger` - Remove squashing of all MCP tool calls to the `mcp` bucket > 🤖	2026-04-22 16:19:56 +00:00
Cian Johnston	4b585465b8	feat: label chatd metrics by model, add stream-state diagnostics (#24475 ) Adds production-observability metrics to coderd/x/chatd/ for model-level correlation and a chatStreams memory-leak investigation. - Label per-request chatd metrics (steps_total, message_count, prompt_size_bytes, tool_result_size_bytes, ttft_seconds, compaction_total) with `model` and enrich the per-turn logger with provider/model. - Add `coderd_chatd_stream_retries_total{provider, model, kind}` counter incremented in chatloop before OnRetry. - Register a prometheus.Collector exposing `streams_active`, `stream_buffer_size_max`, `stream_buffer_events`, `stream_subscribers` from p.chatStreams. - Add `coderd_chatd_stream_buffer_dropped_total` counter, incremented per publishToStream drop independently of the existing log-rate-limited bufferDropCount. - Snapshot logger/model before the title-generation goroutine to avoid a data race with the logger/model rebind below it. > 🤖	2026-04-17 16:16:30 +01:00
Cian Johnston	d7439a9de0	feat: add Prometheus metrics for chatd subsystem (#24371 ) Adds 7 Prometheus metrics to the chatd subsystem and introduces typed `ActivityBumpReason` for deadline bump attribution. \| Metric \| Type \| Labels \| \|--------\|------\|--------\| \| `coderd_chatd_chats` \| Gauge \| `state` (streaming, waiting) \| \| `coderd_chatd_message_count` \| Histogram \| `provider` \| \| `coderd_chatd_prompt_size_bytes` \| Histogram \| `provider` \| \| `coderd_chatd_tool_result_size_bytes` \| Histogram \| `provider`, `tool_name` \| \| `coderd_chatd_ttft_seconds` \| Histogram \| `provider` \| \| `coderd_chatd_compaction_total` \| Counter \| `provider`, `result` \| \| `coderd_chatd_steps_total` \| Counter \| `provider` \| > 🤖	2026-04-15 19:53:10 +01:00
Danny Kopping	48b90f8cc8	feat: add coder_build_info metric (#24365 ) _Disclaimer: produced by Claude Opus 4.6_ Adds a `coder_build_info` metric which allows operators to see which versions of Coder are currently running. --------- Signed-off-by: Danny Kopping <danny@coder.com>	2026-04-15 12:48:38 +00:00
J. Scott Miller	20b953a99d	feat: add Prometheus metric for agent first connection duration (#24179 ) ## Summary Add `coderd_agents_first_connection_seconds` histogram metric that records the duration from workspace agent creation to first connection. This fills an observability gap — provisioner job timings and startup script metrics exist, but the agent connection phase (which can take several minutes) was not exposed to Prometheus. Closes https://github.com/coder/coder/issues/21282 ## Changes - `coderd/prometheusmetrics/prometheusmetrics.go` — Define and register a `HistogramVec` in the existing `Agents()` polling loop. Observe `first_connected_at - created_at` exactly once per agent via a deduplication map, pruned each tick to prevent unbounded memory growth. - `coderd/prometheusmetrics/prometheusmetrics_test.go` — Update `TestAgents` to set `first_connected_at` on the test agent and assert the histogram is collected with correct labels, sample count, and sample sum. - `docs/admin/integrations/prometheus.md`, `scripts/metricsdocgen/generated_metrics` — Auto-generated documentation updates from `make gen`. ## Metric details \| Property \| Value \| \|---\|---\| \| Name \| `coderd_agents_first_connection_seconds` \| \| Type \| histogram \| \| Labels \| `template_name`, `agent_name`, `username`, `workspace_name` \| \| Buckets \| 1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h \| ## Example PromQL ```promql # P95 agent connection time by template histogram_quantile(0.95, sum(rate(coderd_agents_first_connection_seconds_bucket[1h])) by (le, template_name) ) ``` <details> <summary>Implementation notes</summary> ### Design decisions - Histogram over gauge: Enables `histogram_quantile()` for percentile queries. - Observe in `Agents()` polling loop: All required data is already fetched by `GetWorkspaceAgentsForMetrics()` — no new DB queries. - Dedup via `map[uuid.UUID]struct{}`: Prevents re-observing the same agent across polling ticks. Pruned each cycle to bound memory. - Buckets: Aligned with `coderd_provisionerd_workspace_build_timings_seconds` range (1s–1h). ### Overhead at scale (100k active workspaces) The deduplication map (`observedFirstConnection`) and per-tick pruning map (`currentAgentIDs`) are both `map[[16]byte]struct{}`. At 100k agents: - Memory: ~2.25 MB persistent + ~2.25 MB transient per tick = ~4.5 MB peak. - CPU: ~25 ms of map operations per tick (one tick per minute) = <0.05% of one core. Both are negligible relative to the existing cost of the `Agents()` loop (the DB query, per-agent `GetWorkspaceAppsByAgentID` calls, and coordinator node lookups dominate). </details> > 🤖 Generated by Coder Agents	2026-04-14 12:00:46 -05:00
Faur Ioan-Aurel	83fd4cf5c2	fix: OAuth2 cancel button in the authorization page not working (#24058 ) Go's html/template has a built-in security filter (urlFilter) that only allows http, https, and mailto URL schemes. Any other scheme gets replaced with #ZgotmplZ. The OAuth2 app's callback URL uses custom URI scheme which the filter considers unsafe. For example the Coder JetBrains plugin exposes a callback URI with the scheme jetbrains:// - which was effectively changed by the template engine into #ZgotmplZ. Of course this is not an actual callback. When users clicked the cancel button nothing happened. The fix was simple - we now wrap the apps registered callback URI into htmltemplate.URL. Usually this needs some validation otherwise the linter will complain about it. The callback URI used by the Cancel logic is actually validated by our backend when the client app programmatically registered via the dynamic OAuth2 registration endpoints, so we refactored the validation around that code and re-used some of it in the Cancel handling to make sure we don't allow URIs like `javascript` and `data`, even though in theory these URIs were already validated. In addition, while testing this PR with https://github.com/coder/coder-jetbrains-toolbox/pull/209 I discovered that we are also not compliant with https://www.rfc-editor.org/rfc/rfc6749#section-4.1.2.1 which requires the server to attach the local state if it was provided by the client in the original request. Also it is optional but generally a good practice to include `error_description` in the error responses. In fact we follow this pattern for the other types of error responses. So this is not a one off. - resolves #20323 <img width="1485" height="771" alt="Cancel_page_with_invalid_uri" src="https://github.com/user-attachments/assets/5539d234-9ce3-4dda-b421-d023fc9aa99e" /> <img width="486" height="746" alt="Coder Toolbox handling the Cancel button" src="https://github.com/user-attachments/assets/acab71a6-d29c-4fa9-80ba-3c0095bbdc8f" /> <!-- If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting. -->	2026-04-10 12:49:22 +03:00
Jon Ayers	6c44de951d	feat: add Prometheus collector for DERP server expvar metrics (#22583 ) This PR does three things: - Exports derp expvars to the pprof endpoint - Exports the expvar metrics as prometheus metrics in both coderd and wsproxy - Updates our tailscale to a fix I also had to make to avoid a data race condition I generated this with mux but I also manually tested that the metrics were getting properly emitted	2026-03-06 01:57:58 -06:00
Zach	5b7377c375	feat: add Prometheus metrics for boundary log drop reporting (#22521 ) Add Prometheus metrics to the boundary log proxy for observability: - batches_dropped_total (reason: buffer_full, forward_failed) - logs_dropped_total (reason: buffer_full, forward_failed, boundary_channel_full, boundary_batch_full) - batches_forwarded_total Also add BoundaryStatus to the BoundaryMessage envelope so boundary can report dropped log counts as a separate wire message. The agent records these as Prometheus metrics, making boundary-side data loss visible. Backwards compatibility for older versions of boundary is maintained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 12:42:34 -07:00
Susana Ferreira	ca234f346d	fix: mark presets as validation_failed to prevent endless prebuild retries (#22085 ) ## Description - Updates `wsbuilder` to return a `BuildError` with `http.StatusBadRequest` to signify a "validation error" on missing or invalid parameters - Adds a short-circuit in `prebuilds.StoreReconciler` to mark presets for which creating a build returns a "validation error" as "validation failed" and skip further attempts to reconcile. - Adds a test to verify the above - Introduces a new Prometheus metric `coderd_prebuilt_workspaces_preset_validation_failed` to track the above Closes: https://github.com/coder/coder/issues/21237 --------- Co-authored-by: Cian Johnston <cian@coder.com>	2026-02-27 14:26:48 +00:00
Garrett Delfosse	4057363f78	fix(coderd): add organization_name label to insights Prometheus metrics (#22296 ) ## Description When multiple organizations have templates with the same name, the Prometheus `/metrics` endpoint returns HTTP 500 because Prometheus rejects duplicate label combinations. The three `coderd_insights_` metrics (`coderd_insights_templates_active_users`, `coderd_insights_applications_usage_seconds`, `coderd_insights_parameters`) used only `template_name` as a distinguishing label, so two templates named e.g. `"openstack-v1"` in different orgs would produce duplicate metric series. This adds `organization_name` as a label to all three insight metric descriptors to disambiguate templates across organizations. ## Changes `coderd/prometheusmetrics/insights/metricscollector.go`: - Added `organization_name` label to all three metric descriptors - Added `organizationNames` field (template ID → org name) to the `insightsData` struct - In `doTick`: after fetching templates, collect unique org IDs, fetch organizations via `GetOrganizations`, and build a template-ID-to-org-name mapping - In `Collect()`: pass the organization name as an additional label value in every `MustNewConstMetric` call `coderd/prometheusmetrics/insights/testdata/insights-metrics.json`*: Updated golden file to include `organization_name=coder` in all metric label keys. Fixes #21748	2026-02-25 08:58:50 +00:00
Thomas Kosiewski	b776a14b46	fix(coderd): harden OAuth2 provider security (#22194 ) ## Summary Harden the OAuth2 provider with multiple security fixes addressing `coder/security#121` (CSRF session takeover) and converge on OAuth 2.1 compliance. ### Security Fixes \| Fix \| Description \| Commits \| \|-----\|-------------\|---------\| \| CSRF on `/oauth2/authorize` \| Enforce CSRF protection on the authorize endpoint POST (consent form submission) \| `ba7d646`, `b94a64e` \| \| Clickjacking: `frame-ancestors` CSP \| Prevent consent page from being iframed (`Content-Security-Policy: frame-ancestors 'none'` + `X-Frame-Options: DENY`) \| `597aeb2` \| \| Exact redirect URI matching \| Changed from prefix matching to full string exact matching per OAuth 2.1 §4.1.2.1 \| `73d64b1`, `93897f1` \| \| Store & verify `redirect_uri` \| Store redirect_uri with auth code in DB, verify at token exchange matches exactly (RFC 6749 §4.1.3) \| `50569b9`, `d7ca315` \| \| Mandatory PKCE \| Require `code_challenge` at authorization (for `response_type=code`) + unconditional `code_verifier` verification at token exchange \| `d7ca315`, `1cda1a9` \| \| Reject implicit grant \| `response_type=token` now returns `unsupported_response_type` error page (OAuth 2.1 removes implicit flow) \| `d7ca315`, `91b8863` \| ### Changes by File `coderd/httpmw/csrf.go` — Extended the CSRF `ExemptFunc` to enforce CSRF on `/oauth2/authorize` in addition to `/api` routes. The consent form POST is now CSRF-protected to prevent cross-site authorization code theft. `site/site.go` — Added `Content-Security-Policy: frame-ancestors 'none'` and `X-Frame-Options: DENY` headers to `RenderOAuthAllowPage` (consent page only — does not affect the SPA/global CSP used by AI tasks). `coderd/httpapi/queryparams.go` — Changed `RedirectURL` from prefix matching (`strings.HasPrefix(v.Path, base.Path)`) to full URI exact matching (`v.String() != base.String()`), comparing scheme, host, path, and query. `coderd/oauth2provider/authorize.go` — Added PKCE enforcement: `code_challenge` is required when `response_type=code` (via a conditional check, not `RequiredNotEmpty`, so `response_type=token` can reach the explicit rejection path). `ShowAuthorizePage` (GET) validates `response_type` before rendering and returns a 400 error page for unsupported types. `ProcessAuthorize` (POST) stores the `redirect_uri` with the auth code when explicitly provided. `coderd/oauth2provider/tokens.go` — PKCE verification is now unconditional (not gated on `code_challenge` being present in DB). If the stored code has a `redirect_uri`, the token endpoint verifies it matches exactly — mismatch returns `errBadCode` → `invalid_grant`. Missing `code_verifier` returns `invalid_grant`. `codersdk/oauth2.go` — `OAuth2ProviderResponseTypeToken` constant and `Valid()` acceptance are kept so the authorize handler can parse `response_type=token` and return the proper `unsupported_response_type` error rather than failing at parameter validation. *`coderd/database/migrations/000421_` — Added `redirect_uri text` column to `oauth2_provider_app_codes`. ### Design Decisions `state` parameter remains optional — The plan initially required `state` via `RequiredNotEmpty`, but this was reverted in `376a753` to avoid breaking existing clients. The `state` is still hashed and stored when provided (via `state_hash` column), securing clients that opt in. `response_type=token` kept in `Valid()` — Removing it from `Valid()` would cause the parameter parser to reject the request before the authorize handler can return the proper `unsupported_response_type` error. The constant is kept for correct error handling flow. CSP scoped to consent page only — `frame-ancestors 'none'` is set only on the OAuth consent page renderer, not globally. The SPA/global CSP was previously changed to allow framing for AI tasks ([#18102](https://github.com/coder/coder/pull/18102)); this change does not regress that. ### Out of Scope (follow-up PRs) - Bearer tokens in query strings (needs internal caller audit) - Scope enforcement on OAuth2 tokens - Rate limiting on dynamic client registration --- <details> <summary>📋 Implementation Plan</summary> # Plan: Harden OAuth2 Provider — Security Fixes + OAuth 2.1 Compliance ## Context & Why Security issue `coder/security#121` reports a critical session takeover via CSRF on the OAuth2 provider. This plan covers all remaining security fixes from that issue plus convergence on OAuth 2.1 requirements. The goal is a single PR that closes all actionable gaps. ## Current State (already committed on branch `csrf-sjx1`) \| Fix \| Status \| Commits \| \|-----\|--------\|---------\| \| Fix 1: CSRF on `/oauth2/authorize` \| ✅ Done \| `ba7d646`, `b94a64e` \| \| CSRF token in consent form HTML \| ✅ Done \| `b94a64e` \| \| `state_hash` column + storage \| ✅ Done (hash stored, but state still optional) \| `9167d83`, `b94a64e` \| \| Tests for CSRF + state hash \| ✅ Done \| `e4119b5` \| ## Remaining Work ### ~~Fix 2 — Require `state` parameter~~ (DROPPED) > Decision: Do not enforce `state` as required. The `state` parameter is still hashed and stored when provided (via `hashOAuth2State` / `state_hash` column from prior commits), but clients are not forced to supply it. This avoids breaking existing integrations that omit state. Rollback: Remove `"state"` from the `RequiredNotEmpty` call in `coderd/oauth2provider/authorize.go:42`: ```go // BEFORE (current on branch) p.RequiredNotEmpty("response_type", "client_id", "state", "code_challenge") // AFTER p.RequiredNotEmpty("response_type", "client_id", "code_challenge") ``` No test changes needed — tests already pass `state` voluntarily. ### Fix 4 — Exact redirect URI matching Currently `coderd/httpapi/queryparams.go:233` uses prefix matching: ```go // CURRENT — prefix match if v.Host != base.Host \|\| !strings.HasPrefix(v.Path, base.Path) { ``` OAuth 2.1 requires exact string matching. Change to: ```go // AFTER — exact match (OAuth 2.1 §4.1.2.1) if v.Host != base.Host \|\| v.Path != base.Path { ``` File: `coderd/httpapi/queryparams.go` — `RedirectURL` method Also update the error message from "must be a subset of" to "must exactly match". Additionally, store `redirect_uri` with the auth code and verify at the token endpoint (RFC 6749 §4.1.3): 1. New migration (same migration file or a new `000421`): Add `redirect_uri text` column to `oauth2_provider_app_codes` 2. Update INSERT query in `coderd/database/queries/oauth2.sql` to include `redirect_uri` 3. `coderd/oauth2provider/authorize.go`: Store `params.redirectURL.String()` when inserting the code 4. `coderd/oauth2provider/tokens.go`: After retrieving the code from DB, verify that `redirect_uri` from the token request matches the stored value exactly. Currently `tokens.go:103` calls `p.RedirectURL(vals, callbackURL, "redirect_uri")` for prefix validation only — it must compare against the stored redirect_uri from the code, not just the app's callback URL. <details> <summary>Why both exact match AND store+verify?</summary> Exact matching at the authorize endpoint prevents open redirectors (attacker can't use a sub-path). Storing and verifying at the token endpoint prevents code injection — an attacker who steals a code can't exchange it with a different redirect_uri than was originally authorized. This is required by RFC 6749 §4.1.3 and OAuth 2.1. </details> ### Fix 7 — `frame-ancestors` CSP on consent page The consent page can be iframed by a workspace app (same-site), which is the attack vector. Add a `Content-Security-Policy` header to prevent framing. File: `site/site.go` — `RenderOAuthAllowPage` function (~line 731)** Before writing the response, add: ```go func RenderOAuthAllowPage(rw http.ResponseWriter, r http.Request, data RenderOAuthAllowData) { rw.Header().Set("Content-Type", "text/html; charset=utf-8") // Prevent the consent page from being framed to mitigate // clickjacking attacks (coder/security#121). rw.Header().Set("Content-Security-Policy", "frame-ancestors 'none'") rw.Header().Set("X-Frame-Options", "DENY") ... ``` Both headers for defense-in-depth (CSP for modern browsers, X-Frame-Options for legacy). ### OAuth 2.1 — Mandatory PKCE Currently PKCE is checked only when `code_challenge` was provided during authorization (`tokens.go:258`): ```go // CURRENT — conditional check if dbCode.CodeChallenge.Valid && dbCode.CodeChallenge.String != "" { // verify PKCE } ``` OAuth 2.1 requires PKCE for ALL authorization code flows. Change to: File: `coderd/oauth2provider/authorize.go`* — Add `"code_challenge"` to required params: ```go p.RequiredNotEmpty("response_type", "client_id", "code_challenge") ``` File: `coderd/oauth2provider/tokens.go:257-265` — Make PKCE verification unconditional: ```go // AFTER — PKCE always required (OAuth 2.1) if req.CodeVerifier == "" { return codersdk.OAuth2TokenResponse{}, errInvalidPKCE } if !dbCode.CodeChallenge.Valid \|\| dbCode.CodeChallenge.String == "" { // Code was issued without a challenge — should not happen // with the authorize endpoint enforcement, but defend in // depth. return codersdk.OAuth2TokenResponse{}, errInvalidPKCE } if !VerifyPKCE(dbCode.CodeChallenge.String, req.CodeVerifier) { return codersdk.OAuth2TokenResponse{}, errInvalidPKCE } ``` File: `codersdk/oauth2.go` — Remove `OAuth2ProviderResponseTypeToken` from the enum or reject it explicitly in the authorize handler. Currently it's defined at line 216 but the handler ignores `response_type` and always issues a code. We should either: - (a) Remove the `"token"` variant from the enum and reject it with `unsupported_response_type`, OR - (b) Add an explicit check in `ProcessAuthorize` that rejects `response_type=token` Option (b) is simpler and more backwards-compatible: ```go // In ProcessAuthorize, after extracting params: if params.responseType != codersdk.OAuth2ProviderResponseTypeCode { httpapi.WriteOAuth2Error(ctx, rw, http.StatusBadRequest, codersdk.OAuth2ErrorCodeUnsupportedResponseType, "Only response_type=code is supported") return } ``` ### OAuth 2.1 — Bearer tokens in query strings `coderd/httpmw/apikey.go:743` accepts `access_token` from URL query parameters. OAuth 2.1 prohibits this. However, this may be used internally (e.g., workspace apps, DERP). Need to audit callers before removing. Approach: This is a larger change with potential breakage. Mark as a separate follow-up issue rather than including in this PR. Document the finding. ### OAuth 2.1 — Removed flows ✅ Already compliant. `tokens.go` only supports `authorization_code` and `refresh_token` grant types. The implicit grant (`response_type=token`) will be explicitly rejected per the PKCE section above. ### OAuth 2.1 — Refresh token rotation ✅ Already compliant. `tokens.go:442` deletes the old API key when a refresh token is used. ## Migration Plan All DB changes can go in a single new migration (or extend 000420 if the branch is rebased before merge). Columns to add: - `redirect_uri text` on `oauth2_provider_app_codes` The `state_hash` column is already added by migration 000420. ## Implementation Order 1. Fix 7 — CSP headers on consent page (isolated, no deps) 2. ~~Fix 2 — Require `state` parameter~~ (DROPPED — state stays optional) 3. Fix 4 — Exact redirect URI matching + store/verify redirect_uri 4. PKCE mandatory — Require `code_challenge` + reject `response_type=token` 5. Rollback — Remove `"state"` from `RequiredNotEmpty` in `authorize.go` 6. Tests — Update/add tests for all changes 7. `make gen` after DB changes ## Out of Scope (separate PRs) - Bearer tokens in query strings (needs internal caller audit) - Scope enforcement on OAuth2 tokens - Rate limiting / quota on dynamic client registration </details> --- _Generated with [`mux`](https://github.com/coder/mux) • Model: `anthropic:claude-opus-4-6` • Thinking: `xhigh`_	2026-02-23 12:18:44 +01:00
Danielle Maywood	02a80eac2e	docs: document new terraform-managed devcontainers (#21978 )	2026-02-19 11:45:04 +00:00
Susana Ferreira	df84cea924	feat(scripts/metricsdocgen): support merging static and generated metrics files (#21464 ) ## Description This PR refactors `scripts/metricsdocgen/main.go` to support merging static and generated metrics files for documentation generation. The static `metrics` file remains necessary for metrics not defined in the coder codebase (`go_`, `process_`, `promhttp_`, `coder_aibridged_`), as well as edge cases the scanner cannot handle (e.g., such as metrics with runtime-determined labels or function-local variable references for fields, ...). Handling these edge cases in the scanner would make it significantly more complex, so we keep this hybrid approach to accommodate them. This means that in such cases, developers need to update the `metrics` file directly, meaning there is still a risk of out-of-date information in the documentation. However, this solution should already encompass most cases. Static metrics take priority over generated metrics when both files contain the same metric name, allowing manual overrides without modifying the scanner. Some of these edge cases could be easily fixed by updating the codebase to use one of the supported patterns. ## Changes * Update `scripts/metricsdocgen/main.go` to read from two separate metrics files: * `metrics`: static, manually maintained metrics (e.g., `go_`, `process_`, `promhttp_`, `coder_aibridged_`) * `generated_metrics`: auto-generated by the AST scanner * Update `metrics` file to contain only static and edge-case metrics * Skip metrics with empty HELP descriptions in the scanner * Update `generated_metrics` to reflect skipped metrics * Update `docs/admin/integrations/prometheus.md` with merged metrics Related to: https://github.com/coder/coder/issues/13223 Disclosure: This PR was mainly developed with Claude Sonnet 4, with iterative review and refinement by @ssncferreira	2026-02-13 12:19:33 +00:00
Callum Styan	5f3be6b288	feat: add provisioner job queue wait time histogram and jobs enqueued counter (#21869 ) This PR adds some metrics to help identify job enqueue rates and latencies. This work was initiated as a way to help reduce the cost of the observation/measurement itself for autostart scaletests, which impacts our ability to identify/reason about the load caused by autostart. See: https://github.com/coder/internal/issues/1209 I've extended the metrics here to account for regular user initiated builds, prebuilds, autostarts, etc. IMO there is still the question here of whether we want to include or need the `transition` label, which is only present on workspace builds. Including it does lead to an increase in cardinality, and in the case of the histogram (when not using native histograms) that's at least a few extra series for every bucket. We could remove the transition label there but keep it on the counter. Additionally, the histogram is currently observing latencies for other jobs, such as template builds/version imports, those do not have a transition type associated with them. Tested briefly in a workspace, can see metric values like the following: - `coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"} 1` - `coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"} 1` --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 13:40:47 -08:00
Jon Ayers	6035e45cb8	feat: add e2e workspace build duration metric (#21739 ) Adds coderd_template_workspace_build_duration_seconds histogram that tracks the full duration from workspace build creation to agent ready. This captures the complete user-perceived build time including provisioning and agent startup. The metric is emitted when the agent reports ready/error/timeout via the lifecycle API, ensuring each build is counted exactly once per replica.	2026-02-06 16:26:02 -06:00
Thomas Kosiewski	dd6aec04d7	fix(coderd/oauth2provider): support client_secret_basic client auth (#21793 )	2026-02-02 16:01:33 +01:00
Marcin Tojek	036ed5672f	fix!: remove deprecated prometheus metrics (#21788 ) ## Description Removes the following deprecated Prometheus metrics: - `coderd_api_workspace_latest_build_total` → use `coderd_api_workspace_latest_build` instead - `coderd_oauth2_external_requests_rate_limit_total` → use `coderd_oauth2_external_requests_rate_limit` instead These metrics were deprecated in #12976 because gauge metrics should avoid the `_total` suffix per [Prometheus naming conventions](https://prometheus.io/docs/practices/naming/). ## Changes - Removed deprecated metric `coderd_api_workspace_latest_build_total` from `coderd/prometheusmetrics/prometheusmetrics.go` - Removed deprecated metric `coderd_oauth2_external_requests_rate_limit_total` from `coderd/promoauth/oauth2.go` - Updated tests to use the non-deprecated metric name Fixes #12999	2026-01-30 13:30:06 +01:00
Marcin Tojek	04b0253e8a	feat: add Prometheus metrics for license warnings and errors (#21749 ) Fixes: coder/internal#767 Adds two new Prometheus metrics for license health monitoring: - `coderd_license_warnings` - count of active license warnings - `coderd_license_errors` - count of active license errors Metrics endpoint after startup of a deployment with license enabled: ``` ... # HELP coderd_license_errors The number of active license errors. # TYPE coderd_license_errors gauge coderd_license_errors 0 ... # HELP coderd_license_warnings The number of active license warnings. # TYPE coderd_license_warnings gauge coderd_license_warnings 0 ... ```	2026-01-29 13:50:15 +01:00
Callum Styan	806d7e4c11	docs: update metrics docs to include metadata batcher metrics (#21665 ) This updates the metrics docs to include metrics added in https://github.com/coder/coder/pull/21330 Signed-off-by: Callum Styan <callumstyan@gmail.com>	2026-01-26 09:22:14 -08:00
Mathias Fredriksson	ea9f003cdd	docs: clarify dev containers entry point and reduce callouts (#21188 ) The user guide jumped straight into integration details without explaining what dev containers are. Now it opens with a brief orientation linking to the spec, then explains this guide covers the Docker-based approach. Converted several NOTE callouts to prose where they were just cross-references or stacked unnecessarily. The Envbuilder index note was reframed to lead with its strengths rather than "we recommend the other thing." Also updates platform support to Linux only per current status. Refs #21157	2025-12-09 16:37:19 +02:00
Mathias Fredriksson	f3e26ca557	docs: add guidance on when to use Project Discovery for Dev Containers (#21190 ) Refs #21157	2025-12-09 16:36:19 +02:00
Mathias Fredriksson	97bc7eb9e5	docs: restructure dev container documentation (#21157 ) Dev container admin docs were scattered across two locations: the Docker-based integration under extending-templates/ and Envbuilder under managing-templates/. There was no landing page explaining that two approaches exist or helping admins choose between them. This moves everything under admin/integrations/devcontainers/ with a decision guide at the top. Dev containers are an integration with the dev container specification, so integrations/ is a natural fit alongside JFrog, Vault, etc. Stub pages remain at the original locations for discoverability. New structure: admin/integrations/devcontainers/ ├── index.md # Landing page + decision guide ├── integration.md # Docker-based dev containers └── envbuilder/ ├── index.md ├── add-envbuilder.md ├── envbuilder-security-caching.md └── envbuilder-releases-known-issues.md Refs #21080	2025-12-09 13:03:02 +02:00
Danny Kopping	c6631e1e50	feat: expose `aibridged` metrics (#20865 ) Upgrades `coder/aibridge` to v0.2.0 which includes https://github.com/coder/aibridge/pull/62. Creates a `prometheus.Registerer` with a prefix `coder_aibridged_` and passes that along to coder/aibridge which actually exposes the metrics. Also includes a side-effect of a change described in https://github.com/coder/aibridge/pull/62#discussion_r2550017470. --------- Signed-off-by: Danny Kopping <danny@coder.com>	2025-11-24 18:16:06 +02:00
Susana Ferreira	c1f8465de6	fix: add missing provisionerd metrics to docs (#20358 ) ## Description Add missing provisionerd metrics to Prometheus documentation: * `coderd_provisionerd_num_daemons`: The number of provisioner daemons. * `coderd_provisionerd_workspace_build_timings_seconds`: The time taken for a workspace to build. Related to internal thread: https://codercom.slack.com/archives/C07GRNNRW03/p1760642020583019	2025-10-20 11:33:45 +01:00
blink-so[bot]	02ecf32afe	docs: replace offline deployments terminology to air-gapped (#19625 ) This PR comprehensively updates the offline deployments documentation to use more precise "air-gapped" terminology and improves consistency throughout the documentation. ## Changes Made ### Terminology Updates - Title: Changed from "Offline Deployments" to "Air-gapped Deployments" - Summary: Updated to prioritize "air-gapped" terminology and added "disconnected" to cover additional deployment scenarios - Content: Updated tutorial references to use "air-gapped" instead of "offline" - Section headers: - Changed "Offline container images" to "Air-gapped container images" - Changed "Offline docs" to "Air-gapped docs" - Table headers: Changed "Offline deployments" to "Air-gapped deployments" ### Navigation & URL Structure - Navigation title: Updated `docs/manifest.json` to show "Air-gapped Deployments" in sidebar - Navigation description: Updated to "Run Coder in air-gapped / disconnected / offline environments" - File rename: `docs/install/offline.md` → `docs/install/airgap.md` for consistency - URL change: `/install/offline` → `/install/airgap` - Subsection anchors: - `/install/offline#offline-docs` → `/install/airgap#airgap-docs` - `/install/offline#offline-container-images` → `/install/airgap#airgap-container-images` ### Internal Links & References Updated all internal documentation links: - `docs/admin/integrations/index.md` - `docs/admin/networking/index.md` - `docs/changelogs/v0.27.0.md` (including anchor reference) - `docs/tutorials/faqs.md` ### Backward Compatibility - Redirects: Added `docs/_redirects` with 301 redirects: - `/install/offline` → `/install/airgap` - `/install/offline#offline-docs` → `/install/airgap#airgap-docs` - `/install/offline#offline-container-images` → `/install/airgap#airgap-container-images` - Content: Maintains "offline" in the description for broader understanding - Deep links: All subsection anchors redirect properly to maintain existing bookmarks ## Rationale - "Air-gapped" is more precise and commonly used in enterprise/security contexts - "Disconnected" covers additional scenarios where networks may be temporarily or partially isolated - Consistency ensures filename, URL, navigation, content, and subsection anchors all align with the same terminology - Backward compatibility maintained through comprehensive redirects to prevent broken links at any level ## Testing - [x] Verified all internal links point to the new URL structure - [x] Confirmed navigation title updates correctly - [x] Ensured content accuracy is maintained - [x] Added redirects for backward compatibility (main page + subsections) - [x] Updated all cross-references in related documentation - [x] Verified subsection anchor redirects work properly - [x] Confirmed no unnecessary .md file redirects ## Result Complete terminology consistency across: - ✅ Page title and headers - ✅ Navigation and breadcrumbs - ✅ File names and URL structure - ✅ Internal documentation links - ✅ Table headers and section titles - ✅ Subsection anchors and deep links - ✅ Backward compatibility via comprehensive redirects --------- Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com> Co-authored-by: david-fraley <67079030+david-fraley@users.noreply.github.com>	2025-08-29 09:34:44 -05:00
Susana Ferreira	0ab345ca84	feat: add prebuild timing metrics to Prometheus (#19503 ) ## Description This PR introduces one counter and two histograms related to workspace creation and claiming. The goal is to provide clearer observability into how workspaces are created (regular vs prebuild) and the time cost of those operations. ### `coderd_workspace_creation_total` * Metric type: Counter * Name: `coderd_workspace_creation_total` * Labels: `organization_name`, `template_name`, `preset_name` This counter tracks whether a regular workspace (not created from a prebuild pool) was created using a preset or not. Currently, we already expose `coderd_prebuilt_workspaces_claimed_total` for claimed prebuilt workspaces, but we lack a comparable metric for regular workspace creations. This metric fills that gap, making it possible to compare regular creations against claims. Implementation notes: * Exposed as a `coderd_` metric, consistent with other workspace-related metrics (e.g. `coderd_api_workspace_latest_build`: https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149). * Every `defaultRefreshRate` (1 minute ), DB query `GetRegularWorkspaceCreateMetrics` is executed to fetch all regular workspaces (not created from a prebuild pool). * The counter is updated with the total from all time (not just since metric introduction). This differs from the histograms below, which only accumulate from their introduction forward. ### `coderd_workspace_creation_duration_seconds` & `coderd_prebuilt_workspace_claim_duration_seconds` * Metric types: Histogram * Names: * `coderd_workspace_creation_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name`, `type` (`regular`, `prebuild`) * `coderd_prebuilt_workspace_claim_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name` We already have `coderd_provisionerd_workspace_build_timings_seconds`, which tracks build run times for all workspace builds handled by the provisioner daemon. However, in the context of this issue, we are only interested in creation and claim build times, not all transitions; additionally, this metric does not include `preset_name`, and adding it there would significantly increase cardinality. Therefore, separate more focused metrics are introduced here: * `coderd_workspace_creation_duration_seconds`: Build time to create a workspace (either a regular workspace or the build into a prebuild pool, for prebuild initial provisioning build). * `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a prebuilt workspace from the pool. The reason for two separate histograms is that: * Creation (regular or prebuild): provisioning builds with similar time magnitude, generally expected to take longer than a claim operation. * Claim: expected to be a much faster provisioning build. #### Native histogram usage Provisioning times vary widely between projects. Using static buckets risks unbalanced or poorly informative histograms. To address this, these metrics use [Prometheus native histograms](https://prometheus.io/docs/specs/native_histograms/): * First introduced in Prometheus v2.40.0 * Recommended stable usage from v2.45+ * Requires Go client `prometheus/client_golang` v1.15.0+ * Experimental and must be explicitly enabled on the server (`--enable-feature=native-histograms`) For compatibility, we also retain a classic bucket definition (aligned with the existing provisioner metric: https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189). * If native histograms are enabled, Prometheus ingests the high-resolution histogram. * If not, it falls back to the predefined buckets. Implementation notes: * Unlike the counter, these histograms are updated in real-time at workspace build job completion. * They reflect data only from the point of introduction forward (no historical backfill). ## Relates to Closes: https://github.com/coder/coder/issues/19528 Native histograms tested in observability stack: https://github.com/coder/observability/pull/50	2025-08-28 15:00:26 +01:00
Hugo Dutka	c94333d9b5	docs: oauth2-provider fixes (#19170 ) Adds the oauth2-provider doc page to the manifest so it's rendered in the docs, fixes formatting in the oauth2-provider doc, and links to it from the MCP doc. To see the formatting issues, visit https://coder.com/docs/@4bcf44a/admin/integrations/oauth2-provider. To see the doc after the fixes, visit https://coder.com/docs/@f05969a/admin/integrations/oauth2-provider.	2025-08-04 21:20:51 +02:00
Thomas Kosiewski	247efc0dcc	docs: add OAuth2 provider experimental feature documentation (#19165 ) # Add OAuth2 Provider Documentation This PR adds comprehensive documentation for the experimental OAuth2 Provider feature, which allows Coder to function as an OAuth2 authorization server. The documentation covers: - Feature overview and experimental status warning - Setup requirements and enabling the feature - Methods for creating OAuth2 applications (UI and API) - Integration patterns including standard OAuth2 and PKCE flows - Discovery endpoints and token management - Testing and development guidance - Troubleshooting common issues - Security considerations and current limitations The documentation is marked as experimental and includes appropriate warnings about production usage. Signed-off-by: Thomas Kosiewski <tk@coder.com>	2025-08-04 20:17:47 +02:00
Eric Paulsen	8b43503aaf	docs: remove deprecated JFrog Xray integration documentation (#19113 )	2025-07-31 18:46:39 +01:00
blink-so[bot]	aa1a985381	docs: update DX integration title from 'DX Data Cloud' to 'DX' (#18981 ) Simplifies the title to reduce customer confusion as requested by @kylejaggi. The DX platform covers all products, not just Data Cloud. This change makes the documentation clearer for customers who might get confused about which DX product the integration refers to. Changes: - Updated page title from "DX Data Cloud" to "DX" in `docs/admin/integrations/dx-data-cloud.md` Testing: - Verified the markdown renders correctly - No functional changes, documentation-only update --------- Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com> Co-authored-by: bpmct <22407953+bpmct@users.noreply.github.com>	2025-07-21 22:02:44 +00:00
Edward Angert	cbe4627893	docs: document how to tag coder users in dx data cloud (#17805 ) [preview](https://coder.com/docs/@tag-coder-users-dx/admin/integrations/data-cloud) --------- Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>	2025-06-20 17:13:36 -04:00
Edward Angert	5c16079aff	docs: add more specific steps and information about oidc refresh tokens (#18336 ) closes https://github.com/coder/coder/issues/18307 relates to https://github.com/coder/coder/pull/18318 preview: - [refresh-tokens](https://coder.com/docs/@18307-refresh-tokens/admin/users/oidc-auth/refresh-tokens) - [configuring-okta](https://coder.com/docs/@18307-refresh-tokens/tutorials/configuring-okta) ~(not sure why @Emyrk 's photo is so huge there though)~ ✔️ - [x] removed from [idp-sync](https://coder.com/docs/@18307-refresh-tokens/admin/users/idp-sync) to do: - move keycloak - add ping federate and azure - edit text (possibly placeholders for now - I want to see how it all relates and edit it again. right now, there's a note about the same thing in every section in way that's not super helpful/necessary) - ~convert some paragraphs to OL~ calling this out of scope for now --------- Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>	2025-06-16 13:18:55 -04:00
Edward Angert	f4600652c3	docs: remove github avatars (#18338 ) the site is making the pictures big, so I'm just removing them in this PR and then maybe we can investigate it some other time - [live site](https://coder.com/docs/admin/integrations/island) - [preview](https://coder.com/docs/@remove-github-avatars/admin/integrations/island) cc @aqandrew #bring-back-the-hotfix-label Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>	2025-06-12 00:52:21 +00:00
Atif Ali	97ba7f1ce9	docs: fix alert in artifactory guide (#18235 ) [preview](https://coder.com/docs/@atif%2Ffix-alert/admin/integrations/jfrog-artifactory#jfrog-token)	2025-06-04 12:50:23 -04:00
M Atif Ali	99979a78f5	docs: update jfrog-artifactory integration docs (#17413 )	2025-04-16 19:48:26 +05:00
Edward Angert	cf7d143e43	docs: use consistent examples in prometheus doc and add namespaceSelector spec (#16918 ) closes: #15385 - use consistent `prom-http` port (@johnstcn looks like this was changed/added in #12214 - do we prefer `prom-http` over `prometheus-http` or is it more important that they align?) - add `namespaceSelector:` per @francisco-mata (thanks! - sorry it took so long to get this in) from issue: > For some reason our target was not appearing on our prometheus targets, we had to add a namespaceSelector key on the Service Monitor to successfully appear Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>	2025-03-13 22:09:26 -04:00
Edward Angert	101b62dc3e	docs: convert alerts to use GitHub Flavored Markdown (GFM) (#16850 ) followup to #16761 thanks @lucasmelin ! + thanks: @ethanndickson @Parkreiner @matifali @aqandrew - [x] update snippet - [x] find/replace - [x] spot-check [preview](https://coder.com/docs/@16761-gfm-callouts/admin/templates/managing-templates/schedule) (and others) --------- Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com> Co-authored-by: M Atif Ali <atif@coder.com>	2025-03-10 16:58:20 -04:00
Michael Vincent Patterson	5295902596	docs: clarified prometheus integration behavior (#16724 ) Closes issue #16538 Updated docs to explain Behavior of enabling Prometheus	2025-02-26 19:30:41 +00:00
M Atif Ali	f8a49f4984	docs: remove the prerequisite step for kubernetes logs streaming (#16625 )	2025-02-21 22:58:26 +05:00
Ben Potter	dd6d57ed39	feat: add docs explaining how Coder integrates with PlatformX (#16378 ) More details in https://github.com/coder/coder-platformx-notifications Preview at https://coder.com/docs/@dx-integration/admin/integrations/platformx (may be slightly outdated due to caching) closes https://github.com/coder/coder/issues/16308 --------- Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com> Co-authored-by: Edward Angert <EdwardAngert@users.noreply.github.com>	2025-02-03 18:06:30 -06:00
Edward Angert	4f438e71cf	docs: fix broken links (#16179 ) Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com> Co-authored-by: Cian Johnston <cian@coder.com>	2025-01-17 13:18:48 -05:00
Gregory McCue	08dd2ab4cc	docs: fix typo in prometheus.md (#16091 ) Fixes small `scrape_config` typo in `prometheus.md`	2025-01-10 12:02:25 -05:00
Charlie Voiselle	4e0963966d	docs: markdown fixes and edits (#15527 ) - docs: improve admonition for need to add useHttpPath - docs: fix list item nesting - docs: fix list item nesting - docs: improve admonition for authentication - docs: tidy and update vault guide - docs: improve admonitions - docs: improve admonitions - docs: content edits, reference links to make copy easier to read previews: - <https://coder.com/docs/@fix-guides-list-numbers/admin/external-auth> - <https://coder.com/docs/@fix-guides-list-numbers/admin/integrations/island> --------- Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com> Co-authored-by: EdwardAngert <EdwardAngert@users.noreply.github.com>	2025-01-03 14:13:46 -05:00
Muhammad Atif Ali	94f5d52fdc	chore: adopt markdownlint and markdown-table-formatter for *.md (#15831 ) Co-authored-by: Edward Angert <EdwardAngert@users.noreply.github.com>	2025-01-03 13:12:59 +00:00
Eric Paulsen	2ec2e8ae6d	docs: add istio docs (#15733 ) closes https://github.com/coder/coder/issues/11821	2024-12-11 17:48:28 +00:00
Ethan	fa69d1ca74	ci: reenable link checker & fix broken links (#15489 ) Follow-up on #15484.	2024-11-13 16:04:10 +11:00
Ethan	6e18742ad3	ci: replace unmaintained markdown link checker (#15424 ) The old one was flaking a bunch and blocking PRs. This is the one recommended by the maintainer of the old.	2024-11-07 22:30:43 +11:00
Edward Angert	007f0a35a4	fix: adjust instances of Github to GitHub (#15203 ) s/Github/GitHub Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>	2024-10-28 07:43:30 -04:00
Muhammad Atif Ali	419eba5fb6	docs: restructure docs (#14421 ) Closes #13434 Supersedes #14182 --------- Co-authored-by: Ethan <39577870+ethanndickson@users.noreply.github.com> Co-authored-by: Ethan Dickson <ethan@coder.com> Co-authored-by: Ben Potter <ben@coder.com> Co-authored-by: Stephen Kirby <58410745+stirby@users.noreply.github.com> Co-authored-by: Stephen Kirby <me@skirby.dev> Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com> Co-authored-by: Edward Angert <EdwardAngert@users.noreply.github.com>	2024-10-05 10:52:04 -05:00

49 Commits