coder

mirror of https://github.com/coder/coder.git synced 2026-06-02 20:48:20 +00:00

Author	SHA1	Message	Date
Danny Kopping	48b90f8cc8	feat: add coder_build_info metric (#24365 ) _Disclaimer: produced by Claude Opus 4.6_ Adds a `coder_build_info` metric which allows operators to see which versions of Coder are currently running. --------- Signed-off-by: Danny Kopping <danny@coder.com>	2026-04-15 12:48:38 +00:00
Callum Styan	730edba87a	fix: fix false positive disconnected agent metric reporting (#24225 ) We noticed during higher active workspace counts that the agent connection metric, generated via a query to the database, would report a relatively high amount of agents as disconnected. Somewhere between 5 and 20%. However, other metrics such as # of websocket connections would suggest that all agent connections are healthy. Looking at the `Agents` function in prometheus metrics, plus the query execution time (not accounting for actual database RT time) revealed that this reporting of agents as disconnected was almost certainly false positives due to clock drift in the way we're generating the metric values. At 10k metrics, with a p50 of 2ms and p99 of 5ms, the entire `agents` function could take upwards of 50s to execute. Because we were doing a query/database RT to query th apps for each agent individually, and grabbing a `time.Now` value on each iteration of that loop, it's likely the portion of agents that were reported as disconnected were those that had last heartbeat the furthest in the past. The fix here is to set a consistent `now` before fetching agent data to avoid clock drift inflating the inactive timeout comparison, and replace the per-agent app query N+1 with a single batched lookup to prevent loop execution time from pushing agents over the disconnected threshold. Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 22:23:06 -07:00
J. Scott Miller	20b953a99d	feat: add Prometheus metric for agent first connection duration (#24179 ) ## Summary Add `coderd_agents_first_connection_seconds` histogram metric that records the duration from workspace agent creation to first connection. This fills an observability gap — provisioner job timings and startup script metrics exist, but the agent connection phase (which can take several minutes) was not exposed to Prometheus. Closes https://github.com/coder/coder/issues/21282 ## Changes - `coderd/prometheusmetrics/prometheusmetrics.go` — Define and register a `HistogramVec` in the existing `Agents()` polling loop. Observe `first_connected_at - created_at` exactly once per agent via a deduplication map, pruned each tick to prevent unbounded memory growth. - `coderd/prometheusmetrics/prometheusmetrics_test.go` — Update `TestAgents` to set `first_connected_at` on the test agent and assert the histogram is collected with correct labels, sample count, and sample sum. - `docs/admin/integrations/prometheus.md`, `scripts/metricsdocgen/generated_metrics` — Auto-generated documentation updates from `make gen`. ## Metric details \| Property \| Value \| \|---\|---\| \| Name \| `coderd_agents_first_connection_seconds` \| \| Type \| histogram \| \| Labels \| `template_name`, `agent_name`, `username`, `workspace_name` \| \| Buckets \| 1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h \| ## Example PromQL ```promql # P95 agent connection time by template histogram_quantile(0.95, sum(rate(coderd_agents_first_connection_seconds_bucket[1h])) by (le, template_name) ) ``` <details> <summary>Implementation notes</summary> ### Design decisions - Histogram over gauge: Enables `histogram_quantile()` for percentile queries. - Observe in `Agents()` polling loop: All required data is already fetched by `GetWorkspaceAgentsForMetrics()` — no new DB queries. - Dedup via `map[uuid.UUID]struct{}`: Prevents re-observing the same agent across polling ticks. Pruned each cycle to bound memory. - Buckets: Aligned with `coderd_provisionerd_workspace_build_timings_seconds` range (1s–1h). ### Overhead at scale (100k active workspaces) The deduplication map (`observedFirstConnection`) and per-tick pruning map (`currentAgentIDs`) are both `map[[16]byte]struct{}`. At 100k agents: - Memory: ~2.25 MB persistent + ~2.25 MB transient per tick = ~4.5 MB peak. - CPU: ~25 ms of map operations per tick (one tick per minute) = <0.05% of one core. Both are negligible relative to the existing cost of the `Agents()` loop (the DB query, per-agent `GetWorkspaceAppsByAgentID` calls, and coordinator node lookups dominate). </details> > 🤖 Generated by Coder Agents	2026-04-14 12:00:46 -05:00
Mathias Fredriksson	147df5c971	refactor: replace sort.Strings with slices.Sort (#23457 ) The slices package provides type-safe generic replacements for the old typed sort convenience functions. The codebase already uses slices.Sort in 43 call sites; this finishes the migration for the remaining 29. - sort.Strings(x) -> slices.Sort(x) - sort.Float64s(x) -> slices.Sort(x) - sort.StringsAreSorted(x) -> slices.IsSorted(x)	2026-03-23 23:19:23 +02:00
Danielle Maywood	f91475cd51	test: remove unnecessary dbauthz.AsSystemRestricted calls in tests (#22663 )	2026-03-05 20:29:49 +00:00
Garrett Delfosse	4057363f78	fix(coderd): add organization_name label to insights Prometheus metrics (#22296 ) ## Description When multiple organizations have templates with the same name, the Prometheus `/metrics` endpoint returns HTTP 500 because Prometheus rejects duplicate label combinations. The three `coderd_insights_` metrics (`coderd_insights_templates_active_users`, `coderd_insights_applications_usage_seconds`, `coderd_insights_parameters`) used only `template_name` as a distinguishing label, so two templates named e.g. `"openstack-v1"` in different orgs would produce duplicate metric series. This adds `organization_name` as a label to all three insight metric descriptors to disambiguate templates across organizations. ## Changes `coderd/prometheusmetrics/insights/metricscollector.go`: - Added `organization_name` label to all three metric descriptors - Added `organizationNames` field (template ID → org name) to the `insightsData` struct - In `doTick`: after fetching templates, collect unique org IDs, fetch organizations via `GetOrganizations`, and build a template-ID-to-org-name mapping - In `Collect()`: pass the organization name as an additional label value in every `MustNewConstMetric` call `coderd/prometheusmetrics/insights/testdata/insights-metrics.json`*: Updated golden file to include `organization_name=coder` in all metric label keys. Fixes #21748	2026-02-25 08:58:50 +00:00
Marcin Tojek	036ed5672f	fix!: remove deprecated prometheus metrics (#21788 ) ## Description Removes the following deprecated Prometheus metrics: - `coderd_api_workspace_latest_build_total` → use `coderd_api_workspace_latest_build` instead - `coderd_oauth2_external_requests_rate_limit_total` → use `coderd_oauth2_external_requests_rate_limit` instead These metrics were deprecated in #12976 because gauge metrics should avoid the `_total` suffix per [Prometheus naming conventions](https://prometheus.io/docs/practices/naming/). ## Changes - Removed deprecated metric `coderd_api_workspace_latest_build_total` from `coderd/prometheusmetrics/prometheusmetrics.go` - Removed deprecated metric `coderd_oauth2_external_requests_rate_limit_total` from `coderd/promoauth/oauth2.go` - Updated tests to use the non-deprecated metric name Fixes #12999	2026-01-30 13:30:06 +01:00
Spike Curtis	bddb808b25	chore: arrange imports in a standard way (#21452 ) Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example: ``` import ( "context" "time" "github.com/prometheus/client_golang/prometheus" "golang.org/x/xerrors" "gopkg.in/natefinch/lumberjack.v2" "cdr.dev/slog/v3" "github.com/coder/coder/v2/codersdk/agentsdk" "github.com/coder/serpent" ) ``` 3 groups: standard library, 3rd partly libs, Coder libs. This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.	2026-01-08 15:24:11 +04:00
Spike Curtis	49b34a716a	fix: fix slog to always use array of Fields (#21426 ) Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder). It also updates dependencies that also use slog and were updated. I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule. Other dependencies, I pushed new tags.	2026-01-08 10:29:41 +04:00
Steven Masley	3194bcfc9e	chore: distinct operations for provisioner's 'parse', 'init', 'plan', 'apply', 'graph' (#21064 ) Provisioner steps broken into smaller granular actions. Changes: - `ExtractArchive` moved to `init` request (was in `configure`) - Writing `tfstate` moved to `plan` (was in `configure`) - Moved most plan/apply outputs to `GraphComplete`	2025-12-15 11:26:41 -06:00
Ethan	645da33767	test: fix TestDescCacheTimestampUpdate flake (#20975 ) ## Problem `TestDescCacheTimestampUpdate` was flaky on Windows CI because `time.Now()` has ~15.6ms resolution, causing consecutive calls to return identical timestamps. ## Solution Inject `quartz.Clock` into `MetricsAggregator` using an options pattern, making the test deterministic by using a mock clock with explicit time advancement. ### Changes - Add `clock quartz.Clock` field to `MetricsAggregator` struct - Add `WithClock()` option for dependency injection - Replace all `time.Now()` calls with `ma.clock.Now()` - Update test to use mock clock with `mClock.Advance(time.Second)` --- This PR was fully generated by [`mux`](https://github.com/coder/mux) using Claude Opus 4.5, and reviewed by me. Closes https://github.com/coder/internal/issues/1146	2025-12-02 10:53:36 +11:00
Callum Styan	658e8c34a9	perf: improve performance of metricsAggregator path by reducing memory allocations (#20724 ) Signed-off-by: Callum Styan <callumstyan@gmail.com>	2025-11-24 15:45:08 -08:00
Callum Styan	5a18cf4c86	fix: remove unintentionally added print in test code (#20391 ) accidentally added in https://github.com/coder/coder/pull/19786 Signed-off-by: Callum Styan <callumstyan@gmail.com>	2025-10-20 18:51:15 -07:00
Callum Styan	141ef23c81	fix: introduce dedicated queries for workspaces and workspace agents metrics (#19786 ) aid in differentiation between sources of calls to `GetWorkspaces` but introducing new queries for metrics specific use cases --------- Signed-off-by: Callum Styan <callumstyan@gmail.com>	2025-10-17 13:40:10 -07:00
Spike Curtis	1354d84eb4	chore: refactor instance identity to be a SessionTokenProvider (#19566 ) Refactors Agent instance identity to be a SessionTokenProvider. Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags. This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication. Fixes #19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.	2025-09-03 10:38:42 +04:00
dependabot[bot]	519812776e	chore: bump github.com/stretchr/testify from 1.10.0 to 1.11.1 (#19599 ) Bumps [github.com/stretchr/testify](https://github.com/stretchr/testify) from 1.10.0 to 1.11.1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/stretchr/testify/releases">github.com/stretchr/testify's releases</a>.</em></p> <blockquote> <h2>v1.11.1</h2> <p>This release fixes <a href="https://redirect.github.com/stretchr/testify/issues/1785">#1785</a> introduced in v1.11.0 where expected argument values implementing the stringer interface (<code>String() string</code>) with a method which mutates their value, when passed to mock.Mock.On (<code>m.On("Method", <expected>).Return()</code>) or actual argument values passed to mock.Mock.Called may no longer match one another where they previously did match. The behaviour prior to v1.11.0 where the stringer is always called is restored. Future testify releases may not call the stringer method at all in this case.</p> <h2>What's Changed</h2> <ul> <li>Backport <a href="https://redirect.github.com/stretchr/testify/issues/1786">#1786</a> to release/1.11: mock: revert to pre-v1.11.0 argument matching behavior for mutating stringers by <a href="https://github.com/brackendawson"><code>@brackendawson</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1788">stretchr/testify#1788</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/stretchr/testify/compare/v1.11.0...v1.11.1">https://github.com/stretchr/testify/compare/v1.11.0...v1.11.1</a></p> <h2>v1.11.0</h2> <h2>What's Changed</h2> <h3>Functional Changes</h3> <p>v1.11.0 Includes a number of performance improvements.</p> <ul> <li>Call stack perf change for CallerInfo by <a href="https://github.com/mikeauclair"><code>@mikeauclair</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1614">stretchr/testify#1614</a></li> <li>Lazily render mock diff output on successful match by <a href="https://github.com/mikeauclair"><code>@mikeauclair</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1615">stretchr/testify#1615</a></li> <li>assert: check early in Eventually, EventuallyWithT, and Never by <a href="https://github.com/cszczepaniak"><code>@cszczepaniak</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1427">stretchr/testify#1427</a></li> <li>assert: add IsNotType by <a href="https://github.com/bartventer"><code>@bartventer</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1730">stretchr/testify#1730</a></li> <li>assert.JSONEq: shortcut if same strings by <a href="https://github.com/dolmen"><code>@dolmen</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1754">stretchr/testify#1754</a></li> <li>assert.YAMLEq: shortcut if same strings by <a href="https://github.com/dolmen"><code>@dolmen</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1755">stretchr/testify#1755</a></li> <li>assert: faster and simpler isEmpty using reflect.Value.IsZero by <a href="https://github.com/dolmen"><code>@dolmen</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1761">stretchr/testify#1761</a></li> <li>suite: faster methods filtering (internal refactor) by <a href="https://github.com/dolmen"><code>@dolmen</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1758">stretchr/testify#1758</a></li> </ul> <h3>Fixes</h3> <ul> <li>assert.ErrorAs: log target type by <a href="https://github.com/craig65535"><code>@craig65535</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1345">stretchr/testify#1345</a></li> <li>Fix failure message formatting for Positive and Negative asserts in <a href="https://redirect.github.com/stretchr/testify/pull/1062">stretchr/testify#1062</a></li> <li>Improve ErrorIs message when error is nil but an error was expected by <a href="https://github.com/tsioftas"><code>@tsioftas</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1681">stretchr/testify#1681</a></li> <li>fix Subset/NotSubset when calling with mixed input types by <a href="https://github.com/siliconbrain"><code>@siliconbrain</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1729">stretchr/testify#1729</a></li> <li>Improve ErrorAs failure message when error is nil by <a href="https://github.com/ccoVeille"><code>@ccoVeille</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1734">stretchr/testify#1734</a></li> <li>mock.AssertNumberOfCalls: improve error msg by <a href="https://github.com/3scalation"><code>@3scalation</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1743">stretchr/testify#1743</a></li> </ul> <h3>Documentation, Build & CI</h3> <ul> <li>docs: Fix typo in README by <a href="https://github.com/alexandear"><code>@alexandear</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1688">stretchr/testify#1688</a></li> <li>Replace deprecated io/ioutil with io and os by <a href="https://github.com/alexandear"><code>@alexandear</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1684">stretchr/testify#1684</a></li> <li>Document consequences of calling t.FailNow() by <a href="https://github.com/greg0ire"><code>@greg0ire</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1710">stretchr/testify#1710</a></li> <li>chore: update docs for Unset <a href="https://redirect.github.com/stretchr/testify/issues/1621">#1621</a> by <a href="https://github.com/techfg"><code>@techfg</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1709">stretchr/testify#1709</a></li> <li>README: apply gofmt to examples by <a href="https://github.com/alexandear"><code>@alexandear</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1687">stretchr/testify#1687</a></li> <li>refactor: use %q and %T to simplify fmt.Sprintf by <a href="https://github.com/alexandear"><code>@alexandear</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1674">stretchr/testify#1674</a></li> <li>Propose Christophe Colombier (ccoVeille) as approver by <a href="https://github.com/brackendawson"><code>@brackendawson</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1716">stretchr/testify#1716</a></li> <li>Update documentation for the Error function in assert or require package by <a href="https://github.com/architagr"><code>@architagr</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1675">stretchr/testify#1675</a></li> <li>assert: remove deprecated build constraints by <a href="https://github.com/alexandear"><code>@alexandear</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1671">stretchr/testify#1671</a></li> <li>assert: apply gofumpt to internal test suite by <a href="https://github.com/ccoVeille"><code>@ccoVeille</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1739">stretchr/testify#1739</a></li> <li>CI: fix shebang in .ci.*.sh scripts by <a href="https://github.com/dolmen"><code>@dolmen</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1746">stretchr/testify#1746</a></li> <li>assert,require: enable parallel testing on (almost) all top tests by <a href="https://github.com/dolmen"><code>@dolmen</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1747">stretchr/testify#1747</a></li> <li>suite.Passed: add one more status test report by <a href="https://github.com/Ararsa-Derese"><code>@Ararsa-Derese</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1706">stretchr/testify#1706</a></li> <li>Add Helper() method in internal mocks and assert.CollectT by <a href="https://github.com/dolmen"><code>@dolmen</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1423">stretchr/testify#1423</a></li> <li>assert.Same/NotSame: improve usage of Sprintf by <a href="https://github.com/ccoVeille"><code>@ccoVeille</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1742">stretchr/testify#1742</a></li> <li>mock: enable parallel testing on internal testsuite by <a href="https://github.com/dolmen"><code>@dolmen</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1756">stretchr/testify#1756</a></li> <li>suite: cleanup use of 'testing' internals at runtime by <a href="https://github.com/dolmen"><code>@dolmen</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1751">stretchr/testify#1751</a></li> <li>assert: check test failure message for Empty and NotEmpty by <a href="https://github.com/ccoVeille"><code>@ccoVeille</code></a> in <a href="https://redirect.github.com/stretchr/testify/pull/1745">stretchr/testify#1745</a></li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/stretchr/testify/commit/2a57335dc9cd6833daa820bc94d9b40c26a7917d"><code>2a57335</code></a> Merge pull request <a href="https://redirect.github.com/stretchr/testify/issues/1788">#1788</a> from brackendawson/1785-backport-1.11</li> <li><a href="https://github.com/stretchr/testify/commit/af8c91234f184009f57ef29027b39ca89cb00100"><code>af8c912</code></a> Backport <a href="https://redirect.github.com/stretchr/testify/issues/1786">#1786</a> to release/1.11</li> <li><a href="https://github.com/stretchr/testify/commit/b7801fbf5cd58d201296d5d0e132d1849966dbd4"><code>b7801fb</code></a> Merge pull request <a href="https://redirect.github.com/stretchr/testify/issues/1778">#1778</a> from stretchr/dependabot/github_actions/actions/chec...</li> <li><a href="https://github.com/stretchr/testify/commit/69831f3b08c40d56a09d0be93e9d5ae034f1590b"><code>69831f3</code></a> build(deps): bump actions/checkout from 4 to 5</li> <li><a href="https://github.com/stretchr/testify/commit/a53be35c3b0cfcd5189cffcfd75df60ea581104c"><code>a53be35</code></a> Improve captureTestingT helper</li> <li><a href="https://github.com/stretchr/testify/commit/aafb604176db7e1f2c9810bc90d644291d057687"><code>aafb604</code></a> mock: improve formatting of error message</li> <li><a href="https://github.com/stretchr/testify/commit/7218e0390acd2aea3edb18574110ec2753c0aeef"><code>7218e03</code></a> improve error msg</li> <li><a href="https://github.com/stretchr/testify/commit/929a2126c2702df436312656a0304580b526c6e9"><code>929a212</code></a> Merge pull request <a href="https://redirect.github.com/stretchr/testify/issues/1758">#1758</a> from stretchr/dolmen/suite-faster-method-filtering</li> <li><a href="https://github.com/stretchr/testify/commit/bc7459ec38128532ff32f23cfab4ea0b725210f2"><code>bc7459e</code></a> suite: faster filtering of methods (-testify.m)</li> <li><a href="https://github.com/stretchr/testify/commit/7d37b5c962954410bcd7a71ff3a77c79514056d1"><code>7d37b5c</code></a> suite: refactor methodFilter</li> <li>Additional commits viewable in <a href="https://github.com/stretchr/testify/compare/v1.10.0...v1.11.1">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github.com/stretchr/testify&package-manager=go_modules&previous-version=1.10.0&new-version=1.11.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ethan Dickson <ethan@coder.com>	2025-09-02 03:12:37 +00:00
Susana Ferreira	0ab345ca84	feat: add prebuild timing metrics to Prometheus (#19503 ) ## Description This PR introduces one counter and two histograms related to workspace creation and claiming. The goal is to provide clearer observability into how workspaces are created (regular vs prebuild) and the time cost of those operations. ### `coderd_workspace_creation_total` * Metric type: Counter * Name: `coderd_workspace_creation_total` * Labels: `organization_name`, `template_name`, `preset_name` This counter tracks whether a regular workspace (not created from a prebuild pool) was created using a preset or not. Currently, we already expose `coderd_prebuilt_workspaces_claimed_total` for claimed prebuilt workspaces, but we lack a comparable metric for regular workspace creations. This metric fills that gap, making it possible to compare regular creations against claims. Implementation notes: * Exposed as a `coderd_` metric, consistent with other workspace-related metrics (e.g. `coderd_api_workspace_latest_build`: https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149). * Every `defaultRefreshRate` (1 minute ), DB query `GetRegularWorkspaceCreateMetrics` is executed to fetch all regular workspaces (not created from a prebuild pool). * The counter is updated with the total from all time (not just since metric introduction). This differs from the histograms below, which only accumulate from their introduction forward. ### `coderd_workspace_creation_duration_seconds` & `coderd_prebuilt_workspace_claim_duration_seconds` * Metric types: Histogram * Names: * `coderd_workspace_creation_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name`, `type` (`regular`, `prebuild`) * `coderd_prebuilt_workspace_claim_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name` We already have `coderd_provisionerd_workspace_build_timings_seconds`, which tracks build run times for all workspace builds handled by the provisioner daemon. However, in the context of this issue, we are only interested in creation and claim build times, not all transitions; additionally, this metric does not include `preset_name`, and adding it there would significantly increase cardinality. Therefore, separate more focused metrics are introduced here: * `coderd_workspace_creation_duration_seconds`: Build time to create a workspace (either a regular workspace or the build into a prebuild pool, for prebuild initial provisioning build). * `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a prebuilt workspace from the pool. The reason for two separate histograms is that: * Creation (regular or prebuild): provisioning builds with similar time magnitude, generally expected to take longer than a claim operation. * Claim: expected to be a much faster provisioning build. #### Native histogram usage Provisioning times vary widely between projects. Using static buckets risks unbalanced or poorly informative histograms. To address this, these metrics use [Prometheus native histograms](https://prometheus.io/docs/specs/native_histograms/): * First introduced in Prometheus v2.40.0 * Recommended stable usage from v2.45+ * Requires Go client `prometheus/client_golang` v1.15.0+ * Experimental and must be explicitly enabled on the server (`--enable-feature=native-histograms`) For compatibility, we also retain a classic bucket definition (aligned with the existing provisioner metric: https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189). * If native histograms are enabled, Prometheus ingests the high-resolution histogram. * If not, it falls back to the predefined buckets. Implementation notes: * Unlike the counter, these histograms are updated in real-time at workspace build job completion. * They reflect data only from the point of introduction forward (no historical backfill). ## Relates to Closes: https://github.com/coder/coder/issues/19528 Native histograms tested in observability stack: https://github.com/coder/observability/pull/50	2025-08-28 15:00:26 +01:00
Callum Styan	014a2d5b0f	perf: don't call GetUserByID unnecessarily for Agents metrics loops (#19395 ) At the moment, the loop which retrieves and updates the values of the agents metrics excessively calls `GetUserByID` (a DB query). First it retrieves a list of all workspaces, filtering out inactive agents (not entirely clear to me whether this is non-running workspaces, or just dead agents), and then iterates over those workspaces to get the rest of the relevant data for the metrics. The next call is `GetUserByID` for `workspace.OwnerID`. This is unnecessary because the `workspaces_visible` view we pull workspaces from has already been joined with the users table to get the username/name/etc. This should at least partially resolve https://github.com/coder/internal/issues/726 --------- Signed-off-by: Callum Styan <callumstyan@gmail.com>	2025-08-21 11:01:32 -07:00
Dean Sheather	6eb02d1c2a	chore: wire up usage tracking for managed agents (#19096 ) Wires up the usage collector and publisher to coderd. Relates to coder/internal#814	2025-08-20 23:38:09 +10:00
Mathias Fredriksson	1b66495b70	fix(coderd/prometheusmetrics)!: filter deleted wsbuilds to reduce db load (#19197 ) This change removes the `GetLatestWorkspaceBuilds` query which includes all workspaces for all time (including deleted). This allows us to also stop using `GetProvisionerJobsByIDs` for said builds as the job status is included in `GetWorkspaces` called separately. BREAKING CHANGE: The `coderd_api_workspace_latest_build` Prometheus metric no longer includes builds belonging to deleted workspaces, as such, this metric will show fewer statuses. Fixes coder/internal#717	2025-08-11 14:48:31 +03:00
Steven Masley	0a3afeddc8	chore: add more pprof labels for various go routines (#19243 ) - ReplicaSync - Notifications - MetricsAggregator - DBPurge	2025-08-07 20:05:32 +00:00
Steven Masley	8ba8b4f061	chore: add profiling labels for pprof analysis (#19232 ) PProf labels segment the code into groups for determing the source of cpu/memory profiles. Since the web server and background jobs share a lot of the same code (eg wsbuilder), it helps to know if the load is user induced, or background job based.	2025-08-07 11:21:17 -05:00
Callum Styan	ffbfaf2a6f	feat: allow bypassing current CORS magic based on template config (#18706 ) Solves https://github.com/coder/coder/issues/15096 This is a slight rework/refactor of the earlier PRs from @dannykopping and @Emyrk: - https://github.com/coder/coder/pull/15669 - https://github.com/coder/coder/pull/15684 - https://github.com/coder/coder/pull/17596 Rather than having a per-app CORS behaviour setting and additionally a template level setting for ports, this PR adds a single template level CORS behaviour setting that is then used by all apps/ports for workspaces created from that template. The main changes are in `proxy.go` and `request.go` to: a) get the CORS behaviour setting from the template b) have `HandleSubdomain` bypass the CORS middleware handler if the selected behaviour is `passthru` c) in `proxyWorkspaceApp`, do not modify the response if the selected behaviour is `passthru` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added support for configuring CORS behavior ("simple" or "passthru") at the template level for all shared ports. * Introduced a new "CORS Behavior" setting in the template creation and settings forms. * API endpoints and responses now include the optional `cors_behavior` property for templates. * Workspace apps and proxy now honor the specified CORS behavior, enabling conditional CORS middleware application. * Enhanced workspace app tests with comprehensive scenarios covering CORS behaviors and authentication states. * Bug Fixes * None. * Documentation * Updated API and admin documentation to describe the new `cors_behavior` property and its usage. * Added examples and schema references for CORS behavior in relevant API docs. * Tests * Extended automated tests to cover different CORS behavior scenarios for templates and workspace apps. * Chores * Updated audit logging to track changes to the `cors_behavior` field on templates. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Callum Styan <callumstyan@gmail.com>	2025-07-30 13:42:39 -07:00
ケイラ	fae30a00fd	chore: remove unnecessary redeclarations in for loops (#18440 )	2025-06-20 13:16:55 -06:00
Hugo Dutka	277c2c7ea7	chore(coderd/prometheusmetrics): remove dbmem from tests (#18238 )	2025-06-05 09:30:27 +02:00
Steven Masley	ca38729840	chore: revert dynamic params as a safe experiment (#17510 )	2025-04-22 16:21:15 +00:00
Jon Ayers	17ddee05e5	chore: update golang to 1.24.1 (#17035 ) - Update go.mod to use Go 1.24.1 - Update GitHub Actions setup-go action to use Go 1.24.1 - Fix linting issues with golangci-lint by: - Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <claude@anthropic.com>	2025-03-26 01:56:39 -05:00
Eng Zer Jun	04c33968cf	refactor: replace `golang.org/x/exp/slices` with `slices` (#16772 ) The experimental functions in `golang.org/x/exp/slices` are now available in the standard library since Go 1.21. Reference: https://go.dev/doc/go1.21#slices Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>	2025-03-04 00:46:49 +11:00
Spike Curtis	5861e516b9	chore: add standard test logger ignoring db canceled (#15556 ) Refactors our use of `slogtest` to instantiate a "standard logger" across most of our tests. This standard logger incorporates https://github.com/coder/slog/pull/217 to also ignore database query canceled errors by default, which are a source of low-severity flakes. Any test that has set non-default `slogtest.Options` is left alone. In particular, `coderdtest` defaults to ignoring all errors. We might consider revisiting that decision now that we have better tools to target the really common flaky Error logs on shutdown.	2024-11-18 14:09:22 +04:00
Colin Adler	3de98c25db	feat: add prometheus metric for tracking user statuses (#15281 )	2024-10-30 18:41:16 +00:00
Steven Masley	343f8ec9ab	chore: join owner, template, and org in new workspace view (#15116 ) Joins in fields like `username`, `avatar_url`, `organization_name`, `template_name` to `workspaces` via a view. The view must be maintained moving forward, but this prevents needing to add RBAC permissions to fetch related workspace fields.	2024-10-22 09:20:54 -05:00
Garrett Delfosse	922f4c545f	fix: handle new agent stat format correctly (#14576 ) --------- Co-authored-by: Ethan Dickson <ethan@coder.com>	2024-09-20 01:52:14 +10:00
Kayla Washburn-Love	bf4b7abf14	chore(coderd): allow creating workspaces without specifying an organization (#14048 )	2024-07-30 10:44:02 -06:00
Ethan	dd243686e4	chore!: remove deprecated agent v1 routes (#13486 )	2024-06-11 12:22:59 +10:00
Garrett Delfosse	5b9a65e5c1	chore: move Batcher and Tracker to workspacestats (#13418 )	2024-06-10 15:35:23 -04:00
Colin Adler	40390ecc30	chore: fix `TestServer/Prometheus/DBMetricsDisabled` test flake (#13453 ) See: https://github.com/coder/coder/actions/runs/9352137263/job/25739550487#step:5:368	2024-06-03 15:38:59 -05:00
Garrett Delfosse	5789ea5397	chore: move stat reporting into workspacestats package (#13386 )	2024-05-29 11:49:08 -04:00
Garrett Delfosse	c550d0641d	feat: move shared ports out of experiment (#13120 )	2024-05-02 14:11:33 -04:00
Pavel Aseev	4682355eed	chore: deprecate gauge metrics with _total suffix (#12744 ) (#12976 ) * chore: deprecate gauge metrics with _total suffix (#12744) Deprecated metrics: - coderd_oauth2_external_requests_rate_limit_total - coderd_api_workspace_latest_build_total * Apply suggestions from code review add link to follow-up issue Co-authored-by: Cian Johnston <public@cianjohnston.ie> --------- Co-authored-by: Cian Johnston <public@cianjohnston.ie>	2024-04-24 11:23:24 +03:00
Steven Masley	0a8c8ce5cc	chore: remove InsertWorkspaceAgentStat query (#12869 ) * chore: remove InsertWorkspaceAgentStat query InsertWorkspaceAgentStats (batch) exists. We only used the singular in a single unit test place. Removing the single for the batch, reducing the interface size.	2024-04-09 12:35:27 -05:00
Danny Kopping	79fb8e43c5	feat: expose workspace statuses (with details) as a prometheus metric (#12762 ) Implements #12462	2024-04-02 09:57:36 +02:00
Mathias Fredriksson	b183236482	feat(coderd/database): use `template_usage_stats` in `ByTemplate` insights queries (#12668 ) This PR updates the `ByTempalte` insights queries used for generating Prometheus metrics to behave the same way as the new rollup query and re-written insights queries that utilize the rolled up data.	2024-03-25 17:42:02 +02:00
Danny Kopping	9cfd5baa91	feat(coderd): export metric indicating each experiment's status (#12657 )	2024-03-19 14:11:27 +02:00
Steven Masley	f0f9569d51	chore: enforce that provisioners can only acquire jobs in their own organization (#12600 ) * chore: add org ID as optional param to AcquireJob * chore: plumb through organization id to provisioner daemons * add org id to provisioner domain key * enforce org id argument * dbgen provisioner jobs defaults to default org	2024-03-18 12:48:13 -05:00
Danny Kopping	da54c8a51f	fix: fix data race in TestLabelsAggregation tests (#12578 )	2024-03-13 13:47:22 +02:00
Danny Kopping	7a7105ad66	feat: make agent stats' cardinality configurable (#12535 )	2024-03-13 12:03:36 +02:00
Cian Johnston	8f40ee3465	Revert "feat: make agent stats' cardinality configurable (#12468 )" (#12533 ) This reverts commit `21d1873d97`.	2024-03-11 14:33:36 +00:00
Danny Kopping	21d1873d97	feat: make agent stats' cardinality configurable (#12468 ) Closes #12221	2024-03-11 16:04:08 +02:00
Marcin Tojek	aacb4a2b4c	feat: use map instead of slice in metrics aggregator (#11815 )	2024-01-29 09:12:41 +01:00
Spike Curtis	fdd60d316e	fix: fix MetricsAggregator check for metric sameness (#11508 ) Fixes #11451 A refactor of the Agent API passes metrics as protobufs, which include pointers to label name/value pairs. The aggregator tested for sameness by doing a shallow compare of label values, which for different stats reports would compare unequal because the pointers would be different. This fix does a deep compare. While testing I also noted that we neglect to compare template names. This is unlikely to have caused any issue in practice, since the combination of username/workspace is unique, but in the context of comparing metric labels we should do the comparison. If a user creates a workspace, deletes it, then recreates from a different template, we could in principle have reported incorrect stats for the old template.	2024-01-09 15:21:30 +04:00

1 2

92 Commits