Commit Graph

92 Commits

Author SHA1 Message Date
Danny Kopping 48b90f8cc8 feat: add coder_build_info metric (#24365)
_Disclaimer: produced by Claude Opus 4.6_

Adds a `coder_build_info` metric which allows operators to see which
versions of Coder are currently running.

---------

Signed-off-by: Danny Kopping <danny@coder.com>
2026-04-15 12:48:38 +00:00
Callum Styan 730edba87a fix: fix false positive disconnected agent metric reporting (#24225)
We noticed during higher active workspace counts that the agent
connection metric, generated via a query to the database, would report a
relatively high amount of agents as disconnected. Somewhere between 5
and 20%. However, other metrics such as # of websocket connections would
suggest that all agent connections are healthy.

Looking at the `Agents` function in prometheus metrics, plus the query
execution time (not accounting for actual database RT time) revealed
that this reporting of agents as disconnected was almost certainly false
positives due to clock drift in the way we're generating the metric
values. At 10k metrics, with a p50 of 2ms and p99 of 5ms, the entire
`agents` function could take upwards of 50s to execute. Because we were
doing a query/database RT to query th apps for each agent individually,
and grabbing a `time.Now` value on each iteration of that loop, it's
likely the portion of agents that were reported as disconnected were
those that had last heartbeat the furthest in the past.

The fix here is to set a consistent `now` before fetching agent data to
avoid clock drift inflating the inactive timeout comparison, and replace
the per-agent app query N+1 with a single batched lookup to prevent loop
execution time from pushing agents over the disconnected threshold.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:23:06 -07:00
J. Scott Miller 20b953a99d feat: add Prometheus metric for agent first connection duration (#24179)
## Summary

Add `coderd_agents_first_connection_seconds` histogram metric that
records the
duration from workspace agent creation to first connection. This fills
an
observability gap — provisioner job timings and startup script metrics
exist,
but the agent connection phase (which can take several minutes) was not
exposed
to Prometheus.

Closes https://github.com/coder/coder/issues/21282

## Changes

- **`coderd/prometheusmetrics/prometheusmetrics.go`** — Define and
register a
  `HistogramVec` in the existing `Agents()` polling loop. Observe
`first_connected_at - created_at` exactly once per agent via a
deduplication
  map, pruned each tick to prevent unbounded memory growth.
- **`coderd/prometheusmetrics/prometheusmetrics_test.go`** — Update
`TestAgents`
to set `first_connected_at` on the test agent and assert the histogram
is
  collected with correct labels, sample count, and sample sum.
- **`docs/admin/integrations/prometheus.md`**,
**`scripts/metricsdocgen/generated_metrics`** —
  Auto-generated documentation updates from `make gen`.

## Metric details

| Property | Value |
|---|---|
| Name | `coderd_agents_first_connection_seconds` |
| Type | histogram |
| Labels | `template_name`, `agent_name`, `username`, `workspace_name` |
| Buckets | 1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h |

## Example PromQL

```promql
# P95 agent connection time by template
histogram_quantile(0.95,
  sum(rate(coderd_agents_first_connection_seconds_bucket[1h])) by (le, template_name)
)
```

<details>
<summary>Implementation notes</summary>

### Design decisions

- **Histogram over gauge**: Enables `histogram_quantile()` for
percentile queries.
- **Observe in `Agents()` polling loop**: All required data is already
fetched by
  `GetWorkspaceAgentsForMetrics()` — no new DB queries.
- **Dedup via `map[uuid.UUID]struct{}`**: Prevents re-observing the same
agent
  across polling ticks. Pruned each cycle to bound memory.
- **Buckets**: Aligned with
`coderd_provisionerd_workspace_build_timings_seconds`
  range (1s–1h).

### Overhead at scale (100k active workspaces)

The deduplication map (`observedFirstConnection`) and per-tick pruning
map
(`currentAgentIDs`) are both `map[[16]byte]struct{}`. At 100k agents:

- **Memory**: ~2.25 MB persistent + ~2.25 MB transient per tick = **~4.5
MB peak**.
- **CPU**: ~25 ms of map operations per tick (one tick per minute) =
**<0.05% of one core**.

Both are negligible relative to the existing cost of the `Agents()` loop
(the DB
query, per-agent `GetWorkspaceAppsByAgentID` calls, and coordinator node
lookups
dominate).

</details>

> 🤖 Generated by Coder Agents
2026-04-14 12:00:46 -05:00
Mathias Fredriksson 147df5c971 refactor: replace sort.Strings with slices.Sort (#23457)
The slices package provides type-safe generic replacements for the
old typed sort convenience functions. The codebase already uses
slices.Sort in 43 call sites; this finishes the migration for the
remaining 29.

- sort.Strings(x)          -> slices.Sort(x)
- sort.Float64s(x)         -> slices.Sort(x)
- sort.StringsAreSorted(x) -> slices.IsSorted(x)
2026-03-23 23:19:23 +02:00
Danielle Maywood f91475cd51 test: remove unnecessary dbauthz.AsSystemRestricted calls in tests (#22663) 2026-03-05 20:29:49 +00:00
Garrett Delfosse 4057363f78 fix(coderd): add organization_name label to insights Prometheus metrics (#22296)
## Description

When multiple organizations have templates with the same name, the
Prometheus `/metrics` endpoint returns HTTP 500 because Prometheus
rejects duplicate label combinations. The three `coderd_insights_*`
metrics (`coderd_insights_templates_active_users`,
`coderd_insights_applications_usage_seconds`,
`coderd_insights_parameters`) used only `template_name` as a
distinguishing label, so two templates named e.g. `"openstack-v1"` in
different orgs would produce duplicate metric series.

This adds `organization_name` as a label to all three insight metric
descriptors to disambiguate templates across organizations.

## Changes

**`coderd/prometheusmetrics/insights/metricscollector.go`**:
- Added `organization_name` label to all three metric descriptors
- Added `organizationNames` field (template ID → org name) to the
`insightsData` struct
- In `doTick`: after fetching templates, collect unique org IDs, fetch
organizations via `GetOrganizations`, and build a
template-ID-to-org-name mapping
- In `Collect()`: pass the organization name as an additional label
value in every `MustNewConstMetric` call

**`coderd/prometheusmetrics/insights/testdata/insights-metrics.json`**:
Updated golden file to include `organization_name=coder` in all metric
label keys.

Fixes #21748
2026-02-25 08:58:50 +00:00
Marcin Tojek 036ed5672f fix!: remove deprecated prometheus metrics (#21788)
## Description

Removes the following deprecated Prometheus metrics:

- `coderd_api_workspace_latest_build_total` → use
`coderd_api_workspace_latest_build` instead
- `coderd_oauth2_external_requests_rate_limit_total` → use
`coderd_oauth2_external_requests_rate_limit` instead

These metrics were deprecated in #12976 because gauge metrics should
avoid the `_total` suffix per [Prometheus naming
conventions](https://prometheus.io/docs/practices/naming/).

## Changes

- Removed deprecated metric `coderd_api_workspace_latest_build_total`
from `coderd/prometheusmetrics/prometheusmetrics.go`
- Removed deprecated metric
`coderd_oauth2_external_requests_rate_limit_total` from
`coderd/promoauth/oauth2.go`
- Updated tests to use the non-deprecated metric name

Fixes #12999
2026-01-30 13:30:06 +01:00
Spike Curtis bddb808b25 chore: arrange imports in a standard way (#21452)
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:

```
import (
	"context"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"golang.org/x/xerrors"
	"gopkg.in/natefinch/lumberjack.v2"

	"cdr.dev/slog/v3"
	"github.com/coder/coder/v2/codersdk/agentsdk"
	"github.com/coder/serpent"
)
```

3 groups: standard library, 3rd partly libs, Coder libs.

This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
2026-01-08 15:24:11 +04:00
Spike Curtis 49b34a716a fix: fix slog to always use array of Fields (#21426)
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).

It also updates dependencies that also use slog and were updated.

I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.

Other dependencies, I pushed new tags.
2026-01-08 10:29:41 +04:00
Steven Masley 3194bcfc9e chore: distinct operations for provisioner's 'parse', 'init', 'plan', 'apply', 'graph' (#21064)
Provisioner steps broken into smaller granular actions.
Changes:
- `ExtractArchive` moved to `init` request (was in `configure`)
- Writing `tfstate` moved to `plan` (was in `configure`)
- Moved most plan/apply outputs to `GraphComplete`
2025-12-15 11:26:41 -06:00
Ethan 645da33767 test: fix TestDescCacheTimestampUpdate flake (#20975)
## Problem

`TestDescCacheTimestampUpdate` was flaky on Windows CI because
`time.Now()` has ~15.6ms resolution, causing consecutive calls to return
identical timestamps.

## Solution

Inject `quartz.Clock` into `MetricsAggregator` using an options pattern,
making the test deterministic by using a mock clock with explicit time
advancement.

### Changes
- Add `clock quartz.Clock` field to `MetricsAggregator` struct
- Add `WithClock()` option for dependency injection
- Replace all `time.Now()` calls with `ma.clock.Now()`
- Update test to use mock clock with `mClock.Advance(time.Second)`

---

This PR was fully generated by [`mux`](https://github.com/coder/mux)
using Claude Opus 4.5, and reviewed by me.

Closes https://github.com/coder/internal/issues/1146
2025-12-02 10:53:36 +11:00
Callum Styan 658e8c34a9 perf: improve performance of metricsAggregator path by reducing memory allocations (#20724)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-11-24 15:45:08 -08:00
Callum Styan 5a18cf4c86 fix: remove unintentionally added print in test code (#20391)
accidentally added in https://github.com/coder/coder/pull/19786

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-10-20 18:51:15 -07:00
Callum Styan 141ef23c81 fix: introduce dedicated queries for workspaces and workspace agents metrics (#19786)
aid in differentiation between sources of calls to `GetWorkspaces` but introducing new queries for metrics specific use cases

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-10-17 13:40:10 -07:00
Spike Curtis 1354d84eb4 chore: refactor instance identity to be a SessionTokenProvider (#19566)
Refactors Agent instance identity to be a SessionTokenProvider.

Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags.

This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication.

Fixes #19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.
2025-09-03 10:38:42 +04:00
dependabot[bot] 519812776e chore: bump github.com/stretchr/testify from 1.10.0 to 1.11.1 (#19599)
Bumps [github.com/stretchr/testify](https://github.com/stretchr/testify)
from 1.10.0 to 1.11.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/stretchr/testify/releases">github.com/stretchr/testify's
releases</a>.</em></p>
<blockquote>
<h2>v1.11.1</h2>
<p>This release fixes <a
href="https://redirect.github.com/stretchr/testify/issues/1785">#1785</a>
introduced in v1.11.0 where expected argument values implementing the
stringer interface (<code>String() string</code>) with a method which
mutates their value, when passed to mock.Mock.On
(<code>m.On(&quot;Method&quot;, &lt;expected&gt;).Return()</code>) or
actual argument values passed to mock.Mock.Called may no longer match
one another where they previously did match. The behaviour prior to
v1.11.0 where the stringer is always called is restored. Future testify
releases may not call the stringer method at all in this case.</p>
<h2>What's Changed</h2>
<ul>
<li>Backport <a
href="https://redirect.github.com/stretchr/testify/issues/1786">#1786</a>
to release/1.11: mock: revert to pre-v1.11.0 argument matching behavior
for mutating stringers by <a
href="https://github.com/brackendawson"><code>@​brackendawson</code></a>
in <a
href="https://redirect.github.com/stretchr/testify/pull/1788">stretchr/testify#1788</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/stretchr/testify/compare/v1.11.0...v1.11.1">https://github.com/stretchr/testify/compare/v1.11.0...v1.11.1</a></p>
<h2>v1.11.0</h2>
<h2>What's Changed</h2>
<h3>Functional Changes</h3>
<p>v1.11.0 Includes a number of performance improvements.</p>
<ul>
<li>Call stack perf change for CallerInfo by <a
href="https://github.com/mikeauclair"><code>@​mikeauclair</code></a> in
<a
href="https://redirect.github.com/stretchr/testify/pull/1614">stretchr/testify#1614</a></li>
<li>Lazily render mock diff output on successful match by <a
href="https://github.com/mikeauclair"><code>@​mikeauclair</code></a> in
<a
href="https://redirect.github.com/stretchr/testify/pull/1615">stretchr/testify#1615</a></li>
<li>assert: check early in Eventually, EventuallyWithT, and Never by <a
href="https://github.com/cszczepaniak"><code>@​cszczepaniak</code></a>
in <a
href="https://redirect.github.com/stretchr/testify/pull/1427">stretchr/testify#1427</a></li>
<li>assert: add IsNotType by <a
href="https://github.com/bartventer"><code>@​bartventer</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1730">stretchr/testify#1730</a></li>
<li>assert.JSONEq: shortcut if same strings by <a
href="https://github.com/dolmen"><code>@​dolmen</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1754">stretchr/testify#1754</a></li>
<li>assert.YAMLEq: shortcut if same strings by <a
href="https://github.com/dolmen"><code>@​dolmen</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1755">stretchr/testify#1755</a></li>
<li>assert: faster and simpler isEmpty using reflect.Value.IsZero by <a
href="https://github.com/dolmen"><code>@​dolmen</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1761">stretchr/testify#1761</a></li>
<li>suite: faster methods filtering (internal refactor) by <a
href="https://github.com/dolmen"><code>@​dolmen</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1758">stretchr/testify#1758</a></li>
</ul>
<h3>Fixes</h3>
<ul>
<li>assert.ErrorAs: log target type by <a
href="https://github.com/craig65535"><code>@​craig65535</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1345">stretchr/testify#1345</a></li>
<li>Fix failure message formatting for Positive and Negative asserts in
<a
href="https://redirect.github.com/stretchr/testify/pull/1062">stretchr/testify#1062</a></li>
<li>Improve ErrorIs message when error is nil but an error was expected
by <a href="https://github.com/tsioftas"><code>@​tsioftas</code></a> in
<a
href="https://redirect.github.com/stretchr/testify/pull/1681">stretchr/testify#1681</a></li>
<li>fix Subset/NotSubset when calling with mixed input types by <a
href="https://github.com/siliconbrain"><code>@​siliconbrain</code></a>
in <a
href="https://redirect.github.com/stretchr/testify/pull/1729">stretchr/testify#1729</a></li>
<li>Improve ErrorAs failure message when error is nil by <a
href="https://github.com/ccoVeille"><code>@​ccoVeille</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1734">stretchr/testify#1734</a></li>
<li>mock.AssertNumberOfCalls: improve error msg by <a
href="https://github.com/3scalation"><code>@​3scalation</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1743">stretchr/testify#1743</a></li>
</ul>
<h3>Documentation, Build &amp; CI</h3>
<ul>
<li>docs: Fix typo in README by <a
href="https://github.com/alexandear"><code>@​alexandear</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1688">stretchr/testify#1688</a></li>
<li>Replace deprecated io/ioutil with io and os by <a
href="https://github.com/alexandear"><code>@​alexandear</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1684">stretchr/testify#1684</a></li>
<li>Document consequences of calling t.FailNow() by <a
href="https://github.com/greg0ire"><code>@​greg0ire</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1710">stretchr/testify#1710</a></li>
<li>chore: update docs for Unset <a
href="https://redirect.github.com/stretchr/testify/issues/1621">#1621</a>
by <a href="https://github.com/techfg"><code>@​techfg</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1709">stretchr/testify#1709</a></li>
<li>README: apply gofmt to examples by <a
href="https://github.com/alexandear"><code>@​alexandear</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1687">stretchr/testify#1687</a></li>
<li>refactor: use %q and %T to simplify fmt.Sprintf by <a
href="https://github.com/alexandear"><code>@​alexandear</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1674">stretchr/testify#1674</a></li>
<li>Propose Christophe Colombier (ccoVeille) as approver by <a
href="https://github.com/brackendawson"><code>@​brackendawson</code></a>
in <a
href="https://redirect.github.com/stretchr/testify/pull/1716">stretchr/testify#1716</a></li>
<li>Update documentation for the Error function in assert or require
package by <a
href="https://github.com/architagr"><code>@​architagr</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1675">stretchr/testify#1675</a></li>
<li>assert: remove deprecated build constraints by <a
href="https://github.com/alexandear"><code>@​alexandear</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1671">stretchr/testify#1671</a></li>
<li>assert: apply gofumpt to internal test suite by <a
href="https://github.com/ccoVeille"><code>@​ccoVeille</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1739">stretchr/testify#1739</a></li>
<li>CI: fix shebang in .ci.*.sh scripts by <a
href="https://github.com/dolmen"><code>@​dolmen</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1746">stretchr/testify#1746</a></li>
<li>assert,require: enable parallel testing on (almost) all top tests by
<a href="https://github.com/dolmen"><code>@​dolmen</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1747">stretchr/testify#1747</a></li>
<li>suite.Passed: add one more status test report by <a
href="https://github.com/Ararsa-Derese"><code>@​Ararsa-Derese</code></a>
in <a
href="https://redirect.github.com/stretchr/testify/pull/1706">stretchr/testify#1706</a></li>
<li>Add Helper() method in internal mocks and assert.CollectT by <a
href="https://github.com/dolmen"><code>@​dolmen</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1423">stretchr/testify#1423</a></li>
<li>assert.Same/NotSame: improve usage of Sprintf by <a
href="https://github.com/ccoVeille"><code>@​ccoVeille</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1742">stretchr/testify#1742</a></li>
<li>mock: enable parallel testing on internal testsuite by <a
href="https://github.com/dolmen"><code>@​dolmen</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1756">stretchr/testify#1756</a></li>
<li>suite: cleanup use of 'testing' internals at runtime by <a
href="https://github.com/dolmen"><code>@​dolmen</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1751">stretchr/testify#1751</a></li>
<li>assert: check test failure message for Empty and NotEmpty by <a
href="https://github.com/ccoVeille"><code>@​ccoVeille</code></a> in <a
href="https://redirect.github.com/stretchr/testify/pull/1745">stretchr/testify#1745</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/stretchr/testify/commit/2a57335dc9cd6833daa820bc94d9b40c26a7917d"><code>2a57335</code></a>
Merge pull request <a
href="https://redirect.github.com/stretchr/testify/issues/1788">#1788</a>
from brackendawson/1785-backport-1.11</li>
<li><a
href="https://github.com/stretchr/testify/commit/af8c91234f184009f57ef29027b39ca89cb00100"><code>af8c912</code></a>
Backport <a
href="https://redirect.github.com/stretchr/testify/issues/1786">#1786</a>
to release/1.11</li>
<li><a
href="https://github.com/stretchr/testify/commit/b7801fbf5cd58d201296d5d0e132d1849966dbd4"><code>b7801fb</code></a>
Merge pull request <a
href="https://redirect.github.com/stretchr/testify/issues/1778">#1778</a>
from stretchr/dependabot/github_actions/actions/chec...</li>
<li><a
href="https://github.com/stretchr/testify/commit/69831f3b08c40d56a09d0be93e9d5ae034f1590b"><code>69831f3</code></a>
build(deps): bump actions/checkout from 4 to 5</li>
<li><a
href="https://github.com/stretchr/testify/commit/a53be35c3b0cfcd5189cffcfd75df60ea581104c"><code>a53be35</code></a>
Improve captureTestingT helper</li>
<li><a
href="https://github.com/stretchr/testify/commit/aafb604176db7e1f2c9810bc90d644291d057687"><code>aafb604</code></a>
mock: improve formatting of error message</li>
<li><a
href="https://github.com/stretchr/testify/commit/7218e0390acd2aea3edb18574110ec2753c0aeef"><code>7218e03</code></a>
improve error msg</li>
<li><a
href="https://github.com/stretchr/testify/commit/929a2126c2702df436312656a0304580b526c6e9"><code>929a212</code></a>
Merge pull request <a
href="https://redirect.github.com/stretchr/testify/issues/1758">#1758</a>
from stretchr/dolmen/suite-faster-method-filtering</li>
<li><a
href="https://github.com/stretchr/testify/commit/bc7459ec38128532ff32f23cfab4ea0b725210f2"><code>bc7459e</code></a>
suite: faster filtering of methods (-testify.m)</li>
<li><a
href="https://github.com/stretchr/testify/commit/7d37b5c962954410bcd7a71ff3a77c79514056d1"><code>7d37b5c</code></a>
suite: refactor methodFilter</li>
<li>Additional commits viewable in <a
href="https://github.com/stretchr/testify/compare/v1.10.0...v1.11.1">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github.com/stretchr/testify&package-manager=go_modules&previous-version=1.10.0&new-version=1.11.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ethan Dickson <ethan@coder.com>
2025-09-02 03:12:37 +00:00
Susana Ferreira 0ab345ca84 feat: add prebuild timing metrics to Prometheus (#19503)
## Description

This PR introduces one counter and two histograms related to workspace
creation and claiming. The goal is to provide clearer observability into
how workspaces are created (regular vs prebuild) and the time cost of
those operations.

### `coderd_workspace_creation_total`

* Metric type: Counter
* Name: `coderd_workspace_creation_total`
* Labels: `organization_name`, `template_name`, `preset_name`

This counter tracks whether a regular workspace (not created from a
prebuild pool) was created using a preset or not.
Currently, we already expose `coderd_prebuilt_workspaces_claimed_total`
for claimed prebuilt workspaces, but we lack a comparable metric for
regular workspace creations. This metric fills that gap, making it
possible to compare regular creations against claims.

Implementation notes:
* Exposed as a `coderd_` metric, consistent with other workspace-related
metrics (e.g. `coderd_api_workspace_latest_build`:
https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149).
* Every `defaultRefreshRate` (1 minute ), DB query
`GetRegularWorkspaceCreateMetrics` is executed to fetch all regular
workspaces (not created from a prebuild pool).
* The counter is updated with the total from all time (not just since
metric introduction). This differs from the histograms below, which only
accumulate from their introduction forward.

### `coderd_workspace_creation_duration_seconds` &
`coderd_prebuilt_workspace_claim_duration_seconds`

* Metric types: Histogram
* Names:
  * `coderd_workspace_creation_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`, `type`
(`regular`, `prebuild`)
  * `coderd_prebuilt_workspace_claim_duration_seconds`
    * Labels: `organization_name`, `template_name`, `preset_name`

We already have `coderd_provisionerd_workspace_build_timings_seconds`,
which tracks build run times for all workspace builds handled by the
provisioner daemon.
However, in the context of this issue, we are only interested in
creation and claim build times, not all transitions; additionally, this
metric does not include `preset_name`, and adding it there would
significantly increase cardinality. Therefore, separate more focused
metrics are introduced here:
* `coderd_workspace_creation_duration_seconds`: Build time to create a
workspace (either a regular workspace or the build into a prebuild pool,
for prebuild initial provisioning build).
* `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a
prebuilt workspace from the pool.

The reason for two separate histograms is that:
* Creation (regular or prebuild): provisioning builds with similar time
magnitude, generally expected to take longer than a claim operation.
* Claim: expected to be a much faster provisioning build.

#### Native histogram usage

Provisioning times vary widely between projects. Using static buckets
risks unbalanced or poorly informative histograms.
To address this, these metrics use [Prometheus native
histograms](https://prometheus.io/docs/specs/native_histograms/):
* First introduced in Prometheus v2.40.0
* Recommended stable usage from v2.45+
* Requires Go client `prometheus/client_golang` v1.15.0+
* Experimental and must be explicitly enabled on the server
(`--enable-feature=native-histograms`)

For compatibility, we also retain a classic bucket definition (aligned
with the existing provisioner metric:
https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189).
* If native histograms are enabled, Prometheus ingests the
high-resolution histogram.
* If not, it falls back to the predefined buckets.

Implementation notes:
* Unlike the counter, these histograms are updated in real-time at
workspace build job completion.
* They reflect data only from the point of introduction forward (no
historical backfill).

## Relates to 

Closes: https://github.com/coder/coder/issues/19528
Native histograms tested in observability stack:
https://github.com/coder/observability/pull/50
2025-08-28 15:00:26 +01:00
Callum Styan 014a2d5b0f perf: don't call GetUserByID unnecessarily for Agents metrics loops (#19395)
At the moment, the loop which retrieves and updates the values of the
agents metrics excessively calls `GetUserByID` (a DB query). First it
retrieves a list of all workspaces, filtering out inactive agents (not
entirely clear to me whether this is non-running workspaces, or just
dead agents), and then iterates over those workspaces to get the rest of
the relevant data for the metrics. The next call is `GetUserByID` for
`workspace.OwnerID`. This is unnecessary because the `workspaces_visible` view we pull workspaces from has already been joined with the users table to get the username/name/etc.

This should at least partially resolve
https://github.com/coder/internal/issues/726 
---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-08-21 11:01:32 -07:00
Dean Sheather 6eb02d1c2a chore: wire up usage tracking for managed agents (#19096)
Wires up the usage collector and publisher to coderd.

Relates to coder/internal#814
2025-08-20 23:38:09 +10:00
Mathias Fredriksson 1b66495b70 fix(coderd/prometheusmetrics)!: filter deleted wsbuilds to reduce db load (#19197)
This change removes the `GetLatestWorkspaceBuilds` query which includes
all workspaces for all time (including deleted). This allows us to also
stop using `GetProvisionerJobsByIDs` for said builds as the job status
is included in `GetWorkspaces` called separately.

**BREAKING CHANGE**: The `coderd_api_workspace_latest_build` Prometheus
metric no longer includes builds belonging to deleted workspaces, as
such, this metric will show fewer statuses.

Fixes coder/internal#717
2025-08-11 14:48:31 +03:00
Steven Masley 0a3afeddc8 chore: add more pprof labels for various go routines (#19243)
- ReplicaSync
- Notifications
- MetricsAggregator
- DBPurge
2025-08-07 20:05:32 +00:00
Steven Masley 8ba8b4f061 chore: add profiling labels for pprof analysis (#19232)
PProf labels segment the code into groups for determing the source of
cpu/memory profiles. Since the web server and background jobs share a
lot of the same code (eg wsbuilder), it helps to know if the load is
user induced, or background job based.
2025-08-07 11:21:17 -05:00
Callum Styan ffbfaf2a6f feat: allow bypassing current CORS magic based on template config (#18706)
Solves https://github.com/coder/coder/issues/15096

This is a slight rework/refactor of the earlier PRs from @dannykopping
and @Emyrk:
- https://github.com/coder/coder/pull/15669
- https://github.com/coder/coder/pull/15684
- https://github.com/coder/coder/pull/17596

Rather than having a per-app CORS behaviour setting and additionally a
template level setting for ports, this PR adds a single template level
CORS behaviour setting that is then used by all apps/ports for
workspaces created from that template.

The main changes are in `proxy.go` and `request.go` to:
a) get the CORS behaviour setting from the template
b) have `HandleSubdomain` bypass the CORS middleware handler if the
selected behaviour is `passthru`
c) in `proxyWorkspaceApp`, do not modify the response if the selected
behaviour is `passthru`

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added support for configuring CORS behavior ("simple" or "passthru")
at the template level for all shared ports.
* Introduced a new "CORS Behavior" setting in the template creation and
settings forms.
* API endpoints and responses now include the optional `cors_behavior`
property for templates.
* Workspace apps and proxy now honor the specified CORS behavior,
enabling conditional CORS middleware application.
* Enhanced workspace app tests with comprehensive scenarios covering
CORS behaviors and authentication states.

* **Bug Fixes**
  * None.

* **Documentation**
* Updated API and admin documentation to describe the new
`cors_behavior` property and its usage.
* Added examples and schema references for CORS behavior in relevant API
docs.

* **Tests**
* Extended automated tests to cover different CORS behavior scenarios
for templates and workspace apps.

* **Chores**
* Updated audit logging to track changes to the `cors_behavior` field on
templates.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-07-30 13:42:39 -07:00
ケイラ fae30a00fd chore: remove unnecessary redeclarations in for loops (#18440) 2025-06-20 13:16:55 -06:00
Hugo Dutka 277c2c7ea7 chore(coderd/prometheusmetrics): remove dbmem from tests (#18238) 2025-06-05 09:30:27 +02:00
Steven Masley ca38729840 chore: revert dynamic params as a safe experiment (#17510) 2025-04-22 16:21:15 +00:00
Jon Ayers 17ddee05e5 chore: update golang to 1.24.1 (#17035)
- Update go.mod to use Go 1.24.1
- Update GitHub Actions setup-go action to use Go 1.24.1
- Fix linting issues with golangci-lint by:
  - Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1)

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <claude@anthropic.com>
2025-03-26 01:56:39 -05:00
Eng Zer Jun 04c33968cf refactor: replace golang.org/x/exp/slices with slices (#16772)
The experimental functions in `golang.org/x/exp/slices` are now
available in the standard library since Go 1.21.

Reference: https://go.dev/doc/go1.21#slices

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2025-03-04 00:46:49 +11:00
Spike Curtis 5861e516b9 chore: add standard test logger ignoring db canceled (#15556)
Refactors our use of `slogtest` to instantiate a "standard logger" across most of our tests.  This standard logger incorporates https://github.com/coder/slog/pull/217 to also ignore database query canceled errors by default, which are a source of low-severity flakes.

Any test that has set non-default `slogtest.Options` is left alone. In particular, `coderdtest` defaults to ignoring all errors. We might consider revisiting that decision now that we have better tools to target the really common flaky Error logs on shutdown.
2024-11-18 14:09:22 +04:00
Colin Adler 3de98c25db feat: add prometheus metric for tracking user statuses (#15281) 2024-10-30 18:41:16 +00:00
Steven Masley 343f8ec9ab chore: join owner, template, and org in new workspace view (#15116)
Joins in fields like `username`, `avatar_url`, `organization_name`,
`template_name` to `workspaces` via a **view**. 
The view must be maintained moving forward, but this prevents needing to
add RBAC permissions to fetch related workspace fields.
2024-10-22 09:20:54 -05:00
Garrett Delfosse 922f4c545f fix: handle new agent stat format correctly (#14576)
---------

Co-authored-by: Ethan Dickson <ethan@coder.com>
2024-09-20 01:52:14 +10:00
Kayla Washburn-Love bf4b7abf14 chore(coderd): allow creating workspaces without specifying an organization (#14048) 2024-07-30 10:44:02 -06:00
Ethan dd243686e4 chore!: remove deprecated agent v1 routes (#13486) 2024-06-11 12:22:59 +10:00
Garrett Delfosse 5b9a65e5c1 chore: move Batcher and Tracker to workspacestats (#13418) 2024-06-10 15:35:23 -04:00
Colin Adler 40390ecc30 chore: fix TestServer/Prometheus/DBMetricsDisabled test flake (#13453)
See: https://github.com/coder/coder/actions/runs/9352137263/job/25739550487#step:5:368
2024-06-03 15:38:59 -05:00
Garrett Delfosse 5789ea5397 chore: move stat reporting into workspacestats package (#13386) 2024-05-29 11:49:08 -04:00
Garrett Delfosse c550d0641d feat: move shared ports out of experiment (#13120) 2024-05-02 14:11:33 -04:00
Pavel Aseev 4682355eed chore: deprecate gauge metrics with _total suffix (#12744) (#12976)
* chore: deprecate gauge metrics with _total suffix (#12744)

Deprecated metrics:
- coderd_oauth2_external_requests_rate_limit_total
- coderd_api_workspace_latest_build_total

* Apply suggestions from code review

add link to follow-up issue

Co-authored-by: Cian Johnston <public@cianjohnston.ie>

---------

Co-authored-by: Cian Johnston <public@cianjohnston.ie>
2024-04-24 11:23:24 +03:00
Steven Masley 0a8c8ce5cc chore: remove InsertWorkspaceAgentStat query (#12869)
* chore: remove InsertWorkspaceAgentStat query

InsertWorkspaceAgentStats (batch) exists. We only used the singular in
a single unit test place. Removing the single for the batch, reducing
the interface size.
2024-04-09 12:35:27 -05:00
Danny Kopping 79fb8e43c5 feat: expose workspace statuses (with details) as a prometheus metric (#12762)
Implements #12462
2024-04-02 09:57:36 +02:00
Mathias Fredriksson b183236482 feat(coderd/database): use template_usage_stats in *ByTemplate insights queries (#12668)
This PR updates the `*ByTempalte` insights queries used for generating Prometheus metrics to behave the same way as the new rollup query and re-written insights queries that utilize the rolled up data.
2024-03-25 17:42:02 +02:00
Danny Kopping 9cfd5baa91 feat(coderd): export metric indicating each experiment's status (#12657) 2024-03-19 14:11:27 +02:00
Steven Masley f0f9569d51 chore: enforce that provisioners can only acquire jobs in their own organization (#12600)
* chore: add org ID as optional param to AcquireJob
* chore: plumb through organization id to provisioner daemons
* add org id to provisioner domain key
* enforce org id argument
* dbgen provisioner jobs defaults to default org
2024-03-18 12:48:13 -05:00
Danny Kopping da54c8a51f fix: fix data race in TestLabelsAggregation tests (#12578) 2024-03-13 13:47:22 +02:00
Danny Kopping 7a7105ad66 feat: make agent stats' cardinality configurable (#12535) 2024-03-13 12:03:36 +02:00
Cian Johnston 8f40ee3465 Revert "feat: make agent stats' cardinality configurable (#12468)" (#12533)
This reverts commit 21d1873d97.
2024-03-11 14:33:36 +00:00
Danny Kopping 21d1873d97 feat: make agent stats' cardinality configurable (#12468)
Closes #12221
2024-03-11 16:04:08 +02:00
Marcin Tojek aacb4a2b4c feat: use map instead of slice in metrics aggregator (#11815) 2024-01-29 09:12:41 +01:00
Spike Curtis fdd60d316e fix: fix MetricsAggregator check for metric sameness (#11508)
Fixes #11451

A refactor of the Agent API passes metrics as protobufs, which include pointers to label name/value pairs.  The aggregator tested for sameness by doing a shallow compare of label values, which for different stats reports would compare unequal because the pointers would be different.

This fix does a deep compare.

While testing I also noted that we neglect to compare template names. This is unlikely to have caused any issue in practice, since the combination of username/workspace is unique, but in the context of comparing metric labels we should do the comparison.

If a user creates a workspace, deletes it, then recreates from a different template, we could in principle have reported incorrect stats for the old template.
2024-01-09 15:21:30 +04:00