Commit Graph

670 Commits

Author SHA1 Message Date
Danny Kopping ff532d9bf3 chore: handle deprecated aibridge experimental routes (#20565)
In v2.28 we're [removing the aibridge
experiment](https://github.com/coder/coder/pull/20544).

We need to handle `/api/experimental/aibridge/*` until Beta (next
release).

Signed-off-by: Danny Kopping <danny@coder.com>
2025-10-29 19:11:34 -06:00
Susana Ferreira 7e8fcb4b0f perf: optimize prebuilds membership reconciliation to check orgs not presets (#20493)
## Description

The membership reconciliation ensures the prebuilds system user is a
member of all organizations with prebuilds configured. To support
prebuilds quota management, each organization must have a prebuilds
group that the system user belongs to.

## Problem

Previously, membership reconciliation iterated over all presets to check
and update membership status. This meant database queries
`GetGroupByOrgAndName` and `InsertGroupMember` were executed for each
preset. Since presets are unique combinations of `(organization,
template, template version, preset)`, this resulted in several redundant
checks for the same organization.

In dogfood, `InsertGroupMember` was called thousands of times per day,
even though memberships were already configured ([internal Grafana
dashboard link](https://grafana.dev.coder.com/goto/46MZ1UgDg?orgId=1))

<img width="5382" height="1788" alt="Screenshot 2025-10-28 at 16 01 36"
src="https://github.com/user-attachments/assets/757b7253-106f-4f72-8586-8e2ede9f18db"
/>

## Solution

This PR introduces `GetOrganizationsWithPrebuildStatus`, a single query
that returns:
* All unique organizations with prebuilds configured
* Whether the prebuilds user is a member of each organization
* Whether the prebuilds group exists in each organization
* Whether the prebuilds user is in the prebuilds group

The membership reconciliation logic now:
* Fetches status for all organizations in one query
* Only performs inserts for organizations missing required memberships
or groups
* Safely handles concurrent operations via unique constraint violations
* This reduces database load from `O(presets)` to `O(organizations)` per
reconciliation loop, with a single read query when everything is
configured.

## Changes

* Add `GetOrganizationsWithPrebuildStatus` SQL query
* Update `membership.ReconcileAll` to use organization-based
reconciliation instead of preset-based
* Update tests to reflect new behavior

Related to internal thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1760535570381369
2025-10-29 14:24:29 +00:00
Danny Kopping b20fd6f2c1 chore: graduate aibridge API out of experimental (#20523)
<!--

If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.

-->
2025-10-29 07:18:54 -06:00
Danny Kopping 2294c55bd9 chore: graduate aibridged* packages out of experimental (#20522)
<!--

If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.

-->
2025-10-29 07:00:24 -06:00
Susana Ferreira aad1b401c1 feat: add prebuilds reconciliation duration metric (#20535)
## Description

Adds `coderd_prebuilds_reconciliation_duration_seconds` histogram metric
to track the duration of each prebuilds reconciliation cycle.

This metric helps operators monitor reconciliation performance and
identify potential bottlenecks.

## Changes

- Added `ReconcileStats` struct to capture reconciliation cycle
statistics
- Updated `ReconcileAll()` to return stats including elapsed time
- Added histogram metric
`coderd_prebuilds_reconciliation_duration_seconds`
2025-10-29 12:52:30 +00:00
Danny Kopping 95a1ca898f chore: remove aibridge experiment (#20520)
Removes the experiment and all references to it
2025-10-29 06:18:38 -06:00
Susana Ferreira c3e3bb58f2 feat: delete pending canceled prebuilds (#20499)
## Description

PR https://github.com/coder/coder/pull/20387 introduced canceling
pending prebuild jobs from inactive template versions to avoid
provisioning obsolete workspaces. However, the associated prebuilds
remained in the database with "Canceled" status, visible in the UI.

This PR now orphan-deletes these canceled prebuilt workspaces. Since the
canceled jobs were never processed by a provisioner, no Terraform
resources were created, making orphan deletion safe.

Orphan deletion always creates a provisioner job, but behaves
differently based on provisioner availability:
- If no provisioner daemon is available, the job is immediately marked
as completed and the workspace is marked as deleted without any
provisioner processing
- If a provisioner daemon is available, it processes the delete job with
empty Terraform state (no actual resources to destroy)

The job cancellation and workspace deletion occur atomically in the same
transaction. We don't split this into two separate reconciliation runs
because there's no way to distinguish between system-canceled prebuilds
and user-canceled workspaces. If we deleted canceled workspaces in a
later run, we'd delete user-canceled workspaces that users may want to
keep for troubleshooting.

Note: This only applies to system-generated prebuilds from inactive
template versions.

## Changes

* Update `UpdatePrebuildProvisionerJobWithCancel` query to return job
ID, workspace ID, template ID, and template version preset ID
* Add `DeprovisionMode` enum to support orphan deletion in the provision
flow
* Update `ActionTypeCancelPending` handler to cancel jobs and
orphan-delete associated workspaces atomically
2025-10-29 10:37:28 +00:00
Dean Sheather dec6d310a8 fix: avoid bad switch statement in license code (#20509)
Noticed this while trying to investigate a flake.

Relates to https://github.com/coder/internal/issues/788
2025-10-28 06:19:52 +00:00
ケイラ 4f7b279fd8 feat: add an organization member permission level (#19953) 2025-10-27 17:14:16 -06:00
Paweł Banaszewski 50ba223aa1 feat: add db query for setting interception ended_at field (#20437)
Adds UpdateAIBridgeInterceptionEnded query to mark interceptions as
done.
Needed for https://github.com/coder/internal/issues/1051
2025-10-27 09:51:37 +01:00
Susana Ferreira f6e86c6fdb feat: cancel pending prebuilds from non-active template versions (#20387)
## Description

This PR introduces an optimization to automatically cancel pending
prebuild-related jobs from non-active template versions in the
reconciliation loop.

## Problem

Currently, when a template is configured with more prebuild instances
than available provisioners, the provisioner queue can become flooded
with pending prebuild jobs. This issue is worsened when
provisioning/deprovisioning operations take a long time.

When the prebuild reconciliation loop generates jobs faster than
provisioners can process them, pending jobs accumulate in the queue.
Since prebuilt workspaces should always run the latest active template
version, pending prebuild jobs from non-active versions become obsolete
once a new version is promoted.

## Solution

The reconciliation loop cancels pending prebuild-related jobs from
non-active template versions that match the following criteria:

* Build number: 1 (initial build created by the reconciliation loop)
* Job status: `pending`
* Not yet picked up by a provisioner (`worker_id` is `NULL`)
* Owned by the prebuilds system user
* Workspace transition: `start`

This prevents the queue from being cluttered with stale prebuild jobs
that would provision workspaces on an outdated template version that
would consequently need to be deprovisioned.

## Changes

* Added new SQL query `CountPendingNonActivePrebuilds` to identify
presets with pending jobs from non-active versions
* Added new SQL query `UpdatePrebuildProvisionerJobWithCancel` to cancel
jobs for a specific preset
* New reconciliation action type `ActionTypeCancelPending` handles the
cancellation logic
* Cancellation is non-blocking: failures to cancel prebuild jobs are
logged as errors and don't prevent other reconciliation actions

## Follow-up PR

Canceling pending prebuild jobs leaves workspaces in a Canceled state.
While no Terraform resources need to be destroyed (since jobs were
canceled before provisioning started), these database records should
still be cleaned up. This will be addressed in a follow-up PR.

Closes: https://github.com/coder/coder/issues/20242
2025-10-24 15:27:49 +01:00
Steven Masley 13ca9ead3a chore!: ensure consistent secret token generation and hashing (#20388)
This PR uses the same sha256 hashing technique as we use for APIKeys. So
now all randomly generated secrets will be hashed with sha256 for
consistency.

This is a breaking change for the oauth tokens. Since oauth is only
allowed for dev builds and experimental, this is ok.
2025-10-23 15:38:49 -05:00
Hugo Dutka e62c5db678 chore: remove references to dbtestutil.WillUsePostgres (#20436)
Addresses https://github.com/coder/internal/issues/758.

This PR only cleans up dead code, it makes no changes to test logic.
2025-10-23 14:24:54 +02:00
Jake Howell d455f6ea2b fix: rename total to count in AIBridgeListInterceptionsResponse (#20410)
Thanks to the great work in #20393, we’ve successfully introduced
offset-based pagination for this endpoint. However, the frontend expects
a `count` field in the response rather than `total`. This PR updates the
response payload to rename the returned key to `count` for consistency
with frontend expectations and existing API patterns.

This is necessary to unblock the work in #20331
2025-10-23 13:19:12 +11:00
Marcin Tojek caeca1097b chore: refactor license validation (#20411) 2025-10-22 16:12:36 +02:00
Marcin Tojek f2a410566c feat: add support buttons (#20339)
Fixes: https://github.com/coder/coder/issues/16804
2025-10-22 15:35:16 +02:00
Dean Sheather 69c2c40512 chore: add user details to aibridge interception list endpoint (#20397)
- Adds FK from `aibridge_interceptions.initiator_id` to `users.id`
- This is enforced by deleting any rows that don't have any users. Since
this is an experimental feature AND coder never deletes user rows I
think this is acceptable.
- Adds `name` as a property on `codersdk.MinimalUser`
- This matches the `visible_users` view in the database. I'm unsure why
`name` wasn't already included given that `username` is.
- Adds a new `initiator` field to `codersdk.AIBridgeInterception` which
contains `codersdk.MinimalUser` (ID, username, name, avatar URL)
- Removes `initiator_id` from `codersdk.AIBridgeInterception`
    - Should be fine since we're still in early access
2025-10-22 16:18:31 +11:00
Dean Sheather ea261a1f7c chore: add offset-based pagination support to aibridge list endpoint (#20393)
Necessary for the frontend to be able to paginate easily. Cursor
pagination is good for fetching all events, but doesn't play very well
when a pagination component gets involved.

Adds support for `?offset=x` to the existing endpoint. The cursor-based
pagination (`?after_id=x`) is still supported. The two pagination modes
are mutually exclusive, and are documented as such. If both are
supplied, the request will be rejected.

Also adds a `total` property to the response that contains the full
count of items matching the filter. We already have indices in place so
I don't think this will impact performance (or we can revisit it before
GA).
2025-10-21 11:50:00 +00:00
ケイラ caeff49aba chore: refactor roles to support multiple permission sets scoped by org id (#20186)
In preparation for adding the "member" permission level, which will also
be grouped by org ID, do a bit of a refactor to make room for it and the
existing "org" level to live in the same `map`
2025-10-09 11:08:34 -06:00
Sas Swart 544f15523c fix: adjust workspace claims to be initiated by users (#20179)
The prebuilds user never initiates a workspace claim autonomously. A
claim can only happen when a user attempts to create a workspace. When
listing prebuild provisioner jobs, it would not make sense to see jobs
related to users who are creating workspaces and have gotten a prebuilt
workspace. When cleaning up an overwhelmed provisioner queue, we should
not delete claims as they have humans waiting for them and are not part
of the thundering herd.

Therefore, this PR ensures that provisioner jobs that claim workspaces
are considered to be initiated by the user, not the prebuilds system.
2025-10-08 10:40:54 +02:00
Zach 4d1003eace fix: remove initial global HTTP client usage (#20128)
This PR makes the initial steps at removing usage of the global Go HTTP
client, which was seen to have impacts on test flakiness in
https://github.com/coder/internal/issues/1020. The first commit removes
uses from tests, with the exception of one test that is tightly coupled
to the default client. The second commit makes easy/low-risk removals
from application code. This should have some impact to reduce test flakiness.
2025-10-02 11:43:13 -06:00
Cian Johnston ff930ad4f3 feat(coderd): add ability to search org members by user_id, is_system, github_user_id (#20048)
Adds the ability to search org members by query.
Supported fields: `user_id`, `is_system`, `github_user_id`.
2025-09-30 23:54:21 +01:00
Cian Johnston 3e1f6afd66 chore: work around timing issue in TestReplicas/ErrorWithoutLicense (#20002)
Closes https://github.com/coder/internal/issues/268

Wraps the assertions in a `testutil.Eventually` so that hopefully any
transient timing issues resolve themselves. If this does not resolve the
issue, we may need to plumb through some kind of `chan struct{}` into
`api.Entitlements.Update()`
2025-09-29 10:20:26 +01:00
Dean Sheather fc58996bbf chore: add StripPrefix to aibridge server handler (#19990)
oops
2025-09-26 15:40:42 +00:00
Dean Sheather 43415f0144 chore: add enterprise feature for aibridge (#19976)
Adds enterprise feature "aibridge" and gates the aibridge CRUD and LLM API endpoints behind it.
2025-09-27 01:13:06 +10:00
Paweł Banaszewski 0a6ba5d51a feat: add endpoint to list aibridge interceptions (#19929)
Co-authored-by: Dean Sheather <dean@deansheather.com>
2025-09-27 00:20:33 +10:00
Danny Kopping 0a79817050 feat: initialize aibridged & mount API handler (#19798)
Addresses https://github.com/coder/internal/issues/987
2025-09-25 16:37:28 +02:00
Brett Kolodny 38ca98745b feat: add shared_with_group: and shared_with_user: filters to /workspaces endpoint (#19875)
Adds shared_with_user and shared_with_group filters to the /workspaces
endpoint.

- `shared_with_user`: filters workspaces shared with a specific user.
Accepts a user UUID or username.
- `shared_with_group`: filters workspaces shared with a specific group.
Accepts:
  - a group UUID, or
  - `<organization name>/<group name>`, or
  - `<group name>` (resolved in the default organization).


Closes
[coder/internal#1004](https://github.com/coder/internal/issues/1004)
2025-09-19 16:05:27 -04:00
Brett Kolodny e6b04d1918 feat: add shared filter to workspaces query (#19807)
Adds a `shared:<boolean>` search query to the `/workspaces [get]`
endpoint


https://github.com/user-attachments/assets/ccf84bd9-c1fd-4085-825b-2e3176a2d488

Closes
[coder/internal#972](https://github.com/coder/internal/issues/972)
2025-09-16 12:37:39 -04:00
Ethan 995b330250 test: avoid sharing deployment values between subtests (#19833)
Blink didn't figure out a CI failure on main was caused by a data race; fixing it.


I've also updated the [blink prompt](https://gist.githubusercontent.com/ethanndickson/8dea9f1db3957ac1baf30ae8ce6f1a42/raw/060aea7fabb82bef0029a17dad9a5daee7940760/blink-flake-instructions.md).

https://github.com/coder/coder/actions/runs/17737809615
2025-09-16 13:51:26 +10:00
Ethan 6a9b896f5b fix!: use client ip when creating connection logs for workspace proxied app accesses (#19788)
Breaking API Change: 
> The presence of the `ip` field on `codersdk.ConnectionLog` cannot be
guaranteed, and so the field has been made optional. It may be omitted
on API responses.

When running a scaletest, I noticed logs of the form:
```
2025-09-12 06:34:10.924 [erro]  coderd.workspaceapps: upsert connection log failed  trace=0xa17580  span=0xa17620  workspace_id=81b937d7-5777-4df5-b5cb-80241f30326f  agent_id=78b2ff6d-b4a6-4a4e-88a7-283e05455a88  app_id=00000000-0000-0000-0000-000000000000  user_id=00000000-0000-0000-0000-000000000000  user_agent=""  app_slug_or_port=terminal  status_code=404  request_id=67f03cf8-9523-444a-97bc-90de080a54c8 ...
    error= 1 error occurred:
           	* pq: null value in column "ip" of relation "connection_logs" violates not-null constraint
```

to ensure logs are never omitted from the connection log due to a
missing IP again (i.e. I'm not sure if we can always rely on a valid,
parseable, IP from `(http.Request).RemoteAddr`), I've removed the `NOT
NULL` constraint on `ip` on `connection_logs`, and made `ip` on the API
response optional.


The specific cause for these null IPs was the
`/workspaceproxies/me/issue-signed-app-token [post]` endpoint
constructing it's own `http.Request` without a `RemoteAddr` set, and
then passing that to the token issuer.

To solve this, we'll have workspace proxies send the real IP of the
client when calling `/workspaceproxies/me/issue-signed-app-token [post]`
via the header `Coder-Workspace-Proxy-Real-IP`.
2025-09-15 12:30:17 +10:00
Brett Kolodny 854f3c0187 feat: add workspaces/acl [delete] endpoint (#19772)
Closes
[coder/internal#971](https://github.com/coder/internal/issues/971)
2025-09-12 12:21:01 -04:00
Kacper Sawicki 3074547322 perf(enterprise): remove expensive GetWorkspaces query from entitlements (#19747)
Closes: https://github.com/coder/internal/issues/964

This PR addresses the significant database load issue where the
`GetWorkspaces` query was causing performance problems in the license
entitlements code.
2025-09-09 15:46:11 +02:00
Callum Styan 0ec9df390b fix: reduce impact of GetPrebuildMetrics on database (#19694)
see https://github.com/coder/internal/issues/959 but the tl; dr is:
- we call this DB query on an interval (every 15s) and it would be
called on each coderd replica as well
- the generated values update very infrequently (for our most used
internal template I saw the builds created/claimed update twice in a 1h
period)
- we have no index on the initiator ID, so this query has to scan the
entire workspace_builds table on every request

In reality this should likely just be a Prometheus metric, and
Prometheus can handle the counter reset behaviour at query time, but for
now this should at least cut the load of the query to 25% of it's
current impact.

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-09-04 13:43:50 -07:00
Spike Curtis 1354d84eb4 chore: refactor instance identity to be a SessionTokenProvider (#19566)
Refactors Agent instance identity to be a SessionTokenProvider.

Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags.

This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication.

Fixes #19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.
2025-09-03 10:38:42 +04:00
Dean Sheather 39bf3ba628 chore: replace GetManagedAgentCount query with aggregate table (#19636)
- Removes GetManagedAgentCount query
- Adds new table `usage_events_daily` which stores aggregated usage
events by the type and UTC day
- Adds trigger to update the values in this table when a new row is
inserted into `usage_events`
- Adds a migration that adds `usage_events_daily` rows for existing data
in `usage_events`
- Adds tests for the trigger
- Adds tests for the backfill query in the migration

Since the `usage_events` table is unreleased currently, this migration
will do nothing on real deployments and will only affect preview
deployments such as dogfood.

Closes https://github.com/coder/internal/issues/943
2025-08-30 03:39:37 +10:00
Susana Ferreira 353f5dedc1 fix(coderd): fix logic for reporting prebuilt workspace duration metric (#19641)
## Description

When creating a prebuilt workspace, both `flags.IsPrebuild` and
`flags.IsFirstBuild` are true. Previously, the logic rejected cases with
multiple flags, so `coderd_workspace_creation_duration_seconds` wasn’t
updated for prebuilt creations. This is the only valid scenario where
two flags can be true.

## Changes

* Fix logic to update `coderd_workspace_creation_duration_seconds`
metric for prebuilt workspaces.
* Add prebuild helper functions to coderdenttest (other prebuild tests
can reuse this).
* Update workspace's provisionerdmetric tests to include this metric.

Follow-up: https://github.com/coder/coder/pull/19503
Related to: https://github.com/coder/coder/issues/19528
2025-08-29 15:48:48 +01:00
Dean Sheather 605dad8b1f fix: suppress license expiry warning if a new license covers the gap (#19601)
Previously, if you had a new license that would start before the current
one fully expired, you would get a warning. Now, the license validity
periods are merged together, and a warning is only generated based on
the end of the current contiguous period of license coverage.

Closes #19498
2025-08-29 13:53:23 +00:00
Spike Curtis 192c81e8f9 chore: refactor codersdk to use SessionTokenProvider (#19565)
Refactors `codersdk.Client`'s use of session tokens to use a `SessionTokenProvider`, which abstracts the obtaining and storing of the session token.

The main motiviation is to unify Agent authentication an an upstack PR, which can use cloud instance identity via token exchange, rather than a fixed session token.

However, the abstraction could also allow functionality like obtaining the session token from other external sources like the OS credential manager, or an external secret/key management system like Vault.
2025-08-29 10:41:32 +02:00
Callum Styan 321c2b8fce fix: fix flake in TestExecutorAutostartSkipsWhenNoProvisionersAvailable (#19478)
The flake here had two causes:
1. related to usage of time.Now() in MustWaitForProvisionersAvailable
and
2. the fact that UpdateProvisionerLastSeenAt can not use a time that is
further in the past than the current LastSeenAt time

Previously the test here was calling
`coderdtest.MustWaitForProvisionersAvailable` which was using `time.Now`
rather than the next tick time like the real `hasProvisionersAvailable`
function does. Additionally, when using `UpdateProvisionerLastSeenAt`
the underlying db query enforces that the time we're trying to set
`LastSeenAt` to cannot be older than the current value.

I was able to reliably reproduce the flake by executing both the
`UpdateProvisionerLastSeenAt` call and `tickCh <- next` in their own
goroutines, the former with a small sleep to reliably ensure we'd
trigger the autobuild before we set the `LastSeenAt` time. That's when I
also noticed that `coderdtest.MustWaitForProvisionersAvailable` was
using `time.Now` instead of the tick time. When I updated that function
to take in a tick time + added a 2nd call to
`UpdateProvisionerLastSeenAt` to set an original non-stale time, we
could then never get the test to pass because the later call to set the
stale time would not actually modify `LastSeenAt`. On top of that,
calling the provisioner daemons closer in the middle of the function
doesn't really do anything of value in this test.

**The fix for the flake is to keep the go routines, ensuring there would
be a flake if there was not a relevant fix, but to include the fix which
is to ensure that we explicitly wait for the provisioner to be stale
before passing the time to `tickCh`.**

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-08-28 12:07:50 -07:00
Susana Ferreira 0ab345ca84 feat: add prebuild timing metrics to Prometheus (#19503)
## Description

This PR introduces one counter and two histograms related to workspace
creation and claiming. The goal is to provide clearer observability into
how workspaces are created (regular vs prebuild) and the time cost of
those operations.

### `coderd_workspace_creation_total`

* Metric type: Counter
* Name: `coderd_workspace_creation_total`
* Labels: `organization_name`, `template_name`, `preset_name`

This counter tracks whether a regular workspace (not created from a
prebuild pool) was created using a preset or not.
Currently, we already expose `coderd_prebuilt_workspaces_claimed_total`
for claimed prebuilt workspaces, but we lack a comparable metric for
regular workspace creations. This metric fills that gap, making it
possible to compare regular creations against claims.

Implementation notes:
* Exposed as a `coderd_` metric, consistent with other workspace-related
metrics (e.g. `coderd_api_workspace_latest_build`:
https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149).
* Every `defaultRefreshRate` (1 minute ), DB query
`GetRegularWorkspaceCreateMetrics` is executed to fetch all regular
workspaces (not created from a prebuild pool).
* The counter is updated with the total from all time (not just since
metric introduction). This differs from the histograms below, which only
accumulate from their introduction forward.

### `coderd_workspace_creation_duration_seconds` &
`coderd_prebuilt_workspace_claim_duration_seconds`

* Metric types: Histogram
* Names:
  * `coderd_workspace_creation_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`, `type`
(`regular`, `prebuild`)
  * `coderd_prebuilt_workspace_claim_duration_seconds`
    * Labels: `organization_name`, `template_name`, `preset_name`

We already have `coderd_provisionerd_workspace_build_timings_seconds`,
which tracks build run times for all workspace builds handled by the
provisioner daemon.
However, in the context of this issue, we are only interested in
creation and claim build times, not all transitions; additionally, this
metric does not include `preset_name`, and adding it there would
significantly increase cardinality. Therefore, separate more focused
metrics are introduced here:
* `coderd_workspace_creation_duration_seconds`: Build time to create a
workspace (either a regular workspace or the build into a prebuild pool,
for prebuild initial provisioning build).
* `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a
prebuilt workspace from the pool.

The reason for two separate histograms is that:
* Creation (regular or prebuild): provisioning builds with similar time
magnitude, generally expected to take longer than a claim operation.
* Claim: expected to be a much faster provisioning build.

#### Native histogram usage

Provisioning times vary widely between projects. Using static buckets
risks unbalanced or poorly informative histograms.
To address this, these metrics use [Prometheus native
histograms](https://prometheus.io/docs/specs/native_histograms/):
* First introduced in Prometheus v2.40.0
* Recommended stable usage from v2.45+
* Requires Go client `prometheus/client_golang` v1.15.0+
* Experimental and must be explicitly enabled on the server
(`--enable-feature=native-histograms`)

For compatibility, we also retain a classic bucket definition (aligned
with the existing provisioner metric:
https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189).
* If native histograms are enabled, Prometheus ingests the
high-resolution histogram.
* If not, it falls back to the predefined buckets.

Implementation notes:
* Unlike the counter, these histograms are updated in real-time at
workspace build job completion.
* They reflect data only from the point of introduction forward (no
historical backfill).

## Relates to 

Closes: https://github.com/coder/coder/issues/19528
Native histograms tested in observability stack:
https://github.com/coder/observability/pull/50
2025-08-28 15:00:26 +01:00
Sas Swart 4e9ee80882 feat(enterprise/coderd): allow system users to be added to groups (#19518)
closes https://github.com/coder/coder/issues/18274

This pull request makes system users visible in various group related
queries so that they can be added to and removed from groups. This
allows system user quotas to be configured. System users are still
ignored in certain queries, such as when license seat consumption is
determined.

This pull request further ensures the existence of a
"coder_prebuilt_workspaces" group in any organization that needs
prebuilt workspaces

---------

Co-authored-by: Susana Ferreira <susana@coder.com>
2025-08-27 16:57:59 +02:00
ケイラ d7ee1019c0 feat: add endpoint for retrieving workspace acl (#19375)
Implements `/acl [get]` for workspaces, with tests.
Blocked by experiment enablement
2025-08-25 07:11:18 -05:00
Dean Sheather 82f2e15974 chore: add unknown usage event type error (#19436)
- Adds `usagetypes.UnknownEventTypeError` type, which is returned by
`ParseEventWithType`
- Changes `ParseEvent` to not be a generic function since it doesn't
really need it
- Adds `User-Agent` to tallyman requests
2025-08-22 16:32:35 +10:00
Cian Johnston 86f9bed608 chore: fix TestCheckInactiveUsers flake (#19469)
THIS CODE WAS NOT WRITTEN BY A HUMAN.
Use a fixed time interval to avoid timing flakes.
2025-08-21 16:35:31 +01:00
Dean Sheather bfd392b0bf fix: use int64 in publisher delay (#19457) 2025-08-21 11:23:50 +10:00
Rafael Rodriguez 5b1e809862 fix: support oidc group allowlist in oss (#19430)
## Summary

In this pull request we're adding support for OIDC allowed groups in the
OSS version as part of work for
https://github.com/coder/coder/issues/17027.

### Changes

- Restored support for parsing group allow list in OSS code

### Testing

- Added tests for OSS code
- Tested allowed/prohibited group OIDC flows in premium and OSS
2025-08-20 10:09:13 -05:00
Dean Sheather 1a601c30ad chore: move usage types to new package (#19103) 2025-08-20 23:48:38 +10:00
Dean Sheather 6eb02d1c2a chore: wire up usage tracking for managed agents (#19096)
Wires up the usage collector and publisher to coderd.

Relates to coder/internal#814
2025-08-20 23:38:09 +10:00
Susana Ferreira 560cf84251 fix: prevent activity bump for prebuilt workspaces (#19263)
## Description

This PR ensures that activity-based deadline extensions ("activity
bumping") are not applied to prebuilt workspaces. Prebuilds are managed
by the reconciliation loop and must not have `deadline` or
`max_deadline` values set or extended, as they are not part of the
regular lifecycle executor path.

## Changes

- Update `ActivityBumpWorkspace` SQL query to discard prebuilt
workspaces
- Update application layer to avoid calling activity bump logic on
prebuilt workspaces

Related with: 
* Issue: https://github.com/coder/coder/issues/18898
* PR: https://github.com/coder/coder/pull/19252
2025-08-20 12:19:14 +01:00