Commit Graph

3169 Commits

Author SHA1 Message Date
Jon Ayers 6035e45cb8 feat: add e2e workspace build duration metric (#21739)
Adds coderd_template_workspace_build_duration_seconds histogram that
tracks the full duration from workspace build creation to agent ready.
This captures the complete user-perceived build time including
provisioning and agent startup.

The metric is emitted when the agent reports ready/error/timeout via the
lifecycle API, ensuring each build is counted exactly once per replica.
2026-02-06 16:26:02 -06:00
Zach a31e476623 fix: make boundary usage telemetry collection atomic (#21907)
Previously, UpsertBoundaryUsageStats (INSERT...ON CONFLICT DO UPDATE) and
GetAndResetBoundaryUsageSummary (DELETE...RETURNING) could race during
telemetry period cutover. Without serialization, an upsert concurrent with the
delete could lose data (deleted right after being written) or commit after the
delete (miscounted in the next period). Both operations now acquire
LockIDBoundaryUsageStats within a transaction to ensure a clean cutover.
2026-02-06 09:52:17 -07:00
Danielle Maywood 6ccd20d45f feat(agent): populate subagent ID for terraform-defined devcontainers (#21942)
Completes the final piece of the puzzle. Support the pre-creation flow
from the agent side.
2026-02-06 15:52:54 +00:00
Marcin Tojek 456c0bced9 fix: enable strict mode for swagger generation & upgrade swag (#21975)
Adds a Go wrapper (`scripts/apidocgen/swaginit/main.go`) that calls
swag's Go API with `Strict: true`. The `--strict` flag isn't available
in swag's CLI in any version, so the wrapper is the only way to enable
it.

Also upgrades swag from v1.16.2 to v1.16.6 (better generics support,
precise numeric formats, `x-enum-descriptions`, CVE-2024-45338 fix).
2026-02-06 13:04:35 +01:00
Mathias Fredriksson 2549fc71fa feat(coderd): return 409 Conflict for non-active task states (#21887)
Previously we returned 400 Bad Request for all non-active states. This
was semantically incorrect for transitional and paused states where the
request is valid but conflicts with current state.

We now return 409 Conflict for pending/initializing/paused (resolvable
by waiting or resuming) and 400 for error/unknown (actual problems).
This enables client-side auto-resume orchestration per the task
lifecycle RFC.

Closes coder/internal#1265
2026-02-06 12:04:58 +02:00
Mathias Fredriksson c60c373bc9 fix(coderd): clean up task snapshots on task deletion (#21949)
Task snapshots were orphaned when tasks were soft-deleted. The
`task_snapshots` table has an `ON DELETE CASCADE` foreign key, but
that only fires on hard deletes.

Modified DeleteTask to use a CTE that atomically soft-deletes the
task and removes its snapshot in a single transaction. The query now
returns just the task UUID instead of the full row.

Closes coder/internal#1283
2026-02-06 11:55:33 +02:00
Cian Johnston 25a0c807cb chore(coderd/database/dbfake): add support for provisioner job timestamp control (#21944)
Relates to https://github.com/coder/coder/pull/21922 /
https://github.com/coder/internal/issues/1259

* Adds `dbfake.BuilderOption func(*WorkspaceBuildBuilder)`
* Adds `BuilderOption` methods for setting various provisioner job
related fields on `WorkspaceBuildBuilder`.
* Migrates a number of existing tests that previously dependeded on
provisioner job timing to use these updated methods in the following
packages:
  * `coderd/jobreaper`
  * `coderd/notifications/reports`
  * `enterprise/coderd/schedule`
  * `enterprise/coderd/prebuilds`
  * `scripts/workspace-runtime-audit` 

🤖 Created using Mux (Opus 4.5)

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-02-06 09:44:40 +00:00
Spike Curtis b84bb43a07 feat: add standard encodings to binary cache (#21921)
fixes: https://github.com/coder/internal/issues/1300

Adds brotli and zstd compression to the binary cache. Also refactors coderd's streaming encoding middleware to use the same standard set of compression algorithms, so we have them in one place.
2026-02-06 11:28:08 +04:00
Spike Curtis 6b1adb8b12 chore: refactor site handler to take cache dir (#21918)
relates to: https://github.com/coder/internal/issues/1300

Refactors the options to the site handler to take the cache directory, rather than expecting the caller to call `ExtractOrReadBinFS` and pass the results.

This is important in this stack because we need direct access to the cache directory for compressed file caching.
2026-02-06 10:56:48 +04:00
Spike Curtis 8aa9e9acc3 feat: add cachecompress package to compress static files for HTTP (#21915)
relates to: https://github.com/coder/internal/issues/1300

Adds a new package called `cachecompress` which takes a `http.FileSystem` and wraps it with an on-disk cache of compressed files. We lazily compress files when they are requested over HTTP.

# Why we want this

With cached compress, we reduce CPU utilization during workspace creation significantly.

![image.png](https://app.graphite.com/user-attachments/assets/b9e6a38e-c83d-47f2-9e5b-22913c129a84.png)

This is from a 2k scaletest at the top of this stack of PRs so that it's used to server `/bin/` files. Previously we pegged the 4-core Coderds, with profiling showing 40% of CPU going to `zstd` compression (c.f. https://github.com/coder/internal/issues/1300).

With this change compression is reduced down to 1s of CPU time (from 7 minutes).

# Implementation details

The basic structure is taken from Chi's Compressor middleware. I've reproduced the `LICENSE` in the directory because it's MIT licensed, not AGPL like the rest of Coder.

I've structured it not as a middleware that calls an arbitrary upstream HTTP handler, but taking an explicit `http.FileSystem`. This is done for safety so we are only caching static files and not dynamically generated content with this.

One limitation is that on first request for a resource, it compresses the whole file before starting to return any data to the client. For large files like the Coder binaries, this can add 1-5 seconds to the time-to-first-byte, depending on the compression used.

I think this is reasonable: it only affects the very first download of the binary with a particular compression for a particular Coderd.

If we later find this unacceptible, we can fix it without changing interfaces. We can poll the file system to figure out how much data is available while the compression is inprogress.
2026-02-06 10:12:58 +04:00
Steven Masley efd98bd93a chore: add template toggle to disable module caching (#21931)
There exists use cases to disable the new module caching behavior of
workspace builds. This was the legacy behavior.
2026-02-05 14:38:55 -06:00
Mathias Fredriksson 96695edfed fix(coderd/database): correct task pending status logic (#21886)
Previously, tasks with pending provisioner jobs (not yet picked up)
were incorrectly reported as "initializing".

Refs #21887
2026-02-05 14:08:03 +02:00
Jon Ayers 22ece10a4a feat: add healthy filter for workspace queries (#21743)
Adds support for filtering workspaces by health status using
healthy:true or healthy:false in the search query.

This is done by changing `has-agent` to accept a list of statuses and
aliasing `health:true` to `has-agent:connected` and `healthy:false` to
`has-agent:timeout,disconnected`.

Fixes #21623
2026-02-04 20:48:27 -06:00
Danielle Maywood af0e171595 feat(coderd/agentapi): support terraform-defined subagent ids (#21837)
Update `coderd/agentapi` to handle pre-created sub agents
2026-02-04 15:33:48 +00:00
Steven Masley a4ffafd46d test: remove provisioner heartbeat from 'AllProvisionersStale' (#21903)
Provisioner async heartbeat will mark the 'stale' provisioner as ready

closes https://github.com/coder/internal/issues/1288
2026-02-04 08:29:44 -06:00
Steven Masley 6759b51cd6 feat: add endpoint to fetch singular org member (#21732) 2026-02-03 12:48:25 -06:00
Cian Johnston 91be688e39 chore(coderd/database): remove deprecated db2sdk.List(Lazy)? methods (#21902)
Removes deprecated methods db2sdk.List and db2sdk.ListLazy.
2026-02-03 17:52:07 +00:00
ケイラ 7fd13019e5 fix: disable task sharing (#21867) 2026-02-03 09:43:40 -07:00
Steven Masley a16debee76 test: template import should never complete, use Plan over apply (#21895)
Closes https://github.com/coder/internal/issues/1221
2026-02-03 10:16:53 -06:00
Danielle Maywood 2de8cdf160 feat(agent): add subagent ID fields to devcontainers in manifest (#21848)
Update the agent protobuf schema (agent/proto/agent.proto) to include:
- subagent_id field in WorkspaceAgentDevcontainer message
- id field in CreateSubAgentRequest message

Bump the Agent API version from v2.7 to v2.8 and update all client
references throughout the codebase (ConnectRPC27 -> ConnectRPC28,
DRPCAgentClient27 -> DRPCAgentClient28).
2026-02-03 12:37:30 +00:00
Cian Johnston 353ebd9664 feat: add link for viewing raw build logs in workspace and template build jobs (#21727)
* Adds support for parameter `format=text` in the following API routes:
  * `/api/v2/workspaceagents/:id/logs`
  * `/api/v2/workspacebuilds/:id/logs`
  * `/api/v2/templateversions/:id/logs` 
  * `/api/v2/templateversions/:id/dry-run/:id/logs` 

* Adds links to view raw logs on the following pages:
  * Workspace build page
  * Template editor page
  * Template version page

* Refactors existing log formatting in `cli/logs.go` to live in `codersdk`.

🤖 Generated with Claude Opus 4.5, reviewed by me.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-02-03 09:45:23 +00:00
Mathias Fredriksson f75cbab6ce fix(coderd/database): prevent AcquireProvisionerJob from grabbing canceled jobs (#21852)
The AcquireProvisionerJob query only checked started_at IS NULL, allowing
it to acquire jobs that were canceled while pending (which have
completed_at set but started_at still NULL).

Added completed_at IS NULL check to the query to prevent this.

Also fixed JobCompleteBuilder.Do() in dbfake to set started_at when
completing jobs to match production behavior.

Fixes coder/internal#1323
2026-02-03 10:42:17 +02:00
Jon Ayers 3c1db17361 fix: use existing transaction to claim prebuild (#21862)
- Claiming a prebuild was happening outside a transaction
2026-02-02 17:57:59 -06:00
Zach 90aeea5649 fix: handle boundary usage across snapshots and flush races (#21805)
Previously there were two issues that could cause incorrect boundary
usage telemetry data.

1. Bad handling across snapshot intervals: After telemetry snapshot deleted
the DB row, the next flush would INSERT the stale cumulative data (which
included already-reported usage). This would then be overwritten by
subsequent UPDATE flushes, causing the delta between the last snapshot
and the reset to be lost (under-reporting usage). Additionally, if there
was no new usage after the reset, the tracker would carry over all usage
from the previous period into the next period (over-reporting usage).

2. Missed usage from a race condition: Track() calls between the first
mutex unlock and second mutex lock in FlushToDB() were lost. The data
wasn't included in the current flush (already snapshotted) and was wiped
by the subsequent reset. This is likely low impact to overall usage
numbers in the real world.

Fix by tracking unique workspace/user deltas separately from cumulative
values and always tracking delta allowed/denied requests. Deltas are used
for INSERT (fresh start after reset), cumulative for UPDATE (accurate unique
counts within a period). All counters reset atomically before the DB operation
so Track() calls during the operation are preserved for the next flush.
2026-02-02 09:11:54 -07:00
Steven Masley 6b3d4377c3 feat: archive modules in size order until limit is hit (#21773)
Archiving modules attempts to save as many modules as it can before it hits the limit. Enabling the template as much as it can, rather than a hard failure.
2026-02-02 09:03:18 -06:00
Thomas Kosiewski dd6aec04d7 fix(coderd/oauth2provider): support client_secret_basic client auth (#21793) 2026-02-02 16:01:33 +01:00
Jake Howell 052bd114a4 fix: resolve missing users in <UserCombobox /> (#21822)
Closes #21044

This pull-request addresses an issue we were seeing where we would
attempt to filter the `<UserCombobox />` by the users username or email
not their username (which the rendered options would show).

To highlight this I created three different users. Each with a username
that did not contain their `email` or `name` and attempted to filter.
Attempting to search for `John` wouldn't actually show the user as his
username was `x`, and infact whereas a subset of users might be returned
from the backend for having `john` in the `email` it would've been
filtered by the frontend for not being in the `name` field.

| Name | Username |
| --- | --- |
| `Jake` | `z` |  
| `Jeff` | `y` |
| `John` | `x` |

| Previously | Now |
| --- | --- |
| <img width="560" height="547" alt="OLD_USER_COMBOBOX"
src="https://github.com/user-attachments/assets/a0567264-0034-42ac-aba0-95b05c4f92dd"
/> | <img width="580" height="548" alt="NEW_USER_COMBOBOX"
src="https://github.com/user-attachments/assets/1aa0c942-d340-4b1c-8dde-b97879525bfb"
/> |
2026-02-03 00:13:41 +11:00
Marcin Tojek 3e369c0b04 fix: separate SMTP envelope and header addresses (#21840)
## Description

When configuring a From address with a display name (e.g., `Coder System
<system@coder.com>`), the SMTP `MAIL FROM` command was incorrectly
receiving the full address string instead of just the bare email
address, causing `501 Invalid MAIL argument` errors on some SMTP
servers.

## Changes

- Updated `validateFromAddr` to return both:
  - `envelopeFrom`: bare email for SMTP `MAIL FROM` command (RFC 5321)
- `headerFrom`: original address with display name for email header (RFC
5322)

Fixes #20727
2026-02-02 13:53:02 +01:00
George K c60f802580 fix(coderd/rbac): make workspace ACL disabled flag atomic (#21799)
The flag is a package-global that was only meant to be set once on
startup. This was a bad assumption since the lack of sync caused test
flakes.

Related to:
https://github.com/coder/internal/issues/1317
https://github.com/coder/internal/issues/1318
2026-01-30 11:21:27 -08:00
Danielle Maywood 37aecda165 feat(coderd/provisionerdserver): insert sub agent resource (#21699)
Update provisionerdserver to handle the changes introduced to
provisionerd in https://github.com/coder/coder/pull/21602

We now create a relationship between `workspace_agent_devcontainers` and
`workspace_agents` with the newly created `subagent_id`.
2026-01-30 17:19:19 +00:00
Mathias Fredriksson 21eabb1d73 feat(coderd): return log snapshot for paused tasks (#21771)
Previously the task logs endpoint only worked when the workspace was
running, leaving users unable to view task history after pausing.

This change adds snapshot retrieval with state-based branching: active
tasks fetch live logs from AgentAPI, paused/initializing/pending tasks
return stored snapshots (providing continuity during pause/resume), and
error/unknown states return HTTP 409 Conflict.

The response includes snapshot metadata (snapshot, snapshot_at) to
indicate whether logs are live or historical.

Closes coder/internal#1254
2026-01-30 16:09:45 +02:00
Danny Kopping 536bca7ea9 chore: log api key on each HTTP API request (#21785)
Operators need to know which API key was used in HTTP requests.

For example, if a key is leaking and a DDOS is underway using that key, operators need a way to identify the key in use and take steps to expire the key (see https://github.com/coder/coder/issues/21782).

_Disclaimer: created using Claude Opus 4.5_
2026-01-30 14:48:10 +02:00
Marcin Tojek 036ed5672f fix!: remove deprecated prometheus metrics (#21788)
## Description

Removes the following deprecated Prometheus metrics:

- `coderd_api_workspace_latest_build_total` → use
`coderd_api_workspace_latest_build` instead
- `coderd_oauth2_external_requests_rate_limit_total` → use
`coderd_oauth2_external_requests_rate_limit` instead

These metrics were deprecated in #12976 because gauge metrics should
avoid the `_total` suffix per [Prometheus naming
conventions](https://prometheus.io/docs/practices/naming/).

## Changes

- Removed deprecated metric `coderd_api_workspace_latest_build_total`
from `coderd/prometheusmetrics/prometheusmetrics.go`
- Removed deprecated metric
`coderd_oauth2_external_requests_rate_limit_total` from
`coderd/promoauth/oauth2.go`
- Updated tests to use the non-deprecated metric name

Fixes #12999
2026-01-30 13:30:06 +01:00
Jaayden Halko 4847920407 fix: don't allow sharing admins to change own role (#21634)
resolve coder/internal#1280
2026-01-30 06:27:30 -05:00
Ethan a464ab67c6 test: use explicit names in TestStartAutoUpdate to prevent flake (#21745)
The test was creating two template versions without explicit names,
relying on `namesgenerator.NameDigitWith()` which can produce
collisions. When both versions got the same random name, the test failed
with a 409 Conflict error.

Fix by giving each version an explicit name (`v1`, `v2`).

Closes https://github.com/coder/internal/issues/1309

---

*Generated by [mux](https://mux.coder.com)*
2026-01-30 13:24:06 +11:00
Zach 0611e90dd3 feat: add time window fields to telemetry boundary usage (#21772)
Add PeriodStart and PeriodDurationMilliseconds fields to BoundaryUsageSummary
so consumers of telemetry data can understand usage within a particular time window.
2026-01-29 13:40:55 -07:00
Marcin Tojek 04b0253e8a feat: add Prometheus metrics for license warnings and errors (#21749)
Fixes: coder/internal#767

Adds two new Prometheus metrics for license health monitoring:

- `coderd_license_warnings` - count of active license warnings
- `coderd_license_errors` - count of active license errors

Metrics endpoint after startup of a deployment with license enabled:

```
...
# HELP coderd_license_errors The number of active license errors.
# TYPE coderd_license_errors gauge
coderd_license_errors 0
...
# HELP coderd_license_warnings The number of active license warnings.
# TYPE coderd_license_warnings gauge
coderd_license_warnings 0
...
```
2026-01-29 13:50:15 +01:00
Steven Masley dfbd541cee chore: move List util out of db2sdk to avoid circular imports (#21733) 2026-01-28 13:07:53 -06:00
Steven Masley e13f2a9869 chore: remove extra stop_modules from provisionerd proto (#21706)
Was a duplicate of start_modules

Closes https://github.com/coder/coder/issues/21206
2026-01-28 09:25:47 -06:00
Spike Curtis 7090a1e205 chore: renumber duplicate migration 000411 (#21720)
Fixes recent duplicate DB migration in #21607
2026-01-28 08:01:58 +04:00
Spike Curtis f358a6db11 chore: convert tailnet tables to UNLOGGED for improved write performance (#21607)
This migration converts all tailnet coordination tables to UNLOGGED:
- `tailnet_coordinators`
- `tailnet_peers`
- `tailnet_tunnels`

UNLOGGED tables skip Write-Ahead Log (WAL) writes, significantly
improving performance for high-frequency updates like coordinator
heartbeats and peer state changes.

The trade-off is that UNLOGGED tables are truncated on crash recovery
and are not replicated to standby servers. This is acceptable for these
tables because the data is ephemeral:
1. Coordinators re-register on startup
2. Peers re-establish connections on reconnect
3. Tunnels are re-created based on current peer state

**Migration notes:**
- Child tables must be converted before the parent table because LOGGED
child tables cannot reference UNLOGGED parent tables (but the reverse is
allowed)
- The down migration reverses the order: parent first, then children

Fixes https://github.com/coder/coder/issues/21333
2026-01-28 07:12:32 +04:00
Zach 2204731ddb feat: implement boundary usage tracker and telemetry collection (#21716)
Implements telemetry for boundary usage tracking across all Coder
replicas and reports them via telemetry.

Changes:
- Implement Tracker with Track(), FlushToDB(), and StartFlushLoop() methods
- Add telemetry integration via collectBoundaryUsageSummary()
- Use telemetry lock to ensure only one replica collects per period

The tracker accumulates unique workspaces, unique users, and request
counts (allowed/denied) in memory, then flushes to the database
periodically. During telemetry collection, stats are aggregated across
all replicas and reset for the next period.
2026-01-27 19:11:40 -07:00
Steven Masley 799b190dee fix: do not enforce managed agent limit for non-task workspaces (#21689)
Only task workspaces have the checks in wsbuilder for violating the
managed agent caps in the license.

Stopped tasks that are resumed with a regular workspace start **still
count as usage**.
2026-01-27 19:01:17 -06:00
Zach 7dfa33b410 feat: add boundary usage tracking database schema and tracker skeleton (#21670)
feat: add boundary usage telemetry database schema and RBAC

Adds the foundation for tracking boundary usage telemetry across Coder
replicas. This includes:

  - Database schema: `boundary_usage_stats` table with per-replica stats
    (unique workspaces, unique users, allowed/denied request counts)
  - Database queries: upsert stats, get aggregated summary, reset stats,
    delete by replica ID
  - RBAC: `boundary_usage` resource type with read/update/delete actions,
    accessible only via system `BoundaryUsageTracker` subject (not regular
    user roles)
  - Tracker skeleton + docs: stub implementation in `coderd/boundaryusage/`

The tracker accumulates stats in memory and periodically flushes to the
database. Stats are aggregated across replicas for telemetry reporting,
then reset when a new reporting period begins. The tracker implementation
and plumbing will be done in a subsequent commit/PR.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 13:29:21 -07:00
George K c352a51b22 fix(coderd): authorize workspace start/stop/delete by transition action (#21691)
Use transition-specific actions when authorizing workspace build
parameter inserts in the database layer so start/stop/delete do not
require workspace.update.

Related to: https://github.com/coder/internal/issues/1299
2026-01-27 09:08:12 -08:00
Cian Johnston 7b44976618 fix(coderd/provisionerdserver): correct managed agent tracking (#21696)
Relates to https://github.com/coder/internal/issues/1282

Updates tracking of managed agents to be predicated instead on the
presence of a related `task_id` instead of the presence of a
`coder_ai_task` resource.
2026-01-27 12:14:52 +00:00
Mathias Fredriksson 25d7f27cdb feat(coderd): add task log snapshot storage endpoint (#21644)
This change adds a POST /workspaceagents/me/tasks/{task}/log-snapshot
endpoint for agents to upload task conversation history during
workspace shutdown. This allows users to view task logs even when the
workspace is stopped.

The endpoint accepts agentapi format payloads (typically last 10
messages, max 64KB), wraps them in a format envelope, and upserts to the
task_snapshots table. Uses agent token auth and validates the task
belongs to the agent's workspace.

Closes coder/internal#1253
2026-01-27 11:09:24 +02:00
Danny Kopping 7123518baa feat: conditionally send aibridge actor headers (#21643)
Also passes along the authenticated username as actor metadata.

Closes https://github.com/coder/aibridge/issues/135
Depends on https://github.com/coder/aibridge/pull/142

**Replace aibridge tag with merge commit once
https://github.com/coder/aibridge/pull/142 lands.**

---------

Signed-off-by: Danny Kopping <danny@coder.com>
2026-01-26 15:08:17 +00:00
Cian Johnston 612aae2523 chore: replace httpapi.Heartbeat with httpapi.HeartbeatClose (#21676)
Relates to https://github.com/coder/coder/pull/21676

* Replaces all existing usages of `httpapi.Heartbeat` with `httpapi.HeartbeatClose`
* Removes `httpapi.HeartbeatClose`
2026-01-26 12:11:40 +00:00
Spike Curtis f47f89d997 chore: remove unused tailnet v1 tables and queries (#21646)
Removes the legacy tailnet v1 API tables (`tailnet_clients`, `tailnet_agents`, `tailnet_client_subscriptions`) and their associated queries, triggers, and functions. These were superseded by the v2 tables (`tailnet_peers`, `tailnet_tunnels`) in migration 000168, and the v1 API code was removed in commit d6154c4310, but the database artifacts were never cleaned up.

**Changes:**
- New migration `000410_remove_tailnet_v1_tables` to drop the unused tables
- Removed 11 unused queries from `tailnet.sql`
- Removed associated manual wrapper methods in `dbauthz` and `dbmetrics`
- ~930 lines deleted across 11 files
2026-01-26 14:27:17 +04:00