This PR fixes a workspace app authentication bug where requests that
include an `Authorization` header (intended for the upstream app) can
cause Coder to ignore the workspace app session cookie
(`coder_subdomain_app_session_token_*` /
`coder_path_app_session_token`). When that happens, Coder fails to mint
or renew `coder_signed_app_token` and redirects to
`/api/v2/applications/auth-redirect` instead of proxying the request to
the workspace.
This commonly shows up when users run a frontend and backend in the same
workspace and the backend requires `Authorization` (for example, `curl
-H "Authorization: bearer ..."` or browser `fetch()` calls).
Related issues / context:
* Primary bug report and repro:
[https://github.com/coder/coder/issues/21467](https://github.com/coder/coder/issues/21467)
* Related symptoms reported as CORS / redirect failures for workspace
apps:
*
[https://github.com/coder/coder/issues/20667](https://github.com/coder/coder/issues/20667)
*
[https://github.com/coder/coder/issues/19728](https://github.com/coder/coder/issues/19728)
## Root Cause
In `coderd/workspaceapps/cookies.go`, `AppCookies.TokenFromRequest`
checked `httpmw.APITokenFromRequest(r)` first. That helper returns a
token from several places, including `Authorization: Bearer ...`.
As a result, when a request included an upstream `Authorization` header,
that header value was returned as the “session token” for the app proxy,
and `coder_subdomain_app_session_token_*` was never read. Authentication
then failed and the request was treated as signed out.
## Fix
Change the precedence in `AppCookies.TokenFromRequest`:
1. First check the access-method-specific cookie:
* subdomain apps: `coder_subdomain_app_session_token_{hash}`
* path apps: `coder_path_app_session_token`
2. If not present, fall back to `httpmw.APITokenFromRequest(r)` (so
non-browser clients can still authenticate via query, header, or bearer
tokens if they really want to).
This ensures that:
* Backend requests that require `Authorization` still reach the
workspace.
* `coder_signed_app_token` can be renewed from the app session cookie
even when `Authorization` is present.
* `Authorization` is still forwarded to the upstream app (the reverse
proxy code does not strip it).
Initially, I attempted workarounds
([https://github.com/coder/coder/issues/20667#issuecomment-3868578388](https://github.com/coder/coder/issues/20667#issuecomment-3868578388),
[https://github.com/coder/coder/issues/19728#issuecomment-3868578093](https://github.com/coder/coder/issues/19728#issuecomment-3868578093)),
but adding `/auth-redirect` to the permissive CORS paths and extending
the validity of workspace app auth tokens from 1 minute to 1 hour only
partially masked the issue. After workspace restarts and token expiry, I
no longer saw CORS errors, but the tokens were still not renewed.
After patching my local Nix-based setup on Coder v1.30.0 with this
change, I can no longer observe this behavior.
I was trying to figure out why `goleak` was complaining about a dangling
http2 connection goroutine in tests. Turns out that `taskname.Generate`
will call out to Anthropic if an API key is set, and we're calling it in
`dbgen`. Modified to use testutil method instead.
Closes https://github.com/coder/internal/issues/1261.
This pull request adds an endpoint to pause coder tasks by stopping the
underlying workspace.
* Instead of `POST /api/v2/tasks/{user}/{task}/pause`, the endpoint is
currently experimental.
* We do not currently set the build reason to `task_manual_pause`,
because build reasons are currently only used on stop transitions.
Adds coderd_template_workspace_build_duration_seconds histogram that
tracks the full duration from workspace build creation to agent ready.
This captures the complete user-perceived build time including
provisioning and agent startup.
The metric is emitted when the agent reports ready/error/timeout via the
lifecycle API, ensuring each build is counted exactly once per replica.
Previously, UpsertBoundaryUsageStats (INSERT...ON CONFLICT DO UPDATE) and
GetAndResetBoundaryUsageSummary (DELETE...RETURNING) could race during
telemetry period cutover. Without serialization, an upsert concurrent with the
delete could lose data (deleted right after being written) or commit after the
delete (miscounted in the next period). Both operations now acquire
LockIDBoundaryUsageStats within a transaction to ensure a clean cutover.
Adds a Go wrapper (`scripts/apidocgen/swaginit/main.go`) that calls
swag's Go API with `Strict: true`. The `--strict` flag isn't available
in swag's CLI in any version, so the wrapper is the only way to enable
it.
Also upgrades swag from v1.16.2 to v1.16.6 (better generics support,
precise numeric formats, `x-enum-descriptions`, CVE-2024-45338 fix).
Previously we returned 400 Bad Request for all non-active states. This
was semantically incorrect for transitional and paused states where the
request is valid but conflicts with current state.
We now return 409 Conflict for pending/initializing/paused (resolvable
by waiting or resuming) and 400 for error/unknown (actual problems).
This enables client-side auto-resume orchestration per the task
lifecycle RFC.
Closescoder/internal#1265
Task snapshots were orphaned when tasks were soft-deleted. The
`task_snapshots` table has an `ON DELETE CASCADE` foreign key, but
that only fires on hard deletes.
Modified DeleteTask to use a CTE that atomically soft-deletes the
task and removes its snapshot in a single transaction. The query now
returns just the task UUID instead of the full row.
Closescoder/internal#1283
Relates to https://github.com/coder/coder/pull/21922 /
https://github.com/coder/internal/issues/1259
* Adds `dbfake.BuilderOption func(*WorkspaceBuildBuilder)`
* Adds `BuilderOption` methods for setting various provisioner job
related fields on `WorkspaceBuildBuilder`.
* Migrates a number of existing tests that previously dependeded on
provisioner job timing to use these updated methods in the following
packages:
* `coderd/jobreaper`
* `coderd/notifications/reports`
* `enterprise/coderd/schedule`
* `enterprise/coderd/prebuilds`
* `scripts/workspace-runtime-audit`
🤖 Created using Mux (Opus 4.5)
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
fixes: https://github.com/coder/internal/issues/1300
Adds brotli and zstd compression to the binary cache. Also refactors coderd's streaming encoding middleware to use the same standard set of compression algorithms, so we have them in one place.
relates to: https://github.com/coder/internal/issues/1300
Refactors the options to the site handler to take the cache directory, rather than expecting the caller to call `ExtractOrReadBinFS` and pass the results.
This is important in this stack because we need direct access to the cache directory for compressed file caching.
relates to: https://github.com/coder/internal/issues/1300
Adds a new package called `cachecompress` which takes a `http.FileSystem` and wraps it with an on-disk cache of compressed files. We lazily compress files when they are requested over HTTP.
# Why we want this
With cached compress, we reduce CPU utilization during workspace creation significantly.

This is from a 2k scaletest at the top of this stack of PRs so that it's used to server `/bin/` files. Previously we pegged the 4-core Coderds, with profiling showing 40% of CPU going to `zstd` compression (c.f. https://github.com/coder/internal/issues/1300).
With this change compression is reduced down to 1s of CPU time (from 7 minutes).
# Implementation details
The basic structure is taken from Chi's Compressor middleware. I've reproduced the `LICENSE` in the directory because it's MIT licensed, not AGPL like the rest of Coder.
I've structured it not as a middleware that calls an arbitrary upstream HTTP handler, but taking an explicit `http.FileSystem`. This is done for safety so we are only caching static files and not dynamically generated content with this.
One limitation is that on first request for a resource, it compresses the whole file before starting to return any data to the client. For large files like the Coder binaries, this can add 1-5 seconds to the time-to-first-byte, depending on the compression used.
I think this is reasonable: it only affects the very first download of the binary with a particular compression for a particular Coderd.
If we later find this unacceptible, we can fix it without changing interfaces. We can poll the file system to figure out how much data is available while the compression is inprogress.
Adds support for filtering workspaces by health status using
healthy:true or healthy:false in the search query.
This is done by changing `has-agent` to accept a list of statuses and
aliasing `health:true` to `has-agent:connected` and `healthy:false` to
`has-agent:timeout,disconnected`.
Fixes#21623
Update the agent protobuf schema (agent/proto/agent.proto) to include:
- subagent_id field in WorkspaceAgentDevcontainer message
- id field in CreateSubAgentRequest message
Bump the Agent API version from v2.7 to v2.8 and update all client
references throughout the codebase (ConnectRPC27 -> ConnectRPC28,
DRPCAgentClient27 -> DRPCAgentClient28).
* Adds support for parameter `format=text` in the following API routes:
* `/api/v2/workspaceagents/:id/logs`
* `/api/v2/workspacebuilds/:id/logs`
* `/api/v2/templateversions/:id/logs`
* `/api/v2/templateversions/:id/dry-run/:id/logs`
* Adds links to view raw logs on the following pages:
* Workspace build page
* Template editor page
* Template version page
* Refactors existing log formatting in `cli/logs.go` to live in `codersdk`.
🤖 Generated with Claude Opus 4.5, reviewed by me.
---------
Co-authored-by: Claude <noreply@anthropic.com>
The AcquireProvisionerJob query only checked started_at IS NULL, allowing
it to acquire jobs that were canceled while pending (which have
completed_at set but started_at still NULL).
Added completed_at IS NULL check to the query to prevent this.
Also fixed JobCompleteBuilder.Do() in dbfake to set started_at when
completing jobs to match production behavior.
Fixescoder/internal#1323
Previously there were two issues that could cause incorrect boundary
usage telemetry data.
1. Bad handling across snapshot intervals: After telemetry snapshot deleted
the DB row, the next flush would INSERT the stale cumulative data (which
included already-reported usage). This would then be overwritten by
subsequent UPDATE flushes, causing the delta between the last snapshot
and the reset to be lost (under-reporting usage). Additionally, if there
was no new usage after the reset, the tracker would carry over all usage
from the previous period into the next period (over-reporting usage).
2. Missed usage from a race condition: Track() calls between the first
mutex unlock and second mutex lock in FlushToDB() were lost. The data
wasn't included in the current flush (already snapshotted) and was wiped
by the subsequent reset. This is likely low impact to overall usage
numbers in the real world.
Fix by tracking unique workspace/user deltas separately from cumulative
values and always tracking delta allowed/denied requests. Deltas are used
for INSERT (fresh start after reset), cumulative for UPDATE (accurate unique
counts within a period). All counters reset atomically before the DB operation
so Track() calls during the operation are preserved for the next flush.
Archiving modules attempts to save as many modules as it can before it hits the limit. Enabling the template as much as it can, rather than a hard failure.
Closes#21044
This pull-request addresses an issue we were seeing where we would
attempt to filter the `<UserCombobox />` by the users username or email
not their username (which the rendered options would show).
To highlight this I created three different users. Each with a username
that did not contain their `email` or `name` and attempted to filter.
Attempting to search for `John` wouldn't actually show the user as his
username was `x`, and infact whereas a subset of users might be returned
from the backend for having `john` in the `email` it would've been
filtered by the frontend for not being in the `name` field.
| Name | Username |
| --- | --- |
| `Jake` | `z` |
| `Jeff` | `y` |
| `John` | `x` |
| Previously | Now |
| --- | --- |
| <img width="560" height="547" alt="OLD_USER_COMBOBOX"
src="https://github.com/user-attachments/assets/a0567264-0034-42ac-aba0-95b05c4f92dd"
/> | <img width="580" height="548" alt="NEW_USER_COMBOBOX"
src="https://github.com/user-attachments/assets/1aa0c942-d340-4b1c-8dde-b97879525bfb"
/> |
## Description
When configuring a From address with a display name (e.g., `Coder System
<system@coder.com>`), the SMTP `MAIL FROM` command was incorrectly
receiving the full address string instead of just the bare email
address, causing `501 Invalid MAIL argument` errors on some SMTP
servers.
## Changes
- Updated `validateFromAddr` to return both:
- `envelopeFrom`: bare email for SMTP `MAIL FROM` command (RFC 5321)
- `headerFrom`: original address with display name for email header (RFC
5322)
Fixes#20727
Update provisionerdserver to handle the changes introduced to
provisionerd in https://github.com/coder/coder/pull/21602
We now create a relationship between `workspace_agent_devcontainers` and
`workspace_agents` with the newly created `subagent_id`.
Previously the task logs endpoint only worked when the workspace was
running, leaving users unable to view task history after pausing.
This change adds snapshot retrieval with state-based branching: active
tasks fetch live logs from AgentAPI, paused/initializing/pending tasks
return stored snapshots (providing continuity during pause/resume), and
error/unknown states return HTTP 409 Conflict.
The response includes snapshot metadata (snapshot, snapshot_at) to
indicate whether logs are live or historical.
Closescoder/internal#1254
Operators need to know which API key was used in HTTP requests.
For example, if a key is leaking and a DDOS is underway using that key, operators need a way to identify the key in use and take steps to expire the key (see https://github.com/coder/coder/issues/21782).
_Disclaimer: created using Claude Opus 4.5_
## Description
Removes the following deprecated Prometheus metrics:
- `coderd_api_workspace_latest_build_total` → use
`coderd_api_workspace_latest_build` instead
- `coderd_oauth2_external_requests_rate_limit_total` → use
`coderd_oauth2_external_requests_rate_limit` instead
These metrics were deprecated in #12976 because gauge metrics should
avoid the `_total` suffix per [Prometheus naming
conventions](https://prometheus.io/docs/practices/naming/).
## Changes
- Removed deprecated metric `coderd_api_workspace_latest_build_total`
from `coderd/prometheusmetrics/prometheusmetrics.go`
- Removed deprecated metric
`coderd_oauth2_external_requests_rate_limit_total` from
`coderd/promoauth/oauth2.go`
- Updated tests to use the non-deprecated metric name
Fixes#12999
The test was creating two template versions without explicit names,
relying on `namesgenerator.NameDigitWith()` which can produce
collisions. When both versions got the same random name, the test failed
with a 409 Conflict error.
Fix by giving each version an explicit name (`v1`, `v2`).
Closes https://github.com/coder/internal/issues/1309
---
*Generated by [mux](https://mux.coder.com)*
Add PeriodStart and PeriodDurationMilliseconds fields to BoundaryUsageSummary
so consumers of telemetry data can understand usage within a particular time window.
Fixes: coder/internal#767
Adds two new Prometheus metrics for license health monitoring:
- `coderd_license_warnings` - count of active license warnings
- `coderd_license_errors` - count of active license errors
Metrics endpoint after startup of a deployment with license enabled:
```
...
# HELP coderd_license_errors The number of active license errors.
# TYPE coderd_license_errors gauge
coderd_license_errors 0
...
# HELP coderd_license_warnings The number of active license warnings.
# TYPE coderd_license_warnings gauge
coderd_license_warnings 0
...
```
This migration converts all tailnet coordination tables to UNLOGGED:
- `tailnet_coordinators`
- `tailnet_peers`
- `tailnet_tunnels`
UNLOGGED tables skip Write-Ahead Log (WAL) writes, significantly
improving performance for high-frequency updates like coordinator
heartbeats and peer state changes.
The trade-off is that UNLOGGED tables are truncated on crash recovery
and are not replicated to standby servers. This is acceptable for these
tables because the data is ephemeral:
1. Coordinators re-register on startup
2. Peers re-establish connections on reconnect
3. Tunnels are re-created based on current peer state
**Migration notes:**
- Child tables must be converted before the parent table because LOGGED
child tables cannot reference UNLOGGED parent tables (but the reverse is
allowed)
- The down migration reverses the order: parent first, then children
Fixes https://github.com/coder/coder/issues/21333
Implements telemetry for boundary usage tracking across all Coder
replicas and reports them via telemetry.
Changes:
- Implement Tracker with Track(), FlushToDB(), and StartFlushLoop() methods
- Add telemetry integration via collectBoundaryUsageSummary()
- Use telemetry lock to ensure only one replica collects per period
The tracker accumulates unique workspaces, unique users, and request
counts (allowed/denied) in memory, then flushes to the database
periodically. During telemetry collection, stats are aggregated across
all replicas and reset for the next period.
Only task workspaces have the checks in wsbuilder for violating the
managed agent caps in the license.
Stopped tasks that are resumed with a regular workspace start **still
count as usage**.
feat: add boundary usage telemetry database schema and RBAC
Adds the foundation for tracking boundary usage telemetry across Coder
replicas. This includes:
- Database schema: `boundary_usage_stats` table with per-replica stats
(unique workspaces, unique users, allowed/denied request counts)
- Database queries: upsert stats, get aggregated summary, reset stats,
delete by replica ID
- RBAC: `boundary_usage` resource type with read/update/delete actions,
accessible only via system `BoundaryUsageTracker` subject (not regular
user roles)
- Tracker skeleton + docs: stub implementation in `coderd/boundaryusage/`
The tracker accumulates stats in memory and periodically flushes to the
database. Stats are aggregated across replicas for telemetry reporting,
then reset when a new reporting period begins. The tracker implementation
and plumbing will be done in a subsequent commit/PR.
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>