Adds support for filtering workspaces by health status using
healthy:true or healthy:false in the search query.
This is done by changing `has-agent` to accept a list of statuses and
aliasing `health:true` to `has-agent:connected` and `healthy:false` to
`has-agent:timeout,disconnected`.
Fixes#21623
Update the agent protobuf schema (agent/proto/agent.proto) to include:
- subagent_id field in WorkspaceAgentDevcontainer message
- id field in CreateSubAgentRequest message
Bump the Agent API version from v2.7 to v2.8 and update all client
references throughout the codebase (ConnectRPC27 -> ConnectRPC28,
DRPCAgentClient27 -> DRPCAgentClient28).
* Adds support for parameter `format=text` in the following API routes:
* `/api/v2/workspaceagents/:id/logs`
* `/api/v2/workspacebuilds/:id/logs`
* `/api/v2/templateversions/:id/logs`
* `/api/v2/templateversions/:id/dry-run/:id/logs`
* Adds links to view raw logs on the following pages:
* Workspace build page
* Template editor page
* Template version page
* Refactors existing log formatting in `cli/logs.go` to live in `codersdk`.
🤖 Generated with Claude Opus 4.5, reviewed by me.
---------
Co-authored-by: Claude <noreply@anthropic.com>
The AcquireProvisionerJob query only checked started_at IS NULL, allowing
it to acquire jobs that were canceled while pending (which have
completed_at set but started_at still NULL).
Added completed_at IS NULL check to the query to prevent this.
Also fixed JobCompleteBuilder.Do() in dbfake to set started_at when
completing jobs to match production behavior.
Fixescoder/internal#1323
Previously there were two issues that could cause incorrect boundary
usage telemetry data.
1. Bad handling across snapshot intervals: After telemetry snapshot deleted
the DB row, the next flush would INSERT the stale cumulative data (which
included already-reported usage). This would then be overwritten by
subsequent UPDATE flushes, causing the delta between the last snapshot
and the reset to be lost (under-reporting usage). Additionally, if there
was no new usage after the reset, the tracker would carry over all usage
from the previous period into the next period (over-reporting usage).
2. Missed usage from a race condition: Track() calls between the first
mutex unlock and second mutex lock in FlushToDB() were lost. The data
wasn't included in the current flush (already snapshotted) and was wiped
by the subsequent reset. This is likely low impact to overall usage
numbers in the real world.
Fix by tracking unique workspace/user deltas separately from cumulative
values and always tracking delta allowed/denied requests. Deltas are used
for INSERT (fresh start after reset), cumulative for UPDATE (accurate unique
counts within a period). All counters reset atomically before the DB operation
so Track() calls during the operation are preserved for the next flush.
Archiving modules attempts to save as many modules as it can before it hits the limit. Enabling the template as much as it can, rather than a hard failure.
Closes#21044
This pull-request addresses an issue we were seeing where we would
attempt to filter the `<UserCombobox />` by the users username or email
not their username (which the rendered options would show).
To highlight this I created three different users. Each with a username
that did not contain their `email` or `name` and attempted to filter.
Attempting to search for `John` wouldn't actually show the user as his
username was `x`, and infact whereas a subset of users might be returned
from the backend for having `john` in the `email` it would've been
filtered by the frontend for not being in the `name` field.
| Name | Username |
| --- | --- |
| `Jake` | `z` |
| `Jeff` | `y` |
| `John` | `x` |
| Previously | Now |
| --- | --- |
| <img width="560" height="547" alt="OLD_USER_COMBOBOX"
src="https://github.com/user-attachments/assets/a0567264-0034-42ac-aba0-95b05c4f92dd"
/> | <img width="580" height="548" alt="NEW_USER_COMBOBOX"
src="https://github.com/user-attachments/assets/1aa0c942-d340-4b1c-8dde-b97879525bfb"
/> |
## Description
When configuring a From address with a display name (e.g., `Coder System
<system@coder.com>`), the SMTP `MAIL FROM` command was incorrectly
receiving the full address string instead of just the bare email
address, causing `501 Invalid MAIL argument` errors on some SMTP
servers.
## Changes
- Updated `validateFromAddr` to return both:
- `envelopeFrom`: bare email for SMTP `MAIL FROM` command (RFC 5321)
- `headerFrom`: original address with display name for email header (RFC
5322)
Fixes#20727
Update provisionerdserver to handle the changes introduced to
provisionerd in https://github.com/coder/coder/pull/21602
We now create a relationship between `workspace_agent_devcontainers` and
`workspace_agents` with the newly created `subagent_id`.
Previously the task logs endpoint only worked when the workspace was
running, leaving users unable to view task history after pausing.
This change adds snapshot retrieval with state-based branching: active
tasks fetch live logs from AgentAPI, paused/initializing/pending tasks
return stored snapshots (providing continuity during pause/resume), and
error/unknown states return HTTP 409 Conflict.
The response includes snapshot metadata (snapshot, snapshot_at) to
indicate whether logs are live or historical.
Closescoder/internal#1254
Operators need to know which API key was used in HTTP requests.
For example, if a key is leaking and a DDOS is underway using that key, operators need a way to identify the key in use and take steps to expire the key (see https://github.com/coder/coder/issues/21782).
_Disclaimer: created using Claude Opus 4.5_
## Description
Removes the following deprecated Prometheus metrics:
- `coderd_api_workspace_latest_build_total` → use
`coderd_api_workspace_latest_build` instead
- `coderd_oauth2_external_requests_rate_limit_total` → use
`coderd_oauth2_external_requests_rate_limit` instead
These metrics were deprecated in #12976 because gauge metrics should
avoid the `_total` suffix per [Prometheus naming
conventions](https://prometheus.io/docs/practices/naming/).
## Changes
- Removed deprecated metric `coderd_api_workspace_latest_build_total`
from `coderd/prometheusmetrics/prometheusmetrics.go`
- Removed deprecated metric
`coderd_oauth2_external_requests_rate_limit_total` from
`coderd/promoauth/oauth2.go`
- Updated tests to use the non-deprecated metric name
Fixes#12999
The test was creating two template versions without explicit names,
relying on `namesgenerator.NameDigitWith()` which can produce
collisions. When both versions got the same random name, the test failed
with a 409 Conflict error.
Fix by giving each version an explicit name (`v1`, `v2`).
Closes https://github.com/coder/internal/issues/1309
---
*Generated by [mux](https://mux.coder.com)*
Add PeriodStart and PeriodDurationMilliseconds fields to BoundaryUsageSummary
so consumers of telemetry data can understand usage within a particular time window.
Fixes: coder/internal#767
Adds two new Prometheus metrics for license health monitoring:
- `coderd_license_warnings` - count of active license warnings
- `coderd_license_errors` - count of active license errors
Metrics endpoint after startup of a deployment with license enabled:
```
...
# HELP coderd_license_errors The number of active license errors.
# TYPE coderd_license_errors gauge
coderd_license_errors 0
...
# HELP coderd_license_warnings The number of active license warnings.
# TYPE coderd_license_warnings gauge
coderd_license_warnings 0
...
```
This migration converts all tailnet coordination tables to UNLOGGED:
- `tailnet_coordinators`
- `tailnet_peers`
- `tailnet_tunnels`
UNLOGGED tables skip Write-Ahead Log (WAL) writes, significantly
improving performance for high-frequency updates like coordinator
heartbeats and peer state changes.
The trade-off is that UNLOGGED tables are truncated on crash recovery
and are not replicated to standby servers. This is acceptable for these
tables because the data is ephemeral:
1. Coordinators re-register on startup
2. Peers re-establish connections on reconnect
3. Tunnels are re-created based on current peer state
**Migration notes:**
- Child tables must be converted before the parent table because LOGGED
child tables cannot reference UNLOGGED parent tables (but the reverse is
allowed)
- The down migration reverses the order: parent first, then children
Fixes https://github.com/coder/coder/issues/21333
Implements telemetry for boundary usage tracking across all Coder
replicas and reports them via telemetry.
Changes:
- Implement Tracker with Track(), FlushToDB(), and StartFlushLoop() methods
- Add telemetry integration via collectBoundaryUsageSummary()
- Use telemetry lock to ensure only one replica collects per period
The tracker accumulates unique workspaces, unique users, and request
counts (allowed/denied) in memory, then flushes to the database
periodically. During telemetry collection, stats are aggregated across
all replicas and reset for the next period.
Only task workspaces have the checks in wsbuilder for violating the
managed agent caps in the license.
Stopped tasks that are resumed with a regular workspace start **still
count as usage**.
feat: add boundary usage telemetry database schema and RBAC
Adds the foundation for tracking boundary usage telemetry across Coder
replicas. This includes:
- Database schema: `boundary_usage_stats` table with per-replica stats
(unique workspaces, unique users, allowed/denied request counts)
- Database queries: upsert stats, get aggregated summary, reset stats,
delete by replica ID
- RBAC: `boundary_usage` resource type with read/update/delete actions,
accessible only via system `BoundaryUsageTracker` subject (not regular
user roles)
- Tracker skeleton + docs: stub implementation in `coderd/boundaryusage/`
The tracker accumulates stats in memory and periodically flushes to the
database. Stats are aggregated across replicas for telemetry reporting,
then reset when a new reporting period begins. The tracker implementation
and plumbing will be done in a subsequent commit/PR.
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Use transition-specific actions when authorizing workspace build
parameter inserts in the database layer so start/stop/delete do not
require workspace.update.
Related to: https://github.com/coder/internal/issues/1299
Relates to https://github.com/coder/internal/issues/1282
Updates tracking of managed agents to be predicated instead on the
presence of a related `task_id` instead of the presence of a
`coder_ai_task` resource.
This change adds a POST /workspaceagents/me/tasks/{task}/log-snapshot
endpoint for agents to upload task conversation history during
workspace shutdown. This allows users to view task logs even when the
workspace is stopped.
The endpoint accepts agentapi format payloads (typically last 10
messages, max 64KB), wraps them in a format envelope, and upserts to the
task_snapshots table. Uses agent token auth and validates the task
belongs to the agent's workspace.
Closescoder/internal#1253
Relates to https://github.com/coder/coder/pull/21676
* Replaces all existing usages of `httpapi.Heartbeat` with `httpapi.HeartbeatClose`
* Removes `httpapi.HeartbeatClose`
Removes the legacy tailnet v1 API tables (`tailnet_clients`, `tailnet_agents`, `tailnet_client_subscriptions`) and their associated queries, triggers, and functions. These were superseded by the v2 tables (`tailnet_peers`, `tailnet_tunnels`) in migration 000168, and the v1 API code was removed in commit d6154c4310, but the database artifacts were never cleaned up.
**Changes:**
- New migration `000410_remove_tailnet_v1_tables` to drop the unused tables
- Removed 11 unused queries from `tailnet.sql`
- Removed associated manual wrapper methods in `dbauthz` and `dbmetrics`
- ~930 lines deleted across 11 files
Relates to https://github.com/coder/coder/issues/19715
This is similar to https://github.com/coder/coder/pull/19711
This endpoint works by doing the following:
- Subscribing to the database's with pubsub
- Accepts a WebSocket upgrade
- Starts a `httpapi.Heartbeat`
- Creates a json encoder
- **Infinitely loops waiting for notification until request context
cancelled**
The critical issue here is that `httpapi.Heartbeat` silently fails when
the client has disconnected. This means we never cancel the request
context, leaving the WebSocket alive until we receive a notification
from the database and fail to write that down the pipe.
By replacing usage of `httpapi.Heartbeat` with `httpapi.HeartbeatClose`,
we cancel the context _when the heartbeat fails to write_ due to the
client disconnecting. This allows us to cleanup without waiting for a
notification to come through the pubsub channel.
Relates to
https://github.com/coder/aibridge/pull/143/changes#r2720659638
We previously had been returning the following when attempting to delete
failed due to lack of permissions.
```
500 Internal error deleting template: unauthorized: rbac: forbidden
```
This PR updates the handler to return our usual 403 forbidden response.
Adds template_version_id to re-emitted boundary audit logs to allow
filtering and analysis by specific template versions iin addition to the
existing template_id field. Since boundary policies are defined in the
template, the template version is critical to figuring out which policy
was responsible for boundaries decision in a workspace.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
The removal of that permission from the role broke valid use cases (e.g.
a site owner user creating a workspace owned by a system account and
then trying to share it with another user).
The bulk of the PR is made up of the rollbacks of the previously
introduced test updates necessitated by the removal.
Related to: https://github.com/coder/internal/issues/1285
Boundary policies are currently defined at the template level, so
including the template ID in re-emitted logs by the control plane
allows policy creators to filter and observe boundary activity for
specific templates. This makes it easier to verify that policies are
working as expected and to debug issues with specific template
configurations.
AI agents report status via patchWorkspaceAgentAppStatus, but this wasn't
extending workspace deadlines. This prevented proper task auto-pause behavior,
causing tasks to pause mid-execution when there were no human connections.
Now we call ActivityBumpWorkspace when agents report status, using the same
logic as SSH/IDE connections. We bump when transitioning to or from the working
state.
Closescoder/internal#1251
## Description
Introduces a new `X-Coder-Token` header for authenticating requests from
AI Proxy to AI Bridge. Previously, the proxy overwrote the
`Authorization` header with the Coder token, which prevented the
original authentication headers from flowing through to upstream
providers.
With this change, AI Proxy sets the Coder token in a separate header,
preserving the original `Authorization` and `X-Api-Key` headers. AI
Bridge uses this header for authentication and removes it before
forwarding requests to upstream providers. For requests that don't come
through AI Proxy, AI Bridge continues to use `Authorization` and
`X-Api-Key` for authentication.
## Changes
* Add `HeaderCoderAuth` constant and update `ExtractAuthToken` to check
headers in the following order: `X-Coder-Token` > `Authorization` >
`X-Api-Key`
* Update AI Proxy to set `X-Coder-Token` instead of overwriting
`Authorization`
* Remove `X-Coder-Token` in AI Bridge before forwarding to upstream
providers
* Add tests for header handling and token extraction priority
Related to: https://github.com/coder/internal/issues/1235
Agents were losing authentication during workspace shutdown, causing
shutdown scripts to fail. The auth query required agents to belong to
the latest build, but during shutdown a `stop` build becomes latest while
the `start` build's agents are still running.
Modified the auth query to allow `start` build agents to authenticate
temporarily during `stop` execution. The query allows auth when:
- Agent's `start` build job succeeded
- Latest build is `stop` with `pending`/`running` job status
- Builds are adjacent (`stop` is `build_number + 1`)
- Template versions match
Auth closes once `stop` completes.
Renamed `GetWorkspaceAgentAndLatestBuildByAuthToken` to
`GetAuthenticatedWorkspaceAgentAndBuildByAuthToken` since it returns the
agent's build (not always latest) during shutdown.
Closes coder/internal#1249
Fixes#19467
## Description
This PR addresses database connection pool exhaustion during prebuilds
reconciliation by introducing two changes:
* `CanSkipReconciliation`: Filters out presets that don't need
reconciliation before spawning goroutines. This ensures we only create
goroutines for presets that will (_most likely_) perform database
operations, avoiding unnecessary connection pool usage.
* Dynamic `eg.SetLimit`: Limits concurrent goroutines based on the
configured database connection pool size (`CODER_PG_CONN_MAX_OPEN / 2`).
This replaces the previous hardcoded limit of 5, ensuring the
reconciliation loop scales appropriately with the configured pool size
while leaving capacity for other database operations.
## Changes
* Add `CanSkipReconciliation()` method to `PresetSnapshot` that returns
true for inactive presets with no running workspaces, no pending jobs,
or expired prebuilds.
* Add `maxDBConnections` parameter to `NewStoreReconciler` and compute
`reconciliationConcurrency` as half the pool size (minimum 1).
* Add `ReconciliationConcurrency()` getter method to `StoreReconciler`.
* Add `eg.SetLimit(c.reconciliationConcurrency)` to bound concurrent
reconciliation goroutines.
* Add `PresetsTotal` and `PresetsReconciled` to `ReconcileStats` for
observability.
* Add `TestCanSkipReconciliation` unit tests.
* Add `TestReconciliationConcurrency` unit tests.
* Add benchmark tests for reconciliation performance.
## Benchmarks
* `BenchmarkReconcileAll_NoOps`: Tests presets with no reconciliation
actions. All presets are filtered by `CanSkipReconciliation`, resulting
in no goroutines spawned and no database connections used.
* `BenchmarkReconcileAll_ConnectionContention`: Tests presets where all
require reconciliation actions. All presets spawn goroutines, but
concurrency is limited by `eg.SetLimit(reconciliationConcurrency)`.
* `BenchmarkReconcileAll_Mix`: Simulates a realistic scenario with a
large subset of inactive presets (filtered by `CanSkipReconciliation`)
and a smaller subset requiring reconciliation (limited by
`eg.SetLimit`).
Closes: https://github.com/coder/coder/issues/20606