Agents were losing authentication during workspace shutdown, causing
shutdown scripts to fail. The auth query required agents to belong to
the latest build, but during shutdown a `stop` build becomes latest while
the `start` build's agents are still running.
Modified the auth query to allow `start` build agents to authenticate
temporarily during `stop` execution. The query allows auth when:
- Agent's `start` build job succeeded
- Latest build is `stop` with `pending`/`running` job status
- Builds are adjacent (`stop` is `build_number + 1`)
- Template versions match
Auth closes once `stop` completes.
Renamed `GetWorkspaceAgentAndLatestBuildByAuthToken` to
`GetAuthenticatedWorkspaceAgentAndBuildByAuthToken` since it returns the
agent's build (not always latest) during shutdown.
Closes coder/internal#1249
Fixes#19467
Relates to https://github.com/coder/internal/issues/1214
The `ExtractWorkspaceAgentParam` middleware ends up making 4 database
queries to follow the chain of `WorkspaceAgent` -> `WorkspaceResource`
-> `ProvisionerJob` -> `WorkspaceBuild` -- but then dropping all that
hard work on the floor. The `api.workspaceAgent` handler that references
this middleware then has to do all of that work again, plus one more
query to get the related `User` so we can get the username. This pattern
is also mirrored in `getDatabaseTerminal` but without the middleware.
This PR:
* Adds a new query `GetWorkspaceAgentAndWorkspaceByID` to fetch all
this information at once to avoid the multiple round-trips,
* Updates the existing usage of `GetWorkspaceAgentByID` to this new
query instead,
* Updates `ExtractWorkspaceAgentParam` to also store the workspace in
the request context
Dalibo: [0.63ms](https://explain.dalibo.com/plan/40bb597f3539gc6c)
Adds a per-organization setting to disable workspace sharing. When enabled,
all existing workspace ACLs in the organization are cleared and the workspace
ACL mutation API endpoints return `403 Forbidden`.
This complements the existing site-wide `--disable-workspace-sharing` flag by
providing more granular control at the organization level.
Closes https://github.com/coder/internal/issues/1073 (part 2)
---------
Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
fixes#21303
Update user last_seen_at when we mark them active on login. This prevents a narrow race where they can be re-marked dormant and fail to log in.
closes: https://github.com/coder/internal/issues/858
Similar to https://github.com/coder/coder/pull/19375, this one uses
system permissions for fetching actual user and group data.
Modifies the `workspaces_expanded` view to fetch the required data; this way it's made available to all code paths that make use of it.
Also fixes a bug in a test helper function that can result in `null` being saved to the DB for `user_acl` or `group_acl` and break tests; a defensive check constraint that prevents this is worth a PR, e.g:
`ALTER TABLE workspaces
ADD CONSTRAINT group_acl_is_object CHECK (jsonb_typeof(group_acl) = 'object');`
Also adds missing `OwnerName` in `ConvertWorkspaceRows`.
Previously the GetTemplateVersionVariables query did not sort output,
relying on PostgreSQL on-disk ordering which is undeterministic.
Variables are now sorted by name because there is no alternative for
ordering.
Tests were adjusted to accommodate the new ordering, previously they
relied on data being written to disk in insert order.
In this PR we're optimizing the `GetTemplateAppInsightsByTemplate` query
by pre-filtering out apps which do not have an active session during the
start/end time window.
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Tracking issue here: https://github.com/coder/internal/issues/1009
To summarize, the current version of this query selects from
`workspace_agent_stats` twice. The expensive portion of this query is
the bitmap heap scan we have to do for each of these selects. We can
easily cut the cost of this query by 40-50% by cutting this down to a
single select, and using those rows for both sets of calculations.
Eliminating the heap scan itself would require a follow up PR to
introduce a new index. Blink helped with the rewrite of the query.
The current plan looks like this:
```
Nested Loop (cost=6101.64..6101.69 rows=1 width=64) (actual time=11.782..11.787 rows=1 loops=1)
-> Aggregate (cost=2996.17..2996.19 rows=1 width=32) (actual time=3.356..3.357 rows=1 loops=1)
-> Bitmap Heap Scan on workspace_agent_stats (cost=54.80..2992.86 rows=440 width=24) (actu
al time=0.346..2.927 rows=818 loops=1)
Recheck Cond: (created_at > (now() - '00:15:00'::interval))
Filter: (connection_median_latency_ms > '0'::double precision)
Rows Removed by Filter: 1070
Heap Blocks: exact=486
-> Bitmap Index Scan on idx_agent_stats_created_at (cost=0.00..54.69 rows=1368 width
=0) (actual time=0.241..0.241 rows=1888 loops=1)
Index Cond: (created_at > (now() - '00:15:00'::interval))
-> Aggregate (cost=3105.47..3105.49 rows=1 width=32) (actual time=8.418..8.420 rows=1 loops=1)
-> Subquery Scan on a (cost=3060.95..3105.39 rows=7 width=32) (actual time=7.851..8.394 ro
ws=63 loops=1)
Filter: (a.rn = 1)
-> WindowAgg (cost=3060.95..3088.29 rows=1368 width=209) (actual time=7.850..8.382 r
ows=63 loops=1)
Run Condition: (row_number() OVER (?) <= 1)
-> Sort (cost=3060.93..3064.35 rows=1368 width=56) (actual time=7.836..8.036 r
ows=1888 loops=1)
Sort Key: workspace_agent_stats_1.agent_id, workspace_agent_stats_1.create
d_at DESC
Sort Method: quicksort Memory: 181kB
-> Bitmap Heap Scan on workspace_agent_stats workspace_agent_stats_1 (co
st=55.03..2989.67 rows=1368 width=56) (actual time=0.388..2.096 rows=1888 loops=1)
Recheck Cond: (created_at > (now() - '00:15:00'::interval))
Heap Blocks: exact=486
-> Bitmap Index Scan on idx_agent_stats_created_at (cost=0.00..54.
69 rows=1368 width=0) (actual time=0.295..0.295 rows=1888 loops=1)
Index Cond: (created_at > (now() - '00:15:00'::interval))
Planning Time: 2.350 ms
Execution Time: 13.152 ms
(24 rows)
```
The new plan looks like this
```
Aggregate (cost=2966.96..2966.98 rows=1 width=64) (actual time=3.812..3.814 rows=1 loops=1)
-> WindowAgg (cost=2891.96..2916.94 rows=1250 width=88) (actual time=2.696..3.412 rows=1890 loop
s=1)
-> Sort (cost=2891.94..2895.06 rows=1250 width=80) (actual time=2.686..2.780 rows=1890 loo
ps=1)
Sort Key: workspace_agent_stats.agent_id, workspace_agent_stats.created_at DESC
Sort Method: quicksort Memory: 226kB
-> Bitmap Heap Scan on workspace_agent_stats (cost=50.11..2827.64 rows=1250 width=80
) (actual time=0.218..1.551 rows=1890 loops=1)
Recheck Cond: (created_at > (now() - '00:15:00'::interval))
Heap Blocks: exact=474
-> Bitmap Index Scan on idx_agent_stats_created_at (cost=0.00..49.80 rows=1250
width=0) (actual time=0.146..0.147 rows=1890 loops=1)
Index Cond: (created_at > (now() - '00:15:00'::interval))
Planning Time: 0.534 ms
Execution Time: 3.969 ms
(12 rows)
```
If we compare the results of the query they're similar enough that any
differences can be attributed to slightly different timestamps for
`now()` in the version of the query I am using to generate results for
comparison:
```
workspace_rx_bytes | workspace_tx_bytes | workspace_connection_latency_50 | workspace_connection_latency_95 | session_count_vscode | session_count_ssh | session_count_jetbrains | session_count_reconnecting_pty
--------------------+--------------------+---------------------------------+---------------------------------+----------------------+-------------------+-------------------------+--------------------------------
15263563 | 74555854 | 47.933 | 250.5522 | 239 | 59 | 3 | 3
(1 row)
workspace_rx_bytes | workspace_tx_bytes | workspace_connection_latency_50 | workspace_connection_latency_95 | session_count_vscode | session_count_ssh | session_count_jetbrains | session_count_reconnecting_pty
--------------------+--------------------+---------------------------------+---------------------------------+----------------------+-------------------+-------------------------+--------------------------------
15295819 | 74598410 | 47.933 | 250.5522 | 239 | 59 | 3 | 3
```
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Previously setting AI Bridge retention to 0 would cause records to be
deleted immediately since we didn't check for the zero value before
calculating the deletion threshold.
This adds a check for aibridgeRetention > 0 to skip deletion when
retention is disabled, matching the pattern used for other retention
settings (connection logs, audit logs, etc.).
Also fixes the return type of DeleteOldAIBridgeRecords from int32 to
int64 since COUNT(*) returns bigint in PostgreSQL.
Refs #21055
Replace hardcoded 7-day retention for workspace agent logs with
configurable retention from deployment settings. Defaults to 7d to
preserve existing behavior.
Depends on #21038
Updates #20743
Replace hardcoded 7-day retention for expired API keys with configurable
retention from deployment settings. Skips deletion entirely when effective
retention is 0.
Depends on #21021
Updates #20743
Add configurable retention policy for audit logs. The DeleteOldAuditLogs
query excludes deprecated connection events (connect, disconnect, open,
close) which are handled separately by DeleteOldAuditLogConnectionEvents.
Disabled (0) by default.
Depends on #21021
Updates #20743
Add `DeleteOldConnectionLogs` query and integrate it into the `dbpurge`
routine. Retention is controlled by `--retention-connection-logs` flag.
Disabled (0) by default.
Depends on #21021
Updates #20743
## Problem
Users may not realize that task notifications are disabled by default.
To improve awareness, we show a warning alert on the Tasks page when all
task notifications are disabled.
**Alert visibility logic:**
- Shows when **all** task notification templates (Task Working, Task
Idle, Task Completed, Task Failed) are disabled
- Can be dismissed by the user, which stores the dismissal in the user
preferences API
- If the user later enables any task notification in Account Settings,
the dismissal state is cleared so the alert will show again if they
disable all notifications in the future
<img width="2980" height="1588" alt="Screenshot 2025-11-25 at 17 48 17"
src="https://github.com/user-attachments/assets/316bf097-d9d2-4489-bc16-2987ba45f45c"
/>
## Changes
- Added a warning alert to the Tasks page when all task notifications
are disabled
- Introduced new `/users/{user}/preferences` endpoint to manage user
preferences (stored in `user_configs` table)
- Alert is dismissible and stores the dismissal state via the new user
preferences API endpoint
- Enabling any task notification in Account Settings clears the
dismissal state via the preferences API
- Added comprehensive Storybook stories for both TasksPage and
NotificationsPage to test all alert visibility states and interactions
Closes: https://github.com/coder/internal/issues/1089
Closes https://github.com/coder/coder/issues/20913
I've ran the test without the fix, verified the test caught the issue,
then applied the fix, and confirmed the issue no longer happens.
---
🤖 PR was initially written by Claude Opus 4.5 Thinking using Claude Code
and then review by a human 👩
## Description
This PR fixes an issue where `GetLatestWorkspaceAppStatusesByAppID`
returned an unbounded number of rows for a given app ID, which could
cause performance issues for noisy or long-running AI tasks.
## Impact
This change reduces database query overhead for workspace app status
updates, particularly for busy AI tasks that update their status
frequently. Previously, fetching the latest status would return all
historical statuses, now it returns only the most recent one.
Fixes#20862
---
🤖 This change was written by Claude Sonnet 4.5 Thinking using [mux](https://github.com/coder/mux) and reviewed by a human 🏄🏻♂️
## Problem
Tasks currently only expose a machine-friendly name field (e.g.
`task-python-debug-a1b2`), but this value is primarily an identifier
rather than a clean, descriptive label. We need a separate
display-friendly name for use in the UI.
This PR introduces a new `display_name` field and updates the task-name
generation flow. The Claude system prompt was updated to return valid
JSON with both `name` and `display_name`. The name generation logic
follows a fallback chain (Anthropic > prompt sanitization > random
fallback). To make task names more closely resemble their display names,
the legacy `task-` prefix has been removed. For context, PR
https://github.com/coder/coder/pull/20834 introduced a small Task icon
to the workspace list to help identify workspaces associated to tasks.
## Changes
- Database migration: Added `display_name` column to tasks table
- Updated system prompt to generate both task name and display name as
valid JSON
- Task name generation now follows a fallback chain: Anthropic > prompt
sanitization > random fallback
- Removed `task-` prefix from task names to allow more descriptive names
- Note: PR https://github.com/coder/coder/pull/20834 adds a Task icon to
workspaces in the workspace list to distinguish task-created workspaces
**Note:** UI changes will be addressed in a follow-up PR
Related to: https://github.com/coder/coder/issues/20801
This PR adds the backend implementation for modifying task prompts. Part
of https://github.com/coder/internal/issues/1084
## Changes
- New `UpdateTaskPrompt` database query to update task prompts
- New PATCH `/api/v2/tasks/{task}/prompt` endpoint
## Notes
This is part 1 of a 2-part PR stack. The frontend UI will be added in a
follow-up PR based on this branch
(https://github.com/coder/coder/pull/20812).
---
🤖 PR was written by Claude Sonnet 4.5 Thinking using [Coder
Mux](https://github.com/coder/cmux) and reviewed by a human 👩
Addresses [`aibridge#54`](https://github.com/coder/aibridge/issues/54)
When querying against the values in the database for
`/api/experimental/aibridge/interceptions` we found strange behaviour
wherein there was interceptions that lacked prompting and other various
fields we want. Generally this was as a result of the data not actually
existing for these values (as they were inflight).
The simple solution to this was to hide them if they didn't exist. This
PR addresses that.
---------
Co-authored-by: Danny Kopping <danny@coder.com>
This change restructures the `tasks_with_status` view query to:
- Improve debuggability by adding a `status_debug` column to better
understand the outcome
- Reduce clutter from `bool_or`, `bool_and` which are aggregate
functions that did not actually have serve a purpose (each join is 0-1
rows)
- Improve agent lifecycle state coverage, `start_timeout` and
`start_error` were omitted
- These states are easy to trigger even in a perfectly functioning
workspace/task so we now rely on app health to report whether or not
there was an issue
- Mark canceling and canceled workspace build jobs as error state
- Agent stop states were implicitly `unknown`, now there are explicit (I
initially considered `error`, could go either way)
For experimental and dogfood purposes, this adds the ability to opt in a single template.
Leaving the rest of the templates as is.
For GA, this setting might be removed or changed.
* Adds a `GetTaskByOwnerIDAndName` query
* Updates `httpmw.TaskParam` to fall back to task name if no task by
UUID found.
* Updates the `TaskByIdentifier` used in `cli/` to use direct lookup instead of searching.
## Description
The membership reconciliation ensures the prebuilds system user is a
member of all organizations with prebuilds configured. To support
prebuilds quota management, each organization must have a prebuilds
group that the system user belongs to.
## Problem
Previously, membership reconciliation iterated over all presets to check
and update membership status. This meant database queries
`GetGroupByOrgAndName` and `InsertGroupMember` were executed for each
preset. Since presets are unique combinations of `(organization,
template, template version, preset)`, this resulted in several redundant
checks for the same organization.
In dogfood, `InsertGroupMember` was called thousands of times per day,
even though memberships were already configured ([internal Grafana
dashboard link](https://grafana.dev.coder.com/goto/46MZ1UgDg?orgId=1))
<img width="5382" height="1788" alt="Screenshot 2025-10-28 at 16 01 36"
src="https://github.com/user-attachments/assets/757b7253-106f-4f72-8586-8e2ede9f18db"
/>
## Solution
This PR introduces `GetOrganizationsWithPrebuildStatus`, a single query
that returns:
* All unique organizations with prebuilds configured
* Whether the prebuilds user is a member of each organization
* Whether the prebuilds group exists in each organization
* Whether the prebuilds user is in the prebuilds group
The membership reconciliation logic now:
* Fetches status for all organizations in one query
* Only performs inserts for organizations missing required memberships
or groups
* Safely handles concurrent operations via unique constraint violations
* This reduces database load from `O(presets)` to `O(organizations)` per
reconciliation loop, with a single read query when everything is
configured.
## Changes
* Add `GetOrganizationsWithPrebuildStatus` SQL query
* Update `membership.ReconcileAll` to use organization-based
reconciliation instead of preset-based
* Update tests to reflect new behavior
Related to internal thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1760535570381369
## Description
PR https://github.com/coder/coder/pull/20387 introduced canceling
pending prebuild jobs from inactive template versions to avoid
provisioning obsolete workspaces. However, the associated prebuilds
remained in the database with "Canceled" status, visible in the UI.
This PR now orphan-deletes these canceled prebuilt workspaces. Since the
canceled jobs were never processed by a provisioner, no Terraform
resources were created, making orphan deletion safe.
Orphan deletion always creates a provisioner job, but behaves
differently based on provisioner availability:
- If no provisioner daemon is available, the job is immediately marked
as completed and the workspace is marked as deleted without any
provisioner processing
- If a provisioner daemon is available, it processes the delete job with
empty Terraform state (no actual resources to destroy)
The job cancellation and workspace deletion occur atomically in the same
transaction. We don't split this into two separate reconciliation runs
because there's no way to distinguish between system-canceled prebuilds
and user-canceled workspaces. If we deleted canceled workspaces in a
later run, we'd delete user-canceled workspaces that users may want to
keep for troubleshooting.
Note: This only applies to system-generated prebuilds from inactive
template versions.
## Changes
* Update `UpdatePrebuildProvisionerJobWithCancel` query to return job
ID, workspace ID, template ID, and template version preset ID
* Add `DeprovisionMode` enum to support orphan deletion in the provision
flow
* Update `ActionTypeCancelPending` handler to cancel jobs and
orphan-delete associated workspaces atomically
Relates to
https://github.com/coder/coder/pull/20431/files#diff-9cfc826a6ce7e77d977b2025482474dd263d12965b2a94479a74c7f1d872b782
If the workspace relating to a task was deleted, most of the
workspace-related fields in `taskFromDBTaskAndWorkspace` will be
zero-valued. However, we can still get information relating to the owner
so that "created by" shows up correctly in the UI.
Updates the `tasks_with_status` view with a join on `visible_users` to
get owner-related info.
- Adds a new table to keep track of which payloads have already been
reported since we only report for the last clock hour
- Adds a query to gather and aggregate all the data by
provider/model/client
Relates to https://github.com/coder/coder-telemetry-server/issues/27
## Description
This PR introduces an optimization to automatically cancel pending
prebuild-related jobs from non-active template versions in the
reconciliation loop.
## Problem
Currently, when a template is configured with more prebuild instances
than available provisioners, the provisioner queue can become flooded
with pending prebuild jobs. This issue is worsened when
provisioning/deprovisioning operations take a long time.
When the prebuild reconciliation loop generates jobs faster than
provisioners can process them, pending jobs accumulate in the queue.
Since prebuilt workspaces should always run the latest active template
version, pending prebuild jobs from non-active versions become obsolete
once a new version is promoted.
## Solution
The reconciliation loop cancels pending prebuild-related jobs from
non-active template versions that match the following criteria:
* Build number: 1 (initial build created by the reconciliation loop)
* Job status: `pending`
* Not yet picked up by a provisioner (`worker_id` is `NULL`)
* Owned by the prebuilds system user
* Workspace transition: `start`
This prevents the queue from being cluttered with stale prebuild jobs
that would provision workspaces on an outdated template version that
would consequently need to be deprovisioned.
## Changes
* Added new SQL query `CountPendingNonActivePrebuilds` to identify
presets with pending jobs from non-active versions
* Added new SQL query `UpdatePrebuildProvisionerJobWithCancel` to cancel
jobs for a specific preset
* New reconciliation action type `ActionTypeCancelPending` handles the
cancellation logic
* Cancellation is non-blocking: failures to cancel prebuild jobs are
logged as errors and don't prevent other reconciliation actions
## Follow-up PR
Canceling pending prebuild jobs leaves workspaces in a Canceled state.
While no Terraform resources need to be destroyed (since jobs were
canceled before provisioning started), these database records should
still be cleaned up. This will be addressed in a follow-up PR.
Closes: https://github.com/coder/coder/issues/20242
This PR uses the same sha256 hashing technique as we use for APIKeys. So
now all randomly generated secrets will be hashed with sha256 for
consistency.
This is a breaking change for the oauth tokens. Since oauth is only
allowed for dev builds and experimental, this is ok.
- Adds FK from `aibridge_interceptions.initiator_id` to `users.id`
- This is enforced by deleting any rows that don't have any users. Since
this is an experimental feature AND coder never deletes user rows I
think this is acceptable.
- Adds `name` as a property on `codersdk.MinimalUser`
- This matches the `visible_users` view in the database. I'm unsure why
`name` wasn't already included given that `username` is.
- Adds a new `initiator` field to `codersdk.AIBridgeInterception` which
contains `codersdk.MinimalUser` (ID, username, name, avatar URL)
- Removes `initiator_id` from `codersdk.AIBridgeInterception`
- Should be fine since we're still in early access
Necessary for the frontend to be able to paginate easily. Cursor
pagination is good for fetching all events, but doesn't play very well
when a pagination component gets involved.
Adds support for `?offset=x` to the existing endpoint. The cursor-based
pagination (`?after_id=x`) is still supported. The two pagination modes
are mutually exclusive, and are documented as such. If both are
supplied, the request will be rejected.
Also adds a `total` property to the response that contains the full
count of items matching the filter. We already have indices in place so
I don't think this will impact performance (or we can revisit it before
GA).
aid in differentiation between sources of calls to `GetWorkspaces` but introducing new queries for metrics specific use cases
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>