Backport of https://github.com/coder/coder/pull/22904 to `release/2.29`.
Adds an optimistic lock to `UpdateExternalAuthLinkRefreshToken` so that
a concurrent caller that lost a token-refresh race cannot overwrite a
valid token stored by the winner. The SQL `WHERE` clause now includes
`AND oauth_refresh_token = @old_oauth_refresh_token`.
Original PR: #22904
Merge commit: 53e52aef78
Cherry-pick applied cleanly with no conflicts.
> Generated by Coder Agents
Co-authored-by: Kyle Carberry <kyle@coder.com>
Backport of #23835.
Audit and connection log pages were timing out due to expensive COUNT(*)
queries over large tables. This commit adds opt-in count capping:
requests can return a `count_cap` field signaling that the count was
truncated at a threshold, avoiding full table scans that caused page
timeouts.
Text-cast UUID comparisons in regosql-generated authorization queries
also contributed to the slowdown by preventing index usage for
connection and audit log queries. These now emit native UUID operators.
Frontend changes handle the capped state in usePaginatedQuery and
PaginationWidget, optionally displaying a capped count in the pagination
UI (e.g. "Showing 2,076 to 2,100 of 2,000+ logs")
---
Cherry picked from 86ca61d6ca
Make keyring usage for session token storage on by default for supported
platforms (Windows and macOS), with the ability to opt-out via
--use-keyring=false.
This change will be a breaking change for any users depending on the
session token being stored on disk, though users can restore file usage
via the flag above.
This change will also require CLI users to authenticate after updating.
This is to debug context timeouts on API requests to the agent.
Because rbac and database cannot be imported in slim, split the logger
middleware into slim and non-slim versions and break out the recovery
middleware.
## Problem
Tasks currently only expose a machine-friendly name field (e.g.
`task-python-debug-a1b2`), but this value is primarily an identifier
rather than a clean, descriptive label. We need a separate
display-friendly name for use in the UI.
This PR introduces a new `display_name` field and updates the task-name
generation flow. The Claude system prompt was updated to return valid
JSON with both `name` and `display_name`. The name generation logic
follows a fallback chain (Anthropic > prompt sanitization > random
fallback). To make task names more closely resemble their display names,
the legacy `task-` prefix has been removed. For context, PR
https://github.com/coder/coder/pull/20834 introduced a small Task icon
to the workspace list to help identify workspaces associated to tasks.
## Changes
- Database migration: Added `display_name` column to tasks table
- Updated system prompt to generate both task name and display name as
valid JSON
- Task name generation now follows a fallback chain: Anthropic > prompt
sanitization > random fallback
- Removed `task-` prefix from task names to allow more descriptive names
- Note: PR https://github.com/coder/coder/pull/20834 adds a Task icon to
workspaces in the workspace list to distinguish task-created workspaces
**Note:** UI changes will be addressed in a follow-up PR
Related to: https://github.com/coder/coder/issues/20801
Closes https://github.com/coder/coder/issues/20711
We now allow agents to be created on dormant workspaces.
I've ran the test with and without the change. I've confirmed that -
without the fix - it triggers the "rbac: unauthorized" error.
Addresses [`aibridge#54`](https://github.com/coder/aibridge/issues/54)
When querying against the values in the database for
`/api/experimental/aibridge/interceptions` we found strange behaviour
wherein there was interceptions that lacked prompting and other various
fields we want. Generally this was as a result of the data not actually
existing for these values (as they were inflight).
The simple solution to this was to hide them if they didn't exist. This
PR addresses that.
---------
Co-authored-by: Danny Kopping <danny@coder.com>
Fixes https://github.com/coder/internal/issues/1119
## Description
The `CacheTFProviders` function in `testutil/terraform_cache.go` was
only available on Linux and macOS due to the `//go:build linux ||
darwin` build tag. This caused a compile error on Windows when
`enterprise/coderd/workspaces_test.go` tried to call it:
```
enterprise\coderd\workspaces_test.go:3403:28: undefined: testutil.CacheTFProviders
```
## Changes
1. Added `testutil/terraform_cache_windows.go` with a Windows-specific
stub implementation that returns an empty string
2. Updated `downloadProviders` helper in
`enterprise/coderd/workspaces_test.go` to handle empty paths gracefully
## Behavior
- On Linux/macOS: Terraform providers are cached as before
- On Windows: Provider caching is skipped, tests download providers
normally during `terraform init`
## Testing
This should fix the Windows nightly gauntlet failure. The test will
still run on Windows, just without provider caching optimization.
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Fixes#20781
Previously, when `GetProvisionerKey()` failed, the actual error was
swallowed:
```
error: unable to get provisioner key details
```
Now the actual error is preserved using error wrapping, so users can see
the real cause (e.g., 404 Not Found, connection refused, invalid key,
etc.):
```
error: unable to get provisioner key details: GET https://...: 404 Not Found
```
This makes it much easier to diagnose configuration issues.
---
*Generated by [mux](https://cmux.io)*
Relates to https://github.com/coder/internal/issues/914
Adds a runner for scaletesting prebuilds. The runner uploads a no-op template with prebuilds, watches for the corresponding workspaces to be created, and then does the same to tear them down. I didn't originally plan on having metrics for the teardown, but I figured we might as well as it's still the same prebuilds reconciliation loop mechanism being tested.
For experimental and dogfood purposes, this adds the ability to opt in a single template.
Leaving the rest of the templates as is.
For GA, this setting might be removed or changed.
Experiments passed to provisioners to determine behavior. This adds
`--experiments` flag to provisioner daemons. Prior to this, provisioners
had no method to turn on/off experiments.
## Problem
Fix race condition in prebuilds reconciler. Previously, a job
notification event was sent to a Go channel before the provisioning
database transaction completed. The notification is consumed by a
separate goroutine that publishes to PostgreSQL's LISTEN/NOTIFY, using a
separate database connection. This creates a potential race: if a
provisioner daemon receives the notification and queries for the job
before the provisioning transaction commits, it won't find the job in
the database.
This manifested as a flaky test failure in `TestReinitializeAgent`,
where provisioners would occasionally miss newly created jobs. The test
uses a 25-second timeout context, while the acquirer's backup polling
mechanism checks for jobs every 30 seconds. This made the race condition
visible in tests, though in production the backup polling would
eventually pick up the job. The solution presented here guarantees that
a job notification is only sent after the provisioning database
transaction commits.
## Changes
* The `provision()` and `provisionDelete()` functions now return the
provisioner job instead of sending notifications internally.
* A new `publishProvisionerJob()` helper centralizes the notification
logic and is called after each transaction completes.
Closes: https://github.com/coder/internal/issues/963
Currently, when AI Bridge is enabled AND the `oauth2` and
`mcp-server-http` experiments are enabled we inject Coder's MCP tools
into all intercepted AI Bridge requests.
This PR introduces a config to control this behaviour.
**NOTE:** this is a backwards-incompatible change; previously these
tools would be injected automatically, now this setting will need to be
explicitly enabled.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
Fixes flaky `TestWorkspaceTagsTerraform` and
`TestWorkspaceTemplateParamsChange` tests that were failing with
`connection reset by peer` errors when downloading the coder/coder
provider.
This applies the same caching solution which was done in
https://github.com/coder/coder/pull/17373
1. Extracts provider caching logic into `testutil/terraform_cache.go`
2. Updates TestProvision to use the shared caching helpers
3. Updates enterprise workspace tests to use the shared caching helpers
The cache is persisted at `~/.cache/coderv2-test/` and automatically
cached between CI runs via existing GitHub Actions cache setup.
Closes https://github.com/coder/internal/issues/607
This change implements optional secure storage of the CLI token using the operating system
keyring for Windows, with groundwork laid for macOS in a future change. Previously, the
Coder CLI stored authentication tokens in plaintext configuration files, which posed a
security risk because users' tokens are stored unencrypted and can be easily accessed by
other processes or users with file system access.
The keyring is opt-in to preserve compatibility with applications (like the JetBrains
Toolbox plugin, VS code plugin, etc). Users can opt into keyring use with a new
`--use-keyring` flag.
The secure storage is platform dependent. Windows Credential Manager API is used on Windows.
The session token continues to be stored in plain text on macOS and Linux. macOS is omitted
for now while we figure out the best path forward for compatibility with apps like Coder Desktop.
https://www.notion.so/coderhq/CLI-Session-Token-in-OS-Keyring-293d579be592808b8b7fd235304e50d5https://github.com/coder/coder/issues/19403
## Description
The membership reconciliation ensures the prebuilds system user is a
member of all organizations with prebuilds configured. To support
prebuilds quota management, each organization must have a prebuilds
group that the system user belongs to.
## Problem
Previously, membership reconciliation iterated over all presets to check
and update membership status. This meant database queries
`GetGroupByOrgAndName` and `InsertGroupMember` were executed for each
preset. Since presets are unique combinations of `(organization,
template, template version, preset)`, this resulted in several redundant
checks for the same organization.
In dogfood, `InsertGroupMember` was called thousands of times per day,
even though memberships were already configured ([internal Grafana
dashboard link](https://grafana.dev.coder.com/goto/46MZ1UgDg?orgId=1))
<img width="5382" height="1788" alt="Screenshot 2025-10-28 at 16 01 36"
src="https://github.com/user-attachments/assets/757b7253-106f-4f72-8586-8e2ede9f18db"
/>
## Solution
This PR introduces `GetOrganizationsWithPrebuildStatus`, a single query
that returns:
* All unique organizations with prebuilds configured
* Whether the prebuilds user is a member of each organization
* Whether the prebuilds group exists in each organization
* Whether the prebuilds user is in the prebuilds group
The membership reconciliation logic now:
* Fetches status for all organizations in one query
* Only performs inserts for organizations missing required memberships
or groups
* Safely handles concurrent operations via unique constraint violations
* This reduces database load from `O(presets)` to `O(organizations)` per
reconciliation loop, with a single read query when everything is
configured.
## Changes
* Add `GetOrganizationsWithPrebuildStatus` SQL query
* Update `membership.ReconcileAll` to use organization-based
reconciliation instead of preset-based
* Update tests to reflect new behavior
Related to internal thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1760535570381369
## Description
PR https://github.com/coder/coder/pull/20387 introduced canceling
pending prebuild jobs from inactive template versions to avoid
provisioning obsolete workspaces. However, the associated prebuilds
remained in the database with "Canceled" status, visible in the UI.
This PR now orphan-deletes these canceled prebuilt workspaces. Since the
canceled jobs were never processed by a provisioner, no Terraform
resources were created, making orphan deletion safe.
Orphan deletion always creates a provisioner job, but behaves
differently based on provisioner availability:
- If no provisioner daemon is available, the job is immediately marked
as completed and the workspace is marked as deleted without any
provisioner processing
- If a provisioner daemon is available, it processes the delete job with
empty Terraform state (no actual resources to destroy)
The job cancellation and workspace deletion occur atomically in the same
transaction. We don't split this into two separate reconciliation runs
because there's no way to distinguish between system-canceled prebuilds
and user-canceled workspaces. If we deleted canceled workspaces in a
later run, we'd delete user-canceled workspaces that users may want to
keep for troubleshooting.
Note: This only applies to system-generated prebuilds from inactive
template versions.
## Changes
* Update `UpdatePrebuildProvisionerJobWithCancel` query to return job
ID, workspace ID, template ID, and template version preset ID
* Add `DeprovisionMode` enum to support orphan deletion in the provision
flow
* Update `ActionTypeCancelPending` handler to cancel jobs and
orphan-delete associated workspaces atomically
## Description
This PR introduces an optimization to automatically cancel pending
prebuild-related jobs from non-active template versions in the
reconciliation loop.
## Problem
Currently, when a template is configured with more prebuild instances
than available provisioners, the provisioner queue can become flooded
with pending prebuild jobs. This issue is worsened when
provisioning/deprovisioning operations take a long time.
When the prebuild reconciliation loop generates jobs faster than
provisioners can process them, pending jobs accumulate in the queue.
Since prebuilt workspaces should always run the latest active template
version, pending prebuild jobs from non-active versions become obsolete
once a new version is promoted.
## Solution
The reconciliation loop cancels pending prebuild-related jobs from
non-active template versions that match the following criteria:
* Build number: 1 (initial build created by the reconciliation loop)
* Job status: `pending`
* Not yet picked up by a provisioner (`worker_id` is `NULL`)
* Owned by the prebuilds system user
* Workspace transition: `start`
This prevents the queue from being cluttered with stale prebuild jobs
that would provision workspaces on an outdated template version that
would consequently need to be deprovisioned.
## Changes
* Added new SQL query `CountPendingNonActivePrebuilds` to identify
presets with pending jobs from non-active versions
* Added new SQL query `UpdatePrebuildProvisionerJobWithCancel` to cancel
jobs for a specific preset
* New reconciliation action type `ActionTypeCancelPending` handles the
cancellation logic
* Cancellation is non-blocking: failures to cancel prebuild jobs are
logged as errors and don't prevent other reconciliation actions
## Follow-up PR
Canceling pending prebuild jobs leaves workspaces in a Canceled state.
While no Terraform resources need to be destroyed (since jobs were
canceled before provisioning started), these database records should
still be cleaned up. This will be addressed in a follow-up PR.
Closes: https://github.com/coder/coder/issues/20242
This PR uses the same sha256 hashing technique as we use for APIKeys. So
now all randomly generated secrets will be hashed with sha256 for
consistency.
This is a breaking change for the oauth tokens. Since oauth is only
allowed for dev builds and experimental, this is ok.
Thanks to the great work in #20393, we’ve successfully introduced
offset-based pagination for this endpoint. However, the frontend expects
a `count` field in the response rather than `total`. This PR updates the
response payload to rename the returned key to `count` for consistency
with frontend expectations and existing API patterns.
This is necessary to unblock the work in #20331