## Problem
With the new tasks data model, a task starts with an `initializing`
status. However, the API returns `current_state: null` to represent the
agent state, causing the frontend to display "No message available".
This PR updates `codersdk.Task` to return a `current_state` when the
task is initializing with meaningful messages about what's happening
during task initialization.
**Previous message**
<img width="2764" height="288" alt="Screenshot 2025-11-07 at 09 06 13"
src="https://github.com/user-attachments/assets/feec9f15-91ca-4378-8565-5f9de062d11a"
/>
**New message**
<img width="2726" height="226" alt="Screenshot 2025-11-12 at 11 00 15"
src="https://github.com/user-attachments/assets/2f9bee3e-7ac4-4382-b1c3-1d06bbc2906e"
/>
## Changes
- Populate `current_state` with descriptive initialization messages when
task status is `initializing` and no valid app status exists for the
current build
- **dbfake**: Fix `WorkspaceBuild` builder to properly handle
pending/running jobs by linking tasks without requiring agent/app
resources
**Note:** UI Storybook changes to reflect these new messages will be
addressed in a follow-up PR.
Closes: https://github.com/coder/internal/issues/1063
This change restructures the `tasks_with_status` view query to:
- Improve debuggability by adding a `status_debug` column to better
understand the outcome
- Reduce clutter from `bool_or`, `bool_and` which are aggregate
functions that did not actually have serve a purpose (each join is 0-1
rows)
- Improve agent lifecycle state coverage, `start_timeout` and
`start_error` were omitted
- These states are easy to trigger even in a perfectly functioning
workspace/task so we now rely on app health to report whether or not
there was an issue
- Mark canceling and canceled workspace build jobs as error state
- Agent stop states were implicitly `unknown`, now there are explicit (I
initially considered `error`, could go either way)
Previously, `UpdateWorkspaceTimingsMetrics` would log a warning for
workspace builds that aren't tracked (restarts, stops, subsequent builds
after creation). This was noisy since these are legitimate operations,
not errors.
`UpdateWorkspaceTimingsMetrics` is specifically designed to track only
workspace creation, prebuild creation, and prebuild claim timings.
Related with: https://github.com/coder/coder/pull/20772
`prometheus.provisionerd_server_metrics: unsupported workspace timing
flags` appears in the logs, but without knowledge of the available flags
it's not possible to troubleshoot this.
Signed-off-by: Danny Kopping <danny@coder.com>
For experimental and dogfood purposes, this adds the ability to opt in a single template.
Leaving the rest of the templates as is.
For GA, this setting might be removed or changed.
Prior to this, every workspace build ran `terraform init` in a fresh
directory. This would mean the `modules` are downloaded fresh. If the
module is not pinned, subsequent workspace builds would have different
modules.
Experiments passed to provisioners to determine behavior. This adds
`--experiments` flag to provisioner daemons. Prior to this, provisioners
had no method to turn on/off experiments.
Adds some extra meta data sent to provisioners. Also adds a field
`reuse_terraform_workspace` to tell the provisioner whether or not to
use the caching experiment.
Currently, when AI Bridge is enabled AND the `oauth2` and
`mcp-server-http` experiments are enabled we inject Coder's MCP tools
into all intercepted AI Bridge requests.
This PR introduces a config to control this behaviour.
**NOTE:** this is a backwards-incompatible change; previously these
tools would be injected automatically, now this setting will need to be
explicitly enabled.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
Somewhat minor inefficiency in notifications I discovered during a scaletest where I was creating many users. Our `GetUsers` query filter for rbac roles uses the `&&` operator on arrays, which is the intersection of the two arrays. Despite that, we were making seperate DB queries for each role, and then collating the results. I didn't see any other instances of this.
The test changes are required as the order of outgoing notifications is now non-deterministic.
Relates to https://github.com/coder/internal/issues/1098
Currently AgentAPI waits for only 2 seconds worth of identical terminal
screen snapshots before deciding a task has entered a "stable" state. We
interpret this as becoming "idle", resulting in a notification being
triggered. This behavior is not ideal and is ultimately the root cause
of our spammy notifications.
Unfortunately, until we move AgentAPI to either use the Claude Code SDK
(or ACP wrapper around it), we are unable to easily fix the root cause.
This PR instead waits until the agent is ready before it will send state
change notifications. This will at least resolve _some_ of the
complaints about task state notifications being too spammy.
---
🤖 PR was written by Claude Sonnet 4.5 using [Coder
Mux](https://github.com/coder/cmux) and reviewed by a human 👩
* Adds a `GetTaskByOwnerIDAndName` query
* Updates `httpmw.TaskParam` to fall back to task name if no task by
UUID found.
* Updates the `TaskByIdentifier` used in `cli/` to use direct lookup instead of searching.
Create task was still mentioning magic prompt parameter when checking
template task validity. This change updates it to only mention validity
of `coder_ai_task` resource.
Addresses a flake seen locally by @mafredri:
```
panic: interface conversion: proto.isAcquiredJob_Type is nil, not *proto.AcquiredJob_WorkspaceBuild_ [recovered]
panic: interface conversion: proto.isAcquiredJob_Type is nil, not *proto.AcquiredJob_WorkspaceBuild_
goroutine 77 [running]:
testing.tRunner.func1.2({0x35ba440, 0xc000f15620})
/usr/local/go/src/testing/testing.go:1734 +0x21c
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1737 +0x35e
panic({0x35ba440?, 0xc000f15620?})
/usr/local/go/src/runtime/panic.go:792 +0x132
github.com/coder/coder/v2/coderd/provisionerdserver_test.TestServer_ExpirePrebuildsSessionToken(0xc00010d500) /home/coder/coder/coderd/provisionerdserver/provisionerdserver_test.go:4128 +0xc4b
testing.tRunner(0xc00010d500, 0x4bd8450)
/usr/local/go/src/testing/testing.go:1792 +0xf4
created by testing.(*T).Run in goroutine 1
/usr/local/go/src/testing/testing.go:1851 +0x413
FAIL github.com/coder/coder/v2/coderd/provisionerdserver 20.830s
FAIL
```
It's unclear why this would happen in the first place.
With `%w` it prints an address instead of the error, like `<op> <url>
0xc001329370` instead of `<op> <url>: some error`, honestly idk why you
even can log with `%w` it seems like it makes no sense to use `%w`
outside of `fmt.Errorf`.
This is to help debug https://github.com/coder/internal/issues/1010.
* Instead of prompting the user to start a deleted workspace (which is
silly), prompt them to create a new task instead.
* Adds a warning dialog when deleting a workspace
* Updates provisionerdserver to delete the related task if a workspace
is related to a task
We recently made a change to the `wsbuilder` to handle task related
logic. Our test coverage for the lifecycle executor didn't handle this
scenario and so we missed that it had insufficient permissions.
This PR adds `Update` and `Read` permissions for `Task`s in the
lifecycle executor, as well as an autostart/autostop test tailored to
task workspaces to verify the change.
---
Anthropic's Claude Sonnet 4.5 Thinking was involved in writing the tests
## Description
The membership reconciliation ensures the prebuilds system user is a
member of all organizations with prebuilds configured. To support
prebuilds quota management, each organization must have a prebuilds
group that the system user belongs to.
## Problem
Previously, membership reconciliation iterated over all presets to check
and update membership status. This meant database queries
`GetGroupByOrgAndName` and `InsertGroupMember` were executed for each
preset. Since presets are unique combinations of `(organization,
template, template version, preset)`, this resulted in several redundant
checks for the same organization.
In dogfood, `InsertGroupMember` was called thousands of times per day,
even though memberships were already configured ([internal Grafana
dashboard link](https://grafana.dev.coder.com/goto/46MZ1UgDg?orgId=1))
<img width="5382" height="1788" alt="Screenshot 2025-10-28 at 16 01 36"
src="https://github.com/user-attachments/assets/757b7253-106f-4f72-8586-8e2ede9f18db"
/>
## Solution
This PR introduces `GetOrganizationsWithPrebuildStatus`, a single query
that returns:
* All unique organizations with prebuilds configured
* Whether the prebuilds user is a member of each organization
* Whether the prebuilds group exists in each organization
* Whether the prebuilds user is in the prebuilds group
The membership reconciliation logic now:
* Fetches status for all organizations in one query
* Only performs inserts for organizations missing required memberships
or groups
* Safely handles concurrent operations via unique constraint violations
* This reduces database load from `O(presets)` to `O(organizations)` per
reconciliation loop, with a single read query when everything is
configured.
## Changes
* Add `GetOrganizationsWithPrebuildStatus` SQL query
* Update `membership.ReconcileAll` to use organization-based
reconciliation instead of preset-based
* Update tests to reflect new behavior
Related to internal thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1760535570381369
## Description
PR https://github.com/coder/coder/pull/20387 introduced canceling
pending prebuild jobs from inactive template versions to avoid
provisioning obsolete workspaces. However, the associated prebuilds
remained in the database with "Canceled" status, visible in the UI.
This PR now orphan-deletes these canceled prebuilt workspaces. Since the
canceled jobs were never processed by a provisioner, no Terraform
resources were created, making orphan deletion safe.
Orphan deletion always creates a provisioner job, but behaves
differently based on provisioner availability:
- If no provisioner daemon is available, the job is immediately marked
as completed and the workspace is marked as deleted without any
provisioner processing
- If a provisioner daemon is available, it processes the delete job with
empty Terraform state (no actual resources to destroy)
The job cancellation and workspace deletion occur atomically in the same
transaction. We don't split this into two separate reconciliation runs
because there's no way to distinguish between system-canceled prebuilds
and user-canceled workspaces. If we deleted canceled workspaces in a
later run, we'd delete user-canceled workspaces that users may want to
keep for troubleshooting.
Note: This only applies to system-generated prebuilds from inactive
template versions.
## Changes
* Update `UpdatePrebuildProvisionerJobWithCancel` query to return job
ID, workspace ID, template ID, and template version preset ID
* Add `DeprovisionMode` enum to support orphan deletion in the provision
flow
* Update `ActionTypeCancelPending` handler to cancel jobs and
orphan-delete associated workspaces atomically
This should resolve https://github.com/coder/internal/issues/728 by
refactoring the ResourceMonitorAPI struct to only require querying the
resource monitor once for memory and once for volumes, then using the
stored monitors on the API struct from that point on. This should
eliminate the vast majority of calls to `GetWorkspaceByAgentID` and
`FetchVolumesResourceMonitorsUpdatedAfter`/`FetchMemoryResourceMonitorsUpdatedAfter`
(millions of calls per week).
Tests passed, and I ran an instance of coder via a workspace with a
template that added resource monitoring every 10s. Note that this is the
default docker container, so there are other sources of
`GetWorkspaceByAgentID` db queries. Note that this workspace was running
for ~15 minutes at the time I gathered this data.
Over 30s for the `ResourceMonitor` calls:
```
coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep ResourceMonitor | grep count
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsUpdatedAfter"} 2
100 288k 0 288k 0 0 58.3M 0 --:--:-- --:--:-- --:--:-- 70.4M
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsUpdatedAfter"} 2
coderd_db_query_latencies_seconds_count{query="UpdateMemoryResourceMonitor"} 155
coderd_db_query_latencies_seconds_count{query="UpdateVolumeResourceMonitor"} 155
coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep ResourceMonitor | grep count
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsUpdatedAfter"} 2
100 288k 0 288k 0 0 34.7M 0 --:--:-- --:--:-- --:--:-- 40.2M
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsUpdatedAfter"} 2
coderd_db_query_latencies_seconds_count{query="UpdateMemoryResourceMonitor"} 158
coderd_db_query_latencies_seconds_count{query="UpdateVolumeResourceMonitor"} 158
```
And over 1m for the `GetWorkspaceAgentByID` calls, the majority are from
the workspace metadata stats updates:
```
coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep GetWorkspaceByAgentID | grep count
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 284k 0 284k 0 0 42.4M 0 --:--:-- --:--:-- --:--:-- 46.3M
coderd_db_query_latencies_seconds_count{query="GetWorkspaceByAgentID"} 876
coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep GetWorkspaceByAgentID | grep count
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 284k 0 284k 0 0 75.4M 0 --:--:-- --:--:-- --:--:-- 92.7M
coderd_db_query_latencies_seconds_count{query="GetWorkspaceByAgentID"} 918
```
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Relates to https://github.com/coder/internal/issues/1098
Currently task notifications are incredibly noisy. We should disable
them by default for the upcoming release whilst we iron them out.
Relates to
https://github.com/coder/coder/pull/20431/files#diff-9cfc826a6ce7e77d977b2025482474dd263d12965b2a94479a74c7f1d872b782
If the workspace relating to a task was deleted, most of the
workspace-related fields in `taskFromDBTaskAndWorkspace` will be
zero-valued. However, we can still get information relating to the owner
so that "created by" shows up correctly in the UI.
Updates the `tasks_with_status` view with a join on `visible_users` to
get owner-related info.
While investigating a flake I noticed that the dbfake workspace builder
executes all database inserts without a transaction. Since our real
wsbuilder implementation utilizes one it makes sense to do here as well.
For example, our normal workspace <-> build relationship is such that a
workspace cannot exist with at least one build. However, our
GetWorkspaces query left joins workspace builds but has types that are
non-nullable, leading to flakes like coder/internal#1103.
- Adds a new table to keep track of which payloads have already been
reported since we only report for the last clock hour
- Adds a query to gather and aggregate all the data by
provider/model/client
Relates to https://github.com/coder/coder-telemetry-server/issues/27
Adds columns to track package and test name to test_databases table, and populates them as databases are created using the Broker.
In order to seamlessly work with existing `coder_database` databases with the old schema, the SQL that creates the table and columns is additive and idempotent, so we run it every time we initialize the Broker (once per test binary execution).
We include a transaction level advisorly lock to prevent deadlocks before attempting to alter the schema. I was seeing deadlocks without this.
## Description
This PR introduces an optimization to automatically cancel pending
prebuild-related jobs from non-active template versions in the
reconciliation loop.
## Problem
Currently, when a template is configured with more prebuild instances
than available provisioners, the provisioner queue can become flooded
with pending prebuild jobs. This issue is worsened when
provisioning/deprovisioning operations take a long time.
When the prebuild reconciliation loop generates jobs faster than
provisioners can process them, pending jobs accumulate in the queue.
Since prebuilt workspaces should always run the latest active template
version, pending prebuild jobs from non-active versions become obsolete
once a new version is promoted.
## Solution
The reconciliation loop cancels pending prebuild-related jobs from
non-active template versions that match the following criteria:
* Build number: 1 (initial build created by the reconciliation loop)
* Job status: `pending`
* Not yet picked up by a provisioner (`worker_id` is `NULL`)
* Owned by the prebuilds system user
* Workspace transition: `start`
This prevents the queue from being cluttered with stale prebuild jobs
that would provision workspaces on an outdated template version that
would consequently need to be deprovisioned.
## Changes
* Added new SQL query `CountPendingNonActivePrebuilds` to identify
presets with pending jobs from non-active versions
* Added new SQL query `UpdatePrebuildProvisionerJobWithCancel` to cancel
jobs for a specific preset
* New reconciliation action type `ActionTypeCancelPending` handles the
cancellation logic
* Cancellation is non-blocking: failures to cancel prebuild jobs are
logged as errors and don't prevent other reconciliation actions
## Follow-up PR
Canceling pending prebuild jobs leaves workspaces in a Canceled state.
While no Terraform resources need to be destroyed (since jobs were
canceled before provisioning started), these database records should
still be cleaned up. This will be addressed in a follow-up PR.
Closes: https://github.com/coder/coder/issues/20242
Updates the UI to use the new API endpoints for tasks and use its new
data model.
Disclaimer: Since the base data model for tasks changed, we had to do a
quite large refactor and I'm sorry for that 🙏, but you'll notice most of
the changes are to adjust the types.
Closescoder/internal#976
---------
Co-authored-by: Bruno Quaresma <bruno_nonato_quaresma@hotmail.com>