Addresses [`aibridge#54`](https://github.com/coder/aibridge/issues/54)
When querying against the values in the database for
`/api/experimental/aibridge/interceptions` we found strange behaviour
wherein there was interceptions that lacked prompting and other various
fields we want. Generally this was as a result of the data not actually
existing for these values (as they were inflight).
The simple solution to this was to hide them if they didn't exist. This
PR addresses that.
---------
Co-authored-by: Danny Kopping <danny@coder.com>
## Problem
Workspaces associated with tasks were not visually distinguishable in
the workspaces list view. Additionally, the list workspaces endpoint was
not returning the `task_id` field.
<img width="2784" height="864" alt="Screenshot 2025-11-20 at 10 32 22"
src="https://github.com/user-attachments/assets/60704f16-3c66-4553-9215-f10654998a38"
/>
## Changes
- Fix `ConvertWorkspaceRows` to include `task_id` in the list workspaces
endpoint response
- Add "Task" icon to the workspace list view for workspaces associated
with tasks
- Add test to verify `task_id` is correctly returned by the list
workspaces endpoint
- Add Storybook story to showcase the Task icon in the workspace list
Closes https://github.com/coder/coder/issues/20802
Fixes: #20744
Upsert audit and connection log entries with a context derived from the API context, rather than the individual request so that we don't error out if the request is canceled or the client hangs up (e.g. if we return an error).
Database insert errors will fail the transaction. So this error is
fatal. Properly return it for a better error call stack, and not just
hiding the error in the logs.
The test flake can be verified by setting `ReportInterval` to a really
low value, like `100 * time.Millisecond`.
We now set it to a really high value to avoid triggering flush without
manually calling the function in test. This can easily happen because
the default value is 30s and we run tests in parallel. The assertion
typically happens such that:
[use workspace] -> [fetch previous last used] -> [flush]
-> [fetch new last used]
When this edge case is triggered:
[use workspace] -> [report interval flush]
-> [fetch previous last used] -> [flush] -> [fetch new last used]
In this case, both the previous and new last used will be the same,
breaking the test assertion.
Fixescoder/internal#960Fixescoder/internal#975
## Problem
With the new tasks data model, a task starts with an `initializing`
status. However, the API returns `current_state: null` to represent the
agent state, causing the frontend to display "No message available".
This PR updates `codersdk.Task` to return a `current_state` when the
task is initializing with meaningful messages about what's happening
during task initialization.
**Previous message**
<img width="2764" height="288" alt="Screenshot 2025-11-07 at 09 06 13"
src="https://github.com/user-attachments/assets/feec9f15-91ca-4378-8565-5f9de062d11a"
/>
**New message**
<img width="2726" height="226" alt="Screenshot 2025-11-12 at 11 00 15"
src="https://github.com/user-attachments/assets/2f9bee3e-7ac4-4382-b1c3-1d06bbc2906e"
/>
## Changes
- Populate `current_state` with descriptive initialization messages when
task status is `initializing` and no valid app status exists for the
current build
- **dbfake**: Fix `WorkspaceBuild` builder to properly handle
pending/running jobs by linking tasks without requiring agent/app
resources
**Note:** UI Storybook changes to reflect these new messages will be
addressed in a follow-up PR.
Closes: https://github.com/coder/internal/issues/1063
This change restructures the `tasks_with_status` view query to:
- Improve debuggability by adding a `status_debug` column to better
understand the outcome
- Reduce clutter from `bool_or`, `bool_and` which are aggregate
functions that did not actually have serve a purpose (each join is 0-1
rows)
- Improve agent lifecycle state coverage, `start_timeout` and
`start_error` were omitted
- These states are easy to trigger even in a perfectly functioning
workspace/task so we now rely on app health to report whether or not
there was an issue
- Mark canceling and canceled workspace build jobs as error state
- Agent stop states were implicitly `unknown`, now there are explicit (I
initially considered `error`, could go either way)
Previously, `UpdateWorkspaceTimingsMetrics` would log a warning for
workspace builds that aren't tracked (restarts, stops, subsequent builds
after creation). This was noisy since these are legitimate operations,
not errors.
`UpdateWorkspaceTimingsMetrics` is specifically designed to track only
workspace creation, prebuild creation, and prebuild claim timings.
Related with: https://github.com/coder/coder/pull/20772
`prometheus.provisionerd_server_metrics: unsupported workspace timing
flags` appears in the logs, but without knowledge of the available flags
it's not possible to troubleshoot this.
Signed-off-by: Danny Kopping <danny@coder.com>
For experimental and dogfood purposes, this adds the ability to opt in a single template.
Leaving the rest of the templates as is.
For GA, this setting might be removed or changed.
Prior to this, every workspace build ran `terraform init` in a fresh
directory. This would mean the `modules` are downloaded fresh. If the
module is not pinned, subsequent workspace builds would have different
modules.
Experiments passed to provisioners to determine behavior. This adds
`--experiments` flag to provisioner daemons. Prior to this, provisioners
had no method to turn on/off experiments.
Adds some extra meta data sent to provisioners. Also adds a field
`reuse_terraform_workspace` to tell the provisioner whether or not to
use the caching experiment.
Currently, when AI Bridge is enabled AND the `oauth2` and
`mcp-server-http` experiments are enabled we inject Coder's MCP tools
into all intercepted AI Bridge requests.
This PR introduces a config to control this behaviour.
**NOTE:** this is a backwards-incompatible change; previously these
tools would be injected automatically, now this setting will need to be
explicitly enabled.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
Somewhat minor inefficiency in notifications I discovered during a scaletest where I was creating many users. Our `GetUsers` query filter for rbac roles uses the `&&` operator on arrays, which is the intersection of the two arrays. Despite that, we were making seperate DB queries for each role, and then collating the results. I didn't see any other instances of this.
The test changes are required as the order of outgoing notifications is now non-deterministic.
Relates to https://github.com/coder/internal/issues/1098
Currently AgentAPI waits for only 2 seconds worth of identical terminal
screen snapshots before deciding a task has entered a "stable" state. We
interpret this as becoming "idle", resulting in a notification being
triggered. This behavior is not ideal and is ultimately the root cause
of our spammy notifications.
Unfortunately, until we move AgentAPI to either use the Claude Code SDK
(or ACP wrapper around it), we are unable to easily fix the root cause.
This PR instead waits until the agent is ready before it will send state
change notifications. This will at least resolve _some_ of the
complaints about task state notifications being too spammy.
---
🤖 PR was written by Claude Sonnet 4.5 using [Coder
Mux](https://github.com/coder/cmux) and reviewed by a human 👩
* Adds a `GetTaskByOwnerIDAndName` query
* Updates `httpmw.TaskParam` to fall back to task name if no task by
UUID found.
* Updates the `TaskByIdentifier` used in `cli/` to use direct lookup instead of searching.
Create task was still mentioning magic prompt parameter when checking
template task validity. This change updates it to only mention validity
of `coder_ai_task` resource.
Addresses a flake seen locally by @mafredri:
```
panic: interface conversion: proto.isAcquiredJob_Type is nil, not *proto.AcquiredJob_WorkspaceBuild_ [recovered]
panic: interface conversion: proto.isAcquiredJob_Type is nil, not *proto.AcquiredJob_WorkspaceBuild_
goroutine 77 [running]:
testing.tRunner.func1.2({0x35ba440, 0xc000f15620})
/usr/local/go/src/testing/testing.go:1734 +0x21c
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1737 +0x35e
panic({0x35ba440?, 0xc000f15620?})
/usr/local/go/src/runtime/panic.go:792 +0x132
github.com/coder/coder/v2/coderd/provisionerdserver_test.TestServer_ExpirePrebuildsSessionToken(0xc00010d500) /home/coder/coder/coderd/provisionerdserver/provisionerdserver_test.go:4128 +0xc4b
testing.tRunner(0xc00010d500, 0x4bd8450)
/usr/local/go/src/testing/testing.go:1792 +0xf4
created by testing.(*T).Run in goroutine 1
/usr/local/go/src/testing/testing.go:1851 +0x413
FAIL github.com/coder/coder/v2/coderd/provisionerdserver 20.830s
FAIL
```
It's unclear why this would happen in the first place.
With `%w` it prints an address instead of the error, like `<op> <url>
0xc001329370` instead of `<op> <url>: some error`, honestly idk why you
even can log with `%w` it seems like it makes no sense to use `%w`
outside of `fmt.Errorf`.
This is to help debug https://github.com/coder/internal/issues/1010.
* Instead of prompting the user to start a deleted workspace (which is
silly), prompt them to create a new task instead.
* Adds a warning dialog when deleting a workspace
* Updates provisionerdserver to delete the related task if a workspace
is related to a task
We recently made a change to the `wsbuilder` to handle task related
logic. Our test coverage for the lifecycle executor didn't handle this
scenario and so we missed that it had insufficient permissions.
This PR adds `Update` and `Read` permissions for `Task`s in the
lifecycle executor, as well as an autostart/autostop test tailored to
task workspaces to verify the change.
---
Anthropic's Claude Sonnet 4.5 Thinking was involved in writing the tests
## Description
The membership reconciliation ensures the prebuilds system user is a
member of all organizations with prebuilds configured. To support
prebuilds quota management, each organization must have a prebuilds
group that the system user belongs to.
## Problem
Previously, membership reconciliation iterated over all presets to check
and update membership status. This meant database queries
`GetGroupByOrgAndName` and `InsertGroupMember` were executed for each
preset. Since presets are unique combinations of `(organization,
template, template version, preset)`, this resulted in several redundant
checks for the same organization.
In dogfood, `InsertGroupMember` was called thousands of times per day,
even though memberships were already configured ([internal Grafana
dashboard link](https://grafana.dev.coder.com/goto/46MZ1UgDg?orgId=1))
<img width="5382" height="1788" alt="Screenshot 2025-10-28 at 16 01 36"
src="https://github.com/user-attachments/assets/757b7253-106f-4f72-8586-8e2ede9f18db"
/>
## Solution
This PR introduces `GetOrganizationsWithPrebuildStatus`, a single query
that returns:
* All unique organizations with prebuilds configured
* Whether the prebuilds user is a member of each organization
* Whether the prebuilds group exists in each organization
* Whether the prebuilds user is in the prebuilds group
The membership reconciliation logic now:
* Fetches status for all organizations in one query
* Only performs inserts for organizations missing required memberships
or groups
* Safely handles concurrent operations via unique constraint violations
* This reduces database load from `O(presets)` to `O(organizations)` per
reconciliation loop, with a single read query when everything is
configured.
## Changes
* Add `GetOrganizationsWithPrebuildStatus` SQL query
* Update `membership.ReconcileAll` to use organization-based
reconciliation instead of preset-based
* Update tests to reflect new behavior
Related to internal thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1760535570381369
## Description
PR https://github.com/coder/coder/pull/20387 introduced canceling
pending prebuild jobs from inactive template versions to avoid
provisioning obsolete workspaces. However, the associated prebuilds
remained in the database with "Canceled" status, visible in the UI.
This PR now orphan-deletes these canceled prebuilt workspaces. Since the
canceled jobs were never processed by a provisioner, no Terraform
resources were created, making orphan deletion safe.
Orphan deletion always creates a provisioner job, but behaves
differently based on provisioner availability:
- If no provisioner daemon is available, the job is immediately marked
as completed and the workspace is marked as deleted without any
provisioner processing
- If a provisioner daemon is available, it processes the delete job with
empty Terraform state (no actual resources to destroy)
The job cancellation and workspace deletion occur atomically in the same
transaction. We don't split this into two separate reconciliation runs
because there's no way to distinguish between system-canceled prebuilds
and user-canceled workspaces. If we deleted canceled workspaces in a
later run, we'd delete user-canceled workspaces that users may want to
keep for troubleshooting.
Note: This only applies to system-generated prebuilds from inactive
template versions.
## Changes
* Update `UpdatePrebuildProvisionerJobWithCancel` query to return job
ID, workspace ID, template ID, and template version preset ID
* Add `DeprovisionMode` enum to support orphan deletion in the provision
flow
* Update `ActionTypeCancelPending` handler to cancel jobs and
orphan-delete associated workspaces atomically
This should resolve https://github.com/coder/internal/issues/728 by
refactoring the ResourceMonitorAPI struct to only require querying the
resource monitor once for memory and once for volumes, then using the
stored monitors on the API struct from that point on. This should
eliminate the vast majority of calls to `GetWorkspaceByAgentID` and
`FetchVolumesResourceMonitorsUpdatedAfter`/`FetchMemoryResourceMonitorsUpdatedAfter`
(millions of calls per week).
Tests passed, and I ran an instance of coder via a workspace with a
template that added resource monitoring every 10s. Note that this is the
default docker container, so there are other sources of
`GetWorkspaceByAgentID` db queries. Note that this workspace was running
for ~15 minutes at the time I gathered this data.
Over 30s for the `ResourceMonitor` calls:
```
coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep ResourceMonitor | grep count
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsUpdatedAfter"} 2
100 288k 0 288k 0 0 58.3M 0 --:--:-- --:--:-- --:--:-- 70.4M
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsUpdatedAfter"} 2
coderd_db_query_latencies_seconds_count{query="UpdateMemoryResourceMonitor"} 155
coderd_db_query_latencies_seconds_count{query="UpdateVolumeResourceMonitor"} 155
coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep ResourceMonitor | grep count
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsUpdatedAfter"} 2
100 288k 0 288k 0 0 34.7M 0 --:--:-- --:--:-- --:--:-- 40.2M
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsUpdatedAfter"} 2
coderd_db_query_latencies_seconds_count{query="UpdateMemoryResourceMonitor"} 158
coderd_db_query_latencies_seconds_count{query="UpdateVolumeResourceMonitor"} 158
```
And over 1m for the `GetWorkspaceAgentByID` calls, the majority are from
the workspace metadata stats updates:
```
coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep GetWorkspaceByAgentID | grep count
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 284k 0 284k 0 0 42.4M 0 --:--:-- --:--:-- --:--:-- 46.3M
coderd_db_query_latencies_seconds_count{query="GetWorkspaceByAgentID"} 876
coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep GetWorkspaceByAgentID | grep count
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 284k 0 284k 0 0 75.4M 0 --:--:-- --:--:-- --:--:-- 92.7M
coderd_db_query_latencies_seconds_count{query="GetWorkspaceByAgentID"} 918
```
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Relates to https://github.com/coder/internal/issues/1098
Currently task notifications are incredibly noisy. We should disable
them by default for the upcoming release whilst we iron them out.
Relates to
https://github.com/coder/coder/pull/20431/files#diff-9cfc826a6ce7e77d977b2025482474dd263d12965b2a94479a74c7f1d872b782
If the workspace relating to a task was deleted, most of the
workspace-related fields in `taskFromDBTaskAndWorkspace` will be
zero-valued. However, we can still get information relating to the owner
so that "created by" shows up correctly in the UI.
Updates the `tasks_with_status` view with a join on `visible_users` to
get owner-related info.