Backport of #23835.
Audit and connection log pages were timing out due to expensive COUNT(*)
queries over large tables. This commit adds opt-in count capping:
requests can return a `count_cap` field signaling that the count was
truncated at a threshold, avoiding full table scans that caused page
timeouts.
Text-cast UUID comparisons in regosql-generated authorization queries
also contributed to the slowdown by preventing index usage for
connection and audit log queries. These now emit native UUID operators.
Frontend changes handle the capped state in usePaginatedQuery and
PaginationWidget, optionally displaying a capped count in the pagination
UI (e.g. "Showing 2,076 to 2,100 of 2,000+ logs")
---
Cherry picked from 86ca61d6ca
Backport of #23090 to `release/2.29`.
When a user creates a workspace, opens the web terminal, then the
workspace stops but the web terminal remains open, the web terminal will
retry the connection. Coder would issue a HTTP 502 Bad Gateway response
when this occurred because coderd could not connect to the workspace
agent, however this is problematic as any load balancer sitting in front
of Coder sees a 502 and thinks Coder is unhealthy.
This PR changes the response to a HTTP 404 after internal discussion.
Cherry-picked from merge commit
c33812a430. The conflict in
`coderd/workspaceapps/errors.go` was resolved by applying the status
code change (502 → 404) while keeping the existing
`RetryEnabled`/`DashboardURL` fields (the `Actions` refactor is not on
this branch).
Fixes https://github.com/coder/coder/issues/22375
Updates `stringutil.Truncate` to properly handle multi-byte UTF-8
characters.
Adds tests for multi-byte truncation with word boundary.
Created by Mux using Opus 4.6
(cherry picked from commit 0cfa03718e)
Parent agents were re-using AuthInstanceID when spawning child agents.
This caused GetWorkspaceAgentByInstanceID to return the most recently
created sub agent instead of the parent when the parent tried to refetch
its own manifest.
Fix by not reusing AuthInstanceID for sub agents, and updating
GetWorkspaceAgentByInstanceID to filter them out entirely.
---
Cherry picked from 911d734df9
… (#21652)
Relates to https://github.com/coder/coder/issues/19715
This is similar to https://github.com/coder/coder/pull/19711
This endpoint works by doing the following:
- Subscribing to the database's with pubsub
- Accepts a WebSocket upgrade
- Starts a `httpapi.Heartbeat`
- Creates a json encoder
- **Infinitely loops waiting for notification until request context
cancelled**
The critical issue here is that `httpapi.Heartbeat` silently fails when
the client has disconnected. This means we never cancel the request
context, leaving the WebSocket alive until we receive a notification
from the database and fail to write that down the pipe.
By replacing usage of `httpapi.Heartbeat` with `httpapi.HeartbeatClose`,
we cancel the context _when the heartbeat fails to write_ due to the
client disconnecting. This allows us to cleanup without waiting for a
notification to come through the pubsub channel.
(cherry picked from commit 409360c62d)
<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
Co-authored-by: Danielle Maywood <danielle@themaywoods.com>
* https://github.com/coder/coder/pull/21493
* https://github.com/coder/coder/pull/21496
* https://github.com/coder/coder/pull/21530
NB these commits were originally authored by Blink on behalf of
@dannykopping, so amended to reflect actual authorship.
**Repro/Verification Steps:**
* Created a Coder deployment with a non-public schema via Docker compose
on v2.28.6:
* Created a DB init script under `db-init/01-create-schema.sql` with the
following:
```sql
CREATE SCHEMA IF NOT EXISTS coder AUTHORIZATION coder;
GRANT ALL PRIVILEGES ON SCHEMA coder TO coder;
ALTER ROLE coder SET search_path TO coder;
```
* Mounted above inside the `postgres` container:
```diff
volumes:
- coder_data:/var/lib/postgresql/data
+ - ./db-init:/docker-entrypoint-initdb.d:ro
```
* Edited `CODER_PG_CONNECTION_URL` to update the search path:
```diff
environment:
- CODER_PG_CONNECTION_URL:
"postgresql://${POSTGRES_USER:-username}:${POSTGRES_PASSWORD:-password}@database/${POSTGRES_DB:-coder}?sslmode=disable"
+ CODER_PG_CONNECTION_URL:
"postgresql://${POSTGRES_USER:-username}:${POSTGRES_PASSWORD:-password}@database/${POSTGRES_DB:-coder}?sslmode=disable&search_path=coder"
```
* Brought up the deployment:
```shell
CODER_VERSION=v2.28.6 CODER_ACCESS_URL=http://localhost:7080
POSTGRES_USER=coder POSTGRES_PASSWORD=coder docker compose up`
```
* Created user / template / workspace
* Updated to `v2.29.1`:
* ```shell
CODER_VERSION=v2.29.1 CODER_ACCESS_URL=http://localhost:7080
POSTGRES_USER=coder POSTGRES_PASSWORD=coder docker compose up`
```
* Observed following error:
```
database-1 | 2026-01-21 15:07:17.629 UTC [102] ERROR: relation
"public.workspace_agents" does not exist
coder-1 | Encountered an error running "coder server", see "coder server
--help" for more information
database-1 | 2026-01-21 15:07:17.629 UTC [102] STATEMENT: CREATE INDEX
IF NOT EXISTS workspace_agents_auth_instance_id_deleted_idx ON
public.workspace_agents (auth_instance_id, deleted);
coder-1 | error: connect to postgres: connect to postgres: migrate up:
up: 2 errors occurred:
coder-1 | * run statement: migration failed: relation
"public.workspace_agents" does not exist in line 0: CREATE INDEX IF NOT
EXISTS workspace_agents_auth_instance_id_deleted_idx ON
public.workspace_agents (auth_instance_id, deleted);
coder-1 | (details: pq: relation "public.workspace_agents" does not
exist)
coder-1 | * commit tx on unlock: pq: Could not complete operation in a
failed transaction
coder-1 exited with code 1
```
* Built image locally:
```console
$ make build/coder_$(./scripts/version.sh)_linux_amd64.tag
...
ghcr.io/coder/coder:v2.29.1-devel-e8c482a98a67-amd64
```
* Started with new image:
```shell
CODER_VERSION=v2.29.1-devel-e8c482a98a67-amd64
CODER_ACCESS_URL=http://localhost:7080 POSTGRES_USER=coder
POSTGRES_PASSWORD=coder docker compose up
```
* Observed migrations ran successfully and Coder came up successfully
---------
Signed-off-by: Danny Kopping <danny@coder.com>
Co-authored-by: Danny Kopping <danny@coder.com>
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
## Context
GetWorkspaceAgentByInstanceID has a suboptimal plan. Even though it is
designed to fetch a small subset of records, there are no corresponding
indexes and that query results in full table scan:
Query:
```
SELECT id, auth_instance_id FROM workspace_agents
where auth_instance_id='i-013c2b96b6441648a' and deleted=FALSE;
```
Plan:
```
------------------------------------------------------------------------------------------------------------------
Seq Scan on workspace_agents (cost=0.00..222325.48 rows=2 width=36) (actual time=0.012..234.152 rows=4 loops=1)
Filter: ((NOT deleted) AND ((auth_instance_id)::text = 'i-013c2b96b6441648a'::text))
Rows Removed by Filter: 302276
Planning Time: 0.173 ms
Execution Time: 234.169 ms
```
After adding the index, the plan improves drastically.
Updated plan:
```
Bitmap Heap Scan on workspace_agents (cost=4.44..12.32 rows=2 width=36) (actual time=0.019..0.019 rows=0 loops=1)
Recheck Cond: (((auth_instance_id)::text = 'i-013c2b96b6441648a'::text) AND (NOT deleted))
-> Bitmap Index Scan on workspace_agents_auth_instance_id_deleted_idx (cost=0.00..4.44 rows=2 width=0) (actual time=0.013..0.014 rows=0 loops=1)
Index Cond: (((auth_instance_id)::text = 'i-013c2b96b6441648a'::text) AND (deleted = false))
Planning Time: 0.388 ms
Execution Time: 0.044 ms
```
## Changes
* add an index to optimize this query
## Testing
* ran the queries manually against prod and test DBs
* ran `./scripts/develop.sh`, connected to the local PostgreSQL
instance, inspected the indexes to make sure new index is there:
```
Indexes:
"workspace_agents_pkey" PRIMARY KEY, btree (id)
// NEW INDEX CREATED SUCCESSFULLY [comment is mine]
"workspace_agents_auth_instance_id_deleted_idx" btree (auth_instance_id, deleted)
"workspace_agents_auth_token_idx" btree (auth_token)
"workspace_agents_resource_id_idx" btree (resource_id)
```
---------
Signed-off-by: Danny Kopping <danny@coder.com>
Co-authored-by: Danny Kopping <danny@coder.com>
closes https://github.com/coder/coder/issues/20899
This is in response to a migration in v2.27 that takes very long on
deployments with large `api_keys` tables.
NOTE: The optimization causes the _up_ migration to delete old data
(keys that expired more than 7 days ago). The _down_ migration won't
resurrect the deleted data.
This is to debug context timeouts on API requests to the agent.
Because rbac and database cannot be imported in slim, split the logger
middleware into slim and non-slim versions and break out the recovery
middleware.
## Summary
Completes the Coder Tasks GA promotion by updating swagger tags and
regenerating API documentation and updating the frontend API structure.
## Related
Follows #20923 and #20921 which promoted Tasks from Beta/Experimental to
GA.
---
🤖 This change was written by Claude Sonnet 4.5 Thinking using
[mux](https://github.com/coder/mux) and reviewed by a human 🏂
## Description
This PR fixes an issue where `GetLatestWorkspaceAppStatusesByAppID`
returned an unbounded number of rows for a given app ID, which could
cause performance issues for noisy or long-running AI tasks.
## Impact
This change reduces database query overhead for workspace app status
updates, particularly for busy AI tasks that update their status
frequently. Previously, fetching the latest status would return all
historical statuses, now it returns only the most recent one.
Fixes#20862
---
🤖 This change was written by Claude Sonnet 4.5 Thinking using [mux](https://github.com/coder/mux) and reviewed by a human 🏄🏻♂️
## Problem
Tasks currently only expose a machine-friendly name field (e.g.
`task-python-debug-a1b2`), but this value is primarily an identifier
rather than a clean, descriptive label. We need a separate
display-friendly name for use in the UI.
This PR introduces a new `display_name` field and updates the task-name
generation flow. The Claude system prompt was updated to return valid
JSON with both `name` and `display_name`. The name generation logic
follows a fallback chain (Anthropic > prompt sanitization > random
fallback). To make task names more closely resemble their display names,
the legacy `task-` prefix has been removed. For context, PR
https://github.com/coder/coder/pull/20834 introduced a small Task icon
to the workspace list to help identify workspaces associated to tasks.
## Changes
- Database migration: Added `display_name` column to tasks table
- Updated system prompt to generate both task name and display name as
valid JSON
- Task name generation now follows a fallback chain: Anthropic > prompt
sanitization > random fallback
- Removed `task-` prefix from task names to allow more descriptive names
- Note: PR https://github.com/coder/coder/pull/20834 adds a Task icon to
workspaces in the workspace list to distinguish task-created workspaces
**Note:** UI changes will be addressed in a follow-up PR
Related to: https://github.com/coder/coder/issues/20801
This PR adds the backend implementation for modifying task prompts. Part
of https://github.com/coder/internal/issues/1084
## Changes
- New `UpdateTaskPrompt` database query to update task prompts
- New PATCH `/api/v2/tasks/{task}/prompt` endpoint
## Notes
This is part 1 of a 2-part PR stack. The frontend UI will be added in a
follow-up PR based on this branch
(https://github.com/coder/coder/pull/20812).
---
🤖 PR was written by Claude Sonnet 4.5 Thinking using [Coder
Mux](https://github.com/coder/cmux) and reviewed by a human 👩
fixes https://github.com/coder/internal/issues/1123
We want to tests that ports are not included after they are no longer used, but this isn't safe on the real OS networking stack because there is no way to guarantee a port _won't_ be used. Instead, we introduce an interface and fake implementation for testing.
On order to leave the filtering logic in the test path, this PR also does some refactoring.
Caching logic is left in the real OS querying implementation and a new test case is added for it in this PR.
Closes https://github.com/coder/coder/issues/20711
We now allow agents to be created on dormant workspaces.
I've ran the test with and without the change. I've confirmed that -
without the fix - it triggers the "rbac: unauthorized" error.
Addresses [`aibridge#54`](https://github.com/coder/aibridge/issues/54)
When querying against the values in the database for
`/api/experimental/aibridge/interceptions` we found strange behaviour
wherein there was interceptions that lacked prompting and other various
fields we want. Generally this was as a result of the data not actually
existing for these values (as they were inflight).
The simple solution to this was to hide them if they didn't exist. This
PR addresses that.
---------
Co-authored-by: Danny Kopping <danny@coder.com>
## Problem
Workspaces associated with tasks were not visually distinguishable in
the workspaces list view. Additionally, the list workspaces endpoint was
not returning the `task_id` field.
<img width="2784" height="864" alt="Screenshot 2025-11-20 at 10 32 22"
src="https://github.com/user-attachments/assets/60704f16-3c66-4553-9215-f10654998a38"
/>
## Changes
- Fix `ConvertWorkspaceRows` to include `task_id` in the list workspaces
endpoint response
- Add "Task" icon to the workspace list view for workspaces associated
with tasks
- Add test to verify `task_id` is correctly returned by the list
workspaces endpoint
- Add Storybook story to showcase the Task icon in the workspace list
Closes https://github.com/coder/coder/issues/20802
Fixes: #20744
Upsert audit and connection log entries with a context derived from the API context, rather than the individual request so that we don't error out if the request is canceled or the client hangs up (e.g. if we return an error).
Database insert errors will fail the transaction. So this error is
fatal. Properly return it for a better error call stack, and not just
hiding the error in the logs.
The test flake can be verified by setting `ReportInterval` to a really
low value, like `100 * time.Millisecond`.
We now set it to a really high value to avoid triggering flush without
manually calling the function in test. This can easily happen because
the default value is 30s and we run tests in parallel. The assertion
typically happens such that:
[use workspace] -> [fetch previous last used] -> [flush]
-> [fetch new last used]
When this edge case is triggered:
[use workspace] -> [report interval flush]
-> [fetch previous last used] -> [flush] -> [fetch new last used]
In this case, both the previous and new last used will be the same,
breaking the test assertion.
Fixescoder/internal#960Fixescoder/internal#975
## Problem
With the new tasks data model, a task starts with an `initializing`
status. However, the API returns `current_state: null` to represent the
agent state, causing the frontend to display "No message available".
This PR updates `codersdk.Task` to return a `current_state` when the
task is initializing with meaningful messages about what's happening
during task initialization.
**Previous message**
<img width="2764" height="288" alt="Screenshot 2025-11-07 at 09 06 13"
src="https://github.com/user-attachments/assets/feec9f15-91ca-4378-8565-5f9de062d11a"
/>
**New message**
<img width="2726" height="226" alt="Screenshot 2025-11-12 at 11 00 15"
src="https://github.com/user-attachments/assets/2f9bee3e-7ac4-4382-b1c3-1d06bbc2906e"
/>
## Changes
- Populate `current_state` with descriptive initialization messages when
task status is `initializing` and no valid app status exists for the
current build
- **dbfake**: Fix `WorkspaceBuild` builder to properly handle
pending/running jobs by linking tasks without requiring agent/app
resources
**Note:** UI Storybook changes to reflect these new messages will be
addressed in a follow-up PR.
Closes: https://github.com/coder/internal/issues/1063
This change restructures the `tasks_with_status` view query to:
- Improve debuggability by adding a `status_debug` column to better
understand the outcome
- Reduce clutter from `bool_or`, `bool_and` which are aggregate
functions that did not actually have serve a purpose (each join is 0-1
rows)
- Improve agent lifecycle state coverage, `start_timeout` and
`start_error` were omitted
- These states are easy to trigger even in a perfectly functioning
workspace/task so we now rely on app health to report whether or not
there was an issue
- Mark canceling and canceled workspace build jobs as error state
- Agent stop states were implicitly `unknown`, now there are explicit (I
initially considered `error`, could go either way)
Previously, `UpdateWorkspaceTimingsMetrics` would log a warning for
workspace builds that aren't tracked (restarts, stops, subsequent builds
after creation). This was noisy since these are legitimate operations,
not errors.
`UpdateWorkspaceTimingsMetrics` is specifically designed to track only
workspace creation, prebuild creation, and prebuild claim timings.
Related with: https://github.com/coder/coder/pull/20772
`prometheus.provisionerd_server_metrics: unsupported workspace timing
flags` appears in the logs, but without knowledge of the available flags
it's not possible to troubleshoot this.
Signed-off-by: Danny Kopping <danny@coder.com>
For experimental and dogfood purposes, this adds the ability to opt in a single template.
Leaving the rest of the templates as is.
For GA, this setting might be removed or changed.
Prior to this, every workspace build ran `terraform init` in a fresh
directory. This would mean the `modules` are downloaded fresh. If the
module is not pinned, subsequent workspace builds would have different
modules.
Experiments passed to provisioners to determine behavior. This adds
`--experiments` flag to provisioner daemons. Prior to this, provisioners
had no method to turn on/off experiments.
Adds some extra meta data sent to provisioners. Also adds a field
`reuse_terraform_workspace` to tell the provisioner whether or not to
use the caching experiment.
Currently, when AI Bridge is enabled AND the `oauth2` and
`mcp-server-http` experiments are enabled we inject Coder's MCP tools
into all intercepted AI Bridge requests.
This PR introduces a config to control this behaviour.
**NOTE:** this is a backwards-incompatible change; previously these
tools would be injected automatically, now this setting will need to be
explicitly enabled.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
Somewhat minor inefficiency in notifications I discovered during a scaletest where I was creating many users. Our `GetUsers` query filter for rbac roles uses the `&&` operator on arrays, which is the intersection of the two arrays. Despite that, we were making seperate DB queries for each role, and then collating the results. I didn't see any other instances of this.
The test changes are required as the order of outgoing notifications is now non-deterministic.
Relates to https://github.com/coder/internal/issues/1098
Currently AgentAPI waits for only 2 seconds worth of identical terminal
screen snapshots before deciding a task has entered a "stable" state. We
interpret this as becoming "idle", resulting in a notification being
triggered. This behavior is not ideal and is ultimately the root cause
of our spammy notifications.
Unfortunately, until we move AgentAPI to either use the Claude Code SDK
(or ACP wrapper around it), we are unable to easily fix the root cause.
This PR instead waits until the agent is ready before it will send state
change notifications. This will at least resolve _some_ of the
complaints about task state notifications being too spammy.
---
🤖 PR was written by Claude Sonnet 4.5 using [Coder
Mux](https://github.com/coder/cmux) and reviewed by a human 👩