Commit Graph

3058 Commits

Author SHA1 Message Date
Spike Curtis 71c6dc4043 fix: stop disconnecting from coderd early and record disconnect correctly (#21250)
fixes https://github.com/coder/internal/issues/1196

The above issue exposes two different bugs in Coder.

In the agent, there is a race where if the agent is closed while starting up networking, it will erroneously disconnect from Coderd, which delays or breaks writing final status and logs.

In Coderd, there is a bug where we don't properly record the latest agent disconnection time if the agent had previously disconnected. This causes us to report the agent status as "Connected" even after it has disconnected up until the inactivity timeout fires.

This PR fixes both issues.

It also slightly reworks when we send workspace updates based on connection and disconnection. Previously we would send two updates when the agent connected in certain circumstances, even though the status would be the same in both (only times changed).  Now we universally only send one on connect, and then another on disconnect.
2025-12-15 12:04:01 +04:00
Asher 27f0413347 feat: add flag to disable template insights (#20940)
Closes #20399 

To summarize the original commit messages:

- Do not log stats to the database.
- Return errors on the insight endpoints.
- Update the frontend to show those errors.
- Also fixes an issue with getting the user status count via codersdk,
since I added a test to ensure it was not disabled by this flag and it
was sending the wrong payload.
2025-12-14 03:00:03 +00:00
Mathias Fredriksson 761dd55ee8 fix(coderd/database): sort template version variables and fix test flake (#21233)
Previously the GetTemplateVersionVariables query did not sort output,
relying on PostgreSQL on-disk ordering which is undeterministic.

Variables are now sorted by name because there is no alternative for
ordering.

Tests were adjusted to accommodate the new ordering, previously they
relied on data being written to disk in insert order.
2025-12-12 11:41:46 +00:00
Danielle Maywood f45a179181 test: move context to after db creation (#21224)
Closes  https://github.com/coder/internal/issues/1040

We move the context to just before it is used to avoid the scenario
where NewDB takes a while to spin up and runs up the context to the
deadline.
2025-12-11 21:51:16 +00:00
Kacper Sawicki 6f86f67754 feat(coderd): add overload protection with rate limiting and concurrency control (#21161)
## Summary

This adds configurable overload protection to the AI Bridge daemon to
prevent the server from being overwhelmed during periods of high load.

Partially addresses coder/internal#1153 (rate limits and concurrency
control; circuit breakers are deferred to a follow-up).

## New Configuration Options

| Option | Environment Variable | Description | Default |
|--------|---------------------|-------------|---------|
| `--aibridge-max-concurrency` | `CODER_AIBRIDGE_MAX_CONCURRENCY` |
Maximum number of concurrent AI Bridge requests. Set to 0 to disable
(unlimited). | `0` |
| `--aibridge-rate-limit` | `CODER_AIBRIDGE_RATE_LIMIT` | Maximum number
of AI Bridge requests per second. Set to 0 to disable rate limiting. |
`0` |

## Behavior

When limits are exceeded:
- **Concurrency limit**: Returns HTTP `503 Service Unavailable` with
message "AI Bridge is currently at capacity. Please try again later."
- **Rate limit**: Returns HTTP `429 Too Many Requests` with
`Retry-After` header.

Both protections are optional and disabled by default (0 values).

## Implementation

The overload protection is implemented as reusable middleware in
`coderd/httpmw/ratelimit.go`:

1. **`RateLimitByAuthToken`**: Per-user rate limiting that uses
`APITokenFromRequest` to extract the authentication token, with fallback
to `X-Api-Key` header for AI provider compatibility (e.g., Anthropic).
Falls back to IP-based rate limiting if no token is present. Includes
`Retry-After` header for backpressure signaling.
2. **`ConcurrencyLimit`**: Uses an atomic counter to track in-flight
requests and reject when at capacity.

The middleware is applied in `enterprise/coderd/aibridge.go` via
`r.Group` in the following order:
1. Concurrency check (faster rejection for load shedding)
2. Rate limit check

**Note**: Rate limiting currently applies to all AI Bridge requests,
including pass-through requests. Ideally only actual interceptions
should count, but this would require changes in the aibridge library.

## Testing

Added comprehensive tests for:
- Rate limiting by auth token (Bearer token, X-Api-Key, no token
fallback to IP)
- Different tokens not rate limited against each other
- Disabled when limit is zero
- Retry-After header is set on 429 responses
- Concurrency limiting (allows within limit, rejects over limit,
disabled when zero)
2025-12-11 16:38:54 +01:00
Danielle Maywood 8ead6f795d test: close provisioner before creating workspace build (#21219)
Closes https://github.com/coder/internal/issues/1178

I verified the fix works by adding a `time.Sleep(100 *time.Millisecond)` 
between the `CreateWorkspaceBuild` and`CancelWorkspaceBuild` 
calls. Adding this reliably triggered the flake, and when I added the fix 
the flake stopped happening.
2025-12-11 13:36:33 +00:00
Danielle Maywood c3224b793e fix: handle scenario where provisionerdserver deletes task before coderd (#21220) 2025-12-11 13:04:13 +00:00
Callum Styan 8ed1c1d372 perf: reduce calls to GetWorkspaceByAgentID in GetWorkspaceAgentByID (#21046)
This PR piggy backs on the agent API cached workspace added in an earlier PR to provide a fast path for avoiding `GetWorkspaceByAgentID` calls in dbauthz's `GetWorkspaceAgentByID`. This query is not the most expensive, but has a significant call volume at ~16 million calls per week.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-12-10 14:03:24 -08:00
George K da71e546bb chore: fix test errors on newer debian-based systems due to deprecated TZ (#21115)
It appears on newer Debian systems `Canada/Newfoundland` TZ is not
present and `America/St_Johns` should be used instead. Coder tests use a
docker PG image where `Canada/Newfoundland` is still supported:

```
$ docker run --rm -it us-docker.pkg.dev/coder-v2-images-public/public/postgres:17 bash
root@ca99e82721dc:/# ls -l /usr/share/zoneinfo/Canada/Newfoundland
lrwxrwxrwx 1 root root 19 Mar 26  2025 /usr/share/zoneinfo/Canada/Newfoundland -> ../America/St_Johns
```
However, if a local PG instance is running on a Debian Trixie host,
coder test will use it and error out due to the zone being unavailable:

```
$ docker run --rm -it debian:trixie bash
root@f285092767e4:/# ls -l /usr/share/zoneinfo/Canada/Newfoundland
ls: cannot access '/usr/share/zoneinfo/Canada/Newfoundland': No such file or directory
root@f285092767e4:/# ls -l /usr/share/zoneinfo/America/St_Johns
-rw-r--r-- 1 root root 3655 Aug 24 20:12 /usr/share/zoneinfo/America/St_Johns
```

... which causes the tests to error out:

```
$ go test ./enterprise/coderd
--- FAIL: TestWorkspaceTemplateParamsChange (0.13s)
    workspaces_test.go:3097: TestWorkspaceTagsTerraform: using cached terraform providers
    workspaces_test.go:3097: Set TF_CLI_CONFIG_FILE=/home/geo/.cache/coderv2-test/terraform_workspace_tags_test/a28ed341dee8/terraform.rc
    coderdenttest.go:84:
        	Error Trace:	/home/geo/coder/coderd/database/dbtestutil/db.go:161
        	            				/home/geo/coder/coderd/database/dbtestutil/db.go:122
        	            				/home/geo/coder/coderd/coderdtest/coderdtest.go:270
        	            				/home/geo/coder/enterprise/coderd/coderdenttest/coderdenttest.go:105
        	            				/home/geo/coder/enterprise/coderd/coderdenttest/coderdenttest.go:84
        	            				/home/geo/coder/enterprise/coderd/coderdenttest/coderdenttest.go:84
        	            				/home/geo/coder/enterprise/coderd/workspaces_test.go:3103
        	Error:      	Received unexpected error:
        	            	pq: invalid value for parameter "TimeZone": "Canada/Newfoundland"
        	Test:       	TestWorkspaceTemplateParamsChange
        	Messages:   	failed to set timezone for database
...
```

This commit replaces the problematic TZ with the canonical one.
2025-12-10 08:09:13 -08:00
Callum Styan 27c3ec072e perf: support fastpath in dbauthz GetLatestWorkspaceBuildByWorkspaceID (#21047)
This PR piggy backs on the agent API cached workspace added in earlier PRs to provide a fast path for avoiding `GetWorkspaceByID` calls in `GetLatestWorkspaceBuildByWorkspaceID` via injection of the workspaces RBAC object into the context. We can do this from the `agentConnectionMonitor` easily since we already cache the workspace.

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-12-09 15:53:52 -08:00
Callum Styan a59a84b2a7 perf: optimize GetTemplateAppInsightsByTemplate by pre-filtering on start/end times (#20669)
In this PR we're optimizing the `GetTemplateAppInsightsByTemplate` query
by pre-filtering out apps which do not have an active session during the
start/end time window.

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-12-09 15:21:16 -08:00
Callum Styan 6abb889fab perf: optimize GetDeploymentWorkspaceAgentStats by eliminating 2nd select (#21112)
Tracking issue here: https://github.com/coder/internal/issues/1009

To summarize, the current version of this query selects from
`workspace_agent_stats` twice. The expensive portion of this query is
the bitmap heap scan we have to do for each of these selects. We can
easily cut the cost of this query by 40-50% by cutting this down to a
single select, and using those rows for both sets of calculations.

Eliminating the heap scan itself would require a follow up PR to
introduce a new index. Blink helped with the rewrite of the query.

The current plan looks like this:
```
 Nested Loop  (cost=6101.64..6101.69 rows=1 width=64) (actual time=11.782..11.787 rows=1 loops=1)
   ->  Aggregate  (cost=2996.17..2996.19 rows=1 width=32) (actual time=3.356..3.357 rows=1 loops=1)
         ->  Bitmap Heap Scan on workspace_agent_stats  (cost=54.80..2992.86 rows=440 width=24) (actu
al time=0.346..2.927 rows=818 loops=1)
               Recheck Cond: (created_at > (now() - '00:15:00'::interval))
               Filter: (connection_median_latency_ms > '0'::double precision)
               Rows Removed by Filter: 1070
               Heap Blocks: exact=486
               ->  Bitmap Index Scan on idx_agent_stats_created_at  (cost=0.00..54.69 rows=1368 width
=0) (actual time=0.241..0.241 rows=1888 loops=1)
                     Index Cond: (created_at > (now() - '00:15:00'::interval))
   ->  Aggregate  (cost=3105.47..3105.49 rows=1 width=32) (actual time=8.418..8.420 rows=1 loops=1)
         ->  Subquery Scan on a  (cost=3060.95..3105.39 rows=7 width=32) (actual time=7.851..8.394 ro
ws=63 loops=1)
               Filter: (a.rn = 1)
               ->  WindowAgg  (cost=3060.95..3088.29 rows=1368 width=209) (actual time=7.850..8.382 r
ows=63 loops=1)
                     Run Condition: (row_number() OVER (?) <= 1)
                     ->  Sort  (cost=3060.93..3064.35 rows=1368 width=56) (actual time=7.836..8.036 r
ows=1888 loops=1)
                           Sort Key: workspace_agent_stats_1.agent_id, workspace_agent_stats_1.create
d_at DESC
                           Sort Method: quicksort  Memory: 181kB
                           ->  Bitmap Heap Scan on workspace_agent_stats workspace_agent_stats_1  (co
st=55.03..2989.67 rows=1368 width=56) (actual time=0.388..2.096 rows=1888 loops=1)
                                 Recheck Cond: (created_at > (now() - '00:15:00'::interval))
                                 Heap Blocks: exact=486
                                 ->  Bitmap Index Scan on idx_agent_stats_created_at  (cost=0.00..54.
69 rows=1368 width=0) (actual time=0.295..0.295 rows=1888 loops=1)
                                       Index Cond: (created_at > (now() - '00:15:00'::interval))
 Planning Time: 2.350 ms
 Execution Time: 13.152 ms
(24 rows)
```

The new plan looks like this
```
 Aggregate  (cost=2966.96..2966.98 rows=1 width=64) (actual time=3.812..3.814 rows=1 loops=1)
   ->  WindowAgg  (cost=2891.96..2916.94 rows=1250 width=88) (actual time=2.696..3.412 rows=1890 loop
s=1)
         ->  Sort  (cost=2891.94..2895.06 rows=1250 width=80) (actual time=2.686..2.780 rows=1890 loo
ps=1)
               Sort Key: workspace_agent_stats.agent_id, workspace_agent_stats.created_at DESC
               Sort Method: quicksort  Memory: 226kB
               ->  Bitmap Heap Scan on workspace_agent_stats  (cost=50.11..2827.64 rows=1250 width=80
) (actual time=0.218..1.551 rows=1890 loops=1)
                     Recheck Cond: (created_at > (now() - '00:15:00'::interval))
                     Heap Blocks: exact=474
                     ->  Bitmap Index Scan on idx_agent_stats_created_at  (cost=0.00..49.80 rows=1250
 width=0) (actual time=0.146..0.147 rows=1890 loops=1)
                           Index Cond: (created_at > (now() - '00:15:00'::interval))
 Planning Time: 0.534 ms
 Execution Time: 3.969 ms
(12 rows)
```

If we compare the results of the query they're similar enough that any
differences can be attributed to slightly different timestamps for
`now()` in the version of the query I am using to generate results for
comparison:
```
 workspace_rx_bytes | workspace_tx_bytes | workspace_connection_latency_50 | workspace_connection_latency_95 | session_count_vscode | session_count_ssh | session_count_jetbrains | session_count_reconnecting_pty 
--------------------+--------------------+---------------------------------+---------------------------------+----------------------+-------------------+-------------------------+--------------------------------
           15263563 |           74555854 |                          47.933 |                        250.5522 |                  239 |                59 |                       3 |                              3
(1 row)

 workspace_rx_bytes | workspace_tx_bytes | workspace_connection_latency_50 | workspace_connection_latency_95 | session_count_vscode | session_count_ssh | session_count_jetbrains | session_count_reconnecting_pty 
--------------------+--------------------+---------------------------------+---------------------------------+----------------------+-------------------+-------------------------+--------------------------------
           15295819 |           74598410 |                          47.933 |                        250.5522 |                  239 |                59 |                       3 |                              3           
```

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-12-09 15:19:55 -08:00
George K 4379230a27 feat: add deployment-wide option to disable workspace sharing (#21172)
Adds `--disable-workspace-sharing` option.

Workspace sharing is disabled by not including user and group ACLs in
the workspace RBAC object, which prevents ACL-based authz.

Closes https://github.com/coder/internal/issues/1072

The commit also adds saving of workspace user/group ACLs in the test DB
data generator.
2025-12-09 08:13:09 -08:00
blinkagent[bot] 4844c978d8 fix: improve task naming prompt to avoid URL content guessing (#21151)
Previously, when a user created a task with a URL-only prompt (e.g.,
`Let's work on https://github.com/coder/coder/issues/21138`), the LLM
would hallucinate what the URL content might be about - generating names
like "Fix GitHub Actions workflow issue" when the actual issue was
unrelated.

Add examples to the task naming system prompt showing expected behavior
for GitHub issue and PR URLs, teaching the model to use visible URL
parts (repo name, issue/PR number) rather than guessing content.

Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
2025-12-09 09:10:54 -06:00
Dean Sheather b199eb1c38 fix: allow stops and deletes after breaching AI limit (#21186)
Fixes a bug a customer encountered once they breached their limit. Adds
a test.
2025-12-09 11:05:12 +00:00
Asher 3a0e8af6e3 feat: add view workspace button to app error page (#20960)
Closes #19984 

As part of this, I refactored the error template to take in a slice of
actions rather than using individual booleans and strings to control the
behavior.

We decided a link resolves the issue for now so that is what I added,
although we may want to consider a way to start the workspace and follow
the logs dynamically on that page and then show the app when finished
(similar to the tasks page), or at least make the link automatically
start the workspace instead of only taking you to the dashboard where
you have to then start the workspace.
2025-12-08 14:16:00 -09:00
blinkagent[bot] 50d42ab0b9 docs: document 200 OK response for upload file API when file exists (#21071)
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
2025-12-08 16:04:56 -06:00
blinkagent[bot] b4be5bcfed docs: fix swagger tags for license endpoints (#21101)
## Summary

Change `@Tags` from `Organizations` to `Enterprise` for `POST /licenses`
and `POST /licenses/refresh-entitlements` to match the `GET` and
`DELETE` license endpoints which are already tagged as `Enterprise`.

## Problem

The license API endpoints were inconsistently tagged in the swagger
annotations:
- `GET /licenses` → `Enterprise` ✓
- `DELETE /licenses/{id}` → `Enterprise` ✓
- `POST /licenses` → `Organizations` ✗
- `POST /licenses/refresh-entitlements` → `Organizations` ✗

This caused the POST endpoints to be documented in the [Organizations
API docs](https://coder.com/docs/reference/api/organizations) instead of
the [Enterprise API
docs](https://coder.com/docs/reference/api/enterprise) where the other
license endpoints live.

## Fix

Simply updated the `@Tags` annotation from `Organizations` to
`Enterprise` for both POST endpoints.

This was an oversight from the original swagger docs addition in #5625
(January 2023).

Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
2025-12-05 15:27:22 +00:00
Callum Styan 83dbf73dde perf: don't calculate build times for deleted templates (#21072)
The metrics cache to calculate and expose build time metrics for
templates currently calls `GetTemplates`, which returns all templates
even if they are deleted. We can use the `GetTemplatesWithFilter` query
to easily filter out deleted templates from the results, and thus not
call `GetTemplateAverageBuildTime` for those deleted templates. Delete
time for workspaces for non-deleted templates is still calculated.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-12-04 10:27:56 -08:00
Mathias Fredriksson cfdd4a9b88 perf(coderd/database): add index on workspace_app_statuses.app_id (#21099) 2025-12-04 17:56:13 +02:00
Mathias Fredriksson 532a1f3054 fix(coderd): exclude sub-agents from workspace health calculation (#21098) 2025-12-04 15:38:24 +02:00
Spike Curtis 40df21ed62 fix: fixes use of possibly nil RemoteAddr() and LocalAddr() return values (#21076)
fixes: https://github.com/coder/internal/issues/1143

Both gVisor and the Go standard library implementations of `net.Conn` can under certain circumstances return `nil` for `RemoteAddr()` and `LocalAddr()` calls. If we call their methods, we segfault.

This PR fixes these calls and adds ruleguard rules.

Note that `slog.F("remote_addr", conn.RemoteAddr())` is fine because slog detects the `nil` before attempting to stringify the type.
2025-12-03 15:06:00 +04:00
Mathias Fredriksson ad93262d07 fix(coderd/database/dbpurge): allow disabling AI Bridge retention with 0 (#21062)
Previously setting AI Bridge retention to 0 would cause records to be
deleted immediately since we didn't check for the zero value before
calculating the deletion threshold.

This adds a check for aibridgeRetention > 0 to skip deletion when
retention is disabled, matching the pattern used for other retention
settings (connection logs, audit logs, etc.).

Also fixes the return type of DeleteOldAIBridgeRecords from int32 to
int64 since COUNT(*) returns bigint in PostgreSQL.

Refs #21055
2025-12-03 09:37:18 +00:00
Mathias Fredriksson ff46917e62 feat: add retention config for workspace_agent_logs (#21039)
Replace hardcoded 7-day retention for workspace agent logs with
configurable retention from deployment settings. Defaults to 7d to
preserve existing behavior.

Depends on #21038
Updates #20743
2025-12-02 16:01:33 +00:00
Mathias Fredriksson 9ec90cf2e7 feat(coderd/database/dbpurge): make API keys retention configurable (#21037)
Replace hardcoded 7-day retention for expired API keys with configurable
retention from deployment settings. Skips deletion entirely when effective
retention is 0.

Depends on #21021
Updates #20743
2025-12-02 15:41:38 +00:00
Mathias Fredriksson c85d79bcdb feat(coderd/database/dbpurge): add retention for audit logs (#21025)
Add configurable retention policy for audit logs. The DeleteOldAuditLogs
query excludes deprecated connection events (connect, disconnect, open,
close) which are handled separately by DeleteOldAuditLogConnectionEvents.

Disabled (0) by default.

Depends on #21021
Updates #20743
2025-12-02 16:50:09 +02:00
Mathias Fredriksson 9ebcca5b0d feat(coderd/database/dbpurge): add retention for connection logs (#21022)
Add `DeleteOldConnectionLogs` query and integrate it into the `dbpurge`
routine. Retention is controlled by `--retention-connection-logs` flag.
Disabled (0) by default.

Depends on #21021
Updates #20743
2025-12-02 14:17:52 +00:00
Mathias Fredriksson 56e7858570 feat(coderd): add retention policy configuration (#21021)
Add `RetentionConfig` with server flags for configuring data retention:

- `--audit-logs-retention`: retention for audit log entries
- `--connection-logs-retention`: retention for connection logs
- `--api-keys-retention`: retention for expired API keys (default 7d)

Updates #20743
2025-12-02 16:04:06 +02:00
Ethan 645da33767 test: fix TestDescCacheTimestampUpdate flake (#20975)
## Problem

`TestDescCacheTimestampUpdate` was flaky on Windows CI because
`time.Now()` has ~15.6ms resolution, causing consecutive calls to return
identical timestamps.

## Solution

Inject `quartz.Clock` into `MetricsAggregator` using an options pattern,
making the test deterministic by using a mock clock with explicit time
advancement.

### Changes
- Add `clock quartz.Clock` field to `MetricsAggregator` struct
- Add `WithClock()` option for dependency injection
- Replace all `time.Now()` calls with `ma.clock.Now()`
- Update test to use mock clock with `mClock.Advance(time.Second)`

---

This PR was fully generated by [`mux`](https://github.com/coder/mux)
using Claude Opus 4.5, and reviewed by me.

Closes https://github.com/coder/internal/issues/1146
2025-12-02 10:53:36 +11:00
Susana Ferreira f8d9a8046f feat: add notification warning alert to Tasks page (#20900)
## Problem

Users may not realize that task notifications are disabled by default.
To improve awareness, we show a warning alert on the Tasks page when all
task notifications are disabled.

**Alert visibility logic:**
- Shows when **all** task notification templates (Task Working, Task
Idle, Task Completed, Task Failed) are disabled
- Can be dismissed by the user, which stores the dismissal in the user
preferences API
- If the user later enables any task notification in Account Settings,
the dismissal state is cleared so the alert will show again if they
disable all notifications in the future

<img width="2980" height="1588" alt="Screenshot 2025-11-25 at 17 48 17"
src="https://github.com/user-attachments/assets/316bf097-d9d2-4489-bc16-2987ba45f45c"
/>

## Changes

- Added a warning alert to the Tasks page when all task notifications
are disabled
- Introduced new `/users/{user}/preferences` endpoint to manage user
preferences (stored in `user_configs` table)
- Alert is dismissible and stores the dismissal state via the new user
preferences API endpoint
- Enabling any task notification in Account Settings clears the
dismissal state via the preferences API
- Added comprehensive Storybook stories for both TasksPage and
NotificationsPage to test all alert visibility states and interactions

Closes: https://github.com/coder/internal/issues/1089
2025-11-28 16:50:59 +00:00
Callum Styan d22d34e45b fix: pass context with authorization to agentapi (#20959)
The agentapi context needs to be a context with some amount of
authorization attached to it via the context so that the cache refresh
routine can fetch the workspace from the db via GetWorkspaceForAgentID.

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-11-26 14:53:16 -08:00
Danielle Maywood e7dbbcde87 fix: do not notify marked for deletion for deleted workspaces (#20937)
Closes https://github.com/coder/coder/issues/20913

I've ran the test without the fix, verified the test caught the issue,
then applied the fix, and confirmed the issue no longer happens.

---

🤖 PR was initially written by Claude Opus 4.5 Thinking using Claude Code
and then review by a human 👩
2025-11-26 09:23:16 +00:00
Mykyta Protsenko c87c33f7dd perf: add index to improve the GetWorkspaceAgentByInstanceID query performance (#20936)
## Context

GetWorkspaceAgentByInstanceID has a suboptimal plan. Even though it is
designed to fetch a small subset of records, there are no corresponding
indexes and that query results in full table scan:

Query:

```
SELECT id, auth_instance_id FROM workspace_agents
where auth_instance_id='i-013c2b96b6441648a' and deleted=FALSE;
```

Plan:

```
------------------------------------------------------------------------------------------------------------------
 Seq Scan on workspace_agents  (cost=0.00..222325.48 rows=2 width=36) (actual time=0.012..234.152 rows=4 loops=1)
   Filter: ((NOT deleted) AND ((auth_instance_id)::text = 'i-013c2b96b6441648a'::text))
   Rows Removed by Filter: 302276
 Planning Time: 0.173 ms
 Execution Time: 234.169 ms
```

After adding the index, the plan improves drastically.

Updated plan:

```
 Bitmap Heap Scan on workspace_agents  (cost=4.44..12.32 rows=2 width=36) (actual time=0.019..0.019 rows=0 loops=1)
   Recheck Cond: (((auth_instance_id)::text = 'i-013c2b96b6441648a'::text) AND (NOT deleted))
   ->  Bitmap Index Scan on workspace_agents_auth_instance_id_deleted_idx  (cost=0.00..4.44 rows=2 width=0) (actual time=0.013..0.014 rows=0 loops=1)
         Index Cond: (((auth_instance_id)::text = 'i-013c2b96b6441648a'::text) AND (deleted = false))
 Planning Time: 0.388 ms
 Execution Time: 0.044 ms
```

## Changes

* add an index to optimize this query

## Testing

* ran the queries manually against prod and test DBs
* ran `./scripts/develop.sh`, connected to the local PostgreSQL
instance, inspected the indexes to make sure new index is there:

```
Indexes:
    "workspace_agents_pkey" PRIMARY KEY, btree (id)
    // NEW INDEX CREATED SUCCESSFULLY  [comment is mine]
    "workspace_agents_auth_instance_id_deleted_idx" btree (auth_instance_id, deleted)
    "workspace_agents_auth_token_idx" btree (auth_token)
    "workspace_agents_resource_id_idx" btree (resource_id)
```

---------

Signed-off-by: Danny Kopping <danny@coder.com>
Co-authored-by: Danny Kopping <danny@coder.com>
2025-11-26 05:57:25 +02:00
George K a9261577bc perf: optimize migration 371 to run faster on large deployments (#20906)
closes https://github.com/coder/coder/issues/20899

This is in response to a migration in v2.27 that takes very long on
deployments with large `api_keys` tables.

NOTE: The optimization causes the _up_ migration to delete old data
(keys that expired more than 7 days ago). The _down_ migration won't
resurrect the deleted data.
2025-11-25 21:44:59 -06:00
Asher c266bb830c chore: add debug logging and recovery to agent api requests (#20785)
This is to debug context timeouts on API requests to the agent.

Because rbac and database cannot be imported in slim, split the logger
middleware into slim and non-slim versions and break out the recovery
middleware.
2025-11-25 14:59:20 -09:00
Callum Styan b0e8384b82 perf: reduce DB calls to GetWorkspaceByAgentID via caching workspace info (#20662)
---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-11-25 14:45:05 -08:00
Mathias Fredriksson e189dc1f81 fix: complete Tasks GA promotion (docs, site) (#20927)
## Summary

Completes the Coder Tasks GA promotion by updating swagger tags and
regenerating API documentation and updating the frontend API structure.

## Related

Follows #20923 and #20921 which promoted Tasks from Beta/Experimental to
GA.

---

🤖 This change was written by Claude Sonnet 4.5 Thinking using
[mux](https://github.com/coder/mux) and reviewed by a human 🏂
2025-11-25 16:46:13 +00:00
Danielle Maywood b255827a52 chore: promote tasks to stable from experimental (#20921)
- Promote tasks from `/api/experimental` to `/api/v2`.
- Move sdk from `ExperimentalClient` to `Client`.
- Update swagger
2025-11-25 15:24:25 +00:00
Mathias Fredriksson 37fc6646ad perf(coderd/database): limit GetLatestWorkspaceAppStatusByAppID to 1 row (#20917)
## Description

This PR fixes an issue where `GetLatestWorkspaceAppStatusesByAppID`
returned an unbounded number of rows for a given app ID, which could
cause performance issues for noisy or long-running AI tasks.

## Impact

This change reduces database query overhead for workspace app status
updates, particularly for busy AI tasks that update their status
frequently. Previously, fetching the latest status would return all
historical statuses, now it returns only the most recent one.

Fixes #20862

---

🤖 This change was written by Claude Sonnet 4.5 Thinking using [mux](https://github.com/coder/mux) and reviewed by a human 🏄🏻‍♂️
2025-11-25 16:56:42 +02:00
Susana Ferreira 3011207519 feat: add display name field for tasks (#20856)
## Problem

Tasks currently only expose a machine-friendly name field (e.g.
`task-python-debug-a1b2`), but this value is primarily an identifier
rather than a clean, descriptive label. We need a separate
display-friendly name for use in the UI.

This PR introduces a new `display_name` field and updates the task-name
generation flow. The Claude system prompt was updated to return valid
JSON with both `name` and `display_name`. The name generation logic
follows a fallback chain (Anthropic > prompt sanitization > random
fallback). To make task names more closely resemble their display names,
the legacy `task-` prefix has been removed. For context, PR
https://github.com/coder/coder/pull/20834 introduced a small Task icon
to the workspace list to help identify workspaces associated to tasks.

## Changes

- Database migration: Added `display_name` column to tasks table
- Updated system prompt to generate both task name and display name as
valid JSON
- Task name generation now follows a fallback chain: Anthropic > prompt
sanitization > random fallback
- Removed `task-` prefix from task names to allow more descriptive names
- Note: PR https://github.com/coder/coder/pull/20834 adds a Task icon to
workspaces in the workspace list to distinguish task-created workspaces

**Note:** UI changes will be addressed in a follow-up PR

Related to: https://github.com/coder/coder/issues/20801
2025-11-25 13:00:59 +00:00
Danielle Maywood 82f525baf3 feat(coderd): add task prompt modification endpoint (#20811)
This PR adds the backend implementation for modifying task prompts. Part
of https://github.com/coder/internal/issues/1084

## Changes

- New `UpdateTaskPrompt` database query to update task prompts
- New PATCH `/api/v2/tasks/{task}/prompt` endpoint

## Notes

This is part 1 of a 2-part PR stack. The frontend UI will be added in a
follow-up PR based on this branch
(https://github.com/coder/coder/pull/20812).

---

🤖 PR was written by Claude Sonnet 4.5 Thinking using [Coder
Mux](https://github.com/coder/cmux) and reviewed by a human 👩
2025-11-25 11:13:32 +00:00
Spike Curtis afd40436f0 fix: mock Agent querying OS for listening ports in tests (#20842)
fixes https://github.com/coder/internal/issues/1123

We want to tests that ports are not included after they are no longer used, but this isn't safe on the real OS networking stack because there is no way to guarantee a port _won't_ be used. Instead, we introduce an interface and fake implementation for testing.

On order to leave the filtering logic in the test path, this PR also does some refactoring.

Caching logic is left in the real OS querying implementation and a new test case is added for it in this PR.
2025-11-25 14:25:24 +04:00
Danielle Maywood c12303f0b2 fix: allow agents to be created on dormant workspaces (#20909)
Closes https://github.com/coder/coder/issues/20711

We now allow agents to be created on dormant workspaces.

I've ran the test with and without the change. I've confirmed that -
without the fix - it triggers the "rbac: unauthorized" error.
2025-11-25 06:24:33 +00:00
Callum Styan 658e8c34a9 perf: improve performance of metricsAggregator path by reducing memory allocations (#20724)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-11-24 15:45:08 -08:00
Jake Howell ca560d36ce fix: remove inflight interceptions from aibridge returned values (#20852)
Addresses [`aibridge#54`](https://github.com/coder/aibridge/issues/54)

When querying against the values in the database for
`/api/experimental/aibridge/interceptions` we found strange behaviour
wherein there was interceptions that lacked prompting and other various
fields we want. Generally this was as a result of the data not actually
existing for these values (as they were inflight).

The simple solution to this was to hide them if they didn't exist. This
PR addresses that.

---------

Co-authored-by: Danny Kopping <danny@coder.com>
2025-11-25 10:23:39 +11:00
Steven Masley cefe07d074 feat: purge expired api keys in dbpurge (#20863)
closes https://github.com/coder/coder/issues/19889

This is in response to a migration in v2.27 that takes very long on deployments with large `api_key` tables.
2025-11-24 10:24:32 -06:00
Atif Ali 636408906f chore(docs): standardize "AIBridge" to "AI Bridge" in documentation (#20831) 2025-11-24 18:09:04 +05:00
Susana Ferreira 2a9afc77de feat: associate task icon with workspaces (#20834)
## Problem

Workspaces associated with tasks were not visually distinguishable in
the workspaces list view. Additionally, the list workspaces endpoint was
not returning the `task_id` field.

<img width="2784" height="864" alt="Screenshot 2025-11-20 at 10 32 22"
src="https://github.com/user-attachments/assets/60704f16-3c66-4553-9215-f10654998a38"
/>

## Changes

- Fix `ConvertWorkspaceRows` to include `task_id` in the list workspaces
endpoint response
- Add "Task" icon to the workspace list view for workspaces associated
with tasks
- Add test to verify `task_id` is correctly returned by the list
workspaces endpoint
- Add Storybook story to showcase the Task icon in the workspace list

Closes https://github.com/coder/coder/issues/20802
2025-11-21 11:47:10 +00:00
Danny Kopping 5a7d4f69f6 feat: add configurable retention for aibridge (#20828)
Closes https://github.com/coder/internal/issues/1134

---------

Signed-off-by: Danny Kopping <danny@coder.com>
2025-11-21 11:35:36 +02:00
Marcin Tojek d004710a74 feat: add prebuild invalidation via last_invalidated_at timestamp (#20582)
Updates #17917
2025-11-20 17:12:25 +01:00