## Summary
This adds configurable overload protection to the AI Bridge daemon to
prevent the server from being overwhelmed during periods of high load.
Partially addresses coder/internal#1153 (rate limits and concurrency
control; circuit breakers are deferred to a follow-up).
## New Configuration Options
| Option | Environment Variable | Description | Default |
|--------|---------------------|-------------|---------|
| `--aibridge-max-concurrency` | `CODER_AIBRIDGE_MAX_CONCURRENCY` |
Maximum number of concurrent AI Bridge requests. Set to 0 to disable
(unlimited). | `0` |
| `--aibridge-rate-limit` | `CODER_AIBRIDGE_RATE_LIMIT` | Maximum number
of AI Bridge requests per second. Set to 0 to disable rate limiting. |
`0` |
## Behavior
When limits are exceeded:
- **Concurrency limit**: Returns HTTP `503 Service Unavailable` with
message "AI Bridge is currently at capacity. Please try again later."
- **Rate limit**: Returns HTTP `429 Too Many Requests` with
`Retry-After` header.
Both protections are optional and disabled by default (0 values).
## Implementation
The overload protection is implemented as reusable middleware in
`coderd/httpmw/ratelimit.go`:
1. **`RateLimitByAuthToken`**: Per-user rate limiting that uses
`APITokenFromRequest` to extract the authentication token, with fallback
to `X-Api-Key` header for AI provider compatibility (e.g., Anthropic).
Falls back to IP-based rate limiting if no token is present. Includes
`Retry-After` header for backpressure signaling.
2. **`ConcurrencyLimit`**: Uses an atomic counter to track in-flight
requests and reject when at capacity.
The middleware is applied in `enterprise/coderd/aibridge.go` via
`r.Group` in the following order:
1. Concurrency check (faster rejection for load shedding)
2. Rate limit check
**Note**: Rate limiting currently applies to all AI Bridge requests,
including pass-through requests. Ideally only actual interceptions
should count, but this would require changes in the aibridge library.
## Testing
Added comprehensive tests for:
- Rate limiting by auth token (Bearer token, X-Api-Key, no token
fallback to IP)
- Different tokens not rate limited against each other
- Disabled when limit is zero
- Retry-After header is set on 429 responses
- Concurrency limiting (allows within limit, rejects over limit,
disabled when zero)
## Summary
Change `@Tags` from `Organizations` to `Enterprise` for `POST /licenses`
and `POST /licenses/refresh-entitlements` to match the `GET` and
`DELETE` license endpoints which are already tagged as `Enterprise`.
## Problem
The license API endpoints were inconsistently tagged in the swagger
annotations:
- `GET /licenses` → `Enterprise` ✓
- `DELETE /licenses/{id}` → `Enterprise` ✓
- `POST /licenses` → `Organizations` ✗
- `POST /licenses/refresh-entitlements` → `Organizations` ✗
This caused the POST endpoints to be documented in the [Organizations
API docs](https://coder.com/docs/reference/api/organizations) instead of
the [Enterprise API
docs](https://coder.com/docs/reference/api/enterprise) where the other
license endpoints live.
## Fix
Simply updated the `@Tags` annotation from `Organizations` to
`Enterprise` for both POST endpoints.
This was an oversight from the original swagger docs addition in #5625
(January 2023).
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Closes https://github.com/coder/coder/issues/20913
I've ran the test without the fix, verified the test caught the issue,
then applied the fix, and confirmed the issue no longer happens.
---
🤖 PR was initially written by Claude Opus 4.5 Thinking using Claude Code
and then review by a human 👩
Closes https://github.com/coder/coder/issues/20711
We now allow agents to be created on dormant workspaces.
I've ran the test with and without the change. I've confirmed that -
without the fix - it triggers the "rbac: unauthorized" error.
Addresses [`aibridge#54`](https://github.com/coder/aibridge/issues/54)
When querying against the values in the database for
`/api/experimental/aibridge/interceptions` we found strange behaviour
wherein there was interceptions that lacked prompting and other various
fields we want. Generally this was as a result of the data not actually
existing for these values (as they were inflight).
The simple solution to this was to hide them if they didn't exist. This
PR addresses that.
---------
Co-authored-by: Danny Kopping <danny@coder.com>
Fixes https://github.com/coder/internal/issues/1119
## Description
The `CacheTFProviders` function in `testutil/terraform_cache.go` was
only available on Linux and macOS due to the `//go:build linux ||
darwin` build tag. This caused a compile error on Windows when
`enterprise/coderd/workspaces_test.go` tried to call it:
```
enterprise\coderd\workspaces_test.go:3403:28: undefined: testutil.CacheTFProviders
```
## Changes
1. Added `testutil/terraform_cache_windows.go` with a Windows-specific
stub implementation that returns an empty string
2. Updated `downloadProviders` helper in
`enterprise/coderd/workspaces_test.go` to handle empty paths gracefully
## Behavior
- On Linux/macOS: Terraform providers are cached as before
- On Windows: Provider caching is skipped, tests download providers
normally during `terraform init`
## Testing
This should fix the Windows nightly gauntlet failure. The test will
still run on Windows, just without provider caching optimization.
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Experiments passed to provisioners to determine behavior. This adds
`--experiments` flag to provisioner daemons. Prior to this, provisioners
had no method to turn on/off experiments.
## Problem
Fix race condition in prebuilds reconciler. Previously, a job
notification event was sent to a Go channel before the provisioning
database transaction completed. The notification is consumed by a
separate goroutine that publishes to PostgreSQL's LISTEN/NOTIFY, using a
separate database connection. This creates a potential race: if a
provisioner daemon receives the notification and queries for the job
before the provisioning transaction commits, it won't find the job in
the database.
This manifested as a flaky test failure in `TestReinitializeAgent`,
where provisioners would occasionally miss newly created jobs. The test
uses a 25-second timeout context, while the acquirer's backup polling
mechanism checks for jobs every 30 seconds. This made the race condition
visible in tests, though in production the backup polling would
eventually pick up the job. The solution presented here guarantees that
a job notification is only sent after the provisioning database
transaction commits.
## Changes
* The `provision()` and `provisionDelete()` functions now return the
provisioner job instead of sending notifications internally.
* A new `publishProvisionerJob()` helper centralizes the notification
logic and is called after each transaction completes.
Closes: https://github.com/coder/internal/issues/963
Currently, when AI Bridge is enabled AND the `oauth2` and
`mcp-server-http` experiments are enabled we inject Coder's MCP tools
into all intercepted AI Bridge requests.
This PR introduces a config to control this behaviour.
**NOTE:** this is a backwards-incompatible change; previously these
tools would be injected automatically, now this setting will need to be
explicitly enabled.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
Fixes flaky `TestWorkspaceTagsTerraform` and
`TestWorkspaceTemplateParamsChange` tests that were failing with
`connection reset by peer` errors when downloading the coder/coder
provider.
This applies the same caching solution which was done in
https://github.com/coder/coder/pull/17373
1. Extracts provider caching logic into `testutil/terraform_cache.go`
2. Updates TestProvision to use the shared caching helpers
3. Updates enterprise workspace tests to use the shared caching helpers
The cache is persisted at `~/.cache/coderv2-test/` and automatically
cached between CI runs via existing GitHub Actions cache setup.
Closes https://github.com/coder/internal/issues/607
## Description
The membership reconciliation ensures the prebuilds system user is a
member of all organizations with prebuilds configured. To support
prebuilds quota management, each organization must have a prebuilds
group that the system user belongs to.
## Problem
Previously, membership reconciliation iterated over all presets to check
and update membership status. This meant database queries
`GetGroupByOrgAndName` and `InsertGroupMember` were executed for each
preset. Since presets are unique combinations of `(organization,
template, template version, preset)`, this resulted in several redundant
checks for the same organization.
In dogfood, `InsertGroupMember` was called thousands of times per day,
even though memberships were already configured ([internal Grafana
dashboard link](https://grafana.dev.coder.com/goto/46MZ1UgDg?orgId=1))
<img width="5382" height="1788" alt="Screenshot 2025-10-28 at 16 01 36"
src="https://github.com/user-attachments/assets/757b7253-106f-4f72-8586-8e2ede9f18db"
/>
## Solution
This PR introduces `GetOrganizationsWithPrebuildStatus`, a single query
that returns:
* All unique organizations with prebuilds configured
* Whether the prebuilds user is a member of each organization
* Whether the prebuilds group exists in each organization
* Whether the prebuilds user is in the prebuilds group
The membership reconciliation logic now:
* Fetches status for all organizations in one query
* Only performs inserts for organizations missing required memberships
or groups
* Safely handles concurrent operations via unique constraint violations
* This reduces database load from `O(presets)` to `O(organizations)` per
reconciliation loop, with a single read query when everything is
configured.
## Changes
* Add `GetOrganizationsWithPrebuildStatus` SQL query
* Update `membership.ReconcileAll` to use organization-based
reconciliation instead of preset-based
* Update tests to reflect new behavior
Related to internal thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1760535570381369
## Description
PR https://github.com/coder/coder/pull/20387 introduced canceling
pending prebuild jobs from inactive template versions to avoid
provisioning obsolete workspaces. However, the associated prebuilds
remained in the database with "Canceled" status, visible in the UI.
This PR now orphan-deletes these canceled prebuilt workspaces. Since the
canceled jobs were never processed by a provisioner, no Terraform
resources were created, making orphan deletion safe.
Orphan deletion always creates a provisioner job, but behaves
differently based on provisioner availability:
- If no provisioner daemon is available, the job is immediately marked
as completed and the workspace is marked as deleted without any
provisioner processing
- If a provisioner daemon is available, it processes the delete job with
empty Terraform state (no actual resources to destroy)
The job cancellation and workspace deletion occur atomically in the same
transaction. We don't split this into two separate reconciliation runs
because there's no way to distinguish between system-canceled prebuilds
and user-canceled workspaces. If we deleted canceled workspaces in a
later run, we'd delete user-canceled workspaces that users may want to
keep for troubleshooting.
Note: This only applies to system-generated prebuilds from inactive
template versions.
## Changes
* Update `UpdatePrebuildProvisionerJobWithCancel` query to return job
ID, workspace ID, template ID, and template version preset ID
* Add `DeprovisionMode` enum to support orphan deletion in the provision
flow
* Update `ActionTypeCancelPending` handler to cancel jobs and
orphan-delete associated workspaces atomically
## Description
This PR introduces an optimization to automatically cancel pending
prebuild-related jobs from non-active template versions in the
reconciliation loop.
## Problem
Currently, when a template is configured with more prebuild instances
than available provisioners, the provisioner queue can become flooded
with pending prebuild jobs. This issue is worsened when
provisioning/deprovisioning operations take a long time.
When the prebuild reconciliation loop generates jobs faster than
provisioners can process them, pending jobs accumulate in the queue.
Since prebuilt workspaces should always run the latest active template
version, pending prebuild jobs from non-active versions become obsolete
once a new version is promoted.
## Solution
The reconciliation loop cancels pending prebuild-related jobs from
non-active template versions that match the following criteria:
* Build number: 1 (initial build created by the reconciliation loop)
* Job status: `pending`
* Not yet picked up by a provisioner (`worker_id` is `NULL`)
* Owned by the prebuilds system user
* Workspace transition: `start`
This prevents the queue from being cluttered with stale prebuild jobs
that would provision workspaces on an outdated template version that
would consequently need to be deprovisioned.
## Changes
* Added new SQL query `CountPendingNonActivePrebuilds` to identify
presets with pending jobs from non-active versions
* Added new SQL query `UpdatePrebuildProvisionerJobWithCancel` to cancel
jobs for a specific preset
* New reconciliation action type `ActionTypeCancelPending` handles the
cancellation logic
* Cancellation is non-blocking: failures to cancel prebuild jobs are
logged as errors and don't prevent other reconciliation actions
## Follow-up PR
Canceling pending prebuild jobs leaves workspaces in a Canceled state.
While no Terraform resources need to be destroyed (since jobs were
canceled before provisioning started), these database records should
still be cleaned up. This will be addressed in a follow-up PR.
Closes: https://github.com/coder/coder/issues/20242
This PR uses the same sha256 hashing technique as we use for APIKeys. So
now all randomly generated secrets will be hashed with sha256 for
consistency.
This is a breaking change for the oauth tokens. Since oauth is only
allowed for dev builds and experimental, this is ok.
Thanks to the great work in #20393, we’ve successfully introduced
offset-based pagination for this endpoint. However, the frontend expects
a `count` field in the response rather than `total`. This PR updates the
response payload to rename the returned key to `count` for consistency
with frontend expectations and existing API patterns.
This is necessary to unblock the work in #20331
- Adds FK from `aibridge_interceptions.initiator_id` to `users.id`
- This is enforced by deleting any rows that don't have any users. Since
this is an experimental feature AND coder never deletes user rows I
think this is acceptable.
- Adds `name` as a property on `codersdk.MinimalUser`
- This matches the `visible_users` view in the database. I'm unsure why
`name` wasn't already included given that `username` is.
- Adds a new `initiator` field to `codersdk.AIBridgeInterception` which
contains `codersdk.MinimalUser` (ID, username, name, avatar URL)
- Removes `initiator_id` from `codersdk.AIBridgeInterception`
- Should be fine since we're still in early access
Necessary for the frontend to be able to paginate easily. Cursor
pagination is good for fetching all events, but doesn't play very well
when a pagination component gets involved.
Adds support for `?offset=x` to the existing endpoint. The cursor-based
pagination (`?after_id=x`) is still supported. The two pagination modes
are mutually exclusive, and are documented as such. If both are
supplied, the request will be rejected.
Also adds a `total` property to the response that contains the full
count of items matching the filter. We already have indices in place so
I don't think this will impact performance (or we can revisit it before
GA).
In preparation for adding the "member" permission level, which will also
be grouped by org ID, do a bit of a refactor to make room for it and the
existing "org" level to live in the same `map`
The prebuilds user never initiates a workspace claim autonomously. A
claim can only happen when a user attempts to create a workspace. When
listing prebuild provisioner jobs, it would not make sense to see jobs
related to users who are creating workspaces and have gotten a prebuilt
workspace. When cleaning up an overwhelmed provisioner queue, we should
not delete claims as they have humans waiting for them and are not part
of the thundering herd.
Therefore, this PR ensures that provisioner jobs that claim workspaces
are considered to be initiated by the user, not the prebuilds system.
This PR makes the initial steps at removing usage of the global Go HTTP
client, which was seen to have impacts on test flakiness in
https://github.com/coder/internal/issues/1020. The first commit removes
uses from tests, with the exception of one test that is tightly coupled
to the default client. The second commit makes easy/low-risk removals
from application code. This should have some impact to reduce test flakiness.
Closes https://github.com/coder/internal/issues/268
Wraps the assertions in a `testutil.Eventually` so that hopefully any
transient timing issues resolve themselves. If this does not resolve the
issue, we may need to plumb through some kind of `chan struct{}` into
`api.Entitlements.Update()`
Adds shared_with_user and shared_with_group filters to the /workspaces
endpoint.
- `shared_with_user`: filters workspaces shared with a specific user.
Accepts a user UUID or username.
- `shared_with_group`: filters workspaces shared with a specific group.
Accepts:
- a group UUID, or
- `<organization name>/<group name>`, or
- `<group name>` (resolved in the default organization).
Closes
[coder/internal#1004](https://github.com/coder/internal/issues/1004)
Breaking API Change:
> The presence of the `ip` field on `codersdk.ConnectionLog` cannot be
guaranteed, and so the field has been made optional. It may be omitted
on API responses.
When running a scaletest, I noticed logs of the form:
```
2025-09-12 06:34:10.924 [erro] coderd.workspaceapps: upsert connection log failed trace=0xa17580 span=0xa17620 workspace_id=81b937d7-5777-4df5-b5cb-80241f30326f agent_id=78b2ff6d-b4a6-4a4e-88a7-283e05455a88 app_id=00000000-0000-0000-0000-000000000000 user_id=00000000-0000-0000-0000-000000000000 user_agent="" app_slug_or_port=terminal status_code=404 request_id=67f03cf8-9523-444a-97bc-90de080a54c8 ...
error= 1 error occurred:
* pq: null value in column "ip" of relation "connection_logs" violates not-null constraint
```
to ensure logs are never omitted from the connection log due to a
missing IP again (i.e. I'm not sure if we can always rely on a valid,
parseable, IP from `(http.Request).RemoteAddr`), I've removed the `NOT
NULL` constraint on `ip` on `connection_logs`, and made `ip` on the API
response optional.
The specific cause for these null IPs was the
`/workspaceproxies/me/issue-signed-app-token [post]` endpoint
constructing it's own `http.Request` without a `RemoteAddr` set, and
then passing that to the token issuer.
To solve this, we'll have workspace proxies send the real IP of the
client when calling `/workspaceproxies/me/issue-signed-app-token [post]`
via the header `Coder-Workspace-Proxy-Real-IP`.
Closes: https://github.com/coder/internal/issues/964
This PR addresses the significant database load issue where the
`GetWorkspaces` query was causing performance problems in the license
entitlements code.
see https://github.com/coder/internal/issues/959 but the tl; dr is:
- we call this DB query on an interval (every 15s) and it would be
called on each coderd replica as well
- the generated values update very infrequently (for our most used
internal template I saw the builds created/claimed update twice in a 1h
period)
- we have no index on the initiator ID, so this query has to scan the
entire workspace_builds table on every request
In reality this should likely just be a Prometheus metric, and
Prometheus can handle the counter reset behaviour at query time, but for
now this should at least cut the load of the query to 25% of it's
current impact.
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>