Fixes: coder/internal#767
Adds two new Prometheus metrics for license health monitoring:
- `coderd_license_warnings` - count of active license warnings
- `coderd_license_errors` - count of active license errors
Metrics endpoint after startup of a deployment with license enabled:
```
...
# HELP coderd_license_errors The number of active license errors.
# TYPE coderd_license_errors gauge
coderd_license_errors 0
...
# HELP coderd_license_warnings The number of active license warnings.
# TYPE coderd_license_warnings gauge
coderd_license_warnings 0
...
```
My previous change to this test did not create another **workspace**
using the template containing `coder_ai_task` resources, meaning that
this test was not actually testing the right thing. This PR addresses
this oversight.
Implements telemetry for boundary usage tracking across all Coder
replicas and reports them via telemetry.
Changes:
- Implement Tracker with Track(), FlushToDB(), and StartFlushLoop() methods
- Add telemetry integration via collectBoundaryUsageSummary()
- Use telemetry lock to ensure only one replica collects per period
The tracker accumulates unique workspaces, unique users, and request
counts (allowed/denied) in memory, then flushes to the database
periodically. During telemetry collection, stats are aggregated across
all replicas and reset for the next period.
Only task workspaces have the checks in wsbuilder for violating the
managed agent caps in the license.
Stopped tasks that are resumed with a regular workspace start **still
count as usage**.
Relates to https://github.com/coder/internal/issues/1282
Updates tracking of managed agents to be predicated instead on the
presence of a related `task_id` instead of the presence of a
`coder_ai_task` resource.
Closes [#1227](https://github.com/coder/internal/issues/1227)
Added support for license addons, starting with AI Governance, to enable
dynamic feature grouping without requiring license reissuance.
### What changed?
- Introduced a new `Addon` type to represent groupings of features that
can be added to licenses
- Created the first addon `AddonAIGovernance` which includes AI Bridge
and Boundary features
- Added validation for addon dependencies to ensure required features
are present
- Added new features: `FeatureBoundary` and
`FeatureAIGovernanceUserLimit`
- Updated license entitlement logic to handle addons and their features
- Added helper methods to check if features belong to addons
- Updated tests to verify addon functionality
### Why make this change?
This change introduces a more flexible licensing model that allows
features to be grouped into addons that can be added to licenses without
requiring reissuance when new features are added to an addon. This is
particularly useful for specialized feature sets like AI Governance,
where related features can be bundled together and sold as a separate
SKU. The addon approach allows for better organization of features and
more granular control over entitlements.
## Summary
AI Bridge is moving to General Availability in v2.30 and will require
the AI Governance Add-On license in future versions. This adds a soft
warning for deployments using AI Bridge via Premium/Enterprise
FeatureSet without an explicit AI Bridge add-on license.
Relates to: https://github.com/coder/internal/issues/1226
## Changes
- Track whether AI Bridge was explicitly granted via license Features
(add-on) vs inherited from FeatureSet
- Show soft warning when AI Bridge is enabled and entitled via
FeatureSet but not via explicit add-on
- Changed AI Bridge enablement from hardcoded `true` to check
`CODER_AIBRIDGE_ENABLED` deployment config
## Behavior Change
AI Bridge is now only marked as "enabled" in entitlements when
`CODER_AIBRIDGE_ENABLED=true` is set in the deployment config.
Previously, it was always enabled for Premium/Enterprise licenses
regardless of the config setting.
This change ensures that users who do not use AI Bridge will not see the
soft warning about the upcoming license requirement.
## Warning Message
> AI Bridge is now Generally Available in v2.30. In a future Coder
version, your deployment will require the AI Governance Add-On to
continue using this feature. Please reach out to your account team or
sales@coder.com to learn more.
## Behavior
| Condition | Warning Shown |
|-----------|---------------|
| AI Bridge disabled | ❌ No |
| AI Bridge enabled + explicit add-on license | ❌ No |
| AI Bridge enabled + Premium/Enterprise FeatureSet (no add-on) | ✅ Yes
|
## Screenshots
### 1. No license
<img width="1708" height="577" alt="image"
src="https://github.com/user-attachments/assets/cbdbfd4d-55de-4d70-8abf-2665f458e96f"
/>
### 2. No license + CODER_AIBRIDGE_ENABLED=true
<img width="1716" height="513" alt="image"
src="https://github.com/user-attachments/assets/344aae76-7703-485f-b568-1f13a1efa48f"
/>
### 3. Premium license + CODER_AIBRIDGE_ENABLED=false
<img width="1687" height="389" alt="image"
src="https://github.com/user-attachments/assets/c2be12b0-1c0f-438d-a293-f9ec9fe6a736"
/>
### 4. Premium license + CODER_AIBRIDGE_ENABLED=true
<img width="1707" height="525" alt="image"
src="https://github.com/user-attachments/assets/1a4640e1-e656-4f9b-bed0-9390cb5d6a84"
/>
## Notes
- TODO comments added to mark code that should be removed when AI Bridge
enforcement is added
- Feature continues to work - this is just a transitional warning (soft
enforcement)
## Description
Fixes a panic that occurs when the prebuilds feature is toggled by
adding/removing a license. The `StoreReconciler` was not unregistering
the `reconciliationDuration` histogram, causing a "duplicate metrics
collector registration attempted" panic when a new reconciler was
created.
## Changes
* Unregister the `reconciliationDuration` histogram in `Stop()`
alongside the existing metrics collector
* Change log level when stopping the reconciler with a cause, since
"entitlements change" is not an error condition
* Add `TestReconcilerLifecycle` to verify the reconciler can be stopped
and recreated with the same prometheus registry
Related to internal slack thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1769116582171379
## Summary
AI Bridge is moving out of Premium as a separate add-on (GA in Feb 3).
Closes https://github.com/coder/internal/issues/1226
## Changes
- Excludes `FeatureAIBridge` from `Enterprise()` and
`FeatureSetPremium.Features()`
- Adds soft warning for deployments with AI Bridge enabled but not
entitled
- Warning is displayed to Auditor/Owner roles in UI banner and CLI
headers
## Warning Message
When AI Bridge is enabled (`CODER_AIBRIDGE_ENABLED=true`) but the
license doesn't include the entitlement:
> AI Bridge has reached General Availability and your Coder deployment
is not entitled to run this feature. Contact your account team
(https://coder.com/contact) for information around getting a license
with AI Bridge.
## Behavior
- The feature remains usable in v2.30 (soft warning only)
- Future versions may include hard enforcement
The removal of that permission from the role broke valid use cases (e.g.
a site owner user creating a workspace owned by a system account and
then trying to share it with another user).
The bulk of the PR is made up of the rollbacks of the previously
introduced test updates necessitated by the removal.
Related to: https://github.com/coder/internal/issues/1285
Agents were losing authentication during workspace shutdown, causing
shutdown scripts to fail. The auth query required agents to belong to
the latest build, but during shutdown a `stop` build becomes latest while
the `start` build's agents are still running.
Modified the auth query to allow `start` build agents to authenticate
temporarily during `stop` execution. The query allows auth when:
- Agent's `start` build job succeeded
- Latest build is `stop` with `pending`/`running` job status
- Builds are adjacent (`stop` is `build_number + 1`)
- Template versions match
Auth closes once `stop` completes.
Renamed `GetWorkspaceAgentAndLatestBuildByAuthToken` to
`GetAuthenticatedWorkspaceAgentAndBuildByAuthToken` since it returns the
agent's build (not always latest) during shutdown.
Closes coder/internal#1249
Fixes#19467
## Description
This PR addresses database connection pool exhaustion during prebuilds
reconciliation by introducing two changes:
* `CanSkipReconciliation`: Filters out presets that don't need
reconciliation before spawning goroutines. This ensures we only create
goroutines for presets that will (_most likely_) perform database
operations, avoiding unnecessary connection pool usage.
* Dynamic `eg.SetLimit`: Limits concurrent goroutines based on the
configured database connection pool size (`CODER_PG_CONN_MAX_OPEN / 2`).
This replaces the previous hardcoded limit of 5, ensuring the
reconciliation loop scales appropriately with the configured pool size
while leaving capacity for other database operations.
## Changes
* Add `CanSkipReconciliation()` method to `PresetSnapshot` that returns
true for inactive presets with no running workspaces, no pending jobs,
or expired prebuilds.
* Add `maxDBConnections` parameter to `NewStoreReconciler` and compute
`reconciliationConcurrency` as half the pool size (minimum 1).
* Add `ReconciliationConcurrency()` getter method to `StoreReconciler`.
* Add `eg.SetLimit(c.reconciliationConcurrency)` to bound concurrent
reconciliation goroutines.
* Add `PresetsTotal` and `PresetsReconciled` to `ReconcileStats` for
observability.
* Add `TestCanSkipReconciliation` unit tests.
* Add `TestReconciliationConcurrency` unit tests.
* Add benchmark tests for reconciliation performance.
## Benchmarks
* `BenchmarkReconcileAll_NoOps`: Tests presets with no reconciliation
actions. All presets are filtered by `CanSkipReconciliation`, resulting
in no goroutines spawned and no database connections used.
* `BenchmarkReconcileAll_ConnectionContention`: Tests presets where all
require reconciliation actions. All presets spawn goroutines, but
concurrency is limited by `eg.SetLimit(reconciliationConcurrency)`.
* `BenchmarkReconcileAll_Mix`: Simulates a realistic scenario with a
large subset of inactive presets (filtered by `CanSkipReconciliation`)
and a smaller subset requiring reconciliation (limited by
`eg.SetLimit`).
Closes: https://github.com/coder/coder/issues/20606
Adds a per-organization setting to disable workspace sharing. When enabled,
all existing workspace ACLs in the organization are cleared and the workspace
ACL mutation API endpoints return `403 Forbidden`.
This complements the existing site-wide `--disable-workspace-sharing` flag by
providing more granular control at the organization level.
Closes https://github.com/coder/internal/issues/1073 (part 2)
---------
Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
## Description
Reuses the reconciliation lock transaction for read operations during
prebuilds reconciliation, reducing unnecessary database connections.
## Changes
* Use the lock transaction (`db`) for read operations and `c.store` for
write operations:
* `GetPrebuildsSettings`: now uses `db`
* `SnapshotState`: now uses `db`
* `MembershipReconciler`: continues to use `c.store` (performs write
operations)
* Add comments explaining the transaction model and when to use `db` vs
`c.store`
Related to: https://github.com/coder/coder/pull/20587
Replace the external moby/moby/pkg/namesgenerator dependency with an
internal implementation using gofakeit/v7. The moby package has ~25k
unique name combinations, and with its retry parameter only adds a
random digit 0-9, giving ~250k possibilities. In parallel tests, this
has led to collisions (flakes).
The new internal API at coderd/util/namesgenerator eliminates the
external dependnecy and offers functions with explicit uniqueness
guarantees. This PR also consolidates fragmented name generation in a
few places to use the new package.
| Old (moby/moby) | New |
|-------------------------------------|------------------------|
| namesgenerator.GetRandomName(0) | NameWith("_") |
| namesgenerator.GetRandomName(>0) | NameDigitWith("_") |
| testutil.GetRandomName(t) | UniqueName() |
| testutil.GetRandomNameHyphenated(t) | UniqueNameWith("-") |
namesgenerator package API:
- NameWith(delim): random name, not unique
- NameDigitWith(delim): random name with 1-9 suffix, not unique
- UniqueName(): guaranteed unique via atomic counter
- UniqueNameWith(delim): unique with custom delimiter
Names continue to be docker style `[adjective][delim][surname]`. Unique
names are truncated to 32 characters (preserving the numeric suffix) to
fit common name length limits in Coder.
Related test flakes:
https://github.com/coder/internal/issues/1212https://github.com/coder/internal/issues/118https://github.com/coder/internal/issues/1068
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
The implementation for prebuilt workspaces is complex and conversations
regarding edge cases and bugs frequently get bogged down by minutiae,
because it's hard to reason about the behaviour of the system.
To alleviate this, I've introduced otel tracing to the StoreReconciler
(see attached). We can now directly observe the behaviour of the
prebuilds system under load in order to inform our decisions.
Traces are terminated at the boundary between prebuilds and workspace
builder, because of prebuilt workspaces' "fire and forget" philosophy
and to prevent span explosion.
<img width="3024" height="1718" alt="image"
src="https://github.com/user-attachments/assets/f9b207be-8f2c-475e-98a8-46ef70bda446"
/>
Delete builds were not deleting resources as the tf state being sent in the apply request was empty.
State removed from apply request and read from the session instead.
Relates to #20925
This PR expands the test coverage of `enterprise/coderd/TestWorkspaceBuild` to also exercise the `postWorkspaceBuilds` handler. Previously, it only exercised the `createWorkspace` handler.
fixes#21303
Update user last_seen_at when we mark them active on login. This prevents a narrow race where they can be re-marked dormant and fail to log in.
Provisioner steps broken into smaller granular actions.
Changes:
- `ExtractArchive` moved to `init` request (was in `configure`)
- Writing `tfstate` moved to `plan` (was in `configure`)
- Moved most plan/apply outputs to `GraphComplete`
closes: https://github.com/coder/internal/issues/858
Similar to https://github.com/coder/coder/pull/19375, this one uses
system permissions for fetching actual user and group data.
Modifies the `workspaces_expanded` view to fetch the required data; this way it's made available to all code paths that make use of it.
Also fixes a bug in a test helper function that can result in `null` being saved to the DB for `user_acl` or `group_acl` and break tests; a defensive check constraint that prevents this is worth a PR, e.g:
`ALTER TABLE workspaces
ADD CONSTRAINT group_acl_is_object CHECK (jsonb_typeof(group_acl) = 'object');`
Also adds missing `OwnerName` in `ConvertWorkspaceRows`.
## Summary
This adds configurable overload protection to the AI Bridge daemon to
prevent the server from being overwhelmed during periods of high load.
Partially addresses coder/internal#1153 (rate limits and concurrency
control; circuit breakers are deferred to a follow-up).
## New Configuration Options
| Option | Environment Variable | Description | Default |
|--------|---------------------|-------------|---------|
| `--aibridge-max-concurrency` | `CODER_AIBRIDGE_MAX_CONCURRENCY` |
Maximum number of concurrent AI Bridge requests. Set to 0 to disable
(unlimited). | `0` |
| `--aibridge-rate-limit` | `CODER_AIBRIDGE_RATE_LIMIT` | Maximum number
of AI Bridge requests per second. Set to 0 to disable rate limiting. |
`0` |
## Behavior
When limits are exceeded:
- **Concurrency limit**: Returns HTTP `503 Service Unavailable` with
message "AI Bridge is currently at capacity. Please try again later."
- **Rate limit**: Returns HTTP `429 Too Many Requests` with
`Retry-After` header.
Both protections are optional and disabled by default (0 values).
## Implementation
The overload protection is implemented as reusable middleware in
`coderd/httpmw/ratelimit.go`:
1. **`RateLimitByAuthToken`**: Per-user rate limiting that uses
`APITokenFromRequest` to extract the authentication token, with fallback
to `X-Api-Key` header for AI provider compatibility (e.g., Anthropic).
Falls back to IP-based rate limiting if no token is present. Includes
`Retry-After` header for backpressure signaling.
2. **`ConcurrencyLimit`**: Uses an atomic counter to track in-flight
requests and reject when at capacity.
The middleware is applied in `enterprise/coderd/aibridge.go` via
`r.Group` in the following order:
1. Concurrency check (faster rejection for load shedding)
2. Rate limit check
**Note**: Rate limiting currently applies to all AI Bridge requests,
including pass-through requests. Ideally only actual interceptions
should count, but this would require changes in the aibridge library.
## Testing
Added comprehensive tests for:
- Rate limiting by auth token (Bearer token, X-Api-Key, no token
fallback to IP)
- Different tokens not rate limited against each other
- Disabled when limit is zero
- Retry-After header is set on 429 responses
- Concurrency limiting (allows within limit, rejects over limit,
disabled when zero)
## Summary
Change `@Tags` from `Organizations` to `Enterprise` for `POST /licenses`
and `POST /licenses/refresh-entitlements` to match the `GET` and
`DELETE` license endpoints which are already tagged as `Enterprise`.
## Problem
The license API endpoints were inconsistently tagged in the swagger
annotations:
- `GET /licenses` → `Enterprise` ✓
- `DELETE /licenses/{id}` → `Enterprise` ✓
- `POST /licenses` → `Organizations` ✗
- `POST /licenses/refresh-entitlements` → `Organizations` ✗
This caused the POST endpoints to be documented in the [Organizations
API docs](https://coder.com/docs/reference/api/organizations) instead of
the [Enterprise API
docs](https://coder.com/docs/reference/api/enterprise) where the other
license endpoints live.
## Fix
Simply updated the `@Tags` annotation from `Organizations` to
`Enterprise` for both POST endpoints.
This was an oversight from the original swagger docs addition in #5625
(January 2023).
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Closes https://github.com/coder/coder/issues/20913
I've ran the test without the fix, verified the test caught the issue,
then applied the fix, and confirmed the issue no longer happens.
---
🤖 PR was initially written by Claude Opus 4.5 Thinking using Claude Code
and then review by a human 👩
Closes https://github.com/coder/coder/issues/20711
We now allow agents to be created on dormant workspaces.
I've ran the test with and without the change. I've confirmed that -
without the fix - it triggers the "rbac: unauthorized" error.
Addresses [`aibridge#54`](https://github.com/coder/aibridge/issues/54)
When querying against the values in the database for
`/api/experimental/aibridge/interceptions` we found strange behaviour
wherein there was interceptions that lacked prompting and other various
fields we want. Generally this was as a result of the data not actually
existing for these values (as they were inflight).
The simple solution to this was to hide them if they didn't exist. This
PR addresses that.
---------
Co-authored-by: Danny Kopping <danny@coder.com>
Fixes https://github.com/coder/internal/issues/1119
## Description
The `CacheTFProviders` function in `testutil/terraform_cache.go` was
only available on Linux and macOS due to the `//go:build linux ||
darwin` build tag. This caused a compile error on Windows when
`enterprise/coderd/workspaces_test.go` tried to call it:
```
enterprise\coderd\workspaces_test.go:3403:28: undefined: testutil.CacheTFProviders
```
## Changes
1. Added `testutil/terraform_cache_windows.go` with a Windows-specific
stub implementation that returns an empty string
2. Updated `downloadProviders` helper in
`enterprise/coderd/workspaces_test.go` to handle empty paths gracefully
## Behavior
- On Linux/macOS: Terraform providers are cached as before
- On Windows: Provider caching is skipped, tests download providers
normally during `terraform init`
## Testing
This should fix the Windows nightly gauntlet failure. The test will
still run on Windows, just without provider caching optimization.
Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>
Experiments passed to provisioners to determine behavior. This adds
`--experiments` flag to provisioner daemons. Prior to this, provisioners
had no method to turn on/off experiments.
## Problem
Fix race condition in prebuilds reconciler. Previously, a job
notification event was sent to a Go channel before the provisioning
database transaction completed. The notification is consumed by a
separate goroutine that publishes to PostgreSQL's LISTEN/NOTIFY, using a
separate database connection. This creates a potential race: if a
provisioner daemon receives the notification and queries for the job
before the provisioning transaction commits, it won't find the job in
the database.
This manifested as a flaky test failure in `TestReinitializeAgent`,
where provisioners would occasionally miss newly created jobs. The test
uses a 25-second timeout context, while the acquirer's backup polling
mechanism checks for jobs every 30 seconds. This made the race condition
visible in tests, though in production the backup polling would
eventually pick up the job. The solution presented here guarantees that
a job notification is only sent after the provisioning database
transaction commits.
## Changes
* The `provision()` and `provisionDelete()` functions now return the
provisioner job instead of sending notifications internally.
* A new `publishProvisionerJob()` helper centralizes the notification
logic and is called after each transaction completes.
Closes: https://github.com/coder/internal/issues/963
Currently, when AI Bridge is enabled AND the `oauth2` and
`mcp-server-http` experiments are enabled we inject Coder's MCP tools
into all intercepted AI Bridge requests.
This PR introduces a config to control this behaviour.
**NOTE:** this is a backwards-incompatible change; previously these
tools would be injected automatically, now this setting will need to be
explicitly enabled.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
Fixes flaky `TestWorkspaceTagsTerraform` and
`TestWorkspaceTemplateParamsChange` tests that were failing with
`connection reset by peer` errors when downloading the coder/coder
provider.
This applies the same caching solution which was done in
https://github.com/coder/coder/pull/17373
1. Extracts provider caching logic into `testutil/terraform_cache.go`
2. Updates TestProvision to use the shared caching helpers
3. Updates enterprise workspace tests to use the shared caching helpers
The cache is persisted at `~/.cache/coderv2-test/` and automatically
cached between CI runs via existing GitHub Actions cache setup.
Closes https://github.com/coder/internal/issues/607
## Description
The membership reconciliation ensures the prebuilds system user is a
member of all organizations with prebuilds configured. To support
prebuilds quota management, each organization must have a prebuilds
group that the system user belongs to.
## Problem
Previously, membership reconciliation iterated over all presets to check
and update membership status. This meant database queries
`GetGroupByOrgAndName` and `InsertGroupMember` were executed for each
preset. Since presets are unique combinations of `(organization,
template, template version, preset)`, this resulted in several redundant
checks for the same organization.
In dogfood, `InsertGroupMember` was called thousands of times per day,
even though memberships were already configured ([internal Grafana
dashboard link](https://grafana.dev.coder.com/goto/46MZ1UgDg?orgId=1))
<img width="5382" height="1788" alt="Screenshot 2025-10-28 at 16 01 36"
src="https://github.com/user-attachments/assets/757b7253-106f-4f72-8586-8e2ede9f18db"
/>
## Solution
This PR introduces `GetOrganizationsWithPrebuildStatus`, a single query
that returns:
* All unique organizations with prebuilds configured
* Whether the prebuilds user is a member of each organization
* Whether the prebuilds group exists in each organization
* Whether the prebuilds user is in the prebuilds group
The membership reconciliation logic now:
* Fetches status for all organizations in one query
* Only performs inserts for organizations missing required memberships
or groups
* Safely handles concurrent operations via unique constraint violations
* This reduces database load from `O(presets)` to `O(organizations)` per
reconciliation loop, with a single read query when everything is
configured.
## Changes
* Add `GetOrganizationsWithPrebuildStatus` SQL query
* Update `membership.ReconcileAll` to use organization-based
reconciliation instead of preset-based
* Update tests to reflect new behavior
Related to internal thread:
https://codercom.slack.com/archives/C07GRNNRW03/p1760535570381369