Fixes#22030
## Problem
When a template has `require_active_version = true` and a workspace is
outdated, the web UI always shows "Update and start" as the **only**
button (for all users including admins), but `coder start` starts with
the old version. For admins, this silently succeeds on the stale
version. For non-admins, it goes through a clunky 403→retry path. This
also affects the VS Code extension, which calls `coder start --yes`
under the hood.
## Root Cause
`buildWorkspaceStartRequest()` in `cli/start.go` checks
`workspace.AutomaticUpdates == "always"` but ignores
`workspace.TemplateRequireActiveVersion`. The server-side autostart
already ORs both settings together:
```go
// coderd/autobuild/lifecycle_executor.go
func useActiveVersion(opts, ws) bool {
return opts.RequireActiveVersion || ws.AutomaticUpdates == "always"
}
```
The CLI was missing the `RequireActiveVersion` check.
## Fix
Add `workspace.TemplateRequireActiveVersion` to the existing OR
condition:
```go
// Before:
if workspace.AutomaticUpdates == codersdk.AutomaticUpdatesAlways || action == WorkspaceUpdate {
// After:
if workspace.AutomaticUpdates == codersdk.AutomaticUpdatesAlways || workspace.TemplateRequireActiveVersion || action == WorkspaceUpdate {
```
Now `coder start` and `coder restart` proactively use the active
template version when `require_active_version` is set, matching the web
UI and server autostart behavior. The 403→retry fallback remains as a
safety net but is no longer the primary path for any user.
## Testing
Updated `enterprise/cli/start_test.go` — all user types (owner, template
admin, ACL admin, group ACL admin, member) now expect the active version
when `require_active_version` is set, and verify the 403→retry message
does NOT appear.
`--secure-auth-cookie` now automatically sources it's default value from `--access-url`
If the access url uses HTTPS, secure is set to `true`.
To revert to old behavior, set the value explicitly to `false`
Since Go 1.22, the loop variable capture issue is resolved. Variables
declared by for loops are now per-iteration rather than per-loop, making
the 'v := v' pattern unnecessary.
## Problem
Site-wide admins (e.g., Owners) could not use `coder create --org <org>`
to create workspaces in organizations they are not members of. The error
was:
```
$ coder create my-workspace -t docker --org data-science
error: organization "data-science" not found, are you sure you are a member of this organization?
```
This was inconsistent with the web UI, where Owners can create
workspaces in any organization.
## Root Cause
The CLI's `OrganizationContext.Selected()` function only checked the
user's membership list, ignoring site-wide RBAC permissions that grant
Owners access to all organizations.
## Solution
Added a fallback in `OrganizationContext.Selected()` that fetches the
org directly via the API when not found in the membership list. This
works because the API endpoint applies RBAC filtering, allowing Owners
to read any org.
## Impact
This fixes `coder create --org` and all other CLI commands that use
`OrganizationContext.Selected()` (29+ commands), including:
- `coder templates push --org <any-org>`
- `coder organizations members add --org <any-org>`
- `coder provisioner list --org <any-org>`
## Testing
Added `TestEnterpriseCreate/OwnerCanCreateInNonMemberOrg` which:
- Creates an Owner user who is NOT a member of a second org
- Verifies they can create a workspace there using `--org`
- Properly fails without the code fix, passes with it
---
*This PR was generated by [mux](https://mux.coder.com) but reviewed by a
human.*
This PR adds some metrics to help identify job enqueue rates and
latencies. This work was initiated as a way to help reduce the cost of
the observation/measurement itself for autostart scaletests, which
impacts our ability to identify/reason about the load caused by
autostart. See: https://github.com/coder/internal/issues/1209
I've extended the metrics here to account for regular user initiated
builds, prebuilds, autostarts, etc. IMO there is still the question here
of whether we want to include or need the `transition` label, which is
only present on workspace builds. Including it does lead to an increase
in cardinality, and in the case of the histogram (when not using native
histograms) that's at least a few extra series for every bucket. We
could remove the transition label there but keep it on the counter.
Additionally, the histogram is currently observing latencies for other
jobs, such as template builds/version imports, those do not have a
transition type associated with them.
Tested briefly in a workspace, can see metric values like the following:
-
`coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"}
1`
-
`coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"}
1`
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Description
Mark `--ssh-hostname-prefix` flag and `CODER_SSH_HOSTNAME_PREFIX` env
variable as deprecated, recommending users to use
`--workspace-hostname-suffix` / `CODER_WORKSPACE_HOSTNAME_SUFFIX`
instead for consistency with Coder Desktop.
The deprecated option is now hidden from help output and docs but
remains functional for backward compatibility. When used, it will show a
deprecation warning pointing to the recommended alternative.
## Changes
- Added `UseInstead` pointing to `workspace-hostname-suffix` option
(triggers deprecation warning)
- Set `Hidden: true` to hide from CLI help and documentation
- Updated description to mention deprecation
- Regenerated docs and help files via `make gen`
Closes#18156
---
_Originally requested by @matifali in
https://github.com/coder/coder/pull/18085#discussion_r2115594447_
## Description
Adds Prometheus metrics to the AI Bridge Proxy for observability into
proxy traffic and performance.
## Changes
* Add Metrics struct with the following metrics:
* `connect_sessions_total`: counts CONNECT sessions by type
(mitm/tunneled)
* `mitm_requests_total`: counts MITM requests by provider
* `inflight_mitm_requests`: gauge tracking in-flight requests by
provider
* `mitm_request_duration_seconds`: histogram of request latencies by
provider
* `mitm_responses_total`: counts responses by status code class
(2XX/3XX/4XX/5XX) and provider
* Register metrics with `coder_aibridgeproxyd_` prefix in CLI
* Unregister metrics on server close to prevent registry leaks
* Add `tunneledMiddleware` to track non-allowlisted CONNECT sessions
* Add tests for metric recording in both MITM and tunneled paths
Closes: https://github.com/coder/internal/issues/1185
This undeprecates the `allow-workspace-renames` flag. IIUC, the 'danger'
with using this flag is that the workspace name might have been used in
the definition of some other terraform resources within template code,
so a rename could cause problems such as with persistent disks.
for https://github.com/coder/coder/issues/21628
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
## Summary
AI Bridge is moving to General Availability in v2.30 and will require
the AI Governance Add-On license in future versions. This adds a soft
warning for deployments using AI Bridge via Premium/Enterprise
FeatureSet without an explicit AI Bridge add-on license.
Relates to: https://github.com/coder/internal/issues/1226
## Changes
- Track whether AI Bridge was explicitly granted via license Features
(add-on) vs inherited from FeatureSet
- Show soft warning when AI Bridge is enabled and entitled via
FeatureSet but not via explicit add-on
- Changed AI Bridge enablement from hardcoded `true` to check
`CODER_AIBRIDGE_ENABLED` deployment config
## Behavior Change
AI Bridge is now only marked as "enabled" in entitlements when
`CODER_AIBRIDGE_ENABLED=true` is set in the deployment config.
Previously, it was always enabled for Premium/Enterprise licenses
regardless of the config setting.
This change ensures that users who do not use AI Bridge will not see the
soft warning about the upcoming license requirement.
## Warning Message
> AI Bridge is now Generally Available in v2.30. In a future Coder
version, your deployment will require the AI Governance Add-On to
continue using this feature. Please reach out to your account team or
sales@coder.com to learn more.
## Behavior
| Condition | Warning Shown |
|-----------|---------------|
| AI Bridge disabled | ❌ No |
| AI Bridge enabled + explicit add-on license | ❌ No |
| AI Bridge enabled + Premium/Enterprise FeatureSet (no add-on) | ✅ Yes
|
## Screenshots
### 1. No license
<img width="1708" height="577" alt="image"
src="https://github.com/user-attachments/assets/cbdbfd4d-55de-4d70-8abf-2665f458e96f"
/>
### 2. No license + CODER_AIBRIDGE_ENABLED=true
<img width="1716" height="513" alt="image"
src="https://github.com/user-attachments/assets/344aae76-7703-485f-b568-1f13a1efa48f"
/>
### 3. Premium license + CODER_AIBRIDGE_ENABLED=false
<img width="1687" height="389" alt="image"
src="https://github.com/user-attachments/assets/c2be12b0-1c0f-438d-a293-f9ec9fe6a736"
/>
### 4. Premium license + CODER_AIBRIDGE_ENABLED=true
<img width="1707" height="525" alt="image"
src="https://github.com/user-attachments/assets/1a4640e1-e656-4f9b-bed0-9390cb5d6a84"
/>
## Notes
- TODO comments added to mark code that should be removed when AI Bridge
enforcement is added
- Feature continues to work - this is just a transitional warning (soft
enforcement)
Source code changes:
- Added a wrapper for the boundary subcommand that checks feature
entitlement before executing the underlying command.
- Added a helper that returns the Boundary version using the
runtime/debug package, which reads this information from the go.mod
file.
- Added FeatureBoundary to the corresponding enum.
- Move boundary command from AGPL to enterprise.
`NOTE`: From now on, the Boundary version will be specified in go.mod
instead of being defined in AI modules.
The removal of that permission from the role broke valid use cases (e.g.
a site owner user creating a workspace owned by a system account and
then trying to share it with another user).
The bulk of the PR is made up of the rollbacks of the previously
introduced test updates necessitated by the removal.
Related to: https://github.com/coder/internal/issues/1285
Agents were losing authentication during workspace shutdown, causing
shutdown scripts to fail. The auth query required agents to belong to
the latest build, but during shutdown a `stop` build becomes latest while
the `start` build's agents are still running.
Modified the auth query to allow `start` build agents to authenticate
temporarily during `stop` execution. The query allows auth when:
- Agent's `start` build job succeeded
- Latest build is `stop` with `pending`/`running` job status
- Builds are adjacent (`stop` is `build_number + 1`)
- Template versions match
Auth closes once `stop` completes.
Renamed `GetWorkspaceAgentAndLatestBuildByAuthToken` to
`GetAuthenticatedWorkspaceAgentAndBuildByAuthToken` since it returns the
agent's build (not always latest) during shutdown.
Closes coder/internal#1249
Fixes#19467
## Description
This PR addresses database connection pool exhaustion during prebuilds
reconciliation by introducing two changes:
* `CanSkipReconciliation`: Filters out presets that don't need
reconciliation before spawning goroutines. This ensures we only create
goroutines for presets that will (_most likely_) perform database
operations, avoiding unnecessary connection pool usage.
* Dynamic `eg.SetLimit`: Limits concurrent goroutines based on the
configured database connection pool size (`CODER_PG_CONN_MAX_OPEN / 2`).
This replaces the previous hardcoded limit of 5, ensuring the
reconciliation loop scales appropriately with the configured pool size
while leaving capacity for other database operations.
## Changes
* Add `CanSkipReconciliation()` method to `PresetSnapshot` that returns
true for inactive presets with no running workspaces, no pending jobs,
or expired prebuilds.
* Add `maxDBConnections` parameter to `NewStoreReconciler` and compute
`reconciliationConcurrency` as half the pool size (minimum 1).
* Add `ReconciliationConcurrency()` getter method to `StoreReconciler`.
* Add `eg.SetLimit(c.reconciliationConcurrency)` to bound concurrent
reconciliation goroutines.
* Add `PresetsTotal` and `PresetsReconciled` to `ReconcileStats` for
observability.
* Add `TestCanSkipReconciliation` unit tests.
* Add `TestReconciliationConcurrency` unit tests.
* Add benchmark tests for reconciliation performance.
## Benchmarks
* `BenchmarkReconcileAll_NoOps`: Tests presets with no reconciliation
actions. All presets are filtered by `CanSkipReconciliation`, resulting
in no goroutines spawned and no database connections used.
* `BenchmarkReconcileAll_ConnectionContention`: Tests presets where all
require reconciliation actions. All presets spawn goroutines, but
concurrency is limited by `eg.SetLimit(reconciliationConcurrency)`.
* `BenchmarkReconcileAll_Mix`: Simulates a realistic scenario with a
large subset of inactive presets (filtered by `CanSkipReconciliation`)
and a smaller subset requiring reconciliation (limited by
`eg.SetLimit`).
Closes: https://github.com/coder/coder/issues/20606
## Summary
Add circuit breaker support for AI Bridge to protect against cascading
failures from upstream AI provider rate limits (HTTP 429, 503, and
Anthropic's 529 overloaded responses).
## Changes
- Add 5 new CLI options for circuit breaker configuration:
- `--aibridge-circuit-breaker-enabled` (default: false)
- `--aibridge-circuit-breaker-failure-threshold` (default: 5)
- `--aibridge-circuit-breaker-interval` (default: 10s)
- `--aibridge-circuit-breaker-timeout` (default: 30s)
- `--aibridge-circuit-breaker-max-requests` (default: 3)
- Update aibridge dependency to include circuit breaker support
- Add tests for pool creation with circuit breaker providers
## Notes
- Circuit breaker is **disabled by default** for backward compatibility
- When enabled, applies to both OpenAI and Anthropic providers
- Uses sony/gobreaker internally via the aibridge library
## Testing
```
make test RUN=TestPoolWithCircuitBreakerProviders
```
## Description
Adds upstream proxy support for AI Bridge Proxy passthrough requests.
This allows aiproxy to forward non-allowlisted requests through an
upstream proxy. Currently, the only supported configuration is when
aiproxy is the first proxy in the chain (client → aiproxy → upstream
proxy).
## Changes
* Add `--aibridge-proxy-upstream` option to configure an upstream
HTTP/HTTPS proxy URL for passthrough requests
* Add `--aibridge-proxy-upstream-ca` option to trust custom CA
certificates for HTTPS upstream proxies
* Passthrough requests (non-allowlisted domains) are forwarded through
the upstream proxy
* MITM'd requests (allowlisted domains) continue to go directly to
aibridge, not through the upstream proxy
* Add tests for upstream proxy configuration and request routing
Closes: https://github.com/coder/internal/issues/1204
## Description
Implements selective MITM (Man-in-the-Middle) in `aibridgeproxyd` so
that only requests to allowlisted domains are intercepted and decrypted.
Requests to all other domains are tunneled directly without decryption.
## Changes
* New config option: `CODER_AIBRIDGE_PROXY_DOMAIN_ALLOWLIST` (default:
`api.anthropic.com`,`api.openai.com`)
* Selective MITM: Uses `goproxy.ReqHostIs()` to only intercept `CONNECT`
requests to allowlisted hosts
* Certificate caching: Now only generates/caches certificates for
allowlisted domains
* Validation: Startup fails if domain allowlist is empty or contains
invalid entries
Closes: https://github.com/coder/internal/issues/1182
Closes https://github.com/coder/coder/issues/21360
A few considerations/notes:
- I've kept the number of conns to 10 in all other places, except coderd
- which uses the config value
- I opted to also make idle conns configurable; the greater the delta
between max open and max idle, the more connection churn
- Postgres maintains a [_process_ per
connection](https://www.postgresql.org/docs/current/connect-estab.html),
contrary to what the comment said previously
- Operators should be able to tune this, since process churn can
negatively affect OS scheduling
- I've set the value to `"auto"` by default so it's not another knob one
_has to_ twiddle, and sets max idle = max conns / 3
---------
Signed-off-by: Danny Kopping <danny@coder.com>
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
The implementation for prebuilt workspaces is complex and conversations
regarding edge cases and bugs frequently get bogged down by minutiae,
because it's hard to reason about the behaviour of the system.
To alleviate this, I've introduced otel tracing to the StoreReconciler
(see attached). We can now directly observe the behaviour of the
prebuilds system under load in order to inform our decisions.
Traces are terminated at the boundary between prebuilds and workspace
builder, because of prebuilt workspaces' "fire and forget" philosophy
and to prevent span explosion.
<img width="3024" height="1718" alt="image"
src="https://github.com/user-attachments/assets/f9b207be-8f2c-475e-98a8-46ef70bda446"
/>
Because this affects more than just the template insights
page (specifically it also affects the deployment stats endpoint which
is shown on bottom bar and Prometheus), the group is being renamed
generically to just "stats collection". In the future if we need to
affect the other stats we can put those options here.
Then, because this change only affects a portion of stats, specifically
usage stats like connection and application time, bytes sent, etc, add a
new sub-group called "usage stats".
Then finally add back the "enable" flag. This also gives us a place to
one day place an "anonymize" flag if we need to go that route.
## Description
Implements request routing for the AI Bridge Proxy. After MITM decryption, requests to known AI providers (Anthropic, OpenAI) are rewritten to the corresponding aibridged endpoint, while requests to unknown hosts are passed through to their original destination.
## Changes
* Add `CoderAccessURL` configuration option for specifying the Coder deployment URL
* Add `handleRequest` to route decrypted requests based on target host
* Route known AI providers (Anthropic and OpenAI) to AI Bridge specific endpoint.
* Passthrough requests to unknown hosts directly to their original destination
* Inject Coder session token (from https://github.com/coder/coder/pull/21342) as `Authorization: Bearer` header for aibridged
* Add tests for routing and passthrough behavior
Depends on: https://github.com/coder/coder/pull/21342
Closes: https://github.com/coder/internal/issues/1181
## Description
Adds the core AI Bridge MITM proxy daemon. This proxy intercepts HTTPS traffic, decrypts it using a configured CA certificate, and forwards requests to AIBridge for processing.
## Changes
* Added `aibridgeproxyd` package with the core proxy server implementation
* Added configuration options: `CODER_AIBRIDGE_PROXY_ENABLED`, `CODER_AIBRIDGE_PROXY_LISTEN_ADDR`, `CODER_AIBRIDGE_PROXY_CERT_FILE`, `CODER_AIBRIDGE_PROXY_KEY_FILE`
* Added tests for server initialization and MITM functionality
Closes https://github.com/coder/internal/issues/1180
Provisioner steps broken into smaller granular actions.
Changes:
- `ExtractArchive` moved to `init` request (was in `configure`)
- Writing `tfstate` moved to `plan` (was in `configure`)
- Moved most plan/apply outputs to `GraphComplete`
Closes#20399
To summarize the original commit messages:
- Do not log stats to the database.
- Return errors on the insight endpoints.
- Update the frontend to show those errors.
- Also fixes an issue with getting the user status count via codersdk,
since I added a test to ensure it was not disabled by this flag and it
was sending the wrong payload.
## Summary
This adds configurable overload protection to the AI Bridge daemon to
prevent the server from being overwhelmed during periods of high load.
Partially addresses coder/internal#1153 (rate limits and concurrency
control; circuit breakers are deferred to a follow-up).
## New Configuration Options
| Option | Environment Variable | Description | Default |
|--------|---------------------|-------------|---------|
| `--aibridge-max-concurrency` | `CODER_AIBRIDGE_MAX_CONCURRENCY` |
Maximum number of concurrent AI Bridge requests. Set to 0 to disable
(unlimited). | `0` |
| `--aibridge-rate-limit` | `CODER_AIBRIDGE_RATE_LIMIT` | Maximum number
of AI Bridge requests per second. Set to 0 to disable rate limiting. |
`0` |
## Behavior
When limits are exceeded:
- **Concurrency limit**: Returns HTTP `503 Service Unavailable` with
message "AI Bridge is currently at capacity. Please try again later."
- **Rate limit**: Returns HTTP `429 Too Many Requests` with
`Retry-After` header.
Both protections are optional and disabled by default (0 values).
## Implementation
The overload protection is implemented as reusable middleware in
`coderd/httpmw/ratelimit.go`:
1. **`RateLimitByAuthToken`**: Per-user rate limiting that uses
`APITokenFromRequest` to extract the authentication token, with fallback
to `X-Api-Key` header for AI provider compatibility (e.g., Anthropic).
Falls back to IP-based rate limiting if no token is present. Includes
`Retry-After` header for backpressure signaling.
2. **`ConcurrencyLimit`**: Uses an atomic counter to track in-flight
requests and reject when at capacity.
The middleware is applied in `enterprise/coderd/aibridge.go` via
`r.Group` in the following order:
1. Concurrency check (faster rejection for load shedding)
2. Rate limit check
**Note**: Rate limiting currently applies to all AI Bridge requests,
including pass-through requests. Ideally only actual interceptions
should count, but this would require changes in the aibridge library.
## Testing
Added comprehensive tests for:
- Rate limiting by auth token (Bearer token, X-Api-Key, no token
fallback to IP)
- Different tokens not rate limited against each other
- Disabled when limit is zero
- Retry-After header is set on 429 responses
- Concurrency limiting (allows within limit, rejects over limit,
disabled when zero)
Adds `--disable-workspace-sharing` option.
Workspace sharing is disabled by not including user and group ACLs in
the workspace RBAC object, which prevents ACL-based authz.
Closes https://github.com/coder/internal/issues/1072
The commit also adds saving of workspace user/group ACLs in the test DB
data generator.
This changes makes it so that we output the empty string for Format
when there is no data. It turns out there are many places in the code
where we have such handling, but in a way that would break the JSON
formatter (since we'd output nothing on stdout or text rather than
`[]`/`null`).