File-reference parts in user messages were flattened to `TextContent` at
write time because fantasy has no file-reference content type. The
frontend never saw them as structured parts.
This moves all write paths (user, assistant, tool) from fantasy envelope
format to `codersdk.ChatMessagePart`. The streaming layer (`chatloop`)
is untouched, the conversion happens at the serialization boundary in
`persistStep`.
Old rows are still readable. `ParseContent` uses a structural heuristic
(`isFantasyEnvelopeFormat`) to distinguish legacy envelopes from SDK
parts. We chose this over try/fallback because fantasy envelopes
partially unmarshal into `ChatMessagePart` (the `type` field matches)
while silently losing content. A guard test enforces that no SDK part
can produce the envelope shape.
This is forward-only: new rows are unreadable by old code. Chat is
behind a feature flag so rollback risk is contained.
Also adds a typed `ChatMessageRole` to replace raw strings and
`fantasy.MessageRole*` casts at the persistence boundary. The type
covers `ChatMessage.Role`, `ChatStreamMessagePart.Role`, the
`PublishMessagePart` callback chain, and all DB write sites.
`fantasy.MessageRole*` remains only where we build `fantasy.Message`
structs for LLM dispatch.
Separately, `ProviderMetadata` was leaking to SSE clients via
`publishMessagePart`. `StripInternal` now runs on both the SSE and REST
paths, covering this.
Other cleanup:
- Old `db2sdk.contentBlockToPart` silently dropped metadata on
text/reasoning/tool-call content. New code preserves it.
- `providerMetadataToOptions` now logs warnings instead of silently
returning nil.
- `db2sdk` shrinks from ~250 lines of parallel conversion to ~15 lines
delegating to `chatprompt.ParseContent()`, removing the `fantasy` import
entirely.
Refs #22821
_Disclaimer: implemented by a Coder Agent using Claude Opus 4.6._
Marks the injected MCP approach in AI Bridge as deprecated across the
codebase.
## Changes
- **`codersdk/deployment.go`**: Deprecated `ExternalAuthConfig.MCPURL`,
`.MCPToolAllowRegex`, `.MCPToolDenyRegex` fields; deprecated and hid the
`--aibridge-inject-coder-mcp-tools` server flag; deprecated
`AIBridgeConfig.InjectCoderMCPTools`.
- **`coderd/externalauth/externalauth.go`**: Deprecated `Config.MCPURL`,
`.MCPToolAllowRegex`, `.MCPToolDenyRegex`.
- **`enterprise/aibridgedserver/aibridgedserver.go`**: Added runtime
deprecation warning when `CODER_AIBRIDGE_INJECT_CODER_MCP_TOOLS` is
enabled; deprecated `getCoderMCPServerConfig`.
- **`enterprise/aibridged/mcp.go`**: Deprecated `MCPProxyBuilder`
interface and `MCPProxyFactory` struct.
- **`docs/ai-coder/ai-bridge/mcp.md`**: Added deprecation warning
banner.
WaitBuffer is a thread-safe io.Writer that supports blocking until
accumulated output matches a substring or custom predicate. It
replaces ad-hoc safeBuffer/syncWriter types and time.Sleep-based
poll loops in tests with signal-driven waits.
- WaitFor/WaitForNth/WaitForCond for blocking on output
- Replace custom buffer types in cli/sync_test.go and
provisionersdk/agent_test.go
- Convert time.Sleep poll loops to require.Eventually/require.Never
in cli/ssh_test.go, coderd/activitybump_test.go,
coderd/workspaceagentsrpc_test.go, workspaceproxy_test.go, and
scaletest tests
The groupsToRows function was not setting the Group field on
groupTableRow, causing JSON output to contain zero-value structs. Table
output was unaffected since it uses separate fields.
BREAKING CHANGE: The JSON output structure changes from `{"Group":
{"id": ...}}` to `{"id": ...}` (flat). This is technically a breaking
change, but JSON output never contained real data (all fields were
zero-valued), so no working consumer could exist. We're taking the
opportunity to flatten the structure to match other list commands like
`coder list -o json`.
## Problem
When multiple concurrent callers (e.g., parallel workspace builds) read
the same single-use OAuth2 refresh token from the database and race to
exchange it with the provider, the first caller succeeds but subsequent
callers get `bad_refresh_token`. The losing caller then **clears the
valid new token** from the database, permanently breaking the auth link
until the user manually re-authenticates.
This is reliably reproducible when launching multiple workspaces
simultaneously with GitHub App external auth and user-to-server token
expiration enabled.
## Solution
Two layers of protection:
### 1. Singleflight deduplication (`Config.RefreshToken` +
`ObtainOIDCAccessToken`)
Concurrent callers for the same user/provider share a single refresh
call via `golang.org/x/sync/singleflight`, keyed by `userID`. The
singleflight callback re-reads the link from the database to pick up any
token already refreshed by a prior in-flight call, avoiding redundant
IDP round-trips entirely.
### 2. Optimistic locking on `UpdateExternalAuthLinkRefreshToken`
The SQL `WHERE` clause now includes `AND oauth_refresh_token =
@old_oauth_refresh_token`, so if two replicas (HA) race past
singleflight, the loser's destructive UPDATE is a harmless no-op rather
than overwriting the winner's valid token.
## Changes
| File | Change |
|------|--------|
| `coderd/externalauth/externalauth.go` | Added `singleflight.Group` to
`Config`; split `RefreshToken` into public wrapper +
`refreshTokenInner`; pass `OldOauthRefreshToken` to DB update |
| `coderd/provisionerdserver/provisionerdserver.go` | Wrapped OIDC
refresh in `ObtainOIDCAccessToken` with package-level singleflight |
| `coderd/database/queries/externalauth.sql` | Added optimistic lock
(`WHERE ... AND oauth_refresh_token = @old_oauth_refresh_token`) |
| `coderd/database/queries.sql.go` | Regenerated |
| `coderd/database/querier.go` | Regenerated |
| `coderd/database/dbauthz/dbauthz_test.go` | Updated test params for
new field |
| `coderd/externalauth/externalauth_test.go` | Added
`ConcurrentRefreshDedup` test; updated existing tests for singleflight
DB re-read |
## Testing
- **New test `ConcurrentRefreshDedup`**: 5 goroutines call
`RefreshToken` concurrently, asserts IDP refresh called exactly once,
all callers get same token.
- All existing `TestRefreshToken/*` subtests updated and passing.
- `TestObtainOIDCAccessToken` passing.
- `dbauthz` tests passing.
`Test_ProxyServer_Headers` never passed `--http-address`, so it bound to
the default `127.0.0.1:3000`.
`TestWorkspaceProxy_Server_PrometheusEnabled`
used `RandomPort(t)` for `--http-address` (a drive-by from #14972 which
was
fixing the Prometheus port).
Both now use `--http-address :0`. `ConfigureHTTPServers` calls
`net.Listen("tcp", ":0")` and holds the listener open, so there is no
TOCTOU window. Neither test connects to the HTTP listener, so the
assigned port is irrelevant. This matches `cli/server_test.go` where
`:0` is used throughout.
Split from #22693 per review feedback.
Fixes multiple bugs in coderd/chatd and sub-packages including race
conditions, transaction safety, stream buffer bounds, retry limits, and
enterprise relay improvements.
See commit message for full list.
`time.Now()` has nanosecond precision while Postgres timestamps are
microsecond precision. When tests compare `time.Now()` against
DB-sourced timestamps using `Before`/`After`/`WithinRange`/etc., there
is a non-zero flake risk from the precision mismatch.
This replaces `time.Now()` with `dbtime.Now()` (which rounds to
microsecond precision) in all test assertions that compare against
database timestamps.
Follows from #22684.
## Changes (11 files)
| File | Changes |
|---|---|
| `coderd/apikey_test.go` | 11 comparisons with `ExpiresAt` |
| `coderd/users_test.go` | 2 comparisons with `ExpiresAt` |
| `coderd/oauth2_test.go` | 1 comparison with `token.Expiry` |
| `coderd/workspaces_test.go` | 2 comparisons with `DormantAt` |
| `coderd/workspaceagents_test.go` | 3 comparisons with
`ConnectedAt`/`DisconnectedAt` |
| `coderd/workspaceapps/db_test.go` | 1 comparison with `token.Expiry` |
| `coderd/provisionerdserver/provisionerdserver_test.go` | 1 comparison
with `key.ExpiresAt` |
| `enterprise/coderd/workspaces_test.go` | 1 comparison with `DormantAt`
|
| `enterprise/coderd/license/license_test.go` | 3 `NotBefore` values |
| `enterprise/coderd/licenses_test.go` | 2 `NotBefore` values |
| `enterprise/coderd/users_test.go` | 3 `Next()` comparisons |
## Not changed (intentionally)
- `scaletest/placebo/run_test.go` — compares wall-clock elapsed time,
not DB timestamps
- `cli/server_test.go`, `coderd/jwtutils/jwt_test.go`,
`enterprise/aibridgeproxyd/aibridgeproxyd_test.go` — TLS cert fields,
not DB-stored
- `coderd/azureidentity/azureidentity_test.go` — Azure cert expiry, not
DB
🤖 Generated by Claude Opus 4.6 but reviewed manually.
This PR does three things:
- Exports derp expvars to the pprof endpoint
- Exports the expvar metrics as prometheus metrics in both coderd and
wsproxy
- Updates our tailscale to a fix I also had to make to avoid a data race
condition
I generated this with mux but I also manually tested that the metrics
were getting properly emitted
## Description
Adds optional TLS support for the AI Bridge Proxy listener. When TLS cert and key files are provided, the proxy serves over HTTPS instead of plain HTTP.
## Changes
* New configuration options to enable TLS on the proxy listener
* Wraps the TCP listener in `tls.NewListener` when configured
* Tests for validation errors, invalid files, and full integration (tunneled + MITM) through a TLS listener
Note: Documentation for TLS listener setup and client configuration will be handled in a follow-up PR.
Related to: https://github.com/coder/internal/issues/1335
## Description
Renames internal fields, variables, and comments related to the proxy's certificate/key configuration to explicitly reference their MITM CA purpose.
The AI Bridge Proxy uses a CA certificate to sign dynamically generated leaf certificates during MITM interception of HTTPS traffic from AI clients. With the upcoming introduction of TLS listener certificates (for serving the proxy itself over HTTPS, implemented upstack https://github.com/coder/coder/pull/22411), the previous generic naming would become ambiguous. This refactor makes it clear which certificate is which.
No user-facing flags, environment variables, YAML keys, or JSON fields were changed, this is purely an internal rename to avoid confusion going forward.
Related to https://github.com/coder/internal/issues/1335
## Problem
When a browser connects to the chat stream via WebSocket, it
authenticates using cookies only — the native WebSocket API cannot set
custom headers like `Coder-Session-Token`. The relay between replicas
copies the original request's `Cookie` header but did **not** set the
`Coder-Session-Token` header as a fallback.
This causes a **401 on the worker replica** when `EnableHostPrefix` is
enabled, because the `HTTPCookies.Middleware` strips bare
`coder_session_token` cookies (expecting the `__Host-` prefix). Without
a `Coder-Session-Token` header fallback, `apiKeyMiddleware` finds no
valid credentials.
### Root Cause
The data flow:
1. Browser → subscriber replica: `Cookie:
__Host-coder_session_token=xxx` (browser sends prefixed cookie)
2. Subscriber's `HTTPCookies.Middleware` normalizes: `Cookie:
coder_session_token=xxx` (strips prefix)
3. `relayHeaders()` copies `Cookie: coder_session_token=xxx` to relay
request
4. Worker replica's `HTTPCookies.Middleware` sees bare
`coder_session_token` → **strips it** (expects `__Host-` prefix)
5. `apiKeyMiddleware` → `APITokenFromRequest`: no cookie, no header →
**401**
## Fix
Modified `relayHeaders()` to extract the session token value from the
`Cookie` header and set it as the `Coder-Session-Token` header when no
explicit session token header is already present. The header is never
stripped by middleware, so the worker replica can always authenticate.
## Testing
- **`TestRelayHeaders`**: Unit tests for the updated `relayHeaders()`
function covering all scenarios (cookie-only, header+cookie, no auth,
nil source)
- **`TestExtractSessionTokenFromCookieHeader`**: Unit tests for the
helper function
- **`TestChatStreamRelay/RelayCookieOnlyAuth`**: Integration test with
plain HTTP, cookie-only WebSocket auth
- **`TestChatStreamRelay/RelayCookieOnlyAuthWithHostPrefix`**:
Integration test with `EnableHostPrefix=true`, confirming the 401 is
fixed
- **`cookieOnlySessionTokenProvider`**: Test helper that simulates
browser WebSocket behavior (sets Cookie header only on WebSocket dials,
no custom headers)
## Files Changed
- `enterprise/coderd/chatd/chatd.go` — `relayHeaders()` fix +
`extractSessionTokenFromCookieHeader()` helper
- `enterprise/coderd/chatd/relay_headers_internal_test.go` — unit tests
(new file)
- `enterprise/coderd/chats_test.go` — integration tests + test helper
type
## Problem
`TestWorkspaceProvisionerdServerMetrics` flakes because metric
assertions run immediately after
`AwaitWorkspaceBuildJobCompleted` returns, but metrics are updated
**asynchronously after the
DB transaction commits** in `completeWorkspaceBuildJob`.
The timeline in the provisioner server:
1. DB transaction commits (`provisionerdserver.go:~2362`) — job marked
completed
2. Audit logging, notifications, DB queries (`~2370-2427`)
3. **Metric `.Observe()`** (`~2463`) — happens ~100 lines later
The test synchronization (`AwaitWorkspaceBuildJobCompleted`) polls for
`CompletedAt != nil`,
which fires at step 1. The metric assertion then executes before step 3,
causing the flake.
## Fix
Wrap all three metric assertions (prebuild creation, prebuild claim,
regular workspace
creation) in `require.Eventually` to poll until the metric appears, then
assert on the value.
## Test
- `go test -run TestWorkspaceProvisionerdServerMetrics -count=5` — all
pass
- `go test -race -run TestWorkspaceProvisionerdServerMetrics -count=1` —
clean
## Problem
In multi-replica Coder deployments, the chat relay WebSocket between
replicas fails with HTTP 401 (or TLS handshake errors). The subscriber
replica cannot relay `message_part` events from the worker replica.
**Root cause:** `codersdk.Client.Dial()` does not pass `c.HTTPClient` to
`websocket.DialOptions.HTTPClient`. The websocket library
(`github.com/coder/websocket`) falls back to `http.DefaultClient`, which
lacks the mesh TLS configuration needed for inter-replica communication.
The relay code in `enterprise/coderd/chatd/chatd.go` correctly sets
`sdkClient.HTTPClient = cfg.ReplicaHTTPClient` (which has mesh TLS
certs), but that client was never used for the actual WebSocket
handshake.
## Fix
One-line fix in `codersdk/client.go`: propagate `c.HTTPClient` to
`opts.HTTPClient` when the caller hasn't already set one.
## Test
Added `TestChatStreamRelay/RelayWithTLSAndCookieAuth` which:
- Sets up two replicas with TLS certificates (simulating mesh TLS in
production)
- Authenticates via cookies (simulating browser WebSocket behavior)
- Verifies message_part events relay across replicas over TLS
This test times out without the fix because the WebSocket handshake
fails with `x509: certificate signed by unknown authority`
(http.DefaultClient rejects self-signed certs).
## Related
Follow-up to #22635 which fixed the `redirectToAccessURL` middleware
bypassing 307 redirects for relay requests. That fix changed the error
from HTTP 200 to HTTP 401, exposing this deeper issue.
## Summary
Fixes a bug where interrupting a streaming chat and sending a new
message
left the relay connected to the wrong replica. Expanded into a broader
refactor that cleanly separates concerns:
- **OSS** owns pubsub subscription, message catch-up, queue updates,
status forwarding, and local parts merging.
- **Enterprise** (`enterprise/coderd/chatd`) only manages relay dialing,
reconnection, and stale-dial discarding for cross-replica streaming.
## Architecture
### OSS `coderd/chatd/chatd.go`
`Subscribe()` builds the initial snapshot then runs a single merge
goroutine that handles:
- Pubsub subscription for durable events (status, messages, queue,
errors)
- Message catch-up via `AfterMessageID`
- Local `message_part` forwarding
- Relay events from enterprise (when `SubscribeFn` is set)
- Sends `StatusNotification` to enterprise so it can manage relay
lifecycle
Key types:
- `SubscribeFn` — enterprise hook, returns relay-only events channel
- `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`,
`StatusNotifications`, `RequestHeader`, `DB`, `Logger`
- `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on
pubsub status changes
### Enterprise `enterprise/coderd/chatd/chatd.go`
`NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a
`SubscribeFn` that:
- Opens an initial synchronous relay if the chat is running on a remote
worker
- Reads `StatusNotifications` from OSS to open/close relay connections
- Handles async dial, reconnect timers, stale-dial discarding
- Returns only relay `message_part` events
## Bug fixes
### Original bug: stale relay dial after interrupt
`openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a
per-dial context. `closeRelay()` could not cancel in-flight dials. When
the user interrupts and a new replica picks up the chat, the old dial
goroutine could complete after the new one and deliver a stale
`relayResult`.
**Fix**: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking,
`workerID` on `relayResult`. `closeRelay()` cancels the dial context and
drains `relayReadyCh`. Merge loop rejects mismatched worker IDs.
### Additional fixes
- `statusNotifications` send-on-closed-channel race — goroutine now owns
`close()` via defer
- Enterprise spin-loop on `StatusNotifications` close — two-value
receive
with nil-out
- `hasPubsub` set from `p.pubsub != nil` instead of subscription success
— now tracks actual subscription result
- `lastMessageID` not initialized from `afterMessageID` — caused
duplicate messages on catch-up
- `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel
- `closeRelay()` did not drain `relayReadyCh`
- `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in
`InTx`
- `processChat` post-TX side effects fired when chat was taken by
another
worker — added `errChatTakenByOtherWorker` sentinel
- Cancel closure data race on `reconnectTimer`
- Bare blocking send on pubsub error path
- `localParts` hot-spin after channel close
- No-pubsub branch dropped relay events and initial snapshot
- Failed relay dial caused permanent stall (no reconnect retry)
- DB error during reconnect timer caused permanent stall
- `time.NewTimer` replaced with `quartz.Clock` for testable timing
## Tests
9 enterprise tests covering:
- Relay reconnect on drop (mock clock)
- Async dial does not block merge loop
- Relay snapshot delivery
- Stale dial discarded after interrupt
- Cancel during in-flight dial
- Running-to-running worker switch
- Failed dial retries (mock clock)
- Local worker closes relay
- Multiple consecutive reconnects (mock clock)
All pass with `-race`.
Fixescoder/internal#642
We recently fixed Windows specific flakes for this test and reenabled
it. It then failed intermittently due to context deadline expiration.
The temporary path created on Windows contained invalid characters. This
resulted in a silent startup script failure on Windows. The test then
fruitlessly waited until context expiration. The test now uses a valid
path on Windows.
Adds database columns and server-side logic to track interception lineage via tool call IDs. When an interception ends, the server resolves the correlating tool call ID to find the parent interception and links them via `parent_id`.
New `provider_tool_call_id` column on `aibridge_tool_usages` and `parent_id` column on `aibridge_interceptions`, with indexes for lookup. `findParentInterceptionID` queries by tool call ID and filters out the current interception to find the parent.
Adapted from the [coder/coder `dk/prompt_provenance_poc`](https://github.com/coder/coder/compare/main...dk/prompt_provenance_poc) branch.
Depends on [coder/aibridge#188](https://github.com/coder/aibridge/pull/188).
Closes https://github.com/coder/internal/issues/1334
relates to #21335
Enables the agent socket by default and updates docs to strike references to having to enable it.
The PRs in this stack change the MCP server that Tasks use to update their status to rely on the agent socket, rather than directly dialing Coderd with the agent token.
Default disable was a reasonable default when it was only used for the experimental script ordering features, but now that we want to use it for Tasks, it should be default on.
closes https://github.com/coder/internal/issues/642
This PR:
* re-enables `func TestReinitializeAgent(t *testing.T)`
* adjusts it to use a Windows specific command in Windows environments
## Problem
Subscribers connecting to a different replica than the one running the
chat see full messages appear but no streaming partials (`message_part`
events). The relay mechanism that forwards ephemeral parts across
replicas had several bugs.
## Root Causes
1. **`openRelay()` blocked the event loop** — The WebSocket dial (TCP +
TLS + HTTP upgrade) to the worker replica ran synchronously inside the
select loop. While dialing, no events could be processed, channels
filled up, and parts were silently dropped.
2. **Relay drops were permanent** — When the relay WebSocket closed
mid-stream, `relayParts` was set to nil and never reopened. No status
notification would re-trigger it since the chat was still running on the
same worker.
3. **`drainInitial` snapshot race** — The `default` case in the initial
drain loop caused the snapshot to be empty if the remote hadn't flushed
data yet (common immediately after WebSocket connect).
4. **Duplicate event delivery** — The `preloaded` slice caused snapshot
events to be sent both in the return value and re-sent through the
channel goroutine.
## Fixes
### `coderd/chatd/chatd.go` (Subscribe method)
- **Async relay dial**: `openRelayAsync()` spawns a goroutine to dial
the remote replica. The result (channel + cancel func) is delivered on a
`relayReadyCh` channel that the select loop reads without blocking.
- **Relay reconnection**: When the relay channel closes, a 500ms timer
fires. The handler re-checks chat status from the DB and reopens the
relay if the chat is still running on a remote worker.
- **Snapshot parts via channel**: Relay snapshot + live parts are
wrapped into a single channel so they flow through the same path,
avoiding races with the select loop.
### `enterprise/coderd/chats.go` (newRemotePartsProvider)
- **Timer-based drain**: Replaced `default` with a 1-second timer. After
the first event, `Reset(0)` switches to non-blocking drain for remaining
buffered events.
- **Remove preloaded duplication**: The goroutine now only forwards new
events; snapshot events are returned to the caller directly.
## Testing
All existing tests pass:
- `TestInterruptChatBroadcastsStatusAcrossInstances`
- `TestSubscribeSnapshotIncludesStatusEvent`
- `TestSubscribeNoPubsubNoDuplicateMessageParts`
- `TestSubscribeAfterMessageID`
- `TestChatStreamRelay/RelayMessagePartsAcrossReplicas`
## Description
- Updates `wsbuilder` to return a `BuildError` with
`http.StatusBadRequest` to signify a "validation error" on missing or
invalid parameters
- Adds a short-circuit in `prebuilds.StoreReconciler` to mark presets
for which creating a build returns a "validation error" as "validation
failed" and skip further attempts to reconcile.
- Adds a test to verify the above
- Introduces a new Prometheus metric
`coderd_prebuilt_workspaces_preset_validation_failed` to track the above
Closes: https://github.com/coder/coder/issues/21237
---------
Co-authored-by: Cian Johnston <cian@coder.com>
This was a bad smell that was being addressed by the frontend. This type
was generating out to be a `nil`/`null` instead of an empty `License[]`.
Now this returns as an empty array and we can actively check if we have
no licenses with a length of `0`.
This pull-request implements a simple filtering logic so that we're able
to pick which model the user actually used when logs were sent to AI
Bridge.
- Add `GET /aibridge/models` API endpoint that returns distinct model
names from AI Bridge interceptions, with pagination and search support
- New `ListAIBridgeModels` SQL query using case-sensitive prefix
matching (`LIKE model || '%'`) to allow B-tree index usage
- Hand-written `ListAuthorizedAIBridgeModels` in `modelqueries.go` for
RBAC authorization filter injection
- `AIBridgeModels` search query parser in searchquery/search.go
(defaults bare terms to the `model` field)
- dbauthz wrappers, dbmetrics, and dbmock implementations for the new
query
<img width="292" height="185" alt="image"
src="https://github.com/user-attachments/assets/134771df-2d26-4c54-acc4-27f58128b351"
/>
- Previously all tests were sharing the global http.Transport meaning on
`Close` it would close connections presumed to be idle for other tests.
fixes https://github.com/coder/internal/issues/112
Fixes#22030
## Problem
When a template has `require_active_version = true` and a workspace is
outdated, the web UI always shows "Update and start" as the **only**
button (for all users including admins), but `coder start` starts with
the old version. For admins, this silently succeeds on the stale
version. For non-admins, it goes through a clunky 403→retry path. This
also affects the VS Code extension, which calls `coder start --yes`
under the hood.
## Root Cause
`buildWorkspaceStartRequest()` in `cli/start.go` checks
`workspace.AutomaticUpdates == "always"` but ignores
`workspace.TemplateRequireActiveVersion`. The server-side autostart
already ORs both settings together:
```go
// coderd/autobuild/lifecycle_executor.go
func useActiveVersion(opts, ws) bool {
return opts.RequireActiveVersion || ws.AutomaticUpdates == "always"
}
```
The CLI was missing the `RequireActiveVersion` check.
## Fix
Add `workspace.TemplateRequireActiveVersion` to the existing OR
condition:
```go
// Before:
if workspace.AutomaticUpdates == codersdk.AutomaticUpdatesAlways || action == WorkspaceUpdate {
// After:
if workspace.AutomaticUpdates == codersdk.AutomaticUpdatesAlways || workspace.TemplateRequireActiveVersion || action == WorkspaceUpdate {
```
Now `coder start` and `coder restart` proactively use the active
template version when `require_active_version` is set, matching the web
UI and server autostart behavior. The 403→retry fallback remains as a
safety net but is no longer the primary path for any user.
## Testing
Updated `enterprise/cli/start_test.go` — all user types (owner, template
admin, ACL admin, group ACL admin, member) now expect the active version
when `require_active_version` is set, and verify the 403→retry message
does NOT appear.
The provisioner state for a workspace build was being loaded for every
long-lived agent rpc connection. Since this state can be anywhere from
kilobytes to megabytes this can gradually cause the `coderd` memory
footprint to grow over time. It's also a lot of unnecessary allocations
for every query that fetches a workspace build since only a few callers
ever actually reference the provisioner state.
This PR removes it from the returned workspace build and adds a query to
fetch the provisioner state explicitly.
`--secure-auth-cookie` now automatically sources it's default value from `--access-url`
If the access url uses HTTPS, secure is set to `true`.
To revert to old behavior, set the value explicitly to `false`
In relation to
[`internal#1281`](https://github.com/coder/internal/issues/1281)
Remove the `soft_limit` field from the `Feature` type and simplify
license limit handling. This change:
- Removes the `soft_limit` field from the API and SDK
- Uses the soft limit value as the single `limit` value in the UI and
API
- Simplifies warning logic to only show warnings when the limit is
exceeded
- Updates tests to reflect the new behavior
- Updates the UI to use the single limit value for display
In relation to
[`internal#1281`](https://github.com/coder/internal/issues/1281)
Managed agent workspace build limits are now advisory only. Breaching
the limit no longer blocks workspace creation — it only surfaces a
warning.
- Removed hard-limit enforcement in `checkAIBuildUsage` so AI task
builds are always permitted regardless of managed agent count.
- Updated the license warning to remove "Further managed agent builds
will be blocked." verbiage.
- Updated tests to assert builds succeed beyond the limit instead of
failing.
- Removed the "Limit" display from the `ManagedAgentsConsumption`
progress bar — the bar is now relative to the included allowance (soft
limit) only, and turns orange when usage exceeds it.
Bonus:
- De-MUI'd `LicenseBannerView` — replaced Emotion CSS and MUI `Link`
with Tailwind classes.
- Added `highlight-orange` color token to the Tailwind theme.
Since Go 1.22, the loop variable capture issue is resolved. Variables
declared by for loops are now per-iteration rather than per-loop, making
the 'v := v' pattern unnecessary.
## Summary
> NOTE: Calling this out as a breaking change in case existing consumers
of the CLI depend on being able to see expired tokens OR being able to
delete tokens immediately.
Updates the `coder tokens rm` command to immediately expire a token by
ID, preserving the token record for audit trail purposes. Tokens can
still be deleted by passing `--delete`.
## Problem
During an incident on dev.coder.com, operators needed to urgently expire
an API key that was stuck in a hot loop. The only way to do this was via
direct database access:
```sql
UPDATE api_keys SET expires_at = NOW() WHERE id = '...';
```
This is not ideal for operators who may not have direct DB access or
want to avoid manual SQL.
## Solution
This PR adds:
- **API endpoint**: `PUT /api/v2/users/{user}/keys/{keyid}/expire` -
Sets the token's `expires_at` to now
- **SDK method**: `ExpireAPIKey(ctx, userID, keyID)`
- **Updates CLI**: `coder tokens rm <name|id|token>` now _expires_ by
default. You can still delete by passing the `--delete` flag. The `coder
tokens list` command now also hides expired tokens by default. You can
`--include-expired` if needed to include them.
- **Audit logging**: The expire action is logged with old and new key
states
## Test plan
- Tests cover: owner expiring own token, admin expiring other user's
token, non-admin cannot expire other's token, 404 for non-existent token
Closes#21782🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Problem
Site-wide admins (e.g., Owners) could not use `coder create --org <org>`
to create workspaces in organizations they are not members of. The error
was:
```
$ coder create my-workspace -t docker --org data-science
error: organization "data-science" not found, are you sure you are a member of this organization?
```
This was inconsistent with the web UI, where Owners can create
workspaces in any organization.
## Root Cause
The CLI's `OrganizationContext.Selected()` function only checked the
user's membership list, ignoring site-wide RBAC permissions that grant
Owners access to all organizations.
## Solution
Added a fallback in `OrganizationContext.Selected()` that fetches the
org directly via the API when not found in the membership list. This
works because the API endpoint applies RBAC filtering, allowing Owners
to read any org.
## Impact
This fixes `coder create --org` and all other CLI commands that use
`OrganizationContext.Selected()` (29+ commands), including:
- `coder templates push --org <any-org>`
- `coder organizations members add --org <any-org>`
- `coder provisioner list --org <any-org>`
## Testing
Added `TestEnterpriseCreate/OwnerCanCreateInNonMemberOrg` which:
- Creates an Owner user who is NOT a member of a second org
- Verifies they can create a workspace there using `--org`
- Properly fails without the code fix, passes with it
---
*This PR was generated by [mux](https://mux.coder.com) but reviewed by a
human.*
This PR adds some metrics to help identify job enqueue rates and
latencies. This work was initiated as a way to help reduce the cost of
the observation/measurement itself for autostart scaletests, which
impacts our ability to identify/reason about the load caused by
autostart. See: https://github.com/coder/internal/issues/1209
I've extended the metrics here to account for regular user initiated
builds, prebuilds, autostarts, etc. IMO there is still the question here
of whether we want to include or need the `transition` label, which is
only present on workspace builds. Including it does lead to an increase
in cardinality, and in the case of the histogram (when not using native
histograms) that's at least a few extra series for every bucket. We
could remove the transition label there but keep it on the counter.
Additionally, the histogram is currently observing latencies for other
jobs, such as template builds/version imports, those do not have a
transition type associated with them.
Tested briefly in a workspace, can see metric values like the following:
-
`coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"}
1`
-
`coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"}
1`
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Problem
CI failure showed 3 goroutines leaked in the prebuilds reconciler, all
stuck in `select` state:
1) `MetricsCollector.BackgroundFetch` (metrics goroutine)
2) `StoreReconciler.Run` (main reconciliation loop)
3) `StoreReconciler.Run.func3()` (provisioner job publisher goroutine)
All three goroutines were waiting for `ctx.Done()`, which likely means
`cancelFn()` was never called to trigger shutdown.
**Note:** I was unable to reproduce the flake locally. The likely cause
was a race condition between `Run()` and `Stop()` where `Stop()` could
check `running` (seeing `false`), return early, and then `Run()` would
start goroutines that never get cleaned up. This could happen in any
`coderd` test that starts a server with prebuilds enabled.
### Problems identified
1) Missing waitgoroup tracking: provisioner job publisher goroutine was
not tracked in the waitgroup, therefore, this goroutine was not tracked
for a clean shutdown in `Run defer func()`.
2) The provisioner job publisher goroutine had a redundant `case
<-c.done` that could race with `Stop()` select statement.
3) Race condition between `Run()` and `Stop()`: the `running` and
`stopped` fields were `atomic.Bool` values checked and set
independently, allowing a window where `Stop()` could see
`running=false` and return early, then `Run()` would set `running=true`
and start goroutines that would never be cleaned up. This could happen
in any `coderd` test that starts a server with prebuilds enabled.
## Changes
* Added `wg.Add(1)` and `defer wg.Done()` to track provisioner job
publisher goroutine in waitgroup
* Removed redundant `case <-c.done` from provisioner job publisher
goroutine to eliminate race condition
* Replaced `atomic.Bool` for `running` and `stopped` with a `sync.Mutex`
lifecycle state, also protecting `cancelFn` under the same mutex, to
eliminate the race between `Run()` and `Stop()`
* Added a guard in `Run()` to prevent double-start (`c.stopped ||
c.running`)
* Improved comments in Stop() and Run() to clarify shutdown behavior
Closes: https://github.com/coder/internal/issues/1116
Resolves the TODO in TestPool by adding TestPool_Expiry which uses Go
1.25's testing/synctest to verify TTL-based cache eviction.
I wanted to get familiar with the new `synctest` package in Go 1.25 and
found this TODO comment, so I decided to take a stab at it 😄