Commit Graph

760 Commits

Author SHA1 Message Date
Asher 24ab216dd1 feat: add new group members endpoint with filtering and pagination (#23067)
Partially addresses #21813 (still need to make changes to the "add user"
button to be complete)

Since there are a lot of user tests already, I moved them into
`coderdtest` to be shared.
2026-03-20 12:43:03 -08:00
Jaayden Halko 6f244cddde feat: display the addon license UI (#22948)
<img width="1052" height="234" alt="Screenshot 2026-03-18 at 21 58 57"
src="https://github.com/user-attachments/assets/136ccb1f-e47a-44fd-804d-859301161435"
/>

---------

Co-authored-by: Steven Masley <stevenmasley@gmail.com>
2026-03-20 16:34:17 +00:00
Ethan a1e912a763 fix(chatd): deliver retry control events via pubsub (#23349)
> **PR Stack**
> 1. #23351 ← `#23282`
> 2. #23282 ← `#23275`
> 3. #23275 ← `#23349`
> 4. **#23349** ← `main` *(you are here)*

---

Retry events were published only to the local in-process stream via
`publishEvent()`. When pubsub is active, `Subscribe()`'s merge loop only
forwarded durable events (messages, status, errors) from pubsub
notifications,
so retry events were silently dropped for cross-replica subscribers.

This adds a `publishRetry()` helper that publishes both locally and via
pubsub,
and extends the `Subscribe()` notification handler to forward retry
events.

**Changes:**
- `coderd/pubsub/chatstreamnotify.go`: add `Retry` field to notify
message
- `coderd/chatd/chatd.go`: add `publishRetry()`, update `OnRetry`
callback,
  extend `Subscribe()` to forward `notify.Retry`
- `coderd/chatd/chatd_internal_test.go`: focused pubsub delivery test
- `enterprise/coderd/chatd/chatd_test.go`: cross-replica end-to-end test
2026-03-20 15:19:41 +00:00
Kyle Carberry d8ff67fb68 feat: add MCP server configuration backend for chats (#23227)
## Summary

Adds the database schema, API endpoints, SDK types, and encryption
wrappers for admin-managed MCP (Model Context Protocol) server
configurations that chatd can consume. This is the backend foundation
for allowing external MCP tools (Sentry, Linear, GitHub, etc.) to be
used during AI chat sessions.

## Database

Two new tables:
- **`mcp_server_configs`**: Admin-managed server definitions with URL,
transport (Streamable HTTP / SSE), auth config (none / OAuth2 / API key
/ custom headers), tool allow/deny lists, and an availability policy
(`force_on` / `default_on` / `default_off`). Includes CHECK constraints
on transport, auth_type, and availability values.
- **`mcp_server_user_tokens`**: Per-user OAuth2 tokens for servers
requiring individual authentication. Cascades on user/config deletion.

New column on `chats` table:
- **`mcp_server_ids UUID[]`**: Per-chat MCP server selection, following
the same pattern as `model_config_id` — passed at chat creation,
changeable per-message with nil-means-no-change semantics.

## API Endpoints

All routes are under `/api/experimental/mcp/servers/` and gated behind
the `agents` experiment.

**Admin endpoints** (`ResourceDeploymentConfig` auth):
- `POST /` — Create MCP server config
- `PATCH /{id}` — Update MCP server config (full-replace)
- `DELETE /{id}` — Delete MCP server config

**Authenticated endpoints** (all users, enabled servers only for
non-admins):
- `GET /` — List configs (admins see all, members see enabled-only with
admin fields redacted)
- `GET /{id}` — Get config by ID (with `auth_connected` populated
per-user)

**OAuth2 per-user auth flow:**
- `GET /{id}/oauth2/connect` — Initiate OAuth2 flow (state cookie CSRF
protection)
- `GET /{id}/oauth2/callback` — Handle OAuth2 callback, store tokens
- `DELETE /{id}/oauth2/disconnect` — Remove stored OAuth2 tokens

## Security

- **Secrets never returned**: `OAuth2ClientSecret`, `APIKeyValue`, and
`CustomHeaders` are never in API responses — only boolean indicators
(`has_oauth2_secret`, `has_api_key`, `has_custom_headers`).
- **Field redaction for non-admins**: `convertMCPServerConfigRedacted`
strips `OAuth2ClientID`, auth URLs, scopes, and `APIKeyHeader` from
non-admin responses.
- **dbcrypt encryption at rest**: All 5 secret fields use `dbcrypt_keys`
encryption with full encrypt-on-write / decrypt-on-read wrappers (11
dbcrypt method overrides + 2 helpers), following the same pattern as
`chat_providers.api_key`.
- **OAuth2 CSRF protection**: State parameter stored in `HttpOnly`
cookie with `HTTPCookies.Apply()` for correct `Secure`/`SameSite` behind
TLS-terminating proxies.
- **dbauthz authorization**: All 18 querier methods have authorization
wrappers. Read operations use `ActionRead`, write operations use
`ActionUpdate` on `ResourceDeploymentConfig`.

## Governance Model

| Control | Implementation |
|---------|---------------|
| **Global kill switch** | `enabled` defaults to `false` |
| **Availability policy** | `force_on` (always injected), `default_on`
(pre-selected), `default_off` (opt-in) |
| **Per-chat selection** | `mcp_server_ids` on `CreateChatRequest` /
`CreateChatMessageRequest` |
| **Auth gate** | OAuth2 servers require per-user auth before tools are
injected |
| **Tool-level allow/deny** | Arrays on `mcp_server_configs` for
granular tool filtering |
| **Secrets encrypted at rest** | Uses `dbcrypt_keys` (same pattern as
`chat_providers.api_key`) |

## Tests

8 test functions covering:
- Full CRUD lifecycle (create, list, update, delete)
- Non-admin visibility filtering (enabled-only, field redaction)
- `auth_connected` population for OAuth2 vs non-OAuth2 servers
- Availability policy validation (valid values + invalid rejection)
- Unique slug enforcement (409 Conflict)
- OAuth2 disconnect idempotency
- Chat creation with `mcp_server_ids` persistence

## Known Limitations (Deferred)

These are documented and intentional for an experimental feature:
- **Audit logging** not yet wired — will add when feature stabilizes
- **Cross-field validation** (e.g., OAuth2 fields required when
`auth_type=oauth2`) — admin-only endpoint, will add when stabilizing
- **`force_on` auto-injection** — query exists but not yet wired into
chatd tool injection (follow-up)
- **Additional test coverage** — 403 auth tests, GET-by-ID tests,
callback CSRF tests planned for follow-up

## What's NOT in this PR

- Frontend UI (admin panel + chat picker)
- Actual MCP client connections (`chatd/chatmcp/` manager)
- Tool injection into `chatloop/`
2026-03-19 14:07:36 +00:00
Steven Masley 84de391f26 chore: add tallyman events for ai seat tracking (#22689)
AI seat tracking inserted as heartbeat into usage table.
2026-03-18 09:30:22 -05:00
George K 91ec0f1484 feat: add service_accounts workspace sharing mode (#23093)
Introduce a three-way workspace sharing setting (none, everyone,
service_accounts) replacing the boolean workspace_sharing_disabled.
In service_accounts mode, only service account-owned workspaces can be
shared while regular members' share permissions are removed. Adds a
new organization-service-account system role with per-org permissions
reconciled alongside the existing organization-member system role.

Related to:
https://linear.app/codercom/issue/PLAT-28/feat-service-accounts-sharing-mode-and-rbac-role

---------

Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
Co-authored-by: Kayla はな <mckayla@hey.com>
2026-03-17 12:16:43 -07:00
Steven Masley 93b9d70a9b chore: add audit log entry when ai seat is consumed (#22683)
When an ai seat is consumed, an audit log entry is made. This only happens the first time a seat is used.
2026-03-16 15:30:25 -05:00
Steven Masley abf59ee7a6 feat: track ai seat usage (#22682)
When a user uses an AI feature, we record them in the `ai_seat_state` as consuming a seat. 

Added in debouching to prevent excessive writes to the db for this feature. There is no need for frequent updates.
2026-03-16 12:36:26 -05:00
Mathias Fredriksson bdbcd3428b feat(coderd/chatd): unify chat storage on SDK parts and fix file-reference rendering (#22958)
File-reference parts in user messages were flattened to `TextContent` at
write time because fantasy has no file-reference content type. The
frontend never saw them as structured parts.

This moves all write paths (user, assistant, tool) from fantasy envelope
format to `codersdk.ChatMessagePart`. The streaming layer (`chatloop`)
is untouched, the conversion happens at the serialization boundary in
`persistStep`.

Old rows are still readable. `ParseContent` uses a structural heuristic
(`isFantasyEnvelopeFormat`) to distinguish legacy envelopes from SDK
parts. We chose this over try/fallback because fantasy envelopes
partially unmarshal into `ChatMessagePart` (the `type` field matches)
while silently losing content. A guard test enforces that no SDK part
can produce the envelope shape.

This is forward-only: new rows are unreadable by old code. Chat is
behind a feature flag so rollback risk is contained.

Also adds a typed `ChatMessageRole` to replace raw strings and
`fantasy.MessageRole*` casts at the persistence boundary. The type
covers `ChatMessage.Role`, `ChatStreamMessagePart.Role`, the
`PublishMessagePart` callback chain, and all DB write sites.
`fantasy.MessageRole*` remains only where we build `fantasy.Message`
structs for LLM dispatch.

Separately, `ProviderMetadata` was leaking to SSE clients via
`publishMessagePart`. `StripInternal` now runs on both the SSE and REST
paths, covering this.

Other cleanup:

- Old `db2sdk.contentBlockToPart` silently dropped metadata on
text/reasoning/tool-call content. New code preserves it.
- `providerMetadataToOptions` now logs warnings instead of silently
returning nil.
- `db2sdk` shrinks from ~250 lines of parallel conversion to ~15 lines
delegating to `chatprompt.ParseContent()`, removing the `fantasy` import
entirely.

Refs #22821
2026-03-13 17:53:26 +02:00
Mathias Fredriksson 57af7abf1f test: add testutil.WaitBuffer and replace time.Sleep in tests (#22922)
WaitBuffer is a thread-safe io.Writer that supports blocking until
accumulated output matches a substring or custom predicate. It
replaces ad-hoc safeBuffer/syncWriter types and time.Sleep-based
poll loops in tests with signal-driven waits.

- WaitFor/WaitForNth/WaitForCond for blocking on output
- Replace custom buffer types in cli/sync_test.go and
  provisionersdk/agent_test.go
- Convert time.Sleep poll loops to require.Eventually/require.Never
  in cli/ssh_test.go, coderd/activitybump_test.go,
  coderd/workspaceagentsrpc_test.go, workspaceproxy_test.go, and
  scaletest tests
2026-03-12 18:07:52 +02:00
George K e5c19d0af4 feat: backend support for creating and storing service accounts (#22698)
Add is_service_account column to users table with CHECK constraints
enforcing login_type='none' and empty email for service accounts.
Update user creation API to validate service account constraints.

Related to:
https://linear.app/codercom/issue/PLAT-27/feat-backend-support-for-creating-and-storing-service-accounts
2026-03-11 10:19:08 -07:00
Kyle Carberry eecb7d0b66 fix: resolve bugs in chatd streaming system (#22720)
Split from #22693 per review feedback.

Fixes multiple bugs in coderd/chatd and sub-packages including race
conditions, transaction safety, stream buffer bounds, retry limits, and
enterprise relay improvements.

See commit message for full list.
2026-03-06 21:02:25 +00:00
Danny Kopping 13e3df67d6 feat: track client sessions (#22470)
This change adds support for tracking client session IDs in AI Bridge interceptions to enable better session-based auditing.

Depends on https://github.com/coder/aibridge/pull/198  
Fixes https://github.com/coder/internal/issues/1337

The session ID field is optional and not universally supported by all clients.
2026-03-06 14:43:53 +02:00
Cian Johnston 81468323e0 fix(coderd): use dbtime.Now() instead of time.Now() in test assertions against DB timestamps (#22685)
`time.Now()` has nanosecond precision while Postgres timestamps are
microsecond precision. When tests compare `time.Now()` against
DB-sourced timestamps using `Before`/`After`/`WithinRange`/etc., there
is a non-zero flake risk from the precision mismatch.

This replaces `time.Now()` with `dbtime.Now()` (which rounds to
microsecond precision) in all test assertions that compare against
database timestamps.

Follows from #22684.

## Changes (11 files)

| File | Changes |
|---|---|
| `coderd/apikey_test.go` | 11 comparisons with `ExpiresAt` |
| `coderd/users_test.go` | 2 comparisons with `ExpiresAt` |
| `coderd/oauth2_test.go` | 1 comparison with `token.Expiry` |
| `coderd/workspaces_test.go` | 2 comparisons with `DormantAt` |
| `coderd/workspaceagents_test.go` | 3 comparisons with
`ConnectedAt`/`DisconnectedAt` |
| `coderd/workspaceapps/db_test.go` | 1 comparison with `token.Expiry` |
| `coderd/provisionerdserver/provisionerdserver_test.go` | 1 comparison
with `key.ExpiresAt` |
| `enterprise/coderd/workspaces_test.go` | 1 comparison with `DormantAt`
|
| `enterprise/coderd/license/license_test.go` | 3 `NotBefore` values |
| `enterprise/coderd/licenses_test.go` | 2 `NotBefore` values |
| `enterprise/coderd/users_test.go` | 3 `Next()` comparisons |

## Not changed (intentionally)

- `scaletest/placebo/run_test.go` — compares wall-clock elapsed time,
not DB timestamps
- `cli/server_test.go`, `coderd/jwtutils/jwt_test.go`,
`enterprise/aibridgeproxyd/aibridgeproxyd_test.go` — TLS cert fields,
not DB-stored
- `coderd/azureidentity/azureidentity_test.go` — Azure cert expiry, not
DB


🤖 Generated by Claude Opus 4.6 but reviewed manually.
2026-03-06 09:14:11 +00:00
Danielle Maywood f91475cd51 test: remove unnecessary dbauthz.AsSystemRestricted calls in tests (#22663) 2026-03-05 20:29:49 +00:00
Kyle Carberry 94a2e440a8 fix(chatd): extract session token from cookie for relay header (#22649)
## Problem

When a browser connects to the chat stream via WebSocket, it
authenticates using cookies only — the native WebSocket API cannot set
custom headers like `Coder-Session-Token`. The relay between replicas
copies the original request's `Cookie` header but did **not** set the
`Coder-Session-Token` header as a fallback.

This causes a **401 on the worker replica** when `EnableHostPrefix` is
enabled, because the `HTTPCookies.Middleware` strips bare
`coder_session_token` cookies (expecting the `__Host-` prefix). Without
a `Coder-Session-Token` header fallback, `apiKeyMiddleware` finds no
valid credentials.

### Root Cause

The data flow:
1. Browser → subscriber replica: `Cookie:
__Host-coder_session_token=xxx` (browser sends prefixed cookie)
2. Subscriber's `HTTPCookies.Middleware` normalizes: `Cookie:
coder_session_token=xxx` (strips prefix)
3. `relayHeaders()` copies `Cookie: coder_session_token=xxx` to relay
request
4. Worker replica's `HTTPCookies.Middleware` sees bare
`coder_session_token` → **strips it** (expects `__Host-` prefix)
5. `apiKeyMiddleware` → `APITokenFromRequest`: no cookie, no header →
**401**

## Fix

Modified `relayHeaders()` to extract the session token value from the
`Cookie` header and set it as the `Coder-Session-Token` header when no
explicit session token header is already present. The header is never
stripped by middleware, so the worker replica can always authenticate.

## Testing

- **`TestRelayHeaders`**: Unit tests for the updated `relayHeaders()`
function covering all scenarios (cookie-only, header+cookie, no auth,
nil source)
- **`TestExtractSessionTokenFromCookieHeader`**: Unit tests for the
helper function
- **`TestChatStreamRelay/RelayCookieOnlyAuth`**: Integration test with
plain HTTP, cookie-only WebSocket auth
- **`TestChatStreamRelay/RelayCookieOnlyAuthWithHostPrefix`**:
Integration test with `EnableHostPrefix=true`, confirming the 401 is
fixed
- **`cookieOnlySessionTokenProvider`**: Test helper that simulates
browser WebSocket behavior (sets Cookie header only on WebSocket dials,
no custom headers)

## Files Changed

- `enterprise/coderd/chatd/chatd.go` — `relayHeaders()` fix +
`extractSessionTokenFromCookieHeader()` helper
- `enterprise/coderd/chatd/relay_headers_internal_test.go` — unit tests
(new file)
- `enterprise/coderd/chats_test.go` — integration tests + test helper
type
2026-03-05 05:11:07 +00:00
Kyle Carberry 219d02bdc3 fix(coderd): poll for metrics in TestWorkspaceProvisionerdServerMetrics (#22644)
## Problem

`TestWorkspaceProvisionerdServerMetrics` flakes because metric
assertions run immediately after
`AwaitWorkspaceBuildJobCompleted` returns, but metrics are updated
**asynchronously after the
DB transaction commits** in `completeWorkspaceBuildJob`.

The timeline in the provisioner server:
1. DB transaction commits (`provisionerdserver.go:~2362`) — job marked
completed
2. Audit logging, notifications, DB queries (`~2370-2427`)
3. **Metric `.Observe()`** (`~2463`) — happens ~100 lines later

The test synchronization (`AwaitWorkspaceBuildJobCompleted`) polls for
`CompletedAt != nil`,
which fires at step 1. The metric assertion then executes before step 3,
causing the flake.

## Fix

Wrap all three metric assertions (prebuild creation, prebuild claim,
regular workspace
creation) in `require.Eventually` to poll until the metric appears, then
assert on the value.

## Test

- `go test -run TestWorkspaceProvisionerdServerMetrics -count=5` — all
pass
- `go test -race -run TestWorkspaceProvisionerdServerMetrics -count=1` —
clean
2026-03-04 22:30:36 -05:00
Kyle Carberry 63b6868113 fix(codersdk): propagate HTTPClient to websocket.Dial for TLS relay (#22642)
## Problem

In multi-replica Coder deployments, the chat relay WebSocket between
replicas fails with HTTP 401 (or TLS handshake errors). The subscriber
replica cannot relay `message_part` events from the worker replica.

**Root cause:** `codersdk.Client.Dial()` does not pass `c.HTTPClient` to
`websocket.DialOptions.HTTPClient`. The websocket library
(`github.com/coder/websocket`) falls back to `http.DefaultClient`, which
lacks the mesh TLS configuration needed for inter-replica communication.

The relay code in `enterprise/coderd/chatd/chatd.go` correctly sets
`sdkClient.HTTPClient = cfg.ReplicaHTTPClient` (which has mesh TLS
certs), but that client was never used for the actual WebSocket
handshake.

## Fix

One-line fix in `codersdk/client.go`: propagate `c.HTTPClient` to
`opts.HTTPClient` when the caller hasn't already set one.

## Test

Added `TestChatStreamRelay/RelayWithTLSAndCookieAuth` which:
- Sets up two replicas with TLS certificates (simulating mesh TLS in
production)
- Authenticates via cookies (simulating browser WebSocket behavior)
- Verifies message_part events relay across replicas over TLS

This test times out without the fix because the WebSocket handshake
fails with `x509: certificate signed by unknown authority`
(http.DefaultClient rejects self-signed certs).

## Related

Follow-up to #22635 which fixed the `redirectToAccessURL` middleware
bypassing 307 redirects for relay requests. That fix changed the error
from HTTP 200 to HTTP 401, exposing this deeper issue.
2026-03-04 21:57:23 -05:00
Kyle Carberry 30d534b36b fix(chatd): fix relay race conditions, extract enterprise relay logic, move pubsub to OSS (#22589)
## Summary

Fixes a bug where interrupting a streaming chat and sending a new
message
left the relay connected to the wrong replica. Expanded into a broader
refactor that cleanly separates concerns:

- **OSS** owns pubsub subscription, message catch-up, queue updates,
  status forwarding, and local parts merging.
- **Enterprise** (`enterprise/coderd/chatd`) only manages relay dialing,
  reconnection, and stale-dial discarding for cross-replica streaming.

## Architecture

### OSS `coderd/chatd/chatd.go`

`Subscribe()` builds the initial snapshot then runs a single merge
goroutine that handles:

- Pubsub subscription for durable events (status, messages, queue,
errors)
- Message catch-up via `AfterMessageID`
- Local `message_part` forwarding
- Relay events from enterprise (when `SubscribeFn` is set)
- Sends `StatusNotification` to enterprise so it can manage relay
lifecycle

Key types:

- `SubscribeFn` — enterprise hook, returns relay-only events channel
- `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`,
`StatusNotifications`, `RequestHeader`, `DB`, `Logger`
- `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on
pubsub status changes

### Enterprise `enterprise/coderd/chatd/chatd.go`

`NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a
`SubscribeFn` that:

- Opens an initial synchronous relay if the chat is running on a remote
worker
- Reads `StatusNotifications` from OSS to open/close relay connections
- Handles async dial, reconnect timers, stale-dial discarding
- Returns only relay `message_part` events

## Bug fixes

### Original bug: stale relay dial after interrupt

`openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a
per-dial context. `closeRelay()` could not cancel in-flight dials. When
the user interrupts and a new replica picks up the chat, the old dial
goroutine could complete after the new one and deliver a stale
`relayResult`.

**Fix**: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking,
`workerID` on `relayResult`. `closeRelay()` cancels the dial context and
drains `relayReadyCh`. Merge loop rejects mismatched worker IDs.

### Additional fixes

- `statusNotifications` send-on-closed-channel race — goroutine now owns
  `close()` via defer
- Enterprise spin-loop on `StatusNotifications` close — two-value
receive
  with nil-out
- `hasPubsub` set from `p.pubsub != nil` instead of subscription success
  — now tracks actual subscription result
- `lastMessageID` not initialized from `afterMessageID` — caused
  duplicate messages on catch-up
- `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel
- `closeRelay()` did not drain `relayReadyCh`
- `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in
`InTx`
- `processChat` post-TX side effects fired when chat was taken by
another
  worker — added `errChatTakenByOtherWorker` sentinel
- Cancel closure data race on `reconnectTimer`
- Bare blocking send on pubsub error path
- `localParts` hot-spin after channel close
- No-pubsub branch dropped relay events and initial snapshot
- Failed relay dial caused permanent stall (no reconnect retry)
- DB error during reconnect timer caused permanent stall
- `time.NewTimer` replaced with `quartz.Clock` for testable timing

## Tests

9 enterprise tests covering:

- Relay reconnect on drop (mock clock)
- Async dial does not block merge loop
- Relay snapshot delivery
- Stale dial discarded after interrupt
- Cancel during in-flight dial
- Running-to-running worker switch
- Failed dial retries (mock clock)
- Local worker closes relay
- Multiple consecutive reconnects (mock clock)

All pass with `-race`.
2026-03-04 18:42:28 -05:00
Kayla はな e35717bc19 fix: show a notice when workspace sharing is disabled globally in organization settings (#22580) 2026-03-04 11:14:52 -07:00
Sas Swart 8c09df52f9 fix(coderd): use WaitSuperLong in TestReinitializeAgent (#22593)
Fixes coder/internal#642

We recently fixed Windows specific flakes for this test and reenabled
it. It then failed intermittently due to context deadline expiration.
The temporary path created on Windows contained invalid characters. This
resulted in a silent startup script failure on Windows. The test then
fruitlessly waited until context expiration. The test now uses a valid
path on Windows.
2026-03-04 15:22:43 +02:00
Spike Curtis 56eb57caf4 chore: enable agent socket by default (#22352)
relates to #21335

Enables the agent socket by default and updates docs to strike references to having to enable it.

The PRs in this stack change the MCP server that Tasks use to update their status to rely on the agent socket, rather than directly dialing Coderd with the agent token.

Default disable was a reasonable default when it was only used for the experimental script ordering features, but now that we want to use it for Tasks, it should be default on.
2026-03-03 21:23:59 +04:00
Sas Swart e563766722 tests: re-enable 'TestReinitializeAgent' on Windows (#22488)
closes https://github.com/coder/internal/issues/642

This PR:
* re-enables `func TestReinitializeAgent(t *testing.T)`
* adjusts it to use a Windows specific command in Windows environments
2026-03-03 11:22:02 +02:00
Kyle Carberry b7a7683ac0 fix(chatd): harden cross-replica relay for chat stream parts (#22533)
## Problem

Subscribers connecting to a different replica than the one running the
chat see full messages appear but no streaming partials (`message_part`
events). The relay mechanism that forwards ephemeral parts across
replicas had several bugs.

## Root Causes

1. **`openRelay()` blocked the event loop** — The WebSocket dial (TCP +
TLS + HTTP upgrade) to the worker replica ran synchronously inside the
select loop. While dialing, no events could be processed, channels
filled up, and parts were silently dropped.

2. **Relay drops were permanent** — When the relay WebSocket closed
mid-stream, `relayParts` was set to nil and never reopened. No status
notification would re-trigger it since the chat was still running on the
same worker.

3. **`drainInitial` snapshot race** — The `default` case in the initial
drain loop caused the snapshot to be empty if the remote hadn't flushed
data yet (common immediately after WebSocket connect).

4. **Duplicate event delivery** — The `preloaded` slice caused snapshot
events to be sent both in the return value and re-sent through the
channel goroutine.

## Fixes

### `coderd/chatd/chatd.go` (Subscribe method)
- **Async relay dial**: `openRelayAsync()` spawns a goroutine to dial
the remote replica. The result (channel + cancel func) is delivered on a
`relayReadyCh` channel that the select loop reads without blocking.
- **Relay reconnection**: When the relay channel closes, a 500ms timer
fires. The handler re-checks chat status from the DB and reopens the
relay if the chat is still running on a remote worker.
- **Snapshot parts via channel**: Relay snapshot + live parts are
wrapped into a single channel so they flow through the same path,
avoiding races with the select loop.

### `enterprise/coderd/chats.go` (newRemotePartsProvider)
- **Timer-based drain**: Replaced `default` with a 1-second timer. After
the first event, `Reset(0)` switches to non-blocking drain for remaining
buffered events.
- **Remove preloaded duplication**: The goroutine now only forwards new
events; snapshot events are returned to the caller directly.

## Testing

All existing tests pass:
- `TestInterruptChatBroadcastsStatusAcrossInstances`
- `TestSubscribeSnapshotIncludesStatusEvent`
- `TestSubscribeNoPubsubNoDuplicateMessageParts`
- `TestSubscribeAfterMessageID`
- `TestChatStreamRelay/RelayMessagePartsAcrossReplicas`
2026-03-02 19:57:13 -05:00
Kyle Carberry edee917d88 feat: add experimental agents support (#22290)
feat: add AI chat system with agent tools and chat UI

Introduce the chatd subsystem and Agents UI for AI-powered chat
within Coder workspaces.

- Add chatd package with chat loop, message compaction, prompt
  management, and LLM provider integration (OpenAI, Anthropic)
- Add agent tools: create workspace, list/read templates, read/write/
  edit files, execute commands
- Add chat API endpoints with streaming, message editing, and
  durable reconnection
- Add database schema and migrations for chats, chat messages, chat
  providers, and chat model configs
- Add RBAC policies and dbauthz enforcement for chat resources
- Add Agents UI pages with conversation timeline, queued messages
  list, diff viewer, and model configuration panel
- Add comprehensive test coverage including coderd integration tests,
  chatd unit tests, and Storybook stories
- Gate feature behind experiments flag

---------

Co-authored-by: Cian Johnston <cian@coder.com>
Co-authored-by: Danielle Maywood <danielle@themaywoods.com>
Co-authored-by: Jeremy Ruppel <jeremy@coder.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 16:50:56 +00:00
Susana Ferreira ca234f346d fix: mark presets as validation_failed to prevent endless prebuild retries (#22085)
## Description

- Updates `wsbuilder` to return a `BuildError` with
`http.StatusBadRequest` to signify a "validation error" on missing or
invalid parameters
- Adds a short-circuit in `prebuilds.StoreReconciler` to mark presets
for which creating a build returns a "validation error" as "validation
failed" and skip further attempts to reconcile.
- Adds a test to verify the above
- Introduces a new Prometheus metric
`coderd_prebuilt_workspaces_preset_validation_failed` to track the above

Closes: https://github.com/coder/coder/issues/21237

---------

Co-authored-by: Cian Johnston <cian@coder.com>
2026-02-27 14:26:48 +00:00
Jake Howell a51eb40dca fix: marshal convertLicenses() into a [] instead of nil (#22366)
This was a bad smell that was being addressed by the frontend. This type
was generating out to be a `nil`/`null` instead of an empty `License[]`.
Now this returns as an empty array and we can actively check if we have
no licenses with a length of `0`.
2026-02-28 00:23:41 +11:00
Dean Sheather bef7eb9dcc fix: avoid derp-related panic during wsproxy registration (#22322) 2026-02-27 00:07:14 +11:00
Jake Howell d2787df442 feat: add AI Bridge request logs model filter (#22230)
This pull-request implements a simple filtering logic so that we're able
to pick which model the user actually used when logs were sent to AI
Bridge.

- Add `GET /aibridge/models` API endpoint that returns distinct model
names from AI Bridge interceptions, with pagination and search support
- New `ListAIBridgeModels` SQL query using case-sensitive prefix
matching (`LIKE model || '%'`) to allow B-tree index usage
- Hand-written `ListAuthorizedAIBridgeModels` in `modelqueries.go` for
RBAC authorization filter injection
- `AIBridgeModels` search query parser in searchquery/search.go
(defaults bare terms to the `model` field)
- dbauthz wrappers, dbmetrics, and dbmock implementations for the new
query

<img width="292" height="185" alt="image"
src="https://github.com/user-attachments/assets/134771df-2d26-4c54-acc4-27f58128b351"
/>
2026-02-26 02:40:45 +11:00
Jon Ayers 4f34452bcc fix: use separate http.Transports for wsproxy tests (#22292)
- Previously all tests were sharing the global http.Transport meaning on
`Close` it would close connections presumed to be idle for other tests.
fixes https://github.com/coder/internal/issues/112
2026-02-24 23:56:58 -06:00
Jon Ayers 0a7a3da178 fix: exclude provisioner_state from workspace_build_with_user view (#22159)
The provisioner state for a workspace build was being loaded for every
long-lived agent rpc connection. Since this state can be anywhere from
kilobytes to megabytes this can gradually cause the `coderd` memory
footprint to grow over time. It's also a lot of unnecessary allocations
for every query that fetches a workspace build since only a few callers
ever actually reference the provisioner state.

This PR removes it from the returned workspace build and adds a query to
fetch the provisioner state explicitly.
2026-02-23 22:46:17 -06:00
Sushant P 37a8e61ea2 chore: move Shared Workspaces from experiments to beta (#22206)
* Removed the shared-workspaces experiment and cleaned up related
middleware
* Added beta tagging to the UI for shared workspaces
2026-02-23 08:30:32 -08:00
Jake Howell d700f9ebc4 fix: restore block to Managed Agents on Enterprise (#22210)
#21998 accidentally allowed `Managed Agents` usages whilst being on an
`Enterprise` license. This was incorrect, it should work as the
following (same as prior to #21998).

| Scenario | Before your PRs | After your PRs (bug) | After this fix |
|---|---|---|---|
| Unlicensed (AGPL) | Permitted | Permitted | Permitted |
| Licensed, no entitlement | **Blocked** | Permitted | **Blocked** |
| Licensed, explicitly disabled (limit=0) | **Blocked** | Permitted |
**Blocked** |
| Licensed, entitled, under limit | Permitted | Permitted | Permitted |
| Licensed, entitled, over limit | Blocked | Permitted (advisory) |
Permitted (advisory) |
| Any license, stop/delete | Permitted | Permitted | Permitted |
| Any license, non-AI build | Permitted | Permitted | Permitted |
2026-02-20 20:15:32 +11:00
Jake Howell 051ed34580 feat: convert soft_limit to limit (#22048)
In relation to
[`internal#1281`](https://github.com/coder/internal/issues/1281)

Remove the `soft_limit` field from the `Feature` type and simplify
license limit handling. This change:

- Removes the `soft_limit` field from the API and SDK
- Uses the soft limit value as the single `limit` value in the UI and
API
- Simplifies warning logic to only show warnings when the limit is
exceeded
- Updates tests to reflect the new behavior
- Updates the UI to use the single limit value for display
2026-02-20 16:09:12 +11:00
Jake Howell 203899718f feat: remove agent workspaces limit (#21998)
In relation to
[`internal#1281`](https://github.com/coder/internal/issues/1281)

Managed agent workspace build limits are now advisory only. Breaching
the limit no longer blocks workspace creation — it only surfaces a
warning.

- Removed hard-limit enforcement in `checkAIBuildUsage` so AI task
builds are always permitted regardless of managed agent count.
- Updated the license warning to remove "Further managed agent builds
will be blocked." verbiage.
- Updated tests to assert builds succeed beyond the limit instead of
failing.
- Removed the "Limit" display from the `ManagedAgentsConsumption`
progress bar — the bar is now relative to the included allowance (soft
limit) only, and turns orange when usage exceeds it.

Bonus:

- De-MUI'd `LicenseBannerView` — replaced Emotion CSS and MUI `Link`
with Tailwind classes.
- Added `highlight-orange` color token to the Tailwind theme.
2026-02-20 12:56:00 +11:00
Danielle Maywood 92a6d6c2c0 chore: remove unnecessary loop variable captures (#22180)
Since Go 1.22, the loop variable capture issue is resolved. Variables
declared by for loops are now per-iteration rather than per-loop, making
the 'v := v' pattern unnecessary.
2026-02-19 09:02:19 +00:00
Paweł Banaszewski 90c11f3386 feat: add client column to aibridge_interceptions table (#21839)
Adds `client` column to `aibridge_interceptions` table. It is set accordingly to what is passed from AI Bridge in `RecordInterception`.
Adds interception filtering by `client` value.

Depends on: https://github.com/coder/aibridge/pull/158
Updates aibridge library to include this change.

Fixes: https://github.com/coder/aibridge/issues/31
2026-02-17 15:43:02 +01:00
Steven Masley 01f06671a1 chore: return 404, not 400 if missing or authz deny (#22069) 2026-02-13 08:19:07 -06:00
Callum Styan 5f3be6b288 feat: add provisioner job queue wait time histogram and jobs enqueued counter (#21869)
This PR adds some metrics to help identify job enqueue rates and
latencies. This work was initiated as a way to help reduce the cost of
the observation/measurement itself for autostart scaletests, which
impacts our ability to identify/reason about the load caused by
autostart. See: https://github.com/coder/internal/issues/1209

I've extended the metrics here to account for regular user initiated
builds, prebuilds, autostarts, etc. IMO there is still the question here
of whether we want to include or need the `transition` label, which is
only present on workspace builds. Including it does lead to an increase
in cardinality, and in the case of the histogram (when not using native
histograms) that's at least a few extra series for every bucket. We
could remove the transition label there but keep it on the counter.

Additionally, the histogram is currently observing latencies for other
jobs, such as template builds/version imports, those do not have a
transition type associated with them.

Tested briefly in a workspace, can see metric values like the following:
-
`coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"}
1`
-
`coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"}
1`

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 13:40:47 -08:00
Susana Ferreira 220b9f3cc5 fix: track goroutines and fix race condition in reconciler (#21980)
## Problem

CI failure showed 3 goroutines leaked in the prebuilds reconciler, all
stuck in `select` state:

1) `MetricsCollector.BackgroundFetch` (metrics goroutine)
2) `StoreReconciler.Run` (main reconciliation loop)
3) `StoreReconciler.Run.func3()` (provisioner job publisher goroutine)

All three goroutines were waiting for `ctx.Done()`, which likely means
`cancelFn()` was never called to trigger shutdown.

**Note:** I was unable to reproduce the flake locally. The likely cause
was a race condition between `Run()` and `Stop()` where `Stop()` could
check `running` (seeing `false`), return early, and then `Run()` would
start goroutines that never get cleaned up. This could happen in any
`coderd` test that starts a server with prebuilds enabled.

### Problems identified

1) Missing waitgoroup tracking: provisioner job publisher goroutine was
not tracked in the waitgroup, therefore, this goroutine was not tracked
for a clean shutdown in `Run defer func()`.
2) The provisioner job publisher goroutine had a redundant `case
<-c.done` that could race with `Stop()` select statement.
3) Race condition between `Run()` and `Stop()`: the `running` and
`stopped` fields were `atomic.Bool` values checked and set
independently, allowing a window where `Stop()` could see
`running=false` and return early, then `Run()` would set `running=true`
and start goroutines that would never be cleaned up. This could happen
in any `coderd` test that starts a server with prebuilds enabled.

## Changes

* Added `wg.Add(1)` and `defer wg.Done()` to track provisioner job
publisher goroutine in waitgroup
* Removed redundant `case <-c.done` from provisioner job publisher
goroutine to eliminate race condition
* Replaced `atomic.Bool` for `running` and `stopped` with a `sync.Mutex`
lifecycle state, also protecting `cancelFn` under the same mutex, to
eliminate the race between `Run()` and `Stop()`
* Added a guard in `Run()` to prevent double-start (`c.stopped ||
c.running`)
* Improved comments in Stop() and Run() to clarify shutdown behavior

Closes: https://github.com/coder/internal/issues/1116
2026-02-12 15:35:42 +00:00
Cian Johnston 25a0c807cb chore(coderd/database/dbfake): add support for provisioner job timestamp control (#21944)
Relates to https://github.com/coder/coder/pull/21922 /
https://github.com/coder/internal/issues/1259

* Adds `dbfake.BuilderOption func(*WorkspaceBuildBuilder)`
* Adds `BuilderOption` methods for setting various provisioner job
related fields on `WorkspaceBuildBuilder`.
* Migrates a number of existing tests that previously dependeded on
provisioner job timing to use these updated methods in the following
packages:
  * `coderd/jobreaper`
  * `coderd/notifications/reports`
  * `enterprise/coderd/schedule`
  * `enterprise/coderd/prebuilds`
  * `scripts/workspace-runtime-audit` 

🤖 Created using Mux (Opus 4.5)

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-02-06 09:44:40 +00:00
Cian Johnston 91be688e39 chore(coderd/database): remove deprecated db2sdk.List(Lazy)? methods (#21902)
Removes deprecated methods db2sdk.List and db2sdk.ListLazy.
2026-02-03 17:52:07 +00:00
Jon Ayers 3c1db17361 fix: use existing transaction to claim prebuild (#21862)
- Claiming a prebuild was happening outside a transaction
2026-02-02 17:57:59 -06:00
Dean Sheather 6954b73f8a fix: prevent panic from duplicate metrics registration on license upload (#21832) 2026-02-02 20:57:06 +11:00
Ethan a464ab67c6 test: use explicit names in TestStartAutoUpdate to prevent flake (#21745)
The test was creating two template versions without explicit names,
relying on `namesgenerator.NameDigitWith()` which can produce
collisions. When both versions got the same random name, the test failed
with a 409 Conflict error.

Fix by giving each version an explicit name (`v1`, `v2`).

Closes https://github.com/coder/internal/issues/1309

---

*Generated by [mux](https://mux.coder.com)*
2026-01-30 13:24:06 +11:00
Marcin Tojek 04b0253e8a feat: add Prometheus metrics for license warnings and errors (#21749)
Fixes: coder/internal#767

Adds two new Prometheus metrics for license health monitoring:

- `coderd_license_warnings` - count of active license warnings
- `coderd_license_errors` - count of active license errors

Metrics endpoint after startup of a deployment with license enabled:

```
...
# HELP coderd_license_errors The number of active license errors.
# TYPE coderd_license_errors gauge
coderd_license_errors 0
...
# HELP coderd_license_warnings The number of active license warnings.
# TYPE coderd_license_warnings gauge
coderd_license_warnings 0
...
```
2026-01-29 13:50:15 +01:00
Cian Johnston c2c225052a chore(enterprise/coderd): ensure TestManagedAgentLimit differentiates between tasks and workspaces (#21731)
My previous change to this test did not create another **workspace**
using the template containing `coder_ai_task` resources, meaning that
this test was not actually testing the right thing. This PR addresses
this oversight.
2026-01-28 16:30:56 +00:00
Zach 2204731ddb feat: implement boundary usage tracker and telemetry collection (#21716)
Implements telemetry for boundary usage tracking across all Coder
replicas and reports them via telemetry.

Changes:
- Implement Tracker with Track(), FlushToDB(), and StartFlushLoop() methods
- Add telemetry integration via collectBoundaryUsageSummary()
- Use telemetry lock to ensure only one replica collects per period

The tracker accumulates unique workspaces, unique users, and request
counts (allowed/denied) in memory, then flushes to the database
periodically. During telemetry collection, stats are aggregated across
all replicas and reset for the next period.
2026-01-27 19:11:40 -07:00
Steven Masley 799b190dee fix: do not enforce managed agent limit for non-task workspaces (#21689)
Only task workspaces have the checks in wsbuilder for violating the
managed agent caps in the license.

Stopped tasks that are resumed with a regular workspace start **still
count as usage**.
2026-01-27 19:01:17 -06:00
Cian Johnston 7b44976618 fix(coderd/provisionerdserver): correct managed agent tracking (#21696)
Relates to https://github.com/coder/internal/issues/1282

Updates tracking of managed agents to be predicated instead on the
presence of a related `task_id` instead of the presence of a
`coder_ai_task` resource.
2026-01-27 12:14:52 +00:00