Commit Graph

807 Commits

Author SHA1 Message Date
Jon Ayers 17635dde5c chore: include pgcoordinator schema changes in 2.33 (#24931)
Includes https://github.com/coder/coder/pull/24613 since it landed prior
to the pgcoordinator migration

---------

Co-authored-by: Marcin Tojek <mtojek@users.noreply.github.com>
2026-05-04 15:42:34 -05:00
Cian Johnston df1bfe6479 feat: audit user secret create, update, and delete (#24756) (#24849)
Emit user secret audit log entries for create/update/delete operations.
Reads stay un-audited, matching every other resource.

Audit log entries record changes in user secret name, environment
variable name, file path, and value. The secret value column is marked
`ActionSecret` so the diff records the change without showing the
ciphertext or plaintext.

Closes a TOCTOU window on delete to ensure no phantom audit logs for a
delete of a non-existent secret. Secret update accepts a small TOCTOU
window matching the other audited resources (templates, workspaces,
chats). The two-query pattern is wrapped in a transaction so audit state
can't leak from a failed mutation.

(cherry picked from commit 1c30d52b2b)

<!--

If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.

-->

Co-authored-by: Zach <3724288+zedkipp@users.noreply.github.com>
2026-04-30 21:01:27 +01:00
George K 3f0e015fe5 fix: allow coderd to start with an empty DERP map when built-in DERP is disabled (#24544)
Allow coderd to start with an empty base DERP map when built-in DERP
is disabled and no static DERP map is configured, so DERP can come from
workspace proxies after startup.

Also add a DERP healthcheck warning when no DERP servers are currently
available at runtime.

Related to: https://linear.app/codercom/issue/PLAT-43/bug-coderd-unable-to-be-started-if-built-in-derp-server-disabled-and
Related to: https://github.com/coder/coder/issues/22324
2026-04-28 09:17:08 -07:00
Cian Johnston b5a625549e feat: migrate agents-access to org-scoped system role for proper chat RBAC (#24438)
The agents-access role previously granted chat permissions at user
scope, but chats are org-scoped objects. Rego skips user-level perms
when org_owner is set, making the grants invisible. Handler-level
band-aids used synthetic non-org-scoped objects as a workaround.

  - Migrates agents-access from users.rbac_roles (site-level) to
    organization_members.roles (org-scoped) via DB migration
  - Redefines agents-access as a predefined org-scoped builtin role
    alongside organization-admin, organization-auditor, etc., with
    Member permissions granting chat create/read/update
  - Excludes ResourceChat from OrgMemberPermissions so org membership
    alone no longer grants chat access
  - Fixes handler Authorize checks to use org-scoped objects with
semantically correct actions (ActionUpdate for message/tool operations)
  - Grants org admins the ability to assign agents-access

Closes #24250
Fixes CODAGT-174

Note: this does not update the "Usage" endpoints. Tracked by CODAGT-161.
> 🤖
2026-04-23 17:59:42 +01:00
Danny Kopping a8613b2209 chore: deprecate /api/v2/aibridge/interceptions endpoint (#24670)
*Disclaimer: implemented by a Coder Agent using Claude Opus 4.6*

Marks the `GET /api/v2/aibridge/interceptions` endpoint as deprecated in
favor of `/aibridge/sessions`, which provides richer session-level
aggregation including threads and agentic actions.

Changes:
- Add `@Deprecated` Swagger annotation to the endpoint handler
- Add deprecation notice to the
`codersdk.Client.AIBridgeListInterceptions` method
- Regenerated OpenAPI spec with `"deprecated": true` flag

The endpoint remains fully functional.

Fixes https://github.com/coder/internal/issues/1339
2026-04-23 15:33:40 +02:00
Cian Johnston d9e3e206cc fix(enterprise/coderd/x/chatd): deflake relay drain test for multiple timers (#24609)
Fixes https://github.com/coder/internal/issues/1474.

PR #24549 introduced a `quartz.NewMock` clock +
`Trap().NewTimer("drain")` to
remove the wall-clock race. However, the trap consumed only **one**
`NewTimer("drain")` call via `MustWait/MustRelease`.

The merge loop has two code paths that create drain timers with the same
tag:
- Relay result handler (`drainAndClose` path in `relayReadyCh` case):
when an async dial completes after `drainAndClose` was set.
- Status notification handler (`relayParts != nil` branch in
`statusNotifications` case): when `status!=running` arrives while an
active relay exists.

Depending on goroutine scheduling, one or both paths fire. When two
calls hit
the trap, the second blocks the merge loop in `matchCallLocked` (quartz
waits
for all traps to release). Since the test already moved past `MustWait`,
nobody
reads the second call from the trap channel, deadlocking the test.

- Replace the single `MustWait/MustRelease/Advance` with a goroutine
that
loops over `trapDrain.Wait`, releasing and advancing for every drain
timer.
- No production code changes.

> 🤖
2026-04-23 11:13:41 +01:00
Jeremy Ruppel c23abc691f feat: sort AI sessions by last prompt time (#24440)
Previously, the sessions list sorted by `MIN(started_at)` across
interceptions, so sessions with old start times but recent activity
would sink to the bottom of the list regardless of how recently they
were used.

`ListAIBridgeSessions` now sorts by `COALESCE(MAX(prompt.created_at),
MIN(started_at)) DESC`, exposed as the non-nullable `last_active_at`
field. Sessions with prompts surface by last activity; sessions with no
prompts fall back to their start time.

The original implementation used two separate columns (`last_active_at`
as a nullable prompt timestamp and `sort_at` as the non-nullable cursor
key). This revision collapses them into a single `last_active_at` that
is always set — simplifying the SQL, the Go conversion, the API type,
and the frontend.

🤖 Generated with [Claude Code](https://claude.ai/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 12:06:49 -04:00
Paweł Banaszewski e00e85765b chore: move aibridge library code into coder repo (#24190)
This PR merges code from `coder/aibridge` repository into `coder/coder`.
It was split into 4 PRs for easier review but stacked PRs will need to
be merged into this PR so all checks pass.

* https://github.com/coder/coder/pull/24190 -> raw code copy (this PR,
before merging PRs on top of it, it was just 1 commit:
https://github.com/coder/coder/commit/70d33f33200c7e77df910957595715f81f9bec24)
* https://github.com/coder/coder/pull/24570 -> update imports in
`coder/coder` to use copied code
* https://github.com/coder/coder/pull/24586 -> linter fixes and CI
integration (also added README.md)
* https://github.com/coder/coder/pull/24571 -> added exclude to
scripts/check_emdash.sh check

Original PR message (before PR squash):
Moves coder/aibridge code into coder/coder repository.

Omitted files:

- `go.mod`, `go.sum`, `.gitignore`, `.github/workflows/ci.yml,`
`Makefile`, `LICENSE`, `README.md` (modified README.md is added later)
- `.github`, `example`, `buildinfo,` `scripts` directories

Simple verification script (will list omitted files)

```
tmp=$(mktemp -d)
echo "$tmp"
git clone --depth=1 https://github.com/coder/aibridge "$tmp/aibridge"
git clone --depth=1 --branch pb/aibridge-code-move https://github.com/coder/coder "$tmp/coder"
diff -rq --exclude=.git "$tmp/aibridge" "$tmp/coder/aibridge"
# rm -rf "$tmp"
```
2026-04-22 17:01:01 +02:00
Cian Johnston f56adf5731 fix(enterprise/coderd/x/chatd): deflake TestSubscribeRelayDrainWithinGraceLeavesBufferRetained (#24549)
Fixes the flake reported in
https://github.com/coder/internal/issues/1474.

- Use a `quartz.NewMock` clock for the subscriber with a drain timer
trap, so the 200ms relay drain fires only when explicitly advanced —
fully deterministic, no wall-clock race
- Give each `testutil.Eventually` call its own context so one slow
assertion cannot starve subsequent ones of their shared 25s deadline

> 🤖
2026-04-21 11:20:32 +00:00
Cian Johnston 12e49c18a5 fix(enterprise/coderd/x/chatd): reduce relay reconnect spam (#24495)
- Replaces the hard-coded 500ms reconnect timer for dialing chat relays with exponential backoff via `coder/retry`.
- `dialRelay` drops the `codersdk.ExperimentalClient.StreamChat` wrapper
and calls `websocket.Dial` directly so we can capture
`*http.Response.StatusCode` without parsing error strings.
- Adds `RelayDialError` that exposes the HTTP status from `websocket.Dial`
- Modifies retry logic: 401/403 tear the stream down immediately, 5xx/network/timeouts retry
then tear down on cap. Outer stream closes cleanly so the browser SDK
reconnects with a fresh cookie.
- Retry state resets on successful dial and on target-worker change, not
on every `closeRelay()`.

> 🤖 Generated by Coder Agents.
2026-04-20 09:19:17 +01:00
Cian Johnston 3f6b40a833 fix: reap idle chatd stream states on a timer (#24476)
* Adds `streamJanitorLoop` to clean up stale streams every 30s
* zeroes dropped slots to aid in gc-eligibliity
* Adds regression tests in coderd/x/chatd and enterprise/coderd/x/chatd

> 🤖
2026-04-17 19:22:00 +01:00
Dean Sheather 3452ab3166 chore: add client_type field to chats and telemetry (#24342)
Add a `chat_client_type` enum (`ui` | `api`) and `client_type` column to
the `chats` table. The column defaults to `api` for new rows so API
callers don't need to set it explicitly. Existing rows are backfilled to
`ui`.

The field flows through `CreateChatRequest`, `chatd.CreateOptions`,
`InsertChat`, and is returned in the `Chat` response via `db2sdk`.

<details>
<summary>Implementation notes (Coder Agents generated)</summary>

### Changes

**Database migration (000469)**
- New enum `chat_client_type` with values `ui`, `api`.
- New `client_type` column, `NOT NULL DEFAULT 'api'`.
- Backfill: `UPDATE chats SET client_type = 'ui'`.

**SQL query** — `InsertChat` now includes `client_type`.

**SDK** — `ChatClientType` type added; `ClientType` field added to both
`CreateChatRequest` (optional, defaults server-side to `api`) and `Chat`
response.

**Handler** — `postChats` maps the request field (defaulting to `api`)
and passes it through `chatd.CreateOptions`.

**Sub-agent** — Child chats inherit their parent's `client_type`.

**db2sdk** — Maps the database value to the SDK type.

### Decision log
- Default is `api` (not `ui`) so existing API integrations get the
correct value without code changes.
- Backfill sets existing rows to `ui` per requirement.
- Child chats inherit `client_type` from parent rather than defaulting.
</details>
2026-04-16 23:57:05 +10:00
Ethan b9bc0ad6df test: skip TestSubscribeRelayEstablishedMidStream (#24431)
Relates to https://github.com/coder/internal/issues/1455
From that issue:
> Going to skip this test until the underlying race in chatd is fixed.
https://github.com/coder/coder/pull/24279 was a band-aid fix that I no
longer think is valuable pursuing short term. Hugo is working on a RFC
for a redesign of the system to prevent the class of race condition into
the future.
2026-04-16 23:55:41 +10:00
Cian Johnston d7439a9de0 feat: add Prometheus metrics for chatd subsystem (#24371)
Adds 7 Prometheus metrics to the chatd subsystem and introduces typed
`ActivityBumpReason` for deadline bump attribution.

| Metric | Type | Labels |
|--------|------|--------|
| `coderd_chatd_chats` | Gauge | `state` (streaming, waiting) |
| `coderd_chatd_message_count` | Histogram | `provider` |
| `coderd_chatd_prompt_size_bytes` | Histogram | `provider` |
| `coderd_chatd_tool_result_size_bytes` | Histogram | `provider`,
`tool_name` |
| `coderd_chatd_ttft_seconds` | Histogram | `provider` |
| `coderd_chatd_compaction_total` | Counter | `provider`, `result` |
| `coderd_chatd_steps_total` | Counter | `provider` |

> 🤖
2026-04-15 19:53:10 +01:00
Yevhenii Shcherbina dd73ea54bd feat: add allow-byok option for ai-gateway (#24274)
## Summary                  
Adds `--ai-gateway-allow-byok` deployment option to control whether
users can use Bring Your Own Key (BYOK) mode with AI Gateway.
When disabled (`--ai-gateway-allow-byok=false`), BYOK requests are
rejected with a 403 and a message directing the admin to enable the
flag. Centralized key authentication works regardless of this setting.
Defaults to `true` (BYOK allowed).

---------

Co-authored-by: Danny Kopping <danny@coder.com>
2026-04-15 14:16:49 -04:00
Cian Johnston 6194bd6f57 fix: address post-merge review findings for chat org scoping (#24297)
Addresses review findings from #23827 that were added post-merge:

- Persisted attachments now store `organizationId`; mismatched orgs
pruned on restore
- Workspace selection reconciliation: stale IDs from previous orgs
dropped via derived `effectiveWorkspaceId`
- Org picker uses `permittedOrganizations()` for RBAC-aware filtering
- Org picker hidden when user belongs to only one org
- Ref-sync `useEffect` replaced with `useEffectEvent`
- `CreateWorkspace()` and `ListTemplates()` take `organizationID` and
`db` as required function parameters instead of optional struct fields —
compiler enforces them, removes scattered nil guards
- Cross-org template check in `CreateWorkspace` is now unconditional
- `ListTemplates` org-scoping filter now has test coverage
- `setupChatInfra` comment fixed; test helpers use params structs
instead of positional UUIDs
- Enterprise test documents that org admin only sees own chats (handler
hardcodes `OwnerID` — future work needs sidebar UI before lifting that
restriction)

> 🤖
2026-04-15 11:39:05 +01:00
Cian Johnston c552f9f281 fix: stop group spend limits from leaking across org boundaries (#24294)
Three SQL queries (`GetUserGroupSpendLimit`,
`ResolveUserChatSpendLimit`, `GetUserChatSpendInPeriod`) aggregated chat
spend limits and usage globally across all organizations. A restrictive
group limit in org A would bleed into org B.

## Changes

- Add `organization_id` parameter to all three SQL queries in
`coderd/database/queries/chats.sql`
- When nil UUID is passed, queries fall back to global behavior
(backward compat for HTTP dashboard endpoints)
- When real org ID is passed, limits and spend are scoped to that
organization
- Thread `organizationID` through `ResolveUsageLimitStatus` →
`checkUsageLimit` → all chatd call sites
- Update dbauthz wrappers for new param structs
- HTTP endpoints (`chatCostSummary`, `getMyChatUsageLimitStatus`) pass
`uuid.Nil` with TODO for future org-scoped UI
- Add `TestResolveUsageLimitStatus_OrgScoped` with 5 test cases covering
org isolation, nil-UUID fallback, spend scoping, and user override
priority

Closes coder/internal#1466

> 🤖
2026-04-14 16:56:17 +01:00
Yevhenii Shcherbina b78eba9f9d feat: make sure creds are always masked (#24241)
## Summary  
Adds a `sanitizeCredentialHint` safety check in the db-to-SDK conversion
layer to ensure credential hints are always masked before being exposed
in the API. Also adds `credential_kind` and `credential_hint` assertions
to the session threads API test.
2026-04-13 10:14:38 -04:00
Cian Johnston 22062ec52e feat: add organization scoping to chats (#23827)
Fixes https://github.com/coder/internal/issues/1436

* Adds organization_id to chats with backfill (workspace org → user org membership → default org)
* No support yet for ACLs (follow-up issue)
- Cross-org workspace binding rejected (both in `CreateChatRequest` and in `create_workspace` tool
- Adds `OrganizationAutocomplete` to `AgentCreateForm`
- Docs updated with `organization_id` in chats-api.md

> 🤖 Written by a Coder Agent. Reviewed by many humans and many agents.

---------

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
2026-04-13 12:31:25 +01:00
Cian Johnston 7b0421d8c6 fix: revert auto-assign agents-access role enabled (#24170)
This reverts commit d4a9c63e91 (#23968).

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-04-08 20:56:17 +01:00
Jon Ayers 08bd9e672a fix: resolve Test_batcherFlush/RetriesOnTransientFailure flake (#24112)
fixes https://github.com/coder/internal/issues/1452
2026-04-07 13:46:26 -05:00
Kayla はな c5f1a2fccf feat: make service accounts a Premium feature (#24020) 2026-04-07 12:25:32 -06:00
Kyle Carberry f3f0a2c553 fix(enterprise/coderd/x/chatd): harden TestSubscribeRelayEstablishedMidStream against CI flakes (#24108)
Fixes coder/internal#1455

Three changes to eliminate the timing-sensitive flake in
`TestSubscribeRelayEstablishedMidStream`:

1. **Reduce `PendingChatAcquireInterval` from `time.Hour` to
`time.Second`.**
   The primary trigger is still `signalWake()` from `SendMessage`, but a
   short fallback poll ensures the worker picks up the pending chat
   even under heavy CI goroutine scheduling contention.

2. **Increase context timeout from `WaitLong` (25s) to `WaitSuperLong`
(60s).**
   The worker pipeline (model resolution, message loading, LLM call)
   involves multiple DB round-trips that can be slow when PostgreSQL
   is shared with many parallel test packages.

3. **Add a status-polling loop while waiting for the streaming
request.**
   If the worker errors out during chat processing, the test now
   fails immediately with the error status and message instead of
   silently timing out.

> Generated by Coder Agents
2026-04-07 13:41:33 -04:00
George K 86ca61d6ca perf: cap count queries and emit native UUID comparisons for audit/connection logs (#23835)
Audit and connection log pages were timing out due to expensive COUNT(*)
queries over large tables. This commit adds opt-in count capping: requests can
return a `count_cap` field signaling that the count was truncated at a threshold,
avoiding full table scans that caused page timeouts.

Text-cast UUID comparisons in regosql-generated authorization queries
also contributed to the slowdown by preventing index usage for connection
and audit log queries. These now emit native UUID operators.

Frontend changes handle the capped state in usePaginatedQuery and
PaginationWidget, optionally displaying a capped count in the pagination
UI (e.g. "Showing 2,076 to 2,100 of 2,000+ logs")

Related to:
https://linear.app/codercom/issue/PLAT-31/connectionaudit-log-performance-issue
2026-04-07 07:24:53 -07:00
Kyle Carberry e18094825a fix: retain message_part buffer for cross-replica relay (#24031) 2026-04-04 17:24:41 -04:00
Jon Ayers a1d51f0dab feat: batch connection logs to avoid DB lock contention (#23727)
- Running 30k connections was generating a ton of lock contention in the
DB
2026-04-03 15:47:26 -05:00
Paweł Banaszewski 8369fa88fd feat: add columns for cached tokens from aibridge (#23832)
Two new columns added to aibridge_token_usages:
  - cache_read_input_tokens (BIGINT, default 0)
  - cache_write_input_tokens (BIGINT, default 0)

Migration backfills existing rows by extracting values from the metadata
JSONB column (cache_read_input, input_cached, prompt_cached for reads
(max value selected since only 1 should be set), cache_creation_input
for writes).

All references to data from metadata were updated to reference new
columns. No other changes then changing where data is extracted from.

Requires aibridge library version bump to include:
https://github.com/coder/aibridge/pull/229
Fixes: https://github.com/coder/aibridge/issues/150
2026-04-03 16:27:31 +02:00
Michael Suchacz 7d0a0c6495 feat: provider key policies and user provider settings (#23751) 2026-04-02 19:46:42 +02:00
Cian Johnston d4a9c63e91 feat: auto-assign agents-access role to new users when experiment enabled (#23968)
When the `agents` experiment is enabled, new users are automatically
granted the `agents-access` role at creation time so they can use Coder
Agents without manual admin intervention.

- Auto-assigns in `CreateUser()` — covers admin API, OAuth, and OIDC
creation paths
- Skips auto-assign for OIDC users when enterprise site role sync is
enabled (sync overwrites roles on every login; those admins should use
`--oidc-user-role-default` instead)
- CLI `create-admin-user` bypasses `CreateUser()` but creates `owner`
users who already have all permissions

> 🤖 Written by a Coder Agent. Will be reviewed by a human.
2026-04-02 14:46:47 +01:00
Ethan 7757cd8e08 refactor(coderd/x/chatd): insert chats directly as pending on creation (#23888)
Previously, `CreateChat` inserted the `chats` row with the DB default
status (`waiting`), then updated it to `pending` in the same transaction
via `setChatPendingWithStore`. This wasted two extra queries per chat
creation (`GetChatByID` + `UpdateChatStatus`) and rewrote the same row
immediately after inserting it.

Now `CreateChat` passes the status directly to `InsertChat`, so the row
is written once in its final create-time state. The
`setChatPendingWithStore` helper is removed entirely. `InsertChat` now
requires an explicit `status` parameter at all callsites instead of
relying on a DB column default.

## Motivation

On an experimental branch we're trialing firing all chatd notifications
from plpgsql triggers. The old two-step insert made that awkward: in an
`AFTER INSERT` trigger, `NEW` only contained the insert-time row
(`waiting`), not the final committed state (`pending`). To emit the
correct event payload the trigger had to be deferred and re-read the row
from `chats` at commit time.

With this change, `NEW` already contains the correct row to publish — no
deferred trigger, no extra `SELECT`, simpler and cheaper trigger logic.

That said, this seems like a worthwhile change regardless of the trigger
experiment: writing the final row state once removes unnecessary DB work
on every chat creation and makes the create path easier to reason about.
2026-04-02 14:13:51 +11:00
Cian Johnston d6df78c9b9 chore: remove racy ChatStatusPending assertions after CreateChat (#23882)
Removes 6 fragile `require.Equal(t, codersdk.ChatStatusPending,
chat.Status)` assertions from chat relay and creation tests.

**Root cause**: In HA tests with two replicas sharing the same DB, the
worker can acquire a just-created chat (flipping `pending → running` via
`AcquireChats`) before the HTTP response reaches the test. All affected
tests already synchronize via `require.Eventually` waiting for `running`
status, making the initial assertion both redundant and racy.

- Remove 5 assertions in `enterprise/coderd/exp_chats_test.go` (all
`TestChatStreamRelay` subtests)
- Remove 1 assertion in `coderd/exp_chats_test.go` (`TestPostChats`)
- An existing comment in `TestPostChats/Success` already documents this
exact race

Fixes flake:
https://github.com/coder/coder/actions/runs/23807597632/job/69385425724

> 🤖 Written by a Coder Agent. Will be reviewed by a human.
2026-04-01 10:00:50 +01:00
Danny Kopping 9fa103929a perf: make ListAIBridgeSessions 10x faster (#23774)
_Disclaimer: produced using Claude Opus 4.6, reviewed by me, and
validated against Dogfood dataset._

The `ListAIBridgeSessions` query materialized and aggregated all
matching interceptions before paginating, then ran expensive
token/prompt lookups across the full dataset. For a page of 25 sessions
against ~200k interceptions (our dogfood dataset), this meant:
- Three CTEs scanning all rows (filtered_interceptions, session_tokens,
session_root)
  - ARRAY_AGG(fi.id) collecting every interception ID per session
- Lateral prompt lookup via ANY(array_of_all_ids) running for every
session, not just the page
  - ~90MB of disk sorts and JIT compilation kicking in

The improvement is to restructure to paginate first and enrich after: a
single CTE groups interceptions into sessions with only cheap aggregates
(MIN, MAX, COUNT), applies cursor pagination and LIMIT, then lateral
joins fetch metadata, tokens, and prompts for just the ~25-row page.

  Measured against 220k interceptions / 160k sessions:

  | Metric             | Before | After |
  |--------------------|--------|-------|
  | Execution time     | 1800ms | 185ms |
  | Shared buffer hits | 737k   | 2.6k  |
  | Disk sort spill    | 86MB   | 16MB  |
  | Lateral loops      | 160k   | 25    |

https://grafana.dev.coder.com/goto/fbODPGtvR?orgId=1 the results are
identical, just _much_ faster.

--- 

Also includes some additional tests which I added prior to refactoring
the query to ensure no regressions on edge-cases.

---------

Signed-off-by: Danny Kopping <danny@coder.com>
2026-03-31 14:42:23 +02:00
Cian Johnston 3ce82bb885 feat: add chat-access site-wide role to gate chat creation (#23724)
- Add `chat-access` built-in role granting chat CRUD at User scope
- Exclude `ResourceChat` from member, org member, and org service
account `allPermsExcept` calls
- Allow system, owner, and user-admin to assign the new role
- Migration auto-assigns role to users who have ever created a chat
- Update RBAC test matrix: `memberMe` denied, `chatAccessUser` allowed

**Breaking change**: Members without `chat-access` lose chat creation
ability. Migration covers existing chat creators. Members who have never
created a chat do not get this role automatically applied.

> 🤖 This PR was created by a Coder Agent and reviewed by me.
2026-03-31 10:07:21 +01:00
Ethan 13dfc9a9bb test: harden chatd relay test setup (#23759)
These chatd relay tests were seeding chats through
`subscriber.CreateChat(...)`, which wakes the subscriber and can race
local acquisition against the intended remote-worker setup.

Seed waiting and remote-running chats directly in the database instead,
and point the default OpenAI provider at a local safety-net server so
accidental processing fails locally instead of reaching the live API.

Closes https://github.com/coder/internal/issues/1430
2026-03-30 17:52:01 +11:00
Jake Howell 71a492a374 feat: implement <ClientFilter /> to AI Bridge request logs (#22694)
Closes #22136

This pull-request implements a `<ClientFilter />` to our `Request Logs`
page for AI Bridge. This will allow the user to select a client which
they wish to filter against. Technically the backend is able to actually
filter against multiple clients at once however the frontend doesn't
currently have a nice way of supporting this (future improvement).

<img width="1447" height="831" alt="image"
src="https://github.com/user-attachments/assets/0be234e2-25f2-4a89-b971-d74817395da1"
/>

---------

Co-authored-by: Jeremy Ruppel <jeremy.ruppel@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 17:18:28 -04:00
Jaayden Halko 86c3983fc0 feat: add AI Governance seat capacity banners (#23411)
## Summary

Add site-wide banners for AI Governance seat usage thresholds:

1. **90% capacity warning (admin-only):** When actual AI Governance
seats are ≥90% and <100% of the license limit, admins see:
   > "You have used 90% of your AI governance add-on seats."

2. **Over-limit banner (admin-only):** When actual seats exceed the
license limit, admins see a prominent warning:
> "Your organization is using {actual} / {limit} AI Governance user
seats ({X}% over the limit). Contact sales@coder.com"
   - Uses floor whole percentage (Go int division / `Math.floor`)
   - Includes a clickable `mailto:sales@coder.com` link
2026-03-27 05:51:51 +00:00
Danny Kopping 801e57d430 feat: session detail API (#23203) 2026-03-26 18:09:53 +02:00
Ethan 4d74603045 fix(coderd/x/chatd): respect provider Retry-After headers in chat retry loop (#23351)
> **PR Stack**
> 1. **#23351** ← `#23282` *(you are here)*
> 2. #23282 ← `#23275`
> 3. #23275 ← `#23349`
> 4. #23349 ← `main`

---

## Summary

`chatretry.Retry()` used pure exponential backoff (1 s, 2 s, 4 s, …) and
never consulted provider `Retry-After` headers. Fantasy's
`ProviderError` carries `ResponseHeaders` including `Retry-After`, but
`chaterror.Classify()` only parsed error text and silently dropped the
structured transport metadata.

This makes `Retry-After` a first-class signal in the classification →
retry pipeline.

<img width="853" height="346" alt="image"
src="https://github.com/user-attachments/assets/65f012b6-8173-43d2-957e-ab9faddea525"
/>


## Changes

### `coderd/chatd/chaterror/classify.go`

- Added `RetryAfter time.Duration` field to `ClassifiedError` — a
normalized minimum retry delay derived from provider response metadata.
- `Classify()` now calls `extractProviderErrorDetails()` before falling
back to text heuristics. Structured `ProviderError.StatusCode` takes
priority over regex extraction.
- `normalizeClassification()` preserves and clamps `RetryAfter`.

### `coderd/chatd/chaterror/provider_error.go` (new)

Provider-specific extraction, isolated from the text-based
classification logic:

- `extractProviderErrorDetails()` unwraps `*fantasy.ProviderError` from
the error chain via `errors.As`.
- `retryAfterFromHeaders()` parses headers in priority order:
  1. `retry-after-ms` (OpenAI-specific, millisecond precision)
  2. `retry-after` (standard HTTP — integer seconds or HTTP-date)
- Case-insensitive header key lookup.

### `coderd/chatd/chatretry/chatretry.go`

- `effectiveDelay(attempt, classified)` computes `max(Delay(attempt),
classified.RetryAfter)` — the provider hint acts as a floor without
weakening the local exponential backoff.
- `Retry()` now uses `effectiveDelay` and passes the effective delay to
both `onRetry(...)` and the sleep timer, so downstream payloads, logs,
and the frontend countdown stay aligned automatically.

### Tests

- `classify_test.go`: Structured provider status + `Retry-After`
extraction, `retry-after-ms` priority, HTTP-date parsing, invalid header
fallback, `WithProvider` preservation.
- `chatretry_test.go`: Retry-after-as-floor semantics — longer hint
wins, shorter hint keeps base delay.

## Design notes

- **No SDK/API/frontend changes needed.** `codersdk.ChatStreamRetry`
already carries `DelayMs` and `RetryingAt`, and the frontend already
consumes them. The fix is purely in the server-side delay computation.
- **Existing retryability rules unchanged.** This fixes *when* we sleep,
not *whether* an error is retryable.
- **Provider hint is a floor:** `max(baseDelay, RetryAfter)` ensures we
never retry earlier than the provider asks, and never weaken our own
backoff curve.
2026-03-27 01:20:46 +11:00
Danny Kopping 8eade29e68 chore: update AI Bridge warning to require AI Governance Add-On (#23662)
*Disclaimer: implemented by a Coder Agent using Claude Opus 4.6,
reviewed by me.*

Replace the transitional soft warning message:

> AI Bridge is now Generally Available in v2.30. In a future Coder
version, your deployment will require the AI Governance Add-On to
continue using this feature. Please reach out to your account team or
sales@coder.com to learn more.

with the definitive requirement message:

> The AI Governance Add-On is required to use AI Bridge. Please reach
out to your account team or sales@coder.com to learn more.

Updated in:
- `enterprise/coderd/license/license.go`
- `enterprise/coderd/license/license_test.go` (2 occurrences)
2026-03-26 11:10:53 +02:00
Jake Howell 0cea4de69e fix: AI governance into AI Governance (#23553) 2026-03-25 20:06:48 +11:00
Ethan 70f031d793 feat(coderd/chatd): structured chat error classification and retry hardening (#23275)
> **PR Stack**
> 1. #23351 ← `#23282`
> 2. #23282 ← `#23275`
> 3. **#23275** ← `#23349` *(you are here)*
> 4. #23349 ← `main`

---

## Summary

Extracts a structured error classification subsystem for agent chat
(`chatd`) so that retry and error payloads carry machine-readable
metadata — error kind, provider name, HTTP status code, and retryability
— instead of raw error strings.

This is the **backend half** of the error-handling work. The frontend
counterpart is in #23282.

## Changes

### New package: `coderd/chatd/chaterror/`

Canonical error classification — extracts error kind, provider, status
code, and user-facing message from raw provider errors. One source of
truth that drives both retry policy and stream payloads.

- **`kind.go`**: Error kind enum (`rate_limit`, `timeout`, `auth`,
`config`, `overloaded`, `unknown`).
- **`signals.go`**: Signal extraction — parses provider name, HTTP
status code, and retryability from error strings and wrapped types.
- **`classify.go`**: Classification logic — maps extracted signals to an
error kind.
- **`message.go`**: User-facing message templates keyed by kind +
signals.
- **`payload.go`**: Projectors that build `ChatStreamError` and
`ChatStreamRetry` payloads from a classified error.

### Modified

- **`codersdk/chats.go`**: Added `Kind`, `Provider`, `Retryable`,
`StatusCode` fields to `ChatStreamError` and `ChatStreamRetry`.
- **`coderd/chatd/chatretry/`**: Thinned to retry-policy only;
classification logic moved to `chaterror`.
- **`coderd/chatd/chatloop/`**: Added per-attempt first-chunk timeout
(60 s) via `guardedStream` wrapper — produces retryable
`startup_timeout` errors instead of hanging forever.
- **`coderd/chatd/chatd.go`**: Publishes normalized retry/error payloads
via `chaterror` projectors.
2026-03-25 13:47:54 +11:00
Mathias Fredriksson 38f723288f fix: correct malformed struct tags in organizationroles and scim_test (#23497)
Fix leading space in table tag and escaped-quote tag syntax.

Extracted from #23201.
2026-03-25 13:11:08 +11:00
Asher 81188b9ac9 feat: add filtering by service account (#23468)
You can now filter by/out service accounts using
`service_account:true/false` or using the filter dropdown.
2026-03-24 10:13:25 -08:00
Danny Kopping dba9f68b11 chore!: remove members' ability to read their own interceptions; rationalize RBAC requirements (#23320)
_Disclaimer:_ _produced_ _by_ _Claude_ _Opus_ _4\.6,_ _reviewed_ _by_ _me._

**This is a breaking change.** Users who are not have `owner` or sitewide `auditor` roles will no longer be able to view interceptions.  
Regular users should not need to view this information; in fact, it could be used by a malicious insider to see what information we track and don't track to exfiltrate data or perform actions unobserved.

---

Changed authorization for AI Bridge interception-related operations from system-level permissions to resource-specific permissions. The following functions now authorize against `rbac.ResourceAibridgeInterception` instead of `rbac.ResourceSystem`:

- `ListAIBridgeTokenUsagesByInterceptionIDs`
- `ListAIBridgeToolUsagesByInterceptionIDs`
- `ListAIBridgeUserPromptsByInterceptionIDs`

Updated RBAC roles to grant AI Bridge interception permissions:

- **User/Member roles**: Can create and update AI Bridge interceptions but cannot read them back
- **Service accounts**: Same create/update permissions without read access
- **Owners/Auditors**: Retain full read access to all interceptions

Removed system-level authorization bypass in `populatedAndConvertAIBridgeInterceptions` function, allowing proper resource-level authorization checks.

Updated tests to reflect the new permission model where members cannot view AI Bridge interceptions, even their own, while owners and auditors maintain full visibility.
2026-03-24 12:03:20 +02:00
Danny Kopping 43a1af3cd6 feat: session list API (#23202)
<!--

If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.

-->

_Disclaimer:_ _initially_ _produced_ _by_ _Claude_ _Opus_ _4\.6,_ _heavily_ _modified_ _and_ _reviewed_ _by_ _me._

Closes https://github.com/coder/internal/issues/1360

Adds a new `/api/v2/aibridge/sessions` API which returns "sessions".

Sessions, as defined in the [RFC](https://www.notion.so/coderhq/AI-Bridge-Sessions-Threads-2ccd579be59280f28021d3baf7472fbe?source=copy_link), are a set of interceptions logically grouped by a session key issued by the client.  
The API design for this endpoint was done in [this doc](https://github.com/coder/internal/issues/1360).

If the client has not provided a session ID, we will revert to the thread root ID, and if that's not present we use the interception's own ID (i.e. a session of a single interception - which is effectively what we show currently in our `/api/v2/aibridge/interceptions` API).

The SQL query looks gnarly but it's relatively simple, and seems to perform well (~200ms) even when I import dogfood's `aibridge_*` tables into my workspace. If we need to improve performance on this later we can investigate materialized views, perhaps, but for now I don't think it's warranted.

---

_The PR looks large but it's got a lot of generated code; the actual changes aren't huge._
2026-03-24 08:58:47 +02:00
Cian Johnston 80a172f932 chore: move chatd and related packages to /x/ subpackage (#23445)
- Moves `coderd/chatd/`, `coderd/gitsync/`, `enterprise/coderd/chatd/`
under `x/` parent directories to signal instability
- Adds `Experimental:` glue code comments in `coderd/coderd.go`

> 🤖 This PR was created with the help of Coder Agents, and was
reviewed by my human. 🧑‍💻
2026-03-23 17:34:43 +00:00
Cian Johnston ef14654078 chore: move chat methods to ExperimentalClient (#23441)
- Changes all 41 chat method receivers in `codersdk/chats.go` from
`*Client` to `*ExperimentalClient` to ensure that callers are aware that
these reference potentially unstable `/api/experimental` endpoints.


> 🤖 This PR was created with the help of Coder Agents, and has been
reviewed by my human. 🧑‍💻
2026-03-23 14:32:11 +00:00
Asher 24ab216dd1 feat: add new group members endpoint with filtering and pagination (#23067)
Partially addresses #21813 (still need to make changes to the "add user"
button to be complete)

Since there are a lot of user tests already, I moved them into
`coderdtest` to be shared.
2026-03-20 12:43:03 -08:00
Jaayden Halko 6f244cddde feat: display the addon license UI (#22948)
<img width="1052" height="234" alt="Screenshot 2026-03-18 at 21 58 57"
src="https://github.com/user-attachments/assets/136ccb1f-e47a-44fd-804d-859301161435"
/>

---------

Co-authored-by: Steven Masley <stevenmasley@gmail.com>
2026-03-20 16:34:17 +00:00
Ethan a1e912a763 fix(chatd): deliver retry control events via pubsub (#23349)
> **PR Stack**
> 1. #23351 ← `#23282`
> 2. #23282 ← `#23275`
> 3. #23275 ← `#23349`
> 4. **#23349** ← `main` *(you are here)*

---

Retry events were published only to the local in-process stream via
`publishEvent()`. When pubsub is active, `Subscribe()`'s merge loop only
forwarded durable events (messages, status, errors) from pubsub
notifications,
so retry events were silently dropped for cross-replica subscribers.

This adds a `publishRetry()` helper that publishes both locally and via
pubsub,
and extends the `Subscribe()` notification handler to forward retry
events.

**Changes:**
- `coderd/pubsub/chatstreamnotify.go`: add `Retry` field to notify
message
- `coderd/chatd/chatd.go`: add `publishRetry()`, update `OnRetry`
callback,
  extend `Subscribe()` to forward `notify.Retry`
- `coderd/chatd/chatd_internal_test.go`: focused pubsub delivery test
- `enterprise/coderd/chatd/chatd_test.go`: cross-replica end-to-end test
2026-03-20 15:19:41 +00:00