Commit Graph

1358 Commits

Author SHA1 Message Date
Danny Kopping 282ab7de34 refactor: load AI providers from the database at startup (#25672)
Replace the env-based `BuildProviders` with a DB-backed loader. The database is now the single source of truth for runtime provider configuration; env config arrives via `SeedAIProvidersFromEnv` (run at boot) and `BuildProviders` reads it back as `aibridge.Provider` instances. `cli/server.go` and `enterprise/cli/server.go` both call the same path, so aibridged and aibridgeproxyd see the same provider set.

Per-provider `DumpDir` is replaced by a top-level `CODER_AI_GATEWAY_DUMP_DIR` base; each provider's effective dump path is `<base>/<provider name>`.
2026-05-26 15:57:01 +02:00
Danny Kopping c801dcbbc8 fix: strip route prefix when passing request to aibridged handler (#25671)
We weren't stripping the API base (`/api/v2/aibridge`), leading to 404s
when using the in-memory transport.

Signed-off-by: Danny Kopping <danny@coder.com>
2026-05-26 08:04:26 +00:00
Danny Kopping 4ddda3a9db feat: filter interceptions and sessions by provider name (#25640)
Allows filtering sessions & interceptions by provider name, and adds a test to vaidate that provider name is immutable (at least until #25606 lands).
2026-05-25 16:31:48 +02:00
Danny Kopping ef6ee2af68 chore: tolerate empty providers at startup and log env seeds (#25605)
Since AI Gateway is now enabled by default, and if the AI Gateway Proxy is enabled too it's possible the server can start without any configured providers. This would previously block startup, which is unacceptable.

In an upstack PR we will handle reloading the providers at runtime, so the server needs to be able to start up even if it can't handle any proxy requests to AI Gateway.

This change was necessitated because if there are providers configured in the environment they need to be seeded _before_ the proxy starts.
2026-05-22 12:45:14 +02:00
Michael Suchacz ca1f6b19a2 feat: remove legacy chat provider tables (#25416) 2026-05-22 09:50:01 +02:00
Danny Kopping ddec110b0e refactor: move aibridged out of enterprise to AGPL (#25570)
In order to allow Coder Agents to use AI Gateway in OSS, we need to rehome the `aibridged`\-related code into the AGPL path.

The HTTP API is only registered under enterprise so will still require the AI Governance Add-on to be present in order to use it, whereas Coder Agents uses an in-memory pipe to the same handlers.
2026-05-22 09:11:37 +02:00
Danny Kopping c50b0e84b9 feat!: default CODER_AI_GATEWAY_ENABLED to true (#25575)
`CODER_AI_GATEWAY_ENABLED` / `CODER_AIBRIDGE_ENABLED` is now being defaulted to `true` now that it will be used by Coder Agents.

If you previously had this value disabled explicitly, that value will persist.
2026-05-22 08:57:36 +02:00
Danny Kopping 9341efec9f feat!: seed ai_providers from env on server startup (#24895)
_Disclaimer: implemented by a Coder Agent using Claude Opus 4.7_

Part of the implementation of [RFC: Common AI Provider Configs](https://www.notion.so/coderhq/RFC-Common-AI-Provider-Configs-34bd579be59280ed958feffb82024797) (AIGOV-201).

## Note

This change can cause a previously working installation to fail to start should a conflict exist between the providers configured in the environment & those now migrated to the database.

I'll raise a PR upstack to document this process and workarounds should a startup fail.

## What this PR does

Reconciles environment-derived AI provider configuration with the `ai_providers` table at server startup. The seed runs **before** the aibridged daemon is initialized, so the runtime always reads providers from the database; the legacy `CODER_AIBRIDGE_*` environment variables become a one-shot migration source.

### Behavior

- Concurrent server starts are serialized through a Postgres advisory lock (`LockIDAIProvidersEnvSeed`).
- Missing rows are inserted with an audit entry attributed to the system actor.
- Existing rows whose canonical hash matches the env-derived hash are left alone (the common no-op restart path).
- Existing rows whose canonical hash does **not** match cause server startup to fail with a descriptive error so the operator can explicitly resolve the conflict in either env or DB.
- Soft-deleted rows are NOT resurrected from env; an explicit operator deletion is sticky across restarts.
- Indexed providers whose name conflicts with a legacy env var fail startup with a clear remediation message.
- Unknown provider types (e.g. `copilot`, until the DB enum is widened) are skipped with a log entry rather than failing startup.

### Canonical hashing

The `canonicalAIProvider` shape captures exactly the fields that determine runtime behavior — `type`, `base_url`, and the Bedrock subset of settings (access key, access key secret, region, model, small fast model) — and is hashed with SHA-256. The hash is **computed on demand from the row + env**, never persisted, so the database does not need a new column for it. API keys live in the separate `ai_provider_keys` table and are intentionally excluded from the hash so operators can rotate keys via the API without forcing a server restart.

<details>
<summary>Decision log</summary>

- The hash is intentionally not persisted in the database. The RFC discussed this trade-off; computing on demand keeps the schema minimal and lets the canonical shape evolve without a migration.
- The lock uses an `iota` slot in `coderd/database/lock.go` rather than `GenLockID` so it's stable, easy to audit, and matches the convention used for every other startup lock.
- A bearer-token Anthropic provider whose env vars also set Bedrock metadata but no AWS credentials does NOT store the Bedrock fields. Without credentials the discriminated settings would misrepresent the row as Bedrock auth.
- We deliberately do NOT publish to the `ai_providers_changed` pubsub channel from the seed because the seed completes before any subscriber is started; the follow-up PR introduces that channel.

</details>
2026-05-22 08:37:27 +02:00
Michael Suchacz 06526a5822 feat: use AI provider chat APIs (#25415) 2026-05-22 07:53:23 +02:00
Michael Suchacz 40878eeba4 feat: add AI provider schema expansion (#25412) 2026-05-22 02:16:01 +02:00
Jon Ayers 269bd0cb8d fix: skip no-op peer updates in pgcoord binder (#24226) 2026-05-21 17:59:12 -05:00
Paweł Banaszewski 46e93e6325 chore: add ai_gateway options that alias aibridge options (#25061)
Adds options matching new AI Gateway naming.
New options are added as alias for old options. Old options are still
working.
Old options have deprecated message.
No conflict detection was added.

Updated documentation so it mentions only new options. Added note about
old options still working.

> Various AI tools where used to create this PR
2026-05-21 11:14:11 +02:00
Steven Masley 9b6eadab77 fix: drop N+1 db query on template ACL available (#25465)
Fixes
[PLAT-149](https://linear.app/codercom/issue/PLAT-149/template-permissions-search-is-extremely-slow-with-many-groups).

`/acl/available` ran a db query per group. A deployment with >5,000
groups made this route extremely slow.
2026-05-20 22:40:50 +00:00
Spike Curtis 8dc4d76890 chore: add agent-connection-watch for workspaces (#24507)
<!--

If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.

-->

relates to GRU-18  
  
Adds basic implementation for Workspace Agent Connection Watch and tests.  
  
Missing are handling of logs.
2026-05-20 13:09:11 -04:00
Danny Kopping 44b1edd4da fix: unify key-ops audit shape and surface per-key detail (#25534)
Adding missed commit from https://github.com/coder/coder/pull/25484

This formats the audit logs correctly

![image.png](https://app.graphite.com/user-attachments/assets/598d018b-cdf5-4a2c-8321-24ba2c650a1a.png)



<!--

If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.

-->
2026-05-20 17:33:26 +02:00
Michael Suchacz e105e3af45 test: cover personal skill API (#25365)
> Mux updated this PR on behalf of Mike.

## Stack Context

This PR is the API test coverage slice in the experimental personal
skills stack. The storage, schema, permissions, API, and SDK
implementation merged in #25363.

Stack order:
1. #25362 personal skill resolver
2. #25363 storage, permissions, API, and SDK
3. #25365 API test coverage
4. #25366 chattool and chatd integration
5. #25066 settings UI and docs
6. #25386 personal skills slash menu

## What?

Adds API and audit tests for personal skill CRUD, validation failures,
limits, authorization, soft-delete cleanup, and audit content tracking.

This PR is now test-only. It does not include migrations, generated
database code, or API implementation changes.

## Why?

The feature touches storage, permissions, and audit behavior. These
tests make the server behavior reviewable and protected without
re-reviewing the implementation that already merged in #25363.

## Validation

- `go test ./coderd -run '^(TestUserSkill|TestPatchUserSkill)' -count=1`
- `go test ./enterprise/coderd -run
'^TestUserSkillAuditDiffTracksContent$' -count=1`
- pre-commit hook via `gt modify --no-edit`
2026-05-20 11:27:09 +02:00
Danny Kopping dd3223451b feat: add AI providers HTTP CRUD handlers (#24894) 2026-05-20 10:21:36 +02:00
Michael Suchacz 5a8d0016a5 feat: add personal skill storage, API, and SDK (#25363)
> Mux updated this PR on behalf of Mike.

## Stack Context

This PR is the storage, permissions, API, and SDK layer for experimental
personal skills. #25362 has landed on `main`, so this branch is
restacked directly on `main`.

Stack order:
1. #25363 storage, permissions, API, and SDK
2. #25365 API test coverage
3. #25366 chattool and chatd integration
4. #25066 settings UI and docs
5. #25386 personal skills slash menu

## What?

Adds the `user_skills` database table, generated queries, RBAC resources
and scopes, audit resource handling, experimental user-scoped CRUD
endpoints, SDK types, and generated API/site types.

Follow-up review and restack fixes:
- Enforce a bounded personal skill description in parser and database
constraints.
- Return `403 Forbidden` for unauthorized create and update attempts.
- Return explicit conflict responses when soft-deleted users are
targeted.
- Keep user admins out of personal skills, while site owners can read
and delete but not create or update.
- Document trigger-raised constraint names and keep schema constants
covered by tests.
- Reuse `UserSkillMetadata` in the full `UserSkill` SDK response type.
- Generate user skill IDs in Go instead of relying on a database
default.
- Rebase on latest `main` and renumber the user skills migration to
`000502_user_skills`.

## Why?

Personal skills need durable user-owned storage with owner
authorization, limited site-owner moderation, and a hidden API surface
before chatd can consume them.

## Validation

- `make gen`
- `go test ./coderd/database -run '^TestUserSkillSchemaConstants$'
-count=1`
- `go test ./coderd/database/dbauthz -run
'^TestMethodTestSuite/TestUserSkills$' -count=1`
- `go test ./coderd -run '^TestPatchUserSkill$' -count=1`
- `go test ./codersdk ./coderd/database/db2sdk`
- `make lint`
- pre-commit hook on `97fd58108d`
2026-05-20 00:09:09 +02:00
Steven Masley 51b531f5b3 chore: 'go generate' mockgen to use go tool wrapper (#25490)
Calling `mockgen` relies on the executable in the `$PATH`. Using `go
tool` uses the one defined in `go.mod`
2026-05-19 14:53:13 +00:00
Danielle Maywood 170a6e1fe9 feat: add chat sharing foundation (#25041) 2026-05-18 22:32:05 +01:00
Yevhenii Shcherbina 2732378da2 feat: audit group AI budget mutations (#25374)
Relates to
https://linear.app/codercom/issue/AIGOV-284/add-group-budgets-table-and-crud-api

Adds audit-log support for `group_ai_budget` mutations. Without it, an
admin could silently lower a spend limit from `$500` to `$50` or delete
a budget entirely, with no record of who performed the action.

Both write (`create-or-update`) and delete actions now produce audit log
entries, including before/after diffs for `spend_limit_micros`.

Depends on #25203.

## Old Version
<img width="1340" height="456" alt="image"
src="https://github.com/user-attachments/assets/e9ff52fb-a905-4aef-a4ee-7cdc58e68b75"
/>

## New Version (see
https://github.com/coder/coder/pull/25374/changes/9d22833de87cc106c24142c1d471a3f71872bf67)
<img width="1347" height="496" alt="image"
src="https://github.com/user-attachments/assets/1b9bbfa1-f86d-48e3-a0b1-266eb76f851f"
/>
2026-05-18 15:17:20 -04:00
Marcin Tojek 38772bdb7c refactor: remove cache tokens from ExtraTokenTypes (#25118)
Fixes: https://github.com/coder/aibridge/issues/243

> Generated with [Coder Agents](https://coder.com/agents)
2026-05-18 11:30:19 +02:00
Danny Kopping b7a282d544 fix(enterprise/coderd/prebuilds): downgrade reconciliation stats log to debug (#25340)
*Disclaimer: implemented by a Coder Agent using Claude Opus 4.6*

The `reconciliation stats` log line runs on every reconciliation tick
(every 5 minutes by default), even when there are no presets to
reconcile. In a steady-state installation without prebuilds activity,
this is the only log line that persistently shows up at info level.
Demote it to `debug` so the steady-state log output stays quiet.

Before:

```
2026-05-14 15:01:25.085 [info]  coderd.prebuilds: reconciliation stats  elapsed=1.649153ms  presets_total=0  presets_reconciled=0
```

After: same line is emitted at `debug` level and is suppressed at the
default info log level.
2026-05-18 10:47:41 +02:00
Callum Styan 191dd230ae feat: add agentfake scaletest subcommand (#25072)
This PR builds on top of https://github.com/coder/coder/pull/25070 to
add a way of running the larger "fake agent" manager via the existing
CLI, pulling in the URL/credentials already set.

With this, we can run a pod per scaletest region to act as all the
workspaces in that region.


This is in a new subcommand `scaletest agentfake` currently.

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2026-05-15 14:36:54 -07:00
Danny Kopping 985ae9ecd5 feat(enterprise/dbcrypt): encrypt ai_providers and ai_provider_keys at rest (#25326) 2026-05-15 15:42:02 +02:00
Ethan a59b951565 test: skip stale notification chatd flakes (#25376)
These chatd tests are flaking for the same stale control-notification
race tracked by CODAGT-353, so this change skips the newly reflaking
advisor-chain and `TestPatchChatMessage/ChangesModel` tests and rewrites
the older `TODO(hugodutka)` skips to point at the same root cause. This
keeps the known flakes documented consistently until the chatd
notification-flow refactor lands.

Closes CODAGT-427
Closes https://github.com/coder/internal/issues/1510
2026-05-15 17:36:48 +10:00
Callum Styan 81212470fd feat: implement basic (MVP version) of a fake agent + manager (#25070)
This PR introduces a "fake agent" + manager, which can be used during
scaletests to run a single executable that acts as many workspace
agents. The goals of these are to provide a much lighter weight
implementation of a workspace in terms of resource cost and startup time when executing scaletests.

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Mux <noreply@coder.com>
2026-05-14 14:46:36 -07:00
Yevhenii Shcherbina 238968cfa0 feat: add per-group AI budget table and endpoints (#25203)
Closes
https://linear.app/codercom/issue/AIGOV-284/add-group-budgets-table-and-crud-api

## Summary

Adds the `group_ai_budgets` table and the following endpoints:

- `GET /api/v2/groups/{group}/ai/budget`
- `PUT /api/v2/groups/{group}/ai/budget`
- `DELETE /api/v2/groups/{group}/ai/budget`

Each group may have at most one budget row. If no row exists, no budget
is enforced.

### Feature gate
  
Added `RequireFeatureMW(FeatureAIBridge)` on the `/ai/budget` sub-route.

## RBAC

Authorization reuses `rbac.ResourceGroup` with the existing
`.InOrganization(...).WithID(...)` scoping model.

The `dbauthz` wrappers load the parent `groups` row and authorize
against it.

No new resource type is introduced. As a result, anyone with
`group:update` permissions (Owner, OrgAdmin, or UserAdmin within the
organization) can manage AI budgets for that group.

## Read access for group members

`database.Group.RBACObject()` grants `policy.ActionRead` to all members
of the group through the group ACL:

```go
func (g Group) RBACObject() rbac.Object {
	return rbac.ResourceGroup.WithID(g.ID).
		InOrg(g.OrganizationID).
		// Group members can read the group.
		WithGroupACL(map[string][]policy.Action{
			g.ID.String(): {
				policy.ActionRead,
			},
		})
}
```

Because the `GET` endpoint authorizes against the same loaded `Group`
object, any group member can call:

```text
GET /api/v2/groups/{group}/ai/budget
```

`PUT` and `DELETE` remain admin-only. The group ACL grants only
`ActionRead`, so write operations continue to require role-based
`group:update` permissions.

## Alternative considered

A dedicated `rbac.ResourceGroupAiBudget` resource would allow budget
management to be separated from general group administration.

We decided not to add that complexity for now.
2026-05-14 15:54:37 -04:00
Danielle Maywood 9ddfafe2b1 feat: add chat ACL database foundation (#25080) 2026-05-14 17:18:50 +01:00
Danny Kopping 841b777ccd feat: add ai_providers table, queries, dbauthz, audit, RBAC (#24892) 2026-05-14 16:10:46 +02:00
Yevhenii Shcherbina b5e1ea33d8 feat: add AI budget policy and period deployment config (#25122)
Closes
https://linear.app/codercom/issue/AIGOV-283/add-deployment-config-for-ai-budget-policy-and-period

Adds `CODER_AI_BUDGET_POLICY` and `CODER_AI_BUDGET_PERIOD` deployment
options for AI Governance cost controls.
2026-05-12 10:48:36 -04:00
Kyle Carberry 5a5cd79c4c fix: drop buffered chat parts after their durable message commits (#25164) 2026-05-12 00:30:38 -04:00
Kyle Carberry 0ed57ee343 fix(coderd/x/chatd): checkpoint buffered message_parts to avoid stale replay (#25145) 2026-05-11 17:27:03 -04:00
Steven Masley 19573e8aee feat!: patchTemplateMeta to use optional fields (#24984)
Closes https://github.com/coder/coder/issues/13112

**Breaking Change**: Removed status code `StatusNotModified` when no
diffs occur in a patch. Now the patch is always applied and a template
is always returned.
2026-05-11 12:43:52 -05:00
Zach b221632615 fix: wipe user secrets when user is soft-deleted (#24985)
Extend the delete_deleted_user_resources() trigger so that secrets
belonging to a soft-deleted user are removed in the same transaction as
the existing api_keys and user_links cleanup.

user_secrets.user_id has ON DELETE CASCADE, but Coder soft-deletes users
by flipping users.deleted rather than removing the row, so the foreign key
cascade never fires and secrets would otherwise survive deletion.

Assisted by Coder Agents.
2026-05-11 09:07:30 -06:00
Zach 81e2be69e9 test: use typed atomics in test files (#25071)
Use typed atomics (atomic.Int64, atomic.Int32, etc.) in test files to prevent
mixing atomic and non-atomic access on the same value, guarantee 64-bit
alignment on 32-bit platforms, and provide a cleaner API.
2026-05-11 08:41:17 -06:00
Jeremy Ruppel a1dbd758bc feat: add template builder deployment config and telemetry types (#25082) 2026-05-11 09:48:55 -04:00
Thomas Kosiewski 4a6756a3e8 fix: isolate test HTTP clients (#25038) 2026-05-11 11:03:38 +02:00
Marcin Tojek febabfb8b2 feat: add request/response dump support to aibridgeproxyd (#24837)
Closes https://github.com/coder/coder/issues/24335
2026-05-11 10:59:26 +02:00
Ethan b6dbc5614c fix(coderd/x/chatd): handle truncated provider streams (#25074)
coder/fantasy now fails closed when Anthropic or OpenAI Responses
streams close before their provider terminal events instead of yielding
a successful finish.

This bumps the fantasy replacement to coder/fantasy#33 and teaches chat
error classification to treat those failures as retryable timeout errors
with explicit stream-closed messages.

<img width="875" height="311" alt="image"
src="https://github.com/user-attachments/assets/69c6f7b5-c885-46d2-a88b-b7a2b111bd55"
/>
2026-05-08 15:52:42 +10:00
Susana Ferreira b6dacb4a3c feat: add automatic key failover for AI Bridge OpenAI (#24847)
## Description

Adds automatic key failover for centralized OpenAI provider, covering both chat completions and responses APIs. Same shape as the Anthropic PR: each upstream call walks the configured key pool, keys are marked **temporary** on 429 (with cooldown from `Retry-After`) and **permanent** on 401/403. Each agentic-loop iteration gets its own fresh walker so a tool-call continuation can fail over independently of the initial request.

BYOK is unchanged: BYOK requests run as a single attempt with no failover.

## Changes

- `config.OpenAI` carries a `KeyPool`. `Key` remains for BYOK Authorization Bearer set per interception.
- Chat completions blocking interceptor: walks the pool via `newChatCompletionWithKeyFailover`, marks keys on key-specific failures, returns on first success or non-failover error.
- Chat completions streaming interceptor: per-iteration walker. Pre-stream failures fail over to the next key; mid-stream errors are relayed as SSE events.
- Responses blocking interceptor: extracts `newResponseWithKeyFailover` parallel to chatcompletions.
- Responses streaming interceptor: per-iteration walker, retains the existing buffer-then-forward design.

## Related Issues

Related to: https://github.com/coder/internal/issues/1446
Related to: https://linear.app/codercom/issue/AIGOV-197/aibridge-automatic-key-failover-for-bridged-and-passthrough-routes

## Follow-up PRs

- Bedrock multi-key support.
- Refactor provider vs interceptor config separation.
- Record the actually-used key in the interception credential hint after failover.

> [!NOTE]
> Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira
2026-05-07 15:35:46 +01:00
Susana Ferreira f1155ac4d7 feat: add automatic key failover for AI Bridge Anthropic (#24836)
## Description

Adds automatic key failover for centralized Anthropic provider. When a key pool is configured, each upstream call walks the pool and tries keys in order until one succeeds or the pool is exhausted. Keys are marked **temporary** on 429 (with cooldown from `Retry-After`) and **permanent** on 401/403. Errors that aren't key-specific don't trigger failover. Each agentic-loop iteration gets its own fresh walker, so a tool-call continuation can fail over independently of the initial request.

BYOK is unchanged: BYOK requests run as a single attempt with no failover.

## Changes

- `config.Anthropic` carries a `KeyPool`. `Key` remains for BYOK X-Api-Key set per interception.
- Blocking interceptor: walks the pool, marks keys on key-specific failures, returns on first success or non-failover error.
- Streaming interceptor: per-iteration walker. Pre-stream failures fail over to the next key; mid-stream errors are relayed as SSE events.
- New `keypool` error types: `TransientExhaustionError` (carries soonest cooldown) and `ErrPermanentExhaustion`. Replace the prior `ErrAllKeysExhausted`.
- Error responses now consistently include the outer `"type": "error"` field.

## Related Issues

Related to: https://github.com/coder/internal/issues/1446
Related to: https://linear.app/codercom/issue/AIGOV-197/aibridge-automatic-key-failover-for-bridged-and-passthrough-routes

## Follow-up PRs

- Bedrock multi-key support.
- Refactor provider vs interceptor config separation.
- Record the actually-used key in the interception credential hint after failover.

> [!NOTE]
> Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira
2026-05-07 14:57:44 +01:00
Jon Ayers 2cab1b41ad fix: increase MaxMessageSize to 16 MiB (#24599) 2026-05-06 10:12:56 -05:00
Michael Suchacz 0bfb9f6f13 feat: show agent turn summary in agents sidebar (#24942)
Persists the agent-generated turn-end summary on `chats` and shows it as
the Agents sidebar subtitle when present, falling back to the model
name. Errors still take precedence.

> Mux is acting on Mike's behalf.

## What changes

**Storage.** New nullable `last_turn_summary` column on `chats`
(migration `000486`). New `UpdateChatLastTurnSummary` query normalizes
blank/whitespace input to `NULL`, preserves `updated_at` (so the chat
does not jump to the top of the sidebar on summary writes), and uses an
`expected_updated_at` stale-write guard so an older async summary cannot
overwrite a newer turn.

**Backend.** `coderd/x/chatd/chatd.go` decouples summary generation from
webpush. Generated summaries persist for completed parent turns even
when webpush is unconfigured or has no subscriptions. The same generated
text is reused as the webpush body when webpush is configured, so the
summary model is not called twice. Generic fallback push text is no
longer persisted; it clears any stale summary instead.
Error/interrupt/pending-action terminal paths clear `last_turn_summary`
for the latest turn.

**Frontend.** `AgentsSidebar.tsx` subtitle priority is now `errorReason
|| lastTurnSummary || modelName`, normalized via the existing
`asNonEmptyString` helper from `blockUtils.ts`.

## Tests

- `TestUpdateChatLastTurnSummary` (database): success,
whitespace-to-NULL, stale guard rejects, `updated_at` preserved.
- `TestUpdateLastTurnSummaryRejectsStaleWrites` (chatd internal): direct
stale-`expected_updated_at` test.
- `TestSuccessfulChatPersistsTurnSummaryWithoutWebPush`: persistence
works without webpush subscriptions.
- `TestSuccessfulChatSendsWebPushWithSummary`: same generated text
drives both DB and push body.
-
`TestSuccessfulChatSendsWebPushFallbackWithoutSummaryForEmptyAssistantText`:
fallback text is not persisted.
- `TestErroredChatClearsLastTurnSummaryAndSendsWebPush`: error path
clears the field.
- `TestInterruptChatDoesNotSendWebPushNotification`: interrupt path
clears the field, no push fires.
- `AgentsSidebar.test.tsx`: subtitle priority for summary-present,
error-wins, no-summary fallback, whitespace fallback.
- `AgentsSidebar.stories.tsx`: `ChatWithTurnSummary` and
`ChatWithTurnSummaryAndError`.

## Notes

- No backfill. Existing chats keep showing the model name until their
next turn completes.
- Parent chats only in this iteration; the field is rendered on any
`Chat` if a future change extends generation to children.
- Decoupling generation from webpush adds quickgen model calls for
completed parent turns that previously skipped generation when no
subscriptions existed. Existing parent-only, assistant-text-present,
`PushSummaryModel` configured, and bounded-timeout gates keep this
behavior bounded.
2026-05-06 16:43:35 +02:00
Ethan e5c7fdff86 fix(coderd/x/chatd): refresh chat status and bound subscriber reads on Subscribe (#24095)
Tightens the chat stream subscription path on a few related axes. None
of these changes touch the steady-state event flow; they all concern the
subscribe handshake.

## Motivation

`Server.Subscribe` carries three responsibilities that were entangled:

1. Authorize the caller against the chat row.
2. Arm local + pubsub subscriptions before any DB reads
(subscribe-first-then-query).
3. Build the initial snapshot from a fresh chat row, message history,
and queue.

When all three live in one function and share the request context, a few
unfortunate behaviors fall out:

- The HTTP handler's middleware already loaded and authorized the chat
row, but `Subscribe(chatID)` discarded it and re-fetched on every
WebSocket connection.
- The chat row used to populate the initial `status` event was loaded
*before* the pubsub subscription was armed, so a status transition that
happened in that window was silently lost.
- Control-path DB reads inherited whatever context the caller passed in.
A caller without a deadline could wedge a subscriber goroutine
indefinitely on a stalled DB.
- A transient failure of the chat re-read collapsed the entire
subscription instead of degrading gracefully.

## What changes

**Split the auth boundary out into the type signature.** A new
`SubscribeAuthorized(ctx, chat, ...)` takes the already-authorized row
directly. The HTTP handler in `coderd/exp_chats.go` calls it with the
chat row from `httpmw.ChatParam`, eliminating the redundant
`GetChatByID`. `Subscribe(chatID)` is preserved as a thin wrapper for
callers that don't have a chat row in hand (tests, internal callers); it
does the auth lookup and delegates.

**Re-read the chat after arming subscriptions.** Inside
`SubscribeAuthorized`, after the local stream and pubsub subscriptions
are active, we reload the chat row to populate the initial `status`
event and any enterprise relay setup. Combined with the existing
subscribe-first-then-query pattern, this closes the gap where a status
transition between the middleware's load and the subscription arming
would not appear in either the initial snapshot or a live notification.

**Fall back to the middleware row on refresh failure.** If the
post-subscription refresh fails (transient DB blip, brief pool
exhaustion), we log a warning and reuse the row that proved
authorization in the first place. Messages, queue, and pubsub are all
independent of this row, so the stream still works; the initial `status`
is just slightly stale and self-corrects via the next pubsub event.

**Bound subscriber control-path DB reads.** A new
`streamSubscriberControlFetchContext` helper applies a 5-second fallback
timeout only when the caller has no deadline of their own. Used at the
chat refresh, the initial queue load, and the queue-update goroutine
following pubsub notifications. HTTP-driven callers pass through
unchanged; background callers can no longer hang forever on a stalled DB
and leak subscriber goroutines, pubsub subscriptions, and `chatStreams`
entries.
2026-05-06 14:29:53 +10:00
Ethan 46a60e6d5d refactor: move chat error kinds into codersdk (#24955)
Moves the chat error kind taxonomy from `coderd/x/chatd/chaterror` into
`codersdk.ChatErrorKind` and types `ChatError.Kind` /
`ChatStreamRetry.Kind` so generated TypeScript exposes an SDK-owned
union, including `usage_limit`. Backend chat classification now
references the SDK constants directly while preserving the existing JSON
string values.

Keeps chat usage-limit admission failures on their existing 409 response
shape. The frontend maps structured usage-limit responses to the
SDK-owned `usage_limit` kind, uses generated `TypesGen.ChatErrorKind`
directly, and removes the local string union and alias.
2026-05-06 11:57:48 +10:00
Zach f4197d676c refactor: remove unused tailnet connIO stats fields (#24911)
Drop start, lastWrite, and overwrites fields on connIO along with the
Stats() and Overwrites() methods. They have had no readers since
52901e121 which rewrote the PG coordinator's debug page to query the
database directly.
2026-05-05 08:46:53 -06:00
Ethan 4751416b29 fix!: persist structured chat errors (#24919)
**Breaking change for changelog:**

> `codersdk.Chat.last_error` now returns a structured `ChatError` object
(`{message, kind, provider, retryable, status_code, detail}`) instead of
a plain string. The chats API is experimental
(`/api/experimental/chats`), so this ships without a deprecation cycle;
consumers reading `chat.last_error` as a string must update to read
`chat.last_error.message`. SDK/generated TypeScript terminal error
payloads now use the single `ChatError` type; the live stream error
payload type is renamed from `ChatStreamError` to `ChatError`.

Persisted chat errors now carry the same provider-specific detail (kind,
provider, retryable, HTTP status, optional detail) as the live stream,
so refreshing a failed chat rehydrates with the full structured error
instead of a one-line headline.

Existing rows are migrated in place: legacy text errors are wrapped into
`{message, kind: "generic"}` so already-errored chats still render, and
rows with `last_error IS NULL` stay NULL. Internally, persisted fallback
decoding now reuses the existing `chaterror.KindGeneric` constant, with
no JSON value change.

Closes CODAGT-239
2026-05-05 12:56:06 +10:00
Atif Ali fad69df710 fix: correct SCIM Swagger try it out URLs (#24779) 2026-05-05 02:54:03 +05:00
Jon Ayers 6b9637d85a feat: replace pgcoordinator pg_notify triggers with app-level Publish() (#24717) 2026-05-01 15:00:08 -05:00