coder

mirror of https://github.com/coder/coder.git synced 2026-06-02 20:48:20 +00:00

Author	SHA1	Message	Date
Michael Suchacz	ca1f6b19a2	feat: remove legacy chat provider tables (#25416 )	2026-05-22 09:50:01 +02:00
Ethan	a59b951565	test: skip stale notification chatd flakes (#25376 ) These chatd tests are flaking for the same stale control-notification race tracked by CODAGT-353, so this change skips the newly reflaking advisor-chain and `TestPatchChatMessage/ChangesModel` tests and rewrites the older `TODO(hugodutka)` skips to point at the same root cause. This keeps the known flakes documented consistently until the chatd notification-flow refactor lands. Closes CODAGT-427 Closes https://github.com/coder/internal/issues/1510	2026-05-15 17:36:48 +10:00
Kyle Carberry	5a5cd79c4c	fix: drop buffered chat parts after their durable message commits (#25164 )	2026-05-12 00:30:38 -04:00
Kyle Carberry	0ed57ee343	fix(coderd/x/chatd): checkpoint buffered message_parts to avoid stale replay (#25145 )	2026-05-11 17:27:03 -04:00
Ethan	b6dbc5614c	fix(coderd/x/chatd): handle truncated provider streams (#25074 ) coder/fantasy now fails closed when Anthropic or OpenAI Responses streams close before their provider terminal events instead of yielding a successful finish. This bumps the fantasy replacement to coder/fantasy#33 and teaches chat error classification to treat those failures as retryable timeout errors with explicit stream-closed messages. <img width="875" height="311" alt="image" src="https://github.com/user-attachments/assets/69c6f7b5-c885-46d2-a88b-b7a2b111bd55" />	2026-05-08 15:52:42 +10:00
Ethan	e5c7fdff86	fix(coderd/x/chatd): refresh chat status and bound subscriber reads on Subscribe (#24095 ) Tightens the chat stream subscription path on a few related axes. None of these changes touch the steady-state event flow; they all concern the subscribe handshake. ## Motivation `Server.Subscribe` carries three responsibilities that were entangled: 1. Authorize the caller against the chat row. 2. Arm local + pubsub subscriptions before any DB reads (subscribe-first-then-query). 3. Build the initial snapshot from a fresh chat row, message history, and queue. When all three live in one function and share the request context, a few unfortunate behaviors fall out: - The HTTP handler's middleware already loaded and authorized the chat row, but `Subscribe(chatID)` discarded it and re-fetched on every WebSocket connection. - The chat row used to populate the initial `status` event was loaded before the pubsub subscription was armed, so a status transition that happened in that window was silently lost. - Control-path DB reads inherited whatever context the caller passed in. A caller without a deadline could wedge a subscriber goroutine indefinitely on a stalled DB. - A transient failure of the chat re-read collapsed the entire subscription instead of degrading gracefully. ## What changes Split the auth boundary out into the type signature. A new `SubscribeAuthorized(ctx, chat, ...)` takes the already-authorized row directly. The HTTP handler in `coderd/exp_chats.go` calls it with the chat row from `httpmw.ChatParam`, eliminating the redundant `GetChatByID`. `Subscribe(chatID)` is preserved as a thin wrapper for callers that don't have a chat row in hand (tests, internal callers); it does the auth lookup and delegates. Re-read the chat after arming subscriptions. Inside `SubscribeAuthorized`, after the local stream and pubsub subscriptions are active, we reload the chat row to populate the initial `status` event and any enterprise relay setup. Combined with the existing subscribe-first-then-query pattern, this closes the gap where a status transition between the middleware's load and the subscription arming would not appear in either the initial snapshot or a live notification. Fall back to the middleware row on refresh failure. If the post-subscription refresh fails (transient DB blip, brief pool exhaustion), we log a warning and reuse the row that proved authorization in the first place. Messages, queue, and pubsub are all independent of this row, so the stream still works; the initial `status` is just slightly stale and self-corrects via the next pubsub event. Bound subscriber control-path DB reads. A new `streamSubscriberControlFetchContext` helper applies a 5-second fallback timeout only when the caller has no deadline of their own. Used at the chat refresh, the initial queue load, and the queue-update goroutine following pubsub notifications. HTTP-driven callers pass through unchanged; background callers can no longer hang forever on a stalled DB and leak subscriber goroutines, pubsub subscriptions, and `chatStreams` entries.	2026-05-06 14:29:53 +10:00
Ethan	46a60e6d5d	refactor: move chat error kinds into codersdk (#24955 ) Moves the chat error kind taxonomy from `coderd/x/chatd/chaterror` into `codersdk.ChatErrorKind` and types `ChatError.Kind` / `ChatStreamRetry.Kind` so generated TypeScript exposes an SDK-owned union, including `usage_limit`. Backend chat classification now references the SDK constants directly while preserving the existing JSON string values. Keeps chat usage-limit admission failures on their existing 409 response shape. The frontend maps structured usage-limit responses to the SDK-owned `usage_limit` kind, uses generated `TypesGen.ChatErrorKind` directly, and removes the local string union and alias.	2026-05-06 11:57:48 +10:00
Ethan	4751416b29	fix!: persist structured chat errors (#24919 ) Breaking change for changelog: > `codersdk.Chat.last_error` now returns a structured `ChatError` object (`{message, kind, provider, retryable, status_code, detail}`) instead of a plain string. The chats API is experimental (`/api/experimental/chats`), so this ships without a deprecation cycle; consumers reading `chat.last_error` as a string must update to read `chat.last_error.message`. SDK/generated TypeScript terminal error payloads now use the single `ChatError` type; the live stream error payload type is renamed from `ChatStreamError` to `ChatError`. Persisted chat errors now carry the same provider-specific detail (kind, provider, retryable, HTTP status, optional detail) as the live stream, so refreshing a failed chat rehydrates with the full structured error instead of a one-line headline. Existing rows are migrated in place: legacy text errors are wrapped into `{message, kind: "generic"}` so already-errored chats still render, and rows with `last_error IS NULL` stay NULL. Internally, persisted fallback decoding now reuses the existing `chaterror.KindGeneric` constant, with no JSON value change. Closes CODAGT-239	2026-05-05 12:56:06 +10:00
Cian Johnston	2f855904be	refactor: add dbgen chat generators and migrate test boilerplate (#24497 ) - Adds chat-related dbgen generators covering defaults, overrides, and message field mapping. - Replaces raw single-row chat, message, provider, and model-config setup in tests with dbgen helpers. - Simplifies chat seed helpers after moving fixture setup into dbgen. > Generated with [Coder Agents](https://coder.com/agents).	2026-05-01 13:29:33 +01:00
Mathias Fredriksson	dd49a818f9	fix: export chatd.Start to separate server lifecycle (#24761 ) chatd.New() no longer auto-starts the acquire/wake loop. Callers that want chat processing call server.Start() explicitly. Tests that want a passive server skip Start(); heartbeat, stream janitor, and stale recovery still run. Closes coder/internal#1502	2026-04-29 13:54:49 +03:00
Cian Johnston	d9e3e206cc	fix(enterprise/coderd/x/chatd): deflake relay drain test for multiple timers (#24609 ) Fixes https://github.com/coder/internal/issues/1474. PR #24549 introduced a `quartz.NewMock` clock + `Trap().NewTimer("drain")` to remove the wall-clock race. However, the trap consumed only one `NewTimer("drain")` call via `MustWait/MustRelease`. The merge loop has two code paths that create drain timers with the same tag: - Relay result handler (`drainAndClose` path in `relayReadyCh` case): when an async dial completes after `drainAndClose` was set. - Status notification handler (`relayParts != nil` branch in `statusNotifications` case): when `status!=running` arrives while an active relay exists. Depending on goroutine scheduling, one or both paths fire. When two calls hit the trap, the second blocks the merge loop in `matchCallLocked` (quartz waits for all traps to release). Since the test already moved past `MustWait`, nobody reads the second call from the trap channel, deadlocking the test. - Replace the single `MustWait/MustRelease/Advance` with a goroutine that loops over `trapDrain.Wait`, releasing and advancing for every drain timer. - No production code changes. > 🤖	2026-04-23 11:13:41 +01:00
Cian Johnston	f56adf5731	fix(enterprise/coderd/x/chatd): deflake TestSubscribeRelayDrainWithinGraceLeavesBufferRetained (#24549 ) Fixes the flake reported in https://github.com/coder/internal/issues/1474. - Use a `quartz.NewMock` clock for the subscriber with a drain timer trap, so the 200ms relay drain fires only when explicitly advanced — fully deterministic, no wall-clock race - Give each `testutil.Eventually` call its own context so one slow assertion cannot starve subsequent ones of their shared 25s deadline > 🤖	2026-04-21 11:20:32 +00:00
Cian Johnston	12e49c18a5	fix(enterprise/coderd/x/chatd): reduce relay reconnect spam (#24495 ) - Replaces the hard-coded 500ms reconnect timer for dialing chat relays with exponential backoff via `coder/retry`. - `dialRelay` drops the `codersdk.ExperimentalClient.StreamChat` wrapper and calls `websocket.Dial` directly so we can capture `*http.Response.StatusCode` without parsing error strings. - Adds `RelayDialError` that exposes the HTTP status from `websocket.Dial` - Modifies retry logic: 401/403 tear the stream down immediately, 5xx/network/timeouts retry then tear down on cap. Outer stream closes cleanly so the browser SDK reconnects with a fresh cookie. - Retry state resets on successful dial and on target-worker change, not on every `closeRelay()`. > 🤖 Generated by Coder Agents.	2026-04-20 09:19:17 +01:00
Cian Johnston	3f6b40a833	fix: reap idle chatd stream states on a timer (#24476 ) * Adds `streamJanitorLoop` to clean up stale streams every 30s * zeroes dropped slots to aid in gc-eligibliity * Adds regression tests in coderd/x/chatd and enterprise/coderd/x/chatd > 🤖	2026-04-17 19:22:00 +01:00
Dean Sheather	3452ab3166	chore: add client_type field to chats and telemetry (#24342 ) Add a `chat_client_type` enum (`ui` \| `api`) and `client_type` column to the `chats` table. The column defaults to `api` for new rows so API callers don't need to set it explicitly. Existing rows are backfilled to `ui`. The field flows through `CreateChatRequest`, `chatd.CreateOptions`, `InsertChat`, and is returned in the `Chat` response via `db2sdk`. <details> <summary>Implementation notes (Coder Agents generated)</summary> ### Changes Database migration (000469) - New enum `chat_client_type` with values `ui`, `api`. - New `client_type` column, `NOT NULL DEFAULT 'api'`. - Backfill: `UPDATE chats SET client_type = 'ui'`. SQL query — `InsertChat` now includes `client_type`. SDK — `ChatClientType` type added; `ClientType` field added to both `CreateChatRequest` (optional, defaults server-side to `api`) and `Chat` response. Handler — `postChats` maps the request field (defaulting to `api`) and passes it through `chatd.CreateOptions`. Sub-agent — Child chats inherit their parent's `client_type`. db2sdk — Maps the database value to the SDK type. ### Decision log - Default is `api` (not `ui`) so existing API integrations get the correct value without code changes. - Backfill sets existing rows to `ui` per requirement. - Child chats inherit `client_type` from parent rather than defaulting. </details>	2026-04-16 23:57:05 +10:00
Ethan	b9bc0ad6df	test: skip TestSubscribeRelayEstablishedMidStream (#24431 ) Relates to https://github.com/coder/internal/issues/1455 From that issue: > Going to skip this test until the underlying race in chatd is fixed. https://github.com/coder/coder/pull/24279 was a band-aid fix that I no longer think is valuable pursuing short term. Hugo is working on a RFC for a redesign of the system to prevent the class of race condition into the future.	2026-04-16 23:55:41 +10:00
Cian Johnston	c552f9f281	fix: stop group spend limits from leaking across org boundaries (#24294 ) Three SQL queries (`GetUserGroupSpendLimit`, `ResolveUserChatSpendLimit`, `GetUserChatSpendInPeriod`) aggregated chat spend limits and usage globally across all organizations. A restrictive group limit in org A would bleed into org B. ## Changes - Add `organization_id` parameter to all three SQL queries in `coderd/database/queries/chats.sql` - When nil UUID is passed, queries fall back to global behavior (backward compat for HTTP dashboard endpoints) - When real org ID is passed, limits and spend are scoped to that organization - Thread `organizationID` through `ResolveUsageLimitStatus` → `checkUsageLimit` → all chatd call sites - Update dbauthz wrappers for new param structs - HTTP endpoints (`chatCostSummary`, `getMyChatUsageLimitStatus`) pass `uuid.Nil` with TODO for future org-scoped UI - Add `TestResolveUsageLimitStatus_OrgScoped` with 5 test cases covering org isolation, nil-UUID fallback, spend scoping, and user override priority Closes coder/internal#1466 > 🤖	2026-04-14 16:56:17 +01:00
Cian Johnston	22062ec52e	feat: add organization scoping to chats (#23827 ) Fixes https://github.com/coder/internal/issues/1436 * Adds organization_id to chats with backfill (workspace org → user org membership → default org) * No support yet for ACLs (follow-up issue) - Cross-org workspace binding rejected (both in `CreateChatRequest` and in `create_workspace` tool - Adds `OrganizationAutocomplete` to `AgentCreateForm` - Docs updated with `organization_id` in chats-api.md > 🤖 Written by a Coder Agent. Reviewed by many humans and many agents. --------- Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>	2026-04-13 12:31:25 +01:00
Kyle Carberry	f3f0a2c553	fix(enterprise/coderd/x/chatd): harden TestSubscribeRelayEstablishedMidStream against CI flakes (#24108 ) Fixes coder/internal#1455 Three changes to eliminate the timing-sensitive flake in `TestSubscribeRelayEstablishedMidStream`: 1. Reduce `PendingChatAcquireInterval` from `time.Hour` to `time.Second`. The primary trigger is still `signalWake()` from `SendMessage`, but a short fallback poll ensures the worker picks up the pending chat even under heavy CI goroutine scheduling contention. 2. Increase context timeout from `WaitLong` (25s) to `WaitSuperLong` (60s). The worker pipeline (model resolution, message loading, LLM call) involves multiple DB round-trips that can be slow when PostgreSQL is shared with many parallel test packages. 3. Add a status-polling loop while waiting for the streaming request. If the worker errors out during chat processing, the test now fails immediately with the error status and message instead of silently timing out. > Generated by Coder Agents	2026-04-07 13:41:33 -04:00
Kyle Carberry	e18094825a	fix: retain message_part buffer for cross-replica relay (#24031 )	2026-04-04 17:24:41 -04:00
Michael Suchacz	7d0a0c6495	feat: provider key policies and user provider settings (#23751 )	2026-04-02 19:46:42 +02:00
Ethan	7757cd8e08	refactor(coderd/x/chatd): insert chats directly as pending on creation (#23888 ) Previously, `CreateChat` inserted the `chats` row with the DB default status (`waiting`), then updated it to `pending` in the same transaction via `setChatPendingWithStore`. This wasted two extra queries per chat creation (`GetChatByID` + `UpdateChatStatus`) and rewrote the same row immediately after inserting it. Now `CreateChat` passes the status directly to `InsertChat`, so the row is written once in its final create-time state. The `setChatPendingWithStore` helper is removed entirely. `InsertChat` now requires an explicit `status` parameter at all callsites instead of relying on a DB column default. ## Motivation On an experimental branch we're trialing firing all chatd notifications from plpgsql triggers. The old two-step insert made that awkward: in an `AFTER INSERT` trigger, `NEW` only contained the insert-time row (`waiting`), not the final committed state (`pending`). To emit the correct event payload the trigger had to be deferred and re-read the row from `chats` at commit time. With this change, `NEW` already contains the correct row to publish — no deferred trigger, no extra `SELECT`, simpler and cheaper trigger logic. That said, this seems like a worthwhile change regardless of the trigger experiment: writing the final row state once removes unnecessary DB work on every chat creation and makes the create path easier to reason about.	2026-04-02 14:13:51 +11:00
Ethan	13dfc9a9bb	test: harden chatd relay test setup (#23759 ) These chatd relay tests were seeding chats through `subscriber.CreateChat(...)`, which wakes the subscriber and can race local acquisition against the intended remote-worker setup. Seed waiting and remote-running chats directly in the database instead, and point the default OpenAI provider at a local safety-net server so accidental processing fails locally instead of reaching the live API. Closes https://github.com/coder/internal/issues/1430	2026-03-30 17:52:01 +11:00
Ethan	4d74603045	fix(coderd/x/chatd): respect provider Retry-After headers in chat retry loop (#23351 ) > PR Stack > 1. #23351 ← `#23282` (you are here) > 2. #23282 ← `#23275` > 3. #23275 ← `#23349` > 4. #23349 ← `main` --- ## Summary `chatretry.Retry()` used pure exponential backoff (1 s, 2 s, 4 s, …) and never consulted provider `Retry-After` headers. Fantasy's `ProviderError` carries `ResponseHeaders` including `Retry-After`, but `chaterror.Classify()` only parsed error text and silently dropped the structured transport metadata. This makes `Retry-After` a first-class signal in the classification → retry pipeline. <img width="853" height="346" alt="image" src="https://github.com/user-attachments/assets/65f012b6-8173-43d2-957e-ab9faddea525" /> ## Changes ### `coderd/chatd/chaterror/classify.go` - Added `RetryAfter time.Duration` field to `ClassifiedError` — a normalized minimum retry delay derived from provider response metadata. - `Classify()` now calls `extractProviderErrorDetails()` before falling back to text heuristics. Structured `ProviderError.StatusCode` takes priority over regex extraction. - `normalizeClassification()` preserves and clamps `RetryAfter`. ### `coderd/chatd/chaterror/provider_error.go` (new) Provider-specific extraction, isolated from the text-based classification logic: - `extractProviderErrorDetails()` unwraps `fantasy.ProviderError` from the error chain via `errors.As`. - `retryAfterFromHeaders()` parses headers in priority order: 1. `retry-after-ms` (OpenAI-specific, millisecond precision) 2. `retry-after` (standard HTTP — integer seconds or HTTP-date) - Case-insensitive header key lookup. ### `coderd/chatd/chatretry/chatretry.go` - `effectiveDelay(attempt, classified)` computes `max(Delay(attempt), classified.RetryAfter)` — the provider hint acts as a floor without weakening the local exponential backoff. - `Retry()` now uses `effectiveDelay` and passes the effective delay to both `onRetry(...)` and the sleep timer, so downstream payloads, logs, and the frontend countdown stay aligned automatically. ### Tests - `classify_test.go`: Structured provider status + `Retry-After` extraction, `retry-after-ms` priority, HTTP-date parsing, invalid header fallback, `WithProvider` preservation. - `chatretry_test.go`: Retry-after-as-floor semantics — longer hint wins, shorter hint keeps base delay. ## Design notes - No SDK/API/frontend changes needed.* `codersdk.ChatStreamRetry` already carries `DelayMs` and `RetryingAt`, and the frontend already consumes them. The fix is purely in the server-side delay computation. - Existing retryability rules unchanged. This fixes when we sleep, not whether an error is retryable. - Provider hint is a floor: `max(baseDelay, RetryAfter)` ensures we never retry earlier than the provider asks, and never weaken our own backoff curve.	2026-03-27 01:20:46 +11:00
Ethan	70f031d793	feat(coderd/chatd): structured chat error classification and retry hardening (#23275 ) > PR Stack > 1. #23351 ← `#23282` > 2. #23282 ← `#23275` > 3. #23275 ← `#23349` (you are here) > 4. #23349 ← `main` --- ## Summary Extracts a structured error classification subsystem for agent chat (`chatd`) so that retry and error payloads carry machine-readable metadata — error kind, provider name, HTTP status code, and retryability — instead of raw error strings. This is the backend half of the error-handling work. The frontend counterpart is in #23282. ## Changes ### New package: `coderd/chatd/chaterror/` Canonical error classification — extracts error kind, provider, status code, and user-facing message from raw provider errors. One source of truth that drives both retry policy and stream payloads. - `kind.go`: Error kind enum (`rate_limit`, `timeout`, `auth`, `config`, `overloaded`, `unknown`). - `signals.go`: Signal extraction — parses provider name, HTTP status code, and retryability from error strings and wrapped types. - `classify.go`: Classification logic — maps extracted signals to an error kind. - `message.go`: User-facing message templates keyed by kind + signals. - `payload.go`: Projectors that build `ChatStreamError` and `ChatStreamRetry` payloads from a classified error. ### Modified - `codersdk/chats.go`: Added `Kind`, `Provider`, `Retryable`, `StatusCode` fields to `ChatStreamError` and `ChatStreamRetry`. - `coderd/chatd/chatretry/`: Thinned to retry-policy only; classification logic moved to `chaterror`. - `coderd/chatd/chatloop/`: Added per-attempt first-chunk timeout (60 s) via `guardedStream` wrapper — produces retryable `startup_timeout` errors instead of hanging forever. - `coderd/chatd/chatd.go`: Publishes normalized retry/error payloads via `chaterror` projectors.	2026-03-25 13:47:54 +11:00
Cian Johnston	80a172f932	chore: move chatd and related packages to /x/ subpackage (#23445 ) - Moves `coderd/chatd/`, `coderd/gitsync/`, `enterprise/coderd/chatd/` under `x/` parent directories to signal instability - Adds `Experimental:` glue code comments in `coderd/coderd.go` > 🤖 This PR was created with the help of Coder Agents, and was reviewed by my human. 🧑‍💻	2026-03-23 17:34:43 +00:00

26 Commits