Files
coder/coderd/pubsub/chatstreamnotify.go
T
Ethan 70f031d793 feat(coderd/chatd): structured chat error classification and retry hardening (#23275)
> **PR Stack**
> 1. #23351 ← `#23282`
> 2. #23282 ← `#23275`
> 3. **#23275** ← `#23349` *(you are here)*
> 4. #23349 ← `main`

---

## Summary

Extracts a structured error classification subsystem for agent chat
(`chatd`) so that retry and error payloads carry machine-readable
metadata — error kind, provider name, HTTP status code, and retryability
— instead of raw error strings.

This is the **backend half** of the error-handling work. The frontend
counterpart is in #23282.

## Changes

### New package: `coderd/chatd/chaterror/`

Canonical error classification — extracts error kind, provider, status
code, and user-facing message from raw provider errors. One source of
truth that drives both retry policy and stream payloads.

- **`kind.go`**: Error kind enum (`rate_limit`, `timeout`, `auth`,
`config`, `overloaded`, `unknown`).
- **`signals.go`**: Signal extraction — parses provider name, HTTP
status code, and retryability from error strings and wrapped types.
- **`classify.go`**: Classification logic — maps extracted signals to an
error kind.
- **`message.go`**: User-facing message templates keyed by kind +
signals.
- **`payload.go`**: Projectors that build `ChatStreamError` and
`ChatStreamRetry` payloads from a classified error.

### Modified

- **`codersdk/chats.go`**: Added `Kind`, `Provider`, `Retryable`,
`StatusCode` fields to `ChatStreamError` and `ChatStreamRetry`.
- **`coderd/chatd/chatretry/`**: Thinned to retry-policy only;
classification logic moved to `chaterror`.
- **`coderd/chatd/chatloop/`**: Added per-attempt first-chunk timeout
(60 s) via `guardedStream` wrapper — produces retryable
`startup_timeout` errors instead of hanging forever.
- **`coderd/chatd/chatd.go`**: Publishes normalized retry/error payloads
via `chaterror` projectors.
2026-03-25 13:47:54 +11:00

57 lines
2.0 KiB
Go

package pubsub
import (
"fmt"
"github.com/google/uuid"
"github.com/coder/coder/v2/codersdk"
)
// ChatStreamNotifyChannel returns the pubsub channel for per-chat
// stream notifications. Subscribers receive lightweight notifications
// and read actual content from the database.
func ChatStreamNotifyChannel(chatID uuid.UUID) string {
return fmt.Sprintf("chat:stream:%s", chatID)
}
// ChatStreamNotifyMessage is the payload published on the per-chat
// stream notification channel. Durable message content is still read
// from the database, while transient control events can be carried
// inline for cross-replica delivery.
type ChatStreamNotifyMessage struct {
// AfterMessageID tells subscribers to query messages after this
// ID. Set when a new message is persisted.
AfterMessageID int64 `json:"after_message_id,omitempty"`
// Status is set when the chat status changes. Subscribers use
// this to update clients and to manage relay lifecycle.
Status string `json:"status,omitempty"`
// WorkerID identifies which replica is running the chat. Used
// by enterprise relay to know where to connect.
WorkerID string `json:"worker_id,omitempty"`
// Retry carries a structured retry event for cross-replica live
// delivery. This is transient stream state and is not read back
// from the database.
Retry *codersdk.ChatStreamRetry `json:"retry,omitempty"`
// ErrorPayload carries a structured error event for cross-replica
// live delivery. Keep Error for backward compatibility with older
// replicas during rolling deploys.
ErrorPayload *codersdk.ChatStreamError `json:"error_payload,omitempty"`
// Error is the legacy string-only error payload kept for mixed-
// version compatibility during rollout.
Error string `json:"error,omitempty"`
// QueueUpdate is set when the queued messages change.
QueueUpdate bool `json:"queue_update,omitempty"`
// FullRefresh signals that subscribers should re-fetch all
// messages from the beginning (e.g. after an edit that
// truncates message history).
FullRefresh bool `json:"full_refresh,omitempty"`
}