Commit Graph

5 Commits

Author SHA1 Message Date
Ethan a1e912a763 fix(chatd): deliver retry control events via pubsub (#23349)
> **PR Stack**
> 1. #23351 ← `#23282`
> 2. #23282 ← `#23275`
> 3. #23275 ← `#23349`
> 4. **#23349** ← `main` *(you are here)*

---

Retry events were published only to the local in-process stream via
`publishEvent()`. When pubsub is active, `Subscribe()`'s merge loop only
forwarded durable events (messages, status, errors) from pubsub
notifications,
so retry events were silently dropped for cross-replica subscribers.

This adds a `publishRetry()` helper that publishes both locally and via
pubsub,
and extends the `Subscribe()` notification handler to forward retry
events.

**Changes:**
- `coderd/pubsub/chatstreamnotify.go`: add `Retry` field to notify
message
- `coderd/chatd/chatd.go`: add `publishRetry()`, update `OnRetry`
callback,
  extend `Subscribe()` to forward `notify.Retry`
- `coderd/chatd/chatd_internal_test.go`: focused pubsub delivery test
- `enterprise/coderd/chatd/chatd_test.go`: cross-replica end-to-end test
2026-03-20 15:19:41 +00:00
Cian Johnston d8cad81ada fix(coderd/chatd): rate-limit stream drop WARN logs to avoid log spam (#23340)
- Rate-limit "chat stream buffer full" and "dropping chat stream event"
  WARN logs to at most once per 10s per chat.
- Intermediate drops not logged; WARN includes `dropped_count`.
- Per-chat tracking on `chatStreamState` using timestamp comparison
  against `quartz.Clock` — no global tickers, no new `Server` fields.
- Subscriber and buffer drop counters reset at all lifecycle boundaries.

> 🤖 This PR was created with the help of Coder Agents, and was reviewed
by my human. 🧑‍💻
2026-03-20 12:16:39 +00:00
Ethan cda460f5df perf(coderd/chatd): skip same-replica stream DB rereads (#23218)
## Problem

Scaletest follow-up storms showed that the chat stream path was doing a
same-replica DB reread for every durable message it had already
delivered locally.

In a 600-chat / 10-turn run, `/stream`-attributed
`GetChatMessagesByChatID` calls reached about 14.2k across 5,400
follow-up turns — roughly **2.63 rereads per turn**. The primary coderd
replicas saturated their DB pools at 60/60 open connections during the
storm window.

The root cause: when pubsub was active, `Subscribe()` suppressed local
durable `message` events and relied entirely on pubsub notify →
`GetChatMessagesByChatID` for catch-up. Same-replica subscribers paid
the full DB round-trip even though the persisting process was on the
same replica.

## Solution

Add a bounded per-chat **durable message cache** to `chatStreamState` so
that same-replica subscribers can catch up from memory instead of the
database.

### How it works

1. `publishMessage()` caches the SDK event in `chatStreamState` before
local fanout and pubsub notify.
2. `publishEditedMessage()` replaces the cache with only the edited
message, then publishes `FullRefresh`.
3. `Subscribe()` handles ordinary `AfterMessageID` notifies by first
consulting the per-chat durable cache and only falling back to
`GetChatMessagesByChatID` on cache miss.
4. `FullRefresh` always forces a DB reread (cache is bypassed).

### Safety properties

- If the cache misses (e.g. message expired or remote replica), the DB
catch-up still runs — no silent message loss.
- `FullRefresh` (edits) always rereads from the database.
- Remote replicas still use the pubsub + DB path unchanged.
- The cache is bounded (`maxDurableMessageCacheSize = 256`) and scoped
per chat — no unbounded memory growth.

## Impact

This change removes the entire same-replica portion of the stream
rereads. Based on the 600-chat follow-up run, the upper bound on saved
work is the same-replica share of about 14.2k `GetChatMessagesByChatID`
rereads, with the observed total stream reread rate at about 2.63
rereads per follow-up turn.
2026-03-19 14:02:00 +11:00
Ethan a33605df58 perf(coderd/chatd): reuse workspace context within a turn (#23145)
## Summary
- reuse workspace agent context within a single `runChat()` turn
- remove duplicate latest-build agent lookups between
`resolveInstructions()` and `getWorkspaceConn()`
- avoid the extra `GetWorkspaceAgentByID` fetch when the selected
`WorkspaceAgent` already has the needed metadata
- add focused internal tests for reuse and refresh-on-dial-failure

## Why
This came out of a 5000-chat / 10-turn scaletest on bravo against a
single workspace.

The run completed successfully, but coderd stayed DB-pool bound, and one
workspace-backed hot path stood out:
- `GetWorkspaceAgentsInLatestBuildByWorkspaceID ≈ 46.7k`
- `GetWorkspaceByID ≈ 48.0k`
- `GetWorkspaceAgentByID ≈ 2.2k`

Within one `runChat()` turn, chatd was rediscovering the same workspace
agent multiple times just to resolve instructions and open the workspace
connection.

## What this changes
This PR introduces a **turn-local** workspace context helper so a single
acquired turn can:
- resolve the selected workspace agent once
- reuse that agent for instruction resolution
- reuse the same `AgentConn` for workspace tools and reload/compaction

This stays turn-local only, so a later turn on another replica still
rebuilds fresh context from the DB.

## Expected impact
This is an incremental improvement, not a full fix.

It should reduce duplicated workspace-agent lookups and shave some DB
pressure from a hot path for workspace-backed chats, while preserving
multi-replica correctness.

## Testing
- `go test ./coderd/chatd/...`
- `golangci-lint run ./coderd/chatd/...`
2026-03-18 00:33:44 +11:00
Kyle Carberry 1c71fd69f6 fix: workspace auto-refresh during the chat flow (#22447) 2026-02-28 19:07:17 -05:00