mirror of
https://github.com/coder/coder.git
synced 2026-06-03 04:58:23 +00:00
cda460f5df
## Problem Scaletest follow-up storms showed that the chat stream path was doing a same-replica DB reread for every durable message it had already delivered locally. In a 600-chat / 10-turn run, `/stream`-attributed `GetChatMessagesByChatID` calls reached about 14.2k across 5,400 follow-up turns — roughly **2.63 rereads per turn**. The primary coderd replicas saturated their DB pools at 60/60 open connections during the storm window. The root cause: when pubsub was active, `Subscribe()` suppressed local durable `message` events and relied entirely on pubsub notify → `GetChatMessagesByChatID` for catch-up. Same-replica subscribers paid the full DB round-trip even though the persisting process was on the same replica. ## Solution Add a bounded per-chat **durable message cache** to `chatStreamState` so that same-replica subscribers can catch up from memory instead of the database. ### How it works 1. `publishMessage()` caches the SDK event in `chatStreamState` before local fanout and pubsub notify. 2. `publishEditedMessage()` replaces the cache with only the edited message, then publishes `FullRefresh`. 3. `Subscribe()` handles ordinary `AfterMessageID` notifies by first consulting the per-chat durable cache and only falling back to `GetChatMessagesByChatID` on cache miss. 4. `FullRefresh` always forces a DB reread (cache is bypassed). ### Safety properties - If the cache misses (e.g. message expired or remote replica), the DB catch-up still runs — no silent message loss. - `FullRefresh` (edits) always rereads from the database. - Remote replicas still use the pubsub + DB path unchanged. - The cache is bounded (`maxDurableMessageCacheSize = 256`) and scoped per chat — no unbounded memory growth. ## Impact This change removes the entire same-replica portion of the stream rereads. Based on the 600-chat follow-up run, the upper bound on saved work is the same-replica share of about 14.2k `GetChatMessagesByChatID` rereads, with the observed total stream reread rate at about 2.63 rereads per follow-up turn.