fix: recover chatd from poisoned chain anchor on retry (#25097)

When OpenAI's Responses API returns `Previous response with id ... not
found` for a chained turn, classify it as a `ChainBroken` retry, clear
`previous_response_id`, exit chain mode, reload full history, and let
`chatretry` retry. Self-heals chats whose anchor was poisoned before
#25074 stopped truncated streams from being persisted as a successful
turn with a stored response id.

The new state is exposed via the existing
`coderd_chatd_stream_retries_total` counter as a
`chain_broken="true"|"false"` label. Aggregating queries (`sum`, `rate`
over `provider`/`model`/`kind`) keep working without changes; raw-series
matchers without aggregation will now see two series per `(provider,
model, kind)` where they previously saw one. The metric is internal-only
so the blast radius should be small, but if you have dashboards that
index by exact label matchers without aggregation they will need an
extra `sum` or an explicit `chain_broken` selector.

> 🤖 This PR was created with the help of Coder Agents, and was reviewed by a human 🧑‍💻
This commit is contained in:
Cian Johnston
2026-05-11 17:43:40 +01:00
committed by GitHub
parent c2dfaa406a
commit e8508b2d90
8 changed files with 814 additions and 43 deletions
+1 -1
View File
@@ -255,7 +255,7 @@ coderd_chatd_stream_buffer_events 0
coderd_chatd_stream_buffer_size_max 0
# HELP coderd_chatd_stream_retries_total Total LLM stream retries.
# TYPE coderd_chatd_stream_retries_total counter
coderd_chatd_stream_retries_total{provider="",model="",kind=""} 0
coderd_chatd_stream_retries_total{provider="",model="",kind="",chain_broken=""} 0
# HELP coderd_chatd_stream_subscribers Current number of chat stream subscribers across all chat streams.
# TYPE coderd_chatd_stream_subscribers gauge
coderd_chatd_stream_subscribers 0