coder

mirror of https://github.com/coder/coder.git synced 2026-06-05 22:18:20 +00:00

Author	SHA1	Message	Date
Cian Johnston	e8508b2d90	fix: recover chatd from poisoned chain anchor on retry (#25097 ) When OpenAI's Responses API returns `Previous response with id ... not found` for a chained turn, classify it as a `ChainBroken` retry, clear `previous_response_id`, exit chain mode, reload full history, and let `chatretry` retry. Self-heals chats whose anchor was poisoned before #25074 stopped truncated streams from being persisted as a successful turn with a stored response id. The new state is exposed via the existing `coderd_chatd_stream_retries_total` counter as a `chain_broken="true"\|"false"` label. Aggregating queries (`sum`, `rate` over `provider`/`model`/`kind`) keep working without changes; raw-series matchers without aggregation will now see two series per `(provider, model, kind)` where they previously saw one. The metric is internal-only so the blast radius should be small, but if you have dashboards that index by exact label matchers without aggregation they will need an extra `sum` or an explicit `chain_broken` selector. > 🤖 This PR was created with the help of Coder Agents, and was reviewed by a human 🧑‍💻	2026-05-11 17:43:40 +01:00
Ethan	46a60e6d5d	refactor: move chat error kinds into codersdk (#24955 ) Moves the chat error kind taxonomy from `coderd/x/chatd/chaterror` into `codersdk.ChatErrorKind` and types `ChatError.Kind` / `ChatStreamRetry.Kind` so generated TypeScript exposes an SDK-owned union, including `usage_limit`. Backend chat classification now references the SDK constants directly while preserving the existing JSON string values. Keeps chat usage-limit admission failures on their existing 409 response shape. The frontend maps structured usage-limit responses to the SDK-owned `usage_limit` kind, uses generated `TypesGen.ChatErrorKind` directly, and removes the local string union and alias.	2026-05-06 11:57:48 +10:00
Cian Johnston	72e3ae9c5f	feat: add chatd tool call error metrics and logging (#24559 ) - Add `coderd_chatd_tool_errors_total` prometheus counter (labels: provider, model, tool_name) - Log tool call errors at warn level with correlation fields: chat_id, owner_id, organization_id, workspace_id, agent_id, parent_chat_id, trigger_message_id, tool_name, tool_call_id, provider, model - Thread enriched logger from chatd.go into chatloop via `RunOptions.Logger` - Remove squashing of all MCP tool calls to the `mcp` bucket > 🤖	2026-04-22 16:19:56 +00:00
Cian Johnston	4b585465b8	feat: label chatd metrics by model, add stream-state diagnostics (#24475 ) Adds production-observability metrics to coderd/x/chatd/ for model-level correlation and a chatStreams memory-leak investigation. - Label per-request chatd metrics (steps_total, message_count, prompt_size_bytes, tool_result_size_bytes, ttft_seconds, compaction_total) with `model` and enrich the per-turn logger with provider/model. - Add `coderd_chatd_stream_retries_total{provider, model, kind}` counter incremented in chatloop before OnRetry. - Register a prometheus.Collector exposing `streams_active`, `stream_buffer_size_max`, `stream_buffer_events`, `stream_subscribers` from p.chatStreams. - Add `coderd_chatd_stream_buffer_dropped_total` counter, incremented per publishToStream drop independently of the existing log-rate-limited bufferDropCount. - Snapshot logger/model before the title-generation goroutine to avoid a data race with the logger/model rebind below it. > 🤖	2026-04-17 16:16:30 +01:00
blinkagent[bot]	e996f6d44b	chore: increase coderd_chatd_message_count histogram max bucket to 1024 (#24409 ) The `coderd_chatd_message_count` histogram's current max bucket of 128 is being hit in production. This increases the exponential bucket count from 8 to 11, extending coverage from `1..128` to `1..1024`. Before: `1, 2, 4, 8, 16, 32, 64, 128` After: `1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024` Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>	2026-04-16 09:43:54 +01:00
Cian Johnston	d7439a9de0	feat: add Prometheus metrics for chatd subsystem (#24371 ) Adds 7 Prometheus metrics to the chatd subsystem and introduces typed `ActivityBumpReason` for deadline bump attribution. \| Metric \| Type \| Labels \| \|--------\|------\|--------\| \| `coderd_chatd_chats` \| Gauge \| `state` (streaming, waiting) \| \| `coderd_chatd_message_count` \| Histogram \| `provider` \| \| `coderd_chatd_prompt_size_bytes` \| Histogram \| `provider` \| \| `coderd_chatd_tool_result_size_bytes` \| Histogram \| `provider`, `tool_name` \| \| `coderd_chatd_ttft_seconds` \| Histogram \| `provider` \| \| `coderd_chatd_compaction_total` \| Counter \| `provider`, `result` \| \| `coderd_chatd_steps_total` \| Counter \| `provider` \| > 🤖	2026-04-15 19:53:10 +01:00

6 Commits