mirror of
https://github.com/coder/coder.git
synced 2026-06-06 14:38:23 +00:00
8a2f28fa6a2ea8bf755dd7836cf331eead9628d2
10 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
203b0a9df8 |
refactor(coderd/x/chatd): extract OpenAI logic into chatopenai package (#24788)
Extracts OpenAI-specific logic from `coderd/x/chatd` into `coderd/x/chatd/chatopenai` so the main chat path no longer references `fantasyopenai` directly for chain mode info, response IDs, web search tooling, or option mapping. Structural refactor. The only deliberate behavioral narrowing is consolidating Responses store checks and related keyed option or metadata access on `opts[fantasyopenai.Name]`. That is documented by `TestIsResponsesStoreEnabledIgnoresMalformedNonOpenAIKey` and is unreachable in production where Responses options always live under `fantasyopenai.Name`. Summary: - Moves OpenAI Responses chain mode info, response ID helpers, web search tool construction, and provider option conversion into `chatopenai`. - Keeps Anthropic, Google, OpenRouter, and Vercel provider branches as thin, existing code paths. - `chatopenai` only imports `chatprompt` from chatd subpackages. It does not import `chatd`, `chatloop`, `chatprovider`, or `chaterror`. - Follow-up review fixes align helper names, keyed provider option access, map cloning behavior, and PR documentation with the extracted package boundary. - Final sweep trims unused chain-mode state, removes a duplicate store-check test case, drops an unused provider-tool parameter, and shares the chat-message test helper through `chattest`. > Mux is updating this PR on Mike's behalf. |
||
|
|
17409a515c |
feat(coderd): wire advisor runtime to admin config (#24622)
## Summary Wire the advisor runtime into `chatd`: read the admin config on every `runChat`, gate tool registration and system-prompt guidance on a **single eligibility boolean**, register the `advisor` built-in tool, and apply the exclusive-tool policy from PR 1. ## Motivation This is the integration seam where PRs 1–3 come together into an actual user-visible feature. Gating is deliberately root-chat-only for the initial rollout; child/sub-agent chats still do not see the tool or the guidance block. ## Changes ### `coderd/x/chatd/chatd.go` - `loadAdvisorConfig(ctx, logger)` reads the admin config (from PR 3) on each run. If `ModelConfigID` is set, it resolves the override model via `configCache.ModelConfigByID`; otherwise it falls back to the outer chat's model and provider options. Reasoning effort is plumbed into provider options via `applyAdvisorReasoningEffort`. - One computed `advisorEligible` boolean drives **both** tool registration (after skill tools, before MCP tools) and guidance injection via `chatprompt.InsertSystem(prompt, chatadvisor.ParentGuidanceBlock)`. - `setAdvisorPromptSnapshot` closures capture the outer prompt state at the right points in the lifecycle (`renderPlanPathPrompt`, `ReloadMessages`, `PrepareMessages`) so the advisor handoff uses the same context the outer model saw. - `ExclusiveToolNames["advisor"] = true` is passed to `chatloop.Run()` so mixed batches are rejected cleanly (PR 1 machinery). - `builtinToolNames["advisor"] = true` so metrics keep advisor distinct from the generic `mcp` label. ### Child-chat guard - Child/sub-agent chats deliberately do not see the advisor tool or guidance block, to avoid recursion/cost blowups until the pattern is proven. This is covered by `TestAdvisorGating_ChildChat` (currently skipped pending a rewrite against the new `plan`/`explore` subagent infrastructure; core gating logic is still exercised by `TestAdvisorGating_Disabled` and `TestAdvisorGating_RootChat`). ## Stack context This is **PR 4 of 6** in the advisor feature stack. It depends on PRs 1–3. ## Scope / non-goals - No frontend changes. The feature is invocable via the backend but renders generically until PR 5. - No separate provider runner; the nested advisor call reuses the existing model/provider path. - No DB migration. ## Validation - `go test ./coderd/x/chatd/... -run TestAdvisor` - `go build ./...` - `make lint` --- <details> <summary>📋 Implementation Plan (shared across the advisor stack)</summary> # Plan: Add a Mux-style advisor tool to coder agents/chatd ## Outcome Add a first-class `advisor` tool to agent chats in `coderd/x/chatd` that feels native to Coder: - it is a built-in server-side tool, not an MCP/dynamic-tool workaround; - it performs a nested **tool-less** model call for strategic advice; - it is exposed only when eligible, and the prompt mentions it only when it is actually available; - it is treated as a **planning-only** tool so it does not run alongside action tools in the same batch; - it tracks usage/cost separately enough for operators to reason about it; - it has a minimally polished UI in the Agents page; - and it ships with explicit dogfooding evidence, including screenshots and repro videos. ## Design decisions to lock before coding 1. **Primary architecture:** native built-in tool in `chattool/`, backed by a small `chatadvisor` package. 2. **Nested model execution:** reuse chatd's existing model/provider stack for a one-step, tool-less advisor call rather than inventing a new provider pathway. 3. **Execution policy:** treat `advisor` as an exclusive/planning-only tool; mixed batches must return structured policy errors and force the model to retry cleanly. 4. **Availability:** initial rollout is for root agent chats only; disable for child/sub-agent chats until recursion/cost policy is proven. 5. **Prompt sync:** use one eligibility boolean to drive both tool registration and advisor guidance injection. 6. **Persistence/cost split:** MVP should keep advisor usage visible in result metadata and server metrics; only add DB schema if product/billing explicitly needs queryable advisor-specific cost. 7. **UI scope:** generic tool rendering is an acceptable temporary milestone during backend bring-up, but the release candidate should include a dedicated lightweight advisor renderer. ## Delivery model The work should be executed as coordinated workstreams with one integration owner and parallel contributors for low-conflict areas. The integration owner should own `coderd/x/chatd/chatd.go` because prompt assembly, tool registration, and model resolution all converge there. ## Detailed workstreams ### Repo evidence used for this plan <details> <summary>Mux reference and current chatd seams</summary> **Mux reference implementation** - `src/node/services/tools/advisor.ts` — native advisor tool implementation. - `src/common/constants/advisor.ts` — advisor prompt/constants and truncation policy. - `src/common/utils/tools/tools.ts` — conditional tool registration. - `src/node/services/streamContextBuilder.ts` — injects advisor guidance only when the tool is available. **Current chatd seams** - `coderd/x/chatd/chatd.go` - `processChat()` — tool assembly, prompt assembly, and chatloop invocation. - `resolveChatModel()` — current model/provider/key resolution seam. - `type Config struct` — server-level chatd configuration surface. - `coderd/x/chatd/chatloop/chatloop.go` - `Run()` — main streaming/model loop. - `executeTools()` — built-in tool execution/batching seam. - `coderd/x/chatd/chattool/` — built-in tool implementations. - `site/src/pages/AgentsPage/components/ChatElements/tools/Tool.tsx` — tool renderer dispatch. - `site/src/pages/AgentsPage/components/ChatConversation/messageParsing.ts` and `ConversationTimeline.tsx` — tool/result merge and rendering flow. </details> ### Workstream map and ownership | Workstream | Primary owner | Main files | Can run in parallel? | Done when | |---|---|---|---|---| | 0. Integration + gating | Integration lead | `coderd/x/chatd/chatd.go` | No; central merge lane | Tool registration, prompt sync, and model selection are wired together | | 1. Advisor runtime + tool | Backend agent | new `coderd/x/chatd/chatadvisor/`, new `coderd/x/chatd/chattool/advisor.go` | Yes | Tool can perform a tool-less advisor call in memory and return structured results | | 2. Planning-only execution policy | Chatloop agent | `coderd/x/chatd/chatloop/chatloop.go`, related tests | Yes | Mixed `advisor` + action-tool batches are rejected cleanly and deterministically | | 3. Metrics/usage/config | Backend/telemetry agent | `chatd.go`, `chatloop/metrics.go`, optional config plumbing | Partially; coordinate with integration lead | Advisor usage is separately visible in metadata/metrics and limits are enforced | | 4. Frontend rendering | Frontend agent | `site/.../tools/Tool.tsx`, new `AdvisorTool.tsx`, stories | Yes after result schema stabilizes | Advisor renders as a readable card and story tests pass | | 5. Dogfood + QA evidence | QA agent | dev server, Storybook, dogfood output | After backend + UI are usable | Repro videos, screenshots, and a concise QA report exist | ### Parallelization rules - **Do not split `coderd/x/chatd/chatd.go` across multiple execution agents without an integration lead.** That file owns prompt building, tool registration, model resolution, and cost persistence. - Workstreams 1 and 2 can be developed in parallel and then stacked onto the integration branch. - Workstream 4 should begin once the backend result schema is agreed on, even if the backend is still behind a feature flag. - Any agent that needs to re-check Mux behavior should clone `coder/mux` into a temporary directory (for example, `$(mktemp -d)/mux`) and inspect it read-only; do not vendor or copy code from Mux directly. ## Phase 0 — Preflight and guardrails ### Goals - Align the team on the smallest shippable architecture. - Prevent scope creep into MCP/dynamic-tool/sub-agent variants. - Decide upfront what is MVP vs. follow-up. ### Tasks 1. **Confirm the MVP boundary.** - Ship a built-in advisor tool first. - Do **not** make MCP, dynamic tools, or sub-agents the primary implementation. - Do **not** add transient streaming phases in the first backend PR unless they fall out almost for free. 2. **Confirm local workflow hygiene before coding.** - Ensure the repo is using the project git hooks from `scripts/githooks`. - Do not bypass hooks with `--no-verify`. - Use `./scripts/develop.sh` for the full dev server rather than manual build/run commands. 3. **Lock the model-selection policy.** - **Recommended MVP:** advisor uses the same resolved provider/model/cost config as the current chat, with advisor-specific max-output and usage caps. - **Follow-up only if required:** add a separate `AdvisorModelConfigID`-style override that resolves through the existing `configCache`/model-config path. Do not invent a new free-form `provider:model` parser if chatd already stores provider/model separately. 4. **Lock the persistence policy.** - **Recommended MVP:** no DB migration. Persist advisor-visible metadata in the tool result and record separate metrics in memory/Prometheus. - **Only if product/billing explicitly asks for queryable advisor cost:** add a later DB migration or usage table, following the normal `queries/*.sql` + `make gen` workflow. 5. **Create an execution ADR note in the work item or tracking doc.** - Capture: built-in tool, tool-less nested call, root-chat-only rollout, exclusive execution policy, MVP no-DB-migration default. ### Quality gate - Everyone on the team can state the same answers to these questions: - Is advisor a built-in tool? **Yes.** - Can advisor run with action tools in the same batch? **No.** - Does advisor get tools of its own? **No.** - Is a DB migration required for MVP? **No, unless billing insists.** ## Phase 1 — Build the advisor runtime and tool wrapper ### Goals Create the core advisor implementation in a way that is easy to test and keeps `chattool/` thin. ### Files to add - `coderd/x/chatd/chatadvisor/types.go` - `coderd/x/chatd/chatadvisor/guidance.go` - `coderd/x/chatd/chatadvisor/handoff.go` - `coderd/x/chatd/chatadvisor/runtime.go` - `coderd/x/chatd/chatadvisor/runner.go` - `coderd/x/chatd/chattool/advisor.go` ### Responsibilities by file 1. **`types.go`** - Define the input/result schema used by the tool and UI. - Keep the result shape close to Mux so the UI and model both have predictable cases. - Recommended result variants: - `advice` - `limit_reached` - `error` Recommended shape: ```go type AdvisorArgs struct { Question string `json:"question"` } type AdvisorResult struct { Type string `json:"type"` Advice string `json:"advice,omitempty"` Error string `json:"error,omitempty"` AdvisorModel string `json:"advisor_model,omitempty"` RemainingUses int `json:"remaining_uses,omitempty"` Usage *AdvisorUsageResult `json:"usage,omitempty"` } ``` 2. **`guidance.go`** - Hold two strings: - the nested advisor system prompt; - the parent-agent guidance block to inject into the outer system prompt. - The nested advisor prompt must say, in plain language: - you are advising the parent agent; - you do not address the end user directly; - you do not claim actions happened; - you return concise strategic guidance and tradeoffs. 3. **`runtime.go`** - Define the per-run runtime state. - Recommended fields: - resolved model + model config; - provider keys/options reused from the outer chat; - `MaxUsesPerRun`; - `MaxOutputTokens`; - atomic/current call counter; - callback(s) to obtain the current prompt snapshot and current-step snapshot; - optional metrics/usage hook. - Add fail-fast validation for impossible config: nil model, non-positive limits, empty prompt builders, etc. 4. **`handoff.go`** - Build the advisor handoff message from: - the explicit question; - the exact prompt/messages the parent model just used; - the current step's text/reasoning snapshot, if available; - the most recent relevant tool outputs, if they are already in the prompt snapshot. - **Important:** use the already-prepared outer prompt tail, not a fresh DB reload. That keeps the advisor aligned with compaction and the exact context the outer model saw. - Apply hard truncation budgets with recent-context bias. 5. **`runner.go`** - Execute the nested advisor call. - **Recommended implementation:** call `chatloop.Run()` in an in-memory, one-step mode: - `Tools: nil` - `ProviderTools: nil` - `MaxSteps: 1` - `PersistStep`: capture the assistant output in memory instead of writing DB rows - Reuse the existing provider/model/cost path instead of building a second provider runner. - Assert that no tool definitions are passed to the nested call. 6. **`chattool/advisor.go`** - Keep this file thin and consistent with other built-ins. - Responsibilities: - decode `AdvisorArgs`; - validate `Question` is non-empty and bounded; - call the `chatadvisor` runner; - return a structured tool response. ### Defensive programming requirements - Assert `Question` is non-empty after trimming. - Assert runtime limits are positive. - Assert the nested advisor call runs with zero tools/provider tools. - Assert `AdvisorResult.Type` is one of the known variants before returning. - Assert remaining uses never goes negative. ### Acceptance criteria - A unit test can call the advisor tool with a fake model and receive a stable `advice` result. - The nested advisor call is impossible to run with tools accidentally attached. - The core logic lives in `chatadvisor/`, not embedded inside `chatd.go`. ## Phase 2 — Wire advisor into chatd and keep prompt/tool availability in sync ### Goals Register the tool in the right place, expose it only when eligible, and inject system guidance only when the tool is present. ### Files to modify - `coderd/x/chatd/chatd.go` - optionally a small helper file if `chatd.go` becomes too crowded ### Tasks 1. **Compute one eligibility boolean in `processChat()`.** Recommended inputs: - server-level advisor enabled flag; - root chat only (`chat.ParentChatID == uuid.Nil` or equivalent existing root/child check); - a usable resolved model/provider exists; - optional experiment/workspace/org gate if product wants staged rollout. 2. **Create the runtime once per outer chat run.** - Use the model/config/keys resolved by `resolveChatModel()`. - Reuse provider options from the current chat's `ChatModelCallConfig`. - Set `MaxUsesPerRun` and `MaxOutputTokens` from advisor config defaults. 3. **Register the tool in the built-in tool block.** - Insert after the skill tools and before MCP tools in `processChat()`. - Record `builtinToolNames["advisor"] = true` so metrics stay bounded. 4. **Inject advisor guidance into the outer system prompt using the same boolean.** - Use `chatprompt.InsertSystem()` in the same prompt assembly path that already injects user/system instructions. - Place the block near the existing instruction insertion, before plan-path/skill context blocks. - Wrap the guidance in an explicit tag like `<advisor-guidance>` so it is easy to spot in tests and future refactors. 5. **Keep advisor out of child chats for the first release.** - That avoids recursion/cost blowups with `spawn_agent` / `wait_agent` flows. - Document this explicitly in the rollout notes and tests. ### Acceptance criteria - If advisor is disabled, neither the tool nor the prompt guidance appears. - If advisor is enabled, both the tool and the prompt guidance appear. - Root chats can use advisor; child chats cannot. - Built-in tool names include `advisor` so metrics do not collapse it into the generic `mcp` label. ## Phase 3 — Enforce planning-only execution policy in `chatloop` ### Goals Prevent the model from calling `advisor` and action tools in the same execution batch. ### Files to modify - `coderd/x/chatd/chatloop/chatloop.go` - related chatloop tests ### Recommended implementation Keep the MVP small; do **not** build a general policy engine yet. 1. Add a minimal field to `chatloop.RunOptions`, for example: ```go ExclusiveToolName *string ``` 2. In `Run()` / `executeTools()`, detect the case where the exclusive tool appears in the same local-tool batch as any other locally executed tool. 3. When that happens, synthesize structured tool-result errors for the affected calls instead of executing anything in the batch. - `advisor` should receive a clear error like: _advisor must be called by itself before action tools_. - The sibling action tools should receive a paired policy error like: _this tool was skipped because advisor must run alone_. 4. Let the outer model see those tool errors and retry cleanly. - This is simpler and safer than partial execution or hidden deferral. - It preserves deterministic transcript history for debugging. 5. Pass the just-finished step snapshot into the tool execution context. - The advisor runtime should be able to see the current step's text/reasoning content, because that is often the best hint about what the outer model is trying to decide. ### Why this is the right fit - It matches the intended semantics: advisor is consulted **before** taking action. - It avoids subtle race conditions caused by concurrent built-in tool execution. - It keeps the behavior easy to test with fake models. ### Acceptance criteria - A model-emitted batch containing only `advisor` succeeds. - A model-emitted batch containing `advisor` plus any other locally executed tool returns deterministic policy errors and executes nothing. - Non-advisor tool execution stays unchanged for normal chats. ## Phase 4 — Usage limits, metrics, and configuration ### Goals Make advisor safe to operate without over-designing billing/storage in the first release. ### Files to modify - `coderd/x/chatd/chatd.go` - `coderd/x/chatd/chatloop/metrics.go` as needed - `coderd/x/chatd/chatd.go` `Config` struct and constructor path - optional follow-up config/db files only if a separate advisor model or persistent billing is required ### Tasks 1. **Add explicit server config knobs for MVP.** Recommended fields on `chatd.Config` or a nested advisor config struct: - `AdvisorEnabled bool` - `AdvisorMaxUsesPerRun int` - `AdvisorMaxOutputTokens int64` 2. **Track usage per outer run.** - Reset the counter for each `processChat()` invocation. - Return `remaining_uses` in the tool result. - Return `limit_reached` when the cap is exhausted. 3. **Expose advisor usage metadata in the tool result.** - Include model name and token/cost summary if available. - Use the same `callConfig.Cost` calculation path as the outer chat for MVP if advisor reuses the same model. 4. **Record server-side metrics.** - Count advisor invocations, failures, and latency. - Ensure they show up under the built-in tool label `advisor`. 5. **Optional decision gate: separate advisor model.** - If product insists on a stronger/different advisor model, add a follow-up config hook that resolves another existing chat model config through the same `configCache` path. - Keep that out of the first landing PR unless it is required for acceptance. 6. **Optional decision gate: queryable advisor cost.** - If this becomes required, spin a follow-up DB task: - update `coderd/database/queries/*.sql`; - add migration files; - run `make gen`; - update audit mappings if a new auditable type/field is introduced. ### Acceptance criteria - Advisor calls are capped per outer run. - Limit exhaustion is user-visible in the tool result. - Metrics distinguish advisor calls from other built-in tools. - MVP does not require a schema migration unless explicitly approved. ## Phase 5 — Frontend rendering and Storybook coverage ### Goals Make advisor feel intentional in the Agents UI without blocking the backend on fancy streaming UI. ### Files to modify - `site/src/pages/AgentsPage/components/ChatElements/tools/Tool.tsx` - new `site/src/pages/AgentsPage/components/ChatElements/tools/AdvisorTool.tsx` - Storybook story file(s) in the same tools directory ### Delivery strategy 1. **Intermediate milestone during backend bring-up:** rely on the existing generic tool renderer if needed. - This is acceptable only as a short-lived integration checkpoint. 2. **Release milestone:** add a dedicated lightweight `AdvisorTool` renderer. - Reuse existing primitives: - `ToolCollapsible` - `ToolIcon` - `Response` for markdown/prose rendering - `ScrollArea` if the advice can be long - Keep styling light and consistent with the Agents page. - Do not add unnecessary React memoization in `site/src/pages/AgentsPage/`; that area is already React-Compiler aware. 3. **Render the structured result states cleanly.** - `advice` — readable prose/markdown with optional metadata footer. - `limit_reached` — warning-style message. - `error` — error state with visible fallback text. - `running` — existing tool loading state/spinner is enough for MVP. 4. **Add Storybook coverage instead of ad-hoc component tests.** Recommended stories: - successful advice; - running/loading; - limit reached; - error. 5. **Keep the UI contract narrow.** - Prefer one text field like `advice` plus small metadata rather than a deeply nested schema. - That keeps the UI resilient to prompt iteration. ### Acceptance criteria - The advisor tool card renders readable content rather than raw quoted JSON in the final release branch. - Running, limit, and error states are visibly distinct. - Storybook stories and play assertions cover the new states. - Existing tool rendering flows remain unchanged. ## Phase 6 — Automated tests and validation gates ### Backend tests to add 1. **Advisor runtime/tool tests** - question validation; - tool-less nested execution assertion; - success result shaping; - limit-reached result shaping; - error result shaping. 2. **Prompt/gating tests in chatd** - advisor disabled ⇒ no tool, no guidance; - advisor enabled/root chat ⇒ tool + guidance; - child chat ⇒ advisor absent. 3. **Chatloop policy tests** - advisor alone runs; - advisor + action tool mixed batch returns deterministic policy errors; - non-advisor tools still execute normally. 4. **Usage/metrics tests** - per-run cap resets correctly; - builtin tool labeling includes `advisor`; - returned metadata includes model/usage summary when available. ### Frontend tests to add - Storybook `play()` assertions for the advisor renderer states. - Verify expand/collapse behavior and visible fallback text. - Verify the message timeline still renders adjacent tools correctly. ### Recommended command sequence Run these as the implementation matures, not only at the end: 1. Backend-focused gate after phases 1–4: - `make test RUN=TestAdvisor` - `make test RUN=TestChatloopAdvisor` - `make lint` 2. Frontend-focused gate after phase 5: - `pnpm test:storybook src/pages/AgentsPage/components/ChatElements/tools/AdvisorTool.stories.tsx` - `pnpm lint` - `pnpm format` 3. Final repo gate before handoff: - `make pre-commit` - run any additional targeted `make test RUN=...` selections covering touched chatd paths > Use the exact new test names the implementing agents create; the names above are recommended anchors, not existing tests. ## Dogfooding plan ### Principle Dogfood the change as a real agent feature, not just a unit-tested backend. Per the dogfood and `agent-browser` skills, the reviewer should get **watchable repro videos** plus screenshots that make the behavior obvious without reading logs. ### Required setup 1. Start the full dev environment with: - `./scripts/develop.sh` 2. If the frontend renderer changes, also start Storybook from `site/` with: - `pnpm storybook --no-open` 3. Use `agent-browser` directly — **never `npx agent-browser`**. 4. Use named browser sessions and an output folder such as: - `./dogfood-output/advisor/` - with subfolders `screenshots/` and `videos/` ### Evidence protocol For every interactive scenario below: 1. Start video recording **before** the action. 2. Capture step-by-step screenshots at human pace. 3. Capture one annotated screenshot of the final state. 4. Stop the recording. 5. Note the exact pass/fail observation in the QA report. For static UI states (for example Storybook error/limit cards), an annotated screenshot is sufficient; video is optional but still encouraged by this project’s review preference. ### Dogfood scenarios #### Scenario A — Happy path in the real Agents UI **Goal:** prove that a root agent chat can invoke advisor and produce a readable recommendation before taking further action. Steps: 1. Open the Agents page with an advisor-enabled root chat. 2. Start a repro video. 3. Send a prompt that should reasonably trigger strategic planning, such as an architecture or multi-tradeoff question. 4. Capture screenshots of: - the prompt before send; - the running advisor state; - the completed advisor card and the assistant’s follow-up response. 5. Stop recording. Pass criteria: - advisor appears in the timeline; - the rendered result is readable; - the assistant can continue after consuming the advisor output. #### Scenario B — Advisor unavailable path **Goal:** prove the feature is truly gated. Suggested variants (at least one is required, both are better): - feature flag/config off; - child/sub-agent chat. Evidence: - annotated screenshot of the chat/tool state showing advisor is absent; - short video if toggling the gate live is part of the repro. Pass criteria: - no advisor tool is available; - no advisor-specific prompt behavior leaks through. #### Scenario C — UI states in Storybook **Goal:** prove the renderer handles non-happy states cleanly. Required story states: - success/advice; - running; - limit reached; - error. Evidence: - one screenshot per state; - at least one short video showing collapse/expand behavior. Pass criteria: - success renders readable advice; - limit/error have visible fallback text; - the component behaves like the other tool cards. #### Scenario D — Regression sweep of nearby tools **Goal:** ensure advisor does not break the surrounding chat timeline. Check at minimum: - another existing built-in tool still renders correctly near advisor; - sub-agent/tool cards still expand/collapse normally; - no obvious console errors appear in the Agents page during the advisor flow. Evidence: - screenshots of adjacent tool cards; - console/error capture if anything suspicious appears. ### `agent-browser` usage notes for the QA agent - Prefer `agent-browser batch` for 2+ sequential commands when no intermediate parsing is needed. - Use `snapshot -i` to discover interactive refs. - Re-snapshot after navigation or major DOM changes. - Avoid `wait --load networkidle` unless the page is known to go idle; prefer explicit element/text waits or short fixed waits. - Record videos at human pace and include pauses that a reviewer can follow. ## Rollout plan ### Initial rollout - Gate behind a server-side advisor-enabled flag. - Enable only for selected internal/root agent chats first. - Watch metrics for: - invocation count; - failure rate; - latency; - obvious retry loops. ### Expansion conditions Expand beyond the initial rollout only after the following are true: - mixed-batch policy behavior is stable; - cost impact is understood; - frontend UX is readable in production-like dogfood; - no recursion surprises have appeared with sub-agent flows. ### Explicit non-goals for the first release - advisor inside child/sub-agent chats; - provider-agnostic streaming phase UI; - MCP-based external advisor implementation; - mandatory DB-backed advisor cost reporting. ## Final acceptance checklist - [ ] `advisor` is a built-in chatd tool, not an MCP/dynamic-tool substitute. - [ ] The nested advisor call is tool-less and bounded to one in-memory step. - [ ] One eligibility boolean controls both tool registration and prompt guidance injection. - [ ] Root chats can use advisor; child chats cannot in the initial rollout. - [ ] Mixed advisor/action batches produce deterministic policy errors instead of partial execution. - [ ] Per-run usage caps and limit-reached behavior work. - [ ] Advisor usage is visible in metadata/metrics without forcing a DB migration for MVP. - [ ] The Agents UI has a readable advisor card and Storybook coverage. - [ ] Dogfooding produced screenshots and repro videos for the required scenarios. - [ ] Validation commands (`make lint`, targeted `make test`, Storybook tests, `make pre-commit`) passed before handoff. ## Suggested PR split 1. **PR 1 — Backend foundation** - `chatadvisor/` package - `chattool/advisor.go` - `chatloop` exclusive policy - chatd gating/prompt sync - backend tests 2. **PR 2 — Frontend + QA** - advisor renderer - stories/play assertions - dogfood artifacts and QA notes 3. **PR 3 — Optional follow-ups only if demanded by stakeholders** - separate advisor model override - persistent advisor billing/queryability - transient phase-stream UX </details> --- _Generated with [`mux`](https://github.com/coder/mux) • Model: `anthropic:claude-opus-4-7` • Thinking: `max`_ |
||
|
|
62e9752acd |
fix: prevent malformed OpenAI Responses continuations (#24725)
> Worked on by Mux on Mike's behalf. ## Summary - Disable OpenAI Responses `previous_response_id` chain mode when the prior assistant response has unresolved local tool calls, so the next request can include paired tool outputs instead of sending an incomplete continuation. - Update the fantasy pin to a Responses replay fix that preserves stored reasoning references, only replays web search references when paired with reasoning, and validates local function-call output pairing before send. - Add fake OpenAI Responses input validation for the two production 400 shapes and integration coverage for full-history reasoning plus web search replay. - Add sanitized diagnostics for the OpenAI Responses continuity errors. ## Tests - `go test ./providers/openai -run 'TestResponsesToPrompt_(ReasoningWithStore|ReasoningWithWebSearchCombined|WebSearchRequiresReasoningReference|ReasoningWithFunctionCallCombined|WebSearchProviderExecutedToolResults)|TestPrepareParams_(SkipsProviderExecutedToolReferences|ValidatesFunctionCallOutputPairing)|TestValidateResponsesInput_WebSearchReferenceRequiresReasoning' -count=1` - `go test ./providers/openai -count=1` - `GOWORK=off go test ./coderd/x/chatd/chattest -run TestValidateResponsesAPIInput -count=1` - `GOWORK=off go test ./coderd/x/chatd -run 'TestOpenAIResponses(NoStaleWebSearchReplay|FullReplayPairsReasoningAndWebSearch|ChainModeSkipsWhenLocalCallPending|ChainModeStillFiresForProviderExecutedOnly)$|TestResolveChainMode_' -count=1` - `GOWORK=off go test ./coderd/x/chatd/chatprompt -run 'TestInjectMissingToolResults_' -count=1` - `GOWORK=off go test ./coderd/x/chatd/chaterror -run TestClassify_OpenAIResponsesAPIDiagnostics -count=1` - `GOWORK=off go test ./coderd/x/chatd/... -count=1` - `git diff --check` - `git commit` pre-commit hook |
||
|
|
dbcc654d28 |
feat: snapshot explore subagent tool entitlements (#24638)
Explore sub-agents previously could not use `web_search` or external MCP tools. `runChat` hard-skipped both for Explore. Lifting those guards naively would over-grant tools, because a child chat could outlive the spawning turn's plan-mode filter. This change persists the spawning parent turn's filtered external MCP server IDs onto the child Explore chat, and simplifies the Explore provider-tool filter in `runChat`: - New `resolveExploreToolSnapshot` helper: computes the child's inherited external MCP subset by running the parent's configs through `filterExternalMCPConfigsForTurn` (plan-mode policy) and, if the parent is itself an Explore child, further narrowing to the parent's own persisted `MCPServerIDs`. The result is written to the child's `MCPServerIDs` column at spawn time. - The existing `mcp_server_ids` column is the sole durable snapshot. No new chat column is added. - `runChat` for Explore children: loads MCP tools from the persisted snapshot, and keeps only `web_search` from provider-native tools (to block computer-use and other write-style tools, since Explore is read-only). Whether `web_search` is actually available is a per-model decision, determined by the current model config, just like a main chat. - Built-in Explore allowlist is unchanged. Workspace-local MCP remains excluded for Explore. Verification: `go build ./...`, `go test ./coderd/x/chatd/... -count=1`, `make gen` (clean tree), `make lint/emdash`, `go vet`. Deep-review ran 12 reviewers on the feature and 5 on the clarity refactor; CAR reviewed and approved; a subsequent scope reduction dropped a temporary `allow_web_search` column in favor of per-model handling. > Mux is acting on Mike's behalf. |
||
|
|
5f3effd839 |
fix(coderd/x/chatd): add chattest.OpenAI() default fake server (#24540)
- Add `chattest.OpenAI(t)` convenience wrapper around `NewOpenAI` with
sensible defaults (JSON title response for non-streaming, text chunk for
streaming)
- Update `seedChatDependencies` to use it instead of an empty base URL,
preventing title generation from hitting real `api.openai.com` with a
fake key:
```
t.go:111: 2026-04-20 19:23:31.885 [debu] coderd.chatd.processor: title model candidate failed chat_id=edb43454-f23d-4163-9974-d101b8091de6 chat_id=edb43454-f23d-4163-9974-d101b8091de6 ...
error= generate structured title:
github.com/coder/coder/v2/coderd/x/chatd.generateStructuredTitleWithUsage
/home/coder/src/coder/coder/coderd/x/chatd/quickgen.go:443
- unauthorized: Incorrect API key provided: test-api-key. You can find your API key at https://platform.openai.com/account/api-keys.
```
> 🤖
|
||
|
|
8382e96a81 | feat: add types, context, and model normalization (#23914) | ||
|
|
590235138f | fix: pin fixed anthropic/fantasy forks for streaming token accounting (#24077) | ||
|
|
fc1e0beb3b |
fix(coderd/x/chatd): use structured output for chat title generation (#23909)
Chat title generation used free-form text completion, which let models respond conversationally instead of producing a title. Review chats started with GitHub URLs were especially affected — models would say "I don't have the ability to browse external links" and that string became the persisted title. Replace the raw-text `generateShortText` path with structured output via `object.Generate[generatedTitle]`. Both auto-title and manual retitle now go through the same typed contract: the model must return a JSON object with a `title` field, validated and normalized before persistence. Invalid outputs (empty, too long) are rejected and retried through the existing candidate-model fallback loop. |
||
|
|
02356c61f6 |
fix: use previous_response_id chaining for OpenAI store=true follow-ups (#23450)
OpenAI Responses follow-up turns were replaying full assistant/tool history even when `store=true`, which breaks after reasoning + provider-executed `web_search` output. This change persists the OpenAI response ID on assistant messages, then in `coderd/x/chatd` switches `store=true` follow-ups to `previous_response_id` chaining with a system + new-user-only prompt. `store=false` and missing-ID cases still fall back to manual replay. It also updates the fake OpenAI server and integration coverage for the chaining contract, and carries the rebased path move to `coderd/x/chatd` plus the migration renumber needed after rebasing onto `main`. |
||
|
|
80a172f932 |
chore: move chatd and related packages to /x/ subpackage (#23445)
- Moves `coderd/chatd/`, `coderd/gitsync/`, `enterprise/coderd/chatd/` under `x/` parent directories to signal instability - Adds `Experimental:` glue code comments in `coderd/coderd.go` > 🤖 This PR was created with the help of Coder Agents, and was reviewed by my human. 🧑💻 |