mirror of
https://github.com/coder/coder.git
synced 2026-06-02 20:48:20 +00:00
fix(coderd/x/chatd): hoist system prompt fetch out of chat creation transactions (#24369)
## Problem `resolveDeploymentSystemPrompt` was called inside `InTx` closures in both `CreateChat` (`coderd/x/chatd/chatd.go`) and `createChildSubagentChatWithOptions` (`coderd/x/chatd/subagent.go`). That method uses `p.db` (the root store) internally to call `GetChatSystemPromptConfig`, which requires a second DB pool checkout while the transaction already holds one connection. Under concurrent chat creation load (e.g., the chat scaletest at 4800 chats), this causes pool starvation: every in-flight create holds one connection and blocks waiting for another, leading to `idle in transaction` pileups and cascading timeouts across the entire coderd DB pool — including unrelated background work like prebuild metrics and the chat acquire loop. ## Fix Move the `resolveDeploymentSystemPrompt` call before `p.db.InTx(...)` in both call sites. The system prompt config is a read-only deployment-level setting that does not need transactional consistency with the chat insert, so fetching it before the transaction is both safe and preferable (it also shortens transaction lifetime). ## Backporting The `CreateChat` instance of this bug is also present on `release/2.32` (`coderd/x/chatd/chatd.go` line 907). The `subagent.go` instance is not — the child-subagent-chat creation path with its own `InTx` was added after the branch cut. This should be backported, but because this is only in the chat creation path, and that's not typically hit with a great deal of concurrency in the real world, I don't think an urgent patch for 2.32 is necessary. ## Lint gap The existing `InTx` ruleguard rule in `scripts/rules.go` catches direct outer-store usage (`p.db.GetFoo()`) and passing the outer store as a function argument inside `InTx` closures, but it explicitly cannot catch indirect access through receiver methods like `p.resolveDeploymentSystemPrompt()` — the rule documents this blind spot at line 273. Catching this class of bug would require interprocedural analysis (following the callee's body to see if it touches `p.db`), which is beyond what ruleguard's AST pattern matching can express. We're considering a lightweight custom `go/analysis` analyzer (similar to `paralleltestctx`) that does 1-level same-package callee inspection to detect this pattern. In the meantime, this PR adds guidance to `AGENTS.md` so AI reviewers can flag the pattern during code review.
This commit is contained in:
@@ -128,6 +128,18 @@ app, err := api.Database.GetOAuth2ProviderAppByClientID(ctx, clientID)
|
||||
- Keep nullable-field handling, type coercion, and response shaping in the
|
||||
converter so handlers stay focused on request flow and authorization.
|
||||
|
||||
### Transactions and `InTx`
|
||||
|
||||
- Inside `db.InTx(...)` closures, do not use the outer store (`api.Database`,
|
||||
`p.db`, etc.) directly or indirectly. Use the `tx` handle for DB work inside
|
||||
the closure, or fetch read-only inputs before opening the transaction.
|
||||
- Watch for helper methods on a receiver that hide outer-store access. A call
|
||||
like `p.someHelper(ctx)` is still unsafe inside `InTx` if that helper uses
|
||||
`p.db` internally.
|
||||
- Using the outer store while a transaction is open can hold one connection and
|
||||
then block on another pool checkout, which can cause pool starvation and
|
||||
`idle in transaction` incidents under load.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Full workflows available in imported WORKFLOWS.md
|
||||
|
||||
Reference in New Issue
Block a user