fix(coderd/x/chatd): hoist system prompt fetch out of chat creation transactions (#24369)

## Problem

`resolveDeploymentSystemPrompt` was called inside `InTx` closures in
both `CreateChat` (`coderd/x/chatd/chatd.go`) and
`createChildSubagentChatWithOptions` (`coderd/x/chatd/subagent.go`).
That method uses `p.db` (the root store) internally to call
`GetChatSystemPromptConfig`, which requires a second DB pool checkout
while the transaction already holds one connection.

Under concurrent chat creation load (e.g., the chat scaletest at 4800
chats), this causes pool starvation: every in-flight create holds one
connection and blocks waiting for another, leading to `idle in
transaction` pileups and cascading timeouts across the entire coderd DB
pool — including unrelated background work like prebuild metrics and the
chat acquire loop.

## Fix

Move the `resolveDeploymentSystemPrompt` call before `p.db.InTx(...)` in
both call sites. The system prompt config is a read-only
deployment-level setting that does not need transactional consistency
with the chat insert, so fetching it before the transaction is both safe
and preferable (it also shortens transaction lifetime).

## Backporting

The `CreateChat` instance of this bug is also present on `release/2.32`
(`coderd/x/chatd/chatd.go` line 907). The `subagent.go` instance is not
— the child-subagent-chat creation path with its own `InTx` was added
after the branch cut.

This should be backported, but because this is only in the chat creation
path, and that's not typically hit with a great deal of concurrency in
the real world, I don't think an urgent patch for 2.32 is necessary.

## Lint gap

The existing `InTx` ruleguard rule in `scripts/rules.go` catches direct
outer-store usage (`p.db.GetFoo()`) and passing the outer store as a
function argument inside `InTx` closures, but it explicitly cannot catch
indirect access through receiver methods like
`p.resolveDeploymentSystemPrompt()` — the rule documents this blind spot
at line 273. Catching this class of bug would require interprocedural
analysis (following the callee's body to see if it touches `p.db`),
which is beyond what ruleguard's AST pattern matching can express. We're
considering a lightweight custom `go/analysis` analyzer (similar to
`paralleltestctx`) that does 1-level same-package callee inspection to
detect this pattern. In the meantime, this PR adds guidance to
`AGENTS.md` so AI reviewers can flag the pattern during code review.
This commit is contained in:
Ethan
2026-04-16 00:13:15 +10:00
committed by GitHub
parent 227f20df6a
commit e7883d4573
3 changed files with 20 additions and 2 deletions
+12
View File
@@ -128,6 +128,18 @@ app, err := api.Database.GetOAuth2ProviderAppByClientID(ctx, clientID)
- Keep nullable-field handling, type coercion, and response shaping in the
converter so handlers stay focused on request flow and authorization.
### Transactions and `InTx`
- Inside `db.InTx(...)` closures, do not use the outer store (`api.Database`,
`p.db`, etc.) directly or indirectly. Use the `tx` handle for DB work inside
the closure, or fetch read-only inputs before opening the transaction.
- Watch for helper methods on a receiver that hide outer-store access. A call
like `p.someHelper(ctx)` is still unsafe inside `InTx` if that helper uses
`p.db` internally.
- Using the outer store while a transaction is open can hold one connection and
then block on another pool checkout, which can cause pool starvation and
`idle in transaction` incidents under load.
## Quick Reference
### Full workflows available in imported WORKFLOWS.md