mirror of
https://github.com/coder/coder.git
synced 2026-06-02 20:48:20 +00:00
fix(coderd/x/chatd): hoist system prompt fetch out of chat creation transactions (#24369)
## Problem `resolveDeploymentSystemPrompt` was called inside `InTx` closures in both `CreateChat` (`coderd/x/chatd/chatd.go`) and `createChildSubagentChatWithOptions` (`coderd/x/chatd/subagent.go`). That method uses `p.db` (the root store) internally to call `GetChatSystemPromptConfig`, which requires a second DB pool checkout while the transaction already holds one connection. Under concurrent chat creation load (e.g., the chat scaletest at 4800 chats), this causes pool starvation: every in-flight create holds one connection and blocks waiting for another, leading to `idle in transaction` pileups and cascading timeouts across the entire coderd DB pool — including unrelated background work like prebuild metrics and the chat acquire loop. ## Fix Move the `resolveDeploymentSystemPrompt` call before `p.db.InTx(...)` in both call sites. The system prompt config is a read-only deployment-level setting that does not need transactional consistency with the chat insert, so fetching it before the transaction is both safe and preferable (it also shortens transaction lifetime). ## Backporting The `CreateChat` instance of this bug is also present on `release/2.32` (`coderd/x/chatd/chatd.go` line 907). The `subagent.go` instance is not — the child-subagent-chat creation path with its own `InTx` was added after the branch cut. This should be backported, but because this is only in the chat creation path, and that's not typically hit with a great deal of concurrency in the real world, I don't think an urgent patch for 2.32 is necessary. ## Lint gap The existing `InTx` ruleguard rule in `scripts/rules.go` catches direct outer-store usage (`p.db.GetFoo()`) and passing the outer store as a function argument inside `InTx` closures, but it explicitly cannot catch indirect access through receiver methods like `p.resolveDeploymentSystemPrompt()` — the rule documents this blind spot at line 273. Catching this class of bug would require interprocedural analysis (following the callee's body to see if it touches `p.db`), which is beyond what ruleguard's AST pattern matching can express. We're considering a lightweight custom `go/analysis` analyzer (similar to `paralleltestctx`) that does 1-level same-package callee inspection to detect this pattern. In the meantime, this PR adds guidance to `AGENTS.md` so AI reviewers can flag the pattern during code review.
This commit is contained in:
@@ -128,6 +128,18 @@ app, err := api.Database.GetOAuth2ProviderAppByClientID(ctx, clientID)
|
||||
- Keep nullable-field handling, type coercion, and response shaping in the
|
||||
converter so handlers stay focused on request flow and authorization.
|
||||
|
||||
### Transactions and `InTx`
|
||||
|
||||
- Inside `db.InTx(...)` closures, do not use the outer store (`api.Database`,
|
||||
`p.db`, etc.) directly or indirectly. Use the `tx` handle for DB work inside
|
||||
the closure, or fetch read-only inputs before opening the transaction.
|
||||
- Watch for helper methods on a receiver that hide outer-store access. A call
|
||||
like `p.someHelper(ctx)` is still unsafe inside `InTx` if that helper uses
|
||||
`p.db` internally.
|
||||
- Using the outer store while a transaction is open can hold one connection and
|
||||
then block on another pool checkout, which can cause pool starvation and
|
||||
`idle in transaction` incidents under load.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Full workflows available in imported WORKFLOWS.md
|
||||
|
||||
@@ -875,6 +875,10 @@ func (p *Server) CreateChat(ctx context.Context, opts CreateOptions) (database.C
|
||||
if opts.Labels == nil {
|
||||
opts.Labels = database.StringMap{}
|
||||
}
|
||||
// Resolve the deployment prompt before opening the transaction so
|
||||
// chat creation does not hold one DB connection while waiting for
|
||||
// another pool checkout.
|
||||
deploymentPrompt := p.resolveDeploymentSystemPrompt(ctx)
|
||||
|
||||
var chat database.Chat
|
||||
txErr := p.db.InTx(func(tx database.Store) error {
|
||||
@@ -915,7 +919,6 @@ func (p *Server) CreateChat(ctx context.Context, opts CreateOptions) (database.C
|
||||
return xerrors.Errorf("insert chat: %w", err)
|
||||
}
|
||||
|
||||
deploymentPrompt := p.resolveDeploymentSystemPrompt(ctx)
|
||||
userPrompt := SanitizePromptText(opts.SystemPrompt)
|
||||
var workspaceAwareness string
|
||||
if opts.WorkspaceID.Valid {
|
||||
|
||||
@@ -452,6 +452,10 @@ func (p *Server) createChildSubagentChatWithOptions(
|
||||
return database.Chat{}, xerrors.Errorf("marshal labels: %w", err)
|
||||
}
|
||||
childSystemPrompt := SanitizePromptText(opts.systemPrompt)
|
||||
// Resolve the deployment prompt before opening the transaction so
|
||||
// child chat creation does not hold one DB connection while waiting
|
||||
// for another pool checkout.
|
||||
deploymentPrompt := p.resolveDeploymentSystemPrompt(ctx)
|
||||
|
||||
var child database.Chat
|
||||
txErr := p.db.InTx(func(tx database.Store) error {
|
||||
@@ -482,7 +486,6 @@ func (p *Server) createChildSubagentChatWithOptions(
|
||||
return xerrors.Errorf("insert child chat: %w", err)
|
||||
}
|
||||
|
||||
deploymentPrompt := p.resolveDeploymentSystemPrompt(ctx)
|
||||
workspaceAwareness := "There is no workspace associated with this chat yet. Create one using the create_workspace tool before using workspace tools like execute, read_file, write_file, etc."
|
||||
if insertedChat.WorkspaceID.Valid {
|
||||
workspaceAwareness = "This chat is attached to a workspace. You can use workspace tools like execute, read_file, write_file, etc."
|
||||
|
||||
Reference in New Issue
Block a user