Adds `coder exp scaletest chat`, a harness for creating Coder Agents
chat load.
Start the mock LLM separately, prepare the scaletest workspaces you want
to target, then run the chat scaletest against the existing
`scaletest-*` fleet selected by the shared workspace targeting flags:
```sh
coder exp scaletest llm-mock --address 127.0.0.1:18080
coder exp scaletest chat --llm-mock-url http://127.0.0.1:18080/v1 --chats-per-workspace 10 --turns 1
coder exp scaletest chat --llm-mock-url http://127.0.0.1:18080/v1 --template docker --target-workspaces 0:10 --chats-per-workspace 1 --turns 10 --turn-start-delay 30s
```
This is the same pattern used by the `workspace-traffic` load generator.
Keeping the fake LLM as a separate process is intentional so it can be
scaled independently from the Coder deployment, which will likely be
necessary as we scale up and up.
This PR is the starting point: it provides the command, mock
provider/model bootstrap, existing workspace selection, chat streaming,
follow-up turns, metrics, and cleanup. Follow-up PRs will add multi-step
turns via tool calls. I'm still a bit iffy on the mechanism I have for
that. It'll likely involve having the runner send some magic strings
that the mock will recognise.
Relates to CODAGT-307
Relates to GRU-48
Relates to https://github.com/coder/scaletest/issues/124
Generated by Mux, but reviewed by a human
Use typed atomics (atomic.Int64, atomic.Int32, etc.) in test files to prevent
mixing atomic and non-atomic access on the same value, guarantee 64-bit
alignment on 32-bit platforms, and provide a cleaner API.
1 of 9 [next >>](https://github.com/coder/coder/pull/24811)
RFC: [Bridge ↔ Boundaries Correlation
RFC](https://www.notion.so/Bridge-Boundaries-Correlation-313d579be59281f3b4efdbfd6896775a)
Adds three new proto fields for boundary session correlation.
**`ReportBoundaryLogsRequest`**
- `session_id` (string, field 2) — UUID generated by boundary at
startup,
shared across all batches from a single run.
- `confined_process` (string, field 3) — name of the confined process
(e.g. `claude-code`, `codex`, `copilot`).
**`BoundaryLog`**
- `sequence_number` (uint64, field 4) — monotonically increasing counter
per session, primary ordering key when boundary is in use.
`BoundaryLog.time` already existed at field 2; no change needed there.
API version bumped to v2.9.
No behaviour change in coderd or the agent. This is a pure schema bump
that the boundary repo will consume in its own stack.
> Generated by Coder Agents
connMetrics.total was written to via atomic.AddInt64 and read as a plain
int64, producing a data race. Fix by switching the field to atomic.Int64
and using its typed Add/Load methods.
The testMetrics mock had a similar issue where mutex use was missing
where GetTotalBytes read the total bytes, which is also fixed in this
change.
We noticed during higher active workspace counts that the agent
connection metric, generated via a query to the database, would report a
relatively high amount of agents as disconnected. Somewhere between 5
and 20%. However, other metrics such as # of websocket connections would
suggest that all agent connections are healthy.
Looking at the `Agents` function in prometheus metrics, plus the query
execution time (not accounting for actual database RT time) revealed
that this reporting of agents as disconnected was almost certainly false
positives due to clock drift in the way we're generating the metric
values. At 10k metrics, with a p50 of 2ms and p99 of 5ms, the entire
`agents` function could take upwards of 50s to execute. Because we were
doing a query/database RT to query th apps for each agent individually,
and grabbing a `time.Now` value on each iteration of that loop, it's
likely the portion of agents that were reported as disconnected were
those that had last heartbeat the furthest in the past.
The fix here is to set a consistent `now` before fetching agent data to
avoid clock drift inflating the inactive timeout comparison, and replace
the per-agent app query N+1 with a single batched lookup to prevent loop
execution time from pushing agents over the disconnected threshold.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The llmmock Anthropic stream wrote each chunk as `data:` only, so
Anthropic clients never saw the named SSE events they dispatch on and
Claude responses arrived empty even though the stream completed with
HTTP 200.
Update `sendAnthropicStream()` to emit `event: <type>` and `data:
<json>` for each Anthropic chunk while leaving the OpenAI-style streams
unchanged.
The slices package provides type-safe generic replacements for the
old typed sort convenience functions. The codebase already uses
slices.Sort in 43 call sites; this finishes the migration for the
remaining 29.
- sort.Strings(x) -> slices.Sort(x)
- sort.Float64s(x) -> slices.Sort(x)
- sort.StringsAreSorted(x) -> slices.IsSorted(x)
We currently call GetWorkspaceAgentByID millions of times at scale
unnecessarily. This PR embeds immutable fields into the relevant
services instead of fetching for them every time.
resolves https://github.com/coder/scaletest/issues/84
Confirmed with a 10k scaletest that this changeset takes the query from
10M+ queries down to 39k
This PR adds a `WatchAllWorkspaces` function with `watch-all-workspaces`
endpoint, which can be used to listen on a single global pubsub channel
for _all_ workspace build updates, and makes use of it in the autostart
scaletest.
This negates the need to use a workspace watch pubsub channel _per_
workspace, which has auth overhead associated with each call. This is
especially relevant in situations such as the autostart scaletest, where
we need to start/stop a set of workspaces before we can configure their
autostart config. The overhead associated with all the watch requests
skews the scaletest results and makes it harder to reason about the
performance of the autostart feature itself.
The autostart scaletest also no longer generates its own metrics nor
does it wait for all the workspaces to actually start via autostart. We
should update the scaletest dashboard after both PRs are merged to
measure autostart performance via the new metrics.
The new function/endpoint and its usage in the autostart scaletest are
gated behind an experiment feature flag, this is something we should
discuss whether we want to enable the endpoint in prod by default or
not. If so, we can remove the experiment.
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Callum Styan <callum@coder.com>
WaitBuffer is a thread-safe io.Writer that supports blocking until
accumulated output matches a substring or custom predicate. It
replaces ad-hoc safeBuffer/syncWriter types and time.Sleep-based
poll loops in tests with signal-driven waits.
- WaitFor/WaitForNth/WaitForCond for blocking on output
- Replace custom buffer types in cli/sync_test.go and
provisionersdk/agent_test.go
- Convert time.Sleep poll loops to require.Eventually/require.Never
in cli/ssh_test.go, coderd/activitybump_test.go,
coderd/workspaceagentsrpc_test.go, workspaceproxy_test.go, and
scaletest tests
This is useful at least in the case of scaletests but potentially in
other places as well. I noticed that scaletest workspace creation
hammers a single coderd replica.
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
relates to #21335
Modifies our taskstatus scaletest load generator to use the dRPC connection to mimic what an actual running Task would do via the MCP server (c.f. PRs below this one in the stack).
Disclosure: I used AI to generate large portions of this PR, but hand-reviewed and tweaked.
Since Go 1.22, the loop variable capture issue is resolved. Variables
declared by for loops are now per-iteration rather than per-loop, making
the 'v := v' pattern unnecessary.
Apply optimizations:
* https://github.com/openai/openai-go/pull/602
* https://github.com/coder/aibridge/pull/160
These reduce CPU time and allocation count for OpenAI `chat/completions`
and `responses` APIs, making the use of OpenAI chat models through AI
Bridge more performant.
In order to test these changes, we add scaletesting support for the
responses API.
Fixes an issue introduce in #21288
The default sdkclient created by the CLI root includes several additional http.RoundTripper wrappers to check versions and attach telemetry, so `DupClientCopyingHeaders` would break and scale tests would fail.
Instead of explicitly adding support for these additional wrappers to `DupClientCopyingHeaders` I think we should just stop unwrapping and move on. Scale tests don't need these wrapped functions.
This is a bit fragile, since it depends on the fact that the headers wrapper needs to be outermost, but that needs to be true for other uses, since things like dialing DERP do a similar thing where they unwrap and extract the auth headers. More long term this needs a refactor to make HTTP headers in the SDK a more first-class resource instead of this hacky RoundTripper wrapping, but that's for a different day.
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
While scale testing, I noticed that our load generators send basically
all requests to a single Coderd instance.
e.g.

This is because our scale test commands create all `Runner`s using the
same codersdk Client, which means they share an underlying HTTP client.
With HTTP/2 a single TCP session can multiplex many different HTTP
requests (including websockets). So, it creates a single TCP connection
to a single coderd, and then sends all the requests down the one TCP
connections.
This PR modifies the `exp scaletest` load generator commands to create
an independent HTTP client per `Runner`. This means that each runner
will create its own TCP connection. This should help spread the load and
make a more realistic test, because in a real deployment, scaled out
load will be coming over different TCP connections.
Provisioner steps broken into smaller granular actions.
Changes:
- `ExtractArchive` moved to `init` request (was in `configure`)
- Writing `tfstate` moved to `plan` (was in `configure`)
- Moved most plan/apply outputs to `GraphComplete`
This change updates how SMTP notifications are polled during scale
tests.
Before, each of the ~2,000 pollers created its own http.Client, which
opened thousands of short-lived TCP connections.
Under heavy load, this ran out of available network ports and caused
errors like `connect: cannot assign requested address`
Now, all pollers share one HTTP connection pool. This prevents port
exhaustion and makes polling faster and more stable.
If a network error happens, the poller will now retry instead of
stopping, so tests keep running until all notifications are received.
The `SMTPRequestTimeout` is now applied per request using a context,
instead of being set on the `http.Client`.
Wait for the External workspace build job to complete before attempting to pull its credentials from Coder. This resolves a race in the load generator.
Closes https://github.com/coder/internal/issues/1127
In the workspace updates scaletest load generator, we end the test once all clients have seen a workspace update for their workspace. These workspace updates are generated when the workspace is created, NOT when the workspace build has finished.
This means when the runner goes to clean up the workspaces, they may still be building. The runner attempts to address this by cancelling the build, but that fails with a 403 since the runner users don't have permission.
The fix is to simply always wait for the build to finish, regardless of whether we were able to successfully cancel it.
In this test it's probably faster to just wait for the build to finish then the overhead of cancelling it, so that's what I've gone with here.
Relates to https://github.com/coder/internal/issues/914
Adds a runner for scaletesting prebuilds. The runner uploads a no-op template with prebuilds, watches for the corresponding workspaces to be created, and then does the same to tear them down. I didn't originally plan on having metrics for the teardown, but I figured we might as well as it's still the same prebuilds reconciliation loop mechanism being tested.
This PR refactors the notification scale test to use template admins and template deletion as the notification trigger. Additionally, I've added a configurable timeout for SMTP requests.
Previously, notifications were triggered by creating/deleting a user, and notifications were received by users with the owner role. However, because of how many notifications were generated by the runners, we had too many notifications to reliably test notification delivery.
This PR is just committing the changes I made while running the
`workspace-updates` load generator.
It ensures we're not polling the workspace build progress in the
background (while we also watch for workspace updates via the tailnet),
and also removes an unnecessary query to `/api/v2/workspace/{id}` after
each workspace is built.
Since `coder exp scaletest dynamic-parameters` ends up creating template versions, provisioner tags may be required to create the template versions.
On our own scaletest clusters, we tag every provisioner, so an untagged template version won't build and won't get imported.
This PR extends the scaletest notification runner with SMTP support.
If the `--smtp-api-url` flag is provided, the runner will also watch for SMTP notifications using the specified URL.
#### Changes
- Added a new watcher to retrieve emails sent to the runner user
- Tracked WebSocket and SMTP latencies separately
- Updated metrics to include `notification_id` and `notification_type` labels
#### CLI Flags
- `--smtp-api-url`: Address of the SMTP mock HTTP API used to retrieve email notifications
#### Metrics
- `notification_delivery_latency_seconds` now includes:
- `notification_id`
- `notification_type` (`websocket` or `smtp`)
This PR adds a fake SMTP server for scale testing. It collects emails sent during tests, which you can then check using the HTTP API.
#### Changes
- Added mock SMTP server
- Added `coder scaletest smtp` CLI command
- Implemented HTTP API endpoints to retrieve messages by email
- Added auto-purge to prevent memory issues
#### HTTP API Endpoints
- `GET /messages?email=<email>` – Get messages sent to an email address
- `POST /purge` – Clear all messages from memory
The HTTP API parses raw email messages to extract the **date**, **subject**, and **notification ID**.
Notification IDs are sent in emails like this:
```html
<p>
<a href="http://127.0.0.1:3000/settings/notifications?disabled=4e19c0ac-94e1-4532-9515-d1801aa283b2"
style="color: #2563eb; text-decoration: none;">
Stop receiving emails like this
</a>
</p>
```
#### CLI
```bash
coder scaletest smtp --host localhost --port 33199 --api-port 8080 --purge-at-count 1000
```
**Flags:**
- `--host`: Host for the mock SMTP and API server (default: localhost)
- `--port`: Port for the mock SMTP server (random if not specified)
- `--api-port`: Port for the HTTP API server (random if not specified)
- `--purge-at-count`: Max number of messages before auto-purging (default: 100000)
fixes https://github.com/coder/internal/issues/1050
Extends the timeout on the affected test. It uses a postgres database, and so 15s timeout isn't enough to not flake with our test infra these days.
part of https://github.com/coder/internal/issues/912
Adds CLI command `coder exp scaletest dynamic-parameters`
I've left out the configuration of tracing and timeouts for now. I think I want to do some refactoring of the scaletest CLI to make handling those flags take up less boiler plate.
I will add tracing and timeout flags in a follow up PR.
Relates to https://github.com/coder/internal/issues/910
This PR adds a scaletest runner that simulates users receiving notifications through WebSocket connections.
An instance of this notification runner does the following:
1. Creates a user (optionally with specific roles like owner).
2. Connects to /api/v2/notifications/inbox/watch via WebSocket to receive notifications in real-time.
3. Waits for all other concurrently executing runners (per the DialBarrier WaitGroup) to also connect their websockets.
4. For receiving users: Watches the WebSocket for expected notifications and records delivery latency for each notification type.
5. For regular users: Maintains WebSocket connections to simulate concurrent load while receiving users wait for notifications.
6. Waits on the ReceivingWatchBarrier to coordinate between receiving and regular users.
7. Cleans up the created user after the test completes.
Exposes three prometheus metrics:
1. notification_delivery_latency_seconds - HistogramVec. Labels = {username, notification_type}
2. notification_delivery_errors_total - CounterVec. Labels = {username, action}
3. notification_delivery_missed_total - CounterVec. Labels = {username}
The runner measures end-to-end notification latency from when a notification-triggering event occurs (e.g., user creation/deletion) to when the notification is received by a WebSocket client.
Relates to https://github.com/coder/internal/issues/911
This PR adds a scaletest runner that simulates a user with a workspace configured to autostart at a specific time.
An instance of this autostart runner does the following:
1. Creates a user.
2. Creates a workspace.
3. Listens on `/api/v2/workspaces/<ws-id>/watch` to wait for build completions throughout the run.
4. Stops the workspace, waits for the build to finish.
4. Waits for all other concurrently executing runners (per the `SetupBarrier` `WaitGroup`) to also have a stopped workspace.
5. Sets the workspace autostart time to `time.Now().Add(cfg.AutoStartDelay)`
6. Waits for the autostart workspace build to begin, enter the `pending` status, and then `running`.
7. Waits for the autostart workspace build to end.
Exposes four prometheus metrics:
- `autostart_job_creation_latency_seconds` - HistogramVec. Labels = {username, workspace_name}
- `autostart_job_acquired_latency_seconds` - HistogramVec. Labels = {username, workspace_name}
- `autostart_total_latency_seconds` - `HistogramVec`. Labels = `{username, workspace_name}`.
- `autostart_errors_total` - `CounterVec`. Labels = `{username, action}`.
Similar idea as in #19811, this runner doesn't need to conform to `Runnable`, so we have it return the workspace from the `RunReturningWorkspace` function, instead of the more fragile `Run`, followed by a `.WorkspaceID()`.
Relates to https://github.com/coder/internal/issues/889.
This PR adds a scaletest runner that simulates a single Coder Connect client receiving workspace updates.
An instance of a workspace updates runner does the following:
- Creates a user, if a session token is not supplied.
- Attempts to repeatedly dial the Coder Connect endpoint, with a configurable (two minutes by default) timeout.
- Once dialed successfully, waits for any other concurrently executing runners to also dial successfully, or timeout (using the barrier).
- Starts a configurable number of workspace builds.
- Waits for that many workspaces to be seen over the workspace updates stream (with a configurable timeout).
Exposes two prometheus metrics:
- `workspace_updates_latency_seconds` - `HistogramVec`. Labels = `{username, num_owned_workspaces, workspace_name}`
- This is the time between starting a workspace build, and receiving both the corresponding workspace update.
- `workspace_updates_errors_total` - `NewCounterVec`. Labels = `{username, num_owned_workspaces, action}`
- The number of times a specific action of the runner has failed, per user/client.
relates to https://github.com/coder/internal/issues/912
Adds a new scaletest Runner to generate dynamic parameters load.
A later PR will add the CLI command, including creating the template & version.
Relates to https://github.com/coder/internal/issues/889
The existing implementation for exposing read and written bytes was a little awkward - we're going to be adding a bunch of scaletest runners / load generators that *don't* transfer any bytes. This PR has the scaletest reports expose a map of arbitrary string-keyed metrics instead.
FWIW, the latest iteration of the scaletesting infrastructure doesn't parse these reports right now - they're just logged to stdout, so we're good to break the json schema here.