relates to #21335
Modifies our taskstatus scaletest load generator to use the dRPC connection to mimic what an actual running Task would do via the MCP server (c.f. PRs below this one in the stack).
Disclosure: I used AI to generate large portions of this PR, but hand-reviewed and tweaked.
Since Go 1.22, the loop variable capture issue is resolved. Variables
declared by for loops are now per-iteration rather than per-loop, making
the 'v := v' pattern unnecessary.
Apply optimizations:
* https://github.com/openai/openai-go/pull/602
* https://github.com/coder/aibridge/pull/160
These reduce CPU time and allocation count for OpenAI `chat/completions`
and `responses` APIs, making the use of OpenAI chat models through AI
Bridge more performant.
In order to test these changes, we add scaletesting support for the
responses API.
Fixes an issue introduce in #21288
The default sdkclient created by the CLI root includes several additional http.RoundTripper wrappers to check versions and attach telemetry, so `DupClientCopyingHeaders` would break and scale tests would fail.
Instead of explicitly adding support for these additional wrappers to `DupClientCopyingHeaders` I think we should just stop unwrapping and move on. Scale tests don't need these wrapped functions.
This is a bit fragile, since it depends on the fact that the headers wrapper needs to be outermost, but that needs to be true for other uses, since things like dialing DERP do a similar thing where they unwrap and extract the auth headers. More long term this needs a refactor to make HTTP headers in the SDK a more first-class resource instead of this hacky RoundTripper wrapping, but that's for a different day.
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
While scale testing, I noticed that our load generators send basically
all requests to a single Coderd instance.
e.g.

This is because our scale test commands create all `Runner`s using the
same codersdk Client, which means they share an underlying HTTP client.
With HTTP/2 a single TCP session can multiplex many different HTTP
requests (including websockets). So, it creates a single TCP connection
to a single coderd, and then sends all the requests down the one TCP
connections.
This PR modifies the `exp scaletest` load generator commands to create
an independent HTTP client per `Runner`. This means that each runner
will create its own TCP connection. This should help spread the load and
make a more realistic test, because in a real deployment, scaled out
load will be coming over different TCP connections.
Provisioner steps broken into smaller granular actions.
Changes:
- `ExtractArchive` moved to `init` request (was in `configure`)
- Writing `tfstate` moved to `plan` (was in `configure`)
- Moved most plan/apply outputs to `GraphComplete`
This change updates how SMTP notifications are polled during scale
tests.
Before, each of the ~2,000 pollers created its own http.Client, which
opened thousands of short-lived TCP connections.
Under heavy load, this ran out of available network ports and caused
errors like `connect: cannot assign requested address`
Now, all pollers share one HTTP connection pool. This prevents port
exhaustion and makes polling faster and more stable.
If a network error happens, the poller will now retry instead of
stopping, so tests keep running until all notifications are received.
The `SMTPRequestTimeout` is now applied per request using a context,
instead of being set on the `http.Client`.
Wait for the External workspace build job to complete before attempting to pull its credentials from Coder. This resolves a race in the load generator.
Closes https://github.com/coder/internal/issues/1127
In the workspace updates scaletest load generator, we end the test once all clients have seen a workspace update for their workspace. These workspace updates are generated when the workspace is created, NOT when the workspace build has finished.
This means when the runner goes to clean up the workspaces, they may still be building. The runner attempts to address this by cancelling the build, but that fails with a 403 since the runner users don't have permission.
The fix is to simply always wait for the build to finish, regardless of whether we were able to successfully cancel it.
In this test it's probably faster to just wait for the build to finish then the overhead of cancelling it, so that's what I've gone with here.
Relates to https://github.com/coder/internal/issues/914
Adds a runner for scaletesting prebuilds. The runner uploads a no-op template with prebuilds, watches for the corresponding workspaces to be created, and then does the same to tear them down. I didn't originally plan on having metrics for the teardown, but I figured we might as well as it's still the same prebuilds reconciliation loop mechanism being tested.
This PR refactors the notification scale test to use template admins and template deletion as the notification trigger. Additionally, I've added a configurable timeout for SMTP requests.
Previously, notifications were triggered by creating/deleting a user, and notifications were received by users with the owner role. However, because of how many notifications were generated by the runners, we had too many notifications to reliably test notification delivery.
This PR is just committing the changes I made while running the
`workspace-updates` load generator.
It ensures we're not polling the workspace build progress in the
background (while we also watch for workspace updates via the tailnet),
and also removes an unnecessary query to `/api/v2/workspace/{id}` after
each workspace is built.
Since `coder exp scaletest dynamic-parameters` ends up creating template versions, provisioner tags may be required to create the template versions.
On our own scaletest clusters, we tag every provisioner, so an untagged template version won't build and won't get imported.
This PR extends the scaletest notification runner with SMTP support.
If the `--smtp-api-url` flag is provided, the runner will also watch for SMTP notifications using the specified URL.
#### Changes
- Added a new watcher to retrieve emails sent to the runner user
- Tracked WebSocket and SMTP latencies separately
- Updated metrics to include `notification_id` and `notification_type` labels
#### CLI Flags
- `--smtp-api-url`: Address of the SMTP mock HTTP API used to retrieve email notifications
#### Metrics
- `notification_delivery_latency_seconds` now includes:
- `notification_id`
- `notification_type` (`websocket` or `smtp`)
This PR adds a fake SMTP server for scale testing. It collects emails sent during tests, which you can then check using the HTTP API.
#### Changes
- Added mock SMTP server
- Added `coder scaletest smtp` CLI command
- Implemented HTTP API endpoints to retrieve messages by email
- Added auto-purge to prevent memory issues
#### HTTP API Endpoints
- `GET /messages?email=<email>` – Get messages sent to an email address
- `POST /purge` – Clear all messages from memory
The HTTP API parses raw email messages to extract the **date**, **subject**, and **notification ID**.
Notification IDs are sent in emails like this:
```html
<p>
<a href="http://127.0.0.1:3000/settings/notifications?disabled=4e19c0ac-94e1-4532-9515-d1801aa283b2"
style="color: #2563eb; text-decoration: none;">
Stop receiving emails like this
</a>
</p>
```
#### CLI
```bash
coder scaletest smtp --host localhost --port 33199 --api-port 8080 --purge-at-count 1000
```
**Flags:**
- `--host`: Host for the mock SMTP and API server (default: localhost)
- `--port`: Port for the mock SMTP server (random if not specified)
- `--api-port`: Port for the HTTP API server (random if not specified)
- `--purge-at-count`: Max number of messages before auto-purging (default: 100000)
fixes https://github.com/coder/internal/issues/1050
Extends the timeout on the affected test. It uses a postgres database, and so 15s timeout isn't enough to not flake with our test infra these days.
part of https://github.com/coder/internal/issues/912
Adds CLI command `coder exp scaletest dynamic-parameters`
I've left out the configuration of tracing and timeouts for now. I think I want to do some refactoring of the scaletest CLI to make handling those flags take up less boiler plate.
I will add tracing and timeout flags in a follow up PR.
Relates to https://github.com/coder/internal/issues/910
This PR adds a scaletest runner that simulates users receiving notifications through WebSocket connections.
An instance of this notification runner does the following:
1. Creates a user (optionally with specific roles like owner).
2. Connects to /api/v2/notifications/inbox/watch via WebSocket to receive notifications in real-time.
3. Waits for all other concurrently executing runners (per the DialBarrier WaitGroup) to also connect their websockets.
4. For receiving users: Watches the WebSocket for expected notifications and records delivery latency for each notification type.
5. For regular users: Maintains WebSocket connections to simulate concurrent load while receiving users wait for notifications.
6. Waits on the ReceivingWatchBarrier to coordinate between receiving and regular users.
7. Cleans up the created user after the test completes.
Exposes three prometheus metrics:
1. notification_delivery_latency_seconds - HistogramVec. Labels = {username, notification_type}
2. notification_delivery_errors_total - CounterVec. Labels = {username, action}
3. notification_delivery_missed_total - CounterVec. Labels = {username}
The runner measures end-to-end notification latency from when a notification-triggering event occurs (e.g., user creation/deletion) to when the notification is received by a WebSocket client.
Relates to https://github.com/coder/internal/issues/911
This PR adds a scaletest runner that simulates a user with a workspace configured to autostart at a specific time.
An instance of this autostart runner does the following:
1. Creates a user.
2. Creates a workspace.
3. Listens on `/api/v2/workspaces/<ws-id>/watch` to wait for build completions throughout the run.
4. Stops the workspace, waits for the build to finish.
4. Waits for all other concurrently executing runners (per the `SetupBarrier` `WaitGroup`) to also have a stopped workspace.
5. Sets the workspace autostart time to `time.Now().Add(cfg.AutoStartDelay)`
6. Waits for the autostart workspace build to begin, enter the `pending` status, and then `running`.
7. Waits for the autostart workspace build to end.
Exposes four prometheus metrics:
- `autostart_job_creation_latency_seconds` - HistogramVec. Labels = {username, workspace_name}
- `autostart_job_acquired_latency_seconds` - HistogramVec. Labels = {username, workspace_name}
- `autostart_total_latency_seconds` - `HistogramVec`. Labels = `{username, workspace_name}`.
- `autostart_errors_total` - `CounterVec`. Labels = `{username, action}`.
Similar idea as in #19811, this runner doesn't need to conform to `Runnable`, so we have it return the workspace from the `RunReturningWorkspace` function, instead of the more fragile `Run`, followed by a `.WorkspaceID()`.
Relates to https://github.com/coder/internal/issues/889.
This PR adds a scaletest runner that simulates a single Coder Connect client receiving workspace updates.
An instance of a workspace updates runner does the following:
- Creates a user, if a session token is not supplied.
- Attempts to repeatedly dial the Coder Connect endpoint, with a configurable (two minutes by default) timeout.
- Once dialed successfully, waits for any other concurrently executing runners to also dial successfully, or timeout (using the barrier).
- Starts a configurable number of workspace builds.
- Waits for that many workspaces to be seen over the workspace updates stream (with a configurable timeout).
Exposes two prometheus metrics:
- `workspace_updates_latency_seconds` - `HistogramVec`. Labels = `{username, num_owned_workspaces, workspace_name}`
- This is the time between starting a workspace build, and receiving both the corresponding workspace update.
- `workspace_updates_errors_total` - `NewCounterVec`. Labels = `{username, num_owned_workspaces, action}`
- The number of times a specific action of the runner has failed, per user/client.
relates to https://github.com/coder/internal/issues/912
Adds a new scaletest Runner to generate dynamic parameters load.
A later PR will add the CLI command, including creating the template & version.
Relates to https://github.com/coder/internal/issues/889
The existing implementation for exposing read and written bytes was a little awkward - we're going to be adding a bunch of scaletest runners / load generators that *don't* transfer any bytes. This PR has the scaletest reports expose a map of arbitrary string-keyed metrics instead.
FWIW, the latest iteration of the scaletesting infrastructure doesn't parse these reports right now - they're just logged to stdout, so we're good to break the json schema here.
Closes https://github.com/coder/internal/issues/985
Simple refactor of the user creation logic into it's own test runner. This lets us create users independently of workspaces, for use in a bunch of load generators, including the Coder Connect load generator.
This PR creates the new runner, and has the existing `createworkspaces` runner use it.
Relates to https://github.com/coder/internal/issues/985.
Some scaletest runners would autogenerate names if they weren't supplied on the config, while others required a name be supplied, and a name was autogenerated in the CLI command handler. This PR unifies the runners to make names and emails optional on each config, and generate them in the scaletest runner if omitted.
The create user runner in the PR above in the stack will do this too.
This PR improves the ruleguard rule for detecting `t.Fail` calls in goroutines. It picks up additional violations, of which are fixed in this PR.
See self-review for details.
The motivation for fixing this comes from a flake I fixed in https://github.com/coder/coder/pull/19599, where tests would fail from a `require` in an `Eventually`.
Refactors Agent instance identity to be a SessionTokenProvider.
Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags.
This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication.
Fixes#19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.
Refactors `codersdk.Client`'s use of session tokens to use a `SessionTokenProvider`, which abstracts the obtaining and storing of the session token.
The main motiviation is to unify Agent authentication an an upstack PR, which can use cloud instance identity via token exchange, rather than a fixed session token.
However, the abstraction could also allow functionality like obtaining the session token from other external sources like the OS credential manager, or an external secret/key management system like Vault.
Relates to https://github.com/coder/internal/issues/888
As part of our renewed connection scaletesting efforts, we want to
scaletest coder in a scenario where direct connections aren't available
(relatively common for our customers), and all concurrent connections
are relayed via DERP.
This PR adds a flag, `--disable-direct` that can be included on the
existing`coder exp scaletest workspace-traffic -ssh` to disable direct
connections.
We've successfully migrated the latest iteration of our scaletest
infrastructure (`scaletest/terraform/action`) to
https://github.com/coder/scaletest (private repo). This PR removes the
older iterations, and the scriptsfor spinning up & running the load
generators against that infrastructure (`scaletest.sh`). The tooling for
generating load against a Coder deployment remains untouched, as does
the public documentation for that tooling (i.e. `coder exp scaletest`).
If we ever need that old scaletest Terraform code, it's always in the
git history!
This PR refactors the scaletest infrastructure to use a dedicated VPC for each deployment (i.e. alpha, bravo, charlie). It then peers that VPC with the observability VPC, and the Cloud SQL database. It also sets up subnetting for and within each deployment.
With this deployed, I was able to get the scaletest running with metrics flowing into `scaletest.cdr.dev`.
Co-authored-by: Dean Sheather <dean@deansheather.com>
Closes https://github.com/coder/internal/issues/850
This PR has the scaletest infrastructure retrieve and use TLS certificates from the persistent observability cluster.
To support creating multiple instances of the infrastructure simultaneously, `var.name` can be set to `alpha`, `bravo` or `charlie`, which retrieves the corresponding certificates.
Also:
- Adds support for wildcard apps.
- Retrieves the Cloudflare token from GCP secrets.
Removes the requirement to obtain a Cloudflare DNS token from our scaletest/terraform/action builds. Instead, by default, we pull the token from Google Secrets Manager and use the `scaletest.dev` DNS domain.
Removes cloudflare_email as this was unneeded.
Removes the cloudflare_zone_id and instead pulls it from a data source via the Cloudflare API.
closes https://github.com/coder/internal/issues/839
Fixes https://github.com/coder/internal/issues/907
We convert `workspacesdk.AgentConn` to an interface and generate a mock
for it. This allows writing `coderd` tests that rely on the agent's HTTP
api to not have to set up an entire tailnet networking stack.
Relaxes the `terraform` version constraint to be at least 1.9, since
1.12 is installed in our Dogfood image
Adds a `small` scenario to keep costs down while we continue to develop
capabilities.
Closes#17791
This PR adds ability to cancel workspace builds that are in "pending"
status.
Breaking changes:
- CancelWorkspaceBuild method in codersdk now accepts an optional
request parameter
API:
- Added `expect_status` query parameter to the cancel workspace build
endpoint
- This parameter ensures the job hasn't changed state before canceling
- API returns `412 Precondition Failed` if the job is not in the
expected status
- Valid values: `running` or `pending`
- Wrapped the entire cancel method in a database transaction
UI:
- Added confirmation dialog to the `Cancel` button, since it's a
destructive operation


- Enabled cancel action for pending workspaces (`expect_status=pending`
is sent if workspace is in pending status)

---------
Co-authored-by: Dean Sheather <dean@deansheather.com>