This PR ensures that waits on channels will time out according to the
test context, rather than waiting indefinitely. This should alleviate
the panic seen in https://github.com/coder/internal/issues/645 and, if
the deadlock recurs, allow the test to be retried automatically in CI.
Closes https://github.com/coder/vscode-coder/issues/447
Closes https://github.com/coder/jetbrains-coder/issues/543
Closes https://github.com/coder/coder-jetbrains-toolbox/issues/21
This PR adds Coder Connect support to `coder ssh --stdio`.
When connecting to a workspace, if `--force-new-tunnel` is not passed, the CLI will first do a DNS lookup for `<agent>.<workspace>.<owner>.<hostname-suffix>`. If an IP address is returned, and it's within the Coder service prefix, the CLI will not create a new tailnet connection to the workspace, and instead dial the SSH server running on port 22 on the workspace directly over TCP.
This allows IDE extensions to use the Coder Connect tunnel, without requiring any modifications to the extensions themselves.
Additionally, `using_coder_connect` is added to the `sshNetworkStats` file, which the VS Code extension (and maybe Jetbrains?) will be able to read, and indicate to the user that they are using Coder Connect.
One advantage of this approach is that running `coder ssh --stdio` on an offline workspace with Coder Connect enabled will have the CLI wait for the workspace to build, the agent to connect (and optionally, for the startup scripts to finish), before finally connecting using the Coder Connect tunnel.
As a result, `coder ssh --stdio` has the overhead of looking up the workspace and agent, and checking if they are running. On my device, this meant `coder ssh --stdio <workspace>` was approximately a second slower than just connecting to the workspace directly using `ssh <workspace>.coder` (I would assume anyone serious about their Coder Connect usage would know to just do the latter anyway).
To ensure this doesn't come at a significant performance cost, I've also benchmarked this PR.
<details>
<summary>Benchmark</summary>
## Methodology
All tests were completed on `dev.coder.com`, where a Linux workspace running in AWS `us-west1` was created.
The machine running Coder Desktop (the 'client') was a Windows VM running in the same AWS region and VPC as the workspace.
To test the performance of specifically the SSH connection, a port was forwarded between the client and workspace using:
```
ssh -p 22 -L7001:localhost:7001 <host>
```
where `host` was either an alias for an SSH ProxyCommand that called `coder ssh`, or a Coder Connect hostname.
For latency, [`tcping`](https://www.elifulkerson.com/projects/tcping.php) was used against the forwarded port:
```
tcping -n 100 localhost 7001
```
For throughput, [`iperf3`](https://iperf.fr/iperf-download.php) was used:
```
iperf3 -c localhost -p 7001
```
where an `iperf3` server was running on the workspace on port 7001.
## Test Cases
### Testcase 1: `coder ssh` `ProxyCommand` that bicopies from Coder Connect
This case tests the implementation in this PR, such that we can write a config like:
```
Host codercliconnect
ProxyCommand /path/to/coder ssh --stdio workspace
```
With Coder Connect enabled, `ssh -p 22 -L7001:localhost:7001 codercliconnect` will use the Coder Connect tunnel. The results were as follows:
**Throughput, 10 tests, back to back:**
- Average throughput across all tests: 788.20 Mbits/sec
- Minimum average throughput: 731 Mbits/sec
- Maximum average throughput: 871 Mbits/sec
- Standard Deviation: 38.88 Mbits/sec
**Latency, 100 RTTs:**
- Average: 0.369ms
- Minimum: 0.290ms
- Maximum: 0.473ms
### Testcase 2: `ssh` dialing Coder Connect directly without a `ProxyCommand`
This is what we assume to be the 'best' way to use Coder Connect
**Throughput, 10 tests, back to back:**
- Average throughput across all tests: 789.50 Mbits/sec
- Minimum average throughput: 708 Mbits/sec
- Maximum average throughput: 839 Mbits/sec
- Standard Deviation: 39.98 Mbits/sec
**Latency, 100 RTTs:**
- Average: 0.369ms
- Minimum: 0.267ms
- Maximum: 0.440ms
### Testcase 3: `coder ssh` `ProxyCommand` that creates its own Tailnet connection in-process
This is what normally happens when you run `coder ssh`:
**Throughput, 10 tests, back to back:**
- Average throughput across all tests: 610.20 Mbits/sec
- Minimum average throughput: 569 Mbits/sec
- Maximum average throughput: 664 Mbits/sec
- Standard Deviation: 27.29 Mbits/sec
**Latency, 100 RTTs:**
- Average: 0.335ms
- Minimum: 0.262ms
- Maximum: 0.452ms
## Analysis
Performing a two-tailed, unpaired t-test against the throughput of testcases 1 and 2, we find a P value of `0.9450`. This suggests the difference between the data sets is not statistically significant. In other words, there is a 94.5% chance that the difference between the data sets is due to chance.
## Conclusion
From the t-test, and by comparison to the status quo (regular `coder ssh`, which uses gvisor, and is noticeably slower), I think it's safe to say any impact on throughput or latency by the `ProxyCommand` performing a bicopy against Coder Connect is negligible. Users are very much unlikely to run into performance issues as a result of using Coder Connect via `coder ssh`, as implemented in this PR.
Less scientifically, I ran these same tests on my home network with my Sydney workspace, and both throughput and latency were consistent across testcases 1 and 2.
</details>
Addresses https://github.com/coder/internal/issues/322.
This PR starts caching Terraform providers used by `TestProvision` in
`provisioner/terraform/provision_test.go`. The goal is to improve the
reliability of this test by cutting down on the number of network calls
to external services. It leverages GitHub Actions cache, which [on depot
runners is persisted for 14 days by
default](https://depot.dev/docs/github-actions/overview#cache-retention-policy).
Other than the aforementioned `TestProvision`, I couldn't find any other
tests which depend on external terraform providers.
Adds a `coder exp mcp` command which will start a local MCP server
listening on stdio with the following capabilities:
* Show logged in user (`coder whoami`)
* List workspaces (`coder list`)
* List templates (`coder templates list`)
* Start a workspace (`coder start`)
* Stop a workspace (`coder stop`)
* Fetch a single workspace (no direct CLI analogue)
* Execute a command inside a workspace (`coder exp rpty`)
* Report the status of a task (currently a no-op, pending task support)
This can be tested as follows:
```
# Start a local Coder server.
./scripts/develop.sh
# Start a workspace. Currently, creating workspaces is not supported.
./scripts/coder-dev.sh create -t docker --yes
# Add the MCP to your Claude config.
claude mcp add coder ./scripts/coder-dev.sh exp mcp
# Tell Claude to do something Coder-related. You may need to nudge it to use the tools.
claude 'start a docker workspace and tell me what version of python is installed'
```
- Update go.mod to use Go 1.24.1
- Update GitHub Actions setup-go action to use Go 1.24.1
- Fix linting issues with golangci-lint by:
- Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1)
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <claude@anthropic.com>
This change adds support for workspace app auditing.
To avoid audit log spam, we introduce the concept of app audit sessions.
An audit session is unique per workspace app, user, ip, user agent and
http status code. The sessions are stored in a separate table from audit
logs to allow use-case specific optimizations. Sessions are ephemeral
and the table does not function as a log.
The logic for auditing is placed in the DBTokenProvider for workspace
apps so that wsproxies are included.
This is the final change affecting the API fo #15139.
Updates #15139
This PR:
- Reduces test parallelism on Windows in CI
- Unifies wait intervals on Windows with Linux and macOS. Previously we
had custom intervals for Windows to reduce test flakiness on smaller CI
workers, but we don't run tests on small CI workers anymore. Due to how
our CI file is defined, forks run tests on small CI machines, but I'm
not sure if the different intervals actually help or whether that's a
heuristic that happened to fix issues on a particular day and was it
ever reevaluated. I propose we make the change and if someone complains,
revert it.
In particular, reduced test parallelism seems to actually help: I was
able to run Windows tests 5 times in a row without flakes. Not sure if
that's going to fix the problem long term, but it seems worth trying.
- Adds `testutil.GoleakOptions` and consolidates existing options to
this location
- Pre-emptively adds required ignore for this Dependabot PR to pass CI
https://github.com/coder/coder/pull/16066
Refactors our use of `slogtest` to instantiate a "standard logger" across most of our tests. This standard logger incorporates https://github.com/coder/slog/pull/217 to also ignore database query canceled errors by default, which are a source of low-severity flakes.
Any test that has set non-default `slogtest.Options` is left alone. In particular, `coderdtest` defaults to ignoring all errors. We might consider revisiting that decision now that we have better tools to target the really common flaky Error logs on shutdown.
- Assert rbac in fake notifications enqueuer
- Move fake notifications enqueuer to separate notificationstest package
- Update dbauthz rbac policy to allow provisionerd and autostart to create and read notification messages
- Update tests as required
Joins in fields like `username`, `avatar_url`, `organization_name`,
`template_name` to `workspaces` via a **view**.
The view must be maintained moving forward, but this prevents needing to
add RBAC permissions to fetch related workspace fields.
Fixes#13910
Adds testutil.GetRandomName that replaces namesgenerator.GetRandomName but instead appends a monotonically increasing integer instead of a number between 1 and 10.
Adds support to Coordination to call SetAllPeersLost() when it is closed. This ensure that when we disconnect from a Coordinator, we set all peers lost.
This covers CoderSDK (CLI client) and Agent. Next PR will cover MultiAgent (notably, `wsproxy`).
* assert provisioner daemon version and api_version in unit tests
* add build info in HTTP header, extract codersdk.BuildVersionHeader
* add api_version to codersdk.ProvisionerDaemon
* testutil.MustString -> testutil.MustRandString
Fixes an issue where remote forwards are not correctly torn down when using OpenSSH with `coder ssh --stdio`. OpenSSH sends a disconnect signal, but then also sends SIGHUP to `coder`. Previously, we just exited when we got SIGHUP, and this raced against properly disconnecting.
Fixes https://github.com/coder/customers/issues/327
* chore: add /v2 to import module path
go mod requires semantic versioning with versions greater than 1.x
This was a mechanical update by running:
```
go install github.com/marwan-at-work/mod/cmd/mod@latest
mod upgrade
```
Migrate generated files to import /v2
* Fix gen
It looks like it is possible for screen to use control sequences instead
of literal newlines which fails the tests.
This reuses the existing readUntil function used in other pty tests.
- (breaking) Protects Logger and LogBodies fields of codersdk.Client with its mutex. This addresses a data race in cli/scaletest.
- Fillets the existing cli/createworkspaces unit test and moves the testing logic there into the tests under scaletest/createworkspaces.
- Adds testutil.RaceEnabled bool const and conditionaly skips previously-skipped tests under scaletest/ if the race detector is enabled. This is unfortunate and sad, but I would prefer to have these tests at least running without the race detector than not running at all.
- Adds IgnoreErrors option to fake in-memory agent loggers; having the agents fail the test immediately when they encounter any sort of error isn't really helpful.
* chore: skip timing-sensistive AgentMetadata test in the standard suite
* Add test-timing target
* fix windows?
* Works on my Windows desktop?
* Use tag system
* fixup! Use tag system
* feat: Add connection_timeout and troubleshooting_url to agent
This commit adds the connection timeout and troubleshooting url fields
to coder agents.
If an initial connection cannot be established within connection timeout
seconds, then the agent status will be marked as `"timeout"`.
The troubleshooting URL will be present, if configured in the Terraform
template, it can be presented to the user when the agent state is either
`"timeout"` or `"disconnected"`.
Fixes#4678
* feat: HA tailnet coordinator
* fixup! feat: HA tailnet coordinator
* fixup! feat: HA tailnet coordinator
* remove printlns
* close all connections on coordinator
* impelement high availability feature
* fixup! impelement high availability feature
* fixup! impelement high availability feature
* fixup! impelement high availability feature
* fixup! impelement high availability feature
* Add replicas
* Add DERP meshing to arbitrary addresses
* Move packages to highavailability folder
* Move coordinator to high availability package
* Add flags for HA
* Rename to replicasync
* Denest packages for replicas
* Add test for multiple replicas
* Fix coordination test
* Add HA to the helm chart
* Rename function pointer
* Add warnings for HA
* Add the ability to block endpoints
* Add flag to disable P2P connections
* Wow, I made the tests pass
* Add replicas endpoint
* Ensure close kills replica
* Update sql
* Add database latency to high availability
* Pipe TLS to DERP mesh
* Fix DERP mesh with TLS
* Add tests for TLS
* Fix replica sync TLS
* Fix RootCA for replica meshing
* Remove ID from replicasync
* Fix getting certificates for meshing
* Remove excessive locking
* Fix linting
* Store mesh key in the database
* Fix replica key for tests
* Fix types gen
* Fix unlocking unlocked
* Fix race in tests
* Update enterprise/derpmesh/derpmesh.go
Co-authored-by: Colin Adler <colin1adler@gmail.com>
* Rename to syncReplicas
* Reuse http client
* Delete old replicas on a CRON
* Fix race condition in connection tests
* Fix linting
* Fix nil type
* Move pubsub to in-memory for twenty test
* Add comment for configuration tweaking
* Fix leak with transport
* Fix close leak in derpmesh
* Fix race when creating server
* Remove handler update
* Skip test on Windows
* Fix DERP mesh test
* Wrap HTTP handler replacement in mutex
* Fix error message for relay
* Fix API handler for normal tests
* Fix speedtest
* Fix replica resend
* Fix derpmesh send
* Ping async
* Increase wait time of template version jobd
* Fix race when closing replica sync
* Add name to client
* Log the derpmap being used
* Don't connect if DERP is empty
* Improve agent coordinator logging
* Fix lock in coordinator
* Fix relay addr
* Fix race when updating durations
* Fix client publish race
* Run pubsub loop in a queue
* Store agent nodes in order
* Fix coordinator locking
* Check for closed pipe
Co-authored-by: Colin Adler <colin1adler@gmail.com>