Relates to https://github.com/coder/internal/issues/1217
Adds a background goroutine in `--stdio` mode to check if the parent PID
is still alive and exit if it is no longer present.
🤖 Implemented using Mux + Claude Opus 4.5, reviewed and refactored by
me.
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
If you were to somehow get a 401, or some other unexpected HTTP status code when following a workspace's build logs, `coder ssh` would swallow the error, and give you a different error that didn't make sense.
In this case I was getting a 401 on `/templateversions/{templateversion}/dry-run/{jobID}/logs` , but the CLI error would say the workspace had no agents.
I ran into the 401 when running a scaletest, and then whilst attempting to reproduce the issue locally, I ran `coder ssh` from one build of Coder that used `coder_session_token` as the session token cookie name, whilst the other build used `dev_coder_session_token` (as set by `develop.sh`). For reference, the CLI uses cookies when following the build logs.
Adds CompletionHandler to the ssh command that dynamically suggests
workspace and agent targets based on the user's running workspaces.
Features:
- Suggests workspace name for single-agent workspaces
- Suggests agent.workspace format for all agents in multi-agent
workspaces
- Only shows running workspaces (matches immediate availability)
- Alphabetically sorted completions for better UX
Tests cover single-agent, multi-agent, and network error scenarios.
Amp-Thread-ID:
https://ampcode.com/threads/T-d137d343-53f3-4ece-be5a-584249bbd9e8
<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
closes#20158
Demo:
https://github.com/user-attachments/assets/e1000463-ded6-4bc9-b013-61780453f019
---------
Co-authored-by: Ethan Dickson <ethan@coder.com>
Closes#19812
## Problem
> When I try to SSH into my workspace with multiple agents. It does not
provide an intuitive way to do that successfully and instead misguides
by printing wrong instructions.
This PR enhances the error handling to provide suggestions with SSH
commands that users can copy and paste directly.
Before:
```
Encountered an error running "coder ssh", see "coder ssh --help" for more information
error: multiple agents found, please specify the agent name, available agents: [coder dev]
```
After:
```
Encountered an error running "coder ssh", see "coder ssh --help" for more information
error: multiple agents found, please specify the agent name, available agents: [coder dev]
Try running:
$ ssh coder.dogfood.me.coder
$ ssh dev.dogfood.me.coder
```
Refactors the CLI to create the `*codersdk.Client` in the handlers. This is groundwork for changing the `rootCmd.InitClient()` to use the new `ClientOption`s.
It also improves variable locality, scoping the Client to the handler. This makes misuse less likely and reduces the memory allocations to just the command being executed, rather than allocating a Client for every command regardless of whether it is executed.
Fixes https://github.com/coder/internal/issues/907
We convert `workspacesdk.AgentConn` to an interface and generate a mock
for it. This allows writing `coderd` tests that rely on the agent's HTTP
api to not have to set up an entire tailnet networking stack.
This pull request introduces support for external workspace management, allowing users to register and manage workspaces that are provisioned and managed outside of the Coder.
* coder external-workspaces create - Creates a new external workspace (this command extends coder create)
* Example: coder external-workspaces create ext-workspace --template=externally-managed-workspace -y
* Checks if template has coder_external_agent resource before creating a workspace
* coder external-workspaces list - Lists all external workspaces
* coder external-workspaces agent-instructions <workspace name> <agent name> - Retrieves agent connection instruction
* Example: coder external-workspaces agent-instructions ext-workspace main --output=json
This PR introduces new build reason values to identify what type of
connection triggered a workspace build, helping to troubleshoot
workspace-related issues.
## Database Migration
Added migration 000349_extend_workspace_build_reason.up.sql that extends
the build_reason enum with new values:
```
dashboard, cli, ssh_connection, vscode_connection, jetbrains_connection
```
## Implementation
The build reason is specified through the API when creating new
workspace builds:
- Dashboard: Automatically sets reason to `dashboard` when users start
workspaces via the web interface
- CLI `start` command: Sets reason to `cli` when workspaces are started
via the command line
- CLI `ssh` command: Sets reason to ssh_connection when workspaces are
started due to SSH connections
- VS Code connections: Will be set to `vscode_connection` by the VS Code
extension through CLI hidden flag
(https://github.com/coder/vscode-coder/pull/550)
- JetBrains connections: Will be set to `jetbrains_connection` by the
Jetbrains Toolbox
(https://github.com/coder/coder-jetbrains-toolbox/pull/150) and
Jetbrains Gateway extension
(https://github.com/coder/jetbrains-coder/pull/561)
## UI Changes:
* Tooltip with reason in Build history
<img width="309" height="457" alt="image"
src="https://github.com/user-attachments/assets/bde8440b-bf3b-49a1-a244-ed7e8eb9763c"
/>
* Reason in Audit Logs Row tooltip
<img width="906" height="237" alt="image"
src="https://github.com/user-attachments/assets/ebbb62c7-cf07-4398-afbf-323c83fb6426"
/>
<img width="909" height="188" alt="image"
src="https://github.com/user-attachments/assets/1ddbab07-44bf-4dee-8867-b4e2cd56ae96"
/>
This change allows a devcontainer to be opened via the agent syntax,
`coder open vscode <workspace>.<agent>` and removes the `--container`
option to simplify the subcommand. Accessing the subagent will behave
similarly to how the `--container` option behaved.
Fixescoder/internal#748
In the past we randomly selected workspace agent if there were multiple.
Unless both are running on the same machine with the same configuration,
this would be very confusing behavior for a user.
With the introduction of sub agents (devcontainer agents), we have now
made this an error state and require the specifying of agent when there
is more than one (either normal agent or sub agent).
This aligns with the behavior of e.g. Coder Desktop.
Fixescoder/internal#696
Closes#17982.
The purpose of this PR is to expose network latency via the API used by Coder Desktop.
This PR has the tunnel ping all known agents every 5 seconds, in order to produce an instance of:
```proto
message LastPing {
// latency is the RTT of the ping to the agent.
google.protobuf.Duration latency = 1;
// did_p2p indicates whether the ping was sent P2P, or over DERP.
bool did_p2p = 2;
// preferred_derp is the human readable name of the preferred DERP region,
// or the region used for the last ping, if it was sent over DERP.
string preferred_derp = 3;
// preferred_derp_latency is the last known latency to the preferred DERP
// region. Unset if the region does not appear in the DERP map.
optional google.protobuf.Duration preferred_derp_latency = 4;
}
```
The contents of this message are stored and included on all subsequent upsertions of the agent.
Note that we upsert existing agents every 5 seconds to update the `last_handshake` value.
On the desktop apps, this message will be used to produce a tooltip similar to that of the VS Code extension:
<img width="495" alt="image" src="https://github.com/user-attachments/assets/d8b65f3d-f536-4c64-9af9-35c1a42c92d2" />
(wording not final)
Unlike the VS Code extension, we omit:
- The Latency of *all* available DERP regions. It seems not ideal to send a copy of this entire map for every online agent, and it certainly doesn't make sense for it to be on the `Agent` or `LastPing` message.
If we do want to expose this info on Coder Desktop, we should consider how best to do so; maybe we want to include it on a more generic `Netcheck` message.
- The current throughput (Bytes up/down). This is out of scope of the linked issue, and is non-trivial to implement. I'm also not sure of the value given the frequency we're doing these status updates (every 5 seconds).
If we want to expose it, it'll be in a separate PR.
<img width="343" alt="image" src="https://github.com/user-attachments/assets/8447d03b-9721-4111-8ac1-332d70a1e8f1" />
Closes#18088.
The linked issue is misleading -- `coder config-ssh` continues to support the `coder.` prefix. The reason the command
`ssh coder.workspace.agent` fails is because `coder ssh workspace.agent` wasn't supported. This PR fixes that.
We know we used to support `workspace.agent`, as this is what we recommend in the Web UI:

This PR also adds support for `coder ssh agent.workspace.owner`, such that after running `coder config-ssh`, a command like
```
ssh agent.workspace.owner.coder
```
works, even without Coder Connect running. This is done for parity with an existing workflow that uses `ssh workspace.coder`, which either uses Coder Connect if available, or the CLI.
The regular network info file creation code also calls `Mkdirall`.
Wasn't picked up in manual testing as I already had the `/net` folder in
my VSCode.
Wasn't picked up in automated testing because we use an in-memory FS,
which for some reason does this implicitly.
Closes https://github.com/coder/vscode-coder/issues/447
Closes https://github.com/coder/jetbrains-coder/issues/543
Closes https://github.com/coder/coder-jetbrains-toolbox/issues/21
This PR adds Coder Connect support to `coder ssh --stdio`.
When connecting to a workspace, if `--force-new-tunnel` is not passed, the CLI will first do a DNS lookup for `<agent>.<workspace>.<owner>.<hostname-suffix>`. If an IP address is returned, and it's within the Coder service prefix, the CLI will not create a new tailnet connection to the workspace, and instead dial the SSH server running on port 22 on the workspace directly over TCP.
This allows IDE extensions to use the Coder Connect tunnel, without requiring any modifications to the extensions themselves.
Additionally, `using_coder_connect` is added to the `sshNetworkStats` file, which the VS Code extension (and maybe Jetbrains?) will be able to read, and indicate to the user that they are using Coder Connect.
One advantage of this approach is that running `coder ssh --stdio` on an offline workspace with Coder Connect enabled will have the CLI wait for the workspace to build, the agent to connect (and optionally, for the startup scripts to finish), before finally connecting using the Coder Connect tunnel.
As a result, `coder ssh --stdio` has the overhead of looking up the workspace and agent, and checking if they are running. On my device, this meant `coder ssh --stdio <workspace>` was approximately a second slower than just connecting to the workspace directly using `ssh <workspace>.coder` (I would assume anyone serious about their Coder Connect usage would know to just do the latter anyway).
To ensure this doesn't come at a significant performance cost, I've also benchmarked this PR.
<details>
<summary>Benchmark</summary>
## Methodology
All tests were completed on `dev.coder.com`, where a Linux workspace running in AWS `us-west1` was created.
The machine running Coder Desktop (the 'client') was a Windows VM running in the same AWS region and VPC as the workspace.
To test the performance of specifically the SSH connection, a port was forwarded between the client and workspace using:
```
ssh -p 22 -L7001:localhost:7001 <host>
```
where `host` was either an alias for an SSH ProxyCommand that called `coder ssh`, or a Coder Connect hostname.
For latency, [`tcping`](https://www.elifulkerson.com/projects/tcping.php) was used against the forwarded port:
```
tcping -n 100 localhost 7001
```
For throughput, [`iperf3`](https://iperf.fr/iperf-download.php) was used:
```
iperf3 -c localhost -p 7001
```
where an `iperf3` server was running on the workspace on port 7001.
## Test Cases
### Testcase 1: `coder ssh` `ProxyCommand` that bicopies from Coder Connect
This case tests the implementation in this PR, such that we can write a config like:
```
Host codercliconnect
ProxyCommand /path/to/coder ssh --stdio workspace
```
With Coder Connect enabled, `ssh -p 22 -L7001:localhost:7001 codercliconnect` will use the Coder Connect tunnel. The results were as follows:
**Throughput, 10 tests, back to back:**
- Average throughput across all tests: 788.20 Mbits/sec
- Minimum average throughput: 731 Mbits/sec
- Maximum average throughput: 871 Mbits/sec
- Standard Deviation: 38.88 Mbits/sec
**Latency, 100 RTTs:**
- Average: 0.369ms
- Minimum: 0.290ms
- Maximum: 0.473ms
### Testcase 2: `ssh` dialing Coder Connect directly without a `ProxyCommand`
This is what we assume to be the 'best' way to use Coder Connect
**Throughput, 10 tests, back to back:**
- Average throughput across all tests: 789.50 Mbits/sec
- Minimum average throughput: 708 Mbits/sec
- Maximum average throughput: 839 Mbits/sec
- Standard Deviation: 39.98 Mbits/sec
**Latency, 100 RTTs:**
- Average: 0.369ms
- Minimum: 0.267ms
- Maximum: 0.440ms
### Testcase 3: `coder ssh` `ProxyCommand` that creates its own Tailnet connection in-process
This is what normally happens when you run `coder ssh`:
**Throughput, 10 tests, back to back:**
- Average throughput across all tests: 610.20 Mbits/sec
- Minimum average throughput: 569 Mbits/sec
- Maximum average throughput: 664 Mbits/sec
- Standard Deviation: 27.29 Mbits/sec
**Latency, 100 RTTs:**
- Average: 0.335ms
- Minimum: 0.262ms
- Maximum: 0.452ms
## Analysis
Performing a two-tailed, unpaired t-test against the throughput of testcases 1 and 2, we find a P value of `0.9450`. This suggests the difference between the data sets is not statistically significant. In other words, there is a 94.5% chance that the difference between the data sets is due to chance.
## Conclusion
From the t-test, and by comparison to the status quo (regular `coder ssh`, which uses gvisor, and is noticeably slower), I think it's safe to say any impact on throughput or latency by the `ProxyCommand` performing a bicopy against Coder Connect is negligible. Users are very much unlikely to run into performance issues as a result of using Coder Connect via `coder ssh`, as implemented in this PR.
Less scientifically, I ran these same tests on my home network with my Sydney workspace, and both throughput and latency were consistent across testcases 1 and 2.
</details>
There were too many ways to configure the agentcontainers API resulting
in inconsistent behavior or features not being enabled. This refactor
introduces a control flag for enabling or disabling the containers API.
When disabled, all implementations are no-op and explicit endpoint
behaviors are defined. When enabled, concrete implementations are used
by default but can be overridden by passing options.
Adds `hostname-suffix` flag to `coder ssh` command for use in SSH Config ProxyCommands.
Also enforces that Coder server doesn't start the suffix with a dot.
part of: #16828
Fixes an issue where old ssh configs that use the
`owner--workspace--agent` format will fail to properly use the `coder
ssh` command since we migrated off the `coder vscodessh` command.
- Update go.mod to use Go 1.24.1
- Update GitHub Actions setup-go action to use Go 1.24.1
- Fix linting issues with golangci-lint by:
- Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1)
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <claude@anthropic.com>
This adds a flag matching `--ssh-host-prefix` from `coder config-ssh` to
`coder ssh`. By trimming a custom prefix from the argument, we can set
up wildcard-based `Host` entries in SSH config for the IDE plugins (and
eventually `coder config-ssh`).
We also replace `--` in the argument with `/`, so ownership can be
specified in wildcard-based SSH hosts like `<owner>--<workspace>`.
Replaces #16087.
Part of https://github.com/coder/coder/issues/14986.
Related to https://github.com/coder/coder/pull/16078 and
https://github.com/coder/coder/pull/16080.
Part of bringing `coder ssh` to parity with `coder vscodessh` is
associating the log files with a particular parent process (in this
case, the ssh process that spawned the coder CLI via `ProxyCommand`).
`coder vscodessh` named log files using the parent PID, but coder ssh is
missing this. Add the parent PID to the log file name when used in stdio
mode so that the VS Code extension will be able to identify the correct
log file.
See also #16078.
This is the first in a series of PRs to enable `coder ssh` to replace
`coder vscodessh`.
This change adds `--network-info-dir` and `--network-info-interval`
flags to the `ssh` subcommand. These were formerly only available with
the `vscodessh` subcommand.
Subsequent PRs will add a `--ssh-host-prefix` flag to the ssh
subcommand, and adjust the log file naming to contain the parent PID.
Refactor autobuild/notify and tests to use the clock testing library.
I also rewrote some of the comments because I didn't understand them when I was looking at the package.
Currently, importing `codersdk` just to interact with the API requires
importing tailscale, which causes builds to fail unless manually using
our fork.
* fix: separate signals for passive, active, and forced shutdown
`SIGTERM`: Passive shutdown stopping provisioner daemons from accepting new
jobs but waiting for existing jobs to successfully complete.
`SIGINT` (old existing behavior): Notify provisioner daemons to cancel in-flight jobs, wait 5s for jobs to be exited, then force quit.
`SIGKILL`: Untouched from before, will force-quit.
* Revert dramatic signal changes
* Rename
* Fix shutdown behavior for provisioner daemons
* Add test for graceful shutdown
I noticed in my logs that sometimes `coder ssh` doesn't gracefully disconnect from the coordinator.
The cause is the `closerStack` construct we use in that function. It has two paths to start closing things down:
1. explicit `close()` which we do in `defer`
2. context cancellation, which happens if the cli function returns an error
sometimes the ssh remote command returns an error, and this triggers context cancellation of the `closerStack`. That is fine in and of itself, but we still want the explicit `close()` to wait until everything is closed before returning, since that's where we do cleanup, including the graceful disconnect. Prior to this fix the `close()` just immediately exits if another goroutine is closing the stack. Here we add a wait until everything is done.
Beginnings of a solution to #12297
Doesn't cover disco or definitively display whether we successfully connected to DERP, but shows some checklist diagnostics for connecting to an agent.
For this first PR, I just added it to `coder ping` to see how we like it, but could be incorporated into `coder ssh` _et al._ after a timeout.
```
$ coder ping dogfood2
p2p connection established in 147ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 147ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 140ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 140ms
✔ preferred DERP region 999 (Council Bluffs, Iowa)
✔ sent local data to Coder networking coodinator
✔ received remote agent data from Coder networking coordinator
preferred DERP 10013 (Europe Fly.io (Paris))
endpoints: 95.217.xxx.yyy:42631, 95.217.xxx.yyy:37576, 172.17.0.1:37576, 172.20.0.10:37576
✔ Wireguard handshake 11s ago
```