Closes https://github.com/coder/internal/issues/466
```
$ dig -6 @fd60:627a:a42b::53 is.coder--connect--enabled--right--now.coder AAAA
; <<>> DiG 9.10.6 <<>> -6 @fd60:627a:a42b::53 is.coder--connect--enabled--right--now.coder AAAA
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62390
;; flags: qr aa rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;is.coder--connect--enabled--right--now.coder. IN AAAA
;; ANSWER SECTION:
is.coder--connect--enabled--right--now.coder. 2 IN AAAA fd60:627a:a42b::53
;; Query time: 3 msec
;; SERVER: fd60:627a:a42b::53#53(fd60:627a:a42b::53)
;; WHEN: Wed Apr 09 16:59:18 AEST 2025
;; MSG SIZE rcvd: 134
```
Hostname considerations:
- Workspace names, usernames, and agent names can't have double hyphens, so this name can't conflict with a real Coder Connect hostname.
- Components can't start or end with hyphens according to [RFC 952](https://www.rfc-editor.org/rfc/rfc952.html)
- DNS records can't have hyphens in the 3rd and 4th positions, as to not conflict with IDNs https://datatracker.ietf.org/doc/html/rfc5891
- Update go.mod to use Go 1.24.1
- Update GitHub Actions setup-go action to use Go 1.24.1
- Fix linting issues with golangci-lint by:
- Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1)
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <claude@anthropic.com>
Updating golden files is an unnecessary extra step in addition to gen
that is easily overlooked, leading to the developer noticing the issue
in CI leading to lost developer time waiting for tests to complete.
The experimental functions in `golang.org/x/exp/slices` are now
available in the standard library since Go 1.21.
Reference: https://go.dev/doc/go1.21#slices
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Fixes: https://github.com/coder/internal/issues/377
Added an additional SSH listener on port 22, so the agent now listens on both, port one and port 22.
---
Change-Id: Ifd986b260f8ac317e37d65111cd4e0bd1dc38af8
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Addresses https://github.com/coder/coder/issues/16231.
This PR reduces the volume of logs we print after server startup in
order to surface the web UI URL better.
Here are the logs after the changes a couple of seconds after starting
the server:
<img width="868" alt="Screenshot 2025-02-18 at 16 31 32"
src="https://github.com/user-attachments/assets/786dc4b8-7383-48c8-a5c3-a997c01ca915"
/>
The warning is due to running a development site-less build. It wouldn't
show in a release build.
This change adds a new `ReportConnection` endpoint to the `agentapi`.
The protocol version was bumped previously, so it has been omitted here.
This allows the agent to report connection events, for example when the
user connects to the workspace via SSH or VS Code.
Updates #15139
Closes https://github.com/coder/coder-desktop-macos/issues/54
I've also double checked that agents with hyphens & underscores play nice once programmed, as do workspaces with hyphens:
```
$ ping6 main_agent-1.main-workspace.admin.coder
PING6(56=40+8+8 bytes) fd60:627a:a42b:4e91:88c0:da4a:df4f:b54e --> fd60:627a:a42b:46d4:8b55:e549:e498:e6f5
```
also fine in Firefox & Safari, though I'm a little surprised underscores work.
As part of the new resources monitoring logic - more specifically for
OOM & OOD Notifications , we need to update the AgentAPI , and the
agents logic.
This PR aims to do it, and more specifically :
We are updating the AgentAPI & TailnetAPI to version 24 to add two new
methods in the AgentAPI :
- One method to fetch the resources monitoring configuration
- One method to push the datapoints for the resources monitoring.
Also, this PR adds a new logic on the agent side, with a routine running
and ticking - fetching the resources usage each time , but also storing
it in a FIFO like queue.
Finally, this PR fixes a problem we had with RBAC logic on the resources
monitoring model, applying the same logic than we have for similar
entities.
Replace Depot build action with Nix for Nix dogfood image builds
The dogfood Nix image is now built using Nix's native container tooling instead of Depot. This change:
- Adds Nix setup steps to the GitHub Actions workflow
- Removes the Dockerfile.nix in favor of a Nix-native container build
- Updates the flake.nix to support building Docker images
- Introduces a hash file to track Nix-related changes
- Updates the vendorHash for Go dependencies
Change-Id: I4e011fe3a19d9a1375fbfd5223c910e59d66a5d9
Signed-off-by: Thomas Kosiewski <tk@coder.com>
- Adds `testutil.GoleakOptions` and consolidates existing options to
this location
- Pre-emptively adds required ignore for this Dependabot PR to pass CI
https://github.com/coder/coder/pull/16066
Migrates us to `coder/websocket` v1.8.12 rather than `nhooyr/websocket` on an older version.
Works around https://github.com/coder/websocket/issues/504 by adding an explicit test for `xerrors.Is(err, io.EOF)` where we were previously getting `io.EOF` from the netConn.
Addresses #14734.
This PR wires up `tunnel.go` to a `tailnet.Conn` via the new `/tailnet` endpoint, with all the necessary controllers such that a VPN connection can be started, stopped and inspected via the CoderVPN protocol.
fixes: https://github.com/coder/internal/issues/217
> There are a couple problems:
>
> One is that we assert the RPCs succeed, but if the pipeDialer context is canceled at the end of the test, then these assertions happen after the test is officially complete, which panics and affects other tests.
This converts these to just return the error rather than assert.
> The other is that the retrier is slightly bugged: if the current retry delay is 0 AND the ctx is done, (e.g. after successfully connecting and then gracefully disconnecting), then retrier.Wait(c.ctx) is racy and could return either true or false.
Fixes the phantom redial by explicitly checking the context before dialing. Also, in the test, we assert that the controller is closed before completing the test.
Fixes a test flake on TestTailnet_ForcesWebsockets like:
```
t.go:106: 2024-11-18 07:44:25.939 [debu] w2: dial tcp addr_port="[fd7a:115c:a1e0:46cc:bd8e:400d:1bc6:f6ac]:35565"
t.go:106: 2024-11-18 07:44:25.943 [debu] w1.net.netstack: netstack: could not connect to local server at 127.0.0.1:35565 (or [::1]:35565)%!(EXTRA *net.OpError=dial tcp [::1]:35565: connect: connection refused)
conn_test.go:146:
Error Trace: /Users/spike/repos/coder/tailnet/conn_test.go:146
Error: Received unexpected error:
connect tcp [fd7a:115c:a1e0:46cc:bd8e:400d:1bc6:f6ac]:35565: connection was refused
Test: TestTailnet/ForcesWebSockets
t.go:106: 2024-11-18 07:44:25.945 [info] w1: closing tailnet Conn
t.go:106: 2024-11-18 07:44:25.945 [debu] w1: closing configMaps configLoop
t.go:106: 2024-11-18 07:44:25.945 [debu] w1: closing nodeUpdater updateLoop
t.go:106: 2024-11-18 07:44:25.945 [debu] w1: closed netstack
conn_test.go:135:
Error Trace: /Users/spike/repos/coder/tailnet/conn_test.go:135
/Users/spike/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.8.darwin-arm64/src/runtime/asm_arm64.s:1222
Error: Received unexpected error:
connection closed:
github.com/coder/coder/v2/tailnet.init
<autogenerated>:1
Test: TestTailnet/ForcesWebSockets
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x1039771dc]
goroutine 2224 [running]:
github.com/coder/coder/v2/tailnet_test.TestTailnet.func3.2()
/Users/spike/repos/coder/tailnet/conn_test.go:136 +0x7c
created by github.com/coder/coder/v2/tailnet_test.TestTailnet.func3 in goroutine 109
/Users/spike/repos/coder/tailnet/conn_test.go:133 +0x7dc
```
Test didn't synchronize listening on the port before dialing it.
It also has a nil pointer deference when the test fails, which causes a bunch of unrelated output. Also fixed.
Refactors our use of `slogtest` to instantiate a "standard logger" across most of our tests. This standard logger incorporates https://github.com/coder/slog/pull/217 to also ignore database query canceled errors by default, which are a source of low-severity flakes.
Any test that has set non-default `slogtest.Options` is left alone. In particular, `coderdtest` defaults to ignoring all errors. We might consider revisiting that decision now that we have better tools to target the really common flaky Error logs on shutdown.
closes#14730
Adds support for WorkspaceUpdates to the WebsocketDialer. This allows us to dial the new endpoint added in #14847 and connect it up to a `tailnet.Controllers` to connect to all agents over the tailnet.
I refactored the fakeWorkspaceUpdatesProvider to a mock and moved it to `tailnettest` so it could be more easily reused. The Mock is a little more full-featured.
re: #14730
Adds support in `tailnet.Controller` for WorkspaceUpdates.
Also checks configured controllers against the clients returned by the dialer, so that if we connect with a dialer that doesn't support an RPC (for instance the in-memory dialer for ServerTailnet doesn't support WorkspaceUpdates), we throw an error if there is a controller expecting it.
Bumps the Tailnet and Agent API version 2.3, and creates some extra controls and machinery around these versions.
What happened is that we accidentally shipped two new API features without bumping the version. `ScriptCompleted` on the Agent API in Coder v2.16 and `RefreshResumeToken` on the Tailnet API in Coder v2.15.
Since we can't easily retroactively bump the versions, we'll roll these changes into API version 2.3 along with the new WorkspaceUpdates RPC, which hasn't been released yet. That means there is some ambiguity in Coder v2.15-v2.17 about exactly what methods are supported on the Tailnet and Agent APIs. This isn't great, but hasn't caused us major issues because
1. RefreshResumeToken is considered optional, and clients just log and move on if the RPC isn't supported.
2. Agents basically never get started talking to a Coderd that is older than they are, since the agent binary is normally downloaded from Coderd at workspace start.
Still it's good to get things squared away in terms of versions for SDK users and possible edge cases around client and server versions.
To mitigate against this thing happening again, this PR also:
1. adds a CODEOWNERS for the API proto packages, so I'll review changes
2. defines interface types for different API versions, and has the agent explicitly use a specific version. That way, if you add a new method, and try to use it in the agent without thinking explicitly about versions, it won't compile.
With the protocol controllers stuff, we've sort of already abstracted the Tailnet API such that the interface type strategy won't work, but I'll work on getting the Controller to be version aware, such that it can check the API version it's getting against the controllers it has -- in a later PR.
re: #14730
Adds support for the workspace updates protocol controller to also program DNS names for each agent.
Right now, we only program names like `myagent.myworkspace.me.coder` and `myworkspace.coder.` (if there is exactly one agent in the workspace). We also want to support `myagent.myworkspace.username.coder.`, but for that we need to update WorkspaceUpdates RPC to also send the workspace owner's username, which will be in a separate PR.
re: #14730
Adds a protocol controller for WorkspaceUpdates RPC that takes all the agents we learn about over the RPC, and programs them into the Coordination controller, so that we set up tunnels to all the agents.
Handling DNS is in a PR up the stack, as is actually wiring it up to anything.
Closes#14729
Expands the Coordination controller used by the CLI client to allow multiple tunnel destinations (agents). Our current client uses just one, but this unifies the logic so that when we add Coder VPN, 1 is just a special case of "many."
chore of #14729
Refactors the `ServerTailnet` to use `tailnet.Controller` so that we reuse logic around reconnection and handling control messages, instead of reimplementing. This unifies our "client" use of the tailscale API across CLI, coderd, and wsproxy.
refactors `tailnetAPIConnector` to use the `Dialer` interface in `tailnet`, introduced lower in this stack of PRs. This will let us use the same Tailnet API handling code across different things that connect to the Tailnet API (CLI client, coderd, workspace proxies, and soon: Coder VPN).
chore re: #14729
Refactors the way clients of the Tailnet API (clients of the API, which include both workspace "agents" and "clients") interact with the API. Introduces the idea of abstract "controllers" for each of the RPCs in the API, and implements a Coordination controller by refactoring from `workspacesdk`.
chore re: #14729
Closes#14716Closes#14717
Adds a new user-scoped tailnet API endpoint (`api/v2/tailnet`) with a new RPC stream for receiving updates on workspaces owned by a specific user, as defined in #14716.
When a stream is started, the `WorkspaceUpdatesProvider` will begin listening on the user-scoped pubsub events implemented in #14964. When a relevant event type is seen (such as a workspace state transition), the provider will query the DB for all the workspaces (and agents) owned by the user. This gets compared against the result of the previous query to produce a set of workspace updates.
Workspace updates can be requested for any user ID, however only workspaces the authorised user is permitted to `ActionRead` will have their updates streamed.
Opening a tunnel to an agent requires that the user can perform `ActionSSH` against the workspace containing it.
fixes#14715
Configures agents to use an address both in the Tailscale service prefix and the new Coder service prefix. Also modifies the Coordinator auth to allow the new prefix.
Updates `coder/tailscale` to include https://github.com/coder/tailscale/pull/62 which fixes a bug around forwarding TCP connections to localhost. This functionality is tested in the modifications to `TestAgent_Dial`.
re: #14715
This PR introduces the Coder service prefix: `fd60:627a:a42b::/48` and refactors our existing code as calling the Tailscale service prefix explicitly (rather than implicitly).
Removes the unused `Addresses` agent option. All clients today assume they can compute the Agent's IP address based on its UUID, so an agent started with a custom address would break things.
Fixes#12560
When gracefully disconnecting from the coordinator, we would send the Disconnect message and then close the dRPC stream. However, closing the dRPC stream can cause the server not to process the Disconnect message, since we use the stream context in a `select` while sending it to the coordinator.
This is a product bug uncovered by the flake, and probably results in us failing graceful disconnect some minority of the time.
Instead, the `remoteCoordination` (and `inMemoryCoordination` for consistency) should send the Disconnect message and then wait for the coordinator to hang up (on some graceful disconnect timer, in the form of a context).
Drops support for v1 of the tailnet API, which was the original coordination protocol where we only sent node updates, never marked them lost or disconnected.
v2 of the tailnet API went GA for CLI clients in Coder 2.8.0, so clients older than that would stop working.
Removes the support for the Agent's "legacy IP" which was a hardcoded IP address all agents used to use, before we introduced "single tailnet". Single tailnet went GA in 2.7.0.
fixes#14365
I bet what's going on is that in `connectToCoordinatorAndFetchResumeToken()` we call `Coordinate()`, send a message on the `Coordinate` client and then close it in rapid succession. We don't wait around for a response from the coordinator, so dRPC is likely aborting the call `Coordinate()` in the backend because the stream is closed before it even gets a chance.
Instead of using the Coordinator to record the peer ID assigned on the API call, we can wrap the resume token provider, since we call that API _and_ wait for a response. This also affords the opportunity to directly assert we get called with the right token.