Commit Graph

208 Commits

Author SHA1 Message Date
Mathias Fredriksson c069563af1 test: fix use of t.Logf where t.Log would suffice (#16328) 2025-01-29 14:35:04 +00:00
Dean Sheather 28088165a1 chore: get TUN/DNS working on Windows for CoderVPN (#16310) 2025-01-29 08:09:36 +00:00
Thomas Kosiewski 1336925c9f feat(flake.nix): switch dogfood dev image to buildNixShellImage from dockerTools (#16223)
Replace Depot build action with Nix for Nix dogfood image builds

The dogfood Nix image is now built using Nix's native container tooling instead of Depot. This change:

- Adds Nix setup steps to the GitHub Actions workflow
- Removes the Dockerfile.nix in favor of a Nix-native container build
- Updates the flake.nix to support building Docker images
- Introduces a hash file to track Nix-related changes
- Updates the vendorHash for Go dependencies

Change-Id: I4e011fe3a19d9a1375fbfd5223c910e59d66a5d9
Signed-off-by: Thomas Kosiewski <tk@coder.com>
2025-01-28 16:38:37 +01:00
Cian Johnston 7b88776403 chore(testutil): add testutil.GoleakOptions (#16070)
- Adds `testutil.GoleakOptions` and consolidates existing options to
this location
- Pre-emptively adds required ignore for this Dependabot PR to pass CI
https://github.com/coder/coder/pull/16066
2025-01-08 15:38:37 +00:00
Spike Curtis 2c7f8ac65f chore: migrate to coder/websocket 1.8.12 (#15898)
Migrates us to `coder/websocket` v1.8.12 rather than `nhooyr/websocket` on an older version.

Works around https://github.com/coder/websocket/issues/504 by adding an explicit test for `xerrors.Is(err, io.EOF)` where we were previously getting `io.EOF` from the netConn.
2024-12-19 00:51:30 +04:00
Ethan ba48069325 chore: implement CoderVPN client & tunnel (#15612)
Addresses #14734.

This PR wires up `tunnel.go` to a `tailnet.Conn` via the new `/tailnet` endpoint, with all the necessary controllers such that a VPN connection can be started, stopped and inspected via the CoderVPN protocol.
2024-12-05 13:30:22 +11:00
Spike Curtis 029cd5d064 fix(tailnet): prevent redial after Coord graceful restart (#15586)
fixes: https://github.com/coder/internal/issues/217

> There are a couple problems:
>
> One is that we assert the RPCs succeed, but if the pipeDialer context is canceled at the end of the test, then these assertions happen after the test is officially complete, which panics and affects other tests.

This converts these to just return the error rather than assert.

> The other is that the retrier is slightly bugged: if the current retry delay is 0 AND the ctx is done, (e.g. after successfully connecting and then gracefully disconnecting), then retrier.Wait(c.ctx) is racy and could return either true or false.

Fixes the phantom redial by explicitly checking the context before dialing. Also, in the test, we assert that the controller is closed before completing the test.
2024-11-19 11:37:11 +04:00
Spike Curtis 85c3c4c025 feat(tailnet): add alias with username and short alias to DNS (#15585)
Adds DNS aliases of the form `<agent>.<workspace>.<username>.coder.` and
`<workspace>.coder.`
2024-11-19 11:23:17 +04:00
Spike Curtis 1aaaad998c fix: fix listening flake on TestTailnet_ForcesWebSockets (#15555)
Fixes a test flake on TestTailnet_ForcesWebsockets like:

```
    t.go:106: 2024-11-18 07:44:25.939 [debu]  w2: dial tcp  addr_port="[fd7a:115c:a1e0:46cc:bd8e:400d:1bc6:f6ac]:35565"
    t.go:106: 2024-11-18 07:44:25.943 [debu]  w1.net.netstack: netstack: could not connect to local server at 127.0.0.1:35565 (or [::1]:35565)%!(EXTRA *net.OpError=dial tcp [::1]:35565: connect: connection refused)
    conn_test.go:146:
        	Error Trace:	/Users/spike/repos/coder/tailnet/conn_test.go:146
        	Error:      	Received unexpected error:
        	            	connect tcp [fd7a:115c:a1e0:46cc:bd8e:400d:1bc6:f6ac]:35565: connection was refused
        	Test:       	TestTailnet/ForcesWebSockets
    t.go:106: 2024-11-18 07:44:25.945 [info]  w1: closing tailnet Conn
    t.go:106: 2024-11-18 07:44:25.945 [debu]  w1: closing configMaps configLoop
    t.go:106: 2024-11-18 07:44:25.945 [debu]  w1: closing nodeUpdater updateLoop
    t.go:106: 2024-11-18 07:44:25.945 [debu]  w1: closed netstack
    conn_test.go:135:
        	Error Trace:	/Users/spike/repos/coder/tailnet/conn_test.go:135
        	            				/Users/spike/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.8.darwin-arm64/src/runtime/asm_arm64.s:1222
        	Error:      	Received unexpected error:
        	            	connection closed:
        	            	    github.com/coder/coder/v2/tailnet.init
        	            	        <autogenerated>:1
        	Test:       	TestTailnet/ForcesWebSockets
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x1039771dc]

goroutine 2224 [running]:
github.com/coder/coder/v2/tailnet_test.TestTailnet.func3.2()
	/Users/spike/repos/coder/tailnet/conn_test.go:136 +0x7c
created by github.com/coder/coder/v2/tailnet_test.TestTailnet.func3 in goroutine 109
	/Users/spike/repos/coder/tailnet/conn_test.go:133 +0x7dc
```

Test didn't synchronize listening on the port before dialing it. 

It also has a nil pointer deference when the test fails, which causes a bunch of unrelated output. Also fixed.
2024-11-18 16:05:16 +04:00
Spike Curtis 5861e516b9 chore: add standard test logger ignoring db canceled (#15556)
Refactors our use of `slogtest` to instantiate a "standard logger" across most of our tests.  This standard logger incorporates https://github.com/coder/slog/pull/217 to also ignore database query canceled errors by default, which are a source of low-severity flakes.

Any test that has set non-default `slogtest.Options` is left alone. In particular, `coderdtest` defaults to ignoring all errors. We might consider revisiting that decision now that we have better tools to target the really common flaky Error logs on shutdown.
2024-11-18 14:09:22 +04:00
Spike Curtis 747f7ce173 feat: add support for WorkspaceUpdates to WebsocketDialer (#15534)
closes #14730

Adds support for WorkspaceUpdates to the WebsocketDialer. This allows us to dial the new endpoint added in #14847 and connect it up to a `tailnet.Controllers` to connect to all agents over the tailnet.

I refactored the fakeWorkspaceUpdatesProvider to a mock and moved it to `tailnettest` so it could be more easily reused.  The Mock is a little more full-featured.
2024-11-18 10:54:11 +04:00
Spike Curtis 16992ee548 feat(tailnet): add workspace updates support to Controller (#15529)
re: #14730

Adds support in `tailnet.Controller` for WorkspaceUpdates.

Also checks configured controllers against the clients returned by the dialer, so that if we connect with a dialer that doesn't support an RPC (for instance the in-memory dialer for ServerTailnet doesn't support WorkspaceUpdates), we throw an error if there is a controller expecting it.
2024-11-18 10:41:19 +04:00
Spike Curtis 40802958e9 fix: use explicit api versions for agent and tailnet (#15508)
Bumps the Tailnet and Agent API version 2.3, and creates some extra controls and machinery around these versions.

What happened is that we accidentally shipped two new API features without bumping the version.  `ScriptCompleted` on the Agent API in Coder v2.16 and `RefreshResumeToken` on the Tailnet API in Coder v2.15.

Since we can't easily retroactively bump the versions, we'll roll these changes into API version 2.3 along with the new WorkspaceUpdates RPC, which hasn't been released yet.  That means there is some ambiguity in Coder v2.15-v2.17 about exactly what methods are supported on the Tailnet and Agent APIs.  This isn't great, but hasn't caused us major issues because 

1. RefreshResumeToken is considered optional, and clients just log and move on if the RPC isn't supported. 
2. Agents basically never get started talking to a Coderd that is older than they are, since the agent binary is normally downloaded from Coderd at workspace start.

Still it's good to get things squared away in terms of versions for SDK users and possible edge cases around client and server versions.

To mitigate against this thing happening again, this PR also:

1. adds a CODEOWNERS for the API proto packages, so I'll review changes
2. defines interface types for different API versions, and has the agent explicitly use a specific version.  That way, if you add a new method, and try to use it in the agent without thinking explicitly about versions, it won't compile.

With the protocol controllers stuff, we've sort of already abstracted the Tailnet API such that the interface type strategy won't work, but I'll work on getting the Controller to be version aware, such that it can check the API version it's getting against the controllers it has -- in a later PR.
2024-11-15 11:16:28 +04:00
Spike Curtis 916df4d411 feat: set DNS hostnames in workspace updates controller (#15507)
re: #14730

Adds support for the workspace updates protocol controller to also program DNS names for each agent.

Right now, we only program names like `myagent.myworkspace.me.coder` and `myworkspace.coder.` (if there is exactly one agent in the workspace).  We also want to support `myagent.myworkspace.username.coder.`, but for that we need to update WorkspaceUpdates RPC to also send the workspace owner's username, which will be in a separate PR.
2024-11-15 11:00:19 +04:00
Spike Curtis 08216aaad6 feat: add workspace updates controller (#15506)
re: #14730

Adds a protocol controller for WorkspaceUpdates RPC that takes all the agents we learn about over the RPC, and programs them into the Coordination controller, so that we set up tunnels to all the agents.

Handling DNS is in a PR up the stack, as is actually wiring it up to anything.
2024-11-14 16:16:04 +04:00
Ethan 5d853fcfd8 chore: support adding dns hosts to tailnet.Conn (#15419)
Relates to #14718.

The remaining changes (regarding the Tailscale DNS service) will need to
be made on `coder/tailscale`.
2024-11-08 09:37:56 +00:00
Spike Curtis e5661c2748 feat: add support for multiple tunnel destinations in tailnet (#15409)
Closes #14729

Expands the Coordination controller used by the CLI client to allow multiple tunnel destinations (agents).  Our current client uses just one, but this unifies the logic so that when we add Coder VPN, 1 is just a special case of "many."
2024-11-08 13:32:07 +04:00
Spike Curtis 8c00ebc6ee chore: refactor ServerTailnet to use tailnet.Controllers (#15408)
chore of #14729

Refactors the `ServerTailnet` to use `tailnet.Controller` so that we reuse logic around reconnection and handling control messages, instead of reimplementing.  This unifies our "client" use of the tailscale API across CLI, coderd, and wsproxy.
2024-11-08 13:18:56 +04:00
Spike Curtis 718722af1b chore: refactor tailnetAPIConnector to tailnet.Controller (#15361)
Refactors `workspacesdk.tailnetAPIConnector` as a `tailnet.Controller` to reuse all the reconnection and graceful disconnect logic.

chore re: #14729
2024-11-08 10:10:54 +04:00
Spike Curtis 2d061e698d chore: refactor tailnetAPIConnector to use dialer (#15347)
refactors `tailnetAPIConnector` to use the `Dialer` interface in `tailnet`, introduced lower in this stack of PRs. This will let us use the same Tailnet API handling code across different things that connect to the Tailnet API (CLI client, coderd, workspace proxies, and soon: Coder VPN).

chore re: #14729
2024-11-07 17:24:19 +04:00
Ethan 098728138f chore: add a tailscale router that uses the CoderVPN protocol (#15391)
Closes #14732.
2024-11-07 22:45:17 +11:00
Spike Curtis d7e86278c8 chore: add resume token controller (#15346)
Implements a controller for the Tailnet API resume token RPC, by refactoring from `workspacesdk`.

chore re: #14729
2024-11-07 11:32:20 +04:00
Spike Curtis 335e4ab6bf chore: refactor sending telemetry (#15345)
Implements a tailnet API Telemetry controller by refactoring from `workspacesdk`.

chore re: #14729
2024-11-06 20:23:23 +04:00
Spike Curtis 9126cd78a6 chore: refactor DERP setting loop (#15344)
Implements a Tailnet API DERP controller by refactoring from `workspacesdk`

chore re: #14729
2024-11-06 20:04:05 +04:00
Spike Curtis 886dcbec84 chore: refactor coordination (#15343)
Refactors the way clients of the Tailnet API (clients of the API, which include both workspace "agents" and "clients") interact with the API.  Introduces the idea of abstract "controllers" for each of the RPCs in the API, and implements a Coordination controller by refactoring from `workspacesdk`.

chore re: #14729
2024-11-05 13:50:10 +04:00
Ethan 871cc05e99 chore: add a dns.OSConfigurator implementation that uses the CoderVPN protocol (#15342)
Closes #14733.
2024-11-05 19:23:16 +11:00
Ethan b1298a3c1e feat: add WorkspaceUpdates tailnet RPC (#14847)
Closes #14716
Closes #14717

Adds a new user-scoped tailnet API endpoint (`api/v2/tailnet`) with a new RPC stream for receiving updates on workspaces owned by a specific user, as defined in #14716. 

When a stream is started, the `WorkspaceUpdatesProvider` will begin listening on the user-scoped pubsub events implemented in #14964. When a relevant event type is seen (such as a workspace state transition), the provider will query the DB for all the workspaces (and agents) owned by the user. This gets compared against the result of the previous query to produce a set of workspace updates. 

Workspace updates can be requested for any user ID, however only workspaces the authorised user is permitted to `ActionRead` will have their updates streamed.
Opening a tunnel to an agent requires that the user can perform `ActionSSH` against the workspace containing it.
2024-11-01 14:53:53 +11:00
Jon Ayers cd890aa3a0 feat: enable key rotation (#15066)
This PR contains the remaining logic necessary to hook up key rotation
to the product.
2024-10-25 17:14:35 +01:00
Spike Curtis 8785a51b09 feat: include Coder service prefix on agents (#14944)
fixes #14715

Configures agents to use an address both in the Tailscale service prefix and the new Coder service prefix. Also modifies the Coordinator auth to allow the new prefix.

Updates `coder/tailscale` to include https://github.com/coder/tailscale/pull/62 which fixes a bug around forwarding TCP connections to localhost.  This functionality is tested in the modifications to `TestAgent_Dial`.
2024-10-04 10:16:33 +04:00
Spike Curtis 7d9f5ab81d chore: add Coder service prefix to tailnet (#14943)
re: #14715

This PR introduces the Coder service prefix: `fd60:627a:a42b::/48` and refactors our existing code as calling the Tailscale service prefix explicitly (rather than implicitly).

Removes the unused `Addresses` agent option. All clients today assume they can compute the Agent's IP address based on its UUID, so an agent started with a custom address would break things.
2024-10-04 10:04:10 +04:00
Ethan b8944074c4 chore: improve coder server ux (#14761) 2024-09-24 13:16:36 +10:00
Spike Curtis 2df9a3e554 fix: fix tailnet remoteCoordination to wait for server (#14666)
Fixes #12560

When gracefully disconnecting from the coordinator, we would send the Disconnect message and then close the dRPC stream.  However, closing the dRPC stream can cause the server not to process the Disconnect message, since we use the stream context in a `select` while sending it to the coordinator.

This is a product bug uncovered by the flake, and probably results in us failing graceful disconnect some minority of the time.

Instead, the `remoteCoordination` (and `inMemoryCoordination` for consistency) should send the Disconnect message and then wait for the coordinator to hang up (on some graceful disconnect timer, in the form of a context).
2024-09-16 09:24:30 +04:00
Spike Curtis d6154c4310 chore: remove tailnet v1 API support (#14641)
Drops support for v1 of the tailnet API, which was the original coordination protocol where we only sent node updates, never marked them lost or disconnected.

v2 of the tailnet API went GA for CLI clients in Coder 2.8.0, so clients older than that would stop working.
2024-09-12 07:56:31 +04:00
Spike Curtis fb3523b37f chore: remove legacy AgentIP address (#14640)
Removes the support for the Agent's "legacy IP" which was a hardcoded IP address all agents used to use, before we introduced "single tailnet". Single tailnet went GA in 2.7.0.
2024-09-12 07:40:19 +04:00
Spike Curtis 5bd19f8ba3 fix: fix flake in TestWorkspaceAgentClientCoordinate_ResumeToken (#14642)
fixes #14365

I bet what's going on is that in `connectToCoordinatorAndFetchResumeToken()` we call `Coordinate()`, send a message on the `Coordinate` client and then close it in rapid succession. We don't wait around for a response from the coordinator, so dRPC is likely aborting the call `Coordinate()` in the backend because the stream is closed before it even gets a chance.

Instead of using the Coordinator to record the peer ID assigned on the API call, we can wrap the resume token provider, since we call that API _and_ wait for a response. This also affords the opportunity to directly assert we get called with the right token.
2024-09-11 16:32:47 +04:00
Spike Curtis 7b39f6b0d4 fix: improves coordination logging (#14556) 2024-09-04 15:10:43 +04:00
Ethan 8c15192433 feat(cli): add p2p diagnostics to ping (#14426)
First PR to address #14244.

Adds common potential reasons as to why a direct connection to the workspace agent couldn't be established to `coder ping`:
- If the Coder deployment administrator has blocked direction connections (`CODER_BLOCK_DIRECT`).
- If the client has no STUN servers within it's DERP map.
- If the client or agent appears to be behind a hard NAT, as per Tailscale `netInfo.MappingVariesByDestIP`

Also adds a warning if the client or agent has a network interface below the 'safe' MTU for tailnet. This warning is always displayed at the end of a `coder ping`.
2024-08-28 15:39:01 +10:00
Ethan 4cc26be5ec fix: set network telemetry client version on server (#14376) 2024-08-23 06:17:28 +00:00
Dean Sheather cf8be4eac5 feat: add resume support to coordinator connections (#14234) 2024-08-20 17:16:49 +10:00
Jon Ayers 4fc047954e fix: avoid deleting peers on graceful close (#14165)
* fix: avoid deleting peers on graceful close

- Fixes an issue where a coordinator deletes all
  its peers on shutdown. This can cause disconnects
  whenever a coderd is redeployed.
2024-08-14 15:16:08 -04:00
Ethan e09ad1ddc1 fix: lock adding to tailnet waitgroup to avoid race (and fix flake) (#14195) 2024-08-07 15:52:42 +10:00
Spike Curtis dda9c56098 fix: fix TestTailnet/Connect to wait for listener before dialing (#14148) 2024-08-05 15:45:46 +04:00
Dean Sheather d2b035312e chore: fix parse typo for network telemetry (#13971) 2024-07-22 17:14:37 +00:00
Ethan f36b816391 chore: add coder version to network telemetry events (#13871) 2024-07-11 20:46:37 +10:00
Ethan e8db21c89e chore: add additional network telemetry stats & events (#13800) 2024-07-10 14:14:35 +10:00
Spike Curtis e5268e4551 chore: spin clock library out to coder/quartz repo (#13777)
Code that was in `/clock` has been moved to github.com/coder/quartz.  This PR refactors our use of the clock library to point to the external Quartz repo.
2024-07-03 15:02:54 +04:00
Ethan a110d18275 chore: add DRPC tailnet & cli network telemetry (#13687) 2024-07-03 15:23:46 +10:00
Dean Sheather 6c94dd4f23 chore: add DRPC server implementation for network telemetry (#13675) 2024-07-02 01:50:52 +10:00
Colin Adler 3dec6ff32f chore: add protobuf types for tailnet telemetry (#13617) 2024-06-24 12:13:03 -05:00
Ethan 58bf0ec1c6 chore: add additional tailnet topology integration tests (#13549) 2024-06-12 16:02:34 +00:00