Adds support to our coordinator implementations to send Error updates before disconnecting clients.
I was recently debugging a connection issue where the client was getting repeatedly disconnected from the Coordinator, but since we never send any error information it was really hard without server logs.
This PR aims to correct that, by sending a CoordinateResponse with `Error` set in cases where we disconnect a client without them asking us to.
It also logs the error whenever we get one in the client controller.
Updating golden files is an unnecessary extra step in addition to gen
that is easily overlooked, leading to the developer noticing the issue
in CI leading to lost developer time waiting for tests to complete.
- Adds `testutil.GoleakOptions` and consolidates existing options to
this location
- Pre-emptively adds required ignore for this Dependabot PR to pass CI
https://github.com/coder/coder/pull/16066
Refactors our use of `slogtest` to instantiate a "standard logger" across most of our tests. This standard logger incorporates https://github.com/coder/slog/pull/217 to also ignore database query canceled errors by default, which are a source of low-severity flakes.
Any test that has set non-default `slogtest.Options` is left alone. In particular, `coderdtest` defaults to ignoring all errors. We might consider revisiting that decision now that we have better tools to target the really common flaky Error logs on shutdown.
chore of #14729
Refactors the `ServerTailnet` to use `tailnet.Controller` so that we reuse logic around reconnection and handling control messages, instead of reimplementing. This unifies our "client" use of the tailscale API across CLI, coderd, and wsproxy.
This PR is the first in a series aimed at closing
[#15109](https://github.com/coder/coder/issues/15109).
### Changes
- **Template Database Creation:**
`dbtestutil.Open` now has the ability to create a template database if
none is provided via `DB_FROM`. The template database’s name is derived
from a hash of the migration files, ensuring that it can be reused
across tests and is automatically updated whenever migrations change.
- **Optimized Database Handling:**
Previously, `dbtestutil.Open` would spin up a new container for each
test when `DB_FROM` was unset. Now, it first checks for an active
PostgreSQL instance on `localhost:5432`. If none is found, it creates a
single container that remains available for subsequent tests,
eliminating repeated container startups.
These changes address the long individual test times (10+ seconds)
reported by some users, likely due to the time Docker took to start and
complete migrations.
Closes#14716Closes#14717
Adds a new user-scoped tailnet API endpoint (`api/v2/tailnet`) with a new RPC stream for receiving updates on workspaces owned by a specific user, as defined in #14716.
When a stream is started, the `WorkspaceUpdatesProvider` will begin listening on the user-scoped pubsub events implemented in #14964. When a relevant event type is seen (such as a workspace state transition), the provider will query the DB for all the workspaces (and agents) owned by the user. This gets compared against the result of the previous query to produce a set of workspace updates.
Workspace updates can be requested for any user ID, however only workspaces the authorised user is permitted to `ActionRead` will have their updates streamed.
Opening a tunnel to an agent requires that the user can perform `ActionSSH` against the workspace containing it.
re: #14715
This PR introduces the Coder service prefix: `fd60:627a:a42b::/48` and refactors our existing code as calling the Tailscale service prefix explicitly (rather than implicitly).
Removes the unused `Addresses` agent option. All clients today assume they can compute the Agent's IP address based on its UUID, so an agent started with a custom address would break things.
Drops support for v1 of the tailnet API, which was the original coordination protocol where we only sent node updates, never marked them lost or disconnected.
v2 of the tailnet API went GA for CLI clients in Coder 2.8.0, so clients older than that would stop working.
Removes the support for the Agent's "legacy IP" which was a hardcoded IP address all agents used to use, before we introduced "single tailnet". Single tailnet went GA in 2.7.0.
* fix: avoid deleting peers on graceful close
- Fixes an issue where a coordinator deletes all
its peers on shutdown. This can cause disconnects
whenever a coderd is redeployed.
Code that was in `/clock` has been moved to github.com/coder/quartz. This PR refactors our use of the clock library to point to the external Quartz repo.
* chore: create type for unique role names
Using `string` was confusing when something should be combined with
org context, and when not to. Naming this new name, "RoleIdentifier"
c.f. https://github.com/coder/coder/pull/13192#issuecomment-2097657692
We need to wait for PGCoordinator to finish its work before returning on `Close()`, so that we delete database state (best effort -- if this fails others will filter it out based on heartbeats).
Removes our pseudo rbac resources like `WorkspaceApplicationConnect` in favor of additional verbs like `ssh`. This is to make more intuitive permissions for building custom roles.
The source of truth is now `policy.go`
Just moved `rbac.Action` -> `policy.Action`. This is for the stacked PR to not have circular dependencies when doing autogen. Without this, the autogen can produce broken golang code, which prevents the autogen from compiling.
So just avoiding circular dependencies. Doing this in it's own PR to reduce LoC diffs in the primary PR, since this has 0 functional changes.
Instead of removing the mappings of unhealthy coordinators entirely,
mark them as lost instead. This prevents peers from disappearing from
other peers if a coordinator misses a heartbeat.
fixes#12923
Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy.
It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.
Currently, importing `codersdk` just to interact with the API requires
importing tailscale, which causes builds to fail unless manually using
our fork.
This adds the ability for `TunnelAuth` to also authorize incoming wireguard node IPs, preventing agents from reporting anything other than their static IP generated from the agent ID.
Fixes#12141Fixes#11750
PGCoord shutdown was uncoordinated, so an update at an inopportune time during shutdown would be rejected because the coordinator row was already deleted.
This PR ensures that the PGCoord subcomponents that write updates are shut down before we take down the heartbeats, which is responsible for deleting the coordinator row.
Fixes an issue where a MultiAgentConn isn't closed properly when the coordinator it is connected to is closed.
Since servertailnet checks whether the conn is closed before reinitializing, it is important that we check this, otherwise servertailnet can get stuck if the coordinator closes (e.g. when we switch from AGPL to PGCoordinator after decoding a license).
This one is huge, and I'm sorry.
The problem is that once I change `tailnet.Conn` to start doing v2 behavior, I kind of have to change it everywhere, including in CoderSDK (CLI), the agent, wsproxy, and ServerTailnet.
There is still a bit more cleanup to do, and I need to add code so that when we lose connection to the Coordinator, we mark all peers as LOST, but that will be in a separate PR since this is big enough!
wsproxy also needs to be updated to use tailnet v2 because the `tailnet.Conn` stores peers by ID, and the peerID was not being carried by the JSON protocol. This adds a query param to the endpoint to conditionally switch to the new protocol.