Commit Graph

48 Commits

Author SHA1 Message Date
Spike Curtis 345435a04c feat: modify coordinators to send errors and peers to log them (#17467)
Adds support to our coordinator implementations to send Error updates before disconnecting clients.

I was recently debugging a connection issue where the client was getting repeatedly disconnected from the Coordinator, but since we never send any error information it was really hard without server logs.

This PR aims to correct that, by sending a CoordinateResponse with `Error` set in cases where we disconnect a client without them asking us to.

It also logs the error whenever we get one in the client controller.
2025-04-21 11:40:56 +04:00
Dean Sheather fbe2fa66f5 chore: add test for coord rolling restart (#14680)
Closes https://github.com/coder/team-coconut/issues/50

---------

Co-authored-by: Ethan Dickson <ethan@coder.com>
2024-11-20 18:04:33 +11:00
Spike Curtis 8c00ebc6ee chore: refactor ServerTailnet to use tailnet.Controllers (#15408)
chore of #14729

Refactors the `ServerTailnet` to use `tailnet.Controller` so that we reuse logic around reconnection and handling control messages, instead of reimplementing.  This unifies our "client" use of the tailscale API across CLI, coderd, and wsproxy.
2024-11-08 13:18:56 +04:00
Spike Curtis d6154c4310 chore: remove tailnet v1 API support (#14641)
Drops support for v1 of the tailnet API, which was the original coordination protocol where we only sent node updates, never marked them lost or disconnected.

v2 of the tailnet API went GA for CLI clients in Coder 2.8.0, so clients older than that would stop working.
2024-09-12 07:56:31 +04:00
Jon Ayers 4fc047954e fix: avoid deleting peers on graceful close (#14165)
* fix: avoid deleting peers on graceful close

- Fixes an issue where a coordinator deletes all
  its peers on shutdown. This can cause disconnects
  whenever a coderd is redeployed.
2024-08-14 15:16:08 -04:00
Spike Curtis e5268e4551 chore: spin clock library out to coder/quartz repo (#13777)
Code that was in `/clock` has been moved to github.com/coder/quartz.  This PR refactors our use of the clock library to point to the external Quartz repo.
2024-07-03 15:02:54 +04:00
Spike Curtis ce7f13c6c3 fix: fix TestPGCoordinatorSingle_MissedHeartbeats flake (#13686) 2024-06-27 19:17:24 +04:00
Steven Masley 5ccf5084e8 chore: create type for unique role names (#13506)
* chore: create type for unique role names

Using `string` was confusing when something should be combined with
org context, and when not to. Naming this new name, "RoleIdentifier"
2024-06-11 08:55:28 -05:00
Spike Curtis 8326a3a675 chore: change mock clock to allow Advance() within timer/tick functions (#13500) 2024-06-10 15:27:24 +04:00
Spike Curtis a0962ba089 fix: wait for PGCoordinator to clean up db state (#13351)
c.f. https://github.com/coder/coder/pull/13192#issuecomment-2097657692

We need to wait for PGCoordinator to finish its work before returning on `Close()`, so that we delete database state (best effort -- if this fails others will filter it out based on heartbeats).
2024-05-24 12:01:03 +04:00
Steven Masley 1f5788feff chore: remove rbac psuedo resources, add custom verbs (#13276)
Removes our pseudo rbac resources like `WorkspaceApplicationConnect` in favor of additional verbs like `ssh`. This is to make more intuitive permissions for building custom roles.

The source of truth is now `policy.go`
2024-05-15 11:09:42 -05:00
Steven Masley cb6b5e8fbd chore: push rbac actions to policy package (#13274)
Just moved `rbac.Action` -> `policy.Action`. This is for the stacked PR to not have circular dependencies when doing autogen. Without this, the autogen can produce broken golang code, which prevents the autogen from compiling.

So just avoiding circular dependencies. Doing this in it's own PR to reduce LoC diffs in the primary PR, since this has 0 functional changes.
2024-05-15 09:46:35 -05:00
Colin Adler 205c43da99 fix(enterprise): mark nodes from unhealthy coordinators as lost (#13123)
Instead of removing the mappings of unhealthy coordinators entirely,
mark them as lost instead. This prevents peers from disappearing from
other peers if a coordinator misses a heartbeat.
2024-05-03 14:07:29 -05:00
Colin Adler 777dfbe965 feat(enterprise): add ready for handshake support to pgcoord (#12935) 2024-04-16 15:01:10 -05:00
Spike Curtis 06eae954c9 fix: stop sending DeleteTailnetPeer when coordinator is unhealthy (#12925)
fixes #12923

Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy.

It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.
2024-04-10 22:49:13 +04:00
Colin Adler e5d911462f fix(tailnet): enforce valid agent and client addresses (#12197)
This adds the ability for `TunnelAuth` to also authorize incoming wireguard node IPs, preventing agents from reporting anything other than their static IP generated from the agent ID.
2024-03-01 09:02:33 -06:00
Spike Curtis 627232eae9 fix: fix pgcoord to delete coordinator row last (#12155)
Fixes #12141
Fixes #11750

PGCoord shutdown was uncoordinated, so an update at an inopportune time during shutdown would be rejected because the coordinator row was already deleted.

This PR ensures that the PGCoord subcomponents that write updates are shut down before we take down the heartbeats, which is responsible for deleting the coordinator row.
2024-02-15 16:34:29 +04:00
Spike Curtis 1c8b803785 feat: add logging to pgcoord subscribe/unsubscribe (#11952)
Adds logging to unsubscribing from peer and tunnel updates in pgcoordinator, since #11950 seems to be problem with these subscriptions
2024-01-31 12:15:58 +04:00
Cian Johnston ecae6f9135 fix(enterprise/tailnet): handle query canceled error in sendBeat() (#11794) 2024-01-24 18:42:05 +00:00
Spike Curtis cae095fdb6 fix: stop logging errors on canceled cleanup queries (#11547)
Fixes flake seen here: https://github.com/coder/coder/actions/runs/7474259128/job/20340051975
2024-01-10 16:20:29 +04:00
Spike Curtis f2606a78dd fix: avoid converting nil node
fixes: #11276
2023-12-19 13:38:15 +04:00
Dean Sheather e46431078c feat: add AgentAPI using DRPC (#10811)
Co-authored-by: Spike Curtis <spike@coder.com>
2023-12-18 22:53:28 +10:00
Spike Curtis ad3fed72bc chore: rename Coordinator to CoordinatorV1 (#11222)
Renames the tailnet.Coordinator to represent both v1 and v2 APIs, so that we can use this interface for the main atomic pointer.

Part of #10532
2023-12-15 11:38:12 +04:00
Spike Curtis bf3b35b1e2 fix: stop logging context Canceled as error (#11177)
fixes #11166 and a related log that could have the same problem
2023-12-13 13:08:30 +04:00
Spike Curtis b34ecf1e9e fix: fix deadlock of mappingQuery on context canceled
Fixes #11078

replace bare channel send with SendCtx so that we properly shut down when context is canceled.
2023-12-07 17:19:18 +04:00
Spike Curtis 2c86d0bed0 feat: support v2 Tailnet API in AGPL coordinator (#11010)
Fixes #10529
2023-12-06 15:04:28 +04:00
Spike Curtis 612e67a53b feat: add cleanup of lost tailnet peers and tunnels to PGCoordinator (#10939)
Adds the "lost" peer cleanup queries to PGCoordinator, including tests.
2023-12-01 10:13:29 +04:00
Spike Curtis 0cab6e7763 feat: support graceful disconnect in PGCoordinator (#10937)
Adds support for graceful disconnect to PGCoordinator.  When peers gracefully disconnect, they send a disconnect message.  This triggers the peer to be disconnected from all tunneled peers.

The Multi-Agent Client supports graceful disconnect, since it is in memory and we know that when it is closed, we really mean to disconnect.

The v1 agent and client Websocket connections do not support graceful disconnect, since the v1 protocol doesn't have this feature.  That means that if a v1 peer connects to a v2 peer, when the v1 peer's coordinator connection is closed, the v2 peer will
see it as "lost" since we don't know whether the v1 peer meant to disconnect, or it just lost connectivity to the coordinator.
2023-12-01 09:55:25 +04:00
Spike Curtis 52901e1219 feat: implement HTMLDebug for PGCoord with v2 API (#10914)
Implements HTMLDebug for the PGCoordinator with the new v2 API and related DB tables.
2023-11-28 22:37:20 +04:00
Spike Curtis 5c48cb4447 feat: modify PG Coordinator to work with new v2 Tailnet API (#10573)
re: #10528

Refactors PG Coordinator to work with the Tailnet v2 API, including wrappers for the existing v1 API.

The debug endpoint functions, but doesn't return sensible data, that will be in another stacked PR.
2023-11-20 14:31:04 +04:00
Colin Adler 36f3151b71 fix(enterprise/tailnet): properly detect legacy agents (#10083) 2023-10-06 16:49:26 +00:00
Colin Adler c900b5f8df feat: add single tailnet support to pgcoord (#9351) 2023-09-21 14:30:48 -05:00
Spike Curtis a415395e9e fix: stop dropping error log on context canceled after heartbeat (#9427)
Signed-off-by: Spike Curtis <spike@coder.com>
2023-08-30 14:44:00 +04:00
Kyle Carberry 22e781eced chore: add /v2 to import module path (#9072)
* chore: add /v2 to import module path

go mod requires semantic versioning with versions greater than 1.x

This was a mechanical update by running:
```
go install github.com/marwan-at-work/mod/cmd/mod@latest
mod upgrade
```

Migrate generated files to import /v2

* Fix gen
2023-08-18 18:55:43 +00:00
Spike Curtis 2f46f2315c fix: fix race in PGCoord at startup (#9144)
Signed-off-by: Spike Curtis <spike@coder.com>
2023-08-18 09:53:03 +04:00
Spike Curtis c7a6d626b4 fix: make PGCoordinator close connections when unhealthy (#9125)
Signed-off-by: Spike Curtis <spike@coder.com>
2023-08-17 09:36:47 +04:00
Colin Adler bc862fa493 chore: upgrade tailscale to v1.46.1 (#8913) 2023-08-09 19:50:26 +00:00
Colin Adler 0b4f333a6f chore: add http debug support to pgcoord (#8795) 2023-07-28 17:59:31 -05:00
Colin Adler dd2f79995b chore(tailnet): rewrite coordinator debug using html/template (#8752) 2023-07-26 22:54:21 +00:00
Colin Adler 6b92abebb9 fix(tailnet): track agent names for http debug (#8744) 2023-07-26 18:44:10 +00:00
Colin Adler 1cb39fc65d test: ignore more spurious pgcoord errors (#8628) 2023-07-20 19:55:25 +00:00
Colin Adler 00b9a3ce58 fix: prevent error log when pgcoord query is canceled (#8609) 2023-07-19 16:40:57 -05:00
Colin Adler c47b78c44b chore: replace wsconncache with a single tailnet (#8176) 2023-07-12 17:37:31 -05:00
Spike Curtis b4057bd74a feat: make pgCoordinator generally available (#8419)
* pgCoord to GA, fix tests

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix generation and coordinator delete RBAC

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix fakeQuerier -> FakeQuerier

Signed-off-by: Spike Curtis <spike@coder.com>

---------

Signed-off-by: Spike Curtis <spike@coder.com>
2023-07-12 13:35:29 +04:00
Spike Curtis 7943a5b85e fix PG coordinator context and RBAC subject (#8223)
Signed-off-by: Spike Curtis <spike@coder.com>
2023-06-27 10:14:31 +00:00
Spike Curtis 5d48122f12 fix: fix PG Coordinator to update when heartbeats (re)start (#8178)
* fix: fix PG Coordinator to update when heartbeats (re)start

Signed-off-by: Spike Curtis <spike@coder.com>

* rename resetExpiryTimer(WithLock)

Signed-off-by: Spike Curtis <spike@coder.com>

---------

Signed-off-by: Spike Curtis <spike@coder.com>
2023-06-23 10:38:58 +00:00
Spike Curtis ba9d038d42 feat: add periodic cleanup of PG Coordinator state (#8142)
* PG Coordinator cleans orphaned state

Signed-off-by: Spike Curtis <spike@coder.com>

* Don't need pubsub

Signed-off-by: Spike Curtis <spike@coder.com>

---------

Signed-off-by: Spike Curtis <spike@coder.com>
2023-06-23 13:23:28 +04:00
Spike Curtis cc17d2feea refactor: add postgres tailnet coordinator (#8044)
* postgres tailnet coordinator

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix db migration; tests

Signed-off-by: Spike Curtis <spike@coder.com>

* Add fixture, regenerate

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix fixtures

Signed-off-by: Spike Curtis <spike@coder.com>

* review comments, run clean gen

Signed-off-by: Spike Curtis <spike@coder.com>

* Rename waitForConn -> cleanupConn

Signed-off-by: Spike Curtis <spike@coder.com>

* code review updates

Signed-off-by: Spike Curtis <spike@coder.com>

* db migration order

Signed-off-by: Spike Curtis <spike@coder.com>

* fix log field name last_heartbeat

Signed-off-by: Spike Curtis <spike@coder.com>

* fix heartbeat_from log field

Signed-off-by: Spike Curtis <spike@coder.com>

* fix slog fields for linting

Signed-off-by: Spike Curtis <spike@coder.com>

---------

Signed-off-by: Spike Curtis <spike@coder.com>
2023-06-21 16:20:58 +04:00