Commit Graph

54 Commits

Author SHA1 Message Date
Spike Curtis 06e396188f test: subscribe to heartbeats synchronously on PGCoord startup (#21746)
fixes: https://github.com/coder/internal/issues/1304

Subscribe to heartbeats synchronously on startup of PGCoordinator. This ensures tests that send heartbeats don't race with this subscription.
2026-01-29 13:34:34 +04:00
Spike Curtis bddb808b25 chore: arrange imports in a standard way (#21452)
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:

```
import (
	"context"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"golang.org/x/xerrors"
	"gopkg.in/natefinch/lumberjack.v2"

	"cdr.dev/slog/v3"
	"github.com/coder/coder/v2/codersdk/agentsdk"
	"github.com/coder/serpent"
)
```

3 groups: standard library, 3rd partly libs, Coder libs.

This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
2026-01-08 15:24:11 +04:00
Spike Curtis 49b34a716a fix: fix slog to always use array of Fields (#21426)
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).

It also updates dependencies that also use slog and were updated.

I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.

Other dependencies, I pushed new tags.
2026-01-08 10:29:41 +04:00
Spike Curtis 05b037bdea fix: avoid deadlock race writing to a disconnected mapper (#20303)
fixes https://github.com/coder/internal/issues/1045

Fixes a race condition in our PG Coordinator when a peer disconnects. We issue database queries to find the peer mappings (node structures for each peer connected via a tunnel), and then send these to the "mapper" that generates diffs and eventually writes the update to the websocket.

Before this change we erroneously used the querier's context for this update, which has the same lifetime as the coordinator itself. If the peer has disconnected, the mapper might not be reading from its channel, and this causes a deadlock in a querier worker. This also prevents us from doing any more work on the peer.

I also added some more debug logging that would have been helpful when tracking this down.
2025-10-15 15:56:07 +04:00
ケイラ caeff49aba chore: refactor roles to support multiple permission sets scoped by org id (#20186)
In preparation for adding the "member" permission level, which will also
be grouped by org ID, do a bit of a refactor to make room for it and the
existing "org" level to live in the same `map`
2025-10-09 11:08:34 -06:00
Spike Curtis 04dfda8a0e fix: change enqueue error to debug log level (#19686)
fixes https://github.com/coder/internal/issues/958

Logging was being done at error level, but most likely any errors are from simple races between an update triggered around the same time as a client disconnecting. Debug is fine for these.
2025-09-03 13:42:02 +04:00
Spike Curtis 345435a04c feat: modify coordinators to send errors and peers to log them (#17467)
Adds support to our coordinator implementations to send Error updates before disconnecting clients.

I was recently debugging a connection issue where the client was getting repeatedly disconnected from the Coordinator, but since we never send any error information it was really hard without server logs.

This PR aims to correct that, by sending a CoordinateResponse with `Error` set in cases where we disconnect a client without them asking us to.

It also logs the error whenever we get one in the client controller.
2025-04-21 11:40:56 +04:00
Dean Sheather fbe2fa66f5 chore: add test for coord rolling restart (#14680)
Closes https://github.com/coder/team-coconut/issues/50

---------

Co-authored-by: Ethan Dickson <ethan@coder.com>
2024-11-20 18:04:33 +11:00
Spike Curtis 8c00ebc6ee chore: refactor ServerTailnet to use tailnet.Controllers (#15408)
chore of #14729

Refactors the `ServerTailnet` to use `tailnet.Controller` so that we reuse logic around reconnection and handling control messages, instead of reimplementing.  This unifies our "client" use of the tailscale API across CLI, coderd, and wsproxy.
2024-11-08 13:18:56 +04:00
Spike Curtis d6154c4310 chore: remove tailnet v1 API support (#14641)
Drops support for v1 of the tailnet API, which was the original coordination protocol where we only sent node updates, never marked them lost or disconnected.

v2 of the tailnet API went GA for CLI clients in Coder 2.8.0, so clients older than that would stop working.
2024-09-12 07:56:31 +04:00
Jon Ayers 4fc047954e fix: avoid deleting peers on graceful close (#14165)
* fix: avoid deleting peers on graceful close

- Fixes an issue where a coordinator deletes all
  its peers on shutdown. This can cause disconnects
  whenever a coderd is redeployed.
2024-08-14 15:16:08 -04:00
Spike Curtis e5268e4551 chore: spin clock library out to coder/quartz repo (#13777)
Code that was in `/clock` has been moved to github.com/coder/quartz.  This PR refactors our use of the clock library to point to the external Quartz repo.
2024-07-03 15:02:54 +04:00
Spike Curtis ce7f13c6c3 fix: fix TestPGCoordinatorSingle_MissedHeartbeats flake (#13686) 2024-06-27 19:17:24 +04:00
Steven Masley 5ccf5084e8 chore: create type for unique role names (#13506)
* chore: create type for unique role names

Using `string` was confusing when something should be combined with
org context, and when not to. Naming this new name, "RoleIdentifier"
2024-06-11 08:55:28 -05:00
Spike Curtis 8326a3a675 chore: change mock clock to allow Advance() within timer/tick functions (#13500) 2024-06-10 15:27:24 +04:00
Spike Curtis a0962ba089 fix: wait for PGCoordinator to clean up db state (#13351)
c.f. https://github.com/coder/coder/pull/13192#issuecomment-2097657692

We need to wait for PGCoordinator to finish its work before returning on `Close()`, so that we delete database state (best effort -- if this fails others will filter it out based on heartbeats).
2024-05-24 12:01:03 +04:00
Steven Masley 1f5788feff chore: remove rbac psuedo resources, add custom verbs (#13276)
Removes our pseudo rbac resources like `WorkspaceApplicationConnect` in favor of additional verbs like `ssh`. This is to make more intuitive permissions for building custom roles.

The source of truth is now `policy.go`
2024-05-15 11:09:42 -05:00
Steven Masley cb6b5e8fbd chore: push rbac actions to policy package (#13274)
Just moved `rbac.Action` -> `policy.Action`. This is for the stacked PR to not have circular dependencies when doing autogen. Without this, the autogen can produce broken golang code, which prevents the autogen from compiling.

So just avoiding circular dependencies. Doing this in it's own PR to reduce LoC diffs in the primary PR, since this has 0 functional changes.
2024-05-15 09:46:35 -05:00
Colin Adler 205c43da99 fix(enterprise): mark nodes from unhealthy coordinators as lost (#13123)
Instead of removing the mappings of unhealthy coordinators entirely,
mark them as lost instead. This prevents peers from disappearing from
other peers if a coordinator misses a heartbeat.
2024-05-03 14:07:29 -05:00
Colin Adler 777dfbe965 feat(enterprise): add ready for handshake support to pgcoord (#12935) 2024-04-16 15:01:10 -05:00
Spike Curtis 06eae954c9 fix: stop sending DeleteTailnetPeer when coordinator is unhealthy (#12925)
fixes #12923

Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy.

It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.
2024-04-10 22:49:13 +04:00
Colin Adler e5d911462f fix(tailnet): enforce valid agent and client addresses (#12197)
This adds the ability for `TunnelAuth` to also authorize incoming wireguard node IPs, preventing agents from reporting anything other than their static IP generated from the agent ID.
2024-03-01 09:02:33 -06:00
Spike Curtis 627232eae9 fix: fix pgcoord to delete coordinator row last (#12155)
Fixes #12141
Fixes #11750

PGCoord shutdown was uncoordinated, so an update at an inopportune time during shutdown would be rejected because the coordinator row was already deleted.

This PR ensures that the PGCoord subcomponents that write updates are shut down before we take down the heartbeats, which is responsible for deleting the coordinator row.
2024-02-15 16:34:29 +04:00
Spike Curtis 1c8b803785 feat: add logging to pgcoord subscribe/unsubscribe (#11952)
Adds logging to unsubscribing from peer and tunnel updates in pgcoordinator, since #11950 seems to be problem with these subscriptions
2024-01-31 12:15:58 +04:00
Cian Johnston ecae6f9135 fix(enterprise/tailnet): handle query canceled error in sendBeat() (#11794) 2024-01-24 18:42:05 +00:00
Spike Curtis cae095fdb6 fix: stop logging errors on canceled cleanup queries (#11547)
Fixes flake seen here: https://github.com/coder/coder/actions/runs/7474259128/job/20340051975
2024-01-10 16:20:29 +04:00
Spike Curtis f2606a78dd fix: avoid converting nil node
fixes: #11276
2023-12-19 13:38:15 +04:00
Dean Sheather e46431078c feat: add AgentAPI using DRPC (#10811)
Co-authored-by: Spike Curtis <spike@coder.com>
2023-12-18 22:53:28 +10:00
Spike Curtis ad3fed72bc chore: rename Coordinator to CoordinatorV1 (#11222)
Renames the tailnet.Coordinator to represent both v1 and v2 APIs, so that we can use this interface for the main atomic pointer.

Part of #10532
2023-12-15 11:38:12 +04:00
Spike Curtis bf3b35b1e2 fix: stop logging context Canceled as error (#11177)
fixes #11166 and a related log that could have the same problem
2023-12-13 13:08:30 +04:00
Spike Curtis b34ecf1e9e fix: fix deadlock of mappingQuery on context canceled
Fixes #11078

replace bare channel send with SendCtx so that we properly shut down when context is canceled.
2023-12-07 17:19:18 +04:00
Spike Curtis 2c86d0bed0 feat: support v2 Tailnet API in AGPL coordinator (#11010)
Fixes #10529
2023-12-06 15:04:28 +04:00
Spike Curtis 612e67a53b feat: add cleanup of lost tailnet peers and tunnels to PGCoordinator (#10939)
Adds the "lost" peer cleanup queries to PGCoordinator, including tests.
2023-12-01 10:13:29 +04:00
Spike Curtis 0cab6e7763 feat: support graceful disconnect in PGCoordinator (#10937)
Adds support for graceful disconnect to PGCoordinator.  When peers gracefully disconnect, they send a disconnect message.  This triggers the peer to be disconnected from all tunneled peers.

The Multi-Agent Client supports graceful disconnect, since it is in memory and we know that when it is closed, we really mean to disconnect.

The v1 agent and client Websocket connections do not support graceful disconnect, since the v1 protocol doesn't have this feature.  That means that if a v1 peer connects to a v2 peer, when the v1 peer's coordinator connection is closed, the v2 peer will
see it as "lost" since we don't know whether the v1 peer meant to disconnect, or it just lost connectivity to the coordinator.
2023-12-01 09:55:25 +04:00
Spike Curtis 52901e1219 feat: implement HTMLDebug for PGCoord with v2 API (#10914)
Implements HTMLDebug for the PGCoordinator with the new v2 API and related DB tables.
2023-11-28 22:37:20 +04:00
Spike Curtis 5c48cb4447 feat: modify PG Coordinator to work with new v2 Tailnet API (#10573)
re: #10528

Refactors PG Coordinator to work with the Tailnet v2 API, including wrappers for the existing v1 API.

The debug endpoint functions, but doesn't return sensible data, that will be in another stacked PR.
2023-11-20 14:31:04 +04:00
Colin Adler 36f3151b71 fix(enterprise/tailnet): properly detect legacy agents (#10083) 2023-10-06 16:49:26 +00:00
Colin Adler c900b5f8df feat: add single tailnet support to pgcoord (#9351) 2023-09-21 14:30:48 -05:00
Spike Curtis a415395e9e fix: stop dropping error log on context canceled after heartbeat (#9427)
Signed-off-by: Spike Curtis <spike@coder.com>
2023-08-30 14:44:00 +04:00
Kyle Carberry 22e781eced chore: add /v2 to import module path (#9072)
* chore: add /v2 to import module path

go mod requires semantic versioning with versions greater than 1.x

This was a mechanical update by running:
```
go install github.com/marwan-at-work/mod/cmd/mod@latest
mod upgrade
```

Migrate generated files to import /v2

* Fix gen
2023-08-18 18:55:43 +00:00
Spike Curtis 2f46f2315c fix: fix race in PGCoord at startup (#9144)
Signed-off-by: Spike Curtis <spike@coder.com>
2023-08-18 09:53:03 +04:00
Spike Curtis c7a6d626b4 fix: make PGCoordinator close connections when unhealthy (#9125)
Signed-off-by: Spike Curtis <spike@coder.com>
2023-08-17 09:36:47 +04:00
Colin Adler bc862fa493 chore: upgrade tailscale to v1.46.1 (#8913) 2023-08-09 19:50:26 +00:00
Colin Adler 0b4f333a6f chore: add http debug support to pgcoord (#8795) 2023-07-28 17:59:31 -05:00
Colin Adler dd2f79995b chore(tailnet): rewrite coordinator debug using html/template (#8752) 2023-07-26 22:54:21 +00:00
Colin Adler 6b92abebb9 fix(tailnet): track agent names for http debug (#8744) 2023-07-26 18:44:10 +00:00
Colin Adler 1cb39fc65d test: ignore more spurious pgcoord errors (#8628) 2023-07-20 19:55:25 +00:00
Colin Adler 00b9a3ce58 fix: prevent error log when pgcoord query is canceled (#8609) 2023-07-19 16:40:57 -05:00
Colin Adler c47b78c44b chore: replace wsconncache with a single tailnet (#8176) 2023-07-12 17:37:31 -05:00
Spike Curtis b4057bd74a feat: make pgCoordinator generally available (#8419)
* pgCoord to GA, fix tests

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix generation and coordinator delete RBAC

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix fakeQuerier -> FakeQuerier

Signed-off-by: Spike Curtis <spike@coder.com>

---------

Signed-off-by: Spike Curtis <spike@coder.com>
2023-07-12 13:35:29 +04:00