Commit Graph

95 Commits

Author SHA1 Message Date
Spike Curtis 06e396188f test: subscribe to heartbeats synchronously on PGCoord startup (#21746)
fixes: https://github.com/coder/internal/issues/1304

Subscribe to heartbeats synchronously on startup of PGCoordinator. This ensures tests that send heartbeats don't race with this subscription.
2026-01-29 13:34:34 +04:00
Spike Curtis bddb808b25 chore: arrange imports in a standard way (#21452)
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:

```
import (
	"context"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"golang.org/x/xerrors"
	"gopkg.in/natefinch/lumberjack.v2"

	"cdr.dev/slog/v3"
	"github.com/coder/coder/v2/codersdk/agentsdk"
	"github.com/coder/serpent"
)
```

3 groups: standard library, 3rd partly libs, Coder libs.

This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
2026-01-08 15:24:11 +04:00
Spike Curtis 49b34a716a fix: fix slog to always use array of Fields (#21426)
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).

It also updates dependencies that also use slog and were updated.

I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.

Other dependencies, I pushed new tags.
2026-01-08 10:29:41 +04:00
Hugo Dutka e62c5db678 chore: remove references to dbtestutil.WillUsePostgres (#20436)
Addresses https://github.com/coder/internal/issues/758.

This PR only cleans up dead code, it makes no changes to test logic.
2025-10-23 14:24:54 +02:00
Spike Curtis 05b037bdea fix: avoid deadlock race writing to a disconnected mapper (#20303)
fixes https://github.com/coder/internal/issues/1045

Fixes a race condition in our PG Coordinator when a peer disconnects. We issue database queries to find the peer mappings (node structures for each peer connected via a tunnel), and then send these to the "mapper" that generates diffs and eventually writes the update to the websocket.

Before this change we erroneously used the querier's context for this update, which has the same lifetime as the coordinator itself. If the peer has disconnected, the mapper might not be reading from its channel, and this causes a deadlock in a querier worker. This also prevents us from doing any more work on the peer.

I also added some more debug logging that would have been helpful when tracking this down.
2025-10-15 15:56:07 +04:00
ケイラ caeff49aba chore: refactor roles to support multiple permission sets scoped by org id (#20186)
In preparation for adding the "member" permission level, which will also
be grouped by org ID, do a bit of a refactor to make room for it and the
existing "org" level to live in the same `map`
2025-10-09 11:08:34 -06:00
Ethan 50704a5014 ci: improve 'tfail in goroutine' ruleguard rule (#19682)
This PR improves the ruleguard rule for detecting `t.Fail` calls in goroutines. It picks up additional violations, of which are fixed in this PR.
See self-review for details.

The motivation for fixing this comes from a flake I fixed in https://github.com/coder/coder/pull/19599, where tests would fail from a `require` in an `Eventually`.
2025-09-04 14:28:29 +10:00
Spike Curtis 04dfda8a0e fix: change enqueue error to debug log level (#19686)
fixes https://github.com/coder/internal/issues/958

Logging was being done at error level, but most likely any errors are from simple races between an update triggered around the same time as a client disconnecting. Debug is fine for these.
2025-09-03 13:42:02 +04:00
Spike Curtis 6c0bed0f53 chore: update to coder/quartz v0.2.0 (#18007)
Upgrade to coder/quartz v0.2.0 including fixing up a minor API breaking change.
2025-05-27 16:05:03 +04:00
Spike Curtis 345435a04c feat: modify coordinators to send errors and peers to log them (#17467)
Adds support to our coordinator implementations to send Error updates before disconnecting clients.

I was recently debugging a connection issue where the client was getting repeatedly disconnected from the Coordinator, but since we never send any error information it was really hard without server logs.

This PR aims to correct that, by sending a CoordinateResponse with `Error` set in cases where we disconnect a client without them asking us to.

It also logs the error whenever we get one in the client controller.
2025-04-21 11:40:56 +04:00
ケイラ f670bc31f5 chore: update testutil chan helpers (#17408) 2025-04-16 10:37:09 -06:00
Mathias Fredriksson b79167293c chore(Makefile): update golden files as part of make gen (#17039)
Updating golden files is an unnecessary extra step in addition to gen
that is easily overlooked, leading to the developer noticing the issue
in CI leading to lost developer time waiting for tests to complete.
2025-03-21 13:04:30 +00:00
Mathias Fredriksson c069563af1 test: fix use of t.Logf where t.Log would suffice (#16328) 2025-01-29 14:35:04 +00:00
Cian Johnston 7b88776403 chore(testutil): add testutil.GoleakOptions (#16070)
- Adds `testutil.GoleakOptions` and consolidates existing options to
this location
- Pre-emptively adds required ignore for this Dependabot PR to pass CI
https://github.com/coder/coder/pull/16066
2025-01-08 15:38:37 +00:00
Spike Curtis 63572d9f53 fix: loosen timing checks for heartbeats (#15923)
Fixes #15782.

I believe that Windows doesn't always have high-resolution timers available, so this PR loosens the check for PG Coordinator heartbeats, to avoid flakes like:

https://github.com/coder/coder/actions/runs/12397381823/job/34607639048
2024-12-19 13:49:01 +04:00
Hugo Dutka 83c493e832 chore: fix more flaky tests on Windows with Postgres (#15629)
Addresses the following flakes:

- https://github.com/coder/internal/issues/222
- https://github.com/coder/internal/issues/223
- https://github.com/coder/internal/issues/224
- https://github.com/coder/internal/issues/225
- https://github.com/coder/internal/issues/226
- https://github.com/coder/internal/issues/227
- https://github.com/coder/internal/issues/228
- https://github.com/coder/internal/issues/229
- https://github.com/coder/internal/issues/230
2024-11-26 11:56:07 +01:00
Dean Sheather fbe2fa66f5 chore: add test for coord rolling restart (#14680)
Closes https://github.com/coder/team-coconut/issues/50

---------

Co-authored-by: Ethan Dickson <ethan@coder.com>
2024-11-20 18:04:33 +11:00
Spike Curtis 5861e516b9 chore: add standard test logger ignoring db canceled (#15556)
Refactors our use of `slogtest` to instantiate a "standard logger" across most of our tests.  This standard logger incorporates https://github.com/coder/slog/pull/217 to also ignore database query canceled errors by default, which are a source of low-severity flakes.

Any test that has set non-default `slogtest.Options` is left alone. In particular, `coderdtest` defaults to ignoring all errors. We might consider revisiting that decision now that we have better tools to target the really common flaky Error logs on shutdown.
2024-11-18 14:09:22 +04:00
Spike Curtis 8c00ebc6ee chore: refactor ServerTailnet to use tailnet.Controllers (#15408)
chore of #14729

Refactors the `ServerTailnet` to use `tailnet.Controller` so that we reuse logic around reconnection and handling control messages, instead of reimplementing.  This unifies our "client" use of the tailscale API across CLI, coderd, and wsproxy.
2024-11-08 13:18:56 +04:00
Hugo Dutka 1bfa7d42e8 chore: add postgres template caching for tests (#15336)
This PR is the first in a series aimed at closing
[#15109](https://github.com/coder/coder/issues/15109).

### Changes

- **Template Database Creation:**  
`dbtestutil.Open` now has the ability to create a template database if
none is provided via `DB_FROM`. The template database’s name is derived
from a hash of the migration files, ensuring that it can be reused
across tests and is automatically updated whenever migrations change.

- **Optimized Database Handling:**  
Previously, `dbtestutil.Open` would spin up a new container for each
test when `DB_FROM` was unset. Now, it first checks for an active
PostgreSQL instance on `localhost:5432`. If none is found, it creates a
single container that remains available for subsequent tests,
eliminating repeated container startups.

These changes address the long individual test times (10+ seconds)
reported by some users, likely due to the time Docker took to start and
complete migrations.
2024-11-04 17:23:31 +01:00
Ethan b1298a3c1e feat: add WorkspaceUpdates tailnet RPC (#14847)
Closes #14716
Closes #14717

Adds a new user-scoped tailnet API endpoint (`api/v2/tailnet`) with a new RPC stream for receiving updates on workspaces owned by a specific user, as defined in #14716. 

When a stream is started, the `WorkspaceUpdatesProvider` will begin listening on the user-scoped pubsub events implemented in #14964. When a relevant event type is seen (such as a workspace state transition), the provider will query the DB for all the workspaces (and agents) owned by the user. This gets compared against the result of the previous query to produce a set of workspace updates. 

Workspace updates can be requested for any user ID, however only workspaces the authorised user is permitted to `ActionRead` will have their updates streamed.
Opening a tunnel to an agent requires that the user can perform `ActionSSH` against the workspace containing it.
2024-11-01 14:53:53 +11:00
Spike Curtis 7d9f5ab81d chore: add Coder service prefix to tailnet (#14943)
re: #14715

This PR introduces the Coder service prefix: `fd60:627a:a42b::/48` and refactors our existing code as calling the Tailscale service prefix explicitly (rather than implicitly).

Removes the unused `Addresses` agent option. All clients today assume they can compute the Agent's IP address based on its UUID, so an agent started with a custom address would break things.
2024-10-04 10:04:10 +04:00
Spike Curtis d6154c4310 chore: remove tailnet v1 API support (#14641)
Drops support for v1 of the tailnet API, which was the original coordination protocol where we only sent node updates, never marked them lost or disconnected.

v2 of the tailnet API went GA for CLI clients in Coder 2.8.0, so clients older than that would stop working.
2024-09-12 07:56:31 +04:00
Spike Curtis fb3523b37f chore: remove legacy AgentIP address (#14640)
Removes the support for the Agent's "legacy IP" which was a hardcoded IP address all agents used to use, before we introduced "single tailnet". Single tailnet went GA in 2.7.0.
2024-09-12 07:40:19 +04:00
Spike Curtis 7b39f6b0d4 fix: improves coordination logging (#14556) 2024-09-04 15:10:43 +04:00
Dean Sheather cf8be4eac5 feat: add resume support to coordinator connections (#14234) 2024-08-20 17:16:49 +10:00
Jon Ayers 4fc047954e fix: avoid deleting peers on graceful close (#14165)
* fix: avoid deleting peers on graceful close

- Fixes an issue where a coordinator deletes all
  its peers on shutdown. This can cause disconnects
  whenever a coderd is redeployed.
2024-08-14 15:16:08 -04:00
Spike Curtis e5268e4551 chore: spin clock library out to coder/quartz repo (#13777)
Code that was in `/clock` has been moved to github.com/coder/quartz.  This PR refactors our use of the clock library to point to the external Quartz repo.
2024-07-03 15:02:54 +04:00
Dean Sheather 6c94dd4f23 chore: add DRPC server implementation for network telemetry (#13675) 2024-07-02 01:50:52 +10:00
Spike Curtis ce7f13c6c3 fix: fix TestPGCoordinatorSingle_MissedHeartbeats flake (#13686) 2024-06-27 19:17:24 +04:00
Steven Masley 5ccf5084e8 chore: create type for unique role names (#13506)
* chore: create type for unique role names

Using `string` was confusing when something should be combined with
org context, and when not to. Naming this new name, "RoleIdentifier"
2024-06-11 08:55:28 -05:00
Spike Curtis 8326a3a675 chore: change mock clock to allow Advance() within timer/tick functions (#13500) 2024-06-10 15:27:24 +04:00
Spike Curtis a0962ba089 fix: wait for PGCoordinator to clean up db state (#13351)
c.f. https://github.com/coder/coder/pull/13192#issuecomment-2097657692

We need to wait for PGCoordinator to finish its work before returning on `Close()`, so that we delete database state (best effort -- if this fails others will filter it out based on heartbeats).
2024-05-24 12:01:03 +04:00
Steven Masley 1f5788feff chore: remove rbac psuedo resources, add custom verbs (#13276)
Removes our pseudo rbac resources like `WorkspaceApplicationConnect` in favor of additional verbs like `ssh`. This is to make more intuitive permissions for building custom roles.

The source of truth is now `policy.go`
2024-05-15 11:09:42 -05:00
Steven Masley cb6b5e8fbd chore: push rbac actions to policy package (#13274)
Just moved `rbac.Action` -> `policy.Action`. This is for the stacked PR to not have circular dependencies when doing autogen. Without this, the autogen can produce broken golang code, which prevents the autogen from compiling.

So just avoiding circular dependencies. Doing this in it's own PR to reduce LoC diffs in the primary PR, since this has 0 functional changes.
2024-05-15 09:46:35 -05:00
Colin Adler 205c43da99 fix(enterprise): mark nodes from unhealthy coordinators as lost (#13123)
Instead of removing the mappings of unhealthy coordinators entirely,
mark them as lost instead. This prevents peers from disappearing from
other peers if a coordinator misses a heartbeat.
2024-05-03 14:07:29 -05:00
Colin Adler 777dfbe965 feat(enterprise): add ready for handshake support to pgcoord (#12935) 2024-04-16 15:01:10 -05:00
Spike Curtis 06eae954c9 fix: stop sending DeleteTailnetPeer when coordinator is unhealthy (#12925)
fixes #12923

Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy.

It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.
2024-04-10 22:49:13 +04:00
Colin Adler 4d5a7b2d56 chore(codersdk): move all tailscale imports out of codersdk (#12735)
Currently, importing `codersdk` just to interact with the API requires
importing tailscale, which causes builds to fail unless manually using
our fork.
2024-03-26 12:44:31 -05:00
Steven Masley bd6ad88077 chore: nolint always return error function (#12701) 2024-03-21 09:35:10 -05:00
Steven Masley 131d0bd2ba chore: fix linting issue in main(#12697) 2024-03-20 20:15:01 -05:00
Colin Adler e5d911462f fix(tailnet): enforce valid agent and client addresses (#12197)
This adds the ability for `TunnelAuth` to also authorize incoming wireguard node IPs, preventing agents from reporting anything other than their static IP generated from the agent ID.
2024-03-01 09:02:33 -06:00
Cian Johnston a2cbb0f87f fix(enterprise/coderd): check provisionerd API version on connection (#12191) 2024-02-16 18:43:07 +00:00
Spike Curtis 627232eae9 fix: fix pgcoord to delete coordinator row last (#12155)
Fixes #12141
Fixes #11750

PGCoord shutdown was uncoordinated, so an update at an inopportune time during shutdown would be rejected because the coordinator row was already deleted.

This PR ensures that the PGCoord subcomponents that write updates are shut down before we take down the heartbeats, which is responsible for deleting the coordinator row.
2024-02-15 16:34:29 +04:00
Spike Curtis 1c8b803785 feat: add logging to pgcoord subscribe/unsubscribe (#11952)
Adds logging to unsubscribing from peer and tunnel updates in pgcoordinator, since #11950 seems to be problem with these subscriptions
2024-01-31 12:15:58 +04:00
Spike Curtis 520b12e1a2 fix: close MultiAgentConn when coordinator closes (#11941)
Fixes an issue where a MultiAgentConn isn't closed properly when the coordinator it is connected to is closed.

Since servertailnet checks whether the conn is closed before reinitializing, it is important that we check this, otherwise servertailnet can get stuck if the coordinator closes (e.g. when we switch from AGPL to PGCoordinator after decoding a license).
2024-01-31 00:38:19 +04:00
Cian Johnston ecae6f9135 fix(enterprise/tailnet): handle query canceled error in sendBeat() (#11794) 2024-01-24 18:42:05 +00:00
Spike Curtis f01cab9894 feat: use tailnet v2 API for coordination (#11638)
This one is huge, and I'm sorry.

The problem is that once I change `tailnet.Conn` to start doing v2 behavior, I kind of have to change it everywhere, including in CoderSDK (CLI), the agent, wsproxy, and ServerTailnet.

There is still a bit more cleanup to do, and I need to add code so that when we lose connection to the Coordinator, we mark all peers as LOST, but that will be in a separate PR since this is big enough!
2024-01-22 11:07:50 +04:00
Spike Curtis 8910ac715c feat: add tailnet v2 support to wsproxy coordinate endpoint (#11637)
wsproxy also needs to be updated to use tailnet v2 because the `tailnet.Conn` stores peers by ID, and the peerID was not being carried by the JSON protocol.  This adds a query param to the endpoint to conditionally switch to the new protocol.
2024-01-18 10:10:36 +04:00
Spike Curtis cae095fdb6 fix: stop logging errors on canceled cleanup queries (#11547)
Fixes flake seen here: https://github.com/coder/coder/actions/runs/7474259128/job/20340051975
2024-01-10 16:20:29 +04:00