Commit Graph

1715 Commits

Author SHA1 Message Date
Spike Curtis af3fdc68c3 chore: refactor agent routines that use the v2 API (#12223)
In anticipation of needing the `LogSender` to run on a context that doesn't get immediately canceled when you `Close()` the agent, I've undertaken a little refactor to manage the goroutines that get run against the Tailnet and Agent API connection.

This handles controlling two contexts, one that gets canceled right away at the start of graceful shutdown, and another that stays up to allow graceful shutdown to complete.
2024-02-23 11:04:23 +04:00
Cian Johnston 53e8f9c0f9 fix(coderd): only allow untagged provisioners to pick up untagged jobs (#12269)
Alternative solution to #6442

Modifies the behaviour of AcquireProvisionerJob and adds a special case for 'un-tagged' jobs such that they can only be picked up by 'un-tagged' provisioners.

Also adds comprehensive test coverage for AcquireJob given various combinations of tags.
2024-02-22 15:04:31 +00:00
Steven Masley d4d8424ce0 fix: fix GetOrganizationsByUserID error when multiple organizations exist (#12257)
* test: fetching user orgs fails if multi orgs in pg db
* fix: GetOrganizationsByUserID fixed if multi orgs exist
2024-02-22 08:14:48 -06:00
Steven Masley c3a7b13690 chore: remove organization requirement from convertGroup() (#12195)
* feat: convertGroups() no longer requires organization info

Removing role information from some users in the api. This info is
excessive and not required. It is costly to always include
2024-02-21 15:58:11 -06:00
Kayla Washburn-Love 475c3650ca feat: add support for optional external auth providers (#12021) 2024-02-21 11:18:38 -07:00
Bruno Quaresma a827185b6d refactor: move auto fill feature into an experiment (#12230) 2024-02-21 11:48:34 -03:00
Asher 3d742f64e6 fix: move oauth2 routes (#12240)
* fix: move oauth2 routes

From /login/oauth2/* to /oauth2/*.

/login/oauth2 causes /login to no longer get served by the frontend,
even if nothing is actually served on /login itself.

* Add forgotten comment on delete
2024-02-20 17:01:25 -09:00
Asher 4d39da294e feat: add oauth2 token exchange (#12196)
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
2024-02-20 14:58:43 -09:00
Steven Masley 07cccf9033 feat: disable directory listings for static files (#12229)
* feat: disable directory listings for static files

Static file server handles serving static asset files (js, css, etc).
The default file server would also list all files in a directory.
This has been changed to only serve files.
2024-02-20 15:50:30 -06:00
Steven Masley 2dac34276a fix: add postgres triggers to remove deleted users from user_links (#12117)
* chore: add database test fixture to insert non-unique linked_ids
* chore: create unit test to exercise failed email change bug
* fix: add postgres triggers to keep user_links clear of deleted users
* Add migrations to prevent deleted users with links
* Force soft delete of users, do not allow un-delete
2024-02-20 13:19:38 -06:00
Garrett Delfosse b342bd7869 feat: add port sharing frontend (#12119) 2024-02-20 13:26:34 -05:00
Cian Johnston 643c3ee54b refactor(provisionerd): move provisionersdk.VersionCurrent -> provisionerdproto.VersionCurrent (#12225) 2024-02-20 12:44:19 +00:00
Cian Johnston a2cbb0f87f fix(enterprise/coderd): check provisionerd API version on connection (#12191) 2024-02-16 18:43:07 +00:00
Steven Masley f17149c59d feat: set groupsync to use default org (#12146)
* fix: assign new oauth users to default org

This is not a final solution, as we eventually want to be able
to map to different orgs. This makes it so multi-org does not break oauth/oidc.
2024-02-16 11:09:19 -06:00
Steven Masley 75870c22ab fix: assign new oauth users to default org (#12145)
* fix: assign new oauth users to default org

This is not a final solution, as we eventually want to be able
to map to different orgs. This makes it so multi-org does not break oauth/oidc.
2024-02-16 08:47:26 -06:00
Steven Masley 2a8004b1b2 feat: use default org for PostUser (#12143)
Instead of assuming only 1 org exists, this uses the
is_default org to place a user in if not specified.
2024-02-16 08:28:36 -06:00
Steven Masley 2bf2f88b09 feat: implement 'is_default' org field (#12142)
The first organization created is now marked as "default". This is
to allow "single org" behavior as we move to a multi org codebase.

It is intentional that the user cannot change the default org at this
stage. Only 1 default org can exist, and it is always the first org.

Closes: https://github.com/coder/coder/issues/11961
2024-02-15 11:01:16 -06:00
Marcin Tojek 5aa5ff1bde chore: deprecate API workspace build resources (#12167) 2024-02-15 17:13:44 +01:00
Marcin Tojek 7a453608c9 feat: support order property of coder_agent (#12121) 2024-02-15 13:33:13 +01:00
Spike Curtis 2d0b9106c0 fix: change servertailnet to register the DERP dialer before setting DERP map (#12137)
I noticed a possible race where tailnet.Conn can try to dial the embedded region before we've set our custom dialer that send the DERP in-memory.  This closes that race and adds a test case for servertailnet with no STUN and an embedded relay
2024-02-15 10:51:12 +04:00
Cian Johnston d6b025db14 Revert "feat: add activity status and autostop reason to workspace overview (#11987)" (#12144)
Related to https://github.com/coder/coder/pull/11987

This reverts commit d37b131.
2024-02-14 17:14:49 +00:00
Spike Curtis 04991f425a fix: set node callback each time we reinit the coordinator in servertailnet (#12140)
I think this will resolve #12136 but lets get a proper test at the system level before closing.

Before this change, we only register the node callback at start of day for the server tailnet.  If the coordinator changes, like we know happens when we are licensed for the PGCoordinator, we close the connection to the old coord, and open a new one to the new coord.

The callback is designed to direct the updates to the new coordinator, but there is nothing that specifically triggers it to fire after we connect to the new coordinator.

If we have STUN, then period re-STUNs will generally get it to fire eventually, but without STUN it we could go indefinitely without a callback.

This PR changes the servertailnet to re-register the callback each time we reconnect to the coordinator.  Registering a callback (even if it's the same callback) triggers an immediate call with our node information, so the new coordinator will have it.
2024-02-14 20:45:31 +04:00
Spike Curtis 5a0d240bc3 feat: expose DERP server debug metrics (#12135)
Adds some debug endpoints for looking into the DERP server.

The `api/v2/debug/derp/traffic` endpoint requires the `ss` utility to be present in order to function.  I have *not* added the `iproute2` package to our base image as it adds 11MB, so this endpoint won't be useful by default.  However, in a debugging situation, we could exec into the container and then `apk add iproute2`, or build a special debug image.

The `api/v2/debug/expvar` handler contains DERP metrics as well as commandline and memstats.

Example:

```
{
"alert_failed": 0,
"alert_generated": 0,
"cmdline": ["/Users/spike/repos/coder/build/coder_darwin_arm64","--global-config","/Users/spike/repos/coder/.coderv2","server","--http-address","0.0.0.0:3000","--swagger-enable","--access-url","http://127.0.0.1:3000","--dangerous-allow-cors-requests=true"],
"derp": {"accepts": 1, "average_queue_duration_ms": 0, "bytes_received": 0, "bytes_sent": 0, "counter_packets_dropped_reason": {"gone_disconnected": 0, "gone_not_here": 0, "queue_head": 0, "queue_tail": 0, "unknown_dest": 0, "unknown_dest_on_fwd": 0, "write_error": 0}, "counter_packets_dropped_type": {"disco": 0, "other": 0}, "counter_packets_received_kind": {"disco": 0, "other": 0}, "counter_tcp_rtt": {}, "counter_total_dup_client_conns": 0, "gauge_clients_local": 1, "gauge_clients_remote": 0, "gauge_clients_total": 1, "gauge_current_connections": 1, "gauge_current_dup_client_conns": 0, "gauge_current_dup_client_keys": 0, "gauge_current_file_descriptors": 0, "gauge_current_home_connections": 1, "gauge_memstats_sys0": 20874504, "gauge_watchers": 0, "got_ping": 0, "home_moves_in": 0, "home_moves_out": 0, "multiforwarder_created": 0, "multiforwarder_deleted": 0, "packet_forwarder_delete_other_value": 0, "packets_dropped": 0, "packets_forwarded_in": 0, "packets_forwarded_out": 0, "packets_received": 0, "packets_sent": 0, "peer_gone_disconnected_frames": 0, "peer_gone_not_here_frames": 0, "sent_pong": 0, "unknown_frames": 0, "version": "1.47.0-dev20240214-t64db8c604"},
"memstats": {"Alloc":286506256,"TotalAlloc":297594632,"Sys":310621512,"Lookups":0,"Mallocs":304204,"Frees":171570,"HeapAlloc":286506256,"HeapSys":294060032,"HeapIdle":3694592,"HeapInuse":290365440,"HeapReleased":3620864,"HeapObjects":132634,"StackInuse":3735552,"StackSys":3735552,"MSpanInuse":347256,"MSpanSys":358512,"MCacheInuse":9600,"MCacheSys":15600,"BuckHashSys":1469877,"GCSys":9434896,"OtherSys":1547043,"NextGC":551867656,"LastGC":1707892877408883000,"PauseTotalNs":1247000,"PauseNs":[200333,229375,239875,209542,106958,203792,57125,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"PauseEnd":[1707892876217481000,1707892876219726000,1707892876222273000,1707892876226151000,1707892876234815000,1707892877398146000,1707892877408883000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"NumGC":7,"NumForcedGC":0,"GCCPUFraction":0.0022425810335762954,"EnableGC":true,"DebugGC":false,"BySize":[{"Size":0,"Mallocs":0,"Frees":0},{"Size":8,"Mallocs":14396,"Frees":9143},{"Size":16,"Mallocs":89090,"Frees":50507},{"Size":24,"Mallocs":40839,"Frees":24456},{"Size":32,"Mallocs":22404,"Frees":12379},{"Size":48,"Mallocs":51174,"Frees":23718},{"Size":64,"Mallocs":15406,"Frees":3501},{"Size":80,"Mallocs":6688,"Frees":2352},{"Size":96,"Mallocs":2567,"Frees":374},{"Size":112,"Mallocs":19371,"Frees":16883},{"Size":128,"Mallocs":2873,"Frees":1061},{"Size":144,"Mallocs":5600,"Frees":2742},{"Size":160,"Mallocs":2159,"Frees":622},{"Size":176,"Mallocs":454,"Frees":86},{"Size":192,"Mallocs":227,"Frees":128},{"Size":208,"Mallocs":1407,"Frees":732},{"Size":224,"Mallocs":1365,"Frees":1090},{"Size":240,"Mallocs":82,"Frees":48},{"Size":256,"Mallocs":310,"Frees":162},{"Size":288,"Mallocs":1945,"Frees":562},{"Size":320,"Mallocs":1200,"Frees":458},{"Size":352,"Mallocs":133,"Frees":33},{"Size":384,"Mallocs":582,"Frees":51},{"Size":416,"Mallocs":747,"Frees":200},{"Size":448,"Mallocs":113,"Frees":22},{"Size":480,"Mallocs":34,"Frees":21},{"Size":512,"Mallocs":951,"Frees":91},{"Size":576,"Mallocs":364,"Frees":122},{"Size":640,"Mallocs":532,"Frees":270},{"Size":704,"Mallocs":93,"Frees":39},{"Size":768,"Mallocs":83,"Frees":35},{"Size":896,"Mallocs":308,"Frees":175},{"Size":1024,"Mallocs":226,"Frees":122},{"Size":1152,"Mallocs":198,"Frees":100},{"Size":1280,"Mallocs":314,"Frees":171},{"Size":1408,"Mallocs":77,"Frees":47},{"Size":1536,"Mallocs":80,"Frees":54},{"Size":1792,"Mallocs":199,"Frees":107},{"Size":2048,"Mallocs":112,"Frees":48},{"Size":2304,"Mallocs":71,"Frees":32},{"Size":2688,"Mallocs":206,"Frees":81},{"Size":3072,"Mallocs":39,"Frees":15},{"Size":3200,"Mallocs":16,"Frees":7},{"Size":3456,"Mallocs":44,"Frees":29},{"Size":4096,"Mallocs":192,"Frees":83},{"Size":4864,"Mallocs":44,"Frees":25},{"Size":5376,"Mallocs":105,"Frees":43},{"Size":6144,"Mallocs":25,"Frees":5},{"Size":6528,"Mallocs":22,"Frees":7},{"Size":6784,"Mallocs":3,"Frees":0},{"Size":6912,"Mallocs":4,"Frees":2},{"Size":8192,"Mallocs":59,"Frees":10},{"Size":9472,"Mallocs":31,"Frees":12},{"Size":9728,"Mallocs":5,"Frees":2},{"Size":10240,"Mallocs":5,"Frees":0},{"Size":10880,"Mallocs":27,"Frees":11},{"Size":12288,"Mallocs":4,"Frees":1},{"Size":13568,"Mallocs":4,"Frees":2},{"Size":14336,"Mallocs":9,"Frees":2},{"Size":16384,"Mallocs":10,"Frees":2},{"Size":18432,"Mallocs":4,"Frees":2}]},
"warning_failed": 0,
"warning_generated": 0
}
```

If we find the DERP metrics useful we could consider how to include them in Prometheus scrapes based on the tailnet `varz` package.  That's for a later PR if at all.
2024-02-14 15:11:45 +04:00
Steven Masley 5d483a7ea1 fix: do not query user_link for deleted accounts (#12112) 2024-02-13 13:02:21 -06:00
Steven Masley 06f3ab1206 chore: add database test fixture to insert non-unique linked_ids (#12111)
* chore: add database test fixture to insert non-unique linked_ids
2024-02-13 12:06:47 -06:00
Kayla Washburn-Love d37b131426 feat: add activity status and autostop reason to workspace overview (#11987) 2024-02-13 10:50:17 -07:00
Garrett Delfosse 3ab3a62bef feat: add port-sharing backend (#11939) 2024-02-13 09:31:20 -05:00
Dean Sheather e1e352d8c1 feat: add template activity_bump property (#11734)
Allows template admins to configure the activity bump duration. Defaults to 1h.
2024-02-13 07:00:35 +00:00
Dean Sheather fead57f304 fix: allow access to unhealthy/initializing apps (#12086) 2024-02-13 16:30:49 +10:00
Marcin Tojek 3e68650791 feat: support order property of coder_app resource (#12077) 2024-02-12 15:11:31 +01:00
Spike Curtis 92b2e26a48 feat: send log limit exceeded in response, not error (#12078)
When we exceed the db-imposed limit of logs, we need to communicate that back to the agent.  In v1 we did it with a 4xx-level HTTP status, but with dRPC, the errors are delivered as strings, which feels fragile to me for something we want to gracefully handle.

So, this PR adds the log limit exceeded as a field on the response message, and fixes the API handler to set it as appropriate instead of an error.
2024-02-09 16:17:20 +04:00
Spike Curtis 1f5a6d59ba chore: consolidate websocketNetConn implementations (#12065)
Consolidates websocketNetConn from multiple packages in favor of a central one in codersdk
2024-02-09 11:39:08 +04:00
Colin Adler ec8e41f516 chore: add logging around agent app health reporting (#12071) 2024-02-08 23:37:44 -06:00
Marcin Tojek c0e169ebf9 feat: support custom order of agent metadata (#12066) 2024-02-08 17:29:34 +01:00
Spike Curtis 151aaadc23 fix: allow startup scripts larger than 32k (#12060)
Fixes #12057 and adds a regression test.
2024-02-07 22:26:42 +04:00
Eric Paulsen 1abe0cfa1a docs: fix /audit & /insights params (#12043) 2024-02-07 08:38:54 -05:00
Spike Curtis 1cf4b62867 feat: change agent to use v2 API for reporting stats (#12024)
Modifies the agent to use the v2 API to report its statistics, using the `statsReporter` subcomponent.
2024-02-07 15:26:41 +04:00
Spike Curtis 213ae69bee fix: start timer before subscribing to avoid test race (#12031)
Fixes #12030

This is a good example of the kind of thing I'd like to address with a time-testing lib.  The problem is that there is a race between the watchdog starting it's timer and the test incrementing the time.  What would make this easier is if the time-testing library could wait for and assert the call to start the timer before incrementing the time.
2024-02-06 20:21:23 +04:00
Dean Sheather 98b86f3cd6 chore: add logs to pq notification dialer (#12020) 2024-02-06 15:21:48 +00:00
Spike Curtis e09cd2c6bd feat: add watchdog to pubsub (#12011)
adds a watchdog to our pubsub and runs it for Coder server.

If the watchdog times out, it triggers a graceful exit in `coder server` to give any provisioner jobs a chance to shut down.

c.f. #11950
2024-02-06 16:58:45 +04:00
Colin Adler c7f52b73bb feat(coderd): add prometheus metrics to servertailnet (#11988) 2024-02-05 23:57:18 -06:00
Spike Curtis c84a637116 fix: stop logging error on query canceled (#12017)
Fixes flake seen here: https://github.com/coder/coder/actions/runs/7782340530/job/21218566449
2024-02-06 08:43:34 +04:00
Marcin Tojek ad8e0db172 feat: add custom error message on signups disabled page (#11959) 2024-02-01 18:01:25 +01:00
Steven Masley 79d5c238cc fix: always return a clean http client for promoauth (#11963)
* fix: add unit test to verify default client is not broken

* always return a clean http client
* No need to clone the tripper
2024-02-01 11:13:34 -05:00
Spike Curtis 1aa117b9ec chore: rename client Listen to ConnectRPC (#11916)
ConnectRPC seems more appropriate for this function
2024-02-01 14:44:11 +04:00
Spike Curtis d5a98cc6d7 fix: avoid race in TestPGPubsub_Metrics by using Eventually (#11973)
Annoyingly, prometheus Registry collects metrics async, which is causing our test to be racy.  They also don't export enough from the Metric interface for us to replicate a synchronous collect, so we have to use Eventually to test.
2024-02-01 12:10:19 +04:00
Spike Curtis 5a359d50dd feat: add metrics to PGPubsub (#11971)
Adds prometheus metrics to PGPubsub for monitoring its health and performance in production.

Related to #11950 --- additional diagnostics to help figure out what's happening
2024-02-01 11:25:03 +04:00
Colin Adler 3ace7982aa fix: rewrite url to agent ip in single tailnet (#11810)
This restores previous behavior of being able to cache connections
across agents in single tailnet.
2024-02-01 00:25:52 -06:00
Colin Adler 4ed1f5581a chore(coderd): add logging to agent rpc yamux conn (#11965) 2024-01-31 23:17:20 -06:00
Spike Curtis b79785c86f feat: move agent v2 API connection monitoring to yamux layer (#11910)
Moves monitoring of the agent v2 API connection to the yamux layer.

Present behavior monitors this at the websocket layer, and closes the websocket on completion. This can cause yamux to hit unexpected errors since the connection is closed underneath it.

This might be the cause of yamux errors that some customers are seeing

![image.png](https://graphite-user-uploaded-assets-prod.s3.amazonaws.com/tCz4CxRU9jhAJ7zH8RTi/53b8b5ef-e9e5-44a5-b559-99c37c136071.png)

In any case, it's more graceful to close yamux first and let yamux close the underlying websocket.  That should limit yamux error logging to truly unexpected/error cases.

The only downside is that the yamux `Close()` doesn't accept a reason, so if the agent becomes outdated and we close the API connection, the agent just sees the connection close without a reason.  I'm not sure we log this at the agent anyway, but it would be nice.  I think more accurate logging on Coderd are more important.

I've also added some logging when the monitor disconnects for reasons other than the context being canceled (e.g. agent outdated, failed pings).
2024-02-01 08:18:35 +04:00