Commit Graph

359 Commits

Author SHA1 Message Date
Cian Johnston 172e52317c feat(agent): wire up agentssh server to allow exec into container (#16638)
Builds on top of https://github.com/coder/coder/pull/16623/ and wires up
the ReconnectingPTY server. This does nothing to wire up the web
terminal yet but the added test demonstrates the functionality working.

Other changes:
* Refactors and moves the `SystemEnvInfo` interface to the
`agent/usershell` package to address follow-up from
https://github.com/coder/coder/pull/16623#discussion_r1967580249
* Marks `usershellinfo.Get` as deprecated. Consumers should use the
`EnvInfoer` interface instead.

---------

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
Co-authored-by: Danny Kopping <danny@coder.com>
2025-02-26 09:03:27 +00:00
Cian Johnston 304007b5ea feat(agent/agentcontainers): add ContainerEnvInfoer (#16623)
This PR adds an alternative implementation of EnvInfo
(https://github.com/coder/coder/pull/16603) that reads information from
a running container.

---------

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
2025-02-24 15:05:15 +00:00
Thomas Kosiewski 660746462e fix(agent/agentssh): use deterministic host key for SSH server (#16626)
Fixes: https://github.com/coder/coder/issues/16490

The Agent's SSH server now initially generates fixed host keys and, once it receives its manifest, generates and replaces that host key with the one derived from the workspace ID, ensuring consistency across agent restarts. This prevents SSH warnings and host key verification errors when connecting to workspaces through Coder Desktop.

While deterministic keys might seem insecure, the underlying Wireguard tunnel already provides encryption and anti-spoofing protection at the network layer, making this approach acceptable for our use case.

---
Change-Id: I8c7e3070324e5d558374fd6891eea9d48660e1e9
Signed-off-by: Thomas Kosiewski <tk@coder.com>
2025-02-21 14:58:41 +01:00
Mathias Fredriksson b07b33ec9d feat: add agentapi endpoint to report connections for audit (#16507)
This change adds a new `ReportConnection` endpoint to the `agentapi`.

The protocol version was bumped previously, so it has been omitted here.

This allows the agent to report connection events, for example when the
user connects to the workspace via SSH or VS Code.

Updates #15139
2025-02-20 14:52:01 +02:00
Mathias Fredriksson 9f5ad23644 refactor(agent/agentssh): move parsing of magic session and create type (#16630)
This change refactors the parsing of MagicSessionEnvs in the agentssh
package and moves the logic to an earlier stage. Also intoduces enums
for MagicSessionType.

Refs #15139
2025-02-19 22:18:31 +02:00
Cian Johnston 4edd77bc82 chore(agent/agentssh): extract CreateCommandDeps (#16603)
Extracts environment-level dependencies of
`agentssh.Server.CreateCommand()` to an interface to allow alternative
implementations to be passed in.
2025-02-19 09:03:59 +00:00
Vincent Vielle bc609d0056 feat: integrate agentAPI with resources monitoring logic (#16438)
As part of the new resources monitoring logic - more specifically for
OOM & OOD Notifications , we need to update the AgentAPI , and the
agents logic.

This PR aims to do it, and more specifically :  
We are updating the AgentAPI & TailnetAPI to version 24 to add two new
methods in the AgentAPI :
- One method to fetch the resources monitoring configuration
- One method to push the datapoints for the resources monitoring.

Also, this PR adds a new logic on the agent side, with a routine running
and ticking - fetching the resources usage each time , but also storing
it in a FIFO like queue.

Finally, this PR fixes a problem we had with RBAC logic on the resources
monitoring model, applying the same logic than we have for similar
entities.
2025-02-14 10:28:15 +01:00
Cian Johnston 35901028d2 feat(agent): add CODER_AGENT_DEVCONTAINERS_ENABLE option (#16525) 2025-02-11 15:29:59 +00:00
Cian Johnston 140f2a9013 chore(agent/agentcontainers): skip TestDockerCLIContainerLister by default (#16502)
Addresses a test flake seen here:
https://github.com/coder/coder/actions/runs/13239819615/job/36952521742

Also addresses the case where we would try to run `docker inspect` with
no container IDs, which is a silly thing to do.
2025-02-10 12:12:32 +00:00
Cian Johnston 31b1ff7d3b feat(agent): add container list handler (#16346)
Fixes https://github.com/coder/coder/issues/16268

- Adds `/api/v2/workspaceagents/:id/containers` coderd endpoint that allows listing containers
visible to the agent. Optional filtering by labels is supported.
- Adds go tools to the `coder-dylib` CI step so we can generate mocks if needed
2025-02-10 11:29:30 +00:00
Mathias Fredriksson 9520da338e fix: conform to stricter printf usage in Go 1.24 (#16330) 2025-01-29 18:06:22 +02:00
Mathias Fredriksson c069563af1 test: fix use of t.Logf where t.Log would suffice (#16328) 2025-01-29 14:35:04 +00:00
Cian Johnston e8e16c9404 chore(agent/agentscripts): log command cancellation (#16324)
Relates to https://github.com/coder/internal/issues/329

It's currently unclear where the SIGHUP came from; adding some logging
to make it more clear if it happens again in future.

---------

Co-authored-by: Danny Kopping <danny@coder.com>
2025-01-29 13:37:43 +00:00
Thomas Kosiewski 1336925c9f feat(flake.nix): switch dogfood dev image to buildNixShellImage from dockerTools (#16223)
Replace Depot build action with Nix for Nix dogfood image builds

The dogfood Nix image is now built using Nix's native container tooling instead of Depot. This change:

- Adds Nix setup steps to the GitHub Actions workflow
- Removes the Dockerfile.nix in favor of a Nix-native container build
- Updates the flake.nix to support building Docker images
- Introduces a hash file to track Nix-related changes
- Updates the vendorHash for Go dependencies

Change-Id: I4e011fe3a19d9a1375fbfd5223c910e59d66a5d9
Signed-off-by: Thomas Kosiewski <tk@coder.com>
2025-01-28 16:38:37 +01:00
Cian Johnston 7b88776403 chore(testutil): add testutil.GoleakOptions (#16070)
- Adds `testutil.GoleakOptions` and consolidates existing options to
this location
- Pre-emptively adds required ignore for this Dependabot PR to pass CI
https://github.com/coder/coder/pull/16066
2025-01-08 15:38:37 +00:00
Vincent Vielle 289338f19e feat(site): connect open_in parameter (#16036)
Second step to resolve [open_in
issue](https://github.com/coder/terraform-provider-coder/issues/297)

This PR improves the way the open_in parameter is forwarded across the
code, changing the last `string` to const everywhere.

Also make sure it is available and forwarded up to the `CreateLink`
component.
2025-01-07 18:08:03 +01:00
Vincent Vielle 08463c27d8 feat: add OpenIn option to coder_app (#15743)
This PR is the coder/coder part of [the open_in parameter
issue](https://github.com/coder/terraform-provider-coder/issues/297)
aiming to add a new optional parameter to choose how to open modules.

This PR is heavily linked [to this
PR](https://github.com/coder/terraform-provider-coder/pull/321).

ℹ️ For now, some integrations tests can not be pushed as it requires a
release on the terraform-provider repo.
2025-01-03 11:27:02 +01:00
Mathias Fredriksson 4c5b737368 fix: accumulate agentstats until reported and fix insights DAU offset (#15832) 2024-12-18 11:26:38 +02:00
Jon Ayers 354d0fc4c8 fix: filter agent-exec env vars (#15764)
- Filters env vars specific to agent-exec from the exec'd process. This
is to prevent any issues when developing Coder in Coder, particularly
agent tests in the cli pkg.
2024-12-05 16:33:27 +00:00
Jon Ayers f8d938d299 fix: fix oom_score adjustments failing if caps set (#15758)
- Fixes an issue where oom scores would fail to be adjusted in cases
where the `coder` binary has capabilities set on it. This is because
`PR_SET_DUMPABLE` is set to `0` when a process is executed with elevated
capabilities. The fix is to flip `PR_SET_DUMPABLE` to `1` prior to
writing to `oom_score_adj`.
2024-12-05 15:30:58 +02:00
Jon Ayers ce573b9faa fix: add agent exec abstraction (#15717) 2024-12-04 23:30:25 +02:00
Cian Johnston 74f7961018 chore(agent/agentexec): fix flake in agent/agentexec test (#15681)
Should hopefully fix https://github.com/coder/internal/issues/233
2024-11-28 14:28:10 +00:00
Jon Ayers 24d44b4518 fix: add additional context to agent exec errors (#15676) 2024-11-27 21:29:08 +02:00
Jon Ayers 1f238fed59 feat: integrate new agentexec pkg (#15609)
- Integrates the `agentexec` pkg into the agent and removes the
legacy system of iterating over the process tree. It adds some linting
rules to hopefully catch future improper uses of `exec.Command` in the package.
2024-11-27 20:12:15 +02:00
Jon Ayers bbc549d2df feat: add agent exec pkg (#15577) 2024-11-25 17:22:12 +02:00
Spike Curtis 103824f726 fix: fix panic while tearing down reconnecting PTY (#15615)
fixes https://github.com/coder/internal/issues/221

Fixes an issue where two goroutines were sharing the `err` variable, leading to a data race where we'd fail to process the error and then nil-pointer panic.

I ended up refactoring reconnecting PTY stuff into the `reconnectingpty` package, instead of having it on the agent.  That `createTailnet` routine had waaay too many deeply nested goroutines, which is I'm sure a big contributor to the bug appearing in the first place.
2024-11-22 09:46:25 +04:00
Spike Curtis 5861e516b9 chore: add standard test logger ignoring db canceled (#15556)
Refactors our use of `slogtest` to instantiate a "standard logger" across most of our tests.  This standard logger incorporates https://github.com/coder/slog/pull/217 to also ignore database query canceled errors by default, which are a source of low-severity flakes.

Any test that has set non-default `slogtest.Options` is left alone. In particular, `coderdtest` defaults to ignoring all errors. We might consider revisiting that decision now that we have better tools to target the really common flaky Error logs on shutdown.
2024-11-18 14:09:22 +04:00
Cian Johnston 12a9d6336b fix(agent): start rpty lifecycle after all reads/writes (#15535)
Fixes https://github.com/coder/internal/issues/214

#15475 missed that we also write to `rpty` after starting
`rpty.lifecycle()`.
This PR moves the function call right at the end. Hopefully this should
address the data races before we go resorting to mutexes.
2024-11-15 14:48:17 +00:00
Spike Curtis 40802958e9 fix: use explicit api versions for agent and tailnet (#15508)
Bumps the Tailnet and Agent API version 2.3, and creates some extra controls and machinery around these versions.

What happened is that we accidentally shipped two new API features without bumping the version.  `ScriptCompleted` on the Agent API in Coder v2.16 and `RefreshResumeToken` on the Tailnet API in Coder v2.15.

Since we can't easily retroactively bump the versions, we'll roll these changes into API version 2.3 along with the new WorkspaceUpdates RPC, which hasn't been released yet.  That means there is some ambiguity in Coder v2.15-v2.17 about exactly what methods are supported on the Tailnet and Agent APIs.  This isn't great, but hasn't caused us major issues because 

1. RefreshResumeToken is considered optional, and clients just log and move on if the RPC isn't supported. 
2. Agents basically never get started talking to a Coderd that is older than they are, since the agent binary is normally downloaded from Coderd at workspace start.

Still it's good to get things squared away in terms of versions for SDK users and possible edge cases around client and server versions.

To mitigate against this thing happening again, this PR also:

1. adds a CODEOWNERS for the API proto packages, so I'll review changes
2. defines interface types for different API versions, and has the agent explicitly use a specific version.  That way, if you add a new method, and try to use it in the agent without thinking explicitly about versions, it won't compile.

With the protocol controllers stuff, we've sort of already abstracted the Tailnet API such that the interface type strategy won't work, but I'll work on getting the Controller to be version aware, such that it can check the API version it's getting against the controllers it has -- in a later PR.
2024-11-15 11:16:28 +04:00
Cian Johnston b6e7498cb8 fix(agent/reconnectingpty): generate rpty id before starting lifecycle (#15475)
Fixes https://github.com/coder/coder/issues/12687

There was a race condition where we would start the rpty lifecycle
before generating the ID, leading to a data race where we would try to
concurrently read and write the struct field.
2024-11-11 15:02:55 +00:00
Spike Curtis e5661c2748 feat: add support for multiple tunnel destinations in tailnet (#15409)
Closes #14729

Expands the Coordination controller used by the CLI client to allow multiple tunnel destinations (agents).  Our current client uses just one, but this unifies the logic so that when we add Coder VPN, 1 is just a special case of "many."
2024-11-08 13:32:07 +04:00
Spike Curtis 8c00ebc6ee chore: refactor ServerTailnet to use tailnet.Controllers (#15408)
chore of #14729

Refactors the `ServerTailnet` to use `tailnet.Controller` so that we reuse logic around reconnection and handling control messages, instead of reimplementing.  This unifies our "client" use of the tailscale API across CLI, coderd, and wsproxy.
2024-11-08 13:18:56 +04:00
Spike Curtis 886dcbec84 chore: refactor coordination (#15343)
Refactors the way clients of the Tailnet API (clients of the API, which include both workspace "agents" and "clients") interact with the API.  Introduces the idea of abstract "controllers" for each of the RPCs in the API, and implements a Coordination controller by refactoring from `workspacesdk`.

chore re: #14729
2024-11-05 13:50:10 +04:00
Ethan c5a4095610 fix: include custom agent headers in tailnet to support DERP connections (#15145)
Fixes #15131.
2024-10-21 20:59:21 +11:00
Jon Ayers 7da231bc92 fix: fix error handling to prevent spam in proc prio management (#15071) 2024-10-15 02:17:10 +00:00
Spike Curtis 8785a51b09 feat: include Coder service prefix on agents (#14944)
fixes #14715

Configures agents to use an address both in the Tailscale service prefix and the new Coder service prefix. Also modifies the Coordinator auth to allow the new prefix.

Updates `coder/tailscale` to include https://github.com/coder/tailscale/pull/62 which fixes a bug around forwarding TCP connections to localhost.  This functionality is tested in the modifications to `TestAgent_Dial`.
2024-10-04 10:16:33 +04:00
Spike Curtis 7d9f5ab81d chore: add Coder service prefix to tailnet (#14943)
re: #14715

This PR introduces the Coder service prefix: `fd60:627a:a42b::/48` and refactors our existing code as calling the Tailscale service prefix explicitly (rather than implicitly).

Removes the unused `Addresses` agent option. All clients today assume they can compute the Agent's IP address based on its UUID, so an agent started with a custom address would break things.
2024-10-04 10:04:10 +04:00
Danielle Maywood ae522c558d feat: add agent timings (#14713)
* feat: begin impl of agent script timings

* feat: add job_id and display_name to script timings

* fix: increment migration number

* fix: rename migrations from 251 to 254

* test: get tests compiling

* fix: appease the linter

* fix: get tests passing again

* fix: drop column from correct table

* test: add fixture for agent script timings

* fix: typo

* fix: use job id used in provisioner job timings

* fix: increment migration number

* test: behaviour of script runner

* test: rewrite test

* test: does exit 1 script break things?

* test: rewrite test again

* fix: revert change

Not sure how this came to be, I do not recall manually changing
these files.

* fix: let code breathe

* fix: wrap errors

* fix: justify nolint

* fix: swap require.Equal argument order

* fix: add mutex operations

* feat: add 'ran_on_start' and 'blocked_login' fields

* fix: update testdata fixture

* fix: refer to agent_id instead of job_id in timings

* fix: JobID -> AgentID in dbauthz_test

* fix: add 'id' to scripts, make timing refer to script id

* fix: fix broken tests and convert bug

* fix: update testdata fixtures

* fix: update testdata fixtures again

* feat: capture stage and if script timed out

* fix: update migration number

* test: add test for script api

* fix: fake db query

* fix: use UTC time

* fix: ensure r.scriptComplete is not nil

* fix: move err check to right after call

* fix: uppercase sql

* fix: use dbtime.Now()

* fix: debug log on r.scriptCompleted being nil

* fix: ensure correct rbac permissions

* chore: remove DisplayName

* fix: get tests passing

* fix: remove space in sql up

* docs: document ExecuteOption

* fix: drop 'RETURNING' from sql

* chore: remove 'display_name' from timing table

* fix: testdata fixture

* fix: put r.scriptCompleted call in goroutine

* fix: track goroutine for test + use separate context for reporting

* fix: appease linter, handle trackCommandGoroutine error

* fix: resolve race condition

* feat: replace timed_out column with status column

* test: update testdata fixture

* fix: apply suggestions from review

* revert: linter changes
2024-09-24 10:51:49 +01:00
Danielle Maywood 86f68b220e feat: add 'display_name' column to 'workspace_agent_scripts' (#14747)
* feat: add 'display_name' column to 'workspace_agent_scripts'

* fix: backfill from workspace_agent_log_sources

* fix: run 'make gen'
2024-09-20 14:26:13 +01:00
Spike Curtis 2df9a3e554 fix: fix tailnet remoteCoordination to wait for server (#14666)
Fixes #12560

When gracefully disconnecting from the coordinator, we would send the Disconnect message and then close the dRPC stream.  However, closing the dRPC stream can cause the server not to process the Disconnect message, since we use the stream context in a `select` while sending it to the coordinator.

This is a product bug uncovered by the flake, and probably results in us failing graceful disconnect some minority of the time.

Instead, the `remoteCoordination` (and `inMemoryCoordination` for consistency) should send the Disconnect message and then wait for the coordinator to hang up (on some graceful disconnect timer, in the form of a context).
2024-09-16 09:24:30 +04:00
Jon Ayers bfdc29f466 fix: suppress benign errors when listing processes (#14660) 2024-09-12 23:00:04 +01:00
Jon Ayers 9f4972901c fix: avoid logging no such process errors for process priority (#14655) 2024-09-12 18:00:42 +00:00
Spike Curtis fb3523b37f chore: remove legacy AgentIP address (#14640)
Removes the support for the Agent's "legacy IP" which was a hardcoded IP address all agents used to use, before we introduced "single tailnet". Single tailnet went GA in 2.7.0.
2024-09-12 07:40:19 +04:00
Ethan c8580a415a feat: expose current agent connections by type via prometheus (#14612) 2024-09-11 14:13:30 +10:00
Danielle Maywood 25f1ddbf5e feat: add 'hidden' option to 'coder_app' to hide app from UI (#14570)
Add 'hidden' property to 'coder_app' resource to allow hiding apps from the UI.
2024-09-09 14:39:32 +01:00
Mathias Fredriksson 8f07d3357e feat(agent/agentssh): use tcp for X11 forwarding (#14560)
Fixes #14198
2024-09-04 20:06:08 +03:00
Danielle Maywood 839918c5e7 chore(docs): document agent api debug endpoints (#14454)
* chore(docs): add agent api debug docs

* chore(docs): add sections to agent api readme

* chore(docs): link debug manifest to agentsdk.Manifest schema

* chore(docs): add high level overview of agent api debug docs

* chore(docs): link to agent api docs from reference

* chore(docs): fix invalid paths

* chore(docs): use env variable for coder agent debug address
2024-08-28 09:47:14 +01:00
Ethan 8c15192433 feat(cli): add p2p diagnostics to ping (#14426)
First PR to address #14244.

Adds common potential reasons as to why a direct connection to the workspace agent couldn't be established to `coder ping`:
- If the Coder deployment administrator has blocked direction connections (`CODER_BLOCK_DIRECT`).
- If the client has no STUN servers within it's DERP map.
- If the client or agent appears to be behind a hard NAT, as per Tailscale `netInfo.MappingVariesByDestIP`

Also adds a warning if the client or agent has a network interface below the 'safe' MTU for tailnet. This warning is always displayed at the end of a `coder ping`.
2024-08-28 15:39:01 +10:00
Mathias Fredriksson 8c0565177e chore(agent): remove err=<nil> log for batch update metadata complete (#14179)
Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
2024-08-07 11:31:47 +00:00
Spike Curtis 70c5c47efd fix: stop blocking fake Agent API channel writes after context expires (#13908) 2024-07-16 23:22:13 +04:00