coder

mirror of https://github.com/coder/coder.git synced 2026-06-02 20:48:20 +00:00

Author	SHA1	Message	Date
Kacper Sawicki	f016d9e505	fix(coderd): add role param to agent RPC to prevent false connectivity (#22052 ) ## Summary coder-logstream-kube and other tools that use the agent token to connect to the RPC endpoint were incorrectly triggering connection monitoring, causing false connected/disconnected timestamps on the agent. This led to VSCode/JetBrains disconnections and incorrect dashboard status. ## Changes Add a `role` query parameter to `/api/v2/workspaceagents/me/rpc`: - `role=agent`: triggers connection monitoring (default for the agent SDK) - any other value (e.g. `logstream-kube`): skips connection monitoring - omitted: triggers monitoring for backward compatibility with older agents The agent SDK now sends `role=agent` by default. A new `Role` field on the `agentsdk.Client` allows non-agent callers to specify a different role. ## Required follow-up coder-logstream-kube needs to set `client.Role = "logstream-kube"` before calling `ConnectRPC20()`. Without that change, it will still send `role=agent` and trigger monitoring. Fixes #21625	2026-02-18 09:44:06 +01:00
Jon Ayers	6035e45cb8	feat: add e2e workspace build duration metric (#21739 ) Adds coderd_template_workspace_build_duration_seconds histogram that tracks the full duration from workspace build creation to agent ready. This captures the complete user-perceived build time including provisioning and agent startup. The metric is emitted when the agent reports ready/error/timeout via the lifecycle API, ensuring each build is counted exactly once per replica.	2026-02-06 16:26:02 -06:00
Zach	2204731ddb	feat: implement boundary usage tracker and telemetry collection (#21716 ) Implements telemetry for boundary usage tracking across all Coder replicas and reports them via telemetry. Changes: - Implement Tracker with Track(), FlushToDB(), and StartFlushLoop() methods - Add telemetry integration via collectBoundaryUsageSummary() - Use telemetry lock to ensure only one replica collects per period The tracker accumulates unique workspaces, unique users, and request counts (allowed/denied) in memory, then flushes to the database periodically. During telemetry collection, stats are aggregated across all replicas and reset for the next period.	2026-01-27 19:11:40 -07:00
Callum Styan	e195856c43	perf: reduce pg_notify call volume by batching together agent metadata updates (#21330 ) --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-22 22:47:49 -08:00
Zach	6c49938fca	feat: add template version ID to re-emitted boundary logs (#21636 ) Adds template_version_id to re-emitted boundary audit logs to allow filtering and analysis by specific template versions iin addition to the existing template_id field. Since boundary policies are defined in the template, the template version is critical to figuring out which policy was responsible for boundaries decision in a workspace. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 15:06:02 -07:00
Spike Curtis	bddb808b25	chore: arrange imports in a standard way (#21452 ) Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example: ``` import ( "context" "time" "github.com/prometheus/client_golang/prometheus" "golang.org/x/xerrors" "gopkg.in/natefinch/lumberjack.v2" "cdr.dev/slog/v3" "github.com/coder/coder/v2/codersdk/agentsdk" "github.com/coder/serpent" ) ``` 3 groups: standard library, 3rd partly libs, Coder libs. This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.	2026-01-08 15:24:11 +04:00
Spike Curtis	49b34a716a	fix: fix slog to always use array of Fields (#21426 ) Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder). It also updates dependencies that also use slog and were updated. I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule. Other dependencies, I pushed new tags.	2026-01-08 10:29:41 +04:00
Spike Curtis	71c6dc4043	fix: stop disconnecting from coderd early and record disconnect correctly (#21250 ) fixes https://github.com/coder/internal/issues/1196 The above issue exposes two different bugs in Coder. In the agent, there is a race where if the agent is closed while starting up networking, it will erroneously disconnect from Coderd, which delays or breaks writing final status and logs. In Coderd, there is a bug where we don't properly record the latest agent disconnection time if the agent had previously disconnected. This causes us to report the agent status as "Connected" even after it has disconnected up until the inactivity timeout fires. This PR fixes both issues. It also slightly reworks when we send workspace updates based on connection and disconnection. Previously we would send two updates when the agent connected in certain circumstances, even though the status would be the same in both (only times changed). Now we universally only send one on connect, and then another on disconnect.	2025-12-15 12:04:01 +04:00
Callum Styan	27c3ec072e	perf: support fastpath in dbauthz GetLatestWorkspaceBuildByWorkspaceID (#21047 ) This PR piggy backs on the agent API cached workspace added in earlier PRs to provide a fast path for avoiding `GetWorkspaceByID` calls in `GetLatestWorkspaceBuildByWorkspaceID` via injection of the workspaces RBAC object into the context. We can do this from the `agentConnectionMonitor` easily since we already cache the workspace. --------- Signed-off-by: Callum Styan <callumstyan@gmail.com>	2025-12-09 15:53:52 -08:00
Callum Styan	d22d34e45b	fix: pass context with authorization to agentapi (#20959 ) The agentapi context needs to be a context with some amount of authorization attached to it via the context so that the cache refresh routine can fetch the workspace from the db via GetWorkspaceForAgentID. --------- Signed-off-by: Callum Styan <callumstyan@gmail.com>	2025-11-26 14:53:16 -08:00
Callum Styan	b0e8384b82	perf: reduce DB calls to `GetWorkspaceByAgentID` via caching workspace info (#20662 ) --------- Signed-off-by: Callum Styan <callumstyan@gmail.com>	2025-11-25 14:45:05 -08:00
Steven Masley	34c46c0748	chore: rename `service` -> `coder_service`, remove `agent_id` label (#19241 ) Pyroscope uses `service` tag for top level distinction. So move our `service` -> `coder_service`	2025-08-07 13:58:39 -05:00
Ethan	08e17a07fc	chore!: route connection logs to new table (#18340 ) ### Breaking Change (changelog note): > User connections to workspaces, and the opening of workspace apps or ports will no longer create entries in the audit log. Those events will now be included in the 'Connection Log'. Please see the 'Connection Log' page in the dashboard, and the Connection Log [documentation](https://coder.com/docs/admin/monitoring/connection-logs) for details. Those with permission to view the Audit Log will also be able to view the Connection Log. The new Connection Log has the same licensing restrictions as the Audit Log, and requires a Premium Coder deployment. ### Context This is the first PR of a few for moving connection events out of the audit log, and into a new database table and web UI page called the 'Connection Log'. This PR: - Creates the new table - Adds and tests queries for inserting and reading, including reading with an RBAC filter. - Implements the corresponding RBAC changes, such that anyone who can view the audit log can read from the table - Implements, under the enterprise package, a `ConnectionLogger` abstraction to replace the `Auditor` abstraction for these logs. (No-op'd in AGPL, like the `Auditor`) - Routes SSH connection and Workspace App events into the new `ConnectionLogger` - Updates all existing tests to check the values of the `ConnectionLogger` instead of the `Auditor`. Future PRs: - Add filtering to the query - Add an enterprise endpoint to query the new table - Write a query to delete old events from the audit log, call it from dbpurge. - Implement a table in the Web UI for viewing connection logs. > [!NOTE] > The PRs in this stack obviously won't be (completely) atomic. Whilst they'll each pass CI, the stack is designed to be merged all at once. I'm splitting them up for the sake of those reviewing, and so changes can be reviewed as early as possible. Despite this, it's really hard to make this PR any smaller than it already is. I'll be keeping it in draft until it's actually ready to merge.	2025-07-15 14:36:06 +10:00
Danielle Maywood	b712d0b23f	feat(coderd/agentapi): implement sub agent api (#17823 ) Closes https://github.com/coder/internal/issues/619 Implement the `coderd` side of the AgentAPI for the upcoming dev-container agents work. `agent/agenttest/client.go` is left unimplemented for a future PR working to implement the agent side of this feature.	2025-05-29 12:15:47 +01:00
Thomas Kosiewski	93f17bc73e	fix: remove unnecessary user lookup in agent API calls (#17934 ) # Use workspace.OwnerUsername instead of fetching the owner This PR optimizes the agent API by using the `workspace.OwnerUsername` field directly instead of making an additional database query to fetch the owner's username. The change removes the need to call `GetUserByID` in the manifest API and workspace agent RPC endpoints. An issue arose when the agent token was scoped without access to user data (`api_key_scope = "no_user_data"`), causing the agent to fail to fetch the manifest due to an RBAC issue. Change-Id: I3b6e7581134e2374b364ee059e3b18ece3d98b41 Signed-off-by: Thomas Kosiewski <tk@coder.com>	2025-05-20 17:07:50 +02:00
Mathias Fredriksson	b07b33ec9d	feat: add agentapi endpoint to report connections for audit (#16507 ) This change adds a new `ReportConnection` endpoint to the `agentapi`. The protocol version was bumped previously, so it has been omitted here. This allows the agent to report connection events, for example when the user connects to the workspace via SSH or VS Code. Updates #15139	2025-02-20 14:52:01 +02:00
Danielle Maywood	d6b9806098	chore: implement oom/ood processing component (#16436 ) Implements the processing logic as set out in the OOM/OOD RFC.	2025-02-17 16:56:52 +00:00
Spike Curtis	2c7f8ac65f	chore: migrate to coder/websocket 1.8.12 (#15898 ) Migrates us to `coder/websocket` v1.8.12 rather than `nhooyr/websocket` on an older version. Works around https://github.com/coder/websocket/issues/504 by adding an explicit test for `xerrors.Is(err, io.EOF)` where we were previously getting `io.EOF` from the netConn.	2024-12-19 00:51:30 +04:00
Ethan	31506e694b	chore: send workspace pubsub events by owner id (#14964 ) We currently send empty payloads to pubsub channels of the form `workspace:<workspace_id>` to notify listeners of updates to workspaces (such as for refreshing the workspace dashboard). To support https://github.com/coder/coder/issues/14716, we'll instead send `WorkspaceEvent` payloads to pubsub channels of the form `workspace_owner:<owner_id>`. This enables a listener to receive events for all workspaces owned by a user. This PR replaces the usage of the old channels without modifying any existing behaviors. ``` type WorkspaceEvent struct { Kind WorkspaceEventKind `json:"kind"` WorkspaceID uuid.UUID `json:"workspace_id" format:"uuid"` // AgentID is only set for WorkspaceEventKindAgent* events // (excluding AgentTimeout) AgentID *uuid.UUID `json:"agent_id,omitempty" format:"uuid"` } ``` We've defined `WorkspaceEventKind`s based on how the old channel was used, but it's not yet necessary to inspect the types of any of the events, as the existing listeners are designed to fire off any of them. ``` WorkspaceEventKindStateChange WorkspaceEventKind = "state_change" WorkspaceEventKindStatsUpdate WorkspaceEventKind = "stats_update" WorkspaceEventKindMetadataUpdate WorkspaceEventKind = "mtd_update" WorkspaceEventKindAppHealthUpdate WorkspaceEventKind = "app_health" WorkspaceEventKindAgentLifecycleUpdate WorkspaceEventKind = "agt_lifecycle_update" WorkspaceEventKindAgentLogsUpdate WorkspaceEventKind = "agt_logs_update" WorkspaceEventKindAgentConnectionUpdate WorkspaceEventKind = "agt_connection_update" WorkspaceEventKindAgentLogsOverflow WorkspaceEventKind = "agt_logs_overflow" WorkspaceEventKindAgentTimeout WorkspaceEventKind = "agt_timeout" ```	2024-11-01 14:17:05 +11:00
Cian Johnston	5366f2576f	fix(provisionerd/runner): do not log entire resources (#14538 ) fix(coderd/workspaceagentsrpc): do not log entire agent fix(provisionerd/runner): do not log entire resources	2024-09-04 10:23:34 +01:00
Dean Sheather	d2b035312e	chore: fix parse typo for network telemetry (#13971 )	2024-07-22 17:14:37 +00:00
Dean Sheather	6c94dd4f23	chore: add DRPC server implementation for network telemetry (#13675 )	2024-07-02 01:50:52 +10:00
Garrett Delfosse	fed668b432	chore: switch ssh session stats based on experiment (#13637 )	2024-06-25 10:58:45 -04:00
Ethan	dd243686e4	chore!: remove deprecated agent v1 routes (#13486 )	2024-06-11 12:22:59 +10:00
Garrett Delfosse	5789ea5397	chore: move stat reporting into workspacestats package (#13386 )	2024-05-29 11:49:08 -04:00
Spike Curtis	5469011018	fix: stop logging session shutdown as warning (#12922 ) A customer hit like 200k of ErrSessionShutdown, which just dupes any errors we would have generated when shutting down the session for e.g. Ping failures.	2024-04-10 11:50:46 +04:00
Garrett Delfosse	0723dd3abf	fix: ensure agent token is from latest build in middleware (#12443 )	2024-03-14 12:27:32 -04:00
Colin Adler	e5d911462f	fix(tailnet): enforce valid agent and client addresses (#12197 ) This adds the ability for `TunnelAuth` to also authorize incoming wireguard node IPs, preventing agents from reporting anything other than their static IP generated from the agent ID.	2024-03-01 09:02:33 -06:00
Spike Curtis	1f5a6d59ba	chore: consolidate websocketNetConn implementations (#12065 ) Consolidates websocketNetConn from multiple packages in favor of a central one in codersdk	2024-02-09 11:39:08 +04:00
Spike Curtis	c84a637116	fix: stop logging error on query canceled (#12017 ) Fixes flake seen here: https://github.com/coder/coder/actions/runs/7782340530/job/21218566449	2024-02-06 08:43:34 +04:00
Colin Adler	4ed1f5581a	chore(coderd): add logging to agent rpc yamux conn (#11965 )	2024-01-31 23:17:20 -06:00
Spike Curtis	b79785c86f	feat: move agent v2 API connection monitoring to yamux layer (#11910 ) Moves monitoring of the agent v2 API connection to the yamux layer. Present behavior monitors this at the websocket layer, and closes the websocket on completion. This can cause yamux to hit unexpected errors since the connection is closed underneath it. This might be the cause of yamux errors that some customers are seeing ![image.png](https://graphite-user-uploaded-assets-prod.s3.amazonaws.com/tCz4CxRU9jhAJ7zH8RTi/53b8b5ef-e9e5-44a5-b559-99c37c136071.png) In any case, it's more graceful to close yamux first and let yamux close the underlying websocket. That should limit yamux error logging to truly unexpected/error cases. The only downside is that the yamux `Close()` doesn't accept a reason, so if the agent becomes outdated and we close the API connection, the agent just sees the connection close without a reason. I'm not sure we log this at the agent anyway, but it would be nice. I think more accurate logging on Coderd are more important. I've also added some logging when the monitor disconnects for reasons other than the context being canceled (e.g. agent outdated, failed pings).	2024-02-01 08:18:35 +04:00
Spike Curtis	2599850e54	feat: use agent v2 API to post startup (#11877 ) Uses the v2 Agent API to post startup information.	2024-01-30 11:23:28 +04:00
Spike Curtis	207328ca50	feat: use appearance.Fetcher in agentapi (#11770 ) This PR updates the Agent API to use the appearance.Fetcher, which is set by entitlement code in Enterprise coderd. This brings the agentapi into compliance with the Enterprise feature.	2024-01-29 21:22:50 +04:00
Dean Sheather	29707099d7	chore: add agentapi tests (#11269 )	2024-01-26 07:04:19 +00:00
Spike Curtis	f5dbc718a7	fix: accept agent RPC connection without version query parameter (#11790 ) Fixes an issue where Coder v2.7.1 agents connect to /api/v2/workspaceagents/me/rpc without a version query parameter	2024-01-24 09:10:16 +04:00
Spike Curtis	3e0e7f8739	feat: check agent API version on connection (#11696 ) fixes #10531 Adds a check for `version` on connection to the Agent API websocket endpoint. This is primarily for future-proofing, so that up-level agents get a sensible error if they connect to a back-level Coderd. It also refactors the location of the `CurrentVersion` variables, to be part of the `proto` packages, since the versions refer to the APIs defined therein.	2024-01-23 14:27:49 +04:00
Spike Curtis	c9b7d61769	chore: refactor agent connection updates (#11301 ) Refactors the code that handles monitoring an agent websocket with pings and updating the connection times in the DB. Consolidates v1 and v2 agent APIs under the same code for this. One substantive change (not _just_ a refactor) is that I've made it so that we actually disconnect if the agent fails to respond to our pings, rather than the old behavior where we would update the database, but not actually tear down the websocket.	2024-01-02 16:04:37 +04:00
Spike Curtis	36636bb6a5	feat: add tailnet to agent RPC service (#11304 ) Adds tailnet.DRPCService to the agent API Supports #10531 but we still need to add version negotiation to the websocket endpoint	2024-01-02 10:10:20 +04:00
Dean Sheather	e46431078c	feat: add AgentAPI using DRPC (#10811 ) Co-authored-by: Spike Curtis <spike@coder.com>	2023-12-18 22:53:28 +10:00

40 Commits