Closes#21440
The `TestDBPurgeAuthorization` test was overfitting by calling each
purge method individually, which reimplemented dbpurge logic in the test
and created a maintenance burden. When new purge steps are added, they
either need to be reflected in the test or there will be a testing
blindspot.
This change extracts the `doTick` closure into an exported `PurgeTick`
function that returns an error, making the core purge logic testable.
The test now calls `PurgeTick` directly to exercise the actual dbpurge
behavior rather than reimplementing it. Retention values are configured
to ensure all purge operations run, so we test RBAC permissions for all
code paths.
- Tests actual dbpurge behavior instead of reimplementing it
- Automatically covers new purge steps when they're added
- Still validates that all operations have proper RBAC permissions
The test focuses on authorization (checking for RBAC errors) rather than
verifying deletion behavior, which is already covered by other tests
like `TestDeleteExpiredAPIKeys` and `TestDeleteOldAuditLogs`.
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
Related to
[`internal#1139`](https://github.com/coder/internal/issues/1139)
Continuation of #21074
This implements some RBAC role specificity for `dbpurge`, ensuring that
we follow the least-privileged model for removing data from the
database. It is specified as following.
```go
Site: rbac.Permissions(map[string][]policy.Action{
// DeleteOldWorkspaceAgentLogs
// DeleteOldWorkspaceAgentStats
// DeleteOldProvisionerDaemons
// DeleteOldTelemetryLocks
// DeleteOldAuditLogConnectionEvents
// DeleteOldConnectionLogs
rbac.ResourceSystem.Type: {policy.ActionDelete},
// DeleteOldNotificationMessages
rbac.ResourceNotificationMessage.Type: {policy.ActionDelete},
// ExpirePrebuildsAPIKeys
// DeleteExpiredAPIKeys
rbac.ResourceApiKey.Type: {policy.ActionDelete},
// DeleteOldAIBridgeRecords
rbac.ResourceAibridgeInterception.Type: {policy.ActionDelete},
}),
```
| Position | Pull-request |
| -------- | ------------ |
| | [feat: add prometheus observability metrics for
`dbpurge`](https://github.com/coder/coder/pull/21074) |
| ✅ | [feat: add rbac specificity for
`dbpurge`](https://github.com/coder/coder/pull/21088) |
Related to
[`internal#1139`](https://github.com/coder/internal/issues/1139)
This implements some prometheus metrics for records being removed from
the database. Currently we're tracking the following fields being
removed from the DB by this. They're viewable in the
`/api/v2/debug/metrics` endpoint.
* `expired_api_keys`
* `aibridge_records`
* `connection_logs`
* `duration`
```
# HELP coderd_dbpurge_iteration_duration_seconds Duration of each dbpurge iteration in seconds.
# TYPE coderd_dbpurge_iteration_duration_seconds histogram
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="1"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="5"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="10"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="30"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="60"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="300"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="600"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="+Inf"} 1
coderd_dbpurge_iteration_duration_seconds_sum{success="true"} 0.014787814
coderd_dbpurge_iteration_duration_seconds_count{success="true"} 1
# HELP coderd_dbpurge_records_purged_total Total number of records purged by type.
# TYPE coderd_dbpurge_records_purged_total counter
coderd_dbpurge_records_purged_total{record_type="aibridge_records"} 0
coderd_dbpurge_records_purged_total{record_type="audit_logs"} 0
coderd_dbpurge_records_purged_total{record_type="connection_logs"} 0
coderd_dbpurge_records_purged_total{record_type="expired_api_keys"} 0
coderd_dbpurge_records_purged_total{record_type="workspace_agent_logs"} 0
```
| Position | Pull-request |
| -------- | ------------ |
| ✅ | [feat: add prometheus observability metrics for
`dbpurge`](https://github.com/coder/coder/pull/21074) |
| | [feat: add rbac specificity for
`dbpurge`](https://github.com/coder/coder/pull/21088) |
Previously setting AI Bridge retention to 0 would cause records to be
deleted immediately since we didn't check for the zero value before
calculating the deletion threshold.
This adds a check for aibridgeRetention > 0 to skip deletion when
retention is disabled, matching the pattern used for other retention
settings (connection logs, audit logs, etc.).
Also fixes the return type of DeleteOldAIBridgeRecords from int32 to
int64 since COUNT(*) returns bigint in PostgreSQL.
Refs #21055
Replace hardcoded 7-day retention for workspace agent logs with
configurable retention from deployment settings. Defaults to 7d to
preserve existing behavior.
Depends on #21038
Updates #20743
Replace hardcoded 7-day retention for expired API keys with configurable
retention from deployment settings. Skips deletion entirely when effective
retention is 0.
Depends on #21021
Updates #20743
Add configurable retention policy for audit logs. The DeleteOldAuditLogs
query excludes deprecated connection events (connect, disconnect, open,
close) which are handled separately by DeleteOldAuditLogConnectionEvents.
Disabled (0) by default.
Depends on #21021
Updates #20743
Add `DeleteOldConnectionLogs` query and integrate it into the `dbpurge`
routine. Retention is controlled by `--retention-connection-logs` flag.
Disabled (0) by default.
Depends on #21021
Updates #20743
- Adds a new table to keep track of which payloads have already been
reported since we only report for the last clock hour
- Adds a query to gather and aggregate all the data by
provider/model/client
Relates to https://github.com/coder/coder-telemetry-server/issues/27
* provisionerdserver: Expires prebuild user token for workspace, if it
exists, when regenerating session token.
* dbauthz: disallow prebuilds user from creating api keys
* dbpurge: added functionality to expire stale api keys owned by the
prebuilds user
### Breaking change (changelog note):
>With new connection events appearing in the Connection Log, connection events older than 90 days will now be deleted from the Audit Log. If you require this legacy data, we recommend querying it from the REST API or making a backup of the database/these events before upgrading your Coder deployment. Please see the PR for details on what exactly will be deleted.
Of note is that there are currently no plans to delete connection events from the Connection Log.
### Context
This is the fifth PR for moving connection events out of the audit log.
In previous PRs:
- **New** connection logs have been routed to the `connection_logs` table. They will *not* appear in the audit log.
- These new connection logs are served from the new `/api/v2/connectionlog` endpoint.
In this PR:
- We'll now clean existing connection events out of the audit log, if they are older than 90 days, We do this in batches of 1000, every 10 minutes.
The criteria for deletion is simple:
```
WHERE
(
action = 'connect'
OR action = 'disconnect'
OR action = 'open'
OR action = 'close'
)
AND "time" < @before_time::timestamp with time zone
```
where `@before_time` is currently configured to 90 days in the past.
Future PRs:
- Write documentation for the endpoint / feature
Addresses https://github.com/coder/coder/issues/16231.
This PR reduces the volume of logs we print after server startup in
order to surface the web UI URL better.
Here are the logs after the changes a couple of seconds after starting
the server:
<img width="868" alt="Screenshot 2025-02-18 at 16 31 32"
src="https://github.com/user-attachments/assets/786dc4b8-7383-48c8-a5c3-a997c01ca915"
/>
The warning is due to running a development site-less build. It wouldn't
show in a release build.
Before db_metrics were all or nothing. Now `InTx` metrics are always recorded, and query metrics are opt in.
Adds instrumentation & logging around serialization failures in the database.
Updates the `DeleteOldWorkspaceAgentLogs` to:
- Retain logs for the most recent build regardless of age,
- Delete logs for agents that never connected and were created before
the cutoff for deleting logs while still retaining the logs most recent build.
Related to #10576
This PR introduces quartz to coderd/database/dbpurge and updates the following unit tests to make use of Quartz's functionality:
- TestPurge
- TestDeleteOldWorkspaceAgentLogs
Additionally, updates DeleteOldWorkspaceAgentLogs to replace the hard-coded interval with a parameter passed into the query. This aids in testing and brings us a step towards allowing operators to configure the cutoff interval for workspace agent logs.
Add `dbrollup` service that runs the `UpsertTemplateUsageStats` query
every 5 minutes, on the minute. This allows us to have fairly real-time
insights data when viewing "today".
* chore: add /v2 to import module path
go mod requires semantic versioning with versions greater than 1.x
This was a mechanical update by running:
```
go install github.com/marwan-at-work/mod/cmd/mod@latest
mod upgrade
```
Migrate generated files to import /v2
* Fix gen
* chore: rename startup logs to agent logs
This also adds a `source` property to every agent log. It
should allow us to group logs and display them nicer in
the UI as they stream in.
* Fix migration order
* Fix naming
* Rename the frontend
* Fix tests
* Fix down migration
* Match enums for workspace agent logs
* Fix inserting log source
* Fix migration order
* Fix logs tests
* Fix psql insert