Commit Graph

21 Commits

Author SHA1 Message Date
Callum Styan 6abb889fab perf: optimize GetDeploymentWorkspaceAgentStats by eliminating 2nd select (#21112)
Tracking issue here: https://github.com/coder/internal/issues/1009

To summarize, the current version of this query selects from
`workspace_agent_stats` twice. The expensive portion of this query is
the bitmap heap scan we have to do for each of these selects. We can
easily cut the cost of this query by 40-50% by cutting this down to a
single select, and using those rows for both sets of calculations.

Eliminating the heap scan itself would require a follow up PR to
introduce a new index. Blink helped with the rewrite of the query.

The current plan looks like this:
```
 Nested Loop  (cost=6101.64..6101.69 rows=1 width=64) (actual time=11.782..11.787 rows=1 loops=1)
   ->  Aggregate  (cost=2996.17..2996.19 rows=1 width=32) (actual time=3.356..3.357 rows=1 loops=1)
         ->  Bitmap Heap Scan on workspace_agent_stats  (cost=54.80..2992.86 rows=440 width=24) (actu
al time=0.346..2.927 rows=818 loops=1)
               Recheck Cond: (created_at > (now() - '00:15:00'::interval))
               Filter: (connection_median_latency_ms > '0'::double precision)
               Rows Removed by Filter: 1070
               Heap Blocks: exact=486
               ->  Bitmap Index Scan on idx_agent_stats_created_at  (cost=0.00..54.69 rows=1368 width
=0) (actual time=0.241..0.241 rows=1888 loops=1)
                     Index Cond: (created_at > (now() - '00:15:00'::interval))
   ->  Aggregate  (cost=3105.47..3105.49 rows=1 width=32) (actual time=8.418..8.420 rows=1 loops=1)
         ->  Subquery Scan on a  (cost=3060.95..3105.39 rows=7 width=32) (actual time=7.851..8.394 ro
ws=63 loops=1)
               Filter: (a.rn = 1)
               ->  WindowAgg  (cost=3060.95..3088.29 rows=1368 width=209) (actual time=7.850..8.382 r
ows=63 loops=1)
                     Run Condition: (row_number() OVER (?) <= 1)
                     ->  Sort  (cost=3060.93..3064.35 rows=1368 width=56) (actual time=7.836..8.036 r
ows=1888 loops=1)
                           Sort Key: workspace_agent_stats_1.agent_id, workspace_agent_stats_1.create
d_at DESC
                           Sort Method: quicksort  Memory: 181kB
                           ->  Bitmap Heap Scan on workspace_agent_stats workspace_agent_stats_1  (co
st=55.03..2989.67 rows=1368 width=56) (actual time=0.388..2.096 rows=1888 loops=1)
                                 Recheck Cond: (created_at > (now() - '00:15:00'::interval))
                                 Heap Blocks: exact=486
                                 ->  Bitmap Index Scan on idx_agent_stats_created_at  (cost=0.00..54.
69 rows=1368 width=0) (actual time=0.295..0.295 rows=1888 loops=1)
                                       Index Cond: (created_at > (now() - '00:15:00'::interval))
 Planning Time: 2.350 ms
 Execution Time: 13.152 ms
(24 rows)
```

The new plan looks like this
```
 Aggregate  (cost=2966.96..2966.98 rows=1 width=64) (actual time=3.812..3.814 rows=1 loops=1)
   ->  WindowAgg  (cost=2891.96..2916.94 rows=1250 width=88) (actual time=2.696..3.412 rows=1890 loop
s=1)
         ->  Sort  (cost=2891.94..2895.06 rows=1250 width=80) (actual time=2.686..2.780 rows=1890 loo
ps=1)
               Sort Key: workspace_agent_stats.agent_id, workspace_agent_stats.created_at DESC
               Sort Method: quicksort  Memory: 226kB
               ->  Bitmap Heap Scan on workspace_agent_stats  (cost=50.11..2827.64 rows=1250 width=80
) (actual time=0.218..1.551 rows=1890 loops=1)
                     Recheck Cond: (created_at > (now() - '00:15:00'::interval))
                     Heap Blocks: exact=474
                     ->  Bitmap Index Scan on idx_agent_stats_created_at  (cost=0.00..49.80 rows=1250
 width=0) (actual time=0.146..0.147 rows=1890 loops=1)
                           Index Cond: (created_at > (now() - '00:15:00'::interval))
 Planning Time: 0.534 ms
 Execution Time: 3.969 ms
(12 rows)
```

If we compare the results of the query they're similar enough that any
differences can be attributed to slightly different timestamps for
`now()` in the version of the query I am using to generate results for
comparison:
```
 workspace_rx_bytes | workspace_tx_bytes | workspace_connection_latency_50 | workspace_connection_latency_95 | session_count_vscode | session_count_ssh | session_count_jetbrains | session_count_reconnecting_pty 
--------------------+--------------------+---------------------------------+---------------------------------+----------------------+-------------------+-------------------------+--------------------------------
           15263563 |           74555854 |                          47.933 |                        250.5522 |                  239 |                59 |                       3 |                              3
(1 row)

 workspace_rx_bytes | workspace_tx_bytes | workspace_connection_latency_50 | workspace_connection_latency_95 | session_count_vscode | session_count_ssh | session_count_jetbrains | session_count_reconnecting_pty 
--------------------+--------------------+---------------------------------+---------------------------------+----------------------+-------------------+-------------------------+--------------------------------
           15295819 |           74598410 |                          47.933 |                        250.5522 |                  239 |                59 |                       3 |                              3           
```

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2025-12-09 15:19:55 -08:00
Mathias Fredriksson 057d7dacdc chore(coderd/database/queries): remove trailing whitespace (#20192) 2025-10-07 13:10:38 +00:00
Garrett Delfosse 5f640eb219 fix: correct connection_median_latency_ms in query (#15086)
Closes https://github.com/coder/coder/issues/14805
2024-10-17 12:22:26 -04:00
Garrett Delfosse 922f4c545f fix: handle new agent stat format correctly (#14576)
---------

Co-authored-by: Ethan Dickson <ethan@coder.com>
2024-09-20 01:52:14 +10:00
Ethan 628750232f fix: delete workspace agent stats after 180 days (#14489)
Fixes #13430.

The test for purging old workspace agent stats from the DB was consistently failing when ran with Postgres towards the end of the month, but not with the in-memory DB. 

This was because month intervals are calculated differently for `time.Time` and the `interval` type in Postgres:

```
ethan=# SELECT
    '2024-08-30'::DATE AS original_date,
    ('2024-08-30'::DATE - INTERVAL '6 months') AS sub_date;
 original_date |      sub_date
---------------+---------------------
 2024-08-30    | 2024-02-29 00:00:00
(1 row)
```

Using `func (t Time) AddDate(years int, months int, days int) Time`, where `months` is `-6`:
```
Original: 2024-08-30 00:00:00 +0000 UTC
6 Months Earlier: 2024-03-01 00:00:00 +0000 UTC
```

Since 6 months was chosen arbitrarily, we should be able to change it to 180 days, to remove any ambiguity between the in-memory DB, and the Postgres DB. The alternative solution would involve implementing Postgres' month interval algorithm in Go.

The UI only shows stats as old as 168 days (24 weeks), so a frontend change isn't required for the extra days of stats we lose in some cases.
2024-08-30 18:30:04 +10:00
Mathias Fredriksson a69fc657f2 chore(coderd/database): reduce dbpurge load with smaller batches of agent stats (#13049) 2024-04-23 15:01:56 +03:00
Mathias Fredriksson e17e8aa3c9 feat(coderd/database): keep only 1 day of workspace_agent_stats after rollup (#12674) 2024-04-22 13:11:50 +03:00
Steven Masley 0a8c8ce5cc chore: remove InsertWorkspaceAgentStat query (#12869)
* chore: remove InsertWorkspaceAgentStat query

InsertWorkspaceAgentStats (batch) exists. We only used the singular in
a single unit test place. Removing the single for the batch, reducing
the interface size.
2024-04-09 12:35:27 -05:00
Marcin Tojek 941e3873a8 fix: implement fake DeleteOldWorkspaceAgentStats (#11076) 2023-12-07 14:08:16 +01:00
Marcin Tojek b0e3daa120 feat(coderd): support weekly aggregated insights (#9684) 2023-09-19 13:06:19 +02:00
Cian Johnston 9fb18f3ae5 feat(coderd): batch agent stats inserts (#8875)
This PR adds support for batching inserts to the workspace_agents_stats table.
Up to 1024 stats are batched, and flushed every second in a batch.
2023-08-04 17:00:42 +01:00
Steven Masley c795a0e500 feat: Fix Deployment DAUs to work with local timezones (#7647)
* chore: Add timezone param to DAU SQL query
* Merge DAUs response
* Pass time offsets to metricscache
2023-05-30 13:18:27 -04:00
Marcin Tojek 942aba3a66 feat: expose agent stats via Prometheus endpoint (#7115)
* WIP

* WIP

* WIP

* Agents

* fix

* 1min

* fix

* WIP

* Test

* docs

* fmt

* Add timer to measure the metrics collection

* Use CachedGaugeVec

* Unit tests

* WIP

* WIP

* db: GetWorkspaceAgentStatsAndLabels

* fmt

* WIP

* gauges

* feat: collect

* fix

* fmt

* minor fixes

* Prometheus flag

* fix

* WIP

* fix tests

* WIP

* fix json

* Rx Tx bytes

* CloseFunc

* fix

* fix

* Fixes

* fix

* fix: IgnoreErrors

* Fix: Windows

* fix

* reflect.DeepEquals
2023-04-14 16:14:52 +02:00
Kyle Carberry f91b3acf93 fix: group routine workspace agent stats by id (#6601)
Before this was creating separate rows for distinct stat entries, which
resulted in significantly more data being sent to telemetry.
2023-03-14 10:52:03 -05:00
Kyle Carberry 35df1b10d0 feat: add workspace agent stat reporting to telemetry (#6577)
This aggregates stats periodically and sends them by agent ID to
our telemetry server. It should help us identify which editors are
primarily in use.
2023-03-13 14:16:54 -05:00
Kyle Carberry 70b093ff2a fix: filter session count sums by created_at (#6529)
Fixes the session totals being waaaaay too high!
2023-03-09 17:08:41 +02:00
Kyle Carberry 1cc10f2ffb fix: only sum connection latencies when they are set (#6524)
This was producing a median that didn't make sense.
2023-03-09 03:53:09 +00:00
Kyle Carberry 5304b4e483 feat: add connection statistics for workspace agents (#6469)
* fix: don't make session counts cumulative

This made for some weird tracking... we want the point-in-time
number of counts!

* Add databasefake query for getting agent stats

* Add deployment stats endpoint

* The query... works?!?

* Fix aggregation query

* Select from multiple tables instead

* Fix continuous stats

* Increase period of stat refreshes

* Add workspace counts to deployment stats

* fmt

* Add a slight bit of responsiveness

* Fix template version editor overflow

* Add refresh button

* Fix font family on button

* Fix latest stat being reported

* Revert agent conn stats

* Fix linting error

* Fix tests

* Fix gen

* Fix migrations

* Block on sending stat updates

* Add test fixtures

* Fix response structure

* make gen
2023-03-08 21:05:45 -06:00
Kyle Carberry 87ed7a7dba chore: use nil map on agent stats to check if report interval should be returned (#6479)
See https://github.com/coder/coder/actions/runs/4350638262/jobs/7601537088
2023-03-07 14:25:04 +00:00
Kyle Carberry 2ff1c6d613 feat: add agent stats for different connection types (#6412)
This allows us to track when our extensions are used, when the
web terminal is used, and average connection latency to the agent.
2023-03-02 08:06:00 -06:00
Kyle Carberry 05e449943d chore: convert agent stats to use a table (#6374)
* chore: convert workspace agent stats from json to table

* chore: convert agent stats to use a table

Backwards compatibility becomes hard when all agent stats are in a JSON blob.
We also want to query this table for new agents that are failing health checks
so we can display it in the UI.

* Fix migration using default values
2023-02-28 13:33:33 -06:00