Commit Graph

1760 Commits

Author SHA1 Message Date
Susana Ferreira a002fbbae6 refactor: avoid terminology collision with aibridge by renaming passthrough to tunneled (#21562)
## Description

Renames "passthrough" to "tunneled" in aiproxy to avoid terminology
collision with aibridge, which has its own passthrough concept.

Follow-up from:
https://github.com/coder/coder/pull/21512#discussion_r2698231778

---------

Co-authored-by: Danny Kopping <danny@coder.com>
2026-01-19 13:23:42 +00:00
Susana Ferreira a406ed7cc5 feat: add upstream proxy support to aiproxy for passthrough requests (#21512)
## Description

Adds upstream proxy support for AI Bridge Proxy passthrough requests.
This allows aiproxy to forward non-allowlisted requests through an
upstream proxy. Currently, the only supported configuration is when
aiproxy is the first proxy in the chain (client → aiproxy → upstream
proxy).

## Changes

* Add `--aibridge-proxy-upstream` option to configure an upstream
HTTP/HTTPS proxy URL for passthrough requests
* Add `--aibridge-proxy-upstream-ca` option to trust custom CA
certificates for HTTPS upstream proxies
* Passthrough requests (non-allowlisted domains) are forwarded through
the upstream proxy
* MITM'd requests (allowlisted domains) continue to go directly to
aibridge, not through the upstream proxy
* Add tests for upstream proxy configuration and request routing

Closes: https://github.com/coder/internal/issues/1204
2026-01-19 08:50:57 +00:00
Asher 4d414a0df7 feat: add --use-parameter-defaults flag (#21119)
This is like `--yes`, but for parameter prompts.
2026-01-16 17:04:57 -09:00
Cian Johnston ab126e0f0a feat: improve usability of coder show (#21539)
This PR improves the usability of `coder show`:

- Adds a header with workspace owner/name, latest build status and time
since, and template name / version name.
- Updates `namedWorkspace` to allow looking up by UUID
- Also improves associated `TestShow` to respect context deadlines.
2026-01-16 15:45:33 +00:00
Sas Swart 0ebe8e57ad chore: add scaletesting tools for aibridge (#21279)
This pull request adds scaletesting tools for aibridge.

See
https://www.notion.so/Scale-tests-2c5d579be5928088b565d15dd8bdea41?source=copy_link
for information and instructions.

closes: https://github.com/coder/internal/issues/1156
closes: https://github.com/coder/internal/issues/1155
closes: https://github.com/coder/internal/issues/1158
2026-01-15 17:05:46 +02:00
George K 0712faef4f feat(enterprise): implement organization "disable workspace sharing" option (#21376)
Adds a per-organization setting to disable workspace sharing. When enabled,
all existing workspace ACLs in the organization are cleared and the workspace
ACL mutation API endpoints return `403 Forbidden`.

This complements the existing site-wide `--disable-workspace-sharing` flag by
providing more granular control at the organization level.

Closes https://github.com/coder/internal/issues/1073 (part 2)

---------

Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
2026-01-14 09:47:50 -08:00
Danny Kopping 7d5cd06f83 feat: add aibridge structured logging (#21492)
Closes https://github.com/coder/internal/issues/1151

Sample:

```
[API] 2026-01-13 15:50:20.795 [info]  coderd.aibridgedserver: interception started  trace=8bb5a1d8eb10526cc46ad90f191bb468  span=a3e5b5da9546032a  record_type=interception_start  interception_id=97461880-4a6c-47c1-8292-3588dd715312  initiator_id=360c6167-a93a-4442-9c3e-f87a6d1cfb66  api_key_id=vg1sbUv97d  provider=anthropic  model=claude-opus-4-5-20251101  started_at="2026-01-13T15:50:20.790690781Z"  metadata={}
[API] 2026-01-13 15:50:23.741 [info]  coderd.aibridgedserver: token usage recorded  trace=8bb5a1d8eb10526cc46ad90f191bb468  span=a114f0cc3047296e  record_type=token_usage  interception_id=97461880-4a6c-47c1-8292-3588dd715312  msg_id=msg_01VJH1rYKspfun8BW29CrYEu  input_tokens=10  output_tokens=8  created_at="2026-01-13T15:50:23.731587038Z"  metadata={"cache_creation_input":53194,"cache_ephemeral_1h_input":0,"cache_ephemeral_5m_input":53194,"cache_read_input":0,"web_search_requests":0}
[API] 2026-01-13 15:50:26.265 [info]  coderd.aibridgedserver: token usage recorded  trace=8bb5a1d8eb10526cc46ad90f191bb468  span=dbdafb563bff2c9c  record_type=token_usage  interception_id=97461880-4a6c-47c1-8292-3588dd715312  msg_id=msg_01VJH1rYKspfun8BW29CrYEu  input_tokens=0  output_tokens=130  created_at="2026-01-13T15:50:26.254467904Z"  metadata={}
[API] 2026-01-13 15:50:26.268 [info]  coderd.aibridgedserver: prompt usage recorded  trace=8bb5a1d8eb10526cc46ad90f191bb468  span=da51887a757226fc  record_type=prompt_usage  interception_id=97461880-4a6c-47c1-8292-3588dd715312  msg_id=msg_01VJH1rYKspfun8BW29CrYEu  prompt="list the jmia share price"  created_at="2026-01-13T15:50:26.255299811Z"  metadata={}
[API] 2026-01-13 15:50:26.268 [info]  coderd.aibridgedserver: interception ended  trace=8bb5a1d8eb10526cc46ad90f191bb468  span=3fa25397705ee7c9  record_type=interception_end  interception_id=97461880-4a6c-47c1-8292-3588dd715312  ended_at="2026-01-13T15:50:26.25555547Z"
[API] 2026-01-13 15:50:26.269 [info]  coderd.aibridgedserver: tool usage recorded  trace=8bb5a1d8eb10526cc46ad90f191bb468  span=b54af90afc604d29  record_type=tool_usage  interception_id=97461880-4a6c-47c1-8292-3588dd715312  msg_id=msg_01VJH1rYKspfun8BW29CrYEu  tool=mcp__stonks__getStockPriceSnapshot  input="{\"ticker\":\"JMIA\"}"  server_url=""  injected=false  invocation_error=""  created_at="2026-01-13T15:50:26.255164652Z"  metadata={}
```

Structured logging is only enabled when
`CODER_AIBRIDGE_STRUCTURED_LOGGING=true`.

---------

Signed-off-by: Danny Kopping <danny@coder.com>
2026-01-14 17:26:08 +02:00
Susana Ferreira 74b6d12a8a feat: implement selective MITM with configurable domain allowlist in aibridgeproxyd (#21473)
## Description

Implements selective MITM (Man-in-the-Middle) in `aibridgeproxyd` so
that only requests to allowlisted domains are intercepted and decrypted.
Requests to all other domains are tunneled directly without decryption.

## Changes

* New config option: `CODER_AIBRIDGE_PROXY_DOMAIN_ALLOWLIST` (default:
`api.anthropic.com`,`api.openai.com`)
* Selective MITM: Uses `goproxy.ReqHostIs()` to only intercept `CONNECT`
requests to allowlisted hosts
* Certificate caching: Now only generates/caches certificates for
allowlisted domains
* Validation: Startup fails if domain allowlist is empty or contains
invalid entries

Closes: https://github.com/coder/internal/issues/1182
2026-01-13 11:30:51 +00:00
Danny Kopping 49a42eff5c feat: make database connection pool size configurable (#21403)
Closes https://github.com/coder/coder/issues/21360

A few considerations/notes:
- I've kept the number of conns to 10 in all other places, except coderd
- which uses the config value
- I opted to also make idle conns configurable; the greater the delta
between max open and max idle, the more connection churn
- Postgres maintains a [_process_ per
connection](https://www.postgresql.org/docs/current/connect-estab.html),
contrary to what the comment said previously
- Operators should be able to tune this, since process churn can
negatively affect OS scheduling
- I've set the value to `"auto"` by default so it's not another knob one
_has to_ twiddle, and sets max idle = max conns / 3

---------

Signed-off-by: Danny Kopping <danny@coder.com>
2026-01-13 10:50:57 +02:00
George K cc2efe9e1f feat(coderd/rbac): make organization-member a per-org system custom role (#21359)
Migrated the built-in organization-member role to DB storage so it can be customized per org.

Closes https://github.com/coder/internal/issues/1073 (part 1)
2026-01-12 18:19:19 -08:00
Cian Johnston 2b448c7178 feat(cli): enrich user-agent header for client requests (#21483)
Adds the following information to CLI User-Agent headers to aid
deployment administrators in troubleshooting where requests are coming
from.

Before: `Go-http-client/1.1`
After: `coder-cli/v2.34.5 (linux/amd64; coder whoami)`

🤖 These changes were generated by Claude Sonnet 4.5 but reviewed and
edited manually by me.
2026-01-12 17:46:05 +00:00
Kacper Sawicki 6ca70d3618 feat(cli): add --no-build flag to state push for state-only updates (#21374)
## Summary

Adds a `--no-build` flag to `coder state push` that updates the
Terraform state directly without triggering a workspace build.

## Use Case

This enables state-only migrations, such as migrating Kubernetes
resources from deprecated types (e.g., `kubernetes_config_map`) to
versioned types (e.g., `kubernetes_config_map_v1`):

```bash
coder state pull my-workspace > state.json
terraform init
terraform state rm -state=state.json kubernetes_config_map.example
terraform import -state=state.json kubernetes_config_map_v1.example default/example
coder state push --no-build my-workspace state.json
```

## Changes

- Add `PUT /api/v2/workspacebuilds/{id}/state` endpoint to update state
without triggering a build
- Add `UpdateWorkspaceBuildState` SDK method
- Add `--no-build`/`-n` flag to `coder state push`
- Add confirmation prompt (can be skipped with `--yes`/`-y`) since this
is a potentially dangerous operation
- Add test for `--no-build` functionality

Fixes #21336
2026-01-12 15:16:59 +01:00
Steven Masley d2044c2ee9 chore: update protobuf to reuse file request (#21447)
**This is just the protobuf changes for the PR https://github.com/coder/coder/pull/21398**

Moved `UploadFileRequest` from `provisionerd.proto` -> `provisioner.proto`.

Renamed to `FileUpload` because it is now bi-directional.

This **is backwards compatible**. I tested it to confirm the payloads are identical.  Types were just renamed and moved around.

```golang
func TestTypeUpgrade(t *testing.T) {
	t.Parallel()

	x := &proto2.UploadFileRequest{
		Type: &proto2.UploadFileRequest_ChunkPiece{
			ChunkPiece: &proto.ChunkPiece{
				Data:         []byte("Hello World!"),
				FullDataHash: []byte("Foobar"),
				PieceIndex:   42,
			},
		},
	}

	data, err := protobuf.Marshal(x)
	require.NoError(t, err)

	// Exactly the same output
	// EhgKDEhlbGxvIFdvcmxkIRIGRm9vYmFyGCo= on `main`
	// EhgKDEhlbGxvIFdvcmxkIRIGRm9vYmFyGCo= on this branch
	fmt.Println(base64.StdEncoding.EncodeToString(data))
}
```


# What this does

This allows provisioner daemons to download files from `coderd`'s `files` table. This is used to send over cached module files and prevent the need of downloading these modules on each workspace build.
2026-01-09 11:23:32 -06:00
Steven Masley 89f4d60e7b chore: remove experiment "terraform-directory-reuse" (#21397)
Experiment is no longer required, the new method will be released without an experiment and without a toggle

Main PR is: https://github.com/coder/coder/pull/21398
2026-01-09 11:13:16 -06:00
Spike Curtis bddb808b25 chore: arrange imports in a standard way (#21452)
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:

```
import (
	"context"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"golang.org/x/xerrors"
	"gopkg.in/natefinch/lumberjack.v2"

	"cdr.dev/slog/v3"
	"github.com/coder/coder/v2/codersdk/agentsdk"
	"github.com/coder/serpent"
)
```

3 groups: standard library, 3rd partly libs, Coder libs.

This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
2026-01-08 15:24:11 +04:00
Cian Johnston 0f446f99dd feat(cli): add logs cmd (#21430)
This PR adds a command to view the provisioner and agent logs for a
given workspace.
Note: I did investigate using the existing `cliui` methods to tail the
logs but they are tailored to a very specific use-case.

Other changes:
- Adds `Agents` to `dbfake.WorkspaceResponse`
- Adds methods to generate provisioner and agent logs in `dbgen`

---------

Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
2026-01-08 09:58:10 +00:00
Spike Curtis 49b34a716a fix: fix slog to always use array of Fields (#21426)
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).

It also updates dependencies that also use slog and were updated.

I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.

Other dependencies, I pushed new tags.
2026-01-08 10:29:41 +04:00
Danielle Maywood c77c0fce52 fix(cli/open): wait for agent to be created (#21448)
Fix https://github.com/coder/internal/issues/596

---

🤖 Claude Code with Claude Opus 4.5
2026-01-07 16:06:00 +00:00
Cian Johnston 6bd2d1c85f chore(cli): seed healthcheck cache in TestSupportBundle (#21436)
Fixes https://github.com/coder/internal/issues/272

This test periodically fails due to the healthcheck timing out.
The problem is compounded due to the fact that we stand up a new
coderdtest instance for each test.

This PR does the following:
* Updates the subtests to share a single `coderdtest` instance.
* Hits the `/debug/health` endpoint before completing the setup phase so
that the result is cached.

This will not completely remove the issue, as the healthcheck could
still fail due to test-infrastructure-related issues. In this case we
may decide to add a retry in this 'seed' function.
2026-01-07 08:47:31 +00:00
Asher 4a97df3768 chore: rename flag to disable template insights (#21329)
Because this affects more than just the template insights
page (specifically it also affects the deployment stats endpoint which
is shown on bottom bar and Prometheus), the group is being renamed
generically to just "stats collection".  In the future if we need to
affect the other stats we can put those options here.

Then, because this change only affects a portion of stats, specifically
usage stats like connection and application time, bytes sent, etc, add a
new sub-group called "usage stats".

Then finally add back the "enable" flag.  This also gives us a place to
one day place an "anonymize" flag if we need to go that route.
2026-01-05 11:44:06 -09:00
Zach 07924037e7 feat: add boundary log forwarding from agent to coderd (#21345)
Add agent forwarding of boundary audit logs from workspaces to coderd
via agent API, and re-emission of boundary logs to coderd stderr. This
change adds a server to the workspace agent that always listens on a
unix socket for boundary to connect and send audit logs.

coderd log format example:
```
[API] 2025-12-23 18:31:46.755 [info] coderd.agentrpc: boundary_request owner=.. workspace_name=.. agent_name=.. decision=.. workspace_id=.. http_method=.. http_url=.. event_time=.. request_id=..
```

Corresponding boundary PR: https://github.com/coder/boundary/pull/124
RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba
https://github.com/coder/coder/issues/21280
2025-12-31 16:38:19 -07:00
Susana Ferreira b97572285a feat: add core AI MITM proxy daemon (#21296)
## Description

Adds the core AI Bridge MITM proxy daemon. This proxy intercepts HTTPS traffic, decrypts it using a configured CA certificate, and forwards requests to AIBridge for processing.

## Changes

* Added `aibridgeproxyd` package with the core proxy server implementation
* Added configuration options: `CODER_AIBRIDGE_PROXY_ENABLED`, `CODER_AIBRIDGE_PROXY_LISTEN_ADDR`, `CODER_AIBRIDGE_PROXY_CERT_FILE`, `CODER_AIBRIDGE_PROXY_KEY_FILE`
* Added tests for server initialization and MITM functionality

Closes https://github.com/coder/internal/issues/1180
2025-12-29 15:31:51 +00:00
Danielle Maywood 44a46db487 feat(agent): support deleting dev containers (#21247)
Add logic to the agent, and an endpoint, to allow requesting and then deleting a Dev Container and its related agent.
2025-12-22 11:28:31 +00:00
Rowan Smith 81cbf03a52 chore: fix typo in organization roles create help text (#21352)
A simple typo fix to the help text

stidin > stdin

```
➜  coder git:(org_role_fix) ✗ coder organizations roles create -h                            
coder v2.29.1+59cdd7e

USAGE:
  coder organizations roles create [flags] <role_name>

  Create a new organization custom role

    - Run with an input.json file:
  
        $ coder organization -O <organization_name> roles create --stidin < role.json 
```
2025-12-22 11:24:00 +11:00
Jake Howell 00793cc0b5 feat: add prometheus observability metrics for dbpurge (#21074)
Related to
[`internal#1139`](https://github.com/coder/internal/issues/1139)

This implements some prometheus metrics for records being removed from
the database. Currently we're tracking the following fields being
removed from the DB by this. They're viewable in the
`/api/v2/debug/metrics` endpoint.

* `expired_api_keys`
* `aibridge_records`
* `connection_logs`
* `duration`

```
# HELP coderd_dbpurge_iteration_duration_seconds Duration of each dbpurge iteration in seconds.
# TYPE coderd_dbpurge_iteration_duration_seconds histogram
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="1"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="5"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="10"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="30"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="60"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="300"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="600"} 1
coderd_dbpurge_iteration_duration_seconds_bucket{success="true",le="+Inf"} 1
coderd_dbpurge_iteration_duration_seconds_sum{success="true"} 0.014787814
coderd_dbpurge_iteration_duration_seconds_count{success="true"} 1
# HELP coderd_dbpurge_records_purged_total Total number of records purged by type.
# TYPE coderd_dbpurge_records_purged_total counter
coderd_dbpurge_records_purged_total{record_type="aibridge_records"} 0
coderd_dbpurge_records_purged_total{record_type="audit_logs"} 0
coderd_dbpurge_records_purged_total{record_type="connection_logs"} 0
coderd_dbpurge_records_purged_total{record_type="expired_api_keys"} 0
coderd_dbpurge_records_purged_total{record_type="workspace_agent_logs"} 0
```

| Position | Pull-request |
| -------- | ------------ |
|  | [feat: add prometheus observability metrics for
`dbpurge`](https://github.com/coder/coder/pull/21074) |
| | [feat: add rbac specificity for
`dbpurge`](https://github.com/coder/coder/pull/21088) |
2025-12-20 00:20:57 +11:00
Spike Curtis 73253df6bf fix: use separate HTTP clients in scale test load generators (#21288)
While scale testing, I noticed that our load generators send basically
all requests to a single Coderd instance.

e.g. 


![image.png](https://app.graphite.com/user-attachments/assets/e259862a-adf1-47e7-a37b-fd14e420058e.png)


This is because our scale test commands create all `Runner`s using the
same codersdk Client, which means they share an underlying HTTP client.
With HTTP/2 a single TCP session can multiplex many different HTTP
requests (including websockets). So, it creates a single TCP connection
to a single coderd, and then sends all the requests down the one TCP
connections.

This PR modifies the `exp scaletest` load generator commands to create
an independent HTTP client per `Runner`. This means that each runner
will create its own TCP connection. This should help spread the load and
make a more realistic test, because in a real deployment, scaled out
load will be coming over different TCP connections.
2025-12-19 12:22:49 +04:00
Spike Curtis cac6d4ce98 feat: add --max-failures to coder exp scaletest create-workspaces (#21315)
Adds `--max-failures` flag to `coder exp scaletest create-workspaces` so that we can tolerate a few failures without failing the command.

When running our scale test infra, we create Kubernetes Jobs to create the initial cluster workspaces, then we have load-generation jobs that depend on them. At high scale, it's kind of expected that some of the requests will fail: even with 99.9% success, you still expect one failure per 1000. It's useful to be able to carry on with the scale test anyway and proceed to traffic generation.
2025-12-18 11:21:35 +04:00
Zach 174a6192fa refactor: consolidate darwin unix socket test helpers (#21283) 2025-12-16 09:11:54 -07:00
Steven Masley 8fefd91e4a feat!: support PKCE in the oauth2 client's auth/exchange flow (#21215)
**Breaking Change:** Existing oauth apps might now use PKCE. If an
unknown IdP type was being used, and it does not support PKCE, it will
break.

To fix, set the PKCE methods on the external auth to `none`
```
export CODER_EXTERNAL_AUTH_1_PKCE_METHODS=none
```
2025-12-15 17:41:47 +00:00
Steven Masley 3194bcfc9e chore: distinct operations for provisioner's 'parse', 'init', 'plan', 'apply', 'graph' (#21064)
Provisioner steps broken into smaller granular actions.
Changes:
- `ExtractArchive` moved to `init` request (was in `configure`)
- Writing `tfstate` moved to `plan` (was in `configure`)
- Moved most plan/apply outputs to `GraphComplete`
2025-12-15 11:26:41 -06:00
Zach 7ecfd1aa07 fix: isolate keyring usage by parallel test processes (#21256)
This change ensures keyring tests that utilize the real OS keyring use
credentials that are isolated by process ID so that parallel test processes
do not access the same credentials.

https://github.com/coder/internal/issues/1192
2025-12-15 09:40:59 -07:00
Asher 27f0413347 feat: add flag to disable template insights (#20940)
Closes #20399 

To summarize the original commit messages:

- Do not log stats to the database.
- Return errors on the insight endpoints.
- Update the frontend to show those errors.
- Also fixes an issue with getting the user status count via codersdk,
since I added a test to ensure it was not disabled by this flag and it
was sending the wrong payload.
2025-12-14 03:00:03 +00:00
Mathias Fredriksson 761dd55ee8 fix(coderd/database): sort template version variables and fix test flake (#21233)
Previously the GetTemplateVersionVariables query did not sort output,
relying on PostgreSQL on-disk ordering which is undeterministic.

Variables are now sorted by name because there is no alternative for
ordering.

Tests were adjusted to accommodate the new ordering, previously they
relied on data being written to disk in insert order.
2025-12-12 11:41:46 +00:00
Mathias Fredriksson 3d38cd568e test(cli): attempt to fix TestGitSSH flake (#21230)
Since the failing test logs are gone, we can only guess at what went
wrong. Given our parallel test-suite, and that tests typically run slow
on Windows, it seems reasonable that the context timed out due to a
single context being responsbile for setup and two command executions.

This change fixes the issue by updating the context usage, if this flake
ever resurfaces, we can re-investigate.

Fixes coder/internal#770
2025-12-11 18:41:45 +00:00
Mathias Fredriksson 2e4aa729be test(cli): fix flaky TestProvisioners_Golden (#21228)
Use a single base time with consistent offsets and ensure CreatedAt
is set on all dbgen-created resources.

Fixes coder/internal#449
2025-12-11 18:19:18 +00:00
Kacper Sawicki 6f86f67754 feat(coderd): add overload protection with rate limiting and concurrency control (#21161)
## Summary

This adds configurable overload protection to the AI Bridge daemon to
prevent the server from being overwhelmed during periods of high load.

Partially addresses coder/internal#1153 (rate limits and concurrency
control; circuit breakers are deferred to a follow-up).

## New Configuration Options

| Option | Environment Variable | Description | Default |
|--------|---------------------|-------------|---------|
| `--aibridge-max-concurrency` | `CODER_AIBRIDGE_MAX_CONCURRENCY` |
Maximum number of concurrent AI Bridge requests. Set to 0 to disable
(unlimited). | `0` |
| `--aibridge-rate-limit` | `CODER_AIBRIDGE_RATE_LIMIT` | Maximum number
of AI Bridge requests per second. Set to 0 to disable rate limiting. |
`0` |

## Behavior

When limits are exceeded:
- **Concurrency limit**: Returns HTTP `503 Service Unavailable` with
message "AI Bridge is currently at capacity. Please try again later."
- **Rate limit**: Returns HTTP `429 Too Many Requests` with
`Retry-After` header.

Both protections are optional and disabled by default (0 values).

## Implementation

The overload protection is implemented as reusable middleware in
`coderd/httpmw/ratelimit.go`:

1. **`RateLimitByAuthToken`**: Per-user rate limiting that uses
`APITokenFromRequest` to extract the authentication token, with fallback
to `X-Api-Key` header for AI provider compatibility (e.g., Anthropic).
Falls back to IP-based rate limiting if no token is present. Includes
`Retry-After` header for backpressure signaling.
2. **`ConcurrencyLimit`**: Uses an atomic counter to track in-flight
requests and reject when at capacity.

The middleware is applied in `enterprise/coderd/aibridge.go` via
`r.Group` in the following order:
1. Concurrency check (faster rejection for load shedding)
2. Rate limit check

**Note**: Rate limiting currently applies to all AI Bridge requests,
including pass-through requests. Ideally only actual interceptions
should count, but this would require changes in the aibridge library.

## Testing

Added comprehensive tests for:
- Rate limiting by auth token (Bearer token, X-Api-Key, no token
fallback to IP)
- Different tokens not rate limited against each other
- Disabled when limit is zero
- Retry-After header is set on 429 responses
- Concurrency limiting (allows within limit, rejects over limit,
disabled when zero)
2025-12-11 16:38:54 +01:00
George K 4379230a27 feat: add deployment-wide option to disable workspace sharing (#21172)
Adds `--disable-workspace-sharing` option.

Workspace sharing is disabled by not including user and group ACLs in
the workspace RBAC object, which prevents ACL-based authz.

Closes https://github.com/coder/internal/issues/1072

The commit also adds saving of workspace user/group ACLs in the test DB
data generator.
2025-12-09 08:13:09 -08:00
Mathias Fredriksson 7fc8ee4c60 test(cli/cliui): add test for context cancellation during log streaming (#21125)
Verifies that streamLogs properly returns ctx.Err() when the context is
cancelled while waiting for logs. This covers the case where a user
interrupts an SSH connection (e.g., Ctrl+C) during startup script
execution.

Refs #21104
2025-12-08 14:17:25 +00:00
Mathias Fredriksson d351821ec3 fix(cli/cliui): skip startup script logs when Wait=false (#21105)
When users pass --wait=no or set CODER_SSH_WAIT=no, startup logs are no
longer dumped to stderr. The stage indicator is still shown, just not
the log content.

Fixes #13580
2025-12-08 14:11:47 +00:00
Mathias Fredriksson 0c453d7f8e refactor(cli/cliui): extract agentWaiter struct from agent connection state machine (#21104)
The Agent function had complex nested control flow and cross-case state sharing
via the showStartupLogs flag. This made the code hard to follow and maintain.

This change extract an agentWaiter struct with self-contained methods:

- wait: main state machine loop
- waitForConnection: handles Connecting/Timeout states
- handleConnected: handles Connected state and startup scripts
- streamLogs: handles log streaming/fetching
- waitForReconnection: handles Disconnected state
- pollWhile: helper to consolidate polling loops

Each handler is now self-contained with no cross-method state sharing and the 
showStartupLogs flag is replaced by return values and the waitedForConnection
tracking variable.
2025-12-08 14:00:25 +00:00
Danny Kopping 259dee2ea8 fix: move contexts to appropriate locations (#21121)
Closes https://github.com/coder/internal/issues/1173,
https://github.com/coder/internal/issues/1174

Currently these two tests are flaky because the contexts were created
before a potentially long-running process. By the time the context was
actually used, it may have timed out - leading to confusion.

Additionally, the `ExpectMatch` calls were not using the test context -
but rather a background context. I've marked that func as deprecated
because we should always tie these to the test context.

Special thanks to @mafredri for the brain probe 🧠

---------

Signed-off-by: Danny Kopping <danny@coder.com>
2025-12-05 13:14:35 +00:00
Mathias Fredriksson c750695d83 feat(cli/cliui): output empty string for empty table (#20967)
This changes makes it so that we output the empty string for Format
when there is no data. It turns out there are many places in the code
where we have such handling, but in a way that would break the JSON
formatter (since we'd output nothing on stdout or text rather than
`[]`/`null`).
2025-12-03 11:32:59 +02:00
Mathias Fredriksson ff46917e62 feat: add retention config for workspace_agent_logs (#21039)
Replace hardcoded 7-day retention for workspace agent logs with
configurable retention from deployment settings. Defaults to 7d to
preserve existing behavior.

Depends on #21038
Updates #20743
2025-12-02 16:01:33 +00:00
Mathias Fredriksson 56e7858570 feat(coderd): add retention policy configuration (#21021)
Add `RetentionConfig` with server flags for configuring data retention:

- `--audit-logs-retention`: retention for audit log entries
- `--connection-logs-retention`: retention for connection logs
- `--api-keys-retention`: retention for expired API keys (default 7d)

Updates #20743
2025-12-02 16:04:06 +02:00
Ethan bf40d678ec fix(cli): close prebuild runner prometheus server last (#21053)
## Description

Fixes the prebuilds scaletest command where the prometheus server was
being shut down before waiting for metrics to be scraped.

The issue was the defer order - since defers execute in LIFO (last-in,
first-out) order:

**Before (broken):**
1. Register tracing defer (includes wait for prometheus scrape)
2. Register prometheus server defer

Execution order: prometheus closes first, then wait happens (server
already gone!)

**After (fixed):**
1. Register prometheus server defer  
2. Register tracing defer (includes wait for prometheus scrape)

Execution order: wait happens first (server still up), then prometheus
closes.

This matches the pattern used in other scaletest commands.

## Impact

The `coderd_scaletest_prebuild_deletion_jobs_completed` metric (and
potentially others) was always showing 0 because the server shut down
before Prometheus could scrape the final values.

_This PR was generated by [`mux`](https://github.com/coder/mux) and
reviewed by a human._
2025-12-02 12:10:50 +00:00
Jake Howell ab4366f5c6 feat!: implement AI Bridge heading to /deployment/observability (#20791)
> [!CAUTION]
> In whichever release this lands, we've removed the ability to provide
keys via a YAML file (specifically on `openai_key`, `anthropic_key`,
`bedrock_access_key` and finally `bedrock_access_key_secret`). This will
need to be described in the release notes as to not break peoples AI
Bridge integrations upgrading from older versions.

This pull-request ensures that we can see the overview of the settings
of the `AI Bridge` feature within the `/deployment/observability` route.
This set of options only render when the `aibridge` feature flag is
enabled.

### Preview


![preview-ai-bridge-observability](https://github.com/user-attachments/assets/262d2456-94b4-49b2-9b4e-b14583e70ede)
2025-12-01 21:23:46 +00:00
Sas Swart ce627bf23f feat: implement agent socket api, client and cli (#20758)
closes: https://github.com/coder/coder/issues/10352
closes: https://github.com/coder/internal/issues/1094
closes: https://github.com/coder/internal/issues/1095

In this pull request, we enable a new set of experimental cli commands
grouped under `coder exp sync`.
These commands allow any process acting within a coder workspace to
inform the coder agent of its requirements and execution progress. The
coder agent will then relay this information to other processes that
have subscribed.

These commands are:
```
# Check if this feature is enabled in your environment 
coder exp sync ping

# express that your unit depends on another
coder exp sync want <unit> <dependency_unit> 

# express that your unit intends to start a portion of the script that requires 
# other units to have completed first. This command blocks until all dependencies have been met
coder exp sync start <unit> 

# express that your unit has completes its work, allowing dependent units to begin their execution
coder exp sync complete <unit>
```

Example:

In order to automatically run claude code in a new workspace, it must
first have a git repository cloned. The scripts responsible for cloning
the repository and for running claude code would coordinate in the
following way:

```bash
# Script A: Claude code

# Inform the agent that the claude script wants the git script.
# That is, the git script must have completed before the claude script can begin its execution
coder exp sync want claude git

# Inform the agent that we would now like to begin execution of claude.
# This command will block until the git script (and any other defined dependencies)
# have completed
coder exp sync start claude

# Now we run claude code and any other commands we need
claude ...

# Once our script has completed, we inform the agent, so that any scripts that depend on this one
# may begin their execution

coder exp sync complete claude
```

```bash
# Script B: Git

# Because the git script does not have any dependencies, we can simply inform the agent that we 
# intend to start
coder exp sync start git

git clone ssh://git@github.com/coder/coder

# Once the repository have been cloned, we inform the agent that this script is complete, so that
# scripts that depend on it may begin their execution.
coder exp sync complete git
```

Notes:
* Unit names (ie. `claude` and `git`) given as input to the sync
commands are arbitrary strings. You do not have to conform to specific
identifiers. We recommend naming your scripts descriptively, but
succinctly.
* Scripts unit names should be well documented. Other scripts will need
to know the names you've chosen in order to depend on yours. Therefore,
you

---------

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
2025-11-28 08:33:50 +02:00
Zach bbf7b137da fix(cli): remove defaulting to keyring when --global-config set (#20943)
This fixes a regression that caused the VS code extension to be unable
to authenticate after making keyring usage on by default. This is
because the VS code extension assumes the CLI will always use the
session token stored on disk, specifically in the directory specified by
--global-config.

This fix makes keyring usage enabled when the --global-config directory
is not set. This is a bit wonky but necessary to allow the extension to
continue working without modification and without backwards compat
concerns. In the future we should modify these extensions to either
access the credential in the keyring (like Coder Desktop) or some other
approach that doesn't rely on the session token being stored on disk.

Tests:
`coder login dev.coder.com` -> token stored in keyring
`coder login --global-config=/tmp/ dev.coder.com` -> token stored in
`/tmp/session`
2025-11-26 10:17:31 +01:00
Zach 6238a99275 feat(cli)!: enable keyring usage by default (#20851)
Make keyring usage for session token storage on by default for supported
platforms (Windows and macOS), with the ability to opt-out via
--use-keyring=false.

This change will be a breaking change for any users depending on the
session token being stored on disk, though users can restore file usage
via the flag above.

This change will also require CLI users to authenticate after updating.
2025-11-25 18:13:00 -07:00
Danielle Maywood b255827a52 chore: promote tasks to stable from experimental (#20921)
- Promote tasks from `/api/experimental` to `/api/v2`.
- Move sdk from `ExperimentalClient` to `Client`.
- Update swagger
2025-11-25 15:24:25 +00:00