Relates to CODAGT-115
Adds metric `coderd_api_websocket_probes_total`. Every successful
heartbeat for a given path will increment the metric.
Comparing this with `coderd_api_concurrent_websockets` will give an
indication of how many websocket connections are open but in a 'wedged'
state (when heartbeats stopped versus when we closed the connection).
## Problem
Commit 386b449 (PR #23745) changed the `OneWayWebSocketEventSender`
event channel from unbuffered to buffered(64) to reduce chat streaming
latency. This introduced a nondeterministic race in `sendEvent`:
```go
sendEvent := func(event codersdk.ServerSentEvent) error {
select {
case eventC <- event: // buffered channel — almost always ready
case <-ctx.Done(): // also ready after cancellation
}
return nil
}
```
After context cancellation, Go's `select` randomly picks between two
ready cases, so `send()` sometimes returns `nil` instead of `ctx.Err()`.
With the old unbuffered channel the send case was rarely ready (no
reader), masking the bug.
## Fix
Add a priority `select` that checks `ctx.Done()` before attempting the
channel send:
```go
select {
case <-ctx.Done():
return ctx.Err()
default:
}
select {
case eventC <- event:
case <-ctx.Done():
return ctx.Err()
}
```
This is the standard Go pattern for prioritizing one channel over
another. When the context is already cancelled, the first select returns
immediately. The second select still handles the case where cancellation
happens concurrently with the send.
## Verification
- Ran the flaky test 20× in a loop (`-count=20`): all passed
- Ran the full `TestOneWayWebSocketEventSender` suite 5× (`-count=5`):
all passed
- Ran the complete `coderd/httpapi` test package: all passed
Fixescoder/internal#1429
Previously, when a user sent a message, there was a 0–1000ms (avg
~500ms) polling delay before processing began.
`SendMessage`/`CreateChat`/`EditMessage` set `status='pending'` in the
DB and returned, but nothing woke the processing loop — it was a blind
1-second ticker.
## Changes
**Event-driven acquisition (main change):** Adds a `wakeCh` channel to
the chatd `Server`. `CreateChat`, `SendMessage`, `EditMessage`, and
`PromoteQueued` call `signalWake()` after committing their transactions,
which wakes the run loop to call `processOnce` immediately. The 1-second
ticker remains as a fallback safety net for edge cases (stale recovery,
missed signals).
**Buffer WebSocket write channel:** Changes the
`OneWayWebSocketEventSender` event channel from unbuffered to buffered
(64), decoupling the event producer from WebSocket write speed. The
existing 10s write timeout guards against stuck connections.
<details><summary>Implementation plan & analysis</summary>
The full latency analysis identified these sources of delay in the
streaming pipeline:
1. **Chat acquisition polling** — 0–1000ms (avg 500ms) dead time per
message. Fixed by wake channel.
2. **Unbuffered WebSocket write channel** — each token blocked on the
previous WS write completing. Fixed by buffering.
3. **PersistStep DB transaction per step** — `FOR UPDATE` lock + batch
insert. Not addressed in this PR (medium risk, would overlap DB write
with next provider TTFB).
4. **Multi-hop channel pipeline** — 4 channel hops per token. Not
addressed (medium complexity).
</details>
<details><summary>Test stabilization notes</summary>
`signalWake()` causes the chatd daemon to process chats immediately
after creation/send/edit, which exposed timing assumptions in several
tests that expected chats to remain in `pending` status long enough to
assert on. These tests were updated with `require.Eventually` +
`WaitUntilIdleForTest` patterns to wait for processing to settle before
asserting.
The race detector (`test-go-race-pg`) shows failures in
`TestCreateWorkspaceTool_EndToEnd` and `TestAwaitSubagentCompletion` —
these appear to be pre-existing races in the end-to-end chat flow that
are now exercised more aggressively because processing starts
immediately instead of after a 1s delay. Main branch CI (race detector)
passes without these changes.
</details>
Relates to https://github.com/coder/coder/pull/21676
* Replaces all existing usages of `httpapi.Heartbeat` with `httpapi.HeartbeatClose`
* Removes `httpapi.HeartbeatClose`
Add comprehensive OAuth2 enum types to codersdk following RFC specifications:
- OAuth2ProviderGrantType (RFC 6749)
- OAuth2ProviderResponseType (RFC 6749)
- OAuth2TokenEndpointAuthMethod (RFC 7591)
- OAuth2PKCECodeChallengeMethod (RFC 7636)
- OAuth2TokenType (RFC 6749, RFC 9449)
- OAuth2RevocationTokenTypeHint (RFC 7009)
- OAuth2ErrorCode (RFC 6749, RFC 7009, RFC 8707)
Add OAuth2TokenRequest, OAuth2TokenResponse, OAuth2TokenRevocationRequest,
and OAuth2Error structs to the SDK. Update OAuth2ClientRegistrationRequest,
OAuth2ClientRegistrationResponse, OAuth2ClientConfiguration, and
OAuth2AuthorizationServerMetadata to use typed enums instead of raw strings.
This makes codersdk the single source of truth for OAuth2 types, eliminating
duplication between SDK and server-side structs.
Closes#21476
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
## Summary
This PR implements critical MCP OAuth2 compliance features for Coder's authorization server, adding PKCE support, resource parameter handling, and OAuth2 server metadata discovery. This brings Coder's OAuth2 implementation significantly closer to production readiness for MCP (Model Context Protocol)
integrations.
## What's Added
### OAuth2 Authorization Server Metadata (RFC 8414)
- Add `/.well-known/oauth-authorization-server` endpoint for automatic client discovery
- Returns standardized metadata including supported grant types, response types, and PKCE methods
- Essential for MCP client compatibility and OAuth2 standards compliance
### PKCE Support (RFC 7636)
- Implement Proof Key for Code Exchange with S256 challenge method
- Add `code_challenge` and `code_challenge_method` parameters to authorization flow
- Add `code_verifier` validation in token exchange
- Provides enhanced security for public clients (mobile apps, CLIs)
### Resource Parameter Support (RFC 8707)
- Add `resource` parameter to authorization and token endpoints
- Store resource URI and bind tokens to specific audiences
- Critical for MCP's resource-bound token model
### Enhanced OAuth2 Error Handling
- Add OAuth2-compliant error responses with proper error codes
- Use standard error format: `{"error": "code", "error_description": "details"}`
- Improve error consistency across OAuth2 endpoints
### Authorization UI Improvements
- Fix authorization flow to use POST-based consent instead of GET redirects
- Remove dependency on referer headers for security decisions
- Improve CSRF protection with proper state parameter validation
## Why This Matters
**For MCP Integration:** MCP requires OAuth2 authorization servers to support PKCE, resource parameters, and metadata discovery. Without these features, MCP clients cannot securely authenticate with Coder.
**For Security:** PKCE prevents authorization code interception attacks, especially critical for public clients. Resource binding ensures tokens are only valid for intended services.
**For Standards Compliance:** These are widely adopted OAuth2 extensions that improve interoperability with modern OAuth2 clients.
## Database Changes
- **Migration 000343:** Adds `code_challenge`, `code_challenge_method`, `resource_uri` to `oauth2_provider_app_codes`
- **Migration 000343:** Adds `audience` field to `oauth2_provider_app_tokens` for resource binding
- **Audit Updates:** New OAuth2 fields properly tracked in audit system
- **Backward Compatibility:** All changes maintain compatibility with existing OAuth2 flows
## Test Coverage
- Comprehensive PKCE test suite in `coderd/identityprovider/pkce_test.go`
- OAuth2 metadata endpoint tests in `coderd/oauth2_metadata_test.go`
- Integration tests covering PKCE + resource parameter combinations
- Negative tests for invalid PKCE verifiers and malformed requests
## Testing Instructions
```bash
# Run the comprehensive OAuth2 test suite
./scripts/oauth2/test-mcp-oauth2.sh
Manual Testing with Interactive Server
# Start Coder in development mode
./scripts/develop.sh
# In another terminal, set up test app and run interactive flow
eval $(./scripts/oauth2/setup-test-app.sh)
./scripts/oauth2/test-manual-flow.sh
# Opens browser with OAuth2 flow, handles callback automatically
# Clean up when done
./scripts/oauth2/cleanup-test-app.sh
Individual Component Testing
# Test metadata endpoint
curl -s http://localhost:3000/.well-known/oauth-authorization-server | jq .
# Test PKCE generation
./scripts/oauth2/generate-pkce.sh
# Run specific test suites
go test -v ./coderd/identityprovider -run TestVerifyPKCE
go test -v ./coderd -run TestOAuth2AuthorizationServerMetadata
```
### Breaking Changes
None. All changes maintain backward compatibility with existing OAuth2 flows.
---
Change-Id: Ifbd0d9a543d545f9f56ecaa77ff2238542ff954a
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Closes https://github.com/coder/coder/issues/16775
## Changes made
- Added `OneWayWebSocket` function that establishes WebSocket
connections that don't allow client-to-server communication
- Added tests for the new function
- Updated API endpoints to make new WS-based endpoints, and mark
previous SSE-based endpoints as deprecated
- Updated existing SSE handlers to use the same core logic as the new WS
handlers
## Notes
- Frontend changes handled via #16855
Using negative permissions, this role prevents a user's ability to
create & delete a workspace within a given organization.
Workspaces are uniquely owned by an org and a user, so the org has to
supercede the user permission with a negative permission.
# Use case
Organizations must be able to restrict a member's ability to create a
workspace. This permission is implicitly granted (see
https://github.com/coder/coder/issues/16546#issuecomment-2655437860).
To revoke this permission, the solution chosen was to use negative
permissions in a built in role called `WorkspaceCreationBan`.
# Rational
Using negative permissions is new territory, and not ideal. However,
workspaces are in a unique position.
Workspaces have 2 owners. The organization and the user. To prevent
users from creating a workspace in another organization, an [implied
negative
permission](https://github.com/coder/coder/blob/36d9f5ddb3d98029fee07d004709e1e51022e979/coderd/rbac/policy.rego#L172-L192)
is used. So the truth table looks like: _how to read this table
[here](https://github.com/coder/coder/blob/36d9f5ddb3d98029fee07d004709e1e51022e979/coderd/rbac/README.md#roles)_
| Role (example) | Site | Org | User | Result |
|-----------------|------|------|------|--------|
| non-org-member | \_ | N | YN\_ | N |
| user | \_ | \_ | Y | Y |
| WorkspaceBan | \_ | N | Y | Y |
| unauthenticated | \_ | \_ | \_ | N |
This new role, `WorkspaceCreationBan` is the same truth table condition
as if the user was not a member of the organization (when doing a
workspace create/delete). So this behavior **is not entirely new**.
<details>
<summary>How to do it without a negative permission</summary>
The alternate approach would be to remove the implied permission, and
grant it via and organization role. However this would add new behavior
that an organizational role has the ability to grant a user permissions
on their own resources?
It does not make sense for an org role to prevent user from changing
their profile information for example. So the only option is to create a
new truth table column for resources that are owned by both an
organization and a user.
| Role (example) | Site | Org |User+Org| User | Result |
|-----------------|------|------|--------|------|--------|
| non-org-member | \_ | N | \_ | \_ | N |
| user | \_ | \_ | \_ | \_ | N |
| WorkspaceAllow | \_ | \_ | Y | \_ | Y |
| unauthenticated | \_ | \_ | \_ | \_ | N |
Now a user has no opinion on if they can create a workspace, which feels
a little wrong. A user should have the authority over what is theres.
There is fundamental _philosophical_ question of "Who does a workspace
belong to?". The user has some set of autonomy, yet it is the
organization that controls it's existence. A head scratcher 🤔
</details>
## Will we need more negative built in roles?
There are few resources that have shared ownership. Only
`ResourceOrganizationMember` and `ResourceGroupMember`. Since negative
permissions is intended to revoke access to a shared resource, then
**no.** **This is the only one we need**.
Classic resources like `ResourceTemplate` are entirely controlled by the
Organization permissions. And resources entirely in the user control
(like user profile) are only controlled by `User` permissions.
![Uploading Screenshot 2025-02-26 at 22.26.52.png…]()
---------
Co-authored-by: Jaayden Halko <jaayden.halko@gmail.com>
Co-authored-by: ケイラ <mckayla@hey.com>
Migrates us to `coder/websocket` v1.8.12 rather than `nhooyr/websocket` on an older version.
Works around https://github.com/coder/websocket/issues/504 by adding an explicit test for `xerrors.Is(err, io.EOF)` where we were previously getting `io.EOF` from the netConn.
* chore: implement deleting custom roles
* add trigger to delete role from organization members on delete
* chore: add comments to explain populated field
* Add database tables for OAuth2 applications
These are applications that will be able to use OAuth2 to get an API key
from Coder.
* Add endpoints for managing OAuth2 applications
These let you add, update, and remove OAuth2 applications.
* Add frontend for managing OAuth2 applications
* chore: add /v2 to import module path
go mod requires semantic versioning with versions greater than 1.x
This was a mechanical update by running:
```
go install github.com/marwan-at-work/mod/cmd/mod@latest
mod upgrade
```
Migrate generated files to import /v2
* Fix gen
* chore: Rbac errors should be returned, and not hidden behind 404
SqlErrNoRows was hiding actual errors
* Replace sql.ErrNoRow checks
* Remove sql err no rows check from dbauthz test
* Fix to use dbauthz system user
Adds a flag --disable-password-auth that prevents the password login
endpoint from working unless the user has the "owner" (aka. site admin)
role.
Adds a subcommand `coder server create-admin-user` which creates a user
directly in the database with the "owner" role, the "admin" role in
every organization, and password auth. This is to avoid lock-out
situations where all accounts have the login type set to an identity
provider and nobody can login.
* feat: Make workspace watching realtime instead of polling
This was leading to performance issues on the frontend, where
the page should only be rendered if changes occur. While this
could be changed on the frontend, it was always the intention
to make this socket ~realtime anyways.
* Fix workspace tests waiting, erroring on workspace update, and add comments to workspace events
The goroutine launched by `ServerSentEventSender` can perform a write
and flush after the calling http handler has exited, at this point the
resources (e.g. `http.ResponseWriter`) are no longer safe to use.
To work around this issue, heartbeats and sending events are now handled
by the goroutine which signals its closure via a channel. This allows
the calling handler to ensure it is kept alive until it's safe to exit.
Fixes#4807