Compare commits

...

2 Commits

Author SHA1 Message Date
Ben Potter db433e4d34 fix(coder-labs/claude-self-hosted-runner): infer lock status from active_sessions
The previous metadata script scraped
claude_code_self_hosted_runner_locked_account from /metrics. In
Anthropic BYOC build 2.1.97-byoc.9 the runner declares the HELP/TYPE
lines for that gauge but never emits a sample line, even when the
runner is actively serving sessions. The metadata item always
showed 'unlocked' regardless of state, which is wrong: a runner
with active_sessions > 0 has been locked to an Anthropic user for
its lifetime per Anthropic's spec.

Switch to inferring the lock status from /healthz's
active_sessions, and latch a sticky flag at $HOME/.claude/locked
on first observation of active_sessions > 0. Once locked, always
locked, even when the active count drops back to zero between
sessions of the same locked user.

Module tests still pass (6/6). Live verification will follow on the
bpmct/coder-dev-tunnel deployment.
2026-05-18 14:20:43 +00:00
Ben Potter 99f3524160 feat(coder-labs/claude-self-hosted-runner): scaffold module
Drops Anthropic's Claude Code self-hosted runner into any Coder
template that has a coder_agent and a workspace image with the
runner binary installed. The module owns the runner script (writes
a per-session wrapper that forces --permission-mode
bypassPermissions, then spawns a detached supervisor that runs the
runner in the foreground and POSTs a delete build to self-evict on
drain), the agent env it needs, optional bot-git askpass, and host
Docker socket gid fixup. Agent metadata items (lock status, active
sessions, runner ID, last poll) are emitted via the agent_metadata
output for the parent to splat into a dynamic metadata block.

The parent template still owns the coder_agent itself, the
per-workspace scope-restricted self-evict token (minted via the
Mastercard/restapi provider against an admin bootstrap token), the
prebuild preset, and the infra block.

terraform test passes (6/6); rendered run.sh shellchecks clean. This
module is referenced from the Coder docs guide at
docs/ai-coder/claude-code-self-hosted-runners but is not yet
published to the registry; the docs reference the GitHub source
directly until the EAP recipe stabilizes.
2026-05-14 01:03:57 +00:00
4 changed files with 495 additions and 0 deletions
@@ -0,0 +1,73 @@
---
display_name: Claude Code self-hosted runner
description: Run Anthropic's Claude Code self-hosted runner as a long-lived process inside a Coder workspace, with per-workspace scoped self-eviction so the prebuild reconciler keeps the pool warm.
icon: ../../../../.icons/claude.svg
verified: false
tags: [ai, claude, claude-code, anthropic, runner]
---
# Claude Code self-hosted runner
Drops Anthropic's [Claude Code self-hosted runner](https://docs.anthropic.com/en/docs/claude-code/self-hosted-runners) into any Coder template that has a `coder_agent` and a workspace image with the runner binary installed (`/usr/local/bin/claude self-hosted-runner` by default).
The module owns the runner script (writes a per-session wrapper that forces `--permission-mode bypassPermissions`, then spawns a detached supervisor that runs the runner in the foreground and POSTs a delete build to self-evict on drain), the agent environment variables it needs, an optional bot-git askpass setup, and a host Docker socket gid fixup. Agent metadata items (lock status, active sessions, runner ID, last poll) are emitted via the `agent_metadata` output for the parent to splat into a `dynamic "metadata"` block.
The parent template still owns the `coder_agent` itself, the per-workspace scope-restricted self-evict token (minted via the `Mastercard/restapi` provider against an admin bootstrap token), the prebuild preset, and the infra block (`docker_container`, `kubernetes_pod`, etc.).
> [!IMPORTANT]
> This module is part of the [Claude Code self-hosted runners on Coder](https://coder.com/docs/ai-coder/claude-code-self-hosted-runners) recipe, which currently targets Anthropic's EAP build of the runner. Both the runner binary and the wire contract are still evolving; expect API drift until Anthropic ships GA.
## Usage
```tf
module "claude_self_hosted_runner" {
source = "registry.coder.com/coder-labs/claude-self-hosted-runner/coder"
version = "1.0.0"
agent_id = coder_agent.main.id
workspace_id = data.coder_workspace.me.id
pool_secret = var.pool_secret
self_evict_token = jsondecode(restapi_object.self_evict_token.api_response).key
git_bot_token = var.git_bot_token
capacity = tonumber(data.coder_parameter.capacity.value)
}
resource "coder_agent" "main" {
# ... arch, os, dir, startup_script_behavior, etc.
# Static metadata blocks coexist with the dynamic block below;
# Terraform concatenates them on the same coder_agent.
metadata {
display_name = "CPU"
key = "cpu"
script = "top -bn1 | awk '/Cpu/ {print $2 \"%\"}'"
interval = 10
timeout = 5
}
dynamic "metadata" {
for_each = module.claude_self_hosted_runner.agent_metadata
content {
display_name = metadata.value.display_name
key = metadata.value.key
interval = metadata.value.interval
timeout = metadata.value.timeout
script = metadata.value.script
}
}
}
```
## What the module does
- Writes `$HOME/.claude/wrapper.sh` at agent start. The wrapper appends `--permission-mode bypassPermissions` after `"$@"` so unattended sessions never stall on a tool-approval prompt; Claude Code's flag parser is last-occurrence-wins, so this overrides the server-supplied permission mode.
- Sets up the runner's required environment (`CLAUDE_POOL_SECRET`, `CLAUDE_CAPACITY`, `GIT_BOT_TOKEN`, `CODER_SELF_TOKEN`, `CODER_WORKSPACE_ID`) via `coder_env` resources on the agent.
- Spawns a `setsid nohup` supervisor that runs the runner in the foreground. When the runner exits on drain, the supervisor POSTs `/api/v2/workspaces/{id}/builds` with `{"transition":"delete"}` to self-evict, so Coder's prebuild reconciler can queue a replacement.
- Wires up `GIT_ASKPASS` if `git_bot_token` is supplied so the runner's child claude can `git push` without baking credentials into the image.
- If the parent template mounts the host Docker socket at `/var/run/docker.sock` and the gid does not match the in-container `docker` group, chgrps the socket so the workspace user can use it without sudo.
## Self-eviction security model
The `self_evict_token` input is minted by the parent template via the `Mastercard/restapi` provider at template build time, against an admin bootstrap token that lives in Terraform state and is never injected into the workspace. The minted token is scoped to `workspace:delete + workspace:read + template:read + user:read` and allow-listed to this single workspace's UUID. A leaked copy can do exactly one thing: delete this one workspace. No read of peer prebuilds, no SSH, no external auth, no git creds.
The supervisor uses raw `curl` against `/api/v2/workspaces/{id}/builds`, not the `coder delete` CLI. The CLI fetches workspace resources first, which fails against the scoped token whose allow-list intersection excludes peer workspaces.
@@ -0,0 +1,185 @@
terraform {
required_version = ">= 1.5"
required_providers {
coder = {
source = "coder/coder"
version = ">= 2.13"
}
}
}
variable "agent_id" {
type = string
description = "The ID of a Coder agent."
}
variable "workspace_id" {
type = string
description = "data.coder_workspace.me.id from the parent template. Used by the supervisor to self-evict via the workspace builds endpoint."
}
variable "pool_secret" {
type = string
description = "Claude Code self-hosted runner pool secret (from claude.ai)."
sensitive = true
}
variable "self_evict_token" {
type = string
description = "Per-workspace, scope-restricted Coder API token. Scope = workspace:delete + workspace:read + template:read + user:read, allow_list = this workspace's UUID. A leaked copy can only delete this one workspace. The parent template mints it via the Mastercard/restapi provider at build time."
sensitive = true
}
variable "git_bot_token" {
type = string
description = "Optional git PAT for the bot identity. Wired through GIT_ASKPASS so the runner's child claude can push without baking credentials into the image."
sensitive = true
default = ""
}
variable "capacity" {
type = number
description = "Maximum sessions the runner serves at once. The runner locks to one Anthropic user; this caps parallelism within that user's queue."
default = 4
validation {
condition = var.capacity >= 1 && var.capacity <= 16
error_message = "capacity must be between 1 and 16."
}
}
variable "runner_binary_path" {
type = string
description = "Path to the `claude self-hosted-runner` binary inside the workspace."
default = "/usr/local/bin/claude"
}
variable "claude_binary_path" {
type = string
description = "Path to the Claude Code binary the wrapper execs for each session."
default = "/opt/claude/claude"
}
variable "order" {
type = number
description = "Order of the runner script in the agent UI."
default = null
}
resource "coder_env" "pool_secret" {
agent_id = var.agent_id
name = "CLAUDE_POOL_SECRET"
value = var.pool_secret
}
resource "coder_env" "capacity" {
agent_id = var.agent_id
name = "CLAUDE_CAPACITY"
value = tostring(var.capacity)
}
resource "coder_env" "git_bot_token" {
agent_id = var.agent_id
name = "GIT_BOT_TOKEN"
value = var.git_bot_token
}
resource "coder_env" "self_token" {
agent_id = var.agent_id
name = "CODER_SELF_TOKEN"
value = var.self_evict_token
}
resource "coder_env" "workspace_id" {
agent_id = var.agent_id
name = "CODER_WORKSPACE_ID"
value = var.workspace_id
}
resource "coder_script" "claude_runner" {
agent_id = var.agent_id
display_name = "Claude self-hosted runner"
icon = "/icon/code.svg"
run_on_start = true
start_blocks_login = false
script = templatefile("${path.module}/scripts/run.sh", {
CLAUDE_BINARY_PATH = var.claude_binary_path
RUNNER_BINARY_PATH = var.runner_binary_path
})
}
# Agent metadata items. The parent splats this list into a
# `dynamic "metadata"` block on its own `coder_agent` because nested
# blocks cannot be injected from a module. Scraped from the runner's
# local /healthz and /metrics endpoints; this is the only window a
# Coder admin has into who the Anthropic pool has bound this workspace
# to (the runner does not expose the locked user's email over its
# local endpoints; that lives in claude.ai > Self-hosted runner pools).
output "agent_metadata" {
description = "List of agent metadata items the parent template should splat into a `dynamic \"metadata\"` block on its coder_agent."
value = [
{
display_name = "Lock status"
key = "0_lock_status"
interval = 10
timeout = 5
# The runner does not expose its locked state via /metrics or
# /healthz in the current BYOC build, so we infer it from
# active_sessions and latch a sticky flag on disk: once a
# session has landed, the runner is locked to that Anthropic
# user for its entire lifetime per Anthropic's spec, even when
# the active count drops back to zero between sessions.
script = <<-EOT
flag="$HOME/.claude/locked"
active=$(curl -fsS http://127.0.0.1:8080/healthz 2>/dev/null \
| jq -r '.active_sessions // 0')
if [ "$${active:-0}" -gt 0 ] && [ ! -f "$flag" ]; then
touch "$flag" 2>/dev/null || true
fi
if [ -f "$flag" ]; then
printf 'locked'
else
printf 'unlocked'
fi
EOT
},
{
display_name = "Active sessions"
key = "1_active_sessions"
interval = 5
timeout = 5
script = <<-EOT
active=$(curl -fsS http://127.0.0.1:8080/healthz 2>/dev/null \
| jq -r '.active_sessions // empty')
if [ -z "$active" ]; then echo '?'; exit 0; fi
printf '%s / %s' "$active" "$${CLAUDE_CAPACITY:-1}"
EOT
},
{
display_name = "Runner ID"
key = "2_runner_id"
interval = 30
timeout = 5
script = <<-EOT
curl -fsS http://127.0.0.1:8080/healthz 2>/dev/null \
| jq -r '.runner_id // "(starting)"'
EOT
},
{
display_name = "Last Anthropic poll"
key = "3_last_poll"
interval = 15
timeout = 5
script = <<-EOT
age=$(curl -fsS http://127.0.0.1:8080/healthz 2>/dev/null \
| jq -r '.last_poll_age_ms // empty')
if [ -z "$age" ]; then echo '?'; exit 0; fi
if [ "$age" -lt 30000 ]; then
printf 'ok (%sms ago)' "$age"
else
printf 'stale (%ss ago)' $((age/1000))
fi
EOT
},
]
}
@@ -0,0 +1,123 @@
run "plan_with_required_vars" {
command = plan
variables {
agent_id = "test-agent"
workspace_id = "test-workspace"
pool_secret = "test-pool-secret"
self_evict_token = "test-self-token"
}
assert {
condition = length(resource.coder_env.pool_secret.value) > 0
error_message = "pool_secret env should be set"
}
assert {
condition = resource.coder_env.capacity.value == "4"
error_message = "default capacity should be 4"
}
assert {
condition = resource.coder_script.claude_runner.display_name == "Claude self-hosted runner"
error_message = "expected the runner coder_script display_name"
}
}
run "custom_capacity_and_binary_paths" {
command = plan
variables {
agent_id = "test-agent"
workspace_id = "test-workspace"
pool_secret = "test-pool-secret"
self_evict_token = "test-self-token"
capacity = 8
claude_binary_path = "/custom/claude"
runner_binary_path = "/custom/runner"
}
assert {
condition = resource.coder_env.capacity.value == "8"
error_message = "capacity input should flow into CLAUDE_CAPACITY env"
}
assert {
condition = strcontains(resource.coder_script.claude_runner.script, "/custom/claude")
error_message = "claude_binary_path should appear in the rendered script"
}
assert {
condition = strcontains(resource.coder_script.claude_runner.script, "/custom/runner")
error_message = "runner_binary_path should appear in the rendered script"
}
}
run "git_bot_token_optional" {
command = plan
variables {
agent_id = "test-agent"
workspace_id = "test-workspace"
pool_secret = "test-pool-secret"
self_evict_token = "test-self-token"
}
assert {
condition = resource.coder_env.git_bot_token.value == ""
error_message = "git_bot_token should default to empty string"
}
}
run "capacity_validation_rejects_zero" {
command = plan
variables {
agent_id = "test-agent"
workspace_id = "test-workspace"
pool_secret = "test-pool-secret"
self_evict_token = "test-self-token"
capacity = 0
}
expect_failures = [
var.capacity,
]
}
run "capacity_validation_rejects_high" {
command = plan
variables {
agent_id = "test-agent"
workspace_id = "test-workspace"
pool_secret = "test-pool-secret"
self_evict_token = "test-self-token"
capacity = 17
}
expect_failures = [
var.capacity,
]
}
run "agent_metadata_output_has_four_items" {
command = apply
variables {
agent_id = "test-agent"
workspace_id = "test-workspace"
pool_secret = "test-pool-secret"
self_evict_token = "test-self-token"
}
assert {
condition = length(output.agent_metadata) == 4
error_message = "agent_metadata should expose four scraping items"
}
assert {
condition = output.agent_metadata[0].key == "0_lock_status"
error_message = "first metadata item should be lock_status"
}
}
@@ -0,0 +1,114 @@
#!/usr/bin/env bash
# Wires up everything the Claude Code self-hosted runner needs at agent
# start, then spawns a detached supervisor that keeps the runner alive
# and self-evicts on drain.
#
# Runtime env (set by coder_env in main.tf):
# CLAUDE_POOL_SECRET Anthropic pool secret (mandatory).
# CLAUDE_CAPACITY Max parallel sessions per runner (default 1).
# GIT_BOT_TOKEN Optional bot PAT for GIT_ASKPASS.
# CODER_SELF_TOKEN Per-workspace scope-restricted Coder API token.
# CODER_WORKSPACE_ID This workspace's UUID, used by self-eviction.
# CODER_AGENT_URL Set by the Coder agent itself.
set -euo pipefail
CLAUDE_BINARY_PATH='${CLAUDE_BINARY_PATH}'
RUNNER_BINARY_PATH='${RUNNER_BINARY_PATH}'
if [ -z "$${CLAUDE_POOL_SECRET:-}" ]; then
echo "CLAUDE_POOL_SECRET is empty. Set the pool_secret input on the module."
exit 1
fi
install -d -m 0700 "$HOME/.claude"
# --- Bot git askpass ----------------------------------------------------
if [ -n "$${GIT_BOT_TOKEN:-}" ]; then
install -d -m 0700 "$HOME/.git-creds"
cat > "$HOME/.git-creds/askpass.sh" << 'ASK'
#!/bin/sh
printf '%s' "$GIT_BOT_TOKEN"
ASK
chmod 0500 "$HOME/.git-creds/askpass.sh"
git config --global core.askPass "$HOME/.git-creds/askpass.sh"
git config --global credential.helper ''
fi
# --- Host Docker socket gid fixup --------------------------------------
if [ -S /var/run/docker.sock ]; then
sock_gid=$(stat -c %g /var/run/docker.sock)
docker_gid=$(getent group docker | cut -d: -f3 || true)
if [ -n "$${docker_gid:-}" ] && [ "$${sock_gid}" != "$${docker_gid}" ]; then
sudo chgrp "$${docker_gid}" /var/run/docker.sock 2> /dev/null || true
fi
fi
# --- Pool secret on disk -----------------------------------------------
POOL_SECRET_FILE="$HOME/.claude/pool-secret"
rm -f "$POOL_SECRET_FILE"
umask 077
printf '%s' "$${CLAUDE_POOL_SECRET}" > "$POOL_SECRET_FILE"
chmod 0400 "$POOL_SECRET_FILE"
# --- Wrapper script -----------------------------------------------------
# Runner execs this once per session, appending its server-computed
# flags. Claude Code's flag parser is last-occurrence-wins, so flags
# after "$@" win. Force --permission-mode bypassPermissions so
# unattended sessions never stall on a tool-approval prompt.
WRAPPER="$HOME/.claude/wrapper.sh"
{
echo '#!/bin/bash'
echo "exec $${CLAUDE_BINARY_PATH} \"\$@\" --permission-mode bypassPermissions"
} > "$WRAPPER"
chmod 0755 "$WRAPPER"
# --- Supervisor --------------------------------------------------------
# Runs the runner in the foreground; on runner exit POSTs a delete
# build to self-evict. Raw curl, not `coder delete`: the CLI fetches
# workspace resources first, which fails with the per-workspace
# scoped token whose allow-list excludes peer prebuilds.
#
# Single-quoted heredoc, so nothing is expanded by the outer shell.
# The supervisor reads its env vars (CODER_SELF_TOKEN, CODER_AGENT_URL,
# etc.) at runtime, when it's invoked under setsid.
SUPERVISOR="$HOME/.claude/supervisor.sh"
cat > "$SUPERVISOR" << SUP
#!/usr/bin/env bash
set -uo pipefail
exec >>"\$HOME/.claude/supervisor.log" 2>&1
echo "[supervisor] start \$(date -Is)"
$${RUNNER_BINARY_PATH} self-hosted-runner \\
--pool-secret-file "\$HOME/.claude/pool-secret" \\
--capacity "\$${CLAUDE_CAPACITY:-1}" \\
--log-file "\$HOME/.claude/runner.log" \\
--exec-path "\$HOME/.claude/wrapper.sh"
echo "[supervisor] runner exited rc=\$? \$(date -Is)"
if [ -z "\$${CODER_SELF_TOKEN:-}" ]; then
echo "[supervisor] CODER_SELF_TOKEN is empty; skipping self-eviction."
exit 0
fi
http_code=\$(curl -s -o /tmp/evict.out -w "%%{http_code}" \\
-X POST \\
-H "Coder-Session-Token: \$CODER_SELF_TOKEN" \\
-H "Content-Type: application/json" \\
-d '{"transition":"delete"}' \\
"\$CODER_AGENT_URL/api/v2/workspaces/\$CODER_WORKSPACE_ID/builds")
if [ "\$http_code" = "201" ]; then
echo "[supervisor] self-eviction queued (HTTP 201)."
else
echo "[supervisor] self-eviction failed (HTTP \$http_code): \$(head -c 300 /tmp/evict.out)"
fi
SUP
chmod 0700 "$SUPERVISOR"
# Detach with setsid + nohup. The supervisor reopens stdout/stderr to
# its own logfile; redirect all standard fds here to /dev/null so this
# script's exit doesn't drag the supervisor with it.
setsid nohup "$SUPERVISOR" < /dev/null > /dev/null 2>&1 &
disown
echo "Runner spawned as detached supervisor (pid=$!). See ~/.claude/supervisor.log."