mirror of
https://github.com/coder/coder.git
synced 2026-06-04 13:38:21 +00:00
155 lines
5.7 KiB
Markdown
155 lines
5.7 KiB
Markdown
# Improving Agent Resiliency
|
|
|
|
Coder's agent can automatically lower the scheduling priority
|
|
and raise the OOM (out-of-memory) kill score of user processes
|
|
so the agent itself stays alive under resource pressure.
|
|
|
|
## Prerequisites
|
|
|
|
- **Linux** — The feature is ignored on other operating systems.
|
|
- **`CAP_SYS_NICE`** — Required if the agent needs to lower
|
|
the nice value below its current value. In Kubernetes, add
|
|
it to the container's security context:
|
|
|
|
```hcl
|
|
container {
|
|
security_context {
|
|
capabilities {
|
|
add = ["CAP_SYS_NICE"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Environment variables
|
|
|
|
Configure the feature with environment variables in the
|
|
environment that launches the agent binary. These must be set
|
|
on the workspace container or host, not in the `coder_agent`
|
|
resource's `env` block — the agent reads them from its own
|
|
process environment at startup.
|
|
|
|
| Variable | Required | Default | Description |
|
|
|-------------------------|----------|-------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
| `CODER_PROC_PRIO_MGMT` | Yes | — | Set to enable the feature. The agent checks whether the variable is present, not its value — even an empty string enables it. Use `1` by convention. To disable, unset the variable entirely. |
|
|
| `CODER_PROC_OOM_SCORE` | No | Computed from agent's score | Explicit `oom_score_adj` value for child processes. Range: `-1000` to `1000`. |
|
|
| `CODER_PROC_NICE_SCORE` | No | Agent nice + 5 (capped at 19) | Explicit nice value for child processes. Range: `-20` to `19` (higher = lower priority). |
|
|
|
|
### OOM score defaults
|
|
|
|
If you do not set `CODER_PROC_OOM_SCORE`, the agent computes a
|
|
value based on its own `oom_score_adj`:
|
|
|
|
| Agent's `oom_score_adj` | Child score | Rationale |
|
|
|-------------------------|-------------|------------------------------------------------|
|
|
| Negative (< 0) | `0` | Children are treated as normal processes. |
|
|
| >= 998 | `1000` | Children get the maximum score (killed first). |
|
|
| Any other value | `998` | Children get a near-maximum score. |
|
|
|
|
The goal is for the kernel's OOM killer to target child
|
|
processes before the agent, keeping remote connectivity alive
|
|
even when a workspace runs out of memory.
|
|
|
|
### Nice score defaults
|
|
|
|
If you do not set `CODER_PROC_NICE_SCORE`, the agent sets
|
|
children to its own nice value plus 5, capped at 19. This
|
|
gives the agent more CPU scheduling priority than user
|
|
workloads.
|
|
|
|
## Example
|
|
|
|
The following Kubernetes template snippet enables process
|
|
priority management on the workspace container:
|
|
|
|
```hcl
|
|
resource "kubernetes_deployment" "workspace" {
|
|
# ... other configuration
|
|
|
|
spec {
|
|
template {
|
|
spec {
|
|
container {
|
|
name = "dev"
|
|
image = "codercom/enterprise-base:ubuntu"
|
|
|
|
env {
|
|
name = "CODER_AGENT_TOKEN"
|
|
value = coder_agent.main.token
|
|
}
|
|
env {
|
|
name = "CODER_PROC_PRIO_MGMT"
|
|
value = "1"
|
|
}
|
|
env {
|
|
name = "CODER_PROC_OOM_SCORE"
|
|
value = "10"
|
|
}
|
|
env {
|
|
name = "CODER_PROC_NICE_SCORE"
|
|
value = "1"
|
|
}
|
|
|
|
security_context {
|
|
capabilities {
|
|
add = ["CAP_SYS_NICE"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
- `CODER_PROC_OOM_SCORE=10` gives child processes a slightly
|
|
elevated OOM score while keeping them well below the maximum.
|
|
- `CODER_PROC_NICE_SCORE=1` gives children a mildly lower CPU
|
|
priority than the agent.
|
|
- `CAP_SYS_NICE` allows the agent to set nice values.
|
|
|
|
## Troubleshooting
|
|
|
|
### OOM score adjustment fails
|
|
|
|
If you see `failed to adjust oom score` in stderr but the
|
|
process still starts, the agent likely lacks permission to
|
|
write to `/proc/self/oom_score_adj`. Ensure the process is
|
|
dumpable — this is handled automatically by the agent, but
|
|
can fail if the container runtime restricts `prctl` calls.
|
|
|
|
### Nice value is not applied
|
|
|
|
If you see `failed to adjust niceness` in stderr, nice values
|
|
can only be increased (lowered in priority) without
|
|
`CAP_SYS_NICE`. If your template sets a `CODER_PROC_NICE_SCORE`
|
|
lower than the agent's current nice value, add the capability
|
|
to the container's security context.
|
|
|
|
### Environment variables leak to nested Coder agents
|
|
|
|
The agent strips all `CODER_PROC_*` variables from child
|
|
environments automatically. This prevents interference in
|
|
"Coder on Coder" development scenarios where a workspace
|
|
runs another Coder agent.
|
|
|
|
### Verifying the feature is enabled
|
|
|
|
The agent logs whether process priority management is active
|
|
at startup. Look for these lines in the agent log:
|
|
|
|
```text
|
|
"process priority management enabled"
|
|
"process priority management not enabled (linux-only)"
|
|
```
|
|
|
|
The log entry includes the `CODER_PROC_PRIO_MGMT` value and
|
|
the operating system. Check the agent log file at
|
|
`<log-dir>/coder-agent.log` or stderr output.
|
|
|
|
### Feature has no effect on macOS or Windows
|
|
|
|
Process priority management is Linux-only. Setting
|
|
`CODER_PROC_PRIO_MGMT` on other operating systems is safe
|
|
but has no effect.
|