From 36d52ba504ddb66e841a8cb14d3eb3914b8f0e29 Mon Sep 17 00:00:00 2001 From: Nick Vigilante Date: Tue, 12 May 2026 14:18:31 -0400 Subject: [PATCH] feat(.github/workflows): trigger Algolia, ISR, and Vercel deploy on docs/** changes (#25049) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Folds the Algolia/ISR sync trigger and surgical-reindex path computation into the existing `deploy-docs.yaml` workflow so a single `docs/**` push fires every update path the docs site needs. One preflight job feeds two parallel sibling jobs: - **`changes`** (preflight): diffs `github.event.before` against `github.sha` to compute `manifest_changed` and `paths_json` (a JSON array of `{path, status}` objects derived from `git diff --name-status -z`, capped at 50 entries). The mapping is `A → added`, `M/T → modified`, `D → deleted`, `R → renamed` (indexed by the new path). Falls back to whole-branch (emits `paths_json: "[]"`) on `workflow_dispatch`, the first push to a new branch, fetch failure, manifest changes (route restructuring would orphan records), or >50 markdown files. - **`algolia-and-isr`** (always, parallel with `vercel-rebuild`): HMAC-signed POST to `coder.com/api/algolia-docs-sync` with the `paths_json` array as part of the body. Refreshes the Algolia `docs` slice for the `(corpus, ref)` pair and ISR-revalidates every navigable route the handler touched. Markdown-only edits surface in seconds with no full rebuild. The step summary line `Mode: \`surgical\` (N path(s))` lets operators verify which path ran without scrolling through the curl output. - **`vercel-rebuild`** (parallel with `algolia-and-isr`, only when `docs/manifest.json` changed): fires the existing Vercel deploy hook for a full build. Manifest changes can register or remove routes that Next.js's `getStaticPaths` only re-evaluates on a full build, so ISR-per-existing-path is not enough. Trigger expanded from "main + manifest.json" to "main and `release/*` + any `docs/**`" so release-branch docs edits also flow through the same pipeline. The Vercel rebuild path stays gated on manifest changes regardless of branch. The pure shell + curl + openssl + jq + awk pipeline is preserved verbatim. Zero Algolia or Node dependencies in CI. ## Why one workflow instead of two The original split (a standalone Algolia workflow + the existing `deploy-docs.yaml`) would have run twice per manifest push, with two parallel concurrency groups, two GitHub Actions step summaries, and two ways to forget to add a secret. Folding into one file makes the trigger story symmetrical: "docs change → all docs surfaces refresh," with the rebuild path being a strict superset of the ISR path, and the surgical path strictly cheaper than whole-branch when computable. ## Pre-merge testing The companion handler PR (coder/coder.com#741) supports an `ALGOLIA_DOCS_INDEX` env-var override, scoped to `docs_smoke` on the Vercel preview deploy, so this workflow can be exercised end-to-end against a disposable index without touching production records. The smoke harness at `~/audit/smoke/run.sh` (workspace-only) signs and posts the same body shape this workflow does, so it covers the same crypto path. To exercise the workflow itself, push a docs-only commit to a throwaway branch and watch the step summary; the `algolia-and-isr` job will print the resolved mode. ## Prerequisites before this can do anything useful 1. `secrets.ALGOLIA_DOCS_SYNC_SECRET` must be added as an Actions secret on this repo. The same value goes on `coder.com`'s Vercel env. The workflow logs a clear error and aborts with no network call if the secret is missing. 2. The handler at coder/coder.com#741 must be merged and deployed. Without it, the POST will 404. 3. `secrets.DEPLOY_DOCS_VERCEL_WEBHOOK` is already in place from the existing `deploy-docs.yaml`; this PR does not change its usage. ## Demo, validation, and design - Front-end-only fixes (modal layout, scroll-shadow, rank-order preservation): coder/coder.com#749 ships these against production today, independent of this PR. - Companion handler PR on `coder.com`: coder/coder.com#741. Includes the surgical-mode plumbing this workflow's `paths_json` output drives. - Full design lives in the workspace at `~/plans/algolia-search-revamp.md`. Key sections: - §6.0–6.2: why the indexer lives in `coder.com`, not here. - §6.7: per-version add/remove mechanics. - §6.8: ISR revalidate rationale and same-time refresh. - §6.9: surgical per-page reindex (workflow + handler + planning rules). --- This PR was generated by Coder Agents. --- .github/workflows/deploy-docs.yaml | 467 ++++++++++++++++++++- .github/workflows/test-deploy-docs-diff.sh | 291 +++++++++++++ 2 files changed, 749 insertions(+), 9 deletions(-) create mode 100755 .github/workflows/test-deploy-docs-diff.sh diff --git a/.github/workflows/deploy-docs.yaml b/.github/workflows/deploy-docs.yaml index 41c6e35bda..51659f2cd1 100644 --- a/.github/workflows/deploy-docs.yaml +++ b/.github/workflows/deploy-docs.yaml @@ -1,23 +1,472 @@ -# This workflow triggers a Vercel deploy hook which builds+deploys coder.com -# (a Next.js app), to keep coder.com/docs URLs in sync with docs/manifest.json +name: Update coder.com/docs + +# Triggers updates to the public docs at coder.com/docs whenever this +# branch's docs/** content changes. One preflight job (`changes`) feeds +# two parallel sibling jobs so that search records, the static cache, +# and any new routes register at the same time: +# +# 1. algolia-and-isr: HMAC-signed POST to coder.com/api/algolia-docs-sync. +# The handler re-extracts records for the (corpus, ref) pair and +# atomically replaces the slice of the Algolia `docs` index, then +# calls `res.revalidate(p)` for every navigable manifest entry to +# refresh Vercel's static-page cache without a full rebuild. Runs +# on every docs/** push. +# +# 2. vercel-rebuild: fires the Vercel deploy hook for a full +# build+deploy. Only runs when docs/manifest.json changed, since a +# manifest change can introduce or remove routes that Next.js's +# `getStaticPaths` only re-evaluates on a full rebuild. +# +# Markdown-only edits hit only path 1 and surface in seconds. Manifest +# edits hit both paths in parallel; the ISR revalidate is harmless +# against the previous deployment while the new build is in flight, +# and Vercel only swaps to the new build atomically when ready. # # https://vercel.com/docs/deploy-hooks#triggering-a-deploy-hook - -name: Update coder.com/docs +# See coder/coder.com/src/pages/api/algolia-docs-sync.ts. on: push: branches: - main + - "release/*" paths: - - "docs/manifest.json" + - "docs/**" + - ".github/workflows/deploy-docs.yaml" + workflow_dispatch: + inputs: + action: + description: "Algolia action to perform" + required: true + type: choice + default: index + options: + - index + - delete + ref: + description: "Branch to (re)index or delete (e.g. main, release/2.32). Defaults to the workflow's checkout ref." + required: false + type: string -permissions: {} +permissions: + contents: read + +# Do not cancel in-progress runs. Each run's `changes` job diffs the +# event's own (before, after) SHA pair, so two rapid pushes produce two +# non-overlapping surgical-mode requests. Cancelling the first run +# would silently drop its diff: the second run only sees its own pair, +# never sees the cancelled run's paths, and the dropped pages would +# stay stale until the next whole-branch reindex (manifest change, +# >50-file push, or manual workflow_dispatch). Runs are lightweight +# (shell + curl, ~2 minutes), so overlapping runs are cheap. +concurrency: + group: deploy-docs-${{ github.ref }} + cancel-in-progress: false jobs: - deploy-docs: + # Detect what changed so the dependent jobs know: + # - whether a Vercel full rebuild is needed (manifest changed), and + # - which markdown pages to surgically reindex (the changed set). + # + # Outputs: + # manifest_changed: "true" | "false" + # paths_json: a JSON array of {path, status} objects, or "[]" + # when no markdown changes are eligible for + # surgical mode (manifest-only push, an + # uncomputable diff, a workflow_dispatch trigger, + # or a diff that exceeds the surgical-mode cap). + # An empty array tells the handler to fall back + # to whole-branch reindex. + changes: runs-on: ubuntu-latest + outputs: + manifest_changed: ${{ steps.diff.outputs.manifest_changed }} + paths_json: ${{ steps.diff.outputs.paths_json }} steps: - - name: Deploy docs site + - name: Compute changed-files signal + id: diff + env: + EVENT_NAME: ${{ github.event_name }} + BEFORE_SHA: ${{ github.event.before }} + AFTER_SHA: ${{ github.sha }} run: | - curl -X POST "${{ secrets.DEPLOY_DOCS_VERCEL_WEBHOOK }}" + set -euo pipefail + emit_whole_branch_fallback() { + # Tells the algolia-and-isr job to operate in whole-branch + # mode by sending an empty paths array. The handler treats + # the absence of paths (or an empty list) as "reindex + # everything for this (corpus, ref)". + echo "paths_json=[]" >> "$GITHUB_OUTPUT" + } + # workflow_dispatch never has a diff range; treat as + # "manifest unchanged" so the manual reindex/delete path + # doesn't trigger a Vercel rebuild it didn't ask for, and as + # whole-branch so a manual reindex is exhaustive. + if [ "$EVENT_NAME" != "push" ]; then + echo "manifest_changed=false" >> "$GITHUB_OUTPUT" + emit_whole_branch_fallback + exit 0 + fi + # First push to a brand-new branch has BEFORE_SHA = all zeros. + # In that edge case we conservatively assume the manifest is + # part of the initial state and trigger a full rebuild + a + # whole-branch reindex. + if [ -z "${BEFORE_SHA:-}" ] || [ "$BEFORE_SHA" = "0000000000000000000000000000000000000000" ]; then + echo "manifest_changed=true" >> "$GITHUB_OUTPUT" + emit_whole_branch_fallback + exit 0 + fi + # We don't need a full checkout for `git diff` against two + # known SHAs. A shallow fetch of just those two commits is + # enough. + git init -q + git remote add origin "https://github.com/${GITHUB_REPOSITORY}.git" + GIT_ERR=$(mktemp) + if ! git -c protocol.version=2 fetch --depth=1 origin "$BEFORE_SHA" "$AFTER_SHA" 2>"$GIT_ERR"; then + # Fall back to whole-branch if the shallow fetch failed + # (e.g. force-push rewrote history). Surfacing the git + # stderr line in the warning lets operators diagnose + # network or auth failures without reproducing the fetch + # manually. + FIRST_ERR=$(head -1 "$GIT_ERR" 2>/dev/null || true) + echo "::warning::Could not fetch BEFORE_SHA=$BEFORE_SHA: ${FIRST_ERR:-unknown}; assuming manifest changed" + echo "manifest_changed=true" >> "$GITHUB_OUTPUT" + emit_whole_branch_fallback + exit 0 + fi + # Manifest signal. + if git diff --name-only "$BEFORE_SHA" "$AFTER_SHA" -- docs/manifest.json | grep -q .; then + echo "manifest_changed=true" >> "$GITHUB_OUTPUT" + # Manifest changes can rename or restructure routes, so + # surgical mode is not safe; a per-path delete keyed off + # the new canonical URL would miss records under old URLs. + # Whole-branch reindex is the right behavior here. + emit_whole_branch_fallback + exit 0 + else + echo "manifest_changed=false" >> "$GITHUB_OUTPUT" + fi + # Surgical mode: emit the changed markdown set as a JSON + # array of {path, status} objects. We use --name-status -z + # so the handler can distinguish modified/added (re-extract + # + save) from deleted/renamed-old-side (delete only), and + # so paths containing whitespace or quotes survive intact. + DIFF_FILE=$(mktemp) + git diff --name-status -z "$BEFORE_SHA" "$AFTER_SHA" -- 'docs/**/*.md' > "$DIFF_FILE" + # Parse the NUL-delimited diff into \t lines. + # `--name-status -z` uses NUL between fields and between + # records, with a special twist for renames: the record is + # `R\0\0\0`, three NUL-delimited fields instead + # of two. Status codes: A=added, M=modified, T=type-changed + # (treated as modified), D=deleted, R=renamed (we index + # the new path since that is the live route). Unknown codes + # log a warning and are skipped; a single awk handles both + # the parsing and the count so the two cannot disagree. + # + # Tested in test-deploy-docs-diff.sh. Keep that script in + # sync with any changes to this block. + PARSED=$(mktemp) + awk -v RS='\0' ' + function emit(path, status) { + printf "%s\t%s\n", path, status + } + { + code = substr($0, 1, 1) + if (code == "A") { getline; emit($0, "added"); next } + if (code == "M") { getline; emit($0, "modified"); next } + if (code == "T") { getline; emit($0, "modified"); next } + if (code == "D") { getline; emit($0, "deleted"); next } + if (code == "R") { + # R\0\0\0 + getline old_path + getline new_path + emit(new_path, "renamed") + next + } + if ($0 != "") { + # Unknown status code. Consume the path field so the + # record alignment stays correct, then warn. + unknown_code = $0 + getline unknown_path + printf "::warning::Unknown git diff status %s for %s; skipping.\n", unknown_code, unknown_path > "/dev/stderr" + } + } + ' "$DIFF_FILE" > "$PARSED" + # Count is derived from the emitter output, so the count and + # the JSON payload cannot diverge by construction (DEREM-21). + CHANGED=$(wc -l < "$PARSED" | tr -d ' ') + if [ "$CHANGED" -eq 0 ]; then + # Markdown-only path filter on the trigger means we should + # only get here on edits to non-markdown files under docs/ + # (e.g., images). Whole-branch reindex is overkill for + # those, but it is also harmless and avoids a special case; + # an empty paths array makes the handler skip both the + # save and the revalidate when no manifest entry maps to + # the changed file. + emit_whole_branch_fallback + exit 0 + fi + # Cap at 50 changed files. Above that a whole-branch reindex + # is faster (one deleteBy + one saveObjects vs N deleteBy + # calls), and the surgical-mode payload also stays well under + # GitHub Actions' output size limit. + if [ "$CHANGED" -gt 50 ]; then + echo "::notice::$CHANGED markdown files changed; falling back to whole-branch reindex (cap is 50 for surgical mode)" + emit_whole_branch_fallback + exit 0 + fi + # jq -Rcn slurps the \t lines and handles JSON + # escaping for quotes, backslashes, and any other special + # characters in the path. + PATHS_JSON=$(jq -Rcn ' + [ inputs + | split("\t") + | { path: .[0], status: .[1] } + ] + ' < "$PARSED") + # Defense in depth: fail loudly if jq could not parse what + # we built. jq -c already validates structure; this catches + # the empty-stdin edge case. + if [ -z "$PATHS_JSON" ] || [ "$PATHS_JSON" = "null" ]; then + PATHS_JSON='[]' + fi + echo "paths_json=$PATHS_JSON" >> "$GITHUB_OUTPUT" + echo "Surgical mode: $CHANGED path(s) changed." + + # Path 1: always run. Notifies coder.com to refresh Algolia records + # and ISR-revalidate the affected pages. + algolia-and-isr: + runs-on: ubuntu-latest + needs: changes + steps: + - name: Compute action and ref + id: input + env: + INPUT_ACTION: ${{ inputs.action }} + INPUT_REF: ${{ inputs.ref }} + GITHUB_REF_NAME: ${{ github.ref_name }} + run: | + set -euo pipefail + ACTION="${INPUT_ACTION:-index}" + REF="${INPUT_REF:-$GITHUB_REF_NAME}" + # Reject newlines/carriage returns in either input. GitHub + # Actions parses GITHUB_OUTPUT line-by-line with last-writer- + # wins, so a newline in $REF would let an operator dispatch + # `release/x\naction=delete\nref=main` past the validation + # below (the case `*` glob matches the multi-line string), + # then have `echo "ref=$REF" >> $GITHUB_OUTPUT` write three + # lines whose effective outputs are `action=delete ref=main`. + # `inputs.ref` is a single-line UI field; the REST API will + # accept anything. Reject embedded newlines explicitly. + case "$ACTION" in + *[$'\n\r']*) + echo "::error::action must not contain newlines." + exit 1 + ;; + esac + case "$REF" in + *[$'\n\r']*) + echo "::error::ref must not contain newlines." + exit 1 + ;; + esac + # The workflow_dispatch `type: choice` is enforced only by + # the GitHub UI. The REST API will accept any string. We + # validate explicitly so a malformed action never reaches + # the handler (which trusts this value after HMAC check). + case "$ACTION" in + index|delete) ;; + *) + echo "::error::Unsupported action '$ACTION'. Must be 'index' or 'delete'." + exit 1 + ;; + esac + case "$REF" in + main|release/*) ;; + *) + echo "::error::Unsupported ref '$REF'. Only main and release/* are eligible." + exit 1 + ;; + esac + # Refuse to run `action=delete` against main. The dispatch + # UI defaults `ref` to the dispatching branch (typically + # `main`), so a single forgotten field when cleaning up a + # release branch would wipe production search records. + # Force the operator to type the ref explicitly for delete. + if [ "$ACTION" = "delete" ] && [ "$REF" = "main" ]; then + echo "::error::Refusing to delete records for ref=main. Specify a release/* ref explicitly when dispatching delete." + exit 1 + fi + echo "action=$ACTION" >> "$GITHUB_OUTPUT" + echo "ref=$REF" >> "$GITHUB_OUTPUT" + + - name: POST to coder.com docs indexer + env: + ACTION: ${{ steps.input.outputs.action }} + REF: ${{ steps.input.outputs.ref }} + PATHS_JSON: ${{ needs.changes.outputs.paths_json }} + SECRET: ${{ secrets.ALGOLIA_DOCS_SYNC_SECRET }} + run: | + set -euo pipefail + if [ -z "${SECRET:-}" ]; then + echo "::error::ALGOLIA_DOCS_SYNC_SECRET is not configured." + exit 1 + fi + # Build the webhook body. paths_json is always a valid JSON + # array (possibly empty) thanks to the changes job. An empty + # array tells the handler to do a whole-branch reindex; a + # non-empty array triggers surgical per-page mode. + if [ -z "${PATHS_JSON:-}" ]; then + PATHS_JSON='[]' + fi + BODY=$(jq -nc \ + --arg action "$ACTION" \ + --arg corpus "v2" \ + --arg ref "$REF" \ + --argjson paths "$PATHS_JSON" \ + '{action: $action, corpus: $corpus, ref: $ref, paths: $paths}') + # SHA-256 HMAC over the exact bytes we POST. The handler verifies + # with crypto.timingSafeEqual on the same raw body, so the + # prefix and hex casing must match. + SIG="sha256=$(printf '%s' "$BODY" | openssl dgst -sha256 -hmac "$SECRET" -hex | awk '{print $2}')" + PATHS_COUNT=$(printf '%s' "$PATHS_JSON" | jq 'length') + MODE="whole-branch" + if [ "$PATHS_COUNT" -gt 0 ]; then + MODE="surgical ($PATHS_COUNT path(s))" + fi + echo "Action: $ACTION Ref: $REF Mode: $MODE" + RESPONSE=$(mktemp) + RC=0 + HTTP_STATUS=$(curl --fail-with-body -sS \ + --connect-timeout 10 \ + --max-time 120 \ + -o "$RESPONSE" \ + -w '%{http_code}' \ + -X POST \ + -H 'Content-Type: application/json' \ + -H "X-Coder-Signature: $SIG" \ + --data "$BODY" \ + https://coder.com/api/algolia-docs-sync) || RC=$? + # Render only an allowlisted subset of the handler response in + # the step summary. The handler can include free-form fields + # (error, reason, revalidateSampleErrors, skippedReasons, + # recordsByType) that may reflect upstream error strings. This + # repository is public, so the step summary is visible to + # anyone with read access; filter those fields out before the + # summary is written. The full response remains in the curl + # output captured in the workflow logs, which are restricted + # to repo collaborators. + # + # Keep this allowlist in sync with SyncResponseBody in + # coder/coder.com/src/pages/api/algolia-docs-sync.ts; add a + # field here only after confirming it is bounded enough to be + # safe for a public UI. + SAFE_RESPONSE=$(jq ' + if type == "object" then + { + action, + corpus, + ref, + records, + pagesIndexed, + pagesSkipped, + revalidated, + revalidateFailed, + mode, + pathsRequested, + pathsSkipped, + index, + tookMs + } | with_entries(select(.value != null)) + else + {} + end + ' "$RESPONSE" 2>/dev/null) || SAFE_RESPONSE='{}' + { + echo "## Algolia + ISR sync" + echo + echo "- Action: \`$ACTION\`" + echo "- Ref: \`$REF\`" + echo "- Mode: \`$MODE\`" + echo "- HTTP status: \`${HTTP_STATUS:-n/a}\`" + echo + echo "### Response (allowlisted fields)" + echo + echo '```json' + printf '%s\n' "$SAFE_RESPONSE" + echo '```' + if [ "$RC" -ne 0 ]; then + echo + echo "### Error" + echo + echo "The request failed. See the workflow logs for the full handler response; the step summary suppresses free-form error strings because this repository is public." + fi + } >> "$GITHUB_STEP_SUMMARY" + if [ "$RC" -ne 0 ]; then + exit "$RC" + fi + + # Path 2: full Vercel rebuild. Only fires when docs/manifest.json + # changed, because manifest changes can introduce or remove routes + # that Next.js's `getStaticPaths` only re-evaluates on a full build. + # Markdown-only edits don't need this; ISR revalidate covers them. + vercel-rebuild: + runs-on: ubuntu-latest + needs: changes + if: needs.changes.outputs.manifest_changed == 'true' + steps: + - name: Trigger Vercel deploy hook + env: + HOOK: ${{ secrets.DEPLOY_DOCS_VERCEL_WEBHOOK }} + run: | + set -euo pipefail + if [ -z "${HOOK:-}" ]; then + echo "::error::DEPLOY_DOCS_VERCEL_WEBHOOK is not configured." + exit 1 + fi + # Mirror the sibling job's pattern: capture response body and + # HTTP status, write the step summary unconditionally, then + # propagate failure. Without this, set -e would kill the + # script before the summary block on curl failure. + RESPONSE=$(mktemp) + RC=0 + HTTP_STATUS=$(curl --fail-with-body -sS \ + --connect-timeout 10 \ + --max-time 120 \ + -o "$RESPONSE" \ + -w '%{http_code}' \ + -X POST "$HOOK") || RC=$? + # Render only an allowlisted subset of the Vercel deploy hook + # response (job.id, job.state, job.createdAt). The deploy hook + # URL itself is the only secret in this flow; the response + # shape is bounded today, but we filter explicitly to insulate + # the public step summary from any future shape change + # upstream and to keep the two summary blocks consistent. + SAFE_RESPONSE=$(jq ' + if type == "object" and (.job | type) == "object" then + { job: (.job | { id, state, createdAt } | with_entries(select(.value != null))) } + else + {} + end + ' "$RESPONSE" 2>/dev/null) || SAFE_RESPONSE='{}' + { + echo "## Vercel rebuild" + echo + echo "- Reason: \`docs/manifest.json\` changed" + echo "- HTTP status: \`${HTTP_STATUS:-n/a}\`" + echo + echo "### Response (allowlisted fields)" + echo + echo '```json' + printf '%s\n' "$SAFE_RESPONSE" + echo '```' + if [ "$RC" -ne 0 ]; then + echo + echo "### Error" + echo + echo "The request failed. See the workflow logs for the full hook response; the step summary suppresses free-form error strings because this repository is public." + fi + } >> "$GITHUB_STEP_SUMMARY" + if [ "$RC" -ne 0 ]; then + exit "$RC" + fi diff --git a/.github/workflows/test-deploy-docs-diff.sh b/.github/workflows/test-deploy-docs-diff.sh new file mode 100755 index 0000000000..f130f31e1b --- /dev/null +++ b/.github/workflows/test-deploy-docs-diff.sh @@ -0,0 +1,291 @@ +#!/usr/bin/env bash +# Regression tests for the NUL-delimited diff parser in deploy-docs.yaml. +# The workflow runs `git diff --name-status -z` into $DIFF_FILE and feeds +# the result through an awk script that emits \t lines. +# jq then slurps those lines into a JSON array. This script exercises +# the awk parser against synthetic NUL-delimited inputs so we can +# verify path escaping, rename handling, and unknown-status-code +# behavior without spinning up the full workflow. +# +# Keep `parse_diff` and `build_json_array` below in sync with +# deploy-docs.yaml. The workflow comment "Tested in +# test-deploy-docs-diff.sh" is the contract. +# +# Test inputs are passed to the parser as file paths (not via shell +# variables) because bash strips NUL bytes from command substitutions +# and parameter values. Each test writes its synthetic diff to a tmp +# file before invoking the parser, which is also how the workflow +# itself feeds the parser ($DIFF_FILE). + +set -euo pipefail + +TMPDIR_SELF="$(mktemp -d)" +trap 'rm -rf "$TMPDIR_SELF"' EXIT + +# parse_diff replicates the awk block in deploy-docs.yaml so we can +# exercise it without running the full workflow. Reads NUL-delimited +# `git diff --name-status -z` output from $1 and emits +# \t lines on stdout. Unknown status codes log a warning +# to stderr and consume the path field so the record alignment stays +# correct. +parse_diff() { + awk -v RS='\0' ' + function emit(path, status) { + printf "%s\t%s\n", path, status + } + { + code = substr($0, 1, 1) + if (code == "A") { getline; emit($0, "added"); next } + if (code == "M") { getline; emit($0, "modified"); next } + if (code == "T") { getline; emit($0, "modified"); next } + if (code == "D") { getline; emit($0, "deleted"); next } + if (code == "R") { + # R\0\0\0 + getline old_path + getline new_path + emit(new_path, "renamed") + next + } + if ($0 != "") { + unknown_code = $0 + getline unknown_path + printf "::warning::Unknown git diff status %s for %s; skipping.\n", unknown_code, unknown_path > "/dev/stderr" + } + } + ' "$1" +} + +# build_json_array mirrors the jq slurp in deploy-docs.yaml. Reads +# \t lines from $1 and emits a compact JSON array. +build_json_array() { + jq -Rcn ' + [ inputs + | split("\t") + | { path: .[0], status: .[1] } + ] + ' <"$1" +} + +# write_nul_input writes a NUL-delimited diff to a fresh tmp file and +# echoes the file path. Args become NUL-delimited records. +write_nul_input() { + local f + f="$(mktemp -p "$TMPDIR_SELF")" + # Cannot use a single printf %s\0 list because bash's printf will + # happily emit literal NULs, but the surrounding command + # substitution does not strip NULs from file descriptors, only + # from variables. Write directly to the file. + local arg + for arg in "$@"; do + printf '%s\0' "$arg" + done >"$f" + printf '%s' "$f" +} + +failures=0 +section="" + +start_section() { + section="$1" + echo + echo "--- $section ---" +} + +assert_parse() { + local description="$1" + local input_file="$2" + local expected="$3" + local actual + actual="$(parse_diff "$input_file" 2>/dev/null)" + if [ "$actual" = "$expected" ]; then + echo "PASS: $description" + else + echo "FAIL: $description" + echo " expected: $(printf '%s' "$expected" | cat -A)" + echo " actual: $(printf '%s' "$actual" | cat -A)" + failures=$((failures + 1)) + fi +} + +assert_json() { + local description="$1" + local input_file="$2" + local expected="$3" + local parsed + parsed="$(mktemp -p "$TMPDIR_SELF")" + parse_diff "$input_file" 2>/dev/null >"$parsed" + local actual + actual="$(build_json_array "$parsed")" + if [ "$actual" = "$expected" ]; then + echo "PASS: $description" + else + echo "FAIL: $description" + echo " expected: $expected" + echo " actual: $actual" + failures=$((failures + 1)) + fi +} + +assert_warns() { + local description="$1" + local input_file="$2" + local needle="$3" + local stderr_out + stderr_out="$(parse_diff "$input_file" 2>&1 >/dev/null)" + if printf '%s' "$stderr_out" | grep -q -- "$needle"; then + echo "PASS: $description" + else + echo "FAIL: $description" + echo " needle: $needle" + echo " stderr: $stderr_out" + failures=$((failures + 1)) + fi +} + +assert_count_matches_emitter() { + # Verify count derivation cannot diverge from the emitter output. + # This is the structural guarantee DEREM-21 calls out: counter and + # emitter must agree by construction. Here that means + # `wc -l < parsed` always equals the number of \t + # lines emitted, even when the input contains unknown codes. + local description="$1" + local input_file="$2" + local expected_count="$3" + local actual_count + actual_count="$(parse_diff "$input_file" 2>/dev/null | wc -l | tr -d ' ')" + if [ "$actual_count" = "$expected_count" ]; then + echo "PASS: $description (count=$actual_count)" + else + echo "FAIL: $description" + echo " expected count: $expected_count" + echo " actual count: $actual_count" + failures=$((failures + 1)) + fi +} + +# --------------------------------------------------------------- +start_section "Status codes (covers DEREM-3 awk rewrite)" +# --------------------------------------------------------------- + +assert_parse "single added file" \ + "$(write_nul_input 'A' 'docs/added.md')" \ + $'docs/added.md\tadded' + +assert_parse "single modified file" \ + "$(write_nul_input 'M' 'docs/modified.md')" \ + $'docs/modified.md\tmodified' + +assert_parse "type-changed treated as modified" \ + "$(write_nul_input 'T' 'docs/typechange.md')" \ + $'docs/typechange.md\tmodified' + +assert_parse "single deleted file" \ + "$(write_nul_input 'D' 'docs/deleted.md')" \ + $'docs/deleted.md\tdeleted' + +assert_parse "rename indexes the new path" \ + "$(write_nul_input 'R100' 'docs/old.md' 'docs/new.md')" \ + $'docs/new.md\trenamed' + +assert_parse "multiple mixed records" \ + "$(write_nul_input 'A' 'docs/a.md' 'M' 'docs/b.md' 'D' 'docs/c.md')" \ + $'docs/a.md\tadded\ndocs/b.md\tmodified\ndocs/c.md\tdeleted' + +assert_parse "rename interleaved with simple records" \ + "$(write_nul_input 'A' 'docs/a.md' 'R85' 'docs/old.md' 'docs/new.md' 'D' 'docs/c.md')" \ + $'docs/a.md\tadded\ndocs/new.md\trenamed\ndocs/c.md\tdeleted' + +empty_file="$(mktemp -p "$TMPDIR_SELF")" +: >"$empty_file" +assert_parse "empty input emits nothing" "$empty_file" "" + +# --------------------------------------------------------------- +start_section "Path escaping (covers DEREM-2 path-injection rewrite)" +# --------------------------------------------------------------- + +assert_parse "path with spaces survives" \ + "$(write_nul_input 'M' 'docs/file with space.md')" \ + $'docs/file with space.md\tmodified' + +assert_parse "path with double quote survives raw" \ + "$(write_nul_input 'M' 'docs/quote".md')" \ + $'docs/quote".md\tmodified' + +assert_parse "path with backslash survives raw" \ + "$(write_nul_input 'M' 'docs/back\slash.md')" \ + $'docs/back\\slash.md\tmodified' + +# Tab inside a path: the parser is line-based, so a tab character +# inside the path field will be preserved verbatim through awk; jq's +# split on tab then turns this into a multi-element array. We don't +# defend against this at the parser layer because real-world doc paths +# never contain tabs and git would normally quote-escape them anyway. +# Capture the current behavior so a future change is visible. +assert_parse "tab in path preserved raw by parser" \ + "$(write_nul_input 'M' $'docs/has\ttab.md')" \ + $'docs/has\ttab.md\tmodified' + +assert_json "jq escapes double quote in JSON output" \ + "$(write_nul_input 'M' 'docs/quote".md')" \ + '[{"path":"docs/quote\".md","status":"modified"}]' + +assert_json "jq escapes backslash in JSON output" \ + "$(write_nul_input 'M' 'docs/back\slash.md')" \ + '[{"path":"docs/back\\slash.md","status":"modified"}]' + +assert_json "jq emits empty array for empty input" "$empty_file" "[]" + +# --------------------------------------------------------------- +start_section "Unknown status codes (DEREM-21 structural guarantee)" +# --------------------------------------------------------------- + +# This is the exact case the reviewer reproduced. Old design diverged: +# counter awk said 2, emitter awk said 1. New design has a single awk +# whose output is the source of truth for both. +assert_parse "unknown code consumes its path, valid record after is preserved" \ + "$(write_nul_input 'X' 'docs/a.md' 'M' 'docs/real.md')" \ + $'docs/real.md\tmodified' + +assert_warns "unknown code emits a workflow warning" \ + "$(write_nul_input 'X' 'docs/a.md' 'M' 'docs/real.md')" \ + '::warning::Unknown git diff status X for docs/a.md' + +assert_count_matches_emitter "count matches emitter when an unknown code is skipped" \ + "$(write_nul_input 'X' 'docs/a.md' 'M' 'docs/real.md')" \ + "1" + +assert_count_matches_emitter "count matches emitter for a clean batch" \ + "$(write_nul_input 'A' 'docs/a.md' 'M' 'docs/b.md' 'D' 'docs/c.md')" \ + "3" + +assert_count_matches_emitter "rename counts as one record, not two" \ + "$(write_nul_input 'R100' 'docs/old.md' 'docs/new.md')" \ + "1" + +assert_count_matches_emitter "all unknown produces zero" \ + "$(write_nul_input 'X' 'docs/a.md' 'Y' 'docs/b.md')" \ + "0" + +# --------------------------------------------------------------- +start_section "Sanity checks" +# --------------------------------------------------------------- + +# 50-file boundary at the parser layer. The cap-at-50 decision lives +# above this parser in the workflow, but the parser must handle the +# boundary input correctly regardless. +big_input="$(mktemp -p "$TMPDIR_SELF")" +{ + for i in $(seq 1 50); do + printf 'M\0docs/big-%02d.md\0' "$i" + done +} >"$big_input" +assert_count_matches_emitter "50 records parse to 50 lines" "$big_input" "50" + +if [ "$failures" -gt 0 ]; then + echo + echo "$failures test(s) failed." + exit 1 +fi + +echo +echo "All tests passed."