From c1823b5f62aac6e158dc6c9143b625c993b0e11f Mon Sep 17 00:00:00 2001 From: Your Name Date: Thu, 2 Jul 2026 01:20:35 +0800 Subject: [PATCH] fix(ops): preserve controlled drain lane staging --- docs/LOGBOOK.md | 25 +++ docs/runbooks/FULL-STACK-COLD-START-SOP.md | 4 +- docs/runbooks/REBOOT-RECOVERY-SOP.md | 15 +- .../enforce-110-runner-failclosed.sh | 151 +++++++++++++++++- .../test_cold_start_monitor_bounded_probes.py | 42 +++++ 5 files changed, 233 insertions(+), 4 deletions(-) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index c7b6a029..3e664e84 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -52380,6 +52380,31 @@ production browser smoke: **下一步**: - commit / push 後讀回新的 Gitea CD;目標是 controlled-runtime 直接跳過 B5,不再因 cold-start metadata/script 變更要求 Docker socket。 +## 2026-07-02 — P0 110 controlled CD lane fail-closed enforcer staging 修正 + +**完成內容**: +- 修正 `scripts/reboot-recovery/enforce-110-runner-failclosed.sh`:legacy / generic runner 與 primary `awoooi-cd-lane.service` 仍維持 fail-closed,但 `awoooi-cd-lane-drain.service` 的非 secret staging artifact 若通過 guardrail,enforcer 不再把 config / binary / unit 無差別封回 stub。 +- 新增 `controlled_drain_staging_allowed` 判斷:要求 `capacity <= 1`、只允許 `awoooi-host:host` 與 `awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`、binary 為 executable ELF、unit 具備 registration condition、CPU / Memory / Tasks accounting 與 limit、`NoNewPrivileges=true`,且 service inactive、`MainPID=0`、未 enabled / 未 masked。 +- `seal_live_binary_paths`、`seal_lane_binary_restore_sources`、`quarantine_lane_registration_sources`、`seal_lane_unit_files`、`unit_ok` 均改成尊重 controlled drain staging;同時新增 readback / textfile metric `CONTROLLED_DRAIN_STAGING_ALLOWED`。 +- `ops/runner/awoooi-cd-lane-drain.service` 補齊 `CPUAccounting=true`、`MemoryAccounting=true`、`TasksAccounting=true`,讓 source unit 與 verifier guardrail 一致。 +- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 與 `docs/runbooks/REBOOT-RECOVERY-SOP.md` 已沉澱本次經驗:下次若 config / binary 又被搬走,優先看 enforcer staging guardrail,不再手工反覆補 artifact。 + +**本地驗證結果**: +- `bash -n scripts/reboot-recovery/enforce-110-runner-failclosed.sh ops/runner/check-awoooi-110-controlled-cd-lane-readiness.sh scripts/reboot-recovery/awoooi-startup-110.sh`:通過。 +- `python3.11 -m pytest scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py -q`:`16 passed`。 +- `python3.11 -m pytest ops/runner/test_check_awoooi_110_controlled_cd_lane_readiness.py -q`:`5 passed`。 +- `python3 ops/runner/guard-gitea-runner-pressure.py --root .`:通過,`auto_branch_events_on_110=0`、`generic_runner_labels=0`。 +- `git diff --check`:通過。 + +**仍維持**: +- 沒有讀 secret / token / `.env` / raw sessions / SQLite / auth;沒有讀 `.runner` 內容。 +- 沒有使用 GitHub / gh / GitHub API / GitHub Actions。 +- 沒有重啟主機,沒有 Docker / Nginx / K3s / DB / firewall restart,沒有 workflow_dispatch,沒有 DROP / TRUNCATE / restore / prune。 +- 沒有註冊 runner、沒有啟動 `awoooi-cd-lane-drain.service`;runner registration 仍必須走不列印 token / 不讀 `.runner` 內容的 token-safe path。 + +**下一步**: +- commit / push 到 Gitea main 後讀回 CD;再把新版 enforcer 受控同步到 110,重跑非 secret guardrail apply 與 `check-awoooi-110-controlled-cd-lane-readiness.sh`。目標是 active blockers 收斂到 registration / service inactive,不再出現 config / binary / unit 被 enforcer 回封的假 blocker。 + ## 2026-07-01 — 08:50 P0 188 DB circuit breaker post-push readback **完成內容**: diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index e4db66e7..b9889a51 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.91 +> Version: v1.92 > Last updated: 2026-07-02 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -20,6 +20,8 @@ v1.80 / v1.81 credential escrow intake scorecard rule:同一輪 owner response 2026-07-02 110 control-path / Harbor recovery receipt rule:若 Gitea Harbor repair queue 仍保留 `harbor_110_remote_ssh_publickey_auth_stalled`、remote-control unavailable、jobs stale 或 historical failure,但同一輪本地證據同時證明 `wooo` command path ready、110 local Harbor `/v2/` ready、public/internal registry `/v2/` 回 `401`,則該 Gitea Harbor repair 失敗只能列為 historical queue metadata,不得再當成 current SSH blocker。必須用 `/api/v1/agents/harbor-registry-controlled-recovery-receipt` 或同等 validator 合併 `diagnose-110-ssh-publickey-auth.sh`、`recover-110-control-path-and-harbor-local.sh --check`、public Gitea queue readback 與 registry `/v2/` verifier,並把機器可讀結果寫入 `docs/operations/harbor-110-control-path-recovery-readback-2026-07-02.snapshot.json` 類型的 snapshot。2026-07-02 live receipt 顯示:public/internal registry `/v2/` 均為 `401`、latest visible CD `#4335` 為 `Success`、Gitea Harbor repair failure 已是 `historical_after_latest_cd_success=true`;active blockers 收斂為 110 controlled CD lane config / binary / registration / service guardrail、active action container pressure,以及 Gitea CD jobs head-SHA / stale readback mismatch。若 local-console output 只有 `AWOOOI_110_CONTROLLED_CD_LANE_READY` marker,non110 runner parser 不得從 110 `BLOCKER` 行推導 non110 blocker;non110 只有看到 `AWOOOI_NON110_RUNNER_READY` marker 才能列入 active blocker。 +2026-07-02 110 controlled CD lane fail-closed enforcer staging rule:110 runner 壓力事故後,legacy / generic runner 仍必須 fail-closed;但 `awoooi-cd-lane-drain.service` 的非 secret staging artifact 不得再被 enforcer 無差別封回 stub。`scripts/reboot-recovery/enforce-110-runner-failclosed.sh` 只有在 `config.yaml` 符合 `capacity <= 1`、只含 `awoooi-host:host` 與 `awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`、binary 是 executable ELF、systemd unit 具備 `ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner`、`CPUAccounting` / `MemoryAccounting` / `TasksAccounting` / `NoNewPrivileges` 等 guardrail,且 service `inactive`、`MainPID=0`、未 enabled / 未 masked 時,才可保留 drain config / binary / unit,並輸出 `CONTROLLED_DRAIN_STAGING_ALLOWED=1` 與 textfile metric。此 staging 規則不得讀 token、不得讀 `.runner` 內容、不得註冊 runner、不得啟動 service;若 registration 缺失,readiness verifier 仍必須只留下 `controlled_cd_lane_registration_missing` / `controlled_cd_lane_service_not_active` 類 blocker。若 `CONTROLLED_DRAIN_STAGING_ALLOWED=0` 且 config / binary 又被搬走,優先修 source enforcer / unit guardrail,不要手工反覆補同一組 artifact。 + 2026-07-01 23:00 latest live summary:core cold-start 已從 degraded 收斂為 GREEN,但仍不可宣稱 DR complete 或 MOMO 業績資料已最新。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` artifact `/tmp/awoooi-cold-start-source-gate-20260701-225720.log` 回 `PASS=96 WARN=0 BLOCKED=0`、`Result: GREEN`。MOMO daily stale 在 `MOMO_DAILY_FRESHNESS 7|2026-06-24` 且 no-newer-source evidence 成立時不再算 core cold-start warning:cold-start 會輸出 `OK 188 momo daily sales source gate has no newer Drive candidate` 與 `INFO 188 momo daily sales data remains stale; product data freshness is pending source arrival`;這表示主機/服務恢復完成,但產品資料 freshness 仍留在 source-arrival gate,必須等正式 Drive source 到達後由原匯入 pipeline 更新,不得手動 DB 偽更新。110 live monitor 已同步,`/home/wooo/scripts/full-stack-cold-start-check.sh` hash `6115f73002b7e5b0fc46a031a2e7e9049d68abfcc8110f638e975218792c468e`;110 textfile 讀回 `awoooi_cold_start_monitor_up=1`、`pass=96`、`warn=0`、`blocked=0`、`last_exit_code=0`、`last_result{result="green"}=1`、`last_run_duration_seconds=26`。`verify-cold-start-monitor-deploy.sh` 回 `COLD_START_MONITOR_DEPLOY_PARITY_OK`,runtime state `monitor_up=1 warn=0 blocked=0 green=1 blocked_state=0`,ColdStart alerts `0`。`full-stack-recovery-scorecard.sh` 回 `CORE_COLD_START_GREEN=1`、`CORE_COLD_START_WARN_GATES=0`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_FIRING_ALERTS=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`ESCROW_MISSING_COUNT=5`、`NEXT_STEP=complete_credential_escrow_review`、`RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING`。Allowed declaration:110 / 120 / 121 / 188 core cold-start service recovery GREEN,public routes / AWOOOI service / Gitea / Harbor registry / K3s / Stock public route / 188 backup-from-110 / 110 awoooi_db freshness 已恢復。Forbidden declaration:DR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定、以假資料或手動 DB 寫入掩蓋 source freshness。 2026-07-01 21:32 previous live summary:cold-start 假 WARN 已收斂,hard blockers 維持 `0`,但仍不可宣稱 full green 或 10 分鐘全自動恢復完成。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` final artifact `/tmp/awoooi-cold-start-final-20260701-212632.log` 回 `PASS=95 WARN=1 BLOCKED=0`;public routes / TLS 全部通過,StockPlatform 21:22 左右的 `502` 已確認是 web/admin/edge 替換 warmup,外部連續 5 次 `https://stock.wooo.work/` 回 `200`,final cold-start 亦回 `stock 200`。K3s `BAD_PODS=2` 也是 rollout 暫態,連續 6 次只讀觀察已無非 Running/Completed pod,final `BAD_PODS 0`。MOMO current-month `0|0|-|-|-|-` 不再列為 WARN:`momo-drive-token-source-recovery-preflight.sh` 會輸出 `MOMO_LATEST_IMPORT_CLEAN` 與 `MOMO_SOURCE_ABSENT_WITHOUT_NEWER_DRIVE`,cold-start 讀到 latest clean import 且 Drive 無更新 source candidate 時,判定 current-month sync not applicable。110 backup current health 也不再被舊 aggregate log 壓成 WARN:`BACKUP_HEALTH_110 total=13 stale=0 missing_cron=0 missing_script=0 failed_count=5 config_failed=0 integrity_total=2 integrity_stale=0` 代表 current component freshness / critical config / integrity OK;`failed_count=5` 保留為 INFO evidence,等下一次 full `backup-all` 自然覆蓋。live 110 monitor 已同步,hash `full-stack-cold-start-check.sh=d0711f75dfb1ee680442c9d6cf2191741f3b27605f347c9ef2a25a4fed6d40ac`、`momo-drive-token-source-recovery-preflight.sh=571d75e81c509683eb8a38fabbe81fc7822befe45206145f4fb4e865473f5254`;110 textfile 讀回 `awoooi_cold_start_monitor_up=1`、`pass=95`、`warn=1`、`blocked=0`、`last_exit_code=1`、`last_result{result="degraded"}=1`。`verify-cold-start-monitor-deploy.sh` 回 `COLD_START_MONITOR_DEPLOY_PARITY_OK`;`full-stack-recovery-scorecard.sh` 回 `CORE_COLD_START_WARN_GATES=1`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`ESCROW_MISSING_COUNT=5`、`NEXT_STEP=complete_credential_escrow_review`。Allowed declaration:public routes / AWOOOI service / Gitea / Harbor registry / 188 backup-from-110 / 110 awoooi_db freshness / K3s rollout / Stock public route 已恢復,cold-start hard blockers `0`。Forbidden declaration:full green、10 分鐘全自動恢復完成、DR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定。唯一 cold-start WARN 是 MOMO daily data freshness:`MOMO_DAILY_FRESHNESS 7|2026-06-24`,且 Drive intake / failed folder 無新候選;必須走 source-arrival / formal import gate,不可用假資料或手動 DB 寫入掩蓋。 diff --git a/docs/runbooks/REBOOT-RECOVERY-SOP.md b/docs/runbooks/REBOOT-RECOVERY-SOP.md index a0aaf35e..258f144c 100644 --- a/docs/runbooks/REBOOT-RECOVERY-SOP.md +++ b/docs/runbooks/REBOOT-RECOVERY-SOP.md @@ -1,9 +1,9 @@ # AWOOOI 重開機恢復 SOP -> **版本**: v5.1 +> **版本**: v5.2 > **最後更新**: 2026-07-02 (台北時間) > **更新者**: Codex -> **觸發事件**: 110 control-path / Harbor recovery receipt 與 Gitea stale queue blocker 收斂 +> **觸發事件**: 110 control-path / Harbor recovery receipt、Gitea stale queue blocker、controlled CD lane fail-closed enforcer staging 收斂 --- @@ -105,6 +105,17 @@ Git push → Gitea(110:3001) → Build Docker image → Harbor(:5000) → kubectl → K3s pods ``` +### 110 Controlled CD Lane 與 Fail-Closed Enforcer + +110 runner 壓力事故後,`awoooi-cd-lane.service`、legacy runner、generic label 與重型 runner 仍必須維持 fail-closed。唯一可被保留的 110 CD entrypoint 是專用 `awoooi-cd-lane-drain.service` 的非 secret staging 狀態,且必須同時符合下列條件: + +- `config.yaml` 只允許 `capacity <= 1`、`awoooi-host:host`、`awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`。 +- `awoooi_cd_lane_controlled` 必須是 executable ELF,不可為 fail-closed shell stub。 +- unit 必須含 `ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner`、CPU / Memory / Tasks accounting 與 limit、`NoNewPrivileges=true`。 +- service staging 階段必須 inactive、`MainPID=0`、未 enabled / 未 masked;runner token registration 與 `.runner` 內容仍不可讀、不可列印。 + +讀回方式固定先跑 `scripts/reboot-recovery/enforce-110-runner-failclosed.sh --check`,看 `CONTROLLED_DRAIN_STAGING_ALLOWED`;再跑 `ops/runner/check-awoooi-110-controlled-cd-lane-readiness.sh`,看 `CONFIG_READY`、`BINARY_READY`、`REGISTRATION_READY`、`SERVICE_READY`。若只剩 registration / service inactive blocker,代表非 secret guardrail 已收斂,下一步才是 token-safe registration path;不得用手工反覆補 config / binary 取代 source enforcer 修正。 + ### 關鍵依賴說明 | 服務 | 關鍵依賴 | 若依賴失敗 | diff --git a/scripts/reboot-recovery/enforce-110-runner-failclosed.sh b/scripts/reboot-recovery/enforce-110-runner-failclosed.sh index 2471cfc0..4f1a17f4 100755 --- a/scripts/reboot-recovery/enforce-110-runner-failclosed.sh +++ b/scripts/reboot-recovery/enforce-110-runner-failclosed.sh @@ -96,6 +96,13 @@ LIVE_BINARY_PATHS=( "/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner" ) +CONTROLLED_DRAIN_UNIT="${CONTROLLED_DRAIN_UNIT:-awoooi-cd-lane-drain.service}" +CONTROLLED_DRAIN_DIR="${CONTROLLED_DRAIN_DIR:-/home/wooo/awoooi-cd-lane-drain}" +CONTROLLED_DRAIN_BINARY="${CONTROLLED_DRAIN_BINARY:-$CONTROLLED_DRAIN_DIR/awoooi_cd_lane_controlled}" +CONTROLLED_DRAIN_CONFIG="${CONTROLLED_DRAIN_CONFIG:-$CONTROLLED_DRAIN_DIR/config.yaml}" +CONTROLLED_DRAIN_REGISTRATION="${CONTROLLED_DRAIN_REGISTRATION:-$CONTROLLED_DRAIN_DIR/data/.runner}" +CONTROLLED_DRAIN_MAX_CAPACITY="${CONTROLLED_DRAIN_MAX_CAPACITY:-1}" + as_root() { if [ "${EUID:-$(id -u)}" -eq 0 ]; then "$@" @@ -137,6 +144,124 @@ count_runner_processes() { pgrep -f '^/home/wooo/act-runner/act_runner|^/home/wooo/act-runner-controlled/act_runner|^/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner|Runner.Listener|Runner.Worker' 2>/dev/null | wc -l | tr -d ' ' } +extract_runner_capacity() { + local config_path="$1" + awk ' + /^runner:[[:space:]]*$/ { + in_runner=1 + next + } + in_runner && /^[^[:space:]]/ && $0 !~ /^runner:[[:space:]]*$/ { + in_runner=0 + } + in_runner && /^[[:space:]]*capacity:[[:space:]]*/ { + line=$0 + sub(/^[[:space:]]*capacity:[[:space:]]*/, "", line) + gsub(/["'\'']/, "", line) + print line + exit + } + ' "$config_path" +} + +extract_runner_labels() { + local config_path="$1" + awk ' + /^[[:space:]]*labels:[[:space:]]*$/ { + in_labels=1 + next + } + in_labels && /^[[:space:]]*-[[:space:]]*/ { + line=$0 + sub(/^[[:space:]]*-[[:space:]]*"/, "", line) + sub(/^[[:space:]]*-[[:space:]]*/, "", line) + sub(/"[[:space:]]*$/, "", line) + print line + next + } + in_labels && /^[^[:space:]]/ { + in_labels=0 + } + ' "$config_path" +} + +label_name() { + printf '%s' "${1%%:*}" +} + +controlled_drain_config_safe() { + local capacity labels label name has_host=0 has_ubuntu=0 + [ -r "$CONTROLLED_DRAIN_CONFIG" ] || return 1 + capacity="$(extract_runner_capacity "$CONTROLLED_DRAIN_CONFIG" | head -1)" + printf '%s' "${capacity:-}" | grep -Eq '^[0-9]+$' || return 1 + [ "$capacity" -le "$CONTROLLED_DRAIN_MAX_CAPACITY" ] || return 1 + labels="$(extract_runner_labels "$CONTROLLED_DRAIN_CONFIG" || true)" + [ -n "$labels" ] || return 1 + while IFS= read -r label; do + [ -n "$label" ] || continue + name="$(label_name "$label")" + case "$name" in + awoooi-host) + [ "$label" = "awoooi-host:host" ] || return 1 + has_host=1 + ;; + awoooi-ubuntu) + [ "$label" = "awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04" ] || return 1 + has_ubuntu=1 + ;; + ubuntu-latest|ubuntu-*|self-hosted|stockplatform*|stock-platform*|headless*|playwright*) + return 1 + ;; + *) + return 1 + ;; + esac + done <<<"$labels" + [ "$has_host" -eq 1 ] && [ "$has_ubuntu" -eq 1 ] +} + +controlled_drain_binary_safe() { + local kind + [ -f "$CONTROLLED_DRAIN_BINARY" ] && [ -x "$CONTROLLED_DRAIN_BINARY" ] || return 1 + kind="$(file -b "$CONTROLLED_DRAIN_BINARY" 2>/dev/null || echo missing)" + grep -qi 'ELF' <<<"$kind" +} + +controlled_drain_unit_safe() { + local text + text="$(systemctl cat "$CONTROLLED_DRAIN_UNIT" 2>/dev/null || true)" + [ -n "$text" ] || return 1 + grep -Fq -- "ConditionPathExists=$CONTROLLED_DRAIN_REGISTRATION" <<<"$text" || return 1 + grep -Fq -- "$CONTROLLED_DRAIN_BINARY daemon --config $CONTROLLED_DRAIN_CONFIG" <<<"$text" || return 1 + grep -Eq '^[[:space:]]*CPUAccounting=true' <<<"$text" || return 1 + grep -Eq '^[[:space:]]*CPUQuota=' <<<"$text" || return 1 + grep -Eq '^[[:space:]]*MemoryAccounting=true' <<<"$text" || return 1 + grep -Eq '^[[:space:]]*Memory(High|Max)=' <<<"$text" || return 1 + grep -Eq '^[[:space:]]*TasksAccounting=true' <<<"$text" || return 1 + grep -Eq '^[[:space:]]*TasksMax=' <<<"$text" || return 1 + grep -Eq '^[[:space:]]*NoNewPrivileges=true' <<<"$text" || return 1 +} + +controlled_drain_service_inactive() { + local load active unitfile mainpid + load="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p LoadState --value 2>/dev/null || true)" + active="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p ActiveState --value 2>/dev/null || true)" + unitfile="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p UnitFileState --value 2>/dev/null || true)" + mainpid="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p MainPID --value 2>/dev/null || true)" + { [ "$active" = "inactive" ] || [ "$active" = "failed" ] || [ "$active" = "unknown" ] || [ -z "$active" ]; } || return 1 + [ "${mainpid:-0}" = "0" ] || return 1 + [ "$load" != "masked" ] || return 1 + [ "$unitfile" != "masked" ] || return 1 + [ "$unitfile" != "enabled" ] || return 1 +} + +controlled_drain_staging_allowed() { + controlled_drain_config_safe \ + && controlled_drain_binary_safe \ + && controlled_drain_unit_safe \ + && controlled_drain_service_inactive +} + list_action_runner_units() { { systemctl list-unit-files 'actions.runner.*' --no-legend --plain 2>/dev/null | awk '{print $1}' @@ -147,6 +272,11 @@ list_action_runner_units() { stop_and_mask_units() { local unit for unit in "${RUNNER_UNITS[@]}"; do + if [ "$unit" = "$CONTROLLED_DRAIN_UNIT" ] && controlled_drain_staging_allowed; then + as_root systemctl reset-failed "$unit" >/dev/null 2>&1 || true + as_root systemctl disable "$unit" >/dev/null 2>&1 || true + continue + fi as_root systemctl kill --signal=SIGKILL "$unit" >/dev/null 2>&1 || true as_root systemctl stop "$unit" >/dev/null 2>&1 || true as_root systemctl reset-failed "$unit" >/dev/null 2>&1 || true @@ -218,6 +348,9 @@ seal_lane_binary_restore_sources() { local path while IFS= read -r -d '' path; do [ -e "$path" ] || continue + if [ "$path" = "$CONTROLLED_DRAIN_BINARY" ] && controlled_drain_staging_allowed; then + continue + fi write_failclosed_stub "$path" done < <( { @@ -234,6 +367,9 @@ quarantine_lane_registration_sources() { local target for lane_dir in "/home/wooo/awoooi-cd-lane" "/home/wooo/awoooi-cd-lane-drain"; do [ -d "$lane_dir" ] || continue + if [ "$lane_dir" = "$CONTROLLED_DRAIN_DIR" ] && controlled_drain_staging_allowed; then + continue + fi quarantine_dir="$lane_dir/quarantine-failclosed-${STAMP}" as_root chattr -i "$lane_dir" "$lane_dir/data" >/dev/null 2>&1 || true as_root mkdir -p "$quarantine_dir" >/dev/null 2>&1 || true @@ -257,6 +393,9 @@ quarantine_lane_registration_sources() { seal_live_binary_paths() { local path for path in "${LIVE_BINARY_PATHS[@]}"; do + if [ "$path" = "$CONTROLLED_DRAIN_BINARY" ] && controlled_drain_staging_allowed; then + continue + fi write_failclosed_stub "$path" done } @@ -666,7 +805,10 @@ mask_unit_file_to_devnull() { seal_lane_unit_files() { mask_unit_file_to_devnull "awoooi-cd-lane.service" - mask_unit_file_to_devnull "awoooi-cd-lane-drain.service" + if controlled_drain_staging_allowed; then + return 0 + fi + mask_unit_file_to_devnull "$CONTROLLED_DRAIN_UNIT" } root_restore_sources_left() { @@ -680,6 +822,9 @@ root_restore_sources_left() { unit_ok() { local unit="$1" local load active unitfile mainpid + if [ "$unit" = "$CONTROLLED_DRAIN_UNIT" ] && controlled_drain_staging_allowed; then + return 0 + fi load="$(systemctl show "$unit" -p LoadState --value 2>/dev/null || true)" active="$(systemctl show "$unit" -p ActiveState --value 2>/dev/null || true)" unitfile="$(systemctl show "$unit" -p UnitFileState --value 2>/dev/null || true)" @@ -729,6 +874,9 @@ awoooi_runner_failclosed_enforcer_root_restore_sources_left $(root_restore_sourc # HELP awoooi_runner_failclosed_enforcer_apply_performed Whether this run used apply mode. # TYPE awoooi_runner_failclosed_enforcer_apply_performed gauge awoooi_runner_failclosed_enforcer_apply_performed $APPLY_PERFORMED +# HELP awoooi_runner_failclosed_enforcer_controlled_drain_staging_allowed Controlled drain lane non-secret guardrail staging allowance. +# TYPE awoooi_runner_failclosed_enforcer_controlled_drain_staging_allowed gauge +awoooi_runner_failclosed_enforcer_controlled_drain_staging_allowed $(controlled_drain_staging_allowed && echo 1 || echo 0) EOF as_root install -o root -g root -m 0644 "$tmp" "$dir/awoooi_runner_failclosed_enforcer.prom" >/dev/null 2>&1 || true rm -f "$tmp" @@ -743,6 +891,7 @@ print_readback() { echo "LANE_PROCESS_COUNT=$(count_lane_processes)" echo "RUNNER_PROCESS_COUNT=$(count_runner_processes)" echo "ROOT_RESTORE_SOURCES_LEFT=$(root_restore_sources_left)" + echo "CONTROLLED_DRAIN_STAGING_ALLOWED=$(controlled_drain_staging_allowed && echo 1 || echo 0)" echo "RUNNER_UNITS_BAD_COUNT=$(runner_units_bad_count)" for unit in "${RUNNER_UNITS[@]}"; do load="$(systemctl show "$unit" -p LoadState --value 2>/dev/null || true)" diff --git a/scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py b/scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py index a8bff783..1ea5c0fe 100644 --- a/scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py +++ b/scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py @@ -22,6 +22,7 @@ REPAIR_STARTUP_STUB = ( FAILCLOSED_ENFORCER = ( ROOT / "scripts" / "reboot-recovery" / "enforce-110-runner-failclosed.sh" ) +CONTROLLED_CD_LANE_DRAIN_UNIT = ROOT / "ops" / "runner" / "awoooi-cd-lane-drain.service" SSH_AUTH_DIAGNOSE = ( ROOT / "scripts" / "reboot-recovery" / "diagnose-110-ssh-publickey-auth.sh" ) @@ -206,6 +207,47 @@ def test_runner_failclosed_enforcer_does_not_seal_live_startup_recovery_script() assert "awoooi-startup-110.sh.*controlled*" in text +def test_runner_failclosed_enforcer_preserves_controlled_drain_staging_only() -> None: + text = FAILCLOSED_ENFORCER.read_text(encoding="utf-8") + + assert "controlled_drain_staging_allowed()" in text + assert "controlled_drain_config_safe" in text + assert "controlled_drain_binary_safe" in text + assert "controlled_drain_unit_safe" in text + assert "controlled_drain_service_inactive" in text + assert "awoooi-host:host" in text + assert ( + "awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04" + in text + ) + assert "ubuntu-latest|ubuntu-*|self-hosted|stockplatform*|stock-platform*|headless*|playwright*)" in text + assert 'grep -Fq -- "ConditionPathExists=$CONTROLLED_DRAIN_REGISTRATION"' in text + assert 'grep -Eq \'^[[:space:]]*CPUAccounting=true\'' in text + assert 'grep -Eq \'^[[:space:]]*MemoryAccounting=true\'' in text + assert 'grep -Eq \'^[[:space:]]*TasksAccounting=true\'' in text + assert '[ "$unitfile" != "enabled" ] || return 1' in text + assert 'if [ "$unit" = "$CONTROLLED_DRAIN_UNIT" ] && controlled_drain_staging_allowed; then' in text + assert 'if [ "$path" = "$CONTROLLED_DRAIN_BINARY" ] && controlled_drain_staging_allowed; then' in text + assert 'if [ "$lane_dir" = "$CONTROLLED_DRAIN_DIR" ] && controlled_drain_staging_allowed; then' in text + assert "CONTROLLED_DRAIN_STAGING_ALLOWED=" in text + + +def test_controlled_cd_lane_unit_source_has_required_accounting_guardrails() -> None: + text = CONTROLLED_CD_LANE_DRAIN_UNIT.read_text(encoding="utf-8") + + assert "ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner" in text + assert "CPUAccounting=true" in text + assert "CPUQuota=250%" in text + assert "MemoryAccounting=true" in text + assert "MemoryHigh=8G" in text + assert "MemoryMax=12G" in text + assert "TasksAccounting=true" in text + assert "TasksMax=512" in text + assert "IOAccounting=true" in text + assert "IOWeight=100" in text + assert "NoNewPrivileges=true" in text + + def test_110_ssh_publickey_auth_diagnosis_is_bounded_and_read_only() -> None: text = SSH_AUTH_DIAGNOSE.read_text(encoding="utf-8")