fix(ops): preserve controlled drain lane staging
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Failing after 2m8s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped

This commit is contained in:
Your Name
2026-07-02 01:20:35 +08:00
parent fe5bc42210
commit c1823b5f62
5 changed files with 233 additions and 4 deletions

View File

@@ -52380,6 +52380,31 @@ production browser smoke:
**下一步**
- commit / push 後讀回新的 Gitea CD目標是 controlled-runtime 直接跳過 B5不再因 cold-start metadata/script 變更要求 Docker socket。
## 2026-07-02 — P0 110 controlled CD lane fail-closed enforcer staging 修正
**完成內容**
- 修正 `scripts/reboot-recovery/enforce-110-runner-failclosed.sh`legacy / generic runner 與 primary `awoooi-cd-lane.service` 仍維持 fail-closed`awoooi-cd-lane-drain.service` 的非 secret staging artifact 若通過 guardrailenforcer 不再把 config / binary / unit 無差別封回 stub。
- 新增 `controlled_drain_staging_allowed` 判斷:要求 `capacity <= 1`、只允許 `awoooi-host:host``awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`、binary 為 executable ELF、unit 具備 registration condition、CPU / Memory / Tasks accounting 與 limit、`NoNewPrivileges=true`,且 service inactive、`MainPID=0`、未 enabled / 未 masked。
- `seal_live_binary_paths``seal_lane_binary_restore_sources``quarantine_lane_registration_sources``seal_lane_unit_files``unit_ok` 均改成尊重 controlled drain staging同時新增 readback / textfile metric `CONTROLLED_DRAIN_STAGING_ALLOWED`
- `ops/runner/awoooi-cd-lane-drain.service` 補齊 `CPUAccounting=true``MemoryAccounting=true``TasksAccounting=true`,讓 source unit 與 verifier guardrail 一致。
- `docs/runbooks/FULL-STACK-COLD-START-SOP.md``docs/runbooks/REBOOT-RECOVERY-SOP.md` 已沉澱本次經驗:下次若 config / binary 又被搬走,優先看 enforcer staging guardrail不再手工反覆補 artifact。
**本地驗證結果**
- `bash -n scripts/reboot-recovery/enforce-110-runner-failclosed.sh ops/runner/check-awoooi-110-controlled-cd-lane-readiness.sh scripts/reboot-recovery/awoooi-startup-110.sh`:通過。
- `python3.11 -m pytest scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py -q``16 passed`
- `python3.11 -m pytest ops/runner/test_check_awoooi_110_controlled_cd_lane_readiness.py -q``5 passed`
- `python3 ops/runner/guard-gitea-runner-pressure.py --root .`:通過,`auto_branch_events_on_110=0``generic_runner_labels=0`
- `git diff --check`:通過。
**仍維持**
- 沒有讀 secret / token / `.env` / raw sessions / SQLite / auth沒有讀 `.runner` 內容。
- 沒有使用 GitHub / gh / GitHub API / GitHub Actions。
- 沒有重啟主機,沒有 Docker / Nginx / K3s / DB / firewall restart沒有 workflow_dispatch沒有 DROP / TRUNCATE / restore / prune。
- 沒有註冊 runner、沒有啟動 `awoooi-cd-lane-drain.service`runner registration 仍必須走不列印 token / 不讀 `.runner` 內容的 token-safe path。
**下一步**
- commit / push 到 Gitea main 後讀回 CD再把新版 enforcer 受控同步到 110重跑非 secret guardrail apply 與 `check-awoooi-110-controlled-cd-lane-readiness.sh`。目標是 active blockers 收斂到 registration / service inactive不再出現 config / binary / unit 被 enforcer 回封的假 blocker。
## 2026-07-01 — 08:50 P0 188 DB circuit breaker post-push readback
**完成內容**

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.91
> Version: v1.92
> Last updated: 2026-07-02 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -20,6 +20,8 @@ v1.80 / v1.81 credential escrow intake scorecard rule同一輪 owner response
2026-07-02 110 control-path / Harbor recovery receipt rule若 Gitea Harbor repair queue 仍保留 `harbor_110_remote_ssh_publickey_auth_stalled`、remote-control unavailable、jobs stale 或 historical failure但同一輪本地證據同時證明 `wooo` command path ready、110 local Harbor `/v2/` ready、public/internal registry `/v2/``401`,則該 Gitea Harbor repair 失敗只能列為 historical queue metadata不得再當成 current SSH blocker。必須用 `/api/v1/agents/harbor-registry-controlled-recovery-receipt` 或同等 validator 合併 `diagnose-110-ssh-publickey-auth.sh``recover-110-control-path-and-harbor-local.sh --check`、public Gitea queue readback 與 registry `/v2/` verifier並把機器可讀結果寫入 `docs/operations/harbor-110-control-path-recovery-readback-2026-07-02.snapshot.json` 類型的 snapshot。2026-07-02 live receipt 顯示public/internal registry `/v2/` 均為 `401`、latest visible CD `#4335``Success`、Gitea Harbor repair failure 已是 `historical_after_latest_cd_success=true`active blockers 收斂為 110 controlled CD lane config / binary / registration / service guardrail、active action container pressure以及 Gitea CD jobs head-SHA / stale readback mismatch。若 local-console output 只有 `AWOOOI_110_CONTROLLED_CD_LANE_READY` markernon110 runner parser 不得從 110 `BLOCKER` 行推導 non110 blockernon110 只有看到 `AWOOOI_NON110_RUNNER_READY` marker 才能列入 active blocker。
2026-07-02 110 controlled CD lane fail-closed enforcer staging rule110 runner 壓力事故後legacy / generic runner 仍必須 fail-closed`awoooi-cd-lane-drain.service` 的非 secret staging artifact 不得再被 enforcer 無差別封回 stub。`scripts/reboot-recovery/enforce-110-runner-failclosed.sh` 只有在 `config.yaml` 符合 `capacity <= 1`、只含 `awoooi-host:host``awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`、binary 是 executable ELF、systemd unit 具備 `ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner``CPUAccounting` / `MemoryAccounting` / `TasksAccounting` / `NoNewPrivileges` 等 guardrail且 service `inactive``MainPID=0`、未 enabled / 未 masked 時,才可保留 drain config / binary / unit並輸出 `CONTROLLED_DRAIN_STAGING_ALLOWED=1` 與 textfile metric。此 staging 規則不得讀 token、不得讀 `.runner` 內容、不得註冊 runner、不得啟動 service若 registration 缺失readiness verifier 仍必須只留下 `controlled_cd_lane_registration_missing` / `controlled_cd_lane_service_not_active` 類 blocker。若 `CONTROLLED_DRAIN_STAGING_ALLOWED=0` 且 config / binary 又被搬走,優先修 source enforcer / unit guardrail不要手工反覆補同一組 artifact。
2026-07-01 23:00 latest live summarycore cold-start 已從 degraded 收斂為 GREEN但仍不可宣稱 DR complete 或 MOMO 業績資料已最新。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` artifact `/tmp/awoooi-cold-start-source-gate-20260701-225720.log``PASS=96 WARN=0 BLOCKED=0``Result: GREEN`。MOMO daily stale 在 `MOMO_DAILY_FRESHNESS 7|2026-06-24` 且 no-newer-source evidence 成立時不再算 core cold-start warningcold-start 會輸出 `OK 188 momo daily sales source gate has no newer Drive candidate``INFO 188 momo daily sales data remains stale; product data freshness is pending source arrival`;這表示主機/服務恢復完成,但產品資料 freshness 仍留在 source-arrival gate必須等正式 Drive source 到達後由原匯入 pipeline 更新,不得手動 DB 偽更新。110 live monitor 已同步,`/home/wooo/scripts/full-stack-cold-start-check.sh` hash `6115f73002b7e5b0fc46a031a2e7e9049d68abfcc8110f638e975218792c468e`110 textfile 讀回 `awoooi_cold_start_monitor_up=1``pass=96``warn=0``blocked=0``last_exit_code=0``last_result{result="green"}=1``last_run_duration_seconds=26``verify-cold-start-monitor-deploy.sh``COLD_START_MONITOR_DEPLOY_PARITY_OK`runtime state `monitor_up=1 warn=0 blocked=0 green=1 blocked_state=0`ColdStart alerts `0``full-stack-recovery-scorecard.sh``CORE_COLD_START_GREEN=1``CORE_COLD_START_WARN_GATES=0``CORE_COLD_START_BLOCKED_GATES=0``CORE_COLD_START_FIRING_ALERTS=0``CORE_COLD_START_DEPLOY_PARITY=1``CORE_REGISTRY_READY=1``DR_OFFSITE_EVIDENCE_READBACK=1``ESCROW_MISSING_COUNT=5``NEXT_STEP=complete_credential_escrow_review``RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING`。Allowed declaration110 / 120 / 121 / 188 core cold-start service recovery GREENpublic routes / AWOOOI service / Gitea / Harbor registry / K3s / Stock public route / 188 backup-from-110 / 110 awoooi_db freshness 已恢復。Forbidden declarationDR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定、以假資料或手動 DB 寫入掩蓋 source freshness。
2026-07-01 21:32 previous live summarycold-start 假 WARN 已收斂hard blockers 維持 `0`,但仍不可宣稱 full green 或 10 分鐘全自動恢復完成。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` final artifact `/tmp/awoooi-cold-start-final-20260701-212632.log``PASS=95 WARN=1 BLOCKED=0`public routes / TLS 全部通過StockPlatform 21:22 左右的 `502` 已確認是 web/admin/edge 替換 warmup外部連續 5 次 `https://stock.wooo.work/``200`final cold-start 亦回 `stock 200`。K3s `BAD_PODS=2` 也是 rollout 暫態,連續 6 次只讀觀察已無非 Running/Completed podfinal `BAD_PODS 0`。MOMO current-month `0|0|-|-|-|-` 不再列為 WARN`momo-drive-token-source-recovery-preflight.sh` 會輸出 `MOMO_LATEST_IMPORT_CLEAN``MOMO_SOURCE_ABSENT_WITHOUT_NEWER_DRIVE`cold-start 讀到 latest clean import 且 Drive 無更新 source candidate 時,判定 current-month sync not applicable。110 backup current health 也不再被舊 aggregate log 壓成 WARN`BACKUP_HEALTH_110 total=13 stale=0 missing_cron=0 missing_script=0 failed_count=5 config_failed=0 integrity_total=2 integrity_stale=0` 代表 current component freshness / critical config / integrity OK`failed_count=5` 保留為 INFO evidence等下一次 full `backup-all` 自然覆蓋。live 110 monitor 已同步hash `full-stack-cold-start-check.sh=d0711f75dfb1ee680442c9d6cf2191741f3b27605f347c9ef2a25a4fed6d40ac``momo-drive-token-source-recovery-preflight.sh=571d75e81c509683eb8a38fabbe81fc7822befe45206145f4fb4e865473f5254`110 textfile 讀回 `awoooi_cold_start_monitor_up=1``pass=95``warn=1``blocked=0``last_exit_code=1``last_result{result="degraded"}=1``verify-cold-start-monitor-deploy.sh``COLD_START_MONITOR_DEPLOY_PARITY_OK``full-stack-recovery-scorecard.sh``CORE_COLD_START_WARN_GATES=1``CORE_COLD_START_BLOCKED_GATES=0``CORE_COLD_START_DEPLOY_PARITY=1``CORE_REGISTRY_READY=1``DR_OFFSITE_EVIDENCE_READBACK=1``ESCROW_MISSING_COUNT=5``NEXT_STEP=complete_credential_escrow_review`。Allowed declarationpublic routes / AWOOOI service / Gitea / Harbor registry / 188 backup-from-110 / 110 awoooi_db freshness / K3s rollout / Stock public route 已恢復cold-start hard blockers `0`。Forbidden declarationfull green、10 分鐘全自動恢復完成、DR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定。唯一 cold-start WARN 是 MOMO daily data freshness`MOMO_DAILY_FRESHNESS 7|2026-06-24`,且 Drive intake / failed folder 無新候選;必須走 source-arrival / formal import gate不可用假資料或手動 DB 寫入掩蓋。

View File

@@ -1,9 +1,9 @@
# AWOOOI 重開機恢復 SOP
> **版本**: v5.1
> **版本**: v5.2
> **最後更新**: 2026-07-02 (台北時間)
> **更新者**: Codex
> **觸發事件**: 110 control-path / Harbor recovery receiptGitea stale queue blocker 收斂
> **觸發事件**: 110 control-path / Harbor recovery receiptGitea stale queue blocker、controlled CD lane fail-closed enforcer staging 收斂
---
@@ -105,6 +105,17 @@ Git push → Gitea(110:3001)
→ Build Docker image → Harbor(:5000) → kubectl → K3s pods
```
### 110 Controlled CD Lane 與 Fail-Closed Enforcer
110 runner 壓力事故後,`awoooi-cd-lane.service`、legacy runner、generic label 與重型 runner 仍必須維持 fail-closed。唯一可被保留的 110 CD entrypoint 是專用 `awoooi-cd-lane-drain.service` 的非 secret staging 狀態,且必須同時符合下列條件:
- `config.yaml` 只允許 `capacity <= 1``awoooi-host:host``awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`
- `awoooi_cd_lane_controlled` 必須是 executable ELF不可為 fail-closed shell stub。
- unit 必須含 `ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner`、CPU / Memory / Tasks accounting 與 limit、`NoNewPrivileges=true`
- service staging 階段必須 inactive、`MainPID=0`、未 enabled / 未 maskedrunner token registration 與 `.runner` 內容仍不可讀、不可列印。
讀回方式固定先跑 `scripts/reboot-recovery/enforce-110-runner-failclosed.sh --check`,看 `CONTROLLED_DRAIN_STAGING_ALLOWED`;再跑 `ops/runner/check-awoooi-110-controlled-cd-lane-readiness.sh`,看 `CONFIG_READY``BINARY_READY``REGISTRATION_READY``SERVICE_READY`。若只剩 registration / service inactive blocker代表非 secret guardrail 已收斂,下一步才是 token-safe registration path不得用手工反覆補 config / binary 取代 source enforcer 修正。
### 關鍵依賴說明
| 服務 | 關鍵依賴 | 若依賴失敗 |

View File

@@ -96,6 +96,13 @@ LIVE_BINARY_PATHS=(
"/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner"
)
CONTROLLED_DRAIN_UNIT="${CONTROLLED_DRAIN_UNIT:-awoooi-cd-lane-drain.service}"
CONTROLLED_DRAIN_DIR="${CONTROLLED_DRAIN_DIR:-/home/wooo/awoooi-cd-lane-drain}"
CONTROLLED_DRAIN_BINARY="${CONTROLLED_DRAIN_BINARY:-$CONTROLLED_DRAIN_DIR/awoooi_cd_lane_controlled}"
CONTROLLED_DRAIN_CONFIG="${CONTROLLED_DRAIN_CONFIG:-$CONTROLLED_DRAIN_DIR/config.yaml}"
CONTROLLED_DRAIN_REGISTRATION="${CONTROLLED_DRAIN_REGISTRATION:-$CONTROLLED_DRAIN_DIR/data/.runner}"
CONTROLLED_DRAIN_MAX_CAPACITY="${CONTROLLED_DRAIN_MAX_CAPACITY:-1}"
as_root() {
if [ "${EUID:-$(id -u)}" -eq 0 ]; then
"$@"
@@ -137,6 +144,124 @@ count_runner_processes() {
pgrep -f '^/home/wooo/act-runner/act_runner|^/home/wooo/act-runner-controlled/act_runner|^/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner|Runner.Listener|Runner.Worker' 2>/dev/null | wc -l | tr -d ' '
}
extract_runner_capacity() {
local config_path="$1"
awk '
/^runner:[[:space:]]*$/ {
in_runner=1
next
}
in_runner && /^[^[:space:]]/ && $0 !~ /^runner:[[:space:]]*$/ {
in_runner=0
}
in_runner && /^[[:space:]]*capacity:[[:space:]]*/ {
line=$0
sub(/^[[:space:]]*capacity:[[:space:]]*/, "", line)
gsub(/["'\'']/, "", line)
print line
exit
}
' "$config_path"
}
extract_runner_labels() {
local config_path="$1"
awk '
/^[[:space:]]*labels:[[:space:]]*$/ {
in_labels=1
next
}
in_labels && /^[[:space:]]*-[[:space:]]*/ {
line=$0
sub(/^[[:space:]]*-[[:space:]]*"/, "", line)
sub(/^[[:space:]]*-[[:space:]]*/, "", line)
sub(/"[[:space:]]*$/, "", line)
print line
next
}
in_labels && /^[^[:space:]]/ {
in_labels=0
}
' "$config_path"
}
label_name() {
printf '%s' "${1%%:*}"
}
controlled_drain_config_safe() {
local capacity labels label name has_host=0 has_ubuntu=0
[ -r "$CONTROLLED_DRAIN_CONFIG" ] || return 1
capacity="$(extract_runner_capacity "$CONTROLLED_DRAIN_CONFIG" | head -1)"
printf '%s' "${capacity:-}" | grep -Eq '^[0-9]+$' || return 1
[ "$capacity" -le "$CONTROLLED_DRAIN_MAX_CAPACITY" ] || return 1
labels="$(extract_runner_labels "$CONTROLLED_DRAIN_CONFIG" || true)"
[ -n "$labels" ] || return 1
while IFS= read -r label; do
[ -n "$label" ] || continue
name="$(label_name "$label")"
case "$name" in
awoooi-host)
[ "$label" = "awoooi-host:host" ] || return 1
has_host=1
;;
awoooi-ubuntu)
[ "$label" = "awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04" ] || return 1
has_ubuntu=1
;;
ubuntu-latest|ubuntu-*|self-hosted|stockplatform*|stock-platform*|headless*|playwright*)
return 1
;;
*)
return 1
;;
esac
done <<<"$labels"
[ "$has_host" -eq 1 ] && [ "$has_ubuntu" -eq 1 ]
}
controlled_drain_binary_safe() {
local kind
[ -f "$CONTROLLED_DRAIN_BINARY" ] && [ -x "$CONTROLLED_DRAIN_BINARY" ] || return 1
kind="$(file -b "$CONTROLLED_DRAIN_BINARY" 2>/dev/null || echo missing)"
grep -qi 'ELF' <<<"$kind"
}
controlled_drain_unit_safe() {
local text
text="$(systemctl cat "$CONTROLLED_DRAIN_UNIT" 2>/dev/null || true)"
[ -n "$text" ] || return 1
grep -Fq -- "ConditionPathExists=$CONTROLLED_DRAIN_REGISTRATION" <<<"$text" || return 1
grep -Fq -- "$CONTROLLED_DRAIN_BINARY daemon --config $CONTROLLED_DRAIN_CONFIG" <<<"$text" || return 1
grep -Eq '^[[:space:]]*CPUAccounting=true' <<<"$text" || return 1
grep -Eq '^[[:space:]]*CPUQuota=' <<<"$text" || return 1
grep -Eq '^[[:space:]]*MemoryAccounting=true' <<<"$text" || return 1
grep -Eq '^[[:space:]]*Memory(High|Max)=' <<<"$text" || return 1
grep -Eq '^[[:space:]]*TasksAccounting=true' <<<"$text" || return 1
grep -Eq '^[[:space:]]*TasksMax=' <<<"$text" || return 1
grep -Eq '^[[:space:]]*NoNewPrivileges=true' <<<"$text" || return 1
}
controlled_drain_service_inactive() {
local load active unitfile mainpid
load="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p LoadState --value 2>/dev/null || true)"
active="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p ActiveState --value 2>/dev/null || true)"
unitfile="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p UnitFileState --value 2>/dev/null || true)"
mainpid="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p MainPID --value 2>/dev/null || true)"
{ [ "$active" = "inactive" ] || [ "$active" = "failed" ] || [ "$active" = "unknown" ] || [ -z "$active" ]; } || return 1
[ "${mainpid:-0}" = "0" ] || return 1
[ "$load" != "masked" ] || return 1
[ "$unitfile" != "masked" ] || return 1
[ "$unitfile" != "enabled" ] || return 1
}
controlled_drain_staging_allowed() {
controlled_drain_config_safe \
&& controlled_drain_binary_safe \
&& controlled_drain_unit_safe \
&& controlled_drain_service_inactive
}
list_action_runner_units() {
{
systemctl list-unit-files 'actions.runner.*' --no-legend --plain 2>/dev/null | awk '{print $1}'
@@ -147,6 +272,11 @@ list_action_runner_units() {
stop_and_mask_units() {
local unit
for unit in "${RUNNER_UNITS[@]}"; do
if [ "$unit" = "$CONTROLLED_DRAIN_UNIT" ] && controlled_drain_staging_allowed; then
as_root systemctl reset-failed "$unit" >/dev/null 2>&1 || true
as_root systemctl disable "$unit" >/dev/null 2>&1 || true
continue
fi
as_root systemctl kill --signal=SIGKILL "$unit" >/dev/null 2>&1 || true
as_root systemctl stop "$unit" >/dev/null 2>&1 || true
as_root systemctl reset-failed "$unit" >/dev/null 2>&1 || true
@@ -218,6 +348,9 @@ seal_lane_binary_restore_sources() {
local path
while IFS= read -r -d '' path; do
[ -e "$path" ] || continue
if [ "$path" = "$CONTROLLED_DRAIN_BINARY" ] && controlled_drain_staging_allowed; then
continue
fi
write_failclosed_stub "$path"
done < <(
{
@@ -234,6 +367,9 @@ quarantine_lane_registration_sources() {
local target
for lane_dir in "/home/wooo/awoooi-cd-lane" "/home/wooo/awoooi-cd-lane-drain"; do
[ -d "$lane_dir" ] || continue
if [ "$lane_dir" = "$CONTROLLED_DRAIN_DIR" ] && controlled_drain_staging_allowed; then
continue
fi
quarantine_dir="$lane_dir/quarantine-failclosed-${STAMP}"
as_root chattr -i "$lane_dir" "$lane_dir/data" >/dev/null 2>&1 || true
as_root mkdir -p "$quarantine_dir" >/dev/null 2>&1 || true
@@ -257,6 +393,9 @@ quarantine_lane_registration_sources() {
seal_live_binary_paths() {
local path
for path in "${LIVE_BINARY_PATHS[@]}"; do
if [ "$path" = "$CONTROLLED_DRAIN_BINARY" ] && controlled_drain_staging_allowed; then
continue
fi
write_failclosed_stub "$path"
done
}
@@ -666,7 +805,10 @@ mask_unit_file_to_devnull() {
seal_lane_unit_files() {
mask_unit_file_to_devnull "awoooi-cd-lane.service"
mask_unit_file_to_devnull "awoooi-cd-lane-drain.service"
if controlled_drain_staging_allowed; then
return 0
fi
mask_unit_file_to_devnull "$CONTROLLED_DRAIN_UNIT"
}
root_restore_sources_left() {
@@ -680,6 +822,9 @@ root_restore_sources_left() {
unit_ok() {
local unit="$1"
local load active unitfile mainpid
if [ "$unit" = "$CONTROLLED_DRAIN_UNIT" ] && controlled_drain_staging_allowed; then
return 0
fi
load="$(systemctl show "$unit" -p LoadState --value 2>/dev/null || true)"
active="$(systemctl show "$unit" -p ActiveState --value 2>/dev/null || true)"
unitfile="$(systemctl show "$unit" -p UnitFileState --value 2>/dev/null || true)"
@@ -729,6 +874,9 @@ awoooi_runner_failclosed_enforcer_root_restore_sources_left $(root_restore_sourc
# HELP awoooi_runner_failclosed_enforcer_apply_performed Whether this run used apply mode.
# TYPE awoooi_runner_failclosed_enforcer_apply_performed gauge
awoooi_runner_failclosed_enforcer_apply_performed $APPLY_PERFORMED
# HELP awoooi_runner_failclosed_enforcer_controlled_drain_staging_allowed Controlled drain lane non-secret guardrail staging allowance.
# TYPE awoooi_runner_failclosed_enforcer_controlled_drain_staging_allowed gauge
awoooi_runner_failclosed_enforcer_controlled_drain_staging_allowed $(controlled_drain_staging_allowed && echo 1 || echo 0)
EOF
as_root install -o root -g root -m 0644 "$tmp" "$dir/awoooi_runner_failclosed_enforcer.prom" >/dev/null 2>&1 || true
rm -f "$tmp"
@@ -743,6 +891,7 @@ print_readback() {
echo "LANE_PROCESS_COUNT=$(count_lane_processes)"
echo "RUNNER_PROCESS_COUNT=$(count_runner_processes)"
echo "ROOT_RESTORE_SOURCES_LEFT=$(root_restore_sources_left)"
echo "CONTROLLED_DRAIN_STAGING_ALLOWED=$(controlled_drain_staging_allowed && echo 1 || echo 0)"
echo "RUNNER_UNITS_BAD_COUNT=$(runner_units_bad_count)"
for unit in "${RUNNER_UNITS[@]}"; do
load="$(systemctl show "$unit" -p LoadState --value 2>/dev/null || true)"

View File

@@ -22,6 +22,7 @@ REPAIR_STARTUP_STUB = (
FAILCLOSED_ENFORCER = (
ROOT / "scripts" / "reboot-recovery" / "enforce-110-runner-failclosed.sh"
)
CONTROLLED_CD_LANE_DRAIN_UNIT = ROOT / "ops" / "runner" / "awoooi-cd-lane-drain.service"
SSH_AUTH_DIAGNOSE = (
ROOT / "scripts" / "reboot-recovery" / "diagnose-110-ssh-publickey-auth.sh"
)
@@ -206,6 +207,47 @@ def test_runner_failclosed_enforcer_does_not_seal_live_startup_recovery_script()
assert "awoooi-startup-110.sh.*controlled*" in text
def test_runner_failclosed_enforcer_preserves_controlled_drain_staging_only() -> None:
text = FAILCLOSED_ENFORCER.read_text(encoding="utf-8")
assert "controlled_drain_staging_allowed()" in text
assert "controlled_drain_config_safe" in text
assert "controlled_drain_binary_safe" in text
assert "controlled_drain_unit_safe" in text
assert "controlled_drain_service_inactive" in text
assert "awoooi-host:host" in text
assert (
"awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04"
in text
)
assert "ubuntu-latest|ubuntu-*|self-hosted|stockplatform*|stock-platform*|headless*|playwright*)" in text
assert 'grep -Fq -- "ConditionPathExists=$CONTROLLED_DRAIN_REGISTRATION"' in text
assert 'grep -Eq \'^[[:space:]]*CPUAccounting=true\'' in text
assert 'grep -Eq \'^[[:space:]]*MemoryAccounting=true\'' in text
assert 'grep -Eq \'^[[:space:]]*TasksAccounting=true\'' in text
assert '[ "$unitfile" != "enabled" ] || return 1' in text
assert 'if [ "$unit" = "$CONTROLLED_DRAIN_UNIT" ] && controlled_drain_staging_allowed; then' in text
assert 'if [ "$path" = "$CONTROLLED_DRAIN_BINARY" ] && controlled_drain_staging_allowed; then' in text
assert 'if [ "$lane_dir" = "$CONTROLLED_DRAIN_DIR" ] && controlled_drain_staging_allowed; then' in text
assert "CONTROLLED_DRAIN_STAGING_ALLOWED=" in text
def test_controlled_cd_lane_unit_source_has_required_accounting_guardrails() -> None:
text = CONTROLLED_CD_LANE_DRAIN_UNIT.read_text(encoding="utf-8")
assert "ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner" in text
assert "CPUAccounting=true" in text
assert "CPUQuota=250%" in text
assert "MemoryAccounting=true" in text
assert "MemoryHigh=8G" in text
assert "MemoryMax=12G" in text
assert "TasksAccounting=true" in text
assert "TasksMax=512" in text
assert "IOAccounting=true" in text
assert "IOWeight=100" in text
assert "NoNewPrivileges=true" in text
def test_110_ssh_publickey_auth_diagnosis_is_bounded_and_read_only() -> None:
text = SSH_AUTH_DIAGNOSE.read_text(encoding="utf-8")