fix(ops): preserve controlled drain lane staging
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Failing after 2m8s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Failing after 2m8s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
This commit is contained in:
@@ -52380,6 +52380,31 @@ production browser smoke:
|
||||
**下一步**:
|
||||
- commit / push 後讀回新的 Gitea CD;目標是 controlled-runtime 直接跳過 B5,不再因 cold-start metadata/script 變更要求 Docker socket。
|
||||
|
||||
## 2026-07-02 — P0 110 controlled CD lane fail-closed enforcer staging 修正
|
||||
|
||||
**完成內容**:
|
||||
- 修正 `scripts/reboot-recovery/enforce-110-runner-failclosed.sh`:legacy / generic runner 與 primary `awoooi-cd-lane.service` 仍維持 fail-closed,但 `awoooi-cd-lane-drain.service` 的非 secret staging artifact 若通過 guardrail,enforcer 不再把 config / binary / unit 無差別封回 stub。
|
||||
- 新增 `controlled_drain_staging_allowed` 判斷:要求 `capacity <= 1`、只允許 `awoooi-host:host` 與 `awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`、binary 為 executable ELF、unit 具備 registration condition、CPU / Memory / Tasks accounting 與 limit、`NoNewPrivileges=true`,且 service inactive、`MainPID=0`、未 enabled / 未 masked。
|
||||
- `seal_live_binary_paths`、`seal_lane_binary_restore_sources`、`quarantine_lane_registration_sources`、`seal_lane_unit_files`、`unit_ok` 均改成尊重 controlled drain staging;同時新增 readback / textfile metric `CONTROLLED_DRAIN_STAGING_ALLOWED`。
|
||||
- `ops/runner/awoooi-cd-lane-drain.service` 補齊 `CPUAccounting=true`、`MemoryAccounting=true`、`TasksAccounting=true`,讓 source unit 與 verifier guardrail 一致。
|
||||
- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 與 `docs/runbooks/REBOOT-RECOVERY-SOP.md` 已沉澱本次經驗:下次若 config / binary 又被搬走,優先看 enforcer staging guardrail,不再手工反覆補 artifact。
|
||||
|
||||
**本地驗證結果**:
|
||||
- `bash -n scripts/reboot-recovery/enforce-110-runner-failclosed.sh ops/runner/check-awoooi-110-controlled-cd-lane-readiness.sh scripts/reboot-recovery/awoooi-startup-110.sh`:通過。
|
||||
- `python3.11 -m pytest scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py -q`:`16 passed`。
|
||||
- `python3.11 -m pytest ops/runner/test_check_awoooi_110_controlled_cd_lane_readiness.py -q`:`5 passed`。
|
||||
- `python3 ops/runner/guard-gitea-runner-pressure.py --root .`:通過,`auto_branch_events_on_110=0`、`generic_runner_labels=0`。
|
||||
- `git diff --check`:通過。
|
||||
|
||||
**仍維持**:
|
||||
- 沒有讀 secret / token / `.env` / raw sessions / SQLite / auth;沒有讀 `.runner` 內容。
|
||||
- 沒有使用 GitHub / gh / GitHub API / GitHub Actions。
|
||||
- 沒有重啟主機,沒有 Docker / Nginx / K3s / DB / firewall restart,沒有 workflow_dispatch,沒有 DROP / TRUNCATE / restore / prune。
|
||||
- 沒有註冊 runner、沒有啟動 `awoooi-cd-lane-drain.service`;runner registration 仍必須走不列印 token / 不讀 `.runner` 內容的 token-safe path。
|
||||
|
||||
**下一步**:
|
||||
- commit / push 到 Gitea main 後讀回 CD;再把新版 enforcer 受控同步到 110,重跑非 secret guardrail apply 與 `check-awoooi-110-controlled-cd-lane-readiness.sh`。目標是 active blockers 收斂到 registration / service inactive,不再出現 config / binary / unit 被 enforcer 回封的假 blocker。
|
||||
|
||||
## 2026-07-01 — 08:50 P0 188 DB circuit breaker post-push readback
|
||||
|
||||
**完成內容**:
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# AWOOOI 全棧冷啟動與主機重啟 SOP
|
||||
|
||||
> Version: v1.91
|
||||
> Version: v1.92
|
||||
> Last updated: 2026-07-02 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
@@ -20,6 +20,8 @@ v1.80 / v1.81 credential escrow intake scorecard rule:同一輪 owner response
|
||||
|
||||
2026-07-02 110 control-path / Harbor recovery receipt rule:若 Gitea Harbor repair queue 仍保留 `harbor_110_remote_ssh_publickey_auth_stalled`、remote-control unavailable、jobs stale 或 historical failure,但同一輪本地證據同時證明 `wooo` command path ready、110 local Harbor `/v2/` ready、public/internal registry `/v2/` 回 `401`,則該 Gitea Harbor repair 失敗只能列為 historical queue metadata,不得再當成 current SSH blocker。必須用 `/api/v1/agents/harbor-registry-controlled-recovery-receipt` 或同等 validator 合併 `diagnose-110-ssh-publickey-auth.sh`、`recover-110-control-path-and-harbor-local.sh --check`、public Gitea queue readback 與 registry `/v2/` verifier,並把機器可讀結果寫入 `docs/operations/harbor-110-control-path-recovery-readback-2026-07-02.snapshot.json` 類型的 snapshot。2026-07-02 live receipt 顯示:public/internal registry `/v2/` 均為 `401`、latest visible CD `#4335` 為 `Success`、Gitea Harbor repair failure 已是 `historical_after_latest_cd_success=true`;active blockers 收斂為 110 controlled CD lane config / binary / registration / service guardrail、active action container pressure,以及 Gitea CD jobs head-SHA / stale readback mismatch。若 local-console output 只有 `AWOOOI_110_CONTROLLED_CD_LANE_READY` marker,non110 runner parser 不得從 110 `BLOCKER` 行推導 non110 blocker;non110 只有看到 `AWOOOI_NON110_RUNNER_READY` marker 才能列入 active blocker。
|
||||
|
||||
2026-07-02 110 controlled CD lane fail-closed enforcer staging rule:110 runner 壓力事故後,legacy / generic runner 仍必須 fail-closed;但 `awoooi-cd-lane-drain.service` 的非 secret staging artifact 不得再被 enforcer 無差別封回 stub。`scripts/reboot-recovery/enforce-110-runner-failclosed.sh` 只有在 `config.yaml` 符合 `capacity <= 1`、只含 `awoooi-host:host` 與 `awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`、binary 是 executable ELF、systemd unit 具備 `ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner`、`CPUAccounting` / `MemoryAccounting` / `TasksAccounting` / `NoNewPrivileges` 等 guardrail,且 service `inactive`、`MainPID=0`、未 enabled / 未 masked 時,才可保留 drain config / binary / unit,並輸出 `CONTROLLED_DRAIN_STAGING_ALLOWED=1` 與 textfile metric。此 staging 規則不得讀 token、不得讀 `.runner` 內容、不得註冊 runner、不得啟動 service;若 registration 缺失,readiness verifier 仍必須只留下 `controlled_cd_lane_registration_missing` / `controlled_cd_lane_service_not_active` 類 blocker。若 `CONTROLLED_DRAIN_STAGING_ALLOWED=0` 且 config / binary 又被搬走,優先修 source enforcer / unit guardrail,不要手工反覆補同一組 artifact。
|
||||
|
||||
2026-07-01 23:00 latest live summary:core cold-start 已從 degraded 收斂為 GREEN,但仍不可宣稱 DR complete 或 MOMO 業績資料已最新。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` artifact `/tmp/awoooi-cold-start-source-gate-20260701-225720.log` 回 `PASS=96 WARN=0 BLOCKED=0`、`Result: GREEN`。MOMO daily stale 在 `MOMO_DAILY_FRESHNESS 7|2026-06-24` 且 no-newer-source evidence 成立時不再算 core cold-start warning:cold-start 會輸出 `OK 188 momo daily sales source gate has no newer Drive candidate` 與 `INFO 188 momo daily sales data remains stale; product data freshness is pending source arrival`;這表示主機/服務恢復完成,但產品資料 freshness 仍留在 source-arrival gate,必須等正式 Drive source 到達後由原匯入 pipeline 更新,不得手動 DB 偽更新。110 live monitor 已同步,`/home/wooo/scripts/full-stack-cold-start-check.sh` hash `6115f73002b7e5b0fc46a031a2e7e9049d68abfcc8110f638e975218792c468e`;110 textfile 讀回 `awoooi_cold_start_monitor_up=1`、`pass=96`、`warn=0`、`blocked=0`、`last_exit_code=0`、`last_result{result="green"}=1`、`last_run_duration_seconds=26`。`verify-cold-start-monitor-deploy.sh` 回 `COLD_START_MONITOR_DEPLOY_PARITY_OK`,runtime state `monitor_up=1 warn=0 blocked=0 green=1 blocked_state=0`,ColdStart alerts `0`。`full-stack-recovery-scorecard.sh` 回 `CORE_COLD_START_GREEN=1`、`CORE_COLD_START_WARN_GATES=0`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_FIRING_ALERTS=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`ESCROW_MISSING_COUNT=5`、`NEXT_STEP=complete_credential_escrow_review`、`RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING`。Allowed declaration:110 / 120 / 121 / 188 core cold-start service recovery GREEN,public routes / AWOOOI service / Gitea / Harbor registry / K3s / Stock public route / 188 backup-from-110 / 110 awoooi_db freshness 已恢復。Forbidden declaration:DR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定、以假資料或手動 DB 寫入掩蓋 source freshness。
|
||||
|
||||
2026-07-01 21:32 previous live summary:cold-start 假 WARN 已收斂,hard blockers 維持 `0`,但仍不可宣稱 full green 或 10 分鐘全自動恢復完成。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` final artifact `/tmp/awoooi-cold-start-final-20260701-212632.log` 回 `PASS=95 WARN=1 BLOCKED=0`;public routes / TLS 全部通過,StockPlatform 21:22 左右的 `502` 已確認是 web/admin/edge 替換 warmup,外部連續 5 次 `https://stock.wooo.work/` 回 `200`,final cold-start 亦回 `stock 200`。K3s `BAD_PODS=2` 也是 rollout 暫態,連續 6 次只讀觀察已無非 Running/Completed pod,final `BAD_PODS 0`。MOMO current-month `0|0|-|-|-|-` 不再列為 WARN:`momo-drive-token-source-recovery-preflight.sh` 會輸出 `MOMO_LATEST_IMPORT_CLEAN` 與 `MOMO_SOURCE_ABSENT_WITHOUT_NEWER_DRIVE`,cold-start 讀到 latest clean import 且 Drive 無更新 source candidate 時,判定 current-month sync not applicable。110 backup current health 也不再被舊 aggregate log 壓成 WARN:`BACKUP_HEALTH_110 total=13 stale=0 missing_cron=0 missing_script=0 failed_count=5 config_failed=0 integrity_total=2 integrity_stale=0` 代表 current component freshness / critical config / integrity OK;`failed_count=5` 保留為 INFO evidence,等下一次 full `backup-all` 自然覆蓋。live 110 monitor 已同步,hash `full-stack-cold-start-check.sh=d0711f75dfb1ee680442c9d6cf2191741f3b27605f347c9ef2a25a4fed6d40ac`、`momo-drive-token-source-recovery-preflight.sh=571d75e81c509683eb8a38fabbe81fc7822befe45206145f4fb4e865473f5254`;110 textfile 讀回 `awoooi_cold_start_monitor_up=1`、`pass=95`、`warn=1`、`blocked=0`、`last_exit_code=1`、`last_result{result="degraded"}=1`。`verify-cold-start-monitor-deploy.sh` 回 `COLD_START_MONITOR_DEPLOY_PARITY_OK`;`full-stack-recovery-scorecard.sh` 回 `CORE_COLD_START_WARN_GATES=1`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`ESCROW_MISSING_COUNT=5`、`NEXT_STEP=complete_credential_escrow_review`。Allowed declaration:public routes / AWOOOI service / Gitea / Harbor registry / 188 backup-from-110 / 110 awoooi_db freshness / K3s rollout / Stock public route 已恢復,cold-start hard blockers `0`。Forbidden declaration:full green、10 分鐘全自動恢復完成、DR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定。唯一 cold-start WARN 是 MOMO daily data freshness:`MOMO_DAILY_FRESHNESS 7|2026-06-24`,且 Drive intake / failed folder 無新候選;必須走 source-arrival / formal import gate,不可用假資料或手動 DB 寫入掩蓋。
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
# AWOOOI 重開機恢復 SOP
|
||||
|
||||
> **版本**: v5.1
|
||||
> **版本**: v5.2
|
||||
> **最後更新**: 2026-07-02 (台北時間)
|
||||
> **更新者**: Codex
|
||||
> **觸發事件**: 110 control-path / Harbor recovery receipt 與 Gitea stale queue blocker 收斂
|
||||
> **觸發事件**: 110 control-path / Harbor recovery receipt、Gitea stale queue blocker、controlled CD lane fail-closed enforcer staging 收斂
|
||||
|
||||
---
|
||||
|
||||
@@ -105,6 +105,17 @@ Git push → Gitea(110:3001)
|
||||
→ Build Docker image → Harbor(:5000) → kubectl → K3s pods
|
||||
```
|
||||
|
||||
### 110 Controlled CD Lane 與 Fail-Closed Enforcer
|
||||
|
||||
110 runner 壓力事故後,`awoooi-cd-lane.service`、legacy runner、generic label 與重型 runner 仍必須維持 fail-closed。唯一可被保留的 110 CD entrypoint 是專用 `awoooi-cd-lane-drain.service` 的非 secret staging 狀態,且必須同時符合下列條件:
|
||||
|
||||
- `config.yaml` 只允許 `capacity <= 1`、`awoooi-host:host`、`awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`。
|
||||
- `awoooi_cd_lane_controlled` 必須是 executable ELF,不可為 fail-closed shell stub。
|
||||
- unit 必須含 `ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner`、CPU / Memory / Tasks accounting 與 limit、`NoNewPrivileges=true`。
|
||||
- service staging 階段必須 inactive、`MainPID=0`、未 enabled / 未 masked;runner token registration 與 `.runner` 內容仍不可讀、不可列印。
|
||||
|
||||
讀回方式固定先跑 `scripts/reboot-recovery/enforce-110-runner-failclosed.sh --check`,看 `CONTROLLED_DRAIN_STAGING_ALLOWED`;再跑 `ops/runner/check-awoooi-110-controlled-cd-lane-readiness.sh`,看 `CONFIG_READY`、`BINARY_READY`、`REGISTRATION_READY`、`SERVICE_READY`。若只剩 registration / service inactive blocker,代表非 secret guardrail 已收斂,下一步才是 token-safe registration path;不得用手工反覆補 config / binary 取代 source enforcer 修正。
|
||||
|
||||
### 關鍵依賴說明
|
||||
|
||||
| 服務 | 關鍵依賴 | 若依賴失敗 |
|
||||
|
||||
@@ -96,6 +96,13 @@ LIVE_BINARY_PATHS=(
|
||||
"/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner"
|
||||
)
|
||||
|
||||
CONTROLLED_DRAIN_UNIT="${CONTROLLED_DRAIN_UNIT:-awoooi-cd-lane-drain.service}"
|
||||
CONTROLLED_DRAIN_DIR="${CONTROLLED_DRAIN_DIR:-/home/wooo/awoooi-cd-lane-drain}"
|
||||
CONTROLLED_DRAIN_BINARY="${CONTROLLED_DRAIN_BINARY:-$CONTROLLED_DRAIN_DIR/awoooi_cd_lane_controlled}"
|
||||
CONTROLLED_DRAIN_CONFIG="${CONTROLLED_DRAIN_CONFIG:-$CONTROLLED_DRAIN_DIR/config.yaml}"
|
||||
CONTROLLED_DRAIN_REGISTRATION="${CONTROLLED_DRAIN_REGISTRATION:-$CONTROLLED_DRAIN_DIR/data/.runner}"
|
||||
CONTROLLED_DRAIN_MAX_CAPACITY="${CONTROLLED_DRAIN_MAX_CAPACITY:-1}"
|
||||
|
||||
as_root() {
|
||||
if [ "${EUID:-$(id -u)}" -eq 0 ]; then
|
||||
"$@"
|
||||
@@ -137,6 +144,124 @@ count_runner_processes() {
|
||||
pgrep -f '^/home/wooo/act-runner/act_runner|^/home/wooo/act-runner-controlled/act_runner|^/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner|Runner.Listener|Runner.Worker' 2>/dev/null | wc -l | tr -d ' '
|
||||
}
|
||||
|
||||
extract_runner_capacity() {
|
||||
local config_path="$1"
|
||||
awk '
|
||||
/^runner:[[:space:]]*$/ {
|
||||
in_runner=1
|
||||
next
|
||||
}
|
||||
in_runner && /^[^[:space:]]/ && $0 !~ /^runner:[[:space:]]*$/ {
|
||||
in_runner=0
|
||||
}
|
||||
in_runner && /^[[:space:]]*capacity:[[:space:]]*/ {
|
||||
line=$0
|
||||
sub(/^[[:space:]]*capacity:[[:space:]]*/, "", line)
|
||||
gsub(/["'\'']/, "", line)
|
||||
print line
|
||||
exit
|
||||
}
|
||||
' "$config_path"
|
||||
}
|
||||
|
||||
extract_runner_labels() {
|
||||
local config_path="$1"
|
||||
awk '
|
||||
/^[[:space:]]*labels:[[:space:]]*$/ {
|
||||
in_labels=1
|
||||
next
|
||||
}
|
||||
in_labels && /^[[:space:]]*-[[:space:]]*/ {
|
||||
line=$0
|
||||
sub(/^[[:space:]]*-[[:space:]]*"/, "", line)
|
||||
sub(/^[[:space:]]*-[[:space:]]*/, "", line)
|
||||
sub(/"[[:space:]]*$/, "", line)
|
||||
print line
|
||||
next
|
||||
}
|
||||
in_labels && /^[^[:space:]]/ {
|
||||
in_labels=0
|
||||
}
|
||||
' "$config_path"
|
||||
}
|
||||
|
||||
label_name() {
|
||||
printf '%s' "${1%%:*}"
|
||||
}
|
||||
|
||||
controlled_drain_config_safe() {
|
||||
local capacity labels label name has_host=0 has_ubuntu=0
|
||||
[ -r "$CONTROLLED_DRAIN_CONFIG" ] || return 1
|
||||
capacity="$(extract_runner_capacity "$CONTROLLED_DRAIN_CONFIG" | head -1)"
|
||||
printf '%s' "${capacity:-}" | grep -Eq '^[0-9]+$' || return 1
|
||||
[ "$capacity" -le "$CONTROLLED_DRAIN_MAX_CAPACITY" ] || return 1
|
||||
labels="$(extract_runner_labels "$CONTROLLED_DRAIN_CONFIG" || true)"
|
||||
[ -n "$labels" ] || return 1
|
||||
while IFS= read -r label; do
|
||||
[ -n "$label" ] || continue
|
||||
name="$(label_name "$label")"
|
||||
case "$name" in
|
||||
awoooi-host)
|
||||
[ "$label" = "awoooi-host:host" ] || return 1
|
||||
has_host=1
|
||||
;;
|
||||
awoooi-ubuntu)
|
||||
[ "$label" = "awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04" ] || return 1
|
||||
has_ubuntu=1
|
||||
;;
|
||||
ubuntu-latest|ubuntu-*|self-hosted|stockplatform*|stock-platform*|headless*|playwright*)
|
||||
return 1
|
||||
;;
|
||||
*)
|
||||
return 1
|
||||
;;
|
||||
esac
|
||||
done <<<"$labels"
|
||||
[ "$has_host" -eq 1 ] && [ "$has_ubuntu" -eq 1 ]
|
||||
}
|
||||
|
||||
controlled_drain_binary_safe() {
|
||||
local kind
|
||||
[ -f "$CONTROLLED_DRAIN_BINARY" ] && [ -x "$CONTROLLED_DRAIN_BINARY" ] || return 1
|
||||
kind="$(file -b "$CONTROLLED_DRAIN_BINARY" 2>/dev/null || echo missing)"
|
||||
grep -qi 'ELF' <<<"$kind"
|
||||
}
|
||||
|
||||
controlled_drain_unit_safe() {
|
||||
local text
|
||||
text="$(systemctl cat "$CONTROLLED_DRAIN_UNIT" 2>/dev/null || true)"
|
||||
[ -n "$text" ] || return 1
|
||||
grep -Fq -- "ConditionPathExists=$CONTROLLED_DRAIN_REGISTRATION" <<<"$text" || return 1
|
||||
grep -Fq -- "$CONTROLLED_DRAIN_BINARY daemon --config $CONTROLLED_DRAIN_CONFIG" <<<"$text" || return 1
|
||||
grep -Eq '^[[:space:]]*CPUAccounting=true' <<<"$text" || return 1
|
||||
grep -Eq '^[[:space:]]*CPUQuota=' <<<"$text" || return 1
|
||||
grep -Eq '^[[:space:]]*MemoryAccounting=true' <<<"$text" || return 1
|
||||
grep -Eq '^[[:space:]]*Memory(High|Max)=' <<<"$text" || return 1
|
||||
grep -Eq '^[[:space:]]*TasksAccounting=true' <<<"$text" || return 1
|
||||
grep -Eq '^[[:space:]]*TasksMax=' <<<"$text" || return 1
|
||||
grep -Eq '^[[:space:]]*NoNewPrivileges=true' <<<"$text" || return 1
|
||||
}
|
||||
|
||||
controlled_drain_service_inactive() {
|
||||
local load active unitfile mainpid
|
||||
load="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p LoadState --value 2>/dev/null || true)"
|
||||
active="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p ActiveState --value 2>/dev/null || true)"
|
||||
unitfile="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p UnitFileState --value 2>/dev/null || true)"
|
||||
mainpid="$(systemctl show "$CONTROLLED_DRAIN_UNIT" -p MainPID --value 2>/dev/null || true)"
|
||||
{ [ "$active" = "inactive" ] || [ "$active" = "failed" ] || [ "$active" = "unknown" ] || [ -z "$active" ]; } || return 1
|
||||
[ "${mainpid:-0}" = "0" ] || return 1
|
||||
[ "$load" != "masked" ] || return 1
|
||||
[ "$unitfile" != "masked" ] || return 1
|
||||
[ "$unitfile" != "enabled" ] || return 1
|
||||
}
|
||||
|
||||
controlled_drain_staging_allowed() {
|
||||
controlled_drain_config_safe \
|
||||
&& controlled_drain_binary_safe \
|
||||
&& controlled_drain_unit_safe \
|
||||
&& controlled_drain_service_inactive
|
||||
}
|
||||
|
||||
list_action_runner_units() {
|
||||
{
|
||||
systemctl list-unit-files 'actions.runner.*' --no-legend --plain 2>/dev/null | awk '{print $1}'
|
||||
@@ -147,6 +272,11 @@ list_action_runner_units() {
|
||||
stop_and_mask_units() {
|
||||
local unit
|
||||
for unit in "${RUNNER_UNITS[@]}"; do
|
||||
if [ "$unit" = "$CONTROLLED_DRAIN_UNIT" ] && controlled_drain_staging_allowed; then
|
||||
as_root systemctl reset-failed "$unit" >/dev/null 2>&1 || true
|
||||
as_root systemctl disable "$unit" >/dev/null 2>&1 || true
|
||||
continue
|
||||
fi
|
||||
as_root systemctl kill --signal=SIGKILL "$unit" >/dev/null 2>&1 || true
|
||||
as_root systemctl stop "$unit" >/dev/null 2>&1 || true
|
||||
as_root systemctl reset-failed "$unit" >/dev/null 2>&1 || true
|
||||
@@ -218,6 +348,9 @@ seal_lane_binary_restore_sources() {
|
||||
local path
|
||||
while IFS= read -r -d '' path; do
|
||||
[ -e "$path" ] || continue
|
||||
if [ "$path" = "$CONTROLLED_DRAIN_BINARY" ] && controlled_drain_staging_allowed; then
|
||||
continue
|
||||
fi
|
||||
write_failclosed_stub "$path"
|
||||
done < <(
|
||||
{
|
||||
@@ -234,6 +367,9 @@ quarantine_lane_registration_sources() {
|
||||
local target
|
||||
for lane_dir in "/home/wooo/awoooi-cd-lane" "/home/wooo/awoooi-cd-lane-drain"; do
|
||||
[ -d "$lane_dir" ] || continue
|
||||
if [ "$lane_dir" = "$CONTROLLED_DRAIN_DIR" ] && controlled_drain_staging_allowed; then
|
||||
continue
|
||||
fi
|
||||
quarantine_dir="$lane_dir/quarantine-failclosed-${STAMP}"
|
||||
as_root chattr -i "$lane_dir" "$lane_dir/data" >/dev/null 2>&1 || true
|
||||
as_root mkdir -p "$quarantine_dir" >/dev/null 2>&1 || true
|
||||
@@ -257,6 +393,9 @@ quarantine_lane_registration_sources() {
|
||||
seal_live_binary_paths() {
|
||||
local path
|
||||
for path in "${LIVE_BINARY_PATHS[@]}"; do
|
||||
if [ "$path" = "$CONTROLLED_DRAIN_BINARY" ] && controlled_drain_staging_allowed; then
|
||||
continue
|
||||
fi
|
||||
write_failclosed_stub "$path"
|
||||
done
|
||||
}
|
||||
@@ -666,7 +805,10 @@ mask_unit_file_to_devnull() {
|
||||
|
||||
seal_lane_unit_files() {
|
||||
mask_unit_file_to_devnull "awoooi-cd-lane.service"
|
||||
mask_unit_file_to_devnull "awoooi-cd-lane-drain.service"
|
||||
if controlled_drain_staging_allowed; then
|
||||
return 0
|
||||
fi
|
||||
mask_unit_file_to_devnull "$CONTROLLED_DRAIN_UNIT"
|
||||
}
|
||||
|
||||
root_restore_sources_left() {
|
||||
@@ -680,6 +822,9 @@ root_restore_sources_left() {
|
||||
unit_ok() {
|
||||
local unit="$1"
|
||||
local load active unitfile mainpid
|
||||
if [ "$unit" = "$CONTROLLED_DRAIN_UNIT" ] && controlled_drain_staging_allowed; then
|
||||
return 0
|
||||
fi
|
||||
load="$(systemctl show "$unit" -p LoadState --value 2>/dev/null || true)"
|
||||
active="$(systemctl show "$unit" -p ActiveState --value 2>/dev/null || true)"
|
||||
unitfile="$(systemctl show "$unit" -p UnitFileState --value 2>/dev/null || true)"
|
||||
@@ -729,6 +874,9 @@ awoooi_runner_failclosed_enforcer_root_restore_sources_left $(root_restore_sourc
|
||||
# HELP awoooi_runner_failclosed_enforcer_apply_performed Whether this run used apply mode.
|
||||
# TYPE awoooi_runner_failclosed_enforcer_apply_performed gauge
|
||||
awoooi_runner_failclosed_enforcer_apply_performed $APPLY_PERFORMED
|
||||
# HELP awoooi_runner_failclosed_enforcer_controlled_drain_staging_allowed Controlled drain lane non-secret guardrail staging allowance.
|
||||
# TYPE awoooi_runner_failclosed_enforcer_controlled_drain_staging_allowed gauge
|
||||
awoooi_runner_failclosed_enforcer_controlled_drain_staging_allowed $(controlled_drain_staging_allowed && echo 1 || echo 0)
|
||||
EOF
|
||||
as_root install -o root -g root -m 0644 "$tmp" "$dir/awoooi_runner_failclosed_enforcer.prom" >/dev/null 2>&1 || true
|
||||
rm -f "$tmp"
|
||||
@@ -743,6 +891,7 @@ print_readback() {
|
||||
echo "LANE_PROCESS_COUNT=$(count_lane_processes)"
|
||||
echo "RUNNER_PROCESS_COUNT=$(count_runner_processes)"
|
||||
echo "ROOT_RESTORE_SOURCES_LEFT=$(root_restore_sources_left)"
|
||||
echo "CONTROLLED_DRAIN_STAGING_ALLOWED=$(controlled_drain_staging_allowed && echo 1 || echo 0)"
|
||||
echo "RUNNER_UNITS_BAD_COUNT=$(runner_units_bad_count)"
|
||||
for unit in "${RUNNER_UNITS[@]}"; do
|
||||
load="$(systemctl show "$unit" -p LoadState --value 2>/dev/null || true)"
|
||||
|
||||
@@ -22,6 +22,7 @@ REPAIR_STARTUP_STUB = (
|
||||
FAILCLOSED_ENFORCER = (
|
||||
ROOT / "scripts" / "reboot-recovery" / "enforce-110-runner-failclosed.sh"
|
||||
)
|
||||
CONTROLLED_CD_LANE_DRAIN_UNIT = ROOT / "ops" / "runner" / "awoooi-cd-lane-drain.service"
|
||||
SSH_AUTH_DIAGNOSE = (
|
||||
ROOT / "scripts" / "reboot-recovery" / "diagnose-110-ssh-publickey-auth.sh"
|
||||
)
|
||||
@@ -206,6 +207,47 @@ def test_runner_failclosed_enforcer_does_not_seal_live_startup_recovery_script()
|
||||
assert "awoooi-startup-110.sh.*controlled*" in text
|
||||
|
||||
|
||||
def test_runner_failclosed_enforcer_preserves_controlled_drain_staging_only() -> None:
|
||||
text = FAILCLOSED_ENFORCER.read_text(encoding="utf-8")
|
||||
|
||||
assert "controlled_drain_staging_allowed()" in text
|
||||
assert "controlled_drain_config_safe" in text
|
||||
assert "controlled_drain_binary_safe" in text
|
||||
assert "controlled_drain_unit_safe" in text
|
||||
assert "controlled_drain_service_inactive" in text
|
||||
assert "awoooi-host:host" in text
|
||||
assert (
|
||||
"awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04"
|
||||
in text
|
||||
)
|
||||
assert "ubuntu-latest|ubuntu-*|self-hosted|stockplatform*|stock-platform*|headless*|playwright*)" in text
|
||||
assert 'grep -Fq -- "ConditionPathExists=$CONTROLLED_DRAIN_REGISTRATION"' in text
|
||||
assert 'grep -Eq \'^[[:space:]]*CPUAccounting=true\'' in text
|
||||
assert 'grep -Eq \'^[[:space:]]*MemoryAccounting=true\'' in text
|
||||
assert 'grep -Eq \'^[[:space:]]*TasksAccounting=true\'' in text
|
||||
assert '[ "$unitfile" != "enabled" ] || return 1' in text
|
||||
assert 'if [ "$unit" = "$CONTROLLED_DRAIN_UNIT" ] && controlled_drain_staging_allowed; then' in text
|
||||
assert 'if [ "$path" = "$CONTROLLED_DRAIN_BINARY" ] && controlled_drain_staging_allowed; then' in text
|
||||
assert 'if [ "$lane_dir" = "$CONTROLLED_DRAIN_DIR" ] && controlled_drain_staging_allowed; then' in text
|
||||
assert "CONTROLLED_DRAIN_STAGING_ALLOWED=" in text
|
||||
|
||||
|
||||
def test_controlled_cd_lane_unit_source_has_required_accounting_guardrails() -> None:
|
||||
text = CONTROLLED_CD_LANE_DRAIN_UNIT.read_text(encoding="utf-8")
|
||||
|
||||
assert "ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner" in text
|
||||
assert "CPUAccounting=true" in text
|
||||
assert "CPUQuota=250%" in text
|
||||
assert "MemoryAccounting=true" in text
|
||||
assert "MemoryHigh=8G" in text
|
||||
assert "MemoryMax=12G" in text
|
||||
assert "TasksAccounting=true" in text
|
||||
assert "TasksMax=512" in text
|
||||
assert "IOAccounting=true" in text
|
||||
assert "IOWeight=100" in text
|
||||
assert "NoNewPrivileges=true" in text
|
||||
|
||||
|
||||
def test_110_ssh_publickey_auth_diagnosis_is_bounded_and_read_only() -> None:
|
||||
text = SSH_AUTH_DIAGNOSE.read_text(encoding="utf-8")
|
||||
|
||||
|
||||
Reference in New Issue
Block a user