fix(reboot): reject unknown uptime as fresh boot
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 1s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Failing after 2m5s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled

This commit is contained in:
Your Name
2026-07-03 00:45:28 +08:00
parent b97bc6f35e
commit 85d8eeb0db
4 changed files with 100 additions and 8 deletions

View File

@@ -1,3 +1,23 @@
## 2026-07-03 — 00:43 P0-006 reboot-event detector unknown uptime false-fresh fix
**完成內容**
- 依 production action matrix primary lane `reboot_event_detector_and_host_probe` 執行 verify-only host probe / event detector未重啟、未 restart、未寫 state。
- Live host probe 證據99 reachable 但 `uptime_seconds=unknown`、110 reachable / systemd running / startup unit inactive、111 `reachable=0`、112/120/121 reachable、188 reachable 但 `systemd_state=degraded``awoooi-startup.service failed`
- 發現並修正 `scripts/reboot-recovery/reboot-event-detector.py`:舊邏輯把 `uptime_seconds=unknown` 解析成 `-1`,造成 99 ping-only readback 被誤判成 `fresh_boot_hosts=['99']`;同時把 `reachable_unknown_boot` 從 boot-id change 判定排除。
- 新規則fresh boot 必須 `reachable=true``uptime_seconds >= 0` 且在 target window 內boot-id change 必須前後 boot id 都不是 `unknown` / `reachable_unknown_boot` placeholder。
- `FULL-STACK-COLD-START-SOP.md` 升到 v1.98,固定 unknown uptime 不得當 fresh reboot 證據。
**驗證**
- `python3.11 -m pytest scripts/reboot-recovery/tests/test_reboot_event_detector.py scripts/reboot-recovery/tests/test_reboot_auto_recovery_slo_scorecard.py -q -p no:cacheprovider``15 passed`
- `DATABASE_URL=postgresql+asyncpg://test:test@localhost/test PYTHONPATH=apps/api python3.11 -m pytest apps/api/tests/test_reboot_auto_recovery_slo_scorecard_api.py -q -p no:cacheprovider``8 passed`
- `python3.11 -m py_compile scripts/reboot-recovery/reboot-event-detector.py scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py apps/api/src/services/reboot_auto_recovery_slo_scorecard.py`:通過。
- `git diff --check`:通過。
- 修正後 live no-write detector artifact `/tmp/awoooi-reboot-detector-fix-20260703-004256``reboot_detected=false``fresh_boot_hosts=[]``rebooted_hosts=[]``unreachable_hosts=['111']``all_required_hosts_observed=false``all_required_hosts_in_reboot_window=false``recovery_deadline_status=target_window_elapsed``state_written=false`
**仍維持**
- P0-006 仍未達 10 分鐘全主機自動恢復 SLO111 unreachable、188 degraded/startup failed、99 uptime unknown、Windows99 VMware verifier / no-secret remote execution channel 未 ready 都必須繼續作為 blocker。
- 未讀 secret / token / `.env` / raw sessions / SQLite / auth未使用 GitHub / gh未 workflow_dispatch未重啟 host / VM / service未 Docker / Nginx / K3s / DB / firewall restart未 DROP / TRUNCATE / restore / prune / delete / force push。
## 2026-07-03 — 00:36 P0-006 reboot SLO action-matrix alert routing
**完成內容**

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.97
> Version: v1.98
> Last updated: 2026-07-03 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -28,6 +28,8 @@ v1.96 reboot SLO per-blocker alert projection rule重啟後不得只看 `awoo
v1.97 reboot SLO action-matrix routing rule重啟後不得只把 active blocker 丟給人工判讀。`reboot-auto-recovery-slo-scorecard.py`、production `/api/v1/agents/reboot-auto-recovery-slo-scorecard` 與 exporter artifact 必須輸出 `active_blocker_action_matrix`,每個 blocker 固定含 `category``owner_lane``telegram_severity``evidence_inputs``next_safe_action``post_verifier``controlled_apply_mode``forbidden_actions``awoooi_reboot_auto_recovery_slo_active_blocker` metric 必須帶 `category``owner_lane``severity``primary` labelsTelegram / Alertmanager 優先讀這些 labels若 labels 缺失,只能判為 exporter/action-matrix drift。`windows99_remote_execution_channel_unavailable` 必須排在 `windows99_vmware_autostart_readback_missing` 前,因為未恢復 no-secret collection channel 時不可只等 VMware verifier。所有 action matrix row 的 `controlled_apply_authorized_by_scorecard=false`,表示 scorecard 只授權 verifier / check-mode 路由不授權重啟、VM power change、Docker / Nginx / K3s / DB / firewall restart、restore、prune、delete 或 secret 讀取;低風險 controlled apply 必須進各自 lane 的 check-mode、rollback、post-verifier。
v1.98 reboot-event detector unknown-uptime rule`reboot-event-detector.py` 不得把 `uptime_seconds=unknown``boot_id=unknown``boot_id=reachable_unknown_boot` 當作 fresh reboot 或 boot id changed。只有 `reachable=true``uptime_seconds >= 0``uptime_seconds <= target_seconds` 才能進 `fresh_boot_hosts`;只有前後兩個 boot id 都不是 placeholder 時才能判定 `boot_id_changed`。若 99 只剩 ping / TCP reachable、111 unreachable、188 degraded 或 startup failed必須維持 `all_required_hosts_in_reboot_window=false` 與 SLO blocked不得用 ping-only host 當 10 分鐘內重啟證據。
2026-07-02 110 control-path / Harbor recovery receipt rule若 Gitea Harbor repair queue 仍保留 `harbor_110_remote_ssh_publickey_auth_stalled`、remote-control unavailable、jobs stale 或 historical failure但同一輪本地證據同時證明 `wooo` command path ready、110 local Harbor `/v2/` ready、public/internal registry `/v2/``401`,則該 Gitea Harbor repair 失敗只能列為 historical queue metadata不得再當成 current SSH blocker。必須用 `/api/v1/agents/harbor-registry-controlled-recovery-receipt` 或同等 validator 合併 `diagnose-110-ssh-publickey-auth.sh``recover-110-control-path-and-harbor-local.sh --check`、public Gitea queue readback 與 registry `/v2/` verifier並把機器可讀結果寫入 `docs/operations/harbor-110-control-path-recovery-readback-2026-07-02.snapshot.json` 類型的 snapshot。2026-07-02 live receipt 顯示public/internal registry `/v2/` 均為 `401`、latest visible CD `#4335``Success`、Gitea Harbor repair failure 已是 `historical_after_latest_cd_success=true`active blockers 收斂為 110 controlled CD lane config / binary / registration / service guardrail、active action container pressure以及 Gitea CD jobs head-SHA / stale readback mismatch。若 local-console output 只有 `AWOOOI_110_CONTROLLED_CD_LANE_READY` markernon110 runner parser 不得從 110 `BLOCKER` 行推導 non110 blockernon110 只有看到 `AWOOOI_NON110_RUNNER_READY` marker 才能列入 active blocker。
2026-07-02 110 controlled CD lane fail-closed enforcer staging rule110 runner 壓力事故後legacy / generic runner 仍必須 fail-closed`awoooi-cd-lane-drain.service` 的非 secret staging artifact 不得再被 enforcer 無差別封回 stub。`scripts/reboot-recovery/enforce-110-runner-failclosed.sh` 只有在 `config.yaml` 符合 `capacity <= 1`、只含 `awoooi-host:host``awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`、binary 是 executable ELF、systemd unit 具備 `ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner``CPUAccounting` / `MemoryAccounting` / `TasksAccounting` / `NoNewPrivileges` 等 guardrail且 service `inactive``MainPID=0`、未 enabled / 未 masked 時,才可保留 drain config / binary / unit並輸出 `CONTROLLED_DRAIN_STAGING_ALLOWED=1` 與 textfile metric。此 staging 規則不得讀 token、不得讀 `.runner` 內容、不得註冊 runner、不得啟動 service若 registration 缺失readiness verifier 仍必須只留下 `controlled_cd_lane_registration_missing` / `controlled_cd_lane_service_not_active` 類 blocker。若 `CONTROLLED_DRAIN_STAGING_ALLOWED=0` 且 config / binary 又被搬走,優先修 source enforcer / unit guardrail不要手工反覆補同一組 artifact。

View File

@@ -45,6 +45,13 @@ def int_value(value: Any, default: int = -1) -> int:
return default
def known_boot_id(value: Any) -> str:
boot_id = str(value or "")
if boot_id in {"", "unknown", "reachable_unknown_boot"}:
return ""
return boot_id
def parse_host_probe(text: str) -> list[dict[str, Any]]:
rows: list[dict[str, Any]] = []
for raw_line in text.splitlines():
@@ -114,19 +121,22 @@ def build_payload(args: argparse.Namespace) -> dict[str, Any]:
if not current["reachable"]:
unreachable_hosts.append(alias)
previous_boot_id = (
str(previous_host.get("boot_id"))
if isinstance(previous_host, dict) and previous_host.get("boot_id")
known_boot_id(previous_host.get("boot_id"))
if isinstance(previous_host, dict)
else ""
)
current_boot_id = str(current.get("boot_id") or "")
current_boot_id = known_boot_id(current.get("boot_id"))
boot_id_changed = bool(
previous_boot_id
and previous_boot_id != "unknown"
and current_boot_id
and current_boot_id != "unknown"
and previous_boot_id != current_boot_id
)
fresh_boot = bool(current.get("reachable") and int_value(current.get("uptime_seconds")) <= target_seconds)
uptime_seconds = int_value(current.get("uptime_seconds"))
fresh_boot = bool(
current.get("reachable")
and uptime_seconds >= 0
and uptime_seconds <= target_seconds
)
if boot_id_changed:
changed_boot_id_hosts.append(alias)
if fresh_boot:
@@ -142,7 +152,7 @@ def build_payload(args: argparse.Namespace) -> dict[str, Any]:
"uptime_seconds": current.get("uptime_seconds"),
"deadline_at": (
observed_at
+ timedelta(seconds=max(0, target_seconds - int_value(current.get("uptime_seconds"), 0)))
+ timedelta(seconds=max(0, target_seconds - uptime_seconds))
).isoformat(timespec="seconds"),
}
)

View File

@@ -110,3 +110,63 @@ def test_reboot_detector_fails_visible_when_windows_or_vm_host_missing(tmp_path:
assert "99" in payload["missing_hosts"]
assert payload["all_required_hosts_observed"] is False
assert payload["all_required_hosts_in_reboot_window"] is False
def test_reboot_detector_does_not_treat_unknown_uptime_as_fresh_boot(
tmp_path: Path,
) -> None:
probe_path = tmp_path / "host-probe.txt"
state_path = tmp_path / "state.json"
output_path = tmp_path / "event.json"
probe_path.write_text(
"\n".join(
[
"AWOOOI_REBOOT_AUTO_RECOVERY_HOST_PROBE=1",
"TARGET_HOSTS=99",
(
"HOST_BOOT alias=99 target=192.168.0.99 "
"startup_unit=vmware-host-autostart reachable=1 "
"boot_id=reachable_unknown_boot uptime_seconds=unknown "
"systemd_state=ping_reachable startup_enabled=unknown "
"startup_active=unknown"
),
]
)
+ "\n",
encoding="utf-8",
)
state_path.write_text(
json.dumps({"hosts": {"99": {"boot_id": "win-boot-1"}}}),
encoding="utf-8",
)
subprocess.run(
[
sys.executable,
str(SCRIPT),
"--host-probe-file",
str(probe_path),
"--state-file",
str(state_path),
"--target-minutes",
"10",
"--generated-at",
"2026-06-30T18:00:00+08:00",
"--output",
str(output_path),
"--required-host",
"99",
"--no-write-state",
],
check=True,
)
payload = json.loads(output_path.read_text(encoding="utf-8"))
assert payload["observed_hosts"] == ["99"]
assert payload["reboot_detected"] is False
assert payload["fresh_boot_hosts"] == []
assert payload["rebooted_hosts"] == []
assert payload["all_required_hosts_observed"] is True
assert payload["all_required_hosts_in_reboot_window"] is False
assert payload["state_written"] is False