docs(ops): add post-start quick check SOP [skip ci]

2026-06-25 14:32:50 +08:00
parent 0eb303816b
commit 9f81ed0e50
4 changed files with 187 additions and 3 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -1,3 +1,14 @@
+## 2026-06-25｜重啟後一頁式總檢查 SOP 補強
+
+**背景**：SOP v1.51 已能判定 full-stack service GREEN，但長 SOP 太完整，不適合作為每次重啟後 T+10 分鐘內的操作頁。為避免下一次又在 route 200、container healthy、DB freshness、backup、Wazuh registry、DR escrow 之間混淆，本輪新增一頁式 post-start quick check。
+
+**更新**：
+- 新增 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md`，固定重啟後 10 分鐘只讀順序：主機 / SSH、cold-start scorecard、MOMO freshness、backup / offsite / escrow、public routes、110 CPU / runaway process。
+- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升級為 `v1.52`，於最新 baseline 直接連到 quick check。
+- `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md` 更新 P3 docs / automation contract 與 P3-008，明確區分短版 quick check 與長 SOP / Plan B。
+
+**邊界**：本輪 docs-only，沒有 SSH、Docker、systemd、Nginx、firewall、K8s、ArgoCD、Wazuh runtime、active scan 或 secret 操作。Quick check 仍禁止把網站 200 當資料最新、把 backup fresh 當 DR complete、或把 Wazuh route 200 當 agent registry accepted。
+
 ## 2026-06-25｜14:16 full cold-start GREEN / MOMO data freshness recovered

 **背景**：11:53 full cold-start 仍因 MOMO business data stale blocked。14:16 read-only refresh 顯示 MOMO 已成功匯入新資料，資料新鮮度與 Google Drive token metadata gate 均恢復，因此必須更新 SOP / workplan 的釋出判定。
--- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md
+++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md
@@ -1,6 +1,6 @@
 # AWOOOI 全棧冷啟動與主機重啟 SOP

-> Version: v1.51
+> Version: v1.52
 > Last updated: 2026-06-25 Asia/Taipei
 > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.

@@ -10,6 +10,8 @@

 本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天，必須先重跑 live check，再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。

+若只是重啟後要快速判斷能不能宣稱恢復，先跑一頁式總檢查：`docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md`。長 SOP 保留完整背景、例外處理與 Plan B；短版 checklist 負責每次 T+10 分鐘內的固定判定。
+
 2026-06-25 14:16 live read-only refresh supersedes the 11:53 BLOCKED wording. Hosts, routes, K3s, AWOOOI API health, MOMO service health, MOMO business data freshness, backup core/offsite, and core monitoring/exporter surfaces are green for controlled runner/CD release. MOMO is healthy on `V10.674`; latest import job `57` completed cleanly; `MOMO_DAILY_FRESHNESS 1|2026-06-24`; current-month daily snapshot and realtime tables match through `2026-06-24`. Full-stack service readiness is now GREEN, but DR remains blocked by missing credential escrow evidence (`escrow_missing=5`). Do not turn this into a DR complete or security/runtime acceptance claim. Wazuh host registry acceptance remains outside this SOP lane and is still not complete.

 ```text
--- a/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md
+++ b/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md
@@ -0,0 +1,171 @@
+# 主機重啟後一頁式總檢查
+
+> Version: v1.0
+> Last updated: 2026-06-25 Asia/Taipei
+> Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。
+
+---
+
+## 1. 使用時機
+
+每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後，都先跑本頁，再決定是否宣稱恢復。
+
+本頁只回答四件事：
+
+1. 主機是否開起來。
+2. 服務是否真的可用。
+3. 資料與備份是否新鮮。
+4. 有哪些不能宣稱完成。
+
+---
+
+## 2. 絕對判定規則
+
+| 層級 | 可以宣稱 | 必要證據 |
+|------|----------|----------|
+| `HOST_BOOTED` | 主機已開機 | ping / SSH port 回應，或 console login prompt。 |
+| `HOST_READY` | 主機可管理 | SSH read-only 可登入，failed units / disk / clock / network 無硬阻塞。 |
+| `SERVICE_READY` | 單站服務可用 | route / container / local health / DB 或依賴健康都通過。 |
+| `FULL_STACK_GREEN` | 本輪重啟服務恢復完成 | cold-start `WARN=0` 且 `BLOCKED=0`，route、K3s、DB freshness、backup、alert、CronJob、exporter 都通過。 |
+| `DR_COMPLETE` | 災難復原也完成 | `FULL_STACK_GREEN` 加上 credential escrow missing `0`、offsite / restore / escrow evidence 完整。 |
+
+禁止用單一訊號取代整體判定：
+
+- 網站 `200` 不等於資料最新。
+- container `healthy` 不等於 DB / backup / alert 正常。
+- K3s node `Ready` 不等於 workload 分散與 CronJob freshness 正常。
+- Wazuh route `200` 不等於所有主機 agent registry accepted。
+- backup fresh 不等於 DR complete；credential escrow 缺口必須獨立保留。
+
+---
+
+## 3. 10 分鐘只讀總檢查順序
+
+### Step 1 - 主機與 SSH
+
+```bash
+for host in 192.168.0.110 192.168.0.120 192.168.0.121 192.168.0.188; do
+  ping -c 1 -W 1 "$host" >/dev/null && echo "PING_OK $host" || echo "PING_FAIL $host"
+  nc -z -w 2 "$host" 22 && echo "SSH_PORT_OK $host" || echo "SSH_PORT_FAIL $host"
+done
+```
+
+若任一 P0 host 失敗，不要跳去修 Nginx。先判斷是 power / NIC / fsck / SSH trust / host boot 問題。
+
+### Step 2 - 全棧 cold-start scorecard
+
+```bash
+scripts/reboot-recovery/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1
+```
+
+判定：
+
+- `PASS>0 WARN=0 BLOCKED=0`：可進入 `FULL_STACK_GREEN` 候選。
+- `WARN>0 BLOCKED=0`：只能宣稱 `SERVICE_AVAILABLE_DEGRADED`，必須列 WARN。
+- `BLOCKED>0`：不可宣稱恢復完成，先處理第一個 blocker。
+
+### Step 3 - MOMO 專用 freshness gate
+
+```bash
+scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh
+```
+
+必要欄位：
+
+- `MOMO_HEALTH_VERSION`
+- `SCHEDULER_HEALTH`
+- `TOKEN_STAT` / `CONTAINER_TOKEN_STAT` 只看 metadata，不讀 token。
+- `DB_MONTHLY_SYNC`
+- `DB_DAILY_FRESHNESS`
+- `DB_LATEST_DAILY_IMPORT_JOB`
+
+`DB_DAILY_FRESHNESS > 2` 或 import job 失敗時，不可宣稱 MOMO 資料已恢復。
+
+### Step 4 - Backup / offsite / escrow
+
+在 110 只讀執行：
+
+```bash
+/backup/scripts/backup-status.sh --no-notify --no-refresh
+```
+
+必要欄位：
+
+- 110 backup fresh / failed count。
+- 188 backup fresh / failed count。
+- `core_blockers=0`。
+- `integrity_stale=0`。
+- `offsite_fresh=1`。
+- `rclone_gdrive_fresh=1`。
+- `escrow_missing` 必須照實回報。
+
+`escrow_missing>0` 時，服務可 green，但 DR 不可 green。
+
+### Step 5 - Public routes 只作輔助證據
+
+```bash
+for url in \
+  https://awoooi.wooo.work/api/v1/health \
+  https://awoooi.wooo.work/zh-TW/iwooos \
+  https://mo.wooo.work/health \
+  https://stock.wooo.work/; do
+  code="$(curl -k -sS -o /dev/null -w '%{http_code}' "$url" || true)"
+  echo "$code $url"
+done
+```
+
+Route smoke 必須和 cold-start / DB / backup 一起看；不能單獨當恢復證明。
+
+### Step 6 - 110 CPU / runaway process
+
+```bash
+ssh wooo@192.168.0.110 'uptime; vmstat 1 5; ps -eo pid,ppid,pgid,stat,pcpu,pmem,comm,args --sort=-pcpu | head -25'
+```
+
+分類：
+
+- orphan Chrome / headless smoke：走 runaway process PlayBook，未批准不得 kill。
+- Gitea Actions / CI build / test：先標註短期 CI load，不當事故處理。
+- Docker / DB / Harbor / Sentry 持續高載：回到服務相依與 exporter readback。
+
+---
+
+## 4. 放行與阻擋口徑
+
+| 結果 | 口徑 |
+|------|------|
+| `FULL_STACK_GREEN_DR_ESCROW_BLOCKED` | 可宣稱所有服務面恢復；不可宣稱 DR complete。 |
+| `SERVICE_AVAILABLE_DEGRADED` | 可宣稱服務可用；必須列 WARN 與下一步。 |
+| `BLOCKED_MOMO_DATA_FRESHNESS` | 可宣稱網站可用；不可宣稱資料最新。 |
+| `BLOCKED_HOST_OR_K3S` | 不可宣稱全棧恢復；先修主機 / K3s。 |
+| `BLOCKED_BACKUP_CORE` | 不可宣稱恢復完成；備份紅燈優先。 |
+| `BLOCKED_WAZUH_REGISTRY` | 不屬於本 SOP 的服務恢復 blocker；必須交給 IwoooS / Wazuh lane，不可改 Wazuh runtime。 |
+
+---
+
+## 5. 完成後必填 LOGBOOK 摘要
+
+```text
+時間：
+命令類型：read-only / docs-only / write-with-approval
+主機：110 / 120 / 121 / 188
+Cold-start：PASS=? WARN=? BLOCKED=? RESULT=?
+MOMO：version=? daily_freshness=? latest_job=?
+Backup：110=? 188=? core_blockers=? offsite=? escrow_missing=?
+Routes：列出主要 route code
+CPU / runaway：orphan=? active_ci=? load=?
+仍 blocked：
+不可宣稱：
+```
+
+---
+
+## 6. 目前最新已驗證基線
+
+2026-06-25 14:16：
+
+- Cold-start：`PASS=89 WARN=0 BLOCKED=0`，Result `GREEN`。
+- MOMO：`V10.674`，job `57` clean，`DB_DAILY_FRESHNESS 1|2026-06-24`。
+- Backup：110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`。
+- DR：`escrow_missing=5`，不可宣稱 DR complete。
+- Wazuh：host registry accepted 仍不屬於本 SOP 完成項，不可宣稱全部主機納管完成。
--- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
+++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
@@ -15,7 +15,7 @@
 | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. |
 | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 09:05 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。DR remains blocked on real non-secret credential escrow evidence IDs. |
 | P2 service / data truth | GREEN | 100% | Public route/TLS, API/Web route, MOMO health `V10.674`, MOMO main / CD `#904` monthly-sync failure boundary, MOMO main / CD `#910` Drive-auth fail-closed boundary, direct 14:16 public route smoke all expected 2xx/3xx, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. 14:16 preflight confirms app / scheduler / Telegram bot healthy, scheduler restart count `0`, token metadata aligned to scheduler UID, latest job `57` completed cleanly, and `DB_DAILY_FRESHNESS 1|2026-06-24`. |
-| P3 docs / automation contracts | DONE_WITH_MOMO_PREFLIGHT_AND_CPU_TRIAGE | 100% | Workplan, SOP v1.51, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:44 MOMO dedicated preflight blocked readback, 14:16 MOMO dedicated preflight recovery on V10.674 / job 57 / freshness 1, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with full cold-start GREEN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. |
+| P3 docs / automation contracts | DONE_WITH_MOMO_PREFLIGHT_AND_CPU_TRIAGE | 100% | Workplan, SOP v1.52, one-page post-start quick check, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:44 MOMO dedicated preflight blocked readback, 14:16 MOMO dedicated preflight recovery on V10.674 / job 57 / freshness 1, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with full cold-start GREEN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. |

 2026-06-25 14:16 supplemental readback supersedes the 11:53 BLOCKED wording: direct route smoke is 200 for AWOOOI API / IwoooS / MOMO health / Stock, and cold-start public route/TLS gate is green for all expected 2xx/3xx routes. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=18 WARN=3 BLOCKED=0`; MOMO health is `V10.674`; 110 load is around `3.85 / 3.33 / 3.19`, with active Gitea Actions / 2026 World Cup pipeline visible, not orphan Chrome.

@@ -181,7 +181,7 @@ Next: <single next action>
 | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
 | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
 | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
-| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.51 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, MOMO V10.674 / StartedAt / lifecycle / job 57 / freshness 1 recovery readback, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. | Use v1.51 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.35. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; latest MOMO dedicated preflight returns `PASS=18 WARN=3 BLOCKED=0`; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. |
+| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.52 adds one-page post-start quick check, startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, MOMO V10.674 / StartedAt / lifecycle / job 57 / freshness 1 recovery readback, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. | Use `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` for T+10 post-reboot triage, then use SOP v1.52 for exceptions, Plan B, blocker-specific recovery, and historical comparison. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; quick check has one-page command order and LOGBOOK template; latest MOMO dedicated preflight returns `PASS=18 WARN=3 BLOCKED=0`; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. |
 | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
 | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
 | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |