|
|
|
|
@@ -11,15 +11,15 @@
|
|
|
|
|
|
|
|
|
|
| Area | Status | Completion | Evidence |
|
|
|
|
|
|------|--------|------------|----------|
|
|
|
|
|
| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_DATA_STALE_GDRIVE_TOKEN_WARN_DR_ESCROW_BLOCKED | 97% | 2026-06-25 11:35 live cold-start returned `PASS=87 WARN=1 BLOCKED=1`, result `BLOCKED` because MOMO business data freshness remains stale and Google Drive token ownership/writeback metadata is not confirmed. 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, AWOOOI API health is healthy/prod/mock=false, MOMO service health is healthy on `V10.665` after an 11:33 compose replace, 110 / 188 runtime and backup checks are green。MOMO Gitea `main` is `e137d7a5d02a7595a44c3f3cc1cf54b766424ee7`; `cd.yaml #910` succeeded and deployed a fail-closed Drive auth/API boundary into 188 host source and `momo-scheduler` container source. Remaining hard service blocker is still MOMO business data freshness: `MOMO_DAILY_FRESHNESS 8|2026-06-17`; DB current-month readback remains `daily_sales_snapshot=104614|2025-07-01|2026-06-17` and `realtime_sales_monthly=10936|2026/06/01|2026/06/17`; latest valid job `56` is still completed with `sync_success=true` and bounds `2026-06-01..2026-06-17`. Warning evidence: metadata-only check shows `/home/ollama/momo-pro/config/google_token.json` missing on host and `config/google_token.json` missing inside `momo-scheduler`, while scheduler runs as UID/GID `100000:100000`; no token content was read. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
|
|
|
|
|
| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_DATA_STALE_GDRIVE_TOKEN_WARN_DR_ESCROW_BLOCKED | 97% | 2026-06-25 11:35 live cold-start returned `PASS=87 WARN=1 BLOCKED=1`, result `BLOCKED` because MOMO business data freshness remains stale and Google Drive token ownership/writeback metadata is not confirmed. 2026-06-25 11:44 dedicated MOMO preflight returned `PASS=15 WARN=5 BLOCKED=2`: 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, AWOOOI API health is healthy/prod/mock=false, MOMO service health is healthy on `V10.667` after 11:42-11:43 replacement / restart warm-up evidence, 110 / 188 runtime and backup checks are green。MOMO Gitea `main` is `e137d7a5d02a7595a44c3f3cc1cf54b766424ee7`; `cd.yaml #910` succeeded and deployed a fail-closed Drive auth/API boundary into 188 host source and `momo-scheduler` container source. Remaining hard service blocker is still MOMO business data freshness: `MOMO_DAILY_FRESHNESS 8|2026-06-17`; DB current-month readback remains `daily_sales_snapshot=104614|2025-07-01|2026-06-17` and `realtime_sales_monthly=10936|2026/06/01|2026/06/17`; latest valid job `56` is still completed with `sync_success=true` and bounds `2026-06-01..2026-06-17`. Warning evidence: metadata-only check shows `/home/ollama/momo-pro/config/google_token.json` missing on host and `config/google_token.json` missing inside `momo-scheduler`, while scheduler runs as UID/GID `100000:100000`; no token content was read. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
|
|
|
|
|
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. |
|
|
|
|
|
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 09:05 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。DR remains blocked on real non-secret credential escrow evidence IDs. |
|
|
|
|
|
| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98% | Public route/TLS, API/Web route, MOMO health `V10.665`, MOMO main / CD `#904` monthly-sync failure boundary, MOMO main / CD `#910` Drive-auth fail-closed boundary, 10:04 live scheduler fail-closed proof, direct 11:35 public route smoke all 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. MOMO latest business date remains `2026-06-17`; stale age is `8` days as of 11:35. Latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`; targeted source search did not find a newer `即時業績_當日` intake file. Google Drive token metadata is still a WARN because host and container token paths are missing; this requires owner-gated metadata repair/evidence and must not be solved by reading token contents. |
|
|
|
|
|
| P3 docs / automation contracts | DONE_WITH_MOMO_PREFLIGHT_AND_CPU_TRIAGE | 100% | Workplan, SOP v1.49, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:01 MOMO dedicated preflight gate, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with Google Drive token metadata WARN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. |
|
|
|
|
|
| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98% | Public route/TLS, API/Web route, MOMO health `V10.667`, MOMO main / CD `#904` monthly-sync failure boundary, MOMO main / CD `#910` Drive-auth fail-closed boundary, 10:04 live scheduler fail-closed proof, direct 11:35 public route smoke all 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. 11:44 preflight confirms app / scheduler / Telegram bot healthy, scheduler restart count `0`, recent lifecycle events `23`, and exact local source candidate count `0`. MOMO latest business date remains `2026-06-17`; stale age is `8` days as of 11:44. Latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`; targeted source search did not find a newer `即時業績_當日` intake file. Google Drive token metadata is still a WARN because host and container token paths are missing; this requires owner-gated metadata repair/evidence and must not be solved by reading token contents. |
|
|
|
|
|
| P3 docs / automation contracts | DONE_WITH_MOMO_PREFLIGHT_AND_CPU_TRIAGE | 100% | Workplan, SOP v1.50, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:44 MOMO dedicated preflight version / StartedAt / lifecycle / source-candidate gate, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with Google Drive token metadata WARN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. |
|
|
|
|
|
|
|
|
|
|
2026-06-25 11:35 supplemental readback supersedes the 11:21 route / DB / backup wording for current evidence: direct route smoke is still 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan; repo-side cold-start returns `PASS=87 WARN=1 BLOCKED=1`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight remains `PASS=8 WARN=4 BLOCKED=2`; MOMO health is `V10.665` after an 11:33 compose replace; 110 CPU is stable around load `3.16 / 3.26 / 4.36`, not orphan Chrome.
|
|
|
|
|
2026-06-25 11:44 supplemental readback supersedes the 11:35 MOMO service-version wording for current evidence: direct route smoke is still 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan; repo-side cold-start returns `PASS=87 WARN=1 BLOCKED=1`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=15 WARN=5 BLOCKED=2`; MOMO health is `V10.667` after 11:42-11:43 lifecycle events; 110 CPU is stable around load `3.16 / 3.26 / 4.36`, not orphan Chrome.
|
|
|
|
|
|
|
|
|
|
Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-25 11:35, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, and MOMO service health is `V10.665`, but the latest live read-only cold-start scorecard remains `PASS=87 WARN=1 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and Google Drive token metadata is missing / writeback not confirmed. The hard blocker is `188 momo daily sales data stale beyond 3 days`; the token state is a separate WARN and not a reason to read token contents. MOMO Drive auth/API failure is no longer allowed to be recorded as a no-file success after CD `#910`; the 10:04 scheduler run proved it now fails closed and sends failure notification. This code fix does not create new business data. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
|
|
|
|
Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-25 11:44, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, and MOMO service health is `V10.667`, but the latest live read-only cold-start scorecard remains `PASS=87 WARN=1 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and Google Drive token metadata is missing / writeback not confirmed. The hard blocker is `188 momo daily sales data stale beyond 3 days`; the token state is a separate WARN and not a reason to read token contents. MOMO Drive auth/API failure is no longer allowed to be recorded as a no-file success after CD `#910`; the 10:04 scheduler run proved it now fails closed and sends failure notification. This code fix does not create new business data. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
|
|
|
|
|
|
|
|
|
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
|
|
|
|
|
|
|
|
|
|
@@ -160,7 +160,7 @@ Next: <single next action>
|
|
|
|
|
| ID | Status | % | Work item | Fine analysis | Next action | Done criteria |
|
|
|
|
|
|----|--------|---:|-----------|---------------|-------------|---------------|
|
|
|
|
|
| P2-001 | VERIFIED | 100 | Public route smoke | 2026-06-12 18:57 cold-start confirms all listed domains returned expected 2xx/3xx over HTTPS; registry root route returned 200 in the scorecard and `/v2/` remains the normal unauthenticated 401 pattern from earlier checks. This proves ingress/TLS plus current route availability. | Keep as one row in scorecard. | Public route table updated after each reboot. |
|
|
|
|
|
| P2-002 | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98 | momo latest/current-month parity and freshness | Latest current-month parity is good: `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. Full-table read-only DB evidence is `daily_sales_snapshot=104614 rows, 2025-07-01..2026-06-17` and current-month `realtime_sales_monthly=10936 rows, 2026/06/01..2026/06/17`. 11:01 dedicated preflight returns `PASS=8 WARN=4 BLOCKED=2`: public/local health and scheduler are healthy, latest job `56` is clean, but latest business data is stale: `DB_DAILY_FRESHNESS 8|2026-06-17`; host/container token metadata remains missing, and scheduler fail-closed log evidence is not present after the latest container restart window. 10:04 scheduler run remains the previous proof that Drive auth failure now fails closed and sends Telegram failure notification. | Run `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh` before any MOMO recovery claim. Obtain owner-provided non-secret evidence ref for Google Drive token artifact recovery or a newer legitimate PChome daily-sales source file. Recovery must have maintenance window, rollback owner, token metadata-only verification, import job `sync_success=true`, file movement only after success, table bounds, and `MOMO_DAILY_FRESHNESS <= 2`. Do not read token contents, do not manually import stale local samples, product exports, header-only sheets, or already imported archives. | Token artifact metadata matches scheduler UID/mode without exposing token value; source folder has a newer legitimate source or scheduler can list the expected folder; next import has `sync_success=true`; snapshot/current-month row count and bounds match; daily freshness is within threshold; preflight returns no BLOCKED result. |
|
|
|
|
|
| P2-002 | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98 | momo latest/current-month parity and freshness | Latest current-month parity is good: `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. Full-table read-only DB evidence is `daily_sales_snapshot=104614 rows, 2025-07-01..2026-06-17` and current-month `realtime_sales_monthly=10936 rows, 2026/06/01..2026/06/17`. 11:44 dedicated preflight returns `PASS=15 WARN=5 BLOCKED=2`: public/local health and scheduler are healthy on `V10.667`, app / scheduler / Telegram bot StartedAt metadata is captured, recent lifecycle events are visible, exact local daily-sales source candidates are `0`, latest job `56` is clean, but latest business data is stale: `DB_DAILY_FRESHNESS 8|2026-06-17`; host/container token metadata remains missing, and scheduler fail-closed log evidence is not present after the latest container restart window. 10:04 scheduler run remains the previous proof that Drive auth failure now fails closed and sends Telegram failure notification. | Run `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh` before any MOMO recovery claim. Obtain owner-provided non-secret evidence ref for Google Drive token artifact recovery or a newer legitimate PChome daily-sales source file. Recovery must have maintenance window, rollback owner, token metadata-only verification, import job `sync_success=true`, file movement only after success, table bounds, and `MOMO_DAILY_FRESHNESS <= 2`. Do not read token contents, do not manually import stale local samples, product exports, header-only sheets, or already imported archives. | Token artifact metadata matches scheduler UID/mode without exposing token value; source folder has a newer legitimate source or scheduler can list the expected folder; next import has `sync_success=true`; snapshot/current-month row count and bounds match; daily freshness is within threshold; preflight returns no BLOCKED result. |
|
|
|
|
|
| P2-008 | DONE_SUPERSEDED_BY_TOKEN_WARN | 100 | Separate MOMO service recovery from upstream source absence | 2026-06-24 11:35 readback proved MOMO service was healthy and source-file absence was the blocker. 2026-06-25 10:35 supersedes that with a stricter split: service is still healthy, DB parity is still good, but token artifact metadata is missing and the latest scheduler evidence is auth failure, not a healthy empty-source listing. SOP v1.48 records GO/NO-GO rules forbidding old archive re-import, product-export import, truncate, whole-DB restore, fake freshness, or token secret exposure. | Keep stale warning and token WARN active until owner-gated Drive token/source evidence is restored and a legitimate newer `即時業績_當日` source imports cleanly. | Operators can say "MOMO service recovered, data pipeline blocked by Drive token/source evidence and stale business data" without calling the full stack green. |
|
|
|
|
|
| P2-003 | DONE_PRODUCTION_DEPLOYED_WAITING_NEXT_REAL_IMPORT | 99 | Fix momo job semantics | Gitea-first repair is in `/Users/ogt/codex-workspaces/momo-pro-dev` commit `84035906aba0e5e190d031a13cfd9b47a8cd1f73` on branch `codex/momo-current-main-dev-base-20260624`, also fast-forwarded to MacBook Pro and fast-forwarded to MOMO `main`. Gitea Actions `cd.yaml #904` succeeded, and 188 live source contains `_table_columns`, `業績分析儀表板同步失敗`, and `保留來源檔案等待重試,不移動 Google Drive 檔案`. `process_daily_sales_import()` marks monthly sync failure as `failed`, records the sync error in summary, returns `False`, and leaves `auto_import_from_drive()` outside the Drive archive/move path. Regression tests cover both job failure and no-move behavior. | Watch the next real Google Drive import and confirm no file moves unless both tables sync; if a real monthly sync failure happens, verify import job status is `failed` and source file remains pending. | `pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q` returns `10 passed`; production deployment/readback is complete; final behavioral closeout requires next real import evidence. |
|
|
|
|
|
| P2-004 | DONE | 100 | PostgreSQL index corruption runbook path | SOP v1.2 now states `posting list tuple ... cannot be split` is an index repair incident. | Use only concurrent reindex if the error returns. | No truncate, no whole DB restore; `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;` and idempotent resync evidence recorded. |
|
|
|
|
|
@@ -181,7 +181,7 @@ Next: <single next action>
|
|
|
|
|
| P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
|
|
|
|
|
| P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
|
|
|
|
|
| P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
|
|
|
|
|
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.49 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. | Use v1.49 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.35. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; latest MOMO dedicated preflight returns `PASS=8 WARN=4 BLOCKED=2`; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. |
|
|
|
|
|
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.50 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, MOMO V10.667 version / StartedAt / lifecycle / exact source-candidate readback, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. | Use v1.50 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.35. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; latest MOMO dedicated preflight returns `PASS=15 WARN=5 BLOCKED=2`; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. |
|
|
|
|
|
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
|
|
|
|
|
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
|
|
|
|
|
| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
|
|
|
|
|
|