fix(ops): harden reboot API warmup evidence flow
This commit is contained in:
@@ -11,11 +11,11 @@
|
||||
|
||||
| Area | Status | Completion | Evidence |
|
||||
|------|--------|------------|----------|
|
||||
| Overall recovery readiness | SERVICE_GREEN_PRODUCT_DATA_PENDING_EOD_WINDOW | 98% | 2026-06-26 18:45 即時摘要覆蓋 12:13 對今日產品資料的判讀。`post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=0`、`POST_START_RESULT=PRODUCT_DATA_PENDING_EOD_WINDOW`、`STOCK_LATEST_TRADING_DATE=2026-06-26`、`STOCK_BLOCKERS=core_margin_short_daily_missing,ai_recommendations_stale`、`BACKUP_CORE_GREEN=1`、`ESCROW_MISSING_COUNT=5`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`。長版 live cold-start 腳本仍回傳 `PASS=87 WARN=0 BLOCKED=0` / `GREEN`,所以主機 / 服務 / route / K3s / backup core 已恢復;StockPlatform 今日官方融資融券資料尚未發布,需等既有 EOD retry,不可宣稱所有產品資料最新。 |
|
||||
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 23:56 即時摘要覆蓋 18:45 EOD pending 判讀。`post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_WARN=3`、`STOCK_FRESHNESS_STATUS=ok`、`STOCK_LATEST_TRADING_DATE=2026-06-26`、`STOCK_BLOCKERS=none`、`BACKUP_CORE_GREEN=1`、`ESCROW_MISSING_COUNT=5`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`。主機 / K3s / public routes / AWOOOI / MOMO / Stock / backup core / 188 hygiene 已恢復;120 ArgoCD 為 `Synced / Healthy`,`awoooi-prod` Pod 均為 `Running` 或 `Completed`;歷史 `km-vectorize-29689620` failed Job 已由後續成功 Job 覆蓋。DR 仍因 credential escrow 缺 5 不能宣稱 complete;Wazuh registry 已有脫敏 manager readback,但尚未 Dashboard API / owner acceptance。 |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. |
|
||||
| P2 service / data truth | BLOCKED_STOCK_EOD_WINDOW | 90% | Public routes 與 service health 為綠燈,MOMO `V10.716` healthy,current-month parity 為 `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`。StockPlatform 網站為 `200`,且 2026-06-26 price / chips / market index 已存在,但 `/api/v1/system/freshness` 仍為 `blocked`,因為 2026-06-26 的 `core.margin_short_daily` 仍等待官方來源 / row count `0`;`ai.recommendations` 依設計停在 2026-06-25。這是產品資料 freshness blocker,不是 reboot / Nginx / Docker blocker。 |
|
||||
| P3 docs / automation contracts | DONE_WITH_SINGLE_SUMMARY_REPLAY_V175 | 100% | Workplan, SOP v1.75, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields and auto-persisted `summary.txt`, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, post-reboot owner response preflight, owner response placeholder template, one-page post-start quick check v1.15, route retry gate, delegated cold-start public-route warmup classifier, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements from the same `summary.txt`: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Owner response preflight blocks missing files, placeholder templates, secret payloads, credential marker writes, Wazuh active response / re-enroll / restart, host write, and Kali active scan before any evidence can be counted as received or accepted. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
|
||||
| P2 service / data truth | DONE | 100% | Public routes 與 service health 為綠燈,MOMO health `V10.719`,current-month parity 為 `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`。StockPlatform `/api/v1/system/freshness` 為 `ok`,latest trading date `2026-06-26`,blockers `none`;先前 Stock EOD blocker 已由官方來源與正式 cron 自然收斂。 |
|
||||
| P3 docs / automation contracts | DONE_WITH_API_WARMUP_CLASSIFIER_V176 | 100% | Workplan, SOP v1.76, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields and auto-persisted `summary.txt`, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, post-reboot owner response preflight, owner response placeholder template, one-page post-start quick check v1.16, route retry gate, delegated cold-start public-route / AWOOOI API warmup classifier, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements from the same `summary.txt`: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. |
|
||||
|
||||
2026-06-26 12:13 machine-readable summary baseline supersedes the 07:47 / 08:59 gate set: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-121303` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_CHECK_RC=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_COVERAGE_SCOPE=6`, `WAZUH_DIRECT_ACTIVE=2`, `WAZUH_NO_TRANSPORT=1`, `WAZUH_SSH_BLOCKED=3`, `WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`, `WAZUH_DASHBOARD_INDEX_OK=3`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR and security registry evidence; 188 host hygiene is no longer a next gate unless the live checklist regresses.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user