diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index ac0e6767..9d40be9f 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,38 @@ +## 2026-06-24|MOMO Google Drive 權限修復與資料新鮮度 Gate 補強 + +**背景**:`mo.wooo.work` / `/health` 已恢復 200,MOMO containers 也 healthy,但頁面資料仍停在舊版本。只看 route 200 / container healthy 會誤判,因此重新查 DB、import jobs、Google Drive 來源與 scheduler logs。 + +**Live 發現**: +- `daily_sales_snapshot`:`104614` rows,日期 `2025-07-01` 到 `2026-06-17`。 +- `realtime_sales_monthly`:`786621` rows,日期 `2024-01-01` 到 `2026-06-17`。 +- 最新成功 `daily_sales` import job:`id=56`,`2026-06-18 11:41`,來源 `即時業績_當日.xlsx`,`imported_count=10936`,`sync_success=true`,日期範圍 `2026-06-01` 到 `2026-06-17`。 +- 6/18 之後沒有新的 `daily_sales` import job;Drive 待匯入資料夾 `當日業績匯入` 目前沒有符合 `即時業績_當日` 的 Excel 檔。 +- Drive 已匯入資料夾 `當日業績匯入/已匯入` 最新檔 modifiedTime 為 `2026-06-18T01:30:39Z`;沒有比這更新的待匯入來源。 +- `momo-scheduler` 近 24 小時 auto-import 持續執行,但因 `/app/config/google_token.json` 權限錯誤而記錄 `Permission denied`,並被匯入流程折成「沒有找到待匯入的檔案」。 + +**修復**: +- 188 host 上 `google_token.json` 原本為 `1000:1000 600`;在 Docker user namespace 下 scheduler process 為 `100000:100000`,Google client 需要刷新 token 時無法寫回。 +- 已將該 token 檔 owner/mode 修為 `100000:100000 600`。沒有讀取、輸出、保存或提交 token 內容。 +- 修復後 `/app/config/google_token.json` read/write check 通過,Drive list 可正常查詢。 + +**SOP / automation 補強**: +- `scripts/reboot-recovery/full-stack-cold-start-check.sh` 新增 `MOMO_GDRIVE_TOKEN_STAT`,要求 token owner 對齊 scheduler userns 且 mode 不寬於 `600`。 +- 同一腳本新增 `MOMO_DAILY_FRESHNESS`:最新資料落後 `0-2` 天為 OK;落後 `3` 天以上為 `BLOCKED`。 +- live `/home/wooo/scripts/full-stack-cold-start-check.sh` 已同步,SHA256 `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`。 + +**Live 驗證**: +- 2026-06-24 02:44 cold-start read-only:`PASS=86 WARN=0 BLOCKED=1`。 +- `MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000`:PASS。 +- `MOMO_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`:兩張表一致。 +- `MOMO_DAILY_FRESHNESS 6|2026-06-17`:BLOCKED,資料停更超過 3 天。 +- `data_stale_alert` 已有 2026-06-23 `upstream_drive` 留痕,last_date `2026-06-17`,Telegram sent `true`;該告警有 24 小時 dedupe,不屬於健康心跳噪音。 + +**目前判定**: +- Host / route / K3s / backup / Alertmanager 多數已恢復。 +- 不可宣稱 full-stack green,因 MOMO 業務資料來源停更仍是正確 blocker。 +- 不可用網站 200、container healthy 或 DB parity 取代資料新鮮度。 +- DR 仍不可宣稱完成:credential escrow evidence missing 維持 `5`。 + ## 2026-06-24|188 node-exporter 恢復與備份健康缺失告警收斂 **背景**:冷啟動與 `backup-status` 均顯示 188 備份 textfile fresh,但 Alertmanager 仍有 `BackupHealthMonitorMissing188` active。追查後確認不是備份失敗,而是 Prometheus 抓不到 `192.168.0.188:9100`,因此看不到 188 的 `backup_health.prom`。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index fc4b8c2f..54cb5765 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,7 +1,7 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.26 -> Last updated: 2026-06-18 Asia/Taipei +> Version: v1.29 +> Last updated: 2026-06-24 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. --- @@ -10,6 +10,18 @@ 本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。 +2026-06-24 02:44 live readback supersedes the earlier 02:08 green wording: + +```text +Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%. +Live cold-start read-only check: PASS=86 WARN=0 BLOCKED=1, Result=BLOCKED. +Service state: SERVICE_AVAILABLE_DATA_STALE_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, 110/188 backup health fresh, 188 node-exporter textfile scrape restored. +MOMO state: current-month daily_sales_snapshot and realtime_sales_monthly match, but both stop at 2026-06-17. MOMO_DAILY_FRESHNESS is 6 days, which is a hard blocker because business data is not current. +Google Drive state: momo scheduler token ownership is fixed for Docker userns, Drive listing works, but folder 當日業績匯入 currently has no matching 即時業績_當日 Excel source file. Archive latest matching file is 2026-06-18 and was already imported. +Allowed declaration: core hosts, routes, K3s, backup/exporter surfaces are recovered; MOMO data pipeline is blocked waiting for a newer source file or owner-provided source evidence. +Forbidden declaration: full-stack green, MOMO data current, DR complete, or runtime/security acceptance. Credential escrow evidence is still missing and must not be forged. +``` + 2026-06-18 12:17 live readback supersedes older service-availability wording: ```text @@ -69,13 +81,13 @@ Allowed declaration: monitoring, alert rules, AI event packet, PlayBook / KM con Forbidden declaration: AI runtime remediation is enabled. Process termination, Docker/systemd restart, Nginx reload, firewall/K8s action, Telegram live send, Gateway queue write, Bot API call, production write, and secret read remain forbidden without owner approval, maintenance window, evidence ref, dry-run, and post-check. ``` -| 項目 | 2026-06-18 13:43 Asia/Taipei live result | 判定 | +| 項目 | 2026-06-24 02:44 Asia/Taipei live result | 判定 | |------|-------------------------------------------|------| -| Overall recovery readiness | `99%` | `SERVICE_GREEN_DR_ESCROW_BLOCKED` | +| Overall recovery readiness | `97%` | `SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED` | | P0 host / K3s recovery | `100%` | `DONE` | -| P1 backup / alert / escrow | `92%` | `BLOCKED_DR_ESCROW` | -| P2 service / data truth | `100%` | `VERIFIED_FULL_STACK_GREEN_FOR_SERVICE` | -| P3 docs / automation contracts | `100%` | `DONE_WITH_STALE_JOB_CLASSIFICATION` | +| P1 backup / alert / escrow | `93%` | `BLOCKED_DR_ESCROW` | +| P2 service / data truth | `96%` | `BLOCKED_MOMO_DATA_FRESHNESS` | +| P3 docs / automation contracts | `100%` | `DONE_WITH_MOMO_FRESHNESS_GATE` | | 110 host runtime | `fwupd-refresh.timer` intentionally disabled/inactive after non-runtime firmware metadata refresh failed units were classified; `systemctl --failed` returns `0 loaded units listed`; rollback is `sudo systemctl enable --now fwupd-refresh.timer` | `GREEN_WITH_FWUPD_TIMER_DISABLED` | | 110 host runaway process guard | 14:31-14:32 live scrape confirms `monitor_up=1`, orphan browser groups `0`, active Gitea Actions containers `2`, `load5_per_core≈0.79-0.81`, `swap_used_ratio≈1.0`, and `remediation_authorized=0`; exporter/helper also remain in Ansible textfile exporter source-of-truth. | `LIVE_SCRAPED_RUNTIME_GATE_0` | | 120 reachability | ping OK, SSH OK, boot around `2026-06-14 02:23`, K3s active, node `mon Ready` | `GREEN` | @@ -168,8 +180,8 @@ The rule is simple: **recover the dependency chain, not the loudest symptom.** | `HOST_POWERED` | 主機或 VM 看起來已通電 | console / hypervisor 顯示 running,或 LAN ARP 開始出現 | OS 已完成開機 | | `HOST_BOOTED` | OS 已進入可互動狀態 | ping OK、SSH port open、`who -b` 有本次 boot time | systemd / Docker / K3s 已健康 | | `HOST_READY` | 主機基礎服務可承接下一層 | `systemctl is-system-running` 非 degraded;failed units 可解釋;cron / docker / DB / K3s 依角色正常 | public route 或業務資料已正常 | -| `SERVICE_READY` | 主機承載服務可用 | 服務 health、port、container health、DB / Redis / K3s / Harbor / Alertmanager checks 通過 | 備份、排程、告警、資料一致性已驗證 | -| `FULL_STACK_GREEN` | 可以宣稱重啟恢復完成 | cold-start scorecard `WARN=0`、`BLOCKED=0`,備份/offsite/DB/告警/排程都綠 | 120 不可達時永遠不能宣稱 | +| `SERVICE_READY` | 主機承載服務可用 | 服務 health、port、container health、DB / Redis / K3s / Harbor / Alertmanager checks 通過 | 備份、排程、告警、資料一致性與資料新鮮度已驗證 | +| `FULL_STACK_GREEN` | 可以宣稱重啟恢復完成 | cold-start scorecard `WARN=0`、`BLOCKED=0`,備份/offsite/DB/告警/排程/資料新鮮度都綠 | 120 不可達或 MOMO 業務資料 stale 時永遠不能宣稱 | 2026-06-12 的 110/120 事故收斂判定是: @@ -181,6 +193,17 @@ FULL_STACK_GREEN = yes, because cold-start scorecard is PASS=83 WARN=0 BLOCKED=0 DR_COMPLETE = no, because credential escrow evidence is incomplete ``` +2026-06-24 的 MOMO 資料停更判定是: + +```text +110 / 120 / 121 / 188 HOST_READY = yes +Core public services SERVICE_READY = yes +MOMO_DB_PARITY = yes +MOMO_DATA_FRESH = no, because latest daily_sales_snapshot date is 2026-06-17 and stale age is 6 days +FULL_STACK_GREEN = no, because cold-start scorecard is PASS=86 WARN=0 BLOCKED=1 +DR_COMPLETE = no, because credential escrow evidence is incomplete +``` + 所有回報必須使用這組詞,避免把「服務面可用」誤報成「整體 DR 完成」。 --- @@ -291,7 +314,7 @@ Plan B 不是另一套可以繞過 preflight 的重啟流程,也不是事故 Plan A 的目標是: ```text -B4_FULL_STACK_GREEN:cold-start scorecard WARN=0 / BLOCKED=0,backup、offsite、DB、alert、scheduler、K3s 與 public route 都綠。 +B4_FULL_STACK_GREEN:cold-start scorecard WARN=0 / BLOCKED=0,backup、offsite、DB、alert、scheduler、K3s、public route 與業務資料新鮮度都綠。 ``` Plan B 的目標是: @@ -345,7 +368,7 @@ Plan B 的機讀契約固定在 `ops/reboot-recovery/full-stack-cold-start-basel | `B1_HOST_RECOVERY_ONLY` | 只完成主機層恢復 | 目標主機 ping / SSH / boot time / systemd 基礎狀態可判定;服務尚未全驗。 | | `B2_CORE_SERVICE_READY` | 核心服務可用,但完整依賴鏈未過 | public route、API、DB 或 K3s 主要面可用;backup / alert / scheduler / scorecard 尚未全綠。 | | `B3_SERVICE_AVAILABLE_DEGRADED` | 核心服務可用,cold-start 無 hard block 但仍有 WARN | cold-start `BLOCKED=0`;WARN 被明確列出且不被消音。 | -| `B4_FULL_STACK_GREEN` | 本次重啟恢復完成 | cold-start `PASS>0 WARN=0 BLOCKED=0`,backup / offsite / DB / alert / scheduler 全綠。 | +| `B4_FULL_STACK_GREEN` | 本次重啟恢復完成 | cold-start `PASS>0 WARN=0 BLOCKED=0`,backup / offsite / DB / alert / scheduler / data freshness 全綠。 | | `B5_DR_COMPLETE` | DR 完整 | `B4` 加上 credential escrow missing `0`,restore / escrow / offsite evidence 完整。 | #### Plan B 執行時序 @@ -1208,11 +1231,15 @@ docker ps --format "{{.Names}}\t{{.Status}}" | head -120 | 2 | Redis | `PONG` | | 3 | Docker / containerd | active;momo-db / signoz / openclaw / litellm 非 restart loop | | 4 | momo DB parity | `daily_sales_snapshot` 與 `realtime_sales_monthly` 目前月份筆數與日期上下界一致 | +| 4a | momo Google Drive token writeback | `/home/ollama/momo-pro/config/google_token.json` owner 對齊 Docker userns scheduler UID,mode 不寬於 `600`;不得讀取或輸出 token 內容 | +| 4b | momo business data freshness | `daily_sales_snapshot` 最新日期落後 `0-2` 天可接受;落後 `3` 天以上為 `BLOCKED`,即使首頁 / health / DB parity 都正常也不可宣稱 full-stack green | | 5 | SignOz / monitoring bridge | HTTP 200;ClickHouse 不在修復風暴 | | 6 | momo scheduler | container healthy,recent activity pattern > 0;heavy import 等 DB green 後釋出 | | 7 | backup freshness | 188 backup textfile / 110 backup-from-188 freshness OK | -188 post-reboot 不可用「首頁 200」取代 DB parity;若出現 `posting list tuple ... cannot be split`,只走 `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;`,不可 truncate 或整庫 restore。 +188 post-reboot 不可用「首頁 200」取代 DB parity,也不可用 DB parity 取代資料新鮮度。若出現 `posting list tuple ... cannot be split`,只走 `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;`,不可 truncate 或整庫 restore。 + +2026-06-24 補充:若 `momo-scheduler` logs 出現 `Google Drive 認證失敗` / `Permission denied: 'config/google_token.json'`,優先檢查 Docker user namespace 對應 UID。當前已驗證 scheduler process 在 host 上為 `100000:100000`,token 檔必須是 `100000:100000 600` 才能讓 Google client 刷新並寫回 token。此步只改檔案 owner/mode,不讀取、不保存、不貼上 token value。 ### 14.4.3 120 恢復指揮卡 @@ -1652,6 +1679,8 @@ SOP update: Worker / CronJob / queue 類服務若啟動時間可能超過 startup probe,不能只看第一次 `rollout status --timeout=60s` 失敗就判定 production down。必須同時看 deploy marker、image tag、pod readiness、container restart count、service health、public route / API health。若 pod 最終 ready 但 CD 紅燈,這是 CI timeout / probe tuning 工作,不是服務重啟事故;後續應調整 startup probe 或 post-deploy timeout。 +2026-06-24 02:44 補充:本節的 02:08 `PASS=85 WARN=0 BLOCKED=0` 已被 §14.28 的 MOMO data freshness gate 取代;不可再引用該結果宣稱 full-stack green。 + ### 14.27 2026-06-24 188 node-exporter / backup health alert closure 2026-06-24 的第二段變更是恢復 188 node-exporter textfile scrape。`backup-status` 與 cold-start 都能透過 SSH 讀到 188 `backup_health.prom` fresh,但 Prometheus `node-exporter-188` scrape down 會讓 `BackupHealthMonitorMissing188` 正確告警。這種情況不能消音告警,必須恢復 exporter。 @@ -1682,6 +1711,42 @@ ssh ollama@192.168.0.188 'bash -s' < scripts/ops/188-node-exporter-restore.sh 恢復後再查 Prometheus / Alertmanager,不要直接 silence。 +### 14.28 2026-06-24 MOMO Google Drive token 與資料新鮮度 blocker + +2026-06-24 的第三段變更是把「MOMO 服務活著但資料不新」納入 cold-start hard gate。這不是網站 502,也不是 DB parity failure;實際問題是 Google Drive 待匯入資料夾沒有新來源檔,且重啟後 token file ownership 讓 scheduler 一度無法刷新 token。 + +| 項目 | 2026-06-24 MOMO freshness baseline | +|------|------------------------------------| +| SOP version | `v1.29` | +| Token root cause | Docker user namespace 下 `momo-scheduler` host UID/GID 為 `100000:100000`;`google_token.json` 原本是 `1000:1000 600`,Google client 需要寫回 token 時 permission denied | +| Token live repair | `google_token.json` 修為 `100000:100000 600`;只改 owner/mode,不讀取、不輸出、不保存 token value | +| Drive pending folder | `當日業績匯入`,pattern `即時業績_當日`,目前 matching Excel count `0` | +| Drive archive folder | `當日業績匯入/已匯入`,最新 matching file modifiedTime `2026-06-18T01:30:39Z`,已由 import job `56` 匯入 | +| DB parity | `MOMO_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17` | +| Data freshness | `MOMO_DAILY_FRESHNESS 6|2026-06-17` | +| Live cold-start readback | `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED` | +| 110 live script sync | `/home/wooo/scripts/full-stack-cold-start-check.sh` hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8` | +| Alert dedupe | `data_stale_alert` for `upstream_drive` has 24h dedupe; latest evidence was 2026-06-23 with last_date `2026-06-17` | +| Declaration limit | 可宣稱 hosts/routes/K3s/backups recovered;不可宣稱 MOMO data current、full-stack green 或 DR complete | + +MOMO post-reboot 最小 readback: + +```bash +ssh ollama@192.168.0.188 ' +stat -c "%u:%g:%a %n" /home/ollama/momo-pro/config/google_token.json +docker top momo-scheduler -eo pid,user,uid,gid,args | head -n 3 +docker logs --since 2h momo-scheduler 2>&1 | grep -E "AutoImport|Google Drive|Permission denied|沒有找到|發現檔案" | tail -80 +' + +ssh ollama@192.168.0.188 'db_user=$(docker exec momo-pro-system printenv POSTGRES_USER); db_name=$(docker exec momo-pro-system printenv POSTGRES_DB); db_pass=$(docker exec momo-pro-system printenv POSTGRES_PASSWORD); docker exec -i -e PGPASSWORD="$db_pass" momo-db psql -h 127.0.0.1 -U "$db_user" -d "$db_name" -At' <<'SQL' +SELECT 'daily_sales_snapshot|' || count(*) || '|' || min(snapshot_date)::date || '|' || max(snapshot_date)::date FROM daily_sales_snapshot; +SELECT 'realtime_sales_monthly|' || count(*) || '|' || min("日期")::date || '|' || max("日期")::date FROM realtime_sales_monthly; +SELECT 'daily_freshness|' || (CURRENT_DATE - max(snapshot_date)::date) || '|' || max(snapshot_date)::date FROM daily_sales_snapshot; +SQL +``` + +若 Drive pending folder 無新來源檔,不可手動 truncate、不可以舊 archive 檔重複匯入來製造「最新」,也不可把 DB parity 當 data freshness。下一個解除 blocker 的證據必須是:新的 `即時業績_當日` source file 可見、import job 成功、`sync_success=true`、`daily_sales_snapshot` 與 `realtime_sales_monthly` 日期上下界一致,且 `MOMO_DAILY_FRESHNESS <= 2`。 + ### 14.22 重啟後時間軸驗證 每次重啟後照時間軸推進,不要等到最後才一次判定。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 5e7692a0..6326e5ce 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -11,13 +11,13 @@ | Area | Status | Completion | Evidence | |------|--------|------------|----------| -| Overall recovery readiness | SERVICE_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-24 02:08 live cold-start read-only gate returned `PASS=85 WARN=0 BLOCKED=0`, result `GREEN`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, `NODE_READONLY_FILESYSTEM_TRUE=0`, `NODE_DISK_PRESSURE_TRUE=0`, public routes/TLS are green, 110 / 188 runtime checks are green。K8s schedule readback distinguishes `FAILED_JOBS=1` / `STALE_FAILED_JOBS=1` / `ACTIVE_FAILED_JOBS=0`; the retained failed Job is historical evidence, not an active service blocker. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. | +| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED | 97% | 2026-06-24 02:44 live cold-start read-only gate returned `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, `NODE_READONLY_FILESYSTEM_TRUE=0`, `NODE_DISK_PRESSURE_TRUE=0`, public routes/TLS are green, 110 / 188 runtime and backup checks are green。188 `node-exporter` textfile scrape is restored. Remaining service blocker is MOMO business data freshness: `MOMO_DAILY_FRESHNESS 6|2026-06-17`; Drive listing works after token owner repair, but `當日業績匯入` has no newer `即時業績_當日` Excel source file. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 93% | 2026-06-24 02:20 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-23 20:53:42`。02:24 restored 188 `node-exporter` textfile scrape; Prometheus now has `up{job="node-exporter-188"}=1` and `awoooi_backup_health_monitor_up{host="188"}=1`; `BackupHealthMonitorMissing188` resolved. DR remains blocked on real non-secret credential escrow evidence IDs. | -| P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-24 02:08 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=85 WARN=0 BLOCKED=0`. | -| P3 docs / automation contracts | DONE_WITH_HEARTBEAT_NOISE_MOMO_AND_188_EXPORTER_CLOSURE | 100% | Workplan, SOP v1.28, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, and 2026-06-24 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. | +| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 96% | Public route/TLS, API/Web route, momo health, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. However MOMO latest business date is `2026-06-17`; stale age is `6` days. Drive pending folder has `0` matching files and archive latest is the already-imported 2026-06-18 file, so there is no safe newer source to import. | +| P3 docs / automation contracts | DONE_WITH_MOMO_FRESHNESS_GATE | 100% | Workplan, SOP v1.29, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, MOMO Google Drive token userns readback, MOMO daily freshness blocker, and 2026-06-24 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. | -Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-24 02:08, services are green with `PASS=85 WARN=0 BLOCKED=0`; the retained stale failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked. +Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 02:44, routes/hosts/K3s/backups are available, but the scorecard is `PASS=86 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days. Do not declare DR scorecard complete while credential escrow evidence remains blocked. 2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback. @@ -155,7 +155,7 @@ Next: | ID | Status | % | Work item | Fine analysis | Next action | Done criteria | |----|--------|---:|-----------|---------------|-------------|---------------| | P2-001 | VERIFIED | 100 | Public route smoke | 2026-06-12 18:57 cold-start confirms all listed domains returned expected 2xx/3xx over HTTPS; registry root route returned 200 in the scorecard and `/v2/` remains the normal unauthenticated 401 pattern from earlier checks. This proves ingress/TLS plus current route availability. | Keep as one row in scorecard. | Public route table updated after each reboot. | -| P2-002 | VERIFIED | 100 | momo latest/current-month parity | Latest current-month scorecard check: both tables have 4571 rows and matching bounds from `2026-06-01` through `2026-06-07`. | Keep daily check in cold-start SOP. | Latest snapshot/current-month row count and bounds match. | +| P2-002 | BLOCKED_MOMO_DATA_FRESHNESS | 96 | momo latest/current-month parity and freshness | Latest current-month parity is good: `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. However latest business data is stale: `MOMO_DAILY_FRESHNESS 6|2026-06-17`; Drive pending folder `當日業績匯入` has `0` matching `即時業績_當日` Excel files after token owner repair. | Wait for or obtain a newer legitimate source file, then verify import job `sync_success=true`, archive movement, table bounds, and `MOMO_DAILY_FRESHNESS <= 2`. | Snapshot/current-month row count and bounds match, source folder has no unprocessed stale file, and daily freshness is within threshold. | | P2-003 | VERIFIED | 95 | Fix momo job semantics | `/Users/ogt/momo-pro-system/services/import_service.py` and live `/home/ollama/momo-pro/services/import_service.py` now mark monthly sync failure as `failed`, write `drive_file_movable=false`, return `False`, emit a failure alert path, and make auto-import aggregate failures as `success=false`. Live 188 backup: `services/import_service.py.bak.20260604-152827`; live hash after patch: `3fc45671986fa4cc155119f588bc1ebefd272927730052e42e2b9eb4352b2586`. | Watch the next real Google Drive import and confirm no file moves unless both tables sync; keep canonical source-control reconciliation open as a separate supply-chain task. | Live isolated temp-DB/real-Excel test passes; containers reloaded healthy; Telegram token/chat markers are present without exposing secrets; latest DB parity remains 404/404. | | P2-004 | DONE | 100 | PostgreSQL index corruption runbook path | SOP v1.2 now states `posting list tuple ... cannot be split` is an index repair incident. | Use only concurrent reindex if the error returns. | No truncate, no whole DB restore; `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;` and idempotent resync evidence recorded. | | P2-005 | VERIFIED | 100 | Do not rely on route 200 only | 2026-06-12 closeout has route + DB + backup + offsite + schedule + alert + K3s + cold-start scorecard evidence. The only remaining blocker is DR credential escrow, outside service availability. | Keep this cross-surface checklist mandatory after every reboot. | Each reboot record has route, DB, backup, schedules, alert, scorecard rows. | @@ -169,13 +169,13 @@ Next: | ID | Status | % | Work item | Fine analysis | Next action | Done criteria | |----|--------|---:|-----------|---------------|-------------|---------------| | P3-001 | VERIFIED | 100 | Confirm hardening commit | Gitea `main` currently points to `0260ec89...`; `git merge-base --is-ancestor ae7b39d9 0260ec89...` returned true. | Keep evidence in LOGBOOK. | Gitea main contains `ae7b39d9 fix(ops): harden reboot recovery and backup alerts`. | -| P3-002 | VERIFIED | 100 | Confirm live 110 scripts | All six required scripts exist under `/home/wooo/scripts/`; cold-start script hash `31321428207308d6c159fabb679d9f1d0848194b8e6d7eb7b04a2c05779ade46` is live on 110. | Record in LOGBOOK. | Script paths and hashes recorded. | +| P3-002 | VERIFIED | 100 | Confirm live 110 scripts | All required recovery/check scripts exist under `/home/wooo/scripts/`; cold-start script hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8` is live on 110 after the MOMO freshness gate update. | Record every live script hash change in LOGBOOK and SOP. | Script paths and hashes recorded. | | P3-003 | DONE | 100 | Reconcile 188 nginx Ansible baseline | Live 188 already routes `aiops.wooo.work` through VIP; the Ansible template matches that route and has no 120 upstream for aiops. `nginx-sync.yml` now also carries the `188-internal-tools-https.conf.j2` source-of-truth path, and `ansible-validate.sh` syntax-check passes with repo-local roles path. | Run only approved dry-run/apply from the normal Ansible environment before changing live nginx. | Template and live config agree; no 120 upstream for aiops; repo-side syntax and readiness contract pass. | | P3-004 | DONE | 100 | Update `docs/LOGBOOK.md` | Live blocker and new docs are recorded. | Keep this entry updated after each recovery phase. | LOGBOOK has current recovery status and next actions. | | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. | | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. | | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. | -| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.26 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, repo-side readiness audit blocker closure, stale-vs-active K8s failed Job classification, 2026-06-18 live cold-start GREEN readback, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, and allowed declaration wording. | Use v1.26 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, and blockers against §1.4 plus §11.1 / §14.8 through §14.25. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN_FOR_SERVICE`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN_FOR_SERVICE`, and `B5_DR_COMPLETE`; repo-side readiness audit checks runaway process exporter / alerts / gated remediation helper, and live cold-start returns `PASS=84 WARN=0 BLOCKED=0`. | +| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.29 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, MOMO Google Drive token userns readback, and MOMO data freshness hard blocker. | Use v1.29 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, and blockers against §1.4 plus §11.1 / §14.8 through §14.28. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start now returns `PASS=86 WARN=0 BLOCKED=1` when MOMO data freshness is stale, preventing false green. | | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. | | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. | | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. | diff --git a/scripts/reboot-recovery/full-stack-cold-start-check.sh b/scripts/reboot-recovery/full-stack-cold-start-check.sh index 53a3a095..e5c722c5 100755 --- a/scripts/reboot-recovery/full-stack-cold-start-check.sh +++ b/scripts/reboot-recovery/full-stack-cold-start-check.sh @@ -467,15 +467,21 @@ echo "SCHEDULER_CONTAINER_RUNNING $(docker inspect -f "{{.State.Running}}" momo- echo "SCHEDULER_CONTAINER_HEALTH $(docker inspect -f "{{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}}" momo-scheduler 2>/dev/null || true)" echo "SCHEDULER_REGISTERED $(docker logs --tail 400 momo-scheduler 2>&1 | grep -Ec "全部排程任務已註冊|排程任務已註冊|Scheduler started|APScheduler" || true)" echo "SCHEDULER_RECENT_ACTIVITY $(docker logs --since 2h momo-scheduler 2>&1 | grep -Ec "AutoImport|Meta-Analysis|Scheduler|排程|任務|批次 [0-9]+: 取得|\\[Feeder\\]|HITL|候選屬" || true)" +token_stat=$(stat -c "%u:%g:%a" /home/ollama/momo-pro/config/google_token.json 2>/dev/null || true) +scheduler_uid=$(docker top momo-scheduler -eo pid,user,uid 2>/dev/null | awk "NR==2 {print \$3}" || true) +echo "MOMO_GDRIVE_TOKEN_STAT ${token_stat:-missing} scheduler_uid=${scheduler_uid:-unknown}" db_user=$(docker exec momo-pro-system printenv POSTGRES_USER 2>/dev/null || true) db_name=$(docker exec momo-pro-system printenv POSTGRES_DB 2>/dev/null || true) db_pass=$(docker exec momo-pro-system printenv POSTGRES_PASSWORD 2>/dev/null || true) if [ -n "$db_user" ] && [ -n "$db_name" ] && [ -n "$db_pass" ]; then momo_sync=$(docker exec -e PGPASSWORD="$db_pass" -e PGCONNECT_TIMEOUT=5 momo-db psql -h 127.0.0.1 -U "$db_user" -d "$db_name" -Atc "WITH scope AS (SELECT min(snapshot_date::date) dmin, max(snapshot_date::date) dmax, count(*) sc FROM daily_sales_snapshot WHERE snapshot_date::date >= make_date(extract(year from current_date)::int, extract(month from current_date)::int, 1)), monthly AS (SELECT count(*) mc, min(\"日期\"::date) mmin, max(\"日期\"::date) mmax FROM realtime_sales_monthly, scope WHERE scope.sc > 0 AND \"日期\"::date BETWEEN scope.dmin AND scope.dmax) SELECT coalesce(scope.sc,0)::text || chr(124) || coalesce(monthly.mc,0)::text || chr(124) || coalesce(scope.dmin::text,chr(45)) || chr(124) || coalesce(scope.dmax::text,chr(45)) || chr(124) || coalesce(monthly.mmin::text,chr(45)) || chr(124) || coalesce(monthly.mmax::text,chr(45)) FROM scope, monthly;" 2>/dev/null || true) + momo_freshness=$(docker exec -e PGPASSWORD="$db_pass" -e PGCONNECT_TIMEOUT=5 momo-db psql -h 127.0.0.1 -U "$db_user" -d "$db_name" -Atc "SELECT coalesce((current_date - max(snapshot_date::date))::text, chr(45)) || chr(124) || coalesce(max(snapshot_date::date)::text, chr(45)) FROM daily_sales_snapshot;" 2>/dev/null || true) else momo_sync="" + momo_freshness="" fi echo "MOMO_MONTHLY_SYNC ${momo_sync:-unavailable}" +echo "MOMO_DAILY_FRESHNESS ${momo_freshness:-unavailable}" ' 2>&1); then echo "$out" grep -q "CRON_188 active" <<<"$out" && ok "188 cron active" || warn "188 cron not confirmed" @@ -494,7 +500,15 @@ echo "MOMO_MONTHLY_SYNC ${momo_sync:-unavailable}" else warn "188 momo scheduler registration/activity not confirmed" fi + awk '/MOMO_GDRIVE_TOKEN_STAT / {split($2,a,":"); split($3,b,"="); exit !(a[1] == b[2] && a[3] <= 600)}' <<<"$out" && ok "188 momo Google Drive token ownership matches scheduler userns" || warn "188 momo Google Drive token ownership/writeback not confirmed" awk '/MOMO_MONTHLY_SYNC / {split($2,a,"|"); exit !(a[1] > 0 && a[1] == a[2] && a[3] == a[5] && a[4] == a[6])}' <<<"$out" && ok "188 momo current-month snapshot and realtime tables match" || warn "188 momo current-month snapshot/realtime sync not confirmed" + if awk '/MOMO_DAILY_FRESHNESS / {split($2,a,"|"); exit !(a[1] ~ /^[0-9]+$/ && a[1] >= 0 && a[1] <= 2)}' <<<"$out"; then + ok "188 momo daily sales data fresh enough" + elif awk '/MOMO_DAILY_FRESHNESS / {split($2,a,"|"); exit !(a[1] ~ /^[0-9]+$/ && a[1] >= 3)}' <<<"$out"; then + fail "188 momo daily sales data stale beyond 3 days" + else + warn "188 momo daily sales freshness not confirmed" + fi else warn "188 schedule check unavailable" echo "$out"