docs(ops): record heartbeat noise and cold-start detector closure [skip ci]

2026-06-24 02:19:30 +08:00
parent 4a7b532962
commit 8aeeadbde1
4 changed files with 90 additions and 9 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -1,3 +1,27 @@
+## 2026-06-24｜Telegram 正常心跳降噪與 cold-start MOMO detector 修正
+
+**背景**：Telegram 群組每 30 分鐘收到 `AWOOOI 心跳 / 告警鏈路: ✅ 正常`，造成正常訊號洗版；同時 2026-06-24 01:45 live cold-start 雖已 `BLOCKED=0`，但仍因 MOMO scheduler log pattern 與 DB 讀法過舊產生兩個 WARN。這兩者都會讓重啟 SOP 出現 false-noise / false-warning，必須修掉。
+
+**完成內容**：
+- `a84a5a0b fix(api): suppress healthy Telegram heartbeat noise` 已推送，deploy marker `4a7b5329 chore(cd): deploy a84a5a0 [skip ci]` 已回寫到 `gitea/main`。
+- Production API / Web / Worker image 均為 `a84a5a0bc4a672ac6feb95a85ac590aa2dd4bb71`，readiness 為 API `2/2`、Web `2/2`、Worker `1/1`。
+- 正常且無 warning 的 heartbeat 改為只更新 Redis suppression marker / log，不再送 Telegram；warning 仍會送，warning 恢復為 healthy 時仍保留一次 recovery 訊息。
+- 臨時 mitigation 已在 production Redis 設定 `heartbeat:silent_last_sent` 與 `heartbeat:healthy_suppressed_last_seen` 24 小時 TTL，避免下一輪 healthy heartbeat 在 CD 完成前繼續洗版；未讀取或輸出任何 token。
+- live `/home/wooo/scripts/full-stack-cold-start-check.sh` 已同步 MOMO detector 修正，備份為 `/home/wooo/scripts/full-stack-cold-start-check.sh.before-momo-detector-20260624-020759`，新 hash `47e67d0c018f741acfba17a93cb1d668779bd08745902099a10ee61e73ea55b6`。
+- `full-stack-cold-start-check.sh` repo 版同步補強 K3s node Ready / storage condition 判斷，以及 MOMO scheduler / current-month DB parity 讀法；DB 密碼只透過 `PGPASSWORD` 傳入 `docker exec`，不輸出 secret。
+
+**Live 驗證**：
+- 2026-06-24 02:08 cold-start：`PASS=85 WARN=0 BLOCKED=0`，result `GREEN`。
+- MOMO evidence：`SCHEDULER_CONTAINER_RUNNING true`、`SCHEDULER_CONTAINER_HEALTH healthy`、`SCHEDULER_RECENT_ACTIVITY 1303`、`MOMO_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`。
+- K3s evidence：`mon` / `mon1` Ready，VIP `192.168.0.125` present，`NODE_FS_ERROR_EVENTS 0`，`NODE_READONLY_FILESYSTEM_TRUE 0`，`NODE_DISK_PRESSURE_TRUE 0`。
+- K8s jobs：`FAILED_JOBS 1`、`STALE_FAILED_JOBS 1`、`ACTIVE_FAILED_JOBS 0`，保留歷史失敗 Job 作 evidence，不當作 active blocker。
+- Production health：`/api/v1/health` 回 `healthy / prod / mock_mode=false`；`postgresql`、`redis`、`ollama`、`openclaw`、`signoz`、`ollama_gcp_a`、`ollama_gcp_b` 均 up；`ollama_local` 仍為下游 provider 狀態，不影響 API overall healthy。
+
+**注意事項**：
+- Gitea CD `#3289` 最後標示 Failure，根因是 worker startup probe / rollout status timeout 的 false-negative；K8s 實際已 rollout 完成並有 deploy marker。SOP 需後續調整 worker startup window / CD post-deploy timeout，不能把這類假紅燈當生產不可用，也不能忽略它。
+- 這次只處理正常心跳降噪與 cold-start detector false-warning，不代表 Telegram warning / recovery 告警被消音。
+- DR 仍不可宣稱完成：credential escrow evidence missing 仍需維持 `5`，不可偽造或把任何 secret value 放入 repo / 聊天。
+
 ## 2026-06-19｜治理頁公開流程詞與 enum sanitizer 正式驗證完成

 **背景**：前一輪治理頁公開顯示清理已讓主要卡片可見，但 production desktop smoke 仍發現 `live worker`、`Direct Bot API`、`dual approval`、`owner approval` 等深層 snapshot 自由文字殘留，並因 enum 值被 sanitizer 先翻成中文而產生 `MISSING_MESSAGE`。本輪收斂 messages 與 governance tab sanitizer，完成正式部署後 desktop / mobile readback。
--- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md
+++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md
@@ -1628,6 +1628,30 @@ SOP update:
 | Declaration limit | 可宣稱 `FULL_STACK_GREEN_FOR_SERVICE`；不可宣稱 `DR_COMPLETE`、`credential escrow complete` 或任何 runtime/security acceptance |
 | SOP change | v1.25 defines retained failed Job evidence vs active failed Job blocker; future reboot comparison must record all three counters |

+### 14.26 2026-06-24 heartbeat noise / MOMO detector / rollout false-negative closure
+
+2026-06-24 的變更不是主機重啟，而是把重啟 SOP 的兩種 false signal 收斂：Telegram 正常心跳不再每 30 分鐘洗版；MOMO scheduler / current-month parity detector 不再因舊 log pattern 或錯誤 DB exec 使用者誤報 WARN。這個錨點也記錄 CD rollout false-negative：worker startup probe 第一次超時重啟一次，K8s 最終 ready，但 Gitea CD `#3289` 因 rollout status timeout 標 Failure。
+
+| 項目 | 2026-06-24 live baseline |
+|------|--------------------------|
+| SOP version | `v1.27` |
+| Heartbeat code | `a84a5a0b fix(api): suppress healthy Telegram heartbeat noise` |
+| Deploy marker | `4a7b5329 chore(cd): deploy a84a5a0 [skip ci]` |
+| Production image readback | API/Web/Worker image tag `a84a5a0bc4a672ac6feb95a85ac590aa2dd4bb71` |
+| Production rollout | API `2/2`、Web `2/2`、Worker `1/1` Ready |
+| CD result caveat | Gitea CD `#3289` shows Failure because worker rollout status timed out before old replica convergence; K8s deploy marker and production readiness are green |
+| Healthy heartbeat rule | `status=healthy` 且無 warnings 時只更新 suppression marker / log，不送 Telegram；warnings 與 recovery 仍可送 |
+| Live temporary suppression | Redis keys `heartbeat:silent_last_sent` and `heartbeat:healthy_suppressed_last_seen` set with 24h TTL during deployment; no token or secret printed |
+| 110 live script sync | `/home/wooo/scripts/full-stack-cold-start-check.sh` hash `47e67d0c018f741acfba17a93cb1d668779bd08745902099a10ee61e73ea55b6`; previous version backed up to `/home/wooo/scripts/full-stack-cold-start-check.sh.before-momo-detector-20260624-020759` |
+| MOMO scheduler evidence | `SCHEDULER_CONTAINER_RUNNING true`、`SCHEDULER_CONTAINER_HEALTH healthy`、`SCHEDULER_RECENT_ACTIVITY 1303` |
+| MOMO DB parity evidence | `MOMO_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17` |
+| K3s node evidence | `NODE_FS_ERROR_EVENTS 0`、`NODE_READONLY_FILESYSTEM_TRUE 0`、`NODE_DISK_PRESSURE_TRUE 0`、VIP `192.168.0.125` present |
+| Live cold-start readback | `PASS=85 WARN=0 BLOCKED=0`, result `GREEN` |
+| Declaration limit | 可宣稱 current service recovery scorecard green；不可宣稱 `DR_COMPLETE`，credential escrow evidence missing remains `5` |
+| SOP change | v1.27 requires heartbeat success-message suppression, MOMO detector parity using app-provided DB env, and rollout false-negative classification before retrying CD |
+
+Worker / CronJob / queue 類服務若啟動時間可能超過 startup probe，不能只看第一次 `rollout status --timeout=60s` 失敗就判定 production down。必須同時看 deploy marker、image tag、pod readiness、container restart count、service health、public route / API health。若 pod 最終 ready 但 CD 紅燈，這是 CI timeout / probe tuning 工作，不是服務重啟事故；後續應調整 startup probe 或 post-deploy timeout。
+
 ### 14.22 重啟後時間軸驗證

 每次重啟後照時間軸推進，不要等到最後才一次判定。
--- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
+++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
@@ -11,13 +11,13 @@

 | Area | Status | Completion | Evidence |
 |------|--------|------------|----------|
-| Overall recovery readiness | SERVICE_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-18 13:43 live cold-start read-only gate returned `PASS=84 WARN=0 BLOCKED=0`, result `GREEN`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, public routes/TLS are green, 110 backup health is `13/13 fresh failed=0`, 188 backup health is `2/2 fresh failed=0`。K8s schedule readback now distinguishes `FAILED_JOBS=1` / `STALE_FAILED_JOBS=1` / `ACTIVE_FAILED_JOBS=0`; the retained `km-vectorize-29689620` failed Job is historical evidence, not an active service blocker. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
+| Overall recovery readiness | SERVICE_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-24 02:08 live cold-start read-only gate returned `PASS=85 WARN=0 BLOCKED=0`, result `GREEN`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, `NODE_READONLY_FILESYSTEM_TRUE=0`, `NODE_DISK_PRESSURE_TRUE=0`, public routes/TLS are green, 110 / 188 runtime checks are green。K8s schedule readback distinguishes `FAILED_JOBS=1` / `STALE_FAILED_JOBS=1` / `ACTIVE_FAILED_JOBS=0`; the retained failed Job is historical evidence, not an active service blocker. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
 | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
 | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-15 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-15 02:40:13`. Offsite / escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
-| P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-18 13:43 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=84 WARN=0 BLOCKED=0`. |
-| P3 docs / automation contracts | DONE_WITH_RUNAWAY_PROCESS_AIOPS_LIVE_SCRAPED | 100% | Workplan, SOP v1.26, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, and 2026-06-18 live readback are updated. 14:31-14:32 Prometheus scrape confirms 110 `monitor_up=1`, orphan browser group count `0`, active CI containers `2`, load5/core around `0.79-0.81`, swap ratio around `1.0`, `remediation_authorized=0`, and missing/orphan alerts not firing. Repo-side readiness audit also checks runaway process exporter / remediation helper / alert group; live cold-start remains `PASS=84 WARN=0 BLOCKED=0` from the latest service readiness readback. |
+| P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-24 02:08 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=85 WARN=0 BLOCKED=0`. |
+| P3 docs / automation contracts | DONE_WITH_HEARTBEAT_NOISE_AND_MOMO_DETECTOR_CLOSURE | 100% | Workplan, SOP v1.27, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, and 2026-06-24 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. |

-Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-18 13:43, services are green with `WARN=0` and `BLOCKED=0`; the retained stale `km-vectorize` failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
+Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-24 02:08, services are green with `PASS=85 WARN=0 BLOCKED=0`; the retained stale failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked.

 2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.

--- a/scripts/reboot-recovery/full-stack-cold-start-check.sh
+++ b/scripts/reboot-recovery/full-stack-cold-start-check.sh
@@ -316,8 +316,27 @@ kcmd() {
 }
 kcmd get nodes -o wide 2>/dev/null || true
 kcmd get pods -n awoooi-prod -o wide 2>/dev/null || true
+node_condition_summary=$(kcmd get nodes -o json 2>/dev/null | python3 -c "import json,sys
+try:
+  d=json.load(sys.stdin)
+except Exception:
+  d={\"items\": []}
+not_ready=readonly=disk_pressure=0
+for node in d.get(\"items\", []):
+  conds={c.get(\"type\"): c.get(\"status\") for c in node.get(\"status\",{}).get(\"conditions\",[]) or []}
+  if conds.get(\"Ready\") != \"True\":
+    not_ready += 1
+  if conds.get(\"ReadonlyFilesystem\") == \"True\":
+    readonly += 1
+  if conds.get(\"DiskPressure\") == \"True\":
+    disk_pressure += 1
+print(f\"NODE_NOT_READY {not_ready}\")
+print(f\"NODE_READONLY_FILESYSTEM_TRUE {readonly}\")
+print(f\"NODE_DISK_PRESSURE_TRUE {disk_pressure}\")" || true)
+printf "%s\n" "$node_condition_summary"
 node_fs_events=$(kcmd get events -A --field-selector involvedObject.kind=Node --sort-by=.lastTimestamp 2>/dev/null \
-  | grep -Eic "filesystem|fsck|I/O error|read-only file system|Structure needs cleaning|orphan linked list|EXT4-fs|xfs" || true)
+  | grep -Eiv "InvalidDiskCapacity|image filesystem" \
+  | grep -Eic "fsck|I/O error|read-only file system|Structure needs cleaning|orphan linked list|EXT4-fs.*error|XFS.*(corruption|metadata)|Remounting filesystem read-only" || true)
 echo "NODE_FS_ERROR_EVENTS ${node_fs_events:-0}"
 ip addr show | grep 192.168.0.125 || true
 ' 2>&1); then
@@ -339,7 +358,14 @@ ip addr show | grep 192.168.0.125 || true

  grep -q "PG188_PORT OPEN" <<<"$out" && ok "120 can reach 188 PostgreSQL port" || fail "120 cannot reach 188 PostgreSQL"
  grep -q " Ready " <<<"$out$local_kubectl_out" && ok "K3s has Ready node output" || fail "K3s nodes not Ready or kubectl unavailable"
-  grep -q "NODE_FS_ERROR_EVENTS 0" <<<"$out" && ok "K3s node filesystem error events absent" || fail "K3s node filesystem error events present"
+  grep -q "NODE_NOT_READY 0" <<<"$out" && ok "K3s node Ready condition clean" || fail "K3s node Ready condition not clean"
+  if grep -q "NODE_FS_ERROR_EVENTS 0" <<<"$out" \
+    && grep -q "NODE_READONLY_FILESYSTEM_TRUE 0" <<<"$out" \
+    && grep -q "NODE_DISK_PRESSURE_TRUE 0" <<<"$out"; then
+    ok "K3s node storage conditions clean"
+  else
+    fail "K3s node storage condition or severe filesystem event present"
+  fi
  grep -q "192.168.0.125" <<<"$out" && ok "VIP 192.168.0.125 present on 120" || warn "VIP not confirmed on 120"
 }

@@ -439,9 +465,16 @@ if [ -f /home/ollama/node_exporter_textfiles/storage_health.prom ]; then
 fi
 echo "SCHEDULER_CONTAINER_RUNNING $(docker inspect -f "{{.State.Running}}" momo-scheduler 2>/dev/null || true)"
 echo "SCHEDULER_CONTAINER_HEALTH $(docker inspect -f "{{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}}" momo-scheduler 2>/dev/null || true)"
-echo "SCHEDULER_REGISTERED $(docker logs --tail 200 momo-scheduler 2>&1 | grep -c "全部排程任務已註冊" || true)"
-echo "SCHEDULER_RECENT_ACTIVITY $(docker logs --since 2h momo-scheduler 2>&1 | grep -Ec "AutoImport|Meta-Analysis|Scheduler" || true)"
-momo_sync=$(docker exec momo-db sh -c "psql -U \"\$POSTGRES_USER\" -d \"\$POSTGRES_DB\" -Atc \"WITH scope AS (SELECT min(snapshot_date::date) dmin, max(snapshot_date::date) dmax, count(*) sc FROM daily_sales_snapshot WHERE snapshot_date::date >= make_date(extract(year from current_date)::int, extract(month from current_date)::int, 1)), monthly AS (SELECT count(*) mc, min(\\\"日期\\\"::date) mmin, max(\\\"日期\\\"::date) mmax FROM realtime_sales_monthly, scope WHERE scope.sc > 0 AND \\\"日期\\\"::date BETWEEN scope.dmin AND scope.dmax) SELECT coalesce(scope.sc,0)::text || chr(124) || coalesce(monthly.mc,0)::text || chr(124) || coalesce(scope.dmin::text,chr(45)) || chr(124) || coalesce(scope.dmax::text,chr(45)) || chr(124) || coalesce(monthly.mmin::text,chr(45)) || chr(124) || coalesce(monthly.mmax::text,chr(45)) FROM scope, monthly;\"" 2>/dev/null || true)
+echo "SCHEDULER_REGISTERED $(docker logs --tail 400 momo-scheduler 2>&1 | grep -Ec "全部排程任務已註冊|排程任務已註冊|Scheduler started|APScheduler" || true)"
+echo "SCHEDULER_RECENT_ACTIVITY $(docker logs --since 2h momo-scheduler 2>&1 | grep -Ec "AutoImport|Meta-Analysis|Scheduler|排程|任務|批次 [0-9]+: 取得|\\[Feeder\\]|HITL|候選屬" || true)"
+db_user=$(docker exec momo-pro-system printenv POSTGRES_USER 2>/dev/null || true)
+db_name=$(docker exec momo-pro-system printenv POSTGRES_DB 2>/dev/null || true)
+db_pass=$(docker exec momo-pro-system printenv POSTGRES_PASSWORD 2>/dev/null || true)
+if [ -n "$db_user" ] && [ -n "$db_name" ] && [ -n "$db_pass" ]; then
+  momo_sync=$(docker exec -e PGPASSWORD="$db_pass" -e PGCONNECT_TIMEOUT=5 momo-db psql -h 127.0.0.1 -U "$db_user" -d "$db_name" -Atc "WITH scope AS (SELECT min(snapshot_date::date) dmin, max(snapshot_date::date) dmax, count(*) sc FROM daily_sales_snapshot WHERE snapshot_date::date >= make_date(extract(year from current_date)::int, extract(month from current_date)::int, 1)), monthly AS (SELECT count(*) mc, min(\"日期\"::date) mmin, max(\"日期\"::date) mmax FROM realtime_sales_monthly, scope WHERE scope.sc > 0 AND \"日期\"::date BETWEEN scope.dmin AND scope.dmax) SELECT coalesce(scope.sc,0)::text || chr(124) || coalesce(monthly.mc,0)::text || chr(124) || coalesce(scope.dmin::text,chr(45)) || chr(124) || coalesce(scope.dmax::text,chr(45)) || chr(124) || coalesce(monthly.mmin::text,chr(45)) || chr(124) || coalesce(monthly.mmax::text,chr(45)) FROM scope, monthly;" 2>/dev/null || true)
+else
+  momo_sync=""
+fi
 echo "MOMO_MONTHLY_SYNC ${momo_sync:-unavailable}"
 ' 2>&1); then
    echo "$out"