docs(ops): record heartbeat noise and cold-start detector closure [skip ci]

This commit is contained in:
Your Name
2026-06-24 02:19:30 +08:00
parent 4a7b532962
commit 8aeeadbde1
4 changed files with 90 additions and 9 deletions

View File

@@ -1,3 +1,27 @@
## 2026-06-24Telegram 正常心跳降噪與 cold-start MOMO detector 修正
**背景**Telegram 群組每 30 分鐘收到 `AWOOOI 心跳 / 告警鏈路: ✅ 正常`,造成正常訊號洗版;同時 2026-06-24 01:45 live cold-start 雖已 `BLOCKED=0`,但仍因 MOMO scheduler log pattern 與 DB 讀法過舊產生兩個 WARN。這兩者都會讓重啟 SOP 出現 false-noise / false-warning必須修掉。
**完成內容**
- `a84a5a0b fix(api): suppress healthy Telegram heartbeat noise` 已推送deploy marker `4a7b5329 chore(cd): deploy a84a5a0 [skip ci]` 已回寫到 `gitea/main`
- Production API / Web / Worker image 均為 `a84a5a0bc4a672ac6feb95a85ac590aa2dd4bb71`readiness 為 API `2/2`、Web `2/2`、Worker `1/1`
- 正常且無 warning 的 heartbeat 改為只更新 Redis suppression marker / log不再送 Telegramwarning 仍會送warning 恢復為 healthy 時仍保留一次 recovery 訊息。
- 臨時 mitigation 已在 production Redis 設定 `heartbeat:silent_last_sent``heartbeat:healthy_suppressed_last_seen` 24 小時 TTL避免下一輪 healthy heartbeat 在 CD 完成前繼續洗版;未讀取或輸出任何 token。
- live `/home/wooo/scripts/full-stack-cold-start-check.sh` 已同步 MOMO detector 修正,備份為 `/home/wooo/scripts/full-stack-cold-start-check.sh.before-momo-detector-20260624-020759`,新 hash `47e67d0c018f741acfba17a93cb1d668779bd08745902099a10ee61e73ea55b6`
- `full-stack-cold-start-check.sh` repo 版同步補強 K3s node Ready / storage condition 判斷,以及 MOMO scheduler / current-month DB parity 讀法DB 密碼只透過 `PGPASSWORD` 傳入 `docker exec`,不輸出 secret。
**Live 驗證**
- 2026-06-24 02:08 cold-start`PASS=85 WARN=0 BLOCKED=0`result `GREEN`
- MOMO evidence`SCHEDULER_CONTAINER_RUNNING true``SCHEDULER_CONTAINER_HEALTH healthy``SCHEDULER_RECENT_ACTIVITY 1303``MOMO_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`
- K3s evidence`mon` / `mon1` ReadyVIP `192.168.0.125` present`NODE_FS_ERROR_EVENTS 0``NODE_READONLY_FILESYSTEM_TRUE 0``NODE_DISK_PRESSURE_TRUE 0`
- K8s jobs`FAILED_JOBS 1``STALE_FAILED_JOBS 1``ACTIVE_FAILED_JOBS 0`,保留歷史失敗 Job 作 evidence不當作 active blocker。
- Production health`/api/v1/health``healthy / prod / mock_mode=false``postgresql``redis``ollama``openclaw``signoz``ollama_gcp_a``ollama_gcp_b` 均 up`ollama_local` 仍為下游 provider 狀態,不影響 API overall healthy。
**注意事項**
- Gitea CD `#3289` 最後標示 Failure根因是 worker startup probe / rollout status timeout 的 false-negativeK8s 實際已 rollout 完成並有 deploy marker。SOP 需後續調整 worker startup window / CD post-deploy timeout不能把這類假紅燈當生產不可用也不能忽略它。
- 這次只處理正常心跳降噪與 cold-start detector false-warning不代表 Telegram warning / recovery 告警被消音。
- DR 仍不可宣稱完成credential escrow evidence missing 仍需維持 `5`,不可偽造或把任何 secret value 放入 repo / 聊天。
## 2026-06-19治理頁公開流程詞與 enum sanitizer 正式驗證完成
**背景**:前一輪治理頁公開顯示清理已讓主要卡片可見,但 production desktop smoke 仍發現 `live worker``Direct Bot API``dual approval``owner approval` 等深層 snapshot 自由文字殘留,並因 enum 值被 sanitizer 先翻成中文而產生 `MISSING_MESSAGE`。本輪收斂 messages 與 governance tab sanitizer完成正式部署後 desktop / mobile readback。

View File

@@ -1628,6 +1628,30 @@ SOP update:
| Declaration limit | 可宣稱 `FULL_STACK_GREEN_FOR_SERVICE`;不可宣稱 `DR_COMPLETE``credential escrow complete` 或任何 runtime/security acceptance |
| SOP change | v1.25 defines retained failed Job evidence vs active failed Job blocker; future reboot comparison must record all three counters |
### 14.26 2026-06-24 heartbeat noise / MOMO detector / rollout false-negative closure
2026-06-24 的變更不是主機重啟,而是把重啟 SOP 的兩種 false signal 收斂Telegram 正常心跳不再每 30 分鐘洗版MOMO scheduler / current-month parity detector 不再因舊 log pattern 或錯誤 DB exec 使用者誤報 WARN。這個錨點也記錄 CD rollout false-negativeworker startup probe 第一次超時重啟一次K8s 最終 ready但 Gitea CD `#3289` 因 rollout status timeout 標 Failure。
| 項目 | 2026-06-24 live baseline |
|------|--------------------------|
| SOP version | `v1.27` |
| Heartbeat code | `a84a5a0b fix(api): suppress healthy Telegram heartbeat noise` |
| Deploy marker | `4a7b5329 chore(cd): deploy a84a5a0 [skip ci]` |
| Production image readback | API/Web/Worker image tag `a84a5a0bc4a672ac6feb95a85ac590aa2dd4bb71` |
| Production rollout | API `2/2`、Web `2/2`、Worker `1/1` Ready |
| CD result caveat | Gitea CD `#3289` shows Failure because worker rollout status timed out before old replica convergence; K8s deploy marker and production readiness are green |
| Healthy heartbeat rule | `status=healthy` 且無 warnings 時只更新 suppression marker / log不送 Telegramwarnings 與 recovery 仍可送 |
| Live temporary suppression | Redis keys `heartbeat:silent_last_sent` and `heartbeat:healthy_suppressed_last_seen` set with 24h TTL during deployment; no token or secret printed |
| 110 live script sync | `/home/wooo/scripts/full-stack-cold-start-check.sh` hash `47e67d0c018f741acfba17a93cb1d668779bd08745902099a10ee61e73ea55b6`; previous version backed up to `/home/wooo/scripts/full-stack-cold-start-check.sh.before-momo-detector-20260624-020759` |
| MOMO scheduler evidence | `SCHEDULER_CONTAINER_RUNNING true``SCHEDULER_CONTAINER_HEALTH healthy``SCHEDULER_RECENT_ACTIVITY 1303` |
| MOMO DB parity evidence | `MOMO_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17` |
| K3s node evidence | `NODE_FS_ERROR_EVENTS 0``NODE_READONLY_FILESYSTEM_TRUE 0``NODE_DISK_PRESSURE_TRUE 0`、VIP `192.168.0.125` present |
| Live cold-start readback | `PASS=85 WARN=0 BLOCKED=0`, result `GREEN` |
| Declaration limit | 可宣稱 current service recovery scorecard green不可宣稱 `DR_COMPLETE`credential escrow evidence missing remains `5` |
| SOP change | v1.27 requires heartbeat success-message suppression, MOMO detector parity using app-provided DB env, and rollout false-negative classification before retrying CD |
Worker / CronJob / queue 類服務若啟動時間可能超過 startup probe不能只看第一次 `rollout status --timeout=60s` 失敗就判定 production down。必須同時看 deploy marker、image tag、pod readiness、container restart count、service health、public route / API health。若 pod 最終 ready 但 CD 紅燈,這是 CI timeout / probe tuning 工作,不是服務重啟事故;後續應調整 startup probe 或 post-deploy timeout。
### 14.22 重啟後時間軸驗證
每次重啟後照時間軸推進,不要等到最後才一次判定。

View File

@@ -11,13 +11,13 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | SERVICE_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-18 13:43 live cold-start read-only gate returned `PASS=84 WARN=0 BLOCKED=0`, result `GREEN`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, public routes/TLS are green, 110 backup health is `13/13 fresh failed=0`, 188 backup health is `2/2 fresh failed=0`。K8s schedule readback now distinguishes `FAILED_JOBS=1` / `STALE_FAILED_JOBS=1` / `ACTIVE_FAILED_JOBS=0`; the retained `km-vectorize-29689620` failed Job is historical evidence, not an active service blocker. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
| Overall recovery readiness | SERVICE_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-24 02:08 live cold-start read-only gate returned `PASS=85 WARN=0 BLOCKED=0`, result `GREEN`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, `NODE_READONLY_FILESYSTEM_TRUE=0`, `NODE_DISK_PRESSURE_TRUE=0`, public routes/TLS are green, 110 / 188 runtime checks are green。K8s schedule readback distinguishes `FAILED_JOBS=1` / `STALE_FAILED_JOBS=1` / `ACTIVE_FAILED_JOBS=0`; the retained failed Job is historical evidence, not an active service blocker. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-15 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-15 02:40:13`. Offsite / escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
| P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-18 13:43 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=84 WARN=0 BLOCKED=0`. |
| P3 docs / automation contracts | DONE_WITH_RUNAWAY_PROCESS_AIOPS_LIVE_SCRAPED | 100% | Workplan, SOP v1.26, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, and 2026-06-18 live readback are updated. 14:31-14:32 Prometheus scrape confirms 110 `monitor_up=1`, orphan browser group count `0`, active CI containers `2`, load5/core around `0.79-0.81`, swap ratio around `1.0`, `remediation_authorized=0`, and missing/orphan alerts not firing. Repo-side readiness audit also checks runaway process exporter / remediation helper / alert group; live cold-start remains `PASS=84 WARN=0 BLOCKED=0` from the latest service readiness readback. |
| P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-24 02:08 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=85 WARN=0 BLOCKED=0`. |
| P3 docs / automation contracts | DONE_WITH_HEARTBEAT_NOISE_AND_MOMO_DETECTOR_CLOSURE | 100% | Workplan, SOP v1.27, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, and 2026-06-24 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. |
Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-18 13:43, services are green with `WARN=0` and `BLOCKED=0`; the retained stale `km-vectorize` failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-24 02:08, services are green with `PASS=85 WARN=0 BLOCKED=0`; the retained stale failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.

View File

@@ -316,8 +316,27 @@ kcmd() {
}
kcmd get nodes -o wide 2>/dev/null || true
kcmd get pods -n awoooi-prod -o wide 2>/dev/null || true
node_condition_summary=$(kcmd get nodes -o json 2>/dev/null | python3 -c "import json,sys
try:
d=json.load(sys.stdin)
except Exception:
d={\"items\": []}
not_ready=readonly=disk_pressure=0
for node in d.get(\"items\", []):
conds={c.get(\"type\"): c.get(\"status\") for c in node.get(\"status\",{}).get(\"conditions\",[]) or []}
if conds.get(\"Ready\") != \"True\":
not_ready += 1
if conds.get(\"ReadonlyFilesystem\") == \"True\":
readonly += 1
if conds.get(\"DiskPressure\") == \"True\":
disk_pressure += 1
print(f\"NODE_NOT_READY {not_ready}\")
print(f\"NODE_READONLY_FILESYSTEM_TRUE {readonly}\")
print(f\"NODE_DISK_PRESSURE_TRUE {disk_pressure}\")" || true)
printf "%s\n" "$node_condition_summary"
node_fs_events=$(kcmd get events -A --field-selector involvedObject.kind=Node --sort-by=.lastTimestamp 2>/dev/null \
| grep -Eic "filesystem|fsck|I/O error|read-only file system|Structure needs cleaning|orphan linked list|EXT4-fs|xfs" || true)
| grep -Eiv "InvalidDiskCapacity|image filesystem" \
| grep -Eic "fsck|I/O error|read-only file system|Structure needs cleaning|orphan linked list|EXT4-fs.*error|XFS.*(corruption|metadata)|Remounting filesystem read-only" || true)
echo "NODE_FS_ERROR_EVENTS ${node_fs_events:-0}"
ip addr show | grep 192.168.0.125 || true
' 2>&1); then
@@ -339,7 +358,14 @@ ip addr show | grep 192.168.0.125 || true
grep -q "PG188_PORT OPEN" <<<"$out" && ok "120 can reach 188 PostgreSQL port" || fail "120 cannot reach 188 PostgreSQL"
grep -q " Ready " <<<"$out$local_kubectl_out" && ok "K3s has Ready node output" || fail "K3s nodes not Ready or kubectl unavailable"
grep -q "NODE_FS_ERROR_EVENTS 0" <<<"$out" && ok "K3s node filesystem error events absent" || fail "K3s node filesystem error events present"
grep -q "NODE_NOT_READY 0" <<<"$out" && ok "K3s node Ready condition clean" || fail "K3s node Ready condition not clean"
if grep -q "NODE_FS_ERROR_EVENTS 0" <<<"$out" \
&& grep -q "NODE_READONLY_FILESYSTEM_TRUE 0" <<<"$out" \
&& grep -q "NODE_DISK_PRESSURE_TRUE 0" <<<"$out"; then
ok "K3s node storage conditions clean"
else
fail "K3s node storage condition or severe filesystem event present"
fi
grep -q "192.168.0.125" <<<"$out" && ok "VIP 192.168.0.125 present on 120" || warn "VIP not confirmed on 120"
}
@@ -439,9 +465,16 @@ if [ -f /home/ollama/node_exporter_textfiles/storage_health.prom ]; then
fi
echo "SCHEDULER_CONTAINER_RUNNING $(docker inspect -f "{{.State.Running}}" momo-scheduler 2>/dev/null || true)"
echo "SCHEDULER_CONTAINER_HEALTH $(docker inspect -f "{{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}}" momo-scheduler 2>/dev/null || true)"
echo "SCHEDULER_REGISTERED $(docker logs --tail 200 momo-scheduler 2>&1 | grep -c "全部排程任務已註冊" || true)"
echo "SCHEDULER_RECENT_ACTIVITY $(docker logs --since 2h momo-scheduler 2>&1 | grep -Ec "AutoImport|Meta-Analysis|Scheduler" || true)"
momo_sync=$(docker exec momo-db sh -c "psql -U \"\$POSTGRES_USER\" -d \"\$POSTGRES_DB\" -Atc \"WITH scope AS (SELECT min(snapshot_date::date) dmin, max(snapshot_date::date) dmax, count(*) sc FROM daily_sales_snapshot WHERE snapshot_date::date >= make_date(extract(year from current_date)::int, extract(month from current_date)::int, 1)), monthly AS (SELECT count(*) mc, min(\\\"日期\\\"::date) mmin, max(\\\"日期\\\"::date) mmax FROM realtime_sales_monthly, scope WHERE scope.sc > 0 AND \\\"日期\\\"::date BETWEEN scope.dmin AND scope.dmax) SELECT coalesce(scope.sc,0)::text || chr(124) || coalesce(monthly.mc,0)::text || chr(124) || coalesce(scope.dmin::text,chr(45)) || chr(124) || coalesce(scope.dmax::text,chr(45)) || chr(124) || coalesce(monthly.mmin::text,chr(45)) || chr(124) || coalesce(monthly.mmax::text,chr(45)) FROM scope, monthly;\"" 2>/dev/null || true)
echo "SCHEDULER_REGISTERED $(docker logs --tail 400 momo-scheduler 2>&1 | grep -Ec "全部排程任務已註冊|排程任務已註冊|Scheduler started|APScheduler" || true)"
echo "SCHEDULER_RECENT_ACTIVITY $(docker logs --since 2h momo-scheduler 2>&1 | grep -Ec "AutoImport|Meta-Analysis|Scheduler|排程|任務|批次 [0-9]+: 取得|\\[Feeder\\]|HITL|候選屬" || true)"
db_user=$(docker exec momo-pro-system printenv POSTGRES_USER 2>/dev/null || true)
db_name=$(docker exec momo-pro-system printenv POSTGRES_DB 2>/dev/null || true)
db_pass=$(docker exec momo-pro-system printenv POSTGRES_PASSWORD 2>/dev/null || true)
if [ -n "$db_user" ] && [ -n "$db_name" ] && [ -n "$db_pass" ]; then
momo_sync=$(docker exec -e PGPASSWORD="$db_pass" -e PGCONNECT_TIMEOUT=5 momo-db psql -h 127.0.0.1 -U "$db_user" -d "$db_name" -Atc "WITH scope AS (SELECT min(snapshot_date::date) dmin, max(snapshot_date::date) dmax, count(*) sc FROM daily_sales_snapshot WHERE snapshot_date::date >= make_date(extract(year from current_date)::int, extract(month from current_date)::int, 1)), monthly AS (SELECT count(*) mc, min(\"日期\"::date) mmin, max(\"日期\"::date) mmax FROM realtime_sales_monthly, scope WHERE scope.sc > 0 AND \"日期\"::date BETWEEN scope.dmin AND scope.dmax) SELECT coalesce(scope.sc,0)::text || chr(124) || coalesce(monthly.mc,0)::text || chr(124) || coalesce(scope.dmin::text,chr(45)) || chr(124) || coalesce(scope.dmax::text,chr(45)) || chr(124) || coalesce(monthly.mmin::text,chr(45)) || chr(124) || coalesce(monthly.mmax::text,chr(45)) FROM scope, monthly;" 2>/dev/null || true)
else
momo_sync=""
fi
echo "MOMO_MONTHLY_SYNC ${momo_sync:-unavailable}"
' 2>&1); then
echo "$out"