From 15ff063cdacc73756c436079c8696df95d3682f7 Mon Sep 17 00:00:00 2001 From: Your Name Date: Tue, 30 Jun 2026 22:21:36 +0800 Subject: [PATCH] docs(recovery): correct p0 cd failure readback [skip ci] --- docs/LOGBOOK.md | 4 ++-- .../2026-06-04-reboot-cold-start-backup-recovery-workplan.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 6757e7a6a..da16128e3 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -50707,8 +50707,8 @@ production browser smoke: ## 2026-06-30 — 22:08 P0 mainline live scorecard / 110 Harbor control channel blocker readback **照優先順序完成的讀回**: -- 使用乾淨 worktree `/Users/ogt/codex-workspaces/awoooi-p0-006-postgres-readback-20260630`,fast-forward 到 Gitea `main` / `230ee54fa test(agent): align log loop writeback counts`;舊 `/Users/ogt/awoooi` 仍落後且 dirty,未操作。 -- public Gitea queue:CD `#4095` Running,CD `#4094` Canceled,CD `#4093` Failure,CD `#4091` Failure,Harbor repair `#4092` Scheduled / Waiting;queue readback status 仍是 `blocked_harbor_110_repair_no_matching_runner`,缺 `awoooi-host` online runner。jobs API 對 `#4092` 回 stale/mismatched `ai-code-review` / `ubuntu-latest`,不等於 repair job 已執行;CD `#4095` self-heal skip reason 仍是 `not_110_host`。 +- 使用乾淨 worktree `/Users/ogt/codex-workspaces/awoooi-p0-006-postgres-readback-20260630`,fast-forward 到 Gitea `main` / `230ee54fa test(agent): align log loop writeback counts`,並以 docs `[skip ci]` commit `9540a479` 記錄本輪 readback;舊 `/Users/ogt/awoooi` 仍落後且 dirty,未操作。 +- public Gitea queue:latest executable CD `#4095` Failure,classifier=`harbor_registry_public_route_unavailable`、status code `502`、controlled repair attempted=`true`、skip reason=`not_110_host`;CD `#4094` Canceled、`#4093` Failure、`#4091` Failure,Harbor repair `#4092` Scheduled / Waiting。queue readback status 仍是 `blocked_harbor_110_repair_no_matching_runner`,缺 `awoooi-host` online runner。jobs API 對 `#4092` 回 stale/mismatched `ai-code-review` / `ubuntu-latest`,不等於 repair job 已執行。 - live probes:`https://registry.wooo.work/v2/` 502、`http://192.168.0.110:5000/v2/` 502、`https://harbor.wooo.work/api/v2.0/health` 502、`https://signoz.wooo.work/` 502。 - StockPlatform public freshness / ingestion 仍回 `status=not_configured`、blocker `postgres_not_ready`;production `/api/v1/agents/reboot-auto-recovery-slo-scorecard` 仍回 2026-06-29 舊資料,不可作為本輪恢復證據。 - `post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260630-220250/summary.txt` 回 `POST_START_PASS=33 WARN=6 BLOCKED=8`、`SERVICE_GREEN=0`、`PRODUCT_DATA_GREEN=0`、`BACKUP_CORE_GREEN=0`、`HOST_188_SERVICE_GREEN=0`、`OVERALL_DECLARATION=SERVICE_BLOCKED`。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 05137833e..bc2fbafa3 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -18,8 +18,8 @@ | P0-1 | BLOCKED | 全主機 cold-start / 10 分鐘自動恢復 SLO | 22:05 `post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260630-220250/summary.txt` 回 `POST_START_PASS=33 WARN=6 BLOCKED=8`、`SERVICE_GREEN=0`、`PRODUCT_DATA_GREEN=0`、`BACKUP_CORE_GREEN=0`、`HOST_188_SERVICE_GREEN=0`、`OVERALL_DECLARATION=SERVICE_BLOCKED`;`full-stack-cold-start-check.sh --monitor-read-only --no-color` 回 `PASS=66 WARN=5 BLOCKED=5`;SLO scorecard `/tmp/awoooi-reboot-slo-live-20260630-2205-scorecard.json` 回 `can_claim_all_services_recovered_within_target=false`、active blockers `11`。reboot detector 只讀評估 `/tmp/awoooi-reboot-event-live-20260630-2205.json` 回 `reboot_detected=true` 但只有 99 fresh,111 unreachable,all required hosts 不在 10 分鐘視窗內;99 uptime unknown、188 startup failed/degraded、110/112/120/121/188 已超過 10 分鐘視窗。 | 先修第一個 runtime blocker:110 control path / Harbor registry `/v2`。重跑同一 summary / cold-start / SLO scorecard 到 `SERVICE_GREEN=1`、`POST_START_BLOCKED=0`、`PASS` 無 BLOCKED、all-host required observed/reachable 且 `awoooi_reboot_auto_recovery_slo_ready=1`;不可只用 route 200 宣稱恢復。 | | P0-2 | DONE_THIS_INCIDENT | 使用者可見 502:Tsenyang | `www.tsenyang.com` / `tsenyang.com` 由 502 恢復為 200;188 `tsenyang-website` container running;local `127.0.0.1:3000` 回 200。 | 下次同類 502 先查 release symlink / image / container;不先動 Nginx、DNS、DB、主機重啟。 | | P0-3 | BLOCKED | StockPlatform data freshness | public `/healthz`、`/api/healthz` 回 200;freshness / ingestion 回 `not_configured`、`postgres_not_ready`。 | 恢復 110 control path 後,read-only 查 `/home/wooo/stockplatform-v2` compose / DB schema / migration status;禁止 fake freshness、manual DB rows、restore/prune。 | -| P0-4 | BLOCKED | AWOOOI production 版本最新性 | Gitea SSH `main` 最新是 `230ee54fa`,public CD `#4095` Running,`#4094` Canceled,`#4093` Failure,`#4091` Failure;前一個 Harbor 形狀明確的 main CD `#4087` failure classifier=`harbor_registry_public_route_unavailable`、status code `502`、controlled repair attempted=`true`、skip reason=`not_110_host`。production `/api/v1/agents/reboot-auto-recovery-slo-scorecard` 仍回 2026-06-29 舊資料,與 22:05 live Stock `postgres_not_ready` / Harbor 502 不一致,不能當現在真相。 | 補 deploy marker / runtime SHA / endpoint readback 一致;Harbor `/v2` 恢復前 CD 無法把最新 source 發到 production,未一致前不可宣稱 AWOOOI 最新。 | -| P0-5 | BLOCKED | 110 control path / Harbor registry `/v2` | 22:02 live probe:`https://registry.wooo.work/v2/` 502、`http://192.168.0.110:5000/v2/` 502、`https://harbor.wooo.work/api/v2.0/health` 502、`https://signoz.wooo.work/` 502;public Gitea queue readback 回 `status=blocked_harbor_110_repair_no_matching_runner`,Harbor repair `#4092` Scheduled / Waiting,`workflow_no_matching_runner_labels={"harbor-110-local-repair.yaml":"awoooi-host"}`,jobs API 仍是 stale/mismatched `ai-code-review` / `ubuntu-latest`,未真正執行 `harbor-110-local-repair` job;CD `#4095` 仍顯示 self-heal skip reason `not_110_host`。110 SSH read-only command path 仍 timeout。 | 讓 110-local repair workflow 或 110 console/local script 真正執行 `recover-110-control-path-and-harbor-local.sh --check` / `--apply-all`,並讀回 public/internal `/v2` 為 `200/401`。恢復 SSH read-only command path 後才能驗證 Stock DB、Gitea dump、110 backup completeness。 | +| P0-4 | BLOCKED | AWOOOI production 版本最新性 | Gitea SSH `main` 最新是 `9540a479` docs `[skip ci]`;latest executable CD 仍是 `#4095` for `230ee54fa`,狀態 Failure,classifier=`harbor_registry_public_route_unavailable`、status code `502`、controlled repair attempted=`true`、skip reason=`not_110_host`。production `/api/v1/agents/reboot-auto-recovery-slo-scorecard` 仍回 2026-06-29 舊資料,與 22:05 live Stock `postgres_not_ready` / Harbor 502 不一致,不能當現在真相。 | 補 deploy marker / runtime SHA / endpoint readback 一致;Harbor `/v2` 恢復前 CD 無法把最新 source 發到 production,未一致前不可宣稱 AWOOOI 最新。 | +| P0-5 | BLOCKED | 110 control path / Harbor registry `/v2` | 22:20 live probe:`https://registry.wooo.work/v2/` 502、`http://192.168.0.110:5000/v2/` 502、`https://harbor.wooo.work/api/v2.0/health` 502、`https://signoz.wooo.work/` 502;public Gitea queue readback 回 `status=blocked_harbor_110_repair_no_matching_runner`,Harbor repair `#4092` Scheduled / Waiting,`workflow_no_matching_runner_labels={"harbor-110-local-repair.yaml":"awoooi-host"}`,jobs API 仍是 stale/mismatched `ai-code-review` / `ubuntu-latest`,未真正執行 `harbor-110-local-repair` job;CD `#4095` self-heal skip reason `not_110_host`。110 SSH read-only command path 仍 timeout。 | 讓 110-local repair workflow 或 110 console/local script 真正執行 `recover-110-control-path-and-harbor-local.sh --check` / `--apply-all`,並讀回 public/internal `/v2` 為 `200/401`。恢復 SSH read-only command path 後才能驗證 Stock DB、Gitea dump、110 backup completeness。 | | P0-6 | BLOCKED_BACKUP_COMPLETENESS | Gitea repo visibility 與完整備份 | Gitea version API 200;public repo search 只列 4 個 public repo;`stockplatform-v2` public page/API 404,但 internal `git ls-remote` 成功;188 `/home/ollama/backup/110/gitea` 起初為空。已建立 verified emergency bundle `/home/ollama/backup/110/gitea/git-bundles/20260630-190931`:4 個 public/internal repo bundle verify + checksum 成功,`AwoooGo`、`stockplatform-v2`、`vibework` 因 private auth fail-closed。20:18 summary 因 110 `backup-status` 不可讀回,`BACKUP_CORE_GREEN=0`、`DR_ESCROW_BLOCKED=1`、`DR_ESCROW_EVIDENCE_UNKNOWN=1`。 | 188 `gitea_repo_mirror_from_110` subtree metric / alert 已補;下一步仍是恢復 110 SSH command path 後跑正式 `gitea dump`、private repo 非互動備份、repo count、backup-status 與 restore drill readback。unknown 不得當作 backup / DR green。 | | P0-7 | SOURCE_READY_RUNTIME_BLOCKED | 99 VMware / VM autostart | repo 已有 `windows99-vmware-autostart.ps1`;22:05 host probe 讀到 99 ping reachable 但 `boot_id=reachable_unknown_boot` / uptime unknown,111 不可達,112/120/121/188 可讀,188 startup unit failed/degraded。先前只讀 readback 顯示 99 RDP 3389 / SSH 22 可達、WinRM 5985 fail,`administrator@192.168.0.99` SSH publickey denied。 | 恢復 99 可控通道或由 console 套用腳本;完成後讀回 111/188/120/121/112 boot evidence,要求 all-host required observed/reachable 且 99 不再是 unknown uptime。 | | P0-8 | SOURCE_READY_RUNTIME_BLOCKED | 502 maintenance fallback / Telegram / backup alert | L0/L1 fallback runbook、Nginx snippet、reboot / backup alert rules 已在 source;runtime 尚需部署與外部 L1 provider readback。 | L0 以測試 vhost 驗證 `X-AWOOOI-Fallback`;L1 需外部雲端/CDN probe;Telegram 以脫敏 alert receipt 驗證。 |