diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index aa2bb3dc9..0ff611a9f 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,15 @@ +## 2026-07-03 — 01:58 P0-006 host-probe TCP timeout 收斂 + +**完成內容**: +- 接續 production reboot SLO primary lane `reboot_event_detector_and_host_probe`,確認前一輪 host probe 卡住後只輸出 99 / 110,會讓 111 / 112 / 120 / 121 / 188 缺列,直接破壞 10 分鐘 SLO 判定。 +- `scripts/reboot-recovery/reboot-auto-recovery-host-probe.sh` 將 reachable-only fallback 的每個 `nc -z` port probe 包上 `run_with_timeout "${TCP_CONNECT_TIMEOUT_SECONDS:-2}"`,避免 `nc -w` 在部分環境不準時退出時拖死後續主機。 +- `scripts/reboot-recovery/tests/test_reboot_p0_operational_contract.py` 鎖住 TCP probe 必須有外層 timeout;`FULL-STACK-COLD-START-SOP.md` 升到 v1.100,固定 partial probe / verifier timeout 必須 fail-closed,不得宣稱 all-host observed。 +- Live no-write artifact `/tmp/awoooi-reboot-host-probe-confirm-20260703-015844`:bounded host probe 約 `10.825s` 完成、`HOST_BOOT` rows `7`、observed hosts `110,111,112,120,121,188,99`、`missing_hosts=[]`。 + +**仍維持**: +- 10 分鐘全主機自動恢復 SLO 尚未完成:111 仍 unreachable、99 仍只有 ping reachable / uptime unknown、188 仍 `systemd_state=degraded` 且 `awoooi-startup.service failed`,detector 仍回 `reboot_detected=false`、`recovery_deadline_status=target_window_elapsed`。 +- 未讀 secret / token / `.env` / raw sessions / SQLite / auth;未使用 GitHub / gh;未 workflow_dispatch;未重啟 host / VM / service;未 Docker / Nginx / K3s / DB / firewall restart;未 DROP / TRUNCATE / restore / prune / delete / force push。 + ## 2026-07-03 — 01:24 IwoooS 首屏分段工作台 **完成內容**: diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 02ba75682..b42c019de 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.99 +> Version: v1.100 > Last updated: 2026-07-03 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -32,6 +32,8 @@ v1.98 reboot-event detector unknown-uptime rule:`reboot-event-detector.py` 不 v1.99 runtime blocker action-matrix classification rule:Prometheus / runtime overlay 帶回的 `conversation_event_hot_path_index_migration_source_missing` 不得留在 `reboot_slo_unknown`。此 blocker 屬於 `host_cpu_pressure`,owner lane 固定為 `host_pressure_controller`,next action 固定為 `restore_conversation_event_hot_path_migration_source_then_rerun_host_pressure_and_reboot_slo_scorecard_no_restart`;evidence inputs 必須包含 `source_controls` 與 `runtime_metric_readback`。此規則只授權 source-control / verifier 修復與 check-mode,仍不得直接做 DB destructive migration、kill process、Docker / DB / K3s / Nginx restart 或 reboot。 +v1.100 host-probe bounded TCP verifier rule:`scripts/reboot-recovery/reboot-auto-recovery-host-probe.sh` 的 reachable-only TCP fallback 不得只依賴 `nc -w`;每一個 `nc -z` port probe 必須再包一層 `run_with_timeout "${TCP_CONNECT_TIMEOUT_SECONDS:-2}"`,避免單台主機或單個 port 卡住後讓後續主機缺列,造成 10 分鐘 SLO readback 無法判斷。若 verifier timeout 或主機缺列,`reboot-event-detector.py` / `reboot-auto-recovery-slo-scorecard.py` 必須 fail-closed 成 host boot detection blocker;不得用前半段 partial probe 宣稱 all-host observed、fresh reboot 或 auto-recovery ready。2026-07-03 live no-write 驗證 artifact `/tmp/awoooi-reboot-host-probe-confirm-20260703-015844`:bounded host probe 約 `10.825s` 完成、`HOST_BOOT` rows `7`、`missing_hosts=[]`;但 111 仍 unreachable、99 仍 `uptime_seconds=unknown`、188 仍 `systemd_state=degraded` / `startup_active=failed`,因此 10 分鐘全主機自動恢復 SLO 仍維持 blocked。 + 2026-07-02 110 control-path / Harbor recovery receipt rule:若 Gitea Harbor repair queue 仍保留 `harbor_110_remote_ssh_publickey_auth_stalled`、remote-control unavailable、jobs stale 或 historical failure,但同一輪本地證據同時證明 `wooo` command path ready、110 local Harbor `/v2/` ready、public/internal registry `/v2/` 回 `401`,則該 Gitea Harbor repair 失敗只能列為 historical queue metadata,不得再當成 current SSH blocker。必須用 `/api/v1/agents/harbor-registry-controlled-recovery-receipt` 或同等 validator 合併 `diagnose-110-ssh-publickey-auth.sh`、`recover-110-control-path-and-harbor-local.sh --check`、public Gitea queue readback 與 registry `/v2/` verifier,並把機器可讀結果寫入 `docs/operations/harbor-110-control-path-recovery-readback-2026-07-02.snapshot.json` 類型的 snapshot。2026-07-02 live receipt 顯示:public/internal registry `/v2/` 均為 `401`、latest visible CD `#4335` 為 `Success`、Gitea Harbor repair failure 已是 `historical_after_latest_cd_success=true`;active blockers 收斂為 110 controlled CD lane config / binary / registration / service guardrail、active action container pressure,以及 Gitea CD jobs head-SHA / stale readback mismatch。若 local-console output 只有 `AWOOOI_110_CONTROLLED_CD_LANE_READY` marker,non110 runner parser 不得從 110 `BLOCKER` 行推導 non110 blocker;non110 只有看到 `AWOOOI_NON110_RUNNER_READY` marker 才能列入 active blocker。 2026-07-02 110 controlled CD lane fail-closed enforcer staging rule:110 runner 壓力事故後,legacy / generic runner 仍必須 fail-closed;但 `awoooi-cd-lane-drain.service` 的非 secret staging artifact 不得再被 enforcer 無差別封回 stub。`scripts/reboot-recovery/enforce-110-runner-failclosed.sh` 只有在 `config.yaml` 符合 `capacity <= 1`、只含 `awoooi-host:host` 與 `awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`、binary 是 executable ELF、systemd unit 具備 `ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner`、`CPUAccounting` / `MemoryAccounting` / `TasksAccounting` / `NoNewPrivileges` 等 guardrail,且 service `inactive`、`MainPID=0`、未 enabled / 未 masked 時,才可保留 drain config / binary / unit,並輸出 `CONTROLLED_DRAIN_STAGING_ALLOWED=1` 與 textfile metric。此 staging 規則不得讀 token、不得讀 `.runner` 內容、不得註冊 runner、不得啟動 service;若 registration 缺失,readiness verifier 仍必須只留下 `controlled_cd_lane_registration_missing` / `controlled_cd_lane_service_not_active` 類 blocker。若 `CONTROLLED_DRAIN_STAGING_ALLOWED=0` 且 config / binary 又被搬走,優先修 source enforcer / unit guardrail,不要手工反覆補同一組 artifact。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 4d4b57a1e..95dd2f340 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -31,6 +31,7 @@ - 使用者可見 502 優先於資料 freshness;先恢復靜態/容器服務,再回到資料層與版本一致性。 - 版本最新性要同時看 source SHA、deploy marker、runtime SHA 與 public endpoint;不能只看 Gitea main。 - 2026-06-30 實測證明:source 上有 reboot detector / alert / VMware autostart / maintenance fallback 並不等於 runtime 已達標;必須同時讀回 all-host probe、SLO metric、public `/v2`、Stock freshness、backup-status、Telegram delivery。 +- 2026-07-03 實測證明:host probe 本身若沒有 bounded TCP fallback,會從 verifier 層製造 partial evidence,讓 10 分鐘 SLO 無法判斷。`reboot-auto-recovery-host-probe.sh` 的 `nc -z` 必須由 `run_with_timeout` 包住;partial probe / missing host rows 一律 fail-closed,不可宣稱 all-host observed。 | Area | Status | Completion | Evidence | |------|--------|------------|----------| diff --git a/scripts/reboot-recovery/reboot-auto-recovery-host-probe.sh b/scripts/reboot-recovery/reboot-auto-recovery-host-probe.sh index 0fd1e559f..08ad26936 100755 --- a/scripts/reboot-recovery/reboot-auto-recovery-host-probe.sh +++ b/scripts/reboot-recovery/reboot-auto-recovery-host-probe.sh @@ -153,7 +153,8 @@ probe_reachable_only() { if command -v nc >/dev/null 2>&1; then for port in ${BOOT_PROBE_TCP_PORTS:-22 80 443 3389 5985 9100}; do - if nc -z -w "${TCP_CONNECT_TIMEOUT_SECONDS:-2}" "$target_host" "$port" >/dev/null 2>&1; then + if run_with_timeout "${TCP_CONNECT_TIMEOUT_SECONDS:-2}" \ + nc -z -w "${TCP_CONNECT_TIMEOUT_SECONDS:-2}" "$target_host" "$port" >/dev/null 2>&1; then emit_boot_row "$alias" "$target" "$unit" 1 "reachable_unknown_boot" "unknown" "tcp_${port}_reachable" "unknown" "unknown" return 0 fi diff --git a/scripts/reboot-recovery/tests/test_reboot_p0_operational_contract.py b/scripts/reboot-recovery/tests/test_reboot_p0_operational_contract.py index 7a4be0312..18d31fe23 100644 --- a/scripts/reboot-recovery/tests/test_reboot_p0_operational_contract.py +++ b/scripts/reboot-recovery/tests/test_reboot_p0_operational_contract.py @@ -19,6 +19,7 @@ def test_reboot_p0_contract_covers_all_required_hosts_and_vmware_autostart() -> for host in ["99", "110", "111", "112", "120", "121", "188"]: assert host in host_probe + assert 'run_with_timeout "${TCP_CONNECT_TIMEOUT_SECONDS:-2}"' in host_probe assert "AWOOOI-Start-VMware-VMs" in windows99 assert "NoAutoRebootWithLoggedOnUsers" in windows99 assert "Host110Vmx" in windows99