fix(reboot): bound host boot tcp probes

This commit is contained in:
Your Name
2026-07-03 02:00:55 +08:00
parent 1392369e56
commit e8e9bf33a6
5 changed files with 19 additions and 2 deletions

View File

@@ -1,3 +1,15 @@
## 2026-07-03 — 01:58 P0-006 host-probe TCP timeout 收斂
**完成內容**
- 接續 production reboot SLO primary lane `reboot_event_detector_and_host_probe`,確認前一輪 host probe 卡住後只輸出 99 / 110會讓 111 / 112 / 120 / 121 / 188 缺列,直接破壞 10 分鐘 SLO 判定。
- `scripts/reboot-recovery/reboot-auto-recovery-host-probe.sh` 將 reachable-only fallback 的每個 `nc -z` port probe 包上 `run_with_timeout "${TCP_CONNECT_TIMEOUT_SECONDS:-2}"`,避免 `nc -w` 在部分環境不準時退出時拖死後續主機。
- `scripts/reboot-recovery/tests/test_reboot_p0_operational_contract.py` 鎖住 TCP probe 必須有外層 timeout`FULL-STACK-COLD-START-SOP.md` 升到 v1.100,固定 partial probe / verifier timeout 必須 fail-closed不得宣稱 all-host observed。
- Live no-write artifact `/tmp/awoooi-reboot-host-probe-confirm-20260703-015844`bounded host probe 約 `10.825s` 完成、`HOST_BOOT` rows `7`、observed hosts `110,111,112,120,121,188,99``missing_hosts=[]`
**仍維持**
- 10 分鐘全主機自動恢復 SLO 尚未完成111 仍 unreachable、99 仍只有 ping reachable / uptime unknown、188 仍 `systemd_state=degraded``awoooi-startup.service failed`detector 仍回 `reboot_detected=false``recovery_deadline_status=target_window_elapsed`
- 未讀 secret / token / `.env` / raw sessions / SQLite / auth未使用 GitHub / gh未 workflow_dispatch未重啟 host / VM / service未 Docker / Nginx / K3s / DB / firewall restart未 DROP / TRUNCATE / restore / prune / delete / force push。
## 2026-07-03 — 01:24 IwoooS 首屏分段工作台
**完成內容**

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.99
> Version: v1.100
> Last updated: 2026-07-03 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -32,6 +32,8 @@ v1.98 reboot-event detector unknown-uptime rule`reboot-event-detector.py` 不
v1.99 runtime blocker action-matrix classification rulePrometheus / runtime overlay 帶回的 `conversation_event_hot_path_index_migration_source_missing` 不得留在 `reboot_slo_unknown`。此 blocker 屬於 `host_cpu_pressure`owner lane 固定為 `host_pressure_controller`next action 固定為 `restore_conversation_event_hot_path_migration_source_then_rerun_host_pressure_and_reboot_slo_scorecard_no_restart`evidence inputs 必須包含 `source_controls``runtime_metric_readback`。此規則只授權 source-control / verifier 修復與 check-mode仍不得直接做 DB destructive migration、kill process、Docker / DB / K3s / Nginx restart 或 reboot。
v1.100 host-probe bounded TCP verifier rule`scripts/reboot-recovery/reboot-auto-recovery-host-probe.sh` 的 reachable-only TCP fallback 不得只依賴 `nc -w`;每一個 `nc -z` port probe 必須再包一層 `run_with_timeout "${TCP_CONNECT_TIMEOUT_SECONDS:-2}"`,避免單台主機或單個 port 卡住後讓後續主機缺列,造成 10 分鐘 SLO readback 無法判斷。若 verifier timeout 或主機缺列,`reboot-event-detector.py` / `reboot-auto-recovery-slo-scorecard.py` 必須 fail-closed 成 host boot detection blocker不得用前半段 partial probe 宣稱 all-host observed、fresh reboot 或 auto-recovery ready。2026-07-03 live no-write 驗證 artifact `/tmp/awoooi-reboot-host-probe-confirm-20260703-015844`bounded host probe 約 `10.825s` 完成、`HOST_BOOT` rows `7``missing_hosts=[]`;但 111 仍 unreachable、99 仍 `uptime_seconds=unknown`、188 仍 `systemd_state=degraded` / `startup_active=failed`,因此 10 分鐘全主機自動恢復 SLO 仍維持 blocked。
2026-07-02 110 control-path / Harbor recovery receipt rule若 Gitea Harbor repair queue 仍保留 `harbor_110_remote_ssh_publickey_auth_stalled`、remote-control unavailable、jobs stale 或 historical failure但同一輪本地證據同時證明 `wooo` command path ready、110 local Harbor `/v2/` ready、public/internal registry `/v2/``401`,則該 Gitea Harbor repair 失敗只能列為 historical queue metadata不得再當成 current SSH blocker。必須用 `/api/v1/agents/harbor-registry-controlled-recovery-receipt` 或同等 validator 合併 `diagnose-110-ssh-publickey-auth.sh``recover-110-control-path-and-harbor-local.sh --check`、public Gitea queue readback 與 registry `/v2/` verifier並把機器可讀結果寫入 `docs/operations/harbor-110-control-path-recovery-readback-2026-07-02.snapshot.json` 類型的 snapshot。2026-07-02 live receipt 顯示public/internal registry `/v2/` 均為 `401`、latest visible CD `#4335``Success`、Gitea Harbor repair failure 已是 `historical_after_latest_cd_success=true`active blockers 收斂為 110 controlled CD lane config / binary / registration / service guardrail、active action container pressure以及 Gitea CD jobs head-SHA / stale readback mismatch。若 local-console output 只有 `AWOOOI_110_CONTROLLED_CD_LANE_READY` markernon110 runner parser 不得從 110 `BLOCKER` 行推導 non110 blockernon110 只有看到 `AWOOOI_NON110_RUNNER_READY` marker 才能列入 active blocker。
2026-07-02 110 controlled CD lane fail-closed enforcer staging rule110 runner 壓力事故後legacy / generic runner 仍必須 fail-closed`awoooi-cd-lane-drain.service` 的非 secret staging artifact 不得再被 enforcer 無差別封回 stub。`scripts/reboot-recovery/enforce-110-runner-failclosed.sh` 只有在 `config.yaml` 符合 `capacity <= 1`、只含 `awoooi-host:host``awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04`、binary 是 executable ELF、systemd unit 具備 `ConditionPathExists=/home/wooo/awoooi-cd-lane-drain/data/.runner``CPUAccounting` / `MemoryAccounting` / `TasksAccounting` / `NoNewPrivileges` 等 guardrail且 service `inactive``MainPID=0`、未 enabled / 未 masked 時,才可保留 drain config / binary / unit並輸出 `CONTROLLED_DRAIN_STAGING_ALLOWED=1` 與 textfile metric。此 staging 規則不得讀 token、不得讀 `.runner` 內容、不得註冊 runner、不得啟動 service若 registration 缺失readiness verifier 仍必須只留下 `controlled_cd_lane_registration_missing` / `controlled_cd_lane_service_not_active` 類 blocker。若 `CONTROLLED_DRAIN_STAGING_ALLOWED=0` 且 config / binary 又被搬走,優先修 source enforcer / unit guardrail不要手工反覆補同一組 artifact。

View File

@@ -31,6 +31,7 @@
- 使用者可見 502 優先於資料 freshness先恢復靜態/容器服務,再回到資料層與版本一致性。
- 版本最新性要同時看 source SHA、deploy marker、runtime SHA 與 public endpoint不能只看 Gitea main。
- 2026-06-30 實測證明source 上有 reboot detector / alert / VMware autostart / maintenance fallback 並不等於 runtime 已達標;必須同時讀回 all-host probe、SLO metric、public `/v2`、Stock freshness、backup-status、Telegram delivery。
- 2026-07-03 實測證明host probe 本身若沒有 bounded TCP fallback會從 verifier 層製造 partial evidence讓 10 分鐘 SLO 無法判斷。`reboot-auto-recovery-host-probe.sh``nc -z` 必須由 `run_with_timeout` 包住partial probe / missing host rows 一律 fail-closed不可宣稱 all-host observed。
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|

View File

@@ -153,7 +153,8 @@ probe_reachable_only() {
if command -v nc >/dev/null 2>&1; then
for port in ${BOOT_PROBE_TCP_PORTS:-22 80 443 3389 5985 9100}; do
if nc -z -w "${TCP_CONNECT_TIMEOUT_SECONDS:-2}" "$target_host" "$port" >/dev/null 2>&1; then
if run_with_timeout "${TCP_CONNECT_TIMEOUT_SECONDS:-2}" \
nc -z -w "${TCP_CONNECT_TIMEOUT_SECONDS:-2}" "$target_host" "$port" >/dev/null 2>&1; then
emit_boot_row "$alias" "$target" "$unit" 1 "reachable_unknown_boot" "unknown" "tcp_${port}_reachable" "unknown" "unknown"
return 0
fi

View File

@@ -19,6 +19,7 @@ def test_reboot_p0_contract_covers_all_required_hosts_and_vmware_autostart() ->
for host in ["99", "110", "111", "112", "120", "121", "188"]:
assert host in host_probe
assert 'run_with_timeout "${TCP_CONNECT_TIMEOUT_SECONDS:-2}"' in host_probe
assert "AWOOOI-Start-VMware-VMs" in windows99
assert "NoAutoRebootWithLoggedOnUsers" in windows99
assert "Host110Vmx" in windows99