fix(reboot): surface windows99 console channel readback
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled

This commit is contained in:
ogt
2026-07-02 18:32:52 +08:00
parent f237b25453
commit 5b5ef7fe2d
12 changed files with 269 additions and 10 deletions

View File

@@ -57,9 +57,9 @@
| 順序 | ID | 優先序 | 使用者插入要求 | 正規化工作項 | 目前狀態 | 下一個可驗證動作 |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | CIR-P0-RBT-001 | P0 | 「主機重啟後 10 分鐘內全部恢復,且要自動判斷所有主機被重啟」 | 建立 99/110/111/112/120/121/188 reboot event detector + 10 分鐘 SLO scorecard + fixed triage order | 2026-07-02 15:08 live scorecard 已更新readiness `43%`、active blockers `11``windows99_verify_collection``windows99_management_channel` 已進 API / scorecard仍缺 fresh all-host 10 分鐘證明111 不可達99 uptime / VMware verifier 未閉環 | 優先收斂 99 no-secret management channel / verifier readback 與 111 reachability不可宣稱 10 分鐘 SLA 已證明 |
| 1 | CIR-P0-RBT-001 | P0 | 「主機重啟後 10 分鐘內全部恢復,且要自動判斷所有主機被重啟」 | 建立 99/110/111/112/120/121/188 reboot event detector + 10 分鐘 SLO scorecard + fixed triage order | 2026-07-02 18:28 live scorecard 已更新readiness `43%`、active blockers `11``windows99_verify_collection``windows99_management_channel``windows99_local_console_channel_reachable` 已進 API / scorecard仍缺 fresh all-host 10 分鐘證明111 不可達99 uptime / VMware verifier 未閉環 | 優先收斂 99 local console Verify output / no-secret management channel 與 111 reachability不可宣稱 10 分鐘 SLA 已證明 |
| 2 | CIR-P0-RBT-002 | P0 | 「沒有偵測到主機重啟」 | 修正 host reboot/shutdown/up detectionboot_id / uptime / node exporter / Windows exporter / VMware VM power state 都要進同一事件 | Scorecard 已接 collection packet + management probe99 host reachable 但 uptime unknown111 unreachablestale hosts 仍存在 | 讓 99 verifier / Windows exporter 或等效 no-secret readback 進入 host boot event並補 111 reachability 證據 |
| 3 | CIR-P0-RBT-003 | P0 | 「192.168.0.99 VMWare 要自動啟動,裡面 111/188/120/121/112 也自動啟動」 | Windows 99 VMware host autostart + guest VM autostart contractVM host 111/188/120/121/112 開機順序與 readback | Source verifier / parser / API readback / collection packet 已完成management probe 讀回 `host_reachable=true`、RDP open、SSH BatchMode `permission_denied`、WinRM timeoutsnapshot active blockers=`windows99_remote_execution_channel_unavailable``windows99_vmware_autostart_readback_missing` | 恢復 no-secret management channel 或收集 local console Verify output,再確認 `VMRUN_PRESENT`、scheduled task、VMware services、VM power、VMX present 全綠 |
| 3 | CIR-P0-RBT-003 | P0 | 「192.168.0.99 VMWare 要自動啟動,裡面 111/188/120/121/112 也自動啟動」 | Windows 99 VMware host autostart + guest VM autostart contractVM host 111/188/120/121/112 開機順序與 readback | Source verifier / parser / API readback / collection packet 已完成management probe 讀回 `host_reachable=true`、RDP open、`2179` VMConnect / console channel open、SSH BatchMode `permission_denied`、WinRM timeoutsnapshot active blockers=`windows99_remote_execution_channel_unavailable``windows99_vmware_autostart_readback_missing` | 收集 local console Verify output 或恢復 no-secret management channel再確認 `VMRUN_PRESENT`、scheduled task、VMware services、VM power、VMX present 全綠 |
| 4 | CIR-P0-RBT-004 | P0 | 「192.168.0.99 不可因 Windows Update 無預警重開」 | Windows Update reboot policyactive hours / no auto-restart / maintenance window / update notification audit | Source verifier 已補 `WINDOWS_UPDATE_POLICY``WINDOWS_UPDATE_NO_AUTO_REBOOT_READY`collection packet 已列 forbidden actions99 management channel 尚不能收 policy readback | 取得 Verify output若 policy 不綠,再走 controlled apply禁止要求或記錄 Windows 密碼 |
| 5 | CIR-P0-RBT-005 | P0 | 「網站重啟後 502 嚴重影響體驗,要維護頁,外部雲端或專業做法」 | Public maintenance fallbackNginx / edge / external static maintenance page / status page / fail-open UX避免 502 直出 | 尚未完整落地;目前是需求缺口 | 產生 `public_maintenance_fallback` decision recordDNS/edge/外部雲端/本地 Nginx fallback 風險比較,先做不切流量的 check-mode |
| 6 | CIR-P0-RBT-006 | P0 | 「所有主機關機立刻 Telegram 告警,重啟後也要告警,其他告警一併完整思考」 | Down / shutdown suspected / reboot detected / reboot recovered / SLO missed / backup failed / freshness stale / CPU pressure / Gitea queue 告警矩陣 | 部分已有 Alertmanager rule 與 Telegram receipt 補強;仍缺完整 shutdown/up E2E receipt | 建立 Telegram alert matrix + receipt verifier逐項讀回 Alertmanager active/resolved 與 outbound receipt不送測試 secret |