docs(reboot): record cd readback closure [metadata-only]
All checks were successful
CD Pipeline / workflow-shape (push) Successful in 1s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Successful in 1m8s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
All checks were successful
CD Pipeline / workflow-shape (push) Successful in 1s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Successful in 1m8s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
This commit is contained in:
@@ -1,3 +1,19 @@
|
||||
## 2026-07-03 — 10:44 CD bounded wrapper production readback 成功
|
||||
|
||||
**完成內容**:
|
||||
- Gitea CD `#4558` 已跑到本輪 `run_docker_step` wrapper:API build / API push sha / API push latest / Web build / Web push sha / Web push latest 全部 `rc=0`,未觸發 `cd_docker_step_timeout`。
|
||||
- CD 推出 deploy marker `bc8d3af3f chore(cd): deploy dacdd90 [skip ci]`,public queue log 顯示 `deploy_marker=dacdd90`、`job_succeeded=true`、`production_deploy_readback_matched=true`。Gitea remote `main` 已 fast-forward 到 deploy marker。
|
||||
- Production `/api/v1/agents/reboot-auto-recovery-slo-scorecard` 已讀回最新 API 行為:`active_blocker_count=10`、`readiness_percent=60`、`runtime_metric_runtime_readback_added_blockers=["windows99_vmware_autostart_config_not_ready","windows99_update_no_auto_reboot_policy_not_ready"]`、`windows99_update_no_auto_reboot_ready=false`。
|
||||
- 這不是 SLO 完成;這是正確 fail-closed。先前被 nested evidence 壓住的 Windows99 VMware service / autostart config / Windows Update policy blocker 已進 `active_blockers` 與 fixed next action。下一步固定為 `restore_windows99_missing_vmx_source_for_aliases_then_rerun_no_secret_collector_and_scorecard_no_vm_power_change`。
|
||||
|
||||
**已跑驗證**:
|
||||
- Gitea remote ref:`main=bc8d3af3f`,deploy marker subject `chore(cd): deploy dacdd90 [skip ci]`。
|
||||
- Public queue readback:CD `#4558` log success marker / production deploy readback matched。
|
||||
- Production scorecard readback:active blockers `10`,新增 Windows99 config / policy blockers,仍 `status=blocked_reboot_auto_recovery_slo_not_ready`。
|
||||
|
||||
**仍維持**:
|
||||
- 未讀 secret / token / `.env` / raw sessions / SQLite / auth;未使用 GitHub / gh;未 workflow_dispatch;未啟動或關閉 VM;未重啟 host / service;未 Docker / Nginx / K3s / DB / firewall restart;未 DROP / TRUNCATE / restore / prune / delete / force push。
|
||||
|
||||
## 2026-07-03 — 10:30 CD Docker build/push bounded timeout 與 queue classifier
|
||||
|
||||
**完成內容**:
|
||||
|
||||
@@ -13,20 +13,20 @@
|
||||
|
||||
本段覆蓋舊的「單次重啟後人工排查」做法。所有後續狀態回報必須依此順序推進;噪音若會遮蔽 P0,就掛回同一列,不另開支線。
|
||||
|
||||
### 2026-07-03 10:30 最新 P0 覆蓋排序
|
||||
### 2026-07-03 10:44 最新 P0 覆蓋排序
|
||||
|
||||
下表覆蓋 2026-06-30 初始事故列;舊表保留為歷史追蹤。所有新插入需求必須掛在本表,不得再分散成臨時支線。
|
||||
|
||||
| 優先 | 狀態 | 工作項 | 最新證據 | 下一步 / 完成條件 |
|
||||
|------|------|--------|----------|-------------------|
|
||||
| P0-1 | BLOCKED_HOST_WINDOWS | 全主機 reboot auto-detection / auto-trigger / 10 分鐘恢復 SLO | 2026-07-03 08:59 production scorecard:`status=blocked_reboot_auto_recovery_slo_not_ready`、`active_blocker_count=8`、`readiness_percent=67`、`primary_blocker=reboot_event_required_host_unreachable`、`can_claim_all_services_recovered_within_target=false`。no-write artifact `/tmp/awoooi-reboot-continue-20260703-085205`:111 socket probe 仍 timeout / unreachable;99 RDP / VMConnect reachable 但 uptime unknown;188 socket probe reachable,前次 host probe 仍為 `systemd_state=degraded` / `startup_active=failed`。Production readback 同步顯示 `windows99_vmware_readback_present=true`、`windows99_vmware_config_ready=false`、`windows99_vmware_power_ready=false`、`windows99_missing_vmx_aliases=["111"]`、`windows99_powered_off_aliases=["111","112","120","121","188"]`。09:22 source/API contract 已把 unreliable console artifact 額外投影成 `windows99_console_clipboard_unreliable` 類 active blocker。 | 先收斂 99 / Windows99 / VMware 與 111:恢復 no-secret management channel 或取得 validator 可接受的完整 console Verify stdout,讀回 VMX / VM power / host uptime,再 rerun host probe + reboot-event detector;不得 reboot、不得 VM power change、不得讀 Windows 密碼,不得用 RDP clipboard 片段當完成證據。 |
|
||||
| P0-1 | BLOCKED_HOST_WINDOWS | 全主機 reboot auto-detection / auto-trigger / 10 分鐘恢復 SLO | 2026-07-03 10:44 production scorecard:`status=blocked_reboot_auto_recovery_slo_not_ready`、`active_blocker_count=10`、`readiness_percent=60`、active blockers 仍含 reboot event / host unreachable / 99+111+Windows99 VMX / guest power,且新增 `windows99_vmware_autostart_config_not_ready`、`windows99_update_no_auto_reboot_policy_not_ready`。no-write artifact `/tmp/awoooi-reboot-continue-20260703-085205`:111 socket probe 仍 timeout / unreachable;99 RDP / VMConnect reachable 但 uptime unknown;188 socket probe reachable,前次 host probe 仍為 `systemd_state=degraded` / `startup_active=failed`。 | 先收斂 99 / Windows99 / VMware 與 111:恢復 no-secret management channel 或取得 validator 可接受的完整 console Verify stdout,讀回 VMX / VM power / host uptime,再 rerun host probe + reboot-event detector;不得 reboot、不得 VM power change、不得讀 Windows 密碼,不得用 RDP clipboard 片段當完成證據。 |
|
||||
| P0-2 | BLOCKED_EDGE_PRIVILEGED_APPLY | Deploy / reboot 期間 public 502 維護頁與外部 fallback | Gitea CD `#4519` 已推 deploy marker `3aca484 -> a94ddd5`,但 marker 後 public probe 仍讀到 `https://awoooi.wooo.work/api/v1/health` raw `502`、fallback header/body 空;live 188 `/etc/nginx/sites-enabled/awoooi.wooo.work.conf` 缺 maintenance fallback,`/var/www/maintenance/maintenance.html` 缺失,`ollama@188` 無 passwordless sudo。 | 先跑 `scripts/reboot-recovery/public-maintenance-edge-fallback-apply.sh --check` 留 drift receipt;具備 privileged channel 後執行 `--apply`,要求 backup、`nginx -t`、reload、public route probe 全綠。不得讀密碼、不得用 app restart 掩蓋 edge fallback drift。 |
|
||||
| P0-3 | BLOCKED_DEPLOY_READBACK | 所有產品 / 網站版本與資料最新性 | Gitea `main=647d81163` 已含 Windows99 collected verifier / next safe action source;production SLO route 仍未讀回該 source:`active_blocker_count=8`、`runtime_metric_runtime_readback_added_blockers=[]`、`windows99_update_no_auto_reboot_ready=true`。Public Gitea CD `#4556` 仍為 `Running`,build log 顯示 Docker build lock 等待後自清空鎖並進入 API/Web build/push,但尚無 deploy marker / production deploy readback。10:30 source 已補 CD Docker build/push bounded timeout 與 queue classifier,避免下一輪只剩無 ETA 的 `Running`。Stock freshness 仍為 `status=ok`、`latest_trading_date=2026-07-02`、blockers `[]`。 | 先讓 CD 完成 deploy marker / production readback;完成條件是 source SHA、deploy marker、runtime endpoint、public route watch、freshness 四層一致。CD `Running`、Gitea main 最新或 test green 都不得單獨宣稱 production 最新。 |
|
||||
| P0-4 | BLOCKED_WINDOWS99_AUTOSTART | 192.168.0.99 VMware 自動啟動與 VM guest 111 / 188 / 120 / 121 / 112 | 10:15 source/API patch:Production Windows99 nested readback 仍為 `missing_vmx_aliases=["111"]`、`powered_off_aliases=["111","112","120","121","188"]`,且 nested `service_blockers=["VMAuthdService","VMnetDHCP"]`、`policy_blockers=["windows_update_policy_readback_missing"]` 不能再被壓扁成單一 guest power blocker;API overlay 已補 `windows99_vmware_autostart_config_not_ready` / `windows99_update_no_auto_reboot_policy_not_ready` runtime-readback-added blocker。09:53 no-secret management probe:99 reachable、RDP / Hyper-V VMConnect reachable,WinRM timeout,Mac vantage collector 仍為 `ssh_batchmode_auth_ready=0` / `blocked_ssh_publickey_auth_missing`;production vantage 的 no-secret collector readback 曾回 collected,但完成條件仍取決於 VMX / service / policy / power 全綠。 | 先修 111 VMX source 與 VMware service / policy readback,使 check-mode package 能解釋 powered-off aliases;不得從 scorecard 直接重啟 Windows service、改 registry、啟動 / 關閉 VM 或 host reboot。完成條件是 VMX config ready、guest power ready、99 uptime known、all required host reachable,且 active blocker matrix 顯示 service / policy / VMX / power 分開收斂。 |
|
||||
| P0-3 | PARTIAL_GREEN_DEPLOYED_SOURCE | 所有產品 / 網站版本與資料最新性 | Gitea `main=bc8d3af3f chore(cd): deploy dacdd90 [skip ci]`;CD `#4558` log 顯示 API/Web build/push wrapper 全部 `rc=0`、deploy marker `dacdd90`、production deploy readback matched。Production SLO route 已讀回最新 API 行為:`runtime_metric_runtime_readback_added_blockers=["windows99_vmware_autostart_config_not_ready","windows99_update_no_auto_reboot_policy_not_ready"]`、`windows99_update_no_auto_reboot_ready=false`。Stock freshness 仍為 `status=ok`、`latest_trading_date=2026-07-02`、blockers `[]`。 | 版本層已補到 AWOOOI production;完成整體 P0 仍需收斂 Windows99 VMX / service / policy / power 與 host reboot-event blockers。後續每個 public product 仍要維持 source SHA、deploy marker、runtime endpoint、public route watch、freshness 四層一致。 |
|
||||
| P0-4 | BLOCKED_WINDOWS99_AUTOSTART | 192.168.0.99 VMware 自動啟動與 VM guest 111 / 188 / 120 / 121 / 112 | 10:44 production readback 已把 nested Windows99 evidence 上卷到 active blockers:`windows99_vmware_vmx_missing`、`windows99_vmware_guest_power_not_ready`、`windows99_vmware_autostart_config_not_ready`、`windows99_update_no_auto_reboot_policy_not_ready`;`runtime_metric_runtime_readback_added_blockers` 含 config / policy 兩項,`windows99_update_no_auto_reboot_ready=false`。09:53 no-secret management probe:99 reachable、RDP / Hyper-V VMConnect reachable,WinRM timeout,Mac vantage collector 仍為 `ssh_batchmode_auth_ready=0` / `blocked_ssh_publickey_auth_missing`;completion 仍取決於 VMX / service / policy / power 全綠。 | 先修 111 VMX source 與 VMware service / policy readback,使 check-mode package 能解釋 powered-off aliases;不得從 scorecard 直接重啟 Windows service、改 registry、啟動 / 關閉 VM 或 host reboot。完成條件是 VMX config ready、guest power ready、99 uptime known、all required host reachable,且 active blocker matrix 顯示 service / policy / VMX / power 分開收斂。 |
|
||||
| P0-5 | RUNTIME_READY_BACKUP_RECEIPT_GAP | Gitea / 主機 / DB / 網站 / 服務 / 套件 / 工具 / log 備份監控告警 | Gitea repo bundle readback ready:expected `12`、rows `12`、missing `0`、failed `0`、sample restore dry-run ok;backup core green。2026-07-03 source / runtime 已部署 `awoooi_backup_alert_receipt_*` 指標與 Prometheus rules;110 exporter 讀回 88 個 stage requirement、188 讀回 12 個 stage requirement,`BackupAlertReceiptMetricMissing*` inactive,`BackupAlertReceiptStageMissing` 已修成每 `host / receipt_channel` 聚合 pending:110 一條、188 一條。 | 補 `/backup/alert-receipts/*.last_success` 脫敏 marker;下一層仍要補 Gitea full dump、DB/settings/issues/packages/LFS、所有工具與 log 全量備份監控。 |
|
||||
| P0-6 | RUNTIME_READY_ALERT_RECEIPT_GAP | 主機關機 / 重啟 / SLO miss / backup failure Telegram 告警 | Reboot per-blocker alert 與 backup receipt alert rules 已 deploy/readback;backup receipt 缺段不再產生 100 條 stage 噪音,現在聚合成 110 / 188 兩條 host-level pending。scorecard 仍有 8 個 reboot active blockers,尚未完成 shutdown / reboot / backup alert 的 production 脫敏 delivery receipt 全矩陣。 | 補 alert receipt readback:host down、host up、SLO miss、Windows99 blocker、backup stale/failed、deploy 502、freshness stale;完成條件是每類告警都有 sent / received / dedup / escalation evidence。 |
|
||||
| P0-6 | RUNTIME_READY_ALERT_RECEIPT_GAP | 主機關機 / 重啟 / SLO miss / backup failure Telegram 告警 | Reboot per-blocker alert 與 backup receipt alert rules 已 deploy/readback;backup receipt 缺段不再產生 100 條 stage 噪音,現在聚合成 110 / 188 兩條 host-level pending。scorecard 目前有 10 個 reboot active blockers,其中 Windows99 config / policy blocker 已可供 per-blocker alert routing 使用;尚未完成 shutdown / reboot / backup alert 的 production 脫敏 delivery receipt 全矩陣。 | 補 alert receipt readback:host down、host up、SLO miss、Windows99 blocker、backup stale/failed、deploy 502、freshness stale;完成條件是每類告警都有 sent / received / dedup / escalation evidence。 |
|
||||
| P0-7 | SOURCE_READY_SLA_AUTOMATION | 固定排查順序、ETA / wait reason、自動化判斷與修復 | Scorecard 已固定 `current_phase=host_boot_detection_blocked`、`eta_or_wait_reason=reboot_event_readback_missing_eta_unavailable`、`primary_blocker=reboot_event_required_host_unreachable`、fixed triage order 與 next safe action;08:23 artifact 固定下一步為 no-secret Windows99 verify / host probe rerun。09:22 `active_blocker_action_matrix` 已能把 unreliable console artifact 指向 `windows99_console_or_no_secret_management_channel` 與固定 next safe action。10:30 `read-public-gitea-actions-queue.py` 已能把 `BLOCKER cd_docker_step_timeout` 上卷成 `blocked_cd_docker_step_timeout` / `latest_visible_cd_docker_step_timeout*` / fixed safe-next-action;10 分鐘內自動恢復仍未達標。 | 把每個 blocker 的 next_safe_action、post_verifier、forbidden_actions 接到自動 work item / Telegram / scorecard;完成條件是重啟後自動判斷、主動告警、主動 rerun verifier,不再人工臨場猜流程。 |
|
||||
| P0-8 | PARTIAL_READY_POLICY | Windows99 禁止 Windows Update 無預警重啟 | 10:15 source/API patch:若 Windows99 nested verifier 回 `policy_blockers=["windows_update_policy_readback_missing"]`,production API 必須 fail-closed 重新提升 `windows99_update_no_auto_reboot_policy_not_ready`,不得只信舊 Prometheus `windows99_update_no_auto_reboot_ready=true`。 | 保留 no-secret verifier,補週期性 policy readback 與 Telegram drift alert;完成條件是 Windows Update policy drift / missing readback 會自動告警且不需讀 secret,不得從 scorecard 直接 apply registry。 |
|
||||
| P0-8 | BLOCKED_POLICY_READBACK | Windows99 禁止 Windows Update 無預警重啟 | 10:44 production readback 已 fail-closed:`windows99_update_no_auto_reboot_ready=false`,active blockers 含 `windows99_update_no_auto_reboot_policy_not_ready`,來源是 nested verifier 的 `policy_blockers=["windows_update_policy_readback_missing"]`。 | 保留 no-secret verifier,補週期性 policy readback 與 Telegram drift alert;完成條件是 Windows Update policy drift / missing readback 會自動告警且不需讀 secret,不得從 scorecard 直接 apply registry。 |
|
||||
|
||||
| 優先 | 狀態 | 工作項 | 2026-06-30 證據 | 下一步 / 完成條件 |
|
||||
|------|------|--------|------------------|-------------------|
|
||||
|
||||
@@ -65,11 +65,11 @@
|
||||
| 6 | CIR-P0-RBT-006 | P0 | 「所有主機關機立刻 Telegram 告警,重啟後也要告警,其他告警一併完整思考」 | Down / shutdown suspected / reboot detected / reboot recovered / SLO missed / backup failed / freshness stale / CPU pressure / Gitea queue 告警矩陣 | HostDown / HostRebootEventDetected / RebootAutoRecoverySLOMissed 已存在;per-blocker reboot alerts 與 backup receipt rules 已 deploy/readback。Backup receipt 缺段已從 100 條 stage 噪音收斂為 110 / 188 兩條 host-level pending;仍需完整 shutdown/up E2E receipt | 補 Prometheus / Alertmanager active/resolved 與 outbound receipt;backup alert 先補 `/backup/alert-receipts/*.last_success` 脫敏 marker,不送測試 secret、不重啟主機 |
|
||||
| 7 | CIR-P0-RBT-007 | P0 | 「所有備份包含主機、DB、網站、服務、套件、工具、日誌都沒有監控告警」 | Backup observability coverage:backup job inventory、last success、freshness、offsite、restore drill、Telegram/AwoooP receipt | 已有 backup health exporter / alert rules / Gitea bundle restore dry-run;2026-07-03 runtime 讀回 110 有 88 個 receipt stage requirement、188 有 12 個,`BackupAlertReceiptMetricMissing*` inactive,`BackupAlertReceiptStageMissing` 聚合 pending 110 / 188 各一條 | 補 `/backup/alert-receipts/*.last_success`;再補 Gitea full dump / DB / settings / issues / packages / LFS 與所有工具/log 全量備份監控 |
|
||||
| 8 | CIR-P0-RBT-008 | P0 | 「每次重啟排查都不一樣,也不知道多久恢復,不符合 SLA」 | 固定化 reboot runbook:fixed triage order、ETA、active blocker、remaining seconds、owner lane、next command | Production scorecard readback 已固定 `status=blocked_reboot_auto_recovery_slo_not_ready`、readiness `67%`、active blockers `8`、primary `reboot_event_required_host_unreachable`;09:22 source/API contract 已把 unreliable console artifact 接到 `active_blocker_action_matrix`、owner lane 與 next safe action | 優先收斂 99 no-secret Verify / 111 reachability / 188 startup failed/degraded;不得用不同排查路徑繞過 scorecard |
|
||||
| 9 | CIR-P0-RBT-009 | P0 | 「所有產品、網站都要是最新版本;版本和數據是否最新要驗證」 | Product freshness/version matrix:source commit、deploy marker、runtime image、public health、data freshness、latest source availability | AWOOOI Gitea `main=647d81163` 已含 Windows99 collected verifier next-step source,但 production SLO route 仍未讀回該 source:`active_blocker_count=8`、`runtime_metric_runtime_readback_added_blockers=[]`、`windows99_update_no_auto_reboot_ready=true`;Gitea CD `#4556` 仍 Running,尚無 deploy marker / production deploy readback。StockPlatform public freshness / ingestion 讀回 `ok`,latest trading date `2026-07-02`。10:30 已補 CD Docker build/push bounded timeout 與 queue classifier source,避免下一輪無 ETA Running。 | 先讀回 AWOOOI deploy marker / production runtime SHA / scorecard 行為;再建立全產品 readback 表:product、canonical repo、main SHA、deploy marker、public URL、data freshness、blocked reason |
|
||||
| 9 | CIR-P0-RBT-009 | P0 | 「所有產品、網站都要是最新版本;版本和數據是否最新要驗證」 | Product freshness/version matrix:source commit、deploy marker、runtime image、public health、data freshness、latest source availability | AWOOOI Gitea `main=bc8d3af3f chore(cd): deploy dacdd90 [skip ci]`;CD `#4558` log 顯示 API/Web build/push wrapper 全部 `rc=0`、deploy marker `dacdd90`、production deploy readback matched。Production SLO route 已讀回最新 API 行為:`active_blocker_count=10`、`runtime_metric_runtime_readback_added_blockers=["windows99_vmware_autostart_config_not_ready","windows99_update_no_auto_reboot_policy_not_ready"]`、`windows99_update_no_auto_reboot_ready=false`。StockPlatform public freshness / ingestion 仍為 `ok`,latest trading date `2026-07-02`。 | 版本層已補到 AWOOOI production;下一步仍要建立全產品 readback 表,且收斂 Windows99 VMX / service / policy / power 與 host reboot-event blockers |
|
||||
| 10 | CIR-P0-GIT-001 | P0 | 「Gitea 儲存庫都不見了?Gitea 沒完整備份嗎?」 | Gitea repository identity + backup proof + restore drill:不能只看 UI visible,要比對 SSH heads、repo path、bundle backup、restore sample | 2026-07-02 production `/api/v1/agents/gitea-repo-bundle-backup-readback` 已 ready:9 expected repos present/ok、missing=0、failed=0、checksum_missing=0、bundle_fresh=true、all_expected_ok=true、sample_restore_dry_run_ok=true;repo bundle / restore dry-run 層已關閉,不是 repo missing。 | 維持每日 bundle backup + restore dry-run monitoring;另補 Gitea full dump / DB / settings / issues / packages / LFS 備份 readback。禁止刪 repo / 改 visibility / 讀 token / restore 到 production |
|
||||
| 11 | CIR-P0-CPU-001 | P0 | 「110 / 188 CPU 負載持續過高,為什麼沒監控告警、沒主動修復」 | Sustained CPU pressure automation:Alertmanager → controller → evidence → service playbook → verifier → KM writeback | 110 已有 `Host110SustainedModeratePressure`、Gitea playbook、Stock/Postgres evidence;188 仍需同級 controller/alerts readback | 下一步接 `postgres_hot_query_or_backup_export_playbook`;並補 188 equivalent readback,不以單次下降結案 |
|
||||
| 12 | CIR-P0-CPU-002 | P0 | 「噪音會影響真問題,要整合一起做」 | Alert noise / real issue correlation:backup aggregate noise、CPU pressure、Gitea queue、Stock freshness 要分清主因與次因 | 部分已在 SOP 註記;仍需統一 correlation scorecard | 建立 incident correlation readback:primary_blocker、secondary_noise、ignored_noise_reason、evidence_ref |
|
||||
| 13 | CIR-P0-CD-001 | P0 | 「所有專案都不能推版 / 要看到實作結果」 | Gitea-only CD baseline:每次 main push 要有 visible run、deploy marker、production readback;GitHub 不作解法 | AWOOOI main 可推,但目前 latest CD `#4556` 仍 Running 且 production 尚未 readback 最新 source;source 已補 `cd_docker_step_timeout` bounded marker 與 queue classifier,避免 CD 卡住時無具名 blocker | 先推送 / 讀回 bounded CD classifier,接著讀 deploy marker / production runtime;再將 product governance matrix 接入各產品 Gitea CD readiness |
|
||||
| 13 | CIR-P0-CD-001 | P0 | 「所有專案都不能推版 / 要看到實作結果」 | Gitea-only CD baseline:每次 main push 要有 visible run、deploy marker、production readback;GitHub 不作解法 | AWOOOI latest source 已成功經 Gitea CD `#4558` 部署:bounded API/Web build/push wrapper 實跑 `rc=0`,deploy marker `bc8d3af3f chore(cd): deploy dacdd90 [skip ci]`,production deploy readback matched;queue reader 也已具備 `cd_docker_step_timeout` classifier | 將 product governance matrix 接入各產品 Gitea CD readiness;CD `Running` 未 matched 時不得宣稱最新版已上 production |
|
||||
| 14 | CIR-P1-AI-001 | P1 | 「AI 專業在哪?要能主動發現、主動修復」 | AI controlled repair loop:detect → classify → candidate → check-mode → controlled apply → post verifier → KM / PlayBook trust | CPU / Gitea / Telegram receipt 已部分落地;全域 AI loop 未全部接上 | 將每個 P0 runbook 補 `candidate_action`、`controlled_apply_allowed`、`post_verifier`、`trust_writeback` |
|
||||
| 15 | CIR-P1-KM-001 | P1 | 「修復過程、經驗完整沉澱進 SOP,整合到目前版本」 | 所有 P0 修復必須同步 LOGBOOK、SOP、PlayBook、workplan ledger;不能只留在對話 | 本台帳、LOGBOOK、SOP 已開始補;09:22 已把 Windows99 console clipboard 不可靠經驗寫入 SOP v1.108、P0 workplan 與 scorecard regression;仍需 API/UI read model | 把本台帳轉成 read-only API / governance UI row,並建立 `last_updated` / `evidence_count` |
|
||||
| 16 | CIR-P1-WORK-001 | P1 | 「所有已開始、進行中、已完成工作全部看清楚」 | 工作狀態盤點:Done / In Progress / Blocked / Deferred / Next Action + evidence | 本台帳已有初版 Done/In Progress/Blocked;需納入本節新 P0 | 更新下方 Done/In Progress/Blocked,把 reboot/backup/VMware/maintenance/CPU 全列入 |
|
||||
@@ -115,7 +115,7 @@
|
||||
| Public maintenance fallback runtime readback | Gitea CD `#4459` / deploy marker `8d7a6faaf` 已讓 production scorecard 讀回 `public_maintenance_fallback.ready=true`、raw 5xx=`0`、P0 blockers `11`、readiness `47` |
|
||||
| Reboot SLO per-blocker 告警投影 | Source 已補 `awoooi_reboot_auto_recovery_slo_active_blocker{blocker=...}`、`RebootAutoRecoveryActiveBlocker`、`RebootAutoRecoveryActiveBlockerMetricMissing` 與契約測試 |
|
||||
| Backup alert receipt runtime contract | Source / runtime 已補 `awoooi_backup_alert_receipt_expected_info`、`awoooi_backup_alert_receipt_stage_fresh`、`BackupAlertReceiptMetricMissing*`、`BackupAlertReceiptStageMissing`、baseline contract、live visibility checker 與 focused tests;Prometheus rule 已部署,缺段 alert 已聚合成 110 / 188 host-level pending;不送 Telegram、不讀 token |
|
||||
| CD Docker build/push timeout classifier source | `.gitea/workflows/cd.yaml` 已把 API/Web `docker build` / `docker push` 包進 `run_docker_step`,timeout 會輸出 `BLOCKER cd_docker_step_timeout`;`read-public-gitea-actions-queue.py` 已上卷 `blocked_cd_docker_step_timeout` / `latest_visible_cd_docker_step_timeout*` / fixed safe-next-action;本地 focused tests `92 passed` |
|
||||
| CD Docker build/push timeout classifier production proof | `.gitea/workflows/cd.yaml` 已把 API/Web `docker build` / `docker push` 包進 `run_docker_step`,timeout 會輸出 `BLOCKER cd_docker_step_timeout`;`read-public-gitea-actions-queue.py` 已上卷 `blocked_cd_docker_step_timeout` / `latest_visible_cd_docker_step_timeout*` / fixed safe-next-action;本地 focused tests `92 passed`。CD `#4558` live log 已證明 API/Web build/push wrapper 全部 `rc=0`,deploy marker `dacdd90`,production deploy readback matched |
|
||||
|
||||
### In Progress
|
||||
|
||||
@@ -135,7 +135,7 @@
|
||||
| OpenClaw / Gather-style 持續動畫工作室 | route 已存在,已列為 P1 工作項 | 補 production desktop/mobile smoke、AwoooP 導流與截圖證據 |
|
||||
| AI 專業 UI / 非文字牆 cockpit | 已列為 P2 UX 驗收 | 將長文字區塊收斂成 first-viewport cockpit、cards、flow rows 與 expandable details |
|
||||
| 10 分鐘 reboot auto-recovery SLA | 2026-07-03 08:23 production scorecard 固定 `8` blockers / readiness `67%`,但 `can_claim_all_services_recovered_within_target=false`;artifact `/tmp/awoooi-reboot-verify-only-20260703-082310` 保留 host probe / reboot detector / scorecard | 收斂 99 no-secret Verify、111 reachability、188 startup failed/degraded,再 rerun host probe + reboot-event detector |
|
||||
| AWOOOI latest source production deploy readback | Gitea `main=647d81163` 已前進,public CD `#4556` Running;production SLO route 仍是舊版 blocker 行為,尚未讀回 latest source;本輪 source 已補 CD timeout classifier,等待 push / CD / deploy marker / production readback | 推送 bounded classifier commit 後讀 `read-public-gitea-actions-queue.py --json`、deploy marker、production scorecard,未 matched 前不得宣稱最新版本已上 production |
|
||||
| AWOOOI latest source production deploy readback | Gitea `main=bc8d3af3f`,deploy marker `dacdd90`;public CD `#4558` log job succeeded / production deploy readback matched;production scorecard 已讀回 latest API 行為並把 Windows99 config / policy blockers 上卷 | 下一步轉回 Windows99 VMX / service / policy / power 與 host reboot-event blockers;不得因 deploy/readback 成功宣稱 10 分鐘 SLO 完成 |
|
||||
| 99 Windows / VMware autostart | Source verifier / parser / API readback / collection packet 已完成;live 99 Verify output 尚未收集;08:23 management probe 顯示 RDP / Hyper-V VMConnect reachable,但 SSH BatchMode / WinRM blocked,collector `blocked_ssh_publickey_auth_missing` | 收集 99 no-secret Verify output,確認 VM 111/188/120/121/112 running、scheduled task / services / Windows Update policy 全綠 |
|
||||
| Reboot SLO blocker 收斂 | production scorecard 已固定 `8` blockers / readiness `67%`,source / runtime 已補 per-blocker metric/alert;剩餘主 blocker 是 host boot detection + Windows99 VMware VMX / guest power | 依具名 blocker 收斂 99/111/188;不得用 route green 或 RDP 可見宣稱 SLA 完成 |
|
||||
| 全備份監控告警 coverage | exporter/rule 已有 host/DB/site/service/package/tool/log coverage 與 backup alert receipt requirement;runtime rules 已部署,缺段 alert 聚合成 110 / 188 host-level pending;production receipt markers / full Gitea dump / DB/settings/issues/packages/LFS 尚未全綠 | 補 `/backup/alert-receipts/*.last_success`、Gitea full dump 與 restore drill verifier |
|
||||
|
||||
Reference in New Issue
Block a user