docs(governance): expand commander inserted requirement priorities
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Successful in 1m4s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Successful in 1m4s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
This commit is contained in:
@@ -16,6 +16,21 @@
|
||||
**仍維持**:
|
||||
- 未使用 GitHub / `gh` / GitHub API;未讀 secret / token / `.env` / raw sessions / SQLite / auth;未觸發 workflow;未重啟主機 / Docker / Nginx / K3s / DB / firewall;未寫 StockPlatform DB。
|
||||
|
||||
## 2026-07-02 — 13:20 統帥插入需求 P0/P1/P2 優先序補全
|
||||
|
||||
**完成內容**:
|
||||
- `docs/workplans/2026-07-02-commander-inserted-requirements-priority-ledger.md` 已補登全主機重啟後插入的明確需求,不再只收斂 GitHub / Gitea / runner / token 類要求。
|
||||
- 新增 `CIR-P0-RBT-*`、`CIR-P0-GIT-001`、`CIR-P0-CPU-*`、`CIR-P0-CD-001`、`CIR-P1-AI-001`、`CIR-P1-KM-001`、`CIR-P1-WORK-001`、`CIR-P2-OBS-001`、`CIR-P2-UX-001` 等工作項。
|
||||
- 明確納入:10 分鐘 reboot auto-recovery SLO、reboot detection、99 Windows / VMware autostart、VM 111/188/120/121/112 autostart、Windows Update no-auto-restart、502 maintenance fallback、主機 down/up Telegram 告警、完整備份監控告警、SLA/ETA 固定化、全產品版本/數據 freshness、Gitea repo / backup / restore drill、110/188 CPU 主動偵測與修復、噪音/真問題 correlation、AI controlled repair loop、SOP/KM/PlayBook 沉澱。
|
||||
- 下輪固定執行順序已改為:Gitea/CD truth → 110 Stock/Postgres pressure → reboot detection / SLO → 99 VMware / Windows Update → maintenance fallback → backup observability → product freshness/version → 110 controlled drain → product governance → KM/RAG/MCP/LOG。
|
||||
|
||||
**驗證**:
|
||||
- ledger 檢查確認新增關鍵 ID 均存在:`CIR-P0-RBT-001`、`CIR-P0-RBT-003`、`CIR-P0-RBT-005`、`CIR-P0-RBT-006`、`CIR-P0-RBT-007`、`CIR-P0-RBT-009`、`CIR-P0-GIT-001`、`CIR-P0-CPU-001`、`CIR-P1-AI-001`、`CIR-P1-KM-001`。
|
||||
- `.gitea/workflows/cd.yaml` 與 `ops/runner/test_cd_controlled_runtime_profile.py` 已包含該 ledger path,維持 controlled-runtime profile。
|
||||
|
||||
**下一步**:
|
||||
- commit / push Gitea `main` 後讀回 CD;接續目前 active P0:110 Stock/Postgres hot pressure 的 read-only evidence / source freshness / query attribution。
|
||||
|
||||
## 2026-07-02 — 13:10 Telegram 告警 receipt 與 AI controlled readback 補強
|
||||
|
||||
**完成內容**:
|
||||
|
||||
@@ -51,6 +51,31 @@
|
||||
| 24 | CIR-P2-002 | P2 | production health readback 顯示 degraded | SignOz component down 不阻擋本次 CD success,但要分流到 observability lane | 已觀察到 `signoz` connection refused | 排入 Observability P2,不在本輪重啟服務 |
|
||||
| 25 | CIR-P3-001 | P3 | GitHub appeal 後的未來恢復 | 只有使用者未來明確要求恢復 GitHub,才先做風險評估與人工確認 | 等外部狀態 | 未確認前維持 freeze |
|
||||
|
||||
## 2026-07-02 全主機重啟後插入需求補全
|
||||
|
||||
> 本節補登使用者在全主機重啟、Gitea repository、110/188 CPU 壓力與 SLA 責任追問期間插入的明確要求。這些需求不得再被視為對話噪音;後續所有工作視窗必須先查本節,再按排序推進。
|
||||
|
||||
| 順序 | ID | 優先序 | 使用者插入要求 | 正規化工作項 | 目前狀態 | 下一個可驗證動作 |
|
||||
| --- | --- | --- | --- | --- | --- | --- |
|
||||
| 1 | CIR-P0-RBT-001 | P0 | 「主機重啟後 10 分鐘內全部恢復,且要自動判斷所有主機被重啟」 | 建立 99/110/111/112/120/121/188 reboot event detector + 10 分鐘 SLO scorecard + fixed triage order | 部分已有 `reboot_recovery_slo_alerts`、scorecard、textfile;仍需要 fresh all-host reboot/drill 證明 | 產生最新 reboot SLO scorecard readback;若缺 fresh event,標 `awaiting_next_reboot_or_approved_drill`,不可宣稱 10 分鐘 SLA 已證明 |
|
||||
| 2 | CIR-P0-RBT-002 | P0 | 「沒有偵測到主機重啟」 | 修正 host reboot/shutdown/up detection:boot_id / uptime / node exporter / Windows exporter / VMware VM power state 都要進同一事件 | Prometheus rule 已有缺口告警,但 99/VMware/Windows 來源仍未完整閉環 | 補 99/VMware/Windows probe source 與 textfile readback;缺 99 時不得把 110/120/121/188 green 當全主機 green |
|
||||
| 3 | CIR-P0-RBT-003 | P0 | 「192.168.0.99 VMWare 要自動啟動,裡面 111/188/120/121/112 也自動啟動」 | Windows 99 VMware host autostart + guest VM autostart contract;VM host 111/188/120/121/112 開機順序與 readback | 尚未完成;屬 Windows / VMware 主機設定 lane,不得用 Linux host green 代替 | 建立 non-secret Windows/VMware autostart checklist 與 verifier;只讀確認 `.vmx` / autostart policy / VMware service 狀態,不讀 Windows 密碼 |
|
||||
| 4 | CIR-P0-RBT-004 | P0 | 「192.168.0.99 不可因 Windows Update 無預警重開」 | Windows Update reboot policy:active hours / no auto-restart / maintenance window / update notification audit | 尚未完成;屬 99 Windows policy lane | 建立 Windows Update policy verifier;下一步需 99 console 或已授權遠端管理通道,但不得要求使用者貼密碼 |
|
||||
| 5 | CIR-P0-RBT-005 | P0 | 「網站重啟後 502 嚴重影響體驗,要維護頁,外部雲端或專業做法」 | Public maintenance fallback:Nginx / edge / external static maintenance page / status page / fail-open UX,避免 502 直出 | 尚未完整落地;目前是需求缺口 | 產生 `public_maintenance_fallback` decision record:DNS/edge/外部雲端/本地 Nginx fallback 風險比較,先做不切流量的 check-mode |
|
||||
| 6 | CIR-P0-RBT-006 | P0 | 「所有主機關機立刻 Telegram 告警,重啟後也要告警,其他告警一併完整思考」 | Down / shutdown suspected / reboot detected / reboot recovered / SLO missed / backup failed / freshness stale / CPU pressure / Gitea queue 告警矩陣 | 部分已有 Alertmanager rule 與 Telegram receipt 補強;仍缺完整 shutdown/up E2E receipt | 建立 Telegram alert matrix + receipt verifier,逐項讀回 Alertmanager active/resolved 與 outbound receipt,不送測試 secret |
|
||||
| 7 | CIR-P0-RBT-007 | P0 | 「所有備份包含主機、DB、網站、服務、套件、工具、日誌都沒有監控告警」 | Backup observability coverage:backup job inventory、last success、freshness、offsite、restore drill、Telegram receipt | 部分已有 backup health exporter / alert rules;全域 coverage 與 restore drill 未全綠 | 建立 backup coverage matrix:host / DB / website / service config / package list / tool scripts / logs,每列有 metric、alert、last_success、restore_verifier |
|
||||
| 8 | CIR-P0-RBT-008 | P0 | 「每次重啟排查都不一樣,也不知道多久恢復,不符合 SLA」 | 固定化 reboot runbook:fixed triage order、ETA、active blocker、remaining seconds、owner lane、next command | 部分已有 scorecard / SOP;仍需所有回報統一格式 | 將 SLO scorecard 強制輸出 `current_phase`、`eta_or_wait_reason`、`active_blockers`、`next_safe_action` |
|
||||
| 9 | CIR-P0-RBT-009 | P0 | 「所有產品、網站都要是最新版本;版本和數據是否最新要驗證」 | Product freshness/version matrix:source commit、deploy marker、runtime image、public health、data freshness、latest source availability | AWOOOI / StockPlatform 部分已在做;全產品未統一 | 建立全產品 readback 表:product、canonical repo、main SHA、deploy marker、public URL、data freshness、blocked reason |
|
||||
| 10 | CIR-P0-GIT-001 | P0 | 「Gitea 儲存庫都不見了?Gitea 沒完整備份嗎?」 | Gitea repository identity + backup proof + restore drill:不能只看 UI visible,要比對 SSH heads、repo path、bundle backup、restore sample | 已有 9 expected repos OK / backup health missing=0 的 handoff;仍需 restore drill 證明 | 補 Gitea repo bundle backup readback + sample restore dry-run verifier;禁止刪 repo / 改 visibility |
|
||||
| 11 | CIR-P0-CPU-001 | P0 | 「110 / 188 CPU 負載持續過高,為什麼沒監控告警、沒主動修復」 | Sustained CPU pressure automation:Alertmanager → controller → evidence → service playbook → verifier → KM writeback | 110 已有 `Host110SustainedModeratePressure`、Gitea playbook、Stock/Postgres evidence;188 仍需同級 controller/alerts readback | 下一步接 `postgres_hot_query_or_backup_export_playbook`;並補 188 equivalent readback,不以單次下降結案 |
|
||||
| 12 | CIR-P0-CPU-002 | P0 | 「噪音會影響真問題,要整合一起做」 | Alert noise / real issue correlation:backup aggregate noise、CPU pressure、Gitea queue、Stock freshness 要分清主因與次因 | 部分已在 SOP 註記;仍需統一 correlation scorecard | 建立 incident correlation readback:primary_blocker、secondary_noise、ignored_noise_reason、evidence_ref |
|
||||
| 13 | CIR-P0-CD-001 | P0 | 「所有專案都不能推版 / 要看到實作結果」 | Gitea-only CD baseline:每次 main push 要有 visible run、deploy marker、production readback;GitHub 不作解法 | AWOOOI 最新 main 可推,CD success/deploy marker 已多次證明;全產品未全綠 | 將 product governance matrix 接入各產品 Gitea CD readiness,不再只報 AWOOOI |
|
||||
| 14 | CIR-P1-AI-001 | P1 | 「AI 專業在哪?要能主動發現、主動修復」 | AI controlled repair loop:detect → classify → candidate → check-mode → controlled apply → post verifier → KM / PlayBook trust | CPU / Gitea / Telegram receipt 已部分落地;全域 AI loop 未全部接上 | 將每個 P0 runbook 補 `candidate_action`、`controlled_apply_allowed`、`post_verifier`、`trust_writeback` |
|
||||
| 15 | CIR-P1-KM-001 | P1 | 「修復過程、經驗完整沉澱進 SOP,整合到目前版本」 | 所有 P0 修復必須同步 LOGBOOK、SOP、PlayBook、workplan ledger;不能只留在對話 | 本台帳、LOGBOOK、SOP 已開始補;仍需 API/UI read model | 把本台帳轉成 read-only API / governance UI row,並建立 `last_updated` / `evidence_count` |
|
||||
| 16 | CIR-P1-WORK-001 | P1 | 「所有已開始、進行中、已完成工作全部看清楚」 | 工作狀態盤點:Done / In Progress / Blocked / Deferred / Next Action + evidence | 本台帳已有初版 Done/In Progress/Blocked;需納入本節新 P0 | 更新下方 Done/In Progress/Blocked,把 reboot/backup/VMware/maintenance/CPU 全列入 |
|
||||
| 17 | CIR-P2-OBS-001 | P2 | 「其他還有哪些告警也要完整思考」 | Observability coverage expansion:SignOz/Sentry/Langfuse/Harbor/Registry/K3s/DB/backup/freshness/route/TLS 告警 | 多數 rule 分散存在;coverage matrix 不完整 | 建立 alert coverage matrix,區分 P0 actionable 與 P2 observability debt |
|
||||
| 18 | CIR-P2-UX-001 | P2 | 「維護頁外部雲端或其他專業做法評估」 | Maintenance UX 可先做設計 / decision record;實際 DNS/edge cutover 需 controlled apply | 尚未開始 | 先出 no-write design / rollback / smoke verifier,不直接切 DNS |
|
||||
|
||||
## 目前 Done / In Progress / Blocked
|
||||
|
||||
### Done
|
||||
@@ -64,6 +89,8 @@
|
||||
| Production deploy readback matched | `631fc785fa` matched runtime / desired image tag |
|
||||
| Playwright smoke 成功 | `5 passed` |
|
||||
| 建立時基準同步到 Gitea truth | local `main...origin/main` clean at deploy marker `ab748b1a` |
|
||||
| 110 Gitea CPU 專用 check-mode playbook | `gitea-queue-hook-backlog-playbook.py` 已上 main;live readback 可輸出 health/version/hooktasks/active Actions |
|
||||
| 110 CPU evidence / controller 分流一致性 | live evidence 與 controller 皆將 Stock/Postgres pressure 優先導向 `postgres_hot_query_or_backup_export_playbook` |
|
||||
|
||||
### In Progress
|
||||
|
||||
@@ -72,6 +99,11 @@
|
||||
| 110 controlled drain lane | live verifier 仍不能宣稱 ready,需 staging artifacts / registration metadata / service guardrails 全綠 | 補 staging artifacts 後只跑 readiness verifier |
|
||||
| 全產品 source-control governance | 9 expected Gitea repos OK,但跨產品 dev/prod CI baseline 尚未全綠 | 以 Gitea / Gitea SSH / backup health 為 source truth |
|
||||
| KM / PlayBook / RAG / MCP 整合 | 已被列為 P1,不再遺漏 | 建立 work item schema 與 trust writeback 欄位 |
|
||||
| 10 分鐘 reboot auto-recovery SLA | SLO exporter / alerts 部分存在,但缺 fresh all-host reboot/drill proof | 補最新 scorecard readback,缺事件則明確標等待下一次 reboot 或 approved drill |
|
||||
| 99 Windows / VMware autostart | 尚未完成 99 host + VM 111/188/120/121/112 autostart verifier | 建立 non-secret VMware/Windows verifier,不讀密碼 |
|
||||
| 502 maintenance fallback | 尚未完成外部維護頁 / edge fallback 決策與實作 | 先做 no-write decision record + smoke verifier |
|
||||
| 全備份監控告警 coverage | 部分 exporter/rule 已存在,但 host/DB/site/service/package/tool/log coverage 未全列 | 建立 backup coverage matrix 與 restore drill verifier |
|
||||
| Stock/Postgres hot pressure | 110 live 已導向 Stock/Postgres playbook;尚未完成 hot query / backup export playbook closure | 下一步執行 read-only Stock/Postgres evidence 與 source freshness / query attribution |
|
||||
|
||||
### Blocked / Deferred
|
||||
|
||||
@@ -80,12 +112,20 @@
|
||||
| GitHub 全產品 push / mirror | GitHub account safety freeze;不得用 GitHub / gh / API | 保持 stopped / do_not_use;未來需風險評估 |
|
||||
| 直接使用聊天中貼出的敏感值 | secret safety hard gate | 不記錄、不重印、不保存;以 rotate / revoke / hidden prompt 取代 |
|
||||
| VMware console / sudo 密碼路徑 | 屬 break-glass / local console lane | 只有控制路徑不可用且目標明確時才開 non-secret SOP |
|
||||
| Windows Update policy apply | 需要 99 Windows 管理通道或 console;不得收密碼 | 先完成 policy verifier / checklist,再做 controlled apply |
|
||||
| GitHub push / mirror / CI | GitHub freeze | 只保留 stopped / do_not_use 記錄,不列入可執行 P0 |
|
||||
|
||||
## 下一輪固定執行順序
|
||||
|
||||
1. 先確認 Gitea `main` / CD / production readback 是否仍與最新 truth 一致。
|
||||
2. 繼續補 110 controlled drain lane 的 staging / verifier,不恢復 generic runner。
|
||||
3. 把全產品 repo identity / backup health / private visibility / dev-prod CI baseline 接到 product governance matrix。
|
||||
4. 把本台帳的 P0/P1 工作項映射進 KM / PlayBook / RAG / MCP / LOG,不讓需求只留在聊天。
|
||||
5. SignOz degraded 分流到 Observability P2,不阻擋 Gitea/CD 主線。
|
||||
6. GitHub appeal 前不做任何 GitHub 操作;appeal 後若使用者明確要求,先做風險評估。
|
||||
2. 接續目前 active P0:110 Stock/Postgres hot pressure,跑 `postgres_hot_query_or_backup_export_playbook` 的 read-only evidence / source freshness / query attribution。
|
||||
3. 補全 reboot auto-recovery P0:99/110/111/112/120/121/188 reboot detection、10 分鐘 SLO scorecard、Telegram down/up/recovered/SLO missed receipts。
|
||||
4. 補 99 Windows / VMware autostart P0:99 host 自動啟動 VMware,VM 111/188/120/121/112 自動啟動;同時補 Windows Update no-auto-restart verifier。
|
||||
5. 補 public maintenance fallback P0:避免 502 直出,先完成 external/edge/local fallback decision record 與 no-write smoke。
|
||||
6. 補 backup observability P0:host / DB / website / service config / package list / tools / logs backup matrix、last success、offsite、restore drill、Telegram receipt。
|
||||
7. 補 product freshness/version P0:所有產品網站 source SHA / deploy marker / runtime image / public URL / data freshness readback。
|
||||
8. 繼續補 110 controlled drain lane 的 staging / verifier,不恢復 generic runner。
|
||||
9. 把全產品 repo identity / backup health / private visibility / dev-prod CI baseline 接到 product governance matrix。
|
||||
10. 把本台帳的 P0/P1 工作項映射進 KM / PlayBook / RAG / MCP / LOG,不讓需求只留在聊天。
|
||||
11. SignOz degraded 分流到 Observability P2,不阻擋 Gitea/CD 主線。
|
||||
12. GitHub appeal 前不做任何 GitHub 操作;appeal 後若使用者明確要求,先做風險評估。
|
||||
|
||||
Reference in New Issue
Block a user