diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index d9952916..64fb71f1 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,32 @@ +## 2026-06-18|110 Runaway Process AIOps 監控 / 告警 / PlayBook 收斂 + +**背景**:110 CPU 滿載已確認主因是跨專案 stockPlatform headless Chrome smoke 遺留 5 組 orphan process group,其中兩組各吃約 120% CPU;精準 `SIGTERM` 後 `REMAINING_AFTER_TERM=0`。後續 load 仍高是 active Gitea Actions CI build/test,並非 orphan Chrome、Docker/Sentry/Harbor 事故。這類問題不能停在人工 `top/ps`,必須產品化成監控、告警、PlayBook、KM 與 gated 修復。 + +**完成內容**: +- 新增 `scripts/ops/host-runaway-process-exporter.py`,read-only 輸出 `host_runaway_process.prom`,分類 orphan browser smoke、active Gitea Actions、load5/core、swap ratio,並固定 `awoooi_host_runaway_process_remediation_authorized=0`。 +- 新增 `scripts/ops/host-runaway-process-remediation.py`,預設 dry-run;`--apply` 必須同時帶 `--confirm-apply`、`--owner-approval-id`、`--maintenance-window-id`、`--evidence-ref` 與 `--rule`,只送 `SIGTERM`,不做 `SIGKILL`、Docker restart 或 systemd restart。 +- 新增 `scripts/ops/tests/test_host_runaway_process_exporter.py`,鎖住 orphan group 分類、BSD / Linux `ps` 解析、合法 / 年輕 process 忽略、CI/swap 指標、dry-run 與 apply gate 拒絕行為。 +- `ops/monitoring/alerts-unified.yml` 新增 `host_runaway_process_alerts`:`HostOrphanBrowserSmokeHighCpu`、`HostCiRunnerLoadSaturation`、monitor missing/stale、以及 `HostRunawayProcessRemediationUnexpectedlyAuthorized` 保險絲。 +- `infra/ansible/roles/host-textfile-exporters` 與 `infra/ansible/playbooks/110-devops.yml` 納入 110 exporter、gated remediation helper、cron、立即刷新與 metric 驗證。 +- 新增 `docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md`,定義 monitoring -> alert -> AI triage packet -> KM / PlayBook evidence -> gated remediation -> post-check / recurrence guard。 +- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升為 v1.26,將 110 runaway browser / CI load 分流納入 runner/CD 釋出條件與 AI auto-remediation gate。 +- `docs/runbooks/ANSIBLE-OPERATING-MODEL.md`、`docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`、`scripts/reboot-recovery/reboot-recovery-readiness-audit.sh` 同步新 exporter / alert / gated remediation contract。 + +**驗證**: +- `/Users/ogt/.pyenv/shims/python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py -q`:`6 passed`。 +- `scripts/ops/host-runaway-process-exporter.py --stdout --host 110`:成功輸出 `awoooi_host_runaway_process_monitor_up=1`、orphan group 指標、CI active container 指標、load5/core、swap ratio 與 `remediation_authorized=0`。 +- `PATH="/Users/ogt/.pyenv/shims:$PATH" bash scripts/ops/deploy-alerts.sh --dry-run`:YAML 驗證通過,告警規則 `123`、SLO recording `7`、SLO alerts `11`。 +- `PATH="/Users/ogt/.pyenv/shims:$PATH" bash scripts/reboot-recovery/reboot-recovery-readiness-audit.sh --no-color`:`PASS=197 WARN=2 BLOCKED=0`;WARN 為本地缺 `ansible-playbook` 與未跑 live cold-start。 +- `source-control-owner-response-guard.py`、`security-mirror-progress-guard.py`、`doc-secrets-sanity-check.py`、`git diff --check`:通過。 + +**完成度同步**: +- 110 runaway process monitoring / alert / PlayBook / KM contract:`100%`。 +- Repo-side automation mechanism:`100%`。 +- Runtime auto-remediation:`0%`;這是安全閘門,不是缺件。真正 `SIGTERM` 必須有 owner approval、maintenance window、evidence ref、rule 與 post-check。 +- 110 CPU orphan Chrome recurrence guard:`READY`;下一次會由 `HostOrphanBrowserSmokeHighCpu` 指向 PlayBook,而不只剩泛用 `HostHighCpuLoad`。 + +**邊界**:本段未 SSH 寫主機、未 kill live process、未重啟 Docker/systemd/Nginx、未改 firewall/K8s、未讀 secrets、未開 runtime gate。Live 部署 exporter / alert reload 若要執行,需走 Ansible / deploy-alerts 的既有維護流程與 owner readback。 + ## 2026-06-18|P2-406B Receipt Readback Owner Review 本地完成 **背景**:P2-004 已把依賴 / 供應鏈漂移收斂成只讀監控讀回;統帥要求每次推進都不能忘記目標與方向,因此本段把日報 / 週報 / 月報、Telegram receipt owner review、P2-004 drift monitor 與 P2-403J 報表真相串成同一個 owner review surface,讓治理頁可以直接看到 AI Agent 分工、互審與仍被關閉的 runtime 邊界。 diff --git a/docs/runbooks/ANSIBLE-OPERATING-MODEL.md b/docs/runbooks/ANSIBLE-OPERATING-MODEL.md index bc35ee98..7c9365c5 100644 --- a/docs/runbooks/ANSIBLE-OPERATING-MODEL.md +++ b/docs/runbooks/ANSIBLE-OPERATING-MODEL.md @@ -1,6 +1,6 @@ # AWOOOI Ansible 運作模型 -> 最後更新:2026-05-12(台北時間) +> 最後更新:2026-06-18(台北時間) > 範圍:說明 Ansible 在 110 / 120 / 121 / 188 的運維、冷啟動恢復、監控與部署安全中扮演的角色。 ## 產品架構定位 @@ -35,7 +35,7 @@ Git repo | 110 Ollama proxy | `110-ollama-proxy.conf.j2` | `/etc/nginx/sites-enabled/110-ollama-proxy.conf` | | 110 cold-start monitor | `roles/cold-start-monitor` | `/home/wooo/scripts`、cron、node-exporter textfile | | 110 runner guardrails | `roles/runner-guardrails` | `actions.runner.*` systemd drop-ins | -| 110/188 Docker/systemd/storage/backup textfile exporters | `roles/host-textfile-exporters` | `/home/*/node_exporter_textfiles/docker_stats.prom`、`storage_health.prom`、`backup_health.prom`、110 `systemd_units.prom` | +| 110/188 Docker/systemd/storage/backup/runaway-process textfile exporters | `roles/host-textfile-exporters` | `/home/*/node_exporter_textfiles/docker_stats.prom`、`storage_health.prom`、`backup_health.prom`、110 `systemd_units.prom`、110 `host_runaway_process.prom` | | 110 Sentry backup / integrity drill | `110-devops.yml --tags backup_jobs` | `/backup/scripts/backup-sentry.sh`、`check-backup-integrity.sh`、weekly/monthly cron | | 主機健康描述 | `110-devops.yml`、`188-ai-web.yml` | 只讀檢查與有限度主機狀態修復 | @@ -171,6 +171,24 @@ Sentry 資料層備份由 `/backup/scripts/backup-sentry.sh` 負責,納入每 - 每月 `--mode restore-drill`:從每個 repo 抽一個小檔案 `restic dump latest ` 到 0700 暫存目錄,驗證 snapshot 可讀。 - 執行狀態寫入 `/backup/integrity/check.status` 與 `/backup/integrity/restore-drill.status`,由 `backup-health-textfile-exporter.py` 轉成 Prometheus metrics。 +## 110 Runaway Process 分類 + +2026-06-18 起,110 也由 `roles/host-textfile-exporters` 管理: + +```text +/home/wooo/scripts/host-runaway-process-exporter.py +/home/wooo/scripts/host-runaway-process-remediation.py +/home/wooo/node_exporter_textfiles/host_runaway_process.prom +``` + +Exporter 每 2 分鐘只讀取 `ps`、Docker active task container 名稱、`/proc/loadavg` 與 `/proc/meminfo`,用來分辨: + +- orphan headless Chrome / Chromium / Playwright smoke process group。 +- 合法 Gitea Actions CI build/test 負載。 +- load5/core 與 swap ratio。 + +`host-runaway-process-remediation.py` 是 gated PlayBook helper,不會由 cron 執行;`--apply` 必須帶 owner approval、maintenance window 與 evidence ref。Ansible 只部署與刷新 read-only exporter,不授權 kill process。 + ## 下一批納入 Ansible 的項目 | 優先級 | 項目 | 原因 | @@ -179,7 +197,7 @@ Sentry 資料層備份由 `/backup/scripts/backup-sentry.sh` 負責,納入每 | P0 | Sentry 專屬備份與 restic integrity drill | `backup_jobs` 已納入 110 playbook;下一步累積 nightly/weekly/monthly 成功證據 | | P0 | 188 nginx HTTPS route ownership | 避免 public tool routes 在事故後或同步後再次漂移 | | P1 | certbot/snap certbot 標準化 | 目前 apt certbot/OpenSSL 路徑脆弱,renewal 需要統一路徑 | -| P1 | 110/188 Docker/systemd/storage/backup textfile exporters | `roles/host-textfile-exporters` 已建立;下一步是在 ops host 上 dry-run/apply,並確認 `docker_stats.prom` / `storage_health.prom` / `backup_health.prom` / `systemd_units.prom` freshness | +| P1 | 110/188 Docker/systemd/storage/backup/runaway-process textfile exporters | `roles/host-textfile-exporters` 已建立;下一步是在 ops host 上 dry-run/apply,並確認 `docker_stats.prom` / `storage_health.prom` / `backup_health.prom` / `systemd_units.prom` / `host_runaway_process.prom` freshness | | P1 | node-exporter/cAdvisor caps | 監控元件本身不能變成負載來源 | | P2 | K3s diagnostic-only host tasks | 只驗證 containerd/kubelet 狀態,不做破壞性修復 | | P2 | 112 Kali inventory only | 先記錄,不掃描、不修復 | diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 774576fe..c90fd8f9 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.25 +> Version: v1.26 > Last updated: 2026-06-18 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -41,6 +41,7 @@ Forbidden declaration: DR complete or runtime/security acceptance. Credential es | P2 service / data truth | `100%` | `VERIFIED_FULL_STACK_GREEN_FOR_SERVICE` | | P3 docs / automation contracts | `100%` | `DONE_WITH_STALE_JOB_CLASSIFICATION` | | 110 host runtime | `fwupd-refresh.timer` intentionally disabled/inactive after non-runtime firmware metadata refresh failed units were classified; `systemctl --failed` returns `0 loaded units listed`; rollback is `sudo systemctl enable --now fwupd-refresh.timer` | `GREEN_WITH_FWUPD_TIMER_DISABLED` | +| 110 host runaway process guard | `host-runaway-process-exporter.py` / `host-runaway-process-remediation.py` 已納入 110 Ansible textfile exporter source-of-truth;告警可分辨 orphan browser smoke 與合法 Gitea Actions CI load;修復器預設 dry-run,`SIGTERM` 需 owner approval、maintenance window、evidence ref | `AIOPS_MONITOR_READY_RUNTIME_GATE_0` | | 120 reachability | ping OK, SSH OK, boot around `2026-06-14 02:23`, K3s active, node `mon Ready` | `GREEN` | | 121 reachability | ping OK, SSH OK, failed units `0` | `GREEN` | | 188 host runtime | production services green; host degraded only by `certbot.service` and `snap.certbot.renew.service` | `GREEN_WITH_CERTBOT_DEBT` | @@ -784,7 +785,33 @@ Only release these after P0/P1/P2 gates are green: | 188 | litellm | `/health/liveliness` good and provider route verified | | 110 | Sentry Snuba consumers | ClickHouse healthy and Kafka backlog decreasing | | 110 | Sentry uptime-checker | Sentry web/DB healthy | -| 110 | runners | all previous gates green and load/core < 1.0 for 15 minutes | +| 110 | runners | all previous gates green, `host_runaway_process.prom` fresh, orphan browser group count `0`, and load/core < 1.0 for 15 minutes unless the remaining load is explicitly attributed to active CI | + +### 11.1 110 Runaway Browser / CI Load 分流 + +2026-06-18 110 CPU 滿載事件證明:泛用 `HostHighCpuLoad` 只能說主機忙,不能告訴 operator 要不要殺程序。110 現在必須使用專用 host runaway process 指標做第一層分流: + +```bash +grep -E 'awoooi_host_runaway_|awoooi_host_gitea_actions_|awoooi_host_load5_per_core|awoooi_host_swap_used_ratio' \ + /home/wooo/node_exporter_textfiles/host_runaway_process.prom +``` + +判讀: + +| 指標組合 | 判定 | 行動 | +|----------|------|------| +| `awoooi_host_runaway_browser_orphan_group_count > 0` 且 CPU `>= 100` | orphan headless browser / smoke process group | 執行 `host-runaway-process-remediation.py` dry-run;人工確認後才可 gated `SIGTERM` | +| orphan count `0` 且 `awoooi_host_gitea_actions_active_container_count > 0` | 合法 CI build/test 負載 | 觀察 Gitea Actions queue / workflow timeout;不殺程序 | +| `awoooi_host_runaway_process_monitor_up` 缺失或 stale | 監控盲區 | 修 cron / textfile collector / Ansible role,不宣稱 AI Ops 可觀測 | +| `awoooi_host_runaway_process_remediation_authorized > 0` | 監控器被誤改成執行器 | 立即回滾;runtime remediation 必須只走 gated helper | + +正式 PlayBook: + +```text +docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md +``` + +這條 PlayBook 不取代 Docker / Sentry / Harbor / K3s / backup SOP。它只處理 orphan browser smoke 與 CI load 分類,避免 CPU 高時誤重啟 Docker 或誤殺合法 build。 --- @@ -800,7 +827,7 @@ These are release gates after the first cold-start recovery pass: | 110 host | Harbor `/v2/` 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop | | K3s | 120/121 nodes Ready, VIP `192.168.0.125` present, AWOOOI API 2xx/3xx, Web 2xx/3xx | | Public routes | `https://awoooi.wooo.work/api/v1/health` 2xx/3xx, `https://mo.wooo.work/health` 2xx/3xx | -| Guardrails | Docker/systemd textfile exporters fresh, runner `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0` | +| Guardrails | Docker/systemd/storage/backup/runaway-process textfile exporters fresh, runner `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0` | | Schedules | cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success `< 25h` | | Backlog | ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks | @@ -812,6 +839,7 @@ AI auto-repair can move from observe-only to limited execution only after: - Prometheus rules are loaded. - docker/systemd textfile exporter files are fresh. +- runaway process textfile exporter is fresh and `remediation_authorized=0`. - blackbox probes have stable results. - cron/CronJob schedule checks are green. - AWOOOI API `/api/v1/health` passes. @@ -827,6 +855,7 @@ Until then: - require human approval for remediation - no DB/ClickHouse/Harbor/Sentry destructive action - no generic restart action against stateful services +- no process kill unless `host-runaway-process-remediation.py` has dry-run evidence plus owner approval, maintenance window, and evidence ref --- @@ -1596,6 +1625,7 @@ All must be true: - Runners are guarded and released last. - AI auto-remediation is not in full execution mode until all gates are green. - 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded. +- 110 runaway process textfile monitor is fresh, and Prometheus has `HostOrphanBrowserSmokeHighCpu` plus CI load classification rules loaded. - 110 global `/home/wooo/.ssh/known_hosts` still contains verified 120 / 188 entries after any CD run; deploy jobs use `/home/wooo/.ssh/deploy_known_hosts` only. ### 15.1 可宣稱狀態 diff --git a/docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md b/docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md new file mode 100644 index 00000000..d46be540 --- /dev/null +++ b/docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md @@ -0,0 +1,153 @@ +# Host Runaway Process AIOps PlayBook + +> Last updated: 2026-06-18 Asia/Taipei +> Scope: 110 host CPU 滿載、orphan Chrome / Playwright smoke、Gitea Actions CI load 分流。 + +--- + +## 1. 目標 + +這份 PlayBook 把 2026-06-18 110 CPU 滿載事件固化成 AI Ops 閉環: + +```text +read-only exporter -> Prometheus alert -> AI triage packet -> KM / PlayBook evidence -> gated remediation -> post-check / recurrence guard +``` + +它要解決的不是「CPU 高」這個泛用症狀,而是精準分辨: + +| 類型 | 判定 | 處理 | +|------|------|------| +| orphan browser smoke | headless Chrome / Chromium / Playwright process group 存活過久、PPID=1 或 group leader 消失、CPU 合計過高 | 走 dry-run 修復包;人工批准後可送 `SIGTERM` | +| 合法 CI load | Gitea Actions task container 正在跑,沒有 orphan browser 指標 | 觀察 queue / timeout;不要誤殺 | +| Docker / Sentry / Harbor 事故 | container restart、port down、journal error、cold-start gate blocked | 走各服務自己的 SOP,不使用本 PlayBook 殺 process | +| swap 已滿但未 thrash | swap ratio 高但 `vmstat` / load 分類未顯示即時 thrash | 不手動清 swap;先降高 CPU 來源 | + +--- + +## 2. 指標與告警 + +110 由 Ansible `host-textfile-exporters` 角色部署: + +```text +/home/wooo/scripts/host-runaway-process-exporter.py +/home/wooo/scripts/host-runaway-process-remediation.py +/home/wooo/node_exporter_textfiles/host_runaway_process.prom +``` + +核心指標: + +| Metric | 意義 | +|--------|------| +| `awoooi_host_runaway_process_monitor_up{host="110"}` | exporter 是否正常輸出 | +| `awoooi_host_runaway_browser_orphan_group_count{host="110",rule=...}` | 符合規則的 orphan browser process group 數 | +| `awoooi_host_runaway_browser_orphan_cpu_percent{host="110",rule=...}` | orphan group CPU 合計 | +| `awoooi_host_gitea_actions_active_container_count{host="110"}` | 目前 active Gitea Actions task containers | +| `awoooi_host_load5_per_core{host="110"}` | load5 / CPU core | +| `awoooi_host_swap_used_ratio{host="110"}` | swap 使用比例 | +| `awoooi_host_runaway_process_remediation_authorized{host="110"}` | 必須永遠為 `0`;exporter 不是執行器 | + +告警: + +| Alert | 條件 | 行動 | +|-------|------|------| +| `HostOrphanBrowserSmokeHighCpu` | orphan browser group `> 0` 且 CPU `>= 100%` 持續 10 分鐘 | 產生 dry-run 修復包,確認 owner / 維護窗口 / evidence | +| `HostCiRunnerLoadSaturation` | load5/core `> 1.0` 且 active Gitea Actions `> 0` | 標為短期 CI 負載,檢查 runner queue,不直接 kill | +| `HostRunawayProcessMonitorMissing` / `Stale` | exporter 缺失或超過 10 分鐘未更新 | 修 exporter / cron / textfile collector | +| `HostRunawayProcessRemediationUnexpectedlyAuthorized` | `remediation_authorized > 0` | 立即回滾;禁止把監控器改成執行器 | + +--- + +## 3. AI Triager 必做判讀 + +收到 `HostOrphanBrowserSmokeHighCpu` 時,AI / operator 必須先產生 dry-run: + +```bash +python3 scripts/ops/host-runaway-process-remediation.py \ + --rule stockplatform_headless_smoke \ + --min-age-seconds 1800 \ + --min-cpu-percent 50 +``` + +dry-run 必須檢查: + +1. `candidate_count > 0`。 +2. `orphan_reason` 是 `ppid_1` 或 `missing_group_leader`。 +3. `oldest_age_seconds` 超過 PlayBook 門檻。 +4. `active Gitea Actions` 與候選 process group 不是同一個仍在跑的合法 job。 +5. 不是 Docker daemon、Sentry、Harbor、PostgreSQL、ClickHouse、K3s 或 backup 服務本體。 +6. 已有 owner / 維護窗口 / evidence ref。 + +如果只看到 `HostCiRunnerLoadSaturation`,且 orphan group count 為 `0`,預設判定是「合法 CI 短期負載」,不得自動修復。 + +--- + +## 4. Gated Remediation + +真正送 `SIGTERM` 時必須帶齊三個 gate: + +```bash +python3 scripts/ops/host-runaway-process-remediation.py \ + --apply \ + --confirm-apply \ + --rule stockplatform_headless_smoke \ + --owner-approval-id OWNER-APPROVAL-REDACTED \ + --maintenance-window-id MW-REDACTED \ + --evidence-ref INC-REDACTED \ + --wait-seconds 5 +``` + +禁止事項: + +- 不可預設 `SIGKILL`。 +- 不可因 CPU 高直接 `systemctl restart docker`。 +- 不可重啟 Sentry / Harbor / Gitea / Nginx。 +- 不可改 firewall / iptables / NetworkPolicy。 +- 不可讀取或輸出 secret value、token、hash、prefix / suffix。 +- 不可把 route 200、container up、UI 可見當成修復完成。 + +修復完成條件: + +```text +signaled_process_group_count > 0 +remaining_after_wait = [] +awoooi_host_runaway_browser_orphan_group_count == 0 +load5/core 開始下降或維持可解釋 +active Gitea Actions 若仍存在,告警降級為 CI load,而非 orphan smoke +``` + +--- + +## 5. KM / PlayBook 回寫契約 + +每次觸發都要沉澱: + +| 資產 | 必填欄位 | +|------|----------| +| Incident evidence | alert name、host、rule、pgid count、cpu percent、oldest age、active CI count、swap ratio | +| PlayBook run | dry-run payload、owner approval id、maintenance window id、evidence ref、actual signal summary | +| KM entry | 根因分類、誤判防護、修復結果、recurrence guard | +| Verifier | post-check 指標、load trend、orphan group count、runner queue state | +| Work item | 如果缺 owner / evidence / maintenance window,建立補件項,不假性拉高 runtime gate | + +產品上的結論必須分開呈現: + +```text +monitoring_ready=true +alert_ready=true +playbook_ready=true +km_writeback_required=true +runtime_remediation_authorized=false unless gated apply is executed +``` + +--- + +## 6. 與重啟 SOP 的關係 + +110 重啟後,runner / CD / high-load batch 是最後釋出。若 service health green 但 load 持續高: + +1. 先讀 `host_runaway_process.prom`。 +2. orphan browser 指標紅:走本 PlayBook。 +3. active CI 指標紅但 orphan 為 0:等待 / drain / workflow timeout,不走 kill。 +4. Docker / systemd / storage / backup 指標紅:回到 `FULL-STACK-COLD-START-SOP.md` 對應章節。 + +這條 PlayBook 是 AI 自動化產品的 host CPU 專用閉環,不取代 cold-start scorecard,也不解除 credential escrow / DR gate。 diff --git a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md index d8e07db9..0f15290e 100644 --- a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md +++ b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md @@ -5074,3 +5074,17 @@ Trigger commit `f5cd37b7` 與 deploy marker `0ba92357` 已把 governance UI 的 - 本地驗證:JSON schema / snapshot parse 通過;`python3 -m py_compile apps/api/src/services/ai_agent_receipt_readback_owner_review.py` 通過;目標 service / API tests `9 passed`;zh-TW / en i18n JSON parse 通過。Web typecheck 因乾淨 worktree 缺 `node_modules` 未執行,本輪未安裝套件或寫 lockfile。 **裁決:** P2-406B 是 receipt readback owner review 與 governance 可視化,不是 Telegram send、Gateway queue write、Bot API call、receipt production write、AI analysis runtime、中低風險 auto execution、production optimization、secret read、paid API、host write、kubectl action、destructive operation 或 OpenClaw 替換。下一步是 `P2-407`:AI 報表自動分析 no-write runtime,只產生 committed snapshot / 草稿與 actionability score,不得實發或寫 production。 + +### 2026-06-18 14:20 (台北) — §8 / Host CPU AIOps — 新增 110 runaway process 監控 / 告警 / PlayBook / gated remediation + +**觸發**:110 CPU 滿載已確認是跨專案 stockPlatform headless Chrome smoke 遺留 5 組 orphan process group,精準 SIGTERM 後 `REMAINING_AFTER_TERM=0`;後續 load 仍高則是 AWOOOI / VibeWork / 2026 World Cup Gitea Actions build/test。這證明泛用 `HostHighCpuLoad` 不足以支撐 AI 自動化產品,必須能把 orphan process、合法 CI load、Docker/Sentry/Harbor 事故分開。 + +**已推進:** +- 新增 `scripts/ops/host-runaway-process-exporter.py`,read-only 輸出 orphan browser process group、active Gitea Actions、load5/core、swap ratio 與 `remediation_authorized=0`。 +- 新增 `scripts/ops/host-runaway-process-remediation.py`,預設 dry-run;`--apply` 必須帶 owner approval、maintenance window、evidence ref、rule 與 confirm gate,只送 SIGTERM,不預設 SIGKILL。 +- `ops/monitoring/alerts-unified.yml` 新增 `host_runaway_process_alerts`,包含 orphan browser smoke critical、CI load saturation warning、monitor missing/stale 與 remediation authorization 保險絲。 +- `infra/ansible/roles/host-textfile-exporters` 與 `110-devops.yml` 納入 exporter、gated helper、cron、立即刷新與 metric 驗證。 +- 新增 `docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md`,並同步 SOP v1.26、Ansible operating model、reboot recovery workplan、LOGBOOK 與 readiness audit。 +- 新增 pytest,鎖住 orphan 分類、Linux / BSD `ps` 解析、合法 / 年輕 process 忽略、CI/swap 指標、dry-run 與 apply gate 拒絕行為;readiness audit 以 pyenv Python 重跑後 `BLOCKED=0`。 + +**裁決:** 這是 host CPU runaway 的 observe -> classify -> alert -> PlayBook -> KM contract -> gated remediation 閉環,不是 runtime 自動 kill 授權。AI 可以自動診斷、告警、產生 dry-run 修復包與 KM/PlayBook 回寫要求;真正 process termination 仍需 owner approval、maintenance window、evidence ref 與 post-check。Docker restart、systemd restart、Nginx reload、firewall change、secret read、host write 與 production write 仍全部禁止。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 7d99ca3d..90a63065 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -15,7 +15,7 @@ | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-15 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-15 02:40:13`. Offsite / escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. | | P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-18 13:43 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=84 WARN=0 BLOCKED=0`. | -| P3 docs / automation contracts | DONE_WITH_STALE_JOB_CLASSIFICATION | 100% | Workplan, SOP v1.25, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, and 2026-06-18 live readback are updated. Repo-side `reboot-recovery-readiness-audit.sh --no-color` returned `PASS=187 WARN=1 BLOCKED=0`; live cold-start returned `PASS=84 WARN=0 BLOCKED=0`. | +| P3 docs / automation contracts | DONE_WITH_RUNAWAY_PROCESS_AIOPS | 100% | Workplan, SOP v1.26, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, and 2026-06-18 live readback are updated. Repo-side readiness audit now also checks runaway process exporter / remediation helper / alert group; live cold-start remains `PASS=84 WARN=0 BLOCKED=0` from the latest service readiness readback. | Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-18 13:43, services are green with `WARN=0` and `BLOCKED=0`; the retained stale `km-vectorize` failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked. @@ -175,7 +175,7 @@ Next: | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. | | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. | | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. | -| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.25 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, repo-side readiness audit blocker closure, stale-vs-active K8s failed Job classification, 2026-06-18 live cold-start GREEN readback, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note,以及 allowed declaration wording. | Use v1.25 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, and blockers against §1.4 plus §14.8 through §14.25. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN_FOR_SERVICE`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN_FOR_SERVICE`, and `B5_DR_COMPLETE`; repo-side `reboot-recovery-readiness-audit.sh --no-color` returns `PASS=187 WARN=1 BLOCKED=0`, and live cold-start returns `PASS=84 WARN=0 BLOCKED=0`. | +| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.26 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, repo-side readiness audit blocker closure, stale-vs-active K8s failed Job classification, 2026-06-18 live cold-start GREEN readback, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, and allowed declaration wording. | Use v1.26 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, and blockers against §1.4 plus §11.1 / §14.8 through §14.25. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN_FOR_SERVICE`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN_FOR_SERVICE`, and `B5_DR_COMPLETE`; repo-side readiness audit checks runaway process exporter / alerts / gated remediation helper, and live cold-start returns `PASS=84 WARN=0 BLOCKED=0`. | | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. | | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. | | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. | @@ -214,6 +214,12 @@ Do not run `truncate`, whole DB restore, force-push, DROP, or online root filesy ## 9. Progress Updates ```text +2026-06-18 15:10 Asia/Taipei +Phase: P3 AI Ops runaway process automation +Before: 110 CPU 滿載只能靠人工 `ps/top` 判斷;泛用 `HostHighCpuLoad` 無法分辨跨專案 orphan Chrome smoke 與合法 Gitea Actions CI load。 +After: 新增 read-only `host-runaway-process-exporter.py`、gated `host-runaway-process-remediation.py`、Prometheus `host_runaway_process_alerts`、Ansible textfile exporter source-of-truth、SOP v1.26 與 `HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md`。Exporter 暴露 orphan browser、active CI、load/core、swap ratio 與 `remediation_authorized=0`;修復器預設 dry-run,`SIGTERM` 必須帶 owner approval、maintenance window、evidence ref。 +Completion: monitoring / alert / PlayBook / KM contract 100%; runtime auto-remediation remains gated at 0 until a real owner-approved apply is executed. + 2026-06-18 13:43 Asia/Taipei Phase: P1/P2/P3 live readback Before: live cold-start was `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`, because retained stale `km-vectorize-29689620` failed Job evidence was still counted as a service warning. diff --git a/infra/ansible/playbooks/110-devops.yml b/infra/ansible/playbooks/110-devops.yml index df84e4a7..0ebab73a 100644 --- a/infra/ansible/playbooks/110-devops.yml +++ b/infra/ansible/playbooks/110-devops.yml @@ -29,6 +29,7 @@ vars: host_textfile_user: wooo host_textfile_host_label: "110" + host_textfile_manage_runaway_process: true host_textfile_manage_systemd_units: true host_textfile_systemd_unit_glob: "actions.runner.*.service" host_textfile_systemd_units: diff --git a/infra/ansible/roles/host-textfile-exporters/defaults/main.yml b/infra/ansible/roles/host-textfile-exporters/defaults/main.yml index 1947ea1b..327a7da5 100644 --- a/infra/ansible/roles/host-textfile-exporters/defaults/main.yml +++ b/infra/ansible/roles/host-textfile-exporters/defaults/main.yml @@ -7,13 +7,19 @@ host_textfile_docker_stats_src: "{{ playbook_dir }}/../../../scripts/ops/docker- host_textfile_systemd_units_src: "{{ playbook_dir }}/../../../scripts/ops/systemd-units-textfile-exporter.py" host_textfile_storage_health_src: "{{ playbook_dir }}/../../../scripts/ops/storage-health-textfile-exporter.py" host_textfile_backup_health_src: "{{ playbook_dir }}/../../../scripts/ops/backup-health-textfile-exporter.py" +host_textfile_runaway_process_src: "{{ playbook_dir }}/../../../scripts/ops/host-runaway-process-exporter.py" +host_textfile_runaway_process_remediation_src: "{{ playbook_dir }}/../../../scripts/ops/host-runaway-process-remediation.py" host_textfile_docker_cron_minute: "*" host_textfile_systemd_cron_minute: "*" host_textfile_storage_cron_minute: "*" host_textfile_backup_cron_minute: "*/10" +host_textfile_runaway_process_cron_minute: "*/2" host_textfile_manage_docker_stats: true host_textfile_manage_systemd_units: false host_textfile_manage_storage_health: true host_textfile_manage_backup_health: true +host_textfile_manage_runaway_process: false +host_textfile_runaway_process_min_age_seconds: 1800 +host_textfile_runaway_process_min_cpu_percent: 50 host_textfile_systemd_unit_glob: "" host_textfile_systemd_units: [] diff --git a/infra/ansible/roles/host-textfile-exporters/tasks/main.yml b/infra/ansible/roles/host-textfile-exporters/tasks/main.yml index 09c110a4..404de90c 100644 --- a/infra/ansible/roles/host-textfile-exporters/tasks/main.yml +++ b/infra/ansible/roles/host-textfile-exporters/tasks/main.yml @@ -161,6 +161,69 @@ - not ansible_check_mode tags: textfile_exporters +- name: "host textfile exporters | 安裝 runaway process 匯出器" + ansible.builtin.copy: + src: "{{ host_textfile_runaway_process_src }}" + dest: "{{ host_textfile_script_dir }}/host-runaway-process-exporter.py" + owner: "{{ host_textfile_user }}" + group: "{{ host_textfile_user }}" + mode: "0755" + when: host_textfile_manage_runaway_process + tags: textfile_exporters + +- name: "host textfile exporters | 安裝 runaway process gated 修復器" + ansible.builtin.copy: + src: "{{ host_textfile_runaway_process_remediation_src }}" + dest: "{{ host_textfile_script_dir }}/host-runaway-process-remediation.py" + owner: "{{ host_textfile_user }}" + group: "{{ host_textfile_user }}" + mode: "0755" + when: host_textfile_manage_runaway_process + tags: textfile_exporters + +- name: "host textfile exporters | 安裝 runaway process cron" + ansible.builtin.cron: + name: "AWOOOI runaway process textfile exporter" + user: "{{ host_textfile_user }}" + minute: "{{ host_textfile_runaway_process_cron_minute }}" + job: >- + PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin + AIOPS_HOST_LABEL={{ host_textfile_host_label }} + NODE_EXPORTER_TEXTFILE_DIR={{ host_textfile_dir }} + AIOPS_RUNAWAY_PROCESS_MIN_AGE_SECONDS={{ host_textfile_runaway_process_min_age_seconds }} + AIOPS_RUNAWAY_PROCESS_MIN_CPU_PERCENT={{ host_textfile_runaway_process_min_cpu_percent }} + {{ host_textfile_script_dir }}/host-runaway-process-exporter.py + >/tmp/awoooi-host-runaway-process-exporter.cron.log 2>&1 + when: host_textfile_manage_runaway_process + tags: textfile_exporters + +- name: "host textfile exporters | 立即刷新 runaway process 指標" + ansible.builtin.command: + cmd: "{{ host_textfile_script_dir }}/host-runaway-process-exporter.py" + environment: + AIOPS_HOST_LABEL: "{{ host_textfile_host_label }}" + NODE_EXPORTER_TEXTFILE_DIR: "{{ host_textfile_dir }}" + AIOPS_RUNAWAY_PROCESS_MIN_AGE_SECONDS: "{{ host_textfile_runaway_process_min_age_seconds }}" + AIOPS_RUNAWAY_PROCESS_MIN_CPU_PERCENT: "{{ host_textfile_runaway_process_min_cpu_percent }}" + become: true + become_user: "{{ host_textfile_user }}" + changed_when: false + when: + - host_textfile_manage_runaway_process + - not ansible_check_mode + tags: textfile_exporters + +- name: "host textfile exporters | 驗證 runaway process metric 存在" + ansible.builtin.command: + cmd: "grep -q '^awoooi_host_runaway_process_monitor_up{' {{ host_textfile_dir }}/host_runaway_process.prom" + become: true + become_user: "{{ host_textfile_user }}" + changed_when: false + when: + - host_textfile_manage_runaway_process + - not ansible_check_mode + tags: textfile_exporters + - name: "host textfile exporters | 探測 systemd units" ansible.builtin.shell: | set -o pipefail diff --git a/ops/monitoring/alerts-unified.yml b/ops/monitoring/alerts-unified.yml index 9521dec1..66294877 100644 --- a/ops/monitoring/alerts-unified.yml +++ b/ops/monitoring/alerts-unified.yml @@ -133,6 +133,106 @@ groups: description: "磁碟使用率超過 85%" auto_repair_action: "ssh {{ $labels.instance }} 'echo \"=== CPU TOP ===\"; ps aux --sort=-%cpu | head -15; echo \"=== MEMORY ===\"; free -h; echo \"=== DISK ===\"; df -h; echo \"=== LOAD ===\"; uptime'" + # ========================================================================= + # Host runaway process / CI load classification + # ========================================================================= + - name: host_runaway_process_alerts + rules: + - alert: HostRunawayProcessMonitorMissing + expr: absent(awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"}) + for: 15m + labels: + severity: warning + layer: systemd-110 + component: host-runaway-process-monitor + host: "110" + team: ops + alert_category: host_resource + notification_type: TYPE-1 + auto_repair: "false" + annotations: + summary: "110 runaway process textfile metric missing" + description: "110 沒有輸出 host_runaway_process.prom;orphan Chrome / Playwright smoke 與 CI load 分類目前不可觀測。" + runbook: "用 Ansible `110-devops.yml --tags textfile_exporters` 或手動部署 scripts/ops/host-runaway-process-exporter.py,確認 /home/wooo/node_exporter_textfiles/host_runaway_process.prom" + + - alert: HostRunawayProcessMonitorStale + expr: time() - awoooi_host_runaway_process_last_run_timestamp{host="110"} > 600 + for: 10m + labels: + severity: warning + layer: systemd-110 + component: host-runaway-process-monitor + host: "110" + team: ops + alert_category: host_resource + notification_type: TYPE-1 + auto_repair: "false" + annotations: + summary: "110 runaway process monitor stale" + description: "host runaway process exporter 超過 10 分鐘沒有更新;CPU 滿載時無法自動分辨 orphan smoke 與合法 CI。" + runbook: "SSH 110 檢查 crontab、/tmp/awoooi-host-runaway-process-exporter.cron.log 與 node-exporter textfile collector。" + + - alert: HostOrphanBrowserSmokeHighCpu + expr: | + (awoooi_host_runaway_browser_orphan_group_count{host="110"} > 0) + and on(host, rule) + (awoooi_host_runaway_browser_orphan_cpu_percent{host="110"} >= 100) + for: 10m + labels: + severity: critical + layer: systemd-110 + component: host-runaway-process + host: "110" + team: ops + alert_category: host_resource + notification_type: TYPE-3 + auto_repair: "false" + mcp_provider: "ssh_host" + host_type: "bare_metal" + annotations: + summary: "110 orphan browser smoke process group CPU 過高" + description: "偵測到 {{ $labels.rule }} orphan process group,CPU 合計 >= 100% 持續 10 分鐘。這通常是跨專案 headless Chrome / Playwright smoke 遺留,不是 Docker/Sentry/Harbor 事故。" + runbook: "先執行 `scripts/ops/host-runaway-process-remediation.py --rule {{ $labels.rule }}` 產生 dry-run;確認 active Gitea Actions、owner、維護窗口與 evidence ref 後才可用 --apply --confirm-apply 送 SIGTERM。禁止預設 SIGKILL、Docker restart、systemctl restart 或 firewall 變更。" + + - alert: HostRunawayProcessRemediationUnexpectedlyAuthorized + expr: awoooi_host_runaway_process_remediation_authorized{host="110"} > 0 + for: 1m + labels: + severity: critical + layer: systemd-110 + component: host-runaway-process + host: "110" + team: ops + alert_category: host_resource + notification_type: TYPE-3 + auto_repair: "false" + annotations: + summary: "110 runaway process monitor exposed runtime remediation authorization" + description: "host-runaway-process exporter 應永遠保持 read-only;若 remediation_authorized > 0,代表有人把監控器改成執行器或把 runtime gate 誤接上。" + runbook: "立即回滾 exporter,檢查 Git diff、cron、Ansible role 與 /home/wooo/scripts/host-runaway-process-exporter.py。實際修復只能由 gated remediation helper 在人工批准後執行。" + + - alert: HostCiRunnerLoadSaturation + expr: | + (awoooi_host_load5_per_core{host="110"} > 1.0) + and on(host) + (awoooi_host_gitea_actions_active_container_count{host="110"} > 0) + for: 15m + labels: + severity: warning + layer: systemd-110 + component: gitea-actions-runner + host: "110" + team: ops + alert_category: host_resource + notification_type: TYPE-1 + auto_repair: "false" + mcp_provider: "ssh_host" + host_type: "bare_metal" + annotations: + summary: "110 high load is currently explained by active Gitea Actions" + description: "load5/core > 1.0 且存在 Gitea Actions task container;若 orphan browser 指標為 0,先視為短期 CI build/test 負載,不要誤判成 Docker/Sentry/Harbor 事故。" + runbook: "檢查 Gitea runs、runner queue 與 `docker ps --filter name=GITEA-ACTIONS-TASK-`; 僅在 job 卡死、超過 workflow timeout 或 owner 取消後才走 runner drain / cleanup PlayBook。" + # ========================================================================= # K8s 叢集告警 (kubernetes_alerts) # ========================================================================= diff --git a/scripts/ops/host-runaway-process-exporter.py b/scripts/ops/host-runaway-process-exporter.py new file mode 100755 index 00000000..c022f13d --- /dev/null +++ b/scripts/ops/host-runaway-process-exporter.py @@ -0,0 +1,390 @@ +#!/usr/bin/env python3 +""" +Host runaway process textfile exporter for AWOOOI AIOps. + +This exporter is read-only. It classifies orphaned headless browser/smoke +process groups separately from legitimate Gitea Actions load so host CPU alerts +can point to a concrete PlayBook instead of a generic "high CPU" symptom. +""" + +from __future__ import annotations + +import argparse +import os +import re +import subprocess +import tempfile +import time +from dataclasses import dataclass +from pathlib import Path +from typing import Iterable + + +TEXTFILE_DIR = Path(os.environ.get("NODE_EXPORTER_TEXTFILE_DIR", "/var/lib/node_exporter/textfile_collector")) +OUTPUT_NAME = "host_runaway_process.prom" +HOST_LABEL = os.environ.get("AIOPS_HOST_LABEL", os.uname().nodename) +LABEL_RE = re.compile(r'["\\\n]') + + +@dataclass(frozen=True) +class ProcessRow: + pid: int + ppid: int + pgid: int + sid: int + etimes: int + pcpu: float + stat: str + comm: str + args: str + + +@dataclass(frozen=True) +class RunawayRule: + rule_id: str + command_pattern: re.Pattern[str] + context_pattern: re.Pattern[str] + + +@dataclass(frozen=True) +class ProcessGroup: + rule_id: str + pgid: int + rows: tuple[ProcessRow, ...] + cpu_percent: float + oldest_age_seconds: int + orphan_reason: str + sample_comm: str + + +DEFAULT_RULES = ( + RunawayRule( + "stockplatform_headless_smoke", + re.compile(r"(chrome|chromium|playwright)", re.IGNORECASE), + re.compile(r"stockplatform-review-bulk-ux|/tmp/stockplatform", re.IGNORECASE), + ), + RunawayRule( + "headless_browser_smoke", + re.compile(r"(chrome|chromium|playwright)", re.IGNORECASE), + re.compile(r"--headless|--user-data-dir=/tmp|/tmp/.*(smoke|ux|playwright)", re.IGNORECASE), + ), +) + + +def escape_label(value: str) -> str: + return LABEL_RE.sub(lambda m: {"\n": r"\n", "\\": r"\\", '"': r"\""}[m.group(0)], value) + + +def run_text(command: list[str], timeout: int = 20) -> str: + return subprocess.run(command, check=True, capture_output=True, text=True, timeout=timeout).stdout + + +def read_ps_text(ps_file: Path | None = None) -> str: + if ps_file: + return ps_file.read_text(encoding="utf-8") + linux_command = [ + "ps", + "-eo", + "pid=,ppid=,pgid=,sid=,etimes=,pcpu=,stat=,comm=,args=", + ] + try: + return run_text(linux_command) + except (subprocess.CalledProcessError, subprocess.TimeoutExpired): + return run_text( + [ + "ps", + "-axo", + "pid=,ppid=,pgid=,sess=,etime=,pcpu=,stat=,comm=,command=", + ] + ) + + +def elapsed_to_seconds(value: str) -> int: + try: + return int(float(value)) + except ValueError: + pass + + days = 0 + clock = value + if "-" in value: + raw_days, clock = value.split("-", 1) + days = int(raw_days) + parts = [int(part) for part in clock.split(":")] + if len(parts) == 3: + hours, minutes, seconds = parts + elif len(parts) == 2: + hours = 0 + minutes, seconds = parts + else: + hours = 0 + minutes = 0 + seconds = parts[0] + return days * 86400 + hours * 3600 + minutes * 60 + seconds + + +def parse_ps_rows(text: str) -> list[ProcessRow]: + rows: list[ProcessRow] = [] + for line in text.splitlines(): + raw = line.strip() + if not raw: + continue + parts = raw.split(None, 8) + if len(parts) < 9: + continue + try: + rows.append( + ProcessRow( + pid=int(parts[0]), + ppid=int(parts[1]), + pgid=int(parts[2]), + sid=int(parts[3]), + etimes=elapsed_to_seconds(parts[4]), + pcpu=float(parts[5]), + stat=parts[6], + comm=parts[7], + args=parts[8], + ) + ) + except ValueError: + continue + return rows + + +def matching_rule(row: ProcessRow, rules: Iterable[RunawayRule] = DEFAULT_RULES) -> str | None: + haystack = f"{row.comm} {row.args}" + for rule in rules: + if rule.command_pattern.search(haystack) and rule.context_pattern.search(haystack): + return rule.rule_id + return None + + +def orphan_reason(rows: list[ProcessRow], all_pids: set[int]) -> str | None: + if any(row.ppid == 1 for row in rows): + return "ppid_1" + pgid = rows[0].pgid + if pgid not in all_pids: + return "missing_group_leader" + return None + + +def classify_groups( + rows: list[ProcessRow], + *, + min_age_seconds: int, + min_cpu_percent: float, +) -> list[ProcessGroup]: + all_pids = {row.pid for row in rows} + grouped: dict[tuple[str, int], list[ProcessRow]] = {} + for row in rows: + rule_id = matching_rule(row) + if rule_id is None: + continue + grouped.setdefault((rule_id, row.pgid), []).append(row) + + groups: list[ProcessGroup] = [] + for (rule_id, pgid), members in grouped.items(): + reason = orphan_reason(members, all_pids) + if reason is None: + continue + oldest = max(row.etimes for row in members) + cpu_percent = sum(row.pcpu for row in members) + if oldest < min_age_seconds or cpu_percent < min_cpu_percent: + continue + sample_comm = sorted({row.comm for row in members})[0][:48] + groups.append( + ProcessGroup( + rule_id=rule_id, + pgid=pgid, + rows=tuple(sorted(members, key=lambda row: row.pid)), + cpu_percent=cpu_percent, + oldest_age_seconds=oldest, + orphan_reason=reason, + sample_comm=sample_comm, + ) + ) + return sorted(groups, key=lambda group: (-group.cpu_percent, group.rule_id, group.pgid)) + + +def active_gitea_action_containers(docker_file: Path | None = None) -> int: + try: + if docker_file: + names = docker_file.read_text(encoding="utf-8").splitlines() + else: + names = run_text(["docker", "ps", "--format", "{{.Names}}"], timeout=10).splitlines() + except Exception: + return -1 + return sum(1 for name in names if "GITEA-ACTIONS-TASK-" in name) + + +def load5_per_core() -> float: + try: + load5 = float(Path("/proc/loadavg").read_text(encoding="utf-8").split()[1]) + except Exception: + try: + load5 = os.getloadavg()[1] + except OSError: + return 0.0 + cores = os.cpu_count() or 1 + return load5 / cores + + +def swap_used_ratio(meminfo_file: Path | None = None) -> float: + path = meminfo_file or Path("/proc/meminfo") + try: + values: dict[str, float] = {} + for line in path.read_text(encoding="utf-8").splitlines(): + key, _, raw = line.partition(":") + if key in {"SwapTotal", "SwapFree"}: + values[key] = float(raw.strip().split()[0]) * 1024 + total = values.get("SwapTotal", 0.0) + free = values.get("SwapFree", 0.0) + if total <= 0: + return 0.0 + return max(0.0, min(1.0, (total - free) / total)) + except Exception: + return 0.0 + + +def render_metrics( + *, + host: str, + groups: list[ProcessGroup], + active_action_containers: int, + min_age_seconds: int, + min_cpu_percent: float, + now: int, + load_ratio: float, + swap_ratio: float, +) -> str: + labels_host = f'host="{escape_label(host)}"' + rule_ids = sorted({rule.rule_id for rule in DEFAULT_RULES}) + by_rule = {rule_id: [group for group in groups if group.rule_id == rule_id] for rule_id in rule_ids} + lines = [ + "# HELP awoooi_host_runaway_process_monitor_up Whether the host runaway process exporter completed.", + "# TYPE awoooi_host_runaway_process_monitor_up gauge", + "# HELP awoooi_host_runaway_process_last_run_timestamp Unix timestamp of the last exporter run.", + "# TYPE awoooi_host_runaway_process_last_run_timestamp gauge", + "# HELP awoooi_host_runaway_browser_orphan_group_count Count of orphaned browser/smoke process groups above thresholds.", + "# TYPE awoooi_host_runaway_browser_orphan_group_count gauge", + "# HELP awoooi_host_runaway_browser_orphan_process_count Count of orphaned browser/smoke processes above thresholds.", + "# TYPE awoooi_host_runaway_browser_orphan_process_count gauge", + "# HELP awoooi_host_runaway_browser_orphan_cpu_percent Sum CPU percent for orphaned browser/smoke process groups above thresholds.", + "# TYPE awoooi_host_runaway_browser_orphan_cpu_percent gauge", + "# HELP awoooi_host_runaway_browser_orphan_oldest_age_seconds Oldest age of matching orphaned process groups.", + "# TYPE awoooi_host_runaway_browser_orphan_oldest_age_seconds gauge", + "# HELP awoooi_host_runaway_browser_orphan_group_cpu_percent CPU percent for an individual orphaned browser/smoke process group.", + "# TYPE awoooi_host_runaway_browser_orphan_group_cpu_percent gauge", + "# HELP awoooi_host_runaway_browser_orphan_group_info Metadata for an individual orphaned browser/smoke process group.", + "# TYPE awoooi_host_runaway_browser_orphan_group_info gauge", + "# HELP awoooi_host_gitea_actions_active_container_count Active Gitea Actions task containers visible on the host, -1 when Docker is unavailable.", + "# TYPE awoooi_host_gitea_actions_active_container_count gauge", + "# HELP awoooi_host_load5_per_core Host load5 divided by CPU core count.", + "# TYPE awoooi_host_load5_per_core gauge", + "# HELP awoooi_host_swap_used_ratio Host swap used ratio from /proc/meminfo.", + "# TYPE awoooi_host_swap_used_ratio gauge", + "# HELP awoooi_host_runaway_process_remediation_authorized Static guardrail: remediation is not authorized by this exporter.", + "# TYPE awoooi_host_runaway_process_remediation_authorized gauge", + f"awoooi_host_runaway_process_monitor_up{{{labels_host},mode=\"read_only\"}} 1", + f"awoooi_host_runaway_process_last_run_timestamp{{{labels_host}}} {now}", + f"awoooi_host_gitea_actions_active_container_count{{{labels_host}}} {active_action_containers}", + f"awoooi_host_load5_per_core{{{labels_host}}} {load_ratio:.6f}", + f"awoooi_host_swap_used_ratio{{{labels_host}}} {swap_ratio:.6f}", + f"awoooi_host_runaway_process_remediation_authorized{{{labels_host}}} 0", + ] + + for rule_id in rule_ids: + rule_labels = ( + f'{labels_host},rule="{escape_label(rule_id)}",' + f'min_age_seconds="{min_age_seconds}",min_cpu_percent="{min_cpu_percent:g}"' + ) + rule_groups = by_rule[rule_id] + lines.append(f"awoooi_host_runaway_browser_orphan_group_count{{{rule_labels}}} {len(rule_groups)}") + lines.append( + f"awoooi_host_runaway_browser_orphan_process_count{{{rule_labels}}} " + f"{sum(len(group.rows) for group in rule_groups)}" + ) + lines.append( + f"awoooi_host_runaway_browser_orphan_cpu_percent{{{rule_labels}}} " + f"{sum(group.cpu_percent for group in rule_groups):.6f}" + ) + lines.append( + f"awoooi_host_runaway_browser_orphan_oldest_age_seconds{{{rule_labels}}} " + f"{max((group.oldest_age_seconds for group in rule_groups), default=0)}" + ) + + for group in groups[:20]: + group_labels = ( + f'{labels_host},rule="{escape_label(group.rule_id)}",pgid="{group.pgid}",' + f'orphan_reason="{escape_label(group.orphan_reason)}",comm="{escape_label(group.sample_comm)}"' + ) + lines.append(f"awoooi_host_runaway_browser_orphan_group_cpu_percent{{{group_labels}}} {group.cpu_percent:.6f}") + lines.append(f"awoooi_host_runaway_browser_orphan_group_info{{{group_labels}}} 1") + + return "\n".join(lines) + "\n" + + +def collect(args: argparse.Namespace) -> str: + rows = parse_ps_rows(read_ps_text(args.ps_file)) + groups = classify_groups( + rows, + min_age_seconds=args.min_age_seconds, + min_cpu_percent=args.min_cpu_percent, + ) + return render_metrics( + host=args.host, + groups=groups, + active_action_containers=active_gitea_action_containers(args.docker_ps_file), + min_age_seconds=args.min_age_seconds, + min_cpu_percent=args.min_cpu_percent, + now=int(time.time()), + load_ratio=load5_per_core(), + swap_ratio=swap_used_ratio(args.meminfo_file), + ) + + +def write_textfile(payload: str, textfile_dir: Path, output_name: str) -> Path: + textfile_dir.mkdir(parents=True, exist_ok=True) + with tempfile.NamedTemporaryFile("w", dir=textfile_dir, delete=False, encoding="utf-8") as tmp: + tmp.write(payload) + tmp_path = Path(tmp.name) + output_path = textfile_dir / output_name + tmp_path.replace(output_path) + output_path.chmod(0o644) + return output_path + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Export AWOOOI host runaway process metrics.") + parser.add_argument("--host", default=HOST_LABEL) + parser.add_argument("--textfile-dir", type=Path, default=TEXTFILE_DIR) + parser.add_argument("--output-name", default=OUTPUT_NAME) + parser.add_argument("--stdout", action="store_true", help="Print metrics instead of writing the textfile.") + parser.add_argument("--ps-file", type=Path, help="Use a fixture file instead of running ps.") + parser.add_argument("--docker-ps-file", type=Path, help="Use a fixture file instead of docker ps.") + parser.add_argument("--meminfo-file", type=Path, help="Use a fixture file instead of /proc/meminfo.") + parser.add_argument( + "--min-age-seconds", + type=int, + default=int(os.environ.get("AIOPS_RUNAWAY_PROCESS_MIN_AGE_SECONDS", "1800")), + ) + parser.add_argument( + "--min-cpu-percent", + type=float, + default=float(os.environ.get("AIOPS_RUNAWAY_PROCESS_MIN_CPU_PERCENT", "50")), + ) + return parser.parse_args() + + +def main() -> None: + args = parse_args() + payload = collect(args) + if args.stdout: + print(payload, end="") + return + output_path = write_textfile(payload, args.textfile_dir, args.output_name) + print(f"HOST_RUNAWAY_PROCESS_EXPORTER_OK output={output_path}") + + +if __name__ == "__main__": + main() diff --git a/scripts/ops/host-runaway-process-remediation.py b/scripts/ops/host-runaway-process-remediation.py new file mode 100755 index 00000000..928306a7 --- /dev/null +++ b/scripts/ops/host-runaway-process-remediation.py @@ -0,0 +1,165 @@ +#!/usr/bin/env python3 +""" +Gated remediation helper for AWOOOI host runaway process groups. + +Default mode is dry-run. Applying SIGTERM requires explicit owner approval, +maintenance window, evidence reference, and --confirm-apply. This script is a +PlayBook primitive, not a background auto-kill daemon. +""" + +from __future__ import annotations + +import argparse +import importlib.util +import json +import os +import signal +import sys +import time +from pathlib import Path +from types import ModuleType + + +EXPORTER_PATH = Path(__file__).with_name("host-runaway-process-exporter.py") + + +def load_exporter() -> ModuleType: + spec = importlib.util.spec_from_file_location("host_runaway_process_exporter", EXPORTER_PATH) + if spec is None or spec.loader is None: + raise RuntimeError(f"cannot load exporter module: {EXPORTER_PATH}") + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Dry-run or gated SIGTERM for AWOOOI runaway process groups.") + parser.add_argument("--host", default=os.environ.get("AIOPS_HOST_LABEL", os.uname().nodename)) + parser.add_argument("--rule", help="Limit candidates to one rule id. Required with --apply.") + parser.add_argument("--ps-file", type=Path, help="Use a fixture ps file for tests or offline review.") + parser.add_argument("--min-age-seconds", type=int, default=1800) + parser.add_argument("--min-cpu-percent", type=float, default=50) + parser.add_argument("--apply", action="store_true", help="Send SIGTERM to matching process groups.") + parser.add_argument("--confirm-apply", action="store_true", help="Required together with --apply.") + parser.add_argument("--owner-approval-id", default="") + parser.add_argument("--maintenance-window-id", default="") + parser.add_argument("--evidence-ref", default="") + parser.add_argument("--wait-seconds", type=int, default=0, help="Optional wait after SIGTERM before re-reading ps.") + return parser.parse_args() + + +def validate_apply_args(args: argparse.Namespace) -> None: + if not args.apply: + return + missing = [] + if not args.confirm_apply: + missing.append("--confirm-apply") + if not args.rule: + missing.append("--rule") + if not args.owner_approval_id: + missing.append("--owner-approval-id") + if not args.maintenance_window_id: + missing.append("--maintenance-window-id") + if not args.evidence_ref: + missing.append("--evidence-ref") + if missing: + raise SystemExit( + "Refusing apply; missing required gates: " + + ", ".join(missing) + + ". Use dry-run output for the PlayBook packet first." + ) + + +def current_process_group() -> int: + try: + return os.getpgrp() + except Exception: + return -1 + + +def main() -> None: + args = parse_args() + validate_apply_args(args) + exporter = load_exporter() + rows = exporter.parse_ps_rows(exporter.read_ps_text(args.ps_file)) + groups = exporter.classify_groups( + rows, + min_age_seconds=args.min_age_seconds, + min_cpu_percent=args.min_cpu_percent, + ) + if args.rule: + groups = [group for group in groups if group.rule_id == args.rule] + + own_pgrp = current_process_group() + candidates = [] + for group in groups: + blocked_reason = None + if group.pgid <= 1: + blocked_reason = "unsafe_pgid" + elif group.pgid == own_pgrp: + blocked_reason = "own_process_group" + candidates.append( + { + "rule": group.rule_id, + "pgid": group.pgid, + "process_count": len(group.rows), + "cpu_percent": round(group.cpu_percent, 3), + "oldest_age_seconds": group.oldest_age_seconds, + "orphan_reason": group.orphan_reason, + "sample_comm": group.sample_comm, + "blocked_reason": blocked_reason, + "action": "skip" if blocked_reason else ("sigterm" if args.apply else "dry_run"), + } + ) + + signaled: list[int] = [] + if args.apply: + for candidate in candidates: + if candidate["blocked_reason"]: + continue + os.killpg(int(candidate["pgid"]), signal.SIGTERM) + signaled.append(int(candidate["pgid"])) + + remaining_after_wait = None + if args.apply and args.wait_seconds > 0: + time.sleep(args.wait_seconds) + fresh_rows = exporter.parse_ps_rows(exporter.read_ps_text(args.ps_file)) + fresh_groups = exporter.classify_groups( + fresh_rows, + min_age_seconds=args.min_age_seconds, + min_cpu_percent=args.min_cpu_percent, + ) + remaining_after_wait = [ + group.pgid for group in fresh_groups if not args.rule or group.rule_id == args.rule + ] + + payload = { + "schema_version": "host_runaway_process_remediation_v1", + "host": args.host, + "mode": "apply_sigterm" if args.apply else "dry_run", + "runtime_gate": 1 if args.apply else 0, + "owner_approval_id": args.owner_approval_id if args.apply else None, + "maintenance_window_id": args.maintenance_window_id if args.apply else None, + "evidence_ref": args.evidence_ref if args.apply else None, + "min_age_seconds": args.min_age_seconds, + "min_cpu_percent": args.min_cpu_percent, + "candidate_count": len(candidates), + "signaled_process_group_count": len(signaled), + "signaled_process_groups": signaled, + "remaining_after_wait": remaining_after_wait, + "candidates": candidates, + "forbidden_without_gates": [ + "sigkill", + "docker_restart", + "systemctl_restart", + "nginx_reload", + "firewall_change", + "secret_collection", + ], + } + print(json.dumps(payload, ensure_ascii=False, indent=2, sort_keys=True)) + + +if __name__ == "__main__": + main() diff --git a/scripts/ops/tests/test_host_runaway_process_exporter.py b/scripts/ops/tests/test_host_runaway_process_exporter.py new file mode 100644 index 00000000..336a0d8c --- /dev/null +++ b/scripts/ops/tests/test_host_runaway_process_exporter.py @@ -0,0 +1,144 @@ +from __future__ import annotations + +import importlib.util +import subprocess +import sys +from pathlib import Path + + +SCRIPT_ROOT = Path(__file__).resolve().parents[1] +EXPORTER_PATH = SCRIPT_ROOT / "host-runaway-process-exporter.py" +REMEDIATION_PATH = SCRIPT_ROOT / "host-runaway-process-remediation.py" + + +def load_exporter(): + spec = importlib.util.spec_from_file_location("host_runaway_process_exporter", EXPORTER_PATH) + assert spec and spec.loader + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +def test_classifies_orphan_stockplatform_headless_group() -> None: + exporter = load_exporter() + rows = exporter.parse_ps_rows( + """ + 100 1 100 100 7200 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa + 101 100 100 100 7190 55.0 S chromium /opt/chrome/chromium --type=renderer /tmp/stockplatform-review-bulk-ux-aa + 200 10 200 200 600 90.0 S node pnpm --filter @awoooi/web build + """ + ) + + groups = exporter.classify_groups(rows, min_age_seconds=1800, min_cpu_percent=50) + + assert len(groups) == 1 + assert groups[0].rule_id == "stockplatform_headless_smoke" + assert groups[0].pgid == 100 + assert groups[0].orphan_reason == "ppid_1" + assert groups[0].cpu_percent == 120.0 + assert len(groups[0].rows) == 2 + + +def test_ignores_non_orphan_or_young_browser_processes() -> None: + exporter = load_exporter() + rows = exporter.parse_ps_rows( + """ + 100 99 100 100 7200 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa + 101 100 100 100 7190 55.0 S chromium /opt/chrome/chromium /tmp/stockplatform-review-bulk-ux-aa + 300 1 300 300 60 120.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-bb + """ + ) + + assert exporter.classify_groups(rows, min_age_seconds=1800, min_cpu_percent=50) == [] + + +def test_parses_bsd_elapsed_time_for_local_smoke() -> None: + exporter = load_exporter() + rows = exporter.parse_ps_rows( + """ + 100 1 100 100 01:00:00 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa + 101 100 100 100 2-00:00:10 55.0 S chromium /opt/chrome/chromium /tmp/stockplatform-review-bulk-ux-aa + """ + ) + + assert rows[0].etimes == 3600 + assert rows[1].etimes == 172810 + + +def test_renders_ci_load_and_swap_without_authorizing_repair(tmp_path: Path) -> None: + exporter = load_exporter() + groups = exporter.classify_groups( + exporter.parse_ps_rows( + "100 1 100 100 7200 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa" + ), + min_age_seconds=1800, + min_cpu_percent=50, + ) + metrics = exporter.render_metrics( + host="110", + groups=groups, + active_action_containers=3, + min_age_seconds=1800, + min_cpu_percent=50, + now=123, + load_ratio=1.25, + swap_ratio=1.0, + ) + + assert 'awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"} 1' in metrics + assert 'awoooi_host_gitea_actions_active_container_count{host="110"} 3' in metrics + assert 'awoooi_host_swap_used_ratio{host="110"} 1.000000' in metrics + assert 'awoooi_host_runaway_process_remediation_authorized{host="110"} 0' in metrics + assert 'rule="stockplatform_headless_smoke"' in metrics + + +def test_remediation_defaults_to_dry_run(tmp_path: Path) -> None: + ps_file = tmp_path / "ps.txt" + ps_file.write_text( + "100 1 100 100 7200 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa\n", + encoding="utf-8", + ) + + result = subprocess.run( + [ + sys.executable, + str(REMEDIATION_PATH), + "--ps-file", + str(ps_file), + "--rule", + "stockplatform_headless_smoke", + ], + check=True, + capture_output=True, + text=True, + ) + + assert '"mode": "dry_run"' in result.stdout + assert '"runtime_gate": 0' in result.stdout + assert '"action": "dry_run"' in result.stdout + + +def test_remediation_refuses_apply_without_gates(tmp_path: Path) -> None: + ps_file = tmp_path / "ps.txt" + ps_file.write_text( + "100 1 100 100 7200 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa\n", + encoding="utf-8", + ) + + result = subprocess.run( + [ + sys.executable, + str(REMEDIATION_PATH), + "--ps-file", + str(ps_file), + "--apply", + "--rule", + "stockplatform_headless_smoke", + ], + capture_output=True, + text=True, + ) + + assert result.returncode != 0 + assert "Refusing apply" in result.stderr diff --git a/scripts/reboot-recovery/reboot-recovery-readiness-audit.sh b/scripts/reboot-recovery/reboot-recovery-readiness-audit.sh index 19095d14..dd61160b 100755 --- a/scripts/reboot-recovery/reboot-recovery-readiness-audit.sh +++ b/scripts/reboot-recovery/reboot-recovery-readiness-audit.sh @@ -193,6 +193,8 @@ require_file scripts/ops/docker-stats-textfile-exporter.py "Docker stats textfil require_file scripts/ops/systemd-units-textfile-exporter.py "Systemd units textfile exporter" require_file scripts/ops/storage-health-textfile-exporter.py "Storage health textfile exporter" require_file scripts/ops/backup-health-textfile-exporter.py "Backup health textfile exporter" +require_file scripts/ops/host-runaway-process-exporter.py "Host runaway process textfile exporter" +require_file scripts/ops/host-runaway-process-remediation.py "Host runaway process gated remediation helper" require_file scripts/ops/backup-alert-label-contract-check.py "Backup alert label contract check" require_file scripts/ops/backup-alert-live-visibility-check.py "Backup alert live visibility check" require_file scripts/ops/recovery-scorecard-contract-check.py "Recovery scorecard contract check" @@ -270,6 +272,11 @@ require_pattern "awoooi_backup_offsite_full_sync_enabled" scripts/ops/backup-hea require_pattern "awoooi_backup_retention_latest_only" scripts/ops/backup-health-textfile-exporter.py "110 latest-only retention textfile metric" require_pattern "awoooi_backup_cron_active_duplicate_count" scripts/ops/backup-health-textfile-exporter.py "110 backup cron duplicate textfile metric" require_pattern "awoooi_backup_cron_singular_entry_ok" scripts/ops/backup-health-textfile-exporter.py "110 backup cron singular textfile metric" +require_pattern "awoooi_host_runaway_process_monitor_up" scripts/ops/host-runaway-process-exporter.py "110 runaway process monitor metric" +require_pattern "awoooi_host_runaway_process_remediation_authorized" scripts/ops/host-runaway-process-exporter.py "110 runaway process remediation authorization guard metric" +require_pattern "owner-approval-id" scripts/ops/host-runaway-process-remediation.py "Runaway process remediation owner approval gate" +require_pattern "maintenance-window-id" scripts/ops/host-runaway-process-remediation.py "Runaway process remediation maintenance window gate" +require_pattern "evidence-ref" scripts/ops/host-runaway-process-remediation.py "Runaway process remediation evidence gate" require_pattern "textfile_exporters" infra/ansible/playbooks/188-ai-web.yml "188 textfile exporters tag" require_pattern "backup-momo-188-pg.sh" infra/ansible/playbooks/188-ai-web.yml "188 momo PostgreSQL backup deploy" require_pattern "/home/ollama/bin/momo-pg-backup.sh" infra/ansible/playbooks/188-ai-web.yml "188 host-owned momo backup entrypoint" @@ -291,6 +298,10 @@ require_pattern "awoooi_cold_start_blocker_reason" ops/monitoring/alerts-unified require_pattern "docker_container_cpu_cores" ops/monitoring/alerts-unified.yml "Docker CPU alert metric" require_pattern "systemd_unit_watchdog_seconds" ops/monitoring/alerts-unified.yml "Systemd watchdog alert metric" require_pattern "awoooi_host_storage_error_count" ops/monitoring/alerts-unified.yml "Storage health alert metric" +require_pattern "host_runaway_process_alerts" ops/monitoring/alerts-unified.yml "Host runaway process alert group" +require_pattern "HostOrphanBrowserSmokeHighCpu" ops/monitoring/alerts-unified.yml "Host orphan browser smoke alert" +require_pattern "HostCiRunnerLoadSaturation" ops/monitoring/alerts-unified.yml "Host CI runner load classification alert" +require_pattern "awoooi_host_runaway_process_remediation_authorized" ops/monitoring/alerts-unified.yml "Host runaway process remediation guard alert metric" require_pattern "awoooi_backup_job_fresh" ops/monitoring/alerts-unified.yml "Backup freshness alert metric" require_pattern "awoooi_backup_integrity_fresh" ops/monitoring/alerts-unified.yml "Backup integrity alert metric" require_pattern "awoooi_backup_offsite_configured" ops/monitoring/alerts-unified.yml "Backup offsite alert metric"