diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md
index d9952916..64fb71f1 100644
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -1,3 +1,32 @@
+## 2026-06-18｜110 Runaway Process AIOps 監控 / 告警 / PlayBook 收斂
+
+**背景**：110 CPU 滿載已確認主因是跨專案 stockPlatform headless Chrome smoke 遺留 5 組 orphan process group，其中兩組各吃約 120% CPU；精準 `SIGTERM` 後 `REMAINING_AFTER_TERM=0`。後續 load 仍高是 active Gitea Actions CI build/test，並非 orphan Chrome、Docker/Sentry/Harbor 事故。這類問題不能停在人工 `top/ps`，必須產品化成監控、告警、PlayBook、KM 與 gated 修復。
+
+**完成內容**：
+- 新增 `scripts/ops/host-runaway-process-exporter.py`，read-only 輸出 `host_runaway_process.prom`，分類 orphan browser smoke、active Gitea Actions、load5/core、swap ratio，並固定 `awoooi_host_runaway_process_remediation_authorized=0`。
+- 新增 `scripts/ops/host-runaway-process-remediation.py`，預設 dry-run；`--apply` 必須同時帶 `--confirm-apply`、`--owner-approval-id`、`--maintenance-window-id`、`--evidence-ref` 與 `--rule`，只送 `SIGTERM`，不做 `SIGKILL`、Docker restart 或 systemd restart。
+- 新增 `scripts/ops/tests/test_host_runaway_process_exporter.py`，鎖住 orphan group 分類、BSD / Linux `ps` 解析、合法 / 年輕 process 忽略、CI/swap 指標、dry-run 與 apply gate 拒絕行為。
+- `ops/monitoring/alerts-unified.yml` 新增 `host_runaway_process_alerts`：`HostOrphanBrowserSmokeHighCpu`、`HostCiRunnerLoadSaturation`、monitor missing/stale、以及 `HostRunawayProcessRemediationUnexpectedlyAuthorized` 保險絲。
+- `infra/ansible/roles/host-textfile-exporters` 與 `infra/ansible/playbooks/110-devops.yml` 納入 110 exporter、gated remediation helper、cron、立即刷新與 metric 驗證。
+- 新增 `docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md`，定義 monitoring -> alert -> AI triage packet -> KM / PlayBook evidence -> gated remediation -> post-check / recurrence guard。
+- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升為 v1.26，將 110 runaway browser / CI load 分流納入 runner/CD 釋出條件與 AI auto-remediation gate。
+- `docs/runbooks/ANSIBLE-OPERATING-MODEL.md`、`docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`、`scripts/reboot-recovery/reboot-recovery-readiness-audit.sh` 同步新 exporter / alert / gated remediation contract。
+
+**驗證**：
+- `/Users/ogt/.pyenv/shims/python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py -q`：`6 passed`。
+- `scripts/ops/host-runaway-process-exporter.py --stdout --host 110`：成功輸出 `awoooi_host_runaway_process_monitor_up=1`、orphan group 指標、CI active container 指標、load5/core、swap ratio 與 `remediation_authorized=0`。
+- `PATH="/Users/ogt/.pyenv/shims:$PATH" bash scripts/ops/deploy-alerts.sh --dry-run`：YAML 驗證通過，告警規則 `123`、SLO recording `7`、SLO alerts `11`。
+- `PATH="/Users/ogt/.pyenv/shims:$PATH" bash scripts/reboot-recovery/reboot-recovery-readiness-audit.sh --no-color`：`PASS=197 WARN=2 BLOCKED=0`；WARN 為本地缺 `ansible-playbook` 與未跑 live cold-start。
+- `source-control-owner-response-guard.py`、`security-mirror-progress-guard.py`、`doc-secrets-sanity-check.py`、`git diff --check`：通過。
+
+**完成度同步**：
+- 110 runaway process monitoring / alert / PlayBook / KM contract：`100%`。
+- Repo-side automation mechanism：`100%`。
+- Runtime auto-remediation：`0%`；這是安全閘門，不是缺件。真正 `SIGTERM` 必須有 owner approval、maintenance window、evidence ref、rule 與 post-check。
+- 110 CPU orphan Chrome recurrence guard：`READY`；下一次會由 `HostOrphanBrowserSmokeHighCpu` 指向 PlayBook，而不只剩泛用 `HostHighCpuLoad`。
+
+**邊界**：本段未 SSH 寫主機、未 kill live process、未重啟 Docker/systemd/Nginx、未改 firewall/K8s、未讀 secrets、未開 runtime gate。Live 部署 exporter / alert reload 若要執行，需走 Ansible / deploy-alerts 的既有維護流程與 owner readback。
+
 ## 2026-06-18｜P2-406B Receipt Readback Owner Review 本地完成
 
 **背景**：P2-004 已把依賴 / 供應鏈漂移收斂成只讀監控讀回；統帥要求每次推進都不能忘記目標與方向，因此本段把日報 / 週報 / 月報、Telegram receipt owner review、P2-004 drift monitor 與 P2-403J 報表真相串成同一個 owner review surface，讓治理頁可以直接看到 AI Agent 分工、互審與仍被關閉的 runtime 邊界。
diff --git a/docs/runbooks/ANSIBLE-OPERATING-MODEL.md b/docs/runbooks/ANSIBLE-OPERATING-MODEL.md
index bc35ee98..7c9365c5 100644
--- a/docs/runbooks/ANSIBLE-OPERATING-MODEL.md
+++ b/docs/runbooks/ANSIBLE-OPERATING-MODEL.md
@@ -1,6 +1,6 @@
 # AWOOOI Ansible 運作模型
 
-> 最後更新：2026-05-12（台北時間）
+> 最後更新：2026-06-18（台北時間）
 > 範圍：說明 Ansible 在 110 / 120 / 121 / 188 的運維、冷啟動恢復、監控與部署安全中扮演的角色。
 
 ## 產品架構定位
@@ -35,7 +35,7 @@ Git repo
 | 110 Ollama proxy | `110-ollama-proxy.conf.j2` | `/etc/nginx/sites-enabled/110-ollama-proxy.conf` |
 | 110 cold-start monitor | `roles/cold-start-monitor` | `/home/wooo/scripts`、cron、node-exporter textfile |
 | 110 runner guardrails | `roles/runner-guardrails` | `actions.runner.*` systemd drop-ins |
-| 110/188 Docker/systemd/storage/backup textfile exporters | `roles/host-textfile-exporters` | `/home/*/node_exporter_textfiles/docker_stats.prom`、`storage_health.prom`、`backup_health.prom`、110 `systemd_units.prom` |
+| 110/188 Docker/systemd/storage/backup/runaway-process textfile exporters | `roles/host-textfile-exporters` | `/home/*/node_exporter_textfiles/docker_stats.prom`、`storage_health.prom`、`backup_health.prom`、110 `systemd_units.prom`、110 `host_runaway_process.prom` |
 | 110 Sentry backup / integrity drill | `110-devops.yml --tags backup_jobs` | `/backup/scripts/backup-sentry.sh`、`check-backup-integrity.sh`、weekly/monthly cron |
 | 主機健康描述 | `110-devops.yml`、`188-ai-web.yml` | 只讀檢查與有限度主機狀態修復 |
 
@@ -171,6 +171,24 @@ Sentry 資料層備份由 `/backup/scripts/backup-sentry.sh` 負責，納入每
 - 每月 `--mode restore-drill`：從每個 repo 抽一個小檔案 `restic dump latest <sample>` 到 0700 暫存目錄，驗證 snapshot 可讀。
 - 執行狀態寫入 `/backup/integrity/check.status` 與 `/backup/integrity/restore-drill.status`，由 `backup-health-textfile-exporter.py` 轉成 Prometheus metrics。
 
+## 110 Runaway Process 分類
+
+2026-06-18 起，110 也由 `roles/host-textfile-exporters` 管理：
+
+```text
+/home/wooo/scripts/host-runaway-process-exporter.py
+/home/wooo/scripts/host-runaway-process-remediation.py
+/home/wooo/node_exporter_textfiles/host_runaway_process.prom
+```
+
+Exporter 每 2 分鐘只讀取 `ps`、Docker active task container 名稱、`/proc/loadavg` 與 `/proc/meminfo`，用來分辨：
+
+- orphan headless Chrome / Chromium / Playwright smoke process group。
+- 合法 Gitea Actions CI build/test 負載。
+- load5/core 與 swap ratio。
+
+`host-runaway-process-remediation.py` 是 gated PlayBook helper，不會由 cron 執行；`--apply` 必須帶 owner approval、maintenance window 與 evidence ref。Ansible 只部署與刷新 read-only exporter，不授權 kill process。
+
 ## 下一批納入 Ansible 的項目
 
 | 優先級 | 項目 | 原因 |
@@ -179,7 +197,7 @@ Sentry 資料層備份由 `/backup/scripts/backup-sentry.sh` 負責，納入每
 | P0 | Sentry 專屬備份與 restic integrity drill | `backup_jobs` 已納入 110 playbook；下一步累積 nightly/weekly/monthly 成功證據 |
 | P0 | 188 nginx HTTPS route ownership | 避免 public tool routes 在事故後或同步後再次漂移 |
 | P1 | certbot/snap certbot 標準化 | 目前 apt certbot/OpenSSL 路徑脆弱，renewal 需要統一路徑 |
-| P1 | 110/188 Docker/systemd/storage/backup textfile exporters | `roles/host-textfile-exporters` 已建立；下一步是在 ops host 上 dry-run/apply，並確認 `docker_stats.prom` / `storage_health.prom` / `backup_health.prom` / `systemd_units.prom` freshness |
+| P1 | 110/188 Docker/systemd/storage/backup/runaway-process textfile exporters | `roles/host-textfile-exporters` 已建立；下一步是在 ops host 上 dry-run/apply，並確認 `docker_stats.prom` / `storage_health.prom` / `backup_health.prom` / `systemd_units.prom` / `host_runaway_process.prom` freshness |
 | P1 | node-exporter/cAdvisor caps | 監控元件本身不能變成負載來源 |
 | P2 | K3s diagnostic-only host tasks | 只驗證 containerd/kubelet 狀態，不做破壞性修復 |
 | P2 | 112 Kali inventory only | 先記錄，不掃描、不修復 |
diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md
index 774576fe..c90fd8f9 100644
--- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md
+++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md
@@ -1,6 +1,6 @@
 # AWOOOI 全棧冷啟動與主機重啟 SOP
 
-> Version: v1.25
+> Version: v1.26
 > Last updated: 2026-06-18 Asia/Taipei
 > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
 
@@ -41,6 +41,7 @@ Forbidden declaration: DR complete or runtime/security acceptance. Credential es
 | P2 service / data truth | `100%` | `VERIFIED_FULL_STACK_GREEN_FOR_SERVICE` |
 | P3 docs / automation contracts | `100%` | `DONE_WITH_STALE_JOB_CLASSIFICATION` |
 | 110 host runtime | `fwupd-refresh.timer` intentionally disabled/inactive after non-runtime firmware metadata refresh failed units were classified; `systemctl --failed` returns `0 loaded units listed`; rollback is `sudo systemctl enable --now fwupd-refresh.timer` | `GREEN_WITH_FWUPD_TIMER_DISABLED` |
+| 110 host runaway process guard | `host-runaway-process-exporter.py` / `host-runaway-process-remediation.py` 已納入 110 Ansible textfile exporter source-of-truth；告警可分辨 orphan browser smoke 與合法 Gitea Actions CI load；修復器預設 dry-run，`SIGTERM` 需 owner approval、maintenance window、evidence ref | `AIOPS_MONITOR_READY_RUNTIME_GATE_0` |
 | 120 reachability | ping OK, SSH OK, boot around `2026-06-14 02:23`, K3s active, node `mon Ready` | `GREEN` |
 | 121 reachability | ping OK, SSH OK, failed units `0` | `GREEN` |
 | 188 host runtime | production services green; host degraded only by `certbot.service` and `snap.certbot.renew.service` | `GREEN_WITH_CERTBOT_DEBT` |
@@ -784,7 +785,33 @@ Only release these after P0/P1/P2 gates are green:
 | 188 | litellm | `/health/liveliness` good and provider route verified |
 | 110 | Sentry Snuba consumers | ClickHouse healthy and Kafka backlog decreasing |
 | 110 | Sentry uptime-checker | Sentry web/DB healthy |
-| 110 | runners | all previous gates green and load/core < 1.0 for 15 minutes |
+| 110 | runners | all previous gates green, `host_runaway_process.prom` fresh, orphan browser group count `0`, and load/core < 1.0 for 15 minutes unless the remaining load is explicitly attributed to active CI |
+
+### 11.1 110 Runaway Browser / CI Load 分流
+
+2026-06-18 110 CPU 滿載事件證明：泛用 `HostHighCpuLoad` 只能說主機忙，不能告訴 operator 要不要殺程序。110 現在必須使用專用 host runaway process 指標做第一層分流：
+
+```bash
+grep -E 'awoooi_host_runaway_|awoooi_host_gitea_actions_|awoooi_host_load5_per_core|awoooi_host_swap_used_ratio' \
+  /home/wooo/node_exporter_textfiles/host_runaway_process.prom
+```
+
+判讀：
+
+| 指標組合 | 判定 | 行動 |
+|----------|------|------|
+| `awoooi_host_runaway_browser_orphan_group_count > 0` 且 CPU `>= 100` | orphan headless browser / smoke process group | 執行 `host-runaway-process-remediation.py` dry-run；人工確認後才可 gated `SIGTERM` |
+| orphan count `0` 且 `awoooi_host_gitea_actions_active_container_count > 0` | 合法 CI build/test 負載 | 觀察 Gitea Actions queue / workflow timeout；不殺程序 |
+| `awoooi_host_runaway_process_monitor_up` 缺失或 stale | 監控盲區 | 修 cron / textfile collector / Ansible role，不宣稱 AI Ops 可觀測 |
+| `awoooi_host_runaway_process_remediation_authorized > 0` | 監控器被誤改成執行器 | 立即回滾；runtime remediation 必須只走 gated helper |
+
+正式 PlayBook：
+
+```text
+docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md
+```
+
+這條 PlayBook 不取代 Docker / Sentry / Harbor / K3s / backup SOP。它只處理 orphan browser smoke 與 CI load 分類，避免 CPU 高時誤重啟 Docker 或誤殺合法 build。
 
 ---
 
@@ -800,7 +827,7 @@ These are release gates after the first cold-start recovery pass:
 | 110 host | Harbor `/v2/` 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop |
 | K3s | 120/121 nodes Ready, VIP `192.168.0.125` present, AWOOOI API 2xx/3xx, Web 2xx/3xx |
 | Public routes | `https://awoooi.wooo.work/api/v1/health` 2xx/3xx, `https://mo.wooo.work/health` 2xx/3xx |
-| Guardrails | Docker/systemd textfile exporters fresh, runner `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0` |
+| Guardrails | Docker/systemd/storage/backup/runaway-process textfile exporters fresh, runner `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0` |
 | Schedules | cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success `< 25h` |
 | Backlog | ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks |
 
@@ -812,6 +839,7 @@ AI auto-repair can move from observe-only to limited execution only after:
 
 - Prometheus rules are loaded.
 - docker/systemd textfile exporter files are fresh.
+- runaway process textfile exporter is fresh and `remediation_authorized=0`.
 - blackbox probes have stable results.
 - cron/CronJob schedule checks are green.
 - AWOOOI API `/api/v1/health` passes.
@@ -827,6 +855,7 @@ Until then:
 - require human approval for remediation
 - no DB/ClickHouse/Harbor/Sentry destructive action
 - no generic restart action against stateful services
+- no process kill unless `host-runaway-process-remediation.py` has dry-run evidence plus owner approval, maintenance window, and evidence ref
 
 ---
 
@@ -1596,6 +1625,7 @@ All must be true:
 - Runners are guarded and released last.
 - AI auto-remediation is not in full execution mode until all gates are green.
 - 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.
+- 110 runaway process textfile monitor is fresh, and Prometheus has `HostOrphanBrowserSmokeHighCpu` plus CI load classification rules loaded.
 - 110 global `/home/wooo/.ssh/known_hosts` still contains verified 120 / 188 entries after any CD run; deploy jobs use `/home/wooo/.ssh/deploy_known_hosts` only.
 
 ### 15.1 可宣稱狀態
diff --git a/docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md b/docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md
new file mode 100644
index 00000000..d46be540
--- /dev/null
+++ b/docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md
@@ -0,0 +1,153 @@
+# Host Runaway Process AIOps PlayBook
+
+> Last updated: 2026-06-18 Asia/Taipei
+> Scope: 110 host CPU 滿載、orphan Chrome / Playwright smoke、Gitea Actions CI load 分流。
+
+---
+
+## 1. 目標
+
+這份 PlayBook 把 2026-06-18 110 CPU 滿載事件固化成 AI Ops 閉環：
+
+```text
+read-only exporter -> Prometheus alert -> AI triage packet -> KM / PlayBook evidence -> gated remediation -> post-check / recurrence guard
+```
+
+它要解決的不是「CPU 高」這個泛用症狀，而是精準分辨：
+
+| 類型 | 判定 | 處理 |
+|------|------|------|
+| orphan browser smoke | headless Chrome / Chromium / Playwright process group 存活過久、PPID=1 或 group leader 消失、CPU 合計過高 | 走 dry-run 修復包；人工批准後可送 `SIGTERM` |
+| 合法 CI load | Gitea Actions task container 正在跑，沒有 orphan browser 指標 | 觀察 queue / timeout；不要誤殺 |
+| Docker / Sentry / Harbor 事故 | container restart、port down、journal error、cold-start gate blocked | 走各服務自己的 SOP，不使用本 PlayBook 殺 process |
+| swap 已滿但未 thrash | swap ratio 高但 `vmstat` / load 分類未顯示即時 thrash | 不手動清 swap；先降高 CPU 來源 |
+
+---
+
+## 2. 指標與告警
+
+110 由 Ansible `host-textfile-exporters` 角色部署：
+
+```text
+/home/wooo/scripts/host-runaway-process-exporter.py
+/home/wooo/scripts/host-runaway-process-remediation.py
+/home/wooo/node_exporter_textfiles/host_runaway_process.prom
+```
+
+核心指標：
+
+| Metric | 意義 |
+|--------|------|
+| `awoooi_host_runaway_process_monitor_up{host="110"}` | exporter 是否正常輸出 |
+| `awoooi_host_runaway_browser_orphan_group_count{host="110",rule=...}` | 符合規則的 orphan browser process group 數 |
+| `awoooi_host_runaway_browser_orphan_cpu_percent{host="110",rule=...}` | orphan group CPU 合計 |
+| `awoooi_host_gitea_actions_active_container_count{host="110"}` | 目前 active Gitea Actions task containers |
+| `awoooi_host_load5_per_core{host="110"}` | load5 / CPU core |
+| `awoooi_host_swap_used_ratio{host="110"}` | swap 使用比例 |
+| `awoooi_host_runaway_process_remediation_authorized{host="110"}` | 必須永遠為 `0`；exporter 不是執行器 |
+
+告警：
+
+| Alert | 條件 | 行動 |
+|-------|------|------|
+| `HostOrphanBrowserSmokeHighCpu` | orphan browser group `> 0` 且 CPU `>= 100%` 持續 10 分鐘 | 產生 dry-run 修復包，確認 owner / 維護窗口 / evidence |
+| `HostCiRunnerLoadSaturation` | load5/core `> 1.0` 且 active Gitea Actions `> 0` | 標為短期 CI 負載，檢查 runner queue，不直接 kill |
+| `HostRunawayProcessMonitorMissing` / `Stale` | exporter 缺失或超過 10 分鐘未更新 | 修 exporter / cron / textfile collector |
+| `HostRunawayProcessRemediationUnexpectedlyAuthorized` | `remediation_authorized > 0` | 立即回滾；禁止把監控器改成執行器 |
+
+---
+
+## 3. AI Triager 必做判讀
+
+收到 `HostOrphanBrowserSmokeHighCpu` 時，AI / operator 必須先產生 dry-run：
+
+```bash
+python3 scripts/ops/host-runaway-process-remediation.py \
+  --rule stockplatform_headless_smoke \
+  --min-age-seconds 1800 \
+  --min-cpu-percent 50
+```
+
+dry-run 必須檢查：
+
+1. `candidate_count > 0`。
+2. `orphan_reason` 是 `ppid_1` 或 `missing_group_leader`。
+3. `oldest_age_seconds` 超過 PlayBook 門檻。
+4. `active Gitea Actions` 與候選 process group 不是同一個仍在跑的合法 job。
+5. 不是 Docker daemon、Sentry、Harbor、PostgreSQL、ClickHouse、K3s 或 backup 服務本體。
+6. 已有 owner / 維護窗口 / evidence ref。
+
+如果只看到 `HostCiRunnerLoadSaturation`，且 orphan group count 為 `0`，預設判定是「合法 CI 短期負載」，不得自動修復。
+
+---
+
+## 4. Gated Remediation
+
+真正送 `SIGTERM` 時必須帶齊三個 gate：
+
+```bash
+python3 scripts/ops/host-runaway-process-remediation.py \
+  --apply \
+  --confirm-apply \
+  --rule stockplatform_headless_smoke \
+  --owner-approval-id OWNER-APPROVAL-REDACTED \
+  --maintenance-window-id MW-REDACTED \
+  --evidence-ref INC-REDACTED \
+  --wait-seconds 5
+```
+
+禁止事項：
+
+- 不可預設 `SIGKILL`。
+- 不可因 CPU 高直接 `systemctl restart docker`。
+- 不可重啟 Sentry / Harbor / Gitea / Nginx。
+- 不可改 firewall / iptables / NetworkPolicy。
+- 不可讀取或輸出 secret value、token、hash、prefix / suffix。
+- 不可把 route 200、container up、UI 可見當成修復完成。
+
+修復完成條件：
+
+```text
+signaled_process_group_count > 0
+remaining_after_wait = []
+awoooi_host_runaway_browser_orphan_group_count == 0
+load5/core 開始下降或維持可解釋
+active Gitea Actions 若仍存在，告警降級為 CI load，而非 orphan smoke
+```
+
+---
+
+## 5. KM / PlayBook 回寫契約
+
+每次觸發都要沉澱：
+
+| 資產 | 必填欄位 |
+|------|----------|
+| Incident evidence | alert name、host、rule、pgid count、cpu percent、oldest age、active CI count、swap ratio |
+| PlayBook run | dry-run payload、owner approval id、maintenance window id、evidence ref、actual signal summary |
+| KM entry | 根因分類、誤判防護、修復結果、recurrence guard |
+| Verifier | post-check 指標、load trend、orphan group count、runner queue state |
+| Work item | 如果缺 owner / evidence / maintenance window，建立補件項，不假性拉高 runtime gate |
+
+產品上的結論必須分開呈現：
+
+```text
+monitoring_ready=true
+alert_ready=true
+playbook_ready=true
+km_writeback_required=true
+runtime_remediation_authorized=false unless gated apply is executed
+```
+
+---
+
+## 6. 與重啟 SOP 的關係
+
+110 重啟後，runner / CD / high-load batch 是最後釋出。若 service health green 但 load 持續高：
+
+1. 先讀 `host_runaway_process.prom`。
+2. orphan browser 指標紅：走本 PlayBook。
+3. active CI 指標紅但 orphan 為 0：等待 / drain / workflow timeout，不走 kill。
+4. Docker / systemd / storage / backup 指標紅：回到 `FULL-STACK-COLD-START-SOP.md` 對應章節。
+
+這條 PlayBook 是 AI 自動化產品的 host CPU 專用閉環，不取代 cold-start scorecard，也不解除 credential escrow / DR gate。
diff --git a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
index d8e07db9..0f15290e 100644
--- a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
+++ b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
@@ -5074,3 +5074,17 @@ Trigger commit `f5cd37b7` 與 deploy marker `0ba92357` 已把 governance UI 的
 - 本地驗證：JSON schema / snapshot parse 通過；`python3 -m py_compile apps/api/src/services/ai_agent_receipt_readback_owner_review.py` 通過；目標 service / API tests `9 passed`；zh-TW / en i18n JSON parse 通過。Web typecheck 因乾淨 worktree 缺 `node_modules` 未執行，本輪未安裝套件或寫 lockfile。
 
 **裁決：** P2-406B 是 receipt readback owner review 與 governance 可視化，不是 Telegram send、Gateway queue write、Bot API call、receipt production write、AI analysis runtime、中低風險 auto execution、production optimization、secret read、paid API、host write、kubectl action、destructive operation 或 OpenClaw 替換。下一步是 `P2-407`：AI 報表自動分析 no-write runtime，只產生 committed snapshot / 草稿與 actionability score，不得實發或寫 production。
+
+### 2026-06-18 14:20 (台北) — §8 / Host CPU AIOps — 新增 110 runaway process 監控 / 告警 / PlayBook / gated remediation
+
+**觸發**：110 CPU 滿載已確認是跨專案 stockPlatform headless Chrome smoke 遺留 5 組 orphan process group，精準 SIGTERM 後 `REMAINING_AFTER_TERM=0`；後續 load 仍高則是 AWOOOI / VibeWork / 2026 World Cup Gitea Actions build/test。這證明泛用 `HostHighCpuLoad` 不足以支撐 AI 自動化產品，必須能把 orphan process、合法 CI load、Docker/Sentry/Harbor 事故分開。
+
+**已推進：**
+- 新增 `scripts/ops/host-runaway-process-exporter.py`，read-only 輸出 orphan browser process group、active Gitea Actions、load5/core、swap ratio 與 `remediation_authorized=0`。
+- 新增 `scripts/ops/host-runaway-process-remediation.py`，預設 dry-run；`--apply` 必須帶 owner approval、maintenance window、evidence ref、rule 與 confirm gate，只送 SIGTERM，不預設 SIGKILL。
+- `ops/monitoring/alerts-unified.yml` 新增 `host_runaway_process_alerts`，包含 orphan browser smoke critical、CI load saturation warning、monitor missing/stale 與 remediation authorization 保險絲。
+- `infra/ansible/roles/host-textfile-exporters` 與 `110-devops.yml` 納入 exporter、gated helper、cron、立即刷新與 metric 驗證。
+- 新增 `docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md`，並同步 SOP v1.26、Ansible operating model、reboot recovery workplan、LOGBOOK 與 readiness audit。
+- 新增 pytest，鎖住 orphan 分類、Linux / BSD `ps` 解析、合法 / 年輕 process 忽略、CI/swap 指標、dry-run 與 apply gate 拒絕行為；readiness audit 以 pyenv Python 重跑後 `BLOCKED=0`。
+
+**裁決：** 這是 host CPU runaway 的 observe -> classify -> alert -> PlayBook -> KM contract -> gated remediation 閉環，不是 runtime 自動 kill 授權。AI 可以自動診斷、告警、產生 dry-run 修復包與 KM/PlayBook 回寫要求；真正 process termination 仍需 owner approval、maintenance window、evidence ref 與 post-check。Docker restart、systemd restart、Nginx reload、firewall change、secret read、host write 與 production write 仍全部禁止。
diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
index 7d99ca3d..90a63065 100644
--- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
+++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
@@ -15,7 +15,7 @@
 | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
 | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-15 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-15 02:40:13`. Offsite / escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
 | P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-18 13:43 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=84 WARN=0 BLOCKED=0`. |
-| P3 docs / automation contracts | DONE_WITH_STALE_JOB_CLASSIFICATION | 100% | Workplan, SOP v1.25, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, and 2026-06-18 live readback are updated. Repo-side `reboot-recovery-readiness-audit.sh --no-color` returned `PASS=187 WARN=1 BLOCKED=0`; live cold-start returned `PASS=84 WARN=0 BLOCKED=0`. |
+| P3 docs / automation contracts | DONE_WITH_RUNAWAY_PROCESS_AIOPS | 100% | Workplan, SOP v1.26, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, and 2026-06-18 live readback are updated. Repo-side readiness audit now also checks runaway process exporter / remediation helper / alert group; live cold-start remains `PASS=84 WARN=0 BLOCKED=0` from the latest service readiness readback. |
 
 Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-18 13:43, services are green with `WARN=0` and `BLOCKED=0`; the retained stale `km-vectorize` failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
 
@@ -175,7 +175,7 @@ Next: <single next action>
 | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
 | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
 | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
-| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.25 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, repo-side readiness audit blocker closure, stale-vs-active K8s failed Job classification, 2026-06-18 live cold-start GREEN readback, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note，以及 allowed declaration wording. | Use v1.25 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, and blockers against §1.4 plus §14.8 through §14.25. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN_FOR_SERVICE`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN_FOR_SERVICE`, and `B5_DR_COMPLETE`; repo-side `reboot-recovery-readiness-audit.sh --no-color` returns `PASS=187 WARN=1 BLOCKED=0`, and live cold-start returns `PASS=84 WARN=0 BLOCKED=0`. |
+| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.26 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, repo-side readiness audit blocker closure, stale-vs-active K8s failed Job classification, 2026-06-18 live cold-start GREEN readback, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, and allowed declaration wording. | Use v1.26 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, and blockers against §1.4 plus §11.1 / §14.8 through §14.25. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN_FOR_SERVICE`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN_FOR_SERVICE`, and `B5_DR_COMPLETE`; repo-side readiness audit checks runaway process exporter / alerts / gated remediation helper, and live cold-start returns `PASS=84 WARN=0 BLOCKED=0`. |
 | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
 | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
 | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
@@ -214,6 +214,12 @@ Do not run `truncate`, whole DB restore, force-push, DROP, or online root filesy
 ## 9. Progress Updates
 
 ```text
+2026-06-18 15:10 Asia/Taipei
+Phase: P3 AI Ops runaway process automation
+Before: 110 CPU 滿載只能靠人工 `ps/top` 判斷；泛用 `HostHighCpuLoad` 無法分辨跨專案 orphan Chrome smoke 與合法 Gitea Actions CI load。
+After: 新增 read-only `host-runaway-process-exporter.py`、gated `host-runaway-process-remediation.py`、Prometheus `host_runaway_process_alerts`、Ansible textfile exporter source-of-truth、SOP v1.26 與 `HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md`。Exporter 暴露 orphan browser、active CI、load/core、swap ratio 與 `remediation_authorized=0`；修復器預設 dry-run，`SIGTERM` 必須帶 owner approval、maintenance window、evidence ref。
+Completion: monitoring / alert / PlayBook / KM contract 100%; runtime auto-remediation remains gated at 0 until a real owner-approved apply is executed.
+
 2026-06-18 13:43 Asia/Taipei
 Phase: P1/P2/P3 live readback
 Before: live cold-start was `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`, because retained stale `km-vectorize-29689620` failed Job evidence was still counted as a service warning.
diff --git a/infra/ansible/playbooks/110-devops.yml b/infra/ansible/playbooks/110-devops.yml
index df84e4a7..0ebab73a 100644
--- a/infra/ansible/playbooks/110-devops.yml
+++ b/infra/ansible/playbooks/110-devops.yml
@@ -29,6 +29,7 @@
       vars:
         host_textfile_user: wooo
         host_textfile_host_label: "110"
+        host_textfile_manage_runaway_process: true
         host_textfile_manage_systemd_units: true
         host_textfile_systemd_unit_glob: "actions.runner.*.service"
         host_textfile_systemd_units:
diff --git a/infra/ansible/roles/host-textfile-exporters/defaults/main.yml b/infra/ansible/roles/host-textfile-exporters/defaults/main.yml
index 1947ea1b..327a7da5 100644
--- a/infra/ansible/roles/host-textfile-exporters/defaults/main.yml
+++ b/infra/ansible/roles/host-textfile-exporters/defaults/main.yml
@@ -7,13 +7,19 @@ host_textfile_docker_stats_src: "{{ playbook_dir }}/../../../scripts/ops/docker-
 host_textfile_systemd_units_src: "{{ playbook_dir }}/../../../scripts/ops/systemd-units-textfile-exporter.py"
 host_textfile_storage_health_src: "{{ playbook_dir }}/../../../scripts/ops/storage-health-textfile-exporter.py"
 host_textfile_backup_health_src: "{{ playbook_dir }}/../../../scripts/ops/backup-health-textfile-exporter.py"
+host_textfile_runaway_process_src: "{{ playbook_dir }}/../../../scripts/ops/host-runaway-process-exporter.py"
+host_textfile_runaway_process_remediation_src: "{{ playbook_dir }}/../../../scripts/ops/host-runaway-process-remediation.py"
 host_textfile_docker_cron_minute: "*"
 host_textfile_systemd_cron_minute: "*"
 host_textfile_storage_cron_minute: "*"
 host_textfile_backup_cron_minute: "*/10"
+host_textfile_runaway_process_cron_minute: "*/2"
 host_textfile_manage_docker_stats: true
 host_textfile_manage_systemd_units: false
 host_textfile_manage_storage_health: true
 host_textfile_manage_backup_health: true
+host_textfile_manage_runaway_process: false
+host_textfile_runaway_process_min_age_seconds: 1800
+host_textfile_runaway_process_min_cpu_percent: 50
 host_textfile_systemd_unit_glob: ""
 host_textfile_systemd_units: []
diff --git a/infra/ansible/roles/host-textfile-exporters/tasks/main.yml b/infra/ansible/roles/host-textfile-exporters/tasks/main.yml
index 09c110a4..404de90c 100644
--- a/infra/ansible/roles/host-textfile-exporters/tasks/main.yml
+++ b/infra/ansible/roles/host-textfile-exporters/tasks/main.yml
@@ -161,6 +161,69 @@
     - not ansible_check_mode
   tags: textfile_exporters
 
+- name: "host textfile exporters | 安裝 runaway process 匯出器"
+  ansible.builtin.copy:
+    src: "{{ host_textfile_runaway_process_src }}"
+    dest: "{{ host_textfile_script_dir }}/host-runaway-process-exporter.py"
+    owner: "{{ host_textfile_user }}"
+    group: "{{ host_textfile_user }}"
+    mode: "0755"
+  when: host_textfile_manage_runaway_process
+  tags: textfile_exporters
+
+- name: "host textfile exporters | 安裝 runaway process gated 修復器"
+  ansible.builtin.copy:
+    src: "{{ host_textfile_runaway_process_remediation_src }}"
+    dest: "{{ host_textfile_script_dir }}/host-runaway-process-remediation.py"
+    owner: "{{ host_textfile_user }}"
+    group: "{{ host_textfile_user }}"
+    mode: "0755"
+  when: host_textfile_manage_runaway_process
+  tags: textfile_exporters
+
+- name: "host textfile exporters | 安裝 runaway process cron"
+  ansible.builtin.cron:
+    name: "AWOOOI runaway process textfile exporter"
+    user: "{{ host_textfile_user }}"
+    minute: "{{ host_textfile_runaway_process_cron_minute }}"
+    job: >-
+      PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+      AIOPS_HOST_LABEL={{ host_textfile_host_label }}
+      NODE_EXPORTER_TEXTFILE_DIR={{ host_textfile_dir }}
+      AIOPS_RUNAWAY_PROCESS_MIN_AGE_SECONDS={{ host_textfile_runaway_process_min_age_seconds }}
+      AIOPS_RUNAWAY_PROCESS_MIN_CPU_PERCENT={{ host_textfile_runaway_process_min_cpu_percent }}
+      {{ host_textfile_script_dir }}/host-runaway-process-exporter.py
+      >/tmp/awoooi-host-runaway-process-exporter.cron.log 2>&1
+  when: host_textfile_manage_runaway_process
+  tags: textfile_exporters
+
+- name: "host textfile exporters | 立即刷新 runaway process 指標"
+  ansible.builtin.command:
+    cmd: "{{ host_textfile_script_dir }}/host-runaway-process-exporter.py"
+  environment:
+    AIOPS_HOST_LABEL: "{{ host_textfile_host_label }}"
+    NODE_EXPORTER_TEXTFILE_DIR: "{{ host_textfile_dir }}"
+    AIOPS_RUNAWAY_PROCESS_MIN_AGE_SECONDS: "{{ host_textfile_runaway_process_min_age_seconds }}"
+    AIOPS_RUNAWAY_PROCESS_MIN_CPU_PERCENT: "{{ host_textfile_runaway_process_min_cpu_percent }}"
+  become: true
+  become_user: "{{ host_textfile_user }}"
+  changed_when: false
+  when:
+    - host_textfile_manage_runaway_process
+    - not ansible_check_mode
+  tags: textfile_exporters
+
+- name: "host textfile exporters | 驗證 runaway process metric 存在"
+  ansible.builtin.command:
+    cmd: "grep -q '^awoooi_host_runaway_process_monitor_up{' {{ host_textfile_dir }}/host_runaway_process.prom"
+  become: true
+  become_user: "{{ host_textfile_user }}"
+  changed_when: false
+  when:
+    - host_textfile_manage_runaway_process
+    - not ansible_check_mode
+  tags: textfile_exporters
+
 - name: "host textfile exporters | 探測 systemd units"
   ansible.builtin.shell: |
     set -o pipefail
diff --git a/ops/monitoring/alerts-unified.yml b/ops/monitoring/alerts-unified.yml
index 9521dec1..66294877 100644
--- a/ops/monitoring/alerts-unified.yml
+++ b/ops/monitoring/alerts-unified.yml
@@ -133,6 +133,106 @@ groups:
           description: "磁碟使用率超過 85%"
           auto_repair_action: "ssh {{ $labels.instance }} 'echo \"=== CPU TOP ===\"; ps aux --sort=-%cpu | head -15; echo \"=== MEMORY ===\"; free -h; echo \"=== DISK ===\"; df -h; echo \"=== LOAD ===\"; uptime'"
 
+  # =========================================================================
+  # Host runaway process / CI load classification
+  # =========================================================================
+  - name: host_runaway_process_alerts
+    rules:
+      - alert: HostRunawayProcessMonitorMissing
+        expr: absent(awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"})
+        for: 15m
+        labels:
+          severity: warning
+          layer: systemd-110
+          component: host-runaway-process-monitor
+          host: "110"
+          team: ops
+          alert_category: host_resource
+          notification_type: TYPE-1
+          auto_repair: "false"
+        annotations:
+          summary: "110 runaway process textfile metric missing"
+          description: "110 沒有輸出 host_runaway_process.prom；orphan Chrome / Playwright smoke 與 CI load 分類目前不可觀測。"
+          runbook: "用 Ansible `110-devops.yml --tags textfile_exporters` 或手動部署 scripts/ops/host-runaway-process-exporter.py，確認 /home/wooo/node_exporter_textfiles/host_runaway_process.prom"
+
+      - alert: HostRunawayProcessMonitorStale
+        expr: time() - awoooi_host_runaway_process_last_run_timestamp{host="110"} > 600
+        for: 10m
+        labels:
+          severity: warning
+          layer: systemd-110
+          component: host-runaway-process-monitor
+          host: "110"
+          team: ops
+          alert_category: host_resource
+          notification_type: TYPE-1
+          auto_repair: "false"
+        annotations:
+          summary: "110 runaway process monitor stale"
+          description: "host runaway process exporter 超過 10 分鐘沒有更新；CPU 滿載時無法自動分辨 orphan smoke 與合法 CI。"
+          runbook: "SSH 110 檢查 crontab、/tmp/awoooi-host-runaway-process-exporter.cron.log 與 node-exporter textfile collector。"
+
+      - alert: HostOrphanBrowserSmokeHighCpu
+        expr: |
+          (awoooi_host_runaway_browser_orphan_group_count{host="110"} > 0)
+          and on(host, rule)
+          (awoooi_host_runaway_browser_orphan_cpu_percent{host="110"} >= 100)
+        for: 10m
+        labels:
+          severity: critical
+          layer: systemd-110
+          component: host-runaway-process
+          host: "110"
+          team: ops
+          alert_category: host_resource
+          notification_type: TYPE-3
+          auto_repair: "false"
+          mcp_provider: "ssh_host"
+          host_type: "bare_metal"
+        annotations:
+          summary: "110 orphan browser smoke process group CPU 過高"
+          description: "偵測到 {{ $labels.rule }} orphan process group，CPU 合計 >= 100% 持續 10 分鐘。這通常是跨專案 headless Chrome / Playwright smoke 遺留，不是 Docker/Sentry/Harbor 事故。"
+          runbook: "先執行 `scripts/ops/host-runaway-process-remediation.py --rule {{ $labels.rule }}` 產生 dry-run；確認 active Gitea Actions、owner、維護窗口與 evidence ref 後才可用 --apply --confirm-apply 送 SIGTERM。禁止預設 SIGKILL、Docker restart、systemctl restart 或 firewall 變更。"
+
+      - alert: HostRunawayProcessRemediationUnexpectedlyAuthorized
+        expr: awoooi_host_runaway_process_remediation_authorized{host="110"} > 0
+        for: 1m
+        labels:
+          severity: critical
+          layer: systemd-110
+          component: host-runaway-process
+          host: "110"
+          team: ops
+          alert_category: host_resource
+          notification_type: TYPE-3
+          auto_repair: "false"
+        annotations:
+          summary: "110 runaway process monitor exposed runtime remediation authorization"
+          description: "host-runaway-process exporter 應永遠保持 read-only；若 remediation_authorized > 0，代表有人把監控器改成執行器或把 runtime gate 誤接上。"
+          runbook: "立即回滾 exporter，檢查 Git diff、cron、Ansible role 與 /home/wooo/scripts/host-runaway-process-exporter.py。實際修復只能由 gated remediation helper 在人工批准後執行。"
+
+      - alert: HostCiRunnerLoadSaturation
+        expr: |
+          (awoooi_host_load5_per_core{host="110"} > 1.0)
+          and on(host)
+          (awoooi_host_gitea_actions_active_container_count{host="110"} > 0)
+        for: 15m
+        labels:
+          severity: warning
+          layer: systemd-110
+          component: gitea-actions-runner
+          host: "110"
+          team: ops
+          alert_category: host_resource
+          notification_type: TYPE-1
+          auto_repair: "false"
+          mcp_provider: "ssh_host"
+          host_type: "bare_metal"
+        annotations:
+          summary: "110 high load is currently explained by active Gitea Actions"
+          description: "load5/core > 1.0 且存在 Gitea Actions task container；若 orphan browser 指標為 0，先視為短期 CI build/test 負載，不要誤判成 Docker/Sentry/Harbor 事故。"
+          runbook: "檢查 Gitea runs、runner queue 與 `docker ps --filter name=GITEA-ACTIONS-TASK-`; 僅在 job 卡死、超過 workflow timeout 或 owner 取消後才走 runner drain / cleanup PlayBook。"
+
   # =========================================================================
   # K8s 叢集告警 (kubernetes_alerts)
   # =========================================================================
diff --git a/scripts/ops/host-runaway-process-exporter.py b/scripts/ops/host-runaway-process-exporter.py
new file mode 100755
index 00000000..c022f13d
--- /dev/null
+++ b/scripts/ops/host-runaway-process-exporter.py
@@ -0,0 +1,390 @@
+#!/usr/bin/env python3
+"""
+Host runaway process textfile exporter for AWOOOI AIOps.
+
+This exporter is read-only. It classifies orphaned headless browser/smoke
+process groups separately from legitimate Gitea Actions load so host CPU alerts
+can point to a concrete PlayBook instead of a generic "high CPU" symptom.
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import re
+import subprocess
+import tempfile
+import time
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Iterable
+
+
+TEXTFILE_DIR = Path(os.environ.get("NODE_EXPORTER_TEXTFILE_DIR", "/var/lib/node_exporter/textfile_collector"))
+OUTPUT_NAME = "host_runaway_process.prom"
+HOST_LABEL = os.environ.get("AIOPS_HOST_LABEL", os.uname().nodename)
+LABEL_RE = re.compile(r'["\\\n]')
+
+
+@dataclass(frozen=True)
+class ProcessRow:
+    pid: int
+    ppid: int
+    pgid: int
+    sid: int
+    etimes: int
+    pcpu: float
+    stat: str
+    comm: str
+    args: str
+
+
+@dataclass(frozen=True)
+class RunawayRule:
+    rule_id: str
+    command_pattern: re.Pattern[str]
+    context_pattern: re.Pattern[str]
+
+
+@dataclass(frozen=True)
+class ProcessGroup:
+    rule_id: str
+    pgid: int
+    rows: tuple[ProcessRow, ...]
+    cpu_percent: float
+    oldest_age_seconds: int
+    orphan_reason: str
+    sample_comm: str
+
+
+DEFAULT_RULES = (
+    RunawayRule(
+        "stockplatform_headless_smoke",
+        re.compile(r"(chrome|chromium|playwright)", re.IGNORECASE),
+        re.compile(r"stockplatform-review-bulk-ux|/tmp/stockplatform", re.IGNORECASE),
+    ),
+    RunawayRule(
+        "headless_browser_smoke",
+        re.compile(r"(chrome|chromium|playwright)", re.IGNORECASE),
+        re.compile(r"--headless|--user-data-dir=/tmp|/tmp/.*(smoke|ux|playwright)", re.IGNORECASE),
+    ),
+)
+
+
+def escape_label(value: str) -> str:
+    return LABEL_RE.sub(lambda m: {"\n": r"\n", "\\": r"\\", '"': r"\""}[m.group(0)], value)
+
+
+def run_text(command: list[str], timeout: int = 20) -> str:
+    return subprocess.run(command, check=True, capture_output=True, text=True, timeout=timeout).stdout
+
+
+def read_ps_text(ps_file: Path | None = None) -> str:
+    if ps_file:
+        return ps_file.read_text(encoding="utf-8")
+    linux_command = [
+        "ps",
+        "-eo",
+        "pid=,ppid=,pgid=,sid=,etimes=,pcpu=,stat=,comm=,args=",
+    ]
+    try:
+        return run_text(linux_command)
+    except (subprocess.CalledProcessError, subprocess.TimeoutExpired):
+        return run_text(
+            [
+                "ps",
+                "-axo",
+                "pid=,ppid=,pgid=,sess=,etime=,pcpu=,stat=,comm=,command=",
+            ]
+        )
+
+
+def elapsed_to_seconds(value: str) -> int:
+    try:
+        return int(float(value))
+    except ValueError:
+        pass
+
+    days = 0
+    clock = value
+    if "-" in value:
+        raw_days, clock = value.split("-", 1)
+        days = int(raw_days)
+    parts = [int(part) for part in clock.split(":")]
+    if len(parts) == 3:
+        hours, minutes, seconds = parts
+    elif len(parts) == 2:
+        hours = 0
+        minutes, seconds = parts
+    else:
+        hours = 0
+        minutes = 0
+        seconds = parts[0]
+    return days * 86400 + hours * 3600 + minutes * 60 + seconds
+
+
+def parse_ps_rows(text: str) -> list[ProcessRow]:
+    rows: list[ProcessRow] = []
+    for line in text.splitlines():
+        raw = line.strip()
+        if not raw:
+            continue
+        parts = raw.split(None, 8)
+        if len(parts) < 9:
+            continue
+        try:
+            rows.append(
+                ProcessRow(
+                    pid=int(parts[0]),
+                    ppid=int(parts[1]),
+                    pgid=int(parts[2]),
+                    sid=int(parts[3]),
+                    etimes=elapsed_to_seconds(parts[4]),
+                    pcpu=float(parts[5]),
+                    stat=parts[6],
+                    comm=parts[7],
+                    args=parts[8],
+                )
+            )
+        except ValueError:
+            continue
+    return rows
+
+
+def matching_rule(row: ProcessRow, rules: Iterable[RunawayRule] = DEFAULT_RULES) -> str | None:
+    haystack = f"{row.comm} {row.args}"
+    for rule in rules:
+        if rule.command_pattern.search(haystack) and rule.context_pattern.search(haystack):
+            return rule.rule_id
+    return None
+
+
+def orphan_reason(rows: list[ProcessRow], all_pids: set[int]) -> str | None:
+    if any(row.ppid == 1 for row in rows):
+        return "ppid_1"
+    pgid = rows[0].pgid
+    if pgid not in all_pids:
+        return "missing_group_leader"
+    return None
+
+
+def classify_groups(
+    rows: list[ProcessRow],
+    *,
+    min_age_seconds: int,
+    min_cpu_percent: float,
+) -> list[ProcessGroup]:
+    all_pids = {row.pid for row in rows}
+    grouped: dict[tuple[str, int], list[ProcessRow]] = {}
+    for row in rows:
+        rule_id = matching_rule(row)
+        if rule_id is None:
+            continue
+        grouped.setdefault((rule_id, row.pgid), []).append(row)
+
+    groups: list[ProcessGroup] = []
+    for (rule_id, pgid), members in grouped.items():
+        reason = orphan_reason(members, all_pids)
+        if reason is None:
+            continue
+        oldest = max(row.etimes for row in members)
+        cpu_percent = sum(row.pcpu for row in members)
+        if oldest < min_age_seconds or cpu_percent < min_cpu_percent:
+            continue
+        sample_comm = sorted({row.comm for row in members})[0][:48]
+        groups.append(
+            ProcessGroup(
+                rule_id=rule_id,
+                pgid=pgid,
+                rows=tuple(sorted(members, key=lambda row: row.pid)),
+                cpu_percent=cpu_percent,
+                oldest_age_seconds=oldest,
+                orphan_reason=reason,
+                sample_comm=sample_comm,
+            )
+        )
+    return sorted(groups, key=lambda group: (-group.cpu_percent, group.rule_id, group.pgid))
+
+
+def active_gitea_action_containers(docker_file: Path | None = None) -> int:
+    try:
+        if docker_file:
+            names = docker_file.read_text(encoding="utf-8").splitlines()
+        else:
+            names = run_text(["docker", "ps", "--format", "{{.Names}}"], timeout=10).splitlines()
+    except Exception:
+        return -1
+    return sum(1 for name in names if "GITEA-ACTIONS-TASK-" in name)
+
+
+def load5_per_core() -> float:
+    try:
+        load5 = float(Path("/proc/loadavg").read_text(encoding="utf-8").split()[1])
+    except Exception:
+        try:
+            load5 = os.getloadavg()[1]
+        except OSError:
+            return 0.0
+    cores = os.cpu_count() or 1
+    return load5 / cores
+
+
+def swap_used_ratio(meminfo_file: Path | None = None) -> float:
+    path = meminfo_file or Path("/proc/meminfo")
+    try:
+        values: dict[str, float] = {}
+        for line in path.read_text(encoding="utf-8").splitlines():
+            key, _, raw = line.partition(":")
+            if key in {"SwapTotal", "SwapFree"}:
+                values[key] = float(raw.strip().split()[0]) * 1024
+        total = values.get("SwapTotal", 0.0)
+        free = values.get("SwapFree", 0.0)
+        if total <= 0:
+            return 0.0
+        return max(0.0, min(1.0, (total - free) / total))
+    except Exception:
+        return 0.0
+
+
+def render_metrics(
+    *,
+    host: str,
+    groups: list[ProcessGroup],
+    active_action_containers: int,
+    min_age_seconds: int,
+    min_cpu_percent: float,
+    now: int,
+    load_ratio: float,
+    swap_ratio: float,
+) -> str:
+    labels_host = f'host="{escape_label(host)}"'
+    rule_ids = sorted({rule.rule_id for rule in DEFAULT_RULES})
+    by_rule = {rule_id: [group for group in groups if group.rule_id == rule_id] for rule_id in rule_ids}
+    lines = [
+        "# HELP awoooi_host_runaway_process_monitor_up Whether the host runaway process exporter completed.",
+        "# TYPE awoooi_host_runaway_process_monitor_up gauge",
+        "# HELP awoooi_host_runaway_process_last_run_timestamp Unix timestamp of the last exporter run.",
+        "# TYPE awoooi_host_runaway_process_last_run_timestamp gauge",
+        "# HELP awoooi_host_runaway_browser_orphan_group_count Count of orphaned browser/smoke process groups above thresholds.",
+        "# TYPE awoooi_host_runaway_browser_orphan_group_count gauge",
+        "# HELP awoooi_host_runaway_browser_orphan_process_count Count of orphaned browser/smoke processes above thresholds.",
+        "# TYPE awoooi_host_runaway_browser_orphan_process_count gauge",
+        "# HELP awoooi_host_runaway_browser_orphan_cpu_percent Sum CPU percent for orphaned browser/smoke process groups above thresholds.",
+        "# TYPE awoooi_host_runaway_browser_orphan_cpu_percent gauge",
+        "# HELP awoooi_host_runaway_browser_orphan_oldest_age_seconds Oldest age of matching orphaned process groups.",
+        "# TYPE awoooi_host_runaway_browser_orphan_oldest_age_seconds gauge",
+        "# HELP awoooi_host_runaway_browser_orphan_group_cpu_percent CPU percent for an individual orphaned browser/smoke process group.",
+        "# TYPE awoooi_host_runaway_browser_orphan_group_cpu_percent gauge",
+        "# HELP awoooi_host_runaway_browser_orphan_group_info Metadata for an individual orphaned browser/smoke process group.",
+        "# TYPE awoooi_host_runaway_browser_orphan_group_info gauge",
+        "# HELP awoooi_host_gitea_actions_active_container_count Active Gitea Actions task containers visible on the host, -1 when Docker is unavailable.",
+        "# TYPE awoooi_host_gitea_actions_active_container_count gauge",
+        "# HELP awoooi_host_load5_per_core Host load5 divided by CPU core count.",
+        "# TYPE awoooi_host_load5_per_core gauge",
+        "# HELP awoooi_host_swap_used_ratio Host swap used ratio from /proc/meminfo.",
+        "# TYPE awoooi_host_swap_used_ratio gauge",
+        "# HELP awoooi_host_runaway_process_remediation_authorized Static guardrail: remediation is not authorized by this exporter.",
+        "# TYPE awoooi_host_runaway_process_remediation_authorized gauge",
+        f"awoooi_host_runaway_process_monitor_up{{{labels_host},mode=\"read_only\"}} 1",
+        f"awoooi_host_runaway_process_last_run_timestamp{{{labels_host}}} {now}",
+        f"awoooi_host_gitea_actions_active_container_count{{{labels_host}}} {active_action_containers}",
+        f"awoooi_host_load5_per_core{{{labels_host}}} {load_ratio:.6f}",
+        f"awoooi_host_swap_used_ratio{{{labels_host}}} {swap_ratio:.6f}",
+        f"awoooi_host_runaway_process_remediation_authorized{{{labels_host}}} 0",
+    ]
+
+    for rule_id in rule_ids:
+        rule_labels = (
+            f'{labels_host},rule="{escape_label(rule_id)}",'
+            f'min_age_seconds="{min_age_seconds}",min_cpu_percent="{min_cpu_percent:g}"'
+        )
+        rule_groups = by_rule[rule_id]
+        lines.append(f"awoooi_host_runaway_browser_orphan_group_count{{{rule_labels}}} {len(rule_groups)}")
+        lines.append(
+            f"awoooi_host_runaway_browser_orphan_process_count{{{rule_labels}}} "
+            f"{sum(len(group.rows) for group in rule_groups)}"
+        )
+        lines.append(
+            f"awoooi_host_runaway_browser_orphan_cpu_percent{{{rule_labels}}} "
+            f"{sum(group.cpu_percent for group in rule_groups):.6f}"
+        )
+        lines.append(
+            f"awoooi_host_runaway_browser_orphan_oldest_age_seconds{{{rule_labels}}} "
+            f"{max((group.oldest_age_seconds for group in rule_groups), default=0)}"
+        )
+
+    for group in groups[:20]:
+        group_labels = (
+            f'{labels_host},rule="{escape_label(group.rule_id)}",pgid="{group.pgid}",'
+            f'orphan_reason="{escape_label(group.orphan_reason)}",comm="{escape_label(group.sample_comm)}"'
+        )
+        lines.append(f"awoooi_host_runaway_browser_orphan_group_cpu_percent{{{group_labels}}} {group.cpu_percent:.6f}")
+        lines.append(f"awoooi_host_runaway_browser_orphan_group_info{{{group_labels}}} 1")
+
+    return "\n".join(lines) + "\n"
+
+
+def collect(args: argparse.Namespace) -> str:
+    rows = parse_ps_rows(read_ps_text(args.ps_file))
+    groups = classify_groups(
+        rows,
+        min_age_seconds=args.min_age_seconds,
+        min_cpu_percent=args.min_cpu_percent,
+    )
+    return render_metrics(
+        host=args.host,
+        groups=groups,
+        active_action_containers=active_gitea_action_containers(args.docker_ps_file),
+        min_age_seconds=args.min_age_seconds,
+        min_cpu_percent=args.min_cpu_percent,
+        now=int(time.time()),
+        load_ratio=load5_per_core(),
+        swap_ratio=swap_used_ratio(args.meminfo_file),
+    )
+
+
+def write_textfile(payload: str, textfile_dir: Path, output_name: str) -> Path:
+    textfile_dir.mkdir(parents=True, exist_ok=True)
+    with tempfile.NamedTemporaryFile("w", dir=textfile_dir, delete=False, encoding="utf-8") as tmp:
+        tmp.write(payload)
+        tmp_path = Path(tmp.name)
+    output_path = textfile_dir / output_name
+    tmp_path.replace(output_path)
+    output_path.chmod(0o644)
+    return output_path
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Export AWOOOI host runaway process metrics.")
+    parser.add_argument("--host", default=HOST_LABEL)
+    parser.add_argument("--textfile-dir", type=Path, default=TEXTFILE_DIR)
+    parser.add_argument("--output-name", default=OUTPUT_NAME)
+    parser.add_argument("--stdout", action="store_true", help="Print metrics instead of writing the textfile.")
+    parser.add_argument("--ps-file", type=Path, help="Use a fixture file instead of running ps.")
+    parser.add_argument("--docker-ps-file", type=Path, help="Use a fixture file instead of docker ps.")
+    parser.add_argument("--meminfo-file", type=Path, help="Use a fixture file instead of /proc/meminfo.")
+    parser.add_argument(
+        "--min-age-seconds",
+        type=int,
+        default=int(os.environ.get("AIOPS_RUNAWAY_PROCESS_MIN_AGE_SECONDS", "1800")),
+    )
+    parser.add_argument(
+        "--min-cpu-percent",
+        type=float,
+        default=float(os.environ.get("AIOPS_RUNAWAY_PROCESS_MIN_CPU_PERCENT", "50")),
+    )
+    return parser.parse_args()
+
+
+def main() -> None:
+    args = parse_args()
+    payload = collect(args)
+    if args.stdout:
+        print(payload, end="")
+        return
+    output_path = write_textfile(payload, args.textfile_dir, args.output_name)
+    print(f"HOST_RUNAWAY_PROCESS_EXPORTER_OK output={output_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/ops/host-runaway-process-remediation.py b/scripts/ops/host-runaway-process-remediation.py
new file mode 100755
index 00000000..928306a7
--- /dev/null
+++ b/scripts/ops/host-runaway-process-remediation.py
@@ -0,0 +1,165 @@
+#!/usr/bin/env python3
+"""
+Gated remediation helper for AWOOOI host runaway process groups.
+
+Default mode is dry-run. Applying SIGTERM requires explicit owner approval,
+maintenance window, evidence reference, and --confirm-apply. This script is a
+PlayBook primitive, not a background auto-kill daemon.
+"""
+
+from __future__ import annotations
+
+import argparse
+import importlib.util
+import json
+import os
+import signal
+import sys
+import time
+from pathlib import Path
+from types import ModuleType
+
+
+EXPORTER_PATH = Path(__file__).with_name("host-runaway-process-exporter.py")
+
+
+def load_exporter() -> ModuleType:
+    spec = importlib.util.spec_from_file_location("host_runaway_process_exporter", EXPORTER_PATH)
+    if spec is None or spec.loader is None:
+        raise RuntimeError(f"cannot load exporter module: {EXPORTER_PATH}")
+    module = importlib.util.module_from_spec(spec)
+    sys.modules[spec.name] = module
+    spec.loader.exec_module(module)
+    return module
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Dry-run or gated SIGTERM for AWOOOI runaway process groups.")
+    parser.add_argument("--host", default=os.environ.get("AIOPS_HOST_LABEL", os.uname().nodename))
+    parser.add_argument("--rule", help="Limit candidates to one rule id. Required with --apply.")
+    parser.add_argument("--ps-file", type=Path, help="Use a fixture ps file for tests or offline review.")
+    parser.add_argument("--min-age-seconds", type=int, default=1800)
+    parser.add_argument("--min-cpu-percent", type=float, default=50)
+    parser.add_argument("--apply", action="store_true", help="Send SIGTERM to matching process groups.")
+    parser.add_argument("--confirm-apply", action="store_true", help="Required together with --apply.")
+    parser.add_argument("--owner-approval-id", default="")
+    parser.add_argument("--maintenance-window-id", default="")
+    parser.add_argument("--evidence-ref", default="")
+    parser.add_argument("--wait-seconds", type=int, default=0, help="Optional wait after SIGTERM before re-reading ps.")
+    return parser.parse_args()
+
+
+def validate_apply_args(args: argparse.Namespace) -> None:
+    if not args.apply:
+        return
+    missing = []
+    if not args.confirm_apply:
+        missing.append("--confirm-apply")
+    if not args.rule:
+        missing.append("--rule")
+    if not args.owner_approval_id:
+        missing.append("--owner-approval-id")
+    if not args.maintenance_window_id:
+        missing.append("--maintenance-window-id")
+    if not args.evidence_ref:
+        missing.append("--evidence-ref")
+    if missing:
+        raise SystemExit(
+            "Refusing apply; missing required gates: "
+            + ", ".join(missing)
+            + ". Use dry-run output for the PlayBook packet first."
+        )
+
+
+def current_process_group() -> int:
+    try:
+        return os.getpgrp()
+    except Exception:
+        return -1
+
+
+def main() -> None:
+    args = parse_args()
+    validate_apply_args(args)
+    exporter = load_exporter()
+    rows = exporter.parse_ps_rows(exporter.read_ps_text(args.ps_file))
+    groups = exporter.classify_groups(
+        rows,
+        min_age_seconds=args.min_age_seconds,
+        min_cpu_percent=args.min_cpu_percent,
+    )
+    if args.rule:
+        groups = [group for group in groups if group.rule_id == args.rule]
+
+    own_pgrp = current_process_group()
+    candidates = []
+    for group in groups:
+        blocked_reason = None
+        if group.pgid <= 1:
+            blocked_reason = "unsafe_pgid"
+        elif group.pgid == own_pgrp:
+            blocked_reason = "own_process_group"
+        candidates.append(
+            {
+                "rule": group.rule_id,
+                "pgid": group.pgid,
+                "process_count": len(group.rows),
+                "cpu_percent": round(group.cpu_percent, 3),
+                "oldest_age_seconds": group.oldest_age_seconds,
+                "orphan_reason": group.orphan_reason,
+                "sample_comm": group.sample_comm,
+                "blocked_reason": blocked_reason,
+                "action": "skip" if blocked_reason else ("sigterm" if args.apply else "dry_run"),
+            }
+        )
+
+    signaled: list[int] = []
+    if args.apply:
+        for candidate in candidates:
+            if candidate["blocked_reason"]:
+                continue
+            os.killpg(int(candidate["pgid"]), signal.SIGTERM)
+            signaled.append(int(candidate["pgid"]))
+
+    remaining_after_wait = None
+    if args.apply and args.wait_seconds > 0:
+        time.sleep(args.wait_seconds)
+        fresh_rows = exporter.parse_ps_rows(exporter.read_ps_text(args.ps_file))
+        fresh_groups = exporter.classify_groups(
+            fresh_rows,
+            min_age_seconds=args.min_age_seconds,
+            min_cpu_percent=args.min_cpu_percent,
+        )
+        remaining_after_wait = [
+            group.pgid for group in fresh_groups if not args.rule or group.rule_id == args.rule
+        ]
+
+    payload = {
+        "schema_version": "host_runaway_process_remediation_v1",
+        "host": args.host,
+        "mode": "apply_sigterm" if args.apply else "dry_run",
+        "runtime_gate": 1 if args.apply else 0,
+        "owner_approval_id": args.owner_approval_id if args.apply else None,
+        "maintenance_window_id": args.maintenance_window_id if args.apply else None,
+        "evidence_ref": args.evidence_ref if args.apply else None,
+        "min_age_seconds": args.min_age_seconds,
+        "min_cpu_percent": args.min_cpu_percent,
+        "candidate_count": len(candidates),
+        "signaled_process_group_count": len(signaled),
+        "signaled_process_groups": signaled,
+        "remaining_after_wait": remaining_after_wait,
+        "candidates": candidates,
+        "forbidden_without_gates": [
+            "sigkill",
+            "docker_restart",
+            "systemctl_restart",
+            "nginx_reload",
+            "firewall_change",
+            "secret_collection",
+        ],
+    }
+    print(json.dumps(payload, ensure_ascii=False, indent=2, sort_keys=True))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/ops/tests/test_host_runaway_process_exporter.py b/scripts/ops/tests/test_host_runaway_process_exporter.py
new file mode 100644
index 00000000..336a0d8c
--- /dev/null
+++ b/scripts/ops/tests/test_host_runaway_process_exporter.py
@@ -0,0 +1,144 @@
+from __future__ import annotations
+
+import importlib.util
+import subprocess
+import sys
+from pathlib import Path
+
+
+SCRIPT_ROOT = Path(__file__).resolve().parents[1]
+EXPORTER_PATH = SCRIPT_ROOT / "host-runaway-process-exporter.py"
+REMEDIATION_PATH = SCRIPT_ROOT / "host-runaway-process-remediation.py"
+
+
+def load_exporter():
+    spec = importlib.util.spec_from_file_location("host_runaway_process_exporter", EXPORTER_PATH)
+    assert spec and spec.loader
+    module = importlib.util.module_from_spec(spec)
+    sys.modules[spec.name] = module
+    spec.loader.exec_module(module)
+    return module
+
+
+def test_classifies_orphan_stockplatform_headless_group() -> None:
+    exporter = load_exporter()
+    rows = exporter.parse_ps_rows(
+        """
+        100 1 100 100 7200 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa
+        101 100 100 100 7190 55.0 S chromium /opt/chrome/chromium --type=renderer /tmp/stockplatform-review-bulk-ux-aa
+        200 10 200 200 600 90.0 S node pnpm --filter @awoooi/web build
+        """
+    )
+
+    groups = exporter.classify_groups(rows, min_age_seconds=1800, min_cpu_percent=50)
+
+    assert len(groups) == 1
+    assert groups[0].rule_id == "stockplatform_headless_smoke"
+    assert groups[0].pgid == 100
+    assert groups[0].orphan_reason == "ppid_1"
+    assert groups[0].cpu_percent == 120.0
+    assert len(groups[0].rows) == 2
+
+
+def test_ignores_non_orphan_or_young_browser_processes() -> None:
+    exporter = load_exporter()
+    rows = exporter.parse_ps_rows(
+        """
+        100 99 100 100 7200 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa
+        101 100 100 100 7190 55.0 S chromium /opt/chrome/chromium /tmp/stockplatform-review-bulk-ux-aa
+        300 1 300 300 60 120.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-bb
+        """
+    )
+
+    assert exporter.classify_groups(rows, min_age_seconds=1800, min_cpu_percent=50) == []
+
+
+def test_parses_bsd_elapsed_time_for_local_smoke() -> None:
+    exporter = load_exporter()
+    rows = exporter.parse_ps_rows(
+        """
+        100 1 100 100 01:00:00 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa
+        101 100 100 100 2-00:00:10 55.0 S chromium /opt/chrome/chromium /tmp/stockplatform-review-bulk-ux-aa
+        """
+    )
+
+    assert rows[0].etimes == 3600
+    assert rows[1].etimes == 172810
+
+
+def test_renders_ci_load_and_swap_without_authorizing_repair(tmp_path: Path) -> None:
+    exporter = load_exporter()
+    groups = exporter.classify_groups(
+        exporter.parse_ps_rows(
+            "100 1 100 100 7200 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa"
+        ),
+        min_age_seconds=1800,
+        min_cpu_percent=50,
+    )
+    metrics = exporter.render_metrics(
+        host="110",
+        groups=groups,
+        active_action_containers=3,
+        min_age_seconds=1800,
+        min_cpu_percent=50,
+        now=123,
+        load_ratio=1.25,
+        swap_ratio=1.0,
+    )
+
+    assert 'awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"} 1' in metrics
+    assert 'awoooi_host_gitea_actions_active_container_count{host="110"} 3' in metrics
+    assert 'awoooi_host_swap_used_ratio{host="110"} 1.000000' in metrics
+    assert 'awoooi_host_runaway_process_remediation_authorized{host="110"} 0' in metrics
+    assert 'rule="stockplatform_headless_smoke"' in metrics
+
+
+def test_remediation_defaults_to_dry_run(tmp_path: Path) -> None:
+    ps_file = tmp_path / "ps.txt"
+    ps_file.write_text(
+        "100 1 100 100 7200 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa\n",
+        encoding="utf-8",
+    )
+
+    result = subprocess.run(
+        [
+            sys.executable,
+            str(REMEDIATION_PATH),
+            "--ps-file",
+            str(ps_file),
+            "--rule",
+            "stockplatform_headless_smoke",
+        ],
+        check=True,
+        capture_output=True,
+        text=True,
+    )
+
+    assert '"mode": "dry_run"' in result.stdout
+    assert '"runtime_gate": 0' in result.stdout
+    assert '"action": "dry_run"' in result.stdout
+
+
+def test_remediation_refuses_apply_without_gates(tmp_path: Path) -> None:
+    ps_file = tmp_path / "ps.txt"
+    ps_file.write_text(
+        "100 1 100 100 7200 65.0 S chrome /opt/chrome/chrome --headless --user-data-dir=/tmp/stockplatform-review-bulk-ux-aa\n",
+        encoding="utf-8",
+    )
+
+    result = subprocess.run(
+        [
+            sys.executable,
+            str(REMEDIATION_PATH),
+            "--ps-file",
+            str(ps_file),
+            "--apply",
+            "--rule",
+            "stockplatform_headless_smoke",
+        ],
+        capture_output=True,
+        text=True,
+    )
+
+    assert result.returncode != 0
+    assert "Refusing apply" in result.stderr
diff --git a/scripts/reboot-recovery/reboot-recovery-readiness-audit.sh b/scripts/reboot-recovery/reboot-recovery-readiness-audit.sh
index 19095d14..dd61160b 100755
--- a/scripts/reboot-recovery/reboot-recovery-readiness-audit.sh
+++ b/scripts/reboot-recovery/reboot-recovery-readiness-audit.sh
@@ -193,6 +193,8 @@ require_file scripts/ops/docker-stats-textfile-exporter.py "Docker stats textfil
 require_file scripts/ops/systemd-units-textfile-exporter.py "Systemd units textfile exporter"
 require_file scripts/ops/storage-health-textfile-exporter.py "Storage health textfile exporter"
 require_file scripts/ops/backup-health-textfile-exporter.py "Backup health textfile exporter"
+require_file scripts/ops/host-runaway-process-exporter.py "Host runaway process textfile exporter"
+require_file scripts/ops/host-runaway-process-remediation.py "Host runaway process gated remediation helper"
 require_file scripts/ops/backup-alert-label-contract-check.py "Backup alert label contract check"
 require_file scripts/ops/backup-alert-live-visibility-check.py "Backup alert live visibility check"
 require_file scripts/ops/recovery-scorecard-contract-check.py "Recovery scorecard contract check"
@@ -270,6 +272,11 @@ require_pattern "awoooi_backup_offsite_full_sync_enabled" scripts/ops/backup-hea
 require_pattern "awoooi_backup_retention_latest_only" scripts/ops/backup-health-textfile-exporter.py "110 latest-only retention textfile metric"
 require_pattern "awoooi_backup_cron_active_duplicate_count" scripts/ops/backup-health-textfile-exporter.py "110 backup cron duplicate textfile metric"
 require_pattern "awoooi_backup_cron_singular_entry_ok" scripts/ops/backup-health-textfile-exporter.py "110 backup cron singular textfile metric"
+require_pattern "awoooi_host_runaway_process_monitor_up" scripts/ops/host-runaway-process-exporter.py "110 runaway process monitor metric"
+require_pattern "awoooi_host_runaway_process_remediation_authorized" scripts/ops/host-runaway-process-exporter.py "110 runaway process remediation authorization guard metric"
+require_pattern "owner-approval-id" scripts/ops/host-runaway-process-remediation.py "Runaway process remediation owner approval gate"
+require_pattern "maintenance-window-id" scripts/ops/host-runaway-process-remediation.py "Runaway process remediation maintenance window gate"
+require_pattern "evidence-ref" scripts/ops/host-runaway-process-remediation.py "Runaway process remediation evidence gate"
 require_pattern "textfile_exporters" infra/ansible/playbooks/188-ai-web.yml "188 textfile exporters tag"
 require_pattern "backup-momo-188-pg.sh" infra/ansible/playbooks/188-ai-web.yml "188 momo PostgreSQL backup deploy"
 require_pattern "/home/ollama/bin/momo-pg-backup.sh" infra/ansible/playbooks/188-ai-web.yml "188 host-owned momo backup entrypoint"
@@ -291,6 +298,10 @@ require_pattern "awoooi_cold_start_blocker_reason" ops/monitoring/alerts-unified
 require_pattern "docker_container_cpu_cores" ops/monitoring/alerts-unified.yml "Docker CPU alert metric"
 require_pattern "systemd_unit_watchdog_seconds" ops/monitoring/alerts-unified.yml "Systemd watchdog alert metric"
 require_pattern "awoooi_host_storage_error_count" ops/monitoring/alerts-unified.yml "Storage health alert metric"
+require_pattern "host_runaway_process_alerts" ops/monitoring/alerts-unified.yml "Host runaway process alert group"
+require_pattern "HostOrphanBrowserSmokeHighCpu" ops/monitoring/alerts-unified.yml "Host orphan browser smoke alert"
+require_pattern "HostCiRunnerLoadSaturation" ops/monitoring/alerts-unified.yml "Host CI runner load classification alert"
+require_pattern "awoooi_host_runaway_process_remediation_authorized" ops/monitoring/alerts-unified.yml "Host runaway process remediation guard alert metric"
 require_pattern "awoooi_backup_job_fresh" ops/monitoring/alerts-unified.yml "Backup freshness alert metric"
 require_pattern "awoooi_backup_integrity_fresh" ops/monitoring/alerts-unified.yml "Backup integrity alert metric"
 require_pattern "awoooi_backup_offsite_configured" ops/monitoring/alerts-unified.yml "Backup offsite alert metric"