From 228768ff687df5b02636e72e1e76ee823fb503ca Mon Sep 17 00:00:00 2001 From: Your Name Date: Tue, 5 May 2026 14:31:59 +0800 Subject: [PATCH] docs(ops): record host baseline follow-up --- docs/LOGBOOK.md | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index dfba3860..62698580 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1575,6 +1575,44 @@ psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks --- +## 2026-05-05(台北)— 110/188 主機長時間過載基線與 systemd runner 盲區補強 + +**觸發**:統帥要求重新盤點 110/188 長時間 CPU/load 過載,並確認 Claude Code 先前 CPU/memory 配置是否造成服務卡死。 + +### 已完成 + +| 項目 | 結果 | +|------|------| +| Docker Compose baseline | 補 `docker_container_cpu_cores`、memory limit、restart count textfile exporter;Prometheus 100 條規則已載入 | +| 188 momo AutoHeal schema drift | `incidents.traceback_str` / `matched_playbook_id` / severity 長度已用 migration 修復,schema probe 通過 | +| 110 systemd runner 盲區 | 新增 `systemd-units-textfile-exporter.py`,Prometheus 可見 runner restart/watchdog/quota | +| SystemdRunner 告警 | 新增 `SystemdRunnerRestartSpike`、`SystemdRunnerWatchdogEnabled`、`SystemdRunnerMissingResourceQuota` | +| AwoooI 分類/規則 | `SystemdRunner*` 早期分診為 `host_resource TYPE-3`,命中 `systemd_runner_baseline_alert`,SSH 診斷 command 可填入 `{unit}` | +| Guardrail 腳本 | 新增 `scripts/ops/apply-runner-systemd-guardrails.sh`,預設 dry-run,`--apply` 需 sudo | + +### Live 狀態 + +- 188 load 已回穩約 2-4,未再看到 `traceback_str` incident create failed。 +- 110 仍有 `actions.runner.owenhytsai-awoooi.awoooi-110.service`:`WatchdogUSec=5min`、`NRestarts>8490`、CPU/Memory unlimited。 +- 110 runner 修復需 sudo:移除 `watchdog.conf` 並套 `CPUQuota=200%` / `MemoryMax=2G`。 + +### Commits + +| commit | 說明 | +|--------|------| +| `fe618960` | systemd runner textfile exporter + Prometheus/inspector/runbook | +| `34d1c76` | `SystemdRunner*` alert rule routing + sudo guardrail script | +| `0e14935` | `SystemdRunner*` early classification + `{unit}` template variable | +| `ab0f0a8` | deploy API image `runner-classify-20260505-0e14935` | + +### 下一步 + +1. 在 110 以 sudo 執行 `bash scripts/ops/apply-runner-systemd-guardrails.sh --apply`。 +2. 驗證 Prometheus 的 `SystemdRunnerWatchdogEnabled` / `SystemdRunnerMissingResourceQuota` 消失。 +3. 觀察 110 load5/core 是否穩定低於 1.5,若仍高再調 Sentry ingestion/ClickHouse parts。 + +--- + ## 📍 2026-04-22 — 系統報告動態化:新增 5 大區塊(commit 9244c5e) ### 需求