docs(ops): record host baseline follow-up
This commit is contained in:
@@ -1575,6 +1575,44 @@ psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-05(台北)— 110/188 主機長時間過載基線與 systemd runner 盲區補強
|
||||
|
||||
**觸發**:統帥要求重新盤點 110/188 長時間 CPU/load 過載,並確認 Claude Code 先前 CPU/memory 配置是否造成服務卡死。
|
||||
|
||||
### 已完成
|
||||
|
||||
| 項目 | 結果 |
|
||||
|------|------|
|
||||
| Docker Compose baseline | 補 `docker_container_cpu_cores`、memory limit、restart count textfile exporter;Prometheus 100 條規則已載入 |
|
||||
| 188 momo AutoHeal schema drift | `incidents.traceback_str` / `matched_playbook_id` / severity 長度已用 migration 修復,schema probe 通過 |
|
||||
| 110 systemd runner 盲區 | 新增 `systemd-units-textfile-exporter.py`,Prometheus 可見 runner restart/watchdog/quota |
|
||||
| SystemdRunner 告警 | 新增 `SystemdRunnerRestartSpike`、`SystemdRunnerWatchdogEnabled`、`SystemdRunnerMissingResourceQuota` |
|
||||
| AwoooI 分類/規則 | `SystemdRunner*` 早期分診為 `host_resource TYPE-3`,命中 `systemd_runner_baseline_alert`,SSH 診斷 command 可填入 `{unit}` |
|
||||
| Guardrail 腳本 | 新增 `scripts/ops/apply-runner-systemd-guardrails.sh`,預設 dry-run,`--apply` 需 sudo |
|
||||
|
||||
### Live 狀態
|
||||
|
||||
- 188 load 已回穩約 2-4,未再看到 `traceback_str` incident create failed。
|
||||
- 110 仍有 `actions.runner.owenhytsai-awoooi.awoooi-110.service`:`WatchdogUSec=5min`、`NRestarts>8490`、CPU/Memory unlimited。
|
||||
- 110 runner 修復需 sudo:移除 `watchdog.conf` 並套 `CPUQuota=200%` / `MemoryMax=2G`。
|
||||
|
||||
### Commits
|
||||
|
||||
| commit | 說明 |
|
||||
|--------|------|
|
||||
| `fe618960` | systemd runner textfile exporter + Prometheus/inspector/runbook |
|
||||
| `34d1c76` | `SystemdRunner*` alert rule routing + sudo guardrail script |
|
||||
| `0e14935` | `SystemdRunner*` early classification + `{unit}` template variable |
|
||||
| `ab0f0a8` | deploy API image `runner-classify-20260505-0e14935` |
|
||||
|
||||
### 下一步
|
||||
|
||||
1. 在 110 以 sudo 執行 `bash scripts/ops/apply-runner-systemd-guardrails.sh --apply`。
|
||||
2. 驗證 Prometheus 的 `SystemdRunnerWatchdogEnabled` / `SystemdRunnerMissingResourceQuota` 消失。
|
||||
3. 觀察 110 load5/core 是否穩定低於 1.5,若仍高再調 Sentry ingestion/ClickHouse parts。
|
||||
|
||||
---
|
||||
|
||||
## 📍 2026-04-22 — 系統報告動態化:新增 5 大區塊(commit 9244c5e)
|
||||
|
||||
### 需求
|
||||
|
||||
Reference in New Issue
Block a user