diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index a146a350..f843cfac 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -54,6 +54,33 @@ - 本段只修補前端 message catalog;不改 executor、Telegram、主機、K8s、runtime gate、secret 或 live apply 權限。 - P2-415 仍維持 controlled executor dispatch / live apply / Telegram send / Bot API / host write / kubectl / destructive operation 全部 `0`。 +## 2026-06-27|12:43 110 CPU 高負載實際防護:CI / smoke container 辨識修正並部署 + +**時間與來源**: +- 2026-06-27 12:40-12:43 Asia/Taipei。 +- 來源:110 `ps` / Docker stats / host runaway exporter / stale Gitea action dry-run / Gitea log readback。 + +**實際判讀**: +- 12:40 `load average` 已從前一輪高峰下降到 `4.87, 9.98, 40.74`,主要高峰來源已從 stockPlatform smoke 轉為 AWOOOI 正式 CD `awoooi-cd-5901-1-e2e-smoke`、Playwright / Chrome / ffmpeg 與 Gitea log/update;該 job 當時僅跑十幾秒,未判為 stale。 +- 12:43 `load average` 進一步下降到 `2.28, 6.29, 33.05`;active CI container 已歸零,`gitea` 約 `55%` container CPU,未見 30 分鐘以上 orphan browser group。 +- 缺口不是完全沒有監控,而是 `host-runaway-process-exporter.py` 舊版只計算 `GITEA-ACTIONS-*`,漏掉現行 `awoooi-cd--` / `awoooi-code-review--` 命名,導致 active action container 會誤報 `0`。 + +**修復內容**: +- `scripts/ops/host-runaway-process-exporter.py` 已支援 `GITEA-ACTIONS-*`、`*-cd--*`、`*-code-review--*` container 命名。 +- `scripts/ops/stop-stale-gitea-actions-jobs.sh` 已支援相同命名,並給 deploy / tests / post-deploy / code-review / e2e-smoke / source-link-smoke 不同 stale threshold;預設仍 dry-run。 +- `scripts/reboot-recovery/awoooi-startup-110.sh` 的 runner drain 判斷也同步支援現行 container 命名,避免下次重啟時漏判正在跑的 CD task。 +- 已部署到 110:`/home/wooo/scripts/host-runaway-process-exporter.py`、`/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`、`/usr/local/bin/awoooi-startup-110.sh`。 + +**驗證**: +- `pytest scripts/ops/tests/test_host_runaway_process_exporter.py -q`:`7 passed`。 +- `bash -n scripts/ops/stop-stale-gitea-actions-jobs.sh scripts/reboot-recovery/awoooi-startup-110.sh`:通過。 +- 110 exporter readback:`awoooi_host_gitea_actions_active_container_count{host="110"} 0`、`awoooi_host_runaway_browser_orphan_group_count ... 0`、`awoooi_host_load5_per_core{host="110"} 0.524167`。 +- 110 stale job dry-run:`No stale Gitea Actions containers older than policy threshold`。 + +**邊界**: +- 本段沒有重啟 Docker / Gitea / Nginx / systemd service / DB / K3s,也沒有 firewall / secret / active response。 +- 這是實際主機防護工具更新與部署,不是放寬 runtime gate;自動修復仍需 owner / maintenance window / evidence gates。 + ## 2026-06-27|P2-415 AI Agent 受控 Executor 交接跑道:API / 前台 / 測試完成 **背景**:P2-409 已把 high 風險從「人工 owner review」調整為可走 controlled apply queue,critical 仍保留 break-glass 邊界;本段承接 P2-409 / P2-410 / P2-411,補上可被產品與正式 API 讀回的受控 executor handoff 跑道,避免只停留在 UI 文案或口頭批准。 diff --git a/docs/runbooks/HOST-RESOURCE-BASELINE-110-188.md b/docs/runbooks/HOST-RESOURCE-BASELINE-110-188.md index 4146eb47..b8171af0 100644 --- a/docs/runbooks/HOST-RESOURCE-BASELINE-110-188.md +++ b/docs/runbooks/HOST-RESOURCE-BASELINE-110-188.md @@ -52,7 +52,7 @@ Use these thresholds for alerting and AI triage: | Systemd runner restarts | > 2 in 15m | Critical; inspect watchdog/drop-ins and active CI jobs. | | Systemd runner WatchdogSec | > 0 for 10m | Warning; GitHub Actions runner should not be killed by systemd watchdog. | | Systemd runner quota | CPU or memory unlimited for 30m | Warning; apply CPUQuota/MemoryMax or move CI away from Sentry host. | -| Gitea Actions job runtime | > 20m for 5m | Warning; inspect logs and run `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh` dry-run before stopping stale job containers. | +| Gitea Actions job runtime | > 20m for 5m | Warning; inspect logs and run `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh` dry-run before stopping stale job containers. The stale-job detector must include both legacy `GITEA-ACTIONS-*` containers and repo-scoped names such as `awoooi-cd--` / `awoooi-code-review--`. | ## Rules @@ -79,7 +79,7 @@ Use these thresholds for alerting and AI triage: 5. Tune `momo-scheduler` crawler concurrency on 188; keep 2 CPU / 2 GiB until success rate and latency prove it is too low. 6. Fix 188 Elephant Alpha/OpenClaw allowed-action drift before enabling resource auto-repair beyond diagnosis. 7. Add modest caps to currently unlimited low-risk services in small batches. Do not alert every unlimited auxiliary container at once; promote candidates only after 24h usage data. -8. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode. +8. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode. As of 2026-06-27, it must recognize both legacy `GITEA-ACTIONS-*` and current repo-scoped `*-cd--*` / `*-code-review--*` container names. 9. Fix 110 runner services with sudo-capable host maintenance: ```bash diff --git a/scripts/ops/host-runaway-process-exporter.py b/scripts/ops/host-runaway-process-exporter.py index c022f13d..ac758868 100755 --- a/scripts/ops/host-runaway-process-exporter.py +++ b/scripts/ops/host-runaway-process-exporter.py @@ -24,6 +24,9 @@ TEXTFILE_DIR = Path(os.environ.get("NODE_EXPORTER_TEXTFILE_DIR", "/var/lib/node_ OUTPUT_NAME = "host_runaway_process.prom" HOST_LABEL = os.environ.get("AIOPS_HOST_LABEL", os.uname().nodename) LABEL_RE = re.compile(r'["\\\n]') +GITEA_ACTION_CONTAINER_RE = re.compile( + r"^(?:GITEA-ACTIONS-|[A-Za-z0-9][A-Za-z0-9_.-]*-(?:cd|code-review)-[0-9]+-)" +) @dataclass(frozen=True) @@ -214,7 +217,7 @@ def active_gitea_action_containers(docker_file: Path | None = None) -> int: names = run_text(["docker", "ps", "--format", "{{.Names}}"], timeout=10).splitlines() except Exception: return -1 - return sum(1 for name in names if "GITEA-ACTIONS-TASK-" in name) + return sum(1 for name in names if GITEA_ACTION_CONTAINER_RE.search(name)) def load5_per_core() -> float: diff --git a/scripts/ops/stop-stale-gitea-actions-jobs.sh b/scripts/ops/stop-stale-gitea-actions-jobs.sh index 181fede0..8a06eb81 100644 --- a/scripts/ops/stop-stale-gitea-actions-jobs.sh +++ b/scripts/ops/stop-stale-gitea-actions-jobs.sh @@ -11,7 +11,8 @@ set -euo pipefail # bash scripts/ops/stop-stale-gitea-actions-jobs.sh --apply # # Safety rules: -# - Only touches Docker containers named GITEA-ACTIONS-*. +# - Only touches Docker containers named GITEA-ACTIONS-* or repo-scoped +# Gitea job containers such as awoooi-cd-5901-1-e2e-smoke. # - Defaults to containers older than 20 minutes. # - Known long-running workflows get a higher stop threshold than the alert threshold. # - Skips containers with recent log output unless --force is provided. @@ -24,17 +25,20 @@ threshold_for_name() { local name="$1" case "$name" in - *WORKFLOW-CD-Pipeline_JOB-deploy*) + *WORKFLOW-CD-Pipeline_JOB-deploy*|*-cd-*-deploy*) # .gitea/workflows/cd.yaml deploy job timeout is 60m. Give act/Gitea # cleanup a buffer before treating the container as abandoned. echo 4500 ;; - *WORKFLOW-CD-Pipeline_JOB-tests*|*WORKFLOW-CD-Pipeline_JOB-post-deploy-checks*) + *WORKFLOW-CD-Pipeline_JOB-tests*|*WORKFLOW-CD-Pipeline_JOB-post-deploy-checks*|*-cd-*-tests*|*-cd-*-post-deploy*) echo 2400 ;; - *WORKFLOW-Code-Review_JOB-ai-code-review*) + *WORKFLOW-Code-Review_JOB-ai-code-review*|*-code-review-*) echo 720 ;; + *-cd-*-e2e-smoke*|*-cd-*-source-link-smoke*) + echo 900 + ;; *WORKFLOW-Deploy-Alert-Rules_JOB-deploy-alerts*) echo 900 ;; @@ -97,7 +101,7 @@ while read -r name; do fi docker stop "$name" fi -done < <(docker ps --format '{{.Names}}' | grep '^GITEA-ACTIONS-' || true) +done < <(docker ps --format '{{.Names}}' | grep -E '^(GITEA-ACTIONS-|[A-Za-z0-9][A-Za-z0-9_.-]*-(cd|code-review)-[0-9]+-)' || true) if [[ "$found" == "0" ]]; then echo "No stale Gitea Actions containers older than policy threshold (minimum ${MIN_AGE_SECONDS}s)." diff --git a/scripts/ops/tests/test_host_runaway_process_exporter.py b/scripts/ops/tests/test_host_runaway_process_exporter.py index 336a0d8c..343ab0e1 100644 --- a/scripts/ops/tests/test_host_runaway_process_exporter.py +++ b/scripts/ops/tests/test_host_runaway_process_exporter.py @@ -93,6 +93,26 @@ def test_renders_ci_load_and_swap_without_authorizing_repair(tmp_path: Path) -> assert 'rule="stockplatform_headless_smoke"' in metrics +def test_counts_modern_gitea_action_container_names(tmp_path: Path) -> None: + exporter = load_exporter() + docker_file = tmp_path / "docker.txt" + docker_file.write_text( + "\n".join( + [ + "GITEA-ACTIONS-TASK-123", + "awoooi-cd-5901-1-e2e-smoke", + "awoooi-cd-5873-1-source-link-smoke", + "awoooi-code-review-3323-1-ai-code-review", + "gitea", + "stockplatform-v2-api-1", + ] + ), + encoding="utf-8", + ) + + assert exporter.active_gitea_action_containers(docker_file) == 4 + + def test_remediation_defaults_to_dry_run(tmp_path: Path) -> None: ps_file = tmp_path / "ps.txt" ps_file.write_text( diff --git a/scripts/reboot-recovery/awoooi-startup-110.sh b/scripts/reboot-recovery/awoooi-startup-110.sh index 78e18093..e5afeccf 100644 --- a/scripts/reboot-recovery/awoooi-startup-110.sh +++ b/scripts/reboot-recovery/awoooi-startup-110.sh @@ -231,7 +231,7 @@ PY # 若手動執行此 recovery script 時仍有 task container,送 SIGINT # 讓 act_runner drain,不再接新 job,並等手上的 job 收尾。 docker update --restart=no gitea-runner >/dev/null 2>&1 || true - if docker ps --format '{{.Names}}' | grep -q '^GITEA-ACTIONS-TASK-'; then + if docker ps --format '{{.Names}}' | grep -Eq '^(GITEA-ACTIONS-|[A-Za-z0-9][A-Za-z0-9_.-]*-(cd|code-review)-[0-9]+-)'; then log "⚠️ Gitea Actions task container still running; draining docker-wrapped gitea-runner" docker kill --signal=SIGINT gitea-runner >/dev/null 2>&1 || true else