fix(ops): recognize repo-scoped CI containers in load guard [skip ci]

This commit is contained in:
ogt
2026-06-27 12:45:01 +08:00
parent c4fcd9cb12
commit 7f706feded
6 changed files with 63 additions and 9 deletions

View File

@@ -54,6 +54,33 @@
- 本段只修補前端 message catalog不改 executor、Telegram、主機、K8s、runtime gate、secret 或 live apply 權限。
- P2-415 仍維持 controlled executor dispatch / live apply / Telegram send / Bot API / host write / kubectl / destructive operation 全部 `0`
## 2026-06-2712:43 110 CPU 高負載實際防護CI / smoke container 辨識修正並部署
**時間與來源**
- 2026-06-27 12:40-12:43 Asia/Taipei。
- 來源110 `ps` / Docker stats / host runaway exporter / stale Gitea action dry-run / Gitea log readback。
**實際判讀**
- 12:40 `load average` 已從前一輪高峰下降到 `4.87, 9.98, 40.74`,主要高峰來源已從 stockPlatform smoke 轉為 AWOOOI 正式 CD `awoooi-cd-5901-1-e2e-smoke`、Playwright / Chrome / ffmpeg 與 Gitea log/update該 job 當時僅跑十幾秒,未判為 stale。
- 12:43 `load average` 進一步下降到 `2.28, 6.29, 33.05`active CI container 已歸零,`gitea``55%` container CPU未見 30 分鐘以上 orphan browser group。
- 缺口不是完全沒有監控,而是 `host-runaway-process-exporter.py` 舊版只計算 `GITEA-ACTIONS-*`,漏掉現行 `awoooi-cd-<run>-<job>` / `awoooi-code-review-<run>-<job>` 命名,導致 active action container 會誤報 `0`
**修復內容**
- `scripts/ops/host-runaway-process-exporter.py` 已支援 `GITEA-ACTIONS-*``*-cd-<run>-*``*-code-review-<run>-*` container 命名。
- `scripts/ops/stop-stale-gitea-actions-jobs.sh` 已支援相同命名,並給 deploy / tests / post-deploy / code-review / e2e-smoke / source-link-smoke 不同 stale threshold預設仍 dry-run。
- `scripts/reboot-recovery/awoooi-startup-110.sh` 的 runner drain 判斷也同步支援現行 container 命名,避免下次重啟時漏判正在跑的 CD task。
- 已部署到 110`/home/wooo/scripts/host-runaway-process-exporter.py``/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh``/usr/local/bin/awoooi-startup-110.sh`
**驗證**
- `pytest scripts/ops/tests/test_host_runaway_process_exporter.py -q``7 passed`
- `bash -n scripts/ops/stop-stale-gitea-actions-jobs.sh scripts/reboot-recovery/awoooi-startup-110.sh`:通過。
- 110 exporter readback`awoooi_host_gitea_actions_active_container_count{host="110"} 0``awoooi_host_runaway_browser_orphan_group_count ... 0``awoooi_host_load5_per_core{host="110"} 0.524167`
- 110 stale job dry-run`No stale Gitea Actions containers older than policy threshold`
**邊界**
- 本段沒有重啟 Docker / Gitea / Nginx / systemd service / DB / K3s也沒有 firewall / secret / active response。
- 這是實際主機防護工具更新與部署,不是放寬 runtime gate自動修復仍需 owner / maintenance window / evidence gates。
## 2026-06-27P2-415 AI Agent 受控 Executor 交接跑道API / 前台 / 測試完成
**背景**P2-409 已把 high 風險從「人工 owner review」調整為可走 controlled apply queuecritical 仍保留 break-glass 邊界;本段承接 P2-409 / P2-410 / P2-411補上可被產品與正式 API 讀回的受控 executor handoff 跑道,避免只停留在 UI 文案或口頭批准。

View File

@@ -52,7 +52,7 @@ Use these thresholds for alerting and AI triage:
| Systemd runner restarts | > 2 in 15m | Critical; inspect watchdog/drop-ins and active CI jobs. |
| Systemd runner WatchdogSec | > 0 for 10m | Warning; GitHub Actions runner should not be killed by systemd watchdog. |
| Systemd runner quota | CPU or memory unlimited for 30m | Warning; apply CPUQuota/MemoryMax or move CI away from Sentry host. |
| Gitea Actions job runtime | > 20m for 5m | Warning; inspect logs and run `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh` dry-run before stopping stale job containers. |
| Gitea Actions job runtime | > 20m for 5m | Warning; inspect logs and run `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh` dry-run before stopping stale job containers. The stale-job detector must include both legacy `GITEA-ACTIONS-*` containers and repo-scoped names such as `awoooi-cd-<run>-<job>` / `awoooi-code-review-<run>-<job>`. |
## Rules
@@ -79,7 +79,7 @@ Use these thresholds for alerting and AI triage:
5. Tune `momo-scheduler` crawler concurrency on 188; keep 2 CPU / 2 GiB until success rate and latency prove it is too low.
6. Fix 188 Elephant Alpha/OpenClaw allowed-action drift before enabling resource auto-repair beyond diagnosis.
7. Add modest caps to currently unlimited low-risk services in small batches. Do not alert every unlimited auxiliary container at once; promote candidates only after 24h usage data.
8. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode.
8. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode. As of 2026-06-27, it must recognize both legacy `GITEA-ACTIONS-*` and current repo-scoped `*-cd-<run>-*` / `*-code-review-<run>-*` container names.
9. Fix 110 runner services with sudo-capable host maintenance:
```bash

View File

@@ -24,6 +24,9 @@ TEXTFILE_DIR = Path(os.environ.get("NODE_EXPORTER_TEXTFILE_DIR", "/var/lib/node_
OUTPUT_NAME = "host_runaway_process.prom"
HOST_LABEL = os.environ.get("AIOPS_HOST_LABEL", os.uname().nodename)
LABEL_RE = re.compile(r'["\\\n]')
GITEA_ACTION_CONTAINER_RE = re.compile(
r"^(?:GITEA-ACTIONS-|[A-Za-z0-9][A-Za-z0-9_.-]*-(?:cd|code-review)-[0-9]+-)"
)
@dataclass(frozen=True)
@@ -214,7 +217,7 @@ def active_gitea_action_containers(docker_file: Path | None = None) -> int:
names = run_text(["docker", "ps", "--format", "{{.Names}}"], timeout=10).splitlines()
except Exception:
return -1
return sum(1 for name in names if "GITEA-ACTIONS-TASK-" in name)
return sum(1 for name in names if GITEA_ACTION_CONTAINER_RE.search(name))
def load5_per_core() -> float:

View File

@@ -11,7 +11,8 @@ set -euo pipefail
# bash scripts/ops/stop-stale-gitea-actions-jobs.sh --apply
#
# Safety rules:
# - Only touches Docker containers named GITEA-ACTIONS-*.
# - Only touches Docker containers named GITEA-ACTIONS-* or repo-scoped
# Gitea job containers such as awoooi-cd-5901-1-e2e-smoke.
# - Defaults to containers older than 20 minutes.
# - Known long-running workflows get a higher stop threshold than the alert threshold.
# - Skips containers with recent log output unless --force is provided.
@@ -24,17 +25,20 @@ threshold_for_name() {
local name="$1"
case "$name" in
*WORKFLOW-CD-Pipeline_JOB-deploy*)
*WORKFLOW-CD-Pipeline_JOB-deploy*|*-cd-*-deploy*)
# .gitea/workflows/cd.yaml deploy job timeout is 60m. Give act/Gitea
# cleanup a buffer before treating the container as abandoned.
echo 4500
;;
*WORKFLOW-CD-Pipeline_JOB-tests*|*WORKFLOW-CD-Pipeline_JOB-post-deploy-checks*)
*WORKFLOW-CD-Pipeline_JOB-tests*|*WORKFLOW-CD-Pipeline_JOB-post-deploy-checks*|*-cd-*-tests*|*-cd-*-post-deploy*)
echo 2400
;;
*WORKFLOW-Code-Review_JOB-ai-code-review*)
*WORKFLOW-Code-Review_JOB-ai-code-review*|*-code-review-*)
echo 720
;;
*-cd-*-e2e-smoke*|*-cd-*-source-link-smoke*)
echo 900
;;
*WORKFLOW-Deploy-Alert-Rules_JOB-deploy-alerts*)
echo 900
;;
@@ -97,7 +101,7 @@ while read -r name; do
fi
docker stop "$name"
fi
done < <(docker ps --format '{{.Names}}' | grep '^GITEA-ACTIONS-' || true)
done < <(docker ps --format '{{.Names}}' | grep -E '^(GITEA-ACTIONS-|[A-Za-z0-9][A-Za-z0-9_.-]*-(cd|code-review)-[0-9]+-)' || true)
if [[ "$found" == "0" ]]; then
echo "No stale Gitea Actions containers older than policy threshold (minimum ${MIN_AGE_SECONDS}s)."

View File

@@ -93,6 +93,26 @@ def test_renders_ci_load_and_swap_without_authorizing_repair(tmp_path: Path) ->
assert 'rule="stockplatform_headless_smoke"' in metrics
def test_counts_modern_gitea_action_container_names(tmp_path: Path) -> None:
exporter = load_exporter()
docker_file = tmp_path / "docker.txt"
docker_file.write_text(
"\n".join(
[
"GITEA-ACTIONS-TASK-123",
"awoooi-cd-5901-1-e2e-smoke",
"awoooi-cd-5873-1-source-link-smoke",
"awoooi-code-review-3323-1-ai-code-review",
"gitea",
"stockplatform-v2-api-1",
]
),
encoding="utf-8",
)
assert exporter.active_gitea_action_containers(docker_file) == 4
def test_remediation_defaults_to_dry_run(tmp_path: Path) -> None:
ps_file = tmp_path / "ps.txt"
ps_file.write_text(

View File

@@ -231,7 +231,7 @@ PY
# 若手動執行此 recovery script 時仍有 task container送 SIGINT
# 讓 act_runner drain不再接新 job並等手上的 job 收尾。
docker update --restart=no gitea-runner >/dev/null 2>&1 || true
if docker ps --format '{{.Names}}' | grep -q '^GITEA-ACTIONS-TASK-'; then
if docker ps --format '{{.Names}}' | grep -Eq '^(GITEA-ACTIONS-|[A-Za-z0-9][A-Za-z0-9_.-]*-(cd|code-review)-[0-9]+-)'; then
log "⚠️ Gitea Actions task container still running; draining docker-wrapped gitea-runner"
docker kill --signal=SIGINT gitea-runner >/dev/null 2>&1 || true
else