fix(ops): recognize repo-scoped CI containers in load guard [skip ci]
This commit is contained in:
@@ -54,6 +54,33 @@
|
||||
- 本段只修補前端 message catalog;不改 executor、Telegram、主機、K8s、runtime gate、secret 或 live apply 權限。
|
||||
- P2-415 仍維持 controlled executor dispatch / live apply / Telegram send / Bot API / host write / kubectl / destructive operation 全部 `0`。
|
||||
|
||||
## 2026-06-27|12:43 110 CPU 高負載實際防護:CI / smoke container 辨識修正並部署
|
||||
|
||||
**時間與來源**:
|
||||
- 2026-06-27 12:40-12:43 Asia/Taipei。
|
||||
- 來源:110 `ps` / Docker stats / host runaway exporter / stale Gitea action dry-run / Gitea log readback。
|
||||
|
||||
**實際判讀**:
|
||||
- 12:40 `load average` 已從前一輪高峰下降到 `4.87, 9.98, 40.74`,主要高峰來源已從 stockPlatform smoke 轉為 AWOOOI 正式 CD `awoooi-cd-5901-1-e2e-smoke`、Playwright / Chrome / ffmpeg 與 Gitea log/update;該 job 當時僅跑十幾秒,未判為 stale。
|
||||
- 12:43 `load average` 進一步下降到 `2.28, 6.29, 33.05`;active CI container 已歸零,`gitea` 約 `55%` container CPU,未見 30 分鐘以上 orphan browser group。
|
||||
- 缺口不是完全沒有監控,而是 `host-runaway-process-exporter.py` 舊版只計算 `GITEA-ACTIONS-*`,漏掉現行 `awoooi-cd-<run>-<job>` / `awoooi-code-review-<run>-<job>` 命名,導致 active action container 會誤報 `0`。
|
||||
|
||||
**修復內容**:
|
||||
- `scripts/ops/host-runaway-process-exporter.py` 已支援 `GITEA-ACTIONS-*`、`*-cd-<run>-*`、`*-code-review-<run>-*` container 命名。
|
||||
- `scripts/ops/stop-stale-gitea-actions-jobs.sh` 已支援相同命名,並給 deploy / tests / post-deploy / code-review / e2e-smoke / source-link-smoke 不同 stale threshold;預設仍 dry-run。
|
||||
- `scripts/reboot-recovery/awoooi-startup-110.sh` 的 runner drain 判斷也同步支援現行 container 命名,避免下次重啟時漏判正在跑的 CD task。
|
||||
- 已部署到 110:`/home/wooo/scripts/host-runaway-process-exporter.py`、`/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`、`/usr/local/bin/awoooi-startup-110.sh`。
|
||||
|
||||
**驗證**:
|
||||
- `pytest scripts/ops/tests/test_host_runaway_process_exporter.py -q`:`7 passed`。
|
||||
- `bash -n scripts/ops/stop-stale-gitea-actions-jobs.sh scripts/reboot-recovery/awoooi-startup-110.sh`:通過。
|
||||
- 110 exporter readback:`awoooi_host_gitea_actions_active_container_count{host="110"} 0`、`awoooi_host_runaway_browser_orphan_group_count ... 0`、`awoooi_host_load5_per_core{host="110"} 0.524167`。
|
||||
- 110 stale job dry-run:`No stale Gitea Actions containers older than policy threshold`。
|
||||
|
||||
**邊界**:
|
||||
- 本段沒有重啟 Docker / Gitea / Nginx / systemd service / DB / K3s,也沒有 firewall / secret / active response。
|
||||
- 這是實際主機防護工具更新與部署,不是放寬 runtime gate;自動修復仍需 owner / maintenance window / evidence gates。
|
||||
|
||||
## 2026-06-27|P2-415 AI Agent 受控 Executor 交接跑道:API / 前台 / 測試完成
|
||||
|
||||
**背景**:P2-409 已把 high 風險從「人工 owner review」調整為可走 controlled apply queue,critical 仍保留 break-glass 邊界;本段承接 P2-409 / P2-410 / P2-411,補上可被產品與正式 API 讀回的受控 executor handoff 跑道,避免只停留在 UI 文案或口頭批准。
|
||||
|
||||
@@ -52,7 +52,7 @@ Use these thresholds for alerting and AI triage:
|
||||
| Systemd runner restarts | > 2 in 15m | Critical; inspect watchdog/drop-ins and active CI jobs. |
|
||||
| Systemd runner WatchdogSec | > 0 for 10m | Warning; GitHub Actions runner should not be killed by systemd watchdog. |
|
||||
| Systemd runner quota | CPU or memory unlimited for 30m | Warning; apply CPUQuota/MemoryMax or move CI away from Sentry host. |
|
||||
| Gitea Actions job runtime | > 20m for 5m | Warning; inspect logs and run `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh` dry-run before stopping stale job containers. |
|
||||
| Gitea Actions job runtime | > 20m for 5m | Warning; inspect logs and run `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh` dry-run before stopping stale job containers. The stale-job detector must include both legacy `GITEA-ACTIONS-*` containers and repo-scoped names such as `awoooi-cd-<run>-<job>` / `awoooi-code-review-<run>-<job>`. |
|
||||
|
||||
## Rules
|
||||
|
||||
@@ -79,7 +79,7 @@ Use these thresholds for alerting and AI triage:
|
||||
5. Tune `momo-scheduler` crawler concurrency on 188; keep 2 CPU / 2 GiB until success rate and latency prove it is too low.
|
||||
6. Fix 188 Elephant Alpha/OpenClaw allowed-action drift before enabling resource auto-repair beyond diagnosis.
|
||||
7. Add modest caps to currently unlimited low-risk services in small batches. Do not alert every unlimited auxiliary container at once; promote candidates only after 24h usage data.
|
||||
8. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode.
|
||||
8. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode. As of 2026-06-27, it must recognize both legacy `GITEA-ACTIONS-*` and current repo-scoped `*-cd-<run>-*` / `*-code-review-<run>-*` container names.
|
||||
9. Fix 110 runner services with sudo-capable host maintenance:
|
||||
|
||||
```bash
|
||||
|
||||
@@ -24,6 +24,9 @@ TEXTFILE_DIR = Path(os.environ.get("NODE_EXPORTER_TEXTFILE_DIR", "/var/lib/node_
|
||||
OUTPUT_NAME = "host_runaway_process.prom"
|
||||
HOST_LABEL = os.environ.get("AIOPS_HOST_LABEL", os.uname().nodename)
|
||||
LABEL_RE = re.compile(r'["\\\n]')
|
||||
GITEA_ACTION_CONTAINER_RE = re.compile(
|
||||
r"^(?:GITEA-ACTIONS-|[A-Za-z0-9][A-Za-z0-9_.-]*-(?:cd|code-review)-[0-9]+-)"
|
||||
)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
@@ -214,7 +217,7 @@ def active_gitea_action_containers(docker_file: Path | None = None) -> int:
|
||||
names = run_text(["docker", "ps", "--format", "{{.Names}}"], timeout=10).splitlines()
|
||||
except Exception:
|
||||
return -1
|
||||
return sum(1 for name in names if "GITEA-ACTIONS-TASK-" in name)
|
||||
return sum(1 for name in names if GITEA_ACTION_CONTAINER_RE.search(name))
|
||||
|
||||
|
||||
def load5_per_core() -> float:
|
||||
|
||||
@@ -11,7 +11,8 @@ set -euo pipefail
|
||||
# bash scripts/ops/stop-stale-gitea-actions-jobs.sh --apply
|
||||
#
|
||||
# Safety rules:
|
||||
# - Only touches Docker containers named GITEA-ACTIONS-*.
|
||||
# - Only touches Docker containers named GITEA-ACTIONS-* or repo-scoped
|
||||
# Gitea job containers such as awoooi-cd-5901-1-e2e-smoke.
|
||||
# - Defaults to containers older than 20 minutes.
|
||||
# - Known long-running workflows get a higher stop threshold than the alert threshold.
|
||||
# - Skips containers with recent log output unless --force is provided.
|
||||
@@ -24,17 +25,20 @@ threshold_for_name() {
|
||||
local name="$1"
|
||||
|
||||
case "$name" in
|
||||
*WORKFLOW-CD-Pipeline_JOB-deploy*)
|
||||
*WORKFLOW-CD-Pipeline_JOB-deploy*|*-cd-*-deploy*)
|
||||
# .gitea/workflows/cd.yaml deploy job timeout is 60m. Give act/Gitea
|
||||
# cleanup a buffer before treating the container as abandoned.
|
||||
echo 4500
|
||||
;;
|
||||
*WORKFLOW-CD-Pipeline_JOB-tests*|*WORKFLOW-CD-Pipeline_JOB-post-deploy-checks*)
|
||||
*WORKFLOW-CD-Pipeline_JOB-tests*|*WORKFLOW-CD-Pipeline_JOB-post-deploy-checks*|*-cd-*-tests*|*-cd-*-post-deploy*)
|
||||
echo 2400
|
||||
;;
|
||||
*WORKFLOW-Code-Review_JOB-ai-code-review*)
|
||||
*WORKFLOW-Code-Review_JOB-ai-code-review*|*-code-review-*)
|
||||
echo 720
|
||||
;;
|
||||
*-cd-*-e2e-smoke*|*-cd-*-source-link-smoke*)
|
||||
echo 900
|
||||
;;
|
||||
*WORKFLOW-Deploy-Alert-Rules_JOB-deploy-alerts*)
|
||||
echo 900
|
||||
;;
|
||||
@@ -97,7 +101,7 @@ while read -r name; do
|
||||
fi
|
||||
docker stop "$name"
|
||||
fi
|
||||
done < <(docker ps --format '{{.Names}}' | grep '^GITEA-ACTIONS-' || true)
|
||||
done < <(docker ps --format '{{.Names}}' | grep -E '^(GITEA-ACTIONS-|[A-Za-z0-9][A-Za-z0-9_.-]*-(cd|code-review)-[0-9]+-)' || true)
|
||||
|
||||
if [[ "$found" == "0" ]]; then
|
||||
echo "No stale Gitea Actions containers older than policy threshold (minimum ${MIN_AGE_SECONDS}s)."
|
||||
|
||||
@@ -93,6 +93,26 @@ def test_renders_ci_load_and_swap_without_authorizing_repair(tmp_path: Path) ->
|
||||
assert 'rule="stockplatform_headless_smoke"' in metrics
|
||||
|
||||
|
||||
def test_counts_modern_gitea_action_container_names(tmp_path: Path) -> None:
|
||||
exporter = load_exporter()
|
||||
docker_file = tmp_path / "docker.txt"
|
||||
docker_file.write_text(
|
||||
"\n".join(
|
||||
[
|
||||
"GITEA-ACTIONS-TASK-123",
|
||||
"awoooi-cd-5901-1-e2e-smoke",
|
||||
"awoooi-cd-5873-1-source-link-smoke",
|
||||
"awoooi-code-review-3323-1-ai-code-review",
|
||||
"gitea",
|
||||
"stockplatform-v2-api-1",
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
assert exporter.active_gitea_action_containers(docker_file) == 4
|
||||
|
||||
|
||||
def test_remediation_defaults_to_dry_run(tmp_path: Path) -> None:
|
||||
ps_file = tmp_path / "ps.txt"
|
||||
ps_file.write_text(
|
||||
|
||||
@@ -231,7 +231,7 @@ PY
|
||||
# 若手動執行此 recovery script 時仍有 task container,送 SIGINT
|
||||
# 讓 act_runner drain,不再接新 job,並等手上的 job 收尾。
|
||||
docker update --restart=no gitea-runner >/dev/null 2>&1 || true
|
||||
if docker ps --format '{{.Names}}' | grep -q '^GITEA-ACTIONS-TASK-'; then
|
||||
if docker ps --format '{{.Names}}' | grep -Eq '^(GITEA-ACTIONS-|[A-Za-z0-9][A-Za-z0-9_.-]*-(cd|code-review)-[0-9]+-)'; then
|
||||
log "⚠️ Gitea Actions task container still running; draining docker-wrapped gitea-runner"
|
||||
docker kill --signal=SIGINT gitea-runner >/dev/null 2>&1 || true
|
||||
else
|
||||
|
||||
Reference in New Issue
Block a user