diff --git a/.gitea/workflows/ansible-lint.yml b/.gitea/workflows/ansible-lint.yml index 78ee0792..8644e8ca 100644 --- a/.gitea/workflows/ansible-lint.yml +++ b/.gitea/workflows/ansible-lint.yml @@ -26,7 +26,7 @@ on: jobs: validate: - runs-on: self-hosted + runs-on: awoooi-ubuntu timeout-minutes: 15 steps: - uses: actions/checkout@v4 diff --git a/.gitea/workflows/cd.yaml b/.gitea/workflows/cd.yaml index ec1403aa..011e388e 100644 --- a/.gitea/workflows/cd.yaml +++ b/.gitea/workflows/cd.yaml @@ -1245,6 +1245,12 @@ jobs: - uses: actions/checkout@v4 + - name: Wait for Host Web Build Pressure + # 2026-06-27 Codex: post-deploy Playwright smoke is browser-heavy too. + # Refuse to add another smoke run while 110 already has CI/build/smoke + # pressure; this gate is read-only and never kills other repo work. + run: bash scripts/ci/wait-host-web-build-pressure.sh + - name: Get Commit Info id: commit run: | diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 24238127..dd1188f0 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,32 @@ +## 2026-06-27|110 Gitea runner 降壓防回彈與 workflow label 收斂 + +**背景**:110 CPU 事故已確認主因是 Gitea runner 反覆拉起 StockPlatform headless Chrome smoke;前一輪已停止 `gitea-act-runner-host.service`、清掉 Actions / smoke,並把 live runner labels 收斂為 `awoooi-ubuntu` / `awoooi-host`。本輪目標是防止 cold-start / startup 流程把 runner 又自動拉起,並補齊 AWOOI workflow label 與 post-deploy pressure gate。 + +**完成內容**: +- `.gitea/workflows/cd.yaml` 的 `post-deploy-checks` 在 checkout 後新增 `Wait for Host Web Build Pressure`,避免 Alert Chain / Source Link / Monitoring / Playwright smoke 疊到 110 既有 build / smoke / load 壓力。 +- `.gitea/workflows/ansible-lint.yml` 從 `self-hosted` 收斂為 `awoooi-ubuntu`;AWOOI workflows 目前只剩 `awoooi-ubuntu` / `awoooi-host` 兩類 label。 +- `scripts/reboot-recovery/awoooi-startup-110.sh` 改成預設不自動啟動 Gitea host runner;只有明確設定 `AWOOOI_START_GITEA_RUNNER_ON_BOOT=1` 才允許 startup 拉起 runner。 +- live `/usr/local/bin/awoooi-startup-110.sh` 已安裝新版,舊檔備份為 `/usr/local/bin/awoooi-startup-110.sh.bak-20260627-runner-inactive`;本輪沒有執行 startup script,也沒有重啟 runner。 +- `ops/runner/audit-workflow-labels.py` 修正 local fallback,沒有 Gitea auth 但指定 `--local-repo` 時不再輸出假空白。 +- `ops/runner/check-runner-isolation-readiness.sh` 認得 `awoooi-ubuntu`,避免把新 label 誤判成 unknown / mixed owner。 +- `ops/runner/README.md` 更新 2026-06-27 runner 降壓狀態、hard-fail pressure gate、startup 開關與 workflow label 邊界。 + +**驗證結果**: +- `bash -n scripts/reboot-recovery/awoooi-startup-110.sh scripts/ci/wait-host-web-build-pressure.sh ops/runner/check-runner-isolation-readiness.sh ops/runner/audit-runner-pool.sh`:通過。 +- `python3 -m py_compile ops/runner/audit-workflow-labels.py scripts/ops/host-runaway-process-exporter.py`:通過。 +- Gitea workflow YAML parse:10 個 workflow 全部通過。 +- `rg "runs-on: (ubuntu-latest|self-hosted|ubuntu-22.04|ubuntu-24.04)" .gitea/workflows`:無命中。 +- `ops/runner/audit-workflow-labels.py --repo wooo/awoooi --local-repo wooo/awoooi=/Users/ogt/awoooi`:labels 只剩 `awoooi-host` / `awoooi-ubuntu`。 +- 110 readback:`gitea-act-runner-host.service=inactive`、Actions containers `0`、active CI groups `0`、StockPlatform orphan groups `0`。 +- 110 readiness:primary labels `awoooi-ubuntu` / `awoooi-host` 均為 `awoooi_dedicated`,`mixed_owner_classes=0`,active action containers `none`。 +- 110 pressure gate 目前 `GATE_RC=1`,原因是 `load5/core 0.886667 > 0.85`;top process 顯示主要是 `restic` 6h backup,不是 Gitea Actions / Chrome smoke 事故復燃。 +- 110 local Gitea / Sentry / Alertmanager / Grafana health readback:`200 / 302 / 200 / 200`。 + +**邊界與下一步**: +- runner inactive 是刻意降壓;未完成限流 / 搬遷前不可直接重開。 +- 本輪未重啟 Docker / Nginx / firewall / K3s,未 kill process,未讀 raw sessions / SQLite / auth / secret。 +- 下一個 P0:把 StockPlatform smoke 改成排程限流或搬到非 110 runner;再做全主機 cold-start scorecard 與資料 freshness readback。 + ## 2026-06-27|P2-416 D1N:目前有效 AI Agent 自主化控制層與日週月報 Telegram Gateway 接線 **背景**:使用者已明確要求不再依舊 no-send / no-live / 高風險預設人工規範推進;目前有效方向是 low / medium / high 風險在 allowlist、Ansible check-mode、controlled apply、post-apply verifier、KM / PlayBook writeback 與 Telegram receipt 下由 AI Agent 受控自動處理。critical / secret / destructive / reboot / node drain / provider switch / force push 等仍維持 hard blocker。 diff --git a/ops/runner/README.md b/ops/runner/README.md index 7aa2d855..bd1d015b 100644 --- a/ops/runner/README.md +++ b/ops/runner/README.md @@ -132,9 +132,9 @@ runner: | Job | runner label | 用途 | |-----|--------------|------| -| `tests` | `ubuntu-latest` | API unit + B5 integration tests,仍跑在 ci-runner container | +| `tests` | `awoooi-host` | API unit + B5 integration tests,直接跑在 110 host runner | | `build-and-deploy` | `awoooi-host` | Harbor login、API/Web image build/push、GitOps deploy,直接跑在 110 host | -| `post-deploy-checks` | `ubuntu-latest` | Alert chain、monitoring coverage、Playwright smoke | +| `post-deploy-checks` | `awoooi-host` | Alert chain、monitoring coverage、Playwright smoke | 110 只保留 host-level `act_runner` daemon,並在同一份 config 宣告兩類 label: @@ -143,9 +143,7 @@ runner: capacity: 1 shutdown_timeout: 1h labels: - - "ubuntu-latest:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04" - - "ubuntu-22.04:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04" - - "ubuntu-24.04:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04" + - "awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04" - "awoooi-host:host" ``` @@ -208,15 +206,27 @@ AWOOI 的 Docker lock,會和 AWOOI Web image 內的 Next production build 疊 - 只讀取 `ps`,不 kill / renice / reset 任何外部 process。 - 排除 AWOOI 自身 checkout、local worktree 與 Web Docker build 內的 `/app/apps/web` process,避免誤判自己的部署。 -- 預設最多等待 60 次、每次 10 秒;若仍有外部 build,先以 warning 放行, - 避免 CD 永久卡住。 -- 可用 `HOST_WEB_BUILD_PRESSURE_WARN_ONLY=0` 改成 hard fail,但必須先確認 - runner 隔離與其他 repo build 排程已收斂,避免把 shared runner 壓力轉成 - 部署中斷。 +- 預設最多等待 60 次、每次 10 秒;若仍有外部 build / smoke / CI 壓力, + hard fail,避免繼續把新的 browser smoke 疊到 production host。 +- 只有明確設定 `HOST_WEB_BUILD_PRESSURE_WARN_ONLY=1` 才 warning 放行;這只能 + 用在已確認壓力來源可接受的受控補跑。 長期方向仍是 runner 隔離或 build offload;此 gate 是在 shared runner 尚未 拆分前,降低重型前端 build 互相踩踏的保守保護層。 +### 第四層補充: startup 不自動重開 Gitea runner + +2026-06-27 110 CPU 事故止血後,`gitea-act-runner-host.service` 維持 inactive 是 +刻意降壓狀態。`scripts/reboot-recovery/awoooi-startup-110.sh` 仍可修正 runner +`shutdown_timeout` 與 labels,也會停用 legacy Docker runner,但預設不會啟動 +host runner。只有明確設定下列開關時才允許 startup 拉起 runner: + +```bash +AWOOOI_START_GITEA_RUNNER_ON_BOOT=1 /usr/local/bin/awoooi-startup-110.sh +``` + +未完成 runner 限流 / 搬遷前,不要把這個開關加入 systemd environment。 + ### 第五層修復: legacy Docker runner drain 2026-05-21 再次確認 110 同時存在兩個 runner: @@ -370,6 +380,12 @@ runner registration / service: 三個 split runner smoke 都通過後,才 drain primary runner 並移除混合 labels。 +2026-06-27 live update:110 的 `gitea-act-runner-host.service` 已刻意停在 +`inactive`;`/home/wooo/act-runner/config.yaml` labels 已收斂為 +`awoooi-ubuntu` 與 `awoooi-host`,capacity 仍為 `1`。這是降壓與 label isolation +狀態;AWOOI workflows 也應只使用 `awoooi-ubuntu` 或 `awoooi-host`,不可再使用 +`ubuntu-latest` / `self-hosted` 這類泛用 label。這不代表 runner 搬遷完成,也不代表可以直接重開 runner。 + --- 版本: v2.0 | 更新: 2026-03-29 | 作者: Claude Code 變更: v1.0→v2.0 序列建構取代 Job Concurrency Groups diff --git a/ops/runner/audit-workflow-labels.py b/ops/runner/audit-workflow-labels.py index 37a3db34..e1b8c201 100755 --- a/ops/runner/audit-workflow-labels.py +++ b/ops/runner/audit-workflow-labels.py @@ -179,7 +179,7 @@ def fetch_local_labels(repo: str, branch: str, repo_path: Path) -> tuple[list[Wo def label_owner(label: str) -> str: value = label.strip().strip("'\"") - if value == "awoooi-host": + if value in {"awoooi-host", "awoooi-ubuntu"}: return "awoooi_dedicated" if value == "ewoooc-host": return "foreign_dedicated" @@ -234,7 +234,13 @@ def main() -> int: error: str | None = None if auth is not None: repo_labels, error = fetch_gitea_labels(repo, args.branch, auth) - elif repo not in local_paths: + elif repo in local_paths: + repo_labels, local_error = fetch_local_labels(repo, args.branch, local_paths[repo]) + if local_error: + errors.append(f"{repo}: {local_error}") + labels.extend(repo_labels) + continue + else: error = "gitea_auth_unavailable" if error and repo in local_paths: diff --git a/ops/runner/check-runner-isolation-readiness.sh b/ops/runner/check-runner-isolation-readiness.sh index 7e24ae7b..d68d21a6 100755 --- a/ops/runner/check-runner-isolation-readiness.sh +++ b/ops/runner/check-runner-isolation-readiness.sh @@ -70,7 +70,7 @@ label_owner() { local label="$1" local label_name="${label%%:*}" case "$label_name" in - awoooi-host) + awoooi-host|awoooi-ubuntu|awoooi-*) printf 'awoooi_dedicated' ;; ewoooc-host) diff --git a/scripts/reboot-recovery/awoooi-startup-110.sh b/scripts/reboot-recovery/awoooi-startup-110.sh index ec566614..607e7464 100644 --- a/scripts/reboot-recovery/awoooi-startup-110.sh +++ b/scripts/reboot-recovery/awoooi-startup-110.sh @@ -184,15 +184,18 @@ fi # ────────────────────────────────────────────── # STEP 6: Gitea Act Runner(CI/CD 核心) # 2026-04-05 Claude Code: 加入 — 解決重開機後 Gitea runner 離線、CD 失效 -# 重要:必須在 Gitea server 啟動後才能啟動 runner +# 2026-06-27 Codex: 110 是 production / registry / observability 主機; +# runner 預設維持停用降壓,未完成限流 / 搬遷前不可在 startup 自動拉起。 # ────────────────────────────────────────────── -log "[6/6] 啟動 Gitea Act Runner..." +log "[6/6] 檢查 Gitea Act Runner(預設不自動啟動)..." RUNNER_DIR="/home/wooo/act-runner" RUNNER_SERVICE="gitea-act-runner-host.service" +START_GITEA_RUNNER_ON_BOOT="${AWOOOI_START_GITEA_RUNNER_ON_BOOT:-0}" if [ -x "$RUNNER_DIR/act_runner" ] && [ -f "$RUNNER_DIR/config.yaml" ]; then - # 若舊的 .runner 配置指向過期 hostname,先清除讓 runner 重新註冊 + # 若舊的 .runner 配置指向過期 hostname,只有在明確允許啟動 runner + # 時才清除重新註冊;預設降壓模式不得碰 registration 狀態。 RUNNER_FILE="$RUNNER_DIR/data/.runner" - if [ -f "$RUNNER_FILE" ]; then + if [ "$START_GITEA_RUNNER_ON_BOOT" = "1" ] && [ -f "$RUNNER_FILE" ]; then OLD_URL=$(python3 -c "import json; d=json.load(open('$RUNNER_FILE')); print(d.get('address',''))" 2>/dev/null || echo "") if [ "$OLD_URL" != "http://192.168.0.110:3001" ]; then log "⚠️ runner 配置過期 ($OLD_URL),清除重新註冊..." @@ -248,10 +251,14 @@ while idx < len(lines): path.write_text("\n".join(output) + "\n") PY - if systemctl list-unit-files "$RUNNER_SERVICE" >/dev/null 2>&1; then - systemctl enable --now "$RUNNER_SERVICE" >/dev/null 2>&1 || true - elif ! pgrep -f "$RUNNER_DIR/act_runner daemon" >/dev/null; then - nohup "$RUNNER_DIR/run-host-runner.sh" >> "$RUNNER_DIR/host-runner.log" 2>&1 & + if [ "$START_GITEA_RUNNER_ON_BOOT" = "1" ]; then + if systemctl list-unit-files "$RUNNER_SERVICE" >/dev/null 2>&1; then + systemctl enable --now "$RUNNER_SERVICE" >/dev/null 2>&1 || true + elif ! pgrep -f "$RUNNER_DIR/act_runner daemon" >/dev/null; then + nohup "$RUNNER_DIR/run-host-runner.sh" >> "$RUNNER_DIR/host-runner.log" 2>&1 & + fi + else + log "⏸️ Gitea host runner 維持停用;設定 AWOOOI_START_GITEA_RUNNER_ON_BOOT=1 才允許 startup 啟動" fi # 已停用 Docker-wrapped runner;避免它搶走 host label job。 @@ -269,9 +276,11 @@ PY # 驗證 runner 已連線 Gitea if pgrep -f "$RUNNER_DIR/act_runner daemon" >/dev/null; then - log "✅ Gitea host act_runner 已啟動" - else + log "⚠️ Gitea host act_runner 目前正在執行;請確認是否為受控限流 / 搬遷後狀態" + elif [ "$START_GITEA_RUNNER_ON_BOOT" = "1" ]; then log "⚠️ Gitea host act_runner 可能尚未啟動,查看: $RUNNER_DIR/host-runner.log" + else + log "✅ Gitea host act_runner 維持 inactive 降壓狀態" fi else log "⚠️ 找不到 act-runner binary/config: $RUNNER_DIR"