diff --git a/.gitea/workflows/cd.yaml b/.gitea/workflows/cd.yaml index ceda902b..48d0bf72 100644 --- a/.gitea/workflows/cd.yaml +++ b/.gitea/workflows/cd.yaml @@ -254,6 +254,8 @@ jobs: ;; docs/LOGBOOK.md) ;; + docs/runbooks/REBOOT-RECOVERY-SOP.md) + ;; docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md) ;; docs/runbooks/FULL-STACK-COLD-START-SOP.md) @@ -580,6 +582,8 @@ jobs: ;; scripts/reboot-recovery/deploy-to-188.sh) ;; + scripts/reboot-recovery/enforce-110-runner-failclosed.sh) + ;; scripts/reboot-recovery/recover-110-control-path-and-harbor-local.sh) ;; scripts/reboot-recovery/apply-credential-escrow-closeout-receipt-to-110.sh) @@ -821,6 +825,7 @@ jobs: ../../ops/runner/install-awoooi-non110-runner-user-service.sh \ ../../scripts/reboot-recovery/deploy-to-110.sh \ ../../scripts/reboot-recovery/deploy-to-188.sh \ + ../../scripts/reboot-recovery/enforce-110-runner-failclosed.sh \ ../../scripts/reboot-recovery/recover-110-control-path-and-harbor-local.sh \ ../../scripts/reboot-recovery/awoooi-startup.sh \ ../../scripts/reboot-recovery/install-reboot-auto-recovery-slo-110.sh \ diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 3e664e84..c9b7ef46 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -52405,6 +52405,30 @@ production browser smoke: **下一步**: - commit / push 到 Gitea main 後讀回 CD;再把新版 enforcer 受控同步到 110,重跑非 secret guardrail apply 與 `check-awoooi-110-controlled-cd-lane-readiness.sh`。目標是 active blockers 收斂到 registration / service inactive,不再出現 config / binary / unit 被 enforcer 回封的假 blocker。 +## 2026-07-02 — P0 110 controlled drain live staging 與 CD #4341 B5 誤跑修正 + +**完成內容**: +- source commit `c1823b5f6 fix(ops): preserve controlled drain lane staging` 已推到 Gitea main;live 110 已受控同步新版 enforcer、readiness verifier、`awoooi-cd-lane-drain.service`、窄 label `config.yaml`,並從既有 `gitea/act_runner:latest` 抽出 ELF `awoooi_cd_lane_controlled`。 +- live apply 明確回報 `SERVICE_STARTED=0`、`REGISTRATION_TOUCHED=0`、`operation_boundary_runner_token_read=false`、`operation_boundary_raw_runner_registration_read=false`,沒有註冊 runner、沒有啟動 service、沒有讀 `.runner` 內容。 +- 110 enforcer readback:`CONTROLLED_DRAIN_STAGING_ALLOWED=1`、`RUNNER_UNITS_BAD_COUNT=0`、`awoooi-cd-lane-drain.service load=loaded active=inactive unitfile=disabled`;legacy / generic runner 仍全 masked / inactive。 +- 110 readiness verifier readback:`CONFIG_READY=1`、`BINARY_READY=1`、`REGISTRATION_READY=0`、`SERVICE_READY=0`、`LEGACY_FAILCLOSED=1`、`PRIMARY_LANE_FAILCLOSED=1`、`BLOCKER_COUNT=2`,剩餘 blocker 只剩 `controlled_cd_lane_registration_missing` 與 `controlled_cd_lane_service_not_active`。 +- Gitea CD `#4341` 失敗原因已定位為 tests job 誤跑 full profile B5:`BLOCKER b5_docker_socket_unavailable`。根因是本輪變更包含 `docs/runbooks/REBOOT-RECOVERY-SOP.md` 與 `scripts/reboot-recovery/enforce-110-runner-failclosed.sh`,但這兩個 path 未列入 controlled-runtime allowlist。 +- `.gitea/workflows/cd.yaml` 已將上述兩個 path 加入 controlled-runtime profile,並把 enforcer 加入 controlled-runtime `bash -n` syntax check;`ops/runner/test_cd_controlled_runtime_profile.py` 新增 regression,防止同類 recovery/enforcer patch 再落入 B5 Docker socket。 + +**本地驗證結果**: +- `python3.11 -m pytest ops/runner/test_cd_controlled_runtime_profile.py -q`:`43 passed`。 +- `python3 ops/runner/guard-gitea-runner-pressure.py --root .`:通過,`auto_branch_events_on_110=0`、`generic_runner_labels=0`。 +- `node scripts/ci/check-gitea-step-env-secrets.js`:通過,`no Gitea run/with secrets or legacy Telegram routes`。 +- `git diff --check`:通過。 + +**仍維持**: +- 沒有讀 secret / token / `.env` / raw sessions / SQLite / auth;沒有讀 `.runner` 內容。 +- 沒有使用 GitHub / gh / GitHub API / GitHub Actions。 +- 沒有重啟主機,沒有 Docker / Nginx / K3s / DB / firewall restart,沒有 workflow_dispatch,沒有 DROP / TRUNCATE / restore / prune。 + +**下一步**: +- commit / push workflow classifier 修法,讀回新的 Gitea CD,確認 tests 走 controlled-runtime 並跳過 B5;runner registration 仍需 token-safe path 補齊後才可啟動 `awoooi-cd-lane-drain.service`。 + ## 2026-07-01 — 08:50 P0 188 DB circuit breaker post-push readback **完成內容**: diff --git a/ops/runner/test_cd_controlled_runtime_profile.py b/ops/runner/test_cd_controlled_runtime_profile.py index 5c67c320..136be8c3 100644 --- a/ops/runner/test_cd_controlled_runtime_profile.py +++ b/ops/runner/test_cd_controlled_runtime_profile.py @@ -743,6 +743,7 @@ def test_post_start_recovery_verifiers_stay_on_controlled_runtime_profile() -> N text = _workflow_text() expected_sources = [ "docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md)", + "docs/runbooks/REBOOT-RECOVERY-SOP.md)", "docs/runbooks/FULL-STACK-COLD-START-SOP.md)", "docs/operations/host-cpu-pressure-drain-readback-2026-07-01.snapshot.json)", "docs/operations/post-reboot-runtime-recovery-readback-2026-07-01.snapshot.json)", @@ -759,6 +760,7 @@ def test_post_start_recovery_verifiers_stay_on_controlled_runtime_profile() -> N "scripts/ops/host-runaway-process-exporter.py)", "scripts/ops/host-sustained-load-evidence.py)", "scripts/reboot-recovery/deploy-to-110.sh)", + "scripts/reboot-recovery/enforce-110-runner-failclosed.sh)", "scripts/reboot-recovery/recover-110-control-path-and-harbor-local.sh)", "scripts/reboot-recovery/apply-credential-escrow-closeout-receipt-to-110.sh)", "scripts/reboot-recovery/post-start-quick-check.sh)", @@ -791,6 +793,7 @@ def test_post_start_recovery_verifiers_stay_on_controlled_runtime_profile() -> N "../../ops/reboot-recovery/full-stack-cold-start-baseline.yml", "../../ops/runner/check-awoooi-110-controlled-cd-lane-readiness.sh", "../../scripts/reboot-recovery/deploy-to-110.sh", + "../../scripts/reboot-recovery/enforce-110-runner-failclosed.sh", "../../scripts/reboot-recovery/recover-110-control-path-and-harbor-local.sh", "../../scripts/reboot-recovery/apply-credential-escrow-closeout-receipt-to-110.sh", "../../scripts/reboot-recovery/post-start-quick-check.sh",