fix(runner): lower harbor repair schedule pressure [skip ci]
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
# AWOOOI Harbor 110 Local Repair
|
||||
#
|
||||
# Controlled runtime:
|
||||
# - workflow_dispatch + low-frequency schedule only
|
||||
# - workflow_dispatch + hourly low-frequency schedule only
|
||||
# - no push / pull_request / pull_request_target trigger
|
||||
# - runs on the non-110 controlled host lane, then reaches 110 only through a
|
||||
# bounded SSH control channel
|
||||
@@ -13,12 +13,14 @@ name: AWOOOI Harbor 110 Local Repair
|
||||
on:
|
||||
workflow_dispatch:
|
||||
schedule:
|
||||
- cron: "*/10 * * * *"
|
||||
- cron: "17 * * * *"
|
||||
|
||||
env:
|
||||
AWOOOI_HARBOR_110_LOCAL_REPAIR_ENABLED: "1"
|
||||
AWOOOI_110_EXPECTED_HOST_IP: 192.168.0.110
|
||||
AWOOOI_110_SSH_TARGET: wooo@192.168.0.110
|
||||
AWOOOI_110_SSH_CONNECT_TIMEOUT_SECONDS: "3"
|
||||
AWOOOI_110_SSH_COMMAND_TIMEOUT_SECONDS: "12"
|
||||
AWOOOI_HARBOR_110_LOCAL_REPAIR_TRIGGER: ${{ github.event_name }}
|
||||
|
||||
jobs:
|
||||
@@ -60,9 +62,9 @@ jobs:
|
||||
ssh_base=(
|
||||
ssh
|
||||
-o BatchMode=yes
|
||||
-o ConnectTimeout=8
|
||||
-o ServerAliveInterval=5
|
||||
-o ServerAliveCountMax=2
|
||||
-o ConnectTimeout="${AWOOOI_110_SSH_CONNECT_TIMEOUT_SECONDS}"
|
||||
-o ServerAliveInterval=3
|
||||
-o ServerAliveCountMax=1
|
||||
"${AWOOOI_110_SSH_TARGET}"
|
||||
)
|
||||
SSH_PROBE_ATTEMPTS="${AWOOOI_110_SSH_PROBE_ATTEMPTS:-6}"
|
||||
@@ -73,7 +75,7 @@ jobs:
|
||||
attempt=1
|
||||
rc=1
|
||||
while [ "${attempt}" -le "${SSH_PROBE_ATTEMPTS}" ]; do
|
||||
if timeout 30 "${ssh_base[@]}" "$@"; then
|
||||
if timeout "${AWOOOI_110_SSH_COMMAND_TIMEOUT_SECONDS}" "${ssh_base[@]}" "$@"; then
|
||||
echo "harbor_110_remote_ssh_probe_attempt=${attempt} result=success"
|
||||
return 0
|
||||
else
|
||||
@@ -98,7 +100,7 @@ jobs:
|
||||
-o KbdInteractiveAuthentication=no \
|
||||
-o GSSAPIAuthentication=no \
|
||||
-o NumberOfPasswordPrompts=0 \
|
||||
-o ConnectTimeout=8 \
|
||||
-o ConnectTimeout="${AWOOOI_110_SSH_CONNECT_TIMEOUT_SECONDS}" \
|
||||
-o ConnectionAttempts=1 \
|
||||
-o ServerAliveInterval=3 \
|
||||
-o ServerAliveCountMax=1 \
|
||||
|
||||
@@ -1,9 +1,25 @@
|
||||
## 2026-07-01 — 09:47 Harbor remote repair schedule / SSH timeout pressure guard
|
||||
|
||||
**照主線修正的問題**:
|
||||
- 最新 readback:Gitea public health `200`,但 public/internal Harbor registry `/v2` 仍 `502`;110 TCP `22/2222/5000/3000` open,但 `ssh wooo@192.168.0.110` 仍 timeout。Gitea queue 主狀態仍為 `blocked_harbor_110_remote_control_channel_unavailable`。
|
||||
- `.gitea/workflows/harbor-110-local-repair.yaml` 原本每 10 分鐘 schedule 一次,即使 110 SSH control channel unavailable 也會反覆建立 repair run;這不能修復 110,只會增加 Gitea / runner 噪音。
|
||||
- 將 Harbor 110 repair schedule 從 `*/10 * * * *` 收斂為 `17 * * * *`,並把 bounded SSH control check 改為 `ConnectTimeout=3`、command timeout `12s`;workflow_dispatch 仍保留給 110 control path 恢復後的受控修復。
|
||||
- profile test 固定上述低頻 / bounded timeout 規則,避免後續改回高頻 retry。
|
||||
|
||||
**驗證**:
|
||||
- `python3.11 -m pytest ops/runner/test_cd_controlled_runtime_profile.py -q` 通過。
|
||||
- `python3.11 - <<'PY' ... yaml.safe_load(.gitea/workflows/harbor-110-local-repair.yaml) ... PY` 通過。
|
||||
- `python3.11 ops/runner/guard-gitea-runner-pressure.py --root .` 通過。
|
||||
- `git diff --check` 通過。
|
||||
|
||||
**邊界**:只改 Gitea Harbor repair workflow schedule / SSH timeout / tests / LOGBOOK;未讀 secret / token / `.env` / raw sessions / SQLite / auth;未 workflow_dispatch;未重啟主機、未 restart Docker / Nginx / K3s / DB / firewall。
|
||||
|
||||
## 2026-07-01 — 09:44 Harbor repair SSH probe bounded retry
|
||||
|
||||
**照主線修正的問題**:
|
||||
- 最新 live truth:CD `#4215` 仍因 Harbor public `/v2/` = `502` 失敗;Harbor repair `#4212` 的具體 blocker 是 `harbor_110_remote_control_channel_unavailable`。
|
||||
- 188 non-110 runner lane 讀回 ready、host pressure 正常;但 188 → 110 bounded SSH probe 呈現間歇性,一次 `true` 可成功,下一次 `recover-110-control-path-and-harbor-local.sh --check` 又 timeout。
|
||||
- `.gitea/workflows/harbor-110-local-repair.yaml` 對非寫入的 SSH probe / verifier 加 bounded retry:預設 `6` 次、每次仍受 `ConnectTimeout=8`、`ServerAlive*` 與外層 `timeout 30` 限制,並輸出 `harbor_110_remote_ssh_probe_attempt=...` receipt。`run_recovery --apply-all` 不自動 retry,避免半套用被重跑。
|
||||
- `.gitea/workflows/harbor-110-local-repair.yaml` 對非寫入的 SSH probe / verifier 加 bounded retry:預設 `6` 次、每次仍受 `ConnectTimeout`、`ServerAlive*` 與 command timeout 限制,並輸出 `harbor_110_remote_ssh_probe_attempt=...` receipt。`run_recovery --apply-all` 不自動 retry,避免半套用被重跑。
|
||||
- follow-up 修正:retry failure branch 必須在 `else` 內保存原始 `rc`,避免 shell `if` compound status 把連續 timeout 誤記為 `rc=0` / success。
|
||||
|
||||
**驗證**:
|
||||
|
||||
@@ -122,7 +122,7 @@ def test_harbor_110_local_repair_workflow_is_dispatch_only_and_bounded() -> None
|
||||
|
||||
assert "workflow_dispatch:" in text
|
||||
assert "schedule:" in text
|
||||
assert 'cron: "*/10 * * * *"' in text
|
||||
assert 'cron: "17 * * * *"' in text
|
||||
assert "push:" not in text
|
||||
assert "pull_request:" not in text
|
||||
assert "pull_request_target:" not in text
|
||||
@@ -135,7 +135,6 @@ def test_harbor_110_local_repair_workflow_is_dispatch_only_and_bounded() -> None
|
||||
assert "sudo -n env" in text
|
||||
assert "AWOOOI_110_SSH_TARGET" in text
|
||||
assert "BatchMode=yes" in text
|
||||
assert "ConnectTimeout=8" in text
|
||||
assert 'SSH_PROBE_ATTEMPTS="${AWOOOI_110_SSH_PROBE_ATTEMPTS:-6}"' in text
|
||||
assert (
|
||||
'SSH_PROBE_SLEEP_SECONDS="${AWOOOI_110_SSH_PROBE_SLEEP_SECONDS:-10}"'
|
||||
@@ -143,6 +142,10 @@ def test_harbor_110_local_repair_workflow_is_dispatch_only_and_bounded() -> None
|
||||
)
|
||||
assert "else\n rc=$?" in text
|
||||
assert "harbor_110_remote_ssh_probe_attempt=" in text
|
||||
assert 'AWOOOI_110_SSH_CONNECT_TIMEOUT_SECONDS: "3"' in text
|
||||
assert 'AWOOOI_110_SSH_COMMAND_TIMEOUT_SECONDS: "12"' in text
|
||||
assert 'ConnectTimeout="${AWOOOI_110_SSH_CONNECT_TIMEOUT_SECONDS}"' in text
|
||||
assert 'timeout "${AWOOOI_110_SSH_COMMAND_TIMEOUT_SECONDS}"' in text
|
||||
assert "operation_boundary_remote_ssh_bounded=true" in text
|
||||
assert "harbor_110_remote_control_channel_unavailable" in text
|
||||
assert "harbor_110_remote_repair_check_start=1" in text
|
||||
|
||||
Reference in New Issue
Block a user