fix(reboot): repair 110 ssh control path on startup
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Successful in 48s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled

This commit is contained in:
Your Name
2026-07-02 10:53:16 +08:00
parent 37027382b9
commit d84be37e26
3 changed files with 57 additions and 0 deletions

View File

@@ -1,3 +1,22 @@
## 2026-07-02 — 10:52 110 startup SSH control path metadata 自動修復
**照主線修正的問題**
- 全主機 cold-start readback 顯示 110 唯一 BLOCKED 是 read-only SSH command path timeout`SSH_110_BLOCKER remote_control_channel_unavailable`node-exporter / runner pressure evidence 顯示 110 runner job count 為 0、load classifier 不是高負載,因此主線問題收斂為重啟後 control path metadata / session path 需能自動修復。
- `scripts/reboot-recovery/awoooi-startup-110.sh` 新增 `repair_ssh_control_path_metadata`,在 Docker / Harbor / runner 啟動前先呼叫 `/usr/local/bin/repair-110-ssh-publickey-auth-local.sh --apply`,預設 target user `wooo``RELOAD_SSH=0`,只修復 `.ssh` / `authorized_keys` ownership / mode 與 `sshd -t` syntax guard。
- 這條路徑不讀、不列印、不建立 key materialhelper 缺失或修復失敗時記錄 warning 並繼續既有 startup recovery避免 SSH guard 反過來阻斷 Docker / Harbor recovery。
- `scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py` 新增 contract鎖定 SSH metadata repair 必須在 Docker 前執行、必須使用 local-only repair helper、預設不 reload SSH、不得包含 `cat authorized_keys` / `ssh-keygen` / `ssh-copy-id`
**驗證**
- `bash -n scripts/reboot-recovery/awoooi-startup-110.sh scripts/reboot-recovery/repair-110-ssh-publickey-auth-local.sh scripts/reboot-recovery/recover-110-control-path-and-harbor-local.sh`:通過。
- `python3.11 -m pytest scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py scripts/reboot-recovery/tests/test_recover_110_control_path_and_harbor_local.py -q``20 passed`
- `python3.11 -m pytest ops/runner/test_cd_controlled_runtime_profile.py -q``43 passed`
- `python3.11 ops/runner/guard-gitea-runner-pressure.py --root .``GITEA_RUNNER_PRESSURE_GUARD_OK workflow_files=12 scheduled_workflows=4 auto_branch_events_on_110=0 generic_runner_labels=0`
- `git diff --check`:通過。
**邊界**:未使用 GitHub / `gh` / GitHub API未讀 secret / token / `.env` / raw sessions / SQLite / auth未重啟主機未 restart Docker / Nginx / K3s / DB / firewall未 workflow_dispatch未恢復 generic runner。當前 110 live SSH command path 仍需等此版部署到 startup 後由本機 startup / console path 套用,不能把 source 修正宣稱成 runtime 已解。
**下一步**commit / push Gitea `main`,讀回 Gitea CD controlled-runtimeCD 成功後再重跑 cold-start scorecard確認 110 command path 是否已恢復,若仍 blocked 則走 local console recovery package 驗證,不展開支線。
## 2026-07-01 — 23:52 Gitea CD #4315 B5 socket / queue historical blocker 修正
**照主線修正的問題**

View File

@@ -99,7 +99,26 @@ quarantine_corrupt_docker_container_metadata() {
log "Docker corrupt metadata quarantine dir=$qdir count=$count"
}
repair_ssh_control_path_metadata() {
local script="${AWOOOI_110_SSH_REPAIR_SCRIPT:-/usr/local/bin/repair-110-ssh-publickey-auth-local.sh}"
local target_user="${AWOOOI_110_SSH_REPAIR_USER:-wooo}"
local reload_ssh="${AWOOOI_110_SSH_RELOAD_AFTER_REPAIR:-0}"
log "[0/6] 修復 110 SSH control path metadata..."
if [ ! -x "$script" ]; then
log "⚠️ SSH control path metadata repair helper missing or not executable: $script"
return 0
fi
if TARGET_USER="$target_user" RELOAD_SSH="$reload_ssh" "$script" --apply; then
log "✅ 110 SSH control path metadata guard completed target_user=$target_user reload_ssh=$reload_ssh"
else
log "⚠️ 110 SSH control path metadata guard failed; continuing startup"
fi
}
log "=== 192.168.0.110 啟動序列開始 ==="
repair_ssh_control_path_metadata
# ──────────────────────────────────────────────
# STEP 1: Docker 修復(若 BoltDB 損壞)

View File

@@ -129,6 +129,25 @@ def test_startup_110_quarantines_corrupt_docker_container_metadata() -> None:
assert 'run_bounded "$DOCKER_START_TIMEOUT_SECONDS" systemctl start docker.socket docker.service' in text
def test_startup_110_repairs_ssh_control_path_metadata_before_docker() -> None:
text = STARTUP_110.read_text(encoding="utf-8")
repair_fn = text.index("repair_ssh_control_path_metadata()")
repair_call = text.index("repair_ssh_control_path_metadata", repair_fn + 1)
docker_step = text.index('log "[1/5] 檢查 Docker..."')
assert repair_fn < repair_call < docker_step
assert 'AWOOOI_110_SSH_REPAIR_SCRIPT:-/usr/local/bin/repair-110-ssh-publickey-auth-local.sh' in text
assert 'AWOOOI_110_SSH_REPAIR_USER:-wooo' in text
assert 'AWOOOI_110_SSH_RELOAD_AFTER_REPAIR:-0' in text
assert 'TARGET_USER="$target_user" RELOAD_SSH="$reload_ssh" "$script" --apply' in text
assert "SSH control path metadata guard completed" in text
assert "cat \"$home_dir/.ssh/authorized_keys\"" not in text
assert "cat ~/.ssh/authorized_keys" not in text
assert "ssh-keygen" not in text
assert "ssh-copy-id" not in text
def test_startup_110_opens_only_controlled_cd_lane_after_guardrails() -> None:
text = STARTUP_110.read_text(encoding="utf-8")