diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 8980f121..781aef77 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,30 @@ +## 2026-07-01 — 21:05 backup / DR freshness 收斂與高頻備份保護 + +**照主線修正的問題**: +- 188 `backup_from_110` stale:受控重跑 `/home/ollama/bin/backup-from-110.sh`,Harbor / Gitea / bitan backup 均成功;188 exporter 讀回 `backup_from_110 fresh=1`、Gitea private bundle `expected_repo_missing_count=0`、`failed_repo_count=0`。 +- 110 DR phase 誤判:offsite full marker fresh 但 partial marker stale 時,`backup-health-textfile-exporter.py` 會錯誤退回 `run_small_dry_run_then_partial_sync`;已修成 full fresh + escrow missing 時 next step 固定 `complete_credential_escrow_review`,並新增回歸測試。 +- 110 `awoooi_db` high-frequency backup stale:20:00 cron 因 `too many connections for role "awoooi"` 失敗。讀回 `awoooi` role 原本 `CONNECTION LIMIT 2`,當下 DB 舊 OR 查詢 `0`、Postgres CPU 已低,故走 controlled short window:臨時放寬到 `4`、解除 `statement_timeout` 跑一次備份,完成後恢復 role,再把正式保守值調成 `CONNECTION LIMIT 4`,保留 `statement_timeout=750ms`、`max_parallel_workers_per_gather=0`、`enable_seqscan=off`。 +- 高頻備份腳本 source-of-truth 修復:`scripts/backup/backup-awoooi-frequent.sh` 移除舊硬編 DB password,改為只在腳本內部從 K8s Secret key preference 解析 `AWOOOI_BACKUP_DATABASE_URL` / `BACKUP_DATABASE_URL` / `DATABASE_URL`,使用臨時 `.pgpass`,並以 `PGOPTIONS` 讓 backup session 覆寫 role 上的 `statement_timeout`。live 110 已部署備份腳本,hash `01c5de7d08bb21d88aa39fd4d195e70131fa4335a8fc0b6e3f101d2bc9747142`。 + +**驗證**: +- `python3.11 -m pytest scripts/ops/tests/test_backup_health_textfile_exporter.py -q`:`3 passed`。 +- `python3.11 -m py_compile scripts/ops/backup-health-textfile-exporter.py`:通過。 +- `bash -n scripts/backup/backup-awoooi-frequent.sh`:通過。 +- 110 high-frequency backup:`pg_dump 28G`,Restic snapshot `211a8948`,`2026-07-01T20:58:20+08:00`,tags `service:awoooi,freq:6h,timestamp:20260701_204344`;tmp dump 已清除。 +- 110 backup exporter:`awoooi_backup_job_fresh{job="awoooi_db"} 1`、`awoooi_backup_dr_phase{next_step="complete_credential_escrow_review"} 3`、`awoooi_backup_dr_credential_escrow_missing_count 5`。 +- `bash scripts/reboot-recovery/full-stack-recovery-scorecard.sh`:`CORE_COLD_START_WARN_GATES=3`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`NEXT_STEP=complete_credential_escrow_review`。 +- `bash scripts/reboot-recovery/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`:`PASS=93 WARN=3 BLOCKED=0`。 + +**仍維持 / 未完成**: +- 不可宣稱 full green 或 10 分鐘全自動恢復完成;剩餘 WARN 是 MOMO current-month/source freshness warning,以及 110 aggregate `backup-all` 上次仍有 failed components。 +- `ESCROW_MISSING_COUNT=5` 仍是真正 DR evidence gate;不得偽造 marker 或把一般批准當成 credential escrow evidence。 +- 高頻全量 `pg_dump` 這次耗時 `1152s` 並產生 `28G` dump,證明每 6 小時全量 DB dump 對 188 / 110 都太重。後續 P0 需建立專用 backup DB role / `AWOOOI_BACKUP_DATABASE_URL`,並把大型可重建 snapshot 表改為分層備份或 retention 後再納入 full backup;不能再只靠放寬 app role 當長期解法。 +- 110 SSH control path 仍有 intermittent timeout,且 post-backup 讀回 load5 約 `10.41`、top 包含 `systemd` / `gitea` / ClickHouse / Kafka;服務可用但不得宣稱 110 control path 永久穩定。 + +**邊界**:未重啟主機,未 restart Docker / Nginx / K3s / DB / firewall,未讀 secret value / token / `.env` / raw sessions / SQLite / auth,未寫 credential escrow marker,未使用 GitHub / `gh` / GitHub API,未恢復 generic runner。 + +**下一步**:照 P0 繼續收斂剩餘 WARN:先處理 MOMO source freshness / current-month confirmation 與 110 `backup-all` failed components;另開受控設計把高頻全量 pg_dump 改成專用 backup role + 分層備份,避免下一次重啟或 cron 再製造 DB/IO 壓力。 + ## 2026-07-01 — 20:34 P0-006 reboot SLO machine-readback source closure **照主線修正的問題**: diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index aebd30ec..47698c62 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.87 +> Version: v1.88 > Last updated: 2026-07-01 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -18,6 +18,8 @@ v1.79 active owner response template rule:同一輪 owner packet 產生後,p v1.80 / v1.81 credential escrow intake scorecard rule:同一輪 owner response preflight 後,必須用 `scripts/reboot-recovery/post-reboot-credential-escrow-intake-scorecard.py --summary-file "$ARTIFACT_DIR/summary.txt" --owner-packet-file --response-file --offsite-report-file --escrow-status-file ` 收斂 DR escrow gate。scorecard 只讀 sanitized artifacts;不得讀 secret value、不得寫 marker、不得送 owner request、不得開 runtime gate。placeholder readback 期望 `STATUS=blocked_waiting_non_secret_credential_escrow_evidence`、`EFFECTIVE_ESCROW_MISSING_COUNT=5`、`OWNER_RESPONSE_RECEIVED_COUNT=0`、`OWNER_RESPONSE_ACCEPTED_COUNT=0`、`RUNTIME_GATE_COUNT=0`、`CREDENTIAL_MARKER_WRITE_AUTHORIZED_COUNT=0`。若未來收到合格 redacted owner response 並由 preflight 回 `ready_for_independent_reviewer_acceptance`,scorecard 應轉為 `STATUS=ready_for_independent_reviewer_acceptance`;即使 marker 尚未寫入,也只能進 `independent_reviewer_acceptance_then_marker_dry_run`,不得直接寫 marker 或宣稱 `DR_COMPLETE`。 +2026-07-01 21:05 latest live summary:backup / DR freshness 已收斂,但仍不可宣稱 full green 或 10 分鐘全自動恢復完成。188 `backup_from_110` 已受控重跑成功,Harbor / Gitea / bitan backup 均 OK;188 exporter 讀回 `backup_from_110 fresh=1`、Gitea private bundle `expected_repo_missing_count=0`、`failed_repo_count=0`。110 `awoooi_db` high-frequency backup 20:00 因 `too many connections for role "awoooi"` 失敗;本輪在 DB 舊 OR 查詢為 `0`、Postgres CPU 已低的條件下走 controlled short window,產生 Restic snapshot `211a8948`,時間 `2026-07-01T20:58:20+08:00`,tags `service:awoooi,freq:6h,timestamp:20260701_204344`,tmp dump 已清除。正式保守調整為 `awoooi CONNECTION LIMIT 4`,保留 `statement_timeout=750ms`、`max_parallel_workers_per_gather=0`、`enable_seqscan=off`;rollback 是 `ALTER ROLE awoooi CONNECTION LIMIT 2` 並保留同樣 role config。110 backup exporter 讀回 `awoooi_backup_job_fresh{job="awoooi_db"} 1`、`awoooi_backup_dr_phase{next_step="complete_credential_escrow_review"} 3`、`awoooi_backup_dr_credential_escrow_missing_count 5`。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` 回 `PASS=93 WARN=3 BLOCKED=0`;`full-stack-recovery-scorecard.sh` 回 `CORE_COLD_START_WARN_GATES=3`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`NEXT_STEP=complete_credential_escrow_review`。Allowed declaration:public routes / AWOOOI service / Gitea / Harbor registry / 188 backup-from-110 / 110 awoooi_db freshness 已恢復,cold-start hard blockers `0`。Forbidden declaration:full green、10 分鐘全自動恢復完成、DR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定。剩餘 WARN 是 MOMO current-month/source freshness warning 與 110 aggregate `backup-all` 上次 failed components。高頻全量 `pg_dump` 本輪耗時 `1152s` 並產生 `28G` dump,後續 P0 必須建立專用 backup DB role / `AWOOOI_BACKUP_DATABASE_URL`,並把大型可重建 snapshot 表改成分層備份或 retention 後納入 full backup;不得把 app role 放寬當長期唯一解法。 + 2026-07-01 19:32 latest live summary:110 卡住的根因已確認並修復。主機 ping / SSH / Prometheus / Gitea / registry 均可達,但 `awoooi-startup-110.service` 因 `/usr/local/bin/awoooi-startup-110.sh` 被替換成 runner/CD lane fail-closed stub 而固定 `exit 75`,造成 post-reboot systemd degraded 與 cold-start warning。live stub hash 為 `55e1e87d44cf5b8cefb714a49d085b886697d856569ce9478537059d49129f88`,repo 完整 startup script hash 為 `fcc67c7dde889b3cf8ffeca89cc5e58407fabc6f5a80554e39f3f1465f90b318`。修復方式固定使用 `scripts/reboot-recovery/repair-110-startup-script-stub.sh`:預設 `--verify` 只讀;`--apply` 只上傳 repo 完整 startup script、備份 live stub、解除 immutable、install 到 `/usr/local/bin/awoooi-startup-110.sh`、`systemctl daemon-reload`,並只 `reset-failed awoooi-startup-110.service`;`--start-service` 需另外明確指定,因為它會跑完整 startup checks。live apply 指令為 `BACKUP_SUFFIX=before-full-startup-restore-20260701-1928 bash scripts/reboot-recovery/repair-110-startup-script-stub.sh --apply`,post-readback `remote_matches_expected=1`、`remote_is_runner_failclosed_stub=0`,rollback backup 為 `/usr/local/bin/awoooi-startup-110.sh.before-full-startup-restore-20260701-1928`。之後清除 stale `systemd-ask-password-console.path` / `systemd-ask-password-wall.path` failed state;110 已回 `systemctl is-system-running=running`、failed units `0`。最新 cold-start 讀回 `PASS=90 WARN=6 BLOCKED=0`、`FAILED_UNITS_110 0`;110 startup stub / systemd degraded 不再是 blocker。禁止再把 live `/usr/local/bin/awoooi-startup-110.sh` 當 runner/CD lane opener 封成 stub;runner fail-closed enforcer 只能封 runner binary、runner unit、controlled opener artifact 或 upload opener source,不得封 host startup recovery script。Allowed declaration:110 不再卡在 startup stub / systemd degraded,cold-start hard blockers `0`。Forbidden declaration:full green、10 分鐘全自動恢復已完成、DR complete、credential escrow complete、MOMO data 最新或 110 SSH 永久穩定;剩餘 warnings 仍需由 backup / DR escrow / alert 收斂處理。 2026-07-01 19:13 live summary:全主機重啟後的 service hard blockers 已清零,但仍不可宣稱 DR complete 或 full green。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` 回 `PASS=89 WARN=7 BLOCKED=0`;public routes / TLS 含 `https://signoz.wooo.work/` 均回 expected 2xx/3xx,AWOOOI API health `200`,Gitea version API `200`,Gitea CD latest visible `#4283` 為 `Success`。MOMO dedicated source preflight 已從 hard blocker 降為 source freshness warning:`MOMO_DRIVE_TOKEN_SOURCE_PREFLIGHT PASS=19 WARN=6 BLOCKED=0`、`DB_DAILY_FRESHNESS 7|2026-06-24`、`DRIVE_INTAKE_COUNT=0`、`DRIVE_FAILED_COUNT=0`、`DRIVE_GLOBAL_LATEST_MODIFIED=2026-06-25T04:21:47Z`、latest import job `57 completed` 且 `15383|15383|0`,因此「沒有比最後乾淨 import 更新的 Drive source」不得再硬擋 cold-start;若未來 auth failure、failed folder 有候選、新 source 未匯入或 latest import 不乾淨,仍 hard block。110 cold-start monitor deploy parity 已擴充到 `full-stack-cold-start-check.sh`、`cold-start-textfile-exporter.sh` 與 `momo-drive-token-source-recovery-preflight.sh` 三者 hash;latest remote textfile metric 為 `awoooi_cold_start_pass_gates=89`、`warn=7`、`blocked=0`、`last_result{result="degraded"}=1`。SignOz 的 root-owned Nginx source drift 已在 repo 修成 `192.168.0.110:8080`,但 188 `ollama` sudo 仍需要密碼;為了不讀密碼、不 reload Nginx,本輪用 user-level controlled bridge 先恢復 live route:`scripts/reboot-recovery/signoz-188-upstream-bridge.sh --apply` 在 188 建立 `/home/ollama/bin/awoooi-signoz-upstream-bridge.sh` 與 user crontab `@reboot`,以 `socat` 監聽 `127.0.0.1:3301` 並轉到 `192.168.0.110:8080`;status readback `SIGNOZ_BRIDGE_STATUS running=1 listen_code=200 upstream_code=200`,rollback 是 `scripts/reboot-recovery/signoz-188-upstream-bridge.sh --rollback`。19:13 scorecard 讀回 `CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_WARN_GATES=7`、`CORE_COLD_START_FIRING_ALERTS=2`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`ESCROW_MISSING_COUNT=5`、`RECOVERY_STATE=CORE_NOT_READY_DR_OFFSITE_PENDING`。Allowed declaration:網站 / public routes / AWOOOI service / Gitea / Harbor registry path / SignOz public route 已恢復,cold-start hard blockers `0`。Forbidden declaration:10 分鐘全自動恢復已完成、full green、DR complete、credential escrow complete、MOMO business data 最新、110 SSH 永久穩定或 Nginx privileged source apply 已完成。下一步固定為 P0 backup/DR escrow/alert warning 收斂:credential escrow non-secret evidence 仍缺 `5`,backup health 仍有 stale / failed-component warning。 diff --git a/scripts/backup/backup-awoooi-frequent.sh b/scripts/backup/backup-awoooi-frequent.sh index 19c369af..7f661368 100755 --- a/scripts/backup/backup-awoooi-frequent.sh +++ b/scripts/backup/backup-awoooi-frequent.sh @@ -15,37 +15,223 @@ source "$(dirname "$0")/common.sh" SERVICE="awoooi-frequent" AWOOOI_HOST="192.168.0.188" AWOOOI_DB_USER="awoooi" -AWOOOI_DB_PASS="awoooi_prod_2026" +AWOOOI_DB_PASS="${AWOOOI_DB_PASS:-}" AWOOOI_DB_HOST="localhost" AWOOOI_DB_PORT="5432" LOCAL_REPO="${BACKUP_BASE}/awoooi" DUMP_DIR="/tmp/awoooi-freq-backup-$$" +AWOOOI_K8S_HOST="${AWOOOI_K8S_HOST:-192.168.0.120}" +AWOOOI_K8S_HOSTS="${AWOOOI_K8S_HOSTS:-${AWOOOI_K8S_HOST} 192.168.0.121 192.168.0.125}" +AWOOOI_K8S_SECRET_NAME="${AWOOOI_K8S_SECRET_NAME:-awoooi-secrets}" +AWOOOI_K8S_NAMESPACE="${AWOOOI_K8S_NAMESPACE:-awoooi-prod}" +AWOOOI_K8S_DATABASE_URL_KEYS="${AWOOOI_K8S_DATABASE_URL_KEYS:-AWOOOI_BACKUP_DATABASE_URL BACKUP_DATABASE_URL DATABASE_URL}" +FORCE_RLS_RESTORE_SQL="" +FORCE_RLS_RESTORE_DB="" # 高頻備份保留策略 -KEEP_HOURLY=28 # 保留 7 天的 6 小時快照(7*4=28) -KEEP_DAILY=30 -KEEP_WEEKLY=12 -KEEP_MONTHLY=24 +# 2026-05-19 ogt + Codex: 保留策略統一交給 common.sh。 +# 預設 latest-only keep-last=1,避免高頻 DB snapshot 堆積。 + +resolve_database_url() { + if [ -n "${AWOOOI_DATABASE_URL:-}" ]; then + printf '%s\n' "${AWOOOI_DATABASE_URL}" + return 0 + fi + if [ -n "${DATABASE_URL:-}" ]; then + printf '%s\n' "${DATABASE_URL}" + return 0 + fi + + # 2026-07-01 ogt + Codex: 優先使用專用備份 DB URL;不存在時才退回 + # runtime DATABASE_URL。只在遠端流程內解碼,不把 secret value 寫入 log。 + local k8s_host key encoded decoded + for k8s_host in ${AWOOOI_K8S_HOSTS}; do + for key in ${AWOOOI_K8S_DATABASE_URL_KEYS}; do + encoded="$(ssh -o BatchMode=yes -o StrictHostKeyChecking=accept-new -o ConnectTimeout=8 "wooo@${k8s_host}" \ + "sudo -n kubectl get secret ${AWOOOI_K8S_SECRET_NAME} -n ${AWOOOI_K8S_NAMESPACE} -o jsonpath='{.data.${key}}' 2>/dev/null || kubectl get secret ${AWOOOI_K8S_SECRET_NAME} -n ${AWOOOI_K8S_NAMESPACE} -o jsonpath='{.data.${key}}'" \ + 2>/dev/null || true)" + decoded="$(printf '%s' "${encoded}" | base64 -d 2>/dev/null || true)" + if [ -n "${decoded}" ]; then + printf '%s\n' "${decoded}" + return 0 + fi + done + done + return 1 +} + +load_database_config() { + local database_url + database_url="$(resolve_database_url || true)" + if [ -z "${database_url}" ]; then + log_error "無法解析 AWOOOI DATABASE_URL;拒絕使用舊硬編密碼" + return 1 + fi + + eval "$( + python3 - 3<<< "${database_url}" <<'PY' +import shlex +from urllib.parse import unquote, urlparse + +with open(3) as source: + url = source.read().strip() +parsed = urlparse(url) + +values = { + "AWOOOI_DB_USER": unquote(parsed.username or "awoooi"), + "AWOOOI_DB_PASS": unquote(parsed.password or ""), + "AWOOOI_DB_HOST": parsed.hostname or "localhost", + "AWOOOI_DB_PORT": str(parsed.port or 5432), +} +for key, value in values.items(): + print(f"{key}={shlex.quote(value)}") +PY + )" +} + +quote_remote() { + printf "%q" "$1" +} + +pgpass_escape() { + local value="$1" + value="${value//\\/\\\\}" + value="${value//:/\\:}" + printf '%s' "${value}" +} + +pgpass_line() { + local database="$1" + printf '%s:%s:%s:%s:%s\n' \ + "$(pgpass_escape "${AWOOOI_DB_HOST}")" \ + "$(pgpass_escape "${AWOOOI_DB_PORT}")" \ + "$(pgpass_escape "${database}")" \ + "$(pgpass_escape "${AWOOOI_DB_USER}")" \ + "$(pgpass_escape "${AWOOOI_DB_PASS}")" +} + +remote_psql_command() { + local database="$1" + printf "psql --no-password -U %s -h %s -p %s -d %s -v ON_ERROR_STOP=1" \ + "$(quote_remote "${AWOOOI_DB_USER}")" \ + "$(quote_remote "${AWOOOI_DB_HOST}")" \ + "$(quote_remote "${AWOOOI_DB_PORT}")" \ + "$(quote_remote "${database}")" +} + +remote_pgpass_wrapper() { + local command="$1" + printf 'umask 077; pgpass=$(mktemp "${TMPDIR:-/tmp}/awoooi-pgpass.XXXXXX") || exit 1; cleanup() { rm -f "$pgpass"; }; trap cleanup EXIT HUP INT TERM; cat > "$pgpass"; PGOPTIONS="-c statement_timeout=0 -c max_parallel_workers_per_gather=0" PGPASSFILE="$pgpass" %s' "${command}" +} + +run_remote_pgpass_command() { + local database="$1" + local command="$2" + pgpass_line "${database}" | ssh "ollama@${AWOOOI_HOST}" "$(remote_pgpass_wrapper "${command}")" +} + +latest_restic_snapshot_id() { + restic -r "${LOCAL_REPO}" snapshots --latest 1 --json \ + --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | \ + python3 -c 'import json,sys; rows=json.load(sys.stdin); row=max(rows,key=lambda r: r.get("time","")) if rows else {}; print(row.get("short_id","unknown"))' \ + 2>/dev/null || echo "unknown" +} + +collect_force_rls_sql() { + local database="$1" + local mode="$2" + local query + + query=" +select format('ALTER TABLE %I.%I ${mode} ROW LEVEL SECURITY;', n.nspname, c.relname) +from pg_class c +join pg_namespace n on n.oid = c.relnamespace +where c.relkind in ('r', 'p') + and c.relforcerowsecurity + and pg_get_userbyid(c.relowner) = current_user +order by 1; +" + run_remote_pgpass_command "${database}" "$(remote_psql_command "${database}") -At -c $(quote_remote "${query}")" +} + +apply_remote_sql() { + local database="$1" + local sql="$2" + [ -n "${sql}" ] || return 0 + run_remote_pgpass_command "${database}" "$(remote_psql_command "${database}") -c $(quote_remote "${sql}") >/dev/null" +} + +restore_force_rls() { + if [ -n "${FORCE_RLS_RESTORE_DB}" ] && [ -n "${FORCE_RLS_RESTORE_SQL}" ]; then + if apply_remote_sql "${FORCE_RLS_RESTORE_DB}" "${FORCE_RLS_RESTORE_SQL}"; then + log_info "FORCE ROW LEVEL SECURITY 已恢復 (${FORCE_RLS_RESTORE_DB})" + else + log_error "FORCE ROW LEVEL SECURITY 恢復失敗 (${FORCE_RLS_RESTORE_DB})" + return 1 + fi + FORCE_RLS_RESTORE_DB="" + FORCE_RLS_RESTORE_SQL="" + fi +} + +trap restore_force_rls EXIT + +dump_database_with_rls_guard() { + local database="$1" + local output_file="$2" + local stderr_file="${output_file}.stderr" + local noforce_sql force_sql dump_rc + + noforce_sql="$(collect_force_rls_sql "${database}" "NO FORCE")" + force_sql="$(printf '%s\n' "${noforce_sql}" | sed 's/NO FORCE/FORCE/')" + + if [ -n "${noforce_sql}" ]; then + FORCE_RLS_RESTORE_DB="${database}" + FORCE_RLS_RESTORE_SQL="${force_sql}" + log_info "暫時解除 FORCE RLS 以完成完整 pg_dump (${database}, tables=$(printf '%s\n' "${noforce_sql}" | awk 'NF {count++} END {print count+0}'))" + apply_remote_sql "${database}" "${noforce_sql}" + fi + + set +e + run_remote_pgpass_command "${database}" "pg_dump --no-password \ + -U $(quote_remote "${AWOOOI_DB_USER}") -h $(quote_remote "${AWOOOI_DB_HOST}") -p $(quote_remote "${AWOOOI_DB_PORT}") \ + $(quote_remote "${database}")" > "${output_file}" 2>"${stderr_file}" + dump_rc=$? + set -e + + restore_force_rls + + if [ "${dump_rc}" -ne 0 ]; then + log_error "${database} dump 失敗,pg_dump stderr 尾端如下(已避免輸出 credential):" + tail -40 "${stderr_file}" | sed -E 's/(password=)[^ ]+/\1REDACTED/g' || true + return "${dump_rc}" + fi + rm -f "${stderr_file}" +} main() { local start_time=$(date +%s) log_info "========== AWOOOI 高頻備份 ($(date '+%H:%M')) ==========" mkdir -p "${DUMP_DIR}" + load_database_config || { + notify_clawbot "failed" "${SERVICE}" "AWOOOI 高頻備份失敗:DATABASE_URL 不可用" + rm -rf "${DUMP_DIR}" + exit 1 + } local timestamp=$(date "+%Y%m%d_%H%M%S") # 只備份 awoooi_prod(高頻核心) - if ssh ollama@${AWOOOI_HOST} "PGPASSWORD='${AWOOOI_DB_PASS}' pg_dump \ - -U ${AWOOOI_DB_USER} -h ${AWOOOI_DB_HOST} -p ${AWOOOI_DB_PORT} \ - awoooi_prod" > "${DUMP_DIR}/awoooi_prod_${timestamp}.sql" 2>&1; then + if dump_database_with_rls_guard "awoooi_prod" "${DUMP_DIR}/awoooi_prod_${timestamp}.sql"; then local size=$(du -h "${DUMP_DIR}/awoooi_prod_${timestamp}.sql" | cut -f1) log_success "awoooi_prod dump 完成 (${size})" else + local status=$? log_error "awoooi_prod dump 失敗" notify_clawbot "failed" "${SERVICE}" "AWOOOI 高頻備份失敗" rm -rf "${DUMP_DIR}" - exit 1 + exit "${status}" fi # Restic 備份(同一倉庫,頻率不同) @@ -54,18 +240,11 @@ main() { --tag "service:awoooi" --tag "freq:6h" \ --tag "timestamp:${timestamp}" 2>&1 - local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json \ - --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | \ - grep -oP '"short_id":"\K[^"]+' | head -1) + local snapshot_id + snapshot_id="$(latest_restic_snapshot_id)" log_success "快照: ${snapshot_id}" - # GFS 清理(加入 hourly 保留) - restic -r "${LOCAL_REPO}" forget --prune \ - --password-file "${RESTIC_PASSWORD_FILE}" \ - --keep-hourly ${KEEP_HOURLY} \ - --keep-daily ${KEEP_DAILY} \ - --keep-weekly ${KEEP_WEEKLY} \ - --keep-monthly ${KEEP_MONTHLY} 2>&1 + cleanup_old_backups "${LOCAL_REPO}" rm -rf "${DUMP_DIR}" diff --git a/scripts/ops/backup-health-textfile-exporter.py b/scripts/ops/backup-health-textfile-exporter.py index 3332655d..0c607357 100755 --- a/scripts/ops/backup-health-textfile-exporter.py +++ b/scripts/ops/backup-health-textfile-exporter.py @@ -643,6 +643,9 @@ def _offsite_and_escrow_metric_lines(host: str) -> list[str]: if not offsite_configured: next_step = "configure_google_drive_rclone_on_110_tty" phase = 1 + elif escrow_missing_count > 0 and full_fresh: + next_step = "complete_credential_escrow_review" + phase = 3 elif not any_partial_fresh: next_step = "run_small_dry_run_then_partial_sync" phase = 2 diff --git a/scripts/ops/tests/test_backup_health_textfile_exporter.py b/scripts/ops/tests/test_backup_health_textfile_exporter.py index ce513ee7..4a3863fc 100644 --- a/scripts/ops/tests/test_backup_health_textfile_exporter.py +++ b/scripts/ops/tests/test_backup_health_textfile_exporter.py @@ -84,3 +84,33 @@ def test_gitea_bundle_metrics_fail_when_checksum_missing(tmp_path: Path, monkeyp assert all_ok == 0 assert 'awoooi_gitea_bundle_checksum_missing_count{host="188"' in rendered assert rendered.rstrip().endswith(" 0") + + +def test_dr_phase_does_not_regress_when_full_offsite_is_fresh_and_partial_is_stale( + tmp_path: Path, monkeypatch +) -> None: + exporter = load_exporter() + offsite_dir = tmp_path / "offsite" + escrow_dir = tmp_path / "escrow" + offsite_dir.mkdir() + escrow_dir.mkdir() + now = 1_782_900_000 + + monkeypatch.setattr(exporter, "OFFSITE_STATUS_DIR", offsite_dir) + monkeypatch.setattr(exporter, "ESCROW_EVIDENCE_DIR", escrow_dir) + monkeypatch.setattr(exporter.time, "time", lambda: now) + monkeypatch.setattr(exporter, "_b2_configured", lambda: False) + monkeypatch.setattr(exporter, "_rclone_configured", lambda: True) + (offsite_dir / "rclone-last-success").write_text(str(now - 3600), encoding="utf-8") + (offsite_dir / "rclone-partial-last-success").write_text(str(now - 72 * 3600), encoding="utf-8") + + metrics = exporter._offsite_and_escrow_metric_lines("110") + rendered = "\n".join(metrics) + + assert 'awoooi_backup_offsite_fresh{host="110",provider="rclone",max_age_hours="48"} 1' in rendered + assert ( + 'awoooi_backup_offsite_partial_fresh{host="110",provider="rclone",scope="partial",max_age_hours="48"} 0' + in rendered + ) + assert 'awoooi_backup_dr_credential_escrow_missing_count{host="110"} 5' in rendered + assert 'awoooi_backup_dr_phase{host="110",next_step="complete_credential_escrow_review"} 3' in rendered