diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index d797c0cf..4cdb2428 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,23 @@ +## 2026-07-01 — 14:05 110 stale Docker stats attribution hardening + +**照主線修正的問題**: +- live 110 diagnosis 又回到高壓:`NODE_LOAD1=29.9`、`NODE_LOAD5=22.66`、`NODE_LOAD1_PER_CPU=2.49`、`NODE_LOAD_CLASSIFIER=high_load`,但 `docker_stats.prom` 的 `node_textfile_mtime_seconds` 已 stale 約 `107565s`;因此舊值 `docker_container_cpu_cores{container_name="gitea"}=3.4019` 不能再作為當下 CPU 元兇。 +- `scripts/ops/host-sustained-load-controller.py` 已新增 Docker stats freshness gate。當 textfile 超過 `300s` 時,`top_container_cpu` 會變 `null`,舊 container 樣本只保留在 `top_container_cpu_untrusted`;同一份 live metrics 現在分類為 `blocked_unknown_sustained_load_requires_source_specific_playbook`,不再誤報 `blocked_gitea_queue_or_hook_backlog_requires_playbook`。 +- `scripts/ops/host-sustained-load-evidence.py` 同步把 stale container 樣本放到 `top_containers_untrusted`,`top_containers=[]`,recommendation 轉成 `source_specific_playbook_required`。 +- `scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py` 已支援 `docker_stats.fresh=false` / `top_containers_fresh=false`,SLO 仍保留 `host_pressure_high_load`,但不再產生 `host_110_gitea_cpu_pressure`;下一步改為 `restore_docker_stats_textfile_exporter_then_collect_sanitized_host_pressure_no_restart_no_secret_read`。 +- live route / cold-start 仍 blocked:`https://registry.wooo.work/v2/` 502、`https://harbor.wooo.work/api/v2.0/health` 502,`/tmp/awoooi-full-stack-cold-start-after-e580954e8-20260701-135922.log` 回 `PASS=61 WARN=9 BLOCKED=6`。核心 blocker 仍是 110 control path / local recovery package / Harbor `/v2`,不是已證實的 Gitea CPU backlog。 + +**驗證**: +- `python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py -q`:`19 passed`。 +- `python3.11 -m pytest scripts/reboot-recovery/tests/test_reboot_auto_recovery_slo_scorecard.py -q`:`9 passed`。 +- `python3.11 -m py_compile scripts/ops/host-sustained-load-controller.py scripts/ops/host-sustained-load-evidence.py scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py`:通過。 +- `python3.11 ops/runner/guard-gitea-runner-pressure.py --root .`:`GITEA_RUNNER_PRESSURE_GUARD_OK workflow_files=12 scheduled_workflows=4 auto_branch_events_on_110=0 generic_runner_labels=0`。 + +**邊界**:只讀 node exporter / public routes / public Gitea queue;未讀 secret / token / `.env` / raw sessions / SQLite / auth;未使用 GitHub / `gh` / GitHub API;未 workflow_dispatch;未重啟主機、未 restart Docker / Nginx / K3s / DB / firewall;未發 process signal。 + +**下一步**: +- P0 不切支線:先讓 110 local recovery package 可在 110 console 或恢復後 SSH control path 執行,再跑 `recover-110-control-path-and-harbor-local.sh --check`;若只剩 exporter freshness 缺口,先恢復 Docker stats textfile exporter,再收集 sanitized host pressure。 + ## 2026-07-01 — 14:55 Work Items 顯示 AI Loop LOG source tags **照主線修正的問題**: diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 6aa09939..6866e576 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -72,6 +72,8 @@ v1.82 bounded summary rule:`post-start-quick-check.sh` 與 `188-host-hygiene-m 2026-07-01 13:35 110 CPU/load 判讀規則更新:`docker_stats.prom` 必須先看 mtime / freshness,超過 300 秒不得作為當下 container CPU 歸因或 blocker;若 110 `node_load5` 高但 Prometheus CPU mode 仍有 idle、`awoooi_host_gitea_actions_active_process_count=0`、orphan browser count=0,主 blocker 不得誤寫成 Gitea / Playwright / Stock smoke CPU。此時優先看 `diagnose-110-ssh-publickey-auth.sh` 的 `NODE_LOAD_CLASSIFIER`、`DOCKER_STATS_TEXTFILE_FRESHNESS` 與 `SYSTEMD_UNIT ... classifier=systemctl_show_timeout|systemctl_timeout_budget_exhausted`。外部 SSH userauth timeout 時,cold-start 必須輸出 `SSH_110_BLOCKER remote_control_channel_unavailable` 與 `SSH_110_NEXT_ACTION local_console_run_recover_110_control_path_and_harbor_local_check`;下一步是 110 本機 console / 已恢復 control path 執行 `recover-110-control-path-and-harbor-local.sh --check`,不是重跑 Harbor workflow 或用舊 docker stats 指認 Gitea。 +2026-07-01 14:05 追加 controller / SLO stale-attribution guard:`host-sustained-load-controller.py` 與 `host-sustained-load-evidence.py` 必須把超過 `300s` 的 Docker stats 樣本標成 untrusted;`top_container_cpu` / `top_containers` 不得使用 stale `docker_container_cpu_cores`,舊值只能留在 `top_container_cpu_untrusted` / `top_containers_untrusted` 當證據。`reboot-auto-recovery-slo-scorecard.py` 若收到 `docker_stats.fresh=false` 或 `top_containers_fresh=false`,只能保留 `host_pressure_high_load` 與 `host_container_cpu_attribution_stale`,不得產生 `host_110_gitea_cpu_pressure`。此時下一步固定為恢復 Docker stats textfile exporter 或收集 sanitized host pressure,且仍不得重啟 Docker / Nginx / K3s / DB / firewall、不得恢復 generic runner、不得用 stale Gitea CPU 樣本取消或 drain 任何工作。 + 2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness was still blocked at that time and DR remained incomplete. 2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea `main` and live `/home/wooo/stockplatform-v2` are now at `fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints`; six missing production cron entrypoint scripts are restored, `run-intelligence-sync.sh` contains the Docker-backed `psql` shim, and live contract check confirms every `scripts/ops/*.sh` referenced by `install-production-cron.sh` exists. The only live write performed for StockPlatform recovery was a fast-forward `git pull --ff-only origin main` on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: `source-remediation-queue` 19:56 and 20:00 succeeded, `market-index-ingestion` 20:00 succeeded, `price-ingestion` 20:02 succeeded, `margin-short-ingestion` 20:05 succeeded, `chips-ingestion` 20:06 succeeded, and `ai-recommendation-pipeline` 20:10 ran but correctly produced the internal blocker `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. StockPlatform `/api/v1/system/freshness` therefore still returns `status=blocked` because the 2026-06-25 official margin-short source is pending and `ai.recommendations` must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run. diff --git a/scripts/ops/host-sustained-load-controller.py b/scripts/ops/host-sustained-load-controller.py index 33428fab..ce13db51 100755 --- a/scripts/ops/host-sustained-load-controller.py +++ b/scripts/ops/host-sustained-load-controller.py @@ -20,12 +20,14 @@ from __future__ import annotations import argparse import json import re +import time from pathlib import Path from typing import Any DEFAULT_METRICS_FILE = Path("/home/wooo/node_exporter_textfiles/host_runaway_process.prom") DEFAULT_DOCKER_STATS_FILE = Path("/home/wooo/node_exporter_textfiles/docker_stats.prom") +DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS = 300 SCHEMA_VERSION = "host_sustained_load_controlled_automation_v1" LABEL_RE = re.compile(r"(?P[A-Za-z_][A-Za-z0-9_]*)=\"(?P(?:[^\"\\\\]|\\\\.)*)\"") METRIC_RE = re.compile( @@ -41,6 +43,11 @@ def parse_args() -> argparse.Namespace: parser.add_argument("--host", default="110") parser.add_argument("--metrics-file", type=Path, default=DEFAULT_METRICS_FILE) parser.add_argument("--docker-stats-file", type=Path, default=DEFAULT_DOCKER_STATS_FILE) + parser.add_argument( + "--docker-stats-max-age-seconds", + type=int, + default=DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS, + ) parser.add_argument("--load5-per-core-threshold", type=float, default=1.5) parser.add_argument("--ci-stale-age-seconds", type=int, default=1800) parser.add_argument("--json", action="store_true", help="Print JSON only.") @@ -92,6 +99,55 @@ def _sample_value( return default +def _sample_value_any(samples: list[dict[str, Any]], name: str) -> float | None: + for sample in samples: + if sample["name"] == name: + return float(sample["value"]) + return None + + +def _textfile_mtime_seconds(samples: list[dict[str, Any]], suffix: str) -> float | None: + for sample in samples: + if sample["name"] != "node_textfile_mtime_seconds": + continue + file_label = str(sample["labels"].get("file") or "") + if file_label.endswith(suffix): + return float(sample["value"]) + return None + + +def docker_stats_freshness( + *, + samples: list[dict[str, Any]], + docker_stats_file: Path, + max_age_seconds: int, +) -> dict[str, Any]: + mtime = _textfile_mtime_seconds(samples, "docker_stats.prom") + now = _sample_value_any(samples, "node_time_seconds") + source = "node_textfile_mtime_seconds" + if mtime is None: + try: + mtime = docker_stats_file.stat().st_mtime + now = time.time() + source = "file_stat_mtime" + except FileNotFoundError: + return { + "fresh": False, + "age_seconds": None, + "max_age_seconds": max_age_seconds, + "source": "missing", + } + if now is None: + now = time.time() + age_seconds = max(0, int(now - mtime)) + return { + "fresh": age_seconds <= max_age_seconds, + "age_seconds": age_seconds, + "max_age_seconds": max_age_seconds, + "source": source, + } + + def _rule_values(samples: list[dict[str, Any]], name: str, *, host: str) -> list[dict[str, Any]]: values = [] for sample in samples: @@ -159,6 +215,7 @@ def build_packet( host: str, samples: list[dict[str, Any]], docker_samples: list[dict[str, Any]], + docker_stats_status: dict[str, Any], load5_per_core_threshold: float, ci_stale_age_seconds: int, ) -> dict[str, Any]: @@ -209,7 +266,8 @@ def build_packet( ) ) top_orphan = _top_orphan_rule(samples, host=host) - top_container = _top_container_cpu(docker_samples, host=host) + raw_top_container = _top_container_cpu(docker_samples, host=host) + top_container = raw_top_container if docker_stats_status.get("fresh") is True else None top_container_name = str((top_container or {}).get("container_name") or "").lower() top_container_cpu = float((top_container or {}).get("cpu_cores") or 0.0) @@ -317,6 +375,8 @@ def build_packet( "active_ci_oldest_age_seconds": active_ci_oldest_age, "top_orphan_rule": top_orphan, "top_container_cpu": top_container, + "top_container_cpu_untrusted": raw_top_container, + "docker_stats": docker_stats_status, }, "commands": { "dry_run": dry_run_command, @@ -364,6 +424,11 @@ def main() -> int: host=args.host, samples=samples, docker_samples=docker_samples, + docker_stats_status=docker_stats_freshness( + samples=samples, + docker_stats_file=args.docker_stats_file, + max_age_seconds=args.docker_stats_max_age_seconds, + ), load5_per_core_threshold=args.load5_per_core_threshold, ci_stale_age_seconds=args.ci_stale_age_seconds, ) diff --git a/scripts/ops/host-sustained-load-evidence.py b/scripts/ops/host-sustained-load-evidence.py index 0cbf71c6..ef1417b7 100755 --- a/scripts/ops/host-sustained-load-evidence.py +++ b/scripts/ops/host-sustained-load-evidence.py @@ -14,12 +14,14 @@ import json import os import re import subprocess +import time from pathlib import Path from typing import Any DEFAULT_HOST_METRICS_FILE = Path("/home/wooo/node_exporter_textfiles/host_runaway_process.prom") DEFAULT_DOCKER_STATS_FILE = Path("/home/wooo/node_exporter_textfiles/docker_stats.prom") +DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS = 300 SCHEMA_VERSION = "host_sustained_load_sanitized_evidence_v1" LABEL_RE = re.compile(r"(?P[A-Za-z_][A-Za-z0-9_]*)=\"(?P(?:[^\"\\\\]|\\\\.)*)\"") METRIC_RE = re.compile( @@ -33,6 +35,11 @@ def parse_args() -> argparse.Namespace: parser.add_argument("--host", default=os.environ.get("AIOPS_HOST_LABEL", "110")) parser.add_argument("--metrics-file", type=Path, default=DEFAULT_HOST_METRICS_FILE) parser.add_argument("--docker-stats-file", type=Path, default=DEFAULT_DOCKER_STATS_FILE) + parser.add_argument( + "--docker-stats-max-age-seconds", + type=int, + default=DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS, + ) parser.add_argument("--ps-file", type=Path) parser.add_argument("--top-n", type=int, default=8) parser.add_argument("--json", action="store_true") @@ -66,6 +73,55 @@ def parse_prometheus_text(text: str) -> list[dict[str, Any]]: return samples +def _sample_value_any(samples: list[dict[str, Any]], name: str) -> float | None: + for sample in samples: + if sample["name"] == name: + return float(sample["value"]) + return None + + +def _textfile_mtime_seconds(samples: list[dict[str, Any]], suffix: str) -> float | None: + for sample in samples: + if sample["name"] != "node_textfile_mtime_seconds": + continue + file_label = str(sample["labels"].get("file") or "") + if file_label.endswith(suffix): + return float(sample["value"]) + return None + + +def docker_stats_freshness( + *, + samples: list[dict[str, Any]], + docker_stats_file: Path, + max_age_seconds: int, +) -> dict[str, Any]: + mtime = _textfile_mtime_seconds(samples, "docker_stats.prom") + now = _sample_value_any(samples, "node_time_seconds") + source = "node_textfile_mtime_seconds" + if mtime is None: + try: + mtime = docker_stats_file.stat().st_mtime + now = time.time() + source = "file_stat_mtime" + except FileNotFoundError: + return { + "fresh": False, + "age_seconds": None, + "max_age_seconds": max_age_seconds, + "source": "missing", + } + if now is None: + now = time.time() + age_seconds = max(0, int(now - mtime)) + return { + "fresh": age_seconds <= max_age_seconds, + "age_seconds": age_seconds, + "max_age_seconds": max_age_seconds, + "source": source, + } + + def read_text(path: Path | None) -> str: if path is None: return "" @@ -234,8 +290,14 @@ def recommend_playbook(process_families: list[dict[str, Any]], containers: list[ def build_payload(args: argparse.Namespace) -> dict[str, Any]: host_samples = parse_prometheus_text(read_text(args.metrics_file)) docker_samples = parse_prometheus_text(read_text(args.docker_stats_file)) + docker_stats_status = docker_stats_freshness( + samples=host_samples, + docker_stats_file=args.docker_stats_file, + max_age_seconds=args.docker_stats_max_age_seconds, + ) process_summary = summarize_processes(parse_ps_text(collect_ps_text(args.ps_file)), top_n=args.top_n) - containers = top_docker_containers(docker_samples, host=args.host, top_n=args.top_n) + untrusted_containers = top_docker_containers(docker_samples, host=args.host, top_n=args.top_n) + containers = untrusted_containers if docker_stats_status.get("fresh") is True else [] recommendation = recommend_playbook(process_summary["families"], containers) return { @@ -248,10 +310,12 @@ def build_payload(args: argparse.Namespace) -> dict[str, Any]: "readback": { "host_metric_sample_count": len(host_samples), "docker_metric_sample_count": len(docker_samples), + "docker_stats": docker_stats_status, "top_container_count": len(containers), "top_process_family_count": len(process_summary["families"]), }, "top_containers": containers, + "top_containers_untrusted": untrusted_containers, "top_process_families": process_summary["families"], "top_processes_sanitized": process_summary["top_processes"], "redaction": { diff --git a/scripts/ops/tests/test_host_runaway_process_exporter.py b/scripts/ops/tests/test_host_runaway_process_exporter.py index d1d8259d..d977ffa6 100644 --- a/scripts/ops/tests/test_host_runaway_process_exporter.py +++ b/scripts/ops/tests/test_host_runaway_process_exporter.py @@ -425,6 +425,60 @@ def test_sustained_load_controller_routes_gitea_backlog_from_docker_metrics(tmp_ assert "host-sustained-load-evidence.py" in payload["commands"]["dry_run"] +def test_sustained_load_controller_ignores_stale_docker_stats_attribution(tmp_path: Path) -> None: + metrics_file = tmp_path / "host.prom" + metrics_file.write_text( + "\n".join( + [ + 'awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"} 1', + 'awoooi_host_load5_per_core{host="110"} 2.5', + 'awoooi_host_swap_used_ratio{host="110"} 0.1', + 'awoooi_host_runaway_process_remediation_authorized{host="110"} 0', + 'awoooi_host_gitea_actions_active_container_count{host="110"} 0', + 'awoooi_host_gitea_actions_active_process_group_count{host="110"} 0', + 'awoooi_host_runaway_browser_orphan_group_count{host="110",rule="stockplatform_headless_smoke",min_age_seconds="1800",min_cpu_percent="50"} 0', + 'node_textfile_mtime_seconds{file="/host/home/wooo/node_exporter_textfiles/docker_stats.prom"} 1000', + 'node_time_seconds 5000', + ] + ), + encoding="utf-8", + ) + docker_file = tmp_path / "docker.prom" + docker_file.write_text( + "\n".join( + [ + 'docker_container_cpu_cores{host="110",container_name="gitea"} 3.4', + 'docker_container_cpu_cores{host="110",container_name="redis"} 0.2', + ] + ), + encoding="utf-8", + ) + + result = subprocess.run( + [ + sys.executable, + str(CONTROLLER_PATH), + "--host", + "110", + "--metrics-file", + str(metrics_file), + "--docker-stats-file", + str(docker_file), + "--json", + ], + capture_output=True, + text=True, + ) + + assert result.returncode == 75 + payload = json.loads(result.stdout) + assert payload["classification"] == "blocked_unknown_sustained_load_requires_source_specific_playbook" + assert payload["readback"]["docker_stats"]["fresh"] is False + assert payload["readback"]["top_container_cpu"] is None + assert payload["readback"]["top_container_cpu_untrusted"]["container_name"] == "gitea" + assert payload["controlled_apply_allowed"] is False + + def test_sustained_load_controller_routes_unknown_load_to_sanitized_evidence(tmp_path: Path) -> None: metrics_file = tmp_path / "host.prom" metrics_file.write_text( @@ -506,6 +560,55 @@ def test_sustained_load_evidence_emits_sanitized_gitea_recommendation(tmp_path: assert "/home/wooo" not in result.stdout +def test_sustained_load_evidence_keeps_stale_container_samples_untrusted(tmp_path: Path) -> None: + metrics_file = tmp_path / "host.prom" + metrics_file.write_text( + "\n".join( + [ + 'node_textfile_mtime_seconds{file="/host/home/wooo/node_exporter_textfiles/docker_stats.prom"} 1000', + 'node_time_seconds 5000', + ] + ), + encoding="utf-8", + ) + docker_file = tmp_path / "docker.prom" + docker_file.write_text( + 'docker_container_cpu_cores{host="110",container_name="gitea"} 3.4\n', + encoding="utf-8", + ) + ps_file = tmp_path / "ps.txt" + ps_file.write_text( + "100 1 100 120 5.0 1.0 python python monitor.py\n", + encoding="utf-8", + ) + + result = subprocess.run( + [ + sys.executable, + str(EVIDENCE_PATH), + "--host", + "110", + "--metrics-file", + str(metrics_file), + "--ps-file", + str(ps_file), + "--docker-stats-file", + str(docker_file), + "--json", + ], + check=True, + capture_output=True, + text=True, + ) + + payload = json.loads(result.stdout) + assert payload["recommendation"] != "gitea_queue_or_hook_backlog_playbook" + assert payload["readback"]["docker_stats"]["fresh"] is False + assert payload["top_containers"] == [] + assert payload["top_containers_untrusted"][0]["container_name"] == "gitea" + assert payload["operation_boundaries"]["host_write_performed"] is False + + def test_sustained_load_controller_routes_unknown_load_to_sanitized_evidence(tmp_path: Path) -> None: metrics_file = tmp_path / "host.prom" metrics_file.write_text( diff --git a/scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py b/scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py index 73309e83..cab4f271 100755 --- a/scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py +++ b/scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py @@ -446,6 +446,7 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]: high_load_hosts: list[str] = [] gitea_pressure_hosts: list[str] = [] postgres_pressure_hosts: list[str] = [] + container_attribution_stale_hosts: list[str] = [] for item in hosts: if not isinstance(item, dict): @@ -459,6 +460,11 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]: if load5_per_core <= 0 and cores > 0: load5_per_core = load5 / cores top_containers = normalize_top_containers(item.get("top_containers")) + docker_stats = item.get("docker_stats") + top_containers_fresh = item.get("top_containers_fresh") + if top_containers_fresh is None and isinstance(docker_stats, dict): + top_containers_fresh = docker_stats.get("fresh") + container_attribution_fresh = top_containers_fresh is not False row = { "host": host, "load1": round(float_value(item.get("load1")), 4), @@ -467,17 +473,22 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]: "load5_per_core": round(load5_per_core, 4), "node_procs_running": int_value(item.get("node_procs_running")), "node_procs_blocked": int_value(item.get("node_procs_blocked")), + "top_containers_fresh": container_attribution_fresh, "top_containers": top_containers[:5], } + if isinstance(docker_stats, dict): + row["docker_stats"] = docker_stats rows.append(row) if load5_per_core > 1.0: high_load_hosts.append(host) - if host == "110" and any( + if top_containers and not container_attribution_fresh: + container_attribution_stale_hosts.append(host) + if host == "110" and container_attribution_fresh and any( container["container_name"] == "gitea" and container["cpu_cores"] >= 2.0 for container in top_containers ): gitea_pressure_hosts.append(host) - if host == "188" and any( + if host == "188" and container_attribution_fresh and any( container["container_name"] == "k3s-postgres-recovery" and container["cpu_cores"] >= 4.0 for container in top_containers @@ -486,6 +497,8 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]: if high_load_hosts: blockers.append("host_pressure_high_load") + if container_attribution_stale_hosts: + blockers.append("host_container_cpu_attribution_stale") if gitea_pressure_hosts: blockers.append("host_110_gitea_cpu_pressure") if postgres_pressure_hosts: @@ -500,6 +513,7 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]: "high_load_hosts": high_load_hosts, "gitea_pressure_hosts": gitea_pressure_hosts, "postgres_pressure_hosts": postgres_pressure_hosts, + "container_attribution_stale_hosts": container_attribution_stale_hosts, "conversation_event_hot_path_indexes_present": payload.get( "conversation_event_hot_path_indexes_present" ), @@ -507,6 +521,7 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]: "safe_actions": [ "keep_110_legacy_runner_failclosed", "read_public_gitea_queue_metadata_only", + "restore_docker_stats_textfile_exporter_before_container_cpu_attribution", "apply_conversation_event_hot_path_indexes_via_controlled_db_migration", "rerun_host_pressure_and_cold_start_scorecard_after_apply", ], @@ -558,6 +573,11 @@ def choose_safe_next_step( "keep_110_runner_failclosed_read_public_gitea_queue_and_recover_awoooi_host_" "controlled_lane_only_after_verifier_no_generic_runner" ) + if "host_container_cpu_attribution_stale" in pressure_blockers: + return ( + "restore_docker_stats_textfile_exporter_then_collect_sanitized_host_" + "pressure_no_restart_no_secret_read" + ) if blockers == ["host_boot_observation_older_than_target_window"]: return ( "timer_deployed_and_services_readback_green_wait_for_next_all_host_reboot_" diff --git a/scripts/reboot-recovery/tests/test_reboot_auto_recovery_slo_scorecard.py b/scripts/reboot-recovery/tests/test_reboot_auto_recovery_slo_scorecard.py index d1722c1b..a9b72fd8 100644 --- a/scripts/reboot-recovery/tests/test_reboot_auto_recovery_slo_scorecard.py +++ b/scripts/reboot-recovery/tests/test_reboot_auto_recovery_slo_scorecard.py @@ -327,6 +327,46 @@ def test_host_pressure_blocks_slo_with_index_drift_next_step(tmp_path: Path) -> ) +def test_host_pressure_does_not_attribute_stale_docker_stats_to_gitea(tmp_path: Path) -> None: + payload = run_scorecard_with_host_pressure( + tmp_path, + GREEN_SUMMARY, + { + "hosts": [ + { + "host": "110", + "load1": 20.74, + "load5": 18.05, + "cores": 12, + "node_procs_running": 63, + "node_procs_blocked": 0, + "docker_stats": { + "fresh": False, + "age_seconds": 107475, + "max_age_seconds": 300, + "source": "node_textfile_mtime_seconds", + }, + "top_containers": [ + {"container_name": "gitea", "cpu_cores": 3.4019}, + ], + }, + ], + }, + ) + + assert payload["status"] == "blocked_reboot_auto_recovery_slo_not_ready" + assert payload["host_pressure"]["high_load_hosts"] == ["110"] + assert payload["host_pressure"]["gitea_pressure_hosts"] == [] + assert payload["host_pressure"]["container_attribution_stale_hosts"] == ["110"] + assert "host_pressure_high_load" in payload["active_blockers"] + assert "host_container_cpu_attribution_stale" in payload["active_blockers"] + assert "host_110_gitea_cpu_pressure" not in payload["active_blockers"] + assert payload["safe_next_step"] == ( + "restore_docker_stats_textfile_exporter_then_collect_sanitized_host_" + "pressure_no_restart_no_secret_read" + ) + + def test_stockplatform_recovered_marks_controlled_gate_not_required( tmp_path: Path, ) -> None: