fix(recovery): ignore stale docker cpu attribution
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Failing after 33s
AWOOOI Harbor 110 Local Repair / workflow-shape (push) Successful in 0s
CD Pipeline / build-and-deploy (push) Has been skipped
AWOOOI Harbor 110 Local Repair / harbor-110-local-repair (push) Failing after 1m41s
CD Pipeline / post-deploy-checks (push) Has been skipped

This commit is contained in:
Your Name
2026-07-01 14:06:51 +08:00
parent e580954e82
commit 1ac8808607
7 changed files with 318 additions and 4 deletions

View File

@@ -1,3 +1,23 @@
## 2026-07-01 — 14:05 110 stale Docker stats attribution hardening
**照主線修正的問題**
- live 110 diagnosis 又回到高壓:`NODE_LOAD1=29.9``NODE_LOAD5=22.66``NODE_LOAD1_PER_CPU=2.49``NODE_LOAD_CLASSIFIER=high_load`,但 `docker_stats.prom``node_textfile_mtime_seconds` 已 stale 約 `107565s`;因此舊值 `docker_container_cpu_cores{container_name="gitea"}=3.4019` 不能再作為當下 CPU 元兇。
- `scripts/ops/host-sustained-load-controller.py` 已新增 Docker stats freshness gate。當 textfile 超過 `300s` 時,`top_container_cpu` 會變 `null`,舊 container 樣本只保留在 `top_container_cpu_untrusted`;同一份 live metrics 現在分類為 `blocked_unknown_sustained_load_requires_source_specific_playbook`,不再誤報 `blocked_gitea_queue_or_hook_backlog_requires_playbook`
- `scripts/ops/host-sustained-load-evidence.py` 同步把 stale container 樣本放到 `top_containers_untrusted``top_containers=[]`recommendation 轉成 `source_specific_playbook_required`
- `scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py` 已支援 `docker_stats.fresh=false` / `top_containers_fresh=false`SLO 仍保留 `host_pressure_high_load`,但不再產生 `host_110_gitea_cpu_pressure`;下一步改為 `restore_docker_stats_textfile_exporter_then_collect_sanitized_host_pressure_no_restart_no_secret_read`
- live route / cold-start 仍 blocked`https://registry.wooo.work/v2/` 502、`https://harbor.wooo.work/api/v2.0/health` 502`/tmp/awoooi-full-stack-cold-start-after-e580954e8-20260701-135922.log``PASS=61 WARN=9 BLOCKED=6`。核心 blocker 仍是 110 control path / local recovery package / Harbor `/v2`,不是已證實的 Gitea CPU backlog。
**驗證**
- `python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py -q``19 passed`
- `python3.11 -m pytest scripts/reboot-recovery/tests/test_reboot_auto_recovery_slo_scorecard.py -q``9 passed`
- `python3.11 -m py_compile scripts/ops/host-sustained-load-controller.py scripts/ops/host-sustained-load-evidence.py scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py`:通過。
- `python3.11 ops/runner/guard-gitea-runner-pressure.py --root .``GITEA_RUNNER_PRESSURE_GUARD_OK workflow_files=12 scheduled_workflows=4 auto_branch_events_on_110=0 generic_runner_labels=0`
**邊界**:只讀 node exporter / public routes / public Gitea queue未讀 secret / token / `.env` / raw sessions / SQLite / auth未使用 GitHub / `gh` / GitHub API未 workflow_dispatch未重啟主機、未 restart Docker / Nginx / K3s / DB / firewall未發 process signal。
**下一步**
- P0 不切支線:先讓 110 local recovery package 可在 110 console 或恢復後 SSH control path 執行,再跑 `recover-110-control-path-and-harbor-local.sh --check`;若只剩 exporter freshness 缺口,先恢復 Docker stats textfile exporter再收集 sanitized host pressure。
## 2026-07-01 — 14:55 Work Items 顯示 AI Loop LOG source tags
**照主線修正的問題**

View File

@@ -72,6 +72,8 @@ v1.82 bounded summary rule`post-start-quick-check.sh` 與 `188-host-hygiene-m
2026-07-01 13:35 110 CPU/load 判讀規則更新:`docker_stats.prom` 必須先看 mtime / freshness超過 300 秒不得作為當下 container CPU 歸因或 blocker若 110 `node_load5` 高但 Prometheus CPU mode 仍有 idle、`awoooi_host_gitea_actions_active_process_count=0`、orphan browser count=0主 blocker 不得誤寫成 Gitea / Playwright / Stock smoke CPU。此時優先看 `diagnose-110-ssh-publickey-auth.sh``NODE_LOAD_CLASSIFIER``DOCKER_STATS_TEXTFILE_FRESHNESS``SYSTEMD_UNIT ... classifier=systemctl_show_timeout|systemctl_timeout_budget_exhausted`。外部 SSH userauth timeout 時cold-start 必須輸出 `SSH_110_BLOCKER remote_control_channel_unavailable``SSH_110_NEXT_ACTION local_console_run_recover_110_control_path_and_harbor_local_check`;下一步是 110 本機 console / 已恢復 control path 執行 `recover-110-control-path-and-harbor-local.sh --check`,不是重跑 Harbor workflow 或用舊 docker stats 指認 Gitea。
2026-07-01 14:05 追加 controller / SLO stale-attribution guard`host-sustained-load-controller.py``host-sustained-load-evidence.py` 必須把超過 `300s` 的 Docker stats 樣本標成 untrusted`top_container_cpu` / `top_containers` 不得使用 stale `docker_container_cpu_cores`,舊值只能留在 `top_container_cpu_untrusted` / `top_containers_untrusted` 當證據。`reboot-auto-recovery-slo-scorecard.py` 若收到 `docker_stats.fresh=false``top_containers_fresh=false`,只能保留 `host_pressure_high_load``host_container_cpu_attribution_stale`,不得產生 `host_110_gitea_cpu_pressure`。此時下一步固定為恢復 Docker stats textfile exporter 或收集 sanitized host pressure且仍不得重啟 Docker / Nginx / K3s / DB / firewall、不得恢復 generic runner、不得用 stale Gitea CPU 樣本取消或 drain 任何工作。
2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness was still blocked at that time and DR remained incomplete.
2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea `main` and live `/home/wooo/stockplatform-v2` are now at `fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints`; six missing production cron entrypoint scripts are restored, `run-intelligence-sync.sh` contains the Docker-backed `psql` shim, and live contract check confirms every `scripts/ops/*.sh` referenced by `install-production-cron.sh` exists. The only live write performed for StockPlatform recovery was a fast-forward `git pull --ff-only origin main` on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: `source-remediation-queue` 19:56 and 20:00 succeeded, `market-index-ingestion` 20:00 succeeded, `price-ingestion` 20:02 succeeded, `margin-short-ingestion` 20:05 succeeded, `chips-ingestion` 20:06 succeeded, and `ai-recommendation-pipeline` 20:10 ran but correctly produced the internal blocker `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. StockPlatform `/api/v1/system/freshness` therefore still returns `status=blocked` because the 2026-06-25 official margin-short source is pending and `ai.recommendations` must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run.

View File

@@ -20,12 +20,14 @@ from __future__ import annotations
import argparse
import json
import re
import time
from pathlib import Path
from typing import Any
DEFAULT_METRICS_FILE = Path("/home/wooo/node_exporter_textfiles/host_runaway_process.prom")
DEFAULT_DOCKER_STATS_FILE = Path("/home/wooo/node_exporter_textfiles/docker_stats.prom")
DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS = 300
SCHEMA_VERSION = "host_sustained_load_controlled_automation_v1"
LABEL_RE = re.compile(r"(?P<key>[A-Za-z_][A-Za-z0-9_]*)=\"(?P<value>(?:[^\"\\\\]|\\\\.)*)\"")
METRIC_RE = re.compile(
@@ -41,6 +43,11 @@ def parse_args() -> argparse.Namespace:
parser.add_argument("--host", default="110")
parser.add_argument("--metrics-file", type=Path, default=DEFAULT_METRICS_FILE)
parser.add_argument("--docker-stats-file", type=Path, default=DEFAULT_DOCKER_STATS_FILE)
parser.add_argument(
"--docker-stats-max-age-seconds",
type=int,
default=DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS,
)
parser.add_argument("--load5-per-core-threshold", type=float, default=1.5)
parser.add_argument("--ci-stale-age-seconds", type=int, default=1800)
parser.add_argument("--json", action="store_true", help="Print JSON only.")
@@ -92,6 +99,55 @@ def _sample_value(
return default
def _sample_value_any(samples: list[dict[str, Any]], name: str) -> float | None:
for sample in samples:
if sample["name"] == name:
return float(sample["value"])
return None
def _textfile_mtime_seconds(samples: list[dict[str, Any]], suffix: str) -> float | None:
for sample in samples:
if sample["name"] != "node_textfile_mtime_seconds":
continue
file_label = str(sample["labels"].get("file") or "")
if file_label.endswith(suffix):
return float(sample["value"])
return None
def docker_stats_freshness(
*,
samples: list[dict[str, Any]],
docker_stats_file: Path,
max_age_seconds: int,
) -> dict[str, Any]:
mtime = _textfile_mtime_seconds(samples, "docker_stats.prom")
now = _sample_value_any(samples, "node_time_seconds")
source = "node_textfile_mtime_seconds"
if mtime is None:
try:
mtime = docker_stats_file.stat().st_mtime
now = time.time()
source = "file_stat_mtime"
except FileNotFoundError:
return {
"fresh": False,
"age_seconds": None,
"max_age_seconds": max_age_seconds,
"source": "missing",
}
if now is None:
now = time.time()
age_seconds = max(0, int(now - mtime))
return {
"fresh": age_seconds <= max_age_seconds,
"age_seconds": age_seconds,
"max_age_seconds": max_age_seconds,
"source": source,
}
def _rule_values(samples: list[dict[str, Any]], name: str, *, host: str) -> list[dict[str, Any]]:
values = []
for sample in samples:
@@ -159,6 +215,7 @@ def build_packet(
host: str,
samples: list[dict[str, Any]],
docker_samples: list[dict[str, Any]],
docker_stats_status: dict[str, Any],
load5_per_core_threshold: float,
ci_stale_age_seconds: int,
) -> dict[str, Any]:
@@ -209,7 +266,8 @@ def build_packet(
)
)
top_orphan = _top_orphan_rule(samples, host=host)
top_container = _top_container_cpu(docker_samples, host=host)
raw_top_container = _top_container_cpu(docker_samples, host=host)
top_container = raw_top_container if docker_stats_status.get("fresh") is True else None
top_container_name = str((top_container or {}).get("container_name") or "").lower()
top_container_cpu = float((top_container or {}).get("cpu_cores") or 0.0)
@@ -317,6 +375,8 @@ def build_packet(
"active_ci_oldest_age_seconds": active_ci_oldest_age,
"top_orphan_rule": top_orphan,
"top_container_cpu": top_container,
"top_container_cpu_untrusted": raw_top_container,
"docker_stats": docker_stats_status,
},
"commands": {
"dry_run": dry_run_command,
@@ -364,6 +424,11 @@ def main() -> int:
host=args.host,
samples=samples,
docker_samples=docker_samples,
docker_stats_status=docker_stats_freshness(
samples=samples,
docker_stats_file=args.docker_stats_file,
max_age_seconds=args.docker_stats_max_age_seconds,
),
load5_per_core_threshold=args.load5_per_core_threshold,
ci_stale_age_seconds=args.ci_stale_age_seconds,
)

View File

@@ -14,12 +14,14 @@ import json
import os
import re
import subprocess
import time
from pathlib import Path
from typing import Any
DEFAULT_HOST_METRICS_FILE = Path("/home/wooo/node_exporter_textfiles/host_runaway_process.prom")
DEFAULT_DOCKER_STATS_FILE = Path("/home/wooo/node_exporter_textfiles/docker_stats.prom")
DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS = 300
SCHEMA_VERSION = "host_sustained_load_sanitized_evidence_v1"
LABEL_RE = re.compile(r"(?P<key>[A-Za-z_][A-Za-z0-9_]*)=\"(?P<value>(?:[^\"\\\\]|\\\\.)*)\"")
METRIC_RE = re.compile(
@@ -33,6 +35,11 @@ def parse_args() -> argparse.Namespace:
parser.add_argument("--host", default=os.environ.get("AIOPS_HOST_LABEL", "110"))
parser.add_argument("--metrics-file", type=Path, default=DEFAULT_HOST_METRICS_FILE)
parser.add_argument("--docker-stats-file", type=Path, default=DEFAULT_DOCKER_STATS_FILE)
parser.add_argument(
"--docker-stats-max-age-seconds",
type=int,
default=DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS,
)
parser.add_argument("--ps-file", type=Path)
parser.add_argument("--top-n", type=int, default=8)
parser.add_argument("--json", action="store_true")
@@ -66,6 +73,55 @@ def parse_prometheus_text(text: str) -> list[dict[str, Any]]:
return samples
def _sample_value_any(samples: list[dict[str, Any]], name: str) -> float | None:
for sample in samples:
if sample["name"] == name:
return float(sample["value"])
return None
def _textfile_mtime_seconds(samples: list[dict[str, Any]], suffix: str) -> float | None:
for sample in samples:
if sample["name"] != "node_textfile_mtime_seconds":
continue
file_label = str(sample["labels"].get("file") or "")
if file_label.endswith(suffix):
return float(sample["value"])
return None
def docker_stats_freshness(
*,
samples: list[dict[str, Any]],
docker_stats_file: Path,
max_age_seconds: int,
) -> dict[str, Any]:
mtime = _textfile_mtime_seconds(samples, "docker_stats.prom")
now = _sample_value_any(samples, "node_time_seconds")
source = "node_textfile_mtime_seconds"
if mtime is None:
try:
mtime = docker_stats_file.stat().st_mtime
now = time.time()
source = "file_stat_mtime"
except FileNotFoundError:
return {
"fresh": False,
"age_seconds": None,
"max_age_seconds": max_age_seconds,
"source": "missing",
}
if now is None:
now = time.time()
age_seconds = max(0, int(now - mtime))
return {
"fresh": age_seconds <= max_age_seconds,
"age_seconds": age_seconds,
"max_age_seconds": max_age_seconds,
"source": source,
}
def read_text(path: Path | None) -> str:
if path is None:
return ""
@@ -234,8 +290,14 @@ def recommend_playbook(process_families: list[dict[str, Any]], containers: list[
def build_payload(args: argparse.Namespace) -> dict[str, Any]:
host_samples = parse_prometheus_text(read_text(args.metrics_file))
docker_samples = parse_prometheus_text(read_text(args.docker_stats_file))
docker_stats_status = docker_stats_freshness(
samples=host_samples,
docker_stats_file=args.docker_stats_file,
max_age_seconds=args.docker_stats_max_age_seconds,
)
process_summary = summarize_processes(parse_ps_text(collect_ps_text(args.ps_file)), top_n=args.top_n)
containers = top_docker_containers(docker_samples, host=args.host, top_n=args.top_n)
untrusted_containers = top_docker_containers(docker_samples, host=args.host, top_n=args.top_n)
containers = untrusted_containers if docker_stats_status.get("fresh") is True else []
recommendation = recommend_playbook(process_summary["families"], containers)
return {
@@ -248,10 +310,12 @@ def build_payload(args: argparse.Namespace) -> dict[str, Any]:
"readback": {
"host_metric_sample_count": len(host_samples),
"docker_metric_sample_count": len(docker_samples),
"docker_stats": docker_stats_status,
"top_container_count": len(containers),
"top_process_family_count": len(process_summary["families"]),
},
"top_containers": containers,
"top_containers_untrusted": untrusted_containers,
"top_process_families": process_summary["families"],
"top_processes_sanitized": process_summary["top_processes"],
"redaction": {

View File

@@ -425,6 +425,60 @@ def test_sustained_load_controller_routes_gitea_backlog_from_docker_metrics(tmp_
assert "host-sustained-load-evidence.py" in payload["commands"]["dry_run"]
def test_sustained_load_controller_ignores_stale_docker_stats_attribution(tmp_path: Path) -> None:
metrics_file = tmp_path / "host.prom"
metrics_file.write_text(
"\n".join(
[
'awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"} 1',
'awoooi_host_load5_per_core{host="110"} 2.5',
'awoooi_host_swap_used_ratio{host="110"} 0.1',
'awoooi_host_runaway_process_remediation_authorized{host="110"} 0',
'awoooi_host_gitea_actions_active_container_count{host="110"} 0',
'awoooi_host_gitea_actions_active_process_group_count{host="110"} 0',
'awoooi_host_runaway_browser_orphan_group_count{host="110",rule="stockplatform_headless_smoke",min_age_seconds="1800",min_cpu_percent="50"} 0',
'node_textfile_mtime_seconds{file="/host/home/wooo/node_exporter_textfiles/docker_stats.prom"} 1000',
'node_time_seconds 5000',
]
),
encoding="utf-8",
)
docker_file = tmp_path / "docker.prom"
docker_file.write_text(
"\n".join(
[
'docker_container_cpu_cores{host="110",container_name="gitea"} 3.4',
'docker_container_cpu_cores{host="110",container_name="redis"} 0.2',
]
),
encoding="utf-8",
)
result = subprocess.run(
[
sys.executable,
str(CONTROLLER_PATH),
"--host",
"110",
"--metrics-file",
str(metrics_file),
"--docker-stats-file",
str(docker_file),
"--json",
],
capture_output=True,
text=True,
)
assert result.returncode == 75
payload = json.loads(result.stdout)
assert payload["classification"] == "blocked_unknown_sustained_load_requires_source_specific_playbook"
assert payload["readback"]["docker_stats"]["fresh"] is False
assert payload["readback"]["top_container_cpu"] is None
assert payload["readback"]["top_container_cpu_untrusted"]["container_name"] == "gitea"
assert payload["controlled_apply_allowed"] is False
def test_sustained_load_controller_routes_unknown_load_to_sanitized_evidence(tmp_path: Path) -> None:
metrics_file = tmp_path / "host.prom"
metrics_file.write_text(
@@ -506,6 +560,55 @@ def test_sustained_load_evidence_emits_sanitized_gitea_recommendation(tmp_path:
assert "/home/wooo" not in result.stdout
def test_sustained_load_evidence_keeps_stale_container_samples_untrusted(tmp_path: Path) -> None:
metrics_file = tmp_path / "host.prom"
metrics_file.write_text(
"\n".join(
[
'node_textfile_mtime_seconds{file="/host/home/wooo/node_exporter_textfiles/docker_stats.prom"} 1000',
'node_time_seconds 5000',
]
),
encoding="utf-8",
)
docker_file = tmp_path / "docker.prom"
docker_file.write_text(
'docker_container_cpu_cores{host="110",container_name="gitea"} 3.4\n',
encoding="utf-8",
)
ps_file = tmp_path / "ps.txt"
ps_file.write_text(
"100 1 100 120 5.0 1.0 python python monitor.py\n",
encoding="utf-8",
)
result = subprocess.run(
[
sys.executable,
str(EVIDENCE_PATH),
"--host",
"110",
"--metrics-file",
str(metrics_file),
"--ps-file",
str(ps_file),
"--docker-stats-file",
str(docker_file),
"--json",
],
check=True,
capture_output=True,
text=True,
)
payload = json.loads(result.stdout)
assert payload["recommendation"] != "gitea_queue_or_hook_backlog_playbook"
assert payload["readback"]["docker_stats"]["fresh"] is False
assert payload["top_containers"] == []
assert payload["top_containers_untrusted"][0]["container_name"] == "gitea"
assert payload["operation_boundaries"]["host_write_performed"] is False
def test_sustained_load_controller_routes_unknown_load_to_sanitized_evidence(tmp_path: Path) -> None:
metrics_file = tmp_path / "host.prom"
metrics_file.write_text(

View File

@@ -446,6 +446,7 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
high_load_hosts: list[str] = []
gitea_pressure_hosts: list[str] = []
postgres_pressure_hosts: list[str] = []
container_attribution_stale_hosts: list[str] = []
for item in hosts:
if not isinstance(item, dict):
@@ -459,6 +460,11 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
if load5_per_core <= 0 and cores > 0:
load5_per_core = load5 / cores
top_containers = normalize_top_containers(item.get("top_containers"))
docker_stats = item.get("docker_stats")
top_containers_fresh = item.get("top_containers_fresh")
if top_containers_fresh is None and isinstance(docker_stats, dict):
top_containers_fresh = docker_stats.get("fresh")
container_attribution_fresh = top_containers_fresh is not False
row = {
"host": host,
"load1": round(float_value(item.get("load1")), 4),
@@ -467,17 +473,22 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
"load5_per_core": round(load5_per_core, 4),
"node_procs_running": int_value(item.get("node_procs_running")),
"node_procs_blocked": int_value(item.get("node_procs_blocked")),
"top_containers_fresh": container_attribution_fresh,
"top_containers": top_containers[:5],
}
if isinstance(docker_stats, dict):
row["docker_stats"] = docker_stats
rows.append(row)
if load5_per_core > 1.0:
high_load_hosts.append(host)
if host == "110" and any(
if top_containers and not container_attribution_fresh:
container_attribution_stale_hosts.append(host)
if host == "110" and container_attribution_fresh and any(
container["container_name"] == "gitea" and container["cpu_cores"] >= 2.0
for container in top_containers
):
gitea_pressure_hosts.append(host)
if host == "188" and any(
if host == "188" and container_attribution_fresh and any(
container["container_name"] == "k3s-postgres-recovery"
and container["cpu_cores"] >= 4.0
for container in top_containers
@@ -486,6 +497,8 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
if high_load_hosts:
blockers.append("host_pressure_high_load")
if container_attribution_stale_hosts:
blockers.append("host_container_cpu_attribution_stale")
if gitea_pressure_hosts:
blockers.append("host_110_gitea_cpu_pressure")
if postgres_pressure_hosts:
@@ -500,6 +513,7 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
"high_load_hosts": high_load_hosts,
"gitea_pressure_hosts": gitea_pressure_hosts,
"postgres_pressure_hosts": postgres_pressure_hosts,
"container_attribution_stale_hosts": container_attribution_stale_hosts,
"conversation_event_hot_path_indexes_present": payload.get(
"conversation_event_hot_path_indexes_present"
),
@@ -507,6 +521,7 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
"safe_actions": [
"keep_110_legacy_runner_failclosed",
"read_public_gitea_queue_metadata_only",
"restore_docker_stats_textfile_exporter_before_container_cpu_attribution",
"apply_conversation_event_hot_path_indexes_via_controlled_db_migration",
"rerun_host_pressure_and_cold_start_scorecard_after_apply",
],
@@ -558,6 +573,11 @@ def choose_safe_next_step(
"keep_110_runner_failclosed_read_public_gitea_queue_and_recover_awoooi_host_"
"controlled_lane_only_after_verifier_no_generic_runner"
)
if "host_container_cpu_attribution_stale" in pressure_blockers:
return (
"restore_docker_stats_textfile_exporter_then_collect_sanitized_host_"
"pressure_no_restart_no_secret_read"
)
if blockers == ["host_boot_observation_older_than_target_window"]:
return (
"timer_deployed_and_services_readback_green_wait_for_next_all_host_reboot_"

View File

@@ -327,6 +327,46 @@ def test_host_pressure_blocks_slo_with_index_drift_next_step(tmp_path: Path) ->
)
def test_host_pressure_does_not_attribute_stale_docker_stats_to_gitea(tmp_path: Path) -> None:
payload = run_scorecard_with_host_pressure(
tmp_path,
GREEN_SUMMARY,
{
"hosts": [
{
"host": "110",
"load1": 20.74,
"load5": 18.05,
"cores": 12,
"node_procs_running": 63,
"node_procs_blocked": 0,
"docker_stats": {
"fresh": False,
"age_seconds": 107475,
"max_age_seconds": 300,
"source": "node_textfile_mtime_seconds",
},
"top_containers": [
{"container_name": "gitea", "cpu_cores": 3.4019},
],
},
],
},
)
assert payload["status"] == "blocked_reboot_auto_recovery_slo_not_ready"
assert payload["host_pressure"]["high_load_hosts"] == ["110"]
assert payload["host_pressure"]["gitea_pressure_hosts"] == []
assert payload["host_pressure"]["container_attribution_stale_hosts"] == ["110"]
assert "host_pressure_high_load" in payload["active_blockers"]
assert "host_container_cpu_attribution_stale" in payload["active_blockers"]
assert "host_110_gitea_cpu_pressure" not in payload["active_blockers"]
assert payload["safe_next_step"] == (
"restore_docker_stats_textfile_exporter_then_collect_sanitized_host_"
"pressure_no_restart_no_secret_read"
)
def test_stockplatform_recovered_marks_controlled_gate_not_required(
tmp_path: Path,
) -> None: