fix(recovery): ignore stale docker cpu attribution
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Failing after 33s
AWOOOI Harbor 110 Local Repair / workflow-shape (push) Successful in 0s
CD Pipeline / build-and-deploy (push) Has been skipped
AWOOOI Harbor 110 Local Repair / harbor-110-local-repair (push) Failing after 1m41s
CD Pipeline / post-deploy-checks (push) Has been skipped
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Failing after 33s
AWOOOI Harbor 110 Local Repair / workflow-shape (push) Successful in 0s
CD Pipeline / build-and-deploy (push) Has been skipped
AWOOOI Harbor 110 Local Repair / harbor-110-local-repair (push) Failing after 1m41s
CD Pipeline / post-deploy-checks (push) Has been skipped
This commit is contained in:
@@ -1,3 +1,23 @@
|
||||
## 2026-07-01 — 14:05 110 stale Docker stats attribution hardening
|
||||
|
||||
**照主線修正的問題**:
|
||||
- live 110 diagnosis 又回到高壓:`NODE_LOAD1=29.9`、`NODE_LOAD5=22.66`、`NODE_LOAD1_PER_CPU=2.49`、`NODE_LOAD_CLASSIFIER=high_load`,但 `docker_stats.prom` 的 `node_textfile_mtime_seconds` 已 stale 約 `107565s`;因此舊值 `docker_container_cpu_cores{container_name="gitea"}=3.4019` 不能再作為當下 CPU 元兇。
|
||||
- `scripts/ops/host-sustained-load-controller.py` 已新增 Docker stats freshness gate。當 textfile 超過 `300s` 時,`top_container_cpu` 會變 `null`,舊 container 樣本只保留在 `top_container_cpu_untrusted`;同一份 live metrics 現在分類為 `blocked_unknown_sustained_load_requires_source_specific_playbook`,不再誤報 `blocked_gitea_queue_or_hook_backlog_requires_playbook`。
|
||||
- `scripts/ops/host-sustained-load-evidence.py` 同步把 stale container 樣本放到 `top_containers_untrusted`,`top_containers=[]`,recommendation 轉成 `source_specific_playbook_required`。
|
||||
- `scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py` 已支援 `docker_stats.fresh=false` / `top_containers_fresh=false`,SLO 仍保留 `host_pressure_high_load`,但不再產生 `host_110_gitea_cpu_pressure`;下一步改為 `restore_docker_stats_textfile_exporter_then_collect_sanitized_host_pressure_no_restart_no_secret_read`。
|
||||
- live route / cold-start 仍 blocked:`https://registry.wooo.work/v2/` 502、`https://harbor.wooo.work/api/v2.0/health` 502,`/tmp/awoooi-full-stack-cold-start-after-e580954e8-20260701-135922.log` 回 `PASS=61 WARN=9 BLOCKED=6`。核心 blocker 仍是 110 control path / local recovery package / Harbor `/v2`,不是已證實的 Gitea CPU backlog。
|
||||
|
||||
**驗證**:
|
||||
- `python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py -q`:`19 passed`。
|
||||
- `python3.11 -m pytest scripts/reboot-recovery/tests/test_reboot_auto_recovery_slo_scorecard.py -q`:`9 passed`。
|
||||
- `python3.11 -m py_compile scripts/ops/host-sustained-load-controller.py scripts/ops/host-sustained-load-evidence.py scripts/reboot-recovery/reboot-auto-recovery-slo-scorecard.py`:通過。
|
||||
- `python3.11 ops/runner/guard-gitea-runner-pressure.py --root .`:`GITEA_RUNNER_PRESSURE_GUARD_OK workflow_files=12 scheduled_workflows=4 auto_branch_events_on_110=0 generic_runner_labels=0`。
|
||||
|
||||
**邊界**:只讀 node exporter / public routes / public Gitea queue;未讀 secret / token / `.env` / raw sessions / SQLite / auth;未使用 GitHub / `gh` / GitHub API;未 workflow_dispatch;未重啟主機、未 restart Docker / Nginx / K3s / DB / firewall;未發 process signal。
|
||||
|
||||
**下一步**:
|
||||
- P0 不切支線:先讓 110 local recovery package 可在 110 console 或恢復後 SSH control path 執行,再跑 `recover-110-control-path-and-harbor-local.sh --check`;若只剩 exporter freshness 缺口,先恢復 Docker stats textfile exporter,再收集 sanitized host pressure。
|
||||
|
||||
## 2026-07-01 — 14:55 Work Items 顯示 AI Loop LOG source tags
|
||||
|
||||
**照主線修正的問題**:
|
||||
|
||||
@@ -72,6 +72,8 @@ v1.82 bounded summary rule:`post-start-quick-check.sh` 與 `188-host-hygiene-m
|
||||
|
||||
2026-07-01 13:35 110 CPU/load 判讀規則更新:`docker_stats.prom` 必須先看 mtime / freshness,超過 300 秒不得作為當下 container CPU 歸因或 blocker;若 110 `node_load5` 高但 Prometheus CPU mode 仍有 idle、`awoooi_host_gitea_actions_active_process_count=0`、orphan browser count=0,主 blocker 不得誤寫成 Gitea / Playwright / Stock smoke CPU。此時優先看 `diagnose-110-ssh-publickey-auth.sh` 的 `NODE_LOAD_CLASSIFIER`、`DOCKER_STATS_TEXTFILE_FRESHNESS` 與 `SYSTEMD_UNIT ... classifier=systemctl_show_timeout|systemctl_timeout_budget_exhausted`。外部 SSH userauth timeout 時,cold-start 必須輸出 `SSH_110_BLOCKER remote_control_channel_unavailable` 與 `SSH_110_NEXT_ACTION local_console_run_recover_110_control_path_and_harbor_local_check`;下一步是 110 本機 console / 已恢復 control path 執行 `recover-110-control-path-and-harbor-local.sh --check`,不是重跑 Harbor workflow 或用舊 docker stats 指認 Gitea。
|
||||
|
||||
2026-07-01 14:05 追加 controller / SLO stale-attribution guard:`host-sustained-load-controller.py` 與 `host-sustained-load-evidence.py` 必須把超過 `300s` 的 Docker stats 樣本標成 untrusted;`top_container_cpu` / `top_containers` 不得使用 stale `docker_container_cpu_cores`,舊值只能留在 `top_container_cpu_untrusted` / `top_containers_untrusted` 當證據。`reboot-auto-recovery-slo-scorecard.py` 若收到 `docker_stats.fresh=false` 或 `top_containers_fresh=false`,只能保留 `host_pressure_high_load` 與 `host_container_cpu_attribution_stale`,不得產生 `host_110_gitea_cpu_pressure`。此時下一步固定為恢復 Docker stats textfile exporter 或收集 sanitized host pressure,且仍不得重啟 Docker / Nginx / K3s / DB / firewall、不得恢復 generic runner、不得用 stale Gitea CPU 樣本取消或 drain 任何工作。
|
||||
|
||||
2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness was still blocked at that time and DR remained incomplete.
|
||||
|
||||
2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea `main` and live `/home/wooo/stockplatform-v2` are now at `fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints`; six missing production cron entrypoint scripts are restored, `run-intelligence-sync.sh` contains the Docker-backed `psql` shim, and live contract check confirms every `scripts/ops/*.sh` referenced by `install-production-cron.sh` exists. The only live write performed for StockPlatform recovery was a fast-forward `git pull --ff-only origin main` on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: `source-remediation-queue` 19:56 and 20:00 succeeded, `market-index-ingestion` 20:00 succeeded, `price-ingestion` 20:02 succeeded, `margin-short-ingestion` 20:05 succeeded, `chips-ingestion` 20:06 succeeded, and `ai-recommendation-pipeline` 20:10 ran but correctly produced the internal blocker `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. StockPlatform `/api/v1/system/freshness` therefore still returns `status=blocked` because the 2026-06-25 official margin-short source is pending and `ai.recommendations` must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run.
|
||||
|
||||
@@ -20,12 +20,14 @@ from __future__ import annotations
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
DEFAULT_METRICS_FILE = Path("/home/wooo/node_exporter_textfiles/host_runaway_process.prom")
|
||||
DEFAULT_DOCKER_STATS_FILE = Path("/home/wooo/node_exporter_textfiles/docker_stats.prom")
|
||||
DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS = 300
|
||||
SCHEMA_VERSION = "host_sustained_load_controlled_automation_v1"
|
||||
LABEL_RE = re.compile(r"(?P<key>[A-Za-z_][A-Za-z0-9_]*)=\"(?P<value>(?:[^\"\\\\]|\\\\.)*)\"")
|
||||
METRIC_RE = re.compile(
|
||||
@@ -41,6 +43,11 @@ def parse_args() -> argparse.Namespace:
|
||||
parser.add_argument("--host", default="110")
|
||||
parser.add_argument("--metrics-file", type=Path, default=DEFAULT_METRICS_FILE)
|
||||
parser.add_argument("--docker-stats-file", type=Path, default=DEFAULT_DOCKER_STATS_FILE)
|
||||
parser.add_argument(
|
||||
"--docker-stats-max-age-seconds",
|
||||
type=int,
|
||||
default=DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS,
|
||||
)
|
||||
parser.add_argument("--load5-per-core-threshold", type=float, default=1.5)
|
||||
parser.add_argument("--ci-stale-age-seconds", type=int, default=1800)
|
||||
parser.add_argument("--json", action="store_true", help="Print JSON only.")
|
||||
@@ -92,6 +99,55 @@ def _sample_value(
|
||||
return default
|
||||
|
||||
|
||||
def _sample_value_any(samples: list[dict[str, Any]], name: str) -> float | None:
|
||||
for sample in samples:
|
||||
if sample["name"] == name:
|
||||
return float(sample["value"])
|
||||
return None
|
||||
|
||||
|
||||
def _textfile_mtime_seconds(samples: list[dict[str, Any]], suffix: str) -> float | None:
|
||||
for sample in samples:
|
||||
if sample["name"] != "node_textfile_mtime_seconds":
|
||||
continue
|
||||
file_label = str(sample["labels"].get("file") or "")
|
||||
if file_label.endswith(suffix):
|
||||
return float(sample["value"])
|
||||
return None
|
||||
|
||||
|
||||
def docker_stats_freshness(
|
||||
*,
|
||||
samples: list[dict[str, Any]],
|
||||
docker_stats_file: Path,
|
||||
max_age_seconds: int,
|
||||
) -> dict[str, Any]:
|
||||
mtime = _textfile_mtime_seconds(samples, "docker_stats.prom")
|
||||
now = _sample_value_any(samples, "node_time_seconds")
|
||||
source = "node_textfile_mtime_seconds"
|
||||
if mtime is None:
|
||||
try:
|
||||
mtime = docker_stats_file.stat().st_mtime
|
||||
now = time.time()
|
||||
source = "file_stat_mtime"
|
||||
except FileNotFoundError:
|
||||
return {
|
||||
"fresh": False,
|
||||
"age_seconds": None,
|
||||
"max_age_seconds": max_age_seconds,
|
||||
"source": "missing",
|
||||
}
|
||||
if now is None:
|
||||
now = time.time()
|
||||
age_seconds = max(0, int(now - mtime))
|
||||
return {
|
||||
"fresh": age_seconds <= max_age_seconds,
|
||||
"age_seconds": age_seconds,
|
||||
"max_age_seconds": max_age_seconds,
|
||||
"source": source,
|
||||
}
|
||||
|
||||
|
||||
def _rule_values(samples: list[dict[str, Any]], name: str, *, host: str) -> list[dict[str, Any]]:
|
||||
values = []
|
||||
for sample in samples:
|
||||
@@ -159,6 +215,7 @@ def build_packet(
|
||||
host: str,
|
||||
samples: list[dict[str, Any]],
|
||||
docker_samples: list[dict[str, Any]],
|
||||
docker_stats_status: dict[str, Any],
|
||||
load5_per_core_threshold: float,
|
||||
ci_stale_age_seconds: int,
|
||||
) -> dict[str, Any]:
|
||||
@@ -209,7 +266,8 @@ def build_packet(
|
||||
)
|
||||
)
|
||||
top_orphan = _top_orphan_rule(samples, host=host)
|
||||
top_container = _top_container_cpu(docker_samples, host=host)
|
||||
raw_top_container = _top_container_cpu(docker_samples, host=host)
|
||||
top_container = raw_top_container if docker_stats_status.get("fresh") is True else None
|
||||
top_container_name = str((top_container or {}).get("container_name") or "").lower()
|
||||
top_container_cpu = float((top_container or {}).get("cpu_cores") or 0.0)
|
||||
|
||||
@@ -317,6 +375,8 @@ def build_packet(
|
||||
"active_ci_oldest_age_seconds": active_ci_oldest_age,
|
||||
"top_orphan_rule": top_orphan,
|
||||
"top_container_cpu": top_container,
|
||||
"top_container_cpu_untrusted": raw_top_container,
|
||||
"docker_stats": docker_stats_status,
|
||||
},
|
||||
"commands": {
|
||||
"dry_run": dry_run_command,
|
||||
@@ -364,6 +424,11 @@ def main() -> int:
|
||||
host=args.host,
|
||||
samples=samples,
|
||||
docker_samples=docker_samples,
|
||||
docker_stats_status=docker_stats_freshness(
|
||||
samples=samples,
|
||||
docker_stats_file=args.docker_stats_file,
|
||||
max_age_seconds=args.docker_stats_max_age_seconds,
|
||||
),
|
||||
load5_per_core_threshold=args.load5_per_core_threshold,
|
||||
ci_stale_age_seconds=args.ci_stale_age_seconds,
|
||||
)
|
||||
|
||||
@@ -14,12 +14,14 @@ import json
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
DEFAULT_HOST_METRICS_FILE = Path("/home/wooo/node_exporter_textfiles/host_runaway_process.prom")
|
||||
DEFAULT_DOCKER_STATS_FILE = Path("/home/wooo/node_exporter_textfiles/docker_stats.prom")
|
||||
DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS = 300
|
||||
SCHEMA_VERSION = "host_sustained_load_sanitized_evidence_v1"
|
||||
LABEL_RE = re.compile(r"(?P<key>[A-Za-z_][A-Za-z0-9_]*)=\"(?P<value>(?:[^\"\\\\]|\\\\.)*)\"")
|
||||
METRIC_RE = re.compile(
|
||||
@@ -33,6 +35,11 @@ def parse_args() -> argparse.Namespace:
|
||||
parser.add_argument("--host", default=os.environ.get("AIOPS_HOST_LABEL", "110"))
|
||||
parser.add_argument("--metrics-file", type=Path, default=DEFAULT_HOST_METRICS_FILE)
|
||||
parser.add_argument("--docker-stats-file", type=Path, default=DEFAULT_DOCKER_STATS_FILE)
|
||||
parser.add_argument(
|
||||
"--docker-stats-max-age-seconds",
|
||||
type=int,
|
||||
default=DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS,
|
||||
)
|
||||
parser.add_argument("--ps-file", type=Path)
|
||||
parser.add_argument("--top-n", type=int, default=8)
|
||||
parser.add_argument("--json", action="store_true")
|
||||
@@ -66,6 +73,55 @@ def parse_prometheus_text(text: str) -> list[dict[str, Any]]:
|
||||
return samples
|
||||
|
||||
|
||||
def _sample_value_any(samples: list[dict[str, Any]], name: str) -> float | None:
|
||||
for sample in samples:
|
||||
if sample["name"] == name:
|
||||
return float(sample["value"])
|
||||
return None
|
||||
|
||||
|
||||
def _textfile_mtime_seconds(samples: list[dict[str, Any]], suffix: str) -> float | None:
|
||||
for sample in samples:
|
||||
if sample["name"] != "node_textfile_mtime_seconds":
|
||||
continue
|
||||
file_label = str(sample["labels"].get("file") or "")
|
||||
if file_label.endswith(suffix):
|
||||
return float(sample["value"])
|
||||
return None
|
||||
|
||||
|
||||
def docker_stats_freshness(
|
||||
*,
|
||||
samples: list[dict[str, Any]],
|
||||
docker_stats_file: Path,
|
||||
max_age_seconds: int,
|
||||
) -> dict[str, Any]:
|
||||
mtime = _textfile_mtime_seconds(samples, "docker_stats.prom")
|
||||
now = _sample_value_any(samples, "node_time_seconds")
|
||||
source = "node_textfile_mtime_seconds"
|
||||
if mtime is None:
|
||||
try:
|
||||
mtime = docker_stats_file.stat().st_mtime
|
||||
now = time.time()
|
||||
source = "file_stat_mtime"
|
||||
except FileNotFoundError:
|
||||
return {
|
||||
"fresh": False,
|
||||
"age_seconds": None,
|
||||
"max_age_seconds": max_age_seconds,
|
||||
"source": "missing",
|
||||
}
|
||||
if now is None:
|
||||
now = time.time()
|
||||
age_seconds = max(0, int(now - mtime))
|
||||
return {
|
||||
"fresh": age_seconds <= max_age_seconds,
|
||||
"age_seconds": age_seconds,
|
||||
"max_age_seconds": max_age_seconds,
|
||||
"source": source,
|
||||
}
|
||||
|
||||
|
||||
def read_text(path: Path | None) -> str:
|
||||
if path is None:
|
||||
return ""
|
||||
@@ -234,8 +290,14 @@ def recommend_playbook(process_families: list[dict[str, Any]], containers: list[
|
||||
def build_payload(args: argparse.Namespace) -> dict[str, Any]:
|
||||
host_samples = parse_prometheus_text(read_text(args.metrics_file))
|
||||
docker_samples = parse_prometheus_text(read_text(args.docker_stats_file))
|
||||
docker_stats_status = docker_stats_freshness(
|
||||
samples=host_samples,
|
||||
docker_stats_file=args.docker_stats_file,
|
||||
max_age_seconds=args.docker_stats_max_age_seconds,
|
||||
)
|
||||
process_summary = summarize_processes(parse_ps_text(collect_ps_text(args.ps_file)), top_n=args.top_n)
|
||||
containers = top_docker_containers(docker_samples, host=args.host, top_n=args.top_n)
|
||||
untrusted_containers = top_docker_containers(docker_samples, host=args.host, top_n=args.top_n)
|
||||
containers = untrusted_containers if docker_stats_status.get("fresh") is True else []
|
||||
recommendation = recommend_playbook(process_summary["families"], containers)
|
||||
|
||||
return {
|
||||
@@ -248,10 +310,12 @@ def build_payload(args: argparse.Namespace) -> dict[str, Any]:
|
||||
"readback": {
|
||||
"host_metric_sample_count": len(host_samples),
|
||||
"docker_metric_sample_count": len(docker_samples),
|
||||
"docker_stats": docker_stats_status,
|
||||
"top_container_count": len(containers),
|
||||
"top_process_family_count": len(process_summary["families"]),
|
||||
},
|
||||
"top_containers": containers,
|
||||
"top_containers_untrusted": untrusted_containers,
|
||||
"top_process_families": process_summary["families"],
|
||||
"top_processes_sanitized": process_summary["top_processes"],
|
||||
"redaction": {
|
||||
|
||||
@@ -425,6 +425,60 @@ def test_sustained_load_controller_routes_gitea_backlog_from_docker_metrics(tmp_
|
||||
assert "host-sustained-load-evidence.py" in payload["commands"]["dry_run"]
|
||||
|
||||
|
||||
def test_sustained_load_controller_ignores_stale_docker_stats_attribution(tmp_path: Path) -> None:
|
||||
metrics_file = tmp_path / "host.prom"
|
||||
metrics_file.write_text(
|
||||
"\n".join(
|
||||
[
|
||||
'awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"} 1',
|
||||
'awoooi_host_load5_per_core{host="110"} 2.5',
|
||||
'awoooi_host_swap_used_ratio{host="110"} 0.1',
|
||||
'awoooi_host_runaway_process_remediation_authorized{host="110"} 0',
|
||||
'awoooi_host_gitea_actions_active_container_count{host="110"} 0',
|
||||
'awoooi_host_gitea_actions_active_process_group_count{host="110"} 0',
|
||||
'awoooi_host_runaway_browser_orphan_group_count{host="110",rule="stockplatform_headless_smoke",min_age_seconds="1800",min_cpu_percent="50"} 0',
|
||||
'node_textfile_mtime_seconds{file="/host/home/wooo/node_exporter_textfiles/docker_stats.prom"} 1000',
|
||||
'node_time_seconds 5000',
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
docker_file = tmp_path / "docker.prom"
|
||||
docker_file.write_text(
|
||||
"\n".join(
|
||||
[
|
||||
'docker_container_cpu_cores{host="110",container_name="gitea"} 3.4',
|
||||
'docker_container_cpu_cores{host="110",container_name="redis"} 0.2',
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = subprocess.run(
|
||||
[
|
||||
sys.executable,
|
||||
str(CONTROLLER_PATH),
|
||||
"--host",
|
||||
"110",
|
||||
"--metrics-file",
|
||||
str(metrics_file),
|
||||
"--docker-stats-file",
|
||||
str(docker_file),
|
||||
"--json",
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
assert result.returncode == 75
|
||||
payload = json.loads(result.stdout)
|
||||
assert payload["classification"] == "blocked_unknown_sustained_load_requires_source_specific_playbook"
|
||||
assert payload["readback"]["docker_stats"]["fresh"] is False
|
||||
assert payload["readback"]["top_container_cpu"] is None
|
||||
assert payload["readback"]["top_container_cpu_untrusted"]["container_name"] == "gitea"
|
||||
assert payload["controlled_apply_allowed"] is False
|
||||
|
||||
|
||||
def test_sustained_load_controller_routes_unknown_load_to_sanitized_evidence(tmp_path: Path) -> None:
|
||||
metrics_file = tmp_path / "host.prom"
|
||||
metrics_file.write_text(
|
||||
@@ -506,6 +560,55 @@ def test_sustained_load_evidence_emits_sanitized_gitea_recommendation(tmp_path:
|
||||
assert "/home/wooo" not in result.stdout
|
||||
|
||||
|
||||
def test_sustained_load_evidence_keeps_stale_container_samples_untrusted(tmp_path: Path) -> None:
|
||||
metrics_file = tmp_path / "host.prom"
|
||||
metrics_file.write_text(
|
||||
"\n".join(
|
||||
[
|
||||
'node_textfile_mtime_seconds{file="/host/home/wooo/node_exporter_textfiles/docker_stats.prom"} 1000',
|
||||
'node_time_seconds 5000',
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
docker_file = tmp_path / "docker.prom"
|
||||
docker_file.write_text(
|
||||
'docker_container_cpu_cores{host="110",container_name="gitea"} 3.4\n',
|
||||
encoding="utf-8",
|
||||
)
|
||||
ps_file = tmp_path / "ps.txt"
|
||||
ps_file.write_text(
|
||||
"100 1 100 120 5.0 1.0 python python monitor.py\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = subprocess.run(
|
||||
[
|
||||
sys.executable,
|
||||
str(EVIDENCE_PATH),
|
||||
"--host",
|
||||
"110",
|
||||
"--metrics-file",
|
||||
str(metrics_file),
|
||||
"--ps-file",
|
||||
str(ps_file),
|
||||
"--docker-stats-file",
|
||||
str(docker_file),
|
||||
"--json",
|
||||
],
|
||||
check=True,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
payload = json.loads(result.stdout)
|
||||
assert payload["recommendation"] != "gitea_queue_or_hook_backlog_playbook"
|
||||
assert payload["readback"]["docker_stats"]["fresh"] is False
|
||||
assert payload["top_containers"] == []
|
||||
assert payload["top_containers_untrusted"][0]["container_name"] == "gitea"
|
||||
assert payload["operation_boundaries"]["host_write_performed"] is False
|
||||
|
||||
|
||||
def test_sustained_load_controller_routes_unknown_load_to_sanitized_evidence(tmp_path: Path) -> None:
|
||||
metrics_file = tmp_path / "host.prom"
|
||||
metrics_file.write_text(
|
||||
|
||||
@@ -446,6 +446,7 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
|
||||
high_load_hosts: list[str] = []
|
||||
gitea_pressure_hosts: list[str] = []
|
||||
postgres_pressure_hosts: list[str] = []
|
||||
container_attribution_stale_hosts: list[str] = []
|
||||
|
||||
for item in hosts:
|
||||
if not isinstance(item, dict):
|
||||
@@ -459,6 +460,11 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
|
||||
if load5_per_core <= 0 and cores > 0:
|
||||
load5_per_core = load5 / cores
|
||||
top_containers = normalize_top_containers(item.get("top_containers"))
|
||||
docker_stats = item.get("docker_stats")
|
||||
top_containers_fresh = item.get("top_containers_fresh")
|
||||
if top_containers_fresh is None and isinstance(docker_stats, dict):
|
||||
top_containers_fresh = docker_stats.get("fresh")
|
||||
container_attribution_fresh = top_containers_fresh is not False
|
||||
row = {
|
||||
"host": host,
|
||||
"load1": round(float_value(item.get("load1")), 4),
|
||||
@@ -467,17 +473,22 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
|
||||
"load5_per_core": round(load5_per_core, 4),
|
||||
"node_procs_running": int_value(item.get("node_procs_running")),
|
||||
"node_procs_blocked": int_value(item.get("node_procs_blocked")),
|
||||
"top_containers_fresh": container_attribution_fresh,
|
||||
"top_containers": top_containers[:5],
|
||||
}
|
||||
if isinstance(docker_stats, dict):
|
||||
row["docker_stats"] = docker_stats
|
||||
rows.append(row)
|
||||
if load5_per_core > 1.0:
|
||||
high_load_hosts.append(host)
|
||||
if host == "110" and any(
|
||||
if top_containers and not container_attribution_fresh:
|
||||
container_attribution_stale_hosts.append(host)
|
||||
if host == "110" and container_attribution_fresh and any(
|
||||
container["container_name"] == "gitea" and container["cpu_cores"] >= 2.0
|
||||
for container in top_containers
|
||||
):
|
||||
gitea_pressure_hosts.append(host)
|
||||
if host == "188" and any(
|
||||
if host == "188" and container_attribution_fresh and any(
|
||||
container["container_name"] == "k3s-postgres-recovery"
|
||||
and container["cpu_cores"] >= 4.0
|
||||
for container in top_containers
|
||||
@@ -486,6 +497,8 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
|
||||
|
||||
if high_load_hosts:
|
||||
blockers.append("host_pressure_high_load")
|
||||
if container_attribution_stale_hosts:
|
||||
blockers.append("host_container_cpu_attribution_stale")
|
||||
if gitea_pressure_hosts:
|
||||
blockers.append("host_110_gitea_cpu_pressure")
|
||||
if postgres_pressure_hosts:
|
||||
@@ -500,6 +513,7 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
|
||||
"high_load_hosts": high_load_hosts,
|
||||
"gitea_pressure_hosts": gitea_pressure_hosts,
|
||||
"postgres_pressure_hosts": postgres_pressure_hosts,
|
||||
"container_attribution_stale_hosts": container_attribution_stale_hosts,
|
||||
"conversation_event_hot_path_indexes_present": payload.get(
|
||||
"conversation_event_hot_path_indexes_present"
|
||||
),
|
||||
@@ -507,6 +521,7 @@ def build_host_pressure_readback(payload: dict[str, Any]) -> dict[str, Any]:
|
||||
"safe_actions": [
|
||||
"keep_110_legacy_runner_failclosed",
|
||||
"read_public_gitea_queue_metadata_only",
|
||||
"restore_docker_stats_textfile_exporter_before_container_cpu_attribution",
|
||||
"apply_conversation_event_hot_path_indexes_via_controlled_db_migration",
|
||||
"rerun_host_pressure_and_cold_start_scorecard_after_apply",
|
||||
],
|
||||
@@ -558,6 +573,11 @@ def choose_safe_next_step(
|
||||
"keep_110_runner_failclosed_read_public_gitea_queue_and_recover_awoooi_host_"
|
||||
"controlled_lane_only_after_verifier_no_generic_runner"
|
||||
)
|
||||
if "host_container_cpu_attribution_stale" in pressure_blockers:
|
||||
return (
|
||||
"restore_docker_stats_textfile_exporter_then_collect_sanitized_host_"
|
||||
"pressure_no_restart_no_secret_read"
|
||||
)
|
||||
if blockers == ["host_boot_observation_older_than_target_window"]:
|
||||
return (
|
||||
"timer_deployed_and_services_readback_green_wait_for_next_all_host_reboot_"
|
||||
|
||||
@@ -327,6 +327,46 @@ def test_host_pressure_blocks_slo_with_index_drift_next_step(tmp_path: Path) ->
|
||||
)
|
||||
|
||||
|
||||
def test_host_pressure_does_not_attribute_stale_docker_stats_to_gitea(tmp_path: Path) -> None:
|
||||
payload = run_scorecard_with_host_pressure(
|
||||
tmp_path,
|
||||
GREEN_SUMMARY,
|
||||
{
|
||||
"hosts": [
|
||||
{
|
||||
"host": "110",
|
||||
"load1": 20.74,
|
||||
"load5": 18.05,
|
||||
"cores": 12,
|
||||
"node_procs_running": 63,
|
||||
"node_procs_blocked": 0,
|
||||
"docker_stats": {
|
||||
"fresh": False,
|
||||
"age_seconds": 107475,
|
||||
"max_age_seconds": 300,
|
||||
"source": "node_textfile_mtime_seconds",
|
||||
},
|
||||
"top_containers": [
|
||||
{"container_name": "gitea", "cpu_cores": 3.4019},
|
||||
],
|
||||
},
|
||||
],
|
||||
},
|
||||
)
|
||||
|
||||
assert payload["status"] == "blocked_reboot_auto_recovery_slo_not_ready"
|
||||
assert payload["host_pressure"]["high_load_hosts"] == ["110"]
|
||||
assert payload["host_pressure"]["gitea_pressure_hosts"] == []
|
||||
assert payload["host_pressure"]["container_attribution_stale_hosts"] == ["110"]
|
||||
assert "host_pressure_high_load" in payload["active_blockers"]
|
||||
assert "host_container_cpu_attribution_stale" in payload["active_blockers"]
|
||||
assert "host_110_gitea_cpu_pressure" not in payload["active_blockers"]
|
||||
assert payload["safe_next_step"] == (
|
||||
"restore_docker_stats_textfile_exporter_then_collect_sanitized_host_"
|
||||
"pressure_no_restart_no_secret_read"
|
||||
)
|
||||
|
||||
|
||||
def test_stockplatform_recovered_marks_controlled_gate_not_required(
|
||||
tmp_path: Path,
|
||||
) -> None:
|
||||
|
||||
Reference in New Issue
Block a user