diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 744f160f..fbee3f73 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -52440,21 +52440,23 @@ production browser smoke: - `scripts/ops/host-runaway-process-exporter.py` 新增 sanitized process-family metrics:`awoooi_host_process_family_cpu_percent`、`process_count`、`oldest_age_seconds`、`top_info`;只輸出 `family` / `comm`,不輸出 raw command line、workspace path、URL 或 secret。 - `scripts/ops/host-sustained-load-controller.py` 新增 live `ps` process-family 分類,並把中度壓力分流到 `control_plane_saturation`、`gitea_queue_or_hook_backlog`、`stockplatform_hot_query_or_api_pressure` check-mode packet;controller 仍只產生 controlled packet,不做 Docker / systemd / DB / Nginx restart。 - `ops/monitoring/alerts-unified.yml` 將 `Host110SustainedModeratePressure` 從只看 `load5/core` / container CPU,擴充為同時看 `awoooi_host_process_family_cpu_percent{family=~"systemd_control_plane|ssh_control_plane|gitea_service|postgres"} > 50`;Gitea / StockPlatform container early triage 門檻從 `> 2.0 core` 降到 `> 1.0 core`。 +- `scripts/ops/host-sustained-load-controller.py` 追加 `--script-dir`,預設 `/home/wooo/scripts`,確保 live alert 產出的 `dry_run_command` / `post_apply_verifier` 指向 110 實際部署 helper,不再輸出會打空的 repo 相對路徑 `scripts/ops/...`。 - 已 live 部署到 110: - `/home/wooo/scripts/host-runaway-process-exporter.py` SHA `d85d27c81ea76a8f2f370ee85c92381bec4440eea4fd37efb2efb9f43dbd1a8a` - - `/home/wooo/scripts/host-sustained-load-controller.py` SHA `c731c6b5fe5e1683931cf949b3b18c8b290c9a5d5973fac826487ad8de05434a` + - `/home/wooo/scripts/host-sustained-load-controller.py` SHA `7a11407c7df05427085982d6f6d11d1756f908591573a28c9b3267de32b94f3e` - `bash scripts/ops/deploy-alerts.sh` 完成,Prometheus 已載入 `159` 條規則。 **live readback 證據**: -- 110 textfile 已輸出 process-family metric:`systemd_control_plane=72.4`、`gitea_service=53.2`、`postgres=23.3`。 +- 110 textfile / Prometheus 已輸出 process-family metric:`systemd_control_plane=72.4`、`gitea_service=53.1`、`postgres=11.1`。 - Prometheus rule readback:`Host110SustainedModeratePressure state=firing health=ok`,query 已包含 `docker_container_cpu_cores > 1` 與 `awoooi_host_process_family_cpu_percent > 50`。 -- Prometheus `ALERTS{alertname="Host110SustainedModeratePressure"}` 已有 `firing`:`container_name="gitea"`、`family="systemd_control_plane"`、`family="gitea_service"`;`stockplatform-v2-postgres-1` 在部署後短時間為 `pending`。 -- Alertmanager `/api/v2/alerts` 已有 active alert,`status.state=active`、`notification_type=TYPE-1`、`auto_repair=true`、`team=ops`。 -- live controller readback:`classification=blocked_control_plane_saturation_requires_playbook`、`next_action=run_control_plane_saturation_playbook_check_mode`、`control_plane_process_cpu_percent=72.4`、`gitea_process_cpu_percent=53.2`、`top_container_cpu.gitea=0.2507~1.5789`,且 `docker_stats.fresh=true`。 +- Alertmanager `/api/v2/alerts` 已有 active alert:`family=gitea_service`、`family=systemd_control_plane`,`status.state=active`、`auto_repair=true`。 +- live controller readback:`classification=blocked_control_plane_saturation_requires_playbook`、`next_action=run_control_plane_saturation_playbook_check_mode`、`dry_run_command=/home/wooo/scripts/host-sustained-load-evidence.py ...`、`post_apply_verifier=/home/wooo/scripts/host-sustained-load-controller.py ...`、`controller_exit=75`。 +- live sanitized evidence readback:`recommendation=control_plane_saturation_playbook`、`controlled_apply_allowed=false`、`top_process_families` 包含 `systemd_control_plane=72.4`、`gitea_service=53.1`、`unknown=54.3`,`top_containers` 以 `gitea=1.5951 core` 為最高;evidence 明確標示不輸出 raw command line / URL / secret。 **本地驗證結果**: -- `python3.11 -m py_compile scripts/ops/host-runaway-process-exporter.py scripts/ops/host-sustained-load-controller.py`:通過。 +- `python3.11 -m py_compile scripts/ops/host-runaway-process-exporter.py scripts/ops/host-sustained-load-controller.py scripts/ops/host-sustained-load-evidence.py`:通過。 - `python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q`:`27 passed`。 +- `python3.11 -m pytest ops/runner/test_cd_controlled_runtime_profile.py scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q`:`70 passed`。 - `python3.11 -c "import yaml; yaml.safe_load(open('ops/monitoring/alerts-unified.yml'))"`:通過。 - `python3.11 ops/runner/guard-gitea-runner-pressure.py --root .`:通過,`auto_branch_events_on_110=0`、`generic_runner_labels=0`。 - `git diff --check`:通過。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index b59a9b46..2482ec0c 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -92,6 +92,8 @@ v1.82 bounded summary rule:`post-start-quick-check.sh` 與 `188-host-hygiene-m 2026-07-01 23:28 追加 110 主動告警與 controlled quota 收斂;2026-07-02 00:40 更新 Gitea cap;2026-07-02 11:54 追加 process-family 主動告警:`HostLoadAverageSustainedHigh` 的 `load5/core > 1.5 for 15m` 只能當 critical 門檻,不足以回答「110 CPU 又持續偏高為何沒告警」。Prometheus 必須同時有 `Host110SustainedModeratePressure`,在 `awoooi_host_load5_per_core{host="110"} > 0.75`、Gitea / StockPlatform 關鍵容器 `docker_container_cpu_cores > 1.0`,或 `awoooi_host_process_family_cpu_percent{family=~"systemd_control_plane|ssh_control_plane|gitea_service|postgres"} > 50` 持續 1 分鐘時告警,auto-repair action 必須指向 110 已部署的 `/home/wooo/scripts/host-sustained-load-controller.py --load5-per-core-threshold 0.75 --hot-container-cpu-threshold 1.0 --process-family-cpu-threshold 50`。`host-runaway-process-exporter.py` 必須輸出 sanitized process-family metrics,只允許 family / comm labels,不得輸出 raw command line、workspace path、URL 或 secret。Gitea runtime CPU quota 已從 `docker update --cpus=2 gitea` 進一步收斂為 `docker update --cpus=1.5 gitea`;rollback 只允許暫回 `docker update --cpus=2 gitea` 並需 Prometheus readback,不可重啟 Docker daemon 或 Gitea container。若 `BackupAggregateRunFailed` 只因 `exported_job="backup_all"` 舊 aggregate failed_count firing,而 component jobs / `backup-status.sh` 已 green,必須視為噪音並用 component failed count 判斷,不得讓它干擾 cold-start / Telegram 主線。 +2026-07-02 12:15 追加 controller command path 合約:`host-sustained-load-controller.py` 產出的 `dry_run_command` / `post_apply_verifier` 必須使用 110 實際部署路徑 `/home/wooo/scripts/host-sustained-load-evidence.py`、`/home/wooo/scripts/host-sustained-load-controller.py`、`/home/wooo/scripts/host-runaway-process-remediation.py`。若 controller 在 live host 上輸出 `scripts/ops/...` 這類 repo 相對路徑,視為告警自動化斷鏈,需先修 controller / `--script-dir`,再進入 playbook check-mode;不得把「helper 找不到」當成 CPU 根因已處理。 + 2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness was still blocked at that time and DR remained incomplete. 2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea `main` and live `/home/wooo/stockplatform-v2` are now at `fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints`; six missing production cron entrypoint scripts are restored, `run-intelligence-sync.sh` contains the Docker-backed `psql` shim, and live contract check confirms every `scripts/ops/*.sh` referenced by `install-production-cron.sh` exists. The only live write performed for StockPlatform recovery was a fast-forward `git pull --ff-only origin main` on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: `source-remediation-queue` 19:56 and 20:00 succeeded, `market-index-ingestion` 20:00 succeeded, `price-ingestion` 20:02 succeeded, `margin-short-ingestion` 20:05 succeeded, `chips-ingestion` 20:06 succeeded, and `ai-recommendation-pipeline` 20:10 ran but correctly produced the internal blocker `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. StockPlatform `/api/v1/system/freshness` therefore still returns `status=blocked` because the 2026-06-25 official margin-short source is pending and `ai.recommendations` must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run. diff --git a/scripts/ops/host-sustained-load-controller.py b/scripts/ops/host-sustained-load-controller.py index ecc44881..4e4576a8 100755 --- a/scripts/ops/host-sustained-load-controller.py +++ b/scripts/ops/host-sustained-load-controller.py @@ -29,6 +29,7 @@ from typing import Any DEFAULT_METRICS_FILE = Path("/home/wooo/node_exporter_textfiles/host_runaway_process.prom") DEFAULT_DOCKER_STATS_FILE = Path("/home/wooo/node_exporter_textfiles/docker_stats.prom") DEFAULT_DOCKER_STATS_MAX_AGE_SECONDS = 300 +DEFAULT_SCRIPT_DIR = Path("/home/wooo/scripts") SCHEMA_VERSION = "host_sustained_load_controlled_automation_v1" LABEL_RE = re.compile(r"(?P[A-Za-z_][A-Za-z0-9_]*)=\"(?P(?:[^\"\\\\]|\\\\.)*)\"") METRIC_RE = re.compile( @@ -54,6 +55,7 @@ def parse_args() -> argparse.Namespace: parser.add_argument("--hot-container-cpu-threshold", type=float, default=1.0) parser.add_argument("--process-family-cpu-threshold", type=float, default=50.0) parser.add_argument("--ci-stale-age-seconds", type=int, default=1800) + parser.add_argument("--script-dir", type=Path, default=DEFAULT_SCRIPT_DIR) parser.add_argument("--ps-file", type=Path) parser.add_argument("--top-n", type=int, default=8) parser.add_argument("--json", action="store_true", help="Print JSON only.") @@ -382,6 +384,7 @@ def build_packet( hot_container_cpu_threshold: float, process_family_cpu_threshold: float, ci_stale_age_seconds: int, + script_dir: Path = DEFAULT_SCRIPT_DIR, ) -> dict[str, Any]: monitor_up = int( _sample_value( @@ -445,8 +448,11 @@ def build_packet( next_action = "keep_read_only_monitoring" dry_run_command = "" controlled_apply_command = "" + controller_script = script_dir / "host-sustained-load-controller.py" + evidence_script = script_dir / "host-sustained-load-evidence.py" + remediation_script = script_dir / "host-runaway-process-remediation.py" verifier_command = ( - "scripts/ops/host-sustained-load-controller.py " + f"{controller_script} " f"--host {host} --metrics-file {DEFAULT_METRICS_FILE}" ) @@ -463,14 +469,14 @@ def build_packet( severity = "critical" controlled_apply_allowed = True rule = top_orphan["rule"] - dry_run_command = f"scripts/ops/host-runaway-process-remediation.py --rule {rule}" + dry_run_command = f"{remediation_script} --rule {rule}" controlled_apply_command = ( - "scripts/ops/host-runaway-process-remediation.py " + f"{remediation_script} " f"--rule {rule} --apply --confirm-apply " "--controlled-apply-id ${CONTROLLED_APPLY_ID} " "--evidence-ref ${EVIDENCE_REF} " "--post-apply-verifier " - "'scripts/ops/host-sustained-load-controller.py --host " + f"'{controller_script} --host " f"{host} --metrics-file {DEFAULT_METRICS_FILE}' " "--wait-seconds 10" ) @@ -510,7 +516,7 @@ def build_packet( else "warning" ) dry_run_command = ( - "scripts/ops/host-sustained-load-evidence.py " + f"{evidence_script} " f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} " f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json" ) @@ -526,7 +532,7 @@ def build_packet( else "warning" ) dry_run_command = ( - "scripts/ops/host-sustained-load-evidence.py " + f"{evidence_script} " f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} " f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json" ) @@ -535,7 +541,7 @@ def build_packet( classification = "blocked_control_plane_saturation_requires_playbook" severity = "critical" if load5_per_core > load5_per_core_threshold else "warning" dry_run_command = ( - "scripts/ops/host-sustained-load-evidence.py " + f"{evidence_script} " f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} " f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json" ) @@ -550,7 +556,7 @@ def build_packet( classification = "blocked_stockplatform_hot_query_or_api_pressure_requires_playbook" severity = "critical" if load5_per_core > load5_per_core_threshold else "warning" dry_run_command = ( - "scripts/ops/host-sustained-load-evidence.py " + f"{evidence_script} " f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} " f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json" ) @@ -561,7 +567,7 @@ def build_packet( classification = "blocked_gitea_queue_or_hook_backlog_requires_playbook" severity = "critical" if load5_per_core > load5_per_core_threshold else "warning" dry_run_command = ( - "scripts/ops/host-sustained-load-evidence.py " + f"{evidence_script} " f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} " f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json" ) @@ -574,10 +580,9 @@ def build_packet( classification = "blocked_unknown_sustained_load_requires_source_specific_playbook" severity = "critical" dry_run_command = ( - "scripts/ops/host-sustained-load-evidence.py " + f"{evidence_script} " f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} " - "--docker-stats-file /home/wooo/node_exporter_textfiles/docker_stats.prom " - "--json" + f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json" ) next_action = "collect_sanitized_top_process_and_container_stats_then_select_playbook" @@ -596,6 +601,7 @@ def build_packet( "container_cpu_threshold": container_cpu_threshold, "hot_container_cpu_threshold": hot_container_cpu_threshold, "process_family_cpu_threshold": process_family_cpu_threshold, + "script_dir": str(script_dir), "swap_used_ratio": round(swap_used_ratio, 6), "remediation_authorized": remediation_authorized, "active_ci_container_count": active_ci_containers, @@ -668,6 +674,7 @@ def main() -> int: hot_container_cpu_threshold=args.hot_container_cpu_threshold, process_family_cpu_threshold=args.process_family_cpu_threshold, ci_stale_age_seconds=args.ci_stale_age_seconds, + script_dir=args.script_dir, ) if args.json: print(json.dumps(packet, ensure_ascii=False, indent=2, sort_keys=True)) diff --git a/scripts/ops/tests/test_host_runaway_process_exporter.py b/scripts/ops/tests/test_host_runaway_process_exporter.py index 2558c2fc..5fea41ec 100644 --- a/scripts/ops/tests/test_host_runaway_process_exporter.py +++ b/scripts/ops/tests/test_host_runaway_process_exporter.py @@ -334,8 +334,13 @@ def test_sustained_load_controller_routes_orphan_browser_to_controlled_remediati payload = json.loads(result.stdout) assert payload["classification"] == "controlled_orphan_browser_remediation_ready" assert payload["controlled_apply_allowed"] is True - assert "host-runaway-process-remediation.py --rule stockplatform_headless_smoke" in payload["commands"]["dry_run"] + assert ( + "/home/wooo/scripts/host-runaway-process-remediation.py " + "--rule stockplatform_headless_smoke" + ) in payload["commands"]["dry_run"] assert "--controlled-apply-id" in payload["commands"]["controlled_apply"] + assert "scripts/ops/" not in payload["commands"]["dry_run"] + assert "scripts/ops/" not in payload["commands"]["post_apply_verifier"] assert payload["operation_boundaries"]["process_signal_performed"] is False @@ -460,7 +465,8 @@ def test_sustained_load_controller_routes_gitea_backlog_from_docker_metrics(tmp_ assert payload["classification"] == "blocked_gitea_queue_or_hook_backlog_requires_playbook" assert payload["readback"]["top_container_cpu"]["container_name"] == "gitea" assert payload["controlled_apply_allowed"] is False - assert "host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "/home/wooo/scripts/host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "scripts/ops/" not in payload["commands"]["dry_run"] def test_sustained_load_controller_routes_gitea_quota_pressure_even_when_load_is_moderate( @@ -516,7 +522,8 @@ def test_sustained_load_controller_routes_gitea_quota_pressure_even_when_load_is assert payload["severity"] == "warning" assert payload["readback"]["container_cpu_threshold"] == 2.0 assert payload["readback"]["top_container_cpu"]["cpu_cores"] == 2.08 - assert "host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "/home/wooo/scripts/host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "scripts/ops/" not in payload["commands"]["dry_run"] def test_sustained_load_controller_ignores_stale_docker_stats_attribution(tmp_path: Path) -> None: @@ -608,7 +615,8 @@ def test_sustained_load_controller_routes_unknown_load_to_sanitized_evidence(tmp payload = json.loads(result.stdout) assert payload["classification"] == "blocked_unknown_sustained_load_requires_source_specific_playbook" assert payload["controlled_apply_allowed"] is False - assert "host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "/home/wooo/scripts/host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "scripts/ops/" not in payload["commands"]["dry_run"] assert payload["operation_boundaries"]["process_signal_performed"] is False @@ -683,7 +691,8 @@ def test_sustained_load_controller_routes_moderate_stock_container_pressure(tmp_ assert payload["readback"]["top_container_cpu"]["container_name"] == "stockplatform-v2-postgres-1" assert payload["readback"]["top_process_family"]["family"] == "gitea_service" assert payload["controlled_apply_allowed"] is False - assert "host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "/home/wooo/scripts/host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "scripts/ops/" not in payload["commands"]["dry_run"] assert "/home/wooo/gitea/app.ini" not in result.stdout @@ -785,6 +794,10 @@ def test_sustained_load_controller_routes_control_plane_family_pressure(tmp_path assert payload["classification"] == "blocked_control_plane_saturation_requires_playbook" assert payload["readback"]["control_plane_process_cpu_percent"] == 55.0 assert payload["next_action"] == "run_control_plane_saturation_playbook_check_mode" + assert "/home/wooo/scripts/host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "/home/wooo/scripts/host-sustained-load-controller.py" in payload["commands"]["post_apply_verifier"] + assert "scripts/ops/" not in payload["commands"]["dry_run"] + assert "scripts/ops/" not in payload["commands"]["post_apply_verifier"] def test_sustained_load_evidence_emits_sanitized_gitea_recommendation(tmp_path: Path) -> None: @@ -913,7 +926,8 @@ def test_sustained_load_controller_routes_unknown_load_to_sanitized_evidence(tmp == "blocked_unknown_sustained_load_requires_source_specific_playbook" ) assert payload["controlled_apply_allowed"] is False - assert "host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "/home/wooo/scripts/host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "scripts/ops/" not in payload["commands"]["dry_run"] assert payload["operation_boundaries"]["host_write_performed"] is False