diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index fbee3f73..c5eec4fe 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -52441,21 +52441,23 @@ production browser smoke: - `scripts/ops/host-sustained-load-controller.py` 新增 live `ps` process-family 分類,並把中度壓力分流到 `control_plane_saturation`、`gitea_queue_or_hook_backlog`、`stockplatform_hot_query_or_api_pressure` check-mode packet;controller 仍只產生 controlled packet,不做 Docker / systemd / DB / Nginx restart。 - `ops/monitoring/alerts-unified.yml` 將 `Host110SustainedModeratePressure` 從只看 `load5/core` / container CPU,擴充為同時看 `awoooi_host_process_family_cpu_percent{family=~"systemd_control_plane|ssh_control_plane|gitea_service|postgres"} > 50`;Gitea / StockPlatform container early triage 門檻從 `> 2.0 core` 降到 `> 1.0 core`。 - `scripts/ops/host-sustained-load-controller.py` 追加 `--script-dir`,預設 `/home/wooo/scripts`,確保 live alert 產出的 `dry_run_command` / `post_apply_verifier` 指向 110 實際部署 helper,不再輸出會打空的 repo 相對路徑 `scripts/ops/...`。 +- `scripts/ops/host-sustained-load-controller.py` / `host-sustained-load-evidence.py` 同步修正分流優先序:fresh Docker stats 顯示 `gitea` 或 StockPlatform 關鍵容器超過 `1.0 core` 時,優先路由到服務 playbook,不再被長壽命 `systemd_control_plane` 平均 CPU 搶先導向 control-plane playbook。 - 已 live 部署到 110: - `/home/wooo/scripts/host-runaway-process-exporter.py` SHA `d85d27c81ea76a8f2f370ee85c92381bec4440eea4fd37efb2efb9f43dbd1a8a` - - `/home/wooo/scripts/host-sustained-load-controller.py` SHA `7a11407c7df05427085982d6f6d11d1756f908591573a28c9b3267de32b94f3e` + - `/home/wooo/scripts/host-sustained-load-controller.py` SHA `1bf9c183fe8d89c30008e08db0903c24c609df31eeebe62f9d59b3a26a3bd1c0` + - `/home/wooo/scripts/host-sustained-load-evidence.py` SHA `2fd8e7d43a0249f97b35a865cfd4ce2aa45162729faebfc837cde8cf48beec38` - `bash scripts/ops/deploy-alerts.sh` 完成,Prometheus 已載入 `159` 條規則。 **live readback 證據**: - 110 textfile / Prometheus 已輸出 process-family metric:`systemd_control_plane=72.4`、`gitea_service=53.1`、`postgres=11.1`。 - Prometheus rule readback:`Host110SustainedModeratePressure state=firing health=ok`,query 已包含 `docker_container_cpu_cores > 1` 與 `awoooi_host_process_family_cpu_percent > 50`。 - Alertmanager `/api/v2/alerts` 已有 active alert:`family=gitea_service`、`family=systemd_control_plane`,`status.state=active`、`auto_repair=true`。 -- live controller readback:`classification=blocked_control_plane_saturation_requires_playbook`、`next_action=run_control_plane_saturation_playbook_check_mode`、`dry_run_command=/home/wooo/scripts/host-sustained-load-evidence.py ...`、`post_apply_verifier=/home/wooo/scripts/host-sustained-load-controller.py ...`、`controller_exit=75`。 -- live sanitized evidence readback:`recommendation=control_plane_saturation_playbook`、`controlled_apply_allowed=false`、`top_process_families` 包含 `systemd_control_plane=72.4`、`gitea_service=53.1`、`unknown=54.3`,`top_containers` 以 `gitea=1.5951 core` 為最高;evidence 明確標示不輸出 raw command line / URL / secret。 +- live controller readback:`classification=blocked_gitea_queue_or_hook_backlog_requires_playbook`、`next_action=run_gitea_queue_or_hook_backlog_playbook_check_mode`、`dry_run_command=/home/wooo/scripts/host-sustained-load-evidence.py ...`、`post_apply_verifier=/home/wooo/scripts/host-sustained-load-controller.py ...`、`controller_exit=75`。 +- live sanitized evidence readback:`recommendation=gitea_queue_or_hook_backlog_playbook`、`controlled_apply_allowed=false`、`top_process_families` 包含 `systemd_control_plane=72.5`、`gitea_service=53.1`、`unknown=53.7`,`top_containers` 以 `gitea=1.3055 core` 為最高;evidence 明確標示不輸出 raw command line / URL / secret。 **本地驗證結果**: - `python3.11 -m py_compile scripts/ops/host-runaway-process-exporter.py scripts/ops/host-sustained-load-controller.py scripts/ops/host-sustained-load-evidence.py`:通過。 -- `python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q`:`27 passed`。 +- `python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q`:`29 passed`。 - `python3.11 -m pytest ops/runner/test_cd_controlled_runtime_profile.py scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q`:`70 passed`。 - `python3.11 -c "import yaml; yaml.safe_load(open('ops/monitoring/alerts-unified.yml'))"`:通過。 - `python3.11 ops/runner/guard-gitea-runner-pressure.py --root .`:通過,`auto_branch_events_on_110=0`、`generic_runner_labels=0`。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 2482ec0c..bdfdb7a0 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -94,6 +94,8 @@ v1.82 bounded summary rule:`post-start-quick-check.sh` 與 `188-host-hygiene-m 2026-07-02 12:15 追加 controller command path 合約:`host-sustained-load-controller.py` 產出的 `dry_run_command` / `post_apply_verifier` 必須使用 110 實際部署路徑 `/home/wooo/scripts/host-sustained-load-evidence.py`、`/home/wooo/scripts/host-sustained-load-controller.py`、`/home/wooo/scripts/host-runaway-process-remediation.py`。若 controller 在 live host 上輸出 `scripts/ops/...` 這類 repo 相對路徑,視為告警自動化斷鏈,需先修 controller / `--script-dir`,再進入 playbook check-mode;不得把「helper 找不到」當成 CPU 根因已處理。 +2026-07-02 12:35 追加 110 CPU 分流優先序:若 Docker stats 是 fresh,且 `gitea` 或 StockPlatform 關鍵容器已超過 early triage 門檻 `1.0 core`,controller / evidence 必須先路由到對應服務 playbook,不得被長壽命 `ps %CPU` 的 `systemd_control_plane` 平均值搶先導到 control-plane playbook。control-plane saturation 仍保留為後備路徑,適用於沒有已知 hot container / hot service family 的情境。 + 2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness was still blocked at that time and DR remained incomplete. 2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea `main` and live `/home/wooo/stockplatform-v2` are now at `fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints`; six missing production cron entrypoint scripts are restored, `run-intelligence-sync.sh` contains the Docker-backed `psql` shim, and live contract check confirms every `scripts/ops/*.sh` referenced by `install-production-cron.sh` exists. The only live write performed for StockPlatform recovery was a fast-forward `git pull --ff-only origin main` on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: `source-remediation-queue` 19:56 and 20:00 succeeded, `market-index-ingestion` 20:00 succeeded, `price-ingestion` 20:02 succeeded, `margin-short-ingestion` 20:05 succeeded, `chips-ingestion` 20:06 succeeded, and `ai-recommendation-pipeline` 20:10 ran but correctly produced the internal blocker `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. StockPlatform `/api/v1/system/freshness` therefore still returns `status=blocked` because the 2026-06-25 official margin-short source is pending and `ai.recommendations` must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run. diff --git a/scripts/ops/host-sustained-load-controller.py b/scripts/ops/host-sustained-load-controller.py index 4e4576a8..168d7ad5 100755 --- a/scripts/ops/host-sustained-load-controller.py +++ b/scripts/ops/host-sustained-load-controller.py @@ -537,15 +537,6 @@ def build_packet( f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json" ) next_action = "run_gitea_queue_or_hook_backlog_playbook_check_mode" - elif control_plane_cpu >= process_family_cpu_threshold: - classification = "blocked_control_plane_saturation_requires_playbook" - severity = "critical" if load5_per_core > load5_per_core_threshold else "warning" - dry_run_command = ( - f"{evidence_script} " - f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} " - f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json" - ) - next_action = "run_control_plane_saturation_playbook_check_mode" elif ( "stockplatform-v2-postgres-1" in top_container_name and top_container_cpu >= hot_container_cpu_threshold @@ -572,6 +563,15 @@ def build_packet( f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json" ) next_action = "run_gitea_queue_or_hook_backlog_playbook_check_mode" + elif control_plane_cpu >= process_family_cpu_threshold: + classification = "blocked_control_plane_saturation_requires_playbook" + severity = "critical" if load5_per_core > load5_per_core_threshold else "warning" + dry_run_command = ( + f"{evidence_script} " + f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} " + f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json" + ) + next_action = "run_control_plane_saturation_playbook_check_mode" elif load5_per_core > load5_per_core_threshold and swap_used_ratio >= 0.85: classification = "blocked_memory_or_swap_pressure_requires_service_playbook" severity = "critical" diff --git a/scripts/ops/host-sustained-load-evidence.py b/scripts/ops/host-sustained-load-evidence.py index 42cb8dd1..b5d4e612 100755 --- a/scripts/ops/host-sustained-load-evidence.py +++ b/scripts/ops/host-sustained-load-evidence.py @@ -281,11 +281,23 @@ def recommend_playbook(process_families: list[dict[str, Any]], containers: list[ top_container_cpu = float(top_container.get("cpu_cores") or 0.0) top_family = process_families[0] if process_families else {} family = str(top_family.get("family") or "") + family_cpu = { + str(item.get("family") or ""): float(item.get("cpu_percent") or 0.0) + for item in process_families + } - if "gitea" in top_container_name and top_container_cpu >= 2.0: + if "gitea" in top_container_name and top_container_cpu >= 1.0: return "gitea_queue_or_hook_backlog_playbook" - if "postgres" in top_container_name or "postgres" in family: + if ( + ( + "postgres" in top_container_name + or "stockplatform-v2-postgres-1" in top_container_name + ) + and top_container_cpu >= 1.0 + ) or family_cpu.get("postgres", 0.0) >= 50.0: return "postgres_hot_query_or_backup_export_playbook" + if family_cpu.get("gitea_service", 0.0) >= 50.0: + return "gitea_queue_or_hook_backlog_playbook" if family in {"docker_build", "web_build", "gitea_actions_runner"}: return "build_or_runner_pressure_playbook" if family in {"systemd_control_plane", "ssh_control_plane"}: diff --git a/scripts/ops/tests/test_host_runaway_process_exporter.py b/scripts/ops/tests/test_host_runaway_process_exporter.py index 5fea41ec..cff3cbe7 100644 --- a/scripts/ops/tests/test_host_runaway_process_exporter.py +++ b/scripts/ops/tests/test_host_runaway_process_exporter.py @@ -526,6 +526,80 @@ def test_sustained_load_controller_routes_gitea_quota_pressure_even_when_load_is assert "scripts/ops/" not in payload["commands"]["dry_run"] +def test_sustained_load_controller_prioritizes_hot_gitea_container_over_control_plane_average( + tmp_path: Path, +) -> None: + metrics_file = tmp_path / "host.prom" + metrics_file.write_text( + "\n".join( + [ + 'awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"} 1', + 'awoooi_host_load5_per_core{host="110"} 0.70', + 'awoooi_host_swap_used_ratio{host="110"} 0.1', + 'awoooi_host_runaway_process_remediation_authorized{host="110"} 0', + 'awoooi_host_gitea_actions_active_container_count{host="110"} 0', + 'awoooi_host_gitea_actions_active_process_group_count{host="110"} 0', + 'awoooi_host_runaway_browser_orphan_group_count{host="110",rule="stockplatform_headless_smoke",min_age_seconds="1800",min_cpu_percent="50"} 0', + ] + ), + encoding="utf-8", + ) + docker_file = tmp_path / "docker.prom" + docker_file.write_text( + "\n".join( + [ + 'docker_container_cpu_cores{host="110",container_name="gitea"} 1.59', + 'docker_container_cpu_cores{host="110",container_name="redis"} 0.2', + ] + ), + encoding="utf-8", + ) + ps_file = tmp_path / "ps.txt" + ps_file.write_text( + "\n".join( + [ + "100 1 100 75507 61.8 0.0 systemd /sbin/init", + "101 1 101 75469 6.7 0.0 dbus-daemon @dbus-daemon --system", + "200 1 200 75348 53.1 1.3 gitea /usr/local/bin/gitea web --config /home/wooo/gitea/app.ini", + ] + ), + encoding="utf-8", + ) + + result = subprocess.run( + [ + sys.executable, + str(CONTROLLER_PATH), + "--host", + "110", + "--load5-per-core-threshold", + "0.75", + "--hot-container-cpu-threshold", + "1.0", + "--container-cpu-threshold", + "2.0", + "--metrics-file", + str(metrics_file), + "--docker-stats-file", + str(docker_file), + "--ps-file", + str(ps_file), + "--json", + ], + capture_output=True, + text=True, + ) + + assert result.returncode == 75 + payload = json.loads(result.stdout) + assert payload["classification"] == "blocked_gitea_queue_or_hook_backlog_requires_playbook" + assert payload["next_action"] == "run_gitea_queue_or_hook_backlog_playbook_check_mode" + assert payload["readback"]["control_plane_process_cpu_percent"] == 68.5 + assert payload["readback"]["top_container_cpu"]["container_name"] == "gitea" + assert "/home/wooo/scripts/host-sustained-load-evidence.py" in payload["commands"]["dry_run"] + assert "/home/wooo/gitea/app.ini" not in result.stdout + + def test_sustained_load_controller_ignores_stale_docker_stats_attribution(tmp_path: Path) -> None: metrics_file = tmp_path / "host.prom" metrics_file.write_text( @@ -842,6 +916,50 @@ def test_sustained_load_evidence_emits_sanitized_gitea_recommendation(tmp_path: assert "/home/wooo" not in result.stdout +def test_sustained_load_evidence_prioritizes_hot_gitea_container_over_control_plane_average( + tmp_path: Path, +) -> None: + ps_file = tmp_path / "ps.txt" + ps_file.write_text( + "\n".join( + [ + "100 1 100 75507 61.8 0.0 systemd /sbin/init", + "101 1 101 75469 6.7 0.0 dbus-daemon @dbus-daemon --system", + "200 1 200 75348 53.1 1.3 gitea /usr/local/bin/gitea web --config /home/wooo/gitea/app.ini", + ] + ), + encoding="utf-8", + ) + docker_file = tmp_path / "docker.prom" + docker_file.write_text( + 'docker_container_cpu_cores{host="110",container_name="gitea"} 1.4591\n', + encoding="utf-8", + ) + + result = subprocess.run( + [ + sys.executable, + str(EVIDENCE_PATH), + "--host", + "110", + "--ps-file", + str(ps_file), + "--docker-stats-file", + str(docker_file), + "--json", + ], + check=True, + capture_output=True, + text=True, + ) + + payload = json.loads(result.stdout) + assert payload["recommendation"] == "gitea_queue_or_hook_backlog_playbook" + assert payload["top_process_families"][0]["family"] == "systemd_control_plane" + assert payload["top_containers"][0]["container_name"] == "gitea" + assert "/home/wooo/gitea/app.ini" not in result.stdout + + def test_sustained_load_evidence_keeps_stale_container_samples_untrusted(tmp_path: Path) -> None: metrics_file = tmp_path / "host.prom" metrics_file.write_text(