fix(ops): route container pressure alerts to controller
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 1s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 1s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
This commit is contained in:
@@ -51922,6 +51922,7 @@ production browser smoke:
|
||||
- 110 live pressure 重新讀回:`load5` 曾回到約 `8.91`、`awoooi_host_load5_per_core=0.8075`,Gitea 即時 `docker stats` 一度 `218.56%`,但既有 `HostLoadAverageSustainedHigh` 門檻是 `load5/core > 1.5 for 15m`,`DockerContainerCpuSustainedHigh` 也是 `>2 core for 10m` pending;因此先前沒有 CPU firing / Telegram 並不是沒有監控,而是門檻太晚且 auto-repair action 指到未部署路徑。
|
||||
- 已部署 `/home/wooo/scripts/host-sustained-load-controller.py`、`host-sustained-load-evidence.py`、`host-runaway-process-remediation.py` 到 110,備份 suffix `before-host-pressure-controller-20260701-232314`;controller live readback 可執行,且不讀 secret / raw session / runner registration。
|
||||
- `ops/monitoring/alerts-unified.yml` 新增 `Host110SustainedModeratePressure`:`load5/core > 0.75` 或 Gitea / StockPlatform 關鍵容器 CPU `>2.0 core` 持續 1 分鐘即 warning,auto-repair action 指向 110 實際 controller 路徑。
|
||||
- `host-sustained-load-controller.py` 補 `--container-cpu-threshold`,當 Gitea / StockPlatform 關鍵容器貼著 CPU quota 超過門檻時,即使 `load5/core` 尚未達 critical,也會產生 source-specific playbook packet;未超門檻時只回 observing,不亂殺、不重啟。
|
||||
- 將 Gitea container runtime CPU quota 從 `3` core 收斂到 `2` core:`docker update --cpus=2 gitea`;rollback 為 `docker update --cpus=3 gitea`。post-check:`nanocpus=2000000000`、memory 仍 `3GiB`、Gitea API `/api/v1/version` 回 `1.25.5`,無容器重啟。
|
||||
- 修正備份噪音:`BackupAggregateRunFailed` 不再因 `backup_all` 舊 aggregate failed_count firing,改成只看 component job failed count;live `backup-status.sh --no-notify` 已回 `每日備份心跳正常`、`component_failed=0`、`core_blockers=0`、`escrow_missing=0`。
|
||||
- Alertmanager / webhook readback:Alertmanager 仍有 5 個非 CPU active warning;路由預設 `awoooi-webhook`,`telegram-direct` 只給 alert-chain 自身異常。110 到 VIP / 120 / 121 `/api/v1/webhooks/alertmanager` synthetic no-secret smoke 均 HTTP 200,回 `告警已排入背景分析`;`/api/v1/telegram/health` 回 `configured`。
|
||||
@@ -51929,6 +51930,7 @@ production browser smoke:
|
||||
**live readback**:
|
||||
- Prometheus rule readback:`Host110SustainedModeratePressure=inactive`、`DockerContainerCpuSustainedHigh=inactive`、`BackupAggregateRunFailed=inactive`。
|
||||
- Node exporter readback:`awoooi_host_load5_per_core{host="110"} 0.536667`、`node_load5 6.52`、`docker_container_cpu_cores{container_name="gitea",host="110"} 1.4917`、`docker_container_cpu_limit_cores{container_name="gitea",host="110"} 2`。
|
||||
- 第二輪 controller readback:`load5_per_core=0.473333`、Gitea `1.7221` core、`container_cpu_threshold=2.0`,classification `observing_load_within_threshold`;Prometheus 短暫 pending 由上一輪 Gitea >2 core 樣本造成,live controller 未誤執行。
|
||||
- Alertmanager active alerts after fix:`DockerContainerMissingResourceLimit` on 188、`HostDiskUsageHigh` on 110/188、`HostOutOfDiskSpace` on 110/188;CPU / backup aggregate alert no longer firing.
|
||||
- Full-stack cold-start after fix:`PASS=96 WARN=0 BLOCKED=0`,Result `GREEN`;110 registry / Gitea / Harbor / Prometheus / Alertmanager OK,runner fail-closed OK,110 docker/systemd/storage/backup textfiles fresh,public routes expected 2xx/3xx,backup aggregate failed_count 僅列 INFO、不再形成 blocker。
|
||||
|
||||
|
||||
@@ -92,6 +92,11 @@
|
||||
"change": "add Host110SustainedModeratePressure and point sustained-load auto_repair_action to the deployed /home/wooo/scripts controller path",
|
||||
"evidence": "Prometheus rule readback shows Host110SustainedModeratePressure loaded; after quota apply it is inactive because load5/core and Gitea CPU are below threshold"
|
||||
},
|
||||
{
|
||||
"path": "scripts/ops/host-sustained-load-controller.py",
|
||||
"change": "route Gitea / StockPlatform container CPU quota pressure to source-specific playbook packets even when host load is below the critical load5/core threshold",
|
||||
"evidence": "live controller readback after Gitea dropped below threshold returns observing; test fixture with Gitea 2.08 cores returns blocked_gitea_queue_or_hook_backlog_requires_playbook"
|
||||
},
|
||||
{
|
||||
"path": "scripts/ops/backup-alert-label-contract-check.py",
|
||||
"change": "make BackupAggregateRunFailed ignore aggregate-only backup_all noise and require component job failed-count evidence",
|
||||
@@ -140,6 +145,8 @@
|
||||
"gitea_container_cpu_cores_before_quota": "2.1856",
|
||||
"gitea_container_cpu_cores_after_quota_textfile": "1.4917",
|
||||
"gitea_container_cpu_limit_cores": "2",
|
||||
"controller_container_cpu_threshold": "2.0",
|
||||
"controller_latest_classification": "observing_load_within_threshold",
|
||||
"ssh_control_path": "available",
|
||||
"alert_rules": "Host110SustainedModeratePressure loaded; DockerContainerCpuSustainedHigh inactive; BackupAggregateRunFailed inactive",
|
||||
"alert_chain": "110 to VIP/120/121 /api/v1/webhooks/alertmanager synthetic no-secret smoke returned HTTP 200",
|
||||
|
||||
Reference in New Issue
Block a user