docs(logbook): 2026-04-19 20:00 本 session 22 commits 全景記錄
記錄: - 統帥決策 Rule 1 deprecate + Rule 2 保留 + noise 算法修正 - Hermes LLM 升級 (OpenClaw 分析假報真因) - coverage_evaluator 擴充 4 維 (7 維全實作) - deploy-alerts workflow 部署 HostDiskUsageHigh/Critical 到 Prometheus - Review 發現 5 個 bug 全修復 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -6,6 +6,94 @@
|
||||
|
||||
---
|
||||
|
||||
## 📍 2026-04-19 晚 20:00 — Hermes LLM 升級 + Rule 1 deprecate + coverage 7 維完整化 🎖️🎖️🎖️
|
||||
|
||||
### 統帥反饋激活
|
||||
「不理解!你沒有給我完整資訊,我無法決策!」→ 2 條 rules 給完整 YAML + incidents trace
|
||||
「是沒有真實流量?還是你沒有真實去看到其實有真實的流量?!」→ 真實查實證
|
||||
「持續推進 + 持續 review 原本做法 + 朝 AI 自主化方向」→ 執行
|
||||
|
||||
### 統帥決策
|
||||
1. **PostgreSQLDiskGrowthRate**: 選 **C Deprecate**(500MB/h 增長是 PG WAL 正常行為)
|
||||
2. **NoAlertsReceived2Hours**: **保留**(真實告警鏈路守護)
|
||||
3. **noise_rate 算法修正**(NO_ACTION 不算 false positive,觀察後調整)
|
||||
|
||||
### 本輪實作(commits ba18ad2 → c1f23cf)
|
||||
|
||||
**1. rule_stats_updater v2**:排除 NO_ACTION/OBSERVE/INVESTIGATE 的 EXPIRED approval(不算 fp)
|
||||
|
||||
**2. Hermes LLM 升級**:
|
||||
- 新增 `_llm_analyze_noisy_rule`:用 OpenClaw (Ollama/NemoTron/Gemini) 分析每條噪音規則
|
||||
- 輸出 JSON:probable_root_causes / recommended_actions / confidence / should_deprecate
|
||||
- Telegram 摘要含 AI 判定 + top 2 建議
|
||||
- 對齊統帥鐵律:AI 只分析,人工決策
|
||||
|
||||
**3. Rule 1 PostgreSQLDiskGrowthRate deprecate**:
|
||||
- 改 `ops/monitoring/alerts-unified.yml` 刪除舊規則
|
||||
- 新增 `HostDiskUsageHigh` (>80% for 10m, warning)
|
||||
- 新增 `HostDiskUsageCritical` (>90% for 5m, critical)
|
||||
- `labels.supersedes=PostgreSQLDiskGrowthRate` 供追溯
|
||||
- DB 即時 `UPDATE review_status='deprecated'`
|
||||
- **deploy-alerts workflow 自動部署到 Prometheus 生效** ✅
|
||||
|
||||
**4. coverage_evaluator v2 擴充 4 維**:
|
||||
- auto_playbook:asset.name 在 playbooks.symptom_pattern/description → green
|
||||
- auto_remediation:過去 30d remediation_events.target ILIKE asset.name → green/red
|
||||
- auto_rule_matching:過去 30d incidents 觸發 + match asset labels → green/yellow
|
||||
- auto_rule_creation:alert_rule_catalog.source='ai_generated' → 目前全 red(未來 Hermes 產 AI rule 變 green)
|
||||
- **coverage 7 維從原 3 維實作完成 100%**
|
||||
|
||||
### 本 session 完整成果(19:50 累計 22 commits)
|
||||
| 類別 | Commits |
|
||||
|---|---|
|
||||
| aol writer + verifier await + drift 400 | e7ba8cb / c0f3509 |
|
||||
| CI cd.yaml B5 shared network(3 輪除錯)| b636d3b / ddb902f / 5b9b36f |
|
||||
| 4 個核心 scanner | 4259a10 / 0226344 |
|
||||
| asset_scanner v3 + ReplicaSet 橋樑 | d11b09c / fdf8b73 / e677773 |
|
||||
| coverage_evaluator(KM fix)| 007c7ef / 5052323 / c8b263d |
|
||||
| rule_stats_updater + asset_change_tracker | df71c9a / 6b14194 / 92349bc |
|
||||
| Hermes rule quality advisor | 9ed135e / 6ab0ce9 |
|
||||
| LOGBOOK + memory | 2dc84e7 / c015a77 |
|
||||
| **本輪: LLM Hermes + Rule 1 deprecate** | **ba18ad2** |
|
||||
| **本輪: coverage 4 維擴充** | **996ac1d / c1f23cf** |
|
||||
|
||||
### 實證數字(2026-04-19 19:50)
|
||||
| 表 | 現況 |
|
||||
|---|---|
|
||||
| asset_inventory | 140+ 全資源類型 |
|
||||
| asset_relationship | 114(含 Pod→Deployment 54+)|
|
||||
| alert_rule_catalog | 69 條(原 68 + 1 deprecated - 1 new = 69)|
|
||||
| asset_coverage_snapshot | 7 維全部可評估(等部署後首跑升級完整)|
|
||||
| host_capacity_snapshot | 3 hosts 每日累積 |
|
||||
| asset_compliance_snapshot | 39 × 7 = 273 每次 scan |
|
||||
| incident_evidence | 339/24h 持續投資蒐集 |
|
||||
| aol op_types | 6 種活躍(asset_discovered/rule_created/rule_updated/capacity_recommendation/coverage_recalculated/notification_formatted)|
|
||||
|
||||
### Prometheus 生效
|
||||
- HostDiskUsageHigh/Critical 已部署到 110:/home/wooo/monitoring/alerts.yml
|
||||
- deploy-alerts workflow 通知「✅ Prometheus 告警規則部署 success (ba18ad2)」
|
||||
- Prometheus 已載入 69 條規則(log 顯示)
|
||||
|
||||
### 待驗證(要真實流量)
|
||||
- aol(playbook_executed):下一個真實 APPROVED+execute approval
|
||||
- incident_evidence.verification_result:同上
|
||||
- capacity_violation_event:超閾值情況(目前 cpu 66%、mem 15%,距 80%/85% 還有空間)
|
||||
|
||||
### Review 發現的 5 個 bug 全部修復
|
||||
1. kubectl_get namespace 參數 bug → subprocess 直調
|
||||
2. asset_scanner 只掃 pods 盲點 → v3 多資源
|
||||
3. ReplicaSet 橋樑漏 Pod→Deployment → rs_to_deployment map
|
||||
4. coverage_evaluator KM 欄位 body→content → 修正 schema
|
||||
5. drift diff HTTP 400 → item-by-item 累計長度
|
||||
|
||||
### 下一階段候選(統帥批准 4 項已完成 2 項)
|
||||
- ✅ LLM 升級 Hermes(本輪完成)
|
||||
- ⏳ SSL/CVE/backup compliance 6 維實作
|
||||
- ✅ auto_playbook/auto_remediation/auto_rule_matching/auto_rule_creation(本輪擴充)
|
||||
- ⏳ Phase 4 Holt-Winters AI 容量預測
|
||||
|
||||
---
|
||||
|
||||
## 📍 2026-04-19 晚 18:00 — Review 深入:Phase 7 完整化(8 表全寫入 + coverage 升級 + Hermes AI 建議)🎖️🎖️
|
||||
|
||||
### 統帥指示「持續推進 + 持續 review 原本的做法 + 朝 AI 自主化方向」激活
|
||||
|
||||
Reference in New Issue
Block a user