fix(governance): stabilize adr100 km growth slo
Some checks failed
Code Review / ai-code-review (push) Successful in 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s
CD Pipeline / tests (push) Successful in 1m11s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled

This commit is contained in:
Your Name
2026-05-14 19:33:52 +08:00
parent cdb8bf6802
commit d2a4a17969
9 changed files with 267 additions and 30 deletions

View File

@@ -110,11 +110,18 @@ sum(rate(approval_records_high_confidence_total[1h]))
**SLI 計算式**:
```promql
max(knowledge_entries_created_24h)
or
increase(knowledge_entries_total[24h])
```
**Recording rule**: `sli:km_growth_rate:24h`
**資料來源備註2026-05-14 T19**`knowledge_entries_created_24h`
是 API `/metrics` 直接從 PostgreSQL `knowledge_entries.created_at >= now()-24h`
產出的 gauge。`increase(knowledge_entries_total[24h])` 只作舊 counter fallback
避免 emitter 新上線時 Prometheus 還沒有 24h counter history 而誤報 KM 增長為 0。
**目標值SLO**: ≥ 20 筆/day
**Error budget**:不適用標準 burn rate絕對值 SLO改用閾值告警
@@ -158,7 +165,7 @@ increase(knowledge_entries_total[24h])
| `ops/monitoring/tests/test_slo_rules.yaml` | promtool 單元測試 |
| `ops/monitoring/grafana/dashboards/ai-slo-dashboard.json` | Grafana SLO Dashboard |
| `apps/api/src/services/governance_agent.py` | `check_slo_compliance()` 整合 |
| `apps/api/src/services/adr100_slo_metrics_service.py` | 2026-05-14 T18從 PostgreSQL 事實來源輸出 ADR-100 底層 Prometheus series`automation_operation_log_total` 僅納入 remediation / PlayBook / auto-repair 範圍,背景治理工作不進 AI 自動修復 SLO 分母 |
| `apps/api/src/services/adr100_slo_metrics_service.py` | 2026-05-14 T18從 PostgreSQL 事實來源輸出 ADR-100 底層 Prometheus series`automation_operation_log_total` 僅納入 remediation / PlayBook / auto-repair 範圍,背景治理工作不進 AI 自動修復 SLO 分母。2026-05-14 T19追加 `*_created_24h` gauges供治理 Agent / 前端直接顯示最近 24h 事實量,避免 counter 暖機造成 false red |
| `apps/api/src/main.py` `/metrics` | 2026-05-14 T18追加 DB-derived SLO emitter讓既有 `awoooi-api` scrape job 取得底層 series |
## 決策理由