docs(logbook): record auto repair handoff card deploy [skip ci]
This commit is contained in:
@@ -4965,3 +4965,88 @@ telegram_request_failed / telegram_api_error。
|
||||
```
|
||||
|
||||
判讀:Telegram 噪音治理第四層已上線。現在告警處理路徑是:第一張父卡讓人看見事件;同組子告警落 AwoooP event;若父卡存在,Telegram 只補低頻 digest reply。後續要再改善,應進入「父卡狀態編輯 / AwoooP Run drilldown / 每小時戰情室摘要」三選一,不再增加逐筆 Telegram 訊息。
|
||||
|
||||
### 02:15 自動修復結果卡語義化:AUTO RESOLVED / HANDOFF REQUIRED
|
||||
|
||||
**背景**:
|
||||
|
||||
- Telegram 戰情室截圖顯示,`[AUTO] AI 自動修復失敗`、`ESCALATION`、`ACTION REQUIRED` 混在一起時,SRE 很難一眼判斷哪些是已自動完成、哪些是自動化停止後要人工接手。
|
||||
- 本輪先收斂自動修復結果 reply,讓它成為固定語義卡,而不是 raw action / exception 片段。
|
||||
|
||||
**改動**:
|
||||
|
||||
- `decision_manager.py` 新增 `_format_auto_repair_status_line()`:
|
||||
- 成功:`AUTO RESOLVED|AI 自動修復完成`。
|
||||
- 失敗:`HANDOFF REQUIRED|AI 自動修復失敗,已轉人工`。
|
||||
- 失敗卡明確顯示「自動化已停止,不再重試」與「請 SRE 依 AwoooP Run / 原告警卡處理」。
|
||||
- `incident_id`、target、action、error、metrics delta 全部做短欄位壓縮與 HTML escape,避免 Telegram parse error 或長指令洗版。
|
||||
- `test_telegram_message_templates.py` 補兩個回歸測試:
|
||||
- 失敗卡必須標示 `HANDOFF REQUIRED`,並 escape `<scheme> & %d format`。
|
||||
- 成功卡必須標示 `AUTO RESOLVED`,並 escape metrics delta。
|
||||
|
||||
**驗證**:
|
||||
|
||||
```text
|
||||
py_compile:
|
||||
apps/api/src/services/decision_manager.py
|
||||
apps/api/tests/test_telegram_message_templates.py
|
||||
# passed
|
||||
|
||||
pytest:
|
||||
DATABASE_URL='postgresql+asyncpg://test:test@127.0.0.1:5432/test' \
|
||||
/Users/ogt/awoooi/apps/api/.venv/bin/python -m pytest \
|
||||
apps/api/tests/test_telegram_message_templates.py \
|
||||
apps/api/tests/test_channel_hub_grouped_alert_events.py \
|
||||
apps/api/tests/test_alert_grouping_service.py \
|
||||
apps/api/tests/test_ssh_provider_tools.py \
|
||||
apps/api/tests/test_operation_parser_ssh.py -q
|
||||
# 64 passed
|
||||
|
||||
ruff import order:
|
||||
apps/api/tests/test_telegram_message_templates.py
|
||||
# All checks passed
|
||||
|
||||
note:
|
||||
decision_manager.py 是既有 Tier 3 大檔,整檔 ruff import-order 仍有歷史 local import 排序問題;
|
||||
本輪只以 py_compile + 相關單元測試驗證窄改,未做無關大整理。
|
||||
```
|
||||
|
||||
**生產部署**:
|
||||
|
||||
```text
|
||||
Commit:
|
||||
3f69e03f fix(telegram): clarify auto repair handoff cards
|
||||
|
||||
Gitea workflows:
|
||||
1491 CD Pipeline -> success
|
||||
1492 Code Review -> success
|
||||
|
||||
awoooi-api image:
|
||||
192.168.0.110:5000/awoooi/api:3f69e03fcb915514aabf25263b5004b7de5912dc
|
||||
|
||||
awoooi-web image:
|
||||
192.168.0.110:5000/awoooi/web:3f69e03fcb915514aabf25263b5004b7de5912dc
|
||||
|
||||
awoooi-worker image:
|
||||
192.168.0.110:5000/awoooi/api:3f69e03fcb915514aabf25263b5004b7de5912dc
|
||||
|
||||
K8s rollout:
|
||||
awoooi-api -> successfully rolled out, 2/2 ready
|
||||
awoooi-web -> successfully rolled out, 2/2 ready
|
||||
awoooi-worker -> successfully rolled out, 1/1 ready
|
||||
|
||||
HTTP:
|
||||
/api/v1/health -> 200
|
||||
/zh-TW/awooop/runs -> 200
|
||||
|
||||
New pod log:
|
||||
近 5 分鐘未見 auto_repair_result_push_failed / Traceback /
|
||||
telegram_request_failed / telegram_api_error。
|
||||
```
|
||||
|
||||
**進度校準**:
|
||||
|
||||
- Telegram 噪音與可讀性主線:約 86%。
|
||||
- AwoooP + AI 自動化飛輪整體閉環:約 64%。
|
||||
|
||||
判讀:這輪解掉的是「人看不懂訊息狀態」的高痛點。下一步應補 AwoooP Run detail / Timeline 的狀態對照,讓每則 Telegram reply 都能在 Console 裡找到同一個 run / incident 的完整處置脈絡。
|
||||
|
||||
Reference in New Issue
Block a user