docs(plans): 三方向實作計畫 P0/P1/P2

- P0: DIAGNOSE Privacy-First Routing(local chain 隔離 + REJECT 保護)
- P1: Knowledge Auto-Harvesting(Anti-Pattern 閉環 + Runbook 生成)
- P2: Config Drift Detection(GitOps 守門員 + Nemotron 意圖分析)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-04 12:31:36 +08:00
parent 035cb9cd0d
commit 0b41df45d6
3 changed files with 3227 additions and 0 deletions

View File

@@ -0,0 +1,464 @@
# P0DIAGNOSE Privacy-First Routing 實作計畫
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** 為 AIRouter 新增獨立的 local-only fallback chain確保 FORCE_LOCAL 情境下 DIAGNOSE 絕不觸碰雲端,並將非隱私 DIAGNOSE 路由升級至 Nemotron高能力
**Architecture:** 現行 `_full_fallback_chain` 是全局的,`require_local` 過濾雖已存在但只是跳過個別 provider沒有「chain 已耗盡 → REJECT + 通知」的保護。新增 `_local_fallback_chain = [OLLAMA]`Nemotron privacy_level="cloud" 首席架構師已裁定,不進 local chainroute() 根據 `require_local` 選擇 chainlocal chain 全部失敗時發 Telegram 通知並回傳明確錯誤,絕不 fallback 雲端。同時將 `_intent_provider_overrides[DIAGNOSE]` 從 OLLAMA 升級至 NEMOTRON非 FORCE_LOCAL 情境使用雲端高能力)。
**Tech Stack:** Python 3.11, asyncio, structlog, pytest-asyncio, existing AIRouter / TelegramGateway
---
## ⚠️ 架構注意事項(實作前必讀)
`NemotronProvider.privacy_level = "cloud"`(首席架構師 Q2 已裁定NIM 是雲端 GPU。因此
| 情境 | Chain | 說明 |
|------|-------|------|
| `require_local=False`(一般 DIAGNOSE | `_full_fallback_chain`,但 override 改為 NEMOTRON | 雲端高能力 |
| `require_local=True`FORCE_LOCAL機密資料 | `_local_fallback_chain = [OLLAMA]` | 絕不觸碰雲端,含 Nemotron |
---
## File Map
| 動作 | 檔案 | 變更內容 |
|------|------|---------|
| 修改 | `apps/api/src/services/ai_router.py` | 新增 `_local_fallback_chain``execute()` local chain 耗盡時 REJECT + 通知DIAGNOSE override 改 NEMOTRON |
| 修改 | `apps/api/src/services/ai_providers/nemotron.py` | `analyze()` 支援 per-task timeout`context["task_type"]` |
| 修改 | `apps/api/src/core/config.py` | 新增 `NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS``OLLAMA_DIAGNOSE_TIMEOUT_SECONDS` |
| 新增 | `apps/api/tests/test_p0_diagnose_routing.py` | 3 個測試local chain 隔離、REJECT 通知、DIAGNOSE override |
---
## Task 1新增 Config 環境變數
**Files:**
- Modify: `apps/api/src/core/config.py`
- [ ] **Step 1讀取現有 config找到 NEMOTRON_TIMEOUT_SECONDS 附近**
```bash
grep -n "NEMOTRON_TIMEOUT_SECONDS\|HEALTH_CHECK_TIMEOUT" apps/api/src/core/config.py
```
- [ ] **Step 2在 NEMOTRON_TIMEOUT_SECONDS 下方新增兩個欄位**
`NEMOTRON_TIMEOUT_SECONDS` 那行後面加入:
```python
NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS: int = Field(
default=30,
description="DIAGNOSE 任務專用 Nemotron timeout實測後調整",
)
OLLAMA_DIAGNOSE_TIMEOUT_SECONDS: int = Field(
default=60,
description="DIAGNOSE 任務專用 Ollama timeoutOllama 較慢",
)
```
- [ ] **Step 3確認 config 語法正確**
```bash
cd apps/api && python -c "from src.core.config import get_settings; s = get_settings(); print(s.NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS, s.OLLAMA_DIAGNOSE_TIMEOUT_SECONDS)"
```
預期輸出:`30 60`
- [ ] **Step 4Commit**
```bash
git add apps/api/src/core/config.py
git commit -m "feat(config): 新增 DIAGNOSE 專用 timeout 環境變數 (P0)"
```
---
## Task 2NemotronProvider 支援 per-task timeout
**Files:**
- Modify: `apps/api/src/services/ai_providers/nemotron.py:160-170``analyze()` timeout 讀取處)
- [ ] **Step 1寫失敗測試**
新增 `apps/api/tests/test_p0_diagnose_routing.py`
```python
"""
P0 DIAGNOSE Privacy-First Routing Tests
========================================
測試 AIRouter local chain 隔離 + DIAGNOSE timeout 路由
建立時間: 2026-04-04 (台北時區)
建立者: Claude Code (P0 DIAGNOSE Privacy-First)
"""
import os
os.environ.setdefault("MOCK_MODE", "true")
import pytest
from unittest.mock import AsyncMock, MagicMock, patch
class TestNemotronPerTaskTimeout:
"""Nemotron 支援 per-task timeout"""
@pytest.mark.asyncio
async def test_diagnose_uses_diagnose_timeout(self):
"""DIAGNOSE context 應使用 NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS"""
from src.services.ai_providers.nemotron import NemotronProvider
provider = NemotronProvider()
with patch.object(provider, '_http_client') as mock_client:
mock_resp = MagicMock()
mock_resp.status_code = 200
mock_resp.json.return_value = {
"choices": [{"message": {"content": "診斷結果"}}],
"usage": {"total_tokens": 100},
}
mock_client.post = AsyncMock(return_value=mock_resp)
# 傳入 task_type=diagnose
result = await provider.analyze(
prompt="測試診斷",
context={"task_type": "diagnose"},
)
assert result.success is True
# timeout 的實際驗證透過 mock_client.post 呼叫時的 timeout 參數
call_kwargs = mock_client.post.call_args
assert call_kwargs is not None
```
- [ ] **Step 2執行確認失敗NemotronProvider 尚未讀 task_type**
```bash
cd apps/api && python -m pytest tests/test_p0_diagnose_routing.py::TestNemotronPerTaskTimeout -v
```
預期PASS 或 ERROR因為 mock 結構問題),繼續下一步實際改動。
- [ ] **Step 3修改 `nemotron.py` 的 `analyze()` timeout 讀取邏輯**
找到 `analyze()` 中讀取 timeout 的行(約 L163
```python
timeout = getattr(settings, "NEMOTRON_TIMEOUT_SECONDS", 30)
```
改為:
```python
# P0 2026-04-04 Claude Code: per-task timeoutDIAGNOSE 使用獨立設定
task_type = (context or {}).get("task_type", "default")
if task_type == "diagnose":
timeout = getattr(settings, "NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS", 30)
else:
timeout = getattr(settings, "NEMOTRON_TIMEOUT_SECONDS", 30)
```
- [ ] **Step 4執行測試**
```bash
cd apps/api && python -m pytest tests/test_p0_diagnose_routing.py::TestNemotronPerTaskTimeout -v
```
預期PASS
- [ ] **Step 5Commit**
```bash
git add apps/api/src/services/ai_providers/nemotron.py apps/api/tests/test_p0_diagnose_routing.py
git commit -m "feat(nemotron): per-task timeoutDIAGNOSE 使用獨立 timeout 設定 (P0)"
```
---
## Task 3AIRouter 新增 `_local_fallback_chain` + REJECT 保護
**Files:**
- Modify: `apps/api/src/services/ai_router.py`
- [ ] **Step 1在測試檔案新增 local chain 測試**
`tests/test_p0_diagnose_routing.py` 新增:
```python
class TestLocalFallbackChain:
"""require_local=True 時只走 local chain全部失敗 → REJECT不觸碰雲端"""
@pytest.mark.asyncio
async def test_require_local_skips_cloud_providers(self):
"""require_local=True 時cloud provider 不被呼叫"""
from src.services.ai_router import AIRouter
from src.services.ai_providers.interfaces import AIResult
router = AIRouter()
# Mock: Ollama 成功
mock_ollama = AsyncMock()
mock_ollama.name = "ollama"
mock_ollama.privacy_level = "local"
mock_ollama.is_enabled = True
mock_ollama.capabilities = {"rca", "chat"}
mock_ollama.analyze = AsyncMock(return_value=AIResult(
raw_response="本地診斷結果",
success=True,
provider="ollama",
))
mock_ollama.health_check = AsyncMock(return_value=True)
# Mock: Gemini不應該被呼叫
mock_gemini = AsyncMock()
mock_gemini.name = "gemini"
mock_gemini.privacy_level = "cloud"
mock_gemini.is_enabled = True
mock_gemini.analyze = AsyncMock(return_value=AIResult(
raw_response="雲端結果",
success=True,
provider="gemini",
))
from src.services.ai_providers.interfaces import AIProviderEnum
router._registry._providers = {
AIProviderEnum.OLLAMA: mock_ollama,
AIProviderEnum.GEMINI: mock_gemini,
}
result = await router.execute(
prompt="診斷這個問題",
provider_order=["ollama", "gemini"],
require_local=True,
)
assert result.success is True
assert result.provider == "ollama"
mock_gemini.analyze.assert_not_called()
@pytest.mark.asyncio
async def test_require_local_all_fail_returns_reject(self):
"""require_local=True 且所有 local provider 失敗 → 回傳明確錯誤,不 fallback 雲端"""
from src.services.ai_router import AIRouter
from src.services.ai_providers.interfaces import AIResult, AIProviderEnum
router = AIRouter()
# Mock: Ollama 失敗
mock_ollama = AsyncMock()
mock_ollama.name = "ollama"
mock_ollama.privacy_level = "local"
mock_ollama.is_enabled = True
mock_ollama.capabilities = {"rca", "chat"}
mock_ollama.analyze = AsyncMock(return_value=AIResult(
raw_response="",
success=False,
provider="ollama",
error="timeout",
))
mock_ollama.health_check = AsyncMock(return_value=False)
router._registry._providers = {
AIProviderEnum.OLLAMA: mock_ollama,
}
result = await router.execute(
prompt="診斷這個問題",
provider_order=["ollama"],
require_local=True,
)
assert result.success is False
assert result.error == "local_providers_unavailable"
```
- [ ] **Step 2執行確認失敗**
```bash
cd apps/api && python -m pytest tests/test_p0_diagnose_routing.py::TestLocalFallbackChain -v
```
預期FAIL`execute()` 目前沒有 `local_providers_unavailable` 邏輯)
- [ ] **Step 3修改 `ai_router.py` 的 `execute()` 方法**
找到 `execute()` 方法中 for loop 結束後的錯誤處理部分(約 L920-940
```python
# 現有for loop 結束後)
logger.error("ai_router_execute_all_failed", ...)
return AIResult(raw_response="", success=False, provider="none", error=str(errors))
```
改為:
```python
# P0 2026-04-04 Claude Code: local chain 耗盡保護
if require_local:
logger.error(
"ai_router_local_chain_exhausted",
require_local=True,
errors=errors,
)
# 非同步推送 Telegram 通知(不阻塞,忽略失敗)
try:
from src.services.telegram_gateway import get_telegram_gateway
gw = get_telegram_gateway()
await gw.push_system_alert(
"⚠️ DIAGNOSE 本地 Provider 不可用\n所有本地 AI Provider 已失敗,需人工介入"
)
except Exception:
pass
return AIResult(
raw_response="",
success=False,
provider="none",
error="local_providers_unavailable",
)
logger.error("ai_router_execute_all_failed", errors=errors)
return AIResult(raw_response="", success=False, provider="none", error=str(errors))
```
- [ ] **Step 4執行測試**
```bash
cd apps/api && python -m pytest tests/test_p0_diagnose_routing.py::TestLocalFallbackChain -v
```
預期PASS
- [ ] **Step 5Commit**
```bash
git add apps/api/src/services/ai_router.py apps/api/tests/test_p0_diagnose_routing.py
git commit -m "feat(ai-router): local chain 耗盡保護 — REJECT + Telegram 通知,不 fallback 雲端 (P0)"
```
---
## Task 4DIAGNOSE intent override 升級至 Nemotron
**Files:**
- Modify: `apps/api/src/services/ai_router.py:255`
- [ ] **Step 1新增 DIAGNOSE override 測試**
`tests/test_p0_diagnose_routing.py` 新增:
```python
class TestDiagnoseIntentOverride:
"""DIAGNOSE intent 應優先路由至 Nemotron非 FORCE_LOCAL 情境)"""
def test_diagnose_override_is_nemotron(self):
"""_intent_provider_overrides[DIAGNOSE] 應為 NEMOTRON"""
from src.services.ai_router import AIRouter
from src.services.intent_classifier import IntentType
from src.services.ai_router import AIProviderEnum
router = AIRouter()
override = router._intent_provider_overrides.get(IntentType.DIAGNOSE)
assert override == AIProviderEnum.NEMOTRON, (
f"DIAGNOSE 應路由至 NEMOTRON實際為 {override}"
)
```
- [ ] **Step 2執行確認失敗**
```bash
cd apps/api && python -m pytest tests/test_p0_diagnose_routing.py::TestDiagnoseIntentOverride -v
```
預期FAIL目前 override 是 OLLAMA
- [ ] **Step 3修改 `ai_router.py` 的 `_intent_provider_overrides`**
找到(約 L255
```python
IntentType.DIAGNOSE: AIProviderEnum.OLLAMA, # 診斷優先本地 (隱私)
```
改為:
```python
# P0 2026-04-04 Claude Code: DIAGNOSE 升級至 Nemotron高能力雲端
# 注意: FORCE_LOCAL 情境由 require_local=True + local chain 保護Nemotron 會被 privacy 過濾跳過
IntentType.DIAGNOSE: AIProviderEnum.NEMOTRON,
```
- [ ] **Step 4執行測試**
```bash
cd apps/api && python -m pytest tests/test_p0_diagnose_routing.py -v
```
預期:全部 PASS
- [ ] **Step 5執行既有相關測試確保沒有破壞**
```bash
cd apps/api && python -m pytest tests/test_smart_router.py tests/test_intent_classifier.py -v
```
預期:全部 PASS
- [ ] **Step 6Commit**
```bash
git add apps/api/src/services/ai_router.py apps/api/tests/test_p0_diagnose_routing.py
git commit -m "feat(ai-router): DIAGNOSE intent override 升級至 Nemotron (P0)"
```
---
## Task 5更新 Design Doc 記錄架構修正
**Files:**
- Modify: `docs/superpowers/specs/2026-04-04-nemotron-active-defense-design.md`
- [ ] **Step 1在方向二的「架構注意事項」段落前加入修正說明**
在 Design Doc 方向二最前面加入:
```markdown
### ⚠️ 實作修正記錄2026-04-04
設計討論時假設 Nemotron 為 local provider但首席架構師 Q2 已裁定 NIM = 雲端 GPU
`NemotronProvider.privacy_level = "cloud"`
實際實作調整為:
- FORCE_LOCAL 情境:`_local_fallback_chain = [OLLAMA]`Nemotron 被 privacy 過濾正確排除)
- 非 FORCE_LOCAL 情境DIAGNOSE override 改為 NEMOTRON雲端高能力診斷
- 兩種情境的隱私邊界均正確,設計意圖不變
```
- [ ] **Step 2Commit**
```bash
git add docs/superpowers/specs/2026-04-04-nemotron-active-defense-design.md
git commit -m "docs(spec): 方向二實作修正記錄 — Nemotron privacy_level=cloud (P0)"
```
---
## 驗收標準
```bash
# 全部測試通過
cd apps/api && python -m pytest tests/test_p0_diagnose_routing.py -v
# 既有測試未破壞
cd apps/api && python -m pytest tests/test_smart_router.py tests/test_intent_classifier.py tests/test_auto_repair_service.py -v
# Config 環境變數可讀
cd apps/api && python -c "
from src.core.config import get_settings
s = get_settings()
print('NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS:', s.NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS)
print('OLLAMA_DIAGNOSE_TIMEOUT_SECONDS:', s.OLLAMA_DIAGNOSE_TIMEOUT_SECONDS)
"
```
**Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>**

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff