docs(adr): ADR-017 LLMOps Observability 三層觀測架構
建立 Phase 15 LLMOps 觀測架構決策文件,記錄: - 三層觀測架構 (Langfuse + SignOz + Sentry) - Langfuse 整合與 Deep Linking 實作 - Redis Streams Trace Context 傳遞機制 - 取樣率策略與成本估算 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
284
docs/adr/ADR-017-llmops-observability.md
Normal file
284
docs/adr/ADR-017-llmops-observability.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# ADR-017: LLMOps Observability
|
||||
|
||||
> **狀態**: ✅ 已批准
|
||||
> **日期**: 2026-03-26
|
||||
> **決策者**: 統帥
|
||||
> **Phase**: 15
|
||||
|
||||
## 背景
|
||||
|
||||
AWOOOI 作為 AI-First 產品,LLM 呼叫是核心業務流程。然而,在 Phase 15 之前存在以下觀測盲點:
|
||||
|
||||
### 問題一:LLM 呼叫無追蹤
|
||||
|
||||
- Prompt 內容無紀錄,無法重現問題
|
||||
- Token 消耗無統計,成本失控風險
|
||||
- 模型 Fallback 鏈路不可見
|
||||
- AI 幻覺無法追溯來源
|
||||
|
||||
### 問題二:Trace Context 斷鏈
|
||||
|
||||
- Redis Streams 不會自動傳遞 OTEL Trace ID
|
||||
- Worker 處理的 LLM 呼叫與原始請求脫節
|
||||
- 無法從前端錯誤追溯到 AI 決策過程
|
||||
|
||||
### 問題三:觀測系統孤島
|
||||
|
||||
- Sentry (前端錯誤) 與 SignOz (後端 Traces) 無法互連
|
||||
- 需要在三個系統間手動搜尋同一個 Trace
|
||||
- 故障排查效率低下
|
||||
|
||||
---
|
||||
|
||||
## 決策
|
||||
|
||||
採用**三層觀測架構 (The Holy Trinity)**,實現零斷鏈觀測:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ AWOOOI 觀測分層架構 │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Layer 3: AI 決策層 ← Langfuse (100% 取樣) │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ Prompt 追蹤 │ Token 成本 │ Agent 軌跡 │ 幻覺偵測 │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
│ ▲ │
|
||||
│ ⚠️ Redis Streams │
|
||||
│ (手動 Context Injection) │
|
||||
│ │ │
|
||||
│ Layer 2: 基礎設施層 ← SignOz (10% 取樣, Error 100%) │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ API 延遲 │ K8s 資源 │ DB 查詢 │ 後端錯誤 │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
│ ▲ │
|
||||
│ (HTTP Header 自動傳遞) │
|
||||
│ │ │
|
||||
│ Layer 1: 前端用戶層 ← Sentry (10% 取樣, Error 100%) │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ JS 錯誤 │ 用戶操作錄影 │ 效能指標 │ 崩潰堆疊 │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 核心元件
|
||||
|
||||
| 元件 | 部署位置 | 用途 |
|
||||
|------|---------|------|
|
||||
| Langfuse | 192.168.0.110:3100 | LLM 呼叫追蹤、Token 成本 |
|
||||
| SignOz | 192.168.0.188:3301 | API Traces、Metrics、Logs |
|
||||
| Sentry | 192.168.0.110:9000 | 前端錯誤、Session Replay |
|
||||
|
||||
---
|
||||
|
||||
## 技術實作
|
||||
|
||||
### 1. Langfuse Client (`apps/api/src/services/langfuse_client.py`)
|
||||
|
||||
專用 LLM 追蹤 Client,提供:
|
||||
|
||||
```python
|
||||
# Context Manager 追蹤 LLM 呼叫鏈
|
||||
from src.services.langfuse_client import langfuse_trace
|
||||
|
||||
async with langfuse_trace("openclaw_decision") as trace:
|
||||
# 記錄 LLM generation
|
||||
trace.generation(
|
||||
name="ollama_call",
|
||||
model="qwen2.5:7b-instruct",
|
||||
input=prompt,
|
||||
output=result,
|
||||
usage={"input": 500, "output": 200},
|
||||
)
|
||||
|
||||
# 記錄評分 (用於 Prompt 品質追蹤)
|
||||
trace.score("response_quality", value=0.95)
|
||||
```
|
||||
|
||||
**關鍵功能**:
|
||||
- `LangfuseTraceContext`: 自動整合 OTEL trace_id
|
||||
- `langfuse_trace()`: Context Manager 包裝
|
||||
- `langfuse_observe()`: 裝飾器自動追蹤函數
|
||||
- 自動注入 SignOz Deep Link 到 metadata
|
||||
|
||||
### 2. Deep Linking (`apps/api/src/core/deep_linking.py`)
|
||||
|
||||
三系統互連 URL 生成器:
|
||||
|
||||
```python
|
||||
from src.core.deep_linking import DeepLinking
|
||||
|
||||
# 取得所有可用 Deep Links
|
||||
links = DeepLinking.get_all_links(
|
||||
otel_trace_id="abc123...",
|
||||
langfuse_trace_id="lf-xyz...",
|
||||
sentry_issue_id="12345",
|
||||
)
|
||||
# {
|
||||
# "signoz_trace": "http://192.168.0.188:3301/trace/abc123...",
|
||||
# "langfuse_trace": "http://192.168.0.110:3100/project/awoooi-openclaw/traces/lf-xyz...",
|
||||
# "sentry_issue": "http://192.168.0.110:9000/organizations/sentry/issues/12345/",
|
||||
# }
|
||||
```
|
||||
|
||||
**URL 格式**:
|
||||
|
||||
| 系統 | URL 格式 |
|
||||
|------|---------|
|
||||
| SignOz Trace | `http://192.168.0.188:3301/trace/{trace_id}` |
|
||||
| Langfuse Trace | `http://192.168.0.110:3100/project/awoooi-openclaw/traces/{id}` |
|
||||
| Sentry Issue | `http://192.168.0.110:9000/organizations/sentry/issues/{id}/` |
|
||||
|
||||
### 3. Redis Trace Context 傳遞
|
||||
|
||||
解決 Redis Streams 斷鏈問題:
|
||||
|
||||
```python
|
||||
# Producer (webhooks.py) - 注入 Trace Context
|
||||
async def publish_to_redis(signal: Signal):
|
||||
from src.core.telemetry import get_trace_context
|
||||
|
||||
trace_ctx = get_trace_context() # {"trace_id": "xxx", "span_id": "yyy"}
|
||||
payload = {
|
||||
**signal.dict(),
|
||||
"_trace_id": trace_ctx["trace_id"],
|
||||
"_span_id": trace_ctx["span_id"],
|
||||
}
|
||||
await redis.xadd("stream:signals", payload)
|
||||
|
||||
# Consumer (signal_worker.py) - 還原 Trace Context
|
||||
async def consume_from_redis():
|
||||
from src.core.telemetry import restore_trace_context
|
||||
|
||||
message = await redis.xreadgroup(...)
|
||||
trace_id = message.pop("_trace_id", None)
|
||||
span_id = message.pop("_span_id", None)
|
||||
|
||||
# 重建 OTEL Context (W3C traceparent 格式)
|
||||
with restore_trace_context(trace_id, span_id):
|
||||
# 處理邏輯,Langfuse 會繼承此 trace_id
|
||||
await process_signal(message)
|
||||
```
|
||||
|
||||
### 4. 連結方向圖
|
||||
|
||||
```
|
||||
Sentry Error
|
||||
│ tags.otel_trace_id
|
||||
│ contexts.signoz.trace_url
|
||||
▼
|
||||
SignOz Trace
|
||||
│ span.langfuse.trace_id
|
||||
│ span.langfuse.trace_url
|
||||
▼
|
||||
Langfuse Generation
|
||||
│ metadata.otel_trace_id
|
||||
│ metadata.signoz_trace_url
|
||||
▲
|
||||
└── 雙向連結完成
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 部署配置
|
||||
|
||||
### Langfuse Self-Hosted (192.168.0.110:3100)
|
||||
|
||||
| 項目 | 值 |
|
||||
|------|-----|
|
||||
| 主機 | 192.168.0.110 (DevOps 金庫) |
|
||||
| Port | 3100 |
|
||||
| 映像 | langfuse/langfuse:2 |
|
||||
| 資料庫 | PostgreSQL 15 (內建容器) |
|
||||
| 部署層級 | 容器層 (Docker Compose) |
|
||||
|
||||
### 環境變數
|
||||
|
||||
```yaml
|
||||
# K8s awoooi-secrets
|
||||
LANGFUSE_ENABLED: "true"
|
||||
LANGFUSE_URL: "http://192.168.0.110:3100"
|
||||
LANGFUSE_PUBLIC_KEY: "pk-lf-..."
|
||||
LANGFUSE_SECRET_KEY: "sk-lf-..."
|
||||
```
|
||||
|
||||
### 專案配置
|
||||
|
||||
| 項目 | 值 |
|
||||
|------|-----|
|
||||
| 專案名稱 | awoooi-openclaw |
|
||||
| 管理帳號 | admin@awoooi.local |
|
||||
|
||||
---
|
||||
|
||||
## 取樣率策略
|
||||
|
||||
### 成本控制原則
|
||||
|
||||
| 工具 | 正常流量 | Error 發生時 | 原因 |
|
||||
|------|---------|-------------|------|
|
||||
| Sentry | 10% | 100% | 前端流量大 |
|
||||
| SignOz | 10% | 100% | API 流量大 |
|
||||
| Langfuse | **100%** | 100% | AI 決策必須完整記錄 |
|
||||
|
||||
### 動態取樣配置
|
||||
|
||||
```python
|
||||
# Sentry traces_sampler
|
||||
def traces_sampler(sampling_context):
|
||||
if sampling_context.get("error"):
|
||||
return 1.0 # Error 必定記錄
|
||||
return 0.1 # 正常流量 10%
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 成本估算
|
||||
|
||||
| 項目 | 月費 | 說明 |
|
||||
|------|------|------|
|
||||
| Langfuse | $0 | Self-hosted |
|
||||
| 額外儲存 | ~$5 | PostgreSQL Volume |
|
||||
| **總計** | **~$5/月** | 幾乎免費 |
|
||||
|
||||
---
|
||||
|
||||
## 後果
|
||||
|
||||
### 優點
|
||||
|
||||
- **完整追溯**: 從前端錯誤追溯到 AI 決策
|
||||
- **成本可見**: Token 消耗即時監控
|
||||
- **一鍵跳轉**: 三系統 Deep Linking
|
||||
- **Prompt 版本化**: 支援 A/B 測試
|
||||
|
||||
### 缺點
|
||||
|
||||
- **額外延遲**: Langfuse 寫入 ~5ms
|
||||
- **儲存成本**: 100% 取樣累積資料快
|
||||
|
||||
### 風險
|
||||
|
||||
- **Redis Streams 斷鏈**: 需手動維護 Context Injection
|
||||
- **Langfuse 故障**: LLM 呼叫仍正常,僅失去追蹤
|
||||
|
||||
---
|
||||
|
||||
## 相關檔案
|
||||
|
||||
| 檔案 | 說明 |
|
||||
|------|------|
|
||||
| `apps/api/src/services/langfuse_client.py` | Langfuse Client 包裝 |
|
||||
| `apps/api/src/core/deep_linking.py` | Deep Linking URL 生成器 |
|
||||
| `apps/api/src/core/telemetry.py` | Trace Context 工具函數 |
|
||||
| `apps/api/src/core/config.py` | LANGFUSE_* 設定 |
|
||||
|
||||
---
|
||||
|
||||
## 參考
|
||||
|
||||
- [Langfuse Documentation](https://langfuse.com/docs)
|
||||
- [OpenTelemetry Trace Context](https://www.w3.org/TR/trace-context/)
|
||||
- Memory: `project_phase15_llmops_observability.md`
|
||||
- Memory: `project_phase15_langfuse.md`
|
||||
Reference in New Issue
Block a user