docs(awooop): define private Ollama mesh gateway

2026-05-05 22:56:22 +08:00
parent 7baa316224
commit ed7c6946cb
9 changed files with 786 additions and 9 deletions
--- a/docs/adr/ADR-110-gcp-ollama-topology.md
+++ b/docs/adr/ADR-110-gcp-ollama-topology.md
@@ -5,6 +5,10 @@
 **決策者**: 統帥
 **關聯**: 取代 ADR-105（Revert A2 Ollama Primary）

+> 2026-05-05 修正：本 ADR 的「GCP-A → GCP-B → 111 → paid provider」邏輯仍有效，
+> 但公網 GCP IP / 110 nginx proxy 僅為過渡傳輸。正式傳輸與 runtime
+> 管理由 ADR-125（GCP Ollama Private Mesh and AwoooP Inference Gateway）取代。
+
 ---

 ## 背景
@@ -62,3 +66,15 @@ K8s NetworkPolicy egress 已新增 GCP-A/GCP-B 的 /32 出口規則（port 11434
 - Ollama 主要流量走 GCP SSD，效能提升
 - Local 111 保留為最後防線，不棄用
 - Gemini/Nemotron/Claude fallback 鏈不變
+
+## 2026-05-05 現場校正
+
+冷啟動救援期間的實測顯示：
+
+- GCP-A / GCP-B 透過 110 nginx proxy 可連線，但長 prompt 曾出現 504。
+- `/api/ps` 顯示 GCP-A / GCP-B `size_vram: 0`，因此不可假設它們等同 111 GPU/VRAM 推理能力。
+- 告警同步路徑必須使用 `gemma3:4b` 這類 fast lane 模型；14B/32B 模型需移到 async 或 111/GPU 節點。
+- 公網 `34.143.170.20:11434` / `34.21.145.224:11434` 不再視為最終安全架構。
+
+後續以 ADR-125 為準：WireGuard private mesh 是正式網路層，AwoooP
+Inference Gateway 是正式 runtime 層。
--- a/docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md
+++ b/docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md
@@ -0,0 +1,187 @@
+# ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway
+
+**Status**: Accepted  
+**Date**: 2026-05-05 (Asia/Taipei)  
+**Decision Maker**: ogt / Codex  
+**Related**: ADR-110, ADR-111, ADR-113, ADR-121, ADR-124
+
+---
+
+## Context
+
+ADR-110 moved Ollama priority from local-only 111 to a three-layer topology:
+
+1. GCP-A
+2. GCP-B
+3. Local 111
+4. Paid cloud fallback only after all Ollama lanes fail
+
+The 2026-05-05 dirty-reboot recovery and alert incident exposed two gaps in the
+initial ADR-110 implementation:
+
+- The live transport is `K8s Pod -> 192.168.0.110 nginx -> GCP public IP`, not a
+  true private network path.
+- GCP-A and GCP-B reported `size_vram: 0` in `/api/ps`, so they are CPU-only from
+  Ollama's perspective. Private networking improves reachability and security,
+  but does not make these nodes equivalent to local 111 GPU/VRAM behavior.
+
+The public nginx proxy is useful as a bootstrap bridge, but it must not become
+the long-term primary transport for platform inference.
+
+## Decision
+
+Adopt a two-layer target architecture:
+
+### D1 - WireGuard private mesh is the target transport
+
+AwoooP uses a WireGuard site-to-site mesh for GCP Ollama access.
+
+Planned mesh CIDR:
+
+| Host | Role | WireGuard IP |
+|------|------|--------------|
+| 110 | DevOps / transitional proxy / optional mesh router | `10.77.114.10` |
+| 120 | K3s control-plane node | `10.77.114.120` |
+| 121 | K3s control-plane node | `10.77.114.121` |
+| 111 | Local Ollama fallback | `10.77.114.111` |
+| GCP-A | Ollama primary | `10.77.114.21` |
+| GCP-B | Ollama secondary | `10.77.114.22` |
+
+Ollama endpoints after cutover:
+
+| Tier | Endpoint |
+|------|----------|
+| Primary | `http://10.77.114.21:11434` |
+| Secondary | `http://10.77.114.22:11434` |
+| Fallback | `http://10.77.114.111:11434` |
+
+The current `192.168.0.110:11435/11436` nginx proxy remains an emergency bridge
+only until the mesh cutover passes shadow and canary gates.
+
+### D2 - Public Ollama exposure is forbidden after cutover
+
+After mesh cutover:
+
+- GCP firewall must deny public `0.0.0.0/0 -> 11434`.
+- Ollama should bind to the mesh interface or host firewall should allow
+  `11434/tcp` only from `10.77.114.0/24`.
+- K8s NetworkPolicy should allow egress only to the mesh IPs for Ollama.
+
+### D3 - AwoooP Inference Gateway owns runtime routing
+
+Provider clients should stop selecting raw Ollama hosts directly. They should
+call an AwoooP Inference Gateway that owns:
+
+- endpoint health and circuit breakers
+- per-lane concurrency limits
+- model residency and keep-alive policy
+- request timeouts by intent
+- token/cost audit spans
+- fallback order: GCP-A -> GCP-B -> 111 -> paid provider
+
+The gateway may initially expose an Ollama-compatible surface:
+
+| Endpoint | Purpose |
+|----------|---------|
+| `/api/tags` | health/model inventory |
+| `/api/ps` | residency inventory |
+| `/api/generate` | Ollama-compatible generation |
+| `/v1/awooop/inference/runs` | future async AwoooP run API |
+
+Gateway requests must carry `project_id`, `trace_id`, and an intent/lane label
+when called from AwoooP-aware code.
+
+### D4 - Alert lane is protected
+
+Alert diagnosis must not share an unconstrained queue with heavy code-review or
+deep-RCA jobs.
+
+Initial lanes:
+
+| Lane | Model | Primary use | Default timeout |
+|------|-------|-------------|-----------------|
+| `alert-fast` | `gemma3:4b` | Telegram incident cards and low-risk RCA | 45s |
+| `code-review` | `qwen2.5-coder:7b` | Gitea review | 90s |
+| `embedding` | `bge-m3` | RAG embeddings | 30s |
+| `deep-rca` | 111-hosted 14B-class model | slow human-reviewed diagnosis | async only |
+
+No 14B/32B model may evict `alert-fast` residency on GCP-A/GCP-B unless the
+gateway explicitly opens a maintenance window.
+
+## Migration Plan
+
+### Phase 0 - Current bridge
+
+- Keep `192.168.0.110:11435` and `192.168.0.110:11436` active.
+- Alert path uses `ALERT_OLLAMA_MODEL=gemma3:4b`.
+- Gemini remains paid emergency fallback only.
+
+### Phase 1 - Mesh build in parallel
+
+- Install WireGuard on 110, 120, 121, 111, GCP-A, and GCP-B.
+- Assign mesh IPs from `10.77.114.0/24`.
+- Keep public proxy and old env values unchanged.
+- Verify `/api/tags`, `/api/ps`, and `gemma3:4b` generation over mesh.
+
+### Phase 2 - Shadow mesh
+
+- Add shadow health checks from the API pod to mesh endpoints.
+- Emit OTel spans with both `active_endpoint` and `shadow_endpoint`.
+- Do not send production inference traffic to mesh yet.
+
+Promotion gate:
+
+- 24h continuous mesh health
+- p95 `alert-fast` latency <= current proxy p95 + 10%
+- zero public-path-only success events
+
+### Phase 3 - Switch active endpoints
+
+Set production env:
+
+```yaml
+OLLAMA_URL: "http://10.77.114.21:11434"
+OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434"
+OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434"
+```
+
+Promotion gate:
+
+- 7 days canary
+- Gemini usage for alert lane is zero except documented all-Ollama outage
+- no alert-card timeout regression
+
+### Phase 4 - Close public exposure
+
+- Remove or firewall public GCP `11434/tcp`.
+- Keep nginx bridge config but disable listener or restrict to operator-only
+  rollback.
+
+## Rollback
+
+Rollback is env-only while the bridge remains available:
+
+```yaml
+OLLAMA_URL: "http://192.168.0.110:11435"
+OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
+OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
+```
+
+If GCP-A/B are unstable, force 111-first temporarily:
+
+```yaml
+OLLAMA_URL: "http://192.168.0.111:11434"
+OLLAMA_SECONDARY_URL: "http://192.168.0.110:11435"
+OLLAMA_FALLBACK_URL: "http://192.168.0.110:11436"
+```
+
+Paid provider fallback must remain budget-gated.
+
+## Consequences
+
+- GCP Ollama becomes private-by-default instead of public-IP dependent.
+- K8s NetworkPolicy can move from public `/32` rules to stable mesh `/32` rules.
+- AwoooP can manage Ollama as a platform resource shared by all tenants.
+- CPU-only GCP performance remains a capacity constraint; routing must keep
+  heavy jobs off the alert lane or use GPU-capable GCP nodes.
+