docs(awooop): define private Ollama mesh gateway
All checks were successful
Code Review / ai-code-review (push) Successful in 10s

This commit is contained in:
Your Name
2026-05-05 22:56:22 +08:00
parent 7baa316224
commit ed7c6946cb
9 changed files with 786 additions and 9 deletions

View File

@@ -5,6 +5,10 @@
**決策者**: 統帥
**關聯**: 取代 ADR-105Revert A2 Ollama Primary
> 2026-05-05 修正:本 ADR 的「GCP-A → GCP-B → 111 → paid provider」邏輯仍有效
> 但公網 GCP IP / 110 nginx proxy 僅為過渡傳輸。正式傳輸與 runtime
> 管理由 ADR-125GCP Ollama Private Mesh and AwoooP Inference Gateway取代。
---
## 背景
@@ -62,3 +66,15 @@ K8s NetworkPolicy egress 已新增 GCP-A/GCP-B 的 /32 出口規則port 11434
- Ollama 主要流量走 GCP SSD效能提升
- Local 111 保留為最後防線,不棄用
- Gemini/Nemotron/Claude fallback 鏈不變
## 2026-05-05 現場校正
冷啟動救援期間的實測顯示:
- GCP-A / GCP-B 透過 110 nginx proxy 可連線,但長 prompt 曾出現 504。
- `/api/ps` 顯示 GCP-A / GCP-B `size_vram: 0`,因此不可假設它們等同 111 GPU/VRAM 推理能力。
- 告警同步路徑必須使用 `gemma3:4b` 這類 fast lane 模型14B/32B 模型需移到 async 或 111/GPU 節點。
- 公網 `34.143.170.20:11434` / `34.21.145.224:11434` 不再視為最終安全架構。
後續以 ADR-125 為準WireGuard private mesh 是正式網路層AwoooP
Inference Gateway 是正式 runtime 層。

View File

@@ -0,0 +1,187 @@
# ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway
**Status**: Accepted
**Date**: 2026-05-05 (Asia/Taipei)
**Decision Maker**: ogt / Codex
**Related**: ADR-110, ADR-111, ADR-113, ADR-121, ADR-124
---
## Context
ADR-110 moved Ollama priority from local-only 111 to a three-layer topology:
1. GCP-A
2. GCP-B
3. Local 111
4. Paid cloud fallback only after all Ollama lanes fail
The 2026-05-05 dirty-reboot recovery and alert incident exposed two gaps in the
initial ADR-110 implementation:
- The live transport is `K8s Pod -> 192.168.0.110 nginx -> GCP public IP`, not a
true private network path.
- GCP-A and GCP-B reported `size_vram: 0` in `/api/ps`, so they are CPU-only from
Ollama's perspective. Private networking improves reachability and security,
but does not make these nodes equivalent to local 111 GPU/VRAM behavior.
The public nginx proxy is useful as a bootstrap bridge, but it must not become
the long-term primary transport for platform inference.
## Decision
Adopt a two-layer target architecture:
### D1 - WireGuard private mesh is the target transport
AwoooP uses a WireGuard site-to-site mesh for GCP Ollama access.
Planned mesh CIDR:
| Host | Role | WireGuard IP |
|------|------|--------------|
| 110 | DevOps / transitional proxy / optional mesh router | `10.77.114.10` |
| 120 | K3s control-plane node | `10.77.114.120` |
| 121 | K3s control-plane node | `10.77.114.121` |
| 111 | Local Ollama fallback | `10.77.114.111` |
| GCP-A | Ollama primary | `10.77.114.21` |
| GCP-B | Ollama secondary | `10.77.114.22` |
Ollama endpoints after cutover:
| Tier | Endpoint |
|------|----------|
| Primary | `http://10.77.114.21:11434` |
| Secondary | `http://10.77.114.22:11434` |
| Fallback | `http://10.77.114.111:11434` |
The current `192.168.0.110:11435/11436` nginx proxy remains an emergency bridge
only until the mesh cutover passes shadow and canary gates.
### D2 - Public Ollama exposure is forbidden after cutover
After mesh cutover:
- GCP firewall must deny public `0.0.0.0/0 -> 11434`.
- Ollama should bind to the mesh interface or host firewall should allow
`11434/tcp` only from `10.77.114.0/24`.
- K8s NetworkPolicy should allow egress only to the mesh IPs for Ollama.
### D3 - AwoooP Inference Gateway owns runtime routing
Provider clients should stop selecting raw Ollama hosts directly. They should
call an AwoooP Inference Gateway that owns:
- endpoint health and circuit breakers
- per-lane concurrency limits
- model residency and keep-alive policy
- request timeouts by intent
- token/cost audit spans
- fallback order: GCP-A -> GCP-B -> 111 -> paid provider
The gateway may initially expose an Ollama-compatible surface:
| Endpoint | Purpose |
|----------|---------|
| `/api/tags` | health/model inventory |
| `/api/ps` | residency inventory |
| `/api/generate` | Ollama-compatible generation |
| `/v1/awooop/inference/runs` | future async AwoooP run API |
Gateway requests must carry `project_id`, `trace_id`, and an intent/lane label
when called from AwoooP-aware code.
### D4 - Alert lane is protected
Alert diagnosis must not share an unconstrained queue with heavy code-review or
deep-RCA jobs.
Initial lanes:
| Lane | Model | Primary use | Default timeout |
|------|-------|-------------|-----------------|
| `alert-fast` | `gemma3:4b` | Telegram incident cards and low-risk RCA | 45s |
| `code-review` | `qwen2.5-coder:7b` | Gitea review | 90s |
| `embedding` | `bge-m3` | RAG embeddings | 30s |
| `deep-rca` | 111-hosted 14B-class model | slow human-reviewed diagnosis | async only |
No 14B/32B model may evict `alert-fast` residency on GCP-A/GCP-B unless the
gateway explicitly opens a maintenance window.
## Migration Plan
### Phase 0 - Current bridge
- Keep `192.168.0.110:11435` and `192.168.0.110:11436` active.
- Alert path uses `ALERT_OLLAMA_MODEL=gemma3:4b`.
- Gemini remains paid emergency fallback only.
### Phase 1 - Mesh build in parallel
- Install WireGuard on 110, 120, 121, 111, GCP-A, and GCP-B.
- Assign mesh IPs from `10.77.114.0/24`.
- Keep public proxy and old env values unchanged.
- Verify `/api/tags`, `/api/ps`, and `gemma3:4b` generation over mesh.
### Phase 2 - Shadow mesh
- Add shadow health checks from the API pod to mesh endpoints.
- Emit OTel spans with both `active_endpoint` and `shadow_endpoint`.
- Do not send production inference traffic to mesh yet.
Promotion gate:
- 24h continuous mesh health
- p95 `alert-fast` latency <= current proxy p95 + 10%
- zero public-path-only success events
### Phase 3 - Switch active endpoints
Set production env:
```yaml
OLLAMA_URL: "http://10.77.114.21:11434"
OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434"
OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434"
```
Promotion gate:
- 7 days canary
- Gemini usage for alert lane is zero except documented all-Ollama outage
- no alert-card timeout regression
### Phase 4 - Close public exposure
- Remove or firewall public GCP `11434/tcp`.
- Keep nginx bridge config but disable listener or restrict to operator-only
rollback.
## Rollback
Rollback is env-only while the bridge remains available:
```yaml
OLLAMA_URL: "http://192.168.0.110:11435"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
```
If GCP-A/B are unstable, force 111-first temporarily:
```yaml
OLLAMA_URL: "http://192.168.0.111:11434"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11435"
OLLAMA_FALLBACK_URL: "http://192.168.0.110:11436"
```
Paid provider fallback must remain budget-gated.
## Consequences
- GCP Ollama becomes private-by-default instead of public-IP dependent.
- K8s NetworkPolicy can move from public `/32` rules to stable mesh `/32` rules.
- AwoooP can manage Ollama as a platform resource shared by all tenants.
- CPU-only GCP performance remains a capacity constraint; routing must keep
heavy jobs off the alert lane or use GPU-capable GCP nodes.