docs(awooop): define private Ollama mesh gateway
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
This commit is contained in:
@@ -5,6 +5,10 @@
|
||||
**決策者**: 統帥
|
||||
**關聯**: 取代 ADR-105(Revert A2 Ollama Primary)
|
||||
|
||||
> 2026-05-05 修正:本 ADR 的「GCP-A → GCP-B → 111 → paid provider」邏輯仍有效,
|
||||
> 但公網 GCP IP / 110 nginx proxy 僅為過渡傳輸。正式傳輸與 runtime
|
||||
> 管理由 ADR-125(GCP Ollama Private Mesh and AwoooP Inference Gateway)取代。
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
@@ -62,3 +66,15 @@ K8s NetworkPolicy egress 已新增 GCP-A/GCP-B 的 /32 出口規則(port 11434
|
||||
- Ollama 主要流量走 GCP SSD,效能提升
|
||||
- Local 111 保留為最後防線,不棄用
|
||||
- Gemini/Nemotron/Claude fallback 鏈不變
|
||||
|
||||
## 2026-05-05 現場校正
|
||||
|
||||
冷啟動救援期間的實測顯示:
|
||||
|
||||
- GCP-A / GCP-B 透過 110 nginx proxy 可連線,但長 prompt 曾出現 504。
|
||||
- `/api/ps` 顯示 GCP-A / GCP-B `size_vram: 0`,因此不可假設它們等同 111 GPU/VRAM 推理能力。
|
||||
- 告警同步路徑必須使用 `gemma3:4b` 這類 fast lane 模型;14B/32B 模型需移到 async 或 111/GPU 節點。
|
||||
- 公網 `34.143.170.20:11434` / `34.21.145.224:11434` 不再視為最終安全架構。
|
||||
|
||||
後續以 ADR-125 為準:WireGuard private mesh 是正式網路層,AwoooP
|
||||
Inference Gateway 是正式 runtime 層。
|
||||
|
||||
187
docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md
Normal file
187
docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway
|
||||
|
||||
**Status**: Accepted
|
||||
**Date**: 2026-05-05 (Asia/Taipei)
|
||||
**Decision Maker**: ogt / Codex
|
||||
**Related**: ADR-110, ADR-111, ADR-113, ADR-121, ADR-124
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
ADR-110 moved Ollama priority from local-only 111 to a three-layer topology:
|
||||
|
||||
1. GCP-A
|
||||
2. GCP-B
|
||||
3. Local 111
|
||||
4. Paid cloud fallback only after all Ollama lanes fail
|
||||
|
||||
The 2026-05-05 dirty-reboot recovery and alert incident exposed two gaps in the
|
||||
initial ADR-110 implementation:
|
||||
|
||||
- The live transport is `K8s Pod -> 192.168.0.110 nginx -> GCP public IP`, not a
|
||||
true private network path.
|
||||
- GCP-A and GCP-B reported `size_vram: 0` in `/api/ps`, so they are CPU-only from
|
||||
Ollama's perspective. Private networking improves reachability and security,
|
||||
but does not make these nodes equivalent to local 111 GPU/VRAM behavior.
|
||||
|
||||
The public nginx proxy is useful as a bootstrap bridge, but it must not become
|
||||
the long-term primary transport for platform inference.
|
||||
|
||||
## Decision
|
||||
|
||||
Adopt a two-layer target architecture:
|
||||
|
||||
### D1 - WireGuard private mesh is the target transport
|
||||
|
||||
AwoooP uses a WireGuard site-to-site mesh for GCP Ollama access.
|
||||
|
||||
Planned mesh CIDR:
|
||||
|
||||
| Host | Role | WireGuard IP |
|
||||
|------|------|--------------|
|
||||
| 110 | DevOps / transitional proxy / optional mesh router | `10.77.114.10` |
|
||||
| 120 | K3s control-plane node | `10.77.114.120` |
|
||||
| 121 | K3s control-plane node | `10.77.114.121` |
|
||||
| 111 | Local Ollama fallback | `10.77.114.111` |
|
||||
| GCP-A | Ollama primary | `10.77.114.21` |
|
||||
| GCP-B | Ollama secondary | `10.77.114.22` |
|
||||
|
||||
Ollama endpoints after cutover:
|
||||
|
||||
| Tier | Endpoint |
|
||||
|------|----------|
|
||||
| Primary | `http://10.77.114.21:11434` |
|
||||
| Secondary | `http://10.77.114.22:11434` |
|
||||
| Fallback | `http://10.77.114.111:11434` |
|
||||
|
||||
The current `192.168.0.110:11435/11436` nginx proxy remains an emergency bridge
|
||||
only until the mesh cutover passes shadow and canary gates.
|
||||
|
||||
### D2 - Public Ollama exposure is forbidden after cutover
|
||||
|
||||
After mesh cutover:
|
||||
|
||||
- GCP firewall must deny public `0.0.0.0/0 -> 11434`.
|
||||
- Ollama should bind to the mesh interface or host firewall should allow
|
||||
`11434/tcp` only from `10.77.114.0/24`.
|
||||
- K8s NetworkPolicy should allow egress only to the mesh IPs for Ollama.
|
||||
|
||||
### D3 - AwoooP Inference Gateway owns runtime routing
|
||||
|
||||
Provider clients should stop selecting raw Ollama hosts directly. They should
|
||||
call an AwoooP Inference Gateway that owns:
|
||||
|
||||
- endpoint health and circuit breakers
|
||||
- per-lane concurrency limits
|
||||
- model residency and keep-alive policy
|
||||
- request timeouts by intent
|
||||
- token/cost audit spans
|
||||
- fallback order: GCP-A -> GCP-B -> 111 -> paid provider
|
||||
|
||||
The gateway may initially expose an Ollama-compatible surface:
|
||||
|
||||
| Endpoint | Purpose |
|
||||
|----------|---------|
|
||||
| `/api/tags` | health/model inventory |
|
||||
| `/api/ps` | residency inventory |
|
||||
| `/api/generate` | Ollama-compatible generation |
|
||||
| `/v1/awooop/inference/runs` | future async AwoooP run API |
|
||||
|
||||
Gateway requests must carry `project_id`, `trace_id`, and an intent/lane label
|
||||
when called from AwoooP-aware code.
|
||||
|
||||
### D4 - Alert lane is protected
|
||||
|
||||
Alert diagnosis must not share an unconstrained queue with heavy code-review or
|
||||
deep-RCA jobs.
|
||||
|
||||
Initial lanes:
|
||||
|
||||
| Lane | Model | Primary use | Default timeout |
|
||||
|------|-------|-------------|-----------------|
|
||||
| `alert-fast` | `gemma3:4b` | Telegram incident cards and low-risk RCA | 45s |
|
||||
| `code-review` | `qwen2.5-coder:7b` | Gitea review | 90s |
|
||||
| `embedding` | `bge-m3` | RAG embeddings | 30s |
|
||||
| `deep-rca` | 111-hosted 14B-class model | slow human-reviewed diagnosis | async only |
|
||||
|
||||
No 14B/32B model may evict `alert-fast` residency on GCP-A/GCP-B unless the
|
||||
gateway explicitly opens a maintenance window.
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 0 - Current bridge
|
||||
|
||||
- Keep `192.168.0.110:11435` and `192.168.0.110:11436` active.
|
||||
- Alert path uses `ALERT_OLLAMA_MODEL=gemma3:4b`.
|
||||
- Gemini remains paid emergency fallback only.
|
||||
|
||||
### Phase 1 - Mesh build in parallel
|
||||
|
||||
- Install WireGuard on 110, 120, 121, 111, GCP-A, and GCP-B.
|
||||
- Assign mesh IPs from `10.77.114.0/24`.
|
||||
- Keep public proxy and old env values unchanged.
|
||||
- Verify `/api/tags`, `/api/ps`, and `gemma3:4b` generation over mesh.
|
||||
|
||||
### Phase 2 - Shadow mesh
|
||||
|
||||
- Add shadow health checks from the API pod to mesh endpoints.
|
||||
- Emit OTel spans with both `active_endpoint` and `shadow_endpoint`.
|
||||
- Do not send production inference traffic to mesh yet.
|
||||
|
||||
Promotion gate:
|
||||
|
||||
- 24h continuous mesh health
|
||||
- p95 `alert-fast` latency <= current proxy p95 + 10%
|
||||
- zero public-path-only success events
|
||||
|
||||
### Phase 3 - Switch active endpoints
|
||||
|
||||
Set production env:
|
||||
|
||||
```yaml
|
||||
OLLAMA_URL: "http://10.77.114.21:11434"
|
||||
OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434"
|
||||
OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434"
|
||||
```
|
||||
|
||||
Promotion gate:
|
||||
|
||||
- 7 days canary
|
||||
- Gemini usage for alert lane is zero except documented all-Ollama outage
|
||||
- no alert-card timeout regression
|
||||
|
||||
### Phase 4 - Close public exposure
|
||||
|
||||
- Remove or firewall public GCP `11434/tcp`.
|
||||
- Keep nginx bridge config but disable listener or restrict to operator-only
|
||||
rollback.
|
||||
|
||||
## Rollback
|
||||
|
||||
Rollback is env-only while the bridge remains available:
|
||||
|
||||
```yaml
|
||||
OLLAMA_URL: "http://192.168.0.110:11435"
|
||||
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
|
||||
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
|
||||
```
|
||||
|
||||
If GCP-A/B are unstable, force 111-first temporarily:
|
||||
|
||||
```yaml
|
||||
OLLAMA_URL: "http://192.168.0.111:11434"
|
||||
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11435"
|
||||
OLLAMA_FALLBACK_URL: "http://192.168.0.110:11436"
|
||||
```
|
||||
|
||||
Paid provider fallback must remain budget-gated.
|
||||
|
||||
## Consequences
|
||||
|
||||
- GCP Ollama becomes private-by-default instead of public-IP dependent.
|
||||
- K8s NetworkPolicy can move from public `/32` rules to stable mesh `/32` rules.
|
||||
- AwoooP can manage Ollama as a platform resource shared by all tenants.
|
||||
- CPU-only GCP performance remains a capacity constraint; routing must keep
|
||||
heavy jobs off the alert lane or use GPU-capable GCP nodes.
|
||||
|
||||
Reference in New Issue
Block a user