fix(db): cap production connection pool budget
All checks were successful
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Successful in 33s
CD Pipeline / build-and-deploy (push) Successful in 4m38s
CD Pipeline / post-deploy-checks (push) Successful in 51s

This commit is contained in:
Your Name
2026-07-01 17:03:30 +08:00
parent 4561c65fe9
commit fdbbe408c3
6 changed files with 119 additions and 2 deletions

View File

@@ -238,8 +238,14 @@ jobs:
# controlled profile so non-110 CD does not fall into B5's Docker
# socket path just because the previous deploy recorded image
# truth.
k8s/awoooi-prod/04-configmap.yaml)
;;
k8s/awoooi-prod/06-deployment-api.yaml)
;;
k8s/awoooi-prod/08-deployment-worker.yaml)
;;
k8s/awoooi-prod/10-deployment-auto-repair-canary.yaml)
;;
k8s/awoooi-prod/kustomization.yaml)
;;
product.awoooi.yaml)
@@ -294,6 +300,10 @@ jobs:
;;
apps/api/src/api/v1/webhooks.py)
;;
apps/api/src/core/config.py)
;;
apps/api/src/db/base.py)
;;
apps/api/src/services/agent_replay_normalizer.py)
;;
apps/api/src/services/ai_agent_log_intelligence_integration_readback.py)
@@ -438,8 +448,12 @@ jobs:
;;
apps/api/tests/test_awooop_operator_timeline_labels.py)
;;
apps/api/tests/test_config_url_validation.py)
;;
apps/api/tests/test_delivery_closure_workbench_api.py)
;;
apps/api/tests/test_runtime_bootstrap_guards.py)
;;
apps/api/tests/test_backup_dr_target_inventory.py)
;;
apps/api/tests/test_backup_dr_target_inventory_api.py)
@@ -668,6 +682,8 @@ jobs:
if [ "${AWOOOI_CD_TEST_PROFILE:-full}" = "controlled-runtime" ]; then
echo "✅ controlled-runtime profile: running focused replay/auto-approve/copy tests"
python3.11 -m py_compile \
src/core/config.py \
src/db/base.py \
src/api/v1/platform/events.py \
src/api/v1/agents.py \
src/api/v1/iwooos.py \
@@ -768,7 +784,9 @@ jobs:
tests/test_destructive_patterns.py \
tests/test_approval_pending_visibility.py \
tests/test_awooop_operator_timeline_labels.py::test_outbound_timeline_title_labels_runbook_review \
tests/test_config_url_validation.py \
tests/test_delivery_closure_workbench_api.py \
tests/test_runtime_bootstrap_guards.py \
tests/test_backup_dr_target_inventory.py \
tests/test_backup_dr_target_inventory_api.py \
tests/test_backup_dr_readiness_matrix.py \

View File

@@ -1,3 +1,15 @@
## 2026-07-01 — 17:08 Gitea CD #4269 B5 socket blocker / DB pool rollout profile 收斂
**照主線修正的問題**
- Gitea API、private repo heads、Harbor health 與 registry `/v2/` 已讀回正常;上一輪 CD `#4268` tests / build / deploy 成功production deploy readback 已能對齊 Gitea desired image tag `1578b13fad`
- 後續 DB pool source 修復 commit `4561c65fe` 觸發 CD `#4269`API 全量測試已跑到 `3642 passed, 23 skipped`,真正失敗點是 B5 真 DB 整合段落在 non-110 runner 內讀回 `BLOCKER b5_docker_socket_unavailable`,導致 build/deploy 被擋;這是 runner Docker socket 能力缺失,不是 DB pool 測試本身失敗。
- 本輪把 production DB pool / rollout guard 的 source 與 tests 納入 `.gitea/workflows/cd.yaml` controlled-runtime profile`apps/api/src/core/config.py``apps/api/src/db/base.py``apps/api/tests/test_config_url_validation.py``apps/api/tests/test_runtime_bootstrap_guards.py``k8s/awoooi-prod/04-configmap.yaml``06-deployment-api.yaml``08-deployment-worker.yaml``10-deployment-auto-repair-canary.yaml`
- controlled-runtime focused pytest 補跑 `test_config_url_validation.py``test_runtime_bootstrap_guards.py`,同時保留 `ops/runner/test_cd_controlled_runtime_profile.py` guard避免 DB pool / GitOps rollout 小修再次掉入 B5 Docker socket 路徑。
**邊界**:未讀 secret / token / `.env` / raw sessions / SQLite / auth未使用 GitHub / `gh` / GitHub API未重啟主機 / Docker / Nginx / K3s / DB / firewall未 force push。
**下一步**:完成 rebase、跑本地精準驗證、推 Gitea main 觸發下一個 CD讀回 tests / build-and-deploy / post-deploy 與 production image freshness若仍失敗繼續只針對最新 run 的失敗段落收斂。
## 2026-07-01 — 16:32 AI Loop registry-ready / deploy-marker blocker separation
**照主線修正的問題**
@@ -51679,6 +51691,34 @@ production browser smoke:
**下一步**
- commit / push 後讀回新的 Gitea CD run確認 tests、build、deploy readback 都通過,再讀 production health / ArgoCD / image tag 證據。
## 2026-07-01 — 17:02 Production 502 / DB connection budget recovery
**完成內容**
- Gitea CD `#4268` 最終 `Success`deploy readback 在第 `30/36` 次收斂:`Production deploy readback matches Gitea main desired image tag (1578b13fad)`
- CD rollout 期間 public API 一度 502K8s 讀回 `awoooi-api``CrashLoopBackOff`API logs 顯示根因是 `asyncpg.exceptions.TooManyConnectionsError: too many connections for role "awoooi"`
- 受控恢復:暫時 scale down worker / auto-repair-canary 釋放連線並刪除 CrashLoop API podAPI 於 `16:57` 讀回 `health_http=200`ArgoCD 隨後恢復 worker / canary四個 deployment 全部 Ready。
- production deploy summary 讀回:`runtime=1578b13fad``desired=1578b13fad``desired_status=ok``production_deploy_image_tag_matches_main=True``production_deploy_status=closure_verified`
- Source 防再發修法:
- `apps/api/src/core/config.py` 新增 `DATABASE_POOL_SIZE``DATABASE_MAX_OVERFLOW`
- `apps/api/src/db/base.py` 改用設定值建立 SQLAlchemy async engine。
- prod API / worker manifest 設 `DATABASE_POOL_SIZE=1``DATABASE_MAX_OVERFLOW=0`
- worker / auto-repair-canary rollout 改 `maxSurge=0``maxUnavailable=1`,避免 rollout 時新舊 pod 疊加搶 DB 連線。
**本地驗證結果**
- `python3.11 -m py_compile apps/api/src/core/config.py apps/api/src/db/base.py ops/runner/test_cd_controlled_runtime_profile.py`:通過。
- `DATABASE_URL=postgresql://test:test@localhost:5432/test python3.11 -m pytest apps/api/tests/test_config_url_validation.py ops/runner/test_cd_controlled_runtime_profile.py -q``53 passed`
- `python3 ops/runner/guard-gitea-runner-pressure.py --root .`:通過。
- `node scripts/ci/check-gitea-step-env-secrets.js`:通過。
- `git diff --check`:通過。
**仍維持**
- 沒有讀 secret / token / `.env` / raw sessions / SQLite / auth沒有讀 `.runner` 內容。
- 沒有使用 GitHub / gh / GitHub API / GitHub Actions。
- 沒有重啟主機,沒有 Docker / Nginx / K3s / DB restart沒有 workflow_dispatch沒有 DROP / TRUNCATE / restore / prune。
**下一步**
- commit / push DB pool budget 修法;讀回 Gitea CD、production health、deployment image/env 與 DB connection error 是否消失。
## 2026-07-01 — 08:55 Truth-chain hot lookup helper test 對齊
**完成內容**

View File

@@ -90,6 +90,13 @@ spec:
# Production readback compares runtime image truth against this
# GitOps desired tag instead of doing a slow Gitea raw fetch.
value: "1578b13fad9fcbc749179aea22ff24959b5886cc"
- name: DATABASE_POOL_SIZE
# 2026-07-01 Codex: production role `awoooi` currently has a low
# connection limit. Keep API pool conservative until DB role
# limit is raised and verified.
value: "1"
- name: DATABASE_MAX_OVERFLOW
value: "0"
- name: IWOOOS_WAZUH_READONLY_ENABLED
# 2026-06-30 Codex: controlled GitOps enablement after owner
# metadata, manager registry acceptance, dry-run, rollback, and

View File

@@ -85,6 +85,12 @@ spec:
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: DATABASE_POOL_SIZE
# 2026-07-01 Codex: keep worker DB usage inside the current
# production role connection budget during reboot rollouts.
value: "1"
- name: DATABASE_MAX_OVERFLOW
value: "0"
resources:
requests:
cpu: "100m"

View File

@@ -24,8 +24,10 @@ spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
# 2026-07-01 Codex: keep canary rollout non-overlapping during
# post-reboot DB connection pressure recovery.
maxSurge: 0
maxUnavailable: 1
template:
metadata:
labels:

View File

@@ -11,6 +11,11 @@ HARBOR_110_REPAIR_WORKFLOW = (
ROOT / ".gitea" / "workflows" / "harbor-110-local-repair.yaml"
)
WAIT_HOST_PRESSURE = ROOT / "scripts" / "ci" / "wait-host-web-build-pressure.sh"
PROD_API_DEPLOYMENT = ROOT / "k8s" / "awoooi-prod" / "06-deployment-api.yaml"
PROD_WORKER_DEPLOYMENT = ROOT / "k8s" / "awoooi-prod" / "08-deployment-worker.yaml"
PROD_CANARY_DEPLOYMENT = (
ROOT / "k8s" / "awoooi-prod" / "10-deployment-auto-repair-canary.yaml"
)
def _workflow_text() -> str:
@@ -35,10 +40,49 @@ def test_product_manifest_changes_stay_on_controlled_runtime_profile() -> None:
def test_deploy_marker_k8s_files_stay_on_controlled_runtime_profile() -> None:
text = _workflow_text()
assert "build-and-deploy writes only these GitOps" in text
assert "k8s/awoooi-prod/04-configmap.yaml)" in text
assert "k8s/awoooi-prod/06-deployment-api.yaml)" in text
assert "k8s/awoooi-prod/08-deployment-worker.yaml)" in text
assert "k8s/awoooi-prod/10-deployment-auto-repair-canary.yaml)" in text
assert "k8s/awoooi-prod/kustomization.yaml)" in text
def test_prod_db_pool_recovery_sources_stay_on_controlled_runtime_profile() -> None:
text = _workflow_text()
expected_sources = [
"apps/api/src/core/config.py)",
"apps/api/src/db/base.py)",
"apps/api/tests/test_config_url_validation.py)",
"apps/api/tests/test_runtime_bootstrap_guards.py)",
"src/core/config.py",
"src/db/base.py",
"tests/test_config_url_validation.py",
"tests/test_runtime_bootstrap_guards.py",
]
for source in expected_sources:
assert source in text
def test_prod_db_pool_budget_and_non_overlap_rollouts_are_source_controlled() -> None:
api = PROD_API_DEPLOYMENT.read_text(encoding="utf-8")
worker = PROD_WORKER_DEPLOYMENT.read_text(encoding="utf-8")
canary = PROD_CANARY_DEPLOYMENT.read_text(encoding="utf-8")
assert "DATABASE_POOL_SIZE" in api
assert "value: \"1\"" in api
assert "DATABASE_MAX_OVERFLOW" in api
assert "value: \"0\"" in api
assert "DATABASE_POOL_SIZE" in worker
assert "DATABASE_MAX_OVERFLOW" in worker
assert "maxSurge: 0" in worker
assert "maxUnavailable: 1" in worker
assert "post-reboot DB connection pressure recovery" in canary
assert "maxSurge: 0" in canary
assert "maxUnavailable: 1" in canary
def test_workflow_secret_transport_sources_stay_on_controlled_runtime_profile() -> None:
text = _workflow_text()
assert "workflow secret-transport and guard-only" in text