Files
awoooi/docs/security/K8S-ARGOCD-POST-INCIDENT-READBACK-PLAN.md
Your Name 45c2b8ebe6
All checks were successful
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m48s
CD Pipeline / build-and-deploy (push) Successful in 6m0s
CD Pipeline / post-deploy-checks (push) Successful in 2m19s
feat(iwooos): 新增 K8s ArgoCD 事故回讀 gate
2026-06-15 20:42:12 +08:00

8.1 KiB
Raw Permalink Blame History

K8s / ArgoCD 事故後回讀計畫

狀態:只讀計畫完成,等待 GitOps / K8s owner 提供脫敏事故回讀包。
範圍K8s production manifests、ArgoCD、Velero、monitoring manifests。
邊界:本文件不是 ArgoCD API read、sync、rollback、kubectl、Helm、Kustomize、NetworkPolicy / NodePort / RBAC 變更、route smoke、secret 收集、production write 或 runtime gate。

1. 為什麼新增這一層

k8s_argocd_change_evidence_acceptance_v1 已經定義 proposed commit、rendered manifest diff、ArgoCD app / sync revision、health before / after、rollout、route smoke、metrics / alert、secret metadata parity、blast radius、maintenance window、rollback revision 與 post-check owner 的收件規則。

本輪再補 k8s_argocd_post_incident_readback_plan_v1,原因是事故中可能同時出現:

  • ArgoCD 顯示 Synced / Degraded,但服務局部仍未恢復。
  • Pod、Job 或 CronJob 持續 Pending,可能卡在 image pull、scheduling、quota、node、PVC 或 RBAC。
  • drift-scanner、備份排程、監控告警與 public / admin route 互相影響。
  • CD success、deploy marker、route 200、Pod Running 或 UI 可見被誤讀成資安驗收完成。

因此這一層只建立事故後回讀欄位、reviewer checks、結果分流與禁止動作讓 owner evidence 可以被低摩擦收件;在 reviewer record 出現前,所有 accepted / runtime count 必須維持 0

2. 固定摘要

欄位
schema k8s_argocd_post_incident_readback_plan_v1
來源 schema k8s_argocd_change_evidence_acceptance_v1
readback candidates 4
C0 candidates 3
C1 candidates 1
write-capable candidates 4
source manifest files 49
source YAML manifest files 45
source C0 files 36
Deployment objects 5
CronJob objects 5
Secret objects 6
NetworkPolicy objects 6
RBAC objects 5
ArgoCD Application objects 1
PrometheusRule objects 4
readback fields 36
required readback fields 31
reviewer checks 28
outcome lanes 10
blocked actions 41
coverage after plan 66%
post-incident readback received / accepted 0 / 0
runtime gate 0
action buttons 0

3. Readback candidates

Candidate Scope
k8s_argocd_post_incident_readback:awoooi_prod AWOOOI production manifests、Deployment、Service、NetworkPolicy、Secret metadata、route 與 rollout 影響
k8s_argocd_post_incident_readback:argocd ArgoCD Application、sync / health、revision、degraded 狀態與 rollback evidence
k8s_argocd_post_incident_readback:velero Velero、backup / restore、CronJob / schedule、PVC / object storage 與 DR 影響
k8s_argocd_post_incident_readback:monitoring PrometheusRule、monitoring manifests、alert health、drift-scanner 與 receiver route 影響

4. 必填事故後回讀欄位

  1. k8s_incident_or_change_ref
  2. actor_attribution_ref
  3. argocd_app_health_ref
  4. argocd_sync_status_ref
  5. degraded_state_ref
  6. pending_workload_ref
  7. image_pull_or_scheduling_ref
  8. rollout_before_ref
  9. rollout_after_ref
  10. event_summary_ref
  11. metrics_alert_ref
  12. drift_scanner_ref
  13. cronjob_schedule_ref
  14. secret_metadata_parity_ref
  15. network_policy_service_impact_ref
  16. rbac_serviceaccount_impact_ref
  17. public_admin_route_impact_ref
  18. ai_provider_monitoring_impact_ref
  19. operator_notification_ref
  20. cross_project_sync_ref
  21. recovery_or_still_degraded_ref
  22. postcheck_readback_ref
  23. recurrence_guard_ref
  24. maintenance_window
  25. rollback_revision
  26. rollback_owner
  27. followup_owner
  28. redacted_evidence_refs
  29. no_secret_value_attestation
  30. no_raw_manifest_or_kubeconfig_attestation
  31. no_false_green_attestation

5. Reviewer checks

Reviewer 必須確認:

  • incident / change / deploy marker ref 可追溯。
  • actor role / team 可追溯,不能接受匿名 sync、rollback、kubectl、Helm、rollout 或 image 變更。
  • ArgoCD sync status、health status、revision 與 degraded state 有脫敏 ref。
  • Pending workload 已回讀 image pull、scheduling、quota、node、PVC、RBAC 或其他原因摘要。
  • rollout before / after、event summary、metrics / alert、drift-scanner、CronJob schedule 與 route / AI provider / monitoring 影響可讀。
  • Secret 只提供 metadata parity不含 value、hash、partial token、DSN、cookie 或 kubeconfig。
  • NetworkPolicy、Service、Ingress、NodePort、RBAC、ServiceAccount 影響已列清楚。
  • post-check 獨立於原操作人與 UI 卡片。
  • recurrence guard、maintenance window、rollback revision、rollback owner 與 followup owner 可追溯。
  • 不用 route 200、Pod Running、ArgoCD Synced、CD success、dashboard up 或 UI 可見當作事故驗收。

6. Outcome lanes

Lane 意義
waiting_post_incident_readback 尚未收到 K8s / ArgoCD 事故回讀包;所有 accepted / runtime count 維持 0
request_actor_or_revision_supplement 缺 actor、deploy marker、ArgoCD revision 或 rollback revision 時要求補件
request_degraded_pending_supplement 缺 Degraded / Pending / image pull / scheduling / rollout before-after 時要求補件
request_event_metric_supplement 缺 event summary、metrics、alert、drift scanner 或 CronJob schedule 時要求補件
request_route_dependency_supplement 缺 route、AI provider、monitoring、backup / restore、network / RBAC 影響時要求補件
quarantine_raw_payload 收到 Secret value、kubeconfig、raw manifest、raw log、raw event 或未脫敏截圖時隔離
reject_runtime_claim 把 CD success、ArgoCD Synced、route 200、Pod Running 或 UI 可見當驗收時拒收
ready_for_k8s_post_incident_review metadata 合格後,只能進 reviewer review
recurrence_guard_backfill_required 需補防再發 guard、owner review、change freeze 或 automation block
waiting_runtime_gate 即使 readback acceptedruntime gate 仍需獨立人工批准

7. 禁止動作

本計畫固定 blocked_actions=41。其中包含:

  • argocd_api_readargocd_syncargocd_rollback
  • live_cluster_readkubectl_get_livekubectl_applykubectl_patchkubectl_deletekubectl_rollout_restartkubectl_scale
  • helm_upgradehelm_rollbackkustomize_image_set
  • change_network_policychange_nodeportchange_service_typechange_ingress_routechange_rbacchange_serviceaccount
  • change_configmap_runtimechange_secret_metadata
  • secret_value_collectionkubeconfig_collectionraw_manifest_storageraw_event_dump_storageraw_pod_log_storageraw_secret_storage
  • velero_restorerestore_backupprometheus_rule_applyalertmanager_reload
  • route_smokeproduction_write
  • accept_synced_as_healthyaccept_route_200_as_all_greenaccept_pod_running_as_all_greenaccept_cd_success_as_security_acceptance
  • skip_degraded_pending_reviewmark_readback_accepted_without_reviewer_recordopen_runtime_gateadd_action_button

8. 下一步

下一步只能收 GitOps / K8s owner 提供的脫敏事故回讀包,欄位至少包含:

  • incident / change / deploy marker ref
  • actor role / team
  • ArgoCD sync / health / revision 摘要
  • Degraded / Pending / image pull / scheduling 摘要
  • rollout before / after
  • event summary、metrics / alert、drift scanner、CronJob schedule
  • NetworkPolicy / Service / Ingress / NodePort / RBAC / ServiceAccount 影響
  • public / admin route、AI provider、monitoring、backup / restore 影響
  • operator notification 與 cross-project sync ref
  • recovery 或 still-degraded ref
  • post-check、recurrence guard、maintenance window、rollback revision、rollback owner、followup owner
  • no-secret-value、no-raw-manifest-or-kubeconfig、no-false-green attestation

在 reviewer record 前,post_incident_readback_received_count=0post_incident_readback_accepted_count=0runtime_gate_count=0action_button_count=0 必須維持不變。