8.1 KiB
K8s / ArgoCD 事故後回讀計畫
狀態:只讀計畫完成,等待 GitOps / K8s owner 提供脫敏事故回讀包。
範圍:K8s production manifests、ArgoCD、Velero、monitoring manifests。
邊界:本文件不是 ArgoCD API read、sync、rollback、kubectl、Helm、Kustomize、NetworkPolicy / NodePort / RBAC 變更、route smoke、secret 收集、production write 或 runtime gate。
1. 為什麼新增這一層
k8s_argocd_change_evidence_acceptance_v1 已經定義 proposed commit、rendered manifest diff、ArgoCD app / sync revision、health before / after、rollout、route smoke、metrics / alert、secret metadata parity、blast radius、maintenance window、rollback revision 與 post-check owner 的收件規則。
本輪再補 k8s_argocd_post_incident_readback_plan_v1,原因是事故中可能同時出現:
- ArgoCD 顯示
Synced / Degraded,但服務局部仍未恢復。 - Pod、Job 或 CronJob 持續
Pending,可能卡在 image pull、scheduling、quota、node、PVC 或 RBAC。 - drift-scanner、備份排程、監控告警與 public / admin route 互相影響。
- CD success、deploy marker、route 200、Pod Running 或 UI 可見被誤讀成資安驗收完成。
因此這一層只建立事故後回讀欄位、reviewer checks、結果分流與禁止動作,讓 owner evidence 可以被低摩擦收件;在 reviewer record 出現前,所有 accepted / runtime count 必須維持 0。
2. 固定摘要
| 欄位 | 值 |
|---|---|
| schema | k8s_argocd_post_incident_readback_plan_v1 |
| 來源 schema | k8s_argocd_change_evidence_acceptance_v1 |
| readback candidates | 4 |
| C0 candidates | 3 |
| C1 candidates | 1 |
| write-capable candidates | 4 |
| source manifest files | 49 |
| source YAML manifest files | 45 |
| source C0 files | 36 |
| Deployment objects | 5 |
| CronJob objects | 5 |
| Secret objects | 6 |
| NetworkPolicy objects | 6 |
| RBAC objects | 5 |
| ArgoCD Application objects | 1 |
| PrometheusRule objects | 4 |
| readback fields | 36 |
| required readback fields | 31 |
| reviewer checks | 28 |
| outcome lanes | 10 |
| blocked actions | 41 |
| coverage after plan | 66% |
| post-incident readback received / accepted | 0 / 0 |
| runtime gate | 0 |
| action buttons | 0 |
3. Readback candidates
| Candidate | Scope |
|---|---|
k8s_argocd_post_incident_readback:awoooi_prod |
AWOOOI production manifests、Deployment、Service、NetworkPolicy、Secret metadata、route 與 rollout 影響 |
k8s_argocd_post_incident_readback:argocd |
ArgoCD Application、sync / health、revision、degraded 狀態與 rollback evidence |
k8s_argocd_post_incident_readback:velero |
Velero、backup / restore、CronJob / schedule、PVC / object storage 與 DR 影響 |
k8s_argocd_post_incident_readback:monitoring |
PrometheusRule、monitoring manifests、alert health、drift-scanner 與 receiver route 影響 |
4. 必填事故後回讀欄位
k8s_incident_or_change_refactor_attribution_refargocd_app_health_refargocd_sync_status_refdegraded_state_refpending_workload_refimage_pull_or_scheduling_refrollout_before_refrollout_after_refevent_summary_refmetrics_alert_refdrift_scanner_refcronjob_schedule_refsecret_metadata_parity_refnetwork_policy_service_impact_refrbac_serviceaccount_impact_refpublic_admin_route_impact_refai_provider_monitoring_impact_refoperator_notification_refcross_project_sync_refrecovery_or_still_degraded_refpostcheck_readback_refrecurrence_guard_refmaintenance_windowrollback_revisionrollback_ownerfollowup_ownerredacted_evidence_refsno_secret_value_attestationno_raw_manifest_or_kubeconfig_attestationno_false_green_attestation
5. Reviewer checks
Reviewer 必須確認:
- incident / change / deploy marker ref 可追溯。
- actor role / team 可追溯,不能接受匿名 sync、rollback、kubectl、Helm、rollout 或 image 變更。
- ArgoCD sync status、health status、revision 與 degraded state 有脫敏 ref。
- Pending workload 已回讀 image pull、scheduling、quota、node、PVC、RBAC 或其他原因摘要。
- rollout before / after、event summary、metrics / alert、drift-scanner、CronJob schedule 與 route / AI provider / monitoring 影響可讀。
- Secret 只提供 metadata parity,不含 value、hash、partial token、DSN、cookie 或 kubeconfig。
- NetworkPolicy、Service、Ingress、NodePort、RBAC、ServiceAccount 影響已列清楚。
- post-check 獨立於原操作人與 UI 卡片。
- recurrence guard、maintenance window、rollback revision、rollback owner 與 followup owner 可追溯。
- 不用 route 200、Pod Running、ArgoCD Synced、CD success、dashboard up 或 UI 可見當作事故驗收。
6. Outcome lanes
| Lane | 意義 |
|---|---|
waiting_post_incident_readback |
尚未收到 K8s / ArgoCD 事故回讀包;所有 accepted / runtime count 維持 0 |
request_actor_or_revision_supplement |
缺 actor、deploy marker、ArgoCD revision 或 rollback revision 時要求補件 |
request_degraded_pending_supplement |
缺 Degraded / Pending / image pull / scheduling / rollout before-after 時要求補件 |
request_event_metric_supplement |
缺 event summary、metrics、alert、drift scanner 或 CronJob schedule 時要求補件 |
request_route_dependency_supplement |
缺 route、AI provider、monitoring、backup / restore、network / RBAC 影響時要求補件 |
quarantine_raw_payload |
收到 Secret value、kubeconfig、raw manifest、raw log、raw event 或未脫敏截圖時隔離 |
reject_runtime_claim |
把 CD success、ArgoCD Synced、route 200、Pod Running 或 UI 可見當驗收時拒收 |
ready_for_k8s_post_incident_review |
metadata 合格後,只能進 reviewer review |
recurrence_guard_backfill_required |
需補防再發 guard、owner review、change freeze 或 automation block |
waiting_runtime_gate |
即使 readback accepted,runtime gate 仍需獨立人工批准 |
7. 禁止動作
本計畫固定 blocked_actions=41。其中包含:
argocd_api_read、argocd_sync、argocd_rollbacklive_cluster_read、kubectl_get_live、kubectl_apply、kubectl_patch、kubectl_delete、kubectl_rollout_restart、kubectl_scalehelm_upgrade、helm_rollback、kustomize_image_setchange_network_policy、change_nodeport、change_service_type、change_ingress_route、change_rbac、change_serviceaccountchange_configmap_runtime、change_secret_metadatasecret_value_collection、kubeconfig_collection、raw_manifest_storage、raw_event_dump_storage、raw_pod_log_storage、raw_secret_storagevelero_restore、restore_backup、prometheus_rule_apply、alertmanager_reloadroute_smoke、production_writeaccept_synced_as_healthy、accept_route_200_as_all_green、accept_pod_running_as_all_green、accept_cd_success_as_security_acceptanceskip_degraded_pending_review、mark_readback_accepted_without_reviewer_record、open_runtime_gate、add_action_button
8. 下一步
下一步只能收 GitOps / K8s owner 提供的脫敏事故回讀包,欄位至少包含:
- incident / change / deploy marker ref
- actor role / team
- ArgoCD sync / health / revision 摘要
- Degraded / Pending / image pull / scheduling 摘要
- rollout before / after
- event summary、metrics / alert、drift scanner、CronJob schedule
- NetworkPolicy / Service / Ingress / NodePort / RBAC / ServiceAccount 影響
- public / admin route、AI provider、monitoring、backup / restore 影響
- operator notification 與 cross-project sync ref
- recovery 或 still-degraded ref
- post-check、recurrence guard、maintenance window、rollback revision、rollback owner、followup owner
- no-secret-value、no-raw-manifest-or-kubeconfig、no-false-green attestation
在 reviewer record 前,post_incident_readback_received_count=0、post_incident_readback_accepted_count=0、runtime_gate_count=0、action_button_count=0 必須維持不變。