fix(runner): 永久修復 _diag/pages 檔案衝突問題
問題: Runner 並行執行時 "file already exists" 導致 CD 失敗 解決方案: 1. CD Workflow: 刪除整個 _diag/pages 目錄再重建 (非 rm -rf /*) 2. Systemd Timer: 每 5 分鐘自動清理過期檔案 3. flock 鎖定: 防止清理程序競爭 新增檔案: - ops/runner/cleanup-runner-diag.sh - 清理腳本 - ops/runner/runner-diag-cleanup.service - Systemd service - ops/runner/runner-diag-cleanup.timer - 定時器 - ops/runner/deploy-runner-cleanup.sh - 部署腳本 - ops/runner/README.md - 文檔 部署指令: ssh wooo@192.168.0.110 bash awoooi/ops/runner/deploy-runner-cleanup.sh Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
76
.github/workflows/cd.yaml
vendored
76
.github/workflows/cd.yaml
vendored
@@ -56,13 +56,51 @@ jobs:
|
||||
runs-on: [self-hosted, harbor, k8s]
|
||||
timeout-minutes: 1
|
||||
steps:
|
||||
# 2026-03-26: 清理暫存目錄,避免 file conflict (pages + temp)
|
||||
- name: "Clean Runner temp"
|
||||
# =======================================================================
|
||||
# 2026-03-29: Runner _diag/pages 檔案衝突永久修復
|
||||
# 問題: 並行 Job 寫入同一診斷檔案導致 "file already exists"
|
||||
# 解法: 強制清理 + flock 鎖定 + 重建目錄
|
||||
# =======================================================================
|
||||
- name: "Clean Runner Diagnostics (Anti-Collision)"
|
||||
run: |
|
||||
set +e # 不因清理失敗而中斷
|
||||
|
||||
RUNNER_ROOT=$(dirname "$(dirname "$RUNNER_TEMP")")
|
||||
rm -rf "$RUNNER_TEMP"/* 2>/dev/null || true
|
||||
rm -rf "$RUNNER_ROOT/_diag/pages"/* 2>/dev/null || true
|
||||
rm -rf .claude/worktrees 2>/dev/null || true
|
||||
DIAG_DIR="$RUNNER_ROOT/_diag"
|
||||
PAGES_DIR="$DIAG_DIR/pages"
|
||||
LOCK_FILE="/tmp/runner-diag-cleanup.lock"
|
||||
|
||||
echo "🧹 Cleaning Runner diagnostics..."
|
||||
echo " RUNNER_ROOT: $RUNNER_ROOT"
|
||||
echo " PAGES_DIR: $PAGES_DIR"
|
||||
|
||||
# 使用 flock 確保同一時間只有一個清理程序
|
||||
(
|
||||
flock -w 10 200 || { echo "⚠️ Lock timeout, proceeding anyway"; }
|
||||
|
||||
# 1. 清理 _diag/pages (最關鍵)
|
||||
if [ -d "$PAGES_DIR" ]; then
|
||||
# 刪除所有 .log 檔案
|
||||
find "$PAGES_DIR" -name "*.log" -type f -delete 2>/dev/null
|
||||
# 重建目錄確保乾淨
|
||||
rm -rf "$PAGES_DIR" 2>/dev/null
|
||||
mkdir -p "$PAGES_DIR" 2>/dev/null
|
||||
echo " ✅ Cleaned _diag/pages"
|
||||
fi
|
||||
|
||||
# 2. 清理 RUNNER_TEMP
|
||||
rm -rf "$RUNNER_TEMP"/* 2>/dev/null
|
||||
echo " ✅ Cleaned RUNNER_TEMP"
|
||||
|
||||
# 3. 清理 Claude worktrees
|
||||
rm -rf .claude/worktrees 2>/dev/null
|
||||
|
||||
# 4. 清理陳舊的 _work 暫存
|
||||
find "$RUNNER_ROOT/_work" -name "*.tmp" -mmin +30 -delete 2>/dev/null || true
|
||||
|
||||
) 200>"$LOCK_FILE"
|
||||
|
||||
echo "✅ Runner cleanup completed"
|
||||
|
||||
# =======================================================================
|
||||
# ADR-035: Telegram 告警鏈路強制驗證
|
||||
@@ -122,11 +160,12 @@ jobs:
|
||||
web: ${{ inputs.force_deploy == true && 'true' || steps.filter.outputs.web }}
|
||||
k3s-system: ${{ steps.filter.outputs.k3s-system }}
|
||||
steps:
|
||||
# 2026-03-26: 清理暫存目錄 (temp + pages)
|
||||
- name: "Clean Runner temp"
|
||||
# 2026-03-29: Runner 診斷檔案清理 (防止並行衝突)
|
||||
- name: "Clean Runner Diagnostics"
|
||||
run: |
|
||||
RUNNER_ROOT=$(dirname "$(dirname "$RUNNER_TEMP")")
|
||||
rm -rf "$RUNNER_TEMP"/* "$RUNNER_ROOT/_diag/pages"/* .claude/worktrees 2>/dev/null || true
|
||||
rm -rf "$RUNNER_TEMP"/* "$RUNNER_ROOT/_diag/pages" .claude/worktrees 2>/dev/null || true
|
||||
mkdir -p "$RUNNER_ROOT/_diag/pages" 2>/dev/null || true
|
||||
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
@@ -162,11 +201,12 @@ jobs:
|
||||
outputs:
|
||||
image_tag: ${{ steps.tag.outputs.tag }}
|
||||
steps:
|
||||
# 2026-03-26: 清理暫存目錄 (temp + pages)
|
||||
- name: "Clean Runner temp"
|
||||
# 2026-03-29: Runner 診斷檔案清理 (防止並行衝突)
|
||||
- name: "Clean Runner Diagnostics"
|
||||
run: |
|
||||
RUNNER_ROOT=$(dirname "$(dirname "$RUNNER_TEMP")")
|
||||
rm -rf "$RUNNER_TEMP"/* "$RUNNER_ROOT/_diag/pages"/* .claude/worktrees 2>/dev/null || true
|
||||
rm -rf "$RUNNER_TEMP"/* "$RUNNER_ROOT/_diag/pages" .claude/worktrees 2>/dev/null || true
|
||||
mkdir -p "$RUNNER_ROOT/_diag/pages" 2>/dev/null || true
|
||||
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
@@ -200,11 +240,12 @@ jobs:
|
||||
outputs:
|
||||
image_tag: ${{ steps.tag.outputs.tag }}
|
||||
steps:
|
||||
# 2026-03-26: 清理暫存目錄 (temp + pages)
|
||||
- name: "Clean Runner temp"
|
||||
# 2026-03-29: Runner 診斷檔案清理 (防止並行衝突)
|
||||
- name: "Clean Runner Diagnostics"
|
||||
run: |
|
||||
RUNNER_ROOT=$(dirname "$(dirname "$RUNNER_TEMP")")
|
||||
rm -rf "$RUNNER_TEMP"/* "$RUNNER_ROOT/_diag/pages"/* .claude/worktrees 2>/dev/null || true
|
||||
rm -rf "$RUNNER_TEMP"/* "$RUNNER_ROOT/_diag/pages" .claude/worktrees 2>/dev/null || true
|
||||
mkdir -p "$RUNNER_ROOT/_diag/pages" 2>/dev/null || true
|
||||
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
@@ -245,11 +286,12 @@ jobs:
|
||||
if: always() && (needs.build-api.result == 'success' || needs.build-api.result == 'skipped') && (needs.build-web.result == 'success' || needs.build-web.result == 'skipped')
|
||||
environment: production
|
||||
steps:
|
||||
# 2026-03-26: 清理暫存目錄 (temp + pages)
|
||||
- name: "Clean Runner temp"
|
||||
# 2026-03-29: Runner 診斷檔案清理 (防止並行衝突)
|
||||
- name: "Clean Runner Diagnostics"
|
||||
run: |
|
||||
RUNNER_ROOT=$(dirname "$(dirname "$RUNNER_TEMP")")
|
||||
rm -rf "$RUNNER_TEMP"/* "$RUNNER_ROOT/_diag/pages"/* .claude/worktrees 2>/dev/null || true
|
||||
rm -rf "$RUNNER_TEMP"/* "$RUNNER_ROOT/_diag/pages" .claude/worktrees 2>/dev/null || true
|
||||
mkdir -p "$RUNNER_ROOT/_diag/pages" 2>/dev/null || true
|
||||
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
|
||||
66
ops/runner/README.md
Normal file
66
ops/runner/README.md
Normal file
@@ -0,0 +1,66 @@
|
||||
# GitHub Actions Runner 穩定性修復
|
||||
|
||||
## 問題: `_diag/pages` 檔案衝突
|
||||
|
||||
```
|
||||
Error: The file '/home/wooo/actions-runner-awoooi/_diag/pages/xxx.log' already exists.
|
||||
```
|
||||
|
||||
### 根因
|
||||
- GitHub Actions Runner 在執行 Job 時會寫入診斷日誌
|
||||
- 並行 Job 或快速連續執行可能產生 UUID 碰撞
|
||||
- 前次執行的殘留檔案未清理
|
||||
|
||||
### 解決方案
|
||||
|
||||
#### 1. CD Workflow 修復 (即時生效)
|
||||
每個 Job 開始前強制清理並重建 `_diag/pages` 目錄:
|
||||
|
||||
```yaml
|
||||
- name: "Clean Runner Diagnostics"
|
||||
run: |
|
||||
RUNNER_ROOT=$(dirname "$(dirname "$RUNNER_TEMP")")
|
||||
rm -rf "$RUNNER_TEMP"/* "$RUNNER_ROOT/_diag/pages" .claude/worktrees 2>/dev/null || true
|
||||
mkdir -p "$RUNNER_ROOT/_diag/pages" 2>/dev/null || true
|
||||
```
|
||||
|
||||
**關鍵**: 刪除整個目錄再重建,而非 `rm -rf _diag/pages/*`
|
||||
|
||||
#### 2. Systemd Timer (背景清理)
|
||||
每 5 分鐘自動清理過期的診斷檔案:
|
||||
|
||||
```bash
|
||||
# 部署
|
||||
ssh wooo@192.168.0.110
|
||||
cd /path/to/awoooi/ops/runner
|
||||
bash deploy-runner-cleanup.sh
|
||||
```
|
||||
|
||||
### 檔案說明
|
||||
|
||||
| 檔案 | 用途 |
|
||||
|------|------|
|
||||
| `cleanup-runner-diag.sh` | 清理腳本 (安裝到 Runner 目錄) |
|
||||
| `runner-diag-cleanup.service` | Systemd service 定義 |
|
||||
| `runner-diag-cleanup.timer` | Systemd timer (每 5 分鐘) |
|
||||
| `deploy-runner-cleanup.sh` | 一鍵部署腳本 |
|
||||
|
||||
### 監控
|
||||
|
||||
```bash
|
||||
# 查看 timer 狀態
|
||||
sudo systemctl status runner-diag-cleanup.timer
|
||||
|
||||
# 查看清理日誌
|
||||
journalctl -u runner-diag-cleanup.service -f
|
||||
|
||||
# 手動觸發清理
|
||||
sudo systemctl start runner-diag-cleanup.service
|
||||
```
|
||||
|
||||
### 相關文件
|
||||
- Memory: `feedback_runner_zombie_process.md`
|
||||
- ADR: 待建立 (如果問題持續)
|
||||
|
||||
---
|
||||
版本: v1.0 | 建立: 2026-03-29 | 作者: Claude Code
|
||||
82
ops/runner/cleanup-runner-diag.sh
Normal file
82
ops/runner/cleanup-runner-diag.sh
Normal file
@@ -0,0 +1,82 @@
|
||||
#!/bin/bash
|
||||
# =============================================================================
|
||||
# Runner Diagnostic Cleanup Script
|
||||
# =============================================================================
|
||||
# 解決 _diag/pages 檔案衝突問題
|
||||
#
|
||||
# 部署位置: 192.168.0.110 (awoooi-runner)
|
||||
# 執行方式: systemd timer 每 5 分鐘執行
|
||||
#
|
||||
# 版本: v1.0
|
||||
# 建立: 2026-03-29 (台北時區)
|
||||
# 建立者: Claude Code (Runner 穩定性修復)
|
||||
# =============================================================================
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
RUNNER_DIR="/home/wooo/actions-runner-awoooi"
|
||||
DIAG_PAGES_DIR="${RUNNER_DIR}/_diag/pages"
|
||||
LOG_FILE="/var/log/runner-diag-cleanup.log"
|
||||
MAX_AGE_MINUTES=10
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
# 只有當 Runner 不在執行 Job 時才清理
|
||||
check_runner_idle() {
|
||||
# 檢查是否有正在執行的 Job
|
||||
if pgrep -f "Runner.Worker" > /dev/null 2>&1; then
|
||||
return 1 # Runner 忙碌中
|
||||
fi
|
||||
return 0 # Runner 閒置
|
||||
}
|
||||
|
||||
cleanup_diag_pages() {
|
||||
if [[ ! -d "$DIAG_PAGES_DIR" ]]; then
|
||||
log "DIAG_PAGES_DIR not found, skipping"
|
||||
return 0
|
||||
fi
|
||||
|
||||
# 統計檔案數量
|
||||
local count=$(find "$DIAG_PAGES_DIR" -type f -name "*.log" 2>/dev/null | wc -l)
|
||||
|
||||
if [[ $count -eq 0 ]]; then
|
||||
return 0
|
||||
fi
|
||||
|
||||
log "Found $count diagnostic files"
|
||||
|
||||
# 刪除超過 MAX_AGE_MINUTES 的檔案
|
||||
local deleted=$(find "$DIAG_PAGES_DIR" -type f -name "*.log" -mmin +${MAX_AGE_MINUTES} -delete -print 2>/dev/null | wc -l)
|
||||
|
||||
if [[ $deleted -gt 0 ]]; then
|
||||
log "Deleted $deleted stale files (older than ${MAX_AGE_MINUTES}m)"
|
||||
fi
|
||||
}
|
||||
|
||||
cleanup_work_temp() {
|
||||
# 清理 _work/_temp 目錄中的殘留檔案
|
||||
local temp_dir="${RUNNER_DIR}/_work/_temp"
|
||||
if [[ -d "$temp_dir" ]]; then
|
||||
local deleted=$(find "$temp_dir" -type f -mmin +30 -delete -print 2>/dev/null | wc -l)
|
||||
if [[ $deleted -gt 0 ]]; then
|
||||
log "Deleted $deleted temp files from _work/_temp"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
main() {
|
||||
# 檢查 Runner 是否閒置
|
||||
if ! check_runner_idle; then
|
||||
log "Runner is busy, skipping cleanup"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
cleanup_diag_pages
|
||||
cleanup_work_temp
|
||||
|
||||
log "Cleanup completed"
|
||||
}
|
||||
|
||||
main "$@"
|
||||
52
ops/runner/deploy-runner-cleanup.sh
Normal file
52
ops/runner/deploy-runner-cleanup.sh
Normal file
@@ -0,0 +1,52 @@
|
||||
#!/bin/bash
|
||||
# =============================================================================
|
||||
# Deploy Runner Diagnostic Cleanup Service
|
||||
# =============================================================================
|
||||
# 在 192.168.0.110 (Runner 主機) 上執行此腳本
|
||||
#
|
||||
# 執行方式:
|
||||
# ssh wooo@192.168.0.110
|
||||
# bash /path/to/deploy-runner-cleanup.sh
|
||||
#
|
||||
# 版本: v1.0
|
||||
# 建立: 2026-03-29 (台北時區)
|
||||
# =============================================================================
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
RUNNER_DIR="/home/wooo/actions-runner-awoooi"
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
|
||||
echo "🚀 Deploying Runner Diagnostic Cleanup Service..."
|
||||
|
||||
# 1. 複製清理腳本
|
||||
echo "📋 Copying cleanup script..."
|
||||
cp "$SCRIPT_DIR/cleanup-runner-diag.sh" "$RUNNER_DIR/"
|
||||
chmod +x "$RUNNER_DIR/cleanup-runner-diag.sh"
|
||||
|
||||
# 2. 安裝 systemd 服務
|
||||
echo "📋 Installing systemd service..."
|
||||
sudo cp "$SCRIPT_DIR/runner-diag-cleanup.service" /etc/systemd/system/
|
||||
sudo cp "$SCRIPT_DIR/runner-diag-cleanup.timer" /etc/systemd/system/
|
||||
|
||||
# 3. 重載 systemd
|
||||
echo "🔄 Reloading systemd..."
|
||||
sudo systemctl daemon-reload
|
||||
|
||||
# 4. 啟用並啟動 timer
|
||||
echo "⏰ Enabling cleanup timer..."
|
||||
sudo systemctl enable --now runner-diag-cleanup.timer
|
||||
|
||||
# 5. 驗證
|
||||
echo ""
|
||||
echo "✅ Deployment complete!"
|
||||
echo ""
|
||||
echo "📊 Timer status:"
|
||||
sudo systemctl status runner-diag-cleanup.timer --no-pager || true
|
||||
echo ""
|
||||
echo "📊 Next scheduled runs:"
|
||||
sudo systemctl list-timers runner-diag-cleanup.timer --no-pager || true
|
||||
echo ""
|
||||
echo "📝 To test manually:"
|
||||
echo " sudo systemctl start runner-diag-cleanup.service"
|
||||
echo " journalctl -u runner-diag-cleanup.service -f"
|
||||
21
ops/runner/runner-diag-cleanup.service
Normal file
21
ops/runner/runner-diag-cleanup.service
Normal file
@@ -0,0 +1,21 @@
|
||||
# =============================================================================
|
||||
# Runner Diagnostic Cleanup Service
|
||||
# =============================================================================
|
||||
# 部署: sudo cp runner-diag-cleanup.service /etc/systemd/system/
|
||||
# 啟用: sudo systemctl enable --now runner-diag-cleanup.timer
|
||||
# =============================================================================
|
||||
|
||||
[Unit]
|
||||
Description=GitHub Actions Runner Diagnostic Cleanup
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=wooo
|
||||
Group=wooo
|
||||
ExecStart=/home/wooo/actions-runner-awoooi/cleanup-runner-diag.sh
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
18
ops/runner/runner-diag-cleanup.timer
Normal file
18
ops/runner/runner-diag-cleanup.timer
Normal file
@@ -0,0 +1,18 @@
|
||||
# =============================================================================
|
||||
# Runner Diagnostic Cleanup Timer
|
||||
# =============================================================================
|
||||
# 每 5 分鐘執行一次清理,防止 _diag/pages 檔案堆積
|
||||
# =============================================================================
|
||||
|
||||
[Unit]
|
||||
Description=Runner Diagnostic Cleanup Timer
|
||||
Requires=runner-diag-cleanup.service
|
||||
|
||||
[Timer]
|
||||
OnBootSec=1min
|
||||
OnUnitActiveSec=5min
|
||||
AccuracySec=30s
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
Reference in New Issue
Block a user