From 73d40cf1489970badf5deb738ea0f3912c359a6a Mon Sep 17 00:00:00 2001 From: Mustafa Bayar Date: Tue, 1 Jul 2025 11:04:40 +0200 Subject: [PATCH] doc: Create high-level design for RAFT backups --- doc/raft_recovery.md | 269 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 269 insertions(+) create mode 100644 doc/raft_recovery.md diff --git a/doc/raft_recovery.md b/doc/raft_recovery.md new file mode 100644 index 00000000000..1b18ff0bfa1 --- /dev/null +++ b/doc/raft_recovery.md @@ -0,0 +1,269 @@ +# Gitaly WAL Based Backups for RAFT + +See [RAFT.md](raft.md) for understanding Gitaly's Multi-Raft architecture. + +## Executive Summary + +This document describes how Gitaly's partition backup system will operate with RAFT clusters. Partition backups for non-RAFT deployments should remain unchanged, as the backup solution relies solely on transaction data. + +## Current Architecture (Summary) + +Gitaly has a backup solution called **Partition Backup** which takes the current read snapshot of partition via transaction manager and archives it's filesystem: + +- `gitaly-backup` tool receives a list of servers via environment variable +- For each server, it discovers partitions through the server and calls `BackupPartition` RPC +- `BackupPartition` RPC creates a TAR archive with repositories and partition's key-value state, and uploads it to the backup storage. +- Separetely, WAL entry archiving happens continuously via hooks in Partition Manager whenever a new log is applied. + +### How It Works in the existing tooling + +```mermaid +flowchart TD + A[gitaly-backup tool] --> B[Receive backup request
with gitaly servers] + + B --> C[Connect to Server A] + B --> D[Connect to Server B] + + C --> E[ListPartitions

p1, p2] + E --> F[BackupPartition p1] + E --> G[BackupPartition p2] + + D --> H[ListPartitions

p3] + H --> I[BackupPartition p3] + + style A fill:#e3f2fd + style B fill:#fff9c4 + style E fill:#f3e5f5 + style H fill:#f3e5f5 + style F fill:#e8f5e9 + style G fill:#e8f5e9 + style I fill:#e8f5e9 +``` + +### Backup Object Storage Layout + +```yaml +├── gitaly-backups/ +│ ├── partition-backups/ +│ │ ├── storage-a/ +│ │ │ ├── partition-1/ +│ │ │ │ ├── 0000000a.tar +│ │ │ │ └── 0000000f.tar +│ │ │ ├── partition-2/ +│ │ │ │ └── 0000000c.tar +│ │ └── storage-b/ +│ │ └── partition-3/ +│ │ │ └── 0000000a.tar +│ ├── partition-manifests/ +│ │ ├── storage-a/ +│ │ │ ├── partition-1.json +│ │ │ └── partition-2.json +│ │ └── storage-b/ +│ │ └──partition-3.json +└── gitaly-wal-backups/ + ├── storage-a/ + │ ├── partition-1/ + │ │ ├── 00000001.tar + │ │ ├── 00000002.tar + │ │ ├── 00000003.tar + │ │ ├── 00000004.tar + . . . + . . . + . . . +``` + +## Partition Backup for RAFT + +### Core Principle: Leader-Only Backups + +When RAFT is enabled, all backup operations are automatically routed to the partition's RAFT leader: + +```mermaid +flowchart TD +A[BackupPartition request for Partition 2] --> B[Receiving Server] +B --> H{Is this partition stored
in my storage?} +H -->|No| I[Redirect to correct
RAFT group] +H -->|Yes| C{Is RAFT enabled?} +C -->|No| D[Execute backup] +C -->|Yes| E{Am I the leader?} +E -->|Yes| F[Execute backup] +E -->|No| G[Forward to leader] + +style A fill:#e1f5fe +style D fill:#c8e6c9 +style F fill:#c8e6c9 +style G fill:#fff9c4 +style I fill:#fff9c4 +``` + +### How Would It Work In The Existing Tooling + +#### 1. Discovery Phase (Unchanged) + +```mermaid +flowchart TD + A[gitaly-backup tool] --> B[Connect to Server A] + A --> C[Connect to Server B] + A --> D[Connect to Server C] + + B --> E[ListPartitions

p1, p2, p4] + C --> F[ListPartitions

p1, p2, p3] + D --> G[ListPartitions

p1, p3, p4] + + style A fill:#e3f2fd + style E fill:#f3e5f5 + style F fill:#f3e5f5 + style G fill:#f3e5f5 +``` + +#### 2. Execution Phase (With RAFT Routing) + +```mermaid +flowchart TD + A[For each discovered partition] --> B[BackupPartition p1
to ANY server] + A --> C[BackupPartition p2
to ANY server] + A --> D[BackupPartition p3
to ANY server] + + B --> E[Automatically routes
to leader
e.g., Server A] + C --> F[Automatically routes
to leader
e.g., Server A] + D --> G[Automatically routes
to leader
e.g., Server C] + + style A fill:#fff3e0 + style B fill:#e8f5e9 + style C fill:#e8f5e9 + style D fill:#e8f5e9 + style E fill:#e3f2fd + style F fill:#e3f2fd + style G fill:#e3f2fd + + E --> H[Result: Each partition backed up
exactly once from its leader] + F --> H + G --> H + + style H fill:#c8e6c9,stroke:#4caf50,stroke-width:2px +``` + +### Key Benefits + +1. **Consistency**: Always backs up the latest committed data +1. **Simplicity**: No need to track which server is leader +1. **Deduplication**: BackupPartition call would handle the deduplication. Multiple discoveries of same partition result in single backup (If there were no WAL produced between calls for same partition) +1. **Transparency**: Works with existing Gitaly-backup tool without changes + +The de-duplication can be improved further if the recovery tooling tracks the list of partitions across different storages. + +### Alternative Solution + +Instead of directly executing backup jobs, the node can create a job file which will be persisted in the replicated storage (can be part of KV storage), +which then will be easier to track among different nodes and if leadership changes before a job is fully committed, new leader will be able to pick it up. +This also handles the de-duplication as we won't create a new job for a partition if the previous job for same LSN is not yet completed. This approach is +inspired from [CockroachDB's backup architecture](https://www.cockroachlabs.com/docs/stable/backup-architecture) +If we decide on this solution we need to discuss on the overall architecture of the job processing such as: + +```yaml +- Job state machine (pending → running → completed) +- How to handle stale jobs +- Cleanup policies for completed jobs +``` + +## WAL Archiving with RAFT (including auto snapshotting) + +- Only RAFT leaders archive WAL entries +- WAL archiver tracks metrics for backup triggers: + - Total WAL size since last base backup + - Time elapsed since last base backup +- When configured thresholds(time or size based) are exceeded, triggers base partition backup +- Leadership changes trigger archiver handoff +- Ensures single, authoritative WAL sequence + +This approach eliminates the need for a separate auto-backup monitor since the WAL archiver already has all the necessary information and runs only on leaders. + +```markdown +┌─────────────────────────────────────────────────────┐ +│ Simplified Flow │ +├─────────────────────────────────────────────────────┤ +│ │ +│ Partition Manager │ +│ │ │ +│ ▼ │ +│ WAL Archiver (Leader Only) │ +│ │ │ +│ ├─── Archive WAL entry. │ +│ │ │ +│ ├─── Track Metrics: │ +│ │ • WAL archived since last backup │ +│ │ • Time since last backup │ +│ │ │ +│ └─── Check Trigger Conditions: │ +│ • WAL size > threshold? │ +│ • Time elapsed > max interval? │ +│ │ │ +│ ▼ (if triggered) │ +│ Trigger Partition Backup │ +│ │ +│ Benefits: │ +│ • Single component handles both WAL and backups │ +│ • Natural checkpoint after X amount of changes │ +│ • No separate monitoring infrastructure │ +│ • Direct correlation between changes and backups │ +│ │ +└─────────────────────────────────────────────────────┘ +``` + +## Recovery Process + +### Full Disaster Recovery Strategy + +Restore first, then form the cluster + +```markdown +┌─────────────────────────────────────────────────────┐ +│ Full Disaster Recovery Flow │ +├─────────────────────────────────────────────────────┤ +│ │ +│ 1. Prepare Empty Nodes │ +│ └─► Install Gitaly on new servers │ +│ │ +│ 2. Restore Partitions to Individual Nodes │ +│ ├─► Node 1: Restore partitions 1, 3, 5 │ +│ ├─► Node 2: Restore partitions 2, 4, 6 │ +│ └─► Node 3: Restore partitions 1-6 (subset) │ +│ │ +│ 3. Start Nodes in Recovery Mode │ +│ └─► Nodes start without RAFT enabled │ +│ │ +│ 4. Form RAFT Cluster │ +│ ├─► Initialize cluster topology │ +│ ├─► Assign partition replicas │ +│ └─► Enable RAFT consensus │ +│ │ +│ 5. Replicate Missing Data │ +│ └─► RAFT automatically syncs replicas │ +│ │ +└─────────────────────────────────────────────────────┘ +``` + +This approach requires the operator to be aware of the list of partitions to be restored, which can be figured out from the `/gitaly-wal-backups` directory in the backup storage. +Currently, there is no manifest that holds the entire list of partitions. If we need the partition restore process to be automated with a single restore command without manual +partition discovery, we either need a separate manifest file that contains every partition, or we would have to programmatically iterate through `/gitaly-wal-backups` directory +to figure out all of the partitions that were backed up. Some options: + +```yaml +Option 1: Discovery service + - Tool that scans backup storage on-demand + - Builds a special manifest dynamically for recovery which can be feed into recovery tool + - Special recovery manifest can make it possible to restore directly on the desired nodes + - Guaranteed to be accurate list of partitions that can be restored + - Possible with existing go-cloud blob integration + +Option 2: Periodic manifest generation + - WAL archiver updates a global manifest file + - Lists all known partitions and their latest backups + +Option 3: Metadata partition + - Special partition that tracks all other partitions + - Always restored first during recovery +``` + +Another problem with the current architecture is the backup storage path. We tie the backup under the current storage which would make it harder to discover during restore. +We can remove the storage name from the path but that can affect the non-Raft architectures as two different storages can contain same partition ID for different respositories. -- GitLab