From 73d40cf1489970badf5deb738ea0f3912c359a6a Mon Sep 17 00:00:00 2001
From: Mustafa Bayar <mbayar@gitlab.com>
Date: Tue, 1 Jul 2025 11:04:40 +0200
Subject: [PATCH] doc: Create high-level design for RAFT backups

---
 doc/raft_recovery.md | 269 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 269 insertions(+)
 create mode 100644 doc/raft_recovery.md
diff --git a/doc/raft_recovery.md b/doc/raft_recovery.md
new file mode 100644
index 00000000000..1b18ff0bfa1
--- /dev/null
+++ b/doc/raft_recovery.md
@@ -0,0 +1,269 @@
+# Gitaly WAL Based Backups for RAFT
+
+See [RAFT.md](raft.md) for understanding Gitaly's Multi-Raft architecture.
+
+## Executive Summary
+
+This document describes how Gitaly's partition backup system will operate with RAFT clusters. Partition backups for non-RAFT deployments should remain unchanged, as the backup solution relies solely on transaction data.
+
+## Current Architecture (Summary)
+
+Gitaly has a backup solution called **Partition Backup** which takes the current read snapshot of partition via transaction manager and archives it's filesystem:
+
+- `gitaly-backup` tool receives a list of servers via environment variable
+- For each server, it discovers partitions through the server and calls `BackupPartition` RPC
+- `BackupPartition` RPC creates a TAR archive with repositories and partition's key-value state, and uploads it to the backup storage.
+- Separetely, WAL entry archiving happens continuously via hooks in Partition Manager whenever a new log is applied.
+
+### How It Works in the existing tooling
+
+```mermaid
+flowchart TD
+    A[gitaly-backup tool] --> B[Receive backup request<br/>with gitaly servers]
+
+    B --> C[Connect to Server A]
+    B --> D[Connect to Server B]
+
+    C --> E[ListPartitions<br/>↓<br/>p1, p2]
+    E --> F[BackupPartition p1]
+    E --> G[BackupPartition p2]
+
+    D --> H[ListPartitions<br/>↓<br/>p3]
+    H --> I[BackupPartition p3]
+
+    style A fill:#e3f2fd
+    style B fill:#fff9c4
+    style E fill:#f3e5f5
+    style H fill:#f3e5f5
+    style F fill:#e8f5e9
+    style G fill:#e8f5e9
+    style I fill:#e8f5e9
+```
+
+### Backup Object Storage Layout
+
+```yaml
+├── gitaly-backups/
+│   ├── partition-backups/
+│   │   ├── storage-a/
+│   │   │   ├── partition-1/
+│   │   │   │   ├── 0000000a.tar
+│   │   │   │   └── 0000000f.tar
+│   │   │   ├── partition-2/
+│   │   │   │   └── 0000000c.tar
+│   │   └── storage-b/
+│   │       └── partition-3/
+│   │   │       └── 0000000a.tar
+│   ├── partition-manifests/
+│   │   ├── storage-a/
+│   │   │   ├── partition-1.json
+│   │   │   └── partition-2.json
+│   │   └── storage-b/
+│   │       └──partition-3.json
+└── gitaly-wal-backups/
+    ├── storage-a/
+    │   ├── partition-1/
+    │   │   ├── 00000001.tar
+    │   │   ├── 00000002.tar
+    │   │   ├── 00000003.tar
+    │   │   ├── 00000004.tar
+    .   .   .
+    .   .   .
+    .   .   .
+```
+
+## Partition Backup for RAFT
+
+### Core Principle: Leader-Only Backups
+
+When RAFT is enabled, all backup operations are automatically routed to the partition's RAFT leader:
+
+```mermaid
+flowchart TD
+A[BackupPartition request for Partition 2] --> B[Receiving Server]
+B --> H{Is this partition stored<br/>in my storage?}
+H -->|No| I[Redirect to correct<br/>RAFT group]
+H -->|Yes| C{Is RAFT enabled?}
+C -->|No| D[Execute backup]
+C -->|Yes| E{Am I the leader?}
+E -->|Yes| F[Execute backup]
+E -->|No| G[Forward to leader]
+
+style A fill:#e1f5fe
+style D fill:#c8e6c9
+style F fill:#c8e6c9
+style G fill:#fff9c4
+style I fill:#fff9c4
+```
+
+### How Would It Work In The Existing Tooling
+
+#### 1. Discovery Phase (Unchanged)
+
+```mermaid
+flowchart TD
+    A[gitaly-backup tool] --> B[Connect to Server A]
+    A --> C[Connect to Server B]
+    A --> D[Connect to Server C]
+
+    B --> E[ListPartitions<br/>↓<br/>p1, p2, p4]
+    C --> F[ListPartitions<br/>↓<br/>p1, p2, p3]
+    D --> G[ListPartitions<br/>↓<br/>p1, p3, p4]
+
+    style A fill:#e3f2fd
+    style E fill:#f3e5f5
+    style F fill:#f3e5f5
+    style G fill:#f3e5f5
+```
+
+#### 2. Execution Phase (With RAFT Routing)
+
+```mermaid
+flowchart TD
+    A[For each discovered partition] --> B[BackupPartition p1<br/>to ANY server]
+    A --> C[BackupPartition p2<br/>to ANY server]
+    A --> D[BackupPartition p3<br/>to ANY server]
+
+    B --> E[Automatically routes<br/>to leader<br/>e.g., Server A]
+    C --> F[Automatically routes<br/>to leader<br/>e.g., Server A]
+    D --> G[Automatically routes<br/>to leader<br/>e.g., Server C]
+
+    style A fill:#fff3e0
+    style B fill:#e8f5e9
+    style C fill:#e8f5e9
+    style D fill:#e8f5e9
+    style E fill:#e3f2fd
+    style F fill:#e3f2fd
+    style G fill:#e3f2fd
+
+    E --> H[Result: Each partition backed up<br/>exactly once from its leader]
+    F --> H
+    G --> H
+
+    style H fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
+```
+
+### Key Benefits
+
+1. **Consistency**: Always backs up the latest committed data
+1. **Simplicity**: No need to track which server is leader
+1. **Deduplication**: BackupPartition call would handle the deduplication. Multiple discoveries of same partition result in single backup (If there were no WAL produced between calls for same partition)
+1. **Transparency**: Works with existing Gitaly-backup tool without changes
+
+The de-duplication can be improved further if the recovery tooling tracks the list of partitions across different storages.
+
+### Alternative Solution
+
+Instead of directly executing backup jobs, the node can create a job file which will be persisted in the replicated storage (can be part of KV storage),
+which then will be easier to track among different nodes and if leadership changes before a job is fully committed, new leader will be able to pick it up.
+This also handles the de-duplication as we won't create a new job for a partition if the previous job for same LSN is not yet completed. This approach is
+inspired from [CockroachDB's backup architecture](https://www.cockroachlabs.com/docs/stable/backup-architecture)
+If we decide on this solution we need to discuss on the overall architecture of the job processing such as:
+
+```yaml
+- Job state machine (pending → running → completed)
+- How to handle stale jobs
+- Cleanup policies for completed jobs
+```
+
+## WAL Archiving with RAFT (including auto snapshotting)
+
+- Only RAFT leaders archive WAL entries
+- WAL archiver tracks metrics for backup triggers:
+  - Total WAL size since last base backup
+  - Time elapsed since last base backup
+- When configured thresholds(time or size based) are exceeded, triggers base partition backup
+- Leadership changes trigger archiver handoff
+- Ensures single, authoritative WAL sequence
+
+This approach eliminates the need for a separate auto-backup monitor since the WAL archiver already has all the necessary information and runs only on leaders.
+
+```markdown
+┌─────────────────────────────────────────────────────┐
+│                Simplified Flow                      │
+├─────────────────────────────────────────────────────┤
+│                                                     │
+│  Partition Manager                                  │
+│       │                                             │
+│       ▼                                             │
+│  WAL Archiver (Leader Only)                         │
+│       │                                             │
+│       ├─── Archive WAL entry.                       │
+│       │                                             │
+│       ├─── Track Metrics:                           │
+│       │    • WAL archived since last backup         │
+│       │    • Time since last backup                 │
+│       │                                             │
+│       └─── Check Trigger Conditions:                │
+│            • WAL size > threshold?                  │
+│            • Time elapsed > max interval?           │
+│                   │                                 │
+│                   ▼ (if triggered)                  │
+│            Trigger Partition Backup                 │
+│                                                     │
+│  Benefits:                                          │
+│  • Single component handles both WAL and backups    │
+│  • Natural checkpoint after X amount of changes     │
+│  • No separate monitoring infrastructure            │
+│  • Direct correlation between changes and backups   │
+│                                                     │
+└─────────────────────────────────────────────────────┘
+```
+
+## Recovery Process
+
+### Full Disaster Recovery Strategy
+
+Restore first, then form the cluster
+
+```markdown
+┌─────────────────────────────────────────────────────┐
+│           Full Disaster Recovery Flow               │
+├─────────────────────────────────────────────────────┤
+│                                                     │
+│  1. Prepare Empty Nodes                             │
+│     └─► Install Gitaly on new servers               │
+│                                                     │
+│  2. Restore Partitions to Individual Nodes          │
+│     ├─► Node 1: Restore partitions 1, 3, 5          │
+│     ├─► Node 2: Restore partitions 2, 4, 6          │
+│     └─► Node 3: Restore partitions 1-6 (subset)     │
+│                                                     │
+│  3. Start Nodes in Recovery Mode                    │
+│     └─► Nodes start without RAFT enabled            │
+│                                                     │
+│  4. Form RAFT Cluster                               │
+│     ├─► Initialize cluster topology                 │
+│     ├─► Assign partition replicas                   │
+│     └─► Enable RAFT consensus                       │
+│                                                     │
+│  5. Replicate Missing Data                          │
+│     └─► RAFT automatically syncs replicas           │
+│                                                     │
+└─────────────────────────────────────────────────────┘
+```
+
+This approach requires the operator to be aware of the list of partitions to be restored, which can be figured out from the `/gitaly-wal-backups` directory in the backup storage.
+Currently, there is no manifest that holds the entire list of partitions. If we need the partition restore process to be automated with a single restore command without manual
+partition discovery, we either need a separate manifest file that contains every partition, or we would have to programmatically iterate through `/gitaly-wal-backups` directory
+to figure out all of the partitions that were backed up. Some options:
+
+```yaml
+Option 1: Discovery service
+  - Tool that scans backup storage on-demand
+  - Builds a special manifest dynamically for recovery which can be feed into recovery tool
+      - Special recovery manifest can make it possible to restore directly on the desired nodes
+  - Guaranteed to be accurate list of partitions that can be restored
+  - Possible with existing go-cloud blob integration
+
+Option 2: Periodic manifest generation
+  - WAL archiver updates a global manifest file
+  - Lists all known partitions and their latest backups
+
+Option 3: Metadata partition
+  - Special partition that tracks all other partitions
+  - Always restored first during recovery
+```
+
+Another problem with the current architecture is the backup storage path. We tie the backup under the current storage which would make it harder to discover during restore.
+We can remove the storage name from the path but that can affect the non-Raft architectures as two different storages can contain same partition ID for different respositories.
-- 
GitLab