[go: up one dir, main page]

Skip to content

raft: Add gossip protocol foundation for cluster-wide routing table discovery

Overview

This MR adds the foundation for gossip-based routing table propagation in Gitaly Raft clusters. Currently, each Raft group maintains its own routing table in isolation nodes only know about partitions they directly manage. This limitation prevents efficient cross-partition request routing in multi-Raft-group deployments.

This MR implements the first three deliverables of the gossip propagation project: configuration infrastructure, memberlist management, and protobuf message definitions. These components establish the groundwork for cluster-wide partition discovery without requiring external coordination services.

What does this MR contain?

  • Adds optional Gossip configuration section under [raft] in config.toml
  • Auto-detects advertise address from hostname or uses 127.0.0.1 fallback
  • Integrates HashiCorp memberlist v0.5.1 dependency
  • Implements MemberlistManager with full lifecycle management (start/join/shutdown)
  • Uses EventDelegate for handling node join/leave/update events
  • Handles seed discovery and graceful degradation when gossip initialization fails
  • Adds GossipRoutingTableUpdate message to proto/cluster.proto
  • Implements serialization/deserialization helpers with comprehensive validation

High-level design

┌─────────────────────────────────────────────────────────┐
│                 Gitaly Node (Future State)               │
├─────────────────────────────────────────────────────────┤
│  RaftEnabledStorage                                      │
│  ├─ RoutingTable (per-partition, BadgerDB)               │
│  ├─ GossipManager (IN THIS MR: Config + Memberlist)     │
│  │  ├─ MemberlistManager ✅ (cluster membership)        │
│  │  ├─ RoutingBroadcaster ⏸️ (sends updates)            │
│  │  ├─ RoutingListener ⏸️ (receives updates)            │
│  │  └─ ClusterRoutingTable ⏸️ (aggregated view)         │
│  └─ Replica (Raft group)                                 │
│     └─ processConfChange() → BroadcastUpdate() ⏸️       │
└─────────────────────────────────────────────────────────┘
           │                                    ▲
           │ Gossip Protocol (memberlist SWIM)  │
           │ (Broadcasting logic in future MR)  │
           ▼                                    │
┌─────────────────────────────────────────────────────────┐
│              Other Gitaly Nodes (Future)                 │
│  Discover partitions via gossip, route cross-group       │
└─────────────────────────────────────────────────────────┘

The gossip protocol will enable:

  • Cluster-wide partition discovery: Any node can query routing info for any partition
  • Automatic membership tracking: SWIM protocol detects node joins/leaves
  • Eventual consistency: Updates propagate within ~1 second cluster-wide
  • Graceful degradation: Continues operating if gossip fails (local routing still works)

Configuration Examples

Minimal (relies on defaults):

[raft]
  enabled = true
  cluster_id = "my-cluster-uuid"
# Gossip auto-configures: 0.0.0.0:7946, auto-detected advertise address

With Overrides (for complex networks):

[raft]
  enabled = true
  cluster_id = "my-cluster-uuid"

  [raft.gossip]
    advertise_addr = "public-ip.example.com"  # Override auto-detection
    advertise_port = 7946
    seeds = ["node1:7946", "node2:7946"]      # Manual seed list

What's next?

This MR establishes the foundation. Subsequent MRs will add:

  1. Cluster Routing Table - Aggregated view with merge logic
  2. Routing Table Delegate - Gossip delegate for broadcasting/receiving updates
  3. Raft Integration - Hook routing table updates into Raft replica lifecycle
  4. Integration Tests - Multi-node cluster scenarios with gossip propagation
  5. Observability - Additional metrics and structured logging

References

Edited by Quang-Minh Nguyen

Merge request reports

Loading