diff --git a/doc/failure_analysis/index.md b/doc/failure_analysis/index.md new file mode 100644 index 0000000000000000000000000000000000000000..c1e400286dbdc1cfc5053400526d45bf578be8bd --- /dev/null +++ b/doc/failure_analysis/index.md @@ -0,0 +1,113 @@ +# Gitaly Cluster Failure Mode Analysis + +This document attempts to describe the current failure modes of Gitaly +Cluster across versions in an attempt to identify areas of improvements. + +## Writes + +The following sequence diagram shows what happens during a write in +Gitaly Cluster for GitLab 14.0.x: + +```mermaid +sequenceDiagram + autonumber + Client->>+Workhorse: git push + Workhorse->>+Praefect: ReceivePack + Praefect->>+Database: GetPrimary() + Database->>+Database: SELECT primary FROM repositories + Database->>+Praefect: Gitaly A + Praefect->>+Gitaly: ReceivePack + Gitaly->>+Git: git receive-pack + Git->>+Hook: update HEAD master + Hook->>+Praefect: TX: update HEAD master + Praefect->>+Praefect: collect votes + Praefect->>+Hook: Commit + Hook->>+Git: exit 0 + Git->>+Gitaly: exit 0 + Gitaly->>+Praefect: success + Praefect->>Workhorse: success + Workhorse->>Client: success +``` + +In step 3, when Praefect receives the `ReceivePack` RPC from Workhorse, +it calls `GetPrimary()`, which consults the database. Praefect considers +a primary repository on a per-repository basis, which means that in a +cluster of 3 nodes, the primary repository could reside in any of the 3 +nodes. Each entry in the `repositories` table has a row pertaining to a +specific repository, and the `primary` column denotes the current Gitaly +node serving as the primary. + +### Failover + +In GitLab 14.0.x, failover happens whenever a majority of Praefect nodes +deem that a Gitaly node is no longer a valid primary: + +1. The Gitaly node is no longer in the [`valid_primaries` database view](https://gitlab.com/gitlab-org/gitaly/blob/d0083f4c828772537e6891cae4fe0df1f6b255f4/internal/praefect/datastore/migrations/20210525143540_healthy_storages_view.go#L9-20). +1. `valid_primaries` depends on the [`healthy_storages` database view](https://gitlab.com/gitlab-org/gitaly/blob/d0083f4c828772537e6891cae4fe0df1f6b255f4/internal/praefect/datastore/migrations/20210525143540_healthy_storages_view.go#L9-21). +2. `healthy_storages` depends on the [`node_status` database table](https://gitlab.com/gitlab-org/gitaly/blob/d0083f4c828772537e6891cae4fe0df1f6b255f4/internal/praefect/datastore/migrations/20210525143540_healthy_storages_view.go#L11). +3. Praefect nodes attempt to send `HealthCheck` RPC messages to Gitaly nodes once per second and update the [`node_status` table](https://gitlab.com/gitlab-org/gitaly/blob/098c6dcdbde9824385d61b6cc56c2e10724a104b/internal/praefect/nodes/health_manager.go#L144-154) every time. + +A failover is triggered whenever a primary node is no longer in the +`valid_primaries` table. This happens when a majority of Praefect nodes: + +1. Have updated the `node_status` table to indicate they attempted to +contact the Gitaly node 60 seconds ago. +2. The last successful response from that node was over 10 seconds ago. + +In addition, a primary node can also be demoted if the [`replication_queue` for that +node meets certain criteria](https://gitlab.com/gitlab-org/gitaly/blob/371310f8236046666f75710ef02b016011b87deb/internal/praefect/datastore/migrations/20210525173505_valid_primaries_view.go#L22-32). +This can happen if the [Praefect reconciler creates a `delete_replica` job for +that Gitaly node that has not yet been completed](https://gitlab.com/gitlab-org/gitaly/blob/12e0bf3ac80b72bef07a5733a70c270f70771859/internal/praefect/reconciler/reconciler.go#L95-107). + +This failover is triggered when a Praefect detects that the +[`healthy_storages` view has changed](https://gitlab.com/gitlab-org/gitaly/blob/098c6dcdbde9824385d61b6cc56c2e10724a104b/internal/praefect/nodes/health_manager.go#L164-189). + +For GitLab 14.0.x, when a failover is triggered, Praefect updates **all** repositories +pointing to the original Gitaly node to the new primary. For example, +suppose there are three Gitaly nodes: `gitaly-1`, `gitaly-2`, and +`gitaly-3`. if `gitaly-1` has been marked down, Praefect will attempt to +update the `repositories.primary` column to point to a new primary +chosen randomly. If `gitaly-2` and `gitaly-3` are available, one of them +will randomly be picked. + +#### Differences with GitLab 14.1 + +Starting with GitLab 14.1, Gitaly has been changed to perform the +failover election lazily. That means the failover will NOT affect +**all** repositories pointing to the original node. + +### Failure Mode Analysis + +Based on the background above, we can summarize possible failures: + +| Failure mode | Generic root cause | Specific root causes | Likelihood | Effect level | +|--------------------|-------------------------------------------|-------------------------------------------|------------|--------------| +| Writes not working | Workhorse not relaying ReceivePack | Network outage | | | +| | | DNS failure | | | +| | | Configuration error | | | +| | Incorrect/invalid Praefect database state | Stalled database queries | | | +| | | Database connection limits | | | +| | | Out of disk space | | | +| | | Praefect migrations not applied | | | +| | | `delete_replica` job incorrectly inserted | | | +| | | Single node restored from backup | | | +| | | Missing/deleted `repositories` entry | | | +| | Node status not properly updating | Praefect deadlock | | | +| | | Gitaly deadlock (Health RPC OK) | | | +| | | Database deadlock | | | +| | | Database table/row locks | | | +| | | Inconsistent network partition | | | +| | | Clocks out of sync | | | +| | | Configuration error (e.g. auth) | | | +| | Repository corruption | Hardware reboots/failures | | | +| | | Split-brain due to improper failover | | | +| | | Out of disk space | | | + +### Mitigation Strategies (TODO) + +1. Since `node_status` is a critical part of failover detection, this should be as robust as possible. +We should [consider a Gossip-based protocol approach](https://gitlab.com/gitlab-org/gitaly/-/issues/3807) to take +the database out of the equation and use a distributed consensus algorithm to obtain a consistent view of the cluster. + +1. Check repository checksums with every write. +