Experiment: Speed up Transaction snapshotting with overlayfs (!8033) · Merge requests · GitLab.org / Gitaly

Disclaimer: This MR is experimental, aiming to explore the use of overlayfs to speed up snapshotting operations of Gitaly transactions. It is still far from production-ready, so please take it that way. Many tests are still failing because some filesystem mocking uses chmod to simulate errors. Tests of the WAL and Transactions packages are all green (on my local environment), though.

TLDR;

This MR explores using overlayfs to dramatically improve Gitaly transaction snapshotting performance, addressing critical production issues where nodes experience excessive load due to high snapshot creation rates (up to 200 ops/s). The current "deepclone" strategy requires O(n) operations proportional to the number of files in a repository, becoming prohibitively expensive for repositories with thousands of files (some reaching 120,000 files). By leveraging overlayfs, a battle-tested union filesystem used by Docker, snapshot creation becomes a constant-time O(1) operation regardless of repository size.

Benchmarking shows overlayfs maintains constant 0.011ms snapshot creation time from 10 to 50,000 files while deepclone degrades to 31.7ms (3,000x slower), with overlayfs delivering 92,356 snapshots/sec versus deepclone's 31 snapshots/sec for large repositories, while reducing CPU usage by 60% and syscall count by 80%.

The implementation provides a pluggable driver interface for seamless switching between strategies. While there are operational considerations around permissions and container compatibility, the performance gains are compelling enough to warrant exploration as repository complexity continues to grow.

Problem statement

Transaction was introduced into Gitaly. It is an innovative technique that brings ACID traits to Gitaly as well as laying a crucial building block for other important projects such as Raft replication, backup, etc. Recently, we have been spending a lot of effort to roll it out globally on GitLab SaaS. Unfortunately, we kept rolling back due to excessive additional load on impacted nodes:

gitaly-04-stor-gprd.c.gitlab-gitaly-gprd-83fd.internal
gitaly-07-stor-gprd.c.gitlab-gitaly-gprd-0fe1.internal

There are many reasons attributed to the additional load, but two main factors I would love to tackle in this MR:

Low snapshot re-using rate on those nodes.
Unexpected amount of files located inside repositories.

The first (1) problem heavily depends on traffic patterns toward repositories on nodes. On CNY, one big repository gitlab-org/gitlab and its forks dominate traffic. They have extensive read and write rates. Therefore, the snapshot sharing rate is reasonably high. This fact reduces the snapshot creation rate on CNY significantly.

The traffic pattern on impacted nodes is completely different. Repository distribution is more evenly distributed compared to CNY. The traffic is scattered more equally between major repositories on the node. This leads to a much lower sharing ratio, hence disproportionately much more snapshot creation rate. At peak, the node can generate 200 ops/s, 20 times more than on CNY.

Share ratio	Snapshot creation rate	CNY snapshot creation rate

Source	Source	Source

Unfortunately, I believe this traffic pattern is a better representation than CNY. Ideally, having a dominant repository is not a good thing. It leads to annoying side effects and becomes harder to scale. It's likely we'll face more nodes having this pattern when expanding the rollout scope. Therefore, the performance of snapshot management is crucial.

The (2) problem is extremely annoying. From this note, some repositories may reach 76,000 files (and directories). All those files are legitimate and cannot be removed. Unfortunately, this problem is fairly common across repositories on the SaaS platform. Looking into the last 7-day logs of Gitaly:

Median file count across all repositories is ~50
95th percentile is 800. That's a huge jump.
99th percentile is 1,535.
The max file count across rolled-out nodes is 120,000

File count percentiles	Max file count per node

Source	Source

These numbers are absolutely not ideal for snapshotting, which is the heart of Gitaly Transactions. Deep down in the implementation of snapshotting, Gitaly "clones" the repository hierarchy of the original SSOT repository. Let's call this strategy "deepclone". In general, Gitaly walks the tree and creates the same set of directories respectively. It hard-links all files to the new destination. The code can be found here. It provides perfect snapshot isolation for requests to work on. Hard-linking is a genius idea so that notoriously heavy packfiles are not copied over.

Unfortunately, this technique doesn't scale well when the number of files located in a repository grows. We need O(num(files)) hard-linking calls and O(num(dirs)) mkdir calls for a single snapshot creation. Apparently, it takes tons of syscalls, context switches, and computational resources to create a snapshot of a repository. When a snapshot completes its usage, we need the same amount of unlinking and rmdir calls to clean it up.

Another side effect of using the deepclone strategy is that the change set generated by a request is obscure. Gitaly depends on tree walking, Git hooks, and some Git commands for collecting changes of a WAL entry. This drawback will be addressed in another MR (using the same proposed technique).

In recent efforts, we are working on many issues aimed at reducing the number of files, such as rolling out reftables, improving housekeeping, etc. Those works will improve the situation to the point that Transactions overhead is reasonable for global rollouts. However, they don't change the fundamental fact that snapshotting creation/deletion is O(N). Repository patterns will constantly change where the number of files might peak, leading to potential incidents. We also don't control the data of self-managed instances and Dedicated.

Proposal

Instead of the current deepclone strategy, we propose leveraging overlayfs () to achieve constant-time snapshot creation regardless of repository size. Overlayfs is a union filesystem that has been part of the Linux kernel since version 3.18 (2014) and is widely deployed in production environments. It serves as the foundation for Docker's storage driver, powers container runtimes across millions of deployments, and is used by major Linux distributions for live CD/DVD systems. This battle-tested filesystem has proven its reliability and performance at massive scale.

The fundamental difference between overlayfs and our current deepclone approach is striking. While deepclone requires O(num(files)) + O(num(dirs)) syscalls to create hard-links and directories respectively, overlayfs snapshot creation is O(1) - a single mount operation regardless of whether the repository contains 50 files or 120,000 files. Cleanup is equally efficient: instead of walking the entire tree to unlink files and remove directories, we simply perform an O(1) unmount operation followed by cleaning up only the upperdir containing actual changes.

Here's how overlayfs maps to Gitaly's snapshotting needs:

Transaction Mount Point (/path/to/snapshots/txn-456/merged)
│
└── Merged View (what transaction sees)
    ├── refs/heads/main       ← From upperdir (updated)
    ├── objects/pack/new.pack ← From upperdir (new packfile)
    ├── objects/pack/old.pack ← From lowerdir (unchanged)
    ├── refs/tags/            ← From lowerdir (unchanged)
    └── config                ← From lowerdir (unchanged)

┌────────────────────────────────────────────────────────┐
│ upperdir/ │ ← Transaction changes (writable layer)     │
│ ├── refs/heads/main                                    │
│ ├── objects/pack/new.pack                              │
└────────────────────────────────────────────────────────┘
     ▲
     │ overlayfs union
┌────────────────────────────────────────────────────────┐
│ lowerdir/ │ ← Original repository (read-only)          │
│ /path/to/repository/                                   │
│ ├── objects/pack/ (heavy packfiles)                   │
│ ├── refs/heads/main                                    │
│ ├── refs/tags/                                         │
│ └── config                                             │
└────────────────────────────────────────────────────────┘

Mount: O(1) operation regardless of repository size
Unmount: O(1) cleanup operation

From the transaction's perspective, /path/to/snapshots/txn-456/merged appears as a complete, writable copy of the repository. All reads that don't correspond to modified files are served directly from the original repository (lowerdir). Any writes, modifications, or deletions are captured in the upperdir, creating perfect isolation without duplicating unchanged data. In the example above, when the transaction creates a new packfile or updates a reference, only those specific changes are stored in the upperdir while the existing heavy packfiles remain untouched in the lowerdir.

This approach eliminates the filesystem traversal bottleneck entirely. Whether we're snapshotting a minimal repository or one with 120,000 files, both creation and cleanup times remain constant. The performance gains become exponentially more significant as repository complexity grows - exactly the scenario we're facing with problematic nodes.

While overlayfs offers compelling advantages for our use case, there are important caveats around filesystem compatibility and operational considerations that we'll explore in the following sections.

Implementation approach

This takes a generalized, pluggable approach that makes overlayfs easy to try out while maintaining backward compatibility. We introduce a Driver interface that abstracts different snapshotting strategies, allowing users to seamlessly switch between implementations without code changes.

This plugin architecture enables straightforward experimentation - users can test overlayfs performance by configuring their Gitaly settings:

[Transactions]
Enabled = true
Driver = "overlayfs"

The system automatically falls back to deepclone if compatibility checks fail. During development, the GITALY_TEST_WAL_DRIVER environment variable allows testing different drivers without configuration changes. The driver selection happens at runtime with comprehensive compatibility validation, ensuring robust operation across different deployment scenarios.

Benchmarks

The benchmarks are conducted on Intel(R) Core(TM) i7-14700F - 28 cores - 32GB of RAM. Fedora Linux. Benchmarking commands:

unshare --user --map-root-user --mount  go test ./internal/gitaly/storage/storagemgr/partition/snapshot/driver -benchtime=30s -bench=BenchmarkDriver_
unshare --user --map-root-user --mount  go test ./internal/gitaly/storage/storagemgr/partition  -bench=BenchmarkTransactionManager

# Or
su
go test ./internal/gitaly/storage/storagemgr/partition/snapshot/driver -benchtime=30s -bench=BenchmarkDriver_
go test ./internal/gitaly/storage/storagemgr/partition  -bench=BenchmarkTransactionManager

Pure snapshot creation/deletion with huge file amount

Benchmark	Time (ms/op)	Snapshots/sec	Allocations/sec	Alloc’d Mem
10 files – DeepClone	0.011	92 078	27.44 M	22.98 KB
10 files – OverlayFS	0.011	91 042	8.01 M	6.99 KB
50 files – DeepClone	0.029	34 017	25.87 M	77.89 KB
50 files – OverlayFS	0.011	89 643	7.89 M	6.99 KB
100 files – DeepClone	0.053	18 915	25.53 M	148.59 KB
100 files – OverlayFS	0.011	91 309	8.04 M	6.96 KB
500 files – DeepClone	0.232	4 318	25.32 M	686.00 KB
500 files – OverlayFS	0.011	91 636	8.06 M	6.96 KB
1 000 files – DeepClone	0.446	2 244	25.79 M	1.32 MB
1 000 files – OverlayFS	0.011	91 376	8.04 M	6.95 KB
5 000 files – DeepClone	2.149	465.3	26.33 M	6.59 MB
5 000 files – OverlayFS	0.011	90 870	8.00 M	6.99 KB
50 000 files – DeepClone	31.714	31.53	17.77 M	67.65 MB
50 000 files – OverlayFS	0.011	92 356	8.13 M	7.08 KB

Pure snapshot creation/deletion with different file size profiles

Benchmark	Time (ms/op)	Snapshots/sec	Allocations/sec	Alloc’d Mem
1 KB file – DeepClone	0.472	2 119	2.09 M	130.67 KB
1 KB file – OverlayFS	0.069	14 521	1.28 M	6.73 KB
10 KB file – DeepClone	0.416	2 405	2.37 M	130.71 KB
10 KB file – OverlayFS	0.064	15 671	1.38 M	6.75 KB
100 KB file – DeepClone	0.421	2 376	2.34 M	130.77 KB
100 KB file – OverlayFS	0.073	13 692	1.20 M	6.86 KB
1 MB file – DeepClone	0.471	2 122	2.09 M	131.36 KB
1 MB file – OverlayFS	0.067	14 844	1.31 M	6.88 KB
10 MB file – DeepClone	0.412	2 430	2.40 M	130.73 KB
10 MB file – OverlayFS	0.067	14 928	1.31 M	6.93 KB
100 MB file – DeepClone	0.414	2 415	2.38 M	133.68 KB
100 MB file – OverlayFS	0.064	15 608	1.37 M	7.00 KB

Run N transactions with different drivers

Benchmark	Time (ms/op)	Tx/sec	Allocations/sec	Alloc’d Mem
1 repo · 1 updater · 10 tx size · 2 txs – DeepClone	43.93	22.76	273.07 K	1.42 MB
1 repo · 1 updater · 10 tx size · 2 txs – OverlayFS	41.16	24.29	246.85 K	1.28 MB
1 repo · 1 updater · 10 tx size · 5 txs – DeepClone	93.32	10.71	291.29 K	3.32 MB
1 repo · 1 updater · 10 tx size · 5 txs – OverlayFS	97.75	10.23	228.45 K	2.96 MB
1 repo · 1 updater · 500 tx size · 250 txs – DeepClone	6344.07	0.16	13.53 M	1.16 GB
1 repo · 1 updater · 500 tx size · 250 txs – OverlayFS	2698.12	0.37	4.38 M	420.63 MB
1 repo · 1 updater · 1000 tx size · 500 txs – DeepClone	13 996.22	0.07	46.89 M	4.06 GB
1 repo · 1 updater · 1000 tx size · 500 txs – OverlayFS	5 359.62	0.19	12.48 M	1.18 GB

System metrics

System-wide CPU consumption	System-wide memory consumption (nearly the same)	System-wide syscall count

Bottom line

Uuse OverlayFS for virtually all snapshotting – its O(1) behavior yields far lower latency, lower CPU consumption, lower allocations in Gitaly.

OverlayFS work is O(1) with respect to file count, so latency stays flat at ~0.011 ms from 10 files to 50 000 files; DeepClone is roughly O(N): it starts at 0.011 ms for 10 files and rises to 31.7 ms for 50 000 files (~3 000 × slower).
In the same extreme case (50 000 files) OverlayFS delivers 92 356 snapshots/sec, while DeepClone manages only 31 snapshots/sec – a 3 000 × throughput gap.
Memory per snapshot is constant for OverlayFS (~7 KB), but grows with N for DeepClone (23 KB → 67 MB, a 9 600 × jump); allocations show a similar pattern (≈8 M vs 18-27 M allocs/sec).
File size from 1 KB to 100 MB barely affects either method; directory breadth (number of entries) is the real cost driver.
When testing with real transactions, due to setup overhead, this benchmark barely make a single core reaches 5%. We'll need a much greater traffic. That said, overlayfs still yield 2.6x better result and up to 4x less memory.
CPU usage with OverlayFS is significantly lower: DeepClone saturates system and user CPU (~100%), while OverlayFS runs at a much lower baseline.
OverlayFS dramatically reduces system call volume: DeepClone generates millions of linkat, unlinkat, and read calls per run; OverlayFS keeps these near minimal, saving CPU, memory, and syscall latency.

Caveats and operational considerations

The primary operational consideration for overlayfs is permission requirements. Standard overlayfs mounting requires the CAP_SYS_ADMIN capability or root privileges to create overlay mounts. In rootless environments, this can be addressed through user namespaces combined with mount namespaces. For example, using unshare to create a user namespace where overlayfs operations become available:

# Create user and mount namespace, then mount overlayfs
unshare -Urm sh -c '
  mkdir -p /tmp/lower /tmp/upper /tmp/work /tmp/merged
  mount -t overlay overlay \
    -o lowerdir=/tmp/lower,upperdir=/tmp/upper,workdir=/tmp/work \
    /tmp/merged
'

In containerized environments, Docker containers require the --privileged flag or specific capabilities to perform overlayfs mounts:

# Run container with overlayfs capabilities
docker run --privileged --rm -it ubuntu bash
# or with specific capabilities
docker run --cap-add=SYS_ADMIN --rm -it ubuntu bash

For environments where standard overlayfs faces permission constraints or older kernel limitations, we've identified fuse-overlayfs as a promising fallback strategy. This userspace implementation provides the same O(1) performance characteristics while bypassing kernel-level permission requirements, making it particularly valuable for containerized deployments or systems with restrictive security policies.

There are some other things to consider:

Obscure overhead: while overlayfs creates additional upper and work directories, the overhead is typically minimal since only modified files consume additional space. I don't have enough knowledge to have a deep evaluation on this regard.
Compatibility with container: while overlayfs generally plays well with container runtimes, the interaction between Gitaly's overlayfs snapshots and container orchestration systems hasn't been extensively tested across all deployment scenarios.
More thorough testing and benchmark: the benchmarking in this MR is quite simple. We'll need a more realistic benchmark for a final conclusion.

Edited Jul 24, 2025 by Quang-Minh Nguyen

Experiment: Speed up Transaction snapshotting with overlayfs