WIP Proof of concept for git pack-objects cache
This is a proof of concept for caching packfiles generated by Gitaly for git clone
and git fetch
. It is primarily meant to accelerate concurrent or near-concurrent fetches of the exact same data, as you might see when parallel CI jobs need to clone a MR at a specific commit.
Related to gitlab-com/gl-infra/scalability#668 and #1657 (closed).
Video demo
https://youtu.be/jtSMqWes7_Q (demo part is 7 minutes)
What this does
The hard work of a clone or fetch is in generating the pack data. This is done by git pack-objects
. This MR uses a Git configuration option to direct git upload-pack
to call gitaly-hooks
instead of git pack-objects
. Then gitaly-hooks
makes a gRPC call back into Gitaly on a Unix socket, to an RPC called PackObjectsHook
.
PackObjectsHook
has a cache that lives in memory in the Gitaly process, but which stores the packfile data in object storage. PackObjectsHook
calculates a cache key based on the inputs that would have been provided to git pack-objects
and checks the in-memory cache to find an entry. If there is one, it uses the metadata in the entry to stream the pack data back to git
from object storage. If there is no cache entry, it creates one, and with it a goroutine that runs the real git pack-objects
and puts its output into object storage.
The in-memory cache uses LRU with a fixed number of entries, combined with a staleness check. We count on the object storage provider to delete stale blobs for us via object lifecycle rules.
For the sake of responsiveness, progress information generated by git pack-objects
is cached in-memory by Gitaly so we can send it to the client unbuffered. The pack data is buffered into chunks in object storage (currently 50MB but this should be configurable of course) so we can amortize the overhead of creating object storage blobs but at the same time start streaming pack data to the client before the entire request has been generated by git pack-objects
.
Benefits
From experience / anecdotally we know that Git clone and fetch traffic puts a significant amount of pressure on Gitaly servers, and particularly parallel clone/fetch requests on the same repository such as those created by CI pipelines.
With this change, an arbitrary number of near-concurrent clones of the same commits from the same repo will spawn only 1 git pack-objects
process on the Gitaly server. We can condense dozens or more of parallel clones into just one process on the Gitaly server doing heavy lifting, where we now do the heavy lifting independently for each invididual clone.
The exact benefits are hard to predict because the depend on user behavior. The more parallelism, the more benefit. At the opposite end, for 1-off fetches or clones this cache has no benefit, and adds overhead.