WIP Proof of concept for git pack-objects cache (!2832) · Merge requests · GitLab.org / Gitaly

This is a proof of concept for caching packfiles generated by Gitaly for git clone and git fetch. It is primarily meant to accelerate concurrent or near-concurrent fetches of the exact same data, as you might see when parallel CI jobs need to clone a MR at a specific commit.

Video demo

https://youtu.be/jtSMqWes7_Q (demo part is 7 minutes)

What this does

The hard work of a clone or fetch is in generating the pack data. This is done by git pack-objects. This MR uses a Git configuration option to direct git upload-pack to call gitaly-hooks instead of git pack-objects. Then gitaly-hooks makes a gRPC call back into Gitaly on a Unix socket, to an RPC called PackObjectsHook.

PackObjectsHook has a cache that lives in memory in the Gitaly process, but which stores the packfile data in object storage. PackObjectsHook calculates a cache key based on the inputs that would have been provided to git pack-objects and checks the in-memory cache to find an entry. If there is one, it uses the metadata in the entry to stream the pack data back to git from object storage. If there is no cache entry, it creates one, and with it a goroutine that runs the real git pack-objects and puts its output into object storage.

The in-memory cache uses LRU with a fixed number of entries, combined with a staleness check. We count on the object storage provider to delete stale blobs for us via object lifecycle rules.

For the sake of responsiveness, progress information generated by git pack-objects is cached in-memory by Gitaly so we can send it to the client unbuffered. The pack data is buffered into chunks in object storage (currently 50MB but this should be configurable of course) so we can amortize the overhead of creating object storage blobs but at the same time start streaming pack data to the client before the entire request has been generated by git pack-objects.

Benefits

From experience / anecdotally we know that Git clone and fetch traffic puts a significant amount of pressure on Gitaly servers, and particularly parallel clone/fetch requests on the same repository such as those created by CI pipelines.

With this change, an arbitrary number of near-concurrent clones of the same commits from the same repo will spawn only 1 git pack-objects process on the Gitaly server. We can condense dozens or more of parallel clones into just one process on the Gitaly server doing heavy lifting, where we now do the heavy lifting independently for each invididual clone.

The exact benefits are hard to predict because the depend on user behavior. The more parallelism, the more benefit. At the opposite end, for 1-off fetches or clones this cache has no benefit, and adds overhead.

Edited Nov 23, 2020 by Jacob Vosmaer

WIP Proof of concept for git pack-objects cache

Video demo

What this does

Benefits

Merge request reports