CreateFork is slow and duplicates all objects
CreateFork
has always used --no-local
(#817 (closed)) to clone from any internal repository. From https://www.git-scm.com/docs/git-clone:
When the repository to clone from is on a local machine, this flag bypasses the normal "Git aware" transport mechanism and clones the repository by making a copy of HEAD and everything under objects and refs directories. The files under .git/objects/ directory are hardlinked to save space when possible.
I think this was done because this makes it easy to generalizing clone a repository from any shard, and it potentially avoids copying over cruft in the main repo.
However, the problem is that it's very slow. On my test instance with gitlab-or/gitlab
, notice it takes 2 minutes to complete:
# time git clone --bare --no-local /var/opt/gitlab/git-data/repositories/@hashed/c2/35/c2356069e9d1e79ca924378153cfbbfb4d4416b1f99d41a2940bfdb66c5319db.git /tmp/no-local.git
Cloning into bare repository '/tmp/no-local.git'...
remote: Counting objects: 2660817, done.
remote: Total 2660817 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (2660817/2660817), 1.04 GiB | 27.21 MiB/s, done.
Resolving deltas: 100% (2062629/2062629), done.
real 2m16.601s
user 2m17.052s
sys 0m12.271s
Compare that with a --local
and --no-hardlinks
clone with a deduped repo:
# time git clone --bare --local --no-hardlinks /var/opt/gitlab/git-data/repositories/@hashed/c2/35/c2356069e9d1e79ca924378153cfbbfb4d4416b1f99d41a2940bfdb66c5319db.git /tmp/local.git
Cloning into bare repository '/tmp/local.git'...
done.
real 0m0.221s
user 0m0.106s
sys 0m0.108s
It turns out the local copy also brings in the objects/info/alternates
, which makes the clone fast because it doesn't have to copy everything from the pool repository. I thought !2887 (merged) would do the same thing, but --no-local
basically negates that.
Proposal:
- If the source repo is on the same shard as the target repo, use
--local
. - Introduce a boolean field to enable this behavior.