From 559c55042651cfe93f4fda288eff64bf0cc81ec4 Mon Sep 17 00:00:00 2001 From: Karthik Nayak Date: Sat, 22 Mar 2025 18:05:41 +0100 Subject: [PATCH 1/3] bundle: fix non-linear performance scaling with refs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hello, At GitLab, we noticed that bundle creation doesn't seem to scale linearly the number of references in a repository. The following benchmark demostrates the issue: Benchmark 1: bundle (refcount = 100) Time (mean ± σ): 4.4 ms ± 0.5 ms [User: 1.8 ms, System: 2.4 ms] Range (min … max): 3.4 ms … 7.7 ms 434 runs Benchmark 2: bundle (refcount = 1000) Time (mean ± σ): 16.5 ms ± 1.7 ms [User: 9.6 ms, System: 7.2 ms] Range (min … max): 14.1 ms … 21.7 ms 176 runs Benchmark 3: bundle (refcount = 10000) Time (mean ± σ): 220.6 ms ± 3.2 ms [User: 171.6 ms, System: 55.7 ms] Range (min … max): 215.8 ms … 224.9 ms 13 runs Benchmark 4: bundle (refcount = 100000) Time (mean ± σ): 9.622 s ± 0.063 s [User: 9.143 s, System: 0.546 s] Range (min … max): 9.563 s … 9.738 s 10 runs Summary bundle (refcount = 100) ran 3.79 ± 0.61 times faster than bundle (refcount = 1000) 50.63 ± 6.39 times faster than bundle (refcount = 10000) 2207.95 ± 277.35 times faster than bundle (refcount = 100000) Digging into this, the reason for this is because we check for duplicate refnames added by the user. But this check uses an O(N^2) algorithm, which would not scale linearly with the number of refs. The first commit in this small series adds a bunch of tests for this behavior, while also discovering a missed edge case. The second commit introduces an alternative approach which uses an 'strset' to check for duplicates. The new approach fixes the performance problems noticed while also fixing the earlier missed edge case. Overall we see a 6x performance improvement with this series. I found that there is a conflict with 'ps/object-wo-the-repository' in seen, the resolution seems simple enough. Happy to support as needed. --- Changes in v2: - EDITME: describe what is new in this series revision. - EDITME: use bulletpoints and terse descriptions. - Link to v1: https://lore.kernel.org/r/20250401-488-generating-bundles-with-many-references-has-non-linear-performance-v1-0-6d23b2d96557@gmail.com --- b4-submit-tracking --- # This section is used internally by b4 prep for tracking purposes. { "series": { "revision": 2, "change-id": "20250322-488-generating-bundles-with-many-references-has-non-linear-performance-64aec8e0cf1d", "prefixes": [], "history": { "v1": [ "20250401-488-generating-bundles-with-many-references-has-non-linear-performance-v1-0-6d23b2d96557@gmail.com" ] } } } -- GitLab From 40fc702abf0bd56bd3e635a4e86d8a82f824f979 Mon Sep 17 00:00:00 2001 From: Karthik Nayak Date: Sun, 30 Mar 2025 12:50:16 +0200 Subject: [PATCH 2/3] t6020: test for duplicate refnames in bundle creation The commit b2a6d1c686 (bundle: allow the same ref to be given more than once, 2009-01-17) added functionality to detect and remove duplicate refnames from being added during bundle creation. This ensured that clones created from such bundles wouldn't barf about duplicate refnames. The following commit will add some optimizations to make this check faster, but before doing that, it would be optimal to add tests to capture the current behavior. Add tests to capture duplicate refnames provided by the user during bundle creation. This can be a combination of: - refnames directly provided by the user. - refname duplicate by using the '--all' flag alongside manual references being provided. - exclusion criteria provided via a refname "main^!". - short forms of refnames provided, "main" vs "refs/heads/main". Note that currently duplicates due to usage of short and long forms goes undetected. This should be fixed with the optimizations made in the next commit. Signed-off-by: Karthik Nayak --- t/t6020-bundle-misc.sh | 57 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh index b3807e8f35f..dd09df12873 100755 --- a/t/t6020-bundle-misc.sh +++ b/t/t6020-bundle-misc.sh @@ -673,6 +673,63 @@ test_expect_success 'bundle progress with --no-quiet' ' grep "%" err ' +test_expect_success 'create bundle with duplicate refnames' ' + git bundle create out.bdl "main" "main" && + + git bundle list-heads out.bdl | + make_user_friendly_and_stable_output >actual && + cat >expect <<-\EOF && + refs/heads/main + EOF + test_cmp expect actual +' + +# This exhibits a bug, since the same refname is now added to the bundle twice. +test_expect_success 'create bundle with duplicate refnames and --all' ' + git bundle create out.bdl --all "main" "main" && + + git bundle list-heads out.bdl | + make_user_friendly_and_stable_output >actual && + cat >expect <<-\EOF && + refs/heads/main + refs/heads/release + refs/heads/topic/1 + refs/heads/topic/2 + refs/pull/1/head + refs/pull/2/head + refs/tags/v1 + refs/tags/v2 + refs/tags/v3 + HEAD + refs/heads/main + EOF + test_cmp expect actual +' + +test_expect_success 'create bundle with duplicate exlusion refnames' ' + git bundle create out.bdl "main" "main^!" && + + git bundle list-heads out.bdl | + make_user_friendly_and_stable_output >actual && + cat >expect <<-\EOF && + refs/heads/main + EOF + test_cmp expect actual +' + +# This exhibits a bug, since the same refname is now added to the bundle twice. +test_expect_success 'create bundle with duplicate refname short-form' ' + git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" && + + git bundle list-heads out.bdl | + make_user_friendly_and_stable_output >actual && + cat >expect <<-\EOF && + refs/heads/main + refs/heads/main + EOF + test_cmp expect actual +' + test_expect_success 'read bundle over stdin' ' git bundle create some.bundle HEAD && -- GitLab From 41f0df7afef9c9e118b698839b267a7b2e95d435 Mon Sep 17 00:00:00 2001 From: Karthik Nayak Date: Sun, 23 Mar 2025 00:35:48 +0100 Subject: [PATCH 3/3] bundle: fix non-linear performance scaling with refs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 'git bundle create' command has non-linear performance with the number of refs in the repository. Benchmarking the command shows that a large portion of the time (~75%) is spent in the `object_array_remove_duplicates()` function. The `object_array_remove_duplicates()` function was added in b2a6d1c686 (bundle: allow the same ref to be given more than once, 2009-01-17) to skip duplicate refs provided by the user from being written to the bundle. Since this is an O(N^2) algorithm, in repos with large number of references, this can take up a large amount of time. Let's instead use a 'strset' to skip duplicates inside `write_bundle_refs()`. This improves the performance by around 6 times when tested against in repository with 100000 refs: Benchmark 1: bundle (refcount = 100000, revision = master) Time (mean ± σ): 14.653 s ± 0.203 s [User: 13.940 s, System: 0.762 s] Range (min … max): 14.237 s … 14.920 s 10 runs Benchmark 2: bundle (refcount = 100000, revision = HEAD) Time (mean ± σ): 2.394 s ± 0.023 s [User: 1.684 s, System: 0.798 s] Range (min … max): 2.364 s … 2.425 s 10 runs Summary bundle (refcount = 100000, revision = HEAD) ran 6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master) Previously, `object_array_remove_duplicates()` ensured that both the refname and the object it pointed to were checked for duplicates. The new approach, implemented within `write_bundle_refs()`, eliminates duplicate refnames without comparing the objects they reference. This works because, for bundle creation, we only need to prevent duplicate refs from being written to the bundle header. The `revs->pending` array can contain duplicates of multiple types. First, references which resolve to the same refname. For e.g. "git bundle create out.bdl master master" or "git bundle create out.bdl refs/heads/master refs/heads/master" or "git bundle create out.bdl master refs/heads/master". In these scenarios we want to prevent writing "refs/heads/master" twice to the bundle header. Since both the refnames here would point to the same object (unless there is a race), we do not need to check equality of the object. Second, refnames which are duplicates but do not point to the same object. This can happen when we use an exclusion criteria. For e.g. "git bundle create out.bdl master master^!", Here `revs->pending` would contain two elements, both with refname set to "master". However, each of them would be pointing to an INTERESTING and UNINTERESTING object respectively. Since we only write refnames with INTERESTING objects to the bundle header, we perform our duplicate checks only on such objects. Signed-off-by: Karthik Nayak --- bundle.c | 8 +++++++- object.c | 33 --------------------------------- object.h | 6 ------ t/t6020-bundle-misc.sh | 4 ---- 4 files changed, 7 insertions(+), 44 deletions(-) diff --git a/bundle.c b/bundle.c index d7ad6908433..0614426e202 100644 --- a/bundle.c +++ b/bundle.c @@ -384,6 +384,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs) { int i; int ref_count = 0; + struct strset objects = STRSET_INIT; for (i = 0; i < revs->pending.nr; i++) { struct object_array_entry *e = revs->pending.objects + i; @@ -401,6 +402,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs) flag = 0; display_ref = (flag & REF_ISSYMREF) ? e->name : ref; + if (strset_contains(&objects, display_ref)) + goto skip_write_ref; + if (e->item->type == OBJ_TAG && !is_tag_in_date_range(e->item, revs)) { e->item->flags |= UNINTERESTING; @@ -423,6 +427,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs) } ref_count++; + strset_add(&objects, display_ref); write_or_die(bundle_fd, oid_to_hex(&e->item->oid), the_hash_algo->hexsz); write_or_die(bundle_fd, " ", 1); write_or_die(bundle_fd, display_ref, strlen(display_ref)); @@ -431,6 +436,8 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs) free(ref); } + strset_clear(&objects); + /* end header */ write_or_die(bundle_fd, "\n", 1); return ref_count; @@ -566,7 +573,6 @@ int create_bundle(struct repository *r, const char *path, */ revs.blob_objects = revs.tree_objects = 0; traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi); - object_array_remove_duplicates(&revs_copy.pending); /* write bundle refs */ ref_count = write_bundle_refs(bundle_fd, &revs_copy); diff --git a/object.c b/object.c index 100bf9b8d12..a2c59861785 100644 --- a/object.c +++ b/object.c @@ -491,39 +491,6 @@ void object_array_clear(struct object_array *array) array->nr = array->alloc = 0; } -/* - * Return true if array already contains an entry. - */ -static int contains_object(struct object_array *array, - const struct object *item, const char *name) -{ - unsigned nr = array->nr, i; - struct object_array_entry *object = array->objects; - - for (i = 0; i < nr; i++, object++) - if (item == object->item && !strcmp(object->name, name)) - return 1; - return 0; -} - -void object_array_remove_duplicates(struct object_array *array) -{ - unsigned nr = array->nr, src; - struct object_array_entry *objects = array->objects; - - array->nr = 0; - for (src = 0; src < nr; src++) { - if (!contains_object(array, objects[src].item, - objects[src].name)) { - if (src != array->nr) - objects[array->nr] = objects[src]; - array->nr++; - } else { - object_array_release_entry(&objects[src]); - } - } -} - void clear_object_flags(unsigned flags) { int i; diff --git a/object.h b/object.h index 17f32f1103e..0e12c75922c 100644 --- a/object.h +++ b/object.h @@ -324,12 +324,6 @@ typedef int (*object_array_each_func_t)(struct object_array_entry *, void *); void object_array_filter(struct object_array *array, object_array_each_func_t want, void *cb_data); -/* - * Remove from array all but the first entry with a given name. - * Warning: this function uses an O(N^2) algorithm. - */ -void object_array_remove_duplicates(struct object_array *array); - /* * Remove any objects from the array, freeing all used memory; afterwards * the array is ready to store more objects with add_object_array(). diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh index dd09df12873..500c81b8a14 100755 --- a/t/t6020-bundle-misc.sh +++ b/t/t6020-bundle-misc.sh @@ -684,7 +684,6 @@ test_expect_success 'create bundle with duplicate refnames' ' test_cmp expect actual ' -# This exhibits a bug, since the same refname is now added to the bundle twice. test_expect_success 'create bundle with duplicate refnames and --all' ' git bundle create out.bdl --all "main" "main" && @@ -701,7 +700,6 @@ test_expect_success 'create bundle with duplicate refnames and --all' ' refs/tags/v2 refs/tags/v3 HEAD - refs/heads/main EOF test_cmp expect actual ' @@ -717,7 +715,6 @@ test_expect_success 'create bundle with duplicate exlusion refnames' ' test_cmp expect actual ' -# This exhibits a bug, since the same refname is now added to the bundle twice. test_expect_success 'create bundle with duplicate refname short-form' ' git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" && @@ -725,7 +722,6 @@ test_expect_success 'create bundle with duplicate refname short-form' ' make_user_friendly_and_stable_output >actual && cat >expect <<-\EOF && refs/heads/main - refs/heads/main EOF test_cmp expect actual ' -- GitLab