* [PATCH 0/2] bundle: fix non-linear performance scaling with refs
@ 2025-04-01 17:00 Karthik Nayak
2025-04-01 17:00 ` [PATCH 1/2] t6020: test for duplicate refnames in bundle creation Karthik Nayak
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Karthik Nayak @ 2025-04-01 17:00 UTC (permalink / raw)
To: git; +Cc: jltobler, ps, Karthik Nayak
Hello,
At GitLab, we noticed that bundle creation doesn't seem to scale linearly
the number of references in a repository. The following benchmark demostrates
the issue:
Benchmark 1: bundle (refcount = 100)
Time (mean ± σ): 4.4 ms ± 0.5 ms [User: 1.8 ms, System: 2.4 ms]
Range (min … max): 3.4 ms … 7.7 ms 434 runs
Benchmark 2: bundle (refcount = 1000)
Time (mean ± σ): 16.5 ms ± 1.7 ms [User: 9.6 ms, System: 7.2 ms]
Range (min … max): 14.1 ms … 21.7 ms 176 runs
Benchmark 3: bundle (refcount = 10000)
Time (mean ± σ): 220.6 ms ± 3.2 ms [User: 171.6 ms, System: 55.7 ms]
Range (min … max): 215.8 ms … 224.9 ms 13 runs
Benchmark 4: bundle (refcount = 100000)
Time (mean ± σ): 9.622 s ± 0.063 s [User: 9.143 s, System: 0.546 s]
Range (min … max): 9.563 s … 9.738 s 10 runs
Summary
bundle (refcount = 100) ran
3.79 ± 0.61 times faster than bundle (refcount = 1000)
50.63 ± 6.39 times faster than bundle (refcount = 10000)
2207.95 ± 277.35 times faster than bundle (refcount = 100000)
Digging into this, the reason for this is because we check for duplicate refnames
added by the user. But this check uses an O(N^2) algorithm, which would not
scale linearly with the number of refs.
The first commit in this small series adds a bunch of tests for this behavior,
while also discovering a missed edge case. The second commit introduces an
alternative approach which uses an 'strset' to check for duplicates. The new
approach fixes the performance problems noticed while also fixing the earlier
missed edge case. Overall we see a 6x performance improvement with this series.
I found that there is a conflict with 'ps/object-wo-the-repository' in seen,
the resolution seems simple enough. Happy to support as needed.
---
bundle.c | 10 +++++++++-
object.c | 33 -------------------------------
object.h | 6 ------
t/t6020-bundle-misc.sh | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 62 insertions(+), 40 deletions(-)
Karthik Nayak (2):
t6020: test for duplicate refnames in bundle creation
bundle: fix non-linear performance scaling with refs
---
---
base-commit: 683c54c999c301c2cd6f715c411407c413b1d84e
change-id: 20250322-488-generating-bundles-with-many-references-has-non-linear-performance-64aec8e0cf1d
Thanks
- Karthik
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] t6020: test for duplicate refnames in bundle creation
2025-04-01 17:00 [PATCH 0/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
@ 2025-04-01 17:00 ` Karthik Nayak
2025-04-01 17:00 ` [PATCH 2/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
2025-04-08 9:00 ` [PATCH v2 0/2] " Karthik Nayak
2 siblings, 0 replies; 10+ messages in thread
From: Karthik Nayak @ 2025-04-01 17:00 UTC (permalink / raw)
To: git; +Cc: jltobler, ps, Karthik Nayak
The commit b2a6d1c686 (bundle: allow the same ref to be given more than
once, 2009-01-17) added functionality to detect and remove duplicate
refnames from being added during bundle creation. This ensured that
clones created from such bundles wouldn't barf about duplicate refnames.
The following commit will add some optimizations to make this check
faster, but before doing that, it would be optimal to add tests to
capture the current behavior.
Add tests to capture duplicate refnames provided by the user during
bundle creation. This can be a combination of:
- refnames directly provided by the user.
- refname duplicate by using the '--all' flag alongside manual
references being provided.
- exclusion criteria provided via a refname "main^!".
- short forms of refnames provided, "main" vs "refs/heads/main".
Note that currently duplicates due to usage of short and long forms goes
undetected. This should be fixed with the optimizations made in the next
commit.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
t/t6020-bundle-misc.sh | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
index b3807e8f35..dd09df1287 100755
--- a/t/t6020-bundle-misc.sh
+++ b/t/t6020-bundle-misc.sh
@@ -673,6 +673,63 @@ test_expect_success 'bundle progress with --no-quiet' '
grep "%" err
'
+test_expect_success 'create bundle with duplicate refnames' '
+ git bundle create out.bdl "main" "main" &&
+
+ git bundle list-heads out.bdl |
+ make_user_friendly_and_stable_output >actual &&
+ cat >expect <<-\EOF &&
+ <COMMIT-P> refs/heads/main
+ EOF
+ test_cmp expect actual
+'
+
+# This exhibits a bug, since the same refname is now added to the bundle twice.
+test_expect_success 'create bundle with duplicate refnames and --all' '
+ git bundle create out.bdl --all "main" "main" &&
+
+ git bundle list-heads out.bdl |
+ make_user_friendly_and_stable_output >actual &&
+ cat >expect <<-\EOF &&
+ <COMMIT-P> refs/heads/main
+ <COMMIT-N> refs/heads/release
+ <COMMIT-D> refs/heads/topic/1
+ <COMMIT-H> refs/heads/topic/2
+ <COMMIT-D> refs/pull/1/head
+ <COMMIT-G> refs/pull/2/head
+ <TAG-1> refs/tags/v1
+ <TAG-2> refs/tags/v2
+ <TAG-3> refs/tags/v3
+ <COMMIT-P> HEAD
+ <COMMIT-P> refs/heads/main
+ EOF
+ test_cmp expect actual
+'
+
+test_expect_success 'create bundle with duplicate exlusion refnames' '
+ git bundle create out.bdl "main" "main^!" &&
+
+ git bundle list-heads out.bdl |
+ make_user_friendly_and_stable_output >actual &&
+ cat >expect <<-\EOF &&
+ <COMMIT-P> refs/heads/main
+ EOF
+ test_cmp expect actual
+'
+
+# This exhibits a bug, since the same refname is now added to the bundle twice.
+test_expect_success 'create bundle with duplicate refname short-form' '
+ git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" &&
+
+ git bundle list-heads out.bdl |
+ make_user_friendly_and_stable_output >actual &&
+ cat >expect <<-\EOF &&
+ <COMMIT-P> refs/heads/main
+ <COMMIT-P> refs/heads/main
+ EOF
+ test_cmp expect actual
+'
+
test_expect_success 'read bundle over stdin' '
git bundle create some.bundle HEAD &&
--
2.48.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] bundle: fix non-linear performance scaling with refs
2025-04-01 17:00 [PATCH 0/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
2025-04-01 17:00 ` [PATCH 1/2] t6020: test for duplicate refnames in bundle creation Karthik Nayak
@ 2025-04-01 17:00 ` Karthik Nayak
2025-04-03 19:07 ` Toon Claes
2025-04-08 9:00 ` [PATCH v2 0/2] " Karthik Nayak
2 siblings, 1 reply; 10+ messages in thread
From: Karthik Nayak @ 2025-04-01 17:00 UTC (permalink / raw)
To: git; +Cc: jltobler, ps, Karthik Nayak
The 'git bundle create' command has non-linear performance with the
number of refs in the repository. Benchmarking the command shows that
a large portion of the time (~75%) is spent in the
`object_array_remove_duplicates()` function.
The `object_array_remove_duplicates()` function was added in
b2a6d1c686 (bundle: allow the same ref to be given more than once,
2009-01-17) to skip duplicate refs provided by the user from being
written to the bundle. Since this is an O(N^2) algorithm, in repos with
large number of references, this can take up a large amount of time.
Let's instead use a 'strset' to skip duplicates inside
`write_bundle_refs()`. This improves the performance by around 6 times
when tested against in repository with 100000 refs:
Benchmark 1: bundle (refcount = 100000, revision = master)
Time (mean ± σ): 14.653 s ± 0.203 s [User: 13.940 s, System: 0.762 s]
Range (min … max): 14.237 s … 14.920 s 10 runs
Benchmark 2: bundle (refcount = 100000, revision = HEAD)
Time (mean ± σ): 2.394 s ± 0.023 s [User: 1.684 s, System: 0.798 s]
Range (min … max): 2.364 s … 2.425 s 10 runs
Summary
bundle (refcount = 100000, revision = HEAD) ran
6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)
Previously, `object_array_remove_duplicates()` ensured that both the
refname and the object it pointed to were checked for duplicates. The
new approach, implemented within `write_bundle_refs()`, eliminates
duplicate refnames without comparing the objects they reference. This
works because, for bundle creation, we only need to prevent duplicate
refs from being written to the bundle header. The `revs->pending` array
can contain duplicates of multiple types.
First, references which resolve to the same refname. For e.g. "git
bundle create out.bdl master master" or "git bundle create out.bdl
refs/heads/master refs/heads/master" or "git bundle create out.bdl
master refs/heads/master". In these scenarios we want to prevent writing
"refs/heads/master" twice to the bundle header. Since both the refnames
here would point to the same object (unless there is a race), we do not
need to check equality of the object.
Second, refnames which are duplicates but do not point to the same
object. This can happen when we use an exclusion criteria. For e.g. "git
bundle create out.bdl master master^!", Here `revs->pending` would
contain two elements, both with refname set to "master". However, each
of them would be pointing to an INTERESTING and UNINTERESTING object
respectively. Since we only write refnames with INTERESTING objects to
the bundle header, we perform our duplicate checks only on such objects.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
bundle.c | 10 +++++++++-
object.c | 33 ---------------------------------
object.h | 6 ------
t/t6020-bundle-misc.sh | 4 ----
4 files changed, 9 insertions(+), 44 deletions(-)
diff --git a/bundle.c b/bundle.c
index d7ad690843..30cfba0be2 100644
--- a/bundle.c
+++ b/bundle.c
@@ -384,6 +384,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
{
int i;
int ref_count = 0;
+ struct strset objects;
+
+ strset_init(&objects);
for (i = 0; i < revs->pending.nr; i++) {
struct object_array_entry *e = revs->pending.objects + i;
@@ -401,6 +404,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
flag = 0;
display_ref = (flag & REF_ISSYMREF) ? e->name : ref;
+ if (strset_contains(&objects, display_ref))
+ goto skip_write_ref;
+
if (e->item->type == OBJ_TAG &&
!is_tag_in_date_range(e->item, revs)) {
e->item->flags |= UNINTERESTING;
@@ -423,6 +429,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
}
ref_count++;
+ strset_add(&objects, display_ref);
write_or_die(bundle_fd, oid_to_hex(&e->item->oid), the_hash_algo->hexsz);
write_or_die(bundle_fd, " ", 1);
write_or_die(bundle_fd, display_ref, strlen(display_ref));
@@ -431,6 +438,8 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
free(ref);
}
+ strset_clear(&objects);
+
/* end header */
write_or_die(bundle_fd, "\n", 1);
return ref_count;
@@ -566,7 +575,6 @@ int create_bundle(struct repository *r, const char *path,
*/
revs.blob_objects = revs.tree_objects = 0;
traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi);
- object_array_remove_duplicates(&revs_copy.pending);
/* write bundle refs */
ref_count = write_bundle_refs(bundle_fd, &revs_copy);
diff --git a/object.c b/object.c
index 100bf9b8d1..a2c5986178 100644
--- a/object.c
+++ b/object.c
@@ -491,39 +491,6 @@ void object_array_clear(struct object_array *array)
array->nr = array->alloc = 0;
}
-/*
- * Return true if array already contains an entry.
- */
-static int contains_object(struct object_array *array,
- const struct object *item, const char *name)
-{
- unsigned nr = array->nr, i;
- struct object_array_entry *object = array->objects;
-
- for (i = 0; i < nr; i++, object++)
- if (item == object->item && !strcmp(object->name, name))
- return 1;
- return 0;
-}
-
-void object_array_remove_duplicates(struct object_array *array)
-{
- unsigned nr = array->nr, src;
- struct object_array_entry *objects = array->objects;
-
- array->nr = 0;
- for (src = 0; src < nr; src++) {
- if (!contains_object(array, objects[src].item,
- objects[src].name)) {
- if (src != array->nr)
- objects[array->nr] = objects[src];
- array->nr++;
- } else {
- object_array_release_entry(&objects[src]);
- }
- }
-}
-
void clear_object_flags(unsigned flags)
{
int i;
diff --git a/object.h b/object.h
index 17f32f1103..0e12c75922 100644
--- a/object.h
+++ b/object.h
@@ -324,12 +324,6 @@ typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
void object_array_filter(struct object_array *array,
object_array_each_func_t want, void *cb_data);
-/*
- * Remove from array all but the first entry with a given name.
- * Warning: this function uses an O(N^2) algorithm.
- */
-void object_array_remove_duplicates(struct object_array *array);
-
/*
* Remove any objects from the array, freeing all used memory; afterwards
* the array is ready to store more objects with add_object_array().
diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
index dd09df1287..500c81b8a1 100755
--- a/t/t6020-bundle-misc.sh
+++ b/t/t6020-bundle-misc.sh
@@ -684,7 +684,6 @@ test_expect_success 'create bundle with duplicate refnames' '
test_cmp expect actual
'
-# This exhibits a bug, since the same refname is now added to the bundle twice.
test_expect_success 'create bundle with duplicate refnames and --all' '
git bundle create out.bdl --all "main" "main" &&
@@ -701,7 +700,6 @@ test_expect_success 'create bundle with duplicate refnames and --all' '
<TAG-2> refs/tags/v2
<TAG-3> refs/tags/v3
<COMMIT-P> HEAD
- <COMMIT-P> refs/heads/main
EOF
test_cmp expect actual
'
@@ -717,7 +715,6 @@ test_expect_success 'create bundle with duplicate exlusion refnames' '
test_cmp expect actual
'
-# This exhibits a bug, since the same refname is now added to the bundle twice.
test_expect_success 'create bundle with duplicate refname short-form' '
git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" &&
@@ -725,7 +722,6 @@ test_expect_success 'create bundle with duplicate refname short-form' '
make_user_friendly_and_stable_output >actual &&
cat >expect <<-\EOF &&
<COMMIT-P> refs/heads/main
- <COMMIT-P> refs/heads/main
EOF
test_cmp expect actual
'
--
2.48.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] bundle: fix non-linear performance scaling with refs
2025-04-01 17:00 ` [PATCH 2/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
@ 2025-04-03 19:07 ` Toon Claes
2025-04-06 20:48 ` Karthik Nayak
0 siblings, 1 reply; 10+ messages in thread
From: Toon Claes @ 2025-04-03 19:07 UTC (permalink / raw)
To: Karthik Nayak, git; +Cc: jltobler, ps, Karthik Nayak
Karthik Nayak <karthik.188@gmail.com> writes:
> The 'git bundle create' command has non-linear performance with the
> number of refs in the repository. Benchmarking the command shows that
> a large portion of the time (~75%) is spent in the
> `object_array_remove_duplicates()` function.
>
> The `object_array_remove_duplicates()` function was added in
> b2a6d1c686 (bundle: allow the same ref to be given more than once,
> 2009-01-17) to skip duplicate refs provided by the user from being
> written to the bundle. Since this is an O(N^2) algorithm, in repos with
> large number of references, this can take up a large amount of time.
>
> Let's instead use a 'strset' to skip duplicates inside
> `write_bundle_refs()`. This improves the performance by around 6 times
> when tested against in repository with 100000 refs:
>
> Benchmark 1: bundle (refcount = 100000, revision = master)
> Time (mean ± σ): 14.653 s ± 0.203 s [User: 13.940 s, System: 0.762 s]
> Range (min … max): 14.237 s … 14.920 s 10 runs
>
> Benchmark 2: bundle (refcount = 100000, revision = HEAD)
> Time (mean ± σ): 2.394 s ± 0.023 s [User: 1.684 s, System: 0.798 s]
> Range (min … max): 2.364 s … 2.425 s 10 runs
>
> Summary
> bundle (refcount = 100000, revision = HEAD) ran
> 6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)
That's a good find!
> Previously, `object_array_remove_duplicates()` ensured that both the
> refname and the object it pointed to were checked for duplicates. The
> new approach, implemented within `write_bundle_refs()`, eliminates
> duplicate refnames without comparing the objects they reference. This
> works because, for bundle creation, we only need to prevent duplicate
> refs from being written to the bundle header. The `revs->pending` array
> can contain duplicates of multiple types.
Makes sense to me.
> First, references which resolve to the same refname. For e.g. "git
> bundle create out.bdl master master" or "git bundle create out.bdl
> refs/heads/master refs/heads/master" or "git bundle create out.bdl
> master refs/heads/master". In these scenarios we want to prevent writing
> "refs/heads/master" twice to the bundle header. Since both the refnames
> here would point to the same object (unless there is a race), we do not
> need to check equality of the object.
Yeah, we can never be sure about the changes that happen while the
bundle is being created. I fixed another race[1] recently which also was
comparing equality of the object, that causes the ref to be omitted. We
can only act by "best effort" and having the ref point to /some/ object
is the best we can do.
[1]: https://lore.kernel.org/git/20241211-fix-bundle-create-race-v3-1-0587f6f9db1b@iotcl.com/
> Second, refnames which are duplicates but do not point to the same
> object. This can happen when we use an exclusion criteria. For e.g. "git
> bundle create out.bdl master master^!", Here `revs->pending` would
> contain two elements, both with refname set to "master". However, each
> of them would be pointing to an INTERESTING and UNINTERESTING object
> respectively. Since we only write refnames with INTERESTING objects to
> the bundle header, we perform our duplicate checks only on such
> objects.
Thanks for that context, I didn't consider that.
> Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
> ---
> bundle.c | 10 +++++++++-
> object.c | 33 ---------------------------------
> object.h | 6 ------
> t/t6020-bundle-misc.sh | 4 ----
> 4 files changed, 9 insertions(+), 44 deletions(-)
>
> diff --git a/bundle.c b/bundle.c
> index d7ad690843..30cfba0be2 100644
> --- a/bundle.c
> +++ b/bundle.c
> @@ -384,6 +384,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
> {
> int i;
> int ref_count = 0;
> + struct strset objects;
> +
> + strset_init(&objects);
Any reason why you're not using the `STRMAP_INIT` macro?
>
> for (i = 0; i < revs->pending.nr; i++) {
> struct object_array_entry *e = revs->pending.objects + i;
> @@ -401,6 +404,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
> flag = 0;
> display_ref = (flag & REF_ISSYMREF) ? e->name : ref;
>
> + if (strset_contains(&objects, display_ref))
> + goto skip_write_ref;
> +
> if (e->item->type == OBJ_TAG &&
> !is_tag_in_date_range(e->item, revs)) {
> e->item->flags |= UNINTERESTING;
> @@ -423,6 +429,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
> }
>
> ref_count++;
> + strset_add(&objects, display_ref);
> write_or_die(bundle_fd, oid_to_hex(&e->item->oid), the_hash_algo->hexsz);
> write_or_die(bundle_fd, " ", 1);
> write_or_die(bundle_fd, display_ref, strlen(display_ref));
> @@ -431,6 +438,8 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
> free(ref);
> }
>
> + strset_clear(&objects);
> +
> /* end header */
> write_or_die(bundle_fd, "\n", 1);
> return ref_count;
> @@ -566,7 +575,6 @@ int create_bundle(struct repository *r, const char *path,
> */
> revs.blob_objects = revs.tree_objects = 0;
> traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi);
> - object_array_remove_duplicates(&revs_copy.pending);
>
> /* write bundle refs */
> ref_count = write_bundle_refs(bundle_fd, &revs_copy);
> diff --git a/object.c b/object.c
> index 100bf9b8d1..a2c5986178 100644
> --- a/object.c
> +++ b/object.c
> @@ -491,39 +491,6 @@ void object_array_clear(struct object_array *array)
> array->nr = array->alloc = 0;
> }
>
> -/*
> - * Return true if array already contains an entry.
> - */
> -static int contains_object(struct object_array *array,
> - const struct object *item, const char *name)
> -{
> - unsigned nr = array->nr, i;
> - struct object_array_entry *object = array->objects;
> -
> - for (i = 0; i < nr; i++, object++)
> - if (item == object->item && !strcmp(object->name, name))
> - return 1;
> - return 0;
> -}
> -
> -void object_array_remove_duplicates(struct object_array *array)
> -{
> - unsigned nr = array->nr, src;
> - struct object_array_entry *objects = array->objects;
> -
> - array->nr = 0;
> - for (src = 0; src < nr; src++) {
> - if (!contains_object(array, objects[src].item,
> - objects[src].name)) {
> - if (src != array->nr)
> - objects[array->nr] = objects[src];
> - array->nr++;
> - } else {
> - object_array_release_entry(&objects[src]);
> - }
> - }
> -}
> -
> void clear_object_flags(unsigned flags)
> {
> int i;
> diff --git a/object.h b/object.h
> index 17f32f1103..0e12c75922 100644
> --- a/object.h
> +++ b/object.h
> @@ -324,12 +324,6 @@ typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
> void object_array_filter(struct object_array *array,
> object_array_each_func_t want, void *cb_data);
>
> -/*
> - * Remove from array all but the first entry with a given name.
> - * Warning: this function uses an O(N^2) algorithm.
Funny this has been here for more than 10 years. Thanks for this cleanup.
> - */
> -void object_array_remove_duplicates(struct object_array *array);
> -
> /*
> * Remove any objects from the array, freeing all used memory; afterwards
> * the array is ready to store more objects with add_object_array().
> diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
> index dd09df1287..500c81b8a1 100755
> --- a/t/t6020-bundle-misc.sh
> +++ b/t/t6020-bundle-misc.sh
> @@ -684,7 +684,6 @@ test_expect_success 'create bundle with duplicate refnames' '
> test_cmp expect actual
> '
>
> -# This exhibits a bug, since the same refname is now added to the bundle twice.
> test_expect_success 'create bundle with duplicate refnames and --all' '
> git bundle create out.bdl --all "main" "main" &&
>
> @@ -701,7 +700,6 @@ test_expect_success 'create bundle with duplicate refnames and --all' '
> <TAG-2> refs/tags/v2
> <TAG-3> refs/tags/v3
> <COMMIT-P> HEAD
> - <COMMIT-P> refs/heads/main
> EOF
> test_cmp expect actual
> '
> @@ -717,7 +715,6 @@ test_expect_success 'create bundle with duplicate exlusion refnames' '
> test_cmp expect actual
> '
>
> -# This exhibits a bug, since the same refname is now added to the bundle twice.
> test_expect_success 'create bundle with duplicate refname short-form' '
> git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" &&
>
> @@ -725,7 +722,6 @@ test_expect_success 'create bundle with duplicate refname short-form' '
> make_user_friendly_and_stable_output >actual &&
> cat >expect <<-\EOF &&
> <COMMIT-P> refs/heads/main
> - <COMMIT-P> refs/heads/main
> EOF
> test_cmp expect actual
> '
Great work on the alternative implmentation. And thanks for adding these
tests and actually fixing them. I've been manually testing a few more
edge cases, I couldn't find any other scenario that's not covered by the
current implementation.
I approve.
--
Toon
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] bundle: fix non-linear performance scaling with refs
2025-04-03 19:07 ` Toon Claes
@ 2025-04-06 20:48 ` Karthik Nayak
0 siblings, 0 replies; 10+ messages in thread
From: Karthik Nayak @ 2025-04-06 20:48 UTC (permalink / raw)
To: Toon Claes, git; +Cc: jltobler, ps
[-- Attachment #1: Type: text/plain, Size: 9797 bytes --]
Toon Claes <toon@iotcl.com> writes:
> Karthik Nayak <karthik.188@gmail.com> writes:
>
>> The 'git bundle create' command has non-linear performance with the
>> number of refs in the repository. Benchmarking the command shows that
>> a large portion of the time (~75%) is spent in the
>> `object_array_remove_duplicates()` function.
>>
>> The `object_array_remove_duplicates()` function was added in
>> b2a6d1c686 (bundle: allow the same ref to be given more than once,
>> 2009-01-17) to skip duplicate refs provided by the user from being
>> written to the bundle. Since this is an O(N^2) algorithm, in repos with
>> large number of references, this can take up a large amount of time.
>>
>> Let's instead use a 'strset' to skip duplicates inside
>> `write_bundle_refs()`. This improves the performance by around 6 times
>> when tested against in repository with 100000 refs:
>>
>> Benchmark 1: bundle (refcount = 100000, revision = master)
>> Time (mean ± σ): 14.653 s ± 0.203 s [User: 13.940 s, System: 0.762 s]
>> Range (min … max): 14.237 s … 14.920 s 10 runs
>>
>> Benchmark 2: bundle (refcount = 100000, revision = HEAD)
>> Time (mean ± σ): 2.394 s ± 0.023 s [User: 1.684 s, System: 0.798 s]
>> Range (min … max): 2.364 s … 2.425 s 10 runs
>>
>> Summary
>> bundle (refcount = 100000, revision = HEAD) ran
>> 6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)
>
> That's a good find!
>
>> Previously, `object_array_remove_duplicates()` ensured that both the
>> refname and the object it pointed to were checked for duplicates. The
>> new approach, implemented within `write_bundle_refs()`, eliminates
>> duplicate refnames without comparing the objects they reference. This
>> works because, for bundle creation, we only need to prevent duplicate
>> refs from being written to the bundle header. The `revs->pending` array
>> can contain duplicates of multiple types.
>
> Makes sense to me.
>
>> First, references which resolve to the same refname. For e.g. "git
>> bundle create out.bdl master master" or "git bundle create out.bdl
>> refs/heads/master refs/heads/master" or "git bundle create out.bdl
>> master refs/heads/master". In these scenarios we want to prevent writing
>> "refs/heads/master" twice to the bundle header. Since both the refnames
>> here would point to the same object (unless there is a race), we do not
>> need to check equality of the object.
>
> Yeah, we can never be sure about the changes that happen while the
> bundle is being created. I fixed another race[1] recently which also was
> comparing equality of the object, that causes the ref to be omitted. We
> can only act by "best effort" and having the ref point to /some/ object
> is the best we can do.
>
> [1]: https://lore.kernel.org/git/20241211-fix-bundle-create-race-v3-1-0587f6f9db1b@iotcl.com/
>
>> Second, refnames which are duplicates but do not point to the same
>> object. This can happen when we use an exclusion criteria. For e.g. "git
>> bundle create out.bdl master master^!", Here `revs->pending` would
>> contain two elements, both with refname set to "master". However, each
>> of them would be pointing to an INTERESTING and UNINTERESTING object
>> respectively. Since we only write refnames with INTERESTING objects to
>> the bundle header, we perform our duplicate checks only on such
>> objects.
>
> Thanks for that context, I didn't consider that.
>
I didn't at first, but luckily we have a test for such refs, which got
tripped and allowed me to consider that scenario too!
>> Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
>> ---
>> bundle.c | 10 +++++++++-
>> object.c | 33 ---------------------------------
>> object.h | 6 ------
>> t/t6020-bundle-misc.sh | 4 ----
>> 4 files changed, 9 insertions(+), 44 deletions(-)
>>
>> diff --git a/bundle.c b/bundle.c
>> index d7ad690843..30cfba0be2 100644
>> --- a/bundle.c
>> +++ b/bundle.c
>> @@ -384,6 +384,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>> {
>> int i;
>> int ref_count = 0;
>> + struct strset objects;
>> +
>> + strset_init(&objects);
>
> Any reason why you're not using the `STRMAP_INIT` macro?
>
None. This should be the ideal way to do it, will change in the next
version.
>>
>> for (i = 0; i < revs->pending.nr; i++) {
>> struct object_array_entry *e = revs->pending.objects + i;
>> @@ -401,6 +404,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>> flag = 0;
>> display_ref = (flag & REF_ISSYMREF) ? e->name : ref;
>>
>> + if (strset_contains(&objects, display_ref))
>> + goto skip_write_ref;
>> +
>> if (e->item->type == OBJ_TAG &&
>> !is_tag_in_date_range(e->item, revs)) {
>> e->item->flags |= UNINTERESTING;
>> @@ -423,6 +429,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>> }
>>
>> ref_count++;
>> + strset_add(&objects, display_ref);
>> write_or_die(bundle_fd, oid_to_hex(&e->item->oid), the_hash_algo->hexsz);
>> write_or_die(bundle_fd, " ", 1);
>> write_or_die(bundle_fd, display_ref, strlen(display_ref));
>> @@ -431,6 +438,8 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>> free(ref);
>> }
>>
>> + strset_clear(&objects);
>> +
>> /* end header */
>> write_or_die(bundle_fd, "\n", 1);
>> return ref_count;
>> @@ -566,7 +575,6 @@ int create_bundle(struct repository *r, const char *path,
>> */
>> revs.blob_objects = revs.tree_objects = 0;
>> traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi);
>> - object_array_remove_duplicates(&revs_copy.pending);
>>
>> /* write bundle refs */
>> ref_count = write_bundle_refs(bundle_fd, &revs_copy);
>> diff --git a/object.c b/object.c
>> index 100bf9b8d1..a2c5986178 100644
>> --- a/object.c
>> +++ b/object.c
>> @@ -491,39 +491,6 @@ void object_array_clear(struct object_array *array)
>> array->nr = array->alloc = 0;
>> }
>>
>> -/*
>> - * Return true if array already contains an entry.
>> - */
>> -static int contains_object(struct object_array *array,
>> - const struct object *item, const char *name)
>> -{
>> - unsigned nr = array->nr, i;
>> - struct object_array_entry *object = array->objects;
>> -
>> - for (i = 0; i < nr; i++, object++)
>> - if (item == object->item && !strcmp(object->name, name))
>> - return 1;
>> - return 0;
>> -}
>> -
>> -void object_array_remove_duplicates(struct object_array *array)
>> -{
>> - unsigned nr = array->nr, src;
>> - struct object_array_entry *objects = array->objects;
>> -
>> - array->nr = 0;
>> - for (src = 0; src < nr; src++) {
>> - if (!contains_object(array, objects[src].item,
>> - objects[src].name)) {
>> - if (src != array->nr)
>> - objects[array->nr] = objects[src];
>> - array->nr++;
>> - } else {
>> - object_array_release_entry(&objects[src]);
>> - }
>> - }
>> -}
>> -
>> void clear_object_flags(unsigned flags)
>> {
>> int i;
>> diff --git a/object.h b/object.h
>> index 17f32f1103..0e12c75922 100644
>> --- a/object.h
>> +++ b/object.h
>> @@ -324,12 +324,6 @@ typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
>> void object_array_filter(struct object_array *array,
>> object_array_each_func_t want, void *cb_data);
>>
>> -/*
>> - * Remove from array all but the first entry with a given name.
>> - * Warning: this function uses an O(N^2) algorithm.
>
> Funny this has been here for more than 10 years. Thanks for this cleanup.
>
>> - */
>> -void object_array_remove_duplicates(struct object_array *array);
>> -
>> /*
>> * Remove any objects from the array, freeing all used memory; afterwards
>> * the array is ready to store more objects with add_object_array().
>> diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
>> index dd09df1287..500c81b8a1 100755
>> --- a/t/t6020-bundle-misc.sh
>> +++ b/t/t6020-bundle-misc.sh
>> @@ -684,7 +684,6 @@ test_expect_success 'create bundle with duplicate refnames' '
>> test_cmp expect actual
>> '
>>
>> -# This exhibits a bug, since the same refname is now added to the bundle twice.
>> test_expect_success 'create bundle with duplicate refnames and --all' '
>> git bundle create out.bdl --all "main" "main" &&
>>
>> @@ -701,7 +700,6 @@ test_expect_success 'create bundle with duplicate refnames and --all' '
>> <TAG-2> refs/tags/v2
>> <TAG-3> refs/tags/v3
>> <COMMIT-P> HEAD
>> - <COMMIT-P> refs/heads/main
>> EOF
>> test_cmp expect actual
>> '
>> @@ -717,7 +715,6 @@ test_expect_success 'create bundle with duplicate exlusion refnames' '
>> test_cmp expect actual
>> '
>>
>> -# This exhibits a bug, since the same refname is now added to the bundle twice.
>> test_expect_success 'create bundle with duplicate refname short-form' '
>> git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" &&
>>
>> @@ -725,7 +722,6 @@ test_expect_success 'create bundle with duplicate refname short-form' '
>> make_user_friendly_and_stable_output >actual &&
>> cat >expect <<-\EOF &&
>> <COMMIT-P> refs/heads/main
>> - <COMMIT-P> refs/heads/main
>> EOF
>> test_cmp expect actual
>> '
>
> Great work on the alternative implmentation. And thanks for adding these
> tests and actually fixing them. I've been manually testing a few more
> edge cases, I couldn't find any other scenario that's not covered by the
> current implementation.
>
> I approve.
>
> --
> Toon
Appreciate you taking the time and exploring all scenarios. Thanks for
the review.
Karthik
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 0/2] bundle: fix non-linear performance scaling with refs
2025-04-01 17:00 [PATCH 0/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
2025-04-01 17:00 ` [PATCH 1/2] t6020: test for duplicate refnames in bundle creation Karthik Nayak
2025-04-01 17:00 ` [PATCH 2/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
@ 2025-04-08 9:00 ` Karthik Nayak
2025-04-08 9:00 ` [PATCH v2 1/2] t6020: test for duplicate refnames in bundle creation Karthik Nayak
2025-04-08 9:00 ` [PATCH v2 2/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
2 siblings, 2 replies; 10+ messages in thread
From: Karthik Nayak @ 2025-04-08 9:00 UTC (permalink / raw)
To: git; +Cc: jltobler, ps, toon, Karthik Nayak
Hello,
At GitLab, we noticed that bundle creation doesn't seem to scale linearly
the number of references in a repository. The following benchmark demostrates
the issue:
Benchmark 1: bundle (refcount = 100)
Time (mean ± σ): 4.4 ms ± 0.5 ms [User: 1.8 ms, System: 2.4 ms]
Range (min … max): 3.4 ms … 7.7 ms 434 runs
Benchmark 2: bundle (refcount = 1000)
Time (mean ± σ): 16.5 ms ± 1.7 ms [User: 9.6 ms, System: 7.2 ms]
Range (min … max): 14.1 ms … 21.7 ms 176 runs
Benchmark 3: bundle (refcount = 10000)
Time (mean ± σ): 220.6 ms ± 3.2 ms [User: 171.6 ms, System: 55.7 ms]
Range (min … max): 215.8 ms … 224.9 ms 13 runs
Benchmark 4: bundle (refcount = 100000)
Time (mean ± σ): 9.622 s ± 0.063 s [User: 9.143 s, System: 0.546 s]
Range (min … max): 9.563 s … 9.738 s 10 runs
Summary
bundle (refcount = 100) ran
3.79 ± 0.61 times faster than bundle (refcount = 1000)
50.63 ± 6.39 times faster than bundle (refcount = 10000)
2207.95 ± 277.35 times faster than bundle (refcount = 100000)
Digging into this, the reason for this is because we check for duplicate refnames
added by the user. But this check uses an O(N^2) algorithm, which would not
scale linearly with the number of refs.
The first commit in this small series adds a bunch of tests for this behavior,
while also discovering a missed edge case. The second commit introduces an
alternative approach which uses an 'strset' to check for duplicates. The new
approach fixes the performance problems noticed while also fixing the earlier
missed edge case. Overall we see a 6x performance improvement with this series.
I found that there is a conflict with 'ps/object-wo-the-repository' in seen,
the resolution seems simple enough. Happy to support as needed.
---
Changes in v2:
- Use STRSET_INIT macro instead of strset_init().
- Link to v1: https://lore.kernel.org/r/20250401-488-generating-bundles-with-many-references-has-non-linear-performance-v1-0-6d23b2d96557@gmail.com
---
bundle.c | 8 +++++++-
object.c | 33 -------------------------------
object.h | 6 ------
t/t6020-bundle-misc.sh | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 60 insertions(+), 40 deletions(-)
Karthik Nayak (2):
t6020: test for duplicate refnames in bundle creation
bundle: fix non-linear performance scaling with refs
---
Range-diff versus v1:
1: 1fd141cd34 = 1: 2b608cb908 t6020: test for duplicate refnames in bundle creation
2: 25b86d1a6c ! 2: 054389419f bundle: fix non-linear performance scaling with refs
@@ bundle.c: static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
{
int i;
int ref_count = 0;
-+ struct strset objects;
-+
-+ strset_init(&objects);
++ struct strset objects = STRSET_INIT;
for (i = 0; i < revs->pending.nr; i++) {
struct object_array_entry *e = revs->pending.objects + i;
---
base-commit: 683c54c999c301c2cd6f715c411407c413b1d84e
change-id: 20250322-488-generating-bundles-with-many-references-has-non-linear-performance-64aec8e0cf1d
Thanks
- Karthik
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 1/2] t6020: test for duplicate refnames in bundle creation
2025-04-08 9:00 ` [PATCH v2 0/2] " Karthik Nayak
@ 2025-04-08 9:00 ` Karthik Nayak
2025-04-08 9:00 ` [PATCH v2 2/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
1 sibling, 0 replies; 10+ messages in thread
From: Karthik Nayak @ 2025-04-08 9:00 UTC (permalink / raw)
To: git; +Cc: jltobler, ps, toon, Karthik Nayak
The commit b2a6d1c686 (bundle: allow the same ref to be given more than
once, 2009-01-17) added functionality to detect and remove duplicate
refnames from being added during bundle creation. This ensured that
clones created from such bundles wouldn't barf about duplicate refnames.
The following commit will add some optimizations to make this check
faster, but before doing that, it would be optimal to add tests to
capture the current behavior.
Add tests to capture duplicate refnames provided by the user during
bundle creation. This can be a combination of:
- refnames directly provided by the user.
- refname duplicate by using the '--all' flag alongside manual
references being provided.
- exclusion criteria provided via a refname "main^!".
- short forms of refnames provided, "main" vs "refs/heads/main".
Note that currently duplicates due to usage of short and long forms goes
undetected. This should be fixed with the optimizations made in the next
commit.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
t/t6020-bundle-misc.sh | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
index b3807e8f35..dd09df1287 100755
--- a/t/t6020-bundle-misc.sh
+++ b/t/t6020-bundle-misc.sh
@@ -673,6 +673,63 @@ test_expect_success 'bundle progress with --no-quiet' '
grep "%" err
'
+test_expect_success 'create bundle with duplicate refnames' '
+ git bundle create out.bdl "main" "main" &&
+
+ git bundle list-heads out.bdl |
+ make_user_friendly_and_stable_output >actual &&
+ cat >expect <<-\EOF &&
+ <COMMIT-P> refs/heads/main
+ EOF
+ test_cmp expect actual
+'
+
+# This exhibits a bug, since the same refname is now added to the bundle twice.
+test_expect_success 'create bundle with duplicate refnames and --all' '
+ git bundle create out.bdl --all "main" "main" &&
+
+ git bundle list-heads out.bdl |
+ make_user_friendly_and_stable_output >actual &&
+ cat >expect <<-\EOF &&
+ <COMMIT-P> refs/heads/main
+ <COMMIT-N> refs/heads/release
+ <COMMIT-D> refs/heads/topic/1
+ <COMMIT-H> refs/heads/topic/2
+ <COMMIT-D> refs/pull/1/head
+ <COMMIT-G> refs/pull/2/head
+ <TAG-1> refs/tags/v1
+ <TAG-2> refs/tags/v2
+ <TAG-3> refs/tags/v3
+ <COMMIT-P> HEAD
+ <COMMIT-P> refs/heads/main
+ EOF
+ test_cmp expect actual
+'
+
+test_expect_success 'create bundle with duplicate exlusion refnames' '
+ git bundle create out.bdl "main" "main^!" &&
+
+ git bundle list-heads out.bdl |
+ make_user_friendly_and_stable_output >actual &&
+ cat >expect <<-\EOF &&
+ <COMMIT-P> refs/heads/main
+ EOF
+ test_cmp expect actual
+'
+
+# This exhibits a bug, since the same refname is now added to the bundle twice.
+test_expect_success 'create bundle with duplicate refname short-form' '
+ git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" &&
+
+ git bundle list-heads out.bdl |
+ make_user_friendly_and_stable_output >actual &&
+ cat >expect <<-\EOF &&
+ <COMMIT-P> refs/heads/main
+ <COMMIT-P> refs/heads/main
+ EOF
+ test_cmp expect actual
+'
+
test_expect_success 'read bundle over stdin' '
git bundle create some.bundle HEAD &&
--
2.48.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 2/2] bundle: fix non-linear performance scaling with refs
2025-04-08 9:00 ` [PATCH v2 0/2] " Karthik Nayak
2025-04-08 9:00 ` [PATCH v2 1/2] t6020: test for duplicate refnames in bundle creation Karthik Nayak
@ 2025-04-08 9:00 ` Karthik Nayak
2025-04-10 8:57 ` Toon Claes
1 sibling, 1 reply; 10+ messages in thread
From: Karthik Nayak @ 2025-04-08 9:00 UTC (permalink / raw)
To: git; +Cc: jltobler, ps, toon, Karthik Nayak
The 'git bundle create' command has non-linear performance with the
number of refs in the repository. Benchmarking the command shows that
a large portion of the time (~75%) is spent in the
`object_array_remove_duplicates()` function.
The `object_array_remove_duplicates()` function was added in
b2a6d1c686 (bundle: allow the same ref to be given more than once,
2009-01-17) to skip duplicate refs provided by the user from being
written to the bundle. Since this is an O(N^2) algorithm, in repos with
large number of references, this can take up a large amount of time.
Let's instead use a 'strset' to skip duplicates inside
`write_bundle_refs()`. This improves the performance by around 6 times
when tested against in repository with 100000 refs:
Benchmark 1: bundle (refcount = 100000, revision = master)
Time (mean ± σ): 14.653 s ± 0.203 s [User: 13.940 s, System: 0.762 s]
Range (min … max): 14.237 s … 14.920 s 10 runs
Benchmark 2: bundle (refcount = 100000, revision = HEAD)
Time (mean ± σ): 2.394 s ± 0.023 s [User: 1.684 s, System: 0.798 s]
Range (min … max): 2.364 s … 2.425 s 10 runs
Summary
bundle (refcount = 100000, revision = HEAD) ran
6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)
Previously, `object_array_remove_duplicates()` ensured that both the
refname and the object it pointed to were checked for duplicates. The
new approach, implemented within `write_bundle_refs()`, eliminates
duplicate refnames without comparing the objects they reference. This
works because, for bundle creation, we only need to prevent duplicate
refs from being written to the bundle header. The `revs->pending` array
can contain duplicates of multiple types.
First, references which resolve to the same refname. For e.g. "git
bundle create out.bdl master master" or "git bundle create out.bdl
refs/heads/master refs/heads/master" or "git bundle create out.bdl
master refs/heads/master". In these scenarios we want to prevent writing
"refs/heads/master" twice to the bundle header. Since both the refnames
here would point to the same object (unless there is a race), we do not
need to check equality of the object.
Second, refnames which are duplicates but do not point to the same
object. This can happen when we use an exclusion criteria. For e.g. "git
bundle create out.bdl master master^!", Here `revs->pending` would
contain two elements, both with refname set to "master". However, each
of them would be pointing to an INTERESTING and UNINTERESTING object
respectively. Since we only write refnames with INTERESTING objects to
the bundle header, we perform our duplicate checks only on such objects.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
bundle.c | 8 +++++++-
object.c | 33 ---------------------------------
object.h | 6 ------
t/t6020-bundle-misc.sh | 4 ----
4 files changed, 7 insertions(+), 44 deletions(-)
diff --git a/bundle.c b/bundle.c
index d7ad690843..0614426e20 100644
--- a/bundle.c
+++ b/bundle.c
@@ -384,6 +384,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
{
int i;
int ref_count = 0;
+ struct strset objects = STRSET_INIT;
for (i = 0; i < revs->pending.nr; i++) {
struct object_array_entry *e = revs->pending.objects + i;
@@ -401,6 +402,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
flag = 0;
display_ref = (flag & REF_ISSYMREF) ? e->name : ref;
+ if (strset_contains(&objects, display_ref))
+ goto skip_write_ref;
+
if (e->item->type == OBJ_TAG &&
!is_tag_in_date_range(e->item, revs)) {
e->item->flags |= UNINTERESTING;
@@ -423,6 +427,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
}
ref_count++;
+ strset_add(&objects, display_ref);
write_or_die(bundle_fd, oid_to_hex(&e->item->oid), the_hash_algo->hexsz);
write_or_die(bundle_fd, " ", 1);
write_or_die(bundle_fd, display_ref, strlen(display_ref));
@@ -431,6 +436,8 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
free(ref);
}
+ strset_clear(&objects);
+
/* end header */
write_or_die(bundle_fd, "\n", 1);
return ref_count;
@@ -566,7 +573,6 @@ int create_bundle(struct repository *r, const char *path,
*/
revs.blob_objects = revs.tree_objects = 0;
traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi);
- object_array_remove_duplicates(&revs_copy.pending);
/* write bundle refs */
ref_count = write_bundle_refs(bundle_fd, &revs_copy);
diff --git a/object.c b/object.c
index 100bf9b8d1..a2c5986178 100644
--- a/object.c
+++ b/object.c
@@ -491,39 +491,6 @@ void object_array_clear(struct object_array *array)
array->nr = array->alloc = 0;
}
-/*
- * Return true if array already contains an entry.
- */
-static int contains_object(struct object_array *array,
- const struct object *item, const char *name)
-{
- unsigned nr = array->nr, i;
- struct object_array_entry *object = array->objects;
-
- for (i = 0; i < nr; i++, object++)
- if (item == object->item && !strcmp(object->name, name))
- return 1;
- return 0;
-}
-
-void object_array_remove_duplicates(struct object_array *array)
-{
- unsigned nr = array->nr, src;
- struct object_array_entry *objects = array->objects;
-
- array->nr = 0;
- for (src = 0; src < nr; src++) {
- if (!contains_object(array, objects[src].item,
- objects[src].name)) {
- if (src != array->nr)
- objects[array->nr] = objects[src];
- array->nr++;
- } else {
- object_array_release_entry(&objects[src]);
- }
- }
-}
-
void clear_object_flags(unsigned flags)
{
int i;
diff --git a/object.h b/object.h
index 17f32f1103..0e12c75922 100644
--- a/object.h
+++ b/object.h
@@ -324,12 +324,6 @@ typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
void object_array_filter(struct object_array *array,
object_array_each_func_t want, void *cb_data);
-/*
- * Remove from array all but the first entry with a given name.
- * Warning: this function uses an O(N^2) algorithm.
- */
-void object_array_remove_duplicates(struct object_array *array);
-
/*
* Remove any objects from the array, freeing all used memory; afterwards
* the array is ready to store more objects with add_object_array().
diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
index dd09df1287..500c81b8a1 100755
--- a/t/t6020-bundle-misc.sh
+++ b/t/t6020-bundle-misc.sh
@@ -684,7 +684,6 @@ test_expect_success 'create bundle with duplicate refnames' '
test_cmp expect actual
'
-# This exhibits a bug, since the same refname is now added to the bundle twice.
test_expect_success 'create bundle with duplicate refnames and --all' '
git bundle create out.bdl --all "main" "main" &&
@@ -701,7 +700,6 @@ test_expect_success 'create bundle with duplicate refnames and --all' '
<TAG-2> refs/tags/v2
<TAG-3> refs/tags/v3
<COMMIT-P> HEAD
- <COMMIT-P> refs/heads/main
EOF
test_cmp expect actual
'
@@ -717,7 +715,6 @@ test_expect_success 'create bundle with duplicate exlusion refnames' '
test_cmp expect actual
'
-# This exhibits a bug, since the same refname is now added to the bundle twice.
test_expect_success 'create bundle with duplicate refname short-form' '
git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" &&
@@ -725,7 +722,6 @@ test_expect_success 'create bundle with duplicate refname short-form' '
make_user_friendly_and_stable_output >actual &&
cat >expect <<-\EOF &&
<COMMIT-P> refs/heads/main
- <COMMIT-P> refs/heads/main
EOF
test_cmp expect actual
'
--
2.48.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2 2/2] bundle: fix non-linear performance scaling with refs
2025-04-08 9:00 ` [PATCH v2 2/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
@ 2025-04-10 8:57 ` Toon Claes
2025-04-10 9:04 ` Karthik Nayak
0 siblings, 1 reply; 10+ messages in thread
From: Toon Claes @ 2025-04-10 8:57 UTC (permalink / raw)
To: Karthik Nayak, git; +Cc: jltobler, ps, Karthik Nayak
Karthik Nayak <karthik.188@gmail.com> writes:
> The 'git bundle create' command has non-linear performance with the
> number of refs in the repository. Benchmarking the command shows that
> a large portion of the time (~75%) is spent in the
> `object_array_remove_duplicates()` function.
>
> The `object_array_remove_duplicates()` function was added in
> b2a6d1c686 (bundle: allow the same ref to be given more than once,
> 2009-01-17) to skip duplicate refs provided by the user from being
> written to the bundle. Since this is an O(N^2) algorithm, in repos with
> large number of references, this can take up a large amount of time.
>
> Let's instead use a 'strset' to skip duplicates inside
> `write_bundle_refs()`. This improves the performance by around 6 times
> when tested against in repository with 100000 refs:
>
> Benchmark 1: bundle (refcount = 100000, revision = master)
> Time (mean ± σ): 14.653 s ± 0.203 s [User: 13.940 s, System: 0.762 s]
> Range (min … max): 14.237 s … 14.920 s 10 runs
>
> Benchmark 2: bundle (refcount = 100000, revision = HEAD)
> Time (mean ± σ): 2.394 s ± 0.023 s [User: 1.684 s, System: 0.798 s]
> Range (min … max): 2.364 s … 2.425 s 10 runs
>
> Summary
> bundle (refcount = 100000, revision = HEAD) ran
> 6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)
I've done some benchmarking with some "real life" repositories, which
only have a couple of thousand refs and there the difference
(expectedly) barely noticable. Which is good to know there also isn't
any regression.
This version looks good to me, I approve.
--
Toon
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 2/2] bundle: fix non-linear performance scaling with refs
2025-04-10 8:57 ` Toon Claes
@ 2025-04-10 9:04 ` Karthik Nayak
0 siblings, 0 replies; 10+ messages in thread
From: Karthik Nayak @ 2025-04-10 9:04 UTC (permalink / raw)
To: Toon Claes, git; +Cc: jltobler, ps
[-- Attachment #1: Type: text/plain, Size: 1990 bytes --]
Toon Claes <toon@iotcl.com> writes:
> Karthik Nayak <karthik.188@gmail.com> writes:
>
>> The 'git bundle create' command has non-linear performance with the
>> number of refs in the repository. Benchmarking the command shows that
>> a large portion of the time (~75%) is spent in the
>> `object_array_remove_duplicates()` function.
>>
>> The `object_array_remove_duplicates()` function was added in
>> b2a6d1c686 (bundle: allow the same ref to be given more than once,
>> 2009-01-17) to skip duplicate refs provided by the user from being
>> written to the bundle. Since this is an O(N^2) algorithm, in repos with
>> large number of references, this can take up a large amount of time.
>>
>> Let's instead use a 'strset' to skip duplicates inside
>> `write_bundle_refs()`. This improves the performance by around 6 times
>> when tested against in repository with 100000 refs:
>>
>> Benchmark 1: bundle (refcount = 100000, revision = master)
>> Time (mean ± σ): 14.653 s ± 0.203 s [User: 13.940 s, System: 0.762 s]
>> Range (min … max): 14.237 s … 14.920 s 10 runs
>>
>> Benchmark 2: bundle (refcount = 100000, revision = HEAD)
>> Time (mean ± σ): 2.394 s ± 0.023 s [User: 1.684 s, System: 0.798 s]
>> Range (min … max): 2.364 s … 2.425 s 10 runs
>>
>> Summary
>> bundle (refcount = 100000, revision = HEAD) ran
>> 6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)
>
> I've done some benchmarking with some "real life" repositories, which
> only have a couple of thousand refs and there the difference
> (expectedly) barely noticable. Which is good to know there also isn't
> any regression.
>
> This version looks good to me, I approve.
>
> --
> Toon
Thanks Toon. That is good news. That is also what the earlier benchmarks
showed. For smaller repositories the difference becomes inconsequential,
but for large repos, this can take significant amount of time.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-04-10 9:04 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-01 17:00 [PATCH 0/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
2025-04-01 17:00 ` [PATCH 1/2] t6020: test for duplicate refnames in bundle creation Karthik Nayak
2025-04-01 17:00 ` [PATCH 2/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
2025-04-03 19:07 ` Toon Claes
2025-04-06 20:48 ` Karthik Nayak
2025-04-08 9:00 ` [PATCH v2 0/2] " Karthik Nayak
2025-04-08 9:00 ` [PATCH v2 1/2] t6020: test for duplicate refnames in bundle creation Karthik Nayak
2025-04-08 9:00 ` [PATCH v2 2/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
2025-04-10 8:57 ` Toon Claes
2025-04-10 9:04 ` Karthik Nayak
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).