git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] refs/files: use heuristic to decide whether to repack with `--auto`
@ 2024-09-02 13:48 Patrick Steinhardt
  2024-09-02 13:48 ` [PATCH 1/2] t0601: merge tests for auto-packing of refs Patrick Steinhardt
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-02 13:48 UTC (permalink / raw)
  To: git

Hi,

I recently noticed that `git maintenance run --auto` always ends up
repacking loose references. This is because the "files" backend does not
yet have any heuristics wired up to decide whether or not packing refs
is in order. The consequence is that we always decided to pack them,
which is of course quite a waste of time.

This small patch series fixes this by introducing a new heuristic for
the "files" backend. The heuristic is rather simple: the bigger the
"packed-refs" file, the more loose refs we require before we decide to
repack. We have been using this heurisitc successfully for a long time
in Gitaly by now.

Thanks!

Patrick

Patrick Steinhardt (2):
  t0601: merge tests for auto-packing of refs
  refs/files: use heuristic to decide whether to repack with `--auto`

 refs/files-backend.c          |  75 +++++++++++++++++++++++++
 refs/packed-backend.c         |  18 ++++++
 refs/packed-backend.h         |   7 +++
 t/t0601-reffiles-pack-refs.sh | 101 ++++++++++++++++++++++++++++------
 4 files changed, 185 insertions(+), 16 deletions(-)

-- 
2.46.0.421.g159f2d50e7.dirty


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] t0601: merge tests for auto-packing of refs
  2024-09-02 13:48 [PATCH 0/2] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
@ 2024-09-02 13:48 ` Patrick Steinhardt
  2024-09-02 13:48 ` [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
  2024-09-04  8:52 ` [PATCH v2 0/3] refs/files: use heuristics " Patrick Steinhardt
  2 siblings, 0 replies; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-02 13:48 UTC (permalink / raw)
  To: git

We have two tests in t0601 which exercise the same underlying logic,
once via `git pack-refs --auto` and once via `git maintenance run
--auto`. Merge these two tests into one such that it becomes easier to
extend test coverage for both commands at the same time.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 t/t0601-reffiles-pack-refs.sh | 36 +++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/t/t0601-reffiles-pack-refs.sh b/t/t0601-reffiles-pack-refs.sh
index 60a544b8ee8..ed9652bb829 100755
--- a/t/t0601-reffiles-pack-refs.sh
+++ b/t/t0601-reffiles-pack-refs.sh
@@ -161,13 +161,6 @@ test_expect_success 'test --exclude takes precedence over --include' '
 	git pack-refs --include "refs/heads/pack*" --exclude "refs/heads/pack*" &&
 	test -f .git/refs/heads/dont_pack5'
 
-test_expect_success '--auto packs and prunes refs as usual' '
-	git branch auto &&
-	test_path_is_file .git/refs/heads/auto &&
-	git pack-refs --auto --all &&
-	test_path_is_missing .git/refs/heads/auto
-'
-
 test_expect_success 'see if up-to-date packed refs are preserved' '
 	git branch q &&
 	git pack-refs --all --prune &&
@@ -367,14 +360,25 @@ test_expect_success 'pack-refs does not drop broken refs during deletion' '
 	test_cmp expect actual
 '
 
-test_expect_success 'maintenance --auto unconditionally packs loose refs' '
-	git update-ref refs/heads/something HEAD &&
-	test_path_is_file .git/refs/heads/something &&
-	git rev-parse refs/heads/something >expect &&
-	git maintenance run --task=pack-refs --auto &&
-	test_path_is_missing .git/refs/heads/something &&
-	git rev-parse refs/heads/something >actual &&
-	test_cmp expect actual
-'
+for command in "git pack-refs --all --auto" "git maintenance run --task=pack-refs --auto"
+do
+	test_expect_success "$command unconditionally packs loose refs" '
+		test_when_finished "rm -rf repo" &&
+		git init repo &&
+		(
+			cd repo &&
+			git config set maintenance.auto false &&
+			test_commit initial &&
+
+			git update-ref refs/heads/something HEAD &&
+			test_path_is_file .git/refs/heads/something &&
+			git rev-parse refs/heads/something >expect &&
+			$command &&
+			test_path_is_missing .git/refs/heads/something &&
+			git rev-parse refs/heads/something >actual &&
+			test_cmp expect actual
+		)
+	'
+done
 
 test_done
-- 
2.46.0.421.g159f2d50e7.dirty


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-02 13:48 [PATCH 0/2] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
  2024-09-02 13:48 ` [PATCH 1/2] t0601: merge tests for auto-packing of refs Patrick Steinhardt
@ 2024-09-02 13:48 ` Patrick Steinhardt
  2024-09-03  9:00   ` karthik nayak
  2024-09-04  8:52 ` [PATCH v2 0/3] refs/files: use heuristics " Patrick Steinhardt
  2 siblings, 1 reply; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-02 13:48 UTC (permalink / raw)
  To: git

The `--auto` flag for git-pack-refs(1) allows the ref backend to decide
whether or not a repack is in order. This switch has been introduced
mostly with the "reftable" backend in mind, which already knows to
auto-compact its tables during normal operations. When the flag is set,
then it will use the same auto-compaction mechanism and thus end up
doing nothing in most cases.

The "files" backend does not have any such heuristic yet and instead
packs any loose references unconditionally. So we rewrite the complete
"packed-refs" file even if there's only a single loose reference to be
packed.

Even worse, starting with 9f6714ab3e (builtin/gc: pack refs when using
`git maintenance run --auto`, 2024-03-25), `git pack-refs --auto` is
unconditionally executed via our auto maintenance, so we end up repacking
references every single time auto maintenance kicks in. And while that
commit already mentioned that the "files" backend unconditionally packs
refs now, the author obviously didn't quite think about the consequences
thereof. So while the idea was sound, we really should have added a
heuristic to the "files" backend before implementing it.

Introduce a heuristic that decides whether or not it is worth to pack
loose references. The important factors to decide here are the number of
loose references in comparison to the overall size of the "packed-refs"
file. The bigger the "packed-refs" file, the longer it takes to rewrite
it and thus we scale up the limit of allowed loose references before we
repack.

As is the nature of heuristics, this mechansim isn't obviously
"correct", but should rather be seen as a tradeoff between how much
resources we spend packing refs and how inefficient the ref store
becomes. For all I can say, we have successfully been using the exact
same heuristic in Gitaly for several years by now.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 refs/files-backend.c          | 75 +++++++++++++++++++++++++++++++
 refs/packed-backend.c         | 18 ++++++++
 refs/packed-backend.h         |  7 +++
 t/t0601-reffiles-pack-refs.sh | 85 ++++++++++++++++++++++++++++++-----
 4 files changed, 175 insertions(+), 10 deletions(-)

diff --git a/refs/files-backend.c b/refs/files-backend.c
index 1cff65f6ae5..261e2b8570f 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -1311,6 +1311,78 @@ static int should_pack_ref(struct files_ref_store *refs,
 	return 0;
 }
 
+static size_t fastlog2(size_t sz)
+{
+	size_t l = 0;
+	if (!sz)
+		return 0;
+	for (; sz; sz /= 2)
+		l++;
+	return l - 1;
+}
+
+static int should_pack_refs(struct files_ref_store *refs,
+			    struct pack_refs_opts *opts)
+{
+	struct ref_iterator *iter;
+	size_t packed_size;
+	size_t refcount = 0;
+	size_t limit;
+	int ret;
+
+	if (!(opts->flags & PACK_REFS_AUTO))
+		return 1;
+
+	ret = packed_refs_size(refs->packed_ref_store, &packed_size);
+	if (ret < 0)
+		die("cannot determine packed-refs size");
+
+	/*
+	 * Packing loose references into the packed-refs file scales with the
+	 * number of references we're about to write. We thus decide whether we
+	 * repack refs by weighing the current size of the packed-refs file
+	 * against the number of loose references. This is done such that we do
+	 * not repack too often on repositories with a huge number of
+	 * references, where we can expect a lot of churn in the number of
+	 * references.
+	 *
+	 * As a heuristic, we repack if the number of loose references in the
+	 * repository exceeds `log2(nr_packed_refs) * 5`, where we estimate
+	 * `nr_packed_refs = packed_size / 100`, which scales as following:
+	 *
+	 * - 1kB ~ 10 packed refs: 16 refs
+	 * - 10kB ~ 100 packed refs: 33 refs
+	 * - 100kB ~ 1k packed refs: 49 refs
+	 * - 1MB ~ 10k packed refs: 66 refs
+	 * - 10MB ~ 100k packed refs: 82 refs
+	 * - 100MB ~ 1m packed refs: 99 refs
+	 *
+	 * We thus allow roughly 16 additional loose refs per factor of ten of
+	 * packed refs. This heuristic may be tweaked in the future, but should
+	 * serve as a sufficiently good first iteration.
+	 */
+	limit = fastlog2(packed_size / 100) * 5;
+	if (limit < 16)
+		limit = 16;
+
+	iter = cache_ref_iterator_begin(get_loose_ref_cache(refs, 0), NULL,
+					refs->base.repo, 0);
+	while ((ret = ref_iterator_advance(iter)) == ITER_OK) {
+		if (should_pack_ref(refs, iter->refname, iter->oid,
+				    iter->flags, opts))
+			refcount++;
+		if (refcount >= limit) {
+			ref_iterator_abort(iter);
+			return 1;
+		}
+	}
+
+	if (ret != ITER_DONE)
+		die("error while iterating over references");
+
+	return 0;
+}
+
 static int files_pack_refs(struct ref_store *ref_store,
 			   struct pack_refs_opts *opts)
 {
@@ -1323,6 +1395,9 @@ static int files_pack_refs(struct ref_store *ref_store,
 	struct strbuf err = STRBUF_INIT;
 	struct ref_transaction *transaction;
 
+	if (!should_pack_refs(refs, opts))
+		return 0;
+
 	transaction = ref_store_transaction_begin(refs->packed_ref_store, &err);
 	if (!transaction)
 		return -1;
diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 7a0a695ca2f..07c57fd541a 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -1250,6 +1250,24 @@ int packed_refs_is_locked(struct ref_store *ref_store)
 	return is_lock_file_locked(&refs->lock);
 }
 
+int packed_refs_size(struct ref_store *ref_store,
+		     size_t *out)
+{
+	struct packed_ref_store *refs = packed_downcast(ref_store, REF_STORE_READ,
+							"packed_refs_size");
+	struct stat st;
+
+	if (stat(refs->path, &st) < 0) {
+		if (errno != ENOENT)
+			return -1;
+		*out = 0;
+		return 0;
+	}
+
+	*out = st.st_size;
+	return 0;
+}
+
 /*
  * The packed-refs header line that we write out. Perhaps other traits
  * will be added later.
diff --git a/refs/packed-backend.h b/refs/packed-backend.h
index 09437ad13bd..9481d5e7c2e 100644
--- a/refs/packed-backend.h
+++ b/refs/packed-backend.h
@@ -27,6 +27,13 @@ int packed_refs_lock(struct ref_store *ref_store, int flags, struct strbuf *err)
 void packed_refs_unlock(struct ref_store *ref_store);
 int packed_refs_is_locked(struct ref_store *ref_store);
 
+/*
+ * Obtain the size of the `packed-refs` file. Reports `0` as size in case there
+ * is no packed-refs file. Returns 0 on success, negative otherwise.
+ */
+int packed_refs_size(struct ref_store *ref_store,
+		     size_t *out);
+
 /*
  * Return true if `transaction` really needs to be carried out against
  * the specified packed_ref_store, or false if it can be skipped
diff --git a/t/t0601-reffiles-pack-refs.sh b/t/t0601-reffiles-pack-refs.sh
index ed9652bb829..d8cbd3f202b 100755
--- a/t/t0601-reffiles-pack-refs.sh
+++ b/t/t0601-reffiles-pack-refs.sh
@@ -362,21 +362,86 @@ test_expect_success 'pack-refs does not drop broken refs during deletion' '
 
 for command in "git pack-refs --all --auto" "git maintenance run --task=pack-refs --auto"
 do
-	test_expect_success "$command unconditionally packs loose refs" '
+	test_expect_success "$command does not repack below 16 refs without packed-refs" '
 		test_when_finished "rm -rf repo" &&
 		git init repo &&
 		(
 			cd repo &&
 			git config set maintenance.auto false &&
-			test_commit initial &&
-
-			git update-ref refs/heads/something HEAD &&
-			test_path_is_file .git/refs/heads/something &&
-			git rev-parse refs/heads/something >expect &&
-			$command &&
-			test_path_is_missing .git/refs/heads/something &&
-			git rev-parse refs/heads/something >actual &&
-			test_cmp expect actual
+			git commit --allow-empty --message "initial" &&
+
+			# Create 14 additional references, which brings us to
+			# 15 together with the default branch.
+			printf "create refs/heads/loose-%d HEAD\n" $(test_seq 14) >stdin &&
+			git update-ref --stdin <stdin &&
+			test_path_is_missing .git/packed-refs &&
+			git pack-refs --auto --all &&
+			test_path_is_missing .git/packed-refs &&
+
+			# Create the 16th reference, which should cause us to repack.
+			git update-ref refs/heads/loose-15 HEAD &&
+			git pack-refs --auto --all &&
+			test_path_is_file .git/packed-refs
+		)
+	'
+
+	test_expect_success "$command does not repack below 16 refs with small packed-refs" '
+		test_when_finished "rm -rf repo" &&
+		git init repo &&
+		(
+			cd repo &&
+			git config set maintenance.auto false &&
+			git commit --allow-empty --message "initial" &&
+
+			git pack-refs --all &&
+			test_line_count = 2 .git/packed-refs &&
+
+			# Create 15 loose references.
+			printf "create refs/heads/loose-%d HEAD\n" $(test_seq 15) >stdin &&
+			git update-ref --stdin <stdin &&
+			git pack-refs --auto --all &&
+			test_line_count = 2 .git/packed-refs &&
+
+			# Create the 16th loose reference, which should cause us to repack.
+			git update-ref refs/heads/loose-17 HEAD &&
+			git pack-refs --auto --all &&
+			test_line_count = 18 .git/packed-refs
+		)
+	'
+
+	test_expect_success "$command scales with size of packed-refs" '
+		test_when_finished "rm -rf repo" &&
+		git init repo &&
+		(
+			cd repo &&
+			git config set maintenance.auto false &&
+			git commit --allow-empty --message "initial" &&
+
+			# Create 99 packed refs. This should cause the heuristic
+			# to require more than the minimum amount of loose refs.
+			test_seq 99 |
+			while read i
+			do
+				printf "create refs/heads/packed-%d HEAD\n" $i || return 1
+			done >stdin &&
+			git update-ref --stdin <stdin &&
+			git pack-refs --all &&
+			test_line_count = 101 .git/packed-refs &&
+
+			# Create 24 loose refs, which should not yet cause us to repack.
+			printf "create refs/heads/loose-%d HEAD\n" $(test_seq 24) >stdin &&
+			git update-ref --stdin <stdin &&
+			git pack-refs --auto --all &&
+			test_line_count = 101 .git/packed-refs &&
+
+			# Create another handful of refs to cross the border.
+			# Note that we explicitly do not check for strict
+			# boundaries here, as this also depends on the size of
+			# the object hash.
+			printf "create refs/heads/addn-%d HEAD\n" $(test_seq 10) >stdin &&
+			git update-ref --stdin <stdin &&
+			git pack-refs --auto --all &&
+			test_line_count = 135 .git/packed-refs
 		)
 	'
 done
-- 
2.46.0.421.g159f2d50e7.dirty


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-02 13:48 ` [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
@ 2024-09-03  9:00   ` karthik nayak
  2024-09-03  9:23     ` Patrick Steinhardt
  2024-09-04  8:49     ` Patrick Steinhardt
  0 siblings, 2 replies; 16+ messages in thread
From: karthik nayak @ 2024-09-03  9:00 UTC (permalink / raw)
  To: Patrick Steinhardt, git

[-- Attachment #1: Type: text/plain, Size: 2996 bytes --]

Patrick Steinhardt <ps@pks.im> writes:

> The `--auto` flag for git-pack-refs(1) allows the ref backend to decide
> whether or not a repack is in order. This switch has been introduced
> mostly with the "reftable" backend in mind, which already knows to
> auto-compact its tables during normal operations. When the flag is set,
> then it will use the same auto-compaction mechanism and thus end up
> doing nothing in most cases.
>
> The "files" backend does not have any such heuristic yet and instead

Nit: s/instead/will instead/

> packs any loose references unconditionally. So we rewrite the complete
> "packed-refs" file even if there's only a single loose reference to be
> packed.
>
> Even worse, starting with 9f6714ab3e (builtin/gc: pack refs when using
> `git maintenance run --auto`, 2024-03-25), `git pack-refs --auto` is
> unconditionally executed via our auto maintenance, so we end up repacking
> references every single time auto maintenance kicks in. And while that
> commit already mentioned that the "files" backend unconditionally packs
> refs now, the author obviously didn't quite think about the consequences
> thereof. So while the idea was sound, we really should have added a
> heuristic to the "files" backend before implementing it.
>
> Introduce a heuristic that decides whether or not it is worth to pack
> loose references. The important factors to decide here are the number of
> loose references in comparison to the overall size of the "packed-refs"
> file. The bigger the "packed-refs" file, the longer it takes to rewrite
> it and thus we scale up the limit of allowed loose references before we
> repack.
>
> As is the nature of heuristics, this mechansim isn't obviously
> "correct", but should rather be seen as a tradeoff between how much
> resources we spend packing refs and how inefficient the ref store
> becomes. For all I can say, we have successfully been using the exact
> same heuristic in Gitaly for several years by now.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  refs/files-backend.c          | 75 +++++++++++++++++++++++++++++++
>  refs/packed-backend.c         | 18 ++++++++
>  refs/packed-backend.h         |  7 +++
>  t/t0601-reffiles-pack-refs.sh | 85 ++++++++++++++++++++++++++++++-----
>  4 files changed, 175 insertions(+), 10 deletions(-)
>
> diff --git a/refs/files-backend.c b/refs/files-backend.c
> index 1cff65f6ae5..261e2b8570f 100644
> --- a/refs/files-backend.c
> +++ b/refs/files-backend.c
> @@ -1311,6 +1311,78 @@ static int should_pack_ref(struct files_ref_store *refs,
>  	return 0;
>  }
>
> +static size_t fastlog2(size_t sz)

Nit: We already `reftable/stack_test.c:fastlog2` and `bisect.c:log2i`. I
wonder if it would be nice to consolidate all of them. But I guess it's
okay, since the reftable code anyways is isolated from the rest of the
Git code.

> +{
> +	size_t l = 0;
> +	if (!sz)
> +		return 0;
> +	for (; sz; sz /= 2)
> +		l++;
> +	return l - 1;
> +}

[snip]

The rest of the patch looks great!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-03  9:00   ` karthik nayak
@ 2024-09-03  9:23     ` Patrick Steinhardt
  2024-09-03 18:23       ` Junio C Hamano
  2024-09-04  8:49     ` Patrick Steinhardt
  1 sibling, 1 reply; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-03  9:23 UTC (permalink / raw)
  To: karthik nayak; +Cc: git

On Tue, Sep 03, 2024 at 02:00:16AM -0700, karthik nayak wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > diff --git a/refs/files-backend.c b/refs/files-backend.c
> > index 1cff65f6ae5..261e2b8570f 100644
> > --- a/refs/files-backend.c
> > +++ b/refs/files-backend.c
> > @@ -1311,6 +1311,78 @@ static int should_pack_ref(struct files_ref_store *refs,
> >  	return 0;
> >  }
> >
> > +static size_t fastlog2(size_t sz)
> 
> Nit: We already `reftable/stack_test.c:fastlog2` and `bisect.c:log2i`. I
> wonder if it would be nice to consolidate all of them. But I guess it's
> okay, since the reftable code anyways is isolated from the rest of the
> Git code.

Yeah, the version we have here is in fact an (almost exact) copy of the
function in our reftable codebase. And as you note, we do not want to
deduplicate these such that the reftable library is somewhat isolated
from the rest of the codebase.

I also noticed `log2i()`, but honestly the only reason why I didn't
reuse it is that I had no clue where to put it. There isn't any header
that would be a good fit for it, and creating a new "math.h" header for
a single function felt overblown to me. So I decided to just not bother.
I'm happy to adjust though if somebody has a suggestion for where to put
it.

Patrick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-03  9:23     ` Patrick Steinhardt
@ 2024-09-03 18:23       ` Junio C Hamano
  2024-09-04  7:42         ` Patrick Steinhardt
  0 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2024-09-03 18:23 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: karthik nayak, git

Patrick Steinhardt <ps@pks.im> writes:

> I also noticed `log2i()`, but honestly the only reason why I didn't
> reuse it is that I had no clue where to put it. There isn't any header
> that would be a good fit for it, and creating a new "math.h" header for
> a single function felt overblown to me. So I decided to just not bother.
> I'm happy to adjust though if somebody has a suggestion for where to put
> it.

Given the existing contents of wrapper.h (near the end of the file),
I think wrapper.c would be a good place to do so.

Shouldn't this essentially be a call to ffs() with the argument
tweaked, by the way?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-03 18:23       ` Junio C Hamano
@ 2024-09-04  7:42         ` Patrick Steinhardt
  2024-09-04 16:15           ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-04  7:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: karthik nayak, git

On Tue, Sep 03, 2024 at 11:23:12AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > I also noticed `log2i()`, but honestly the only reason why I didn't
> > reuse it is that I had no clue where to put it. There isn't any header
> > that would be a good fit for it, and creating a new "math.h" header for
> > a single function felt overblown to me. So I decided to just not bother.
> > I'm happy to adjust though if somebody has a suggestion for where to put
> > it.
> 
> Given the existing contents of wrapper.h (near the end of the file),
> I think wrapper.c would be a good place to do so.

Indeed. Quite a grab bag of functions, thanks for the hint!

> Shouldn't this essentially be a call to ffs() with the argument
> tweaked, by the way?

We are looking for the last set bit, not for the first set bit. We can
massage things a bit, but wouldn't that require us to be aware of the
platforms endianess?

In any case, GCC is clever enough to notice what we're doing:

    fastlog2(unsigned long):
            xor     eax, eax
            test    rdi, rdi
            je      .L5
            bsr     rax, rdi
    .L5:
            ret

The `log2i()` function looks a bit less efficient:

    log2i(unsigned long):
            test    rdi, rdi
            je      .L3
            bsr     rdi, rdi
            lea     eax, [rdi+1]
            cdqe
            ret
    .L3:
            xor     eax, eax
            ret

Clang isn't yet clever enough with v18.1, but is with trunk:

    log2i(unsigned long):
            bsr     rax, rdi
            inc     rax
            test    rdi, rdi
            cmove   rax, rdi
            ret

    fastlog2(unsigned long):
            test    rdi, rdi
            je      .LBB1_1
            shr     rdi
            bsr     rcx, rdi
            mov     eax, 127
            cmovne  rax, rcx
            xor     eax, -64
            add     eax, 65
            ret
    .LBB1_1:
            mov     eax, -1
            ret

So with the following definition we're optimizing both with GCC and
Clang:

    size_t fastlog2(size_t sz)
    {
        size_t l = 0;
        if (!sz)
            return 0;
        for (; sz; sz >>= 1)
            l++;
        return l;
    }

I'd thus say we can just pick that function instead of caring about
platform endianess with `ffs()`.

Patrick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-03  9:00   ` karthik nayak
  2024-09-03  9:23     ` Patrick Steinhardt
@ 2024-09-04  8:49     ` Patrick Steinhardt
  2024-09-05  8:44       ` karthik nayak
  1 sibling, 1 reply; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-04  8:49 UTC (permalink / raw)
  To: karthik nayak; +Cc: git

On Tue, Sep 03, 2024 at 02:00:16AM -0700, karthik nayak wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > The `--auto` flag for git-pack-refs(1) allows the ref backend to decide
> > whether or not a repack is in order. This switch has been introduced
> > mostly with the "reftable" backend in mind, which already knows to
> > auto-compact its tables during normal operations. When the flag is set,
> > then it will use the same auto-compaction mechanism and thus end up
> > doing nothing in most cases.
> >
> > The "files" backend does not have any such heuristic yet and instead
> 
> Nit: s/instead/will instead/
> 
> > packs any loose references unconditionally. So we rewrite the complete
> > "packed-refs" file even if there's only a single loose reference to be
> > packed.

Revisiting this: isn't the current version of this sentence correct?
Replacing it with "will instead" certainly makes it incorrect without
also changing "packs" to "pack".

Patrick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 0/3] refs/files: use heuristics to decide whether to repack with `--auto`
  2024-09-02 13:48 [PATCH 0/2] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
  2024-09-02 13:48 ` [PATCH 1/2] t0601: merge tests for auto-packing of refs Patrick Steinhardt
  2024-09-02 13:48 ` [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
@ 2024-09-04  8:52 ` Patrick Steinhardt
  2024-09-04  8:53   ` [PATCH v2 1/3] wrapper: introduce `log2u()` Patrick Steinhardt
                     ` (2 more replies)
  2 siblings, 3 replies; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-04  8:52 UTC (permalink / raw)
  To: git; +Cc: karthik nayak, Junio C Hamano

Hi,

this is the second version of my patch series that introduces a new
heuristic for packing refs with the "files" backend. This heuristic is
designed to avoid needlessly rewriting the "packed-refs" file when there
are only a small set of loose refs. The number of loose refs required
scales with the size of the "packed-refs" file.

There is only one change compared to v1, namely a deduplication of the
log2 functions we have in our tree.

Thanks!

Patrick

Patrick Steinhardt (3):
  wrapper: introduce `log2u()`
  t0601: merge tests for auto-packing of refs
  refs/files: use heuristic to decide whether to repack with `--auto`

 bisect.c                      |  12 +---
 refs/files-backend.c          |  65 ++++++++++++++++++++++
 refs/packed-backend.c         |  18 ++++++
 refs/packed-backend.h         |   7 +++
 t/t0601-reffiles-pack-refs.sh | 101 ++++++++++++++++++++++++++++------
 wrapper.h                     |  18 ++++++
 6 files changed, 194 insertions(+), 27 deletions(-)

Range-diff against v1:
-:  ----------- > 1:  df8c5dffffe wrapper: introduce `log2u()`
1:  3a8063e8b2c = 2:  4a59cec205d t0601: merge tests for auto-packing of refs
2:  9a63abfe3b8 ! 3:  49f953142b1 refs/files: use heuristic to decide whether to repack with `--auto`
    @@ refs/files-backend.c: static int should_pack_ref(struct files_ref_store *refs,
      	return 0;
      }
      
    -+static size_t fastlog2(size_t sz)
    -+{
    -+	size_t l = 0;
    -+	if (!sz)
    -+		return 0;
    -+	for (; sz; sz /= 2)
    -+		l++;
    -+	return l - 1;
    -+}
    -+
     +static int should_pack_refs(struct files_ref_store *refs,
     +			    struct pack_refs_opts *opts)
     +{
    @@ refs/files-backend.c: static int should_pack_ref(struct files_ref_store *refs,
     +	 * packed refs. This heuristic may be tweaked in the future, but should
     +	 * serve as a sufficiently good first iteration.
     +	 */
    -+	limit = fastlog2(packed_size / 100) * 5;
    ++	limit = log2u(packed_size / 100) * 5;
     +	if (limit < 16)
     +		limit = 16;
     +

base-commit: 4590f2e9412378c61eac95966709c78766d326ba
-- 
2.46.0.519.g2e7b89e038.dirty


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/3] wrapper: introduce `log2u()`
  2024-09-04  8:52 ` [PATCH v2 0/3] refs/files: use heuristics " Patrick Steinhardt
@ 2024-09-04  8:53   ` Patrick Steinhardt
  2024-09-04  8:53   ` [PATCH v2 2/3] t0601: merge tests for auto-packing of refs Patrick Steinhardt
  2024-09-04  8:53   ` [PATCH v2 3/3] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
  2 siblings, 0 replies; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-04  8:53 UTC (permalink / raw)
  To: git; +Cc: karthik nayak, Junio C Hamano

We have an implementation of a function that computes the log2 for an
integer. While we could instead use log2(3P), that involves floating
point numbers and is thus needlessly complex and inefficient.

We're about to add a second caller that wants to compute log2 for a
`size_t`. Let's thus move the function into "wrapper.h" such that it
becomes generally available.

While at it, tweak the implementation a bit:

  - The parameter is converted from `int` to `uintmax_t`. This
    conversion is safe to do in "bisect.c" because we already check that
    the argument is positive.

  - The return value is an `unsigned`. It cannot ever be negative, so it
    is pointless for it to be a signed integer.

  - Loop until `!n` instead of requiring that `n > 1` and then subtract
    1 from the result and add a special case for `!sz`. This helps
    compilers to generate more efficient code.

Compilers recognize the pattern of this function and optimize
accordingly. On GCC 14.2 x86_64:

    log2u(unsigned long):
            test    rdi, rdi
            je      .L3
            bsr     rax, rdi
            ret
    .L3:
            mov     eax, -1
            ret

Clang 18.1 does not yet recognize the pattern, but starts to do so on
Clang trunk x86_64. The code isn't quite as efficient as the one
generated by GCC, but still manages to optimize away the loop:

    log2u(unsigned long):
            test    rdi, rdi
            je      .LBB0_1
            shr     rdi
            bsr     rcx, rdi
            mov     eax, 127
            cmovne  rax, rcx
            xor     eax, -64
            add     eax, 65
            ret
    .LBB0_1:
            mov     eax, -1
            ret

The pattern is also recognized on other platforms like ARM64 GCC 14.2.0,
where we end up using `clz`:

    log2u(unsigned long):
            clz     x2, x0
            cmp     x0, 0
            mov     w1, 63
            sub     w0, w1, w2
            csinv   w0, w0, wzr, ne
            ret

Note that we have a similar function `fastlog2()` in the reftable code.
As that codebase is separate from the Git codebase we do not adapt it to
use the new function.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 bisect.c  | 12 +-----------
 wrapper.h | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/bisect.c b/bisect.c
index 4406fc44cf9..4713226fad4 100644
--- a/bisect.c
+++ b/bisect.c
@@ -1130,16 +1130,6 @@ enum bisect_error bisect_next_all(struct repository *r, const char *prefix)
 	return res;
 }
 
-static inline int log2i(int n)
-{
-	int log2 = 0;
-
-	for (; n > 1; n >>= 1)
-		log2++;
-
-	return log2;
-}
-
 static inline int exp2i(int n)
 {
 	return 1 << n;
@@ -1162,7 +1152,7 @@ int estimate_bisect_steps(int all)
 	if (all < 3)
 		return 0;
 
-	n = log2i(all);
+	n = log2u(all);
 	e = exp2i(n);
 	x = all - e;
 
diff --git a/wrapper.h b/wrapper.h
index 1b2b047ea06..a6b3e1f09ec 100644
--- a/wrapper.h
+++ b/wrapper.h
@@ -140,4 +140,22 @@ int csprng_bytes(void *buf, size_t len);
  */
 uint32_t git_rand(void);
 
+/* Provide log2 of the given `size_t`. */
+static inline unsigned log2u(uintmax_t sz)
+{
+	unsigned l = 0;
+
+	/*
+	 * Technically this isn't required, but it helps the compiler optimize
+	 * this to a `bsr` instruction.
+	 */
+	if (!sz)
+		return 0;
+
+	for (; sz; sz >>= 1)
+		l++;
+
+	return l - 1;
+}
+
 #endif /* WRAPPER_H */
-- 
2.46.0.519.g2e7b89e038.dirty


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/3] t0601: merge tests for auto-packing of refs
  2024-09-04  8:52 ` [PATCH v2 0/3] refs/files: use heuristics " Patrick Steinhardt
  2024-09-04  8:53   ` [PATCH v2 1/3] wrapper: introduce `log2u()` Patrick Steinhardt
@ 2024-09-04  8:53   ` Patrick Steinhardt
  2024-09-04  8:53   ` [PATCH v2 3/3] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
  2 siblings, 0 replies; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-04  8:53 UTC (permalink / raw)
  To: git; +Cc: karthik nayak, Junio C Hamano

We have two tests in t0601 which exercise the same underlying logic,
once via `git pack-refs --auto` and once via `git maintenance run
--auto`. Merge these two tests into one such that it becomes easier to
extend test coverage for both commands at the same time.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 t/t0601-reffiles-pack-refs.sh | 36 +++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/t/t0601-reffiles-pack-refs.sh b/t/t0601-reffiles-pack-refs.sh
index 60a544b8ee8..ed9652bb829 100755
--- a/t/t0601-reffiles-pack-refs.sh
+++ b/t/t0601-reffiles-pack-refs.sh
@@ -161,13 +161,6 @@ test_expect_success 'test --exclude takes precedence over --include' '
 	git pack-refs --include "refs/heads/pack*" --exclude "refs/heads/pack*" &&
 	test -f .git/refs/heads/dont_pack5'
 
-test_expect_success '--auto packs and prunes refs as usual' '
-	git branch auto &&
-	test_path_is_file .git/refs/heads/auto &&
-	git pack-refs --auto --all &&
-	test_path_is_missing .git/refs/heads/auto
-'
-
 test_expect_success 'see if up-to-date packed refs are preserved' '
 	git branch q &&
 	git pack-refs --all --prune &&
@@ -367,14 +360,25 @@ test_expect_success 'pack-refs does not drop broken refs during deletion' '
 	test_cmp expect actual
 '
 
-test_expect_success 'maintenance --auto unconditionally packs loose refs' '
-	git update-ref refs/heads/something HEAD &&
-	test_path_is_file .git/refs/heads/something &&
-	git rev-parse refs/heads/something >expect &&
-	git maintenance run --task=pack-refs --auto &&
-	test_path_is_missing .git/refs/heads/something &&
-	git rev-parse refs/heads/something >actual &&
-	test_cmp expect actual
-'
+for command in "git pack-refs --all --auto" "git maintenance run --task=pack-refs --auto"
+do
+	test_expect_success "$command unconditionally packs loose refs" '
+		test_when_finished "rm -rf repo" &&
+		git init repo &&
+		(
+			cd repo &&
+			git config set maintenance.auto false &&
+			test_commit initial &&
+
+			git update-ref refs/heads/something HEAD &&
+			test_path_is_file .git/refs/heads/something &&
+			git rev-parse refs/heads/something >expect &&
+			$command &&
+			test_path_is_missing .git/refs/heads/something &&
+			git rev-parse refs/heads/something >actual &&
+			test_cmp expect actual
+		)
+	'
+done
 
 test_done
-- 
2.46.0.519.g2e7b89e038.dirty


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/3] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-04  8:52 ` [PATCH v2 0/3] refs/files: use heuristics " Patrick Steinhardt
  2024-09-04  8:53   ` [PATCH v2 1/3] wrapper: introduce `log2u()` Patrick Steinhardt
  2024-09-04  8:53   ` [PATCH v2 2/3] t0601: merge tests for auto-packing of refs Patrick Steinhardt
@ 2024-09-04  8:53   ` Patrick Steinhardt
  2024-09-04 15:24     ` Junio C Hamano
  2 siblings, 1 reply; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-04  8:53 UTC (permalink / raw)
  To: git; +Cc: karthik nayak, Junio C Hamano

The `--auto` flag for git-pack-refs(1) allows the ref backend to decide
whether or not a repack is in order. This switch has been introduced
mostly with the "reftable" backend in mind, which already knows to
auto-compact its tables during normal operations. When the flag is set,
then it will use the same auto-compaction mechanism and thus end up
doing nothing in most cases.

The "files" backend does not have any such heuristic yet and instead
packs any loose references unconditionally. So we rewrite the complete
"packed-refs" file even if there's only a single loose reference to be
packed.

Even worse, starting with 9f6714ab3e (builtin/gc: pack refs when using
`git maintenance run --auto`, 2024-03-25), `git pack-refs --auto` is
unconditionally executed via our auto maintenance, so we end up repacking
references every single time auto maintenance kicks in. And while that
commit already mentioned that the "files" backend unconditionally packs
refs now, the author obviously didn't quite think about the consequences
thereof. So while the idea was sound, we really should have added a
heuristic to the "files" backend before implementing it.

Introduce a heuristic that decides whether or not it is worth to pack
loose references. The important factors to decide here are the number of
loose references in comparison to the overall size of the "packed-refs"
file. The bigger the "packed-refs" file, the longer it takes to rewrite
it and thus we scale up the limit of allowed loose references before we
repack.

As is the nature of heuristics, this mechansim isn't obviously
"correct", but should rather be seen as a tradeoff between how much
resources we spend packing refs and how inefficient the ref store
becomes. For all I can say, we have successfully been using the exact
same heuristic in Gitaly for several years by now.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 refs/files-backend.c          | 65 +++++++++++++++++++++++++++
 refs/packed-backend.c         | 18 ++++++++
 refs/packed-backend.h         |  7 +++
 t/t0601-reffiles-pack-refs.sh | 85 ++++++++++++++++++++++++++++++-----
 4 files changed, 165 insertions(+), 10 deletions(-)

diff --git a/refs/files-backend.c b/refs/files-backend.c
index 1cff65f6ae5..51bcd16f821 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -1311,6 +1311,68 @@ static int should_pack_ref(struct files_ref_store *refs,
 	return 0;
 }
 
+static int should_pack_refs(struct files_ref_store *refs,
+			    struct pack_refs_opts *opts)
+{
+	struct ref_iterator *iter;
+	size_t packed_size;
+	size_t refcount = 0;
+	size_t limit;
+	int ret;
+
+	if (!(opts->flags & PACK_REFS_AUTO))
+		return 1;
+
+	ret = packed_refs_size(refs->packed_ref_store, &packed_size);
+	if (ret < 0)
+		die("cannot determine packed-refs size");
+
+	/*
+	 * Packing loose references into the packed-refs file scales with the
+	 * number of references we're about to write. We thus decide whether we
+	 * repack refs by weighing the current size of the packed-refs file
+	 * against the number of loose references. This is done such that we do
+	 * not repack too often on repositories with a huge number of
+	 * references, where we can expect a lot of churn in the number of
+	 * references.
+	 *
+	 * As a heuristic, we repack if the number of loose references in the
+	 * repository exceeds `log2(nr_packed_refs) * 5`, where we estimate
+	 * `nr_packed_refs = packed_size / 100`, which scales as following:
+	 *
+	 * - 1kB ~ 10 packed refs: 16 refs
+	 * - 10kB ~ 100 packed refs: 33 refs
+	 * - 100kB ~ 1k packed refs: 49 refs
+	 * - 1MB ~ 10k packed refs: 66 refs
+	 * - 10MB ~ 100k packed refs: 82 refs
+	 * - 100MB ~ 1m packed refs: 99 refs
+	 *
+	 * We thus allow roughly 16 additional loose refs per factor of ten of
+	 * packed refs. This heuristic may be tweaked in the future, but should
+	 * serve as a sufficiently good first iteration.
+	 */
+	limit = log2u(packed_size / 100) * 5;
+	if (limit < 16)
+		limit = 16;
+
+	iter = cache_ref_iterator_begin(get_loose_ref_cache(refs, 0), NULL,
+					refs->base.repo, 0);
+	while ((ret = ref_iterator_advance(iter)) == ITER_OK) {
+		if (should_pack_ref(refs, iter->refname, iter->oid,
+				    iter->flags, opts))
+			refcount++;
+		if (refcount >= limit) {
+			ref_iterator_abort(iter);
+			return 1;
+		}
+	}
+
+	if (ret != ITER_DONE)
+		die("error while iterating over references");
+
+	return 0;
+}
+
 static int files_pack_refs(struct ref_store *ref_store,
 			   struct pack_refs_opts *opts)
 {
@@ -1323,6 +1385,9 @@ static int files_pack_refs(struct ref_store *ref_store,
 	struct strbuf err = STRBUF_INIT;
 	struct ref_transaction *transaction;
 
+	if (!should_pack_refs(refs, opts))
+		return 0;
+
 	transaction = ref_store_transaction_begin(refs->packed_ref_store, &err);
 	if (!transaction)
 		return -1;
diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 7a0a695ca2f..07c57fd541a 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -1250,6 +1250,24 @@ int packed_refs_is_locked(struct ref_store *ref_store)
 	return is_lock_file_locked(&refs->lock);
 }
 
+int packed_refs_size(struct ref_store *ref_store,
+		     size_t *out)
+{
+	struct packed_ref_store *refs = packed_downcast(ref_store, REF_STORE_READ,
+							"packed_refs_size");
+	struct stat st;
+
+	if (stat(refs->path, &st) < 0) {
+		if (errno != ENOENT)
+			return -1;
+		*out = 0;
+		return 0;
+	}
+
+	*out = st.st_size;
+	return 0;
+}
+
 /*
  * The packed-refs header line that we write out. Perhaps other traits
  * will be added later.
diff --git a/refs/packed-backend.h b/refs/packed-backend.h
index 09437ad13bd..9481d5e7c2e 100644
--- a/refs/packed-backend.h
+++ b/refs/packed-backend.h
@@ -27,6 +27,13 @@ int packed_refs_lock(struct ref_store *ref_store, int flags, struct strbuf *err)
 void packed_refs_unlock(struct ref_store *ref_store);
 int packed_refs_is_locked(struct ref_store *ref_store);
 
+/*
+ * Obtain the size of the `packed-refs` file. Reports `0` as size in case there
+ * is no packed-refs file. Returns 0 on success, negative otherwise.
+ */
+int packed_refs_size(struct ref_store *ref_store,
+		     size_t *out);
+
 /*
  * Return true if `transaction` really needs to be carried out against
  * the specified packed_ref_store, or false if it can be skipped
diff --git a/t/t0601-reffiles-pack-refs.sh b/t/t0601-reffiles-pack-refs.sh
index ed9652bb829..d8cbd3f202b 100755
--- a/t/t0601-reffiles-pack-refs.sh
+++ b/t/t0601-reffiles-pack-refs.sh
@@ -362,21 +362,86 @@ test_expect_success 'pack-refs does not drop broken refs during deletion' '
 
 for command in "git pack-refs --all --auto" "git maintenance run --task=pack-refs --auto"
 do
-	test_expect_success "$command unconditionally packs loose refs" '
+	test_expect_success "$command does not repack below 16 refs without packed-refs" '
 		test_when_finished "rm -rf repo" &&
 		git init repo &&
 		(
 			cd repo &&
 			git config set maintenance.auto false &&
-			test_commit initial &&
-
-			git update-ref refs/heads/something HEAD &&
-			test_path_is_file .git/refs/heads/something &&
-			git rev-parse refs/heads/something >expect &&
-			$command &&
-			test_path_is_missing .git/refs/heads/something &&
-			git rev-parse refs/heads/something >actual &&
-			test_cmp expect actual
+			git commit --allow-empty --message "initial" &&
+
+			# Create 14 additional references, which brings us to
+			# 15 together with the default branch.
+			printf "create refs/heads/loose-%d HEAD\n" $(test_seq 14) >stdin &&
+			git update-ref --stdin <stdin &&
+			test_path_is_missing .git/packed-refs &&
+			git pack-refs --auto --all &&
+			test_path_is_missing .git/packed-refs &&
+
+			# Create the 16th reference, which should cause us to repack.
+			git update-ref refs/heads/loose-15 HEAD &&
+			git pack-refs --auto --all &&
+			test_path_is_file .git/packed-refs
+		)
+	'
+
+	test_expect_success "$command does not repack below 16 refs with small packed-refs" '
+		test_when_finished "rm -rf repo" &&
+		git init repo &&
+		(
+			cd repo &&
+			git config set maintenance.auto false &&
+			git commit --allow-empty --message "initial" &&
+
+			git pack-refs --all &&
+			test_line_count = 2 .git/packed-refs &&
+
+			# Create 15 loose references.
+			printf "create refs/heads/loose-%d HEAD\n" $(test_seq 15) >stdin &&
+			git update-ref --stdin <stdin &&
+			git pack-refs --auto --all &&
+			test_line_count = 2 .git/packed-refs &&
+
+			# Create the 16th loose reference, which should cause us to repack.
+			git update-ref refs/heads/loose-17 HEAD &&
+			git pack-refs --auto --all &&
+			test_line_count = 18 .git/packed-refs
+		)
+	'
+
+	test_expect_success "$command scales with size of packed-refs" '
+		test_when_finished "rm -rf repo" &&
+		git init repo &&
+		(
+			cd repo &&
+			git config set maintenance.auto false &&
+			git commit --allow-empty --message "initial" &&
+
+			# Create 99 packed refs. This should cause the heuristic
+			# to require more than the minimum amount of loose refs.
+			test_seq 99 |
+			while read i
+			do
+				printf "create refs/heads/packed-%d HEAD\n" $i || return 1
+			done >stdin &&
+			git update-ref --stdin <stdin &&
+			git pack-refs --all &&
+			test_line_count = 101 .git/packed-refs &&
+
+			# Create 24 loose refs, which should not yet cause us to repack.
+			printf "create refs/heads/loose-%d HEAD\n" $(test_seq 24) >stdin &&
+			git update-ref --stdin <stdin &&
+			git pack-refs --auto --all &&
+			test_line_count = 101 .git/packed-refs &&
+
+			# Create another handful of refs to cross the border.
+			# Note that we explicitly do not check for strict
+			# boundaries here, as this also depends on the size of
+			# the object hash.
+			printf "create refs/heads/addn-%d HEAD\n" $(test_seq 10) >stdin &&
+			git update-ref --stdin <stdin &&
+			git pack-refs --auto --all &&
+			test_line_count = 135 .git/packed-refs
 		)
 	'
 done
-- 
2.46.0.519.g2e7b89e038.dirty


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-04  8:53   ` [PATCH v2 3/3] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
@ 2024-09-04 15:24     ` Junio C Hamano
  2024-09-05  9:58       ` Patrick Steinhardt
  0 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2024-09-04 15:24 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, karthik nayak

Patrick Steinhardt <ps@pks.im> writes:

> Introduce a heuristic that decides whether or not it is worth to pack
> loose references. The important factors to decide here are the number of
> loose references in comparison to the overall size of the "packed-refs"
> file. The bigger the "packed-refs" file, the longer it takes to rewrite
> it and thus we scale up the limit of allowed loose references before we
> repack.
>
> As is the nature of heuristics, this mechansim isn't obviously
> "correct", but should rather be seen as a tradeoff between how much
> resources we spend packing refs and how inefficient the ref store
> becomes. For all I can say, we have successfully been using the exact
> same heuristic in Gitaly for several years by now.

This seems to hit the balance between the thoroughness of repack
(leaving fewer loose refs is good) and the frequencey (doing repack
less often is good) in a quite sensible way.

I also have to wonder if it adds a good component to the heuristics
to leave younger loose refs (wrt mtime) out of packed-refs, with the
expectation that they are more likely to be updated again soon than
refs that were written long time ago and stayed the same value.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-04  7:42         ` Patrick Steinhardt
@ 2024-09-04 16:15           ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2024-09-04 16:15 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: karthik nayak, git

Patrick Steinhardt <ps@pks.im> writes:

> In any case, GCC is clever enough to notice what we're doing:
>
>     fastlog2(unsigned long):
>             xor     eax, eax
>             test    rdi, rdi
>             je      .L5
>             bsr     rax, rdi
>     .L5:
>             ret

Nice.  Aiming to compile to "bsr" is very good.

> So with the following definition we're optimizing both with GCC and
> Clang:
>
>     size_t fastlog2(size_t sz)
>     {
>         size_t l = 0;
>         if (!sz)
>             return 0;
>         for (; sz; sz >>= 1)
>             l++;
>         return l;
>     }
>
> I'd thus say we can just pick that function instead of caring about
> platform endianess with `ffs()`.

The above loop that compilers seem to know to reduce to "bsr" is
good.

FWIW, because the definition of "first bit" in ffs() is "least
significant bit", you do not need to worry about endianness at all,
but what we want is most significant, so it is not directly usable.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-04  8:49     ` Patrick Steinhardt
@ 2024-09-05  8:44       ` karthik nayak
  0 siblings, 0 replies; 16+ messages in thread
From: karthik nayak @ 2024-09-05  8:44 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1062 bytes --]

Patrick Steinhardt <ps@pks.im> writes:

> On Tue, Sep 03, 2024 at 02:00:16AM -0700, karthik nayak wrote:
>> Patrick Steinhardt <ps@pks.im> writes:
>>
>> > The `--auto` flag for git-pack-refs(1) allows the ref backend to decide
>> > whether or not a repack is in order. This switch has been introduced
>> > mostly with the "reftable" backend in mind, which already knows to
>> > auto-compact its tables during normal operations. When the flag is set,
>> > then it will use the same auto-compaction mechanism and thus end up
>> > doing nothing in most cases.
>> >
>> > The "files" backend does not have any such heuristic yet and instead
>>
>> Nit: s/instead/will instead/
>>
>> > packs any loose references unconditionally. So we rewrite the complete
>> > "packed-refs" file even if there's only a single loose reference to be
>> > packed.
>
> Revisiting this: isn't the current version of this sentence correct?
> Replacing it with "will instead" certainly makes it incorrect without
> also changing "packs" to "pack".
>
> Patrick

You're right, I did missread.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] refs/files: use heuristic to decide whether to repack with `--auto`
  2024-09-04 15:24     ` Junio C Hamano
@ 2024-09-05  9:58       ` Patrick Steinhardt
  0 siblings, 0 replies; 16+ messages in thread
From: Patrick Steinhardt @ 2024-09-05  9:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, karthik nayak

On Wed, Sep 04, 2024 at 08:24:44AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Introduce a heuristic that decides whether or not it is worth to pack
> > loose references. The important factors to decide here are the number of
> > loose references in comparison to the overall size of the "packed-refs"
> > file. The bigger the "packed-refs" file, the longer it takes to rewrite
> > it and thus we scale up the limit of allowed loose references before we
> > repack.
> >
> > As is the nature of heuristics, this mechansim isn't obviously
> > "correct", but should rather be seen as a tradeoff between how much
> > resources we spend packing refs and how inefficient the ref store
> > becomes. For all I can say, we have successfully been using the exact
> > same heuristic in Gitaly for several years by now.
> 
> This seems to hit the balance between the thoroughness of repack
> (leaving fewer loose refs is good) and the frequencey (doing repack
> less often is good) in a quite sensible way.
> 
> I also have to wonder if it adds a good component to the heuristics
> to leave younger loose refs (wrt mtime) out of packed-refs, with the
> expectation that they are more likely to be updated again soon than
> refs that were written long time ago and stayed the same value.

Maybe.

In general, I expect that most users typically only touch a very small
set of refs, e.g. the four or five feature branches that they have. So
even without such an additional component we would not end up repacking
all that often.

That picture changes once you consider remotes, because with bigger
teams it's quite likely that you'll get many ref updates there. I'm not
sure a time-based heuristic would be a good fit for this usecase,
because I'd think that repacking those right away is sensible most of
the time.

We also have to consider that an mtime-based component makes the overall
system harder to understand and indeterministic. Which isn't to say that
it doesn't make sense.  But I rather think we should land the simple and
stupid solution first, and in case we see that it's insufficient we can
iterate and improve it in the future.

Patrick

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-09-05  9:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-02 13:48 [PATCH 0/2] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
2024-09-02 13:48 ` [PATCH 1/2] t0601: merge tests for auto-packing of refs Patrick Steinhardt
2024-09-02 13:48 ` [PATCH 2/2] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
2024-09-03  9:00   ` karthik nayak
2024-09-03  9:23     ` Patrick Steinhardt
2024-09-03 18:23       ` Junio C Hamano
2024-09-04  7:42         ` Patrick Steinhardt
2024-09-04 16:15           ` Junio C Hamano
2024-09-04  8:49     ` Patrick Steinhardt
2024-09-05  8:44       ` karthik nayak
2024-09-04  8:52 ` [PATCH v2 0/3] refs/files: use heuristics " Patrick Steinhardt
2024-09-04  8:53   ` [PATCH v2 1/3] wrapper: introduce `log2u()` Patrick Steinhardt
2024-09-04  8:53   ` [PATCH v2 2/3] t0601: merge tests for auto-packing of refs Patrick Steinhardt
2024-09-04  8:53   ` [PATCH v2 3/3] refs/files: use heuristic to decide whether to repack with `--auto` Patrick Steinhardt
2024-09-04 15:24     ` Junio C Hamano
2024-09-05  9:58       ` Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).