git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ref-filter: format iteratively with lexicographic refname sorting
@ 2024-10-16  6:00 Patrick Steinhardt
  2024-10-16 22:11 ` Taylor Blau
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Patrick Steinhardt @ 2024-10-16  6:00 UTC (permalink / raw)
  To: git; +Cc: Victoria Dye, Derrick Stolee

In bd98f9774e (ref-filter.c: filter & format refs in the same callback,
2023-11-14), we have introduced logic into the ref-filter subsystem that
determines whether or not we can output references iteratively instead
of first collecting all references, post-processing them and printing
them once done. This has the advantage that we don't have to store all
refs in memory and, when used with e.g. `--count=1`, that we don't have
to read all refs in the first place.

One restriction we have in place for that is that caller must not ask
for sorted refs, because there is no way to sort the refs without first
reading them all into an array. So the benefits can only be reaped when
explicitly asking for output not to be sorted.

But there is one exception here where we _can_ get away with sorting
refs while streaming: ref backends sort references returned by their
iterators in lexicographic order. So if the following conditions are all
true we can do iterative streaming:

  - The caller uses at most a single name pattern. Otherwise we'd have
    to sort results from multiple invocations of the iterator.

  - There must be at most a single sorting specification, as otherwise
    we're not using plain lexicographic ordering.

  - The sorting specification must use the "refname".

  - The sorting specification must not be using any flags, like
    case-insensitive sorting.

Now the resulting logic does feel quite fragile overall, which makes me
a bit uneasy. But after thinking about this for a while I couldn't find
any obvious gaps in my reasoning. Furthermore, given that lexicographic
sorting order is the default in git-for-each-ref(1), this is likely to
benefit a whole lot of usecases out there.

The following benchmark executes git-for-each-ref(1) in a crafted repo
with 1 million references:

  Benchmark 1: git for-each-ref (revision = HEAD~)
    Time (mean ± σ):      6.756 s ±  0.014 s    [User: 3.004 s, System: 3.541 s]
    Range (min … max):    6.738 s …  6.784 s    10 runs

  Benchmark 2: git for-each-ref (revision = HEAD)
    Time (mean ± σ):      6.479 s ±  0.017 s    [User: 2.858 s, System: 3.422 s]
    Range (min … max):    6.450 s …  6.519 s    10 runs

  Summary
    git for-each-ref (revision = HEAD)
      1.04 ± 0.00 times faster than git for-each-ref (revision = HEAD~)

The change results in a slight performance improvement, but nothing that
would really stand out. Something that cannot be seen in the benchmark
though is peak memory usage, which went from 404.5MB to 68.96kB.

A more interesting benchmark is printing a single referenence with
`--count=1`:

  Benchmark 1: git for-each-ref --count=1 (revision = HEAD~)
    Time (mean ± σ):      6.655 s ±  0.018 s    [User: 2.865 s, System: 3.576 s]
    Range (min … max):    6.630 s …  6.680 s    10 runs

  Benchmark 2: git for-each-ref --count=1 (revision = HEAD)
    Time (mean ± σ):       8.6 ms ±   1.3 ms    [User: 2.3 ms, System: 6.1 ms]
    Range (min … max):     6.7 ms …  14.4 ms    266 runs

  Summary
    git git for-each-ref --count=1 (revision = HEAD)
    770.58 ± 116.19 times faster than git for-each-ref --count=1 (revision = HEAD~)

Whereas we scaled with the number of references before, we now print the
first reference and exit immediately, which provides a massive win.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

I'm honestly not quite sure whether I think that this change is fine, or
whether it is getting too fragile. I decided to send the patch anyway so
that we can discuss on the mailing list, mostly because I think that the
results speak for themselves.

Ultimately, this very much feels like a tradeoff to me.

Thanks!

Patrick

 ref-filter.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index dd195007ce1..e075ca21d8e 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -3244,10 +3244,40 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int
 	return ret;
 }
 
+struct ref_sorting {
+	struct ref_sorting *next;
+	int atom; /* index into used_atom array (internal) */
+	enum ref_sorting_order sort_flags;
+};
+
 static inline int can_do_iterative_format(struct ref_filter *filter,
 					  struct ref_sorting *sorting,
 					  struct ref_format *format)
 {
+	/*
+	 * Reference backends sort patterns lexicographically by refname, so if
+	 * the sorting options ask for exactly that we may be able to do
+	 * iterative formatting.
+	 */
+	if (sorting) {
+		size_t n = 0;
+
+		/*
+		 * There must be a single sorting filter that uses
+		 * lexicographic sorting of the refname.
+		 */
+		if (sorting->next ||
+		    sorting->sort_flags ||
+		    used_atom[sorting->atom].atom_type != ATOM_REFNAME)
+			return 0;
+
+		/* And there must be at most a single name pattern. */
+		while (filter->name_patterns && filter->name_patterns[n] && n < 2)
+			n++;
+		if (n > 1)
+			return 0;
+	}
+
 	/*
 	 * Filtering & formatting results within a single ref iteration
 	 * callback is not compatible with options that require
@@ -3258,7 +3288,6 @@ static inline int can_do_iterative_format(struct ref_filter *filter,
 	 */
 	return !(filter->reachable_from ||
 		 filter->unreachable_from ||
-		 sorting ||
 		 format->bases.nr ||
 		 format->is_base_tips.nr);
 }
@@ -3316,12 +3345,6 @@ static int memcasecmp(const void *vs1, const void *vs2, size_t n)
 	return 0;
 }
 
-struct ref_sorting {
-	struct ref_sorting *next;
-	int atom; /* index into used_atom array (internal) */
-	enum ref_sorting_order sort_flags;
-};
-
 static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, struct ref_array_item *b)
 {
 	struct atom_value *va, *vb;
-- 
2.47.0.72.gef8ce8f3d4.dirty


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-16  6:00 [PATCH] ref-filter: format iteratively with lexicographic refname sorting Patrick Steinhardt
@ 2024-10-16 22:11 ` Taylor Blau
  2024-10-17  2:48   ` Jeff King
  2024-10-17  5:09 ` [PATCH v2] " Patrick Steinhardt
  2024-10-21 11:33 ` [PATCH v3] " Patrick Steinhardt
  2 siblings, 1 reply; 12+ messages in thread
From: Taylor Blau @ 2024-10-16 22:11 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Victoria Dye, Derrick Stolee

On Wed, Oct 16, 2024 at 08:00:30AM +0200, Patrick Steinhardt wrote:
> But there is one exception here where we _can_ get away with sorting
> refs while streaming: ref backends sort references returned by their
> iterators in lexicographic order. So if the following conditions are all
> true we can do iterative streaming:
>
>   - The caller uses at most a single name pattern. Otherwise we'd have
>     to sort results from multiple invocations of the iterator.
>
>   - There must be at most a single sorting specification, as otherwise
>     we're not using plain lexicographic ordering.
>
>   - The sorting specification must use the "refname".
>
>   - The sorting specification must not be using any flags, like
>     case-insensitive sorting.

Perhaps a niche case, but what about ancient packed-refs files that were
written before the 'sorted' capability was introduced?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-16 22:11 ` Taylor Blau
@ 2024-10-17  2:48   ` Jeff King
  2024-10-17  4:28     ` Patrick Steinhardt
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2024-10-17  2:48 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Patrick Steinhardt, git, Victoria Dye, Derrick Stolee

On Wed, Oct 16, 2024 at 06:11:47PM -0400, Taylor Blau wrote:

> On Wed, Oct 16, 2024 at 08:00:30AM +0200, Patrick Steinhardt wrote:
> > But there is one exception here where we _can_ get away with sorting
> > refs while streaming: ref backends sort references returned by their
> > iterators in lexicographic order. So if the following conditions are all
> > true we can do iterative streaming:
> >
> >   - The caller uses at most a single name pattern. Otherwise we'd have
> >     to sort results from multiple invocations of the iterator.
> >
> >   - There must be at most a single sorting specification, as otherwise
> >     we're not using plain lexicographic ordering.
> >
> >   - The sorting specification must use the "refname".
> >
> >   - The sorting specification must not be using any flags, like
> >     case-insensitive sorting.
> 
> Perhaps a niche case, but what about ancient packed-refs files that were
> written before the 'sorted' capability was introduced?

We should be OK there. In that case we actually read in and sort the
packed-refs entries ourselves. We have to, since we do an in-order merge
with the loose refs while iterating.

I do think this optimization is worth doing, and not a problem with our
current backends. The biggest worries would be:

  1. Some new ref backend that doesn't return sorted results. I find
     this unlikely, and anyway it's easily caught by having coverage in
     the test suite (which I assume we already have, but I didn't look).

  2. Some new flag combination that requires disabling the optimization,
     and which must be dealt with in the code. This seems unlikely to me
     but not impossible. I think enabling the optimization is worth it,
     though.

> >   - The caller uses at most a single name pattern. Otherwise we'd have
> >     to sort results from multiple invocations of the iterator.

I think this part is erring on the cautious side, as we can often
collapse these into a single iteration due to the ref-prefix work. It
may be OK to keep using the slower code here if multiple patterns aren't
commonly used, but I'd suspect that:

  git for-each-ref --sort=refname refs/heads refs/tags

could benefit.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-17  2:48   ` Jeff King
@ 2024-10-17  4:28     ` Patrick Steinhardt
  2024-10-21 12:36       ` karthik nayak
  0 siblings, 1 reply; 12+ messages in thread
From: Patrick Steinhardt @ 2024-10-17  4:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, Victoria Dye, Derrick Stolee

On Wed, Oct 16, 2024 at 10:48:28PM -0400, Jeff King wrote:
> On Wed, Oct 16, 2024 at 06:11:47PM -0400, Taylor Blau wrote:
> 
> > On Wed, Oct 16, 2024 at 08:00:30AM +0200, Patrick Steinhardt wrote:
> > > But there is one exception here where we _can_ get away with sorting
> > > refs while streaming: ref backends sort references returned by their
> > > iterators in lexicographic order. So if the following conditions are all
> > > true we can do iterative streaming:
> > >
> > >   - The caller uses at most a single name pattern. Otherwise we'd have
> > >     to sort results from multiple invocations of the iterator.
> > >
> > >   - There must be at most a single sorting specification, as otherwise
> > >     we're not using plain lexicographic ordering.
> > >
> > >   - The sorting specification must use the "refname".
> > >
> > >   - The sorting specification must not be using any flags, like
> > >     case-insensitive sorting.
> > 
> > Perhaps a niche case, but what about ancient packed-refs files that were
> > written before the 'sorted' capability was introduced?
> 
> We should be OK there. In that case we actually read in and sort the
> packed-refs entries ourselves. We have to, since we do an in-order merge
> with the loose refs while iterating.
> 
> I do think this optimization is worth doing, and not a problem with our
> current backends. The biggest worries would be:
> 
>   1. Some new ref backend that doesn't return sorted results. I find
>      this unlikely, and anyway it's easily caught by having coverage in
>      the test suite (which I assume we already have, but I didn't look).

My assumption is that a ref iterator that _isn't_ sorted would lead to
undesirable behaviour. I'd be surprised if git-show-ref(1) started to
show refs in a random order. So we have essentially baked the
requirement that ref iterators return refs in lexicographic order into
our codebase already.

>   2. Some new flag combination that requires disabling the optimization,
>      and which must be dealt with in the code. This seems unlikely to me
>      but not impossible. I think enabling the optimization is worth it,
>      though.

And this part is an issue with or without my patch. The logic we have
in the ref-filter API is quite fragile, and everybody who wants to add
some new flags must remember to update `can_do_iterative_format()`. I'm
not really a huge fan of that, but on the other hand the subsystem does
not change all that frequently anyway.

> > >     to sort results from multiple invocations of the iterator.
> 
> I think this part is erring on the cautious side, as we can often
> collapse these into a single iteration due to the ref-prefix work. It
> may be OK to keep using the slower code here if multiple patterns aren't
> commonly used, but I'd suspect that:
> 
>   git for-each-ref --sort=refname refs/heads refs/tags
> 
> could benefit.

Mh. So we do end up using `refs_for_each_fullref_in_prefixes()`, which
may or may not end up collapsing the prefixes. We do sort and dedup the
prefixes via `find_longest_prefixes()`, so we don't have to worry about
e.g. `git for-each-ref refs/tags refs/heads refs/tags`.

So... it should be fine to also use this with multiple patterns? None of
our tests fail, either, which reassures me a bit.

I'll send a v2 that loosens this requirement.

Thanks for your input!

Patrick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-16  6:00 [PATCH] ref-filter: format iteratively with lexicographic refname sorting Patrick Steinhardt
  2024-10-16 22:11 ` Taylor Blau
@ 2024-10-17  5:09 ` Patrick Steinhardt
  2024-10-17 20:57   ` Taylor Blau
  2024-10-21 11:10   ` Toon Claes
  2024-10-21 11:33 ` [PATCH v3] " Patrick Steinhardt
  2 siblings, 2 replies; 12+ messages in thread
From: Patrick Steinhardt @ 2024-10-17  5:09 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Jeff King

In bd98f9774e (ref-filter.c: filter & format refs in the same callback,
2023-11-14), we have introduced logic into the ref-filter subsystem that
determines whether or not we can output references iteratively instead
of first collecting all references, post-processing them and printing
them once done. This has the advantage that we don't have to store all
refs in memory and, when used with e.g. `--count=1`, that we don't have
to read all refs in the first place.

One restriction we have in place for that is that caller must not ask
for sorted refs, because there is no way to sort the refs without first
reading them all into an array. So the benefits can only be reaped when
explicitly asking for output not to be sorted.

But there is one exception here where we _can_ get away with sorting
refs while streaming: ref backends sort references returned by their
iterators in lexicographic order. So if the following conditions are all
true we can do iterative streaming:

  - There must be at most a single sorting specification, as otherwise
    we're not using plain lexicographic ordering.

  - The sorting specification must use the "refname".

  - The sorting specification must not be using any flags, like
    case-insensitive sorting.

Now the resulting logic does feel quite fragile overall, which makes me
a bit uneasy. But after thinking about this for a while I couldn't find
any obvious gaps in my reasoning. Furthermore, given that lexicographic
sorting order is the default in git-for-each-ref(1), this is likely to
benefit a whole lot of usecases out there.

The following benchmark executes git-for-each-ref(1) in a crafted repo
with 1 million references:

  Benchmark 1: git for-each-ref (revision = HEAD~)
    Time (mean ± σ):      6.756 s ±  0.014 s    [User: 3.004 s, System: 3.541 s]
    Range (min … max):    6.738 s …  6.784 s    10 runs

  Benchmark 2: git for-each-ref (revision = HEAD)
    Time (mean ± σ):      6.479 s ±  0.017 s    [User: 2.858 s, System: 3.422 s]
    Range (min … max):    6.450 s …  6.519 s    10 runs

  Summary
    git for-each-ref (revision = HEAD)
      1.04 ± 0.00 times faster than git for-each-ref (revision = HEAD~)

The change results in a slight performance improvement, but nothing that
would really stand out. Something that cannot be seen in the benchmark
though is peak memory usage, which went from 404.5MB to 68.96kB.

A more interesting benchmark is printing a single referenence with
`--count=1`:

  Benchmark 1: git for-each-ref --count=1 (revision = HEAD~)
    Time (mean ± σ):      6.655 s ±  0.018 s    [User: 2.865 s, System: 3.576 s]
    Range (min … max):    6.630 s …  6.680 s    10 runs

  Benchmark 2: git for-each-ref --count=1 (revision = HEAD)
    Time (mean ± σ):       8.6 ms ±   1.3 ms    [User: 2.3 ms, System: 6.1 ms]
    Range (min … max):     6.7 ms …  14.4 ms    266 runs

  Summary
    git git for-each-ref --count=1 (revision = HEAD)
    770.58 ± 116.19 times faster than git for-each-ref --count=1 (revision = HEAD~)

Whereas we scaled with the number of references before, we now print the
first reference and exit immediately, which provides a massive win.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

There's only a single change compared to v1, namely that we don't
disable the optimization with multiple name patterns anymore. See also
the range-diff at the end.

Thanks!

Patrick

 ref-filter.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index dd195007ce1..424a9cb50ae 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -3244,10 +3244,31 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int
 	return ret;
 }
 
+struct ref_sorting {
+	struct ref_sorting *next;
+	int atom; /* index into used_atom array (internal) */
+	enum ref_sorting_order sort_flags;
+};
+
 static inline int can_do_iterative_format(struct ref_filter *filter,
 					  struct ref_sorting *sorting,
 					  struct ref_format *format)
 {
+	/*
+	 * Reference backends sort patterns lexicographically by refname, so if
+	 * the sorting options ask for exactly that we are able to do iterative
+	 * formatting.
+	 *
+	 * Note that we do not have to worry about multiple name patterns,
+	 * either. Those get sorted and deduplicated eventually in
+	 * `refs_for_each_fullref_in_prefixes()`, so we return names in the
+	 * correct ordering here, too.
+	 */
+	if (sorting && (sorting->next ||
+			sorting->sort_flags ||
+			used_atom[sorting->atom].atom_type != ATOM_REFNAME))
+		return 0;
+
 	/*
 	 * Filtering & formatting results within a single ref iteration
 	 * callback is not compatible with options that require
@@ -3258,7 +3279,6 @@ static inline int can_do_iterative_format(struct ref_filter *filter,
 	 */
 	return !(filter->reachable_from ||
 		 filter->unreachable_from ||
-		 sorting ||
 		 format->bases.nr ||
 		 format->is_base_tips.nr);
 }
@@ -3316,12 +3336,6 @@ static int memcasecmp(const void *vs1, const void *vs2, size_t n)
 	return 0;
 }
 
-struct ref_sorting {
-	struct ref_sorting *next;
-	int atom; /* index into used_atom array (internal) */
-	enum ref_sorting_order sort_flags;
-};
-
 static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, struct ref_array_item *b)
 {
 	struct atom_value *va, *vb;

Range-diff against v1:
1:  93e0b80c990 ! 1:  e0daa6a2eac ref-filter: format iteratively with lexicographic refname sorting
    @@ Commit message
         iterators in lexicographic order. So if the following conditions are all
         true we can do iterative streaming:
     
    -      - The caller uses at most a single name pattern. Otherwise we'd have
    -        to sort results from multiple invocations of the iterator.
    -
           - There must be at most a single sorting specification, as otherwise
             we're not using plain lexicographic ordering.
     
    @@ ref-filter.c: int filter_refs(struct ref_array *array, struct ref_filter *filter
      {
     +	/*
     +	 * Reference backends sort patterns lexicographically by refname, so if
    -+	 * the sorting options ask for exactly that we may be able to do
    -+	 * iterative formatting.
    ++	 * the sorting options ask for exactly that we are able to do iterative
    ++	 * formatting.
    ++	 *
    ++	 * Note that we do not have to worry about multiple name patterns,
    ++	 * either. Those get sorted and deduplicated eventually in
    ++	 * `refs_for_each_fullref_in_prefixes()`, so we return names in the
    ++	 * correct ordering here, too.
     +	 */
    -+	if (sorting) {
    -+		size_t n = 0;
    -+
    -+		/*
    -+		 * There must be a single sorting filter that uses
    -+		 * lexicographic sorting of the refname.
    -+		 */
    -+		if (sorting->next ||
    -+		    sorting->sort_flags ||
    -+		    used_atom[sorting->atom].atom_type != ATOM_REFNAME)
    -+			return 0;
    -+
    -+		/* And there must be at most a single name pattern. */
    -+		while (filter->name_patterns && filter->name_patterns[n] && n < 2)
    -+			n++;
    -+		if (n > 1)
    -+			return 0;
    -+	}
    ++	if (sorting && (sorting->next ||
    ++			sorting->sort_flags ||
    ++			used_atom[sorting->atom].atom_type != ATOM_REFNAME))
    ++		return 0;
     +
      	/*
      	 * Filtering & formatting results within a single ref iteration
-- 
2.47.0.72.gef8ce8f3d4.dirty


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-17  5:09 ` [PATCH v2] " Patrick Steinhardt
@ 2024-10-17 20:57   ` Taylor Blau
  2024-10-21 11:10   ` Toon Claes
  1 sibling, 0 replies; 12+ messages in thread
From: Taylor Blau @ 2024-10-17 20:57 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King

On Thu, Oct 17, 2024 at 07:09:51AM +0200, Patrick Steinhardt wrote:
>  ref-filter.c | 28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)

Looking good. Let's see if Peff has any further comments, otherwise I
think we can start merging this one down.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-17  5:09 ` [PATCH v2] " Patrick Steinhardt
  2024-10-17 20:57   ` Taylor Blau
@ 2024-10-21 11:10   ` Toon Claes
  2024-10-21 11:33     ` Patrick Steinhardt
  1 sibling, 1 reply; 12+ messages in thread
From: Toon Claes @ 2024-10-21 11:10 UTC (permalink / raw)
  To: Patrick Steinhardt, git; +Cc: Taylor Blau, Jeff King

Patrick Steinhardt <ps@pks.im> writes:

> [snip]
>
>  ref-filter.c | 28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/ref-filter.c b/ref-filter.c
> index dd195007ce1..424a9cb50ae 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -3244,10 +3244,31 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int
>  	return ret;
>  }
>  
> +struct ref_sorting {
> +	struct ref_sorting *next;
> +	int atom; /* index into used_atom array (internal) */
> +	enum ref_sorting_order sort_flags;
> +};
> +
>  static inline int can_do_iterative_format(struct ref_filter *filter,
>  					  struct ref_sorting *sorting,
>  					  struct ref_format *format)
>  {
> +	/*
> +	 * Reference backends sort patterns lexicographically by refname, so if
> +	 * the sorting options ask for exactly that we are able to do iterative
> +	 * formatting.
> +	 *
> +	 * Note that we do not have to worry about multiple name patterns,
> +	 * either. Those get sorted and deduplicated eventually in
> +	 * `refs_for_each_fullref_in_prefixes()`, so we return names in the
> +	 * correct ordering here, too.
> +	 */
> +	if (sorting && (sorting->next ||
> +			sorting->sort_flags ||
> +			used_atom[sorting->atom].atom_type != ATOM_REFNAME))
> +		return 0;
> +
>  	/*
>  	 * Filtering & formatting results within a single ref iteration
>  	 * callback is not compatible with options that require
> @@ -3258,7 +3279,6 @@ static inline int can_do_iterative_format(struct ref_filter *filter,
>  	 */
>  	return !(filter->reachable_from ||
>  		 filter->unreachable_from ||
> -		 sorting ||

Just a small nit, because we remove `sorting` from this condition, I
suggest to also remove the following comment above it:

	 * - sorting the filtered results


Otherwise no comments from my side.

-- 
Toon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-16  6:00 [PATCH] ref-filter: format iteratively with lexicographic refname sorting Patrick Steinhardt
  2024-10-16 22:11 ` Taylor Blau
  2024-10-17  5:09 ` [PATCH v2] " Patrick Steinhardt
@ 2024-10-21 11:33 ` Patrick Steinhardt
  2024-10-21 20:46   ` Taylor Blau
  2 siblings, 1 reply; 12+ messages in thread
From: Patrick Steinhardt @ 2024-10-21 11:33 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Jeff King, Toon Claes

In bd98f9774e (ref-filter.c: filter & format refs in the same callback,
2023-11-14), we have introduced logic into the ref-filter subsystem that
determines whether or not we can output references iteratively instead
of first collecting all references, post-processing them and printing
them once done. This has the advantage that we don't have to store all
refs in memory and, when used with e.g. `--count=1`, that we don't have
to read all refs in the first place.

One restriction we have in place for that is that caller must not ask
for sorted refs, because there is no way to sort the refs without first
reading them all into an array. So the benefits can only be reaped when
explicitly asking for output not to be sorted.

But there is one exception here where we _can_ get away with sorting
refs while streaming: ref backends sort references returned by their
iterators in lexicographic order. So if the following conditions are all
true we can do iterative streaming:

  - There must be at most a single sorting specification, as otherwise
    we're not using plain lexicographic ordering.

  - The sorting specification must use the "refname".

  - The sorting specification must not be using any flags, like
    case-insensitive sorting.

Now the resulting logic does feel quite fragile overall, which makes me
a bit uneasy. But after thinking about this for a while I couldn't find
any obvious gaps in my reasoning. Furthermore, given that lexicographic
sorting order is the default in git-for-each-ref(1), this is likely to
benefit a whole lot of usecases out there.

The following benchmark executes git-for-each-ref(1) in a crafted repo
with 1 million references:

  Benchmark 1: git for-each-ref (revision = HEAD~)
    Time (mean ± σ):      6.756 s ±  0.014 s    [User: 3.004 s, System: 3.541 s]
    Range (min … max):    6.738 s …  6.784 s    10 runs

  Benchmark 2: git for-each-ref (revision = HEAD)
    Time (mean ± σ):      6.479 s ±  0.017 s    [User: 2.858 s, System: 3.422 s]
    Range (min … max):    6.450 s …  6.519 s    10 runs

  Summary
    git for-each-ref (revision = HEAD)
      1.04 ± 0.00 times faster than git for-each-ref (revision = HEAD~)

The change results in a slight performance improvement, but nothing that
would really stand out. Something that cannot be seen in the benchmark
though is peak memory usage, which went from 404.5MB to 68.96kB.

A more interesting benchmark is printing a single referenence with
`--count=1`:

  Benchmark 1: git for-each-ref --count=1 (revision = HEAD~)
    Time (mean ± σ):      6.655 s ±  0.018 s    [User: 2.865 s, System: 3.576 s]
    Range (min … max):    6.630 s …  6.680 s    10 runs

  Benchmark 2: git for-each-ref --count=1 (revision = HEAD)
    Time (mean ± σ):       8.6 ms ±   1.3 ms    [User: 2.3 ms, System: 6.1 ms]
    Range (min … max):     6.7 ms …  14.4 ms    266 runs

  Summary
    git git for-each-ref --count=1 (revision = HEAD)
    770.58 ± 116.19 times faster than git for-each-ref --count=1 (revision = HEAD~)

Whereas we scaled with the number of references before, we now print the
first reference and exit immediately, which provides a massive win.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

There's only a single change compared to v2, namely a fixup for a
now-stale comment. Thanks!

Patrick

 ref-filter.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index dd195007ce1..84c60361072 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -3244,21 +3244,40 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int
 	return ret;
 }
 
+struct ref_sorting {
+	struct ref_sorting *next;
+	int atom; /* index into used_atom array (internal) */
+	enum ref_sorting_order sort_flags;
+};
+
 static inline int can_do_iterative_format(struct ref_filter *filter,
 					  struct ref_sorting *sorting,
 					  struct ref_format *format)
 {
+	/*
+	 * Reference backends sort patterns lexicographically by refname, so if
+	 * the sorting options ask for exactly that we are able to do iterative
+	 * formatting.
+	 *
+	 * Note that we do not have to worry about multiple name patterns,
+	 * either. Those get sorted and deduplicated eventually in
+	 * `refs_for_each_fullref_in_prefixes()`, so we return names in the
+	 * correct ordering here, too.
+	 */
+	if (sorting && (sorting->next ||
+			sorting->sort_flags ||
+			used_atom[sorting->atom].atom_type != ATOM_REFNAME))
+		return 0;
+
 	/*
 	 * Filtering & formatting results within a single ref iteration
 	 * callback is not compatible with options that require
 	 * post-processing a filtered ref_array. These include:
 	 * - filtering on reachability
-	 * - sorting the filtered results
 	 * - including ahead-behind information in the formatted output
 	 */
 	return !(filter->reachable_from ||
 		 filter->unreachable_from ||
-		 sorting ||
 		 format->bases.nr ||
 		 format->is_base_tips.nr);
 }
@@ -3316,12 +3335,6 @@ static int memcasecmp(const void *vs1, const void *vs2, size_t n)
 	return 0;
 }
 
-struct ref_sorting {
-	struct ref_sorting *next;
-	int atom; /* index into used_atom array (internal) */
-	enum ref_sorting_order sort_flags;
-};
-
 static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, struct ref_array_item *b)
 {
 	struct atom_value *va, *vb;

Range-diff against v2:
1:  e0daa6a2eac ! 1:  d23c3e3ee7f ref-filter: format iteratively with lexicographic refname sorting
    @@ ref-filter.c: int filter_refs(struct ref_array *array, struct ref_filter *filter
      	/*
      	 * Filtering & formatting results within a single ref iteration
      	 * callback is not compatible with options that require
    -@@ ref-filter.c: static inline int can_do_iterative_format(struct ref_filter *filter,
    + 	 * post-processing a filtered ref_array. These include:
    + 	 * - filtering on reachability
    +-	 * - sorting the filtered results
    + 	 * - including ahead-behind information in the formatted output
      	 */
      	return !(filter->reachable_from ||
      		 filter->unreachable_from ||
-- 
2.47.0.72.gef8ce8f3d4.dirty


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-21 11:10   ` Toon Claes
@ 2024-10-21 11:33     ` Patrick Steinhardt
  0 siblings, 0 replies; 12+ messages in thread
From: Patrick Steinhardt @ 2024-10-21 11:33 UTC (permalink / raw)
  To: Toon Claes; +Cc: git, Taylor Blau, Jeff King

On Mon, Oct 21, 2024 at 01:10:44PM +0200, Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > @@ -3258,7 +3279,6 @@ static inline int can_do_iterative_format(struct ref_filter *filter,
> >  	 */
> >  	return !(filter->reachable_from ||
> >  		 filter->unreachable_from ||
> > -		 sorting ||
> 
> Just a small nit, because we remove `sorting` from this condition, I
> suggest to also remove the following comment above it:
> 
> 	 * - sorting the filtered results
> 
> 
> Otherwise no comments from my side.

Ah, true. Will fix in v3, thanks!

Patrick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-17  4:28     ` Patrick Steinhardt
@ 2024-10-21 12:36       ` karthik nayak
  2024-10-21 20:45         ` Taylor Blau
  0 siblings, 1 reply; 12+ messages in thread
From: karthik nayak @ 2024-10-21 12:36 UTC (permalink / raw)
  To: Patrick Steinhardt, Jeff King
  Cc: Taylor Blau, git, Victoria Dye, Derrick Stolee

[-- Attachment #1: Type: text/plain, Size: 3491 bytes --]

Patrick Steinhardt <ps@pks.im> writes:

> On Wed, Oct 16, 2024 at 10:48:28PM -0400, Jeff King wrote:
>> On Wed, Oct 16, 2024 at 06:11:47PM -0400, Taylor Blau wrote:
>>
>> > On Wed, Oct 16, 2024 at 08:00:30AM +0200, Patrick Steinhardt wrote:
>> > > But there is one exception here where we _can_ get away with sorting
>> > > refs while streaming: ref backends sort references returned by their
>> > > iterators in lexicographic order. So if the following conditions are all
>> > > true we can do iterative streaming:
>> > >
>> > >   - The caller uses at most a single name pattern. Otherwise we'd have
>> > >     to sort results from multiple invocations of the iterator.
>> > >
>> > >   - There must be at most a single sorting specification, as otherwise
>> > >     we're not using plain lexicographic ordering.
>> > >
>> > >   - The sorting specification must use the "refname".
>> > >
>> > >   - The sorting specification must not be using any flags, like
>> > >     case-insensitive sorting.
>> >
>> > Perhaps a niche case, but what about ancient packed-refs files that were
>> > written before the 'sorted' capability was introduced?
>>
>> We should be OK there. In that case we actually read in and sort the
>> packed-refs entries ourselves. We have to, since we do an in-order merge
>> with the loose refs while iterating.
>>
>> I do think this optimization is worth doing, and not a problem with our
>> current backends. The biggest worries would be:
>>
>>   1. Some new ref backend that doesn't return sorted results. I find
>>      this unlikely, and anyway it's easily caught by having coverage in
>>      the test suite (which I assume we already have, but I didn't look).
>
> My assumption is that a ref iterator that _isn't_ sorted would lead to
> undesirable behaviour. I'd be surprised if git-show-ref(1) started to
> show refs in a random order. So we have essentially baked the
> requirement that ref iterators return refs in lexicographic order into
> our codebase already.
>
>>   2. Some new flag combination that requires disabling the optimization,
>>      and which must be dealt with in the code. This seems unlikely to me
>>      but not impossible. I think enabling the optimization is worth it,
>>      though.
>
> And this part is an issue with or without my patch. The logic we have
> in the ref-filter API is quite fragile, and everybody who wants to add
> some new flags must remember to update `can_do_iterative_format()`. I'm
> not really a huge fan of that, but on the other hand the subsystem does
> not change all that frequently anyway.
>
>> > >     to sort results from multiple invocations of the iterator.
>>
>> I think this part is erring on the cautious side, as we can often
>> collapse these into a single iteration due to the ref-prefix work. It
>> may be OK to keep using the slower code here if multiple patterns aren't
>> commonly used, but I'd suspect that:
>>
>>   git for-each-ref --sort=refname refs/heads refs/tags
>>
>> could benefit.
>
> Mh. So we do end up using `refs_for_each_fullref_in_prefixes()`, which
> may or may not end up collapsing the prefixes. We do sort and dedup the
> prefixes via `find_longest_prefixes()`, so we don't have to worry about
> e.g. `git for-each-ref refs/tags refs/heads refs/tags`.
>

Tangent: This sent me down a rabbit hole, I wonder if we can do better
with naming, `find_longest_prefixes` calls `find_longest_prefixes_1`,
The `_1` doesn't help at all with explaining what the function does.

[snip]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-21 12:36       ` karthik nayak
@ 2024-10-21 20:45         ` Taylor Blau
  0 siblings, 0 replies; 12+ messages in thread
From: Taylor Blau @ 2024-10-21 20:45 UTC (permalink / raw)
  To: karthik nayak
  Cc: Patrick Steinhardt, Jeff King, git, Victoria Dye, Derrick Stolee

On Mon, Oct 21, 2024 at 07:36:55AM -0500, karthik nayak wrote:
> > Mh. So we do end up using `refs_for_each_fullref_in_prefixes()`, which
> > may or may not end up collapsing the prefixes. We do sort and dedup the
> > prefixes via `find_longest_prefixes()`, so we don't have to worry about
> > e.g. `git for-each-ref refs/tags refs/heads refs/tags`.
>
> Tangent: This sent me down a rabbit hole, I wonder if we can do better
> with naming, `find_longest_prefixes` calls `find_longest_prefixes_1`,
> The `_1` doesn't help at all with explaining what the function does.
>
> [snip]

This is actually one of the examples I was thinking of when I replied to
you in the other thread. find_longest_prefixes() is the entry-point, and
does a little bit of setup and tear down that is unique.

But the recursion happens within find_longest_prefixes_1(), which does
not want to repeat the same setup and tear down, hence the split. I
still maintain that _1() is a useful convention for differentiating
between the two, but I'm fine to be in the minority there ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] ref-filter: format iteratively with lexicographic refname sorting
  2024-10-21 11:33 ` [PATCH v3] " Patrick Steinhardt
@ 2024-10-21 20:46   ` Taylor Blau
  0 siblings, 0 replies; 12+ messages in thread
From: Taylor Blau @ 2024-10-21 20:46 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Toon Claes

On Mon, Oct 21, 2024 at 01:33:23PM +0200, Patrick Steinhardt wrote:
>  ref-filter.c | 29 +++++++++++++++++++++--------
>  1 file changed, 21 insertions(+), 8 deletions(-)

Thanks, as you note this is awfully similar to v2, which was already
looking quite good in my opinion.

Thanks to those who have participated in reviewing this topic. Let's
start merging this one down to 'next'.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-10-21 20:46 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-16  6:00 [PATCH] ref-filter: format iteratively with lexicographic refname sorting Patrick Steinhardt
2024-10-16 22:11 ` Taylor Blau
2024-10-17  2:48   ` Jeff King
2024-10-17  4:28     ` Patrick Steinhardt
2024-10-21 12:36       ` karthik nayak
2024-10-21 20:45         ` Taylor Blau
2024-10-17  5:09 ` [PATCH v2] " Patrick Steinhardt
2024-10-17 20:57   ` Taylor Blau
2024-10-21 11:10   ` Toon Claes
2024-10-21 11:33     ` Patrick Steinhardt
2024-10-21 11:33 ` [PATCH v3] " Patrick Steinhardt
2024-10-21 20:46   ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).