Re: [PATCH] setlocalversion: Add workaround for "git describe" performance issue

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] setlocalversion: Add workaround for "git describe" performance issue
       [not found] <309549cafdcfe50c4fceac3263220cc3d8b109b2.1730337435.git.jpoimboe@kernel.org>
@ 2024-10-31 10:37 ` Rasmus Villemoes
  2024-10-31 11:42   ` Jeff King
  2024-10-31 11:43   ` Masahiro Yamada
  0 siblings, 2 replies; 27+ messages in thread
From: Rasmus Villemoes @ 2024-10-31 10:37 UTC (permalink / raw)
  To: Josh Poimboeuf; +Cc: Masahiro Yamada, linux-kernel, linux-kbuild, git

On Wed, Oct 30 2024, Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> If HEAD isn't associated with an annotated tag, a bug (or feature?) in
> "git describe --match" causes it to search every commit in the entire
> repository looking for additional match candidates.  Instead of it
> taking a fraction of a second, it adds 10-15 seconds to the beginning of
> every kernel build.
>
> Fix it by adding an additional dummy match which is slightly further
> away from the most recent one, along with setting the max candidate
> count to 1 (not 2, apparently another bug).
>

cc += git list

Hm, I tried looking at the git describe source code, and while I can't
claim I understand it very well, I think the main problem is around
this part:

			if (!tags && !all && n->prio < 2) {
				unannotated_cnt++;
			} else if (match_cnt < max_candidates) {
				struct possible_tag *t = &all_matches[match_cnt++];
				t->name = n;
				t->depth = seen_commits - 1;
				t->flag_within = 1u << match_cnt;
				t->found_order = match_cnt;
				c->object.flags |= t->flag_within;
				if (n->prio == 2)
					annotated_cnt++;
			}
			else {
				gave_up_on = c;
				break;
			}

So in the case where one doesn't pass any --match, we get something like

git describe --debug 5f78aec0d7e9
describe 5f78aec0d7e9
No exact match on refs or tags, searching to describe
 annotated        243 v4.19-rc5
 annotated        485 v4.19-rc4
 annotated        814 v4.19-rc3
 annotated       1124 v4.19-rc2
 annotated       1391 v4.19-rc1
 annotated      10546 v4.18
 annotated      10611 v4.18-rc8
 annotated      10819 v4.18-rc7
 annotated      11029 v4.18-rc6
 annotated      11299 v4.18-rc5
traversed 11400 commits
more than 10 tags found; listed 10 most recent
gave up search at 1e4b044d22517cae7047c99038abb444423243ca
v4.19-rc5-243-g5f78aec0d7e9

and that "gave up" commit is v4.18-rc4, the eleventh commit
encountered. That also explains why you have to add a "dummy" second
--match to make --candidates=1 have the expected behaviour.

Perhaps the logic should instead be that as soon as match_cnt hits
max_candidates (i.e. all the tags we're going to consider have actually
been visited), we break out. That is, the last "else" above should
instead be replaced by

  if (match_cnt == max_candidates) {
    ... /* ? , gave_up_on is now a misnomer */
    break;
  }

Then as a further DWIM aid, wherever the initialization logic is could
be updated so that, after expanding all the --match= wildcards, if the
number of tags is less than max_candidates, automatically lower
max_candidates to that number (which in the setlocalversion case will
always be 1 because we're not actually passing a wildcard).

Or, we could explicitly on the kernel side pass that --candidates=1, but
yes, I agree it looks like a bug that the loop requires encountering +1
tag to hit that break;.

Rasmus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] setlocalversion: Add workaround for "git describe" performance issue
  2024-10-31 10:37 ` [PATCH] setlocalversion: Add workaround for "git describe" performance issue Rasmus Villemoes
@ 2024-10-31 11:42   ` Jeff King
  2024-10-31 12:24     ` Jeff King
  2024-10-31 11:43   ` Masahiro Yamada
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff King @ 2024-10-31 11:42 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Josh Poimboeuf, Masahiro Yamada, linux-kernel, linux-kbuild, git

On Thu, Oct 31, 2024 at 11:37:27AM +0100, Rasmus Villemoes wrote:

> and that "gave up" commit is v4.18-rc4, the eleventh commit
> encountered. That also explains why you have to add a "dummy" second
> --match to make --candidates=1 have the expected behaviour.
> 
> Perhaps the logic should instead be that as soon as match_cnt hits
> max_candidates (i.e. all the tags we're going to consider have actually
> been visited), we break out. That is, the last "else" above should
> instead be replaced by
> 
>   if (match_cnt == max_candidates) {
>     ... /* ? , gave_up_on is now a misnomer */
>     break;
>   }

Yes, I agree that is the right direction. Replacing the "else" entirely
feels a little weird, because it is part of the:

  if (!tags && !all && n->prio < 2)
	...
  else if (match_cnt < max_candidates)
	...
  else
	...

So we'd now run that check even if we triggered the first block. But I
don't think it should matter in practice. We only increment match_cnt in
the else-if here. So the "else" block could go away, and the check for
giving up could go inside the else-if.

It does seem like gave_up_on is now pointless, but I'm not sure I
understand all of the code here. I assumed that it was only used to
report "this is where we gave up", and to give you the extra bit of
information that there _were_ other candidates that we omitted (and not
just exactly max_candidates). Of course we don't show that without
--debug. So it seems silly to spend a bunch of extra CPU for that.

But the plot thickens.

What I was going to suggest is that if we wanted to retain that one bit
of information, what we could do instead is: independent of
max_candidates, see if we've found all of the possible names we expanded
from --match. Then max_candidates would work as it does now, but we'd
avoid fruitlessly searching when there are no more names to find.

Counting the number of expanded names is a little weird. We use them to
annotate the commits, but of course multiple names can point to a single
commit, and there's a priority override system. I think the final number
we can find is the number of entries in the "names" hash.

So I expected this to work:

diff --git a/builtin/describe.c b/builtin/describe.c
index 7330a77b38..70a11072de 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -380,6 +380,9 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 				c->object.flags |= t->flag_within;
 				if (n->prio == 2)
 					annotated_cnt++;
+
+				if (match_cnt == hashmap_get_size(&names))
+					break;
 			}
 			else {
 				gave_up_on = c;

but it's still slow! If we set "gave_up_on = c", then it gets fast. I'm
not sure why that is. Later we do:

        if (gave_up_on) {
                commit_list_insert_by_date(gave_up_on, &list);
                seen_commits--;
        }
        seen_commits += finish_depth_computation(&list, &all_matches[0]);

but I don't at all understand why adding gave_up_on lets that finish
sooner. So I'm worried we're missing something about how it is used.

One hack is to just, like the max_candidates case, let us look at one
_more_ commit before bailing. Like this:

diff --git a/builtin/describe.c b/builtin/describe.c
index 7330a77b38..177c8232f6 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -365,6 +365,11 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 		struct commit_list *parents = c->parents;
 		struct commit_name **slot;

+		if (match_cnt == hashmap_get_size(&names)) {
+			gave_up_on = c;
+			break;
+		}
+
 		seen_commits++;
 		slot = commit_names_peek(&commit_names, c);
 		n = slot ? *slot : NULL;

That works, but I have a feeling that figured out what the heck is going
on with gave_up_on might produce a more elegant solution.

> Then as a further DWIM aid, wherever the initialization logic is could
> be updated so that, after expanding all the --match= wildcards, if the
> number of tags is less than max_candidates, automatically lower
> max_candidates to that number (which in the setlocalversion case will
> always be 1 because we're not actually passing a wildcard).

Yeah, I had the same thought (though if we do a separate hashmap check
as above, it wouldn't be needed).

-Peff

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] setlocalversion: Add workaround for "git describe" performance issue
  2024-10-31 10:37 ` [PATCH] setlocalversion: Add workaround for "git describe" performance issue Rasmus Villemoes
  2024-10-31 11:42   ` Jeff King
@ 2024-10-31 11:43   ` Masahiro Yamada
  1 sibling, 0 replies; 27+ messages in thread
From: Masahiro Yamada @ 2024-10-31 11:43 UTC (permalink / raw)
  To: Rasmus Villemoes; +Cc: Josh Poimboeuf, linux-kernel, linux-kbuild, git

On Thu, Oct 31, 2024 at 11:37 AM Rasmus Villemoes <ravi@prevas.dk> wrote:
>
> On Wed, Oct 30 2024, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> > If HEAD isn't associated with an annotated tag, a bug (or feature?) in
> > "git describe --match" causes it to search every commit in the entire
> > repository looking for additional match candidates.  Instead of it
> > taking a fraction of a second, it adds 10-15 seconds to the beginning of
> > every kernel build.
> >
> > Fix it by adding an additional dummy match which is slightly further
> > away from the most recent one, along with setting the max candidate
> > count to 1 (not 2, apparently another bug).
> >
>
> cc += git list
>
> Hm, I tried looking at the git describe source code, and while I can't
> claim I understand it very well, I think the main problem is around
> this part:
>
>                         if (!tags && !all && n->prio < 2) {
>                                 unannotated_cnt++;
>                         } else if (match_cnt < max_candidates) {
>                                 struct possible_tag *t = &all_matches[match_cnt++];
>                                 t->name = n;
>                                 t->depth = seen_commits - 1;
>                                 t->flag_within = 1u << match_cnt;
>                                 t->found_order = match_cnt;
>                                 c->object.flags |= t->flag_within;
>                                 if (n->prio == 2)
>                                         annotated_cnt++;
>                         }
>                         else {
>                                 gave_up_on = c;
>                                 break;
>                         }
>
> So in the case where one doesn't pass any --match, we get something like
>
> git describe --debug 5f78aec0d7e9
> describe 5f78aec0d7e9
> No exact match on refs or tags, searching to describe
>  annotated        243 v4.19-rc5
>  annotated        485 v4.19-rc4
>  annotated        814 v4.19-rc3
>  annotated       1124 v4.19-rc2
>  annotated       1391 v4.19-rc1
>  annotated      10546 v4.18
>  annotated      10611 v4.18-rc8
>  annotated      10819 v4.18-rc7
>  annotated      11029 v4.18-rc6
>  annotated      11299 v4.18-rc5
> traversed 11400 commits
> more than 10 tags found; listed 10 most recent
> gave up search at 1e4b044d22517cae7047c99038abb444423243ca
> v4.19-rc5-243-g5f78aec0d7e9
>
> and that "gave up" commit is v4.18-rc4, the eleventh commit
> encountered. That also explains why you have to add a "dummy" second
> --match to make --candidates=1 have the expected behaviour.
>
> Perhaps the logic should instead be that as soon as match_cnt hits
> max_candidates (i.e. all the tags we're going to consider have actually
> been visited), we break out. That is, the last "else" above should
> instead be replaced by
>
>   if (match_cnt == max_candidates) {
>     ... /* ? , gave_up_on is now a misnomer */
>     break;
>   }
>
> Then as a further DWIM aid, wherever the initialization logic is could
> be updated so that, after expanding all the --match= wildcards, if the
> number of tags is less than max_candidates, automatically lower
> max_candidates to that number (which in the setlocalversion case will
> always be 1 because we're not actually passing a wildcard).
>
> Or, we could explicitly on the kernel side pass that --candidates=1, but
> yes, I agree it looks like a bug that the loop requires encountering +1
> tag to hit that break;.
>
> Rasmus
>

I still do not understand the logic either.

git traverses all the way back to
d8470b7c13e11c18cf14a7e3180f0b00e715e4f0.



$ git describe --match=v6.12-rc5 --debug c1e939a21eb1
describe c1e939a21eb1
No exact match on refs or tags, searching to describe
finished search at d8470b7c13e11c18cf14a7e3180f0b00e715e4f0
 annotated         44 v6.12-rc5
traversed 1310005 commits
v6.12-rc5-44-gc1e939a21eb1



Or, more simply,


$ git describe --match=v6.12-rc5 --debug e42b1a9a2557
describe e42b1a9a2557
No exact match on refs or tags, searching to describe
finished search at d8470b7c13e11c18cf14a7e3180f0b00e715e4f0
 annotated          5 v6.12-rc5
traversed 1309966 commits
v6.12-rc5-5-ge42b1a9a2557




e42b1a9a2557 merges v6.12-rc5 and 25f00a13dccf.

The latter is obviously, v6.12-rc2 + 4 commits.
v6.12-rc2 is an ancestor of v6.12-rc5.

I do not understand why git traverses 1.3 million commits.



--
Best Regards

Masahiro Yamada

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] setlocalversion: Add workaround for "git describe" performance issue
  2024-10-31 11:42   ` Jeff King
@ 2024-10-31 12:24     ` Jeff King
  2024-10-31 14:43       ` Jeff King
  2024-11-01 10:23       ` [PATCH] setlocalversion: Add workaround for "git describe" performance issue Rasmus Villemoes
  0 siblings, 2 replies; 27+ messages in thread
From: Jeff King @ 2024-10-31 12:24 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Josh Poimboeuf, Masahiro Yamada, linux-kernel, linux-kbuild, git

On Thu, Oct 31, 2024 at 07:42:10AM -0400, Jeff King wrote:

> That works, but I have a feeling that figured out what the heck is going
> on with gave_up_on might produce a more elegant solution.

OK, I think I might have made some sense of this.

In finish_depth_computation(), we traverse down "list" forever, passing
flags up to our parents, until we find a commit that is marked with the
same "within" flag as our candidate. And then if everything left has
that same "within" flag set, we can bail.

So I _think_ the point is to basically count up what we'd get from this
traversal:

  $tag..$commit

where "$tag" is the candidate tag we found, and "$commit" is what we're
trying to describe (so imagine "git describe --match=$tag $commit").

We can't just use the depth we found while traversing down to $tag,
because there might be side branches we need to count up, too. And we
don't start a new traversal, because we'd be repeating the bits we
already went over when finding $tag in the first place.

And we feed that "list" from the original traversal state. So if we
break out of the traversal early but don't set gave_up_on, then we have
nothing in that state that holds the "within" flag. So we just walk all
of the commits down to the root, because nobody is propagating the flag
to them.

We have to feed at least one commit with the "within" flag into the
traversal so that it can let us end things. But I don't think it really
matters if that commit is the one we found, or if it's a parent of one
that we happened to pass "within" bits down to.

So I think we can just set "gave_up_on" to the final element we found
(whether from max_candidates or from finding every possible name). I.e.,
what I showed earlier, or what you were proposing.

I was also a bit puzzled how this works when there are multiple tags.
We feed only one "best" candidate to finish_depth_computation(), but
gave_up_on does not necessarily have its flag set. But I think that case
the point is that _some_ commit in the list does, and we literally add
every commit to that list.

I'm actually a bit skeptical that any of this is faster than simply
starting over a new traversal of $tag..$commit to find the depth, since
we are considering each commit anew. And there's a bunch of accidentally
quadratic bits of finish_depth_computation(). But frankly I'm somewhat
afraid to touch any of this more than necessary.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] setlocalversion: Add workaround for "git describe" performance issue
  2024-10-31 12:24     ` Jeff King
@ 2024-10-31 14:43       ` Jeff King
  2024-11-04 12:37         ` Benno Evers
  2024-11-01 10:23       ` [PATCH] setlocalversion: Add workaround for "git describe" performance issue Rasmus Villemoes
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff King @ 2024-10-31 14:43 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Benno Evers, Josh Poimboeuf, Masahiro Yamada, linux-kernel,
	linux-kbuild, git

On Thu, Oct 31, 2024 at 08:24:56AM -0400, Jeff King wrote:

> We have to feed at least one commit with the "within" flag into the
> traversal so that it can let us end things. But I don't think it really
> matters if that commit is the one we found, or if it's a parent of one
> that we happened to pass "within" bits down to.
> 
> So I think we can just set "gave_up_on" to the final element we found
> (whether from max_candidates or from finding every possible name). I.e.,
> what I showed earlier, or what you were proposing.

Hmph. So I don't think this is quite true, but now I'm puzzled again.

It is accurate to say that we must make sure _some_ commit with the
those flag bits set remains in "list". And I don't think it matters if
it's the candidate we found, or its parent.

But there's other stuff happening in that loop, after we process that
max candidate (where we'd proposed to break) but before we hit the next
possible candidate. Stuff like adding onto the depth of the other
candidates. Josh's example doesn't show that because it only has one
candidate, but I could imagine a case where it does matter (though I
didn't construct one).

So I'd have thought that this:

diff --git a/builtin/describe.c b/builtin/describe.c
index 7330a77b38..b0f645c41d 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -366,6 +366,12 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 		struct commit_name **slot;

 		seen_commits++;
+
+		if (match_cnt == max_candidates) {
+			gave_up_on = c;
+			break;
+		}
+
 		slot = commit_names_peek(&commit_names, c);
 		n = slot ? *slot : NULL;
 		if (n) {
@@ -381,10 +387,6 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 				if (n->prio == 2)
 					annotated_cnt++;
 			}
-			else {
-				gave_up_on = c;
-				break;
-			}
 		}
 		for (cur_match = 0; cur_match < match_cnt; cur_match++) {
 			struct possible_tag *t = &all_matches[cur_match];

would do it, by just finishing out the loop iteration and bailing on the
next commit. After all, that commit _could_ be a candidate itself. But
it causes a test in t6120 to fail. We have a disjoint history like this:

                 B
                 o
                  \
    o-----o---o----x
          A

and we expect that "x" is described as "A-3" (because we are including
the disjoint B). But after the patch above and with --candidates=2
(since there are only two tags and part of our goal is to limit
candidates to the number of tags), we find "B-4". Which is worse (at
least by some metrics).

I think this comes from 30b1c7ad9d (describe: don't abort too early when
searching tags, 2020-02-26). And given the problem description there, I
can see how quitting early in a disjoint history will give you worse
answers. But the patch above is triggering a case that already _could_
trigger.

So it feels like 30b1c7ad9d is incomplete. Without any patches, if I
limit it to --candidates=2 but make A^ a tag, then it gets the same
wrong answer (for the exact same reason). And I don't see a way to make
it correct without losing the ability to break out of the traversal
early when we hit max_candidates (which is obviously a very important
optimization in general). But maybe I'm missing something.

I do think my patch above is not introducing a new problem that wasn't
already there. It's just that the toy repo, having so few tags, means
any logic to reduce max_candidates will trigger there.

+cc the author of 30b1c7ad9d for any wisdom

-Peff

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] setlocalversion: Add workaround for "git describe" performance issue
  2024-10-31 12:24     ` Jeff King
  2024-10-31 14:43       ` Jeff King
@ 2024-11-01 10:23       ` Rasmus Villemoes
  2024-11-01 11:39         ` Jeff King
  1 sibling, 1 reply; 27+ messages in thread
From: Rasmus Villemoes @ 2024-11-01 10:23 UTC (permalink / raw)
  To: Jeff King
  Cc: Josh Poimboeuf, Masahiro Yamada, linux-kernel, linux-kbuild, git

On Thu, Oct 31 2024, Jeff King <peff@peff.net> wrote:

> On Thu, Oct 31, 2024 at 07:42:10AM -0400, Jeff King wrote:
>
>> That works, but I have a feeling that figured out what the heck is going
>> on with gave_up_on might produce a more elegant solution.
>
> OK, I think I might have made some sense of this.
>
> In finish_depth_computation(), we traverse down "list" forever, passing
> flags up to our parents, until we find a commit that is marked with the
> same "within" flag as our candidate. And then if everything left has
> that same "within" flag set, we can bail.
>
> So I _think_ the point is to basically count up what we'd get from this
> traversal:
>
>   $tag..$commit
>
> where "$tag" is the candidate tag we found, and "$commit" is what we're
> trying to describe (so imagine "git describe --match=$tag $commit").

Yeah, so this is really just what the setlocalversion script wants to
know. For a few diffent possible values of $tag (in most cases just 1),
we ask: Is $tag an annotated tag? Is it an ancestor of HEAD? And if so,
how many commits are in $tag..HEAD.

Perhaps we could on the kernel side replace the "git describe --match"
calls with a helper, something like this (needs a lot of polishing):

===
# Produce output similar to what "git describe --match=$tag 2>
# /dev/null" would.  It doesn't have to match exactly as the caller is
# only interested in whether $tag == HEAD, and if not, the number
# between the tag and the short sha1.
describe()
{
    # Is $tag an annotated tag? Could/should probably be written using
    # some plumbing instead of git describe, but with --exact-match,
    # we avoid the walk-to-the-start-of-history behaviour, so fine for
    # this demo.
    git describe --exact-match --match=$tag $tag >/dev/null 2>/dev/null || return 1

    # Can it be used to describe HEAD, i.e. is it an ancestor of HEAD?
    git merge-base --is-ancestor $tag HEAD || return 1

    # Find the number that "git describe" would append.
    count=$(git rev-list --count $tag..HEAD)
    if [ $count -eq 0 ] ; then
        echo "$tag"
    else
        echo "$tag-$count-$head"
    fi
}
===

But if we go this route, we should probably rework the logic
somewhat. There's no point getting the count ourselves, stuffing that
into a string, and then splitting that string with awk to %05d format
the count.

I also don't know if either the --is-ancestor or the rev-list count
could end up doing the same walk-all-commits we're trying to avoid.

Rasmus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] setlocalversion: Add workaround for "git describe" performance issue
  2024-11-01 10:23       ` [PATCH] setlocalversion: Add workaround for "git describe" performance issue Rasmus Villemoes
@ 2024-11-01 11:39         ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2024-11-01 11:39 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Josh Poimboeuf, Masahiro Yamada, linux-kernel, linux-kbuild, git

On Fri, Nov 01, 2024 at 11:23:05AM +0100, Rasmus Villemoes wrote:

> Perhaps we could on the kernel side replace the "git describe --match"
> calls with a helper, something like this (needs a lot of polishing):

Yeah, if you are describing off of a single tag, it may just be easier
to query things about the tag directly. Though I do still think
git-describe should be faster here. I'm still pondering what to do about
the disjoint history tests, but otherwise have a polished series to
send.

> ===
> # Produce output similar to what "git describe --match=$tag 2>
> # /dev/null" would.  It doesn't have to match exactly as the caller is
> # only interested in whether $tag == HEAD, and if not, the number
> # between the tag and the short sha1.
> describe()
> {
>     # Is $tag an annotated tag? Could/should probably be written using
>     # some plumbing instead of git describe, but with --exact-match,
>     # we avoid the walk-to-the-start-of-history behaviour, so fine for
>     # this demo.
>     git describe --exact-match --match=$tag $tag >/dev/null 2>/dev/null || return 1

Probably "git cat-file -t $tag" is the simplest way to see if it points
to a tag.

>     # Can it be used to describe HEAD, i.e. is it an ancestor of HEAD?
>     git merge-base --is-ancestor $tag HEAD || return 1
> 
>     # Find the number that "git describe" would append.
>     count=$(git rev-list --count $tag..HEAD)
>     if [ $count -eq 0 ] ; then
>         echo "$tag"
>     else
>         echo "$tag-$count-$head"
>     fi

You can query both of these at once with:

  git rev-list --count --left-right $tag...HEAD

That will traverse down to the merge base and give you two counts. If
the first one is 0, then $tag is a direct ancestor. And the second one
is the count of what's in HEAD.

At first glance, it seems like you'd waste time counting the HEAD side
when the --is-ancestor check could have rejected the tag earlier. But in
practice I think the time will always be dominated by walking down to
the merge base in all commands.

> I also don't know if either the --is-ancestor or the rev-list count
> could end up doing the same walk-all-commits we're trying to avoid.

It shouldn't. In all of those cases we'll generally walk breadth-first
down to the merge base. They're also operations that can take advantage
of other optimizations that git-describe never learned about. E.g.,
generation numbers in the commit graph.

We can even do fast --count with reachability bitmaps, though I wouldn't
expect most dev repos to have bitmaps built. Also, it looks like
"--left-right --count" does not support bitmaps. IMHO that is a bug. ;)

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] setlocalversion: Add workaround for "git describe" performance issue
  2024-10-31 14:43       ` Jeff King
@ 2024-11-04 12:37         ` Benno Evers
  2024-11-06 19:22           ` [PATCH 0/4] perf improvements for git-describe with few tags Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Benno Evers @ 2024-11-04 12:37 UTC (permalink / raw)
  To: Jeff King
  Cc: Rasmus Villemoes, Benno Evers, Josh Poimboeuf, Masahiro Yamada,
	linux-kernel, linux-kbuild, git

Hi,

I'm afraid I can't offer much wisdom, but a few thoughts:

In the testcase, the difference between A-3 and B-4 looks very
academic, but on a real repo the results are more obviously wrong. For
example, if I put the test setup on top of the current git repo:

    benno@bourbaki:~/src/git/tmp-test$ git describe HEAD
    A-3-ga53f69dfb5
    benno@bourbaki:~/src/git/tmp-test$ git describe --candidates=2 HEAD
    B-75205-ga53f69dfb5

When writing the patch I thought that it might be a good idea to
change the definition of `describe` to favor the tag with the shortest
first-parent distance to the described tag and print A-2 in the test
scenario, to me that seems the most intuitive description. But that's
a change in behavior, and it's not even clear that most people would
agree A-2 is better, so I discarded the idea.

Other than that, the only way I can see to implement the behavior
exactly as described would be add the same condition when breaking for
reaching the max number of candidates, ie. to stop adding new
candidates but to delay the break from the loop until all disjoint
paths are unified. No idea how much of a performance hit that would be
in practice, I guess it depends on average branch lengths.

Best regards,
Benno

Am Do., 31. Okt. 2024 um 15:43 Uhr schrieb Jeff King <peff@peff.net>:
>
> On Thu, Oct 31, 2024 at 08:24:56AM -0400, Jeff King wrote:
>
> > We have to feed at least one commit with the "within" flag into the
> > traversal so that it can let us end things. But I don't think it really
> > matters if that commit is the one we found, or if it's a parent of one
> > that we happened to pass "within" bits down to.
> >
> > So I think we can just set "gave_up_on" to the final element we found
> > (whether from max_candidates or from finding every possible name). I.e.,
> > what I showed earlier, or what you were proposing.
>
> Hmph. So I don't think this is quite true, but now I'm puzzled again.
>
> It is accurate to say that we must make sure _some_ commit with the
> those flag bits set remains in "list". And I don't think it matters if
> it's the candidate we found, or its parent.
>
> But there's other stuff happening in that loop, after we process that
> max candidate (where we'd proposed to break) but before we hit the next
> possible candidate. Stuff like adding onto the depth of the other
> candidates. Josh's example doesn't show that because it only has one
> candidate, but I could imagine a case where it does matter (though I
> didn't construct one).
>
> So I'd have thought that this:
>
> diff --git a/builtin/describe.c b/builtin/describe.c
> index 7330a77b38..b0f645c41d 100644
> --- a/builtin/describe.c
> +++ b/builtin/describe.c
> @@ -366,6 +366,12 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
>                 struct commit_name **slot;
>
>                 seen_commits++;
> +
> +               if (match_cnt == max_candidates) {
> +                       gave_up_on = c;
> +                       break;
> +               }
> +
>                 slot = commit_names_peek(&commit_names, c);
>                 n = slot ? *slot : NULL;
>                 if (n) {
> @@ -381,10 +387,6 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
>                                 if (n->prio == 2)
>                                         annotated_cnt++;
>                         }
> -                       else {
> -                               gave_up_on = c;
> -                               break;
> -                       }
>                 }
>                 for (cur_match = 0; cur_match < match_cnt; cur_match++) {
>                         struct possible_tag *t = &all_matches[cur_match];
>
> would do it, by just finishing out the loop iteration and bailing on the
> next commit. After all, that commit _could_ be a candidate itself. But
> it causes a test in t6120 to fail. We have a disjoint history like this:
>
>                  B
>                  o
>                   \
>     o-----o---o----x
>           A
>
> and we expect that "x" is described as "A-3" (because we are including
> the disjoint B). But after the patch above and with --candidates=2
> (since there are only two tags and part of our goal is to limit
> candidates to the number of tags), we find "B-4". Which is worse (at
> least by some metrics).
>
> I think this comes from 30b1c7ad9d (describe: don't abort too early when
> searching tags, 2020-02-26). And given the problem description there, I
> can see how quitting early in a disjoint history will give you worse
> answers. But the patch above is triggering a case that already _could_
> trigger.
>
> So it feels like 30b1c7ad9d is incomplete. Without any patches, if I
> limit it to --candidates=2 but make A^ a tag, then it gets the same
> wrong answer (for the exact same reason). And I don't see a way to make
> it correct without losing the ability to break out of the traversal
> early when we hit max_candidates (which is obviously a very important
> optimization in general). But maybe I'm missing something.
>
> I do think my patch above is not introducing a new problem that wasn't
> already there. It's just that the toy repo, having so few tags, means
> any logic to reduce max_candidates will trigger there.
>
> +cc the author of 30b1c7ad9d for any wisdom
>
> -Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/4] perf improvements for git-describe with few tags
  2024-11-04 12:37         ` Benno Evers
@ 2024-11-06 19:22           ` Jeff King
  2024-11-06 19:26             ` Jeff King
                               ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Jeff King @ 2024-11-06 19:22 UTC (permalink / raw)
  To: git
  Cc: Benno Evers, Rasmus Villemoes, Benno Evers, Josh Poimboeuf,
	Masahiro Yamada

On Mon, Nov 04, 2024 at 01:37:27PM +0100, Benno Evers wrote:

> I'm afraid I can't offer much wisdom, but a few thoughts:

Thank you. Between this response and a bit of pondering over the last
few days, I think I have a firm understanding of the issue and possible
paths forward.

So here's the series I came up with, which starts by adjusting the tests
to be resilient to the later changes, but also to show the existing
failure mode.

And then the rest of the patches add the performance improvements we've
been discussing in the thread.

I'll drop the kernel lists from the cc since I think this has gotten
well off topic there.

  [1/4]: t6120: demonstrate weakness in disjoint-root handling
  [2/4]: t/perf: add tests for git-describe
  [3/4]: describe: stop digging for max_candidates+1
  [4/4]: describe: stop traversing when we run out of names

 builtin/describe.c       | 17 ++++++++++-------
 t/perf/p6100-describe.sh | 30 ++++++++++++++++++++++++++++++
 t/t6120-describe.sh      | 15 ++++++++++++---
 3 files changed, 52 insertions(+), 10 deletions(-)
 create mode 100755 t/perf/p6100-describe.sh

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] perf improvements for git-describe with few tags
  2024-11-06 19:22           ` [PATCH 0/4] perf improvements for git-describe with few tags Jeff King
@ 2024-11-06 19:26             ` Jeff King
  2024-11-06 21:16             ` [PATCH 1/4] t6120: demonstrate weakness in disjoint-root handling Jeff King
                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2024-11-06 19:26 UTC (permalink / raw)
  To: git
  Cc: Benno Evers, Rasmus Villemoes, Benno Evers, Josh Poimboeuf,
	Masahiro Yamada

On Wed, Nov 06, 2024 at 02:22:37PM -0500, Jeff King wrote:

> So here's the series I came up with, which starts by adjusting the tests
> to be resilient to the later changes, but also to show the existing
> failure mode.
> 
> And then the rest of the patches add the performance improvements we've
> been discussing in the thread.

Oops, I just realized there's a silly error in my patch, but I need to
go offline before I can fix it. I'll send patches later today, but I
didn't want to leave anybody confused about the delay. :)

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/4] t6120: demonstrate weakness in disjoint-root handling
  2024-11-06 19:22           ` [PATCH 0/4] perf improvements for git-describe with few tags Jeff King
  2024-11-06 19:26             ` Jeff King
@ 2024-11-06 21:16             ` Jeff King
  2024-11-06 21:17             ` [PATCH 2/4] t/perf: add tests for git-describe Jeff King
                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2024-11-06 21:16 UTC (permalink / raw)
  To: git
  Cc: Benno Evers, Rasmus Villemoes, Benno Evers, Josh Poimboeuf,
	Masahiro Yamada

Commit 30b1c7ad9d (describe: don't abort too early when searching tags,
2020-02-26) tried to fix a problem that happens when there are disjoint
histories: to accurately compare the counts for different tags, we need
to keep walking the history longer in order to find a common base.

But its fix misses a case: we may still bail early if we hit the
max_candidates limit, producing suboptimal output. You can see this in
action by adding "--candidates=2" to the tests; we'll stop traversing as
soon as we see the second tag and will produce the wrong answer. I hit
this in practice while trying to teach git-describe not to keep looking
for candidates after we've seen all tags in the repo (effectively adding
--candidates=2, since these toy repos have only two tags each).

This is probably fixable by continuing to walk after hitting the
max-candidates limit, all the way down to a common ancestor of all
candidates. But it's not clear in practice what the preformance
implications would be (it would depend on how long the branches that
hold the candidates are).

So I'm punting on that for now, but I'd like to adjust the tests to be
more resilient, and to document the findings. So this patch:

  1. Adds an extra tag at the bottom of history. This shouldn't change
     the output, but does mean we are more resilient to low values of
     --candidates (e.g., if we start reducing it to the total number of
     tags). This is arguably closer to the real world anyway, where
     you're not going to have just 2 tags, but an arbitrarily long
     history going back in time, possibly with multiple irrelevant tags
     in it (I called the new tag "H" here for "history").

  2. Run the same tests with --candidates=2, which shows that even with
     the current code they can fail if we end the traversal early. That
     leaves a trail for anybody interested in trying to improve the
     behavior.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t6120-describe.sh | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
index 05ed2510d9..69689d2f36 100755
--- a/t/t6120-describe.sh
+++ b/t/t6120-describe.sh
@@ -19,13 +19,17 @@ TEST_PASSES_SANITIZE_LEAK=true
 
 check_describe () {
 	indir= &&
+	outcome=success &&
 	while test $# != 0
 	do
 		case "$1" in
 		-C)
 			indir="$2"
 			shift
 			;;
+		--expect-failure)
+			outcome=failure
+			;;
 		*)
 			break
 			;;
@@ -36,7 +40,7 @@ check_describe () {
 	expect="$1"
 	shift
 	describe_opts="$@"
-	test_expect_success "describe $describe_opts" '
+	test_expect_${outcome} "describe $describe_opts" '
 		git ${indir:+ -C "$indir"} describe $describe_opts >raw &&
 		sed -e "s/-g[0-9a-f]*\$/-gHASH/" <raw >actual &&
 		echo "$expect" >expect &&
@@ -617,7 +621,7 @@ test_expect_success 'name-rev --annotate-stdin works with commitGraph' '
 
 #               B
 #               o
-#                \
+#  H             \
 #  o-----o---o----x
 #        A
 #
@@ -627,6 +631,7 @@ test_expect_success 'setup: describe commits with disjoint bases' '
 		cd disjoint1 &&
 
 		echo o >> file && git add file && git commit -m o &&
+		git tag H -a -m H &&
 		echo A >> file && git add file && git commit -m A &&
 		git tag A -a -m A &&
 		echo o >> file && git add file && git commit -m o &&
@@ -639,8 +644,9 @@ test_expect_success 'setup: describe commits with disjoint bases' '
 '
 
 check_describe -C disjoint1 "A-3-gHASH" HEAD
+check_describe -C disjoint1 --expect-failure "A-3-gHASH" --candidates=2 HEAD
 
-#           B
+#       H   B
 #   o---o---o------------.
 #                         \
 #                  o---o---x
@@ -658,13 +664,15 @@ test_expect_success 'setup: describe commits with disjoint bases 2' '
 		git checkout --orphan branch &&
 		echo o >> file2 && git add file2 && GIT_COMMITTER_DATE="2020-01-01 15:00" git commit -m o &&
 		echo o >> file2 && git add file2 && GIT_COMMITTER_DATE="2020-01-01 15:01" git commit -m o &&
+		git tag H -a -m H &&
 		echo B >> file2 && git add file2 && GIT_COMMITTER_DATE="2020-01-01 15:02" git commit -m B &&
 		git tag B -a -m B &&
 		git merge --no-ff --allow-unrelated-histories main -m x
 	)
 '
 
 check_describe -C disjoint2 "B-3-gHASH" HEAD
+check_describe -C disjoint2 --expect-failure "B-3-gHASH" --candidates=2 HEAD
 
 test_expect_success 'setup misleading taggerdates' '
 	GIT_COMMITTER_DATE="2006-12-12 12:31" git tag -a -m "another tag" newer-tag-older-commit unique-file~1
-- 
2.47.0.441.g1a09955689


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] t/perf: add tests for git-describe
  2024-11-06 19:22           ` [PATCH 0/4] perf improvements for git-describe with few tags Jeff King
  2024-11-06 19:26             ` Jeff King
  2024-11-06 21:16             ` [PATCH 1/4] t6120: demonstrate weakness in disjoint-root handling Jeff King
@ 2024-11-06 21:17             ` Jeff King
  2024-11-06 21:17             ` [PATCH 3/4] describe: stop digging for max_candidates+1 Jeff King
                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2024-11-06 21:17 UTC (permalink / raw)
  To: git
  Cc: Benno Evers, Rasmus Villemoes, Benno Evers, Josh Poimboeuf,
	Masahiro Yamada

We don't have a perf script for git-describe, despite it often being
accused of slowness. Let's add a few simple tests to start with.

Rather than use the existing tags from our test repo, we'll make our own
so that we have a known quantity and position. We'll add a "new" tag
near the tip of HEAD, and an "old" one that is at the very bottom. And
then our tests are:

  1. Describing HEAD naively requires walking all the way down to the
     old tag as we collect candidates. This gives us a baseline for what
     "slow" looks like.

  2. Doing the same with --candidates=1 can potentially be fast, because
     we can quie after finding "new". But we don't, and it's also slow.

  3. Likewise we should be able to quit when there are no more tags to
     find. This can happen naturally if a repo has few tags, but also if
     you restrict the set of tags with --match.

Here are the results running against linux.git. Note that I have a
commit-graph built for the repo, so "slow" here is ~700ms. Without a
commit graph it's more like 9s!

  Test                                           HEAD
  --------------------------------------------------------------
  6100.2: describe HEAD                          0.70(0.66+0.04)
  6100.3: describe HEAD with one max candidate   0.70(0.66+0.04)
  6100.4: describe HEAD with one tag             0.70(0.64+0.06)

Signed-off-by: Jeff King <peff@peff.net>
---
 t/perf/p6100-describe.sh | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)
 create mode 100755 t/perf/p6100-describe.sh

diff --git a/t/perf/p6100-describe.sh b/t/perf/p6100-describe.sh
new file mode 100755
index 0000000000..069f91ce49
--- /dev/null
+++ b/t/perf/p6100-describe.sh
@@ -0,0 +1,30 @@
+#!/bin/sh
+
+test_description='performance of git-describe'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+# clear out old tags and give us a known state
+test_expect_success 'set up tags' '
+	git for-each-ref --format="delete %(refname)" refs/tags >to-delete &&
+	git update-ref --stdin <to-delete &&
+	new=$(git rev-list -1000 HEAD | tail -n 1) &&
+	git tag -m new new $new &&
+	old=$(git rev-list       HEAD | tail -n 1) &&
+	git tag -m old old $old
+'
+
+test_perf 'describe HEAD' '
+	git describe HEAD
+'
+
+test_perf 'describe HEAD with one max candidate' '
+	git describe --candidates=1 HEAD
+'
+
+test_perf 'describe HEAD with one tag' '
+	git describe --match=new HEAD
+'
+
+test_done
-- 
2.47.0.441.g1a09955689


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4] describe: stop digging for max_candidates+1
  2024-11-06 19:22           ` [PATCH 0/4] perf improvements for git-describe with few tags Jeff King
                               ` (2 preceding siblings ...)
  2024-11-06 21:17             ` [PATCH 2/4] t/perf: add tests for git-describe Jeff King
@ 2024-11-06 21:17             ` Jeff King
  2024-11-06 21:17             ` [PATCH 4/4] describe: stop traversing when we run out of names Jeff King
  2024-11-26  5:05             ` [PATCH 0/4] perf improvements for git-describe with few tags Junio C Hamano
  5 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2024-11-06 21:17 UTC (permalink / raw)
  To: git
  Cc: Benno Evers, Rasmus Villemoes, Benno Evers, Josh Poimboeuf,
	Masahiro Yamada

By default, describe considers only 10 candidate matches, and stops
traversing when we have enough. This makes things much faster in a large
repository, where collecting all candidates requires walking all the way
down to the root (or at least to the oldest tag). This goes all the way
back to 8713ab3079 (Improve git-describe performance by reducing
revision listing., 2007-01-13).

However, we don't stop immediately when we have enough candidates. We
keep traversing and only bail when we find one more candidate that we're
ignoring. Usually this is not too expensive, if the tags are sprinkled
evenly throughout history. But if you are unlucky, you might hit the max
candidate quickly, and then have a huge swath of history before finding
the next one.

Our p6100 test has exactly this unlucky case: with a max of "1", we find
a recent tag quickly and then have to go all the way to the root to find
the old tag that will be discarded.

A more interesting real-world case is:

  git describe --candidates=1 --match=v6.12-rc4 HEAD

in the linux.git repo. There we restrict the set of tags to a single
one, so there is no older candidate to find at all! But despite
--candidates=1, we keep traversing to the root only to find nothing.

So why do we keep traversing after hitting thet max? There are two
reasons I can see:

  1. In theory the extra information that there was another candidate
     could be useful, and we record it in the gave_up_on variable. But
     we only show this information with --debug.

  2. After finding the candidate, there's more processing we do in our
     loop. The most important of this is propagating the "within" flags
     to our parent commits, and putting them in the commit_list we'll
     use for finish_depth_computation().

     That function continues the traversal until we've counted all
     commits reachable from the starting point but not reachable from
     our best candidate tag (so essentially counting "$tag..$start", but
     avoiding re-walking over the bits we've seen).  If we break
     immediately without putting those commits into the list, our depth
     computation will be wrong (in the worst case we'll count all the
     way down to the root, not realizing those commits are included in
     our tag).

But we don't need to find a new candidate for (2). As soon as we finish
the loop iteration where we hit max_candidates, we can then quit on the
next iteration. This should produce the same output as the original code
(which could, after all, find a candidate on the very next commit
anyway) but ends the traversal with less pointless digging.

We still have to set "gave_up_on"; we've popped it off the list and it
has to go back. An alternative would be to re-order the loop so that it
never gets popped, but it's perhaps still useful to show in the --debug
output, so we need to know it anyway. We do have to adjust the --debug
output since it's now just a commit where we stopped traversing, and not
the max+1th candidate.

p6100 shows the speedup using linux.git:

  Test                                           HEAD^             HEAD
  ---------------------------------------------------------------------------------------
  6100.2: describe HEAD                          0.70(0.63+0.06)   0.71(0.66+0.04) +1.4%
  6100.3: describe HEAD with one max candidate   0.70(0.64+0.05)   0.01(0.00+0.00) -98.6%
  6100.4: describe HEAD with one tag             0.70(0.67+0.03)   0.70(0.63+0.06) +0.0%

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Helped-by: Rasmus Villemoes <ravi@prevas.dk>
Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/describe.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/builtin/describe.c b/builtin/describe.c
index 7330a77b38..69f2d942be 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -366,6 +366,12 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 		struct commit_name **slot;

 		seen_commits++;
+
+		if (match_cnt == max_candidates) {
+			gave_up_on = c;
+			break;
+		}
+
 		slot = commit_names_peek(&commit_names, c);
 		n = slot ? *slot : NULL;
 		if (n) {
@@ -381,10 +387,6 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 				if (n->prio == 2)
 					annotated_cnt++;
 			}
-			else {
-				gave_up_on = c;
-				break;
-			}
 		}
 		for (cur_match = 0; cur_match < match_cnt; cur_match++) {
 			struct possible_tag *t = &all_matches[cur_match];
@@ -470,9 +472,8 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 		fprintf(stderr, _("traversed %lu commits\n"), seen_commits);
 		if (gave_up_on) {
 			fprintf(stderr,
-				_("more than %i tags found; listed %i most recent\n"
-				"gave up search at %s\n"),
-				max_candidates, max_candidates,
+				_("found %i tags; gave up search at %s\n"),
+				max_candidates,
 				oid_to_hex(&gave_up_on->object.oid));
 		}
 	}
-- 
2.47.0.441.g1a09955689

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] describe: stop traversing when we run out of names
  2024-11-06 19:22           ` [PATCH 0/4] perf improvements for git-describe with few tags Jeff King
                               ` (3 preceding siblings ...)
  2024-11-06 21:17             ` [PATCH 3/4] describe: stop digging for max_candidates+1 Jeff King
@ 2024-11-06 21:17             ` Jeff King
  2024-12-04 23:15               ` [PATCH] fixup! " Josh Steadmon
  2024-11-26  5:05             ` [PATCH 0/4] perf improvements for git-describe with few tags Junio C Hamano
  5 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2024-11-06 21:17 UTC (permalink / raw)
  To: git
  Cc: Benno Evers, Rasmus Villemoes, Benno Evers, Josh Poimboeuf,
	Masahiro Yamada

When trying to describe a commit, we'll traverse from the commit,
collecting candidate tags that point to its ancestors. But once we've
seen all of the tags in the repo, there's no point in traversing
further. There's nothing left to find!

For a default "git describe", this isn't usually a big problem. In a
large repo you'll probably have multiple tags, so we'll eventually find
10 candidates (the default for max_candidates) and stop there. And in a
small repo, it's quick to traverse to the root.

But you can imagine a large repo with few tags. Or, as we saw in a real
world case, explicitly limiting the set of matches like this (on
linux.git):

  git describe --match=v6.12-rc4 HEAD

which goes all the way to the root before realizing that no, there are
no other tags under consideration besides the one we fed via --match.
If we add in "--candidates=1" there, it's much faster (at least as of
the previous commit).

But we should be able to speed this up without the user asking for it.
After expanding all matching tags, we know the total number of names. We
could just stop the traversal there, but as hinted at above we already
have a mechanism for doing that: the max_candidate limit. So we can just
reduce that limit to match the number of possible candidates.

Our p6100 test shows this off:

  Test                                           HEAD^             HEAD
  ---------------------------------------------------------------------------------------
  6100.2: describe HEAD                          0.71(0.65+0.06)   0.72(0.68+0.04) +1.4%
  6100.3: describe HEAD with one max candidate   0.01(0.00+0.00)   0.01(0.00+0.00) +0.0%
  6100.4: describe HEAD with one tag             0.72(0.66+0.05)   0.01(0.00+0.00) -98.6%

Now we are fast automatically, just as if --candidates=1 were supplied
by the user.

Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Helped-by: Rasmus Villemoes <ravi@prevas.dk>
Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/describe.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/builtin/describe.c b/builtin/describe.c
index 69f2d942be..8ec3be87df 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -667,6 +667,8 @@ int cmd_describe(int argc,
 			     NULL);
 	if (!hashmap_get_size(&names) && !always)
 		die(_("No names found, cannot describe anything."));
+	if (hashmap_get_size(&names) < max_candidates)
+		max_candidates = hashmap_get_size(&names);

 	if (argc == 0) {
 		if (broken) {
-- 
2.47.0.441.g1a09955689

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] perf improvements for git-describe with few tags
  2024-11-06 19:22           ` [PATCH 0/4] perf improvements for git-describe with few tags Jeff King
                               ` (4 preceding siblings ...)
  2024-11-06 21:17             ` [PATCH 4/4] describe: stop traversing when we run out of names Jeff King
@ 2024-11-26  5:05             ` Junio C Hamano
  2024-12-04 23:04               ` Josh Steadmon
  5 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2024-11-26  5:05 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Benno Evers, Rasmus Villemoes, Benno Evers, Josh Poimboeuf,
	Masahiro Yamada

Jeff King <peff@peff.net> writes:

> So here's the series I came up with, which starts by adjusting the tests
> to be resilient to the later changes, but also to show the existing
> failure mode.
>
> And then the rest of the patches add the performance improvements we've
> been discussing in the thread.

So, this did not get any comments, but I had a time to read it over,
and did not find anything suspicious in there.

Let me mark it for 'next', this time for real, in the What's cooking
draft I have.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] perf improvements for git-describe with few tags
  2024-11-26  5:05             ` [PATCH 0/4] perf improvements for git-describe with few tags Junio C Hamano
@ 2024-12-04 23:04               ` Josh Steadmon
  2024-12-04 23:20                 ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Josh Steadmon @ 2024-12-04 23:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, git, Benno Evers, Rasmus Villemoes, Benno Evers,
	Josh Poimboeuf, Masahiro Yamada

On 2024.11.26 14:05, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> > So here's the series I came up with, which starts by adjusting the tests
> > to be resilient to the later changes, but also to show the existing
> > failure mode.
> >
> > And then the rest of the patches add the performance improvements we've
> > been discussing in the thread.
> 
> So, this did not get any comments, but I had a time to read it over,
> and did not find anything suspicious in there.
> 
> Let me mark it for 'next', this time for real, in the What's cooking
> draft I have.
> 
> Thanks.
> 

This breaks the case of `git describe --always $SOME_HASH` (we hit the
die at builtin/describe.c:340) when there are no tags in the repo. I can
send a test case and a small fix shortly.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] fixup! describe: stop traversing when we run out of names
  2024-11-06 21:17             ` [PATCH 4/4] describe: stop traversing when we run out of names Jeff King
@ 2024-12-04 23:15               ` Josh Steadmon
  2024-12-04 23:27                 ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Josh Steadmon @ 2024-12-04 23:15 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, benno.martin.evers, benno, ravi, jpoimboe,
	masahiroy

Don't exit when we run out of names if we also set --always

Signed-off-by: Josh Steadmon <steadmon@google.com>
---
 builtin/describe.c  |  2 +-
 t/t6120-describe.sh | 14 ++++++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/builtin/describe.c b/builtin/describe.c
index 8ec3be87df..065c1bde6e 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -336,7 +336,7 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 		return;
 	}
 
-	if (!max_candidates)
+	if (!max_candidates && !always)
 		die(_("no tag exactly matches '%s'"), oid_to_hex(&cmit->object.oid));
 	if (debug)
 		fprintf(stderr, _("No exact match on refs or tags, searching to describe\n"));
diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
index 5633b11d01..9aebf09d3d 100755
--- a/t/t6120-describe.sh
+++ b/t/t6120-describe.sh
@@ -715,4 +715,18 @@ test_expect_success 'describe --broken --dirty with a file with changed stat' '
 	)
 '
 
+test_expect_success '--always with no refs falls back to commit hash' '
+	git init always-no-refs &&
+	(
+		cd always-no-refs &&
+		test_commit --no-tag A &&
+		test_commit --no-tag B &&
+		test_commit --no-tag C &&
+		git describe --abbrev=12 --always HEAD^ >actual &&
+		echo 13 >expected_size &&
+		test_file_size actual >actual_size &&
+		test_cmp expected_size actual_size
+	)
+'
+
 test_done

base-commit: a4f8a869558d59677e8d9798666a23391f0b4ca8
-- 
2.47.0.338.g60cca15819-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] perf improvements for git-describe with few tags
  2024-12-04 23:04               ` Josh Steadmon
@ 2024-12-04 23:20                 ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2024-12-04 23:20 UTC (permalink / raw)
  To: Josh Steadmon
  Cc: Junio C Hamano, git, Benno Evers, Rasmus Villemoes, Benno Evers,
	Josh Poimboeuf, Masahiro Yamada

On Wed, Dec 04, 2024 at 03:04:59PM -0800, Josh Steadmon wrote:

> This breaks the case of `git describe --always $SOME_HASH` (we hit the
> die at builtin/describe.c:340) when there are no tags in the repo. I can
> send a test case and a small fix shortly.

Yeah, this is easy to reproduce. I think it was always broken with:

  git describe --candidates=0 --always ...

since there is a line that skips the whole algorithm and bails early if
there are no candidates allowed. But that should surely not kick in if
"always" is set. I.e., I'd expect the fix to be something like:

diff --git a/builtin/describe.c b/builtin/describe.c
index d6c77a714f..d4c869e3d5 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -332,7 +332,7 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 		return;
 	}
 
-	if (!max_candidates)
+	if (!max_candidates && !always)
 		die(_("no tag exactly matches '%s'"), oid_to_hex(&cmit->object.oid));
 	if (debug)
 		fprintf(stderr, _("No exact match on refs or tags, searching to describe\n"));

But I'll wait and see what your proposed fix looks like.

-Peff

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] fixup! describe: stop traversing when we run out of names
  2024-12-04 23:15               ` [PATCH] fixup! " Josh Steadmon
@ 2024-12-04 23:27                 ` Jeff King
  2024-12-04 23:54                   ` Jeff King
  2024-12-05 20:14                   ` [PATCH] describe: drop early return for max_candidates == 0 Jeff King
  0 siblings, 2 replies; 27+ messages in thread
From: Jeff King @ 2024-12-04 23:27 UTC (permalink / raw)
  To: Josh Steadmon
  Cc: git, gitster, benno.martin.evers, benno, ravi, jpoimboe,
	masahiroy

On Wed, Dec 04, 2024 at 03:15:42PM -0800, Josh Steadmon wrote:

> diff --git a/builtin/describe.c b/builtin/describe.c
> index 8ec3be87df..065c1bde6e 100644
> --- a/builtin/describe.c
> +++ b/builtin/describe.c
> @@ -336,7 +336,7 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
>  		return;
>  	}
>  
> -	if (!max_candidates)
> +	if (!max_candidates && !always)
>  		die(_("no tag exactly matches '%s'"), oid_to_hex(&cmit->object.oid));
>  	if (debug)
>  		fprintf(stderr, _("No exact match on refs or tags, searching to describe\n"));

Yep, this is the same spot I found. I think it's the right place to make
the fix.

> Subject: Re: [PATCH] fixup! describe: stop traversing when we run out of names

This commit is already in 'next', so it's too late to squash in a change
(though I'd have done this separately anyway, as it's already an issue
for a manual --candidates=0 setting, as unlikely as that is).

Can you re-send with a full commit message?

> diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
> index 5633b11d01..9aebf09d3d 100755
> --- a/t/t6120-describe.sh
> +++ b/t/t6120-describe.sh
> @@ -715,4 +715,18 @@ test_expect_success 'describe --broken --dirty with a file with changed stat' '
>  	)
>  '
>  
> +test_expect_success '--always with no refs falls back to commit hash' '
> +	git init always-no-refs &&
> +	(
> +		cd always-no-refs &&
> +		test_commit --no-tag A &&
> +		test_commit --no-tag B &&
> +		test_commit --no-tag C &&
> +		git describe --abbrev=12 --always HEAD^ >actual &&
> +		echo 13 >expected_size &&
> +		test_file_size actual >actual_size &&
> +		test_cmp expected_size actual_size
> +	)
> +'

I'm not sure if I'm missing anything subtle, but this seems more
complicated than necessary to show the bug. I think just the exit code
of:

  git describe --match=does-not-exist --always HEAD

is sufficient, even in a repo with tags. If you really want to check
stdout, then probably comparing against:

  git rev-list -1 --abbrev-commit --abbrev=13 HEAD >expect &&
  test_cmp expect actual

is a little more obvious than the size check.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] fixup! describe: stop traversing when we run out of names
  2024-12-04 23:27                 ` Jeff King
@ 2024-12-04 23:54                   ` Jeff King
  2024-12-05 20:14                   ` [PATCH] describe: drop early return for max_candidates == 0 Jeff King
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff King @ 2024-12-04 23:54 UTC (permalink / raw)
  To: Josh Steadmon
  Cc: git, gitster, benno.martin.evers, benno, ravi, jpoimboe,
	masahiroy

On Wed, Dec 04, 2024 at 06:27:50PM -0500, Jeff King wrote:

> > -	if (!max_candidates)
> > +	if (!max_candidates && !always)
> >  		die(_("no tag exactly matches '%s'"), oid_to_hex(&cmit->object.oid));
> >  	if (debug)
> >  		fprintf(stderr, _("No exact match on refs or tags, searching to describe\n"));
> 
> Yep, this is the same spot I found. I think it's the right place to make
> the fix.
> 
> > Subject: Re: [PATCH] fixup! describe: stop traversing when we run out of names
> 
> This commit is already in 'next', so it's too late to squash in a change
> (though I'd have done this separately anyway, as it's already an issue
> for a manual --candidates=0 setting, as unlikely as that is).
> 
> Can you re-send with a full commit message?

In case it helps with writing a commit message:

I wondered why this line was there at all. It comes from 2c33f75754
(Teach git-describe --exact-match to avoid expensive tag searches,
2008-02-24). The --exact-match option is implemented by setting
max-candidates to 0. So:

  git describe --exact-match --always

has always been broken, but probably nobody ever cared. My series
reduces the max_candidates setting automatically when there is nothing
to find, which means you are more likely to hit the bug.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] describe: drop early return for max_candidates == 0
  2024-12-04 23:27                 ` Jeff King
  2024-12-04 23:54                   ` Jeff King
@ 2024-12-05 20:14                   ` Jeff King
  2024-12-05 22:28                     ` Josh Steadmon
  2024-12-06  3:01                     ` Junio C Hamano
  1 sibling, 2 replies; 27+ messages in thread
From: Jeff King @ 2024-12-05 20:14 UTC (permalink / raw)
  To: Josh Steadmon
  Cc: git, gitster, benno.martin.evers, benno, ravi, jpoimboe,
	masahiroy

On Wed, Dec 04, 2024 at 06:27:50PM -0500, Jeff King wrote:

> > Subject: Re: [PATCH] fixup! describe: stop traversing when we run out of names
> 
> This commit is already in 'next', so it's too late to squash in a change
> (though I'd have done this separately anyway, as it's already an issue
> for a manual --candidates=0 setting, as unlikely as that is).
> 
> Can you re-send with a full commit message?

Actually, after thinking on this a bit more, I think the solution below
is a bit more elegant. This can go on top of jk/describe-perf.

-- >8 --
From: Josh Steadmon <steadmon@google.com>
Subject: [PATCH] describe: drop early return for max_candidates == 0

Before we even start the describe algorithm, we check to see if
max_candidates is 0 and bail immediately if we did not find an exact
match. This comes from 2c33f75754 (Teach git-describe --exact-match to
avoid expensive tag searches, 2008-02-24), since the the --exact-match
option just sets max_candidates to 0.

But this interacts badly with the --always option (ironically added only
a week later in da2478dbb0 (describe --always: fall back to showing an
abbreviated object name, 2008-03-02)). With --always, we'd still want to
show the hash rather than calling die().

So this:

  git describe --exact-match --always

and likewise:

  git describe --exact-match --candidates=0

has always been broken. But nobody ever noticed, because using those
options together is rather unlikely. However, this bug became a lot
easier to trigger with a30154187a (describe: stop traversing when we run
out of names, 2024-10-31). There we reduce max_candidates automatically
based on the number of tags available. So in a repo with no tags (or one
where --match finds no tags), max_candidates becomes 0, and --always
will never show anything.

So that early check for --exact-match's zero candidates needs to be
adjusted. One way to do so is to have it check the "always" flag and
handle it specially, producing the expected hash. But that would require
duplicating the output code for "always".

Instead, we'd prefer to just fall through to the normal algorithm, which
should notice that we are not allowed to find any more candidates, stop
looking, and then hit the regular "always" output code. Back when
2c33f75754 was first done, this was a bad idea, since the normal
algorithm kept looking for the max+1 candidate. But since 082a4d90af
(describe: stop digging for max_candidates+1, 2024-10-31), we don't do
that anymore, and the algorithm is essentially a noop.

So we can drop the early return entirely, and the fact that
max_candidates is 0 will let us quit early without any special casing.

Reported-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Jeff King <peff@peff.net>
---
There is some small bit of setup work in the algorithm, like creating
the reverse index of commits->names in a slab. I don't think that's
worth worrying about. But if we did care, we could lazily initialize
that index, which would also benefit any other cases that bail before
needing it.

 builtin/describe.c  | 2 --
 t/t6120-describe.sh | 6 ++++++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/builtin/describe.c b/builtin/describe.c
index 8ec3be87df..21e1c87c65 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -336,8 +336,6 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)
 		return;
 	}

-	if (!max_candidates)
-		die(_("no tag exactly matches '%s'"), oid_to_hex(&cmit->object.oid));
 	if (debug)
 		fprintf(stderr, _("No exact match on refs or tags, searching to describe\n"));

diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
index 5633b11d01..009d84ff17 100755
--- a/t/t6120-describe.sh
+++ b/t/t6120-describe.sh
@@ -715,4 +715,10 @@ test_expect_success 'describe --broken --dirty with a file with changed stat' '
 	)
 '

+test_expect_success '--always with no refs falls back to commit hash' '
+	git rev-parse HEAD >expect &&
+	git describe --no-abbrev --always --match=no-such-tag >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.47.1.734.g721956425b

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] describe: drop early return for max_candidates == 0
  2024-12-05 20:14                   ` [PATCH] describe: drop early return for max_candidates == 0 Jeff King
@ 2024-12-05 22:28                     ` Josh Steadmon
  2024-12-05 23:21                       ` Jeff King
  2024-12-06  3:01                     ` Junio C Hamano
  1 sibling, 1 reply; 27+ messages in thread
From: Josh Steadmon @ 2024-12-05 22:28 UTC (permalink / raw)
  To: Jeff King
  Cc: git, gitster, benno.martin.evers, benno, ravi, jpoimboe,
	masahiroy

On 2024.12.05 15:14, Jeff King wrote:
> On Wed, Dec 04, 2024 at 06:27:50PM -0500, Jeff King wrote:
> 
> > > Subject: Re: [PATCH] fixup! describe: stop traversing when we run out of names
> > 
> > This commit is already in 'next', so it's too late to squash in a change
> > (though I'd have done this separately anyway, as it's already an issue
> > for a manual --candidates=0 setting, as unlikely as that is).
> > 
> > Can you re-send with a full commit message?
> 
> Actually, after thinking on this a bit more, I think the solution below
> is a bit more elegant. This can go on top of jk/describe-perf.
> 

Thanks, and sorry for not replying earlier, I got distracted by a
different $DAYJOB breakage:
https://lore.kernel.org/git/b41ae080654a3603af09801018df539f656cf9d8.1733430345.git.steadmon@google.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] describe: drop early return for max_candidates == 0
  2024-12-05 22:28                     ` Josh Steadmon
@ 2024-12-05 23:21                       ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2024-12-05 23:21 UTC (permalink / raw)
  To: Josh Steadmon
  Cc: git, gitster, benno.martin.evers, benno, ravi, jpoimboe,
	masahiroy

On Thu, Dec 05, 2024 at 02:28:45PM -0800, Josh Steadmon wrote:

> > Actually, after thinking on this a bit more, I think the solution below
> > is a bit more elegant. This can go on top of jk/describe-perf.
> > 
> 
> Thanks, and sorry for not replying earlier, I got distracted by a
> different $DAYJOB breakage:

No problem. Thanks for finding it!

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] describe: drop early return for max_candidates == 0
  2024-12-05 20:14                   ` [PATCH] describe: drop early return for max_candidates == 0 Jeff King
  2024-12-05 22:28                     ` Josh Steadmon
@ 2024-12-06  3:01                     ` Junio C Hamano
  2024-12-06  3:28                       ` Jeff King
  1 sibling, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2024-12-06  3:01 UTC (permalink / raw)
  To: Jeff King
  Cc: Josh Steadmon, git, benno.martin.evers, benno, ravi, jpoimboe,
	masahiroy

Jeff King <peff@peff.net> writes:

> Actually, after thinking on this a bit more, I think the solution below
> is a bit more elegant. This can go on top of jk/describe-perf.
>
> -- >8 --
> From: Josh Steadmon <steadmon@google.com>
> Subject: [PATCH] describe: drop early return for max_candidates == 0

OK, so the patch authorship still blames Josh.  But there is no
sign-off because ... the approach to the fix is so different that
blaming Josh for this implementation is no longer appropriate?

> Reported-by: Josh Steadmon <steadmon@google.com>
> Signed-off-by: Jeff King <peff@peff.net>

If so, please take the authorship yourself.

> Before we even start the describe algorithm, we check to see if
> max_candidates is 0 and bail immediately if we did not find an exact
> match. This comes from 2c33f75754 (Teach git-describe --exact-match to
> avoid expensive tag searches, 2008-02-24), since the the --exact-match
> option just sets max_candidates to 0.
> ...
> So this:
>
>   git describe --exact-match --always
>
> and likewise:
>
>   git describe --exact-match --candidates=0

Did the latter mean to say "git decribe --candidates=0 --always", as
the earlier paragraph explains that "--exact" affects the number of
candidates?

Without this patch, all three give the same result:

    $ git describe --exact-match --always HEAD
    fatal: no tag exactly matches '59d18088fe8ace4bf18ade27eeef3664fb6b0878'
    $ git describe --exact-match --candidates=0 HEAD
    fatal: no tag exactly matches '59d18088fe8ace4bf18ade27eeef3664fb6b0878'
    $ git describe --candidates=0 --always HEAD
    fatal: no tag exactly matches '59d18088fe8ace4bf18ade27eeef3664fb6b0878'

With this patch, we instead get this:

    $ ./git describe --exact-match --always HEAD
    59d18088fe
    $ ./git describe --exact-match --candidates=0 HEAD
    fatal: No tags can describe '59d18088fe8ace4bf18ade27eeef3664fb6b0878'.
    Try --always, or create some tags.
    $ ./git describe --candidates=0 --always HEAD
    59d18088fe

> But this interacts badly with the --always option (ironically added only
> a week later in da2478dbb0 (describe --always: fall back to showing an
> abbreviated object name, 2008-03-02)). With --always, we'd still want to
> show the hash rather than calling die().
> ...

> has always been broken.

Hmph, I am not sure if the behaviour is _broken_ in the first place.

The user asks with "--exact-match" that a result based on some ref
that does not directly point at the object being described is *not*
acceptable, so with or without "--always", it looks to me that it is
doing the right thing, if there is no exact match (or there is no
tag and the user only allowed tag to describe the objects) and the
result is "no tag exactly matches object X" failure.

Or is our position that these mutually incompatible options, namely
"--exact-match" and "--always", follow the "last one wins" rule?
The implementation does not seem to say so.

If the earlier request is to describe only as exact tag (and fail if
there is no appropriate tag), but then we changed our mind and ask
to fall back to an abbreviation, this one is understandable:

    $ ./git describe --exact-match --always HEAD
    59d18088fe

But then this is not.  The last thing we explicitly told the command
is that we accept only the exact match, but this one does not fail,
which seems like a bug:

    $ ./git describe --always --exact-match HEAD
    59d18088fe

So I am not sure if the "buggy" behaviour is buggy to begin with.
The way these two are documented can be read both ways,

    --exact-match::
            Only output exact matches (a tag directly references the
            supplied commit).  This is a synonym for --candidates=0.

    --always::
            Show uniquely abbreviated commit object as fallback.

but my reading is when you give both and when the object given is
not directly pointed at by any existing tag, "ONLY output exact
matches" cannot be satisified.  And "show as fallback" cannot be
satisfied within the constraint that the command is allowed "only
output exact matches".

I think the complexity from the point of view of a calling script to
deal with either behaviour is probably similar.  If you ask for
"--exact-match" and there is no exact match, you can ask rev-parse
to give a shortened one, and you know which one you are giving the
user.  We can change what "--exact-match + --candidate=0" combination
means to let it fallback, but then you'd need to check the output to
see if you got an exact tag or a fallback, and for that you'd
probably need to ask "show-ref refs/tags/$output" or something.

So I am not sure if it is worth changing the behaviour this late in
the game?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] describe: drop early return for max_candidates == 0
  2024-12-06  3:01                     ` Junio C Hamano
@ 2024-12-06  3:28                       ` Jeff King
  2024-12-06  4:19                         ` Junio C Hamano
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2024-12-06  3:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Josh Steadmon, git, benno.martin.evers, benno, ravi, jpoimboe,
	masahiroy

On Fri, Dec 06, 2024 at 12:01:41PM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Actually, after thinking on this a bit more, I think the solution below
> > is a bit more elegant. This can go on top of jk/describe-perf.
> >
> > -- >8 --
> > From: Josh Steadmon <steadmon@google.com>
> > Subject: [PATCH] describe: drop early return for max_candidates == 0
> 
> OK, so the patch authorship still blames Josh.  But there is no
> sign-off because ... the approach to the fix is so different that
> blaming Josh for this implementation is no longer appropriate?

Oh, whoops. I had originally intended to just write the commit message
and leave him with credit, but then I ended up changing approach.
Leaving him as the author was an oversight.

> > Before we even start the describe algorithm, we check to see if
> > max_candidates is 0 and bail immediately if we did not find an exact
> > match. This comes from 2c33f75754 (Teach git-describe --exact-match to
> > avoid expensive tag searches, 2008-02-24), since the the --exact-match
> > option just sets max_candidates to 0.
> > ...
> > So this:
> >
> >   git describe --exact-match --always
> >
> > and likewise:
> >
> >   git describe --exact-match --candidates=0
> 
> Did the latter mean to say "git decribe --candidates=0 --always", as
> the earlier paragraph explains that "--exact" affects the number of
> candidates?

Urgh, yes, you are correct. I can resend, but I think we should resolve
the questions below.

> Without this patch, all three give the same result:
> 
>     $ git describe --exact-match --always HEAD
>     fatal: no tag exactly matches '59d18088fe8ace4bf18ade27eeef3664fb6b0878'
>     $ git describe --exact-match --candidates=0 HEAD
>     fatal: no tag exactly matches '59d18088fe8ace4bf18ade27eeef3664fb6b0878'
>     $ git describe --candidates=0 --always HEAD
>     fatal: no tag exactly matches '59d18088fe8ace4bf18ade27eeef3664fb6b0878'
> 
> With this patch, we instead get this:
> 
>     $ ./git describe --exact-match --always HEAD
>     59d18088fe
>     $ ./git describe --exact-match --candidates=0 HEAD
>     fatal: No tags can describe '59d18088fe8ace4bf18ade27eeef3664fb6b0878'.
>     Try --always, or create some tags.
>     $ ./git describe --candidates=0 --always HEAD
>     59d18088fe

Right, exactly.

> > But this interacts badly with the --always option (ironically added only
> > a week later in da2478dbb0 (describe --always: fall back to showing an
> > abbreviated object name, 2008-03-02)). With --always, we'd still want to
> > show the hash rather than calling die().
> > ...
> > has always been broken.
> 
> Hmph, I am not sure if the behaviour is _broken_ in the first place.
> 
> The user asks with "--exact-match" that a result based on some ref
> that does not directly point at the object being described is *not*
> acceptable, so with or without "--always", it looks to me that it is
> doing the right thing, if there is no exact match (or there is no
> tag and the user only allowed tag to describe the objects) and the
> result is "no tag exactly matches object X" failure.
> 
> Or is our position that these mutually incompatible options, namely
> "--exact-match" and "--always", follow the "last one wins" rule?
> The implementation does not seem to say so.

I think you could argue that they are mutually incompatible. But we have
never marked them as such, nor do we do any sort of last-one-wins.
They are two distinct options, but in --exact-match mode, --always is
simply ignored. Which I think is a bug.

> So I am not sure if the "buggy" behaviour is buggy to begin with.
> The way these two are documented can be read both ways,
> 
>     --exact-match::
>             Only output exact matches (a tag directly references the
>             supplied commit).  This is a synonym for --candidates=0.
> 
>     --always::
>             Show uniquely abbreviated commit object as fallback.
> 
> but my reading is when you give both and when the object given is
> not directly pointed at by any existing tag, "ONLY output exact
> matches" cannot be satisified.  And "show as fallback" cannot be
> satisfied within the constraint that the command is allowed "only
> output exact matches".

I think there can be a more expansive reading of --exact-match (or of
--candidates=0), which is: only output a tag that matches exactly. And
then --always is orthogonal to that. There is no other output to
produce, so we show the commit object itself.

Now that more expansive reading is not what --exact-match says above.
But it is the only thing that makes sense to me for --candidates=0, and
the two are synonyms.

> I think the complexity from the point of view of a calling script to
> deal with either behaviour is probably similar.  If you ask for
> "--exact-match" and there is no exact match, you can ask rev-parse
> to give a shortened one, and you know which one you are giving the
> user.  We can change what "--exact-match + --candidate=0" combination
> means to let it fallback, but then you'd need to check the output to
> see if you got an exact tag or a fallback, and for that you'd
> probably need to ask "show-ref refs/tags/$output" or something.
> 
> So I am not sure if it is worth changing the behaviour this late in
> the game?

I think there are really two questions here:

  1. Is the current behavior of "describe --exact-match --always" a bug?
     I'll grant that probably nobody cares deeply, which is why the
     interaction has not been noticed for all of these years. I think
     the semantics this patch gives are the only ones that make sense,
     but I also don't care that deeply. But...

  2. What should we do about the new regression caused by limiting the
     candidate list? I.e., my earlier patches in this topic make us
     behave as if --candidates=<n> was given when there are fewer tags
     in the repo. That runs afoul of the special-casing of
     --candidates=0 when there are no tags in the repo (or you limit the
     candidates to zero via --match).

     If we are not going to address (1) as this patch does, then we need
     another solution. We can internally hold an extra variable to
     distinguish the number of user-requested candidates from the number
     of actual candidates available. But I think my solution to (1) here
     harmonizes the --candidates=0 case with --always, and then the
     auto-adjusted max-candidates case just falls out naturally.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] describe: drop early return for max_candidates == 0
  2024-12-06  3:28                       ` Jeff King
@ 2024-12-06  4:19                         ` Junio C Hamano
  2024-12-06  5:42                           ` [PATCH] describe: split "found all tags" and max_candidates logic Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2024-12-06  4:19 UTC (permalink / raw)
  To: Jeff King
  Cc: Josh Steadmon, git, benno.martin.evers, benno, ravi, jpoimboe,
	masahiroy

Jeff King <peff@peff.net> writes:

>> Without this patch, all three give the same result:
>> 
>>     $ git describe --exact-match --always HEAD
>>     fatal: no tag exactly matches '59d18088fe8ace4bf18ade27eeef3664fb6b0878'
>>     $ git describe --exact-match --candidates=0 HEAD
>>     fatal: no tag exactly matches '59d18088fe8ace4bf18ade27eeef3664fb6b0878'
>>     $ git describe --candidates=0 --always HEAD
>>     fatal: no tag exactly matches '59d18088fe8ace4bf18ade27eeef3664fb6b0878'
>> 
>> With this patch, we instead get this:
>> 
>>     $ ./git describe --exact-match --always HEAD
>>     59d18088fe
>>     $ ./git describe --exact-match --candidates=0 HEAD
>>     fatal: No tags can describe '59d18088fe8ace4bf18ade27eeef3664fb6b0878'.
>>     Try --always, or create some tags.
>>     $ ./git describe --candidates=0 --always HEAD
>>     59d18088fe
> ...
> I think there are really two questions here:
>
>   1. Is the current behavior of "describe --exact-match --always" a bug?
>      I'll grant that probably nobody cares deeply, which is why the
>      interaction has not been noticed for all of these years. I think
>      the semantics this patch gives are the only ones that make sense,
>      but I also don't care that deeply. But...
>
>   2. What should we do about the new regression caused by limiting the
>      candidate list?

Ahh, OK, these --candidate=0 / --exact-match were for illustration
purposes only.  The real issue is that the user does not, with

  $ git describe --always HEAD

ask for exact matches only at all, but we internally pretend as if
they did, which is not nice.

My gut reaction is that it is wrong not to give the abbreviated
object name in this case, but the price to do so shouldn't be to
change the behaviour when --exact-match was requested the the user.

Loosening the interaction between the two options, when both were
given explicitly, may be an improvement, but I think that should be
treated as a separate topic, with its merit justified independently,
since the command has been behaving this way from fairly early
version, possibly the one that had both of the options for the first
time.

  $ rungit v2.20.0 describe --exact-match HEAD
  fatal: No names found, cannot describe anything.
  $ rungit v2.20.0 describe --exact-match --always HEAD
  fatal: no tag exactly matches '13a3dd7fe014658da465e9621ec3651f5473041e'
  $ rungit v2.20.0 describe --exact-match --candidate=0 HEAD
  fatal: No names found, cannot describe anything.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] describe: split "found all tags" and max_candidates logic
  2024-12-06  4:19                         ` Junio C Hamano
@ 2024-12-06  5:42                           ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2024-12-06  5:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Josh Steadmon, git, benno.martin.evers, benno, ravi, jpoimboe,
	masahiroy

On Fri, Dec 06, 2024 at 01:19:36PM +0900, Junio C Hamano wrote:

> My gut reaction is that it is wrong not to give the abbreviated
> object name in this case, but the price to do so shouldn't be to
> change the behaviour when --exact-match was requested the the user.

I think this is where we differ. You call it a price, but to me it is a
bonus. ;)

> Loosening the interaction between the two options, when both were
> given explicitly, may be an improvement, but I think that should be
> treated as a separate topic, with its merit justified independently,
> since the command has been behaving this way from fairly early
> version, possibly the one that had both of the options for the first
> time.
> 
>   $ rungit v2.20.0 describe --exact-match HEAD
>   fatal: No names found, cannot describe anything.
>   $ rungit v2.20.0 describe --exact-match --always HEAD
>   fatal: no tag exactly matches '13a3dd7fe014658da465e9621ec3651f5473041e'
>   $ rungit v2.20.0 describe --exact-match --candidate=0 HEAD
>   fatal: No names found, cannot describe anything.

OK. That's certainly the more conservative approach, since it reduces
the change of behavior from previous versions.

Here's a replacement patch (again, on top of jk/describe-perf).

-- >8 --
Subject: [PATCH] describe: split "found all tags" and max_candidates logic

Commit a30154187a (describe: stop traversing when we run out of names,
2024-10-31) taught git-describe to automatically reduce the
max_candidates setting to match the total number of possible names. This
lets us break out of the traversal rather than fruitlessly searching for
more candidates when there are no more to be found.

However, setting max_candidates to 0 (e.g., if the repo has no tags)
overlaps with the --exact-match option, which explicitly uses the same
value. And this causes a regression with --always, which is ignored in
exact-match mode. We used to get this in a repo with no tags:

  $ git describe --always HEAD
  b2f0a7f

and now we get:

  $ git describe --always HEAD
  fatal: no tag exactly matches 'b2f0a7f47f5f2aebe1e7fceff19a57de20a78c06'

The reason is that we bail early in describe_commit() when
max_candidates is set to 0. This logic goes all the way back to
2c33f75754 (Teach git-describe --exact-match to avoid expensive tag
searches, 2008-02-24).

We should obviously fix this regression, but there are two paths,
depending on what you think:

  $ git describe --always --exact-match

and

  $ git describe --always --candidates=0

should do. Since the "--always" option was added, it has always been
ignored in --exact-match (or --candidates=0) mode. I.e., we treat
--exact-match as a true exact match of a tag, and never fall back to
using --always, even if it was requested.

If we think that's a bug (or at least a misfeature), then the right
solution is to fix it by removing the early bail-out from 2c33f75754,
letting the noop algorithm run and then hitting the --always fallback
output. And then our regression naturally goes away, because it follows
the same path.

If we think that the current "--exact-match --always" behavior is the
right thing, then we have to differentiate the case where we
automatically reduced max_candidates to 0 from the case where the user
asked for it specifically. That's possible to do with a flag, but we can
also just reimplement the logic from a30154187a to explicitly break out
of the traversal when we run out of candidates (rather than relying on
the existing max_candidates check).

My gut feeling is along the lines of option 1 (it's a bug, and people
would be happy for "--exact-match --always" to give the fallback rather
than ignoring "--always"). But the documentation can be interpreted in
the other direction, and we've certainly lived with the existing
behavior for many years. So it's possible that changing it now is the
wrong thing.

So this patch fixes the regression by taking the second option,
retaining the "--exact-match" behavior as-is. There are two new tests.
The first shows that the regression is fixed (we don't even need a new
repo without tags; a restrictive --match is enough to create the
situation that there are no candidate names).

The second test confirms that the "--exact-match --always" behavior
remains unchanged and continues to die when there is no tag pointing at
the specified commit. It's possible we may reconsider this in the
future, but this shows that the approach described above is implemented
faithfully.

We can also run the perf tests in p6100 to see that we've retained the
speedup that a30154187a was going for:

  Test                                           HEAD^             HEAD
  --------------------------------------------------------------------------------------
  6100.2: describe HEAD                          0.72(0.64+0.07)   0.72(0.66+0.06) +0.0%
  6100.3: describe HEAD with one max candidate   0.01(0.00+0.00)   0.01(0.00+0.00) +0.0%
  6100.4: describe HEAD with one tag             0.01(0.01+0.00)   0.01(0.01+0.00) +0.0%

Reported-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/describe.c  |  5 ++---
 t/t6120-describe.sh | 10 ++++++++++
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/builtin/describe.c b/builtin/describe.c
index 8ec3be87df..a6ef8af32a 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -367,7 +367,8 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst)

 		seen_commits++;

-		if (match_cnt == max_candidates) {
+		if (match_cnt == max_candidates ||
+		    match_cnt == hashmap_get_size(&names)) {
 			gave_up_on = c;
 			break;
 		}
@@ -667,8 +668,6 @@ int cmd_describe(int argc,
 			     NULL);
 	if (!hashmap_get_size(&names) && !always)
 		die(_("No names found, cannot describe anything."));
-	if (hashmap_get_size(&names) < max_candidates)
-		max_candidates = hashmap_get_size(&names);

 	if (argc == 0) {
 		if (broken) {
diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
index 5633b11d01..3f6160d702 100755
--- a/t/t6120-describe.sh
+++ b/t/t6120-describe.sh
@@ -715,4 +715,14 @@ test_expect_success 'describe --broken --dirty with a file with changed stat' '
 	)
 '

+test_expect_success '--always with no refs falls back to commit hash' '
+	git rev-parse HEAD >expect &&
+	git describe --no-abbrev --always --match=no-such-tag >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success '--exact-match does not show --always fallback' '
+	test_must_fail git describe --exact-match --always
+'
+
 test_done
-- 
2.47.1.734.g721956425b

^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-12-06  5:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <309549cafdcfe50c4fceac3263220cc3d8b109b2.1730337435.git.jpoimboe@kernel.org>
2024-10-31 10:37 ` [PATCH] setlocalversion: Add workaround for "git describe" performance issue Rasmus Villemoes
2024-10-31 11:42   ` Jeff King
2024-10-31 12:24     ` Jeff King
2024-10-31 14:43       ` Jeff King
2024-11-04 12:37         ` Benno Evers
2024-11-06 19:22           ` [PATCH 0/4] perf improvements for git-describe with few tags Jeff King
2024-11-06 19:26             ` Jeff King
2024-11-06 21:16             ` [PATCH 1/4] t6120: demonstrate weakness in disjoint-root handling Jeff King
2024-11-06 21:17             ` [PATCH 2/4] t/perf: add tests for git-describe Jeff King
2024-11-06 21:17             ` [PATCH 3/4] describe: stop digging for max_candidates+1 Jeff King
2024-11-06 21:17             ` [PATCH 4/4] describe: stop traversing when we run out of names Jeff King
2024-12-04 23:15               ` [PATCH] fixup! " Josh Steadmon
2024-12-04 23:27                 ` Jeff King
2024-12-04 23:54                   ` Jeff King
2024-12-05 20:14                   ` [PATCH] describe: drop early return for max_candidates == 0 Jeff King
2024-12-05 22:28                     ` Josh Steadmon
2024-12-05 23:21                       ` Jeff King
2024-12-06  3:01                     ` Junio C Hamano
2024-12-06  3:28                       ` Jeff King
2024-12-06  4:19                         ` Junio C Hamano
2024-12-06  5:42                           ` [PATCH] describe: split "found all tags" and max_candidates logic Jeff King
2024-11-26  5:05             ` [PATCH 0/4] perf improvements for git-describe with few tags Junio C Hamano
2024-12-04 23:04               ` Josh Steadmon
2024-12-04 23:20                 ` Jeff King
2024-11-01 10:23       ` [PATCH] setlocalversion: Add workaround for "git describe" performance issue Rasmus Villemoes
2024-11-01 11:39         ` Jeff King
2024-10-31 11:43   ` Masahiro Yamada

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).