All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 1/1] commit: don't use generation numbers if not needed
Date: Sat, 06 Oct 2018 18:54:01 +0200	[thread overview]
Message-ID: <86woqvxbh2.fsf@gmail.com> (raw)
In-Reply-To: <efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com> (Derrick Stolee via GitGitGadget's message of "Thu, 30 Aug 2018 05:58:09 -0700 (PDT)")

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> In 3afc679b "commit: use generations in paint_down_to_common()",
> the queue in paint_down_to_common() was changed to use a priority
> order based on generation number before commit date. This served
> two purposes:
>
>  1. When generation numbers are present, the walk guarantees
>     correct topological relationships, regardless of clock skew in
>     commit dates.

Which means that we would walk no more commits that there are on the
_longest_ path...

>  2. It enables short-circuiting the walk when the min_generation
>     parameter is added in d7c1ec3e "commit: add short-circuit to
>     paint_down_to_common()". This short-circuit helps commands
>     like 'git branch --contains' from needing to walk to a merge
>     base when we know the result is false.
>
> The commit message for 3afc679b includes the following sentence:
>
>     This change does not affect the number of commits that are
>     walked during the execution of paint_down_to_common(), only
>     the order that those commits are inspected.
>
> This statement is incorrect. Because it changes the order in which
> the commits are inspected, it changes the order they are added to
> the queue, and hence can change the number of loops before the
> queue_has_nonstale() method returns true.

...which does not mean that it would walk no more commits than on
shortest path; thus it can give performance regression if it chooses to
walk longer path first, compared to date-based heuristic if it chooses
shorter one (but can walk few commits more that strictly necessary on
chosen path).

O.K., I can understand that.

>
> This change makes a concrete difference depending on the topology
> of the commit graph. For instance, computing the merge-base between
> consecutive versions of the Linux kernel has no effect for versions
> after v4.9, but 'git merge-base v4.8 v4.9' presents a performance
> regression:
>
>     v2.18.0: 0.122s
> v2.19.0-rc1: 0.547s
>        HEAD: 0.127s

Now I will wonder if the 0.005s difference between v2.18.0 and HEAD
version is within the noise (within the standard deviation), or
not... (just kidding).

>
> To determine that this was simply an ordering issue, I inserted
> a counter within the while loop of paint_down_to_common() and
> found that the loop runs 167,468 times in v2.18.0 and 635,579
> times in v2.19.0-rc1.

Nice analysis.

>
> The topology of this case can be described in a simplified way
> here:
>
>   v4.9
>    |  \
>    |   \
>   v4.8  \
>    | \   \
>    |  \   |
>   ...  A  B
>    |  /  /
>    | /  /
>    |/__/
>    C
>
> Here, the "..." means "a very long line of commits". By generation
> number, A and B have generation one more than C. However, A and B
> have commit date higher than most of the commits reachable from
> v4.8. When the walk reaches v4.8, we realize that it has PARENT1
> and PARENT2 flags, so everything it can reach is marked as STALE,
> including A. B has only the PARENT1 flag, so is not STALE.
>
> When paint_down_to_common() is run using
> compare_commits_by_commit_date, A and B are removed from the queue
> early and C is inserted into the queue. At this point, C and the
> rest of the queue entries are marked as STALE. The loop then
> terminates.
>
> When paint_down_to_common() is run using
> compare_commits_by_gen_then_commit_date, B is removed from the
> queue only after the many commits reachable from v4.8 are explored.
> This causes the loop to run longer. The reason for this regression
> is simple: the queue order is intended to not explore a commit
> until everything that _could_ reach that commit is explored. From
> the information gathered by the original ordering, we have no
> guarantee that there is not a commit D reachable from v4.8 that
> can also reach B.

So the problem is with shortcuts, i.e. merges of short-length branches
(like e.g. fixup topic branches) based on an older commits (like best
practices recommend: base off oldest applicable commit).  Due to the
bottom-up nature of generation numbers examining such shortcut is
postponed, compared to "heuristic" ordering by the creation date.

By problem I mean that generation-number based code gives worse
performance than commitdate-based code.

>                  We gained absolute correctness in exchange for
> a performance regression.

Or rather we gained better worst case performance in exchange for worse
average case performance (thus a performance regression).

> The performance regression is probably the worse option, since
> these incorrect results in paint_down_to_common() are rare. The
> topology required for the performance regression are less rare,
> but still require multiple merge commits where the parents differ
> greatly in generation number. In our example above, the commit A
> is as important as the commit B to demonstrate the problem, since
> otherwise the commit C will sit in the queue as non-stale just as
> long in both orders.

All right, that is good explanation of why the change.  Worst case
performance is rare, performance in case of "shortcuts" is more
important: they are more often (I guess -- are there any tests for
this?).

> The solution provided uses the min_generation parameter to decide
> if we should use generation numbers in our ordering. When
> min_generation is equal to zero, it means that the caller has no
> known cutoff for the walk, so we should rely on our commit-date
> heuristic as before; this is the case with merge_bases_many().
> When min_generation is non-zero, then the caller knows a valuable
> cutoff for the short-circuit mechanism; this is the case with
> remove_redundant() and in_merge_bases_many().

TLDR; use compare_commits_by_commit_date() if there is no min generation
cutoff, compare_commits_by_gen_then_commit_date() otherwise, right?

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 1a6e632185..449c1f4920 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -874,6 +874,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n,
>  	int i;
>  	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
>  
> +	if (!min_generation)
> +		queue.compare = compare_commits_by_commit_date;
> +
>  	one->object.flags |= PARENT1;
>  	if (!n) {
>  		commit_list_append(one, &result);
> @@ -891,7 +894,7 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n,
>  		struct commit_list *parents;
>  		int flags;
>  
> -		if (commit->generation > last_gen)
> +		if (min_generation && commit->generation > last_gen)
>  			BUG("bad generation skip %8x > %8x at %s",
>  			    commit->generation, last_gen,
>  			    oid_to_hex(&commit->object.oid));

  reply	other threads:[~2018-10-06 16:54 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-30 12:58 [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Derrick Stolee via GitGitGadget
2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget
2018-10-06 16:54   ` Jakub Narebski [this message]
2018-08-30 15:26 ` [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Junio C Hamano
2018-10-06 16:09 ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86woqvxbh2.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.