in_merge_bases() is too expensive for recent "pu" update

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* in_merge_bases() is too expensive for recent "pu" update
@ 2012-08-23 12:32 Nguyen Thai Ngoc Duy
  2012-08-23 14:20 ` Thomas Rast
  0 siblings, 1 reply; 16+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-08-23 12:32 UTC (permalink / raw)
  To: Git Mailing List

I just did a "git fetch". It took 19 seconds (measured with
gettimeofday) to finish in_merge_bases() in update_local_ref() in
fetch.c, just to print this line

 + a4f2db3...b95a282 pu         -> origin/pu  (forced update)

It's partly my fault because I'm on gcc -O0. But should it not take
that much time for one line? As a grumpy user, I'd be perfectly ok
with git saying "probably forced update, check it yourself" if the
spent time exceeds half a second. On the same token, if diffstat takes
too long for git-pull, then perhaps just stop and print "do it
yourself".
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-23 12:32 in_merge_bases() is too expensive for recent "pu" update Nguyen Thai Ngoc Duy
@ 2012-08-23 14:20 ` Thomas Rast
  2012-08-23 18:08   ` Junio C Hamano
                     ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Thomas Rast @ 2012-08-23 14:20 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> I just did a "git fetch". It took 19 seconds (measured with
> gettimeofday) to finish in_merge_bases() in update_local_ref() in
> fetch.c, just to print this line
>
>  + a4f2db3...b95a282 pu         -> origin/pu  (forced update)
>
> It's partly my fault because I'm on gcc -O0. But should it not take
> that much time for one line? As a grumpy user, I'd be perfectly ok
> with git saying "probably forced update, check it yourself" if the
> spent time exceeds half a second. On the same token, if diffstat takes
> too long for git-pull, then perhaps just stop and print "do it
> yourself".

I missed out on the a4f2db3 update, so I can't test exactly this, but my
reflog tells the I had before b95a282 was d3dd556.  Computing all
merge-bases takes very long indeed:

  $ time git merge-base --all d3dd556 b95a282 | wc -l
  51

  real    0m4.877s
  user    0m4.829s
  sys     0m0.026s

And I think the reason is that the algorithm is just doing too much.
From a quick glance, it appears to first compute the result list rslt[]
of merge bases, then for each pair attempts to eliminate one of them by
determining that rslt[i] or rslt[j] is also a merge base of rslt[j:].
It is thus spending quadratic time in the number of merge-bases, times
the size of history (ok, I can't actually come up with an example that
does the latter).

At the very least it should be possible to change in_merge_bases() to
not do any of the post-filtering; perhaps like the patch below.  It
passes the test suite.  The whole "merge bases of A and a list of Bs"
thing is blowing my overheated mind, though, so I'm not able to convince
myself that it is correct in all cases.

Even if this turns out to be flawed, we should also identify uses of
in_merge_bases() where the real question was is_descendant_of() [I
somewhat suspect that's all of them], and then replace is_descendant_of
with a much cheaper lookup.  This can be as simple as propagating a mark
from the candidate until it either goes beyond all possible ancestors,
or hits one of them.

By the way, the internal slowness of git-merge-base also affects the
A...B syntax.  For example,

  git rev-list --left-right --count @{upstream}...

is used by the __git_ps1 code to determine the prompt display, which
just got very slow for me today.  Again, it should be easy to figure out
the boundary of the symmetric difference simply by propagating two
marks.  I do not think that the result of A...B actually depends on
figuring out exactly what the merge bases are, simply excluding *any*
candidate without pruning is enough.

diff --git i/commit.c w/commit.c
index 65a8485..70427ab 100644
--- i/commit.c
+++ w/commit.c
@@ -837,10 +837,13 @@ int in_merge_bases(struct commit *commit, struct commit **reference, int num)
 	struct commit_list *bases, *b;
 	int ret = 0;

-	if (num == 1)
-		bases = get_merge_bases(commit, *reference, 1);
-	else
+	if (num != 1)
 		die("not yet");
+
+	bases = merge_bases_many(commit, 1, reference);
+	clear_commit_marks(commit, all_flags);
+	clear_commit_marks(*reference, all_flags);
+	
 	for (b = bases; b; b = b->next) {
 		if (!hashcmp(commit->object.sha1, b->item->object.sha1)) {
 			ret = 1;

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-23 14:20 ` Thomas Rast
@ 2012-08-23 18:08   ` Junio C Hamano
  2012-08-23 20:42     ` Junio C Hamano
  2012-08-24 11:42   ` Nguyen Thai Ngoc Duy
  2012-08-28  1:50   ` Junio C Hamano
  2 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2012-08-23 18:08 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List

Thomas Rast <trast@student.ethz.ch> writes:

> At the very least it should be possible to change in_merge_bases() to
> not do any of the post-filtering; perhaps like the patch below.

I do not recall the details but the post-filtering was added after
the initial naive version without it that was posted to the list did
not work correctly in corner cases.  I wouldn't doubt that N*(N-1)
loop can be optimized to something with log(N), but I offhand find
it hard to believe "to not do any" could be correct without seeing
more detailed analysis.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-23 18:08   ` Junio C Hamano
@ 2012-08-23 20:42     ` Junio C Hamano
  2012-08-23 21:03       ` Junio C Hamano
  2012-08-24  9:43       ` Thomas Rast
  0 siblings, 2 replies; 16+ messages in thread
From: Junio C Hamano @ 2012-08-23 20:42 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> Thomas Rast <trast@student.ethz.ch> writes:
>
>> At the very least it should be possible to change in_merge_bases() to
>> not do any of the post-filtering; perhaps like the patch below.
>
> I do not recall the details but the post-filtering was added after
> the initial naive version without it that was posted to the list did
> not work correctly in corner cases.  I wouldn't doubt that N*(N-1)
> loop can be optimized to something with log(N), but I offhand find
> it hard to believe "to not do any" could be correct without seeing
> more detailed analysis.

If "one on one side, many on the other side" in merge_bases_many()
confuses any of you in the readers, you can just ignore many-ness.
Getting merge bases for "one" against many "two"s using
merge_bases_many() is exactly the same as getting merge bases for
"one" against a (fictitious) single commit you can make by merging
all "two"s.

So we can think about the degenerate case of merge base between two
commits A and B in the following discussion.

A merge-base between A and B is defined to be a commit that is a
common ancestor of both A and B, and that is not an ancestor of any
other merge-base between A and B.  In the simplest case (illustrated
below), 1 is a merge base between A and B, but 2 is not (even though
it is an ancestor of both A and B, it is an ancestor of 1 which is a
merge base).

                  y---A
                 /
     ---2---o---1---x---B

So, the thinking goes, in order to find all merge bases, first we
can traverse from A and B, following "parent" link from children,
and painting found parents in two colors.

Start from A and B.  Follow from B to find 'x' and paint it in blue,
follow from A to find 'y' and paint it in amber.  Follow from 'x' to
'1', paint it in blue.  Follow from 'y' to '1', paint it in amber
but notice that it already is painted in blue.  Stop the traversal
there and we found a candidate '1' that could be a merge base.  We
know digging from '1' will not find more merge bases, so we should
stop there (I do not recall offhand if the current code does stop
there, though) [*1*].

There may be other paths that are not depicted in the above
illustration that reach '2' from A and B without passing '1'
through.

            o-------o
           /         \
          /       y---A
         /       /
     ---2---z---1---x---B
         \         /
          o-------o

In such a history, we may stop after finding '1' and '2' in the
first pass of "stop when a node is painted in both amber and blue".
This way, the first pass will find _all_ merge bases, but it also
may find common ancestors that are not merge bases.

So we need to notice that '1' and '2' have ancestry relation in
order to reject '2' as "common but not merge-base".  One way of
doing so is not to stop at '1' and keep digging (and eventually we
find that '2' is what we could reach from '1' that already is a
merge base), but then we will be susceptible to the same kind of
clock skew issue as the revision traverser.  Instead, merge-base
traverser chose to do this by running the same "stop traversing at
common" traverser between the candidates (in this case, '1' and
'2').

The objective of this second traversal is very different from the
first one, though.  We do not need _all_ the merge bases between '1'
and '2'; we do not even need merge bases.

The only thing we need to prove that '1' is a merge base (i.e. not
an ancestor of any other merge base candidates the first round of
traversal between A and B found) is to do the first round of the
traversal for '1' as "one" and all the other ('2' in this case) as
"two"s; if the first round of such traversal finishes without
painting '1' in both colors, that would mean '1' is not reachable
from any other candidates of merge base between A and B, so we have
proven that '1' is a merge base.

So I suspect that the postprocess phase could be made from N*(N-1)
to N (run merge_bases_many() once for each of the common ancestor
the first round found).  You might also be able to stop the
traversal used in the first phase (i.e. get_merge_bases()) a bit
earlier, if we are digging through '1' to paint 'z' (and eventually
'2') in STALE (amber+blue) color, as digging further only to paint
things in STALE color is not necessary with the postprocess.

[Footnote]

*1* Digging through '2' down may find that other candidate merge
bases we reach by traversing other paths that may not be depicted in
the above illustration, and there may be such paths to reach '1'
from A and B that does not pass '2' through.  That is a possible
alternative way to reject common ancestor that is not merge-base.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-23 20:42     ` Junio C Hamano
@ 2012-08-23 21:03       ` Junio C Hamano
  2012-08-24  9:32         ` Thomas Rast
  2012-08-24  9:43       ` Thomas Rast
  1 sibling, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2012-08-23 21:03 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> The objective of this second traversal is very different from the
> first one, though.  We do not need _all_ the merge bases between '1'
> and '2'; we do not even need merge bases.
>
> The only thing we need to prove that '1' is a merge base (i.e. not
> an ancestor of any other merge base candidates the first round of
> traversal between A and B found) is to do the first round of the
> traversal for '1' as "one" and all the other ('2' in this case) as
> "two"s; if the first round of such traversal finishes without
> painting '1' in both colors, that would mean '1' is not reachable
> from any other candidates of merge base between A and B, so we have
> proven that '1' is a merge base.
>
> So I suspect that the postprocess phase could be made from N*(N-1)
> to N (run merge_bases_many() once for each of the common ancestor
> the first round found).  You might also be able to stop the
> traversal used in the first phase (i.e. get_merge_bases()) a bit
> earlier, if we are digging through '1' to paint 'z' (and eventually
> '2') in STALE (amber+blue) color, as digging further only to paint
> things in STALE color is not necessary with the postprocess.

As a corollary, the "is pu@{0} a fast-forward of pu@{1}?" check does
not need merge base computation at all.  The only thing it needs to
do is to prove pu@{1} is reachable from pu@{0}, and that can be done
the same way in which '1' can be proved unreachable from '2' in the
post processing phase as described above, i.e. it needs only the
first phase of running merge_bases_many() without postprocessing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-23 21:03       ` Junio C Hamano
@ 2012-08-24  9:32         ` Thomas Rast
  2012-08-24 15:50           ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Rast @ 2012-08-24  9:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> As a corollary, the "is pu@{0} a fast-forward of pu@{1}?" check does
> not need merge base computation at all.  The only thing it needs to
> do is to prove pu@{1} is reachable from pu@{0}, and that can be done
> the same way in which '1' can be proved unreachable from '2' in the
> post processing phase as described above, i.e. it needs only the
> first phase of running merge_bases_many() without postprocessing.

Well, yeah, you snipped this part from my original post :-)

} Even if this turns out to be flawed, we should also identify uses of
} in_merge_bases() where the real question was is_descendant_of() [I
} somewhat suspect that's all of them], and then replace is_descendant_of
} with a much cheaper lookup.  This can be as simple as propagating a mark
} from the candidate until it either goes beyond all possible ancestors,
} or hits one of them.

As far as the "real question" goes, I'm purely guessing here -- it is
entirely possible that a bunch of the in_merge_bases() checks really do
need the pruned set of merge bases.  But this particular check, and I
suspect a bunch of others, does not.

Then there's the next bit:

} By the way, the internal slowness of git-merge-base also affects the
} A...B syntax.  For example,
} 
}   git rev-list --left-right --count @{upstream}...
} 
} is used by the __git_ps1 code to determine the prompt display, which
} just got very slow for me today.  Again, it should be easy to figure out
} the boundary of the symmetric difference simply by propagating two
} marks.  I do not think that the result of A...B actually depends on
} figuring out exactly what the merge bases are, simply excluding *any*
} candidate without pruning is enough.

Apart from __git_ps1, this is also used by git-status, git-checkout and
'git branch -v' to show "your branch is N behind and M ahead".  Again
it's a bit of a hunch, but I think figuring out the symmetric difference
is a simple matter of propagating two marks and including only commits
that ended up having exactly one of them.  At least the counting case
should be easy, rev-list is slightly harder to fix.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-23 20:42     ` Junio C Hamano
  2012-08-23 21:03       ` Junio C Hamano
@ 2012-08-24  9:43       ` Thomas Rast
  2012-08-24 15:15         ` Jeff King
  2012-08-24 15:52         ` Junio C Hamano
  1 sibling, 2 replies; 16+ messages in thread
From: Thomas Rast @ 2012-08-24  9:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List, Jeff King

Junio C Hamano <gitster@pobox.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> Thomas Rast <trast@student.ethz.ch> writes:
>>
>>> At the very least it should be possible to change in_merge_bases() to
>>> not do any of the post-filtering; perhaps like the patch below.
>>
>> I do not recall the details but the post-filtering was added after
>> the initial naive version without it that was posted to the list did
>> not work correctly in corner cases.  I wouldn't doubt that N*(N-1)
>> loop can be optimized to something with log(N), but I offhand find
>> it hard to believe "to not do any" could be correct without seeing
>> more detailed analysis.
>
> If "one on one side, many on the other side" in merge_bases_many()
> confuses any of you in the readers, you can just ignore many-ness.
> Getting merge bases for "one" against many "two"s using
> merge_bases_many() is exactly the same as getting merge bases for
> "one" against a (fictitious) single commit you can make by merging
> all "two"s.
>
> So we can think about the degenerate case of merge base between two
> commits A and B in the following discussion.
>
> A merge-base between A and B is defined to be a commit that is a
> common ancestor of both A and B, and that is not an ancestor of any
> other merge-base between A and B.

Ok, under that definition I really do think that it's "easy".

You have all the pieces here but one:

> Start from A and B.  Follow from B to find 'x' and paint it in blue,
> follow from A to find 'y' and paint it in amber.  Follow from 'x' to
> '1', paint it in blue.  Follow from 'y' to '1', paint it in amber
> but notice that it already is painted in blue.
[...]
>             o-------o
>            /         \
>           /       y---A
>          /       /
>      ---2---z---1---x---B
>          \         /
>           o-------o
[...]
> So we need to notice that '1' and '2' have ancestry relation in
> order to reject '2' as "common but not merge-base".  One way of
> doing so is not to stop at '1' and keep digging (and eventually we
> find that '2' is what we could reach from '1' that already is a
> merge base), but then we will be susceptible to the same kind of
> clock skew issue as the revision traverser.

I think that is *the* way to do it.

And the way to fix the clock skew issue (also in the revision traversal)
is Peff's generation number cache.  Just keep propagating marks, digging
in generation order, until the generations prove that nothing new can
happen.

  [Side note, in reply to an earlier comment in the rev-list thread: I
  agree with Peff's reasoning that a cache is better than a commit
  header.]

The precise termination condition should be:

  There are no more one-colored commits to walk, and

  The maximum generation of the remaining commits in the walk is less
  than the minimum generation of the merge base candidates

Then the generation numbers ensure that no commit in the walk (which by
then only propagates STALE marks) can possibly be a descendant of any of
the candidates.  At that point, the candidates are the true set of merge
bases.


I conjecture that every history walking problem can be solved in time
linear in the number of commits once we properly use the generation
numbers ;-)

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-23 14:20 ` Thomas Rast
  2012-08-23 18:08   ` Junio C Hamano
@ 2012-08-24 11:42   ` Nguyen Thai Ngoc Duy
  2012-08-24 11:51     ` Nguyen Thai Ngoc Duy
  2012-08-28  1:50   ` Junio C Hamano
  2 siblings, 1 reply; 16+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-08-24 11:42 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Git Mailing List

On Thu, Aug 23, 2012 at 9:20 PM, Thomas Rast <trast@student.ethz.ch> wrote:
> At the very least it should be possible to change in_merge_bases() to
> not do any of the post-filtering; perhaps like the patch below.  It
> passes the test suite.  The whole "merge bases of A and a list of Bs"
> thing is blowing my overheated mind, though, so I'm not able to convince
> myself that it is correct in all cases.
>
> diff --git i/commit.c w/commit.c
> index 65a8485..70427ab 100644
> --- i/commit.c
> +++ w/commit.c
> @@ -837,10 +837,13 @@ int in_merge_bases(struct commit *commit, struct commit **reference, int num)
>         struct commit_list *bases, *b;
>         int ret = 0;
>
> -       if (num == 1)
> -               bases = get_merge_bases(commit, *reference, 1);
> -       else
> +       if (num != 1)
>                 die("not yet");
> +
> +       bases = merge_bases_many(commit, 1, reference);
> +       clear_commit_marks(commit, all_flags);
> +       clear_commit_marks(*reference, all_flags);
> +
>         for (b = bases; b; b = b->next) {
>                 if (!hashcmp(commit->object.sha1, b->item->object.sha1)) {
>                         ret = 1;

Without looking into detail (as I'm not familiar with this code), this
patch does not help much. Without the patch:

$ time ./git merge-base a4f2db3 b95a282
ed36e5bd41f7192e42e9b4c573875a343a9daf48

real    0m19.988s
user    0m19.797s
sys     0m0.082s

With the patch:

$ time ./git merge-base a4f2db3 b95a282
ed36e5bd41f7192e42e9b4c573875a343a9daf48

real    0m19.560s
user    0m19.448s
sys     0m0.037s
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-24 11:42   ` Nguyen Thai Ngoc Duy
@ 2012-08-24 11:51     ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 16+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-08-24 11:51 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Git Mailing List

On Fri, Aug 24, 2012 at 6:42 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> Without looking into detail (as I'm not familiar with this code), this
> patch does not help much.

And I should have looked. git merge-base does not use that function.
With your patch, git fetch is instant again for same 'pu' update.
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-24  9:43       ` Thomas Rast
@ 2012-08-24 15:15         ` Jeff King
  2012-08-24 16:15           ` Junio C Hamano
  2012-08-24 15:52         ` Junio C Hamano
  1 sibling, 1 reply; 16+ messages in thread
From: Jeff King @ 2012-08-24 15:15 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Junio C Hamano, Nguyen Thai Ngoc Duy, Git Mailing List

On Fri, Aug 24, 2012 at 11:43:40AM +0200, Thomas Rast wrote:

> > Start from A and B.  Follow from B to find 'x' and paint it in blue,
> > follow from A to find 'y' and paint it in amber.  Follow from 'x' to
> > '1', paint it in blue.  Follow from 'y' to '1', paint it in amber
> > but notice that it already is painted in blue.
> [...]
> >             o-------o
> >            /         \
> >           /       y---A
> >          /       /
> >      ---2---z---1---x---B
> >          \         /
> >           o-------o
> [...]
> > So we need to notice that '1' and '2' have ancestry relation in
> > order to reject '2' as "common but not merge-base".  One way of
> > doing so is not to stop at '1' and keep digging (and eventually we
> > find that '2' is what we could reach from '1' that already is a
> > merge base), but then we will be susceptible to the same kind of
> > clock skew issue as the revision traverser.
> 
> I think that is *the* way to do it.
> 
> And the way to fix the clock skew issue (also in the revision traversal)
> is Peff's generation number cache.  Just keep propagating marks, digging
> in generation order, until the generations prove that nothing new can
> happen.

I thought you were just interested in speeding up is_descendent_of. You
should be able to do that without a generation number. Just start from A
and B as above, do the two-color painting, and do not add the parents of
any two-color commits (because you know they are ancestors of both, and
therefore you cannot find either by looking backwards). If you paint "B"
amber, then it is a descendent of "A" (and vice versa if you paint "A"
blue).

Clock skew may make you take longer (because you may go depth-first down
an uninteresting chain of commits), but it should never give you the
wrong answer. It's not as fast as using a generation-based cutoff
(because you have to keep walking to the actual shared ancestor), but in
practice it's usually fine.

The reason I did not do that for "tag --contains" is that I wanted to
do a simultaneous walk for all tags. That would require N color bits for
N tags, and we have a limited space in each commit object. I didn't time
using a separate hash table to store those bits outside of the commit
objects. That would have a higher constant, but should still yield good
big-O complexity.

>   [Side note, in reply to an earlier comment in the rev-list thread: I
>   agree with Peff's reasoning that a cache is better than a commit
>   header.]

I still think it's a good idea to keep it out of the commit header. But
I think I might lean towards tying it to the pack index (i.e.,
generating it when we generate the index, and depending on the sha1
table in the index). That would be smaller, and gets rid of any
complexity or performance issues with making incremental updates to the
cache. Loose objects wouldn't have a cached version number, but that's
OK; in practice you can quickly walk backwards to a commit that _is_
cached.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-24  9:32         ` Thomas Rast
@ 2012-08-24 15:50           ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2012-08-24 15:50 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List

Thomas Rast <trast@student.ethz.ch> writes:

> Well, yeah, you snipped this part from my original post :-)
>
> } Even if this turns out to be flawed, we should also identify uses of
> } in_merge_bases() where the real question was is_descendant_of() [I
> } somewhat suspect that's all of them], and then replace is_descendant_of
> } with a much cheaper lookup.  This can be as simple as propagating a mark
> } from the candidate until it either goes beyond all possible ancestors,
> } or hits one of them.

Yeah, I agree with the above, and the "cheaper lookup" would
probably be merge-bases-many() without postprocess.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-24  9:43       ` Thomas Rast
  2012-08-24 15:15         ` Jeff King
@ 2012-08-24 15:52         ` Junio C Hamano
  1 sibling, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2012-08-24 15:52 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List, Jeff King

Thomas Rast <trast@student.ethz.ch> writes:

> Junio C Hamano <gitster@pobox.com> writes:
> ...
>> Start from A and B.  Follow from B to find 'x' and paint it in blue,
>> follow from A to find 'y' and paint it in amber.  Follow from 'x' to
>> '1', paint it in blue.  Follow from 'y' to '1', paint it in amber
>> but notice that it already is painted in blue.
> [...]
>>             o-------o
>>            /         \
>>           /       y---A
>>          /       /
>>      ---2---z---1---x---B
>>          \         /
>>           o-------o
> [...]
>> So we need to notice that '1' and '2' have ancestry relation in
>> order to reject '2' as "common but not merge-base".  One way of
>> doing so is not to stop at '1' and keep digging (and eventually we
>> find that '2' is what we could reach from '1' that already is a
>> merge base), but then we will be susceptible to the same kind of
>> clock skew issue as the revision traverser.
>
> I think that is *the* way to do it.

But we do not live in *that* world.  At least not yet.

> I conjecture that every history walking problem can be solved in time
> linear in the number of commits once we properly use the generation
> numbers ;-)

I would conjecture that too, but we do not live in that world yet.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-24 15:15         ` Jeff King
@ 2012-08-24 16:15           ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2012-08-24 16:15 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, Nguyen Thai Ngoc Duy, Git Mailing List

Jeff King <peff@peff.net> writes:

> I thought you were just interested in speeding up is_descendent_of. You
> should be able to do that without a generation number. Just start from A
> and B as above, do the two-color painting, and do not add the parents of
> any two-color commits (because you know they are ancestors of both, and
> therefore you cannot find either by looking backwards). If you paint "B"
> amber, then it is a descendent of "A" (and vice versa if you paint "A"
> blue).

Yes, that is what merge_bases_many() (the one without the post
processing to cull candidates) is primarily doing.  The function
also does the "STALE" bit processing that aims to reduce the number
of candidates in "no clock skew" cases with minimum effort, to
mimimize the cost of get_merge_bases_many() in the post-processing
phase, but removing the "STALE" bit processing shouldn't affect the
correctness of get_merge_bases_many() and would help performance of
merge_bases_many() proper, which is useful for is_descendant_of().

> The reason I did not do that for "tag --contains" is that I wanted to
> do a simultaneous walk for all tags. That would require N color bits for
> N tags, and we have a limited space in each commit object. I didn't time
> using a separate hash table to store those bits outside of the commit
> objects. That would have a higher constant, but should still yield good
> big-O complexity.

It may not be a bad idea to at least try and see the performance
implications of moving many users of object->flags to a new
implementation of per-purpose flag bits based on the decoration
infrastracture.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-23 14:20 ` Thomas Rast
  2012-08-23 18:08   ` Junio C Hamano
  2012-08-24 11:42   ` Nguyen Thai Ngoc Duy
@ 2012-08-28  1:50   ` Junio C Hamano
  2012-08-28  8:12     ` Thomas Rast
  2 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2012-08-28  1:50 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List

Thomas Rast <trast@student.ethz.ch> writes:

> diff --git i/commit.c w/commit.c
> index 65a8485..70427ab 100644
> --- i/commit.c
> +++ w/commit.c
> @@ -837,10 +837,13 @@ int in_merge_bases(struct commit *commit, struct commit **reference, int num)
>  	struct commit_list *bases, *b;
>  	int ret = 0;
>  
> -	if (num == 1)
> -		bases = get_merge_bases(commit, *reference, 1);
> -	else
> +	if (num != 1)
>  		die("not yet");
> +
> +	bases = merge_bases_many(commit, 1, reference);
> +	clear_commit_marks(commit, all_flags);
> +	clear_commit_marks(*reference, all_flags);
> +	
>  	for (b = bases; b; b = b->next) {
>  		if (!hashcmp(commit->object.sha1, b->item->object.sha1)) {
>  			ret = 1;

This ended up being part of the series I sent earlier, and I want to
assign authorship to you. As you did this as part of the discussion,
naturally the patch came without a sign-off.  Can we consider it
signed off?  Just saying "ok" is fine.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-28  1:50   ` Junio C Hamano
@ 2012-08-28  8:12     ` Thomas Rast
  2012-08-28 16:03       ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Rast @ 2012-08-28  8:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> Thomas Rast <trast@student.ethz.ch> writes:
>
>> diff --git i/commit.c w/commit.c
>> index 65a8485..70427ab 100644
>> --- i/commit.c
>> +++ w/commit.c
>> @@ -837,10 +837,13 @@ int in_merge_bases(struct commit *commit, struct commit **reference, int num)
>>  	struct commit_list *bases, *b;
>>  	int ret = 0;
>>  
>> -	if (num == 1)
>> -		bases = get_merge_bases(commit, *reference, 1);
>> -	else
>> +	if (num != 1)
>>  		die("not yet");
>> +
>> +	bases = merge_bases_many(commit, 1, reference);
>> +	clear_commit_marks(commit, all_flags);
>> +	clear_commit_marks(*reference, all_flags);
>> +	
>>  	for (b = bases; b; b = b->next) {
>>  		if (!hashcmp(commit->object.sha1, b->item->object.sha1)) {
>>  			ret = 1;
>
> This ended up being part of the series I sent earlier, and I want to
> assign authorship to you. As you did this as part of the discussion,
> naturally the patch came without a sign-off.  Can we consider it
> signed off?  Just saying "ok" is fine.

Sure:

  ok

;-)

I'm also mildly surprised that it ended up being correct, albeit with
some extra work from you :-)

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: in_merge_bases() is too expensive for recent "pu" update
  2012-08-28  8:12     ` Thomas Rast
@ 2012-08-28 16:03       ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2012-08-28 16:03 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List

Thomas Rast <trast@inf.ethz.ch> writes:

> I'm also mildly surprised that it ended up being correct, albeit with
> some extra work from you :-)

I actually am not all that surprised.  It just shows that the
original code was layered in more or less the right way.  At the the
bottom layer we would want a way to paint graph down to common
ancestors cheaply, and then on top we want to have a way to filter
out the redundant ones.  It is a different story that the
implementation of individual layers may still have rooms for
improvements (just like the last patch in my series showed for the
upper layer where we used to do N * (N-1) when we only needed N).

I have a suspicion that the merge_bases_many() at the bottom layer
could be built on an even cheaper API.  This caller you added does
not need "bases" list returned; it only wants to see that ancestors
of "commit" and "reference" down to their common ancestors are
painted in two colors PARENT1 and PARENT2.  Instead of iterating
over the returned candidate list, we might be able to do

	common = PARENT1 | PARENT2;
	paint_ancestors(commit, 1, reference);
        commit_is_common = ((commit->object.flags & common) == common);
        clear_commit_marks(commit, all_flags);
        clear_commit_marks(*reference, all_flags);

where paint_ancestors() just paints but does not accumlate to return
a commit_list.  Note that paint_ancestors() need to always paint its
arguments (merge_bases_many() has an early special case to return a
list that only has "commit" when it appears in "reference" without
painting anything, and the callers like get_merge_bases_many() know
this to avoid calling clear_commit_marks()) if we are to go in this
direction, so it is unclear if it will be overall gain.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-08-28 16:03 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-23 12:32 in_merge_bases() is too expensive for recent "pu" update Nguyen Thai Ngoc Duy
2012-08-23 14:20 ` Thomas Rast
2012-08-23 18:08   ` Junio C Hamano
2012-08-23 20:42     ` Junio C Hamano
2012-08-23 21:03       ` Junio C Hamano
2012-08-24  9:32         ` Thomas Rast
2012-08-24 15:50           ` Junio C Hamano
2012-08-24  9:43       ` Thomas Rast
2012-08-24 15:15         ` Jeff King
2012-08-24 16:15           ` Junio C Hamano
2012-08-24 15:52         ` Junio C Hamano
2012-08-24 11:42   ` Nguyen Thai Ngoc Duy
2012-08-24 11:51     ` Nguyen Thai Ngoc Duy
2012-08-28  1:50   ` Junio C Hamano
2012-08-28  8:12     ` Thomas Rast
2012-08-28 16:03       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).