[RFH] git cherry vs. git rev-list --cherry, or: Why does "..." suck?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFH] git cherry vs. git rev-list --cherry, or: Why does "..." suck?
@ 2011-03-22 12:07 Michael J Gruber
  2011-03-23 16:46 ` Michael J Gruber
  2011-03-23 17:19 ` Jeff King
  0 siblings, 2 replies; 5+ messages in thread
From: Michael J Gruber @ 2011-03-22 12:07 UTC (permalink / raw)
  To: Git Mailing List

In the process of converting "git cherry" and "git format patch" to use
the new rev-list options (the saner way according to d7a17ca (git-log
--cherry-pick A...B, 2007-04-09) already!), I have a simple question and
a hard one which I both ask help for:

run_command
===========

I could use either run_command_v_opt(args, RUN_GIT_CMD) or setup the
walker, call it etc. For the former I have to check how to treat the
third argument to "git cherry", the latter seems to be more code (and I
would need to call the rev-list/log output loop somehow).

Is there a general preference for using or avoiding run_command?

(There's also the question of what details of git cherry's output I need
to preserve.)

Performance
===========

I don't get this:

git cherry A B: 0.4s
git rev-list --cherry A...B: 1.7s
(more details below)

This makes "rev-list --cherry" almost unacceptable as a replacement. But
I'd like to understand this difference (and maybe do something about
it). I'm lost with gprof, but here are more details on the timing:

A is pu at 0f169fc
B is next at 5ddab49 plus three commits which are not upstream

rev-list --count 5ddab49..A is 166 (117 without merges), for B it is 3

Now the timings (rev-list done with --count):

cherry A B: 0.4s
cherry B A: 0.4s
rev-list --cherry A...B: 1.7s

The latter computes merge bases (there are 25), the former does not. How
much is it:

merge-base A B: 0.95s
merge-base --all A B: 0.95s
rev-parse A...B: 0.95s

So this accounts for much of the difference (and we need to do something
about get_merge_bases()), but not all. How much is the patch-id computation:

rev-list --no-merges --right-only --cherry-pick A...B: 1.7s
(the above is --cherry)
rev-list --no-merges --right-only A...B: 1.0s
rev-list --no-merges --left-right A...B: 1.0s

Why does it take rev-list 0.7s to do the same patch-id computations that
cherry does in less than 0.4s? (More details on what they do below.)

rev-list --no-merges A..B: 0.04s (counting to 3, yeah)
rev-list --no-merges A..B: 0.6s (counting to 117)

The latter has no patch-id nor merge computation. Should this really
take 0.6s?

I'm stomped. Help, please!

Michael

What the commands roughly do:

cherry A B [limit]:
===================
add pending B ^A
walk B..A (on temp rev_info) and
add_commit_patch_id() on these
clear_commit_marks()
add pending ^limit if specified
walk A..B and
reverse that list and
has_commit_patch_id() on these

rev-list --cherry A...B:
========================
get_merge_bases for A,B
add pending --not merge bases
add pending A B
add_commit_patch_id() on smaller side
has_commit_patch_id() on other side (&& mark id seen)
recheck smaller side (based on id->seen)

This seems to enumerate A..B and B..A more often, but is iterating
through a commit list that time consuming? The number of patch-id
computations is the same as far as I can see.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFH] git cherry vs. git rev-list --cherry, or: Why does "..." suck?
  2011-03-22 12:07 [RFH] git cherry vs. git rev-list --cherry, or: Why does "..." suck? Michael J Gruber
@ 2011-03-23 16:46 ` Michael J Gruber
  2011-03-23 18:20   ` Junio C Hamano
  2011-03-23 17:19 ` Jeff King
  1 sibling, 1 reply; 5+ messages in thread
From: Michael J Gruber @ 2011-03-23 16:46 UTC (permalink / raw)
  To: Git Mailing List

Adding some recent insight:

Michael J Gruber venit, vidit, dixit 22.03.2011 13:07:
> Performance
> ===========
> 
> I don't get this:
> 
> git cherry A B: 0.4s
> git rev-list --cherry A...B: 1.7s
> (more details below)

I can get the latter down to 0.95s and this

> merge-base A B: 0.95s
> merge-base --all A B: 0.95s
> rev-parse A...B: 0.95s

to 0.16s each. The downside is that merge-base may give a few
unneccessary candidates (commits which are ancestors of other commits it
returns), but this does not change the results for rev-list, of course.

I get this dramatic speedup by removing the check for duplicates from
get_merge_bases_many() in commit.c. After a first merge_bases_many()
run, returning N commits, that check calls merge_bases_many() again for
each pair (N choose 2) to check whether one is contained in the other.
Quite a bottleneck. Removing it works great. But can we live with a few
additional merge bases?

Michael

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFH] git cherry vs. git rev-list --cherry, or: Why does "..." suck?
  2011-03-22 12:07 [RFH] git cherry vs. git rev-list --cherry, or: Why does "..." suck? Michael J Gruber
  2011-03-23 16:46 ` Michael J Gruber
@ 2011-03-23 17:19 ` Jeff King
  1 sibling, 0 replies; 5+ messages in thread
From: Jeff King @ 2011-03-23 17:19 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: Git Mailing List

On Tue, Mar 22, 2011 at 01:07:53PM +0100, Michael J Gruber wrote:

> I don't get this:
> 
> git cherry A B: 0.4s
> git rev-list --cherry A...B: 1.7s
> (more details below)
> 
> This makes "rev-list --cherry" almost unacceptable as a replacement. But
> I'd like to understand this difference (and maybe do something about
> it). I'm lost with gprof, but here are more details on the timing:

I don't have much to say on the problem at hand, but have you tried
using the "perf" tool from the kernel to measure? These days it ships in
linux-tools-2.6.x in Debian unstable; I don't know about other distros.

You can do this:

  perf record git cherry $A $B >/dev/null
  perf record git rev-list --cherry $A...$B >/dev/null
  perf diff

to get a list of the hot-spots with the time differences between the two
runs (along with "perf report" and "perf annotate" to get more
information).

Disclaimer: I am very new to perf, so I may be misleading you about how
useful the "diff" would be. In particular, it seems to be based on a
percentages of time spent between two runs. Which is great for two runs
of the same program calling very similar functions. But for two programs
calling _mostly_ the same functions, I don't know how misleading it is.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFH] git cherry vs. git rev-list --cherry, or: Why does "..." suck?
  2011-03-23 16:46 ` Michael J Gruber
@ 2011-03-23 18:20   ` Junio C Hamano
  2011-03-24  7:38     ` Michael J Gruber
  0 siblings, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2011-03-23 18:20 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: Git Mailing List

Michael J Gruber <git@drmicha.warpmail.net> writes:

> Adding some recent insight:
>
> Michael J Gruber venit, vidit, dixit 22.03.2011 13:07:
>> Performance
>> ===========
>> 
>> I don't get this:
>> 
>> git cherry A B: 0.4s
>> git rev-list --cherry A...B: 1.7s
>> (more details below)
>
> I can get the latter down to 0.95s and this
>
>> merge-base A B: 0.95s
>> merge-base --all A B: 0.95s
>> rev-parse A...B: 0.95s
>
> to 0.16s each. The downside is that merge-base may give a few
> unneccessary candidates (commits which are ancestors of other commits it
> returns), but this does not change the results for rev-list, of course.
>
> I get this dramatic speedup by removing the check for duplicates from
> get_merge_bases_many() in commit.c. After a first merge_bases_many()
> run, returning N commits, that check calls merge_bases_many() again for
> each pair (N choose 2) to check whether one is contained in the other.
> Quite a bottleneck. Removing it works great. But can we live with a few
> additional merge bases?

When we run merge-base as the top-level command (this includes
reduce_heads() that is used by "git merge"), we have to cull unnecessary
phantom bases that can be reached by other bases, so you are not allowed
to make such a change unconditionally.

Passing down a parameter from a caller that is prepared to handle phantom
merge bases correctly is probably the right approach.  Existing callers
can make "safer" calls for now; you can later examine them and turn them
into "faster" calls if they operate correctly given a result that contain
phantom bases.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFH] git cherry vs. git rev-list --cherry, or: Why does "..." suck?
  2011-03-23 18:20   ` Junio C Hamano
@ 2011-03-24  7:38     ` Michael J Gruber
  0 siblings, 0 replies; 5+ messages in thread
From: Michael J Gruber @ 2011-03-24  7:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

Junio C Hamano venit, vidit, dixit 23.03.2011 19:20:
> Michael J Gruber <git@drmicha.warpmail.net> writes:
> 
>> Adding some recent insight:
>>
>> Michael J Gruber venit, vidit, dixit 22.03.2011 13:07:
>>> Performance
>>> ===========
>>>
>>> I don't get this:
>>>
>>> git cherry A B: 0.4s
>>> git rev-list --cherry A...B: 1.7s
>>> (more details below)
>>
>> I can get the latter down to 0.95s and this
>>
>>> merge-base A B: 0.95s
>>> merge-base --all A B: 0.95s
>>> rev-parse A...B: 0.95s
>>
>> to 0.16s each. The downside is that merge-base may give a few
>> unneccessary candidates (commits which are ancestors of other commits it
>> returns), but this does not change the results for rev-list, of course.
>>
>> I get this dramatic speedup by removing the check for duplicates from
>> get_merge_bases_many() in commit.c. After a first merge_bases_many()
>> run, returning N commits, that check calls merge_bases_many() again for
>> each pair (N choose 2) to check whether one is contained in the other.
>> Quite a bottleneck. Removing it works great. But can we live with a few
>> additional merge bases?
> 
> When we run merge-base as the top-level command (this includes
> reduce_heads() that is used by "git merge"), we have to cull unnecessary
> phantom bases that can be reached by other bases, so you are not allowed
> to make such a change unconditionally.
> 
> Passing down a parameter from a caller that is prepared to handle phantom
> merge bases correctly is probably the right approach.  Existing callers
> can make "safer" calls for now; you can later examine them and turn them
> into "faster" calls if they operate correctly given a result that contain
> phantom bases.

Yes, I was thinking of having thorough vs. fast mode, but I'll dig more
into merge_bases_many().

My current impression is that those phantom merge bases appear only
(mainly?) when there are severe date ordering problems in the dag.
merge_bases_many() uses commit_list_insert_by_date(), and given the way
it walks, later merge bases which are ancestors of another one get
marked STALE automatically.

For callers which are interested in one base only, the check makes a
difference only if there are date ordering problems.

In my A B example, merge_bases_many() comes up with the correct 25 ones
during its first run, and then gets called 300 times again (25*24/2) to
check each pair of them, without any reduction. (Clearly, it's a non
issue in the case of unique merge base.)

What I'm mainly interested in is the A...B case. And for a revision walk
with one A...B range and revs->limited, one could even embed the whole
mb-logic into the walk! But I don't see how to do this for multiple
symmetric ranges.

Michael

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-03-24  7:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-22 12:07 [RFH] git cherry vs. git rev-list --cherry, or: Why does "..." suck? Michael J Gruber
2011-03-23 16:46 ` Michael J Gruber
2011-03-23 18:20   ` Junio C Hamano
2011-03-24  7:38     ` Michael J Gruber
2011-03-23 17:19 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).