* three-way diff performance problem
@ 2009-07-21 18:10 Linus Torvalds
2009-07-21 18:16 ` Linus Torvalds
2009-07-21 19:21 ` Junio C Hamano
0 siblings, 2 replies; 8+ messages in thread
From: Linus Torvalds @ 2009-07-21 18:10 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
Ok, so I decided to abuse git today, because I actually had a problem
where git could potentially solve it. Sadly, it's now 55 minutes later,
and my single git command hasn't completed yet :(
What did I do?
We have a really odd bug-report for the kernel where switching the
compiler from using '-fwrapv' to using '-fno-strict-overflow' causes a
non-working kernel.
So the exact same kernel sources work fine with either 'fwrapv' or,
indeed, with neither option, but with '-fno-strict-overflow' it just hangs
at boot.
Now, the problem is that I can't actually see the issue, so I just have
three kernel binaries, two of which work, and one which does not. I can
disassemble them, and I did, and the end result are three files with
roughly 1.14 million lines.
The diffs between them are in the 40- to 90-thousand line area, and it's
really hard to read diffs of assembly code and try to figure out what the
problem with the generated code is. Yeah, I'm pretty good at reading
assembler, but I'm not _that_ good. Just scanning 40 thousand lines takes
a while, looking at it _closely_ is just incredibly hard and isn't going
to ever really work out.
But there's a saving grace: I have two working kernels. And the
differences are often silly things like different register allocation, and
they are _different_. Many functions end up being the same between one
working kernel and the failing one, while being different in another
working kernel.
So what I want is the differences that are different from _both_ working
kernels, since being the same as _one_ of them means that it's not a bug.
Now, being a git person, what does that say? Right: just check in the
working kernels as two branches under the same filename, then merge the
branches, and force the merge result to be the non-working kernel, and do
a three-way combined context diff! So I did exactly that.
An hour later, it's still thinking.
So I used our nice new 'perf' tool from the kernel on a run that I killed
after five minuets, and lookie here:
#
# (17566764 samples)
#
# Overhead Command Shared Object Symbol
# ........ ................ ......................... ......
#
96.26% git /home/torvalds/bin/git [.] consume_line
...
it says we're spending pretty much all our time in a single function.
In fact, on the instruction-level profile, it looks like it's all spent on
four instructions:
3.51 : 45aec0: 4c 85 72 10 test %r14,0x10(%rdx)
86.27 : 45aec4: 48 0f 45 ca cmovne %rdx,%rcx
: if (sline->lost_head) {
: struct lline *last_one = NULL;
: /* We cannot squash it with earlier one */
: for (lline = sline->lost_head;
: lline;
: lline = lline->next)
6.69 : 45aec8: 48 8b 12 mov (%rdx),%rdx
:
: /* Check to see if we can squash things */
: if (sline->lost_head) {
: struct lline *last_one = NULL;
: /* We cannot squash it with earlier one */
: for (lline = sline->lost_head;
3.51 : 45aecb: 48 85 d2 test %rdx,%rdx
0.01 : 45aece: 75 f0 jne 45aec0 <consume_line+0xc0>
(ok, five, I included the 0.01% of the branch instruction that finishes
the loop - the way Nehalem works, it will merge those test+jne
instructions into one macro-instruction internally, which is why the
branch that is obviously part of the loop doesn't get a lot of hits)
Just FYI, those four instructions come from inlining 'append_lost()'. It's
compiled that first for-loop into some very good assembly language, but
the problem is that this all looks very much quadratic in number of
lost_lines. The quality of the compiler thus sadly in no way helps make up
for a really really expensive algorithm.
Any ideas?
For any git people who want to look at the horror that caused this, feel
free to download the (fairly small) git repository from
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/git-merge-test.git
and try to do a
git show --color fno-strict-overflow
just for fun. Do "gitk --all" to see the simple nature of the repository,
but if you do, you'll need to kill the "git diff-tree -r -p --textconv .."
process that gitk will start (and which will take forever).
The code in question is all Junio's, and is over three years old. So it's
obviously been working fine. And the load is obviously not really
appropriate, but at the same time I think it's an interesting thing to
try. An it doesn't seem to be about memory use - It's now been running for
85 minutes (ok, this email took long to write, and I uploaded the thing
etc in the meantime etc), and it's holding steady at roughly 175MB RSS.
Linus
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: three-way diff performance problem
2009-07-21 18:10 three-way diff performance problem Linus Torvalds
@ 2009-07-21 18:16 ` Linus Torvalds
2009-07-21 19:21 ` Junio C Hamano
1 sibling, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2009-07-21 18:16 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
On Tue, 21 Jul 2009, Linus Torvalds wrote:
>
> Now, being a git person, what does that say? Right: just check in the
> working kernels as two branches under the same filename, then merge the
> branches, and force the merge result to be the non-working kernel, and do
> a three-way combined context diff! So I did exactly that.
Btw, I did it wrong, which is why it takes too long. Instead of making the
result be the right dump-file, I made it be a bad one, so the diffs are
much bigger than they should be. Which then makes the combined-merge diff
take longer than necessary (with the O(n^2) thing).
Doing it _correctly_ actually made the three-way diff only take 45
seconds, but sadly still left me with 22 thousand lines.
Linus
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: three-way diff performance problem
2009-07-21 18:10 three-way diff performance problem Linus Torvalds
2009-07-21 18:16 ` Linus Torvalds
@ 2009-07-21 19:21 ` Junio C Hamano
2009-07-21 19:31 ` Linus Torvalds
2009-07-21 19:47 ` Junio C Hamano
1 sibling, 2 replies; 8+ messages in thread
From: Junio C Hamano @ 2009-07-21 19:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano
Linus Torvalds <torvalds@linux-foundation.org> writes:
> In fact, on the instruction-level profile, it looks like it's all spent on
> four instructions:
>
> 3.51 : 45aec0: 4c 85 72 10 test %r14,0x10(%rdx)
> 86.27 : 45aec4: 48 0f 45 ca cmovne %rdx,%rcx
> : if (sline->lost_head) {
> : struct lline *last_one = NULL;
> : /* We cannot squash it with earlier one */
> : for (lline = sline->lost_head;
> : lline;
> : lline = lline->next)
> 6.69 : 45aec8: 48 8b 12 mov (%rdx),%rdx
> :
> : /* Check to see if we can squash things */
> : if (sline->lost_head) {
> : struct lline *last_one = NULL;
> : /* We cannot squash it with earlier one */
> : for (lline = sline->lost_head;
> 3.51 : 45aecb: 48 85 d2 test %rdx,%rdx
> 0.01 : 45aece: 75 f0 jne 45aec0 <consume_line+0xc0>
>
> (ok, five, I included the 0.01% of the branch instruction that finishes
> the loop - the way Nehalem works, it will merge those test+jne
> instructions into one macro-instruction internally, which is why the
> branch that is obviously part of the loop doesn't get a lot of hits)
>
> Just FYI, those four instructions come from inlining 'append_lost()'. It's
> compiled that first for-loop into some very good assembly language, but
> the problem is that this all looks very much quadratic in number of
> lost_lines. The quality of the compiler thus sadly in no way helps make up
> for a really really expensive algorithm.
>
> Any ideas?
What's that cmovne?
The function append_lost() is the meat of combining. When you have seen
a hunk like this:
@@ -l,k +m,n @@
-lost line
-another lost line
common line
+added line
We queue the lost lines in front of a surviving line (that is sline that
points at "common line"). "lost line" and "another lost line" are stored
in lline (lost line) and they are queued to sline->lost_head.
When another diff that corresponds to this section looks like this:
@@ -l,k +m1,n1 @@
-lost line
common line
+added line
the function looks at "lost line", tries to find the identical one in the
list of llines. We find one, and mark the fact that the lline element is
shared among the two parents.
That allows the output routine to say
@@@ -l,k +m,n +m1,n1 @@@
--lost line
- another lost line
common line
++added line
If that cmovne that eats 86.27% of the run time has something to do with
the memcmp() that is repeatedly done, and if the length of lost_head list
is very large, I think a possible optimization is to add a line hash value
in struct lline.
After feeding the first diff (forget the one with +m1,n1 example above),
if the second diff looks like this:
@@ -l,k +m2,n2 @@
-another lost line
-lost line
common line
+added line
we match "-another lost line" with the second element in lost_head list,
and when processing the next line "-lost line", we try to avoid matching
it with the first one in the list, by stupidly scanning from the front of
the list. I think we should add a member to struct sline to remember the
last element in lost_head list for the current parent to skip that scan.
Perhaps that scan is the one that is costing, not memcmp()?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: three-way diff performance problem
2009-07-21 19:21 ` Junio C Hamano
@ 2009-07-21 19:31 ` Linus Torvalds
2009-07-21 19:47 ` Junio C Hamano
1 sibling, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2009-07-21 19:31 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
On Tue, 21 Jul 2009, Junio C Hamano wrote:
>
> What's that cmovne?
It's "conditional move if not equal".
So the source code is
/* We cannot squash it with earlier one */
for (lline = sline->lost_head;
lline;
lline = lline->next)
if (lline->parent_map & this_mask)
last_one = lline;
adn the compiler has generated this:
loop:
test %r14,0x10(%rdx)
cmovne %rdx,%rcx
mov (%rdx),%rdx
test %rdx,%rdx
jne loop
from it. In the above, '%r14' contains this_mask, and '%rcx' contains
'last_line' and '%rdx' contains 'lline'.
So:
- test %r14,0x10(%rdx)
test "this_mask & lline->parent_map"
- cmovne %rdx,%rcx
"if the test above was non-zero, then last_one = lline"
- mov (%rdx),%rdx
"lline = lline->next"
- test %rdx,%rdx
"is lline NULL"
- jne loop
no, continue.
> The function append_lost() is the meat of combining. When you have seen
> a hunk like this:
>
> @@ -l,k +m,n @@
> -lost line
> -another lost line
> common line
> +added line
>
> We queue the lost lines in front of a surviving line (that is sline that
> points at "common line"). "lost line" and "another lost line" are stored
> in lline (lost line) and they are queued to sline->lost_head.
Right. And "sline->lost_head" is going to have over a _million_ entries.
So each time you add one, we'll traverse that loop a million times. For
each new entry. End result: that loop gets executed on the order of a
million million times. It doesn't help that the compiler made it be very
efficient code ;(
Linus
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: three-way diff performance problem
2009-07-21 19:21 ` Junio C Hamano
2009-07-21 19:31 ` Linus Torvalds
@ 2009-07-21 19:47 ` Junio C Hamano
2009-07-21 20:34 ` Linus Torvalds
1 sibling, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2009-07-21 19:47 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Junio C Hamano <gitster@pobox.com> writes:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> In fact, on the instruction-level profile, it looks like it's all spent on
>> four instructions:
>>
>> 3.51 : 45aec0: 4c 85 72 10 test %r14,0x10(%rdx)
>> 86.27 : 45aec4: 48 0f 45 ca cmovne %rdx,%rcx
>> : if (sline->lost_head) {
>> : struct lline *last_one = NULL;
>> : /* We cannot squash it with earlier one */
>> : for (lline = sline->lost_head;
>> : lline;
>> : lline = lline->next)
>> 6.69 : 45aec8: 48 8b 12 mov (%rdx),%rdx
>> :
>> : /* Check to see if we can squash things */
>> : if (sline->lost_head) {
>> : struct lline *last_one = NULL;
>> : /* We cannot squash it with earlier one */
>> : for (lline = sline->lost_head;
>> 3.51 : 45aecb: 48 85 d2 test %rdx,%rdx
>> 0.01 : 45aece: 75 f0 jne 45aec0 <consume_line+0xc0>
>> ...
> After feeding the first diff (forget the one with +m1,n1 example above),
> if the second diff looks like this:
>
> @@ -l,k +m2,n2 @@
> -another lost line
> -lost line
> common line
> +added line
>
> we match "-another lost line" with the second element in lost_head list,
> and when processing the next line "-lost line", we try to avoid matching
> it with the first one in the list, by stupidly scanning from the front of
> the list. I think we should add a member to struct sline to remember the
> last element in lost_head list for the current parent to skip that scan.
> Perhaps that scan is the one that is costing, not memcmp()?
Here is a patch to do that. I haven't tested it yet though.
combine-diff.c | 16 ++++++++--------
1 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/combine-diff.c b/combine-diff.c
index bbf74fc..8194104 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -80,6 +80,7 @@ struct lline {
/* Lines surviving in the merge result */
struct sline {
struct lline *lost_head, **lost_tail;
+ struct lline *lost_last_for_current_parent;
char *bol;
int len;
/* bit 0 up to (N-1) are on if the parent has this line (i.e.
@@ -121,18 +122,15 @@ static void append_lost(struct sline *sline, int n, const char *line, int len)
/* Check to see if we can squash things */
if (sline->lost_head) {
- struct lline *last_one = NULL;
- /* We cannot squash it with earlier one */
- for (lline = sline->lost_head;
- lline;
- lline = lline->next)
- if (lline->parent_map & this_mask)
- last_one = lline;
- lline = last_one ? last_one->next : sline->lost_head;
+ if (sline->lost_last_for_current_parent)
+ lline = sline->lost_last_for_current_parent;
+ else
+ lline = sline->lost_head;
while (lline) {
if (lline->len == len &&
!memcmp(lline->line, line, len)) {
lline->parent_map |= this_mask;
+ sline->lost_last_for_current_parent = lline;
return;
}
lline = lline->next;
@@ -147,6 +145,7 @@ static void append_lost(struct sline *sline, int n, const char *line, int len)
lline->line[len] = 0;
*sline->lost_tail = lline;
sline->lost_tail = &lline->next;
+ sline->lost_last_for_current_parent = lline;
}
struct combine_diff_state {
@@ -187,6 +186,7 @@ static void consume_line(void *state_, char *line, unsigned long len)
xcalloc(state->num_parent,
sizeof(unsigned long));
state->sline[state->nb-1].p_lno[state->n] = state->ob;
+ state->lost_bucket->lost_last_for_current_parent = NULL;
return;
}
if (!state->lost_bucket)
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: three-way diff performance problem
2009-07-21 19:47 ` Junio C Hamano
@ 2009-07-21 20:34 ` Linus Torvalds
2009-07-21 20:46 ` Junio C Hamano
0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2009-07-21 20:34 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
On Tue, 21 Jul 2009, Junio C Hamano wrote:
>
> Here is a patch to do that. I haven't tested it yet though.
Well, it seems to work.
It turned the 85+ minute thing (which I eventually just killed) into
something that took 1.4 _seconds_. Of course, I didn't check that the end
result was identical, since I never had the patience to wait for the
original case.
Now sadly, my fixed test (which took 45 seconds before) didn't much
improve (it doesn't have nearly as many removed lines, since I checked in
the _right_ file that is a much closer version to the parent files).
But what is intriguing is that it it gets different results. So I suspect
this one actually changed behavior some way:
[torvalds@nehalem git-merge]$ time git show --color HEAD | wc
22671 63302 672234
real 0m44.024s
user 0m43.879s
sys 0m0.148s
[torvalds@nehalem git-merge]$ time ~/git/git show --color HEAD | wc
22596 63122 671076
real 0m43.553s
user 0m43.435s
sys 0m0.128s
and notice how the git version with your change gives different line
numbers.
Now, this is a diff, and different answers are possibly _both_ valid, so
who knows. But I suspect that you _intended_ for the patch to be a
semantic no-op, and it doesn't seem to be.
Linus
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: three-way diff performance problem
2009-07-21 20:34 ` Linus Torvalds
@ 2009-07-21 20:46 ` Junio C Hamano
2009-07-21 21:01 ` Linus Torvalds
0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2009-07-21 20:46 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Tue, 21 Jul 2009, Junio C Hamano wrote:
>>
>> Here is a patch to do that. I haven't tested it yet though.
>
> Well, it seems to work.
>
> It turned the 85+ minute thing (which I eventually just killed) into
> something that took 1.4 _seconds_. Of course, I didn't check that the end
> result was identical, since I never had the patience to wait for the
> original case.
>
> Now sadly, my fixed test (which took 45 seconds before) didn't much
> improve (it doesn't have nearly as many removed lines, since I checked in
> the _right_ file that is a much closer version to the parent files).
>
> But what is intriguing is that it it gets different results. So I suspect
> this one actually changed behavior some way:
>
> [torvalds@nehalem git-merge]$ time git show --color HEAD | wc
> 22671 63302 672234
>
> real 0m44.024s
> user 0m43.879s
> sys 0m0.148s
>
> [torvalds@nehalem git-merge]$ time ~/git/git show --color HEAD | wc
> 22596 63122 671076
>
> real 0m43.553s
> user 0m43.435s
> sys 0m0.128s
>
> and notice how the git version with your change gives different line
> numbers.
>
> Now, this is a diff, and different answers are possibly _both_ valid, so
> who knows. But I suspect that you _intended_ for the patch to be a
> semantic no-op, and it doesn't seem to be.
I know why. The conversion was wrong. The original found the last_one
that was from the parent we are looking at, and when there is such, it
started the scan from the one _after that_. Otherwise, if lost_head list
had an entry
@@ -l,k +m,n @@
-one
two
three
and the diff with the parent we are currently looking at duplicates
removal:
@@ -l,k +m1,n1 @@
-one
-one
two
three
we will end up losing the second removal, which would be what is happening
with the patch you tried.
I actually was scratching my head wondering why it wasn't happening in the
original code after I sent that faulty patch.
Here is another attempt.
combine-diff.c | 13 +++++--------
1 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/combine-diff.c b/combine-diff.c
index bbf74fc..2438490 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -80,6 +80,7 @@ struct lline {
/* Lines surviving in the merge result */
struct sline {
struct lline *lost_head, **lost_tail;
+ struct lline *next_lost;
char *bol;
int len;
/* bit 0 up to (N-1) are on if the parent has this line (i.e.
@@ -121,18 +122,12 @@ static void append_lost(struct sline *sline, int n, const char *line, int len)
/* Check to see if we can squash things */
if (sline->lost_head) {
- struct lline *last_one = NULL;
- /* We cannot squash it with earlier one */
- for (lline = sline->lost_head;
- lline;
- lline = lline->next)
- if (lline->parent_map & this_mask)
- last_one = lline;
- lline = last_one ? last_one->next : sline->lost_head;
+ lline = sline->next_lost;
while (lline) {
if (lline->len == len &&
!memcmp(lline->line, line, len)) {
lline->parent_map |= this_mask;
+ sline->next_lost = lline->next;
return;
}
lline = lline->next;
@@ -147,6 +142,7 @@ static void append_lost(struct sline *sline, int n, const char *line, int len)
lline->line[len] = 0;
*sline->lost_tail = lline;
sline->lost_tail = &lline->next;
+ sline->next_lost = NULL;
}
struct combine_diff_state {
@@ -187,6 +183,7 @@ static void consume_line(void *state_, char *line, unsigned long len)
xcalloc(state->num_parent,
sizeof(unsigned long));
state->sline[state->nb-1].p_lno[state->n] = state->ob;
+ state->lost_bucket->next_lost = state->lost_bucket->lost_head;
return;
}
if (!state->lost_bucket)
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: three-way diff performance problem
2009-07-21 20:46 ` Junio C Hamano
@ 2009-07-21 21:01 ` Linus Torvalds
0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2009-07-21 21:01 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
On Tue, 21 Jul 2009, Junio C Hamano wrote:
>
> I know why. The conversion was wrong. The original found the last_one
> that was from the parent we are looking at, and when there is such, it
> started the scan from the one _after that_. Otherwise, if lost_head list
> had an entry
>
> @@ -l,k +m,n @@
> -one
> two
> three
>
> and the diff with the parent we are currently looking at duplicates
> removal:
>
> @@ -l,k +m1,n1 @@
> -one
> -one
> two
> three
>
> we will end up losing the second removal, which would be what is happening
> with the patch you tried.
Ok, that matches what I saw. Doing a 'diff' between the outputs showed
missing lines from the result with your first patch.
> I actually was scratching my head wondering why it wasn't happening in the
> original code after I sent that faulty patch.
>
> Here is another attempt.
Yup. This attempt now generates output that matches the original one (in
44s), while at the same time still handling the previously very expensive
case quickly (1.4s).
So it looks good now from my testing - I'm not going to say anything else
about the patch, since I'm not all that familiar with the code itself.
Btw, the 'perf report' on the fixed build now says
89.67% git /home/torvalds/git/git [.] xdl_prepare_env
6.42% git /home/torvalds/git/git [.] xdl_recs_cmp
0.92% git [kernel] [k] hpet_next_event
0.52% git /home/torvalds/git/git [.] xdl_prepare_ctx
0.44% git /lib64/libz.so.1.2.3 [.] inflate_fast
0.29% git /home/torvalds/git/git [.] xdl_hash_record
0.27% git /lib64/libc-2.10.1.so [.] __GI_memcmp
0.26% git /home/torvalds/git/git [.] show_patch_diff
0.14% git [kernel] [k] _spin_lock
0.09% git [kernel] [k] clear_page_c
0.09% git /lib64/libz.so.1.2.3 [.] adler32
so now the expensive part is xdl_prepare_env. And that does actually make
sense: it's the code that generates all the hash records for the lines to
be compared, so there doesn't seem to be anything hugely wrong with it. If
there are a lot of identical lines, I'd expect that to be fairly
expensive.
Linus
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-07-21 21:01 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-21 18:10 three-way diff performance problem Linus Torvalds
2009-07-21 18:16 ` Linus Torvalds
2009-07-21 19:21 ` Junio C Hamano
2009-07-21 19:31 ` Linus Torvalds
2009-07-21 19:47 ` Junio C Hamano
2009-07-21 20:34 ` Linus Torvalds
2009-07-21 20:46 ` Junio C Hamano
2009-07-21 21:01 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).