* absurdly slow git-diff @ 2008-11-07 20:01 Abhijit Menon-Sen 2008-11-07 21:28 ` Mike Hommey 2008-11-07 21:37 ` Linus Torvalds 0 siblings, 2 replies; 13+ messages in thread From: Abhijit Menon-Sen @ 2008-11-07 20:01 UTC (permalink / raw) To: git I have a 240k-line file, and I change one character on every sixth line. The resulting diff gives git serious indigestion: $ git --version git version 1.6.0.3.640.g6331a $ mkdir a; cd a; git init Initialized empty Git repository in /home/ams/a/.git/ $ cp ../1 .; git add 1; git commit -q -m 1 $ cp ../2 1; git add 1; git commit -q -m 2 $ time git show HEAD > x git show HEAD > x 309.88s user 0.46s system 97% cpu 5:17.06 total (I use commit -q above not only for brevity; for the second commit, calculating the diffstat takes the same five minutes that git show, git log -p, git log --stat etc. all take.) Note that diff(1) can handle the patch fine: $ time diff -u ../1 ../2 >/dev/null diff -u ../1 ../2 > /dev/null 0.30s user 0.06s system 69% cpu 0.519 total If anyone's interested, the files are http://toroid.org/misc/1 and http://toroid.org/misc/2 Does anyone understand why this slowdown might happen or have suggestions about where I should look for it? Thanks. -- ams ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 20:01 absurdly slow git-diff Abhijit Menon-Sen @ 2008-11-07 21:28 ` Mike Hommey 2008-11-07 21:37 ` Linus Torvalds 1 sibling, 0 replies; 13+ messages in thread From: Mike Hommey @ 2008-11-07 21:28 UTC (permalink / raw) To: Abhijit Menon-Sen; +Cc: git On Sat, Nov 08, 2008 at 01:31:27AM +0530, Abhijit Menon-Sen wrote: > I have a 240k-line file, and I change one character on every sixth line. > The resulting diff gives git serious indigestion: > > $ git --version > git version 1.6.0.3.640.g6331a > $ mkdir a; cd a; git init > Initialized empty Git repository in /home/ams/a/.git/ > $ cp ../1 .; git add 1; git commit -q -m 1 > $ cp ../2 1; git add 1; git commit -q -m 2 You don't need to go that far. You can stop at cp ../2 1 and run git diff from there. All the time is spent in the two loops in xdiff/xprepare.c:xdl_cleanup_records, on line 400 and 412. I'll leave the rest of the investigation to people actually knowing this code ;) Mike ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 20:01 absurdly slow git-diff Abhijit Menon-Sen 2008-11-07 21:28 ` Mike Hommey @ 2008-11-07 21:37 ` Linus Torvalds 2008-11-07 23:04 ` Davide Libenzi 2008-11-08 0:14 ` Pierre Habouzit 1 sibling, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2008-11-07 21:37 UTC (permalink / raw) To: Abhijit Menon-Sen, Pierre Habouzit, Davide Libenzi; +Cc: Git Mailing List On Sat, 8 Nov 2008, Abhijit Menon-Sen wrote: > > If anyone's interested, the files are http://toroid.org/misc/1 and > http://toroid.org/misc/2 Btw, you can see this by just doing git diff 1 2 without even doing "git init" or doing any actual git repository. > Does anyone understand why this slowdown might happen or have > suggestions about where I should look for it? Sure. It's actually fairly simple. You're hitting a O(n^2) thing (possibly higher), and it's triggered by the fact that almost all your lines are identical, ie you have a file that basically has 40,000 lines of each of xxxx: xxx, xx xxx xxxx xx:xx:xx +xxxx xx: xxxx xxxxxxxxxxx <xxxx@xxxx.xxx> xxxx: xxxx xxxxxxxxxxx <xxxx@xxxx.xxx> and 30,000 of * xxxxx xxxxx (xxxxxx.xxxx xxx xxxxx (\Xxxxxx) xxxxxx.xxxxxx {xxx} with a smattering of others. And this is a case where the internal git implementation does really badly. And nobody has really cared before, because nobody has ever had a case that mattered. There's a number of different 'diff' algorithms, and it looks like GNU diff has one that avoids the O(n^2) case for this case. I'm adding Davide as the original author of the diff library to the cc. I'm also adding Pierre, since he was talking about trying to implement another diff algorithm (although I'm not at all sure that the patience diff really would help this case at all). Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 21:37 ` Linus Torvalds @ 2008-11-07 23:04 ` Davide Libenzi 2008-11-07 23:18 ` Davide Libenzi 2008-11-08 0:14 ` Pierre Habouzit 1 sibling, 1 reply; 13+ messages in thread From: Davide Libenzi @ 2008-11-07 23:04 UTC (permalink / raw) To: Linus Torvalds; +Cc: Abhijit Menon-Sen, Pierre Habouzit, Git Mailing List On Fri, 7 Nov 2008, Linus Torvalds wrote: > On Sat, 8 Nov 2008, Abhijit Menon-Sen wrote: > > > > If anyone's interested, the files are http://toroid.org/misc/1 and > > http://toroid.org/misc/2 > > Btw, you can see this by just doing > > git diff 1 2 > > without even doing "git init" or doing any actual git repository. > > > Does anyone understand why this slowdown might happen or have > > suggestions about where I should look for it? > > Sure. It's actually fairly simple. You're hitting a O(n^2) thing (possibly > higher), and it's triggered by the fact that almost all your lines are > identical, ie you have a file that basically has 40,000 lines of each of > > xxxx: xxx, xx xxx xxxx xx:xx:xx +xxxx > xx: xxxx xxxxxxxxxxx <xxxx@xxxx.xxx> > xxxx: xxxx xxxxxxxxxxx <xxxx@xxxx.xxx> > > and 30,000 of > > * xxxxx xxxxx (xxxxxx.xxxx xxx xxxxx (\Xxxxxx) xxxxxx.xxxxxx {xxx} > > with a smattering of others. And this is a case where the internal git > implementation does really badly. And nobody has really cared before, > because nobody has ever had a case that mattered. > > There's a number of different 'diff' algorithms, and it looks like GNU > diff has one that avoids the O(n^2) case for this case. > > I'm adding Davide as the original author of the diff library to the cc. > I'm also adding Pierre, since he was talking about trying to implement > another diff algorithm (although I'm not at all sure that the patience > diff really would help this case at all). That should be an easy fix. Just need to limit the window by which xdl_clean_mmatch() scans the current position. - Davide ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 23:04 ` Davide Libenzi @ 2008-11-07 23:18 ` Davide Libenzi 2008-11-07 23:42 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Davide Libenzi @ 2008-11-07 23:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: Abhijit Menon-Sen, Pierre Habouzit, Git Mailing List On Fri, 7 Nov 2008, Davide Libenzi wrote: > On Fri, 7 Nov 2008, Linus Torvalds wrote: > > > On Sat, 8 Nov 2008, Abhijit Menon-Sen wrote: > > > > > > If anyone's interested, the files are http://toroid.org/misc/1 and > > > http://toroid.org/misc/2 > > > > Btw, you can see this by just doing > > > > git diff 1 2 > > > > without even doing "git init" or doing any actual git repository. > > > > > Does anyone understand why this slowdown might happen or have > > > suggestions about where I should look for it? > > > > Sure. It's actually fairly simple. You're hitting a O(n^2) thing (possibly > > higher), and it's triggered by the fact that almost all your lines are > > identical, ie you have a file that basically has 40,000 lines of each of > > > > xxxx: xxx, xx xxx xxxx xx:xx:xx +xxxx > > xx: xxxx xxxxxxxxxxx <xxxx@xxxx.xxx> > > xxxx: xxxx xxxxxxxxxxx <xxxx@xxxx.xxx> > > > > and 30,000 of > > > > * xxxxx xxxxx (xxxxxx.xxxx xxx xxxxx (\Xxxxxx) xxxxxx.xxxxxx {xxx} > > > > with a smattering of others. And this is a case where the internal git > > implementation does really badly. And nobody has really cared before, > > because nobody has ever had a case that mattered. > > > > There's a number of different 'diff' algorithms, and it looks like GNU > > diff has one that avoids the O(n^2) case for this case. > > > > I'm adding Davide as the original author of the diff library to the cc. > > I'm also adding Pierre, since he was talking about trying to implement > > another diff algorithm (although I'm not at all sure that the patience > > diff really would help this case at all). > > That should be an easy fix. Just need to limit the window by which > xdl_clean_mmatch() scans the current position. With +/- 100 lines (200 lines window): davide@alien:~$ time ./xdiff_test --diff 1 2 > /dev/null real 0m1.534s user 0m1.466s sys 0m0.040s - Davide ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 23:18 ` Davide Libenzi @ 2008-11-07 23:42 ` Linus Torvalds 2008-11-07 23:48 ` Davide Libenzi 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2008-11-07 23:42 UTC (permalink / raw) To: Davide Libenzi; +Cc: Abhijit Menon-Sen, Pierre Habouzit, Git Mailing List On Fri, 7 Nov 2008, Davide Libenzi wrote: > > With +/- 100 lines (200 lines window): > > davide@alien:~$ time ./xdiff_test --diff 1 2 > /dev/null > > real 0m1.534s > user 0m1.466s > sys 0m0.040s I assume the patch is something like the appended? Linus --- xdiff/xprepare.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c index e87ab57..4bebd76 100644 --- a/xdiff/xprepare.c +++ b/xdiff/xprepare.c @@ -318,7 +318,7 @@ static int xdl_clean_mmatch(char const *dis, long i, long s, long e) { * Note that we always call this function with dis[i] > 1, so the * current line (i) is already a multimatch line. */ - for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) { + for (r = 1, rdis0 = 0, rpdis0 = 1; r < 100 && (i - r) >= s; r++) { if (!dis[i - r]) rdis0++; else if (dis[i - r] == 2) @@ -334,7 +334,7 @@ static int xdl_clean_mmatch(char const *dis, long i, long s, long e) { */ if (rdis0 == 0) return 0; - for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) { + for (r = 1, rdis1 = 0, rpdis1 = 1; r < 100 && (i + r) <= e; r++) { if (!dis[i + r]) rdis1++; else if (dis[i + r] == 2) ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 23:42 ` Linus Torvalds @ 2008-11-07 23:48 ` Davide Libenzi 2008-11-07 23:57 ` Linus Torvalds 2008-11-08 5:30 ` Junio C Hamano 0 siblings, 2 replies; 13+ messages in thread From: Davide Libenzi @ 2008-11-07 23:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: Abhijit Menon-Sen, Pierre Habouzit, Git Mailing List On Fri, 7 Nov 2008, Linus Torvalds wrote: > > > On Fri, 7 Nov 2008, Davide Libenzi wrote: > > > > With +/- 100 lines (200 lines window): > > > > davide@alien:~$ time ./xdiff_test --diff 1 2 > /dev/null > > > > real 0m1.534s > > user 0m1.466s > > sys 0m0.040s > > I assume the patch is something like the appended? > > Linus > > --- > xdiff/xprepare.c | 4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c > index e87ab57..4bebd76 100644 > --- a/xdiff/xprepare.c > +++ b/xdiff/xprepare.c > @@ -318,7 +318,7 @@ static int xdl_clean_mmatch(char const *dis, long i, long s, long e) { > * Note that we always call this function with dis[i] > 1, so the > * current line (i) is already a multimatch line. > */ > - for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) { > + for (r = 1, rdis0 = 0, rpdis0 = 1; r < 100 && (i - r) >= s; r++) { > if (!dis[i - r]) > rdis0++; > else if (dis[i - r] == 2) > @@ -334,7 +334,7 @@ static int xdl_clean_mmatch(char const *dis, long i, long s, long e) { > */ > if (rdis0 == 0) > return 0; > - for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) { > + for (r = 1, rdis1 = 0, rpdis1 = 1; r < 100 && (i + r) <= e; r++) { > if (!dis[i + r]) > rdis1++; > else if (dis[i + r] == 2) > Yeah, similar. Mine is below. There's one less branch in the for loops. - Davide diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c index deba25a..3ebd87c 100644 --- a/xdiff/xprepare.c +++ b/xdiff/xprepare.c @@ -23,10 +23,9 @@ #include "xinclude.h" - #define XDL_KPDIS_RUN 4 #define XDL_MAX_EQLIMIT 1024 - +#define XDL_SIMSCAN_WINDOWN 100 typedef struct s_xdlclass { @@ -246,6 +245,18 @@ static int xdl_clean_mmatch(char const *dis, long i, long s, long e) { long r, rdis0, rpdis0, rdis1, rpdis1; /* + * Limits the window the is examined during the similar-lines + * scan. The loops below stops when dis[i - r] == 1 (line that + * has no match), but there are corner cases where the loop + * proceed all the way to the extremities by causing huge + * performance penalties in case of big files. + */ + if (i - s > XDL_SIMSCAN_WINDOWN) + s = i - XDL_SIMSCAN_WINDOWN; + if (e - i > XDL_SIMSCAN_WINDOWN) + e = i + XDL_SIMSCAN_WINDOWN; + + /* * Scans the lines before 'i' to find a run of lines that either * have no match (dis[j] == 0) or have multiple matches (dis[j] > 1). * Note that we always call this function with dis[i] > 1, so the ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 23:48 ` Davide Libenzi @ 2008-11-07 23:57 ` Linus Torvalds 2008-11-08 4:57 ` Abhijit Menon-Sen 2008-11-08 21:02 ` Junio C Hamano 2008-11-08 5:30 ` Junio C Hamano 1 sibling, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2008-11-07 23:57 UTC (permalink / raw) To: Davide Libenzi, Junio C Hamano Cc: Abhijit Menon-Sen, Pierre Habouzit, Git Mailing List On Fri, 7 Nov 2008, Davide Libenzi wrote: > > Yeah, similar. Mine is below. There's one less branch in the for loops. ..and has a comment and made the magic constant be named. Junio, the time difference is quite big for Abhijit's admittedly odd test-case: - Before: [torvalds@nehalem slow-diff]$ time git diff 1 2 > out.old real 2m19.912s user 2m19.885s sys 0m0.024s - After: [torvalds@nehalem slow-diff]$ time ~/git/git diff 1 2 > out real 0m0.841s user 0m0.816s sys 0m0.024s with no difference in output. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 23:57 ` Linus Torvalds @ 2008-11-08 4:57 ` Abhijit Menon-Sen 2008-11-08 21:02 ` Junio C Hamano 1 sibling, 0 replies; 13+ messages in thread From: Abhijit Menon-Sen @ 2008-11-08 4:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Davide Libenzi, Junio C Hamano, Pierre Habouzit, git At 2008-11-07 15:57:23 -0800, torvalds@linux-foundation.org wrote: > > > Yeah, similar. Mine is below. There's one less branch in the for > > loops. > > ..and has a comment and made the magic constant be named. It works fine for me (the time went from 5m17s to 1.8s). (By the way, my test case is certainly very odd, but it is a real file from my test suite, albeit with the content x'ed away; and the change was to adjust the expected output for all the items. I wasn't looking for bugs. :-) Thanks for the explanation and the patch. -- ams ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 23:57 ` Linus Torvalds 2008-11-08 4:57 ` Abhijit Menon-Sen @ 2008-11-08 21:02 ` Junio C Hamano 1 sibling, 0 replies; 13+ messages in thread From: Junio C Hamano @ 2008-11-08 21:02 UTC (permalink / raw) To: Linus Torvalds Cc: Davide Libenzi, Abhijit Menon-Sen, Pierre Habouzit, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > Junio, the time difference is quite big for Abhijit's admittedly odd > test-case: > ... > with no difference in output. In git.git history, "git-whatchanged -m -p -1" gives different output for the following commits, with and without the patch: c0e9892637e8144f10f2c408e276a470520f3601 d6b3e3a33f71910526ccf80af6c13a230363cd89 cecb98a9c3ed9271b0974bb6d7edbcf16e8a68f3 ce18135d862b5dbc731d203b27c279529e58b54b 36b5b3c65948694d9a92de5a17f2b97c3cd84879 767e130915015f897fb87b939843b4882212574b 927a503cd07718ea0f700052043f383253904a56 I've sampled a few (but not all) of them and they are different only because just how common lines are matched up, which is expected. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 23:48 ` Davide Libenzi 2008-11-07 23:57 ` Linus Torvalds @ 2008-11-08 5:30 ` Junio C Hamano 2008-11-08 16:27 ` Davide Libenzi 1 sibling, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2008-11-08 5:30 UTC (permalink / raw) To: Davide Libenzi Cc: Linus Torvalds, Abhijit Menon-Sen, Pierre Habouzit, Git Mailing List Davide Libenzi <davidel@xmailserver.org> writes: > Yeah, similar. Mine is below. There's one less branch in the for loops. Thanks, will apply like this, but I am not sure if you meant windowN or just window... -- >8 -- From: Davide Libenzi <davidel@xmailserver.org> Date: Fri, 7 Nov 2008 21:24:33 -0800 Subject: [PATCH] xdiff: give up scanning similar lines early In a corner case of large files whose lines do not match uniquely, the loop to eliminate a line that matches multiple locations adjacent to a run of lines that do not uniquely match wasted too much cycles. Fix this by giving up early after scanning 100 lines in both direction. --- xdiff/xprepare.c | 15 +++++++++++++-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c index e87ab57..6a70cdf 100644 --- a/xdiff/xprepare.c +++ b/xdiff/xprepare.c @@ -23,10 +23,9 @@ #include "xinclude.h" - #define XDL_KPDIS_RUN 4 #define XDL_MAX_EQLIMIT 1024 - +#define XDL_SIMSCAN_WINDOWN 100 typedef struct s_xdlclass { @@ -313,6 +312,18 @@ static int xdl_clean_mmatch(char const *dis, long i, long s, long e) { long r, rdis0, rpdis0, rdis1, rpdis1; /* + * Limits the window the is examined during the similar-lines + * scan. The loops below stops when dis[i - r] == 1 (line that + * has no match), but there are corner cases where the loop + * proceed all the way to the extremities by causing huge + * performance penalties in case of big files. + */ + if (i - s > XDL_SIMSCAN_WINDOWN) + s = i - XDL_SIMSCAN_WINDOWN; + if (e - i > XDL_SIMSCAN_WINDOWN) + e = i + XDL_SIMSCAN_WINDOWN; + + /* * Scans the lines before 'i' to find a run of lines that either * have no match (dis[j] == 0) or have multiple matches (dis[j] > 1). * Note that we always call this function with dis[i] > 1, so the -- 1.6.0.3.674.gdf99f ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-08 5:30 ` Junio C Hamano @ 2008-11-08 16:27 ` Davide Libenzi 0 siblings, 0 replies; 13+ messages in thread From: Davide Libenzi @ 2008-11-08 16:27 UTC (permalink / raw) To: Junio C Hamano Cc: Linus Torvalds, Abhijit Menon-Sen, Pierre Habouzit, Git Mailing List On Fri, 7 Nov 2008, Junio C Hamano wrote: > Davide Libenzi <davidel@xmailserver.org> writes: > > > Yeah, similar. Mine is below. There's one less branch in the for loops. > > Thanks, will apply like this, but I am not sure if you meant windowN or > just window... Whoops, just WINDOW. - Davide ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: absurdly slow git-diff 2008-11-07 21:37 ` Linus Torvalds 2008-11-07 23:04 ` Davide Libenzi @ 2008-11-08 0:14 ` Pierre Habouzit 1 sibling, 0 replies; 13+ messages in thread From: Pierre Habouzit @ 2008-11-08 0:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: Abhijit Menon-Sen, Davide Libenzi, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 1163 bytes --] On Fri, Nov 07, 2008 at 09:37:29PM +0000, Linus Torvalds wrote: > > On Sat, 8 Nov 2008, Abhijit Menon-Sen wrote: > > > > If anyone's interested, the files are http://toroid.org/misc/1 and > > http://toroid.org/misc/2 > I'm also adding Pierre, since he was talking about trying to implement > another diff algorithm (although I'm not at all sure that the patience > diff really would help this case at all). FWIW Patience diff wouldn't help at all here. Patience diff is just a matter of preseeding your preferred diff algorithm with better (wrt human readability) candidate for the invariant lines. IOW it helps dividing the problem into smaller bits, but requires *unique lines* to start with. If you haven't any, then basically, Patience diff does nothing and calls your usual diff algorithm on the whole files. It does so in a pseudo linear complexity, hence should not make overall time really worse, but will not help for the ending time usually either. -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2008-11-08 21:04 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-11-07 20:01 absurdly slow git-diff Abhijit Menon-Sen 2008-11-07 21:28 ` Mike Hommey 2008-11-07 21:37 ` Linus Torvalds 2008-11-07 23:04 ` Davide Libenzi 2008-11-07 23:18 ` Davide Libenzi 2008-11-07 23:42 ` Linus Torvalds 2008-11-07 23:48 ` Davide Libenzi 2008-11-07 23:57 ` Linus Torvalds 2008-11-08 4:57 ` Abhijit Menon-Sen 2008-11-08 21:02 ` Junio C Hamano 2008-11-08 5:30 ` Junio C Hamano 2008-11-08 16:27 ` Davide Libenzi 2008-11-08 0:14 ` Pierre Habouzit
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox