* Re: Linux 2.6.24-rc6 [not found] ` <20071221030152.GC8535@fattire.cabal.ca> @ 2007-12-21 3:49 ` Linus Torvalds 2007-12-21 3:50 ` Kyle McMartin 2007-12-21 4:22 ` Linus Torvalds 0 siblings, 2 replies; 9+ messages in thread From: Linus Torvalds @ 2007-12-21 3:49 UTC (permalink / raw) To: Kyle McMartin, Junio C Hamano, Git Mailing List; +Cc: Linux Kernel Mailing List On Thu, 20 Dec 2007, Kyle McMartin wrote: > > I think I see the problem, it's lack of context in the diff, No, the problem is that "git diff" is apparently broken by a recent optimization. The diff is simply broken. The tar-ball and the git archive itself is fine, but yes, the diff from 2.6.23 to 2.6.24-rc6 is bad. It's the "trim_common_tail()" optimization that has caused way too much pain. Sorry about that, I'll fix it up asap. Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6 2007-12-21 3:49 ` Linux 2.6.24-rc6 Linus Torvalds @ 2007-12-21 3:50 ` Kyle McMartin 2007-12-21 4:22 ` Linus Torvalds 1 sibling, 0 replies; 9+ messages in thread From: Kyle McMartin @ 2007-12-21 3:50 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Git Mailing List, Linux Kernel Mailing List On Thu, Dec 20, 2007 at 07:49:05PM -0800, Linus Torvalds wrote: > > > On Thu, 20 Dec 2007, Kyle McMartin wrote: > > > > I think I see the problem, it's lack of context in the diff, > > No, the problem is that "git diff" is apparently broken by a recent > optimization. The diff is simply broken. > > The tar-ball and the git archive itself is fine, but yes, the diff from > 2.6.23 to 2.6.24-rc6 is bad. It's the "trim_common_tail()" optimization > that has caused way too much pain. > > Sorry about that, I'll fix it up asap. > no biggie, thanks! cheers, Kyle ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6 2007-12-21 3:49 ` Linux 2.6.24-rc6 Linus Torvalds 2007-12-21 3:50 ` Kyle McMartin @ 2007-12-21 4:22 ` Linus Torvalds 2007-12-21 4:40 ` Linus Torvalds 2007-12-21 4:58 ` Linus Torvalds 1 sibling, 2 replies; 9+ messages in thread From: Linus Torvalds @ 2007-12-21 4:22 UTC (permalink / raw) To: Kyle McMartin, Junio C Hamano, Git Mailing List; +Cc: Linux Kernel Mailing List On Thu, 20 Dec 2007, Linus Torvalds wrote: > > The tar-ball and the git archive itself is fine, but yes, the diff from > 2.6.23 to 2.6.24-rc6 is bad. It's the "trim_common_tail()" optimization > that has caused way too much pain. Very interesting breakage. The patch was actually "correct" in a (rather limited) technical sense, but the context at the end was missing because while the trim_common_tail() code made sure to keep enough common context to allow a valid diff to be generated, the diff machinery itself could decide that it could generate the diff differently than the "obvious" solution. It only happened for a few files that had lots of repeated lines - so that the diff could literally be done multiple different ways - and in fact, the file that caused the problems really had a bogus commit that duplicated *way* too much data, and caused lots of #define's to exist twice. But the sad fact appears that the git optimization (which is very important for "git blame", which needs no context), is only really valid for that one case where we really don't need any context. I uploaded a fixed patch. And here's the git patch to avoid this optimization when there is context. Linus --- xdiff-interface.c | 12 ++++++------ 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/xdiff-interface.c b/xdiff-interface.c index 9ee877c..0b7e057 100644 --- a/xdiff-interface.c +++ b/xdiff-interface.c @@ -110,22 +110,22 @@ int xdiff_outf(void *priv_, mmbuffer_t *mb, int nbuf) static void trim_common_tail(mmfile_t *a, mmfile_t *b, long ctx) { const int blk = 1024; - long trimmed = 0, recovered = 0; + long trimmed = 0; char *ap = a->ptr + a->size; char *bp = b->ptr + b->size; long smaller = (a->size < b->size) ? a->size : b->size; + if (ctx) + return; + while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) { trimmed += blk; ap -= blk; bp -= blk; } - while (recovered < trimmed && 0 <= ctx) - if (ap[recovered++] == '\n') - ctx--; - a->size -= (trimmed - recovered); - b->size -= (trimmed - recovered); + a->size -= trimmed; + b->size -= trimmed; } int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t const *xecfg, xdemitcb_t *xecb) ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6 2007-12-21 4:22 ` Linus Torvalds @ 2007-12-21 4:40 ` Linus Torvalds 2007-12-21 4:50 ` Kyle McMartin 2007-12-21 4:58 ` Linus Torvalds 1 sibling, 1 reply; 9+ messages in thread From: Linus Torvalds @ 2007-12-21 4:40 UTC (permalink / raw) To: Kyle McMartin, Junio C Hamano, Git Mailing List, Raphael Assenat, Andrew Morton Cc: Linux Kernel Mailing List On Thu, 20 Dec 2007, Linus Torvalds wrote: > > It only happened for a few files that had lots of repeated lines - so that > the diff could literally be done multiple different ways - and in fact, > the file that caused the problems really had a bogus commit that > duplicated *way* too much data, and caused lots of #define's to exist > twice. Here's the example of this kind of behaviour: in the 2.6.26-rc5 tree the file drivers/video/mbx/reg_bits.h has the #defines for /* DINTRS - Display Interrupt Status Register */ /* DINTRE - Display Interrupt Enable Register */ duplicated twice due to commit ba282daa919f89c871780f344a71e5403a70b634 ("mbxfb: Improvements and new features") by Raphael Assenat mistakenly adding another copy of the same old set of defines that we already got added once before by commit fb137d5b7f2301f2717944322bba38039083c431 ("mbxfb: Add more registers bits access macros"). Now, that was a mistake - and one that probably happened because Rafael or more likely Andrew Morton used GNU patch with its insane defaults (which is to happily apply the same patch that adds things twice, because it doesn't really care if the context matches or not). But what that kind of thing causes is that when you create a patch of the end result, it can show the now new duplicate lines two different (but equally valid) ways: it can show it as an addition of the _first_ set of lines, or it can show it as an addition of the _second_ set of lines. They are the same, after all. Now, it doesn't really matter which way you choose to show it, although because of how "git diff" finds similarities, it tends to prefer to show the second set of identical lines as the "new" ones. Which is generally reasonable. However, that interacted really badly with the new git logic that said that "if the two files end in the same sequence, just ignore the common tail of the file", because the latter copy of the identical lines would now show up as _part_ of that common tail, so the lines that the git diff machinery would normally like to show up as "new" did in fact end up being considered uninteresting, because they were part of an idential tail. So now "git diff" would happily pick _earlier_ lines as the new ones, and it would still be a conceptually valid diff, but because we had trimmed the tail of the file, that conceptually valid diff no longer had the expected shared context at the end. And while it's a bit embarrassing, I'm really rather happy that both GNU patch and "git apply" actually refused to apply the patch. It may have been "conceptually correct" (ie it did really contain all of the changes!) but because it lacked the expected context it really wasn't a good patch. That was a rather long-winded explanation of what happened, mainly because it was all very unexpected to me, and I had personally mistakenly thought the git optimization was perfectly valid and actually had to go through the end result to see what was going on. Anyway, the diff on kernel.org should be all ok now, and mirrored out too. Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6 2007-12-21 4:40 ` Linus Torvalds @ 2007-12-21 4:50 ` Kyle McMartin 0 siblings, 0 replies; 9+ messages in thread From: Kyle McMartin @ 2007-12-21 4:50 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Git Mailing List, Raphael Assenat, Andrew Morton, Linux Kernel Mailing List On Thu, Dec 20, 2007 at 08:40:54PM -0800, Linus Torvalds wrote: > That was a rather long-winded explanation of what happened, mainly because > it was all very unexpected to me, and I had personally mistakenly thought > the git optimization was perfectly valid and actually had to go through > the end result to see what was going on. > > Anyway, the diff on kernel.org should be all ok now, and mirrored out too. > Thanks again for being so quick to track this down, applies fine and is out for building in rawhide now. cheers, Kyle ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6 2007-12-21 4:22 ` Linus Torvalds 2007-12-21 4:40 ` Linus Torvalds @ 2007-12-21 4:58 ` Linus Torvalds 2007-12-21 5:21 ` Linus Torvalds 2007-12-21 16:56 ` Junio C Hamano 1 sibling, 2 replies; 9+ messages in thread From: Linus Torvalds @ 2007-12-21 4:58 UTC (permalink / raw) To: Kyle McMartin, Junio C Hamano, Git Mailing List; +Cc: Linux Kernel Mailing List On Thu, 20 Dec 2007, Linus Torvalds wrote: > > And here's the git patch to avoid this optimization when there is > context. Actually, the code to finding one '\n' is still needed to avoid the (pathological) case of getting a "\No newline", so scrap that one which was too aggressive, and use this (simpler) one instead. Not that it matters in real life, since nobody uses -U0, and "git blame" won't care. But let's get it right anyway ;) This whole function has had more bugs than it has lines. Linus --- xdiff-interface.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/xdiff-interface.c b/xdiff-interface.c index 9ee877c..711029e 100644 --- a/xdiff-interface.c +++ b/xdiff-interface.c @@ -115,15 +115,18 @@ static void trim_common_tail(mmfile_t *a, mmfile_t *b, long ctx) char *bp = b->ptr + b->size; long smaller = (a->size < b->size) ? a->size : b->size; + if (ctx) + return; + while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) { trimmed += blk; ap -= blk; bp -= blk; } - while (recovered < trimmed && 0 <= ctx) + while (recovered < trimmed) if (ap[recovered++] == '\n') - ctx--; + break; a->size -= (trimmed - recovered); b->size -= (trimmed - recovered); } ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6 2007-12-21 4:58 ` Linus Torvalds @ 2007-12-21 5:21 ` Linus Torvalds 2007-12-21 17:45 ` Linus Torvalds 2007-12-21 16:56 ` Junio C Hamano 1 sibling, 1 reply; 9+ messages in thread From: Linus Torvalds @ 2007-12-21 5:21 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List On Thu, 20 Dec 2007, Linus Torvalds wrote: > > Not that it matters in real life, since nobody uses -U0, and "git blame" > won't care. But let's get it right anyway ;) .. and here's a test-case to see the *impact* of this whole issue: yes | head -1000 > a yes | head -2000 > b git diff -U0 a b | grep @@ diff -U0 a b | grep @@ and notice the differences. Both will add a thousand lines of "y", but the "git diff" will add it at a different point, ie git says: @@ -488,0 +489,1000 @@ y while a non-tail-optimizing diff will say @@ -1000,0 +1001,1000 @@ which may be a bit more obvious. Both answers are *correct*, though. The particular choice of "insert at line 489, after line 488" is a bit odd, but is because we don't actually search to exactly the beginning of where the differences started, we search in blocks of 1kB and then we go forward to the next newline. (We could also go to exactly where the differences started, and if the previous character was a newline or the beginning of the file, we'd not move forward at all, and then we'd get a more "logical" diff of inserting at the beginning). Considering that the answer is correct, and this only happens for insane cases anyway, I don't really think anybody cares, but it's an interesting case nonetheless. Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6 2007-12-21 5:21 ` Linus Torvalds @ 2007-12-21 17:45 ` Linus Torvalds 0 siblings, 0 replies; 9+ messages in thread From: Linus Torvalds @ 2007-12-21 17:45 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List On Thu, 20 Dec 2007, Linus Torvalds wrote: > > Both answers are *correct*, though. The particular choice of "insert at > line 489, after line 488" is a bit odd, but is because we don't actually > search to exactly the beginning of where the differences started, we > search in blocks of 1kB and then we go forward to the next newline. This slightly more involved diff does a better job at this particular issue. Whether the complexity is worth it or not, I dunno, but it changes the "remove common lines at the end" to do an exact job, which for this particular test-case means that the end result of adding a thousand lines of 'y' will look like [torvalds@woody ~]$ git diff -U0 a b | grep @@ @@ -0,0 +1,1000 @@ instead - ie it will say that they were added at the very beginning of the file rather than added at some arbitrary point in the middle. Whether this is really worth it, I dunno. Also, I'm kind of debating with myself whether it would make most sense to only do this kind of optimization when (pick arbitrary cut-off here) something like more than half of the file is identical at the end. If we don't have a noticeable fraction of the file being the same, it may not make sense to really bother with this, since it really is meant for just things like ChangeLog files etc that have data added at the beginning. That would make this whole optimization a lot more targeted to the case where it really matters and really helps. I also do have an incling of a really evil way to make xdiff handle the case of having multiple lines of context right too, and basically just move all of this logic into xdiff itself rather than have this interface-level hack, but I'll have to let that idea brew for a while yet. Linus --- xdiff-interface.c | 38 ++++++++++++++++++++++++++++++-------- 1 files changed, 30 insertions(+), 8 deletions(-) diff --git a/xdiff-interface.c b/xdiff-interface.c index 9ee877c..54a53d2 100644 --- a/xdiff-interface.c +++ b/xdiff-interface.c @@ -109,21 +109,43 @@ int xdiff_outf(void *priv_, mmbuffer_t *mb, int nbuf) */ static void trim_common_tail(mmfile_t *a, mmfile_t *b, long ctx) { - const int blk = 1024; + int blk = 1024; long trimmed = 0, recovered = 0; char *ap = a->ptr + a->size; char *bp = b->ptr + b->size; long smaller = (a->size < b->size) ? a->size : b->size; - while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) { - trimmed += blk; - ap -= blk; - bp -= blk; - } + if (ctx) + return; + + do { + while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) { + trimmed += blk; + ap -= blk; + bp -= blk; + } + blk /= 2; + } while (blk); + + /* Did we trim one of them all away? */ + if (trimmed == smaller) { + char *bigger; + if (a->size == b->size) + return; + bigger = a->ptr; + if (a->size > b->size) + bigger = b->ptr; + + /* Did the other one end in a newline? */ + if (bigger[trimmed-1] == '\n') + goto done; + } - while (recovered < trimmed && 0 <= ctx) + /* Find the next newline */ + while (recovered < trimmed) if (ap[recovered++] == '\n') - ctx--; + break; +done: a->size -= (trimmed - recovered); b->size -= (trimmed - recovered); } ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6 2007-12-21 4:58 ` Linus Torvalds 2007-12-21 5:21 ` Linus Torvalds @ 2007-12-21 16:56 ` Junio C Hamano 1 sibling, 0 replies; 9+ messages in thread From: Junio C Hamano @ 2007-12-21 16:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kyle McMartin, Git Mailing List, Linux Kernel Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > Actually, the code to finding one '\n' is still needed to avoid the > (pathological) case of getting a "\No newline", so scrap that one which > was too aggressive, and use this (simpler) one instead. > > Not that it matters in real life, since nobody uses -U0, and "git blame" > won't care. But let's get it right anyway ;) Actually "blame won't care" is a bit too strong. It's only we (you) made it not to care. It is a different story if the change to make it not to care was sensible. The diff text "git blame" will see is affected with the tail trimming optimization exactly the same way as the optimization triggered this thread. With the common tails trimmed, the hunks match differently from the way they match without trimming (the gcc changelog testcase has differences between the unoptimized blame and the tail-trimming one --- your original to put this logic only in blame and my rewrite to move it in xdiff-interface produce the same result that is different from the unoptimized version, although both are 4x faster than the original). When there are multiple possible matches, any match among them is a correct match, and a match with a line in a blob is a match to the blob no matter what line the match is anyway. The usual workflow of checking blame to find the commit that introduced the change and then going to "git show" to view the bigger picture won't get affected. But the blamed line numbers will be different from the unoptimized blame, and it may not match the expectation of human readers. It won't match "git show" output of the blamed commit. > This whole function has had more bugs than it has lines. I have to agree. It started as a brilliant idea but then it was enhanced (in an attempt to support context) and executed not so brilliantly. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-12-21 17:46 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <alpine.LFD.0.9999.0712201731010.21557@woody.linux-foundation.org> [not found] ` <20071221024805.GB8535@fattire.cabal.ca> [not found] ` <20071221030152.GC8535@fattire.cabal.ca> 2007-12-21 3:49 ` Linux 2.6.24-rc6 Linus Torvalds 2007-12-21 3:50 ` Kyle McMartin 2007-12-21 4:22 ` Linus Torvalds 2007-12-21 4:40 ` Linus Torvalds 2007-12-21 4:50 ` Kyle McMartin 2007-12-21 4:58 ` Linus Torvalds 2007-12-21 5:21 ` Linus Torvalds 2007-12-21 17:45 ` Linus Torvalds 2007-12-21 16:56 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).