* Re: Linux 2.6.24-rc6
[not found] ` <20071221030152.GC8535@fattire.cabal.ca>
@ 2007-12-21 3:49 ` Linus Torvalds
2007-12-21 3:50 ` Kyle McMartin
2007-12-21 4:22 ` Linus Torvalds
0 siblings, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21 3:49 UTC (permalink / raw)
To: Kyle McMartin, Junio C Hamano, Git Mailing List; +Cc: Linux Kernel Mailing List
On Thu, 20 Dec 2007, Kyle McMartin wrote:
>
> I think I see the problem, it's lack of context in the diff,
No, the problem is that "git diff" is apparently broken by a recent
optimization. The diff is simply broken.
The tar-ball and the git archive itself is fine, but yes, the diff from
2.6.23 to 2.6.24-rc6 is bad. It's the "trim_common_tail()" optimization
that has caused way too much pain.
Sorry about that, I'll fix it up asap.
Linus
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6
2007-12-21 3:49 ` Linux 2.6.24-rc6 Linus Torvalds
@ 2007-12-21 3:50 ` Kyle McMartin
2007-12-21 4:22 ` Linus Torvalds
1 sibling, 0 replies; 9+ messages in thread
From: Kyle McMartin @ 2007-12-21 3:50 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Git Mailing List, Linux Kernel Mailing List
On Thu, Dec 20, 2007 at 07:49:05PM -0800, Linus Torvalds wrote:
>
>
> On Thu, 20 Dec 2007, Kyle McMartin wrote:
> >
> > I think I see the problem, it's lack of context in the diff,
>
> No, the problem is that "git diff" is apparently broken by a recent
> optimization. The diff is simply broken.
>
> The tar-ball and the git archive itself is fine, but yes, the diff from
> 2.6.23 to 2.6.24-rc6 is bad. It's the "trim_common_tail()" optimization
> that has caused way too much pain.
>
> Sorry about that, I'll fix it up asap.
>
no biggie, thanks!
cheers, Kyle
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6
2007-12-21 3:49 ` Linux 2.6.24-rc6 Linus Torvalds
2007-12-21 3:50 ` Kyle McMartin
@ 2007-12-21 4:22 ` Linus Torvalds
2007-12-21 4:40 ` Linus Torvalds
2007-12-21 4:58 ` Linus Torvalds
1 sibling, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21 4:22 UTC (permalink / raw)
To: Kyle McMartin, Junio C Hamano, Git Mailing List; +Cc: Linux Kernel Mailing List
On Thu, 20 Dec 2007, Linus Torvalds wrote:
>
> The tar-ball and the git archive itself is fine, but yes, the diff from
> 2.6.23 to 2.6.24-rc6 is bad. It's the "trim_common_tail()" optimization
> that has caused way too much pain.
Very interesting breakage. The patch was actually "correct" in a (rather
limited) technical sense, but the context at the end was missing because
while the trim_common_tail() code made sure to keep enough common context
to allow a valid diff to be generated, the diff machinery itself could
decide that it could generate the diff differently than the "obvious"
solution.
It only happened for a few files that had lots of repeated lines - so that
the diff could literally be done multiple different ways - and in fact,
the file that caused the problems really had a bogus commit that
duplicated *way* too much data, and caused lots of #define's to exist
twice.
But the sad fact appears that the git optimization (which is very
important for "git blame", which needs no context), is only really valid
for that one case where we really don't need any context.
I uploaded a fixed patch. And here's the git patch to avoid this
optimization when there is context.
Linus
---
xdiff-interface.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/xdiff-interface.c b/xdiff-interface.c
index 9ee877c..0b7e057 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -110,22 +110,22 @@ int xdiff_outf(void *priv_, mmbuffer_t *mb, int nbuf)
static void trim_common_tail(mmfile_t *a, mmfile_t *b, long ctx)
{
const int blk = 1024;
- long trimmed = 0, recovered = 0;
+ long trimmed = 0;
char *ap = a->ptr + a->size;
char *bp = b->ptr + b->size;
long smaller = (a->size < b->size) ? a->size : b->size;
+ if (ctx)
+ return;
+
while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) {
trimmed += blk;
ap -= blk;
bp -= blk;
}
- while (recovered < trimmed && 0 <= ctx)
- if (ap[recovered++] == '\n')
- ctx--;
- a->size -= (trimmed - recovered);
- b->size -= (trimmed - recovered);
+ a->size -= trimmed;
+ b->size -= trimmed;
}
int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t const *xecfg, xdemitcb_t *xecb)
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6
2007-12-21 4:22 ` Linus Torvalds
@ 2007-12-21 4:40 ` Linus Torvalds
2007-12-21 4:50 ` Kyle McMartin
2007-12-21 4:58 ` Linus Torvalds
1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21 4:40 UTC (permalink / raw)
To: Kyle McMartin, Junio C Hamano, Git Mailing List, Raphael Assenat,
Andrew Morton
Cc: Linux Kernel Mailing List
On Thu, 20 Dec 2007, Linus Torvalds wrote:
>
> It only happened for a few files that had lots of repeated lines - so that
> the diff could literally be done multiple different ways - and in fact,
> the file that caused the problems really had a bogus commit that
> duplicated *way* too much data, and caused lots of #define's to exist
> twice.
Here's the example of this kind of behaviour: in the 2.6.26-rc5 tree the
file drivers/video/mbx/reg_bits.h has the #defines for
/* DINTRS - Display Interrupt Status Register */
/* DINTRE - Display Interrupt Enable Register */
duplicated twice due to commit ba282daa919f89c871780f344a71e5403a70b634
("mbxfb: Improvements and new features") by Raphael Assenat mistakenly
adding another copy of the same old set of defines that we already got
added once before by commit fb137d5b7f2301f2717944322bba38039083c431
("mbxfb: Add more registers bits access macros").
Now, that was a mistake - and one that probably happened because Rafael or
more likely Andrew Morton used GNU patch with its insane defaults (which
is to happily apply the same patch that adds things twice, because it
doesn't really care if the context matches or not).
But what that kind of thing causes is that when you create a patch of the
end result, it can show the now new duplicate lines two different (but
equally valid) ways: it can show it as an addition of the _first_ set of
lines, or it can show it as an addition of the _second_ set of lines. They
are the same, after all.
Now, it doesn't really matter which way you choose to show it, although
because of how "git diff" finds similarities, it tends to prefer to show
the second set of identical lines as the "new" ones. Which is generally
reasonable.
However, that interacted really badly with the new git logic that said
that "if the two files end in the same sequence, just ignore the common
tail of the file", because the latter copy of the identical lines would
now show up as _part_ of that common tail, so the lines that the git diff
machinery would normally like to show up as "new" did in fact end up being
considered uninteresting, because they were part of an idential tail.
So now "git diff" would happily pick _earlier_ lines as the new ones, and
it would still be a conceptually valid diff, but because we had trimmed
the tail of the file, that conceptually valid diff no longer had the
expected shared context at the end.
And while it's a bit embarrassing, I'm really rather happy that both GNU
patch and "git apply" actually refused to apply the patch. It may have
been "conceptually correct" (ie it did really contain all of the changes!)
but because it lacked the expected context it really wasn't a good patch.
That was a rather long-winded explanation of what happened, mainly because
it was all very unexpected to me, and I had personally mistakenly thought
the git optimization was perfectly valid and actually had to go through
the end result to see what was going on.
Anyway, the diff on kernel.org should be all ok now, and mirrored out too.
Linus
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6
2007-12-21 4:40 ` Linus Torvalds
@ 2007-12-21 4:50 ` Kyle McMartin
0 siblings, 0 replies; 9+ messages in thread
From: Kyle McMartin @ 2007-12-21 4:50 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Git Mailing List, Raphael Assenat, Andrew Morton,
Linux Kernel Mailing List
On Thu, Dec 20, 2007 at 08:40:54PM -0800, Linus Torvalds wrote:
> That was a rather long-winded explanation of what happened, mainly because
> it was all very unexpected to me, and I had personally mistakenly thought
> the git optimization was perfectly valid and actually had to go through
> the end result to see what was going on.
>
> Anyway, the diff on kernel.org should be all ok now, and mirrored out too.
>
Thanks again for being so quick to track this down, applies fine and is
out for building in rawhide now.
cheers, Kyle
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6
2007-12-21 4:22 ` Linus Torvalds
2007-12-21 4:40 ` Linus Torvalds
@ 2007-12-21 4:58 ` Linus Torvalds
2007-12-21 5:21 ` Linus Torvalds
2007-12-21 16:56 ` Junio C Hamano
1 sibling, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21 4:58 UTC (permalink / raw)
To: Kyle McMartin, Junio C Hamano, Git Mailing List; +Cc: Linux Kernel Mailing List
On Thu, 20 Dec 2007, Linus Torvalds wrote:
>
> And here's the git patch to avoid this optimization when there is
> context.
Actually, the code to finding one '\n' is still needed to avoid the
(pathological) case of getting a "\No newline", so scrap that one which
was too aggressive, and use this (simpler) one instead.
Not that it matters in real life, since nobody uses -U0, and "git blame"
won't care. But let's get it right anyway ;)
This whole function has had more bugs than it has lines.
Linus
---
xdiff-interface.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/xdiff-interface.c b/xdiff-interface.c
index 9ee877c..711029e 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -115,15 +115,18 @@ static void trim_common_tail(mmfile_t *a, mmfile_t *b, long ctx)
char *bp = b->ptr + b->size;
long smaller = (a->size < b->size) ? a->size : b->size;
+ if (ctx)
+ return;
+
while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) {
trimmed += blk;
ap -= blk;
bp -= blk;
}
- while (recovered < trimmed && 0 <= ctx)
+ while (recovered < trimmed)
if (ap[recovered++] == '\n')
- ctx--;
+ break;
a->size -= (trimmed - recovered);
b->size -= (trimmed - recovered);
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6
2007-12-21 4:58 ` Linus Torvalds
@ 2007-12-21 5:21 ` Linus Torvalds
2007-12-21 17:45 ` Linus Torvalds
2007-12-21 16:56 ` Junio C Hamano
1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21 5:21 UTC (permalink / raw)
To: Junio C Hamano, Git Mailing List
On Thu, 20 Dec 2007, Linus Torvalds wrote:
>
> Not that it matters in real life, since nobody uses -U0, and "git blame"
> won't care. But let's get it right anyway ;)
.. and here's a test-case to see the *impact* of this whole issue:
yes | head -1000 > a
yes | head -2000 > b
git diff -U0 a b | grep @@
diff -U0 a b | grep @@
and notice the differences.
Both will add a thousand lines of "y", but the "git diff" will add it at a
different point, ie git says:
@@ -488,0 +489,1000 @@ y
while a non-tail-optimizing diff will say
@@ -1000,0 +1001,1000 @@
which may be a bit more obvious.
Both answers are *correct*, though. The particular choice of "insert at
line 489, after line 488" is a bit odd, but is because we don't actually
search to exactly the beginning of where the differences started, we
search in blocks of 1kB and then we go forward to the next newline.
(We could also go to exactly where the differences started, and if the
previous character was a newline or the beginning of the file, we'd not
move forward at all, and then we'd get a more "logical" diff of inserting
at the beginning).
Considering that the answer is correct, and this only happens for insane
cases anyway, I don't really think anybody cares, but it's an interesting
case nonetheless.
Linus
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6
2007-12-21 4:58 ` Linus Torvalds
2007-12-21 5:21 ` Linus Torvalds
@ 2007-12-21 16:56 ` Junio C Hamano
1 sibling, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2007-12-21 16:56 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kyle McMartin, Git Mailing List, Linux Kernel Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Actually, the code to finding one '\n' is still needed to avoid the
> (pathological) case of getting a "\No newline", so scrap that one which
> was too aggressive, and use this (simpler) one instead.
>
> Not that it matters in real life, since nobody uses -U0, and "git blame"
> won't care. But let's get it right anyway ;)
Actually "blame won't care" is a bit too strong. It's only we
(you) made it not to care. It is a different story if the
change to make it not to care was sensible.
The diff text "git blame" will see is affected with the tail
trimming optimization exactly the same way as the optimization
triggered this thread. With the common tails trimmed, the hunks
match differently from the way they match without trimming (the
gcc changelog testcase has differences between the unoptimized
blame and the tail-trimming one --- your original to put this
logic only in blame and my rewrite to move it in xdiff-interface
produce the same result that is different from the unoptimized
version, although both are 4x faster than the original).
When there are multiple possible matches, any match among them
is a correct match, and a match with a line in a blob is a match
to the blob no matter what line the match is anyway. The usual
workflow of checking blame to find the commit that introduced
the change and then going to "git show" to view the bigger
picture won't get affected. But the blamed line numbers will be
different from the unoptimized blame, and it may not match the
expectation of human readers. It won't match "git show" output
of the blamed commit.
> This whole function has had more bugs than it has lines.
I have to agree. It started as a brilliant idea but then it was
enhanced (in an attempt to support context) and executed not so
brilliantly.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.24-rc6
2007-12-21 5:21 ` Linus Torvalds
@ 2007-12-21 17:45 ` Linus Torvalds
0 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21 17:45 UTC (permalink / raw)
To: Junio C Hamano, Git Mailing List
On Thu, 20 Dec 2007, Linus Torvalds wrote:
>
> Both answers are *correct*, though. The particular choice of "insert at
> line 489, after line 488" is a bit odd, but is because we don't actually
> search to exactly the beginning of where the differences started, we
> search in blocks of 1kB and then we go forward to the next newline.
This slightly more involved diff does a better job at this particular
issue. Whether the complexity is worth it or not, I dunno, but it changes
the "remove common lines at the end" to do an exact job, which for this
particular test-case means that the end result of adding a thousand lines
of 'y' will look like
[torvalds@woody ~]$ git diff -U0 a b | grep @@
@@ -0,0 +1,1000 @@
instead - ie it will say that they were added at the very beginning of the
file rather than added at some arbitrary point in the middle.
Whether this is really worth it, I dunno.
Also, I'm kind of debating with myself whether it would make most sense to
only do this kind of optimization when (pick arbitrary cut-off here)
something like more than half of the file is identical at the end. If we
don't have a noticeable fraction of the file being the same, it may not
make sense to really bother with this, since it really is meant for just
things like ChangeLog files etc that have data added at the beginning.
That would make this whole optimization a lot more targeted to the case
where it really matters and really helps.
I also do have an incling of a really evil way to make xdiff handle the
case of having multiple lines of context right too, and basically just
move all of this logic into xdiff itself rather than have this
interface-level hack, but I'll have to let that idea brew for a while yet.
Linus
---
xdiff-interface.c | 38 ++++++++++++++++++++++++++++++--------
1 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/xdiff-interface.c b/xdiff-interface.c
index 9ee877c..54a53d2 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -109,21 +109,43 @@ int xdiff_outf(void *priv_, mmbuffer_t *mb, int nbuf)
*/
static void trim_common_tail(mmfile_t *a, mmfile_t *b, long ctx)
{
- const int blk = 1024;
+ int blk = 1024;
long trimmed = 0, recovered = 0;
char *ap = a->ptr + a->size;
char *bp = b->ptr + b->size;
long smaller = (a->size < b->size) ? a->size : b->size;
- while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) {
- trimmed += blk;
- ap -= blk;
- bp -= blk;
- }
+ if (ctx)
+ return;
+
+ do {
+ while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) {
+ trimmed += blk;
+ ap -= blk;
+ bp -= blk;
+ }
+ blk /= 2;
+ } while (blk);
+
+ /* Did we trim one of them all away? */
+ if (trimmed == smaller) {
+ char *bigger;
+ if (a->size == b->size)
+ return;
+ bigger = a->ptr;
+ if (a->size > b->size)
+ bigger = b->ptr;
+
+ /* Did the other one end in a newline? */
+ if (bigger[trimmed-1] == '\n')
+ goto done;
+ }
- while (recovered < trimmed && 0 <= ctx)
+ /* Find the next newline */
+ while (recovered < trimmed)
if (ap[recovered++] == '\n')
- ctx--;
+ break;
+done:
a->size -= (trimmed - recovered);
b->size -= (trimmed - recovered);
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-12-21 17:46 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <alpine.LFD.0.9999.0712201731010.21557@woody.linux-foundation.org>
[not found] ` <20071221024805.GB8535@fattire.cabal.ca>
[not found] ` <20071221030152.GC8535@fattire.cabal.ca>
2007-12-21 3:49 ` Linux 2.6.24-rc6 Linus Torvalds
2007-12-21 3:50 ` Kyle McMartin
2007-12-21 4:22 ` Linus Torvalds
2007-12-21 4:40 ` Linus Torvalds
2007-12-21 4:50 ` Kyle McMartin
2007-12-21 4:58 ` Linus Torvalds
2007-12-21 5:21 ` Linus Torvalds
2007-12-21 17:45 ` Linus Torvalds
2007-12-21 16:56 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).