Re: Linux 2.6.24-rc6

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Linux 2.6.24-rc6
       [not found]   ` <20071221030152.GC8535@fattire.cabal.ca>
@ 2007-12-21  3:49     ` Linus Torvalds
  2007-12-21  3:50       ` Kyle McMartin
  2007-12-21  4:22       ` Linus Torvalds
  0 siblings, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21  3:49 UTC (permalink / raw)
  To: Kyle McMartin, Junio C Hamano, Git Mailing List; +Cc: Linux Kernel Mailing List

On Thu, 20 Dec 2007, Kyle McMartin wrote:
> 
> I think I see the problem, it's lack of context in the diff,

No, the problem is that "git diff" is apparently broken by a recent 
optimization. The diff is simply broken.

The tar-ball and the git archive itself is fine, but yes, the diff from 
2.6.23 to 2.6.24-rc6 is bad. It's the "trim_common_tail()" optimization 
that has caused way too much pain.

Sorry about that, I'll fix it up asap.

		Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Linux 2.6.24-rc6
  2007-12-21  3:49     ` Linux 2.6.24-rc6 Linus Torvalds
@ 2007-12-21  3:50       ` Kyle McMartin
  2007-12-21  4:22       ` Linus Torvalds
  1 sibling, 0 replies; 9+ messages in thread
From: Kyle McMartin @ 2007-12-21  3:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Git Mailing List, Linux Kernel Mailing List

On Thu, Dec 20, 2007 at 07:49:05PM -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 20 Dec 2007, Kyle McMartin wrote:
> > 
> > I think I see the problem, it's lack of context in the diff,
> 
> No, the problem is that "git diff" is apparently broken by a recent 
> optimization. The diff is simply broken.
> 
> The tar-ball and the git archive itself is fine, but yes, the diff from 
> 2.6.23 to 2.6.24-rc6 is bad. It's the "trim_common_tail()" optimization 
> that has caused way too much pain.
> 
> Sorry about that, I'll fix it up asap.
> 

no biggie, thanks!

cheers, Kyle

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Linux 2.6.24-rc6
  2007-12-21  3:49     ` Linux 2.6.24-rc6 Linus Torvalds
  2007-12-21  3:50       ` Kyle McMartin
@ 2007-12-21  4:22       ` Linus Torvalds
  2007-12-21  4:40         ` Linus Torvalds
  2007-12-21  4:58         ` Linus Torvalds
  1 sibling, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21  4:22 UTC (permalink / raw)
  To: Kyle McMartin, Junio C Hamano, Git Mailing List; +Cc: Linux Kernel Mailing List

On Thu, 20 Dec 2007, Linus Torvalds wrote:
> 
> The tar-ball and the git archive itself is fine, but yes, the diff from 
> 2.6.23 to 2.6.24-rc6 is bad. It's the "trim_common_tail()" optimization 
> that has caused way too much pain.

Very interesting breakage. The patch was actually "correct" in a (rather 
limited) technical sense, but the context at the end was missing because 
while the trim_common_tail() code made sure to keep enough common context 
to allow a valid diff to be generated, the diff machinery itself could 
decide that it could generate the diff differently than the "obvious" 
solution.

It only happened for a few files that had lots of repeated lines - so that 
the diff could literally be done multiple different ways - and in fact, 
the file that caused the problems really had a bogus commit that 
duplicated *way* too much data, and caused lots of #define's to exist 
twice.

But the sad fact appears that the git optimization (which is very 
important for "git blame", which needs no context), is only really valid 
for that one case where we really don't need any context.

I uploaded a fixed patch. And here's the git patch to avoid this 
optimization when there is context.

		Linus

---
 xdiff-interface.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 9ee877c..0b7e057 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -110,22 +110,22 @@ int xdiff_outf(void *priv_, mmbuffer_t *mb, int nbuf)
 static void trim_common_tail(mmfile_t *a, mmfile_t *b, long ctx)
 {
 	const int blk = 1024;
-	long trimmed = 0, recovered = 0;
+	long trimmed = 0;
 	char *ap = a->ptr + a->size;
 	char *bp = b->ptr + b->size;
 	long smaller = (a->size < b->size) ? a->size : b->size;

+	if (ctx)
+		return;
+
 	while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) {
 		trimmed += blk;
 		ap -= blk;
 		bp -= blk;
 	}

-	while (recovered < trimmed && 0 <= ctx)
-		if (ap[recovered++] == '\n')
-			ctx--;
-	a->size -= (trimmed - recovered);
-	b->size -= (trimmed - recovered);
+	a->size -= trimmed;
+	b->size -= trimmed;
 }

 int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t const *xecfg, xdemitcb_t *xecb)

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Linux 2.6.24-rc6
  2007-12-21  4:22       ` Linus Torvalds
@ 2007-12-21  4:40         ` Linus Torvalds
  2007-12-21  4:50           ` Kyle McMartin
  2007-12-21  4:58         ` Linus Torvalds
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21  4:40 UTC (permalink / raw)
  To: Kyle McMartin, Junio C Hamano, Git Mailing List, Raphael Assenat,
	Andrew Morton
  Cc: Linux Kernel Mailing List

On Thu, 20 Dec 2007, Linus Torvalds wrote:
> 
> It only happened for a few files that had lots of repeated lines - so that 
> the diff could literally be done multiple different ways - and in fact, 
> the file that caused the problems really had a bogus commit that 
> duplicated *way* too much data, and caused lots of #define's to exist 
> twice.

Here's the example of this kind of behaviour: in the 2.6.26-rc5 tree the 
file drivers/video/mbx/reg_bits.h has the #defines for 

	/* DINTRS - Display Interrupt Status Register */
	/* DINTRE - Display Interrupt Enable Register */

duplicated twice due to commit ba282daa919f89c871780f344a71e5403a70b634 
("mbxfb: Improvements and new features") by Raphael Assenat mistakenly 
adding another copy of the same old set of defines that we already got 
added once before by commit fb137d5b7f2301f2717944322bba38039083c431 
("mbxfb: Add more registers bits access macros").

Now, that was a mistake - and one that probably happened because Rafael or 
more likely Andrew Morton used GNU patch with its insane defaults (which 
is to happily apply the same patch that adds things twice, because it 
doesn't really care if the context matches or not).

But what that kind of thing causes is that when you create a patch of the 
end result, it can show the now new duplicate lines two different (but 
equally valid) ways: it can show it as an addition of the _first_ set of 
lines, or it can show it as an addition of the _second_ set of lines. They 
are the same, after all.

Now, it doesn't really matter which way you choose to show it, although 
because of how "git diff" finds similarities, it tends to prefer to show 
the second set of identical lines as the "new" ones. Which is generally 
reasonable.

However, that interacted really badly with the new git logic that said 
that "if the two files end in the same sequence, just ignore the common 
tail of the file", because the latter copy of the identical lines would 
now show up as _part_ of that common tail, so the lines that the git diff 
machinery would normally like to show up as "new" did in fact end up being 
considered uninteresting, because they were part of an idential tail. 

So now "git diff" would happily pick _earlier_ lines as the new ones, and 
it would still be a conceptually valid diff, but because we had trimmed 
the tail of the file, that conceptually valid diff no longer had the 
expected shared context at the end.

And while it's a bit embarrassing, I'm really rather happy that both GNU 
patch and "git apply" actually refused to apply the patch. It may have 
been "conceptually correct" (ie it did really contain all of the changes!) 
but because it lacked the expected context it really wasn't a good patch. 

That was a rather long-winded explanation of what happened, mainly because 
it was all very unexpected to me, and I had personally mistakenly thought 
the git optimization was perfectly valid and actually had to go through 
the end result to see what was going on.

Anyway, the diff on kernel.org should be all ok now, and mirrored out too.

			Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Linux 2.6.24-rc6
  2007-12-21  4:40         ` Linus Torvalds
@ 2007-12-21  4:50           ` Kyle McMartin
  0 siblings, 0 replies; 9+ messages in thread
From: Kyle McMartin @ 2007-12-21  4:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Git Mailing List, Raphael Assenat, Andrew Morton,
	Linux Kernel Mailing List

On Thu, Dec 20, 2007 at 08:40:54PM -0800, Linus Torvalds wrote:
> That was a rather long-winded explanation of what happened, mainly because 
> it was all very unexpected to me, and I had personally mistakenly thought 
> the git optimization was perfectly valid and actually had to go through 
> the end result to see what was going on.
> 
> Anyway, the diff on kernel.org should be all ok now, and mirrored out too.
> 

Thanks again for being so quick to track this down, applies fine and is
out for building in rawhide now.

cheers, Kyle

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Linux 2.6.24-rc6
  2007-12-21  4:22       ` Linus Torvalds
  2007-12-21  4:40         ` Linus Torvalds
@ 2007-12-21  4:58         ` Linus Torvalds
  2007-12-21  5:21           ` Linus Torvalds
  2007-12-21 16:56           ` Junio C Hamano
  1 sibling, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21  4:58 UTC (permalink / raw)
  To: Kyle McMartin, Junio C Hamano, Git Mailing List; +Cc: Linux Kernel Mailing List



On Thu, 20 Dec 2007, Linus Torvalds wrote:
>
> And here's the git patch to avoid this optimization when there is 
> context.

Actually, the code to finding one '\n' is still needed to avoid the 
(pathological) case of getting a "\No newline", so scrap that one which 
was too aggressive, and use this (simpler) one instead.

Not that it matters in real life, since nobody uses -U0, and "git blame" 
won't care. But let's get it right anyway ;)

This whole function has had more bugs than it has lines.

		Linus

---
 xdiff-interface.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 9ee877c..711029e 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -115,15 +115,18 @@ static void trim_common_tail(mmfile_t *a, mmfile_t *b, long ctx)
 	char *bp = b->ptr + b->size;
 	long smaller = (a->size < b->size) ? a->size : b->size;
 
+	if (ctx)
+		return;
+
 	while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) {
 		trimmed += blk;
 		ap -= blk;
 		bp -= blk;
 	}
 
-	while (recovered < trimmed && 0 <= ctx)
+	while (recovered < trimmed)
 		if (ap[recovered++] == '\n')
-			ctx--;
+			break;
 	a->size -= (trimmed - recovered);
 	b->size -= (trimmed - recovered);
 }

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Linux 2.6.24-rc6
  2007-12-21  4:58         ` Linus Torvalds
@ 2007-12-21  5:21           ` Linus Torvalds
  2007-12-21 17:45             ` Linus Torvalds
  2007-12-21 16:56           ` Junio C Hamano
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21  5:21 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List

On Thu, 20 Dec 2007, Linus Torvalds wrote:
> 
> Not that it matters in real life, since nobody uses -U0, and "git blame" 
> won't care. But let's get it right anyway ;)

.. and here's a test-case to see the *impact* of this whole issue:

	yes | head -1000 > a
	yes | head -2000 > b
	git diff -U0 a b | grep @@
	diff -U0 a b | grep @@

and notice the differences.

Both will add a thousand lines of "y", but the "git diff" will add it at a 
different point, ie git says:

	@@ -488,0 +489,1000 @@ y

while a non-tail-optimizing diff will say

	@@ -1000,0 +1001,1000 @@

which may be a bit more obvious.

Both answers are *correct*, though. The particular choice of "insert at 
line 489, after line 488" is a bit odd, but is because we don't actually 
search to exactly the beginning of where the differences started, we 
search in blocks of 1kB and then we go forward to the next newline.

(We could also go to exactly where the differences started, and if the 
previous character was a newline or the beginning of the file, we'd not 
move forward at all, and then we'd get a more "logical" diff of inserting 
at the beginning).

Considering that the answer is correct, and this only happens for insane 
cases anyway, I don't really think anybody cares, but it's an interesting 
case nonetheless.

			Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Linux 2.6.24-rc6
  2007-12-21  4:58         ` Linus Torvalds
  2007-12-21  5:21           ` Linus Torvalds
@ 2007-12-21 16:56           ` Junio C Hamano
  1 sibling, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2007-12-21 16:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kyle McMartin, Git Mailing List, Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Actually, the code to finding one '\n' is still needed to avoid the 
> (pathological) case of getting a "\No newline", so scrap that one which 
> was too aggressive, and use this (simpler) one instead.
>
> Not that it matters in real life, since nobody uses -U0, and "git blame" 
> won't care. But let's get it right anyway ;)

Actually "blame won't care" is a bit too strong.  It's only we
(you) made it not to care.  It is a different story if the
change to make it not to care was sensible.

The diff text "git blame" will see is affected with the tail
trimming optimization exactly the same way as the optimization
triggered this thread.  With the common tails trimmed, the hunks
match differently from the way they match without trimming (the
gcc changelog testcase has differences between the unoptimized
blame and the tail-trimming one --- your original to put this
logic only in blame and my rewrite to move it in xdiff-interface
produce the same result that is different from the unoptimized
version, although both are 4x faster than the original).

When there are multiple possible matches, any match among them
is a correct match, and a match with a line in a blob is a match
to the blob no matter what line the match is anyway.  The usual
workflow of checking blame to find the commit that introduced
the change and then going to "git show" to view the bigger
picture won't get affected.  But the blamed line numbers will be
different from the unoptimized blame, and it may not match the
expectation of human readers.  It won't match "git show" output
of the blamed commit.

> This whole function has had more bugs than it has lines.

I have to agree.  It started as a brilliant idea but then it was
enhanced (in an attempt to support context) and executed not so
brilliantly.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Linux 2.6.24-rc6
  2007-12-21  5:21           ` Linus Torvalds
@ 2007-12-21 17:45             ` Linus Torvalds
  0 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2007-12-21 17:45 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List

On Thu, 20 Dec 2007, Linus Torvalds wrote:
> 
> Both answers are *correct*, though. The particular choice of "insert at 
> line 489, after line 488" is a bit odd, but is because we don't actually 
> search to exactly the beginning of where the differences started, we 
> search in blocks of 1kB and then we go forward to the next newline.

This slightly more involved diff does a better job at this particular 
issue. Whether the complexity is worth it or not, I dunno, but it changes 
the "remove common lines at the end" to do an exact job, which for this 
particular test-case means that the end result of adding a thousand lines 
of 'y' will look like

	[torvalds@woody ~]$ git diff -U0 a b | grep @@
	@@ -0,0 +1,1000 @@

instead - ie it will say that they were added at the very beginning of the 
file rather than added at some arbitrary point in the middle.

Whether this is really worth it, I dunno.

Also, I'm kind of debating with myself whether it would make most sense to 
only do this kind of optimization when (pick arbitrary cut-off here) 
something like more than half of the file is identical at the end. If we 
don't have a noticeable fraction of the file being the same, it may not 
make sense to really bother with this, since it really is meant for just 
things like ChangeLog files etc that have data added at the beginning.

That would make this whole optimization a lot more targeted to the case 
where it really matters and really helps.

I also do have an incling of a really evil way to make xdiff handle the 
case of having multiple lines of context right too, and basically just 
move all of this logic into xdiff itself rather than have this 
interface-level hack, but I'll have to let that idea brew for a while yet. 

			Linus

---
 xdiff-interface.c |   38 ++++++++++++++++++++++++++++++--------
 1 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 9ee877c..54a53d2 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -109,21 +109,43 @@ int xdiff_outf(void *priv_, mmbuffer_t *mb, int nbuf)
  */
 static void trim_common_tail(mmfile_t *a, mmfile_t *b, long ctx)
 {
-	const int blk = 1024;
+	int blk = 1024;
 	long trimmed = 0, recovered = 0;
 	char *ap = a->ptr + a->size;
 	char *bp = b->ptr + b->size;
 	long smaller = (a->size < b->size) ? a->size : b->size;

-	while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) {
-		trimmed += blk;
-		ap -= blk;
-		bp -= blk;
-	}
+	if (ctx)
+		return;
+
+	do {
+		while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) {
+			trimmed += blk;
+			ap -= blk;
+			bp -= blk;
+		}
+		blk /= 2;
+	} while (blk);
+
+	/* Did we trim one of them all away? */
+	if (trimmed == smaller) {
+		char *bigger;
+		if (a->size == b->size)
+			return;
+		bigger = a->ptr;
+		if (a->size > b->size)
+			bigger = b->ptr;
+
+		/* Did the other one end in a newline? */
+		if (bigger[trimmed-1] == '\n')
+			goto done;
+	}		

-	while (recovered < trimmed && 0 <= ctx)
+	/* Find the next newline */
+	while (recovered < trimmed)
 		if (ap[recovered++] == '\n')
-			ctx--;
+			break;
+done:
 	a->size -= (trimmed - recovered);
 	b->size -= (trimmed - recovered);
 }

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-12-21 17:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <alpine.LFD.0.9999.0712201731010.21557@woody.linux-foundation.org>
     [not found] ` <20071221024805.GB8535@fattire.cabal.ca>
     [not found]   ` <20071221030152.GC8535@fattire.cabal.ca>
2007-12-21  3:49     ` Linux 2.6.24-rc6 Linus Torvalds
2007-12-21  3:50       ` Kyle McMartin
2007-12-21  4:22       ` Linus Torvalds
2007-12-21  4:40         ` Linus Torvalds
2007-12-21  4:50           ` Kyle McMartin
2007-12-21  4:58         ` Linus Torvalds
2007-12-21  5:21           ` Linus Torvalds
2007-12-21 17:45             ` Linus Torvalds
2007-12-21 16:56           ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).