[PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation
@ 2007-12-01 16:01 Dmitry V. Levin
  2007-12-01 19:46 ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry V. Levin @ 2007-12-01 16:01 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Junio C Hamano

When checking buffer for NUL byte, do not limit size of buffer we check.
Otherwise we break git-rebase: git-format-patch may generate output which
git-mailinfo cannot handle properly.

Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
---
 t/t3407-rebase-binary.sh |   32 ++++++++++++++++++++++++++++++++
 xdiff-interface.c        |    3 ---
 2 files changed, 32 insertions(+), 3 deletions(-)
 create mode 100755 t/t3407-rebase-binary.sh

diff --git a/t/t3407-rebase-binary.sh b/t/t3407-rebase-binary.sh
new file mode 100755
index 0000000..213dc9d
--- /dev/null
+++ b/t/t3407-rebase-binary.sh
@@ -0,0 +1,32 @@
+#!/bin/sh
+
+test_description='rebase binary test'
+
+. ./test-lib.sh
+
+yes 1234567 |head -n 2003 >text
+yes 1234567 |head -n 2000 >bin && printf 'binary\0bin\n' >>bin && yes 1234567 |head -n 3 >>bin
+
+test_expect_success setup '
+
+	cat text >file &&
+	git add file &&
+	git commit -m"text" &&
+
+	git branch bin &&
+
+	echo side >side &&
+	git add side &&
+	git commit -m"side" &&
+
+	git checkout bin &&
+	cat bin >file &&
+	git commit -a -m"bin"
+'
+
+test_expect_success rebase '
+
+	git rebase master
+'
+
+test_done
diff --git a/xdiff-interface.c b/xdiff-interface.c
index be866d1..fa9f58d 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -122,11 +122,8 @@ int read_mmfile(mmfile_t *ptr, const char *filename)
 	return 0;
 }
 
-#define FIRST_FEW_BYTES 8000
 int buffer_is_binary(const char *ptr, unsigned long size)
 {
-	if (FIRST_FEW_BYTES < size)
-		size = FIRST_FEW_BYTES;
 	return !!memchr(ptr, 0, size);
 }
 
-- 
ldv

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation
  2007-12-01 16:01 [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation Dmitry V. Levin
@ 2007-12-01 19:46 ` Junio C Hamano
  2007-12-03 21:50   ` Dmitry V. Levin
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2007-12-01 19:46 UTC (permalink / raw)
  To: Dmitry V. Levin; +Cc: Git Mailing List

"Dmitry V. Levin" <ldv@altlinux.org> writes:

> When checking buffer for NUL byte, do not limit size of buffer we check.
> Otherwise we break git-rebase: git-format-patch may generate output which
> git-mailinfo cannot handle properly.

I think this is tackling a valid problem but it is a wrong solution.
The change penalizes text changes which is the majority, just in case
there is an unusual change that has an embedded NUL far into the file
(iow, exception).

Perhaps mailinfo can be updated to handle embedded NUL.

Another alternative (I've been trying to find time to do so for quite a
while now but dealing with list traffic always takes priority on my time
allotment) is to update rebase not to rely on "format-patch piped to
am", and I think that is more correct solution in the longer term.

In the meantime, a workaround would be to use "rebase -i".  It uses
cherry-pick machinery instead of "format-patch piped to am", and
hopefully would handle NULs better.  It probably is slower than non
interactive one exactly because it uses cherry-pick, and that is the
reason I am first working on updating cherry-pick before actually making
the non-interactive rebase to use it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation
  2007-12-01 19:46 ` Junio C Hamano
@ 2007-12-03 21:50   ` Dmitry V. Levin
  2007-12-03 23:24     ` Junio C Hamano
  2007-12-04  0:00     ` Linus Torvalds
  0 siblings, 2 replies; 7+ messages in thread
From: Dmitry V. Levin @ 2007-12-03 21:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 986 bytes --]

On Sat, Dec 01, 2007 at 11:46:52AM -0800, Junio C Hamano wrote:
> On Sat, Dec 01, 2007 at 07:01:13PM +0300, Dmitry V. Levin wrote:
> 
> > When checking buffer for NUL byte, do not limit size of buffer we check.
> > Otherwise we break git-rebase: git-format-patch may generate output which
> > git-mailinfo cannot handle properly.
> 
> I think this is tackling a valid problem but it is a wrong solution.
> The change penalizes text changes which is the majority, just in case
> there is an unusual change that has an embedded NUL far into the file
> (iow, exception).

Penalizes?
Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes,
FIRST_FEW_BYTES limit is 8000 bytes.
Well, I prefer slightly penalized but working properly git-rebase.
Attached test case demonstrates how current git-rebase can just run
successfully but produce a wrong result.

P.S. The real life example where you can hit this git-rebase problem is
GNU .info files.


-- 
ldv

[-- Attachment #1.2: t3408-rebase-binary-correctness.sh --]
[-- Type: application/x-sh, Size: 506 bytes --]

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation
  2007-12-03 21:50   ` Dmitry V. Levin
@ 2007-12-03 23:24     ` Junio C Hamano
  2007-12-04  0:00     ` Linus Torvalds
  1 sibling, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2007-12-03 23:24 UTC (permalink / raw)
  To: Dmitry V. Levin; +Cc: Git Mailing List

"Dmitry V. Levin" <ldv@altlinux.org> writes:

> On Sat, Dec 01, 2007 at 11:46:52AM -0800, Junio C Hamano wrote:
>> On Sat, Dec 01, 2007 at 07:01:13PM +0300, Dmitry V. Levin wrote:
>> 
>> > When checking buffer for NUL byte, do not limit size of buffer we check.
>> > Otherwise we break git-rebase: git-format-patch may generate output which
>> > git-mailinfo cannot handle properly.
>> 
>> I think this is tackling a valid problem but it is a wrong solution.
>> The change penalizes text changes which is the majority, just in case
>> there is an unusual change that has an embedded NUL far into the file
>> (iow, exception).
>
> Penalizes?
> Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes,
> FIRST_FEW_BYTES limit is 8000 bytes.

I really wish we were living in a simpler time, back when I could just
say "we optimize for the kernel" and did not have to be worried about
getting laughed at.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation
  2007-12-03 21:50   ` Dmitry V. Levin
  2007-12-03 23:24     ` Junio C Hamano
@ 2007-12-04  0:00     ` Linus Torvalds
  2007-12-04  1:00       ` Johannes Schindelin
  1 sibling, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2007-12-04  0:00 UTC (permalink / raw)
  To: Dmitry V. Levin; +Cc: Junio C Hamano, Git Mailing List

On Tue, 4 Dec 2007, Dmitry V. Levin wrote:
>
> Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes,

Don't do "average" sizes. That's an almost totally meaningless number.

"Average" makes sense if you have some kind of gaussian distribution or 
similar. File sizes tend to be exponential distributions, and what makes 
much more sense is to look at the median. That doesn't show the effect of 
a few larger files, and also gives you a much better "half the files are 
smaller than x" idea.

And the median filesize for the kernel is just a few bytes over 4k.

Of the 23,000+ files in the current kernel, about 15,500 are less than 
8kB. And 17,179 are smaller than the 10944 bytes you mention.

I'd argue that 8kB (or even 4kB) is probably a good number for things like 
that: it catches the bulk of all files in their entirety, but it *avoids* 
spending tons of time on the (few) really large files.

			Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation
  2007-12-04  0:00     ` Linus Torvalds
@ 2007-12-04  1:00       ` Johannes Schindelin
  2007-12-05 10:47         ` David Kastrup
  0 siblings, 1 reply; 7+ messages in thread
From: Johannes Schindelin @ 2007-12-04  1:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dmitry V. Levin, Junio C Hamano, Git Mailing List

Hi,

On Mon, 3 Dec 2007, Linus Torvalds wrote:

> On Tue, 4 Dec 2007, Dmitry V. Levin wrote:
> >
> > Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes,
> 
> Don't do "average" sizes. That's an almost totally meaningless number.
> 
> "Average" makes sense if you have some kind of gaussian distribution or 
> similar.

To enhance on that: Gaussian is symmetric, which cannot be the proper 
distribution for anything that is non-negative.

I see so many mis-applications of statistics/probability theory in my day 
job that I cannot resist pointing people to the Poisson distribution here 
(in whose context "average" actually makes kind of sense).

But back to the problem: if you have a truly binary file, then _every_ 
byte (absent further information, of course) has a probability of 1/256 of 
being 0.

Which means that if a file is binary, but is unusual enough to have that 
property only for half of the first 8192 bytes, you get a probability of 
1 - 1 / 256^4096 = 1 - 1 / 2 ^ 32768 that the current test succeeds.

I fail to see how this test can possibly fail for the average case.

So if it fails only for special cases, we are probably (in the common, not 
the mathematical, sense) better off asking those people encountering them 
to add git-attributes for the files.

IMHO that is not asking for too much.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation
  2007-12-04  1:00       ` Johannes Schindelin
@ 2007-12-05 10:47         ` David Kastrup
  0 siblings, 0 replies; 7+ messages in thread
From: David Kastrup @ 2007-12-05 10:47 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Dmitry V. Levin, Junio C Hamano, Git Mailing List

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Mon, 3 Dec 2007, Linus Torvalds wrote:
>
>> On Tue, 4 Dec 2007, Dmitry V. Levin wrote:
>> >
>> > Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes,
>> 
>> Don't do "average" sizes. That's an almost totally meaningless number.
>> 
>> "Average" makes sense if you have some kind of gaussian distribution or 
>> similar.
>
> To enhance on that: Gaussian is symmetric, which cannot be the proper 
> distribution for anything that is non-negative.

This reasoning is nonsense, since Gaussians are not necessarily
symmetric about zero, and for example the equally distributed
probability between 0 and 1 is both symmetric and non-negative.  And the
trivial distribution of "always zero" is even _both_ symmetric around
zero and non-negative.

What is true for Gaussians is that their probability is non-zero
everywhere.  But with a meaning of "non-zero" that should let the kind
of people using hashes for unique file identification sleep well.

> I see so many mis-applications of statistics/probability theory in my
> day job that I cannot resist pointing people to the Poisson
> distribution here (in whose context "average" actually makes kind of
> sense).

The main point of Gaussians is that they approximate a distribution
coming from a sum of independent random sources pretty well, even if the
individual sources are not Gaussians themselves.

Poisson distributions come about as the sum of independent exponential
distributions, and yes, when the number of summands grows, the result is
quite well modeled by a Gaussian: the impossible outliers predicted by
the Gaussian approximation take a negligible probability of the total.

If the distribution is more like coming from a product of independent
sources, the results will be better modeled by log-Gaussian
distributions (which happen to be non-negative).

The exponential distribution happens to be the log-distribution of the
(0,1) equal probability distribution, so for lower order Poisson
distributions, approximation with log-Gaussians may seem more
straightforward.

> But back to the problem: if you have a truly binary file, then _every_
> byte (absent further information, of course) has a probability of
> 1/256 of being 0.

Absent any information, there is no reason to assume equal probability
as more likely than other probabilities.

> Which means that if a file is binary,

is _random_ binary.  Few people version random binary files.  It is
rather pointless.

> but is unusual enough to have that property only for half of the first
> 8192 bytes, you get a probability of 1 - 1 / 256^4096 = 1 - 1 / 2 ^
> 32768 that the current test succeeds.

Rather

(1-1/256)^4096 ~= exp(-4096/256) = about 1 in 10 million

Which means that the test would yield a false negative (with regard to
the data being binary) about once in 10000000 times when done on 4096
bytes, even given your rather absurd assumption of random binary data
being versioned.

To make this somewhat more obvious: let's take a look at the probability
of 128 random bytes being all non-zero.  According to your model, that
should be (1-1/256^128)=1-1/32768.  According to mine, (1-1/256)^128 ~=
exp(-0.5) ~= 60%.  Why is this even more than 50%, which would be the
naive assumption?  Because it is likely that we will have duplicate
bytes among our 128 bytes, so the probability for an occurence of 0
becomes less.

> I fail to see how this test can possibly fail for the average case.

You start with a nonsensical mathematical model, and don't even do the
math right that would follow from it.

The "average case" is not random equidistributed uncorrelated data,
anyway.

> So if it fails only for special cases, we are probably (in the common,
> not the mathematical, sense) better off asking those people
> encountering them to add git-attributes for the files.
>
> IMHO that is not asking for too much.

That the problem has actually been encountered in real life does not
exactly do much to support your assumptions and your conclusions, does
it?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-12-05 10:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-01 16:01 [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation Dmitry V. Levin
2007-12-01 19:46 ` Junio C Hamano
2007-12-03 21:50   ` Dmitry V. Levin
2007-12-03 23:24     ` Junio C Hamano
2007-12-04  0:00     ` Linus Torvalds
2007-12-04  1:00       ` Johannes Schindelin
2007-12-05 10:47         ` David Kastrup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).