* [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation @ 2007-12-01 16:01 Dmitry V. Levin 2007-12-01 19:46 ` Junio C Hamano 0 siblings, 1 reply; 7+ messages in thread From: Dmitry V. Levin @ 2007-12-01 16:01 UTC (permalink / raw) To: Git Mailing List; +Cc: Junio C Hamano When checking buffer for NUL byte, do not limit size of buffer we check. Otherwise we break git-rebase: git-format-patch may generate output which git-mailinfo cannot handle properly. Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> --- t/t3407-rebase-binary.sh | 32 ++++++++++++++++++++++++++++++++ xdiff-interface.c | 3 --- 2 files changed, 32 insertions(+), 3 deletions(-) create mode 100755 t/t3407-rebase-binary.sh diff --git a/t/t3407-rebase-binary.sh b/t/t3407-rebase-binary.sh new file mode 100755 index 0000000..213dc9d --- /dev/null +++ b/t/t3407-rebase-binary.sh @@ -0,0 +1,32 @@ +#!/bin/sh + +test_description='rebase binary test' + +. ./test-lib.sh + +yes 1234567 |head -n 2003 >text +yes 1234567 |head -n 2000 >bin && printf 'binary\0bin\n' >>bin && yes 1234567 |head -n 3 >>bin + +test_expect_success setup ' + + cat text >file && + git add file && + git commit -m"text" && + + git branch bin && + + echo side >side && + git add side && + git commit -m"side" && + + git checkout bin && + cat bin >file && + git commit -a -m"bin" +' + +test_expect_success rebase ' + + git rebase master +' + +test_done diff --git a/xdiff-interface.c b/xdiff-interface.c index be866d1..fa9f58d 100644 --- a/xdiff-interface.c +++ b/xdiff-interface.c @@ -122,11 +122,8 @@ int read_mmfile(mmfile_t *ptr, const char *filename) return 0; } -#define FIRST_FEW_BYTES 8000 int buffer_is_binary(const char *ptr, unsigned long size) { - if (FIRST_FEW_BYTES < size) - size = FIRST_FEW_BYTES; return !!memchr(ptr, 0, size); } -- ldv ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation 2007-12-01 16:01 [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation Dmitry V. Levin @ 2007-12-01 19:46 ` Junio C Hamano 2007-12-03 21:50 ` Dmitry V. Levin 0 siblings, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2007-12-01 19:46 UTC (permalink / raw) To: Dmitry V. Levin; +Cc: Git Mailing List "Dmitry V. Levin" <ldv@altlinux.org> writes: > When checking buffer for NUL byte, do not limit size of buffer we check. > Otherwise we break git-rebase: git-format-patch may generate output which > git-mailinfo cannot handle properly. I think this is tackling a valid problem but it is a wrong solution. The change penalizes text changes which is the majority, just in case there is an unusual change that has an embedded NUL far into the file (iow, exception). Perhaps mailinfo can be updated to handle embedded NUL. Another alternative (I've been trying to find time to do so for quite a while now but dealing with list traffic always takes priority on my time allotment) is to update rebase not to rely on "format-patch piped to am", and I think that is more correct solution in the longer term. In the meantime, a workaround would be to use "rebase -i". It uses cherry-pick machinery instead of "format-patch piped to am", and hopefully would handle NULs better. It probably is slower than non interactive one exactly because it uses cherry-pick, and that is the reason I am first working on updating cherry-pick before actually making the non-interactive rebase to use it. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation 2007-12-01 19:46 ` Junio C Hamano @ 2007-12-03 21:50 ` Dmitry V. Levin 2007-12-03 23:24 ` Junio C Hamano 2007-12-04 0:00 ` Linus Torvalds 0 siblings, 2 replies; 7+ messages in thread From: Dmitry V. Levin @ 2007-12-03 21:50 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List [-- Attachment #1.1: Type: text/plain, Size: 986 bytes --] On Sat, Dec 01, 2007 at 11:46:52AM -0800, Junio C Hamano wrote: > On Sat, Dec 01, 2007 at 07:01:13PM +0300, Dmitry V. Levin wrote: > > > When checking buffer for NUL byte, do not limit size of buffer we check. > > Otherwise we break git-rebase: git-format-patch may generate output which > > git-mailinfo cannot handle properly. > > I think this is tackling a valid problem but it is a wrong solution. > The change penalizes text changes which is the majority, just in case > there is an unusual change that has an embedded NUL far into the file > (iow, exception). Penalizes? Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes, FIRST_FEW_BYTES limit is 8000 bytes. Well, I prefer slightly penalized but working properly git-rebase. Attached test case demonstrates how current git-rebase can just run successfully but produce a wrong result. P.S. The real life example where you can hit this git-rebase problem is GNU .info files. -- ldv [-- Attachment #1.2: t3408-rebase-binary-correctness.sh --] [-- Type: application/x-sh, Size: 506 bytes --] [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation 2007-12-03 21:50 ` Dmitry V. Levin @ 2007-12-03 23:24 ` Junio C Hamano 2007-12-04 0:00 ` Linus Torvalds 1 sibling, 0 replies; 7+ messages in thread From: Junio C Hamano @ 2007-12-03 23:24 UTC (permalink / raw) To: Dmitry V. Levin; +Cc: Git Mailing List "Dmitry V. Levin" <ldv@altlinux.org> writes: > On Sat, Dec 01, 2007 at 11:46:52AM -0800, Junio C Hamano wrote: >> On Sat, Dec 01, 2007 at 07:01:13PM +0300, Dmitry V. Levin wrote: >> >> > When checking buffer for NUL byte, do not limit size of buffer we check. >> > Otherwise we break git-rebase: git-format-patch may generate output which >> > git-mailinfo cannot handle properly. >> >> I think this is tackling a valid problem but it is a wrong solution. >> The change penalizes text changes which is the majority, just in case >> there is an unusual change that has an embedded NUL far into the file >> (iow, exception). > > Penalizes? > Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes, > FIRST_FEW_BYTES limit is 8000 bytes. I really wish we were living in a simpler time, back when I could just say "we optimize for the kernel" and did not have to be worried about getting laughed at. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation 2007-12-03 21:50 ` Dmitry V. Levin 2007-12-03 23:24 ` Junio C Hamano @ 2007-12-04 0:00 ` Linus Torvalds 2007-12-04 1:00 ` Johannes Schindelin 1 sibling, 1 reply; 7+ messages in thread From: Linus Torvalds @ 2007-12-04 0:00 UTC (permalink / raw) To: Dmitry V. Levin; +Cc: Junio C Hamano, Git Mailing List On Tue, 4 Dec 2007, Dmitry V. Levin wrote: > > Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes, Don't do "average" sizes. That's an almost totally meaningless number. "Average" makes sense if you have some kind of gaussian distribution or similar. File sizes tend to be exponential distributions, and what makes much more sense is to look at the median. That doesn't show the effect of a few larger files, and also gives you a much better "half the files are smaller than x" idea. And the median filesize for the kernel is just a few bytes over 4k. Of the 23,000+ files in the current kernel, about 15,500 are less than 8kB. And 17,179 are smaller than the 10944 bytes you mention. I'd argue that 8kB (or even 4kB) is probably a good number for things like that: it catches the bulk of all files in their entirety, but it *avoids* spending tons of time on the (few) really large files. Linus ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation 2007-12-04 0:00 ` Linus Torvalds @ 2007-12-04 1:00 ` Johannes Schindelin 2007-12-05 10:47 ` David Kastrup 0 siblings, 1 reply; 7+ messages in thread From: Johannes Schindelin @ 2007-12-04 1:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: Dmitry V. Levin, Junio C Hamano, Git Mailing List Hi, On Mon, 3 Dec 2007, Linus Torvalds wrote: > On Tue, 4 Dec 2007, Dmitry V. Levin wrote: > > > > Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes, > > Don't do "average" sizes. That's an almost totally meaningless number. > > "Average" makes sense if you have some kind of gaussian distribution or > similar. To enhance on that: Gaussian is symmetric, which cannot be the proper distribution for anything that is non-negative. I see so many mis-applications of statistics/probability theory in my day job that I cannot resist pointing people to the Poisson distribution here (in whose context "average" actually makes kind of sense). But back to the problem: if you have a truly binary file, then _every_ byte (absent further information, of course) has a probability of 1/256 of being 0. Which means that if a file is binary, but is unusual enough to have that property only for half of the first 8192 bytes, you get a probability of 1 - 1 / 256^4096 = 1 - 1 / 2 ^ 32768 that the current test succeeds. I fail to see how this test can possibly fail for the average case. So if it fails only for special cases, we are probably (in the common, not the mathematical, sense) better off asking those people encountering them to add git-attributes for the files. IMHO that is not asking for too much. Ciao, Dscho ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation 2007-12-04 1:00 ` Johannes Schindelin @ 2007-12-05 10:47 ` David Kastrup 0 siblings, 0 replies; 7+ messages in thread From: David Kastrup @ 2007-12-05 10:47 UTC (permalink / raw) To: Johannes Schindelin Cc: Linus Torvalds, Dmitry V. Levin, Junio C Hamano, Git Mailing List Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > On Mon, 3 Dec 2007, Linus Torvalds wrote: > >> On Tue, 4 Dec 2007, Dmitry V. Levin wrote: >> > >> > Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes, >> >> Don't do "average" sizes. That's an almost totally meaningless number. >> >> "Average" makes sense if you have some kind of gaussian distribution or >> similar. > > To enhance on that: Gaussian is symmetric, which cannot be the proper > distribution for anything that is non-negative. This reasoning is nonsense, since Gaussians are not necessarily symmetric about zero, and for example the equally distributed probability between 0 and 1 is both symmetric and non-negative. And the trivial distribution of "always zero" is even _both_ symmetric around zero and non-negative. What is true for Gaussians is that their probability is non-zero everywhere. But with a meaning of "non-zero" that should let the kind of people using hashes for unique file identification sleep well. > I see so many mis-applications of statistics/probability theory in my > day job that I cannot resist pointing people to the Poisson > distribution here (in whose context "average" actually makes kind of > sense). The main point of Gaussians is that they approximate a distribution coming from a sum of independent random sources pretty well, even if the individual sources are not Gaussians themselves. Poisson distributions come about as the sum of independent exponential distributions, and yes, when the number of summands grows, the result is quite well modeled by a Gaussian: the impossible outliers predicted by the Gaussian approximation take a negligible probability of the total. If the distribution is more like coming from a product of independent sources, the results will be better modeled by log-Gaussian distributions (which happen to be non-negative). The exponential distribution happens to be the log-distribution of the (0,1) equal probability distribution, so for lower order Poisson distributions, approximation with log-Gaussians may seem more straightforward. > But back to the problem: if you have a truly binary file, then _every_ > byte (absent further information, of course) has a probability of > 1/256 of being 0. Absent any information, there is no reason to assume equal probability as more likely than other probabilities. > Which means that if a file is binary, is _random_ binary. Few people version random binary files. It is rather pointless. > but is unusual enough to have that property only for half of the first > 8192 bytes, you get a probability of 1 - 1 / 256^4096 = 1 - 1 / 2 ^ > 32768 that the current test succeeds. Rather (1-1/256)^4096 ~= exp(-4096/256) = about 1 in 10 million Which means that the test would yield a false negative (with regard to the data being binary) about once in 10000000 times when done on 4096 bytes, even given your rather absurd assumption of random binary data being versioned. To make this somewhat more obvious: let's take a look at the probability of 128 random bytes being all non-zero. According to your model, that should be (1-1/256^128)=1-1/32768. According to mine, (1-1/256)^128 ~= exp(-0.5) ~= 60%. Why is this even more than 50%, which would be the naive assumption? Because it is likely that we will have duplicate bytes among our 128 bytes, so the probability for an occurence of 0 becomes less. > I fail to see how this test can possibly fail for the average case. You start with a nonsensical mathematical model, and don't even do the math right that would follow from it. The "average case" is not random equidistributed uncorrelated data, anyway. > So if it fails only for special cases, we are probably (in the common, > not the mathematical, sense) better off asking those people > encountering them to add git-attributes for the files. > > IMHO that is not asking for too much. That the problem has actually been encountered in real life does not exactly do much to support your assumptions and your conclusions, does it? -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2007-12-05 10:47 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-12-01 16:01 [PATCH] xdiff-interface.c (buffer_is_binary): Remove buffer size limitation Dmitry V. Levin 2007-12-01 19:46 ` Junio C Hamano 2007-12-03 21:50 ` Dmitry V. Levin 2007-12-03 23:24 ` Junio C Hamano 2007-12-04 0:00 ` Linus Torvalds 2007-12-04 1:00 ` Johannes Schindelin 2007-12-05 10:47 ` David Kastrup
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).