From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: phillip.wood@dunelm.org.uk,
Johannes Schindelin <Johannes.Schindelin@gmx.de>,
Derrick Stolee <stolee@gmail.com>,
Phillip Wood via GitGitGadget <gitgitgadget@gmail.com>,
git@vger.kernel.org
Subject: Re: [PATCH 1/3] diff histogram: intern strings
Date: Fri, 19 Nov 2021 22:22:04 +0100 [thread overview]
Message-ID: <211119.86v90n25cv.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <YZe4hqF6Jf14L5tb@coredump.intra.peff.net>
On Fri, Nov 19 2021, Jeff King wrote:
> On Fri, Nov 19, 2021 at 10:05:32AM +0000, Phillip Wood wrote:
>
>> On 18/11/2021 15:42, Jeff King wrote:
>> > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote:
>> >
>> > > I think the really important thing to point out is that
>> > > `xdl_classify_record()` ensures that the `ha` attribute is different for
>> > > different text. AFAIR it even "linearizes" the `ha` values, i.e. they
>> > > won't be all over the place but start at 0 (or 1).
>> > >
>> > > So no, I'm not worried about collisions. That would be a bug in
>> > > `xdl_classify_record()` and I think we would have caught this bug by now.
>> >
>> > Ah, thanks for that explanation. That addresses my collision concern from
>> > earlier in the thread completely.
>>
>> Yes, thanks for clarifying I should have been clearer in my reply to Stolee.
>> The reason I was waffling on about file sizes is that there can only be a
>> collision if there are more than 2^32 unique lines. I think the minimum file
>> size where that happens is just below 10GB when one side of the diff has
>> 2^31 lines and the other has 2^31 + 1 lines and all the lines are unique.
>
> Right, that makes more sense (and we are not likely to lift the 1GB
> limit anytime soon; there are tons of 32-bit variables and potential
> integer overflows all through the xdiff code).
Interestingly:
$ du -sh 8gb*
8.1G 8gb
8.1G 8gb.cp
$ ~/g/git/git -P -c core.bigFileThreshold=10g diff -U0 --no-index --no-color-moved 2gb 2gb.cp
diff --git a/8gb b/8gb.cp
index a886cdfe5ce..4965a132d44 100644
--- a/8gb
+++ b/8gb.cp
@@ -17,0 +18 @@ more
+blah
And the only change I made was:
diff --git a/xdiff-interface.c b/xdiff-interface.c
index 75b32aef51d..cb8ca5f5d0a 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -117,9 +117,6 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co
mmfile_t a = *mf1;
mmfile_t b = *mf2;
- if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE)
- return -1;
-
if (!xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT))
trim_common_tail(&a, &b);
Perhaps we're being overly concervative with these hardcoded limits, at
least on some platforms? This is Linux x86_64.
I understand from skimming the above that it's about the pathological
case, these two files are the same except for a trailer at the end.
I wonder how far you could get with #define int size_t & the like ... :)
next prev parent reply other threads:[~2021-11-19 21:38 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget
2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget
2021-11-17 15:55 ` Derrick Stolee
2021-11-17 16:46 ` Jeff King
2021-11-17 16:52 ` Phillip Wood
2021-11-18 15:35 ` Johannes Schindelin
2021-11-18 15:42 ` Jeff King
2021-11-19 10:05 ` Phillip Wood
2021-11-19 14:45 ` Jeff King
2021-11-19 21:22 ` Ævar Arnfjörð Bjarmason [this message]
2021-11-19 22:19 ` Jeff King
2021-11-19 15:49 ` Johannes Schindelin
2021-11-17 11:20 ` [PATCH 2/3] xdiff: avoid unnecessary memory allocations Phillip Wood via GitGitGadget
2021-11-17 11:20 ` [PATCH 3/3] xdiff: simplify comparison Phillip Wood via GitGitGadget
2021-11-18 15:40 ` [PATCH 0/3] xdiff: speedup histogram diff Johannes Schindelin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=211119.86v90n25cv.gmgdl@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=peff@peff.net \
--cc=phillip.wood@dunelm.org.uk \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).