* [PATCH 0/3] xdiff: speedup histogram diff @ 2021-11-17 11:20 Phillip Wood via GitGitGadget 2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget ` (3 more replies) 0 siblings, 4 replies; 15+ messages in thread From: Phillip Wood via GitGitGadget @ 2021-11-17 11:20 UTC (permalink / raw) To: git; +Cc: Phillip Wood Histogram is the only diff algorithm not to call xdl_classify_record(). Calling xdl_classify_record() means that it is not necessary to use xdl_recmatch() when comparing lines, all that is necessary is to compare the hash values. This gives a 7% reduction in the runtime of "git log --patch" when using the histogram diff algorithm. Phillip Wood (3): diff histogram: intern strings xdiff: avoid unnecessary memory allocations xdiff: simplify comparison xdiff/xdiffi.c | 5 +---- xdiff/xhistogram.c | 5 ++--- xdiff/xprepare.c | 35 +++++++++++++++-------------------- 3 files changed, 18 insertions(+), 27 deletions(-) base-commit: cd3e606211bb1cf8bc57f7d76bab98cc17a150bc Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1079%2Fphillipwood%2Fwip%2Fhistogram-speedup-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1079/phillipwood/wip/histogram-speedup-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/1079 -- gitgitgadget ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/3] diff histogram: intern strings 2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget @ 2021-11-17 11:20 ` Phillip Wood via GitGitGadget 2021-11-17 15:55 ` Derrick Stolee 2021-11-17 11:20 ` [PATCH 2/3] xdiff: avoid unnecessary memory allocations Phillip Wood via GitGitGadget ` (2 subsequent siblings) 3 siblings, 1 reply; 15+ messages in thread From: Phillip Wood via GitGitGadget @ 2021-11-17 11:20 UTC (permalink / raw) To: git; +Cc: Phillip Wood, Phillip Wood From: Phillip Wood <phillip.wood@dunelm.org.uk> Histogram is the only diff algorithm not to call xdl_classify_record(). xdl_classify_record() ensures that the hash values of two strings that are not equal differ which means that it is not necessary to use xdl_recmatch() when comparing lines, all that is necessary is to compare the hash values. This gives a 7% reduction in the runtime of "git log --patch" when using the histogram diff algorithm. Test HEAD^ HEAD ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.18(0.14+0.04) 0.19(0.17+0.02) +5.6% 4000.2: log --raw -3000 (tree-only) 0.99(0.77+0.21) 0.98(0.78+0.20) -1.0% 4000.3: log -p -3000 (Myers) 4.84(4.31+0.51) 4.81(4.15+0.64) -0.6% 4000.4: log -p -3000 --histogram 6.34(5.86+0.46) 5.87(5.19+0.66) -7.4% 4000.5: log -p -3000 --patience 5.39(4.60+0.76) 5.35(4.60+0.73) -0.7% Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> --- xdiff/xhistogram.c | 5 ++--- xdiff/xprepare.c | 24 ++++++++---------------- 2 files changed, 10 insertions(+), 19 deletions(-) diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c index e694bfd9e31..6c1c88a69a1 100644 --- a/xdiff/xhistogram.c +++ b/xdiff/xhistogram.c @@ -91,9 +91,8 @@ struct region { static int cmp_recs(xpparam_t const *xpp, xrecord_t *r1, xrecord_t *r2) { - return r1->ha == r2->ha && - xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size, - xpp->flags); + return r1->ha == r2->ha; + } #define CMP_ENV(xpp, env, s1, l1, s2, l2) \ diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c index abeb8fb84e6..7fae0727a02 100644 --- a/xdiff/xprepare.c +++ b/xdiff/xprepare.c @@ -181,15 +181,11 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_ if (!(recs = (xrecord_t **) xdl_malloc(narec * sizeof(xrecord_t *)))) goto abort; - if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF) - hbits = hsize = 0; - else { - hbits = xdl_hashbits((unsigned int) narec); - hsize = 1 << hbits; - if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *)))) - goto abort; - memset(rhash, 0, hsize * sizeof(xrecord_t *)); - } + hbits = xdl_hashbits((unsigned int) narec); + hsize = 1 << hbits; + if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *)))) + goto abort; + memset(rhash, 0, hsize * sizeof(xrecord_t *)); nrec = 0; if ((cur = blk = xdl_mmfile_first(mf, &bsize)) != NULL) { @@ -208,9 +204,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_ crec->size = (long) (cur - prev); crec->ha = hav; recs[nrec++] = crec; - - if ((XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) && - xdl_classify_record(pass, cf, rhash, hbits, crec) < 0) + if (xdl_classify_record(pass, cf, rhash, hbits, crec) < 0) goto abort; } } @@ -279,8 +273,7 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, enl1 = xdl_guess_lines(mf1, sample) + 1; enl2 = xdl_guess_lines(mf2, sample) + 1; - if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF && - xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0) + if (xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0) return -1; if (xdl_prepare_ctx(1, mf1, enl1, xpp, &cf, &xe->xdf1) < 0) { @@ -305,8 +298,7 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, return -1; } - if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) - xdl_free_classifier(&cf); + xdl_free_classifier(&cf); return 0; } -- gitgitgadget ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] diff histogram: intern strings 2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget @ 2021-11-17 15:55 ` Derrick Stolee 2021-11-17 16:46 ` Jeff King 2021-11-17 16:52 ` Phillip Wood 0 siblings, 2 replies; 15+ messages in thread From: Derrick Stolee @ 2021-11-17 15:55 UTC (permalink / raw) To: Phillip Wood via GitGitGadget, git; +Cc: Phillip Wood On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote: > From: Phillip Wood <phillip.wood@dunelm.org.uk> > > Histogram is the only diff algorithm not to call > xdl_classify_record(). xdl_classify_record() ensures that the hash > values of two strings that are not equal differ which means that it is > not necessary to use xdl_recmatch() when comparing lines, all that is > necessary is to compare the hash values. This gives a 7% reduction in > the runtime of "git log --patch" when using the histogram diff > algorithm. > > Test HEAD^ HEAD > ----------------------------------------------------------------------------- > 4000.1: log -3000 (baseline) 0.18(0.14+0.04) 0.19(0.17+0.02) +5.6% > 4000.2: log --raw -3000 (tree-only) 0.99(0.77+0.21) 0.98(0.78+0.20) -1.0% > 4000.3: log -p -3000 (Myers) 4.84(4.31+0.51) 4.81(4.15+0.64) -0.6% > 4000.4: log -p -3000 --histogram 6.34(5.86+0.46) 5.87(5.19+0.66) -7.4% > 4000.5: log -p -3000 --patience 5.39(4.60+0.76) 5.35(4.60+0.73) -0.7% > > Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> > --- > xdiff/xhistogram.c | 5 ++--- > xdiff/xprepare.c | 24 ++++++++---------------- > 2 files changed, 10 insertions(+), 19 deletions(-) > > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c > index e694bfd9e31..6c1c88a69a1 100644 > --- a/xdiff/xhistogram.c > +++ b/xdiff/xhistogram.c > @@ -91,9 +91,8 @@ struct region { > static int cmp_recs(xpparam_t const *xpp, > xrecord_t *r1, xrecord_t *r2) > { > - return r1->ha == r2->ha && > - xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size, > - xpp->flags); > + return r1->ha == r2->ha; > + nit: stray newline. The only meaningful change here is that you are relying entirely on the hash and not checking the content again. This means that hash collisions on this 32-bit hash could start introducing different results. Are we worried about that? I see that a similar hash-comparison is done in xpatience.c without further checking the contents, but xdiffi.c compares the hashes and then checks with xdl_recmatch(). So, we are still not reaching full consistency across all diff algorithms with how we handle these comparisons. I think it is good to have at least one that can be used if/when we hit these hash collisions within a diff, but it can be hard to communicate to a user why they need to change a diff algorithm for such an internal reason. The following bits looked scary at first, but you are just removing the special-casing of XDF_HISTOGRAM_DIFF from the preparation stage. > - if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF) > - hbits = hsize = 0; > - else { > - hbits = xdl_hashbits((unsigned int) narec); > - hsize = 1 << hbits; > - if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *)))) > - goto abort; > - memset(rhash, 0, hsize * sizeof(xrecord_t *)); > - } > + hbits = xdl_hashbits((unsigned int) narec); > + hsize = 1 << hbits; > + if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *)))) > + goto abort; > + memset(rhash, 0, hsize * sizeof(xrecord_t *)); > - if ((XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) && > - xdl_classify_record(pass, cf, rhash, hbits, crec) < 0) > + if (xdl_classify_record(pass, cf, rhash, hbits, crec) < 0) > - if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF && > - xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0) > + if (xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0) > - if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) > - xdl_free_classifier(&cf); > + xdl_free_classifier(&cf); The existence of these conditions gave me pause, so I went to look for where they were inserted. They were made in 9f37c27 (xdiff/xprepare: skip classification, 2011-07-12) with the justification that We don't need any of that in histogram diff, so we omit calls to these functions. We also skip allocating memory to the hash table, rhash, as it is no longer used. This gives us a small boost in performance. But you are actually _using_ these preparation steps, which means you are re-adding the cost of hashing but overall improving because you use the data correctly. Excellent. Thanks, -Stolee ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] diff histogram: intern strings 2021-11-17 15:55 ` Derrick Stolee @ 2021-11-17 16:46 ` Jeff King 2021-11-17 16:52 ` Phillip Wood 1 sibling, 0 replies; 15+ messages in thread From: Jeff King @ 2021-11-17 16:46 UTC (permalink / raw) To: Derrick Stolee; +Cc: Phillip Wood via GitGitGadget, git, Phillip Wood On Wed, Nov 17, 2021 at 10:55:02AM -0500, Derrick Stolee wrote: > > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c > > index e694bfd9e31..6c1c88a69a1 100644 > > --- a/xdiff/xhistogram.c > > +++ b/xdiff/xhistogram.c > > @@ -91,9 +91,8 @@ struct region { > > static int cmp_recs(xpparam_t const *xpp, > > xrecord_t *r1, xrecord_t *r2) > > { > > - return r1->ha == r2->ha && > > - xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size, > > - xpp->flags); > > + return r1->ha == r2->ha; > > + > > nit: stray newline. > > The only meaningful change here is that you are relying entirely on > the hash and not checking the content again. This means that hash > collisions on this 32-bit hash could start introducing different > results. Are we worried about that? I had the same thought. But running "git log --histogram -p" on git.git does not seem to produce any differences between the two. So perhaps collisions are removed elsewhere. It would be nice to have a better understanding of this before proceeding with this change. Curiously, I got a much smaller improvement in my test, which did: git log --no-merges -p --histogram :^po My assumption being that "po/" diffs are big and uninteresting and so bloat the output. But that turned out to be interesting timing-wise. Excluding "po/" means that patch produces only a 0.6% improvement in speed. But _just_ running the diffs for po shows a 24% speedup! I guess this is just because those files are much larger than average (and changed in fewer commits), meaning that the time spent hashing and comparing lines will show up as a larger percentage of the total work. But I wondered... > The existence of these conditions gave me pause, so I went to look for where they > were inserted. They were made in 9f37c27 (xdiff/xprepare: skip classification, > 2011-07-12) with the justification that > > We don't need any of that in histogram diff, so we omit calls to these > functions. We also skip allocating memory to the hash table, rhash, as > it is no longer used. > > This gives us a small boost in performance. > > But you are actually _using_ these preparation steps, which means you are > re-adding the cost of hashing but overall improving because you use the > data correctly. Excellent. Are we making a tradeoff here based on the data patterns? That is, it seems like we are spending extra time upfront to do classification in order to get quicker comparisons later. Presumably the upfront work is O(n) in the length of the file. How many comparisons do we expect to save? Is it also linear in the number of lines, or could it be super- or sub-linear depending on the actual diff? -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] diff histogram: intern strings 2021-11-17 15:55 ` Derrick Stolee 2021-11-17 16:46 ` Jeff King @ 2021-11-17 16:52 ` Phillip Wood 2021-11-18 15:35 ` Johannes Schindelin 1 sibling, 1 reply; 15+ messages in thread From: Phillip Wood @ 2021-11-17 16:52 UTC (permalink / raw) To: Derrick Stolee, Phillip Wood via GitGitGadget, git; +Cc: Phillip Wood Hi Stolee On 17/11/2021 15:55, Derrick Stolee wrote: > > > On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote: >> From: Phillip Wood <phillip.wood@dunelm.org.uk> >> >> Histogram is the only diff algorithm not to call >> xdl_classify_record(). xdl_classify_record() ensures that the hash >> values of two strings that are not equal differ which means that it is >> not necessary to use xdl_recmatch() when comparing lines, all that is >> necessary is to compare the hash values. This gives a 7% reduction in >> the runtime of "git log --patch" when using the histogram diff >> algorithm. >> >> Test HEAD^ HEAD >> ----------------------------------------------------------------------------- >> 4000.1: log -3000 (baseline) 0.18(0.14+0.04) 0.19(0.17+0.02) +5.6% >> 4000.2: log --raw -3000 (tree-only) 0.99(0.77+0.21) 0.98(0.78+0.20) -1.0% >> 4000.3: log -p -3000 (Myers) 4.84(4.31+0.51) 4.81(4.15+0.64) -0.6% >> 4000.4: log -p -3000 --histogram 6.34(5.86+0.46) 5.87(5.19+0.66) -7.4% >> 4000.5: log -p -3000 --patience 5.39(4.60+0.76) 5.35(4.60+0.73) -0.7% >> >> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> >> --- >> xdiff/xhistogram.c | 5 ++--- >> xdiff/xprepare.c | 24 ++++++++---------------- >> 2 files changed, 10 insertions(+), 19 deletions(-) >> >> diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c >> index e694bfd9e31..6c1c88a69a1 100644 >> --- a/xdiff/xhistogram.c >> +++ b/xdiff/xhistogram.c >> @@ -91,9 +91,8 @@ struct region { >> static int cmp_recs(xpparam_t const *xpp, >> xrecord_t *r1, xrecord_t *r2) >> { >> - return r1->ha == r2->ha && >> - xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size, >> - xpp->flags); >> + return r1->ha == r2->ha; >> + > > nit: stray newline. > > The only meaningful change here is that you are relying entirely on > the hash and not checking the content again. This means that hash > collisions on this 32-bit hash could start introducing different > results. Are we worried about that? xdiff-interface.c limits the size of the file that can be passed to just below 1GB so we are safe. The other diff algorithms are already using this optimization. (the hash is 64 bits on most platforms, the xdiff code could really benefit from a unsigned long -> size_t cleanup) > I see that a similar hash-comparison is done in xpatience.c without > further checking the contents, but xdiffi.c compares the hashes and > then checks with xdl_recmatch(). So, we are still not reaching full > consistency across all diff algorithms with how we handle these > comparisons. I think it is good to have at least one that can be used > if/when we hit these hash collisions within a diff, but it can be hard > to communicate to a user why they need to change a diff algorithm for > such an internal reason. I think that code in xdiffi.c is only used by the diff slider code that implements diff.indentHeuristic, the Myers diff implementation just compares the hash values. Before this change the diff slider code needed to do the full check to be correct when processing the output of the histogram algorithm. > The following bits looked scary at first, but you are just removing the > special-casing of XDF_HISTOGRAM_DIFF from the preparation stage. > >> - if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF) >> - hbits = hsize = 0; >> - else { >> - hbits = xdl_hashbits((unsigned int) narec); >> - hsize = 1 << hbits; >> - if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *)))) >> - goto abort; >> - memset(rhash, 0, hsize * sizeof(xrecord_t *)); >> - } >> + hbits = xdl_hashbits((unsigned int) narec); >> + hsize = 1 << hbits; >> + if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *)))) >> + goto abort; >> + memset(rhash, 0, hsize * sizeof(xrecord_t *)); > >> - if ((XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) && >> - xdl_classify_record(pass, cf, rhash, hbits, crec) < 0) >> + if (xdl_classify_record(pass, cf, rhash, hbits, crec) < 0) > >> - if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF && >> - xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0) >> + if (xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0) > >> - if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) >> - xdl_free_classifier(&cf); >> + xdl_free_classifier(&cf); > > The existence of these conditions gave me pause, so I went to look for where they > were inserted. They were made in 9f37c27 (xdiff/xprepare: skip classification, > 2011-07-12) with the justification that > > We don't need any of that in histogram diff, so we omit calls to these > functions. We also skip allocating memory to the hash table, rhash, as > it is no longer used. > > This gives us a small boost in performance. > > But you are actually _using_ these preparation steps, which means you are > re-adding the cost of hashing but overall improving because you use the > data correctly. Excellent. Thanks for looking at this Best Wishes Phillip > Thanks, > -Stolee > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] diff histogram: intern strings 2021-11-17 16:52 ` Phillip Wood @ 2021-11-18 15:35 ` Johannes Schindelin 2021-11-18 15:42 ` Jeff King 0 siblings, 1 reply; 15+ messages in thread From: Johannes Schindelin @ 2021-11-18 15:35 UTC (permalink / raw) To: Phillip Wood; +Cc: Derrick Stolee, Phillip Wood via GitGitGadget, git Hi, On Wed, 17 Nov 2021, Phillip Wood wrote: > On 17/11/2021 15:55, Derrick Stolee wrote: > > > > On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote: > > > From: Phillip Wood <phillip.wood@dunelm.org.uk> > > > > > > Histogram is the only diff algorithm not to call > > > xdl_classify_record(). xdl_classify_record() ensures that the hash > > > values of two strings that are not equal differ which means that it is > > > not necessary to use xdl_recmatch() when comparing lines, all that is > > > necessary is to compare the hash values. This gives a 7% reduction in > > > the runtime of "git log --patch" when using the histogram diff > > > algorithm. > > > > > > Test HEAD^ HEAD > > > ----------------------------------------------------------------------------- > > > 4000.1: log -3000 (baseline) 0.18(0.14+0.04) 0.19(0.17+0.02) > > > +5.6% > > > 4000.2: log --raw -3000 (tree-only) 0.99(0.77+0.21) 0.98(0.78+0.20) > > > -1.0% > > > 4000.3: log -p -3000 (Myers) 4.84(4.31+0.51) 4.81(4.15+0.64) > > > -0.6% > > > 4000.4: log -p -3000 --histogram 6.34(5.86+0.46) 5.87(5.19+0.66) > > > -7.4% > > > 4000.5: log -p -3000 --patience 5.39(4.60+0.76) 5.35(4.60+0.73) > > > -0.7% > > > > > > Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> > > > --- > > > xdiff/xhistogram.c | 5 ++--- > > > xdiff/xprepare.c | 24 ++++++++---------------- > > > 2 files changed, 10 insertions(+), 19 deletions(-) > > > > > > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c > > > index e694bfd9e31..6c1c88a69a1 100644 > > > --- a/xdiff/xhistogram.c > > > +++ b/xdiff/xhistogram.c > > > @@ -91,9 +91,8 @@ struct region { > > > static int cmp_recs(xpparam_t const *xpp, > > > xrecord_t *r1, xrecord_t *r2) > > > { > > > - return r1->ha == r2->ha && > > > - xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size, > > > - xpp->flags); > > > + return r1->ha == r2->ha; > > > + > > > > nit: stray newline. > > > > The only meaningful change here is that you are relying entirely on > > the hash and not checking the content again. This means that hash > > collisions on this 32-bit hash could start introducing different > > results. Are we worried about that? > > xdiff-interface.c limits the size of the file that can be passed to just below > 1GB so we are safe. The other diff algorithms are already using this > optimization. (the hash is 64 bits on most platforms, the xdiff code could > really benefit from a unsigned long -> size_t cleanup) I think the really important thing to point out is that `xdl_classify_record()` ensures that the `ha` attribute is different for different text. AFAIR it even "linearizes" the `ha` values, i.e. they won't be all over the place but start at 0 (or 1). So no, I'm not worried about collisions. That would be a bug in `xdl_classify_record()` and I think we would have caught this bug by now. Ciao, Dscho ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] diff histogram: intern strings 2021-11-18 15:35 ` Johannes Schindelin @ 2021-11-18 15:42 ` Jeff King 2021-11-19 10:05 ` Phillip Wood 0 siblings, 1 reply; 15+ messages in thread From: Jeff King @ 2021-11-18 15:42 UTC (permalink / raw) To: Johannes Schindelin Cc: Phillip Wood, Derrick Stolee, Phillip Wood via GitGitGadget, git On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote: > I think the really important thing to point out is that > `xdl_classify_record()` ensures that the `ha` attribute is different for > different text. AFAIR it even "linearizes" the `ha` values, i.e. they > won't be all over the place but start at 0 (or 1). > > So no, I'm not worried about collisions. That would be a bug in > `xdl_classify_record()` and I think we would have caught this bug by now. Ah, thanks for that explanation. That addresses my collision concern from earlier in the thread completely. -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] diff histogram: intern strings 2021-11-18 15:42 ` Jeff King @ 2021-11-19 10:05 ` Phillip Wood 2021-11-19 14:45 ` Jeff King 2021-11-19 15:49 ` Johannes Schindelin 0 siblings, 2 replies; 15+ messages in thread From: Phillip Wood @ 2021-11-19 10:05 UTC (permalink / raw) To: Jeff King, Johannes Schindelin Cc: Phillip Wood, Derrick Stolee, Phillip Wood via GitGitGadget, git On 18/11/2021 15:42, Jeff King wrote: > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote: > >> I think the really important thing to point out is that >> `xdl_classify_record()` ensures that the `ha` attribute is different for >> different text. AFAIR it even "linearizes" the `ha` values, i.e. they >> won't be all over the place but start at 0 (or 1). >> >> So no, I'm not worried about collisions. That would be a bug in >> `xdl_classify_record()` and I think we would have caught this bug by now. > > Ah, thanks for that explanation. That addresses my collision concern from > earlier in the thread completely. Yes, thanks for clarifying I should have been clearer in my reply to Stolee. The reason I was waffling on about file sizes is that there can only be a collision if there are more than 2^32 unique lines. I think the minimum file size where that happens is just below 10GB when one side of the diff has 2^31 lines and the other has 2^31 + 1 lines and all the lines are unique. Best Wishes Phillip ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] diff histogram: intern strings 2021-11-19 10:05 ` Phillip Wood @ 2021-11-19 14:45 ` Jeff King 2021-11-19 21:22 ` Ævar Arnfjörð Bjarmason 2021-11-19 15:49 ` Johannes Schindelin 1 sibling, 1 reply; 15+ messages in thread From: Jeff King @ 2021-11-19 14:45 UTC (permalink / raw) To: phillip.wood Cc: Johannes Schindelin, Derrick Stolee, Phillip Wood via GitGitGadget, git On Fri, Nov 19, 2021 at 10:05:32AM +0000, Phillip Wood wrote: > On 18/11/2021 15:42, Jeff King wrote: > > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote: > > > > > I think the really important thing to point out is that > > > `xdl_classify_record()` ensures that the `ha` attribute is different for > > > different text. AFAIR it even "linearizes" the `ha` values, i.e. they > > > won't be all over the place but start at 0 (or 1). > > > > > > So no, I'm not worried about collisions. That would be a bug in > > > `xdl_classify_record()` and I think we would have caught this bug by now. > > > > Ah, thanks for that explanation. That addresses my collision concern from > > earlier in the thread completely. > > Yes, thanks for clarifying I should have been clearer in my reply to Stolee. > The reason I was waffling on about file sizes is that there can only be a > collision if there are more than 2^32 unique lines. I think the minimum file > size where that happens is just below 10GB when one side of the diff has > 2^31 lines and the other has 2^31 + 1 lines and all the lines are unique. Right, that makes more sense (and we are not likely to lift the 1GB limit anytime soon; there are tons of 32-bit variables and potential integer overflows all through the xdiff code). It's probably worth explaining this a bit in the commit message. I also, FWIW, found the subject confusing. I expected "intern" to refer to keeping a single copy of some strings. Maybe: Subject: diff histogram: skip xdl_recmatch for comparing records or something? -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] diff histogram: intern strings 2021-11-19 14:45 ` Jeff King @ 2021-11-19 21:22 ` Ævar Arnfjörð Bjarmason 2021-11-19 22:19 ` Jeff King 0 siblings, 1 reply; 15+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-11-19 21:22 UTC (permalink / raw) To: Jeff King Cc: phillip.wood, Johannes Schindelin, Derrick Stolee, Phillip Wood via GitGitGadget, git On Fri, Nov 19 2021, Jeff King wrote: > On Fri, Nov 19, 2021 at 10:05:32AM +0000, Phillip Wood wrote: > >> On 18/11/2021 15:42, Jeff King wrote: >> > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote: >> > >> > > I think the really important thing to point out is that >> > > `xdl_classify_record()` ensures that the `ha` attribute is different for >> > > different text. AFAIR it even "linearizes" the `ha` values, i.e. they >> > > won't be all over the place but start at 0 (or 1). >> > > >> > > So no, I'm not worried about collisions. That would be a bug in >> > > `xdl_classify_record()` and I think we would have caught this bug by now. >> > >> > Ah, thanks for that explanation. That addresses my collision concern from >> > earlier in the thread completely. >> >> Yes, thanks for clarifying I should have been clearer in my reply to Stolee. >> The reason I was waffling on about file sizes is that there can only be a >> collision if there are more than 2^32 unique lines. I think the minimum file >> size where that happens is just below 10GB when one side of the diff has >> 2^31 lines and the other has 2^31 + 1 lines and all the lines are unique. > > Right, that makes more sense (and we are not likely to lift the 1GB > limit anytime soon; there are tons of 32-bit variables and potential > integer overflows all through the xdiff code). Interestingly: $ du -sh 8gb* 8.1G 8gb 8.1G 8gb.cp $ ~/g/git/git -P -c core.bigFileThreshold=10g diff -U0 --no-index --no-color-moved 2gb 2gb.cp diff --git a/8gb b/8gb.cp index a886cdfe5ce..4965a132d44 100644 --- a/8gb +++ b/8gb.cp @@ -17,0 +18 @@ more +blah And the only change I made was: diff --git a/xdiff-interface.c b/xdiff-interface.c index 75b32aef51d..cb8ca5f5d0a 100644 --- a/xdiff-interface.c +++ b/xdiff-interface.c @@ -117,9 +117,6 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co mmfile_t a = *mf1; mmfile_t b = *mf2; - if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE) - return -1; - if (!xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT)) trim_common_tail(&a, &b); Perhaps we're being overly concervative with these hardcoded limits, at least on some platforms? This is Linux x86_64. I understand from skimming the above that it's about the pathological case, these two files are the same except for a trailer at the end. I wonder how far you could get with #define int size_t & the like ... :) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] diff histogram: intern strings 2021-11-19 21:22 ` Ævar Arnfjörð Bjarmason @ 2021-11-19 22:19 ` Jeff King 0 siblings, 0 replies; 15+ messages in thread From: Jeff King @ 2021-11-19 22:19 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: phillip.wood, Johannes Schindelin, Derrick Stolee, Phillip Wood via GitGitGadget, git On Fri, Nov 19, 2021 at 10:22:04PM +0100, Ævar Arnfjörð Bjarmason wrote: > > Right, that makes more sense (and we are not likely to lift the 1GB > > limit anytime soon; there are tons of 32-bit variables and potential > > integer overflows all through the xdiff code). > > Interestingly: > > $ du -sh 8gb* > 8.1G 8gb > 8.1G 8gb.cp > $ ~/g/git/git -P -c core.bigFileThreshold=10g diff -U0 --no-index --no-color-moved 2gb 2gb.cp > diff --git a/8gb b/8gb.cp > index a886cdfe5ce..4965a132d44 100644 > --- a/8gb > +++ b/8gb.cp > @@ -17,0 +18 @@ more > +blah > > And the only change I made was: > > diff --git a/xdiff-interface.c b/xdiff-interface.c > index 75b32aef51d..cb8ca5f5d0a 100644 > --- a/xdiff-interface.c > +++ b/xdiff-interface.c > @@ -117,9 +117,6 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co > mmfile_t a = *mf1; > mmfile_t b = *mf2; > > - if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE) > - return -1; > - > if (!xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT)) > trim_common_tail(&a, &b); > > Perhaps we're being overly concervative with these hardcoded limits, at > least on some platforms? This is Linux x86_64. It's been a couple of years since I looked, but I'm fairly certain there are triggerable heap overflows. You probably need fewer than 2^31 lines, but more than 2^30, as that will overflow the size computation of an array whose elements are themselves 32-bit integers. For instance, this: perl -e 'print "x\n" x (2**30 + 10)' >gigaline cp gigaline gigaline.cp echo foo >>gigaline results in: $ git.compile -c core.bigfilethreshold=10g --no-pager diff --no-index gigaline gigaline.cp fatal: Out of memory, malloc failed (tried to allocate 18446744056529682432 bytes) so at some point we went negative with our allocation (and then it was cast to size_t when we passed it xmalloc). There's probably a value somewhere in the middle where it wraps but stays positive, and you'd get a heap overflow. > I understand from skimming the above that it's about the pathological > case, these two files are the same except for a trailer at the end. The real danger here is not producing a wrong answer for some dumb cases, but introducing an exploitable heap overflow. Switching to size_t, or at the very least using st_mult(), etc, everywhere in xdiff would help. I looked at that long ago, but eventually decided it was safer and less work to just stick the 1GB limit, since it practice nobody really cares about diffing beyond that level. (And the limit is really about number of lines, but 1GB of bytes is an easy proxy for that). It would be OK for somebody to fix it if they really want bigger diffs, but I think it has to be done carefully. -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] diff histogram: intern strings 2021-11-19 10:05 ` Phillip Wood 2021-11-19 14:45 ` Jeff King @ 2021-11-19 15:49 ` Johannes Schindelin 1 sibling, 0 replies; 15+ messages in thread From: Johannes Schindelin @ 2021-11-19 15:49 UTC (permalink / raw) To: Phillip Wood Cc: Jeff King, Derrick Stolee, Phillip Wood via GitGitGadget, git Hi Phillip, On Fri, 19 Nov 2021, Phillip Wood wrote: > On 18/11/2021 15:42, Jeff King wrote: > > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote: > > > > > I think the really important thing to point out is that > > > `xdl_classify_record()` ensures that the `ha` attribute is different > > > for different text. AFAIR it even "linearizes" the `ha` values, i.e. > > > they won't be all over the place but start at 0 (or 1). > > > > > > So no, I'm not worried about collisions. That would be a bug in > > > `xdl_classify_record()` and I think we would have caught this bug by > > > now. > > > > Ah, thanks for that explanation. That addresses my collision concern > > from earlier in the thread completely. > > Yes, thanks for clarifying I should have been clearer in my reply to > Stolee. The reason I was waffling on about file sizes is that there can > only be a collision if there are more than 2^32 unique lines. I think > the minimum file size where that happens is just below 10GB when one > side of the diff has 2^31 lines and the other has 2^31 + 1 lines and all > the lines are unique. Indeed, and as you pointed out, we already refuse to generate diffs for such large amounts of data. (For what it's worth, I totally agree with punting on such large data, it would also take too long a time to generate diffs on such large data to be reasonable.) Ciao, Dscho ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/3] xdiff: avoid unnecessary memory allocations 2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget 2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget @ 2021-11-17 11:20 ` Phillip Wood via GitGitGadget 2021-11-17 11:20 ` [PATCH 3/3] xdiff: simplify comparison Phillip Wood via GitGitGadget 2021-11-18 15:40 ` [PATCH 0/3] xdiff: speedup histogram diff Johannes Schindelin 3 siblings, 0 replies; 15+ messages in thread From: Phillip Wood via GitGitGadget @ 2021-11-17 11:20 UTC (permalink / raw) To: git; +Cc: Phillip Wood, Phillip Wood From: Phillip Wood <phillip.wood@dunelm.org.uk> rindex and ha are only used by xdl_cleanup_records() which is not called by the histogram or patience algorithms. The perf test results show a small reduction in run time but that is probably within the noise. Test HEAD^ HEAD ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.19(0.17+0.02) 0.19(0.12+0.07) +0.0% 4000.2: log --raw -3000 (tree-only) 0.98(0.78+0.20) 0.98(0.81+0.16) +0.0% 4000.3: log -p -3000 (Myers) 4.81(4.15+0.64) 4.81(4.23+0.56) +0.0% 4000.4: log -p -3000 --histogram 5.87(5.19+0.66) 5.83(5.11+0.70) -0.7% 4000.5: log -p -3000 --patience 5.35(4.60+0.73) 5.31(4.61+0.69) -0.7% Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> --- xdiff/xprepare.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c index 7fae0727a02..4527a4a07c4 100644 --- a/xdiff/xprepare.c +++ b/xdiff/xprepare.c @@ -213,10 +213,13 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_ goto abort; memset(rchg, 0, (nrec + 2) * sizeof(char)); - if (!(rindex = (long *) xdl_malloc((nrec + 1) * sizeof(long)))) - goto abort; - if (!(ha = (unsigned long *) xdl_malloc((nrec + 1) * sizeof(unsigned long)))) - goto abort; + if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) && + (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)) { + if (!(rindex = xdl_malloc((nrec + 1) * sizeof(*rindex)))) + goto abort; + if (!(ha = xdl_malloc((nrec + 1) * sizeof(*ha)))) + goto abort; + } xdf->nrec = nrec; xdf->recs = recs; -- gitgitgadget ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/3] xdiff: simplify comparison 2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget 2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget 2021-11-17 11:20 ` [PATCH 2/3] xdiff: avoid unnecessary memory allocations Phillip Wood via GitGitGadget @ 2021-11-17 11:20 ` Phillip Wood via GitGitGadget 2021-11-18 15:40 ` [PATCH 0/3] xdiff: speedup histogram diff Johannes Schindelin 3 siblings, 0 replies; 15+ messages in thread From: Phillip Wood via GitGitGadget @ 2021-11-17 11:20 UTC (permalink / raw) To: git; +Cc: Phillip Wood, Phillip Wood From: Phillip Wood <phillip.wood@dunelm.org.uk> Now that the histogram algorithm calls xdl_classify_record() it is no longer necessary to use xdl_recmatch() to compare lines, it is sufficient just to compare the hash values. This has a negligible effect on performance. Test HEAD~1 HEAD ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.19(0.12+0.07) 0.18(0.14+0.04) -5.3% 4000.2: log --raw -3000 (tree-only) 0.98(0.81+0.16) 0.98(0.79+0.18) +0.0% 4000.3: log -p -3000 (Myers) 4.81(4.23+0.56) 4.80(4.26+0.53) -0.2% 4000.4: log -p -3000 --histogram 5.83(5.11+0.70) 5.82(5.15+0.65) -0.2% 4000.5: log -p -3000 --patience 5.31(4.61+0.69) 5.30(4.54+0.75) -0.2% Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> --- xdiff/xdiffi.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index a4542c05b61..6a3b9280beb 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -392,10 +392,7 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1, static int recs_match(xrecord_t *rec1, xrecord_t *rec2, long flags) { - return (rec1->ha == rec2->ha && - xdl_recmatch(rec1->ptr, rec1->size, - rec2->ptr, rec2->size, - flags)); + return (rec1->ha == rec2->ha); } /* -- gitgitgadget ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] xdiff: speedup histogram diff 2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget ` (2 preceding siblings ...) 2021-11-17 11:20 ` [PATCH 3/3] xdiff: simplify comparison Phillip Wood via GitGitGadget @ 2021-11-18 15:40 ` Johannes Schindelin 3 siblings, 0 replies; 15+ messages in thread From: Johannes Schindelin @ 2021-11-18 15:40 UTC (permalink / raw) To: Phillip Wood via GitGitGadget; +Cc: git, Phillip Wood Hi Phillip, On Wed, 17 Nov 2021, Phillip Wood via GitGitGadget wrote: > Histogram is the only diff algorithm not to call xdl_classify_record(). > Calling xdl_classify_record() means that it is not necessary to use > xdl_recmatch() when comparing lines, all that is necessary is to compare the > hash values. This gives a 7% reduction in the runtime of "git log --patch" > when using the histogram diff algorithm. Thanks for this! I had a look over the patches (and refreshed my memory of what `xdl_classify_record()` does), and they look good. Having said that, I would love to increase my confidence by backing up the patches with a sort of stress test. Do we have anything like that? I guess not, a `git grep histogram t/` did not really turn up anything promising, p4000 does not even validate the output... Thanks, Dscho ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2021-11-19 22:19 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget 2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget 2021-11-17 15:55 ` Derrick Stolee 2021-11-17 16:46 ` Jeff King 2021-11-17 16:52 ` Phillip Wood 2021-11-18 15:35 ` Johannes Schindelin 2021-11-18 15:42 ` Jeff King 2021-11-19 10:05 ` Phillip Wood 2021-11-19 14:45 ` Jeff King 2021-11-19 21:22 ` Ævar Arnfjörð Bjarmason 2021-11-19 22:19 ` Jeff King 2021-11-19 15:49 ` Johannes Schindelin 2021-11-17 11:20 ` [PATCH 2/3] xdiff: avoid unnecessary memory allocations Phillip Wood via GitGitGadget 2021-11-17 11:20 ` [PATCH 3/3] xdiff: simplify comparison Phillip Wood via GitGitGadget 2021-11-18 15:40 ` [PATCH 0/3] xdiff: speedup histogram diff Johannes Schindelin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).