[PATCH] xdiff: re-diff shifted change groups when using histogram algorithm

public inbox for git@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
@ 2025-12-06 20:51 Yee Cheng Chin via GitGitGadget
  2026-01-21 20:51 ` Junio C Hamano
  2026-03-02 14:54 ` [PATCH v2] " Yee Cheng Chin via GitGitGadget
  0 siblings, 2 replies; 17+ messages in thread
From: Yee Cheng Chin via GitGitGadget @ 2025-12-06 20:51 UTC (permalink / raw)
  To: git; +Cc: Yee Cheng Chin, Yee Cheng Chin

From: Yee Cheng Chin <ychin.git@gmail.com>

After a diff algorithm has been run, the compaction phase
(xdl_change_compact()) shifts and merges change groups to produce a
cleaner output. However, this shifting could create a new matched group
where both sides now have matching lines. This results in a
wrong-looking diff output which contains redundant lines that are the
same on both files.

Fix this by detecting this situation, and re-diff the texts on each side
to find similar lines, using the fall-back Myer's diff. Only do this for
histogram diff as it's the only algorithm where this is relevant. Below
contains an example, and more details.

For an example, consider two files below:

    file1:
        A

        A
        A
        A

        A
        A
        A

    file2:
        A

        A
        x
        A

        A
        A
        A

When using Myer's diff, the algorithm finds that only the "x" has been
changed, and produces a final diff result (these are line diffs, but
using word-diff syntax for ease of presentation):

        A A[-A-]{+x+}A AAA

When using histogram diff, the algorithm first discovers the LCS "A
AAA", which it uses as anchor, then produces an intermediate diff:

        {+A Ax+}A AAA[- AAA-].

This is a longer diff than Myer's, but it's still self-consistent.
However, the compaction phase attempts to shift the first file's diff
group upwards (note that this shift crosses the anchor that histogram
had used), leading to the final results for histogram diff:

        [-A AA-]{+A Ax+}A AAA

This is a technically correct patch but looks clearly redundant to a
human as the first 3 lines should not be in the diff.

The fix would detect that a shift has caused matching to a new group,
and re-diff the "A AA" and "A Ax" parts, which results in "A A"
correctly re-marked as unchanged. This creates the now correct histogram
diff:

        A A[-A-]{+x+}A AAA

This issue is not applicable to Myer's diff algorithm as it already
generates a minimal diff, which means a shift cannot result in a smaller
diff output (the default Myer's diff in xdiff is not guaranteed to be
minimal for performance reasons, but it typically does a good enough
job).

It's also not applicable to patience diff, because it uses only unique
lines as anchor for its splits, and falls back to Myer's diff within
each split. Shifting requires both ends having the same lines, and
therefore cannot cross the unique line boundaries established by the
patience algorithm. In contrast histogram diff uses non-unique lines as
anchors, and therefore shifting can cross over them.

This issue is rare in a normal repository. Below is a table of
repositories (`git log --no-merges -p --histogram -1000`), showing how
many times a re-diff was done and how many times it resulted in finding
matching lines (therefore addressing this issue) with the fix. In
general it is fewer than 1% of diff's that exhibit this offending
behavior:

| Repo (1k commits)  | Re-diff | Found matching lines |
|--------------------|---------|----------------------|
| llvm-project       |  45     | 11                   |
| vim                | 110     |  9                   |
| git                |  18     |  2                   |
| WebKit             | 168     |  1                   |
| ripgrep            |  22     |  1                   |
| cpython            |  32     |  0                   |
| vscode             |  13     |  0                   |

Signed-off-by: Yee Cheng Chin <ychin.git@gmail.com>
---
    xdiff: re-diff shifted change groups when using histogram algorithm
    
    This is a somewhat rare issue when using histogram to diff files, as the
    algorithm will generate a diff output that looks redundant and wrong to
    a human. I provided a synthetic example in the commit message, but for
    one from the real world, do the following command in the Git repo:
    
    git show -U0 --diff-algorithm=histogram 2c8999027c -- po/ga.po
    
    
    Scroll to the line "@@ -7239,3 +5831,5 @@", and we can see the following
    diff hunk:
    
    -#: builtin/diff.c
    -msgid "Not a git repository"
    -msgstr "Ní stór git"
    +msgid "cannot come back to cwd"
    +msgstr "ní féidir teacht ar ais chuig cwd"
    +
    +msgid "Not a git repository"
    +msgstr "Ní stór git é"
    
    
    We can see that the "Not a git repository" line is identical on both
    sides, which means it should not have been in the diff results to begin
    with. Under other diff algorithms (or histogram diff with this fix),
    said line is not considered to be part of the diff.
    
    Also, when I was implementing this, an alternative I was considering was
    to add a bespoke linear-time algorithm to remove matching lines on both
    sides. Just calling the fall-back diff seems the easiest and cleanest
    and so I went with that.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2120%2Fychin%2Fxdiff-fix-compact-remove-redundant-lines-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2120/ychin/xdiff-fix-compact-remove-redundant-lines-v1
Pull-Request: https://github.com/git/git/pull/2120

 t/meson.build                         |   1 +
 t/t4073-diff-shifted-matched-group.sh | 137 ++++++++++++++++++++++++++
 xdiff/xdiffi.c                        |  43 ++++++++
 3 files changed, 181 insertions(+)
 create mode 100755 t/t4073-diff-shifted-matched-group.sh

diff --git a/t/meson.build b/t/meson.build
index 7c994d4643..ee233e80da 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -497,6 +497,7 @@ integration_tests = [
   't4070-diff-pairs.sh',
   't4071-diff-minimal.sh',
   't4072-diff-max-depth.sh',
+  't4073-diff-shifted-matched-group.sh',
   't4100-apply-stat.sh',
   't4101-apply-nonl.sh',
   't4102-apply-rename.sh',
diff --git a/t/t4073-diff-shifted-matched-group.sh b/t/t4073-diff-shifted-matched-group.sh
new file mode 100755
index 0000000000..0e915b78a6
--- /dev/null
+++ b/t/t4073-diff-shifted-matched-group.sh
@@ -0,0 +1,137 @@
+#!/bin/sh
+
+test_description='shifted diff groups re-diffing during histogram diff'
+
+. ./test-lib.sh
+
+test_expect_success 'shifted diff group should re-diff to minimize patch' '
+	test_write_lines A x A A A x A A A >file1 &&
+	test_write_lines A x A Z A x A A A >file2 &&
+
+	file1_h=$(git rev-parse --short $(git hash-object file1)) &&
+	file2_h=$(git rev-parse --short $(git hash-object file2)) &&
+
+	cat >expect <<-EOF &&
+	diff --git a/file1 b/file2
+	index $file1_h..$file2_h 100644
+	--- a/file1
+	+++ b/file2
+	@@ -1,7 +1,7 @@
+	 A
+	 x
+	 A
+	-A
+	+Z
+	 A
+	 x
+	 A
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
+	test_cmp expect output
+'
+
+test_expect_success 're-diff should preserve diff flags' '
+	test_write_lines a b c a b c >file1 &&
+	test_write_lines x " b" z a b c >file2 &&
+
+	file1_h=$(git rev-parse --short $(git hash-object file1)) &&
+	file2_h=$(git rev-parse --short $(git hash-object file2)) &&
+
+	cat >expect <<-EOF &&
+	diff --git a/file1 b/file2
+	index $file1_h..$file2_h 100644
+	--- a/file1
+	+++ b/file2
+	@@ -1,6 +1,6 @@
+	-a
+	-b
+	-c
+	+x
+	+ b
+	+z
+	 a
+	 b
+	 c
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
+	test_cmp expect output &&
+
+	cat >expect_iwhite <<-EOF &&
+	diff --git a/file1 b/file2
+	index $file1_h..$file2_h 100644
+	--- a/file1
+	+++ b/file2
+	@@ -1,6 +1,6 @@
+	-a
+	+x
+	  b
+	-c
+	+z
+	 a
+	 b
+	 c
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram --ignore-all-space file1 file2 >output_iwhite &&
+	test_cmp expect_iwhite output_iwhite
+'
+
+test_expect_success 'shifting on either side should trigger re-diff properly' '
+	test_write_lines a b c a b c a b c >file1 &&
+	test_write_lines a b c a1 a2 a3 b c1 a b c >file2 &&
+
+	file1_h=$(git rev-parse --short $(git hash-object file1)) &&
+	file2_h=$(git rev-parse --short $(git hash-object file2)) &&
+
+	cat >expect1 <<-EOF &&
+	diff --git a/file1 b/file2
+	index $file1_h..$file2_h 100644
+	--- a/file1
+	+++ b/file2
+	@@ -1,9 +1,11 @@
+	 a
+	 b
+	 c
+	-a
+	+a1
+	+a2
+	+a3
+	 b
+	-c
+	+c1
+	 a
+	 b
+	 c
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram file1 file2 >output1 &&
+	test_cmp expect1 output1 &&
+
+	cat >expect2 <<-EOF &&
+	diff --git a/file2 b/file1
+	index $file2_h..$file1_h 100644
+	--- a/file2
+	+++ b/file1
+	@@ -1,11 +1,9 @@
+	 a
+	 b
+	 c
+	-a1
+	-a2
+	-a3
+	+a
+	 b
+	-c1
+	+c
+	 a
+	 b
+	 c
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram file2 file1 >output2 &&
+	test_cmp expect2 output2
+'
+
+test_done
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 6f3998ee54..5d9c7b5434 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -793,6 +793,7 @@ static int group_slide_up(xdfile_t *xdf, struct xdlgroup *g)
  */
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 	struct xdlgroup g, go;
+	struct xdlgroup g_orig, go_orig;
 	long earliest_end, end_matching_other;
 	long groupsize;
 
@@ -806,6 +807,9 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		if (g.end == g.start)
 			goto next;
 
+		g_orig = g;
+		go_orig = go;
+
 		/*
 		 * Now shift the change up and then down as far as possible in
 		 * each direction. If it bumps into any other changes, merge
@@ -915,6 +919,45 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			}
 		}
 
+		/*
+		 * If this has a matching group from the other file, it could
+		 * either be the original match from the diff algorithm, or
+		 * arrived at by shifting and joining groups. When it's the
+		 * latter, it's possible for the two newly joined sides to have
+		 * matching lines. Re-diff the group to mark these matching
+		 * lines as unchanged and remove from the diff output.
+		 *
+		 * Only do this for histogram diff as its LCS algorithm makes
+		 * this scenario possible. In contrast, patience diff finds LCS
+		 * of unique lines that groups cannot be shifted across.
+		 * Myer's diff (standalone or used as fall-back in patience
+		 * diff) already finds minimal edits so it is not possible for
+		 * shifted groups to result in a smaller diff. (Without
+		 * XDF_NEED_MINIMAL, Myer's isn't technically guaranteed to be
+		 * minimal, but it should be so most of the time)
+		 */
+		if (end_matching_other != -1 &&
+				XDF_DIFF_ALG(flags) == XDF_HISTOGRAM_DIFF &&
+				(g.start != g_orig.start ||
+				 g.end != g_orig.end ||
+				 go.start != go_orig.start ||
+				 go.end != go_orig.end)) {
+			xpparam_t xpp;
+			xdfenv_t xe;
+
+			memset(&xpp, 0, sizeof(xpp));
+			xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
+
+			memcpy(&xe.xdf1, xdf, sizeof(xdfile_t));
+			memcpy(&xe.xdf2, xdfo, sizeof(xdfile_t));
+
+			if (xdl_fall_back_diff(&xe, &xpp,
+					       g.start + 1, g.end - g.start,
+					       go.start + 1, go.end - go.start)) {
+				return -1;
+			}
+		}
+
 	next:
 		/* Move past the just-processed group: */
 		if (group_next(xdf, &g))

base-commit: f0ef5b6d9bcc258e4cbef93839d1b7465d5212b9
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2025-12-06 20:51 [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm Yee Cheng Chin via GitGitGadget
@ 2026-01-21 20:51 ` Junio C Hamano
  2026-01-24 10:54   ` Phillip Wood
  2026-03-02 14:54 ` [PATCH v2] " Yee Cheng Chin via GitGitGadget
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2026-01-21 20:51 UTC (permalink / raw)
  To: Yee Cheng Chin via GitGitGadget; +Cc: git, Yee Cheng Chin

"Yee Cheng Chin via GitGitGadget" <gitgitgadget@gmail.com> writes:

> When using Myer's diff, the algorithm finds that only the "x" has been
> changed, and produces a final diff result (these are line diffs, but
> using word-diff syntax for ease of presentation):
>
>         A A[-A-]{+x+}A AAA

And patience gives the same result; as you noted, it uses a unique
line as the anchoring point.

> When using histogram diff, the algorithm first discovers the LCS "A
> AAA", which it uses as anchor, then produces an intermediate diff:
>
>         {+A Ax+}A AAA[- AAA-].
>
> This is a longer diff than Myer's, but it's still self-consistent.
> However, the compaction phase attempts to shift the first file's diff
> group upwards (note that this shift crosses the anchor that histogram
> had used), leading to the final results for histogram diff:
>
>         [-A AA-]{+A Ax+}A AAA
>
> This is a technically correct patch but looks clearly redundant to a
> human as the first 3 lines should not be in the diff.

So true.

> The fix would detect that a shift has caused matching to a new group,
> and re-diff the "A AA" and "A Ax" parts, which results in "A A"
> correctly re-marked as unchanged. This creates the now correct histogram
> diff:
>
>         A A[-A-]{+x+}A AAA

OK.

> This issue is rare in a normal repository. Below is a table of
> repositories (`git log --no-merges -p --histogram -1000`), showing how
> many times a re-diff was done and how many times it resulted in finding
> matching lines (therefore addressing this issue) with the fix. In
> general it is fewer than 1% of diff's that exhibit this offending
> behavior:

In other words, without the fix, we'd see 1% or so commits with
suboptimal (or "funny looking") diff that will trigger bug reports,
which sounds like an unacceptably high failure rate.

> @@ -915,6 +919,45 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  			}
>  		}
>  
> +		/*
> +		 * If this has a matching group from the other file, it could
> +		 * either be the original match from the diff algorithm, or
> +		 * arrived at by shifting and joining groups. When it's the
> +		 * latter, it's possible for the two newly joined sides to have
> +		 * matching lines. Re-diff the group to mark these matching
> +		 * lines as unchanged and remove from the diff output.
> +		 *
> +		 * Only do this for histogram diff as its LCS algorithm makes
> +		 * this scenario possible. In contrast, patience diff finds LCS
> +		 * of unique lines that groups cannot be shifted across.
> +		 * Myer's diff (standalone or used as fall-back in patience
> +		 * diff) already finds minimal edits so it is not possible for
> +		 * shifted groups to result in a smaller diff. (Without
> +		 * XDF_NEED_MINIMAL, Myer's isn't technically guaranteed to be
> +		 * minimal, but it should be so most of the time)
> +		 */
> +		if (end_matching_other != -1 &&
> +				XDF_DIFF_ALG(flags) == XDF_HISTOGRAM_DIFF &&
> +				(g.start != g_orig.start ||
> +				 g.end != g_orig.end ||
> +				 go.start != go_orig.start ||
> +				 go.end != go_orig.end)) {

So the idea is to remember the original values in g and go (the
location of the group in the file and the other file) and if
shifting up and down changed any one of the four ends from the
original locations, we always take the fall-back route (if we are
doing histogram)?

By the way, this appears after the if/else if/ cascade that has:

	if (g.end == earliest_end) {
		... do nothing case (case #1)
	} else if (end_matching_other != -1) {
		... do the slide-up thing (case #2)
	} else if (flags & XDF_INDENT_HEIRISTIC) {
		... do the indent heuristic thing (case #3)
	}

Am I reading the code correctly that, even though this new block
appears as if it is a post-clean-up phase that is independent from
which one of the three choices are taken in the previous if/elseif
cascade, it only is relevant to the second case?  I am wondering if
it would make it easier to follow if the new code were made into a
small helper function that is called from the (case #2) arm of the
existing if/else if cascade.

Thanks.


> +			xpparam_t xpp;
> +			xdfenv_t xe;
> +
> +			memset(&xpp, 0, sizeof(xpp));
> +			xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
> +
> +			memcpy(&xe.xdf1, xdf, sizeof(xdfile_t));
> +			memcpy(&xe.xdf2, xdfo, sizeof(xdfile_t));
> +
> +			if (xdl_fall_back_diff(&xe, &xpp,
> +					       g.start + 1, g.end - g.start,
> +					       go.start + 1, go.end - go.start)) {
> +				return -1;
> +			}
> +		}
> +
>  	next:
>  		/* Move past the just-processed group: */
>  		if (group_next(xdf, &g))
>
> base-commit: f0ef5b6d9bcc258e4cbef93839d1b7465d5212b9

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-01-21 20:51 ` Junio C Hamano
@ 2026-01-24 10:54   ` Phillip Wood
  2026-01-25 17:34     ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Phillip Wood @ 2026-01-24 10:54 UTC (permalink / raw)
  To: Junio C Hamano, Yee Cheng Chin via GitGitGadget; +Cc: git, Yee Cheng Chin

On 21/01/2026 20:51, Junio C Hamano wrote:
> "Yee Cheng Chin via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> @@ -915,6 +919,45 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>   			}
>>   		}
>>   
>> +		/*
>> +		 * If this has a matching group from the other file, it could
>> +		 * either be the original match from the diff algorithm, or
>> +		 * arrived at by shifting and joining groups. When it's the
>> +		 * latter, it's possible for the two newly joined sides to have
>> +		 * matching lines. Re-diff the group to mark these matching
>> +		 * lines as unchanged and remove from the diff output.
>> +		 *
>> +		 * Only do this for histogram diff as its LCS algorithm makes
>> +		 * this scenario possible. In contrast, patience diff finds LCS
>> +		 * of unique lines that groups cannot be shifted across.
>> +		 * Myer's diff (standalone or used as fall-back in patience
>> +		 * diff) already finds minimal edits so it is not possible for
>> +		 * shifted groups to result in a smaller diff. (Without
>> +		 * XDF_NEED_MINIMAL, Myer's isn't technically guaranteed to be
>> +		 * minimal, but it should be so most of the time)
>> +		 */
>> +		if (end_matching_other != -1 &&
>> +				XDF_DIFF_ALG(flags) == XDF_HISTOGRAM_DIFF &&
>> +				(g.start != g_orig.start ||
>> +				 g.end != g_orig.end ||
>> +				 go.start != go_orig.start ||
>> +				 go.end != go_orig.end)) {
> 
> So the idea is to remember the original values in g and go (the
> location of the group in the file and the other file) and if
> shifting up and down changed any one of the four ends from the
> original locations, we always take the fall-back route (if we are
> doing histogram)?

I'm a bit confused why we need to check both groups. I think they're 
supposed to move together (if we move "g" by n context lines we also 
move "go" by n context lines) so I can't see how we can have

	g.start == g_orig.start && g.end == g_orig.end

when

	go.start != go.orig.start || go.end != go_orig.end

> By the way, this appears after the if/else if/ cascade that has:
> 
> 	if (g.end == earliest_end) {
> 		... do nothing case (case #1)
> 	} else if (end_matching_other != -1) {
> 		... do the slide-up thing (case #2)
> 	} else if (flags & XDF_INDENT_HEIRISTIC) {
> 		... do the indent heuristic thing (case #3)
> 	}
> 
> Am I reading the code correctly that, even though this new block
> appears as if it is a post-clean-up phase that is independent from
> which one of the three choices are taken in the previous if/elseif
> cascade, it only is relevant to the second case?  I am wondering if
> it would make it easier to follow if the new code were made into a
> small helper function that is called from the (case #2) arm of the
> existing if/else if cascade.

That's a good point

>> +			xpparam_t xpp;
>> +			xdfenv_t xe;
>> +
>> +			memset(&xpp, 0, sizeof(xpp));
>> +			xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
>> +
>> +			memcpy(&xe.xdf1, xdf, sizeof(xdfile_t));
>> +			memcpy(&xe.xdf2, xdfo, sizeof(xdfile_t));

These would be safer as "xe.xdf1 = *xdf" so we don't have to worry about 
getting the size correct (sizeof(*xdf) would also be safer but there is 
no need for memcpy() here).

I also wondered if we need to do a diff or if we can just mark the 
common prefix and suffix as unchanged but I suspect that wont will work 
for more complicated examples.

Thanks

Phillip

>> +
>> +			if (xdl_fall_back_diff(&xe, &xpp,
>> +					       g.start + 1, g.end - g.start,
>> +					       go.start + 1, go.end - go.start)) {
>> +				return -1;
>> +			}
>> +		}
>> +
>>   	next:
>>   		/* Move past the just-processed group: */
>>   		if (group_next(xdf, &g))
>>
>> base-commit: f0ef5b6d9bcc258e4cbef93839d1b7465d5212b9
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-01-24 10:54   ` Phillip Wood
@ 2026-01-25 17:34     ` Junio C Hamano
  2026-01-26  9:37       ` Phillip Wood
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2026-01-25 17:34 UTC (permalink / raw)
  To: Phillip Wood; +Cc: Yee Cheng Chin via GitGitGadget, git, Yee Cheng Chin

Phillip Wood <phillip.wood123@gmail.com> writes:

> On 21/01/2026 20:51, Junio C Hamano wrote:
>> "Yee Cheng Chin via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> 
>>> @@ -915,6 +919,45 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>>   			}
>>>   		}
>>>   
>>> +		/*
>>> +		 * If this has a matching group from the other file, it could
>>> +		 * either be the original match from the diff algorithm, or
>>> +		 * arrived at by shifting and joining groups. When it's the
>>> +		 * latter, it's possible for the two newly joined sides to have
>>> +		 * matching lines. Re-diff the group to mark these matching
>>> +		 * lines as unchanged and remove from the diff output.

Also, after reading the first paragraph of the big comment again, it
makes me wonder if it is saying the same thing as "When histogram is
being used, we shouldn't bother shifting up and down to join groups,
as the result will always worse than the fallback", but is it that
bad?

>>> +		if (end_matching_other != -1 &&
>>> +				XDF_DIFF_ALG(flags) == XDF_HISTOGRAM_DIFF &&
>>> +				(g.start != g_orig.start ||
>>> +				 g.end != g_orig.end ||
>>> +				 go.start != go_orig.start ||
>>> +				 go.end != go_orig.end)) {
>> 
>> So the idea is to remember the original values in g and go (the
>> location of the group in the file and the other file) and if
>> shifting up and down changed any one of the four ends from the
>> original locations, we always take the fall-back route (if we are
>> doing histogram)?
>
> I'm a bit confused why we need to check both groups. I think they're 
> supposed to move together (if we move "g" by n context lines we also 
> move "go" by n context lines) so I can't see how we can have
>
> 	g.start == g_orig.start && g.end == g_orig.end
>
> when
>
> 	go.start != go.orig.start || go.end != go_orig.end

Interesting.

>> By the way, this appears after the if/else if/ cascade that has:
>> 
>> 	if (g.end == earliest_end) {
>> 		... do nothing case (case #1)
>> 	} else if (end_matching_other != -1) {
>> 		... do the slide-up thing (case #2)
>> 	} else if (flags & XDF_INDENT_HEIRISTIC) {
>> 		... do the indent heuristic thing (case #3)
>> 	}
>> 
>> Am I reading the code correctly that, even though this new block
>> appears as if it is a post-clean-up phase that is independent from
>> which one of the three choices are taken in the previous if/elseif
>> cascade, it only is relevant to the second case?  I am wondering if
>> it would make it easier to follow if the new code were made into a
>> small helper function that is called from the (case #2) arm of the
>> existing if/else if cascade.
>
> That's a good point
>
>>> +			xpparam_t xpp;
>>> +			xdfenv_t xe;
>>> +
>>> +			memset(&xpp, 0, sizeof(xpp));
>>> +			xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
>>> +
>>> +			memcpy(&xe.xdf1, xdf, sizeof(xdfile_t));
>>> +			memcpy(&xe.xdf2, xdfo, sizeof(xdfile_t));
>
> These would be safer as "xe.xdf1 = *xdf" so we don't have to worry about 
> getting the size correct (sizeof(*xdf) would also be safer but there is 
> no need for memcpy() here).

Very good readability enhancement suggestion.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-01-25 17:34     ` Junio C Hamano
@ 2026-01-26  9:37       ` Phillip Wood
  2026-01-26 17:32         ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Phillip Wood @ 2026-01-26  9:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Yee Cheng Chin via GitGitGadget, git, Yee Cheng Chin

On 25/01/2026 17:34, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> On 21/01/2026 20:51, Junio C Hamano wrote:
>>> "Yee Cheng Chin via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>
>>>> @@ -915,6 +919,45 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>>>    			}
>>>>    		}
>>>>    
>>>> +		/*
>>>> +		 * If this has a matching group from the other file, it could
>>>> +		 * either be the original match from the diff algorithm, or
>>>> +		 * arrived at by shifting and joining groups. When it's the
>>>> +		 * latter, it's possible for the two newly joined sides to have
>>>> +		 * matching lines. Re-diff the group to mark these matching
>>>> +		 * lines as unchanged and remove from the diff output.
> 
> Also, after reading the first paragraph of the big comment again, it
> makes me wonder if it is saying the same thing as "When histogram is
> being used, we shouldn't bother shifting up and down to join groups,
> as the result will always worse than the fallback", but is it that
> bad?

Looking at the example in the commit message the result of shifting up 
and down and then calling the fallback is better than either the 
unshifted diff or shifting without the fallback, so I don't think just 
disabling shifting improves things. It would also stop us coalescing 
changed lines, for example

-A             A
  A     ->     -A
-B            -B

The indent heuristic seems to assume that we've shifted down as far as 
possible before trying it so that would probably get messed up as well. 
To me the problem is that the histogram diff does not always generate 
particularly good diffs (maybe I'm biased - whenever I've tried 
switching the default to "histogram" I've always switched back 
"patience" fairly quickly after being presented with a diff that I found 
hard to comprehend)

Thanks

Phillip

>>>> +		if (end_matching_other != -1 &&
>>>> +				XDF_DIFF_ALG(flags) == XDF_HISTOGRAM_DIFF &&
>>>> +				(g.start != g_orig.start ||
>>>> +				 g.end != g_orig.end ||
>>>> +				 go.start != go_orig.start ||
>>>> +				 go.end != go_orig.end)) {
>>>
>>> So the idea is to remember the original values in g and go (the
>>> location of the group in the file and the other file) and if
>>> shifting up and down changed any one of the four ends from the
>>> original locations, we always take the fall-back route (if we are
>>> doing histogram)?
>>
>> I'm a bit confused why we need to check both groups. I think they're
>> supposed to move together (if we move "g" by n context lines we also
>> move "go" by n context lines) so I can't see how we can have
>>
>> 	g.start == g_orig.start && g.end == g_orig.end
>>
>> when
>>
>> 	go.start != go.orig.start || go.end != go_orig.end
> 
> Interesting.
> 
>>> By the way, this appears after the if/else if/ cascade that has:
>>>
>>> 	if (g.end == earliest_end) {
>>> 		... do nothing case (case #1)
>>> 	} else if (end_matching_other != -1) {
>>> 		... do the slide-up thing (case #2)
>>> 	} else if (flags & XDF_INDENT_HEIRISTIC) {
>>> 		... do the indent heuristic thing (case #3)
>>> 	}
>>>
>>> Am I reading the code correctly that, even though this new block
>>> appears as if it is a post-clean-up phase that is independent from
>>> which one of the three choices are taken in the previous if/elseif
>>> cascade, it only is relevant to the second case?  I am wondering if
>>> it would make it easier to follow if the new code were made into a
>>> small helper function that is called from the (case #2) arm of the
>>> existing if/else if cascade.
>>
>> That's a good point
>>
>>>> +			xpparam_t xpp;
>>>> +			xdfenv_t xe;
>>>> +
>>>> +			memset(&xpp, 0, sizeof(xpp));
>>>> +			xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
>>>> +
>>>> +			memcpy(&xe.xdf1, xdf, sizeof(xdfile_t));
>>>> +			memcpy(&xe.xdf2, xdfo, sizeof(xdfile_t));
>>
>> These would be safer as "xe.xdf1 = *xdf" so we don't have to worry about
>> getting the size correct (sizeof(*xdf) would also be safer but there is
>> no need for memcpy() here).
> 
> Very good readability enhancement suggestion.
> 
> Thanks.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-01-26  9:37       ` Phillip Wood
@ 2026-01-26 17:32         ` Junio C Hamano
  2026-01-29 16:53           ` Yee Cheng Chin
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2026-01-26 17:32 UTC (permalink / raw)
  To: Phillip Wood; +Cc: Yee Cheng Chin via GitGitGadget, git, Yee Cheng Chin

Phillip Wood <phillip.wood123@gmail.com> writes:

> On 25/01/2026 17:34, Junio C Hamano wrote:
>> Phillip Wood <phillip.wood123@gmail.com> writes:
>> 
>>> On 21/01/2026 20:51, Junio C Hamano wrote:
>>>> "Yee Cheng Chin via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>>
>>>>> @@ -915,6 +919,45 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>>>>    			}
>>>>>    		}
>>>>>    
>>>>> +		/*
>>>>> +		 * If this has a matching group from the other file, it could
>>>>> +		 * either be the original match from the diff algorithm, or
>>>>> +		 * arrived at by shifting and joining groups. When it's the
>>>>> +		 * latter, it's possible for the two newly joined sides to have
>>>>> +		 * matching lines. Re-diff the group to mark these matching
>>>>> +		 * lines as unchanged and remove from the diff output.
>> 
>> Also, after reading the first paragraph of the big comment again, it
>> makes me wonder if it is saying the same thing as "When histogram is
>> being used, we shouldn't bother shifting up and down to join groups,
>> as the result will always worse than the fallback", but is it that
>> bad?
>
> Looking at the example in the commit message the result of shifting up 
> and down and then calling the fallback is better than either the 
> unshifted diff or shifting without the fallback, so I don't think just 
> disabling shifting improves things. It would also stop us coalescing 
> changed lines, for example
>
> -A             A
>   A     ->     -A
> -B            -B
>
> The indent heuristic seems to assume that we've shifted down as far as 
> possible before trying it so that would probably get messed up as well. 

I see.  Thanks for a good explanation.

> To me the problem is that the histogram diff does not always generate 
> particularly good diffs (maybe I'm biased - whenever I've tried 
> switching the default to "histogram" I've always switched back 
> "patience" fairly quickly after being presented with a diff that I found 
> hard to comprehend)
>
> Thanks
>
> Phillip

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-01-26 17:32         ` Junio C Hamano
@ 2026-01-29 16:53           ` Yee Cheng Chin
  2026-01-29 20:58             ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Yee Cheng Chin @ 2026-01-29 16:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Phillip Wood, Yee Cheng Chin via GitGitGadget, git

Thanks for the review and sorry for being a little late in replying.
Aggregating all my inline replies in one email if that's ok.

On Wed, Jan 21, 2026 at 12:51 PM Junio C Hamano <gitster@pobox.com> wrote:
> So the idea is to remember the original values in g and go (the
> location of the group in the file and the other file) and if
> shifting up and down changed any one of the four ends from the
> original locations, we always take the fall-back route (if we are
> doing histogram)?
>
> By the way, this appears after the if/else if/ cascade that has:
>
>         if (g.end == earliest_end) {
>                 ... do nothing case (case #1)
>         } else if (end_matching_other != -1) {
>                 ... do the slide-up thing (case #2)
>         } else if (flags & XDF_INDENT_HEIRISTIC) {
>                 ... do the indent heuristic thing (case #3)
>         }
>
> Am I reading the code correctly that, even though this new block
> appears as if it is a post-clean-up phase that is independent from
> which one of the three choices are taken in the previous if/elseif
> cascade, it only is relevant to the second case?  I am wondering if
> it would make it easier to follow if the new code were made into a
> small helper function that is called from the (case #2) arm of the
> existing if/else if cascade.

That's correct. This condition happens only in the 2nd case. The
problematic scenario here only happens when the opposite side is
non-empty. If the opposite is empty (case #3, where we run the indent
heuristic algorithm), there's simply no need to re-diff anything
because diff'ing against an empty hunk is pointless.

You made a good point about placing it in the if block itself. The
existing code was a little confusing and took me re-reading the code
before I remember the condition. I'll fix it in v2.

On Sat, Jan 24, 2026 at 2:54 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> I'm a bit confused why we need to check both groups. I think they're
> supposed to move together (if we move "g" by n context lines we also
> move "go" by n context lines) so I can't see how we can have
>
>         g.start == g_orig.start && g.end == g_orig.end
>
> when
>
>         go.start != go.orig.start || go.end != go_orig.end
>

You are right. It was an over-specification. Looking through the code
we should be able to just use "g" and there is no need to test for
"g_orig". Will fix in v2.

> >> +                    xpparam_t xpp;
> >> +                    xdfenv_t xe;
> >> +
> >> +                    memset(&xpp, 0, sizeof(xpp));
> >> +                    xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
> >> +
> >> +                    memcpy(&xe.xdf1, xdf, sizeof(xdfile_t));
> >> +                    memcpy(&xe.xdf2, xdfo, sizeof(xdfile_t));
>
> These would be safer as "xe.xdf1 = *xdf" so we don't have to worry about
> getting the size correct (sizeof(*xdf) would also be safer but there is
> no need for memcpy() here).

Will fix in v2.

> I also wondered if we need to do a diff or if we can just mark the
> common prefix and suffix as unchanged but I suspect that wont will work
> for more complicated examples.

Common prefix/suffix would not work for more complicated examples.
Here's an example (imagine each character to be its own line):

File 1:
A AAyz AAA
File 2:
A xAA AAA

The current Git histogram diff generates the following:
A [-AAyz -]{+xAA +}AAA

After the fix, we have:
A {+x+}AA[-yz-] AA

Note that there is no common prefix here, and we need a real diff
algorithm if we want to solve this issue in a generic fashion. As I
mentioned in the cover letter, I thought about implementing a "bespoke
linear-time algorithm" but decided against it. What I meant was we
could implement a simple diff algorithm that finds the common lines in
both hunks that would run faster than Myer's, but isn't guaranteed to
be a optimal minimal diff. I decided that it is unnecessary to
overcomplicate things given that we can just call the fallback diff.

On Mon, Jan 26, 2026 at 1:37 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> On 25/01/2026 17:34, Junio C Hamano wrote:
> > Also, after reading the first paragraph of the big comment again, it
> > makes me wonder if it is saying the same thing as "When histogram is
> > being used, we shouldn't bother shifting up and down to join groups,
> > as the result will always worse than the fallback", but is it that
> > bad?
>
> Looking at the example in the commit message the result of shifting up
> and down and then calling the fallback is better than either the
> unshifted diff or shifting without the fallback, so I don't think just
> disabling shifting improves things. It would also stop us coalescing
> changed lines, for example
>
> -A             A
>   A     ->     -A
> -B            -B
>

I agree with you, but I think it is actually a nuanced decision. The
histogram diff algorithm explicitly chose the specific
alignment/anchor points to align both files due to the frequency of
the lines. When we do the sliding / compaction step, we are
essentially ignoring and overriding the algorithmic decision made by
histogram, for the sake of other metrics that we value (compaction
values fewer diff hunks, and indent heuristics values aligning by
semantics approximated by indentation). I think those metrics do help
which is why we added them, but there's a bit of design tension
between the underlying algorithm and the cleanup step.

> To me the problem is that the histogram diff does not always generate
> particularly good diffs (maybe I'm biased - whenever I've tried
> switching the default to "histogram" I've always switched back
> "patience" fairly quickly after being presented with a diff that I found
> hard to comprehend)

FWIW I personally feel that way as well. I think the documentation and
narrative that histogram diff is a "more advanced/extended version" of
patience diff is sometimes problematic, as both algorithms are fairly
different and have their own weaknesses. The Longest Common
Subsequence (LCS) used for alignment in patience diff is global for
the file and allows gaps, whereas the LCS in histogram diff requires
consecutive lines. This means even if the diff has unique lines across
both files the diff results could be quite different between histogram
and patience. This consecutive requirement for a subsequence is why
histogram diff runs faster than patience diff most of the time, but it
does mean the patience algorithm is better at discovering a global
"spine" across a file.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-01-29 16:53           ` Yee Cheng Chin
@ 2026-01-29 20:58             ` Junio C Hamano
  2026-01-30  1:58               ` Yee Cheng Chin
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2026-01-29 20:58 UTC (permalink / raw)
  To: Yee Cheng Chin; +Cc: Phillip Wood, Yee Cheng Chin via GitGitGadget, git

Yee Cheng Chin <ychin.git@gmail.com> writes:

> Thanks for the review and sorry for being a little late in replying.
> Aggregating all my inline replies in one email if that's ok.
>
> On Wed, Jan 21, 2026 at 12:51 PM Junio C Hamano <gitster@pobox.com> wrote:
>> So the idea is to remember the original values in g and go (the
>> location of the group in the file and the other file) and if
>> shifting up and down changed any one of the four ends from the
>> original locations, we always take the fall-back route (if we are
>> doing histogram)?
>>
>> By the way, this appears after the if/else if/ cascade that has:
>>
>>         if (g.end == earliest_end) {
>>                 ... do nothing case (case #1)
>>         } else if (end_matching_other != -1) {
>>                 ... do the slide-up thing (case #2)
>>         } else if (flags & XDF_INDENT_HEIRISTIC) {
>>                 ... do the indent heuristic thing (case #3)
>>         }
>>
>> Am I reading the code correctly that, even though this new block
>> appears as if it is a post-clean-up phase that is independent from
>> which one of the three choices are taken in the previous if/elseif
>> cascade, it only is relevant to the second case?  I am wondering if
>> it would make it easier to follow if the new code were made into a
>> small helper function that is called from the (case #2) arm of the
>> existing if/else if cascade.
>
> That's correct. This condition happens only in the 2nd case. The
> problematic scenario here only happens when the opposite side is
> non-empty. If the opposite is empty (case #3, where we run the indent
> heuristic algorithm), there's simply no need to re-diff anything
> because diff'ing against an empty hunk is pointless.

OK.  In the version posted, it appeard that it is possible, after
not doing the slide-up thing but using indent heuristic thing, to
fall into this compensation codepath because the new code was placed
after the above if-else-if cascade as if it is an independent
clean-up phase.  Encapsulating that new code in a helper function
and calling it at the end of "do the slide-up thing" block will make
the intent clearer.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-01-29 20:58             ` Junio C Hamano
@ 2026-01-30  1:58               ` Yee Cheng Chin
  2026-01-30  5:43                 ` Junio C Hamano
                                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Yee Cheng Chin @ 2026-01-30  1:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Phillip Wood, Yee Cheng Chin via GitGitGadget, git

On Thu, Jan 29, 2026 at 12:58 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Yee Cheng Chin <ychin.git@gmail.com> writes:
> >
> > On Wed, Jan 21, 2026 at 12:51 PM Junio C Hamano <gitster@pobox.com> wrote:
> >> By the way, this appears after the if/else if/ cascade that has:
> >>
> >>         if (g.end == earliest_end) {
> >>                 ... do nothing case (case #1)
> >>         } else if (end_matching_other != -1) {
> >>                 ... do the slide-up thing (case #2)
> >>         } else if (flags & XDF_INDENT_HEIRISTIC) {
> >>                 ... do the indent heuristic thing (case #3)
> >>         }
> >>
> >> Am I reading the code correctly that, even though this new block
> >> appears as if it is a post-clean-up phase that is independent from
> >> which one of the three choices are taken in the previous if/elseif
> >> cascade, it only is relevant to the second case?  I am wondering if
> >> it would make it easier to follow if the new code were made into a
> >> small helper function that is called from the (case #2) arm of the
> >> existing if/else if cascade.
> >
> > That's correct. This condition happens only in the 2nd case. The
> > problematic scenario here only happens when the opposite side is
> > non-empty. If the opposite is empty (case #3, where we run the indent
> > heuristic algorithm), there's simply no need to re-diff anything
> > because diff'ing against an empty hunk is pointless.
>
> OK.  In the version posted, it appeard that it is possible, after
> not doing the slide-up thing but using indent heuristic thing, to
> fall into this compensation codepath because the new code was placed
> after the above if-else-if cascade as if it is an independent
> clean-up phase.  Encapsulating that new code in a helper function
> and calling it at the end of "do the slide-up thing" block will make
> the intent clearer.

Sorry, I actually misspoke. I forgot that re-diff is actually needed
in both case #1 and #2. Note that even in #1, it's possible for
`end_matching_other != -1` to be true. In case #3, it only cannot
happen because `end_matching_other` has to be -1 by then (meaning that
this diff hunk only has content on this side and is empty on the
other).

Case #1 happens when no *remaining* shifting was necessary, but note
that this happens after the do/while loop above, where previous loops
could have shifted and compacted the diff blocks already. Case #2 just
means there's some remaining clean up work to be done.

Just for a concrete test case that will illustrate this in case
someone is running the code and want a demonstration:

File 1:
AXB*

File 2:
CD*XE*

The first "*" is used as the histogram alignment anchor, which will be
shifted resulting in a compaction, and therefore needs to trigger a
re-diff. The correct output is as follows (which will only happen if
we also run the re-diff in case #1):

{-A-}[+CD*+]X{-B-}[+E+]*

Otherwise we will get the wrong output (note how the "X" is
erroneuously included on both sides):

{-AXB-}[+CD*XE+]*

Because of that, I'm leaning on keeping the current code structure,
because it *is* indeed a cleanup step to be run after the previous
one. I could still refactor it into a separate function and put it
into the the case #1/#2 if blocks if you think that's cleaner.

I will also add the above to the test case in v2.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-01-30  1:58               ` Yee Cheng Chin
@ 2026-01-30  5:43                 ` Junio C Hamano
  2026-01-30 16:06                 ` Phillip Wood
  2026-02-20 23:07                 ` Junio C Hamano
  2 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2026-01-30  5:43 UTC (permalink / raw)
  To: Yee Cheng Chin; +Cc: Phillip Wood, Yee Cheng Chin via GitGitGadget, git

Yee Cheng Chin <ychin.git@gmail.com> writes:

> Because of that, I'm leaning on keeping the current code structure,
> because it *is* indeed a cleanup step to be run after the previous
> one. I could still refactor it into a separate function and put it
> into the the case #1/#2 if blocks if you think that's cleaner.
>
> I will also add the above to the test case in v2.

As long as the resulting code is explained (perhaps in the comment
and/or with the code structure) well enough so that when read by
somebody else in two months, it won't have to invite the same
question as I asked in this thread, it would be OK.  I do not know
offhand if a comment with the current code structure is good enough,
or calling the same helper function from two out of three arms of
if/else-if cascade would make it even clearer.

I agree that the case you gave is tricky enough that it would be a
good idea to add it as a test.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-01-30  1:58               ` Yee Cheng Chin
  2026-01-30  5:43                 ` Junio C Hamano
@ 2026-01-30 16:06                 ` Phillip Wood
  2026-02-20 23:07                 ` Junio C Hamano
  2 siblings, 0 replies; 17+ messages in thread
From: Phillip Wood @ 2026-01-30 16:06 UTC (permalink / raw)
  To: Yee Cheng Chin, Junio C Hamano; +Cc: Yee Cheng Chin via GitGitGadget, git

On 30/01/2026 01:58, Yee Cheng Chin wrote:
> 
> Case #1 happens when no *remaining* shifting was necessary, but note
> that this happens after the do/while loop above, where previous loops
> could have shifted and compacted the diff blocks already. Case #2 just
> means there's some remaining clean up work to be done.

That's a good point - as well as commenting the new code, it would be 
helpful to update the comment in case #1 to make it clear that we don't 
need to shift back up to align with a matching block, not there there 
was no shift possible. I agree with Junio that it would be useful to add 
the example below as a test

Thanks

Phillip

> Just for a concrete test case that will illustrate this in case
> someone is running the code and want a demonstration:
> 
> File 1:
> AXB*
> 
> File 2:
> CD*XE*
> 
> The first "*" is used as the histogram alignment anchor, which will be
> shifted resulting in a compaction, and therefore needs to trigger a
> re-diff. The correct output is as follows (which will only happen if
> we also run the re-diff in case #1):
> 
> {-A-}[+CD*+]X{-B-}[+E+]*
> 
> Otherwise we will get the wrong output (note how the "X" is
> erroneuously included on both sides):
> 
> {-AXB-}[+CD*XE+]*
> 
> Because of that, I'm leaning on keeping the current code structure,
> because it *is* indeed a cleanup step to be run after the previous
> one. I could still refactor it into a separate function and put it
> into the the case #1/#2 if blocks if you think that's cleaner.
> 
> I will also add the above to the test case in v2.
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-01-30  1:58               ` Yee Cheng Chin
  2026-01-30  5:43                 ` Junio C Hamano
  2026-01-30 16:06                 ` Phillip Wood
@ 2026-02-20 23:07                 ` Junio C Hamano
  2026-02-21  9:56                   ` Yee Cheng Chin
  2 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2026-02-20 23:07 UTC (permalink / raw)
  To: Yee Cheng Chin; +Cc: Phillip Wood, Yee Cheng Chin via GitGitGadget, git

Yee Cheng Chin <ychin.git@gmail.com> writes:

> ...
> {-AXB-}[+CD*XE+]*
>
> Because of that, I'm leaning on keeping the current code structure,
> because it *is* indeed a cleanup step to be run after the previous
> one. I could still refactor it into a separate function and put it
> into the the case #1/#2 if blocks if you think that's cleaner.
>
> I will also add the above to the test case in v2.

OK, it has been a few weeks since we had this message.  Will we see
an update sometime soon?  No rush, but just pinging.

Thanks.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-02-20 23:07                 ` Junio C Hamano
@ 2026-02-21  9:56                   ` Yee Cheng Chin
  0 siblings, 0 replies; 17+ messages in thread
From: Yee Cheng Chin @ 2026-02-21  9:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Phillip Wood, Yee Cheng Chin via GitGitGadget, git

Hi, yes, I'm still on it. Sorry for the delay. I will push an update
out this weekend.

On Fri, Feb 20, 2026 at 3:07 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Yee Cheng Chin <ychin.git@gmail.com> writes:
>
> > ...
> > {-AXB-}[+CD*XE+]*
> >
> > Because of that, I'm leaning on keeping the current code structure,
> > because it *is* indeed a cleanup step to be run after the previous
> > one. I could still refactor it into a separate function and put it
> > into the the case #1/#2 if blocks if you think that's cleaner.
> >
> > I will also add the above to the test case in v2.
>
> OK, it has been a few weeks since we had this message.  Will we see
> an update sometime soon?  No rush, but just pinging.
>
> Thanks.
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] xdiff: re-diff shifted change groups when using histogram algorithm
  2025-12-06 20:51 [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm Yee Cheng Chin via GitGitGadget
  2026-01-21 20:51 ` Junio C Hamano
@ 2026-03-02 14:54 ` Yee Cheng Chin via GitGitGadget
  2026-03-13  7:07   ` Junio C Hamano
  1 sibling, 1 reply; 17+ messages in thread
From: Yee Cheng Chin via GitGitGadget @ 2026-03-02 14:54 UTC (permalink / raw)
  To: git; +Cc: Phillip Wood, Yee Cheng Chin, Yee Cheng Chin

From: Yee Cheng Chin <ychin.git@gmail.com>

After a diff algorithm has been run, the compaction phase
(xdl_change_compact()) shifts and merges change groups to produce a
cleaner output. However, this shifting could create a new matched group
where both sides now have matching lines. This results in a
wrong-looking diff output which contains redundant lines that are the
same on both files.

Fix this by detecting this situation, and re-diff the texts on each side
to find similar lines, using the fall-back Myer's diff. Only do this for
histogram diff as it's the only algorithm where this is relevant. Below
contains an example, and more details.

For an example, consider two files below:

    file1:
        A

        A
        A
        A

        A
        A
        A

    file2:
        A

        A
        x
        A

        A
        A
        A

When using Myer's diff, the algorithm finds that only the "x" has been
changed, and produces a final diff result (these are line diffs, but
using word-diff syntax for ease of presentation):

        A A[-A-]{+x+}A AAA

When using histogram diff, the algorithm first discovers the LCS "A
AAA", which it uses as anchor, then produces an intermediate diff:

        {+A Ax+}A AAA[- AAA-].

This is a longer diff than Myer's, but it's still self-consistent.
However, the compaction phase attempts to shift the first file's diff
group upwards (note that this shift crosses the anchor that histogram
had used), leading to the final results for histogram diff:

        [-A AA-]{+A Ax+}A AAA

This is a technically correct patch but looks clearly redundant to a
human as the first 3 lines should not be in the diff.

The fix would detect that a shift has caused matching to a new group,
and re-diff the "A AA" and "A Ax" parts, which results in "A A"
correctly re-marked as unchanged. This creates the now correct histogram
diff:

        A A[-A-]{+x+}A AAA

This issue is not applicable to Myer's diff algorithm as it already
generates a minimal diff, which means a shift cannot result in a smaller
diff output (the default Myer's diff in xdiff is not guaranteed to be
minimal for performance reasons, but it typically does a good enough
job).

It's also not applicable to patience diff, because it uses only unique
lines as anchor for its splits, and falls back to Myer's diff within
each split. Shifting requires both ends having the same lines, and
therefore cannot cross the unique line boundaries established by the
patience algorithm. In contrast histogram diff uses non-unique lines as
anchors, and therefore shifting can cross over them.

This issue is rare in a normal repository. Below is a table of
repositories (`git log --no-merges -p --histogram -1000`), showing how
many times a re-diff was done and how many times it resulted in finding
matching lines (therefore addressing this issue) with the fix. In
general it is fewer than 1% of diff's that exhibit this offending
behavior:

| Repo (1k commits)  | Re-diff | Found matching lines |
|--------------------|---------|----------------------|
| llvm-project       |  45     | 11                   |
| vim                | 110     |  9                   |
| git                |  18     |  2                   |
| WebKit             | 168     |  1                   |
| ripgrep            |  22     |  1                   |
| cpython            |  32     |  0                   |
| vscode             |  13     |  0                   |

Signed-off-by: Yee Cheng Chin <ychin.git@gmail.com>
---
    xdiff: re-diff shifted change groups when using histogram algorithm
    
    Changes since v1:
    
     * Fix the entry condition to be easier to understand by checking for
       go.end!=go.start, which makes it clear that this is just a triviality
       test (if one side is empty there is no point in diff'ing anything)
     * Remove go_orig, which was redundant as it was tracking the same thing
       as g_orig.
     * Use assignment instead of memcpy()
     * Clean up comments
     * Per discussed, add test to show that we need to re-diff even if we
       entere the first condition "no shifting was possible".
    
    This is a somewhat rare issue when using histogram to diff files, as the
    algorithm will generate a diff output that looks redundant and wrong to
    a human. I provided a synthetic example in the commit message, but for
    one from the real world, do the following command in the Git repo:
    
    git show -U0 --diff-algorithm=histogram 2c8999027c -- po/ga.po
    
    
    Scroll to the line "@@ -7239,3 +5831,5 @@", and we can see the following
    diff hunk:
    
    -#: builtin/diff.c
    -msgid "Not a git repository"
    -msgstr "Ní stór git"
    +msgid "cannot come back to cwd"
    +msgstr "ní féidir teacht ar ais chuig cwd"
    +
    +msgid "Not a git repository"
    +msgstr "Ní stór git é"
    
    
    We can see that the "Not a git repository" line is identical on both
    sides, which means it should not have been in the diff results to begin
    with. Under other diff algorithms (or histogram diff with this fix),
    said line is not considered to be part of the diff.
    
    Also, when I was implementing this, an alternative I was considering was
    to add a bespoke linear-time algorithm to remove matching lines on both
    sides. Just calling the fall-back diff seems the easiest and cleanest
    and so I went with that.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2120%2Fychin%2Fxdiff-fix-compact-remove-redundant-lines-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2120/ychin/xdiff-fix-compact-remove-redundant-lines-v2
Pull-Request: https://github.com/git/git/pull/2120

Range-diff vs v1:

 1:  34a370e59e ! 1:  bd63c6866b xdiff: re-diff shifted change groups when using histogram algorithm
     @@ Commit message
      
       ## t/meson.build ##
      @@ t/meson.build: integration_tests = [
     -   't4070-diff-pairs.sh',
         't4071-diff-minimal.sh',
         't4072-diff-max-depth.sh',
     -+  't4073-diff-shifted-matched-group.sh',
     +   't4073-diff-stat-name-width.sh',
     ++  't4074-diff-shifted-matched-group.sh',
         't4100-apply-stat.sh',
         't4101-apply-nonl.sh',
         't4102-apply-rename.sh',
      
     - ## t/t4073-diff-shifted-matched-group.sh (new) ##
     + ## t/t4074-diff-shifted-matched-group.sh (new) ##
      @@
      +#!/bin/sh
      +
     @@ t/t4073-diff-shifted-matched-group.sh (new)
      +
      +. ./test-lib.sh
      +
     -+test_expect_success 'shifted diff group should re-diff to minimize patch' '
     ++test_expect_success 'shifted/merged diff group should re-diff to minimize patch' '
      +	test_write_lines A x A A A x A A A >file1 &&
      +	test_write_lines A x A Z A x A A A >file2 &&
      +
     @@ t/t4073-diff-shifted-matched-group.sh (new)
      +	test_cmp expect output
      +'
      +
     ++test_expect_success 'merged diff group with no shift' '
     ++	test_write_lines A Z B x >file1 &&
     ++	test_write_lines C D x Z E x >file2 &&
     ++
     ++	file1_h=$(git rev-parse --short $(git hash-object file1)) &&
     ++	file2_h=$(git rev-parse --short $(git hash-object file2)) &&
     ++
     ++	cat >expect <<-EOF &&
     ++	diff --git a/file1 b/file2
     ++	index $file1_h..$file2_h 100644
     ++	--- a/file1
     ++	+++ b/file2
     ++	@@ -1,4 +1,6 @@
     ++	-A
     ++	+C
     ++	+D
     ++	+x
     ++	 Z
     ++	-B
     ++	+E
     ++	 x
     ++	EOF
     ++
     ++	test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
     ++	test_cmp expect output
     ++'
     ++
      +test_expect_success 're-diff should preserve diff flags' '
      +	test_write_lines a b c a b c >file1 &&
      +	test_write_lines x " b" z a b c >file2 &&
     @@ xdiff/xdiffi.c: static int group_slide_up(xdfile_t *xdf, struct xdlgroup *g)
        */
       int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
       	struct xdlgroup g, go;
     -+	struct xdlgroup g_orig, go_orig;
     ++	struct xdlgroup g_orig;
       	long earliest_end, end_matching_other;
       	long groupsize;
       
     @@ xdiff/xdiffi.c: int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags
       			goto next;
       
      +		g_orig = g;
     -+		go_orig = go;
      +
       		/*
       		 * Now shift the change up and then down as far as possible in
       		 * each direction. If it bumps into any other changes, merge
     +-		 * them.
     ++		 * them and restart the process.
     + 		 */
     + 		do {
     + 			groupsize = g.end - g.start;
     +@@ xdiff/xdiffi.c: int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
     + 			/*
     + 			 * Move the possibly merged group of changes back to
     + 			 * line up with the last group of changes from the
     +-			 * other file that it can align with.
     ++			 * other file that it can align with. This avoids breaking
     ++			 * a single change into a separate addition/deletion.
     + 			 */
     + 			while (go.end == go.start) {
     + 				if (group_slide_up(xdf, &g))
      @@ xdiff/xdiffi.c: int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
       			}
       		}
       
      +		/*
     -+		 * If this has a matching group from the other file, it could
     -+		 * either be the original match from the diff algorithm, or
     -+		 * arrived at by shifting and joining groups. When it's the
     -+		 * latter, it's possible for the two newly joined sides to have
     -+		 * matching lines. Re-diff the group to mark these matching
     -+		 * lines as unchanged and remove from the diff output.
     ++		 * If we merged change groups during shifting, the new
     ++		 * combined group could now have matching lines in both files,
     ++		 * even if the original separate groups did not. Re-diff the
     ++		 * new group to find these matching lines to mark them as
     ++		 * unchanged.
     ++		 *
     ++		 * Only do this if the corresponding group in the other file is
     ++		 * non-empty, as it's trivial otherwise.
      +		 *
     -+		 * Only do this for histogram diff as its LCS algorithm makes
     -+		 * this scenario possible. In contrast, patience diff finds LCS
     ++		 * Only do this for histogram diff as its LCS algorithm allows
     ++		 * for this scenario. In contrast, patience diff finds LCS
      +		 * of unique lines that groups cannot be shifted across.
      +		 * Myer's diff (standalone or used as fall-back in patience
      +		 * diff) already finds minimal edits so it is not possible for
     @@ xdiff/xdiffi.c: int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags
      +		 * XDF_NEED_MINIMAL, Myer's isn't technically guaranteed to be
      +		 * minimal, but it should be so most of the time)
      +		 */
     -+		if (end_matching_other != -1 &&
     ++		if (go.end != go.start &&
      +				XDF_DIFF_ALG(flags) == XDF_HISTOGRAM_DIFF &&
      +				(g.start != g_orig.start ||
     -+				 g.end != g_orig.end ||
     -+				 go.start != go_orig.start ||
     -+				 go.end != go_orig.end)) {
     ++				 g.end != g_orig.end)) {
      +			xpparam_t xpp;
      +			xdfenv_t xe;
      +
      +			memset(&xpp, 0, sizeof(xpp));
      +			xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
      +
     -+			memcpy(&xe.xdf1, xdf, sizeof(xdfile_t));
     -+			memcpy(&xe.xdf2, xdfo, sizeof(xdfile_t));
     ++			xe.xdf1 = *xdf;
     ++			xe.xdf2 = *xdfo;
      +
      +			if (xdl_fall_back_diff(&xe, &xpp,
      +					       g.start + 1, g.end - g.start,


 t/meson.build                         |   1 +
 t/t4074-diff-shifted-matched-group.sh | 164 ++++++++++++++++++++++++++
 xdiff/xdiffi.c                        |  47 +++++++-
 3 files changed, 210 insertions(+), 2 deletions(-)
 create mode 100755 t/t4074-diff-shifted-matched-group.sh

diff --git a/t/meson.build b/t/meson.build
index 6d91470ebc..dfd0a5a7d9 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -504,6 +504,7 @@ integration_tests = [
   't4071-diff-minimal.sh',
   't4072-diff-max-depth.sh',
   't4073-diff-stat-name-width.sh',
+  't4074-diff-shifted-matched-group.sh',
   't4100-apply-stat.sh',
   't4101-apply-nonl.sh',
   't4102-apply-rename.sh',
diff --git a/t/t4074-diff-shifted-matched-group.sh b/t/t4074-diff-shifted-matched-group.sh
new file mode 100755
index 0000000000..d77fa3b79d
--- /dev/null
+++ b/t/t4074-diff-shifted-matched-group.sh
@@ -0,0 +1,164 @@
+#!/bin/sh
+
+test_description='shifted diff groups re-diffing during histogram diff'
+
+. ./test-lib.sh
+
+test_expect_success 'shifted/merged diff group should re-diff to minimize patch' '
+	test_write_lines A x A A A x A A A >file1 &&
+	test_write_lines A x A Z A x A A A >file2 &&
+
+	file1_h=$(git rev-parse --short $(git hash-object file1)) &&
+	file2_h=$(git rev-parse --short $(git hash-object file2)) &&
+
+	cat >expect <<-EOF &&
+	diff --git a/file1 b/file2
+	index $file1_h..$file2_h 100644
+	--- a/file1
+	+++ b/file2
+	@@ -1,7 +1,7 @@
+	 A
+	 x
+	 A
+	-A
+	+Z
+	 A
+	 x
+	 A
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
+	test_cmp expect output
+'
+
+test_expect_success 'merged diff group with no shift' '
+	test_write_lines A Z B x >file1 &&
+	test_write_lines C D x Z E x >file2 &&
+
+	file1_h=$(git rev-parse --short $(git hash-object file1)) &&
+	file2_h=$(git rev-parse --short $(git hash-object file2)) &&
+
+	cat >expect <<-EOF &&
+	diff --git a/file1 b/file2
+	index $file1_h..$file2_h 100644
+	--- a/file1
+	+++ b/file2
+	@@ -1,4 +1,6 @@
+	-A
+	+C
+	+D
+	+x
+	 Z
+	-B
+	+E
+	 x
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
+	test_cmp expect output
+'
+
+test_expect_success 're-diff should preserve diff flags' '
+	test_write_lines a b c a b c >file1 &&
+	test_write_lines x " b" z a b c >file2 &&
+
+	file1_h=$(git rev-parse --short $(git hash-object file1)) &&
+	file2_h=$(git rev-parse --short $(git hash-object file2)) &&
+
+	cat >expect <<-EOF &&
+	diff --git a/file1 b/file2
+	index $file1_h..$file2_h 100644
+	--- a/file1
+	+++ b/file2
+	@@ -1,6 +1,6 @@
+	-a
+	-b
+	-c
+	+x
+	+ b
+	+z
+	 a
+	 b
+	 c
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
+	test_cmp expect output &&
+
+	cat >expect_iwhite <<-EOF &&
+	diff --git a/file1 b/file2
+	index $file1_h..$file2_h 100644
+	--- a/file1
+	+++ b/file2
+	@@ -1,6 +1,6 @@
+	-a
+	+x
+	  b
+	-c
+	+z
+	 a
+	 b
+	 c
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram --ignore-all-space file1 file2 >output_iwhite &&
+	test_cmp expect_iwhite output_iwhite
+'
+
+test_expect_success 'shifting on either side should trigger re-diff properly' '
+	test_write_lines a b c a b c a b c >file1 &&
+	test_write_lines a b c a1 a2 a3 b c1 a b c >file2 &&
+
+	file1_h=$(git rev-parse --short $(git hash-object file1)) &&
+	file2_h=$(git rev-parse --short $(git hash-object file2)) &&
+
+	cat >expect1 <<-EOF &&
+	diff --git a/file1 b/file2
+	index $file1_h..$file2_h 100644
+	--- a/file1
+	+++ b/file2
+	@@ -1,9 +1,11 @@
+	 a
+	 b
+	 c
+	-a
+	+a1
+	+a2
+	+a3
+	 b
+	-c
+	+c1
+	 a
+	 b
+	 c
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram file1 file2 >output1 &&
+	test_cmp expect1 output1 &&
+
+	cat >expect2 <<-EOF &&
+	diff --git a/file2 b/file1
+	index $file2_h..$file1_h 100644
+	--- a/file2
+	+++ b/file1
+	@@ -1,11 +1,9 @@
+	 a
+	 b
+	 c
+	-a1
+	-a2
+	-a3
+	+a
+	 b
+	-c1
+	+c
+	 a
+	 b
+	 c
+	EOF
+
+	test_expect_code 1 git diff --no-index --histogram file2 file1 >output2 &&
+	test_cmp expect2 output2
+'
+
+test_done
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 4376f943db..5455b4690d 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -792,6 +792,7 @@ static int group_slide_up(xdfile_t *xdf, struct xdlgroup *g)
  */
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 	struct xdlgroup g, go;
+	struct xdlgroup g_orig;
 	long earliest_end, end_matching_other;
 	long groupsize;
 
@@ -805,10 +806,12 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		if (g.end == g.start)
 			goto next;
 
+		g_orig = g;
+
 		/*
 		 * Now shift the change up and then down as far as possible in
 		 * each direction. If it bumps into any other changes, merge
-		 * them.
+		 * them and restart the process.
 		 */
 		do {
 			groupsize = g.end - g.start;
@@ -861,7 +864,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			/*
 			 * Move the possibly merged group of changes back to
 			 * line up with the last group of changes from the
-			 * other file that it can align with.
+			 * other file that it can align with. This avoids breaking
+			 * a single change into a separate addition/deletion.
 			 */
 			while (go.end == go.start) {
 				if (group_slide_up(xdf, &g))
@@ -914,6 +918,45 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			}
 		}
 
+		/*
+		 * If we merged change groups during shifting, the new
+		 * combined group could now have matching lines in both files,
+		 * even if the original separate groups did not. Re-diff the
+		 * new group to find these matching lines to mark them as
+		 * unchanged.
+		 *
+		 * Only do this if the corresponding group in the other file is
+		 * non-empty, as it's trivial otherwise.
+		 *
+		 * Only do this for histogram diff as its LCS algorithm allows
+		 * for this scenario. In contrast, patience diff finds LCS
+		 * of unique lines that groups cannot be shifted across.
+		 * Myer's diff (standalone or used as fall-back in patience
+		 * diff) already finds minimal edits so it is not possible for
+		 * shifted groups to result in a smaller diff. (Without
+		 * XDF_NEED_MINIMAL, Myer's isn't technically guaranteed to be
+		 * minimal, but it should be so most of the time)
+		 */
+		if (go.end != go.start &&
+				XDF_DIFF_ALG(flags) == XDF_HISTOGRAM_DIFF &&
+				(g.start != g_orig.start ||
+				 g.end != g_orig.end)) {
+			xpparam_t xpp;
+			xdfenv_t xe;
+
+			memset(&xpp, 0, sizeof(xpp));
+			xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
+
+			xe.xdf1 = *xdf;
+			xe.xdf2 = *xdfo;
+
+			if (xdl_fall_back_diff(&xe, &xpp,
+					       g.start + 1, g.end - g.start,
+					       go.start + 1, go.end - go.start)) {
+				return -1;
+			}
+		}
+
 	next:
 		/* Move past the just-processed group: */
 		if (group_next(xdf, &g))

base-commit: 2cc71917514657b93014134350864f4849edfc83
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-03-02 14:54 ` [PATCH v2] " Yee Cheng Chin via GitGitGadget
@ 2026-03-13  7:07   ` Junio C Hamano
  2026-03-13 10:23     ` Phillip Wood
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2026-03-13  7:07 UTC (permalink / raw)
  To: Yee Cheng Chin via GitGitGadget; +Cc: git, Phillip Wood, Yee Cheng Chin

"Yee Cheng Chin via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Yee Cheng Chin <ychin.git@gmail.com>
>
> After a diff algorithm has been run, the compaction phase
> (xdl_change_compact()) shifts and merges change groups to produce a
> cleaner output. However, this shifting could create a new matched group
> where both sides now have matching lines. This results in a
> wrong-looking diff output which contains redundant lines that are the
> same on both files.
>
> Fix this by detecting this situation, and re-diff the texts on each side
> to find similar lines, using the fall-back Myer's diff. Only do this for
> histogram diff as it's the only algorithm where this is relevant. Below
> contains an example, and more details.
> ...
> This issue is rare in a normal repository. Below is a table of
> repositories (`git log --no-merges -p --histogram -1000`), showing how
> many times a re-diff was done and how many times it resulted in finding
> matching lines (therefore addressing this issue) with the fix. In
> general it is fewer than 1% of diff's that exhibit this offending
> behavior:
>
> | Repo (1k commits)  | Re-diff | Found matching lines |
> |--------------------|---------|----------------------|
> | llvm-project       |  45     | 11                   |
> | vim                | 110     |  9                   |
> | git                |  18     |  2                   |
> | WebKit             | 168     |  1                   |
> | ripgrep            |  22     |  1                   |
> | cpython            |  32     |  0                   |
> | vscode             |  13     |  0                   |
>
> Signed-off-by: Yee Cheng Chin <ychin.git@gmail.com>
> ---

Thanks for the updated patch, and sorry for nobody responding to the
patch for over a week.

The detailed explanation of the issue and the inclusion of the
repository analysis results are very helpful; they clearly show that
while this is a rare edge case, it significantly improves the
quality of histogram diffs when it does occur.

 - The removal of go_orig is correct since g and go are kept in sync 
   throughout the slide loops.

 - Clearing the algorithm mask while preserving other flags ensures that 
   user-provided options like --ignore-all-space are correctly applied 
   during the re-diff.

 - While ignore_regex and anchors are not passed to the sub-diff, they 
   aren't currently available to xdl_change_compact anyway. Given that 
   compaction happens before regex filtering in the main pipeline, this
   is OK, I guess.

Let me mark the topic for 'next'.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-03-13  7:07   ` Junio C Hamano
@ 2026-03-13 10:23     ` Phillip Wood
  2026-03-19 23:30       ` Yee Cheng Chin
  0 siblings, 1 reply; 17+ messages in thread
From: Phillip Wood @ 2026-03-13 10:23 UTC (permalink / raw)
  To: Junio C Hamano, Yee Cheng Chin via GitGitGadget; +Cc: git, Yee Cheng Chin

On 13/03/2026 07:07, Junio C Hamano wrote:
> "Yee Cheng Chin via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Yee Cheng Chin <ychin.git@gmail.com>
>>
>> After a diff algorithm has been run, the compaction phase
>> (xdl_change_compact()) shifts and merges change groups to produce a
>> cleaner output. However, this shifting could create a new matched group
>> where both sides now have matching lines. This results in a
>> wrong-looking diff output which contains redundant lines that are the
>> same on both files.
>>
>> Fix this by detecting this situation, and re-diff the texts on each side
>> to find similar lines, using the fall-back Myer's diff. Only do this for
>> histogram diff as it's the only algorithm where this is relevant. Below
>> contains an example, and more details.
>> ...
>> This issue is rare in a normal repository. Below is a table of
>> repositories (`git log --no-merges -p --histogram -1000`), showing how
>> many times a re-diff was done and how many times it resulted in finding
>> matching lines (therefore addressing this issue) with the fix. In
>> general it is fewer than 1% of diff's that exhibit this offending
>> behavior:
>>
>> | Repo (1k commits)  | Re-diff | Found matching lines |
>> |--------------------|---------|----------------------|
>> | llvm-project       |  45     | 11                   |
>> | vim                | 110     |  9                   |
>> | git                |  18     |  2                   |
>> | WebKit             | 168     |  1                   |
>> | ripgrep            |  22     |  1                   |
>> | cpython            |  32     |  0                   |
>> | vscode             |  13     |  0                   |
>>
>> Signed-off-by: Yee Cheng Chin <ychin.git@gmail.com>
>> ---
> 
> Thanks for the updated patch, and sorry for nobody responding to the
> patch for over a week.

Yes, sorry for the slow response. I agree with Junio that this is 
explained well and looks good

Thanks

Phillip

> The detailed explanation of the issue and the inclusion of the
> repository analysis results are very helpful; they clearly show that
> while this is a rare edge case, it significantly improves the
> quality of histogram diffs when it does occur.
> 
>   - The removal of go_orig is correct since g and go are kept in sync
>     throughout the slide loops.
> 
>   - Clearing the algorithm mask while preserving other flags ensures that
>     user-provided options like --ignore-all-space are correctly applied
>     during the re-diff.
> 
>   - While ignore_regex and anchors are not passed to the sub-diff, they
>     aren't currently available to xdl_change_compact anyway. Given that
>     compaction happens before regex filtering in the main pipeline, this
>     is OK, I guess.
> 
> Let me mark the topic for 'next'.
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xdiff: re-diff shifted change groups when using histogram algorithm
  2026-03-13 10:23     ` Phillip Wood
@ 2026-03-19 23:30       ` Yee Cheng Chin
  0 siblings, 0 replies; 17+ messages in thread
From: Yee Cheng Chin @ 2026-03-19 23:30 UTC (permalink / raw)
  To: phillip.wood; +Cc: Junio C Hamano, Yee Cheng Chin via GitGitGadget, git

Great. Sounds good, thanks!


On Fri, Mar 13, 2026 at 3:23 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> On 13/03/2026 07:07, Junio C Hamano wrote:
> > "Yee Cheng Chin via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >
> >> From: Yee Cheng Chin <ychin.git@gmail.com>
> >>
> >> After a diff algorithm has been run, the compaction phase
> >> (xdl_change_compact()) shifts and merges change groups to produce a
> >> cleaner output. However, this shifting could create a new matched group
> >> where both sides now have matching lines. This results in a
> >> wrong-looking diff output which contains redundant lines that are the
> >> same on both files.
> >>
> >> Fix this by detecting this situation, and re-diff the texts on each side
> >> to find similar lines, using the fall-back Myer's diff. Only do this for
> >> histogram diff as it's the only algorithm where this is relevant. Below
> >> contains an example, and more details.
> >> ...
> >> This issue is rare in a normal repository. Below is a table of
> >> repositories (`git log --no-merges -p --histogram -1000`), showing how
> >> many times a re-diff was done and how many times it resulted in finding
> >> matching lines (therefore addressing this issue) with the fix. In
> >> general it is fewer than 1% of diff's that exhibit this offending
> >> behavior:
> >>
> >> | Repo (1k commits)  | Re-diff | Found matching lines |
> >> |--------------------|---------|----------------------|
> >> | llvm-project       |  45     | 11                   |
> >> | vim                | 110     |  9                   |
> >> | git                |  18     |  2                   |
> >> | WebKit             | 168     |  1                   |
> >> | ripgrep            |  22     |  1                   |
> >> | cpython            |  32     |  0                   |
> >> | vscode             |  13     |  0                   |
> >>
> >> Signed-off-by: Yee Cheng Chin <ychin.git@gmail.com>
> >> ---
> >
> > Thanks for the updated patch, and sorry for nobody responding to the
> > patch for over a week.
>
> Yes, sorry for the slow response. I agree with Junio that this is
> explained well and looks good
>
> Thanks
>
> Phillip
>
> > The detailed explanation of the issue and the inclusion of the
> > repository analysis results are very helpful; they clearly show that
> > while this is a rare edge case, it significantly improves the
> > quality of histogram diffs when it does occur.
> >
> >   - The removal of go_orig is correct since g and go are kept in sync
> >     throughout the slide loops.
> >
> >   - Clearing the algorithm mask while preserving other flags ensures that
> >     user-provided options like --ignore-all-space are correctly applied
> >     during the re-diff.
> >
> >   - While ignore_regex and anchors are not passed to the sub-diff, they
> >     aren't currently available to xdl_change_compact anyway. Given that
> >     compaction happens before regex filtering in the main pipeline, this
> >     is OK, I guess.
> >
> > Let me mark the topic for 'next'.
> >
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-03-19 23:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-06 20:51 [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm Yee Cheng Chin via GitGitGadget
2026-01-21 20:51 ` Junio C Hamano
2026-01-24 10:54   ` Phillip Wood
2026-01-25 17:34     ` Junio C Hamano
2026-01-26  9:37       ` Phillip Wood
2026-01-26 17:32         ` Junio C Hamano
2026-01-29 16:53           ` Yee Cheng Chin
2026-01-29 20:58             ` Junio C Hamano
2026-01-30  1:58               ` Yee Cheng Chin
2026-01-30  5:43                 ` Junio C Hamano
2026-01-30 16:06                 ` Phillip Wood
2026-02-20 23:07                 ` Junio C Hamano
2026-02-21  9:56                   ` Yee Cheng Chin
2026-03-02 14:54 ` [PATCH v2] " Yee Cheng Chin via GitGitGadget
2026-03-13  7:07   ` Junio C Hamano
2026-03-13 10:23     ` Phillip Wood
2026-03-19 23:30       ` Yee Cheng Chin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox