From: Junio C Hamano <gitster@pobox.com>
To: "Yee Cheng Chin via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Yee Cheng Chin <ychin.git@gmail.com>
Subject: Re: [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm
Date: Wed, 21 Jan 2026 12:51:34 -0800 [thread overview]
Message-ID: <xmqqikcusn8p.fsf@gitster.g> (raw)
In-Reply-To: <pull.2120.git.git.1765054287938.gitgitgadget@gmail.com> (Yee Cheng Chin via GitGitGadget's message of "Sat, 06 Dec 2025 20:51:27 +0000")
"Yee Cheng Chin via GitGitGadget" <gitgitgadget@gmail.com> writes:
> When using Myer's diff, the algorithm finds that only the "x" has been
> changed, and produces a final diff result (these are line diffs, but
> using word-diff syntax for ease of presentation):
>
> A A[-A-]{+x+}A AAA
And patience gives the same result; as you noted, it uses a unique
line as the anchoring point.
> When using histogram diff, the algorithm first discovers the LCS "A
> AAA", which it uses as anchor, then produces an intermediate diff:
>
> {+A Ax+}A AAA[- AAA-].
>
> This is a longer diff than Myer's, but it's still self-consistent.
> However, the compaction phase attempts to shift the first file's diff
> group upwards (note that this shift crosses the anchor that histogram
> had used), leading to the final results for histogram diff:
>
> [-A AA-]{+A Ax+}A AAA
>
> This is a technically correct patch but looks clearly redundant to a
> human as the first 3 lines should not be in the diff.
So true.
> The fix would detect that a shift has caused matching to a new group,
> and re-diff the "A AA" and "A Ax" parts, which results in "A A"
> correctly re-marked as unchanged. This creates the now correct histogram
> diff:
>
> A A[-A-]{+x+}A AAA
OK.
> This issue is rare in a normal repository. Below is a table of
> repositories (`git log --no-merges -p --histogram -1000`), showing how
> many times a re-diff was done and how many times it resulted in finding
> matching lines (therefore addressing this issue) with the fix. In
> general it is fewer than 1% of diff's that exhibit this offending
> behavior:
In other words, without the fix, we'd see 1% or so commits with
suboptimal (or "funny looking") diff that will trigger bug reports,
which sounds like an unacceptably high failure rate.
> @@ -915,6 +919,45 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
> }
> }
>
> + /*
> + * If this has a matching group from the other file, it could
> + * either be the original match from the diff algorithm, or
> + * arrived at by shifting and joining groups. When it's the
> + * latter, it's possible for the two newly joined sides to have
> + * matching lines. Re-diff the group to mark these matching
> + * lines as unchanged and remove from the diff output.
> + *
> + * Only do this for histogram diff as its LCS algorithm makes
> + * this scenario possible. In contrast, patience diff finds LCS
> + * of unique lines that groups cannot be shifted across.
> + * Myer's diff (standalone or used as fall-back in patience
> + * diff) already finds minimal edits so it is not possible for
> + * shifted groups to result in a smaller diff. (Without
> + * XDF_NEED_MINIMAL, Myer's isn't technically guaranteed to be
> + * minimal, but it should be so most of the time)
> + */
> + if (end_matching_other != -1 &&
> + XDF_DIFF_ALG(flags) == XDF_HISTOGRAM_DIFF &&
> + (g.start != g_orig.start ||
> + g.end != g_orig.end ||
> + go.start != go_orig.start ||
> + go.end != go_orig.end)) {
So the idea is to remember the original values in g and go (the
location of the group in the file and the other file) and if
shifting up and down changed any one of the four ends from the
original locations, we always take the fall-back route (if we are
doing histogram)?
By the way, this appears after the if/else if/ cascade that has:
if (g.end == earliest_end) {
... do nothing case (case #1)
} else if (end_matching_other != -1) {
... do the slide-up thing (case #2)
} else if (flags & XDF_INDENT_HEIRISTIC) {
... do the indent heuristic thing (case #3)
}
Am I reading the code correctly that, even though this new block
appears as if it is a post-clean-up phase that is independent from
which one of the three choices are taken in the previous if/elseif
cascade, it only is relevant to the second case? I am wondering if
it would make it easier to follow if the new code were made into a
small helper function that is called from the (case #2) arm of the
existing if/else if cascade.
Thanks.
> + xpparam_t xpp;
> + xdfenv_t xe;
> +
> + memset(&xpp, 0, sizeof(xpp));
> + xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
> +
> + memcpy(&xe.xdf1, xdf, sizeof(xdfile_t));
> + memcpy(&xe.xdf2, xdfo, sizeof(xdfile_t));
> +
> + if (xdl_fall_back_diff(&xe, &xpp,
> + g.start + 1, g.end - g.start,
> + go.start + 1, go.end - go.start)) {
> + return -1;
> + }
> + }
> +
> next:
> /* Move past the just-processed group: */
> if (group_next(xdf, &g))
>
> base-commit: f0ef5b6d9bcc258e4cbef93839d1b7465d5212b9
next prev parent reply other threads:[~2026-01-21 20:51 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-06 20:51 [PATCH] xdiff: re-diff shifted change groups when using histogram algorithm Yee Cheng Chin via GitGitGadget
2026-01-21 20:51 ` Junio C Hamano [this message]
2026-01-24 10:54 ` Phillip Wood
2026-01-25 17:34 ` Junio C Hamano
2026-01-26 9:37 ` Phillip Wood
2026-01-26 17:32 ` Junio C Hamano
2026-01-29 16:53 ` Yee Cheng Chin
2026-01-29 20:58 ` Junio C Hamano
2026-01-30 1:58 ` Yee Cheng Chin
2026-01-30 5:43 ` Junio C Hamano
2026-01-30 16:06 ` Phillip Wood
2026-02-20 23:07 ` Junio C Hamano
2026-02-21 9:56 ` Yee Cheng Chin
2026-03-02 14:54 ` [PATCH v2] " Yee Cheng Chin via GitGitGadget
2026-03-13 7:07 ` Junio C Hamano
2026-03-13 10:23 ` Phillip Wood
2026-03-19 23:30 ` Yee Cheng Chin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqikcusn8p.fsf@gitster.g \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=ychin.git@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox