* [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers
@ 2026-05-22 2:11 Michael Montalbo via GitGitGadget
2026-05-22 2:11 ` [PATCH 1/5] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget
` (6 more replies)
0 siblings, 7 replies; 30+ messages in thread
From: Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 UTC (permalink / raw)
To: git; +Cc: Michael Montalbo
This series adds diff.<driver>.process, a long-running subprocess protocol
that lets external tools provide hunks to git's diff and blame pipelines.
Over the past 18 years, git's diff pipeline accumulated many features that
operate on hunks: word diff, function context, color-moved, indent
heuristic, blame. External tools can replace the pipeline entirely
(diff.<driver>.command) or select among builtin algorithms
(diff.<driver>.algorithm), but there is no way for a tool to provide
line-change information into the pipeline. Tools that understand code
structure (tree-sitter parsers, format-aware analyzers, tools like
Difftastic and Mergiraf) must bypass git's pipeline and lose access to
everything downstream.
The protocol follows filter.<driver>.process: pkt-line over stdin/stdout,
capability negotiation, one tool invocation per git command. The tool
receives file pairs and returns hunk descriptors that git feeds into the
standard xdiff pipeline. All output features work normally.
Zero hunks with status=success means the tool considers the files
equivalent. git diff shows no output for the file, and git blame skips the
commit, attributing lines to earlier commits.
On error or tool crash, git falls back silently to the builtin diff
algorithm. The feature is opt-in via diff.<driver>.process and
.gitattributes; unconfigured files are unaffected.
The series includes git diff-process-normalize, a built-in tool that
compares files line by line ignoring whitespace (same logic as "git diff -w"
via xdiff_compare_lines):
[diff "cdiff"]
process = git diff-process-normalize
A whitespace-only boolean flag could serve this specific case. The
subprocess protocol is more general, allowing any tool to participate
without further changes to git.
Michael Montalbo (5):
xdiff: support external hunks via xpparam_t
userdiff: add diff.<driver>.process config
diff: add long-running diff process via diff.<driver>.process
blame: consult diff process for zero-hunk detection
diff-process-normalize: add built-in whitespace normalizer
Documentation/config/diff.adoc | 10 +
Documentation/gitattributes.adoc | 58 +++++
Makefile | 2 +
blame.c | 43 +++-
builtin.h | 1 +
builtin/diff-process-normalize.c | 143 ++++++++++
diff-process.c | 203 +++++++++++++++
diff-process.h | 28 ++
diff.c | 25 ++
git.c | 1 +
t/t4080-diff-process.sh | 430 +++++++++++++++++++++++++++++++
userdiff.c | 7 +
userdiff.h | 2 +
xdiff-interface.c | 7 +-
xdiff/xdiff.h | 13 +
xdiff/xdiffi.c | 98 ++++++-
16 files changed, 1063 insertions(+), 8 deletions(-)
create mode 100644 builtin/diff-process-normalize.c
create mode 100644 diff-process.c
create mode 100644 diff-process.h
create mode 100755 t/t4080-diff-process.sh
base-commit: 94f057755b7941b321fd11fec1b2e3ca5313a4e0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2120%2Fmmontalbo%2Fmm%2Fstructural-diff-backend-clean-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2120/mmontalbo/mm/structural-diff-backend-clean-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/2120
--
gitgitgadget
^ permalink raw reply [flat|nested] 30+ messages in thread* [PATCH 1/5] xdiff: support external hunks via xpparam_t 2026-05-22 2:11 [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 ` Michael Montalbo via GitGitGadget 2026-05-22 5:29 ` Junio C Hamano 2026-05-22 2:11 ` [PATCH 2/5] userdiff: add diff.<driver>.process config Michael Montalbo via GitGitGadget ` (5 subsequent siblings) 6 siblings, 1 reply; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Add two new xpparam_t fields (external_hunks, external_hunks_nr) that let callers supply pre-computed hunks. When set, xdl_diff() populates the changed[] arrays from these hunks instead of running the diff algorithm, then continues through compaction and emission as usual. Validate supplied hunks before use: reject out-of-bounds line numbers, overlapping or out-of-order hunks, negative counts, and violations of the synchronization invariant (unchanged line counts must match between files). On validation failure, fall back to the builtin diff algorithm. Skip trim_common_tail() in xdi_diff() when external hunks are present, since external hunks reference line numbers in the original content. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- xdiff-interface.c | 7 +++- xdiff/xdiff.h | 13 +++++++ xdiff/xdiffi.c | 98 +++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 114 insertions(+), 4 deletions(-) diff --git a/xdiff-interface.c b/xdiff-interface.c index f043330f2a..9542c0bcc2 100644 --- a/xdiff-interface.c +++ b/xdiff-interface.c @@ -124,7 +124,12 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE) return -1; - if (!xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT)) + /* + * External hunks reference line numbers in the original content; + * trimming the tail would change line counts and invalidate them. + */ + if (!xpp->external_hunks && + !xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT)) trim_common_tail(&a, &b); return xdl_diff(&a, &b, xpp, xecfg, xecb); diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h index dc370712e9..2ee6f1aae3 100644 --- a/xdiff/xdiff.h +++ b/xdiff/xdiff.h @@ -78,6 +78,15 @@ typedef struct s_mmbuffer { long size; } mmbuffer_t; +/* + * Hunk descriptor for externally computed diffs. + * Line numbers are 1-based, matching unified diff convention. + */ +struct xdl_hunk { + long old_start, old_count; + long new_start, new_count; +}; + typedef struct s_xpparam { unsigned long flags; @@ -88,6 +97,10 @@ typedef struct s_xpparam { /* See Documentation/diff-options.adoc. */ char **anchors; size_t anchors_nr; + + /* Externally computed hunks: bypass the diff algorithm. */ + const struct xdl_hunk *external_hunks; + size_t external_hunks_nr; } xpparam_t; typedef struct s_xdemitcb { diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 5455b4690d..7eca4ab4a1 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -1085,16 +1085,108 @@ static void xdl_mark_ignorable_regex(xdchange_t *xscr, const xdfenv_t *xe, } } +/* + * Populate the changed[] arrays from externally supplied hunks, + * bypassing the diff algorithm. Validates that hunks are in order, + * non-overlapping, and within bounds. + * + * Returns 0 on success, -1 on validation failure. + */ +static int xdl_populate_hunks_from_external(xdfenv_t *xe, + const struct xdl_hunk *hunks, + size_t nr_hunks) +{ + size_t i; + long j, prev_old_end = 0, prev_new_end = 0; + long total_old = 0, total_new = 0; + + /* + * Clear changed[] arrays. xdl_prepare_env() may have dirtied + * them via xdl_cleanup_records(). The allocation is nrec + 2 + * elements; changed points one past the start (see xprepare.c). + */ + memset(xe->xdf1.changed - 1, 0, + (xe->xdf1.nrec + 2) * sizeof(bool)); + memset(xe->xdf2.changed - 1, 0, + (xe->xdf2.nrec + 2) * sizeof(bool)); + + for (i = 0; i < nr_hunks; i++) { + const struct xdl_hunk *h = &hunks[i]; + + if (h->old_count < 0 || h->new_count < 0) + return -1; + + /* Bounds check (1-based line numbers) */ + if (h->old_count > 0 && + (h->old_start < 1 || + h->old_start + h->old_count - 1 > xe->xdf1.nrec)) + return -1; + if (h->new_count > 0 && + (h->new_start < 1 || + h->new_start + h->new_count - 1 > xe->xdf2.nrec)) + return -1; + + /* Zero-count hunks: start must still be in [1, nrec+1] */ + if (h->old_count == 0 && + (h->old_start < 1 || h->old_start > xe->xdf1.nrec + 1)) + return -1; + if (h->new_count == 0 && + (h->new_start < 1 || h->new_start > xe->xdf2.nrec + 1)) + return -1; + + /* Ordering: no overlap with previous hunk */ + if (h->old_start < prev_old_end || + h->new_start < prev_new_end) + return -1; + + for (j = 0; j < h->old_count; j++) + xe->xdf1.changed[h->old_start - 1 + j] = true; + for (j = 0; j < h->new_count; j++) + xe->xdf2.changed[h->new_start - 1 + j] = true; + + prev_old_end = h->old_start + h->old_count; + prev_new_end = h->new_start + h->new_count; + total_old += h->old_count; + total_new += h->new_count; + } + + /* + * Synchronization invariant: unchanged line counts must match. + * Otherwise xdl_build_script() would walk off one array. + */ + if ((long)xe->xdf1.nrec - total_old != + (long)xe->xdf2.nrec - total_new) + return -1; + + return 0; +} + int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t const *xecfg, xdemitcb_t *ecb) { xdchange_t *xscr; xdfenv_t xe; emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff; - if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) { - - return -1; + if (xpp->external_hunks) { + if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0) + return -1; + if (xdl_populate_hunks_from_external(&xe, + xpp->external_hunks, + xpp->external_hunks_nr) < 0) { + /* + * Invalid external hunks; fall back to the + * builtin diff algorithm. Re-runs + * xdl_prepare_env() via xdl_do_diff(). + */ + xdl_free_env(&xe); + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) + return -1; + } + } else { + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) + return -1; } + if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 || xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 || xdl_build_script(&xe, &xscr) < 0) { -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 1/5] xdiff: support external hunks via xpparam_t 2026-05-22 2:11 ` [PATCH 1/5] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget @ 2026-05-22 5:29 ` Junio C Hamano 2026-05-22 19:06 ` Michael Montalbo 0 siblings, 1 reply; 30+ messages in thread From: Junio C Hamano @ 2026-05-22 5:29 UTC (permalink / raw) To: Michael Montalbo via GitGitGadget; +Cc: git, Michael Montalbo "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > +/* > + * Populate the changed[] arrays from externally supplied hunks, > + * bypassing the diff algorithm. Validates that hunks are in order, > + * non-overlapping, and within bounds. > + * > + * Returns 0 on success, -1 on validation failure. > + */ > +static int xdl_populate_hunks_from_external(xdfenv_t *xe, > + const struct xdl_hunk *hunks, > + size_t nr_hunks) > +{ > + size_t i; > + long j, prev_old_end = 0, prev_new_end = 0; > + long total_old = 0, total_new = 0; > + > + /* > + * Clear changed[] arrays. xdl_prepare_env() may have dirtied > + * them via xdl_cleanup_records(). The allocation is nrec + 2 > + * elements; changed points one past the start (see xprepare.c). > + */ > + memset(xe->xdf1.changed - 1, 0, > + (xe->xdf1.nrec + 2) * sizeof(bool)); > + memset(xe->xdf2.changed - 1, 0, > + (xe->xdf2.nrec + 2) * sizeof(bool)); This, especially the starting offset of -1, looks horrible. The internal layout of xdfenv_t might happen to match the way the above code expects, which is how xdl_prepare_ctx() may have give you, but it somehow feels brittle. I guess the assumption that changed[] does not point at the beginning of the allocated area (e.g., it is a no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that it cannot be helped. Sigh. > int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, > xdemitconf_t const *xecfg, xdemitcb_t *ecb) { > xdchange_t *xscr; > xdfenv_t xe; > emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff; > > - if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) { > - > - return -1; > + if (xpp->external_hunks) { > + if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0) > + return -1; > + if (xdl_populate_hunks_from_external(&xe, > + xpp->external_hunks, > + xpp->external_hunks_nr) < 0) { > + /* > + * Invalid external hunks; fall back to the > + * builtin diff algorithm. Re-runs > + * xdl_prepare_env() via xdl_do_diff(). > + */ > + xdl_free_env(&xe); > + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) > + return -1; If the external tool keeps sending bogus hunks, silently falling back to what we would have done if there weren't any external stuff may be necessary to pleasantly keep using Git, but two and a half short comments here. (1) "What we would have done" is exactly the same as what appears in the corresponding "else" block. Can we make sure that we do not have to keep updating both copies in the future with some code rearrangement? (2) The writer of the external tool may want to see some trace of warning under certain flags when a failure of the tool forces the receiving end to fallback. (3) If the tool throws too many broken replies, perhaps we want to disable it automatically? > + } > + } else { > + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) > + return -1; > } > + > if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 || > xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 || > xdl_build_script(&xe, &xscr) < 0) { ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/5] xdiff: support external hunks via xpparam_t 2026-05-22 5:29 ` Junio C Hamano @ 2026-05-22 19:06 ` Michael Montalbo 2026-05-24 8:50 ` Junio C Hamano 0 siblings, 1 reply; 30+ messages in thread From: Michael Montalbo @ 2026-05-22 19:06 UTC (permalink / raw) To: Junio C Hamano; +Cc: Michael Montalbo via GitGitGadget, git On Thu, May 21, 2026 at 10:29 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > +/* > > + * Populate the changed[] arrays from externally supplied hunks, > > + * bypassing the diff algorithm. Validates that hunks are in order, > > + * non-overlapping, and within bounds. > > + * > > + * Returns 0 on success, -1 on validation failure. > > + */ > > +static int xdl_populate_hunks_from_external(xdfenv_t *xe, > > + const struct xdl_hunk *hunks, > > + size_t nr_hunks) > > +{ > > + size_t i; > > + long j, prev_old_end = 0, prev_new_end = 0; > > + long total_old = 0, total_new = 0; > > + > > + /* > > + * Clear changed[] arrays. xdl_prepare_env() may have dirtied > > + * them via xdl_cleanup_records(). The allocation is nrec + 2 > > + * elements; changed points one past the start (see xprepare.c). > > + */ > > + memset(xe->xdf1.changed - 1, 0, > > + (xe->xdf1.nrec + 2) * sizeof(bool)); > > + memset(xe->xdf2.changed - 1, 0, > > + (xe->xdf2.nrec + 2) * sizeof(bool)); > > This, especially the starting offset of -1, looks horrible. The > internal layout of xdfenv_t might happen to match the way the above > code expects, which is how xdl_prepare_ctx() may have give you, but > it somehow feels brittle. I guess the assumption that changed[] > does not point at the beginning of the allocated area (e.g., it is a > no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that > it cannot be helped. Sigh. > Agreed it is ugly. I wanted to make sure the entire changed[] including sentinels were clear as a defensive measure for downstream callers (xdl_change_compact). I agree this results in something that is ugly and brittle, but in the end I thought it was superior to relying on the fact that upstream zeroes the entire changed[] array. Maybe if the comment was more explicit about why this is happening it would be helpful? /* * Clear changed[] arrays including sentinels. * xdl_prepare_env() may have dirtied them via * xdl_cleanup_records(), and xdl_change_compact() reads * the sentinel at changed[-1] during backward scans. */ > > int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, > > xdemitconf_t const *xecfg, xdemitcb_t *ecb) { > > xdchange_t *xscr; > > xdfenv_t xe; > > emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff; > > > > - if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) { > > - > > - return -1; > > + if (xpp->external_hunks) { > > + if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0) > > + return -1; > > + if (xdl_populate_hunks_from_external(&xe, > > + xpp->external_hunks, > > + xpp->external_hunks_nr) < 0) { > > + /* > > + * Invalid external hunks; fall back to the > > + * builtin diff algorithm. Re-runs > > + * xdl_prepare_env() via xdl_do_diff(). > > + */ > > + xdl_free_env(&xe); > > + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) > > + return -1; > > If the external tool keeps sending bogus hunks, silently falling > back to what we would have done if there weren't any external stuff > may be necessary to pleasantly keep using Git, but two and a half > short comments here. > > (1) "What we would have done" is exactly the same as what appears > in the corresponding "else" block. Can we make sure that we do > not have to keep updating both copies in the future with some > code rearrangement? > How about something like this: if (xpp->external_hunks) { if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0) return -1; if (xdl_populate_hunks_from_external(&xe, xpp->external_hunks, xpp->external_hunks_nr) == 0) goto diff_done; xdl_free_env(&xe); } if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) return -1; diff_done: > (2) The writer of the external tool may want to see some trace of > warning under certain flags when a failure of the tool forces > the receiving end to fallback. > In diff.c how about we emit a warning rather than a trace on fallback: warning(_("diff process failed for '%s'," " falling back to builtin diff"), name_a); > (3) If the tool throws too many broken replies, perhaps we want to > disable it automatically? > For the RFC I wanted to keep it simple, but I definitely agree. A configurable failure policy makes a lot of sense to me (e.g., disable after N failures). > > + } > > + } else { > > + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) > > + return -1; > > } > > + > > if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 || > > xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 || > > xdl_build_script(&xe, &xscr) < 0) { ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/5] xdiff: support external hunks via xpparam_t 2026-05-22 19:06 ` Michael Montalbo @ 2026-05-24 8:50 ` Junio C Hamano 2026-05-24 18:01 ` Michael Montalbo 0 siblings, 1 reply; 30+ messages in thread From: Junio C Hamano @ 2026-05-24 8:50 UTC (permalink / raw) To: Michael Montalbo; +Cc: Michael Montalbo via GitGitGadget, git Michael Montalbo <mmontalbo@gmail.com> writes: >> > + * Clear changed[] arrays. xdl_prepare_env() may have dirtied >> > + * them via xdl_cleanup_records(). The allocation is nrec + 2 >> > + * elements; changed points one past the start (see xprepare.c). >> > + */ >> > + memset(xe->xdf1.changed - 1, 0, >> > + (xe->xdf1.nrec + 2) * sizeof(bool)); >> > + memset(xe->xdf2.changed - 1, 0, >> > + (xe->xdf2.nrec + 2) * sizeof(bool)); >> >> This, especially the starting offset of -1, looks horrible. The >> internal layout of xdfenv_t might happen to match the way the above >> code expects, which is how xdl_prepare_ctx() may have give you, but >> it somehow feels brittle. I guess the assumption that changed[] >> does not point at the beginning of the allocated area (e.g., it is a >> no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that >> it cannot be helped. Sigh. >> > > Agreed it is ugly. I wanted to make sure the entire changed[] including > sentinels were clear as a defensive measure for downstream callers > (xdl_change_compact). I agree this results in something that is ugly > and brittle, but in the end I thought it was superior to relying on the > fact that upstream zeroes the entire changed[] array. Maybe if the > comment was more explicit about why this is happening it would be > helpful? Perhaps make these memset() into calls to a helper function that is defined in xdiff/xprepare.c with a descriptive name and placed near where xdl_prepare_ctx() is. That way, the patch in question does not even have to expose the strangeness of changed[] (i.e., it has 2 more elements than it would normally contain to make the memory region for changed[-1] and changed[N] valid, and freeing it requires free(changed-1)) to the code path. It only needs to say "Hey, I am clearing changed[] arrays because of XXX" without having to say "by the way, the memory layout of changed[] is strange this way", the latter of which is not exactly of interest for readers of this code. > /* > * Clear changed[] arrays including sentinels. > * xdl_prepare_env() may have dirtied them via > * xdl_cleanup_records(), and xdl_change_compact() reads > * the sentinel at changed[-1] during backward scans. > */ And this belongs in xdiff/xprepare.c near that new helper function. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/5] xdiff: support external hunks via xpparam_t 2026-05-24 8:50 ` Junio C Hamano @ 2026-05-24 18:01 ` Michael Montalbo 0 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo @ 2026-05-24 18:01 UTC (permalink / raw) To: Junio C Hamano; +Cc: Michael Montalbo via GitGitGadget, git On Sun, May 24, 2026 at 1:50 AM Junio C Hamano <gitster@pobox.com> wrote: > > Michael Montalbo <mmontalbo@gmail.com> writes: > > >> > + * Clear changed[] arrays. xdl_prepare_env() may have dirtied > >> > + * them via xdl_cleanup_records(). The allocation is nrec + 2 > >> > + * elements; changed points one past the start (see xprepare.c). > >> > + */ > >> > + memset(xe->xdf1.changed - 1, 0, > >> > + (xe->xdf1.nrec + 2) * sizeof(bool)); > >> > + memset(xe->xdf2.changed - 1, 0, > >> > + (xe->xdf2.nrec + 2) * sizeof(bool)); > >> > >> This, especially the starting offset of -1, looks horrible. The > >> internal layout of xdfenv_t might happen to match the way the above > >> code expects, which is how xdl_prepare_ctx() may have give you, but > >> it somehow feels brittle. I guess the assumption that changed[] > >> does not point at the beginning of the allocated area (e.g., it is a > >> no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that > >> it cannot be helped. Sigh. > >> > > > > Agreed it is ugly. I wanted to make sure the entire changed[] including > > sentinels were clear as a defensive measure for downstream callers > > (xdl_change_compact). I agree this results in something that is ugly > > and brittle, but in the end I thought it was superior to relying on the > > fact that upstream zeroes the entire changed[] array. Maybe if the > > comment was more explicit about why this is happening it would be > > helpful? > > Perhaps make these memset() into calls to a helper function that is > defined in xdiff/xprepare.c with a descriptive name and placed near > where xdl_prepare_ctx() is. That way, the patch in question does > not even have to expose the strangeness of changed[] (i.e., it has 2 > more elements than it would normally contain to make the memory > region for changed[-1] and changed[N] valid, and freeing it requires > free(changed-1)) to the code path. It only needs to say "Hey, I am > clearing changed[] arrays because of XXX" without having to say "by > the way, the memory layout of changed[] is strange this way", the > latter of which is not exactly of interest for readers of this code. > > > /* > > * Clear changed[] arrays including sentinels. > > * xdl_prepare_env() may have dirtied them via > > * xdl_cleanup_records(), and xdl_change_compact() reads > > * the sentinel at changed[-1] during backward scans. > > */ > > And this belongs in xdiff/xprepare.c near that new helper function. That sounds a lot nicer. Will update. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 2/5] userdiff: add diff.<driver>.process config 2026-05-22 2:11 [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget 2026-05-22 2:11 ` [PATCH 1/5] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 ` Michael Montalbo via GitGitGadget 2026-05-22 2:11 ` [PATCH 3/5] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget ` (4 subsequent siblings) 6 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Add a new per-driver configuration key that specifies the command for a long-running diff process. The name follows filter.<driver>.process: a long-running subprocess that stays alive across files within a single git invocation. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- userdiff.c | 7 +++++++ userdiff.h | 2 ++ 2 files changed, 9 insertions(+) diff --git a/userdiff.c b/userdiff.c index fe710a68bf..81c0bebcce 100644 --- a/userdiff.c +++ b/userdiff.c @@ -499,6 +499,13 @@ int userdiff_config(const char *k, const char *v) drv->algorithm = drv->algorithm_owned; return ret; } + if (!strcmp(type, "process")) { + int ret; + FREE_AND_NULL(drv->process_owned); + ret = git_config_string(&drv->process_owned, k, v); + drv->process = drv->process_owned; + return ret; + } return 0; } diff --git a/userdiff.h b/userdiff.h index 827361b0bc..51c26e0d41 100644 --- a/userdiff.h +++ b/userdiff.h @@ -31,6 +31,8 @@ struct userdiff_driver { char *textconv_owned; struct notes_cache *textconv_cache; int textconv_want_cache; + const char *process; + char *process_owned; }; enum userdiff_driver_type { USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 3/5] diff: add long-running diff process via diff.<driver>.process 2026-05-22 2:11 [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget 2026-05-22 2:11 ` [PATCH 1/5] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget 2026-05-22 2:11 ` [PATCH 2/5] userdiff: add diff.<driver>.process config Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 ` Michael Montalbo via GitGitGadget 2026-05-22 2:11 ` [PATCH 4/5] blame: consult diff process for zero-hunk detection Michael Montalbo via GitGitGadget ` (3 subsequent siblings) 6 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Add support for external diff processes that communicate via the long-running process protocol (pkt-line over stdin/stdout). A diff process is configured per userdiff driver: [diff "cdiff"] process = /path/to/diff-tool The tool receives file pairs and returns hunks describing which lines changed. Git feeds these hunks into the standard xdiff pipeline, so all output features (word diff, function context, color) work normally. The handshake negotiates version=1 and capability=hunks. Per-file requests send command=hunks, pathname, and both file contents as packetized data. The tool responds with hunk lines and a status packet. On error, git falls back to the builtin diff algorithm. Zero hunks with status=success means the tool considers the files equivalent. Git skips diff output for that file entirely. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- Documentation/config/diff.adoc | 8 + Documentation/gitattributes.adoc | 40 ++++ Makefile | 1 + diff-process.c | 203 +++++++++++++++++++ diff-process.h | 28 +++ diff.c | 25 +++ t/t4080-diff-process.sh | 338 +++++++++++++++++++++++++++++++ 7 files changed, 643 insertions(+) create mode 100644 diff-process.c create mode 100644 diff-process.h create mode 100755 t/t4080-diff-process.sh diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc index 1135a62a0a..4ab5f60df6 100644 --- a/Documentation/config/diff.adoc +++ b/Documentation/config/diff.adoc @@ -218,6 +218,14 @@ endif::git-diff[] Set this option to `true` to make the diff driver cache the text conversion outputs. See linkgit:gitattributes[5] for details. +`diff.<driver>.process`:: + The command to run as a long-running diff process. + The tool communicates via the pkt-line protocol and returns + hunks that are fed into Git's diff and blame pipelines. + If the tool returns zero hunks, the file is treated as + unchanged for both diff output and blame attribution. + See linkgit:gitattributes[5] for details. + `diff.indentHeuristic`:: Set this option to `false` to disable the default heuristics that shift diff hunk boundaries to make patches easier to read. diff --git a/Documentation/gitattributes.adoc b/Documentation/gitattributes.adoc index f20041a323..cc724f8c63 100644 --- a/Documentation/gitattributes.adoc +++ b/Documentation/gitattributes.adoc @@ -821,6 +821,46 @@ NOTE: If `diff.<name>.command` is defined for path with the (see above), and adding `diff.<name>.algorithm` has no effect, as the algorithm is not passed to the external diff driver. +Using an external diff process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An external tool can provide content-aware line matching by +setting `diff.<name>.process` to the command that runs +the tool. The tool is a long-running process that communicates via +the pkt-line protocol (see +linkgit:gitprotocol-long-running-process[5]). + +------------------------ +*.c diff=cdiff +------------------------ + +---------------------------------------------------------------- +[diff "cdiff"] + process = /path/to/diff-process-tool +---------------------------------------------------------------- + +The tool receives file pairs and returns hunk descriptors indicating +which lines changed. Git feeds these hunks into its standard diff +pipeline, so all output features (word diff, function context, +color) work normally. + +If the tool fails or returns an error, Git silently falls back to +the builtin diff algorithm. If the tool returns invalid hunks +(out of bounds, overlapping), Git also falls back silently. + +The handshake negotiates `version=1` and `capability=hunks`. +Per-file requests send `command=hunks` and `pathname=<path>`, +followed by the old and new file content as packetized data. +The tool responds with lines of the form +`hunk <old_start> <old_count> <new_start> <new_count>` +(1-based line numbers), a flush packet, and `status=success`. + +If the tool returns zero hunks with `status=success`, Git treats +the file as having no changes and produces no diff output. + +Tools should ignore unknown keys in the per-file request to +remain forward-compatible. + Defining a custom hunk-header ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/Makefile b/Makefile index cedc234173..22900368dd 100644 --- a/Makefile +++ b/Makefile @@ -1142,6 +1142,7 @@ LIB_OBJS += diff-delta.o LIB_OBJS += diff-merges.o LIB_OBJS += diff-lib.o LIB_OBJS += diff-no-index.o +LIB_OBJS += diff-process.o LIB_OBJS += diff.o LIB_OBJS += diffcore-break.o LIB_OBJS += diffcore-delta.o diff --git a/diff-process.c b/diff-process.c new file mode 100644 index 0000000000..7b0f0e1f7e --- /dev/null +++ b/diff-process.c @@ -0,0 +1,203 @@ +/* + * Diff process backend: communicates with a long-running external + * tool via the pkt-line protocol to obtain content-aware hunks. + * + * Protocol: pkt-line over stdin/stdout, following the pattern of + * the long-running filter process protocol (see convert.c). + * + * Handshake: + * git> git-diff-client / version=1 / flush + * tool< git-diff-server / version=1 / flush + * git> capability=hunks / flush + * tool< capability=hunks / flush + * + * Per-file: + * git> command=hunks / pathname=<path> / flush + * git> <old content packetized> / flush + * git> <new content packetized> / flush + * tool< hunk <old_start> <old_count> <new_start> <new_count> + * tool< ... / flush + * tool< status=success / flush + * + * Zero hunks with status=success means the tool considers the + * files equivalent. Git will skip the diff for that file. + */ + +#include "git-compat-util.h" +#include "diff-process.h" +#include "userdiff.h" +#include "sub-process.h" +#include "pkt-line.h" +#include "strbuf.h" +#include "xdiff/xdiff.h" + +#define CAP_HUNKS (1u << 0) + +struct diff_subprocess { + struct subprocess_entry subprocess; + unsigned int supported_capabilities; +}; + +static int subprocess_map_initialized; +static struct hashmap subprocess_map; + +static int start_diff_process_fn(struct subprocess_entry *subprocess) +{ + static int versions[] = { 1, 0 }; + static struct subprocess_capability capabilities[] = { + { "hunks", CAP_HUNKS }, + { NULL, 0 } + }; + struct diff_subprocess *entry = + (struct diff_subprocess *)subprocess; + + /* Uses dying pkt-line variant, same as convert.c filters. */ + return subprocess_handshake(subprocess, "git-diff", + versions, NULL, + capabilities, + &entry->supported_capabilities); +} + +static struct diff_subprocess *find_or_start_process(const char *cmd) +{ + struct diff_subprocess *entry; + + if (!subprocess_map_initialized) { + subprocess_map_initialized = 1; + hashmap_init(&subprocess_map, cmd2process_cmp, NULL, 0); + } + + entry = (struct diff_subprocess *) + subprocess_find_entry(&subprocess_map, cmd); + if (entry) + return entry; + + entry = xcalloc(1, sizeof(*entry)); + if (subprocess_start(&subprocess_map, &entry->subprocess, + cmd, start_diff_process_fn)) { + free(entry); + return NULL; + } + + return entry; +} + +static int send_file_content(int fd, const char *buf, long size) +{ + int ret; + + if (size > 0) + ret = write_packetized_from_buf_no_flush(buf, size, fd); + else + ret = 0; + if (ret) + return ret; + return packet_flush_gently(fd); +} + +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk) +{ + char *end; + + /* Format: "hunk <old_start> <old_count> <new_start> <new_count>" */ + if (!skip_prefix(line, "hunk ", &line)) + return -1; + + hunk->old_start = strtol(line, &end, 10); + if (end == line || *end != ' ') + return -1; + line = end; + + hunk->old_count = strtol(line, &end, 10); + if (end == line || *end != ' ') + return -1; + line = end; + + hunk->new_start = strtol(line, &end, 10); + if (end == line || *end != ' ') + return -1; + line = end; + + hunk->new_count = strtol(line, &end, 10); + if (end == line || *end != '\0') + return -1; + + return 0; +} + +int diff_process_get_hunks(struct userdiff_driver *drv, + const char *path, + const char *old_buf, long old_size, + const char *new_buf, long new_size, + struct xdl_hunk **hunks_out, + size_t *nr_hunks_out) +{ + struct diff_subprocess *backend; + struct child_process *process; + int fd_in, fd_out; + struct strbuf status = STRBUF_INIT; + struct xdl_hunk *hunks = NULL; + struct xdl_hunk hunk; + size_t nr_hunks = 0, alloc_hunks = 0; + int len; + char *line; + + if (!drv || !drv->process) + return -1; + + backend = find_or_start_process(drv->process); + if (!backend) + return -1; + + if (!(backend->supported_capabilities & CAP_HUNKS)) + return -1; + + process = subprocess_get_child_process(&backend->subprocess); + fd_in = process->in; + fd_out = process->out; + + /* Send request */ + if (packet_write_fmt_gently(fd_in, "command=hunks\n") || + packet_write_fmt_gently(fd_in, "pathname=%s\n", path) || + packet_flush_gently(fd_in)) + goto error; + + /* Send old file content */ + if (send_file_content(fd_in, old_buf, old_size)) + goto error; + + /* Send new file content */ + if (send_file_content(fd_in, new_buf, new_size)) + goto error; + + /* Read hunks until flush packet */ + while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 && + line) { + if (parse_hunk_line(line, &hunk) < 0) + goto error; + ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks); + hunks[nr_hunks++] = hunk; + } + if (len < 0) + goto error; + + /* Read status */ + if (subprocess_read_status(fd_out, &status)) + goto error; + + if (strcmp(status.buf, "success")) { + if (!strcmp(status.buf, "abort")) + backend->supported_capabilities &= ~CAP_HUNKS; + goto error; + } + + *hunks_out = hunks; + *nr_hunks_out = nr_hunks; + strbuf_release(&status); + return 0; + +error: + free(hunks); + strbuf_release(&status); + return -1; +} diff --git a/diff-process.h b/diff-process.h new file mode 100644 index 0000000000..4c84951e02 --- /dev/null +++ b/diff-process.h @@ -0,0 +1,28 @@ +#ifndef DIFF_PROCESS_H +#define DIFF_PROCESS_H + +struct userdiff_driver; +struct xdl_hunk; + +/* + * Query a diff process for hunks describing the changes + * between old_buf and new_buf. + * + * The backend is a long-running subprocess configured via + * diff.<driver>.process. It receives file content via + * pkt-line and returns hunks with 1-based line numbers. + * + * On success, sets *hunks_out and *nr_hunks_out to a newly allocated + * array (caller must free) and returns 0. + * + * On failure, returns -1. The caller should fall back to the + * builtin diff algorithm. + */ +int diff_process_get_hunks(struct userdiff_driver *drv, + const char *path, + const char *old_buf, long old_size, + const char *new_buf, long new_size, + struct xdl_hunk **hunks_out, + size_t *nr_hunks_out); + +#endif /* DIFF_PROCESS_H */ diff --git a/diff.c b/diff.c index 397e38b41c..c5e7c329b2 100644 --- a/diff.c +++ b/diff.c @@ -25,7 +25,9 @@ #include "utf8.h" #include "odb.h" #include "userdiff.h" +#include "diff-process.h" #include "submodule.h" +#include "trace2.h" #include "hashmap.h" #include "mem-pool.h" #include "merge-ll.h" @@ -3991,6 +3993,7 @@ static void builtin_diff(const char *name_a, xpparam_t xpp; xdemitconf_t xecfg; struct emit_callback ecbdata; + struct xdl_hunk *ext_hunks = NULL; unsigned ws_rule; const struct userdiff_funcname *pe; @@ -4031,6 +4034,27 @@ static void builtin_diff(const char *name_a, xpp.ignore_regex_nr = o->ignore_regex_nr; xpp.anchors = o->anchors; xpp.anchors_nr = o->anchors_nr; + + if (!o->ignore_driver_algorithm && + one->driver && one->driver->process) { + size_t ext_hunks_nr = 0; + if (!diff_process_get_hunks( + one->driver, name_a, + mf1.ptr, mf1.size, + mf2.ptr, mf2.size, + &ext_hunks, &ext_hunks_nr)) { + if (!ext_hunks_nr) + goto free_ab_and_return; + xpp.external_hunks = ext_hunks; + xpp.external_hunks_nr = ext_hunks_nr; + } else { + trace2_data_string("diff", + o->repo, + "diff-process-fallback", + name_a); + } + } + xecfg.ctxlen = o->context; xecfg.interhunkctxlen = o->interhunkcontext; xecfg.flags = XDL_EMIT_FUNCNAMES; @@ -4111,6 +4135,7 @@ static void builtin_diff(const char *name_a, } else if (xdi_diff_outf(&mf1, &mf2, NULL, fn_out_consume, &ecbdata, &xpp, &xecfg)) die("unable to generate diff for %s", one->path); + free(ext_hunks); if (o->word_diff) free_diff_words_data(&ecbdata); if (textconv_one) diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh new file mode 100755 index 0000000000..6f49f4e66b --- /dev/null +++ b/t/t4080-diff-process.sh @@ -0,0 +1,338 @@ +#!/bin/sh + +test_description='diff process via long-running process' + +. ./test-lib.sh + +if test_have_prereq PYTHON +then + PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python) +fi + +# +# A single parametric diff process. +# Usage: diff-process-backend --mode=<mode> [--log=<path>] +# +# Modes: +# whole-file - report all lines as changed (default) +# fixed-hunk - always report hunk 5 2 5 2 +# bad-hunk - report out-of-bounds hunk 999 1 999 1 +# zero-hunk - return zero hunks (files considered equivalent) +# error - return status=error for every request +# abort - return status=abort for every request +# crash - read one request then exit without responding +# +setup_backend () { + cat >"$TRASH_DIRECTORY/diff-process-backend.py" <<-\PYEOF + import sys, os + + def read_pkt(): + hdr = sys.stdin.buffer.read(4) + if len(hdr) < 4: return None + length = int(hdr, 16) + if length == 0: return "" + data = sys.stdin.buffer.read(length - 4) + return data.decode().rstrip("\n") + + def write_pkt(line): + data = (line + "\n").encode() + sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data) + sys.stdout.buffer.flush() + + def write_flush(): + sys.stdout.buffer.write(b"0000") + sys.stdout.buffer.flush() + + def read_content(): + chunks = [] + while True: + hdr = sys.stdin.buffer.read(4) + if len(hdr) < 4: break + length = int(hdr, 16) + if length == 0: break + chunks.append(sys.stdin.buffer.read(length - 4)) + return b"".join(chunks) + + mode = "whole-file" + logfile = None + for arg in sys.argv[1:]: + if arg.startswith("--mode="): + mode = arg[7:] + elif arg.startswith("--log="): + logfile = open(arg[6:], "a") + + def log(msg): + if logfile: + logfile.write(msg + "\n") + logfile.flush() + + # Handshake + assert read_pkt() == "git-diff-client" + assert read_pkt() == "version=1" + read_pkt() + write_pkt("git-diff-server") + write_pkt("version=1") + write_flush() + while True: + p = read_pkt() + if p == "": break + write_pkt("capability=hunks") + write_flush() + + log("ready") + + while True: + cmd = None + pathname = None + while True: + p = read_pkt() + if p is None: sys.exit(0) + if p == "": break + if p.startswith("command="): cmd = p.split("=",1)[1] + if p.startswith("pathname="): pathname = p.split("=",1)[1] + if cmd is None: sys.exit(0) + old = read_content() + new = read_content() + log(f"command={cmd} pathname={pathname}") + + if mode == "error": + write_flush() + write_pkt("status=error") + write_flush() + continue + + if mode == "abort": + write_flush() + write_pkt("status=abort") + write_flush() + continue + + if mode == "crash": + sys.exit(1) + + if cmd == "hunks": + if mode == "fixed-hunk": + write_pkt("hunk 5 2 5 2") + elif mode == "bad-hunk": + write_pkt("hunk 999 1 999 1") + elif mode == "zero-hunk": + pass + else: + ol = len(old.split(b"\n")) + nl = len(new.split(b"\n")) + write_pkt(f"hunk 1 {ol} 1 {nl}") + write_flush() + write_pkt("status=success") + write_flush() + else: + write_flush() + write_pkt("status=error") + write_flush() + PYEOF + write_script diff-process-backend <<-SHEOF + exec "$PYTHON_PATH" "$TRASH_DIRECTORY/diff-process-backend.py" "\$@" + SHEOF +} + +BACKEND="./diff-process-backend" + +test_expect_success PYTHON 'setup' ' + setup_backend && + echo "*.c diff=cdiff" >.gitattributes && + git add .gitattributes && + git commit -m "initial" +' + +test_expect_success PYTHON 'diff process hunk boundaries affect output' ' + cat >boundary.c <<-\EOF && + line1 + line2 + line3 + line4 + OLD5 + OLD6 + line7 + line8 + OLD9 + OLD10 + EOF + git add boundary.c && + git commit -m "add boundary.c" && + + cat >boundary.c <<-\EOF && + line1 + line2 + line3 + line4 + NEW5 + NEW6 + line7 + line8 + NEW9 + NEW10 + EOF + + # The file has changes at lines 5-6 and 9-10, but fixed-hunk + # only reports lines 5-6 as changed. Lines 9-10 should not + # appear as changed in the output. + git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \ + diff boundary.c >actual && + grep "^-OLD5" actual && + grep "^-OLD6" actual && + grep "^+NEW5" actual && + grep "^+NEW6" actual && + ! grep "^-OLD9" actual && + ! grep "^-OLD10" actual && + ! grep "^+NEW9" actual && + ! grep "^+NEW10" actual +' + +test_expect_success PYTHON 'diff process fallback on tool error status' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ + diff boundary.c >actual && + # Fallback produces the full builtin diff (both change regions). + grep "^-OLD5" actual && + grep "^+NEW5" actual && + grep "^-OLD9" actual && + grep "^+NEW9" actual && + # Tool was contacted (it replied with error, not crash). + grep "command=hunks pathname=boundary.c" backend.log +' + +test_expect_success PYTHON 'diff process fallback on bad hunks' ' + git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \ + diff boundary.c >actual && + grep "^-OLD5" actual && + grep "^+NEW5" actual && + grep "^-OLD9" actual && + grep "^+NEW9" actual +' + +test_expect_success PYTHON 'diff process fallback on tool crash' ' + git -c diff.cdiff.process="$BACKEND --mode=crash" \ + diff boundary.c >actual && + grep "^-OLD5" actual && + grep "^+NEW5" actual && + grep "^-OLD9" actual && + grep "^+NEW9" actual +' + +test_expect_success PYTHON 'diff process abort disables for session' ' + cat >abort1.c <<-\EOF && + int first(void) { return 1; } + EOF + cat >abort2.c <<-\EOF && + int second(void) { return 2; } + EOF + git add abort1.c abort2.c && + git commit -m "add abort files" && + + cat >abort1.c <<-\EOF && + int first(void) { return 10; } + EOF + cat >abort2.c <<-\EOF && + int second(void) { return 20; } + EOF + + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \ + diff -- abort1.c abort2.c >actual && + # Both files should still produce diff output via fallback. + grep "return 10" actual && + grep "return 20" actual && + # The tool aborts on the first file and git clears its + # capability. The second file never contacts the tool, + # so the log should have exactly one entry, not two. + grep "command=hunks" backend.log >matches && + test_line_count = 1 matches +' + +test_expect_success PYTHON 'diff process handles multiple files' ' + cat >multi1.c <<-\EOF && + int one(void) { return 1; } + EOF + cat >multi2.c <<-\EOF && + int two(void) { return 2; } + EOF + git add multi1.c multi2.c && + git commit -m "add multi files" && + + cat >multi1.c <<-\EOF && + int one(void) { return 10; } + EOF + cat >multi2.c <<-\EOF && + int two(void) { return 20; } + EOF + + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff -- multi1.c multi2.c >actual && + grep "return 10" actual && + grep "return 20" actual && + grep "pathname=multi1.c" backend.log && + grep "pathname=multi2.c" backend.log +' + +test_expect_success PYTHON 'diff process with --word-diff' ' + cat >worddiff.c <<-\EOF && + int value(void) { return 1; } + EOF + git add worddiff.c && + git commit -m "add worddiff.c" && + + cat >worddiff.c <<-\EOF && + int value(void) { return 999; } + EOF + + git -c diff.cdiff.process="$BACKEND" \ + diff --word-diff worddiff.c >actual && + grep "\[-1;-\]" actual && + grep "{+999;+}" actual +' + +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff --diff-algorithm=patience worddiff.c >actual && + grep "return 999" actual && + test_path_is_missing backend.log +' + +test_expect_success PYTHON 'diff process works with git log -p' ' + cat >logtest.c <<-\EOF && + int logfunc(void) { return 1; } + EOF + git add logtest.c && + git commit -m "add logtest.c" && + + cat >logtest.c <<-\EOF && + int logfunc(void) { return 2; } + EOF + git add logtest.c && + git commit -m "change logtest.c" && + + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + log -1 -p -- logtest.c >actual && + grep "return 2" actual && + grep "command=hunks pathname=logtest.c" backend.log +' + +test_expect_success PYTHON 'diff process zero hunks suppresses diff output' ' + cat >zerohunk.c <<-\EOF && + int zero(void) { return 0; } + EOF + git add zerohunk.c && + git commit -m "add zerohunk.c" && + + cat >zerohunk.c <<-\EOF && + int zero(void) { return 999; } + EOF + + git -c diff.cdiff.process="$BACKEND --mode=zero-hunk" \ + diff zerohunk.c >actual && + test_must_be_empty actual +' + +test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 4/5] blame: consult diff process for zero-hunk detection 2026-05-22 2:11 [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget ` (2 preceding siblings ...) 2026-05-22 2:11 ` [PATCH 3/5] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 ` Michael Montalbo via GitGitGadget 2026-05-22 2:11 ` [PATCH 5/5] diff-process-normalize: add built-in whitespace normalizer Michael Montalbo via GitGitGadget ` (2 subsequent siblings) 6 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> When a diff process is configured via diff.<driver>.process, consult it during blame's per-commit diffing. If the process returns zero hunks for a commit's changes to a file, treat the commit as having no changes, causing blame to attribute lines to earlier commits. The subprocess is long-running (one startup cost amortized across the blame traversal), but each commit in the file's history incurs a round-trip to the tool. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- Documentation/gitattributes.adoc | 3 +++ blame.c | 43 +++++++++++++++++++++++++++++--- t/t4080-diff-process.sh | 32 ++++++++++++++++++++++++ 3 files changed, 74 insertions(+), 4 deletions(-) diff --git a/Documentation/gitattributes.adoc b/Documentation/gitattributes.adoc index cc724f8c63..7d66fa3aa1 100644 --- a/Documentation/gitattributes.adoc +++ b/Documentation/gitattributes.adoc @@ -857,6 +857,9 @@ The tool responds with lines of the form If the tool returns zero hunks with `status=success`, Git treats the file as having no changes and produces no diff output. +`git blame` also consults the diff process and skips commits +where it reports zero hunks, attributing lines to earlier commits +instead. Tools should ignore unknown keys in the per-file request to remain forward-compatible. diff --git a/blame.c b/blame.c index a3c49d132e..8a5f14db7a 100644 --- a/blame.c +++ b/blame.c @@ -19,6 +19,8 @@ #include "tag.h" #include "trace2.h" #include "blame.h" +#include "diff-process.h" +#include "userdiff.h" #include "alloc.h" #include "commit-slab.h" #include "bloom.h" @@ -315,16 +317,47 @@ static struct commit *fake_working_tree_commit(struct repository *r, static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b, - xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, int xdl_opts) + xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, + int xdl_opts, struct index_state *istate, + const char *path) { xpparam_t xpp = {0}; xdemitconf_t xecfg = {0}; xdemitcb_t ecb = {NULL}; + struct xdl_hunk *ext_hunks = NULL; + int ret; xpp.flags = xdl_opts; xecfg.hunk_func = hunk_func; ecb.priv = cb_data; - return xdi_diff(file_a, file_b, &xpp, &xecfg, &ecb); + + if (path && istate) { + struct userdiff_driver *drv; + drv = userdiff_find_by_path(istate, path); + if (drv && drv->process) { + size_t nr = 0; + if (!diff_process_get_hunks(drv, path, + file_a->ptr, file_a->size, + file_b->ptr, file_b->size, + &ext_hunks, &nr)) { + if (!nr) { + /* + * Zero hunks: the diff process + * considers these files equivalent. + * Skip so blame looks past this + * commit. + */ + return 0; + } + xpp.external_hunks = ext_hunks; + xpp.external_hunks_nr = nr; + } + } + } + + ret = xdi_diff(file_a, file_b, &xpp, &xecfg, &ecb); + free(ext_hunks); + return ret; } static const char *get_next_line(const char *start, const char *end) @@ -1961,7 +1994,8 @@ static void pass_blame_to_parent(struct blame_scoreboard *sb, &sb->num_read_blob, ignore_diffs); sb->num_get_patch++; - if (diff_hunks(&file_p, &file_o, blame_chunk_cb, &d, sb->xdl_opts)) + if (diff_hunks(&file_p, &file_o, blame_chunk_cb, &d, sb->xdl_opts, + sb->revs->diffopt.repo->index, target->path)) die("unable to generate diff (%s -> %s)", oid_to_hex(&parent->commit->object.oid), oid_to_hex(&target->commit->object.oid)); @@ -2114,7 +2148,8 @@ static void find_copy_in_blob(struct blame_scoreboard *sb, * file_p partially may match that image. */ memset(split, 0, sizeof(struct blame_entry [3])); - if (diff_hunks(file_p, &file_o, handle_split_cb, &d, sb->xdl_opts)) + if (diff_hunks(file_p, &file_o, handle_split_cb, &d, sb->xdl_opts, + NULL, NULL)) die("unable to generate diff (%s)", oid_to_hex(&parent->commit->object.oid)); /* remainder, if any, all match the preimage */ diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh index 6f49f4e66b..5ed644b786 100755 --- a/t/t4080-diff-process.sh +++ b/t/t4080-diff-process.sh @@ -335,4 +335,36 @@ test_expect_success PYTHON 'diff process zero hunks suppresses diff output' ' test_must_be_empty actual ' +test_expect_success PYTHON 'blame skips commits with zero hunks from diff process' ' + cat >blame.c <<-\EOF && + int main(void) + { + return 0; + } + EOF + git add blame.c && + git commit -m "add blame.c" && + + cat >blame.c <<-\EOF && + int main(void) + { + return 0; + } + EOF + git add blame.c && + git commit -m "reformat blame.c" && + BLAME_COMMIT=$(git rev-parse --short HEAD) && + + # Without zero-hunk mode, blame attributes the change. + git blame blame.c >without && + grep "$BLAME_COMMIT" without && + + # With zero-hunk mode, the process considers the files equivalent + # and blame skips the reformat commit. + git -c diff.cdiff.process="$BACKEND --mode=zero-hunk" \ + blame blame.c >with && + ! grep "$BLAME_COMMIT" with +' + + test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 5/5] diff-process-normalize: add built-in whitespace normalizer 2026-05-22 2:11 [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget ` (3 preceding siblings ...) 2026-05-22 2:11 ` [PATCH 4/5] blame: consult diff process for zero-hunk detection Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 ` Michael Montalbo via GitGitGadget 2026-05-22 5:29 ` [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Junio C Hamano 2026-05-25 18:29 ` [PATCH v2 0/4] " Michael Montalbo via GitGitGadget 6 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-22 2:11 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Add git diff-process-normalize, a built-in diff process that detects whitespace-only changes. It compares files line by line using xdiff_compare_lines() with XDF_IGNORE_WHITESPACE (same logic as "git diff -w"). If all lines match, it returns zero hunks; otherwise it returns an error so git falls back to the builtin diff algorithm. [diff "cdiff"] process = git diff-process-normalize Update documentation to describe zero-hunk behavior for diff and blame, and document the built-in normalize tool. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- Documentation/config/diff.adoc | 2 + Documentation/gitattributes.adoc | 15 ++++ Makefile | 1 + builtin.h | 1 + builtin/diff-process-normalize.c | 143 +++++++++++++++++++++++++++++++ git.c | 1 + t/t4080-diff-process.sh | 60 +++++++++++++ 7 files changed, 223 insertions(+) create mode 100644 builtin/diff-process-normalize.c diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc index 4ab5f60df6..475736c6ed 100644 --- a/Documentation/config/diff.adoc +++ b/Documentation/config/diff.adoc @@ -224,6 +224,8 @@ endif::git-diff[] hunks that are fed into Git's diff and blame pipelines. If the tool returns zero hunks, the file is treated as unchanged for both diff output and blame attribution. + Git provides `git diff-process-normalize` as a built-in + tool that detects whitespace-only changes. See linkgit:gitattributes[5] for details. `diff.indentHeuristic`:: diff --git a/Documentation/gitattributes.adoc b/Documentation/gitattributes.adoc index 7d66fa3aa1..3f1d7affd8 100644 --- a/Documentation/gitattributes.adoc +++ b/Documentation/gitattributes.adoc @@ -861,6 +861,21 @@ the file as having no changes and produces no diff output. where it reports zero hunks, attributing lines to earlier commits instead. +Git ships with a built-in diff process, `git diff-process-normalize`, +that detects whitespace-only changes. Files whose only differences +are whitespace produce zero hunks; files with non-whitespace changes +fall back to the builtin diff algorithm. To use it: + +---------------------------------------------------------------- +[diff "cdiff"] + process = git diff-process-normalize +---------------------------------------------------------------- + +This is useful after running a code formatter: `git diff` shows +no output for files that only had whitespace changes, +`git blame` skips whitespace-only commits automatically without +requiring a `.git-blame-ignore-revs` file. + Tools should ignore unknown keys in the per-file request to remain forward-compatible. diff --git a/Makefile b/Makefile index 22900368dd..01acfaf7b8 100644 --- a/Makefile +++ b/Makefile @@ -1409,6 +1409,7 @@ BUILTIN_OBJS += builtin/diagnose.o BUILTIN_OBJS += builtin/diff-files.o BUILTIN_OBJS += builtin/diff-index.o BUILTIN_OBJS += builtin/diff-pairs.o +BUILTIN_OBJS += builtin/diff-process-normalize.o BUILTIN_OBJS += builtin/diff-tree.o BUILTIN_OBJS += builtin/diff.o BUILTIN_OBJS += builtin/difftool.o diff --git a/builtin.h b/builtin.h index 235c51f30e..c713a0417f 100644 --- a/builtin.h +++ b/builtin.h @@ -178,6 +178,7 @@ int cmd_diff_files(int argc, const char **argv, const char *prefix, struct repos int cmd_diff_index(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_pairs(int argc, const char **argv, const char *prefix, struct repository *repo); +int cmd_diff_process_normalize(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_diff_tree(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_difftool(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_env__helper(int argc, const char **argv, const char *prefix, struct repository *repo); diff --git a/builtin/diff-process-normalize.c b/builtin/diff-process-normalize.c new file mode 100644 index 0000000000..1580f6b7d9 --- /dev/null +++ b/builtin/diff-process-normalize.c @@ -0,0 +1,143 @@ +/* + * Built-in diff process that returns zero hunks for files whose + * only differences are whitespace, and status=error otherwise. + * See diff-process.c for the protocol and gitattributes(5) for usage. + * + * Uses xdiff_compare_lines() with XDF_IGNORE_WHITESPACE to compare + * lines, giving the same whitespace handling as "git diff -w". + */ + +#include "builtin.h" +#include "pkt-line.h" +#include "strbuf.h" +#include "xdiff-interface.h" + +/* + * Read a single pkt-line. Returns 1 for data, 0 for flush, -1 for EOF. + */ +static int read_pkt(int fd, struct strbuf *line) +{ + int len; + char *data; + + if (packet_read_line_gently(fd, &len, &data) < 0) + return -1; + if (!data || !len) + return 0; /* flush */ + strbuf_reset(line); + strbuf_add(line, data, len); + strbuf_rtrim(line); + return 1; +} + +/* + * Read packetized content until a flush packet. + */ +static int read_content(int fd, struct strbuf *out) +{ + strbuf_reset(out); + if (read_packetized_to_strbuf(fd, out, PACKET_READ_GENTLE_ON_EOF) < 0) + return -1; + return 0; +} + +/* + * Compare two buffers line by line using xdiff_compare_lines() with + * XDF_IGNORE_WHITESPACE (same logic as "git diff -w"). + * Returns 1 if all lines match, 0 otherwise. + */ +static int whitespace_equivalent(const char *a, long size_a, + const char *b, long size_b) +{ + const char *ea = a + size_a; + const char *eb = b + size_b; + + while (a < ea && b < eb) { + const char *eol_a = memchr(a, '\n', ea - a); + const char *eol_b = memchr(b, '\n', eb - b); + long len_a = (eol_a ? eol_a : ea) - a; + long len_b = (eol_b ? eol_b : eb) - b; + + if (!xdiff_compare_lines(a, len_a, b, len_b, + XDF_IGNORE_WHITESPACE)) + return 0; + + a += len_a + (eol_a ? 1 : 0); + b += len_b + (eol_b ? 1 : 0); + } + + /* Both sides must be exhausted */ + return a >= ea && b >= eb; +} + +int cmd_diff_process_normalize(int argc UNUSED, const char **argv UNUSED, + const char *prefix UNUSED, + struct repository *repo UNUSED) +{ + struct strbuf line = STRBUF_INIT; + struct strbuf old_content = STRBUF_INIT; + struct strbuf new_content = STRBUF_INIT; + int ret; + + /* Handshake: read client greeting */ + ret = read_pkt(0, &line); + if (ret <= 0 || strcmp(line.buf, "git-diff-client")) + return 1; + ret = read_pkt(0, &line); + if (ret <= 0 || strcmp(line.buf, "version=1")) + return 1; + read_pkt(0, &line); /* flush */ + + /* Send server greeting */ + packet_write_fmt(1, "git-diff-server\n"); + packet_write_fmt(1, "version=1\n"); + packet_flush(1); + + /* Read client capabilities until flush */ + while ((ret = read_pkt(0, &line)) > 0) + ; /* consume */ + + /* Send our capabilities */ + packet_write_fmt(1, "capability=hunks\n"); + packet_flush(1); + + /* Main loop: process file pairs */ + for (;;) { + int have_command = 0; + + /* Read request headers until flush */ + while ((ret = read_pkt(0, &line)) > 0) { + if (starts_with(line.buf, "command=")) + have_command = 1; + } + if (ret < 0) + break; /* EOF: client closed connection */ + if (!have_command) + break; + + /* Read old file content */ + if (read_content(0, &old_content) < 0) + break; + /* Read new file content */ + if (read_content(0, &new_content) < 0) + break; + + if (whitespace_equivalent(old_content.buf, old_content.len, + new_content.buf, new_content.len)) { + /* Whitespace-only differences */ + packet_flush(1); /* zero hunks */ + packet_write_fmt(1, "status=success\n"); + packet_flush(1); + } else { + /* Non-whitespace differences: fall back */ + packet_flush(1); + packet_write_fmt(1, "status=error\n"); + packet_flush(1); + } + } + + strbuf_release(&line); + strbuf_release(&old_content); + strbuf_release(&new_content); + return 0; +} diff --git a/git.c b/git.c index 5a40eab8a2..6239240b02 100644 --- a/git.c +++ b/git.c @@ -568,6 +568,7 @@ static struct cmd_struct commands[] = { { "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, { "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT }, { "diff-pairs", cmd_diff_pairs, RUN_SETUP | NO_PARSEOPT }, + { "diff-process-normalize", cmd_diff_process_normalize, NO_PARSEOPT }, { "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT }, { "difftool", cmd_difftool, RUN_SETUP_GENTLY }, { "fast-export", cmd_fast_export, RUN_SETUP }, diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh index 5ed644b786..a6fa1df456 100755 --- a/t/t4080-diff-process.sh +++ b/t/t4080-diff-process.sh @@ -366,5 +366,65 @@ test_expect_success PYTHON 'blame skips commits with zero hunks from diff proces ! grep "$BLAME_COMMIT" with ' +NORMALIZE="git diff-process-normalize" + +test_expect_success 'diff-process-normalize setup' ' + echo "*.c diff=cdiff" >.gitattributes && + git add .gitattributes && + test_commit normalize-base +' + +test_expect_success 'diff-process-normalize suppresses whitespace-only changes' ' + cat >ws.c <<-\EOF && + int main(void) + { + return 0; + } + EOF + git add ws.c && + git commit -m "add ws.c" && + + cat >ws.c <<-\EOF && + int main(void) + { + return 0; + } + EOF + + git -c diff.cdiff.process="$NORMALIZE" \ + diff ws.c >actual && + test_must_be_empty actual +' + +test_expect_success 'diff-process-normalize falls back on non-whitespace changes' ' + cat >ws.c <<-\EOF && + int main(void) + { + return 0; + } + + int added_function(void) + { + return 99; + } + EOF + + git -c diff.cdiff.process="$NORMALIZE" \ + diff ws.c >actual && + grep "added_function" actual +' + +test_expect_success 'diff-process-normalize falls back on mixed whitespace and real changes' ' + cat >ws.c <<-\EOF && + int main(void) + { + return 42; + } + EOF + + git -c diff.cdiff.process="$NORMALIZE" \ + diff ws.c >actual && + grep "return 42" actual +' test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers 2026-05-22 2:11 [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget ` (4 preceding siblings ...) 2026-05-22 2:11 ` [PATCH 5/5] diff-process-normalize: add built-in whitespace normalizer Michael Montalbo via GitGitGadget @ 2026-05-22 5:29 ` Junio C Hamano 2026-05-22 17:19 ` Michael Montalbo 2026-05-25 18:29 ` [PATCH v2 0/4] " Michael Montalbo via GitGitGadget 6 siblings, 1 reply; 30+ messages in thread From: Junio C Hamano @ 2026-05-22 5:29 UTC (permalink / raw) To: Michael Montalbo via GitGitGadget; +Cc: git, Michael Montalbo "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > This series adds diff.<driver>.process, a long-running subprocess protocol > that lets external tools provide hunks to git's diff and blame pipelines. > > Over the past 18 years, git's diff pipeline accumulated many features that > operate on hunks: word diff, function context, color-moved, indent > heuristic, blame. External tools can replace the pipeline entirely > (diff.<driver>.command) or select among builtin algorithms > (diff.<driver>.algorithm), but there is no way for a tool to provide > line-change information into the pipeline. Tools that understand code > structure (tree-sitter parsers, format-aware analyzers, tools like > Difftastic and Mergiraf) must bypass git's pipeline and lose access to > everything downstream. > > The protocol follows filter.<driver>.process: pkt-line over stdin/stdout, > capability negotiation, one tool invocation per git command. The tool > receives file pairs and returns hunk descriptors that git feeds into the > standard xdiff pipeline. All output features work normally. > > Zero hunks with status=success means the tool considers the files > equivalent. git diff shows no output for the file, and git blame skips the > commit, attributing lines to earlier commits. > > On error or tool crash, git falls back silently to the builtin diff > algorithm. The feature is opt-in via diff.<driver>.process and > .gitattributes; unconfigured files are unaffected. > > The series includes git diff-process-normalize, a built-in tool that > compares files line by line ignoring whitespace (same logic as "git diff -w" > via xdiff_compare_lines): Interesting. If the goal is purely to normalize content before comparison (e.g. stripping comments or canonicalizing formatting), we already have the `textconv` mechanism. While `textconv` is a "one-shot" per-file process, it is significantly simpler. I suspect, however, that the primary focus here is to allow external tools to provide structural alignment (e.g. for AST- aware diffs like Difftastic or Mergiraf) without losing the original content in the display. Unlike `textconv`, which transforms the text the user sees, this protocol lets the display remain identical to the source while using a custom engine for the line-matching logic. If that is the intent, it should be stated more explicitly in the documentation and commit messages. The "whitespace-normalize" demonstration in [PATCH 5/5] is misleading because it's exactly the case where `textconv` would be sufficient. I am afraid that the use of a long-running subprocess for every diff/blame invocation adds significant complexity and overhead. In particular, wouldn't the `blame` implementation performs a round-trip to the subprocess for every commit in the history? Even with a persistent process, the overhead of serializing and deserializing the entire file content twice (old and new) for every commit could be prohibitive for large files or deep histories. So, I dunno. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers 2026-05-22 5:29 ` [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Junio C Hamano @ 2026-05-22 17:19 ` Michael Montalbo 0 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo @ 2026-05-22 17:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: Michael Montalbo via GitGitGadget, git On Thu, May 21, 2026 at 10:29 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > This series adds diff.<driver>.process, a long-running subprocess protocol > > that lets external tools provide hunks to git's diff and blame pipelines. > > > > Over the past 18 years, git's diff pipeline accumulated many features that > > operate on hunks: word diff, function context, color-moved, indent > > heuristic, blame. External tools can replace the pipeline entirely > > (diff.<driver>.command) or select among builtin algorithms > > (diff.<driver>.algorithm), but there is no way for a tool to provide > > line-change information into the pipeline. Tools that understand code > > structure (tree-sitter parsers, format-aware analyzers, tools like > > Difftastic and Mergiraf) must bypass git's pipeline and lose access to > > everything downstream. > > > > The protocol follows filter.<driver>.process: pkt-line over stdin/stdout, > > capability negotiation, one tool invocation per git command. The tool > > receives file pairs and returns hunk descriptors that git feeds into the > > standard xdiff pipeline. All output features work normally. > > > > Zero hunks with status=success means the tool considers the files > > equivalent. git diff shows no output for the file, and git blame skips the > > commit, attributing lines to earlier commits. > > > > On error or tool crash, git falls back silently to the builtin diff > > algorithm. The feature is opt-in via diff.<driver>.process and > > .gitattributes; unconfigured files are unaffected. > > > > The series includes git diff-process-normalize, a built-in tool that > > compares files line by line ignoring whitespace (same logic as "git diff -w" > > via xdiff_compare_lines): > > Interesting. > > If the goal is purely to normalize content before comparison > (e.g. stripping comments or canonicalizing formatting), we already > have the `textconv` mechanism. While `textconv` is a "one-shot" > per-file process, it is significantly simpler. > > I suspect, however, that the primary focus here is to allow external > tools to provide structural alignment (e.g. for AST- aware diffs > like Difftastic or Mergiraf) without losing the original content in > the display. Unlike `textconv`, which transforms the text the user > sees, this protocol lets the display remain identical to the source > while using a custom engine for the line-matching logic. > > If that is the intent, it should be stated more explicitly in the > documentation and commit messages. The "whitespace-normalize" > demonstration in [PATCH 5/5] is misleading because it's exactly the > case where `textconv` would be sufficient. > Thank you for looking at this. Yes, you have correctly identified the primary focus. My intention with the whitespace normalization example was to provide a kind of "hello world" diff process that would showcase how such a tool could interact with the pipeline further down (i.e., blame vs diff output). However, I do agree that it is a confusing example because it seems to clash with something `textconv` already provides. I will update messaging across the series to make the true intention of these changes more clear. > I am afraid that the use of a long-running subprocess for every > diff/blame invocation adds significant complexity and overhead. In > particular, wouldn't the `blame` implementation performs a > round-trip to the subprocess for every commit in the history? Even > with a persistent process, the overhead of serializing and > deserializing the entire file content twice (old and new) for every > commit could be prohibitive for large files or deep histories. > > So, I dunno. I hear you on this point. Anecdotally, my measurement of running blame on diff.c: Performance: without process: 0.57s with process: 0.67s Blame attribution: without process: 726 unique commits with process: 721 unique commits (5 whitespace-only skipped) Skipped commits: 0ea7d5b6f8 diff.c: fix some recent whitespace style violations 2775d92c53 diff.c: stylefix 4b25d091ba Fix a bunch of pointer declarations (codestyle) a6080a0a44 War on whitespace eeefa7c90e Style fixes, add a space after if/for/while. I was imagining we could potentially optimize performance by extending the protocol to enable passing OIDs as a capability so tools could read objects directly without needing any serialization/deserialization. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2 0/4] [RFC] diff: add diff.<driver>.process for external hunk providers 2026-05-22 2:11 [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget ` (5 preceding siblings ...) 2026-05-22 5:29 ` [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Junio C Hamano @ 2026-05-25 18:29 ` Michael Montalbo via GitGitGadget 2026-05-25 18:29 ` [PATCH v2 1/4] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget ` (4 more replies) 6 siblings, 5 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-25 18:29 UTC (permalink / raw) To: git; +Cc: Michael Montalbo This series adds diff.<driver>.process, a long-running subprocess protocol that lets external tools provide hunks into git's diff and blame pipelines. diff.<driver>.process opens the door to integrating content-aware logic that don't exist inside git: structural diff tools, format-aware analyzers, and other tools that understand the semantics of the content being tracked. Where diff.<driver>.command replaces the diff pipeline entirely, diff.<driver>.process feeds hunks into it, so all downstream features (word diff, function context, color-moved, stat, blame) work normally. The protocol follows filter.<driver>.process: pkt-line over stdin/stdout, capability negotiation, one tool invocation per git command. The tool receives file pairs and returns hunk descriptors that git feeds into the standard xdiff pipeline. Zero hunks with status=success means the tool considers the files equivalent. git diff shows no output for the file, and git blame skips the commit, attributing lines to earlier commits. On error or tool crash, git falls back silently to the builtin diff algorithm. The feature is opt-in via diff.<driver>.process and .gitattributes; unconfigured files are unaffected. The blame integration calls the diff process for each commit in the file's history. The subprocess is long-running (one startup amortized across the traversal), but per-commit round-trips add latency. A natural follow-up would be a capability that sends blob OIDs instead of content, allowing tools that can read the object store directly to avoid the cost of serializing and deserializing blob content over the pipe for each file pair. Changes since v1: * Dropped the built-in diff-process-normalize tool since it obscured the main use case for the RFC and update series messaging accordingly. * Encapsulated changed[] memory layout behind xdl_clear_changed() helper in xprepare.c. * Restructured the external hunks path in xdl_diff() to fall through to the regular diff algorithm on validation failure instead of duplicating diff logic on fallback then returning. * Changed subprocess error reporting from trace to warning. * Fixed whitespace issues in the test's embedded Python. * Changed instances of grep to test_grep in tests. Michael Montalbo (4): xdiff: support external hunks via xpparam_t userdiff: add diff.<driver>.process config diff: add long-running diff process via diff.<driver>.process blame: consult diff process for zero-hunk detection Documentation/config/diff.adoc | 8 + Documentation/gitattributes.adoc | 43 ++++ Makefile | 1 + blame.c | 43 +++- diff-process.c | 206 +++++++++++++++++ diff-process.h | 28 +++ diff.c | 23 ++ t/.gitattributes | 1 + t/t4080-diff-process.sh | 370 +++++++++++++++++++++++++++++++ userdiff.c | 7 + userdiff.h | 2 + xdiff-interface.c | 7 +- xdiff/xdiff.h | 13 ++ xdiff/xdiffi.c | 84 ++++++- xdiff/xprepare.c | 10 + xdiff/xprepare.h | 1 + 16 files changed, 840 insertions(+), 7 deletions(-) create mode 100644 diff-process.c create mode 100644 diff-process.h create mode 100755 t/t4080-diff-process.sh base-commit: 94f057755b7941b321fd11fec1b2e3ca5313a4e0 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2120%2Fmmontalbo%2Fmm%2Fstructural-diff-backend-clean-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2120/mmontalbo/mm/structural-diff-backend-clean-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/2120 Range-diff vs v1: 1: 8c0ea0bc07 ! 1: f887a7e2ba xdiff: support external hunks via xpparam_t @@ xdiff/xdiffi.c: static void xdl_mark_ignorable_regex(xdchange_t *xscr, const xdf + long j, prev_old_end = 0, prev_new_end = 0; + long total_old = 0, total_new = 0; + -+ /* -+ * Clear changed[] arrays. xdl_prepare_env() may have dirtied -+ * them via xdl_cleanup_records(). The allocation is nrec + 2 -+ * elements; changed points one past the start (see xprepare.c). -+ */ -+ memset(xe->xdf1.changed - 1, 0, -+ (xe->xdf1.nrec + 2) * sizeof(bool)); -+ memset(xe->xdf2.changed - 1, 0, -+ (xe->xdf2.nrec + 2) * sizeof(bool)); ++ xdl_clear_changed(&xe->xdf1); ++ xdl_clear_changed(&xe->xdf2); + + for (i = 0; i < nr_hunks; i++) { + const struct xdl_hunk *h = &hunks[i]; @@ xdiff/xdiffi.c: static void xdl_mark_ignorable_regex(xdchange_t *xscr, const xdf + /* Bounds check (1-based line numbers) */ + if (h->old_count > 0 && + (h->old_start < 1 || -+ h->old_start + h->old_count - 1 > xe->xdf1.nrec)) ++ h->old_start + h->old_count - 1 > (long)xe->xdf1.nrec)) + return -1; + if (h->new_count > 0 && + (h->new_start < 1 || -+ h->new_start + h->new_count - 1 > xe->xdf2.nrec)) ++ h->new_start + h->new_count - 1 > (long)xe->xdf2.nrec)) + return -1; + + /* Zero-count hunks: start must still be in [1, nrec+1] */ + if (h->old_count == 0 && -+ (h->old_start < 1 || h->old_start > xe->xdf1.nrec + 1)) ++ (h->old_start < 1 || h->old_start > (long)xe->xdf1.nrec + 1)) + return -1; + if (h->new_count == 0 && -+ (h->new_start < 1 || h->new_start > xe->xdf2.nrec + 1)) ++ (h->new_start < 1 || h->new_start > (long)xe->xdf2.nrec + 1)) + return -1; + + /* Ordering: no overlap with previous hunk */ @@ xdiff/xdiffi.c: static void xdl_mark_ignorable_regex(xdchange_t *xscr, const xdf emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff; - if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) { -- -- return -1; + if (xpp->external_hunks) { + if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0) + return -1; + if (xdl_populate_hunks_from_external(&xe, + xpp->external_hunks, -+ xpp->external_hunks_nr) < 0) { -+ /* -+ * Invalid external hunks; fall back to the -+ * builtin diff algorithm. Re-runs -+ * xdl_prepare_env() via xdl_do_diff(). -+ */ -+ xdl_free_env(&xe); -+ if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) -+ return -1; -+ } -+ } else { -+ if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) -+ return -1; - } ++ xpp->external_hunks_nr) == 0) ++ goto diff_done; ++ xdl_free_env(&xe); ++ } + ++ if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) + return -1; +- } ++ ++diff_done: + if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 || xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 || xdl_build_script(&xe, &xscr) < 0) { + + ## xdiff/xprepare.c ## +@@ xdiff/xprepare.c: int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, + + return 0; + } ++ ++/* ++ * Reset the changed[] array so that no lines are marked as changed. ++ * Also clears the sentinel slots at changed[-1] and changed[nrec] ++ * that xdl_change_compact() relies on during backward scans. ++ */ ++void xdl_clear_changed(xdfile_t *xdf) ++{ ++ memset(xdf->changed - 1, 0, (xdf->nrec + 2) * sizeof(bool)); ++} + + ## xdiff/xprepare.h ## +@@ + int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, + xdfenv_t *xe); + void xdl_free_env(xdfenv_t *xe); ++void xdl_clear_changed(xdfile_t *xdf); + + + 2: 3bc127c800 = 2: de6d85f9d7 userdiff: add diff.<driver>.process config 3: f9976fc6aa ! 3: c25647c6e5 diff: add long-running diff process via diff.<driver>.process @@ Commit message [diff "cdiff"] process = /path/to/diff-tool - The tool receives file pairs and returns hunks describing which - lines changed. Git feeds these hunks into the standard xdiff - pipeline, so all output features (word diff, function context, - color) work normally. + The tool provides custom line-matching: it receives file pairs + and returns hunks that reference original line numbers. Unlike + textconv, which transforms the displayed content, the diff + output shows the actual file while the tool controls which + lines are marked as changed. The handshake negotiates version=1 and capability=hunks. Per-file requests send command=hunks, pathname, and both file contents as packetized data. The tool responds with hunk lines and a status - packet. On error, git falls back to the builtin diff algorithm. + packet. On error, git falls back to the builtin diff algorithm + with a warning. - Zero hunks with status=success means the tool considers the files - equivalent. Git skips diff output for that file entirely. + Zero hunks with status=success means the tool considers the + files equivalent. Git skips diff output for that file. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> @@ Documentation/gitattributes.adoc: NOTE: If `diff.<name>.command` is defined for +An external tool can provide content-aware line matching by +setting `diff.<name>.process` to the command that runs +the tool. The tool is a long-running process that communicates via -+the pkt-line protocol (see -+linkgit:gitprotocol-long-running-process[5]). ++the pkt-line protocol (described in ++Documentation/technical/long-running-process-protocol.adoc). + +------------------------ +*.c diff=cdiff @@ diff-process.c (new) @@ +/* + * Diff process backend: communicates with a long-running external -+ * tool via the pkt-line protocol to obtain content-aware hunks. ++ * tool via the pkt-line protocol to obtain custom line-matching ++ * results. Unlike textconv, which transforms the displayed content, ++ * hunks from a diff process reference original line numbers and ++ * the display shows the actual file content. + * + * Protocol: pkt-line over stdin/stdout, following the pattern of + * the long-running filter process protocol (see convert.c). @@ diff.c #include "userdiff.h" +#include "diff-process.h" #include "submodule.h" -+#include "trace2.h" #include "hashmap.h" #include "mem-pool.h" - #include "merge-ll.h" @@ diff.c: static void builtin_diff(const char *name_a, xpparam_t xpp; xdemitconf_t xecfg; @@ diff.c: static void builtin_diff(const char *name_a, + xpp.external_hunks = ext_hunks; + xpp.external_hunks_nr = ext_hunks_nr; + } else { -+ trace2_data_string("diff", -+ o->repo, -+ "diff-process-fallback", -+ name_a); ++ warning(_("diff process failed for '%s'," ++ " falling back to builtin diff"), ++ name_a); + } + } + @@ diff.c: static void builtin_diff(const char *name_a, free_diff_words_data(&ecbdata); if (textconv_one) + ## t/.gitattributes ## +@@ t/.gitattributes: t[0-9][0-9][0-9][0-9]/* -whitespace + /t8005/*.txt eol=lf + /t9*/*.dump eol=lf + /t0040*.sh whitespace=-indent-with-non-tab ++/t4080-diff-process.sh whitespace=-indent-with-non-tab + ## t/t4080-diff-process.sh (new) ## @@ +#!/bin/sh @@ t/t4080-diff-process.sh (new) + # appear as changed in the output. + git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \ + diff boundary.c >actual && -+ grep "^-OLD5" actual && -+ grep "^-OLD6" actual && -+ grep "^+NEW5" actual && -+ grep "^+NEW6" actual && -+ ! grep "^-OLD9" actual && -+ ! grep "^-OLD10" actual && -+ ! grep "^+NEW9" actual && -+ ! grep "^+NEW10" actual ++ test_grep "^-OLD5" actual && ++ test_grep "^-OLD6" actual && ++ test_grep "^+NEW5" actual && ++ test_grep "^+NEW6" actual && ++ ! test_grep "^-OLD9" actual && ++ ! test_grep "^-OLD10" actual && ++ ! test_grep "^+NEW9" actual && ++ ! test_grep "^+NEW10" actual +' + +test_expect_success PYTHON 'diff process fallback on tool error status' ' @@ t/t4080-diff-process.sh (new) + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ + diff boundary.c >actual && + # Fallback produces the full builtin diff (both change regions). -+ grep "^-OLD5" actual && -+ grep "^+NEW5" actual && -+ grep "^-OLD9" actual && -+ grep "^+NEW9" actual && ++ test_grep "^-OLD5" actual && ++ test_grep "^+NEW5" actual && ++ test_grep "^-OLD9" actual && ++ test_grep "^+NEW9" actual && + # Tool was contacted (it replied with error, not crash). -+ grep "command=hunks pathname=boundary.c" backend.log ++ test_grep "command=hunks pathname=boundary.c" backend.log +' + +test_expect_success PYTHON 'diff process fallback on bad hunks' ' + git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \ + diff boundary.c >actual && -+ grep "^-OLD5" actual && -+ grep "^+NEW5" actual && -+ grep "^-OLD9" actual && -+ grep "^+NEW9" actual ++ test_grep "^-OLD5" actual && ++ test_grep "^+NEW5" actual && ++ test_grep "^-OLD9" actual && ++ test_grep "^+NEW9" actual +' + +test_expect_success PYTHON 'diff process fallback on tool crash' ' + git -c diff.cdiff.process="$BACKEND --mode=crash" \ + diff boundary.c >actual && -+ grep "^-OLD5" actual && -+ grep "^+NEW5" actual && -+ grep "^-OLD9" actual && -+ grep "^+NEW9" actual ++ test_grep "^-OLD5" actual && ++ test_grep "^+NEW5" actual && ++ test_grep "^-OLD9" actual && ++ test_grep "^+NEW9" actual +' + +test_expect_success PYTHON 'diff process abort disables for session' ' @@ t/t4080-diff-process.sh (new) + git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \ + diff -- abort1.c abort2.c >actual && + # Both files should still produce diff output via fallback. -+ grep "return 10" actual && -+ grep "return 20" actual && ++ test_grep "return 10" actual && ++ test_grep "return 20" actual && + # The tool aborts on the first file and git clears its + # capability. The second file never contacts the tool, + # so the log should have exactly one entry, not two. -+ grep "command=hunks" backend.log >matches && ++ test_grep "command=hunks" backend.log >matches && + test_line_count = 1 matches +' + @@ t/t4080-diff-process.sh (new) + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff -- multi1.c multi2.c >actual && -+ grep "return 10" actual && -+ grep "return 20" actual && -+ grep "pathname=multi1.c" backend.log && -+ grep "pathname=multi2.c" backend.log ++ test_grep "return 10" actual && ++ test_grep "return 20" actual && ++ test_grep "pathname=multi1.c" backend.log && ++ test_grep "pathname=multi2.c" backend.log +' + +test_expect_success PYTHON 'diff process with --word-diff' ' @@ t/t4080-diff-process.sh (new) + + git -c diff.cdiff.process="$BACKEND" \ + diff --word-diff worddiff.c >actual && -+ grep "\[-1;-\]" actual && -+ grep "{+999;+}" actual ++ test_grep "\[-1;-\]" actual && ++ test_grep "{+999;+}" actual +' + +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff --diff-algorithm=patience worddiff.c >actual && -+ grep "return 999" actual && ++ test_grep "return 999" actual && + test_path_is_missing backend.log +' + @@ t/t4080-diff-process.sh (new) + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + log -1 -p -- logtest.c >actual && -+ grep "return 2" actual && -+ grep "command=hunks pathname=logtest.c" backend.log ++ test_grep "return 2" actual && ++ test_grep "command=hunks pathname=logtest.c" backend.log +' + +test_expect_success PYTHON 'diff process zero hunks suppresses diff output' ' 4: 4e6ea6d518 ! 4: 39ff53acef blame: consult diff process for zero-hunk detection @@ t/t4080-diff-process.sh: test_expect_success PYTHON 'diff process zero hunks sup + + # Without zero-hunk mode, blame attributes the change. + git blame blame.c >without && -+ grep "$BLAME_COMMIT" without && ++ test_grep "$BLAME_COMMIT" without && + + # With zero-hunk mode, the process considers the files equivalent + # and blame skips the reformat commit. + git -c diff.cdiff.process="$BACKEND --mode=zero-hunk" \ + blame blame.c >with && -+ ! grep "$BLAME_COMMIT" with ++ ! test_grep "$BLAME_COMMIT" with +' + + 5: 8c7359b8a1 < -: ---------- diff-process-normalize: add built-in whitespace normalizer -- gitgitgadget ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2 1/4] xdiff: support external hunks via xpparam_t 2026-05-25 18:29 ` [PATCH v2 0/4] " Michael Montalbo via GitGitGadget @ 2026-05-25 18:29 ` Michael Montalbo via GitGitGadget 2026-05-25 18:29 ` [PATCH v2 2/4] userdiff: add diff.<driver>.process config Michael Montalbo via GitGitGadget ` (3 subsequent siblings) 4 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-25 18:29 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Add two new xpparam_t fields (external_hunks, external_hunks_nr) that let callers supply pre-computed hunks. When set, xdl_diff() populates the changed[] arrays from these hunks instead of running the diff algorithm, then continues through compaction and emission as usual. Validate supplied hunks before use: reject out-of-bounds line numbers, overlapping or out-of-order hunks, negative counts, and violations of the synchronization invariant (unchanged line counts must match between files). On validation failure, fall back to the builtin diff algorithm. Skip trim_common_tail() in xdi_diff() when external hunks are present, since external hunks reference line numbers in the original content. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- xdiff-interface.c | 7 +++- xdiff/xdiff.h | 13 ++++++++ xdiff/xdiffi.c | 84 +++++++++++++++++++++++++++++++++++++++++++++-- xdiff/xprepare.c | 10 ++++++ xdiff/xprepare.h | 1 + 5 files changed, 112 insertions(+), 3 deletions(-) diff --git a/xdiff-interface.c b/xdiff-interface.c index f043330f2a..9542c0bcc2 100644 --- a/xdiff-interface.c +++ b/xdiff-interface.c @@ -124,7 +124,12 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE) return -1; - if (!xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT)) + /* + * External hunks reference line numbers in the original content; + * trimming the tail would change line counts and invalidate them. + */ + if (!xpp->external_hunks && + !xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT)) trim_common_tail(&a, &b); return xdl_diff(&a, &b, xpp, xecfg, xecb); diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h index dc370712e9..2ee6f1aae3 100644 --- a/xdiff/xdiff.h +++ b/xdiff/xdiff.h @@ -78,6 +78,15 @@ typedef struct s_mmbuffer { long size; } mmbuffer_t; +/* + * Hunk descriptor for externally computed diffs. + * Line numbers are 1-based, matching unified diff convention. + */ +struct xdl_hunk { + long old_start, old_count; + long new_start, new_count; +}; + typedef struct s_xpparam { unsigned long flags; @@ -88,6 +97,10 @@ typedef struct s_xpparam { /* See Documentation/diff-options.adoc. */ char **anchors; size_t anchors_nr; + + /* Externally computed hunks: bypass the diff algorithm. */ + const struct xdl_hunk *external_hunks; + size_t external_hunks_nr; } xpparam_t; typedef struct s_xdemitcb { diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 5455b4690d..e7d6190d37 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -1085,16 +1085,96 @@ static void xdl_mark_ignorable_regex(xdchange_t *xscr, const xdfenv_t *xe, } } +/* + * Populate the changed[] arrays from externally supplied hunks, + * bypassing the diff algorithm. Validates that hunks are in order, + * non-overlapping, and within bounds. + * + * Returns 0 on success, -1 on validation failure. + */ +static int xdl_populate_hunks_from_external(xdfenv_t *xe, + const struct xdl_hunk *hunks, + size_t nr_hunks) +{ + size_t i; + long j, prev_old_end = 0, prev_new_end = 0; + long total_old = 0, total_new = 0; + + xdl_clear_changed(&xe->xdf1); + xdl_clear_changed(&xe->xdf2); + + for (i = 0; i < nr_hunks; i++) { + const struct xdl_hunk *h = &hunks[i]; + + if (h->old_count < 0 || h->new_count < 0) + return -1; + + /* Bounds check (1-based line numbers) */ + if (h->old_count > 0 && + (h->old_start < 1 || + h->old_start + h->old_count - 1 > (long)xe->xdf1.nrec)) + return -1; + if (h->new_count > 0 && + (h->new_start < 1 || + h->new_start + h->new_count - 1 > (long)xe->xdf2.nrec)) + return -1; + + /* Zero-count hunks: start must still be in [1, nrec+1] */ + if (h->old_count == 0 && + (h->old_start < 1 || h->old_start > (long)xe->xdf1.nrec + 1)) + return -1; + if (h->new_count == 0 && + (h->new_start < 1 || h->new_start > (long)xe->xdf2.nrec + 1)) + return -1; + + /* Ordering: no overlap with previous hunk */ + if (h->old_start < prev_old_end || + h->new_start < prev_new_end) + return -1; + + for (j = 0; j < h->old_count; j++) + xe->xdf1.changed[h->old_start - 1 + j] = true; + for (j = 0; j < h->new_count; j++) + xe->xdf2.changed[h->new_start - 1 + j] = true; + + prev_old_end = h->old_start + h->old_count; + prev_new_end = h->new_start + h->new_count; + total_old += h->old_count; + total_new += h->new_count; + } + + /* + * Synchronization invariant: unchanged line counts must match. + * Otherwise xdl_build_script() would walk off one array. + */ + if ((long)xe->xdf1.nrec - total_old != + (long)xe->xdf2.nrec - total_new) + return -1; + + return 0; +} + int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t const *xecfg, xdemitcb_t *ecb) { xdchange_t *xscr; xdfenv_t xe; emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff; - if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) { + if (xpp->external_hunks) { + if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0) + return -1; + if (xdl_populate_hunks_from_external(&xe, + xpp->external_hunks, + xpp->external_hunks_nr) == 0) + goto diff_done; + xdl_free_env(&xe); + } + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) return -1; - } + +diff_done: + if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 || xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 || xdl_build_script(&xe, &xscr) < 0) { diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c index cd4fc405eb..4645a9a746 100644 --- a/xdiff/xprepare.c +++ b/xdiff/xprepare.c @@ -432,3 +432,13 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, return 0; } + +/* + * Reset the changed[] array so that no lines are marked as changed. + * Also clears the sentinel slots at changed[-1] and changed[nrec] + * that xdl_change_compact() relies on during backward scans. + */ +void xdl_clear_changed(xdfile_t *xdf) +{ + memset(xdf->changed - 1, 0, (xdf->nrec + 2) * sizeof(bool)); +} diff --git a/xdiff/xprepare.h b/xdiff/xprepare.h index 947d9fc1bb..0413baf07b 100644 --- a/xdiff/xprepare.h +++ b/xdiff/xprepare.h @@ -28,6 +28,7 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdfenv_t *xe); void xdl_free_env(xdfenv_t *xe); +void xdl_clear_changed(xdfile_t *xdf); -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 2/4] userdiff: add diff.<driver>.process config 2026-05-25 18:29 ` [PATCH v2 0/4] " Michael Montalbo via GitGitGadget 2026-05-25 18:29 ` [PATCH v2 1/4] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget @ 2026-05-25 18:29 ` Michael Montalbo via GitGitGadget 2026-05-25 18:29 ` [PATCH v2 3/4] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget ` (2 subsequent siblings) 4 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-25 18:29 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Add a new per-driver configuration key that specifies the command for a long-running diff process. The name follows filter.<driver>.process: a long-running subprocess that stays alive across files within a single git invocation. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- userdiff.c | 7 +++++++ userdiff.h | 2 ++ 2 files changed, 9 insertions(+) diff --git a/userdiff.c b/userdiff.c index fe710a68bf..81c0bebcce 100644 --- a/userdiff.c +++ b/userdiff.c @@ -499,6 +499,13 @@ int userdiff_config(const char *k, const char *v) drv->algorithm = drv->algorithm_owned; return ret; } + if (!strcmp(type, "process")) { + int ret; + FREE_AND_NULL(drv->process_owned); + ret = git_config_string(&drv->process_owned, k, v); + drv->process = drv->process_owned; + return ret; + } return 0; } diff --git a/userdiff.h b/userdiff.h index 827361b0bc..51c26e0d41 100644 --- a/userdiff.h +++ b/userdiff.h @@ -31,6 +31,8 @@ struct userdiff_driver { char *textconv_owned; struct notes_cache *textconv_cache; int textconv_want_cache; + const char *process; + char *process_owned; }; enum userdiff_driver_type { USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 3/4] diff: add long-running diff process via diff.<driver>.process 2026-05-25 18:29 ` [PATCH v2 0/4] " Michael Montalbo via GitGitGadget 2026-05-25 18:29 ` [PATCH v2 1/4] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget 2026-05-25 18:29 ` [PATCH v2 2/4] userdiff: add diff.<driver>.process config Michael Montalbo via GitGitGadget @ 2026-05-25 18:29 ` Michael Montalbo via GitGitGadget 2026-05-26 1:56 ` Junio C Hamano 2026-05-26 2:26 ` Junio C Hamano 2026-05-25 18:29 ` [PATCH v2 4/4] blame: consult diff process for zero-hunk detection Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget 4 siblings, 2 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-25 18:29 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Add support for external diff processes that communicate via the long-running process protocol (pkt-line over stdin/stdout). A diff process is configured per userdiff driver: [diff "cdiff"] process = /path/to/diff-tool The tool provides custom line-matching: it receives file pairs and returns hunks that reference original line numbers. Unlike textconv, which transforms the displayed content, the diff output shows the actual file while the tool controls which lines are marked as changed. The handshake negotiates version=1 and capability=hunks. Per-file requests send command=hunks, pathname, and both file contents as packetized data. The tool responds with hunk lines and a status packet. On error, git falls back to the builtin diff algorithm with a warning. Zero hunks with status=success means the tool considers the files equivalent. Git skips diff output for that file. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- Documentation/config/diff.adoc | 8 + Documentation/gitattributes.adoc | 40 ++++ Makefile | 1 + diff-process.c | 206 +++++++++++++++++++ diff-process.h | 28 +++ diff.c | 23 +++ t/.gitattributes | 1 + t/t4080-diff-process.sh | 338 +++++++++++++++++++++++++++++++ 8 files changed, 645 insertions(+) create mode 100644 diff-process.c create mode 100644 diff-process.h create mode 100755 t/t4080-diff-process.sh diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc index 1135a62a0a..4ab5f60df6 100644 --- a/Documentation/config/diff.adoc +++ b/Documentation/config/diff.adoc @@ -218,6 +218,14 @@ endif::git-diff[] Set this option to `true` to make the diff driver cache the text conversion outputs. See linkgit:gitattributes[5] for details. +`diff.<driver>.process`:: + The command to run as a long-running diff process. + The tool communicates via the pkt-line protocol and returns + hunks that are fed into Git's diff and blame pipelines. + If the tool returns zero hunks, the file is treated as + unchanged for both diff output and blame attribution. + See linkgit:gitattributes[5] for details. + `diff.indentHeuristic`:: Set this option to `false` to disable the default heuristics that shift diff hunk boundaries to make patches easier to read. diff --git a/Documentation/gitattributes.adoc b/Documentation/gitattributes.adoc index f20041a323..962896a0b4 100644 --- a/Documentation/gitattributes.adoc +++ b/Documentation/gitattributes.adoc @@ -821,6 +821,46 @@ NOTE: If `diff.<name>.command` is defined for path with the (see above), and adding `diff.<name>.algorithm` has no effect, as the algorithm is not passed to the external diff driver. +Using an external diff process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An external tool can provide content-aware line matching by +setting `diff.<name>.process` to the command that runs +the tool. The tool is a long-running process that communicates via +the pkt-line protocol (described in +Documentation/technical/long-running-process-protocol.adoc). + +------------------------ +*.c diff=cdiff +------------------------ + +---------------------------------------------------------------- +[diff "cdiff"] + process = /path/to/diff-process-tool +---------------------------------------------------------------- + +The tool receives file pairs and returns hunk descriptors indicating +which lines changed. Git feeds these hunks into its standard diff +pipeline, so all output features (word diff, function context, +color) work normally. + +If the tool fails or returns an error, Git silently falls back to +the builtin diff algorithm. If the tool returns invalid hunks +(out of bounds, overlapping), Git also falls back silently. + +The handshake negotiates `version=1` and `capability=hunks`. +Per-file requests send `command=hunks` and `pathname=<path>`, +followed by the old and new file content as packetized data. +The tool responds with lines of the form +`hunk <old_start> <old_count> <new_start> <new_count>` +(1-based line numbers), a flush packet, and `status=success`. + +If the tool returns zero hunks with `status=success`, Git treats +the file as having no changes and produces no diff output. + +Tools should ignore unknown keys in the per-file request to +remain forward-compatible. + Defining a custom hunk-header ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/Makefile b/Makefile index cedc234173..22900368dd 100644 --- a/Makefile +++ b/Makefile @@ -1142,6 +1142,7 @@ LIB_OBJS += diff-delta.o LIB_OBJS += diff-merges.o LIB_OBJS += diff-lib.o LIB_OBJS += diff-no-index.o +LIB_OBJS += diff-process.o LIB_OBJS += diff.o LIB_OBJS += diffcore-break.o LIB_OBJS += diffcore-delta.o diff --git a/diff-process.c b/diff-process.c new file mode 100644 index 0000000000..801ac9e22e --- /dev/null +++ b/diff-process.c @@ -0,0 +1,206 @@ +/* + * Diff process backend: communicates with a long-running external + * tool via the pkt-line protocol to obtain custom line-matching + * results. Unlike textconv, which transforms the displayed content, + * hunks from a diff process reference original line numbers and + * the display shows the actual file content. + * + * Protocol: pkt-line over stdin/stdout, following the pattern of + * the long-running filter process protocol (see convert.c). + * + * Handshake: + * git> git-diff-client / version=1 / flush + * tool< git-diff-server / version=1 / flush + * git> capability=hunks / flush + * tool< capability=hunks / flush + * + * Per-file: + * git> command=hunks / pathname=<path> / flush + * git> <old content packetized> / flush + * git> <new content packetized> / flush + * tool< hunk <old_start> <old_count> <new_start> <new_count> + * tool< ... / flush + * tool< status=success / flush + * + * Zero hunks with status=success means the tool considers the + * files equivalent. Git will skip the diff for that file. + */ + +#include "git-compat-util.h" +#include "diff-process.h" +#include "userdiff.h" +#include "sub-process.h" +#include "pkt-line.h" +#include "strbuf.h" +#include "xdiff/xdiff.h" + +#define CAP_HUNKS (1u << 0) + +struct diff_subprocess { + struct subprocess_entry subprocess; + unsigned int supported_capabilities; +}; + +static int subprocess_map_initialized; +static struct hashmap subprocess_map; + +static int start_diff_process_fn(struct subprocess_entry *subprocess) +{ + static int versions[] = { 1, 0 }; + static struct subprocess_capability capabilities[] = { + { "hunks", CAP_HUNKS }, + { NULL, 0 } + }; + struct diff_subprocess *entry = + (struct diff_subprocess *)subprocess; + + /* Uses dying pkt-line variant, same as convert.c filters. */ + return subprocess_handshake(subprocess, "git-diff", + versions, NULL, + capabilities, + &entry->supported_capabilities); +} + +static struct diff_subprocess *find_or_start_process(const char *cmd) +{ + struct diff_subprocess *entry; + + if (!subprocess_map_initialized) { + subprocess_map_initialized = 1; + hashmap_init(&subprocess_map, cmd2process_cmp, NULL, 0); + } + + entry = (struct diff_subprocess *) + subprocess_find_entry(&subprocess_map, cmd); + if (entry) + return entry; + + entry = xcalloc(1, sizeof(*entry)); + if (subprocess_start(&subprocess_map, &entry->subprocess, + cmd, start_diff_process_fn)) { + free(entry); + return NULL; + } + + return entry; +} + +static int send_file_content(int fd, const char *buf, long size) +{ + int ret; + + if (size > 0) + ret = write_packetized_from_buf_no_flush(buf, size, fd); + else + ret = 0; + if (ret) + return ret; + return packet_flush_gently(fd); +} + +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk) +{ + char *end; + + /* Format: "hunk <old_start> <old_count> <new_start> <new_count>" */ + if (!skip_prefix(line, "hunk ", &line)) + return -1; + + hunk->old_start = strtol(line, &end, 10); + if (end == line || *end != ' ') + return -1; + line = end; + + hunk->old_count = strtol(line, &end, 10); + if (end == line || *end != ' ') + return -1; + line = end; + + hunk->new_start = strtol(line, &end, 10); + if (end == line || *end != ' ') + return -1; + line = end; + + hunk->new_count = strtol(line, &end, 10); + if (end == line || *end != '\0') + return -1; + + return 0; +} + +int diff_process_get_hunks(struct userdiff_driver *drv, + const char *path, + const char *old_buf, long old_size, + const char *new_buf, long new_size, + struct xdl_hunk **hunks_out, + size_t *nr_hunks_out) +{ + struct diff_subprocess *backend; + struct child_process *process; + int fd_in, fd_out; + struct strbuf status = STRBUF_INIT; + struct xdl_hunk *hunks = NULL; + struct xdl_hunk hunk; + size_t nr_hunks = 0, alloc_hunks = 0; + int len; + char *line; + + if (!drv || !drv->process) + return -1; + + backend = find_or_start_process(drv->process); + if (!backend) + return -1; + + if (!(backend->supported_capabilities & CAP_HUNKS)) + return -1; + + process = subprocess_get_child_process(&backend->subprocess); + fd_in = process->in; + fd_out = process->out; + + /* Send request */ + if (packet_write_fmt_gently(fd_in, "command=hunks\n") || + packet_write_fmt_gently(fd_in, "pathname=%s\n", path) || + packet_flush_gently(fd_in)) + goto error; + + /* Send old file content */ + if (send_file_content(fd_in, old_buf, old_size)) + goto error; + + /* Send new file content */ + if (send_file_content(fd_in, new_buf, new_size)) + goto error; + + /* Read hunks until flush packet */ + while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 && + line) { + if (parse_hunk_line(line, &hunk) < 0) + goto error; + ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks); + hunks[nr_hunks++] = hunk; + } + if (len < 0) + goto error; + + /* Read status */ + if (subprocess_read_status(fd_out, &status)) + goto error; + + if (strcmp(status.buf, "success")) { + if (!strcmp(status.buf, "abort")) + backend->supported_capabilities &= ~CAP_HUNKS; + goto error; + } + + *hunks_out = hunks; + *nr_hunks_out = nr_hunks; + strbuf_release(&status); + return 0; + +error: + free(hunks); + strbuf_release(&status); + return -1; +} diff --git a/diff-process.h b/diff-process.h new file mode 100644 index 0000000000..4c84951e02 --- /dev/null +++ b/diff-process.h @@ -0,0 +1,28 @@ +#ifndef DIFF_PROCESS_H +#define DIFF_PROCESS_H + +struct userdiff_driver; +struct xdl_hunk; + +/* + * Query a diff process for hunks describing the changes + * between old_buf and new_buf. + * + * The backend is a long-running subprocess configured via + * diff.<driver>.process. It receives file content via + * pkt-line and returns hunks with 1-based line numbers. + * + * On success, sets *hunks_out and *nr_hunks_out to a newly allocated + * array (caller must free) and returns 0. + * + * On failure, returns -1. The caller should fall back to the + * builtin diff algorithm. + */ +int diff_process_get_hunks(struct userdiff_driver *drv, + const char *path, + const char *old_buf, long old_size, + const char *new_buf, long new_size, + struct xdl_hunk **hunks_out, + size_t *nr_hunks_out); + +#endif /* DIFF_PROCESS_H */ diff --git a/diff.c b/diff.c index 397e38b41c..1aeb0f319e 100644 --- a/diff.c +++ b/diff.c @@ -25,6 +25,7 @@ #include "utf8.h" #include "odb.h" #include "userdiff.h" +#include "diff-process.h" #include "submodule.h" #include "hashmap.h" #include "mem-pool.h" @@ -3991,6 +3992,7 @@ static void builtin_diff(const char *name_a, xpparam_t xpp; xdemitconf_t xecfg; struct emit_callback ecbdata; + struct xdl_hunk *ext_hunks = NULL; unsigned ws_rule; const struct userdiff_funcname *pe; @@ -4031,6 +4033,26 @@ static void builtin_diff(const char *name_a, xpp.ignore_regex_nr = o->ignore_regex_nr; xpp.anchors = o->anchors; xpp.anchors_nr = o->anchors_nr; + + if (!o->ignore_driver_algorithm && + one->driver && one->driver->process) { + size_t ext_hunks_nr = 0; + if (!diff_process_get_hunks( + one->driver, name_a, + mf1.ptr, mf1.size, + mf2.ptr, mf2.size, + &ext_hunks, &ext_hunks_nr)) { + if (!ext_hunks_nr) + goto free_ab_and_return; + xpp.external_hunks = ext_hunks; + xpp.external_hunks_nr = ext_hunks_nr; + } else { + warning(_("diff process failed for '%s'," + " falling back to builtin diff"), + name_a); + } + } + xecfg.ctxlen = o->context; xecfg.interhunkctxlen = o->interhunkcontext; xecfg.flags = XDL_EMIT_FUNCNAMES; @@ -4111,6 +4133,7 @@ static void builtin_diff(const char *name_a, } else if (xdi_diff_outf(&mf1, &mf2, NULL, fn_out_consume, &ecbdata, &xpp, &xecfg)) die("unable to generate diff for %s", one->path); + free(ext_hunks); if (o->word_diff) free_diff_words_data(&ecbdata); if (textconv_one) diff --git a/t/.gitattributes b/t/.gitattributes index 7664c6e027..de97920cab 100644 --- a/t/.gitattributes +++ b/t/.gitattributes @@ -23,3 +23,4 @@ t[0-9][0-9][0-9][0-9]/* -whitespace /t8005/*.txt eol=lf /t9*/*.dump eol=lf /t0040*.sh whitespace=-indent-with-non-tab +/t4080-diff-process.sh whitespace=-indent-with-non-tab diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh new file mode 100755 index 0000000000..083e48e872 --- /dev/null +++ b/t/t4080-diff-process.sh @@ -0,0 +1,338 @@ +#!/bin/sh + +test_description='diff process via long-running process' + +. ./test-lib.sh + +if test_have_prereq PYTHON +then + PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python) +fi + +# +# A single parametric diff process. +# Usage: diff-process-backend --mode=<mode> [--log=<path>] +# +# Modes: +# whole-file - report all lines as changed (default) +# fixed-hunk - always report hunk 5 2 5 2 +# bad-hunk - report out-of-bounds hunk 999 1 999 1 +# zero-hunk - return zero hunks (files considered equivalent) +# error - return status=error for every request +# abort - return status=abort for every request +# crash - read one request then exit without responding +# +setup_backend () { + cat >"$TRASH_DIRECTORY/diff-process-backend.py" <<-\PYEOF + import sys, os + + def read_pkt(): + hdr = sys.stdin.buffer.read(4) + if len(hdr) < 4: return None + length = int(hdr, 16) + if length == 0: return "" + data = sys.stdin.buffer.read(length - 4) + return data.decode().rstrip("\n") + + def write_pkt(line): + data = (line + "\n").encode() + sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data) + sys.stdout.buffer.flush() + + def write_flush(): + sys.stdout.buffer.write(b"0000") + sys.stdout.buffer.flush() + + def read_content(): + chunks = [] + while True: + hdr = sys.stdin.buffer.read(4) + if len(hdr) < 4: break + length = int(hdr, 16) + if length == 0: break + chunks.append(sys.stdin.buffer.read(length - 4)) + return b"".join(chunks) + + mode = "whole-file" + logfile = None + for arg in sys.argv[1:]: + if arg.startswith("--mode="): + mode = arg[7:] + elif arg.startswith("--log="): + logfile = open(arg[6:], "a") + + def log(msg): + if logfile: + logfile.write(msg + "\n") + logfile.flush() + + # Handshake + assert read_pkt() == "git-diff-client" + assert read_pkt() == "version=1" + read_pkt() + write_pkt("git-diff-server") + write_pkt("version=1") + write_flush() + while True: + p = read_pkt() + if p == "": break + write_pkt("capability=hunks") + write_flush() + + log("ready") + + while True: + cmd = None + pathname = None + while True: + p = read_pkt() + if p is None: sys.exit(0) + if p == "": break + if p.startswith("command="): cmd = p.split("=",1)[1] + if p.startswith("pathname="): pathname = p.split("=",1)[1] + if cmd is None: sys.exit(0) + old = read_content() + new = read_content() + log(f"command={cmd} pathname={pathname}") + + if mode == "error": + write_flush() + write_pkt("status=error") + write_flush() + continue + + if mode == "abort": + write_flush() + write_pkt("status=abort") + write_flush() + continue + + if mode == "crash": + sys.exit(1) + + if cmd == "hunks": + if mode == "fixed-hunk": + write_pkt("hunk 5 2 5 2") + elif mode == "bad-hunk": + write_pkt("hunk 999 1 999 1") + elif mode == "zero-hunk": + pass + else: + ol = len(old.split(b"\n")) + nl = len(new.split(b"\n")) + write_pkt(f"hunk 1 {ol} 1 {nl}") + write_flush() + write_pkt("status=success") + write_flush() + else: + write_flush() + write_pkt("status=error") + write_flush() + PYEOF + write_script diff-process-backend <<-SHEOF + exec "$PYTHON_PATH" "$TRASH_DIRECTORY/diff-process-backend.py" "\$@" + SHEOF +} + +BACKEND="./diff-process-backend" + +test_expect_success PYTHON 'setup' ' + setup_backend && + echo "*.c diff=cdiff" >.gitattributes && + git add .gitattributes && + git commit -m "initial" +' + +test_expect_success PYTHON 'diff process hunk boundaries affect output' ' + cat >boundary.c <<-\EOF && + line1 + line2 + line3 + line4 + OLD5 + OLD6 + line7 + line8 + OLD9 + OLD10 + EOF + git add boundary.c && + git commit -m "add boundary.c" && + + cat >boundary.c <<-\EOF && + line1 + line2 + line3 + line4 + NEW5 + NEW6 + line7 + line8 + NEW9 + NEW10 + EOF + + # The file has changes at lines 5-6 and 9-10, but fixed-hunk + # only reports lines 5-6 as changed. Lines 9-10 should not + # appear as changed in the output. + git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \ + diff boundary.c >actual && + test_grep "^-OLD5" actual && + test_grep "^-OLD6" actual && + test_grep "^+NEW5" actual && + test_grep "^+NEW6" actual && + ! test_grep "^-OLD9" actual && + ! test_grep "^-OLD10" actual && + ! test_grep "^+NEW9" actual && + ! test_grep "^+NEW10" actual +' + +test_expect_success PYTHON 'diff process fallback on tool error status' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ + diff boundary.c >actual && + # Fallback produces the full builtin diff (both change regions). + test_grep "^-OLD5" actual && + test_grep "^+NEW5" actual && + test_grep "^-OLD9" actual && + test_grep "^+NEW9" actual && + # Tool was contacted (it replied with error, not crash). + test_grep "command=hunks pathname=boundary.c" backend.log +' + +test_expect_success PYTHON 'diff process fallback on bad hunks' ' + git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \ + diff boundary.c >actual && + test_grep "^-OLD5" actual && + test_grep "^+NEW5" actual && + test_grep "^-OLD9" actual && + test_grep "^+NEW9" actual +' + +test_expect_success PYTHON 'diff process fallback on tool crash' ' + git -c diff.cdiff.process="$BACKEND --mode=crash" \ + diff boundary.c >actual && + test_grep "^-OLD5" actual && + test_grep "^+NEW5" actual && + test_grep "^-OLD9" actual && + test_grep "^+NEW9" actual +' + +test_expect_success PYTHON 'diff process abort disables for session' ' + cat >abort1.c <<-\EOF && + int first(void) { return 1; } + EOF + cat >abort2.c <<-\EOF && + int second(void) { return 2; } + EOF + git add abort1.c abort2.c && + git commit -m "add abort files" && + + cat >abort1.c <<-\EOF && + int first(void) { return 10; } + EOF + cat >abort2.c <<-\EOF && + int second(void) { return 20; } + EOF + + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \ + diff -- abort1.c abort2.c >actual && + # Both files should still produce diff output via fallback. + test_grep "return 10" actual && + test_grep "return 20" actual && + # The tool aborts on the first file and git clears its + # capability. The second file never contacts the tool, + # so the log should have exactly one entry, not two. + test_grep "command=hunks" backend.log >matches && + test_line_count = 1 matches +' + +test_expect_success PYTHON 'diff process handles multiple files' ' + cat >multi1.c <<-\EOF && + int one(void) { return 1; } + EOF + cat >multi2.c <<-\EOF && + int two(void) { return 2; } + EOF + git add multi1.c multi2.c && + git commit -m "add multi files" && + + cat >multi1.c <<-\EOF && + int one(void) { return 10; } + EOF + cat >multi2.c <<-\EOF && + int two(void) { return 20; } + EOF + + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff -- multi1.c multi2.c >actual && + test_grep "return 10" actual && + test_grep "return 20" actual && + test_grep "pathname=multi1.c" backend.log && + test_grep "pathname=multi2.c" backend.log +' + +test_expect_success PYTHON 'diff process with --word-diff' ' + cat >worddiff.c <<-\EOF && + int value(void) { return 1; } + EOF + git add worddiff.c && + git commit -m "add worddiff.c" && + + cat >worddiff.c <<-\EOF && + int value(void) { return 999; } + EOF + + git -c diff.cdiff.process="$BACKEND" \ + diff --word-diff worddiff.c >actual && + test_grep "\[-1;-\]" actual && + test_grep "{+999;+}" actual +' + +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff --diff-algorithm=patience worddiff.c >actual && + test_grep "return 999" actual && + test_path_is_missing backend.log +' + +test_expect_success PYTHON 'diff process works with git log -p' ' + cat >logtest.c <<-\EOF && + int logfunc(void) { return 1; } + EOF + git add logtest.c && + git commit -m "add logtest.c" && + + cat >logtest.c <<-\EOF && + int logfunc(void) { return 2; } + EOF + git add logtest.c && + git commit -m "change logtest.c" && + + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + log -1 -p -- logtest.c >actual && + test_grep "return 2" actual && + test_grep "command=hunks pathname=logtest.c" backend.log +' + +test_expect_success PYTHON 'diff process zero hunks suppresses diff output' ' + cat >zerohunk.c <<-\EOF && + int zero(void) { return 0; } + EOF + git add zerohunk.c && + git commit -m "add zerohunk.c" && + + cat >zerohunk.c <<-\EOF && + int zero(void) { return 999; } + EOF + + git -c diff.cdiff.process="$BACKEND --mode=zero-hunk" \ + diff zerohunk.c >actual && + test_must_be_empty actual +' + +test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2 3/4] diff: add long-running diff process via diff.<driver>.process 2026-05-25 18:29 ` [PATCH v2 3/4] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget @ 2026-05-26 1:56 ` Junio C Hamano 2026-05-29 0:51 ` Michael Montalbo 2026-05-26 2:26 ` Junio C Hamano 1 sibling, 1 reply; 30+ messages in thread From: Junio C Hamano @ 2026-05-26 1:56 UTC (permalink / raw) To: Michael Montalbo via GitGitGadget; +Cc: git, Michael Montalbo "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > +struct diff_subprocess { > + struct subprocess_entry subprocess; > + unsigned int supported_capabilities; > +}; > + > +static int subprocess_map_initialized; > +static struct hashmap subprocess_map; Can we avoid introducing new global variables like these? Would "struct userdiff_driver" or "struct diff_options" be a good place to hang this hashmap, perhaps? > +static int send_file_content(int fd, const char *buf, long size) > +{ > + int ret; > + > + if (size > 0) > + ret = write_packetized_from_buf_no_flush(buf, size, fd); > + else > + ret = 0; Shouldn't "size == -24" be flagged as an invalid input? > + if (ret) > + return ret; > + return packet_flush_gently(fd); > +} > +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk) > +{ > +... > +} This gives a silent error diagnosis, which is good for a lower level helper. > +int diff_process_get_hunks(struct userdiff_driver *drv, > + const char *path, > + const char *old_buf, long old_size, > + const char *new_buf, long new_size, > + struct xdl_hunk **hunks_out, > + size_t *nr_hunks_out) > +{ > + struct diff_subprocess *backend; > + struct child_process *process; > + int fd_in, fd_out; > + struct strbuf status = STRBUF_INIT; > + struct xdl_hunk *hunks = NULL; > + struct xdl_hunk hunk; > + size_t nr_hunks = 0, alloc_hunks = 0; > + int len; > + char *line; > + > + if (!drv || !drv->process) > + return -1; A driver that does not define process is not an error; it is perfectly normal in the current world order where nobody has such an external process and even fi this patch lands, external processes are optional. So here "return -1" does not mean an error, and silent return is perfectly fine. > + backend = find_or_start_process(drv->process); > + if (!backend) > + return -1; This is probably an error; the user specified drv->process, we either tried to find or start the process and failed. Isn't it an event that deserves to be reported in an error message? > + if (!(backend->supported_capabilities & CAP_HUNKS)) > + return -1; Backend started, but the "hunks" feature is not supported. Perhaps in a year or two, this external process protocol may have become so popular that it gained more capabilities, possibly making get_hunks obsolete. We may be looking at such an external process that uses other capabilities but not this one. This is not an error, so silent return is perfectly fine. > + process = subprocess_get_child_process(&backend->subprocess); > + fd_in = process->in; > + fd_out = process->out; > + > + /* Send request */ > + if (packet_write_fmt_gently(fd_in, "command=hunks\n") || > + packet_write_fmt_gently(fd_in, "pathname=%s\n", path) || > + packet_flush_gently(fd_in)) > + goto error; > + > + /* Send old file content */ > + if (send_file_content(fd_in, old_buf, old_size)) > + goto error; > + > + /* Send new file content */ > + if (send_file_content(fd_in, new_buf, new_size)) > + goto error; > + > + /* Read hunks until flush packet */ > + while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 && > + line) { > + if (parse_hunk_line(line, &hunk) < 0) > + goto error; > + ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks); > + hunks[nr_hunks++] = hunk; > + } > + if (len < 0) > + goto error; > + > + /* Read status */ > + if (subprocess_read_status(fd_out, &status)) > + goto error; > + > + if (strcmp(status.buf, "success")) { > + if (!strcmp(status.buf, "abort")) > + backend->supported_capabilities &= ~CAP_HUNKS; > + goto error; > + } > + > + *hunks_out = hunks; > + *nr_hunks_out = nr_hunks; > + strbuf_release(&status); > + return 0; > + > +error: All exceptions that lead here look like events that should be reported to the end-user. > + free(hunks); > + strbuf_release(&status); > + return -1; > +} > +/* > + * Query a diff process for hunks describing the changes > + * between old_buf and new_buf. > + * > + * The backend is a long-running subprocess configured via > + * diff.<driver>.process. It receives file content via > + * pkt-line and returns hunks with 1-based line numbers. > + * > + * On success, sets *hunks_out and *nr_hunks_out to a newly allocated > + * array (caller must free) and returns 0. > + * > + * On failure, returns -1. The caller should fall back to the > + * builtin diff algorithm. > + */ I do not agree with this. If it is a failure, the user should fix the external process (or disable). It shouldn't be hidden behind a fallback. As I left comments, in this round of implementation, there are conditions that returns -1 for soemthing that is not an error (i.e., not configured, or process not supporting the particular capability) *and* in those cases the caller should fall back as if nothing happened. But some error cases, the caller should't hide them. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 3/4] diff: add long-running diff process via diff.<driver>.process 2026-05-26 1:56 ` Junio C Hamano @ 2026-05-29 0:51 ` Michael Montalbo 0 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo @ 2026-05-29 0:51 UTC (permalink / raw) To: Junio C Hamano; +Cc: Michael Montalbo via GitGitGadget, git On Mon, May 25, 2026 at 6:56 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > +struct diff_subprocess { > > + struct subprocess_entry subprocess; > > + unsigned int supported_capabilities; > > +}; > > + > > +static int subprocess_map_initialized; > > +static struct hashmap subprocess_map; > > Can we avoid introducing new global variables like these? Would > "struct userdiff_driver" or "struct diff_options" be a good place to > hang this hashmap, perhaps? > Will clean this up. > > +static int send_file_content(int fd, const char *buf, long size) > > +{ > > + int ret; > > + > > + if (size > 0) > > + ret = write_packetized_from_buf_no_flush(buf, size, fd); > > + else > > + ret = 0; > > Shouldn't "size == -24" be flagged as an invalid input? > Will fix and do a broader audit of input validation and bounds checking. > > + if (ret) > > + return ret; > > + return packet_flush_gently(fd); > > +} > > > +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk) > > +{ > > +... > > +} > > This gives a silent error diagnosis, which is good for a lower level > helper. > > > +int diff_process_get_hunks(struct userdiff_driver *drv, > > + const char *path, > > + const char *old_buf, long old_size, > > + const char *new_buf, long new_size, > > + struct xdl_hunk **hunks_out, > > + size_t *nr_hunks_out) > > +{ > > + struct diff_subprocess *backend; > > + struct child_process *process; > > + int fd_in, fd_out; > > + struct strbuf status = STRBUF_INIT; > > + struct xdl_hunk *hunks = NULL; > > + struct xdl_hunk hunk; > > + size_t nr_hunks = 0, alloc_hunks = 0; > > + int len; > > + char *line; > > + > > + if (!drv || !drv->process) > > + return -1; > > A driver that does not define process is not an error; it is > perfectly normal in the current world order where nobody has such an > external process and even fi this patch lands, external processes > are optional. So here "return -1" does not mean an error, and > silent return is perfectly fine. > > > + backend = find_or_start_process(drv->process); > > + if (!backend) > > + return -1; > > This is probably an error; the user specified drv->process, we > either tried to find or start the process and failed. Isn't it an > event that deserves to be reported in an error message? > > > + if (!(backend->supported_capabilities & CAP_HUNKS)) > > + return -1; > > Backend started, but the "hunks" feature is not supported. Perhaps > in a year or two, this external process protocol may have become so > popular that it gained more capabilities, possibly making get_hunks > obsolete. We may be looking at such an external process that uses > other capabilities but not this one. This is not an error, so > silent return is perfectly fine. > > > + process = subprocess_get_child_process(&backend->subprocess); > > + fd_in = process->in; > > + fd_out = process->out; > > + > > + /* Send request */ > > + if (packet_write_fmt_gently(fd_in, "command=hunks\n") || > > + packet_write_fmt_gently(fd_in, "pathname=%s\n", path) || > > + packet_flush_gently(fd_in)) > > + goto error; > > + > > + /* Send old file content */ > > + if (send_file_content(fd_in, old_buf, old_size)) > > + goto error; > > + > > + /* Send new file content */ > > + if (send_file_content(fd_in, new_buf, new_size)) > > + goto error; > > + > > + /* Read hunks until flush packet */ > > + while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 && > > + line) { > > + if (parse_hunk_line(line, &hunk) < 0) > > + goto error; > > + ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks); > > + hunks[nr_hunks++] = hunk; > > + } > > + if (len < 0) > > + goto error; > > + > > + /* Read status */ > > + if (subprocess_read_status(fd_out, &status)) > > + goto error; > > + > > + if (strcmp(status.buf, "success")) { > > + if (!strcmp(status.buf, "abort")) > > + backend->supported_capabilities &= ~CAP_HUNKS; > > + goto error; > > + } > > + > > + *hunks_out = hunks; > > + *nr_hunks_out = nr_hunks; > > + strbuf_release(&status); > > + return 0; > > + > > +error: > > All exceptions that lead here look like events that should be > reported to the end-user. > Agreed on all points. I will restructure things so errors are flagged when appropriate (i.e., user specified a process but one was not found / couldn't start and exceptions) and non-errors are treated as they should be. > > + free(hunks); > > + strbuf_release(&status); > > + return -1; > > +} > > > +/* > > + * Query a diff process for hunks describing the changes > > + * between old_buf and new_buf. > > + * > > + * The backend is a long-running subprocess configured via > > + * diff.<driver>.process. It receives file content via > > + * pkt-line and returns hunks with 1-based line numbers. > > + * > > + * On success, sets *hunks_out and *nr_hunks_out to a newly allocated > > + * array (caller must free) and returns 0. > > + * > > + * On failure, returns -1. The caller should fall back to the > > + * builtin diff algorithm. > > + */ > > I do not agree with this. If it is a failure, the user should fix > the external process (or disable). It shouldn't be hidden behind a > fallback. As I left comments, in this round of implementation, > there are conditions that returns -1 for soemthing that is not an > error (i.e., not configured, or process not supporting the > particular capability) *and* in those cases the caller should fall > back as if nothing happened. But some error cases, the caller > should't hide them. Will address in a follow-up. Thank you for the feedback! ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 3/4] diff: add long-running diff process via diff.<driver>.process 2026-05-25 18:29 ` [PATCH v2 3/4] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget 2026-05-26 1:56 ` Junio C Hamano @ 2026-05-26 2:26 ` Junio C Hamano 2026-05-29 0:55 ` Michael Montalbo 1 sibling, 1 reply; 30+ messages in thread From: Junio C Hamano @ 2026-05-26 2:26 UTC (permalink / raw) To: Michael Montalbo via GitGitGadget; +Cc: git, Michael Montalbo "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > Zero hunks with status=success means the tool considers the > files equivalent. Git skips diff output for that file. Is "zero hunk" a common word or some random string you invented? If the latter, which is I am assuming it to be, you should define what it means at/before the first use. Here in the proposed log message, and ... > > Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> > --- > Documentation/config/diff.adoc | 8 + > Documentation/gitattributes.adoc | 40 ++++ > Makefile | 1 + > diff-process.c | 206 +++++++++++++++++++ > diff-process.h | 28 +++ > diff.c | 23 +++ > t/.gitattributes | 1 + > t/t4080-diff-process.sh | 338 +++++++++++++++++++++++++++++++ > 8 files changed, 645 insertions(+) > create mode 100644 diff-process.c > create mode 100644 diff-process.h > create mode 100755 t/t4080-diff-process.sh > > diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc > index 1135a62a0a..4ab5f60df6 100644 > --- a/Documentation/config/diff.adoc > +++ b/Documentation/config/diff.adoc > @@ -218,6 +218,14 @@ endif::git-diff[] > Set this option to `true` to make the diff driver cache the text > conversion outputs. See linkgit:gitattributes[5] for details. > > +`diff.<driver>.process`:: > + The command to run as a long-running diff process. > + The tool communicates via the pkt-line protocol and returns > + hunks that are fed into Git's diff and blame pipelines. > + If the tool returns zero hunks, the file is treated as > + unchanged for both diff output and blame attribution. > + See linkgit:gitattributes[5] for details. ... also here. I do not know if you mean "the tool returns no hunks" (there is no "hunk <old_start> <old_count> <new_start> <new_count>" line passed from the tool over the protocol) or "the tool returns zero-hunk" (there is a special "zero-hunk" message to signal this particular condition sent over the protocol), and this description does not quite help disambiguating between the two. If the former, then avoid "zero hunks" as it sounds like a noun with special meaning. Yes, we can say "tool returns one hunk", "tool returns 31 hunks", etc., so "tool returns zero hunks" may logically be correct, but "when the tool returns no hunks with status=success" is much less confusing, I think. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 3/4] diff: add long-running diff process via diff.<driver>.process 2026-05-26 2:26 ` Junio C Hamano @ 2026-05-29 0:55 ` Michael Montalbo 0 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo @ 2026-05-29 0:55 UTC (permalink / raw) To: Junio C Hamano; +Cc: Michael Montalbo via GitGitGadget, git On Mon, May 25, 2026 at 7:26 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > Zero hunks with status=success means the tool considers the > > files equivalent. Git skips diff output for that file. > > Is "zero hunk" a common word or some random string you invented? If > the latter, which is I am assuming it to be, you should define what > it means at/before the first use. Here in the proposed log message, > and ... > > > > > Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> > > --- > > Documentation/config/diff.adoc | 8 + > > Documentation/gitattributes.adoc | 40 ++++ > > Makefile | 1 + > > diff-process.c | 206 +++++++++++++++++++ > > diff-process.h | 28 +++ > > diff.c | 23 +++ > > t/.gitattributes | 1 + > > t/t4080-diff-process.sh | 338 +++++++++++++++++++++++++++++++ > > 8 files changed, 645 insertions(+) > > create mode 100644 diff-process.c > > create mode 100644 diff-process.h > > create mode 100755 t/t4080-diff-process.sh > > > > diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc > > index 1135a62a0a..4ab5f60df6 100644 > > --- a/Documentation/config/diff.adoc > > +++ b/Documentation/config/diff.adoc > > @@ -218,6 +218,14 @@ endif::git-diff[] > > Set this option to `true` to make the diff driver cache the text > > conversion outputs. See linkgit:gitattributes[5] for details. > > > > +`diff.<driver>.process`:: > > + The command to run as a long-running diff process. > > + The tool communicates via the pkt-line protocol and returns > > + hunks that are fed into Git's diff and blame pipelines. > > + If the tool returns zero hunks, the file is treated as > > + unchanged for both diff output and blame attribution. > > + See linkgit:gitattributes[5] for details. > > ... also here. > > I do not know if you mean "the tool returns no hunks" (there is no > "hunk <old_start> <old_count> <new_start> <new_count>" line passed > from the tool over the protocol) or "the tool returns zero-hunk" > (there is a special "zero-hunk" message to signal this particular > condition sent over the protocol), and this description does not > quite help disambiguating between the two. > > If the former, then avoid "zero hunks" as it sounds like a noun with > special meaning. Yes, we can say "tool returns one hunk", "tool > returns 31 hunks", etc., so "tool returns zero hunks" may logically > be correct, but "when the tool returns no hunks with status=success" > is much less confusing, I think. Yes, "zero hunks" was my own invention and I see why it's confusing. Will update the messaging to use "no hunks" instead and do a broader sweep of the documentation to clarify the protocol and expected tool behavior. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2 4/4] blame: consult diff process for zero-hunk detection 2026-05-25 18:29 ` [PATCH v2 0/4] " Michael Montalbo via GitGitGadget ` (2 preceding siblings ...) 2026-05-25 18:29 ` [PATCH v2 3/4] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget @ 2026-05-25 18:29 ` Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget 4 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-25 18:29 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> When a diff process is configured via diff.<driver>.process, consult it during blame's per-commit diffing. If the process returns zero hunks for a commit's changes to a file, treat the commit as having no changes, causing blame to attribute lines to earlier commits. The subprocess is long-running (one startup cost amortized across the blame traversal), but each commit in the file's history incurs a round-trip to the tool. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- Documentation/gitattributes.adoc | 3 +++ blame.c | 43 +++++++++++++++++++++++++++++--- t/t4080-diff-process.sh | 32 ++++++++++++++++++++++++ 3 files changed, 74 insertions(+), 4 deletions(-) diff --git a/Documentation/gitattributes.adoc b/Documentation/gitattributes.adoc index 962896a0b4..c087b4b265 100644 --- a/Documentation/gitattributes.adoc +++ b/Documentation/gitattributes.adoc @@ -857,6 +857,9 @@ The tool responds with lines of the form If the tool returns zero hunks with `status=success`, Git treats the file as having no changes and produces no diff output. +`git blame` also consults the diff process and skips commits +where it reports zero hunks, attributing lines to earlier commits +instead. Tools should ignore unknown keys in the per-file request to remain forward-compatible. diff --git a/blame.c b/blame.c index a3c49d132e..8a5f14db7a 100644 --- a/blame.c +++ b/blame.c @@ -19,6 +19,8 @@ #include "tag.h" #include "trace2.h" #include "blame.h" +#include "diff-process.h" +#include "userdiff.h" #include "alloc.h" #include "commit-slab.h" #include "bloom.h" @@ -315,16 +317,47 @@ static struct commit *fake_working_tree_commit(struct repository *r, static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b, - xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, int xdl_opts) + xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, + int xdl_opts, struct index_state *istate, + const char *path) { xpparam_t xpp = {0}; xdemitconf_t xecfg = {0}; xdemitcb_t ecb = {NULL}; + struct xdl_hunk *ext_hunks = NULL; + int ret; xpp.flags = xdl_opts; xecfg.hunk_func = hunk_func; ecb.priv = cb_data; - return xdi_diff(file_a, file_b, &xpp, &xecfg, &ecb); + + if (path && istate) { + struct userdiff_driver *drv; + drv = userdiff_find_by_path(istate, path); + if (drv && drv->process) { + size_t nr = 0; + if (!diff_process_get_hunks(drv, path, + file_a->ptr, file_a->size, + file_b->ptr, file_b->size, + &ext_hunks, &nr)) { + if (!nr) { + /* + * Zero hunks: the diff process + * considers these files equivalent. + * Skip so blame looks past this + * commit. + */ + return 0; + } + xpp.external_hunks = ext_hunks; + xpp.external_hunks_nr = nr; + } + } + } + + ret = xdi_diff(file_a, file_b, &xpp, &xecfg, &ecb); + free(ext_hunks); + return ret; } static const char *get_next_line(const char *start, const char *end) @@ -1961,7 +1994,8 @@ static void pass_blame_to_parent(struct blame_scoreboard *sb, &sb->num_read_blob, ignore_diffs); sb->num_get_patch++; - if (diff_hunks(&file_p, &file_o, blame_chunk_cb, &d, sb->xdl_opts)) + if (diff_hunks(&file_p, &file_o, blame_chunk_cb, &d, sb->xdl_opts, + sb->revs->diffopt.repo->index, target->path)) die("unable to generate diff (%s -> %s)", oid_to_hex(&parent->commit->object.oid), oid_to_hex(&target->commit->object.oid)); @@ -2114,7 +2148,8 @@ static void find_copy_in_blob(struct blame_scoreboard *sb, * file_p partially may match that image. */ memset(split, 0, sizeof(struct blame_entry [3])); - if (diff_hunks(file_p, &file_o, handle_split_cb, &d, sb->xdl_opts)) + if (diff_hunks(file_p, &file_o, handle_split_cb, &d, sb->xdl_opts, + NULL, NULL)) die("unable to generate diff (%s)", oid_to_hex(&parent->commit->object.oid)); /* remainder, if any, all match the preimage */ diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh index 083e48e872..50f49a9b02 100755 --- a/t/t4080-diff-process.sh +++ b/t/t4080-diff-process.sh @@ -335,4 +335,36 @@ test_expect_success PYTHON 'diff process zero hunks suppresses diff output' ' test_must_be_empty actual ' +test_expect_success PYTHON 'blame skips commits with zero hunks from diff process' ' + cat >blame.c <<-\EOF && + int main(void) + { + return 0; + } + EOF + git add blame.c && + git commit -m "add blame.c" && + + cat >blame.c <<-\EOF && + int main(void) + { + return 0; + } + EOF + git add blame.c && + git commit -m "reformat blame.c" && + BLAME_COMMIT=$(git rev-parse --short HEAD) && + + # Without zero-hunk mode, blame attributes the change. + git blame blame.c >without && + test_grep "$BLAME_COMMIT" without && + + # With zero-hunk mode, the process considers the files equivalent + # and blame skips the reformat commit. + git -c diff.cdiff.process="$BACKEND --mode=zero-hunk" \ + blame blame.c >with && + ! test_grep "$BLAME_COMMIT" with +' + + test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers 2026-05-25 18:29 ` [PATCH v2 0/4] " Michael Montalbo via GitGitGadget ` (3 preceding siblings ...) 2026-05-25 18:29 ` [PATCH v2 4/4] blame: consult diff process for zero-hunk detection Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 ` Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 1/6] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget ` (6 more replies) 4 siblings, 7 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 UTC (permalink / raw) To: git; +Cc: Michael Montalbo Language-aware diff tools (e.g., Difftastic) and format-specific analyzers can produce better line matching than Git's builtin diff algorithm, but diff.<driver>.command replaces Git's output entirely, losing downstream features like word diff, function context, color, and blame. This series adds diff.<driver>.process, a long-running subprocess protocol that lets an external tool control which lines Git considers changed while Git handles all output formatting. The protocol follows filter.<driver>.process: pkt-line over stdin/stdout, capability negotiation, one process per Git invocation. The tool receives both file versions and returns changed regions (line ranges in the old and new file). Git validates and feeds them into the xdiff pipeline in place of the builtin diff algorithm. When the tool returns no hunks, Git treats the files as having no changes. The first two patches add xdiff plumbing for externally supplied hunks and the diff.<driver>.process config key. Patch 3 refactors the subprocess API to separate process lifecycle from hashmap management, since the diff process stores its subprocess on the userdiff driver rather than in a hashmap. The main feature lands in patch 4. Patch 5 wires up bypass knobs (--no-ext-diff, format-patch). Patch 6 integrates with blame so the tool can declare commits as having no changes. Changes since v2: * Extracted subprocess_start_command() from subprocess_start() so the diff process can reuse the startup machinery without a hashmap (new patch 3). * Subprocess stored on userdiff_driver, no global variables. * Differentiated error handling: tool failure warns (with tool name), "not configured" is silent, abort is voluntary. * Single public entry point: diff_process_fill_hunks() handles driver lookup, flag checks, subprocess management, and error reporting for both diff.c and blame.c. * Rewrote gitattributes protocol section to match filter process convention with packet diagrams and fully specified hunk format. * Split bypass knobs (--no-ext-diff, format-patch) into a separate commit. * SIGPIPE protection, stricter input validation, const correctness. * Changed "zero hunks" to "no hunks" throughout. Michael Montalbo (6): xdiff: support external hunks via xpparam_t userdiff: add diff.<driver>.process config sub-process: separate process lifecycle from hashmap management diff: add long-running diff process via diff.<driver>.process diff: bypass diff process with --no-ext-diff and in format-patch blame: consult diff process for no-hunk detection Documentation/config/diff.adoc | 5 + Documentation/diff-algorithm-option.adoc | 3 + Documentation/diff-options.adoc | 4 +- Documentation/gitattributes.adoc | 139 +++++ Makefile | 1 + blame.c | 40 +- builtin/log.c | 7 + diff-process.c | 288 ++++++++++ diff-process.h | 39 ++ diff.c | 29 +- diff.h | 5 + meson.build | 1 + sub-process.c | 29 +- sub-process.h | 9 +- t/.gitattributes | 1 + t/meson.build | 1 + t/t4080-diff-process.sh | 660 +++++++++++++++++++++++ userdiff.c | 7 + userdiff.h | 5 + xdiff-interface.c | 7 +- xdiff/xdiff.h | 13 + xdiff/xdiffi.c | 85 ++- xdiff/xprepare.c | 10 + xdiff/xprepare.h | 1 + 24 files changed, 1368 insertions(+), 21 deletions(-) create mode 100644 diff-process.c create mode 100644 diff-process.h create mode 100755 t/t4080-diff-process.sh base-commit: 94f057755b7941b321fd11fec1b2e3ca5313a4e0 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2120%2Fmmontalbo%2Fmm%2Fstructural-diff-backend-clean-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2120/mmontalbo/mm/structural-diff-backend-clean-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/2120 Range-diff vs v2: 1: f887a7e2ba ! 1: 13eb201d63 xdiff: support external hunks via xpparam_t @@ Commit message numbers, overlapping or out-of-order hunks, negative counts, and violations of the synchronization invariant (unchanged line counts must match between files). On validation failure, fall back to - the builtin diff algorithm. + the builtin diff algorithm; this re-runs xdl_prepare_env() since + the first call may have dirtied the changed[] arrays. Skip trim_common_tail() in xdi_diff() when external hunks are present, since external hunks reference line numbers in the @@ xdiff/xdiff.h: typedef struct s_xpparam { char **anchors; size_t anchors_nr; + -+ /* Externally computed hunks: bypass the diff algorithm. */ -+ const struct xdl_hunk *external_hunks; ++ /* Externally computed hunks: bypass the diff algorithm. Owned by caller. */ ++ struct xdl_hunk *external_hunks; + size_t external_hunks_nr; } xpparam_t; @@ xdiff/xdiffi.c: static void xdl_mark_ignorable_regex(xdchange_t *xscr, const xdf + * Returns 0 on success, -1 on validation failure. + */ +static int xdl_populate_hunks_from_external(xdfenv_t *xe, -+ const struct xdl_hunk *hunks, ++ struct xdl_hunk *hunks, + size_t nr_hunks) +{ + size_t i; + long j, prev_old_end = 0, prev_new_end = 0; + long total_old = 0, total_new = 0; + ++ /* ++ * xdl_prepare_env() may dirty changed[] via xdl_cleanup_records(). ++ * Clear them so only the external hunks are marked. ++ */ + xdl_clear_changed(&xe->xdf1); + xdl_clear_changed(&xe->xdf2); + + for (i = 0; i < nr_hunks; i++) { -+ const struct xdl_hunk *h = &hunks[i]; ++ struct xdl_hunk *h = &hunks[i]; + + if (h->old_count < 0 || h->new_count < 0) + return -1; -+ -+ /* Bounds check (1-based line numbers) */ -+ if (h->old_count > 0 && -+ (h->old_start < 1 || -+ h->old_start + h->old_count - 1 > (long)xe->xdf1.nrec)) -+ return -1; -+ if (h->new_count > 0 && -+ (h->new_start < 1 || -+ h->new_start + h->new_count - 1 > (long)xe->xdf2.nrec)) ++ if (h->old_start < 1 || h->new_start < 1) + return -1; + -+ /* Zero-count hunks: start must still be in [1, nrec+1] */ -+ if (h->old_count == 0 && -+ (h->old_start < 1 || h->old_start > (long)xe->xdf1.nrec + 1)) ++ /* ++ * Range must fit: start + count - 1 <= nrec, ++ * rewritten to avoid overflow. Same for both sides. ++ * ++ * When count is 0 (pure insert/delete) the check ++ * reduces to 0 > nrec - start + 1, which rejects ++ * start > nrec + 1 and allows start == nrec + 1 ++ * (the position after the last line). ++ */ ++ if (h->old_count > (long)xe->xdf1.nrec - h->old_start + 1) + return -1; -+ if (h->new_count == 0 && -+ (h->new_start < 1 || h->new_start > (long)xe->xdf2.nrec + 1)) ++ if (h->new_count > (long)xe->xdf2.nrec - h->new_start + 1) + return -1; + -+ /* Ordering: no overlap with previous hunk */ ++ /* Ordering: no overlap with previous hunk (adjacent is OK) */ + if (h->old_start < prev_old_end || + h->new_start < prev_new_end) + return -1; @@ xdiff/xdiffi.c: static void xdl_mark_ignorable_regex(xdchange_t *xscr, const xdf - } + +diff_done: -+ if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 || xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 || xdl_build_script(&xe, &xscr) < 0) { 2: de6d85f9d7 ! 2: 58f4763c63 userdiff: add diff.<driver>.process config @@ Metadata ## Commit message ## userdiff: add diff.<driver>.process config - Add a new per-driver configuration key that specifies the command - for a long-running diff process. - - The name follows filter.<driver>.process: a long-running subprocess - that stays alive across files within a single git invocation. + Add the process field to struct userdiff_driver and teach the + config parser to populate it from diff.<driver>.process. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> -: ---------- > 3: d6c833dd42 sub-process: separate process lifecycle from hashmap management 3: c25647c6e5 ! 4: d044fa0ee5 diff: add long-running diff process via diff.<driver>.process @@ Commit message process = /path/to/diff-tool The tool provides custom line-matching: it receives file pairs - and returns hunks that reference original line numbers. Unlike - textconv, which transforms the displayed content, the diff - output shows the actual file while the tool controls which - lines are marked as changed. + and returns hunks that reference line numbers in the content. + When textconv is also configured, the tool receives the + textconv-transformed content. The tool controls which lines + are marked as changed while the display shows the file content. + Patch output features (word diff, function context, color) work + normally; summary formats like --stat use their own diff path + and are not affected. The handshake negotiates version=1 and capability=hunks. Per-file requests send command=hunks, pathname, and both file contents as packetized data. The tool responds with hunk lines and a status - packet. On error, git falls back to the builtin diff algorithm - with a warning. + packet (success, error, or abort). On error, Git warns and falls + back to the builtin diff algorithm for that file. On abort, Git + silently falls back for the current file and stops sending further + requests to the tool for the remainder of the session. - Zero hunks with status=success means the tool considers the - files equivalent. Git skips diff output for that file. + When the tool returns no hunks followed by status=success, Git + treats the file as having no changes and produces no diff output. + This also means --exit-code reports no changes for that file. + + The subprocess is stored on the userdiff_driver struct and + launched on first use. If the process fails to start, the + handshake fails, or a communication error occurs mid-stream, + the failure is cached on the driver to avoid retrying and + re-warning on every subsequent file. + + diff_process_fill_hunks() is the sole public entry point. It + handles driver lookup, flag checks, subprocess management, and + error reporting, returning an enum that lets callers distinguish + "hunks populated" from "files equivalent" from "not applicable" + from "tool failure." Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> @@ Documentation/config/diff.adoc: endif::git-diff[] conversion outputs. See linkgit:gitattributes[5] for details. +`diff.<driver>.process`:: -+ The command to run as a long-running diff process. -+ The tool communicates via the pkt-line protocol and returns -+ hunks that are fed into Git's diff and blame pipelines. -+ If the tool returns zero hunks, the file is treated as -+ unchanged for both diff output and blame attribution. ++ The command to run as a long-running diff process that ++ provides hunks to Git's diff pipeline. + See linkgit:gitattributes[5] for details. + `diff.indentHeuristic`:: @@ Documentation/gitattributes.adoc: NOTE: If `diff.<name>.command` is defined for algorithm is not passed to the external diff driver. +Using an external diff process -+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -+ -+An external tool can provide content-aware line matching by -+setting `diff.<name>.process` to the command that runs -+the tool. The tool is a long-running process that communicates via -+the pkt-line protocol (described in ++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ++ ++If `diff.<name>.process` is defined, Git sends the old and new file ++content to an external tool and receives back a list of changed ++regions (pairs of line ranges in the old and new file). Git uses ++these instead of its builtin diff algorithm, but still controls ++all output formatting, so features like word diff, function context, ++color, and blame work normally. This is achieved by using the ++long-running process protocol (described in +Documentation/technical/long-running-process-protocol.adoc). ++Unlike `diff.<name>.command`, which replaces Git's output entirely, ++the diff process feeds results back into the standard pipeline. ++ ++First, in `.gitattributes`, assign the `diff` attribute for paths. + +------------------------ +*.c diff=cdiff +------------------------ + ++Then, define a "diff.<name>.process" configuration to specify ++the diff process command. ++ +---------------------------------------------------------------- +[diff "cdiff"] + process = /path/to/diff-process-tool +---------------------------------------------------------------- + -+The tool receives file pairs and returns hunk descriptors indicating -+which lines changed. Git feeds these hunks into its standard diff -+pipeline, so all output features (word diff, function context, -+color) work normally. -+ -+If the tool fails or returns an error, Git silently falls back to -+the builtin diff algorithm. If the tool returns invalid hunks -+(out of bounds, overlapping), Git also falls back silently. -+ -+The handshake negotiates `version=1` and `capability=hunks`. -+Per-file requests send `command=hunks` and `pathname=<path>`, -+followed by the old and new file content as packetized data. -+The tool responds with lines of the form -+`hunk <old_start> <old_count> <new_start> <new_count>` -+(1-based line numbers), a flush packet, and `status=success`. -+ -+If the tool returns zero hunks with `status=success`, Git treats -+the file as having no changes and produces no diff output. -+ -+Tools should ignore unknown keys in the per-file request to -+remain forward-compatible. ++When Git encounters the first file that needs to be diffed, it starts ++the process and performs the handshake. In the handshake, the welcome ++message sent by Git is "git-diff-client", only version 1 is supported, ++and the supported capability is "hunks" (the changed regions ++described below). ++ ++For each file, Git sends a list of "key=value" pairs terminated with ++a flush packet, followed by the old and new file content as packetized ++data, each terminated with a flush packet. The pathname is relative ++to the repository root. When `diff.<name>.textconv` is also set, ++the tool receives the textconv-transformed content rather than the ++raw blob. Git does not send binary files to the diff process. ++ ++----------------------- ++packet: git> command=hunks ++packet: git> pathname=path/file.c ++packet: git> 0000 ++packet: git> OLD_CONTENT ++packet: git> 0000 ++packet: git> NEW_CONTENT ++packet: git> 0000 ++----------------------- ++ ++The tool is expected to respond with zero or more hunk lines, ++a flush packet, and a status packet terminated with a flush packet. ++Each hunk line has the form: ++ ++ `hunk <old_start> <old_count> <new_start> <new_count>` ++ ++where `<old_start>` and `<old_count>` identify a range of lines in ++the old file, and `<new_start>` and `<new_count>` identify the ++replacement range in the new file. Start values are 1-based and ++counts are non-negative. Ranges must not extend beyond the end of ++the file. For example, `hunk 3 2 3 4` means that 2 lines starting ++at line 3 in the old file were replaced by 4 lines starting at ++line 3 in the new file. An `<old_count>` of 0 means no lines were ++removed (pure insertion); a `<new_count>` of 0 means no lines were ++added (pure deletion). ++ ++Lines are delimited by newlines. A file `"foo\nbar\n"` and a ++file `"foo\nbar"` both have 2 lines. ++ ++Hunks must be listed in order and must not overlap. Any line ++not covered by a hunk is treated as unchanged, so the total ++number of unchanged lines must be the same on both sides. ++For example, if the old file has 10 lines and the hunks cover ++4 of them (`old_count` values summing to 4), then 6 old lines ++are unchanged. The new file must also have exactly 6 lines ++not covered by hunks, so the `new_count` values must sum to ++`new_file_lines - 6`. ++ ++----------------------- ++packet: git< hunk 1 3 1 5 ++packet: git< hunk 10 2 12 2 ++packet: git< 0000 ++packet: git< status=success ++packet: git< 0000 ++----------------------- ++ ++If the tool responds with hunks and "success", Git marks those lines ++as changed and feeds them into the standard diff pipeline. Patch ++output features (word diff, function context, color) work normally. ++Note that `--stat` and other summary formats use their own diff path ++and are not affected by the diff process. ++ ++If no hunk lines precede the flush, followed by "success", Git ++treats the files as having no changes: `git diff` produces no output ++and `git blame` skips the commit, attributing lines to earlier commits. ++ ++----------------------- ++packet: git< 0000 ++packet: git< status=success ++packet: git< 0000 ++----------------------- ++ ++If the tool returns invalid hunks (out of bounds, overlapping), Git ++silently falls back to the builtin diff algorithm. ++ ++In case the tool cannot or does not want to process the content, ++it is expected to respond with an "error" status. Git warns and ++falls back to the builtin diff algorithm for this file. The tool ++remains available for subsequent files. ++ ++----------------------- ++packet: git< 0000 ++packet: git< status=error ++packet: git< 0000 ++----------------------- ++ ++In case the tool cannot or does not want to process the content as ++well as any future content for the lifetime of the Git process, it ++is expected to respond with an "abort" status. Git silently falls ++back to the builtin diff algorithm for this file and does not send ++further requests to the tool. ++ ++----------------------- ++packet: git< 0000 ++packet: git< status=abort ++packet: git< 0000 ++----------------------- ++ ++If the tool dies during the communication or does not adhere to the ++protocol then Git will stop the process and fall back to the builtin ++diff algorithm. Git warns once and does not restart the process for ++subsequent files. ++ ++Tools should ignore unknown keys in the per-file request to remain ++forward-compatible. Future versions of Git may send additional ++`command=` values; tools that receive an unrecognized command should ++respond with `status=error` rather than terminating. + Defining a custom hunk-header ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ diff-process.c (new) +/* + * Diff process backend: communicates with a long-running external + * tool via the pkt-line protocol to obtain custom line-matching -+ * results. Unlike textconv, which transforms the displayed content, -+ * hunks from a diff process reference original line numbers and -+ * the display shows the actual file content. ++ * results. The tool controls which lines are marked as changed ++ * while the display shows the file content (after any textconv ++ * transformation, if configured). + * + * Protocol: pkt-line over stdin/stdout, following the pattern of + * the long-running filter process protocol (see convert.c). @@ diff-process.c (new) + * tool< ... / flush + * tool< status=success / flush + * -+ * Zero hunks with status=success means the tool considers the -+ * files equivalent. Git will skip the diff for that file. ++ * When the tool returns no hunks with status=success, it considers ++ * the files equivalent. Git will skip the diff for that file. + */ + +#include "git-compat-util.h" +#include "diff-process.h" ++#include "diff.h" ++#include "gettext.h" ++#include "repository.h" ++#include "sigchain.h" +#include "userdiff.h" +#include "sub-process.h" +#include "pkt-line.h" @@ diff-process.c (new) + unsigned int supported_capabilities; +}; + -+static int subprocess_map_initialized; -+static struct hashmap subprocess_map; -+ +static int start_diff_process_fn(struct subprocess_entry *subprocess) +{ + static int versions[] = { 1, 0 }; @@ diff-process.c (new) + { NULL, 0 } + }; + struct diff_subprocess *entry = -+ (struct diff_subprocess *)subprocess; ++ container_of(subprocess, struct diff_subprocess, subprocess); + -+ /* Uses dying pkt-line variant, same as convert.c filters. */ + return subprocess_handshake(subprocess, "git-diff", + versions, NULL, + capabilities, + &entry->supported_capabilities); +} + -+static struct diff_subprocess *find_or_start_process(const char *cmd) ++static struct diff_subprocess *get_or_launch_process( ++ struct userdiff_driver *drv) +{ + struct diff_subprocess *entry; + -+ if (!subprocess_map_initialized) { -+ subprocess_map_initialized = 1; -+ hashmap_init(&subprocess_map, cmd2process_cmp, NULL, 0); -+ } -+ -+ entry = (struct diff_subprocess *) -+ subprocess_find_entry(&subprocess_map, cmd); -+ if (entry) -+ return entry; ++ if (drv->diff_subprocess) ++ return drv->diff_subprocess; + + entry = xcalloc(1, sizeof(*entry)); -+ if (subprocess_start(&subprocess_map, &entry->subprocess, -+ cmd, start_diff_process_fn)) { ++ if (subprocess_start_command(&entry->subprocess, drv->process, ++ start_diff_process_fn)) { + free(entry); ++ drv->diff_process_failed = 1; + return NULL; + } + ++ drv->diff_subprocess = entry; + return entry; +} + +static int send_file_content(int fd, const char *buf, long size) +{ -+ int ret; ++ int ret = 0; + ++ if (size < 0) ++ return -1; + if (size > 0) + ret = write_packetized_from_buf_no_flush(buf, size, fd); -+ else -+ ret = 0; + if (ret) + return ret; + return packet_flush_gently(fd); @@ diff-process.c (new) +{ + char *end; + -+ /* Format: "hunk <old_start> <old_count> <new_start> <new_count>" */ ++ /* ++ * Format: "hunk <old_start> <old_count> <new_start> <new_count>" ++ * All numbers must be non-negative decimal with no leading ++ * whitespace or sign characters. ++ */ + if (!skip_prefix(line, "hunk ", &line)) + return -1; + ++ if (!isdigit(*line)) ++ return -1; ++ errno = 0; + hunk->old_start = strtol(line, &end, 10); -+ if (end == line || *end != ' ') ++ if (errno || end == line || *end++ != ' ') + return -1; + line = end; + ++ if (!isdigit(*line)) ++ return -1; ++ errno = 0; + hunk->old_count = strtol(line, &end, 10); -+ if (end == line || *end != ' ') ++ if (errno || end == line || *end++ != ' ') + return -1; + line = end; + ++ if (!isdigit(*line)) ++ return -1; ++ errno = 0; + hunk->new_start = strtol(line, &end, 10); -+ if (end == line || *end != ' ') ++ if (errno || end == line || *end++ != ' ') + return -1; + line = end; + ++ if (!isdigit(*line)) ++ return -1; ++ errno = 0; + hunk->new_count = strtol(line, &end, 10); -+ if (end == line || *end != '\0') ++ if (errno || end == line || *end != '\0') + return -1; + + return 0; +} + -+int diff_process_get_hunks(struct userdiff_driver *drv, -+ const char *path, -+ const char *old_buf, long old_size, -+ const char *new_buf, long new_size, -+ struct xdl_hunk **hunks_out, -+ size_t *nr_hunks_out) ++static enum diff_process_result get_hunks( ++ struct userdiff_driver *drv, ++ const char *path, ++ const char *old_buf, long old_size, ++ const char *new_buf, long new_size, ++ struct xdl_hunk **hunks_out, ++ size_t *nr_hunks_out) +{ + struct diff_subprocess *backend; + struct child_process *process; @@ diff-process.c (new) + int len; + char *line; + -+ if (!drv || !drv->process) -+ return -1; -+ -+ backend = find_or_start_process(drv->process); ++ backend = get_or_launch_process(drv); + if (!backend) -+ return -1; ++ return DIFF_PROCESS_ERROR; + + if (!(backend->supported_capabilities & CAP_HUNKS)) -+ return -1; ++ return DIFF_PROCESS_SKIP; + + process = subprocess_get_child_process(&backend->subprocess); + fd_in = process->in; + fd_out = process->out; + ++ sigchain_push(SIGPIPE, SIG_IGN); ++ + /* Send request */ + if (packet_write_fmt_gently(fd_in, "command=hunks\n") || + packet_write_fmt_gently(fd_in, "pathname=%s\n", path) || + packet_flush_gently(fd_in)) -+ goto error; ++ goto comm_error; + + /* Send old file content */ + if (send_file_content(fd_in, old_buf, old_size)) -+ goto error; ++ goto comm_error; + + /* Send new file content */ + if (send_file_content(fd_in, new_buf, new_size)) -+ goto error; ++ goto comm_error; + + /* Read hunks until flush packet */ + while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 && + line) { + if (parse_hunk_line(line, &hunk) < 0) -+ goto error; ++ goto comm_error; + ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks); + hunks[nr_hunks++] = hunk; + } + if (len < 0) -+ goto error; ++ goto comm_error; + + /* Read status */ + if (subprocess_read_status(fd_out, &status)) -+ goto error; ++ goto comm_error; ++ ++ if (!strcmp(status.buf, "success")) { ++ *hunks_out = hunks; ++ *nr_hunks_out = nr_hunks; ++ strbuf_release(&status); ++ sigchain_pop(SIGPIPE); ++ return DIFF_PROCESS_OK; ++ } + -+ if (strcmp(status.buf, "success")) { -+ if (!strcmp(status.buf, "abort")) -+ backend->supported_capabilities &= ~CAP_HUNKS; -+ goto error; ++ if (!strcmp(status.buf, "abort")) { ++ /* ++ * The tool voluntarily withdrew: stop sending requests ++ * but do not warn (this is not a failure). ++ */ ++ backend->supported_capabilities &= ~CAP_HUNKS; ++ free(hunks); ++ strbuf_release(&status); ++ sigchain_pop(SIGPIPE); ++ return DIFF_PROCESS_SKIP; + } + -+ *hunks_out = hunks; -+ *nr_hunks_out = nr_hunks; ++ /* status=error or unknown status */ ++ free(hunks); + strbuf_release(&status); -+ return 0; -+ -+error: ++ sigchain_pop(SIGPIPE); ++ return DIFF_PROCESS_ERROR; ++ ++comm_error: ++ /* ++ * Communication failure (broken pipe, malformed response). ++ * Tear down the process and mark as failed so we do not ++ * retry on every subsequent file. ++ */ ++ drv->diff_process_failed = 1; ++ drv->diff_subprocess = NULL; ++ subprocess_stop_command(&backend->subprocess); ++ free(backend); + free(hunks); + strbuf_release(&status); -+ return -1; ++ sigchain_pop(SIGPIPE); ++ return DIFF_PROCESS_ERROR; ++} ++ ++enum diff_process_result diff_process_fill_hunks( ++ struct diff_options *diffopt, ++ const char *path, ++ const mmfile_t *file_a, ++ const mmfile_t *file_b, ++ xpparam_t *xpp) ++{ ++ struct userdiff_driver *drv; ++ struct xdl_hunk *ext_hunks = NULL; ++ size_t nr = 0; ++ enum diff_process_result res; ++ ++ if (!diffopt || !path) ++ return DIFF_PROCESS_SKIP; ++ if (diffopt->flags.no_diff_process || diffopt->ignore_driver_algorithm) ++ return DIFF_PROCESS_SKIP; ++ ++ drv = userdiff_find_by_path(diffopt->repo->index, path); ++ if (!drv || !drv->process) ++ return DIFF_PROCESS_SKIP; ++ if (drv->diff_process_failed) ++ return DIFF_PROCESS_SKIP; ++ ++ res = get_hunks(drv, path, ++ file_a->ptr, file_a->size, ++ file_b->ptr, file_b->size, ++ &ext_hunks, &nr); ++ if (res == DIFF_PROCESS_OK) { ++ if (!nr) { ++ free(ext_hunks); ++ return DIFF_PROCESS_EQUIVALENT; ++ } ++ xpp->external_hunks = ext_hunks; ++ xpp->external_hunks_nr = nr; ++ return DIFF_PROCESS_OK; ++ } ++ if (res == DIFF_PROCESS_ERROR) { ++ warning(_("diff process '%s' failed for '%s'," ++ " falling back to builtin diff"), ++ drv->process, path); ++ return DIFF_PROCESS_ERROR; ++ } ++ return DIFF_PROCESS_SKIP; +} ## diff-process.h (new) ## @@ diff-process.h (new) +#ifndef DIFF_PROCESS_H +#define DIFF_PROCESS_H + -+struct userdiff_driver; -+struct xdl_hunk; ++#include "xdiff/xdiff.h" ++ ++struct diff_options; ++ ++enum diff_process_result { ++ DIFF_PROCESS_ERROR = -1, /* tool failure: warned, fell back */ ++ DIFF_PROCESS_OK = 0, /* hunks populated in xpp */ ++ DIFF_PROCESS_SKIP, /* no process configured: use builtin */ ++ DIFF_PROCESS_EQUIVALENT, /* tool says files are equivalent */ ++}; + +/* -+ * Query a diff process for hunks describing the changes -+ * between old_buf and new_buf. ++ * Consult the diff process configured for 'path' and populate ++ * xpp->external_hunks with the returned hunks. + * -+ * The backend is a long-running subprocess configured via -+ * diff.<driver>.process. It receives file content via -+ * pkt-line and returns hunks with 1-based line numbers. ++ * Handles driver lookup, flag checks (--no-ext-diff, ++ * --diff-algorithm), subprocess management, and error reporting. + * -+ * On success, sets *hunks_out and *nr_hunks_out to a newly allocated -+ * array (caller must free) and returns 0. ++ * Returns DIFF_PROCESS_OK when hunks are populated in xpp. ++ * The caller owns xpp->external_hunks and must free() it. + * -+ * On failure, returns -1. The caller should fall back to the -+ * builtin diff algorithm. ++ * Returns DIFF_PROCESS_EQUIVALENT when the tool returns no hunks ++ * (files are considered identical); caller should skip diff/blame. ++ * Returns DIFF_PROCESS_SKIP when no process applies; caller ++ * should use the builtin diff algorithm. ++ * Returns DIFF_PROCESS_ERROR on tool failure (already warned); ++ * caller should fall back to the builtin diff algorithm. + */ -+int diff_process_get_hunks(struct userdiff_driver *drv, -+ const char *path, -+ const char *old_buf, long old_size, -+ const char *new_buf, long new_size, -+ struct xdl_hunk **hunks_out, -+ size_t *nr_hunks_out); ++enum diff_process_result diff_process_fill_hunks( ++ struct diff_options *diffopt, ++ const char *path, ++ const mmfile_t *file_a, ++ const mmfile_t *file_b, ++ xpparam_t *xpp); + +#endif /* DIFF_PROCESS_H */ @@ diff.c #include "submodule.h" #include "hashmap.h" #include "mem-pool.h" -@@ diff.c: static void builtin_diff(const char *name_a, - xpparam_t xpp; - xdemitconf_t xecfg; - struct emit_callback ecbdata; -+ struct xdl_hunk *ext_hunks = NULL; - unsigned ws_rule; - const struct userdiff_funcname *pe; - @@ diff.c: static void builtin_diff(const char *name_a, xpp.ignore_regex_nr = o->ignore_regex_nr; xpp.anchors = o->anchors; xpp.anchors_nr = o->anchors_nr; + -+ if (!o->ignore_driver_algorithm && -+ one->driver && one->driver->process) { -+ size_t ext_hunks_nr = 0; -+ if (!diff_process_get_hunks( -+ one->driver, name_a, -+ mf1.ptr, mf1.size, -+ mf2.ptr, mf2.size, -+ &ext_hunks, &ext_hunks_nr)) { -+ if (!ext_hunks_nr) -+ goto free_ab_and_return; -+ xpp.external_hunks = ext_hunks; -+ xpp.external_hunks_nr = ext_hunks_nr; -+ } else { -+ warning(_("diff process failed for '%s'," -+ " falling back to builtin diff"), -+ name_a); -+ } ++ if (diff_process_fill_hunks(o, name_a, ++ &mf1, &mf2, &xpp) ++ == DIFF_PROCESS_EQUIVALENT) { ++ if (textconv_one) ++ free(mf1.ptr); ++ if (textconv_two) ++ free(mf2.ptr); ++ goto free_ab_and_return; + } + xecfg.ctxlen = o->context; @@ diff.c: static void builtin_diff(const char *name_a, } else if (xdi_diff_outf(&mf1, &mf2, NULL, fn_out_consume, &ecbdata, &xpp, &xecfg)) die("unable to generate diff for %s", one->path); -+ free(ext_hunks); ++ free(xpp.external_hunks); if (o->word_diff) free_diff_words_data(&ecbdata); if (textconv_one) + ## diff.h ## +@@ diff.h: struct diff_flags { + */ + unsigned allow_external; + ++ /** Disables diff.<driver>.process. */ ++ unsigned no_diff_process; ++ + /** + * For communication between the calling program and the options parser; + * tell the calling program to signal the presence of difference using + + ## meson.build ## +@@ meson.build: libgit_sources = [ + 'diff-merges.c', + 'diff-lib.c', + 'diff-no-index.c', ++ 'diff-process.c', + 'diff.c', + 'diffcore-break.c', + 'diffcore-delta.c', + ## t/.gitattributes ## @@ t/.gitattributes: t[0-9][0-9][0-9][0-9]/* -whitespace /t8005/*.txt eol=lf @@ t/.gitattributes: t[0-9][0-9][0-9][0-9]/* -whitespace /t0040*.sh whitespace=-indent-with-non-tab +/t4080-diff-process.sh whitespace=-indent-with-non-tab + ## t/meson.build ## +@@ t/meson.build: integration_tests = [ + 't4072-diff-max-depth.sh', + 't4073-diff-stat-name-width.sh', + 't4074-diff-shifted-matched-group.sh', ++ 't4080-diff-process.sh', + 't4100-apply-stat.sh', + 't4101-apply-nonl.sh', + 't4102-apply-rename.sh', + ## t/t4080-diff-process.sh (new) ## @@ +#!/bin/sh @@ t/t4080-diff-process.sh (new) +# whole-file - report all lines as changed (default) +# fixed-hunk - always report hunk 5 2 5 2 +# bad-hunk - report out-of-bounds hunk 999 1 999 1 -+# zero-hunk - return zero hunks (files considered equivalent) ++# bad-sync - report hunk with mismatched unchanged totals ++# overlap - report two overlapping hunks ++# no-hunks - return no hunks (files considered equivalent) +# error - return status=error for every request +# abort - return status=abort for every request +# crash - read one request then exit without responding @@ t/t4080-diff-process.sh (new) + if cmd is None: sys.exit(0) + old = read_content() + new = read_content() -+ log(f"command={cmd} pathname={pathname}") ++ old_first = old.split(b"\n")[0].decode(errors="replace") if old else "" ++ new_first = new.split(b"\n")[0].decode(errors="replace") if new else "" ++ log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}") + + if mode == "error": + write_flush() @@ t/t4080-diff-process.sh (new) + write_pkt("hunk 5 2 5 2") + elif mode == "bad-hunk": + write_pkt("hunk 999 1 999 1") -+ elif mode == "zero-hunk": ++ elif mode == "bad-sync": ++ write_pkt("hunk 1 2 1 1") ++ elif mode == "overlap": ++ write_pkt("hunk 1 5 1 5") ++ write_pkt("hunk 3 2 3 2") ++ elif mode == "no-hunks": + pass + else: -+ ol = len(old.split(b"\n")) -+ nl = len(new.split(b"\n")) ++ ol = old.count(b"\n") ++ nl = new.count(b"\n") + write_pkt(f"hunk 1 {ol} 1 {nl}") + write_flush() + write_pkt("status=success") @@ t/t4080-diff-process.sh (new) + setup_backend && + echo "*.c diff=cdiff" >.gitattributes && + git add .gitattributes && -+ git commit -m "initial" -+' + -+test_expect_success PYTHON 'diff process hunk boundaries affect output' ' ++ # boundary.c: 10 lines, changes at 5-6 and 9-10. ++ # Used by: hunk boundaries, error fallback, crash, bad hunks, overlap. + cat >boundary.c <<-\EOF && + line1 + line2 @@ t/t4080-diff-process.sh (new) + OLD10 + EOF + git add boundary.c && -+ git commit -m "add boundary.c" && + ++ # worddiff.c: single-line function, value changes 1 -> 999. ++ # Used by: word-diff, --diff-algorithm, --no-ext-diff, --stat. ++ cat >worddiff.c <<-\EOF && ++ int value(void) { return 1; } ++ EOF ++ git add worddiff.c && ++ ++ # newfile.c: single-line function, value changes 42 -> 99. ++ # Used by: new file, --exit-code, multiple drivers. ++ cat >newfile.c <<-\EOF && ++ int new_func(void) { return 42; } ++ EOF ++ git add newfile.c && ++ ++ # logtest.c: single-line function for log/format-patch tests. ++ # Needs two commits so log -1 has a diff. ++ cat >logtest.c <<-\EOF && ++ int logfunc(void) { return 1; } ++ EOF ++ git add logtest.c && ++ ++ # two.c/one.c: two-file pair for error/abort/startup-failure tests. ++ cat >one.c <<-\EOF && ++ int first(void) { return 1; } ++ EOF ++ cat >two.c <<-\EOF && ++ int second(void) { return 2; } ++ EOF ++ git add one.c two.c && ++ ++ git commit -m "initial" && ++ ++ # Second commit for logtest.c (so log -1 has something to show). ++ cat >logtest.c <<-\EOF && ++ int logfunc(void) { return 2; } ++ EOF ++ git add logtest.c && ++ git commit -m "change logtest.c" && ++ ++ # Working tree modifications (not committed). + cat >boundary.c <<-\EOF && + line1 + line2 @@ t/t4080-diff-process.sh (new) + NEW10 + EOF + ++ cat >worddiff.c <<-\EOF && ++ int value(void) { return 999; } ++ EOF ++ ++ cat >newfile.c <<-\EOF && ++ int new_func(void) { return 99; } ++ EOF ++ ++ cat >one.c <<-\EOF && ++ int first(void) { return 10; } ++ EOF ++ ++ cat >two.c <<-\EOF ++ int second(void) { return 20; } ++ EOF ++' ++ ++# ++# Core behavior: the tool controls which lines are marked as changed. ++# ++ ++test_expect_success PYTHON 'diff process hunk boundaries affect output' ' + # The file has changes at lines 5-6 and 9-10, but fixed-hunk + # only reports lines 5-6 as changed. Lines 9-10 should not + # appear as changed in the output. @@ t/t4080-diff-process.sh (new) + test_grep "^-OLD6" actual && + test_grep "^+NEW5" actual && + test_grep "^+NEW6" actual && -+ ! test_grep "^-OLD9" actual && -+ ! test_grep "^-OLD10" actual && -+ ! test_grep "^+NEW9" actual && -+ ! test_grep "^+NEW10" actual ++ test_grep ! "^-OLD9" actual && ++ test_grep ! "^-OLD10" actual && ++ test_grep ! "^+NEW9" actual && ++ test_grep ! "^+NEW10" actual +' + -+test_expect_success PYTHON 'diff process fallback on tool error status' ' ++test_expect_success PYTHON 'diff process works with new file' ' + rm -f backend.log && -+ git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ -+ diff boundary.c >actual && -+ # Fallback produces the full builtin diff (both change regions). -+ test_grep "^-OLD5" actual && -+ test_grep "^+NEW5" actual && -+ test_grep "^-OLD9" actual && -+ test_grep "^+NEW9" actual && -+ # Tool was contacted (it replied with error, not crash). -+ test_grep "command=hunks pathname=boundary.c" backend.log ++ git -c diff.cdiff.process="$BACKEND --log=backend.log" \ ++ diff -- newfile.c >actual 2>stderr && ++ test_grep "return 99" actual && ++ test_grep "pathname=newfile.c" backend.log && ++ test_must_be_empty stderr +' + -+test_expect_success PYTHON 'diff process fallback on bad hunks' ' -+ git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \ -+ diff boundary.c >actual && -+ test_grep "^-OLD5" actual && -+ test_grep "^+NEW5" actual && -+ test_grep "^-OLD9" actual && -+ test_grep "^+NEW9" actual ++test_expect_success PYTHON 'diff process works with added file (empty old side)' ' ++ cat >added.c <<-\EOF && ++ int added(void) { return 1; } ++ EOF ++ git add added.c && ++ ++ rm -f backend.log && ++ git -c diff.cdiff.process="$BACKEND --log=backend.log" \ ++ diff --cached -- added.c >actual 2>stderr && ++ test_grep "added" actual && ++ test_grep "pathname=added.c" backend.log && ++ test_must_be_empty stderr +' + -+test_expect_success PYTHON 'diff process fallback on tool crash' ' -+ git -c diff.cdiff.process="$BACKEND --mode=crash" \ -+ diff boundary.c >actual && -+ test_grep "^-OLD5" actual && -+ test_grep "^+NEW5" actual && -+ test_grep "^-OLD9" actual && -+ test_grep "^+NEW9" actual ++test_expect_success PYTHON 'diff process skipped for binary files' ' ++ printf "\\0binary" >binary.c && ++ git add binary.c && ++ git commit -m "add binary" && ++ printf "\\0changed" >binary.c && ++ ++ rm -f backend.log && ++ git -c diff.cdiff.process="$BACKEND --log=backend.log" \ ++ diff -- binary.c >actual && ++ test_grep "Binary files" actual && ++ test_path_is_missing backend.log +' + -+test_expect_success PYTHON 'diff process abort disables for session' ' -+ cat >abort1.c <<-\EOF && -+ int first(void) { return 1; } -+ EOF -+ cat >abort2.c <<-\EOF && -+ int second(void) { return 2; } -+ EOF -+ git add abort1.c abort2.c && -+ git commit -m "add abort files" && ++test_expect_success PYTHON 'diff process not consulted for unmatched driver' ' ++ echo "not tracked by cdiff" >unmatched.txt && ++ git add unmatched.txt && ++ git commit -m "add unmatched.txt" && + -+ cat >abort1.c <<-\EOF && -+ int first(void) { return 10; } -+ EOF -+ cat >abort2.c <<-\EOF && -+ int second(void) { return 20; } -+ EOF ++ echo "modified" >unmatched.txt && + + rm -f backend.log && -+ git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \ -+ diff -- abort1.c abort2.c >actual && -+ # Both files should still produce diff output via fallback. -+ test_grep "return 10" actual && -+ test_grep "return 20" actual && -+ # The tool aborts on the first file and git clears its -+ # capability. The second file never contacts the tool, -+ # so the log should have exactly one entry, not two. -+ test_grep "command=hunks" backend.log >matches && -+ test_line_count = 1 matches ++ git -c diff.cdiff.process="$BACKEND --log=backend.log" \ ++ diff -- unmatched.txt >actual && ++ test_grep "modified" actual && ++ test_path_is_missing backend.log +' + -+test_expect_success PYTHON 'diff process handles multiple files' ' -+ cat >multi1.c <<-\EOF && -+ int one(void) { return 1; } ++test_expect_success PYTHON 'multiple drivers use separate processes' ' ++ echo "*.h diff=hdiff" >>.gitattributes && ++ git add .gitattributes && ++ ++ cat >multi.h <<-\EOF && ++ int header(void) { return 1; } + EOF -+ cat >multi2.c <<-\EOF && -+ int two(void) { return 2; } ++ git add multi.h && ++ git commit -m "add multi.h" && ++ ++ cat >multi.h <<-\EOF && ++ int header(void) { return 2; } + EOF -+ git add multi1.c multi2.c && -+ git commit -m "add multi files" && + -+ cat >multi1.c <<-\EOF && -+ int one(void) { return 10; } ++ rm -f backend-c.log backend-h.log && ++ git -c diff.cdiff.process="$BACKEND --log=backend-c.log" \ ++ -c diff.hdiff.process="$BACKEND --log=backend-h.log" \ ++ diff -- newfile.c multi.h >actual 2>stderr && ++ test_grep "pathname=newfile.c" backend-c.log && ++ test_grep "pathname=multi.h" backend-h.log && ++ test_must_be_empty stderr ++' ++ ++test_expect_success PYTHON 'diff process works alongside textconv' ' ++ write_script uppercase-filter <<-\EOF && ++ tr "a-z" "A-Z" <"$1" + EOF -+ cat >multi2.c <<-\EOF && -+ int two(void) { return 20; } ++ ++ cat >textconv.c <<-\EOF && ++ hello world ++ EOF ++ git add textconv.c && ++ git commit -m "add textconv.c" && ++ ++ cat >textconv.c <<-\EOF && ++ goodbye world + EOF + + rm -f backend.log && -+ git -c diff.cdiff.process="$BACKEND --log=backend.log" \ -+ diff -- multi1.c multi2.c >actual && -+ test_grep "return 10" actual && -+ test_grep "return 20" actual && -+ test_grep "pathname=multi1.c" backend.log && -+ test_grep "pathname=multi2.c" backend.log ++ git -c diff.cdiff.textconv="./uppercase-filter" \ ++ -c diff.cdiff.process="$BACKEND --log=backend.log" \ ++ diff -- textconv.c >actual 2>stderr && ++ # The diff process receives textconv-transformed (uppercase) content. ++ test_grep "pathname=textconv.c" backend.log && ++ test_grep "old=HELLO WORLD" backend.log && ++ test_grep "new=GOODBYE WORLD" backend.log && ++ test_must_be_empty stderr +' + ++# ++# Downstream features: word diff, log, equivalent files, exit code. ++# ++ +test_expect_success PYTHON 'diff process with --word-diff' ' -+ cat >worddiff.c <<-\EOF && -+ int value(void) { return 1; } ++ rm -f backend.log && ++ git -c diff.cdiff.process="$BACKEND --log=backend.log" \ ++ diff --word-diff worddiff.c >actual 2>stderr && ++ test_grep "\[-1;-\]" actual && ++ test_grep "{+999;+}" actual && ++ test_grep "pathname=worddiff.c" backend.log && ++ test_must_be_empty stderr ++' ++ ++test_expect_success PYTHON 'diff process works with git log -p' ' ++ # With no-hunks mode, the tool says the files are equivalent, ++ # so log -p should show the commit but no diff content. ++ rm -f backend.log && ++ git -c diff.cdiff.process="$BACKEND --mode=no-hunks --log=backend.log" \ ++ log -1 -p -- logtest.c >actual 2>stderr && ++ test_grep "change logtest.c" actual && ++ test_grep ! "return 2" actual && ++ test_grep "command=hunks pathname=logtest.c" backend.log && ++ test_must_be_empty stderr ++' ++ ++test_expect_success PYTHON 'diff process no hunks suppresses diff output' ' ++ cat >nohunks.c <<-\EOF && ++ int zero(void) { return 0; } + EOF -+ git add worddiff.c && -+ git commit -m "add worddiff.c" && ++ git add nohunks.c && ++ git commit -m "add nohunks.c" && + -+ cat >worddiff.c <<-\EOF && -+ int value(void) { return 999; } ++ cat >nohunks.c <<-\EOF && ++ int zero(void) { return 999; } + EOF + -+ git -c diff.cdiff.process="$BACKEND" \ -+ diff --word-diff worddiff.c >actual && -+ test_grep "\[-1;-\]" actual && -+ test_grep "{+999;+}" actual ++ git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ ++ diff nohunks.c >actual && ++ test_must_be_empty actual ++' ++ ++test_expect_success PYTHON 'diff process no hunks with --exit-code returns success' ' ++ git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ ++ diff --exit-code nohunks.c ++' ++ ++test_expect_success PYTHON 'diff process with --exit-code and hunks returns failure' ' ++ test_expect_code 1 git -c diff.cdiff.process="$BACKEND" \ ++ diff --exit-code newfile.c +' + ++# ++# Bypass mechanisms: flags and commands that skip the diff process. ++# ++ +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ @@ t/t4080-diff-process.sh (new) + test_path_is_missing backend.log +' + -+test_expect_success PYTHON 'diff process works with git log -p' ' -+ cat >logtest.c <<-\EOF && -+ int logfunc(void) { return 1; } -+ EOF -+ git add logtest.c && -+ git commit -m "add logtest.c" && ++test_expect_success PYTHON 'diff process not used by --stat' ' ++ rm -f backend.log && ++ git -c diff.cdiff.process="$BACKEND --log=backend.log" \ ++ diff --stat worddiff.c >actual && ++ test_grep "worddiff.c" actual && ++ test_path_is_missing backend.log ++' + -+ cat >logtest.c <<-\EOF && -+ int logfunc(void) { return 2; } -+ EOF -+ git add logtest.c && -+ git commit -m "change logtest.c" && ++# ++# Error handling and fallback. ++# + ++test_expect_success PYTHON 'diff process fallback on tool error status' ' + rm -f backend.log && -+ git -c diff.cdiff.process="$BACKEND --log=backend.log" \ -+ log -1 -p -- logtest.c >actual && -+ test_grep "return 2" actual && -+ test_grep "command=hunks pathname=logtest.c" backend.log ++ git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ ++ diff boundary.c >actual 2>stderr && ++ # Fallback produces the full builtin diff (both change regions). ++ test_grep "^-OLD5" actual && ++ test_grep "^+NEW5" actual && ++ test_grep "^-OLD9" actual && ++ test_grep "^+NEW9" actual && ++ # Tool was contacted (it replied with error, not crash). ++ test_grep "command=hunks pathname=boundary.c" backend.log && ++ test_grep "diff process.*failed" stderr +' + -+test_expect_success PYTHON 'diff process zero hunks suppresses diff output' ' -+ cat >zerohunk.c <<-\EOF && -+ int zero(void) { return 0; } ++test_expect_success PYTHON 'diff process error keeps tool available for next file' ' ++ rm -f backend.log && ++ git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ ++ diff -- one.c two.c >actual 2>stderr && ++ # Unlike abort, error keeps the tool available: both files ++ # are sent to the tool (and both fall back). ++ test_grep "pathname=one.c" backend.log && ++ test_grep "pathname=two.c" backend.log && ++ test_grep "return 10" actual && ++ test_grep "return 20" actual ++' ++ ++test_expect_success PYTHON 'diff process abort disables for session' ' ++ rm -f backend.log && ++ git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \ ++ diff -- one.c two.c >actual && ++ # Both files should still produce diff output via fallback. ++ test_grep "return 10" actual && ++ test_grep "return 20" actual && ++ # The tool aborts on the first file and git clears its ++ # capability. The second file never contacts the tool. ++ test_grep "pathname=one.c" backend.log && ++ test_grep ! "pathname=two.c" backend.log ++' ++ ++test_expect_success PYTHON 'diff process fallback on tool crash' ' ++ git -c diff.cdiff.process="$BACKEND --mode=crash" \ ++ diff boundary.c >actual 2>stderr && ++ test_grep "^-OLD5" actual && ++ test_grep "^+NEW5" actual && ++ test_grep "^-OLD9" actual && ++ test_grep "^+NEW9" actual && ++ # Crash is a communication failure, so a warning is emitted. ++ test_grep "diff process.*failed" stderr ++' ++ ++test_expect_success PYTHON 'diff process startup failure only warns once' ' ++ git -c diff.cdiff.process="/nonexistent/tool" \ ++ diff -- one.c two.c >actual 2>stderr && ++ # Both files produce diff output via fallback. ++ test_grep "return 10" actual && ++ test_grep "return 20" actual && ++ # Sentinel prevents repeated warnings: only one, not one per file. ++ test_grep "diff process.*failed" stderr >warnings && ++ test_line_count = 1 warnings ++' ++ ++test_expect_success PYTHON 'diff process fallback on bad hunks' ' ++ git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \ ++ diff boundary.c >actual 2>stderr && ++ test_grep "^-OLD5" actual && ++ test_grep "^+NEW5" actual && ++ test_grep "^-OLD9" actual && ++ test_grep "^+NEW9" actual && ++ # Invalid hunks are caught by xdiff validation, not the ++ # protocol layer, so no warning is emitted. ++ test_must_be_empty stderr ++' ++ ++test_expect_success PYTHON 'diff process fallback on mismatched unchanged totals' ' ++ cat >synctest.c <<-\EOF && ++ line1 ++ line2 ++ line3 + EOF -+ git add zerohunk.c && -+ git commit -m "add zerohunk.c" && ++ git add synctest.c && ++ git commit -m "add synctest.c" && + -+ cat >zerohunk.c <<-\EOF && -+ int zero(void) { return 999; } ++ cat >synctest.c <<-\EOF && ++ line1 ++ changed ++ line3 + EOF + -+ git -c diff.cdiff.process="$BACKEND --mode=zero-hunk" \ -+ diff zerohunk.c >actual && -+ test_must_be_empty actual ++ # bad-sync reports hunk 1 2 1 1: marks 2 old lines and 1 new ++ # line as changed, leaving 1 unchanged old vs 2 unchanged new. ++ # The synchronization invariant fails and git falls back. ++ git -c diff.cdiff.process="$BACKEND --mode=bad-sync" \ ++ diff synctest.c >actual 2>stderr && ++ test_grep "changed" actual ++' ++ ++test_expect_success PYTHON 'diff process fallback on overlapping hunks' ' ++ # boundary.c has 10 lines, so both hunks are in bounds ++ # but they overlap at lines 3-5, triggering the ordering check. ++ git -c diff.cdiff.process="$BACKEND --mode=overlap" \ ++ diff boundary.c >actual 2>stderr && ++ test_grep "NEW5" actual +' + +test_done + + ## userdiff.h ## +@@ + + #include "notes-cache.h" + ++struct diff_subprocess; + struct index_state; + struct repository; + +@@ userdiff.h: struct userdiff_driver { + int textconv_want_cache; + const char *process; + char *process_owned; ++ struct diff_subprocess *diff_subprocess; ++ unsigned diff_process_failed : 1; + }; + enum userdiff_driver_type { + USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, -: ---------- > 5: f4fd9aa682 diff: bypass diff process with --no-ext-diff and in format-patch 4: 39ff53acef ! 6: 370e766978 blame: consult diff process for zero-hunk detection @@ Metadata Author: Michael Montalbo <mmontalbo@gmail.com> ## Commit message ## - blame: consult diff process for zero-hunk detection + blame: consult diff process for no-hunk detection When a diff process is configured via diff.<driver>.process, consult it during blame's per-commit diffing. If the process - returns zero hunks for a commit's changes to a file, treat the + returns no hunks for a commit's changes to a file, treat the commit as having no changes, causing blame to attribute lines to earlier commits. + The consultation happens at the pass_blame_to_parent() callsite + using diff_process_fill_hunks(), matching how builtin_diff() in + diff.c uses the same function. A new diff_hunks_xpp() variant + accepts a pre-populated xpparam_t for this callsite, while the + existing diff_hunks() retains its original signature and behavior. + The copy-detection callsite is unaffected since it does not use + the diff process. + The subprocess is long-running (one startup cost amortized across the blame traversal), but each commit in the file's history incurs a round-trip to the tool. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> - ## Documentation/gitattributes.adoc ## -@@ Documentation/gitattributes.adoc: The tool responds with lines of the form - - If the tool returns zero hunks with `status=success`, Git treats - the file as having no changes and produces no diff output. -+`git blame` also consults the diff process and skips commits -+where it reports zero hunks, attributing lines to earlier commits -+instead. - - Tools should ignore unknown keys in the per-file request to - remain forward-compatible. - ## blame.c ## @@ #include "tag.h" #include "trace2.h" #include "blame.h" +#include "diff-process.h" -+#include "userdiff.h" ++#include "xdiff-interface.h" #include "alloc.h" #include "commit-slab.h" #include "bloom.h" @@ blame.c: static struct commit *fake_working_tree_commit(struct repository *r, - static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b, + +-static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b, - xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, int xdl_opts) -+ xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, -+ int xdl_opts, struct index_state *istate, -+ const char *path) ++static int diff_hunks_xpp(mmfile_t *file_a, mmfile_t *file_b, ++ xdl_emit_hunk_consume_func_t hunk_func, ++ void *cb_data, xpparam_t *xpp) { - xpparam_t xpp = {0}; +- xpparam_t xpp = {0}; xdemitconf_t xecfg = {0}; xdemitcb_t ecb = {NULL}; -+ struct xdl_hunk *ext_hunks = NULL; -+ int ret; - xpp.flags = xdl_opts; +- xpp.flags = xdl_opts; xecfg.hunk_func = hunk_func; ecb.priv = cb_data; - return xdi_diff(file_a, file_b, &xpp, &xecfg, &ecb); ++ return xdi_diff(file_a, file_b, xpp, &xecfg, &ecb); ++} + -+ if (path && istate) { -+ struct userdiff_driver *drv; -+ drv = userdiff_find_by_path(istate, path); -+ if (drv && drv->process) { -+ size_t nr = 0; -+ if (!diff_process_get_hunks(drv, path, -+ file_a->ptr, file_a->size, -+ file_b->ptr, file_b->size, -+ &ext_hunks, &nr)) { -+ if (!nr) { -+ /* -+ * Zero hunks: the diff process -+ * considers these files equivalent. -+ * Skip so blame looks past this -+ * commit. -+ */ -+ return 0; -+ } -+ xpp.external_hunks = ext_hunks; -+ xpp.external_hunks_nr = nr; -+ } -+ } -+ } ++static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b, ++ xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, int xdl_opts) ++{ ++ xpparam_t xpp = {0}; + -+ ret = xdi_diff(file_a, file_b, &xpp, &xecfg, &ecb); -+ free(ext_hunks); -+ return ret; ++ xpp.flags = xdl_opts; ++ return diff_hunks_xpp(file_a, file_b, hunk_func, cb_data, &xpp); } static const char *get_next_line(const char *start, const char *end) +@@ blame.c: static void pass_blame_to_parent(struct blame_scoreboard *sb, + struct blame_origin *parent, int ignore_diffs) + { + mmfile_t file_p, file_o; ++ xpparam_t xpp = {0}; + struct blame_chunk_cb_data d; + struct blame_entry *newdest = NULL; + @@ blame.c: static void pass_blame_to_parent(struct blame_scoreboard *sb, &sb->num_read_blob, ignore_diffs); sb->num_get_patch++; - if (diff_hunks(&file_p, &file_o, blame_chunk_cb, &d, sb->xdl_opts)) -+ if (diff_hunks(&file_p, &file_o, blame_chunk_cb, &d, sb->xdl_opts, -+ sb->revs->diffopt.repo->index, target->path)) - die("unable to generate diff (%s -> %s)", - oid_to_hex(&parent->commit->object.oid), - oid_to_hex(&target->commit->object.oid)); -@@ blame.c: static void find_copy_in_blob(struct blame_scoreboard *sb, - * file_p partially may match that image. - */ - memset(split, 0, sizeof(struct blame_entry [3])); -- if (diff_hunks(file_p, &file_o, handle_split_cb, &d, sb->xdl_opts)) -+ if (diff_hunks(file_p, &file_o, handle_split_cb, &d, sb->xdl_opts, -+ NULL, NULL)) - die("unable to generate diff (%s)", - oid_to_hex(&parent->commit->object.oid)); - /* remainder, if any, all match the preimage */ +- die("unable to generate diff (%s -> %s)", +- oid_to_hex(&parent->commit->object.oid), +- oid_to_hex(&target->commit->object.oid)); ++ xpp.flags = sb->xdl_opts; ++ /* ++ * If the diff process considers the files equivalent, ++ * skip the diff so blame looks past this commit. ++ */ ++ if (diff_process_fill_hunks(&sb->revs->diffopt, target->path, ++ &file_p, &file_o, &xpp) ++ != DIFF_PROCESS_EQUIVALENT) { ++ if (diff_hunks_xpp(&file_p, &file_o, blame_chunk_cb, ++ &d, &xpp)) ++ die("unable to generate diff (%s -> %s)", ++ oid_to_hex(&parent->commit->object.oid), ++ oid_to_hex(&target->commit->object.oid)); ++ } ++ free(xpp.external_hunks); + /* The rest are the same as the parent */ + blame_chunk(&d.dstq, &d.srcq, INT_MAX, d.offset, INT_MAX, 0, + parent, target, 0); ## t/t4080-diff-process.sh ## -@@ t/t4080-diff-process.sh: test_expect_success PYTHON 'diff process zero hunks suppresses diff output' ' - test_must_be_empty actual +@@ t/t4080-diff-process.sh: test_expect_success PYTHON 'diff process fallback on overlapping hunks' ' + test_grep "NEW5" actual ' -+test_expect_success PYTHON 'blame skips commits with zero hunks from diff process' ' ++# ++# Blame integration. ++# ++ ++test_expect_success PYTHON 'blame uses tool-provided hunks' ' ++ cat >blame-hunk.c <<-\EOF && ++ line1 ++ line2 ++ line3 ++ line4 ++ original5 ++ original6 ++ line7 ++ line8 ++ line9 ++ line10 ++ EOF ++ git add blame-hunk.c && ++ git commit -m "add blame-hunk.c" && ++ ORIG=$(git rev-parse --short HEAD) && ++ ++ cat >blame-hunk.c <<-\EOF && ++ line1 ++ line2 ++ line3 ++ line4 ++ changed5 ++ changed6 ++ line7 ++ line8 ++ changed9 ++ changed10 ++ EOF ++ git add blame-hunk.c && ++ git commit -m "change blame-hunk.c" && ++ CHANGE=$(git rev-parse --short HEAD) && ++ ++ # With fixed-hunk mode the tool reports only lines 5-6 as changed, ++ # so blame should attribute lines 9-10 to the original commit ++ # even though the builtin diff would show them as changed. ++ git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \ ++ blame blame-hunk.c >actual && ++ sed -n "9p" actual >line9 && ++ sed -n "10p" actual >line10 && ++ test_grep "$ORIG" line9 && ++ test_grep "$ORIG" line10 && ++ sed -n "5p" actual >line5 && ++ sed -n "6p" actual >line6 && ++ test_grep "$CHANGE" line5 && ++ test_grep "$CHANGE" line6 ++' ++ ++test_expect_success PYTHON 'blame skips commits with no hunks from diff process' ' + cat >blame.c <<-\EOF && + int main(void) + { @@ t/t4080-diff-process.sh: test_expect_success PYTHON 'diff process zero hunks sup + EOF + git add blame.c && + git commit -m "add blame.c" && ++ ORIG_COMMIT=$(git rev-parse --short HEAD) && + + cat >blame.c <<-\EOF && + int main(void) @@ t/t4080-diff-process.sh: test_expect_success PYTHON 'diff process zero hunks sup + git commit -m "reformat blame.c" && + BLAME_COMMIT=$(git rev-parse --short HEAD) && + -+ # Without zero-hunk mode, blame attributes the change. ++ # Without no-hunks mode, blame attributes the change. + git blame blame.c >without && + test_grep "$BLAME_COMMIT" without && + -+ # With zero-hunk mode, the process considers the files equivalent -+ # and blame skips the reformat commit. -+ git -c diff.cdiff.process="$BACKEND --mode=zero-hunk" \ ++ # With no-hunks mode, the process considers the files equivalent ++ # and blame skips the reformat commit, attributing to the original. ++ git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ + blame blame.c >with && -+ ! test_grep "$BLAME_COMMIT" with ++ test_grep ! "$BLAME_COMMIT" with && ++ test_grep "$ORIG_COMMIT" with +' + ++test_expect_success PYTHON 'blame --no-ext-diff bypasses diff process' ' ++ rm -f backend.log && ++ git -c diff.cdiff.process="$BACKEND --mode=no-hunks --log=backend.log" \ ++ blame --no-ext-diff blame.c >actual && ++ # Without the process, blame attributes the reformat commit normally. ++ test_grep "$BLAME_COMMIT" actual && ++ test_path_is_missing backend.log ++' ++ ++test_expect_success PYTHON 'blame --no-ext-diff uses builtin hunks' ' ++ # fixed-hunk mode would narrow blame to lines 5-6, but ++ # --no-ext-diff should bypass it and use the builtin diff. ++ rm -f backend.log && ++ git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk --log=backend.log" \ ++ blame --no-ext-diff blame-hunk.c >actual && ++ # Builtin diff attributes lines 9-10 to the change commit. ++ sed -n "9p" actual >line9 && ++ test_grep "$CHANGE" line9 && ++ test_path_is_missing backend.log ++' + test_done -- gitgitgadget ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v3 1/6] xdiff: support external hunks via xpparam_t 2026-05-29 20:48 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 ` Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 2/6] userdiff: add diff.<driver>.process config Michael Montalbo via GitGitGadget ` (5 subsequent siblings) 6 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Add two new xpparam_t fields (external_hunks, external_hunks_nr) that let callers supply pre-computed hunks. When set, xdl_diff() populates the changed[] arrays from these hunks instead of running the diff algorithm, then continues through compaction and emission as usual. Validate supplied hunks before use: reject out-of-bounds line numbers, overlapping or out-of-order hunks, negative counts, and violations of the synchronization invariant (unchanged line counts must match between files). On validation failure, fall back to the builtin diff algorithm; this re-runs xdl_prepare_env() since the first call may have dirtied the changed[] arrays. Skip trim_common_tail() in xdi_diff() when external hunks are present, since external hunks reference line numbers in the original content. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- xdiff-interface.c | 7 +++- xdiff/xdiff.h | 13 ++++++++ xdiff/xdiffi.c | 85 +++++++++++++++++++++++++++++++++++++++++++++-- xdiff/xprepare.c | 10 ++++++ xdiff/xprepare.h | 1 + 5 files changed, 113 insertions(+), 3 deletions(-) diff --git a/xdiff-interface.c b/xdiff-interface.c index f043330f2a..9542c0bcc2 100644 --- a/xdiff-interface.c +++ b/xdiff-interface.c @@ -124,7 +124,12 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE) return -1; - if (!xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT)) + /* + * External hunks reference line numbers in the original content; + * trimming the tail would change line counts and invalidate them. + */ + if (!xpp->external_hunks && + !xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT)) trim_common_tail(&a, &b); return xdl_diff(&a, &b, xpp, xecfg, xecb); diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h index dc370712e9..a7e8502e4c 100644 --- a/xdiff/xdiff.h +++ b/xdiff/xdiff.h @@ -78,6 +78,15 @@ typedef struct s_mmbuffer { long size; } mmbuffer_t; +/* + * Hunk descriptor for externally computed diffs. + * Line numbers are 1-based, matching unified diff convention. + */ +struct xdl_hunk { + long old_start, old_count; + long new_start, new_count; +}; + typedef struct s_xpparam { unsigned long flags; @@ -88,6 +97,10 @@ typedef struct s_xpparam { /* See Documentation/diff-options.adoc. */ char **anchors; size_t anchors_nr; + + /* Externally computed hunks: bypass the diff algorithm. Owned by caller. */ + struct xdl_hunk *external_hunks; + size_t external_hunks_nr; } xpparam_t; typedef struct s_xdemitcb { diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 5455b4690d..b0a01ca856 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -1085,16 +1085,97 @@ static void xdl_mark_ignorable_regex(xdchange_t *xscr, const xdfenv_t *xe, } } +/* + * Populate the changed[] arrays from externally supplied hunks, + * bypassing the diff algorithm. Validates that hunks are in order, + * non-overlapping, and within bounds. + * + * Returns 0 on success, -1 on validation failure. + */ +static int xdl_populate_hunks_from_external(xdfenv_t *xe, + struct xdl_hunk *hunks, + size_t nr_hunks) +{ + size_t i; + long j, prev_old_end = 0, prev_new_end = 0; + long total_old = 0, total_new = 0; + + /* + * xdl_prepare_env() may dirty changed[] via xdl_cleanup_records(). + * Clear them so only the external hunks are marked. + */ + xdl_clear_changed(&xe->xdf1); + xdl_clear_changed(&xe->xdf2); + + for (i = 0; i < nr_hunks; i++) { + struct xdl_hunk *h = &hunks[i]; + + if (h->old_count < 0 || h->new_count < 0) + return -1; + if (h->old_start < 1 || h->new_start < 1) + return -1; + + /* + * Range must fit: start + count - 1 <= nrec, + * rewritten to avoid overflow. Same for both sides. + * + * When count is 0 (pure insert/delete) the check + * reduces to 0 > nrec - start + 1, which rejects + * start > nrec + 1 and allows start == nrec + 1 + * (the position after the last line). + */ + if (h->old_count > (long)xe->xdf1.nrec - h->old_start + 1) + return -1; + if (h->new_count > (long)xe->xdf2.nrec - h->new_start + 1) + return -1; + + /* Ordering: no overlap with previous hunk (adjacent is OK) */ + if (h->old_start < prev_old_end || + h->new_start < prev_new_end) + return -1; + + for (j = 0; j < h->old_count; j++) + xe->xdf1.changed[h->old_start - 1 + j] = true; + for (j = 0; j < h->new_count; j++) + xe->xdf2.changed[h->new_start - 1 + j] = true; + + prev_old_end = h->old_start + h->old_count; + prev_new_end = h->new_start + h->new_count; + total_old += h->old_count; + total_new += h->new_count; + } + + /* + * Synchronization invariant: unchanged line counts must match. + * Otherwise xdl_build_script() would walk off one array. + */ + if ((long)xe->xdf1.nrec - total_old != + (long)xe->xdf2.nrec - total_new) + return -1; + + return 0; +} + int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t const *xecfg, xdemitcb_t *ecb) { xdchange_t *xscr; xdfenv_t xe; emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff; - if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) { + if (xpp->external_hunks) { + if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0) + return -1; + if (xdl_populate_hunks_from_external(&xe, + xpp->external_hunks, + xpp->external_hunks_nr) == 0) + goto diff_done; + xdl_free_env(&xe); + } + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) return -1; - } + +diff_done: if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 || xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 || xdl_build_script(&xe, &xscr) < 0) { diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c index cd4fc405eb..4645a9a746 100644 --- a/xdiff/xprepare.c +++ b/xdiff/xprepare.c @@ -432,3 +432,13 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, return 0; } + +/* + * Reset the changed[] array so that no lines are marked as changed. + * Also clears the sentinel slots at changed[-1] and changed[nrec] + * that xdl_change_compact() relies on during backward scans. + */ +void xdl_clear_changed(xdfile_t *xdf) +{ + memset(xdf->changed - 1, 0, (xdf->nrec + 2) * sizeof(bool)); +} diff --git a/xdiff/xprepare.h b/xdiff/xprepare.h index 947d9fc1bb..0413baf07b 100644 --- a/xdiff/xprepare.h +++ b/xdiff/xprepare.h @@ -28,6 +28,7 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdfenv_t *xe); void xdl_free_env(xdfenv_t *xe); +void xdl_clear_changed(xdfile_t *xdf); -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v3 2/6] userdiff: add diff.<driver>.process config 2026-05-29 20:48 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 1/6] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 ` Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 3/6] sub-process: separate process lifecycle from hashmap management Michael Montalbo via GitGitGadget ` (4 subsequent siblings) 6 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Add the process field to struct userdiff_driver and teach the config parser to populate it from diff.<driver>.process. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- userdiff.c | 7 +++++++ userdiff.h | 2 ++ 2 files changed, 9 insertions(+) diff --git a/userdiff.c b/userdiff.c index fe710a68bf..81c0bebcce 100644 --- a/userdiff.c +++ b/userdiff.c @@ -499,6 +499,13 @@ int userdiff_config(const char *k, const char *v) drv->algorithm = drv->algorithm_owned; return ret; } + if (!strcmp(type, "process")) { + int ret; + FREE_AND_NULL(drv->process_owned); + ret = git_config_string(&drv->process_owned, k, v); + drv->process = drv->process_owned; + return ret; + } return 0; } diff --git a/userdiff.h b/userdiff.h index 827361b0bc..51c26e0d41 100644 --- a/userdiff.h +++ b/userdiff.h @@ -31,6 +31,8 @@ struct userdiff_driver { char *textconv_owned; struct notes_cache *textconv_cache; int textconv_want_cache; + const char *process; + char *process_owned; }; enum userdiff_driver_type { USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v3 3/6] sub-process: separate process lifecycle from hashmap management 2026-05-29 20:48 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 1/6] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 2/6] userdiff: add diff.<driver>.process config Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 ` Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 4/6] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget ` (3 subsequent siblings) 6 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> subprocess_start() and subprocess_stop() couple two concerns: managing a child process (setup, handshake, teardown) and managing a hashmap that indexes running processes by command string. The hashmap suits callers like convert.c where many files may share one filter process looked up by name, but callers that manage process lifetime through their own data structures do not need it. Extract subprocess_start_command() and subprocess_stop_command() so callers can reuse the child process setup and handshake machinery without maintaining a hashmap. subprocess_start() and subprocess_stop() become thin wrappers that add hashmap operations on top. No functional change for existing callers. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- sub-process.c | 29 ++++++++++++++++++++++++----- sub-process.h | 9 ++++++++- 2 files changed, 32 insertions(+), 6 deletions(-) diff --git a/sub-process.c b/sub-process.c index 83bf0a0e82..33b0bbc831 100644 --- a/sub-process.c +++ b/sub-process.c @@ -49,7 +49,7 @@ int subprocess_read_status(int fd, struct strbuf *status) return (len < 0) ? len : 0; } -void subprocess_stop(struct hashmap *hashmap, struct subprocess_entry *entry) +void subprocess_stop_command(struct subprocess_entry *entry) { if (!entry) return; @@ -57,7 +57,14 @@ void subprocess_stop(struct hashmap *hashmap, struct subprocess_entry *entry) entry->process.clean_on_exit = 0; kill(entry->process.pid, SIGTERM); finish_command(&entry->process); +} + +void subprocess_stop(struct hashmap *hashmap, struct subprocess_entry *entry) +{ + if (!entry) + return; + subprocess_stop_command(entry); hashmap_remove(hashmap, &entry->ent, NULL); } @@ -72,7 +79,7 @@ static void subprocess_exit_handler(struct child_process *process) finish_command(process); } -int subprocess_start(struct hashmap *hashmap, struct subprocess_entry *entry, const char *cmd, +int subprocess_start_command(struct subprocess_entry *entry, const char *cmd, subprocess_start_fn startfn) { int err; @@ -96,15 +103,27 @@ int subprocess_start(struct hashmap *hashmap, struct subprocess_entry *entry, co return err; } - hashmap_entry_init(&entry->ent, strhash(cmd)); - err = startfn(entry); if (err) { error("initialization for subprocess '%s' failed", cmd); - subprocess_stop(hashmap, entry); + subprocess_stop_command(entry); return err; } + return 0; +} + +int subprocess_start(struct hashmap *hashmap, struct subprocess_entry *entry, const char *cmd, + subprocess_start_fn startfn) +{ + int err; + + err = subprocess_start_command(entry, cmd, startfn); + if (err) { + return err; + } + + hashmap_entry_init(&entry->ent, strhash(cmd)); hashmap_add(hashmap, &entry->ent); return 0; } diff --git a/sub-process.h b/sub-process.h index bfc3959a1b..45f1b8e5e3 100644 --- a/sub-process.h +++ b/sub-process.h @@ -52,10 +52,17 @@ int cmd2process_cmp(const void *unused_cmp_data, */ typedef int(*subprocess_start_fn)(struct subprocess_entry *entry); -/* Start a subprocess and add it to the subprocess hashmap. */ +/* Start a subprocess and run the startfn (typically handshake). */ +int subprocess_start_command(struct subprocess_entry *entry, const char *cmd, + subprocess_start_fn startfn); + +/* Start a subprocess, run startfn, and add it to the subprocess hashmap. */ int subprocess_start(struct hashmap *hashmap, struct subprocess_entry *entry, const char *cmd, subprocess_start_fn startfn); +/* Kill a subprocess. */ +void subprocess_stop_command(struct subprocess_entry *entry); + /* Kill a subprocess and remove it from the subprocess hashmap. */ void subprocess_stop(struct hashmap *hashmap, struct subprocess_entry *entry); -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v3 4/6] diff: add long-running diff process via diff.<driver>.process 2026-05-29 20:48 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget ` (2 preceding siblings ...) 2026-05-29 20:48 ` [PATCH v3 3/6] sub-process: separate process lifecycle from hashmap management Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 ` Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 5/6] diff: bypass diff process with --no-ext-diff and in format-patch Michael Montalbo via GitGitGadget ` (2 subsequent siblings) 6 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Add support for external diff processes that communicate via the long-running process protocol (pkt-line over stdin/stdout). A diff process is configured per userdiff driver: [diff "cdiff"] process = /path/to/diff-tool The tool provides custom line-matching: it receives file pairs and returns hunks that reference line numbers in the content. When textconv is also configured, the tool receives the textconv-transformed content. The tool controls which lines are marked as changed while the display shows the file content. Patch output features (word diff, function context, color) work normally; summary formats like --stat use their own diff path and are not affected. The handshake negotiates version=1 and capability=hunks. Per-file requests send command=hunks, pathname, and both file contents as packetized data. The tool responds with hunk lines and a status packet (success, error, or abort). On error, Git warns and falls back to the builtin diff algorithm for that file. On abort, Git silently falls back for the current file and stops sending further requests to the tool for the remainder of the session. When the tool returns no hunks followed by status=success, Git treats the file as having no changes and produces no diff output. This also means --exit-code reports no changes for that file. The subprocess is stored on the userdiff_driver struct and launched on first use. If the process fails to start, the handshake fails, or a communication error occurs mid-stream, the failure is cached on the driver to avoid retrying and re-warning on every subsequent file. diff_process_fill_hunks() is the sole public entry point. It handles driver lookup, flag checks, subprocess management, and error reporting, returning an enum that lets callers distinguish "hunks populated" from "files equivalent" from "not applicable" from "tool failure." Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- Documentation/config/diff.adoc | 5 + Documentation/gitattributes.adoc | 139 ++++++++ Makefile | 1 + diff-process.c | 288 +++++++++++++++++ diff-process.h | 39 +++ diff.c | 13 + diff.h | 3 + meson.build | 1 + t/.gitattributes | 1 + t/meson.build | 1 + t/t4080-diff-process.sh | 538 +++++++++++++++++++++++++++++++ userdiff.h | 3 + 12 files changed, 1032 insertions(+) create mode 100644 diff-process.c create mode 100644 diff-process.h create mode 100755 t/t4080-diff-process.sh diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc index 1135a62a0a..ac0635bb3b 100644 --- a/Documentation/config/diff.adoc +++ b/Documentation/config/diff.adoc @@ -218,6 +218,11 @@ endif::git-diff[] Set this option to `true` to make the diff driver cache the text conversion outputs. See linkgit:gitattributes[5] for details. +`diff.<driver>.process`:: + The command to run as a long-running diff process that + provides hunks to Git's diff pipeline. + See linkgit:gitattributes[5] for details. + `diff.indentHeuristic`:: Set this option to `false` to disable the default heuristics that shift diff hunk boundaries to make patches easier to read. diff --git a/Documentation/gitattributes.adoc b/Documentation/gitattributes.adoc index f20041a323..1700bd8e97 100644 --- a/Documentation/gitattributes.adoc +++ b/Documentation/gitattributes.adoc @@ -821,6 +821,145 @@ NOTE: If `diff.<name>.command` is defined for path with the (see above), and adding `diff.<name>.algorithm` has no effect, as the algorithm is not passed to the external diff driver. +Using an external diff process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If `diff.<name>.process` is defined, Git sends the old and new file +content to an external tool and receives back a list of changed +regions (pairs of line ranges in the old and new file). Git uses +these instead of its builtin diff algorithm, but still controls +all output formatting, so features like word diff, function context, +color, and blame work normally. This is achieved by using the +long-running process protocol (described in +Documentation/technical/long-running-process-protocol.adoc). +Unlike `diff.<name>.command`, which replaces Git's output entirely, +the diff process feeds results back into the standard pipeline. + +First, in `.gitattributes`, assign the `diff` attribute for paths. + +------------------------ +*.c diff=cdiff +------------------------ + +Then, define a "diff.<name>.process" configuration to specify +the diff process command. + +---------------------------------------------------------------- +[diff "cdiff"] + process = /path/to/diff-process-tool +---------------------------------------------------------------- + +When Git encounters the first file that needs to be diffed, it starts +the process and performs the handshake. In the handshake, the welcome +message sent by Git is "git-diff-client", only version 1 is supported, +and the supported capability is "hunks" (the changed regions +described below). + +For each file, Git sends a list of "key=value" pairs terminated with +a flush packet, followed by the old and new file content as packetized +data, each terminated with a flush packet. The pathname is relative +to the repository root. When `diff.<name>.textconv` is also set, +the tool receives the textconv-transformed content rather than the +raw blob. Git does not send binary files to the diff process. + +----------------------- +packet: git> command=hunks +packet: git> pathname=path/file.c +packet: git> 0000 +packet: git> OLD_CONTENT +packet: git> 0000 +packet: git> NEW_CONTENT +packet: git> 0000 +----------------------- + +The tool is expected to respond with zero or more hunk lines, +a flush packet, and a status packet terminated with a flush packet. +Each hunk line has the form: + + `hunk <old_start> <old_count> <new_start> <new_count>` + +where `<old_start>` and `<old_count>` identify a range of lines in +the old file, and `<new_start>` and `<new_count>` identify the +replacement range in the new file. Start values are 1-based and +counts are non-negative. Ranges must not extend beyond the end of +the file. For example, `hunk 3 2 3 4` means that 2 lines starting +at line 3 in the old file were replaced by 4 lines starting at +line 3 in the new file. An `<old_count>` of 0 means no lines were +removed (pure insertion); a `<new_count>` of 0 means no lines were +added (pure deletion). + +Lines are delimited by newlines. A file `"foo\nbar\n"` and a +file `"foo\nbar"` both have 2 lines. + +Hunks must be listed in order and must not overlap. Any line +not covered by a hunk is treated as unchanged, so the total +number of unchanged lines must be the same on both sides. +For example, if the old file has 10 lines and the hunks cover +4 of them (`old_count` values summing to 4), then 6 old lines +are unchanged. The new file must also have exactly 6 lines +not covered by hunks, so the `new_count` values must sum to +`new_file_lines - 6`. + +----------------------- +packet: git< hunk 1 3 1 5 +packet: git< hunk 10 2 12 2 +packet: git< 0000 +packet: git< status=success +packet: git< 0000 +----------------------- + +If the tool responds with hunks and "success", Git marks those lines +as changed and feeds them into the standard diff pipeline. Patch +output features (word diff, function context, color) work normally. +Note that `--stat` and other summary formats use their own diff path +and are not affected by the diff process. + +If no hunk lines precede the flush, followed by "success", Git +treats the files as having no changes: `git diff` produces no output +and `git blame` skips the commit, attributing lines to earlier commits. + +----------------------- +packet: git< 0000 +packet: git< status=success +packet: git< 0000 +----------------------- + +If the tool returns invalid hunks (out of bounds, overlapping), Git +silently falls back to the builtin diff algorithm. + +In case the tool cannot or does not want to process the content, +it is expected to respond with an "error" status. Git warns and +falls back to the builtin diff algorithm for this file. The tool +remains available for subsequent files. + +----------------------- +packet: git< 0000 +packet: git< status=error +packet: git< 0000 +----------------------- + +In case the tool cannot or does not want to process the content as +well as any future content for the lifetime of the Git process, it +is expected to respond with an "abort" status. Git silently falls +back to the builtin diff algorithm for this file and does not send +further requests to the tool. + +----------------------- +packet: git< 0000 +packet: git< status=abort +packet: git< 0000 +----------------------- + +If the tool dies during the communication or does not adhere to the +protocol then Git will stop the process and fall back to the builtin +diff algorithm. Git warns once and does not restart the process for +subsequent files. + +Tools should ignore unknown keys in the per-file request to remain +forward-compatible. Future versions of Git may send additional +`command=` values; tools that receive an unrecognized command should +respond with `status=error` rather than terminating. + Defining a custom hunk-header ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/Makefile b/Makefile index cedc234173..22900368dd 100644 --- a/Makefile +++ b/Makefile @@ -1142,6 +1142,7 @@ LIB_OBJS += diff-delta.o LIB_OBJS += diff-merges.o LIB_OBJS += diff-lib.o LIB_OBJS += diff-no-index.o +LIB_OBJS += diff-process.o LIB_OBJS += diff.o LIB_OBJS += diffcore-break.o LIB_OBJS += diffcore-delta.o diff --git a/diff-process.c b/diff-process.c new file mode 100644 index 0000000000..d2ef9463d7 --- /dev/null +++ b/diff-process.c @@ -0,0 +1,288 @@ +/* + * Diff process backend: communicates with a long-running external + * tool via the pkt-line protocol to obtain custom line-matching + * results. The tool controls which lines are marked as changed + * while the display shows the file content (after any textconv + * transformation, if configured). + * + * Protocol: pkt-line over stdin/stdout, following the pattern of + * the long-running filter process protocol (see convert.c). + * + * Handshake: + * git> git-diff-client / version=1 / flush + * tool< git-diff-server / version=1 / flush + * git> capability=hunks / flush + * tool< capability=hunks / flush + * + * Per-file: + * git> command=hunks / pathname=<path> / flush + * git> <old content packetized> / flush + * git> <new content packetized> / flush + * tool< hunk <old_start> <old_count> <new_start> <new_count> + * tool< ... / flush + * tool< status=success / flush + * + * When the tool returns no hunks with status=success, it considers + * the files equivalent. Git will skip the diff for that file. + */ + +#include "git-compat-util.h" +#include "diff-process.h" +#include "diff.h" +#include "gettext.h" +#include "repository.h" +#include "sigchain.h" +#include "userdiff.h" +#include "sub-process.h" +#include "pkt-line.h" +#include "strbuf.h" +#include "xdiff/xdiff.h" + +#define CAP_HUNKS (1u << 0) + +struct diff_subprocess { + struct subprocess_entry subprocess; + unsigned int supported_capabilities; +}; + +static int start_diff_process_fn(struct subprocess_entry *subprocess) +{ + static int versions[] = { 1, 0 }; + static struct subprocess_capability capabilities[] = { + { "hunks", CAP_HUNKS }, + { NULL, 0 } + }; + struct diff_subprocess *entry = + container_of(subprocess, struct diff_subprocess, subprocess); + + return subprocess_handshake(subprocess, "git-diff", + versions, NULL, + capabilities, + &entry->supported_capabilities); +} + +static struct diff_subprocess *get_or_launch_process( + struct userdiff_driver *drv) +{ + struct diff_subprocess *entry; + + if (drv->diff_subprocess) + return drv->diff_subprocess; + + entry = xcalloc(1, sizeof(*entry)); + if (subprocess_start_command(&entry->subprocess, drv->process, + start_diff_process_fn)) { + free(entry); + drv->diff_process_failed = 1; + return NULL; + } + + drv->diff_subprocess = entry; + return entry; +} + +static int send_file_content(int fd, const char *buf, long size) +{ + int ret = 0; + + if (size < 0) + return -1; + if (size > 0) + ret = write_packetized_from_buf_no_flush(buf, size, fd); + if (ret) + return ret; + return packet_flush_gently(fd); +} + +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk) +{ + char *end; + + /* + * Format: "hunk <old_start> <old_count> <new_start> <new_count>" + * All numbers must be non-negative decimal with no leading + * whitespace or sign characters. + */ + if (!skip_prefix(line, "hunk ", &line)) + return -1; + + if (!isdigit(*line)) + return -1; + errno = 0; + hunk->old_start = strtol(line, &end, 10); + if (errno || end == line || *end++ != ' ') + return -1; + line = end; + + if (!isdigit(*line)) + return -1; + errno = 0; + hunk->old_count = strtol(line, &end, 10); + if (errno || end == line || *end++ != ' ') + return -1; + line = end; + + if (!isdigit(*line)) + return -1; + errno = 0; + hunk->new_start = strtol(line, &end, 10); + if (errno || end == line || *end++ != ' ') + return -1; + line = end; + + if (!isdigit(*line)) + return -1; + errno = 0; + hunk->new_count = strtol(line, &end, 10); + if (errno || end == line || *end != '\0') + return -1; + + return 0; +} + +static enum diff_process_result get_hunks( + struct userdiff_driver *drv, + const char *path, + const char *old_buf, long old_size, + const char *new_buf, long new_size, + struct xdl_hunk **hunks_out, + size_t *nr_hunks_out) +{ + struct diff_subprocess *backend; + struct child_process *process; + int fd_in, fd_out; + struct strbuf status = STRBUF_INIT; + struct xdl_hunk *hunks = NULL; + struct xdl_hunk hunk; + size_t nr_hunks = 0, alloc_hunks = 0; + int len; + char *line; + + backend = get_or_launch_process(drv); + if (!backend) + return DIFF_PROCESS_ERROR; + + if (!(backend->supported_capabilities & CAP_HUNKS)) + return DIFF_PROCESS_SKIP; + + process = subprocess_get_child_process(&backend->subprocess); + fd_in = process->in; + fd_out = process->out; + + sigchain_push(SIGPIPE, SIG_IGN); + + /* Send request */ + if (packet_write_fmt_gently(fd_in, "command=hunks\n") || + packet_write_fmt_gently(fd_in, "pathname=%s\n", path) || + packet_flush_gently(fd_in)) + goto comm_error; + + /* Send old file content */ + if (send_file_content(fd_in, old_buf, old_size)) + goto comm_error; + + /* Send new file content */ + if (send_file_content(fd_in, new_buf, new_size)) + goto comm_error; + + /* Read hunks until flush packet */ + while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 && + line) { + if (parse_hunk_line(line, &hunk) < 0) + goto comm_error; + ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks); + hunks[nr_hunks++] = hunk; + } + if (len < 0) + goto comm_error; + + /* Read status */ + if (subprocess_read_status(fd_out, &status)) + goto comm_error; + + if (!strcmp(status.buf, "success")) { + *hunks_out = hunks; + *nr_hunks_out = nr_hunks; + strbuf_release(&status); + sigchain_pop(SIGPIPE); + return DIFF_PROCESS_OK; + } + + if (!strcmp(status.buf, "abort")) { + /* + * The tool voluntarily withdrew: stop sending requests + * but do not warn (this is not a failure). + */ + backend->supported_capabilities &= ~CAP_HUNKS; + free(hunks); + strbuf_release(&status); + sigchain_pop(SIGPIPE); + return DIFF_PROCESS_SKIP; + } + + /* status=error or unknown status */ + free(hunks); + strbuf_release(&status); + sigchain_pop(SIGPIPE); + return DIFF_PROCESS_ERROR; + +comm_error: + /* + * Communication failure (broken pipe, malformed response). + * Tear down the process and mark as failed so we do not + * retry on every subsequent file. + */ + drv->diff_process_failed = 1; + drv->diff_subprocess = NULL; + subprocess_stop_command(&backend->subprocess); + free(backend); + free(hunks); + strbuf_release(&status); + sigchain_pop(SIGPIPE); + return DIFF_PROCESS_ERROR; +} + +enum diff_process_result diff_process_fill_hunks( + struct diff_options *diffopt, + const char *path, + const mmfile_t *file_a, + const mmfile_t *file_b, + xpparam_t *xpp) +{ + struct userdiff_driver *drv; + struct xdl_hunk *ext_hunks = NULL; + size_t nr = 0; + enum diff_process_result res; + + if (!diffopt || !path) + return DIFF_PROCESS_SKIP; + if (diffopt->flags.no_diff_process || diffopt->ignore_driver_algorithm) + return DIFF_PROCESS_SKIP; + + drv = userdiff_find_by_path(diffopt->repo->index, path); + if (!drv || !drv->process) + return DIFF_PROCESS_SKIP; + if (drv->diff_process_failed) + return DIFF_PROCESS_SKIP; + + res = get_hunks(drv, path, + file_a->ptr, file_a->size, + file_b->ptr, file_b->size, + &ext_hunks, &nr); + if (res == DIFF_PROCESS_OK) { + if (!nr) { + free(ext_hunks); + return DIFF_PROCESS_EQUIVALENT; + } + xpp->external_hunks = ext_hunks; + xpp->external_hunks_nr = nr; + return DIFF_PROCESS_OK; + } + if (res == DIFF_PROCESS_ERROR) { + warning(_("diff process '%s' failed for '%s'," + " falling back to builtin diff"), + drv->process, path); + return DIFF_PROCESS_ERROR; + } + return DIFF_PROCESS_SKIP; +} diff --git a/diff-process.h b/diff-process.h new file mode 100644 index 0000000000..d34b42f811 --- /dev/null +++ b/diff-process.h @@ -0,0 +1,39 @@ +#ifndef DIFF_PROCESS_H +#define DIFF_PROCESS_H + +#include "xdiff/xdiff.h" + +struct diff_options; + +enum diff_process_result { + DIFF_PROCESS_ERROR = -1, /* tool failure: warned, fell back */ + DIFF_PROCESS_OK = 0, /* hunks populated in xpp */ + DIFF_PROCESS_SKIP, /* no process configured: use builtin */ + DIFF_PROCESS_EQUIVALENT, /* tool says files are equivalent */ +}; + +/* + * Consult the diff process configured for 'path' and populate + * xpp->external_hunks with the returned hunks. + * + * Handles driver lookup, flag checks (--no-ext-diff, + * --diff-algorithm), subprocess management, and error reporting. + * + * Returns DIFF_PROCESS_OK when hunks are populated in xpp. + * The caller owns xpp->external_hunks and must free() it. + * + * Returns DIFF_PROCESS_EQUIVALENT when the tool returns no hunks + * (files are considered identical); caller should skip diff/blame. + * Returns DIFF_PROCESS_SKIP when no process applies; caller + * should use the builtin diff algorithm. + * Returns DIFF_PROCESS_ERROR on tool failure (already warned); + * caller should fall back to the builtin diff algorithm. + */ +enum diff_process_result diff_process_fill_hunks( + struct diff_options *diffopt, + const char *path, + const mmfile_t *file_a, + const mmfile_t *file_b, + xpparam_t *xpp); + +#endif /* DIFF_PROCESS_H */ diff --git a/diff.c b/diff.c index 397e38b41c..2d5ed6ea8c 100644 --- a/diff.c +++ b/diff.c @@ -25,6 +25,7 @@ #include "utf8.h" #include "odb.h" #include "userdiff.h" +#include "diff-process.h" #include "submodule.h" #include "hashmap.h" #include "mem-pool.h" @@ -4031,6 +4032,17 @@ static void builtin_diff(const char *name_a, xpp.ignore_regex_nr = o->ignore_regex_nr; xpp.anchors = o->anchors; xpp.anchors_nr = o->anchors_nr; + + if (diff_process_fill_hunks(o, name_a, + &mf1, &mf2, &xpp) + == DIFF_PROCESS_EQUIVALENT) { + if (textconv_one) + free(mf1.ptr); + if (textconv_two) + free(mf2.ptr); + goto free_ab_and_return; + } + xecfg.ctxlen = o->context; xecfg.interhunkctxlen = o->interhunkcontext; xecfg.flags = XDL_EMIT_FUNCNAMES; @@ -4111,6 +4123,7 @@ static void builtin_diff(const char *name_a, } else if (xdi_diff_outf(&mf1, &mf2, NULL, fn_out_consume, &ecbdata, &xpp, &xecfg)) die("unable to generate diff for %s", one->path); + free(xpp.external_hunks); if (o->word_diff) free_diff_words_data(&ecbdata); if (textconv_one) diff --git a/diff.h b/diff.h index 7eb84aadf4..d1e5a13e9e 100644 --- a/diff.h +++ b/diff.h @@ -173,6 +173,9 @@ struct diff_flags { */ unsigned allow_external; + /** Disables diff.<driver>.process. */ + unsigned no_diff_process; + /** * For communication between the calling program and the options parser; * tell the calling program to signal the presence of difference using diff --git a/meson.build b/meson.build index 11488623bf..8a7370b38f 100644 --- a/meson.build +++ b/meson.build @@ -328,6 +328,7 @@ libgit_sources = [ 'diff-merges.c', 'diff-lib.c', 'diff-no-index.c', + 'diff-process.c', 'diff.c', 'diffcore-break.c', 'diffcore-delta.c', diff --git a/t/.gitattributes b/t/.gitattributes index 7664c6e027..de97920cab 100644 --- a/t/.gitattributes +++ b/t/.gitattributes @@ -23,3 +23,4 @@ t[0-9][0-9][0-9][0-9]/* -whitespace /t8005/*.txt eol=lf /t9*/*.dump eol=lf /t0040*.sh whitespace=-indent-with-non-tab +/t4080-diff-process.sh whitespace=-indent-with-non-tab diff --git a/t/meson.build b/t/meson.build index 7528e5cda5..f67208d7ee 100644 --- a/t/meson.build +++ b/t/meson.build @@ -510,6 +510,7 @@ integration_tests = [ 't4072-diff-max-depth.sh', 't4073-diff-stat-name-width.sh', 't4074-diff-shifted-matched-group.sh', + 't4080-diff-process.sh', 't4100-apply-stat.sh', 't4101-apply-nonl.sh', 't4102-apply-rename.sh', diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh new file mode 100755 index 0000000000..f159cd86d8 --- /dev/null +++ b/t/t4080-diff-process.sh @@ -0,0 +1,538 @@ +#!/bin/sh + +test_description='diff process via long-running process' + +. ./test-lib.sh + +if test_have_prereq PYTHON +then + PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python) +fi + +# +# A single parametric diff process. +# Usage: diff-process-backend --mode=<mode> [--log=<path>] +# +# Modes: +# whole-file - report all lines as changed (default) +# fixed-hunk - always report hunk 5 2 5 2 +# bad-hunk - report out-of-bounds hunk 999 1 999 1 +# bad-sync - report hunk with mismatched unchanged totals +# overlap - report two overlapping hunks +# no-hunks - return no hunks (files considered equivalent) +# error - return status=error for every request +# abort - return status=abort for every request +# crash - read one request then exit without responding +# +setup_backend () { + cat >"$TRASH_DIRECTORY/diff-process-backend.py" <<-\PYEOF + import sys, os + + def read_pkt(): + hdr = sys.stdin.buffer.read(4) + if len(hdr) < 4: return None + length = int(hdr, 16) + if length == 0: return "" + data = sys.stdin.buffer.read(length - 4) + return data.decode().rstrip("\n") + + def write_pkt(line): + data = (line + "\n").encode() + sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data) + sys.stdout.buffer.flush() + + def write_flush(): + sys.stdout.buffer.write(b"0000") + sys.stdout.buffer.flush() + + def read_content(): + chunks = [] + while True: + hdr = sys.stdin.buffer.read(4) + if len(hdr) < 4: break + length = int(hdr, 16) + if length == 0: break + chunks.append(sys.stdin.buffer.read(length - 4)) + return b"".join(chunks) + + mode = "whole-file" + logfile = None + for arg in sys.argv[1:]: + if arg.startswith("--mode="): + mode = arg[7:] + elif arg.startswith("--log="): + logfile = open(arg[6:], "a") + + def log(msg): + if logfile: + logfile.write(msg + "\n") + logfile.flush() + + # Handshake + assert read_pkt() == "git-diff-client" + assert read_pkt() == "version=1" + read_pkt() + write_pkt("git-diff-server") + write_pkt("version=1") + write_flush() + while True: + p = read_pkt() + if p == "": break + write_pkt("capability=hunks") + write_flush() + + log("ready") + + while True: + cmd = None + pathname = None + while True: + p = read_pkt() + if p is None: sys.exit(0) + if p == "": break + if p.startswith("command="): cmd = p.split("=",1)[1] + if p.startswith("pathname="): pathname = p.split("=",1)[1] + if cmd is None: sys.exit(0) + old = read_content() + new = read_content() + old_first = old.split(b"\n")[0].decode(errors="replace") if old else "" + new_first = new.split(b"\n")[0].decode(errors="replace") if new else "" + log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}") + + if mode == "error": + write_flush() + write_pkt("status=error") + write_flush() + continue + + if mode == "abort": + write_flush() + write_pkt("status=abort") + write_flush() + continue + + if mode == "crash": + sys.exit(1) + + if cmd == "hunks": + if mode == "fixed-hunk": + write_pkt("hunk 5 2 5 2") + elif mode == "bad-hunk": + write_pkt("hunk 999 1 999 1") + elif mode == "bad-sync": + write_pkt("hunk 1 2 1 1") + elif mode == "overlap": + write_pkt("hunk 1 5 1 5") + write_pkt("hunk 3 2 3 2") + elif mode == "no-hunks": + pass + else: + ol = old.count(b"\n") + nl = new.count(b"\n") + write_pkt(f"hunk 1 {ol} 1 {nl}") + write_flush() + write_pkt("status=success") + write_flush() + else: + write_flush() + write_pkt("status=error") + write_flush() + PYEOF + write_script diff-process-backend <<-SHEOF + exec "$PYTHON_PATH" "$TRASH_DIRECTORY/diff-process-backend.py" "\$@" + SHEOF +} + +BACKEND="./diff-process-backend" + +test_expect_success PYTHON 'setup' ' + setup_backend && + echo "*.c diff=cdiff" >.gitattributes && + git add .gitattributes && + + # boundary.c: 10 lines, changes at 5-6 and 9-10. + # Used by: hunk boundaries, error fallback, crash, bad hunks, overlap. + cat >boundary.c <<-\EOF && + line1 + line2 + line3 + line4 + OLD5 + OLD6 + line7 + line8 + OLD9 + OLD10 + EOF + git add boundary.c && + + # worddiff.c: single-line function, value changes 1 -> 999. + # Used by: word-diff, --diff-algorithm, --no-ext-diff, --stat. + cat >worddiff.c <<-\EOF && + int value(void) { return 1; } + EOF + git add worddiff.c && + + # newfile.c: single-line function, value changes 42 -> 99. + # Used by: new file, --exit-code, multiple drivers. + cat >newfile.c <<-\EOF && + int new_func(void) { return 42; } + EOF + git add newfile.c && + + # logtest.c: single-line function for log/format-patch tests. + # Needs two commits so log -1 has a diff. + cat >logtest.c <<-\EOF && + int logfunc(void) { return 1; } + EOF + git add logtest.c && + + # two.c/one.c: two-file pair for error/abort/startup-failure tests. + cat >one.c <<-\EOF && + int first(void) { return 1; } + EOF + cat >two.c <<-\EOF && + int second(void) { return 2; } + EOF + git add one.c two.c && + + git commit -m "initial" && + + # Second commit for logtest.c (so log -1 has something to show). + cat >logtest.c <<-\EOF && + int logfunc(void) { return 2; } + EOF + git add logtest.c && + git commit -m "change logtest.c" && + + # Working tree modifications (not committed). + cat >boundary.c <<-\EOF && + line1 + line2 + line3 + line4 + NEW5 + NEW6 + line7 + line8 + NEW9 + NEW10 + EOF + + cat >worddiff.c <<-\EOF && + int value(void) { return 999; } + EOF + + cat >newfile.c <<-\EOF && + int new_func(void) { return 99; } + EOF + + cat >one.c <<-\EOF && + int first(void) { return 10; } + EOF + + cat >two.c <<-\EOF + int second(void) { return 20; } + EOF +' + +# +# Core behavior: the tool controls which lines are marked as changed. +# + +test_expect_success PYTHON 'diff process hunk boundaries affect output' ' + # The file has changes at lines 5-6 and 9-10, but fixed-hunk + # only reports lines 5-6 as changed. Lines 9-10 should not + # appear as changed in the output. + git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \ + diff boundary.c >actual && + test_grep "^-OLD5" actual && + test_grep "^-OLD6" actual && + test_grep "^+NEW5" actual && + test_grep "^+NEW6" actual && + test_grep ! "^-OLD9" actual && + test_grep ! "^-OLD10" actual && + test_grep ! "^+NEW9" actual && + test_grep ! "^+NEW10" actual +' + +test_expect_success PYTHON 'diff process works with new file' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff -- newfile.c >actual 2>stderr && + test_grep "return 99" actual && + test_grep "pathname=newfile.c" backend.log && + test_must_be_empty stderr +' + +test_expect_success PYTHON 'diff process works with added file (empty old side)' ' + cat >added.c <<-\EOF && + int added(void) { return 1; } + EOF + git add added.c && + + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff --cached -- added.c >actual 2>stderr && + test_grep "added" actual && + test_grep "pathname=added.c" backend.log && + test_must_be_empty stderr +' + +test_expect_success PYTHON 'diff process skipped for binary files' ' + printf "\\0binary" >binary.c && + git add binary.c && + git commit -m "add binary" && + printf "\\0changed" >binary.c && + + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff -- binary.c >actual && + test_grep "Binary files" actual && + test_path_is_missing backend.log +' + +test_expect_success PYTHON 'diff process not consulted for unmatched driver' ' + echo "not tracked by cdiff" >unmatched.txt && + git add unmatched.txt && + git commit -m "add unmatched.txt" && + + echo "modified" >unmatched.txt && + + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff -- unmatched.txt >actual && + test_grep "modified" actual && + test_path_is_missing backend.log +' + +test_expect_success PYTHON 'multiple drivers use separate processes' ' + echo "*.h diff=hdiff" >>.gitattributes && + git add .gitattributes && + + cat >multi.h <<-\EOF && + int header(void) { return 1; } + EOF + git add multi.h && + git commit -m "add multi.h" && + + cat >multi.h <<-\EOF && + int header(void) { return 2; } + EOF + + rm -f backend-c.log backend-h.log && + git -c diff.cdiff.process="$BACKEND --log=backend-c.log" \ + -c diff.hdiff.process="$BACKEND --log=backend-h.log" \ + diff -- newfile.c multi.h >actual 2>stderr && + test_grep "pathname=newfile.c" backend-c.log && + test_grep "pathname=multi.h" backend-h.log && + test_must_be_empty stderr +' + +test_expect_success PYTHON 'diff process works alongside textconv' ' + write_script uppercase-filter <<-\EOF && + tr "a-z" "A-Z" <"$1" + EOF + + cat >textconv.c <<-\EOF && + hello world + EOF + git add textconv.c && + git commit -m "add textconv.c" && + + cat >textconv.c <<-\EOF && + goodbye world + EOF + + rm -f backend.log && + git -c diff.cdiff.textconv="./uppercase-filter" \ + -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff -- textconv.c >actual 2>stderr && + # The diff process receives textconv-transformed (uppercase) content. + test_grep "pathname=textconv.c" backend.log && + test_grep "old=HELLO WORLD" backend.log && + test_grep "new=GOODBYE WORLD" backend.log && + test_must_be_empty stderr +' + +# +# Downstream features: word diff, log, equivalent files, exit code. +# + +test_expect_success PYTHON 'diff process with --word-diff' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff --word-diff worddiff.c >actual 2>stderr && + test_grep "\[-1;-\]" actual && + test_grep "{+999;+}" actual && + test_grep "pathname=worddiff.c" backend.log && + test_must_be_empty stderr +' + +test_expect_success PYTHON 'diff process works with git log -p' ' + # With no-hunks mode, the tool says the files are equivalent, + # so log -p should show the commit but no diff content. + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --mode=no-hunks --log=backend.log" \ + log -1 -p -- logtest.c >actual 2>stderr && + test_grep "change logtest.c" actual && + test_grep ! "return 2" actual && + test_grep "command=hunks pathname=logtest.c" backend.log && + test_must_be_empty stderr +' + +test_expect_success PYTHON 'diff process no hunks suppresses diff output' ' + cat >nohunks.c <<-\EOF && + int zero(void) { return 0; } + EOF + git add nohunks.c && + git commit -m "add nohunks.c" && + + cat >nohunks.c <<-\EOF && + int zero(void) { return 999; } + EOF + + git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ + diff nohunks.c >actual && + test_must_be_empty actual +' + +test_expect_success PYTHON 'diff process no hunks with --exit-code returns success' ' + git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ + diff --exit-code nohunks.c +' + +test_expect_success PYTHON 'diff process with --exit-code and hunks returns failure' ' + test_expect_code 1 git -c diff.cdiff.process="$BACKEND" \ + diff --exit-code newfile.c +' + +# +# Bypass mechanisms: flags and commands that skip the diff process. +# + +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff --diff-algorithm=patience worddiff.c >actual && + test_grep "return 999" actual && + test_path_is_missing backend.log +' + +test_expect_success PYTHON 'diff process not used by --stat' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff --stat worddiff.c >actual && + test_grep "worddiff.c" actual && + test_path_is_missing backend.log +' + +# +# Error handling and fallback. +# + +test_expect_success PYTHON 'diff process fallback on tool error status' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ + diff boundary.c >actual 2>stderr && + # Fallback produces the full builtin diff (both change regions). + test_grep "^-OLD5" actual && + test_grep "^+NEW5" actual && + test_grep "^-OLD9" actual && + test_grep "^+NEW9" actual && + # Tool was contacted (it replied with error, not crash). + test_grep "command=hunks pathname=boundary.c" backend.log && + test_grep "diff process.*failed" stderr +' + +test_expect_success PYTHON 'diff process error keeps tool available for next file' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ + diff -- one.c two.c >actual 2>stderr && + # Unlike abort, error keeps the tool available: both files + # are sent to the tool (and both fall back). + test_grep "pathname=one.c" backend.log && + test_grep "pathname=two.c" backend.log && + test_grep "return 10" actual && + test_grep "return 20" actual +' + +test_expect_success PYTHON 'diff process abort disables for session' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \ + diff -- one.c two.c >actual && + # Both files should still produce diff output via fallback. + test_grep "return 10" actual && + test_grep "return 20" actual && + # The tool aborts on the first file and git clears its + # capability. The second file never contacts the tool. + test_grep "pathname=one.c" backend.log && + test_grep ! "pathname=two.c" backend.log +' + +test_expect_success PYTHON 'diff process fallback on tool crash' ' + git -c diff.cdiff.process="$BACKEND --mode=crash" \ + diff boundary.c >actual 2>stderr && + test_grep "^-OLD5" actual && + test_grep "^+NEW5" actual && + test_grep "^-OLD9" actual && + test_grep "^+NEW9" actual && + # Crash is a communication failure, so a warning is emitted. + test_grep "diff process.*failed" stderr +' + +test_expect_success PYTHON 'diff process startup failure only warns once' ' + git -c diff.cdiff.process="/nonexistent/tool" \ + diff -- one.c two.c >actual 2>stderr && + # Both files produce diff output via fallback. + test_grep "return 10" actual && + test_grep "return 20" actual && + # Sentinel prevents repeated warnings: only one, not one per file. + test_grep "diff process.*failed" stderr >warnings && + test_line_count = 1 warnings +' + +test_expect_success PYTHON 'diff process fallback on bad hunks' ' + git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \ + diff boundary.c >actual 2>stderr && + test_grep "^-OLD5" actual && + test_grep "^+NEW5" actual && + test_grep "^-OLD9" actual && + test_grep "^+NEW9" actual && + # Invalid hunks are caught by xdiff validation, not the + # protocol layer, so no warning is emitted. + test_must_be_empty stderr +' + +test_expect_success PYTHON 'diff process fallback on mismatched unchanged totals' ' + cat >synctest.c <<-\EOF && + line1 + line2 + line3 + EOF + git add synctest.c && + git commit -m "add synctest.c" && + + cat >synctest.c <<-\EOF && + line1 + changed + line3 + EOF + + # bad-sync reports hunk 1 2 1 1: marks 2 old lines and 1 new + # line as changed, leaving 1 unchanged old vs 2 unchanged new. + # The synchronization invariant fails and git falls back. + git -c diff.cdiff.process="$BACKEND --mode=bad-sync" \ + diff synctest.c >actual 2>stderr && + test_grep "changed" actual +' + +test_expect_success PYTHON 'diff process fallback on overlapping hunks' ' + # boundary.c has 10 lines, so both hunks are in bounds + # but they overlap at lines 3-5, triggering the ordering check. + git -c diff.cdiff.process="$BACKEND --mode=overlap" \ + diff boundary.c >actual 2>stderr && + test_grep "NEW5" actual +' + +test_done diff --git a/userdiff.h b/userdiff.h index 51c26e0d41..a98eabe377 100644 --- a/userdiff.h +++ b/userdiff.h @@ -3,6 +3,7 @@ #include "notes-cache.h" +struct diff_subprocess; struct index_state; struct repository; @@ -33,6 +34,8 @@ struct userdiff_driver { int textconv_want_cache; const char *process; char *process_owned; + struct diff_subprocess *diff_subprocess; + unsigned diff_process_failed : 1; }; enum userdiff_driver_type { USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v3 5/6] diff: bypass diff process with --no-ext-diff and in format-patch 2026-05-29 20:48 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget ` (3 preceding siblings ...) 2026-05-29 20:48 ` [PATCH v3 4/6] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 ` Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 6/6] blame: consult diff process for no-hunk detection Michael Montalbo via GitGitGadget 2026-05-31 10:44 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Junio C Hamano 6 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> Make --no-ext-diff disable diff.<driver>.process in addition to diff.<driver>.command. Although the two mechanisms work differently (command replaces Git's output, process feeds hunks back into the pipeline), both invoke external tools and --no-ext-diff means "no external tools." Replace the OPT_BOOL for --ext-diff with an OPT_CALLBACK that sets both allow_external and no_diff_process, so a single option controls both. Passing --ext-diff explicitly clears no_diff_process, so a later --ext-diff overrides an earlier --no-ext-diff. Disable the diff process unconditionally in format-patch so that generated patches are always based on the builtin diff algorithm and can be applied reliably by recipients who do not have the external tool. Document that --diff-algorithm also bypasses the diff process, since it sets ignore_driver_algorithm which diff_process_fill_hunks already checks. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- Documentation/diff-algorithm-option.adoc | 3 +++ Documentation/diff-options.adoc | 4 +++- builtin/log.c | 7 +++++++ diff.c | 16 ++++++++++++++-- diff.h | 4 +++- t/t4080-diff-process.sh | 16 ++++++++++++++++ 6 files changed, 46 insertions(+), 4 deletions(-) diff --git a/Documentation/diff-algorithm-option.adoc b/Documentation/diff-algorithm-option.adoc index 8e3a0b63d7..4d7e2ec35f 100644 --- a/Documentation/diff-algorithm-option.adoc +++ b/Documentation/diff-algorithm-option.adoc @@ -18,3 +18,6 @@ For instance, if you configured the `diff.algorithm` variable to a non-default value and want to use the default one, then you have to use `--diff-algorithm=default` option. ++ +If you explicitly choose a diff algorithm, it also bypasses +`diff.<driver>.process` (see linkgit:gitattributes[5]). diff --git a/Documentation/diff-options.adoc b/Documentation/diff-options.adoc index 8a63b5e164..18b8b0ed24 100644 --- a/Documentation/diff-options.adoc +++ b/Documentation/diff-options.adoc @@ -825,7 +825,9 @@ endif::git-format-patch[] to use this option with linkgit:git-log[1] and friends. `--no-ext-diff`:: - Disallow external diff drivers. + Disallow external diff helpers, including + `diff.<driver>.command` and `diff.<driver>.process` + (see linkgit:gitattributes[5]). `--textconv`:: `--no-textconv`:: diff --git a/builtin/log.c b/builtin/log.c index 8c0939dd42..1ea520c12d 100644 --- a/builtin/log.c +++ b/builtin/log.c @@ -2213,6 +2213,13 @@ int cmd_format_patch(int argc, if (argc > 1) die(_("unrecognized argument: %s"), argv[1]); + /* + * Disable diff.<driver>.process so that patches generated by + * format-patch are always based on the builtin diff algorithm + * and can be applied reliably. + */ + rev.diffopt.flags.no_diff_process = 1; + if (rev.diffopt.output_format & DIFF_FORMAT_NAME) die(_("--name-only does not make sense")); if (rev.diffopt.output_format & DIFF_FORMAT_NAME_STATUS) diff --git a/diff.c b/diff.c index 2d5ed6ea8c..38932db084 100644 --- a/diff.c +++ b/diff.c @@ -5913,6 +5913,17 @@ static int diff_opt_submodule(const struct option *opt, return 0; } +static int diff_opt_ext_diff(const struct option *opt, + const char *arg, int unset) +{ + struct diff_options *options = opt->value; + + BUG_ON_OPT_ARG(arg); + options->flags.allow_external = !unset; + options->flags.no_diff_process = unset; + return 0; +} + static int diff_opt_textconv(const struct option *opt, const char *arg, int unset) { @@ -6241,8 +6252,9 @@ struct option *add_diff_options(const struct option *opts, N_("exit with 1 if there were differences, 0 otherwise")), OPT_BOOL(0, "quiet", &options->flags.quick, N_("disable all output of the program")), - OPT_BOOL(0, "ext-diff", &options->flags.allow_external, - N_("allow an external diff helper to be executed")), + OPT_CALLBACK_F(0, "ext-diff", options, NULL, + N_("allow an external diff helper to be executed"), + PARSE_OPT_NOARG, diff_opt_ext_diff), OPT_CALLBACK_F(0, "textconv", options, NULL, N_("run external text conversion filters when comparing binary files"), PARSE_OPT_NOARG, diff_opt_textconv), diff --git a/diff.h b/diff.h index d1e5a13e9e..71ba389f03 100644 --- a/diff.h +++ b/diff.h @@ -173,7 +173,9 @@ struct diff_flags { */ unsigned allow_external; - /** Disables diff.<driver>.process. */ + /** + * Disables diff.<driver>.process. Set by --no-ext-diff. + */ unsigned no_diff_process; /** diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh index f159cd86d8..ee0c306abd 100755 --- a/t/t4080-diff-process.sh +++ b/t/t4080-diff-process.sh @@ -419,6 +419,22 @@ test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' ' test_path_is_missing backend.log ' +test_expect_success PYTHON 'diff process bypassed by --no-ext-diff' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + diff --no-ext-diff worddiff.c >actual && + test_grep "return 999" actual && + test_path_is_missing backend.log +' + +test_expect_success PYTHON 'diff process not used by format-patch' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ + format-patch -1 --stdout -- logtest.c >actual && + test_grep "return 2" actual && + test_path_is_missing backend.log +' + test_expect_success PYTHON 'diff process not used by --stat' ' rm -f backend.log && git -c diff.cdiff.process="$BACKEND --log=backend.log" \ -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v3 6/6] blame: consult diff process for no-hunk detection 2026-05-29 20:48 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget ` (4 preceding siblings ...) 2026-05-29 20:48 ` [PATCH v3 5/6] diff: bypass diff process with --no-ext-diff and in format-patch Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 ` Michael Montalbo via GitGitGadget 2026-05-31 10:44 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Junio C Hamano 6 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo via GitGitGadget @ 2026-05-29 20:48 UTC (permalink / raw) To: git; +Cc: Michael Montalbo, Michael Montalbo From: Michael Montalbo <mmontalbo@gmail.com> When a diff process is configured via diff.<driver>.process, consult it during blame's per-commit diffing. If the process returns no hunks for a commit's changes to a file, treat the commit as having no changes, causing blame to attribute lines to earlier commits. The consultation happens at the pass_blame_to_parent() callsite using diff_process_fill_hunks(), matching how builtin_diff() in diff.c uses the same function. A new diff_hunks_xpp() variant accepts a pre-populated xpparam_t for this callsite, while the existing diff_hunks() retains its original signature and behavior. The copy-detection callsite is unaffected since it does not use the diff process. The subprocess is long-running (one startup cost amortized across the blame traversal), but each commit in the file's history incurs a round-trip to the tool. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> --- blame.c | 40 +++++++++++---- t/t4080-diff-process.sh | 106 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 137 insertions(+), 9 deletions(-) diff --git a/blame.c b/blame.c index a3c49d132e..79f02735a4 100644 --- a/blame.c +++ b/blame.c @@ -19,6 +19,8 @@ #include "tag.h" #include "trace2.h" #include "blame.h" +#include "diff-process.h" +#include "xdiff-interface.h" #include "alloc.h" #include "commit-slab.h" #include "bloom.h" @@ -314,17 +316,25 @@ static struct commit *fake_working_tree_commit(struct repository *r, -static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b, - xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, int xdl_opts) +static int diff_hunks_xpp(mmfile_t *file_a, mmfile_t *file_b, + xdl_emit_hunk_consume_func_t hunk_func, + void *cb_data, xpparam_t *xpp) { - xpparam_t xpp = {0}; xdemitconf_t xecfg = {0}; xdemitcb_t ecb = {NULL}; - xpp.flags = xdl_opts; xecfg.hunk_func = hunk_func; ecb.priv = cb_data; - return xdi_diff(file_a, file_b, &xpp, &xecfg, &ecb); + return xdi_diff(file_a, file_b, xpp, &xecfg, &ecb); +} + +static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b, + xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, int xdl_opts) +{ + xpparam_t xpp = {0}; + + xpp.flags = xdl_opts; + return diff_hunks_xpp(file_a, file_b, hunk_func, cb_data, &xpp); } static const char *get_next_line(const char *start, const char *end) @@ -1943,6 +1953,7 @@ static void pass_blame_to_parent(struct blame_scoreboard *sb, struct blame_origin *parent, int ignore_diffs) { mmfile_t file_p, file_o; + xpparam_t xpp = {0}; struct blame_chunk_cb_data d; struct blame_entry *newdest = NULL; @@ -1961,10 +1972,21 @@ static void pass_blame_to_parent(struct blame_scoreboard *sb, &sb->num_read_blob, ignore_diffs); sb->num_get_patch++; - if (diff_hunks(&file_p, &file_o, blame_chunk_cb, &d, sb->xdl_opts)) - die("unable to generate diff (%s -> %s)", - oid_to_hex(&parent->commit->object.oid), - oid_to_hex(&target->commit->object.oid)); + xpp.flags = sb->xdl_opts; + /* + * If the diff process considers the files equivalent, + * skip the diff so blame looks past this commit. + */ + if (diff_process_fill_hunks(&sb->revs->diffopt, target->path, + &file_p, &file_o, &xpp) + != DIFF_PROCESS_EQUIVALENT) { + if (diff_hunks_xpp(&file_p, &file_o, blame_chunk_cb, + &d, &xpp)) + die("unable to generate diff (%s -> %s)", + oid_to_hex(&parent->commit->object.oid), + oid_to_hex(&target->commit->object.oid)); + } + free(xpp.external_hunks); /* The rest are the same as the parent */ blame_chunk(&d.dstq, &d.srcq, INT_MAX, d.offset, INT_MAX, 0, parent, target, 0); diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh index ee0c306abd..fdf6da1c34 100755 --- a/t/t4080-diff-process.sh +++ b/t/t4080-diff-process.sh @@ -551,4 +551,110 @@ test_expect_success PYTHON 'diff process fallback on overlapping hunks' ' test_grep "NEW5" actual ' +# +# Blame integration. +# + +test_expect_success PYTHON 'blame uses tool-provided hunks' ' + cat >blame-hunk.c <<-\EOF && + line1 + line2 + line3 + line4 + original5 + original6 + line7 + line8 + line9 + line10 + EOF + git add blame-hunk.c && + git commit -m "add blame-hunk.c" && + ORIG=$(git rev-parse --short HEAD) && + + cat >blame-hunk.c <<-\EOF && + line1 + line2 + line3 + line4 + changed5 + changed6 + line7 + line8 + changed9 + changed10 + EOF + git add blame-hunk.c && + git commit -m "change blame-hunk.c" && + CHANGE=$(git rev-parse --short HEAD) && + + # With fixed-hunk mode the tool reports only lines 5-6 as changed, + # so blame should attribute lines 9-10 to the original commit + # even though the builtin diff would show them as changed. + git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \ + blame blame-hunk.c >actual && + sed -n "9p" actual >line9 && + sed -n "10p" actual >line10 && + test_grep "$ORIG" line9 && + test_grep "$ORIG" line10 && + sed -n "5p" actual >line5 && + sed -n "6p" actual >line6 && + test_grep "$CHANGE" line5 && + test_grep "$CHANGE" line6 +' + +test_expect_success PYTHON 'blame skips commits with no hunks from diff process' ' + cat >blame.c <<-\EOF && + int main(void) + { + return 0; + } + EOF + git add blame.c && + git commit -m "add blame.c" && + ORIG_COMMIT=$(git rev-parse --short HEAD) && + + cat >blame.c <<-\EOF && + int main(void) + { + return 0; + } + EOF + git add blame.c && + git commit -m "reformat blame.c" && + BLAME_COMMIT=$(git rev-parse --short HEAD) && + + # Without no-hunks mode, blame attributes the change. + git blame blame.c >without && + test_grep "$BLAME_COMMIT" without && + + # With no-hunks mode, the process considers the files equivalent + # and blame skips the reformat commit, attributing to the original. + git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ + blame blame.c >with && + test_grep ! "$BLAME_COMMIT" with && + test_grep "$ORIG_COMMIT" with +' + +test_expect_success PYTHON 'blame --no-ext-diff bypasses diff process' ' + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --mode=no-hunks --log=backend.log" \ + blame --no-ext-diff blame.c >actual && + # Without the process, blame attributes the reformat commit normally. + test_grep "$BLAME_COMMIT" actual && + test_path_is_missing backend.log +' + +test_expect_success PYTHON 'blame --no-ext-diff uses builtin hunks' ' + # fixed-hunk mode would narrow blame to lines 5-6, but + # --no-ext-diff should bypass it and use the builtin diff. + rm -f backend.log && + git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk --log=backend.log" \ + blame --no-ext-diff blame-hunk.c >actual && + # Builtin diff attributes lines 9-10 to the change commit. + sed -n "9p" actual >line9 && + test_grep "$CHANGE" line9 && + test_path_is_missing backend.log +' + test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers 2026-05-29 20:48 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget ` (5 preceding siblings ...) 2026-05-29 20:48 ` [PATCH v3 6/6] blame: consult diff process for no-hunk detection Michael Montalbo via GitGitGadget @ 2026-05-31 10:44 ` Junio C Hamano 2026-06-01 4:28 ` Michael Montalbo 6 siblings, 1 reply; 30+ messages in thread From: Junio C Hamano @ 2026-05-31 10:44 UTC (permalink / raw) To: Michael Montalbo via GitGitGadget; +Cc: git, Michael Montalbo "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > Language-aware diff tools (e.g., Difftastic) and format-specific analyzers > can produce better line matching than Git's builtin diff algorithm, but > diff.<driver>.command replaces Git's output entirely, losing downstream > features like word diff, function context, color, and blame. This seems to break CI on Windows; take a look at https://github.com/git/git/actions/runs/26709491830/job/78717295153 for an example. Thanks. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers 2026-05-31 10:44 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Junio C Hamano @ 2026-06-01 4:28 ` Michael Montalbo 0 siblings, 0 replies; 30+ messages in thread From: Michael Montalbo @ 2026-06-01 4:28 UTC (permalink / raw) To: Junio C Hamano; +Cc: Michael Montalbo via GitGitGadget, git On Sun, May 31, 2026 at 3:44 AM Junio C Hamano <gitster@pobox.com> wrote: > > "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > Language-aware diff tools (e.g., Difftastic) and format-specific analyzers > > can produce better line matching than Git's builtin diff algorithm, but > > diff.<driver>.command replaces Git's output entirely, losing downstream > > features like word diff, function context, color, and blame. > > This seems to break CI on Windows; take a look at > > https://github.com/git/git/actions/runs/26709491830/job/78717295153 > > for an example. > > Thanks. Thanks for the heads up. Interestingly, I think Windows exposed a latent issue with sub-process startup dying instead of erroring out when there is a space in the path. I've submitted https://lore.kernel.org/git/pull.2133.git.1780287309846.gitgitgadget@gmail.com/T/#u which I believe addresses the issue, however it seems like there are some potentially unrelated CI failures going on right now which is making verification a bit harder. ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2026-06-01 4:28 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-22 2:11 [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget 2026-05-22 2:11 ` [PATCH 1/5] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget 2026-05-22 5:29 ` Junio C Hamano 2026-05-22 19:06 ` Michael Montalbo 2026-05-24 8:50 ` Junio C Hamano 2026-05-24 18:01 ` Michael Montalbo 2026-05-22 2:11 ` [PATCH 2/5] userdiff: add diff.<driver>.process config Michael Montalbo via GitGitGadget 2026-05-22 2:11 ` [PATCH 3/5] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget 2026-05-22 2:11 ` [PATCH 4/5] blame: consult diff process for zero-hunk detection Michael Montalbo via GitGitGadget 2026-05-22 2:11 ` [PATCH 5/5] diff-process-normalize: add built-in whitespace normalizer Michael Montalbo via GitGitGadget 2026-05-22 5:29 ` [PATCH 0/5] [RFC] diff: add diff.<driver>.process for external hunk providers Junio C Hamano 2026-05-22 17:19 ` Michael Montalbo 2026-05-25 18:29 ` [PATCH v2 0/4] " Michael Montalbo via GitGitGadget 2026-05-25 18:29 ` [PATCH v2 1/4] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget 2026-05-25 18:29 ` [PATCH v2 2/4] userdiff: add diff.<driver>.process config Michael Montalbo via GitGitGadget 2026-05-25 18:29 ` [PATCH v2 3/4] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget 2026-05-26 1:56 ` Junio C Hamano 2026-05-29 0:51 ` Michael Montalbo 2026-05-26 2:26 ` Junio C Hamano 2026-05-29 0:55 ` Michael Montalbo 2026-05-25 18:29 ` [PATCH v2 4/4] blame: consult diff process for zero-hunk detection Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 1/6] xdiff: support external hunks via xpparam_t Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 2/6] userdiff: add diff.<driver>.process config Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 3/6] sub-process: separate process lifecycle from hashmap management Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 4/6] diff: add long-running diff process via diff.<driver>.process Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 5/6] diff: bypass diff process with --no-ext-diff and in format-patch Michael Montalbo via GitGitGadget 2026-05-29 20:48 ` [PATCH v3 6/6] blame: consult diff process for no-hunk detection Michael Montalbo via GitGitGadget 2026-05-31 10:44 ` [PATCH v3 0/6] [RFC] diff: add diff.<driver>.process for external hunk providers Junio C Hamano 2026-06-01 4:28 ` Michael Montalbo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox