public inbox for git@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] Xdiff cleanup part 3
@ 2026-01-02 18:52 Ezekiel Newren via GitGitGadget
  2026-01-02 18:52 ` [PATCH 01/10] ivec: introduce the C side of ivec Ezekiel Newren via GitGitGadget
                   ` (14 more replies)
  0 siblings, 15 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren

Patch series summary:

 * patch 1: Introduce the ivec type
 * patch 2: Create the function xdl_do_classic_diff()
 * patches 3-4: generic cleanup
 * patches 5-8: convert from dstart/dend (in xdfile_t) to
   delta_start/delta_end (in xdfenv_t)
 * patches 9-10: move xdl_cleanup_records(), and related, from xprepare.c to
   xdiffi.c

Things that will be addressed in future patch series:

 * Make xdl_cleanup_records() easier to read
 * convert recs/nrec into an ivec
 * convert changed to an ivec
 * remove reference_index/nreff from xdfile_t and turn it into an ivec
 * splitting minimal_perfect_hash out as its own ivec
 * improve the performance of the classifier and parsing/hashing lines

=== before this patch series typedef struct s_xdfile { xrecord_t *recs;
size_t nrec; ptrdiff_t dstart, dend; bool *changed; size_t *reference_index;
size_t nreff; } xdfile_t;

typedef struct s_xdfenv { xdfile_t xdf1, xdf2; } xdfenv_t;

=== after this patch series typedef struct s_xdfile { xrecord_t *recs;
size_t nrec; bool *changed; size_t *reference_index; size_t nreff; }
xdfile_t;

typedef struct s_xdfenv { xdfile_t xdf1, xdf2; size_t delta_start,
delta_end; size_t mph_size; } xdfenv_t;

Ezekiel Newren (10):
  ivec: introduce the C side of ivec
  xdiff: make classic diff explicit by creating xdl_do_classic_diff()
  xdiff: don't waste time guessing the number of lines
  xdiff: let patience and histogram benefit from xdl_trim_ends()
  xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records()
  xdiff: cleanup xdl_trim_ends()
  xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start
  xdiff: replace xdfile_t.dend with xdfenv_t.delta_end
  xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
  xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c

 Makefile           |   1 +
 compat/ivec.c      | 113 ++++++++++++++++++
 compat/ivec.h      |  52 +++++++++
 meson.build        |   1 +
 xdiff/xdiffi.c     | 221 +++++++++++++++++++++++++++++++++---
 xdiff/xdiffi.h     |   1 +
 xdiff/xhistogram.c |   7 +-
 xdiff/xpatience.c  |   7 +-
 xdiff/xprepare.c   | 277 ++++++++-------------------------------------
 xdiff/xtypes.h     |   3 +-
 xdiff/xutils.c     |  20 ----
 xdiff/xutils.h     |   1 -
 12 files changed, 432 insertions(+), 272 deletions(-)
 create mode 100644 compat/ivec.c
 create mode 100644 compat/ivec.h


base-commit: 66ce5f8e8872f0183bb137911c52b07f1f242d13
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2156%2Fezekielnewren%2Fxdiff-cleanup-3-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2156/ezekielnewren/xdiff-cleanup-3-v1
Pull-Request: https://github.com/git/git/pull/2156
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
@ 2026-01-02 18:52 ` Ezekiel Newren via GitGitGadget
  2026-01-04  5:32   ` Junio C Hamano
                     ` (2 more replies)
  2026-01-02 18:52 ` [PATCH 02/10] xdiff: make classic diff explicit by creating xdl_do_classic_diff() Ezekiel Newren via GitGitGadget
                   ` (13 subsequent siblings)
  14 siblings, 3 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Trying to use Rust's Vec in C, or git's ALLOC_GROW() macros (via
wrapper functions) in Rust is painful because:

  * C doesn't define its own vector type, and even though Rust does
    have Vec its painful to use on the C side (more on that below).
    However its still not viable to use Rust's Vec type because Git
    needs to be able to compile without Rust. So ivec was created
    expressley to be interoperable between C and Rust without needing
    Rust.
  * C doing vector things the Rust way would require wrapper functions,
    and Rust doing vector things the C way would require wrapper
    functions, so ivec was created to ensure a consistent contract
    between the 2 languages for how to manipulate a vector.
  * Currently, Rust defines its own 'Vec' type that is generic, but its
    memory allocator and struct layout weren't designed for
    interoperability with C (or any language for that matter), meaning
    that the C side cannot push to or expand a 'Vec' without defining
    wrapper functions in Rust that C can call. Without special care,
    the two languages might use different allocators (malloc/free on
    the C side, and possibly something else in Rust), which would make
    it difficult for a function in one language to free elements
    allocated by a call from a function in the other language.
  * Similarly, git defines ALLOC_GROW() and related macros in
    git-compat-util.h. While we could add functions allowing Rust to
    invoke something similar to those macros, passing three variables
    (pointer, length, allocated_size) instead of a single variable
    (vector) across the language boundary requires more cognitive
    overhead for readers to keep track of and makes it easier to make
    mistakes. Further, for low-level components that we want to
    eventually convert to pure Rust, such triplets would feel very out
    of place.

To address these issue, introduce a new type, ivec -- short for
interoperable vector. (We refer to it as 'ivec' generally, though on
the Rust side the struct is called IVec to match Rust style.)  This new
type is specifically designed for FFI purposes, so that both languages
handle the vector in the same way, though it could be used on either
side independently. This type is designed such that it can easily be
replaced by a Rust 'Vec' once interoperability is no longer a concern.

One particular item to note is that Git's macros to handle vec
operations infer the amount that a vec needs to grow from the size of
a pointer, but that makes it somewhat specific to the macros used in C.
To avoid defining every ivec function as a macro I opted to also
include an element_size field that allows concrete functions like
push() to know how much to grow the memory. This element_size also
helps in verifying that the ivec is correct when passing from C to
Rust.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 Makefile      |   1 +
 compat/ivec.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++
 compat/ivec.h |  52 +++++++++++++++++++++++
 meson.build   |   1 +
 4 files changed, 167 insertions(+)
 create mode 100644 compat/ivec.c
 create mode 100644 compat/ivec.h

diff --git a/Makefile b/Makefile
index 89d8d73ec0..f923b307d6 100644
--- a/Makefile
+++ b/Makefile
@@ -1107,6 +1107,7 @@ LIB_OBJS += commit-reach.o
 LIB_OBJS += commit.o
 LIB_OBJS += common-exit.o
 LIB_OBJS += common-init.o
+LIB_OBJS += compat/ivec.o
 LIB_OBJS += compat/nonblock.o
 LIB_OBJS += compat/obstack.o
 LIB_OBJS += compat/open.o
diff --git a/compat/ivec.c b/compat/ivec.c
new file mode 100644
index 0000000000..0a777e78dc
--- /dev/null
+++ b/compat/ivec.c
@@ -0,0 +1,113 @@
+#include "ivec.h"
+
+struct IVec_c_void {
+	void *ptr;
+	size_t length;
+	size_t capacity;
+	size_t element_size;
+};
+
+static void _set_capacity(void *self_, size_t new_capacity)
+{
+	struct IVec_c_void *self = self_;
+
+	if (new_capacity == self->capacity) {
+		return;
+	}
+	if (new_capacity == 0) {
+		free(self->ptr);
+		self->ptr = NULL;
+	} else {
+		self->ptr = realloc(self->ptr, new_capacity * self->element_size);
+	}
+	self->capacity = new_capacity;
+}
+
+
+void ivec_init(void *self_, size_t element_size)
+{
+	struct IVec_c_void *self = self_;
+
+	self->ptr = NULL;
+	self->length = 0;
+	self->capacity = 0;
+	self->element_size = element_size;
+}
+
+void ivec_zero(void *self_, size_t capacity)
+{
+	struct IVec_c_void *self = self_;
+
+	self->ptr = calloc(capacity, self->element_size);
+	self->length = capacity;
+	self->capacity = capacity;
+	// DO NOT MODIFY element_size!!!
+}
+
+void ivec_reserve_exact(void *self_, size_t additional)
+{
+	struct IVec_c_void *self = self_;
+
+	_set_capacity(self, self->capacity + additional);
+}
+
+void ivec_reserve(void *self_, size_t additional)
+{
+	struct IVec_c_void *self = self_;
+
+	size_t growby = 128;
+	if (self->capacity > growby)
+		growby = self->capacity;
+	if (additional > growby)
+		growby = additional;
+
+	_set_capacity(self, self->capacity + growby);
+}
+
+void ivec_shrink_to_fit(void *self_)
+{
+	struct IVec_c_void *self = self_;
+
+	_set_capacity(self, self->length);
+}
+
+void ivec_push(void *self_, const void *value)
+{
+	struct IVec_c_void *self = self_;
+	void *dst = NULL;
+
+	if (self->length == self->capacity)
+		ivec_reserve(self, 1);
+
+	dst = (uint8_t*)self->ptr + self->length * self->element_size;
+	memcpy(dst, value, self->element_size);
+	self->length++;
+}
+
+void ivec_free(void *self_)
+{
+	struct IVec_c_void *self = self_;
+
+	free(self->ptr);
+	self->ptr = NULL;
+	self->length = 0;
+	self->capacity = 0;
+	// DO NOT MODIFY element_size!!!
+}
+
+void ivec_move(void *src_, void *dst_)
+{
+	struct IVec_c_void *src = src_;
+	struct IVec_c_void *dst = dst_;
+
+	ivec_free(dst);
+	dst->ptr = src->ptr;
+	dst->length = src->length;
+	dst->capacity = src->capacity;
+	// DO NOT MODIFY element_size!!!
+
+	src->ptr = NULL;
+	src->length = 0;
+	src->capacity = 0;
+	// DO NOT MODIFY element_size!!!
+}
diff --git a/compat/ivec.h b/compat/ivec.h
new file mode 100644
index 0000000000..654a05c506
--- /dev/null
+++ b/compat/ivec.h
@@ -0,0 +1,52 @@
+#ifndef IVEC_H
+#define IVEC_H
+
+#include <git-compat-util.h>
+
+#define IVEC_INIT(variable) ivec_init(&(variable), sizeof(*(variable).ptr))
+
+#ifndef CBINDGEN
+#define DEFINE_IVEC_TYPE(type, suffix) \
+struct IVec_##suffix { \
+	type* ptr; \
+	size_t length; \
+	size_t capacity; \
+	size_t element_size; \
+}
+
+DEFINE_IVEC_TYPE(bool, bool);
+
+DEFINE_IVEC_TYPE(uint8_t, u8);
+DEFINE_IVEC_TYPE(uint16_t, u16);
+DEFINE_IVEC_TYPE(uint32_t, u32);
+DEFINE_IVEC_TYPE(uint64_t, u64);
+
+DEFINE_IVEC_TYPE(int8_t, i8);
+DEFINE_IVEC_TYPE(int16_t, i16);
+DEFINE_IVEC_TYPE(int32_t, i32);
+DEFINE_IVEC_TYPE(int64_t, i64);
+
+DEFINE_IVEC_TYPE(float, f32);
+DEFINE_IVEC_TYPE(double, f64);
+
+DEFINE_IVEC_TYPE(size_t, usize);
+DEFINE_IVEC_TYPE(ssize_t, isize);
+#endif
+
+void ivec_init(void *self_, size_t element_size);
+
+void ivec_zero(void *self_, size_t capacity);
+
+void ivec_reserve_exact(void *self_, size_t additional);
+
+void ivec_reserve(void *self_, size_t additional);
+
+void ivec_shrink_to_fit(void *self_);
+
+void ivec_push(void *self_, const void *value);
+
+void ivec_free(void *self_);
+
+void ivec_move(void *src, void *dst);
+
+#endif /* IVEC_H */
diff --git a/meson.build b/meson.build
index dd52efd1c8..42ac0c8c42 100644
--- a/meson.build
+++ b/meson.build
@@ -302,6 +302,7 @@ libgit_sources = [
   'commit.c',
   'common-exit.c',
   'common-init.c',
+  'compat/ivec.c',
   'compat/nonblock.c',
   'compat/obstack.c',
   'compat/open.c',
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 02/10] xdiff: make classic diff explicit by creating xdl_do_classic_diff()
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
  2026-01-02 18:52 ` [PATCH 01/10] ivec: introduce the C side of ivec Ezekiel Newren via GitGitGadget
@ 2026-01-02 18:52 ` Ezekiel Newren via GitGitGadget
  2026-01-20 15:01   ` Phillip Wood
  2026-01-02 18:52 ` [PATCH 03/10] xdiff: don't waste time guessing the number of lines Ezekiel Newren via GitGitGadget
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Later patches will prepare xdl_cleanup_records() to be moved into xdiffi.c
since only the classic diff uses that function.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c | 43 +++++++++++++++++++++++++++----------------
 xdiff/xdiffi.h |  1 +
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 4376f943db..e3196c7245 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -311,26 +311,13 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
 }
 
 
-int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
-		xdfenv_t *xe) {
+int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
+{
 	long ndiags;
 	long *kvd, *kvdf, *kvdb;
 	xdalgoenv_t xenv;
 	int res;
 
-	if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0)
-		return -1;
-
-	if (XDF_DIFF_ALG(xpp->flags) == XDF_PATIENCE_DIFF) {
-		res = xdl_do_patience_diff(xpp, xe);
-		goto out;
-	}
-
-	if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF) {
-		res = xdl_do_histogram_diff(xpp, xe);
-		goto out;
-	}
-
 	/*
 	 * Allocate and setup K vectors to be used by the differential
 	 * algorithm.
@@ -355,9 +342,33 @@ int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 	xenv.heur_min = XDL_HEUR_MIN_COST;
 
 	res = xdl_recs_cmp(&xe->xdf1, 0, xe->xdf1.nreff, &xe->xdf2, 0, xe->xdf2.nreff,
-			   kvdf, kvdb, (xpp->flags & XDF_NEED_MINIMAL) != 0,
+			   kvdf, kvdb, (flags & XDF_NEED_MINIMAL) != 0,
 			   &xenv);
+
 	xdl_free(kvd);
+
+	return res;
+}
+
+
+int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
+		xdfenv_t *xe) {
+	int res;
+
+	if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0)
+		return -1;
+
+	if (XDF_DIFF_ALG(xpp->flags) == XDF_PATIENCE_DIFF) {
+		res = xdl_do_patience_diff(xpp, xe);
+		goto out;
+	}
+
+	if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF) {
+		res = xdl_do_histogram_diff(xpp, xe);
+		goto out;
+	}
+
+	res = xdl_do_classic_diff(xe, xpp->flags);
  out:
 	if (res < 0)
 		xdl_free_env(xe);
diff --git a/xdiff/xdiffi.h b/xdiff/xdiffi.h
index 49e52c67f9..8bf4c20373 100644
--- a/xdiff/xdiffi.h
+++ b/xdiff/xdiffi.h
@@ -42,6 +42,7 @@ typedef struct s_xdchange {
 int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
 		 xdfile_t *xdf2, long off2, long lim2,
 		 long *kvdf, long *kvdb, int need_min, xdalgoenv_t *xenv);
+int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags);
 int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 		xdfenv_t *xe);
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 03/10] xdiff: don't waste time guessing the number of lines
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
  2026-01-02 18:52 ` [PATCH 01/10] ivec: introduce the C side of ivec Ezekiel Newren via GitGitGadget
  2026-01-02 18:52 ` [PATCH 02/10] xdiff: make classic diff explicit by creating xdl_do_classic_diff() Ezekiel Newren via GitGitGadget
@ 2026-01-02 18:52 ` Ezekiel Newren via GitGitGadget
  2026-01-20 15:02   ` Phillip Wood
  2026-01-02 18:52 ` [PATCH 04/10] xdiff: let patience and histogram benefit from xdl_trim_ends() Ezekiel Newren via GitGitGadget
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

All lines must be read anyway, so classify them after they're read in.
Also move the memset() into xdl_init_classifier().

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 52 +++++++++++++++++++-----------------------------
 xdiff/xutils.c   | 20 -------------------
 xdiff/xutils.h   |  1 -
 3 files changed, 21 insertions(+), 52 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 34c82e4f8e..96a32cc5e9 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -26,8 +26,6 @@
 #define XDL_KPDIS_RUN 4
 #define XDL_MAX_EQLIMIT 1024
 #define XDL_SIMSCAN_WINDOW 100
-#define XDL_GUESS_NLINES1 256
-#define XDL_GUESS_NLINES2 20
 
 #define DISCARD 0
 #define KEEP 1
@@ -55,6 +53,8 @@ typedef struct s_xdlclassifier {
 
 
 static int xdl_init_classifier(xdlclassifier_t *cf, long size, long flags) {
+	memset(cf, 0, sizeof(xdlclassifier_t));
+
 	cf->flags = flags;
 
 	cf->hbits = xdl_hashbits((unsigned int) size);
@@ -134,12 +134,12 @@ static void xdl_free_ctx(xdfile_t *xdf)
 }
 
 
-static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_t const *xpp,
-			   xdlclassifier_t *cf, xdfile_t *xdf) {
+static int xdl_prepare_ctx(mmfile_t *mf, xdfile_t *xdf, uint64_t flags) {
 	long bsize;
 	uint64_t hav;
 	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
+	long narec = 8;
 
 	xdf->reference_index = NULL;
 	xdf->changed = NULL;
@@ -152,23 +152,21 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 	if ((cur = blk = xdl_mmfile_first(mf, &bsize))) {
 		for (top = blk + bsize; cur < top; ) {
 			prev = cur;
-			hav = xdl_hash_record(&cur, top, xpp->flags);
+			hav = xdl_hash_record(&cur, top, flags);
 			if (XDL_ALLOC_GROW(xdf->recs, (long)xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
 			crec->size = cur - prev;
 			crec->line_hash = hav;
-			if (xdl_classify_record(pass, cf, crec) < 0)
-				goto abort;
 		}
 	}
 
 	if (!XDL_CALLOC_ARRAY(xdf->changed, xdf->nrec + 2))
 		goto abort;
 
-	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
-	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)) {
+	if ((XDF_DIFF_ALG(flags) != XDF_PATIENCE_DIFF) &&
+	    (XDF_DIFF_ALG(flags) != XDF_HISTOGRAM_DIFF)) {
 		if (!XDL_ALLOC_ARRAY(xdf->reference_index, xdf->nrec + 1))
 			goto abort;
 	}
@@ -381,37 +379,29 @@ static int xdl_optimize_ctxs(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2
 
 int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 		    xdfenv_t *xe) {
-	long enl1, enl2, sample;
 	xdlclassifier_t cf;
 
-	memset(&cf, 0, sizeof(cf));
-
-	/*
-	 * For histogram diff, we can afford a smaller sample size and
-	 * thus a poorer estimate of the number of lines, as the hash
-	 * table (rhash) won't be filled up/grown. The number of lines
-	 * (nrecs) will be updated correctly anyway by
-	 * xdl_prepare_ctx().
-	 */
-	sample = (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF
-		  ? XDL_GUESS_NLINES2 : XDL_GUESS_NLINES1);
+	if (xdl_prepare_ctx(mf1, &xe->xdf1, xpp->flags) < 0) {
 
-	enl1 = xdl_guess_lines(mf1, sample) + 1;
-	enl2 = xdl_guess_lines(mf2, sample) + 1;
-
-	if (xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0)
 		return -1;
+	}
+	if (xdl_prepare_ctx(mf2, &xe->xdf2, xpp->flags) < 0) {
 
-	if (xdl_prepare_ctx(1, mf1, enl1, xpp, &cf, &xe->xdf1) < 0) {
-
-		xdl_free_classifier(&cf);
+		xdl_free_ctx(&xe->xdf1);
 		return -1;
 	}
-	if (xdl_prepare_ctx(2, mf2, enl2, xpp, &cf, &xe->xdf2) < 0) {
 
-		xdl_free_ctx(&xe->xdf1);
-		xdl_free_classifier(&cf);
+	if (xdl_init_classifier(&cf, xe->xdf1.nrec + xe->xdf2.nrec + 1, xpp->flags) < 0)
 		return -1;
+
+	for (size_t i = 0; i < xe->xdf1.nrec; i++) {
+		xrecord_t *rec = &xe->xdf1.recs[i];
+		xdl_classify_record(1, &cf, rec);
+	}
+
+	for (size_t i = 0; i < xe->xdf2.nrec; i++) {
+		xrecord_t *rec = &xe->xdf2.recs[i];
+		xdl_classify_record(2, &cf, rec);
 	}
 
 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 77ee1ad9c8..b3d51197c1 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -118,26 +118,6 @@ void *xdl_cha_alloc(chastore_t *cha) {
 	return data;
 }
 
-long xdl_guess_lines(mmfile_t *mf, long sample) {
-	long nl = 0, size, tsize = 0;
-	char const *data, *cur, *top;
-
-	if ((cur = data = xdl_mmfile_first(mf, &size))) {
-		for (top = data + size; nl < sample && cur < top; ) {
-			nl++;
-			if (!(cur = memchr(cur, '\n', top - cur)))
-				cur = top;
-			else
-				cur++;
-		}
-		tsize += (long) (cur - data);
-	}
-
-	if (nl && tsize)
-		nl = xdl_mmfile_size(mf) / (tsize / nl);
-
-	return nl + 1;
-}
 
 int xdl_blankline(const char *line, long size, long flags)
 {
diff --git a/xdiff/xutils.h b/xdiff/xutils.h
index 615b4a9d35..d800840dd0 100644
--- a/xdiff/xutils.h
+++ b/xdiff/xutils.h
@@ -31,7 +31,6 @@ int xdl_emit_diffrec(char const *rec, long size, char const *pre, long psize,
 int xdl_cha_init(chastore_t *cha, long isize, long icount);
 void xdl_cha_free(chastore_t *cha);
 void *xdl_cha_alloc(chastore_t *cha);
-long xdl_guess_lines(mmfile_t *mf, long sample);
 int xdl_blankline(const char *line, long size, long flags);
 int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags);
 uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 04/10] xdiff: let patience and histogram benefit from xdl_trim_ends()
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2026-01-02 18:52 ` [PATCH 03/10] xdiff: don't waste time guessing the number of lines Ezekiel Newren via GitGitGadget
@ 2026-01-02 18:52 ` Ezekiel Newren via GitGitGadget
  2026-01-20 15:02   ` Phillip Wood
  2026-01-02 18:52 ` [PATCH 05/10] xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records() Ezekiel Newren via GitGitGadget
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The patience diff is set up the exact same way as histogram, see
xdl_do_historgram_diff() in xhistogram.c. xdl_optimize_ctxs() is
redundant now, delete it.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xpatience.c |  4 +++-
 xdiff/xprepare.c  | 14 ++------------
 2 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index 9580d18032..2bce07cf48 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -373,5 +373,7 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
 
 int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
 {
-	return patience_diff(xpp, env, 1, (int)env->xdf1.nrec, 1, (int)env->xdf2.nrec);
+	return patience_diff(xpp, env,
+		env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
+		env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 96a32cc5e9..0d7d9f6146 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -366,17 +366,6 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 }
 
 
-static int xdl_optimize_ctxs(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-
-	if (xdl_trim_ends(xdf1, xdf2) < 0 ||
-	    xdl_cleanup_records(cf, xdf1, xdf2) < 0) {
-
-		return -1;
-	}
-
-	return 0;
-}
-
 int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 		    xdfenv_t *xe) {
 	xdlclassifier_t cf;
@@ -404,9 +393,10 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 		xdl_classify_record(2, &cf, rec);
 	}
 
+	xdl_trim_ends(&xe->xdf1, &xe->xdf2);
 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
 	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
-	    xdl_optimize_ctxs(&cf, &xe->xdf1, &xe->xdf2) < 0) {
+	    xdl_cleanup_records(&cf, &xe->xdf1, &xe->xdf2) < 0) {
 
 		xdl_free_ctx(&xe->xdf2);
 		xdl_free_ctx(&xe->xdf1);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 05/10] xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records()
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2026-01-02 18:52 ` [PATCH 04/10] xdiff: let patience and histogram benefit from xdl_trim_ends() Ezekiel Newren via GitGitGadget
@ 2026-01-02 18:52 ` Ezekiel Newren via GitGitGadget
  2026-01-20 16:32   ` Phillip Wood
  2026-01-02 18:52 ` [PATCH 06/10] xdiff: cleanup xdl_trim_ends() Ezekiel Newren via GitGitGadget
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

View with --color-words. Prepare these functions to use the fields:
delta_start, delta_end. A future patch will add these fields to
xdfenv_t.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 60 ++++++++++++++++++++++++------------------------
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 0d7d9f6146..0acb3437d4 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -261,7 +261,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  * matches on the other file. Also, lines that have multiple matches
  * might be potentially discarded if they appear in a run of discardable.
  */
-static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
+static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 	long i, nm, mlim;
 	xrecord_t *recs;
 	xdlclass_t *rcrec;
@@ -273,11 +273,11 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 * Create temporary arrays that will help us decide if
 	 * changed[i] should remain false, or become true.
 	 */
-	if (!XDL_CALLOC_ARRAY(action1, xdf1->nrec + 1)) {
+	if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
 		ret = -1;
 		goto cleanup;
 	}
-	if (!XDL_CALLOC_ARRAY(action2, xdf2->nrec + 1)) {
+	if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
 		ret = -1;
 		goto cleanup;
 	}
@@ -285,17 +285,17 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	/*
 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
 	 */
-	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
-	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
+	for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart]; i <= xe->xdf1.dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len2 : 0;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
-	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
-	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
+	for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart]; i <= xe->xdf2.dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len1 : 0;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
@@ -305,27 +305,27 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 * Use temporary arrays to decide if changed[i] should remain
 	 * false, or become true.
 	 */
-	xdf1->nreff = 0;
-	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
-	     i <= xdf1->dend; i++, recs++) {
+	xe->xdf1.nreff = 0;
+	for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart];
+	     i <= xe->xdf1.dend; i++, recs++) {
 		if (action1[i] == KEEP ||
-		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->reference_index[xdf1->nreff++] = i;
+		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->xdf1.dstart, xe->xdf1.dend))) {
+			xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
-			xdf1->changed[i] = true;
+			xe->xdf1.changed[i] = true;
 			/* i.e. discard */
 	}
 
-	xdf2->nreff = 0;
-	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
-	     i <= xdf2->dend; i++, recs++) {
+	xe->xdf2.nreff = 0;
+	for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart];
+	     i <= xe->xdf2.dend; i++, recs++) {
 		if (action2[i] == KEEP ||
-		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->reference_index[xdf2->nreff++] = i;
+		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->xdf2.dstart, xe->xdf2.dend))) {
+			xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
-			xdf2->changed[i] = true;
+			xe->xdf2.changed[i] = true;
 			/* i.e. discard */
 	}
 
@@ -340,27 +340,27 @@ cleanup:
 /*
  * Early trim initial and terminal matching records.
  */
-static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
+static int xdl_trim_ends(xdfenv_t *xe) {
 	long i, lim;
 	xrecord_t *recs1, *recs2;
 
-	recs1 = xdf1->recs;
-	recs2 = xdf2->recs;
-	for (i = 0, lim = (long)XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
+	recs1 = xe->xdf1.recs;
+	recs2 = xe->xdf2.recs;
+	for (i = 0, lim = (long)XDL_MIN(xe->xdf1.nrec, xe->xdf2.nrec); i < lim;
 	     i++, recs1++, recs2++)
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
-	xdf1->dstart = xdf2->dstart = i;
+	xe->xdf1.dstart = xe->xdf2.dstart = i;
 
-	recs1 = xdf1->recs + xdf1->nrec - 1;
-	recs2 = xdf2->recs + xdf2->nrec - 1;
+	recs1 = xe->xdf1.recs + xe->xdf1.nrec - 1;
+	recs2 = xe->xdf2.recs + xe->xdf2.nrec - 1;
 	for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
-	xdf1->dend = (long)xdf1->nrec - i - 1;
-	xdf2->dend = (long)xdf2->nrec - i - 1;
+	xe->xdf1.dend = (long)xe->xdf1.nrec - i - 1;
+	xe->xdf2.dend = (long)xe->xdf2.nrec - i - 1;
 
 	return 0;
 }
@@ -393,10 +393,10 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 		xdl_classify_record(2, &cf, rec);
 	}
 
-	xdl_trim_ends(&xe->xdf1, &xe->xdf2);
+	xdl_trim_ends(xe);
 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
 	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
-	    xdl_cleanup_records(&cf, &xe->xdf1, &xe->xdf2) < 0) {
+	    xdl_cleanup_records(&cf, xe) < 0) {
 
 		xdl_free_ctx(&xe->xdf2);
 		xdl_free_ctx(&xe->xdf1);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 06/10] xdiff: cleanup xdl_trim_ends()
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (4 preceding siblings ...)
  2026-01-02 18:52 ` [PATCH 05/10] xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records() Ezekiel Newren via GitGitGadget
@ 2026-01-02 18:52 ` Ezekiel Newren via GitGitGadget
  2026-01-20 16:32   ` Phillip Wood
  2026-01-02 18:52 ` [PATCH 07/10] xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start Ezekiel Newren via GitGitGadget
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

This patch is best viewed with a before and after of the whole
function.

Rather than using 2 pointers and walking them. Use direct indexing with
local variables of what is being compared to make it easier to follow
along.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 0acb3437d4..06b6a6f804 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -340,29 +340,29 @@ cleanup:
 /*
  * Early trim initial and terminal matching records.
  */
-static int xdl_trim_ends(xdfenv_t *xe) {
-	long i, lim;
-	xrecord_t *recs1, *recs2;
-
-	recs1 = xe->xdf1.recs;
-	recs2 = xe->xdf2.recs;
-	for (i = 0, lim = (long)XDL_MIN(xe->xdf1.nrec, xe->xdf2.nrec); i < lim;
-	     i++, recs1++, recs2++)
-		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
+static void xdl_trim_ends(xdfenv_t *xe)
+{
+	size_t lim = XDL_MIN(xe->xdf1.nrec, xe->xdf2.nrec);
+
+	for (size_t i = 0; i < lim; i++) {
+		size_t mph1 = xe->xdf1.recs[i].minimal_perfect_hash;
+		size_t mph2 = xe->xdf2.recs[i].minimal_perfect_hash;
+		if (mph1 != mph2) {
+			xe->xdf1.dstart = xe->xdf2.dstart = (ssize_t)i;
+			lim -= i;
 			break;
+		}
+	}
 
-	xe->xdf1.dstart = xe->xdf2.dstart = i;
-
-	recs1 = xe->xdf1.recs + xe->xdf1.nrec - 1;
-	recs2 = xe->xdf2.recs + xe->xdf2.nrec - 1;
-	for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
-		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
+	for (size_t i = 0; i < lim; i++) {
+		size_t mph1 = xe->xdf1.recs[xe->xdf1.nrec - 1 - i].minimal_perfect_hash;
+		size_t mph2 = xe->xdf2.recs[xe->xdf2.nrec - 1 - i].minimal_perfect_hash;
+		if (mph1 != mph2) {
+			xe->xdf1.dend = xe->xdf1.nrec - 1 - i;
+			xe->xdf2.dend = xe->xdf2.nrec - 1 - i;
 			break;
-
-	xe->xdf1.dend = (long)xe->xdf1.nrec - i - 1;
-	xe->xdf2.dend = (long)xe->xdf2.nrec - i - 1;
-
-	return 0;
+		}
+	}
 }
 
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 07/10] xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (5 preceding siblings ...)
  2026-01-02 18:52 ` [PATCH 06/10] xdiff: cleanup xdl_trim_ends() Ezekiel Newren via GitGitGadget
@ 2026-01-02 18:52 ` Ezekiel Newren via GitGitGadget
  2026-01-20 16:32   ` Phillip Wood
  2026-01-02 18:52 ` [PATCH 08/10] xdiff: replace xdfile_t.dend with xdfenv_t.delta_end Ezekiel Newren via GitGitGadget
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Placing delta_start in xdfenv_t instead of xdfile_t provides a more
appropriate context since this variable only makes sense with a pair
of files. View with --color-words.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xhistogram.c |  4 ++--
 xdiff/xpatience.c  |  4 ++--
 xdiff/xprepare.c   | 17 +++++++++--------
 xdiff/xtypes.h     |  3 ++-
 4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
index 5ae1282c27..eb6a52d9ba 100644
--- a/xdiff/xhistogram.c
+++ b/xdiff/xhistogram.c
@@ -365,6 +365,6 @@ out:
 int xdl_do_histogram_diff(xpparam_t const *xpp, xdfenv_t *env)
 {
 	return histogram_diff(xpp, env,
-		env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
-		env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
+		env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
+		env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
 }
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index 2bce07cf48..bd0ffbb417 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -374,6 +374,6 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
 int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
 {
 	return patience_diff(xpp, env,
-		env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
-		env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
+		env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
+		env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 06b6a6f804..e88468e74c 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -173,7 +173,6 @@ static int xdl_prepare_ctx(mmfile_t *mf, xdfile_t *xdf, uint64_t flags) {
 
 	xdf->changed += 1;
 	xdf->nreff = 0;
-	xdf->dstart = 0;
 	xdf->dend = xdf->nrec - 1;
 
 	return 0;
@@ -287,7 +286,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 	 */
 	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
-	for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart]; i <= xe->xdf1.dend; i++, recs++) {
+	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= xe->xdf1.dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len2 : 0;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
@@ -295,7 +294,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 
 	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
-	for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart]; i <= xe->xdf2.dend; i++, recs++) {
+	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= xe->xdf2.dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len1 : 0;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
@@ -306,10 +305,10 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 	 * false, or become true.
 	 */
 	xe->xdf1.nreff = 0;
-	for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart];
+	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
 	     i <= xe->xdf1.dend; i++, recs++) {
 		if (action1[i] == KEEP ||
-		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->xdf1.dstart, xe->xdf1.dend))) {
+		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, xe->xdf1.dend))) {
 			xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
@@ -318,10 +317,10 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 	}
 
 	xe->xdf2.nreff = 0;
-	for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart];
+	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
 	     i <= xe->xdf2.dend; i++, recs++) {
 		if (action2[i] == KEEP ||
-		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->xdf2.dstart, xe->xdf2.dend))) {
+		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, xe->xdf2.dend))) {
 			xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
@@ -348,7 +347,7 @@ static void xdl_trim_ends(xdfenv_t *xe)
 		size_t mph1 = xe->xdf1.recs[i].minimal_perfect_hash;
 		size_t mph2 = xe->xdf2.recs[i].minimal_perfect_hash;
 		if (mph1 != mph2) {
-			xe->xdf1.dstart = xe->xdf2.dstart = (ssize_t)i;
+			xe->delta_start = (ssize_t)i;
 			lim -= i;
 			break;
 		}
@@ -370,6 +369,8 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 		    xdfenv_t *xe) {
 	xdlclassifier_t cf;
 
+	xe->delta_start = 0;
+
 	if (xdl_prepare_ctx(mf1, &xe->xdf1, xpp->flags) < 0) {
 
 		return -1;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 979586f20a..bda1f85eb0 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -48,7 +48,7 @@ typedef struct s_xrecord {
 typedef struct s_xdfile {
 	xrecord_t *recs;
 	size_t nrec;
-	ptrdiff_t dstart, dend;
+	ptrdiff_t dend;
 	bool *changed;
 	size_t *reference_index;
 	size_t nreff;
@@ -56,6 +56,7 @@ typedef struct s_xdfile {
 
 typedef struct s_xdfenv {
 	xdfile_t xdf1, xdf2;
+	size_t delta_start;
 } xdfenv_t;
 
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 08/10] xdiff: replace xdfile_t.dend with xdfenv_t.delta_end
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (6 preceding siblings ...)
  2026-01-02 18:52 ` [PATCH 07/10] xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start Ezekiel Newren via GitGitGadget
@ 2026-01-02 18:52 ` Ezekiel Newren via GitGitGadget
  2026-01-02 18:52 ` [PATCH 09/10] xdiff: remove dependence on xdlclassifier from xdl_cleanup_records() Ezekiel Newren via GitGitGadget
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

View with --color-words. Same argument as delta_start.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xhistogram.c |  7 +++++--
 xdiff/xpatience.c  |  7 +++++--
 xdiff/xprepare.c   | 19 ++++++++++---------
 xdiff/xtypes.h     |  3 +--
 4 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
index eb6a52d9ba..b4d6f88748 100644
--- a/xdiff/xhistogram.c
+++ b/xdiff/xhistogram.c
@@ -364,7 +364,10 @@ out:
 
 int xdl_do_histogram_diff(xpparam_t const *xpp, xdfenv_t *env)
 {
+	ptrdiff_t dend1 = env->xdf1.nrec - 1 - env->delta_end;
+	ptrdiff_t dend2 = env->xdf2.nrec - 1 - env->delta_end;
+
 	return histogram_diff(xpp, env,
-		env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
-		env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
+		env->delta_start + 1, dend1 - env->delta_start + 1,
+		env->delta_start + 1, dend2 - env->delta_start + 1);
 }
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index bd0ffbb417..5b8bb34d2b 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -373,7 +373,10 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
 
 int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
 {
+	ptrdiff_t dend1 = env->xdf1.nrec - 1 - env->delta_end;
+	ptrdiff_t dend2 = env->xdf2.nrec - 1 - env->delta_end;
+
 	return patience_diff(xpp, env,
-		env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
-		env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
+		env->delta_start + 1, dend1 - env->delta_start + 1,
+		env->delta_start + 1, dend2 - env->delta_start + 1);
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index e88468e74c..d3cdb6ac02 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -173,7 +173,6 @@ static int xdl_prepare_ctx(mmfile_t *mf, xdfile_t *xdf, uint64_t flags) {
 
 	xdf->changed += 1;
 	xdf->nreff = 0;
-	xdf->dend = xdf->nrec - 1;
 
 	return 0;
 
@@ -267,6 +266,8 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 	uint8_t *action1 = NULL, *action2 = NULL;
 	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
 	int ret = 0;
+	ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
+	ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
 
 	/*
 	 * Create temporary arrays that will help us decide if
@@ -286,7 +287,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 	 */
 	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
-	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= xe->xdf1.dend; i++, recs++) {
+	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len2 : 0;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
@@ -294,7 +295,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 
 	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
-	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= xe->xdf2.dend; i++, recs++) {
+	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len1 : 0;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
@@ -306,9 +307,9 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 	 */
 	xe->xdf1.nreff = 0;
 	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
-	     i <= xe->xdf1.dend; i++, recs++) {
+	     i <= dend1; i++, recs++) {
 		if (action1[i] == KEEP ||
-		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, xe->xdf1.dend))) {
+		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, dend1))) {
 			xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
@@ -318,9 +319,9 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 
 	xe->xdf2.nreff = 0;
 	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
-	     i <= xe->xdf2.dend; i++, recs++) {
+	     i <= dend2; i++, recs++) {
 		if (action2[i] == KEEP ||
-		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, xe->xdf2.dend))) {
+		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, dend2))) {
 			xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
@@ -357,8 +358,7 @@ static void xdl_trim_ends(xdfenv_t *xe)
 		size_t mph1 = xe->xdf1.recs[xe->xdf1.nrec - 1 - i].minimal_perfect_hash;
 		size_t mph2 = xe->xdf2.recs[xe->xdf2.nrec - 1 - i].minimal_perfect_hash;
 		if (mph1 != mph2) {
-			xe->xdf1.dend = xe->xdf1.nrec - 1 - i;
-			xe->xdf2.dend = xe->xdf2.nrec - 1 - i;
+			xe->delta_end = i;
 			break;
 		}
 	}
@@ -370,6 +370,7 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 	xdlclassifier_t cf;
 
 	xe->delta_start = 0;
+	xe->delta_end = 0;
 
 	if (xdl_prepare_ctx(mf1, &xe->xdf1, xpp->flags) < 0) {
 
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index bda1f85eb0..a939396064 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -48,7 +48,6 @@ typedef struct s_xrecord {
 typedef struct s_xdfile {
 	xrecord_t *recs;
 	size_t nrec;
-	ptrdiff_t dend;
 	bool *changed;
 	size_t *reference_index;
 	size_t nreff;
@@ -56,7 +55,7 @@ typedef struct s_xdfile {
 
 typedef struct s_xdfenv {
 	xdfile_t xdf1, xdf2;
-	size_t delta_start;
+	size_t delta_start, delta_end;
 } xdfenv_t;
 
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 09/10] xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (7 preceding siblings ...)
  2026-01-02 18:52 ` [PATCH 08/10] xdiff: replace xdfile_t.dend with xdfenv_t.delta_end Ezekiel Newren via GitGitGadget
@ 2026-01-02 18:52 ` Ezekiel Newren via GitGitGadget
  2026-01-16 20:19   ` René Scharfe
  2026-01-21 15:01   ` Phillip Wood
  2026-01-02 18:52 ` [PATCH 10/10] xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c Ezekiel Newren via GitGitGadget
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Disentangle xdl_cleanup_records() from the classifier so that it can be
moved from xprepare.c into xdiffi.c.

The classic diff is the only algorithm that needs to count the number
of times each line occurs in each file. Make xdl_cleanup_records()
count the number of lines instead of the classifier so it won't slow
down patience or histogram.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 52 +++++++++++++++++++++++++++++++++---------------
 xdiff/xtypes.h   |  1 +
 2 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index d3cdb6ac02..b53a3b80c4 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -21,6 +21,7 @@
  */
 
 #include "xinclude.h"
+#include "compat/ivec.h"
 
 
 #define XDL_KPDIS_RUN 4
@@ -35,7 +36,6 @@ typedef struct s_xdlclass {
 	struct s_xdlclass *next;
 	xrecord_t rec;
 	long idx;
-	long len1, len2;
 } xdlclass_t;
 
 typedef struct s_xdlclassifier {
@@ -92,7 +92,7 @@ static void xdl_free_classifier(xdlclassifier_t *cf) {
 }
 
 
-static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) {
+static int xdl_classify_record(xdlclassifier_t *cf, xrecord_t *rec) {
 	size_t hi;
 	xdlclass_t *rcrec;
 
@@ -113,13 +113,10 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 				return -1;
 		cf->rcrecs[rcrec->idx] = rcrec;
 		rcrec->rec = *rec;
-		rcrec->len1 = rcrec->len2 = 0;
 		rcrec->next = cf->rchash[hi];
 		cf->rchash[hi] = rcrec;
 	}
 
-	(pass == 1) ? rcrec->len1++ : rcrec->len2++;
-
 	rec->minimal_perfect_hash = (size_t)rcrec->idx;
 
 	return 0;
@@ -253,22 +250,44 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
 	return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
 }
 
+struct xoccurrence
+{
+	size_t file1, file2;
+};
+
+
+DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
+
 
 /*
  * Try to reduce the problem complexity, discard records that have no
  * matches on the other file. Also, lines that have multiple matches
  * might be potentially discarded if they appear in a run of discardable.
  */
-static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
-	long i, nm, mlim;
+static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
+	long i;
+	size_t nm, mlim;
 	xrecord_t *recs;
-	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
-	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
+	struct IVec_xoccurrence occ;
+	bool need_min = !!(flags & XDF_NEED_MINIMAL);
 	int ret = 0;
 	ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
 	ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
 
+	IVEC_INIT(occ);
+	ivec_zero(&occ, xe->mph_size);
+
+	for (size_t j = 0; j < xe->xdf1.nrec; j++) {
+		size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
+		occ.ptr[mph1].file1 += 1;
+	}
+
+	for (size_t j = 0; j < xe->xdf2.nrec; j++) {
+		size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
+		occ.ptr[mph2].file2 += 1;
+	}
+
 	/*
 	 * Create temporary arrays that will help us decide if
 	 * changed[i] should remain false, or become true.
@@ -288,16 +307,14 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
-		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
-		nm = rcrec ? rcrec->len2 : 0;
+		nm = occ.ptr[recs->minimal_perfect_hash].file2;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
 	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
-		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
-		nm = rcrec ? rcrec->len1 : 0;
+		nm = occ.ptr[recs->minimal_perfect_hash].file1;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
@@ -332,6 +349,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
 cleanup:
 	xdl_free(action1);
 	xdl_free(action2);
+	ivec_free(&occ);
 
 	return ret;
 }
@@ -387,18 +405,20 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 
 	for (size_t i = 0; i < xe->xdf1.nrec; i++) {
 		xrecord_t *rec = &xe->xdf1.recs[i];
-		xdl_classify_record(1, &cf, rec);
+		xdl_classify_record(&cf, rec);
 	}
 
 	for (size_t i = 0; i < xe->xdf2.nrec; i++) {
 		xrecord_t *rec = &xe->xdf2.recs[i];
-		xdl_classify_record(2, &cf, rec);
+		xdl_classify_record(&cf, rec);
 	}
 
+	xe->mph_size = cf.count;
+
 	xdl_trim_ends(xe);
 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
 	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
-	    xdl_cleanup_records(&cf, xe) < 0) {
+	    xdl_cleanup_records(xe, xpp->flags) < 0) {
 
 		xdl_free_ctx(&xe->xdf2);
 		xdl_free_ctx(&xe->xdf1);
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index a939396064..2528bd37e8 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -56,6 +56,7 @@ typedef struct s_xdfile {
 typedef struct s_xdfenv {
 	xdfile_t xdf1, xdf2;
 	size_t delta_start, delta_end;
+	size_t mph_size;
 } xdfenv_t;
 
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 10/10] xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (8 preceding siblings ...)
  2026-01-02 18:52 ` [PATCH 09/10] xdiff: remove dependence on xdlclassifier from xdl_cleanup_records() Ezekiel Newren via GitGitGadget
@ 2026-01-02 18:52 ` Ezekiel Newren via GitGitGadget
  2026-01-21 15:01   ` Phillip Wood
  2026-01-04  2:44 ` [PATCH 00/10] Xdiff cleanup part 3 Junio C Hamano
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-01-02 18:52 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Only the classic diff uses xdl_cleanup_records(). Move it,
xdl_clean_mmatch(), and the macros to xdiffi.c and call
xdl_cleanup_records() inside of xdl_do_classic_diff(). This better
organizes the code related to the classic diff.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   | 180 ++++++++++++++++++++++++++++++++++++++++++++
 xdiff/xprepare.c | 191 +----------------------------------------------
 2 files changed, 181 insertions(+), 190 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index e3196c7245..0f1fd7cf80 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -21,6 +21,7 @@
  */
 
 #include "xinclude.h"
+#include "compat/ivec.h"
 
 static size_t get_hash(xdfile_t *xdf, long index)
 {
@@ -33,6 +34,14 @@ static size_t get_hash(xdfile_t *xdf, long index)
 #define XDL_SNAKE_CNT 20
 #define XDL_K_HEUR 4
 
+#define XDL_KPDIS_RUN 4
+#define XDL_MAX_EQLIMIT 1024
+#define XDL_SIMSCAN_WINDOW 100
+
+#define DISCARD 0
+#define KEEP 1
+#define INVESTIGATE 2
+
 typedef struct s_xdpsplit {
 	long i1, i2;
 	int min_lo, min_hi;
@@ -311,6 +320,175 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
 }
 
 
+static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
+	long r, rdis0, rpdis0, rdis1, rpdis1;
+
+	/*
+	 * Limits the window that is examined during the similar-lines
+	 * scan. The loops below stops when action[i - r] == KEEP
+	 * (line that has no match), but there are corner cases where
+	 * the loop proceed all the way to the extremities by causing
+	 * huge performance penalties in case of big files.
+	 */
+	if (i - s > XDL_SIMSCAN_WINDOW)
+		s = i - XDL_SIMSCAN_WINDOW;
+	if (e - i > XDL_SIMSCAN_WINDOW)
+		e = i + XDL_SIMSCAN_WINDOW;
+
+	/*
+	 * Scans the lines before 'i' to find a run of lines that either
+	 * have no match (action[j] == DISCARD) or have multiple matches
+	 * (action[j] == INVESTIGATE). Note that we always call this
+	 * function with action[i] == INVESTIGATE, so the current line
+	 * (i) is already a multimatch line.
+	 */
+	for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) {
+		if (action[i - r] == DISCARD)
+			rdis0++;
+		else if (action[i - r] == INVESTIGATE)
+			rpdis0++;
+		else if (action[i - r] == KEEP)
+			break;
+		else
+			BUG("Illegal value for action[i - r]");
+	}
+	/*
+	 * If the run before the line 'i' found only multimatch lines,
+	 * we return false and hence we don't make the current line (i)
+	 * discarded. We want to discard multimatch lines only when
+	 * they appear in the middle of runs with nomatch lines
+	 * (action[j] == DISCARD).
+	 */
+	if (rdis0 == 0)
+		return 0;
+	for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) {
+		if (action[i + r] == DISCARD)
+			rdis1++;
+		else if (action[i + r] == INVESTIGATE)
+			rpdis1++;
+		else if (action[i + r] == KEEP)
+			break;
+		else
+			BUG("Illegal value for action[i + r]");
+	}
+	/*
+	 * If the run after the line 'i' found only multimatch lines,
+	 * we return false and hence we don't make the current line (i)
+	 * discarded.
+	 */
+	if (rdis1 == 0)
+		return false;
+	rdis1 += rdis0;
+	rpdis1 += rpdis0;
+
+	return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
+}
+
+struct xoccurrence
+{
+	size_t file1, file2;
+};
+
+
+DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
+
+
+/*
+ * Try to reduce the problem complexity, discard records that have no
+ * matches on the other file. Also, lines that have multiple matches
+ * might be potentially discarded if they appear in a run of discardable.
+ */
+static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
+	long i;
+	size_t nm, mlim;
+	xrecord_t *recs;
+	uint8_t *action1 = NULL, *action2 = NULL;
+	struct IVec_xoccurrence occ;
+	bool need_min = !!(flags & XDF_NEED_MINIMAL);
+	int ret = 0;
+	ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
+	ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
+
+	IVEC_INIT(occ);
+	ivec_zero(&occ, xe->mph_size);
+
+	for (size_t j = 0; j < xe->xdf1.nrec; j++) {
+		size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
+		occ.ptr[mph1].file1 += 1;
+	}
+
+	for (size_t j = 0; j < xe->xdf2.nrec; j++) {
+		size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
+		occ.ptr[mph2].file2 += 1;
+	}
+
+	/*
+	 * Create temporary arrays that will help us decide if
+	 * changed[i] should remain false, or become true.
+	 */
+	if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
+		ret = -1;
+		goto cleanup;
+	}
+
+	/*
+	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
+	 */
+	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
+		mlim = XDL_MAX_EQLIMIT;
+	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
+		nm = occ.ptr[recs->minimal_perfect_hash].file2;
+		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
+	}
+
+	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
+		mlim = XDL_MAX_EQLIMIT;
+	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
+		nm = occ.ptr[recs->minimal_perfect_hash].file1;
+		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
+	}
+
+	/*
+	 * Use temporary arrays to decide if changed[i] should remain
+	 * false, or become true.
+	 */
+	xe->xdf1.nreff = 0;
+	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
+	     i <= dend1; i++, recs++) {
+		if (action1[i] == KEEP ||
+		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, dend1))) {
+			xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
+			/* changed[i] remains false, i.e. keep */
+		} else
+			xe->xdf1.changed[i] = true;
+			/* i.e. discard */
+	}
+
+	xe->xdf2.nreff = 0;
+	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
+	     i <= dend2; i++, recs++) {
+		if (action2[i] == KEEP ||
+		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, dend2))) {
+			xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
+			/* changed[i] remains false, i.e. keep */
+		} else
+			xe->xdf2.changed[i] = true;
+			/* i.e. discard */
+	}
+
+cleanup:
+	xdl_free(action1);
+	xdl_free(action2);
+	ivec_free(&occ);
+
+	return ret;
+}
+
+
 int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
 {
 	long ndiags;
@@ -318,6 +496,8 @@ int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
 	xdalgoenv_t xenv;
 	int res;
 
+	xdl_cleanup_records(xe, flags);
+
 	/*
 	 * Allocate and setup K vectors to be used by the differential
 	 * algorithm.
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index b53a3b80c4..3f555e29f4 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -24,14 +24,6 @@
 #include "compat/ivec.h"
 
 
-#define XDL_KPDIS_RUN 4
-#define XDL_MAX_EQLIMIT 1024
-#define XDL_SIMSCAN_WINDOW 100
-
-#define DISCARD 0
-#define KEEP 1
-#define INVESTIGATE 2
-
 typedef struct s_xdlclass {
 	struct s_xdlclass *next;
 	xrecord_t rec;
@@ -50,8 +42,6 @@ typedef struct s_xdlclassifier {
 } xdlclassifier_t;
 
 
-
-
 static int xdl_init_classifier(xdlclassifier_t *cf, long size, long flags) {
 	memset(cf, 0, sizeof(xdlclassifier_t));
 
@@ -186,175 +176,6 @@ void xdl_free_env(xdfenv_t *xe) {
 }
 
 
-static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
-	long r, rdis0, rpdis0, rdis1, rpdis1;
-
-	/*
-	 * Limits the window that is examined during the similar-lines
-	 * scan. The loops below stops when action[i - r] == KEEP
-	 * (line that has no match), but there are corner cases where
-	 * the loop proceed all the way to the extremities by causing
-	 * huge performance penalties in case of big files.
-	 */
-	if (i - s > XDL_SIMSCAN_WINDOW)
-		s = i - XDL_SIMSCAN_WINDOW;
-	if (e - i > XDL_SIMSCAN_WINDOW)
-		e = i + XDL_SIMSCAN_WINDOW;
-
-	/*
-	 * Scans the lines before 'i' to find a run of lines that either
-	 * have no match (action[j] == DISCARD) or have multiple matches
-	 * (action[j] == INVESTIGATE). Note that we always call this
-	 * function with action[i] == INVESTIGATE, so the current line
-	 * (i) is already a multimatch line.
-	 */
-	for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) {
-		if (action[i - r] == DISCARD)
-			rdis0++;
-		else if (action[i - r] == INVESTIGATE)
-			rpdis0++;
-		else if (action[i - r] == KEEP)
-			break;
-		else
-			BUG("Illegal value for action[i - r]");
-	}
-	/*
-	 * If the run before the line 'i' found only multimatch lines,
-	 * we return false and hence we don't make the current line (i)
-	 * discarded. We want to discard multimatch lines only when
-	 * they appear in the middle of runs with nomatch lines
-	 * (action[j] == DISCARD).
-	 */
-	if (rdis0 == 0)
-		return 0;
-	for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) {
-		if (action[i + r] == DISCARD)
-			rdis1++;
-		else if (action[i + r] == INVESTIGATE)
-			rpdis1++;
-		else if (action[i + r] == KEEP)
-			break;
-		else
-			BUG("Illegal value for action[i + r]");
-	}
-	/*
-	 * If the run after the line 'i' found only multimatch lines,
-	 * we return false and hence we don't make the current line (i)
-	 * discarded.
-	 */
-	if (rdis1 == 0)
-		return false;
-	rdis1 += rdis0;
-	rpdis1 += rpdis0;
-
-	return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
-}
-
-struct xoccurrence
-{
-	size_t file1, file2;
-};
-
-
-DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
-
-
-/*
- * Try to reduce the problem complexity, discard records that have no
- * matches on the other file. Also, lines that have multiple matches
- * might be potentially discarded if they appear in a run of discardable.
- */
-static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
-	long i;
-	size_t nm, mlim;
-	xrecord_t *recs;
-	uint8_t *action1 = NULL, *action2 = NULL;
-	struct IVec_xoccurrence occ;
-	bool need_min = !!(flags & XDF_NEED_MINIMAL);
-	int ret = 0;
-	ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
-	ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
-
-	IVEC_INIT(occ);
-	ivec_zero(&occ, xe->mph_size);
-
-	for (size_t j = 0; j < xe->xdf1.nrec; j++) {
-		size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
-		occ.ptr[mph1].file1 += 1;
-	}
-
-	for (size_t j = 0; j < xe->xdf2.nrec; j++) {
-		size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
-		occ.ptr[mph2].file2 += 1;
-	}
-
-	/*
-	 * Create temporary arrays that will help us decide if
-	 * changed[i] should remain false, or become true.
-	 */
-	if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
-		ret = -1;
-		goto cleanup;
-	}
-	if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
-		ret = -1;
-		goto cleanup;
-	}
-
-	/*
-	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
-	 */
-	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
-		mlim = XDL_MAX_EQLIMIT;
-	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
-		nm = occ.ptr[recs->minimal_perfect_hash].file2;
-		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
-	}
-
-	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
-		mlim = XDL_MAX_EQLIMIT;
-	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
-		nm = occ.ptr[recs->minimal_perfect_hash].file1;
-		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
-	}
-
-	/*
-	 * Use temporary arrays to decide if changed[i] should remain
-	 * false, or become true.
-	 */
-	xe->xdf1.nreff = 0;
-	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
-	     i <= dend1; i++, recs++) {
-		if (action1[i] == KEEP ||
-		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, dend1))) {
-			xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
-			/* changed[i] remains false, i.e. keep */
-		} else
-			xe->xdf1.changed[i] = true;
-			/* i.e. discard */
-	}
-
-	xe->xdf2.nreff = 0;
-	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
-	     i <= dend2; i++, recs++) {
-		if (action2[i] == KEEP ||
-		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, dend2))) {
-			xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
-			/* changed[i] remains false, i.e. keep */
-		} else
-			xe->xdf2.changed[i] = true;
-			/* i.e. discard */
-	}
-
-cleanup:
-	xdl_free(action1);
-	xdl_free(action2);
-	ivec_free(&occ);
-
-	return ret;
-}
-
-
 /*
  * Early trim initial and terminal matching records.
  */
@@ -414,19 +235,9 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 	}
 
 	xe->mph_size = cf.count;
+	xdl_free_classifier(&cf);
 
 	xdl_trim_ends(xe);
-	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
-	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
-	    xdl_cleanup_records(xe, xpp->flags) < 0) {
-
-		xdl_free_ctx(&xe->xdf2);
-		xdl_free_ctx(&xe->xdf1);
-		xdl_free_classifier(&cf);
-		return -1;
-	}
-
-	xdl_free_classifier(&cf);
 
 	return 0;
 }
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/10] Xdiff cleanup part 3
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (9 preceding siblings ...)
  2026-01-02 18:52 ` [PATCH 10/10] xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c Ezekiel Newren via GitGitGadget
@ 2026-01-04  2:44 ` Junio C Hamano
  2026-01-04  6:01 ` Yee Cheng Chin
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2026-01-04  2:44 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget; +Cc: git, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  compat/ivec.c      | 113 ++++++++++++++++++
>  compat/ivec.h      |  52 +++++++++

I very much like the general direction, but I wonder if we expect
many more "rust-to-C interface layer" files to come, which I suspect
is generally true, and in which case I think it is a good idea to
rethink the use of "compat/" for this purpose from early days, as
"compat/" is not about "compat between C and something else", but is
about "compat between platform peculiarity and (idealized) POSIX
environment our code assumes".


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-02 18:52 ` [PATCH 01/10] ivec: introduce the C side of ivec Ezekiel Newren via GitGitGadget
@ 2026-01-04  5:32   ` Junio C Hamano
  2026-01-17 16:06     ` Ezekiel Newren
  2026-01-08 14:34   ` Phillip Wood
  2026-01-16 20:19   ` René Scharfe
  2 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2026-01-04  5:32 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget; +Cc: git, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +	if (new_capacity == 0) {
> +		free(self->ptr);
> +		self->ptr = NULL;

	if (!new_capacity)
		FREE_AND_NULL(self->ptr);
	else
		...;

> +void ivec_free(void *self_)
> +{
> +	struct IVec_c_void *self = self_;
> +
> +	free(self->ptr);
> +	self->ptr = NULL;

Likewise.  Otherwise the code will fail coccicheck.

> +	self->length = 0;
> +	self->capacity = 0;
> +	// DO NOT MODIFY element_size!!!

	/* A single-liner comment in our codebase looks like this */


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/10] Xdiff cleanup part 3
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (10 preceding siblings ...)
  2026-01-04  2:44 ` [PATCH 00/10] Xdiff cleanup part 3 Junio C Hamano
@ 2026-01-04  6:01 ` Yee Cheng Chin
  2026-01-28 14:40 ` Phillip Wood
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Yee Cheng Chin @ 2026-01-04  6:01 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget; +Cc: git, Ezekiel Newren

Hi Ezekiel, I wonder if you saw my proposed patch "xdiff: fix outdated
xpatience comments referring to "ha" member var"?
(https://lore.kernel.org/pull.2139.git.git.1766464905719.gitgitgadget@gmail.com)
from 2 weeks ago? It simply cleans up a stale comment after a previous
xdiff cleanup when the "ha" member variable was split. I don't think
it conflicts with this part 3 (it's a small comments clean up) but I
wonder if you could take a look? Just to avoid future conflicts.

On Fri, Jan 2, 2026 at 10:52 AM Ezekiel Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Patch series summary:
>
>  * patch 1: Introduce the ivec type
>  * patch 2: Create the function xdl_do_classic_diff()
>  * patches 3-4: generic cleanup
>  * patches 5-8: convert from dstart/dend (in xdfile_t) to
>    delta_start/delta_end (in xdfenv_t)
>  * patches 9-10: move xdl_cleanup_records(), and related, from xprepare.c to
>    xdiffi.c
>
> Things that will be addressed in future patch series:
>
>  * Make xdl_cleanup_records() easier to read
>  * convert recs/nrec into an ivec
>  * convert changed to an ivec
>  * remove reference_index/nreff from xdfile_t and turn it into an ivec
>  * splitting minimal_perfect_hash out as its own ivec
>  * improve the performance of the classifier and parsing/hashing lines
>
> === before this patch series typedef struct s_xdfile { xrecord_t *recs;
> size_t nrec; ptrdiff_t dstart, dend; bool *changed; size_t *reference_index;
> size_t nreff; } xdfile_t;
>
> typedef struct s_xdfenv { xdfile_t xdf1, xdf2; } xdfenv_t;
>
> === after this patch series typedef struct s_xdfile { xrecord_t *recs;
> size_t nrec; bool *changed; size_t *reference_index; size_t nreff; }
> xdfile_t;
>
> typedef struct s_xdfenv { xdfile_t xdf1, xdf2; size_t delta_start,
> delta_end; size_t mph_size; } xdfenv_t;
>
> Ezekiel Newren (10):
>   ivec: introduce the C side of ivec
>   xdiff: make classic diff explicit by creating xdl_do_classic_diff()
>   xdiff: don't waste time guessing the number of lines
>   xdiff: let patience and histogram benefit from xdl_trim_ends()
>   xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records()
>   xdiff: cleanup xdl_trim_ends()
>   xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start
>   xdiff: replace xdfile_t.dend with xdfenv_t.delta_end
>   xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
>   xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c
>
>  Makefile           |   1 +
>  compat/ivec.c      | 113 ++++++++++++++++++
>  compat/ivec.h      |  52 +++++++++
>  meson.build        |   1 +
>  xdiff/xdiffi.c     | 221 +++++++++++++++++++++++++++++++++---
>  xdiff/xdiffi.h     |   1 +
>  xdiff/xhistogram.c |   7 +-
>  xdiff/xpatience.c  |   7 +-
>  xdiff/xprepare.c   | 277 ++++++++-------------------------------------
>  xdiff/xtypes.h     |   3 +-
>  xdiff/xutils.c     |  20 ----
>  xdiff/xutils.h     |   1 -
>  12 files changed, 432 insertions(+), 272 deletions(-)
>  create mode 100644 compat/ivec.c
>  create mode 100644 compat/ivec.h
>
>
> base-commit: 66ce5f8e8872f0183bb137911c52b07f1f242d13
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2156%2Fezekielnewren%2Fxdiff-cleanup-3-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2156/ezekielnewren/xdiff-cleanup-3-v1
> Pull-Request: https://github.com/git/git/pull/2156
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-02 18:52 ` [PATCH 01/10] ivec: introduce the C side of ivec Ezekiel Newren via GitGitGadget
  2026-01-04  5:32   ` Junio C Hamano
@ 2026-01-08 14:34   ` Phillip Wood
  2026-01-15 15:55     ` Ezekiel Newren
  2026-01-16 20:19   ` René Scharfe
  2 siblings, 1 reply; 78+ messages in thread
From: Phillip Wood @ 2026-01-08 14:34 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

Hi Ezekiel

On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> Trying to use Rust's Vec in C, or git's ALLOC_GROW() macros (via
> wrapper functions) in Rust is painful because:
> 
>    * C doesn't define its own vector type, and even though Rust does
>      have Vec its painful to use on the C side (more on that below).
>      However its still not viable to use Rust's Vec type because Git
>      needs to be able to compile without Rust. So ivec was created
>      expressley to be interoperable between C and Rust without needing
>      Rust.
>    * C doing vector things the Rust way would require wrapper functions,
>      and Rust doing vector things the C way would require wrapper
>      functions, so ivec was created to ensure a consistent contract
>      between the 2 languages for how to manipulate a vector.
>    * Currently, Rust defines its own 'Vec' type that is generic, but its
>      memory allocator and struct layout weren't designed for
>      interoperability with C (or any language for that matter), meaning
>      that the C side cannot push to or expand a 'Vec' without defining
>      wrapper functions in Rust that C can call. Without special care,
>      the two languages might use different allocators (malloc/free on
>      the C side, and possibly something else in Rust), which would make
>      it difficult for a function in one language to free elements
>      allocated by a call from a function in the other language.
>    * Similarly, git defines ALLOC_GROW() and related macros in
>      git-compat-util.h. While we could add functions allowing Rust to
>      invoke something similar to those macros, passing three variables
>      (pointer, length, allocated_size) instead of a single variable
>      (vector) across the language boundary requires more cognitive
>      overhead for readers to keep track of and makes it easier to make
>      mistakes. Further, for low-level components that we want to
>      eventually convert to pure Rust, such triplets would feel very out
>      of place.
> 
> To address these issue, introduce a new type, ivec -- short for
> interoperable vector. (We refer to it as 'ivec' generally, though on
> the Rust side the struct is called IVec to match Rust style.)  This new
> type is specifically designed for FFI purposes, so that both languages
> handle the vector in the same way, though it could be used on either
> side independently. This type is designed such that it can easily be
> replaced by a Rust 'Vec' once interoperability is no longer a concern.
> 
> One particular item to note is that Git's macros to handle vec
> operations infer the amount that a vec needs to grow from the size of
> a pointer, but that makes it somewhat specific to the macros used in C.
> To avoid defining every ivec function as a macro I opted to also
> include an element_size field that allows concrete functions like
> push() to know how much to grow the memory. This element_size also
> helps in verifying that the ivec is correct when passing from C to
> Rust.

I've left some comments below but I think this is a sensible direction.

> diff --git a/compat/ivec.c b/compat/ivec.c
> new file mode 100644
> index 0000000000..0a777e78dc
> --- /dev/null
> +++ b/compat/ivec.c
> @@ -0,0 +1,113 @@
> +#include "ivec.h"
> +
> +struct IVec_c_void {

We normally use all lower case names for structs but as this is shared 
with rust it maybe makes sense to use CamelCase so the names are the 
same in both languages.

> +	void *ptr;
> +	size_t length;
> +	size_t capacity;
> +	size_t element_size;
> +};
> +
> +static void _set_capacity(void *self_, size_t new_capacity)
> +{
> +	struct IVec_c_void *self = self_;

Passing any of the ivec variants defined below to this function invokes 
undefined behavior because we're not casting the pointer back to the 
orginal type. However I think on the platforms we care about 
sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.

> +
> +	if (new_capacity == self->capacity) {
> +		return;
> +	}
> +	if (new_capacity == 0) {
> +		free(self->ptr);
> +		self->ptr = NULL;
> +	} else {
> +		self->ptr = realloc(self->ptr, new_capacity * self->element_size);
> +	}
> +	self->capacity = new_capacity;

Not if realloc() returns NULL. We should check for that, probably by 
using xrealloc().

> +void ivec_zero(void *self_, size_t capacity)
> +{
> +	struct IVec_c_void *self = self_;
> +
> +	self->ptr = calloc(capacity, self->element_size);

We should be handling allocation failures here probably by using xcalloc().

> +void ivec_reserve(void *self_, size_t additional)
> +{
> +	struct IVec_c_void *self = self_;
> +
> +	size_t growby = 128;
> +	if (self->capacity > growby)
> +		growby = self->capacity;
> +	if (additional > growby)
> +		growby = additional;

This growth strategy differs from both ALLOC_GROW() and 
XDL_ALLOC_GROW(), if there isn't a good reason for that we should 
perhaps just use ALLOC_GROW() here.

> +void ivec_push(void *self_, const void *value)
> +{
> +	struct IVec_c_void *self = self_;
> +	void *dst = NULL;
> +
> +	if (self->length == self->capacity)
> +		ivec_reserve(self, 1);
> +
> +	dst = (uint8_t*)self->ptr + self->length * self->element_size;
> +	memcpy(dst, value, self->element_size);

If self->element_size was a compile time constant the compiler could 
easily optimize this call away. I'm not sure that is easy to achieve though.

> +	self->length++;
> +}
> +
> +void ivec_free(void *self_)

Normally we'd call a like this that free the allocations and 
re-initializes the members ivec_clear()

> +{
> +	struct IVec_c_void *self = self_;
> +
> +	free(self->ptr);
> +	self->ptr = NULL;
> +	self->length = 0;
> +	self->capacity = 0;
> +	// DO NOT MODIFY element_size!!!
> +}
> +
> +void ivec_move(void *src_, void *dst_)
> +{
> +	struct IVec_c_void *src = src_;
> +	struct IVec_c_void *dst = dst_;

Maybe we should add

	if (src->element_size != dst->element_size)
		BUG("moving incompatible arrays");
> +
> +	ivec_free(dst);
> +	dst->ptr = src->ptr;
> +	dst->length = src->length;
> +	dst->capacity = src->capacity;
> +	// DO NOT MODIFY element_size!!!

As the element sizes must match maybe *dst = *src would be clearer?

> +
> +	src->ptr = NULL;
> +	src->length = 0;
> +	src->capacity = 0;
> +	// DO NOT MODIFY element_size!!!
> +}
> diff --git a/compat/ivec.h b/compat/ivec.h
> new file mode 100644
> index 0000000000..654a05c506
> --- /dev/null
> +++ b/compat/ivec.h
> @@ -0,0 +1,52 @@
> +#ifndef IVEC_H
> +#define IVEC_H
> +
> +#include <git-compat-util.h>

It would be nice to have some documentation in this header, see the 
examples in strvec.h and hashmap.h

> +#define IVEC_INIT(variable) ivec_init(&(variable), sizeof(*(variable).ptr))

This is a bit cumbersome to use compared to our usual *_INIT macros. I'm 
struggling to see how we can make it nicer though as DEFINE_IVEC_TYPE 
cannot define a per-type initializer macro and I we cannot initialize 
the element size without knowing the type.

> +
> +#ifndef CBINDGEN
> +#define DEFINE_IVEC_TYPE(type, suffix) \
> +struct IVec_##suffix { \
> +	type* ptr; \
> +	size_t length; \
> +	size_t capacity; \
> +	size_t element_size; \
> +}

I wonder if we want to define type safe inline safe wrappers for the 
ivec_* functions here. I think the only functions where the element type 
matters are ivec_move() and ivec_push(), for the others like 
ivec_zero(), ivec_reserve() and ivec_free() the element type does not 
matter. ivec_push() would certainly be easier to use with a wrapper as 
means we can avoid forcing the caller to take the address of the value.

static inline ivec_##suffix##_push(struct IVec_##suffix *self, type 
value) { \
	const void *ptr = &value; \
	ivec_push(self, ptr); \
}

I'll try and take a look at the rest of this series next week

Thanks

Phillip

> +
> +DEFINE_IVEC_TYPE(bool, bool);
> +
> +DEFINE_IVEC_TYPE(uint8_t, u8);
> +DEFINE_IVEC_TYPE(uint16_t, u16);
> +DEFINE_IVEC_TYPE(uint32_t, u32);
> +DEFINE_IVEC_TYPE(uint64_t, u64);
> +
> +DEFINE_IVEC_TYPE(int8_t, i8);
> +DEFINE_IVEC_TYPE(int16_t, i16);
> +DEFINE_IVEC_TYPE(int32_t, i32);
> +DEFINE_IVEC_TYPE(int64_t, i64);
> +
> +DEFINE_IVEC_TYPE(float, f32);
> +DEFINE_IVEC_TYPE(double, f64);
> +
> +DEFINE_IVEC_TYPE(size_t, usize);
> +DEFINE_IVEC_TYPE(ssize_t, isize);
> +#endif
> +
> +void ivec_init(void *self_, size_t element_size);
> +
> +void ivec_zero(void *self_, size_t capacity);
> +
> +void ivec_reserve_exact(void *self_, size_t additional);
> +
> +void ivec_reserve(void *self_, size_t additional);
> +
> +void ivec_shrink_to_fit(void *self_);
> +
> +void ivec_push(void *self_, const void *value);
> +
> +void ivec_free(void *self_);
> +
> +void ivec_move(void *src, void *dst);
> +
> +#endif /* IVEC_H */
> diff --git a/meson.build b/meson.build
> index dd52efd1c8..42ac0c8c42 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -302,6 +302,7 @@ libgit_sources = [
>     'commit.c',
>     'common-exit.c',
>     'common-init.c',
> +  'compat/ivec.c',
>     'compat/nonblock.c',
>     'compat/obstack.c',
>     'compat/open.c',


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-08 14:34   ` Phillip Wood
@ 2026-01-15 15:55     ` Ezekiel Newren
  2026-01-16 10:39       ` Phillip Wood
  2026-01-20 14:06       ` Phillip Wood
  0 siblings, 2 replies; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-15 15:55 UTC (permalink / raw)
  To: phillip.wood; +Cc: Ezekiel Newren via GitGitGadget, git

On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> > diff --git a/compat/ivec.c b/compat/ivec.c
> > new file mode 100644
> > index 0000000000..0a777e78dc
> > --- /dev/null
> > +++ b/compat/ivec.c
> > @@ -0,0 +1,113 @@
> > +#include "ivec.h"
> > +
> > +struct IVec_c_void {
>
> We normally use all lower case names for structs but as this is shared
> with rust it maybe makes sense to use CamelCase so the names are the
> same in both languages.

My preference would be all lowercase, but cbindgen insists on using
the same casing as was used in Rust. I don't think there's a way to
make cbindgen use all lowercase for structs.

> > +     void *ptr;
> > +     size_t length;
> > +     size_t capacity;
> > +     size_t element_size;
> > +};
> > +
> > +static void _set_capacity(void *self_, size_t new_capacity)
> > +{
> > +     struct IVec_c_void *self = self_;
>
> Passing any of the ivec variants defined below to this function invokes
> undefined behavior because we're not casting the pointer back to the
> orginal type. However I think on the platforms we care about
> sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.

If someone finds that this code does not work because of this
assumption I'd like to know. But I can't fathom a case where it
wouldn't work.

> > +
> > +     if (new_capacity == self->capacity) {
> > +             return;
> > +     }
> > +     if (new_capacity == 0) {
> > +             free(self->ptr);
> > +             self->ptr = NULL;
> > +     } else {
> > +             self->ptr = realloc(self->ptr, new_capacity * self->element_size);
> > +     }
> > +     self->capacity = new_capacity;
>
> Not if realloc() returns NULL. We should check for that, probably by
> using xrealloc().
>
> > +void ivec_zero(void *self_, size_t capacity)
> > +{
> > +     struct IVec_c_void *self = self_;
> > +
> > +     self->ptr = calloc(capacity, self->element_size);
>
> We should be handling allocation failures here probably by using xcalloc().

I've changed it to xrealloc() similar for the calloc() call.


> > +void ivec_reserve(void *self_, size_t additional)
> > +{
> > +     struct IVec_c_void *self = self_;
> > +
> > +     size_t growby = 128;
> > +     if (self->capacity > growby)
> > +             growby = self->capacity;
> > +     if (additional > growby)
> > +             growby = additional;
>
> This growth strategy differs from both ALLOC_GROW() and
> XDL_ALLOC_GROW(), if there isn't a good reason for that we should
> perhaps just use ALLOC_GROW() here.

XDL_ALLOW_GROW() can't be used because the pointer is always a void*
in this function.

> > +void ivec_push(void *self_, const void *value)
> > +{
> > +     struct IVec_c_void *self = self_;
> > +     void *dst = NULL;
> > +
> > +     if (self->length == self->capacity)
> > +             ivec_reserve(self, 1);
> > +
> > +     dst = (uint8_t*)self->ptr + self->length * self->element_size;
> > +     memcpy(dst, value, self->element_size);
>
> If self->element_size was a compile time constant the compiler could
> easily optimize this call away. I'm not sure that is easy to achieve though.

The problem is that I didn't want all of ivec to be macros that looked
like function calls. I wanted to minimize use of macros so that it was
easier to port and verify that the Rust implementation matches the
behavior of the C implementation.

> > +void ivec_free(void *self_)
>
> Normally we'd call a like this that free the allocations and
> re-initializes the members ivec_clear()

In Rust Vec.clear() means to set length to zero, but leaves the
allocation alone. The reason why I'm zeroing the struct is to help
avoid FFI issues. If not zero then what should the members be set to,
to indicate that using the struct is not valid anymore? In Rust an
object is freed when it goes out of scope and _cannot_ be accessed
afterward.

> > +{
> > +     struct IVec_c_void *self = self_;
> > +
> > +     free(self->ptr);
> > +     self->ptr = NULL;
> > +     self->length = 0;
> > +     self->capacity = 0;
> > +     // DO NOT MODIFY element_size!!!
> > +}
> > +
> > +void ivec_move(void *src_, void *dst_)
> > +{
> > +     struct IVec_c_void *src = src_;
> > +     struct IVec_c_void *dst = dst_;
>
> Maybe we should add
>
>         if (src->element_size != dst->element_size)
>                 BUG("moving incompatible arrays");

I'll do that.

> > +
> > +     ivec_free(dst);
> > +     dst->ptr = src->ptr;
> > +     dst->length = src->length;
> > +     dst->capacity = src->capacity;
> > +     // DO NOT MODIFY element_size!!!
>
> As the element sizes must match maybe *dst = *src would be clearer?

That seems fine.

> > +
> > +     src->ptr = NULL;
> > +     src->length = 0;
> > +     src->capacity = 0;
> > +     // DO NOT MODIFY element_size!!!
> > +}
> > diff --git a/compat/ivec.h b/compat/ivec.h
> > new file mode 100644
> > index 0000000000..654a05c506
> > --- /dev/null
> > +++ b/compat/ivec.h
> > @@ -0,0 +1,52 @@
> > +#ifndef IVEC_H
> > +#define IVEC_H
> > +
> > +#include <git-compat-util.h>
>
> It would be nice to have some documentation in this header, see the
> examples in strvec.h and hashmap.h
>
> > +#define IVEC_INIT(variable) ivec_init(&(variable), sizeof(*(variable).ptr))
>
> This is a bit cumbersome to use compared to our usual *_INIT macros. I'm
> struggling to see how we can make it nicer though as DEFINE_IVEC_TYPE
> cannot define a per-type initializer macro and I we cannot initialize
> the element size without knowing the type.

I don't see what's cumbersome about it. Maybe an example use case
would clarify things.

```
DEFINE_IVEC_TYPE(xrecord_t, xrecord);

void some_function() {
    struct IVec_xrecord rec;
    IVEC_INIT(rec);  // i.e. ivec_init(&rec, sizeof(*rec.ptr);

    // use concrete functions to manipulate vector or access the array
directly via ptr
}
```

IVEC_INIT() should be used on the concrete type.

> > +
> > +#ifndef CBINDGEN
> > +#define DEFINE_IVEC_TYPE(type, suffix) \
> > +struct IVec_##suffix { \
> > +     type* ptr; \
> > +     size_t length; \
> > +     size_t capacity; \
> > +     size_t element_size; \
> > +}
>
> I wonder if we want to define type safe inline safe wrappers for the
> ivec_* functions here. I think the only functions where the element type
> matters are ivec_move() and ivec_push(), for the others like
> ivec_zero(), ivec_reserve() and ivec_free() the element type does not
> matter. ivec_push() would certainly be easier to use with a wrapper as
> means we can avoid forcing the caller to take the address of the value.
>
> static inline ivec_##suffix##_push(struct IVec_##suffix *self, type
> value) { \
>         const void *ptr = &value; \
>         ivec_push(self, ptr); \
> }

I turned ivec_push() into a macro, but the rest will remain as
concrete functions.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-15 15:55     ` Ezekiel Newren
@ 2026-01-16 10:39       ` Phillip Wood
  2026-01-16 20:19         ` René Scharfe
  2026-01-17 16:14         ` Ezekiel Newren
  2026-01-20 14:06       ` Phillip Wood
  1 sibling, 2 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-16 10:39 UTC (permalink / raw)
  To: Ezekiel Newren, phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Jeff King,
	René Scharfe

I've Cc'd Peff and René for a second opinion if you have time please.

On 15/01/2026 15:55, Ezekiel Newren wrote:
> On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
 >
>>> +static void _set_capacity(void *self_, size_t new_capacity)
>>> +{
>>> +     struct IVec_c_void *self = self_;
>>
>> Passing any of the ivec variants defined below to this function invokes
>> undefined behavior because we're not casting the pointer back to the
>> orginal type. However I think on the platforms we care about
>> sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.
> 
> If someone finds that this code does not work because of this
> assumption I'd like to know. But I can't fathom a case where it
> wouldn't work.

So we have two different structs

struct IVec_c_void {
	void *ptr;
	size_t length;
	size_t capacity;
	size_t element_size;
}

and

struct Ivec_u8 {
	uint8_t *ptr;
	size_t length;
	size_t capacity;
	size_t element_size;
}

One the platforms we care about they will have the same memory layout as 
all pointers have the same representation. However I don't think they 
are "compatible types" in the language of the C standard because the 
type of the "ptr" member differs. That means casting IVec_u8* to 
IVec_c_void* either directly or via void* is undefined and so

	struct IVec_u8 vec;
	ivec_init(&vec, sizeof(*vec.ptr));

is undefined. For the compiler to see the undefined cast it needs to 
look across translation units because the implementation of ivec_init() 
will be in a separate file to where it is called. Maybe that and the 
fact they have the same memory layout saves us from having to worry too 
much though I'm always nervous of undefined behavior.

An alternative would be to pass the individual struct members as 
function parameters

	void ivec_init(void **vec, size_t &length, size_t &capacity,
		       size_t &element_size_, size_t element_size)
	{
		*vec = NULL;
		*length = 0;
		*capacity = 0;
		*element_size_ = element_size;
	}

and have DEFINE_IVEC_TYPE create typesafe wrappers

	static inline void ivec_u8_init(struct IVec_u8 *vec)
	{
		void *ptr = vec->ptr;
		ivec_init(&ptr, &v->length, &v->capacity,
			  &v->element_size, sizeof(*(v->ptr));
		vec->ptr = ptr;
	}

That's safe because we cast the "ptr" member to "void*" and then back to 
the original type. On the rust side the implementation of IVec<T> would 
also need to split out the individual struct members when it calls 
ivec_init() etc. It's all a bit more effort but the benefit is that we 
don't have any undefined behavior and we have a nice typesafe C 
interface to 'struct IVec_*'.

Thanks

Phillip


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-16 10:39       ` Phillip Wood
@ 2026-01-16 20:19         ` René Scharfe
  2026-01-17 13:55           ` Phillip Wood
  2026-01-17 16:14         ` Ezekiel Newren
  1 sibling, 1 reply; 78+ messages in thread
From: René Scharfe @ 2026-01-16 20:19 UTC (permalink / raw)
  To: phillip.wood, Ezekiel Newren
  Cc: Ezekiel Newren via GitGitGadget, git, Jeff King

On 1/16/26 11:39 AM, Phillip Wood wrote:
> I've Cc'd Peff and René for a second opinion if you have time please.
> 
> On 15/01/2026 15:55, Ezekiel Newren wrote:
>> On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>>>> +static void _set_capacity(void *self_, size_t new_capacity)
>>>> +{
>>>> +     struct IVec_c_void *self = self_;
>>>
>>> Passing any of the ivec variants defined below to this function invokes
>>> undefined behavior because we're not casting the pointer back to the
>>> orginal type. However I think on the platforms we care about
>>> sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.
>>
>> If someone finds that this code does not work because of this
>> assumption I'd like to know. But I can't fathom a case where it
>> wouldn't work.
> 
> So we have two different structs
> 
> struct IVec_c_void {
>     void *ptr;
>     size_t length;
>     size_t capacity;
>     size_t element_size;
> }
> 
> and
> 
> struct Ivec_u8 {
>     uint8_t *ptr;
>     size_t length;
>     size_t capacity;
>     size_t element_size;
> }
> 
> One the platforms we care about they will have the same memory
> layout as all pointers have the same representation. However I don't
> think they are "compatible types" in the language of the C standard
> because the type of the "ptr" member differs. That means casting
> IVec_u8* to IVec_c_void* either directly or via void* is undefined
> and so
> 
>     struct IVec_u8 vec;
>     ivec_init(&vec, sizeof(*vec.ptr));
> 
> is undefined. For the compiler to see the undefined cast it needs to
> look across translation units because the implementation of
> ivec_init() will be in a separate file to where it is called. Maybe
> that and the fact they have the same memory layout saves us from
> having to worry too much though I'm always nervous of undefined
> behavior.

True.  The GCC docs give a fun example of what a compiler might do
when using different struct types to access the same memory:

https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Aliasing-Type-Rules.html

Not sure it applies to this case, but the point is that compilers
can and will do terrifying things when they smell UB, with little
concern for safety or original intent.

> An alternative would be to pass the individual struct members as function parameters
> 
>     void ivec_init(void **vec, size_t &length, size_t &capacity,
>                size_t &element_size_, size_t element_size)
>     {
>         *vec = NULL;
>         *length = 0;
>         *capacity = 0;
>         *element_size_ = element_size;
>     }

The ampersands (&) should be asterisks (*), right?

> and have DEFINE_IVEC_TYPE create typesafe wrappers
> 
>     static inline void ivec_u8_init(struct IVec_u8 *vec)
>     {
>         void *ptr = vec->ptr;
>         ivec_init(&ptr, &v->length, &v->capacity,
>               &v->element_size, sizeof(*(v->ptr));
>         vec->ptr = ptr;
>     }

Mixes "v" and "vec", misses a closing parenthesis.  Looks viable,
though, and this method should be applicable to the rest of the
functions as well (on the C side).

I guess this doesn't require an element_size member anymore as
each wrapper can pass in the sizeof value.

> That's safe because we cast the "ptr" member to "void*" and then
> back to the original type. On the rust side the implementation of
> IVec<T> would also need to split out the individual struct members
> when it calls ivec_init() etc. It's all a bit more effort but the
> benefit is that we don't have any undefined behavior and we have a
> nice typesafe C interface to 'struct IVec_*'.
Right.  No idea how ugly this would be on the Rust side, though.

René


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-02 18:52 ` [PATCH 01/10] ivec: introduce the C side of ivec Ezekiel Newren via GitGitGadget
  2026-01-04  5:32   ` Junio C Hamano
  2026-01-08 14:34   ` Phillip Wood
@ 2026-01-16 20:19   ` René Scharfe
  2026-01-17 15:58     ` Ezekiel Newren
  2 siblings, 1 reply; 78+ messages in thread
From: René Scharfe @ 2026-01-16 20:19 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 1/2/26 7:52 PM, Ezekiel Newren via GitGitGadget wrote:
> diff --git a/compat/ivec.c b/compat/ivec.c
> new file mode 100644
> index 0000000000..0a777e78dc
> --- /dev/null
> +++ b/compat/ivec.c
> @@ -0,0 +1,113 @@
> +#include "ivec.h"
> +
> +struct IVec_c_void {
> +	void *ptr;
> +	size_t length;
> +	size_t capacity;
> +	size_t element_size;
> +};
> +
> +static void _set_capacity(void *self_, size_t new_capacity)
> +{
> +	struct IVec_c_void *self = self_;
> +
> +	if (new_capacity == self->capacity) {
> +		return;
> +	}
> +	if (new_capacity == 0) {
> +		free(self->ptr);
> +		self->ptr = NULL;
> +	} else {
> +		self->ptr = realloc(self->ptr, new_capacity * self->element_size);
> +	}
> +	self->capacity = new_capacity;
> +}
> +
> +
> +void ivec_init(void *self_, size_t element_size)
> +{
> +	struct IVec_c_void *self = self_;
> +
> +	self->ptr = NULL;
> +	self->length = 0;
> +	self->capacity = 0;
> +	self->element_size = element_size;
> +}
> +
> +void ivec_zero(void *self_, size_t capacity)
> +{
> +	struct IVec_c_void *self = self_;
> +
> +	self->ptr = calloc(capacity, self->element_size);
> +	self->length = capacity;
> +	self->capacity = capacity;
> +	// DO NOT MODIFY element_size!!!
> +}
> +
> +void ivec_reserve_exact(void *self_, size_t additional)
> +{
> +	struct IVec_c_void *self = self_;
> +
> +	_set_capacity(self, self->capacity + additional);
> +}
> +
> +void ivec_reserve(void *self_, size_t additional)
> +{
> +	struct IVec_c_void *self = self_;
> +
> +	size_t growby = 128;
> +	if (self->capacity > growby)
> +		growby = self->capacity;
> +	if (additional > growby)
> +		growby = additional;
> +
> +	_set_capacity(self, self->capacity + growby);
> +}

Constant growth steps like these cause linear growth and quadratic
complexity.  ALLOC_GROW does exponential growth with factor 1.5 to
get linear complexity.  Here's an old plea to do the same:
https://blog.mozilla.org/nnethercote/2014/11/04/please-grow-your-buffers-exponentially/

René


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
  2026-01-02 18:52 ` [PATCH 09/10] xdiff: remove dependence on xdlclassifier from xdl_cleanup_records() Ezekiel Newren via GitGitGadget
@ 2026-01-16 20:19   ` René Scharfe
  2026-01-17 16:34     ` Ezekiel Newren
  2026-01-21 15:01   ` Phillip Wood
  1 sibling, 1 reply; 78+ messages in thread
From: René Scharfe @ 2026-01-16 20:19 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 1/2/26 7:52 PM, Ezekiel Newren via GitGitGadget wrote:
> @@ -253,22 +250,44 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
>  	return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
>  }
>  
> +struct xoccurrence
> +{
> +	size_t file1, file2;
> +};
> +
> +
> +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
> +
>  
>  /*
>   * Try to reduce the problem complexity, discard records that have no
>   * matches on the other file. Also, lines that have multiple matches
>   * might be potentially discarded if they appear in a run of discardable.
>   */
> -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> -	long i, nm, mlim;
> +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
> +	long i;
> +	size_t nm, mlim;
>  	xrecord_t *recs;
> -	xdlclass_t *rcrec;
>  	uint8_t *action1 = NULL, *action2 = NULL;
> -	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
> +	struct IVec_xoccurrence occ;
> +	bool need_min = !!(flags & XDF_NEED_MINIMAL);
>  	int ret = 0;
>  	ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
>  	ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
>  
> +	IVEC_INIT(occ);
> +	ivec_zero(&occ, xe->mph_size);

This array is presized here.  It is neither grown nor shrunken.
CALLOC_ARRAY would work just as well, at least at this point, no?

> +
> +	for (size_t j = 0; j < xe->xdf1.nrec; j++) {
> +		size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
> +		occ.ptr[mph1].file1 += 1;
> +	}
> +
> +	for (size_t j = 0; j < xe->xdf2.nrec; j++) {
> +		size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
> +		occ.ptr[mph2].file2 += 1;
> +	}
> +
>  	/*
>  	 * Create temporary arrays that will help us decide if
>  	 * changed[i] should remain false, or become true.
> @@ -288,16 +307,14 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>  	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>  		mlim = XDL_MAX_EQLIMIT;
>  	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
> -		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> -		nm = rcrec ? rcrec->len2 : 0;
> +		nm = occ.ptr[recs->minimal_perfect_hash].file2;
>  		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
>  	}
>  
>  	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>  		mlim = XDL_MAX_EQLIMIT;
>  	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
> -		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> -		nm = rcrec ? rcrec->len1 : 0;
> +		nm = occ.ptr[recs->minimal_perfect_hash].file1;
>  		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
>  	}
>  
> @@ -332,6 +349,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>  cleanup:
>  	xdl_free(action1);
>  	xdl_free(action2);
> +	ivec_free(&occ);
>  
>  	return ret;
>  }

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-16 20:19         ` René Scharfe
@ 2026-01-17 13:55           ` Phillip Wood
  2026-01-17 16:04             ` Ezekiel Newren
  0 siblings, 1 reply; 78+ messages in thread
From: Phillip Wood @ 2026-01-17 13:55 UTC (permalink / raw)
  To: René Scharfe, phillip.wood, Ezekiel Newren
  Cc: Ezekiel Newren via GitGitGadget, git, Jeff King

On 16/01/2026 20:19, René Scharfe wrote:
> On 1/16/26 11:39 AM, Phillip Wood wrote:
>> I've Cc'd Peff and René for a second opinion if you have time please.
>>
>> On 15/01/2026 15:55, Ezekiel Newren wrote:
>>> On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>>
>>>>> +static void _set_capacity(void *self_, size_t new_capacity)
>>>>> +{
>>>>> +     struct IVec_c_void *self = self_;
>>>>
>>>> Passing any of the ivec variants defined below to this function invokes
>>>> undefined behavior because we're not casting the pointer back to the
>>>> orginal type. However I think on the platforms we care about
>>>> sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.
>>>
>>> If someone finds that this code does not work because of this
>>> assumption I'd like to know. But I can't fathom a case where it
>>> wouldn't work.
>>
>> So we have two different structs
>>
>> struct IVec_c_void {
>>      void *ptr;
>>      size_t length;
>>      size_t capacity;
>>      size_t element_size;
>> }
>>
>> and
>>
>> struct Ivec_u8 {
>>      uint8_t *ptr;
>>      size_t length;
>>      size_t capacity;
>>      size_t element_size;
>> }
>>
>> One the platforms we care about they will have the same memory
>> layout as all pointers have the same representation. However I don't
>> think they are "compatible types" in the language of the C standard
>> because the type of the "ptr" member differs. That means casting
>> IVec_u8* to IVec_c_void* either directly or via void* is undefined
>> and so
>>
>>      struct IVec_u8 vec;
>>      ivec_init(&vec, sizeof(*vec.ptr));
>>
>> is undefined. For the compiler to see the undefined cast it needs to
>> look across translation units because the implementation of
>> ivec_init() will be in a separate file to where it is called. Maybe
>> that and the fact they have the same memory layout saves us from
>> having to worry too much though I'm always nervous of undefined
>> behavior.
> 
> True.  The GCC docs give a fun example of what a compiler might do
> when using different struct types to access the same memory:
> 
> https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Aliasing-Type-Rules.html

Thanks for the link

> Not sure it applies to this case, but the point is that compilers
> can and will do terrifying things when they smell UB, with little
> concern for safety or original intent.
> 
>> An alternative would be to pass the individual struct members as function parameters
>>
>>      void ivec_init(void **vec, size_t &length, size_t &capacity,
>>                 size_t &element_size_, size_t element_size)
>>      {
>>          *vec = NULL;
>>          *length = 0;
>>          *capacity = 0;
>>          *element_size_ = element_size;
>>      }
> 
> The ampersands (&) should be asterisks (*), right?

Indeed, that's embarrassing - I must have been thinking of the caller.

>> and have DEFINE_IVEC_TYPE create typesafe wrappers
>>
>>      static inline void ivec_u8_init(struct IVec_u8 *vec)
>>      {
>>          void *ptr = vec->ptr;
>>          ivec_init(&ptr, &v->length, &v->capacity,
>>                &v->element_size, sizeof(*(v->ptr));
>>          vec->ptr = ptr;
>>      }
> 
> Mixes "v" and "vec", misses a closing parenthesis.  Looks viable,
> though, and this method should be applicable to the rest of the
> functions as well (on the C side).
> 
> I guess this doesn't require an element_size member anymore as
> each wrapper can pass in the sizeof value.

Good point

>> That's safe because we cast the "ptr" member to "void*" and then
>> back to the original type. On the rust side the implementation of
>> IVec<T> would also need to split out the individual struct members
>> when it calls ivec_init() etc. It's all a bit more effort but the
>> benefit is that we don't have any undefined behavior and we have a
>> nice typesafe C interface to 'struct IVec_*'.
> Right.  No idea how ugly this would be on the Rust side, though.

I'm hoping it's not too bad and `impl IVec<T>` just contains the 
equivalent of the wrappers generated by DEFINE_IVEC_TYPE()

Thanks

Phillip
> 
> René
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-16 20:19   ` René Scharfe
@ 2026-01-17 15:58     ` Ezekiel Newren
  2026-01-18 14:55       ` René Scharfe
  0 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-17 15:58 UTC (permalink / raw)
  To: René Scharfe; +Cc: Ezekiel Newren via GitGitGadget, git

On Fri, Jan 16, 2026 at 1:19 PM René Scharfe <l.s.r@web.de> wrote:
>
> On 1/2/26 7:52 PM, Ezekiel Newren via GitGitGadget wrote:
> > diff --git a/compat/ivec.c b/compat/ivec.c
> > new file mode 100644
> > index 0000000000..0a777e78dc
> > --- /dev/null
> > +++ b/compat/ivec.c
> > @@ -0,0 +1,113 @@
> > +#include "ivec.h"
> > +
> > +struct IVec_c_void {
> > +     void *ptr;
> > +     size_t length;
> > +     size_t capacity;
> > +     size_t element_size;
> > +};
> > +
> > +static void _set_capacity(void *self_, size_t new_capacity)
> > +{
> > +     struct IVec_c_void *self = self_;
> > +
> > +     if (new_capacity == self->capacity) {
> > +             return;
> > +     }
> > +     if (new_capacity == 0) {
> > +             free(self->ptr);
> > +             self->ptr = NULL;
> > +     } else {
> > +             self->ptr = realloc(self->ptr, new_capacity * self->element_size);
> > +     }
> > +     self->capacity = new_capacity;
> > +}
> > +
> > +
> > +void ivec_init(void *self_, size_t element_size)
> > +{
> > +     struct IVec_c_void *self = self_;
> > +
> > +     self->ptr = NULL;
> > +     self->length = 0;
> > +     self->capacity = 0;
> > +     self->element_size = element_size;
> > +}
> > +
> > +void ivec_zero(void *self_, size_t capacity)
> > +{
> > +     struct IVec_c_void *self = self_;
> > +
> > +     self->ptr = calloc(capacity, self->element_size);
> > +     self->length = capacity;
> > +     self->capacity = capacity;
> > +     // DO NOT MODIFY element_size!!!
> > +}
> > +
> > +void ivec_reserve_exact(void *self_, size_t additional)
> > +{
> > +     struct IVec_c_void *self = self_;
> > +
> > +     _set_capacity(self, self->capacity + additional);
> > +}
> > +
> > +void ivec_reserve(void *self_, size_t additional)
> > +{
> > +     struct IVec_c_void *self = self_;
> > +
> > +     size_t growby = 128;
> > +     if (self->capacity > growby)
> > +             growby = self->capacity;
> > +     if (additional > growby)
> > +             growby = additional;
> > +
> > +     _set_capacity(self, self->capacity + growby);
> > +}
>
> Constant growth steps like these cause linear growth and quadratic
> complexity.  ALLOC_GROW does exponential growth with factor 1.5 to
> get linear complexity.  Here's an old plea to do the same:
> https://blog.mozilla.org/nnethercote/2014/11/04/please-grow-your-buffers-exponentially/
>
> René

It _is_ exponential. ivec_reserve(&vec, 1) means grow by _at least_ 1.
I'm not using typical memory management as defined in
git-compat-util.h because I'm trying to get ivec to behave very
similarly to Rust's Vec so that when Rust is introduced into the code,
C programmers will already be familiar with how Vec operates _and_ so
that converting from IVec to Vec is as simple as refactoring IVec
declarations to Vec.

Since C does not support generics there is no _proper_ solution. What
I have come up with on the C side for ivec is my best effort
compromise.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-17 13:55           ` Phillip Wood
@ 2026-01-17 16:04             ` Ezekiel Newren
  2026-01-18 14:58               ` René Scharfe
  0 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-17 16:04 UTC (permalink / raw)
  To: Phillip Wood
  Cc: René Scharfe, phillip.wood, Ezekiel Newren via GitGitGadget,
	git, Jeff King

On Sat, Jan 17, 2026 at 6:55 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> On 16/01/2026 20:19, René Scharfe wrote:
> > On 1/16/26 11:39 AM, Phillip Wood wrote:
> >> I've Cc'd Peff and René for a second opinion if you have time please.
> >>
> >> On 15/01/2026 15:55, Ezekiel Newren wrote:
> >>> On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> >>>
> >>>>> +static void _set_capacity(void *self_, size_t new_capacity)
> >>>>> +{
> >>>>> +     struct IVec_c_void *self = self_;
> >>>>
> >>>> Passing any of the ivec variants defined below to this function invokes
> >>>> undefined behavior because we're not casting the pointer back to the
> >>>> orginal type. However I think on the platforms we care about
> >>>> sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.
> >>>
> >>> If someone finds that this code does not work because of this
> >>> assumption I'd like to know. But I can't fathom a case where it
> >>> wouldn't work.
> >>
> >> So we have two different structs
> >>
> >> struct IVec_c_void {
> >>      void *ptr;
> >>      size_t length;
> >>      size_t capacity;
> >>      size_t element_size;
> >> }
> >>
> >> and
> >>
> >> struct Ivec_u8 {
> >>      uint8_t *ptr;
> >>      size_t length;
> >>      size_t capacity;
> >>      size_t element_size;
> >> }
> >>
> >> One the platforms we care about they will have the same memory
> >> layout as all pointers have the same representation. However I don't
> >> think they are "compatible types" in the language of the C standard
> >> because the type of the "ptr" member differs. That means casting
> >> IVec_u8* to IVec_c_void* either directly or via void* is undefined
> >> and so
> >>
> >>      struct IVec_u8 vec;
> >>      ivec_init(&vec, sizeof(*vec.ptr));
> >>
> >> is undefined. For the compiler to see the undefined cast it needs to
> >> look across translation units because the implementation of
> >> ivec_init() will be in a separate file to where it is called. Maybe
> >> that and the fact they have the same memory layout saves us from
> >> having to worry too much though I'm always nervous of undefined
> >> behavior.
> >
> > True.  The GCC docs give a fun example of what a compiler might do
> > when using different struct types to access the same memory:
> >
> > https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Aliasing-Type-Rules.html
>
> Thanks for the link
>
> > Not sure it applies to this case, but the point is that compilers
> > can and will do terrifying things when they smell UB, with little
> > concern for safety or original intent.
> >
> >> An alternative would be to pass the individual struct members as function parameters
> >>
> >>      void ivec_init(void **vec, size_t &length, size_t &capacity,
> >>                 size_t &element_size_, size_t element_size)
> >>      {
> >>          *vec = NULL;
> >>          *length = 0;
> >>          *capacity = 0;
> >>          *element_size_ = element_size;
> >>      }
> >
> > The ampersands (&) should be asterisks (*), right?
>
> Indeed, that's embarrassing - I must have been thinking of the caller.
>
> >> and have DEFINE_IVEC_TYPE create typesafe wrappers
> >>
> >>      static inline void ivec_u8_init(struct IVec_u8 *vec)
> >>      {
> >>          void *ptr = vec->ptr;
> >>          ivec_init(&ptr, &v->length, &v->capacity,
> >>                &v->element_size, sizeof(*(v->ptr));
> >>          vec->ptr = ptr;
> >>      }
> >
> > Mixes "v" and "vec", misses a closing parenthesis.  Looks viable,
> > though, and this method should be applicable to the rest of the
> > functions as well (on the C side).
> >
> > I guess this doesn't require an element_size member anymore as
> > each wrapper can pass in the sizeof value.
>
> Good point
>
> >> That's safe because we cast the "ptr" member to "void*" and then
> >> back to the original type. On the rust side the implementation of
> >> IVec<T> would also need to split out the individual struct members
> >> when it calls ivec_init() etc. It's all a bit more effort but the
> >> benefit is that we don't have any undefined behavior and we have a
> >> nice typesafe C interface to 'struct IVec_*'.
> > Right.  No idea how ugly this would be on the Rust side, though.
>
> I'm hoping it's not too bad and `impl IVec<T>` just contains the
> equivalent of the wrappers generated by DEFINE_IVEC_TYPE()
>
> Thanks
>
> Phillip
> >
> > René
> >
>

I don't like this solution. ivec_push() is the only function that
deals with actual values. The rest are just generic memory management
functions. What if we used:

#define ivec_init(vec) { \
    (vec)->ptr = NULL; \
    (vec)->length = 0; \
    (vec)->capacity = 0; \
    (vec)->element_size = sizeof(*(vec)->ptr); \
}

#define ivec_push_unsafe(vec, value) (vec)->ptr[(vec)->length++] = (value)

/*
 * grow by at least 1
 */
#define ivec_push(vec, value) { \
    if ((vec)->length == (vec)->capacity) \
       ivec_reserve(vec, 1); \
    ivec_push_unsafe(vec, value); \
}

Instead of concrete functions?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-04  5:32   ` Junio C Hamano
@ 2026-01-17 16:06     ` Ezekiel Newren
  0 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-17 16:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ezekiel Newren via GitGitGadget, git

On Sat, Jan 3, 2026 at 10:32 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +     if (new_capacity == 0) {
> > +             free(self->ptr);
> > +             self->ptr = NULL;
>
>         if (!new_capacity)
>                 FREE_AND_NULL(self->ptr);
>         else
>                 ...;
>
> > +void ivec_free(void *self_)
> > +{
> > +     struct IVec_c_void *self = self_;
> > +
> > +     free(self->ptr);
> > +     self->ptr = NULL;
>
> Likewise.  Otherwise the code will fail coccicheck.
>
> > +     self->length = 0;
> > +     self->capacity = 0;
> > +     // DO NOT MODIFY element_size!!!
>
>         /* A single-liner comment in our codebase looks like this */
>

I will make these changes.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-16 10:39       ` Phillip Wood
  2026-01-16 20:19         ` René Scharfe
@ 2026-01-17 16:14         ` Ezekiel Newren
  2026-01-17 16:16           ` Ezekiel Newren
  2026-01-17 17:40           ` Phillip Wood
  1 sibling, 2 replies; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-17 16:14 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Jeff King,
	René Scharfe

On Fri, Jan 16, 2026 at 3:39 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> I've Cc'd Peff and René for a second opinion if you have time please.
>
> On 15/01/2026 15:55, Ezekiel Newren wrote:
> > On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>  >
> >>> +static void _set_capacity(void *self_, size_t new_capacity)
> >>> +{
> >>> +     struct IVec_c_void *self = self_;
> >>
> >> Passing any of the ivec variants defined below to this function invokes
> >> undefined behavior because we're not casting the pointer back to the
> >> orginal type. However I think on the platforms we care about
> >> sizeof(void*) == sizeof(T*) for all T so maybe we can look the other way.
> >
> > If someone finds that this code does not work because of this
> > assumption I'd like to know. But I can't fathom a case where it
> > wouldn't work.
>
> So we have two different structs
>
> struct IVec_c_void {
>         void *ptr;
>         size_t length;
>         size_t capacity;
>         size_t element_size;
> }
>
> and
>
> struct Ivec_u8 {
>         uint8_t *ptr;
>         size_t length;
>         size_t capacity;
>         size_t element_size;
> }
>
> One the platforms we care about they will have the same memory layout as
> all pointers have the same representation. However I don't think they
> are "compatible types" in the language of the C standard because the
> type of the "ptr" member differs. That means casting IVec_u8* to
> IVec_c_void* either directly or via void* is undefined and so
>
>         struct IVec_u8 vec;
>         ivec_init(&vec, sizeof(*vec.ptr));
>
> is undefined. For the compiler to see the undefined cast it needs to
> look across translation units because the implementation of ivec_init()
> will be in a separate file to where it is called. Maybe that and the
> fact they have the same memory layout saves us from having to worry too
> much though I'm always nervous of undefined behavior.
>
> An alternative would be to pass the individual struct members as
> function parameters
>
>         void ivec_init(void **vec, size_t &length, size_t &capacity,
>                        size_t &element_size_, size_t element_size)
>         {
>                 *vec = NULL;
>                 *length = 0;
>                 *capacity = 0;
>                 *element_size_ = element_size;
>         }
>
> and have DEFINE_IVEC_TYPE create typesafe wrappers
>
>         static inline void ivec_u8_init(struct IVec_u8 *vec)
>         {
>                 void *ptr = vec->ptr;
>                 ivec_init(&ptr, &v->length, &v->capacity,
>                           &v->element_size, sizeof(*(v->ptr));
>                 vec->ptr = ptr;
>         }
>
> That's safe because we cast the "ptr" member to "void*" and then back to
> the original type. On the rust side the implementation of IVec<T> would
> also need to split out the individual struct members when it calls
> ivec_init() etc. It's all a bit more effort but the benefit is that we
> don't have any undefined behavior and we have a nice typesafe C
> interface to 'struct IVec_*'.
>
> Thanks
>
> Phillip
>

If the size of different kinds of pointers ever differed from the size
of void* then wouldn't that make all calls to malloc undefined? I
don't see this as a problem since I'm not casting between structs with
different members that are not pointers. I could use void* for
everything, but then we'd need an accessor like *(T*)ivec_at(&vec, i),
but this is much more painful and error prone than simply vec.ptr[i].

I agree that the example referenced by Rene is problematic, but
irrelevant to ivec in my opinion.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-17 16:14         ` Ezekiel Newren
@ 2026-01-17 16:16           ` Ezekiel Newren
  2026-01-17 17:40           ` Phillip Wood
  1 sibling, 0 replies; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-17 16:16 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Jeff King,
	René Scharfe

> If the size of different kinds of pointers ever differed from the size
> of void* then wouldn't that make all calls to malloc undefined? I

I meant to say undefined behavior, not simply undefined.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
  2026-01-16 20:19   ` René Scharfe
@ 2026-01-17 16:34     ` Ezekiel Newren
  2026-01-18 18:23       ` René Scharfe
  0 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-17 16:34 UTC (permalink / raw)
  To: René Scharfe; +Cc: Ezekiel Newren via GitGitGadget, git

On Fri, Jan 16, 2026 at 1:19 PM René Scharfe <l.s.r@web.de> wrote:
>
> On 1/2/26 7:52 PM, Ezekiel Newren via GitGitGadget wrote:
> > @@ -253,22 +250,44 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
> >       return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
> >  }
> >
> > +struct xoccurrence
> > +{
> > +     size_t file1, file2;
> > +};
> > +
> > +
> > +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
> > +
> >
> >  /*
> >   * Try to reduce the problem complexity, discard records that have no
> >   * matches on the other file. Also, lines that have multiple matches
> >   * might be potentially discarded if they appear in a run of discardable.
> >   */
> > -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> > -     long i, nm, mlim;
> > +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
> > +     long i;
> > +     size_t nm, mlim;
> >       xrecord_t *recs;
> > -     xdlclass_t *rcrec;
> >       uint8_t *action1 = NULL, *action2 = NULL;
> > -     bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
> > +     struct IVec_xoccurrence occ;
> > +     bool need_min = !!(flags & XDF_NEED_MINIMAL);
> >       int ret = 0;
> >       ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
> >       ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
> >
> > +     IVEC_INIT(occ);
> > +     ivec_zero(&occ, xe->mph_size);
>
> This array is presized here.  It is neither grown nor shrunken.
> CALLOC_ARRAY would work just as well, at least at this point, no?
>
> > +
> > +     for (size_t j = 0; j < xe->xdf1.nrec; j++) {
> > +             size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
> > +             occ.ptr[mph1].file1 += 1;
> > +     }
> > +
> > +     for (size_t j = 0; j < xe->xdf2.nrec; j++) {
> > +             size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
> > +             occ.ptr[mph2].file2 += 1;
> > +     }
> > +
> >       /*
> >        * Create temporary arrays that will help us decide if
> >        * changed[i] should remain false, or become true.
> > @@ -288,16 +307,14 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> >       if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
> >               mlim = XDL_MAX_EQLIMIT;
> >       for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
> > -             rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> > -             nm = rcrec ? rcrec->len2 : 0;
> > +             nm = occ.ptr[recs->minimal_perfect_hash].file2;
> >               action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> >       }
> >
> >       if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
> >               mlim = XDL_MAX_EQLIMIT;
> >       for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
> > -             rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> > -             nm = rcrec ? rcrec->len1 : 0;
> > +             nm = occ.ptr[recs->minimal_perfect_hash].file1;
> >               action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> >       }
> >
> > @@ -332,6 +349,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> >  cleanup:
> >       xdl_free(action1);
> >       xdl_free(action2);
> > +     ivec_free(&occ);
> >
> >       return ret;
> >  }

In Rust the memory management macros defined in git-compat-util.h will
not be available. ivec was built expressly to bridge the gap between C
and Rust. I'm avoiding using those macros because I'm trying to get C
programmers familiar with how Rust's Vec operates without forcing them
to read and write in Rust. Also, it makes converting from IVec to Vec
super easy.

ivec_zero() also sets length and capacity. Also CALLOC_ARRAY needs to
know the type of the pointer which ivec_zero() does not have access
to. This is one of the few ivec functions that does not have a direct
equivalent in Rust's Vec, but is faster than what is logically
equivalent in Rust.

In Rust the closest safe equivalent would look like:

let size = 35;
let mut vec = Vec::<u64>::new();
vec.reserve_exact(size);
vec.fill(0);  // requires that T implements the `Copy` trait

The unsafe version would look like:
let size = 35;
let mut vec = Vec::<u64>::new();
vec.reserve_exact(size);
unsafe {
    std::ptr::write_bytes(vec.as_mut_ptr(), 0, size * size_of::<u64>());
}

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-17 16:14         ` Ezekiel Newren
  2026-01-17 16:16           ` Ezekiel Newren
@ 2026-01-17 17:40           ` Phillip Wood
  2026-01-19  5:59             ` Jeff King
  1 sibling, 1 reply; 78+ messages in thread
From: Phillip Wood @ 2026-01-17 17:40 UTC (permalink / raw)
  To: Ezekiel Newren, phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Jeff King,
	René Scharfe

On 17/01/2026 16:14, Ezekiel Newren wrote:
> 
> If the size of different kinds of pointers ever differed from the size
> of void* then wouldn't that make all calls to malloc undefined?

I believe there are (Havard architecture?) platforms where function 
pointers are a different width to data pointers, and that's why you 
cannot store a function pointer in void*. I agree it would be weird for 
char* to have a different width to int*, I suspect the restrictions on 
casting from one type to another are about alignment.

> I
> don't see this as a problem since I'm not casting between structs with
> different members that are not pointers.

But isn't that is still undefined behavior as far as the C standard is 
concerned? It might make sense for it to work, but common sense has 
little to do with undefined behavior.

> I could use void* for
> everything, but then we'd need an accessor like *(T*)ivec_at(&vec, i),
> but this is much more painful and error prone than simply vec.ptr[i].

Yeah that's horrible

Thanks

Phillip


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-17 15:58     ` Ezekiel Newren
@ 2026-01-18 14:55       ` René Scharfe
  0 siblings, 0 replies; 78+ messages in thread
From: René Scharfe @ 2026-01-18 14:55 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Ezekiel Newren via GitGitGadget, git

On 1/17/26 4:58 PM, Ezekiel Newren wrote:
> On Fri, Jan 16, 2026 at 1:19 PM René Scharfe <l.s.r@web.de> wrote:
>>
>>> +void ivec_reserve(void *self_, size_t additional)
>>> +{
>>> +     struct IVec_c_void *self = self_;
>>> +
>>> +     size_t growby = 128;
>>> +     if (self->capacity > growby)
>>> +             growby = self->capacity;
>>> +     if (additional > growby)
>>> +             growby = additional;
>>> +
>>> +     _set_capacity(self, self->capacity + growby);
>>> +}
>>
>> Constant growth steps like these cause linear growth and quadratic
>> complexity.  ALLOC_GROW does exponential growth with factor 1.5 to
>> get linear complexity.  Here's an old plea to do the same:
>> https://blog.mozilla.org/nnethercote/2014/11/04/please-grow-your-buffers-exponentially/
>>
>> René
> 
> It _is_ exponential. ivec_reserve(&vec, 1) means grow by _at least_ 1.
D'oh!  Right, it grows with factor 2, as growby is at least as big as
->capacity.  I can't read.

René


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-17 16:04             ` Ezekiel Newren
@ 2026-01-18 14:58               ` René Scharfe
  0 siblings, 0 replies; 78+ messages in thread
From: René Scharfe @ 2026-01-18 14:58 UTC (permalink / raw)
  To: Ezekiel Newren, Phillip Wood
  Cc: phillip.wood, Ezekiel Newren via GitGitGadget, git, Jeff King

On 1/17/26 5:04 PM, Ezekiel Newren wrote:
> 
> I don't like this solution. ivec_push() is the only function that
> deals with actual values. The rest are just generic memory management
> functions. What if we used:
> 
> #define ivec_init(vec) { \
>     (vec)->ptr = NULL; \
>     (vec)->length = 0; \
>     (vec)->capacity = 0; \
>     (vec)->element_size = sizeof(*(vec)->ptr); \
> }
> 
> #define ivec_push_unsafe(vec, value) (vec)->ptr[(vec)->length++] = (value)
> 
> /*
>  * grow by at least 1
>  */
> #define ivec_push(vec, value) { \
>     if ((vec)->length == (vec)->capacity) \
>        ivec_reserve(vec, 1); \
>     ivec_push_unsafe(vec, value); \
> }
> 
> Instead of concrete functions?

These macros are OK on the C side in respect to type-safety.

I guess they would have to be duplicated somehow in Rust?

How would ivec_reserve() look like?

The macros use their parameter "vec" multiple times, though, so callers
must not pass in an expression with a side-effect, as it would be
evaluated more than once.  We have a few of those already.  You have to
be careful not to do stuff like this (example of calling a _push-like
function with an argument with a side-effect from
strbuf.c::strbuf_join_argv()):

	while (--argc)
		strbuf_addstr(buf, *(++argv));

Also they can't be used like a function -- you'd have to call them
without a trailing semicolon.  That's a small issue and easily
overcome by wrapping their body in "do { } while (0)".

René


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
  2026-01-17 16:34     ` Ezekiel Newren
@ 2026-01-18 18:23       ` René Scharfe
  0 siblings, 0 replies; 78+ messages in thread
From: René Scharfe @ 2026-01-18 18:23 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Ezekiel Newren via GitGitGadget, git

On 1/17/26 5:34 PM, Ezekiel Newren wrote:
> On Fri, Jan 16, 2026 at 1:19 PM René Scharfe <l.s.r@web.de> wrote:
>>
>> On 1/2/26 7:52 PM, Ezekiel Newren via GitGitGadget wrote:
>>> @@ -253,22 +250,44 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
>>>       return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
>>>  }
>>>
>>> +struct xoccurrence
>>> +{
>>> +     size_t file1, file2;
>>> +};
>>> +
>>> +
>>> +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
>>> +
>>>
>>>  /*
>>>   * Try to reduce the problem complexity, discard records that have no
>>>   * matches on the other file. Also, lines that have multiple matches
>>>   * might be potentially discarded if they appear in a run of discardable.
>>>   */
>>> -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>>> -     long i, nm, mlim;
>>> +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
>>> +     long i;
>>> +     size_t nm, mlim;
>>>       xrecord_t *recs;
>>> -     xdlclass_t *rcrec;
>>>       uint8_t *action1 = NULL, *action2 = NULL;
>>> -     bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
>>> +     struct IVec_xoccurrence occ;
>>> +     bool need_min = !!(flags & XDF_NEED_MINIMAL);
>>>       int ret = 0;
>>>       ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
>>>       ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
>>>
>>> +     IVEC_INIT(occ);
>>> +     ivec_zero(&occ, xe->mph_size);
>>
>> This array is presized here.  It is neither grown nor shrunken.
>> CALLOC_ARRAY would work just as well, at least at this point, no?
>>
>>> +
>>> +     for (size_t j = 0; j < xe->xdf1.nrec; j++) {
>>> +             size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
>>> +             occ.ptr[mph1].file1 += 1;
>>> +     }
>>> +
>>> +     for (size_t j = 0; j < xe->xdf2.nrec; j++) {
>>> +             size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
>>> +             occ.ptr[mph2].file2 += 1;
>>> +     }
>>> +
>>>       /*
>>>        * Create temporary arrays that will help us decide if
>>>        * changed[i] should remain false, or become true.
>>> @@ -288,16 +307,14 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>>>       if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>>>               mlim = XDL_MAX_EQLIMIT;
>>>       for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
>>> -             rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>>> -             nm = rcrec ? rcrec->len2 : 0;
>>> +             nm = occ.ptr[recs->minimal_perfect_hash].file2;
>>>               action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
>>>       }
>>>
>>>       if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>>>               mlim = XDL_MAX_EQLIMIT;
>>>       for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
>>> -             rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>>> -             nm = rcrec ? rcrec->len1 : 0;
>>> +             nm = occ.ptr[recs->minimal_perfect_hash].file1;
>>>               action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
>>>       }
>>>
>>> @@ -332,6 +349,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>>>  cleanup:
>>>       xdl_free(action1);
>>>       xdl_free(action2);
>>> +     ivec_free(&occ);
>>>
>>>       return ret;
>>>  }
> 
> In Rust the memory management macros defined in git-compat-util.h will
> not be available. ivec was built expressly to bridge the gap between C
> and Rust. I'm avoiding using those macros because I'm trying to get C
> programmers familiar with how Rust's Vec operates without forcing them
> to read and write in Rust. Also, it makes converting from IVec to Vec
> super easy.
> 
> ivec_zero() also sets length and capacity. Also CALLOC_ARRAY needs to
> know the type of the pointer which ivec_zero() does not have access
> to. This is one of the few ivec functions that does not have a direct
> equivalent in Rust's Vec, but is faster than what is logically
> equivalent in Rust.
> 
> In Rust the closest safe equivalent would look like:
> 
> let size = 35;
> let mut vec = Vec::<u64>::new();
> vec.reserve_exact(size);
> vec.fill(0);  // requires that T implements the `Copy` trait
> 
> The unsafe version would look like:
> let size = 35;
> let mut vec = Vec::<u64>::new();
> vec.reserve_exact(size);
> unsafe {
>     std::ptr::write_bytes(vec.as_mut_ptr(), 0, size * size_of::<u64>());
> }

I was being unclear and made a few assumptions here.  My point was just
that this is a fixed-size array and doesn't need to be stored in a
variable-sized container.  This is the first Ivec user, and I would have
expected it to exercise the push function.  I assume accessing a
fixed-size array via FFI would be a lot easier since allocation and
growth are out of the picture.

René


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-17 17:40           ` Phillip Wood
@ 2026-01-19  5:59             ` Jeff King
  2026-01-19 20:21               ` Ezekiel Newren
  2026-01-20 13:46               ` Phillip Wood
  0 siblings, 2 replies; 78+ messages in thread
From: Jeff King @ 2026-01-19  5:59 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Ezekiel Newren, phillip.wood, Ezekiel Newren via GitGitGadget,
	git, René Scharfe

On Sat, Jan 17, 2026 at 05:40:08PM +0000, Phillip Wood wrote:

> On 17/01/2026 16:14, Ezekiel Newren wrote:
> > 
> > If the size of different kinds of pointers ever differed from the size
> > of void* then wouldn't that make all calls to malloc undefined?
> 
> I believe there are (Havard architecture?) platforms where function pointers
> are a different width to data pointers, and that's why you cannot store a
> function pointer in void*. I agree it would be weird for char* to have a
> different width to int*, I suspect the restrictions on casting from one type
> to another are about alignment.

The standard does allow for different pointer sizes for char and int.
The key thing is that a void pointer has to be able to represent any. So
you can cast a smaller pointer to void and vice versa (and the latter
would presumably throw away some of the bits, which is OK as long as the
void was made from one of those smaller pointers originally).

More discussion at:

  https://c-faq.com/null/machexamp.html

I don't know how malloc worked on those platforms, though. The caller
knows that malloc returns a void pointer, so it could cast to the
smaller format in the usual way at the call-site. But I don't know how
you would tell malloc() in a standard way what type of pointer you
wanted to get out of it. I suspect they may have had specialized
allocation functions. Or maybe it was enough to just throw away the low
bits if you only cared about a word-addressable pointer.

At any rate, yeah, I agree with your original concern that the two
structs are not compatible. The layouts could be totally different. And
not just due to pointer size, but IIRC pointers to different types could
have different alignment requirements. So:

  struct foo_void {
	size_t len;
	void *ptr;
  };

  struct foo_u8 {
	size_t len;
	uint8_t *ptr;
  };

might need different padding to properly align the pointers. In the case
under discussion the pointers are always at the start, though, so I
think it wouldn't matter.

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-19  5:59             ` Jeff King
@ 2026-01-19 20:21               ` Ezekiel Newren
  2026-01-19 20:40                 ` Jeff King
  2026-01-20 13:46               ` Phillip Wood
  1 sibling, 1 reply; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-19 20:21 UTC (permalink / raw)
  To: Jeff King
  Cc: Phillip Wood, phillip.wood, Ezekiel Newren via GitGitGadget, git,
	René Scharfe

On Sun, Jan 18, 2026 at 10:59 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Jan 17, 2026 at 05:40:08PM +0000, Phillip Wood wrote:
>
> > On 17/01/2026 16:14, Ezekiel Newren wrote:
> > >
> > > If the size of different kinds of pointers ever differed from the size
> > > of void* then wouldn't that make all calls to malloc undefined?
> >
> > I believe there are (Havard architecture?) platforms where function pointers
> > are a different width to data pointers, and that's why you cannot store a
> > function pointer in void*. I agree it would be weird for char* to have a
> > different width to int*, I suspect the restrictions on casting from one type
> > to another are about alignment.
>
> The standard does allow for different pointer sizes for char and int.
> The key thing is that a void pointer has to be able to represent any. So
> you can cast a smaller pointer to void and vice versa (and the latter
> would presumably throw away some of the bits, which is OK as long as the
> void was made from one of those smaller pointers originally).
>
> More discussion at:
>
>   https://c-faq.com/null/machexamp.html
>
> I don't know how malloc worked on those platforms, though. The caller
> knows that malloc returns a void pointer, so it could cast to the
> smaller format in the usual way at the call-site. But I don't know how
> you would tell malloc() in a standard way what type of pointer you
> wanted to get out of it. I suspect they may have had specialized
> allocation functions. Or maybe it was enough to just throw away the low
> bits if you only cared about a word-addressable pointer.
>
> At any rate, yeah, I agree with your original concern that the two
> structs are not compatible. The layouts could be totally different. And
> not just due to pointer size, but IIRC pointers to different types could
> have different alignment requirements. So:
>
>   struct foo_void {
>         size_t len;
>         void *ptr;
>   };
>
>   struct foo_u8 {
>         size_t len;
>         uint8_t *ptr;
>   };
>
> might need different padding to properly align the pointers. In the case
> under discussion the pointers are always at the start, though, so I
> think it wouldn't matter.
>
> -Peff

Ok..., is there a way to pad a field to the largest size needed so
that this also works on the harvard architecture? If C isn't even self
consistent then how are these structs going to be passed between C and
Rust (which is THE point of ivec)?

Or do we just tell the arcane Harvard architecture "too bad" Git won't
run on it anymore?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-19 20:21               ` Ezekiel Newren
@ 2026-01-19 20:40                 ` Jeff King
  2026-01-20  2:36                   ` D. Ben Knoble
  2026-01-21 21:00                   ` Ezekiel Newren
  0 siblings, 2 replies; 78+ messages in thread
From: Jeff King @ 2026-01-19 20:40 UTC (permalink / raw)
  To: Ezekiel Newren
  Cc: Phillip Wood, phillip.wood, Ezekiel Newren via GitGitGadget, git,
	René Scharfe

On Mon, Jan 19, 2026 at 01:21:04PM -0700, Ezekiel Newren wrote:

> Ok..., is there a way to pad a field to the largest size needed so
> that this also works on the harvard architecture? If C isn't even self
> consistent then how are these structs going to be passed between C and
> Rust (which is THE point of ivec)?

If you make a union of the pointers, it will require the largest size
and the strictest alignment requirement. So:

  struct foo {
	union {
		void *v;
		uint8_t *u8;
	} ptr;
	size_t len;
  };

would be a single struct you could use to store a void pointer _or_ a u8
pointer. The one thing you shouldn't do there, though, is assign via one
union member and read from the other. So I don't know if that helps you
or not (I confess I have not followed this rust discussion at all, and
know nothing about rust/c ABI compatibility, and just got roped in on C
esoterica).

> Or do we just tell the arcane Harvard architecture "too bad" Git won't
> run on it anymore?

Minor nit: the Harvard architecture is one where function pointers are
not the same as data pointers. An int/char distinction can happen even
on more common (von Neumann) machines.

But I think we can rephrase your question as: are there real-world
machines we care about that will have different pointer sizes, or can we
ignore this issue for practical purposes?

I don't know the answer. I suspect it probably is OK for Git not to run
on the machines mentioned in that C faq. But:

  1. Sometimes there are subtle implications of undefined behavior that
     may cause a compiler (even for a sensible machine) to do unexpected
     things. I don't know offhand if that is the case here.

  2. There are some modern platforms in which pointers are a bit more
     opaque than just numeric addresses. For example, we've had a few
     patches dealing with questionable pointer usage to make things work
     on CHERI Arm systems. I'm not sure if any of that would matter
     here, though (IIRC, it was mostly that pointers were unexpectedly
     large and had matching alignment requirements, but all of them
     equally so).

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-19 20:40                 ` Jeff King
@ 2026-01-20  2:36                   ` D. Ben Knoble
  2026-01-21 21:00                   ` Ezekiel Newren
  1 sibling, 0 replies; 78+ messages in thread
From: D. Ben Knoble @ 2026-01-20  2:36 UTC (permalink / raw)
  To: Jeff King
  Cc: Ezekiel Newren, Phillip Wood, phillip.wood,
	Ezekiel Newren via GitGitGadget, git, René Scharfe

On Mon, Jan 19, 2026 at 3:41 PM Jeff King <peff@peff.net> wrote:
>
>   2. There are some modern platforms in which pointers are a bit more
>      opaque than just numeric addresses. For example, we've had a few
>      patches dealing with questionable pointer usage to make things work
>      on CHERI Arm systems. I'm not sure if any of that would matter
>      here, though (IIRC, it was mostly that pointers were unexpectedly
>      large and had matching alignment requirements, but all of them
>      equally so).

Arguably on all modern platforms, pointers are more than just numeric
addresses, due to provenance ;)

- https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html
- https://www.ralfj.de/blog/2020/12/14/provenance.html
- https://www.ralfj.de/blog/2022/04/11/provenance-exposed.html

-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-19  5:59             ` Jeff King
  2026-01-19 20:21               ` Ezekiel Newren
@ 2026-01-20 13:46               ` Phillip Wood
  1 sibling, 0 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-20 13:46 UTC (permalink / raw)
  To: Jeff King
  Cc: Ezekiel Newren, phillip.wood, Ezekiel Newren via GitGitGadget,
	git, René Scharfe

Hi Peff

On 19/01/2026 05:59, Jeff King wrote:
> On Sat, Jan 17, 2026 at 05:40:08PM +0000, Phillip Wood wrote:
> 
>> On 17/01/2026 16:14, Ezekiel Newren wrote:
>>>
>>> If the size of different kinds of pointers ever differed from the size
>>> of void* then wouldn't that make all calls to malloc undefined?
>>
>> I believe there are (Havard architecture?) platforms where function pointers
>> are a different width to data pointers, and that's why you cannot store a
>> function pointer in void*. I agree it would be weird for char* to have a
>> different width to int*, I suspect the restrictions on casting from one type
>> to another are about alignment.
> 
> The standard does allow for different pointer sizes for char and int.
> The key thing is that a void pointer has to be able to represent any. So
> you can cast a smaller pointer to void and vice versa (and the latter
> would presumably throw away some of the bits, which is OK as long as the
> void was made from one of those smaller pointers originally).
> 
> More discussion at:
> 
>    https://c-faq.com/null/machexamp.html

Thanks for the clarification and the link - the C FAQ is always an 
interesting read.

Phillip

> I don't know how malloc worked on those platforms, though. The caller
> knows that malloc returns a void pointer, so it could cast to the
> smaller format in the usual way at the call-site. But I don't know how
> you would tell malloc() in a standard way what type of pointer you
> wanted to get out of it. I suspect they may have had specialized
> allocation functions. Or maybe it was enough to just throw away the low
> bits if you only cared about a word-addressable pointer.
> 
> At any rate, yeah, I agree with your original concern that the two
> structs are not compatible. The layouts could be totally different. And
> not just due to pointer size, but IIRC pointers to different types could
> have different alignment requirements. So:
> 
>    struct foo_void {
> 	size_t len;
> 	void *ptr;
>    };
> 
>    struct foo_u8 {
> 	size_t len;
> 	uint8_t *ptr;
>    };
> 
> might need different padding to properly align the pointers. In the case
> under discussion the pointers are always at the start, though, so I
> think it wouldn't matter.
> 
> -Peff


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-15 15:55     ` Ezekiel Newren
  2026-01-16 10:39       ` Phillip Wood
@ 2026-01-20 14:06       ` Phillip Wood
  2026-01-21 21:39         ` Ezekiel Newren
  1 sibling, 1 reply; 78+ messages in thread
From: Phillip Wood @ 2026-01-20 14:06 UTC (permalink / raw)
  To: Ezekiel Newren, phillip.wood; +Cc: Ezekiel Newren via GitGitGadget, git

Hi Ezekiel

On 15/01/2026 15:55, Ezekiel Newren wrote:
> On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>> +void ivec_reserve(void *self_, size_t additional)
>>> +{
>>> +     struct IVec_c_void *self = self_;
>>> +
>>> +     size_t growby = 128;
>>> +     if (self->capacity > growby)
>>> +             growby = self->capacity;
>>> +     if (additional > growby)
>>> +             growby = additional;
>>
>> This growth strategy differs from both ALLOC_GROW() and
>> XDL_ALLOC_GROW(), if there isn't a good reason for that we should
>> perhaps just use ALLOC_GROW() here.
> 
> XDL_ALLOW_GROW() can't be used because the pointer is always a void*
> in this function.

Oh right. I'm not sure that's not a reason to use a different growth 
strategy though. The minimum size of 128 elements is probably good for 
the xdiff code that creates arrays with one element per line but if this 
is supposed to be for general use it is going to waste space when we're 
allocating a lot of small arrays. ALLOC_GROW() uses alloc_nr() to 
calculate the new side so perhaps we could use that here?

>>> +void ivec_push(void *self_, const void *value)
>>> +{
>>> +     struct IVec_c_void *self = self_;
>>> +     void *dst = NULL;
>>> +
>>> +     if (self->length == self->capacity)
>>> +             ivec_reserve(self, 1);
>>> +
>>> +     dst = (uint8_t*)self->ptr + self->length * self->element_size;
>>> +     memcpy(dst, value, self->element_size);
>>
>> If self->element_size was a compile time constant the compiler could
>> easily optimize this call away. I'm not sure that is easy to achieve though.
> 
> The problem is that I didn't want all of ivec to be macros that looked
> like function calls. I wanted to minimize use of macros so that it was
> easier to port and verify that the Rust implementation matches the
> behavior of the C implementation.

I think that's a reasonable concern. So is the plan to have a parallel 
rust implementation of these functions rather than call the C 
implementation from rust?

>>> +void ivec_free(void *self_)
>>
>> Normally we'd call a like this that free the allocations and
>> re-initializes the members ivec_clear()
> 
> In Rust Vec.clear() means to set length to zero, but leaves the
> allocation alone. The reason why I'm zeroing the struct is to help
> avoid FFI issues. If not zero then what should the members be set to,
> to indicate that using the struct is not valid anymore? In Rust an
> object is freed when it goes out of scope and _cannot_ be accessed
> afterward.

I'm aware that Vec::clear() has different semantics (it does what 
strbuf_reset() does). That's unfortunate but this function has different 
semantics to all the other *_free() functions in git. Our coding 
guidelines say

  - There are several common idiomatic names for functions performing
    specific tasks on a structure `S`:

     - `S_init()` initializes a structure without allocating the
       structure itself.

     - `S_release()` releases a structure's contents without freeing the
       structure.

     - `S_clear()` is equivalent to `S_release()` followed by `S_init()`
       such that the structure is directly usable after clearing it. When
       `S_clear()` is provided, `S_init()` shall not allocate resources
       that need to be released again.

     - `S_free()` releases a structure's contents and frees the
       structure.

As we write more rust code and so wrap more of our existing structs 
we're going to be wrapping C code that uses the definitions above so I 
think we should do the same with struct IVec_*.

>>> diff --git a/compat/ivec.h b/compat/ivec.h
>>> new file mode 100644
>>> index 0000000000..654a05c506
>>> --- /dev/null
>>> +++ b/compat/ivec.h
>>> @@ -0,0 +1,52 @@
>>> +#ifndef IVEC_H
>>> +#define IVEC_H
>>> +
>>> +#include <git-compat-util.h>
>>
>> It would be nice to have some documentation in this header, see the
>> examples in strvec.h and hashmap.h
>>
>>> +#define IVEC_INIT(variable) ivec_init(&(variable), sizeof(*(variable).ptr))
>>
>> This is a bit cumbersome to use compared to our usual *_INIT macros. I'm
>> struggling to see how we can make it nicer though as DEFINE_IVEC_TYPE
>> cannot define a per-type initializer macro and I we cannot initialize
>> the element size without knowing the type.
> 
> I don't see what's cumbersome about it. Maybe an example use case
> would clarify things.

It is cumbersome because it separates the initialization from the 
declaration. Normally our *_INIT macros are initializer lists so we can 
write

	struct strbuf = STRBUF_INIT;

which keeps the declaration and initialization together. Although 
they're on adjacent lines in your example in real code the 
initialization likely to be separated from the declaration by other 
variable declarations.

> ```
> DEFINE_IVEC_TYPE(xrecord_t, xrecord);
> 
> void some_function() {
>      struct IVec_xrecord rec;
>      IVEC_INIT(rec);  // i.e. ivec_init(&rec, sizeof(*rec.ptr);

Thanks

Phillip

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/10] xdiff: make classic diff explicit by creating xdl_do_classic_diff()
  2026-01-02 18:52 ` [PATCH 02/10] xdiff: make classic diff explicit by creating xdl_do_classic_diff() Ezekiel Newren via GitGitGadget
@ 2026-01-20 15:01   ` Phillip Wood
  2026-01-21 21:05     ` Ezekiel Newren
  0 siblings, 1 reply; 78+ messages in thread
From: Phillip Wood @ 2026-01-20 15:01 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> Later patches will prepare xdl_cleanup_records() to be moved into xdiffi.c
> since only the classic diff uses that function.

I assume that's to make it easier to covert the myers implementation to 
rust without affecting the rest of the code? If so it would be nice to 
say that.

> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>

> +int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> +		xdfenv_t *xe) {
> +	int res;
> +
> +	if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0)
> +		return -1;
> +
> +	if (XDF_DIFF_ALG(xpp->flags) == XDF_PATIENCE_DIFF) {
> +		res = xdl_do_patience_diff(xpp, xe);
> +		goto out;
> +	}
> +
> +	if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF) {
> +		res = xdl_do_histogram_diff(xpp, xe);
> +		goto out;
> +	}
> +
> +	res = xdl_do_classic_diff(xe, xpp->flags);

This might be clearer that we're calling only one of the three functions 
if we wrote this as

	if (XDF_DIFF_ALG(xpp->flags) == XDIF_PATIENCE_DIFF)
		res = xdl_do_patience_diff(xpp, xe);
	else if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF)
		res = xdl_do_histogram_diff(xpp, xe);
	else
		res = xdl_do_classic_diff(xe, xpp->flags);

and then we can drop the out: label

Thanks

Phillip

>    out:
>   	if (res < 0)
>   		xdl_free_env(xe);
> diff --git a/xdiff/xdiffi.h b/xdiff/xdiffi.h
> index 49e52c67f9..8bf4c20373 100644
> --- a/xdiff/xdiffi.h
> +++ b/xdiff/xdiffi.h
> @@ -42,6 +42,7 @@ typedef struct s_xdchange {
>   int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
>   		 xdfile_t *xdf2, long off2, long lim2,
>   		 long *kvdf, long *kvdb, int need_min, xdalgoenv_t *xenv);
> +int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags);
>   int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
>   		xdfenv_t *xe);
>   int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/10] xdiff: don't waste time guessing the number of lines
  2026-01-02 18:52 ` [PATCH 03/10] xdiff: don't waste time guessing the number of lines Ezekiel Newren via GitGitGadget
@ 2026-01-20 15:02   ` Phillip Wood
  2026-01-21 21:12     ` Ezekiel Newren
  0 siblings, 1 reply; 78+ messages in thread
From: Phillip Wood @ 2026-01-20 15:02 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> All lines must be read anyway, so classify them after they're read in.
> Also move the memset() into xdl_init_classifier().

So instead of looping over the input lines one and a bit times (the bit 
being from xdl_guess_lines) we now loop over them twice as we split them 
first and then classify them in a separate loop. It does save some work 
not to call xdl_guess_lines but it is unclear if that offsets 
classifying them in a separate loop.

> +	for (size_t i = 0; i < xe->xdf1.nrec; i++) {
> +		xrecord_t *rec = &xe->xdf1.recs[i];
> +		xdl_classify_record(1, &cf, rec);

We seem to have lost the error handling if xdl_classify_record() fails.

Thanks

Phillip

> +	}
> +
> +	for (size_t i = 0; i < xe->xdf2.nrec; i++) {
> +		xrecord_t *rec = &xe->xdf2.recs[i];
> +		xdl_classify_record(2, &cf, rec);
>   	}
>   
>   	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
> diff --git a/xdiff/xutils.c b/xdiff/xutils.c
> index 77ee1ad9c8..b3d51197c1 100644
> --- a/xdiff/xutils.c
> +++ b/xdiff/xutils.c
> @@ -118,26 +118,6 @@ void *xdl_cha_alloc(chastore_t *cha) {
>   	return data;
>   }
>   
> -long xdl_guess_lines(mmfile_t *mf, long sample) {
> -	long nl = 0, size, tsize = 0;
> -	char const *data, *cur, *top;
> -
> -	if ((cur = data = xdl_mmfile_first(mf, &size))) {
> -		for (top = data + size; nl < sample && cur < top; ) {
> -			nl++;
> -			if (!(cur = memchr(cur, '\n', top - cur)))
> -				cur = top;
> -			else
> -				cur++;
> -		}
> -		tsize += (long) (cur - data);
> -	}
> -
> -	if (nl && tsize)
> -		nl = xdl_mmfile_size(mf) / (tsize / nl);
> -
> -	return nl + 1;
> -}
>   
>   int xdl_blankline(const char *line, long size, long flags)
>   {
> diff --git a/xdiff/xutils.h b/xdiff/xutils.h
> index 615b4a9d35..d800840dd0 100644
> --- a/xdiff/xutils.h
> +++ b/xdiff/xutils.h
> @@ -31,7 +31,6 @@ int xdl_emit_diffrec(char const *rec, long size, char const *pre, long psize,
>   int xdl_cha_init(chastore_t *cha, long isize, long icount);
>   void xdl_cha_free(chastore_t *cha);
>   void *xdl_cha_alloc(chastore_t *cha);
> -long xdl_guess_lines(mmfile_t *mf, long sample);
>   int xdl_blankline(const char *line, long size, long flags);
>   int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags);
>   uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 04/10] xdiff: let patience and histogram benefit from xdl_trim_ends()
  2026-01-02 18:52 ` [PATCH 04/10] xdiff: let patience and histogram benefit from xdl_trim_ends() Ezekiel Newren via GitGitGadget
@ 2026-01-20 15:02   ` Phillip Wood
  2026-01-21 14:49     ` Phillip Wood
  0 siblings, 1 reply; 78+ messages in thread
From: Phillip Wood @ 2026-01-20 15:02 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> The patience diff is set up the exact same way as histogram, see
> xdl_do_historgram_diff() in xhistogram.c. xdl_optimize_ctxs() is
> redundant now, delete it.

Does this change the output? The patience diff looks for unique context 
lines and builds the context out from those. For files that look like

Old	New
A	A
B	B
C	A
B	B
A	C
	B
	A

That will give a hunk

@@ -1,3 +0,5 @@
+A
+B
  A
  B
  C

but trimming the common prefix first would give

@@ -1,5 +1,7
  A
  B
+A
+B
  C
  B
  A

Though it seems like the diff silder causes us to output the same diff 
in both cases for that simple test so maybe it is not an issue. It would 
certainly be helpful to comment on any possible changes in the commit 
message as it could have been a deliberate choice not to trim the ends 
for those algorithms.

> -static int xdl_optimize_ctxs(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
> -
> -	if (xdl_trim_ends(xdf1, xdf2) < 0 ||
> -	    xdl_cleanup_records(cf, xdf1, xdf2) < 0) {
> -
> -		return -1;
> -	}
> -
> -	return 0;
> -}
> -
>   int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
>   		    xdfenv_t *xe) {
>   	xdlclassifier_t cf;
> @@ -404,9 +393,10 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
>   		xdl_classify_record(2, &cf, rec);
>   	}
>   
> +	xdl_trim_ends(&xe->xdf1, &xe->xdf2);

It would be clear that this was safe if you changed the function 
signature to return void as the way it is called in xdl_optimize_ctxs() 
makes it look like it can return an error.

Thanks

Phillip

>   	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
>   	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
> -	    xdl_optimize_ctxs(&cf, &xe->xdf1, &xe->xdf2) < 0) {
> +	    xdl_cleanup_records(&cf, &xe->xdf1, &xe->xdf2) < 0) {
>   
>   		xdl_free_ctx(&xe->xdf2);
>   		xdl_free_ctx(&xe->xdf1);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/10] xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records()
  2026-01-02 18:52 ` [PATCH 05/10] xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records() Ezekiel Newren via GitGitGadget
@ 2026-01-20 16:32   ` Phillip Wood
  0 siblings, 0 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-20 16:32 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> View with --color-words. Prepare these functions to use the fields:
> delta_start, delta_end. A future patch will add these fields to
> xdfenv_t.

I'm afraid this message doesn't make much sense to me. What are these 
new fields? I think it would help to explain what this up comming change 
is going to do and why.

Oh, having read patch 7 we're removing dstart and dend from xdfile_t and 
replacing them with delta_start and delta_end in xdfenv_t. It would be 
useful to say that here.
> -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
> +static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {

Maybe we could add xdf1 and xdf2 as local variables to avoid having to 
change the code that accesses the members of xdfile_t that are not going 
to be moved to xdfenv_t.

Thanks

Phillip

>   	long i, nm, mlim;
>   	xrecord_t *recs;
>   	xdlclass_t *rcrec;
> @@ -273,11 +273,11 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>   	 * Create temporary arrays that will help us decide if
>   	 * changed[i] should remain false, or become true.
>   	 */
> -	if (!XDL_CALLOC_ARRAY(action1, xdf1->nrec + 1)) {
> +	if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
>   		ret = -1;
>   		goto cleanup;
>   	}
> -	if (!XDL_CALLOC_ARRAY(action2, xdf2->nrec + 1)) {
> +	if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
>   		ret = -1;
>   		goto cleanup;
>   	}
> @@ -285,17 +285,17 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>   	/*
>   	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
>   	 */
> -	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
> +	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>   		mlim = XDL_MAX_EQLIMIT;
> -	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
> +	for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart]; i <= xe->xdf1.dend; i++, recs++) {
>   		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>   		nm = rcrec ? rcrec->len2 : 0;
>   		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
>   	}
>   
> -	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
> +	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>   		mlim = XDL_MAX_EQLIMIT;
> -	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
> +	for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart]; i <= xe->xdf2.dend; i++, recs++) {
>   		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>   		nm = rcrec ? rcrec->len1 : 0;
>   		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> @@ -305,27 +305,27 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>   	 * Use temporary arrays to decide if changed[i] should remain
>   	 * false, or become true.
>   	 */
> -	xdf1->nreff = 0;
> -	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
> -	     i <= xdf1->dend; i++, recs++) {
> +	xe->xdf1.nreff = 0;
> +	for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart];
> +	     i <= xe->xdf1.dend; i++, recs++) {
>   		if (action1[i] == KEEP ||
> -		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
> -			xdf1->reference_index[xdf1->nreff++] = i;
> +		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->xdf1.dstart, xe->xdf1.dend))) {
> +			xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
>   			/* changed[i] remains false, i.e. keep */
>   		} else
> -			xdf1->changed[i] = true;
> +			xe->xdf1.changed[i] = true;
>   			/* i.e. discard */
>   	}
>   
> -	xdf2->nreff = 0;
> -	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
> -	     i <= xdf2->dend; i++, recs++) {
> +	xe->xdf2.nreff = 0;
> +	for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart];
> +	     i <= xe->xdf2.dend; i++, recs++) {
>   		if (action2[i] == KEEP ||
> -		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
> -			xdf2->reference_index[xdf2->nreff++] = i;
> +		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->xdf2.dstart, xe->xdf2.dend))) {
> +			xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
>   			/* changed[i] remains false, i.e. keep */
>   		} else
> -			xdf2->changed[i] = true;
> +			xe->xdf2.changed[i] = true;
>   			/* i.e. discard */
>   	}
>   
> @@ -340,27 +340,27 @@ cleanup:
>   /*
>    * Early trim initial and terminal matching records.
>    */
> -static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
> +static int xdl_trim_ends(xdfenv_t *xe) {
>   	long i, lim;
>   	xrecord_t *recs1, *recs2;
>   
> -	recs1 = xdf1->recs;
> -	recs2 = xdf2->recs;
> -	for (i = 0, lim = (long)XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
> +	recs1 = xe->xdf1.recs;
> +	recs2 = xe->xdf2.recs;
> +	for (i = 0, lim = (long)XDL_MIN(xe->xdf1.nrec, xe->xdf2.nrec); i < lim;
>   	     i++, recs1++, recs2++)
>   		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
>   			break;
>   
> -	xdf1->dstart = xdf2->dstart = i;
> +	xe->xdf1.dstart = xe->xdf2.dstart = i;
>   
> -	recs1 = xdf1->recs + xdf1->nrec - 1;
> -	recs2 = xdf2->recs + xdf2->nrec - 1;
> +	recs1 = xe->xdf1.recs + xe->xdf1.nrec - 1;
> +	recs2 = xe->xdf2.recs + xe->xdf2.nrec - 1;
>   	for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
>   		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
>   			break;
>   
> -	xdf1->dend = (long)xdf1->nrec - i - 1;
> -	xdf2->dend = (long)xdf2->nrec - i - 1;
> +	xe->xdf1.dend = (long)xe->xdf1.nrec - i - 1;
> +	xe->xdf2.dend = (long)xe->xdf2.nrec - i - 1;
>   
>   	return 0;
>   }
> @@ -393,10 +393,10 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
>   		xdl_classify_record(2, &cf, rec);
>   	}
>   
> -	xdl_trim_ends(&xe->xdf1, &xe->xdf2);
> +	xdl_trim_ends(xe);
>   	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
>   	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
> -	    xdl_cleanup_records(&cf, &xe->xdf1, &xe->xdf2) < 0) {
> +	    xdl_cleanup_records(&cf, xe) < 0) {
>   
>   		xdl_free_ctx(&xe->xdf2);
>   		xdl_free_ctx(&xe->xdf1);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/10] xdiff: cleanup xdl_trim_ends()
  2026-01-02 18:52 ` [PATCH 06/10] xdiff: cleanup xdl_trim_ends() Ezekiel Newren via GitGitGadget
@ 2026-01-20 16:32   ` Phillip Wood
  0 siblings, 0 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-20 16:32 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren



On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> This patch is best viewed with a before and after of the whole
> function.
> 
> Rather than using 2 pointers and walking them. Use direct indexing with
> local variables of what is being compared to make it easier to follow
> along.

I think using direct indexing makes things clearer, but I'm not sure 
this is a faithful conversion (see below).

> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index 0acb3437d4..06b6a6f804 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -340,29 +340,29 @@ cleanup:
>   /*
>    * Early trim initial and terminal matching records.
>    */
> -static int xdl_trim_ends(xdfenv_t *xe) {
> -	long i, lim;
> -	xrecord_t *recs1, *recs2;
> -
> -	recs1 = xe->xdf1.recs;
> -	recs2 = xe->xdf2.recs;
> -	for (i = 0, lim = (long)XDL_MIN(xe->xdf1.nrec, xe->xdf2.nrec); i < lim;
> -	     i++, recs1++, recs2++)
> -		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
> +static void xdl_trim_ends(xdfenv_t *xe)
> +{
> +	size_t lim = XDL_MIN(xe->xdf1.nrec, xe->xdf2.nrec);
> +
> +	for (size_t i = 0; i < lim; i++) {
> +		size_t mph1 = xe->xdf1.recs[i].minimal_perfect_hash;
> +		size_t mph2 = xe->xdf2.recs[i].minimal_perfect_hash;
> +		if (mph1 != mph2) {
> +			xe->xdf1.dstart = xe->xdf2.dstart = (ssize_t)i;

The type of dstart is ptrdiff_t, not ssize_t.

The original set dstart and dend unconditionally but here they are not 
set if all the lines match.

Thanks

Phillip

> +			lim -= i;
>   			break;
> +		}
> +	}
>   
> -	xe->xdf1.dstart = xe->xdf2.dstart = i;
> -
> -	recs1 = xe->xdf1.recs + xe->xdf1.nrec - 1;
> -	recs2 = xe->xdf2.recs + xe->xdf2.nrec - 1;
> -	for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
> -		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
> +	for (size_t i = 0; i < lim; i++) {
> +		size_t mph1 = xe->xdf1.recs[xe->xdf1.nrec - 1 - i].minimal_perfect_hash;
> +		size_t mph2 = xe->xdf2.recs[xe->xdf2.nrec - 1 - i].minimal_perfect_hash;
> +		if (mph1 != mph2) {
> +			xe->xdf1.dend = xe->xdf1.nrec - 1 - i;
> +			xe->xdf2.dend = xe->xdf2.nrec - 1 - i;
>   			break;
> -
> -	xe->xdf1.dend = (long)xe->xdf1.nrec - i - 1;
> -	xe->xdf2.dend = (long)xe->xdf2.nrec - i - 1;
> -
> -	return 0;
> +		}
> +	}
>   }
>   
>   


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/10] xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start
  2026-01-02 18:52 ` [PATCH 07/10] xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start Ezekiel Newren via GitGitGadget
@ 2026-01-20 16:32   ` Phillip Wood
  2026-01-28 10:51     ` Phillip Wood
  0 siblings, 1 reply; 78+ messages in thread
From: Phillip Wood @ 2026-01-20 16:32 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> Placing delta_start in xdfenv_t instead of xdfile_t provides a more
> appropriate context since this variable only makes sense with a pair
> of files. View with --color-words.

So as dstart and dend must be the same for both files we now store the 
values once in xdfenv_t. That explains why we start passing xdfenv_t 
around rather than xdfile_t in patch 5.

Thanks

Phillip

> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>   xdiff/xhistogram.c |  4 ++--
>   xdiff/xpatience.c  |  4 ++--
>   xdiff/xprepare.c   | 17 +++++++++--------
>   xdiff/xtypes.h     |  3 ++-
>   4 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
> index 5ae1282c27..eb6a52d9ba 100644
> --- a/xdiff/xhistogram.c
> +++ b/xdiff/xhistogram.c
> @@ -365,6 +365,6 @@ out:
>   int xdl_do_histogram_diff(xpparam_t const *xpp, xdfenv_t *env)
>   {
>   	return histogram_diff(xpp, env,
> -		env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
> -		env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
> +		env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
> +		env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
>   }
> diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
> index 2bce07cf48..bd0ffbb417 100644
> --- a/xdiff/xpatience.c
> +++ b/xdiff/xpatience.c
> @@ -374,6 +374,6 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
>   int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
>   {
>   	return patience_diff(xpp, env,
> -		env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
> -		env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
> +		env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
> +		env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
>   }
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index 06b6a6f804..e88468e74c 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -173,7 +173,6 @@ static int xdl_prepare_ctx(mmfile_t *mf, xdfile_t *xdf, uint64_t flags) {
>   
>   	xdf->changed += 1;
>   	xdf->nreff = 0;
> -	xdf->dstart = 0;
>   	xdf->dend = xdf->nrec - 1;
>   
>   	return 0;
> @@ -287,7 +286,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>   	 */
>   	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>   		mlim = XDL_MAX_EQLIMIT;
> -	for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart]; i <= xe->xdf1.dend; i++, recs++) {
> +	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= xe->xdf1.dend; i++, recs++) {
>   		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>   		nm = rcrec ? rcrec->len2 : 0;
>   		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> @@ -295,7 +294,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>   
>   	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>   		mlim = XDL_MAX_EQLIMIT;
> -	for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart]; i <= xe->xdf2.dend; i++, recs++) {
> +	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= xe->xdf2.dend; i++, recs++) {
>   		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>   		nm = rcrec ? rcrec->len1 : 0;
>   		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> @@ -306,10 +305,10 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>   	 * false, or become true.
>   	 */
>   	xe->xdf1.nreff = 0;
> -	for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart];
> +	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
>   	     i <= xe->xdf1.dend; i++, recs++) {
>   		if (action1[i] == KEEP ||
> -		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->xdf1.dstart, xe->xdf1.dend))) {
> +		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, xe->xdf1.dend))) {
>   			xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
>   			/* changed[i] remains false, i.e. keep */
>   		} else
> @@ -318,10 +317,10 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>   	}
>   
>   	xe->xdf2.nreff = 0;
> -	for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart];
> +	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
>   	     i <= xe->xdf2.dend; i++, recs++) {
>   		if (action2[i] == KEEP ||
> -		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->xdf2.dstart, xe->xdf2.dend))) {
> +		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, xe->xdf2.dend))) {
>   			xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
>   			/* changed[i] remains false, i.e. keep */
>   		} else
> @@ -348,7 +347,7 @@ static void xdl_trim_ends(xdfenv_t *xe)
>   		size_t mph1 = xe->xdf1.recs[i].minimal_perfect_hash;
>   		size_t mph2 = xe->xdf2.recs[i].minimal_perfect_hash;
>   		if (mph1 != mph2) {
> -			xe->xdf1.dstart = xe->xdf2.dstart = (ssize_t)i;
> +			xe->delta_start = (ssize_t)i;
>   			lim -= i;
>   			break;
>   		}
> @@ -370,6 +369,8 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
>   		    xdfenv_t *xe) {
>   	xdlclassifier_t cf;
>   
> +	xe->delta_start = 0;
> +
>   	if (xdl_prepare_ctx(mf1, &xe->xdf1, xpp->flags) < 0) {
>   
>   		return -1;
> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index 979586f20a..bda1f85eb0 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -48,7 +48,7 @@ typedef struct s_xrecord {
>   typedef struct s_xdfile {
>   	xrecord_t *recs;
>   	size_t nrec;
> -	ptrdiff_t dstart, dend;
> +	ptrdiff_t dend;
>   	bool *changed;
>   	size_t *reference_index;
>   	size_t nreff;
> @@ -56,6 +56,7 @@ typedef struct s_xdfile {
>   
>   typedef struct s_xdfenv {
>   	xdfile_t xdf1, xdf2;
> +	size_t delta_start;
>   } xdfenv_t;
>   
>   


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 04/10] xdiff: let patience and histogram benefit from xdl_trim_ends()
  2026-01-20 15:02   ` Phillip Wood
@ 2026-01-21 14:49     ` Phillip Wood
  0 siblings, 0 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-21 14:49 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 20/01/2026 15:02, Phillip Wood wrote:
> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>
>> The patience diff is set up the exact same way as histogram, see
>> xdl_do_historgram_diff() in xhistogram.c. xdl_optimize_ctxs() is
>> redundant now, delete it.
> 
> Does this change the output? The patience diff looks for unique context 
> lines and builds the context out from those. For files that look like
> 
> Old    New
> A    A
> B    B
> C    A
> B    B
> A    C
>      B
>      A
> 
> That will give a hunk
> 
> @@ -1,3 +0,5 @@
> +A
> +B
>   A
>   B
>   C
> 
> but trimming the common prefix first would give
> 
> @@ -1,5 +1,7
>   A
>   B
> +A
> +B
>   C
>   B
>   A
> 
> Though it seems like the diff silder causes us to output the same diff 
> in both cases for that simple test so maybe it is not an issue.

It does change larger diffs. If you run

git show --diff-algorithm=patience --diff-merges=first-parent f406b89552

You get a different diff with this series applied.

Thanks

Phillip

> It would 
> certainly be helpful to comment on any possible changes in the commit 
> message as it could have been a deliberate choice not to trim the ends 
> for those algorithms.
> 
>> -static int xdl_optimize_ctxs(xdlclassifier_t *cf, xdfile_t *xdf1, 
>> xdfile_t *xdf2) {
>> -
>> -    if (xdl_trim_ends(xdf1, xdf2) < 0 ||
>> -        xdl_cleanup_records(cf, xdf1, xdf2) < 0) {
>> -
>> -        return -1;
>> -    }
>> -
>> -    return 0;
>> -}
>> -
>>   int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
>>               xdfenv_t *xe) {
>>       xdlclassifier_t cf;
>> @@ -404,9 +393,10 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, 
>> xpparam_t const *xpp,
>>           xdl_classify_record(2, &cf, rec);
>>       }
>> +    xdl_trim_ends(&xe->xdf1, &xe->xdf2);
> 
> It would be clear that this was safe if you changed the function 
> signature to return void as the way it is called in xdl_optimize_ctxs() 
> makes it look like it can return an error.
> 
> Thanks
> 
> Phillip
> 
>>       if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
>>           (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
>> -        xdl_optimize_ctxs(&cf, &xe->xdf1, &xe->xdf2) < 0) {
>> +        xdl_cleanup_records(&cf, &xe->xdf1, &xe->xdf2) < 0) {
>>           xdl_free_ctx(&xe->xdf2);
>>           xdl_free_ctx(&xe->xdf1);
> 
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
  2026-01-02 18:52 ` [PATCH 09/10] xdiff: remove dependence on xdlclassifier from xdl_cleanup_records() Ezekiel Newren via GitGitGadget
  2026-01-16 20:19   ` René Scharfe
@ 2026-01-21 15:01   ` Phillip Wood
  1 sibling, 0 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-21 15:01 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

Hi Ezekiel

On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> Disentangle xdl_cleanup_records() from the classifier so that it can be
> moved from xprepare.c into xdiffi.c.
> 
> The classic diff is the only algorithm that needs to count the number
> of times each line occurs in each file. Make xdl_cleanup_records()
> count the number of lines instead of the classifier so it won't slow
> down patience or histogram.

Have you measured the speed up that this gives? It looks like it saves 
very little work for the patience or histogram algorithms and means we 
now make a second pass over the data in the myers case. If there is a 
reason to do this related to the rust conversion then that might be a 
more convincing argument. As Rene has said already this isn't a 
particularly interesting demonstration of struct IVec - it would be nice 
to see more of the API exercised.

Thanks

Phillip

> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>   xdiff/xprepare.c | 52 +++++++++++++++++++++++++++++++++---------------
>   xdiff/xtypes.h   |  1 +
>   2 files changed, 37 insertions(+), 16 deletions(-)
> 
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index d3cdb6ac02..b53a3b80c4 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -21,6 +21,7 @@
>    */
>   
>   #include "xinclude.h"
> +#include "compat/ivec.h"
>   
>   
>   #define XDL_KPDIS_RUN 4
> @@ -35,7 +36,6 @@ typedef struct s_xdlclass {
>   	struct s_xdlclass *next;
>   	xrecord_t rec;
>   	long idx;
> -	long len1, len2;
>   } xdlclass_t;
>   
>   typedef struct s_xdlclassifier {
> @@ -92,7 +92,7 @@ static void xdl_free_classifier(xdlclassifier_t *cf) {
>   }
>   
>   
> -static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) {
> +static int xdl_classify_record(xdlclassifier_t *cf, xrecord_t *rec) {
>   	size_t hi;
>   	xdlclass_t *rcrec;
>   
> @@ -113,13 +113,10 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
>   				return -1;
>   		cf->rcrecs[rcrec->idx] = rcrec;
>   		rcrec->rec = *rec;
> -		rcrec->len1 = rcrec->len2 = 0;
>   		rcrec->next = cf->rchash[hi];
>   		cf->rchash[hi] = rcrec;
>   	}
>   
> -	(pass == 1) ? rcrec->len1++ : rcrec->len2++;
> -
>   	rec->minimal_perfect_hash = (size_t)rcrec->idx;
>   
>   	return 0;
> @@ -253,22 +250,44 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
>   	return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
>   }
>   
> +struct xoccurrence
> +{
> +	size_t file1, file2;
> +};
> +
> +
> +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
> +
>   
>   /*
>    * Try to reduce the problem complexity, discard records that have no
>    * matches on the other file. Also, lines that have multiple matches
>    * might be potentially discarded if they appear in a run of discardable.
>    */
> -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
> -	long i, nm, mlim;
> +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
> +	long i;
> +	size_t nm, mlim;
>   	xrecord_t *recs;
> -	xdlclass_t *rcrec;
>   	uint8_t *action1 = NULL, *action2 = NULL;
> -	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
> +	struct IVec_xoccurrence occ;
> +	bool need_min = !!(flags & XDF_NEED_MINIMAL);
>   	int ret = 0;
>   	ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
>   	ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
>   
> +	IVEC_INIT(occ);
> +	ivec_zero(&occ, xe->mph_size);
> +
> +	for (size_t j = 0; j < xe->xdf1.nrec; j++) {
> +		size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
> +		occ.ptr[mph1].file1 += 1;
> +	}
> +
> +	for (size_t j = 0; j < xe->xdf2.nrec; j++) {
> +		size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
> +		occ.ptr[mph2].file2 += 1;
> +	}
> +
>   	/*
>   	 * Create temporary arrays that will help us decide if
>   	 * changed[i] should remain false, or become true.
> @@ -288,16 +307,14 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>   	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>   		mlim = XDL_MAX_EQLIMIT;
>   	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
> -		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> -		nm = rcrec ? rcrec->len2 : 0;
> +		nm = occ.ptr[recs->minimal_perfect_hash].file2;
>   		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
>   	}
>   
>   	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>   		mlim = XDL_MAX_EQLIMIT;
>   	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
> -		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
> -		nm = rcrec ? rcrec->len1 : 0;
> +		nm = occ.ptr[recs->minimal_perfect_hash].file1;
>   		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
>   	}
>   
> @@ -332,6 +349,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
>   cleanup:
>   	xdl_free(action1);
>   	xdl_free(action2);
> +	ivec_free(&occ);
>   
>   	return ret;
>   }
> @@ -387,18 +405,20 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
>   
>   	for (size_t i = 0; i < xe->xdf1.nrec; i++) {
>   		xrecord_t *rec = &xe->xdf1.recs[i];
> -		xdl_classify_record(1, &cf, rec);
> +		xdl_classify_record(&cf, rec);
>   	}
>   
>   	for (size_t i = 0; i < xe->xdf2.nrec; i++) {
>   		xrecord_t *rec = &xe->xdf2.recs[i];
> -		xdl_classify_record(2, &cf, rec);
> +		xdl_classify_record(&cf, rec);
>   	}
>   
> +	xe->mph_size = cf.count;
> +
>   	xdl_trim_ends(xe);
>   	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
>   	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
> -	    xdl_cleanup_records(&cf, xe) < 0) {
> +	    xdl_cleanup_records(xe, xpp->flags) < 0) {
>   
>   		xdl_free_ctx(&xe->xdf2);
>   		xdl_free_ctx(&xe->xdf1);
> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index a939396064..2528bd37e8 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -56,6 +56,7 @@ typedef struct s_xdfile {
>   typedef struct s_xdfenv {
>   	xdfile_t xdf1, xdf2;
>   	size_t delta_start, delta_end;
> +	size_t mph_size;
>   } xdfenv_t;
>   
>   


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/10] xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c
  2026-01-02 18:52 ` [PATCH 10/10] xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c Ezekiel Newren via GitGitGadget
@ 2026-01-21 15:01   ` Phillip Wood
  2026-01-28 10:56     ` Phillip Wood
  0 siblings, 1 reply; 78+ messages in thread
From: Phillip Wood @ 2026-01-21 15:01 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

Hi Ezekiel

On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> Only the classic diff uses xdl_cleanup_records(). Move it,
> xdl_clean_mmatch(), and the macros to xdiffi.c and call
> xdl_cleanup_records() inside of xdl_do_classic_diff(). This better
> organizes the code related to the classic diff.

I think calling xdl_cleanup_records() from inside xdl_do_classic_diff() 
makes sense. I don't have a strong opinion either way on the code 
movement. You should remove '#include "compat/ivec.h"' from xprepare.c 
if you're moving the only code that uses it out of that file.

Thanks

Phillip

> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>   xdiff/xdiffi.c   | 180 ++++++++++++++++++++++++++++++++++++++++++++
>   xdiff/xprepare.c | 191 +----------------------------------------------
>   2 files changed, 181 insertions(+), 190 deletions(-)
> 
> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index e3196c7245..0f1fd7cf80 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> @@ -21,6 +21,7 @@
>    */
>   
>   #include "xinclude.h"
> +#include "compat/ivec.h"
>   
>   static size_t get_hash(xdfile_t *xdf, long index)
>   {
> @@ -33,6 +34,14 @@ static size_t get_hash(xdfile_t *xdf, long index)
>   #define XDL_SNAKE_CNT 20
>   #define XDL_K_HEUR 4
>   
> +#define XDL_KPDIS_RUN 4
> +#define XDL_MAX_EQLIMIT 1024
> +#define XDL_SIMSCAN_WINDOW 100
> +
> +#define DISCARD 0
> +#define KEEP 1
> +#define INVESTIGATE 2
> +
>   typedef struct s_xdpsplit {
>   	long i1, i2;
>   	int min_lo, min_hi;
> @@ -311,6 +320,175 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
>   }
>   
>   
> +static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
> +	long r, rdis0, rpdis0, rdis1, rpdis1;
> +
> +	/*
> +	 * Limits the window that is examined during the similar-lines
> +	 * scan. The loops below stops when action[i - r] == KEEP
> +	 * (line that has no match), but there are corner cases where
> +	 * the loop proceed all the way to the extremities by causing
> +	 * huge performance penalties in case of big files.
> +	 */
> +	if (i - s > XDL_SIMSCAN_WINDOW)
> +		s = i - XDL_SIMSCAN_WINDOW;
> +	if (e - i > XDL_SIMSCAN_WINDOW)
> +		e = i + XDL_SIMSCAN_WINDOW;
> +
> +	/*
> +	 * Scans the lines before 'i' to find a run of lines that either
> +	 * have no match (action[j] == DISCARD) or have multiple matches
> +	 * (action[j] == INVESTIGATE). Note that we always call this
> +	 * function with action[i] == INVESTIGATE, so the current line
> +	 * (i) is already a multimatch line.
> +	 */
> +	for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) {
> +		if (action[i - r] == DISCARD)
> +			rdis0++;
> +		else if (action[i - r] == INVESTIGATE)
> +			rpdis0++;
> +		else if (action[i - r] == KEEP)
> +			break;
> +		else
> +			BUG("Illegal value for action[i - r]");
> +	}
> +	/*
> +	 * If the run before the line 'i' found only multimatch lines,
> +	 * we return false and hence we don't make the current line (i)
> +	 * discarded. We want to discard multimatch lines only when
> +	 * they appear in the middle of runs with nomatch lines
> +	 * (action[j] == DISCARD).
> +	 */
> +	if (rdis0 == 0)
> +		return 0;
> +	for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) {
> +		if (action[i + r] == DISCARD)
> +			rdis1++;
> +		else if (action[i + r] == INVESTIGATE)
> +			rpdis1++;
> +		else if (action[i + r] == KEEP)
> +			break;
> +		else
> +			BUG("Illegal value for action[i + r]");
> +	}
> +	/*
> +	 * If the run after the line 'i' found only multimatch lines,
> +	 * we return false and hence we don't make the current line (i)
> +	 * discarded.
> +	 */
> +	if (rdis1 == 0)
> +		return false;
> +	rdis1 += rdis0;
> +	rpdis1 += rpdis0;
> +
> +	return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
> +}
> +
> +struct xoccurrence
> +{
> +	size_t file1, file2;
> +};
> +
> +
> +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
> +
> +
> +/*
> + * Try to reduce the problem complexity, discard records that have no
> + * matches on the other file. Also, lines that have multiple matches
> + * might be potentially discarded if they appear in a run of discardable.
> + */
> +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
> +	long i;
> +	size_t nm, mlim;
> +	xrecord_t *recs;
> +	uint8_t *action1 = NULL, *action2 = NULL;
> +	struct IVec_xoccurrence occ;
> +	bool need_min = !!(flags & XDF_NEED_MINIMAL);
> +	int ret = 0;
> +	ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
> +	ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
> +
> +	IVEC_INIT(occ);
> +	ivec_zero(&occ, xe->mph_size);
> +
> +	for (size_t j = 0; j < xe->xdf1.nrec; j++) {
> +		size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
> +		occ.ptr[mph1].file1 += 1;
> +	}
> +
> +	for (size_t j = 0; j < xe->xdf2.nrec; j++) {
> +		size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
> +		occ.ptr[mph2].file2 += 1;
> +	}
> +
> +	/*
> +	 * Create temporary arrays that will help us decide if
> +	 * changed[i] should remain false, or become true.
> +	 */
> +	if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
> +		ret = -1;
> +		goto cleanup;
> +	}
> +	if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
> +		ret = -1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
> +	 */
> +	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
> +		mlim = XDL_MAX_EQLIMIT;
> +	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
> +		nm = occ.ptr[recs->minimal_perfect_hash].file2;
> +		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> +	}
> +
> +	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
> +		mlim = XDL_MAX_EQLIMIT;
> +	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
> +		nm = occ.ptr[recs->minimal_perfect_hash].file1;
> +		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> +	}
> +
> +	/*
> +	 * Use temporary arrays to decide if changed[i] should remain
> +	 * false, or become true.
> +	 */
> +	xe->xdf1.nreff = 0;
> +	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
> +	     i <= dend1; i++, recs++) {
> +		if (action1[i] == KEEP ||
> +		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, dend1))) {
> +			xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
> +			/* changed[i] remains false, i.e. keep */
> +		} else
> +			xe->xdf1.changed[i] = true;
> +			/* i.e. discard */
> +	}
> +
> +	xe->xdf2.nreff = 0;
> +	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
> +	     i <= dend2; i++, recs++) {
> +		if (action2[i] == KEEP ||
> +		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, dend2))) {
> +			xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
> +			/* changed[i] remains false, i.e. keep */
> +		} else
> +			xe->xdf2.changed[i] = true;
> +			/* i.e. discard */
> +	}
> +
> +cleanup:
> +	xdl_free(action1);
> +	xdl_free(action2);
> +	ivec_free(&occ);
> +
> +	return ret;
> +}
> +
> +
>   int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
>   {
>   	long ndiags;
> @@ -318,6 +496,8 @@ int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
>   	xdalgoenv_t xenv;
>   	int res;
>   
> +	xdl_cleanup_records(xe, flags);
> +
>   	/*
>   	 * Allocate and setup K vectors to be used by the differential
>   	 * algorithm.
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index b53a3b80c4..3f555e29f4 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -24,14 +24,6 @@
>   #include "compat/ivec.h"
>   
>   
> -#define XDL_KPDIS_RUN 4
> -#define XDL_MAX_EQLIMIT 1024
> -#define XDL_SIMSCAN_WINDOW 100
> -
> -#define DISCARD 0
> -#define KEEP 1
> -#define INVESTIGATE 2
> -
>   typedef struct s_xdlclass {
>   	struct s_xdlclass *next;
>   	xrecord_t rec;
> @@ -50,8 +42,6 @@ typedef struct s_xdlclassifier {
>   } xdlclassifier_t;
>   
>   
> -
> -
>   static int xdl_init_classifier(xdlclassifier_t *cf, long size, long flags) {
>   	memset(cf, 0, sizeof(xdlclassifier_t));
>   
> @@ -186,175 +176,6 @@ void xdl_free_env(xdfenv_t *xe) {
>   }
>   
>   
> -static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
> -	long r, rdis0, rpdis0, rdis1, rpdis1;
> -
> -	/*
> -	 * Limits the window that is examined during the similar-lines
> -	 * scan. The loops below stops when action[i - r] == KEEP
> -	 * (line that has no match), but there are corner cases where
> -	 * the loop proceed all the way to the extremities by causing
> -	 * huge performance penalties in case of big files.
> -	 */
> -	if (i - s > XDL_SIMSCAN_WINDOW)
> -		s = i - XDL_SIMSCAN_WINDOW;
> -	if (e - i > XDL_SIMSCAN_WINDOW)
> -		e = i + XDL_SIMSCAN_WINDOW;
> -
> -	/*
> -	 * Scans the lines before 'i' to find a run of lines that either
> -	 * have no match (action[j] == DISCARD) or have multiple matches
> -	 * (action[j] == INVESTIGATE). Note that we always call this
> -	 * function with action[i] == INVESTIGATE, so the current line
> -	 * (i) is already a multimatch line.
> -	 */
> -	for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) {
> -		if (action[i - r] == DISCARD)
> -			rdis0++;
> -		else if (action[i - r] == INVESTIGATE)
> -			rpdis0++;
> -		else if (action[i - r] == KEEP)
> -			break;
> -		else
> -			BUG("Illegal value for action[i - r]");
> -	}
> -	/*
> -	 * If the run before the line 'i' found only multimatch lines,
> -	 * we return false and hence we don't make the current line (i)
> -	 * discarded. We want to discard multimatch lines only when
> -	 * they appear in the middle of runs with nomatch lines
> -	 * (action[j] == DISCARD).
> -	 */
> -	if (rdis0 == 0)
> -		return 0;
> -	for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) {
> -		if (action[i + r] == DISCARD)
> -			rdis1++;
> -		else if (action[i + r] == INVESTIGATE)
> -			rpdis1++;
> -		else if (action[i + r] == KEEP)
> -			break;
> -		else
> -			BUG("Illegal value for action[i + r]");
> -	}
> -	/*
> -	 * If the run after the line 'i' found only multimatch lines,
> -	 * we return false and hence we don't make the current line (i)
> -	 * discarded.
> -	 */
> -	if (rdis1 == 0)
> -		return false;
> -	rdis1 += rdis0;
> -	rpdis1 += rpdis0;
> -
> -	return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
> -}
> -
> -struct xoccurrence
> -{
> -	size_t file1, file2;
> -};
> -
> -
> -DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
> -
> -
> -/*
> - * Try to reduce the problem complexity, discard records that have no
> - * matches on the other file. Also, lines that have multiple matches
> - * might be potentially discarded if they appear in a run of discardable.
> - */
> -static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
> -	long i;
> -	size_t nm, mlim;
> -	xrecord_t *recs;
> -	uint8_t *action1 = NULL, *action2 = NULL;
> -	struct IVec_xoccurrence occ;
> -	bool need_min = !!(flags & XDF_NEED_MINIMAL);
> -	int ret = 0;
> -	ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
> -	ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
> -
> -	IVEC_INIT(occ);
> -	ivec_zero(&occ, xe->mph_size);
> -
> -	for (size_t j = 0; j < xe->xdf1.nrec; j++) {
> -		size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
> -		occ.ptr[mph1].file1 += 1;
> -	}
> -
> -	for (size_t j = 0; j < xe->xdf2.nrec; j++) {
> -		size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
> -		occ.ptr[mph2].file2 += 1;
> -	}
> -
> -	/*
> -	 * Create temporary arrays that will help us decide if
> -	 * changed[i] should remain false, or become true.
> -	 */
> -	if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
> -		ret = -1;
> -		goto cleanup;
> -	}
> -	if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
> -		ret = -1;
> -		goto cleanup;
> -	}
> -
> -	/*
> -	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
> -	 */
> -	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
> -		mlim = XDL_MAX_EQLIMIT;
> -	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
> -		nm = occ.ptr[recs->minimal_perfect_hash].file2;
> -		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> -	}
> -
> -	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
> -		mlim = XDL_MAX_EQLIMIT;
> -	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
> -		nm = occ.ptr[recs->minimal_perfect_hash].file1;
> -		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> -	}
> -
> -	/*
> -	 * Use temporary arrays to decide if changed[i] should remain
> -	 * false, or become true.
> -	 */
> -	xe->xdf1.nreff = 0;
> -	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
> -	     i <= dend1; i++, recs++) {
> -		if (action1[i] == KEEP ||
> -		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->delta_start, dend1))) {
> -			xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
> -			/* changed[i] remains false, i.e. keep */
> -		} else
> -			xe->xdf1.changed[i] = true;
> -			/* i.e. discard */
> -	}
> -
> -	xe->xdf2.nreff = 0;
> -	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
> -	     i <= dend2; i++, recs++) {
> -		if (action2[i] == KEEP ||
> -		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->delta_start, dend2))) {
> -			xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
> -			/* changed[i] remains false, i.e. keep */
> -		} else
> -			xe->xdf2.changed[i] = true;
> -			/* i.e. discard */
> -	}
> -
> -cleanup:
> -	xdl_free(action1);
> -	xdl_free(action2);
> -	ivec_free(&occ);
> -
> -	return ret;
> -}
> -
> -
>   /*
>    * Early trim initial and terminal matching records.
>    */
> @@ -414,19 +235,9 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
>   	}
>   
>   	xe->mph_size = cf.count;
> +	xdl_free_classifier(&cf);
>   
>   	xdl_trim_ends(xe);
> -	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
> -	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
> -	    xdl_cleanup_records(xe, xpp->flags) < 0) {
> -
> -		xdl_free_ctx(&xe->xdf2);
> -		xdl_free_ctx(&xe->xdf1);
> -		xdl_free_classifier(&cf);
> -		return -1;
> -	}
> -
> -	xdl_free_classifier(&cf);
>   
>   	return 0;
>   }


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-19 20:40                 ` Jeff King
  2026-01-20  2:36                   ` D. Ben Knoble
@ 2026-01-21 21:00                   ` Ezekiel Newren
  2026-01-21 21:20                     ` Jeff King
  1 sibling, 1 reply; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-21 21:00 UTC (permalink / raw)
  To: Jeff King
  Cc: Phillip Wood, phillip.wood, Ezekiel Newren via GitGitGadget, git,
	René Scharfe

On Mon, Jan 19, 2026 at 1:40 PM Jeff King <peff@peff.net> wrote:
>
> On Mon, Jan 19, 2026 at 01:21:04PM -0700, Ezekiel Newren wrote:
>
> > Ok..., is there a way to pad a field to the largest size needed so
> > that this also works on the harvard architecture? If C isn't even self
> > consistent then how are these structs going to be passed between C and
> > Rust (which is THE point of ivec)?
>
> If you make a union of the pointers, it will require the largest size
> and the strictest alignment requirement. So:
>
>   struct foo {
>         union {
>                 void *v;
>                 uint8_t *u8;
>         } ptr;
>         size_t len;
>   };
>
> would be a single struct you could use to store a void pointer _or_ a u8
> pointer. The one thing you shouldn't do there, though, is assign via one
> union member and read from the other. So I don't know if that helps you
> or not (I confess I have not followed this rust discussion at all, and
> know nothing about rust/c ABI compatibility, and just got roped in on C
> esoterica).
>
> > Or do we just tell the arcane Harvard architecture "too bad" Git won't
> > run on it anymore?
>
> Minor nit: the Harvard architecture is one where function pointers are
> not the same as data pointers. An int/char distinction can happen even
> on more common (von Neumann) machines.
>
> But I think we can rephrase your question as: are there real-world
> machines we care about that will have different pointer sizes, or can we
> ignore this issue for practical purposes?
>
> I don't know the answer. I suspect it probably is OK for Git not to run
> on the machines mentioned in that C faq. But:
>
>   1. Sometimes there are subtle implications of undefined behavior that
>      may cause a compiler (even for a sensible machine) to do unexpected
>      things. I don't know offhand if that is the case here.
>
>   2. There are some modern platforms in which pointers are a bit more
>      opaque than just numeric addresses. For example, we've had a few
>      patches dealing with questionable pointer usage to make things work
>      on CHERI Arm systems. I'm not sure if any of that would matter
>      here, though (IIRC, it was mostly that pointers were unexpectedly
>      large and had matching alignment requirements, but all of them
>      equally so).
>
> -Peff

What about adding clar unit tests to make sure that different ivec
types have the same size and layout? e.g. sizeof(IVec_c_void) ==
sizeof(IVec_u8);
sizeof(IVec_c_void) == sizeof(IVec_u16);
sizeof(IVec_c_void) == sizeof(IVec_u32);
sizeof(IVec_c_void) == sizeof(IVec_u64);
...

As well as other tests for ivec.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/10] xdiff: make classic diff explicit by creating xdl_do_classic_diff()
  2026-01-20 15:01   ` Phillip Wood
@ 2026-01-21 21:05     ` Ezekiel Newren
  0 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-21 21:05 UTC (permalink / raw)
  To: phillip.wood; +Cc: Ezekiel Newren via GitGitGadget, git

On Tue, Jan 20, 2026 at 8:01 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > Later patches will prepare xdl_cleanup_records() to be moved into xdiffi.c
> > since only the classic diff uses that function.
>
> I assume that's to make it easier to covert the myers implementation to
> rust without affecting the rest of the code? If so it would be nice to
> say that.

Making it easier to port to Rust is a side effect. The primary goal is
to simplify the job of xprepare to only parsing and hashing lines in a
file. xdl_cleanup_records() is only used by classic diff
(myers/minimal) which means it doesn't belong in xprepare because it's
part of a diff algorithm and isn't relevant to preparing the file for
a diff algorithm. Perhaps xdl_trim_ends() should be moved into
xdl_do_diff() too...

> > Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
>
> > +int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> > +             xdfenv_t *xe) {
> > +     int res;
> > +
> > +     if (xdl_prepare_env(mf1, mf2, xpp, xe) < 0)
> > +             return -1;
> > +
> > +     if (XDF_DIFF_ALG(xpp->flags) == XDF_PATIENCE_DIFF) {
> > +             res = xdl_do_patience_diff(xpp, xe);
> > +             goto out;
> > +     }
> > +
> > +     if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF) {
> > +             res = xdl_do_histogram_diff(xpp, xe);
> > +             goto out;
> > +     }
> > +
> > +     res = xdl_do_classic_diff(xe, xpp->flags);
>
> This might be clearer that we're calling only one of the three functions
> if we wrote this as
>
>         if (XDF_DIFF_ALG(xpp->flags) == XDIF_PATIENCE_DIFF)
>                 res = xdl_do_patience_diff(xpp, xe);
>         else if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF)
>                 res = xdl_do_histogram_diff(xpp, xe);
>         else
>                 res = xdl_do_classic_diff(xe, xpp->flags);
>
> and then we can drop the out: label

In a later cleanup, I make this exact change :)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/10] xdiff: don't waste time guessing the number of lines
  2026-01-20 15:02   ` Phillip Wood
@ 2026-01-21 21:12     ` Ezekiel Newren
  2026-01-22 10:16       ` Phillip Wood
  0 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-21 21:12 UTC (permalink / raw)
  To: phillip.wood; +Cc: Ezekiel Newren via GitGitGadget, git

On Tue, Jan 20, 2026 at 8:02 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > All lines must be read anyway, so classify them after they're read in.
> > Also move the memset() into xdl_init_classifier().
>
> So instead of looping over the input lines one and a bit times (the bit
> being from xdl_guess_lines) we now loop over them twice as we split them
> first and then classify them in a separate loop. It does save some work
> not to call xdl_guess_lines but it is unclear if that offsets
> classifying them in a separate loop.
>
> > +     for (size_t i = 0; i < xe->xdf1.nrec; i++) {
> > +             xrecord_t *rec = &xe->xdf1.recs[i];
> > +             xdl_classify_record(1, &cf, rec);
>
> We seem to have lost the error handling if xdl_classify_record() fails.

The error handling was not "lost" it was deliberately removed. The
only way in which xdl_classify_record() could fail is by a failed
memory allocation. On the Rust side this would result in a panic
(panic means something different in Rust vs C) in which case C could
not possibly recover. Also for operations like Vec.push() in Rust it's
assumed that memory management functions will never fail and if they
do they crash the program with no chance of recovery (unless you
account for panic unwinding which is really ugly). It seems a lot of
arguments about ivec and my xdiff cleanups are "We don't do things
this way in Git/C" I'm aware of many of these arguments and I'm trying
to address them with a more specific answer of "Yes, but that's not
how things are done in Rust and all of this is to prepare the code for
conversion to Rust and some things shouldn't, or even, cannot be done
the C way in Rust."

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-21 21:00                   ` Ezekiel Newren
@ 2026-01-21 21:20                     ` Jeff King
  2026-01-21 21:31                       ` Junio C Hamano
  0 siblings, 1 reply; 78+ messages in thread
From: Jeff King @ 2026-01-21 21:20 UTC (permalink / raw)
  To: Ezekiel Newren
  Cc: Phillip Wood, phillip.wood, Ezekiel Newren via GitGitGadget, git,
	René Scharfe

On Wed, Jan 21, 2026 at 02:00:15PM -0700, Ezekiel Newren wrote:

> What about adding clar unit tests to make sure that different ivec
> types have the same size and layout? e.g. sizeof(IVec_c_void) ==
> sizeof(IVec_u8);
> sizeof(IVec_c_void) == sizeof(IVec_u16);
> sizeof(IVec_c_void) == sizeof(IVec_u32);
> sizeof(IVec_c_void) == sizeof(IVec_u64);
> ...
> 
> As well as other tests for ivec.

I'm a little hesitant in general to have run-time tests for properties
around undefined behavior, just because the compiler is allowed to do a
lot of tricky things when we get into that territory. Plus it is not
really _solving_ the problem, but perhaps just alerting us slightly
sooner than the production code itself crashing and burning.

You'd also need to check the pointer field sizes directly due to
padding. I don't think it's sufficient, due to padding. If one pointer
is 4 bytes and another is 8 (for example), but the element afterwards
requires 8-byte alignment, then the compiler will have to insert 4 bytes
of padding. And the resulting struct size will be the same. You'd have
to more directly check that sizeof(uint_t*) == sizeof(void *), I think.

So I dunno. I am not a compiler expert, nor a rust expert, nor really
know anything about rust/C ABI boundaries. There might be no problem at
all here, and I'm only commenting on what I know is possible (albeit
unlikely) from the C side.

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-21 21:20                     ` Jeff King
@ 2026-01-21 21:31                       ` Junio C Hamano
  2026-01-21 21:45                         ` Ezekiel Newren
  0 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2026-01-21 21:31 UTC (permalink / raw)
  To: Jeff King
  Cc: Ezekiel Newren, Phillip Wood, phillip.wood,
	Ezekiel Newren via GitGitGadget, git, René Scharfe

Jeff King <peff@peff.net> writes:

> On Wed, Jan 21, 2026 at 02:00:15PM -0700, Ezekiel Newren wrote:
>
>> What about adding clar unit tests to make sure that different ivec
>> types have the same size and layout? e.g. sizeof(IVec_c_void) ==
>> sizeof(IVec_u8);
>> sizeof(IVec_c_void) == sizeof(IVec_u16);
>> sizeof(IVec_c_void) == sizeof(IVec_u32);
>> sizeof(IVec_c_void) == sizeof(IVec_u64);
>> ...
>> 
>> As well as other tests for ivec.
>
> I'm a little hesitant in general to have run-time tests for properties
> around undefined behavior, just because the compiler is allowed to do a
> lot of tricky things when we get into that territory. Plus it is not
> really _solving_ the problem, but perhaps just alerting us slightly
> sooner than the production code itself crashing and burning.

Yup, by definition, testing undefined behaviour with code is more or
less pointless.  Implementation defined behaviour, maybe, but not
undefined ones, please.

I thought you already gave them that having different possibilities
in a union would work correctly, but perhaps I was reading a
different thread?  I dunno...




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-20 14:06       ` Phillip Wood
@ 2026-01-21 21:39         ` Ezekiel Newren
  2026-01-28 11:15           ` Phillip Wood
  0 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-21 21:39 UTC (permalink / raw)
  To: Phillip Wood; +Cc: phillip.wood, Ezekiel Newren via GitGitGadget, git

On Tue, Jan 20, 2026 at 7:06 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Ezekiel
>
> On 15/01/2026 15:55, Ezekiel Newren wrote:
> > On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> >>> +void ivec_reserve(void *self_, size_t additional)
> >>> +{
> >>> +     struct IVec_c_void *self = self_;
> >>> +
> >>> +     size_t growby = 128;
> >>> +     if (self->capacity > growby)
> >>> +             growby = self->capacity;
> >>> +     if (additional > growby)
> >>> +             growby = additional;
> >>
> >> This growth strategy differs from both ALLOC_GROW() and
> >> XDL_ALLOC_GROW(), if there isn't a good reason for that we should
> >> perhaps just use ALLOC_GROW() here.
> >
> > XDL_ALLOW_GROW() can't be used because the pointer is always a void*
> > in this function.
>
> Oh right. I'm not sure that's not a reason to use a different growth
> strategy though. The minimum size of 128 elements is probably good for
> the xdiff code that creates arrays with one element per line but if this
> is supposed to be for general use it is going to waste space when we're
> allocating a lot of small arrays. ALLOC_GROW() uses alloc_nr() to
> calculate the new side so perhaps we could use that here?

If ivec_reserve() isn't suitable then ivec_reserve_exact() should be
used instead.

> >>> +void ivec_push(void *self_, const void *value)
> >>> +{
> >>> +     struct IVec_c_void *self = self_;
> >>> +     void *dst = NULL;
> >>> +
> >>> +     if (self->length == self->capacity)
> >>> +             ivec_reserve(self, 1);
> >>> +
> >>> +     dst = (uint8_t*)self->ptr + self->length * self->element_size;
> >>> +     memcpy(dst, value, self->element_size);
> >>
> >> If self->element_size was a compile time constant the compiler could
> >> easily optimize this call away. I'm not sure that is easy to achieve though.
> >
> > The problem is that I didn't want all of ivec to be macros that looked
> > like function calls. I wanted to minimize use of macros so that it was
> > easier to port and verify that the Rust implementation matches the
> > behavior of the C implementation.
>
> I think that's a reasonable concern. So is the plan to have a parallel
> rust implementation of these functions rather than call the C
> implementation from rust?

Yes, the Rust implementation will be independent of the C
implementation, but will behave the same way. That's why I'm calling
it an interoperable vec as opposed to a compatible vec. Rust can't
call the C ivec functions and C can't call the Rust ivec functions,
but they'll behave the same way.

> >>> +void ivec_free(void *self_)
> >>
> >> Normally we'd call a like this that free the allocations and
> >> re-initializes the members ivec_clear()
> >
> > In Rust Vec.clear() means to set length to zero, but leaves the
> > allocation alone. The reason why I'm zeroing the struct is to help
> > avoid FFI issues. If not zero then what should the members be set to,
> > to indicate that using the struct is not valid anymore? In Rust an
> > object is freed when it goes out of scope and _cannot_ be accessed
> > afterward.

Maybe I should call this ivec_drop(). Though the notion of explicitly
freeing an object in Rust is _almost_ nonsense. The way you free
something in Rust is to let it go out of scope.

> I'm aware that Vec::clear() has different semantics (it does what
> strbuf_reset() does). That's unfortunate but this function has different
> semantics to all the other *_free() functions in git. Our coding
> guidelines say
>
>   - There are several common idiomatic names for functions performing
>     specific tasks on a structure `S`:
>
>      - `S_init()` initializes a structure without allocating the
>        structure itself.
>
>      - `S_release()` releases a structure's contents without freeing the
>        structure.
>
>      - `S_clear()` is equivalent to `S_release()` followed by `S_init()`
>        such that the structure is directly usable after clearing it. When
>        `S_clear()` is provided, `S_init()` shall not allocate resources
>        that need to be released again.
>
>      - `S_free()` releases a structure's contents and frees the
>        structure.
>
> As we write more rust code and so wrap more of our existing structs
> we're going to be wrapping C code that uses the definitions above so I
> think we should do the same with struct IVec_*.

I disagree. IVec isn't a wrapper around an existing struct. ivec is
meant to very closely mimic Rust's Vec while guaranteeing
interoperability. For things like strbuf I haven't conceived of a
solution for that yet. Making ivec diverge from Rust's Vec will result
in POLA violations due to different behavior when refactoring an
IVec<your_type_here> to Vec<your_type_here>.

> >>> diff --git a/compat/ivec.h b/compat/ivec.h
> >>> new file mode 100644
> >>> index 0000000000..654a05c506
> >>> --- /dev/null
> >>> +++ b/compat/ivec.h
> >>> @@ -0,0 +1,52 @@
> >>> +#ifndef IVEC_H
> >>> +#define IVEC_H
> >>> +
> >>> +#include <git-compat-util.h>
> >>
> >> It would be nice to have some documentation in this header, see the
> >> examples in strvec.h and hashmap.h
> >>
> >>> +#define IVEC_INIT(variable) ivec_init(&(variable), sizeof(*(variable).ptr))
> >>
> >> This is a bit cumbersome to use compared to our usual *_INIT macros. I'm
> >> struggling to see how we can make it nicer though as DEFINE_IVEC_TYPE
> >> cannot define a per-type initializer macro and I we cannot initialize
> >> the element size without knowing the type.
> >
> > I don't see what's cumbersome about it. Maybe an example use case
> > would clarify things.
>
> It is cumbersome because it separates the initialization from the
> declaration. Normally our *_INIT macros are initializer lists so we can
> write
>
>         struct strbuf = STRBUF_INIT;
>
> which keeps the declaration and initialization together. Although
> they're on adjacent lines in your example in real code the
> initialization likely to be separated from the declaration by other
> variable declarations.

Ah I see what you mean now. I'll experiment with making IVEC_INIT()
work like that. One wrinkle is that STRBUF_INIT is a single concrete
type whereas IVEC_INIT() is meant for generic types.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-21 21:31                       ` Junio C Hamano
@ 2026-01-21 21:45                         ` Ezekiel Newren
  0 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren @ 2026-01-21 21:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Phillip Wood, phillip.wood,
	Ezekiel Newren via GitGitGadget, git, René Scharfe

On Wed, Jan 21, 2026 at 2:31 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jeff King <peff@peff.net> writes:
>
> > On Wed, Jan 21, 2026 at 02:00:15PM -0700, Ezekiel Newren wrote:
> >
> >> What about adding clar unit tests to make sure that different ivec
> >> types have the same size and layout? e.g. sizeof(IVec_c_void) ==
> >> sizeof(IVec_u8);
> >> sizeof(IVec_c_void) == sizeof(IVec_u16);
> >> sizeof(IVec_c_void) == sizeof(IVec_u32);
> >> sizeof(IVec_c_void) == sizeof(IVec_u64);
> >> ...
> >>
> >> As well as other tests for ivec.
> >
> > I'm a little hesitant in general to have run-time tests for properties
> > around undefined behavior, just because the compiler is allowed to do a
> > lot of tricky things when we get into that territory. Plus it is not
> > really _solving_ the problem, but perhaps just alerting us slightly
> > sooner than the production code itself crashing and burning.
>
> Yup, by definition, testing undefined behaviour with code is more or
> less pointless.  Implementation defined behaviour, maybe, but not
> undefined ones, please.
>
> I thought you already gave them that having different possibilities
> in a union would work correctly, but perhaps I was reading a
> different thread?  I dunno...
>

In my opinion the proper solution to this is to document that any
platform with different size pointers for different types is not
supported by Git. Which would make using Git on those platforms "use
at your own risk".

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/10] xdiff: don't waste time guessing the number of lines
  2026-01-21 21:12     ` Ezekiel Newren
@ 2026-01-22 10:16       ` Phillip Wood
  0 siblings, 0 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-22 10:16 UTC (permalink / raw)
  To: Ezekiel Newren, phillip.wood; +Cc: Ezekiel Newren via GitGitGadget, git

On 21/01/2026 21:12, Ezekiel Newren wrote:
> On Tue, Jan 20, 2026 at 8:02 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
>>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>>
>>> All lines must be read anyway, so classify them after they're read in.
>>> Also move the memset() into xdl_init_classifier().
>>
>> So instead of looping over the input lines one and a bit times (the bit
>> being from xdl_guess_lines) we now loop over them twice as we split them
>> first and then classify them in a separate loop. It does save some work
>> not to call xdl_guess_lines but it is unclear if that offsets
>> classifying them in a separate loop.
>>
>>> +     for (size_t i = 0; i < xe->xdf1.nrec; i++) {
>>> +             xrecord_t *rec = &xe->xdf1.recs[i];
>>> +             xdl_classify_record(1, &cf, rec);
>>
>> We seem to have lost the error handling if xdl_classify_record() fails.
> 
> The error handling was not "lost" it was deliberately removed. 

That's the sort of thing that needs to be explained in the commit message.

> The
> only way in which xdl_classify_record() could fail is by a failed
> memory allocation. On the Rust side this would result in a panic
> (panic means something different in Rust vs C) in which case C could
> not possibly recover.

There is no rust code in xdiff at the moment so we don't panic on 
failure. In git we'll die() because xdl_malloc() and friends are defined 
as xmalloc() etc. which die on allocation failure. However anyone else 
picking up this code and using a different allocator that does not die 
on allocation failure will expect the error to be propagated.

If you want to stop supporting other allocators then you should propose 
a patch to do so, not silently slip the change into this patch.

Thanks

Phillip

> Also for operations like Vec.push() in Rust it's
> assumed that memory management functions will never fail and if they
> do they crash the program with no chance of recovery (unless you
> account for panic unwinding which is really ugly). It seems a lot of
> arguments about ivec and my xdiff cleanups are "We don't do things
> this way in Git/C" I'm aware of many of these arguments and I'm trying
> to address them with a more specific answer of "Yes, but that's not
> how things are done in Rust and all of this is to prepare the code for
> conversion to Rust and some things shouldn't, or even, cannot be done
> the C way in Rust."


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/10] xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start
  2026-01-20 16:32   ` Phillip Wood
@ 2026-01-28 10:51     ` Phillip Wood
  0 siblings, 0 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-28 10:51 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 20/01/2026 16:32, Phillip Wood wrote:
> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>
>> Placing delta_start in xdfenv_t instead of xdfile_t provides a more
>> appropriate context since this variable only makes sense with a pair
>> of files. View with --color-words.
> 
> So as dstart and dend must be the same for both files we now store the 
> values once in xdfenv_t. 

Except it's only dstart that's the same, dend is different because it 
convinently stores an index, not an offset from the end. Having realized 
that, moving them to xdfenv_t makes less sense as having to calculate 
the dend index from an offset from the end of the array each time is a 
pain and sooner or later we'll make a mistake.

Thanks

Phillip

> That explains why we start passing xdfenv_t 
> around rather than xdfile_t in patch 5.
> 
> Thanks
> 
> Phillip
> 
>> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
>> ---
>>   xdiff/xhistogram.c |  4 ++--
>>   xdiff/xpatience.c  |  4 ++--
>>   xdiff/xprepare.c   | 17 +++++++++--------
>>   xdiff/xtypes.h     |  3 ++-
>>   4 files changed, 15 insertions(+), 13 deletions(-)
>>
>> diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
>> index 5ae1282c27..eb6a52d9ba 100644
>> --- a/xdiff/xhistogram.c
>> +++ b/xdiff/xhistogram.c
>> @@ -365,6 +365,6 @@ out:
>>   int xdl_do_histogram_diff(xpparam_t const *xpp, xdfenv_t *env)
>>   {
>>       return histogram_diff(xpp, env,
>> -        env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
>> -        env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
>> +        env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
>> +        env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
>>   }
>> diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
>> index 2bce07cf48..bd0ffbb417 100644
>> --- a/xdiff/xpatience.c
>> +++ b/xdiff/xpatience.c
>> @@ -374,6 +374,6 @@ static int patience_diff(xpparam_t const *xpp, 
>> xdfenv_t *env,
>>   int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
>>   {
>>       return patience_diff(xpp, env,
>> -        env->xdf1.dstart + 1, env->xdf1.dend - env->xdf1.dstart + 1,
>> -        env->xdf2.dstart + 1, env->xdf2.dend - env->xdf2.dstart + 1);
>> +        env->delta_start + 1, env->xdf1.dend - env->delta_start + 1,
>> +        env->delta_start + 1, env->xdf2.dend - env->delta_start + 1);
>>   }
>> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
>> index 06b6a6f804..e88468e74c 100644
>> --- a/xdiff/xprepare.c
>> +++ b/xdiff/xprepare.c
>> @@ -173,7 +173,6 @@ static int xdl_prepare_ctx(mmfile_t *mf, xdfile_t 
>> *xdf, uint64_t flags) {
>>       xdf->changed += 1;
>>       xdf->nreff = 0;
>> -    xdf->dstart = 0;
>>       xdf->dend = xdf->nrec - 1;
>>       return 0;
>> @@ -287,7 +286,7 @@ static int xdl_cleanup_records(xdlclassifier_t 
>> *cf, xdfenv_t *xe) {
>>        */
>>       if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>>           mlim = XDL_MAX_EQLIMIT;
>> -    for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart]; 
>> i <= xe->xdf1.dend; i++, recs++) {
>> +    for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; 
>> i <= xe->xdf1.dend; i++, recs++) {
>>           rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>>           nm = rcrec ? rcrec->len2 : 0;
>>           action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && ! 
>> need_min) ? INVESTIGATE: KEEP;
>> @@ -295,7 +294,7 @@ static int xdl_cleanup_records(xdlclassifier_t 
>> *cf, xdfenv_t *xe) {
>>       if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>>           mlim = XDL_MAX_EQLIMIT;
>> -    for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart]; 
>> i <= xe->xdf2.dend; i++, recs++) {
>> +    for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; 
>> i <= xe->xdf2.dend; i++, recs++) {
>>           rcrec = cf->rcrecs[recs->minimal_perfect_hash];
>>           nm = rcrec ? rcrec->len1 : 0;
>>           action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && ! 
>> need_min) ? INVESTIGATE: KEEP;
>> @@ -306,10 +305,10 @@ static int xdl_cleanup_records(xdlclassifier_t 
>> *cf, xdfenv_t *xe) {
>>        * false, or become true.
>>        */
>>       xe->xdf1.nreff = 0;
>> -    for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart];
>> +    for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
>>            i <= xe->xdf1.dend; i++, recs++) {
>>           if (action1[i] == KEEP ||
>> -            (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, 
>> i, xe->xdf1.dstart, xe->xdf1.dend))) {
>> +            (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, 
>> i, xe->delta_start, xe->xdf1.dend))) {
>>               xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
>>               /* changed[i] remains false, i.e. keep */
>>           } else
>> @@ -318,10 +317,10 @@ static int xdl_cleanup_records(xdlclassifier_t 
>> *cf, xdfenv_t *xe) {
>>       }
>>       xe->xdf2.nreff = 0;
>> -    for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart];
>> +    for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
>>            i <= xe->xdf2.dend; i++, recs++) {
>>           if (action2[i] == KEEP ||
>> -            (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, 
>> i, xe->xdf2.dstart, xe->xdf2.dend))) {
>> +            (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, 
>> i, xe->delta_start, xe->xdf2.dend))) {
>>               xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
>>               /* changed[i] remains false, i.e. keep */
>>           } else
>> @@ -348,7 +347,7 @@ static void xdl_trim_ends(xdfenv_t *xe)
>>           size_t mph1 = xe->xdf1.recs[i].minimal_perfect_hash;
>>           size_t mph2 = xe->xdf2.recs[i].minimal_perfect_hash;
>>           if (mph1 != mph2) {
>> -            xe->xdf1.dstart = xe->xdf2.dstart = (ssize_t)i;
>> +            xe->delta_start = (ssize_t)i;
>>               lim -= i;
>>               break;
>>           }
>> @@ -370,6 +369,8 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, 
>> xpparam_t const *xpp,
>>               xdfenv_t *xe) {
>>       xdlclassifier_t cf;
>> +    xe->delta_start = 0;
>> +
>>       if (xdl_prepare_ctx(mf1, &xe->xdf1, xpp->flags) < 0) {
>>           return -1;
>> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
>> index 979586f20a..bda1f85eb0 100644
>> --- a/xdiff/xtypes.h
>> +++ b/xdiff/xtypes.h
>> @@ -48,7 +48,7 @@ typedef struct s_xrecord {
>>   typedef struct s_xdfile {
>>       xrecord_t *recs;
>>       size_t nrec;
>> -    ptrdiff_t dstart, dend;
>> +    ptrdiff_t dend;
>>       bool *changed;
>>       size_t *reference_index;
>>       size_t nreff;
>> @@ -56,6 +56,7 @@ typedef struct s_xdfile {
>>   typedef struct s_xdfenv {
>>       xdfile_t xdf1, xdf2;
>> +    size_t delta_start;
>>   } xdfenv_t;
> 
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/10] xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c
  2026-01-21 15:01   ` Phillip Wood
@ 2026-01-28 10:56     ` Phillip Wood
  0 siblings, 0 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-28 10:56 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 21/01/2026 15:01, Phillip Wood wrote:
> Hi Ezekiel
> 
> On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>
>> Only the classic diff uses xdl_cleanup_records(). Move it,
>> xdl_clean_mmatch(), and the macros to xdiffi.c and call
>> xdl_cleanup_records() inside of xdl_do_classic_diff(). This better
>> organizes the code related to the classic diff.
> 
> I think calling xdl_cleanup_records() from inside xdl_do_classic_diff() 
> makes sense. I don't have a strong opinion either way on the code 
> movement.

Having thought about it I'm not so sure the code movement here makes 
sense. Having utility functions in a separate file is perfectly 
reasonable (afterall xprepare.c existed before the histogram and 
patientce algorithms were added). It's not like the code xdiffi.c is 
only about the myers diff there is generic code for diff sliders in 
there as well.

Thanks

Phillip

  You should remove '#include "compat/ivec.h"' from xprepare.c
> if you're moving the only code that uses it out of that file.
> 
> Thanks
> 
> Phillip
> 
>> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
>> ---
>>   xdiff/xdiffi.c   | 180 ++++++++++++++++++++++++++++++++++++++++++++
>>   xdiff/xprepare.c | 191 +----------------------------------------------
>>   2 files changed, 181 insertions(+), 190 deletions(-)
>>
>> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
>> index e3196c7245..0f1fd7cf80 100644
>> --- a/xdiff/xdiffi.c
>> +++ b/xdiff/xdiffi.c
>> @@ -21,6 +21,7 @@
>>    */
>>   #include "xinclude.h"
>> +#include "compat/ivec.h"
>>   static size_t get_hash(xdfile_t *xdf, long index)
>>   {
>> @@ -33,6 +34,14 @@ static size_t get_hash(xdfile_t *xdf, long index)
>>   #define XDL_SNAKE_CNT 20
>>   #define XDL_K_HEUR 4
>> +#define XDL_KPDIS_RUN 4
>> +#define XDL_MAX_EQLIMIT 1024
>> +#define XDL_SIMSCAN_WINDOW 100
>> +
>> +#define DISCARD 0
>> +#define KEEP 1
>> +#define INVESTIGATE 2
>> +
>>   typedef struct s_xdpsplit {
>>       long i1, i2;
>>       int min_lo, min_hi;
>> @@ -311,6 +320,175 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long 
>> lim1,
>>   }
>> +static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, 
>> long e) {
>> +    long r, rdis0, rpdis0, rdis1, rpdis1;
>> +
>> +    /*
>> +     * Limits the window that is examined during the similar-lines
>> +     * scan. The loops below stops when action[i - r] == KEEP
>> +     * (line that has no match), but there are corner cases where
>> +     * the loop proceed all the way to the extremities by causing
>> +     * huge performance penalties in case of big files.
>> +     */
>> +    if (i - s > XDL_SIMSCAN_WINDOW)
>> +        s = i - XDL_SIMSCAN_WINDOW;
>> +    if (e - i > XDL_SIMSCAN_WINDOW)
>> +        e = i + XDL_SIMSCAN_WINDOW;
>> +
>> +    /*
>> +     * Scans the lines before 'i' to find a run of lines that either
>> +     * have no match (action[j] == DISCARD) or have multiple matches
>> +     * (action[j] == INVESTIGATE). Note that we always call this
>> +     * function with action[i] == INVESTIGATE, so the current line
>> +     * (i) is already a multimatch line.
>> +     */
>> +    for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) {
>> +        if (action[i - r] == DISCARD)
>> +            rdis0++;
>> +        else if (action[i - r] == INVESTIGATE)
>> +            rpdis0++;
>> +        else if (action[i - r] == KEEP)
>> +            break;
>> +        else
>> +            BUG("Illegal value for action[i - r]");
>> +    }
>> +    /*
>> +     * If the run before the line 'i' found only multimatch lines,
>> +     * we return false and hence we don't make the current line (i)
>> +     * discarded. We want to discard multimatch lines only when
>> +     * they appear in the middle of runs with nomatch lines
>> +     * (action[j] == DISCARD).
>> +     */
>> +    if (rdis0 == 0)
>> +        return 0;
>> +    for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) {
>> +        if (action[i + r] == DISCARD)
>> +            rdis1++;
>> +        else if (action[i + r] == INVESTIGATE)
>> +            rpdis1++;
>> +        else if (action[i + r] == KEEP)
>> +            break;
>> +        else
>> +            BUG("Illegal value for action[i + r]");
>> +    }
>> +    /*
>> +     * If the run after the line 'i' found only multimatch lines,
>> +     * we return false and hence we don't make the current line (i)
>> +     * discarded.
>> +     */
>> +    if (rdis1 == 0)
>> +        return false;
>> +    rdis1 += rdis0;
>> +    rpdis1 += rpdis0;
>> +
>> +    return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
>> +}
>> +
>> +struct xoccurrence
>> +{
>> +    size_t file1, file2;
>> +};
>> +
>> +
>> +DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
>> +
>> +
>> +/*
>> + * Try to reduce the problem complexity, discard records that have no
>> + * matches on the other file. Also, lines that have multiple matches
>> + * might be potentially discarded if they appear in a run of 
>> discardable.
>> + */
>> +static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
>> +    long i;
>> +    size_t nm, mlim;
>> +    xrecord_t *recs;
>> +    uint8_t *action1 = NULL, *action2 = NULL;
>> +    struct IVec_xoccurrence occ;
>> +    bool need_min = !!(flags & XDF_NEED_MINIMAL);
>> +    int ret = 0;
>> +    ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
>> +    ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
>> +
>> +    IVEC_INIT(occ);
>> +    ivec_zero(&occ, xe->mph_size);
>> +
>> +    for (size_t j = 0; j < xe->xdf1.nrec; j++) {
>> +        size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
>> +        occ.ptr[mph1].file1 += 1;
>> +    }
>> +
>> +    for (size_t j = 0; j < xe->xdf2.nrec; j++) {
>> +        size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
>> +        occ.ptr[mph2].file2 += 1;
>> +    }
>> +
>> +    /*
>> +     * Create temporary arrays that will help us decide if
>> +     * changed[i] should remain false, or become true.
>> +     */
>> +    if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
>> +        ret = -1;
>> +        goto cleanup;
>> +    }
>> +    if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
>> +        ret = -1;
>> +        goto cleanup;
>> +    }
>> +
>> +    /*
>> +     * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
>> +     */
>> +    if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>> +        mlim = XDL_MAX_EQLIMIT;
>> +    for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; 
>> i <= dend1; i++, recs++) {
>> +        nm = occ.ptr[recs->minimal_perfect_hash].file2;
>> +        action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? 
>> INVESTIGATE: KEEP;
>> +    }
>> +
>> +    if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>> +        mlim = XDL_MAX_EQLIMIT;
>> +    for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; 
>> i <= dend2; i++, recs++) {
>> +        nm = occ.ptr[recs->minimal_perfect_hash].file1;
>> +        action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? 
>> INVESTIGATE: KEEP;
>> +    }
>> +
>> +    /*
>> +     * Use temporary arrays to decide if changed[i] should remain
>> +     * false, or become true.
>> +     */
>> +    xe->xdf1.nreff = 0;
>> +    for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
>> +         i <= dend1; i++, recs++) {
>> +        if (action1[i] == KEEP ||
>> +            (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, 
>> i, xe->delta_start, dend1))) {
>> +            xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
>> +            /* changed[i] remains false, i.e. keep */
>> +        } else
>> +            xe->xdf1.changed[i] = true;
>> +            /* i.e. discard */
>> +    }
>> +
>> +    xe->xdf2.nreff = 0;
>> +    for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
>> +         i <= dend2; i++, recs++) {
>> +        if (action2[i] == KEEP ||
>> +            (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, 
>> i, xe->delta_start, dend2))) {
>> +            xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
>> +            /* changed[i] remains false, i.e. keep */
>> +        } else
>> +            xe->xdf2.changed[i] = true;
>> +            /* i.e. discard */
>> +    }
>> +
>> +cleanup:
>> +    xdl_free(action1);
>> +    xdl_free(action2);
>> +    ivec_free(&occ);
>> +
>> +    return ret;
>> +}
>> +
>> +
>>   int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
>>   {
>>       long ndiags;
>> @@ -318,6 +496,8 @@ int xdl_do_classic_diff(xdfenv_t *xe, uint64_t flags)
>>       xdalgoenv_t xenv;
>>       int res;
>> +    xdl_cleanup_records(xe, flags);
>> +
>>       /*
>>        * Allocate and setup K vectors to be used by the differential
>>        * algorithm.
>> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
>> index b53a3b80c4..3f555e29f4 100644
>> --- a/xdiff/xprepare.c
>> +++ b/xdiff/xprepare.c
>> @@ -24,14 +24,6 @@
>>   #include "compat/ivec.h"
>> -#define XDL_KPDIS_RUN 4
>> -#define XDL_MAX_EQLIMIT 1024
>> -#define XDL_SIMSCAN_WINDOW 100
>> -
>> -#define DISCARD 0
>> -#define KEEP 1
>> -#define INVESTIGATE 2
>> -
>>   typedef struct s_xdlclass {
>>       struct s_xdlclass *next;
>>       xrecord_t rec;
>> @@ -50,8 +42,6 @@ typedef struct s_xdlclassifier {
>>   } xdlclassifier_t;
>> -
>> -
>>   static int xdl_init_classifier(xdlclassifier_t *cf, long size, long 
>> flags) {
>>       memset(cf, 0, sizeof(xdlclassifier_t));
>> @@ -186,175 +176,6 @@ void xdl_free_env(xdfenv_t *xe) {
>>   }
>> -static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, 
>> long e) {
>> -    long r, rdis0, rpdis0, rdis1, rpdis1;
>> -
>> -    /*
>> -     * Limits the window that is examined during the similar-lines
>> -     * scan. The loops below stops when action[i - r] == KEEP
>> -     * (line that has no match), but there are corner cases where
>> -     * the loop proceed all the way to the extremities by causing
>> -     * huge performance penalties in case of big files.
>> -     */
>> -    if (i - s > XDL_SIMSCAN_WINDOW)
>> -        s = i - XDL_SIMSCAN_WINDOW;
>> -    if (e - i > XDL_SIMSCAN_WINDOW)
>> -        e = i + XDL_SIMSCAN_WINDOW;
>> -
>> -    /*
>> -     * Scans the lines before 'i' to find a run of lines that either
>> -     * have no match (action[j] == DISCARD) or have multiple matches
>> -     * (action[j] == INVESTIGATE). Note that we always call this
>> -     * function with action[i] == INVESTIGATE, so the current line
>> -     * (i) is already a multimatch line.
>> -     */
>> -    for (r = 1, rdis0 = 0, rpdis0 = 1; (i - r) >= s; r++) {
>> -        if (action[i - r] == DISCARD)
>> -            rdis0++;
>> -        else if (action[i - r] == INVESTIGATE)
>> -            rpdis0++;
>> -        else if (action[i - r] == KEEP)
>> -            break;
>> -        else
>> -            BUG("Illegal value for action[i - r]");
>> -    }
>> -    /*
>> -     * If the run before the line 'i' found only multimatch lines,
>> -     * we return false and hence we don't make the current line (i)
>> -     * discarded. We want to discard multimatch lines only when
>> -     * they appear in the middle of runs with nomatch lines
>> -     * (action[j] == DISCARD).
>> -     */
>> -    if (rdis0 == 0)
>> -        return 0;
>> -    for (r = 1, rdis1 = 0, rpdis1 = 1; (i + r) <= e; r++) {
>> -        if (action[i + r] == DISCARD)
>> -            rdis1++;
>> -        else if (action[i + r] == INVESTIGATE)
>> -            rpdis1++;
>> -        else if (action[i + r] == KEEP)
>> -            break;
>> -        else
>> -            BUG("Illegal value for action[i + r]");
>> -    }
>> -    /*
>> -     * If the run after the line 'i' found only multimatch lines,
>> -     * we return false and hence we don't make the current line (i)
>> -     * discarded.
>> -     */
>> -    if (rdis1 == 0)
>> -        return false;
>> -    rdis1 += rdis0;
>> -    rpdis1 += rpdis0;
>> -
>> -    return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
>> -}
>> -
>> -struct xoccurrence
>> -{
>> -    size_t file1, file2;
>> -};
>> -
>> -
>> -DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
>> -
>> -
>> -/*
>> - * Try to reduce the problem complexity, discard records that have no
>> - * matches on the other file. Also, lines that have multiple matches
>> - * might be potentially discarded if they appear in a run of 
>> discardable.
>> - */
>> -static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
>> -    long i;
>> -    size_t nm, mlim;
>> -    xrecord_t *recs;
>> -    uint8_t *action1 = NULL, *action2 = NULL;
>> -    struct IVec_xoccurrence occ;
>> -    bool need_min = !!(flags & XDF_NEED_MINIMAL);
>> -    int ret = 0;
>> -    ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
>> -    ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
>> -
>> -    IVEC_INIT(occ);
>> -    ivec_zero(&occ, xe->mph_size);
>> -
>> -    for (size_t j = 0; j < xe->xdf1.nrec; j++) {
>> -        size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
>> -        occ.ptr[mph1].file1 += 1;
>> -    }
>> -
>> -    for (size_t j = 0; j < xe->xdf2.nrec; j++) {
>> -        size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
>> -        occ.ptr[mph2].file2 += 1;
>> -    }
>> -
>> -    /*
>> -     * Create temporary arrays that will help us decide if
>> -     * changed[i] should remain false, or become true.
>> -     */
>> -    if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
>> -        ret = -1;
>> -        goto cleanup;
>> -    }
>> -    if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
>> -        ret = -1;
>> -        goto cleanup;
>> -    }
>> -
>> -    /*
>> -     * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
>> -     */
>> -    if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
>> -        mlim = XDL_MAX_EQLIMIT;
>> -    for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; 
>> i <= dend1; i++, recs++) {
>> -        nm = occ.ptr[recs->minimal_perfect_hash].file2;
>> -        action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? 
>> INVESTIGATE: KEEP;
>> -    }
>> -
>> -    if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
>> -        mlim = XDL_MAX_EQLIMIT;
>> -    for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; 
>> i <= dend2; i++, recs++) {
>> -        nm = occ.ptr[recs->minimal_perfect_hash].file1;
>> -        action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? 
>> INVESTIGATE: KEEP;
>> -    }
>> -
>> -    /*
>> -     * Use temporary arrays to decide if changed[i] should remain
>> -     * false, or become true.
>> -     */
>> -    xe->xdf1.nreff = 0;
>> -    for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start];
>> -         i <= dend1; i++, recs++) {
>> -        if (action1[i] == KEEP ||
>> -            (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, 
>> i, xe->delta_start, dend1))) {
>> -            xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
>> -            /* changed[i] remains false, i.e. keep */
>> -        } else
>> -            xe->xdf1.changed[i] = true;
>> -            /* i.e. discard */
>> -    }
>> -
>> -    xe->xdf2.nreff = 0;
>> -    for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start];
>> -         i <= dend2; i++, recs++) {
>> -        if (action2[i] == KEEP ||
>> -            (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, 
>> i, xe->delta_start, dend2))) {
>> -            xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
>> -            /* changed[i] remains false, i.e. keep */
>> -        } else
>> -            xe->xdf2.changed[i] = true;
>> -            /* i.e. discard */
>> -    }
>> -
>> -cleanup:
>> -    xdl_free(action1);
>> -    xdl_free(action2);
>> -    ivec_free(&occ);
>> -
>> -    return ret;
>> -}
>> -
>> -
>>   /*
>>    * Early trim initial and terminal matching records.
>>    */
>> @@ -414,19 +235,9 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, 
>> xpparam_t const *xpp,
>>       }
>>       xe->mph_size = cf.count;
>> +    xdl_free_classifier(&cf);
>>       xdl_trim_ends(xe);
>> -    if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
>> -        (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
>> -        xdl_cleanup_records(xe, xpp->flags) < 0) {
>> -
>> -        xdl_free_ctx(&xe->xdf2);
>> -        xdl_free_ctx(&xe->xdf1);
>> -        xdl_free_classifier(&cf);
>> -        return -1;
>> -    }
>> -
>> -    xdl_free_classifier(&cf);
>>       return 0;
>>   }
> 
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] ivec: introduce the C side of ivec
  2026-01-21 21:39         ` Ezekiel Newren
@ 2026-01-28 11:15           ` Phillip Wood
  0 siblings, 0 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-28 11:15 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: phillip.wood, Ezekiel Newren via GitGitGadget, git

On 21/01/2026 21:39, Ezekiel Newren wrote:
> On Tue, Jan 20, 2026 at 7:06 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>> Hi Ezekiel
>>
>> On 15/01/2026 15:55, Ezekiel Newren wrote:
>>> On Thu, Jan 8, 2026 at 7:34 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>>>> +void ivec_reserve(void *self_, size_t additional)
>>>>> +{
>>>>> +     struct IVec_c_void *self = self_;
>>>>> +
>>>>> +     size_t growby = 128;
>>>>> +     if (self->capacity > growby)
>>>>> +             growby = self->capacity;
>>>>> +     if (additional > growby)
>>>>> +             growby = additional;
>>>>
>>>> This growth strategy differs from both ALLOC_GROW() and
>>>> XDL_ALLOC_GROW(), if there isn't a good reason for that we should
>>>> perhaps just use ALLOC_GROW() here.
>>>
>>> XDL_ALLOW_GROW() can't be used because the pointer is always a void*
>>> in this function.
>>
>> Oh right. I'm not sure that's not a reason to use a different growth
>> strategy though. The minimum size of 128 elements is probably good for
>> the xdiff code that creates arrays with one element per line but if this
>> is supposed to be for general use it is going to waste space when we're
>> allocating a lot of small arrays. ALLOC_GROW() uses alloc_nr() to
>> calculate the new side so perhaps we could use that here?
> 
> If ivec_reserve() isn't suitable then ivec_reserve_exact() should be
> used instead.

If some C code that pushes one element at a time to an array using 
ALLOC_GROW() is converted to use an ivec then we don't want to change 
the code behaves - that means it should grow the array in the same way. 
I don't see how the suggestion to use ivec_reserve_exact() helps in that 
situation. What is the advantage in having a different growth 
characteristic?

>>>>> +void ivec_push(void *self_, const void *value)
>>>>> +{
>>>>> +     struct IVec_c_void *self = self_;
>>>>> +     void *dst = NULL;
>>>>> +
>>>>> +     if (self->length == self->capacity)
>>>>> +             ivec_reserve(self, 1);
>>>>> +
>>>>> +     dst = (uint8_t*)self->ptr + self->length * self->element_size;
>>>>> +     memcpy(dst, value, self->element_size);
>>>>
>>>> If self->element_size was a compile time constant the compiler could
>>>> easily optimize this call away. I'm not sure that is easy to achieve though.
>>>
>>> The problem is that I didn't want all of ivec to be macros that looked
>>> like function calls. I wanted to minimize use of macros so that it was
>>> easier to port and verify that the Rust implementation matches the
>>> behavior of the C implementation.
>>
>> I think that's a reasonable concern. So is the plan to have a parallel
>> rust implementation of these functions rather than call the C
>> implementation from rust?
> 
> Yes, the Rust implementation will be independent of the C
> implementation, but will behave the same way. That's why I'm calling
> it an interoperable vec as opposed to a compatible vec. Rust can't
> call the C ivec functions and C can't call the Rust ivec functions,
> but they'll behave the same way.

Interesting - I'm curious what the advantage of that is over having rust 
call the C implementation? I can see you wouldn't want to be calling 
into C for each ivec.push() call, but checking if there is room to push 
the new element in rust and calling into C to extend the vector if not 
should be reasonable and then you don't have to re-implement everything 
in rust.

>>>>> +void ivec_free(void *self_)
>>>>
>>>> Normally we'd call a like this that free the allocations and
>>>> re-initializes the members ivec_clear()
>>>
>>> In Rust Vec.clear() means to set length to zero, but leaves the
>>> allocation alone. The reason why I'm zeroing the struct is to help
>>> avoid FFI issues. If not zero then what should the members be set to,
>>> to indicate that using the struct is not valid anymore? In Rust an
>>> object is freed when it goes out of scope and _cannot_ be accessed
>>> afterward.
> 
> Maybe I should call this ivec_drop(). Though the notion of explicitly
> freeing an object in Rust is _almost_ nonsense. The way you free
> something in Rust is to let it go out of scope.

Indeed - which means this wont be a public function in rust and so why 
do we worry about naming it ivec_clear()? At least ivec_drop() does not 
conflict with any of the standard function suffixes that we're already 
using in git.

>> I'm aware that Vec::clear() has different semantics (it does what
>> strbuf_reset() does). That's unfortunate but this function has different
>> semantics to all the other *_free() functions in git. Our coding
>> guidelines say
>>
>>    - There are several common idiomatic names for functions performing
>>      specific tasks on a structure `S`:
>>
>>       - `S_init()` initializes a structure without allocating the
>>         structure itself.
>>
>>       - `S_release()` releases a structure's contents without freeing the
>>         structure.
>>
>>       - `S_clear()` is equivalent to `S_release()` followed by `S_init()`
>>         such that the structure is directly usable after clearing it. When
>>         `S_clear()` is provided, `S_init()` shall not allocate resources
>>         that need to be released again.
>>
>>       - `S_free()` releases a structure's contents and frees the
>>         structure.
>>
>> As we write more rust code and so wrap more of our existing structs
>> we're going to be wrapping C code that uses the definitions above so I
>> think we should do the same with struct IVec_*.
> 
> I disagree. IVec isn't a wrapper around an existing struct.

So just because it is a new stuct it shouldn't have to follow the 
existing naming conventions?

> ivec is
> meant to very closely mimic Rust's Vec while guaranteeing
> interoperability. For things like strbuf I haven't conceived of a
> solution for that yet. Making ivec diverge from Rust's Vec will result
> in POLA violations due to different behavior when refactoring an
> IVec<your_type_here> to Vec<your_type_here>.

On the other hand, vec.reset() does not exist so you'd get a compiler 
error if you forgot to rename those calls when changing from IVec to Vec 
and the rust code wouldn't be calling ivec.clear(). I'm not sure citing 
POLA concerns is very convincing as ivec_free() in C is a POLA violation 
for anyone familiar with git's code base so it's not like there's a 
choice that avoids that concern.

Thanks

Phillip

>>>>> diff --git a/compat/ivec.h b/compat/ivec.h
>>>>> new file mode 100644
>>>>> index 0000000000..654a05c506
>>>>> --- /dev/null
>>>>> +++ b/compat/ivec.h
>>>>> @@ -0,0 +1,52 @@
>>>>> +#ifndef IVEC_H
>>>>> +#define IVEC_H
>>>>> +
>>>>> +#include <git-compat-util.h>
>>>>
>>>> It would be nice to have some documentation in this header, see the
>>>> examples in strvec.h and hashmap.h
>>>>
>>>>> +#define IVEC_INIT(variable) ivec_init(&(variable), sizeof(*(variable).ptr))
>>>>
>>>> This is a bit cumbersome to use compared to our usual *_INIT macros. I'm
>>>> struggling to see how we can make it nicer though as DEFINE_IVEC_TYPE
>>>> cannot define a per-type initializer macro and I we cannot initialize
>>>> the element size without knowing the type.
>>>
>>> I don't see what's cumbersome about it. Maybe an example use case
>>> would clarify things.
>>
>> It is cumbersome because it separates the initialization from the
>> declaration. Normally our *_INIT macros are initializer lists so we can
>> write
>>
>>          struct strbuf = STRBUF_INIT;
>>
>> which keeps the declaration and initialization together. Although
>> they're on adjacent lines in your example in real code the
>> initialization likely to be separated from the declaration by other
>> variable declarations.
> 
> Ah I see what you mean now. I'll experiment with making IVEC_INIT()
> work like that. One wrinkle is that STRBUF_INIT is a single concrete
> type whereas IVEC_INIT() is meant for generic types.

If you can get it to work that would be great, but I can't think of a 
way of getting it to work for a generic type.

Thanks

Phillip


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/10] Xdiff cleanup part 3
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (11 preceding siblings ...)
  2026-01-04  6:01 ` Yee Cheng Chin
@ 2026-01-28 14:40 ` Phillip Wood
  2026-03-06 23:03 ` Junio C Hamano
  2026-03-25 21:11 ` [PATCH v2 0/5] " Ezekiel Newren via GitGitGadget
  14 siblings, 0 replies; 78+ messages in thread
From: Phillip Wood @ 2026-01-28 14:40 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

The discussion of this series has got rather spread out so I thought it 
might be helpful to write a summary of my thoughts here.

On 02/01/2026 18:52, Ezekiel Newren via GitGitGadget wrote:
> Patch series summary:
> 
>   * patch 1: Introduce the ivec type

I agree this is a good idea to allow rust and C code to operate on the 
some data structure. The implementation needs a bit of work to avoid 
undefined behavior.

>   * patch 2: Create the function xdl_do_classic_diff()

This is sensible

>   * patches 3-4: generic cleanup

Patch 3 claims to "stop wasting time" but it introduces an extra pass 
over the input records without any explanation of why that is more 
efficient.

Patch 4 removes the common lines from the beginning and end of the input 
files before passing them on to the patience or histogram algorithms. 
That should speed things up (though we should measure by how much). It 
changes the output because excluding the common lines at beginning and 
end of the file changes the longest sequence of unique context lines in 
the lines that remain. If the different output is easier to read then 
that's clearly a good thing but you would need to do some analysis to 
show that.

>   * patches 5-8: convert from dstart/dend (in xdfile_t) to
>     delta_start/delta_end (in xdfenv_t)

dstart a dend in xdfile_t are the index of the first and last line after 
removing any common lines from the beginning and end. The proposal is to 
store the offset from the beginning and end instead in xdfenv_t. Looking 
at where dstart and dend are used I think storing the indices is more 
convenient - if we store offsets we end up calculating the indices from 
them which is a pain and introduces an opportunity to make an error.

>   * patches 9-10: move xdl_cleanup_records(), and related, from xprepare.c to
>     xdiffi.c

Here we finally get to use the ivec data structre introduced in patch 1. 
However it is just replacing a fixed size array and so does not 
demonstrate the more interesting parts of the API which concern growing 
the array as we push more elements to it. I'm also not convinced by the 
claim that this change saves time as it introduces an extra pass over 
the input records.

Overall I struggled to see how the cleanups proposed here linked to the 
introduction of the ivec data structure.

Thanks

Phillip

> Things that will be addressed in future patch series:
> 
>   * Make xdl_cleanup_records() easier to read
>   * convert recs/nrec into an ivec
>   * convert changed to an ivec
>   * remove reference_index/nreff from xdfile_t and turn it into an ivec
>   * splitting minimal_perfect_hash out as its own ivec
>   * improve the performance of the classifier and parsing/hashing lines
> 
> === before this patch series typedef struct s_xdfile { xrecord_t *recs;
> size_t nrec; ptrdiff_t dstart, dend; bool *changed; size_t *reference_index;
> size_t nreff; } xdfile_t;
> 
> typedef struct s_xdfenv { xdfile_t xdf1, xdf2; } xdfenv_t;
> 
> === after this patch series typedef struct s_xdfile { xrecord_t *recs;
> size_t nrec; bool *changed; size_t *reference_index; size_t nreff; }
> xdfile_t;
> 
> typedef struct s_xdfenv { xdfile_t xdf1, xdf2; size_t delta_start,
> delta_end; size_t mph_size; } xdfenv_t;
> 
> Ezekiel Newren (10):
>    ivec: introduce the C side of ivec
>    xdiff: make classic diff explicit by creating xdl_do_classic_diff()
>    xdiff: don't waste time guessing the number of lines
>    xdiff: let patience and histogram benefit from xdl_trim_ends()
>    xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records()
>    xdiff: cleanup xdl_trim_ends()
>    xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start
>    xdiff: replace xdfile_t.dend with xdfenv_t.delta_end
>    xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
>    xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c
> 
>   Makefile           |   1 +
>   compat/ivec.c      | 113 ++++++++++++++++++
>   compat/ivec.h      |  52 +++++++++
>   meson.build        |   1 +
>   xdiff/xdiffi.c     | 221 +++++++++++++++++++++++++++++++++---
>   xdiff/xdiffi.h     |   1 +
>   xdiff/xhistogram.c |   7 +-
>   xdiff/xpatience.c  |   7 +-
>   xdiff/xprepare.c   | 277 ++++++++-------------------------------------
>   xdiff/xtypes.h     |   3 +-
>   xdiff/xutils.c     |  20 ----
>   xdiff/xutils.h     |   1 -
>   12 files changed, 432 insertions(+), 272 deletions(-)
>   create mode 100644 compat/ivec.c
>   create mode 100644 compat/ivec.h
> 
> 
> base-commit: 66ce5f8e8872f0183bb137911c52b07f1f242d13
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2156%2Fezekielnewren%2Fxdiff-cleanup-3-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2156/ezekielnewren/xdiff-cleanup-3-v1
> Pull-Request: https://github.com/git/git/pull/2156


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/10] Xdiff cleanup part 3
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (12 preceding siblings ...)
  2026-01-28 14:40 ` Phillip Wood
@ 2026-03-06 23:03 ` Junio C Hamano
  2026-03-09 19:06   ` Ezekiel Newren
  2026-03-25 21:11 ` [PATCH v2 0/5] " Ezekiel Newren via GitGitGadget
  14 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2026-03-06 23:03 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget; +Cc: git, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Patch series summary:
>
>  * patch 1: Introduce the ivec type
>  * patch 2: Create the function xdl_do_classic_diff()
>  * patches 3-4: generic cleanup
>  * patches 5-8: convert from dstart/dend (in xdfile_t) to
>    delta_start/delta_end (in xdfenv_t)
>  * patches 9-10: move xdl_cleanup_records(), and related, from xprepare.c to
>    xdiffi.c

Is this topic still viable?

We had to stop merging this series to the integration branches as
another topic <cover.1769424529.git.phillip.wood@dunelm.org.uk> with
smaller footprint was making conflicting clean-up.  Since the other
topic was merged at 5465d368 (Merge branch 'pw/xdiff-cleanups',
2026-02-20) a few weeks ago, we may want to resurrect this topic by
rebasing on top of a more recent 'master' branch.

Thanks.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/10] Xdiff cleanup part 3
  2026-03-06 23:03 ` Junio C Hamano
@ 2026-03-09 19:06   ` Ezekiel Newren
  2026-03-09 23:31     ` Junio C Hamano
  0 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren @ 2026-03-09 19:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ezekiel Newren via GitGitGadget, git

On Fri, Mar 6, 2026 at 4:03 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Patch series summary:
> >
> >  * patch 1: Introduce the ivec type
> >  * patch 2: Create the function xdl_do_classic_diff()
> >  * patches 3-4: generic cleanup
> >  * patches 5-8: convert from dstart/dend (in xdfile_t) to
> >    delta_start/delta_end (in xdfenv_t)
> >  * patches 9-10: move xdl_cleanup_records(), and related, from xprepare.c to
> >    xdiffi.c
>
> Is this topic still viable?
>
> We had to stop merging this series to the integration branches as
> another topic <cover.1769424529.git.phillip.wood@dunelm.org.uk> with
> smaller footprint was making conflicting clean-up.  Since the other
> topic was merged at 5465d368 (Merge branch 'pw/xdiff-cleanups',
> 2026-02-20) a few weeks ago, we may want to resurrect this topic by
> rebasing on top of a more recent 'master' branch.
>
> Thanks.

I plan on rebasing on top of master with lots of changes. v2 will be
quite different from v1.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/10] Xdiff cleanup part 3
  2026-03-09 19:06   ` Ezekiel Newren
@ 2026-03-09 23:31     ` Junio C Hamano
  0 siblings, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2026-03-09 23:31 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Ezekiel Newren via GitGitGadget, git

Ezekiel Newren <ezekielnewren@gmail.com> writes:

> I plan on rebasing on top of master with lots of changes. v2 will be
> quite different from v1.

Thanks.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v2 0/5] Xdiff cleanup part 3
  2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
                   ` (13 preceding siblings ...)
  2026-03-06 23:03 ` Junio C Hamano
@ 2026-03-25 21:11 ` Ezekiel Newren via GitGitGadget
  2026-03-25 21:11   ` [PATCH v2 1/5] xdiff/xdl_cleanup_records: delete local recs pointer Ezekiel Newren via GitGitGadget
                     ` (6 more replies)
  14 siblings, 7 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-25 21:11 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren

v2 is a radical departure from v1 Changes in v2:

 * make the flow of xdl_cleanup_records() easier to follow

There is no performance or behavioral change introduced in this patch
series.

=== original cover letter bellow ===

Patch series summary:

 * patch 1: Introduce the ivec type
 * patch 2: Create the function xdl_do_classic_diff()
 * patches 3-4: generic cleanup
 * patches 5-8: convert from dstart/dend (in xdfile_t) to
   delta_start/delta_end (in xdfenv_t)
 * patches 9-10: move xdl_cleanup_records(), and related, from xprepare.c to
   xdiffi.c

Things that will be addressed in future patch series:

 * Make xdl_cleanup_records() easier to read
 * convert recs/nrec into an ivec
 * convert changed to an ivec
 * remove reference_index/nreff from xdfile_t and turn it into an ivec
 * splitting minimal_perfect_hash out as its own ivec
 * improve the performance of the classifier and parsing/hashing lines

=== before this patch series typedef struct s_xdfile { xrecord_t *recs;
size_t nrec; ptrdiff_t dstart, dend; bool *changed; size_t *reference_index;
size_t nreff; } xdfile_t;

typedef struct s_xdfenv { xdfile_t xdf1, xdf2; } xdfenv_t;

=== after this patch series typedef struct s_xdfile { xrecord_t *recs;
size_t nrec; bool *changed; size_t *reference_index; size_t nreff; }
xdfile_t;

typedef struct s_xdfenv { xdfile_t xdf1, xdf2; size_t delta_start,
delta_end; size_t mph_size; } xdfenv_t;

Ezekiel Newren (5):
  xdiff/xdl_cleanup_records: delete local recs pointer
  xdiff/xdl_cleanup_records: make limits more clear
  xdiff/xdl_cleanup_records: make setting action easier to follow
  xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity
  xdiff/xdl_cleanup_records: use unambiguous types

 xdiff/xprepare.c | 89 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 59 insertions(+), 30 deletions(-)


base-commit: ca1db8a0f7dc0dbea892e99f5b37c5fe5861be71
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2156%2Fezekielnewren%2Fxdiff-cleanup-3-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2156/ezekielnewren/xdiff-cleanup-3-v2
Pull-Request: https://github.com/git/git/pull/2156

Range-diff vs v1:

  1:  adf1395d20 <  -:  ---------- ivec: introduce the C side of ivec
  2:  9bd01bce9f <  -:  ---------- xdiff: make classic diff explicit by creating xdl_do_classic_diff()
  3:  53e4840c16 <  -:  ---------- xdiff: don't waste time guessing the number of lines
  4:  70040ea135 <  -:  ---------- xdiff: let patience and histogram benefit from xdl_trim_ends()
  5:  742f2d381a !  1:  8f9165d477 xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records()
     @@ Metadata
      Author: Ezekiel Newren <ezekielnewren@gmail.com>
      
       ## Commit message ##
     -    xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records()
     +    xdiff/xdl_cleanup_records: delete local recs pointer
      
     -    View with --color-words. Prepare these functions to use the fields:
     -    delta_start, delta_end. A future patch will add these fields to
     -    xdfenv_t.
     +    Simplify the first 2 for loops by directly indexing the xdfile.recs.
     +    recs is unused in the last 2 for loops, remove it. Best viewed with
     +    --color-words.
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
       ## xdiff/xprepare.c ##
      @@ xdiff/xprepare.c: static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
     -  * matches on the other file. Also, lines that have multiple matches
     -  * might be potentially discarded if they appear in a run of discardable.
        */
     --static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
     -+static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
     + static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
       	long i, nm, mlim;
     - 	xrecord_t *recs;
     +-	xrecord_t *recs;
       	xdlclass_t *rcrec;
     + 	uint8_t *action1 = NULL, *action2 = NULL;
     + 	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
      @@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
     - 	 * Create temporary arrays that will help us decide if
     - 	 * changed[i] should remain false, or become true.
       	 */
     --	if (!XDL_CALLOC_ARRAY(action1, xdf1->nrec + 1)) {
     -+	if (!XDL_CALLOC_ARRAY(action1, xe->xdf1.nrec + 1)) {
     - 		ret = -1;
     - 		goto cleanup;
     - 	}
     --	if (!XDL_CALLOC_ARRAY(action2, xdf2->nrec + 1)) {
     -+	if (!XDL_CALLOC_ARRAY(action2, xe->xdf2.nrec + 1)) {
     - 		ret = -1;
     - 		goto cleanup;
     - 	}
     -@@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
     - 	/*
     - 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
     - 	 */
     --	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
     -+	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
     + 	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
       		mlim = XDL_MAX_EQLIMIT;
      -	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
     -+	for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart]; i <= xe->xdf1.dend; i++, recs++) {
     - 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
     +-		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
     ++	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
     ++		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
     ++		rcrec = cf->rcrecs[mph1];
       		nm = rcrec ? rcrec->len2 : 0;
       		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
       	}
       
     --	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
     -+	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
     + 	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
       		mlim = XDL_MAX_EQLIMIT;
      -	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
     -+	for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart]; i <= xe->xdf2.dend; i++, recs++) {
     - 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
     +-		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
     ++	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
     ++		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
     ++		rcrec = cf->rcrecs[mph2];
       		nm = rcrec ? rcrec->len1 : 0;
       		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
     + 	}
      @@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
     - 	 * Use temporary arrays to decide if changed[i] should remain
       	 * false, or become true.
       	 */
     --	xdf1->nreff = 0;
     + 	xdf1->nreff = 0;
      -	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
      -	     i <= xdf1->dend; i++, recs++) {
     -+	xe->xdf1.nreff = 0;
     -+	for (i = xe->xdf1.dstart, recs = &xe->xdf1.recs[xe->xdf1.dstart];
     -+	     i <= xe->xdf1.dend; i++, recs++) {
     ++	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
       		if (action1[i] == KEEP ||
     --		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
     --			xdf1->reference_index[xdf1->nreff++] = i;
     -+		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xe->xdf1.dstart, xe->xdf1.dend))) {
     -+			xe->xdf1.reference_index[xe->xdf1.nreff++] = i;
     - 			/* changed[i] remains false, i.e. keep */
     - 		} else
     --			xdf1->changed[i] = true;
     -+			xe->xdf1.changed[i] = true;
     - 			/* i.e. discard */
     + 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
     + 			xdf1->reference_index[xdf1->nreff++] = i;
     +@@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
       	}
       
     --	xdf2->nreff = 0;
     + 	xdf2->nreff = 0;
      -	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
      -	     i <= xdf2->dend; i++, recs++) {
     -+	xe->xdf2.nreff = 0;
     -+	for (i = xe->xdf2.dstart, recs = &xe->xdf2.recs[xe->xdf2.dstart];
     -+	     i <= xe->xdf2.dend; i++, recs++) {
     ++	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
       		if (action2[i] == KEEP ||
     --		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
     --			xdf2->reference_index[xdf2->nreff++] = i;
     -+		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xe->xdf2.dstart, xe->xdf2.dend))) {
     -+			xe->xdf2.reference_index[xe->xdf2.nreff++] = i;
     - 			/* changed[i] remains false, i.e. keep */
     - 		} else
     --			xdf2->changed[i] = true;
     -+			xe->xdf2.changed[i] = true;
     - 			/* i.e. discard */
     - 	}
     - 
     -@@ xdiff/xprepare.c: cleanup:
     - /*
     -  * Early trim initial and terminal matching records.
     -  */
     --static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
     -+static int xdl_trim_ends(xdfenv_t *xe) {
     - 	long i, lim;
     - 	xrecord_t *recs1, *recs2;
     - 
     --	recs1 = xdf1->recs;
     --	recs2 = xdf2->recs;
     --	for (i = 0, lim = (long)XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
     -+	recs1 = xe->xdf1.recs;
     -+	recs2 = xe->xdf2.recs;
     -+	for (i = 0, lim = (long)XDL_MIN(xe->xdf1.nrec, xe->xdf2.nrec); i < lim;
     - 	     i++, recs1++, recs2++)
     - 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
     - 			break;
     - 
     --	xdf1->dstart = xdf2->dstart = i;
     -+	xe->xdf1.dstart = xe->xdf2.dstart = i;
     - 
     --	recs1 = xdf1->recs + xdf1->nrec - 1;
     --	recs2 = xdf2->recs + xdf2->nrec - 1;
     -+	recs1 = xe->xdf1.recs + xe->xdf1.nrec - 1;
     -+	recs2 = xe->xdf2.recs + xe->xdf2.nrec - 1;
     - 	for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
     - 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
     - 			break;
     - 
     --	xdf1->dend = (long)xdf1->nrec - i - 1;
     --	xdf2->dend = (long)xdf2->nrec - i - 1;
     -+	xe->xdf1.dend = (long)xe->xdf1.nrec - i - 1;
     -+	xe->xdf2.dend = (long)xe->xdf2.nrec - i - 1;
     - 
     - 	return 0;
     - }
     -@@ xdiff/xprepare.c: int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
     - 		xdl_classify_record(2, &cf, rec);
     - 	}
     - 
     --	xdl_trim_ends(&xe->xdf1, &xe->xdf2);
     -+	xdl_trim_ends(xe);
     - 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
     - 	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
     --	    xdl_cleanup_records(&cf, &xe->xdf1, &xe->xdf2) < 0) {
     -+	    xdl_cleanup_records(&cf, xe) < 0) {
     - 
     - 		xdl_free_ctx(&xe->xdf2);
     - 		xdl_free_ctx(&xe->xdf1);
     + 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
     + 			xdf2->reference_index[xdf2->nreff++] = i;
  6:  65da408da9 <  -:  ---------- xdiff: cleanup xdl_trim_ends()
  7:  d74722538b <  -:  ---------- xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start
  8:  d0ef5b23c4 <  -:  ---------- xdiff: replace xdfile_t.dend with xdfenv_t.delta_end
  9:  f9b10e71d2 !  2:  62adaa8e5a xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
     @@ Metadata
      Author: Ezekiel Newren <ezekielnewren@gmail.com>
      
       ## Commit message ##
     -    xdiff: remove dependence on xdlclassifier from xdl_cleanup_records()
     +    xdiff/xdl_cleanup_records: make limits more clear
      
     -    Disentangle xdl_cleanup_records() from the classifier so that it can be
     -    moved from xprepare.c into xdiffi.c.
     -
     -    The classic diff is the only algorithm that needs to count the number
     -    of times each line occurs in each file. Make xdl_cleanup_records()
     -    count the number of lines instead of the classifier so it won't slow
     -    down patience or histogram.
     +    Make the handling of per-file limits and the minimal-case clearer.
     +      * Use explicit per-file limit variables (mlim1, mlim2) and initialize
     +        them.
     +      * The additional condition `!need_min` is redudant now, remove it.
     +    Best viewed with --color-words.
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
       ## xdiff/xprepare.c ##
     -@@
     -  */
     - 
     - #include "xinclude.h"
     -+#include "compat/ivec.h"
     - 
     - 
     - #define XDL_KPDIS_RUN 4
     -@@ xdiff/xprepare.c: typedef struct s_xdlclass {
     - 	struct s_xdlclass *next;
     - 	xrecord_t rec;
     - 	long idx;
     --	long len1, len2;
     - } xdlclass_t;
     - 
     - typedef struct s_xdlclassifier {
     -@@ xdiff/xprepare.c: static void xdl_free_classifier(xdlclassifier_t *cf) {
     - }
     - 
     - 
     --static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) {
     -+static int xdl_classify_record(xdlclassifier_t *cf, xrecord_t *rec) {
     - 	size_t hi;
     - 	xdlclass_t *rcrec;
     - 
     -@@ xdiff/xprepare.c: static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
     - 				return -1;
     - 		cf->rcrecs[rcrec->idx] = rcrec;
     - 		rcrec->rec = *rec;
     --		rcrec->len1 = rcrec->len2 = 0;
     - 		rcrec->next = cf->rchash[hi];
     - 		cf->rchash[hi] = rcrec;
     - 	}
     - 
     --	(pass == 1) ? rcrec->len1++ : rcrec->len2++;
     --
     - 	rec->minimal_perfect_hash = (size_t)rcrec->idx;
     - 
     - 	return 0;
      @@ xdiff/xprepare.c: static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
     - 	return rpdis1 * XDL_KPDIS_RUN < (rpdis1 + rdis1);
     - }
     - 
     -+struct xoccurrence
     -+{
     -+	size_t file1, file2;
     -+};
     -+
     -+
     -+DEFINE_IVEC_TYPE(struct xoccurrence, xoccurrence);
     -+
     - 
     - /*
     -  * Try to reduce the problem complexity, discard records that have no
     -  * matches on the other file. Also, lines that have multiple matches
        * might be potentially discarded if they appear in a run of discardable.
        */
     --static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
     + static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
      -	long i, nm, mlim;
     -+static int xdl_cleanup_records(xdfenv_t *xe, uint64_t flags) {
     -+	long i;
     -+	size_t nm, mlim;
     - 	xrecord_t *recs;
     --	xdlclass_t *rcrec;
     ++	long i, nm;
     ++	size_t mlim1, mlim2;
     + 	xdlclass_t *rcrec;
       	uint8_t *action1 = NULL, *action2 = NULL;
     --	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
     -+	struct IVec_xoccurrence occ;
     -+	bool need_min = !!(flags & XDF_NEED_MINIMAL);
     - 	int ret = 0;
     - 	ptrdiff_t dend1 = xe->xdf1.nrec - 1 - xe->delta_end;
     - 	ptrdiff_t dend2 = xe->xdf2.nrec - 1 - xe->delta_end;
     + 	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
     +@@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
     + 		goto cleanup;
     + 	}
       
     -+	IVEC_INIT(occ);
     -+	ivec_zero(&occ, xe->mph_size);
     -+
     -+	for (size_t j = 0; j < xe->xdf1.nrec; j++) {
     -+		size_t mph1 = xe->xdf1.recs[j].minimal_perfect_hash;
     -+		occ.ptr[mph1].file1 += 1;
     -+	}
     -+
     -+	for (size_t j = 0; j < xe->xdf2.nrec; j++) {
     -+		size_t mph2 = xe->xdf2.recs[j].minimal_perfect_hash;
     -+		occ.ptr[mph2].file2 += 1;
     ++	if (need_min) {
     ++		/* i.e. infinity */
     ++		mlim1 = SIZE_MAX;
     ++		mlim2 = SIZE_MAX;
     ++	} else {
     ++		mlim1 = XDL_MIN(xdl_bogosqrt(xdf1->nrec), XDL_MAX_EQLIMIT);
     ++		mlim2 = XDL_MIN(xdl_bogosqrt(xdf2->nrec), XDL_MAX_EQLIMIT);
      +	}
      +
       	/*
     - 	 * Create temporary arrays that will help us decide if
     - 	 * changed[i] should remain false, or become true.
     -@@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
     - 	if ((mlim = xdl_bogosqrt((long)xe->xdf1.nrec)) > XDL_MAX_EQLIMIT)
     - 		mlim = XDL_MAX_EQLIMIT;
     - 	for (i = xe->delta_start, recs = &xe->xdf1.recs[xe->delta_start]; i <= dend1; i++, recs++) {
     --		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
     --		nm = rcrec ? rcrec->len2 : 0;
     -+		nm = occ.ptr[recs->minimal_perfect_hash].file2;
     - 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
     + 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
     + 	 */
     +-	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
     +-		mlim = XDL_MAX_EQLIMIT;
     + 	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
     + 		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
     + 		rcrec = cf->rcrecs[mph1];
     + 		nm = rcrec ? rcrec->len2 : 0;
     +-		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
     ++		action1[i] = (nm == 0) ? DISCARD: nm >= mlim1 ? INVESTIGATE: KEEP;
       	}
       
     - 	if ((mlim = xdl_bogosqrt((long)xe->xdf2.nrec)) > XDL_MAX_EQLIMIT)
     - 		mlim = XDL_MAX_EQLIMIT;
     - 	for (i = xe->delta_start, recs = &xe->xdf2.recs[xe->delta_start]; i <= dend2; i++, recs++) {
     --		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
     --		nm = rcrec ? rcrec->len1 : 0;
     -+		nm = occ.ptr[recs->minimal_perfect_hash].file1;
     - 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
     - 	}
     - 
     -@@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfenv_t *xe) {
     - cleanup:
     - 	xdl_free(action1);
     - 	xdl_free(action2);
     -+	ivec_free(&occ);
     - 
     - 	return ret;
     - }
     -@@ xdiff/xprepare.c: int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
     - 
     - 	for (size_t i = 0; i < xe->xdf1.nrec; i++) {
     - 		xrecord_t *rec = &xe->xdf1.recs[i];
     --		xdl_classify_record(1, &cf, rec);
     -+		xdl_classify_record(&cf, rec);
     +-	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
     +-		mlim = XDL_MAX_EQLIMIT;
     + 	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
     + 		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
     + 		rcrec = cf->rcrecs[mph2];
     + 		nm = rcrec ? rcrec->len1 : 0;
     +-		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
     ++		action2[i] = (nm == 0) ? DISCARD: nm >= mlim2 ? INVESTIGATE: KEEP;
       	}
       
     - 	for (size_t i = 0; i < xe->xdf2.nrec; i++) {
     - 		xrecord_t *rec = &xe->xdf2.recs[i];
     --		xdl_classify_record(2, &cf, rec);
     -+		xdl_classify_record(&cf, rec);
     - 	}
     - 
     -+	xe->mph_size = cf.count;
     -+
     - 	xdl_trim_ends(xe);
     - 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
     - 	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
     --	    xdl_cleanup_records(&cf, xe) < 0) {
     -+	    xdl_cleanup_records(xe, xpp->flags) < 0) {
     - 
     - 		xdl_free_ctx(&xe->xdf2);
     - 		xdl_free_ctx(&xe->xdf1);
     -
     - ## xdiff/xtypes.h ##
     -@@ xdiff/xtypes.h: typedef struct s_xdfile {
     - typedef struct s_xdfenv {
     - 	xdfile_t xdf1, xdf2;
     - 	size_t delta_start, delta_end;
     -+	size_t mph_size;
     - } xdfenv_t;
     - 
     - 
     + 	/*
 10:  1dba6b34aa <  -:  ---------- xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c
  -:  ---------- >  3:  8be7e4781a xdiff/xdl_cleanup_records: make setting action easier to follow
  -:  ---------- >  4:  6abd052c34 xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity
  -:  ---------- >  5:  a52787f019 xdiff/xdl_cleanup_records: use unambiguous types

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v2 1/5] xdiff/xdl_cleanup_records: delete local recs pointer
  2026-03-25 21:11 ` [PATCH v2 0/5] " Ezekiel Newren via GitGitGadget
@ 2026-03-25 21:11   ` Ezekiel Newren via GitGitGadget
  2026-03-25 21:11   ` [PATCH v2 2/5] xdiff/xdl_cleanup_records: make limits more clear Ezekiel Newren via GitGitGadget
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-25 21:11 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Simplify the first 2 for loops by directly indexing the xdfile.recs.
recs is unused in the last 2 for loops, remove it. Best viewed with
--color-words.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index cd4fc405eb..d6e1901d2d 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -269,7 +269,6 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
 	long i, nm, mlim;
-	xrecord_t *recs;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
 	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
@@ -293,16 +292,18 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 */
 	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
-	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
+	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
+		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
+		rcrec = cf->rcrecs[mph1];
 		nm = rcrec ? rcrec->len2 : 0;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
 	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
-	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
+	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
+		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
+		rcrec = cf->rcrecs[mph2];
 		nm = rcrec ? rcrec->len1 : 0;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -312,8 +313,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 * false, or become true.
 	 */
 	xdf1->nreff = 0;
-	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
-	     i <= xdf1->dend; i++, recs++) {
+	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
 			xdf1->reference_index[xdf1->nreff++] = i;
@@ -324,8 +324,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	}
 
 	xdf2->nreff = 0;
-	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
-	     i <= xdf2->dend; i++, recs++) {
+	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
 			xdf2->reference_index[xdf2->nreff++] = i;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 2/5] xdiff/xdl_cleanup_records: make limits more clear
  2026-03-25 21:11 ` [PATCH v2 0/5] " Ezekiel Newren via GitGitGadget
  2026-03-25 21:11   ` [PATCH v2 1/5] xdiff/xdl_cleanup_records: delete local recs pointer Ezekiel Newren via GitGitGadget
@ 2026-03-25 21:11   ` Ezekiel Newren via GitGitGadget
  2026-03-25 21:11   ` [PATCH v2 3/5] xdiff/xdl_cleanup_records: make setting action easier to follow Ezekiel Newren via GitGitGadget
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-25 21:11 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Make the handling of per-file limits and the minimal-case clearer.
  * Use explicit per-file limit variables (mlim1, mlim2) and initialize
    them.
  * The additional condition `!need_min` is redudant now, remove it.
Best viewed with --color-words.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index d6e1901d2d..756a5b8dcc 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -268,7 +268,8 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  * might be potentially discarded if they appear in a run of discardable.
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-	long i, nm, mlim;
+	long i, nm;
+	size_t mlim1, mlim2;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
 	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
@@ -287,25 +288,30 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 		goto cleanup;
 	}
 
+	if (need_min) {
+		/* i.e. infinity */
+		mlim1 = SIZE_MAX;
+		mlim2 = SIZE_MAX;
+	} else {
+		mlim1 = XDL_MIN(xdl_bogosqrt(xdf1->nrec), XDL_MAX_EQLIMIT);
+		mlim2 = XDL_MIN(xdl_bogosqrt(xdf2->nrec), XDL_MAX_EQLIMIT);
+	}
+
 	/*
 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
 	 */
-	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
-		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
 		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
 		rcrec = cf->rcrecs[mph1];
 		nm = rcrec ? rcrec->len2 : 0;
-		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
+		action1[i] = (nm == 0) ? DISCARD: nm >= mlim1 ? INVESTIGATE: KEEP;
 	}
 
-	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
-		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
 		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
 		rcrec = cf->rcrecs[mph2];
 		nm = rcrec ? rcrec->len1 : 0;
-		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
+		action2[i] = (nm == 0) ? DISCARD: nm >= mlim2 ? INVESTIGATE: KEEP;
 	}
 
 	/*
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 3/5] xdiff/xdl_cleanup_records: make setting action easier to follow
  2026-03-25 21:11 ` [PATCH v2 0/5] " Ezekiel Newren via GitGitGadget
  2026-03-25 21:11   ` [PATCH v2 1/5] xdiff/xdl_cleanup_records: delete local recs pointer Ezekiel Newren via GitGitGadget
  2026-03-25 21:11   ` [PATCH v2 2/5] xdiff/xdl_cleanup_records: make limits more clear Ezekiel Newren via GitGitGadget
@ 2026-03-25 21:11   ` Ezekiel Newren via GitGitGadget
  2026-03-25 21:11   ` [PATCH v2 4/5] xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity Ezekiel Newren via GitGitGadget
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-25 21:11 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Rewrite nested ternaries with a clear if/else ladder for
action1/action2 to improve readability while preserving
behavior.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 756a5b8dcc..127848b764 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -304,14 +304,24 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
 		rcrec = cf->rcrecs[mph1];
 		nm = rcrec ? rcrec->len2 : 0;
-		action1[i] = (nm == 0) ? DISCARD: nm >= mlim1 ? INVESTIGATE: KEEP;
+		if (nm == 0)
+			action1[i] = DISCARD;
+		else if (nm < mlim1)
+			action1[i] = KEEP;
+		else /* nm >= mlim1 */
+			action1[i] = INVESTIGATE;
 	}
 
 	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
 		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
 		rcrec = cf->rcrecs[mph2];
 		nm = rcrec ? rcrec->len1 : 0;
-		action2[i] = (nm == 0) ? DISCARD: nm >= mlim2 ? INVESTIGATE: KEEP;
+		if (nm == 0)
+			action2[i] = DISCARD;
+		else if (nm < mlim2)
+			action2[i] = KEEP;
+		else /* nm >= mlim2 */
+			action2[i] = INVESTIGATE;
 	}
 
 	/*
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 4/5] xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity
  2026-03-25 21:11 ` [PATCH v2 0/5] " Ezekiel Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2026-03-25 21:11   ` [PATCH v2 3/5] xdiff/xdl_cleanup_records: make setting action easier to follow Ezekiel Newren via GitGitGadget
@ 2026-03-25 21:11   ` Ezekiel Newren via GitGitGadget
  2026-03-25 21:11   ` [PATCH v2 5/5] xdiff/xdl_cleanup_records: use unambiguous types Ezekiel Newren via GitGitGadget
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-25 21:11 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Make it clear that INVESTIGATE is turned into KEEP or DISCARD based on
the result of xdl_clean_mmatch() which reduces actionX[i] into a
boolean value.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 127848b764..dd595cf8a1 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -330,24 +330,38 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 */
 	xdf1->nreff = 0;
 	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
-		if (action1[i] == KEEP ||
-		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
+		if (action1[i] == INVESTIGATE) {
+			if (!xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))
+				action1[i] = KEEP;
+			else
+				action1[i] = DISCARD;
+		}
+
+		if (action1[i] == KEEP) {
 			xdf1->reference_index[xdf1->nreff++] = i;
-			/* changed[i] remains false, i.e. keep */
-		} else
+			/* changed[i] remains false */
+		} else if (action1[i] == DISCARD)
 			xdf1->changed[i] = true;
-			/* i.e. discard */
+		else
+			BUG("Illegal state for action1[i]");
 	}
 
 	xdf2->nreff = 0;
 	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
-		if (action2[i] == KEEP ||
-		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
+		if (action2[i] == INVESTIGATE) {
+			if (!xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))
+				action2[i] = KEEP;
+			else
+				action2[i] = DISCARD;
+		}
+
+		if (action2[i] == KEEP) {
 			xdf2->reference_index[xdf2->nreff++] = i;
-			/* changed[i] remains false, i.e. keep */
-		} else
+			/* changed[i] remains false */
+		} else if (action2[i] == DISCARD)
 			xdf2->changed[i] = true;
-			/* i.e. discard */
+		else
+			BUG("Illegal state for action2[i]");
 	}
 
 cleanup:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 5/5] xdiff/xdl_cleanup_records: use unambiguous types
  2026-03-25 21:11 ` [PATCH v2 0/5] " Ezekiel Newren via GitGitGadget
                     ` (3 preceding siblings ...)
  2026-03-25 21:11   ` [PATCH v2 4/5] xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity Ezekiel Newren via GitGitGadget
@ 2026-03-25 21:11   ` Ezekiel Newren via GitGitGadget
  2026-03-25 21:58     ` Junio C Hamano
  2026-03-26  6:26   ` [PATCH v2 0/5] Xdiff cleanup part 3 SZEDER Gábor
  2026-03-27 19:23   ` [PATCH v3 0/6] " Ezekiel Newren via GitGitGadget
  6 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-25 21:11 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Change the parameters of xdl_clean_mmatch() and the local variables
i, nm in xdl_cleanup_records() to use unambiguous types. Best viewed
with --color-words.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index dd595cf8a1..39e48ad33a 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -197,8 +197,8 @@ void xdl_free_env(xdfenv_t *xe) {
 }
 
 
-static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
-	long r, rdis0, rpdis0, rdis1, rpdis1;
+static bool xdl_clean_mmatch(uint8_t const *action, ptrdiff_t i, ptrdiff_t s, ptrdiff_t e) {
+	ptrdiff_t r, rdis0, rpdis0, rdis1, rpdis1;
 
 	/*
 	 * Limits the window that is examined during the similar-lines
@@ -268,8 +268,8 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  * might be potentially discarded if they appear in a run of discardable.
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-	long i, nm;
-	size_t mlim1, mlim2;
+	ptrdiff_t i;
+	size_t nm, mlim1, mlim2;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
 	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
@@ -303,7 +303,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
 		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
 		rcrec = cf->rcrecs[mph1];
-		nm = rcrec ? rcrec->len2 : 0;
+		nm = rcrec ? (size_t)rcrec->len2 : 0;
 		if (nm == 0)
 			action1[i] = DISCARD;
 		else if (nm < mlim1)
@@ -315,7 +315,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
 		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
 		rcrec = cf->rcrecs[mph2];
-		nm = rcrec ? rcrec->len1 : 0;
+		nm = rcrec ? (size_t)rcrec->len1 : 0;
 		if (nm == 0)
 			action2[i] = DISCARD;
 		else if (nm < mlim2)
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 5/5] xdiff/xdl_cleanup_records: use unambiguous types
  2026-03-25 21:11   ` [PATCH v2 5/5] xdiff/xdl_cleanup_records: use unambiguous types Ezekiel Newren via GitGitGadget
@ 2026-03-25 21:58     ` Junio C Hamano
  0 siblings, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2026-03-25 21:58 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Ezekiel Newren <ezekielnewren@gmail.com>
>
> Change the parameters of xdl_clean_mmatch() and the local variables
> i, nm in xdl_cleanup_records() to use unambiguous types. Best viewed
> with --color-words.
>
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>  xdiff/xprepare.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index dd595cf8a1..39e48ad33a 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -197,8 +197,8 @@ void xdl_free_env(xdfenv_t *xe) {
>  }
>  
>  
> -static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
> -	long r, rdis0, rpdis0, rdis1, rpdis1;
> +static bool xdl_clean_mmatch(uint8_t const *action, ptrdiff_t i, ptrdiff_t s, ptrdiff_t e) {
> +	ptrdiff_t r, rdis0, rpdis0, rdis1, rpdis1;
>  
>  	/*
>  	 * Limits the window that is examined during the similar-lines
> @@ -268,8 +268,8 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
>   * might be potentially discarded if they appear in a run of discardable.
>   */
>  static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
> -	long i, nm;
> -	size_t mlim1, mlim2;
> +	ptrdiff_t i;
> +	size_t nm, mlim1, mlim2;

Looking good.  Moving away from platform native "long" and to types
that have more specific meaning makes sense.

>  	xdlclass_t *rcrec;
>  	uint8_t *action1 = NULL, *action2 = NULL;
>  	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
> @@ -303,7 +303,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>  	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
>  		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
>  		rcrec = cf->rcrecs[mph1];
> -		nm = rcrec ? rcrec->len2 : 0;
> +		nm = rcrec ? (size_t)rcrec->len2 : 0;
>  		if (nm == 0)
>  			action1[i] = DISCARD;
>  		else if (nm < mlim1)
> @@ -315,7 +315,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>  	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
>  		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
>  		rcrec = cf->rcrecs[mph2];
> -		nm = rcrec ? rcrec->len1 : 0;
> +		nm = rcrec ? (size_t)rcrec->len1 : 0;
>  		if (nm == 0)
>  			action2[i] = DISCARD;
>  		else if (nm < mlim2)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 0/5] Xdiff cleanup part 3
  2026-03-25 21:11 ` [PATCH v2 0/5] " Ezekiel Newren via GitGitGadget
                     ` (4 preceding siblings ...)
  2026-03-25 21:11   ` [PATCH v2 5/5] xdiff/xdl_cleanup_records: use unambiguous types Ezekiel Newren via GitGitGadget
@ 2026-03-26  6:26   ` SZEDER Gábor
  2026-03-27 19:23   ` [PATCH v3 0/6] " Ezekiel Newren via GitGitGadget
  6 siblings, 0 replies; 78+ messages in thread
From: SZEDER Gábor @ 2026-03-26  6:26 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren

On Wed, Mar 25, 2026 at 09:11:00PM +0000, Ezekiel Newren via GitGitGadget wrote:
> v2 is a radical departure from v1 Changes in v2:
> 
>  * make the flow of xdl_cleanup_records() easier to follow
> 
> There is no performance or behavioral change introduced in this patch
> series.
> 
> === original cover letter bellow ===
> 
> Patch series summary:
> 
>  * patch 1: Introduce the ivec type
>  * patch 2: Create the function xdl_do_classic_diff()
>  * patches 3-4: generic cleanup
>  * patches 5-8: convert from dstart/dend (in xdfile_t) to
>    delta_start/delta_end (in xdfenv_t)
>  * patches 9-10: move xdl_cleanup_records(), and related, from xprepare.c to
>    xdiffi.c
> 
> Things that will be addressed in future patch series:
> 
>  * Make xdl_cleanup_records() easier to read
>  * convert recs/nrec into an ivec
>  * convert changed to an ivec
>  * remove reference_index/nreff from xdfile_t and turn it into an ivec
>  * splitting minimal_perfect_hash out as its own ivec
>  * improve the performance of the classifier and parsing/hashing lines
> 
> === before this patch series typedef struct s_xdfile { xrecord_t *recs;
> size_t nrec; ptrdiff_t dstart, dend; bool *changed; size_t *reference_index;
> size_t nreff; } xdfile_t;
> 
> typedef struct s_xdfenv { xdfile_t xdf1, xdf2; } xdfenv_t;
> 
> === after this patch series typedef struct s_xdfile { xrecord_t *recs;
> size_t nrec; bool *changed; size_t *reference_index; size_t nreff; }
> xdfile_t;
> 
> typedef struct s_xdfenv { xdfile_t xdf1, xdf2; size_t delta_start,
> delta_end; size_t mph_size; } xdfenv_t;

Please make sure that each commit in this series can be built with
DEVELOPER=1, which enables a bunch of additional compiler warnings.
While the last commit can be built with all those warnings, the three
in the middle fail with sign comparison errors.

> Ezekiel Newren (5):
>   xdiff/xdl_cleanup_records: delete local recs pointer
>   xdiff/xdl_cleanup_records: make limits more clear

        CC xdiff/xprepare.o
    xdiff/xprepare.c: In function ‘xdl_cleanup_records’:
    xdiff/xprepare.c:307:54: error: comparison of integer expressions of different signedness: ‘long int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
      307 |                 action1[i] = (nm == 0) ? DISCARD: nm >= mlim1 ? INVESTIGATE: KEEP;
          |                                                      ^~
    xdiff/xprepare.c:314:54: error: comparison of integer expressions of different signedness: ‘long int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
      314 |                 action2[i] = (nm == 0) ? DISCARD: nm >= mlim2 ? INVESTIGATE: KEEP;
          |                                                      ^~
    cc1: all warnings being treated as errors
    make: *** [Makefile:2923: xdiff/xprepare.o] Error 1

>   xdiff/xdl_cleanup_records: make setting action easier to follow

      CC xdiff/xprepare.o
  xdiff/xprepare.c: In function ‘xdl_cleanup_records’:
  xdiff/xprepare.c:309:29: error: comparison of integer expressions of different signedness: ‘long int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
    309 |                 else if (nm < mlim1)
        |                             ^
  xdiff/xprepare.c:321:29: error: comparison of integer expressions of different signedness: ‘long int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
    321 |                 else if (nm < mlim2)
        |                             ^
  cc1: all warnings being treated as errors
  make: *** [Makefile:2923: xdiff/xprepare.o] Error 1

>   xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity

Same error as the last one.

>   xdiff/xdl_cleanup_records: use unambiguous types

Good.

> 
>  xdiff/xprepare.c | 89 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 59 insertions(+), 30 deletions(-)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 0/6] Xdiff cleanup part 3
  2026-03-25 21:11 ` [PATCH v2 0/5] " Ezekiel Newren via GitGitGadget
                     ` (5 preceding siblings ...)
  2026-03-26  6:26   ` [PATCH v2 0/5] Xdiff cleanup part 3 SZEDER Gábor
@ 2026-03-27 19:23   ` Ezekiel Newren via GitGitGadget
  2026-03-27 19:23     ` [PATCH v3 1/6] xdiff/xdl_cleanup_records: delete local recs pointer Ezekiel Newren via GitGitGadget
                       ` (5 more replies)
  6 siblings, 6 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-27 19:23 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren

Changes in v3:

 * run make DEVELOPER=1 on each commit and fix all compiler issues

v2 is a radical departure from v1 Changes in v2:

 * make the flow of xdl_cleanup_records() easier to follow

There is no performance or behavioral change introduced in this patch
series.

=== original cover letter bellow ===

Patch series summary:

 * patch 1: Introduce the ivec type
 * patch 2: Create the function xdl_do_classic_diff()
 * patches 3-4: generic cleanup
 * patches 5-8: convert from dstart/dend (in xdfile_t) to
   delta_start/delta_end (in xdfenv_t)
 * patches 9-10: move xdl_cleanup_records(), and related, from xprepare.c to
   xdiffi.c

Things that will be addressed in future patch series:

 * Make xdl_cleanup_records() easier to read
 * convert recs/nrec into an ivec
 * convert changed to an ivec
 * remove reference_index/nreff from xdfile_t and turn it into an ivec
 * splitting minimal_perfect_hash out as its own ivec
 * improve the performance of the classifier and parsing/hashing lines

=== before this patch series typedef struct s_xdfile { xrecord_t *recs;
size_t nrec; ptrdiff_t dstart, dend; bool *changed; size_t *reference_index;
size_t nreff; } xdfile_t;

typedef struct s_xdfenv { xdfile_t xdf1, xdf2; } xdfenv_t;

=== after this patch series typedef struct s_xdfile { xrecord_t *recs;
size_t nrec; bool *changed; size_t *reference_index; size_t nreff; }
xdfile_t;

typedef struct s_xdfenv { xdfile_t xdf1, xdf2; size_t delta_start,
delta_end; size_t mph_size; } xdfenv_t;

Ezekiel Newren (6):
  xdiff/xdl_cleanup_records: delete local recs pointer
  xdiff: use unambiguous types in xdl_bogo_sqrt()
  xdiff/xdl_cleanup_records: use unambiguous types
  xdiff/xdl_cleanup_records: make limits more clear
  xdiff/xdl_cleanup_records: make setting action easier to follow
  xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity

 xdiff/xdiffi.c   |  2 +-
 xdiff/xprepare.c | 84 ++++++++++++++++++++++++++++++++----------------
 xdiff/xutils.c   |  4 +--
 xdiff/xutils.h   |  2 +-
 4 files changed, 60 insertions(+), 32 deletions(-)


base-commit: ca1db8a0f7dc0dbea892e99f5b37c5fe5861be71
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2156%2Fezekielnewren%2Fxdiff-cleanup-3-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2156/ezekielnewren/xdiff-cleanup-3-v3
Pull-Request: https://github.com/git/git/pull/2156

Range-diff vs v2:

 1:  8f9165d477 = 1:  da32a9747c xdiff/xdl_cleanup_records: delete local recs pointer
 -:  ---------- > 2:  86b0ad100c xdiff: use unambiguous types in xdl_bogo_sqrt()
 5:  a52787f019 ! 3:  39a35365ae xdiff/xdl_cleanup_records: use unambiguous types
     @@ Commit message
          xdiff/xdl_cleanup_records: use unambiguous types
      
          Change the parameters of xdl_clean_mmatch() and the local variables
     -    i, nm in xdl_cleanup_records() to use unambiguous types. Best viewed
     -    with --color-words.
     +    i, nm, mlim in xdl_cleanup_records() to use unambiguous types. Best
     +    viewed with --color-words.
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
     @@ xdiff/xprepare.c: static bool xdl_clean_mmatch(uint8_t const *action, long i, lo
        * might be potentially discarded if they appear in a run of discardable.
        */
       static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
     --	long i, nm;
     --	size_t mlim1, mlim2;
     -+	ptrdiff_t i;
     -+	size_t nm, mlim1, mlim2;
     +-	long i, nm, mlim;
     ++	ptrdiff_t i, nm, mlim;
       	xdlclass_t *rcrec;
       	uint8_t *action1 = NULL, *action2 = NULL;
       	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
     -@@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
     - 	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
     - 		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
     - 		rcrec = cf->rcrecs[mph1];
     --		nm = rcrec ? rcrec->len2 : 0;
     -+		nm = rcrec ? (size_t)rcrec->len2 : 0;
     - 		if (nm == 0)
     - 			action1[i] = DISCARD;
     - 		else if (nm < mlim1)
     -@@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
     - 	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
     - 		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
     - 		rcrec = cf->rcrecs[mph2];
     --		nm = rcrec ? rcrec->len1 : 0;
     -+		nm = rcrec ? (size_t)rcrec->len1 : 0;
     - 		if (nm == 0)
     - 			action2[i] = DISCARD;
     - 		else if (nm < mlim2)
 2:  62adaa8e5a ! 4:  86dd98db9b xdiff/xdl_cleanup_records: make limits more clear
     @@ Commit message
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
       ## xdiff/xprepare.c ##
     -@@ xdiff/xprepare.c: static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
     +@@ xdiff/xprepare.c: static bool xdl_clean_mmatch(uint8_t const *action, ptrdiff_t i, ptrdiff_t s, pt
        * might be potentially discarded if they appear in a run of discardable.
        */
       static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
     --	long i, nm, mlim;
     -+	long i, nm;
     -+	size_t mlim1, mlim2;
     +-	ptrdiff_t i, nm, mlim;
     ++	ptrdiff_t i, nm, mlim1, mlim2;
       	xdlclass_t *rcrec;
       	uint8_t *action1 = NULL, *action2 = NULL;
       	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
     @@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *
       	/*
       	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
       	 */
     --	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
     +-	if ((mlim = (long)xdl_bogosqrt((uint64_t)xdf1->nrec)) > XDL_MAX_EQLIMIT)
      -		mlim = XDL_MAX_EQLIMIT;
       	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
       		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
     @@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *
      +		action1[i] = (nm == 0) ? DISCARD: nm >= mlim1 ? INVESTIGATE: KEEP;
       	}
       
     --	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
     +-	if ((mlim = (long)xdl_bogosqrt((uint64_t)xdf2->nrec)) > XDL_MAX_EQLIMIT)
      -		mlim = XDL_MAX_EQLIMIT;
       	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
       		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
 3:  8be7e4781a = 5:  ecc25be32f xdiff/xdl_cleanup_records: make setting action easier to follow
 4:  6abd052c34 = 6:  8f4def8814 xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 1/6] xdiff/xdl_cleanup_records: delete local recs pointer
  2026-03-27 19:23   ` [PATCH v3 0/6] " Ezekiel Newren via GitGitGadget
@ 2026-03-27 19:23     ` Ezekiel Newren via GitGitGadget
  2026-03-27 19:23     ` [PATCH v3 2/6] xdiff: use unambiguous types in xdl_bogo_sqrt() Ezekiel Newren via GitGitGadget
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-27 19:23 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Simplify the first 2 for loops by directly indexing the xdfile.recs.
recs is unused in the last 2 for loops, remove it. Best viewed with
--color-words.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index cd4fc405eb..d6e1901d2d 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -269,7 +269,6 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
 	long i, nm, mlim;
-	xrecord_t *recs;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
 	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
@@ -293,16 +292,18 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 */
 	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
-	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
+	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
+		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
+		rcrec = cf->rcrecs[mph1];
 		nm = rcrec ? rcrec->len2 : 0;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
 	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
-	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
+	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
+		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
+		rcrec = cf->rcrecs[mph2];
 		nm = rcrec ? rcrec->len1 : 0;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -312,8 +313,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 * false, or become true.
 	 */
 	xdf1->nreff = 0;
-	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
-	     i <= xdf1->dend; i++, recs++) {
+	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
 			xdf1->reference_index[xdf1->nreff++] = i;
@@ -324,8 +324,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	}
 
 	xdf2->nreff = 0;
-	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
-	     i <= xdf2->dend; i++, recs++) {
+	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
 			xdf2->reference_index[xdf2->nreff++] = i;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 2/6] xdiff: use unambiguous types in xdl_bogo_sqrt()
  2026-03-27 19:23   ` [PATCH v3 0/6] " Ezekiel Newren via GitGitGadget
  2026-03-27 19:23     ` [PATCH v3 1/6] xdiff/xdl_cleanup_records: delete local recs pointer Ezekiel Newren via GitGitGadget
@ 2026-03-27 19:23     ` Ezekiel Newren via GitGitGadget
  2026-03-27 19:23     ` [PATCH v3 3/6] xdiff/xdl_cleanup_records: use unambiguous types Ezekiel Newren via GitGitGadget
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-27 19:23 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

There is no real square root for a negative number and size_t may not
be large enough for certain applications, replace long with uint64_t.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   | 2 +-
 xdiff/xprepare.c | 4 ++--
 xdiff/xutils.c   | 4 ++--
 xdiff/xutils.h   | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 4376f943db..88708c12a3 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -348,7 +348,7 @@ int xdl_do_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 	kvdf += xe->xdf2.nreff + 1;
 	kvdb += xe->xdf2.nreff + 1;
 
-	xenv.mxcost = xdl_bogosqrt(ndiags);
+	xenv.mxcost = (long)xdl_bogosqrt((uint64_t)ndiags);
 	if (xenv.mxcost < XDL_MAX_COST_MIN)
 		xenv.mxcost = XDL_MAX_COST_MIN;
 	xenv.snake_cnt = XDL_SNAKE_CNT;
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index d6e1901d2d..48fb5ce6fe 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -290,7 +290,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	/*
 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
 	 */
-	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = (long)xdl_bogosqrt((uint64_t)xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
 		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
@@ -299,7 +299,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
-	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = (long)xdl_bogosqrt((uint64_t)xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
 		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 77ee1ad9c8..9a999acdc0 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -23,8 +23,8 @@
 #include "xinclude.h"
 
 
-long xdl_bogosqrt(long n) {
-	long i;
+uint64_t xdl_bogosqrt(uint64_t n) {
+	uint64_t i;
 
 	/*
 	 * Classical integer square root approximation using shifts.
diff --git a/xdiff/xutils.h b/xdiff/xutils.h
index 615b4a9d35..58f9d74cda 100644
--- a/xdiff/xutils.h
+++ b/xdiff/xutils.h
@@ -25,7 +25,7 @@
 
 
 
-long xdl_bogosqrt(long n);
+uint64_t xdl_bogosqrt(uint64_t n);
 int xdl_emit_diffrec(char const *rec, long size, char const *pre, long psize,
 		     xdemitcb_t *ecb);
 int xdl_cha_init(chastore_t *cha, long isize, long icount);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 3/6] xdiff/xdl_cleanup_records: use unambiguous types
  2026-03-27 19:23   ` [PATCH v3 0/6] " Ezekiel Newren via GitGitGadget
  2026-03-27 19:23     ` [PATCH v3 1/6] xdiff/xdl_cleanup_records: delete local recs pointer Ezekiel Newren via GitGitGadget
  2026-03-27 19:23     ` [PATCH v3 2/6] xdiff: use unambiguous types in xdl_bogo_sqrt() Ezekiel Newren via GitGitGadget
@ 2026-03-27 19:23     ` Ezekiel Newren via GitGitGadget
  2026-03-27 19:23     ` [PATCH v3 4/6] xdiff/xdl_cleanup_records: make limits more clear Ezekiel Newren via GitGitGadget
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-27 19:23 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Change the parameters of xdl_clean_mmatch() and the local variables
i, nm, mlim in xdl_cleanup_records() to use unambiguous types. Best
viewed with --color-words.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 48fb5ce6fe..386668a92d 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -197,8 +197,8 @@ void xdl_free_env(xdfenv_t *xe) {
 }
 
 
-static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
-	long r, rdis0, rpdis0, rdis1, rpdis1;
+static bool xdl_clean_mmatch(uint8_t const *action, ptrdiff_t i, ptrdiff_t s, ptrdiff_t e) {
+	ptrdiff_t r, rdis0, rpdis0, rdis1, rpdis1;
 
 	/*
 	 * Limits the window that is examined during the similar-lines
@@ -268,7 +268,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  * might be potentially discarded if they appear in a run of discardable.
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-	long i, nm, mlim;
+	ptrdiff_t i, nm, mlim;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
 	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 4/6] xdiff/xdl_cleanup_records: make limits more clear
  2026-03-27 19:23   ` [PATCH v3 0/6] " Ezekiel Newren via GitGitGadget
                       ` (2 preceding siblings ...)
  2026-03-27 19:23     ` [PATCH v3 3/6] xdiff/xdl_cleanup_records: use unambiguous types Ezekiel Newren via GitGitGadget
@ 2026-03-27 19:23     ` Ezekiel Newren via GitGitGadget
  2026-03-27 21:09       ` Junio C Hamano
  2026-03-27 19:23     ` [PATCH v3 5/6] xdiff/xdl_cleanup_records: make setting action easier to follow Ezekiel Newren via GitGitGadget
  2026-03-27 19:23     ` [PATCH v3 6/6] xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity Ezekiel Newren via GitGitGadget
  5 siblings, 1 reply; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-27 19:23 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Make the handling of per-file limits and the minimal-case clearer.
  * Use explicit per-file limit variables (mlim1, mlim2) and initialize
    them.
  * The additional condition `!need_min` is redudant now, remove it.
Best viewed with --color-words.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 386668a92d..2cf1f8d1a8 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -268,7 +268,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, ptrdiff_t i, ptrdiff_t s, pt
  * might be potentially discarded if they appear in a run of discardable.
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-	ptrdiff_t i, nm, mlim;
+	ptrdiff_t i, nm, mlim1, mlim2;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
 	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
@@ -287,25 +287,30 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 		goto cleanup;
 	}
 
+	if (need_min) {
+		/* i.e. infinity */
+		mlim1 = SIZE_MAX;
+		mlim2 = SIZE_MAX;
+	} else {
+		mlim1 = XDL_MIN(xdl_bogosqrt(xdf1->nrec), XDL_MAX_EQLIMIT);
+		mlim2 = XDL_MIN(xdl_bogosqrt(xdf2->nrec), XDL_MAX_EQLIMIT);
+	}
+
 	/*
 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
 	 */
-	if ((mlim = (long)xdl_bogosqrt((uint64_t)xdf1->nrec)) > XDL_MAX_EQLIMIT)
-		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
 		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
 		rcrec = cf->rcrecs[mph1];
 		nm = rcrec ? rcrec->len2 : 0;
-		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
+		action1[i] = (nm == 0) ? DISCARD: nm >= mlim1 ? INVESTIGATE: KEEP;
 	}
 
-	if ((mlim = (long)xdl_bogosqrt((uint64_t)xdf2->nrec)) > XDL_MAX_EQLIMIT)
-		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
 		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
 		rcrec = cf->rcrecs[mph2];
 		nm = rcrec ? rcrec->len1 : 0;
-		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
+		action2[i] = (nm == 0) ? DISCARD: nm >= mlim2 ? INVESTIGATE: KEEP;
 	}
 
 	/*
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 5/6] xdiff/xdl_cleanup_records: make setting action easier to follow
  2026-03-27 19:23   ` [PATCH v3 0/6] " Ezekiel Newren via GitGitGadget
                       ` (3 preceding siblings ...)
  2026-03-27 19:23     ` [PATCH v3 4/6] xdiff/xdl_cleanup_records: make limits more clear Ezekiel Newren via GitGitGadget
@ 2026-03-27 19:23     ` Ezekiel Newren via GitGitGadget
  2026-03-27 19:23     ` [PATCH v3 6/6] xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity Ezekiel Newren via GitGitGadget
  5 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-27 19:23 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Rewrite nested ternaries with a clear if/else ladder for
action1/action2 to improve readability while preserving
behavior.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 2cf1f8d1a8..3d5c61249f 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -303,14 +303,24 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
 		rcrec = cf->rcrecs[mph1];
 		nm = rcrec ? rcrec->len2 : 0;
-		action1[i] = (nm == 0) ? DISCARD: nm >= mlim1 ? INVESTIGATE: KEEP;
+		if (nm == 0)
+			action1[i] = DISCARD;
+		else if (nm < mlim1)
+			action1[i] = KEEP;
+		else /* nm >= mlim1 */
+			action1[i] = INVESTIGATE;
 	}
 
 	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
 		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
 		rcrec = cf->rcrecs[mph2];
 		nm = rcrec ? rcrec->len1 : 0;
-		action2[i] = (nm == 0) ? DISCARD: nm >= mlim2 ? INVESTIGATE: KEEP;
+		if (nm == 0)
+			action2[i] = DISCARD;
+		else if (nm < mlim2)
+			action2[i] = KEEP;
+		else /* nm >= mlim2 */
+			action2[i] = INVESTIGATE;
 	}
 
 	/*
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 6/6] xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity
  2026-03-27 19:23   ` [PATCH v3 0/6] " Ezekiel Newren via GitGitGadget
                       ` (4 preceding siblings ...)
  2026-03-27 19:23     ` [PATCH v3 5/6] xdiff/xdl_cleanup_records: make setting action easier to follow Ezekiel Newren via GitGitGadget
@ 2026-03-27 19:23     ` Ezekiel Newren via GitGitGadget
  5 siblings, 0 replies; 78+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2026-03-27 19:23 UTC (permalink / raw)
  To: git
  Cc: Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Make it clear that INVESTIGATE is turned into KEEP or DISCARD based on
the result of xdl_clean_mmatch() which reduces actionX[i] into a
boolean value.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 3d5c61249f..195148442b 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -329,24 +329,38 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 */
 	xdf1->nreff = 0;
 	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
-		if (action1[i] == KEEP ||
-		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
+		if (action1[i] == INVESTIGATE) {
+			if (!xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))
+				action1[i] = KEEP;
+			else
+				action1[i] = DISCARD;
+		}
+
+		if (action1[i] == KEEP) {
 			xdf1->reference_index[xdf1->nreff++] = i;
-			/* changed[i] remains false, i.e. keep */
-		} else
+			/* changed[i] remains false */
+		} else if (action1[i] == DISCARD)
 			xdf1->changed[i] = true;
-			/* i.e. discard */
+		else
+			BUG("Illegal state for action1[i]");
 	}
 
 	xdf2->nreff = 0;
 	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
-		if (action2[i] == KEEP ||
-		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
+		if (action2[i] == INVESTIGATE) {
+			if (!xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))
+				action2[i] = KEEP;
+			else
+				action2[i] = DISCARD;
+		}
+
+		if (action2[i] == KEEP) {
 			xdf2->reference_index[xdf2->nreff++] = i;
-			/* changed[i] remains false, i.e. keep */
-		} else
+			/* changed[i] remains false */
+		} else if (action2[i] == DISCARD)
 			xdf2->changed[i] = true;
-			/* i.e. discard */
+		else
+			BUG("Illegal state for action2[i]");
 	}
 
 cleanup:
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/6] xdiff/xdl_cleanup_records: make limits more clear
  2026-03-27 19:23     ` [PATCH v3 4/6] xdiff/xdl_cleanup_records: make limits more clear Ezekiel Newren via GitGitGadget
@ 2026-03-27 21:09       ` Junio C Hamano
  2026-03-27 23:01         ` Junio C Hamano
  0 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2026-03-27 21:09 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Ezekiel Newren <ezekielnewren@gmail.com>
>
> Make the handling of per-file limits and the minimal-case clearer.
>   * Use explicit per-file limit variables (mlim1, mlim2) and initialize
>     them.
>   * The additional condition `!need_min` is redudant now, remove it.
> Best viewed with --color-words.
>
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>  xdiff/xprepare.c | 19 ++++++++++++-------
>  1 file changed, 12 insertions(+), 7 deletions(-)

t4071 and t8015 do not like this step, even though they are happy
with 1-3/6 applied.


> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index 386668a92d..2cf1f8d1a8 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -268,7 +268,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, ptrdiff_t i, ptrdiff_t s, pt
>   * might be potentially discarded if they appear in a run of discardable.
>   */
>  static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
> -	ptrdiff_t i, nm, mlim;
> +	ptrdiff_t i, nm, mlim1, mlim2;
>  	xdlclass_t *rcrec;
>  	uint8_t *action1 = NULL, *action2 = NULL;
>  	bool need_min = !!(cf->flags & XDF_NEED_MINIMAL);
> @@ -287,25 +287,30 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>  		goto cleanup;
>  	}
>  
> +	if (need_min) {
> +		/* i.e. infinity */
> +		mlim1 = SIZE_MAX;
> +		mlim2 = SIZE_MAX;
> +	} else {
> +		mlim1 = XDL_MIN(xdl_bogosqrt(xdf1->nrec), XDL_MAX_EQLIMIT);
> +		mlim2 = XDL_MIN(xdl_bogosqrt(xdf2->nrec), XDL_MAX_EQLIMIT);
> +	}
> +
>  	/*
>  	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
>  	 */
> -	if ((mlim = (long)xdl_bogosqrt((uint64_t)xdf1->nrec)) > XDL_MAX_EQLIMIT)
> -		mlim = XDL_MAX_EQLIMIT;
>  	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
>  		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
>  		rcrec = cf->rcrecs[mph1];
>  		nm = rcrec ? rcrec->len2 : 0;
> -		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> +		action1[i] = (nm == 0) ? DISCARD: nm >= mlim1 ? INVESTIGATE: KEEP;
>  	}
>  
> -	if ((mlim = (long)xdl_bogosqrt((uint64_t)xdf2->nrec)) > XDL_MAX_EQLIMIT)
> -		mlim = XDL_MAX_EQLIMIT;
>  	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
>  		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
>  		rcrec = cf->rcrecs[mph2];
>  		nm = rcrec ? rcrec->len1 : 0;
> -		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
> +		action2[i] = (nm == 0) ? DISCARD: nm >= mlim2 ? INVESTIGATE: KEEP;
>  	}
>  
>  	/*

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/6] xdiff/xdl_cleanup_records: make limits more clear
  2026-03-27 21:09       ` Junio C Hamano
@ 2026-03-27 23:01         ` Junio C Hamano
  0 siblings, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2026-03-27 23:01 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Yee Cheng Chin, Phillip Wood, René Scharfe, Jeff King,
	D. Ben Knoble, Ezekiel Newren

Junio C Hamano <gitster@pobox.com> writes:

> "Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>
>> Make the handling of per-file limits and the minimal-case clearer.
>>   * Use explicit per-file limit variables (mlim1, mlim2) and initialize
>>     them.
>>   * The additional condition `!need_min` is redudant now, remove it.
>> Best viewed with --color-words.
>>
>> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
>> ---
>>  xdiff/xprepare.c | 19 ++++++++++++-------
>>  1 file changed, 12 insertions(+), 7 deletions(-)
>
> t4071 and t8015 do not like this step, even though they are happy
> with 1-3/6 applied.
>
>
>> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
>> index 386668a92d..2cf1f8d1a8 100644
>> --- a/xdiff/xprepare.c
>> +++ b/xdiff/xprepare.c
>> @@ -268,7 +268,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, ptrdiff_t i, ptrdiff_t s, pt
>>   * might be potentially discarded if they appear in a run of discardable.
>>   */
>>  static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
>> -	ptrdiff_t i, nm, mlim;
>> +	ptrdiff_t i, nm, mlim1, mlim2;

Ah, the problem may manifest itself in this step in the series, but
the root cause might be before this step.  ptrdiff_t is signed and
that is the type used for mlim/mlim1/mlim2 here, and before this
series these counters count in "long" that is signed.

>> +	if (need_min) {
>> +		/* i.e. infinity */
>> +		mlim1 = SIZE_MAX;
>> +		mlim2 = SIZE_MAX;

But SIZE_MAX is the maximum that a size_t (unsigned) can take.  No
wonder assigning it to ptrdiff_t and assuming that any other
sensible ptrdiff_t value can ever reach it.  Instead, this
essentially assigns -1 to mlim1 and mlim2 when need_min is true.

>> +	} else {
>> +		mlim1 = XDL_MIN(xdl_bogosqrt(xdf1->nrec), XDL_MAX_EQLIMIT);
>> +		mlim2 = XDL_MIN(xdl_bogosqrt(xdf2->nrec), XDL_MAX_EQLIMIT);

This side I do not think has much to do with the breakage, but the
way XDL_MIN() is implemented, it must be noted that xdl_bogosqrt()
is called twice on the same value with this rewrite ...

>> +	}
>> +
>>  	/*
>>  	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
>>  	 */
>> -	if ((mlim = (long)xdl_bogosqrt((uint64_t)xdf1->nrec)) > XDL_MAX_EQLIMIT)
>> -		mlim = XDL_MAX_EQLIMIT;

... as opposed to computing the value only once, in the original.

>>  	for (i = xdf1->dstart; i <= xdf1->dend; i++) {
>>  		size_t mph1 = xdf1->recs[i].minimal_perfect_hash;
>>  		rcrec = cf->rcrecs[mph1];
>>  		nm = rcrec ? rcrec->len2 : 0;
>> -		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;

So the original said, "if nm is not zero and need_min is true, do
not bother comparing nm with anything, and always use KEEP.  If
need_min is false, we use INVESTIGAGE only when nm is large enough,
otherwise KEEP.

>> +		action1[i] = (nm == 0) ? DISCARD: nm >= mlim1 ? INVESTIGATE: KEEP;

Updated code, when nm is not zero, does something different.  if
need_min is true, mlim1 is set to -1 and presumably nm is a count or
length that is bounded on its lower end with 0, so it is larger than
mlim1 (== -1), and we always take INVESTIGATE and never kEEP.

So the rewritten code is broken when need_min is true?

I suspect the remainder of the patch is broken exactly the same way,
so the remedy would be similar?

>>  	}
>>  
>> -	if ((mlim = (long)xdl_bogosqrt((uint64_t)xdf2->nrec)) > XDL_MAX_EQLIMIT)
>> -		mlim = XDL_MAX_EQLIMIT;
>>  	for (i = xdf2->dstart; i <= xdf2->dend; i++) {
>>  		size_t mph2 = xdf2->recs[i].minimal_perfect_hash;
>>  		rcrec = cf->rcrecs[mph2];
>>  		nm = rcrec ? rcrec->len1 : 0;
>> -		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
>> +		action2[i] = (nm == 0) ? DISCARD: nm >= mlim2 ? INVESTIGATE: KEEP;
>>  	}
>>  
>>  	/*

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2026-03-27 23:01 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-02 18:52 [PATCH 00/10] Xdiff cleanup part 3 Ezekiel Newren via GitGitGadget
2026-01-02 18:52 ` [PATCH 01/10] ivec: introduce the C side of ivec Ezekiel Newren via GitGitGadget
2026-01-04  5:32   ` Junio C Hamano
2026-01-17 16:06     ` Ezekiel Newren
2026-01-08 14:34   ` Phillip Wood
2026-01-15 15:55     ` Ezekiel Newren
2026-01-16 10:39       ` Phillip Wood
2026-01-16 20:19         ` René Scharfe
2026-01-17 13:55           ` Phillip Wood
2026-01-17 16:04             ` Ezekiel Newren
2026-01-18 14:58               ` René Scharfe
2026-01-17 16:14         ` Ezekiel Newren
2026-01-17 16:16           ` Ezekiel Newren
2026-01-17 17:40           ` Phillip Wood
2026-01-19  5:59             ` Jeff King
2026-01-19 20:21               ` Ezekiel Newren
2026-01-19 20:40                 ` Jeff King
2026-01-20  2:36                   ` D. Ben Knoble
2026-01-21 21:00                   ` Ezekiel Newren
2026-01-21 21:20                     ` Jeff King
2026-01-21 21:31                       ` Junio C Hamano
2026-01-21 21:45                         ` Ezekiel Newren
2026-01-20 13:46               ` Phillip Wood
2026-01-20 14:06       ` Phillip Wood
2026-01-21 21:39         ` Ezekiel Newren
2026-01-28 11:15           ` Phillip Wood
2026-01-16 20:19   ` René Scharfe
2026-01-17 15:58     ` Ezekiel Newren
2026-01-18 14:55       ` René Scharfe
2026-01-02 18:52 ` [PATCH 02/10] xdiff: make classic diff explicit by creating xdl_do_classic_diff() Ezekiel Newren via GitGitGadget
2026-01-20 15:01   ` Phillip Wood
2026-01-21 21:05     ` Ezekiel Newren
2026-01-02 18:52 ` [PATCH 03/10] xdiff: don't waste time guessing the number of lines Ezekiel Newren via GitGitGadget
2026-01-20 15:02   ` Phillip Wood
2026-01-21 21:12     ` Ezekiel Newren
2026-01-22 10:16       ` Phillip Wood
2026-01-02 18:52 ` [PATCH 04/10] xdiff: let patience and histogram benefit from xdl_trim_ends() Ezekiel Newren via GitGitGadget
2026-01-20 15:02   ` Phillip Wood
2026-01-21 14:49     ` Phillip Wood
2026-01-02 18:52 ` [PATCH 05/10] xdiff: use xdfenv_t in xdl_trim_ends() and xdl_cleanup_records() Ezekiel Newren via GitGitGadget
2026-01-20 16:32   ` Phillip Wood
2026-01-02 18:52 ` [PATCH 06/10] xdiff: cleanup xdl_trim_ends() Ezekiel Newren via GitGitGadget
2026-01-20 16:32   ` Phillip Wood
2026-01-02 18:52 ` [PATCH 07/10] xdiff: replace xdfile_t.dstart with xdfenv_t.delta_start Ezekiel Newren via GitGitGadget
2026-01-20 16:32   ` Phillip Wood
2026-01-28 10:51     ` Phillip Wood
2026-01-02 18:52 ` [PATCH 08/10] xdiff: replace xdfile_t.dend with xdfenv_t.delta_end Ezekiel Newren via GitGitGadget
2026-01-02 18:52 ` [PATCH 09/10] xdiff: remove dependence on xdlclassifier from xdl_cleanup_records() Ezekiel Newren via GitGitGadget
2026-01-16 20:19   ` René Scharfe
2026-01-17 16:34     ` Ezekiel Newren
2026-01-18 18:23       ` René Scharfe
2026-01-21 15:01   ` Phillip Wood
2026-01-02 18:52 ` [PATCH 10/10] xdiff: move xdl_cleanup_records() from xprepare.c to xdiffi.c Ezekiel Newren via GitGitGadget
2026-01-21 15:01   ` Phillip Wood
2026-01-28 10:56     ` Phillip Wood
2026-01-04  2:44 ` [PATCH 00/10] Xdiff cleanup part 3 Junio C Hamano
2026-01-04  6:01 ` Yee Cheng Chin
2026-01-28 14:40 ` Phillip Wood
2026-03-06 23:03 ` Junio C Hamano
2026-03-09 19:06   ` Ezekiel Newren
2026-03-09 23:31     ` Junio C Hamano
2026-03-25 21:11 ` [PATCH v2 0/5] " Ezekiel Newren via GitGitGadget
2026-03-25 21:11   ` [PATCH v2 1/5] xdiff/xdl_cleanup_records: delete local recs pointer Ezekiel Newren via GitGitGadget
2026-03-25 21:11   ` [PATCH v2 2/5] xdiff/xdl_cleanup_records: make limits more clear Ezekiel Newren via GitGitGadget
2026-03-25 21:11   ` [PATCH v2 3/5] xdiff/xdl_cleanup_records: make setting action easier to follow Ezekiel Newren via GitGitGadget
2026-03-25 21:11   ` [PATCH v2 4/5] xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity Ezekiel Newren via GitGitGadget
2026-03-25 21:11   ` [PATCH v2 5/5] xdiff/xdl_cleanup_records: use unambiguous types Ezekiel Newren via GitGitGadget
2026-03-25 21:58     ` Junio C Hamano
2026-03-26  6:26   ` [PATCH v2 0/5] Xdiff cleanup part 3 SZEDER Gábor
2026-03-27 19:23   ` [PATCH v3 0/6] " Ezekiel Newren via GitGitGadget
2026-03-27 19:23     ` [PATCH v3 1/6] xdiff/xdl_cleanup_records: delete local recs pointer Ezekiel Newren via GitGitGadget
2026-03-27 19:23     ` [PATCH v3 2/6] xdiff: use unambiguous types in xdl_bogo_sqrt() Ezekiel Newren via GitGitGadget
2026-03-27 19:23     ` [PATCH v3 3/6] xdiff/xdl_cleanup_records: use unambiguous types Ezekiel Newren via GitGitGadget
2026-03-27 19:23     ` [PATCH v3 4/6] xdiff/xdl_cleanup_records: make limits more clear Ezekiel Newren via GitGitGadget
2026-03-27 21:09       ` Junio C Hamano
2026-03-27 23:01         ` Junio C Hamano
2026-03-27 19:23     ` [PATCH v3 5/6] xdiff/xdl_cleanup_records: make setting action easier to follow Ezekiel Newren via GitGitGadget
2026-03-27 19:23     ` [PATCH v3 6/6] xdiff/xdl_cleanup_records: simplify INVESTIGATE handling for clarity Ezekiel Newren via GitGitGadget

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox