[PATCH 0/9] Xdiff cleanup part2

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] Xdiff cleanup part2
@ 2025-10-15 21:18 Ezekiel Newren via GitGitGadget
  2025-10-15 21:18 ` [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t Ezekiel Newren via GitGitGadget
                   ` (11 more replies)
  0 siblings, 12 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren

Maintainer note: This patch series builds on top of en/xdiff-cleanup and
am/xdiff-hash-tweak (both of which are now in master).

The primary goal of this patch series is to convert every field's type in
xrecord_t and xdfile_t to be unambiguous, in preparation to make it more
Rust FFI friendly. Additionally the ha field in xrecord_t is split into
line_hash and minimal_perfect hash.

The order of some of the fields has changed as called out by the commit
messages.

Before:

typedef struct s_xrecord {
	char const *ptr;
	long size;
	unsigned long ha;
} xrecord_t;

typedef struct s_xdfile {
	xrecord_t *recs;
	long nrec;
	long dstart, dend;
	bool *changed;
	long *rindex;
	long nreff;
} xdfile_t;


After part 2

typedef struct s_xrecord {
	uint8_t const *ptr;
	size_t size;
	uint64_t line_hash;
	size_t minimal_perfect_hash;
} xrecord_t;

typedef struct s_xdfile {
	xrecord_t *recs;
	size_t nrec;
	bool *changed;
	size_t *reference_index;
	size_t nreff;
	ssize_t dstart, dend;
} xdfile_t;


Ezekiel Newren (9):
  xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  xdiff: make xrecord_t.ptr a uint8_t instead of char
  xdiff: use size_t for xrecord_t.size
  xdiff: use unambiguous types in xdl_hash_record()
  xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  xdiff: make xdfile_t.nrec a size_t instead of long
  xdiff: make xdfile_t.nreff a size_t instead of long
  xdiff: change rindex from long to size_t in xdfile_t
  xdiff: rename rindex -> reference_index

 xdiff-interface.c  |  2 +-
 xdiff/xdiffi.c     | 29 +++++++++++------------
 xdiff/xemit.c      | 28 +++++++++++-----------
 xdiff/xhistogram.c |  4 ++--
 xdiff/xmerge.c     | 30 ++++++++++++------------
 xdiff/xpatience.c  | 14 +++++------
 xdiff/xprepare.c   | 58 +++++++++++++++++++++++-----------------------
 xdiff/xtypes.h     | 15 ++++++------
 xdiff/xutils.c     | 32 ++++++++++++-------------
 xdiff/xutils.h     |  6 ++---
 10 files changed, 109 insertions(+), 109 deletions(-)


base-commit: 143f58ef7535f8f8a80d810768a18bdf3807de26
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2070%2Fezekielnewren%2Fxdiff_cleanup_part2-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2070/ezekielnewren/xdiff_cleanup_part2-v1
Pull-Request: https://github.com/git/git/pull/2070
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
@ 2025-10-15 21:18 ` Ezekiel Newren via GitGitGadget
  2025-10-21 11:32   ` Phillip Wood
  2025-10-15 21:18 ` [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

ssize_t is appropriate for dstart and dend because they both describe
positive or negative offsets relative to a pointer.

A future patch will move these fields to a different struct. Moving
them to the end of xdfile_t now, means the field order of xdfile_t will
be disturbed less.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xtypes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index f145abba3e..3514bb1684 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -47,10 +47,10 @@ typedef struct s_xrecord {
 typedef struct s_xdfile {
 	xrecord_t *recs;
 	long nrec;
-	long dstart, dend;
 	bool *changed;
 	long *rindex;
 	long nreff;
+	ssize_t dstart, dend;
 } xdfile_t;
 
 typedef struct s_xdfenv {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  2025-10-15 21:18 ` [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t Ezekiel Newren via GitGitGadget
@ 2025-10-21 11:32   ` Phillip Wood
  2025-10-21 17:18     ` Junio C Hamano
  0 siblings, 1 reply; 118+ messages in thread
From: Phillip Wood @ 2025-10-21 11:32 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren

On 15/10/2025 22:18, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> ssize_t is appropriate for dstart and dend because they both describe
> positive or negative offsets relative to a pointer.

Isn't ptrdiff_t the appropriate type for an offset to a pointer? ssize_t 
is not guaranteed to be the same width as size_t (this has caused 
problems in the past[1]) and is only defined by POSIX, not the C standard.

Thanks

Phillip

[1] https://lore.kernel.org/git/loom.20150207T174514-727@post.gmane.org/

> A future patch will move these fields to a different struct. Moving
> them to the end of xdfile_t now, means the field order of xdfile_t will
> be disturbed less.
> 
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>   xdiff/xtypes.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index f145abba3e..3514bb1684 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -47,10 +47,10 @@ typedef struct s_xrecord {
>   typedef struct s_xdfile {
>   	xrecord_t *recs;
>   	long nrec;
> -	long dstart, dend;
>   	bool *changed;
>   	long *rindex;
>   	long nreff;
> +	ssize_t dstart, dend;
>   } xdfile_t;
>   
>   typedef struct s_xdfenv {


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  2025-10-21 11:32   ` Phillip Wood
@ 2025-10-21 17:18     ` Junio C Hamano
  2025-10-22 21:07       ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Junio C Hamano @ 2025-10-21 17:18 UTC (permalink / raw)
  To: Phillip Wood; +Cc: Ezekiel Newren via GitGitGadget, git, Ezekiel Newren

Phillip Wood <phillip.wood123@gmail.com> writes:

> On 15/10/2025 22:18, Ezekiel Newren via GitGitGadget wrote:
>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>> 
>> ssize_t is appropriate for dstart and dend because they both describe
>> positive or negative offsets relative to a pointer.
>
> Isn't ptrdiff_t the appropriate type for an offset to a pointer? ssize_t 
> is not guaranteed to be the same width as size_t (this has caused 
> problems in the past[1]) and is only defined by POSIX, not the C standard.
>
> Thanks
>
> Phillip
>
> [1] https://lore.kernel.org/git/loom.20150207T174514-727@post.gmane.org/

Thanks for bringing up a very good point.

We often consider that a function that yields what we would normally
put in a size_t variable, when we _know_ that the return value would
not be so big to exceed half the range of size_t, can instead return
ssize_t and use the negative half of the range to signal error
conditions, but as the cited incident shows that it is an easy
mistake to make.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  2025-10-21 17:18     ` Junio C Hamano
@ 2025-10-22 21:07       ` Ezekiel Newren
  2025-10-22 21:38         ` Junio C Hamano
  0 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren @ 2025-10-22 21:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Phillip Wood, Ezekiel Newren via GitGitGadget, git

On Tue, Oct 21, 2025 at 11:18 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Phillip Wood <phillip.wood123@gmail.com> writes:
>
> > On 15/10/2025 22:18, Ezekiel Newren via GitGitGadget wrote:
> >> From: Ezekiel Newren <ezekielnewren@gmail.com>
> >>
> >> ssize_t is appropriate for dstart and dend because they both describe
> >> positive or negative offsets relative to a pointer.
> >
> > Isn't ptrdiff_t the appropriate type for an offset to a pointer? ssize_t
> > is not guaranteed to be the same width as size_t (this has caused
> > problems in the past[1]) and is only defined by POSIX, not the C standard.
> >
> > Thanks
> >
> > Phillip
> >
> > [1] https://lore.kernel.org/git/loom.20150207T174514-727@post.gmane.org/
>
> Thanks for bringing up a very good point.
>
> We often consider that a function that yields what we would normally
> put in a size_t variable, when we _know_ that the return value would
> not be so big to exceed half the range of size_t, can instead return
> ssize_t and use the negative half of the range to signal error
> conditions, but as the cited incident shows that it is an easy
> mistake to make.

In my compat/rust_types.h file (which was dropped) I defined isize
using ptrdiff_t rather than ssize_t. Maybe that file should be revived
so that we don't have confusion in code reviews when structs are being
expressly converted for the purpose of Rust FFI? I'd really like to
bring that file back so that everyone has a clear reference for how C
types map to Rust, but no one seemed to like it except me. Maybe it
should be an adoc file rather than a header?

[1] compat/rust_types.h
https://lore.kernel.org/git/2a7d5b05c18d4a96f1905b7043d47c62d367cd2a.1757274320.git.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  2025-10-22 21:07       ` Ezekiel Newren
@ 2025-10-22 21:38         ` Junio C Hamano
  2025-10-22 21:51           ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Junio C Hamano @ 2025-10-22 21:38 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Phillip Wood, Ezekiel Newren via GitGitGadget, git

Ezekiel Newren <ezekielnewren@gmail.com> writes:

> In my compat/rust_types.h file (which was dropped) I defined isize
> using ptrdiff_t rather than ssize_t. Maybe that file should be revived
> so that we don't have confusion in code reviews when structs are being
> expressly converted for the purpose of Rust FFI? I'd really like to
> bring that file back so that everyone has a clear reference for how C
> types map to Rust, but no one seemed to like it except me. Maybe it
> should be an adoc file rather than a header?

I may be mistaken, but I thought that the latest agreement was to
use conceptually the "same" type in each language, have each
language call that type in its native way, and if needed convert at
the FFI boundary.  So if we agree to use, for example, 64-bit signed
integer type for counting things plus returning error conditions via
negative values, maybe C-side can agree to use i64 for it, without
having to worry about how that thing is called in Rust side.

I am not sure in what way <compat/rust_types.h> should be used, and
perhaps a documentation file may be sufficient as you suggest, but
in any case, I agree that it should be made clear to everybody what
C-types are to be mapped to what Rust types and vice versa, and if
some C-types have no corresponding Rust type in that mapping, or if
some Rust types have no corresponding C-type, that type needs to be
converted before they reach the FFI boundary.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  2025-10-22 21:38         ` Junio C Hamano
@ 2025-10-22 21:51           ` Ezekiel Newren
  0 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-10-22 21:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Phillip Wood, Ezekiel Newren via GitGitGadget, git

On Wed, Oct 22, 2025 at 3:38 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Ezekiel Newren <ezekielnewren@gmail.com> writes:
>
> > In my compat/rust_types.h file (which was dropped) I defined isize
> > using ptrdiff_t rather than ssize_t. Maybe that file should be revived
> > so that we don't have confusion in code reviews when structs are being
> > expressly converted for the purpose of Rust FFI? I'd really like to
> > bring that file back so that everyone has a clear reference for how C
> > types map to Rust, but no one seemed to like it except me. Maybe it
> > should be an adoc file rather than a header?
>
> I may be mistaken, but I thought that the latest agreement was to
> use conceptually the "same" type in each language, have each
> language call that type in its native way, and if needed convert at
> the FFI boundary.  So if we agree to use, for example, 64-bit signed
> integer type for counting things plus returning error conditions via
> negative values, maybe C-side can agree to use i64 for it, without
> having to worry about how that thing is called in Rust side.

Your understanding is correct. Would
Documentation/unambiguous_types.adoc be an appropriate place for this
documentation?

> I am not sure in what way <compat/rust_types.h> should be used, and
> perhaps a documentation file may be sufficient as you suggest, but
> in any case, I agree that it should be made clear to everybody what
> C-types are to be mapped to what Rust types and vice versa, and if
> some C-types have no corresponding Rust type in that mapping, or if
> some Rust types have no corresponding C-type, that type needs to be
> converted before they reach the FFI boundary.

Alright. I guess I'll drop the idea of compat/rust_types.h permanently.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
  2025-10-15 21:18 ` [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t Ezekiel Newren via GitGitGadget
@ 2025-10-15 21:18 ` Ezekiel Newren via GitGitGadget
  2025-10-16 21:51   ` Kristoffer Haugsbakk
                     ` (2 more replies)
  2025-10-15 21:18 ` [PATCH 3/9] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
                   ` (9 subsequent siblings)
  11 siblings, 3 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also
referring to bytes in memory, rather than unicode code points, use
uint8_t instead of char.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c    |  8 ++++----
 xdiff/xemit.c     |  6 +++---
 xdiff/xmerge.c    | 14 +++++++-------
 xdiff/xpatience.c |  2 +-
 xdiff/xprepare.c  |  8 ++++----
 xdiff/xtypes.h    |  2 +-
 xdiff/xutils.c    |  4 ++--
 7 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 6f3998ee54..411a8aa69f 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -407,7 +407,7 @@ static int get_indent(xrecord_t *rec)
 	int ret = 0;
 
 	for (i = 0; i < rec->size; i++) {
-		char c = rec->ptr[i];
+		uint8_t c = rec->ptr[i];
 
 		if (!XDL_ISSPACE(c))
 			return ret;
@@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
 
 		rec = &xe->xdf1.recs[xch->i1];
 		for (i = 0; i < xch->chg1 && ignore; i++)
-			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
 
 		rec = &xe->xdf2.recs[xch->i2];
 		for (i = 0; i < xch->chg2 && ignore; i++)
-			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
 
 		xch->ignore = ignore;
 	}
@@ -1008,7 +1008,7 @@ static int record_matches_regex(xrecord_t *rec, xpparam_t const *xpp) {
 	size_t i;
 
 	for (i = 0; i < xpp->ignore_regex_nr; i++)
-		if (!regexec_buf(xpp->ignore_regex[i], rec->ptr, rec->size, 1,
+		if (!regexec_buf(xpp->ignore_regex[i], (const char *)rec->ptr, rec->size, 1,
 				 &regmatch, 0))
 			return 1;
 
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index b2f1f30cd3..ead930088a 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
 {
 	xrecord_t *rec = &xdf->recs[ri];
 
-	if (xdl_emit_diffrec(rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
+	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
 		return -1;
 
 	return 0;
@@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
 	xrecord_t *rec = &xdf->recs[ri];
 
 	if (!xecfg->find_func)
-		return def_ff(rec->ptr, rec->size, buf, sz);
-	return xecfg->find_func(rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
+		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
+	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index fd600cbb5d..75cb3e76a2 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
 	xrecord_t *rec2 = xe2->xdf2.recs + i2;
 
 	for (i = 0; i < line_count; i++) {
-		int result = xdl_recmatch(rec1[i].ptr, rec1[i].size,
-			rec2[i].ptr, rec2[i].size, flags);
+		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
+			(const char *)rec2[i].ptr, rec2[i].size, flags);
 		if (!result)
 			return -1;
 	}
@@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 
 static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
 {
-	return xdl_recmatch(rec1->ptr, rec1->size,
-			    rec2->ptr, rec2->size, flags);
+	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
+			    (const char *)rec2->ptr, rec2->size, flags);
 }
 
 /*
@@ -382,10 +382,10 @@ static int xdl_refine_conflicts(xdfenv_t *xe1, xdfenv_t *xe2, xdmerge_t *m,
 		 * we have a very simple mmfile structure.
 		 */
 		t1.ptr = (char *)xe1->xdf2.recs[m->i1].ptr;
-		t1.size = xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
+		t1.size = (char *)xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
 			+ xe1->xdf2.recs[m->i1 + m->chg1 - 1].size - t1.ptr;
 		t2.ptr = (char *)xe2->xdf2.recs[m->i2].ptr;
-		t2.size = xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
+		t2.size = (char *)xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
 			+ xe2->xdf2.recs[m->i2 + m->chg2 - 1].size - t2.ptr;
 		if (xdl_do_diff(&t1, &t2, xpp, &xe) < 0)
 			return -1;
@@ -440,7 +440,7 @@ static int line_contains_alnum(const char *ptr, long size)
 static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
 {
 	for (; chg; chg--, i++)
-		if (line_contains_alnum(xe->xdf2.recs[i].ptr,
+		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
 				xe->xdf2.recs[i].size))
 			return 1;
 	return 0;
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index 669b653580..bb61354f22 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -121,7 +121,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 		return;
 	map->entries[index].line1 = line;
 	map->entries[index].hash = record->ha;
-	map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1].ptr);
+	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
 	if (!map->first)
 		map->first = map->entries + index;
 	if (map->last) {
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 192334f1b7..4cb18b2b88 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
 		if (rcrec->rec.ha == rec->ha &&
-				xdl_recmatch(rcrec->rec.ptr, rcrec->rec.size,
-					rec->ptr, rec->size, cf->flags))
+				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
+					(const char *)rec->ptr, rec->size, cf->flags))
 			break;
 
 	if (!rcrec) {
@@ -156,8 +156,8 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
-			crec->ptr = prev;
-			crec->size = (long) (cur - prev);
+			crec->ptr = (uint8_t const *)prev;
+			crec->size =(long) ( cur - prev);
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 3514bb1684..57983627f5 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -39,7 +39,7 @@ typedef struct s_chastore {
 } chastore_t;
 
 typedef struct s_xrecord {
-	char const *ptr;
+	uint8_t const *ptr;
 	long size;
 	unsigned long ha;
 } xrecord_t;
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 447e66c719..7be063bfb6 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -465,10 +465,10 @@ int xdl_fall_back_diff(xdfenv_t *diff_env, xpparam_t const *xpp,
 	xdfenv_t env;
 
 	subfile1.ptr = (char *)diff_env->xdf1.recs[line1 - 1].ptr;
-	subfile1.size = diff_env->xdf1.recs[line1 + count1 - 2].ptr +
+	subfile1.size = (char *)diff_env->xdf1.recs[line1 + count1 - 2].ptr +
 		diff_env->xdf1.recs[line1 + count1 - 2].size - subfile1.ptr;
 	subfile2.ptr = (char *)diff_env->xdf2.recs[line2 - 1].ptr;
-	subfile2.size = diff_env->xdf2.recs[line2 + count2 - 2].ptr +
+	subfile2.size = (char *)diff_env->xdf2.recs[line2 + count2 - 2].ptr +
 		diff_env->xdf2.recs[line2 + count2 - 2].size - subfile2.ptr;
 	if (xdl_do_diff(&subfile1, &subfile2, xpp, &env) < 0)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-15 21:18 ` [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
@ 2025-10-16 21:51   ` Kristoffer Haugsbakk
  2025-10-21  8:33   ` Patrick Steinhardt
  2025-10-21 13:13   ` Phillip Wood
  2 siblings, 0 replies; 118+ messages in thread
From: Kristoffer Haugsbakk @ 2025-10-16 21:51 UTC (permalink / raw)
  To: Josh Soref, git; +Cc: Ezekiel Newren

On Wed, Oct 15, 2025, at 23:18, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
>
> Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also
> referring to bytes in memory, rather than unicode code points, use

s/unicode/Unicode/

> uint8_t instead of char.
>
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>[snip]

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-15 21:18 ` [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
  2025-10-16 21:51   ` Kristoffer Haugsbakk
@ 2025-10-21  8:33   ` Patrick Steinhardt
  2025-10-22 21:12     ` Ezekiel Newren
  2025-10-21 13:13   ` Phillip Wood
  2 siblings, 1 reply; 118+ messages in thread
From: Patrick Steinhardt @ 2025-10-21  8:33 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget; +Cc: git, Ezekiel Newren

On Wed, Oct 15, 2025 at 09:18:14PM +0000, Ezekiel Newren via GitGitGadget wrote:
> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index 6f3998ee54..411a8aa69f 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> @@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
>  
>  		rec = &xe->xdf1.recs[xch->i1];
>  		for (i = 0; i < xch->chg1 && ignore; i++)
> -			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
> +			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
>  
>  		rec = &xe->xdf2.recs[xch->i2];
>  		for (i = 0; i < xch->chg2 && ignore; i++)
> -			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
> +			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
>  
>  		xch->ignore = ignore;
>  	}

Okay. Seemingly, we convert the structure itself, but we don't convert
any of the functions to accept an `uint8_t`. I guess you drew the line
here so that we don't have to also touch up dozens of function
signatures?

And how did you end up verifying that you added all casts? Does the
compiler flag those as warnings?

In any case, it might be nice to explain both of these details in the
commit message.

Patrick

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-21  8:33   ` Patrick Steinhardt
@ 2025-10-22 21:12     ` Ezekiel Newren
  0 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-10-22 21:12 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Ezekiel Newren via GitGitGadget, git

On Tue, Oct 21, 2025 at 2:33 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Wed, Oct 15, 2025 at 09:18:14PM +0000, Ezekiel Newren via GitGitGadget wrote:
> > diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> > index 6f3998ee54..411a8aa69f 100644
> > --- a/xdiff/xdiffi.c
> > +++ b/xdiff/xdiffi.c
> > @@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
> >
> >               rec = &xe->xdf1.recs[xch->i1];
> >               for (i = 0; i < xch->chg1 && ignore; i++)
> > -                     ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
> > +                     ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
> >
> >               rec = &xe->xdf2.recs[xch->i2];
> >               for (i = 0; i < xch->chg2 && ignore; i++)
> > -                     ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
> > +                     ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
> >
> >               xch->ignore = ignore;
> >       }
>
> Okay. Seemingly, we convert the structure itself, but we don't convert
> any of the functions to accept an `uint8_t`. I guess you drew the line
> here so that we don't have to also touch up dozens of function
> signatures?

That is correct. I wanted to avoid _boiling the ocean_ just to change
the type of ptr.

> And how did you end up verifying that you added all casts? Does the
> compiler flag those as warnings?

I used CLion to search for all uses of that field and then added casts
where the types differ. Another way to do that is to run `make
DEVELOPER=1` and address all of the `uint8_t differs in signedness
from char` errors that are spat out.

> In any case, it might be nice to explain both of these details in the
> commit message.

I will update it.

Thanks.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-15 21:18 ` [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
  2025-10-16 21:51   ` Kristoffer Haugsbakk
  2025-10-21  8:33   ` Patrick Steinhardt
@ 2025-10-21 13:13   ` Phillip Wood
  2025-10-21 18:15     ` Junio C Hamano
  2 siblings, 1 reply; 118+ messages in thread
From: Phillip Wood @ 2025-10-21 13:13 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren, Patrick Steinhardt

On 15/10/2025 22:18, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also
> referring to bytes in memory, rather than unicode code points, use
> uint8_t instead of char.

It C "char" never refers to a unicode code point so I don't follow the 
reasoning here. Isn't the reason you want to change from "char" to 
"uint8_t" to match rust? Given "char" and "uint8_t" are the same width 
why can't we use "char" in the C struct and "u8" in the rust struct as 
the two structs would still have the same layout?

I agree with Patrick's comments on this patch - it would be nice to know 
how you decided where to add casts. Given that rust is going to be 
optional for at least a year we should take care to leave the C code in 
good shape with a minimum number of casts.

Thanks

Phillip

> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>   xdiff/xdiffi.c    |  8 ++++----
>   xdiff/xemit.c     |  6 +++---
>   xdiff/xmerge.c    | 14 +++++++-------
>   xdiff/xpatience.c |  2 +-
>   xdiff/xprepare.c  |  8 ++++----
>   xdiff/xtypes.h    |  2 +-
>   xdiff/xutils.c    |  4 ++--
>   7 files changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index 6f3998ee54..411a8aa69f 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> @@ -407,7 +407,7 @@ static int get_indent(xrecord_t *rec)
>   	int ret = 0;
>   
>   	for (i = 0; i < rec->size; i++) {
> -		char c = rec->ptr[i];
> +		uint8_t c = rec->ptr[i];
>   
>   		if (!XDL_ISSPACE(c))
>   			return ret;
> @@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
>   
>   		rec = &xe->xdf1.recs[xch->i1];
>   		for (i = 0; i < xch->chg1 && ignore; i++)
> -			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
> +			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
>   
>   		rec = &xe->xdf2.recs[xch->i2];
>   		for (i = 0; i < xch->chg2 && ignore; i++)
> -			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
> +			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
>   
>   		xch->ignore = ignore;
>   	}
> @@ -1008,7 +1008,7 @@ static int record_matches_regex(xrecord_t *rec, xpparam_t const *xpp) {
>   	size_t i;
>   
>   	for (i = 0; i < xpp->ignore_regex_nr; i++)
> -		if (!regexec_buf(xpp->ignore_regex[i], rec->ptr, rec->size, 1,
> +		if (!regexec_buf(xpp->ignore_regex[i], (const char *)rec->ptr, rec->size, 1,
>   				 &regmatch, 0))
>   			return 1;
>   
> diff --git a/xdiff/xemit.c b/xdiff/xemit.c
> index b2f1f30cd3..ead930088a 100644
> --- a/xdiff/xemit.c
> +++ b/xdiff/xemit.c
> @@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
>   {
>   	xrecord_t *rec = &xdf->recs[ri];
>   
> -	if (xdl_emit_diffrec(rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
> +	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
>   		return -1;
>   
>   	return 0;
> @@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
>   	xrecord_t *rec = &xdf->recs[ri];
>   
>   	if (!xecfg->find_func)
> -		return def_ff(rec->ptr, rec->size, buf, sz);
> -	return xecfg->find_func(rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
> +		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
> +	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
>   }
>   
>   static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
> diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
> index fd600cbb5d..75cb3e76a2 100644
> --- a/xdiff/xmerge.c
> +++ b/xdiff/xmerge.c
> @@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
>   	xrecord_t *rec2 = xe2->xdf2.recs + i2;
>   
>   	for (i = 0; i < line_count; i++) {
> -		int result = xdl_recmatch(rec1[i].ptr, rec1[i].size,
> -			rec2[i].ptr, rec2[i].size, flags);
> +		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
> +			(const char *)rec2[i].ptr, rec2[i].size, flags);
>   		if (!result)
>   			return -1;
>   	}
> @@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
>   
>   static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
>   {
> -	return xdl_recmatch(rec1->ptr, rec1->size,
> -			    rec2->ptr, rec2->size, flags);
> +	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
> +			    (const char *)rec2->ptr, rec2->size, flags);
>   }
>   
>   /*
> @@ -382,10 +382,10 @@ static int xdl_refine_conflicts(xdfenv_t *xe1, xdfenv_t *xe2, xdmerge_t *m,
>   		 * we have a very simple mmfile structure.
>   		 */
>   		t1.ptr = (char *)xe1->xdf2.recs[m->i1].ptr;
> -		t1.size = xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
> +		t1.size = (char *)xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
>   			+ xe1->xdf2.recs[m->i1 + m->chg1 - 1].size - t1.ptr;
>   		t2.ptr = (char *)xe2->xdf2.recs[m->i2].ptr;
> -		t2.size = xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
> +		t2.size = (char *)xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
>   			+ xe2->xdf2.recs[m->i2 + m->chg2 - 1].size - t2.ptr;
>   		if (xdl_do_diff(&t1, &t2, xpp, &xe) < 0)
>   			return -1;
> @@ -440,7 +440,7 @@ static int line_contains_alnum(const char *ptr, long size)
>   static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
>   {
>   	for (; chg; chg--, i++)
> -		if (line_contains_alnum(xe->xdf2.recs[i].ptr,
> +		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
>   				xe->xdf2.recs[i].size))
>   			return 1;
>   	return 0;
> diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
> index 669b653580..bb61354f22 100644
> --- a/xdiff/xpatience.c
> +++ b/xdiff/xpatience.c
> @@ -121,7 +121,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
>   		return;
>   	map->entries[index].line1 = line;
>   	map->entries[index].hash = record->ha;
> -	map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1].ptr);
> +	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
>   	if (!map->first)
>   		map->first = map->entries + index;
>   	if (map->last) {
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index 192334f1b7..4cb18b2b88 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
>   	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
>   	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
>   		if (rcrec->rec.ha == rec->ha &&
> -				xdl_recmatch(rcrec->rec.ptr, rcrec->rec.size,
> -					rec->ptr, rec->size, cf->flags))
> +				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
> +					(const char *)rec->ptr, rec->size, cf->flags))
>   			break;
>   
>   	if (!rcrec) {
> @@ -156,8 +156,8 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
>   			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
>   				goto abort;
>   			crec = &xdf->recs[xdf->nrec++];
> -			crec->ptr = prev;
> -			crec->size = (long) (cur - prev);
> +			crec->ptr = (uint8_t const *)prev;
> +			crec->size =(long) ( cur - prev);
>   			crec->ha = hav;
>   			if (xdl_classify_record(pass, cf, crec) < 0)
>   				goto abort;
> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index 3514bb1684..57983627f5 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -39,7 +39,7 @@ typedef struct s_chastore {
>   } chastore_t;
>   
>   typedef struct s_xrecord {
> -	char const *ptr;
> +	uint8_t const *ptr;
>   	long size;
>   	unsigned long ha;
>   } xrecord_t;
> diff --git a/xdiff/xutils.c b/xdiff/xutils.c
> index 447e66c719..7be063bfb6 100644
> --- a/xdiff/xutils.c
> +++ b/xdiff/xutils.c
> @@ -465,10 +465,10 @@ int xdl_fall_back_diff(xdfenv_t *diff_env, xpparam_t const *xpp,
>   	xdfenv_t env;
>   
>   	subfile1.ptr = (char *)diff_env->xdf1.recs[line1 - 1].ptr;
> -	subfile1.size = diff_env->xdf1.recs[line1 + count1 - 2].ptr +
> +	subfile1.size = (char *)diff_env->xdf1.recs[line1 + count1 - 2].ptr +
>   		diff_env->xdf1.recs[line1 + count1 - 2].size - subfile1.ptr;
>   	subfile2.ptr = (char *)diff_env->xdf2.recs[line2 - 1].ptr;
> -	subfile2.size = diff_env->xdf2.recs[line2 + count2 - 2].ptr +
> +	subfile2.size = (char *)diff_env->xdf2.recs[line2 + count2 - 2].ptr +
>   		diff_env->xdf2.recs[line2 + count2 - 2].size - subfile2.ptr;
>   	if (xdl_do_diff(&subfile1, &subfile2, xpp, &env) < 0)
>   		return -1;


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-21 13:13   ` Phillip Wood
@ 2025-10-21 18:15     ` Junio C Hamano
  2025-10-22 13:27       ` Phillip Wood
  0 siblings, 1 reply; 118+ messages in thread
From: Junio C Hamano @ 2025-10-21 18:15 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Ezekiel Newren via GitGitGadget, git, Ezekiel Newren,
	Patrick Steinhardt

Phillip Wood <phillip.wood123@gmail.com> writes:

> It C "char" never refers to a unicode code point so I don't follow the 
> reasoning here. Isn't the reason you want to change from "char" to 
> "uint8_t" to match rust? Given "char" and "uint8_t" are the same width 
> why can't we use "char" in the C struct and "u8" in the rust struct as 
> the two structs would still have the same layout?

And forcing u8 makes sure both sides of the ffi agrees on the
signedness (C "char"'s signedness is implementation defined),
which is a good thing.

I 100% agree that being honest about the motivation to sell this
change would be a good thing to do here.  I do not think "in this
series, I want to match the types used at the interface to be of
Rust's" is a position to be ashamed of ;-)

> I agree with Patrick's comments on this patch - it would be nice to know 
> how you decided where to add casts. Given that rust is going to be 
> optional for at least a year we should take care to leave the C code in 
> good shape with a minimum number of casts.

Thanks.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-21 18:15     ` Junio C Hamano
@ 2025-10-22 13:27       ` Phillip Wood
  2025-10-22 20:55         ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Phillip Wood @ 2025-10-22 13:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ezekiel Newren via GitGitGadget, git, Ezekiel Newren,
	Patrick Steinhardt

On 21/10/2025 19:15, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> It C "char" never refers to a unicode code point so I don't follow the
>> reasoning here. Isn't the reason you want to change from "char" to
>> "uint8_t" to match rust? Given "char" and "uint8_t" are the same width
>> why can't we use "char" in the C struct and "u8" in the rust struct as
>> the two structs would still have the same layout?
> 
> And forcing u8 makes sure both sides of the ffi agrees on the
> signedness (C "char"'s signedness is implementation defined),
> which is a good thing.

That's true and ignoring the signedness would be hacky but I'm not sure 
it matters in practice. Both C and rust would use the same bit patterns 
for "abc" and b"abc\0" and in general C plays fast and loose with the 
signedness of variables all over the place. The trade off for respecting 
the signedness is that we either have casts all over the place or 
massive churn converting the rest of the code to use uint8_t. This 
problem isn't limited to xdiff, it will be true wherever we share 
bytestrings such as the contents of objects between C and rust as we 
tend to use char rather than uint8_t in our code.

Thanks

Phillip

> I 100% agree that being honest about the motivation to sell this
> change would be a good thing to do here.  I do not think "in this
> series, I want to match the types used at the interface to be of
> Rust's" is a position to be ashamed of ;-)
> 
>> I agree with Patrick's comments on this patch - it would be nice to know
>> how you decided where to add casts. Given that rust is going to be
>> optional for at least a year we should take care to leave the C code in
>> good shape with a minimum number of casts.
> 
> Thanks.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-22 13:27       ` Phillip Wood
@ 2025-10-22 20:55         ` Ezekiel Newren
  0 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-10-22 20:55 UTC (permalink / raw)
  To: phillip.wood
  Cc: Junio C Hamano, Ezekiel Newren via GitGitGadget, git,
	Patrick Steinhardt

On Wed, Oct 22, 2025 at 7:27 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> > I 100% agree that being honest about the motivation to sell this
> > change would be a good thing to do here.  I do not think "in this
> > series, I want to match the types used at the interface to be of
> > Rust's" is a position to be ashamed of ;-)
> >
> >> I agree with Patrick's comments on this patch - it would be nice to know
> >> how you decided where to add casts. Given that rust is going to be
> >> optional for at least a year we should take care to leave the C code in
> >> good shape with a minimum number of casts.
> >
> > Thanks.

I'm not arguing that uint8_t should be used everywhere in Git, only
that it is used everywhere in xdiff. xrecord_t and xdfile_t are
fundamental to how xdiff passes data around and they need to be
transparent to both sides. I'm trying to leave the rest of the data
structures alone in order to avoid refactor churn. Refactoring C to
use unambiguous types, outside of xdiff, is outside the scope of this
patch series.

Another problem with using char instead of uint8_t is that tools like
cbindgen and bindgen don't translate char to u8. Bindgen will see char
and will produce std::ffi::c_char on the Rust side, see [1] for why
that's a problem. The other way around is a problem too. When cbindgen
sees u8 it will generate uint8_t on the C side and then `make
DEVELOPER=1` won't compile because uint8_t and char differer in
signedness.

[1] Problems with C types
https://lore.kernel.org/git/CAH=ZcbA_8JM1hdUAfFe3ho0ShuniguEpV1308S0nCkCHOCsmmg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 3/9] xdiff: use size_t for xrecord_t.size
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
  2025-10-15 21:18 ` [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t Ezekiel Newren via GitGitGadget
  2025-10-15 21:18 ` [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
@ 2025-10-15 21:18 ` Ezekiel Newren via GitGitGadget
  2025-10-15 21:18 ` [PATCH 4/9] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is the appropriate type because size is describing the number of
elements, bytes in this case, in memory.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   |  7 +++----
 xdiff/xemit.c    |  8 ++++----
 xdiff/xmerge.c   | 16 ++++++++--------
 xdiff/xprepare.c |  6 +++---
 xdiff/xtypes.h   |  2 +-
 5 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 411a8aa69f..edd05466df 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -403,10 +403,9 @@ static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
  */
 static int get_indent(xrecord_t *rec)
 {
-	long i;
 	int ret = 0;
 
-	for (i = 0; i < rec->size; i++) {
+	for (size_t i = 0; i < rec->size; i++) {
 		uint8_t c = rec->ptr[i];
 
 		if (!XDL_ISSPACE(c))
@@ -993,11 +992,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
 
 		rec = &xe->xdf1.recs[xch->i1];
 		for (i = 0; i < xch->chg1 && ignore; i++)
-			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, (long)rec[i].size, flags);
 
 		rec = &xe->xdf2.recs[xch->i2];
 		for (i = 0; i < xch->chg2 && ignore; i++)
-			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, (long)rec[i].size, flags);
 
 		xch->ignore = ignore;
 	}
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index ead930088a..2f8007753c 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
 {
 	xrecord_t *rec = &xdf->recs[ri];
 
-	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
+	if (xdl_emit_diffrec((char const *)rec->ptr, (long)rec->size, pre, strlen(pre), ecb) < 0)
 		return -1;
 
 	return 0;
@@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
 	xrecord_t *rec = &xdf->recs[ri];
 
 	if (!xecfg->find_func)
-		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
-	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
+		return def_ff((const char *)rec->ptr, (long)rec->size, buf, sz);
+	return xecfg->find_func((const char *)rec->ptr, (long)rec->size, buf, sz, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
@@ -151,7 +151,7 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 static int is_empty_rec(xdfile_t *xdf, long ri)
 {
 	xrecord_t *rec = &xdf->recs[ri];
-	long i = 0;
+	size_t i = 0;
 
 	for (; i < rec->size && XDL_ISSPACE(rec->ptr[i]); i++);
 
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index 75cb3e76a2..0dd4558a32 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
 	xrecord_t *rec2 = xe2->xdf2.recs + i2;
 
 	for (i = 0; i < line_count; i++) {
-		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
-			(const char *)rec2[i].ptr, rec2[i].size, flags);
+		int result = xdl_recmatch((const char *)rec1[i].ptr, (long)rec1[i].size,
+			(const char *)rec2[i].ptr, (long)rec2[i].size, flags);
 		if (!result)
 			return -1;
 	}
@@ -119,11 +119,11 @@ static int xdl_recs_copy_0(int use_orig, xdfenv_t *xe, int i, int count, int nee
 	if (count < 1)
 		return 0;
 
-	for (i = 0; i < count; size += recs[i++].size)
+	for (i = 0; i < count; size += (int)recs[i++].size)
 		if (dest)
 			memcpy(dest + size, recs[i].ptr, recs[i].size);
 	if (add_nl) {
-		i = recs[count - 1].size;
+		i = (int)recs[count - 1].size;
 		if (i == 0 || recs[count - 1].ptr[i - 1] != '\n') {
 			if (needs_cr) {
 				if (dest)
@@ -156,7 +156,7 @@ static int xdl_orig_copy(xdfenv_t *xe, int i, int count, int needs_cr, int add_n
  */
 static int is_eol_crlf(xdfile_t *file, int i)
 {
-	long size;
+	size_t size;
 
 	if (i < file->nrec - 1)
 		/* All lines before the last *must* end in LF */
@@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 
 static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
 {
-	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
-			    (const char *)rec2->ptr, rec2->size, flags);
+	return xdl_recmatch((const char *)rec1->ptr, (long)rec1->size,
+			    (const char *)rec2->ptr, (long)rec2->size, flags);
 }
 
 /*
@@ -441,7 +441,7 @@ static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
 {
 	for (; chg; chg--, i++)
 		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
-				xe->xdf2.recs[i].size))
+				(long)xe->xdf2.recs[i].size))
 			return 1;
 	return 0;
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 4cb18b2b88..b3219aed3e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
 		if (rcrec->rec.ha == rec->ha &&
-				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
-					(const char *)rec->ptr, rec->size, cf->flags))
+				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
+					(const char *)rec->ptr, (long)rec->size, cf->flags))
 			break;
 
 	if (!rcrec) {
@@ -157,7 +157,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = (uint8_t const *)prev;
-			crec->size =(long) ( cur - prev);
+			crec->size = cur - prev;
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 57983627f5..00d2d8c8cd 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -40,7 +40,7 @@ typedef struct s_chastore {
 
 typedef struct s_xrecord {
 	uint8_t const *ptr;
-	long size;
+	size_t size;
 	unsigned long ha;
 } xrecord_t;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 4/9] xdiff: use unambiguous types in xdl_hash_record()
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2025-10-15 21:18 ` [PATCH 3/9] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
@ 2025-10-15 21:18 ` Ezekiel Newren via GitGitGadget
  2025-10-21  8:33   ` Patrick Steinhardt
  2025-10-15 21:18 ` [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff-interface.c |  2 +-
 xdiff/xprepare.c  |  6 +++---
 xdiff/xutils.c    | 28 ++++++++++++++--------------
 xdiff/xutils.h    |  6 +++---
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 4971f722b3..1a35556380 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -300,7 +300,7 @@ void xdiff_clear_find_func(xdemitconf_t *xecfg)
 
 unsigned long xdiff_hash_string(const char *s, size_t len, long flags)
 {
-	return xdl_hash_record(&s, s + len, flags);
+	return xdl_hash_record((uint8_t const**)&s, (uint8_t const*)s + len, flags);
 }
 
 int xdiff_compare_lines(const char *l1, long s1,
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index b3219aed3e..85e56021da 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -137,8 +137,8 @@ static void xdl_free_ctx(xdfile_t *xdf)
 static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_t const *xpp,
 			   xdlclassifier_t *cf, xdfile_t *xdf) {
 	long bsize;
-	unsigned long hav;
-	char const *blk, *cur, *top, *prev;
+	uint64_t hav;
+	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
 
 	xdf->rindex = NULL;
@@ -156,7 +156,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
-			crec->ptr = (uint8_t const *)prev;
+			crec->ptr = prev;
 			crec->size = cur - prev;
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 7be063bfb6..77ee1ad9c8 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -249,11 +249,11 @@ int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
 	return 1;
 }
 
-unsigned long xdl_hash_record_with_whitespace(char const **data,
-		char const *top, long flags) {
-	unsigned long ha = 5381;
-	char const *ptr = *data;
-	int cr_at_eol_only = (flags & XDF_WHITESPACE_FLAGS) == XDF_IGNORE_CR_AT_EOL;
+uint64_t xdl_hash_record_with_whitespace(uint8_t const **data,
+		uint8_t const *top, uint64_t flags) {
+	uint64_t ha = 5381;
+	uint8_t const *ptr = *data;
+	bool cr_at_eol_only = (flags & XDF_WHITESPACE_FLAGS) == XDF_IGNORE_CR_AT_EOL;
 
 	for (; ptr < top && *ptr != '\n'; ptr++) {
 		if (cr_at_eol_only) {
@@ -263,8 +263,8 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 				continue;
 		}
 		else if (XDL_ISSPACE(*ptr)) {
-			const char *ptr2 = ptr;
-			int at_eol;
+			const uint8_t *ptr2 = ptr;
+			bool at_eol;
 			while (ptr + 1 < top && XDL_ISSPACE(ptr[1])
 					&& ptr[1] != '\n')
 				ptr++;
@@ -274,20 +274,20 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 			else if (flags & XDF_IGNORE_WHITESPACE_CHANGE
 				 && !at_eol) {
 				ha += (ha << 5);
-				ha ^= (unsigned long) ' ';
+				ha ^= (uint64_t) ' ';
 			}
 			else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL
 				 && !at_eol) {
 				while (ptr2 != ptr + 1) {
 					ha += (ha << 5);
-					ha ^= (unsigned long) *ptr2;
+					ha ^= (uint64_t) *ptr2;
 					ptr2++;
 				}
 			}
 			continue;
 		}
 		ha += (ha << 5);
-		ha ^= (unsigned long) *ptr;
+		ha ^= (uint64_t) *ptr;
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 
@@ -304,9 +304,9 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 #define REASSOC_FENCE(x, y)
 #endif
 
-unsigned long xdl_hash_record_verbatim(char const **data, char const *top) {
-	unsigned long ha = 5381, c0, c1;
-	char const *ptr = *data;
+uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top) {
+	uint64_t ha = 5381, c0, c1;
+	uint8_t const *ptr = *data;
 #if 0
 	/*
 	 * The baseline form of the optimized loop below. This is the djb2
@@ -314,7 +314,7 @@ unsigned long xdl_hash_record_verbatim(char const **data, char const *top) {
 	 */
 	for (; ptr < top && *ptr != '\n'; ptr++) {
 		ha += (ha << 5);
-		ha += (unsigned long) *ptr;
+		ha += (uint64_t) *ptr;
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 #else
diff --git a/xdiff/xutils.h b/xdiff/xutils.h
index 13f6831047..615b4a9d35 100644
--- a/xdiff/xutils.h
+++ b/xdiff/xutils.h
@@ -34,9 +34,9 @@ void *xdl_cha_alloc(chastore_t *cha);
 long xdl_guess_lines(mmfile_t *mf, long sample);
 int xdl_blankline(const char *line, long size, long flags);
 int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags);
-unsigned long xdl_hash_record_verbatim(char const **data, char const *top);
-unsigned long xdl_hash_record_with_whitespace(char const **data, char const *top, long flags);
-static inline unsigned long xdl_hash_record(char const **data, char const *top, long flags)
+uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top);
+uint64_t xdl_hash_record_with_whitespace(uint8_t const **data, uint8_t const *top, uint64_t flags);
+static inline uint64_t xdl_hash_record(uint8_t const **data, uint8_t const *top, uint64_t flags)
 {
 	if (flags & XDF_WHITESPACE_FLAGS)
 		return xdl_hash_record_with_whitespace(data, top, flags);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 4/9] xdiff: use unambiguous types in xdl_hash_record()
  2025-10-15 21:18 ` [PATCH 4/9] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
@ 2025-10-21  8:33   ` Patrick Steinhardt
  2025-10-22 21:20     ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Patrick Steinhardt @ 2025-10-21  8:33 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget; +Cc: git, Ezekiel Newren

On Wed, Oct 15, 2025 at 09:18:16PM +0000, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>

This should have a commit message explaining what exactly you're doing
here.

Patrick

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 4/9] xdiff: use unambiguous types in xdl_hash_record()
  2025-10-21  8:33   ` Patrick Steinhardt
@ 2025-10-22 21:20     ` Ezekiel Newren
  2025-10-23  5:49       ` Patrick Steinhardt
  0 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren @ 2025-10-22 21:20 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Ezekiel Newren via GitGitGadget, git

On Tue, Oct 21, 2025 at 2:33 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Wed, Oct 15, 2025 at 09:18:16PM +0000, Ezekiel Newren via GitGitGadget wrote:
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
>
> This should have a commit message explaining what exactly you're doing
> here.

I thought I did have a commit message justifying my changes. Maybe it
got deleted through a rebase. How about a message like:

Convert the function signature and body to use unambiguous types. char
is changed to uint8_t because this function processes bytes in memory.
unsigned long to uint64_t so that the hash output is consistent across
platforms. `flags` was changed from long to uint64_t to ensure the
high order bits are not dropped on platforms that treat long as 32
bits.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 4/9] xdiff: use unambiguous types in xdl_hash_record()
  2025-10-22 21:20     ` Ezekiel Newren
@ 2025-10-23  5:49       ` Patrick Steinhardt
  0 siblings, 0 replies; 118+ messages in thread
From: Patrick Steinhardt @ 2025-10-23  5:49 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Ezekiel Newren via GitGitGadget, git

On Wed, Oct 22, 2025 at 03:20:32PM -0600, Ezekiel Newren wrote:
> On Tue, Oct 21, 2025 at 2:33 AM Patrick Steinhardt <ps@pks.im> wrote:
> >
> > On Wed, Oct 15, 2025 at 09:18:16PM +0000, Ezekiel Newren via GitGitGadget wrote:
> > > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > This should have a commit message explaining what exactly you're doing
> > here.
> 
> I thought I did have a commit message justifying my changes. Maybe it
> got deleted through a rebase. How about a message like:
> 
> Convert the function signature and body to use unambiguous types. char
> is changed to uint8_t because this function processes bytes in memory.
> unsigned long to uint64_t so that the hash output is consistent across
> platforms. `flags` was changed from long to uint64_t to ensure the
> high order bits are not dropped on platforms that treat long as 32
> bits.

Works for me, I guess. Thanks!

Patrick

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2025-10-15 21:18 ` [PATCH 4/9] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
@ 2025-10-15 21:18 ` Ezekiel Newren via GitGitGadget
  2025-10-20 23:29   ` Ezekiel Newren
  2025-10-15 21:18 ` [PATCH 6/9] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The ha field is serving two different purposes, which makes the code
harder to read. At first glance it looks like many places assume
there could never be hash collisions between lines of the two input
files. In reality, line_hash is used together with xdl_recmatch() to
ensure correct comparisons of lines, even when collisions occur.

To make this clearer, the old ha field has been split:
  * line_hash: The straightforward hash of a line, requiring no
    additional context.
  * minimal_perfect_hash: Not a new concept, but now a separate
    field. It comes from the classifier's general-purpose hash table,
    which assigns each line a unique and minimal hash across the two
    files.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c     |  6 +++---
 xdiff/xhistogram.c |  4 ++--
 xdiff/xpatience.c  | 10 +++++-----
 xdiff/xprepare.c   | 16 ++++++++--------
 xdiff/xtypes.h     |  3 ++-
 5 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index edd05466df..436c34697d 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -22,9 +22,9 @@
 
 #include "xinclude.h"
 
-static unsigned long get_hash(xdfile_t *xdf, long index)
+static size_t get_hash(xdfile_t *xdf, long index)
 {
-	return xdf->recs[xdf->rindex[index]].ha;
+	return xdf->recs[xdf->rindex[index]].minimal_perfect_hash;
 }
 
 #define XDL_MAX_COST_MIN 256
@@ -385,7 +385,7 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 
 static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
 {
-	return (rec1->ha == rec2->ha);
+	return rec1->minimal_perfect_hash == rec2->minimal_perfect_hash;
 }
 
 /*
diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
index 6dc450b1fe..5ae1282c27 100644
--- a/xdiff/xhistogram.c
+++ b/xdiff/xhistogram.c
@@ -90,7 +90,7 @@ struct region {
 
 static int cmp_recs(xrecord_t *r1, xrecord_t *r2)
 {
-	return r1->ha == r2->ha;
+	return r1->minimal_perfect_hash == r2->minimal_perfect_hash;
 
 }
 
@@ -98,7 +98,7 @@ static int cmp_recs(xrecord_t *r1, xrecord_t *r2)
 	(cmp_recs(REC(i->env, s1, l1), REC(i->env, s2, l2)))
 
 #define TABLE_HASH(index, side, line) \
-	XDL_HASHLONG((REC(index->env, side, line))->ha, index->table_bits)
+	XDL_HASHLONG((REC(index->env, side, line))->minimal_perfect_hash, index->table_bits)
 
 static int scanA(struct histindex *index, int line1, int count1)
 {
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index bb61354f22..cc53266f3b 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -48,7 +48,7 @@
 struct hashmap {
 	int nr, alloc;
 	struct entry {
-		unsigned long hash;
+		size_t minimal_perfect_hash;
 		/*
 		 * 0 = unused entry, 1 = first line, 2 = second, etc.
 		 * line2 is NON_UNIQUE if the line is not unique
@@ -101,10 +101,10 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 	 * So we multiply ha by 2 in the hope that the hashing was
 	 * "unique enough".
 	 */
-	int index = (int)((record->ha << 1) % map->alloc);
+	int index = (int)((record->minimal_perfect_hash << 1) % map->alloc);
 
 	while (map->entries[index].line1) {
-		if (map->entries[index].hash != record->ha) {
+		if (map->entries[index].minimal_perfect_hash != record->minimal_perfect_hash) {
 			if (++index >= map->alloc)
 				index = 0;
 			continue;
@@ -120,7 +120,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 	if (pass == 2)
 		return;
 	map->entries[index].line1 = line;
-	map->entries[index].hash = record->ha;
+	map->entries[index].minimal_perfect_hash = record->minimal_perfect_hash;
 	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
 	if (!map->first)
 		map->first = map->entries + index;
@@ -248,7 +248,7 @@ static int match(struct hashmap *map, int line1, int line2)
 {
 	xrecord_t *record1 = &map->env->xdf1.recs[line1 - 1];
 	xrecord_t *record2 = &map->env->xdf2.recs[line2 - 1];
-	return record1->ha == record2->ha;
+	return record1->minimal_perfect_hash == record2->minimal_perfect_hash;
 }
 
 static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 85e56021da..16236bd045 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -96,9 +96,9 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	long hi;
 	xdlclass_t *rcrec;
 
-	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
+	hi = (long) XDL_HASHLONG(rec->line_hash, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
-		if (rcrec->rec.ha == rec->ha &&
+		if (rcrec->rec.line_hash == rec->line_hash &&
 				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
 					(const char *)rec->ptr, (long)rec->size, cf->flags))
 			break;
@@ -120,7 +120,7 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 
 	(pass == 1) ? rcrec->len1++ : rcrec->len2++;
 
-	rec->ha = (unsigned long) rcrec->idx;
+	rec->minimal_perfect_hash = (size_t)rcrec->idx;
 
 	return 0;
 }
@@ -158,7 +158,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
 			crec->size = cur - prev;
-			crec->ha = hav;
+			crec->line_hash = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
 		}
@@ -290,7 +290,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->ha];
+		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len2 : 0;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -298,7 +298,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->ha];
+		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len1 : 0;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -350,7 +350,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 	recs2 = xdf2->recs;
 	for (i = 0, lim = XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
 	     i++, recs1++, recs2++)
-		if (recs1->ha != recs2->ha)
+		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
 	xdf1->dstart = xdf2->dstart = i;
@@ -358,7 +358,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 	recs1 = xdf1->recs + xdf1->nrec - 1;
 	recs2 = xdf2->recs + xdf2->nrec - 1;
 	for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
-		if (recs1->ha != recs2->ha)
+		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
 	xdf1->dend = xdf1->nrec - i - 1;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 00d2d8c8cd..a57a8c2c12 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -41,7 +41,8 @@ typedef struct s_chastore {
 typedef struct s_xrecord {
 	uint8_t const *ptr;
 	size_t size;
-	unsigned long ha;
+	uint64_t line_hash;
+	size_t minimal_perfect_hash;
 } xrecord_t;
 
 typedef struct s_xdfile {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-10-15 21:18 ` [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
@ 2025-10-20 23:29   ` Ezekiel Newren
  2025-10-21  5:10     ` Junio C Hamano
                       ` (2 more replies)
  0 siblings, 3 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-10-20 23:29 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget; +Cc: git

On Wed, Oct 15, 2025 at 3:18 PM Ezekiel Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Ezekiel Newren <ezekielnewren@gmail.com>
>
> The ha field is serving two different purposes, which makes the code
> harder to read. At first glance it looks like many places assume
> there could never be hash collisions between lines of the two input
> files. In reality, line_hash is used together with xdl_recmatch() to
> ensure correct comparisons of lines, even when collisions occur.
>
> To make this clearer, the old ha field has been split:
>   * line_hash: The straightforward hash of a line, requiring no
>     additional context.
>   * minimal_perfect_hash: Not a new concept, but now a separate
>     field. It comes from the classifier's general-purpose hash table,
>     which assigns each line a unique and minimal hash across the two
>     files.
>
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>

I'm a bit surprised that nobody has commented on this patch. I thought
that someone would have criticized the length of the name
"minimal_perfect_hash" or asked me why I was splitting one field into
two.

I don't see any reason why this patch series shouldn't move forward.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-10-20 23:29   ` Ezekiel Newren
@ 2025-10-21  5:10     ` Junio C Hamano
  2025-10-21  8:33     ` Patrick Steinhardt
  2025-10-21 10:03     ` Phillip Wood
  2 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-10-21  5:10 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Ezekiel Newren via GitGitGadget, git

Ezekiel Newren <ezekielnewren@gmail.com> writes:

> I'm a bit surprised that nobody has commented on this patch. I thought
> that someone would have criticized the length of the name
> "minimal_perfect_hash" or asked me why I was splitting one field into
> two.

Sometimes there aren't enough round tuits to go around, and when
people have been too busy to review it, we see no comment, either
positive ones or negative ones.

> I don't see any reason why this patch series shouldn't move forward.

A patch series needs a positive reason to move forward;
unfortunately we cannot tell much from lack of negative comments.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-10-20 23:29   ` Ezekiel Newren
  2025-10-21  5:10     ` Junio C Hamano
@ 2025-10-21  8:33     ` Patrick Steinhardt
  2025-10-21 10:03     ` Phillip Wood
  2 siblings, 0 replies; 118+ messages in thread
From: Patrick Steinhardt @ 2025-10-21  8:33 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Ezekiel Newren via GitGitGadget, git

On Mon, Oct 20, 2025 at 05:29:25PM -0600, Ezekiel Newren wrote:
> On Wed, Oct 15, 2025 at 3:18 PM Ezekiel Newren via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > The ha field is serving two different purposes, which makes the code
> > harder to read. At first glance it looks like many places assume
> > there could never be hash collisions between lines of the two input
> > files. In reality, line_hash is used together with xdl_recmatch() to
> > ensure correct comparisons of lines, even when collisions occur.
> >
> > To make this clearer, the old ha field has been split:
> >   * line_hash: The straightforward hash of a line, requiring no
> >     additional context.
> >   * minimal_perfect_hash: Not a new concept, but now a separate
> >     field. It comes from the classifier's general-purpose hash table,
> >     which assigns each line a unique and minimal hash across the two
> >     files.
> >
> > Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> I'm a bit surprised that nobody has commented on this patch. I thought
> that someone would have criticized the length of the name
> "minimal_perfect_hash" or asked me why I was splitting one field into
> two.

I actually appreciate the longer name. I'm not a fan of abbreviations
that are hard to understand myself. Sure, they are easier to type, but
in many cases they end up making the code way harder to understand if
you are not deeply familiar with it. There's of course exceptions to
this, but I don't really think that your patch falls into them.

Patrick

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-10-20 23:29   ` Ezekiel Newren
  2025-10-21  5:10     ` Junio C Hamano
  2025-10-21  8:33     ` Patrick Steinhardt
@ 2025-10-21 10:03     ` Phillip Wood
  2025-10-21 11:16       ` Chris Torek
  2025-10-22 21:31       ` Ezekiel Newren
  2 siblings, 2 replies; 118+ messages in thread
From: Phillip Wood @ 2025-10-21 10:03 UTC (permalink / raw)
  To: Ezekiel Newren, Ezekiel Newren via GitGitGadget
  Cc: git, Patrick Steinhardt, Junio C Hamano

Hi Ezekiel

On 21/10/2025 00:29, Ezekiel Newren wrote:
> On Wed, Oct 15, 2025 at 3:18 PM Ezekiel Newren via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>
>> The ha field is serving two different purposes, which makes the code
>> harder to read. At first glance it looks like many places assume
>> there could never be hash collisions between lines of the two input
>> files. In reality, line_hash is used together with xdl_recmatch() to
>> ensure correct comparisons of lines, even when collisions occur.
>>
>> To make this clearer, the old ha field has been split:
>>    * line_hash: The straightforward hash of a line, requiring no
>>      additional context.
>>    * minimal_perfect_hash: Not a new concept, but now a separate
>>      field. It comes from the classifier's general-purpose hash table,
>>      which assigns each line a unique and minimal hash across the two
>>      files.
>>
>> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> I'm a bit surprised that nobody has commented on this patch.

I've been off the list and I haven't caught up with this series yet.

> I thought
> that someone would have criticized the length of the name
> "minimal_perfect_hash" or asked me why I was splitting one field into
> two.

I think "perfect_hash" would be fine if we want a shorter name. More 
importantly it would be helpful to explain why the two fields have 
different types. I assume it is because the perfect_hash is used as an 
array index and therefore size_t is a better match for rust's usize than 
uint64_t. How much more memory do we end up using by adding second hash 
member to the struct? If the aim is to show that only one of them is 
used at a time then a union might be more appropriate but I doubt that 
plays well with rust.

I'll try and have a look at the other patches later this week. I think 
the type changes are going to need careful review.

Thanks

Phillip

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-10-21 10:03     ` Phillip Wood
@ 2025-10-21 11:16       ` Chris Torek
  2025-10-22 21:31       ` Ezekiel Newren
  1 sibling, 0 replies; 118+ messages in thread
From: Chris Torek @ 2025-10-21 11:16 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ezekiel Newren, Ezekiel Newren via GitGitGadget, git,
	Patrick Steinhardt, Junio C Hamano

On Tue, Oct 21, 2025 at 3:04 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
...
> uint64_t. How much more memory do we end up using by adding second hash
> member to the struct?

As in any string-to-string algorithm of this sort, there's one per "symbol",
but in this case a "symbol" is a line in a file. So if files are M and N lines
long, there are M+N symbols. Take the difference of the size of the two
records and multiply by this.

Assuming "sane" input file sizes (under a million lines each) it's a few
megabytes maximum...

Chris

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-10-21 10:03     ` Phillip Wood
  2025-10-21 11:16       ` Chris Torek
@ 2025-10-22 21:31       ` Ezekiel Newren
  1 sibling, 0 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-10-22 21:31 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Patrick Steinhardt,
	Junio C Hamano

On Tue, Oct 21, 2025 at 4:03 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Ezekiel
>
> On 21/10/2025 00:29, Ezekiel Newren wrote:
> > On Wed, Oct 15, 2025 at 3:18 PM Ezekiel Newren via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> From: Ezekiel Newren <ezekielnewren@gmail.com>
> >>
> >> The ha field is serving two different purposes, which makes the code
> >> harder to read. At first glance it looks like many places assume
> >> there could never be hash collisions between lines of the two input
> >> files. In reality, line_hash is used together with xdl_recmatch() to
> >> ensure correct comparisons of lines, even when collisions occur.
> >>
> >> To make this clearer, the old ha field has been split:
> >>    * line_hash: The straightforward hash of a line, requiring no
> >>      additional context.
> >>    * minimal_perfect_hash: Not a new concept, but now a separate
> >>      field. It comes from the classifier's general-purpose hash table,
> >>      which assigns each line a unique and minimal hash across the two
> >>      files.
> >>
> >> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > I'm a bit surprised that nobody has commented on this patch.
>
> I've been off the list and I haven't caught up with this series yet.
>
> > I thought
> > that someone would have criticized the length of the name
> > "minimal_perfect_hash" or asked me why I was splitting one field into
> > two.
>
> I think "perfect_hash" would be fine if we want a shorter name. More
> importantly it would be helpful to explain why the two fields have
> different types. I assume it is because the perfect_hash is used as an
> array index and therefore size_t is a better match for rust's usize than
> uint64_t.

Your understanding is correct. line_hash is fixed width while
minimal_perfect_hash is meant to be used as an array index into
memory. I'll update my commit message to make this more clear.

> How much more memory do we end up using by adding second hash
> member to the struct? If the aim is to show that only one of them is
> used at a time then a union might be more appropriate but I doubt that
> plays well with rust.

xrecord_t used to be defined with a pointer, so we're at the same
size. But more importantly I plan on splitting minimal_perfect_hash
out of xrecord_t into its own array. I think the diff algorithms end
up being a little bit faster with a separate array because each
element is only 8 bytes instead of 32.

In v2.51.0:
typedef struct s_xrecord {
       struct s_xrecord *next;
       char const *ptr;
       long size;
       unsigned long ha;
} xrecord_t;

This patch series:
typedef struct s_xrecord {
       uint8_t const *ptr;
       size_t size;
       uint64_t line_hash;
       size_t minimal_perfect_hash;
} xrecord_t;

> I'll try and have a look at the other patches later this week. I think
> the type changes are going to need careful review.

I appreciate the careful review. I figured it would be best to limit
the scope of this patch series to type changes, so that it wasn't
bogged down by other stuff.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 6/9] xdiff: make xdfile_t.nrec a size_t instead of long
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                   ` (4 preceding siblings ...)
  2025-10-15 21:18 ` [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
@ 2025-10-15 21:18 ` Ezekiel Newren via GitGitGadget
  2025-10-15 21:18 ` [PATCH 7/9] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is used because nrec describes the number of elements in memory
for recs, and the number of elements in memory for 'changed' + 2.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c    |  8 ++++----
 xdiff/xemit.c     | 20 ++++++++++----------
 xdiff/xmerge.c    |  8 ++++----
 xdiff/xpatience.c |  2 +-
 xdiff/xprepare.c  | 12 ++++++------
 xdiff/xtypes.h    |  2 +-
 6 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 436c34697d..759193fe5d 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -483,7 +483,7 @@ static void measure_split(const xdfile_t *xdf, long split,
 {
 	long i;
 
-	if (split >= xdf->nrec) {
+	if (split >= (long)xdf->nrec) {
 		m->end_of_file = 1;
 		m->indent = -1;
 	} else {
@@ -506,7 +506,7 @@ static void measure_split(const xdfile_t *xdf, long split,
 
 	m->post_blank = 0;
 	m->post_indent = -1;
-	for (i = split + 1; i < xdf->nrec; i++) {
+	for (i = split + 1; i < (long)xdf->nrec; i++) {
 		m->post_indent = get_indent(&xdf->recs[i]);
 		if (m->post_indent != -1)
 			break;
@@ -717,7 +717,7 @@ static void group_init(xdfile_t *xdf, struct xdlgroup *g)
  */
 static inline int group_next(xdfile_t *xdf, struct xdlgroup *g)
 {
-	if (g->end == xdf->nrec)
+	if (g->end == (long)xdf->nrec)
 		return -1;
 
 	g->start = g->end + 1;
@@ -750,7 +750,7 @@ static inline int group_previous(xdfile_t *xdf, struct xdlgroup *g)
  */
 static int group_slide_down(xdfile_t *xdf, struct xdlgroup *g)
 {
-	if (g->end < xdf->nrec &&
+	if (g->end < (long)xdf->nrec &&
 	    recs_match(&xdf->recs[g->start], &xdf->recs[g->end])) {
 		xdf->changed[g->start++] = false;
 		xdf->changed[g->end++] = true;
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index 2f8007753c..04f7e9193b 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -137,7 +137,7 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 	buf = func_line ? func_line->buf : dummy;
 	size = func_line ? sizeof(func_line->buf) : sizeof(dummy);
 
-	for (l = start; l != limit && 0 <= l && l < xe->xdf1.nrec; l += step) {
+	for (l = start; l != limit && 0 <= l && l < (long)xe->xdf1.nrec; l += step) {
 		long len = match_func_rec(&xe->xdf1, xecfg, l, buf, size);
 		if (len >= 0) {
 			if (func_line)
@@ -179,14 +179,14 @@ pre_context_calculation:
 			long fs1, i1 = xch->i1;
 
 			/* Appended chunk? */
-			if (i1 >= xe->xdf1.nrec) {
+			if (i1 >= (long)xe->xdf1.nrec) {
 				long i2 = xch->i2;
 
 				/*
 				 * We don't need additional context if
 				 * a whole function was added.
 				 */
-				while (i2 < xe->xdf2.nrec) {
+				while (i2 < (long)xe->xdf2.nrec) {
 					if (is_func_rec(&xe->xdf2, xecfg, i2))
 						goto post_context_calculation;
 					i2++;
@@ -196,7 +196,7 @@ pre_context_calculation:
 				 * Otherwise get more context from the
 				 * pre-image.
 				 */
-				i1 = xe->xdf1.nrec - 1;
+				i1 = (long)xe->xdf1.nrec - 1;
 			}
 
 			fs1 = get_func_line(xe, xecfg, NULL, i1, -1);
@@ -228,8 +228,8 @@ pre_context_calculation:
 
  post_context_calculation:
 		lctx = xecfg->ctxlen;
-		lctx = XDL_MIN(lctx, xe->xdf1.nrec - (xche->i1 + xche->chg1));
-		lctx = XDL_MIN(lctx, xe->xdf2.nrec - (xche->i2 + xche->chg2));
+		lctx = XDL_MIN(lctx, (long)xe->xdf1.nrec - (xche->i1 + xche->chg1));
+		lctx = XDL_MIN(lctx, (long)xe->xdf2.nrec - (xche->i2 + xche->chg2));
 
 		e1 = xche->i1 + xche->chg1 + lctx;
 		e2 = xche->i2 + xche->chg2 + lctx;
@@ -237,13 +237,13 @@ pre_context_calculation:
 		if (xecfg->flags & XDL_EMIT_FUNCCONTEXT) {
 			long fe1 = get_func_line(xe, xecfg, NULL,
 						 xche->i1 + xche->chg1,
-						 xe->xdf1.nrec);
+						 (long)xe->xdf1.nrec);
 			while (fe1 > 0 && is_empty_rec(&xe->xdf1, fe1 - 1))
 				fe1--;
 			if (fe1 < 0)
-				fe1 = xe->xdf1.nrec;
+				fe1 = (long)xe->xdf1.nrec;
 			if (fe1 > e1) {
-				e2 = XDL_MIN(e2 + (fe1 - e1), xe->xdf2.nrec);
+				e2 = XDL_MIN(e2 + (fe1 - e1), (long)xe->xdf2.nrec);
 				e1 = fe1;
 			}
 
@@ -254,7 +254,7 @@ pre_context_calculation:
 			 */
 			if (xche->next) {
 				long l = XDL_MIN(xche->next->i1,
-						 xe->xdf1.nrec - 1);
+						 (long)xe->xdf1.nrec - 1);
 				if (l - xecfg->ctxlen <= e1 ||
 				    get_func_line(xe, xecfg, NULL, l, e1) < 0) {
 					xche = xche->next;
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index 0dd4558a32..29dad98c49 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -158,7 +158,7 @@ static int is_eol_crlf(xdfile_t *file, int i)
 {
 	size_t size;
 
-	if (i < file->nrec - 1)
+	if (i < (long)file->nrec - 1)
 		/* All lines before the last *must* end in LF */
 		return (size = file->recs[i].size) > 1 &&
 			file->recs[i].ptr[size - 2] == '\r';
@@ -317,7 +317,7 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 			continue;
 		i = m->i1 + m->chg1;
 	}
-	size += xdl_recs_copy(xe1, i, xe1->xdf2.nrec - i, 0, 0,
+	size += xdl_recs_copy(xe1, i, (int)xe1->xdf2.nrec - i, 0, 0,
 			      dest ? dest + size : NULL);
 	return size;
 }
@@ -622,7 +622,7 @@ static int xdl_do_merge(xdfenv_t *xe1, xdchange_t *xscr1,
 			changes = c;
 		i0 = xscr1->i1;
 		i1 = xscr1->i2;
-		i2 = xscr1->i1 + xe2->xdf2.nrec - xe2->xdf1.nrec;
+		i2 = xscr1->i1 + (long)xe2->xdf2.nrec - (long)xe2->xdf1.nrec;
 		chg0 = xscr1->chg1;
 		chg1 = xscr1->chg2;
 		chg2 = xscr1->chg1;
@@ -637,7 +637,7 @@ static int xdl_do_merge(xdfenv_t *xe1, xdchange_t *xscr1,
 		if (!changes)
 			changes = c;
 		i0 = xscr2->i1;
-		i1 = xscr2->i1 + xe1->xdf2.nrec - xe1->xdf1.nrec;
+		i1 = xscr2->i1 + (long)xe1->xdf2.nrec - (long)xe1->xdf1.nrec;
 		i2 = xscr2->i2;
 		chg0 = xscr2->chg1;
 		chg1 = xscr2->chg1;
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index cc53266f3b..a0b31eb5d8 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -370,5 +370,5 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
 
 int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
 {
-	return patience_diff(xpp, env, 1, env->xdf1.nrec, 1, env->xdf2.nrec);
+	return patience_diff(xpp, env, 1, (int)env->xdf1.nrec, 1, (int)env->xdf2.nrec);
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 16236bd045..4ee9fb60cd 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -153,7 +153,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 		for (top = blk + bsize; cur < top; ) {
 			prev = cur;
 			hav = xdl_hash_record(&cur, top, xpp->flags);
-			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
+			if (XDL_ALLOC_GROW(xdf->recs, (long)xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
@@ -287,7 +287,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	/*
 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
 	 */
-	if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
@@ -295,7 +295,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
-	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
@@ -348,7 +348,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 
 	recs1 = xdf1->recs;
 	recs2 = xdf2->recs;
-	for (i = 0, lim = XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
+	for (i = 0, lim = (long)XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
 	     i++, recs1++, recs2++)
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
@@ -361,8 +361,8 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
-	xdf1->dend = xdf1->nrec - i - 1;
-	xdf2->dend = xdf2->nrec - i - 1;
+	xdf1->dend = (long)xdf1->nrec - i - 1;
+	xdf2->dend = (long)xdf2->nrec - i - 1;
 
 	return 0;
 }
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index a57a8c2c12..179ae2ae89 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -47,7 +47,7 @@ typedef struct s_xrecord {
 
 typedef struct s_xdfile {
 	xrecord_t *recs;
-	long nrec;
+	size_t nrec;
 	bool *changed;
 	long *rindex;
 	long nreff;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 7/9] xdiff: make xdfile_t.nreff a size_t instead of long
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                   ` (5 preceding siblings ...)
  2025-10-15 21:18 ` [PATCH 6/9] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
@ 2025-10-15 21:18 ` Ezekiel Newren via GitGitGadget
  2025-10-15 21:18 ` [PATCH 8/9] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is used because nreff describes the number of elements in memory
for rindex.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 14 +++++++-------
 xdiff/xtypes.h   |  2 +-
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 4ee9fb60cd..c690bafeb1 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -264,7 +264,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  * might be potentially discarded if they appear in a run of discardable.
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-	long i, nm, nreff, mlim;
+	long i, nm, mlim;
 	xrecord_t *recs;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
@@ -307,29 +307,29 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 * Use temporary arrays to decide if changed[i] should remain
 	 * false, or become true.
 	 */
-	for (nreff = 0, i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
+	xdf1->nreff = 0;
+	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
 	     i <= xdf1->dend; i++, recs++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->rindex[nreff++] = i;
+			xdf1->rindex[xdf1->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf1->changed[i] = true;
 			/* i.e. discard */
 	}
-	xdf1->nreff = nreff;
 
-	for (nreff = 0, i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
+	xdf2->nreff = 0;
+	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
 	     i <= xdf2->dend; i++, recs++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->rindex[nreff++] = i;
+			xdf2->rindex[xdf2->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf2->changed[i] = true;
 			/* i.e. discard */
 	}
-	xdf2->nreff = nreff;
 
 cleanup:
 	xdl_free(action1);
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 179ae2ae89..e9473bfd45 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -50,7 +50,7 @@ typedef struct s_xdfile {
 	size_t nrec;
 	bool *changed;
 	long *rindex;
-	long nreff;
+	size_t nreff;
 	ssize_t dstart, dend;
 } xdfile_t;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 8/9] xdiff: change rindex from long to size_t in xdfile_t
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                   ` (6 preceding siblings ...)
  2025-10-15 21:18 ` [PATCH 7/9] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
@ 2025-10-15 21:18 ` Ezekiel Newren via GitGitGadget
  2025-10-21  8:34   ` Patrick Steinhardt
  2025-10-15 21:18 ` [PATCH 9/9] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

rindex describes a index offset which means it's an index into memory
which should use size_t. dstart and dend will be deleted in a future
patch series. Move them to the end to help avoid refactor conflicts.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xtypes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index e9473bfd45..8016222de9 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -49,7 +49,7 @@ typedef struct s_xdfile {
 	xrecord_t *recs;
 	size_t nrec;
 	bool *changed;
-	long *rindex;
+	size_t *rindex;
 	size_t nreff;
 	ssize_t dstart, dend;
 } xdfile_t;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 8/9] xdiff: change rindex from long to size_t in xdfile_t
  2025-10-15 21:18 ` [PATCH 8/9] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
@ 2025-10-21  8:34   ` Patrick Steinhardt
  2025-10-22 22:14     ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Patrick Steinhardt @ 2025-10-21  8:34 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget; +Cc: git, Ezekiel Newren

On Wed, Oct 15, 2025 at 09:18:20PM +0000, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> rindex describes a index offset which means it's an index into memory
> which should use size_t. dstart and dend will be deleted in a future
> patch series. Move them to the end to help avoid refactor conflicts.

In a patch like this I would appreciate some explanation why we can
change the type without adapting any of its users. So basically explain
why this refactoring is safe to do and won't cause any issues.

Patrick

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 8/9] xdiff: change rindex from long to size_t in xdfile_t
  2025-10-21  8:34   ` Patrick Steinhardt
@ 2025-10-22 22:14     ` Ezekiel Newren
  2025-10-23  5:49       ` Patrick Steinhardt
  0 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren @ 2025-10-22 22:14 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Ezekiel Newren via GitGitGadget, git

On Tue, Oct 21, 2025 at 2:34 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Wed, Oct 15, 2025 at 09:18:20PM +0000, Ezekiel Newren via GitGitGadget wrote:
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > rindex describes a index offset which means it's an index into memory
> > which should use size_t. dstart and dend will be deleted in a future
> > patch series. Move them to the end to help avoid refactor conflicts.
>
> In a patch like this I would appreciate some explanation why we can
> change the type without adapting any of its users. So basically explain
> why this refactoring is safe to do and won't cause any issues.

The values of rindex are only used in 3 places. get_hash() which was
created in [1]. and 2 places in xdl_recs_cmp(). All of them use rindex
as an index into another array directly so there's no cascading
refactor impact. get_hash() was created precisely to reduce refactor
churn. How about a commit message like:

Changing the type of rindex from long to size_t has no cascading
refactor impact because it is only ever used to directly index other
arrays.

[1] create get_hash()
https://lore.kernel.org/git/637d1032abbd33b7673d3c101267816fbf1a343c.1758926520.git.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 8/9] xdiff: change rindex from long to size_t in xdfile_t
  2025-10-22 22:14     ` Ezekiel Newren
@ 2025-10-23  5:49       ` Patrick Steinhardt
  0 siblings, 0 replies; 118+ messages in thread
From: Patrick Steinhardt @ 2025-10-23  5:49 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Ezekiel Newren via GitGitGadget, git

On Wed, Oct 22, 2025 at 04:14:42PM -0600, Ezekiel Newren wrote:
> On Tue, Oct 21, 2025 at 2:34 AM Patrick Steinhardt <ps@pks.im> wrote:
> >
> > On Wed, Oct 15, 2025 at 09:18:20PM +0000, Ezekiel Newren via GitGitGadget wrote:
> > > From: Ezekiel Newren <ezekielnewren@gmail.com>
> > >
> > > rindex describes a index offset which means it's an index into memory
> > > which should use size_t. dstart and dend will be deleted in a future
> > > patch series. Move them to the end to help avoid refactor conflicts.
> >
> > In a patch like this I would appreciate some explanation why we can
> > change the type without adapting any of its users. So basically explain
> > why this refactoring is safe to do and won't cause any issues.
> 
> The values of rindex are only used in 3 places. get_hash() which was
> created in [1]. and 2 places in xdl_recs_cmp(). All of them use rindex
> as an index into another array directly so there's no cascading
> refactor impact. get_hash() was created precisely to reduce refactor
> churn. How about a commit message like:
> 
> Changing the type of rindex from long to size_t has no cascading
> refactor impact because it is only ever used to directly index other
> arrays.

Sounds good to me, thanks!

Patrick

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 9/9] xdiff: rename rindex -> reference_index
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                   ` (7 preceding siblings ...)
  2025-10-15 21:18 ` [PATCH 8/9] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
@ 2025-10-15 21:18 ` Ezekiel Newren via GitGitGadget
  2025-10-15 21:28 ` [PATCH 0/9] Xdiff cleanup part2 Junio C Hamano
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The classic diff adds only the lines that it's going to consider,
during the diff, to an array. A mapping between the compacted
array, and the lines of the file that they reference, are
facilitated by this array.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   |  6 +++---
 xdiff/xprepare.c | 10 +++++-----
 xdiff/xtypes.h   |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 759193fe5d..8eb664be3e 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -24,7 +24,7 @@
 
 static size_t get_hash(xdfile_t *xdf, long index)
 {
-	return xdf->recs[xdf->rindex[index]].minimal_perfect_hash;
+	return xdf->recs[xdf->reference_index[index]].minimal_perfect_hash;
 }
 
 #define XDL_MAX_COST_MIN 256
@@ -278,10 +278,10 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
 	 */
 	if (off1 == lim1) {
 		for (; off2 < lim2; off2++)
-			xdf2->changed[xdf2->rindex[off2]] = true;
+			xdf2->changed[xdf2->reference_index[off2]] = true;
 	} else if (off2 == lim2) {
 		for (; off1 < lim1; off1++)
-			xdf1->changed[xdf1->rindex[off1]] = true;
+			xdf1->changed[xdf1->reference_index[off1]] = true;
 	} else {
 		xdpsplit_t spl;
 		spl.i1 = spl.i2 = 0;
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index c690bafeb1..1dd420a2ff 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -128,7 +128,7 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 
 static void xdl_free_ctx(xdfile_t *xdf)
 {
-	xdl_free(xdf->rindex);
+	xdl_free(xdf->reference_index);
 	xdl_free(xdf->changed - 1);
 	xdl_free(xdf->recs);
 }
@@ -141,7 +141,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
 
-	xdf->rindex = NULL;
+	xdf->reference_index = NULL;
 	xdf->changed = NULL;
 	xdf->recs = NULL;
 
@@ -169,7 +169,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 
 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
 	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)) {
-		if (!XDL_ALLOC_ARRAY(xdf->rindex, xdf->nrec + 1))
+		if (!XDL_ALLOC_ARRAY(xdf->reference_index, xdf->nrec + 1))
 			goto abort;
 	}
 
@@ -312,7 +312,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	     i <= xdf1->dend; i++, recs++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->rindex[xdf1->nreff++] = i;
+			xdf1->reference_index[xdf1->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf1->changed[i] = true;
@@ -324,7 +324,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	     i <= xdf2->dend; i++, recs++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->rindex[xdf2->nreff++] = i;
+			xdf2->reference_index[xdf2->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf2->changed[i] = true;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 8016222de9..373ccefa28 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -49,7 +49,7 @@ typedef struct s_xdfile {
 	xrecord_t *recs;
 	size_t nrec;
 	bool *changed;
-	size_t *rindex;
+	size_t *reference_index;
 	size_t nreff;
 	ssize_t dstart, dend;
 } xdfile_t;
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 0/9] Xdiff cleanup part2
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                   ` (8 preceding siblings ...)
  2025-10-15 21:18 ` [PATCH 9/9] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
@ 2025-10-15 21:28 ` Junio C Hamano
  2025-10-21 13:28 ` Phillip Wood
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
  11 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-10-15 21:28 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget; +Cc: git, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> The primary goal of this patch series is to convert every field's type in
> xrecord_t and xdfile_t to be unambiguous, in preparation to make it more
> Rust FFI friendly. Additionally the ha field in xrecord_t is split into
> line_hash and minimal_perfect hash.
>
> The order of some of the fields has changed as called out by the commit
> messages.
>
> Before:
>
> typedef struct s_xrecord {
> 	char const *ptr;
> 	long size;
> 	unsigned long ha;
> } xrecord_t;
>
> typedef struct s_xdfile {
> 	xrecord_t *recs;
> 	long nrec;
> 	long dstart, dend;
> 	bool *changed;
> 	long *rindex;
> 	long nreff;
> } xdfile_t;
>
>
> After part 2
>
> typedef struct s_xrecord {
> 	uint8_t const *ptr;
> 	size_t size;
> 	uint64_t line_hash;
> 	size_t minimal_perfect_hash;
> } xrecord_t;
>
> typedef struct s_xdfile {
> 	xrecord_t *recs;
> 	size_t nrec;
> 	bool *changed;
> 	size_t *reference_index;
> 	size_t nreff;
> 	ssize_t dstart, dend;
> } xdfile_t;

Excellent summary.

>
>
> Ezekiel Newren (9):
>   xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
>   xdiff: make xrecord_t.ptr a uint8_t instead of char
>   xdiff: use size_t for xrecord_t.size
>   xdiff: use unambiguous types in xdl_hash_record()
>   xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
>   xdiff: make xdfile_t.nrec a size_t instead of long
>   xdiff: make xdfile_t.nreff a size_t instead of long
>   xdiff: change rindex from long to size_t in xdfile_t
>   xdiff: rename rindex -> reference_index
>
>  xdiff-interface.c  |  2 +-
>  xdiff/xdiffi.c     | 29 +++++++++++------------
>  xdiff/xemit.c      | 28 +++++++++++-----------
>  xdiff/xhistogram.c |  4 ++--
>  xdiff/xmerge.c     | 30 ++++++++++++------------
>  xdiff/xpatience.c  | 14 +++++------
>  xdiff/xprepare.c   | 58 +++++++++++++++++++++++-----------------------
>  xdiff/xtypes.h     | 15 ++++++------
>  xdiff/xutils.c     | 32 ++++++++++++-------------
>  xdiff/xutils.h     |  6 ++---
>  10 files changed, 109 insertions(+), 109 deletions(-)
>
>
> base-commit: 143f58ef7535f8f8a80d810768a18bdf3807de26
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2070%2Fezekielnewren%2Fxdiff_cleanup_part2-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2070/ezekielnewren/xdiff_cleanup_part2-v1
> Pull-Request: https://github.com/git/git/pull/2070

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 0/9] Xdiff cleanup part2
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                   ` (9 preceding siblings ...)
  2025-10-15 21:28 ` [PATCH 0/9] Xdiff cleanup part2 Junio C Hamano
@ 2025-10-21 13:28 ` Phillip Wood
  2025-10-21 13:41   ` Junio C Hamano
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
  11 siblings, 1 reply; 118+ messages in thread
From: Phillip Wood @ 2025-10-21 13:28 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git; +Cc: Ezekiel Newren, Patrick Steinhardt

Hi Ezekiel

On 15/10/2025 22:18, Ezekiel Newren via GitGitGadget wrote:
> Maintainer note: This patch series builds on top of en/xdiff-cleanup and
> am/xdiff-hash-tweak (both of which are now in master).
> 
> The primary goal of this patch series is to convert every field's type in
> xrecord_t and xdfile_t to be unambiguous, in preparation to make it more
> Rust FFI friendly. Additionally the ha field in xrecord_t is split into
> line_hash and minimal_perfect hash.

Given that this series changes the types of all the "long" struct 
members to "size_t" I was surprised to see that it adds so many "(long)" 
casts. At the end of this series there are 38 lines in xdiff/ that 
contain "(long)" compared to just 4 in master. I had expected that as 
we'd converted all the members to "size_t" there would be no need to 
keep using "long" in the code. As rust is going to be optional for quite 
a while I think we should clean up the C code to avoid casting between 
"long" and "size_t"

Thanks

Phillip

> The order of some of the fields has changed as called out by the commit
> messages.
> 
> Before:
> 
> typedef struct s_xrecord {
> 	char const *ptr;
> 	long size;
> 	unsigned long ha;
> } xrecord_t;
> 
> typedef struct s_xdfile {
> 	xrecord_t *recs;
> 	long nrec;
> 	long dstart, dend;
> 	bool *changed;
> 	long *rindex;
> 	long nreff;
> } xdfile_t;
> 
> 
> After part 2
> 
> typedef struct s_xrecord {
> 	uint8_t const *ptr;
> 	size_t size;
> 	uint64_t line_hash;
> 	size_t minimal_perfect_hash;
> } xrecord_t;
> 
> typedef struct s_xdfile {
> 	xrecord_t *recs;
> 	size_t nrec;
> 	bool *changed;
> 	size_t *reference_index;
> 	size_t nreff;
> 	ssize_t dstart, dend;
> } xdfile_t;
> 
> 
> Ezekiel Newren (9):
>    xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
>    xdiff: make xrecord_t.ptr a uint8_t instead of char
>    xdiff: use size_t for xrecord_t.size
>    xdiff: use unambiguous types in xdl_hash_record()
>    xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
>    xdiff: make xdfile_t.nrec a size_t instead of long
>    xdiff: make xdfile_t.nreff a size_t instead of long
>    xdiff: change rindex from long to size_t in xdfile_t
>    xdiff: rename rindex -> reference_index
> 
>   xdiff-interface.c  |  2 +-
>   xdiff/xdiffi.c     | 29 +++++++++++------------
>   xdiff/xemit.c      | 28 +++++++++++-----------
>   xdiff/xhistogram.c |  4 ++--
>   xdiff/xmerge.c     | 30 ++++++++++++------------
>   xdiff/xpatience.c  | 14 +++++------
>   xdiff/xprepare.c   | 58 +++++++++++++++++++++++-----------------------
>   xdiff/xtypes.h     | 15 ++++++------
>   xdiff/xutils.c     | 32 ++++++++++++-------------
>   xdiff/xutils.h     |  6 ++---
>   10 files changed, 109 insertions(+), 109 deletions(-)
> 
> 
> base-commit: 143f58ef7535f8f8a80d810768a18bdf3807de26
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2070%2Fezekielnewren%2Fxdiff_cleanup_part2-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2070/ezekielnewren/xdiff_cleanup_part2-v1
> Pull-Request: https://github.com/git/git/pull/2070


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 0/9] Xdiff cleanup part2
  2025-10-21 13:28 ` Phillip Wood
@ 2025-10-21 13:41   ` Junio C Hamano
  0 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-10-21 13:41 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Ezekiel Newren via GitGitGadget, git, Ezekiel Newren,
	Patrick Steinhardt

Phillip Wood <phillip.wood123@gmail.com> writes:

> Given that this series changes the types of all the "long" struct 
> members to "size_t" I was surprised to see that it adds so many "(long)" 
> casts. At the end of this series there are 38 lines in xdiff/ that 
> contain "(long)" compared to just 4 in master. I had expected that as 
> we'd converted all the members to "size_t" there would be no need to 
> keep using "long" in the code. As rust is going to be optional for quite 
> a while I think we should clean up the C code to avoid casting between 
> "long" and "size_t"

Either we cast here or have existing code that used to use long to
use another type, that needs to be done carefully as we would be
moving code that used signed type to now use unsigned.  While I
agree with you in principle that we shouldn't try to interface
between code pieces with impedance mismatch (for which the need to
cast is an indication), we'd need to draw a line somewhere.

Thanks.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v2 00/10] Xdiff cleanup part2
  2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                   ` (10 preceding siblings ...)
  2025-10-21 13:28 ` Phillip Wood
@ 2025-10-29 22:19 ` Ezekiel Newren via GitGitGadget
  2025-10-29 22:19   ` [PATCH v2 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
                     ` (11 more replies)
  11 siblings, 12 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

Changes in v2:

 * Added documentation about unambiguous types and FFI
 * Addressed comments on the mailing list


Original cover letter below:
============================

Maintainer note: This patch series builds on top of en/xdiff-cleanup and
am/xdiff-hash-tweak (both of which are now in master).

The primary goal of this patch series is to convert every field's type in
xrecord_t and xdfile_t to be unambiguous, in preparation to make it more
Rust FFI friendly. Additionally the ha field in xrecord_t is split into
line_hash and minimal_perfect hash.

The order of some of the fields has changed as called out by the commit
messages.

Before:

typedef struct s_xrecord {
	char const *ptr;
	long size;
	unsigned long ha;
} xrecord_t;

typedef struct s_xdfile {
	xrecord_t *recs;
	long nrec;
	long dstart, dend;
	bool *changed;
	long *rindex;
	long nreff;
} xdfile_t;


After part 2

typedef struct s_xrecord {
	uint8_t const *ptr;
	size_t size;
	uint64_t line_hash;
	size_t minimal_perfect_hash;
} xrecord_t;

typedef struct s_xdfile {
	xrecord_t *recs;
	size_t nrec;
	bool *changed;
	size_t *reference_index;
	size_t nreff;
	ssize_t dstart, dend;
} xdfile_t;


Ezekiel Newren (10):
  doc: define unambiguous type mappings across C and Rust
  xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  xdiff: make xrecord_t.ptr a uint8_t instead of char
  xdiff: use size_t for xrecord_t.size
  xdiff: use unambiguous types in xdl_hash_record()
  xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  xdiff: make xdfile_t.nrec a size_t instead of long
  xdiff: make xdfile_t.nreff a size_t instead of long
  xdiff: change rindex from long to size_t in xdfile_t
  xdiff: rename rindex -> reference_index

 .../technical/unambiguous-types.adoc          | 229 ++++++++++++++++++
 xdiff-interface.c                             |   2 +-
 xdiff/xdiffi.c                                |  29 ++-
 xdiff/xemit.c                                 |  28 +--
 xdiff/xhistogram.c                            |   4 +-
 xdiff/xmerge.c                                |  30 +--
 xdiff/xpatience.c                             |  14 +-
 xdiff/xprepare.c                              |  58 ++---
 xdiff/xtypes.h                                |  15 +-
 xdiff/xutils.c                                |  32 +--
 xdiff/xutils.h                                |   6 +-
 11 files changed, 338 insertions(+), 109 deletions(-)
 create mode 100644 Documentation/technical/unambiguous-types.adoc


base-commit: 143f58ef7535f8f8a80d810768a18bdf3807de26
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2070%2Fezekielnewren%2Fxdiff_cleanup_part2-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2070/ezekielnewren/xdiff_cleanup_part2-v2
Pull-Request: https://github.com/git/git/pull/2070

Range-diff vs v1:

  -:  ---------- >  1:  88133848d1 doc: define unambiguous type mappings across C and Rust
  1:  1fa9a7d7d1 !  2:  9197903add xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
     @@ xdiff/xtypes.h: typedef struct s_xrecord {
       	bool *changed;
       	long *rindex;
       	long nreff;
     -+	ssize_t dstart, dend;
     ++	ptrdiff_t dstart, dend;
       } xdfile_t;
       
       typedef struct s_xdfenv {
  2:  7b9e8961d4 !  3:  46bc1b3e25 xdiff: make xrecord_t.ptr a uint8_t instead of char
     @@ Commit message
          xdiff: make xrecord_t.ptr a uint8_t instead of char
      
          Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also
     -    referring to bytes in memory, rather than unicode code points, use
     +    referring to bytes in memory, rather than Unicode code points, use
          uint8_t instead of char.
      
     +    Every usage of this field was inspected and cast to char*, or similar,
     +    to avoid signedness warnings/errors from the compiler. Casting was used
     +    so that the whole of xdiff doesn't need to be refactored in order to
     +    change the type of this field.
     +
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
       ## xdiff/xdiffi.c ##
  3:  ae15ed7121 =  4:  07e28aad3b xdiff: use size_t for xrecord_t.size
  4:  7fcd83c990 !  5:  1ade7d8165 xdiff: use unambiguous types in xdl_hash_record()
     @@ Metadata
       ## Commit message ##
          xdiff: use unambiguous types in xdl_hash_record()
      
     +    Convert the function signature and body to use unambiguous types. char
     +    is changed to uint8_t because this function processes bytes in memory.
     +    unsigned long to uint64_t so that the hash output is consistent across
     +    platforms. `flags` was changed from long to uint64_t to ensure the
     +    high order bits are not dropped on platforms that treat long as 32
     +    bits.
     +
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
       ## xdiff-interface.c ##
  5:  a3e706ecda =  6:  59054ea0cb xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  6:  5767ba4ee8 =  7:  f91be17858 xdiff: make xdfile_t.nrec a size_t instead of long
  7:  4caa6a4669 !  8:  e2a6a23cc4 xdiff: make xdfile_t.nreff a size_t instead of long
     @@ xdiff/xtypes.h: typedef struct s_xdfile {
       	long *rindex;
      -	long nreff;
      +	size_t nreff;
     - 	ssize_t dstart, dend;
     + 	ptrdiff_t dstart, dend;
       } xdfile_t;
       
  8:  6dca5e6222 !  9:  3b6054945f xdiff: change rindex from long to size_t in xdfile_t
     @@ Commit message
          xdiff: change rindex from long to size_t in xdfile_t
      
          rindex describes a index offset which means it's an index into memory
     -    which should use size_t. dstart and dend will be deleted in a future
     -    patch series. Move them to the end to help avoid refactor conflicts.
     +    which should use size_t.
     +
     +    Changing the type of rindex from long to size_t has no cascading
     +    refactor impact because it is only ever used to directly index other
     +    arrays.
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
     @@ xdiff/xtypes.h: typedef struct s_xdfile {
      -	long *rindex;
      +	size_t *rindex;
       	size_t nreff;
     - 	ssize_t dstart, dend;
     + 	ptrdiff_t dstart, dend;
       } xdfile_t;
  9:  518e5f5557 ! 10:  1856a29026 xdiff: rename rindex -> reference_index
     @@ xdiff/xtypes.h: typedef struct s_xdfile {
      -	size_t *rindex;
      +	size_t *reference_index;
       	size_t nreff;
     - 	ssize_t dstart, dend;
     + 	ptrdiff_t dstart, dend;
       } xdfile_t;

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v2 01/10] doc: define unambiguous type mappings across C and Rust
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
@ 2025-10-29 22:19   ` Ezekiel Newren via GitGitGadget
  2025-11-06  9:55     ` Phillip Wood
  2025-10-29 22:19   ` [PATCH v2 02/10] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t Ezekiel Newren via GitGitGadget
                     ` (10 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Document other nuances with crossing the FFI boundary. Other language
mappings may be added in the future.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 .../technical/unambiguous-types.adoc          | 229 ++++++++++++++++++
 1 file changed, 229 insertions(+)
 create mode 100644 Documentation/technical/unambiguous-types.adoc

diff --git a/Documentation/technical/unambiguous-types.adoc b/Documentation/technical/unambiguous-types.adoc
new file mode 100644
index 0000000000..658a5b578e
--- /dev/null
+++ b/Documentation/technical/unambiguous-types.adoc
@@ -0,0 +1,229 @@
+= Unambiguous types
+
+Most of these mappings are obvious, but there are some nuances and gotchas with
+Rust FFI (Foreign Function Interface).
+
+This document defines clear, one-to-one mappings between primitive types in C,
+Rust (and possible other languages in the future). Its purpose is to eliminate
+ambiguity in type widths, signedness, and binary representation across
+platforms and languages.
+
+For Git, the only header required to use these unambiguous types in C is
+`git-compat-util.h`.
+
+== Boolean types
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| bool^1^       | bool
+|===
+
+== Integer types
+
+In C, `<stdint.h>` (or an equivalent) must be included.
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| uint8_t    | u8
+| uint16_t   | u16
+| uint32_t   | u32
+| uint64_t   | u64
+
+| int8_t     | i8
+| int16_t    | i16
+| int32_t    | i32
+| int64_t    | i64
+|===
+
+== Floating-point types
+
+Rust requires IEEE-754 semantics.
+In C, that is typically true, but not guaranteed by the standard.
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| float^2^      | f32
+| double^2^     | f64
+|===
+
+== Size types
+
+These types represent pointer-sized integers and are typically defined in
+`<stddef.h>` or an equivalent header.
+
+Size types should be used any time pointer arithmetic is performed e.g.
+indexing an array, describing the number of elements in memory, etc...
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| size_t^3^     | usize
+| ptrdiff_t^4^  | isize
+|===
+
+== Character types
+
+This is where C and Rust don't have a clean one-to-one mapping. A C `char` is
+an 8-bit type that is signless (neither signed nor unsigned) which causes
+problems with e.g. `make DEVELOPER=1`. Rust's `char` type is an unsigned 32-bit
+integer that is used to describe Unicode code points. Even though a C `char`
+is the same width as `u8`, `char` should be converted to u8 where it is
+describing bytes in memory. If a C `char` is not describing bytes, then it
+should be converted to a more accurate unambiguous type.
+
+While you could specify `char` in the C code and `u8` in Rust code, it's not as
+clear what the appropriate type is, but it would work across the FFI boundary.
+However the bigger problem comes from code generation tools like cbindgen and
+bindgen. When cbindgen see u8 in Rust it will generate uint8_t on the C side
+which will cause differ in signedness warnings/errors. Similaraly if bindgen
+see `char` on the C side it will generate `std::ffi::c_char` which has its own
+problems.
+
+=== Notes
+^1^ This is only true if stdbool.h (or equivalent) is used. +
+^2^ C does not enforce IEEE-754 compatibility, but Rust expects it. If the
+platform/arch for C does not follow IEEE-754 then this equivalence does not
+hold. Also, it's assumed that `float` is 32 bits and `double` is 64, but
+there may be a strange platform/arch where even this isn't true. +
+^3^ C also defines uintptr_t, but this should not be used in Git. +
+^4^ C also defines ssize_t and intptr_t, but these should not be used in Git. +
+
+== Problems with std::ffi::c_* types in Rust
+TL;DR: They're not guaranteed to match C types for all possible C
+compilers/platforms/architectures.
+
+Only a few of Rust's C FFI types are considered safe and semantically clear to
+use: +
+
+* `c_void`
+* `CStr`
+* `CString`
+
+Even then, they should be used sparingly, and only where the semantics match
+exactly.
+
+The std::os::raw::c_* (which is deprecated) directly inherits the problems of
+core::ffi, which changes over time and seems to make a best guess at the
+correct definition for a given platform/target. This probably isn't a problem
+for all platforms that Rust supports currently, but can anyone say that Rust
+got it right for all C compilers of all platforms/targets?
+
+On top of all of that we're targeting an older version of Rust which doesn't
+have the latest mappings.
+
+To give an example: c_long is defined in
+footnote:[https://doc.rust-lang.org/1.63.0/src/core/ffi/mod.rs.html#175-189[c_long in 1.63.0]]
+footnote:[https://doc.rust-lang.org/1.89.0/src/core/ffi/primitives.rs.html#135-151[c_long in 1.89.0]]
+
+=== Rust version 1.63.0
+
+[source]
+----
+mod c_long_definition {
+    cfg_if! {
+        if #[cfg(all(target_pointer_width = "64", not(windows)))] {
+            pub type c_long = i64;
+            pub type NonZero_c_long = crate::num::NonZeroI64;
+            pub type c_ulong = u64;
+            pub type NonZero_c_ulong = crate::num::NonZeroU64;
+        } else {
+            // The minimal size of `long` in the C standard is 32 bits
+            pub type c_long = i32;
+            pub type NonZero_c_long = crate::num::NonZeroI32;
+            pub type c_ulong = u32;
+            pub type NonZero_c_ulong = crate::num::NonZeroU32;
+        }
+    }
+}
+----
+
+=== Rust version 1.89.0
+
+[source]
+----
+mod c_long_definition {
+    crate::cfg_select! {
+        any(
+            all(target_pointer_width = "64", not(windows)),
+            // wasm32 Linux ABI uses 64-bit long
+            all(target_arch = "wasm32", target_os = "linux")
+        ) => {
+            pub(super) type c_long = i64;
+            pub(super) type c_ulong = u64;
+        }
+        _ => {
+            // The minimal size of `long` in the C standard is 32 bits
+            pub(super) type c_long = i32;
+            pub(super) type c_ulong = u32;
+        }
+    }
+}
+----
+
+Even for the cases where C types are correctly mapped to Rust types via
+std::ffi::c_* there are still problems. Let's take c_char for example. On some
+platforms it's u8 on others it's i8.
+
+=== Subtraction underflow in debug mode
+
+The following code will panic in debug on platforms that define c_char as u8,
+but won't if it's an i8.
+
+[source]
+----
+let mut x: std::ffi::c_char = 0;
+x -= 1;
+----
+
+=== Inconsistent shift behavior
+
+`x` will be 0xC0 for platforms that use i8, but will be 0x40 where it's u8.
+
+[source]
+----
+let mut x: std::ffi::c_char = 0x80;
+x >>= 1;
+----
+
+=== Equality fails to compile on some platforms
+
+The following will not compile on platforms that define c_char as i8, but will
+if it's u8. You can cast x e.g. `assert_eq!(x as u8, b'a');`, but then you get
+a warning on platforms that use u8 and a clean compilation where i8 is used.
+
+[source]
+----
+let mut x: std::ffi::c_char = 0x61;
+assert_eq!(x, b'a');
+----
+
+== Enum types
+Rust enum types should not be used as FFI types. Rust enum types are more like
+C union types than C enum's. For something like:
+
+[source]
+----
+#[repr(C, u8)]
+enum Fruit {
+    Apple,
+    Banana,
+    Cherry,
+}
+----
+
+It's easy enough to make sure the Rust enum matches what C would expect, but a
+more complex type like.
+
+[source]
+----
+enum HashResult {
+    SHA1([u8; 20]),
+    SHA256([u8; 32]),
+}
+----
+
+The Rust compiler has to add a discriminant to the enum to distinguish between
+the variants. The width, location, and values for that discriminant is up to
+the Rust compiler and is not ABI stable.
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 01/10] doc: define unambiguous type mappings across C and Rust
  2025-10-29 22:19   ` [PATCH v2 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
@ 2025-11-06  9:55     ` Phillip Wood
  2025-11-06 22:52       ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Phillip Wood @ 2025-11-06  9:55 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Chris Torek,
	Ezekiel Newren

Hi Ezekiel

On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> Document other nuances with crossing the FFI boundary. Other language
> mappings may be added in the future.

Thanks for adding this, I've left a few comments below. Overall I 
thought it was very well written. I tried building an html version of 
this but even after adding it to the list of TECH_DOCS in 
Documentation/Makefile with

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 47208269a2e..2699f0b24af 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -143,6 +143,7 @@ TECH_DOCS += technical/shallow
  TECH_DOCS += technical/sparse-checkout
  TECH_DOCS += technical/sparse-index
  TECH_DOCS += technical/trivial-merge
+TECH_DOCS += technical/unambiguous-types
  TECH_DOCS += technical/unit-tests
  SP_ARTICLES += $(TECH_DOCS)
  SP_ARTICLES += technical/api-index

it fails with

$ make -C Documentation/ technical/unambiguous-types.html 
                                       Merge branch 
'ps/object-source-loose' into seen
make: Entering directory '/home/phil/src/git/Documentation'
     GEN asciidoc.conf
     * new asciidoc flags
     ASCIIDOC technical/unambiguous-types.html
asciidoc: ERROR: unambiguous-types.adoc: line 139: undefined filter 
attribute in command: source-highlight --gen-version -f xhtml -s 
{language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}} 
{args=}
asciidoc: ERROR: unambiguous-types.adoc: line 162: undefined filter 
attribute in command: source-highlight --gen-version -f xhtml -s 
{language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}} 
{args=}
asciidoc: ERROR: unambiguous-types.adoc: line 177: undefined filter 
attribute in command: source-highlight --gen-version -f xhtml -s 
{language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}} 
{args=}
asciidoc: ERROR: unambiguous-types.adoc: line 187: undefined filter 
attribute in command: source-highlight --gen-version -f xhtml -s 
{language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}} 
{args=}
asciidoc: ERROR: unambiguous-types.adoc: line 199: undefined filter 
attribute in command: source-highlight --gen-version -f xhtml -s 
{language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}} 
{args=}
asciidoc: ERROR: unambiguous-types.adoc: line 213: undefined filter 
attribute in command: source-highlight --gen-version -f xhtml -s 
{language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}} 
{args=}
asciidoc: ERROR: unambiguous-types.adoc: line 224: undefined filter 
attribute in command: source-highlight --gen-version -f xhtml -s 
{language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}} 
{args=}
make: *** [Makefile:396: technical/unambiguous-types.html] Error 1
make: *** Deleting file 'technical/unambiguous-types.html'
make: Leaving directory '/home/phil/src/git/Documentation'

> +== Character types
> +
> +This is where C and Rust don't have a clean one-to-one mapping. A C `char` is
> +an 8-bit type that is signless (neither signed nor unsigned) 

I found this a bit confusing. Isn't the signedness of "char" 
implementation defined rather than it being "signless"

> which causes
> +problems with e.g. `make DEVELOPER=1`.

I'm not sure what this is referring to - maybe -Wsign-compare?

> Rust's `char` type is an unsigned 32-bit
> +integer that is used to describe Unicode code points. Even though a C `char`
> +is the same width as `u8`, `char` should be converted to u8 where it is
> +describing bytes in memory. 

I'm dreading the point where we start sharing "struct strbuf" with rust 
and have to change the "buf" member from "char*" to "uint8_t*". While it 
is not used in the xdiff code it is ubiquitous everywhere else and there 
are lots of places where be pass the "buf" member to functions expecting 
a "char*".

	git grep -E '(\.|->)buf\W'

has over 4000 matches

> If a C `char` is not describing bytes, then it
> +should be converted to a more accurate unambiguous type.

That's a good point.

> +While you could specify `char` in the C code and `u8` in Rust code, it's not as
> +clear what the appropriate type is, but it would work across the FFI boundary.
> +However the bigger problem comes from code generation tools like cbindgen and
> +bindgen. When cbindgen see u8 in Rust it will generate uint8_t on the C side
> +which will cause differ in signedness warnings/errors. Similaraly if bindgen
> +see `char` on the C side it will generate `std::ffi::c_char` which has its own
> +problems.

Yeah, we definitely don't want to be using "std::ffi::c_char" in our 
rust implementations. I do wonder if we might want to use it (or CStr) 
judiciously in function parameters and immediately convert it to u8 in 
the function body where the function is called from C though.

> +=== Notes
> +^1^ This is only true if stdbool.h (or equivalent) is used. +
> +^2^ C does not enforce IEEE-754 compatibility, but Rust expects it. If the
> +platform/arch for C does not follow IEEE-754 then this equivalence does not
> +hold. Also, it's assumed that `float` is 32 bits and `double` is 64, but
> +there may be a strange platform/arch where even this isn't true. +
> +^3^ C also defines uintptr_t, but this should not be used in Git. +
> +^4^ C also defines ssize_t and intptr_t, but these should not be used in Git. +

[u]intptr_t and ssize_t are used in git already. As Junio has pointed 
out there are sane uses for these types but we don't want to use them in 
structs or function parameters where the struct or function is shared 
with rust.

> +
> +== Problems with std::ffi::c_* types in Rust
> +TL;DR: They're not guaranteed to match C types for all possible C
> +compilers/platforms/architectures.

Is this official policy of the rust project?

Thanks

Phillip

> +Only a few of Rust's C FFI types are considered safe and semantically clear to
> +use: +
> +
> +* `c_void`
> +* `CStr`
> +* `CString`
> +
> +Even then, they should be used sparingly, and only where the semantics match
> +exactly.
> +
> +The std::os::raw::c_* (which is deprecated) directly inherits the problems of
> +core::ffi, which changes over time and seems to make a best guess at the
> +correct definition for a given platform/target. This probably isn't a problem
> +for all platforms that Rust supports currently, but can anyone say that Rust
> +got it right for all C compilers of all platforms/targets?
> +
> +On top of all of that we're targeting an older version of Rust which doesn't
> +have the latest mappings.
> +
> +To give an example: c_long is defined in
> +footnote:[https://doc.rust-lang.org/1.63.0/src/core/ffi/mod.rs.html#175-189[c_long in 1.63.0]]
> +footnote:[https://doc.rust-lang.org/1.89.0/src/core/ffi/primitives.rs.html#135-151[c_long in 1.89.0]]
> +
> +=== Rust version 1.63.0
> +
> +[source]
> +----
> +mod c_long_definition {
> +    cfg_if! {
> +        if #[cfg(all(target_pointer_width = "64", not(windows)))] {
> +            pub type c_long = i64;
> +            pub type NonZero_c_long = crate::num::NonZeroI64;
> +            pub type c_ulong = u64;
> +            pub type NonZero_c_ulong = crate::num::NonZeroU64;
> +        } else {
> +            // The minimal size of `long` in the C standard is 32 bits
> +            pub type c_long = i32;
> +            pub type NonZero_c_long = crate::num::NonZeroI32;
> +            pub type c_ulong = u32;
> +            pub type NonZero_c_ulong = crate::num::NonZeroU32;
> +        }
> +    }
> +}
> +----
> +
> +=== Rust version 1.89.0
> +
> +[source]
> +----
> +mod c_long_definition {
> +    crate::cfg_select! {
> +        any(
> +            all(target_pointer_width = "64", not(windows)),
> +            // wasm32 Linux ABI uses 64-bit long
> +            all(target_arch = "wasm32", target_os = "linux")
> +        ) => {
> +            pub(super) type c_long = i64;
> +            pub(super) type c_ulong = u64;
> +        }
> +        _ => {
> +            // The minimal size of `long` in the C standard is 32 bits
> +            pub(super) type c_long = i32;
> +            pub(super) type c_ulong = u32;
> +        }
> +    }
> +}
> +----
> +
> +Even for the cases where C types are correctly mapped to Rust types via
> +std::ffi::c_* there are still problems. Let's take c_char for example. On some
> +platforms it's u8 on others it's i8.
> +
> +=== Subtraction underflow in debug mode
> +
> +The following code will panic in debug on platforms that define c_char as u8,
> +but won't if it's an i8.
> +
> +[source]
> +----
> +let mut x: std::ffi::c_char = 0;
> +x -= 1;
> +----
> +
> +=== Inconsistent shift behavior
> +
> +`x` will be 0xC0 for platforms that use i8, but will be 0x40 where it's u8.
> +
> +[source]
> +----
> +let mut x: std::ffi::c_char = 0x80;
> +x >>= 1;
> +----
> +
> +=== Equality fails to compile on some platforms
> +
> +The following will not compile on platforms that define c_char as i8, but will
> +if it's u8. You can cast x e.g. `assert_eq!(x as u8, b'a');`, but then you get
> +a warning on platforms that use u8 and a clean compilation where i8 is used.
> +
> +[source]
> +----
> +let mut x: std::ffi::c_char = 0x61;
> +assert_eq!(x, b'a');
> +----
> +
> +== Enum types
> +Rust enum types should not be used as FFI types. Rust enum types are more like
> +C union types than C enum's. For something like:
> +
> +[source]
> +----
> +#[repr(C, u8)]
> +enum Fruit {
> +    Apple,
> +    Banana,
> +    Cherry,
> +}
> +----
> +
> +It's easy enough to make sure the Rust enum matches what C would expect, but a
> +more complex type like.
> +
> +[source]
> +----
> +enum HashResult {
> +    SHA1([u8; 20]),
> +    SHA256([u8; 32]),
> +}
> +----
> +
> +The Rust compiler has to add a discriminant to the enum to distinguish between
> +the variants. The width, location, and values for that discriminant is up to
> +the Rust compiler and is not ABI stable.


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-06  9:55     ` Phillip Wood
@ 2025-11-06 22:52       ` Ezekiel Newren
  2025-11-09 14:14         ` Phillip Wood
  0 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren @ 2025-11-06 22:52 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Chris Torek

On Thu, Nov 6, 2025 at 2:55 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Ezekiel
>
> On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > Document other nuances with crossing the FFI boundary. Other language
> > mappings may be added in the future.
>
> Thanks for adding this, I've left a few comments below. Overall I
> thought it was very well written.

Thanks.

I felt it was necessary since C vs Rust types keep coming up over and
over again. I'm flexible with the wording of this document. I was just
trying to convey a firm and clear stance on what is and isn't proper
in Git.

> I tried building an html version of
> this but even after adding it to the list of TECH_DOCS in
> Documentation/Makefile with
>
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index 47208269a2e..2699f0b24af 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -143,6 +143,7 @@ TECH_DOCS += technical/shallow
>   TECH_DOCS += technical/sparse-checkout
>   TECH_DOCS += technical/sparse-index
>   TECH_DOCS += technical/trivial-merge
> +TECH_DOCS += technical/unambiguous-types
>   TECH_DOCS += technical/unit-tests
>   SP_ARTICLES += $(TECH_DOCS)
>   SP_ARTICLES += technical/api-index
>
> it fails with
>
> $ make -C Documentation/ technical/unambiguous-types.html
>                                        Merge branch
> 'ps/object-source-loose' into seen
> make: Entering directory '/home/phil/src/git/Documentation'
>      GEN asciidoc.conf
>      * new asciidoc flags
>      ASCIIDOC technical/unambiguous-types.html
> asciidoc: ERROR: unambiguous-types.adoc: line 139: undefined filter
> attribute in command: source-highlight --gen-version -f xhtml -s
> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
> {args=}
> asciidoc: ERROR: unambiguous-types.adoc: line 162: undefined filter
> attribute in command: source-highlight --gen-version -f xhtml -s
> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
> {args=}
> asciidoc: ERROR: unambiguous-types.adoc: line 177: undefined filter
> attribute in command: source-highlight --gen-version -f xhtml -s
> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
> {args=}
> asciidoc: ERROR: unambiguous-types.adoc: line 187: undefined filter
> attribute in command: source-highlight --gen-version -f xhtml -s
> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
> {args=}
> asciidoc: ERROR: unambiguous-types.adoc: line 199: undefined filter
> attribute in command: source-highlight --gen-version -f xhtml -s
> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
> {args=}
> asciidoc: ERROR: unambiguous-types.adoc: line 213: undefined filter
> attribute in command: source-highlight --gen-version -f xhtml -s
> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
> {args=}
> asciidoc: ERROR: unambiguous-types.adoc: line 224: undefined filter
> attribute in command: source-highlight --gen-version -f xhtml -s
> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
> {args=}
> make: *** [Makefile:396: technical/unambiguous-types.html] Error 1
> make: *** Deleting file 'technical/unambiguous-types.html'
> make: Leaving directory '/home/phil/src/git/Documentation'

I've never created documentation for Git before, so this helps. I'll
incorporate your suggestions.

> > +== Character types
> > +
> > +This is where C and Rust don't have a clean one-to-one mapping. A C `char` is
> > +an 8-bit type that is signless (neither signed nor unsigned)
>
> I found this a bit confusing. Isn't the signedness of "char"
> implementation defined rather than it being "signless"
>
> > which causes
> > +problems with e.g. `make DEVELOPER=1`.
>
> I'm not sure what this is referring to - maybe -Wsign-compare?

When I build Git with `make DEVELOPER=1` and I compare uint8_t with
char it complains about a difference in signedness. When I compare
int8_t with char it also complains about a difference in signedness.
So it is implementation defined, but it's also neither signed nor
unsigned according to DEVELOPER=1 since it complains either way.

> > Rust's `char` type is an unsigned 32-bit
> > +integer that is used to describe Unicode code points. Even though a C `char`
> > +is the same width as `u8`, `char` should be converted to u8 where it is
> > +describing bytes in memory.
>
> I'm dreading the point where we start sharing "struct strbuf" with rust
> and have to change the "buf" member from "char*" to "uint8_t*". While it
> is not used in the xdiff code it is ubiquitous everywhere else and there
> are lots of places where be pass the "buf" member to functions expecting
> a "char*".
>
>         git grep -E '(\.|->)buf\W'
>
> has over 4000 matches

This is why I started in Xdiff since its code is mostly isolated. I
think that we might have to bite the bullet and deal with the ugly
mapping of char on the C side and u8 on the Rust side when dealing
with strbuf. Maybe as we translate more of C into Rust someone will
have a better suggestion. I think my ivec type would be better since
strbuf is almost a special case of my ivec type, but dealing with
strbuf is outside the scope of this patch series.

> > If a C `char` is not describing bytes, then it
> > +should be converted to a more accurate unambiguous type.
>
> That's a good point.
>
> > +While you could specify `char` in the C code and `u8` in Rust code, it's not as
> > +clear what the appropriate type is, but it would work across the FFI boundary.
> > +However the bigger problem comes from code generation tools like cbindgen and
> > +bindgen. When cbindgen see u8 in Rust it will generate uint8_t on the C side
> > +which will cause differ in signedness warnings/errors. Similarly if bindgen
> > +see `char` on the C side it will generate `std::ffi::c_char` which has its own
> > +problems.
>
> Yeah, we definitely don't want to be using "std::ffi::c_char" in our
> rust implementations. I do wonder if we might want to use it (or CStr)
> judiciously in function parameters and immediately convert it to u8 in
> the function body where the function is called from C though.

That's basically the design pattern I've been using.

In many of my translations from C to Rust I create a Rust stub
function that takes pointer types and wraps them into safe types which
then get handed off to a safe Rust function. I think that in the cases
where CString/CStr is required the Rust stub function would create a
&[u8] slice for the safe function to operate on.

> > +=== Notes
> > +^1^ This is only true if stdbool.h (or equivalent) is used. +
> > +^2^ C does not enforce IEEE-754 compatibility, but Rust expects it. If the
> > +platform/arch for C does not follow IEEE-754 then this equivalence does not
> > +hold. Also, it's assumed that `float` is 32 bits and `double` is 64, but
> > +there may be a strange platform/arch where even this isn't true. +
> > +^3^ C also defines uintptr_t, but this should not be used in Git. +
> > +^4^ C also defines ssize_t and intptr_t, but these should not be used in Git. +
>
> [u]intptr_t and ssize_t are used in git already. As Junio has pointed
> out there are sane uses for these types but we don't want to use them in
> structs or function parameters where the struct or function is shared
> with rust.

You're right, I should update the phrasing. Something like: "These
types shouldn't be used if their explicit purpose is for FFI. Whether
as a field in a struct or part of a function signature." I'll update
the wording.

> > +
> > +== Problems with std::ffi::c_* types in Rust
> > +TL;DR: They're not guaranteed to match C types for all possible C
> > +compilers/platforms/architectures.
>
> Is this official policy of the rust project?

No, this is a personal inference based on logical deduction. The c_*
definitions have changed over time with new Rust version releases, and
Git targets more platforms/architectures than what Rust officially
supports. While it's not guaranteed that it won't work everywhere.
It's also not guaranteed to work everywhere either. On top of that
we're targeting 1.63.0 who's c_* definitions are different in 1.89.0
which I show an example of with c_long_definition. Can anyone say with
certainty that Rust got these mappings right or wrong for all possible
C compilers/architectures/platforms? If so (which I highly doubt)
could someone provide a link?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-06 22:52       ` Ezekiel Newren
@ 2025-11-09 14:14         ` Phillip Wood
  0 siblings, 0 replies; 118+ messages in thread
From: Phillip Wood @ 2025-11-09 14:14 UTC (permalink / raw)
  To: Ezekiel Newren, phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Chris Torek

On 06/11/2025 22:52, Ezekiel Newren wrote:
> On Thu, Nov 6, 2025 at 2:55 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>> On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
>>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>>
>>> Document other nuances with crossing the FFI boundary. Other language
>>> mappings may be added in the future.
>>
>> Thanks for adding this, I've left a few comments below. Overall I
>> thought it was very well written.
> 
> Thanks.
> 
> I felt it was necessary since C vs Rust types keep coming up over and
> over again. I'm flexible with the wording of this document. I was just
> trying to convey a firm and clear stance on what is and isn't proper
> in Git.

That will definitely be useful as we add more rust code. In the future 
we may want to add a summary of which types to use to 
Documentation/CodingGuidelines but that doesn't need to be done in this 
series.

>> I tried building an html version of
>> this but even after adding it to the list of TECH_DOCS in
>> Documentation/Makefile with
>>
>> diff --git a/Documentation/Makefile b/Documentation/Makefile
>> index 47208269a2e..2699f0b24af 100644
>> --- a/Documentation/Makefile
>> +++ b/Documentation/Makefile
>> @@ -143,6 +143,7 @@ TECH_DOCS += technical/shallow
>>    TECH_DOCS += technical/sparse-checkout
>>    TECH_DOCS += technical/sparse-index
>>    TECH_DOCS += technical/trivial-merge
>> +TECH_DOCS += technical/unambiguous-types
>>    TECH_DOCS += technical/unit-tests
>>    SP_ARTICLES += $(TECH_DOCS)
>>    SP_ARTICLES += technical/api-index
>>
>> it fails with
>>
>> $ make -C Documentation/ technical/unambiguous-types.html
>>                                         Merge branch
>> 'ps/object-source-loose' into seen
>> make: Entering directory '/home/phil/src/git/Documentation'
>>       GEN asciidoc.conf
>>       * new asciidoc flags
>>       ASCIIDOC technical/unambiguous-types.html
>> asciidoc: ERROR: unambiguous-types.adoc: line 139: undefined filter
>> attribute in command: source-highlight --gen-version -f xhtml -s
>> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
>> {args=}
>> asciidoc: ERROR: unambiguous-types.adoc: line 162: undefined filter
>> attribute in command: source-highlight --gen-version -f xhtml -s
>> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
>> {args=}
>> asciidoc: ERROR: unambiguous-types.adoc: line 177: undefined filter
>> attribute in command: source-highlight --gen-version -f xhtml -s
>> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
>> {args=}
>> asciidoc: ERROR: unambiguous-types.adoc: line 187: undefined filter
>> attribute in command: source-highlight --gen-version -f xhtml -s
>> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
>> {args=}
>> asciidoc: ERROR: unambiguous-types.adoc: line 199: undefined filter
>> attribute in command: source-highlight --gen-version -f xhtml -s
>> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
>> {args=}
>> asciidoc: ERROR: unambiguous-types.adoc: line 213: undefined filter
>> attribute in command: source-highlight --gen-version -f xhtml -s
>> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
>> {args=}
>> asciidoc: ERROR: unambiguous-types.adoc: line 224: undefined filter
>> attribute in command: source-highlight --gen-version -f xhtml -s
>> {language} {src_numbered?--line-number=' '} {src_tab?--tab={src_tab}}
>> {args=}
>> make: *** [Makefile:396: technical/unambiguous-types.html] Error 1
>> make: *** Deleting file 'technical/unambiguous-types.html'
>> make: Leaving directory '/home/phil/src/git/Documentation'
> 
> I've never created documentation for Git before, so this helps. I'll
> incorporate your suggestions.

We should also add this file to Documentation/technical/meson.build. It 
seems those errors above are due to some incompatibility between 
asciidoc and asciidoctor as I just tried running

     make -C Documentation/ USE_ASCIIDOCTOR=1 
technical/unambiguous-types.html

and it worked just fine. I'm afraid I don't know enough asciidoc to make 
any helpful suggestions on how to fix it.

>>> +== Character types
>>> +
>>> +This is where C and Rust don't have a clean one-to-one mapping. A C `char` is
>>> +an 8-bit type that is signless (neither signed nor unsigned)
>>
>> I found this a bit confusing. Isn't the signedness of "char"
>> implementation defined rather than it being "signless"
>>
>>> which causes
>>> +problems with e.g. `make DEVELOPER=1`.
>>
>> I'm not sure what this is referring to - maybe -Wsign-compare?
> 
> When I build Git with `make DEVELOPER=1` and I compare uint8_t with
> char it complains about a difference in signedness. When I compare
> int8_t with char it also complains about a difference in signedness.
> So it is implementation defined, but it's also neither signed nor
> unsigned according to DEVELOPER=1 since it complains either way.

Oh, I see - this is saying mixing "char" and "uint8_t" causes problems. 
I agree, perhaps we could expand this slightly to mention comparison 
with uint8_t to make it clearer.

>>> Rust's `char` type is an unsigned 32-bit
>>> +integer that is used to describe Unicode code points. Even though a C `char`
>>> +is the same width as `u8`, `char` should be converted to u8 where it is
>>> +describing bytes in memory.
>>
>> I'm dreading the point where we start sharing "struct strbuf" with rust
>> and have to change the "buf" member from "char*" to "uint8_t*". While it
>> is not used in the xdiff code it is ubiquitous everywhere else and there
>> are lots of places where be pass the "buf" member to functions expecting
>> a "char*".
>>
>>          git grep -E '(\.|->)buf\W'
>>
>> has over 4000 matches
> 
> This is why I started in Xdiff since its code is mostly isolated.

Good plan!

> I
> think that we might have to bite the bullet and deal with the ugly
> mapping of char on the C side and u8 on the Rust side when dealing
> with strbuf. Maybe as we translate more of C into Rust someone will
> have a better suggestion. I think my ivec type would be better since
> strbuf is almost a special case of my ivec type, but dealing with
> strbuf is outside the scope of this patch series.

Yes, hopefully it will become clearer what the least painful route 
forward is as we get more experience with rust <=> C iterop.

>>> +While you could specify `char` in the C code and `u8` in Rust code, it's not as
>>> +clear what the appropriate type is, but it would work across the FFI boundary.
>>> +However the bigger problem comes from code generation tools like cbindgen and
>>> +bindgen. When cbindgen see u8 in Rust it will generate uint8_t on the C side
>>> +which will cause differ in signedness warnings/errors. Similarly if bindgen
>>> +see `char` on the C side it will generate `std::ffi::c_char` which has its own
>>> +problems.
>>
>> Yeah, we definitely don't want to be using "std::ffi::c_char" in our
>> rust implementations. I do wonder if we might want to use it (or CStr)
>> judiciously in function parameters and immediately convert it to u8 in
>> the function body where the function is called from C though.
> 
> That's basically the design pattern I've been using.
> 
> In many of my translations from C to Rust I create a Rust stub
> function that takes pointer types and wraps them into safe types which
> then get handed off to a safe Rust function. I think that in the cases
> where CString/CStr is required the Rust stub function would create a
> &[u8] slice for the safe function to operate on.

That sounds like a good pattern - we get a nice interface for the C code 
and the rust implementation uses the idiomatic rust types.

Thanks

Phillip

>>> +=== Notes
>>> +^1^ This is only true if stdbool.h (or equivalent) is used. +
>>> +^2^ C does not enforce IEEE-754 compatibility, but Rust expects it. If the
>>> +platform/arch for C does not follow IEEE-754 then this equivalence does not
>>> +hold. Also, it's assumed that `float` is 32 bits and `double` is 64, but
>>> +there may be a strange platform/arch where even this isn't true. +
>>> +^3^ C also defines uintptr_t, but this should not be used in Git. +
>>> +^4^ C also defines ssize_t and intptr_t, but these should not be used in Git. +
>>
>> [u]intptr_t and ssize_t are used in git already. As Junio has pointed
>> out there are sane uses for these types but we don't want to use them in
>> structs or function parameters where the struct or function is shared
>> with rust.
> 
> You're right, I should update the phrasing. Something like: "These
> types shouldn't be used if their explicit purpose is for FFI. Whether
> as a field in a struct or part of a function signature." I'll update
> the wording.
> 
>>> +
>>> +== Problems with std::ffi::c_* types in Rust
>>> +TL;DR: They're not guaranteed to match C types for all possible C
>>> +compilers/platforms/architectures.
>>
>> Is this official policy of the rust project?
> 
> No, this is a personal inference based on logical deduction. The c_*
> definitions have changed over time with new Rust version releases, and
> Git targets more platforms/architectures than what Rust officially
> supports. While it's not guaranteed that it won't work everywhere.
> It's also not guaranteed to work everywhere either. On top of that
> we're targeting 1.63.0 who's c_* definitions are different in 1.89.0
> which I show an example of with c_long_definition. Can anyone say with
> certainty that Rust got these mappings right or wrong for all possible
> C compilers/architectures/platforms? If so (which I highly doubt)
> could someone provide a link?
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v2 02/10] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
  2025-10-29 22:19   ` [PATCH v2 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
@ 2025-10-29 22:19   ` Ezekiel Newren via GitGitGadget
  2025-11-06  9:55     ` Phillip Wood
  2025-10-29 22:19   ` [PATCH v2 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
                     ` (9 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

ssize_t is appropriate for dstart and dend because they both describe
positive or negative offsets relative to a pointer.

A future patch will move these fields to a different struct. Moving
them to the end of xdfile_t now, means the field order of xdfile_t will
be disturbed less.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xtypes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index f145abba3e..7c8c057bca 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -47,10 +47,10 @@ typedef struct s_xrecord {
 typedef struct s_xdfile {
 	xrecord_t *recs;
 	long nrec;
-	long dstart, dend;
 	bool *changed;
 	long *rindex;
 	long nreff;
+	ptrdiff_t dstart, dend;
 } xdfile_t;
 
 typedef struct s_xdfenv {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 02/10] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  2025-10-29 22:19   ` [PATCH v2 02/10] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t Ezekiel Newren via GitGitGadget
@ 2025-11-06  9:55     ` Phillip Wood
  2025-11-06 22:56       ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Phillip Wood @ 2025-11-06  9:55 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Chris Torek,
	Ezekiel Newren

Hi Ezekiel

On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> ssize_t is appropriate for dstart and dend because they both describe
> positive or negative offsets relative to a pointer.

This paragraph and the subject need updating to match the change from 
ssize_t to ptrdiff_t.

> A future patch will move these fields to a different struct. Moving
> them to the end of xdfile_t now, means the field order of xdfile_t will
> be disturbed less.

I'm not sure why that matters but I also don't object

Thanks

Phillip

> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>   xdiff/xtypes.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index f145abba3e..7c8c057bca 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -47,10 +47,10 @@ typedef struct s_xrecord {
>   typedef struct s_xdfile {
>   	xrecord_t *recs;
>   	long nrec;
> -	long dstart, dend;
>   	bool *changed;
>   	long *rindex;
>   	long nreff;
> +	ptrdiff_t dstart, dend;
>   } xdfile_t;
>   
>   typedef struct s_xdfenv {


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 02/10] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
  2025-11-06  9:55     ` Phillip Wood
@ 2025-11-06 22:56       ` Ezekiel Newren
  0 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-11-06 22:56 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Chris Torek

On Thu, Nov 6, 2025 at 2:55 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Ezekiel
>
> On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > ssize_t is appropriate for dstart and dend because they both describe
> > positive or negative offsets relative to a pointer.
>
> This paragraph and the subject need updating to match the change from
> ssize_t to ptrdiff_t.

You're right. I thought I updated that. I'll make that change for the
next version.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v2 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
  2025-10-29 22:19   ` [PATCH v2 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
  2025-10-29 22:19   ` [PATCH v2 02/10] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t Ezekiel Newren via GitGitGadget
@ 2025-10-29 22:19   ` Ezekiel Newren via GitGitGadget
  2025-11-06 10:49     ` Phillip Wood
  2025-11-06 10:55     ` Phillip Wood
  2025-10-29 22:19   ` [PATCH v2 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
                     ` (8 subsequent siblings)
  11 siblings, 2 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also
referring to bytes in memory, rather than Unicode code points, use
uint8_t instead of char.

Every usage of this field was inspected and cast to char*, or similar,
to avoid signedness warnings/errors from the compiler. Casting was used
so that the whole of xdiff doesn't need to be refactored in order to
change the type of this field.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c    |  8 ++++----
 xdiff/xemit.c     |  6 +++---
 xdiff/xmerge.c    | 14 +++++++-------
 xdiff/xpatience.c |  2 +-
 xdiff/xprepare.c  |  8 ++++----
 xdiff/xtypes.h    |  2 +-
 xdiff/xutils.c    |  4 ++--
 7 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 6f3998ee54..411a8aa69f 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -407,7 +407,7 @@ static int get_indent(xrecord_t *rec)
 	int ret = 0;
 
 	for (i = 0; i < rec->size; i++) {
-		char c = rec->ptr[i];
+		uint8_t c = rec->ptr[i];
 
 		if (!XDL_ISSPACE(c))
 			return ret;
@@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
 
 		rec = &xe->xdf1.recs[xch->i1];
 		for (i = 0; i < xch->chg1 && ignore; i++)
-			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
 
 		rec = &xe->xdf2.recs[xch->i2];
 		for (i = 0; i < xch->chg2 && ignore; i++)
-			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
 
 		xch->ignore = ignore;
 	}
@@ -1008,7 +1008,7 @@ static int record_matches_regex(xrecord_t *rec, xpparam_t const *xpp) {
 	size_t i;
 
 	for (i = 0; i < xpp->ignore_regex_nr; i++)
-		if (!regexec_buf(xpp->ignore_regex[i], rec->ptr, rec->size, 1,
+		if (!regexec_buf(xpp->ignore_regex[i], (const char *)rec->ptr, rec->size, 1,
 				 &regmatch, 0))
 			return 1;
 
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index b2f1f30cd3..ead930088a 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
 {
 	xrecord_t *rec = &xdf->recs[ri];
 
-	if (xdl_emit_diffrec(rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
+	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
 		return -1;
 
 	return 0;
@@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
 	xrecord_t *rec = &xdf->recs[ri];
 
 	if (!xecfg->find_func)
-		return def_ff(rec->ptr, rec->size, buf, sz);
-	return xecfg->find_func(rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
+		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
+	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index fd600cbb5d..75cb3e76a2 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
 	xrecord_t *rec2 = xe2->xdf2.recs + i2;
 
 	for (i = 0; i < line_count; i++) {
-		int result = xdl_recmatch(rec1[i].ptr, rec1[i].size,
-			rec2[i].ptr, rec2[i].size, flags);
+		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
+			(const char *)rec2[i].ptr, rec2[i].size, flags);
 		if (!result)
 			return -1;
 	}
@@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 
 static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
 {
-	return xdl_recmatch(rec1->ptr, rec1->size,
-			    rec2->ptr, rec2->size, flags);
+	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
+			    (const char *)rec2->ptr, rec2->size, flags);
 }
 
 /*
@@ -382,10 +382,10 @@ static int xdl_refine_conflicts(xdfenv_t *xe1, xdfenv_t *xe2, xdmerge_t *m,
 		 * we have a very simple mmfile structure.
 		 */
 		t1.ptr = (char *)xe1->xdf2.recs[m->i1].ptr;
-		t1.size = xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
+		t1.size = (char *)xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
 			+ xe1->xdf2.recs[m->i1 + m->chg1 - 1].size - t1.ptr;
 		t2.ptr = (char *)xe2->xdf2.recs[m->i2].ptr;
-		t2.size = xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
+		t2.size = (char *)xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
 			+ xe2->xdf2.recs[m->i2 + m->chg2 - 1].size - t2.ptr;
 		if (xdl_do_diff(&t1, &t2, xpp, &xe) < 0)
 			return -1;
@@ -440,7 +440,7 @@ static int line_contains_alnum(const char *ptr, long size)
 static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
 {
 	for (; chg; chg--, i++)
-		if (line_contains_alnum(xe->xdf2.recs[i].ptr,
+		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
 				xe->xdf2.recs[i].size))
 			return 1;
 	return 0;
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index 669b653580..bb61354f22 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -121,7 +121,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 		return;
 	map->entries[index].line1 = line;
 	map->entries[index].hash = record->ha;
-	map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1].ptr);
+	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
 	if (!map->first)
 		map->first = map->entries + index;
 	if (map->last) {
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 192334f1b7..4cb18b2b88 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
 		if (rcrec->rec.ha == rec->ha &&
-				xdl_recmatch(rcrec->rec.ptr, rcrec->rec.size,
-					rec->ptr, rec->size, cf->flags))
+				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
+					(const char *)rec->ptr, rec->size, cf->flags))
 			break;
 
 	if (!rcrec) {
@@ -156,8 +156,8 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
-			crec->ptr = prev;
-			crec->size = (long) (cur - prev);
+			crec->ptr = (uint8_t const *)prev;
+			crec->size =(long) ( cur - prev);
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 7c8c057bca..b1c520a378 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -39,7 +39,7 @@ typedef struct s_chastore {
 } chastore_t;
 
 typedef struct s_xrecord {
-	char const *ptr;
+	uint8_t const *ptr;
 	long size;
 	unsigned long ha;
 } xrecord_t;
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 447e66c719..7be063bfb6 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -465,10 +465,10 @@ int xdl_fall_back_diff(xdfenv_t *diff_env, xpparam_t const *xpp,
 	xdfenv_t env;
 
 	subfile1.ptr = (char *)diff_env->xdf1.recs[line1 - 1].ptr;
-	subfile1.size = diff_env->xdf1.recs[line1 + count1 - 2].ptr +
+	subfile1.size = (char *)diff_env->xdf1.recs[line1 + count1 - 2].ptr +
 		diff_env->xdf1.recs[line1 + count1 - 2].size - subfile1.ptr;
 	subfile2.ptr = (char *)diff_env->xdf2.recs[line2 - 1].ptr;
-	subfile2.size = diff_env->xdf2.recs[line2 + count2 - 2].ptr +
+	subfile2.size = (char *)diff_env->xdf2.recs[line2 + count2 - 2].ptr +
 		diff_env->xdf2.recs[line2 + count2 - 2].size - subfile2.ptr;
 	if (xdl_do_diff(&subfile1, &subfile2, xpp, &env) < 0)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-29 22:19   ` [PATCH v2 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
@ 2025-11-06 10:49     ` Phillip Wood
  2025-11-06 23:13       ` Ezekiel Newren
  2025-11-06 10:55     ` Phillip Wood
  1 sibling, 1 reply; 118+ messages in thread
From: Phillip Wood @ 2025-11-06 10:49 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Chris Torek,
	Ezekiel Newren

Hi Ezekiel

On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also
> referring to bytes in memory, rather than Unicode code points, use
> uint8_t instead of char.

The reference to unicode code points here still makes no sense to me. I 
thought the reason for the conversion was to match rust's u8.

> Every usage of this field was inspected and cast to char*, or similar,
> to avoid signedness warnings/errors from the compiler. Casting was used
> so that the whole of xdiff doesn't need to be refactored in order to
> change the type of this field.

Thanks for adding this. Having played a little with changing some 
function parameters to avoid adding these casts I agree this patch is a 
good place to stop as the number of changes required quickly spiraled 
out of control.

Thanks

Phillip

> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>   xdiff/xdiffi.c    |  8 ++++----
>   xdiff/xemit.c     |  6 +++---
>   xdiff/xmerge.c    | 14 +++++++-------
>   xdiff/xpatience.c |  2 +-
>   xdiff/xprepare.c  |  8 ++++----
>   xdiff/xtypes.h    |  2 +-
>   xdiff/xutils.c    |  4 ++--
>   7 files changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index 6f3998ee54..411a8aa69f 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> @@ -407,7 +407,7 @@ static int get_indent(xrecord_t *rec)
>   	int ret = 0;
>   
>   	for (i = 0; i < rec->size; i++) {
> -		char c = rec->ptr[i];
> +		uint8_t c = rec->ptr[i];
>   
>   		if (!XDL_ISSPACE(c))
>   			return ret;
> @@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
>   
>   		rec = &xe->xdf1.recs[xch->i1];
>   		for (i = 0; i < xch->chg1 && ignore; i++)
> -			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
> +			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
>   
>   		rec = &xe->xdf2.recs[xch->i2];
>   		for (i = 0; i < xch->chg2 && ignore; i++)
> -			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
> +			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
>   
>   		xch->ignore = ignore;
>   	}
> @@ -1008,7 +1008,7 @@ static int record_matches_regex(xrecord_t *rec, xpparam_t const *xpp) {
>   	size_t i;
>   
>   	for (i = 0; i < xpp->ignore_regex_nr; i++)
> -		if (!regexec_buf(xpp->ignore_regex[i], rec->ptr, rec->size, 1,
> +		if (!regexec_buf(xpp->ignore_regex[i], (const char *)rec->ptr, rec->size, 1,
>   				 &regmatch, 0))
>   			return 1;
>   
> diff --git a/xdiff/xemit.c b/xdiff/xemit.c
> index b2f1f30cd3..ead930088a 100644
> --- a/xdiff/xemit.c
> +++ b/xdiff/xemit.c
> @@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
>   {
>   	xrecord_t *rec = &xdf->recs[ri];
>   
> -	if (xdl_emit_diffrec(rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
> +	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
>   		return -1;
>   
>   	return 0;
> @@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
>   	xrecord_t *rec = &xdf->recs[ri];
>   
>   	if (!xecfg->find_func)
> -		return def_ff(rec->ptr, rec->size, buf, sz);
> -	return xecfg->find_func(rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
> +		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
> +	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
>   }
>   
>   static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
> diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
> index fd600cbb5d..75cb3e76a2 100644
> --- a/xdiff/xmerge.c
> +++ b/xdiff/xmerge.c
> @@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
>   	xrecord_t *rec2 = xe2->xdf2.recs + i2;
>   
>   	for (i = 0; i < line_count; i++) {
> -		int result = xdl_recmatch(rec1[i].ptr, rec1[i].size,
> -			rec2[i].ptr, rec2[i].size, flags);
> +		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
> +			(const char *)rec2[i].ptr, rec2[i].size, flags);
>   		if (!result)
>   			return -1;
>   	}
> @@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
>   
>   static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
>   {
> -	return xdl_recmatch(rec1->ptr, rec1->size,
> -			    rec2->ptr, rec2->size, flags);
> +	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
> +			    (const char *)rec2->ptr, rec2->size, flags);
>   }
>   
>   /*
> @@ -382,10 +382,10 @@ static int xdl_refine_conflicts(xdfenv_t *xe1, xdfenv_t *xe2, xdmerge_t *m,
>   		 * we have a very simple mmfile structure.
>   		 */
>   		t1.ptr = (char *)xe1->xdf2.recs[m->i1].ptr;
> -		t1.size = xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
> +		t1.size = (char *)xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
>   			+ xe1->xdf2.recs[m->i1 + m->chg1 - 1].size - t1.ptr;
>   		t2.ptr = (char *)xe2->xdf2.recs[m->i2].ptr;
> -		t2.size = xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
> +		t2.size = (char *)xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
>   			+ xe2->xdf2.recs[m->i2 + m->chg2 - 1].size - t2.ptr;
>   		if (xdl_do_diff(&t1, &t2, xpp, &xe) < 0)
>   			return -1;
> @@ -440,7 +440,7 @@ static int line_contains_alnum(const char *ptr, long size)
>   static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
>   {
>   	for (; chg; chg--, i++)
> -		if (line_contains_alnum(xe->xdf2.recs[i].ptr,
> +		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
>   				xe->xdf2.recs[i].size))
>   			return 1;
>   	return 0;
> diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
> index 669b653580..bb61354f22 100644
> --- a/xdiff/xpatience.c
> +++ b/xdiff/xpatience.c
> @@ -121,7 +121,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
>   		return;
>   	map->entries[index].line1 = line;
>   	map->entries[index].hash = record->ha;
> -	map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1].ptr);
> +	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
>   	if (!map->first)
>   		map->first = map->entries + index;
>   	if (map->last) {
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index 192334f1b7..4cb18b2b88 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
>   	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
>   	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
>   		if (rcrec->rec.ha == rec->ha &&
> -				xdl_recmatch(rcrec->rec.ptr, rcrec->rec.size,
> -					rec->ptr, rec->size, cf->flags))
> +				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
> +					(const char *)rec->ptr, rec->size, cf->flags))
>   			break;
>   
>   	if (!rcrec) {
> @@ -156,8 +156,8 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
>   			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
>   				goto abort;
>   			crec = &xdf->recs[xdf->nrec++];
> -			crec->ptr = prev;
> -			crec->size = (long) (cur - prev);
> +			crec->ptr = (uint8_t const *)prev;
> +			crec->size =(long) ( cur - prev);
>   			crec->ha = hav;
>   			if (xdl_classify_record(pass, cf, crec) < 0)
>   				goto abort;
> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index 7c8c057bca..b1c520a378 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -39,7 +39,7 @@ typedef struct s_chastore {
>   } chastore_t;
>   
>   typedef struct s_xrecord {
> -	char const *ptr;
> +	uint8_t const *ptr;
>   	long size;
>   	unsigned long ha;
>   } xrecord_t;
> diff --git a/xdiff/xutils.c b/xdiff/xutils.c
> index 447e66c719..7be063bfb6 100644
> --- a/xdiff/xutils.c
> +++ b/xdiff/xutils.c
> @@ -465,10 +465,10 @@ int xdl_fall_back_diff(xdfenv_t *diff_env, xpparam_t const *xpp,
>   	xdfenv_t env;
>   
>   	subfile1.ptr = (char *)diff_env->xdf1.recs[line1 - 1].ptr;
> -	subfile1.size = diff_env->xdf1.recs[line1 + count1 - 2].ptr +
> +	subfile1.size = (char *)diff_env->xdf1.recs[line1 + count1 - 2].ptr +
>   		diff_env->xdf1.recs[line1 + count1 - 2].size - subfile1.ptr;
>   	subfile2.ptr = (char *)diff_env->xdf2.recs[line2 - 1].ptr;
> -	subfile2.size = diff_env->xdf2.recs[line2 + count2 - 2].ptr +
> +	subfile2.size = (char *)diff_env->xdf2.recs[line2 + count2 - 2].ptr +
>   		diff_env->xdf2.recs[line2 + count2 - 2].size - subfile2.ptr;
>   	if (xdl_do_diff(&subfile1, &subfile2, xpp, &env) < 0)
>   		return -1;


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-11-06 10:49     ` Phillip Wood
@ 2025-11-06 23:13       ` Ezekiel Newren
  0 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-11-06 23:13 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Chris Torek

On Thu, Nov 6, 2025 at 3:49 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Ezekiel
>
> On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also
> > referring to bytes in memory, rather than Unicode code points, use
> > uint8_t instead of char.
>
> The reference to unicode code points here still makes no sense to me. I
> thought the reason for the conversion was to match rust's u8.

It is to match Rust's u8 type, but I was also trying to convey that
ptr is referring to bytes and not characters _because_ xdiff performs
textual differences. It's not spelled out anywhere in Xdiff that it
does or doesn't take Unicode into consideration. Would comparing
Unicode code points change how Xdiff behaves? Should it behave
differently? I don't know. My understanding is that whether the bytes
are utf-8, utf-16le, utf-16be, or some other encoding of Unicode.
Xdiff doesn't care and treats the lines in a file as raw byte strings.

There's also the question of "Should the Rust side of Xdiff treat
lines in a file as &[u8] or &str?" The reason why this matters is
because in order to get a &str from &[u8] in Rust you need to call a
function like:

```
let raw_bytes = b"abc\n";
let result = std::str::from_utf8(raw_bytes);
if let Ok(line) = result {
    // do something
}
```

What happens if it's not utf8 encoded? What if it's malformed utf8? To
avoid these problems I only use &[u8] in xdiff and perform differences
on raw byte strings rather than considering Unicode at all like how
Xdiff already does.

Does that explain my comment about Unicode or does it still seem out
of place to you? I can remove the mention of Unicode from the commit
message if this still doesn't make any sense to you.

> > Every usage of this field was inspected and cast to char*, or similar,
> > to avoid signedness warnings/errors from the compiler. Casting was used
> > so that the whole of xdiff doesn't need to be refactored in order to
> > change the type of this field.
>
> Thanks for adding this. Having played a little with changing some
> function parameters to avoid adding these casts I agree this patch is a
> good place to stop as the number of changes required quickly spiraled
> out of control.

I'm not excited about the casts either, but these 2 structs are
fundamental to how Xdiff passes data around, and so they need to be
FFI friendly. I don't plan on converting other structs or function
signatures in Xdiff unless I really have to.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-10-29 22:19   ` [PATCH v2 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
  2025-11-06 10:49     ` Phillip Wood
@ 2025-11-06 10:55     ` Phillip Wood
  2025-11-06 23:14       ` Ezekiel Newren
  1 sibling, 1 reply; 118+ messages in thread
From: Phillip Wood @ 2025-11-06 10:55 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Chris Torek,
	Ezekiel Newren

On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
> @@ -156,8 +156,8 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
>   			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
>   				goto abort;
>   			crec = &xdf->recs[xdf->nrec++];
> -			crec->ptr = prev;
> -			crec->size = (long) (cur - prev);
> +			crec->ptr = (uint8_t const *)prev;
> +			crec->size =(long) ( cur - prev);

The changes to crec->size here look unintentional

Thanks

Phillip


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-11-06 10:55     ` Phillip Wood
@ 2025-11-06 23:14       ` Ezekiel Newren
  0 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-11-06 23:14 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Chris Torek

On Thu, Nov 6, 2025 at 3:55 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
> > @@ -156,8 +156,8 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
> >                       if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
> >                               goto abort;
> >                       crec = &xdf->recs[xdf->nrec++];
> > -                     crec->ptr = prev;
> > -                     crec->size = (long) (cur - prev);
> > +                     crec->ptr = (uint8_t const *)prev;
> > +                     crec->size =(long) ( cur - prev);
>
> The changes to crec->size here look unintentional

I agree. I'll change that.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v2 04/10] xdiff: use size_t for xrecord_t.size
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2025-10-29 22:19   ` [PATCH v2 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
@ 2025-10-29 22:19   ` Ezekiel Newren via GitGitGadget
  2025-10-29 22:19   ` [PATCH v2 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is the appropriate type because size is describing the number of
elements, bytes in this case, in memory.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   |  7 +++----
 xdiff/xemit.c    |  8 ++++----
 xdiff/xmerge.c   | 16 ++++++++--------
 xdiff/xprepare.c |  6 +++---
 xdiff/xtypes.h   |  2 +-
 5 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 411a8aa69f..edd05466df 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -403,10 +403,9 @@ static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
  */
 static int get_indent(xrecord_t *rec)
 {
-	long i;
 	int ret = 0;
 
-	for (i = 0; i < rec->size; i++) {
+	for (size_t i = 0; i < rec->size; i++) {
 		uint8_t c = rec->ptr[i];
 
 		if (!XDL_ISSPACE(c))
@@ -993,11 +992,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
 
 		rec = &xe->xdf1.recs[xch->i1];
 		for (i = 0; i < xch->chg1 && ignore; i++)
-			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, (long)rec[i].size, flags);
 
 		rec = &xe->xdf2.recs[xch->i2];
 		for (i = 0; i < xch->chg2 && ignore; i++)
-			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, (long)rec[i].size, flags);
 
 		xch->ignore = ignore;
 	}
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index ead930088a..2f8007753c 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
 {
 	xrecord_t *rec = &xdf->recs[ri];
 
-	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
+	if (xdl_emit_diffrec((char const *)rec->ptr, (long)rec->size, pre, strlen(pre), ecb) < 0)
 		return -1;
 
 	return 0;
@@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
 	xrecord_t *rec = &xdf->recs[ri];
 
 	if (!xecfg->find_func)
-		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
-	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
+		return def_ff((const char *)rec->ptr, (long)rec->size, buf, sz);
+	return xecfg->find_func((const char *)rec->ptr, (long)rec->size, buf, sz, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
@@ -151,7 +151,7 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 static int is_empty_rec(xdfile_t *xdf, long ri)
 {
 	xrecord_t *rec = &xdf->recs[ri];
-	long i = 0;
+	size_t i = 0;
 
 	for (; i < rec->size && XDL_ISSPACE(rec->ptr[i]); i++);
 
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index 75cb3e76a2..0dd4558a32 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
 	xrecord_t *rec2 = xe2->xdf2.recs + i2;
 
 	for (i = 0; i < line_count; i++) {
-		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
-			(const char *)rec2[i].ptr, rec2[i].size, flags);
+		int result = xdl_recmatch((const char *)rec1[i].ptr, (long)rec1[i].size,
+			(const char *)rec2[i].ptr, (long)rec2[i].size, flags);
 		if (!result)
 			return -1;
 	}
@@ -119,11 +119,11 @@ static int xdl_recs_copy_0(int use_orig, xdfenv_t *xe, int i, int count, int nee
 	if (count < 1)
 		return 0;
 
-	for (i = 0; i < count; size += recs[i++].size)
+	for (i = 0; i < count; size += (int)recs[i++].size)
 		if (dest)
 			memcpy(dest + size, recs[i].ptr, recs[i].size);
 	if (add_nl) {
-		i = recs[count - 1].size;
+		i = (int)recs[count - 1].size;
 		if (i == 0 || recs[count - 1].ptr[i - 1] != '\n') {
 			if (needs_cr) {
 				if (dest)
@@ -156,7 +156,7 @@ static int xdl_orig_copy(xdfenv_t *xe, int i, int count, int needs_cr, int add_n
  */
 static int is_eol_crlf(xdfile_t *file, int i)
 {
-	long size;
+	size_t size;
 
 	if (i < file->nrec - 1)
 		/* All lines before the last *must* end in LF */
@@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 
 static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
 {
-	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
-			    (const char *)rec2->ptr, rec2->size, flags);
+	return xdl_recmatch((const char *)rec1->ptr, (long)rec1->size,
+			    (const char *)rec2->ptr, (long)rec2->size, flags);
 }
 
 /*
@@ -441,7 +441,7 @@ static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
 {
 	for (; chg; chg--, i++)
 		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
-				xe->xdf2.recs[i].size))
+				(long)xe->xdf2.recs[i].size))
 			return 1;
 	return 0;
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 4cb18b2b88..b3219aed3e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
 		if (rcrec->rec.ha == rec->ha &&
-				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
-					(const char *)rec->ptr, rec->size, cf->flags))
+				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
+					(const char *)rec->ptr, (long)rec->size, cf->flags))
 			break;
 
 	if (!rcrec) {
@@ -157,7 +157,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = (uint8_t const *)prev;
-			crec->size =(long) ( cur - prev);
+			crec->size = cur - prev;
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index b1c520a378..88b1fe4649 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -40,7 +40,7 @@ typedef struct s_chastore {
 
 typedef struct s_xrecord {
 	uint8_t const *ptr;
-	long size;
+	size_t size;
 	unsigned long ha;
 } xrecord_t;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v2 05/10] xdiff: use unambiguous types in xdl_hash_record()
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
                     ` (3 preceding siblings ...)
  2025-10-29 22:19   ` [PATCH v2 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
@ 2025-10-29 22:19   ` Ezekiel Newren via GitGitGadget
  2025-10-29 22:19   ` [PATCH v2 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Convert the function signature and body to use unambiguous types. char
is changed to uint8_t because this function processes bytes in memory.
unsigned long to uint64_t so that the hash output is consistent across
platforms. `flags` was changed from long to uint64_t to ensure the
high order bits are not dropped on platforms that treat long as 32
bits.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff-interface.c |  2 +-
 xdiff/xprepare.c  |  6 +++---
 xdiff/xutils.c    | 28 ++++++++++++++--------------
 xdiff/xutils.h    |  6 +++---
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 4971f722b3..1a35556380 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -300,7 +300,7 @@ void xdiff_clear_find_func(xdemitconf_t *xecfg)
 
 unsigned long xdiff_hash_string(const char *s, size_t len, long flags)
 {
-	return xdl_hash_record(&s, s + len, flags);
+	return xdl_hash_record((uint8_t const**)&s, (uint8_t const*)s + len, flags);
 }
 
 int xdiff_compare_lines(const char *l1, long s1,
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index b3219aed3e..85e56021da 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -137,8 +137,8 @@ static void xdl_free_ctx(xdfile_t *xdf)
 static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_t const *xpp,
 			   xdlclassifier_t *cf, xdfile_t *xdf) {
 	long bsize;
-	unsigned long hav;
-	char const *blk, *cur, *top, *prev;
+	uint64_t hav;
+	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
 
 	xdf->rindex = NULL;
@@ -156,7 +156,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
-			crec->ptr = (uint8_t const *)prev;
+			crec->ptr = prev;
 			crec->size = cur - prev;
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 7be063bfb6..77ee1ad9c8 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -249,11 +249,11 @@ int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
 	return 1;
 }
 
-unsigned long xdl_hash_record_with_whitespace(char const **data,
-		char const *top, long flags) {
-	unsigned long ha = 5381;
-	char const *ptr = *data;
-	int cr_at_eol_only = (flags & XDF_WHITESPACE_FLAGS) == XDF_IGNORE_CR_AT_EOL;
+uint64_t xdl_hash_record_with_whitespace(uint8_t const **data,
+		uint8_t const *top, uint64_t flags) {
+	uint64_t ha = 5381;
+	uint8_t const *ptr = *data;
+	bool cr_at_eol_only = (flags & XDF_WHITESPACE_FLAGS) == XDF_IGNORE_CR_AT_EOL;
 
 	for (; ptr < top && *ptr != '\n'; ptr++) {
 		if (cr_at_eol_only) {
@@ -263,8 +263,8 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 				continue;
 		}
 		else if (XDL_ISSPACE(*ptr)) {
-			const char *ptr2 = ptr;
-			int at_eol;
+			const uint8_t *ptr2 = ptr;
+			bool at_eol;
 			while (ptr + 1 < top && XDL_ISSPACE(ptr[1])
 					&& ptr[1] != '\n')
 				ptr++;
@@ -274,20 +274,20 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 			else if (flags & XDF_IGNORE_WHITESPACE_CHANGE
 				 && !at_eol) {
 				ha += (ha << 5);
-				ha ^= (unsigned long) ' ';
+				ha ^= (uint64_t) ' ';
 			}
 			else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL
 				 && !at_eol) {
 				while (ptr2 != ptr + 1) {
 					ha += (ha << 5);
-					ha ^= (unsigned long) *ptr2;
+					ha ^= (uint64_t) *ptr2;
 					ptr2++;
 				}
 			}
 			continue;
 		}
 		ha += (ha << 5);
-		ha ^= (unsigned long) *ptr;
+		ha ^= (uint64_t) *ptr;
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 
@@ -304,9 +304,9 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 #define REASSOC_FENCE(x, y)
 #endif
 
-unsigned long xdl_hash_record_verbatim(char const **data, char const *top) {
-	unsigned long ha = 5381, c0, c1;
-	char const *ptr = *data;
+uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top) {
+	uint64_t ha = 5381, c0, c1;
+	uint8_t const *ptr = *data;
 #if 0
 	/*
 	 * The baseline form of the optimized loop below. This is the djb2
@@ -314,7 +314,7 @@ unsigned long xdl_hash_record_verbatim(char const **data, char const *top) {
 	 */
 	for (; ptr < top && *ptr != '\n'; ptr++) {
 		ha += (ha << 5);
-		ha += (unsigned long) *ptr;
+		ha += (uint64_t) *ptr;
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 #else
diff --git a/xdiff/xutils.h b/xdiff/xutils.h
index 13f6831047..615b4a9d35 100644
--- a/xdiff/xutils.h
+++ b/xdiff/xutils.h
@@ -34,9 +34,9 @@ void *xdl_cha_alloc(chastore_t *cha);
 long xdl_guess_lines(mmfile_t *mf, long sample);
 int xdl_blankline(const char *line, long size, long flags);
 int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags);
-unsigned long xdl_hash_record_verbatim(char const **data, char const *top);
-unsigned long xdl_hash_record_with_whitespace(char const **data, char const *top, long flags);
-static inline unsigned long xdl_hash_record(char const **data, char const *top, long flags)
+uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top);
+uint64_t xdl_hash_record_with_whitespace(uint8_t const **data, uint8_t const *top, uint64_t flags);
+static inline uint64_t xdl_hash_record(uint8_t const **data, uint8_t const *top, uint64_t flags)
 {
 	if (flags & XDF_WHITESPACE_FLAGS)
 		return xdl_hash_record_with_whitespace(data, top, flags);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v2 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
                     ` (4 preceding siblings ...)
  2025-10-29 22:19   ` [PATCH v2 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
@ 2025-10-29 22:19   ` Ezekiel Newren via GitGitGadget
  2025-11-06 11:00     ` Phillip Wood
  2025-10-29 22:19   ` [PATCH v2 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
                     ` (5 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The ha field is serving two different purposes, which makes the code
harder to read. At first glance it looks like many places assume
there could never be hash collisions between lines of the two input
files. In reality, line_hash is used together with xdl_recmatch() to
ensure correct comparisons of lines, even when collisions occur.

To make this clearer, the old ha field has been split:
  * line_hash: The straightforward hash of a line, requiring no
    additional context.
  * minimal_perfect_hash: Not a new concept, but now a separate
    field. It comes from the classifier's general-purpose hash table,
    which assigns each line a unique and minimal hash across the two
    files.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c     |  6 +++---
 xdiff/xhistogram.c |  4 ++--
 xdiff/xpatience.c  | 10 +++++-----
 xdiff/xprepare.c   | 16 ++++++++--------
 xdiff/xtypes.h     |  3 ++-
 5 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index edd05466df..436c34697d 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -22,9 +22,9 @@
 
 #include "xinclude.h"
 
-static unsigned long get_hash(xdfile_t *xdf, long index)
+static size_t get_hash(xdfile_t *xdf, long index)
 {
-	return xdf->recs[xdf->rindex[index]].ha;
+	return xdf->recs[xdf->rindex[index]].minimal_perfect_hash;
 }
 
 #define XDL_MAX_COST_MIN 256
@@ -385,7 +385,7 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 
 static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
 {
-	return (rec1->ha == rec2->ha);
+	return rec1->minimal_perfect_hash == rec2->minimal_perfect_hash;
 }
 
 /*
diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
index 6dc450b1fe..5ae1282c27 100644
--- a/xdiff/xhistogram.c
+++ b/xdiff/xhistogram.c
@@ -90,7 +90,7 @@ struct region {
 
 static int cmp_recs(xrecord_t *r1, xrecord_t *r2)
 {
-	return r1->ha == r2->ha;
+	return r1->minimal_perfect_hash == r2->minimal_perfect_hash;
 
 }
 
@@ -98,7 +98,7 @@ static int cmp_recs(xrecord_t *r1, xrecord_t *r2)
 	(cmp_recs(REC(i->env, s1, l1), REC(i->env, s2, l2)))
 
 #define TABLE_HASH(index, side, line) \
-	XDL_HASHLONG((REC(index->env, side, line))->ha, index->table_bits)
+	XDL_HASHLONG((REC(index->env, side, line))->minimal_perfect_hash, index->table_bits)
 
 static int scanA(struct histindex *index, int line1, int count1)
 {
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index bb61354f22..cc53266f3b 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -48,7 +48,7 @@
 struct hashmap {
 	int nr, alloc;
 	struct entry {
-		unsigned long hash;
+		size_t minimal_perfect_hash;
 		/*
 		 * 0 = unused entry, 1 = first line, 2 = second, etc.
 		 * line2 is NON_UNIQUE if the line is not unique
@@ -101,10 +101,10 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 	 * So we multiply ha by 2 in the hope that the hashing was
 	 * "unique enough".
 	 */
-	int index = (int)((record->ha << 1) % map->alloc);
+	int index = (int)((record->minimal_perfect_hash << 1) % map->alloc);
 
 	while (map->entries[index].line1) {
-		if (map->entries[index].hash != record->ha) {
+		if (map->entries[index].minimal_perfect_hash != record->minimal_perfect_hash) {
 			if (++index >= map->alloc)
 				index = 0;
 			continue;
@@ -120,7 +120,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 	if (pass == 2)
 		return;
 	map->entries[index].line1 = line;
-	map->entries[index].hash = record->ha;
+	map->entries[index].minimal_perfect_hash = record->minimal_perfect_hash;
 	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
 	if (!map->first)
 		map->first = map->entries + index;
@@ -248,7 +248,7 @@ static int match(struct hashmap *map, int line1, int line2)
 {
 	xrecord_t *record1 = &map->env->xdf1.recs[line1 - 1];
 	xrecord_t *record2 = &map->env->xdf2.recs[line2 - 1];
-	return record1->ha == record2->ha;
+	return record1->minimal_perfect_hash == record2->minimal_perfect_hash;
 }
 
 static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 85e56021da..16236bd045 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -96,9 +96,9 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	long hi;
 	xdlclass_t *rcrec;
 
-	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
+	hi = (long) XDL_HASHLONG(rec->line_hash, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
-		if (rcrec->rec.ha == rec->ha &&
+		if (rcrec->rec.line_hash == rec->line_hash &&
 				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
 					(const char *)rec->ptr, (long)rec->size, cf->flags))
 			break;
@@ -120,7 +120,7 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 
 	(pass == 1) ? rcrec->len1++ : rcrec->len2++;
 
-	rec->ha = (unsigned long) rcrec->idx;
+	rec->minimal_perfect_hash = (size_t)rcrec->idx;
 
 	return 0;
 }
@@ -158,7 +158,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
 			crec->size = cur - prev;
-			crec->ha = hav;
+			crec->line_hash = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
 		}
@@ -290,7 +290,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->ha];
+		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len2 : 0;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -298,7 +298,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->ha];
+		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len1 : 0;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -350,7 +350,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 	recs2 = xdf2->recs;
 	for (i = 0, lim = XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
 	     i++, recs1++, recs2++)
-		if (recs1->ha != recs2->ha)
+		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
 	xdf1->dstart = xdf2->dstart = i;
@@ -358,7 +358,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 	recs1 = xdf1->recs + xdf1->nrec - 1;
 	recs2 = xdf2->recs + xdf2->nrec - 1;
 	for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
-		if (recs1->ha != recs2->ha)
+		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
 	xdf1->dend = xdf1->nrec - i - 1;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 88b1fe4649..742b81bf3b 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -41,7 +41,8 @@ typedef struct s_chastore {
 typedef struct s_xrecord {
 	uint8_t const *ptr;
 	size_t size;
-	unsigned long ha;
+	uint64_t line_hash;
+	size_t minimal_perfect_hash;
 } xrecord_t;
 
 typedef struct s_xdfile {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-10-29 22:19   ` [PATCH v2 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
@ 2025-11-06 11:00     ` Phillip Wood
  2025-11-06 23:20       ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Phillip Wood @ 2025-11-06 11:00 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Chris Torek,
	Ezekiel Newren

Hi Ezekiel

On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> The ha field is serving two different purposes, which makes the code
> harder to read. At first glance it looks like many places assume
> there could never be hash collisions between lines of the two input
> files. In reality, line_hash is used together with xdl_recmatch() to
> ensure correct comparisons of lines, even when collisions occur.
> 
> To make this clearer, the old ha field has been split:
>    * line_hash: The straightforward hash of a line, requiring no
>      additional context.
>    * minimal_perfect_hash: Not a new concept, but now a separate
>      field. It comes from the classifier's general-purpose hash table,
>      which assigns each line a unique and minimal hash across the two
>      files.

It would be nice to explain the differing types for the two fields in 
the commit message.
> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index 85e56021da..16236bd045 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -96,9 +96,9 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
>   	long hi;
>   	xdlclass_t *rcrec;
>   
> -	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
> +	hi = (long) XDL_HASHLONG(rec->line_hash, cf->hbits);

"hi" is only used as an array index so it might be nicer to change it to 
size_t and avoid this cast instead.

Thanks

Phillip


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-11-06 11:00     ` Phillip Wood
@ 2025-11-06 23:20       ` Ezekiel Newren
  0 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-11-06 23:20 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Chris Torek

On Thu, Nov 6, 2025 at 4:00 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Ezekiel
>
> On 29/10/2025 22:19, Ezekiel Newren via GitGitGadget wrote:
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > The ha field is serving two different purposes, which makes the code
> > harder to read. At first glance it looks like many places assume
> > there could never be hash collisions between lines of the two input
> > files. In reality, line_hash is used together with xdl_recmatch() to
> > ensure correct comparisons of lines, even when collisions occur.
> >
> > To make this clearer, the old ha field has been split:
> >    * line_hash: The straightforward hash of a line, requiring no
> >      additional context.
> >    * minimal_perfect_hash: Not a new concept, but now a separate
> >      field. It comes from the classifier's general-purpose hash table,
> >      which assigns each line a unique and minimal hash across the two
> >      files.
>
> It would be nice to explain the differing types for the two fields in
> the commit message.

I'll add something like:
line_hash is a uint64_t because it is the output of a fixed width hash
function. minimal_perfect_hash is size_t because its purpose is to
index into an array. This also avoids the problem of having to cast to
usize on the Rust side every time minimal_perfect_hash is used to
index a slice.

> > diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> > index 85e56021da..16236bd045 100644
> > --- a/xdiff/xprepare.c
> > +++ b/xdiff/xprepare.c
> > @@ -96,9 +96,9 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
> >       long hi;
> >       xdlclass_t *rcrec;
> >
> > -     hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
> > +     hi = (long) XDL_HASHLONG(rec->line_hash, cf->hbits);
>
> "hi" is only used as an array index so it might be nicer to change it to
> size_t and avoid this cast instead.

I agree. I'll make that change.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v2 07/10] xdiff: make xdfile_t.nrec a size_t instead of long
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
                     ` (5 preceding siblings ...)
  2025-10-29 22:19   ` [PATCH v2 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
@ 2025-10-29 22:19   ` Ezekiel Newren via GitGitGadget
  2025-10-29 22:19   ` [PATCH v2 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is used because nrec describes the number of elements in memory
for recs, and the number of elements in memory for 'changed' + 2.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c    |  8 ++++----
 xdiff/xemit.c     | 20 ++++++++++----------
 xdiff/xmerge.c    |  8 ++++----
 xdiff/xpatience.c |  2 +-
 xdiff/xprepare.c  | 12 ++++++------
 xdiff/xtypes.h    |  2 +-
 6 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 436c34697d..759193fe5d 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -483,7 +483,7 @@ static void measure_split(const xdfile_t *xdf, long split,
 {
 	long i;
 
-	if (split >= xdf->nrec) {
+	if (split >= (long)xdf->nrec) {
 		m->end_of_file = 1;
 		m->indent = -1;
 	} else {
@@ -506,7 +506,7 @@ static void measure_split(const xdfile_t *xdf, long split,
 
 	m->post_blank = 0;
 	m->post_indent = -1;
-	for (i = split + 1; i < xdf->nrec; i++) {
+	for (i = split + 1; i < (long)xdf->nrec; i++) {
 		m->post_indent = get_indent(&xdf->recs[i]);
 		if (m->post_indent != -1)
 			break;
@@ -717,7 +717,7 @@ static void group_init(xdfile_t *xdf, struct xdlgroup *g)
  */
 static inline int group_next(xdfile_t *xdf, struct xdlgroup *g)
 {
-	if (g->end == xdf->nrec)
+	if (g->end == (long)xdf->nrec)
 		return -1;
 
 	g->start = g->end + 1;
@@ -750,7 +750,7 @@ static inline int group_previous(xdfile_t *xdf, struct xdlgroup *g)
  */
 static int group_slide_down(xdfile_t *xdf, struct xdlgroup *g)
 {
-	if (g->end < xdf->nrec &&
+	if (g->end < (long)xdf->nrec &&
 	    recs_match(&xdf->recs[g->start], &xdf->recs[g->end])) {
 		xdf->changed[g->start++] = false;
 		xdf->changed[g->end++] = true;
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index 2f8007753c..04f7e9193b 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -137,7 +137,7 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 	buf = func_line ? func_line->buf : dummy;
 	size = func_line ? sizeof(func_line->buf) : sizeof(dummy);
 
-	for (l = start; l != limit && 0 <= l && l < xe->xdf1.nrec; l += step) {
+	for (l = start; l != limit && 0 <= l && l < (long)xe->xdf1.nrec; l += step) {
 		long len = match_func_rec(&xe->xdf1, xecfg, l, buf, size);
 		if (len >= 0) {
 			if (func_line)
@@ -179,14 +179,14 @@ pre_context_calculation:
 			long fs1, i1 = xch->i1;
 
 			/* Appended chunk? */
-			if (i1 >= xe->xdf1.nrec) {
+			if (i1 >= (long)xe->xdf1.nrec) {
 				long i2 = xch->i2;
 
 				/*
 				 * We don't need additional context if
 				 * a whole function was added.
 				 */
-				while (i2 < xe->xdf2.nrec) {
+				while (i2 < (long)xe->xdf2.nrec) {
 					if (is_func_rec(&xe->xdf2, xecfg, i2))
 						goto post_context_calculation;
 					i2++;
@@ -196,7 +196,7 @@ pre_context_calculation:
 				 * Otherwise get more context from the
 				 * pre-image.
 				 */
-				i1 = xe->xdf1.nrec - 1;
+				i1 = (long)xe->xdf1.nrec - 1;
 			}
 
 			fs1 = get_func_line(xe, xecfg, NULL, i1, -1);
@@ -228,8 +228,8 @@ pre_context_calculation:
 
  post_context_calculation:
 		lctx = xecfg->ctxlen;
-		lctx = XDL_MIN(lctx, xe->xdf1.nrec - (xche->i1 + xche->chg1));
-		lctx = XDL_MIN(lctx, xe->xdf2.nrec - (xche->i2 + xche->chg2));
+		lctx = XDL_MIN(lctx, (long)xe->xdf1.nrec - (xche->i1 + xche->chg1));
+		lctx = XDL_MIN(lctx, (long)xe->xdf2.nrec - (xche->i2 + xche->chg2));
 
 		e1 = xche->i1 + xche->chg1 + lctx;
 		e2 = xche->i2 + xche->chg2 + lctx;
@@ -237,13 +237,13 @@ pre_context_calculation:
 		if (xecfg->flags & XDL_EMIT_FUNCCONTEXT) {
 			long fe1 = get_func_line(xe, xecfg, NULL,
 						 xche->i1 + xche->chg1,
-						 xe->xdf1.nrec);
+						 (long)xe->xdf1.nrec);
 			while (fe1 > 0 && is_empty_rec(&xe->xdf1, fe1 - 1))
 				fe1--;
 			if (fe1 < 0)
-				fe1 = xe->xdf1.nrec;
+				fe1 = (long)xe->xdf1.nrec;
 			if (fe1 > e1) {
-				e2 = XDL_MIN(e2 + (fe1 - e1), xe->xdf2.nrec);
+				e2 = XDL_MIN(e2 + (fe1 - e1), (long)xe->xdf2.nrec);
 				e1 = fe1;
 			}
 
@@ -254,7 +254,7 @@ pre_context_calculation:
 			 */
 			if (xche->next) {
 				long l = XDL_MIN(xche->next->i1,
-						 xe->xdf1.nrec - 1);
+						 (long)xe->xdf1.nrec - 1);
 				if (l - xecfg->ctxlen <= e1 ||
 				    get_func_line(xe, xecfg, NULL, l, e1) < 0) {
 					xche = xche->next;
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index 0dd4558a32..29dad98c49 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -158,7 +158,7 @@ static int is_eol_crlf(xdfile_t *file, int i)
 {
 	size_t size;
 
-	if (i < file->nrec - 1)
+	if (i < (long)file->nrec - 1)
 		/* All lines before the last *must* end in LF */
 		return (size = file->recs[i].size) > 1 &&
 			file->recs[i].ptr[size - 2] == '\r';
@@ -317,7 +317,7 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 			continue;
 		i = m->i1 + m->chg1;
 	}
-	size += xdl_recs_copy(xe1, i, xe1->xdf2.nrec - i, 0, 0,
+	size += xdl_recs_copy(xe1, i, (int)xe1->xdf2.nrec - i, 0, 0,
 			      dest ? dest + size : NULL);
 	return size;
 }
@@ -622,7 +622,7 @@ static int xdl_do_merge(xdfenv_t *xe1, xdchange_t *xscr1,
 			changes = c;
 		i0 = xscr1->i1;
 		i1 = xscr1->i2;
-		i2 = xscr1->i1 + xe2->xdf2.nrec - xe2->xdf1.nrec;
+		i2 = xscr1->i1 + (long)xe2->xdf2.nrec - (long)xe2->xdf1.nrec;
 		chg0 = xscr1->chg1;
 		chg1 = xscr1->chg2;
 		chg2 = xscr1->chg1;
@@ -637,7 +637,7 @@ static int xdl_do_merge(xdfenv_t *xe1, xdchange_t *xscr1,
 		if (!changes)
 			changes = c;
 		i0 = xscr2->i1;
-		i1 = xscr2->i1 + xe1->xdf2.nrec - xe1->xdf1.nrec;
+		i1 = xscr2->i1 + (long)xe1->xdf2.nrec - (long)xe1->xdf1.nrec;
 		i2 = xscr2->i2;
 		chg0 = xscr2->chg1;
 		chg1 = xscr2->chg1;
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index cc53266f3b..a0b31eb5d8 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -370,5 +370,5 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
 
 int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
 {
-	return patience_diff(xpp, env, 1, env->xdf1.nrec, 1, env->xdf2.nrec);
+	return patience_diff(xpp, env, 1, (int)env->xdf1.nrec, 1, (int)env->xdf2.nrec);
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 16236bd045..4ee9fb60cd 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -153,7 +153,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 		for (top = blk + bsize; cur < top; ) {
 			prev = cur;
 			hav = xdl_hash_record(&cur, top, xpp->flags);
-			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
+			if (XDL_ALLOC_GROW(xdf->recs, (long)xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
@@ -287,7 +287,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	/*
 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
 	 */
-	if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
@@ -295,7 +295,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
-	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
@@ -348,7 +348,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 
 	recs1 = xdf1->recs;
 	recs2 = xdf2->recs;
-	for (i = 0, lim = XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
+	for (i = 0, lim = (long)XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
 	     i++, recs1++, recs2++)
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
@@ -361,8 +361,8 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
-	xdf1->dend = xdf1->nrec - i - 1;
-	xdf2->dend = xdf2->nrec - i - 1;
+	xdf1->dend = (long)xdf1->nrec - i - 1;
+	xdf2->dend = (long)xdf2->nrec - i - 1;
 
 	return 0;
 }
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 742b81bf3b..17cafd8b6e 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -47,7 +47,7 @@ typedef struct s_xrecord {
 
 typedef struct s_xdfile {
 	xrecord_t *recs;
-	long nrec;
+	size_t nrec;
 	bool *changed;
 	long *rindex;
 	long nreff;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v2 08/10] xdiff: make xdfile_t.nreff a size_t instead of long
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
                     ` (6 preceding siblings ...)
  2025-10-29 22:19   ` [PATCH v2 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
@ 2025-10-29 22:19   ` Ezekiel Newren via GitGitGadget
  2025-10-29 22:19   ` [PATCH v2 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is used because nreff describes the number of elements in memory
for rindex.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 14 +++++++-------
 xdiff/xtypes.h   |  2 +-
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 4ee9fb60cd..c690bafeb1 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -264,7 +264,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  * might be potentially discarded if they appear in a run of discardable.
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-	long i, nm, nreff, mlim;
+	long i, nm, mlim;
 	xrecord_t *recs;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
@@ -307,29 +307,29 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 * Use temporary arrays to decide if changed[i] should remain
 	 * false, or become true.
 	 */
-	for (nreff = 0, i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
+	xdf1->nreff = 0;
+	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
 	     i <= xdf1->dend; i++, recs++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->rindex[nreff++] = i;
+			xdf1->rindex[xdf1->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf1->changed[i] = true;
 			/* i.e. discard */
 	}
-	xdf1->nreff = nreff;
 
-	for (nreff = 0, i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
+	xdf2->nreff = 0;
+	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
 	     i <= xdf2->dend; i++, recs++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->rindex[nreff++] = i;
+			xdf2->rindex[xdf2->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf2->changed[i] = true;
 			/* i.e. discard */
 	}
-	xdf2->nreff = nreff;
 
 cleanup:
 	xdl_free(action1);
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 17cafd8b6e..df4c5cab1a 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -50,7 +50,7 @@ typedef struct s_xdfile {
 	size_t nrec;
 	bool *changed;
 	long *rindex;
-	long nreff;
+	size_t nreff;
 	ptrdiff_t dstart, dend;
 } xdfile_t;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v2 09/10] xdiff: change rindex from long to size_t in xdfile_t
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
                     ` (7 preceding siblings ...)
  2025-10-29 22:19   ` [PATCH v2 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
@ 2025-10-29 22:19   ` Ezekiel Newren via GitGitGadget
  2025-10-29 22:19   ` [PATCH v2 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

rindex describes a index offset which means it's an index into memory
which should use size_t.

Changing the type of rindex from long to size_t has no cascading
refactor impact because it is only ever used to directly index other
arrays.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xtypes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index df4c5cab1a..3bcc0920e0 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -49,7 +49,7 @@ typedef struct s_xdfile {
 	xrecord_t *recs;
 	size_t nrec;
 	bool *changed;
-	long *rindex;
+	size_t *rindex;
 	size_t nreff;
 	ptrdiff_t dstart, dend;
 } xdfile_t;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v2 10/10] xdiff: rename rindex -> reference_index
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
                     ` (8 preceding siblings ...)
  2025-10-29 22:19   ` [PATCH v2 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
@ 2025-10-29 22:19   ` Ezekiel Newren via GitGitGadget
  2025-10-30 14:26   ` [PATCH v2 00/10] Xdiff cleanup part2 Junio C Hamano
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-10-29 22:19 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The classic diff adds only the lines that it's going to consider,
during the diff, to an array. A mapping between the compacted
array, and the lines of the file that they reference, are
facilitated by this array.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   |  6 +++---
 xdiff/xprepare.c | 10 +++++-----
 xdiff/xtypes.h   |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 759193fe5d..8eb664be3e 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -24,7 +24,7 @@
 
 static size_t get_hash(xdfile_t *xdf, long index)
 {
-	return xdf->recs[xdf->rindex[index]].minimal_perfect_hash;
+	return xdf->recs[xdf->reference_index[index]].minimal_perfect_hash;
 }
 
 #define XDL_MAX_COST_MIN 256
@@ -278,10 +278,10 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
 	 */
 	if (off1 == lim1) {
 		for (; off2 < lim2; off2++)
-			xdf2->changed[xdf2->rindex[off2]] = true;
+			xdf2->changed[xdf2->reference_index[off2]] = true;
 	} else if (off2 == lim2) {
 		for (; off1 < lim1; off1++)
-			xdf1->changed[xdf1->rindex[off1]] = true;
+			xdf1->changed[xdf1->reference_index[off1]] = true;
 	} else {
 		xdpsplit_t spl;
 		spl.i1 = spl.i2 = 0;
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index c690bafeb1..1dd420a2ff 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -128,7 +128,7 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 
 static void xdl_free_ctx(xdfile_t *xdf)
 {
-	xdl_free(xdf->rindex);
+	xdl_free(xdf->reference_index);
 	xdl_free(xdf->changed - 1);
 	xdl_free(xdf->recs);
 }
@@ -141,7 +141,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
 
-	xdf->rindex = NULL;
+	xdf->reference_index = NULL;
 	xdf->changed = NULL;
 	xdf->recs = NULL;
 
@@ -169,7 +169,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 
 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
 	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)) {
-		if (!XDL_ALLOC_ARRAY(xdf->rindex, xdf->nrec + 1))
+		if (!XDL_ALLOC_ARRAY(xdf->reference_index, xdf->nrec + 1))
 			goto abort;
 	}
 
@@ -312,7 +312,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	     i <= xdf1->dend; i++, recs++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->rindex[xdf1->nreff++] = i;
+			xdf1->reference_index[xdf1->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf1->changed[i] = true;
@@ -324,7 +324,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	     i <= xdf2->dend; i++, recs++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->rindex[xdf2->nreff++] = i;
+			xdf2->reference_index[xdf2->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf2->changed[i] = true;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 3bcc0920e0..5accbec284 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -49,7 +49,7 @@ typedef struct s_xdfile {
 	xrecord_t *recs;
 	size_t nrec;
 	bool *changed;
-	size_t *rindex;
+	size_t *reference_index;
 	size_t nreff;
 	ptrdiff_t dstart, dend;
 } xdfile_t;
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v2 00/10] Xdiff cleanup part2
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
                     ` (9 preceding siblings ...)
  2025-10-29 22:19   ` [PATCH v2 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
@ 2025-10-30 14:26   ` Junio C Hamano
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
  11 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-10-30 14:26 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  * Added documentation about unambiguous types and FFI

Nicely written; a few footnote entries may be a bit too strict,
misleading, and may need rephrasing, though.  For example, we may
want to be suspicious when we see code that uses ssize_t as if it is
half the size_t plus error indication, it does not immediately mean
that the type "should not be used in Git". It is perfectly sensible
to assign to or compare with returned value from write(2), for
example.

Will queue.  Thanks.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v3 00/10] Xdiff cleanup part2
  2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
                     ` (10 preceding siblings ...)
  2025-10-30 14:26   ` [PATCH v2 00/10] Xdiff cleanup part2 Junio C Hamano
@ 2025-11-11 19:42   ` Ezekiel Newren via GitGitGadget
  2025-11-11 19:42     ` [PATCH v3 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
                       ` (11 more replies)
  11 siblings, 12 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

Changes in v3:

 * Address comments about commit messages and documentation
 * Add unambiguous-types.adoc to Makefile and Meson
 * Use markdown style to avoid asciidoc issues

Changes in v2:

 * Added documentation about unambiguous types and FFI
 * Addressed comments on the mailing list


Original cover letter below:
============================

Maintainer note: This patch series builds on top of en/xdiff-cleanup and
am/xdiff-hash-tweak (both of which are now in master).

The primary goal of this patch series is to convert every field's type in
xrecord_t and xdfile_t to be unambiguous, in preparation to make it more
Rust FFI friendly. Additionally the ha field in xrecord_t is split into
line_hash and minimal_perfect hash.

The order of some of the fields has changed as called out by the commit
messages.

Before:

typedef struct s_xrecord {
	char const *ptr;
	long size;
	unsigned long ha;
} xrecord_t;

typedef struct s_xdfile {
	xrecord_t *recs;
	long nrec;
	long dstart, dend;
	bool *changed;
	long *rindex;
	long nreff;
} xdfile_t;


After part 2

typedef struct s_xrecord {
	uint8_t const *ptr;
	size_t size;
	uint64_t line_hash;
	size_t minimal_perfect_hash;
} xrecord_t;

typedef struct s_xdfile {
	xrecord_t *recs;
	size_t nrec;
	bool *changed;
	size_t *reference_index;
	size_t nreff;
	ssize_t dstart, dend;
} xdfile_t;


Ezekiel Newren (10):
  doc: define unambiguous type mappings across C and Rust
  xdiff: use ptrdiff_t for dstart/dend
  xdiff: make xrecord_t.ptr a uint8_t instead of char
  xdiff: use size_t for xrecord_t.size
  xdiff: use unambiguous types in xdl_hash_record()
  xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  xdiff: make xdfile_t.nrec a size_t instead of long
  xdiff: make xdfile_t.nreff a size_t instead of long
  xdiff: change rindex from long to size_t in xdfile_t
  xdiff: rename rindex -> reference_index

 Documentation/Makefile                        |   1 +
 Documentation/technical/meson.build           |   1 +
 .../technical/unambiguous-types.adoc          | 239 ++++++++++++++++++
 xdiff-interface.c                             |   2 +-
 xdiff/xdiffi.c                                |  29 +--
 xdiff/xemit.c                                 |  28 +-
 xdiff/xhistogram.c                            |   4 +-
 xdiff/xmerge.c                                |  30 +--
 xdiff/xpatience.c                             |  14 +-
 xdiff/xprepare.c                              |  60 ++---
 xdiff/xtypes.h                                |  15 +-
 xdiff/xutils.c                                |  32 +--
 xdiff/xutils.h                                |   6 +-
 13 files changed, 351 insertions(+), 110 deletions(-)
 create mode 100644 Documentation/technical/unambiguous-types.adoc


base-commit: a99f379adf116d53eb11957af5bab5214915f91d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2070%2Fezekielnewren%2Fxdiff_cleanup_part2-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2070/ezekielnewren/xdiff_cleanup_part2-v3
Pull-Request: https://github.com/git/git/pull/2070

Range-diff vs v2:

  1:  88133848d1 !  1:  e5d084d340 doc: define unambiguous type mappings across C and Rust
     @@ Commit message
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
     + ## Documentation/Makefile ##
     +@@ Documentation/Makefile: TECH_DOCS += technical/shallow
     + TECH_DOCS += technical/sparse-checkout
     + TECH_DOCS += technical/sparse-index
     + TECH_DOCS += technical/trivial-merge
     ++TECH_DOCS += technical/unambiguous-types
     + TECH_DOCS += technical/unit-tests
     + SP_ARTICLES += $(TECH_DOCS)
     + SP_ARTICLES += technical/api-index
     +
     + ## Documentation/technical/meson.build ##
     +@@ Documentation/technical/meson.build: articles = [
     +   'sparse-checkout.adoc',
     +   'sparse-index.adoc',
     +   'trivial-merge.adoc',
     ++  'unambiguous-types.adoc',
     +   'unit-tests.adoc',
     + ]
     + 
     +
       ## Documentation/technical/unambiguous-types.adoc (new) ##
      @@
      += Unambiguous types
     @@ Documentation/technical/unambiguous-types.adoc (new)
      +|===
      +| C Type | Rust Type
      +| size_t^3^     | usize
     -+| ptrdiff_t^4^  | isize
     ++| ptrdiff_t^3^  | isize
      +|===
      +
      +== Character types
      +
     -+This is where C and Rust don't have a clean one-to-one mapping. A C `char` is
     -+an 8-bit type that is signless (neither signed nor unsigned) which causes
     -+problems with e.g. `make DEVELOPER=1`. Rust's `char` type is an unsigned 32-bit
     -+integer that is used to describe Unicode code points. Even though a C `char`
     -+is the same width as `u8`, `char` should be converted to u8 where it is
     -+describing bytes in memory. If a C `char` is not describing bytes, then it
     -+should be converted to a more accurate unambiguous type.
     ++This is where C and Rust don't have a clean one-to-one mapping.
     ++
     ++C comparison problem: While the sign of `char` is implementation defined, it's
     ++also signless (neither signed nor unsigned). When building with
     ++`make DEVELOPER=1` it will complain about a "differ in signedness" when `char`
     ++is compared with `uint8_t` or `int8_t`.
     ++
     ++Rust's `char` type is an unsigned 32-bit integer that is used to describe
     ++Unicode code points. Even though a C `char` is the same width as `u8`, `char`
     ++should be converted to u8 where it is describing bytes in memory. If a C
     ++`char` is not describing bytes, then it should be converted to a more accurate
     ++unambiguous type. The reason for mentioning Unicode here is because of how &str
     ++is defined in Rust and how to create a &str from &[u8]. Rust assumes that &str
     ++is a correctly encoded utf-8 string, i.e. text in memory. Where as a C `char`
     ++makes no assumption about the bytes that it is representing.
     ++
     ++```
     ++let raw_bytes = b"abc\n";
     ++let result = std::str::from_utf8(raw_bytes);
     ++if let Ok(line) = result {
     ++    // do something with text
     ++}
     ++```
      +
      +While you could specify `char` in the C code and `u8` in Rust code, it's not as
      +clear what the appropriate type is, but it would work across the FFI boundary.
     -+However the bigger problem comes from code generation tools like cbindgen and
     -+bindgen. When cbindgen see u8 in Rust it will generate uint8_t on the C side
     -+which will cause differ in signedness warnings/errors. Similaraly if bindgen
     -+see `char` on the C side it will generate `std::ffi::c_char` which has its own
     ++However, the bigger problem comes from code generation tools like cbindgen and
     ++bindgen. When cbindgen sees u8 in Rust it will generate uint8_t on the C side
     ++which will cause differ in signedness warnings/errors. Similarly if bindgen
     ++sees `char` on the C side it will generate `std::ffi::c_char` which has its own
      +problems.
      +
      +=== Notes
     @@ Documentation/technical/unambiguous-types.adoc (new)
      +platform/arch for C does not follow IEEE-754 then this equivalence does not
      +hold. Also, it's assumed that `float` is 32 bits and `double` is 64, but
      +there may be a strange platform/arch where even this isn't true. +
     -+^3^ C also defines uintptr_t, but this should not be used in Git. +
     -+^4^ C also defines ssize_t and intptr_t, but these should not be used in Git. +
     ++^3^ C also defines uintptr_t, ssize_t and intptr_t, but these types are
     ++discouraged for FFI purposes. For functions like `read()` and `write()` ssize_t
     ++should be cast to a different, and unambiguous, type before being passed over
     ++the FFI boundary. +
      +
      +== Problems with std::ffi::c_* types in Rust
     -+TL;DR: They're not guaranteed to match C types for all possible C
     -+compilers/platforms/architectures.
     ++TL;DR: In practice, Rust's `c_*` types aren't guaranteed to match C types for
     ++all possible C compilers, platforms, or architectures, because Rust only
     ++ensures correctness of C types on officially supported targets. These
     ++definitions have changed over time to match more targets which means that the
     ++c_* definitions will differ based on which Rust version Git chooses to use.
      +
     -+Only a few of Rust's C FFI types are considered safe and semantically clear to
     -+use: +
     ++Current list of safe, Rust side, FFI types in Git: +
      +
      +* `c_void`
      +* `CStr`
     @@ Documentation/technical/unambiguous-types.adoc (new)
      +Even then, they should be used sparingly, and only where the semantics match
      +exactly.
      +
     -+The std::os::raw::c_* (which is deprecated) directly inherits the problems of
     -+core::ffi, which changes over time and seems to make a best guess at the
     -+correct definition for a given platform/target. This probably isn't a problem
     -+for all platforms that Rust supports currently, but can anyone say that Rust
     -+got it right for all C compilers of all platforms/targets?
     -+
     -+On top of all of that we're targeting an older version of Rust which doesn't
     -+have the latest mappings.
     ++The std::os::raw::c_* directly inherits the problems of core::ffi, which
     ++changes over time and seems to make a best guess at the correct definition for
     ++a given platform/target. This probably isn't a problem for all other platforms
     ++that Rust supports currently, but can anyone say that Rust got it right for all
     ++C compilers of all platforms/targets?
      +
      +To give an example: c_long is defined in
      +footnote:[https://doc.rust-lang.org/1.63.0/src/core/ffi/mod.rs.html#175-189[c_long in 1.63.0]]
     @@ Documentation/technical/unambiguous-types.adoc (new)
      +
      +=== Rust version 1.63.0
      +
     -+[source]
     -+----
     ++```
      +mod c_long_definition {
      +    cfg_if! {
      +        if #[cfg(all(target_pointer_width = "64", not(windows)))] {
     @@ Documentation/technical/unambiguous-types.adoc (new)
      +        }
      +    }
      +}
     -+----
     ++```
      +
      +=== Rust version 1.89.0
      +
     -+[source]
     -+----
     ++```
      +mod c_long_definition {
      +    crate::cfg_select! {
      +        any(
     @@ Documentation/technical/unambiguous-types.adoc (new)
      +        }
      +    }
      +}
     -+----
     ++```
      +
      +Even for the cases where C types are correctly mapped to Rust types via
      +std::ffi::c_* there are still problems. Let's take c_char for example. On some
     @@ Documentation/technical/unambiguous-types.adoc (new)
      +The following code will panic in debug on platforms that define c_char as u8,
      +but won't if it's an i8.
      +
     -+[source]
     -+----
     ++```
      +let mut x: std::ffi::c_char = 0;
      +x -= 1;
     -+----
     ++```
      +
      +=== Inconsistent shift behavior
      +
      +`x` will be 0xC0 for platforms that use i8, but will be 0x40 where it's u8.
      +
     -+[source]
     -+----
     ++```
      +let mut x: std::ffi::c_char = 0x80;
      +x >>= 1;
     -+----
     ++```
      +
      +=== Equality fails to compile on some platforms
      +
     @@ Documentation/technical/unambiguous-types.adoc (new)
      +if it's u8. You can cast x e.g. `assert_eq!(x as u8, b'a');`, but then you get
      +a warning on platforms that use u8 and a clean compilation where i8 is used.
      +
     -+[source]
     -+----
     ++```
      +let mut x: std::ffi::c_char = 0x61;
      +assert_eq!(x, b'a');
     -+----
     ++```
      +
      +== Enum types
      +Rust enum types should not be used as FFI types. Rust enum types are more like
      +C union types than C enum's. For something like:
      +
     -+[source]
     -+----
     ++```
      +#[repr(C, u8)]
      +enum Fruit {
      +    Apple,
      +    Banana,
      +    Cherry,
      +}
     -+----
     ++```
      +
      +It's easy enough to make sure the Rust enum matches what C would expect, but a
      +more complex type like.
      +
     -+[source]
     -+----
     ++```
      +enum HashResult {
      +    SHA1([u8; 20]),
      +    SHA256([u8; 32]),
      +}
     -+----
     ++```
      +
      +The Rust compiler has to add a discriminant to the enum to distinguish between
      +the variants. The width, location, and values for that discriminant is up to
  2:  9197903add !  2:  52e3f589b1 xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
     @@ Metadata
      Author: Ezekiel Newren <ezekielnewren@gmail.com>
      
       ## Commit message ##
     -    xdiff: use ssize_t for dstart/dend, make them last in xdfile_t
     +    xdiff: use ptrdiff_t for dstart/dend
      
     -    ssize_t is appropriate for dstart and dend because they both describe
     +    ptrdiff_t is appropriate for dstart and dend because they both describe
          positive or negative offsets relative to a pointer.
      
          A future patch will move these fields to a different struct. Moving
  3:  46bc1b3e25 !  3:  83e7bf180a xdiff: make xrecord_t.ptr a uint8_t instead of char
     @@ Metadata
       ## Commit message ##
          xdiff: make xrecord_t.ptr a uint8_t instead of char
      
     -    Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also
     -    referring to bytes in memory, rather than Unicode code points, use
     -    uint8_t instead of char.
     +    Make xrecord_t.ptr uint8_t because it's referring to bytes in memory.
      
          Every usage of this field was inspected and cast to char*, or similar,
          to avoid signedness warnings/errors from the compiler. Casting was used
     @@ xdiff/xprepare.c: static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, lo
       				goto abort;
       			crec = &xdf->recs[xdf->nrec++];
      -			crec->ptr = prev;
     --			crec->size = (long) (cur - prev);
      +			crec->ptr = (uint8_t const *)prev;
     -+			crec->size =(long) ( cur - prev);
     + 			crec->size = (long) (cur - prev);
       			crec->ha = hav;
       			if (xdl_classify_record(pass, cf, crec) < 0)
     - 				goto abort;
      
       ## xdiff/xtypes.h ##
      @@ xdiff/xtypes.h: typedef struct s_chastore {
  4:  07e28aad3b !  4:  da2b80ea0b xdiff: use size_t for xrecord_t.size
     @@ xdiff/xprepare.c: static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, lo
       				goto abort;
       			crec = &xdf->recs[xdf->nrec++];
       			crec->ptr = (uint8_t const *)prev;
     --			crec->size =(long) ( cur - prev);
     +-			crec->size = (long) (cur - prev);
      +			crec->size = cur - prev;
       			crec->ha = hav;
       			if (xdl_classify_record(pass, cf, crec) < 0)
  5:  1ade7d8165 =  5:  c6ba630ac5 xdiff: use unambiguous types in xdl_hash_record()
  6:  59054ea0cb !  6:  3834ea8f9b xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
     @@ Commit message
          xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
      
          The ha field is serving two different purposes, which makes the code
     -    harder to read. At first glance it looks like many places assume
     +    harder to read. At first glance, it looks like many places assume
          there could never be hash collisions between lines of the two input
          files. In reality, line_hash is used together with xdl_recmatch() to
          ensure correct comparisons of lines, even when collisions occur.
      
          To make this clearer, the old ha field has been split:
     -      * line_hash: The straightforward hash of a line, requiring no
     -        additional context.
     +      * line_hash: a straightforward hash of a line, independent of any
     +        external context. Its type is uint64_t, as it comes from a fixed
     +        width hash function.
            * minimal_perfect_hash: Not a new concept, but now a separate
              field. It comes from the classifier's general-purpose hash table,
              which assigns each line a unique and minimal hash across the two
     -        files.
     +        files. A size_t is used here because it's meant to be used to
     +        index an array. This also this avoids ` as usize` casts on the Rust
     +        side when using it to index a slice.
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
     @@ xdiff/xpatience.c: static int match(struct hashmap *map, int line1, int line2)
       static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
      
       ## xdiff/xprepare.c ##
     -@@ xdiff/xprepare.c: static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
     - 	long hi;
     +@@ xdiff/xprepare.c: static void xdl_free_classifier(xdlclassifier_t *cf) {
     + 
     + 
     + static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) {
     +-	long hi;
     ++	size_t hi;
       	xdlclass_t *rcrec;
       
      -	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
     -+	hi = (long) XDL_HASHLONG(rec->line_hash, cf->hbits);
     ++	hi = XDL_HASHLONG(rec->line_hash, cf->hbits);
       	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
      -		if (rcrec->rec.ha == rec->ha &&
      +		if (rcrec->rec.line_hash == rec->line_hash &&
  7:  f91be17858 !  7:  e2a2c7530c xdiff: make xdfile_t.nrec a size_t instead of long
     @@ Metadata
       ## Commit message ##
          xdiff: make xdfile_t.nrec a size_t instead of long
      
     -    size_t is used because nrec describes the number of elements in memory
     -    for recs, and the number of elements in memory for 'changed' + 2.
     +    size_t is used because nrec describes the number of elements for both
     +    recs, and for 'changed' + 2.
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
  8:  e2a6a23cc4 =  8:  31cd2a1aa4 xdiff: make xdfile_t.nreff a size_t instead of long
  9:  3b6054945f !  9:  aee0d3958b xdiff: change rindex from long to size_t in xdfile_t
     @@ Metadata
       ## Commit message ##
          xdiff: change rindex from long to size_t in xdfile_t
      
     -    rindex describes a index offset which means it's an index into memory
     -    which should use size_t.
     +    The field rindex describes an index offset for other arrays. Change it
     +    to size_t.
      
          Changing the type of rindex from long to size_t has no cascading
          refactor impact because it is only ever used to directly index other
 10:  1856a29026 ! 10:  75c26fe160 xdiff: rename rindex -> reference_index
     @@ Commit message
      
          The classic diff adds only the lines that it's going to consider,
          during the diff, to an array. A mapping between the compacted
     -    array, and the lines of the file that they reference, are
     +    array, and the lines of the file that they reference, is
          facilitated by this array.
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v3 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
@ 2025-11-11 19:42     ` Ezekiel Newren via GitGitGadget
  2025-11-11 20:52       ` Junio C Hamano
  2025-11-11 21:05       ` Junio C Hamano
  2025-11-11 19:42     ` [PATCH v3 02/10] xdiff: use ptrdiff_t for dstart/dend Ezekiel Newren via GitGitGadget
                       ` (10 subsequent siblings)
  11 siblings, 2 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Document other nuances with crossing the FFI boundary. Other language
mappings may be added in the future.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 Documentation/Makefile                        |   1 +
 Documentation/technical/meson.build           |   1 +
 .../technical/unambiguous-types.adoc          | 239 ++++++++++++++++++
 3 files changed, 241 insertions(+)
 create mode 100644 Documentation/technical/unambiguous-types.adoc

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 04e9e10b27..bc1adb2d9d 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -142,6 +142,7 @@ TECH_DOCS += technical/shallow
 TECH_DOCS += technical/sparse-checkout
 TECH_DOCS += technical/sparse-index
 TECH_DOCS += technical/trivial-merge
+TECH_DOCS += technical/unambiguous-types
 TECH_DOCS += technical/unit-tests
 SP_ARTICLES += $(TECH_DOCS)
 SP_ARTICLES += technical/api-index
diff --git a/Documentation/technical/meson.build b/Documentation/technical/meson.build
index be698ef22a..89a6e26821 100644
--- a/Documentation/technical/meson.build
+++ b/Documentation/technical/meson.build
@@ -32,6 +32,7 @@ articles = [
   'sparse-checkout.adoc',
   'sparse-index.adoc',
   'trivial-merge.adoc',
+  'unambiguous-types.adoc',
   'unit-tests.adoc',
 ]
 
diff --git a/Documentation/technical/unambiguous-types.adoc b/Documentation/technical/unambiguous-types.adoc
new file mode 100644
index 0000000000..6bca39209b
--- /dev/null
+++ b/Documentation/technical/unambiguous-types.adoc
@@ -0,0 +1,239 @@
+= Unambiguous types
+
+Most of these mappings are obvious, but there are some nuances and gotchas with
+Rust FFI (Foreign Function Interface).
+
+This document defines clear, one-to-one mappings between primitive types in C,
+Rust (and possible other languages in the future). Its purpose is to eliminate
+ambiguity in type widths, signedness, and binary representation across
+platforms and languages.
+
+For Git, the only header required to use these unambiguous types in C is
+`git-compat-util.h`.
+
+== Boolean types
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| bool^1^       | bool
+|===
+
+== Integer types
+
+In C, `<stdint.h>` (or an equivalent) must be included.
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| uint8_t    | u8
+| uint16_t   | u16
+| uint32_t   | u32
+| uint64_t   | u64
+
+| int8_t     | i8
+| int16_t    | i16
+| int32_t    | i32
+| int64_t    | i64
+|===
+
+== Floating-point types
+
+Rust requires IEEE-754 semantics.
+In C, that is typically true, but not guaranteed by the standard.
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| float^2^      | f32
+| double^2^     | f64
+|===
+
+== Size types
+
+These types represent pointer-sized integers and are typically defined in
+`<stddef.h>` or an equivalent header.
+
+Size types should be used any time pointer arithmetic is performed e.g.
+indexing an array, describing the number of elements in memory, etc...
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| size_t^3^     | usize
+| ptrdiff_t^3^  | isize
+|===
+
+== Character types
+
+This is where C and Rust don't have a clean one-to-one mapping.
+
+C comparison problem: While the sign of `char` is implementation defined, it's
+also signless (neither signed nor unsigned). When building with
+`make DEVELOPER=1` it will complain about a "differ in signedness" when `char`
+is compared with `uint8_t` or `int8_t`.
+
+Rust's `char` type is an unsigned 32-bit integer that is used to describe
+Unicode code points. Even though a C `char` is the same width as `u8`, `char`
+should be converted to u8 where it is describing bytes in memory. If a C
+`char` is not describing bytes, then it should be converted to a more accurate
+unambiguous type. The reason for mentioning Unicode here is because of how &str
+is defined in Rust and how to create a &str from &[u8]. Rust assumes that &str
+is a correctly encoded utf-8 string, i.e. text in memory. Where as a C `char`
+makes no assumption about the bytes that it is representing.
+
+```
+let raw_bytes = b"abc\n";
+let result = std::str::from_utf8(raw_bytes);
+if let Ok(line) = result {
+    // do something with text
+}
+```
+
+While you could specify `char` in the C code and `u8` in Rust code, it's not as
+clear what the appropriate type is, but it would work across the FFI boundary.
+However, the bigger problem comes from code generation tools like cbindgen and
+bindgen. When cbindgen sees u8 in Rust it will generate uint8_t on the C side
+which will cause differ in signedness warnings/errors. Similarly if bindgen
+sees `char` on the C side it will generate `std::ffi::c_char` which has its own
+problems.
+
+=== Notes
+^1^ This is only true if stdbool.h (or equivalent) is used. +
+^2^ C does not enforce IEEE-754 compatibility, but Rust expects it. If the
+platform/arch for C does not follow IEEE-754 then this equivalence does not
+hold. Also, it's assumed that `float` is 32 bits and `double` is 64, but
+there may be a strange platform/arch where even this isn't true. +
+^3^ C also defines uintptr_t, ssize_t and intptr_t, but these types are
+discouraged for FFI purposes. For functions like `read()` and `write()` ssize_t
+should be cast to a different, and unambiguous, type before being passed over
+the FFI boundary. +
+
+== Problems with std::ffi::c_* types in Rust
+TL;DR: In practice, Rust's `c_*` types aren't guaranteed to match C types for
+all possible C compilers, platforms, or architectures, because Rust only
+ensures correctness of C types on officially supported targets. These
+definitions have changed over time to match more targets which means that the
+c_* definitions will differ based on which Rust version Git chooses to use.
+
+Current list of safe, Rust side, FFI types in Git: +
+
+* `c_void`
+* `CStr`
+* `CString`
+
+Even then, they should be used sparingly, and only where the semantics match
+exactly.
+
+The std::os::raw::c_* directly inherits the problems of core::ffi, which
+changes over time and seems to make a best guess at the correct definition for
+a given platform/target. This probably isn't a problem for all other platforms
+that Rust supports currently, but can anyone say that Rust got it right for all
+C compilers of all platforms/targets?
+
+To give an example: c_long is defined in
+footnote:[https://doc.rust-lang.org/1.63.0/src/core/ffi/mod.rs.html#175-189[c_long in 1.63.0]]
+footnote:[https://doc.rust-lang.org/1.89.0/src/core/ffi/primitives.rs.html#135-151[c_long in 1.89.0]]
+
+=== Rust version 1.63.0
+
+```
+mod c_long_definition {
+    cfg_if! {
+        if #[cfg(all(target_pointer_width = "64", not(windows)))] {
+            pub type c_long = i64;
+            pub type NonZero_c_long = crate::num::NonZeroI64;
+            pub type c_ulong = u64;
+            pub type NonZero_c_ulong = crate::num::NonZeroU64;
+        } else {
+            // The minimal size of `long` in the C standard is 32 bits
+            pub type c_long = i32;
+            pub type NonZero_c_long = crate::num::NonZeroI32;
+            pub type c_ulong = u32;
+            pub type NonZero_c_ulong = crate::num::NonZeroU32;
+        }
+    }
+}
+```
+
+=== Rust version 1.89.0
+
+```
+mod c_long_definition {
+    crate::cfg_select! {
+        any(
+            all(target_pointer_width = "64", not(windows)),
+            // wasm32 Linux ABI uses 64-bit long
+            all(target_arch = "wasm32", target_os = "linux")
+        ) => {
+            pub(super) type c_long = i64;
+            pub(super) type c_ulong = u64;
+        }
+        _ => {
+            // The minimal size of `long` in the C standard is 32 bits
+            pub(super) type c_long = i32;
+            pub(super) type c_ulong = u32;
+        }
+    }
+}
+```
+
+Even for the cases where C types are correctly mapped to Rust types via
+std::ffi::c_* there are still problems. Let's take c_char for example. On some
+platforms it's u8 on others it's i8.
+
+=== Subtraction underflow in debug mode
+
+The following code will panic in debug on platforms that define c_char as u8,
+but won't if it's an i8.
+
+```
+let mut x: std::ffi::c_char = 0;
+x -= 1;
+```
+
+=== Inconsistent shift behavior
+
+`x` will be 0xC0 for platforms that use i8, but will be 0x40 where it's u8.
+
+```
+let mut x: std::ffi::c_char = 0x80;
+x >>= 1;
+```
+
+=== Equality fails to compile on some platforms
+
+The following will not compile on platforms that define c_char as i8, but will
+if it's u8. You can cast x e.g. `assert_eq!(x as u8, b'a');`, but then you get
+a warning on platforms that use u8 and a clean compilation where i8 is used.
+
+```
+let mut x: std::ffi::c_char = 0x61;
+assert_eq!(x, b'a');
+```
+
+== Enum types
+Rust enum types should not be used as FFI types. Rust enum types are more like
+C union types than C enum's. For something like:
+
+```
+#[repr(C, u8)]
+enum Fruit {
+    Apple,
+    Banana,
+    Cherry,
+}
+```
+
+It's easy enough to make sure the Rust enum matches what C would expect, but a
+more complex type like.
+
+```
+enum HashResult {
+    SHA1([u8; 20]),
+    SHA256([u8; 32]),
+}
+```
+
+The Rust compiler has to add a discriminant to the enum to distinguish between
+the variants. The width, location, and values for that discriminant is up to
+the Rust compiler and is not ABI stable.
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-11 19:42     ` [PATCH v3 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
@ 2025-11-11 20:52       ` Junio C Hamano
  2025-11-11 21:05       ` Junio C Hamano
  1 sibling, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-11-11 20:52 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +== Character types
> +
> +This is where C and Rust don't have a clean one-to-one mapping.
> +
> +C comparison problem: While the sign of `char` is implementation defined, it's
> +also signless (neither signed nor unsigned). When building with
> +`make DEVELOPER=1` it will complain about a "differ in signedness" when `char`
> +is compared with `uint8_t` or `int8_t`.
> +
> +Rust's `char` type is an unsigned 32-bit integer that is used to describe
> +Unicode code points. Even though a C `char` is the same width as `u8`, `char`
> +should be converted to u8 where it is describing bytes in memory. If a C
> +`char` is not describing bytes, then it should be converted to a more accurate
> +unambiguous type. The reason for mentioning Unicode here is because of how &str
> +is defined in Rust and how to create a &str from &[u8]. Rust assumes that &str
> +is a correctly encoded utf-8 string, i.e. text in memory. Where as a C `char`
> +makes no assumption about the bytes that it is representing.

Even though you write excuses for bringing up Unicode here, I am
afraid that most of the above is irrelevant tangent that makes the
point of this documentation muddier.  Anybody who is involved in
this effort would at least know that C's char is not about
representing Unicode codepoints (it is way too narrow for that),
while Rust's char type exactly is, and I do not see much point in
making such an apples-and-oranges comparison to spend extra words
here.

Another thing I found confusing is your mention of &[u8] vs &str.
Surely, Rust will have trouble if an array of u8 we FFI an array of
bytes we have on the C side, if the byte sequence were a broken
UTF-8.  But that would not be fixed if you only rewrote C code to
use `uint8_t[]` where it originally used `char[]`, would it?  If we
have on C-side char[] that has iso8859-1 in it, we still would want
to use uint8_t[] when we smuggle the result of passing it to iconv()
to translate that into UTF-8 into Rust.  Or we may pass such an
iso8859-1 encoded string directly as an uint8_t[] byte array to Rust
and let Rust side run an equivalent of iconv() to obtain char array.

The point is that "your byte sequence has to be valid UTF-8" does
not fit well in the narrative here.  If we want to move/interface
the handling of "encoding" header in commit objects with code
written in Rust, this starts to matter.

So even if it is technically correct, it is another irrelevant
tangent when we discuss why we want to use uint8_t on the C side to
help cbindgen/bindgen to map it to u8 on Rust side.

Wouldn't just directly going into

    If a piece of C code uses `char` to represent a byte, it makes
    it easier to interface with Rust to rewrite it to use uint8_t
    and let cbindgen/bindgen map it to u8 on the Rust side.

be clearer, would it?  We never deal with a single Unicode codepoint
or an array of them (we do deal with utf8 encoded array of bytes,
though) on the C side, and I do not think it is likely to change, so
there is nothing lost if we did not talk about how `char` in Rust
behaves at all.

And of course, not talking about `char` in Rust does not mean that
we need a rule like "if you want to interface with C, never use
`char` on the Rust side".  `char` may have its uses on Rust side,
just like `char` may have its uses on C side.

Also I do not quite get your precondition "If a C `char` is not
describing bytes".  What `char` in C on modern platforms would
describe something _other_ _than_ bytes?  Even the way things like
varint use `char` is exactly for accessing individual bytes.  Even
when it is used as a space-saver in a structure member whose value
would never exceed 100, i.e., a small integer, we would know and be
implicitly relying on the fact that the member is a byte-wide.

Thanks.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-11 19:42     ` [PATCH v3 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
  2025-11-11 20:52       ` Junio C Hamano
@ 2025-11-11 21:05       ` Junio C Hamano
  1 sibling, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-11-11 21:05 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/Documentation/technical/unambiguous-types.adoc b/Documentation/technical/unambiguous-types.adoc
> new file mode 100644
> index 0000000000..6bca39209b
> --- /dev/null
> +++ b/Documentation/technical/unambiguous-types.adoc
> @@ -0,0 +1,239 @@
> += Unambiguous types
> +
> +Most of these mappings are obvious, but there are some nuances and gotchas with
> +Rust FFI (Foreign Function Interface).
> +
> +This document defines clear, one-to-one mappings between primitive types in C,
> +Rust (and possible other languages in the future). Its purpose is to eliminate
> +ambiguity in type widths, signedness, and binary representation across
> +platforms and languages.

This is a laudable goal.  It does a lot more than "to eliminate
ambiguity" at least in some sections.  The section on character
types I already commented on, for example, is full of good points to
list the concerns that developers need to be aware of and careful
about.

> +== Enum types
> +Rust enum types should not be used as FFI types. Rust enum types are more like
> +C union types than C enum's. For something like:
> +
> +```
> +#[repr(C, u8)]
> +enum Fruit {
> +    Apple,
> +    Banana,
> +    Cherry,
> +}
> +```
> +
> +It's easy enough to make sure the Rust enum matches what C would expect, but a
> +more complex type like.
> +
> +```
> +enum HashResult {
> +    SHA1([u8; 20]),
> +    SHA256([u8; 32]),
> +}
> +```
> +
> +The Rust compiler has to add a discriminant to the enum to distinguish between
> +the variants. The width, location, and values for that discriminant is up to
> +the Rust compiler and is not ABI stable.

Good example, as we already do use this one, if I am not mistaken
;-)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v3 02/10] xdiff: use ptrdiff_t for dstart/dend
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
  2025-11-11 19:42     ` [PATCH v3 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
@ 2025-11-11 19:42     ` Ezekiel Newren via GitGitGadget
  2025-11-11 22:23       ` Junio C Hamano
  2025-11-11 19:42     ` [PATCH v3 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
                       ` (9 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

ptrdiff_t is appropriate for dstart and dend because they both describe
positive or negative offsets relative to a pointer.

A future patch will move these fields to a different struct. Moving
them to the end of xdfile_t now, means the field order of xdfile_t will
be disturbed less.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xtypes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index f145abba3e..7c8c057bca 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -47,10 +47,10 @@ typedef struct s_xrecord {
 typedef struct s_xdfile {
 	xrecord_t *recs;
 	long nrec;
-	long dstart, dend;
 	bool *changed;
 	long *rindex;
 	long nreff;
+	ptrdiff_t dstart, dend;
 } xdfile_t;
 
 typedef struct s_xdfenv {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 02/10] xdiff: use ptrdiff_t for dstart/dend
  2025-11-11 19:42     ` [PATCH v3 02/10] xdiff: use ptrdiff_t for dstart/dend Ezekiel Newren via GitGitGadget
@ 2025-11-11 22:23       ` Junio C Hamano
  0 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-11-11 22:23 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Ezekiel Newren <ezekielnewren@gmail.com>
>
> ptrdiff_t is appropriate for dstart and dend because they both describe
> positive or negative offsets relative to a pointer.

Makes sense.

> A future patch will move these fields to a different struct. Moving
> them to the end of xdfile_t now, means the field order of xdfile_t will
> be disturbed less.

If these members will be gone from this struct, it wouldn't make any
difference in the end.  I am not sure what you mean by "disturbed
less".  Right now there is a gap between changed and nrec members,
and at some later point, these two members may be adjacent with each
other.  I do not think it would make that much difference if they
become adjacent after this step [02/10], after step [10/10], or in a
separate series (xdiff-cleanup-3?).

> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>  xdiff/xtypes.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index f145abba3e..7c8c057bca 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -47,10 +47,10 @@ typedef struct s_xrecord {
>  typedef struct s_xdfile {
>  	xrecord_t *recs;
>  	long nrec;
> -	long dstart, dend;
>  	bool *changed;
>  	long *rindex;
>  	long nreff;
> +	ptrdiff_t dstart, dend;
>  } xdfile_t;
>  
>  typedef struct s_xdfenv {

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v3 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
  2025-11-11 19:42     ` [PATCH v3 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
  2025-11-11 19:42     ` [PATCH v3 02/10] xdiff: use ptrdiff_t for dstart/dend Ezekiel Newren via GitGitGadget
@ 2025-11-11 19:42     ` Ezekiel Newren via GitGitGadget
  2025-11-11 22:53       ` Junio C Hamano
  2025-11-11 19:42     ` [PATCH v3 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
                       ` (8 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Make xrecord_t.ptr uint8_t because it's referring to bytes in memory.

Every usage of this field was inspected and cast to char*, or similar,
to avoid signedness warnings/errors from the compiler. Casting was used
so that the whole of xdiff doesn't need to be refactored in order to
change the type of this field.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c    |  8 ++++----
 xdiff/xemit.c     |  6 +++---
 xdiff/xmerge.c    | 14 +++++++-------
 xdiff/xpatience.c |  2 +-
 xdiff/xprepare.c  |  6 +++---
 xdiff/xtypes.h    |  2 +-
 xdiff/xutils.c    |  4 ++--
 7 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 6f3998ee54..411a8aa69f 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -407,7 +407,7 @@ static int get_indent(xrecord_t *rec)
 	int ret = 0;
 
 	for (i = 0; i < rec->size; i++) {
-		char c = rec->ptr[i];
+		uint8_t c = rec->ptr[i];
 
 		if (!XDL_ISSPACE(c))
 			return ret;
@@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
 
 		rec = &xe->xdf1.recs[xch->i1];
 		for (i = 0; i < xch->chg1 && ignore; i++)
-			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
 
 		rec = &xe->xdf2.recs[xch->i2];
 		for (i = 0; i < xch->chg2 && ignore; i++)
-			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
 
 		xch->ignore = ignore;
 	}
@@ -1008,7 +1008,7 @@ static int record_matches_regex(xrecord_t *rec, xpparam_t const *xpp) {
 	size_t i;
 
 	for (i = 0; i < xpp->ignore_regex_nr; i++)
-		if (!regexec_buf(xpp->ignore_regex[i], rec->ptr, rec->size, 1,
+		if (!regexec_buf(xpp->ignore_regex[i], (const char *)rec->ptr, rec->size, 1,
 				 &regmatch, 0))
 			return 1;
 
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index b2f1f30cd3..ead930088a 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
 {
 	xrecord_t *rec = &xdf->recs[ri];
 
-	if (xdl_emit_diffrec(rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
+	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
 		return -1;
 
 	return 0;
@@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
 	xrecord_t *rec = &xdf->recs[ri];
 
 	if (!xecfg->find_func)
-		return def_ff(rec->ptr, rec->size, buf, sz);
-	return xecfg->find_func(rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
+		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
+	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index fd600cbb5d..75cb3e76a2 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
 	xrecord_t *rec2 = xe2->xdf2.recs + i2;
 
 	for (i = 0; i < line_count; i++) {
-		int result = xdl_recmatch(rec1[i].ptr, rec1[i].size,
-			rec2[i].ptr, rec2[i].size, flags);
+		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
+			(const char *)rec2[i].ptr, rec2[i].size, flags);
 		if (!result)
 			return -1;
 	}
@@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 
 static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
 {
-	return xdl_recmatch(rec1->ptr, rec1->size,
-			    rec2->ptr, rec2->size, flags);
+	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
+			    (const char *)rec2->ptr, rec2->size, flags);
 }
 
 /*
@@ -382,10 +382,10 @@ static int xdl_refine_conflicts(xdfenv_t *xe1, xdfenv_t *xe2, xdmerge_t *m,
 		 * we have a very simple mmfile structure.
 		 */
 		t1.ptr = (char *)xe1->xdf2.recs[m->i1].ptr;
-		t1.size = xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
+		t1.size = (char *)xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
 			+ xe1->xdf2.recs[m->i1 + m->chg1 - 1].size - t1.ptr;
 		t2.ptr = (char *)xe2->xdf2.recs[m->i2].ptr;
-		t2.size = xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
+		t2.size = (char *)xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
 			+ xe2->xdf2.recs[m->i2 + m->chg2 - 1].size - t2.ptr;
 		if (xdl_do_diff(&t1, &t2, xpp, &xe) < 0)
 			return -1;
@@ -440,7 +440,7 @@ static int line_contains_alnum(const char *ptr, long size)
 static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
 {
 	for (; chg; chg--, i++)
-		if (line_contains_alnum(xe->xdf2.recs[i].ptr,
+		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
 				xe->xdf2.recs[i].size))
 			return 1;
 	return 0;
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index 669b653580..bb61354f22 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -121,7 +121,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 		return;
 	map->entries[index].line1 = line;
 	map->entries[index].hash = record->ha;
-	map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1].ptr);
+	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
 	if (!map->first)
 		map->first = map->entries + index;
 	if (map->last) {
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 192334f1b7..4c56467076 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
 		if (rcrec->rec.ha == rec->ha &&
-				xdl_recmatch(rcrec->rec.ptr, rcrec->rec.size,
-					rec->ptr, rec->size, cf->flags))
+				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
+					(const char *)rec->ptr, rec->size, cf->flags))
 			break;
 
 	if (!rcrec) {
@@ -156,7 +156,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
-			crec->ptr = prev;
+			crec->ptr = (uint8_t const *)prev;
 			crec->size = (long) (cur - prev);
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 7c8c057bca..b1c520a378 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -39,7 +39,7 @@ typedef struct s_chastore {
 } chastore_t;
 
 typedef struct s_xrecord {
-	char const *ptr;
+	uint8_t const *ptr;
 	long size;
 	unsigned long ha;
 } xrecord_t;
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 447e66c719..7be063bfb6 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -465,10 +465,10 @@ int xdl_fall_back_diff(xdfenv_t *diff_env, xpparam_t const *xpp,
 	xdfenv_t env;
 
 	subfile1.ptr = (char *)diff_env->xdf1.recs[line1 - 1].ptr;
-	subfile1.size = diff_env->xdf1.recs[line1 + count1 - 2].ptr +
+	subfile1.size = (char *)diff_env->xdf1.recs[line1 + count1 - 2].ptr +
 		diff_env->xdf1.recs[line1 + count1 - 2].size - subfile1.ptr;
 	subfile2.ptr = (char *)diff_env->xdf2.recs[line2 - 1].ptr;
-	subfile2.size = diff_env->xdf2.recs[line2 + count2 - 2].ptr +
+	subfile2.size = (char *)diff_env->xdf2.recs[line2 + count2 - 2].ptr +
 		diff_env->xdf2.recs[line2 + count2 - 2].size - subfile2.ptr;
 	if (xdl_do_diff(&subfile1, &subfile2, xpp, &env) < 0)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-11-11 19:42     ` [PATCH v3 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
@ 2025-11-11 22:53       ` Junio C Hamano
  0 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-11-11 22:53 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Ezekiel Newren <ezekielnewren@gmail.com>
>
> Make xrecord_t.ptr uint8_t because it's referring to bytes in memory.
>
> Every usage of this field was inspected and cast to char*, or similar,

"inspected and changed to cast to"?

> to avoid signedness warnings/errors from the compiler. Casting was used
> so that the whole of xdiff doesn't need to be refactored in order to
> change the type of this field.

> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>  xdiff/xdiffi.c    |  8 ++++----
>  xdiff/xemit.c     |  6 +++---
>  xdiff/xmerge.c    | 14 +++++++-------
>  xdiff/xpatience.c |  2 +-
>  xdiff/xprepare.c  |  6 +++---
>  xdiff/xtypes.h    |  2 +-
>  xdiff/xutils.c    |  4 ++--
>  7 files changed, 21 insertions(+), 21 deletions(-)
>
> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index 6f3998ee54..411a8aa69f 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> @@ -407,7 +407,7 @@ static int get_indent(xrecord_t *rec)
>  	int ret = 0;
>  
>  	for (i = 0; i < rec->size; i++) {
> -		char c = rec->ptr[i];
> +		uint8_t c = rec->ptr[i];

rec->ptr[] is now an array of uint8_t, so this is not "inspected and
cast to".  It is unclear from limited context lines how 'c' is used
by the existing code, but one example here ...

>  		if (!XDL_ISSPACE(c))
>  			return ret;

... in the post context assumes that XDL_ISSPACE(), which was
designed to work with `char` (of implementation-defined signedness),
would safely accept an `unsigned char` (let's admit it; for all
practical purposes, uint8_t is equivalent to unsigned char while we
are looking at C code) so the updated code should work fine.

The definition of XDL_ISSPACE(c) indeed casts `c` to "unsigned char"
as the first thing it does, and other tests in this if/else if
cascade (hidden in the post context of this hunk) are equality
comparisons with ' ' and '\t', so this conversion is safe.

Either way would work so it is a minor point, but instead of
changing type of `c` to u8 than casting it to `char`, as the
proposed log message explained, i.e.,

		char c = (char)rec->ptr[i];

would have been much easier to reason about why this code after the
patch is still correct.

> @@ -382,10 +382,10 @@ static int xdl_refine_conflicts(xdfenv_t *xe1, xdfenv_t *xe2, xdmerge_t *m,
>  		 * we have a very simple mmfile structure.
>  		 */
>  		t1.ptr = (char *)xe1->xdf2.recs[m->i1].ptr;
> -		t1.size = xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
> +		t1.size = (char *)xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
>  			+ xe1->xdf2.recs[m->i1 + m->chg1 - 1].size - t1.ptr;

The ptr member in the t1 and t2 struct is still of type (char *), so
the size computation is performed as ptrdiff between two (char *),
which makes sense.

> @@ -156,7 +156,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
>  			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
>  				goto abort;
>  			crec = &xdf->recs[xdf->nrec++];
> -			crec->ptr = prev;
> +			crec->ptr = (uint8_t const *)prev;
>  			crec->size = (long) (cur - prev);

Hmm, it is tempting to fix this while at it, but I guess the ".size"
member being "long" will be updated to use ptrdiff_t or something
more appropriate in a later step.

Looking sensible.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v3 04/10] xdiff: use size_t for xrecord_t.size
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
                       ` (2 preceding siblings ...)
  2025-11-11 19:42     ` [PATCH v3 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
@ 2025-11-11 19:42     ` Ezekiel Newren via GitGitGadget
  2025-11-11 23:08       ` Junio C Hamano
  2025-11-11 19:42     ` [PATCH v3 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
                       ` (7 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is the appropriate type because size is describing the number of
elements, bytes in this case, in memory.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   |  7 +++----
 xdiff/xemit.c    |  8 ++++----
 xdiff/xmerge.c   | 16 ++++++++--------
 xdiff/xprepare.c |  6 +++---
 xdiff/xtypes.h   |  2 +-
 5 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 411a8aa69f..edd05466df 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -403,10 +403,9 @@ static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
  */
 static int get_indent(xrecord_t *rec)
 {
-	long i;
 	int ret = 0;
 
-	for (i = 0; i < rec->size; i++) {
+	for (size_t i = 0; i < rec->size; i++) {
 		uint8_t c = rec->ptr[i];
 
 		if (!XDL_ISSPACE(c))
@@ -993,11 +992,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
 
 		rec = &xe->xdf1.recs[xch->i1];
 		for (i = 0; i < xch->chg1 && ignore; i++)
-			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, (long)rec[i].size, flags);
 
 		rec = &xe->xdf2.recs[xch->i2];
 		for (i = 0; i < xch->chg2 && ignore; i++)
-			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, (long)rec[i].size, flags);
 
 		xch->ignore = ignore;
 	}
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index ead930088a..2f8007753c 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
 {
 	xrecord_t *rec = &xdf->recs[ri];
 
-	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
+	if (xdl_emit_diffrec((char const *)rec->ptr, (long)rec->size, pre, strlen(pre), ecb) < 0)
 		return -1;
 
 	return 0;
@@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
 	xrecord_t *rec = &xdf->recs[ri];
 
 	if (!xecfg->find_func)
-		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
-	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
+		return def_ff((const char *)rec->ptr, (long)rec->size, buf, sz);
+	return xecfg->find_func((const char *)rec->ptr, (long)rec->size, buf, sz, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
@@ -151,7 +151,7 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 static int is_empty_rec(xdfile_t *xdf, long ri)
 {
 	xrecord_t *rec = &xdf->recs[ri];
-	long i = 0;
+	size_t i = 0;
 
 	for (; i < rec->size && XDL_ISSPACE(rec->ptr[i]); i++);
 
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index 75cb3e76a2..0dd4558a32 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
 	xrecord_t *rec2 = xe2->xdf2.recs + i2;
 
 	for (i = 0; i < line_count; i++) {
-		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
-			(const char *)rec2[i].ptr, rec2[i].size, flags);
+		int result = xdl_recmatch((const char *)rec1[i].ptr, (long)rec1[i].size,
+			(const char *)rec2[i].ptr, (long)rec2[i].size, flags);
 		if (!result)
 			return -1;
 	}
@@ -119,11 +119,11 @@ static int xdl_recs_copy_0(int use_orig, xdfenv_t *xe, int i, int count, int nee
 	if (count < 1)
 		return 0;
 
-	for (i = 0; i < count; size += recs[i++].size)
+	for (i = 0; i < count; size += (int)recs[i++].size)
 		if (dest)
 			memcpy(dest + size, recs[i].ptr, recs[i].size);
 	if (add_nl) {
-		i = recs[count - 1].size;
+		i = (int)recs[count - 1].size;
 		if (i == 0 || recs[count - 1].ptr[i - 1] != '\n') {
 			if (needs_cr) {
 				if (dest)
@@ -156,7 +156,7 @@ static int xdl_orig_copy(xdfenv_t *xe, int i, int count, int needs_cr, int add_n
  */
 static int is_eol_crlf(xdfile_t *file, int i)
 {
-	long size;
+	size_t size;
 
 	if (i < file->nrec - 1)
 		/* All lines before the last *must* end in LF */
@@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 
 static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
 {
-	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
-			    (const char *)rec2->ptr, rec2->size, flags);
+	return xdl_recmatch((const char *)rec1->ptr, (long)rec1->size,
+			    (const char *)rec2->ptr, (long)rec2->size, flags);
 }
 
 /*
@@ -441,7 +441,7 @@ static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
 {
 	for (; chg; chg--, i++)
 		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
-				xe->xdf2.recs[i].size))
+				(long)xe->xdf2.recs[i].size))
 			return 1;
 	return 0;
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 4c56467076..b3219aed3e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
 		if (rcrec->rec.ha == rec->ha &&
-				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
-					(const char *)rec->ptr, rec->size, cf->flags))
+				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
+					(const char *)rec->ptr, (long)rec->size, cf->flags))
 			break;
 
 	if (!rcrec) {
@@ -157,7 +157,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = (uint8_t const *)prev;
-			crec->size = (long) (cur - prev);
+			crec->size = cur - prev;
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index b1c520a378..88b1fe4649 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -40,7 +40,7 @@ typedef struct s_chastore {
 
 typedef struct s_xrecord {
 	uint8_t const *ptr;
-	long size;
+	size_t size;
 	unsigned long ha;
 } xrecord_t;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 04/10] xdiff: use size_t for xrecord_t.size
  2025-11-11 19:42     ` [PATCH v3 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
@ 2025-11-11 23:08       ` Junio C Hamano
  2025-11-14  6:02         ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Junio C Hamano @ 2025-11-11 23:08 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Ezekiel Newren <ezekielnewren@gmail.com>
>
> size_t is the appropriate type because size is describing the number of
> elements, bytes in this case, in memory.
>
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>  xdiff/xdiffi.c   |  7 +++----
>  xdiff/xemit.c    |  8 ++++----
>  xdiff/xmerge.c   | 16 ++++++++--------
>  xdiff/xprepare.c |  6 +++---
>  xdiff/xtypes.h   |  2 +-
>  5 files changed, 19 insertions(+), 20 deletions(-)

This step looks mostly OK but it is messy in some places.

> diff --git a/xdiff/xemit.c b/xdiff/xemit.c
> index ead930088a..2f8007753c 100644
> --- a/xdiff/xemit.c
> +++ b/xdiff/xemit.c
> @@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
>  {
>  	xrecord_t *rec = &xdf->recs[ri];
>  
> -	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
> +	if (xdl_emit_diffrec((char const *)rec->ptr, (long)rec->size, pre, strlen(pre), ecb) < 0)

On platforms where long is narrower than size_t, we'd tentatively
leave things broken until we update xdl_emit_diffrec() to take
size_t, as it would become too noisy to change it in the same patch,
I guess?

> @@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
>  	xrecord_t *rec = &xdf->recs[ri];
>  
>  	if (!xecfg->find_func)
> -		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
> -	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
> +		return def_ff((const char *)rec->ptr, (long)rec->size, buf, sz);
> +	return xecfg->find_func((const char *)rec->ptr, (long)rec->size, buf, sz, xecfg->find_func_priv);

Ditto.

> diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
> index 75cb3e76a2..0dd4558a32 100644
> --- a/xdiff/xmerge.c
> +++ b/xdiff/xmerge.c
> @@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
>  	xrecord_t *rec2 = xe2->xdf2.recs + i2;
>  
>  	for (i = 0; i < line_count; i++) {
> -		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
> -			(const char *)rec2[i].ptr, rec2[i].size, flags);
> +		int result = xdl_recmatch((const char *)rec1[i].ptr, (long)rec1[i].size,
> +			(const char *)rec2[i].ptr, (long)rec2[i].size, flags);

Ditto.

> @@ -119,11 +119,11 @@ static int xdl_recs_copy_0(int use_orig, xdfenv_t *xe, int i, int count, int nee
>  	if (count < 1)
>  		return 0;
>  
> -	for (i = 0; i < count; size += recs[i++].size)
> +	for (i = 0; i < count; size += (int)recs[i++].size)
>  		if (dest)
>  			memcpy(dest + size, recs[i].ptr, recs[i].size);
>  	if (add_nl) {
> -		i = recs[count - 1].size;
> +		i = (int)recs[count - 1].size;
>  		if (i == 0 || recs[count - 1].ptr[i - 1] != '\n') {
>  			if (needs_cr) {
>  				if (dest)

This is messier than I expected.  Before the precontext of this
hunk, "i" and "count" are both incoming parameters of type "int", so
the same "what if size_t is wider?" puzzlement applies here.  At
least, the reason why "i" and "count" is "int" is not because they
want to be able to express negative values, so it shouldn't involve
too much hassle if we later want to change them to size_t to lose
these casts.

> @@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
>  
>  static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
>  {
> -	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
> -			    (const char *)rec2->ptr, rec2->size, flags);
> +	return xdl_recmatch((const char *)rec1->ptr, (long)rec1->size,
> +			    (const char *)rec2->ptr, (long)rec2->size, flags);
>  }

Same "long may not be wide enough, in which case we'd need further
fixes" applies here.

> @@ -441,7 +441,7 @@ static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
>  {
>  	for (; chg; chg--, i++)
>  		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
> -				xe->xdf2.recs[i].size))
> +				(long)xe->xdf2.recs[i].size))
>  			return 1;
>  	return 0;
>  }

Ditto.

> diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> index 4c56467076..b3219aed3e 100644
> --- a/xdiff/xprepare.c
> +++ b/xdiff/xprepare.c
> @@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
>  	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
>  	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
>  		if (rcrec->rec.ha == rec->ha &&
> -				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
> -					(const char *)rec->ptr, rec->size, cf->flags))
> +				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
> +					(const char *)rec->ptr, (long)rec->size, cf->flags))

Ditto.

> @@ -157,7 +157,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
>  				goto abort;
>  			crec = &xdf->recs[xdf->nrec++];
>  			crec->ptr = (uint8_t const *)prev;
> -			crec->size = (long) (cur - prev);
> +			crec->size = cur - prev;

Yay!

> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index b1c520a378..88b1fe4649 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -40,7 +40,7 @@ typedef struct s_chastore {
>  
>  typedef struct s_xrecord {
>  	uint8_t const *ptr;
> -	long size;
> +	size_t size;

Yay, too!

>  	unsigned long ha;
>  } xrecord_t;


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 04/10] xdiff: use size_t for xrecord_t.size
  2025-11-11 23:08       ` Junio C Hamano
@ 2025-11-14  6:02         ` Ezekiel Newren
  2025-11-14 16:31           ` Junio C Hamano
  0 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren @ 2025-11-14  6:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Phillip Wood, Chris Torek

On Tue, Nov 11, 2025 at 4:08 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Ezekiel Newren <ezekielnewren@gmail.com>
> >
> > size_t is the appropriate type because size is describing the number of
> > elements, bytes in this case, in memory.
> >
> > Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> > ---
> >  xdiff/xdiffi.c   |  7 +++----
> >  xdiff/xemit.c    |  8 ++++----
> >  xdiff/xmerge.c   | 16 ++++++++--------
> >  xdiff/xprepare.c |  6 +++---
> >  xdiff/xtypes.h   |  2 +-
> >  5 files changed, 19 insertions(+), 20 deletions(-)
>
> This step looks mostly OK but it is messy in some places.
>
> > diff --git a/xdiff/xemit.c b/xdiff/xemit.c
> > index ead930088a..2f8007753c 100644
> > --- a/xdiff/xemit.c
> > +++ b/xdiff/xemit.c
> > @@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
> >  {
> >       xrecord_t *rec = &xdf->recs[ri];
> >
> > -     if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
> > +     if (xdl_emit_diffrec((char const *)rec->ptr, (long)rec->size, pre, strlen(pre), ecb) < 0)
>
> On platforms where long is narrower than size_t, we'd tentatively
> leave things broken until we update xdl_emit_diffrec() to take
> size_t, as it would become too noisy to change it in the same patch,
> I guess?
>
> > @@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
> >       xrecord_t *rec = &xdf->recs[ri];
> >
> >       if (!xecfg->find_func)
> > -             return def_ff((const char *)rec->ptr, rec->size, buf, sz);
> > -     return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
> > +             return def_ff((const char *)rec->ptr, (long)rec->size, buf, sz);
> > +     return xecfg->find_func((const char *)rec->ptr, (long)rec->size, buf, sz, xecfg->find_func_priv);
>
> Ditto.
>
> > diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
> > index 75cb3e76a2..0dd4558a32 100644
> > --- a/xdiff/xmerge.c
> > +++ b/xdiff/xmerge.c
> > @@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
> >       xrecord_t *rec2 = xe2->xdf2.recs + i2;
> >
> >       for (i = 0; i < line_count; i++) {
> > -             int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
> > -                     (const char *)rec2[i].ptr, rec2[i].size, flags);
> > +             int result = xdl_recmatch((const char *)rec1[i].ptr, (long)rec1[i].size,
> > +                     (const char *)rec2[i].ptr, (long)rec2[i].size, flags);
>
> Ditto.
>
> > @@ -119,11 +119,11 @@ static int xdl_recs_copy_0(int use_orig, xdfenv_t *xe, int i, int count, int nee
> >       if (count < 1)
> >               return 0;
> >
> > -     for (i = 0; i < count; size += recs[i++].size)
> > +     for (i = 0; i < count; size += (int)recs[i++].size)
> >               if (dest)
> >                       memcpy(dest + size, recs[i].ptr, recs[i].size);
> >       if (add_nl) {
> > -             i = recs[count - 1].size;
> > +             i = (int)recs[count - 1].size;
> >               if (i == 0 || recs[count - 1].ptr[i - 1] != '\n') {
> >                       if (needs_cr) {
> >                               if (dest)
>
> This is messier than I expected.  Before the precontext of this
> hunk, "i" and "count" are both incoming parameters of type "int", so
> the same "what if size_t is wider?" puzzlement applies here.  At
> least, the reason why "i" and "count" is "int" is not because they
> want to be able to express negative values, so it shouldn't involve
> too much hassle if we later want to change them to size_t to lose
> these casts.
>
> > @@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
> >
> >  static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
> >  {
> > -     return xdl_recmatch((const char *)rec1->ptr, rec1->size,
> > -                         (const char *)rec2->ptr, rec2->size, flags);
> > +     return xdl_recmatch((const char *)rec1->ptr, (long)rec1->size,
> > +                         (const char *)rec2->ptr, (long)rec2->size, flags);
> >  }
>
> Same "long may not be wide enough, in which case we'd need further
> fixes" applies here.
>
> > @@ -441,7 +441,7 @@ static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
> >  {
> >       for (; chg; chg--, i++)
> >               if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
> > -                             xe->xdf2.recs[i].size))
> > +                             (long)xe->xdf2.recs[i].size))
> >                       return 1;
> >       return 0;
> >  }
>
> Ditto.
>
> > diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
> > index 4c56467076..b3219aed3e 100644
> > --- a/xdiff/xprepare.c
> > +++ b/xdiff/xprepare.c
> > @@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
> >       hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
> >       for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
> >               if (rcrec->rec.ha == rec->ha &&
> > -                             xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
> > -                                     (const char *)rec->ptr, rec->size, cf->flags))
> > +                             xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
> > +                                     (const char *)rec->ptr, (long)rec->size, cf->flags))
>
> Ditto.

mmbuffer_t holds all of the bytes of the file in memory, so the number
of lines referenced in mmbuffer_t has to be less than or equal to
that, which makes the point about long vs size_t moot for this patch
series. Maybe int vs size_t is a different story, but there are many
other places that use `int` that limit the number of lines in a file
that aren't touched at all in this patch series. I will update these
types, but in a future patch series because they cause a refactor
avalanche in many places.

I don't like the current state that Xdiff is in either. That's why I
intend to keep going with my xdiff cleanup series.

> > @@ -157,7 +157,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
> >                               goto abort;
> >                       crec = &xdf->recs[xdf->nrec++];
> >                       crec->ptr = (uint8_t const *)prev;
> > -                     crec->size = (long) (cur - prev);
> > +                     crec->size = cur - prev;
>
> Yay!
>
> > diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> > index b1c520a378..88b1fe4649 100644
> > --- a/xdiff/xtypes.h
> > +++ b/xdiff/xtypes.h
> > @@ -40,7 +40,7 @@ typedef struct s_chastore {
> >
> >  typedef struct s_xrecord {
> >       uint8_t const *ptr;
> > -     long size;
> > +     size_t size;
>
> Yay, too!
>
> >       unsigned long ha;
> >  } xrecord_t;

I agree. It's nice to see some clean code in this patch series.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 04/10] xdiff: use size_t for xrecord_t.size
  2025-11-14  6:02         ` Ezekiel Newren
@ 2025-11-14 16:31           ` Junio C Hamano
  0 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-11-14 16:31 UTC (permalink / raw)
  To: Ezekiel Newren
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Phillip Wood, Chris Torek

Ezekiel Newren <ezekielnewren@gmail.com> writes:

> On Tue, Nov 11, 2025 at 4:08 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> "Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> ...
> mmbuffer_t holds all of the bytes of the file in memory, so the number
> of lines referenced in mmbuffer_t has to be less than or equal to
> that, which makes the point about long vs size_t moot for this patch
> series.

... because size there is still "long"?

> I don't like the current state that Xdiff is in either. That's why I
> intend to keep going with my xdiff cleanup series.

Great, and we already have seen improvements; an intermediate state,
as we already discussed in this thread, may be noisier with casts
but that cannot be avoided.

> I agree. It's nice to see some clean code in this patch series.

Thanks.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v3 05/10] xdiff: use unambiguous types in xdl_hash_record()
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
                       ` (3 preceding siblings ...)
  2025-11-11 19:42     ` [PATCH v3 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
@ 2025-11-11 19:42     ` Ezekiel Newren via GitGitGadget
  2025-11-11 19:42     ` [PATCH v3 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
                       ` (6 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Convert the function signature and body to use unambiguous types. char
is changed to uint8_t because this function processes bytes in memory.
unsigned long to uint64_t so that the hash output is consistent across
platforms. `flags` was changed from long to uint64_t to ensure the
high order bits are not dropped on platforms that treat long as 32
bits.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff-interface.c |  2 +-
 xdiff/xprepare.c  |  6 +++---
 xdiff/xutils.c    | 28 ++++++++++++++--------------
 xdiff/xutils.h    |  6 +++---
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 4971f722b3..1a35556380 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -300,7 +300,7 @@ void xdiff_clear_find_func(xdemitconf_t *xecfg)
 
 unsigned long xdiff_hash_string(const char *s, size_t len, long flags)
 {
-	return xdl_hash_record(&s, s + len, flags);
+	return xdl_hash_record((uint8_t const**)&s, (uint8_t const*)s + len, flags);
 }
 
 int xdiff_compare_lines(const char *l1, long s1,
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index b3219aed3e..85e56021da 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -137,8 +137,8 @@ static void xdl_free_ctx(xdfile_t *xdf)
 static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_t const *xpp,
 			   xdlclassifier_t *cf, xdfile_t *xdf) {
 	long bsize;
-	unsigned long hav;
-	char const *blk, *cur, *top, *prev;
+	uint64_t hav;
+	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
 
 	xdf->rindex = NULL;
@@ -156,7 +156,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
-			crec->ptr = (uint8_t const *)prev;
+			crec->ptr = prev;
 			crec->size = cur - prev;
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 7be063bfb6..77ee1ad9c8 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -249,11 +249,11 @@ int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
 	return 1;
 }
 
-unsigned long xdl_hash_record_with_whitespace(char const **data,
-		char const *top, long flags) {
-	unsigned long ha = 5381;
-	char const *ptr = *data;
-	int cr_at_eol_only = (flags & XDF_WHITESPACE_FLAGS) == XDF_IGNORE_CR_AT_EOL;
+uint64_t xdl_hash_record_with_whitespace(uint8_t const **data,
+		uint8_t const *top, uint64_t flags) {
+	uint64_t ha = 5381;
+	uint8_t const *ptr = *data;
+	bool cr_at_eol_only = (flags & XDF_WHITESPACE_FLAGS) == XDF_IGNORE_CR_AT_EOL;
 
 	for (; ptr < top && *ptr != '\n'; ptr++) {
 		if (cr_at_eol_only) {
@@ -263,8 +263,8 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 				continue;
 		}
 		else if (XDL_ISSPACE(*ptr)) {
-			const char *ptr2 = ptr;
-			int at_eol;
+			const uint8_t *ptr2 = ptr;
+			bool at_eol;
 			while (ptr + 1 < top && XDL_ISSPACE(ptr[1])
 					&& ptr[1] != '\n')
 				ptr++;
@@ -274,20 +274,20 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 			else if (flags & XDF_IGNORE_WHITESPACE_CHANGE
 				 && !at_eol) {
 				ha += (ha << 5);
-				ha ^= (unsigned long) ' ';
+				ha ^= (uint64_t) ' ';
 			}
 			else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL
 				 && !at_eol) {
 				while (ptr2 != ptr + 1) {
 					ha += (ha << 5);
-					ha ^= (unsigned long) *ptr2;
+					ha ^= (uint64_t) *ptr2;
 					ptr2++;
 				}
 			}
 			continue;
 		}
 		ha += (ha << 5);
-		ha ^= (unsigned long) *ptr;
+		ha ^= (uint64_t) *ptr;
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 
@@ -304,9 +304,9 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 #define REASSOC_FENCE(x, y)
 #endif
 
-unsigned long xdl_hash_record_verbatim(char const **data, char const *top) {
-	unsigned long ha = 5381, c0, c1;
-	char const *ptr = *data;
+uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top) {
+	uint64_t ha = 5381, c0, c1;
+	uint8_t const *ptr = *data;
 #if 0
 	/*
 	 * The baseline form of the optimized loop below. This is the djb2
@@ -314,7 +314,7 @@ unsigned long xdl_hash_record_verbatim(char const **data, char const *top) {
 	 */
 	for (; ptr < top && *ptr != '\n'; ptr++) {
 		ha += (ha << 5);
-		ha += (unsigned long) *ptr;
+		ha += (uint64_t) *ptr;
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 #else
diff --git a/xdiff/xutils.h b/xdiff/xutils.h
index 13f6831047..615b4a9d35 100644
--- a/xdiff/xutils.h
+++ b/xdiff/xutils.h
@@ -34,9 +34,9 @@ void *xdl_cha_alloc(chastore_t *cha);
 long xdl_guess_lines(mmfile_t *mf, long sample);
 int xdl_blankline(const char *line, long size, long flags);
 int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags);
-unsigned long xdl_hash_record_verbatim(char const **data, char const *top);
-unsigned long xdl_hash_record_with_whitespace(char const **data, char const *top, long flags);
-static inline unsigned long xdl_hash_record(char const **data, char const *top, long flags)
+uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top);
+uint64_t xdl_hash_record_with_whitespace(uint8_t const **data, uint8_t const *top, uint64_t flags);
+static inline uint64_t xdl_hash_record(uint8_t const **data, uint8_t const *top, uint64_t flags)
 {
 	if (flags & XDF_WHITESPACE_FLAGS)
 		return xdl_hash_record_with_whitespace(data, top, flags);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v3 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
                       ` (4 preceding siblings ...)
  2025-11-11 19:42     ` [PATCH v3 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
@ 2025-11-11 19:42     ` Ezekiel Newren via GitGitGadget
  2025-11-11 23:21       ` Junio C Hamano
  2025-11-11 19:42     ` [PATCH v3 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
                       ` (5 subsequent siblings)
  11 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The ha field is serving two different purposes, which makes the code
harder to read. At first glance, it looks like many places assume
there could never be hash collisions between lines of the two input
files. In reality, line_hash is used together with xdl_recmatch() to
ensure correct comparisons of lines, even when collisions occur.

To make this clearer, the old ha field has been split:
  * line_hash: a straightforward hash of a line, independent of any
    external context. Its type is uint64_t, as it comes from a fixed
    width hash function.
  * minimal_perfect_hash: Not a new concept, but now a separate
    field. It comes from the classifier's general-purpose hash table,
    which assigns each line a unique and minimal hash across the two
    files. A size_t is used here because it's meant to be used to
    index an array. This also this avoids ` as usize` casts on the Rust
    side when using it to index a slice.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c     |  6 +++---
 xdiff/xhistogram.c |  4 ++--
 xdiff/xpatience.c  | 10 +++++-----
 xdiff/xprepare.c   | 18 +++++++++---------
 xdiff/xtypes.h     |  3 ++-
 5 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index edd05466df..436c34697d 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -22,9 +22,9 @@
 
 #include "xinclude.h"
 
-static unsigned long get_hash(xdfile_t *xdf, long index)
+static size_t get_hash(xdfile_t *xdf, long index)
 {
-	return xdf->recs[xdf->rindex[index]].ha;
+	return xdf->recs[xdf->rindex[index]].minimal_perfect_hash;
 }
 
 #define XDL_MAX_COST_MIN 256
@@ -385,7 +385,7 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 
 static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
 {
-	return (rec1->ha == rec2->ha);
+	return rec1->minimal_perfect_hash == rec2->minimal_perfect_hash;
 }
 
 /*
diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
index 6dc450b1fe..5ae1282c27 100644
--- a/xdiff/xhistogram.c
+++ b/xdiff/xhistogram.c
@@ -90,7 +90,7 @@ struct region {
 
 static int cmp_recs(xrecord_t *r1, xrecord_t *r2)
 {
-	return r1->ha == r2->ha;
+	return r1->minimal_perfect_hash == r2->minimal_perfect_hash;
 
 }
 
@@ -98,7 +98,7 @@ static int cmp_recs(xrecord_t *r1, xrecord_t *r2)
 	(cmp_recs(REC(i->env, s1, l1), REC(i->env, s2, l2)))
 
 #define TABLE_HASH(index, side, line) \
-	XDL_HASHLONG((REC(index->env, side, line))->ha, index->table_bits)
+	XDL_HASHLONG((REC(index->env, side, line))->minimal_perfect_hash, index->table_bits)
 
 static int scanA(struct histindex *index, int line1, int count1)
 {
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index bb61354f22..cc53266f3b 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -48,7 +48,7 @@
 struct hashmap {
 	int nr, alloc;
 	struct entry {
-		unsigned long hash;
+		size_t minimal_perfect_hash;
 		/*
 		 * 0 = unused entry, 1 = first line, 2 = second, etc.
 		 * line2 is NON_UNIQUE if the line is not unique
@@ -101,10 +101,10 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 	 * So we multiply ha by 2 in the hope that the hashing was
 	 * "unique enough".
 	 */
-	int index = (int)((record->ha << 1) % map->alloc);
+	int index = (int)((record->minimal_perfect_hash << 1) % map->alloc);
 
 	while (map->entries[index].line1) {
-		if (map->entries[index].hash != record->ha) {
+		if (map->entries[index].minimal_perfect_hash != record->minimal_perfect_hash) {
 			if (++index >= map->alloc)
 				index = 0;
 			continue;
@@ -120,7 +120,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 	if (pass == 2)
 		return;
 	map->entries[index].line1 = line;
-	map->entries[index].hash = record->ha;
+	map->entries[index].minimal_perfect_hash = record->minimal_perfect_hash;
 	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
 	if (!map->first)
 		map->first = map->entries + index;
@@ -248,7 +248,7 @@ static int match(struct hashmap *map, int line1, int line2)
 {
 	xrecord_t *record1 = &map->env->xdf1.recs[line1 - 1];
 	xrecord_t *record2 = &map->env->xdf2.recs[line2 - 1];
-	return record1->ha == record2->ha;
+	return record1->minimal_perfect_hash == record2->minimal_perfect_hash;
 }
 
 static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 85e56021da..bea0992b5e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -93,12 +93,12 @@ static void xdl_free_classifier(xdlclassifier_t *cf) {
 
 
 static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) {
-	long hi;
+	size_t hi;
 	xdlclass_t *rcrec;
 
-	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
+	hi = XDL_HASHLONG(rec->line_hash, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
-		if (rcrec->rec.ha == rec->ha &&
+		if (rcrec->rec.line_hash == rec->line_hash &&
 				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
 					(const char *)rec->ptr, (long)rec->size, cf->flags))
 			break;
@@ -120,7 +120,7 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 
 	(pass == 1) ? rcrec->len1++ : rcrec->len2++;
 
-	rec->ha = (unsigned long) rcrec->idx;
+	rec->minimal_perfect_hash = (size_t)rcrec->idx;
 
 	return 0;
 }
@@ -158,7 +158,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
 			crec->size = cur - prev;
-			crec->ha = hav;
+			crec->line_hash = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
 		}
@@ -290,7 +290,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->ha];
+		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len2 : 0;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -298,7 +298,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->ha];
+		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len1 : 0;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -350,7 +350,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 	recs2 = xdf2->recs;
 	for (i = 0, lim = XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
 	     i++, recs1++, recs2++)
-		if (recs1->ha != recs2->ha)
+		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
 	xdf1->dstart = xdf2->dstart = i;
@@ -358,7 +358,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 	recs1 = xdf1->recs + xdf1->nrec - 1;
 	recs2 = xdf2->recs + xdf2->nrec - 1;
 	for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
-		if (recs1->ha != recs2->ha)
+		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
 	xdf1->dend = xdf1->nrec - i - 1;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 88b1fe4649..742b81bf3b 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -41,7 +41,8 @@ typedef struct s_chastore {
 typedef struct s_xrecord {
 	uint8_t const *ptr;
 	size_t size;
-	unsigned long ha;
+	uint64_t line_hash;
+	size_t minimal_perfect_hash;
 } xrecord_t;
 
 typedef struct s_xdfile {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-11-11 19:42     ` [PATCH v3 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
@ 2025-11-11 23:21       ` Junio C Hamano
  2025-11-14  5:41         ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Junio C Hamano @ 2025-11-11 23:21 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> To make this clearer, the old ha field has been split:
>   * line_hash: a straightforward hash of a line, independent of any
>     external context. Its type is uint64_t, as it comes from a fixed
>     width hash function.
>   * minimal_perfect_hash: Not a new concept, but now a separate
>     field. It comes from the classifier's general-purpose hash table,
>     which assigns each line a unique and minimal hash across the two
>     files. A size_t is used here because it's meant to be used to
>     index an array. This also this avoids ` as usize` casts on the Rust
>     side when using it to index a slice.

How much extra memory pressure does this change cause?  In a single
instance of xrecord_t, we used to have a single ulong plus a pointer
and a size_t; now we replaced the single ulong with two 8-byte words,
so 33% more memory per record, which is not so huge a deal?

>  static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) {
> -	long hi;
> +	size_t hi;
>  	xdlclass_t *rcrec;
>  
> -	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
> +	hi = XDL_HASHLONG(rec->line_hash, cf->hbits);

Very nice that we can lose these random-looking casts.

> diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
> index 88b1fe4649..742b81bf3b 100644
> --- a/xdiff/xtypes.h
> +++ b/xdiff/xtypes.h
> @@ -41,7 +41,8 @@ typedef struct s_chastore {
>  typedef struct s_xrecord {
>  	uint8_t const *ptr;
>  	size_t size;
> -	unsigned long ha;
> +	uint64_t line_hash;
> +	size_t minimal_perfect_hash;
>  } xrecord_t;
>  
>  typedef struct s_xdfile {

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-11-11 23:21       ` Junio C Hamano
@ 2025-11-14  5:41         ` Ezekiel Newren
  2025-11-14 20:06           ` Junio C Hamano
  0 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren @ 2025-11-14  5:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Phillip Wood, Chris Torek

On Tue, Nov 11, 2025 at 4:21 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > To make this clearer, the old ha field has been split:
> >   * line_hash: a straightforward hash of a line, independent of any
> >     external context. Its type is uint64_t, as it comes from a fixed
> >     width hash function.
> >   * minimal_perfect_hash: Not a new concept, but now a separate
> >     field. It comes from the classifier's general-purpose hash table,
> >     which assigns each line a unique and minimal hash across the two
> >     files. A size_t is used here because it's meant to be used to
> >     index an array. This also this avoids ` as usize` casts on the Rust
> >     side when using it to index a slice.
>
> How much extra memory pressure does this change cause?  In a single
> instance of xrecord_t, we used to have a single ulong plus a pointer
> and a size_t; now we replaced the single ulong with two 8-byte words,
> so 33% more memory per record, which is not so huge a deal?

This was asked and answered earlier in this patch series [1].

> >  static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) {
> > -     long hi;
> > +     size_t hi;
> >       xdlclass_t *rcrec;
> >
> > -     hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
> > +     hi = XDL_HASHLONG(rec->line_hash, cf->hbits);
>
> Very nice that we can lose these random-looking casts.

This was Phillip's suggestion [2]. Thanks Phillip.

[1] https://lore.kernel.org/git/CAH=ZcbD7FeRHtYvN_4=qHApB-AwK18=KRU2SGWNg8ADkrFM-Fw@mail.gmail.com/
[2] https://lore.kernel.org/git/a66fb440-058e-4cd8-8971-9c320c0387e8@gmail.com/

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-11-14  5:41         ` Ezekiel Newren
@ 2025-11-14 20:06           ` Junio C Hamano
  0 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-11-14 20:06 UTC (permalink / raw)
  To: Ezekiel Newren
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Phillip Wood, Chris Torek

Ezekiel Newren <ezekielnewren@gmail.com> writes:

>> How much extra memory pressure does this change cause?  In a single
>> instance of xrecord_t, we used to have a single ulong plus a pointer
>> and a size_t; now we replaced the single ulong with two 8-byte words,
>> so 33% more memory per record, which is not so huge a deal?
>
> This was asked and answered earlier in this patch series [1].

In short, this step does bloat, but the memory usage will shrink
when the members are moved elsewhere in future patches?

> [1] https://lore.kernel.org/git/CAH=ZcbD7FeRHtYvN_4=qHApB-AwK18=KRU2SGWNg8ADkrFM-Fw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v3 07/10] xdiff: make xdfile_t.nrec a size_t instead of long
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
                       ` (5 preceding siblings ...)
  2025-11-11 19:42     ` [PATCH v3 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
@ 2025-11-11 19:42     ` Ezekiel Newren via GitGitGadget
  2025-11-11 19:42     ` [PATCH v3 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
                       ` (4 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is used because nrec describes the number of elements for both
recs, and for 'changed' + 2.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c    |  8 ++++----
 xdiff/xemit.c     | 20 ++++++++++----------
 xdiff/xmerge.c    |  8 ++++----
 xdiff/xpatience.c |  2 +-
 xdiff/xprepare.c  | 12 ++++++------
 xdiff/xtypes.h    |  2 +-
 6 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 436c34697d..759193fe5d 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -483,7 +483,7 @@ static void measure_split(const xdfile_t *xdf, long split,
 {
 	long i;
 
-	if (split >= xdf->nrec) {
+	if (split >= (long)xdf->nrec) {
 		m->end_of_file = 1;
 		m->indent = -1;
 	} else {
@@ -506,7 +506,7 @@ static void measure_split(const xdfile_t *xdf, long split,
 
 	m->post_blank = 0;
 	m->post_indent = -1;
-	for (i = split + 1; i < xdf->nrec; i++) {
+	for (i = split + 1; i < (long)xdf->nrec; i++) {
 		m->post_indent = get_indent(&xdf->recs[i]);
 		if (m->post_indent != -1)
 			break;
@@ -717,7 +717,7 @@ static void group_init(xdfile_t *xdf, struct xdlgroup *g)
  */
 static inline int group_next(xdfile_t *xdf, struct xdlgroup *g)
 {
-	if (g->end == xdf->nrec)
+	if (g->end == (long)xdf->nrec)
 		return -1;
 
 	g->start = g->end + 1;
@@ -750,7 +750,7 @@ static inline int group_previous(xdfile_t *xdf, struct xdlgroup *g)
  */
 static int group_slide_down(xdfile_t *xdf, struct xdlgroup *g)
 {
-	if (g->end < xdf->nrec &&
+	if (g->end < (long)xdf->nrec &&
 	    recs_match(&xdf->recs[g->start], &xdf->recs[g->end])) {
 		xdf->changed[g->start++] = false;
 		xdf->changed[g->end++] = true;
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index 2f8007753c..04f7e9193b 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -137,7 +137,7 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 	buf = func_line ? func_line->buf : dummy;
 	size = func_line ? sizeof(func_line->buf) : sizeof(dummy);
 
-	for (l = start; l != limit && 0 <= l && l < xe->xdf1.nrec; l += step) {
+	for (l = start; l != limit && 0 <= l && l < (long)xe->xdf1.nrec; l += step) {
 		long len = match_func_rec(&xe->xdf1, xecfg, l, buf, size);
 		if (len >= 0) {
 			if (func_line)
@@ -179,14 +179,14 @@ pre_context_calculation:
 			long fs1, i1 = xch->i1;
 
 			/* Appended chunk? */
-			if (i1 >= xe->xdf1.nrec) {
+			if (i1 >= (long)xe->xdf1.nrec) {
 				long i2 = xch->i2;
 
 				/*
 				 * We don't need additional context if
 				 * a whole function was added.
 				 */
-				while (i2 < xe->xdf2.nrec) {
+				while (i2 < (long)xe->xdf2.nrec) {
 					if (is_func_rec(&xe->xdf2, xecfg, i2))
 						goto post_context_calculation;
 					i2++;
@@ -196,7 +196,7 @@ pre_context_calculation:
 				 * Otherwise get more context from the
 				 * pre-image.
 				 */
-				i1 = xe->xdf1.nrec - 1;
+				i1 = (long)xe->xdf1.nrec - 1;
 			}
 
 			fs1 = get_func_line(xe, xecfg, NULL, i1, -1);
@@ -228,8 +228,8 @@ pre_context_calculation:
 
  post_context_calculation:
 		lctx = xecfg->ctxlen;
-		lctx = XDL_MIN(lctx, xe->xdf1.nrec - (xche->i1 + xche->chg1));
-		lctx = XDL_MIN(lctx, xe->xdf2.nrec - (xche->i2 + xche->chg2));
+		lctx = XDL_MIN(lctx, (long)xe->xdf1.nrec - (xche->i1 + xche->chg1));
+		lctx = XDL_MIN(lctx, (long)xe->xdf2.nrec - (xche->i2 + xche->chg2));
 
 		e1 = xche->i1 + xche->chg1 + lctx;
 		e2 = xche->i2 + xche->chg2 + lctx;
@@ -237,13 +237,13 @@ pre_context_calculation:
 		if (xecfg->flags & XDL_EMIT_FUNCCONTEXT) {
 			long fe1 = get_func_line(xe, xecfg, NULL,
 						 xche->i1 + xche->chg1,
-						 xe->xdf1.nrec);
+						 (long)xe->xdf1.nrec);
 			while (fe1 > 0 && is_empty_rec(&xe->xdf1, fe1 - 1))
 				fe1--;
 			if (fe1 < 0)
-				fe1 = xe->xdf1.nrec;
+				fe1 = (long)xe->xdf1.nrec;
 			if (fe1 > e1) {
-				e2 = XDL_MIN(e2 + (fe1 - e1), xe->xdf2.nrec);
+				e2 = XDL_MIN(e2 + (fe1 - e1), (long)xe->xdf2.nrec);
 				e1 = fe1;
 			}
 
@@ -254,7 +254,7 @@ pre_context_calculation:
 			 */
 			if (xche->next) {
 				long l = XDL_MIN(xche->next->i1,
-						 xe->xdf1.nrec - 1);
+						 (long)xe->xdf1.nrec - 1);
 				if (l - xecfg->ctxlen <= e1 ||
 				    get_func_line(xe, xecfg, NULL, l, e1) < 0) {
 					xche = xche->next;
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index 0dd4558a32..29dad98c49 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -158,7 +158,7 @@ static int is_eol_crlf(xdfile_t *file, int i)
 {
 	size_t size;
 
-	if (i < file->nrec - 1)
+	if (i < (long)file->nrec - 1)
 		/* All lines before the last *must* end in LF */
 		return (size = file->recs[i].size) > 1 &&
 			file->recs[i].ptr[size - 2] == '\r';
@@ -317,7 +317,7 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 			continue;
 		i = m->i1 + m->chg1;
 	}
-	size += xdl_recs_copy(xe1, i, xe1->xdf2.nrec - i, 0, 0,
+	size += xdl_recs_copy(xe1, i, (int)xe1->xdf2.nrec - i, 0, 0,
 			      dest ? dest + size : NULL);
 	return size;
 }
@@ -622,7 +622,7 @@ static int xdl_do_merge(xdfenv_t *xe1, xdchange_t *xscr1,
 			changes = c;
 		i0 = xscr1->i1;
 		i1 = xscr1->i2;
-		i2 = xscr1->i1 + xe2->xdf2.nrec - xe2->xdf1.nrec;
+		i2 = xscr1->i1 + (long)xe2->xdf2.nrec - (long)xe2->xdf1.nrec;
 		chg0 = xscr1->chg1;
 		chg1 = xscr1->chg2;
 		chg2 = xscr1->chg1;
@@ -637,7 +637,7 @@ static int xdl_do_merge(xdfenv_t *xe1, xdchange_t *xscr1,
 		if (!changes)
 			changes = c;
 		i0 = xscr2->i1;
-		i1 = xscr2->i1 + xe1->xdf2.nrec - xe1->xdf1.nrec;
+		i1 = xscr2->i1 + (long)xe1->xdf2.nrec - (long)xe1->xdf1.nrec;
 		i2 = xscr2->i2;
 		chg0 = xscr2->chg1;
 		chg1 = xscr2->chg1;
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index cc53266f3b..a0b31eb5d8 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -370,5 +370,5 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
 
 int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
 {
-	return patience_diff(xpp, env, 1, env->xdf1.nrec, 1, env->xdf2.nrec);
+	return patience_diff(xpp, env, 1, (int)env->xdf1.nrec, 1, (int)env->xdf2.nrec);
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index bea0992b5e..705ddd1ae0 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -153,7 +153,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 		for (top = blk + bsize; cur < top; ) {
 			prev = cur;
 			hav = xdl_hash_record(&cur, top, xpp->flags);
-			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
+			if (XDL_ALLOC_GROW(xdf->recs, (long)xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
@@ -287,7 +287,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	/*
 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
 	 */
-	if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
@@ -295,7 +295,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
-	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
@@ -348,7 +348,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 
 	recs1 = xdf1->recs;
 	recs2 = xdf2->recs;
-	for (i = 0, lim = XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
+	for (i = 0, lim = (long)XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
 	     i++, recs1++, recs2++)
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
@@ -361,8 +361,8 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
-	xdf1->dend = xdf1->nrec - i - 1;
-	xdf2->dend = xdf2->nrec - i - 1;
+	xdf1->dend = (long)xdf1->nrec - i - 1;
+	xdf2->dend = (long)xdf2->nrec - i - 1;
 
 	return 0;
 }
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 742b81bf3b..17cafd8b6e 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -47,7 +47,7 @@ typedef struct s_xrecord {
 
 typedef struct s_xdfile {
 	xrecord_t *recs;
-	long nrec;
+	size_t nrec;
 	bool *changed;
 	long *rindex;
 	long nreff;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v3 08/10] xdiff: make xdfile_t.nreff a size_t instead of long
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
                       ` (6 preceding siblings ...)
  2025-11-11 19:42     ` [PATCH v3 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
@ 2025-11-11 19:42     ` Ezekiel Newren via GitGitGadget
  2025-11-11 19:42     ` [PATCH v3 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
                       ` (3 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is used because nreff describes the number of elements in memory
for rindex.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 14 +++++++-------
 xdiff/xtypes.h   |  2 +-
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 705ddd1ae0..39fd79d9d4 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -264,7 +264,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  * might be potentially discarded if they appear in a run of discardable.
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-	long i, nm, nreff, mlim;
+	long i, nm, mlim;
 	xrecord_t *recs;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
@@ -307,29 +307,29 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 * Use temporary arrays to decide if changed[i] should remain
 	 * false, or become true.
 	 */
-	for (nreff = 0, i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
+	xdf1->nreff = 0;
+	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
 	     i <= xdf1->dend; i++, recs++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->rindex[nreff++] = i;
+			xdf1->rindex[xdf1->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf1->changed[i] = true;
 			/* i.e. discard */
 	}
-	xdf1->nreff = nreff;
 
-	for (nreff = 0, i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
+	xdf2->nreff = 0;
+	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
 	     i <= xdf2->dend; i++, recs++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->rindex[nreff++] = i;
+			xdf2->rindex[xdf2->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf2->changed[i] = true;
 			/* i.e. discard */
 	}
-	xdf2->nreff = nreff;
 
 cleanup:
 	xdl_free(action1);
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 17cafd8b6e..df4c5cab1a 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -50,7 +50,7 @@ typedef struct s_xdfile {
 	size_t nrec;
 	bool *changed;
 	long *rindex;
-	long nreff;
+	size_t nreff;
 	ptrdiff_t dstart, dend;
 } xdfile_t;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v3 09/10] xdiff: change rindex from long to size_t in xdfile_t
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
                       ` (7 preceding siblings ...)
  2025-11-11 19:42     ` [PATCH v3 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
@ 2025-11-11 19:42     ` Ezekiel Newren via GitGitGadget
  2025-11-11 19:42     ` [PATCH v3 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
                       ` (2 subsequent siblings)
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The field rindex describes an index offset for other arrays. Change it
to size_t.

Changing the type of rindex from long to size_t has no cascading
refactor impact because it is only ever used to directly index other
arrays.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xtypes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index df4c5cab1a..3bcc0920e0 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -49,7 +49,7 @@ typedef struct s_xdfile {
 	xrecord_t *recs;
 	size_t nrec;
 	bool *changed;
-	long *rindex;
+	size_t *rindex;
 	size_t nreff;
 	ptrdiff_t dstart, dend;
 } xdfile_t;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v3 10/10] xdiff: rename rindex -> reference_index
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
                       ` (8 preceding siblings ...)
  2025-11-11 19:42     ` [PATCH v3 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
@ 2025-11-11 19:42     ` Ezekiel Newren via GitGitGadget
  2025-11-11 23:40     ` [PATCH v3 00/10] Xdiff cleanup part2 Junio C Hamano
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
  11 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-11 19:42 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The classic diff adds only the lines that it's going to consider,
during the diff, to an array. A mapping between the compacted
array, and the lines of the file that they reference, is
facilitated by this array.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   |  6 +++---
 xdiff/xprepare.c | 10 +++++-----
 xdiff/xtypes.h   |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 759193fe5d..8eb664be3e 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -24,7 +24,7 @@
 
 static size_t get_hash(xdfile_t *xdf, long index)
 {
-	return xdf->recs[xdf->rindex[index]].minimal_perfect_hash;
+	return xdf->recs[xdf->reference_index[index]].minimal_perfect_hash;
 }
 
 #define XDL_MAX_COST_MIN 256
@@ -278,10 +278,10 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
 	 */
 	if (off1 == lim1) {
 		for (; off2 < lim2; off2++)
-			xdf2->changed[xdf2->rindex[off2]] = true;
+			xdf2->changed[xdf2->reference_index[off2]] = true;
 	} else if (off2 == lim2) {
 		for (; off1 < lim1; off1++)
-			xdf1->changed[xdf1->rindex[off1]] = true;
+			xdf1->changed[xdf1->reference_index[off1]] = true;
 	} else {
 		xdpsplit_t spl;
 		spl.i1 = spl.i2 = 0;
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 39fd79d9d4..34c82e4f8e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -128,7 +128,7 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 
 static void xdl_free_ctx(xdfile_t *xdf)
 {
-	xdl_free(xdf->rindex);
+	xdl_free(xdf->reference_index);
 	xdl_free(xdf->changed - 1);
 	xdl_free(xdf->recs);
 }
@@ -141,7 +141,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
 
-	xdf->rindex = NULL;
+	xdf->reference_index = NULL;
 	xdf->changed = NULL;
 	xdf->recs = NULL;
 
@@ -169,7 +169,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 
 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
 	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)) {
-		if (!XDL_ALLOC_ARRAY(xdf->rindex, xdf->nrec + 1))
+		if (!XDL_ALLOC_ARRAY(xdf->reference_index, xdf->nrec + 1))
 			goto abort;
 	}
 
@@ -312,7 +312,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	     i <= xdf1->dend; i++, recs++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->rindex[xdf1->nreff++] = i;
+			xdf1->reference_index[xdf1->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf1->changed[i] = true;
@@ -324,7 +324,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	     i <= xdf2->dend; i++, recs++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->rindex[xdf2->nreff++] = i;
+			xdf2->reference_index[xdf2->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf2->changed[i] = true;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 3bcc0920e0..5accbec284 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -49,7 +49,7 @@ typedef struct s_xdfile {
 	xrecord_t *recs;
 	size_t nrec;
 	bool *changed;
-	size_t *rindex;
+	size_t *reference_index;
 	size_t nreff;
 	ptrdiff_t dstart, dend;
 } xdfile_t;
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 00/10] Xdiff cleanup part2
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
                       ` (9 preceding siblings ...)
  2025-11-11 19:42     ` [PATCH v3 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
@ 2025-11-11 23:40     ` Junio C Hamano
  2025-11-14  5:52       ` Ezekiel Newren
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
  11 siblings, 1 reply; 118+ messages in thread
From: Junio C Hamano @ 2025-11-11 23:40 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> The primary goal of this patch series is to convert every field's type in
> xrecord_t and xdfile_t to be unambiguous, in preparation to make it more
> Rust FFI friendly. Additionally the ha field in xrecord_t is split into
> line_hash and minimal_perfect hash.

After having read the series to its end, I am left with this feeling
that it does only half the things that it needs to do.  It does all
what the above paragraph claims it does, sure, in that the relevant
data structures now use not "long" but "size_t", not "char" but
"uint8_t", etc., and I do find the resulting data structures sensibly
described.

But for the code to be truly consistent between the data structures
and the operations that work on them, types of on-stack variables
and function parameters would need to be updated to match these
struct members.  As we convert one structure member at a time, casts
may need to be sprinkled for assignments to these variables and
passing these struct members as parameters to functions (which I
commented on one of these patche) to keep the blast radius of the
changes in each step manageable, but I would have expected that
functions that used to take, say, an "int", would be updated to take
"size_t" if the value coming to the parameter is from these struct
members.

Perhaps that would be the theme for "Xdiff cleanup part 3" series
that we will eventually see after the dust settles from this round?

Thanks.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v3 00/10] Xdiff cleanup part2
  2025-11-11 23:40     ` [PATCH v3 00/10] Xdiff cleanup part2 Junio C Hamano
@ 2025-11-14  5:52       ` Ezekiel Newren
  0 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-11-14  5:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Phillip Wood, Chris Torek

On Tue, Nov 11, 2025 at 4:40 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > The primary goal of this patch series is to convert every field's type in
> > xrecord_t and xdfile_t to be unambiguous, in preparation to make it more
> > Rust FFI friendly. Additionally the ha field in xrecord_t is split into
> > line_hash and minimal_perfect hash.
>
> After having read the series to its end, I am left with this feeling
> that it does only half the things that it needs to do.  It does all
> what the above paragraph claims it does, sure, in that the relevant
> data structures now use not "long" but "size_t", not "char" but
> "uint8_t", etc., and I do find the resulting data structures sensibly
> described.

This patch series is already 10 commits long, and it's been a
challenge to chunk cleanups of Xdiff because its code is so tangled.
I'm hoping that future maintenance of Xdiff (after my xdiff cleanup
series is complete) will be much easier.

> But for the code to be truly consistent between the data structures
> and the operations that work on them, types of on-stack variables
> and function parameters would need to be updated to match these
> struct members.  As we convert one structure member at a time, casts
> may need to be sprinkled for assignments to these variables and
> passing these struct members as parameters to functions (which I
> commented on one of these patches) to keep the blast radius of the
> changes in each step manageable, but I would have expected that
> functions that used to take, say, an "int", would be updated to take
> "size_t" if the value coming to the parameter is from these struct
> members.

I had to draw the line somewhere, and I plan on making more changes to
delete more idiosyncrasies in Xdiff.

> Perhaps that would be the theme for "Xdiff cleanup part 3" series
> that we will eventually see after the dust settles from this round?

Not just part 3, but the entire xdiff cleanup series will be about
correcting types among many other code cleanups. This patch series
alone is unsatisfactory, but it is only 1 of many patch series to
come.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v4 00/10] Xdiff cleanup part2
  2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
                       ` (10 preceding siblings ...)
  2025-11-11 23:40     ` [PATCH v3 00/10] Xdiff cleanup part2 Junio C Hamano
@ 2025-11-14 22:36     ` Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
                         ` (10 more replies)
  11 siblings, 11 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

Changes in v4:

 * Update documentation to not mention Unicode except once
 * Don't move dstart/dend with in the xdfile_t struct
 * Rephrase justification on changing xrecord_t.ptr's type

Changes in v3:

 * Address comments about commit messages and documentation
 * Add unambiguous-types.adoc to Makefile and Meson
 * Use markdown style to avoid asciidoc issues

Changes in v2:

 * Added documentation about unambiguous types and FFI
 * Addressed comments on the mailing list


Original cover letter below:
============================

Maintainer note: This patch series builds on top of en/xdiff-cleanup and
am/xdiff-hash-tweak (both of which are now in master).

The primary goal of this patch series is to convert every field's type in
xrecord_t and xdfile_t to be unambiguous, in preparation to make it more
Rust FFI friendly. Additionally the ha field in xrecord_t is split into
line_hash and minimal_perfect hash.

The order of some of the fields has changed as called out by the commit
messages.

Before:

typedef struct s_xrecord {
	char const *ptr;
	long size;
	unsigned long ha;
} xrecord_t;

typedef struct s_xdfile {
	xrecord_t *recs;
	long nrec;
	long dstart, dend;
	bool *changed;
	long *rindex;
	long nreff;
} xdfile_t;


After part 2

typedef struct s_xrecord {
	uint8_t const *ptr;
	size_t size;
	uint64_t line_hash;
	size_t minimal_perfect_hash;
} xrecord_t;

typedef struct s_xdfile {
	xrecord_t *recs;
	size_t nrec;
	ptrdiff_t dstart, dend;
	bool *changed;
	size_t *reference_index;
	size_t nreff;
} xdfile_t;


Ezekiel Newren (10):
  doc: define unambiguous type mappings across C and Rust
  xdiff: use ptrdiff_t for dstart/dend
  xdiff: make xrecord_t.ptr a uint8_t instead of char
  xdiff: use size_t for xrecord_t.size
  xdiff: use unambiguous types in xdl_hash_record()
  xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  xdiff: make xdfile_t.nrec a size_t instead of long
  xdiff: make xdfile_t.nreff a size_t instead of long
  xdiff: change rindex from long to size_t in xdfile_t
  xdiff: rename rindex -> reference_index

 Documentation/Makefile                        |   1 +
 Documentation/technical/meson.build           |   1 +
 .../technical/unambiguous-types.adoc          | 224 ++++++++++++++++++
 xdiff-interface.c                             |   2 +-
 xdiff/xdiffi.c                                |  29 ++-
 xdiff/xemit.c                                 |  28 +--
 xdiff/xhistogram.c                            |   4 +-
 xdiff/xmerge.c                                |  30 +--
 xdiff/xpatience.c                             |  14 +-
 xdiff/xprepare.c                              |  60 ++---
 xdiff/xtypes.h                                |  15 +-
 xdiff/xutils.c                                |  32 +--
 xdiff/xutils.h                                |   6 +-
 13 files changed, 336 insertions(+), 110 deletions(-)
 create mode 100644 Documentation/technical/unambiguous-types.adoc


base-commit: a99f379adf116d53eb11957af5bab5214915f91d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2070%2Fezekielnewren%2Fxdiff_cleanup_part2-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2070/ezekielnewren/xdiff_cleanup_part2-v4
Pull-Request: https://github.com/git/git/pull/2070

Range-diff vs v3:

  1:  e5d084d340 !  1:  af732beb69 doc: define unambiguous type mappings across C and Rust
     @@ Metadata
       ## Commit message ##
          doc: define unambiguous type mappings across C and Rust
      
     -    Document other nuances with crossing the FFI boundary. Other language
     +    Document other nuances when crossing the FFI boundary. Other language
          mappings may be added in the future.
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
     @@ Documentation/technical/unambiguous-types.adoc (new)
      +
      +This is where C and Rust don't have a clean one-to-one mapping.
      +
     ++A C `char` and a Rust `u8` share the same bit width, so any C struct containing
     ++a `char` will have the same size as the corresponding Rust struct using `u8`.
     ++In that sense, such structs are safe to pass over the FFI boundary, because
     ++their fields will be laid out identically. However, beyond bit width, C `char`
     ++has additional semantics and platform-dependent behavior that can cause
     ++problems, as discussed below.
     ++
      +C comparison problem: While the sign of `char` is implementation defined, it's
      +also signless (neither signed nor unsigned). When building with
      +`make DEVELOPER=1` it will complain about a "differ in signedness" when `char`
      +is compared with `uint8_t` or `int8_t`.
      +
     -+Rust's `char` type is an unsigned 32-bit integer that is used to describe
     -+Unicode code points. Even though a C `char` is the same width as `u8`, `char`
     -+should be converted to u8 where it is describing bytes in memory. If a C
     -+`char` is not describing bytes, then it should be converted to a more accurate
     -+unambiguous type. The reason for mentioning Unicode here is because of how &str
     -+is defined in Rust and how to create a &str from &[u8]. Rust assumes that &str
     -+is a correctly encoded utf-8 string, i.e. text in memory. Where as a C `char`
     -+makes no assumption about the bytes that it is representing.
     -+
     -+```
     -+let raw_bytes = b"abc\n";
     -+let result = std::str::from_utf8(raw_bytes);
     -+if let Ok(line) = result {
     -+    // do something with text
     -+}
     -+```
     -+
     -+While you could specify `char` in the C code and `u8` in Rust code, it's not as
     -+clear what the appropriate type is, but it would work across the FFI boundary.
     -+However, the bigger problem comes from code generation tools like cbindgen and
     -+bindgen. When cbindgen sees u8 in Rust it will generate uint8_t on the C side
     -+which will cause differ in signedness warnings/errors. Similarly if bindgen
     -+sees `char` on the C side it will generate `std::ffi::c_char` which has its own
     -+problems.
     ++Note: Rust's `char` type is an unsigned 32-bit integer that is used to describe
     ++Unicode code points.
      +
      +=== Notes
      +^1^ This is only true if stdbool.h (or equivalent) is used. +
  2:  52e3f589b1 !  2:  b60a03eb31 xdiff: use ptrdiff_t for dstart/dend
     @@ Commit message
          ptrdiff_t is appropriate for dstart and dend because they both describe
          positive or negative offsets relative to a pointer.
      
     -    A future patch will move these fields to a different struct. Moving
     -    them to the end of xdfile_t now, means the field order of xdfile_t will
     -    be disturbed less.
     -
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
       ## xdiff/xtypes.h ##
     @@ xdiff/xtypes.h: typedef struct s_xrecord {
       	xrecord_t *recs;
       	long nrec;
      -	long dstart, dend;
     ++	ptrdiff_t dstart, dend;
       	bool *changed;
       	long *rindex;
       	long nreff;
     -+	ptrdiff_t dstart, dend;
     - } xdfile_t;
     - 
     - typedef struct s_xdfenv {
  3:  83e7bf180a !  3:  042fbb11d0 xdiff: make xrecord_t.ptr a uint8_t instead of char
     @@ Commit message
      
          Make xrecord_t.ptr uint8_t because it's referring to bytes in memory.
      
     -    Every usage of this field was inspected and cast to char*, or similar,
     -    to avoid signedness warnings/errors from the compiler. Casting was used
     -    so that the whole of xdiff doesn't need to be refactored in order to
     -    change the type of this field.
     +    In order to avoid a refactor avalanche, many uses of this field were
     +    cast to char* or similar. One exception is in get_indent() where the
     +    local variable `char c` was changed to `uint8_t c`.
     +
     +    Places where casting was unnecessary:
     +    xemit.c:156
     +    xmerge.c:124
     +    xmerge.c:127
     +    xmerge.c:164
     +    xmerge.c:169
     +    xmerge.c:172
     +    xmerge.c:178
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
      
  4:  da2b80ea0b =  4:  c103fa6bea xdiff: use size_t for xrecord_t.size
  5:  c6ba630ac5 =  5:  2ee9a74653 xdiff: use unambiguous types in xdl_hash_record()
  6:  3834ea8f9b !  6:  f044274bd5 xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
     @@ Commit message
              field. It comes from the classifier's general-purpose hash table,
              which assigns each line a unique and minimal hash across the two
              files. A size_t is used here because it's meant to be used to
     -        index an array. This also this avoids ` as usize` casts on the Rust
     +        index an array. This also avoids ` as usize` casts on the Rust
              side when using it to index a slice.
      
          Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
  7:  e2a2c7530c !  7:  f7a3731d94 xdiff: make xdfile_t.nrec a size_t instead of long
     @@ xdiff/xtypes.h: typedef struct s_xrecord {
       	xrecord_t *recs;
      -	long nrec;
      +	size_t nrec;
     + 	ptrdiff_t dstart, dend;
       	bool *changed;
       	long *rindex;
     - 	long nreff;
  8:  31cd2a1aa4 !  8:  93f84ae72e xdiff: make xdfile_t.nreff a size_t instead of long
     @@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *
      
       ## xdiff/xtypes.h ##
      @@ xdiff/xtypes.h: typedef struct s_xdfile {
     - 	size_t nrec;
     + 	ptrdiff_t dstart, dend;
       	bool *changed;
       	long *rindex;
      -	long nreff;
      +	size_t nreff;
     - 	ptrdiff_t dstart, dend;
       } xdfile_t;
       
     + typedef struct s_xdfenv {
  9:  aee0d3958b !  9:  39369becc8 xdiff: change rindex from long to size_t in xdfile_t
     @@ Commit message
      
       ## xdiff/xtypes.h ##
      @@ xdiff/xtypes.h: typedef struct s_xdfile {
     - 	xrecord_t *recs;
       	size_t nrec;
     + 	ptrdiff_t dstart, dend;
       	bool *changed;
      -	long *rindex;
      +	size_t *rindex;
       	size_t nreff;
     - 	ptrdiff_t dstart, dend;
       } xdfile_t;
     + 
 10:  75c26fe160 ! 10:  950d1e6193 xdiff: rename rindex -> reference_index
     @@ xdiff/xprepare.c: static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *
      
       ## xdiff/xtypes.h ##
      @@ xdiff/xtypes.h: typedef struct s_xdfile {
     - 	xrecord_t *recs;
       	size_t nrec;
     + 	ptrdiff_t dstart, dend;
       	bool *changed;
      -	size_t *rindex;
      +	size_t *reference_index;
       	size_t nreff;
     - 	ptrdiff_t dstart, dend;
       } xdfile_t;
     + 

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
@ 2025-11-14 22:36       ` Ezekiel Newren via GitGitGadget
  2025-11-15  3:06         ` Ramsay Jones
  2025-11-14 22:36       ` [PATCH v4 02/10] xdiff: use ptrdiff_t for dstart/dend Ezekiel Newren via GitGitGadget
                         ` (9 subsequent siblings)
  10 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Document other nuances when crossing the FFI boundary. Other language
mappings may be added in the future.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 Documentation/Makefile                        |   1 +
 Documentation/technical/meson.build           |   1 +
 .../technical/unambiguous-types.adoc          | 224 ++++++++++++++++++
 3 files changed, 226 insertions(+)
 create mode 100644 Documentation/technical/unambiguous-types.adoc

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 04e9e10b27..bc1adb2d9d 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -142,6 +142,7 @@ TECH_DOCS += technical/shallow
 TECH_DOCS += technical/sparse-checkout
 TECH_DOCS += technical/sparse-index
 TECH_DOCS += technical/trivial-merge
+TECH_DOCS += technical/unambiguous-types
 TECH_DOCS += technical/unit-tests
 SP_ARTICLES += $(TECH_DOCS)
 SP_ARTICLES += technical/api-index
diff --git a/Documentation/technical/meson.build b/Documentation/technical/meson.build
index be698ef22a..89a6e26821 100644
--- a/Documentation/technical/meson.build
+++ b/Documentation/technical/meson.build
@@ -32,6 +32,7 @@ articles = [
   'sparse-checkout.adoc',
   'sparse-index.adoc',
   'trivial-merge.adoc',
+  'unambiguous-types.adoc',
   'unit-tests.adoc',
 ]
 
diff --git a/Documentation/technical/unambiguous-types.adoc b/Documentation/technical/unambiguous-types.adoc
new file mode 100644
index 0000000000..9a42c72890
--- /dev/null
+++ b/Documentation/technical/unambiguous-types.adoc
@@ -0,0 +1,224 @@
+= Unambiguous types
+
+Most of these mappings are obvious, but there are some nuances and gotchas with
+Rust FFI (Foreign Function Interface).
+
+This document defines clear, one-to-one mappings between primitive types in C,
+Rust (and possible other languages in the future). Its purpose is to eliminate
+ambiguity in type widths, signedness, and binary representation across
+platforms and languages.
+
+For Git, the only header required to use these unambiguous types in C is
+`git-compat-util.h`.
+
+== Boolean types
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| bool^1^       | bool
+|===
+
+== Integer types
+
+In C, `<stdint.h>` (or an equivalent) must be included.
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| uint8_t    | u8
+| uint16_t   | u16
+| uint32_t   | u32
+| uint64_t   | u64
+
+| int8_t     | i8
+| int16_t    | i16
+| int32_t    | i32
+| int64_t    | i64
+|===
+
+== Floating-point types
+
+Rust requires IEEE-754 semantics.
+In C, that is typically true, but not guaranteed by the standard.
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| float^2^      | f32
+| double^2^     | f64
+|===
+
+== Size types
+
+These types represent pointer-sized integers and are typically defined in
+`<stddef.h>` or an equivalent header.
+
+Size types should be used any time pointer arithmetic is performed e.g.
+indexing an array, describing the number of elements in memory, etc...
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| size_t^3^     | usize
+| ptrdiff_t^3^  | isize
+|===
+
+== Character types
+
+This is where C and Rust don't have a clean one-to-one mapping.
+
+A C `char` and a Rust `u8` share the same bit width, so any C struct containing
+a `char` will have the same size as the corresponding Rust struct using `u8`.
+In that sense, such structs are safe to pass over the FFI boundary, because
+their fields will be laid out identically. However, beyond bit width, C `char`
+has additional semantics and platform-dependent behavior that can cause
+problems, as discussed below.
+
+C comparison problem: While the sign of `char` is implementation defined, it's
+also signless (neither signed nor unsigned). When building with
+`make DEVELOPER=1` it will complain about a "differ in signedness" when `char`
+is compared with `uint8_t` or `int8_t`.
+
+Note: Rust's `char` type is an unsigned 32-bit integer that is used to describe
+Unicode code points.
+
+=== Notes
+^1^ This is only true if stdbool.h (or equivalent) is used. +
+^2^ C does not enforce IEEE-754 compatibility, but Rust expects it. If the
+platform/arch for C does not follow IEEE-754 then this equivalence does not
+hold. Also, it's assumed that `float` is 32 bits and `double` is 64, but
+there may be a strange platform/arch where even this isn't true. +
+^3^ C also defines uintptr_t, ssize_t and intptr_t, but these types are
+discouraged for FFI purposes. For functions like `read()` and `write()` ssize_t
+should be cast to a different, and unambiguous, type before being passed over
+the FFI boundary. +
+
+== Problems with std::ffi::c_* types in Rust
+TL;DR: In practice, Rust's `c_*` types aren't guaranteed to match C types for
+all possible C compilers, platforms, or architectures, because Rust only
+ensures correctness of C types on officially supported targets. These
+definitions have changed over time to match more targets which means that the
+c_* definitions will differ based on which Rust version Git chooses to use.
+
+Current list of safe, Rust side, FFI types in Git: +
+
+* `c_void`
+* `CStr`
+* `CString`
+
+Even then, they should be used sparingly, and only where the semantics match
+exactly.
+
+The std::os::raw::c_* directly inherits the problems of core::ffi, which
+changes over time and seems to make a best guess at the correct definition for
+a given platform/target. This probably isn't a problem for all other platforms
+that Rust supports currently, but can anyone say that Rust got it right for all
+C compilers of all platforms/targets?
+
+To give an example: c_long is defined in
+footnote:[https://doc.rust-lang.org/1.63.0/src/core/ffi/mod.rs.html#175-189[c_long in 1.63.0]]
+footnote:[https://doc.rust-lang.org/1.89.0/src/core/ffi/primitives.rs.html#135-151[c_long in 1.89.0]]
+
+=== Rust version 1.63.0
+
+```
+mod c_long_definition {
+    cfg_if! {
+        if #[cfg(all(target_pointer_width = "64", not(windows)))] {
+            pub type c_long = i64;
+            pub type NonZero_c_long = crate::num::NonZeroI64;
+            pub type c_ulong = u64;
+            pub type NonZero_c_ulong = crate::num::NonZeroU64;
+        } else {
+            // The minimal size of `long` in the C standard is 32 bits
+            pub type c_long = i32;
+            pub type NonZero_c_long = crate::num::NonZeroI32;
+            pub type c_ulong = u32;
+            pub type NonZero_c_ulong = crate::num::NonZeroU32;
+        }
+    }
+}
+```
+
+=== Rust version 1.89.0
+
+```
+mod c_long_definition {
+    crate::cfg_select! {
+        any(
+            all(target_pointer_width = "64", not(windows)),
+            // wasm32 Linux ABI uses 64-bit long
+            all(target_arch = "wasm32", target_os = "linux")
+        ) => {
+            pub(super) type c_long = i64;
+            pub(super) type c_ulong = u64;
+        }
+        _ => {
+            // The minimal size of `long` in the C standard is 32 bits
+            pub(super) type c_long = i32;
+            pub(super) type c_ulong = u32;
+        }
+    }
+}
+```
+
+Even for the cases where C types are correctly mapped to Rust types via
+std::ffi::c_* there are still problems. Let's take c_char for example. On some
+platforms it's u8 on others it's i8.
+
+=== Subtraction underflow in debug mode
+
+The following code will panic in debug on platforms that define c_char as u8,
+but won't if it's an i8.
+
+```
+let mut x: std::ffi::c_char = 0;
+x -= 1;
+```
+
+=== Inconsistent shift behavior
+
+`x` will be 0xC0 for platforms that use i8, but will be 0x40 where it's u8.
+
+```
+let mut x: std::ffi::c_char = 0x80;
+x >>= 1;
+```
+
+=== Equality fails to compile on some platforms
+
+The following will not compile on platforms that define c_char as i8, but will
+if it's u8. You can cast x e.g. `assert_eq!(x as u8, b'a');`, but then you get
+a warning on platforms that use u8 and a clean compilation where i8 is used.
+
+```
+let mut x: std::ffi::c_char = 0x61;
+assert_eq!(x, b'a');
+```
+
+== Enum types
+Rust enum types should not be used as FFI types. Rust enum types are more like
+C union types than C enum's. For something like:
+
+```
+#[repr(C, u8)]
+enum Fruit {
+    Apple,
+    Banana,
+    Cherry,
+}
+```
+
+It's easy enough to make sure the Rust enum matches what C would expect, but a
+more complex type like.
+
+```
+enum HashResult {
+    SHA1([u8; 20]),
+    SHA256([u8; 32]),
+}
+```
+
+The Rust compiler has to add a discriminant to the enum to distinguish between
+the variants. The width, location, and values for that discriminant is up to
+the Rust compiler and is not ABI stable.
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-14 22:36       ` [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
@ 2025-11-15  3:06         ` Ramsay Jones
  2025-11-15  3:41           ` Ben Knoble
  0 siblings, 1 reply; 118+ messages in thread
From: Ramsay Jones @ 2025-11-15  3:06 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren



On 14/11/2025 10:36 pm, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> Document other nuances when crossing the FFI boundary. Other language
> mappings may be added in the future.
> 
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>  Documentation/Makefile                        |   1 +
>  Documentation/technical/meson.build           |   1 +
>  .../technical/unambiguous-types.adoc          | 224 ++++++++++++++++++
>  3 files changed, 226 insertions(+)
>  create mode 100644 Documentation/technical/unambiguous-types.adoc
> 
[snip]

> +== Character types
> +
> +This is where C and Rust don't have a clean one-to-one mapping.
> +
> +A C `char` and a Rust `u8` share the same bit width, so any C struct containing
> +a `char` will have the same size as the corresponding Rust struct using `u8`.
> +In that sense, such structs are safe to pass over the FFI boundary, because
> +their fields will be laid out identically. However, beyond bit width, C `char`
> +has additional semantics and platform-dependent behavior that can cause
> +problems, as discussed below.
> +
> +C comparison problem: While the sign of `char` is implementation defined, it's
> +also signless (neither signed nor unsigned). When building with

Hmm, this sets my teeth on edge. The C char type is not 'signless' (whatever that is
supposed to mean), it's 'sign-ness' is implementation-defined behaviour. This means
that it is 'unspecified behavior where each implementation documents how the choice
is made'. In particular, it has to document:

  "Which of signed char or unsigned char has the same range, representation, and
   behavior as "plain" char (6.2.5, 6.3.1.1)."

(it is still a distinct type, however). Note that some compilers even allow you to
specify which you want for a given compilation! (see gcc options -f[un]signed-char
and their inverse 'no' options!)


ATB,
Ramsay Jones



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-15  3:06         ` Ramsay Jones
@ 2025-11-15  3:41           ` Ben Knoble
  2025-11-15 14:55             ` Ramsay Jones
  0 siblings, 1 reply; 118+ messages in thread
From: Ben Knoble @ 2025-11-15  3:41 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Phillip Wood, Chris Torek, Ezekiel Newren


> Le 14 nov. 2025 à 22:09, Ramsay Jones <ramsay@ramsayjones.plus.com> a écrit :
> 
> 
> 
>> On 14/11/2025 10:36 pm, Ezekiel Newren via GitGitGadget wrote:
>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>> 
>> Document other nuances when crossing the FFI boundary. Other language
>> mappings may be added in the future.
>> 
>> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
>> ---
>> Documentation/Makefile                        |   1 +
>> Documentation/technical/meson.build           |   1 +
>> .../technical/unambiguous-types.adoc          | 224 ++++++++++++++++++
>> 3 files changed, 226 insertions(+)
>> create mode 100644 Documentation/technical/unambiguous-types.adoc
>> 
> [snip]
> 
>> +== Character types
>> +
>> +This is where C and Rust don't have a clean one-to-one mapping.
>> +
>> +A C `char` and a Rust `u8` share the same bit width, so any C struct containing
>> +a `char` will have the same size as the corresponding Rust struct using `u8`.
>> +In that sense, such structs are safe to pass over the FFI boundary, because
>> +their fields will be laid out identically. However, beyond bit width, C `char`
>> +has additional semantics and platform-dependent behavior that can cause
>> +problems, as discussed below.
>> +
>> +C comparison problem: While the sign of `char` is implementation defined, it's
>> +also signless (neither signed nor unsigned). When building with
> 
> Hmm, this sets my teeth on edge. The C char type is not 'signless' (whatever that is
> supposed to mean), it's 'sign-ness' is implementation-defined behaviour. This means
> that it is 'unspecified behavior where each implementation documents how the choice
> is made'. In particular, it has to document:
> 
>  "Which of signed char or unsigned char has the same range, representation, and
>   behavior as "plain" char (6.2.5, 6.3.1.1)."
> 
> (it is still a distinct type, however). Note that some compilers even allow you to
> specify which you want for a given compilation! (see gcc options -f[un]signed-char
> and their inverse 'no' options!)
> 
> 
> ATB,
> Ramsay Jones

This was discussed briefly in replies to v2’s 2/10, where Ezekiel said that DEVELOPER=1 warned about sign issues whether char was compared to int or unsigned. [From mobile I cannot reliably paste the message ID or link and preserve a plain-text email, apologies for the oblique reference.]

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-15  3:41           ` Ben Knoble
@ 2025-11-15 14:55             ` Ramsay Jones
  2025-11-15 16:42               ` Junio C Hamano
  0 siblings, 1 reply; 118+ messages in thread
From: Ramsay Jones @ 2025-11-15 14:55 UTC (permalink / raw)
  To: Ben Knoble
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Phillip Wood, Chris Torek, Ezekiel Newren



On 15/11/2025 3:41 am, Ben Knoble wrote:
> 
>> Le 14 nov. 2025 à 22:09, Ramsay Jones <ramsay@ramsayjones.plus.com> a écrit :
>>
>> 
>>
>>> On 14/11/2025 10:36 pm, Ezekiel Newren via GitGitGadget wrote:
>>> From: Ezekiel Newren <ezekielnewren@gmail.com>
>>>
>>> Document other nuances when crossing the FFI boundary. Other language
>>> mappings may be added in the future.
>>>
>>> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
>>> ---
>>> Documentation/Makefile                        |   1 +
>>> Documentation/technical/meson.build           |   1 +
>>> .../technical/unambiguous-types.adoc          | 224 ++++++++++++++++++
>>> 3 files changed, 226 insertions(+)
>>> create mode 100644 Documentation/technical/unambiguous-types.adoc
>>>
>> [snip]
>>
>>> +== Character types
>>> +
>>> +This is where C and Rust don't have a clean one-to-one mapping.
>>> +
>>> +A C `char` and a Rust `u8` share the same bit width, so any C struct containing
>>> +a `char` will have the same size as the corresponding Rust struct using `u8`.
>>> +In that sense, such structs are safe to pass over the FFI boundary, because
>>> +their fields will be laid out identically. However, beyond bit width, C `char`
>>> +has additional semantics and platform-dependent behavior that can cause
>>> +problems, as discussed below.
>>> +
>>> +C comparison problem: While the sign of `char` is implementation defined, it's
>>> +also signless (neither signed nor unsigned). When building with
>>
>> Hmm, this sets my teeth on edge. The C char type is not 'signless' (whatever that is
>> supposed to mean), it's 'sign-ness' is implementation-defined behaviour. This means
>> that it is 'unspecified behavior where each implementation documents how the choice
>> is made'. In particular, it has to document:
>>
>>  "Which of signed char or unsigned char has the same range, representation, and
>>   behavior as "plain" char (6.2.5, 6.3.1.1)."
>>
>> (it is still a distinct type, however). Note that some compilers even allow you to
>> specify which you want for a given compilation! (see gcc options -f[un]signed-char
>> and their inverse 'no' options!)
>>
>>
>> ATB,
>> Ramsay Jones
> 
> This was discussed briefly in replies to v2’s 2/10, where Ezekiel said that DEVELOPER=1 warned about sign issues whether char was compared to int or unsigned. [From mobile I cannot reliably paste the message ID or link and preserve a plain-text email, apologies for the oblique reference.]

Err... sorry, but I don't see how this comment relates to my email. puzzled! ;)

ATB,
Ramsay Jones




^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-15 14:55             ` Ramsay Jones
@ 2025-11-15 16:42               ` Junio C Hamano
  2025-11-15 16:59                 ` D. Ben Knoble
  2025-11-17  1:20                 ` Junio C Hamano
  0 siblings, 2 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-11-15 16:42 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Ben Knoble, Ezekiel Newren via GitGitGadget, git,
	Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

Ramsay Jones <ramsay@ramsayjones.plus.com> writes:

>> This was discussed briefly in replies to v2’s 2/10, where
>> Ezekiel said that DEVELOPER=1 warned about sign issues whether
>> char was compared to int or unsigned. [From mobile I cannot
>> reliably paste the message ID or link and preserve a plain-text
>> email, apologies for the oblique reference.]
>
> Err... sorry, but I don't see how this comment relates to my
> email. puzzled! ;)

Me neither, but I suspect it may mostly use of non-word "signless"
that is the issue.  It is understandable for the -Wsign-compare
warning (especially given that it very often complains about
perfectly good pieces of code) to complain when you compare a "char"
with a signed integer, saying "on a platform where 'char' is
unsigned, you would be comparing signed and unsigned values with
this expression", and at the same time complain when you compare a
"char" with an unsigned integer, saying "on a platform where 'char'
is signed...".

I'd say it shows more about how garbage -Wsign-compare is than about
how 'char' is ambiguous and should be avoided, but others may have
different opinions.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-15 16:42               ` Junio C Hamano
@ 2025-11-15 16:59                 ` D. Ben Knoble
  2025-11-15 20:03                   ` Junio C Hamano
  2025-11-17  1:20                 ` Junio C Hamano
  1 sibling, 1 reply; 118+ messages in thread
From: D. Ben Knoble @ 2025-11-15 16:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ramsay Jones, Ezekiel Newren via GitGitGadget, git,
	Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

On Sat, Nov 15, 2025 at 11:42 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Ramsay Jones <ramsay@ramsayjones.plus.com> writes:
>
> >> This was discussed briefly in replies to v2’s 2/10, where
> >> Ezekiel said that DEVELOPER=1 warned about sign issues whether
> >> char was compared to int or unsigned. [From mobile I cannot
> >> reliably paste the message ID or link and preserve a plain-text
> >> email, apologies for the oblique reference.]
> >
> > Err... sorry, but I don't see how this comment relates to my
> > email. puzzled! ;)
>
> Me neither, but I suspect it may mostly use of non-word "signless"
> that is the issue.  It is understandable for the -Wsign-compare
> warning (especially given that it very often complains about
> perfectly good pieces of code) to complain when you compare a "char"
> with a signed integer, saying "on a platform where 'char' is
> unsigned, you would be comparing signed and unsigned values with
> this expression", and at the same time complain when you compare a
> "char" with an unsigned integer, saying "on a platform where 'char'
> is signed...".

Agreed, and I suspect this is roughly the implementation.

My point was that Ezekiel seemed to justify (?) the use of "signless"
by pointing to those warnings (I personally am on the fence for how to
treat the combination of facts, but it seems useful to consider that
char is not easily comparable with integers of various signedness).

-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-15 16:59                 ` D. Ben Knoble
@ 2025-11-15 20:03                   ` Junio C Hamano
  0 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-11-15 20:03 UTC (permalink / raw)
  To: D. Ben Knoble
  Cc: Ramsay Jones, Ezekiel Newren via GitGitGadget, git,
	Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

"D. Ben Knoble" <ben.knoble@gmail.com> writes:

>> Me neither, but I suspect it may mostly use of non-word "signless"
>> that is the issue.
> ...
> Agreed, and I suspect this is roughly the implementation.
>
> My point was that Ezekiel seemed to justify (?) the use of "signless"
> by pointing to those warnings (I personally am on the fence for how to
> treat the combination of facts, but it seems useful to consider that
> char is not easily comparable with integers of various signedness).

Agreed.  I think your point matches my suspicion that the use of the
non-word "signless" was what Ramsay reacted.  The 'char' with the
implementation defined signedness is making -Wsign-compare even more
quirky than it already is.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-15 16:42               ` Junio C Hamano
  2025-11-15 16:59                 ` D. Ben Knoble
@ 2025-11-17  1:20                 ` Junio C Hamano
  2025-11-17  2:08                   ` Ramsay Jones
  1 sibling, 1 reply; 118+ messages in thread
From: Junio C Hamano @ 2025-11-17  1:20 UTC (permalink / raw)
  To: Ramsay Jones, Ezekiel Newren
  Cc: Ben Knoble, Ezekiel Newren via GitGitGadget, git,
	Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek

Junio C Hamano <gitster@pobox.com> writes:

> Me neither, but I suspect it may mostly use of non-word "signless"
> that is the issue.

So, the patch text that claims C's "char" is "signless" still needs
to be updated, I think.  The problematic paragraph (with a bit of
rewrapping) reads like this:

    C comparison problem: While the sign of `char` is implementation
    defined, it's also signless (neither signed nor unsigned). When
    building with `make DEVELOPER=1` it will complain about a
    "differ in signedness" when `char` is compared with `uint8_t` or
    `int8_t`.

Perhaps

    The C language leaves the signedness of `char` implementation
    defined.  Because our developer build enables -Wsign-compare,
    comparison of a value of `char` type with either signed or
    unsigned integers will trigger warnings from the compiler.
    Avoiding `char` of implementation defined signedness helps us
    being a bit more explicit.

or something is sufficient?


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-17  1:20                 ` Junio C Hamano
@ 2025-11-17  2:08                   ` Ramsay Jones
  0 siblings, 0 replies; 118+ messages in thread
From: Ramsay Jones @ 2025-11-17  2:08 UTC (permalink / raw)
  To: Junio C Hamano, Ezekiel Newren
  Cc: Ben Knoble, Ezekiel Newren via GitGitGadget, git,
	Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek



On 17/11/2025 1:20 am, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
>> Me neither, but I suspect it may mostly use of non-word "signless"
>> that is the issue.
> 
> So, the patch text that claims C's "char" is "signless" still needs
> to be updated, I think.  The problematic paragraph (with a bit of
> rewrapping) reads like this:

Sorry for being AFK for a over a day! :) I didn't think this would
generate so much traffic.

>     C comparison problem: While the sign of `char` is implementation
>     defined, it's also signless (neither signed nor unsigned). When
>     building with `make DEVELOPER=1` it will complain about a
>     "differ in signedness" when `char` is compared with `uint8_t` or
>     `int8_t`.

Yes, the 'signless' nonsense is what 'triggered' me. ;)

> 
> Perhaps
> 
>     The C language leaves the signedness of `char` implementation
>     defined.  Because our developer build enables -Wsign-compare,
>     comparison of a value of `char` type with either signed or
>     unsigned integers will trigger warnings from the compiler.

s/will/may/ - it depends!

>     Avoiding `char` of implementation defined signedness helps us
>     being a bit more explicit.
> 
> or something is sufficient?

Yes, this looks good to me (but then I am not particularly good at
word-smithing).

Thanks.

ATB,
Ramsay Jones




^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v4 02/10] xdiff: use ptrdiff_t for dstart/dend
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
@ 2025-11-14 22:36       ` Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
                         ` (8 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

ptrdiff_t is appropriate for dstart and dend because they both describe
positive or negative offsets relative to a pointer.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xtypes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index f145abba3e..7a2d429ec5 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -47,7 +47,7 @@ typedef struct s_xrecord {
 typedef struct s_xdfile {
 	xrecord_t *recs;
 	long nrec;
-	long dstart, dend;
+	ptrdiff_t dstart, dend;
 	bool *changed;
 	long *rindex;
 	long nreff;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v4 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 02/10] xdiff: use ptrdiff_t for dstart/dend Ezekiel Newren via GitGitGadget
@ 2025-11-14 22:36       ` Ezekiel Newren via GitGitGadget
  2025-11-15  8:26         ` Junio C Hamano
  2025-11-14 22:36       ` [PATCH v4 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
                         ` (7 subsequent siblings)
  10 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Make xrecord_t.ptr uint8_t because it's referring to bytes in memory.

In order to avoid a refactor avalanche, many uses of this field were
cast to char* or similar. One exception is in get_indent() where the
local variable `char c` was changed to `uint8_t c`.

Places where casting was unnecessary:
xemit.c:156
xmerge.c:124
xmerge.c:127
xmerge.c:164
xmerge.c:169
xmerge.c:172
xmerge.c:178

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c    |  8 ++++----
 xdiff/xemit.c     |  6 +++---
 xdiff/xmerge.c    | 14 +++++++-------
 xdiff/xpatience.c |  2 +-
 xdiff/xprepare.c  |  6 +++---
 xdiff/xtypes.h    |  2 +-
 xdiff/xutils.c    |  4 ++--
 7 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 6f3998ee54..411a8aa69f 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -407,7 +407,7 @@ static int get_indent(xrecord_t *rec)
 	int ret = 0;
 
 	for (i = 0; i < rec->size; i++) {
-		char c = rec->ptr[i];
+		uint8_t c = rec->ptr[i];
 
 		if (!XDL_ISSPACE(c))
 			return ret;
@@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
 
 		rec = &xe->xdf1.recs[xch->i1];
 		for (i = 0; i < xch->chg1 && ignore; i++)
-			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
 
 		rec = &xe->xdf2.recs[xch->i2];
 		for (i = 0; i < xch->chg2 && ignore; i++)
-			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
 
 		xch->ignore = ignore;
 	}
@@ -1008,7 +1008,7 @@ static int record_matches_regex(xrecord_t *rec, xpparam_t const *xpp) {
 	size_t i;
 
 	for (i = 0; i < xpp->ignore_regex_nr; i++)
-		if (!regexec_buf(xpp->ignore_regex[i], rec->ptr, rec->size, 1,
+		if (!regexec_buf(xpp->ignore_regex[i], (const char *)rec->ptr, rec->size, 1,
 				 &regmatch, 0))
 			return 1;
 
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index b2f1f30cd3..ead930088a 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
 {
 	xrecord_t *rec = &xdf->recs[ri];
 
-	if (xdl_emit_diffrec(rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
+	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
 		return -1;
 
 	return 0;
@@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
 	xrecord_t *rec = &xdf->recs[ri];
 
 	if (!xecfg->find_func)
-		return def_ff(rec->ptr, rec->size, buf, sz);
-	return xecfg->find_func(rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
+		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
+	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index fd600cbb5d..75cb3e76a2 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
 	xrecord_t *rec2 = xe2->xdf2.recs + i2;
 
 	for (i = 0; i < line_count; i++) {
-		int result = xdl_recmatch(rec1[i].ptr, rec1[i].size,
-			rec2[i].ptr, rec2[i].size, flags);
+		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
+			(const char *)rec2[i].ptr, rec2[i].size, flags);
 		if (!result)
 			return -1;
 	}
@@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 
 static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
 {
-	return xdl_recmatch(rec1->ptr, rec1->size,
-			    rec2->ptr, rec2->size, flags);
+	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
+			    (const char *)rec2->ptr, rec2->size, flags);
 }
 
 /*
@@ -382,10 +382,10 @@ static int xdl_refine_conflicts(xdfenv_t *xe1, xdfenv_t *xe2, xdmerge_t *m,
 		 * we have a very simple mmfile structure.
 		 */
 		t1.ptr = (char *)xe1->xdf2.recs[m->i1].ptr;
-		t1.size = xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
+		t1.size = (char *)xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
 			+ xe1->xdf2.recs[m->i1 + m->chg1 - 1].size - t1.ptr;
 		t2.ptr = (char *)xe2->xdf2.recs[m->i2].ptr;
-		t2.size = xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
+		t2.size = (char *)xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
 			+ xe2->xdf2.recs[m->i2 + m->chg2 - 1].size - t2.ptr;
 		if (xdl_do_diff(&t1, &t2, xpp, &xe) < 0)
 			return -1;
@@ -440,7 +440,7 @@ static int line_contains_alnum(const char *ptr, long size)
 static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
 {
 	for (; chg; chg--, i++)
-		if (line_contains_alnum(xe->xdf2.recs[i].ptr,
+		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
 				xe->xdf2.recs[i].size))
 			return 1;
 	return 0;
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index 669b653580..bb61354f22 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -121,7 +121,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 		return;
 	map->entries[index].line1 = line;
 	map->entries[index].hash = record->ha;
-	map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1].ptr);
+	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
 	if (!map->first)
 		map->first = map->entries + index;
 	if (map->last) {
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 192334f1b7..4c56467076 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
 		if (rcrec->rec.ha == rec->ha &&
-				xdl_recmatch(rcrec->rec.ptr, rcrec->rec.size,
-					rec->ptr, rec->size, cf->flags))
+				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
+					(const char *)rec->ptr, rec->size, cf->flags))
 			break;
 
 	if (!rcrec) {
@@ -156,7 +156,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
-			crec->ptr = prev;
+			crec->ptr = (uint8_t const *)prev;
 			crec->size = (long) (cur - prev);
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 7a2d429ec5..69727fb299 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -39,7 +39,7 @@ typedef struct s_chastore {
 } chastore_t;
 
 typedef struct s_xrecord {
-	char const *ptr;
+	uint8_t const *ptr;
 	long size;
 	unsigned long ha;
 } xrecord_t;
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 447e66c719..7be063bfb6 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -465,10 +465,10 @@ int xdl_fall_back_diff(xdfenv_t *diff_env, xpparam_t const *xpp,
 	xdfenv_t env;
 
 	subfile1.ptr = (char *)diff_env->xdf1.recs[line1 - 1].ptr;
-	subfile1.size = diff_env->xdf1.recs[line1 + count1 - 2].ptr +
+	subfile1.size = (char *)diff_env->xdf1.recs[line1 + count1 - 2].ptr +
 		diff_env->xdf1.recs[line1 + count1 - 2].size - subfile1.ptr;
 	subfile2.ptr = (char *)diff_env->xdf2.recs[line2 - 1].ptr;
-	subfile2.size = diff_env->xdf2.recs[line2 + count2 - 2].ptr +
+	subfile2.size = (char *)diff_env->xdf2.recs[line2 + count2 - 2].ptr +
 		diff_env->xdf2.recs[line2 + count2 - 2].size - subfile2.ptr;
 	if (xdl_do_diff(&subfile1, &subfile2, xpp, &env) < 0)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v4 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-11-14 22:36       ` [PATCH v4 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
@ 2025-11-15  8:26         ` Junio C Hamano
  2025-11-18 20:55           ` Ezekiel Newren
  0 siblings, 1 reply; 118+ messages in thread
From: Junio C Hamano @ 2025-11-15  8:26 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> In order to avoid a refactor avalanche, many uses of this field were
> cast to char* or similar. One exception is in get_indent() where the
> local variable `char c` was changed to `uint8_t c`.

I actually think keeping "char c" as in the original is a lot more
logical for that particular case, as the existing use of that local
variable are _all_ about C's 'char', and not about a very short
unsigned integer.  The variable is compared with C's character
constants like ' ' (whitespace) and '\t' (horizontal tab), or is
given to XDL_ISSPACE() macro, which is also about C's character.

But because it is so minor a thing, I do not think that it deserves
a reroll on its own.  Just in case if there are other things that
need to change and the series needs a reroll, here is the only
change required for this.

 xdiff/xdiffi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git c/xdiff/xdiffi.c w/xdiff/xdiffi.c
index 8eb664be3e..4376f943db 100644
--- c/xdiff/xdiffi.c
+++ w/xdiff/xdiffi.c
@@ -406,7 +406,7 @@ static int get_indent(xrecord_t *rec)
 	int ret = 0;

 	for (size_t i = 0; i < rec->size; i++) {
-		uint8_t c = rec->ptr[i];
+		char c = (char) rec->ptr[i];

 		if (!XDL_ISSPACE(c))
 			return ret;

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v4 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-11-15  8:26         ` Junio C Hamano
@ 2025-11-18 20:55           ` Ezekiel Newren
  0 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren @ 2025-11-18 20:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Phillip Wood, Chris Torek

On Sat, Nov 15, 2025 at 1:26 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > In order to avoid a refactor avalanche, many uses of this field were
> > cast to char* or similar. One exception is in get_indent() where the
> > local variable `char c` was changed to `uint8_t c`.
>
> I actually think keeping "char c" as in the original is a lot more
> logical for that particular case, as the existing use of that local
> variable are _all_ about C's 'char', and not about a very short
> unsigned integer.  The variable is compared with C's character
> constants like ' ' (whitespace) and '\t' (horizontal tab), or is
> given to XDL_ISSPACE() macro, which is also about C's character.
>
> But because it is so minor a thing, I do not think that it deserves
> a reroll on its own.  Just in case if there are other things that
> need to change and the series needs a reroll, here is the only
> change required for this.
>
>
>  xdiff/xdiffi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git c/xdiff/xdiffi.c w/xdiff/xdiffi.c
> index 8eb664be3e..4376f943db 100644
> --- c/xdiff/xdiffi.c
> +++ w/xdiff/xdiffi.c
> @@ -406,7 +406,7 @@ static int get_indent(xrecord_t *rec)
>         int ret = 0;
>
>         for (size_t i = 0; i < rec->size; i++) {
> -               uint8_t c = rec->ptr[i];
> +               char c = (char) rec->ptr[i];
>
>                 if (!XDL_ISSPACE(c))
>                         return ret;

I have v5 ready to go, but there seems to be a problem with
gitgitgadget. Once that's resolved I'll post the new version.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v4 04/10] xdiff: use size_t for xrecord_t.size
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
                         ` (2 preceding siblings ...)
  2025-11-14 22:36       ` [PATCH v4 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
@ 2025-11-14 22:36       ` Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
                         ` (6 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is the appropriate type because size is describing the number of
elements, bytes in this case, in memory.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   |  7 +++----
 xdiff/xemit.c    |  8 ++++----
 xdiff/xmerge.c   | 16 ++++++++--------
 xdiff/xprepare.c |  6 +++---
 xdiff/xtypes.h   |  2 +-
 5 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 411a8aa69f..edd05466df 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -403,10 +403,9 @@ static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
  */
 static int get_indent(xrecord_t *rec)
 {
-	long i;
 	int ret = 0;
 
-	for (i = 0; i < rec->size; i++) {
+	for (size_t i = 0; i < rec->size; i++) {
 		uint8_t c = rec->ptr[i];
 
 		if (!XDL_ISSPACE(c))
@@ -993,11 +992,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
 
 		rec = &xe->xdf1.recs[xch->i1];
 		for (i = 0; i < xch->chg1 && ignore; i++)
-			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, (long)rec[i].size, flags);
 
 		rec = &xe->xdf2.recs[xch->i2];
 		for (i = 0; i < xch->chg2 && ignore; i++)
-			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, (long)rec[i].size, flags);
 
 		xch->ignore = ignore;
 	}
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index ead930088a..2f8007753c 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
 {
 	xrecord_t *rec = &xdf->recs[ri];
 
-	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
+	if (xdl_emit_diffrec((char const *)rec->ptr, (long)rec->size, pre, strlen(pre), ecb) < 0)
 		return -1;
 
 	return 0;
@@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
 	xrecord_t *rec = &xdf->recs[ri];
 
 	if (!xecfg->find_func)
-		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
-	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
+		return def_ff((const char *)rec->ptr, (long)rec->size, buf, sz);
+	return xecfg->find_func((const char *)rec->ptr, (long)rec->size, buf, sz, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
@@ -151,7 +151,7 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 static int is_empty_rec(xdfile_t *xdf, long ri)
 {
 	xrecord_t *rec = &xdf->recs[ri];
-	long i = 0;
+	size_t i = 0;
 
 	for (; i < rec->size && XDL_ISSPACE(rec->ptr[i]); i++);
 
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index 75cb3e76a2..0dd4558a32 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
 	xrecord_t *rec2 = xe2->xdf2.recs + i2;
 
 	for (i = 0; i < line_count; i++) {
-		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
-			(const char *)rec2[i].ptr, rec2[i].size, flags);
+		int result = xdl_recmatch((const char *)rec1[i].ptr, (long)rec1[i].size,
+			(const char *)rec2[i].ptr, (long)rec2[i].size, flags);
 		if (!result)
 			return -1;
 	}
@@ -119,11 +119,11 @@ static int xdl_recs_copy_0(int use_orig, xdfenv_t *xe, int i, int count, int nee
 	if (count < 1)
 		return 0;
 
-	for (i = 0; i < count; size += recs[i++].size)
+	for (i = 0; i < count; size += (int)recs[i++].size)
 		if (dest)
 			memcpy(dest + size, recs[i].ptr, recs[i].size);
 	if (add_nl) {
-		i = recs[count - 1].size;
+		i = (int)recs[count - 1].size;
 		if (i == 0 || recs[count - 1].ptr[i - 1] != '\n') {
 			if (needs_cr) {
 				if (dest)
@@ -156,7 +156,7 @@ static int xdl_orig_copy(xdfenv_t *xe, int i, int count, int needs_cr, int add_n
  */
 static int is_eol_crlf(xdfile_t *file, int i)
 {
-	long size;
+	size_t size;
 
 	if (i < file->nrec - 1)
 		/* All lines before the last *must* end in LF */
@@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 
 static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
 {
-	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
-			    (const char *)rec2->ptr, rec2->size, flags);
+	return xdl_recmatch((const char *)rec1->ptr, (long)rec1->size,
+			    (const char *)rec2->ptr, (long)rec2->size, flags);
 }
 
 /*
@@ -441,7 +441,7 @@ static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
 {
 	for (; chg; chg--, i++)
 		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
-				xe->xdf2.recs[i].size))
+				(long)xe->xdf2.recs[i].size))
 			return 1;
 	return 0;
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 4c56467076..b3219aed3e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
 		if (rcrec->rec.ha == rec->ha &&
-				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
-					(const char *)rec->ptr, rec->size, cf->flags))
+				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
+					(const char *)rec->ptr, (long)rec->size, cf->flags))
 			break;
 
 	if (!rcrec) {
@@ -157,7 +157,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = (uint8_t const *)prev;
-			crec->size = (long) (cur - prev);
+			crec->size = cur - prev;
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 69727fb299..354349b523 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -40,7 +40,7 @@ typedef struct s_chastore {
 
 typedef struct s_xrecord {
 	uint8_t const *ptr;
-	long size;
+	size_t size;
 	unsigned long ha;
 } xrecord_t;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v4 05/10] xdiff: use unambiguous types in xdl_hash_record()
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
                         ` (3 preceding siblings ...)
  2025-11-14 22:36       ` [PATCH v4 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
@ 2025-11-14 22:36       ` Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
                         ` (5 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Convert the function signature and body to use unambiguous types. char
is changed to uint8_t because this function processes bytes in memory.
unsigned long to uint64_t so that the hash output is consistent across
platforms. `flags` was changed from long to uint64_t to ensure the
high order bits are not dropped on platforms that treat long as 32
bits.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff-interface.c |  2 +-
 xdiff/xprepare.c  |  6 +++---
 xdiff/xutils.c    | 28 ++++++++++++++--------------
 xdiff/xutils.h    |  6 +++---
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 4971f722b3..1a35556380 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -300,7 +300,7 @@ void xdiff_clear_find_func(xdemitconf_t *xecfg)
 
 unsigned long xdiff_hash_string(const char *s, size_t len, long flags)
 {
-	return xdl_hash_record(&s, s + len, flags);
+	return xdl_hash_record((uint8_t const**)&s, (uint8_t const*)s + len, flags);
 }
 
 int xdiff_compare_lines(const char *l1, long s1,
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index b3219aed3e..85e56021da 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -137,8 +137,8 @@ static void xdl_free_ctx(xdfile_t *xdf)
 static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_t const *xpp,
 			   xdlclassifier_t *cf, xdfile_t *xdf) {
 	long bsize;
-	unsigned long hav;
-	char const *blk, *cur, *top, *prev;
+	uint64_t hav;
+	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
 
 	xdf->rindex = NULL;
@@ -156,7 +156,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
-			crec->ptr = (uint8_t const *)prev;
+			crec->ptr = prev;
 			crec->size = cur - prev;
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 7be063bfb6..77ee1ad9c8 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -249,11 +249,11 @@ int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
 	return 1;
 }
 
-unsigned long xdl_hash_record_with_whitespace(char const **data,
-		char const *top, long flags) {
-	unsigned long ha = 5381;
-	char const *ptr = *data;
-	int cr_at_eol_only = (flags & XDF_WHITESPACE_FLAGS) == XDF_IGNORE_CR_AT_EOL;
+uint64_t xdl_hash_record_with_whitespace(uint8_t const **data,
+		uint8_t const *top, uint64_t flags) {
+	uint64_t ha = 5381;
+	uint8_t const *ptr = *data;
+	bool cr_at_eol_only = (flags & XDF_WHITESPACE_FLAGS) == XDF_IGNORE_CR_AT_EOL;
 
 	for (; ptr < top && *ptr != '\n'; ptr++) {
 		if (cr_at_eol_only) {
@@ -263,8 +263,8 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 				continue;
 		}
 		else if (XDL_ISSPACE(*ptr)) {
-			const char *ptr2 = ptr;
-			int at_eol;
+			const uint8_t *ptr2 = ptr;
+			bool at_eol;
 			while (ptr + 1 < top && XDL_ISSPACE(ptr[1])
 					&& ptr[1] != '\n')
 				ptr++;
@@ -274,20 +274,20 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 			else if (flags & XDF_IGNORE_WHITESPACE_CHANGE
 				 && !at_eol) {
 				ha += (ha << 5);
-				ha ^= (unsigned long) ' ';
+				ha ^= (uint64_t) ' ';
 			}
 			else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL
 				 && !at_eol) {
 				while (ptr2 != ptr + 1) {
 					ha += (ha << 5);
-					ha ^= (unsigned long) *ptr2;
+					ha ^= (uint64_t) *ptr2;
 					ptr2++;
 				}
 			}
 			continue;
 		}
 		ha += (ha << 5);
-		ha ^= (unsigned long) *ptr;
+		ha ^= (uint64_t) *ptr;
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 
@@ -304,9 +304,9 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 #define REASSOC_FENCE(x, y)
 #endif
 
-unsigned long xdl_hash_record_verbatim(char const **data, char const *top) {
-	unsigned long ha = 5381, c0, c1;
-	char const *ptr = *data;
+uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top) {
+	uint64_t ha = 5381, c0, c1;
+	uint8_t const *ptr = *data;
 #if 0
 	/*
 	 * The baseline form of the optimized loop below. This is the djb2
@@ -314,7 +314,7 @@ unsigned long xdl_hash_record_verbatim(char const **data, char const *top) {
 	 */
 	for (; ptr < top && *ptr != '\n'; ptr++) {
 		ha += (ha << 5);
-		ha += (unsigned long) *ptr;
+		ha += (uint64_t) *ptr;
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 #else
diff --git a/xdiff/xutils.h b/xdiff/xutils.h
index 13f6831047..615b4a9d35 100644
--- a/xdiff/xutils.h
+++ b/xdiff/xutils.h
@@ -34,9 +34,9 @@ void *xdl_cha_alloc(chastore_t *cha);
 long xdl_guess_lines(mmfile_t *mf, long sample);
 int xdl_blankline(const char *line, long size, long flags);
 int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags);
-unsigned long xdl_hash_record_verbatim(char const **data, char const *top);
-unsigned long xdl_hash_record_with_whitespace(char const **data, char const *top, long flags);
-static inline unsigned long xdl_hash_record(char const **data, char const *top, long flags)
+uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top);
+uint64_t xdl_hash_record_with_whitespace(uint8_t const **data, uint8_t const *top, uint64_t flags);
+static inline uint64_t xdl_hash_record(uint8_t const **data, uint8_t const *top, uint64_t flags)
 {
 	if (flags & XDF_WHITESPACE_FLAGS)
 		return xdl_hash_record_with_whitespace(data, top, flags);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v4 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
                         ` (4 preceding siblings ...)
  2025-11-14 22:36       ` [PATCH v4 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
@ 2025-11-14 22:36       ` Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
                         ` (4 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The ha field is serving two different purposes, which makes the code
harder to read. At first glance, it looks like many places assume
there could never be hash collisions between lines of the two input
files. In reality, line_hash is used together with xdl_recmatch() to
ensure correct comparisons of lines, even when collisions occur.

To make this clearer, the old ha field has been split:
  * line_hash: a straightforward hash of a line, independent of any
    external context. Its type is uint64_t, as it comes from a fixed
    width hash function.
  * minimal_perfect_hash: Not a new concept, but now a separate
    field. It comes from the classifier's general-purpose hash table,
    which assigns each line a unique and minimal hash across the two
    files. A size_t is used here because it's meant to be used to
    index an array. This also avoids ` as usize` casts on the Rust
    side when using it to index a slice.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c     |  6 +++---
 xdiff/xhistogram.c |  4 ++--
 xdiff/xpatience.c  | 10 +++++-----
 xdiff/xprepare.c   | 18 +++++++++---------
 xdiff/xtypes.h     |  3 ++-
 5 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index edd05466df..436c34697d 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -22,9 +22,9 @@
 
 #include "xinclude.h"
 
-static unsigned long get_hash(xdfile_t *xdf, long index)
+static size_t get_hash(xdfile_t *xdf, long index)
 {
-	return xdf->recs[xdf->rindex[index]].ha;
+	return xdf->recs[xdf->rindex[index]].minimal_perfect_hash;
 }
 
 #define XDL_MAX_COST_MIN 256
@@ -385,7 +385,7 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 
 static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
 {
-	return (rec1->ha == rec2->ha);
+	return rec1->minimal_perfect_hash == rec2->minimal_perfect_hash;
 }
 
 /*
diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
index 6dc450b1fe..5ae1282c27 100644
--- a/xdiff/xhistogram.c
+++ b/xdiff/xhistogram.c
@@ -90,7 +90,7 @@ struct region {
 
 static int cmp_recs(xrecord_t *r1, xrecord_t *r2)
 {
-	return r1->ha == r2->ha;
+	return r1->minimal_perfect_hash == r2->minimal_perfect_hash;
 
 }
 
@@ -98,7 +98,7 @@ static int cmp_recs(xrecord_t *r1, xrecord_t *r2)
 	(cmp_recs(REC(i->env, s1, l1), REC(i->env, s2, l2)))
 
 #define TABLE_HASH(index, side, line) \
-	XDL_HASHLONG((REC(index->env, side, line))->ha, index->table_bits)
+	XDL_HASHLONG((REC(index->env, side, line))->minimal_perfect_hash, index->table_bits)
 
 static int scanA(struct histindex *index, int line1, int count1)
 {
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index bb61354f22..cc53266f3b 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -48,7 +48,7 @@
 struct hashmap {
 	int nr, alloc;
 	struct entry {
-		unsigned long hash;
+		size_t minimal_perfect_hash;
 		/*
 		 * 0 = unused entry, 1 = first line, 2 = second, etc.
 		 * line2 is NON_UNIQUE if the line is not unique
@@ -101,10 +101,10 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 	 * So we multiply ha by 2 in the hope that the hashing was
 	 * "unique enough".
 	 */
-	int index = (int)((record->ha << 1) % map->alloc);
+	int index = (int)((record->minimal_perfect_hash << 1) % map->alloc);
 
 	while (map->entries[index].line1) {
-		if (map->entries[index].hash != record->ha) {
+		if (map->entries[index].minimal_perfect_hash != record->minimal_perfect_hash) {
 			if (++index >= map->alloc)
 				index = 0;
 			continue;
@@ -120,7 +120,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 	if (pass == 2)
 		return;
 	map->entries[index].line1 = line;
-	map->entries[index].hash = record->ha;
+	map->entries[index].minimal_perfect_hash = record->minimal_perfect_hash;
 	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
 	if (!map->first)
 		map->first = map->entries + index;
@@ -248,7 +248,7 @@ static int match(struct hashmap *map, int line1, int line2)
 {
 	xrecord_t *record1 = &map->env->xdf1.recs[line1 - 1];
 	xrecord_t *record2 = &map->env->xdf2.recs[line2 - 1];
-	return record1->ha == record2->ha;
+	return record1->minimal_perfect_hash == record2->minimal_perfect_hash;
 }
 
 static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 85e56021da..bea0992b5e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -93,12 +93,12 @@ static void xdl_free_classifier(xdlclassifier_t *cf) {
 
 
 static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) {
-	long hi;
+	size_t hi;
 	xdlclass_t *rcrec;
 
-	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
+	hi = XDL_HASHLONG(rec->line_hash, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
-		if (rcrec->rec.ha == rec->ha &&
+		if (rcrec->rec.line_hash == rec->line_hash &&
 				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
 					(const char *)rec->ptr, (long)rec->size, cf->flags))
 			break;
@@ -120,7 +120,7 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 
 	(pass == 1) ? rcrec->len1++ : rcrec->len2++;
 
-	rec->ha = (unsigned long) rcrec->idx;
+	rec->minimal_perfect_hash = (size_t)rcrec->idx;
 
 	return 0;
 }
@@ -158,7 +158,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
 			crec->size = cur - prev;
-			crec->ha = hav;
+			crec->line_hash = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
 		}
@@ -290,7 +290,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->ha];
+		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len2 : 0;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -298,7 +298,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->ha];
+		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len1 : 0;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -350,7 +350,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 	recs2 = xdf2->recs;
 	for (i = 0, lim = XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
 	     i++, recs1++, recs2++)
-		if (recs1->ha != recs2->ha)
+		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
 	xdf1->dstart = xdf2->dstart = i;
@@ -358,7 +358,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 	recs1 = xdf1->recs + xdf1->nrec - 1;
 	recs2 = xdf2->recs + xdf2->nrec - 1;
 	for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
-		if (recs1->ha != recs2->ha)
+		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
 	xdf1->dend = xdf1->nrec - i - 1;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 354349b523..d4e9cd2e76 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -41,7 +41,8 @@ typedef struct s_chastore {
 typedef struct s_xrecord {
 	uint8_t const *ptr;
 	size_t size;
-	unsigned long ha;
+	uint64_t line_hash;
+	size_t minimal_perfect_hash;
 } xrecord_t;
 
 typedef struct s_xdfile {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v4 07/10] xdiff: make xdfile_t.nrec a size_t instead of long
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
                         ` (5 preceding siblings ...)
  2025-11-14 22:36       ` [PATCH v4 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
@ 2025-11-14 22:36       ` Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
                         ` (3 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is used because nrec describes the number of elements for both
recs, and for 'changed' + 2.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c    |  8 ++++----
 xdiff/xemit.c     | 20 ++++++++++----------
 xdiff/xmerge.c    |  8 ++++----
 xdiff/xpatience.c |  2 +-
 xdiff/xprepare.c  | 12 ++++++------
 xdiff/xtypes.h    |  2 +-
 6 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 436c34697d..759193fe5d 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -483,7 +483,7 @@ static void measure_split(const xdfile_t *xdf, long split,
 {
 	long i;
 
-	if (split >= xdf->nrec) {
+	if (split >= (long)xdf->nrec) {
 		m->end_of_file = 1;
 		m->indent = -1;
 	} else {
@@ -506,7 +506,7 @@ static void measure_split(const xdfile_t *xdf, long split,
 
 	m->post_blank = 0;
 	m->post_indent = -1;
-	for (i = split + 1; i < xdf->nrec; i++) {
+	for (i = split + 1; i < (long)xdf->nrec; i++) {
 		m->post_indent = get_indent(&xdf->recs[i]);
 		if (m->post_indent != -1)
 			break;
@@ -717,7 +717,7 @@ static void group_init(xdfile_t *xdf, struct xdlgroup *g)
  */
 static inline int group_next(xdfile_t *xdf, struct xdlgroup *g)
 {
-	if (g->end == xdf->nrec)
+	if (g->end == (long)xdf->nrec)
 		return -1;
 
 	g->start = g->end + 1;
@@ -750,7 +750,7 @@ static inline int group_previous(xdfile_t *xdf, struct xdlgroup *g)
  */
 static int group_slide_down(xdfile_t *xdf, struct xdlgroup *g)
 {
-	if (g->end < xdf->nrec &&
+	if (g->end < (long)xdf->nrec &&
 	    recs_match(&xdf->recs[g->start], &xdf->recs[g->end])) {
 		xdf->changed[g->start++] = false;
 		xdf->changed[g->end++] = true;
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index 2f8007753c..04f7e9193b 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -137,7 +137,7 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 	buf = func_line ? func_line->buf : dummy;
 	size = func_line ? sizeof(func_line->buf) : sizeof(dummy);
 
-	for (l = start; l != limit && 0 <= l && l < xe->xdf1.nrec; l += step) {
+	for (l = start; l != limit && 0 <= l && l < (long)xe->xdf1.nrec; l += step) {
 		long len = match_func_rec(&xe->xdf1, xecfg, l, buf, size);
 		if (len >= 0) {
 			if (func_line)
@@ -179,14 +179,14 @@ pre_context_calculation:
 			long fs1, i1 = xch->i1;
 
 			/* Appended chunk? */
-			if (i1 >= xe->xdf1.nrec) {
+			if (i1 >= (long)xe->xdf1.nrec) {
 				long i2 = xch->i2;
 
 				/*
 				 * We don't need additional context if
 				 * a whole function was added.
 				 */
-				while (i2 < xe->xdf2.nrec) {
+				while (i2 < (long)xe->xdf2.nrec) {
 					if (is_func_rec(&xe->xdf2, xecfg, i2))
 						goto post_context_calculation;
 					i2++;
@@ -196,7 +196,7 @@ pre_context_calculation:
 				 * Otherwise get more context from the
 				 * pre-image.
 				 */
-				i1 = xe->xdf1.nrec - 1;
+				i1 = (long)xe->xdf1.nrec - 1;
 			}
 
 			fs1 = get_func_line(xe, xecfg, NULL, i1, -1);
@@ -228,8 +228,8 @@ pre_context_calculation:
 
  post_context_calculation:
 		lctx = xecfg->ctxlen;
-		lctx = XDL_MIN(lctx, xe->xdf1.nrec - (xche->i1 + xche->chg1));
-		lctx = XDL_MIN(lctx, xe->xdf2.nrec - (xche->i2 + xche->chg2));
+		lctx = XDL_MIN(lctx, (long)xe->xdf1.nrec - (xche->i1 + xche->chg1));
+		lctx = XDL_MIN(lctx, (long)xe->xdf2.nrec - (xche->i2 + xche->chg2));
 
 		e1 = xche->i1 + xche->chg1 + lctx;
 		e2 = xche->i2 + xche->chg2 + lctx;
@@ -237,13 +237,13 @@ pre_context_calculation:
 		if (xecfg->flags & XDL_EMIT_FUNCCONTEXT) {
 			long fe1 = get_func_line(xe, xecfg, NULL,
 						 xche->i1 + xche->chg1,
-						 xe->xdf1.nrec);
+						 (long)xe->xdf1.nrec);
 			while (fe1 > 0 && is_empty_rec(&xe->xdf1, fe1 - 1))
 				fe1--;
 			if (fe1 < 0)
-				fe1 = xe->xdf1.nrec;
+				fe1 = (long)xe->xdf1.nrec;
 			if (fe1 > e1) {
-				e2 = XDL_MIN(e2 + (fe1 - e1), xe->xdf2.nrec);
+				e2 = XDL_MIN(e2 + (fe1 - e1), (long)xe->xdf2.nrec);
 				e1 = fe1;
 			}
 
@@ -254,7 +254,7 @@ pre_context_calculation:
 			 */
 			if (xche->next) {
 				long l = XDL_MIN(xche->next->i1,
-						 xe->xdf1.nrec - 1);
+						 (long)xe->xdf1.nrec - 1);
 				if (l - xecfg->ctxlen <= e1 ||
 				    get_func_line(xe, xecfg, NULL, l, e1) < 0) {
 					xche = xche->next;
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index 0dd4558a32..29dad98c49 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -158,7 +158,7 @@ static int is_eol_crlf(xdfile_t *file, int i)
 {
 	size_t size;
 
-	if (i < file->nrec - 1)
+	if (i < (long)file->nrec - 1)
 		/* All lines before the last *must* end in LF */
 		return (size = file->recs[i].size) > 1 &&
 			file->recs[i].ptr[size - 2] == '\r';
@@ -317,7 +317,7 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 			continue;
 		i = m->i1 + m->chg1;
 	}
-	size += xdl_recs_copy(xe1, i, xe1->xdf2.nrec - i, 0, 0,
+	size += xdl_recs_copy(xe1, i, (int)xe1->xdf2.nrec - i, 0, 0,
 			      dest ? dest + size : NULL);
 	return size;
 }
@@ -622,7 +622,7 @@ static int xdl_do_merge(xdfenv_t *xe1, xdchange_t *xscr1,
 			changes = c;
 		i0 = xscr1->i1;
 		i1 = xscr1->i2;
-		i2 = xscr1->i1 + xe2->xdf2.nrec - xe2->xdf1.nrec;
+		i2 = xscr1->i1 + (long)xe2->xdf2.nrec - (long)xe2->xdf1.nrec;
 		chg0 = xscr1->chg1;
 		chg1 = xscr1->chg2;
 		chg2 = xscr1->chg1;
@@ -637,7 +637,7 @@ static int xdl_do_merge(xdfenv_t *xe1, xdchange_t *xscr1,
 		if (!changes)
 			changes = c;
 		i0 = xscr2->i1;
-		i1 = xscr2->i1 + xe1->xdf2.nrec - xe1->xdf1.nrec;
+		i1 = xscr2->i1 + (long)xe1->xdf2.nrec - (long)xe1->xdf1.nrec;
 		i2 = xscr2->i2;
 		chg0 = xscr2->chg1;
 		chg1 = xscr2->chg1;
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index cc53266f3b..a0b31eb5d8 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -370,5 +370,5 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
 
 int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
 {
-	return patience_diff(xpp, env, 1, env->xdf1.nrec, 1, env->xdf2.nrec);
+	return patience_diff(xpp, env, 1, (int)env->xdf1.nrec, 1, (int)env->xdf2.nrec);
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index bea0992b5e..705ddd1ae0 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -153,7 +153,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 		for (top = blk + bsize; cur < top; ) {
 			prev = cur;
 			hav = xdl_hash_record(&cur, top, xpp->flags);
-			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
+			if (XDL_ALLOC_GROW(xdf->recs, (long)xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
@@ -287,7 +287,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	/*
 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
 	 */
-	if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
@@ -295,7 +295,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
-	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
@@ -348,7 +348,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 
 	recs1 = xdf1->recs;
 	recs2 = xdf2->recs;
-	for (i = 0, lim = XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
+	for (i = 0, lim = (long)XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
 	     i++, recs1++, recs2++)
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
@@ -361,8 +361,8 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
-	xdf1->dend = xdf1->nrec - i - 1;
-	xdf2->dend = xdf2->nrec - i - 1;
+	xdf1->dend = (long)xdf1->nrec - i - 1;
+	xdf2->dend = (long)xdf2->nrec - i - 1;
 
 	return 0;
 }
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index d4e9cd2e76..4c4d9bd147 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -47,7 +47,7 @@ typedef struct s_xrecord {
 
 typedef struct s_xdfile {
 	xrecord_t *recs;
-	long nrec;
+	size_t nrec;
 	ptrdiff_t dstart, dend;
 	bool *changed;
 	long *rindex;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v4 08/10] xdiff: make xdfile_t.nreff a size_t instead of long
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
                         ` (6 preceding siblings ...)
  2025-11-14 22:36       ` [PATCH v4 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
@ 2025-11-14 22:36       ` Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
                         ` (2 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is used because nreff describes the number of elements in memory
for rindex.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 14 +++++++-------
 xdiff/xtypes.h   |  2 +-
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 705ddd1ae0..39fd79d9d4 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -264,7 +264,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  * might be potentially discarded if they appear in a run of discardable.
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-	long i, nm, nreff, mlim;
+	long i, nm, mlim;
 	xrecord_t *recs;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
@@ -307,29 +307,29 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 * Use temporary arrays to decide if changed[i] should remain
 	 * false, or become true.
 	 */
-	for (nreff = 0, i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
+	xdf1->nreff = 0;
+	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
 	     i <= xdf1->dend; i++, recs++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->rindex[nreff++] = i;
+			xdf1->rindex[xdf1->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf1->changed[i] = true;
 			/* i.e. discard */
 	}
-	xdf1->nreff = nreff;
 
-	for (nreff = 0, i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
+	xdf2->nreff = 0;
+	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
 	     i <= xdf2->dend; i++, recs++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->rindex[nreff++] = i;
+			xdf2->rindex[xdf2->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf2->changed[i] = true;
 			/* i.e. discard */
 	}
-	xdf2->nreff = nreff;
 
 cleanup:
 	xdl_free(action1);
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 4c4d9bd147..1f495f987f 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -51,7 +51,7 @@ typedef struct s_xdfile {
 	ptrdiff_t dstart, dend;
 	bool *changed;
 	long *rindex;
-	long nreff;
+	size_t nreff;
 } xdfile_t;
 
 typedef struct s_xdfenv {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v4 09/10] xdiff: change rindex from long to size_t in xdfile_t
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
                         ` (7 preceding siblings ...)
  2025-11-14 22:36       ` [PATCH v4 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
@ 2025-11-14 22:36       ` Ezekiel Newren via GitGitGadget
  2025-11-14 22:36       ` [PATCH v4 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The field rindex describes an index offset for other arrays. Change it
to size_t.

Changing the type of rindex from long to size_t has no cascading
refactor impact because it is only ever used to directly index other
arrays.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xtypes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 1f495f987f..9074cdadd1 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -50,7 +50,7 @@ typedef struct s_xdfile {
 	size_t nrec;
 	ptrdiff_t dstart, dend;
 	bool *changed;
-	long *rindex;
+	size_t *rindex;
 	size_t nreff;
 } xdfile_t;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v4 10/10] xdiff: rename rindex -> reference_index
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
                         ` (8 preceding siblings ...)
  2025-11-14 22:36       ` [PATCH v4 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
@ 2025-11-14 22:36       ` Ezekiel Newren via GitGitGadget
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-14 22:36 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ezekiel Newren, Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The classic diff adds only the lines that it's going to consider,
during the diff, to an array. A mapping between the compacted
array, and the lines of the file that they reference, is
facilitated by this array.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   |  6 +++---
 xdiff/xprepare.c | 10 +++++-----
 xdiff/xtypes.h   |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 759193fe5d..8eb664be3e 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -24,7 +24,7 @@
 
 static size_t get_hash(xdfile_t *xdf, long index)
 {
-	return xdf->recs[xdf->rindex[index]].minimal_perfect_hash;
+	return xdf->recs[xdf->reference_index[index]].minimal_perfect_hash;
 }
 
 #define XDL_MAX_COST_MIN 256
@@ -278,10 +278,10 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
 	 */
 	if (off1 == lim1) {
 		for (; off2 < lim2; off2++)
-			xdf2->changed[xdf2->rindex[off2]] = true;
+			xdf2->changed[xdf2->reference_index[off2]] = true;
 	} else if (off2 == lim2) {
 		for (; off1 < lim1; off1++)
-			xdf1->changed[xdf1->rindex[off1]] = true;
+			xdf1->changed[xdf1->reference_index[off1]] = true;
 	} else {
 		xdpsplit_t spl;
 		spl.i1 = spl.i2 = 0;
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 39fd79d9d4..34c82e4f8e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -128,7 +128,7 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 
 static void xdl_free_ctx(xdfile_t *xdf)
 {
-	xdl_free(xdf->rindex);
+	xdl_free(xdf->reference_index);
 	xdl_free(xdf->changed - 1);
 	xdl_free(xdf->recs);
 }
@@ -141,7 +141,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
 
-	xdf->rindex = NULL;
+	xdf->reference_index = NULL;
 	xdf->changed = NULL;
 	xdf->recs = NULL;
 
@@ -169,7 +169,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 
 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
 	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)) {
-		if (!XDL_ALLOC_ARRAY(xdf->rindex, xdf->nrec + 1))
+		if (!XDL_ALLOC_ARRAY(xdf->reference_index, xdf->nrec + 1))
 			goto abort;
 	}
 
@@ -312,7 +312,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	     i <= xdf1->dend; i++, recs++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->rindex[xdf1->nreff++] = i;
+			xdf1->reference_index[xdf1->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf1->changed[i] = true;
@@ -324,7 +324,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	     i <= xdf2->dend; i++, recs++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->rindex[xdf2->nreff++] = i;
+			xdf2->reference_index[xdf2->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf2->changed[i] = true;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 9074cdadd1..979586f20a 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -50,7 +50,7 @@ typedef struct s_xdfile {
 	size_t nrec;
 	ptrdiff_t dstart, dend;
 	bool *changed;
-	size_t *rindex;
+	size_t *reference_index;
 	size_t nreff;
 } xdfile_t;
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 00/10] Xdiff cleanup part2
  2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
                         ` (9 preceding siblings ...)
  2025-11-14 22:36       ` [PATCH v4 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34       ` Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
                           ` (10 more replies)
  10 siblings, 11 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren

Changes in v5:

 * Remove the non-word 'signless', and rephrase that paragraph in
   unambiguous-types.adoc
 * Cast to char in xdiffi.c:get_indent() rather than changing the local
   variable to uint8_t

Changes in v4:

 * Update documentation to not mention Unicode except once
 * Don't move dstart/dend with in the xdfile_t struct
 * Rephrase justification on changing xrecord_t.ptr's type

Changes in v3:

 * Address comments about commit messages and documentation
 * Add unambiguous-types.adoc to Makefile and Meson
 * Use markdown style to avoid asciidoc issues

Changes in v2:

 * Added documentation about unambiguous types and FFI
 * Addressed comments on the mailing list


Original cover letter below:
============================

Maintainer note: This patch series builds on top of en/xdiff-cleanup and
am/xdiff-hash-tweak (both of which are now in master).

The primary goal of this patch series is to convert every field's type in
xrecord_t and xdfile_t to be unambiguous, in preparation to make it more
Rust FFI friendly. Additionally the ha field in xrecord_t is split into
line_hash and minimal_perfect hash.

The order of some of the fields has changed as called out by the commit
messages.

Before:

typedef struct s_xrecord {
	char const *ptr;
	long size;
	unsigned long ha;
} xrecord_t;

typedef struct s_xdfile {
	xrecord_t *recs;
	long nrec;
	long dstart, dend;
	bool *changed;
	long *rindex;
	long nreff;
} xdfile_t;


After part 2

typedef struct s_xrecord {
	uint8_t const *ptr;
	size_t size;
	uint64_t line_hash;
	size_t minimal_perfect_hash;
} xrecord_t;

typedef struct s_xdfile {
	xrecord_t *recs;
	size_t nrec;
	ptrdiff_t dstart, dend;
	bool *changed;
	size_t *reference_index;
	size_t nreff;
} xdfile_t;


Ezekiel Newren (10):
  doc: define unambiguous type mappings across C and Rust
  xdiff: use ptrdiff_t for dstart/dend
  xdiff: make xrecord_t.ptr a uint8_t instead of char
  xdiff: use size_t for xrecord_t.size
  xdiff: use unambiguous types in xdl_hash_record()
  xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  xdiff: make xdfile_t.nrec a size_t instead of long
  xdiff: make xdfile_t.nreff a size_t instead of long
  xdiff: change rindex from long to size_t in xdfile_t
  xdiff: rename rindex -> reference_index

 Documentation/Makefile                        |   1 +
 Documentation/technical/meson.build           |   1 +
 .../technical/unambiguous-types.adoc          | 224 ++++++++++++++++++
 xdiff-interface.c                             |   2 +-
 xdiff/xdiffi.c                                |  29 ++-
 xdiff/xemit.c                                 |  28 +--
 xdiff/xhistogram.c                            |   4 +-
 xdiff/xmerge.c                                |  30 +--
 xdiff/xpatience.c                             |  14 +-
 xdiff/xprepare.c                              |  60 ++---
 xdiff/xtypes.h                                |  15 +-
 xdiff/xutils.c                                |  32 +--
 xdiff/xutils.h                                |   6 +-
 13 files changed, 336 insertions(+), 110 deletions(-)
 create mode 100644 Documentation/technical/unambiguous-types.adoc


base-commit: a99f379adf116d53eb11957af5bab5214915f91d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2070%2Fezekielnewren%2Fxdiff_cleanup_part2-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2070/ezekielnewren/xdiff_cleanup_part2-v5
Pull-Request: https://github.com/git/git/pull/2070

Range-diff vs v4:

  1:  af732beb69 !  1:  8b56bf1172 doc: define unambiguous type mappings across C and Rust
     @@ Documentation/technical/unambiguous-types.adoc (new)
      +has additional semantics and platform-dependent behavior that can cause
      +problems, as discussed below.
      +
     -+C comparison problem: While the sign of `char` is implementation defined, it's
     -+also signless (neither signed nor unsigned). When building with
     -+`make DEVELOPER=1` it will complain about a "differ in signedness" when `char`
     -+is compared with `uint8_t` or `int8_t`.
     ++The C language leaves the signedness of `char` implementation defined. Because
     ++our developer build enables -Wsign-compare, comparison of a value of `char`
     ++type with either signed or unsigned integers may trigger warnings from the
     ++compiler.
      +
      +Note: Rust's `char` type is an unsigned 32-bit integer that is used to describe
      +Unicode code points.
  2:  b60a03eb31 =  2:  c4193d11f5 xdiff: use ptrdiff_t for dstart/dend
  3:  042fbb11d0 !  3:  dd76d4f586 xdiff: make xrecord_t.ptr a uint8_t instead of char
     @@ Commit message
          Make xrecord_t.ptr uint8_t because it's referring to bytes in memory.
      
          In order to avoid a refactor avalanche, many uses of this field were
     -    cast to char* or similar. One exception is in get_indent() where the
     -    local variable `char c` was changed to `uint8_t c`.
     +    cast to char* or similar.
      
          Places where casting was unnecessary:
          xemit.c:156
     @@ xdiff/xdiffi.c: static int get_indent(xrecord_t *rec)
       
       	for (i = 0; i < rec->size; i++) {
      -		char c = rec->ptr[i];
     -+		uint8_t c = rec->ptr[i];
     ++		char c = (char) rec->ptr[i];
       
       		if (!XDL_ISSPACE(c))
       			return ret;
  4:  c103fa6bea !  4:  11cec1d2ec xdiff: use size_t for xrecord_t.size
     @@ xdiff/xdiffi.c: static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
       
      -	for (i = 0; i < rec->size; i++) {
      +	for (size_t i = 0; i < rec->size; i++) {
     - 		uint8_t c = rec->ptr[i];
     + 		char c = (char) rec->ptr[i];
       
       		if (!XDL_ISSPACE(c))
      @@ xdiff/xdiffi.c: static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
  5:  2ee9a74653 =  5:  6f267360b7 xdiff: use unambiguous types in xdl_hash_record()
  6:  f044274bd5 =  6:  78af0f16f4 xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  7:  f7a3731d94 =  7:  5c19f9ded3 xdiff: make xdfile_t.nrec a size_t instead of long
  8:  93f84ae72e =  8:  d1f498edb1 xdiff: make xdfile_t.nreff a size_t instead of long
  9:  39369becc8 =  9:  bc4941c146 xdiff: change rindex from long to size_t in xdfile_t
 10:  950d1e6193 = 10:  dcc9d6bfaf xdiff: rename rindex -> reference_index

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v5 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34         ` Ezekiel Newren via GitGitGadget
  2025-11-18 23:46           ` Ramsay Jones
  2025-11-18 22:34         ` [PATCH v5 02/10] xdiff: use ptrdiff_t for dstart/dend Ezekiel Newren via GitGitGadget
                           ` (9 subsequent siblings)
  10 siblings, 1 reply; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren,
	Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Document other nuances when crossing the FFI boundary. Other language
mappings may be added in the future.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 Documentation/Makefile                        |   1 +
 Documentation/technical/meson.build           |   1 +
 .../technical/unambiguous-types.adoc          | 224 ++++++++++++++++++
 3 files changed, 226 insertions(+)
 create mode 100644 Documentation/technical/unambiguous-types.adoc

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 04e9e10b27..bc1adb2d9d 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -142,6 +142,7 @@ TECH_DOCS += technical/shallow
 TECH_DOCS += technical/sparse-checkout
 TECH_DOCS += technical/sparse-index
 TECH_DOCS += technical/trivial-merge
+TECH_DOCS += technical/unambiguous-types
 TECH_DOCS += technical/unit-tests
 SP_ARTICLES += $(TECH_DOCS)
 SP_ARTICLES += technical/api-index
diff --git a/Documentation/technical/meson.build b/Documentation/technical/meson.build
index be698ef22a..89a6e26821 100644
--- a/Documentation/technical/meson.build
+++ b/Documentation/technical/meson.build
@@ -32,6 +32,7 @@ articles = [
   'sparse-checkout.adoc',
   'sparse-index.adoc',
   'trivial-merge.adoc',
+  'unambiguous-types.adoc',
   'unit-tests.adoc',
 ]
 
diff --git a/Documentation/technical/unambiguous-types.adoc b/Documentation/technical/unambiguous-types.adoc
new file mode 100644
index 0000000000..9a4990847c
--- /dev/null
+++ b/Documentation/technical/unambiguous-types.adoc
@@ -0,0 +1,224 @@
+= Unambiguous types
+
+Most of these mappings are obvious, but there are some nuances and gotchas with
+Rust FFI (Foreign Function Interface).
+
+This document defines clear, one-to-one mappings between primitive types in C,
+Rust (and possible other languages in the future). Its purpose is to eliminate
+ambiguity in type widths, signedness, and binary representation across
+platforms and languages.
+
+For Git, the only header required to use these unambiguous types in C is
+`git-compat-util.h`.
+
+== Boolean types
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| bool^1^       | bool
+|===
+
+== Integer types
+
+In C, `<stdint.h>` (or an equivalent) must be included.
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| uint8_t    | u8
+| uint16_t   | u16
+| uint32_t   | u32
+| uint64_t   | u64
+
+| int8_t     | i8
+| int16_t    | i16
+| int32_t    | i32
+| int64_t    | i64
+|===
+
+== Floating-point types
+
+Rust requires IEEE-754 semantics.
+In C, that is typically true, but not guaranteed by the standard.
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| float^2^      | f32
+| double^2^     | f64
+|===
+
+== Size types
+
+These types represent pointer-sized integers and are typically defined in
+`<stddef.h>` or an equivalent header.
+
+Size types should be used any time pointer arithmetic is performed e.g.
+indexing an array, describing the number of elements in memory, etc...
+
+[cols="1,1", options="header"]
+|===
+| C Type | Rust Type
+| size_t^3^     | usize
+| ptrdiff_t^3^  | isize
+|===
+
+== Character types
+
+This is where C and Rust don't have a clean one-to-one mapping.
+
+A C `char` and a Rust `u8` share the same bit width, so any C struct containing
+a `char` will have the same size as the corresponding Rust struct using `u8`.
+In that sense, such structs are safe to pass over the FFI boundary, because
+their fields will be laid out identically. However, beyond bit width, C `char`
+has additional semantics and platform-dependent behavior that can cause
+problems, as discussed below.
+
+The C language leaves the signedness of `char` implementation defined. Because
+our developer build enables -Wsign-compare, comparison of a value of `char`
+type with either signed or unsigned integers may trigger warnings from the
+compiler.
+
+Note: Rust's `char` type is an unsigned 32-bit integer that is used to describe
+Unicode code points.
+
+=== Notes
+^1^ This is only true if stdbool.h (or equivalent) is used. +
+^2^ C does not enforce IEEE-754 compatibility, but Rust expects it. If the
+platform/arch for C does not follow IEEE-754 then this equivalence does not
+hold. Also, it's assumed that `float` is 32 bits and `double` is 64, but
+there may be a strange platform/arch where even this isn't true. +
+^3^ C also defines uintptr_t, ssize_t and intptr_t, but these types are
+discouraged for FFI purposes. For functions like `read()` and `write()` ssize_t
+should be cast to a different, and unambiguous, type before being passed over
+the FFI boundary. +
+
+== Problems with std::ffi::c_* types in Rust
+TL;DR: In practice, Rust's `c_*` types aren't guaranteed to match C types for
+all possible C compilers, platforms, or architectures, because Rust only
+ensures correctness of C types on officially supported targets. These
+definitions have changed over time to match more targets which means that the
+c_* definitions will differ based on which Rust version Git chooses to use.
+
+Current list of safe, Rust side, FFI types in Git: +
+
+* `c_void`
+* `CStr`
+* `CString`
+
+Even then, they should be used sparingly, and only where the semantics match
+exactly.
+
+The std::os::raw::c_* directly inherits the problems of core::ffi, which
+changes over time and seems to make a best guess at the correct definition for
+a given platform/target. This probably isn't a problem for all other platforms
+that Rust supports currently, but can anyone say that Rust got it right for all
+C compilers of all platforms/targets?
+
+To give an example: c_long is defined in
+footnote:[https://doc.rust-lang.org/1.63.0/src/core/ffi/mod.rs.html#175-189[c_long in 1.63.0]]
+footnote:[https://doc.rust-lang.org/1.89.0/src/core/ffi/primitives.rs.html#135-151[c_long in 1.89.0]]
+
+=== Rust version 1.63.0
+
+```
+mod c_long_definition {
+    cfg_if! {
+        if #[cfg(all(target_pointer_width = "64", not(windows)))] {
+            pub type c_long = i64;
+            pub type NonZero_c_long = crate::num::NonZeroI64;
+            pub type c_ulong = u64;
+            pub type NonZero_c_ulong = crate::num::NonZeroU64;
+        } else {
+            // The minimal size of `long` in the C standard is 32 bits
+            pub type c_long = i32;
+            pub type NonZero_c_long = crate::num::NonZeroI32;
+            pub type c_ulong = u32;
+            pub type NonZero_c_ulong = crate::num::NonZeroU32;
+        }
+    }
+}
+```
+
+=== Rust version 1.89.0
+
+```
+mod c_long_definition {
+    crate::cfg_select! {
+        any(
+            all(target_pointer_width = "64", not(windows)),
+            // wasm32 Linux ABI uses 64-bit long
+            all(target_arch = "wasm32", target_os = "linux")
+        ) => {
+            pub(super) type c_long = i64;
+            pub(super) type c_ulong = u64;
+        }
+        _ => {
+            // The minimal size of `long` in the C standard is 32 bits
+            pub(super) type c_long = i32;
+            pub(super) type c_ulong = u32;
+        }
+    }
+}
+```
+
+Even for the cases where C types are correctly mapped to Rust types via
+std::ffi::c_* there are still problems. Let's take c_char for example. On some
+platforms it's u8 on others it's i8.
+
+=== Subtraction underflow in debug mode
+
+The following code will panic in debug on platforms that define c_char as u8,
+but won't if it's an i8.
+
+```
+let mut x: std::ffi::c_char = 0;
+x -= 1;
+```
+
+=== Inconsistent shift behavior
+
+`x` will be 0xC0 for platforms that use i8, but will be 0x40 where it's u8.
+
+```
+let mut x: std::ffi::c_char = 0x80;
+x >>= 1;
+```
+
+=== Equality fails to compile on some platforms
+
+The following will not compile on platforms that define c_char as i8, but will
+if it's u8. You can cast x e.g. `assert_eq!(x as u8, b'a');`, but then you get
+a warning on platforms that use u8 and a clean compilation where i8 is used.
+
+```
+let mut x: std::ffi::c_char = 0x61;
+assert_eq!(x, b'a');
+```
+
+== Enum types
+Rust enum types should not be used as FFI types. Rust enum types are more like
+C union types than C enum's. For something like:
+
+```
+#[repr(C, u8)]
+enum Fruit {
+    Apple,
+    Banana,
+    Cherry,
+}
+```
+
+It's easy enough to make sure the Rust enum matches what C would expect, but a
+more complex type like.
+
+```
+enum HashResult {
+    SHA1([u8; 20]),
+    SHA256([u8; 32]),
+}
+```
+
+The Rust compiler has to add a discriminant to the enum to distinguish between
+the variants. The width, location, and values for that discriminant is up to
+the Rust compiler and is not ABI stable.
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-18 22:34         ` [PATCH v5 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
@ 2025-11-18 23:46           ` Ramsay Jones
  2025-11-19  4:14             ` Junio C Hamano
  0 siblings, 1 reply; 118+ messages in thread
From: Ramsay Jones @ 2025-11-18 23:46 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget, git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ben Knoble, Ezekiel Newren



On 18/11/2025 10:34 pm, Ezekiel Newren via GitGitGadget wrote:
> From: Ezekiel Newren <ezekielnewren@gmail.com>
> 
> Document other nuances when crossing the FFI boundary. Other language
> mappings may be added in the future.
> 
> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
> ---
>  Documentation/Makefile                        |   1 +
>  Documentation/technical/meson.build           |   1 +
>  .../technical/unambiguous-types.adoc          | 224 ++++++++++++++++++
>  3 files changed, 226 insertions(+)
>  create mode 100644 Documentation/technical/unambiguous-types.adoc
> 
[snip]

> diff --git a/Documentation/technical/unambiguous-types.adoc b/Documentation/technical/unambiguous-types.adoc
> new file mode 100644
> index 0000000000..9a4990847c
> --- /dev/null
> +++ b/Documentation/technical/unambiguous-types.adoc
> @@ -0,0 +1,224 @@
> += Unambiguous types
> +
> +Most of these mappings are obvious, but there are some nuances and gotchas with
> +Rust FFI (Foreign Function Interface).
> +
> +This document defines clear, one-to-one mappings between primitive types in C,
> +Rust (and possible other languages in the future). Its purpose is to eliminate
> +ambiguity in type widths, signedness, and binary representation across
> +platforms and languages.
> +
> +For Git, the only header required to use these unambiguous types in C is
> +`git-compat-util.h`.
> +
> +== Boolean types
> +[cols="1,1", options="header"]
> +|===
> +| C Type | Rust Type
> +| bool^1^       | bool
> +|===
> +
> +== Integer types
> +
> +In C, `<stdint.h>` (or an equivalent) must be included.
> +
> +[cols="1,1", options="header"]
> +|===
> +| C Type | Rust Type
> +| uint8_t    | u8
> +| uint16_t   | u16
> +| uint32_t   | u32
> +| uint64_t   | u64
> +
> +| int8_t     | i8
> +| int16_t    | i16
> +| int32_t    | i32
> +| int64_t    | i64
> +|===
> +
> +== Floating-point types
> +
> +Rust requires IEEE-754 semantics.
> +In C, that is typically true, but not guaranteed by the standard.
> +
> +[cols="1,1", options="header"]
> +|===
> +| C Type | Rust Type
> +| float^2^      | f32
> +| double^2^     | f64
> +|===
> +
> +== Size types
> +
> +These types represent pointer-sized integers and are typically defined in
> +`<stddef.h>` or an equivalent header.
> +
> +Size types should be used any time pointer arithmetic is performed e.g.
> +indexing an array, describing the number of elements in memory, etc...
> +
> +[cols="1,1", options="header"]
> +|===
> +| C Type | Rust Type
> +| size_t^3^     | usize
> +| ptrdiff_t^3^  | isize
> +|===
> +
> +== Character types
> +
> +This is where C and Rust don't have a clean one-to-one mapping.
> +
> +A C `char` and a Rust `u8` share the same bit width, so any C struct containing
> +a `char` will have the same size as the corresponding Rust struct using `u8`.
> +In that sense, such structs are safe to pass over the FFI boundary, because
> +their fields will be laid out identically. However, beyond bit width, C `char`
> +has additional semantics and platform-dependent behavior that can cause
> +problems, as discussed below.
> +
> +The C language leaves the signedness of `char` implementation defined. Because
> +our developer build enables -Wsign-compare, comparison of a value of `char`
> +type with either signed or unsigned integers may trigger warnings from the
> +compiler.

Yep, much better. Thanks!

ATB,
Ramsay Jones



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 01/10] doc: define unambiguous type mappings across C and Rust
  2025-11-18 23:46           ` Ramsay Jones
@ 2025-11-19  4:14             ` Junio C Hamano
  0 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-11-19  4:14 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Ezekiel Newren via GitGitGadget, git, Kristoffer Haugsbakk,
	Patrick Steinhardt, Phillip Wood, Chris Torek, Ben Knoble,
	Ezekiel Newren

Ramsay Jones <ramsay@ramsayjones.plus.com> writes:

>> +== Character types
>> +
>> +This is where C and Rust don't have a clean one-to-one mapping.
>> +
>> +A C `char` and a Rust `u8` share the same bit width, so any C struct containing
>> +a `char` will have the same size as the corresponding Rust struct using `u8`.
>> +In that sense, such structs are safe to pass over the FFI boundary, because
>> +their fields will be laid out identically. However, beyond bit width, C `char`
>> +has additional semantics and platform-dependent behavior that can cause
>> +problems, as discussed below.
>> +
>> +The C language leaves the signedness of `char` implementation defined. Because
>> +our developer build enables -Wsign-compare, comparison of a value of `char`
>> +type with either signed or unsigned integers may trigger warnings from the
>> +compiler.
>
> Yep, much better. Thanks!
>
> ATB,
> Ramsay Jones

Indeed.
Thanks, both.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v5 02/10] xdiff: use ptrdiff_t for dstart/dend
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34         ` Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
                           ` (8 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren,
	Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

ptrdiff_t is appropriate for dstart and dend because they both describe
positive or negative offsets relative to a pointer.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xtypes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index f145abba3e..7a2d429ec5 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -47,7 +47,7 @@ typedef struct s_xrecord {
 typedef struct s_xdfile {
 	xrecord_t *recs;
 	long nrec;
-	long dstart, dend;
+	ptrdiff_t dstart, dend;
 	bool *changed;
 	long *rindex;
 	long nreff;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 02/10] xdiff: use ptrdiff_t for dstart/dend Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34         ` Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
                           ` (7 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren,
	Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Make xrecord_t.ptr uint8_t because it's referring to bytes in memory.

In order to avoid a refactor avalanche, many uses of this field were
cast to char* or similar.

Places where casting was unnecessary:
xemit.c:156
xmerge.c:124
xmerge.c:127
xmerge.c:164
xmerge.c:169
xmerge.c:172
xmerge.c:178

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c    |  8 ++++----
 xdiff/xemit.c     |  6 +++---
 xdiff/xmerge.c    | 14 +++++++-------
 xdiff/xpatience.c |  2 +-
 xdiff/xprepare.c  |  6 +++---
 xdiff/xtypes.h    |  2 +-
 xdiff/xutils.c    |  4 ++--
 7 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 6f3998ee54..95989b6af1 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -407,7 +407,7 @@ static int get_indent(xrecord_t *rec)
 	int ret = 0;
 
 	for (i = 0; i < rec->size; i++) {
-		char c = rec->ptr[i];
+		char c = (char) rec->ptr[i];
 
 		if (!XDL_ISSPACE(c))
 			return ret;
@@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
 
 		rec = &xe->xdf1.recs[xch->i1];
 		for (i = 0; i < xch->chg1 && ignore; i++)
-			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
 
 		rec = &xe->xdf2.recs[xch->i2];
 		for (i = 0; i < xch->chg2 && ignore; i++)
-			ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
 
 		xch->ignore = ignore;
 	}
@@ -1008,7 +1008,7 @@ static int record_matches_regex(xrecord_t *rec, xpparam_t const *xpp) {
 	size_t i;
 
 	for (i = 0; i < xpp->ignore_regex_nr; i++)
-		if (!regexec_buf(xpp->ignore_regex[i], rec->ptr, rec->size, 1,
+		if (!regexec_buf(xpp->ignore_regex[i], (const char *)rec->ptr, rec->size, 1,
 				 &regmatch, 0))
 			return 1;
 
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index b2f1f30cd3..ead930088a 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
 {
 	xrecord_t *rec = &xdf->recs[ri];
 
-	if (xdl_emit_diffrec(rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
+	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
 		return -1;
 
 	return 0;
@@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
 	xrecord_t *rec = &xdf->recs[ri];
 
 	if (!xecfg->find_func)
-		return def_ff(rec->ptr, rec->size, buf, sz);
-	return xecfg->find_func(rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
+		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
+	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index fd600cbb5d..75cb3e76a2 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
 	xrecord_t *rec2 = xe2->xdf2.recs + i2;
 
 	for (i = 0; i < line_count; i++) {
-		int result = xdl_recmatch(rec1[i].ptr, rec1[i].size,
-			rec2[i].ptr, rec2[i].size, flags);
+		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
+			(const char *)rec2[i].ptr, rec2[i].size, flags);
 		if (!result)
 			return -1;
 	}
@@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 
 static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
 {
-	return xdl_recmatch(rec1->ptr, rec1->size,
-			    rec2->ptr, rec2->size, flags);
+	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
+			    (const char *)rec2->ptr, rec2->size, flags);
 }
 
 /*
@@ -382,10 +382,10 @@ static int xdl_refine_conflicts(xdfenv_t *xe1, xdfenv_t *xe2, xdmerge_t *m,
 		 * we have a very simple mmfile structure.
 		 */
 		t1.ptr = (char *)xe1->xdf2.recs[m->i1].ptr;
-		t1.size = xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
+		t1.size = (char *)xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr
 			+ xe1->xdf2.recs[m->i1 + m->chg1 - 1].size - t1.ptr;
 		t2.ptr = (char *)xe2->xdf2.recs[m->i2].ptr;
-		t2.size = xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
+		t2.size = (char *)xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr
 			+ xe2->xdf2.recs[m->i2 + m->chg2 - 1].size - t2.ptr;
 		if (xdl_do_diff(&t1, &t2, xpp, &xe) < 0)
 			return -1;
@@ -440,7 +440,7 @@ static int line_contains_alnum(const char *ptr, long size)
 static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
 {
 	for (; chg; chg--, i++)
-		if (line_contains_alnum(xe->xdf2.recs[i].ptr,
+		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
 				xe->xdf2.recs[i].size))
 			return 1;
 	return 0;
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index 669b653580..bb61354f22 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -121,7 +121,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 		return;
 	map->entries[index].line1 = line;
 	map->entries[index].hash = record->ha;
-	map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1].ptr);
+	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
 	if (!map->first)
 		map->first = map->entries + index;
 	if (map->last) {
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 192334f1b7..4c56467076 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
 		if (rcrec->rec.ha == rec->ha &&
-				xdl_recmatch(rcrec->rec.ptr, rcrec->rec.size,
-					rec->ptr, rec->size, cf->flags))
+				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
+					(const char *)rec->ptr, rec->size, cf->flags))
 			break;
 
 	if (!rcrec) {
@@ -156,7 +156,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
-			crec->ptr = prev;
+			crec->ptr = (uint8_t const *)prev;
 			crec->size = (long) (cur - prev);
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 7a2d429ec5..69727fb299 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -39,7 +39,7 @@ typedef struct s_chastore {
 } chastore_t;
 
 typedef struct s_xrecord {
-	char const *ptr;
+	uint8_t const *ptr;
 	long size;
 	unsigned long ha;
 } xrecord_t;
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 447e66c719..7be063bfb6 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -465,10 +465,10 @@ int xdl_fall_back_diff(xdfenv_t *diff_env, xpparam_t const *xpp,
 	xdfenv_t env;
 
 	subfile1.ptr = (char *)diff_env->xdf1.recs[line1 - 1].ptr;
-	subfile1.size = diff_env->xdf1.recs[line1 + count1 - 2].ptr +
+	subfile1.size = (char *)diff_env->xdf1.recs[line1 + count1 - 2].ptr +
 		diff_env->xdf1.recs[line1 + count1 - 2].size - subfile1.ptr;
 	subfile2.ptr = (char *)diff_env->xdf2.recs[line2 - 1].ptr;
-	subfile2.size = diff_env->xdf2.recs[line2 + count2 - 2].ptr +
+	subfile2.size = (char *)diff_env->xdf2.recs[line2 + count2 - 2].ptr +
 		diff_env->xdf2.recs[line2 + count2 - 2].size - subfile2.ptr;
 	if (xdl_do_diff(&subfile1, &subfile2, xpp, &env) < 0)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 04/10] xdiff: use size_t for xrecord_t.size
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                           ` (2 preceding siblings ...)
  2025-11-18 22:34         ` [PATCH v5 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34         ` Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
                           ` (6 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren,
	Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is the appropriate type because size is describing the number of
elements, bytes in this case, in memory.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   |  7 +++----
 xdiff/xemit.c    |  8 ++++----
 xdiff/xmerge.c   | 16 ++++++++--------
 xdiff/xprepare.c |  6 +++---
 xdiff/xtypes.h   |  2 +-
 5 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 95989b6af1..cb8e412c7b 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -403,10 +403,9 @@ static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
  */
 static int get_indent(xrecord_t *rec)
 {
-	long i;
 	int ret = 0;
 
-	for (i = 0; i < rec->size; i++) {
+	for (size_t i = 0; i < rec->size; i++) {
 		char c = (char) rec->ptr[i];
 
 		if (!XDL_ISSPACE(c))
@@ -993,11 +992,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags)
 
 		rec = &xe->xdf1.recs[xch->i1];
 		for (i = 0; i < xch->chg1 && ignore; i++)
-			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, (long)rec[i].size, flags);
 
 		rec = &xe->xdf2.recs[xch->i2];
 		for (i = 0; i < xch->chg2 && ignore; i++)
-			ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags);
+			ignore = xdl_blankline((const char *)rec[i].ptr, (long)rec[i].size, flags);
 
 		xch->ignore = ignore;
 	}
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index ead930088a..2f8007753c 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t *
 {
 	xrecord_t *rec = &xdf->recs[ri];
 
-	if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0)
+	if (xdl_emit_diffrec((char const *)rec->ptr, (long)rec->size, pre, strlen(pre), ecb) < 0)
 		return -1;
 
 	return 0;
@@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
 	xrecord_t *rec = &xdf->recs[ri];
 
 	if (!xecfg->find_func)
-		return def_ff((const char *)rec->ptr, rec->size, buf, sz);
-	return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv);
+		return def_ff((const char *)rec->ptr, (long)rec->size, buf, sz);
+	return xecfg->find_func((const char *)rec->ptr, (long)rec->size, buf, sz, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
@@ -151,7 +151,7 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 static int is_empty_rec(xdfile_t *xdf, long ri)
 {
 	xrecord_t *rec = &xdf->recs[ri];
-	long i = 0;
+	size_t i = 0;
 
 	for (; i < rec->size && XDL_ISSPACE(rec->ptr[i]); i++);
 
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index 75cb3e76a2..0dd4558a32 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2,
 	xrecord_t *rec2 = xe2->xdf2.recs + i2;
 
 	for (i = 0; i < line_count; i++) {
-		int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size,
-			(const char *)rec2[i].ptr, rec2[i].size, flags);
+		int result = xdl_recmatch((const char *)rec1[i].ptr, (long)rec1[i].size,
+			(const char *)rec2[i].ptr, (long)rec2[i].size, flags);
 		if (!result)
 			return -1;
 	}
@@ -119,11 +119,11 @@ static int xdl_recs_copy_0(int use_orig, xdfenv_t *xe, int i, int count, int nee
 	if (count < 1)
 		return 0;
 
-	for (i = 0; i < count; size += recs[i++].size)
+	for (i = 0; i < count; size += (int)recs[i++].size)
 		if (dest)
 			memcpy(dest + size, recs[i].ptr, recs[i].size);
 	if (add_nl) {
-		i = recs[count - 1].size;
+		i = (int)recs[count - 1].size;
 		if (i == 0 || recs[count - 1].ptr[i - 1] != '\n') {
 			if (needs_cr) {
 				if (dest)
@@ -156,7 +156,7 @@ static int xdl_orig_copy(xdfenv_t *xe, int i, int count, int needs_cr, int add_n
  */
 static int is_eol_crlf(xdfile_t *file, int i)
 {
-	long size;
+	size_t size;
 
 	if (i < file->nrec - 1)
 		/* All lines before the last *must* end in LF */
@@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 
 static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags)
 {
-	return xdl_recmatch((const char *)rec1->ptr, rec1->size,
-			    (const char *)rec2->ptr, rec2->size, flags);
+	return xdl_recmatch((const char *)rec1->ptr, (long)rec1->size,
+			    (const char *)rec2->ptr, (long)rec2->size, flags);
 }
 
 /*
@@ -441,7 +441,7 @@ static int lines_contain_alnum(xdfenv_t *xe, int i, int chg)
 {
 	for (; chg; chg--, i++)
 		if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr,
-				xe->xdf2.recs[i].size))
+				(long)xe->xdf2.recs[i].size))
 			return 1;
 	return 0;
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 4c56467076..b3219aed3e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
 		if (rcrec->rec.ha == rec->ha &&
-				xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size,
-					(const char *)rec->ptr, rec->size, cf->flags))
+				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
+					(const char *)rec->ptr, (long)rec->size, cf->flags))
 			break;
 
 	if (!rcrec) {
@@ -157,7 +157,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = (uint8_t const *)prev;
-			crec->size = (long) (cur - prev);
+			crec->size = cur - prev;
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 69727fb299..354349b523 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -40,7 +40,7 @@ typedef struct s_chastore {
 
 typedef struct s_xrecord {
 	uint8_t const *ptr;
-	long size;
+	size_t size;
 	unsigned long ha;
 } xrecord_t;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 05/10] xdiff: use unambiguous types in xdl_hash_record()
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                           ` (3 preceding siblings ...)
  2025-11-18 22:34         ` [PATCH v5 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34         ` Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
                           ` (5 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren,
	Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

Convert the function signature and body to use unambiguous types. char
is changed to uint8_t because this function processes bytes in memory.
unsigned long to uint64_t so that the hash output is consistent across
platforms. `flags` was changed from long to uint64_t to ensure the
high order bits are not dropped on platforms that treat long as 32
bits.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff-interface.c |  2 +-
 xdiff/xprepare.c  |  6 +++---
 xdiff/xutils.c    | 28 ++++++++++++++--------------
 xdiff/xutils.h    |  6 +++---
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 4971f722b3..1a35556380 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -300,7 +300,7 @@ void xdiff_clear_find_func(xdemitconf_t *xecfg)
 
 unsigned long xdiff_hash_string(const char *s, size_t len, long flags)
 {
-	return xdl_hash_record(&s, s + len, flags);
+	return xdl_hash_record((uint8_t const**)&s, (uint8_t const*)s + len, flags);
 }
 
 int xdiff_compare_lines(const char *l1, long s1,
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index b3219aed3e..85e56021da 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -137,8 +137,8 @@ static void xdl_free_ctx(xdfile_t *xdf)
 static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_t const *xpp,
 			   xdlclassifier_t *cf, xdfile_t *xdf) {
 	long bsize;
-	unsigned long hav;
-	char const *blk, *cur, *top, *prev;
+	uint64_t hav;
+	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
 
 	xdf->rindex = NULL;
@@ -156,7 +156,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
-			crec->ptr = (uint8_t const *)prev;
+			crec->ptr = prev;
 			crec->size = cur - prev;
 			crec->ha = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 7be063bfb6..77ee1ad9c8 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -249,11 +249,11 @@ int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
 	return 1;
 }
 
-unsigned long xdl_hash_record_with_whitespace(char const **data,
-		char const *top, long flags) {
-	unsigned long ha = 5381;
-	char const *ptr = *data;
-	int cr_at_eol_only = (flags & XDF_WHITESPACE_FLAGS) == XDF_IGNORE_CR_AT_EOL;
+uint64_t xdl_hash_record_with_whitespace(uint8_t const **data,
+		uint8_t const *top, uint64_t flags) {
+	uint64_t ha = 5381;
+	uint8_t const *ptr = *data;
+	bool cr_at_eol_only = (flags & XDF_WHITESPACE_FLAGS) == XDF_IGNORE_CR_AT_EOL;
 
 	for (; ptr < top && *ptr != '\n'; ptr++) {
 		if (cr_at_eol_only) {
@@ -263,8 +263,8 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 				continue;
 		}
 		else if (XDL_ISSPACE(*ptr)) {
-			const char *ptr2 = ptr;
-			int at_eol;
+			const uint8_t *ptr2 = ptr;
+			bool at_eol;
 			while (ptr + 1 < top && XDL_ISSPACE(ptr[1])
 					&& ptr[1] != '\n')
 				ptr++;
@@ -274,20 +274,20 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 			else if (flags & XDF_IGNORE_WHITESPACE_CHANGE
 				 && !at_eol) {
 				ha += (ha << 5);
-				ha ^= (unsigned long) ' ';
+				ha ^= (uint64_t) ' ';
 			}
 			else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL
 				 && !at_eol) {
 				while (ptr2 != ptr + 1) {
 					ha += (ha << 5);
-					ha ^= (unsigned long) *ptr2;
+					ha ^= (uint64_t) *ptr2;
 					ptr2++;
 				}
 			}
 			continue;
 		}
 		ha += (ha << 5);
-		ha ^= (unsigned long) *ptr;
+		ha ^= (uint64_t) *ptr;
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 
@@ -304,9 +304,9 @@ unsigned long xdl_hash_record_with_whitespace(char const **data,
 #define REASSOC_FENCE(x, y)
 #endif
 
-unsigned long xdl_hash_record_verbatim(char const **data, char const *top) {
-	unsigned long ha = 5381, c0, c1;
-	char const *ptr = *data;
+uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top) {
+	uint64_t ha = 5381, c0, c1;
+	uint8_t const *ptr = *data;
 #if 0
 	/*
 	 * The baseline form of the optimized loop below. This is the djb2
@@ -314,7 +314,7 @@ unsigned long xdl_hash_record_verbatim(char const **data, char const *top) {
 	 */
 	for (; ptr < top && *ptr != '\n'; ptr++) {
 		ha += (ha << 5);
-		ha += (unsigned long) *ptr;
+		ha += (uint64_t) *ptr;
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 #else
diff --git a/xdiff/xutils.h b/xdiff/xutils.h
index 13f6831047..615b4a9d35 100644
--- a/xdiff/xutils.h
+++ b/xdiff/xutils.h
@@ -34,9 +34,9 @@ void *xdl_cha_alloc(chastore_t *cha);
 long xdl_guess_lines(mmfile_t *mf, long sample);
 int xdl_blankline(const char *line, long size, long flags);
 int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags);
-unsigned long xdl_hash_record_verbatim(char const **data, char const *top);
-unsigned long xdl_hash_record_with_whitespace(char const **data, char const *top, long flags);
-static inline unsigned long xdl_hash_record(char const **data, char const *top, long flags)
+uint64_t xdl_hash_record_verbatim(uint8_t const **data, uint8_t const *top);
+uint64_t xdl_hash_record_with_whitespace(uint8_t const **data, uint8_t const *top, uint64_t flags);
+static inline uint64_t xdl_hash_record(uint8_t const **data, uint8_t const *top, uint64_t flags)
 {
 	if (flags & XDF_WHITESPACE_FLAGS)
 		return xdl_hash_record_with_whitespace(data, top, flags);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                           ` (4 preceding siblings ...)
  2025-11-18 22:34         ` [PATCH v5 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34         ` Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
                           ` (4 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren,
	Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The ha field is serving two different purposes, which makes the code
harder to read. At first glance, it looks like many places assume
there could never be hash collisions between lines of the two input
files. In reality, line_hash is used together with xdl_recmatch() to
ensure correct comparisons of lines, even when collisions occur.

To make this clearer, the old ha field has been split:
  * line_hash: a straightforward hash of a line, independent of any
    external context. Its type is uint64_t, as it comes from a fixed
    width hash function.
  * minimal_perfect_hash: Not a new concept, but now a separate
    field. It comes from the classifier's general-purpose hash table,
    which assigns each line a unique and minimal hash across the two
    files. A size_t is used here because it's meant to be used to
    index an array. This also avoids ` as usize` casts on the Rust
    side when using it to index a slice.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c     |  6 +++---
 xdiff/xhistogram.c |  4 ++--
 xdiff/xpatience.c  | 10 +++++-----
 xdiff/xprepare.c   | 18 +++++++++---------
 xdiff/xtypes.h     |  3 ++-
 5 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index cb8e412c7b..8d96074414 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -22,9 +22,9 @@
 
 #include "xinclude.h"
 
-static unsigned long get_hash(xdfile_t *xdf, long index)
+static size_t get_hash(xdfile_t *xdf, long index)
 {
-	return xdf->recs[xdf->rindex[index]].ha;
+	return xdf->recs[xdf->rindex[index]].minimal_perfect_hash;
 }
 
 #define XDL_MAX_COST_MIN 256
@@ -385,7 +385,7 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 
 static int recs_match(xrecord_t *rec1, xrecord_t *rec2)
 {
-	return (rec1->ha == rec2->ha);
+	return rec1->minimal_perfect_hash == rec2->minimal_perfect_hash;
 }
 
 /*
diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
index 6dc450b1fe..5ae1282c27 100644
--- a/xdiff/xhistogram.c
+++ b/xdiff/xhistogram.c
@@ -90,7 +90,7 @@ struct region {
 
 static int cmp_recs(xrecord_t *r1, xrecord_t *r2)
 {
-	return r1->ha == r2->ha;
+	return r1->minimal_perfect_hash == r2->minimal_perfect_hash;
 
 }
 
@@ -98,7 +98,7 @@ static int cmp_recs(xrecord_t *r1, xrecord_t *r2)
 	(cmp_recs(REC(i->env, s1, l1), REC(i->env, s2, l2)))
 
 #define TABLE_HASH(index, side, line) \
-	XDL_HASHLONG((REC(index->env, side, line))->ha, index->table_bits)
+	XDL_HASHLONG((REC(index->env, side, line))->minimal_perfect_hash, index->table_bits)
 
 static int scanA(struct histindex *index, int line1, int count1)
 {
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index bb61354f22..cc53266f3b 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -48,7 +48,7 @@
 struct hashmap {
 	int nr, alloc;
 	struct entry {
-		unsigned long hash;
+		size_t minimal_perfect_hash;
 		/*
 		 * 0 = unused entry, 1 = first line, 2 = second, etc.
 		 * line2 is NON_UNIQUE if the line is not unique
@@ -101,10 +101,10 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 	 * So we multiply ha by 2 in the hope that the hashing was
 	 * "unique enough".
 	 */
-	int index = (int)((record->ha << 1) % map->alloc);
+	int index = (int)((record->minimal_perfect_hash << 1) % map->alloc);
 
 	while (map->entries[index].line1) {
-		if (map->entries[index].hash != record->ha) {
+		if (map->entries[index].minimal_perfect_hash != record->minimal_perfect_hash) {
 			if (++index >= map->alloc)
 				index = 0;
 			continue;
@@ -120,7 +120,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
 	if (pass == 2)
 		return;
 	map->entries[index].line1 = line;
-	map->entries[index].hash = record->ha;
+	map->entries[index].minimal_perfect_hash = record->minimal_perfect_hash;
 	map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr);
 	if (!map->first)
 		map->first = map->entries + index;
@@ -248,7 +248,7 @@ static int match(struct hashmap *map, int line1, int line2)
 {
 	xrecord_t *record1 = &map->env->xdf1.recs[line1 - 1];
 	xrecord_t *record2 = &map->env->xdf2.recs[line2 - 1];
-	return record1->ha == record2->ha;
+	return record1->minimal_perfect_hash == record2->minimal_perfect_hash;
 }
 
 static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 85e56021da..bea0992b5e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -93,12 +93,12 @@ static void xdl_free_classifier(xdlclassifier_t *cf) {
 
 
 static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) {
-	long hi;
+	size_t hi;
 	xdlclass_t *rcrec;
 
-	hi = (long) XDL_HASHLONG(rec->ha, cf->hbits);
+	hi = XDL_HASHLONG(rec->line_hash, cf->hbits);
 	for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next)
-		if (rcrec->rec.ha == rec->ha &&
+		if (rcrec->rec.line_hash == rec->line_hash &&
 				xdl_recmatch((const char *)rcrec->rec.ptr, (long)rcrec->rec.size,
 					(const char *)rec->ptr, (long)rec->size, cf->flags))
 			break;
@@ -120,7 +120,7 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 
 	(pass == 1) ? rcrec->len1++ : rcrec->len2++;
 
-	rec->ha = (unsigned long) rcrec->idx;
+	rec->minimal_perfect_hash = (size_t)rcrec->idx;
 
 	return 0;
 }
@@ -158,7 +158,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
 			crec->size = cur - prev;
-			crec->ha = hav;
+			crec->line_hash = hav;
 			if (xdl_classify_record(pass, cf, crec) < 0)
 				goto abort;
 		}
@@ -290,7 +290,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->ha];
+		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len2 : 0;
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -298,7 +298,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
-		rcrec = cf->rcrecs[recs->ha];
+		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
 		nm = rcrec ? rcrec->len1 : 0;
 		action2[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
@@ -350,7 +350,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 	recs2 = xdf2->recs;
 	for (i = 0, lim = XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
 	     i++, recs1++, recs2++)
-		if (recs1->ha != recs2->ha)
+		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
 	xdf1->dstart = xdf2->dstart = i;
@@ -358,7 +358,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 	recs1 = xdf1->recs + xdf1->nrec - 1;
 	recs2 = xdf2->recs + xdf2->nrec - 1;
 	for (lim -= i, i = 0; i < lim; i++, recs1--, recs2--)
-		if (recs1->ha != recs2->ha)
+		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
 	xdf1->dend = xdf1->nrec - i - 1;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 354349b523..d4e9cd2e76 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -41,7 +41,8 @@ typedef struct s_chastore {
 typedef struct s_xrecord {
 	uint8_t const *ptr;
 	size_t size;
-	unsigned long ha;
+	uint64_t line_hash;
+	size_t minimal_perfect_hash;
 } xrecord_t;
 
 typedef struct s_xdfile {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 07/10] xdiff: make xdfile_t.nrec a size_t instead of long
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                           ` (5 preceding siblings ...)
  2025-11-18 22:34         ` [PATCH v5 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34         ` Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
                           ` (3 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren,
	Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is used because nrec describes the number of elements for both
recs, and for 'changed' + 2.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c    |  8 ++++----
 xdiff/xemit.c     | 20 ++++++++++----------
 xdiff/xmerge.c    |  8 ++++----
 xdiff/xpatience.c |  2 +-
 xdiff/xprepare.c  | 12 ++++++------
 xdiff/xtypes.h    |  2 +-
 6 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 8d96074414..21d06bce96 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -483,7 +483,7 @@ static void measure_split(const xdfile_t *xdf, long split,
 {
 	long i;
 
-	if (split >= xdf->nrec) {
+	if (split >= (long)xdf->nrec) {
 		m->end_of_file = 1;
 		m->indent = -1;
 	} else {
@@ -506,7 +506,7 @@ static void measure_split(const xdfile_t *xdf, long split,
 
 	m->post_blank = 0;
 	m->post_indent = -1;
-	for (i = split + 1; i < xdf->nrec; i++) {
+	for (i = split + 1; i < (long)xdf->nrec; i++) {
 		m->post_indent = get_indent(&xdf->recs[i]);
 		if (m->post_indent != -1)
 			break;
@@ -717,7 +717,7 @@ static void group_init(xdfile_t *xdf, struct xdlgroup *g)
  */
 static inline int group_next(xdfile_t *xdf, struct xdlgroup *g)
 {
-	if (g->end == xdf->nrec)
+	if (g->end == (long)xdf->nrec)
 		return -1;
 
 	g->start = g->end + 1;
@@ -750,7 +750,7 @@ static inline int group_previous(xdfile_t *xdf, struct xdlgroup *g)
  */
 static int group_slide_down(xdfile_t *xdf, struct xdlgroup *g)
 {
-	if (g->end < xdf->nrec &&
+	if (g->end < (long)xdf->nrec &&
 	    recs_match(&xdf->recs[g->start], &xdf->recs[g->end])) {
 		xdf->changed[g->start++] = false;
 		xdf->changed[g->end++] = true;
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index 2f8007753c..04f7e9193b 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -137,7 +137,7 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 	buf = func_line ? func_line->buf : dummy;
 	size = func_line ? sizeof(func_line->buf) : sizeof(dummy);
 
-	for (l = start; l != limit && 0 <= l && l < xe->xdf1.nrec; l += step) {
+	for (l = start; l != limit && 0 <= l && l < (long)xe->xdf1.nrec; l += step) {
 		long len = match_func_rec(&xe->xdf1, xecfg, l, buf, size);
 		if (len >= 0) {
 			if (func_line)
@@ -179,14 +179,14 @@ pre_context_calculation:
 			long fs1, i1 = xch->i1;
 
 			/* Appended chunk? */
-			if (i1 >= xe->xdf1.nrec) {
+			if (i1 >= (long)xe->xdf1.nrec) {
 				long i2 = xch->i2;
 
 				/*
 				 * We don't need additional context if
 				 * a whole function was added.
 				 */
-				while (i2 < xe->xdf2.nrec) {
+				while (i2 < (long)xe->xdf2.nrec) {
 					if (is_func_rec(&xe->xdf2, xecfg, i2))
 						goto post_context_calculation;
 					i2++;
@@ -196,7 +196,7 @@ pre_context_calculation:
 				 * Otherwise get more context from the
 				 * pre-image.
 				 */
-				i1 = xe->xdf1.nrec - 1;
+				i1 = (long)xe->xdf1.nrec - 1;
 			}
 
 			fs1 = get_func_line(xe, xecfg, NULL, i1, -1);
@@ -228,8 +228,8 @@ pre_context_calculation:
 
  post_context_calculation:
 		lctx = xecfg->ctxlen;
-		lctx = XDL_MIN(lctx, xe->xdf1.nrec - (xche->i1 + xche->chg1));
-		lctx = XDL_MIN(lctx, xe->xdf2.nrec - (xche->i2 + xche->chg2));
+		lctx = XDL_MIN(lctx, (long)xe->xdf1.nrec - (xche->i1 + xche->chg1));
+		lctx = XDL_MIN(lctx, (long)xe->xdf2.nrec - (xche->i2 + xche->chg2));
 
 		e1 = xche->i1 + xche->chg1 + lctx;
 		e2 = xche->i2 + xche->chg2 + lctx;
@@ -237,13 +237,13 @@ pre_context_calculation:
 		if (xecfg->flags & XDL_EMIT_FUNCCONTEXT) {
 			long fe1 = get_func_line(xe, xecfg, NULL,
 						 xche->i1 + xche->chg1,
-						 xe->xdf1.nrec);
+						 (long)xe->xdf1.nrec);
 			while (fe1 > 0 && is_empty_rec(&xe->xdf1, fe1 - 1))
 				fe1--;
 			if (fe1 < 0)
-				fe1 = xe->xdf1.nrec;
+				fe1 = (long)xe->xdf1.nrec;
 			if (fe1 > e1) {
-				e2 = XDL_MIN(e2 + (fe1 - e1), xe->xdf2.nrec);
+				e2 = XDL_MIN(e2 + (fe1 - e1), (long)xe->xdf2.nrec);
 				e1 = fe1;
 			}
 
@@ -254,7 +254,7 @@ pre_context_calculation:
 			 */
 			if (xche->next) {
 				long l = XDL_MIN(xche->next->i1,
-						 xe->xdf1.nrec - 1);
+						 (long)xe->xdf1.nrec - 1);
 				if (l - xecfg->ctxlen <= e1 ||
 				    get_func_line(xe, xecfg, NULL, l, e1) < 0) {
 					xche = xche->next;
diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c
index 0dd4558a32..29dad98c49 100644
--- a/xdiff/xmerge.c
+++ b/xdiff/xmerge.c
@@ -158,7 +158,7 @@ static int is_eol_crlf(xdfile_t *file, int i)
 {
 	size_t size;
 
-	if (i < file->nrec - 1)
+	if (i < (long)file->nrec - 1)
 		/* All lines before the last *must* end in LF */
 		return (size = file->recs[i].size) > 1 &&
 			file->recs[i].ptr[size - 2] == '\r';
@@ -317,7 +317,7 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1,
 			continue;
 		i = m->i1 + m->chg1;
 	}
-	size += xdl_recs_copy(xe1, i, xe1->xdf2.nrec - i, 0, 0,
+	size += xdl_recs_copy(xe1, i, (int)xe1->xdf2.nrec - i, 0, 0,
 			      dest ? dest + size : NULL);
 	return size;
 }
@@ -622,7 +622,7 @@ static int xdl_do_merge(xdfenv_t *xe1, xdchange_t *xscr1,
 			changes = c;
 		i0 = xscr1->i1;
 		i1 = xscr1->i2;
-		i2 = xscr1->i1 + xe2->xdf2.nrec - xe2->xdf1.nrec;
+		i2 = xscr1->i1 + (long)xe2->xdf2.nrec - (long)xe2->xdf1.nrec;
 		chg0 = xscr1->chg1;
 		chg1 = xscr1->chg2;
 		chg2 = xscr1->chg1;
@@ -637,7 +637,7 @@ static int xdl_do_merge(xdfenv_t *xe1, xdchange_t *xscr1,
 		if (!changes)
 			changes = c;
 		i0 = xscr2->i1;
-		i1 = xscr2->i1 + xe1->xdf2.nrec - xe1->xdf1.nrec;
+		i1 = xscr2->i1 + (long)xe1->xdf2.nrec - (long)xe1->xdf1.nrec;
 		i2 = xscr2->i2;
 		chg0 = xscr2->chg1;
 		chg1 = xscr2->chg1;
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index cc53266f3b..a0b31eb5d8 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -370,5 +370,5 @@ static int patience_diff(xpparam_t const *xpp, xdfenv_t *env,
 
 int xdl_do_patience_diff(xpparam_t const *xpp, xdfenv_t *env)
 {
-	return patience_diff(xpp, env, 1, env->xdf1.nrec, 1, env->xdf2.nrec);
+	return patience_diff(xpp, env, 1, (int)env->xdf1.nrec, 1, (int)env->xdf2.nrec);
 }
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index bea0992b5e..705ddd1ae0 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -153,7 +153,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 		for (top = blk + bsize; cur < top; ) {
 			prev = cur;
 			hav = xdl_hash_record(&cur, top, xpp->flags);
-			if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec))
+			if (XDL_ALLOC_GROW(xdf->recs, (long)xdf->nrec + 1, narec))
 				goto abort;
 			crec = &xdf->recs[xdf->nrec++];
 			crec->ptr = prev;
@@ -287,7 +287,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	/*
 	 * Initialize temporary arrays with DISCARD, KEEP, or INVESTIGATE.
 	 */
-	if ((mlim = xdl_bogosqrt(xdf1->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xdf1->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
@@ -295,7 +295,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 		action1[i] = (nm == 0) ? DISCARD: (nm >= mlim && !need_min) ? INVESTIGATE: KEEP;
 	}
 
-	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
+	if ((mlim = xdl_bogosqrt((long)xdf2->nrec)) > XDL_MAX_EQLIMIT)
 		mlim = XDL_MAX_EQLIMIT;
 	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
 		rcrec = cf->rcrecs[recs->minimal_perfect_hash];
@@ -348,7 +348,7 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 
 	recs1 = xdf1->recs;
 	recs2 = xdf2->recs;
-	for (i = 0, lim = XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
+	for (i = 0, lim = (long)XDL_MIN(xdf1->nrec, xdf2->nrec); i < lim;
 	     i++, recs1++, recs2++)
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
@@ -361,8 +361,8 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
 		if (recs1->minimal_perfect_hash != recs2->minimal_perfect_hash)
 			break;
 
-	xdf1->dend = xdf1->nrec - i - 1;
-	xdf2->dend = xdf2->nrec - i - 1;
+	xdf1->dend = (long)xdf1->nrec - i - 1;
+	xdf2->dend = (long)xdf2->nrec - i - 1;
 
 	return 0;
 }
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index d4e9cd2e76..4c4d9bd147 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -47,7 +47,7 @@ typedef struct s_xrecord {
 
 typedef struct s_xdfile {
 	xrecord_t *recs;
-	long nrec;
+	size_t nrec;
 	ptrdiff_t dstart, dend;
 	bool *changed;
 	long *rindex;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 08/10] xdiff: make xdfile_t.nreff a size_t instead of long
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                           ` (6 preceding siblings ...)
  2025-11-18 22:34         ` [PATCH v5 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34         ` Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
                           ` (2 subsequent siblings)
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren,
	Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

size_t is used because nreff describes the number of elements in memory
for rindex.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xprepare.c | 14 +++++++-------
 xdiff/xtypes.h   |  2 +-
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 705ddd1ae0..39fd79d9d4 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -264,7 +264,7 @@ static bool xdl_clean_mmatch(uint8_t const *action, long i, long s, long e) {
  * might be potentially discarded if they appear in a run of discardable.
  */
 static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-	long i, nm, nreff, mlim;
+	long i, nm, mlim;
 	xrecord_t *recs;
 	xdlclass_t *rcrec;
 	uint8_t *action1 = NULL, *action2 = NULL;
@@ -307,29 +307,29 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	 * Use temporary arrays to decide if changed[i] should remain
 	 * false, or become true.
 	 */
-	for (nreff = 0, i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
+	xdf1->nreff = 0;
+	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
 	     i <= xdf1->dend; i++, recs++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->rindex[nreff++] = i;
+			xdf1->rindex[xdf1->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf1->changed[i] = true;
 			/* i.e. discard */
 	}
-	xdf1->nreff = nreff;
 
-	for (nreff = 0, i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
+	xdf2->nreff = 0;
+	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart];
 	     i <= xdf2->dend; i++, recs++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->rindex[nreff++] = i;
+			xdf2->rindex[xdf2->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf2->changed[i] = true;
 			/* i.e. discard */
 	}
-	xdf2->nreff = nreff;
 
 cleanup:
 	xdl_free(action1);
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 4c4d9bd147..1f495f987f 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -51,7 +51,7 @@ typedef struct s_xdfile {
 	ptrdiff_t dstart, dend;
 	bool *changed;
 	long *rindex;
-	long nreff;
+	size_t nreff;
 } xdfile_t;
 
 typedef struct s_xdfenv {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 09/10] xdiff: change rindex from long to size_t in xdfile_t
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                           ` (7 preceding siblings ...)
  2025-11-18 22:34         ` [PATCH v5 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34         ` Ezekiel Newren via GitGitGadget
  2025-11-18 22:34         ` [PATCH v5 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
  2025-11-18 23:11         ` [PATCH v5 00/10] Xdiff cleanup part2 Junio C Hamano
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren,
	Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The field rindex describes an index offset for other arrays. Change it
to size_t.

Changing the type of rindex from long to size_t has no cascading
refactor impact because it is only ever used to directly index other
arrays.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xtypes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 1f495f987f..9074cdadd1 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -50,7 +50,7 @@ typedef struct s_xdfile {
 	size_t nrec;
 	ptrdiff_t dstart, dend;
 	bool *changed;
-	long *rindex;
+	size_t *rindex;
 	size_t nreff;
 } xdfile_t;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 10/10] xdiff: rename rindex -> reference_index
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                           ` (8 preceding siblings ...)
  2025-11-18 22:34         ` [PATCH v5 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
@ 2025-11-18 22:34         ` Ezekiel Newren via GitGitGadget
  2025-11-18 23:11         ` [PATCH v5 00/10] Xdiff cleanup part2 Junio C Hamano
  10 siblings, 0 replies; 118+ messages in thread
From: Ezekiel Newren via GitGitGadget @ 2025-11-18 22:34 UTC (permalink / raw)
  To: git
  Cc: Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren,
	Ezekiel Newren

From: Ezekiel Newren <ezekielnewren@gmail.com>

The classic diff adds only the lines that it's going to consider,
during the diff, to an array. A mapping between the compacted
array, and the lines of the file that they reference, is
facilitated by this array.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
---
 xdiff/xdiffi.c   |  6 +++---
 xdiff/xprepare.c | 10 +++++-----
 xdiff/xtypes.h   |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 21d06bce96..4376f943db 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -24,7 +24,7 @@
 
 static size_t get_hash(xdfile_t *xdf, long index)
 {
-	return xdf->recs[xdf->rindex[index]].minimal_perfect_hash;
+	return xdf->recs[xdf->reference_index[index]].minimal_perfect_hash;
 }
 
 #define XDL_MAX_COST_MIN 256
@@ -278,10 +278,10 @@ int xdl_recs_cmp(xdfile_t *xdf1, long off1, long lim1,
 	 */
 	if (off1 == lim1) {
 		for (; off2 < lim2; off2++)
-			xdf2->changed[xdf2->rindex[off2]] = true;
+			xdf2->changed[xdf2->reference_index[off2]] = true;
 	} else if (off2 == lim2) {
 		for (; off1 < lim1; off1++)
-			xdf1->changed[xdf1->rindex[off1]] = true;
+			xdf1->changed[xdf1->reference_index[off1]] = true;
 	} else {
 		xdpsplit_t spl;
 		spl.i1 = spl.i2 = 0;
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 39fd79d9d4..34c82e4f8e 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -128,7 +128,7 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t
 
 static void xdl_free_ctx(xdfile_t *xdf)
 {
-	xdl_free(xdf->rindex);
+	xdl_free(xdf->reference_index);
 	xdl_free(xdf->changed - 1);
 	xdl_free(xdf->recs);
 }
@@ -141,7 +141,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 	uint8_t const *blk, *cur, *top, *prev;
 	xrecord_t *crec;
 
-	xdf->rindex = NULL;
+	xdf->reference_index = NULL;
 	xdf->changed = NULL;
 	xdf->recs = NULL;
 
@@ -169,7 +169,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 
 	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
 	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)) {
-		if (!XDL_ALLOC_ARRAY(xdf->rindex, xdf->nrec + 1))
+		if (!XDL_ALLOC_ARRAY(xdf->reference_index, xdf->nrec + 1))
 			goto abort;
 	}
 
@@ -312,7 +312,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	     i <= xdf1->dend; i++, recs++) {
 		if (action1[i] == KEEP ||
 		    (action1[i] == INVESTIGATE && !xdl_clean_mmatch(action1, i, xdf1->dstart, xdf1->dend))) {
-			xdf1->rindex[xdf1->nreff++] = i;
+			xdf1->reference_index[xdf1->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf1->changed[i] = true;
@@ -324,7 +324,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
 	     i <= xdf2->dend; i++, recs++) {
 		if (action2[i] == KEEP ||
 		    (action2[i] == INVESTIGATE && !xdl_clean_mmatch(action2, i, xdf2->dstart, xdf2->dend))) {
-			xdf2->rindex[xdf2->nreff++] = i;
+			xdf2->reference_index[xdf2->nreff++] = i;
 			/* changed[i] remains false, i.e. keep */
 		} else
 			xdf2->changed[i] = true;
diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h
index 9074cdadd1..979586f20a 100644
--- a/xdiff/xtypes.h
+++ b/xdiff/xtypes.h
@@ -50,7 +50,7 @@ typedef struct s_xdfile {
 	size_t nrec;
 	ptrdiff_t dstart, dend;
 	bool *changed;
-	size_t *rindex;
+	size_t *reference_index;
 	size_t nreff;
 } xdfile_t;
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/10] Xdiff cleanup part2
  2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
                           ` (9 preceding siblings ...)
  2025-11-18 22:34         ` [PATCH v5 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
@ 2025-11-18 23:11         ` Junio C Hamano
  10 siblings, 0 replies; 118+ messages in thread
From: Junio C Hamano @ 2025-11-18 23:11 UTC (permalink / raw)
  To: Ezekiel Newren via GitGitGadget
  Cc: git, Kristoffer Haugsbakk, Patrick Steinhardt, Phillip Wood,
	Chris Torek, Ramsay Jones, Ben Knoble, Ezekiel Newren

"Ezekiel Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Changes in v5:
>
>  * Remove the non-word 'signless', and rephrase that paragraph in
>    unambiguous-types.adoc
>  * Cast to char in xdiffi.c:get_indent() rather than changing the local
>    variable to uint8_t
> ...
>
> Ezekiel Newren (10):
>   doc: define unambiguous type mappings across C and Rust
>   xdiff: use ptrdiff_t for dstart/dend
>   xdiff: make xrecord_t.ptr a uint8_t instead of char
>   xdiff: use size_t for xrecord_t.size
>   xdiff: use unambiguous types in xdl_hash_record()
>   xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
>   xdiff: make xdfile_t.nrec a size_t instead of long
>   xdiff: make xdfile_t.nreff a size_t instead of long
>   xdiff: change rindex from long to size_t in xdfile_t
>   xdiff: rename rindex -> reference_index
>
>  Documentation/Makefile                        |   1 +
>  Documentation/technical/meson.build           |   1 +
>  .../technical/unambiguous-types.adoc          | 224 ++++++++++++++++++
>  xdiff-interface.c                             |   2 +-
>  xdiff/xdiffi.c                                |  29 ++-
>  xdiff/xemit.c                                 |  28 +--
>  xdiff/xhistogram.c                            |   4 +-
>  xdiff/xmerge.c                                |  30 +--
>  xdiff/xpatience.c                             |  14 +-
>  xdiff/xprepare.c                              |  60 ++---
>  xdiff/xtypes.h                                |  15 +-
>  xdiff/xutils.c                                |  32 +--
>  xdiff/xutils.h                                |   6 +-
>  13 files changed, 336 insertions(+), 110 deletions(-)
>  create mode 100644 Documentation/technical/unambiguous-types.adoc

This round looks good to me.  Shall we mark it for 'next'?

Thanks.

^ permalink raw reply	[flat|nested] 118+ messages in thread

end of thread, other threads:[~2025-11-19  4:15 UTC | newest]

Thread overview: 118+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-15 21:18 [PATCH 0/9] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
2025-10-15 21:18 ` [PATCH 1/9] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t Ezekiel Newren via GitGitGadget
2025-10-21 11:32   ` Phillip Wood
2025-10-21 17:18     ` Junio C Hamano
2025-10-22 21:07       ` Ezekiel Newren
2025-10-22 21:38         ` Junio C Hamano
2025-10-22 21:51           ` Ezekiel Newren
2025-10-15 21:18 ` [PATCH 2/9] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
2025-10-16 21:51   ` Kristoffer Haugsbakk
2025-10-21  8:33   ` Patrick Steinhardt
2025-10-22 21:12     ` Ezekiel Newren
2025-10-21 13:13   ` Phillip Wood
2025-10-21 18:15     ` Junio C Hamano
2025-10-22 13:27       ` Phillip Wood
2025-10-22 20:55         ` Ezekiel Newren
2025-10-15 21:18 ` [PATCH 3/9] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
2025-10-15 21:18 ` [PATCH 4/9] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
2025-10-21  8:33   ` Patrick Steinhardt
2025-10-22 21:20     ` Ezekiel Newren
2025-10-23  5:49       ` Patrick Steinhardt
2025-10-15 21:18 ` [PATCH 5/9] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
2025-10-20 23:29   ` Ezekiel Newren
2025-10-21  5:10     ` Junio C Hamano
2025-10-21  8:33     ` Patrick Steinhardt
2025-10-21 10:03     ` Phillip Wood
2025-10-21 11:16       ` Chris Torek
2025-10-22 21:31       ` Ezekiel Newren
2025-10-15 21:18 ` [PATCH 6/9] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
2025-10-15 21:18 ` [PATCH 7/9] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
2025-10-15 21:18 ` [PATCH 8/9] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
2025-10-21  8:34   ` Patrick Steinhardt
2025-10-22 22:14     ` Ezekiel Newren
2025-10-23  5:49       ` Patrick Steinhardt
2025-10-15 21:18 ` [PATCH 9/9] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
2025-10-15 21:28 ` [PATCH 0/9] Xdiff cleanup part2 Junio C Hamano
2025-10-21 13:28 ` Phillip Wood
2025-10-21 13:41   ` Junio C Hamano
2025-10-29 22:19 ` [PATCH v2 00/10] " Ezekiel Newren via GitGitGadget
2025-10-29 22:19   ` [PATCH v2 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
2025-11-06  9:55     ` Phillip Wood
2025-11-06 22:52       ` Ezekiel Newren
2025-11-09 14:14         ` Phillip Wood
2025-10-29 22:19   ` [PATCH v2 02/10] xdiff: use ssize_t for dstart/dend, make them last in xdfile_t Ezekiel Newren via GitGitGadget
2025-11-06  9:55     ` Phillip Wood
2025-11-06 22:56       ` Ezekiel Newren
2025-10-29 22:19   ` [PATCH v2 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
2025-11-06 10:49     ` Phillip Wood
2025-11-06 23:13       ` Ezekiel Newren
2025-11-06 10:55     ` Phillip Wood
2025-11-06 23:14       ` Ezekiel Newren
2025-10-29 22:19   ` [PATCH v2 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
2025-10-29 22:19   ` [PATCH v2 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
2025-10-29 22:19   ` [PATCH v2 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
2025-11-06 11:00     ` Phillip Wood
2025-11-06 23:20       ` Ezekiel Newren
2025-10-29 22:19   ` [PATCH v2 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
2025-10-29 22:19   ` [PATCH v2 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
2025-10-29 22:19   ` [PATCH v2 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
2025-10-29 22:19   ` [PATCH v2 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
2025-10-30 14:26   ` [PATCH v2 00/10] Xdiff cleanup part2 Junio C Hamano
2025-11-11 19:42   ` [PATCH v3 " Ezekiel Newren via GitGitGadget
2025-11-11 19:42     ` [PATCH v3 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
2025-11-11 20:52       ` Junio C Hamano
2025-11-11 21:05       ` Junio C Hamano
2025-11-11 19:42     ` [PATCH v3 02/10] xdiff: use ptrdiff_t for dstart/dend Ezekiel Newren via GitGitGadget
2025-11-11 22:23       ` Junio C Hamano
2025-11-11 19:42     ` [PATCH v3 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
2025-11-11 22:53       ` Junio C Hamano
2025-11-11 19:42     ` [PATCH v3 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
2025-11-11 23:08       ` Junio C Hamano
2025-11-14  6:02         ` Ezekiel Newren
2025-11-14 16:31           ` Junio C Hamano
2025-11-11 19:42     ` [PATCH v3 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
2025-11-11 19:42     ` [PATCH v3 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
2025-11-11 23:21       ` Junio C Hamano
2025-11-14  5:41         ` Ezekiel Newren
2025-11-14 20:06           ` Junio C Hamano
2025-11-11 19:42     ` [PATCH v3 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
2025-11-11 19:42     ` [PATCH v3 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
2025-11-11 19:42     ` [PATCH v3 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
2025-11-11 19:42     ` [PATCH v3 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
2025-11-11 23:40     ` [PATCH v3 00/10] Xdiff cleanup part2 Junio C Hamano
2025-11-14  5:52       ` Ezekiel Newren
2025-11-14 22:36     ` [PATCH v4 " Ezekiel Newren via GitGitGadget
2025-11-14 22:36       ` [PATCH v4 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
2025-11-15  3:06         ` Ramsay Jones
2025-11-15  3:41           ` Ben Knoble
2025-11-15 14:55             ` Ramsay Jones
2025-11-15 16:42               ` Junio C Hamano
2025-11-15 16:59                 ` D. Ben Knoble
2025-11-15 20:03                   ` Junio C Hamano
2025-11-17  1:20                 ` Junio C Hamano
2025-11-17  2:08                   ` Ramsay Jones
2025-11-14 22:36       ` [PATCH v4 02/10] xdiff: use ptrdiff_t for dstart/dend Ezekiel Newren via GitGitGadget
2025-11-14 22:36       ` [PATCH v4 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
2025-11-15  8:26         ` Junio C Hamano
2025-11-18 20:55           ` Ezekiel Newren
2025-11-14 22:36       ` [PATCH v4 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
2025-11-14 22:36       ` [PATCH v4 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
2025-11-14 22:36       ` [PATCH v4 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
2025-11-14 22:36       ` [PATCH v4 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
2025-11-14 22:36       ` [PATCH v4 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
2025-11-14 22:36       ` [PATCH v4 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
2025-11-14 22:36       ` [PATCH v4 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
2025-11-18 22:34       ` [PATCH v5 00/10] Xdiff cleanup part2 Ezekiel Newren via GitGitGadget
2025-11-18 22:34         ` [PATCH v5 01/10] doc: define unambiguous type mappings across C and Rust Ezekiel Newren via GitGitGadget
2025-11-18 23:46           ` Ramsay Jones
2025-11-19  4:14             ` Junio C Hamano
2025-11-18 22:34         ` [PATCH v5 02/10] xdiff: use ptrdiff_t for dstart/dend Ezekiel Newren via GitGitGadget
2025-11-18 22:34         ` [PATCH v5 03/10] xdiff: make xrecord_t.ptr a uint8_t instead of char Ezekiel Newren via GitGitGadget
2025-11-18 22:34         ` [PATCH v5 04/10] xdiff: use size_t for xrecord_t.size Ezekiel Newren via GitGitGadget
2025-11-18 22:34         ` [PATCH v5 05/10] xdiff: use unambiguous types in xdl_hash_record() Ezekiel Newren via GitGitGadget
2025-11-18 22:34         ` [PATCH v5 06/10] xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash Ezekiel Newren via GitGitGadget
2025-11-18 22:34         ` [PATCH v5 07/10] xdiff: make xdfile_t.nrec a size_t instead of long Ezekiel Newren via GitGitGadget
2025-11-18 22:34         ` [PATCH v5 08/10] xdiff: make xdfile_t.nreff " Ezekiel Newren via GitGitGadget
2025-11-18 22:34         ` [PATCH v5 09/10] xdiff: change rindex from long to size_t in xdfile_t Ezekiel Newren via GitGitGadget
2025-11-18 22:34         ` [PATCH v5 10/10] xdiff: rename rindex -> reference_index Ezekiel Newren via GitGitGadget
2025-11-18 23:11         ` [PATCH v5 00/10] Xdiff cleanup part2 Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).