From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qk1-f172.google.com (mail-qk1-f172.google.com [209.85.222.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 93F6832A3D7 for ; Tue, 11 Nov 2025 19:42:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762890157; cv=none; b=jVrLKy7bfdjC2BwmepT1kaQlUtyS5G3lBJxUlsiKqovHT7hxOd0qmXoIrRRzd0EZ2Uw5Bsu8uAnS4daRSZkhqzyGGpjgxwh5TiZ/bwBJwyj2IhxTnfQTJCACph9/PQkB66tmI61nWAgfXGa5K1a8arR6WOypr7jxUz4Z+WOol+8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762890157; c=relaxed/simple; bh=3h6qu+Lpd6oDi2duoShCxiWtkUsDt2l2KVK8yKQd3JU=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=n49mrh4JWDq7cpLwMNPIBmteS1o+8qZ9Kf2dnuN7Y8fEnDAVaFth1Ovv8lPGM+MhR3UZbobq2p7XudDgKaE4RFAOezTw958UBiEMmi/EQzrtL8oZmKO1B9TAapPm7jxQt2NYVhWX6yS8EwCU6YpqLSDF4MYflwzx3MxPjhNvzhU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=eFY2FEWT; arc=none smtp.client-ip=209.85.222.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="eFY2FEWT" Received: by mail-qk1-f172.google.com with SMTP id af79cd13be357-8a3eac7ca30so8332485a.2 for ; Tue, 11 Nov 2025 11:42:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1762890154; x=1763494954; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=ER+AX5SX31xDelYlgeIJ2fRyAAD0RPsl4GGm9QvNAP0=; b=eFY2FEWTE23mK8ZK8CIdxu/jH/HjtC4dU7BMatsYuezAvpmmnThh+NSnGZx6QkpLky A2r7UHRdLgXg921WrfnH9vbJkp9ehTDqI6rfc6QWkRdR9fYJtjy2KYG+1yq3/k9CIwaL +jrHHQh4c0zC5Pl4j9H7D64MrkyfV84keea+TvdP+b+HO0gCr+PEEa8fG9AXkBOqsbmL 6zU0azo66K0LFgy/ESoCAqGQTiQESLbGqwezKXsxDVrg3D8XWUu+c242Obm1icpAsSqs nEUsBDI7gfJs42Tf3XRUXbDskhxo75tZTyopbwvJicIZ56mpofKgdUL0UKXsacNceJm3 284A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762890154; x=1763494954; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=ER+AX5SX31xDelYlgeIJ2fRyAAD0RPsl4GGm9QvNAP0=; b=LfsgWaTzI0dYgsYZofPy6wsalxorHi8UyaJzL4Grv7CVFjamaEFcRc92F+USeOyzCq uD2UedV6RGSgAs2gV/oyX0o7/T5v5469XmKlggcWGNKRklwm0FSjVGtgm9LE1wKb+0jb fE4xk6yxPoSBAa8VchIP9x7DrgITiBDsW+Bcco8JhQtfOvP8PXBQvMXVg+drV+iIQt6N kKToVRDYhuFUruBQeNwx2d0elys8E/8owwbLFUs+lUvQUGYSlRuDvZcf1d6Sc7jfUVqf ljQsmOevVqF3keZCfBGjzb3FPsvnqz98PFJZIvOe4mEzToII1Jqb37I7EqVMBV+nr7hV 9KIw== X-Gm-Message-State: AOJu0YzuXaQICcxVIgMyUAo7qkq+A5qsrXkxoRNdyCYvqlDtEp8Jh3L3 fOXPm5C/NrrFlhNdNglxkC3+DKNfKArpBVKrltZf7lf/ms9l36+cQaleojczNg== X-Gm-Gg: ASbGncuV47l+oIwubRHxLf45ha1xABNl+OUNV9VFCyc1NO7WXU7Zjyl1iN27VXH/4TZ JV+Yp6iR2uWRxJENyWZEqevhi1UIuErvwtAo2vbyxMOucQSp44ivqe4Pfvi8HU0wtW3pUVytrsx qUe1bp35yVKFwkVYzxNc4WOdMHyZEzx/890iJ0MBzPdeAjS+1HTgN6JYKAweVNoeo07fclFXakY dZ7me2P5SFZk49O0B2ADWZuQ22W2Ith5j0EVyXFyerI+wo/xqeTkQzz1SlMtQmm7cCPCoHjiQQ7 pRYy4I4eRB4v1OF+RDP/517XqV7aEBvRT5CilGv2iU+cMn3hAY3BkOWGkPRKvU88SbRzu9gn7Rs THWK5VqgENK6EMFesNN8qrRtxGdwTjssMG/i7tJZ7s+9T/V2RY1ZLAeYBYux5UCT59iamZs3BoW dm7n4= X-Google-Smtp-Source: AGHT+IEBKFcJwQpODk4sMXS/XXDIDbEVrMapqe+y8VchRJQNGBgjW/g8QVp4Ysnbx9i/4MquxBADig== X-Received: by 2002:a05:620a:2546:b0:8a3:cd9e:e40e with SMTP id af79cd13be357-8b29b84ff51mr56233085a.86.1762890153817; Tue, 11 Nov 2025 11:42:33 -0800 (PST) Received: from [127.0.0.1] ([172.208.127.36]) by smtp.gmail.com with ESMTPSA id af79cd13be357-8b29a84b075sm44653485a.7.2025.11.11.11.42.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Nov 2025 11:42:33 -0800 (PST) Message-Id: In-Reply-To: References: From: "Ezekiel Newren via GitGitGadget" Date: Tue, 11 Nov 2025 19:42:22 +0000 Subject: [PATCH v3 00/10] Xdiff cleanup part2 Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: Kristoffer Haugsbakk , Patrick Steinhardt , Phillip Wood , Chris Torek , Ezekiel Newren Changes in v3: * Address comments about commit messages and documentation * Add unambiguous-types.adoc to Makefile and Meson * Use markdown style to avoid asciidoc issues Changes in v2: * Added documentation about unambiguous types and FFI * Addressed comments on the mailing list Original cover letter below: ============================ Maintainer note: This patch series builds on top of en/xdiff-cleanup and am/xdiff-hash-tweak (both of which are now in master). The primary goal of this patch series is to convert every field's type in xrecord_t and xdfile_t to be unambiguous, in preparation to make it more Rust FFI friendly. Additionally the ha field in xrecord_t is split into line_hash and minimal_perfect hash. The order of some of the fields has changed as called out by the commit messages. Before: typedef struct s_xrecord { char const *ptr; long size; unsigned long ha; } xrecord_t; typedef struct s_xdfile { xrecord_t *recs; long nrec; long dstart, dend; bool *changed; long *rindex; long nreff; } xdfile_t; After part 2 typedef struct s_xrecord { uint8_t const *ptr; size_t size; uint64_t line_hash; size_t minimal_perfect_hash; } xrecord_t; typedef struct s_xdfile { xrecord_t *recs; size_t nrec; bool *changed; size_t *reference_index; size_t nreff; ssize_t dstart, dend; } xdfile_t; Ezekiel Newren (10): doc: define unambiguous type mappings across C and Rust xdiff: use ptrdiff_t for dstart/dend xdiff: make xrecord_t.ptr a uint8_t instead of char xdiff: use size_t for xrecord_t.size xdiff: use unambiguous types in xdl_hash_record() xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash xdiff: make xdfile_t.nrec a size_t instead of long xdiff: make xdfile_t.nreff a size_t instead of long xdiff: change rindex from long to size_t in xdfile_t xdiff: rename rindex -> reference_index Documentation/Makefile | 1 + Documentation/technical/meson.build | 1 + .../technical/unambiguous-types.adoc | 239 ++++++++++++++++++ xdiff-interface.c | 2 +- xdiff/xdiffi.c | 29 +-- xdiff/xemit.c | 28 +- xdiff/xhistogram.c | 4 +- xdiff/xmerge.c | 30 +-- xdiff/xpatience.c | 14 +- xdiff/xprepare.c | 60 ++--- xdiff/xtypes.h | 15 +- xdiff/xutils.c | 32 +-- xdiff/xutils.h | 6 +- 13 files changed, 351 insertions(+), 110 deletions(-) create mode 100644 Documentation/technical/unambiguous-types.adoc base-commit: a99f379adf116d53eb11957af5bab5214915f91d Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2070%2Fezekielnewren%2Fxdiff_cleanup_part2-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2070/ezekielnewren/xdiff_cleanup_part2-v3 Pull-Request: https://github.com/git/git/pull/2070 Range-diff vs v2: 1: 88133848d1 ! 1: e5d084d340 doc: define unambiguous type mappings across C and Rust @@ Commit message Signed-off-by: Ezekiel Newren + ## Documentation/Makefile ## +@@ Documentation/Makefile: TECH_DOCS += technical/shallow + TECH_DOCS += technical/sparse-checkout + TECH_DOCS += technical/sparse-index + TECH_DOCS += technical/trivial-merge ++TECH_DOCS += technical/unambiguous-types + TECH_DOCS += technical/unit-tests + SP_ARTICLES += $(TECH_DOCS) + SP_ARTICLES += technical/api-index + + ## Documentation/technical/meson.build ## +@@ Documentation/technical/meson.build: articles = [ + 'sparse-checkout.adoc', + 'sparse-index.adoc', + 'trivial-merge.adoc', ++ 'unambiguous-types.adoc', + 'unit-tests.adoc', + ] + + ## Documentation/technical/unambiguous-types.adoc (new) ## @@ += Unambiguous types @@ Documentation/technical/unambiguous-types.adoc (new) +|=== +| C Type | Rust Type +| size_t^3^ | usize -+| ptrdiff_t^4^ | isize ++| ptrdiff_t^3^ | isize +|=== + +== Character types + -+This is where C and Rust don't have a clean one-to-one mapping. A C `char` is -+an 8-bit type that is signless (neither signed nor unsigned) which causes -+problems with e.g. `make DEVELOPER=1`. Rust's `char` type is an unsigned 32-bit -+integer that is used to describe Unicode code points. Even though a C `char` -+is the same width as `u8`, `char` should be converted to u8 where it is -+describing bytes in memory. If a C `char` is not describing bytes, then it -+should be converted to a more accurate unambiguous type. ++This is where C and Rust don't have a clean one-to-one mapping. ++ ++C comparison problem: While the sign of `char` is implementation defined, it's ++also signless (neither signed nor unsigned). When building with ++`make DEVELOPER=1` it will complain about a "differ in signedness" when `char` ++is compared with `uint8_t` or `int8_t`. ++ ++Rust's `char` type is an unsigned 32-bit integer that is used to describe ++Unicode code points. Even though a C `char` is the same width as `u8`, `char` ++should be converted to u8 where it is describing bytes in memory. If a C ++`char` is not describing bytes, then it should be converted to a more accurate ++unambiguous type. The reason for mentioning Unicode here is because of how &str ++is defined in Rust and how to create a &str from &[u8]. Rust assumes that &str ++is a correctly encoded utf-8 string, i.e. text in memory. Where as a C `char` ++makes no assumption about the bytes that it is representing. ++ ++``` ++let raw_bytes = b"abc\n"; ++let result = std::str::from_utf8(raw_bytes); ++if let Ok(line) = result { ++ // do something with text ++} ++``` + +While you could specify `char` in the C code and `u8` in Rust code, it's not as +clear what the appropriate type is, but it would work across the FFI boundary. -+However the bigger problem comes from code generation tools like cbindgen and -+bindgen. When cbindgen see u8 in Rust it will generate uint8_t on the C side -+which will cause differ in signedness warnings/errors. Similaraly if bindgen -+see `char` on the C side it will generate `std::ffi::c_char` which has its own ++However, the bigger problem comes from code generation tools like cbindgen and ++bindgen. When cbindgen sees u8 in Rust it will generate uint8_t on the C side ++which will cause differ in signedness warnings/errors. Similarly if bindgen ++sees `char` on the C side it will generate `std::ffi::c_char` which has its own +problems. + +=== Notes @@ Documentation/technical/unambiguous-types.adoc (new) +platform/arch for C does not follow IEEE-754 then this equivalence does not +hold. Also, it's assumed that `float` is 32 bits and `double` is 64, but +there may be a strange platform/arch where even this isn't true. + -+^3^ C also defines uintptr_t, but this should not be used in Git. + -+^4^ C also defines ssize_t and intptr_t, but these should not be used in Git. + ++^3^ C also defines uintptr_t, ssize_t and intptr_t, but these types are ++discouraged for FFI purposes. For functions like `read()` and `write()` ssize_t ++should be cast to a different, and unambiguous, type before being passed over ++the FFI boundary. + + +== Problems with std::ffi::c_* types in Rust -+TL;DR: They're not guaranteed to match C types for all possible C -+compilers/platforms/architectures. ++TL;DR: In practice, Rust's `c_*` types aren't guaranteed to match C types for ++all possible C compilers, platforms, or architectures, because Rust only ++ensures correctness of C types on officially supported targets. These ++definitions have changed over time to match more targets which means that the ++c_* definitions will differ based on which Rust version Git chooses to use. + -+Only a few of Rust's C FFI types are considered safe and semantically clear to -+use: + ++Current list of safe, Rust side, FFI types in Git: + + +* `c_void` +* `CStr` @@ Documentation/technical/unambiguous-types.adoc (new) +Even then, they should be used sparingly, and only where the semantics match +exactly. + -+The std::os::raw::c_* (which is deprecated) directly inherits the problems of -+core::ffi, which changes over time and seems to make a best guess at the -+correct definition for a given platform/target. This probably isn't a problem -+for all platforms that Rust supports currently, but can anyone say that Rust -+got it right for all C compilers of all platforms/targets? -+ -+On top of all of that we're targeting an older version of Rust which doesn't -+have the latest mappings. ++The std::os::raw::c_* directly inherits the problems of core::ffi, which ++changes over time and seems to make a best guess at the correct definition for ++a given platform/target. This probably isn't a problem for all other platforms ++that Rust supports currently, but can anyone say that Rust got it right for all ++C compilers of all platforms/targets? + +To give an example: c_long is defined in +footnote:[https://doc.rust-lang.org/1.63.0/src/core/ffi/mod.rs.html#175-189[c_long in 1.63.0]] @@ Documentation/technical/unambiguous-types.adoc (new) + +=== Rust version 1.63.0 + -+[source] -+---- ++``` +mod c_long_definition { + cfg_if! { + if #[cfg(all(target_pointer_width = "64", not(windows)))] { @@ Documentation/technical/unambiguous-types.adoc (new) + } + } +} -+---- ++``` + +=== Rust version 1.89.0 + -+[source] -+---- ++``` +mod c_long_definition { + crate::cfg_select! { + any( @@ Documentation/technical/unambiguous-types.adoc (new) + } + } +} -+---- ++``` + +Even for the cases where C types are correctly mapped to Rust types via +std::ffi::c_* there are still problems. Let's take c_char for example. On some @@ Documentation/technical/unambiguous-types.adoc (new) +The following code will panic in debug on platforms that define c_char as u8, +but won't if it's an i8. + -+[source] -+---- ++``` +let mut x: std::ffi::c_char = 0; +x -= 1; -+---- ++``` + +=== Inconsistent shift behavior + +`x` will be 0xC0 for platforms that use i8, but will be 0x40 where it's u8. + -+[source] -+---- ++``` +let mut x: std::ffi::c_char = 0x80; +x >>= 1; -+---- ++``` + +=== Equality fails to compile on some platforms + @@ Documentation/technical/unambiguous-types.adoc (new) +if it's u8. You can cast x e.g. `assert_eq!(x as u8, b'a');`, but then you get +a warning on platforms that use u8 and a clean compilation where i8 is used. + -+[source] -+---- ++``` +let mut x: std::ffi::c_char = 0x61; +assert_eq!(x, b'a'); -+---- ++``` + +== Enum types +Rust enum types should not be used as FFI types. Rust enum types are more like +C union types than C enum's. For something like: + -+[source] -+---- ++``` +#[repr(C, u8)] +enum Fruit { + Apple, + Banana, + Cherry, +} -+---- ++``` + +It's easy enough to make sure the Rust enum matches what C would expect, but a +more complex type like. + -+[source] -+---- ++``` +enum HashResult { + SHA1([u8; 20]), + SHA256([u8; 32]), +} -+---- ++``` + +The Rust compiler has to add a discriminant to the enum to distinguish between +the variants. The width, location, and values for that discriminant is up to 2: 9197903add ! 2: 52e3f589b1 xdiff: use ssize_t for dstart/dend, make them last in xdfile_t @@ Metadata Author: Ezekiel Newren ## Commit message ## - xdiff: use ssize_t for dstart/dend, make them last in xdfile_t + xdiff: use ptrdiff_t for dstart/dend - ssize_t is appropriate for dstart and dend because they both describe + ptrdiff_t is appropriate for dstart and dend because they both describe positive or negative offsets relative to a pointer. A future patch will move these fields to a different struct. Moving 3: 46bc1b3e25 ! 3: 83e7bf180a xdiff: make xrecord_t.ptr a uint8_t instead of char @@ Metadata ## Commit message ## xdiff: make xrecord_t.ptr a uint8_t instead of char - Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also - referring to bytes in memory, rather than Unicode code points, use - uint8_t instead of char. + Make xrecord_t.ptr uint8_t because it's referring to bytes in memory. Every usage of this field was inspected and cast to char*, or similar, to avoid signedness warnings/errors from the compiler. Casting was used @@ xdiff/xprepare.c: static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, lo goto abort; crec = &xdf->recs[xdf->nrec++]; - crec->ptr = prev; -- crec->size = (long) (cur - prev); + crec->ptr = (uint8_t const *)prev; -+ crec->size =(long) ( cur - prev); + crec->size = (long) (cur - prev); crec->ha = hav; if (xdl_classify_record(pass, cf, crec) < 0) - goto abort; ## xdiff/xtypes.h ## @@ xdiff/xtypes.h: typedef struct s_chastore { 4: 07e28aad3b ! 4: da2b80ea0b xdiff: use size_t for xrecord_t.size @@ xdiff/xprepare.c: static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, lo goto abort; crec = &xdf->recs[xdf->nrec++]; crec->ptr = (uint8_t const *)prev; -- crec->size =(long) ( cur - prev); +- crec->size = (long) (cur - prev); + crec->size = cur - prev; crec->ha = hav; if (xdl_classify_record(pass, cf, crec) < 0) 5: 1ade7d8165 = 5: c6ba630ac5 xdiff: use unambiguous types in xdl_hash_record() 6: 59054ea0cb ! 6: 3834ea8f9b xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash @@ Commit message xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash The ha field is serving two different purposes, which makes the code - harder to read. At first glance it looks like many places assume + harder to read. At first glance, it looks like many places assume there could never be hash collisions between lines of the two input files. In reality, line_hash is used together with xdl_recmatch() to ensure correct comparisons of lines, even when collisions occur. To make this clearer, the old ha field has been split: - * line_hash: The straightforward hash of a line, requiring no - additional context. + * line_hash: a straightforward hash of a line, independent of any + external context. Its type is uint64_t, as it comes from a fixed + width hash function. * minimal_perfect_hash: Not a new concept, but now a separate field. It comes from the classifier's general-purpose hash table, which assigns each line a unique and minimal hash across the two - files. + files. A size_t is used here because it's meant to be used to + index an array. This also this avoids ` as usize` casts on the Rust + side when using it to index a slice. Signed-off-by: Ezekiel Newren @@ xdiff/xpatience.c: static int match(struct hashmap *map, int line1, int line2) static int patience_diff(xpparam_t const *xpp, xdfenv_t *env, ## xdiff/xprepare.c ## -@@ xdiff/xprepare.c: static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t - long hi; +@@ xdiff/xprepare.c: static void xdl_free_classifier(xdlclassifier_t *cf) { + + + static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t *rec) { +- long hi; ++ size_t hi; xdlclass_t *rcrec; - hi = (long) XDL_HASHLONG(rec->ha, cf->hbits); -+ hi = (long) XDL_HASHLONG(rec->line_hash, cf->hbits); ++ hi = XDL_HASHLONG(rec->line_hash, cf->hbits); for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next) - if (rcrec->rec.ha == rec->ha && + if (rcrec->rec.line_hash == rec->line_hash && 7: f91be17858 ! 7: e2a2c7530c xdiff: make xdfile_t.nrec a size_t instead of long @@ Metadata ## Commit message ## xdiff: make xdfile_t.nrec a size_t instead of long - size_t is used because nrec describes the number of elements in memory - for recs, and the number of elements in memory for 'changed' + 2. + size_t is used because nrec describes the number of elements for both + recs, and for 'changed' + 2. Signed-off-by: Ezekiel Newren 8: e2a6a23cc4 = 8: 31cd2a1aa4 xdiff: make xdfile_t.nreff a size_t instead of long 9: 3b6054945f ! 9: aee0d3958b xdiff: change rindex from long to size_t in xdfile_t @@ Metadata ## Commit message ## xdiff: change rindex from long to size_t in xdfile_t - rindex describes a index offset which means it's an index into memory - which should use size_t. + The field rindex describes an index offset for other arrays. Change it + to size_t. Changing the type of rindex from long to size_t has no cascading refactor impact because it is only ever used to directly index other 10: 1856a29026 ! 10: 75c26fe160 xdiff: rename rindex -> reference_index @@ Commit message The classic diff adds only the lines that it's going to consider, during the diff, to an array. A mapping between the compacted - array, and the lines of the file that they reference, are + array, and the lines of the file that they reference, is facilitated by this array. Signed-off-by: Ezekiel Newren -- gitgitgadget