From: "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: "Derrick Stolee" <stolee@gmail.com>,
"Torsten Bögershausen" <tboegi@web.de>,
"Jeff King" <peff@peff.net>, "Patrick Steinhardt" <ps@pks.im>,
"Johannes Schindelin" <johannes.schindelin@gmx.de>
Subject: [PATCH v3 00/11] Handle cloning of objects larger than 4GB on Windows
Date: Fri, 08 May 2026 08:16:38 +0000 [thread overview]
Message-ID: <pull.2102.v3.git.1778228209.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.2102.v2.git.1777914508.gitgitgadget@gmail.com>
On Windows, unsigned long is 32-bit even on 64-bit systems. This causes
multiple problems when Git handles objects larger than 4GB. This patch
series is a very targeted fix for a very early part of the problem: it
addresses the most fundamental truncation points that prevent a >4GB object
from surviving a clone at all.
Specifically, this fixes:
* zlib's uLong wrapping and triggering BUG() assertions in the git_zstream
wrapper
* Object sizes being truncated in pack streaming, delta headers, and
index-pack/unpack-objects
* pack-objects re-encoding reused pack entries with a truncated size,
producing corrupt packs on the wire
Many other code paths still use unsigned long for object sizes (e.g.,
cat-file -s, object_info.sizep, the delta machinery) and will need their own
conversions. This series does not attempt to fix those.
Based on work by @LordKiRon in git-for-windows/git#6076.
For testing, add a test helper that synthesizes a pack with a >4GB blob and
regression tests that clone it via both the unpack-objects and index-pack
code paths using file:// transport. Since these test cases are quite slow
(even after optimizing the pack generation part, the git clone test has no
chance but to hash 2x4GB of data), they are marked as EXPENSIVE. To ensure
that they are passing well in advance of any release, the CI is changed to
run them in the CI builds of relatively infrequent integration branch
updates.
Changes since v2:
* Now uses the proper data type for the varint decoding value (thanks,
Torsten!)
* The callers that now would silently narrow size_t to unsigned long
properly check and error out instead (thanks, Torsten!)
Changes since v1:
* dramatically accelerated the test helper that generates 4GB pack files,
via two separate strategies:
1. using the "unsafe" SHA-1 for the blob OID computation.
2. using pre-computed "Lego blocks" to construct the 4GB packs needed in
the test cases, where the size (and therefore the involved OIDs) are
well-known in advance.
* even with these improvements, the actual git clone is still slow (of
course, because it cannot use any of those shortcuts), therefore the
tests are marked as EXPENSIVE.
* to exercise those tests nevertheless, the last patch lets all EXPENSIVE
test cases be run for the integration branches other than seen.
Johannes Schindelin (11):
index-pack, unpack-objects: use size_t for object size
git-zlib: handle data streams larger than 4GB
odb, packfile: use size_t for streaming object sizes
delta, packfile: use size_t for delta header sizes
test-tool: add a helper to synthesize large packfiles
t5608: add regression test for >4GB object clone
test-tool synthesize: use the unsafe hash for speed
test-tool synthesize: precompute pack for 4 GiB + 1
test-tool synthesize: add precomputed SHA-256 pack for 4 GiB + 1
t5608: mark >4GB tests as EXPENSIVE
ci: run expensive tests on push builds to integration branches
Makefile | 1 +
builtin/index-pack.c | 8 +-
builtin/pack-objects.c | 34 ++-
builtin/unpack-objects.c | 4 +-
ci/lib.sh | 9 +
compat/zlib-compat.h | 2 +
delta.h | 14 +-
git-zlib.c | 25 +-
git-zlib.h | 4 +-
object-file.c | 12 +-
odb/streaming.c | 13 +-
odb/streaming.h | 2 +-
oss-fuzz/fuzz-pack-headers.c | 2 +-
pack-bitmap.c | 2 +-
pack-check.c | 6 +-
packfile.c | 57 ++--
packfile.h | 4 +-
t/helper/meson.build | 1 +
t/helper/test-synthesize.c | 541 +++++++++++++++++++++++++++++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t5608-clone-2gb.sh | 37 +++
22 files changed, 724 insertions(+), 56 deletions(-)
create mode 100644 t/helper/test-synthesize.c
base-commit: 94f057755b7941b321fd11fec1b2e3ca5313a4e0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2102%2Fdscho%2Ffix-large-clones-on-windows-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2102/dscho/fix-large-clones-on-windows-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/2102
Range-diff vs v2:
1: dc660106ea ! 1: 311cdc601d index-pack, unpack-objects: use size_t for object size
@@ Commit message
64-bit platforms, and ensuring the shift arithmetic occurs in 64-bit
space.
+ Declare the per-byte continuation variable `c` as size_t as well,
+ matching the canonical varint decoder unpack_object_header_buffer()
+ in packfile.c. With c as size_t the expression (c & 0x7f) << shift
+ is naturally size_t-typed, so the explicit cast that an earlier
+ iteration carried at the use site is no longer needed.
+
+ While at it, add the same overflow guard that
+ unpack_object_header_buffer() carries: if the cumulative shift would
+ exceed bitsizeof(size_t) - 7, refuse the input rather than invoking
+ undefined behavior. Unlike unpack_object_header_buffer(), which
+ labels this case "bad object header", report it as the platform
+ limit it actually is: a header may be perfectly well-formed and
+ still encode a size we cannot represent locally (notably on a
+ 32-bit build consuming a packfile produced on a 64-bit host).
+
This was originally authored by LordKiRon <https://github.com/LordKiRon>,
who preferred not to reveal their real name and therefore agreed that I
take over authorship.
+ Helped-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
## builtin/index-pack.c ##
@@ builtin/index-pack.c: static void *unpack_raw_entry(struct object_entry *obj,
{
unsigned char *p;
- unsigned long size, c;
-+ size_t size;
-+ unsigned long c;
++ size_t size, c;
off_t base_offset;
unsigned shift;
void *data;
@@ builtin/index-pack.c: static void *unpack_raw_entry(struct object_entry *obj,
+ size = (c & 15);
+ shift = 4;
+ while (c & 0x80) {
++ if ((bitsizeof(size_t) - 7) < shift)
++ die(_("object size too large for this platform"));
p = fill(1);
c = *p;
use(1);
-- size += (c & 0x7f) << shift;
-+ size += ((size_t)c & 0x7f) << shift;
- shift += 7;
- }
- obj->size = size;
## builtin/unpack-objects.c ##
@@ builtin/unpack-objects.c: static void unpack_one(unsigned nr)
@@ builtin/unpack-objects.c: static void unpack_one(unsigned nr)
unsigned shift;
unsigned char *pack;
- unsigned long size, c;
-+ size_t size;
-+ unsigned long c;
++ size_t size, c;
enum object_type type;
obj_list[nr].offset = consumed_bytes;
@@ builtin/unpack-objects.c: static void unpack_one(unsigned nr)
+ size = (c & 15);
+ shift = 4;
+ while (c & 0x80) {
++ if ((bitsizeof(size_t) - 7) < shift)
++ die(_("object size too large for this platform"));
pack = fill(1);
c = *pack;
use(1);
-- size += (c & 0x7f) << shift;
-+ size += ((size_t)c & 0x7f) << shift;
- shift += 7;
- }
-
2: 92f4327b1f = 2: c611913194 git-zlib: handle data streams larger than 4GB
3: 3a539061c5 ! 3: b789f57de9 odb, packfile: use size_t for streaming object sizes
@@ Commit message
temporary variables where the types differ, with comments noting the
truncation limitation for code paths that still use unsigned long.
+ Widening the producers to size_t in this way introduces a handful of
+ silent size_t -> unsigned long narrowings on Windows, all in
+ builtin/pack-objects.c, where the consumers are still typed
+ unsigned long. Make those narrowings explicit with
+ cast_size_t_to_ulong() so they assert loudly the moment an object
+ actually exceeds ULONG_MAX bytes:
+
+ - oe_get_size_slow() returns unsigned long but holds a size_t
+ locally; cast at the return.
+ - write_reuse_object() passes a size_t into check_pack_inflate(),
+ whose expect parameter is unsigned long; cast at the call.
+ - check_object() routes a size_t through SET_SIZE() and
+ SET_DELTA_SIZE(), both of which take unsigned long via
+ oe_set_size() / oe_set_delta_size(); cast at the three call
+ sites in the OBJ_OFS_DELTA / OBJ_REF_DELTA branches and in the
+ non-delta default arm.
+
+ The cast-only treatment is deliberately a stop-gap. Properly
+ widening oe_set_size, oe_get_size_slow's return type,
+ check_pack_inflate's expect parameter, object_info.sizep,
+ patch_delta, and the OE_SIZE_BITS bit-fields cascades into a series
+ that is too large to be reviewable, so the proper widening is
+ deferred to a follow-up topic. Until then,
+ cast_size_t_to_ulong() at least makes the truncation explicit at
+ the source: it documents the boundary, and on a 64-bit non-Windows
+ platform it is a no-op.
+
This was originally authored by LordKiRon <https://github.com/LordKiRon>,
who preferred not to reveal their real name and therefore agreed that I
take over authorship.
+ Helped-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
## builtin/pack-objects.c ##
@@ builtin/pack-objects.c: static off_t write_reuse_object(struct hashfile *f, stru
if (DELTA(entry))
type = (allow_ofs_delta && DELTA(entry)->idx.offset) ?
+@@ builtin/pack-objects.c: static off_t write_reuse_object(struct hashfile *f, struct object_entry *entry,
+ datalen -= entry->in_pack_header_size;
+
+ if (!pack_to_stdout && p->index_version == 1 &&
+- check_pack_inflate(p, &w_curs, offset, datalen, entry_size)) {
++ check_pack_inflate(p, &w_curs, offset, datalen,
++ cast_size_t_to_ulong(entry_size))) {
+ error(_("corrupt packed object for %s"),
+ oid_to_hex(&entry->idx.oid));
+ unuse_pack(&w_curs);
@@ builtin/pack-objects.c: static void write_reused_pack_one(struct packed_git *reuse_packfile,
{
off_t offset, next, cur;
@@ builtin/pack-objects.c: static void check_object(struct object_entry *entry, uin
buf = use_pack(p, &w_curs, entry->in_pack_offset, &avail);
+@@ builtin/pack-objects.c: static void check_object(struct object_entry *entry, uint32_t object_index)
+ default:
+ /* Not a delta hence we've already got all we need. */
+ oe_set_type(entry, entry->in_pack_type);
+- SET_SIZE(entry, in_pack_size);
++ SET_SIZE(entry, cast_size_t_to_ulong(in_pack_size));
+ entry->in_pack_header_size = used;
+ if (oe_type(entry) < OBJ_COMMIT || oe_type(entry) > OBJ_BLOB)
+ goto give_up;
+@@ builtin/pack-objects.c: static void check_object(struct object_entry *entry, uint32_t object_index)
+ if (have_base &&
+ can_reuse_delta(&base_ref, entry, &base_entry)) {
+ oe_set_type(entry, entry->in_pack_type);
+- SET_SIZE(entry, in_pack_size); /* delta size */
+- SET_DELTA_SIZE(entry, in_pack_size);
++ SET_SIZE(entry, cast_size_t_to_ulong(in_pack_size)); /* delta size */
++ SET_DELTA_SIZE(entry, cast_size_t_to_ulong(in_pack_size));
+
+ if (base_entry) {
+ SET_DELTA(entry, base_entry);
@@ builtin/pack-objects.c: unsigned long oe_get_size_slow(struct packing_data *pack,
struct pack_window *w_curs;
unsigned char *buf;
@@ builtin/pack-objects.c: unsigned long oe_get_size_slow(struct packing_data *pack
}
p = oe_in_pack(pack, e);
+@@ builtin/pack-objects.c: unsigned long oe_get_size_slow(struct packing_data *pack,
+
+ unuse_pack(&w_curs);
+ packing_data_unlock(&to_pack);
+- return size;
++ return cast_size_t_to_ulong(size);
+ }
+
+ static int try_delta(struct unpacked *trg, struct unpacked *src,
## object-file.c ##
@@ object-file.c: int odb_source_loose_read_object_stream(struct odb_read_stream **out,
4: 3274cba862 = 4: 8e87a4e71f delta, packfile: use size_t for delta header sizes
5: afa74a3a2b = 5: 34fec4a32d test-tool: add a helper to synthesize large packfiles
6: a3019888d8 = 6: 88f992903f t5608: add regression test for >4GB object clone
7: 859e93e7a9 = 7: 4f207c8a47 test-tool synthesize: use the unsafe hash for speed
8: 29b9a74e91 = 8: 2751c21c6e test-tool synthesize: precompute pack for 4 GiB + 1
9: 8e6e720804 = 9: 3a006d96c3 test-tool synthesize: add precomputed SHA-256 pack for 4 GiB + 1
10: 5b44410b2f = 10: 86c09af4f5 t5608: mark >4GB tests as EXPENSIVE
11: 1eaaa7fad7 = 11: 2159f6a271 ci: run expensive tests on push builds to integration branches
--
gitgitgadget
next prev parent reply other threads:[~2026-05-08 8:16 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 16:26 [PATCH 0/6] Handle cloning of objects larger than 4GB on Windows Johannes Schindelin via GitGitGadget
2026-04-28 16:26 ` [PATCH 1/6] index-pack, unpack-objects: use size_t for object size Johannes Schindelin via GitGitGadget
2026-04-30 14:13 ` Torsten Bögershausen
2026-05-03 14:46 ` Johannes Schindelin
2026-04-28 16:26 ` [PATCH 2/6] git-zlib: handle data streams larger than 4GB Johannes Schindelin via GitGitGadget
2026-04-28 16:26 ` [PATCH 3/6] odb, packfile: use size_t for streaming object sizes Johannes Schindelin via GitGitGadget
2026-04-28 16:26 ` [PATCH 4/6] delta, packfile: use size_t for delta header sizes Johannes Schindelin via GitGitGadget
2026-04-29 13:28 ` Derrick Stolee
2026-05-03 14:49 ` Johannes Schindelin
2026-04-28 16:26 ` [PATCH 5/6] test-tool: add a helper to synthesize large packfiles Johannes Schindelin via GitGitGadget
2026-04-28 16:26 ` [PATCH 6/6] t5608: add regression test for >4GB object clone Johannes Schindelin via GitGitGadget
2026-04-29 13:34 ` Derrick Stolee
2026-05-01 6:38 ` Jeff King
2026-05-01 13:19 ` Derrick Stolee
2026-05-04 17:07 ` Johannes Schindelin
2026-04-29 13:35 ` [PATCH 0/6] Handle cloning of objects larger than 4GB on Windows Derrick Stolee
2026-05-04 17:08 ` [PATCH v2 00/11] " Johannes Schindelin via GitGitGadget
2026-05-04 17:08 ` [PATCH v2 01/11] index-pack, unpack-objects: use size_t for object size Johannes Schindelin via GitGitGadget
2026-05-05 19:11 ` Torsten Bögershausen
2026-05-08 7:36 ` Johannes Schindelin
2026-05-08 19:09 ` Torsten Bögershausen
2026-05-10 2:41 ` Junio C Hamano
2026-05-10 9:14 ` Torsten Bögershausen
2026-05-04 17:08 ` [PATCH v2 02/11] git-zlib: handle data streams larger than 4GB Johannes Schindelin via GitGitGadget
2026-05-04 17:08 ` [PATCH v2 03/11] odb, packfile: use size_t for streaming object sizes Johannes Schindelin via GitGitGadget
2026-05-05 19:27 ` Torsten Bögershausen
2026-05-08 7:38 ` Johannes Schindelin
2026-05-04 17:08 ` [PATCH v2 04/11] delta, packfile: use size_t for delta header sizes Johannes Schindelin via GitGitGadget
2026-05-04 17:08 ` [PATCH v2 05/11] test-tool: add a helper to synthesize large packfiles Johannes Schindelin via GitGitGadget
2026-05-04 17:08 ` [PATCH v2 06/11] t5608: add regression test for >4GB object clone Johannes Schindelin via GitGitGadget
2026-05-04 17:08 ` [PATCH v2 07/11] test-tool synthesize: use the unsafe hash for speed Johannes Schindelin via GitGitGadget
2026-05-04 17:08 ` [PATCH v2 08/11] test-tool synthesize: precompute pack for 4 GiB + 1 Johannes Schindelin via GitGitGadget
2026-05-04 18:27 ` Derrick Stolee
2026-05-05 20:54 ` Johannes Schindelin
2026-05-04 17:08 ` [PATCH v2 09/11] test-tool synthesize: add precomputed SHA-256 " Johannes Schindelin via GitGitGadget
2026-05-04 17:08 ` [PATCH v2 10/11] t5608: mark >4GB tests as EXPENSIVE Johannes Schindelin via GitGitGadget
2026-05-04 17:08 ` [PATCH v2 11/11] ci: run expensive tests on push builds to integration branches Johannes Schindelin via GitGitGadget
2026-05-04 18:35 ` Derrick Stolee
2026-05-05 12:56 ` Junio C Hamano
2026-05-05 23:07 ` Junio C Hamano
2026-05-06 8:33 ` Johannes Schindelin
2026-05-07 9:18 ` Junio C Hamano
2026-05-07 10:24 ` Patrick Steinhardt
2026-05-08 2:50 ` Junio C Hamano
2026-05-08 8:16 ` Johannes Schindelin via GitGitGadget [this message]
2026-05-08 8:16 ` [PATCH v3 01/11] index-pack, unpack-objects: use size_t for object size Johannes Schindelin via GitGitGadget
2026-05-08 8:16 ` [PATCH v3 02/11] git-zlib: handle data streams larger than 4GB Johannes Schindelin via GitGitGadget
2026-05-08 8:16 ` [PATCH v3 03/11] odb, packfile: use size_t for streaming object sizes Johannes Schindelin via GitGitGadget
2026-05-08 8:16 ` [PATCH v3 04/11] delta, packfile: use size_t for delta header sizes Johannes Schindelin via GitGitGadget
2026-05-08 8:16 ` [PATCH v3 05/11] test-tool: add a helper to synthesize large packfiles Johannes Schindelin via GitGitGadget
2026-05-08 8:16 ` [PATCH v3 06/11] t5608: add regression test for >4GB object clone Johannes Schindelin via GitGitGadget
2026-05-08 8:16 ` [PATCH v3 07/11] test-tool synthesize: use the unsafe hash for speed Johannes Schindelin via GitGitGadget
2026-05-08 8:16 ` [PATCH v3 08/11] test-tool synthesize: precompute pack for 4 GiB + 1 Johannes Schindelin via GitGitGadget
2026-05-08 8:16 ` [PATCH v3 09/11] test-tool synthesize: add precomputed SHA-256 " Johannes Schindelin via GitGitGadget
2026-05-08 8:16 ` [PATCH v3 10/11] t5608: mark >4GB tests as EXPENSIVE Johannes Schindelin via GitGitGadget
2026-05-08 8:16 ` [PATCH v3 11/11] ci: run expensive tests on push builds to integration branches Johannes Schindelin via GitGitGadget
2026-05-10 23:51 ` [PATCH] ci: enable EXPENSIVE for contributor builds Junio C Hamano
2026-05-11 7:05 ` Patrick Steinhardt
2026-05-11 8:29 ` Junio C Hamano
2026-05-11 10:02 ` Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=pull.2102.v3.git.1778228209.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=git@vger.kernel.org \
--cc=johannes.schindelin@gmx.de \
--cc=peff@peff.net \
--cc=ps@pks.im \
--cc=stolee@gmail.com \
--cc=tboegi@web.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.