All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: "Derrick Stolee" <stolee@gmail.com>,
	"Torsten Bögershausen" <tboegi@web.de>,
	"Jeff King" <peff@peff.net>, "Patrick Steinhardt" <ps@pks.im>,
	"Johannes Schindelin" <johannes.schindelin@gmx.de>,
	"Johannes Schindelin" <johannes.schindelin@gmx.de>
Subject: [PATCH v3 08/11] test-tool synthesize: precompute pack for 4 GiB + 1
Date: Fri, 08 May 2026 08:16:46 +0000	[thread overview]
Message-ID: <2751c21c6e730ac7f0ce9c2606bbd8b56dede11e.1778228209.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.2102.v3.git.1778228209.gitgitgadget@gmail.com>

From: Johannes Schindelin <johannes.schindelin@gmx.de>

The synthesize helper hashes roughly 8 GiB of data through SHA-1 to
produce a 4 GiB + 1 pack (4 GiB for the pack checksum, 4 GiB for
the blob OID). Since the blob content is all NUL bytes, every byte
in the resulting pack file is deterministic for a given blob size and
hash algorithm.

Add a fast path that writes the pack from precomputed constants:
a 25-byte prefix (pack header, object header, zlib header, first
block header), the zero-filled bulk with periodic 5-byte deflate
block headers, and a 513-byte suffix (tree, two commits, empty tree,
pack SHA-1 checksum). This eliminates all SHA-1 and adler32
computation, making the helper purely I/O-bound.

The precomputed constants are stored in a struct fast_pack array
keyed by hash algorithm format_id, so that adding SHA-256 support
later requires only adding another array entry with its suffix.

The constants were generated by running the generic path and
extracting the non-zero bytes from the resulting pack file.

Benchmarks generating a 4 GiB + 1 pack (3 runs each, SHA1DC on
x86_64):

  generic path:   88s / 81s / 140s
  fast path:      14s / 13s / 15s

On CI, where t5608 currently takes 200-850 seconds depending on the
job, the fast path cuts the pack-generation phase from minutes to
seconds, leaving only the clone operations themselves.

Assisted-by: Claude Opus 4.6
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/helper/test-synthesize.c | 202 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 201 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-synthesize.c b/t/helper/test-synthesize.c
index e2faaad7b4..83c40ee02a 100644
--- a/t/helper/test-synthesize.c
+++ b/t/helper/test-synthesize.c
@@ -112,6 +112,201 @@ static void write_pack_object(FILE *f, struct git_hash_ctx *pack_ctx,
 	algo->final_oid_fn(oid, &ctx);
 }
 
+/*
+ * Fast path: precomputed pack data for a 4 GiB + 1 all-NUL blob.
+ *
+ * The generated pack is almost entirely zeros with a small constant
+ * prefix, periodic deflate block headers, and a constant suffix
+ * containing the tree, two commits, and the pack checksum.  Because
+ * every byte is deterministic for a given blob size and hash algorithm,
+ * we can write the pack without computing any hashes at all, reducing
+ * runtime from minutes of hash computation to seconds of pure I/O.
+ *
+ * The blob is stored as an uncompressed deflate stream: a two-byte
+ * zlib header, then 65538 blocks of up to 0xffff bytes each, followed
+ * by an adler32 checksum.  The pack header and deflate framing are
+ * shared across hash algorithms; only the suffix (which contains OIDs
+ * and the pack checksum) differs.
+ *
+ * Constants were generated by running the generic path and extracting
+ * the non-zero bytes from the resulting pack file.
+ */
+
+#define FAST_PACK_4G1_BLOB_SIZE ((size_t)4 * 1024 * 1024 * 1024 + 1)
+#define FAST_PACK_4G1_N_FULL_BLOCKS 65537
+
+/*
+ * Per-hash-algorithm constants for the fast path.  The prefix and
+ * deflate block structure are identical across algorithms; only the
+ * suffix (tree, commits, pack checksum) and the commit OID differ.
+ */
+struct fast_pack {
+	uint32_t format_id;
+	const unsigned char *suffix;
+	size_t suffix_len;
+	const char *commit_oid;
+};
+
+/* Pack header + pack object header + zlib header + first block header */
+static const unsigned char fast_pack_prefix[] = {
+	/* PACK header: signature, version 2, 5 objects */
+	0x50, 0x41, 0x43, 0x4b, 0x00, 0x00, 0x00, 0x02,
+	0x00, 0x00, 0x00, 0x05,
+	/* pack object header: blob, size = 4294967297 */
+	0xb1, 0x80, 0x80, 0x80, 0x80, 0x01,
+	/* zlib header: CMF=0x78, FLG=0x01 */
+	0x78, 0x01,
+	/* first non-final block header: BFINAL=0, LEN=0xffff, NLEN=0x0000 */
+	0x00, 0xff, 0xff, 0x00, 0x00
+};
+
+/* Every non-final deflate block header is identical */
+static const unsigned char fast_pack_block_header[] = {
+	0x00, 0xff, 0xff, 0x00, 0x00
+};
+
+/* Final block (2 data bytes) + adler32 of 4294967297 NUL bytes */
+static const unsigned char fast_pack_final_block[] = {
+	/* BFINAL=1, LEN=2, NLEN=0xfffd */
+	0x01, 0x02, 0x00, 0xfd, 0xff,
+	/* 2 NUL data bytes */
+	0x00, 0x00,
+	/* adler32 */
+	0x00, 0xe2, 0x00, 0x01
+};
+
+/*
+ * SHA-1 suffix: tree, commit, empty tree, final commit, pack checksum.
+ */
+static const unsigned char fast_pack_sha1_suffix[] = {
+	0xa0, 0x02, 0x78, 0x01, 0x01, 0x20, 0x00, 0xdf,
+	0xff, 0x31, 0x30, 0x30, 0x36, 0x34, 0x34, 0x20,
+	0x66, 0x69, 0x6c, 0x65, 0x00, 0x3e, 0xb7, 0xfe,
+	0xb1, 0x41, 0x3c, 0x75, 0x7f, 0x0d, 0x81, 0x81,
+	0xde, 0xb2, 0x8d, 0x1d, 0xab, 0x03, 0xd6, 0x48,
+	0x46, 0xb4, 0xb4, 0x0c, 0x60, 0x95, 0x0b, 0x78,
+	0x01, 0x01, 0xb5, 0x00, 0x4a, 0xff, 0x74, 0x72,
+	0x65, 0x65, 0x20, 0x63, 0x36, 0x38, 0x33, 0x66,
+	0x63, 0x63, 0x37, 0x64, 0x31, 0x64, 0x38, 0x33,
+	0x65, 0x66, 0x32, 0x66, 0x65, 0x31, 0x61, 0x66,
+	0x35, 0x35, 0x32, 0x31, 0x35, 0x64, 0x30, 0x31,
+	0x36, 0x38, 0x64, 0x62, 0x35, 0x32, 0x61, 0x33,
+	0x61, 0x33, 0x62, 0x0a, 0x61, 0x75, 0x74, 0x68,
+	0x6f, 0x72, 0x20, 0x41, 0x20, 0x55, 0x20, 0x54,
+	0x68, 0x6f, 0x72, 0x20, 0x3c, 0x61, 0x75, 0x74,
+	0x68, 0x6f, 0x72, 0x40, 0x65, 0x78, 0x61, 0x6d,
+	0x70, 0x6c, 0x65, 0x2e, 0x63, 0x6f, 0x6d, 0x3e,
+	0x20, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37,
+	0x38, 0x39, 0x30, 0x20, 0x2b, 0x30, 0x30, 0x30,
+	0x30, 0x0a, 0x63, 0x6f, 0x6d, 0x6d, 0x69, 0x74,
+	0x74, 0x65, 0x72, 0x20, 0x43, 0x20, 0x4f, 0x20,
+	0x4d, 0x69, 0x74, 0x74, 0x65, 0x72, 0x20, 0x3c,
+	0x63, 0x6f, 0x6d, 0x6d, 0x69, 0x74, 0x74, 0x65,
+	0x72, 0x40, 0x65, 0x78, 0x61, 0x6d, 0x70, 0x6c,
+	0x65, 0x2e, 0x63, 0x6f, 0x6d, 0x3e, 0x20, 0x31,
+	0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39,
+	0x30, 0x20, 0x2b, 0x30, 0x30, 0x30, 0x30, 0x0a,
+	0x0a, 0x4c, 0x61, 0x72, 0x67, 0x65, 0x20, 0x62,
+	0x6c, 0x6f, 0x62, 0x20, 0x63, 0x6f, 0x6d, 0x6d,
+	0x69, 0x74, 0x0a, 0xc6, 0x55, 0x37, 0x6b, 0x20,
+	0x78, 0x01, 0x01, 0x00, 0x00, 0xff, 0xff, 0x00,
+	0x00, 0x00, 0x01, 0x95, 0x0e, 0x78, 0x01, 0x01,
+	0xe5, 0x00, 0x1a, 0xff, 0x74, 0x72, 0x65, 0x65,
+	0x20, 0x34, 0x62, 0x38, 0x32, 0x35, 0x64, 0x63,
+	0x36, 0x34, 0x32, 0x63, 0x62, 0x36, 0x65, 0x62,
+	0x39, 0x61, 0x30, 0x36, 0x30, 0x65, 0x35, 0x34,
+	0x62, 0x66, 0x38, 0x64, 0x36, 0x39, 0x32, 0x38,
+	0x38, 0x66, 0x62, 0x65, 0x65, 0x34, 0x39, 0x30,
+	0x34, 0x0a, 0x70, 0x61, 0x72, 0x65, 0x6e, 0x74,
+	0x20, 0x63, 0x35, 0x62, 0x32, 0x31, 0x63, 0x36,
+	0x31, 0x31, 0x61, 0x61, 0x35, 0x39, 0x34, 0x65,
+	0x63, 0x39, 0x66, 0x64, 0x37, 0x65, 0x39, 0x32,
+	0x63, 0x66, 0x39, 0x36, 0x34, 0x38, 0x39, 0x31,
+	0x34, 0x63, 0x61, 0x34, 0x63, 0x32, 0x34, 0x31,
+	0x32, 0x0a, 0x61, 0x75, 0x74, 0x68, 0x6f, 0x72,
+	0x20, 0x41, 0x20, 0x55, 0x20, 0x54, 0x68, 0x6f,
+	0x72, 0x20, 0x3c, 0x61, 0x75, 0x74, 0x68, 0x6f,
+	0x72, 0x40, 0x65, 0x78, 0x61, 0x6d, 0x70, 0x6c,
+	0x65, 0x2e, 0x63, 0x6f, 0x6d, 0x3e, 0x20, 0x31,
+	0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39,
+	0x30, 0x20, 0x2b, 0x30, 0x30, 0x30, 0x30, 0x0a,
+	0x63, 0x6f, 0x6d, 0x6d, 0x69, 0x74, 0x74, 0x65,
+	0x72, 0x20, 0x43, 0x20, 0x4f, 0x20, 0x4d, 0x69,
+	0x74, 0x74, 0x65, 0x72, 0x20, 0x3c, 0x63, 0x6f,
+	0x6d, 0x6d, 0x69, 0x74, 0x74, 0x65, 0x72, 0x40,
+	0x65, 0x78, 0x61, 0x6d, 0x70, 0x6c, 0x65, 0x2e,
+	0x63, 0x6f, 0x6d, 0x3e, 0x20, 0x31, 0x32, 0x33,
+	0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x30, 0x20,
+	0x2b, 0x30, 0x30, 0x30, 0x30, 0x0a, 0x0a, 0x45,
+	0x6d, 0x70, 0x74, 0x79, 0x20, 0x74, 0x72, 0x65,
+	0x65, 0x20, 0x63, 0x6f, 0x6d, 0x6d, 0x69, 0x74,
+	0x0a, 0xaa, 0xb8, 0x45, 0x01, 0x8e, 0xfc, 0xf0,
+	0x2f, 0x9c, 0xc5, 0xcc, 0x4f, 0x6a, 0x1a, 0xc9,
+	0x2b, 0x23, 0xa9, 0xff, 0x91, 0x06, 0xc2, 0x70,
+	0xe3
+};
+
+static const struct fast_pack fast_packs[] = {
+	{
+		.format_id = GIT_SHA1_FORMAT_ID,
+		.suffix = fast_pack_sha1_suffix,
+		.suffix_len = sizeof(fast_pack_sha1_suffix),
+		.commit_oid = "aac43daf40d0377af31aa9c798a4ae8a31b55c1d",
+	},
+};
+
+/*
+ * Try the fast path for known blob sizes.  Returns 1 if the pack was
+ * written from precomputed constants, 0 if the caller should fall
+ * through to the generic path.
+ */
+static int generate_fast_pack(const char *path, size_t blob_size,
+			      const struct git_hash_algo *algo)
+{
+	const struct fast_pack *fp = NULL;
+	FILE *f;
+	size_t i;
+
+	if (blob_size != FAST_PACK_4G1_BLOB_SIZE)
+		return 0;
+
+	for (i = 0; i < ARRAY_SIZE(fast_packs); i++) {
+		if (fast_packs[i].format_id == algo->format_id) {
+			fp = &fast_packs[i];
+			break;
+		}
+	}
+	if (!fp)
+		return 0;
+
+	f = xfopen(path, "wb");
+
+	fwrite_or_die(f, fast_pack_prefix, sizeof(fast_pack_prefix));
+
+	/* First full block: 0xffff zero bytes (header already in prefix) */
+	fwrite_or_die(f, zeros, BLOCK_SIZE);
+
+	/* Remaining non-final full blocks */
+	for (i = 1; i < FAST_PACK_4G1_N_FULL_BLOCKS; i++) {
+		fwrite_or_die(f, fast_pack_block_header,
+			      sizeof(fast_pack_block_header));
+		fwrite_or_die(f, zeros, BLOCK_SIZE);
+	}
+
+	/* Final block (2 data bytes) + adler32 */
+	fwrite_or_die(f, fast_pack_final_block,
+		      sizeof(fast_pack_final_block));
+
+	/* Tree, commits, and pack checksum */
+	fwrite_or_die(f, fp->suffix, fp->suffix_len);
+
+	if (fclose(f))
+		die_errno(_("could not close '%s'"), path);
+
+	printf("%s\n", fp->commit_oid);
+	return 1;
+}
+
 /*
  * Generate a pack file with a single large (>4GB) reachable object.
  *
@@ -127,7 +322,7 @@ static void write_pack_object(FILE *f, struct git_hash_ctx *pack_ctx,
 static int generate_pack_with_large_object(const char *path, size_t blob_size,
 					   const struct git_hash_algo *algo)
 {
-	FILE *f = xfopen(path, "wb");
+	FILE *f;
 	struct git_hash_ctx pack_ctx;
 	unsigned char pack_hash[GIT_MAX_RAWSZ];
 	struct object_id blob_oid, tree_oid, commit_oid, empty_tree_oid, final_commit_oid;
@@ -139,6 +334,11 @@ static int generate_pack_with_large_object(const char *path, size_t blob_size,
 		.hdr_entries = htonl(object_count),
 	};
 
+	if (generate_fast_pack(path, blob_size, algo))
+		return 0;
+
+	f = xfopen(path, "wb");
+
 	algo->init_fn(&pack_ctx);
 
 	/* Write pack header */
-- 
gitgitgadget


  parent reply	other threads:[~2026-05-08  8:17 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 16:26 [PATCH 0/6] Handle cloning of objects larger than 4GB on Windows Johannes Schindelin via GitGitGadget
2026-04-28 16:26 ` [PATCH 1/6] index-pack, unpack-objects: use size_t for object size Johannes Schindelin via GitGitGadget
2026-04-30 14:13   ` Torsten Bögershausen
2026-05-03 14:46     ` Johannes Schindelin
2026-04-28 16:26 ` [PATCH 2/6] git-zlib: handle data streams larger than 4GB Johannes Schindelin via GitGitGadget
2026-04-28 16:26 ` [PATCH 3/6] odb, packfile: use size_t for streaming object sizes Johannes Schindelin via GitGitGadget
2026-04-28 16:26 ` [PATCH 4/6] delta, packfile: use size_t for delta header sizes Johannes Schindelin via GitGitGadget
2026-04-29 13:28   ` Derrick Stolee
2026-05-03 14:49     ` Johannes Schindelin
2026-04-28 16:26 ` [PATCH 5/6] test-tool: add a helper to synthesize large packfiles Johannes Schindelin via GitGitGadget
2026-04-28 16:26 ` [PATCH 6/6] t5608: add regression test for >4GB object clone Johannes Schindelin via GitGitGadget
2026-04-29 13:34   ` Derrick Stolee
2026-05-01  6:38     ` Jeff King
2026-05-01 13:19       ` Derrick Stolee
2026-05-04 17:07         ` Johannes Schindelin
2026-04-29 13:35 ` [PATCH 0/6] Handle cloning of objects larger than 4GB on Windows Derrick Stolee
2026-05-04 17:08 ` [PATCH v2 00/11] " Johannes Schindelin via GitGitGadget
2026-05-04 17:08   ` [PATCH v2 01/11] index-pack, unpack-objects: use size_t for object size Johannes Schindelin via GitGitGadget
2026-05-05 19:11     ` Torsten Bögershausen
2026-05-08  7:36       ` Johannes Schindelin
2026-05-08 19:09         ` Torsten Bögershausen
2026-05-10  2:41           ` Junio C Hamano
2026-05-10  9:14             ` Torsten Bögershausen
2026-05-04 17:08   ` [PATCH v2 02/11] git-zlib: handle data streams larger than 4GB Johannes Schindelin via GitGitGadget
2026-05-04 17:08   ` [PATCH v2 03/11] odb, packfile: use size_t for streaming object sizes Johannes Schindelin via GitGitGadget
2026-05-05 19:27     ` Torsten Bögershausen
2026-05-08  7:38       ` Johannes Schindelin
2026-05-04 17:08   ` [PATCH v2 04/11] delta, packfile: use size_t for delta header sizes Johannes Schindelin via GitGitGadget
2026-05-04 17:08   ` [PATCH v2 05/11] test-tool: add a helper to synthesize large packfiles Johannes Schindelin via GitGitGadget
2026-05-04 17:08   ` [PATCH v2 06/11] t5608: add regression test for >4GB object clone Johannes Schindelin via GitGitGadget
2026-05-04 17:08   ` [PATCH v2 07/11] test-tool synthesize: use the unsafe hash for speed Johannes Schindelin via GitGitGadget
2026-05-04 17:08   ` [PATCH v2 08/11] test-tool synthesize: precompute pack for 4 GiB + 1 Johannes Schindelin via GitGitGadget
2026-05-04 18:27     ` Derrick Stolee
2026-05-05 20:54       ` Johannes Schindelin
2026-05-04 17:08   ` [PATCH v2 09/11] test-tool synthesize: add precomputed SHA-256 " Johannes Schindelin via GitGitGadget
2026-05-04 17:08   ` [PATCH v2 10/11] t5608: mark >4GB tests as EXPENSIVE Johannes Schindelin via GitGitGadget
2026-05-04 17:08   ` [PATCH v2 11/11] ci: run expensive tests on push builds to integration branches Johannes Schindelin via GitGitGadget
2026-05-04 18:35     ` Derrick Stolee
2026-05-05 12:56       ` Junio C Hamano
2026-05-05 23:07         ` Junio C Hamano
2026-05-06  8:33           ` Johannes Schindelin
2026-05-07  9:18             ` Junio C Hamano
2026-05-07 10:24               ` Patrick Steinhardt
2026-05-08  2:50         ` Junio C Hamano
2026-05-08  8:16   ` [PATCH v3 00/11] Handle cloning of objects larger than 4GB on Windows Johannes Schindelin via GitGitGadget
2026-05-08  8:16     ` [PATCH v3 01/11] index-pack, unpack-objects: use size_t for object size Johannes Schindelin via GitGitGadget
2026-05-08  8:16     ` [PATCH v3 02/11] git-zlib: handle data streams larger than 4GB Johannes Schindelin via GitGitGadget
2026-05-08  8:16     ` [PATCH v3 03/11] odb, packfile: use size_t for streaming object sizes Johannes Schindelin via GitGitGadget
2026-05-08  8:16     ` [PATCH v3 04/11] delta, packfile: use size_t for delta header sizes Johannes Schindelin via GitGitGadget
2026-05-08  8:16     ` [PATCH v3 05/11] test-tool: add a helper to synthesize large packfiles Johannes Schindelin via GitGitGadget
2026-05-08  8:16     ` [PATCH v3 06/11] t5608: add regression test for >4GB object clone Johannes Schindelin via GitGitGadget
2026-05-08  8:16     ` [PATCH v3 07/11] test-tool synthesize: use the unsafe hash for speed Johannes Schindelin via GitGitGadget
2026-05-08  8:16     ` Johannes Schindelin via GitGitGadget [this message]
2026-05-08  8:16     ` [PATCH v3 09/11] test-tool synthesize: add precomputed SHA-256 pack for 4 GiB + 1 Johannes Schindelin via GitGitGadget
2026-05-08  8:16     ` [PATCH v3 10/11] t5608: mark >4GB tests as EXPENSIVE Johannes Schindelin via GitGitGadget
2026-05-08  8:16     ` [PATCH v3 11/11] ci: run expensive tests on push builds to integration branches Johannes Schindelin via GitGitGadget
2026-05-10 23:51       ` [PATCH] ci: enable EXPENSIVE for contributor builds Junio C Hamano
2026-05-11  7:05         ` Patrick Steinhardt
2026-05-11  8:29           ` Junio C Hamano
2026-05-11 10:02             ` Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2751c21c6e730ac7f0ce9c2606bbd8b56dede11e.1778228209.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=johannes.schindelin@gmx.de \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    --cc=stolee@gmail.com \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.