git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] midx repack: fix overflow on 32 bit systems
@ 2025-05-20 15:04 Phillip Wood
  2025-05-20 15:04 ` [PATCH 1/4] midx repack: avoid integer " Phillip Wood
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-20 15:04 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Taylor Blau

From: Phillip Wood <phillip.wood@dunelm.org.uk>

This series fixes an overflow when running "git multi-pack-index
repack" on an old raspberry pi and a couple of other small issues I
noticed while reading the code. I'm unsure how realistic the example
of integer overflow on 64 bit systems in patch 2 is. I'm happy to drop
it if people who work with large repositories think its not worth
worrying about.

Base-Commit: cb96e1697ad6e54d11fc920c95f82977f8e438f8
Published-As: https://github.com/phillipwood/git/releases/tag/pw%2Fmidx-repack-overflow%2Fv1
View-Changes-At: https://github.com/phillipwood/git/compare/cb96e1697...29769df1c
Fetch-It-Via: git fetch https://github.com/phillipwood/git pw/midx-repack-overflow/v1


Phillip Wood (4):
  midx repack: avoid integer overflow on 32 bit systems
  midx repack: avoid potential integer overflow on 64 bit systems
  midx: avoid negative array index
  midx docs: clarify tie breaking

 Documentation/git-multi-pack-index.adoc |  6 ++++--
 git-compat-util.h                       | 16 ++++++++++++++++
 midx-write.c                            | 22 ++++++++++++++++------
 3 files changed, 36 insertions(+), 8 deletions(-)

-- 
2.49.0.897.gfad3eb7d210


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/4] midx repack: avoid integer overflow on 32 bit systems
  2025-05-20 15:04 [PATCH 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
@ 2025-05-20 15:04 ` Phillip Wood
  2025-05-20 17:54   ` Taylor Blau
  2025-05-21 13:10   ` D. Ben Knoble
  2025-05-20 15:04 ` [PATCH 2/4] midx repack: avoid potential integer overflow on 64 " Phillip Wood
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-20 15:04 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Taylor Blau, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

On a 32 bit system "git multi-pack-index --repack --batch-size=120M"
failed with

    fatal: size_t overflow: 6038786 * 1289

The calculation to estimated size of the objects in the pack referenced
by the multi-pack-index uses st_mult() to multiply the pack size by the
number of referenced objects before dividing by the total number of
objects in the pack. As size_t is 32 bits on 32 bit systems this
calculation easily overflows. Fix this by using 64bit arithmetic instead.

Also fix a potential overflow when caluculating the total size of the
objects referenced by the multipack index with a batch size larger
than SIZE_MAX / 2. In that case

    total_size += estimated_size

can overflow as both total_size and estimated_size can be greater that
SIZE_MAX / 2. This is addressed by using saturating arithmetic for the
addition.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 git-compat-util.h | 16 ++++++++++++++++
 midx-write.c      | 12 ++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/git-compat-util.h b/git-compat-util.h
index 36b9577c8d4..4678e21c4cb 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -668,6 +668,22 @@ static inline int cast_size_t_to_int(size_t a)
 	return (int)a;
 }
 
+static inline uint64_t u64_mult(uint64_t a, uint64_t b)
+{
+	if (unsigned_mult_overflows(a, b))
+		die("uint64_t overflow: %"PRIuMAX" * %"PRIuMAX,
+		    (uintmax_t)a, (uintmax_t)b);
+	return a * b;
+}
+
+static inline uint64_t u64_add(uint64_t a, uint64_t b)
+{
+	if (unsigned_add_overflows(a, b))
+		die("uint64_t overflow: %"PRIuMAX" + %"PRIuMAX,
+		    (uintmax_t)a, (uintmax_t)b);
+	return a + b;
+}
+
 /*
  * Limit size of IO chunks, because huge chunks only cause pain.  OS X
  * 64-bit is buggy, returning EINVAL if len >= INT_MAX; and even in
diff --git a/midx-write.c b/midx-write.c
index dd3b3070e55..c7cb2315431 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1699,19 +1699,23 @@ static void fill_included_packs_batch(struct repository *r,
 	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
 		int pack_int_id = pack_info[i].pack_int_id;
 		struct packed_git *p = m->packs[pack_int_id];
-		size_t expected_size;
+		uint64_t expected_size;
 
 		if (!want_included_pack(r, m, pack_kept_objects, pack_int_id))
 			continue;
 
-		expected_size = st_mult(p->pack_size,
-					pack_info[i].referenced_objects);
+		expected_size = uint64_mult(p->pack_size,
+					    pack_info[i].referenced_objects);
 		expected_size /= p->num_objects;
 
 		if (expected_size >= batch_size)
 			continue;
 
-		total_size += expected_size;
+		if (unsigned_add_overflows (total_size, (size_t)expected_size))
+			total_size = SIZE_MAX;
+		else
+			total_size += expected_size;
+
 		include_pack[pack_int_id] = 1;
 	}
 
-- 
2.49.0.897.gfad3eb7d210


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/4] midx repack: avoid potential integer overflow on 64 bit systems
  2025-05-20 15:04 [PATCH 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
  2025-05-20 15:04 ` [PATCH 1/4] midx repack: avoid integer " Phillip Wood
@ 2025-05-20 15:04 ` Phillip Wood
  2025-05-20 17:59   ` Taylor Blau
  2025-05-20 15:04 ` [PATCH 3/4] midx: avoid negative array index Phillip Wood
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Phillip Wood @ 2025-05-20 15:04 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Taylor Blau, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

On a 64 bit system the calculation

    p->pack_size * pack_info[i].referenced_objects

could overflow. If a pack file contains 2^28 objects with an average
compressed size of 1KB then the pack size will be 2^38B. If all of the
objects are referenced by the multi-pack index the sum above will
overflow. Avoid this by using shifted integer arithmetic and changing
the order of the calculation so that the pack size is divided by the
total number of objects in the pack before multiplying by the number of
objects referenced by the multi-pack index. Using a shift of 14 bits
should give reasonable accuracy while avoiding overflow for pack sizes
less that 1PB.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 midx-write.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/midx-write.c b/midx-write.c
index c7cb2315431..2ee381e8fcd 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1704,9 +1704,15 @@ static void fill_included_packs_batch(struct repository *r,
 		if (!want_included_pack(r, m, pack_kept_objects, pack_int_id))
 			continue;
 
-		expected_size = uint64_mult(p->pack_size,
-					    pack_info[i].referenced_objects);
+		/*
+		 * Use shifted integer arithmetic to calculate the
+		 * expected pack size to ~4 significant digits without
+		 * overflow for packsizes less that 1PB.
+		 */
+		expected_size = (uint64_t)pack_info[i].referenced_objects << 14;
 		expected_size /= p->num_objects;
+		expected_size = u64_mult(expected_size, p->pack_size);
+		expected_size = u64_add(expected_size, 1u << 13) >> 14;
 
 		if (expected_size >= batch_size)
 			continue;
-- 
2.49.0.897.gfad3eb7d210


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/4] midx: avoid negative array index
  2025-05-20 15:04 [PATCH 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
  2025-05-20 15:04 ` [PATCH 1/4] midx repack: avoid integer " Phillip Wood
  2025-05-20 15:04 ` [PATCH 2/4] midx repack: avoid potential integer overflow on 64 " Phillip Wood
@ 2025-05-20 15:04 ` Phillip Wood
  2025-05-20 17:58   ` Taylor Blau
  2025-05-20 15:04 ` [PATCH 4/4] midx docs: clarify tie breaking Phillip Wood
  2025-05-22 15:55 ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
  4 siblings, 1 reply; 25+ messages in thread
From: Phillip Wood @ 2025-05-20 15:04 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Taylor Blau, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

nth_midxed_pack_int_id() returns the index of the pack file in the multi
pack index's list of packfiles that the specified object. The index is
returned as a uint32_t. Storing this in an int will make the index
negative if the most significant bit is set. Fix this by using uint32_t
as the rest of the code does. This is unlikely to be a practical problem
as it requires the multipack index to reference 2^31 packfiles.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 midx-write.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/midx-write.c b/midx-write.c
index 2ee381e8fcd..38a458d7322 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1566,7 +1566,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 					  _("Counting referenced objects"),
 					  m->num_objects);
 	for (i = 0; i < m->num_objects; i++) {
-		int pack_int_id = nth_midxed_pack_int_id(m, i);
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
 		count[pack_int_id]++;
 		display_progress(progress, i + 1);
 	}
@@ -1697,7 +1697,7 @@ static void fill_included_packs_batch(struct repository *r,
 
 	total_size = 0;
 	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
-		int pack_int_id = pack_info[i].pack_int_id;
+		uint32_t pack_int_id = pack_info[i].pack_int_id;
 		struct packed_git *p = m->packs[pack_int_id];
 		uint64_t expected_size;
 
-- 
2.49.0.897.gfad3eb7d210


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/4] midx docs: clarify tie breaking
  2025-05-20 15:04 [PATCH 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
                   ` (2 preceding siblings ...)
  2025-05-20 15:04 ` [PATCH 3/4] midx: avoid negative array index Phillip Wood
@ 2025-05-20 15:04 ` Phillip Wood
  2025-05-20 18:07   ` Taylor Blau
  2025-05-21 13:14   ` D. Ben Knoble
  2025-05-22 15:55 ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
  4 siblings, 2 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-20 15:04 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Taylor Blau, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

Clarify what happens when an object exists in more than one pack, but
not in the preferred pack. If the user does not pass a preferred pack
then the pack with the lowest mtime is chosen as the preferred pack. For
objects that are not in the preferred pack the pack with the highest
mtime is used. "git multi-pack-index repack" relies on this behavior. If
ties were resolved in favor of the oldest pack as the current
documentation suggests the multi-pack index would not reference any of
the objects in the pack created by "git multi-pack-index repack".

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 Documentation/git-multi-pack-index.adoc | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-multi-pack-index.adoc b/Documentation/git-multi-pack-index.adoc
index 631d5c7d15c..1f016b2f682 100644
--- a/Documentation/git-multi-pack-index.adoc
+++ b/Documentation/git-multi-pack-index.adoc
@@ -40,8 +40,10 @@ write::
 	--preferred-pack=<pack>::
 		Optionally specify the tie-breaking pack used when
 		multiple packs contain the same object. `<pack>` must
-		contain at least one object. If not given, ties are
-		broken in favor of the pack with the lowest mtime.
+		contain at least one object. If not given the pack with
+		the lowest mtime is used as the preferred pack. Ties
+		for objects that are not contained in the preferred
+		are resolved in favor of the pack with the newest mtime.
 
 	--[no-]bitmap::
 		Control whether or not a multi-pack bitmap is written.
-- 
2.49.0.897.gfad3eb7d210


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] midx repack: avoid integer overflow on 32 bit systems
  2025-05-20 15:04 ` [PATCH 1/4] midx repack: avoid integer " Phillip Wood
@ 2025-05-20 17:54   ` Taylor Blau
  2025-05-21 15:19     ` Phillip Wood
  2025-05-21 13:10   ` D. Ben Knoble
  1 sibling, 1 reply; 25+ messages in thread
From: Taylor Blau @ 2025-05-20 17:54 UTC (permalink / raw)
  To: Phillip Wood; +Cc: git, Derrick Stolee, Phillip Wood

On Tue, May 20, 2025 at 04:04:24PM +0100, Phillip Wood wrote:
> diff --git a/midx-write.c b/midx-write.c
> index dd3b3070e55..c7cb2315431 100644
> --- a/midx-write.c
> +++ b/midx-write.c
> @@ -1699,19 +1699,23 @@ static void fill_included_packs_batch(struct repository *r,
>  	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
>  		int pack_int_id = pack_info[i].pack_int_id;
>  		struct packed_git *p = m->packs[pack_int_id];
> -		size_t expected_size;
> +		uint64_t expected_size;
>
>  		if (!want_included_pack(r, m, pack_kept_objects, pack_int_id))
>  			continue;
>
> -		expected_size = st_mult(p->pack_size,
> -					pack_info[i].referenced_objects);
> +		expected_size = uint64_mult(p->pack_size,
> +					    pack_info[i].referenced_objects);

Makes sense.

>  		expected_size /= p->num_objects;
>
>  		if (expected_size >= batch_size)
>  			continue;
>
> -		total_size += expected_size;
> +		if (unsigned_add_overflows (total_size, (size_t)expected_size))
> +			total_size = SIZE_MAX;
> +		else
> +			total_size += expected_size;
> +

But this part I am not totally following. Here we have 'total_size'
declared as a size_t, and 'expected_size' as a uint64_t, and (on 32-bit
systems) down-cast to a 32-bit unsigned value.

So if 'expected_size' is larger than SIZE_MAX, we should set
'total_size' to SIZE_MAX. But that may not happen, say if
'expected_size' is (2^32-1<<32). Should total_size also be declared as a
uint64_t here?

I wondered if it might be easier to count down from the given batch_size
instead of adding up to it (requiring the second
unsigned_add_overflows() check). I tried it out and got this instead:

--- 8< ---
diff --git a/midx-write.c b/midx-write.c
index 48a4dc5e94..f81dd9ff6d 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1671,7 +1671,7 @@ static void fill_included_packs_batch(struct repository *r,
 				      size_t batch_size)
 {
 	uint32_t i;
-	size_t total_size;
+	uint64_t remaining = batch_size;
 	struct repack_info *pack_info;
 	int pack_kept_objects = 0;

@@ -1695,23 +1695,23 @@ static void fill_included_packs_batch(struct repository *r,

 	QSORT(pack_info, m->num_packs, compare_by_mtime);

-	total_size = 0;
-	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+	for (i = 0; i < m->num_packs; i++) {
 		int pack_int_id = pack_info[i].pack_int_id;
 		struct packed_git *p = m->packs[pack_int_id];
-		size_t expected_size;
+		uint64_t expected_size, factor;

 		if (!want_included_pack(r, m, pack_kept_objects, pack_int_id))
 			continue;

-		expected_size = st_mult(p->pack_size,
-					pack_info[i].referenced_objects);
-		expected_size /= p->num_objects;
+		factor = pack_info[i].referenced_objects / p->num_objects;
+		if (p->pack_size > UINT64_MAX / factor)
+			die(...);

-		if (expected_size >= batch_size)
-			continue;
+		expected_size = p->pack_size * factor;
+		if (expected_size > remaining)
+			break;

-		total_size += expected_size;
+		remaining -= expected_size;
 		include_pack[pack_int_id] = 1;
 	}
--- >8 ---

That reduces the two overflow checks down to one, and avoids the need to
introduce a uint64_t-specific variant of the st_add() function.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] midx: avoid negative array index
  2025-05-20 15:04 ` [PATCH 3/4] midx: avoid negative array index Phillip Wood
@ 2025-05-20 17:58   ` Taylor Blau
  0 siblings, 0 replies; 25+ messages in thread
From: Taylor Blau @ 2025-05-20 17:58 UTC (permalink / raw)
  To: Phillip Wood; +Cc: git, Derrick Stolee, Phillip Wood

On Tue, May 20, 2025 at 04:04:26PM +0100, Phillip Wood wrote:
> From: Phillip Wood <phillip.wood@dunelm.org.uk>
>
> nth_midxed_pack_int_id() returns the index of the pack file in the multi
> pack index's list of packfiles that the specified object. The index is
> returned as a uint32_t. Storing this in an int will make the index
> negative if the most significant bit is set. Fix this by using uint32_t
> as the rest of the code does. This is unlikely to be a practical problem
> as it requires the multipack index to reference 2^31 packfiles.
>
> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
> ---
>  midx-write.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/midx-write.c b/midx-write.c
> index 2ee381e8fcd..38a458d7322 100644
> --- a/midx-write.c
> +++ b/midx-write.c
> @@ -1566,7 +1566,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
>  					  _("Counting referenced objects"),
>  					  m->num_objects);
>  	for (i = 0; i < m->num_objects; i++) {
> -		int pack_int_id = nth_midxed_pack_int_id(m, i);
> +		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
>  		count[pack_int_id]++;
>  		display_progress(progress, i + 1);
>  	}
> @@ -1697,7 +1697,7 @@ static void fill_included_packs_batch(struct repository *r,
>
>  	total_size = 0;
>  	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
> -		int pack_int_id = pack_info[i].pack_int_id;
> +		uint32_t pack_int_id = pack_info[i].pack_int_id;
>  		struct packed_git *p = m->packs[pack_int_id];
>  		uint64_t expected_size;
>
> --
> 2.49.0.897.gfad3eb7d210
>

Thanks for catching these. I was going to comment on it as something I
noticed while reading the first patch, but I'm glad that you addressed
it here.

It looks like these two declarations date back to:

 - 19575c7c8e (multi-pack-index: implement 'expire' subcommand, 2019-06-10)
 - ce1e4a105b (midx: implement midx_repack(), 2019-06-10)

(both of which were merged via 4308d81d45 (Merge branch
'ds/midx-expire-repack', 2019-07-19)). But I think these have always had
the wrong type, since the pack_int_id field comes from commit fe1ed56f5e
(midx: sort and deduplicate objects from packfiles, 2018-07-12), where
it was first introduced as a uint32_t.

I actually think these all should size_t's since they're indexing into
an array, but that can be dealt with outside of this series, since this
is an obvious improvement.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] midx repack: avoid potential integer overflow on 64 bit systems
  2025-05-20 15:04 ` [PATCH 2/4] midx repack: avoid potential integer overflow on 64 " Phillip Wood
@ 2025-05-20 17:59   ` Taylor Blau
  2025-05-21 15:20     ` Phillip Wood
  0 siblings, 1 reply; 25+ messages in thread
From: Taylor Blau @ 2025-05-20 17:59 UTC (permalink / raw)
  To: Phillip Wood; +Cc: git, Derrick Stolee, Phillip Wood

On Tue, May 20, 2025 at 04:04:25PM +0100, Phillip Wood wrote:
> From: Phillip Wood <phillip.wood@dunelm.org.uk>
>
> On a 64 bit system the calculation
>
>     p->pack_size * pack_info[i].referenced_objects
>
> could overflow. If a pack file contains 2^28 objects with an average
> compressed size of 1KB then the pack size will be 2^38B. If all of the
> objects are referenced by the multi-pack index the sum above will
> overflow. Avoid this by using shifted integer arithmetic and changing
> the order of the calculation so that the pack size is divided by the
> total number of objects in the pack before multiplying by the number of
> objects referenced by the multi-pack index. Using a shift of 14 bits
> should give reasonable accuracy while avoiding overflow for pack sizes
> less that 1PB.

Ahhh, this renders some of comments on the previous patch moot. I think
that this is a not-unreasonable concern to be addressing even on modern
64-bit systems, since I have definitely encountered packs that have on
the order of ~2^28 objects in them.

I like this approach quite a bit, thanks!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] midx docs: clarify tie breaking
  2025-05-20 15:04 ` [PATCH 4/4] midx docs: clarify tie breaking Phillip Wood
@ 2025-05-20 18:07   ` Taylor Blau
  2025-05-21 15:20     ` Phillip Wood
  2025-05-21 13:14   ` D. Ben Knoble
  1 sibling, 1 reply; 25+ messages in thread
From: Taylor Blau @ 2025-05-20 18:07 UTC (permalink / raw)
  To: Phillip Wood; +Cc: git, Derrick Stolee, Phillip Wood

On Tue, May 20, 2025 at 04:04:27PM +0100, Phillip Wood wrote:
> From: Phillip Wood <phillip.wood@dunelm.org.uk>
>
> Clarify what happens when an object exists in more than one pack, but
> not in the preferred pack. If the user does not pass a preferred pack
> then the pack with the lowest mtime is chosen as the preferred pack. For
> objects that are not in the preferred pack the pack with the highest
> mtime is used. "git multi-pack-index repack" relies on this behavior. If
> ties were resolved in favor of the oldest pack as the current
> documentation suggests the multi-pack index would not reference any of
> the objects in the pack created by "git multi-pack-index repack".

This commit message could likely be shortened since it is repeating some
information from the patch content itself, but I don't have a strong
opinion here.

> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
> ---
>  Documentation/git-multi-pack-index.adoc | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-multi-pack-index.adoc b/Documentation/git-multi-pack-index.adoc
> index 631d5c7d15c..1f016b2f682 100644
> --- a/Documentation/git-multi-pack-index.adoc
> +++ b/Documentation/git-multi-pack-index.adoc
> @@ -40,8 +40,10 @@ write::
>  	--preferred-pack=<pack>::
>  		Optionally specify the tie-breaking pack used when
>  		multiple packs contain the same object. `<pack>` must
> -		contain at least one object. If not given, ties are
> -		broken in favor of the pack with the lowest mtime.
> +		contain at least one object. If not given the pack with
> +		the lowest mtime is used as the preferred pack. Ties
> +		for objects that are not contained in the preferred
> +		are resolved in favor of the pack with the newest mtime.

I think the clarification here is good, but the structure makes it a
little difficult to follow. The above reads to me like:

    1. What does --preferred-pack do?
    2. What restrictions are there on the pack?
    3. What happens if --preferred-pack is not given?
    4. What happens if the preferred pack does not contain the object?

But I think it might be clearer to structure this like:

    1. What does --preferred-pack do for objects in the preferred pack?
    2. What happens if the preferred pack does not contain the object?
    3. What happens if --preferred-pack is not given?
    4. What restrictions are there on the pack?

I tried to write something like this below:

    When specified, break ties in favor of this pack when there are
    additional copies of its objects in other packs. Ties for objects
    not found in the preferred pack are resolved in favor of the copy in
    the pack with the highest mtime. If unspecified, the pack with the
    lowest mtime is used by default. The preferred pack must have at
    least one object.

I think that the result here is a little easier to follow than what's
proposed above, but I am obviously biased ;-). If you think the two are
equivalent or mine is less clear than yours, feel free to ignore this.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] midx repack: avoid integer overflow on 32 bit systems
  2025-05-20 15:04 ` [PATCH 1/4] midx repack: avoid integer " Phillip Wood
  2025-05-20 17:54   ` Taylor Blau
@ 2025-05-21 13:10   ` D. Ben Knoble
  2025-05-21 15:01     ` Junio C Hamano
  2025-05-21 15:20     ` Phillip Wood
  1 sibling, 2 replies; 25+ messages in thread
From: D. Ben Knoble @ 2025-05-21 13:10 UTC (permalink / raw)
  To: Phillip Wood; +Cc: git, Derrick Stolee, Taylor Blau, Phillip Wood

On Tue, May 20, 2025 at 11:05 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> From: Phillip Wood <phillip.wood@dunelm.org.uk>
>
> On a 32 bit system "git multi-pack-index --repack --batch-size=120M"
> failed with
>
>     fatal: size_t overflow: 6038786 * 1289
>
> The calculation to estimated size of the objects in the pack referenced
> by the multi-pack-index uses st_mult() to multiply the pack size by the
> number of referenced objects before dividing by the total number of
> objects in the pack. As size_t is 32 bits on 32 bit systems this
> calculation easily overflows. Fix this by using 64bit arithmetic instead.
>
> Also fix a potential overflow when caluculating the total size of the
> objects referenced by the multipack index with a batch size larger
> than SIZE_MAX / 2. In that case
>
>     total_size += estimated_size
>
> can overflow as both total_size and estimated_size can be greater that
> SIZE_MAX / 2. This is addressed by using saturating arithmetic for the
> addition.
>
> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
> ---
>  git-compat-util.h | 16 ++++++++++++++++
>  midx-write.c      | 12 ++++++++----
>  2 files changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/git-compat-util.h b/git-compat-util.h
> index 36b9577c8d4..4678e21c4cb 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -668,6 +668,22 @@ static inline int cast_size_t_to_int(size_t a)
>         return (int)a;
>  }
>
> +static inline uint64_t u64_mult(uint64_t a, uint64_t b)
> +{
> +       if (unsigned_mult_overflows(a, b))
> +               die("uint64_t overflow: %"PRIuMAX" * %"PRIuMAX,
> +                   (uintmax_t)a, (uintmax_t)b);
> +       return a * b;
> +}
> +
> +static inline uint64_t u64_add(uint64_t a, uint64_t b)
> +{
> +       if (unsigned_add_overflows(a, b))
> +               die("uint64_t overflow: %"PRIuMAX" + %"PRIuMAX,
> +                   (uintmax_t)a, (uintmax_t)b);
> +       return a + b;
> +}
> +
>  /*
>   * Limit size of IO chunks, because huge chunks only cause pain.  OS X
>   * 64-bit is buggy, returning EINVAL if len >= INT_MAX; and even in
> diff --git a/midx-write.c b/midx-write.c
> index dd3b3070e55..c7cb2315431 100644
> --- a/midx-write.c
> +++ b/midx-write.c
> @@ -1699,19 +1699,23 @@ static void fill_included_packs_batch(struct repository *r,
>         for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
>                 int pack_int_id = pack_info[i].pack_int_id;
>                 struct packed_git *p = m->packs[pack_int_id];
> -               size_t expected_size;
> +               uint64_t expected_size;
>
>                 if (!want_included_pack(r, m, pack_kept_objects, pack_int_id))
>                         continue;
>
> -               expected_size = st_mult(p->pack_size,
> -                                       pack_info[i].referenced_objects);
> +               expected_size = uint64_mult(p->pack_size,
> +                                           pack_info[i].referenced_objects);
>                 expected_size /= p->num_objects;
>
>                 if (expected_size >= batch_size)
>                         continue;
>
> -               total_size += expected_size;
> +               if (unsigned_add_overflows (total_size, (size_t)expected_size))

Style nit (only in case Taylor's approach doesn't prove better): I
wasn't expecting a space between the function and its argument list.

> +                       total_size = SIZE_MAX;
> +               else
> +                       total_size += expected_size;
> +
>                 include_pack[pack_int_id] = 1;
>         }
>
> --
> 2.49.0.897.gfad3eb7d210
>
>


-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] midx docs: clarify tie breaking
  2025-05-20 15:04 ` [PATCH 4/4] midx docs: clarify tie breaking Phillip Wood
  2025-05-20 18:07   ` Taylor Blau
@ 2025-05-21 13:14   ` D. Ben Knoble
  1 sibling, 0 replies; 25+ messages in thread
From: D. Ben Knoble @ 2025-05-21 13:14 UTC (permalink / raw)
  To: Phillip Wood; +Cc: git, Derrick Stolee, Taylor Blau, Phillip Wood

On Tue, May 20, 2025 at 11:15 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> From: Phillip Wood <phillip.wood@dunelm.org.uk>
>
> Clarify what happens when an object exists in more than one pack, but
> not in the preferred pack. If the user does not pass a preferred pack
> then the pack with the lowest mtime is chosen as the preferred pack. For
> objects that are not in the preferred pack the pack with the highest
> mtime is used. "git multi-pack-index repack" relies on this behavior. If
> ties were resolved in favor of the oldest pack as the current
> documentation suggests the multi-pack index would not reference any of
> the objects in the pack created by "git multi-pack-index repack".
>
> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
> ---
>  Documentation/git-multi-pack-index.adoc | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-multi-pack-index.adoc b/Documentation/git-multi-pack-index.adoc
> index 631d5c7d15c..1f016b2f682 100644
> --- a/Documentation/git-multi-pack-index.adoc
> +++ b/Documentation/git-multi-pack-index.adoc
> @@ -40,8 +40,10 @@ write::
>         --preferred-pack=<pack>::
>                 Optionally specify the tie-breaking pack used when
>                 multiple packs contain the same object. `<pack>` must
> -               contain at least one object. If not given, ties are
> -               broken in favor of the pack with the lowest mtime.
> +               contain at least one object. If not given the pack with
> +               the lowest mtime is used as the preferred pack. Ties
> +               for objects that are not contained in the preferred
> +               are resolved in favor of the pack with the newest mtime.

I think Taylor's reword caught this already: "preferred [what] are
resolved …"? (Probably "pack".)

>
>         --[no-]bitmap::
>                 Control whether or not a multi-pack bitmap is written.
> --
> 2.49.0.897.gfad3eb7d210
>
>


-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] midx repack: avoid integer overflow on 32 bit systems
  2025-05-21 13:10   ` D. Ben Knoble
@ 2025-05-21 15:01     ` Junio C Hamano
  2025-05-21 15:20     ` Phillip Wood
  1 sibling, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2025-05-21 15:01 UTC (permalink / raw)
  To: D. Ben Knoble
  Cc: Phillip Wood, git, Derrick Stolee, Taylor Blau, Phillip Wood

"D. Ben Knoble" <ben.knoble@gmail.com> writes:

>> +               if (unsigned_add_overflows (total_size, (size_t)expected_size))
>
> Style nit (only in case Taylor's approach doesn't prove better): I
> wasn't expecting a space between the function and its argument list.

Good eyes.  I didn't notice it myself.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] midx repack: avoid integer overflow on 32 bit systems
  2025-05-20 17:54   ` Taylor Blau
@ 2025-05-21 15:19     ` Phillip Wood
  2025-05-23  0:34       ` Taylor Blau
  0 siblings, 1 reply; 25+ messages in thread
From: Phillip Wood @ 2025-05-21 15:19 UTC (permalink / raw)
  To: Taylor Blau, Phillip Wood; +Cc: git, Derrick Stolee

Hi Taylor

On 20/05/2025 18:54, Taylor Blau wrote:
> On Tue, May 20, 2025 at 04:04:24PM +0100, Phillip Wood wrote:
>> diff --git a/midx-write.c b/midx-write.c
>> index dd3b3070e55..c7cb2315431 100644
>> --- a/midx-write.c
>> +++ b/midx-write.c
>> @@ -1699,19 +1699,23 @@ static void fill_included_packs_batch(struct repository *r,
>>   	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
>>   		int pack_int_id = pack_info[i].pack_int_id;
>>   		struct packed_git *p = m->packs[pack_int_id];
>> -		size_t expected_size;
>> +		uint64_t expected_size;
>>
>>   		if (!want_included_pack(r, m, pack_kept_objects, pack_int_id))
>>   			continue;
>>
>> -		expected_size = st_mult(p->pack_size,
>> -					pack_info[i].referenced_objects);
>> +		expected_size = uint64_mult(p->pack_size,
>> +					    pack_info[i].referenced_objects);
> 
> Makes sense.
> 
>>   		expected_size /= p->num_objects;
>>
>>   		if (expected_size >= batch_size)
>>   			continue;
>>
>> -		total_size += expected_size;
>> +		if (unsigned_add_overflows (total_size, (size_t)expected_size))
>> +			total_size = SIZE_MAX;
>> +		else
>> +			total_size += expected_size;
>> +
> 
> But this part I am not totally following. Here we have 'total_size'
> declared as a size_t, and 'expected_size' as a uint64_t, and (on 32-bit
> systems) down-cast to a 32-bit unsigned value.
> 
> So if 'expected_size' is larger than SIZE_MAX, we should set
> 'total_size' to SIZE_MAX. But that may not happen, say if
> 'expected_size' is (2^32-1<<32). Should total_size also be declared as a
> uint64_t here?

By this point we know that expected_size < SIZE_MAX due to the test in 
the context lines above this change. batch_size is declared as size_t 
and to get here expected_size < batch_size. I'll add a sentence to the 
commit message to make that clearer.

> I wondered if it might be easier to count down from the given batch_size
> instead of adding up to it (requiring the second
> unsigned_add_overflows() check). I tried it out and got this instead:

I think you're right that we if we counted down we'd need one less 
comparison but I'm not sure if it is worth the churn. In the diff below

     factor = pack_info[i].referenced_objects / p->num_objects;

can only ever be zero or one as factor is declared as uint64_t so I 
don't think it works as-is. If you're happy with the shifted-integer 
approach in the next patch I'd rather just stick with that.

Thanks

Phillip

> --- 8< ---
> diff --git a/midx-write.c b/midx-write.c
> index 48a4dc5e94..f81dd9ff6d 100644
> --- a/midx-write.c
> +++ b/midx-write.c
> @@ -1671,7 +1671,7 @@ static void fill_included_packs_batch(struct repository *r,
>   				      size_t batch_size)
>   {
>   	uint32_t i;
> -	size_t total_size;
> +	uint64_t remaining = batch_size;
>   	struct repack_info *pack_info;
>   	int pack_kept_objects = 0;
> 
> @@ -1695,23 +1695,23 @@ static void fill_included_packs_batch(struct repository *r,
> 
>   	QSORT(pack_info, m->num_packs, compare_by_mtime);
> 
> -	total_size = 0;
> -	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
> +	for (i = 0; i < m->num_packs; i++) {
>   		int pack_int_id = pack_info[i].pack_int_id;
>   		struct packed_git *p = m->packs[pack_int_id];
> -		size_t expected_size;
> +		uint64_t expected_size, factor;
> 
>   		if (!want_included_pack(r, m, pack_kept_objects, pack_int_id))
>   			continue;
> 
> -		expected_size = st_mult(p->pack_size,
> -					pack_info[i].referenced_objects);
> -		expected_size /= p->num_objects;
> +		factor = pack_info[i].referenced_objects / p->num_objects;
> +		if (p->pack_size > UINT64_MAX / factor)
> +			die(...);
> 
> -		if (expected_size >= batch_size)
> -			continue;
> +		expected_size = p->pack_size * factor;
> +		if (expected_size > remaining)
> +			break;
> 
> -		total_size += expected_size;
> +		remaining -= expected_size;
>   		include_pack[pack_int_id] = 1;
>   	}
> --- >8 ---
> 
> That reduces the two overflow checks down to one, and avoids the need to
> introduce a uint64_t-specific variant of the st_add() function.
> 
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] midx repack: avoid potential integer overflow on 64 bit systems
  2025-05-20 17:59   ` Taylor Blau
@ 2025-05-21 15:20     ` Phillip Wood
  0 siblings, 0 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-21 15:20 UTC (permalink / raw)
  To: Taylor Blau, Phillip Wood; +Cc: git, Derrick Stolee

On 20/05/2025 18:59, Taylor Blau wrote:
> On Tue, May 20, 2025 at 04:04:25PM +0100, Phillip Wood wrote:
>> From: Phillip Wood <phillip.wood@dunelm.org.uk>
>>
>> On a 64 bit system the calculation
>>
>>      p->pack_size * pack_info[i].referenced_objects
>>
>> could overflow. If a pack file contains 2^28 objects with an average
>> compressed size of 1KB then the pack size will be 2^38B. If all of the
>> objects are referenced by the multi-pack index the sum above will
>> overflow. Avoid this by using shifted integer arithmetic and changing
>> the order of the calculation so that the pack size is divided by the
>> total number of objects in the pack before multiplying by the number of
>> objects referenced by the multi-pack index. Using a shift of 14 bits
>> should give reasonable accuracy while avoiding overflow for pack sizes
>> less that 1PB.
> 
> Ahhh, this renders some of comments on the previous patch moot. I think
> that this is a not-unreasonable concern to be addressing even on modern
> 64-bit systems, since I have definitely encountered packs that have on
> the order of ~2^28 objects in them.

Thanks, that's good to know

Phillip

> I like this approach quite a bit, thanks!
> 
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] midx docs: clarify tie breaking
  2025-05-20 18:07   ` Taylor Blau
@ 2025-05-21 15:20     ` Phillip Wood
  0 siblings, 0 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-21 15:20 UTC (permalink / raw)
  To: Taylor Blau, Phillip Wood; +Cc: git, Derrick Stolee

On 20/05/2025 19:07, Taylor Blau wrote:
> On Tue, May 20, 2025 at 04:04:27PM +0100, Phillip Wood wrote:
>> From: Phillip Wood <phillip.wood@dunelm.org.uk>
>>
>> Clarify what happens when an object exists in more than one pack, but
>> not in the preferred pack. If the user does not pass a preferred pack
>> then the pack with the lowest mtime is chosen as the preferred pack. For
>> objects that are not in the preferred pack the pack with the highest
>> mtime is used. "git multi-pack-index repack" relies on this behavior. If
>> ties were resolved in favor of the oldest pack as the current
>> documentation suggests the multi-pack index would not reference any of
>> the objects in the pack created by "git multi-pack-index repack".
> 
> This commit message could likely be shortened since it is repeating some
> information from the patch content itself, but I don't have a strong
> opinion here.

Yes there is a bit a repetition. The thing I wanted to get across is 
that "git multi-pack-index repack" relies on favoring the newest pack if 
an object is not in the preferred pack.

>> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
>> ---
>>   Documentation/git-multi-pack-index.adoc | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/git-multi-pack-index.adoc b/Documentation/git-multi-pack-index.adoc
>> index 631d5c7d15c..1f016b2f682 100644
>> --- a/Documentation/git-multi-pack-index.adoc
>> +++ b/Documentation/git-multi-pack-index.adoc
>> @@ -40,8 +40,10 @@ write::
>>   	--preferred-pack=<pack>::
>>   		Optionally specify the tie-breaking pack used when
>>   		multiple packs contain the same object. `<pack>` must
>> -		contain at least one object. If not given, ties are
>> -		broken in favor of the pack with the lowest mtime.
>> +		contain at least one object. If not given the pack with
>> +		the lowest mtime is used as the preferred pack. Ties
>> +		for objects that are not contained in the preferred
>> +		are resolved in favor of the pack with the newest mtime.
> 
> I think the clarification here is good, but the structure makes it a
> little difficult to follow. The above reads to me like:
> 
>      1. What does --preferred-pack do?
>      2. What restrictions are there on the pack?
>      3. What happens if --preferred-pack is not given?
>      4. What happens if the preferred pack does not contain the object?
> 
> But I think it might be clearer to structure this like:
> 
>      1. What does --preferred-pack do for objects in the preferred pack?
>      2. What happens if the preferred pack does not contain the object?
>      3. What happens if --preferred-pack is not given?
>      4. What restrictions are there on the pack?
> 
> I tried to write something like this below:
> 
>      When specified, break ties in favor of this pack when there are
>      additional copies of its objects in other packs. Ties for objects
>      not found in the preferred pack are resolved in favor of the copy in

I'm tempted to say "are always resolved" here to make it clear that the 
behavior does not depend on --preferred-pack

>      the pack with the highest mtime. If unspecified, the pack with the
>      lowest mtime is used by default. The preferred pack must have at
>      least one object.
> 
> I think that the result here is a little easier to follow than what's
> proposed above, but I am obviously biased ;-). If you think the two are
> equivalent or mine is less clear than yours, feel free to ignore this.

With the tweak above I like your version

Thanks

Phillip

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] midx repack: avoid integer overflow on 32 bit systems
  2025-05-21 13:10   ` D. Ben Knoble
  2025-05-21 15:01     ` Junio C Hamano
@ 2025-05-21 15:20     ` Phillip Wood
  1 sibling, 0 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-21 15:20 UTC (permalink / raw)
  To: D. Ben Knoble, Phillip Wood; +Cc: git, Derrick Stolee, Taylor Blau

Hi Ben

On 21/05/2025 14:10, D. Ben Knoble wrote:
> On Tue, May 20, 2025 at 11:05 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>> +               if (unsigned_add_overflows (total_size, (size_t)expected_size))
> 
> Style nit (only in case Taylor's approach doesn't prove better): I
> wasn't expecting a space between the function and its argument list.

Good catch, I'll fix that

Thanks

Phillip

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems
  2025-05-20 15:04 [PATCH 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
                   ` (3 preceding siblings ...)
  2025-05-20 15:04 ` [PATCH 4/4] midx docs: clarify tie breaking Phillip Wood
@ 2025-05-22 15:55 ` Phillip Wood
  2025-05-22 15:55   ` [PATCH v2 1/4] midx repack: avoid integer " Phillip Wood
                     ` (4 more replies)
  4 siblings, 5 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-22 15:55 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Taylor Blau, D . Ben Knoble

From: Phillip Wood <phillip.wood@dunelm.org.uk>

Thanks to Taylor and Ben for their comments on V1.
Changes since V1:

 - Patch 1: style fix and clarified commit message
 - Patch 4: reworded commit message and documentation

Cover Letter for V1:

This series fixes an overflow when running "git multi-pack-index
repack" on an old raspberry pi and a couple of other small issues I
noticed while reading the code. I'm unsure how realistic the example
of integer overflow on 64 bit systems in patch 2 is. I'm happy to drop
it if people who work with large repositories think its not worth
worrying about.

Base-Commit: cb96e1697ad6e54d11fc920c95f82977f8e438f8
Published-As: https://github.com/phillipwood/git/releases/tag/pw%2Fmidx-repack-overflow%2Fv2
View-Changes-At: https://github.com/phillipwood/git/compare/cb96e1697...a140181bd
Fetch-It-Via: git fetch https://github.com/phillipwood/git pw/midx-repack-overflow/v2


Phillip Wood (4):
  midx repack: avoid integer overflow on 32 bit systems
  midx repack: avoid potential integer overflow on 64 bit systems
  midx: avoid negative array index
  midx docs: clarify tie breaking

 Documentation/git-multi-pack-index.adoc | 11 +++++++----
 git-compat-util.h                       | 16 ++++++++++++++++
 midx-write.c                            | 22 ++++++++++++++++------
 3 files changed, 39 insertions(+), 10 deletions(-)

Range-diff against v1:
1:  cbc5e69b908 ! 1:  9a1e6c81688 midx repack: avoid integer overflow on 32 bit systems
    @@ Commit message
     
         can overflow as both total_size and estimated_size can be greater that
         SIZE_MAX / 2. This is addressed by using saturating arithmetic for the
    -    addition.
    +    addition. Although estimated_size is of type uint64_t by the time we
    +    reach this sum it is bounded by the batch size which is of type size_t
    +    and so casting estimated_size to size_t does not truncate the value.
     
         Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
     
    @@ midx-write.c: static void fill_included_packs_batch(struct repository *r,
      			continue;
      
     -		total_size += expected_size;
    -+		if (unsigned_add_overflows (total_size, (size_t)expected_size))
    ++		if (unsigned_add_overflows(total_size, (size_t)expected_size))
     +			total_size = SIZE_MAX;
     +		else
     +			total_size += expected_size;
2:  9f07da4fe71 = 2:  54303d96c31 midx repack: avoid potential integer overflow on 64 bit systems
3:  688b0273604 = 3:  5b6cfb9d212 midx: avoid negative array index
4:  29769df1c60 ! 4:  a140181bd57 midx docs: clarify tie breaking
    @@ Commit message
         midx docs: clarify tie breaking
     
         Clarify what happens when an object exists in more than one pack, but
    -    not in the preferred pack. If the user does not pass a preferred pack
    -    then the pack with the lowest mtime is chosen as the preferred pack. For
    -    objects that are not in the preferred pack the pack with the highest
    -    mtime is used. "git multi-pack-index repack" relies on this behavior. If
    -    ties were resolved in favor of the oldest pack as the current
    -    documentation suggests the multi-pack index would not reference any of
    -    the objects in the pack created by "git multi-pack-index repack".
    +    not in the preferred pack. "git multi-pack-index repack" relies on ties
    +    for objects that are not in the preferred pack being resolved in favor
    +    of the newest pack that contains a copy of the object. If ties were
    +    resolved in favor of the oldest pack as the current documentation
    +    suggests the multi-pack index would not reference any of the objects in
    +    the pack created by "git multi-pack-index repack".
     
    +    Helped-by: Taylor Blau <me@ttaylorr.com>
         Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
     
      ## Documentation/git-multi-pack-index.adoc ##
     @@ Documentation/git-multi-pack-index.adoc: write::
    + +
    + --
      	--preferred-pack=<pack>::
    - 		Optionally specify the tie-breaking pack used when
    - 		multiple packs contain the same object. `<pack>` must
    +-		Optionally specify the tie-breaking pack used when
    +-		multiple packs contain the same object. `<pack>` must
     -		contain at least one object. If not given, ties are
     -		broken in favor of the pack with the lowest mtime.
    -+		contain at least one object. If not given the pack with
    -+		the lowest mtime is used as the preferred pack. Ties
    -+		for objects that are not contained in the preferred
    -+		are resolved in favor of the pack with the newest mtime.
    ++		When specified, break ties in favor of this pack when
    ++		there are additional copies of its objects in other
    ++		packs. Ties for objects not found in the preferred
    ++		pack are always resolved in favor of the copy in the
    ++		pack with the highest mtime. If unspecified, the pack
    ++		with the lowest mtime is used by default. The
    ++		preferred pack must have at least one object.
      
      	--[no-]bitmap::
      		Control whether or not a multi-pack bitmap is written.
-- 
2.49.0.897.gfad3eb7d210


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/4] midx repack: avoid integer overflow on 32 bit systems
  2025-05-22 15:55 ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
@ 2025-05-22 15:55   ` Phillip Wood
  2025-05-22 15:55   ` [PATCH v2 2/4] midx repack: avoid potential integer overflow on 64 " Phillip Wood
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-22 15:55 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Taylor Blau, D . Ben Knoble, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

On a 32 bit system "git multi-pack-index --repack --batch-size=120M"
failed with

    fatal: size_t overflow: 6038786 * 1289

The calculation to estimated size of the objects in the pack referenced
by the multi-pack-index uses st_mult() to multiply the pack size by the
number of referenced objects before dividing by the total number of
objects in the pack. As size_t is 32 bits on 32 bit systems this
calculation easily overflows. Fix this by using 64bit arithmetic instead.

Also fix a potential overflow when caluculating the total size of the
objects referenced by the multipack index with a batch size larger
than SIZE_MAX / 2. In that case

    total_size += estimated_size

can overflow as both total_size and estimated_size can be greater that
SIZE_MAX / 2. This is addressed by using saturating arithmetic for the
addition. Although estimated_size is of type uint64_t by the time we
reach this sum it is bounded by the batch size which is of type size_t
and so casting estimated_size to size_t does not truncate the value.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 git-compat-util.h | 16 ++++++++++++++++
 midx-write.c      | 12 ++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/git-compat-util.h b/git-compat-util.h
index 36b9577c8d4..4678e21c4cb 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -668,6 +668,22 @@ static inline int cast_size_t_to_int(size_t a)
 	return (int)a;
 }
 
+static inline uint64_t u64_mult(uint64_t a, uint64_t b)
+{
+	if (unsigned_mult_overflows(a, b))
+		die("uint64_t overflow: %"PRIuMAX" * %"PRIuMAX,
+		    (uintmax_t)a, (uintmax_t)b);
+	return a * b;
+}
+
+static inline uint64_t u64_add(uint64_t a, uint64_t b)
+{
+	if (unsigned_add_overflows(a, b))
+		die("uint64_t overflow: %"PRIuMAX" + %"PRIuMAX,
+		    (uintmax_t)a, (uintmax_t)b);
+	return a + b;
+}
+
 /*
  * Limit size of IO chunks, because huge chunks only cause pain.  OS X
  * 64-bit is buggy, returning EINVAL if len >= INT_MAX; and even in
diff --git a/midx-write.c b/midx-write.c
index dd3b3070e55..105014a2792 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1699,19 +1699,23 @@ static void fill_included_packs_batch(struct repository *r,
 	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
 		int pack_int_id = pack_info[i].pack_int_id;
 		struct packed_git *p = m->packs[pack_int_id];
-		size_t expected_size;
+		uint64_t expected_size;
 
 		if (!want_included_pack(r, m, pack_kept_objects, pack_int_id))
 			continue;
 
-		expected_size = st_mult(p->pack_size,
-					pack_info[i].referenced_objects);
+		expected_size = uint64_mult(p->pack_size,
+					    pack_info[i].referenced_objects);
 		expected_size /= p->num_objects;
 
 		if (expected_size >= batch_size)
 			continue;
 
-		total_size += expected_size;
+		if (unsigned_add_overflows(total_size, (size_t)expected_size))
+			total_size = SIZE_MAX;
+		else
+			total_size += expected_size;
+
 		include_pack[pack_int_id] = 1;
 	}
 
-- 
2.49.0.897.gfad3eb7d210


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/4] midx repack: avoid potential integer overflow on 64 bit systems
  2025-05-22 15:55 ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
  2025-05-22 15:55   ` [PATCH v2 1/4] midx repack: avoid integer " Phillip Wood
@ 2025-05-22 15:55   ` Phillip Wood
  2025-05-22 15:55   ` [PATCH v2 3/4] midx: avoid negative array index Phillip Wood
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-22 15:55 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Taylor Blau, D . Ben Knoble, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

On a 64 bit system the calculation

    p->pack_size * pack_info[i].referenced_objects

could overflow. If a pack file contains 2^28 objects with an average
compressed size of 1KB then the pack size will be 2^38B. If all of the
objects are referenced by the multi-pack index the sum above will
overflow. Avoid this by using shifted integer arithmetic and changing
the order of the calculation so that the pack size is divided by the
total number of objects in the pack before multiplying by the number of
objects referenced by the multi-pack index. Using a shift of 14 bits
should give reasonable accuracy while avoiding overflow for pack sizes
less that 1PB.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 midx-write.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/midx-write.c b/midx-write.c
index 105014a2792..8121e96f4fd 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1704,9 +1704,15 @@ static void fill_included_packs_batch(struct repository *r,
 		if (!want_included_pack(r, m, pack_kept_objects, pack_int_id))
 			continue;
 
-		expected_size = uint64_mult(p->pack_size,
-					    pack_info[i].referenced_objects);
+		/*
+		 * Use shifted integer arithmetic to calculate the
+		 * expected pack size to ~4 significant digits without
+		 * overflow for packsizes less that 1PB.
+		 */
+		expected_size = (uint64_t)pack_info[i].referenced_objects << 14;
 		expected_size /= p->num_objects;
+		expected_size = u64_mult(expected_size, p->pack_size);
+		expected_size = u64_add(expected_size, 1u << 13) >> 14;
 
 		if (expected_size >= batch_size)
 			continue;
-- 
2.49.0.897.gfad3eb7d210


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 3/4] midx: avoid negative array index
  2025-05-22 15:55 ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
  2025-05-22 15:55   ` [PATCH v2 1/4] midx repack: avoid integer " Phillip Wood
  2025-05-22 15:55   ` [PATCH v2 2/4] midx repack: avoid potential integer overflow on 64 " Phillip Wood
@ 2025-05-22 15:55   ` Phillip Wood
  2025-05-22 15:55   ` [PATCH v2 4/4] midx docs: clarify tie breaking Phillip Wood
  2025-05-23  0:36   ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Taylor Blau
  4 siblings, 0 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-22 15:55 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Taylor Blau, D . Ben Knoble, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

nth_midxed_pack_int_id() returns the index of the pack file in the multi
pack index's list of packfiles that the specified object. The index is
returned as a uint32_t. Storing this in an int will make the index
negative if the most significant bit is set. Fix this by using uint32_t
as the rest of the code does. This is unlikely to be a practical problem
as it requires the multipack index to reference 2^31 packfiles.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 midx-write.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/midx-write.c b/midx-write.c
index 8121e96f4fd..ba4a94950a8 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1566,7 +1566,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 					  _("Counting referenced objects"),
 					  m->num_objects);
 	for (i = 0; i < m->num_objects; i++) {
-		int pack_int_id = nth_midxed_pack_int_id(m, i);
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
 		count[pack_int_id]++;
 		display_progress(progress, i + 1);
 	}
@@ -1697,7 +1697,7 @@ static void fill_included_packs_batch(struct repository *r,
 
 	total_size = 0;
 	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
-		int pack_int_id = pack_info[i].pack_int_id;
+		uint32_t pack_int_id = pack_info[i].pack_int_id;
 		struct packed_git *p = m->packs[pack_int_id];
 		uint64_t expected_size;
 
-- 
2.49.0.897.gfad3eb7d210


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 4/4] midx docs: clarify tie breaking
  2025-05-22 15:55 ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
                     ` (2 preceding siblings ...)
  2025-05-22 15:55   ` [PATCH v2 3/4] midx: avoid negative array index Phillip Wood
@ 2025-05-22 15:55   ` Phillip Wood
  2025-05-23  0:36   ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Taylor Blau
  4 siblings, 0 replies; 25+ messages in thread
From: Phillip Wood @ 2025-05-22 15:55 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Taylor Blau, D . Ben Knoble, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

Clarify what happens when an object exists in more than one pack, but
not in the preferred pack. "git multi-pack-index repack" relies on ties
for objects that are not in the preferred pack being resolved in favor
of the newest pack that contains a copy of the object. If ties were
resolved in favor of the oldest pack as the current documentation
suggests the multi-pack index would not reference any of the objects in
the pack created by "git multi-pack-index repack".

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 Documentation/git-multi-pack-index.adoc | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-multi-pack-index.adoc b/Documentation/git-multi-pack-index.adoc
index 631d5c7d15c..b6cd0d7f855 100644
--- a/Documentation/git-multi-pack-index.adoc
+++ b/Documentation/git-multi-pack-index.adoc
@@ -38,10 +38,13 @@ write::
 +
 --
 	--preferred-pack=<pack>::
-		Optionally specify the tie-breaking pack used when
-		multiple packs contain the same object. `<pack>` must
-		contain at least one object. If not given, ties are
-		broken in favor of the pack with the lowest mtime.
+		When specified, break ties in favor of this pack when
+		there are additional copies of its objects in other
+		packs. Ties for objects not found in the preferred
+		pack are always resolved in favor of the copy in the
+		pack with the highest mtime. If unspecified, the pack
+		with the lowest mtime is used by default. The
+		preferred pack must have at least one object.
 
 	--[no-]bitmap::
 		Control whether or not a multi-pack bitmap is written.
-- 
2.49.0.897.gfad3eb7d210


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] midx repack: avoid integer overflow on 32 bit systems
  2025-05-21 15:19     ` Phillip Wood
@ 2025-05-23  0:34       ` Taylor Blau
  0 siblings, 0 replies; 25+ messages in thread
From: Taylor Blau @ 2025-05-23  0:34 UTC (permalink / raw)
  To: phillip.wood; +Cc: git, Derrick Stolee

On Wed, May 21, 2025 at 04:19:59PM +0100, Phillip Wood wrote:
> > >   		expected_size /= p->num_objects;
> > >
> > >   		if (expected_size >= batch_size)
> > >   			continue;
> > >
> > > -		total_size += expected_size;
> > > +		if (unsigned_add_overflows (total_size, (size_t)expected_size))
> > > +			total_size = SIZE_MAX;
> > > +		else
> > > +			total_size += expected_size;
> > > +
> >
> > But this part I am not totally following. Here we have 'total_size'
> > declared as a size_t, and 'expected_size' as a uint64_t, and (on 32-bit
> > systems) down-cast to a 32-bit unsigned value.
> >
> > So if 'expected_size' is larger than SIZE_MAX, we should set
> > 'total_size' to SIZE_MAX. But that may not happen, say if
> > 'expected_size' is (2^32-1<<32). Should total_size also be declared as a
> > uint64_t here?
>
> By this point we know that expected_size < SIZE_MAX due to the test in the
> context lines above this change. batch_size is declared as size_t and to get
> here expected_size < batch_size. I'll add a sentence to the commit message
> to make that clearer.

Ahh... makes sense. I don't think a comment is necessary, this should
have been obvious. The check you're referring to gives us the fact that

    expected_size < batch_size <= SIZE_MAX

So we're OK here; sorry for missing that!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems
  2025-05-22 15:55 ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
                     ` (3 preceding siblings ...)
  2025-05-22 15:55   ` [PATCH v2 4/4] midx docs: clarify tie breaking Phillip Wood
@ 2025-05-23  0:36   ` Taylor Blau
  2025-05-27  8:26     ` Phillip Wood
  4 siblings, 1 reply; 25+ messages in thread
From: Taylor Blau @ 2025-05-23  0:36 UTC (permalink / raw)
  To: Phillip Wood; +Cc: git, Derrick Stolee, D . Ben Knoble

On Thu, May 22, 2025 at 04:55:19PM +0100, Phillip Wood wrote:
> Phillip Wood (4):
>   midx repack: avoid integer overflow on 32 bit systems
>   midx repack: avoid potential integer overflow on 64 bit systems
>   midx: avoid negative array index
>   midx docs: clarify tie breaking
>
>  Documentation/git-multi-pack-index.adoc | 11 +++++++----
>  git-compat-util.h                       | 16 ++++++++++++++++
>  midx-write.c                            | 22 ++++++++++++++++------
>  3 files changed, 39 insertions(+), 10 deletions(-)
>
> Range-diff against v1:

Thanks, the range-diff and patches look great to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems
  2025-05-23  0:36   ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Taylor Blau
@ 2025-05-27  8:26     ` Phillip Wood
  2025-05-27 15:42       ` Junio C Hamano
  0 siblings, 1 reply; 25+ messages in thread
From: Phillip Wood @ 2025-05-27  8:26 UTC (permalink / raw)
  To: Taylor Blau, Phillip Wood; +Cc: git, Derrick Stolee, D . Ben Knoble

On 23/05/2025 01:36, Taylor Blau wrote:
> On Thu, May 22, 2025 at 04:55:19PM +0100, Phillip Wood wrote:
>> Phillip Wood (4):
>>    midx repack: avoid integer overflow on 32 bit systems
>>    midx repack: avoid potential integer overflow on 64 bit systems
>>    midx: avoid negative array index
>>    midx docs: clarify tie breaking
>>
>>   Documentation/git-multi-pack-index.adoc | 11 +++++++----
>>   git-compat-util.h                       | 16 ++++++++++++++++
>>   midx-write.c                            | 22 ++++++++++++++++------
>>   3 files changed, 39 insertions(+), 10 deletions(-)
>>
>> Range-diff against v1:
> 
> Thanks, the range-diff and patches look great to me.

That's great, thank you for you comments and suggestions especially with 
regard to large repositories.

Thanks

Phillip

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems
  2025-05-27  8:26     ` Phillip Wood
@ 2025-05-27 15:42       ` Junio C Hamano
  0 siblings, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2025-05-27 15:42 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Taylor Blau, Phillip Wood, git, Derrick Stolee, D . Ben Knoble

Phillip Wood <phillip.wood123@gmail.com> writes:

> On 23/05/2025 01:36, Taylor Blau wrote:
>> On Thu, May 22, 2025 at 04:55:19PM +0100, Phillip Wood wrote:
>>> Phillip Wood (4):
>>>    midx repack: avoid integer overflow on 32 bit systems
>>>    midx repack: avoid potential integer overflow on 64 bit systems
>>>    midx: avoid negative array index
>>>    midx docs: clarify tie breaking
>>>
>>>   Documentation/git-multi-pack-index.adoc | 11 +++++++----
>>>   git-compat-util.h                       | 16 ++++++++++++++++
>>>   midx-write.c                            | 22 ++++++++++++++++------
>>>   3 files changed, 39 insertions(+), 10 deletions(-)
>>>
>>> Range-diff against v1:
>> Thanks, the range-diff and patches look great to me.
>
> That's great, thank you for you comments and suggestions especially
> with regard to large repositories.

Thanks, both.  Let's mark the topic for next.



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-05-27 15:42 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20 15:04 [PATCH 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
2025-05-20 15:04 ` [PATCH 1/4] midx repack: avoid integer " Phillip Wood
2025-05-20 17:54   ` Taylor Blau
2025-05-21 15:19     ` Phillip Wood
2025-05-23  0:34       ` Taylor Blau
2025-05-21 13:10   ` D. Ben Knoble
2025-05-21 15:01     ` Junio C Hamano
2025-05-21 15:20     ` Phillip Wood
2025-05-20 15:04 ` [PATCH 2/4] midx repack: avoid potential integer overflow on 64 " Phillip Wood
2025-05-20 17:59   ` Taylor Blau
2025-05-21 15:20     ` Phillip Wood
2025-05-20 15:04 ` [PATCH 3/4] midx: avoid negative array index Phillip Wood
2025-05-20 17:58   ` Taylor Blau
2025-05-20 15:04 ` [PATCH 4/4] midx docs: clarify tie breaking Phillip Wood
2025-05-20 18:07   ` Taylor Blau
2025-05-21 15:20     ` Phillip Wood
2025-05-21 13:14   ` D. Ben Knoble
2025-05-22 15:55 ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Phillip Wood
2025-05-22 15:55   ` [PATCH v2 1/4] midx repack: avoid integer " Phillip Wood
2025-05-22 15:55   ` [PATCH v2 2/4] midx repack: avoid potential integer overflow on 64 " Phillip Wood
2025-05-22 15:55   ` [PATCH v2 3/4] midx: avoid negative array index Phillip Wood
2025-05-22 15:55   ` [PATCH v2 4/4] midx docs: clarify tie breaking Phillip Wood
2025-05-23  0:36   ` [PATCH v2 0/4] midx repack: fix overflow on 32 bit systems Taylor Blau
2025-05-27  8:26     ` Phillip Wood
2025-05-27 15:42       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).