* [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
In a similar fashion as other checked cast functions in this header
(such as `cast_size_t_to_ulong()` and `cast_size_t_to_int()`), implement
a checked cast function for going from a size_t to a uint32_t value.
This function will be utilized in a future commit which needs to make
such a conversion.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
git-compat-util.h | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/git-compat-util.h b/git-compat-util.h
index 3e7a59b5ff..c3b6c2c226 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1013,6 +1013,15 @@ static inline unsigned long cast_size_t_to_ulong(size_t a)
return (unsigned long)a;
}
+static inline uint32_t cast_size_t_to_uint32_t(size_t a)
+{
+ if (a != (uint32_t)a)
+ die("object too large to read on this platform: %"
+ PRIuMAX" is cut off to %u",
+ (uintmax_t)a, (uint32_t)a);
+ return (uint32_t)a;
+}
+
static inline int cast_size_t_to_int(size_t a)
{
if (a > INT_MAX)
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 16/26] pack-objects: include number of packs reused in output
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
In addition to including the number of objects reused verbatim from a
reuse-pack, include the number of packs from which objects were reused.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 31053128fc..7eb035eb7d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -223,6 +223,7 @@ static struct progress *progress_state;
static struct bitmapped_pack *reuse_packfiles;
static size_t reuse_packfiles_nr;
+static size_t reuse_packfiles_used_nr;
static uint32_t reuse_packfile_objects;
static struct bitmap *reuse_packfile_bitmap;
@@ -1265,6 +1266,8 @@ static void write_pack_file(void)
for (j = 0; j < reuse_packfiles_nr; j++) {
reused_chunks_nr = 0;
write_reused_pack(&reuse_packfiles[j], f);
+ if (reused_chunks_nr)
+ reuse_packfiles_used_nr++;
}
offset = hashfile_total(f);
}
@@ -4587,9 +4590,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
fprintf_ln(stderr,
_("Total %"PRIu32" (delta %"PRIu32"),"
" reused %"PRIu32" (delta %"PRIu32"),"
- " pack-reused %"PRIu32),
+ " pack-reused %"PRIu32" (from %"PRIuMAX")"),
written, written_delta, reused, reused_delta,
- reuse_packfile_objects);
+ reuse_packfile_objects,
+ (uintmax_t)reuse_packfiles_used_nr);
cleanup:
clear_packing_data(&to_pack);
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
The function `write_reused_pack_verbatim()` within
`builtin/pack-objects.c` is responsible for writing out a continuous
set of objects beginning at the start of the reuse packfile.
In the existing implementation, we did something like:
while (pos < reuse_packfile_bitmap->word_alloc &&
reuse_packfile_bitmap->words[pos] == (eword_t)~0)
pos++;
if (pos)
/* write first `pos * BITS_IN_WORD` objects from pack */
as an optimization to record a single chunk for the longest continuous
prefix of objects wanted out of the reuse pack, instead of having a
chunk for each individual object. For more details, see bb514de356
(pack-objects: improve partial packfile reuse, 2019-12-18).
In order to retain this optimization in a multi-pack reuse world, we can
no longer assume that the first object in a pack is on a word boundary
in the bitmap storing the set of reusable objects.
Assuming that all objects from the beginning of the reuse packfile up to
the object corresponding to the first bit on a word boundary are part of
the result, consume whole words at a time until the last whole word
belonging to the reuse packfile. Copy those objects to the resulting
packfile, and track that we reused them by recording a single chunk.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 73 ++++++++++++++++++++++++++++++++++--------
1 file changed, 60 insertions(+), 13 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6ce52d88a9..31053128fc 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1097,31 +1097,78 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
struct hashfile *out,
- off_t pack_start UNUSED,
+ off_t pack_start,
struct pack_window **w_curs)
{
- size_t pos = 0;
+ size_t pos = reuse_packfile->bitmap_pos;
+ size_t end;
- while (pos < reuse_packfile_bitmap->word_alloc &&
- reuse_packfile_bitmap->words[pos] == (eword_t)~0)
- pos++;
+ if (pos % BITS_IN_EWORD) {
+ size_t word_pos = (pos / BITS_IN_EWORD);
+ size_t offset = pos % BITS_IN_EWORD;
+ size_t last;
+ eword_t word = reuse_packfile_bitmap->words[word_pos];
- if (pos) {
- off_t to_write;
+ if (offset + reuse_packfile->bitmap_nr < BITS_IN_EWORD)
+ last = offset + reuse_packfile->bitmap_nr;
+ else
+ last = BITS_IN_EWORD;
- written = (pos * BITS_IN_EWORD);
- to_write = pack_pos_to_offset(reuse_packfile->p, written)
- - sizeof(struct pack_header);
+ for (; offset < last; offset++) {
+ if (word >> offset == 0)
+ return word_pos;
+ if (!bitmap_get(reuse_packfile_bitmap,
+ word_pos * BITS_IN_EWORD + offset))
+ return word_pos;
+ }
+
+ pos += BITS_IN_EWORD - (pos % BITS_IN_EWORD);
+ }
+
+ /*
+ * Now we're going to copy as many whole eword_t's as possible.
+ * "end" is the index of the last whole eword_t we copy, but
+ * there may be additional bits to process. Those are handled
+ * individually by write_reused_pack().
+ *
+ * Begin by advancing to the first word boundary in range of the
+ * bit positions occupied by objects in "reuse_packfile". Then
+ * pick the last word boundary in the same range. If we have at
+ * least one word's worth of bits to process, continue on.
+ */
+ end = reuse_packfile->bitmap_pos + reuse_packfile->bitmap_nr;
+ if (end % BITS_IN_EWORD)
+ end -= end % BITS_IN_EWORD;
+ if (pos >= end)
+ return reuse_packfile->bitmap_pos / BITS_IN_EWORD;
+
+ while (pos < end &&
+ reuse_packfile_bitmap->words[pos / BITS_IN_EWORD] == (eword_t)~0)
+ pos += BITS_IN_EWORD;
+
+ if (pos > end)
+ pos = end;
+
+ if (reuse_packfile->bitmap_pos < pos) {
+ off_t pack_start_off = pack_pos_to_offset(reuse_packfile->p, 0);
+ off_t pack_end_off = pack_pos_to_offset(reuse_packfile->p,
+ pos - reuse_packfile->bitmap_pos);
+
+ written += pos - reuse_packfile->bitmap_pos;
/* We're recording one chunk, not one object. */
- record_reused_object(sizeof(struct pack_header), 0);
+ record_reused_object(pack_start_off,
+ pack_start_off - (hashfile_total(out) - pack_start));
hashflush(out);
copy_pack_data(out, reuse_packfile->p, w_curs,
- sizeof(struct pack_header), to_write);
+ pack_start_off, pack_end_off - pack_start_off);
display_progress(progress_state, written);
}
- return pos;
+ if (pos % BITS_IN_EWORD)
+ BUG("attempted to jump past a word boundary to %"PRIuMAX,
+ (uintmax_t)pos);
+ return pos / BITS_IN_EWORD;
}
static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
The function `write_reused_pack()` within `builtin/pack-objects.c` is
responsible for performing pack-reuse on a single pack, and has two main
functions:
- it dispatches a call to `write_reused_pack_verbatim()` to see if we
can reuse portions of the packfile in whole-word chunks
- for any remaining objects (that is, any objects that appear after
the first "gap" in the bitmap), call write_reused_pack_one() on that
object to record it for reuse.
Prepare this function for multi-pack reuse by removing the assumption
that the bit position corresponding to the first object being reused
from a given pack must be at bit position zero.
The changes in this function are mostly straightforward. Initialize `i`
to the position of the first word to contain bits corresponding to that
reuse pack. In most situations, we throw the initialized value away,
since we end up replacing it with the return value from
write_reused_pack_verbatim(), moving us past the section of whole words
that we reused.
Likewise, modify the per-object loop to ignore any bits at the beginning
of the first word that do not belong to the pack currently being reused,
as well as skip to the "done" section once we have processed the last
bit corresponding to this pack.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 07c849b5d4..6ce52d88a9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1127,7 +1127,7 @@ static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
struct hashfile *f)
{
- size_t i = 0;
+ size_t i = reuse_packfile->bitmap_pos / BITS_IN_EWORD;
uint32_t offset;
off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);
struct pack_window *w_curs = NULL;
@@ -1145,17 +1145,23 @@ static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
break;
offset += ewah_bit_ctz64(word >> offset);
+ if (pos + offset < reuse_packfile->bitmap_pos)
+ continue;
+ if (pos + offset >= reuse_packfile->bitmap_pos + reuse_packfile->bitmap_nr)
+ goto done;
/*
* Can use bit positions directly, even for MIDX
* bitmaps. See comment in try_partial_reuse()
* for why.
*/
- write_reused_pack_one(reuse_packfile->p, pos + offset,
+ write_reused_pack_one(reuse_packfile->p,
+ pos + offset - reuse_packfile->bitmap_pos,
f, pack_start, &w_curs);
display_progress(progress_state, ++written);
}
}
+done:
unuse_pack(&w_curs);
}
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
Further prepare pack-objects to perform verbatim pack-reuse over
multiple packfiles by converting functions that take in a pointer to a
`struct packed_git` to instead take in a pointer to a `struct
bitmapped_pack`.
The additional information found in the bitmapped_pack struct (such as
the bit position corresponding to the beginning of the pack) will be
necessary in order to perform verbatim pack-reuse.
Note that we don't use any of the extra pieces of information contained
in the bitmapped_pack struct, so this step is merely preparatory and
does not introduce any functional changes.
Note further that we do not change the argument type to
write_reused_pack_one(). That function is responsible for copying
sections of the packfile directly and optionally patching any OFS_DELTAs
to account for not reusing sections of the packfile in between a delta
and its base.
As such, that function is (and should remain) oblivious to multi-pack
reuse, and does not require any of the extra pieces of information
stored in the bitmapped_pack struct.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f51b86d99f..07c849b5d4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -221,7 +221,8 @@ static int thin;
static int num_preferred_base;
static struct progress *progress_state;
-static struct packed_git *reuse_packfile;
+static struct bitmapped_pack *reuse_packfiles;
+static size_t reuse_packfiles_nr;
static uint32_t reuse_packfile_objects;
static struct bitmap *reuse_packfile_bitmap;
@@ -1094,7 +1095,7 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
}
-static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
+static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
struct hashfile *out,
off_t pack_start UNUSED,
struct pack_window **w_curs)
@@ -1109,13 +1110,13 @@ static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
off_t to_write;
written = (pos * BITS_IN_EWORD);
- to_write = pack_pos_to_offset(reuse_packfile, written)
+ to_write = pack_pos_to_offset(reuse_packfile->p, written)
- sizeof(struct pack_header);
/* We're recording one chunk, not one object. */
record_reused_object(sizeof(struct pack_header), 0);
hashflush(out);
- copy_pack_data(out, reuse_packfile, w_curs,
+ copy_pack_data(out, reuse_packfile->p, w_curs,
sizeof(struct pack_header), to_write);
display_progress(progress_state, written);
@@ -1123,7 +1124,7 @@ static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
return pos;
}
-static void write_reused_pack(struct packed_git *reuse_packfile,
+static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
struct hashfile *f)
{
size_t i = 0;
@@ -1149,8 +1150,8 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
* bitmaps. See comment in try_partial_reuse()
* for why.
*/
- write_reused_pack_one(reuse_packfile, pos + offset, f,
- pack_start, &w_curs);
+ write_reused_pack_one(reuse_packfile->p, pos + offset,
+ f, pack_start, &w_curs);
display_progress(progress_state, ++written);
}
}
@@ -1206,9 +1207,12 @@ static void write_pack_file(void)
offset = write_pack_header(f, nr_remaining);
- if (reuse_packfile) {
+ if (reuse_packfiles_nr) {
assert(pack_to_stdout);
- write_reused_pack(reuse_packfile, f);
+ for (j = 0; j < reuse_packfiles_nr; j++) {
+ reused_chunks_nr = 0;
+ write_reused_pack(&reuse_packfiles[j], f);
+ }
offset = hashfile_total(f);
}
@@ -3949,19 +3953,16 @@ static int pack_options_allow_reuse(void)
static int get_object_list_from_bitmap(struct rev_info *revs)
{
- struct bitmapped_pack *packs = NULL;
- size_t packs_nr = 0;
-
if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
return -1;
if (pack_options_allow_reuse())
- reuse_partial_packfile_from_bitmap(bitmap_git, &packs,
- &packs_nr,
+ reuse_partial_packfile_from_bitmap(bitmap_git,
+ &reuse_packfiles,
+ &reuse_packfiles_nr,
&reuse_packfile_bitmap);
- if (packs) {
- reuse_packfile = packs[0].p;
+ if (reuse_packfiles) {
reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
if (!reuse_packfile_objects)
BUG("expected non-empty reuse bitmap");
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
When reusing objects from a pack, we keep track of a set of one or more
`reused_chunk`s, corresponding to sections of one or more object(s) from
a source pack that we are reusing. Each chunk contains two pieces of
information:
- the offset of the first object in the source pack (relative to the
beginning of the source pack)
- the difference between that offset, and the corresponding offset in
the pack we're generating
The purpose of keeping track of these is so that we can patch an
OFS_DELTAs that cross over a section of the reuse pack that we didn't
take.
For instance, consider a hypothetical pack as shown below:
(chunk #2)
__________...
/
/
+--------+---------+-------------------+---------+
... | <base> | <other> | (unused) | <delta> | ...
+--------+---------+-------------------+---------+
\ /
\______________/
(chunk #1)
Suppose that we are sending objects "base", "other", and "delta", and
that the "delta" object is stored as an OFS_DELTA, and that its base is
"base". If we don't send any objects in the "(unused)" range, we can't
copy the delta'd object directly, since its delta offset includes a
range of the pack that we didn't copy, so we have to account for that
difference when patching and reassembling the delta.
In order to compute this value correctly, we need to know not only where
we are in the packfile we're assembling (with `hashfile_total(f)`) but
also the position of the first byte of the packfile that we are
currently reusing. Currently, this works just fine, since when reusing
only a single pack those two values are always identical (because
verbatim reuse is the first thing pack-objects does when enabled after
writing the pack header).
But when reusing multiple packs which have one or more gaps, we'll need
to account for these two values diverging.
Together, these two allow us to compute the reused chunk's offset
difference relative to the start of the reused pack, as desired.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 102fe9a4f8..f51b86d99f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1015,6 +1015,7 @@ static off_t find_reused_offset(off_t where)
static void write_reused_pack_one(struct packed_git *reuse_packfile,
size_t pos, struct hashfile *out,
+ off_t pack_start,
struct pack_window **w_curs)
{
off_t offset, next, cur;
@@ -1024,7 +1025,8 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
offset = pack_pos_to_offset(reuse_packfile, pos);
next = pack_pos_to_offset(reuse_packfile, pos + 1);
- record_reused_object(offset, offset - hashfile_total(out));
+ record_reused_object(offset,
+ offset - (hashfile_total(out) - pack_start));
cur = offset;
type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
@@ -1094,6 +1096,7 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
struct hashfile *out,
+ off_t pack_start UNUSED,
struct pack_window **w_curs)
{
size_t pos = 0;
@@ -1125,10 +1128,12 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
{
size_t i = 0;
uint32_t offset;
+ off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);
struct pack_window *w_curs = NULL;
if (allow_ofs_delta)
- i = write_reused_pack_verbatim(reuse_packfile, f, &w_curs);
+ i = write_reused_pack_verbatim(reuse_packfile, f, pack_start,
+ &w_curs);
for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
eword_t word = reuse_packfile_bitmap->words[i];
@@ -1145,7 +1150,7 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
* for why.
*/
write_reused_pack_one(reuse_packfile, pos + offset, f,
- &w_curs);
+ pack_start, &w_curs);
display_progress(progress_state, ++written);
}
}
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
The routines pack-objects uses to perform verbatim pack-reuse are:
- write_reused_pack_one()
- write_reused_pack_verbatim()
- write_reused_pack()
, all of which assume that there is exactly one packfile being reused:
the global constant `reuse_packfile`.
Prepare for reusing objects from multiple packs by making reuse packfile
a parameter of each of the above functions in preparation for calling
these functions in a loop with multiple packfiles.
Note that we still have the global "reuse_packfile", but pass it through
each of the above function's parameter lists, eliminating all but one
direct access (the top-level caller in `write_pack_file()`). Even after
this series, we will still have a global, but it will hold the array of
reusable packfiles, and we'll pass them one at a time to these functions
in a loop.
Note also that we will eventually need to pass a `bitmapped_pack`
instead of a `packed_git` in order to hold onto additional information
required for reuse (such as the bit position of the first object
belonging to that pack). But that change will be made in a future commit
so as to minimize the noise below as much as possible.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 87e16636a8..102fe9a4f8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1013,7 +1013,8 @@ static off_t find_reused_offset(off_t where)
return reused_chunks[lo-1].difference;
}
-static void write_reused_pack_one(size_t pos, struct hashfile *out,
+static void write_reused_pack_one(struct packed_git *reuse_packfile,
+ size_t pos, struct hashfile *out,
struct pack_window **w_curs)
{
off_t offset, next, cur;
@@ -1091,7 +1092,8 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
}
-static size_t write_reused_pack_verbatim(struct hashfile *out,
+static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
+ struct hashfile *out,
struct pack_window **w_curs)
{
size_t pos = 0;
@@ -1118,14 +1120,15 @@ static size_t write_reused_pack_verbatim(struct hashfile *out,
return pos;
}
-static void write_reused_pack(struct hashfile *f)
+static void write_reused_pack(struct packed_git *reuse_packfile,
+ struct hashfile *f)
{
size_t i = 0;
uint32_t offset;
struct pack_window *w_curs = NULL;
if (allow_ofs_delta)
- i = write_reused_pack_verbatim(f, &w_curs);
+ i = write_reused_pack_verbatim(reuse_packfile, f, &w_curs);
for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
eword_t word = reuse_packfile_bitmap->words[i];
@@ -1141,7 +1144,8 @@ static void write_reused_pack(struct hashfile *f)
* bitmaps. See comment in try_partial_reuse()
* for why.
*/
- write_reused_pack_one(pos + offset, f, &w_curs);
+ write_reused_pack_one(reuse_packfile, pos + offset, f,
+ &w_curs);
display_progress(progress_state, ++written);
}
}
@@ -1199,7 +1203,7 @@ static void write_pack_file(void)
if (reuse_packfile) {
assert(pack_to_stdout);
- write_reused_pack(f);
+ write_reused_pack(reuse_packfile, f);
offset = hashfile_total(f);
}
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
Further prepare for enabling verbatim pack-reuse over multiple packfiles
by changing the signature of reuse_partial_packfile_from_bitmap() to
populate an array of `struct bitmapped_pack *`'s instead of a pointer to
a single packfile.
Since the array we're filling out is sized dynamically[^1], add an
additional `size_t *` parameter which will hold the number of reusable
packs (equal to the number of elements in the array).
Note that since we still have not implemented true multi-pack reuse,
these changes aren't propagated out to the rest of the caller in
builtin/pack-objects.c.
In the interim state, we expect that the array has a single element, and
we use that element to fill out the static `reuse_packfile` variable
(which is a bog-standard `struct packed_git *`). Future commits will
continue to push this change further out through the pack-objects code.
[^1]: That is, even though we know the number of packs which are
candidates for pack-reuse, we do not know how many of those
candidates we can actually reuse.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 9 +++++++--
pack-bitmap.c | 6 ++++--
pack-bitmap.h | 5 +++--
3 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c3df6d9657..87e16636a8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3940,14 +3940,19 @@ static int pack_options_allow_reuse(void)
static int get_object_list_from_bitmap(struct rev_info *revs)
{
+ struct bitmapped_pack *packs = NULL;
+ size_t packs_nr = 0;
+
if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
return -1;
if (pack_options_allow_reuse())
- reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
+ reuse_partial_packfile_from_bitmap(bitmap_git, &packs,
+ &packs_nr,
&reuse_packfile_bitmap);
- if (reuse_packfile) {
+ if (packs) {
+ reuse_packfile = packs[0].p;
reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
if (!reuse_packfile_objects)
BUG("expected non-empty reuse bitmap");
diff --git a/pack-bitmap.c b/pack-bitmap.c
index c75a83e9cc..4d5a484678 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2001,7 +2001,8 @@ static int bitmapped_pack_cmp(const void *va, const void *vb)
}
void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
- struct packed_git **packfile_out,
+ struct bitmapped_pack **packs_out,
+ size_t *packs_nr_out,
struct bitmap **reuse_out)
{
struct repository *r = the_repository;
@@ -2069,7 +2070,8 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
* need to be handled separately.
*/
bitmap_and_not(result, reuse);
- *packfile_out = packs[0].p;
+ *packs_out = packs;
+ *packs_nr_out = packs_nr;
*reuse_out = reuse;
}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index ab3fdcde6b..7a12a2ce81 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -78,8 +78,9 @@ int test_bitmap_hashes(struct repository *r);
struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
int filter_provided_objects);
uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
-void reuse_partial_packfile_from_bitmap(struct bitmap_index *,
- struct packed_git **packfile,
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+ struct bitmapped_pack **packs_out,
+ size_t *packs_nr_out,
struct bitmap **reuse_out);
int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
kh_oid_map_t *reused_bitmaps, int show_progress);
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
The signature of `reuse_partial_packfile_from_bitmap()` currently takes
in a bitmap, as well as three output parameters (filled through
pointers, and passed as arguments), and also returns an integer result.
The output parameters are filled out with: (a) the packfile used for
pack-reuse, (b) the number of objects from that pack that we can reuse,
and (c) a bitmap indicating which objects we can reuse. The return value
is either -1 (when there are no objects to reuse), or 0 (when there is
at least one object to reuse).
Some of these parameters are redundant. Notably, we can infer from the
bitmap how many objects are reused by calling bitmap_popcount(). And we
can similar compute the return value based on that number as well.
As such, clean up the signature of this function to drop the "*entries"
parameter, as well as the int return value, since the single caller of
this function can infer these values themself.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 16 +++++++++-------
pack-bitmap.c | 16 +++++++---------
pack-bitmap.h | 7 +++----
3 files changed, 19 insertions(+), 20 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 321d7effb0..c3df6d9657 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3943,13 +3943,15 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
return -1;
- if (pack_options_allow_reuse() &&
- !reuse_partial_packfile_from_bitmap(
- bitmap_git,
- &reuse_packfile,
- &reuse_packfile_objects,
- &reuse_packfile_bitmap)) {
- assert(reuse_packfile_objects);
+ if (pack_options_allow_reuse())
+ reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
+ &reuse_packfile_bitmap);
+
+ if (reuse_packfile) {
+ reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
+ if (!reuse_packfile_objects)
+ BUG("expected non-empty reuse bitmap");
+
nr_result += reuse_packfile_objects;
nr_seen += reuse_packfile_objects;
display_progress(progress_state, nr_seen);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d64a80c30c..c75a83e9cc 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2000,10 +2000,9 @@ static int bitmapped_pack_cmp(const void *va, const void *vb)
return 0;
}
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
- struct packed_git **packfile_out,
- uint32_t *entries,
- struct bitmap **reuse_out)
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+ struct packed_git **packfile_out,
+ struct bitmap **reuse_out)
{
struct repository *r = the_repository;
struct bitmapped_pack *packs = NULL;
@@ -2025,7 +2024,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
warning(_("unable to load pack: '%s', disabling pack-reuse"),
bitmap_git->midx->pack_names[i]);
free(packs);
- return -1;
+ return;
}
if (!pack.bitmap_nr)
continue; /* no objects from this pack */
@@ -2059,10 +2058,10 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
- *entries = bitmap_popcount(reuse);
- if (!*entries) {
+ if (bitmap_is_empty(reuse)) {
+ free(packs);
bitmap_free(reuse);
- return -1;
+ return;
}
/*
@@ -2072,7 +2071,6 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
bitmap_and_not(result, reuse);
*packfile_out = packs[0].p;
*reuse_out = reuse;
- return 0;
}
int bitmap_walk_contains(struct bitmap_index *bitmap_git,
diff --git a/pack-bitmap.h b/pack-bitmap.h
index b68b213388..ab3fdcde6b 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -78,10 +78,9 @@ int test_bitmap_hashes(struct repository *r);
struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
int filter_provided_objects);
uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
- struct packed_git **packfile,
- uint32_t *entries,
- struct bitmap **reuse_out);
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *,
+ struct packed_git **packfile,
+ struct bitmap **reuse_out);
int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
kh_oid_map_t *reused_bitmaps, int show_progress);
void free_bitmap_index(struct bitmap_index *);
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 08/26] ewah: implement `bitmap_is_empty()`
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
In a future commit, we will want to check whether or not a bitmap has
any bits set in any of its words. The best way to do this (prior to the
existence of this patch) is to call `bitmap_popcount()` and check
whether the result is non-zero.
But this is semi-wasteful, since we do not need to know the exact number
of bits set, only whether or not there is at least one of them.
Implement a new helper function to check just that.
Suggested-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
ewah/bitmap.c | 9 +++++++++
ewah/ewok.h | 1 +
2 files changed, 10 insertions(+)
diff --git a/ewah/bitmap.c b/ewah/bitmap.c
index 7b525b1ecd..ac7e0af622 100644
--- a/ewah/bitmap.c
+++ b/ewah/bitmap.c
@@ -169,6 +169,15 @@ size_t bitmap_popcount(struct bitmap *self)
return count;
}
+int bitmap_is_empty(struct bitmap *self)
+{
+ size_t i;
+ for (i = 0; i < self->word_alloc; i++)
+ if (self->words[i])
+ return 0;
+ return 1;
+}
+
int bitmap_equals(struct bitmap *self, struct bitmap *other)
{
struct bitmap *big, *small;
diff --git a/ewah/ewok.h b/ewah/ewok.h
index 7eb8b9b630..c11d76c6f3 100644
--- a/ewah/ewok.h
+++ b/ewah/ewok.h
@@ -189,5 +189,6 @@ void bitmap_or_ewah(struct bitmap *self, struct ewah_bitmap *other);
void bitmap_or(struct bitmap *self, const struct bitmap *other);
size_t bitmap_popcount(struct bitmap *self);
+int bitmap_is_empty(struct bitmap *self);
#endif
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
When trying to assemble a pack with bitmaps using `--use-bitmap-index`,
`pack-objects` asks the pack-bitmap machinery for a bitmap which
indicates the set of objects we can "reuse" verbatim from on-disk.
This set is roughly comprised of: a prefix of objects in the bitmapped
pack (or preferred pack, in the case of a multi-pack reachability
bitmap), plus any other objects not included in the prefix, excluding
any deltas whose base we are not sending in the resulting pack.
The pack-bitmap machinery is responsible for computing this bitmap, and
does so with the following functions:
- reuse_partial_packfile_from_bitmap()
- try_partial_reuse()
In the existing implementation, the first function is responsible for
(a) marking the prefix of objects in the reusable pack, and then (b)
calling try_partial_reuse() on any remaining objects to ensure that they
are also reusable (and removing them from the bitmapped set if they are
not).
Likewise, the `try_partial_reuse()` function is responsible for checking
whether an isolated object (that is, an object from the bitmapped
pack/preferred pack not contained in the prefix from earlier) may be
reused, i.e. that it isn't a delta of an object that we are not sending
in the resulting pack.
These functions are based on two core assumptions, which we will unwind
in this and the following commits:
1. There is only a single pack from the bitmap which is eligible for
verbatim pack-reuse. For single-pack bitmaps, this is trivially the
bitmapped pack. For multi-pack bitmaps, this is (currently) the
MIDX's preferred pack.
2. The pack eligible for reuse has its first object in bit position 0,
and all objects from that pack follow in pack-order from that first
bit position.
In order to perform verbatim pack reuse over multiple packs, we must
unwind these two assumptions. Most notably, in order to reuse bits from
a given packfile, we need to know the first bit position occupied by
an object form that packfile. To propagate this information around, pass
a `struct bitmapped_pack *` anywhere we previously passed a `struct
packed_git *`, since the former contains the bitmap position we're
interested in (as well as a pointer to the latter).
As an additional step, factor out a sub-routine from the main
`reuse_partial_packfile_from_bitmap()` function, called
`reuse_partial_packfile_from_bitmap_1()`. This new function will be
responsible for figuring out which objects may be reused from a single
pack, and the existing function will dispatch multiple calls to its new
helper function for each reusable pack.
Consequently, `reuse_partial_packfile_from_bitmap()` will now maintain
an array of reusable packs instead of a single such pack. We currently
expect that array to have only a single element, so this awkward state
is short-lived. It will serve as useful scaffolding in subsequent
commits as we begin to work towards enabling multi-pack reuse.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
pack-bitmap.c | 118 +++++++++++++++++++++++++++++++++++++-------------
1 file changed, 87 insertions(+), 31 deletions(-)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d2f1306960..d64a80c30c 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1836,7 +1836,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
* -1 means "stop trying further objects"; 0 means we may or may not have
* reused, but you can keep feeding bits.
*/
-static int try_partial_reuse(struct packed_git *pack,
+static int try_partial_reuse(struct bitmapped_pack *pack,
size_t pos,
struct bitmap *reuse,
struct pack_window **w_curs)
@@ -1868,11 +1868,11 @@ static int try_partial_reuse(struct packed_git *pack,
* preferred pack precede all bits from other packs.
*/
- if (pos >= pack->num_objects)
+ if (pos >= pack->p->num_objects)
return -1; /* not actually in the pack or MIDX preferred pack */
- offset = delta_obj_offset = pack_pos_to_offset(pack, pos);
- type = unpack_object_header(pack, w_curs, &offset, &size);
+ offset = delta_obj_offset = pack_pos_to_offset(pack->p, pos);
+ type = unpack_object_header(pack->p, w_curs, &offset, &size);
if (type < 0)
return -1; /* broken packfile, punt */
@@ -1888,11 +1888,11 @@ static int try_partial_reuse(struct packed_git *pack,
* and the normal slow path will complain about it in
* more detail.
*/
- base_offset = get_delta_base(pack, w_curs, &offset, type,
+ base_offset = get_delta_base(pack->p, w_curs, &offset, type,
delta_obj_offset);
if (!base_offset)
return 0;
- if (offset_to_pack_pos(pack, base_offset, &base_pos) < 0)
+ if (offset_to_pack_pos(pack->p, base_offset, &base_pos) < 0)
return 0;
/*
@@ -1915,14 +1915,14 @@ static int try_partial_reuse(struct packed_git *pack,
* to REF_DELTA on the fly. Better to just let the normal
* object_entry code path handle it.
*/
- if (!bitmap_get(reuse, base_pos))
+ if (!bitmap_get(reuse, pack->bitmap_pos + base_pos))
return 0;
}
/*
* If we got here, then the object is OK to reuse. Mark it.
*/
- bitmap_set(reuse, pos);
+ bitmap_set(reuse, pack->bitmap_pos + pos);
return 0;
}
@@ -1934,29 +1934,13 @@ uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
}
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
- struct packed_git **packfile_out,
- uint32_t *entries,
- struct bitmap **reuse_out)
+static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git,
+ struct bitmapped_pack *pack,
+ struct bitmap *reuse)
{
- struct repository *r = the_repository;
- struct packed_git *pack;
struct bitmap *result = bitmap_git->result;
- struct bitmap *reuse;
struct pack_window *w_curs = NULL;
size_t i = 0;
- uint32_t offset;
- uint32_t objects_nr;
-
- assert(result);
-
- load_reverse_index(r, bitmap_git);
-
- if (bitmap_is_midx(bitmap_git))
- pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
- else
- pack = bitmap_git->pack;
- objects_nr = pack->num_objects;
while (i < result->word_alloc && result->words[i] == (eword_t)~0)
i++;
@@ -1969,15 +1953,15 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
* we use it instead of another pack. In single-pack bitmaps, the choice
* is made for us.
*/
- if (i > objects_nr / BITS_IN_EWORD)
- i = objects_nr / BITS_IN_EWORD;
+ if (i > pack->p->num_objects / BITS_IN_EWORD)
+ i = pack->p->num_objects / BITS_IN_EWORD;
- reuse = bitmap_word_alloc(i);
memset(reuse->words, 0xFF, i * sizeof(eword_t));
for (; i < result->word_alloc; ++i) {
eword_t word = result->words[i];
size_t pos = (i * BITS_IN_EWORD);
+ size_t offset;
for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
if ((word >> offset) == 0)
@@ -2002,6 +1986,78 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
done:
unuse_pack(&w_curs);
+}
+
+static int bitmapped_pack_cmp(const void *va, const void *vb)
+{
+ const struct bitmapped_pack *a = va;
+ const struct bitmapped_pack *b = vb;
+
+ if (a->bitmap_pos < b->bitmap_pos)
+ return -1;
+ if (a->bitmap_pos > b->bitmap_pos)
+ return 1;
+ return 0;
+}
+
+int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+ struct packed_git **packfile_out,
+ uint32_t *entries,
+ struct bitmap **reuse_out)
+{
+ struct repository *r = the_repository;
+ struct bitmapped_pack *packs = NULL;
+ struct bitmap *result = bitmap_git->result;
+ struct bitmap *reuse;
+ size_t i;
+ size_t packs_nr = 0, packs_alloc = 0;
+ size_t word_alloc;
+ uint32_t objects_nr = 0;
+
+ assert(result);
+
+ load_reverse_index(r, bitmap_git);
+
+ if (bitmap_is_midx(bitmap_git)) {
+ for (i = 0; i < bitmap_git->midx->num_packs; i++) {
+ struct bitmapped_pack pack;
+ if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) {
+ warning(_("unable to load pack: '%s', disabling pack-reuse"),
+ bitmap_git->midx->pack_names[i]);
+ free(packs);
+ return -1;
+ }
+ if (!pack.bitmap_nr)
+ continue; /* no objects from this pack */
+ if (pack.bitmap_pos)
+ continue; /* not preferred pack */
+
+ ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+ memcpy(&packs[packs_nr++], &pack, sizeof(pack));
+
+ objects_nr += pack.p->num_objects;
+ }
+
+ QSORT(packs, packs_nr, bitmapped_pack_cmp);
+ } else {
+ ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+
+ packs[packs_nr].p = bitmap_git->pack;
+ packs[packs_nr].bitmap_pos = 0;
+ packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
+
+ objects_nr = packs[packs_nr++].p->num_objects;
+ }
+
+ word_alloc = objects_nr / BITS_IN_EWORD;
+ if (objects_nr % BITS_IN_EWORD)
+ word_alloc++;
+ reuse = bitmap_word_alloc(word_alloc);
+
+ if (packs_nr != 1)
+ BUG("pack reuse not yet implemented for multiple packs");
+
+ reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
*entries = bitmap_popcount(reuse);
if (!*entries) {
@@ -2014,7 +2070,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
* need to be handled separately.
*/
bitmap_and_not(result, reuse);
- *packfile_out = pack;
+ *packfile_out = packs[0].p;
*reuse_out = reuse;
return 0;
}
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 06/26] midx: implement `midx_locate_pack()`
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
The multi-pack index API exposes a `midx_contains_pack()` function that
takes in a string ending in either ".idx" or ".pack" and returns whether
or not the MIDX contains a given pack corresponding to that string.
There is no corresponding function to locate the position of a pack
within the MIDX's pack order (sorted lexically by pack filename).
We could add an optional out parameter to `midx_contains_pack()` that is
filled out with the pack's position when the parameter is non-NULL. To
minimize the amount of fallout from this change, instead introduce a new
function by renaming `midx_contains_pack()` to `midx_locate_pack()`,
adding that output parameter, and then reimplementing
`midx_contains_pack()` in terms of it.
Future patches will make use of this new function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
midx.c | 13 +++++++++++--
midx.h | 5 ++++-
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/midx.c b/midx.c
index de25612b0c..beaf0c0de4 100644
--- a/midx.c
+++ b/midx.c
@@ -428,7 +428,8 @@ static int cmp_idx_or_pack_name(const char *idx_or_pack_name,
return strcmp(idx_or_pack_name, idx_name);
}
-int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
+int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
+ uint32_t *pos)
{
uint32_t first = 0, last = m->num_packs;
@@ -439,8 +440,11 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
current = m->pack_names[mid];
cmp = cmp_idx_or_pack_name(idx_or_pack_name, current);
- if (!cmp)
+ if (!cmp) {
+ if (pos)
+ *pos = mid;
return 1;
+ }
if (cmp > 0) {
first = mid + 1;
continue;
@@ -451,6 +455,11 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
return 0;
}
+int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
+{
+ return midx_locate_pack(m, idx_or_pack_name, NULL);
+}
+
int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local)
{
struct multi_pack_index *m;
diff --git a/midx.h b/midx.h
index b404235db5..89c5aa637e 100644
--- a/midx.h
+++ b/midx.h
@@ -70,7 +70,10 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
struct multi_pack_index *m,
uint32_t n);
int fill_midx_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
-int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name);
+int midx_contains_pack(struct multi_pack_index *m,
+ const char *idx_or_pack_name);
+int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
+ uint32_t *pos);
int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
/*
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 05/26] midx: implement `BTMP` chunk
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
When a multi-pack bitmap is used to implement verbatim pack reuse (that
is, when verbatim chunks from an on-disk packfile are copied
directly[^1]), it does so by using its "preferred pack" as the source
for pack-reuse.
This allows repositories to pack the majority of their objects into a
single (often large) pack, and then use it as the single source for
verbatim pack reuse. This increases the amount of objects that are
reused verbatim (and consequently, decrease the amount of time it takes
to generate many packs). But this performance comes at a cost, which is
that the preferred packfile must pace its growth with that of the entire
repository in order to maintain the utility of verbatim pack reuse.
As repositories grow beyond what we can reasonably store in a single
packfile, the utility of verbatim pack reuse diminishes. Or, at the very
least, it becomes increasingly more expensive to maintain as the pack
grows larger and larger.
It would be beneficial to be able to perform this same optimization over
multiple packs, provided some modest constraints (most importantly, that
the set of packs eligible for verbatim reuse are disjoint with respect
to the subset of their objects being sent).
If we assume that the packs which we treat as candidates for verbatim
reuse are disjoint with respect to any of their objects we may output,
we need to make only modest modifications to the verbatim pack-reuse
code itself. Most notably, we need to remove the assumption that the
bits in the reachability bitmap corresponding to objects from the single
reuse pack begin at the first bit position.
Future patches will unwind these assumptions and reimplement their
existing functionality as special cases of the more general assumptions
(e.g. that reuse bits can start anywhere within the bitset, but happen
to start at 0 for all existing cases).
This patch does not yet relax any of those assumptions. Instead, it
implements a foundational data-structure, the "Bitampped Packs" (`BTMP`)
chunk of the multi-pack index. The `BTMP` chunk's contents are described
in detail here. Importantly, the `BTMP` chunk contains information to
map regions of a multi-pack index's reachability bitmap to the packs
whose objects they represent.
For now, this chunk is only written, not read (outside of the test-tool
used in this patch to test the new chunk's behavior). Future patches
will begin to make use of this new chunk.
[^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
pack that wasn't used verbatim.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Documentation/gitformat-pack.txt | 76 ++++++++++++++++++++++++++++++++
midx.c | 75 +++++++++++++++++++++++++++++--
midx.h | 5 +++
pack-bitmap.h | 9 ++++
t/helper/test-read-midx.c | 30 ++++++++++++-
t/t5319-multi-pack-index.sh | 35 +++++++++++++++
6 files changed, 226 insertions(+), 4 deletions(-)
diff --git a/Documentation/gitformat-pack.txt b/Documentation/gitformat-pack.txt
index 9fcb29a9c8..d6ae229be5 100644
--- a/Documentation/gitformat-pack.txt
+++ b/Documentation/gitformat-pack.txt
@@ -396,6 +396,15 @@ CHUNK DATA:
is padded at the end with between 0 and 3 NUL bytes to make the
chunk size a multiple of 4 bytes.
+ Bitmapped Packfiles (ID: {'B', 'T', 'M', 'P'})
+ Stores a table of two 4-byte unsigned integers in network order.
+ Each table entry corresponds to a single pack (in the order that
+ they appear above in the `PNAM` chunk). The values for each table
+ entry are as follows:
+ - The first bit position (in pseudo-pack order, see below) to
+ contain an object from that pack.
+ - The number of bits whose objects are selected from that pack.
+
OID Fanout (ID: {'O', 'I', 'D', 'F'})
The ith entry, F[i], stores the number of OIDs with first
byte at most i. Thus F[255] stores the total
@@ -509,6 +518,73 @@ packs arranged in MIDX order (with the preferred pack coming first).
The MIDX's reverse index is stored in the optional 'RIDX' chunk within
the MIDX itself.
+=== `BTMP` chunk
+
+The Bitmapped Packfiles (`BTMP`) chunk encodes additional information
+about the objects in the multi-pack index's reachability bitmap. Recall
+that objects from the MIDX are arranged in "pseudo-pack" order (see
+above) for reachability bitmaps.
+
+From the example above, suppose we have packs "a", "b", and "c", with
+10, 15, and 20 objects, respectively. In pseudo-pack order, those would
+be arranged as follows:
+
+ |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
+
+When working with single-pack bitmaps (or, equivalently, multi-pack
+reachability bitmaps with a preferred pack), linkgit:git-pack-objects[1]
+performs ``verbatim'' reuse, attempting to reuse chunks of the bitmapped
+or preferred packfile instead of adding objects to the packing list.
+
+When a chunk of bytes is reused from an existing pack, any objects
+contained therein do not need to be added to the packing list, saving
+memory and CPU time. But a chunk from an existing packfile can only be
+reused when the following conditions are met:
+
+ - The chunk contains only objects which were requested by the caller
+ (i.e. does not contain any objects which the caller didn't ask for
+ explicitly or implicitly).
+
+ - All objects stored in non-thin packs as offset- or reference-deltas
+ also include their base object in the resulting pack.
+
+The `BTMP` chunk encodes the necessary information in order to implement
+multi-pack reuse over a set of packfiles as described above.
+Specifically, the `BTMP` chunk encodes three pieces of information (all
+32-bit unsigned integers in network byte-order) for each packfile `p`
+that is stored in the MIDX, as follows:
+
+`bitmap_pos`:: The first bit position (in pseudo-pack order) in the
+ multi-pack index's reachability bitmap occupied by an object from `p`.
+
+`bitmap_nr`:: The number of bit positions (including the one at
+ `bitmap_pos`) that encode objects from that pack `p`.
+
+For example, the `BTMP` chunk corresponding to the above example (with
+packs ``a'', ``b'', and ``c'') would look like:
+
+[cols="1,2,2"]
+|===
+| |`bitmap_pos` |`bitmap_nr`
+
+|packfile ``a''
+|`0`
+|`10`
+
+|packfile ``b''
+|`10`
+|`15`
+
+|packfile ``c''
+|`25`
+|`20`
+|===
+
+With this information in place, we can treat each packfile as
+individually reusable in the same fashion as verbatim pack reuse is
+performed on individual packs prior to the implementation of the `BTMP`
+chunk.
+
== cruft packs
The cruft packs feature offer an alternative to Git's traditional mechanism of
diff --git a/midx.c b/midx.c
index 8dba67ddbe..de25612b0c 100644
--- a/midx.c
+++ b/midx.c
@@ -33,6 +33,7 @@
#define MIDX_CHUNK_ALIGNMENT 4
#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_BITMAPPEDPACKS 0x42544d50 /* "BTMP" */
#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
@@ -41,6 +42,7 @@
#define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
#define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
#define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
+#define MIDX_CHUNK_BITMAPPED_PACKS_WIDTH (2 * sizeof(uint32_t))
#define MIDX_LARGE_OFFSET_NEEDED 0x80000000
#define PACK_EXPIRED UINT_MAX
@@ -193,6 +195,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
&m->chunk_large_offsets_len);
+ pair_chunk(cf, MIDX_CHUNKID_BITMAPPEDPACKS,
+ (const unsigned char **)&m->chunk_bitmapped_packs,
+ &m->chunk_bitmapped_packs_len);
if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
pair_chunk(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex,
@@ -286,6 +291,26 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t
return 0;
}
+int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
+ struct bitmapped_pack *bp, uint32_t pack_int_id)
+{
+ if (!m->chunk_bitmapped_packs)
+ return error(_("MIDX does not contain the BTMP chunk"));
+
+ if (prepare_midx_pack(r, m, pack_int_id))
+ return error(_("could not load bitmapped pack %"PRIu32), pack_int_id);
+
+ bp->p = m->packs[pack_int_id];
+ bp->bitmap_pos = get_be32((char *)m->chunk_bitmapped_packs +
+ MIDX_CHUNK_BITMAPPED_PACKS_WIDTH * pack_int_id);
+ bp->bitmap_nr = get_be32((char *)m->chunk_bitmapped_packs +
+ MIDX_CHUNK_BITMAPPED_PACKS_WIDTH * pack_int_id +
+ sizeof(uint32_t));
+ bp->pack_int_id = pack_int_id;
+
+ return 0;
+}
+
int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result)
{
return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
@@ -468,10 +493,16 @@ static size_t write_midx_header(struct hashfile *f,
return MIDX_HEADER_SIZE;
}
+#define BITMAP_POS_UNKNOWN (~((uint32_t)0))
+
struct pack_info {
uint32_t orig_pack_int_id;
char *pack_name;
struct packed_git *p;
+
+ uint32_t bitmap_pos;
+ uint32_t bitmap_nr;
+
unsigned expired : 1;
};
@@ -484,6 +515,7 @@ static void fill_pack_info(struct pack_info *info,
info->orig_pack_int_id = orig_pack_int_id;
info->pack_name = xstrdup(pack_name);
info->p = p;
+ info->bitmap_pos = BITMAP_POS_UNKNOWN;
}
static int pack_info_compare(const void *_a, const void *_b)
@@ -824,6 +856,26 @@ static int write_midx_pack_names(struct hashfile *f, void *data)
return 0;
}
+static int write_midx_bitmapped_packs(struct hashfile *f, void *data)
+{
+ struct write_midx_context *ctx = data;
+ size_t i;
+
+ for (i = 0; i < ctx->nr; i++) {
+ struct pack_info *pack = &ctx->info[i];
+ if (pack->expired)
+ continue;
+
+ if (pack->bitmap_pos == BITMAP_POS_UNKNOWN && pack->bitmap_nr)
+ BUG("pack '%s' has no bitmap position, but has %d bitmapped object(s)",
+ pack->pack_name, pack->bitmap_nr);
+
+ hashwrite_be32(f, pack->bitmap_pos);
+ hashwrite_be32(f, pack->bitmap_nr);
+ }
+ return 0;
+}
+
static int write_midx_oid_fanout(struct hashfile *f,
void *data)
{
@@ -991,8 +1043,19 @@ static uint32_t *midx_pack_order(struct write_midx_context *ctx)
QSORT(data, ctx->entries_nr, midx_pack_order_cmp);
ALLOC_ARRAY(pack_order, ctx->entries_nr);
- for (i = 0; i < ctx->entries_nr; i++)
+ for (i = 0; i < ctx->entries_nr; i++) {
+ struct pack_midx_entry *e = &ctx->entries[data[i].nr];
+ struct pack_info *pack = &ctx->info[ctx->pack_perm[e->pack_int_id]];
+ if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
+ pack->bitmap_pos = i;
+ pack->bitmap_nr++;
pack_order[i] = data[i].nr;
+ }
+ for (i = 0; i < ctx->nr; i++) {
+ struct pack_info *pack = &ctx->info[ctx->pack_perm[i]];
+ if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
+ pack->bitmap_pos = 0;
+ }
free(data);
trace2_region_leave("midx", "midx_pack_order", the_repository);
@@ -1293,6 +1356,7 @@ static int write_midx_internal(const char *object_dir,
struct hashfile *f = NULL;
struct lock_file lk;
struct write_midx_context ctx = { 0 };
+ int bitmapped_packs_concat_len = 0;
int pack_name_concat_len = 0;
int dropped_packs = 0;
int result = 0;
@@ -1505,8 +1569,10 @@ static int write_midx_internal(const char *object_dir,
}
for (i = 0; i < ctx.nr; i++) {
- if (!ctx.info[i].expired)
- pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
+ if (ctx.info[i].expired)
+ continue;
+ pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
+ bitmapped_packs_concat_len += 2 * sizeof(uint32_t);
}
/* Check that the preferred pack wasn't expired (if given). */
@@ -1566,6 +1632,9 @@ static int write_midx_internal(const char *object_dir,
add_chunk(cf, MIDX_CHUNKID_REVINDEX,
st_mult(ctx.entries_nr, sizeof(uint32_t)),
write_midx_revindex);
+ add_chunk(cf, MIDX_CHUNKID_BITMAPPEDPACKS,
+ bitmapped_packs_concat_len,
+ write_midx_bitmapped_packs);
}
write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
diff --git a/midx.h b/midx.h
index a5d98919c8..b404235db5 100644
--- a/midx.h
+++ b/midx.h
@@ -7,6 +7,7 @@
struct object_id;
struct pack_entry;
struct repository;
+struct bitmapped_pack;
#define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
#define GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP \
@@ -33,6 +34,8 @@ struct multi_pack_index {
const unsigned char *chunk_pack_names;
size_t chunk_pack_names_len;
+ const uint32_t *chunk_bitmapped_packs;
+ size_t chunk_bitmapped_packs_len;
const uint32_t *chunk_oid_fanout;
const unsigned char *chunk_oid_lookup;
const unsigned char *chunk_object_offsets;
@@ -58,6 +61,8 @@ void get_midx_rev_filename(struct strbuf *out, struct multi_pack_index *m);
struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
+int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
+ struct bitmapped_pack *bp, uint32_t pack_int_id);
int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos);
uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 5273a6a019..b68b213388 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -52,6 +52,15 @@ typedef int (*show_reachable_fn)(
struct bitmap_index;
+struct bitmapped_pack {
+ struct packed_git *p;
+
+ uint32_t bitmap_pos;
+ uint32_t bitmap_nr;
+
+ uint32_t pack_int_id; /* MIDX only */
+};
+
struct bitmap_index *prepare_bitmap_git(struct repository *r);
struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx);
void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index e9a444ddba..e48557aba1 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -100,10 +100,36 @@ static int read_midx_preferred_pack(const char *object_dir)
return 0;
}
+static int read_midx_bitmapped_packs(const char *object_dir)
+{
+ struct multi_pack_index *midx = NULL;
+ struct bitmapped_pack pack;
+ uint32_t i;
+
+ setup_git_directory();
+
+ midx = load_multi_pack_index(object_dir, 1);
+ if (!midx)
+ return 1;
+
+ for (i = 0; i < midx->num_packs; i++) {
+ if (nth_bitmapped_pack(the_repository, midx, &pack, i) < 0)
+ return 1;
+
+ printf("%s\n", pack_basename(pack.p));
+ printf(" bitmap_pos: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_pos);
+ printf(" bitmap_nr: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_nr);
+ }
+
+ close_midx(midx);
+
+ return 0;
+}
+
int cmd__read_midx(int argc, const char **argv)
{
if (!(argc == 2 || argc == 3))
- usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
+ usage("read-midx [--show-objects|--checksum|--preferred-pack|--bitmap] <object-dir>");
if (!strcmp(argv[1], "--show-objects"))
return read_midx_file(argv[2], 1);
@@ -111,5 +137,7 @@ int cmd__read_midx(int argc, const char **argv)
return read_midx_checksum(argv[2]);
else if (!strcmp(argv[1], "--preferred-pack"))
return read_midx_preferred_pack(argv[2]);
+ else if (!strcmp(argv[1], "--bitmap"))
+ return read_midx_bitmapped_packs(argv[2]);
return read_midx_file(argv[1], 0);
}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index c20aafe99a..dd09134db0 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1171,4 +1171,39 @@ test_expect_success 'reader notices out-of-bounds fanout' '
test_cmp expect err
'
+test_expect_success 'bitmapped packs are stored via the BTMP chunk' '
+ test_when_finished "rm -fr repo" &&
+ git init repo &&
+ (
+ cd repo &&
+
+ for i in 1 2 3 4 5
+ do
+ test_commit "$i" &&
+ git repack -d || return 1
+ done &&
+
+ find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename |
+ sort >packs &&
+
+ git multi-pack-index write --stdin-packs <packs &&
+ test_must_fail test-tool read-midx --bitmap $objdir 2>err &&
+ cat >expect <<-\EOF &&
+ error: MIDX does not contain the BTMP chunk
+ EOF
+ test_cmp expect err &&
+
+ git multi-pack-index write --stdin-packs --bitmap \
+ --preferred-pack="$(head -n1 <packs)" <packs &&
+ test-tool read-midx --bitmap $objdir >actual &&
+ for i in $(test_seq $(wc -l <packs))
+ do
+ sed -ne "${i}s/\.idx$/\.pack/p" packs &&
+ echo " bitmap_pos: $((($i - 1) * 3))" &&
+ echo " bitmap_nr: 3" || return 1
+ done >expect &&
+ test_cmp expect actual
+ )
+'
+
test_done
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 04/26] midx: factor out `fill_pack_info()`
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
When selecting which packfiles will be written while generating a MIDX,
the MIDX internals fill out a 'struct pack_info' with various pieces of
book-keeping.
Instead of filling out each field of the `pack_info` structure
individually in each of the two spots that modify the array of such
structures (`ctx->info`), extract a common routine that does this for
us.
This reduces the code duplication by a modest amount. But more
importantly, it zero-initializes the structure before assigning values
into it. This hardens us for a future change which will add additional
fields to this structure which (until this patch) was not
zero-initialized.
As a result, any new fields added to the `pack_info` structure need only
be updated in a single location, instead of at each spot within midx.c.
There are no functional changes in this patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
midx.c | 38 ++++++++++++++++++++------------------
1 file changed, 20 insertions(+), 18 deletions(-)
diff --git a/midx.c b/midx.c
index 778dd536c8..8dba67ddbe 100644
--- a/midx.c
+++ b/midx.c
@@ -475,6 +475,17 @@ struct pack_info {
unsigned expired : 1;
};
+static void fill_pack_info(struct pack_info *info,
+ struct packed_git *p, const char *pack_name,
+ uint32_t orig_pack_int_id)
+{
+ memset(info, 0, sizeof(struct pack_info));
+
+ info->orig_pack_int_id = orig_pack_int_id;
+ info->pack_name = xstrdup(pack_name);
+ info->p = p;
+}
+
static int pack_info_compare(const void *_a, const void *_b)
{
struct pack_info *a = (struct pack_info *)_a;
@@ -515,6 +526,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
const char *file_name, void *data)
{
struct write_midx_context *ctx = data;
+ struct packed_git *p;
if (ends_with(file_name, ".idx")) {
display_progress(ctx->progress, ++ctx->pack_paths_checked);
@@ -541,27 +553,22 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
- ctx->info[ctx->nr].p = add_packed_git(full_path,
- full_path_len,
- 0);
-
- if (!ctx->info[ctx->nr].p) {
+ p = add_packed_git(full_path, full_path_len, 0);
+ if (!p) {
warning(_("failed to add packfile '%s'"),
full_path);
return;
}
- if (open_pack_index(ctx->info[ctx->nr].p)) {
+ if (open_pack_index(p)) {
warning(_("failed to open pack-index '%s'"),
full_path);
- close_pack(ctx->info[ctx->nr].p);
- FREE_AND_NULL(ctx->info[ctx->nr].p);
+ close_pack(p);
+ free(p);
return;
}
- ctx->info[ctx->nr].pack_name = xstrdup(file_name);
- ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
- ctx->info[ctx->nr].expired = 0;
+ fill_pack_info(&ctx->info[ctx->nr], p, file_name, ctx->nr);
ctx->nr++;
}
}
@@ -1321,11 +1328,6 @@ static int write_midx_internal(const char *object_dir,
for (i = 0; i < ctx.m->num_packs; i++) {
ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
- ctx.info[ctx.nr].orig_pack_int_id = i;
- ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
- ctx.info[ctx.nr].p = ctx.m->packs[i];
- ctx.info[ctx.nr].expired = 0;
-
if (flags & MIDX_WRITE_REV_INDEX) {
/*
* If generating a reverse index, need to have
@@ -1341,10 +1343,10 @@ static int write_midx_internal(const char *object_dir,
if (open_pack_index(ctx.m->packs[i]))
die(_("could not open index for %s"),
ctx.m->packs[i]->pack_name);
- ctx.info[ctx.nr].p = ctx.m->packs[i];
}
- ctx.nr++;
+ fill_pack_info(&ctx.info[ctx.nr++], ctx.m->packs[i],
+ ctx.m->pack_names[i], i);
}
}
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 03/26] pack-bitmap: plug leak in find_objects()
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
The `find_objects()` function creates an object_list for any tips of the
reachability query which do not have corresponding bitmaps.
The object_list is not used outside of `find_objects()`, but we never
free it with `object_list_free()`, resulting in a leak. Let's plug that
leak by calling `object_list_free()`, which results in t6113 becoming
leak-free.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
pack-bitmap.c | 2 ++
t/t6113-rev-list-bitmap-filters.sh | 2 ++
2 files changed, 4 insertions(+)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 0260890341..d2f1306960 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1280,6 +1280,8 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
base = fill_in_bitmap(bitmap_git, revs, base, seen);
}
+ object_list_free(¬_mapped);
+
return base;
}
diff --git a/t/t6113-rev-list-bitmap-filters.sh b/t/t6113-rev-list-bitmap-filters.sh
index 86c70521f1..459f0d7412 100755
--- a/t/t6113-rev-list-bitmap-filters.sh
+++ b/t/t6113-rev-list-bitmap-filters.sh
@@ -4,6 +4,8 @@ test_description='rev-list combining bitmaps and filters'
. ./test-lib.sh
. "$TEST_DIRECTORY"/lib-bitmap.sh
+TEST_PASSES_SANITIZE_LEAK=true
+
test_expect_success 'set up bitmapped repo' '
# one commit will have bitmaps, the other will not
test_commit one &&
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 02/26] pack-bitmap-write: deep-clear the `bb_commit` slab
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
The `bb_commit` commit slab is used by the pack-bitmap-write machinery
to track various pieces of bookkeeping used to generate reachability
bitmaps.
Even though we clear the slab when freeing the bitmap_builder struct
(with `bitmap_builder_clear()`), there are still pointers which point to
locations in memory that have not yet been freed, resulting in a leak.
Plug the leak by introducing a suitable `free_fn` for the `struct
bb_commit` type, and make sure it is called on each member of the slab
via the `deep_clear_bb_data()` function.
Note that it is possible for both of the arguments to `bitmap_free()` to
be NULL, but `bitmap_free()` is a noop for NULL arguments, so it is OK
to pass them unconditionally.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
pack-bitmap-write.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index f4ecdf8b0e..ae37fb6976 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -198,6 +198,13 @@ struct bb_commit {
unsigned idx; /* within selected array */
};
+static void clear_bb_commit(struct bb_commit *commit)
+{
+ free_commit_list(commit->reverse_edges);
+ bitmap_free(commit->commit_mask);
+ bitmap_free(commit->bitmap);
+}
+
define_commit_slab(bb_data, struct bb_commit);
struct bitmap_builder {
@@ -339,7 +346,7 @@ static void bitmap_builder_init(struct bitmap_builder *bb,
static void bitmap_builder_clear(struct bitmap_builder *bb)
{
- clear_bb_data(&bb->data);
+ deep_clear_bb_data(&bb->data, clear_bb_commit);
free(bb->commits);
bb->commits_nr = bb->commits_alloc = 0;
}
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 01/26] pack-objects: free packing_data in more places
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1702592603.git.me@ttaylorr.com>
The pack-objects internals use a packing_data struct to track what
objects are part of the pack(s) being formed.
Since these structures contain allocated fields, failing to
appropriately free() them results in a leak. Plug that leak by
introducing a clear_packing_data() function, and call it in the
appropriate spots.
This is a fairly straightforward leak to plug, since none of the callers
expect to read any values or have any references to parts of the address
space being freed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 1 +
midx.c | 5 +++++
pack-objects.c | 15 +++++++++++++++
pack-objects.h | 1 +
4 files changed, 22 insertions(+)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 89a8b5a976..321d7effb0 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4522,6 +4522,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
reuse_packfile_objects);
cleanup:
+ clear_packing_data(&to_pack);
list_objects_filter_release(&filter_options);
strvec_clear(&rp);
diff --git a/midx.c b/midx.c
index 1d14661dad..778dd536c8 100644
--- a/midx.c
+++ b/midx.c
@@ -1603,8 +1603,13 @@ static int write_midx_internal(const char *object_dir,
flags) < 0) {
error(_("could not write multi-pack bitmap"));
result = 1;
+ clear_packing_data(&pdata);
+ free(commits);
goto cleanup;
}
+
+ clear_packing_data(&pdata);
+ free(commits);
}
/*
* NOTE: Do not use ctx.entries beyond this point, since it might
diff --git a/pack-objects.c b/pack-objects.c
index f403ca6986..a9d9855063 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -151,6 +151,21 @@ void prepare_packing_data(struct repository *r, struct packing_data *pdata)
init_recursive_mutex(&pdata->odb_lock);
}
+void clear_packing_data(struct packing_data *pdata)
+{
+ if (!pdata)
+ return;
+
+ free(pdata->cruft_mtime);
+ free(pdata->in_pack);
+ free(pdata->in_pack_by_idx);
+ free(pdata->in_pack_pos);
+ free(pdata->index);
+ free(pdata->layer);
+ free(pdata->objects);
+ free(pdata->tree_depth);
+}
+
struct object_entry *packlist_alloc(struct packing_data *pdata,
const struct object_id *oid)
{
diff --git a/pack-objects.h b/pack-objects.h
index 0d78db40cb..b9898a4e64 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -169,6 +169,7 @@ struct packing_data {
};
void prepare_packing_data(struct repository *r, struct packing_data *pdata);
+void clear_packing_data(struct packing_data *pdata);
/* Protect access to object database */
static inline void packing_data_lock(struct packing_data *pdata)
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply related
* [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
This is a reroll of my series to implement multi-pack verbatim reuse.
Since last time, I implemented a suggestion from Peff in [1] that
requires reusable packs be disjoint with respect to the objects they are
sending, not their entire set of objects. This allows for us to
implement multi-pack reuse without the DISP chunk or the
`--extend-disjoint`, `--exclude-disjoint`, `--retain-disjoint`, etc
options from the previous round.
Besides that, much is unchanged from last time, and the bulk of the
remaining changes are from Patrick Steinhardt's review of the first 2/3
of this series.
Performance remains mostly unchanged since last time, and I was able to
achieve the following results in hyperfine on my worst-case scenario
test repository (broken into ~100 packs):
$ hyperfine -L v single,multi -n '{v}-pack reuse' \
'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
Benchmark 1: single-pack reuse
Time (mean ± σ): 6.234 s ± 0.026 s [User: 43.733 s, System: 0.349 s]
Range (min … max): 6.197 s … 6.293 s 10 runs
Benchmark 2: multi-pack reuse
Time (mean ± σ): 959.5 ms ± 2.4 ms [User: 1133.8 ms, System: 36.3 ms]
Range (min … max): 957.2 ms … 964.8 ms 10 runs
Summary
multi-pack reuse ran
6.50 ± 0.03 times faster than single-pack reuse
As before, this series is still quite large, despite losing a fair bit
of code related to disjoint packs. After thinking on it some more, I
still couldn't come up with a satisfying way to break up this series, so
here it is presented in one big chunk. If others have ideas on how
better to present this series, please let me know.
In the meantime, thanks in advance for your review!
[1]: https://lore.kernel.org/git/20231212080332.GC1117953@coredump.intra.peff.net/
Taylor Blau (26):
pack-objects: free packing_data in more places
pack-bitmap-write: deep-clear the `bb_commit` slab
pack-bitmap: plug leak in find_objects()
midx: factor out `fill_pack_info()`
midx: implement `BTMP` chunk
midx: implement `midx_locate_pack()`
pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
ewah: implement `bitmap_is_empty()`
pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
pack-bitmap: return multiple packs via
`reuse_partial_packfile_from_bitmap()`
pack-objects: parameterize pack-reuse routines over a single pack
pack-objects: keep track of `pack_start` for each reuse pack
pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
pack-objects: prepare `write_reused_pack()` for multi-pack reuse
pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
reuse
pack-objects: include number of packs reused in output
git-compat-util.h: implement checked size_t to uint32_t conversion
midx: implement `midx_preferred_pack()`
pack-revindex: factor out `midx_key_to_pack_pos()` helper
pack-revindex: implement `midx_pair_to_pack_pos()`
pack-bitmap: prepare to mark objects from multiple packs for reuse
pack-objects: add tracing for various packfile metrics
t/test-lib-functions.sh: implement `test_trace2_data` helper
pack-objects: allow setting `pack.allowPackReuse` to "single"
pack-bitmap: enable reuse from all bitmapped packs
t/perf: add performance tests for multi-pack reuse
Documentation/config/pack.txt | 16 +-
Documentation/gitformat-pack.txt | 76 +++++++
builtin/pack-objects.c | 169 +++++++++++----
ewah/bitmap.c | 9 +
ewah/ewok.h | 1 +
git-compat-util.h | 9 +
midx.c | 151 +++++++++++---
midx.h | 12 +-
pack-bitmap-write.c | 9 +-
pack-bitmap.c | 321 +++++++++++++++++++----------
pack-bitmap.h | 19 +-
pack-objects.c | 15 ++
pack-objects.h | 1 +
pack-revindex.c | 50 +++--
pack-revindex.h | 3 +
t/helper/test-read-midx.c | 41 +++-
t/perf/p5332-multi-pack-reuse.sh | 81 ++++++++
t/t5319-multi-pack-index.sh | 35 ++++
t/t5332-multi-pack-reuse.sh | 203 ++++++++++++++++++
t/t6113-rev-list-bitmap-filters.sh | 2 +
t/test-lib-functions.sh | 14 ++
21 files changed, 1039 insertions(+), 198 deletions(-)
create mode 100755 t/perf/p5332-multi-pack-reuse.sh
create mode 100755 t/t5332-multi-pack-reuse.sh
Range-diff against v1:
1: 101d6a2841 ! 1: 7d65abfa1d pack-objects: free packing_data in more places
@@ Commit message
Since these structures contain allocated fields, failing to
appropriately free() them results in a leak. Plug that leak by
- introducing a free_packing_data() function, and call it in the
+ introducing a clear_packing_data() function, and call it in the
appropriate spots.
This is a fairly straightforward leak to plug, since none of the callers
@@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
reuse_packfile_objects);
cleanup:
-+ free_packing_data(&to_pack);
++ clear_packing_data(&to_pack);
list_objects_filter_release(&filter_options);
strvec_clear(&rp);
@@ midx.c: static int write_midx_internal(const char *object_dir,
flags) < 0) {
error(_("could not write multi-pack bitmap"));
result = 1;
-+ free_packing_data(&pdata);
++ clear_packing_data(&pdata);
+ free(commits);
goto cleanup;
}
+
-+ free_packing_data(&pdata);
++ clear_packing_data(&pdata);
+ free(commits);
}
/*
@@ pack-objects.c: void prepare_packing_data(struct repository *r, struct packing_d
init_recursive_mutex(&pdata->odb_lock);
}
-+void free_packing_data(struct packing_data *pdata)
++void clear_packing_data(struct packing_data *pdata)
+{
+ if (!pdata)
+ return;
@@ pack-objects.h: struct packing_data {
};
void prepare_packing_data(struct repository *r, struct packing_data *pdata);
-+void free_packing_data(struct packing_data *pdata);
++void clear_packing_data(struct packing_data *pdata);
/* Protect access to object database */
static inline void packing_data_lock(struct packing_data *pdata)
2: 6f5ff96998 ! 2: 19cdaf59c5 pack-bitmap-write: deep-clear the `bb_commit` slab
@@ Commit message
bb_commit` type, and make sure it is called on each member of the slab
via the `deep_clear_bb_data()` function.
+ Note that it is possible for both of the arguments to `bitmap_free()` to
+ be NULL, but `bitmap_free()` is a noop for NULL arguments, so it is OK
+ to pass them unconditionally.
+
Signed-off-by: Taylor Blau <me@ttaylorr.com>
## pack-bitmap-write.c ##
@@ pack-bitmap-write.c: struct bb_commit {
+static void clear_bb_commit(struct bb_commit *commit)
+{
-+ free(commit->reverse_edges);
++ free_commit_list(commit->reverse_edges);
+ bitmap_free(commit->commit_mask);
+ bitmap_free(commit->bitmap);
+}
3: bc38fba655 = 3: 477df6c974 pack-bitmap: plug leak in find_objects()
4: ccf1337305 ! 4: a06ed75af1 midx: factor out `fill_pack_info()`
@@ midx.c: struct pack_info {
};
+static void fill_pack_info(struct pack_info *info,
-+ struct packed_git *p, char *pack_name,
++ struct packed_git *p, const char *pack_name,
+ uint32_t orig_pack_int_id)
+{
+ memset(info, 0, sizeof(struct pack_info));
+
+ info->orig_pack_int_id = orig_pack_int_id;
-+ info->pack_name = pack_name;
++ info->pack_name = xstrdup(pack_name);
+ info->p = p;
+}
+
@@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len
+ if (open_pack_index(p)) {
warning(_("failed to open pack-index '%s'"),
full_path);
- close_pack(ctx->info[ctx->nr].p);
-@@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
+- close_pack(ctx->info[ctx->nr].p);
+- FREE_AND_NULL(ctx->info[ctx->nr].p);
++ close_pack(p);
++ free(p);
return;
}
- ctx->info[ctx->nr].pack_name = xstrdup(file_name);
- ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
- ctx->info[ctx->nr].expired = 0;
-+ fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
-+ ctx->nr);
++ fill_pack_info(&ctx->info[ctx->nr], p, file_name, ctx->nr);
ctx->nr++;
}
}
@@ midx.c: static int write_midx_internal(const char *object_dir,
- ctx.nr++;
+ fill_pack_info(&ctx.info[ctx.nr++], ctx.m->packs[i],
-+ xstrdup(ctx.m->pack_names[i]), i);
++ ctx.m->pack_names[i], i);
}
}
5: c52d7e7b27 ! 5: 6fdc68418f midx: implement `DISP` chunk
@@ Metadata
Author: Taylor Blau <me@ttaylorr.com>
## Commit message ##
- midx: implement `DISP` chunk
+ midx: implement `BTMP` chunk
When a multi-pack bitmap is used to implement verbatim pack reuse (that
is, when verbatim chunks from an on-disk packfile are copied
@@ Commit message
It would be beneficial to be able to perform this same optimization over
multiple packs, provided some modest constraints (most importantly, that
the set of packs eligible for verbatim reuse are disjoint with respect
- to the objects that they contain).
+ to the subset of their objects being sent).
If we assume that the packs which we treat as candidates for verbatim
- reuse are disjoint with respect to their objects, we need to make only
- modest modifications to the verbatim pack-reuse code itself. Most
- notably, we need to remove the assumption that the bits in the
- reachability bitmap corresponding to objects from the single reuse pack
- begin at the first bit position.
+ reuse are disjoint with respect to any of their objects we may output,
+ we need to make only modest modifications to the verbatim pack-reuse
+ code itself. Most notably, we need to remove the assumption that the
+ bits in the reachability bitmap corresponding to objects from the single
+ reuse pack begin at the first bit position.
Future patches will unwind these assumptions and reimplement their
existing functionality as special cases of the more general assumptions
@@ Commit message
to start at 0 for all existing cases).
This patch does not yet relax any of those assumptions. Instead, it
- implements a foundational data-structure, the "Disjoint Packs" (`DISP`)
- chunk of the multi-pack index. The `DISP` chunk's contents are described
- in detail here. Importantly, the `DISP` chunk contains information to
+ implements a foundational data-structure, the "Bitampped Packs" (`BTMP`)
+ chunk of the multi-pack index. The `BTMP` chunk's contents are described
+ in detail here. Importantly, the `BTMP` chunk contains information to
map regions of a multi-pack index's reachability bitmap to the packs
whose objects they represent.
@@ Commit message
used in this patch to test the new chunk's behavior). Future patches
will begin to make use of this new chunk.
- This patch implements reading (though no callers outside of the above
- one perform any reading) and writing this new chunk. It also extends the
- `--stdin-packs` format used by the `git multi-pack-index write` builtin
- to be able to designate that a given pack is to be marked as "disjoint"
- by prefixing it with a '+' character.
-
[^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
pack that wasn't used verbatim.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
- ## Documentation/git-multi-pack-index.txt ##
-@@ Documentation/git-multi-pack-index.txt: write::
- --stdin-packs::
- Write a multi-pack index containing only the set of
- line-delimited pack index basenames provided over stdin.
-+ Lines beginning with a '+' character (followed by the
-+ pack index basename as before) have their pack marked as
-+ "disjoint". See the "`DISP` chunk and disjoint packs"
-+ section in linkgit:gitformat-pack[5] for more.
-
- --refs-snapshot=<path>::
- With `--bitmap`, optionally specify a file which
-
## Documentation/gitformat-pack.txt ##
@@ Documentation/gitformat-pack.txt: CHUNK DATA:
is padded at the end with between 0 and 3 NUL bytes to make the
chunk size a multiple of 4 bytes.
-+ Disjoint Packfiles (ID: {'D', 'I', 'S', 'P'})
-+ Stores a table of three 4-byte unsigned integers in network order.
++ Bitmapped Packfiles (ID: {'B', 'T', 'M', 'P'})
++ Stores a table of two 4-byte unsigned integers in network order.
+ Each table entry corresponds to a single pack (in the order that
+ they appear above in the `PNAM` chunk). The values for each table
+ entry are as follows:
-+ - The first bit position (in psuedo-pack order, see below) to
++ - The first bit position (in pseudo-pack order, see below) to
+ contain an object from that pack.
+ - The number of bits whose objects are selected from that pack.
-+ - A "meta" value, whose least-significant bit indicates whether or
-+ not the pack is disjoint with respect to other packs. The
-+ remaining bits are unused.
-+ Two packs are "disjoint" with respect to one another when they have
-+ disjoint sets of objects. In other words, any object found in a pack
-+ contained in the set of disjoint packfiles is guaranteed to be
-+ uniquely located among those packs.
+
OID Fanout (ID: {'O', 'I', 'D', 'F'})
The ith entry, F[i], stores the number of OIDs with first
@@ Documentation/gitformat-pack.txt: packs arranged in MIDX order (with the preferr
The MIDX's reverse index is stored in the optional 'RIDX' chunk within
the MIDX itself.
-+=== `DISP` chunk and disjoint packs
++=== `BTMP` chunk
+
-+The Disjoint Packfiles (`DISP`) chunk encodes additional information
++The Bitmapped Packfiles (`BTMP`) chunk encodes additional information
+about the objects in the multi-pack index's reachability bitmap. Recall
-+that objects from the MIDX are arranged in "pseudo-pack" order (see:
++that objects from the MIDX are arranged in "pseudo-pack" order (see
+above) for reachability bitmaps.
+
+From the example above, suppose we have packs "a", "b", and "c", with
@@ Documentation/gitformat-pack.txt: packs arranged in MIDX order (with the preferr
+ |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
+
+When working with single-pack bitmaps (or, equivalently, multi-pack
-+reachability bitmaps without any packs marked as disjoint),
-+linkgit:git-pack-objects[1] performs ``verbatim'' reuse, attempting to
-+reuse chunks of the existing packfile instead of adding objects to the
-+packing list.
++reachability bitmaps with a preferred pack), linkgit:git-pack-objects[1]
++performs ``verbatim'' reuse, attempting to reuse chunks of the bitmapped
++or preferred packfile instead of adding objects to the packing list.
+
-+When a chunk of bytes are reused from an existing pack, any objects
++When a chunk of bytes is reused from an existing pack, any objects
+contained therein do not need to be added to the packing list, saving
+memory and CPU time. But a chunk from an existing packfile can only be
+reused when the following conditions are met:
@@ Documentation/gitformat-pack.txt: packs arranged in MIDX order (with the preferr
+ (i.e. does not contain any objects which the caller didn't ask for
+ explicitly or implicitly).
+
-+ - All objects stored as offset- or reference-deltas also include their
-+ base object in the resulting pack.
++ - All objects stored in non-thin packs as offset- or reference-deltas
++ also include their base object in the resulting pack.
+
-+Additionally, packfiles many not contain more than one copy of any given
-+object. This introduces an additional constraint over the set of packs
-+we may want to reuse. The most straightforward approach is to mandate
-+that the set of packs is disjoint with respect to the set of objects
-+contained in each pack. In other words, for each object `o` in the union
-+of all objects stored by the disjoint set of packs, `o` is contained in
-+exactly one pack from the disjoint set.
-+
-+One alternative design choice for multi-pack reuse might instead involve
-+imposing a chunk-level constraint that allows packs in the reusable set
-+to contain multiple copies across different packs, but restricts each
-+chunk against including more than one copy of such an object. This is in
-+theory possible to implement, but significantly more complicated than
-+forcing packs themselves to be disjoint. Most notably, we would have to
-+keep track of which objects have already been sent during verbatim
-+pack-reuse, defeating the main purpose of verbatim pack reuse (that we
-+don't have to keep track of individual objects).
-+
-+The `DISP` chunk encodes the necessary information in order to implement
-+multi-pack reuse over a disjoint set of packs as described above.
-+Specifically, the `DISP` chunk encodes three pieces of information (all
++The `BTMP` chunk encodes the necessary information in order to implement
++multi-pack reuse over a set of packfiles as described above.
++Specifically, the `BTMP` chunk encodes three pieces of information (all
+32-bit unsigned integers in network byte-order) for each packfile `p`
+that is stored in the MIDX, as follows:
+
@@ Documentation/gitformat-pack.txt: packs arranged in MIDX order (with the preferr
+`bitmap_nr`:: The number of bit positions (including the one at
+ `bitmap_pos`) that encode objects from that pack `p`.
+
-+`disjoint`:: Metadata, including whether or not the pack `p` is
-+ ``disjoint''. The least significant bit stores whether or not the pack
-+ is disjoint. The remaining bits are reserved for future use.
-+
-+For example, the `DISP` chunk corresponding to the above example (with
++For example, the `BTMP` chunk corresponding to the above example (with
+packs ``a'', ``b'', and ``c'') would look like:
+
-+[cols="1,2,2,2"]
++[cols="1,2,2"]
+|===
-+| |`bitmap_pos` |`bitmap_nr` |`disjoint`
++| |`bitmap_pos` |`bitmap_nr`
+
+|packfile ``a''
+|`0`
+|`10`
-+|`0x1`
+
+|packfile ``b''
+|`10`
+|`15`
-+|`0x1`
+
+|packfile ``c''
+|`25`
+|`20`
-+|`0x1`
+|===
+
-+With these constraints and information in place, we can treat each
-+packfile marked as disjoint as individually reusable in the same fashion
-+as verbatim pack reuse is performed on individual packs prior to the
-+implementation of the `DISP` chunk.
++With this information in place, we can treat each packfile as
++individually reusable in the same fashion as verbatim pack reuse is
++performed on individual packs prior to the implementation of the `BTMP`
++chunk.
+
== cruft packs
The cruft packs feature offer an alternative to Git's traditional mechanism of
- ## builtin/multi-pack-index.c ##
-@@ builtin/multi-pack-index.c: static int git_multi_pack_index_write_config(const char *var, const char *value,
- return 0;
- }
-
-+#define DISJOINT ((void*)(uintptr_t)1)
-+
- static void read_packs_from_stdin(struct string_list *to)
- {
- struct strbuf buf = STRBUF_INIT;
-- while (strbuf_getline(&buf, stdin) != EOF)
-- string_list_append(to, buf.buf);
-+ while (strbuf_getline(&buf, stdin) != EOF) {
-+ if (*buf.buf == '+')
-+ string_list_append(to, buf.buf + 1)->util = DISJOINT;
-+ else
-+ string_list_append(to, buf.buf);
-+ }
- string_list_sort(to);
-
- strbuf_release(&buf);
-
## midx.c ##
@@
#define MIDX_CHUNK_ALIGNMENT 4
#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
-+#define MIDX_CHUNKID_DISJOINTPACKS 0x44495350 /* "DISP" */
++#define MIDX_CHUNKID_BITMAPPEDPACKS 0x42544d50 /* "BTMP" */
#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
+@@
+ #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
+ #define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
+ #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
++#define MIDX_CHUNK_BITMAPPED_PACKS_WIDTH (2 * sizeof(uint32_t))
+ #define MIDX_LARGE_OFFSET_NEEDED 0x80000000
+
+ #define PACK_EXPIRED UINT_MAX
@@ midx.c: struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
&m->chunk_large_offsets_len);
-+ pair_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
-+ (const unsigned char **)&m->chunk_disjoint_packs,
-+ &m->chunk_disjoint_packs_len);
++ pair_chunk(cf, MIDX_CHUNKID_BITMAPPEDPACKS,
++ (const unsigned char **)&m->chunk_bitmapped_packs,
++ &m->chunk_bitmapped_packs_len);
if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
pair_chunk(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex,
@@ midx.c: int prepare_midx_pack(struct repository *r, struct multi_pack_index *m,
+int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
+ struct bitmapped_pack *bp, uint32_t pack_int_id)
+{
-+ if (!m->chunk_disjoint_packs)
-+ return error(_("MIDX does not contain the DISP chunk"));
++ if (!m->chunk_bitmapped_packs)
++ return error(_("MIDX does not contain the BTMP chunk"));
+
+ if (prepare_midx_pack(r, m, pack_int_id))
-+ return error(_("could not load disjoint pack %"PRIu32), pack_int_id);
++ return error(_("could not load bitmapped pack %"PRIu32), pack_int_id);
+
+ bp->p = m->packs[pack_int_id];
-+ bp->bitmap_pos = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id);
-+ bp->bitmap_nr = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 1);
-+ bp->disjoint = !!get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 2);
++ bp->bitmap_pos = get_be32((char *)m->chunk_bitmapped_packs +
++ MIDX_CHUNK_BITMAPPED_PACKS_WIDTH * pack_int_id);
++ bp->bitmap_nr = get_be32((char *)m->chunk_bitmapped_packs +
++ MIDX_CHUNK_BITMAPPED_PACKS_WIDTH * pack_int_id +
++ sizeof(uint32_t));
++ bp->pack_int_id = pack_int_id;
+
+ return 0;
+}
@@ midx.c: static size_t write_midx_header(struct hashfile *f,
uint32_t orig_pack_int_id;
char *pack_name;
struct packed_git *p;
-- unsigned expired : 1;
+
+ uint32_t bitmap_pos;
+ uint32_t bitmap_nr;
+
-+ unsigned expired : 1,
-+ disjoint : 1;
+ unsigned expired : 1;
};
- static void fill_pack_info(struct pack_info *info,
@@ midx.c: static void fill_pack_info(struct pack_info *info,
info->orig_pack_int_id = orig_pack_int_id;
- info->pack_name = pack_name;
+ info->pack_name = xstrdup(pack_name);
info->p = p;
+ info->bitmap_pos = BITMAP_POS_UNKNOWN;
}
static int pack_info_compare(const void *_a, const void *_b)
-@@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
- {
- struct write_midx_context *ctx = data;
- struct packed_git *p;
-+ struct string_list_item *item = NULL;
-
- if (ends_with(file_name, ".idx")) {
- display_progress(ctx->progress, ++ctx->pack_paths_checked);
-@@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
- * should be performed independently (likely checking
- * to_include before the existing MIDX).
- */
-- if (ctx->m && midx_contains_pack(ctx->m, file_name))
-- return;
-- else if (ctx->to_include &&
-- !string_list_has_string(ctx->to_include, file_name))
-+ if (ctx->m && midx_contains_pack(ctx->m, file_name)) {
- return;
-+ } else if (ctx->to_include) {
-+ item = string_list_lookup(ctx->to_include, file_name);
-+ if (!item)
-+ return;
-+ }
-
- ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
-
-@@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
-
- fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
- ctx->nr);
-+ if (item)
-+ ctx->info[ctx->nr].disjoint = !!item->util;
- ctx->nr++;
- }
- }
-@@ midx.c: struct pack_midx_entry {
- uint32_t pack_int_id;
- time_t pack_mtime;
- uint64_t offset;
-- unsigned preferred : 1;
-+ unsigned preferred : 1,
-+ disjoint : 1;
- };
-
- static int midx_oid_compare(const void *_a, const void *_b)
-@@ midx.c: static int midx_oid_compare(const void *_a, const void *_b)
- if (a->preferred < b->preferred)
- return 1;
-
-+ /* Sort objects in a disjoint pack last when multiple copies exist. */
-+ if (a->disjoint < b->disjoint)
-+ return -1;
-+ if (a->disjoint > b->disjoint)
-+ return 1;
-+
- if (a->pack_mtime > b->pack_mtime)
- return -1;
- else if (a->pack_mtime < b->pack_mtime)
-@@ midx.c: static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
- &fanout->entries[fanout->nr],
- cur_object);
- fanout->entries[fanout->nr].preferred = 0;
-+ fanout->entries[fanout->nr].disjoint = 0;
- fanout->nr++;
- }
- }
-@@ midx.c: static void midx_fanout_add_pack_fanout(struct midx_fanout *fanout,
- cur_object,
- &fanout->entries[fanout->nr],
- preferred);
-+ fanout->entries[fanout->nr].disjoint = info->disjoint;
- fanout->nr++;
- }
- }
-@@ midx.c: static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
- * Take only the first duplicate.
- */
- for (cur_object = 0; cur_object < fanout.nr; cur_object++) {
-- if (cur_object && oideq(&fanout.entries[cur_object - 1].oid,
-- &fanout.entries[cur_object].oid))
-- continue;
-+ struct pack_midx_entry *ours = &fanout.entries[cur_object];
-+ if (cur_object) {
-+ struct pack_midx_entry *prev = &fanout.entries[cur_object - 1];
-+ if (oideq(&prev->oid, &ours->oid)) {
-+ if (prev->disjoint && ours->disjoint)
-+ die(_("duplicate object '%s' among disjoint packs '%s', '%s'"),
-+ oid_to_hex(&prev->oid),
-+ info[prev->pack_int_id].pack_name,
-+ info[ours->pack_int_id].pack_name);
-+ continue;
-+ }
-+ }
-
- ALLOC_GROW(deduplicated_entries, st_add(*nr_objects, 1),
- alloc_objects);
-- memcpy(&deduplicated_entries[*nr_objects],
-- &fanout.entries[cur_object],
-+ memcpy(&deduplicated_entries[*nr_objects], ours,
- sizeof(struct pack_midx_entry));
- (*nr_objects)++;
- }
@@ midx.c: static int write_midx_pack_names(struct hashfile *f, void *data)
return 0;
}
-+static int write_midx_disjoint_packs(struct hashfile *f, void *data)
++static int write_midx_bitmapped_packs(struct hashfile *f, void *data)
+{
+ struct write_midx_context *ctx = data;
+ size_t i;
@@ midx.c: static int write_midx_pack_names(struct hashfile *f, void *data)
+
+ hashwrite_be32(f, pack->bitmap_pos);
+ hashwrite_be32(f, pack->bitmap_nr);
-+ hashwrite_be32(f, !!pack->disjoint);
+ }
+ return 0;
+}
@@ midx.c: static int write_midx_internal(const char *object_dir,
struct hashfile *f = NULL;
struct lock_file lk;
struct write_midx_context ctx = { 0 };
-+ int pack_disjoint_concat_len = 0;
++ int bitmapped_packs_concat_len = 0;
int pack_name_concat_len = 0;
int dropped_packs = 0;
int result = 0;
@@ midx.c: static int write_midx_internal(const char *object_dir,
+ if (ctx.info[i].expired)
+ continue;
+ pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
-+ pack_disjoint_concat_len += 3 * sizeof(uint32_t);
++ bitmapped_packs_concat_len += 2 * sizeof(uint32_t);
}
/* Check that the preferred pack wasn't expired (if given). */
@@ midx.c: static int write_midx_internal(const char *object_dir,
add_chunk(cf, MIDX_CHUNKID_REVINDEX,
st_mult(ctx.entries_nr, sizeof(uint32_t)),
write_midx_revindex);
-+ add_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
-+ pack_disjoint_concat_len, write_midx_disjoint_packs);
++ add_chunk(cf, MIDX_CHUNKID_BITMAPPEDPACKS,
++ bitmapped_packs_concat_len,
++ write_midx_bitmapped_packs);
}
write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
@@ midx.h: struct multi_pack_index {
const unsigned char *chunk_pack_names;
size_t chunk_pack_names_len;
-+ const uint32_t *chunk_disjoint_packs;
-+ size_t chunk_disjoint_packs_len;
++ const uint32_t *chunk_bitmapped_packs;
++ size_t chunk_bitmapped_packs_len;
const uint32_t *chunk_oid_fanout;
const unsigned char *chunk_oid_lookup;
const unsigned char *chunk_object_offsets;
@@ pack-bitmap.h: typedef int (*show_reachable_fn)(
+ uint32_t bitmap_pos;
+ uint32_t bitmap_nr;
+
-+ unsigned disjoint : 1;
++ uint32_t pack_int_id; /* MIDX only */
+};
+
struct bitmap_index *prepare_bitmap_git(struct repository *r);
@@ t/helper/test-read-midx.c: static int read_midx_preferred_pack(const char *objec
+ printf("%s\n", pack_basename(pack.p));
+ printf(" bitmap_pos: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_pos);
+ printf(" bitmap_nr: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_nr);
-+ printf(" disjoint: %s\n", pack.disjoint & 0x1 ? "yes" : "no");
+ }
+
+ close_midx(midx);
@@ t/helper/test-read-midx.c: int cmd__read_midx(int argc, const char **argv)
}
## t/t5319-multi-pack-index.sh ##
-@@ t/t5319-multi-pack-index.sh: test_expect_success 'reader notices too-small revindex chunk' '
- test_cmp expect.err err
+@@ t/t5319-multi-pack-index.sh: test_expect_success 'reader notices out-of-bounds fanout' '
+ test_cmp expect err
'
-+test_expect_success 'disjoint packs are stored via the DISP chunk' '
++test_expect_success 'bitmapped packs are stored via the BTMP chunk' '
+ test_when_finished "rm -fr repo" &&
+ git init repo &&
+ (
@@ t/t5319-multi-pack-index.sh: test_expect_success 'reader notices too-small revin
+ git repack -d || return 1
+ done &&
+
-+ find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename | sort >packs &&
++ find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename |
++ sort >packs &&
+
+ git multi-pack-index write --stdin-packs <packs &&
+ test_must_fail test-tool read-midx --bitmap $objdir 2>err &&
+ cat >expect <<-\EOF &&
-+ error: MIDX does not contain the DISP chunk
++ error: MIDX does not contain the BTMP chunk
+ EOF
+ test_cmp expect err &&
+
-+ sed -e "s/^/+/g" packs >in &&
+ git multi-pack-index write --stdin-packs --bitmap \
-+ --preferred-pack="$(head -n1 <packs)" <in &&
++ --preferred-pack="$(head -n1 <packs)" <packs &&
+ test-tool read-midx --bitmap $objdir >actual &&
+ for i in $(test_seq $(wc -l <packs))
+ do
+ sed -ne "${i}s/\.idx$/\.pack/p" packs &&
-+ echo " bitmap_pos: $(( $(( $i - 1 )) * 3 ))" &&
-+ echo " bitmap_nr: 3" &&
-+ echo " disjoint: yes" || return 1
++ echo " bitmap_pos: $((($i - 1) * 3))" &&
++ echo " bitmap_nr: 3" || return 1
+ done >expect &&
+ test_cmp expect actual
+ )
+'
-+
-+test_expect_success 'non-disjoint packs are detected' '
-+ test_when_finished "rm -fr repo" &&
-+ git init repo &&
-+ (
-+ cd repo &&
-+
-+ test_commit base &&
-+ git repack -d &&
-+ test_commit other &&
-+ git repack -a &&
-+
-+ ls -la .git/objects/pack/ &&
-+
-+ find $objdir/pack -type f -name "*.idx" |
-+ sed -e "s/.*\/\(.*\)$/+\1/g" >in &&
-+
-+ test_must_fail git multi-pack-index write --stdin-packs \
-+ --bitmap <in 2>err &&
-+ grep "duplicate object.* among disjoint packs" err
-+ )
-+'
+
test_done
6: 541fbb442b = 6: 96f397a2b2 midx: implement `midx_locate_pack()`
7: 3019738b52 < -: ---------- midx: implement `--retain-disjoint` mode
8: 0368f7ab37 < -: ---------- pack-objects: implement `--ignore-disjoint` mode
9: b75869befb < -: ---------- repack: implement `--extend-disjoint` mode
10: 970bd9eaea ! 7: df9233cf0f pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
@@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitma
unuse_pack(&w_curs);
+}
+
++static int bitmapped_pack_cmp(const void *va, const void *vb)
++{
++ const struct bitmapped_pack *a = va;
++ const struct bitmapped_pack *b = vb;
++
++ if (a->bitmap_pos < b->bitmap_pos)
++ return -1;
++ if (a->bitmap_pos > b->bitmap_pos)
++ return 1;
++ return 0;
++}
++
+int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+ struct packed_git **packfile_out,
+ uint32_t *entries,
@@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitma
+
+ objects_nr += pack.p->num_objects;
+ }
++
++ QSORT(packs, packs_nr, bitmapped_pack_cmp);
+ } else {
+ ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+
+ packs[packs_nr].p = bitmap_git->pack;
+ packs[packs_nr].bitmap_pos = 0;
+ packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
-+ packs[packs_nr].disjoint = 1;
+
+ objects_nr = packs[packs_nr++].p->num_objects;
+ }
-: ---------- > 8: 595b3b6986 ewah: implement `bitmap_is_empty()`
11: 432854b27c ! 9: d851f821fc pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
@@ builtin/pack-objects.c: static int get_object_list_from_bitmap(struct rev_info *
display_progress(progress_state, nr_seen);
## pack-bitmap.c ##
-@@ pack-bitmap.c: static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
- unuse_pack(&w_curs);
+@@ pack-bitmap.c: static int bitmapped_pack_cmp(const void *va, const void *vb)
+ return 0;
}
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
@@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitma
- *entries = bitmap_popcount(reuse);
- if (!*entries) {
-+ if (!bitmap_popcount(reuse)) {
++ if (bitmap_is_empty(reuse)) {
+ free(packs);
bitmap_free(reuse);
- return -1;
12: 36b794d9e1 ! 10: f551892bab pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`
@@ builtin/pack-objects.c: static int pack_options_allow_reuse(void)
BUG("expected non-empty reuse bitmap");
## pack-bitmap.c ##
-@@ pack-bitmap.c: static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
+@@ pack-bitmap.c: static int bitmapped_pack_cmp(const void *va, const void *vb)
}
void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
13: dca1083d8e = 11: 67e4fd8a06 pack-objects: parameterize pack-reuse routines over a single pack
14: 6f4fba861b ! 12: 9a5c38514b pack-objects: keep track of `pack_start` for each reuse pack
@@ Commit message
In order to compute this value correctly, we need to know not only where
we are in the packfile we're assembling (with `hashfile_total(f)`) but
also the position of the first byte of the packfile that we are
- currently reusing.
+ currently reusing. Currently, this works just fine, since when reusing
+ only a single pack those two values are always identical (because
+ verbatim reuse is the first thing pack-objects does when enabled after
+ writing the pack header).
+
+ But when reusing multiple packs which have one or more gaps, we'll need
+ to account for these two values diverging.
Together, these two allow us to compute the reused chunk's offset
difference relative to the start of the reused pack, as desired.
15: 2bb01e2370 = 13: 5492d11f25 pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
16: 67a8196978 ! 14: b32742ebcb pack-objects: prepare `write_reused_pack()` for multi-pack reuse
@@ Commit message
Prepare this function for multi-pack reuse by removing the assumption
that the bit position corresponding to the first object being reused
- from a given pack may not be at bit position zero.
+ from a given pack must be at bit position zero.
The changes in this function are mostly straightforward. Initialize `i`
to the position of the first word to contain bits corresponding to that
17: 1f766648df = 15: 805c42185a pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse
18: 4cd9f99bfd ! 16: 55696bc1c9 pack-objects: include number of packs reused in output
@@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
+ (uintmax_t)reuse_packfiles_used_nr);
cleanup:
- free_packing_data(&to_pack);
+ clear_packing_data(&to_pack);
19: 176c4c95ac < -: ---------- pack-bitmap: prepare to mark objects from multiple packs for reuse
-: ---------- > 17: 6ede9e0603 git-compat-util.h: implement checked size_t to uint32_t conversion
-: ---------- > 18: ab0456a71e midx: implement `midx_preferred_pack()`
-: ---------- > 19: 14b054d272 pack-revindex: factor out `midx_key_to_pack_pos()` helper
-: ---------- > 20: e99481014e pack-revindex: implement `midx_pair_to_pack_pos()`
-: ---------- > 21: 3e3625aebe pack-bitmap: prepare to mark objects from multiple packs for reuse
20: 46f1a03b8b ! 22: 1723cd0384 pack-objects: add tracing for various packfile metrics
@@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
+ trace2_data_intmax("pack-objects", the_repository, "packs-reused", reuse_packfiles_used_nr);
+
cleanup:
- free_packing_data(&to_pack);
+ clear_packing_data(&to_pack);
list_objects_filter_release(&filter_options);
21: fb764fbacc = 23: 79c830e37a t/test-lib-functions.sh: implement `test_trace2_data` helper
22: 3140a1703a ! 24: d06b0961b5 pack-objects: allow setting `pack.allowPackReuse` to "single"
@@ Commit message
"single" implies the same behavior as "true", "1", "yes", and so on. But
it will complement a new "multi" value (to be introduced in a future
commit). When set to "single", we will only perform pack reuse on a
- single pack, regardless of whether or not there are multiple disjoint
+ single pack, regardless of whether or not there are multiple MIDX'd
packs.
This requires no code changes (yet), since we only support single pack
23: 7345e39467 ! 25: 7002cf08fe pack-bitmap: reuse objects from all disjoint packs
@@ Metadata
Author: Taylor Blau <me@ttaylorr.com>
## Commit message ##
- pack-bitmap: reuse objects from all disjoint packs
+ pack-bitmap: enable reuse from all bitmapped packs
Now that both the pack-bitmap and pack-objects code are prepared to
- handle marking and using objects from multiple disjoint packs for
- verbatim reuse, allow marking objects from all disjoint packs as
+ handle marking and using objects from multiple bitmapped packs for
+ verbatim reuse, allow marking objects from all bitmapped packs as
eligible for reuse.
Within the `reuse_partial_packfile_from_bitmap()` function, we no longer
only mark the pack whose first object is at bit position zero for reuse,
- and instead mark any pack which is flagged as disjoint by the MIDX as a
- reuse candidate. If no such packs exist (i.e because we are reading a
- MIDX written before the "DISP" chunk was introduced), then treat the
- preferred pack as disjoint for the purposes of reuse. This is a safe
- assumption to make since all duplicate objects are resolved in favor of
- the preferred pack.
+ and instead mark any pack contained in the MIDX as a reuse candidate.
Provide a handful of test cases in a new script (t5332) exercising
interesting behavior for multi-pack reuse to ensure that we performed
@@ Commit message
Signed-off-by: Taylor Blau <me@ttaylorr.com>
## Documentation/config/pack.txt ##
-@@ Documentation/config/pack.txt: to linkgit:git-repack[1].
+@@ Documentation/config/pack.txt: all existing objects. You can force recompression by passing the -F option
+ to linkgit:git-repack[1].
+
pack.allowPackReuse::
- When true or "single", and when reachability bitmaps are enabled,
- pack-objects will try to send parts of the bitmapped packfile
+- When true or "single", and when reachability bitmaps are enabled,
+- pack-objects will try to send parts of the bitmapped packfile
- verbatim. This can reduce memory and CPU usage to serve fetches,
-+ verbatim. When "multi", and when a multi-pack reachability bitmap is
-+ available, pack-objects will try to send parts of all packs marked as
-+ disjoint by the MIDX. If only a single pack bitmap is available, and
+- but might result in sending a slightly larger pack. Defaults to
+- true.
++ When true or "single", and when reachability bitmaps are
++ enabled, pack-objects will try to send parts of the bitmapped
++ packfile verbatim. When "multi", and when a multi-pack
++ reachability bitmap is available, pack-objects will try to send
++ parts of all packs in the MIDX.
+++
++ If only a single pack bitmap is available, and
+ `pack.allowPackReuse` is set to "multi", reuse parts of just the
-+ bitmapped packfile. This can reduce memory and CPU usage to serve fetches,
- but might result in sending a slightly larger pack. Defaults to
- true.
++ bitmapped packfile. This can reduce memory and CPU usage to
++ serve fetches, but might result in sending a slightly larger
++ pack. Defaults to true.
+ pack.island::
+ An extended regular expression configuring a set of delta
## builtin/pack-objects.c ##
@@ builtin/pack-objects.c: static int use_bitmap_index = -1;
@@ builtin/pack-objects.c: static int get_object_list_from_bitmap(struct rev_info *
reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
## pack-bitmap.c ##
-@@ pack-bitmap.c: static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
- unuse_pack(&w_curs);
- }
-
-+static void make_disjoint_pack(struct bitmapped_pack *out, struct packed_git *p)
-+{
-+ out->p = p;
-+ out->bitmap_pos = 0;
-+ out->bitmap_nr = p->num_objects;
-+ out->disjoint = 1;
-+}
-+
+@@ pack-bitmap.c: static int bitmapped_pack_cmp(const void *va, const void *vb)
void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
struct bitmapped_pack **packs_out,
size_t *packs_nr_out,
@@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitm
free(packs);
return;
}
-- if (!pack.bitmap_nr)
++
+ if (!pack.bitmap_nr)
- continue; /* no objects from this pack */
- if (pack.bitmap_pos)
- continue; /* not preferred pack */
-+
-+ if (!pack.disjoint)
+ continue;
+
+ if (!multi_pack_reuse && pack.bitmap_pos) {
@@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitm
+
+ if (!multi_pack_reuse)
+ break;
-+ }
-+
-+ if (!packs_nr) {
-+ /*
-+ * Old MIDXs (i.e. those written before the "DISP" chunk
-+ * existed) will not have any packs marked as disjoint.
-+ *
-+ * But we still want to perform pack reuse with the
-+ * special "preferred pack" as before. To do this, form
-+ * the singleton set containing just the preferred pack,
-+ * which is trivially disjoint with itself.
-+ *
-+ * Moreover, the MIDX is guaranteed to resolve duplicate
-+ * objects in favor of the copy in the preferred pack
-+ * (if one exists). Thus, we can safely perform pack
-+ * reuse on this pack.
-+ */
-+ uint32_t preferred_pack_pos;
-+ struct packed_git *preferred_pack;
-+
-+ preferred_pack_pos = midx_preferred_pack(bitmap_git);
-+ preferred_pack = bitmap_git->midx->packs[preferred_pack_pos];
-+
-+ ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
-+
-+ make_disjoint_pack(&packs[packs_nr], preferred_pack);
-+ objects_nr = packs[packs_nr++].p->num_objects;
}
- } else {
+
+ QSORT(packs, packs_nr, bitmapped_pack_cmp);
+@@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
-- packs[packs_nr].p = bitmap_git->pack;
+ packs[packs_nr].p = bitmap_git->pack;
- packs[packs_nr].bitmap_pos = 0;
-- packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
-- packs[packs_nr].disjoint = 1;
--
-+ make_disjoint_pack(&packs[packs_nr], bitmap_git->pack);
- objects_nr = packs[packs_nr++].p->num_objects;
+ packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
++ packs[packs_nr].bitmap_pos = 0;
+
+- objects_nr = packs[packs_nr++].p->num_objects;
++ objects_nr = packs[packs_nr++].bitmap_nr;
}
+ word_alloc = objects_nr / BITS_IN_EWORD;
@@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
word_alloc++;
reuse = bitmap_word_alloc(word_alloc);
@@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitm
+ for (i = 0; i < packs_nr; i++)
+ reuse_partial_packfile_from_bitmap_1(bitmap_git, &packs[i], reuse);
- if (!bitmap_popcount(reuse)) {
+ if (bitmap_is_empty(reuse)) {
free(packs);
## pack-bitmap.h ##
-@@ pack-bitmap.h: uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
+@@ pack-bitmap.h: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
struct bitmapped_pack **packs_out,
size_t *packs_nr_out,
@@ t/t5332-multi-pack-reuse.sh (new)
+
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-bitmap.sh
-+. "$TEST_DIRECTORY"/lib-disjoint.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
-+all_packs () {
-+ find $packdir -type f -name "*.idx" | sed -e 's/^.*\/\([^\/]\)/\1/g'
-+}
-+
-+all_disjoint () {
-+ all_packs | sed -e 's/^/+/g'
-+}
-+
+test_pack_reused () {
+ test_trace2_data pack-objects pack-reused "$1"
+}
@@ t/t5332-multi-pack-reuse.sh (new)
+ grep "$1" objects | cut -d" " -f1
+}
+
-+test_expect_success 'setup' '
++test_expect_success 'preferred pack is reused for single-pack reuse' '
++ test_config pack.allowPackReuse single &&
++
++ for i in A B
++ do
++ test_commit "$i" &&
++ git repack -d || return 1
++ done &&
++
++ git multi-pack-index write --bitmap &&
++
++ : >trace2.txt &&
++ GIT_TRACE2_EVENT="$PWD/trace2.txt" \
++ git pack-objects --stdout --revs --all >/dev/null &&
++
++ test_pack_reused 3 <trace2.txt &&
++ test_packs_reused 1 <trace2.txt
++'
++
++test_expect_success 'enable multi-pack reuse' '
+ git config pack.allowPackReuse multi
+'
+
-+test_expect_success 'preferred pack is reused without packs marked disjoint' '
-+ test_commit A &&
-+ test_commit B &&
-+
-+ A="$(echo A | git pack-objects --unpacked --delta-base-offset $packdir/pack)" &&
-+ B="$(echo B | git pack-objects --unpacked --delta-base-offset $packdir/pack)" &&
-+
-+ git prune-packed &&
-+
-+ git multi-pack-index write --bitmap &&
-+
-+ test_must_not_be_disjoint "pack-$A.pack" &&
-+ test_must_not_be_disjoint "pack-$B.pack" &&
-+
-+ : >trace2.txt &&
-+ GIT_TRACE2_EVENT="$PWD/trace2.txt" \
-+ git pack-objects --stdout --revs --all >/dev/null &&
-+
-+ test_pack_reused 3 <trace2.txt &&
-+ test_packs_reused 1 <trace2.txt
-+'
-+
-+test_expect_success 'reuse all objects from subset of disjoint packs' '
++test_expect_success 'reuse all objects from subset of bitmapped packs' '
+ test_commit C &&
++ git repack -d &&
+
-+ C="$(echo C | git pack-objects --unpacked --delta-base-offset $packdir/pack)" &&
-+
-+ git prune-packed &&
++ git multi-pack-index write --bitmap &&
+
+ cat >in <<-EOF &&
-+ pack-$A.idx
-+ +pack-$B.idx
-+ +pack-$C.idx
++ $(git rev-parse C)
++ ^$(git rev-parse A)
+ EOF
-+ git multi-pack-index write --bitmap --stdin-packs <in &&
-+
-+ test_must_not_be_disjoint "pack-$A.pack" &&
-+ test_must_be_disjoint "pack-$B.pack" &&
-+ test_must_be_disjoint "pack-$C.pack" &&
+
+ : >trace2.txt &&
+ GIT_TRACE2_EVENT="$PWD/trace2.txt" \
-+ git pack-objects --stdout --revs --all >/dev/null &&
++ git pack-objects --stdout --revs <in >/dev/null &&
+
+ test_pack_reused 6 <trace2.txt &&
+ test_packs_reused 2 <trace2.txt
+'
+
-+test_expect_success 'reuse all objects from all disjoint packs' '
-+ rm -fr $packdir/multi-pack-index* &&
-+
-+ all_disjoint >in &&
-+ git multi-pack-index write --bitmap --stdin-packs <in &&
-+
-+ test_must_be_disjoint "pack-$A.pack" &&
-+ test_must_be_disjoint "pack-$B.pack" &&
-+ test_must_be_disjoint "pack-$C.pack" &&
-+
++test_expect_success 'reuse all objects from all packs' '
+ : >trace2.txt &&
+ GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+ git pack-objects --stdout --revs --all >/dev/null &&
@@ t/t5332-multi-pack-reuse.sh (new)
+ test_packs_reused 3 <trace2.txt
+'
+
-+test_expect_success 'reuse objects from first disjoint pack with middle gap' '
-+ test_commit D &&
-+ test_commit E &&
-+ test_commit F &&
++test_expect_success 'reuse objects from first pack with middle gap' '
++ for i in D E F
++ do
++ test_commit "$i" || return 1
++ done &&
+
+ # Set "pack.window" to zero to ensure that we do not create any
+ # deltas, which could alter the amount of pack reuse we perform
@@ t/t5332-multi-pack-reuse.sh (new)
+ # Ensure that the pack we are constructing sorts ahead of any
+ # other packs in lexical/bitmap order by choosing it as the
+ # preferred pack.
-+ all_disjoint >in &&
-+ git multi-pack-index write --bitmap --preferred-pack="pack-$D.idx" \
-+ --stdin-packs <in &&
-+
-+ test_must_be_disjoint pack-$A.pack &&
-+ test_must_be_disjoint pack-$B.pack &&
-+ test_must_be_disjoint pack-$C.pack &&
-+ test_must_be_disjoint pack-$D.pack &&
++ git multi-pack-index write --bitmap --preferred-pack="pack-$D.idx" &&
+
+ cat >in <<-EOF &&
+ $(git rev-parse E)
@@ t/t5332-multi-pack-reuse.sh (new)
+ test_packs_reused 1 <trace2.txt
+'
+
-+test_expect_success 'reuse objects from middle disjoint pack with middle gap' '
++test_expect_success 'reuse objects from middle pack with middle gap' '
+ rm -fr $packdir/multi-pack-index* &&
+
+ # Ensure that the pack we are constructing sort into any
+ # position *but* the first one, by choosing a different pack as
+ # the preferred one.
-+ all_disjoint >in &&
-+ git multi-pack-index write --bitmap --preferred-pack="pack-$A.idx" \
-+ --stdin-packs <in &&
++ git multi-pack-index write --bitmap --preferred-pack="pack-$A.idx" &&
+
+ cat >in <<-EOF &&
+ $(git rev-parse E)
@@ t/t5332-multi-pack-reuse.sh (new)
+ test_packs_reused 1 <trace2.txt
+'
+
-+test_expect_success 'omit delta with uninteresting base' '
++test_expect_success 'omit delta with uninteresting base (same pack)' '
+ git repack -adk &&
+
+ test_seq 32 >f &&
@@ t/t5332-multi-pack-reuse.sh (new)
+
+ have_delta "$(git rev-parse $delta:f)" "$(git rev-parse $base:f)" &&
+
-+ all_disjoint >in &&
-+ git multi-pack-index write --bitmap --stdin-packs <in &&
++ git multi-pack-index write --bitmap &&
+
+ cat >in <<-EOF &&
+ $(git rev-parse other)
@@ t/t5332-multi-pack-reuse.sh (new)
+ GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+ git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
+
-+ # Even though all packs are marked disjoint, we can only reuse
-+ # the 3 objects corresponding to "other" from the latest pack.
++ # We can only reuse the 3 objects corresponding to "other" from
++ # the latest pack.
+ #
+ # This is because even though we want "delta", we do not want
+ # "base", meaning that we have to inflate the delta/base-pair
@@ t/t5332-multi-pack-reuse.sh (new)
+ test_packs_reused 1 <trace2.txt
+'
+
++test_expect_success 'omit delta from uninteresting base (cross pack)' '
++ cat >in <<-EOF &&
++ $(git rev-parse $base)
++ ^$(git rev-parse $delta)
++ EOF
++
++ P="$(git pack-objects --revs $packdir/pack <in)" &&
++
++ git multi-pack-index write --bitmap --preferred-pack="pack-$P.idx" &&
++
++ : >trace2.txt &&
++ GIT_TRACE2_EVENT="$PWD/trace2.txt" \
++ git pack-objects --stdout --delta-base-offset --all >/dev/null &&
++
++ packs_nr="$(find $packdir -type f -name "pack-*.pack" | wc -l)" &&
++ objects_nr="$(git rev-list --count --all --objects)" &&
++
++ test_pack_reused $(($objects_nr - 1)) <trace2.txt &&
++ test_packs_reused $packs_nr <trace2.txt
++'
++
+test_done
24: 980b318f98 = 26: 94e5ae4cf6 t/perf: add performance tests for multi-pack reuse
base-commit: 1a87c842ece327d03d08096395969aca5e0a6996
--
2.43.0.102.ga31d690331.dirty
^ permalink raw reply
* Re: Question regarding Git updater
From: Jeff King @ 2023-12-14 22:12 UTC (permalink / raw)
To: Andreas Scholz; +Cc: git
In-Reply-To: <CAHDWvZyHDbjOnnCYCkfMY+HPWobrcgP6c1kkWFrRgWV4fHED=w@mail.gmail.com>
On Thu, Dec 14, 2023 at 05:44:41PM +0100, Andreas Scholz wrote:
> I hope you can help me with answering my question regarding the update
> mechanism for Git after it has been installed.
You may need to be more specific here about how you're installing Git.
The Git project itself does not ship any binary packages at all, let
alone ones that have an update mechanism.
If you're using the Git for Windows installer, it looks like it does
have an auto-update feature:
https://github.com/git-for-windows/build-extra/blob/15b05c2399f152783d1fe9f167692dd5cd8ae1e1/installer/install.iss#L114
You might get a response on this list from Git for Windows folks, but
there is also a separate Git for Windows list:
https://groups.google.com/g/git-for-windows
You may be more likely to get a response there, or opening an issue on
the GitHub repo:
https://github.com/git-for-windows/git
If you're using something else, you may have to ask the specific
packager.
-Peff
^ permalink raw reply
* Re: [PATCH] Use ^=1 to toggle between 0 and 1
From: Jeff King @ 2023-12-14 22:05 UTC (permalink / raw)
To: René Scharfe; +Cc: AtariDreams via GitGitGadget, git, Seija Kijin
In-Reply-To: <4d0b2a5f-305b-4350-b164-44923cb250d8@web.de>
On Thu, Dec 14, 2023 at 02:08:31PM +0100, René Scharfe wrote:
> > I don't even know that we'd need much of a weather-balloon patch. I
> > think it would be valid to do:
> >
> > #ifndef bool
> > #define bool int
> >
> > to handle pre-C99 compilers (if there even are any these days). Of
> > course we probably need some conditional magic to try to "#include
> > <stdbool.h>" for the actual C99. I guess we could assume C99 by default
> > and then add NO_STDBOOL as an escape hatch if anybody complains.
>
> The semantics are slightly different in edge cases, so that fallback
> would not be fully watertight. E.g. consider:
>
> bool b(bool cond) {return cond == true;}
> bool b2(void) {return b(2);}
Yeah. b2() is wrong for passing "2" to a bool. I assumed that the
compiler would warn of that (at least for people on modern C99
compilers, not the fallback code), but it doesn't seem to. It's been a
long time since I've worked on a code base that made us of "bool", but I
guess that idea is that silently coercing a non-zero int to a bool is
reasonable in many cases (e.g., "bool found_foo = count_foos()").
I guess one could argue that b() is also sub-optimal, as it should just
say "return cond" or "return !cond" rather than explicitly comparing to
true/false. But I won't be surprised if it happens from time to time.
> A coding rule to not compare bools could mitigate that. Or a rule to
> only use the values true and false in bool context and to only use
> logical operators on them.
That seems more complex than we want if our goal is just supporting
legacy systems that may or may not even exist. Given your example, I'd
be more inclined to just do a weather-balloon adding <stdbool.h> to
git-compat-util.h, and using "bool" in a single spot in the code. If
nobody screams after a few releases, we can consider it OK. If they do,
it's a trivial patch to convert back.
-Peff
^ permalink raw reply
* Serving an .html web page from inside a git repository
From: luigi cerbai @ 2023-12-14 21:56 UTC (permalink / raw)
To: git
Hello, I have a technical question to ask regarding serving an .html webpage from inside a git repository.
In this setup, I have to make the .html page contents viewable on a browser open to outside traffic without
using web server programs hosted on my local machine. The role of my local machine is to push newer files
to the git repository while the software located inside that directory serves the content to the outside world
that can be accessed by clicking or typing their own co-respective URLs on their browser.
I want to test this setup in order to learn how websites works but I have no idea how to pull this off, do you
have any directions to provide in order to help me build this setup?
^ permalink raw reply
* [PATCH 2/2] mailinfo: avoid recursion when unquoting From headers
From: Jeff King @ 2023-12-14 21:48 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: git, Taylor Blau, Carlos Andrés Ramírez Cataño
In-Reply-To: <20231214214444.GB2297853@coredump.intra.peff.net>
Our unquote_comment() function is recursive; when it sees a comment
within a comment, like:
(this is an (embedded) comment)
it recurses to handle the inner comment. This is fine for practical use,
but it does mean that you can easily run out of stack space with a
malicious header. For example:
perl -e 'print "From: ", "(" x 2**18;' |
git mailinfo /dev/null /dev/null
segfaults on my system. And since mailinfo is likely to be fed untrusted
input from the Internet (if not by human users, who might recognize a
garbage header, but certainly there are automated systems that apply
patches from a list) it may be possible for an attacker to trigger the
problem.
That said, I don't think there's an interesting security vulnerability
here. All an attacker can do is make it impossible to parse their email
and apply their patch, and there are lots of ways to generate bogus
emails. So it's more of an annoyance than anything.
But it's pretty easy to fix it. The recursion is not helping us preserve
any particular state from each level. The only flag in our parsing is
take_next_literally, and we can never recurse when it is set (since the
start of a new comment implies it was not backslash-escaped). So it is
really only useful for finding the end of the matched pair of
parentheses. We can do that easily with a simple depth counter.
Signed-off-by: Jeff King <peff@peff.net>
---
mailinfo.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mailinfo.c b/mailinfo.c
index 737b9e5e13..db236f9f9f 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -59,6 +59,7 @@ static void parse_bogus_from(struct mailinfo *mi, const struct strbuf *line)
static const char *unquote_comment(struct strbuf *outbuf, const char *in)
{
int take_next_literally = 0;
+ int depth = 1;
strbuf_addch(outbuf, '(');
@@ -72,11 +73,14 @@ static const char *unquote_comment(struct strbuf *outbuf, const char *in)
take_next_literally = 1;
continue;
case '(':
- in = unquote_comment(outbuf, in);
+ strbuf_addch(outbuf, '(');
+ depth++;
continue;
case ')':
strbuf_addch(outbuf, ')');
- return in;
+ if (!--depth)
+ return in;
+ continue;
}
}
--
2.43.0.363.g1d22a8f302
^ permalink raw reply related
* [PATCH 1/2] t5100: make rfc822 comment test more careful
From: Jeff King @ 2023-12-14 21:47 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: git, Taylor Blau, Carlos Andrés Ramírez Cataño
In-Reply-To: <20231214214444.GB2297853@coredump.intra.peff.net>
When processing "From" headers in an email, mailinfo "unquotes" quoted
strings and rfc822 parenthesized comments. For quoted strings, we
actually remove the double-quotes, so:
From: "A U Thor" <someone@example.com>
become:
Author: A U Thor
Email: someone@example.com
But for comments, we leave the outer parentheses in place, so:
From: A U (this is a comment) Thor <someone@example.com>
becomes:
Author: A U (this is a comment) Thor
Email: someone@example.com
So what is the comment "unquoting" actually doing? In our code, being in
a comment section has exactly two effects:
1. We'll unquote backslash-escaped characters inside a comment
section.
2. We _won't_ unquote double-quoted strings inside a comment section.
Our test for comments in t5100 checks this:
From: "A U Thor" <somebody@example.com> (this is \(really\) a comment (honestly))
So it is covering (1), but not (2). Let's add in a quoted string to
cover this.
Moreover, because the comment appears at the end of the From header,
there's nothing to confirm that we correctly found the end of the
comment section (and not just the end-of-string). Let's instead move it
to the beginning of the header, which means we can confirm that the
existing quoted string is detected (which will only happen if we know
we've left the comment block).
As expected, the test continues to pass, but this will give us more
confidence as we refactor the code in the next patch.
Signed-off-by: Jeff King <peff@peff.net>
---
I did this as a separate patch so it is easy to see that the existing
code already passes the test (i.e., our refactor is a true noop in terms
of behavior, and not fixing or breaking anything).
| 2 +-
| 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
--git a/t/t5100/comment.expect b/t/t5100/comment.expect
index 7228177984..bd71956a47 100644
--- a/t/t5100/comment.expect
+++ b/t/t5100/comment.expect
@@ -1,4 +1,4 @@
-Author: A U Thor (this is (really) a comment (honestly))
+Author: (this is (really) a "comment" (honestly)) A U Thor
Email: somebody@example.com
Subject: testing comments
Date: Sun, 25 May 2008 00:38:18 -0700
--git a/t/t5100/comment.in b/t/t5100/comment.in
index c53a192dfe..0b7e903b06 100644
--- a/t/t5100/comment.in
+++ b/t/t5100/comment.in
@@ -1,5 +1,5 @@
From 1234567890123456789012345678901234567890 Mon Sep 17 00:00:00 2001
-From: "A U Thor" <somebody@example.com> (this is \(really\) a comment (honestly))
+From: (this is \(really\) a "comment" (honestly)) "A U Thor" <somebody@example.com>
Date: Sun, 25 May 2008 00:38:18 -0700
Subject: [PATCH] testing comments
--
2.43.0.363.g1d22a8f302
^ permalink raw reply related
* [PATCH 0/2] avoiding recursion in mailinfo
From: Jeff King @ 2023-12-14 21:44 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: git, Taylor Blau, Carlos Andrés Ramírez Cataño
In-Reply-To: <ZXqxoKLFG19tMFpF@tanuki>
On Thu, Dec 14, 2023 at 08:41:20AM +0100, Patrick Steinhardt wrote:
> > @@ -72,11 +73,14 @@ static const char *unquote_comment(struct strbuf *outbuf, const char *in)
> > take_next_literally = 1;
> > continue;
> > case '(':
> > - in = unquote_comment(outbuf, in);
> > + strbuf_addch(outbuf, '(');
> > + depth++;
> > continue;
> > case ')':
> > strbuf_addch(outbuf, ')');
> > - return in;
> > + if (!--depth)
> > + return in;
> > + continue;
> > }
> > }
> >
> > I doubt it's a big deal either way, but if it's that easy to do it might
> > be worth it.
>
> Isn't this only protecting against unbalanced braces? That might be a
> sensible check to do regardless, but does it protect against recursion
> blowing the stack if you just happen to have many opening braces?
>
> Might be I'm missing something.
It protects against recrusion blowing the stack because it removes the
recursive call to unquote_comment(). ;)
The "depth" stuff is there because without recursion, we have to know
when we've hit the correct closing paren (as opposed to an embedded
one).
Here it is as a patch (actually two patches). I don't think it's high
priority, but I'd sunk enough brain cells into thinking about it that I
wanted to capture the work. ;)
I did it on top of the earlier mailinfo out-of-bounds read fix, but it
could be applied separately.
[1/2]: t5100: make rfc822 comment test more careful
[2/2]: mailinfo: avoid recursion when unquoting From headers
mailinfo.c | 8 ++++++--
t/t5100/comment.expect | 2 +-
t/t5100/comment.in | 2 +-
3 files changed, 8 insertions(+), 4 deletions(-)
-Peff
^ permalink raw reply
* Re: [PATCH] mailinfo: fix out-of-bounds memory reads in unquote_quoted_pair()
From: Jeff King @ 2023-12-14 21:40 UTC (permalink / raw)
To: Junio C Hamano
Cc: Patrick Steinhardt, git, Taylor Blau,
Carlos Andrés Ramírez Cataño
In-Reply-To: <xmqqy1dynofo.fsf@gitster.g>
On Wed, Dec 13, 2023 at 06:54:03AM -0800, Junio C Hamano wrote:
> I actually had trouble with the proposed update, and wondered if
>
> - while ((c = *in++) != 0) {
> + while ((c = *in)) {
> + in++;
>
> is easier to follow, but then was hit by the possibility that the
> same "we have incremented 'in' a bit too early" may exist if such
> a loop wants to use 'in' in its body. Wouldn't it mean that
>
> - while ((c = *in++) != 0) {
> + for (; c = *in; in++) {
>
> would be even a better rewrite?
No, the "for" loop wouldn't work, because the loop body actually depends
on "in" having already been incremented. If we find the end of the
comment or quoted string, we return "in", and the caller is expecting it
to have moved past the closing quote. So that would have to become
"return in+1".
IOW, the issue is that the normal end-of-quote parsing and hitting
end-of-string are fundamentally different. So we either need to
differentiate the returns (either with "+1" on one, or "-1" on the
other). Or we need to choose to increment "in" based on which we found
(which is what my patch does).
-Peff
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox