* [PATCH] repack-promisor: add fake paths to oids when repacking promisor objects
@ 2026-03-27 16:12 Abraham Samuel Adekunle
2026-03-31 6:02 ` Patrick Steinhardt
0 siblings, 1 reply; 4+ messages in thread
From: Abraham Samuel Adekunle @ 2026-03-27 16:12 UTC (permalink / raw)
To: git
Cc: Christian Couder, Karthik Nayak, Justin Tobler, Siddharth Asthana,
Ayush Chandekar, Lucas Seiki Oshiro, Junio C Hamano,
Patrick Steinhardt, Phillip Wood
This change addresses the NEEDSWORK comment added by commit
5d19e81 (repack: repack promisor objects if -a or -A is set).
When 'git-repack' repacks promisor objects, only the raw oids
are sent to 'git-pack-objects'.
This gives 'git-pack-objects' no information about the original
pack order of those objects in the packfile so it must
rely on its default strategy of sorting the objects by type and
then by size over again. This can produce suboptimal packfiles
because the objects that were previously stored in the same
packfile can become separated.
Provide a hint to 'git-pack-objects' when sorting, by using the
packfile basename, and the offset of the object in the existing
packfile as fake paths when writing the oids to 'git-pack-objects'.
This will ensure they can be grouped by the type and existing pack
order which will make them end up close together in the sort, improving
delta compression.
Signed-off-by: Abraham Samuel Adekunle <abrahamadekunle50@gmail.com>
---
repack-promisor.c | 39 +++++++++++++++++++++++++++++----------
1 file changed, 29 insertions(+), 10 deletions(-)
diff --git a/repack-promisor.c b/repack-promisor.c
index 90318ce150..3f3034fb79 100644
--- a/repack-promisor.c
+++ b/repack-promisor.c
@@ -12,25 +12,51 @@ struct write_oid_context {
const struct git_hash_algo *algop;
};
+/**
+ * Build fake path for the objects to give pack-objects
+ * an ordering hint.
+ * For the packed objects: pack-basename/offset-padded
+ */
+
+static void build_ordering_hint(struct object_info *oi, struct strbuf *hint)
+{
+ struct packed_git *pack;
+ unsigned long offset;
+
+ if (oi->whence == OI_PACKED) {
+ pack = oi->u.packed.pack;
+ offset = oi->u.packed.offset;
+ strbuf_addf(hint, "%s/%05lu", pack_basename(pack), (unsigned long)offset);
+ } else
+ strbuf_addstr(hint, "loose");
+}
+
/*
* Write oid to the given struct child_process's stdin, starting it first if
* necessary.
*/
static int write_oid(const struct object_id *oid,
- struct object_info *oi UNUSED,
+ struct object_info *oi,
void *data)
{
struct write_oid_context *ctx = data;
struct child_process *cmd = ctx->cmd;
+ struct strbuf hint = STRBUF_INIT;
if (cmd->in == -1) {
if (start_command(cmd))
die(_("could not start pack-objects to repack promisor objects"));
}
+ build_ordering_hint(oi, &hint);
+
if (write_in_full(cmd->in, oid_to_hex(oid), ctx->algop->hexsz) < 0 ||
+ write_in_full(cmd->in, " ", 1) < 0 ||
+ write_in_full(cmd->in, hint.buf, hint.len) < 0 ||
write_in_full(cmd->in, "\n", 1) < 0)
die(_("failed to feed promisor objects to pack-objects"));
+
+ strbuf_release(&hint);
return 0;
}
@@ -85,20 +111,13 @@ void repack_promisor_objects(struct repository *repo,
{
struct write_oid_context ctx;
struct child_process cmd = CHILD_PROCESS_INIT;
+ struct object_info request = OBJECT_INFO_INIT;
prepare_pack_objects(&cmd, args, packtmp);
cmd.in = -1;
-
- /*
- * NEEDSWORK: Giving pack-objects only the OIDs without any ordering
- * hints may result in suboptimal deltas in the resulting pack. See if
- * the OIDs can be sent with fake paths such that pack-objects can use a
- * {type -> existing pack order} ordering when computing deltas instead
- * of a {type -> size} ordering, which may produce better deltas.
- */
ctx.cmd = &cmd;
ctx.algop = repo->hash_algo;
- odb_for_each_object(repo->objects, NULL, write_oid, &ctx,
+ odb_for_each_object(repo->objects, &request, write_oid, &ctx,
ODB_FOR_EACH_OBJECT_PROMISOR_ONLY);
if (cmd.in == -1) {
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH] repack-promisor: add fake paths to oids when repacking promisor objects
2026-03-27 16:12 [PATCH] repack-promisor: add fake paths to oids when repacking promisor objects Abraham Samuel Adekunle
@ 2026-03-31 6:02 ` Patrick Steinhardt
2026-03-31 9:53 ` Samuel Abraham
2026-03-31 15:51 ` Junio C Hamano
0 siblings, 2 replies; 4+ messages in thread
From: Patrick Steinhardt @ 2026-03-31 6:02 UTC (permalink / raw)
To: Abraham Samuel Adekunle
Cc: git, Christian Couder, Karthik Nayak, Justin Tobler,
Siddharth Asthana, Ayush Chandekar, Lucas Seiki Oshiro,
Junio C Hamano, Phillip Wood
On Fri, Mar 27, 2026 at 05:12:43PM +0100, Abraham Samuel Adekunle wrote:
> This change addresses the NEEDSWORK comment added by commit
> 5d19e81 (repack: repack promisor objects if -a or -A is set).
>
> When 'git-repack' repacks promisor objects, only the raw oids
> are sent to 'git-pack-objects'.
> This gives 'git-pack-objects' no information about the original
> pack order of those objects in the packfile so it must
> rely on its default strategy of sorting the objects by type and
> then by size over again. This can produce suboptimal packfiles
> because the objects that were previously stored in the same
> packfile can become separated.
>
> Provide a hint to 'git-pack-objects' when sorting, by using the
> packfile basename, and the offset of the object in the existing
> packfile as fake paths when writing the oids to 'git-pack-objects'.
>
> This will ensure they can be grouped by the type and existing pack
> order which will make them end up close together in the sort, improving
> delta compression.
I think the general idea may be sound, but ideally we would have some
benchmarks that demonstrate it actually is. Like, can you come up with
scenarios where it will indeed improve the packfile size and show the
advantage of this change? Are there scenarios that are likely to have a
disadvantage because of this new ordering? Which of these scenarios do
we expect to be more likely?
Before answering these questions we basically just claim it's going to
be an improvement without actually verifying.
> diff --git a/repack-promisor.c b/repack-promisor.c
> index 90318ce150..3f3034fb79 100644
> --- a/repack-promisor.c
> +++ b/repack-promisor.c
> @@ -12,25 +12,51 @@ struct write_oid_context {
> const struct git_hash_algo *algop;
> };
>
> +/**
> + * Build fake path for the objects to give pack-objects
> + * an ordering hint.
> + * For the packed objects: pack-basename/offset-padded
> + */
> +
Nit: this empty line can be removed.
> +static void build_ordering_hint(struct object_info *oi, struct strbuf *hint)
> +{
> + struct packed_git *pack;
> + unsigned long offset;
> +
> + if (oi->whence == OI_PACKED) {
> + pack = oi->u.packed.pack;
> + offset = oi->u.packed.offset;
> + strbuf_addf(hint, "%s/%05lu", pack_basename(pack), (unsigned long)offset);
> + } else
> + strbuf_addstr(hint, "loose");
Nit: our coding guidelines say that once an if statement requires curly
braces in one branch, all branches should have them.
I also have to wonder whether it's going to be a benefit to also specify
a hint for loose objects, or whether we should rather not write any hint
at all for them.
> +}
> +
> /*
> * Write oid to the given struct child_process's stdin, starting it first if
> * necessary.
> */
> static int write_oid(const struct object_id *oid,
> - struct object_info *oi UNUSED,
> + struct object_info *oi,
> void *data)
> {
> struct write_oid_context *ctx = data;
> struct child_process *cmd = ctx->cmd;
> + struct strbuf hint = STRBUF_INIT;
>
> if (cmd->in == -1) {
> if (start_command(cmd))
> die(_("could not start pack-objects to repack promisor objects"));
> }
>
> + build_ordering_hint(oi, &hint);
> +
> if (write_in_full(cmd->in, oid_to_hex(oid), ctx->algop->hexsz) < 0 ||
> + write_in_full(cmd->in, " ", 1) < 0 ||
> + write_in_full(cmd->in, hint.buf, hint.len) < 0 ||
> write_in_full(cmd->in, "\n", 1) < 0)
> die(_("failed to feed promisor objects to pack-objects"));
This now translate into at least four write(3p) syscalls per object. Can
we maybe reuse a buffer so that we can reduce the number of syscalls?
Thanks!
Patrick
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] repack-promisor: add fake paths to oids when repacking promisor objects
2026-03-31 6:02 ` Patrick Steinhardt
@ 2026-03-31 9:53 ` Samuel Abraham
2026-03-31 15:51 ` Junio C Hamano
1 sibling, 0 replies; 4+ messages in thread
From: Samuel Abraham @ 2026-03-31 9:53 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: git, Christian Couder, Karthik Nayak, Justin Tobler,
Siddharth Asthana, Ayush Chandekar, Lucas Seiki Oshiro,
Junio C Hamano, Phillip Wood
On Tue, Mar 31, 2026 at 7:02 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Fri, Mar 27, 2026 at 05:12:43PM +0100, Abraham Samuel Adekunle wrote:
> > This change addresses the NEEDSWORK comment added by commit
> > 5d19e81 (repack: repack promisor objects if -a or -A is set).
> >
> > When 'git-repack' repacks promisor objects, only the raw oids
> > are sent to 'git-pack-objects'.
> > This gives 'git-pack-objects' no information about the original
> > pack order of those objects in the packfile so it must
> > rely on its default strategy of sorting the objects by type and
> > then by size over again. This can produce suboptimal packfiles
> > because the objects that were previously stored in the same
> > packfile can become separated.
> >
> > Provide a hint to 'git-pack-objects' when sorting, by using the
> > packfile basename, and the offset of the object in the existing
> > packfile as fake paths when writing the oids to 'git-pack-objects'.
> >
> > This will ensure they can be grouped by the type and existing pack
> > order which will make them end up close together in the sort, improving
> > delta compression.
>
> I think the general idea may be sound, but ideally we would have some
> benchmarks that demonstrate it actually is. Like, can you come up with
> scenarios where it will indeed improve the packfile size and show the
> advantage of this change? Are there scenarios that are likely to have a
> disadvantage because of this new ordering? Which of these scenarios do
> we expect to be more likely?
Thank you Patrick for the review.
Okay I will look into this
>
> Before answering these questions we basically just claim it's going to
> be an improvement without actually verifying.
Yes I agree
>
> > diff --git a/repack-promisor.c b/repack-promisor.c
> > index 90318ce150..3f3034fb79 100644
> > --- a/repack-promisor.c
> > +++ b/repack-promisor.c
> > @@ -12,25 +12,51 @@ struct write_oid_context {
> > const struct git_hash_algo *algop;
> > };
> >
> > +/**
> > + * Build fake path for the objects to give pack-objects
> > + * an ordering hint.
> > + * For the packed objects: pack-basename/offset-padded
> > + */
> > +
>
> Nit: this empty line can be removed.
Okay
>
> > +static void build_ordering_hint(struct object_info *oi, struct strbuf *hint)
> > +{
> > + struct packed_git *pack;
> > + unsigned long offset;
> > +
> > + if (oi->whence == OI_PACKED) {
> > + pack = oi->u.packed.pack;
> > + offset = oi->u.packed.offset;
> > + strbuf_addf(hint, "%s/%05lu", pack_basename(pack), (unsigned long)offset);
> > + } else
> > + strbuf_addstr(hint, "loose");
>
> Nit: our coding guidelines say that once an if statement requires curly
> braces in one branch, all branches should have them.
Okay thank you
>
> I also have to wonder whether it's going to be a benefit to also specify
> a hint for loose objects, or whether we should rather not write any hint
> at all for them.
Okay, I think not writing any hints makes sense.
Thanks
>
> > +}
> > +
> > /*
> > * Write oid to the given struct child_process's stdin, starting it first if
> > * necessary.
> > */
> > static int write_oid(const struct object_id *oid,
> > - struct object_info *oi UNUSED,
> > + struct object_info *oi,
> > void *data)
> > {
> > struct write_oid_context *ctx = data;
> > struct child_process *cmd = ctx->cmd;
> > + struct strbuf hint = STRBUF_INIT;
> >
> > if (cmd->in == -1) {
> > if (start_command(cmd))
> > die(_("could not start pack-objects to repack promisor objects"));
> > }
> >
> > + build_ordering_hint(oi, &hint);
> > +
> > if (write_in_full(cmd->in, oid_to_hex(oid), ctx->algop->hexsz) < 0 ||
> > + write_in_full(cmd->in, " ", 1) < 0 ||
> > + write_in_full(cmd->in, hint.buf, hint.len) < 0 ||
> > write_in_full(cmd->in, "\n", 1) < 0)
> > die(_("failed to feed promisor objects to pack-objects"));
>
> This now translate into at least four write(3p) syscalls per object. Can
> we maybe reuse a buffer so that we can reduce the number of syscalls?
Okay noted
Thanks
Abraham
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] repack-promisor: add fake paths to oids when repacking promisor objects
2026-03-31 6:02 ` Patrick Steinhardt
2026-03-31 9:53 ` Samuel Abraham
@ 2026-03-31 15:51 ` Junio C Hamano
1 sibling, 0 replies; 4+ messages in thread
From: Junio C Hamano @ 2026-03-31 15:51 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Abraham Samuel Adekunle, git, Christian Couder, Karthik Nayak,
Justin Tobler, Siddharth Asthana, Ayush Chandekar,
Lucas Seiki Oshiro, Phillip Wood
Patrick Steinhardt <ps@pks.im> writes:
>> This will ensure they can be grouped by the type and existing pack
>> order which will make them end up close together in the sort, improving
>> delta compression.
>
> I think the general idea may be sound, but ideally we would have some
> benchmarks that demonstrate it actually is. Like, can you come up with
> scenarios where it will indeed improve the packfile size and show the
> advantage of this change? Are there scenarios that are likely to have a
> disadvantage because of this new ordering? Which of these scenarios do
> we expect to be more likely?
>
> Before answering these questions we basically just claim it's going to
> be an improvement without actually verifying.
While it is a very good point that a change that claims to improve
performance must come with verifyable data, because the packfile
size alone is not what you want to optimize for in the first place,
it is quite hard to come up with a useful benchmark in this area.
Back when I was working on packfile generation, we needed to
optimize for two things (luckily they are not competing goals). One
is to choose a good delta-base, which will contribute to an overall
pack size that is smaller. The ordering of the objects in a pack,
on the other hand, does not directly contribute to the size, but has
impact on runtime performance, by keeping related things closer
together to reduce the need to "seek" in the pack stream.
Generally, two objects that appear next to each other in a well
optimized packstream are not expected to be similar with each other.
They are more likely to be two unrelated files that appear in the
same tree object (i.e., they do not delta with each other well, but
at runtime, they are often needed together). So it may even be
detrimental to use the offset in packfile as a clue to choose among
potential delta bases.
Do we have name-hash data for the original pack somewhere available
so that the repacker can take advantage of? If so, it may be more
relevant thing to reuse.
I agree with your other points in your review, too. Thanks for
helping this topic.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-03-31 15:51 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-27 16:12 [PATCH] repack-promisor: add fake paths to oids when repacking promisor objects Abraham Samuel Adekunle
2026-03-31 6:02 ` Patrick Steinhardt
2026-03-31 9:53 ` Samuel Abraham
2026-03-31 15:51 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox