Git development
 help / color / mirror / Atom feed
* Re: [PATCH] revision: use priority queue in limit_list()
From: Derrick Stolee @ 2026-05-14 19:57 UTC (permalink / raw)
  To: Kristofer Karlsson via GitGitGadget, git; +Cc: Kristofer Karlsson
In-Reply-To: <pull.2114.git.1778777491939.gitgitgadget@gmail.com>

On 5/14/2026 12:51 PM, Kristofer Karlsson via GitGitGadget wrote:
> From: Kristofer Karlsson <krka@spotify.com>
> 
> limit_list() maintains a date-sorted work queue of commits using a
> linked list with commit_list_insert_by_date() for insertion.  Each
> insertion walks the list to find the right position — O(n) per insert.
> In repositories with merge-heavy histories, the symmetric difference
> can contain thousands of commits, making this O(n) insertion the
> dominant cost.

Linear operations are bad, especially when multiplied by linear-ish
loops, causing quadratic behavior.

> Replace the sorted linked list with a prio_queue (binary heap).  This
> gives O(log n) insertion and O(log n) extraction instead of O(n)
> insertion and O(1) extraction, which is a net win when the queue is
> large.

Yes, much better.

> The still_interesting() and everybody_uninteresting() helpers are
> updated to scan the prio_queue's contiguous array instead of walking a
> linked list.  process_parents() already accepts both a commit_list and
> a prio_queue parameter, so the change in limit_list() simply switches
> which one is passed.
> 
> Benchmark: git rev-list --left-right --count HEAD~N...HEAD
> Repository: 2.3M commits, merge-heavy DAG (monorepo)
> Best of 5 runs, times in seconds:
> 
>   commits in
>   symmetric diff   baseline   patched    speedup
>   --------------   --------   -------    -------
>             10       0.01      0.01       1.0x
>             50       0.01      0.01       1.0x
>           3751      21.23      8.49       2.5x
>           4524      21.70      8.29       2.6x
>          10130      20.10      6.65       3.0x
> 
> No change for small traversals; 2.5-3.0x faster when the queue grows
> to thousands of commits.

This is good. Is there any chance that you could demonstrate this with
any commits in the Git repo? It does have some interesting behavior,
especially around point releases that are independent from the 'master'
branch and thus could have lopsided symmetric differences using well-
established tag names.

>     Switching to a prio_queue (binary heap) reduces insertion cost to O(log
>     w), bringing total cost to O(N·log w). The practical result on the same
>     repository:
>     
>     commits in
>     symmetric diff   before     after      speedup
>     --------------   --------   -------    -------
>             3751      21.2s      8.5s       2.5x
>             4524      21.7s      8.3s       2.6x
>            10130      20.1s      6.6s       3.0x

Very nice! I notice that this data is in your cover letter, but
not the commit message. Is that intentional?

>     This affects any command that triggers limit_list() — i.e., when
>     revs->limited is set — including --left-right, --cherry-mark,
>     --cherry-pick, --ancestry-path, bisect, and rebase's fork-point
>     computation. The practical trigger is git status --ahead-behind on a
>     branch that has diverged from upstream in a merge-heavy repository.

This also impacts 'git log --graph' when there is no serialized
commit-graph file. We are still using limit_list() in that case.

>     The change is minimal (+21/−17 lines, single file) because
>     process_parents() already accepts both a commit_list and a prio_queue
>     parameter — limit_list() just switches which one it passes.

The key logic is turning the initial list into the starting
points for the priority queue and everything else is about
moving types around, it seems.

> @@ -1451,6 +1447,7 @@ static int limit_list(struct rev_info *revs)
>  	struct commit_list *newlist = NULL;
>  	struct commit_list **p = &newlist;
>  	struct commit *interesting_cache = NULL;
> +	struct prio_queue queue = { .compare = compare_commits_by_commit_date };

Here, we are _not_ using generation numbers, which is correct
for this case because we are matching the date-based sorting
of the previous list.

>  	while (original_list) {
>  		struct commit *commit = pop_commit(&original_list);
> +		prio_queue_put(&queue, commit);
> +	}
> +
> +	while (queue.nr) {
> +		struct commit *commit = prio_queue_get(&queue);
>  		struct object *obj = &commit->object;
  
This is a fun reuse of lines to take the old "drain the
list as it is being mutated" loop and turn it into "fill
the priority queue" and "drain the priority queue as it
is being mutated"

This code change looks good. No new tests are needed, since
this is a performance-only change. Do any of the tests in
t/perf/ demonstrate this improvement?

Thanks,
-Stolee


^ permalink raw reply

* Re: [PATCH v4 0/3] Avoid hardcoded "good"/"bad" bisect terms
From: Junio C Hamano @ 2026-05-14 19:56 UTC (permalink / raw)
  To: Jonas Rebmann; +Cc: git, Chris Down, Jeff King, Phillip Wood
In-Reply-To: <20260514-bisect-terms-v4-0-b3e3cf1b06ce@schlaraffenlan.de>

Jonas Rebmann <kernel@schlaraffenlan.de> writes:

> While checking whether all output messages of git bisect were covered by
> [PATCH 1/3] bisect: use selected alternate terms in status output I
> found hardcoded good/bad refs leading to incompatibility of git
> rev-parse --bisect with alternate bisect run terms. This is addressed by
> [PATCH 3/3] rev-parse: use selected alternate terms to look up refs
>
> Signed-off-by: Jonas Rebmann <kernel@schlaraffenlan.de>
> ---
> Changes in v4:
> - Always print bisect terms in single quote (Thanks, Junio)
> - Split out quoting of bisect terms into separate commit
>> - Link to v3: https://patch.msgid.link/20260417-bisect-terms-v3-0-d659fa547261@schlaraffenlan.de

Having this "Link" is better than nothing, but it alone is not
sufficient to help those who mechanise patch consumption with b4.

  https://b4.docs.kernel.org/en/latest/index.html

Please make sure that your cover letter of the (i+1)th iteration
[v(i+1) 0/N] is a reply to the cover letter of the i-th iteration
[v(i) 0/M].  With that, anybody who has the i-th iteration can

 1. Check out the i-th iteration, e.g., 

    $ git checkout jr/bisect-custom-terms-in-output

 2. Peek at any of the commits on the topic branch with notes/amlog
    e.g.,

    $ git notes --ref=amlog show HEAD

 3. Check out the base, e.g.,

    $ git checkout --detach master...

    Note that the branch that holds the i-th iteration can now be
    accessed via @{-1} (i.e., the branch previously checked out).

 4. Tell B4 to fetch the latest round of the same series, by giving
    one/any of the message-ID we learned earlier, and apply them.

    $ b4 am -o- \
      "<20260417-bisect-terms-v3-2-d659fa547261@schlaraffenlan.de>" |
      git am -s

 5. See what changed with range-diff

    $ git range-diff @{-1}...

 6. When satisified, replace the topic with the new iteration.

    $ git checkout -B @{-1}

It is crucial that b4 can find the newer iteration when given a
message-ID from the older iteration in step 4 for this workflow to
work, and for that, [v4 0/3] must be a reply to [v3 0/2].  Otherwise
b4 will say that v3 is still the last iteration and we cannot make
progress.

> Changes in v3:
> - when referencing newly introduced terms, reference them in single
>   quotes (Thanks, Phillip)
> - Prefer test_grep over grep in updated Tests (Thanks, Phillip)
> - Improve commit messages (Thanks, Phillip)
> - Don't leak memory after read_bisect_terms() (Thanks, Phillip)
> - Don't leak memory after xstrfmt() (Thanks, Junio)
> - Add test case to patch 2/2
> - Link to v2: https://patch.msgid.link/20260323-bisect-terms-v2-0-8d6bdb2c9c7e@schlaraffenlan.de

With these improvements, v3 was already in a quite good shape, but
the latest makes it look even better.  Will replace.  Thanks.

^ permalink raw reply

* Re: [PATCH] revision: use priority queue in limit_list()
From: Junio C Hamano @ 2026-05-14 19:40 UTC (permalink / raw)
  To: Kristofer Karlsson via GitGitGadget; +Cc: git, Kristofer Karlsson
In-Reply-To: <pull.2114.git.1778777491939.gitgitgadget@gmail.com>

"Kristofer Karlsson via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> Benchmark: git rev-list --left-right --count HEAD~N...HEAD
> Repository: 2.3M commits, merge-heavy DAG (monorepo)
> Best of 5 runs, times in seconds:
>
>   commits in
>   symmetric diff   baseline   patched    speedup
>   --------------   --------   -------    -------
>             10       0.01      0.01       1.0x
>             50       0.01      0.01       1.0x
>           3751      21.23      8.49       2.5x
>           4524      21.70      8.29       2.6x
>          10130      20.10      6.65       3.0x
>
> No change for small traversals; 2.5-3.0x faster when the queue grows
> to thousands of commits.

Impressive.

> Signed-off-by: Kristofer Karlsson <krka@spotify.com>
> ---
>     revision: use priority queue in limit_list()
> ...
>     This affects any command that triggers limit_list() — i.e., when
>     revs->limited is set — including --left-right, --cherry-mark,
>     --cherry-pick, --ancestry-path, bisect, and rebase's fork-point
>     computation. The practical trigger is git status --ahead-behind on a
>     branch that has diverged from upstream in a merge-heavy repository.

I found this description a bit curious.  Notably missing from the
above list of revs->limited users is a bog standard A..B and it is
unclear the omission is because that case is not improved and if so
why.

I think a major reason of the omission of A..B from the above is,
despite my recollection that such a range (i.e., any presense of
UNINTERESTING commit) computation _always_ worked on a limited list,
these days we conditionally do not when we have commit graph and we
are showing in --topo-order (which is implicitly enabled when many
options other than --topo-order is in effect) since 1b4d8827
(revision: use generation for A..B --topo-order queries, 2019-05-21).

It might be interesting to extend your benchmark over the same
history with the same command line, perhaps with and without an
explicit "--topo-order" added, in a repository _without_
commit-graph enabled.

Thanks.

^ permalink raw reply

* Re: [PATCH 0/3] daemon: fix network address handling bugs
From: Junio C Hamano @ 2026-05-14 19:20 UTC (permalink / raw)
  To: Sebastien Tardif via GitGitGadget; +Cc: git, Sebastien Tardif
In-Reply-To: <pull.2300.git.git.1778773592.gitgitgadget@gmail.com>

"Sebastien Tardif via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Fix three related issues in daemon.c's network address handling:

Thanks for separating patches so that each of them addresses one
specific issue.

It would have been better if you sent this series as [PATCH v2] as a
reply to <pull.2299.git.git.1778291290159.gitgitgadget@gmail.com>,
which is the previous round.  That way, the mailing list archive
will keep the related discussions together on the same page.  If we
visit the page for the cover letter I am responding to,

  https://lore.kernel.org/git/pull.2300.git.git.1778773592.gitgitgadget@gmail.com/

nobody can see that there was a previous iteration so those who
looked at the earlier effort cannot refer back to it and compare.

> IPv6 address corruption in lookup_hostname(): getaddrinfo() is called with
> AF_UNSPEC hints, so it may return IPv6 results. However, the code
> unconditionally casts ai_addr to sockaddr_in and passes AF_INET to
> inet_ntop(). On IPv6-only hosts, this reads from the wrong struct offset,
> producing garbage IP addresses. Fixed by checking ai_family and handling
> both AF_INET and AF_INET6.
>
> IPv6 address truncation in ip2str(): The sockaddr struct size (ai_addrlen)
> is passed as the output buffer size to inet_ntop(). For IPv6,
> sizeof(sockaddr_in6) is 28 bytes but INET6_ADDRSTRLEN is 46, so long IPv6
> addresses are silently truncated. Fixed by passing sizeof(ip) instead, and
> dropping the now-unused len parameter.
>
> NULL pointer in execute() logging: REMOTE_PORT environment variable is used
> in a format string without a NULL check (only REMOTE_ADDR was checked). If
> REMOTE_PORT is unset, NULL is passed to printf's %s, which is undefined
> behavior. Fixed by using a fallback string.
>
> Sebastien Tardif (3):
>   daemon: fix IPv6 address corruption in lookup_hostname()
>   daemon: fix IPv6 address truncation in ip2str()
>   daemon: guard NULL REMOTE_PORT in execute() logging
>
>  daemon.c | 31 +++++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 10 deletions(-)
>
>
> base-commit: 59ff4886a579f4bc91e976fe18590b9ae02c7a08
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2300%2FSebTardif%2Ffix%2Fdaemon-ipv6-and-null-port-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2300/SebTardif/fix/daemon-ipv6-and-null-port-v1
> Pull-Request: https://github.com/git/git/pull/2300

^ permalink raw reply

* Re: [PATCH 2/2] use __builtin_add_overflow() in st_add() with Clang
From: Junio C Hamano @ 2026-05-14 19:12 UTC (permalink / raw)
  To: René Scharfe; +Cc: Git List, Jeff King
In-Reply-To: <0ded6062-f66a-4713-af24-d1b5aa654823@web.de>

René Scharfe <l.s.r@web.de> writes:

> Provide a variant of st_add() that wraps __builtin_add_overflow() to
> help Clang optimize it.  Use it on all platforms for simplicity.
> ...
> +/* Help Clang; GCC generates the same code for both variants. */
> +#if defined(__clang__)
> +static inline size_t st_add(size_t a, size_t b)
> +{
> +	size_t sum;
> +	if (__builtin_add_overflow(a, b, &sum))
> +		die("size_t overflow: %"PRIuMAX" + %"PRIuMAX,
> +		    (uintmax_t)a, (uintmax_t)b);
> +	return sum;
> +}
> +#else
>  static inline size_t st_add(size_t a, size_t b)
>  {
>  	if (unsigned_add_overflows(a, b))
> @@ -621,6 +632,7 @@ static inline size_t st_add(size_t a, size_t b)
>  		    (uintmax_t)a, (uintmax_t)b);
>  	return a + b;
>  }
> +#endif

Makes me wonder if we tweaked unsigned_add_overflows() to take an
extra *dst parameter to match __builtin_add_overflow(), which of
course requires us to all of 18 callsites, it might make the whole
thing a bit simpler.  New uses of unsigned_add_overflows(), if we
ever add them, would automatically benefit, right?

^ permalink raw reply

* Re: [PATCH 1/2] strbuf: use st_add3() in strbuf_grow()
From: Junio C Hamano @ 2026-05-14 19:07 UTC (permalink / raw)
  To: René Scharfe; +Cc: Git List
In-Reply-To: <c6e9b337-c4fc-4cbd-ac32-e8d3814749b0@web.de>

René Scharfe <l.s.r@web.de> writes:

> Simplify the code by calling st_add3() to do overflow checks instead of
> open-coding it.  This changes the error message to include the offending
> summands, which can be helpful when tracking down the cause.
>
> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
>  strbuf.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/strbuf.c b/strbuf.c
> index 3e04addc22..bb04d3910e 100644
> --- a/strbuf.c
> +++ b/strbuf.c
> @@ -106,12 +106,9 @@ void strbuf_attach(struct strbuf *sb, void *buf, size_t len, size_t alloc)
>  void strbuf_grow(struct strbuf *sb, size_t extra)
>  {
>  	int new_buf = !sb->alloc;
> -	if (unsigned_add_overflows(extra, 1) ||
> -	    unsigned_add_overflows(sb->len, extra + 1))
> -		die("you want to use way too much memory");
>  	if (new_buf)
>  		sb->buf = NULL;
> -	ALLOC_GROW(sb->buf, sb->len + extra + 1, sb->alloc);
> +	ALLOC_GROW(sb->buf, st_add3(sb->len, extra, 1), sb->alloc);

ALLOC_GROW() being a macro that references its second argument three
times, doesn't this rewrite rely on the compiler being clever enough
to notice that checking for the same overflow three times is
pointless and does it only once?  I guess the original has the same
issue already, so this may not be so bad but it makes me feel a bit
queasy.

>  	if (new_buf)
>  		sb->buf[0] = '\0';
>  }

^ permalink raw reply

* [PATCH] trailer: change strbuf in-place in unfold_value()
From: René Scharfe @ 2026-05-14 18:40 UTC (permalink / raw)
  To: Git List; +Cc: Jeff King

Avoid an allocation by doing s/\n\s*/ /g (replacing NL and any following
whitespace with a SP) right in the strbuf instead of copying the result
to a temporary one and swapping them in the end.  We can safely do that
because the replacement is never longer than the original string.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
Formatted with --function-context for easier review.
Inspired by https://lore.kernel.org/git/20260513185408.GA147423@coredump.intra.peff.net/

 trailer.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/trailer.c b/trailer.c
index 470f86a4a2..b89fa12fe7 100644
--- a/trailer.c
+++ b/trailer.c
@@ -988,29 +988,25 @@ static int ends_with_blank_line(const char *buf, size_t len)
 
 static void unfold_value(struct strbuf *val)
 {
-	struct strbuf out = STRBUF_INIT;
 	size_t i;
+	size_t pos = 0;
 
-	strbuf_grow(&out, val->len);
 	i = 0;
 	while (i < val->len) {
 		char c = val->buf[i++];
 		if (c == '\n') {
 			/* Collapse continuation down to a single space. */
 			while (i < val->len && isspace(val->buf[i]))
 				i++;
-			strbuf_addch(&out, ' ');
-		} else {
-			strbuf_addch(&out, c);
+			val->buf[pos++] = ' ';
+		} else if (pos != i) {
+			val->buf[pos++] = c;
 		}
 	}
+	strbuf_setlen(val, pos);
 
 	/* Empty lines may have left us with whitespace cruft at the edges */
-	strbuf_trim(&out);
-
-	/* output goes back to val as if we modified it in-place */
-	strbuf_swap(&out, val);
-	strbuf_release(&out);
+	strbuf_trim(val);
 }
 
 static struct trailer_block *trailer_block_new(void)
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 7/7] odb/transaction: make `write_object_stream()` pluggable
From: Justin Tobler @ 2026-05-14 18:37 UTC (permalink / raw)
  To: git; +Cc: ps, gitster, peff, Justin Tobler
In-Reply-To: <20260514183740.1505171-1-jltobler@gmail.com>

How an ODB transaction handles writing objects is expected to vary
between implementations. Introduce a new `write_object_stream()`
callback in `struct odb_transaction` to make this function pluggable.
Rename `index_blob_packfile_transaction()` to
`odb_transaction_files_write_object_stream()` and wire it up for use
with `struct odb_transaction_files` accordingly.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
---
 object-file.c     | 16 +++++++++-------
 odb/transaction.c |  7 +++++++
 odb/transaction.h | 25 ++++++++++++++++++++++---
 3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/object-file.c b/object-file.c
index 0d492e6962..23f665df90 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1578,9 +1578,10 @@ static void flush_packfile_transaction(struct odb_transaction_files *transaction
  * binary blobs, they generally do not want to get any conversion, and
  * callers should avoid this code path when filters are requested.
  */
-static int index_blob_packfile_transaction(struct odb_transaction *base,
-					   struct odb_write_stream *stream,
-					   size_t size, struct object_id *result_oid)
+static int odb_transaction_files_write_object_stream(struct odb_transaction *base,
+						     struct odb_write_stream *stream,
+						     size_t size,
+						     struct object_id *result_oid)
 {
 	struct odb_transaction_files *transaction = container_of(base,
 								 struct odb_transaction_files,
@@ -1664,10 +1665,10 @@ int index_fd(struct index_state *istate, struct object_id *oid,
 			struct object_database *odb = the_repository->objects;
 			struct odb_transaction *transaction = odb_transaction_begin(odb);
 
-			ret = index_blob_packfile_transaction(odb->transaction,
-							      &stream,
-							      xsize_t(st->st_size),
-							      oid);
+			ret = odb_transaction_write_object_stream(odb->transaction,
+								  &stream,
+								  xsize_t(st->st_size),
+								  oid);
 			odb_transaction_commit(transaction);
 		} else {
 			ret = hash_blob_stream(&stream,
@@ -2132,6 +2133,7 @@ struct odb_transaction *odb_transaction_files_begin(struct odb_source *source)
 	transaction = xcalloc(1, sizeof(*transaction));
 	transaction->base.source = source;
 	transaction->base.commit = odb_transaction_files_commit;
+	transaction->base.write_object_stream = odb_transaction_files_write_object_stream;
 
 	return &transaction->base;
 }
diff --git a/odb/transaction.c b/odb/transaction.c
index 592ac84075..b16e07aebf 100644
--- a/odb/transaction.c
+++ b/odb/transaction.c
@@ -26,3 +26,10 @@ void odb_transaction_commit(struct odb_transaction *transaction)
 	transaction->source->odb->transaction = NULL;
 	free(transaction);
 }
+
+int odb_transaction_write_object_stream(struct odb_transaction *transaction,
+					struct odb_write_stream *stream,
+					size_t len, struct object_id *oid)
+{
+	return transaction->write_object_stream(transaction, stream, len, oid);
+}
diff --git a/odb/transaction.h b/odb/transaction.h
index a56e392f21..854fda06f5 100644
--- a/odb/transaction.h
+++ b/odb/transaction.h
@@ -12,14 +12,24 @@
  *
  * Each ODB source is expected to implement its own transaction handling.
  */
-struct odb_transaction;
-typedef void (*odb_transaction_commit_fn)(struct odb_transaction *transaction);
 struct odb_transaction {
 	/* The ODB source the transaction is opened against. */
 	struct odb_source *source;
 
 	/* The ODB source specific callback invoked to commit a transaction. */
-	odb_transaction_commit_fn commit;
+	void (*commit)(struct odb_transaction *transaction);
+
+	/*
+	 * This callback is expected to write the given object stream into
+	 * the ODB transaction. Note that for now, only blobs support streaming.
+	 *
+	 * The resulting object ID shall be written into the out pointer. The
+	 * callback is expected to return 0 on success, a negative error code
+	 * otherwise.
+	 */
+	int (*write_object_stream)(struct odb_transaction *transaction,
+				   struct odb_write_stream *stream, size_t len,
+				   struct object_id *oid);
 };
 
 /*
@@ -35,4 +45,13 @@ struct odb_transaction *odb_transaction_begin(struct object_database *odb);
  */
 void odb_transaction_commit(struct odb_transaction *transaction);
 
+/*
+ * Writes the object in the provided stream into the transaction. The resulting
+ * object ID is written into the out pointer. Returns 0 on success, a negative
+ * error code otherwise.
+ */
+int odb_transaction_write_object_stream(struct odb_transaction *transaction,
+					struct odb_write_stream *stream,
+					size_t len, struct object_id *oid);
+
 #endif
-- 
2.54.0.105.g59ff4886a5


^ permalink raw reply related

* [PATCH v4 6/7] object-file: generalize packfile writes to use odb_write_stream
From: Justin Tobler @ 2026-05-14 18:37 UTC (permalink / raw)
  To: git; +Cc: ps, gitster, peff, Justin Tobler
In-Reply-To: <20260514183740.1505171-1-jltobler@gmail.com>

The `index_blob_packfile_transaction()` function streams blob data
directly from an fd. This makes it difficult to reuse as part of a
generic transactional object writing interface.

Refactor the packfile write path to operate on a `struct
odb_write_stream`, allowing callers to supply data from arbitrary
sources.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
---
 object-file.c | 56 +++++++++++++++++++++++++++------------------------
 1 file changed, 30 insertions(+), 26 deletions(-)

diff --git a/object-file.c b/object-file.c
index 6d7afdb723..0d492e6962 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1447,18 +1447,19 @@ static int hash_blob_stream(struct odb_write_stream *stream,
 }
 
 /*
- * Read the contents from fd for size bytes, streaming it to the
+ * Read the contents from the stream provided, streaming it to the
  * packfile in state while updating the hash in ctx.
  */
 static void stream_blob_to_pack(struct transaction_packfile *state,
-				struct git_hash_ctx *ctx, int fd, size_t size,
-				const char *path)
+				struct git_hash_ctx *ctx, size_t size,
+				struct odb_write_stream *stream)
 {
 	git_zstream s;
 	unsigned char ibuf[16384];
 	unsigned char obuf[16384];
 	unsigned hdrlen;
 	int status = Z_OK;
+	size_t bytes_read = 0;
 
 	git_deflate_init(&s, pack_compression_level);
 
@@ -1467,23 +1468,21 @@ static void stream_blob_to_pack(struct transaction_packfile *state,
 	s.avail_out = sizeof(obuf) - hdrlen;
 
 	while (status != Z_STREAM_END) {
-		if (size && !s.avail_in) {
-			size_t rsize = size < sizeof(ibuf) ? size : sizeof(ibuf);
-			ssize_t read_result = read_in_full(fd, ibuf, rsize);
-			if (read_result < 0)
-				die_errno("failed to read from '%s'", path);
-			if ((size_t)read_result != rsize)
-				die("failed to read %u bytes from '%s'",
-				    (unsigned)rsize, path);
+		if (!stream->is_finished && !s.avail_in) {
+			ssize_t rsize = odb_write_stream_read(stream, ibuf,
+							      sizeof(ibuf));
+
+			if (rsize < 0)
+				die("failed to read blob data");
 
 			git_hash_update(ctx, ibuf, rsize);
 
 			s.next_in = ibuf;
 			s.avail_in = rsize;
-			size -= rsize;
+			bytes_read += rsize;
 		}
 
-		status = git_deflate(&s, size ? 0 : Z_FINISH);
+		status = git_deflate(&s, stream->is_finished ? Z_FINISH : 0);
 
 		if (!s.avail_out || status == Z_STREAM_END) {
 			size_t written = s.next_out - obuf;
@@ -1503,6 +1502,11 @@ static void stream_blob_to_pack(struct transaction_packfile *state,
 			die("unexpected deflate failure: %d", status);
 		}
 	}
+
+	if (bytes_read != size)
+		die("read %" PRIuMAX " bytes of blob data, but expected %" PRIuMAX " bytes",
+		    (uintmax_t)bytes_read, (uintmax_t)size);
+
 	git_deflate_end(&s);
 }
 
@@ -1574,10 +1578,13 @@ static void flush_packfile_transaction(struct odb_transaction_files *transaction
  * binary blobs, they generally do not want to get any conversion, and
  * callers should avoid this code path when filters are requested.
  */
-static int index_blob_packfile_transaction(struct odb_transaction_files *transaction,
-					   struct object_id *result_oid, int fd,
-					   size_t size, const char *path)
+static int index_blob_packfile_transaction(struct odb_transaction *base,
+					   struct odb_write_stream *stream,
+					   size_t size, struct object_id *result_oid)
 {
+	struct odb_transaction_files *transaction = container_of(base,
+								 struct odb_transaction_files,
+								 base);
 	struct transaction_packfile *state = &transaction->packfile;
 	struct git_hash_ctx ctx;
 	unsigned char obuf[16384];
@@ -1611,7 +1618,7 @@ static int index_blob_packfile_transaction(struct odb_transaction_files *transac
 	hashfile_checkpoint(state->f, &checkpoint);
 	idx->offset = state->offset;
 	crc32_begin(state->f);
-	stream_blob_to_pack(state, &ctx, fd, size, path);
+	stream_blob_to_pack(state, &ctx, size, stream);
 	git_hash_final_oid(result_oid, &ctx);
 
 	idx->crc32 = crc32_end(state->f);
@@ -1655,15 +1662,12 @@ int index_fd(struct index_state *istate, struct object_id *oid,
 
 		if (flags & INDEX_WRITE_OBJECT) {
 			struct object_database *odb = the_repository->objects;
-			struct odb_transaction_files *files_transaction;
-			struct odb_transaction *transaction;
-
-			transaction = odb_transaction_begin(odb);
-			files_transaction = container_of(odb->transaction,
-							 struct odb_transaction_files,
-							 base);
-			ret = index_blob_packfile_transaction(files_transaction, oid, fd,
-						      xsize_t(st->st_size), path);
+			struct odb_transaction *transaction = odb_transaction_begin(odb);
+
+			ret = index_blob_packfile_transaction(odb->transaction,
+							      &stream,
+							      xsize_t(st->st_size),
+							      oid);
 			odb_transaction_commit(transaction);
 		} else {
 			ret = hash_blob_stream(&stream,
-- 
2.54.0.105.g59ff4886a5


^ permalink raw reply related

* [PATCH v4 5/7] object-file: avoid fd seekback by checking object size upfront
From: Justin Tobler @ 2026-05-14 18:37 UTC (permalink / raw)
  To: git; +Cc: ps, gitster, peff, Justin Tobler
In-Reply-To: <20260514183740.1505171-1-jltobler@gmail.com>

In certain scenarios, Git handles writing blobs that exceed
"core.bigFileThreshold" differently by streaming the object directly
into a packfile. When there is an active ODB transaction, these blobs
are streamed to the same packfile instead of using a separate packfile
for each. If "pack.packSizeLimit" is configured and streaming another
object causes the packfile to exceed the configured limit, the packfile
is truncated back to the previous object and the object write is
restarted in a new packfile.

This works fine, but requires the fd being read from to save a
checkpoint so it becomes possible to rewind the input source via seeking
back to a known offset at the beginning. In a subsequent commit, blob
streaming is converted to use `struct odb_write_stream` as a more
generic input source instead of an fd which doesn't provide a mechanism
for rewinding.

For this use case though, rewinding the fd is not strictly necessary
because the inflated size of the object is known and can be used to
approximate whether writing the object would cause the packfile to
exceed the configured limit prior to writing anything. These blobs
written to the packfile are never deltified thus the size difference
between what is written versus the inflated size is due to zlib
compression. While this does prevent packfiles from being filled to the
potential maximum is some cases, it should be good enough and still
prevents the packfile from exceeding any configured limit.

Use the inflated blob size to determine whether writing an object to a
packfile will exceed the configured "pack.packSizeLimit".

Signed-off-by: Justin Tobler <jltobler@gmail.com>
---
 object-file.c | 86 +++++++++++++++------------------------------------
 1 file changed, 25 insertions(+), 61 deletions(-)

diff --git a/object-file.c b/object-file.c
index a59030911f..6d7afdb723 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1448,29 +1448,17 @@ static int hash_blob_stream(struct odb_write_stream *stream,
 
 /*
  * Read the contents from fd for size bytes, streaming it to the
- * packfile in state while updating the hash in ctx. Signal a failure
- * by returning a negative value when the resulting pack would exceed
- * the pack size limit and this is not the first object in the pack,
- * so that the caller can discard what we wrote from the current pack
- * by truncating it and opening a new one. The caller will then call
- * us again after rewinding the input fd.
- *
- * The already_hashed_to pointer is kept untouched by the caller to
- * make sure we do not hash the same byte when we are called
- * again. This way, the caller does not have to checkpoint its hash
- * status before calling us just in case we ask it to call us again
- * with a new pack.
+ * packfile in state while updating the hash in ctx.
  */
-static int stream_blob_to_pack(struct transaction_packfile *state,
-			       struct git_hash_ctx *ctx, off_t *already_hashed_to,
-			       int fd, size_t size, const char *path)
+static void stream_blob_to_pack(struct transaction_packfile *state,
+				struct git_hash_ctx *ctx, int fd, size_t size,
+				const char *path)
 {
 	git_zstream s;
 	unsigned char ibuf[16384];
 	unsigned char obuf[16384];
 	unsigned hdrlen;
 	int status = Z_OK;
-	off_t offset = 0;
 
 	git_deflate_init(&s, pack_compression_level);
 
@@ -1487,15 +1475,9 @@ static int stream_blob_to_pack(struct transaction_packfile *state,
 			if ((size_t)read_result != rsize)
 				die("failed to read %u bytes from '%s'",
 				    (unsigned)rsize, path);
-			offset += rsize;
-			if (*already_hashed_to < offset) {
-				size_t hsize = offset - *already_hashed_to;
-				if (rsize < hsize)
-					hsize = rsize;
-				if (hsize)
-					git_hash_update(ctx, ibuf, hsize);
-				*already_hashed_to = offset;
-			}
+
+			git_hash_update(ctx, ibuf, rsize);
+
 			s.next_in = ibuf;
 			s.avail_in = rsize;
 			size -= rsize;
@@ -1506,14 +1488,6 @@ static int stream_blob_to_pack(struct transaction_packfile *state,
 		if (!s.avail_out || status == Z_STREAM_END) {
 			size_t written = s.next_out - obuf;
 
-			/* would we bust the size limit? */
-			if (state->nr_written &&
-			    pack_size_limit_cfg &&
-			    pack_size_limit_cfg < state->offset + written) {
-				git_deflate_abort(&s);
-				return -1;
-			}
-
 			hashwrite(state->f, obuf, written);
 			state->offset += written;
 			s.next_out = obuf;
@@ -1530,7 +1504,6 @@ static int stream_blob_to_pack(struct transaction_packfile *state,
 		}
 	}
 	git_deflate_end(&s);
-	return 0;
 }
 
 static void flush_packfile_transaction(struct odb_transaction_files *transaction)
@@ -1606,48 +1579,39 @@ static int index_blob_packfile_transaction(struct odb_transaction_files *transac
 					   size_t size, const char *path)
 {
 	struct transaction_packfile *state = &transaction->packfile;
-	off_t seekback, already_hashed_to;
 	struct git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
 	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx;
 
-	seekback = lseek(fd, 0, SEEK_CUR);
-	if (seekback == (off_t)-1)
-		return error("cannot find the current offset");
-
 	header_len = format_object_header((char *)obuf, sizeof(obuf),
 					  OBJ_BLOB, size);
 	transaction->base.source->odb->repo->hash_algo->init_fn(&ctx);
 	git_hash_update(&ctx, obuf, header_len);
 
+	/*
+	 * If writing another object to the packfile could result in it
+	 * exceeding the configured size limit, flush the current packfile
+	 * transaction.
+	 *
+	 * Note that this uses the inflated object size as an approximation.
+	 * Blob objects written in this manner are not delta-compressed, so
+	 * the difference between the inflated and on-disk size is limited
+	 * to zlib compression and is sufficient for this check.
+	 */
+	if (state->nr_written && pack_size_limit_cfg &&
+	    pack_size_limit_cfg < state->offset + size)
+		flush_packfile_transaction(transaction);
+
 	CALLOC_ARRAY(idx, 1);
 	prepare_packfile_transaction(transaction);
 	hashfile_checkpoint_init(state->f, &checkpoint);
 
-	already_hashed_to = 0;
-
-	while (1) {
-		prepare_packfile_transaction(transaction);
-		hashfile_checkpoint(state->f, &checkpoint);
-		idx->offset = state->offset;
-		crc32_begin(state->f);
-
-		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
-					 fd, size, path))
-			break;
-		/*
-		 * Writing this object to the current pack will make
-		 * it too big; we need to truncate it, start a new
-		 * pack, and write into it.
-		 */
-		hashfile_truncate(state->f, &checkpoint);
-		state->offset = checkpoint.offset;
-		flush_packfile_transaction(transaction);
-		if (lseek(fd, seekback, SEEK_SET) == (off_t)-1)
-			return error("cannot seek back");
-	}
+	hashfile_checkpoint(state->f, &checkpoint);
+	idx->offset = state->offset;
+	crc32_begin(state->f);
+	stream_blob_to_pack(state, &ctx, fd, size, path);
 	git_hash_final_oid(result_oid, &ctx);
 
 	idx->crc32 = crc32_end(state->f);
-- 
2.54.0.105.g59ff4886a5


^ permalink raw reply related

* [PATCH v4 4/7] object-file: remove flags from transaction packfile writes
From: Justin Tobler @ 2026-05-14 18:37 UTC (permalink / raw)
  To: git; +Cc: ps, gitster, peff, Justin Tobler
In-Reply-To: <20260514183740.1505171-1-jltobler@gmail.com>

The `index_blob_packfile_transaction()` function handles streaming a
blob from an fd to compute its object ID and conditionally writes the
object directly to a packfile if the INDEX_WRITE_OBJECT flag is set. A
subsequent commit will make these packfile object writes part of the
transaction interface. Consequently, having the object write be
conditional on this flag is a bit awkward.

In preparation for this change, introduce a dedicated
`hash_blob_stream()` helper that only computes the OID from a `struct
odb_write_stream`. This is invoked by `index_fd()` instead when the
INDEX_WRITE_OBJECT is not set. The object write performed via
`index_blob_packfile_transaction()` is made unconditional accordingly.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
---
 object-file.c   | 132 +++++++++++++++++++++++++++++-------------------
 odb/streaming.c |  46 +++++++++++++++++
 odb/streaming.h |  12 +++++
 3 files changed, 138 insertions(+), 52 deletions(-)

diff --git a/object-file.c b/object-file.c
index a1afca23c5..a59030911f 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1397,11 +1397,10 @@ static int already_written(struct odb_transaction_files *transaction,
 }
 
 /* Lazily create backing packfile for the state */
-static void prepare_packfile_transaction(struct odb_transaction_files *transaction,
-					 unsigned flags)
+static void prepare_packfile_transaction(struct odb_transaction_files *transaction)
 {
 	struct transaction_packfile *state = &transaction->packfile;
-	if (!(flags & INDEX_WRITE_OBJECT) || state->f)
+	if (state->f)
 		return;
 
 	state->f = create_tmp_packfile(transaction->base.source->odb->repo,
@@ -1414,6 +1413,39 @@ static void prepare_packfile_transaction(struct odb_transaction_files *transacti
 		die_errno("unable to write pack header");
 }
 
+static int hash_blob_stream(struct odb_write_stream *stream,
+			    const struct git_hash_algo *hash_algo,
+			    struct object_id *result_oid, size_t size)
+{
+	unsigned char buf[16384];
+	struct git_hash_ctx ctx;
+	unsigned header_len;
+	size_t bytes_hashed = 0;
+
+	header_len = format_object_header((char *)buf, sizeof(buf),
+					  OBJ_BLOB, size);
+	hash_algo->init_fn(&ctx);
+	git_hash_update(&ctx, buf, header_len);
+
+	while (!stream->is_finished) {
+		ssize_t read_result = odb_write_stream_read(stream, buf,
+							    sizeof(buf));
+
+		if (read_result < 0)
+			return -1;
+
+		git_hash_update(&ctx, buf, read_result);
+		bytes_hashed += read_result;
+	}
+
+	if (bytes_hashed != size)
+		return -1;
+
+	git_hash_final_oid(result_oid, &ctx);
+
+	return 0;
+}
+
 /*
  * Read the contents from fd for size bytes, streaming it to the
  * packfile in state while updating the hash in ctx. Signal a failure
@@ -1431,15 +1463,13 @@ static void prepare_packfile_transaction(struct odb_transaction_files *transacti
  */
 static int stream_blob_to_pack(struct transaction_packfile *state,
 			       struct git_hash_ctx *ctx, off_t *already_hashed_to,
-			       int fd, size_t size, const char *path,
-			       unsigned flags)
+			       int fd, size_t size, const char *path)
 {
 	git_zstream s;
 	unsigned char ibuf[16384];
 	unsigned char obuf[16384];
 	unsigned hdrlen;
 	int status = Z_OK;
-	int write_object = (flags & INDEX_WRITE_OBJECT);
 	off_t offset = 0;
 
 	git_deflate_init(&s, pack_compression_level);
@@ -1474,20 +1504,18 @@ static int stream_blob_to_pack(struct transaction_packfile *state,
 		status = git_deflate(&s, size ? 0 : Z_FINISH);
 
 		if (!s.avail_out || status == Z_STREAM_END) {
-			if (write_object) {
-				size_t written = s.next_out - obuf;
-
-				/* would we bust the size limit? */
-				if (state->nr_written &&
-				    pack_size_limit_cfg &&
-				    pack_size_limit_cfg < state->offset + written) {
-					git_deflate_abort(&s);
-					return -1;
-				}
-
-				hashwrite(state->f, obuf, written);
-				state->offset += written;
+			size_t written = s.next_out - obuf;
+
+			/* would we bust the size limit? */
+			if (state->nr_written &&
+			    pack_size_limit_cfg &&
+			    pack_size_limit_cfg < state->offset + written) {
+				git_deflate_abort(&s);
+				return -1;
 			}
+
+			hashwrite(state->f, obuf, written);
+			state->offset += written;
 			s.next_out = obuf;
 			s.avail_out = sizeof(obuf);
 		}
@@ -1575,8 +1603,7 @@ static void flush_packfile_transaction(struct odb_transaction_files *transaction
  */
 static int index_blob_packfile_transaction(struct odb_transaction_files *transaction,
 					   struct object_id *result_oid, int fd,
-					   size_t size, const char *path,
-					   unsigned flags)
+					   size_t size, const char *path)
 {
 	struct transaction_packfile *state = &transaction->packfile;
 	off_t seekback, already_hashed_to;
@@ -1584,7 +1611,7 @@ static int index_blob_packfile_transaction(struct odb_transaction_files *transac
 	unsigned char obuf[16384];
 	unsigned header_len;
 	struct hashfile_checkpoint checkpoint;
-	struct pack_idx_entry *idx = NULL;
+	struct pack_idx_entry *idx;
 
 	seekback = lseek(fd, 0, SEEK_CUR);
 	if (seekback == (off_t)-1)
@@ -1595,33 +1622,26 @@ static int index_blob_packfile_transaction(struct odb_transaction_files *transac
 	transaction->base.source->odb->repo->hash_algo->init_fn(&ctx);
 	git_hash_update(&ctx, obuf, header_len);
 
-	/* Note: idx is non-NULL when we are writing */
-	if ((flags & INDEX_WRITE_OBJECT) != 0) {
-		CALLOC_ARRAY(idx, 1);
-
-		prepare_packfile_transaction(transaction, flags);
-		hashfile_checkpoint_init(state->f, &checkpoint);
-	}
+	CALLOC_ARRAY(idx, 1);
+	prepare_packfile_transaction(transaction);
+	hashfile_checkpoint_init(state->f, &checkpoint);
 
 	already_hashed_to = 0;
 
 	while (1) {
-		prepare_packfile_transaction(transaction, flags);
-		if (idx) {
-			hashfile_checkpoint(state->f, &checkpoint);
-			idx->offset = state->offset;
-			crc32_begin(state->f);
-		}
+		prepare_packfile_transaction(transaction);
+		hashfile_checkpoint(state->f, &checkpoint);
+		idx->offset = state->offset;
+		crc32_begin(state->f);
+
 		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
-					 fd, size, path, flags))
+					 fd, size, path))
 			break;
 		/*
 		 * Writing this object to the current pack will make
 		 * it too big; we need to truncate it, start a new
 		 * pack, and write into it.
 		 */
-		if (!idx)
-			BUG("should not happen");
 		hashfile_truncate(state->f, &checkpoint);
 		state->offset = checkpoint.offset;
 		flush_packfile_transaction(transaction);
@@ -1629,8 +1649,6 @@ static int index_blob_packfile_transaction(struct odb_transaction_files *transac
 			return error("cannot seek back");
 	}
 	git_hash_final_oid(result_oid, &ctx);
-	if (!idx)
-		return 0;
 
 	idx->crc32 = crc32_end(state->f);
 	if (already_written(transaction, result_oid)) {
@@ -1668,18 +1686,28 @@ int index_fd(struct index_state *istate, struct object_id *oid,
 		ret = index_core(istate, oid, fd, xsize_t(st->st_size),
 				 type, path, flags);
 	} else {
-		struct object_database *odb = the_repository->objects;
-		struct odb_transaction_files *files_transaction;
-		struct odb_transaction *transaction;
-
-		transaction = odb_transaction_begin(odb);
-		files_transaction = container_of(odb->transaction,
-						 struct odb_transaction_files,
-						 base);
-		ret = index_blob_packfile_transaction(files_transaction, oid, fd,
-						      xsize_t(st->st_size),
-						      path, flags);
-		odb_transaction_commit(transaction);
+		struct odb_write_stream stream;
+		odb_write_stream_from_fd(&stream, fd, xsize_t(st->st_size));
+
+		if (flags & INDEX_WRITE_OBJECT) {
+			struct object_database *odb = the_repository->objects;
+			struct odb_transaction_files *files_transaction;
+			struct odb_transaction *transaction;
+
+			transaction = odb_transaction_begin(odb);
+			files_transaction = container_of(odb->transaction,
+							 struct odb_transaction_files,
+							 base);
+			ret = index_blob_packfile_transaction(files_transaction, oid, fd,
+						      xsize_t(st->st_size), path);
+			odb_transaction_commit(transaction);
+		} else {
+			ret = hash_blob_stream(&stream,
+					       the_repository->hash_algo, oid,
+					       xsize_t(st->st_size));
+		}
+
+		odb_write_stream_release(&stream);
 	}
 
 	close(fd);
diff --git a/odb/streaming.c b/odb/streaming.c
index a68dd2cbe3..20531e864c 100644
--- a/odb/streaming.c
+++ b/odb/streaming.c
@@ -237,6 +237,11 @@ ssize_t odb_write_stream_read(struct odb_write_stream *st, void *buf, size_t sz)
 	return st->read(st, buf, sz);
 }
 
+void odb_write_stream_release(struct odb_write_stream *st)
+{
+	free(st->data);
+}
+
 int odb_stream_blob_to_fd(struct object_database *odb,
 			  int fd,
 			  const struct object_id *oid,
@@ -292,3 +297,44 @@ int odb_stream_blob_to_fd(struct object_database *odb,
 	odb_read_stream_close(st);
 	return result;
 }
+
+struct read_object_fd_data {
+	int fd;
+	size_t remaining;
+};
+
+static ssize_t read_object_fd(struct odb_write_stream *stream,
+			      unsigned char *buf, size_t len)
+{
+	struct read_object_fd_data *data = stream->data;
+	ssize_t read_result;
+	size_t count;
+
+	if (stream->is_finished)
+		return 0;
+
+	count = data->remaining < len ? data->remaining : len;
+	read_result = read_in_full(data->fd, buf, count);
+	if (read_result < 0 || (size_t)read_result != count)
+		return -1;
+
+	data->remaining -= count;
+	if (!data->remaining)
+		stream->is_finished = 1;
+
+	return read_result;
+}
+
+void odb_write_stream_from_fd(struct odb_write_stream *stream, int fd,
+			      size_t size)
+{
+	struct read_object_fd_data *data;
+
+	CALLOC_ARRAY(data, 1);
+	data->fd = fd;
+	data->remaining = size;
+
+	stream->data = data;
+	stream->read = read_object_fd;
+	stream->is_finished = 0;
+}
diff --git a/odb/streaming.h b/odb/streaming.h
index 65ced911fe..2a8cac19a4 100644
--- a/odb/streaming.h
+++ b/odb/streaming.h
@@ -5,6 +5,7 @@
 #define STREAMING_H 1
 
 #include "object.h"
+#include "odb.h"
 
 struct object_database;
 struct odb_read_stream;
@@ -65,6 +66,11 @@ struct odb_write_stream {
 ssize_t odb_write_stream_read(struct odb_write_stream *stream, void *buf,
 			      size_t len);
 
+/*
+ * Releases memory allocated for underlying stream data.
+ */
+void odb_write_stream_release(struct odb_write_stream *stream);
+
 /*
  * Look up the object by its ID and write the full contents to the file
  * descriptor. The object must be a blob, or the function will fail. When
@@ -82,4 +88,10 @@ int odb_stream_blob_to_fd(struct object_database *odb,
 			  struct stream_filter *filter,
 			  int can_seek);
 
+/*
+ * Sets up an ODB write stream that reads from an fd.
+ */
+void odb_write_stream_from_fd(struct odb_write_stream *stream, int fd,
+			      size_t size);
+
 #endif /* STREAMING_H */
-- 
2.54.0.105.g59ff4886a5


^ permalink raw reply related

* [PATCH v4 3/7] odb: update `struct odb_write_stream` read() callback
From: Justin Tobler @ 2026-05-14 18:37 UTC (permalink / raw)
  To: git; +Cc: ps, gitster, peff, Justin Tobler
In-Reply-To: <20260514183740.1505171-1-jltobler@gmail.com>

The `read()` callback used by `struct odb_write_stream` currently
returns a pointer to an internal buffer along with the number of bytes
read. This makes buffer ownership unclear and provides no way to report
errors.

Update the interface to instead require the caller to provide a buffer,
and have the callback return the number of bytes written to it or a
negative value on error. While at it, also move the `struct
odb_write_stream` definition to "odb/streaming.h". Call sites are
updated accordingly.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
---
 builtin/unpack-objects.c | 20 ++++++++------------
 object-file.c            | 15 ++++++++++++---
 odb.h                    |  6 +-----
 odb/streaming.c          |  5 +++++
 odb/streaming.h          | 18 ++++++++++++++++++
 5 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index bc9b1e047e..64e58e79fd 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -9,6 +9,7 @@
 #include "hex.h"
 #include "object-file.h"
 #include "odb.h"
+#include "odb/streaming.h"
 #include "odb/transaction.h"
 #include "object.h"
 #include "delta.h"
@@ -360,24 +361,21 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size,
 
 struct input_zstream_data {
 	git_zstream *zstream;
-	unsigned char buf[8192];
 	int status;
 };
 
-static const void *feed_input_zstream(struct odb_write_stream *in_stream,
-				      unsigned long *readlen)
+static ssize_t feed_input_zstream(struct odb_write_stream *in_stream,
+				  unsigned char *buf, size_t buf_len)
 {
 	struct input_zstream_data *data = in_stream->data;
 	git_zstream *zstream = data->zstream;
 	void *in = fill(1);
 
-	if (in_stream->is_finished) {
-		*readlen = 0;
-		return NULL;
-	}
+	if (in_stream->is_finished)
+		return 0;
 
-	zstream->next_out = data->buf;
-	zstream->avail_out = sizeof(data->buf);
+	zstream->next_out = buf;
+	zstream->avail_out = buf_len;
 	zstream->next_in = in;
 	zstream->avail_in = len;
 
@@ -385,9 +383,7 @@ static const void *feed_input_zstream(struct odb_write_stream *in_stream,
 
 	in_stream->is_finished = data->status != Z_OK;
 	use(len - zstream->avail_in);
-	*readlen = sizeof(data->buf) - zstream->avail_out;
-
-	return data->buf;
+	return buf_len - zstream->avail_out;
 }
 
 static void stream_blob(unsigned long size, unsigned nr)
diff --git a/object-file.c b/object-file.c
index bfbb632cf8..a1afca23c5 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1066,6 +1066,7 @@ int odb_source_loose_write_stream(struct odb_source *source,
 	struct git_hash_ctx c, compat_c;
 	struct strbuf tmp_file = STRBUF_INIT;
 	struct strbuf filename = STRBUF_INIT;
+	unsigned char buf[8192];
 	int dirlen;
 	char hdr[MAX_HEADER_LEN];
 	int hdrlen;
@@ -1098,9 +1099,17 @@ int odb_source_loose_write_stream(struct odb_source *source,
 		unsigned char *in0 = stream.next_in;
 
 		if (!stream.avail_in && !in_stream->is_finished) {
-			const void *in = in_stream->read(in_stream, &stream.avail_in);
-			stream.next_in = (void *)in;
-			in0 = (unsigned char *)in;
+			ssize_t read_len = odb_write_stream_read(in_stream, buf,
+								 sizeof(buf));
+			if (read_len < 0) {
+				close(fd);
+				err = -1;
+				goto cleanup;
+			}
+
+			stream.avail_in = read_len;
+			stream.next_in = buf;
+			in0 = buf;
 			/* All data has been read. */
 			if (in_stream->is_finished)
 				flush = 1;
diff --git a/odb.h b/odb.h
index ec5367b13e..6faeaa0589 100644
--- a/odb.h
+++ b/odb.h
@@ -529,11 +529,7 @@ static inline int odb_write_object(struct object_database *odb,
 	return odb_write_object_ext(odb, buf, len, type, oid, NULL, 0);
 }
 
-struct odb_write_stream {
-	const void *(*read)(struct odb_write_stream *, unsigned long *len);
-	void *data;
-	int is_finished;
-};
+struct odb_write_stream;
 
 int odb_write_object_stream(struct object_database *odb,
 			    struct odb_write_stream *stream, size_t len,
diff --git a/odb/streaming.c b/odb/streaming.c
index 5927a12954..a68dd2cbe3 100644
--- a/odb/streaming.c
+++ b/odb/streaming.c
@@ -232,6 +232,11 @@ struct odb_read_stream *odb_read_stream_open(struct object_database *odb,
 	return st;
 }
 
+ssize_t odb_write_stream_read(struct odb_write_stream *st, void *buf, size_t sz)
+{
+	return st->read(st, buf, sz);
+}
+
 int odb_stream_blob_to_fd(struct object_database *odb,
 			  int fd,
 			  const struct object_id *oid,
diff --git a/odb/streaming.h b/odb/streaming.h
index c7861f7e13..65ced911fe 100644
--- a/odb/streaming.h
+++ b/odb/streaming.h
@@ -47,6 +47,24 @@ int odb_read_stream_close(struct odb_read_stream *stream);
  */
 ssize_t odb_read_stream_read(struct odb_read_stream *stream, void *buf, size_t len);
 
+/*
+ * A stream that provides an object to be written to the object database without
+ * loading all of it into memory.
+ */
+struct odb_write_stream {
+	ssize_t (*read)(struct odb_write_stream *, unsigned char *, size_t);
+	void *data;
+	int is_finished;
+};
+
+/*
+ * Read data from the stream into the buffer. Returns 0 when finished and the
+ * number of bytes read on success. Returns a negative error code in case
+ * reading from the stream fails.
+ */
+ssize_t odb_write_stream_read(struct odb_write_stream *stream, void *buf,
+			      size_t len);
+
 /*
  * Look up the object by its ID and write the full contents to the file
  * descriptor. The object must be a blob, or the function will fail. When
-- 
2.54.0.105.g59ff4886a5


^ permalink raw reply related

* [PATCH v4 2/7] odb/transaction: use pluggable `begin_transaction()`
From: Justin Tobler @ 2026-05-14 18:37 UTC (permalink / raw)
  To: git; +Cc: ps, gitster, peff, Justin Tobler
In-Reply-To: <20260514183740.1505171-1-jltobler@gmail.com>

Each ODB source is expected to provide an ODB transaction implementation
that should be used when starting a transaction. With d6fc6fe6f8
(odb/source: make `begin_transaction()` function pluggable, 2026-03-05),
the `struct odb_source` now provides a pluggable callback for beginning
transactions. Use the callback provided by the ODB source accordingly.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
---
 odb/transaction.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/odb/transaction.c b/odb/transaction.c
index 9bf3f347dc..592ac84075 100644
--- a/odb/transaction.c
+++ b/odb/transaction.c
@@ -1,5 +1,5 @@
 #include "git-compat-util.h"
-#include "object-file.h"
+#include "odb/source.h"
 #include "odb/transaction.h"
 
 struct odb_transaction *odb_transaction_begin(struct object_database *odb)
@@ -7,7 +7,7 @@ struct odb_transaction *odb_transaction_begin(struct object_database *odb)
 	if (odb->transaction)
 		return NULL;
 
-	odb->transaction = odb_transaction_files_begin(odb->sources);
+	odb_source_begin_transaction(odb->sources, &odb->transaction);
 
 	return odb->transaction;
 }
-- 
2.54.0.105.g59ff4886a5


^ permalink raw reply related

* [PATCH v4 1/7] odb: split `struct odb_transaction` into separate header
From: Justin Tobler @ 2026-05-14 18:37 UTC (permalink / raw)
  To: git; +Cc: ps, gitster, peff, Justin Tobler
In-Reply-To: <20260514183740.1505171-1-jltobler@gmail.com>

The current ODB transaction interface is colocated with other ODB
interfaces in "odb.{c,h}". Subsequent commits will expand `struct
odb_transaction` to support write operations on the transaction
directly. To keep things organized and prevent "odb.{c,h}" from becoming
more unwieldy, split out `struct odb_transaction` into a separate
header.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
---
 Makefile                 |  1 +
 builtin/add.c            |  1 +
 builtin/unpack-objects.c |  1 +
 builtin/update-index.c   |  1 +
 cache-tree.c             |  1 +
 meson.build              |  1 +
 object-file.c            |  1 +
 odb.c                    | 25 -------------------------
 odb.h                    | 31 -------------------------------
 odb/transaction.c        | 28 ++++++++++++++++++++++++++++
 odb/transaction.h        | 38 ++++++++++++++++++++++++++++++++++++++
 read-cache.c             |  1 +
 12 files changed, 74 insertions(+), 56 deletions(-)
 create mode 100644 odb/transaction.c
 create mode 100644 odb/transaction.h

diff --git a/Makefile b/Makefile
index dbf0022054..6342db13e5 100644
--- a/Makefile
+++ b/Makefile
@@ -1219,6 +1219,7 @@ LIB_OBJS += odb.o
 LIB_OBJS += odb/source.o
 LIB_OBJS += odb/source-files.o
 LIB_OBJS += odb/streaming.o
+LIB_OBJS += odb/transaction.o
 LIB_OBJS += oid-array.o
 LIB_OBJS += oidmap.o
 LIB_OBJS += oidset.o
diff --git a/builtin/add.c b/builtin/add.c
index 7737ab878b..c859f66519 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -16,6 +16,7 @@
 #include "run-command.h"
 #include "object-file.h"
 #include "odb.h"
+#include "odb/transaction.h"
 #include "parse-options.h"
 #include "path.h"
 #include "preload-index.h"
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index 6fc64e9e4b..bc9b1e047e 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -9,6 +9,7 @@
 #include "hex.h"
 #include "object-file.h"
 #include "odb.h"
+#include "odb/transaction.h"
 #include "object.h"
 #include "delta.h"
 #include "pack.h"
diff --git a/builtin/update-index.c b/builtin/update-index.c
index 8a5907767b..bcc43852ef 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -19,6 +19,7 @@
 #include "tree-walk.h"
 #include "object-file.h"
 #include "odb.h"
+#include "odb/transaction.h"
 #include "refs.h"
 #include "resolve-undo.h"
 #include "parse-options.h"
diff --git a/cache-tree.c b/cache-tree.c
index 60bcc07c3b..f056869cfd 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -10,6 +10,7 @@
 #include "cache-tree.h"
 #include "object-file.h"
 #include "odb.h"
+#include "odb/transaction.h"
 #include "read-cache-ll.h"
 #include "replace-object.h"
 #include "repository.h"
diff --git a/meson.build b/meson.build
index 8309942d18..6dc23b3af2 100644
--- a/meson.build
+++ b/meson.build
@@ -405,6 +405,7 @@ libgit_sources = [
   'odb/source.c',
   'odb/source-files.c',
   'odb/streaming.c',
+  'odb/transaction.c',
   'oid-array.c',
   'oidmap.c',
   'oidset.c',
diff --git a/object-file.c b/object-file.c
index f0b029ff0b..bfbb632cf8 100644
--- a/object-file.c
+++ b/object-file.c
@@ -21,6 +21,7 @@
 #include "object-file.h"
 #include "odb.h"
 #include "odb/streaming.h"
+#include "odb/transaction.h"
 #include "oidtree.h"
 #include "pack.h"
 #include "packfile.h"
diff --git a/odb.c b/odb.c
index 350e23f3c0..8c3cbc1b53 100644
--- a/odb.c
+++ b/odb.c
@@ -1069,28 +1069,3 @@ void odb_reprepare(struct object_database *o)
 
 	obj_read_unlock();
 }
-
-struct odb_transaction *odb_transaction_begin(struct object_database *odb)
-{
-	if (odb->transaction)
-		return NULL;
-
-	odb->transaction = odb_transaction_files_begin(odb->sources);
-
-	return odb->transaction;
-}
-
-void odb_transaction_commit(struct odb_transaction *transaction)
-{
-	if (!transaction)
-		return;
-
-	/*
-	 * Ensure the transaction ending matches the pending transaction.
-	 */
-	ASSERT(transaction == transaction->source->odb->transaction);
-
-	transaction->commit(transaction);
-	transaction->source->odb->transaction = NULL;
-	free(transaction);
-}
diff --git a/odb.h b/odb.h
index 9aee260105..ec5367b13e 100644
--- a/odb.h
+++ b/odb.h
@@ -35,24 +35,6 @@ struct packed_git;
 struct packfile_store;
 struct cached_object_entry;
 
-/*
- * A transaction may be started for an object database prior to writing new
- * objects via odb_transaction_begin(). These objects are not committed until
- * odb_transaction_commit() is invoked. Only a single transaction may be pending
- * at a time.
- *
- * Each ODB source is expected to implement its own transaction handling.
- */
-struct odb_transaction;
-typedef void (*odb_transaction_commit_fn)(struct odb_transaction *transaction);
-struct odb_transaction {
-	/* The ODB source the transaction is opened against. */
-	struct odb_source *source;
-
-	/* The ODB source specific callback invoked to commit a transaction. */
-	odb_transaction_commit_fn commit;
-};
-
 /*
  * The object database encapsulates access to objects in a repository. It
  * manages one or more sources that store the actual objects which are
@@ -154,19 +136,6 @@ void odb_close(struct object_database *o);
  */
 void odb_reprepare(struct object_database *o);
 
-/*
- * Starts an ODB transaction. Subsequent objects are written to the transaction
- * and not committed until odb_transaction_commit() is invoked on the
- * transaction. If the ODB already has a pending transaction, NULL is returned.
- */
-struct odb_transaction *odb_transaction_begin(struct object_database *odb);
-
-/*
- * Commits an ODB transaction making the written objects visible. If the
- * specified transaction is NULL, the function is a no-op.
- */
-void odb_transaction_commit(struct odb_transaction *transaction);
-
 /*
  * Find source by its object directory path. Returns a `NULL` pointer in case
  * the source could not be found.
diff --git a/odb/transaction.c b/odb/transaction.c
new file mode 100644
index 0000000000..9bf3f347dc
--- /dev/null
+++ b/odb/transaction.c
@@ -0,0 +1,28 @@
+#include "git-compat-util.h"
+#include "object-file.h"
+#include "odb/transaction.h"
+
+struct odb_transaction *odb_transaction_begin(struct object_database *odb)
+{
+	if (odb->transaction)
+		return NULL;
+
+	odb->transaction = odb_transaction_files_begin(odb->sources);
+
+	return odb->transaction;
+}
+
+void odb_transaction_commit(struct odb_transaction *transaction)
+{
+	if (!transaction)
+		return;
+
+	/*
+	 * Ensure the transaction ending matches the pending transaction.
+	 */
+	ASSERT(transaction == transaction->source->odb->transaction);
+
+	transaction->commit(transaction);
+	transaction->source->odb->transaction = NULL;
+	free(transaction);
+}
diff --git a/odb/transaction.h b/odb/transaction.h
new file mode 100644
index 0000000000..a56e392f21
--- /dev/null
+++ b/odb/transaction.h
@@ -0,0 +1,38 @@
+#ifndef ODB_TRANSACTION_H
+#define ODB_TRANSACTION_H
+
+#include "odb.h"
+#include "odb/source.h"
+
+/*
+ * A transaction may be started for an object database prior to writing new
+ * objects via odb_transaction_begin(). These objects are not committed until
+ * odb_transaction_commit() is invoked. Only a single transaction may be pending
+ * at a time.
+ *
+ * Each ODB source is expected to implement its own transaction handling.
+ */
+struct odb_transaction;
+typedef void (*odb_transaction_commit_fn)(struct odb_transaction *transaction);
+struct odb_transaction {
+	/* The ODB source the transaction is opened against. */
+	struct odb_source *source;
+
+	/* The ODB source specific callback invoked to commit a transaction. */
+	odb_transaction_commit_fn commit;
+};
+
+/*
+ * Starts an ODB transaction. Subsequent objects are written to the transaction
+ * and not committed until odb_transaction_commit() is invoked on the
+ * transaction. If the ODB already has a pending transaction, NULL is returned.
+ */
+struct odb_transaction *odb_transaction_begin(struct object_database *odb);
+
+/*
+ * Commits an ODB transaction making the written objects visible. If the
+ * specified transaction is NULL, the function is a no-op.
+ */
+void odb_transaction_commit(struct odb_transaction *transaction);
+
+#endif
diff --git a/read-cache.c b/read-cache.c
index 5049f9baca..8147c7e94a 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -20,6 +20,7 @@
 #include "dir.h"
 #include "object-file.h"
 #include "odb.h"
+#include "odb/transaction.h"
 #include "oid-array.h"
 #include "tree.h"
 #include "commit.h"
-- 
2.54.0.105.g59ff4886a5


^ permalink raw reply related

* [PATCH v4 0/7] odb: add write operation to ODB transaction interface
From: Justin Tobler @ 2026-05-14 18:37 UTC (permalink / raw)
  To: git; +Cc: ps, gitster, peff, Justin Tobler
In-Reply-To: <20260402213220.2651523-1-jltobler@gmail.com>

Greetings,

This series lays the groundwork for introducing write operations to the
ODB transaction interface. The eventual goal is for all object writes
performed within a transaction to go through this interface explicitly,
rather than implicitly relying on the transaction to reconfigure ODB
sources so that writes are redirected to a temporary location.

For now, only `odb_transaction_write_object_stream()` is implemented and
wires up the existing logic for streaming "large" blobs directly into a
packfile as part of the transaction.

Most of the patches are structural refactorings to enable this, but
patch 4 introduces a behavioral change in how packfiles that would
exceed "pack.packSizeLimit" are handled.

Changes since V3:
- Fixed leak due to an fd not being closed when exiting prior to
  close_loose_object() being invoked.

Changes since V2:
- Renamed some variables to improve clarity
- Make `odb_write_stream_from_fd()` fully initialize the underlying
  `struct odb_write_stream`
- Move `struct odb_write_stream` to "odb/streaming.h"
- Make the `hash_blob_stream()` helper more generic by operating on a
  `struct odb_write_stream` instead of reading from an fd directly.
- Introduce an `odb_write_stream_release()` helper to free the
  underlying stream data.

Changes since V1:
- Fixed some typos
- Improved error handling
- Removed unnecessary guard statement
- Documented in comments why inflated object size is used to approximate
  if object write will exceed "pack.packSizeLimit".
- Updated `struct odb_write_stream` read() callback to support returning
  errors and using caller provided buffer
- Updated the `hash_blob_stream()` function signature to operate on a
  `struct odb_write_stream` instead of an fd directly
- Renamed some variables/functions for better clarity

Thanks,
-Justin

Justin Tobler (7):
  odb: split `struct odb_transaction` into separate header
  odb/transaction: use pluggable `begin_transaction()`
  odb: update `struct odb_write_stream` read() callback
  object-file: remove flags from transaction packfile writes
  object-file: avoid fd seekback by checking object size upfront
  object-file: generalize packfile writes to use odb_write_stream
  odb/transaction: make `write_object_stream()` pluggable

 Makefile                 |   1 +
 builtin/add.c            |   1 +
 builtin/unpack-objects.c |  21 ++--
 builtin/update-index.c   |   1 +
 cache-tree.c             |   1 +
 meson.build              |   1 +
 object-file.c            | 238 ++++++++++++++++++++-------------------
 odb.c                    |  25 ----
 odb.h                    |  37 +-----
 odb/streaming.c          |  51 +++++++++
 odb/streaming.h          |  30 +++++
 odb/transaction.c        |  35 ++++++
 odb/transaction.h        |  57 ++++++++++
 read-cache.c             |   1 +
 14 files changed, 312 insertions(+), 188 deletions(-)
 create mode 100644 odb/transaction.c
 create mode 100644 odb/transaction.h

Range-diff against v3:
1:  eee372b426 = 1:  eee372b426 odb: split `struct odb_transaction` into separate header
2:  57ac075560 = 2:  57ac075560 odb/transaction: use pluggable `begin_transaction()`
3:  11321ad607 ! 3:  d53ad95712 odb: update `struct odb_write_stream` read() callback
    @@ object-file.c: int odb_source_loose_write_stream(struct odb_source *source,
     +			ssize_t read_len = odb_write_stream_read(in_stream, buf,
     +								 sizeof(buf));
     +			if (read_len < 0) {
    ++				close(fd);
     +				err = -1;
     +				goto cleanup;
     +			}
4:  72d4656eee = 4:  fa7a3ad5dc object-file: remove flags from transaction packfile writes
5:  e4896101ff = 5:  1ca08e0590 object-file: avoid fd seekback by checking object size upfront
6:  b3cb0a707c = 6:  a548401057 object-file: generalize packfile writes to use odb_write_stream
7:  e1d292a7ed = 7:  4765b1024a odb/transaction: make `write_object_stream()` pluggable

base-commit: 5361983c075154725be47b65cca9a2421789e410
-- 
2.54.0.105.g59ff4886a5


^ permalink raw reply

* Re: [GSoC RFC PATCH 0/1] graph: add indentation for commits preceded by a root
From: Pablo Sabater @ 2026-05-14 17:45 UTC (permalink / raw)
  To: phillip.wood
  Cc: git, gitster, christian.couder, karthik.188, jltobler,
	ayu.chandekar, siddharthasthana31, chandrapratap3519
In-Reply-To: <26d887d2-6ec2-4af1-b0bd-8e9b017bb4dd@gmail.com>

El jue, 14 may 2026 a las 17:15, Phillip Wood
(<phillip.wood123@gmail.com>) escribió:
>
> Hi Pablo
>
> On 02/04/2026 22:17, Pablo Sabater wrote:
> > When having a history with multiple root commits and drawing the history
> > near the roots, the graphing engine renders the commit one below the other,
> > seeming that they are related, which makes the graph confusing.
> >
> > This issue was reported by Junio at:
> >    https://lore.kernel.org/git/xmqqikaawrpx.fsf@gitster.g/
> >
> > e.g.:
> >
> >    * root-B
> >    * child-A2
> >    * child-A1
> >    * root-A
> >
> > [...]
>  >
> >    * root-B
> >      * child-A2
> >     /
> >    * child-A1
> >    * root-A
>
> I'm rather late to the party here, but personally I find the indentation
> a bit confusing, it would be clearer to me if we had a blank line after
> a root commit

Hi,

>
>      * root-B
>
>      * child-A2
>      * child-A1
>      * root-A
>
> It takes the same amount of vertical space but keeps the children of
> root-A together.

I have mixed feelings about which approach to choose.
The idea of a blank line was thought at
https://lore.kernel.org/git/xmqq8s8vvw9m.fsf@gitster.c.googlers.com/
but Junio argued against it for having an extra row because the
indentation he proposed didn't collapse, however I find indentation +
no collapse the most confusing one.
I'd say that I'm fine with both approaches, blank line or indentation
+ collapse.

> > without the patch:
> >
> >    * A root
> >    * B root
> >    * C root
> >    * D1 child
> >    * D root
> >
> > with the patch, the indentation cascades:
> >
> >    * A root
> >      * B root
> >        * C root
> >          * D1 child
> >       _ /
> >      /
> >     /
> >    * D root

  * A root

  * B root

  * C root

  * D1 child

  * D root

Here I think a blank line looks worse, too much space for just 5
commits and becomes one extra line which if this were like up to 7 or
more parentless commits one after the other would be more noticeable.
But there are cases that blank line might be better:

  * 10_A2
  * 10_A1
  * 10_A
    *   10_M
   /|\
  | | * 10_D
  | * 10_C
  * 10_B

Feels like a shower of commits instead of an indented merge.

Pro to the blank line, the parentless check is the same and it's just
printing a '\n' at the right spot, while indent i'm mimicking like if
there was a commit there.
Anyways, I think in the majority of the cases the indentation +
collapsing looks better.
Sorry for the brief reply, I'm busy today.

Regards,

--
Pablo

>
> Thanks
>
> Phillip
>
> > This is done by adding a is_placeholder flag to the columns, the root commit
> > is actually there but marked as a placeholder
> >
> > e.g.:
> >
> >     * root-B
> >    (B) * child-A2
> >      /
> >     * child-A1
> >     * root-A
> >
> > (B) would be root-B column with the placeholder flag active.
> >
> > Then teaching the rendering function to print a padding ' ' when meeting a
> > placeholder column outputs the second example.
> >
> > There could also be the case where there are multiple roots
> >
> > without the patch:
> >
> >    * A root
> >    * B root
> >    * C root
> >    * D1 child
> >    * D root
> >
> > with the patch, the indentation cascades:
> >
> >    * A root
> >      * B root
> >        * C root
> >          * D1 child
> >       _ /
> >      /
> >     /
> >    * D root
> >
> > the _ / might look weird but that's how the collapsing rendering does it
> > for big gaps, this case being from the 4th column to the 0th column.
> > Another patch could change the collapsing rendering for placeholders ?
> > I haven't done it to keep it minimal, but a follow up could make it
> > to be straight '/'. This would make it bigger but easier for the eye to follow.
> > IMO is not worth it, but opinions are welcome.
> >
> > The patch also adds tests for different cases like a root preceding multiple
> > parents merges and the examples above.
> >
> > There could be some edge cases still so any testing is very welcome.
> >
> > Pablo Sabater (1):
> >    graph: add indentation for commits preceded by a root
> >
> >   graph.c                      |  68 ++++++++++++++++--
> >   t/t4215-log-skewed-merges.sh | 136 +++++++++++++++++++++++++++++++++++
> >   2 files changed, 198 insertions(+), 6 deletions(-)
> >
> >
> > base-commit: 256554692df0685b45e60778b08802b720880c50
>

^ permalink raw reply

* Re: What's cooking in git.git (May 2026, #03)
From: Pushkar Singh @ 2026-05-14 17:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <xmqqmry3i9a4.fsf@gitster.g>

Hi Junio,

My thinking was mainly that git stash show normally omits untracked
changes, while --include-untracked consults the additional untracked
parent of the stash commit.

I did not see existing coverage specifically checking that behavior, so
I thought a small test for it could be useful. But I understand your
point that not every observable behavior necessarily needs explicit
coverage, and I'll keep that in mind for future patches.

Thanks,
Pushkar

^ permalink raw reply

* [PATCH] revision: use priority queue in limit_list()
From: Kristofer Karlsson via GitGitGadget @ 2026-05-14 16:51 UTC (permalink / raw)
  To: git; +Cc: Kristofer Karlsson, Kristofer Karlsson

From: Kristofer Karlsson <krka@spotify.com>

limit_list() maintains a date-sorted work queue of commits using a
linked list with commit_list_insert_by_date() for insertion.  Each
insertion walks the list to find the right position — O(n) per insert.
In repositories with merge-heavy histories, the symmetric difference
can contain thousands of commits, making this O(n) insertion the
dominant cost.

Replace the sorted linked list with a prio_queue (binary heap).  This
gives O(log n) insertion and O(log n) extraction instead of O(n)
insertion and O(1) extraction, which is a net win when the queue is
large.

The still_interesting() and everybody_uninteresting() helpers are
updated to scan the prio_queue's contiguous array instead of walking a
linked list.  process_parents() already accepts both a commit_list and
a prio_queue parameter, so the change in limit_list() simply switches
which one is passed.

Benchmark: git rev-list --left-right --count HEAD~N...HEAD
Repository: 2.3M commits, merge-heavy DAG (monorepo)
Best of 5 runs, times in seconds:

  commits in
  symmetric diff   baseline   patched    speedup
  --------------   --------   -------    -------
            10       0.01      0.01       1.0x
            50       0.01      0.01       1.0x
          3751      21.23      8.49       2.5x
          4524      21.70      8.29       2.6x
         10130      20.10      6.65       3.0x

No change for small traversals; 2.5-3.0x faster when the queue grows
to thousands of commits.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
---
    revision: use priority queue in limit_list()
    
    This patch speeds up limit_list() by 2.5–3x on large, merge-heavy
    repositories by replacing a sorted linked list with a priority queue.
    
    The sorted linked list used as a work queue in limit_list() has O(n)
    insertion cost per commit, where n is the current queue length (the
    "width" of the active walk frontier). In merge-heavy DAGs this frontier
    grows wide — profiling on a 2.3M-commit monorepo showed 59% of total CPU
    time in commit_list_insert_by_date(). Total cost is O(N·w) where N is
    commits walked and w is peak queue width; in merge-heavy histories w
    scales with N, approaching O(N²).
    
    Switching to a prio_queue (binary heap) reduces insertion cost to O(log
    w), bringing total cost to O(N·log w). The practical result on the same
    repository:
    
    commits in
    symmetric diff   before     after      speedup
    --------------   --------   -------    -------
            3751      21.2s      8.5s       2.5x
            4524      21.7s      8.3s       2.6x
           10130      20.1s      6.6s       3.0x
    
    
    This affects any command that triggers limit_list() — i.e., when
    revs->limited is set — including --left-right, --cherry-mark,
    --cherry-pick, --ancestry-path, bisect, and rebase's fork-point
    computation. The practical trigger is git status --ahead-behind on a
    branch that has diverged from upstream in a merge-heavy repository.
    
    The change is minimal (+21/−17 lines, single file) because
    process_parents() already accepts both a commit_list and a prio_queue
    parameter — limit_list() just switches which one it passes.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2114%2Fspkrka%2Flimit-list-prio-queue-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2114/spkrka/limit-list-prio-queue-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/2114

 revision.c | 38 +++++++++++++++++++++-----------------
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/revision.c b/revision.c
index 599b3a66c3..2b1b3bb10e 100644
--- a/revision.c
+++ b/revision.c
@@ -473,10 +473,10 @@ static struct commit *handle_commit(struct rev_info *revs,
 	die("%s is unknown object", name);
 }
 
-static int everybody_uninteresting(struct commit_list *orig,
+static int everybody_uninteresting(struct prio_queue *orig,
 				   struct commit **interesting_cache)
 {
-	struct commit_list *list = orig;
+	size_t i;
 
 	if (*interesting_cache) {
 		struct commit *commit = *interesting_cache;
@@ -484,9 +484,8 @@ static int everybody_uninteresting(struct commit_list *orig,
 			return 0;
 	}
 
-	while (list) {
-		struct commit *commit = list->item;
-		list = list->next;
+	for (i = 0; i < orig->nr; i++) {
+		struct commit *commit = orig->array[i].data;
 		if (commit->object.flags & UNINTERESTING)
 			continue;
 
@@ -1300,20 +1299,17 @@ static void cherry_pick_list(struct commit_list *list, struct rev_info *revs)
 /* How many extra uninteresting commits we want to see.. */
 #define SLOP 5
 
-static int still_interesting(struct commit_list *src, timestamp_t date, int slop,
+static int still_interesting(struct prio_queue *src, timestamp_t date, int slop,
 			     struct commit **interesting_cache)
 {
 	/*
-	 * No source list at all? We're definitely done..
+	 * Since src is sorted by date, it is enough to peek at the
+	 * first entry to compare dates.  No entry at all means done.
 	 */
-	if (!src)
+	struct commit *commit = prio_queue_peek(src);
+	if (!commit)
 		return 0;
-
-	/*
-	 * Does the destination list contain entries with a date
-	 * before the source list? Definitely _not_ done.
-	 */
-	if (date <= src->item->date)
+	if (date <= commit->date)
 		return SLOP;
 
 	/*
@@ -1451,6 +1447,7 @@ static int limit_list(struct rev_info *revs)
 	struct commit_list *newlist = NULL;
 	struct commit_list **p = &newlist;
 	struct commit *interesting_cache = NULL;
+	struct prio_queue queue = { .compare = compare_commits_by_commit_date };
 
 	if (revs->ancestry_path_implicit_bottoms) {
 		collect_bottom_commits(original_list,
@@ -1461,6 +1458,11 @@ static int limit_list(struct rev_info *revs)
 
 	while (original_list) {
 		struct commit *commit = pop_commit(&original_list);
+		prio_queue_put(&queue, commit);
+	}
+
+	while (queue.nr) {
+		struct commit *commit = prio_queue_get(&queue);
 		struct object *obj = &commit->object;
 
 		if (commit == interesting_cache)
@@ -1468,11 +1470,13 @@ static int limit_list(struct rev_info *revs)
 
 		if (revs->max_age != -1 && (commit->date < revs->max_age))
 			obj->flags |= UNINTERESTING;
-		if (process_parents(revs, commit, &original_list, NULL) < 0)
+		if (process_parents(revs, commit, NULL, &queue) < 0) {
+			clear_prio_queue(&queue);
 			return -1;
+		}
 		if (obj->flags & UNINTERESTING) {
 			mark_parents_uninteresting(revs, commit);
-			slop = still_interesting(original_list, date, slop, &interesting_cache);
+			slop = still_interesting(&queue, date, slop, &interesting_cache);
 			if (slop)
 				continue;
 			break;
@@ -1509,7 +1513,7 @@ static int limit_list(struct rev_info *revs)
 		}
 	}
 
-	commit_list_free(original_list);
+	clear_prio_queue(&queue);
 	revs->commits = newlist;
 	return 0;
 }

base-commit: 59ff4886a579f4bc91e976fe18590b9ae02c7a08
-- 
gitgitgadget

^ permalink raw reply related

* [PATCH v3 4/4] grep: prefetch necessary blobs
From: Elijah Newren via GitGitGadget @ 2026-05-14 16:25 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Phillip Wood, Derrick Stolee, Elijah Newren,
	Elijah Newren
In-Reply-To: <pull.2089.v3.git.1778775928.gitgitgadget@gmail.com>

From: Elijah Newren <newren@gmail.com>

In partial clones, `git grep` fetches necessary blobs on-demand one
at a time, which can be very slow.  In partial clones, add an extra
preliminary walk over the tree similar to grep_tree() which collects
the blobs of interest, and then prefetches them.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/grep.c  | 143 ++++++++++++++++++++++++++++++++++++++++++++++++
 t/t7810-grep.sh |  58 ++++++++++++++++++++
 2 files changed, 201 insertions(+)

diff --git a/builtin/grep.c b/builtin/grep.c
index e33285e5e6..85656d8d3f 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -28,9 +28,12 @@
 #include "object-file.h"
 #include "object-name.h"
 #include "odb.h"
+#include "oid-array.h"
+#include "oidset.h"
 #include "packfile.h"
 #include "pager.h"
 #include "path.h"
+#include "promisor-remote.h"
 #include "read-cache-ll.h"
 #include "write-or-die.h"
 
@@ -692,6 +695,144 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 	return hit;
 }
 
+static void collect_blob_oids_for_tree(struct repository *repo,
+				       const struct pathspec *pathspec,
+				       struct tree_desc *tree,
+				       struct strbuf *base,
+				       int tn_len,
+				       struct oidset *blob_oids)
+{
+	struct name_entry entry;
+	int old_baselen = base->len;
+	struct strbuf name = STRBUF_INIT;
+	enum interesting match = entry_not_interesting;
+
+	while (tree_entry(tree, &entry)) {
+		if (match != all_entries_interesting) {
+			strbuf_addstr(&name, base->buf + tn_len);
+			match = tree_entry_interesting(repo->index,
+						       &entry, &name,
+						       pathspec);
+			strbuf_reset(&name);
+
+			if (match == all_entries_not_interesting)
+				break;
+			if (match == entry_not_interesting)
+				continue;
+		}
+
+		strbuf_add(base, entry.path, tree_entry_len(&entry));
+
+		if (S_ISREG(entry.mode)) {
+			if (!odb_has_object(repo->objects, &entry.oid, 0))
+				oidset_insert(blob_oids, &entry.oid);
+		} else if (S_ISDIR(entry.mode)) {
+			enum object_type type;
+			struct tree_desc sub_tree;
+			void *data;
+			unsigned long size;
+
+			data = odb_read_object(repo->objects, &entry.oid,
+					       &type, &size);
+			if (!data)
+				die(_("unable to read tree (%s)"),
+				    oid_to_hex(&entry.oid));
+
+			strbuf_addch(base, '/');
+			init_tree_desc(&sub_tree, &entry.oid, data, size);
+			collect_blob_oids_for_tree(repo, pathspec, &sub_tree,
+						   base, tn_len, blob_oids);
+			free(data);
+		}
+		/*
+		 * ...no else clause for S_ISGITLINK: submodules have their
+		 * own promisor configuration and would need separate fetches
+		 * anyway.
+		 */
+
+		strbuf_setlen(base, old_baselen);
+	}
+
+	strbuf_release(&name);
+}
+
+static void collect_blob_oids_for_treeish(struct grep_opt *opt,
+					  const struct pathspec *pathspec,
+					  const struct object_id *tree_ish_oid,
+					  const char *name,
+					  struct oidset *blob_oids)
+{
+	struct tree_desc tree;
+	void *data;
+	unsigned long size;
+	struct strbuf base = STRBUF_INIT;
+	int len;
+
+	data = odb_read_object_peeled(opt->repo->objects, tree_ish_oid,
+				      OBJ_TREE, &size, NULL);
+
+	if (!data)
+		return;
+
+	len = name ? strlen(name) : 0;
+	if (len) {
+		strbuf_add(&base, name, len);
+		strbuf_addch(&base, ':');
+	}
+	init_tree_desc(&tree, tree_ish_oid, data, size);
+
+	collect_blob_oids_for_tree(opt->repo, pathspec, &tree,
+				   &base, base.len, blob_oids);
+
+	strbuf_release(&base);
+	free(data);
+}
+
+static void prefetch_grep_blobs(struct grep_opt *opt,
+				const struct pathspec *pathspec,
+				const struct object_array *list)
+{
+	struct oidset blob_oids = OIDSET_INIT;
+
+	/* Exit if we're not in a partial clone */
+	if (!repo_has_promisor_remote(opt->repo))
+		return;
+
+	/* For each tree, gather the blobs in it */
+	for (int i = 0; i < list->nr; i++) {
+		struct object *real_obj;
+
+		obj_read_lock();
+		real_obj = deref_tag(opt->repo, list->objects[i].item,
+				     NULL, 0);
+		obj_read_unlock();
+
+		if (real_obj &&
+		    (real_obj->type == OBJ_COMMIT ||
+		     real_obj->type == OBJ_TREE))
+			collect_blob_oids_for_treeish(opt, pathspec,
+						      &real_obj->oid,
+						      list->objects[i].name,
+						      &blob_oids);
+	}
+
+	/* Prefetch the blobs we found */
+	if (oidset_size(&blob_oids)) {
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		struct oidset_iter iter;
+		const struct object_id *oid;
+
+		oidset_iter_init(&blob_oids, &iter);
+		while ((oid = oidset_iter_next(&iter)))
+			oid_array_append(&to_fetch, oid);
+
+		promisor_remote_get_direct(opt->repo, to_fetch.oid, to_fetch.nr);
+
+		oid_array_clear(&to_fetch);
+	}
+	oidset_clear(&blob_oids);
+}
+
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		       struct object *obj, const char *name, const char *path)
 {
@@ -732,6 +873,8 @@ static int grep_objects(struct grep_opt *opt, const struct pathspec *pathspec,
 	int hit = 0;
 	const unsigned int nr = list->nr;
 
+	prefetch_grep_blobs(opt, pathspec, list);
+
 	for (i = 0; i < nr; i++) {
 		struct object *real_obj;
 
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 64ac4f04ee..3d08fd2a0c 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -1929,4 +1929,62 @@ test_expect_success 'grep does not report i-t-a and assume unchanged with -L' '
 	test_cmp expected actual
 '
 
+test_expect_success 'grep of revision in partial clone batches prefetch and honors pathspec' '
+	test_when_finished "rm -rf grep-partial-src grep-partial" &&
+
+	git init grep-partial-src &&
+	(
+		cd grep-partial-src &&
+		git config uploadpack.allowfilter 1 &&
+		git config uploadpack.allowanysha1inwant 1 &&
+		mkdir a b &&
+		echo "needle in haystack" >a/matches.txt &&
+		echo "nothing to see here" >a/nomatch.txt &&
+		echo "needle again" >b/matches.md &&
+		git add . &&
+		git commit -m "initial"
+	) &&
+
+	git clone --no-checkout --filter=blob:none \
+		"file://$(pwd)/grep-partial-src" grep-partial &&
+
+	# All three blobs are missing immediately after a blobless clone.
+	git -C grep-partial rev-list --quiet --objects \
+		--missing=print HEAD >missing &&
+	test_line_count = 3 missing &&
+
+	# A pathspec-limited grep should prefetch only the two blobs
+	# in a/.  It should fetch both blobs in one batched request.
+	GIT_TRACE2_EVENT="$(pwd)/grep-trace-pathspec" \
+		git -C grep-partial grep -c "needle" HEAD -- "a/*.txt" >result &&
+
+	# Only a/matches.txt contains "needle" among the matched paths.
+	test_line_count = 1 result &&
+
+	# Exactly the two a/*.txt blobs should have been requested, and
+	# the server packed those two objects in the response.
+	test_trace2_data promisor fetch_count 2 <grep-trace-pathspec &&
+	test_trace2_data pack-objects written 2 <grep-trace-pathspec &&
+
+	# b/matches.md should still be missing locally.
+	git -C grep-partial rev-list --quiet --objects \
+		--missing=print HEAD >missing &&
+	test_line_count = 1 missing &&
+
+	# A second grep without a pathspec must recurse into both
+	# subdirectories, but should request only the still-missing blob
+	# from the promisor.
+	GIT_TRACE2_EVENT="$(pwd)/grep-trace-all" \
+		git -C grep-partial grep -c "needle" HEAD >result &&
+
+	test_line_count = 2 result &&
+	test_trace2_data promisor fetch_count 1 <grep-trace-all &&
+	test_trace2_data pack-objects written 1 <grep-trace-all &&
+
+	# Everything is local now.
+	git -C grep-partial rev-list --quiet --objects \
+		--missing=print HEAD >missing &&
+	test_line_count = 0 missing
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related

* [PATCH v3 3/4] builtin/log: prefetch necessary blobs for `git cherry`
From: Elijah Newren via GitGitGadget @ 2026-05-14 16:25 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Phillip Wood, Derrick Stolee, Elijah Newren,
	Elijah Newren
In-Reply-To: <pull.2089.v3.git.1778775928.gitgitgadget@gmail.com>

From: Elijah Newren <newren@gmail.com>

In partial clones, `git cherry` fetches necessary blobs on-demand one
at a time, which can be very slow.  We would like to prefetch all
necessary blobs upfront.  To do so, we need to be able to first figure
out which blobs are needed.

`git cherry` does its work in a two-phase approach: first computing
header-only IDs (based on file paths and modes), then falling back to
full content-based IDs only when header-only IDs collide -- or, more
accurately, whenever the oidhash() of the header-only object_ids
collide.

patch-ids.c handles this by creating an ids->patches hashmap that has
all the data we need, but the problem is that any attempt to query the
hashmap will invoke the patch_id_neq() function on any colliding objects,
which causes the on-demand fetching.

Insert a new prefetch_cherry_blobs() function before checking for
collisions.  Use a temporary replacement on the ids->patches.cmpfn
in order to enumerate the blobs that would be needed without yet
fetching them, and then fetch them all at once, then restore the old
ids->patches.cmpfn.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/log.c     | 131 ++++++++++++++++++++++++++++++++++++++++++++++
 t/t3500-cherry.sh |  27 ++++++++++
 2 files changed, 158 insertions(+)

diff --git a/builtin/log.c b/builtin/log.c
index 8c0939dd42..e464b30af4 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -21,10 +21,12 @@
 #include "color.h"
 #include "commit.h"
 #include "diff.h"
+#include "diffcore.h"
 #include "diff-merges.h"
 #include "revision.h"
 #include "log-tree.h"
 #include "oid-array.h"
+#include "oidset.h"
 #include "tag.h"
 #include "reflog-walk.h"
 #include "patch-ids.h"
@@ -43,9 +45,11 @@
 #include "utf8.h"
 
 #include "commit-reach.h"
+#include "promisor-remote.h"
 #include "range-diff.h"
 #include "tmp-objdir.h"
 #include "tree.h"
+#include "userdiff.h"
 #include "write-or-die.h"
 
 #define MAIL_DEFAULT_WRAP 72
@@ -2602,6 +2606,131 @@ static void print_commit(char sign, struct commit *commit, int verbose,
 	}
 }
 
+/*
+ * Enumerate blob OIDs from a single commit's diff, inserting them into blobs.
+ * Skips files whose userdiff driver explicitly declares binary status
+ * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and
+ * never reads blob content.  Use userdiff_find_by_path() since
+ * diff_filespec_load_driver() is static in diff.c.
+ *
+ * Clean up with diff_queue_clear() (from diffcore.h).
+ */
+static void collect_diff_blob_oids(struct commit *commit,
+				   struct diff_options *opts,
+				   struct oidset *blobs)
+{
+	struct diff_queue_struct *q;
+
+	/*
+	 * Merge commits are filtered out by patch_id_defined() in patch-ids.c,
+	 * so we'll never be called with one.
+	 */
+	assert(!commit->parents || !commit->parents->next);
+
+	if (commit->parents)
+		diff_tree_oid(&commit->parents->item->object.oid,
+			      &commit->object.oid, "", opts);
+	else
+		diff_root_tree_oid(&commit->object.oid, "", opts);
+	diffcore_std(opts);
+
+	q = &diff_queued_diff;
+	for (int i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		struct userdiff_driver *drv;
+
+		/* Skip binary files */
+		drv = userdiff_find_by_path(opts->repo->index, p->one->path);
+		if (drv && drv->binary > 0)
+			continue;
+
+		if (DIFF_FILE_VALID(p->one) &&
+		    odb_read_object_info_extended(opts->repo->objects,
+						  &p->one->oid, NULL,
+						  OBJECT_INFO_FOR_PREFETCH))
+			oidset_insert(blobs, &p->one->oid);
+		if (DIFF_FILE_VALID(p->two) &&
+		    odb_read_object_info_extended(opts->repo->objects,
+						  &p->two->oid, NULL,
+						  OBJECT_INFO_FOR_PREFETCH))
+			oidset_insert(blobs, &p->two->oid);
+	}
+	diff_queue_clear(q);
+}
+
+static int always_match(const void *cmp_data UNUSED,
+			const struct hashmap_entry *entry1 UNUSED,
+			const struct hashmap_entry *entry2 UNUSED,
+			const void *keydata UNUSED)
+{
+	return 0;
+}
+
+/*
+ * Prefetch blobs for git cherry in partial clones.
+ *
+ * Called between the revision walk (which builds the head-side
+ * commit list) and the has_commit_patch_id() comparison loop.
+ *
+ * Uses a cmpfn-swap trick to avoid reading blobs: temporarily
+ * replaces the hashmap's comparison function with a trivial
+ * always-match function, so hashmap_get()/hashmap_get_next() match
+ * any entry with the same oidhash bucket.  These are the set of oids
+ * that would trigger patch_id_neq() during normal lookup and cause
+ * blobs to be read on demand, and we want to prefetch them all at
+ * once instead.
+ */
+static void prefetch_cherry_blobs(struct repository *repo,
+				  struct commit_list *list,
+				  struct patch_ids *ids)
+{
+	struct oidset blobs = OIDSET_INIT;
+	hashmap_cmp_fn original_cmpfn;
+
+	/* Exit if we're not in a partial clone */
+	if (!repo_has_promisor_remote(repo))
+		return;
+
+	/* Save original cmpfn, replace with always_match */
+	original_cmpfn = ids->patches.cmpfn;
+	ids->patches.cmpfn = always_match;
+
+	/* Find header-only collisions, gather blobs from those commits */
+	for (struct commit_list *l = list; l; l = l->next) {
+		struct commit *c = l->item;
+		bool match_found = false;
+		for (struct patch_id *cur = patch_id_iter_first(c, ids);
+		     cur;
+		     cur = patch_id_iter_next(cur, ids)) {
+			match_found = true;
+			collect_diff_blob_oids(cur->commit, &ids->diffopts,
+					       &blobs);
+		}
+		if (match_found)
+			collect_diff_blob_oids(c, &ids->diffopts, &blobs);
+	}
+
+	/* Restore original cmpfn */
+	ids->patches.cmpfn = original_cmpfn;
+
+	/* If we have any blobs to fetch, fetch them */
+	if (oidset_size(&blobs)) {
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		struct oidset_iter iter;
+		const struct object_id *oid;
+
+		oidset_iter_init(&blobs, &iter);
+		while ((oid = oidset_iter_next(&iter)))
+			oid_array_append(&to_fetch, oid);
+
+		promisor_remote_get_direct(repo, to_fetch.oid, to_fetch.nr);
+
+		oid_array_clear(&to_fetch);
+	}
+
+	oidset_clear(&blobs);
+}
+
 int cmd_cherry(int argc,
 	       const char **argv,
 	       const char *prefix,
@@ -2673,6 +2802,8 @@ int cmd_cherry(int argc,
 		commit_list_insert(commit, &list);
 	}
 
+	prefetch_cherry_blobs(the_repository, list, &ids);
+
 	for (struct commit_list *l = list; l; l = l->next) {
 		char sign = '+';
 
diff --git a/t/t3500-cherry.sh b/t/t3500-cherry.sh
index 78c3eac54b..3e66827d76 100755
--- a/t/t3500-cherry.sh
+++ b/t/t3500-cherry.sh
@@ -78,4 +78,31 @@ test_expect_success 'cherry ignores whitespace' '
 	test_cmp expect actual
 '
 
+# Reuse the expect file from the previous test, in a partial clone
+test_expect_success 'cherry in partial clone does bulk prefetch' '
+	test_config uploadpack.allowfilter 1 &&
+	test_config uploadpack.allowanysha1inwant 1 &&
+	test_when_finished "rm -rf copy" &&
+
+	git clone --bare --filter=blob:none file://"$(pwd)" copy &&
+	(
+		cd copy &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.output" git cherry upstream-with-space feature-without-space >actual &&
+		test_cmp ../expect actual &&
+
+		grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
+		test_line_count = 1 fetches &&
+		test_trace2_data promisor fetch_count 4 <trace.output &&
+
+		# A second invocation should not refetch any blobs, since
+		# the prefetch is expected to filter out OIDs that are
+		# already present locally.
+		GIT_TRACE2_EVENT="$(pwd)/trace2.output" git cherry upstream-with-space feature-without-space >actual &&
+		test_cmp ../expect actual &&
+
+		! grep "child_start.*fetch.negotiationAlgorithm" trace2.output &&
+		! grep "\"key\":\"fetch_count\"" trace2.output
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related

* [PATCH v3 2/4] patch-ids.h: add missing trailing parenthesis in documentation comment
From: Elijah Newren via GitGitGadget @ 2026-05-14 16:25 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Phillip Wood, Derrick Stolee, Elijah Newren,
	Elijah Newren
In-Reply-To: <pull.2089.v3.git.1778775928.gitgitgadget@gmail.com>

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 patch-ids.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/patch-ids.h b/patch-ids.h
index 490d739371..57534ee722 100644
--- a/patch-ids.h
+++ b/patch-ids.h
@@ -37,7 +37,7 @@ int has_commit_patch_id(struct commit *commit, struct patch_ids *);
  *   struct patch_id *cur;
  *   for (cur = patch_id_iter_first(commit, ids);
  *        cur;
- *        cur = patch_id_iter_next(cur, ids) {
+ *        cur = patch_id_iter_next(cur, ids)) {
  *           ... look at cur->commit
  *   }
  */
-- 
gitgitgadget


^ permalink raw reply related

* [PATCH v3 1/4] promisor-remote: document caller filtering contract
From: Elijah Newren via GitGitGadget @ 2026-05-14 16:25 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Phillip Wood, Derrick Stolee, Elijah Newren,
	Elijah Newren
In-Reply-To: <pull.2089.v3.git.1778775928.gitgitgadget@gmail.com>

From: Elijah Newren <newren@gmail.com>

promisor_remote_get_direct() does not, on its happy path, filter out
OIDs that are already present in the local object store: every OID
the caller supplies is written to the fetch subprocess's stdin and
ends up in the response pack.  The only filtering it performs is in
remove_fetched_oids(), and that only runs after a fetch failure when
falling back to a different configured promisor remote.

Almost every existing caller already filters locally-present OIDs out
itself (typically with odb_read_object_info_extended() and
OBJECT_INFO_FOR_PREFETCH, or odb_has_object() with no fetch flag).  But
the existing API comment does not state this expectation, so a new
caller is easy to write incorrectly (I missed this originally and wrote
two problematic callers).  Omitting the filter still "works" in the
sense that the desired objects end up local, but it silently makes the
fetch request -- and the response pack -- larger than necessary,
defeating part of the point of batching.

Spell the contract out so future callers know to filter (and
deduplicate) themselves, and point them at the helpers they should
use to check local presence without accidentally triggering a lazy
fetch.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 promisor-remote.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/promisor-remote.h b/promisor-remote.h
index 3d4d2de018..301f5ac5cb 100644
--- a/promisor-remote.h
+++ b/promisor-remote.h
@@ -29,6 +29,17 @@ int repo_has_promisor_remote(struct repository *r);
  * Fetches all requested objects from all promisor remotes, trying them one at
  * a time until all objects are fetched.
  *
+ * Callers are responsible for filtering out OIDs that are already present
+ * locally before calling this function: every supplied OID is sent in the
+ * fetch request, even if the object already exists in the local object
+ * store. (Only after a fetch failure does this function fall back to
+ * stripping already-present OIDs from the list before trying the next
+ * configured promisor remote.) Callers should also deduplicate the OIDs.
+ *
+ * To test for local presence without triggering a lazy fetch (which would
+ * defeat the purpose of batching), use odb_has_object(..., 0) or
+ * odb_read_object_info_extended() with OBJECT_INFO_FOR_PREFETCH.
+ *
  * If oid_nr is 0, this function returns immediately.
  */
 void promisor_remote_get_direct(struct repository *repo,
-- 
gitgitgadget


^ permalink raw reply related

* [PATCH v3 0/4] Batch prefetching
From: Elijah Newren via GitGitGadget @ 2026-05-14 16:25 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Phillip Wood, Derrick Stolee, Elijah Newren
In-Reply-To: <pull.2089.v2.git.1776472347.gitgitgadget@gmail.com>

Changes since v2:

 * Modified the final patch as suggested by Stolee to include pathspec usage
   in the testcase
 * Modified the last two patches to not re-download blobs we already have
   locally, and adjusted the tests to verify
 * Inserted a new first patch, containing a documentation addition that
   would have helped me avoid making the above mistake in the first place.

Note: Stolee also suggest some code sharing or code movement in his review
of v2 2/3, but possibly based on a misunderstanding of v2 2/3 (that patch
isn't about a diff) and it's not clear to me what could be shared or moved,
so that's not part of this round.

Changes since v1:

 * Remove stray file that should have never been added. So embarrassing that
   I didn't catch that before submitting.


Original cover-letter:
======================

Partial clones provide a trade-off for users: avoid downloading blobs
upfront, at the expense of needing to download them later as they run other
commands. This tradeoff can sometimes incur a more severe cost than
expected, particularly if needed blobs are discovered as they are accessed,
resulting in downloading blobs one at a time. Some commands like checkout,
diff, and merge do batch prefetches of necessary blobs, since that can
dramatically reduce the pain of on-demand loading. Extend this ability to
two more commands: cherry and grep.

This series was spurred by a report where git cherry jobs were each doing
hundreds of single-blob fetches, at a cost of 3s each. Batching those
downloads should dramatically speed up their jobs. (And I decided to fix up
git grep similarly while at it.)

I'll also note that git backfill with revisions and/or pathspecs could also
improve things for these users, but since backfill is a manual command users
would have to run and requires users to try to figure out which data is
needed (a challenge in the case of cherry), it still makes sense to provide
smarter behavior for folks who don't choose to manually run backfill.

Also, correct a documentation typo I noticed in patch-ids.h (related to code
I was using for the git cherry fixes) as a preparatory fixup.

Elijah Newren (4):
  promisor-remote: document caller filtering contract
  patch-ids.h: add missing trailing parenthesis in documentation comment
  builtin/log: prefetch necessary blobs for `git cherry`
  grep: prefetch necessary blobs

 builtin/grep.c    | 143 ++++++++++++++++++++++++++++++++++++++++++++++
 builtin/log.c     | 131 ++++++++++++++++++++++++++++++++++++++++++
 patch-ids.h       |   2 +-
 promisor-remote.h |  11 ++++
 t/t3500-cherry.sh |  27 +++++++++
 t/t7810-grep.sh   |  58 +++++++++++++++++++
 6 files changed, 371 insertions(+), 1 deletion(-)


base-commit: 9f223ef1c026d91c7ac68cc0211bde255dda6199
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2089%2Fnewren%2Fbatch-prefetching-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2089/newren/batch-prefetching-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/2089

Range-diff vs v2:

 -:  ---------- > 1:  6ad11e2c28 promisor-remote: document caller filtering contract
 1:  663816a344 = 2:  08a2c6517b patch-ids.h: add missing trailing parenthesis in documentation comment
 2:  a705852723 ! 3:  c0655e5d41 builtin/log: prefetch necessary blobs for `git cherry`
     @@ builtin/log.c: static void print_commit(char sign, struct commit *commit, int ve
      +		if (drv && drv->binary > 0)
      +			continue;
      +
     -+		if (DIFF_FILE_VALID(p->one))
     ++		if (DIFF_FILE_VALID(p->one) &&
     ++		    odb_read_object_info_extended(opts->repo->objects,
     ++						  &p->one->oid, NULL,
     ++						  OBJECT_INFO_FOR_PREFETCH))
      +			oidset_insert(blobs, &p->one->oid);
     -+		if (DIFF_FILE_VALID(p->two))
     ++		if (DIFF_FILE_VALID(p->two) &&
     ++		    odb_read_object_info_extended(opts->repo->objects,
     ++						  &p->two->oid, NULL,
     ++						  OBJECT_INFO_FOR_PREFETCH))
      +			oidset_insert(blobs, &p->two->oid);
      +	}
      +	diff_queue_clear(q);
     @@ t/t3500-cherry.sh: test_expect_success 'cherry ignores whitespace' '
      +
      +		grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
      +		test_line_count = 1 fetches &&
     -+		test_trace2_data promisor fetch_count 4 <trace.output
     ++		test_trace2_data promisor fetch_count 4 <trace.output &&
     ++
     ++		# A second invocation should not refetch any blobs, since
     ++		# the prefetch is expected to filter out OIDs that are
     ++		# already present locally.
     ++		GIT_TRACE2_EVENT="$(pwd)/trace2.output" git cherry upstream-with-space feature-without-space >actual &&
     ++		test_cmp ../expect actual &&
     ++
     ++		! grep "child_start.*fetch.negotiationAlgorithm" trace2.output &&
     ++		! grep "\"key\":\"fetch_count\"" trace2.output
      +	)
      +'
      +
 3:  8fbfe69bc4 ! 4:  75d4ca7cff grep: prefetch necessary blobs
     @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec
      +		strbuf_add(base, entry.path, tree_entry_len(&entry));
      +
      +		if (S_ISREG(entry.mode)) {
     -+			oidset_insert(blob_oids, &entry.oid);
     ++			if (!odb_has_object(repo->objects, &entry.oid, 0))
     ++				oidset_insert(blob_oids, &entry.oid);
      +		} else if (S_ISDIR(entry.mode)) {
      +			enum object_type type;
      +			struct tree_desc sub_tree;
     @@ t/t7810-grep.sh: test_expect_success 'grep does not report i-t-a and assume unch
       	test_cmp expected actual
       '
       
     -+test_expect_success 'grep of revision in partial clone does bulk prefetch' '
     ++test_expect_success 'grep of revision in partial clone batches prefetch and honors pathspec' '
      +	test_when_finished "rm -rf grep-partial-src grep-partial" &&
      +
      +	git init grep-partial-src &&
     @@ t/t7810-grep.sh: test_expect_success 'grep does not report i-t-a and assume unch
      +		cd grep-partial-src &&
      +		git config uploadpack.allowfilter 1 &&
      +		git config uploadpack.allowanysha1inwant 1 &&
     -+		echo "needle in haystack" >searchme &&
     -+		echo "no match here" >other &&
     -+		mkdir subdir &&
     -+		echo "needle again" >subdir/deep &&
     ++		mkdir a b &&
     ++		echo "needle in haystack" >a/matches.txt &&
     ++		echo "nothing to see here" >a/nomatch.txt &&
     ++		echo "needle again" >b/matches.md &&
      +		git add . &&
      +		git commit -m "initial"
      +	) &&
     @@ t/t7810-grep.sh: test_expect_success 'grep does not report i-t-a and assume unch
      +	git clone --no-checkout --filter=blob:none \
      +		"file://$(pwd)/grep-partial-src" grep-partial &&
      +
     -+	# All blobs should be missing after a blobless clone.
     ++	# All three blobs are missing immediately after a blobless clone.
      +	git -C grep-partial rev-list --quiet --objects \
      +		--missing=print HEAD >missing &&
      +	test_line_count = 3 missing &&
      +
     -+	# grep HEAD should batch-prefetch all blobs in one request.
     -+	GIT_TRACE2_EVENT="$(pwd)/grep-trace" \
     ++	# A pathspec-limited grep should prefetch only the two blobs
     ++	# in a/.  It should fetch both blobs in one batched request.
     ++	GIT_TRACE2_EVENT="$(pwd)/grep-trace-pathspec" \
     ++		git -C grep-partial grep -c "needle" HEAD -- "a/*.txt" >result &&
     ++
     ++	# Only a/matches.txt contains "needle" among the matched paths.
     ++	test_line_count = 1 result &&
     ++
     ++	# Exactly the two a/*.txt blobs should have been requested, and
     ++	# the server packed those two objects in the response.
     ++	test_trace2_data promisor fetch_count 2 <grep-trace-pathspec &&
     ++	test_trace2_data pack-objects written 2 <grep-trace-pathspec &&
     ++
     ++	# b/matches.md should still be missing locally.
     ++	git -C grep-partial rev-list --quiet --objects \
     ++		--missing=print HEAD >missing &&
     ++	test_line_count = 1 missing &&
     ++
     ++	# A second grep without a pathspec must recurse into both
     ++	# subdirectories, but should request only the still-missing blob
     ++	# from the promisor.
     ++	GIT_TRACE2_EVENT="$(pwd)/grep-trace-all" \
      +		git -C grep-partial grep -c "needle" HEAD >result &&
      +
     -+	# Should find matches in two files.
      +	test_line_count = 2 result &&
     ++	test_trace2_data promisor fetch_count 1 <grep-trace-all &&
     ++	test_trace2_data pack-objects written 1 <grep-trace-all &&
      +
     -+	# Should have prefetched all 3 objects at once
     -+	test_trace2_data promisor fetch_count 3 <grep-trace
     ++	# Everything is local now.
     ++	git -C grep-partial rev-list --quiet --objects \
     ++		--missing=print HEAD >missing &&
     ++	test_line_count = 0 missing
      +'
      +
       test_done

-- 
gitgitgadget

^ permalink raw reply

* Re: [PATCH v3 3/4] approxidate: make "specials" respect fixed day-of-month
From: Junio C Hamano @ 2026-05-14 16:06 UTC (permalink / raw)
  To: Tuomas Ahola; +Cc: git, Jeff King
In-Reply-To: <20260514115520.6660-4-taahol@utu.fi>

Tuomas Ahola <taahol@utu.fi> writes:

> The special approxidate time formats, "noon" and "tea" differ from
> "12pm" and "5pm" by having the feature of wrapping to the previous day
> if the current time is before those hours:
>
> 	now  -> 2026-05-13 11:00:00 +0000
>
> 	12pm -> 2026-05-13 12:00:00 +0000
> 	5pm  -> 2026-05-13 17:00:00 +0000
>
> 	noon -> 2026-05-12 12:00:00 +0000
> 	tea  -> 2026-05-12 17:00:00 +0000
>
> However, that logic carries too far.  Even when the date is specified,
> the behavior of the "specials" depends on the current time.  Assuming
> the same time as above, we get:
>
> 	today at noon -> 2026-05-12 12:00:00 +0000 (should be 13 May)
> 	13 May at tea -> 2026-05-12 17:00:00 +0000
>
> or, using an example mentioned in date-formats.adoc:
>
> 	last Friday at noon -> 2026-05-07 12:00:00 +0000 (should be 8 May)
>
> The quirk seems to be rather old.  Already in 2006, Linus Torvalds
> remarked that the date yielded by "one year ago yesterday at tea-time"
> was "just silly and not even correct".  Indeed, even today it gives:
>
> 	One year ago yesterday at tea-time -> 2025-05-11 17:00:00 +0000
> 	  (should be 12 May)
>
> Let's fix all of those with a simple patch.  Check whether we already
> have a specified day-of-month in `tm->tm_mday` and make `date_time()`
> stick to it.  Ensure the correct behavior with relevant tests.


I find this vastly easier to follow the reasoning, compared to the
previous iteration.  Very nicely done.



>
> Links:
>   1. https://lore.kernel.org/git/Pine.LNX.4.64.0610101102560.3952@g5.osdl.org/
>
> Signed-off-by: Tuomas Ahola <taahol@utu.fi>
> ---
>
> Notes:
>     > Again, this may be just me, but I happen to find the version of
>     > comment in Peff's review on the earlier iteration of this series
>     > much easier to understand.
>     >
>     
>     Thanks, applied.
>
>  date.c          | 6 +++++-
>  t/t0006-date.sh | 4 ++++
>  2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/date.c b/date.c
> index 412aca6dc4..73879d202c 100644
> --- a/date.c
> +++ b/date.c
> @@ -1132,7 +1132,11 @@ static void date_yesterday(struct tm *tm, struct tm *now, int *num)
>  
>  static void date_time(struct tm *tm, struct tm *now, int hour)
>  {
> -	if (tm->tm_hour < hour)
> +	/*
> +	 * If we do not yet have a specified day, we'll use the most recent
> +	 * version of "hour" relative to now.  But that may be yesterday.
> +	 */
> +	if (tm->tm_mday < 0 && tm->tm_hour < hour)
>  		update_tm(tm, now, 24*60*60);
>  	tm->tm_hour = hour;
>  	tm->tm_min = 0;
> diff --git a/t/t0006-date.sh b/t/t0006-date.sh
> index c7667bade2..d800cb30fe 100755
> --- a/t/t0006-date.sh
> +++ b/t/t0006-date.sh
> @@ -208,8 +208,12 @@ check_approxidate '6pm yesterday' '2009-08-29 18:00:00'
>  check_approxidate '3:00' '2009-08-30 03:00:00'
>  check_approxidate '15:00' '2009-08-30 15:00:00'
>  check_approxidate 'noon today' '2009-08-30 12:00:00'
> +check_approxidate 'today at noon' '2009-08-30 12:00:00' '-12 hours'
>  check_approxidate 'noon yesterday' '2009-08-29 12:00:00'
> +check_approxidate 'last Friday at noon' '2009-08-28 12:00:00'
> +check_approxidate 'last Friday at noon' '2009-08-28 12:00:00' '-12 hours'
>  check_approxidate 'January 5th noon pm' '2009-01-05 12:00:00'
> +check_approxidate 'January 5th noon pm' '2009-01-05 12:00:00' '-12 hours'
>  check_approxidate '10am noon' '2009-08-29 12:00:00'
>  check_approxidate 'January 5th yesterday' '2009-01-29 19:20:00'
>  check_approxidate 'January 5th yesterday' '2008-12-31 19:20:00' '+2 days'

^ permalink raw reply

* [PATCH 3/3] daemon: guard NULL REMOTE_PORT in execute() logging
From: Sebastien Tardif via GitGitGadget @ 2026-05-14 15:46 UTC (permalink / raw)
  To: git; +Cc: Sebastien Tardif, Sebastien Tardif
In-Reply-To: <pull.2300.git.git.1778773592.gitgitgadget@gmail.com>

From: Sebastien Tardif <sebtardif@ncf.ca>

The REMOTE_PORT environment variable is used in a format string
without a NULL check, while REMOTE_ADDR is checked. If REMOTE_PORT
is unset, NULL is passed to printf's %s, which is undefined behavior.

Add a fallback string for the NULL case.

Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
---
 daemon.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/daemon.c b/daemon.c
index 103c08d868..78cca8673f 100644
--- a/daemon.c
+++ b/daemon.c
@@ -753,7 +753,7 @@ static int execute(void)
 	struct strvec env = STRVEC_INIT;
 
 	if (addr)
-		loginfo("Connection from %s:%s", addr, port);
+		loginfo("Connection from %s:%s", addr, port ? port : "?");
 
 	set_keep_alive(0);
 	alarm(init_timeout ? init_timeout : timeout);
-- 
gitgitgadget

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox