[PATCH 0/2] upload-pack: reduce lock contention when writing packfile data

public inbox for git@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] upload-pack: reduce lock contention when writing packfile data
@ 2026-02-27 11:22 Patrick Steinhardt
  2026-02-27 11:23 ` [PATCH 1/2] upload-pack: fix debug statement when flushing " Patrick Steinhardt
                   ` (4 more replies)
  0 siblings, 5 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-02-27 11:22 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley

Hi,

this small patch series fixes some heavy lock contention when writing
data from git-upload-pack(1) into pipes. This lock contention can be
observed when having hundreds of git-upload-pack(1) processes active at
the same time that write data into pipes at dozens of gigabits per
second.

I have uploaded the flame graph that clearly shows the lock contention
at [1].

Thanks!

Patrick

[1]: https://gitlab.com/gitlab-org/git/-/work_items/675

---
Patrick Steinhardt (2):
      upload-pack: fix debug statement when flushing packfile data
      upload-pack: reduce lock contention when writing packfile data

 upload-pack.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)


---
base-commit: 7b2bccb0d58d4f24705bf985de1f4612e4cf06e5
change-id: 20260227-pks-upload-pack-write-contention-435ce01f5fe9


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/2] upload-pack: fix debug statement when flushing packfile data
  2026-02-27 11:22 [PATCH 0/2] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
@ 2026-02-27 11:23 ` Patrick Steinhardt
  2026-02-27 11:23 ` [PATCH 2/2] upload-pack: reduce lock contention when writing " Patrick Steinhardt
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-02-27 11:23 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley

When git-upload-pack(1) writes packfile data to the client we have some
logic in place that buffers some partial lines. When that buffer still
contains data after git-pack-objects(1) has finished we flush the buffer
so that all remaining bytes are sent out.

Curiously, when we do so we also print the string "flushed." to stderr.
This statement has been introduced in b1c71b7281 (upload-pack: avoid
sending an incomplete pack upon failure, 2006-06-20), so quite a while
ago. What's interesting though is that stderr is typically spliced
through to the client-side, and consequently the client would see this
message. Munging the way how we do the caching indeed confirms this:

  $ git clone file:///home/pks/Development/linux/
  Cloning into bare repository 'linux.git'...
  remote: Enumerating objects: 12980346, done.
  remote: Counting objects: 100% (131820/131820), done.
  remote: Compressing objects: 100% (50290/50290), done.
  remote: Total 12980346 (delta 96319), reused 104500 (delta 81217), pack-reused 12848526 (from 1)
  Receiving objects: 100% (12980346/12980346), 3.23 GiB | 57.44 MiB/s, done.
  flushed.
  Resolving deltas: 100% (10676718/10676718), done.

It's quite clear that this string shouldn't ever be visible to the
client, so it rather feels like this is a left-over debug statement. The
menitoned commit doesn't mention this line, either.

Remove the debug output to prepare for a change in how we do the
buffering in the next commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index 2d2b70cbf2..c2643c0295 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -457,11 +457,9 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 	}

 	/* flush the data */
-	if (output_state->used > 0) {
+	if (output_state->used > 0)
 		send_client_data(1, output_state->buffer, output_state->used,
 				 pack_data->use_sideband);
-		fprintf(stderr, "flushed.\n");
-	}
 	free(output_state);
 	if (pack_data->use_sideband)
 		packet_flush(1);

-- 
2.53.0.536.g309c995771.dirty

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/2] upload-pack: reduce lock contention when writing packfile data
  2026-02-27 11:22 [PATCH 0/2] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
  2026-02-27 11:23 ` [PATCH 1/2] upload-pack: fix debug statement when flushing " Patrick Steinhardt
@ 2026-02-27 11:23 ` Patrick Steinhardt
  2026-02-27 13:04   ` brian m. carlson
                     ` (2 more replies)
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-02-27 11:23 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley

In our production systems we have recently observed write contention in
git-upload-pack(1). The system in question was consistently streaming
packfiles at a rate of dozens of gigabits per second, but curiously the
system was neither bottlenecked on CPU, memory or IOPS.

We eventually discovered that Git was spending 80% of its time in
`pipe_write()`, out of which almost all of the time was spent in the
`ep_poll_callback` function in the kernel. Quoting the reporter:

  This infrastructure is part of an event notification queue designed to
  allow for multiple producers to emit events, but that concurrency
  safety is guarded by 3 layers of locking. The layer we're hitting
  contention in uses a simple reader/writer lock mode (a.k.a. shared
  versus exclusive mode), where producers need shared-mode (read mode),
  and various other actions use exclusive (write) mode.

The system in question generates workloads where we have hundreds of
git-upload-pack(1) processes active at the same point in time. These
processes end up contending around those locks, and the consequence is
that the Git processes stall.

Now git-upload-pack(1) already has the infrastructure in place to buffer
some of the data it reads from git-pack-objects(1) before actually
sending it out. We only use this infrastructure in very limited ways
though, so we generally end up matching one read(3p) call with one
write(3p) call. Even worse, when the sideband is enabled we end up
matching one read with _two_ writes: one for the pkt-line length, and
one for the packfile data.

Extend our use of the buffering infrastructure so that we soak up bytes
until the buffer is filled up at least 2/3rds of its capacity. The
change is relatively simple to implement as we already know to flush the
buffer in `create_pack_file()` after git-pack-objects(1) has finished.

This significantly reduces the number of write(3p) syscalls we need to
do. Before this change, cloning the Linux repository resulted in around
400,000 write(3p) syscalls. With the buffering in place we only do
around 130,000 syscalls.

Now we could of course go even further and make sure that we always fill
up the whole buffer. But this might cause an increase in read(3p)
syscalls, and some tests show that this only reduces the number of
write(3p) syscalls from 130,000 to 100,000. So overall this doesn't seem
worth it.

Helped-by: Matt Smiley <msmiley@gitlab.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/upload-pack.c b/upload-pack.c
index c2643c0295..f8ba245616 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -270,6 +270,13 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 		}
 	}
 
+	/*
+	 * Make sure that we buffer some data before sending it to the client.
+	 * This significantly reduces the number of write(3p) syscalls.
+	 */
+	if (readsz && os->used < (LARGE_PACKET_DATA_MAX * 2 / 3))
+		return readsz;
+
 	if (os->used > 1) {
 		send_client_data(1, os->buffer, os->used - 1, use_sideband);
 		os->buffer[0] = os->buffer[os->used - 1];

-- 
2.53.0.536.g309c995771.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/2] upload-pack: reduce lock contention when writing packfile data
  2026-02-27 11:23 ` [PATCH 2/2] upload-pack: reduce lock contention when writing " Patrick Steinhardt
@ 2026-02-27 13:04   ` brian m. carlson
  2026-02-27 18:14     ` Patrick Steinhardt
  2026-02-27 17:29   ` Junio C Hamano
  2026-02-27 19:37   ` Jeff King
  2 siblings, 1 reply; 64+ messages in thread
From: brian m. carlson @ 2026-02-27 13:04 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Matt Smiley

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

On 2026-02-27 at 11:23:01, Patrick Steinhardt wrote:
> diff --git a/upload-pack.c b/upload-pack.c
> index c2643c0295..f8ba245616 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -270,6 +270,13 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
>  		}
>  	}
>  
> +	/*
> +	 * Make sure that we buffer some data before sending it to the client.
> +	 * This significantly reduces the number of write(3p) syscalls.
> +	 */
> +	if (readsz && os->used < (LARGE_PACKET_DATA_MAX * 2 / 3))
> +		return readsz;
> +
>  	if (os->used > 1) {
>  		send_client_data(1, os->buffer, os->used - 1, use_sideband);
>  		os->buffer[0] = os->buffer[os->used - 1];

This seems mostly reasonable and well-explained.  The one question I
have is this: how does this work when packfile generation is actually
very slow (or when the connection is slow) and we need to send data
every so often to keep the connection alive?

I just want to make sure we're not breaking the keepalive sideband case
when that's necessary, but of course I have no objections to improving
performance and reducing overhead.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/2] upload-pack: reduce lock contention when writing packfile data
  2026-02-27 11:23 ` [PATCH 2/2] upload-pack: reduce lock contention when writing " Patrick Steinhardt
  2026-02-27 13:04   ` brian m. carlson
@ 2026-02-27 17:29   ` Junio C Hamano
  2026-02-27 19:37   ` Jeff King
  2 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2026-02-27 17:29 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Matt Smiley

Patrick Steinhardt <ps@pks.im> writes:

> ...
> write(3p) call. Even worse, when the sideband is enabled we end up
> matching one read with _two_ writes: one for the pkt-line length, and
> one for the packfile data.
>
> Extend our use of the buffering infrastructure so that we soak up bytes
> until the buffer is filled up at least 2/3rds of its capacity. The
> change is relatively simple to implement as we already know to flush the
> buffer in `create_pack_file()` after git-pack-objects(1) has finished.
>
> This significantly reduces the number of write(3p) syscalls we need to
> do. Before this change, cloning the Linux repository resulted in around
> 400,000 write(3p) syscalls. With the buffering in place we only do
> around 130,000 syscalls.
>
> Now we could of course go even further and make sure that we always fill
> up the whole buffer. But this might cause an increase in read(3p)
> syscalls, and some tests show that this only reduces the number of
> write(3p) syscalls from 130,000 to 100,000. So overall this doesn't seem
> worth it.

Very well reasoned.  I love this size ratio between explanation and
code change ;-)

>
> Helped-by: Matt Smiley <msmiley@gitlab.com>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  upload-pack.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/upload-pack.c b/upload-pack.c
> index c2643c0295..f8ba245616 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -270,6 +270,13 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
>  		}
>  	}
>  
> +	/*
> +	 * Make sure that we buffer some data before sending it to the client.
> +	 * This significantly reduces the number of write(3p) syscalls.
> +	 */
> +	if (readsz && os->used < (LARGE_PACKET_DATA_MAX * 2 / 3))
> +		return readsz;
> +
>  	if (os->used > 1) {
>  		send_client_data(1, os->buffer, os->used - 1, use_sideband);
>  		os->buffer[0] = os->buffer[os->used - 1];

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/2] upload-pack: reduce lock contention when writing packfile data
  2026-02-27 13:04   ` brian m. carlson
@ 2026-02-27 18:14     ` Patrick Steinhardt
  0 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-02-27 18:14 UTC (permalink / raw)
  To: brian m. carlson, git, Matt Smiley

On Fri, Feb 27, 2026 at 01:04:05PM +0000, brian m. carlson wrote:
> On 2026-02-27 at 11:23:01, Patrick Steinhardt wrote:
> > diff --git a/upload-pack.c b/upload-pack.c
> > index c2643c0295..f8ba245616 100644
> > --- a/upload-pack.c
> > +++ b/upload-pack.c
> > @@ -270,6 +270,13 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * Make sure that we buffer some data before sending it to the client.
> > +	 * This significantly reduces the number of write(3p) syscalls.
> > +	 */
> > +	if (readsz && os->used < (LARGE_PACKET_DATA_MAX * 2 / 3))
> > +		return readsz;
> > +
> >  	if (os->used > 1) {
> >  		send_client_data(1, os->buffer, os->used - 1, use_sideband);
> >  		os->buffer[0] = os->buffer[os->used - 1];
> 
> This seems mostly reasonable and well-explained.  The one question I
> have is this: how does this work when packfile generation is actually
> very slow (or when the connection is slow) and we need to send data
> every so often to keep the connection alive?
> 
> I just want to make sure we're not breaking the keepalive sideband case
> when that's necessary, but of course I have no objections to improving
> performance and reducing overhead.

Right. `relay_pack_data()` is handling the logic to soak up data from
git-pack-objects(1). The outer loop is in `create_pack_file()`, and
there we already have a timeout configured. If data is produced too
slow, then we'd eventually land in there and send the keepalive packet.

I guess there is an interesting edge case here: if git-pack-objects(1)
creates bytes fast enough to not hit the 1 second timeout, but slow
enough to basically never fill the buffer, then we could potentially run
into a timeout eventually.

I'm not sure whether this is likely to happen, and whether it's
something we want to address. Maybe we should extend the logic so that
we also send the keepalive pkt-line in case we have only been buffering
data for the last 1 second?

Basically, we'd reset a timestamp every time data was sent. If we see
that we received data without writing it out, and if the current
time is one second beyond the last time we reset, then we might want to
send out the keepalive.

Patrick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/2] upload-pack: reduce lock contention when writing packfile data
  2026-02-27 11:23 ` [PATCH 2/2] upload-pack: reduce lock contention when writing " Patrick Steinhardt
  2026-02-27 13:04   ` brian m. carlson
  2026-02-27 17:29   ` Junio C Hamano
@ 2026-02-27 19:37   ` Jeff King
  2026-03-02 12:12     ` Patrick Steinhardt
  2 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2026-02-27 19:37 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Matt Smiley, brian m. carlson, Junio C Hamano

On Fri, Feb 27, 2026 at 12:23:01PM +0100, Patrick Steinhardt wrote:

> Extend our use of the buffering infrastructure so that we soak up bytes
> until the buffer is filled up at least 2/3rds of its capacity. The
> change is relatively simple to implement as we already know to flush the
> buffer in `create_pack_file()` after git-pack-objects(1) has finished.

We are relaying write() calls from pack-objects here, which is writing
to us in 8kb chunks (due to csum-file.c buffering). So most of our
writes will be 8k.

Rather than buffering in upload-pack, would it not be simpler to just
increase the write size from pack-objects? Then we do not have to worry
about disrupting upload-pack's keepalive timeouts. And as a bonus, if
you are worried about the system-wide number of calls, you will likewise
be reducing the number of read() and write() calls over the pipe between
pack-objects and upload-pack.

Something like this:

diff --git a/csum-file.c b/csum-file.c
index 6e21e3cac8..94798fa429 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -206,7 +206,7 @@ struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
 	 * size so the progress indicators arrive at a more
 	 * frequent rate.
 	 */
-	return hashfd_internal(algop, fd, name, tp, 8 * 1024);
+	return hashfd_internal(algop, fd, name, tp, 32 * 1024);
 }

 void hashfile_checkpoint_init(struct hashfile *f,

reduces the number of write calls reported by:

  git clone \
    --upload-pack='perf stat -e syscalls:sys_enter_write git-upload-pack' \
    --bare --no-local linux.git foo.git

from ~420k to ~160k. In theory we expect ~8x reduction in our target
area, 4x for each of pack-objects and upload-pack, but of course there
are other writes going on, too, including the extra sideband ones. And
obviously we could push it further towards LARGE_PACKET_MAX to save even
more.

> Now git-upload-pack(1) already has the infrastructure in place to buffer
> some of the data it reads from git-pack-objects(1) before actually
> sending it out. We only use this infrastructure in very limited ways
> though, so we generally end up matching one read(3p) call with one
> write(3p) call. Even worse, when the sideband is enabled we end up
> matching one read with _two_ writes: one for the pkt-line length, and
> one for the packfile data.

Using writev() would be an easy-ish fix here, modulo portability
concerns (though of course it is easy to implement a fallback writev()
in terms of write()). Doing this:

diff --git a/sideband.c b/sideband.c
index ea7c25211e..b5509fbaa2 100644
--- a/sideband.c
+++ b/sideband.c
@@ -266,19 +266,25 @@ void send_sideband(int fd, int band, const char *data, ssize_t sz, int packet_ma
 	while (sz) {
 		unsigned n;
 		char hdr[5];
+		struct iovec iov[2];

 		n = sz;
 		if (packet_max - 5 < n)
 			n = packet_max - 5;
 		if (0 <= band) {
 			xsnprintf(hdr, sizeof(hdr), "%04x", n + 5);
 			hdr[4] = band;
-			write_or_die(fd, hdr, 5);
+			iov[0].iov_base = hdr;
+			iov[0].iov_len = 5;
 		} else {
 			xsnprintf(hdr, sizeof(hdr), "%04x", n + 4);
-			write_or_die(fd, hdr, 4);
+			iov[0].iov_base = hdr;
+			iov[0].iov_len = 4;
 		}
-		write_or_die(fd, p, n);
+		iov[1].iov_base = p;
+		iov[1].iov_len = n;
+		/* obviously needs looping and error detection */
+		writev(fd, iov, 2);
 		p += n;
 		sz -= n;
 	}

drops my 160k write calls down to 82k.

Another option here is teaching the packet-forming code to reserve a few
bytes at the front of the packet. There's a little discussion here:

  https://lore.kernel.org/git/YBkeYSA5UfQP1m%2Fx@coredump.intra.peff.net/

In theory it's easy and elegant to do, but I'm not sure what the
refactoring fallout would be like.

> This significantly reduces the number of write(3p) syscalls we need to
> do. Before this change, cloning the Linux repository resulted in around
> 400,000 write(3p) syscalls. With the buffering in place we only do
> around 130,000 syscalls.

Out of curiosity, how did you end up measuring? I first tried with
strace (without "-f") on the upload-pack process, but strace slowed it
enough that it ended up collecting multiple of pack-object's 8k write()
calls in a single read() call. ;) The "perf stat" above seemed to work
OK, though of course it's counting child processes, too.

-Peff

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/2] upload-pack: reduce lock contention when writing packfile data
  2026-02-27 19:37   ` Jeff King
@ 2026-03-02 12:12     ` Patrick Steinhardt
  2026-03-02 18:20       ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-02 12:12 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matt Smiley, brian m. carlson, Junio C Hamano

On Fri, Feb 27, 2026 at 02:37:58PM -0500, Jeff King wrote:
> On Fri, Feb 27, 2026 at 12:23:01PM +0100, Patrick Steinhardt wrote:
> 
> > Extend our use of the buffering infrastructure so that we soak up bytes
> > until the buffer is filled up at least 2/3rds of its capacity. The
> > change is relatively simple to implement as we already know to flush the
> > buffer in `create_pack_file()` after git-pack-objects(1) has finished.
> 
> We are relaying write() calls from pack-objects here, which is writing
> to us in 8kb chunks (due to csum-file.c buffering). So most of our
> writes will be 8k.
> 
> Rather than buffering in upload-pack, would it not be simpler to just
> increase the write size from pack-objects? Then we do not have to worry
> about disrupting upload-pack's keepalive timeouts. And as a bonus, if
> you are worried about the system-wide number of calls, you will likewise
> be reducing the number of read() and write() calls over the pipe between
> pack-objects and upload-pack.

We can do that. But we also have to keep in mind that downstream in the
pipe may be a process that's not even git-pack-objects(1) in the first
place because of "uploadpack.packObjectsHook". So maybe we should have a
look at doing both.

> > Now git-upload-pack(1) already has the infrastructure in place to buffer
> > some of the data it reads from git-pack-objects(1) before actually
> > sending it out. We only use this infrastructure in very limited ways
> > though, so we generally end up matching one read(3p) call with one
> > write(3p) call. Even worse, when the sideband is enabled we end up
> > matching one read with _two_ writes: one for the pkt-line length, and
> > one for the packfile data.
> 
> Using writev() would be an easy-ish fix here, modulo portability
> concerns (though of course it is easy to implement a fallback writev()
> in terms of write()). Doing this:

Right, I was also wondering about whether we might want to use writev(),
but I didn't have the time yet to have a deeper look. I'll have a look
at whether I can integrate your change in a platform-compatible way.

> Out of curiosity, how did you end up measuring? I first tried with
> strace (without "-f") on the upload-pack process, but strace slowed it
> enough that it ended up collecting multiple of pack-object's 8k write()
> calls in a single read() call. ;) The "perf stat" above seemed to work
> OK, though of course it's counting child processes, too.

I used strace for this. I didn't really dig into it too deep as I was
rather busy handling the havoc that this issue caused :)

I'll send a v2 soonish, thanks!

Patrick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/2] upload-pack: reduce lock contention when writing packfile data
  2026-03-02 12:12     ` Patrick Steinhardt
@ 2026-03-02 18:20       ` Jeff King
  2026-03-03  9:31         ` Patrick Steinhardt
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2026-03-02 18:20 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Matt Smiley, brian m. carlson, Junio C Hamano

On Mon, Mar 02, 2026 at 01:12:07PM +0100, Patrick Steinhardt wrote:

> > Rather than buffering in upload-pack, would it not be simpler to just
> > increase the write size from pack-objects? Then we do not have to worry
> > about disrupting upload-pack's keepalive timeouts. And as a bonus, if
> > you are worried about the system-wide number of calls, you will likewise
> > be reducing the number of read() and write() calls over the pipe between
> > pack-objects and upload-pack.
> 
> We can do that. But we also have to keep in mind that downstream in the
> pipe may be a process that's not even git-pack-objects(1) in the first
> place because of "uploadpack.packObjectsHook". So maybe we should have a
> look at doing both.

True, though I think that whatever is producing gobs of output from that
hook should consider using a buffer size close to a pktline. In many
cases it will be pack-objects itself (just wrapped with some extra
magic), but I guess you may have some kind of caching layer at GitLab
(we did at GitHub).

That is getting specialized enough that I don't feel too bad suggesting
that authors of those tools should consider buffer sizes.

As far as doing both, I'm not sure if it's worth it. My two concerns
are:

  1. It re-opens the question of whether upload-pack might stall waiting
     to fill its buffer and fail to produce keepalives correctly.

  2. I wonder if we could get some weird interactions between the two
     buffer sizes. E.g., if pack-objects sends 50k bytes at a time, but
     upload-pack wants to wait for 51k. So we read 50k then wait for the
     next chunk. Either:

       a. we read 50k again, pull off 13k of it to make a full pktline,
          and then memcpy around the other 37k to await more data.
	  There's a bunch of extra copying as our buffer sizes don't
	  line up.

       b. we read 13k (to fill up the pkt we're trying to send), send
          that, and then the next read gets a partial read(). So we end
	  up issuing more reads, although sometimes pack-objects might
	  catch up and fill the pipe buffer, and we'd get a full packet
	  in one go. But depending on the timing, I wonder if things
	  could get choppy and we'd end up issuing a bunch of extra
	  reads (and possibly extra writes on the pack-objects side if
	  it's waiting on us to create more space in the pipe).

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/2] upload-pack: reduce lock contention when writing packfile data
  2026-03-02 18:20       ` Jeff King
@ 2026-03-03  9:31         ` Patrick Steinhardt
  2026-03-03 13:35           ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03  9:31 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matt Smiley, brian m. carlson, Junio C Hamano

On Mon, Mar 02, 2026 at 01:20:23PM -0500, Jeff King wrote:
> On Mon, Mar 02, 2026 at 01:12:07PM +0100, Patrick Steinhardt wrote:
> 
> > > Rather than buffering in upload-pack, would it not be simpler to just
> > > increase the write size from pack-objects? Then we do not have to worry
> > > about disrupting upload-pack's keepalive timeouts. And as a bonus, if
> > > you are worried about the system-wide number of calls, you will likewise
> > > be reducing the number of read() and write() calls over the pipe between
> > > pack-objects and upload-pack.
> > 
> > We can do that. But we also have to keep in mind that downstream in the
> > pipe may be a process that's not even git-pack-objects(1) in the first
> > place because of "uploadpack.packObjectsHook". So maybe we should have a
> > look at doing both.
> 
> True, though I think that whatever is producing gobs of output from that
> hook should consider using a buffer size close to a pktline. In many
> cases it will be pack-objects itself (just wrapped with some extra
> magic), but I guess you may have some kind of caching layer at GitLab
> (we did at GitHub).
> 
> That is getting specialized enough that I don't feel too bad suggesting
> that authors of those tools should consider buffer sizes.
> 
> As far as doing both, I'm not sure if it's worth it. My two concerns
> are:
> 
>   1. It re-opens the question of whether upload-pack might stall waiting
>      to fill its buffer and fail to produce keepalives correctly.

I've got a patch for that. The problem can even trigger right now as we
already do buffer some of the data, and that may cause the keepalives to
be missed. But this only happens initially in our current
infrastructure, before we see the "PACK" signature, so it's unlikely to
be a problem in practice.

>   2. I wonder if we could get some weird interactions between the two
>      buffer sizes. E.g., if pack-objects sends 50k bytes at a time, but
>      upload-pack wants to wait for 51k. So we read 50k then wait for the
>      next chunk. Either:
> 
>        a. we read 50k again, pull off 13k of it to make a full pktline,
>           and then memcpy around the other 37k to await more data.
> 	  There's a bunch of extra copying as our buffer sizes don't
> 	  line up.
> 
>        b. we read 13k (to fill up the pkt we're trying to send), send
>           that, and then the next read gets a partial read(). So we end
> 	  up issuing more reads, although sometimes pack-objects might
> 	  catch up and fill the pipe buffer, and we'd get a full packet
> 	  in one go. But depending on the timing, I wonder if things
> 	  could get choppy and we'd end up issuing a bunch of extra
> 	  reads (and possibly extra writes on the pack-objects side if
> 	  it's waiting on us to create more space in the pipe).

We would likely hit this issue if we insist on the buffer being
completely filled before sending it out. But that's why I adapted the
logic to say that we send out once we've filled it at least 2/3rds of
the pktline limit. So in your case above we wouldn't face an issue as
we'd already send the first 50kB, as it is smaller than 2/3rds of the
maximum length (~42kB).

That being said, you'll still be able to construct cases where we have
weird edge cases. For example if you consistently send one byte less
than 2/3rds of the capacity.

Patrick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/2] upload-pack: reduce lock contention when writing packfile data
  2026-03-03  9:31         ` Patrick Steinhardt
@ 2026-03-03 13:35           ` Jeff King
  2026-03-03 13:47             ` Patrick Steinhardt
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2026-03-03 13:35 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Matt Smiley, brian m. carlson, Junio C Hamano

On Tue, Mar 03, 2026 at 10:31:46AM +0100, Patrick Steinhardt wrote:

> > As far as doing both, I'm not sure if it's worth it. My two concerns
> > are:
> > 
> >   1. It re-opens the question of whether upload-pack might stall waiting
> >      to fill its buffer and fail to produce keepalives correctly.
> 
> I've got a patch for that. The problem can even trigger right now as we
> already do buffer some of the data, and that may cause the keepalives to
> be missed. But this only happens initially in our current
> infrastructure, before we see the "PACK" signature, so it's unlikely to
> be a problem in practice.

I'm not sure what you mean by "this only happens initially" here. If it
is: we can only miss keepalives in that time, then I think that is
probably a real problem. The time we _most_ need keepalives is before we
see the PACK signature, because that is when pack-objects is chewing on
the input, looking for deltas, etc, and not producing any output.

It is usually "solved" by pack-objects producing progress over stderr,
but for "--quiet" fetches, it could produce nothing for a long time.

But anyway, if you are fixing it either way, then I am happy. :)

> We would likely hit this issue if we insist on the buffer being
> completely filled before sending it out. But that's why I adapted the
> logic to say that we send out once we've filled it at least 2/3rds of
> the pktline limit. So in your case above we wouldn't face an issue as
> we'd already send the first 50kB, as it is smaller than 2/3rds of the
> maximum length (~42kB).
> 
> That being said, you'll still be able to construct cases where we have
> weird edge cases. For example if you consistently send one byte less
> than 2/3rds of the capacity.

Right, my numbers were just meant as examples. Whatever the values, it
means that whatever is generating the pack data (pack-objects or
otherwise) really wants to be in sync with how upload-pack is buffering.
Or vice versa. If we just pass back whole chunks of what we read() in
upload-pack, then that happens automatically.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/2] upload-pack: reduce lock contention when writing packfile data
  2026-03-03 13:35           ` Jeff King
@ 2026-03-03 13:47             ` Patrick Steinhardt
  0 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 13:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matt Smiley, brian m. carlson, Junio C Hamano

On Tue, Mar 03, 2026 at 08:35:40AM -0500, Jeff King wrote:
> On Tue, Mar 03, 2026 at 10:31:46AM +0100, Patrick Steinhardt wrote:
> > We would likely hit this issue if we insist on the buffer being
> > completely filled before sending it out. But that's why I adapted the
> > logic to say that we send out once we've filled it at least 2/3rds of
> > the pktline limit. So in your case above we wouldn't face an issue as
> > we'd already send the first 50kB, as it is smaller than 2/3rds of the
> > maximum length (~42kB).
> > 
> > That being said, you'll still be able to construct cases where we have
> > weird edge cases. For example if you consistently send one byte less
> > than 2/3rds of the capacity.
> 
> Right, my numbers were just meant as examples. Whatever the values, it
> means that whatever is generating the pack data (pack-objects or
> otherwise) really wants to be in sync with how upload-pack is buffering.
> Or vice versa. If we just pass back whole chunks of what we read() in
> upload-pack, then that happens automatically.

Yeah, that's fair. I'll include this part in v2 for now so that we can
discuss once the bigger picture emerges :)

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 00/10] upload-pack: reduce lock contention when writing packfile data
  2026-02-27 11:22 [PATCH 0/2] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
  2026-02-27 11:23 ` [PATCH 1/2] upload-pack: fix debug statement when flushing " Patrick Steinhardt
  2026-02-27 11:23 ` [PATCH 2/2] upload-pack: reduce lock contention when writing " Patrick Steinhardt
@ 2026-03-03 15:00 ` Patrick Steinhardt
  2026-03-03 15:00   ` [PATCH v2 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
                     ` (9 more replies)
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
  4 siblings, 10 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

Hi,

this small patch series fixes some heavy lock contention when writing
data from git-upload-pack(1) into pipes. This lock contention can be
observed when having hundreds of git-upload-pack(1) processes active at
the same time that write data into pipes at dozens of gigabits per
second.

I have uploaded the flame graph that clearly shows the lock contention
at [1].

Changes in v2:
  - Change the buffer size in git-pack-objects(1) to also reduce the
    number of write syscalls over there.
  - Introduce writev to half the number of syscalls when writing
    pktlines.
  - Use `sizeof(os->buffer)` instead of open-coding its size.
  - Improve keepalive logic in git-upload-pack(1) to account for
    buffering.
  - Link to v1: https://lore.kernel.org/r/20260227-pks-upload-pack-write-contention-v1-0-7166fe255704@pks.im

Thanks!

Patrick

[1]: https://gitlab.com/gitlab-org/git/-/work_items/675

---
Patrick Steinhardt (10):
      upload-pack: fix debug statement when flushing packfile data
      upload-pack: adapt keepalives based on buffering
      upload-pack: reduce lock contention when writing packfile data
      git-compat-util: introduce `cast_size_t_to_ssize_t()`
      compat/posix: introduce writev(3p) wrapper
      wrapper: introduce writev(3p) wrappers
      sideband: use writev(3p) to send pktlines
      csum-file: introduce `hashfd_ext()`
      csum-file: drop `hashfd_throughput()`
      builtin/pack-objects: reduce lock contention when writing packfile data

 Makefile               |  4 ++++
 builtin/pack-objects.c | 23 +++++++++++++++----
 compat/posix.h         | 14 ++++++++++++
 compat/writev.c        | 29 ++++++++++++++++++++++++
 config.mak.uname       |  2 ++
 csum-file.c            | 28 +++++++----------------
 csum-file.h            | 16 ++++++++++++--
 git-compat-util.h      |  8 +++++++
 meson.build            |  1 +
 sideband.c             | 14 +++++++++---
 upload-pack.c          | 60 ++++++++++++++++++++++++++++++++++++++++----------
 wrapper.c              | 41 ++++++++++++++++++++++++++++++++++
 wrapper.h              |  9 ++++++++
 write-or-die.c         |  8 +++++++
 write-or-die.h         |  1 +
 15 files changed, 217 insertions(+), 41 deletions(-)

Range-diff versus v1:

 1:  368a984880 =  1:  81c00e3b52 upload-pack: fix debug statement when flushing packfile data
 -:  ---------- >  2:  01c86a9573 upload-pack: adapt keepalives based on buffering
 2:  f014e8005a !  3:  16bc5661d4 upload-pack: reduce lock contention when writing packfile data
    @@ Commit message
         write(3p) syscalls from 130,000 to 100,000. So overall this doesn't seem
         worth it.
     
    +    Note that the issue could also be fixed by adapting the write buffer
    +    that we use in the downstream git-pack-objects(1) command, and such a
    +    change would have roughly the same result. But the command that
    +    generates the packfile data may not always be git-pack-objects(1) as it
    +    can be changed via "uploadpack.packObjectsHook", so such a fix would
    +    only help in _some_ cases. Regardless of that, we'll also adapt the
    +    write buffer size of git-pack-objects(1) in a subsequent commit.
    +
         Helped-by: Matt Smiley <msmiley@gitlab.com>
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
    @@ upload-pack.c: static int relay_pack_data(int pack_objects_out, struct output_st
     +	 * Make sure that we buffer some data before sending it to the client.
     +	 * This significantly reduces the number of write(3p) syscalls.
     +	 */
    -+	if (readsz && os->used < (LARGE_PACKET_DATA_MAX * 2 / 3))
    ++	if (readsz && os->used < (sizeof(os->buffer) * 2 / 3))
     +		return readsz;
     +
      	if (os->used > 1) {
 -:  ---------- >  4:  f42a8a0558 git-compat-util: introduce `cast_size_t_to_ssize_t()`
 -:  ---------- >  5:  830c72f588 compat/posix: introduce writev(3p) wrapper
 -:  ---------- >  6:  733ef129f1 wrapper: introduce writev(3p) wrappers
 -:  ---------- >  7:  18f7429e14 sideband: use writev(3p) to send pktlines
 -:  ---------- >  8:  8eb2aee74d csum-file: introduce `hashfd_ext()`
 -:  ---------- >  9:  a634506b81 csum-file: drop `hashfd_throughput()`
 -:  ---------- > 10:  ca6d52b012 builtin/pack-objects: reduce lock contention when writing packfile data

---
base-commit: 7b2bccb0d58d4f24705bf985de1f4612e4cf06e5
change-id: 20260227-pks-upload-pack-write-contention-435ce01f5fe9


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 01/10] upload-pack: fix debug statement when flushing packfile data
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
@ 2026-03-03 15:00   ` Patrick Steinhardt
  2026-03-03 15:00   ` [PATCH v2 02/10] upload-pack: adapt keepalives based on buffering Patrick Steinhardt
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

When git-upload-pack(1) writes packfile data to the client we have some
logic in place that buffers some partial lines. When that buffer still
contains data after git-pack-objects(1) has finished we flush the buffer
so that all remaining bytes are sent out.

Curiously, when we do so we also print the string "flushed." to stderr.
This statement has been introduced in b1c71b7281 (upload-pack: avoid
sending an incomplete pack upon failure, 2006-06-20), so quite a while
ago. What's interesting though is that stderr is typically spliced
through to the client-side, and consequently the client would see this
message. Munging the way how we do the caching indeed confirms this:

  $ git clone file:///home/pks/Development/linux/
  Cloning into bare repository 'linux.git'...
  remote: Enumerating objects: 12980346, done.
  remote: Counting objects: 100% (131820/131820), done.
  remote: Compressing objects: 100% (50290/50290), done.
  remote: Total 12980346 (delta 96319), reused 104500 (delta 81217), pack-reused 12848526 (from 1)
  Receiving objects: 100% (12980346/12980346), 3.23 GiB | 57.44 MiB/s, done.
  flushed.
  Resolving deltas: 100% (10676718/10676718), done.

It's quite clear that this string shouldn't ever be visible to the
client, so it rather feels like this is a left-over debug statement. The
menitoned commit doesn't mention this line, either.

Remove the debug output to prepare for a change in how we do the
buffering in the next commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index 2d2b70cbf2..c2643c0295 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -457,11 +457,9 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 	}

 	/* flush the data */
-	if (output_state->used > 0) {
+	if (output_state->used > 0)
 		send_client_data(1, output_state->buffer, output_state->used,
 				 pack_data->use_sideband);
-		fprintf(stderr, "flushed.\n");
-	}
 	free(output_state);
 	if (pack_data->use_sideband)
 		packet_flush(1);

-- 
2.53.0.697.g625c4fb2da.dirty

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 02/10] upload-pack: adapt keepalives based on buffering
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
  2026-03-03 15:00   ` [PATCH v2 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
@ 2026-03-03 15:00   ` Patrick Steinhardt
  2026-03-05  0:56     ` Jeff King
  2026-03-03 15:00   ` [PATCH v2 03/10] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

The function `create_pack_file()` is responsible for sending the
packfile data to the client of git-upload-pack(1). As generating the
bytes may take significant computing resources we also have a mechanism
in place that optionally sends keepalive pktlines in case we haven't
sent out any data.

The keepalive logic is purely based poll(3p): we pass a timeout to that
syscall, and if the call times out we send out the keepalive pktline.
While reasonable, this logic isn't entirely sufficient: even if the call
to poll(3p) ends because we have received data on any of the file
descriptors we may not necessarily send data to the client.

The most important edge case here happens in `relay_pack_data()`. When
we haven't seen the initial "PACK" signature from git-pack-objects(1)
yet we buffer incoming data. So in the worst case, if each of the bytes
of that signature arrive shortly before the configured keepalive
timeout, then we may not send out any data for a time period that is
(almost) four times as long as the configured timeout.

This edge case is rather unlikely to matter in practice. But in a
subsequent commit we're going to adapt our buffering mechanism to become
more aggressive, which makes it more likely that we don't send any data
for an extended amount of time.

Adapt the logic so that instead of using a fixed timeout on every call
to poll(3p), we instead figure out how much time has passed since the
last-sent data.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 49 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 40 insertions(+), 9 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index c2643c0295..04521e57c9 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -29,6 +29,7 @@
 #include "commit-graph.h"
 #include "commit-reach.h"
 #include "shallow.h"
+#include "trace.h"
 #include "write-or-die.h"
 #include "json-writer.h"
 #include "strmap.h"
@@ -218,7 +219,8 @@ struct output_state {
 };
 
 static int relay_pack_data(int pack_objects_out, struct output_state *os,
-			   int use_sideband, int write_packfile_line)
+			   int use_sideband, int write_packfile_line,
+			   bool *did_send_data)
 {
 	/*
 	 * We keep the last byte to ourselves
@@ -232,6 +234,8 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 	 */
 	ssize_t readsz;
 
+	*did_send_data = false;
+
 	readsz = xread(pack_objects_out, os->buffer + os->used,
 		       sizeof(os->buffer) - os->used);
 	if (readsz < 0) {
@@ -247,6 +251,7 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 				if (os->packfile_uris_started)
 					packet_delim(1);
 				packet_write_fmt(1, "\1packfile\n");
+				*did_send_data = true;
 			}
 			break;
 		}
@@ -259,6 +264,7 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 			}
 			*p = '\0';
 			packet_write_fmt(1, "\1%s\n", os->buffer);
+			*did_send_data = true;
 
 			os->used -= p - os->buffer + 1;
 			memmove(os->buffer, p + 1, os->used);
@@ -279,6 +285,7 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 		os->used = 0;
 	}
 
+	*did_send_data = true;
 	return readsz;
 }
 
@@ -290,6 +297,7 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 	char progress[128];
 	char abort_msg[] = "aborting due to possible repository "
 		"corruption on the remote side.";
+	uint64_t last_sent_ms = 0;
 	ssize_t sz;
 	int i;
 	FILE *pipe_fd;
@@ -365,10 +373,14 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 	 */
 
 	while (1) {
+		uint64_t now_ms = getnanotime() / 1000000;
 		struct pollfd pfd[2];
-		int pe, pu, pollsize, polltimeout;
+		int pe, pu, pollsize, polltimeout_ms;
 		int ret;
 
+		if (!last_sent_ms)
+			last_sent_ms = now_ms;
+
 		reset_timeout(pack_data->timeout);
 
 		pollsize = 0;
@@ -390,11 +402,21 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		if (!pollsize)
 			break;
 
-		polltimeout = pack_data->keepalive < 0
-			? -1
-			: 1000 * pack_data->keepalive;
+		if (pack_data->keepalive < 0) {
+			polltimeout_ms = -1;
+		} else {
+			/*
+			 * The polling timeout needs to be adjusted based on
+			 * the time we have sent our last package. The longer
+			 * it's been in the past, the shorter the timeout
+			 * becomes until we eventually don't block at all.
+			 */
+			polltimeout_ms = 1000 * pack_data->keepalive - (now_ms - last_sent_ms);
+			if (polltimeout_ms < 0)
+				polltimeout_ms = 0;
+		}
 
-		ret = poll(pfd, pollsize, polltimeout);
+		ret = poll(pfd, pollsize, polltimeout_ms);
 
 		if (ret < 0) {
 			if (errno != EINTR) {
@@ -403,16 +425,18 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 			}
 			continue;
 		}
+
 		if (0 <= pe && (pfd[pe].revents & (POLLIN|POLLHUP))) {
 			/* Status ready; we ship that in the side-band
 			 * or dump to the standard error.
 			 */
 			sz = xread(pack_objects.err, progress,
 				  sizeof(progress));
-			if (0 < sz)
+			if (0 < sz) {
 				send_client_data(2, progress, sz,
 						 pack_data->use_sideband);
-			else if (sz == 0) {
+				last_sent_ms = now_ms;
+			} else if (sz == 0) {
 				close(pack_objects.err);
 				pack_objects.err = -1;
 			}
@@ -421,11 +445,14 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 			/* give priority to status messages */
 			continue;
 		}
+
 		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
+			bool did_send_data;
 			int result = relay_pack_data(pack_objects.out,
 						     output_state,
 						     pack_data->use_sideband,
-						     !!uri_protocols);
+						     !!uri_protocols,
+						     &did_send_data);
 
 			if (result == 0) {
 				close(pack_objects.out);
@@ -433,6 +460,9 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 			} else if (result < 0) {
 				goto fail;
 			}
+
+			if (did_send_data)
+				last_sent_ms = now_ms;
 		}
 
 		/*
@@ -448,6 +478,7 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		if (!ret && pack_data->use_sideband) {
 			static const char buf[] = "0005\1";
 			write_or_die(1, buf, 5);
+			last_sent_ms = now_ms;
 		}
 	}
 

-- 
2.53.0.697.g625c4fb2da.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 03/10] upload-pack: reduce lock contention when writing packfile data
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
  2026-03-03 15:00   ` [PATCH v2 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
  2026-03-03 15:00   ` [PATCH v2 02/10] upload-pack: adapt keepalives based on buffering Patrick Steinhardt
@ 2026-03-03 15:00   ` Patrick Steinhardt
  2026-03-05  1:16     ` Jeff King
  2026-03-03 15:00   ` [PATCH v2 04/10] git-compat-util: introduce `cast_size_t_to_ssize_t()` Patrick Steinhardt
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

In our production systems we have recently observed write contention in
git-upload-pack(1). The system in question was consistently streaming
packfiles at a rate of dozens of gigabits per second, but curiously the
system was neither bottlenecked on CPU, memory or IOPS.

We eventually discovered that Git was spending 80% of its time in
`pipe_write()`, out of which almost all of the time was spent in the
`ep_poll_callback` function in the kernel. Quoting the reporter:

  This infrastructure is part of an event notification queue designed to
  allow for multiple producers to emit events, but that concurrency
  safety is guarded by 3 layers of locking. The layer we're hitting
  contention in uses a simple reader/writer lock mode (a.k.a. shared
  versus exclusive mode), where producers need shared-mode (read mode),
  and various other actions use exclusive (write) mode.

The system in question generates workloads where we have hundreds of
git-upload-pack(1) processes active at the same point in time. These
processes end up contending around those locks, and the consequence is
that the Git processes stall.

Now git-upload-pack(1) already has the infrastructure in place to buffer
some of the data it reads from git-pack-objects(1) before actually
sending it out. We only use this infrastructure in very limited ways
though, so we generally end up matching one read(3p) call with one
write(3p) call. Even worse, when the sideband is enabled we end up
matching one read with _two_ writes: one for the pkt-line length, and
one for the packfile data.

Extend our use of the buffering infrastructure so that we soak up bytes
until the buffer is filled up at least 2/3rds of its capacity. The
change is relatively simple to implement as we already know to flush the
buffer in `create_pack_file()` after git-pack-objects(1) has finished.

This significantly reduces the number of write(3p) syscalls we need to
do. Before this change, cloning the Linux repository resulted in around
400,000 write(3p) syscalls. With the buffering in place we only do
around 130,000 syscalls.

Now we could of course go even further and make sure that we always fill
up the whole buffer. But this might cause an increase in read(3p)
syscalls, and some tests show that this only reduces the number of
write(3p) syscalls from 130,000 to 100,000. So overall this doesn't seem
worth it.

Note that the issue could also be fixed by adapting the write buffer
that we use in the downstream git-pack-objects(1) command, and such a
change would have roughly the same result. But the command that
generates the packfile data may not always be git-pack-objects(1) as it
can be changed via "uploadpack.packObjectsHook", so such a fix would
only help in _some_ cases. Regardless of that, we'll also adapt the
write buffer size of git-pack-objects(1) in a subsequent commit.

Helped-by: Matt Smiley <msmiley@gitlab.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/upload-pack.c b/upload-pack.c
index 04521e57c9..1b1c81ea63 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -276,6 +276,13 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 		}
 	}
 
+	/*
+	 * Make sure that we buffer some data before sending it to the client.
+	 * This significantly reduces the number of write(3p) syscalls.
+	 */
+	if (readsz && os->used < (sizeof(os->buffer) * 2 / 3))
+		return readsz;
+
 	if (os->used > 1) {
 		send_client_data(1, os->buffer, os->used - 1, use_sideband);
 		os->buffer[0] = os->buffer[os->used - 1];

-- 
2.53.0.697.g625c4fb2da.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 04/10] git-compat-util: introduce `cast_size_t_to_ssize_t()`
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2026-03-03 15:00   ` [PATCH v2 03/10] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
@ 2026-03-03 15:00   ` Patrick Steinhardt
  2026-03-03 15:00   ` [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

Introduce a new helper function `cast_size_t_to_ssize_t()`. This
function will be used in the next commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 git-compat-util.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index bebcf9f698..c6af04cd7a 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -665,6 +665,14 @@ static inline int cast_size_t_to_int(size_t a)
 	return (int)a;
 }
 
+static inline ssize_t cast_size_t_to_ssize_t(size_t a)
+{
+	if (a > maximum_signed_value_of_type(ssize_t))
+		die("number too large to represent as ssize_t on this platform: %"PRIuMAX,
+		    (uintmax_t)a);
+	return (ssize_t)a;
+}
+
 static inline uint64_t u64_mult(uint64_t a, uint64_t b)
 {
 	if (unsigned_mult_overflows(a, b))

-- 
2.53.0.697.g625c4fb2da.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2026-03-03 15:00   ` [PATCH v2 04/10] git-compat-util: introduce `cast_size_t_to_ssize_t()` Patrick Steinhardt
@ 2026-03-03 15:00   ` Patrick Steinhardt
  2026-03-04 22:01     ` Junio C Hamano
  2026-03-03 15:00   ` [PATCH v2 06/10] wrapper: introduce writev(3p) wrappers Patrick Steinhardt
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

In a subsequent commit we're going to add the first caller to
writev(3p). Introduce a compatibility wrapper for this syscall that we
can use on systems that don't have this syscall.

The syscall exists on modern Unixes like Linux and macOS, and seemingly
even for NonStop according to [1]. It doesn't seem to exist on Windows
though.

[1]: http://nonstoptools.com/manuals/OSS-SystemCalls.pdf
[2]: https://www.gnu.org/software/gnulib/manual/html_node/writev.html

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 Makefile         |  4 ++++
 compat/posix.h   | 14 ++++++++++++++
 compat/writev.c  | 29 +++++++++++++++++++++++++++++
 config.mak.uname |  2 ++
 meson.build      |  1 +
 5 files changed, 50 insertions(+)

diff --git a/Makefile b/Makefile
index 7f37ad8f58..cb95ff2daf 100644
--- a/Makefile
+++ b/Makefile
@@ -2021,6 +2021,10 @@ ifdef NO_PREAD
 	COMPAT_CFLAGS += -DNO_PREAD
 	COMPAT_OBJS += compat/pread.o
 endif
+ifdef NO_WRITEV
+	COMPAT_CFLAGS += -DNO_WRITEV
+	COMPAT_OBJS += compat/writev.o
+endif
 ifdef NO_FAST_WORKING_DIRECTORY
 	BASIC_CFLAGS += -DNO_FAST_WORKING_DIRECTORY
 endif
diff --git a/compat/posix.h b/compat/posix.h
index 245386fa4a..3c611d2736 100644
--- a/compat/posix.h
+++ b/compat/posix.h
@@ -137,6 +137,9 @@
 #include <sys/socket.h>
 #include <sys/ioctl.h>
 #include <sys/statvfs.h>
+#ifndef NO_WRITEV
+#include <sys/uio.h>
+#endif
 #include <termios.h>
 #ifndef NO_SYS_SELECT_H
 #include <sys/select.h>
@@ -323,6 +326,17 @@ int git_lstat(const char *, struct stat *);
 ssize_t git_pread(int fd, void *buf, size_t count, off_t offset);
 #endif
 
+#ifdef NO_WRITEV
+#define writev git_writev
+#define iovec git_iovec
+struct git_iovec {
+	void *iov_base;
+	size_t iov_len;
+};
+
+ssize_t git_writev(int fd, const struct iovec *iov, int iovcnt);
+#endif
+
 #ifdef NO_SETENV
 #define setenv gitsetenv
 int gitsetenv(const char *, const char *, int);
diff --git a/compat/writev.c b/compat/writev.c
new file mode 100644
index 0000000000..b77e534d5d
--- /dev/null
+++ b/compat/writev.c
@@ -0,0 +1,29 @@
+#include "../git-compat-util.h"
+#include "../wrapper.h"
+
+ssize_t git_writev(int fd, const struct iovec *iov, int iovcnt)
+{
+	size_t total_written = 0;
+
+	for (int i = 0; i < iovcnt; i++) {
+		const char *bytes = iov[i].iov_base;
+		size_t iovec_written = 0;
+
+		while (iovec_written < iov[i].iov_len) {
+			ssize_t bytes_written = xwrite(fd, bytes + iovec_written,
+						       iov[i].iov_len - iovec_written);
+			if (bytes_written < 0) {
+				if (total_written)
+					goto out;
+				return bytes_written;
+			}
+			if (!bytes_written)
+				goto out;
+			iovec_written += bytes_written;
+			total_written += bytes_written;
+		}
+	}
+
+out:
+	return cast_size_t_to_ssize_t(total_written);
+}
diff --git a/config.mak.uname b/config.mak.uname
index 5feb582558..ccb3f71881 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -459,6 +459,7 @@ ifeq ($(uname_S),Windows)
 	SANE_TOOL_PATH ?= $(msvc_bin_dir_msys)
 	HAVE_ALLOCA_H = YesPlease
 	NO_PREAD = YesPlease
+	NO_WRITEV = YesPlease
 	NEEDS_CRYPTO_WITH_SSL = YesPlease
 	NO_LIBGEN_H = YesPlease
 	NO_POLL = YesPlease
@@ -674,6 +675,7 @@ ifeq ($(uname_S),MINGW)
 	pathsep = ;
 	HAVE_ALLOCA_H = YesPlease
 	NO_PREAD = YesPlease
+	NO_WRITEV = YesPlease
 	NEEDS_CRYPTO_WITH_SSL = YesPlease
 	NO_LIBGEN_H = YesPlease
 	NO_POLL = YesPlease
diff --git a/meson.build b/meson.build
index 762e2d0fc0..63514b6b84 100644
--- a/meson.build
+++ b/meson.build
@@ -1409,6 +1409,7 @@ checkfuncs = {
   'initgroups' : [],
   'strtoumax' : ['strtoumax.c', 'strtoimax.c'],
   'pread' : ['pread.c'],
+  'writev' : ['writev.c'],
 }
 
 if host_machine.system() == 'windows'

-- 
2.53.0.697.g625c4fb2da.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 06/10] wrapper: introduce writev(3p) wrappers
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
                     ` (4 preceding siblings ...)
  2026-03-03 15:00   ` [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
@ 2026-03-03 15:00   ` Patrick Steinhardt
  2026-03-03 15:00   ` [PATCH v2 07/10] sideband: use writev(3p) to send pktlines Patrick Steinhardt
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

In the preceding commit we have added a compatibility wrapper for the
writev(3p) syscall. Introduce some generic wrappers for this function
that we nowadays take for granted in the Git codebase.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 wrapper.c      | 41 +++++++++++++++++++++++++++++++++++++++++
 wrapper.h      |  9 +++++++++
 write-or-die.c |  8 ++++++++
 write-or-die.h |  1 +
 4 files changed, 59 insertions(+)

diff --git a/wrapper.c b/wrapper.c
index 16f5a63fbb..be8fa575e6 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -323,6 +323,47 @@ ssize_t write_in_full(int fd, const void *buf, size_t count)
 	return total;
 }
 
+ssize_t writev_in_full(int fd, struct iovec *iov, int iovcnt)
+{
+	ssize_t total_written = 0;
+
+	while (iovcnt) {
+		ssize_t bytes_written = writev(fd, iov, iovcnt);
+		if (bytes_written < 0) {
+			if (errno == EINTR || errno == EAGAIN)
+				continue;
+			return -1;
+		}
+		if (!bytes_written) {
+			errno = ENOSPC;
+			return -1;
+		}
+
+		total_written += bytes_written;
+
+		/*
+		 * We first need to discard any iovec entities that have been
+		 * fully written.
+		 */
+		while (iovcnt && (size_t)bytes_written >= iov->iov_len) {
+			bytes_written -= iov->iov_len;
+			iov++;
+			iovcnt--;
+		}
+
+		/*
+		 * Finally, we need to adjust the last iovec in case we have
+		 * performed a partial write.
+		 */
+		if (iovcnt && bytes_written) {
+			iov->iov_base = (char *) iov->iov_base + bytes_written;
+			iov->iov_len -= bytes_written;
+		}
+	}
+
+	return total_written;
+}
+
 ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset)
 {
 	char *p = buf;
diff --git a/wrapper.h b/wrapper.h
index 15ac3bab6e..27519b32d1 100644
--- a/wrapper.h
+++ b/wrapper.h
@@ -47,6 +47,15 @@ ssize_t read_in_full(int fd, void *buf, size_t count);
 ssize_t write_in_full(int fd, const void *buf, size_t count);
 ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset);
 
+/*
+ * Try to write all iovecs. Returns -1 in case an error occurred with a proper
+ * errno set, the number of bytes written otherwise.
+ *
+ * Note that the iovec will be modified as a result of this call to adjust for
+ * partial writes!
+ */
+ssize_t writev_in_full(int fd, struct iovec *iov, int iovcnt);
+
 static inline ssize_t write_str_in_full(int fd, const char *str)
 {
 	return write_in_full(fd, str, strlen(str));
diff --git a/write-or-die.c b/write-or-die.c
index 01a9a51fa2..5f522fb728 100644
--- a/write-or-die.c
+++ b/write-or-die.c
@@ -96,6 +96,14 @@ void write_or_die(int fd, const void *buf, size_t count)
 	}
 }
 
+void writev_or_die(int fd, struct iovec *iov, int iovlen)
+{
+	if (writev_in_full(fd, iov, iovlen) < 0) {
+		check_pipe(errno);
+		die_errno("writev error");
+	}
+}
+
 void fwrite_or_die(FILE *f, const void *buf, size_t count)
 {
 	if (fwrite(buf, 1, count, f) != count)
diff --git a/write-or-die.h b/write-or-die.h
index 65a5c42a47..ae3d7d88b8 100644
--- a/write-or-die.h
+++ b/write-or-die.h
@@ -7,6 +7,7 @@ void fprintf_or_die(FILE *, const char *fmt, ...);
 void fwrite_or_die(FILE *f, const void *buf, size_t count);
 void fflush_or_die(FILE *f);
 void write_or_die(int fd, const void *buf, size_t count);
+void writev_or_die(int fd, struct iovec *iov, int iovlen);
 
 /*
  * These values are used to help identify parts of a repository to fsync.

-- 
2.53.0.697.g625c4fb2da.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 07/10] sideband: use writev(3p) to send pktlines
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
                     ` (5 preceding siblings ...)
  2026-03-03 15:00   ` [PATCH v2 06/10] wrapper: introduce writev(3p) wrappers Patrick Steinhardt
@ 2026-03-03 15:00   ` Patrick Steinhardt
  2026-03-04 22:05     ` Junio C Hamano
  2026-03-03 15:00   ` [PATCH v2 08/10] csum-file: introduce `hashfd_ext()` Patrick Steinhardt
                     ` (2 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

Every pktline that we send out via `send_sideband()` currently requires
two syscalls: one to write the pktline's length, and one to send its
data. This typically isn't all that much of a problem, but under extreme
load the syscalls may cause contention in the kernel.

Refactor the code to instead use the newly introduced writev(3p) infra
so that we can send out the data with a single syscall. This reduces the
number of syscalls from around 133,000 calls to write(3p) to around
67,000 calls to writev(3p).

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 sideband.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/sideband.c b/sideband.c
index ea7c25211e..1ed6614eaf 100644
--- a/sideband.c
+++ b/sideband.c
@@ -264,6 +264,7 @@ void send_sideband(int fd, int band, const char *data, ssize_t sz, int packet_ma
 	const char *p = data;
 
 	while (sz) {
+		struct iovec iov[2];
 		unsigned n;
 		char hdr[5];
 
@@ -273,12 +274,19 @@ void send_sideband(int fd, int band, const char *data, ssize_t sz, int packet_ma
 		if (0 <= band) {
 			xsnprintf(hdr, sizeof(hdr), "%04x", n + 5);
 			hdr[4] = band;
-			write_or_die(fd, hdr, 5);
+			iov[0].iov_base = hdr;
+			iov[0].iov_len = 5;
 		} else {
 			xsnprintf(hdr, sizeof(hdr), "%04x", n + 4);
-			write_or_die(fd, hdr, 4);
+			iov[0].iov_base = hdr;
+			iov[0].iov_len = 4;
 		}
-		write_or_die(fd, p, n);
+
+		iov[1].iov_base = (void *) p;
+		iov[1].iov_len = n;
+
+		writev_or_die(fd, iov, ARRAY_SIZE(iov));
+
 		p += n;
 		sz -= n;
 	}

-- 
2.53.0.697.g625c4fb2da.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 08/10] csum-file: introduce `hashfd_ext()`
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
                     ` (6 preceding siblings ...)
  2026-03-03 15:00   ` [PATCH v2 07/10] sideband: use writev(3p) to send pktlines Patrick Steinhardt
@ 2026-03-03 15:00   ` Patrick Steinhardt
  2026-03-04 22:11     ` Junio C Hamano
  2026-03-03 15:00   ` [PATCH v2 09/10] csum-file: drop `hashfd_throughput()` Patrick Steinhardt
  2026-03-03 15:00   ` [PATCH v2 10/10] builtin/pack-objects: reduce lock contention when writing packfile data Patrick Steinhardt
  9 siblings, 1 reply; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

Introduce a new `hashfd_ext()` function that takes an options structure.
This function will replace `hashd_throughput()` in the next commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 csum-file.c | 22 +++++++++++++---------
 csum-file.h | 14 ++++++++++++++
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 6e21e3cac8..a50416247e 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -161,17 +161,16 @@ struct hashfile *hashfd_check(const struct git_hash_algo *algop,
 	return f;
 }
 
-static struct hashfile *hashfd_internal(const struct git_hash_algo *algop,
-					int fd, const char *name,
-					struct progress *tp,
-					size_t buffer_len)
+struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
+			    int fd, const char *name,
+			    const struct hashfd_options *opts)
 {
 	struct hashfile *f = xmalloc(sizeof(*f));
 	f->fd = fd;
 	f->check_fd = -1;
 	f->offset = 0;
 	f->total = 0;
-	f->tp = tp;
+	f->tp = opts->progress;
 	f->name = name;
 	f->do_crc = 0;
 	f->skip_hash = 0;
@@ -179,8 +178,8 @@ static struct hashfile *hashfd_internal(const struct git_hash_algo *algop,
 	f->algop = unsafe_hash_algo(algop);
 	f->algop->init_fn(&f->ctx);
 
-	f->buffer_len = buffer_len;
-	f->buffer = xmalloc(buffer_len);
+	f->buffer_len = opts->buffer_len ? opts->buffer_len : 128 * 1024;
+	f->buffer = xmalloc(f->buffer_len);
 	f->check_buffer = NULL;
 
 	return f;
@@ -194,7 +193,8 @@ struct hashfile *hashfd(const struct git_hash_algo *algop,
 	 * measure the rate of data passing through this hashfile,
 	 * use a larger buffer size to reduce fsync() calls.
 	 */
-	return hashfd_internal(algop, fd, name, NULL, 128 * 1024);
+	struct hashfd_options opts = { 0 };
+	return hashfd_ext(algop, fd, name, &opts);
 }
 
 struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
@@ -206,7 +206,11 @@ struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
 	 * size so the progress indicators arrive at a more
 	 * frequent rate.
 	 */
-	return hashfd_internal(algop, fd, name, tp, 8 * 1024);
+	struct hashfd_options opts = {
+		.progress = tp,
+		.buffer_len = 8 * 1024,
+	};
+	return hashfd_ext(algop, fd, name, &opts);
 }
 
 void hashfile_checkpoint_init(struct hashfile *f,
diff --git a/csum-file.h b/csum-file.h
index 07ae11024a..a03b60120d 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -45,6 +45,20 @@ int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 #define CSUM_FSYNC		2
 #define CSUM_HASH_IN_STREAM	4
 
+struct hashfd_options {
+	/*
+	 * Throughput progress that counts the number of bytes that have been
+	 * hashed.
+	 */
+	struct progress *progress;
+
+	/* The length of the buffer that shall be used read read data. */
+	size_t buffer_len;
+};
+
+struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
+			    int fd, const char *name,
+			    const struct hashfd_options *opts);
 struct hashfile *hashfd(const struct git_hash_algo *algop,
 			int fd, const char *name);
 struct hashfile *hashfd_check(const struct git_hash_algo *algop,

-- 
2.53.0.697.g625c4fb2da.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 09/10] csum-file: drop `hashfd_throughput()`
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
                     ` (7 preceding siblings ...)
  2026-03-03 15:00   ` [PATCH v2 08/10] csum-file: introduce `hashfd_ext()` Patrick Steinhardt
@ 2026-03-03 15:00   ` Patrick Steinhardt
  2026-03-03 15:00   ` [PATCH v2 10/10] builtin/pack-objects: reduce lock contention when writing packfile data Patrick Steinhardt
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

The `hashfd_throughput()` function is used by a single callsite in
git-pack-objects(1). In contrast to `hashfd()`, this function uses a
progress meter to measure throughput and a smaller buffer length so that
the progress meter can provide more granular metrics.

We're going to change that caller in the next commit to be a bit more
specific to packing objects. As such, `hashfd_throughput()` will be a
somewhat unfitting mechanism for any potential new callers.

Drop the function and replace it with a call to `hashfd_ext()`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/pack-objects.c | 19 +++++++++++++++----
 csum-file.c            | 16 ----------------
 csum-file.h            |  2 --
 3 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index cfb03d4c09..db04e6dd0e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1330,11 +1330,22 @@ static void write_pack_file(void)
 		unsigned char hash[GIT_MAX_RAWSZ];
 		char *pack_tmp_name = NULL;
 
-		if (pack_to_stdout)
-			f = hashfd_throughput(the_repository->hash_algo, 1,
-					      "<stdout>", progress_state);
-		else
+		if (pack_to_stdout) {
+			/*
+			 * Since we are expecting to report progress of the
+			 * write into this hashfile, use a smaller buffer
+			 * size so the progress indicators arrive at a more
+			 * frequent rate.
+			 */
+			struct hashfd_options opts = {
+				.progress = progress_state,
+				.buffer_len = 8 * 1024,
+			};
+			f = hashfd_ext(the_repository->hash_algo, 1,
+				       "<stdout>", &opts);
+		} else {
 			f = create_tmp_packfile(the_repository, &pack_tmp_name);
+		}
 
 		offset = write_pack_header(f, nr_remaining);
 
diff --git a/csum-file.c b/csum-file.c
index a50416247e..5dfaca5543 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -197,22 +197,6 @@ struct hashfile *hashfd(const struct git_hash_algo *algop,
 	return hashfd_ext(algop, fd, name, &opts);
 }
 
-struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
-				   int fd, const char *name, struct progress *tp)
-{
-	/*
-	 * Since we are expecting to report progress of the
-	 * write into this hashfile, use a smaller buffer
-	 * size so the progress indicators arrive at a more
-	 * frequent rate.
-	 */
-	struct hashfd_options opts = {
-		.progress = tp,
-		.buffer_len = 8 * 1024,
-	};
-	return hashfd_ext(algop, fd, name, &opts);
-}
-
 void hashfile_checkpoint_init(struct hashfile *f,
 			      struct hashfile_checkpoint *checkpoint)
 {
diff --git a/csum-file.h b/csum-file.h
index a03b60120d..01472555c8 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -63,8 +63,6 @@ struct hashfile *hashfd(const struct git_hash_algo *algop,
 			int fd, const char *name);
 struct hashfile *hashfd_check(const struct git_hash_algo *algop,
 			      const char *name);
-struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
-				   int fd, const char *name, struct progress *tp);
 
 /*
  * Free the hashfile without flushing its contents to disk. This only

-- 
2.53.0.697.g625c4fb2da.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 10/10] builtin/pack-objects: reduce lock contention when writing packfile data
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
                     ` (8 preceding siblings ...)
  2026-03-03 15:00   ` [PATCH v2 09/10] csum-file: drop `hashfd_throughput()` Patrick Steinhardt
@ 2026-03-03 15:00   ` Patrick Steinhardt
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-03 15:00 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King

When running `git pack-objects --stdout` we feed the data through
`hashfd_ext()` with a progress meter and a smaller-than-usual buffer
length of 8kB so that we can track throughput more granularly. But as
packfiles tend to be on the larger side, this small buffer size may
cause a ton of write(3p) syscalls.

Originally, the buffer we used in `hashfd()` was 8kB for all use cases.
This was changed though in 2ca245f8be (csum-file.h: increase hashfile
buffer size, 2021-05-18) because we noticed that the number of writes
can have an impact on performance. So the buffer size was increased to
128kB, which improved performance a bit for some use cases.

But the commit didn't touch the buffer size for `hashd_throughput()`.
The reasoning here was that callers expect the progress indicator to
update frequently, and a larger buffer size would of course reduce the
update frequency especially on slow networks.

While that is of course true, there was (and still is, even though it's
now a call to `hashfd_ext()`) only a single caller of this function in
git-pack-objects(1). This command is responsible for writing packfiles,
and those packfiles are often on the bigger side. So arguably:

  - The user won't care about increments of 8kB when packfiles tend to
    be megabytes or even gigabytes in size.

  - Reducing the number of syscalls would be even more valuable here
    than it would be for multi-pack indices, which was the benchmark
    done in the mentioned commit, as MIDXs are typically significantly
    smaller than packfiles.

  - Nowadays, many internet connections should be able to transfer data
    at a rate significantly higher than 8kB per second.

Update the buffer to instead have a size of `LARGE_PACKET_DATA_MAX - 1`,
which translates to ~64kB. This limit was chosen because `git
pack-objects --stdout` is most often used when sending packfiles via
git-upload-pack(1), where packfile data is chunked into pktlines when
using the sideband. Furthermore, most internet connections should have a
bandwidth signifcantly higher than 64kB/s, so we'd still be able to
observe progress updates at a rate of at least once per second.

This change significantly reduces the number of write(3p) syscalls from
355,000 to 44,000 when packing the Linux repository. While this results
in a small performance improvement on an otherwise-unused system, this
improvement is mostly negligible. More importantly though, it will
reduce lock contention in the kernel on an extremely busy system where
we have many processes writing data at once.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/pack-objects.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index db04e6dd0e..b8d684522d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -41,6 +41,7 @@
 #include "promisor-remote.h"
 #include "pack-mtimes.h"
 #include "parse-options.h"
+#include "pkt-line.h"
 #include "blob.h"
 #include "tree.h"
 #include "path-walk.h"
@@ -1332,14 +1333,17 @@ static void write_pack_file(void)

 		if (pack_to_stdout) {
 			/*
-			 * Since we are expecting to report progress of the
-			 * write into this hashfile, use a smaller buffer
-			 * size so the progress indicators arrive at a more
-			 * frequent rate.
+			 * This command is most often invoked via
+			 * git-upload-pack(1), which will typically chunk data
+			 * into pktlines. As such, we use the maximum data
+			 * length of them as buffer length.
+			 *
+			 * Note that we need to subtract one though to
+			 * accomodate for the sideband byte.
 			 */
 			struct hashfd_options opts = {
 				.progress = progress_state,
-				.buffer_len = 8 * 1024,
+				.buffer_len = LARGE_PACKET_DATA_MAX - 1,
 			};
 			f = hashfd_ext(the_repository->hash_algo, 1,
 				       "<stdout>", &opts);

-- 
2.53.0.697.g625c4fb2da.dirty

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper
  2026-03-03 15:00   ` [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
@ 2026-03-04 22:01     ` Junio C Hamano
  2026-03-05  0:37       ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2026-03-04 22:01 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Matt Smiley, brian m. carlson, Jeff King

Patrick Steinhardt <ps@pks.im> writes:

> In a subsequent commit we're going to add the first caller to
> writev(3p). Introduce a compatibility wrapper for this syscall that we
> can use on systems that don't have this syscall.
>
> The syscall exists on modern Unixes like Linux and macOS, and seemingly
> even for NonStop according to [1]. It doesn't seem to exist on Windows
> though.



>
> [1]: http://nonstoptools.com/manuals/OSS-SystemCalls.pdf
> [2]: https://www.gnu.org/software/gnulib/manual/html_node/writev.html
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  Makefile         |  4 ++++
>  compat/posix.h   | 14 ++++++++++++++
>  compat/writev.c  | 29 +++++++++++++++++++++++++++++
>  config.mak.uname |  2 ++
>  meson.build      |  1 +
>  5 files changed, 50 insertions(+)
>
> diff --git a/Makefile b/Makefile
> index 7f37ad8f58..cb95ff2daf 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -2021,6 +2021,10 @@ ifdef NO_PREAD
>  	COMPAT_CFLAGS += -DNO_PREAD
>  	COMPAT_OBJS += compat/pread.o
>  endif
> +ifdef NO_WRITEV
> +	COMPAT_CFLAGS += -DNO_WRITEV
> +	COMPAT_OBJS += compat/writev.o
> +endif
>  ifdef NO_FAST_WORKING_DIRECTORY
>  	BASIC_CFLAGS += -DNO_FAST_WORKING_DIRECTORY
>  endif
> diff --git a/compat/posix.h b/compat/posix.h
> index 245386fa4a..3c611d2736 100644
> --- a/compat/posix.h
> +++ b/compat/posix.h
> @@ -137,6 +137,9 @@
>  #include <sys/socket.h>
>  #include <sys/ioctl.h>
>  #include <sys/statvfs.h>
> +#ifndef NO_WRITEV
> +#include <sys/uio.h>
> +#endif
>  #include <termios.h>
>  #ifndef NO_SYS_SELECT_H
>  #include <sys/select.h>
> @@ -323,6 +326,17 @@ int git_lstat(const char *, struct stat *);
>  ssize_t git_pread(int fd, void *buf, size_t count, off_t offset);
>  #endif
>  
> +#ifdef NO_WRITEV
> +#define writev git_writev
> +#define iovec git_iovec
> +struct git_iovec {
> +	void *iov_base;
> +	size_t iov_len;
> +};
> +
> +ssize_t git_writev(int fd, const struct iovec *iov, int iovcnt);
> +#endif
> +
>  #ifdef NO_SETENV
>  #define setenv gitsetenv
>  int gitsetenv(const char *, const char *, int);
> diff --git a/compat/writev.c b/compat/writev.c
> new file mode 100644
> index 0000000000..b77e534d5d
> --- /dev/null
> +++ b/compat/writev.c
> @@ -0,0 +1,29 @@
> +#include "../git-compat-util.h"
> +#include "../wrapper.h"
> +
> +ssize_t git_writev(int fd, const struct iovec *iov, int iovcnt)
> +{
> +	size_t total_written = 0;
> +
> +	for (int i = 0; i < iovcnt; i++) {
> +		const char *bytes = iov[i].iov_base;
> +		size_t iovec_written = 0;
> +
> +		while (iovec_written < iov[i].iov_len) {
> +			ssize_t bytes_written = xwrite(fd, bytes + iovec_written,
> +						       iov[i].iov_len - iovec_written);
> +			if (bytes_written < 0) {
> +				if (total_written)
> +					goto out;
> +				return bytes_written;
> +			}
> +			if (!bytes_written)
> +				goto out;
> +			iovec_written += bytes_written;
> +			total_written += bytes_written;
> +		}
> +	}
> +
> +out:
> +	return cast_size_t_to_ssize_t(total_written);
> +}

Because we do not check the accumulation of bytes_written in the two
accumulator variables inside the inner loop, it is very possible for
total_written to wraparound size_t and end up below the largest
value possible to be stored in ssize_t type.

IOW, the cast_size_t_to_ssize_t() introduced in the previous step is
pointless, isn't it?

According to [*1*], the real

    ssize_t writev(int fd, const struct iovec *iov, int iovcnt)

is supposed to report error with errno set to EINVAL when the sum of
iov_len member of the iov[] array elements exceed half of the
maximum size_t.

       EINVAL The sum of the iov_len values overflows an ssize_t value.

So instead of dying with cast_size_t_to_ssize_t(), we probably would
want the check done in a more stupid and straight-forward way?
Adding up iov[i].iov_len while the addition would not wraparound in
each and every step, and return error with EINVAL before attempting
even a single call to xwrite(), and then have the above double loop
that does not care about integer wraparound at all, and return
total_written with simple cast to (ssize_t)?


[Reference]

 *1* https://pubs.opengroup.org/onlinepubs/9799919799/functions/writev.html


> diff --git a/config.mak.uname b/config.mak.uname
> index 5feb582558..ccb3f71881 100644
> --- a/config.mak.uname
> +++ b/config.mak.uname
> @@ -459,6 +459,7 @@ ifeq ($(uname_S),Windows)
>  	SANE_TOOL_PATH ?= $(msvc_bin_dir_msys)
>  	HAVE_ALLOCA_H = YesPlease
>  	NO_PREAD = YesPlease
> +	NO_WRITEV = YesPlease
>  	NEEDS_CRYPTO_WITH_SSL = YesPlease
>  	NO_LIBGEN_H = YesPlease
>  	NO_POLL = YesPlease
> @@ -674,6 +675,7 @@ ifeq ($(uname_S),MINGW)
>  	pathsep = ;
>  	HAVE_ALLOCA_H = YesPlease
>  	NO_PREAD = YesPlease
> +	NO_WRITEV = YesPlease
>  	NEEDS_CRYPTO_WITH_SSL = YesPlease
>  	NO_LIBGEN_H = YesPlease
>  	NO_POLL = YesPlease
> diff --git a/meson.build b/meson.build
> index 762e2d0fc0..63514b6b84 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1409,6 +1409,7 @@ checkfuncs = {
>    'initgroups' : [],
>    'strtoumax' : ['strtoumax.c', 'strtoimax.c'],
>    'pread' : ['pread.c'],
> +  'writev' : ['writev.c'],
>  }
>  
>  if host_machine.system() == 'windows'

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 07/10] sideband: use writev(3p) to send pktlines
  2026-03-03 15:00   ` [PATCH v2 07/10] sideband: use writev(3p) to send pktlines Patrick Steinhardt
@ 2026-03-04 22:05     ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2026-03-04 22:05 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Matt Smiley, brian m. carlson, Jeff King

Patrick Steinhardt <ps@pks.im> writes:

> Every pktline that we send out via `send_sideband()` currently requires
> two syscalls: one to write the pktline's length, and one to send its
> data. This typically isn't all that much of a problem, but under extreme
> load the syscalls may cause contention in the kernel.
>
> Refactor the code to instead use the newly introduced writev(3p) infra
> so that we can send out the data with a single syscall. This reduces the
> number of syscalls from around 133,000 calls to write(3p) to around
> 67,000 calls to writev(3p).
>
> Suggested-by: Jeff King <peff@peff.net>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  sideband.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)

Quite straight-forward.  Looking good.

> diff --git a/sideband.c b/sideband.c
> index ea7c25211e..1ed6614eaf 100644
> --- a/sideband.c
> +++ b/sideband.c
> @@ -264,6 +264,7 @@ void send_sideband(int fd, int band, const char *data, ssize_t sz, int packet_ma
>  	const char *p = data;
>  
>  	while (sz) {
> +		struct iovec iov[2];
>  		unsigned n;
>  		char hdr[5];
>  
> @@ -273,12 +274,19 @@ void send_sideband(int fd, int band, const char *data, ssize_t sz, int packet_ma
>  		if (0 <= band) {
>  			xsnprintf(hdr, sizeof(hdr), "%04x", n + 5);
>  			hdr[4] = band;
> -			write_or_die(fd, hdr, 5);
> +			iov[0].iov_base = hdr;
> +			iov[0].iov_len = 5;
>  		} else {
>  			xsnprintf(hdr, sizeof(hdr), "%04x", n + 4);
> -			write_or_die(fd, hdr, 4);
> +			iov[0].iov_base = hdr;
> +			iov[0].iov_len = 4;
>  		}
> -		write_or_die(fd, p, n);
> +
> +		iov[1].iov_base = (void *) p;
> +		iov[1].iov_len = n;
> +
> +		writev_or_die(fd, iov, ARRAY_SIZE(iov));
> +
>  		p += n;
>  		sz -= n;
>  	}

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/10] csum-file: introduce `hashfd_ext()`
  2026-03-03 15:00   ` [PATCH v2 08/10] csum-file: introduce `hashfd_ext()` Patrick Steinhardt
@ 2026-03-04 22:11     ` Junio C Hamano
  2026-03-10 12:09       ` Patrick Steinhardt
  0 siblings, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2026-03-04 22:11 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Matt Smiley, brian m. carlson, Jeff King

Patrick Steinhardt <ps@pks.im> writes:

> -static struct hashfile *hashfd_internal(const struct git_hash_algo *algop,
> -					int fd, const char *name,
> -					struct progress *tp,
> -					size_t buffer_len)
> +struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
> +			    int fd, const char *name,
> +			    const struct hashfd_options *opts)

What does _ext stand for?

More seriously, this essentially chooses to pick two parameters
hashfd_internal() takes and put them in a struct, which would give
us a clear upgrade path to add more to the structure without having
to change the function signature.  But what is the criteria used to
choose these two among 5 parameters the original function takes?

Specifically, I am wondering if fd and algop should be part of the
structure, as these would be exactly the same for repeated calls to
this function to write to a single stream.

> +struct hashfd_options {
> +	/*
> +	 * Throughput progress that counts the number of bytes that have been
> +	 * hashed.
> +	 */
> +	struct progress *progress;
> +
> +	/* The length of the buffer that shall be used read read data. */
> +	size_t buffer_len;
> +};
> +
> +struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
> +			    int fd, const char *name,
> +			    const struct hashfd_options *opts);
>  struct hashfile *hashfd(const struct git_hash_algo *algop,
>  			int fd, const char *name);
>  struct hashfile *hashfd_check(const struct git_hash_algo *algop,

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper
  2026-03-04 22:01     ` Junio C Hamano
@ 2026-03-05  0:37       ` Jeff King
  2026-03-05  2:16         ` brian m. carlson
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2026-03-05  0:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Matt Smiley, brian m. carlson

On Wed, Mar 04, 2026 at 02:01:18PM -0800, Junio C Hamano wrote:

> Because we do not check the accumulation of bytes_written in the two
> accumulator variables inside the inner loop, it is very possible for
> total_written to wraparound size_t and end up below the largest
> value possible to be stored in ssize_t type.

Coverity complained about it, too.

> IOW, the cast_size_t_to_ssize_t() introduced in the previous step is
> pointless, isn't it?
> 
> According to [*1*], the real
> 
>     ssize_t writev(int fd, const struct iovec *iov, int iovcnt)
> 
> is supposed to report error with errno set to EINVAL when the sum of
> iov_len member of the iov[] array elements exceed half of the
> maximum size_t.
> 
>        EINVAL The sum of the iov_len values overflows an ssize_t value.
> 
> So instead of dying with cast_size_t_to_ssize_t(), we probably would
> want the check done in a more stupid and straight-forward way?
> Adding up iov[i].iov_len while the addition would not wraparound in
> each and every step, and return error with EINVAL before attempting
> even a single call to xwrite(), and then have the above double loop
> that does not care about integer wraparound at all, and return
> total_written with simple cast to (ssize_t)?

I like that writev() can work as a drop-in replacement for write() at
the lowest level. But given that our main use is likely to be pkt-lines,
I do kind of wonder if we should just try to be more clever in forming
our buffers. That makes all of the portability and compat questions go
away (and gives the benefit to platforms that don't even have writev).

E.g., the somewhat ugly patch below is enough to do single write()s for
the pack data in upload-pack. I think if the pkt-writing API were less
ad-hoc (and had an actual buffer in "struct packet_writer"), this could
be made a bit more universal.

diff --git a/upload-pack.c b/upload-pack.c
index 1b1c81ea63..d01b547b10 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -179,11 +179,32 @@ static void reset_timeout(unsigned int timeout)
 	alarm(timeout);
 }
 
-static void send_client_data(int fd, const char *data, ssize_t sz,
-			     int use_sideband)
+#define send_client_data(fd, data, sz, use_sideband) \
+	send_client_data_1((fd), (data), (sz), (use_sideband), NULL)
+static void send_client_data_1(int fd, const char *data, ssize_t sz,
+			       int use_sideband, char *hdr_buf)
 {
 	if (use_sideband) {
-		send_sideband(1, fd, data, sz, use_sideband);
+		/*
+		 * If we don't have a contiguous header buf, or if the data is
+		 * too big for a single packet (which I don't think happens
+		 * with any of our current callers), bail to the old-style
+		 * send_sideband.
+		 */
+		if (!hdr_buf || sz > LARGE_PACKET_DATA_MAX - 1) {
+			send_sideband(1, fd, data, sz, use_sideband);
+			return;
+		}
+
+		if (hdr_buf + 5 != data)
+			BUG("hdr_buf should be contiguous with pkt data");
+
+		if (use_sideband < 0)
+			BUG("negative sideband!?");
+
+		set_packet_header(hdr_buf, sz + 5);
+		hdr_buf[4] = 1; /* sideband 1 */
+		write_or_die(fd, hdr_buf, sz + 5);
 		return;
 	}
 	if (fd == 3)
@@ -211,7 +232,14 @@ struct output_state {
 	 * sideband-64k the band designator takes up 1 byte of space. Because
 	 * relay_pack_data keeps the last byte to itself, we make the buffer 1
 	 * byte bigger than the intended maximum write size.
+	 *
+	 * However, we also reserve 5 bytes before the buffer so we can format
+	 * a packet header and send it with a single write(). We may be pushing
+	 * our luck on assuming there will be no padding here (though we do
+	 * check it with a BUG(), but there are other ways to write it
+	 * (e.g., a single large buffer with "buffer" as a pointer into it).
 	 */
+	char hdr[5];
 	char buffer[(LARGE_PACKET_DATA_MAX - 1) + 1];
 	int used;
 	unsigned packfile_uris_started : 1;
@@ -284,7 +312,7 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 		return readsz;
 
 	if (os->used > 1) {
-		send_client_data(1, os->buffer, os->used - 1, use_sideband);
+		send_client_data_1(1, os->buffer, os->used - 1, use_sideband, os->hdr);
 		os->buffer[0] = os->buffer[os->used - 1];
 		os->used = 1;
 	} else {

-Peff

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 02/10] upload-pack: adapt keepalives based on buffering
  2026-03-03 15:00   ` [PATCH v2 02/10] upload-pack: adapt keepalives based on buffering Patrick Steinhardt
@ 2026-03-05  0:56     ` Jeff King
  2026-03-10 12:08       ` Patrick Steinhardt
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2026-03-05  0:56 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Matt Smiley, brian m. carlson

On Tue, Mar 03, 2026 at 04:00:17PM +0100, Patrick Steinhardt wrote:

> The most important edge case here happens in `relay_pack_data()`. When
> we haven't seen the initial "PACK" signature from git-pack-objects(1)
> yet we buffer incoming data. So in the worst case, if each of the bytes
> of that signature arrive shortly before the configured keepalive
> timeout, then we may not send out any data for a time period that is
> (almost) four times as long as the configured timeout.

Thanks for laying out this case. I think this is all-but-impossible in
practice, as anybody writing "PACK" is going to do so all at once. Even
4 separate write() calls would be fine, as long as it does not pause in
between!

I think there's another one, too. If we are getting packfile_uris, and
pack-objects writes half a line, we will pause waiting for the complete
line to show up. This also seems quite unlikely in practice.

> This edge case is rather unlikely to matter in practice. But in a
> subsequent commit we're going to adapt our buffering mechanism to become
> more aggressive, which makes it more likely that we don't send any data
> for an extended amount of time.
> 
> Adapt the logic so that instead of using a fixed timeout on every call
> to poll(3p), we instead figure out how much time has passed since the
> last-sent data.

OK. That should not be too bad to do, though...

> @@ -365,10 +373,14 @@ static void create_pack_file(struct upload_pack_data *pack_data,
>  	 */
>  
>  	while (1) {
> +		uint64_t now_ms = getnanotime() / 1000000;

...now we are talking about wall-clock time since the epoch. What
happens if time goes backwards due to a clock reset?

Then now_ms may be less than last_sent_ms, and then here:

> +		} else {
> +			/*
> +			 * The polling timeout needs to be adjusted based on
> +			 * the time we have sent our last package. The longer
> +			 * it's been in the past, the shorter the timeout
> +			 * becomes until we eventually don't block at all.
> +			 */
> +			polltimeout_ms = 1000 * pack_data->keepalive - (now_ms - last_sent_ms);
> +			if (polltimeout_ms < 0)
> +				polltimeout_ms = 0;
> +		}

...we end up with a value higher than the keepalive, and we wait too
long. That's probably an OK outcome for such an exceptional condition.
The worst case if you set your clock back is that we fail to send
keepalives until the next actual data chunk arrives.

>  		if (0 <= pe && (pfd[pe].revents & (POLLIN|POLLHUP))) {
>  			/* Status ready; we ship that in the side-band
>  			 * or dump to the standard error.
>  			 */
>  			sz = xread(pack_objects.err, progress,
>  				  sizeof(progress));
> -			if (0 < sz)
> +			if (0 < sz) {
>  				send_client_data(2, progress, sz,
>  						 pack_data->use_sideband);
> -			else if (sz == 0) {
> +				last_sent_ms = now_ms;
> +			} else if (sz == 0) {
>  				close(pack_objects.err);
>  				pack_objects.err = -1;
>  			}

OK, so here we know we sent data, because we pass along stderr as
quickly as possible. If we got EOF on stderr, that doesn't count as
sending data, because we don't pass along that EOF to the client. Makes
sense.

>  		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
> +			bool did_send_data;
>  			int result = relay_pack_data(pack_objects.out,
>  						     output_state,
>  						     pack_data->use_sideband,
> -						     !!uri_protocols);
> +						     !!uri_protocols,
> +						     &did_send_data);

And then for stdout, we have to ask relay_pack_data() if it sent
anything. That makes sense.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 03/10] upload-pack: reduce lock contention when writing packfile data
  2026-03-03 15:00   ` [PATCH v2 03/10] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
@ 2026-03-05  1:16     ` Jeff King
  2026-03-10 12:14       ` Patrick Steinhardt
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2026-03-05  1:16 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Matt Smiley, brian m. carlson

On Tue, Mar 03, 2026 at 04:00:18PM +0100, Patrick Steinhardt wrote:

> Extend our use of the buffering infrastructure so that we soak up bytes
> until the buffer is filled up at least 2/3rds of its capacity. The
> change is relatively simple to implement as we already know to flush the
> buffer in `create_pack_file()` after git-pack-objects(1) has finished.

This 2/3rds feels kind of arbitrary. Isn't our best bet to try to fill
pkt-lines? Later you say:

> Now we could of course go even further and make sure that we always fill
> up the whole buffer. But this might cause an increase in read(3p)
> syscalls, and some tests show that this only reduces the number of
> write(3p) syscalls from 130,000 to 100,000. So overall this doesn't seem
> worth it.

But I am not clear how it increases the number of read() calls. I guess
you are concerned that we'll get 50k, and then do a read for the
remaining 14k, and then read 50k, and then 14k, and so on. But I'm
unconvinced that 2/3 is really any better here, as it depends on the
buffering patterns of the upstream writer. They could be writing 1 byte
less than 2/3, and we'd wait to buffer, then read half their next
packet, write it, then read the second of of their next packet, wait to
buffer, and so on.

Even just doing:

  git clone --upload-pack='strace -e write git-upload-pack' --bare --no-local . foo.git

with this patch (and not the later one to increase the buffer size of
pack-objects), I see an interesting flip-flop between packets of size
65515 and 61461. But we never send a single full-size one, even though
pack-objects should be outpacing us (because we're slowed by running
under strace). That's probably an OK loss of efficiency in practice, but
it's very dependent on pack-objects buffering.

I'm still a little bit negative on the whole concept of buffering in
upload-pack, just because the interactions between buffering layers can
be so subtle. But I guess I'm not really making any argument that I
didn't make in v1, and you kept this in v2, so I suppose you are not
swayed by it. ;)

If we are going to buffer in upload-pack, there is an obvious
optimization that I didn't see in your series. When we send a keepalive,
we should just send whatever we have in os->buffer (even if it is
nothing).  If we are wasting 5 bytes of pkt-line header and a write()
call to send the keepalive, we may as well send what data we do have.

I don't know how much it would help in practice, though. Most keepalives
will come before the pack data starts, as once pack-objects starts
producing data, it tends to do so pretty consistently. And of course we
can't send os->buffer before we see the PACK header, because the whole
point is to buffer the early bit waiting for packfile uris.

So it might not be worth adding.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper
  2026-03-05  0:37       ` Jeff King
@ 2026-03-05  2:16         ` brian m. carlson
  2026-03-05  6:39           ` Johannes Sixt
  0 siblings, 1 reply; 64+ messages in thread
From: brian m. carlson @ 2026-03-05  2:16 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Patrick Steinhardt, git, Matt Smiley

[-- Attachment #1: Type: text/plain, Size: 1689 bytes --]

On 2026-03-05 at 00:37:45, Jeff King wrote:
> I like that writev() can work as a drop-in replacement for write() at
> the lowest level. But given that our main use is likely to be pkt-lines,
> I do kind of wonder if we should just try to be more clever in forming
> our buffers. That makes all of the portability and compat questions go
> away (and gives the benefit to platforms that don't even have writev).

This does work and it is clever, but I think the writev is clearer and
more explicit.  In addition, this is literally the kind of use case that
it's designed for and the kernel will have a highly optimized
implementation handling it.  I could also see myself making use of
writev in my future work as well, although I don't have any concrete
code depending on it at the moment (but I could probably add it in the
interop code).

There is a Windows equivalent as well (according to the Rust
documentation[0]) which doesn't have to be added at this point, but
could in the future as a quality-of-implementation issue.  And almost
all Unix systems will support this, with the possible exception of
NonStop, since it's part of XSI and effectively everyone implements
those C functions.  (Also, writev was in 4.2BSD, released in 1983, so
you'd have to be really behind the times to not support it.)

[0] Rust defines the `std::io::IoSlice` type which is documented as
follows:

    A buffer type used with `Write::write_vectored`.

    It is semantically a wrapper around a `&[u8]`, but is guaranteed to
    be ABI compatible with the `iovec` type on Unix platforms and
    `WSABUF` on Windows.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper
  2026-03-05  2:16         ` brian m. carlson
@ 2026-03-05  6:39           ` Johannes Sixt
  2026-03-05 22:22             ` brian m. carlson
  0 siblings, 1 reply; 64+ messages in thread
From: Johannes Sixt @ 2026-03-05  6:39 UTC (permalink / raw)
  To: brian m. carlson, Patrick Steinhardt
  Cc: Jeff King, Junio C Hamano, Matt Smiley, git

Am 05.03.26 um 03:16 schrieb brian m. carlson:
> On 2026-03-05 at 00:37:45, Jeff King wrote:
>> I like that writev() can work as a drop-in replacement for write() at
>> the lowest level. But given that our main use is likely to be pkt-lines,
>> I do kind of wonder if we should just try to be more clever in forming
>> our buffers. That makes all of the portability and compat questions go
>> away (and gives the benefit to platforms that don't even have writev).
> 
> This does work and it is clever, but I think the writev is clearer and
> more explicit.  In addition, this is literally the kind of use case that
> it's designed for and the kernel will have a highly optimized
> implementation handling it.  I could also see myself making use of
> writev in my future work as well,...

Please don't. The use of writev may mislead you to depend on guarantees
that a kernel implementation of writev can provide, but a compat/
implementation cannot. (For example, I read something about "the file
pointer shall be unchanged" in the case of errors.) Please use simple
and stupid functions (write). I highly doubt that you can squeeze out a
noticable performance improvement with writev.

-- Hannes


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper
  2026-03-05  6:39           ` Johannes Sixt
@ 2026-03-05 22:22             ` brian m. carlson
  2026-03-10 12:09               ` Patrick Steinhardt
  0 siblings, 1 reply; 64+ messages in thread
From: brian m. carlson @ 2026-03-05 22:22 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Patrick Steinhardt, Jeff King, Junio C Hamano, Matt Smiley, git

[-- Attachment #1: Type: text/plain, Size: 813 bytes --]

On 2026-03-05 at 06:39:13, Johannes Sixt wrote:
> Please don't. The use of writev may mislead you to depend on guarantees
> that a kernel implementation of writev can provide, but a compat/
> implementation cannot. (For example, I read something about "the file
> pointer shall be unchanged" in the case of errors.) Please use simple
> and stupid functions (write). I highly doubt that you can squeeze out a
> noticable performance improvement with writev.

I have Rust code that does show a substantial performance improvement
with writev and I use it for a very similar purpose (printing the size
before a packet of data).

I am okay with slightly loosening the guarantees to support our compat
implementation if that's what we need to do.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 02/10] upload-pack: adapt keepalives based on buffering
  2026-03-05  0:56     ` Jeff King
@ 2026-03-10 12:08       ` Patrick Steinhardt
  0 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 12:08 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matt Smiley, brian m. carlson

On Wed, Mar 04, 2026 at 07:56:11PM -0500, Jeff King wrote:
> On Tue, Mar 03, 2026 at 04:00:17PM +0100, Patrick Steinhardt wrote:
> 
> > The most important edge case here happens in `relay_pack_data()`. When
> > we haven't seen the initial "PACK" signature from git-pack-objects(1)
> > yet we buffer incoming data. So in the worst case, if each of the bytes
> > of that signature arrive shortly before the configured keepalive
> > timeout, then we may not send out any data for a time period that is
> > (almost) four times as long as the configured timeout.
> 
> Thanks for laying out this case. I think this is all-but-impossible in
> practice, as anybody writing "PACK" is going to do so all at once. Even
> 4 separate write() calls would be fine, as long as it does not pause in
> between!
> 
> I think there's another one, too. If we are getting packfile_uris, and
> pack-objects writes half a line, we will pause waiting for the complete
> line to show up. This also seems quite unlikely in practice.

It could happen in case the data was written by the pack-objects hook.
But I fully agree that this is a rather unlikely scenario.

> > This edge case is rather unlikely to matter in practice. But in a
> > subsequent commit we're going to adapt our buffering mechanism to become
> > more aggressive, which makes it more likely that we don't send any data
> > for an extended amount of time.
> > 
> > Adapt the logic so that instead of using a fixed timeout on every call
> > to poll(3p), we instead figure out how much time has passed since the
> > last-sent data.
> 
> OK. That should not be too bad to do, though...
> 
> > @@ -365,10 +373,14 @@ static void create_pack_file(struct upload_pack_data *pack_data,
> >  	 */
> >  
> >  	while (1) {
> > +		uint64_t now_ms = getnanotime() / 1000000;
> 
> ...now we are talking about wall-clock time since the epoch. What
> happens if time goes backwards due to a clock reset?
> 
> Then now_ms may be less than last_sent_ms, and then here:
> 
> > +		} else {
> > +			/*
> > +			 * The polling timeout needs to be adjusted based on
> > +			 * the time we have sent our last package. The longer
> > +			 * it's been in the past, the shorter the timeout
> > +			 * becomes until we eventually don't block at all.
> > +			 */
> > +			polltimeout_ms = 1000 * pack_data->keepalive - (now_ms - last_sent_ms);
> > +			if (polltimeout_ms < 0)
> > +				polltimeout_ms = 0;
> > +		}
> 
> ...we end up with a value higher than the keepalive, and we wait too
> long. That's probably an OK outcome for such an exceptional condition.
> The worst case if you set your clock back is that we fail to send
> keepalives until the next actual data chunk arrives.

Right. It's rather unlikely to happen, and in addition to that we use a
monotonic clock in `getnanotime()` on systems with `CLOCK_MONOTONIC` and
on Windows. Which probably covers almost all platforms out there.

Patrick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/10] csum-file: introduce `hashfd_ext()`
  2026-03-04 22:11     ` Junio C Hamano
@ 2026-03-10 12:09       ` Patrick Steinhardt
  0 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 12:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Matt Smiley, brian m. carlson, Jeff King

On Wed, Mar 04, 2026 at 02:11:31PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > -static struct hashfile *hashfd_internal(const struct git_hash_algo *algop,
> > -					int fd, const char *name,
> > -					struct progress *tp,
> > -					size_t buffer_len)
> > +struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
> > +			    int fd, const char *name,
> > +			    const struct hashfd_options *opts)
> 
> What does _ext stand for?

"extended". It's a pattern that we use in lots of other places, too.

> More seriously, this essentially chooses to pick two parameters
> hashfd_internal() takes and put them in a struct, which would give
> us a clear upgrade path to add more to the structure without having
> to change the function signature.  But what is the criteria used to
> choose these two among 5 parameters the original function takes?

The criteria is mostly whether the parameter is optional or not. The
algorithm, file descriptor and name are all mandatory, whereas all the
other parts in the options structure are optional.

Patrick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper
  2026-03-05 22:22             ` brian m. carlson
@ 2026-03-10 12:09               ` Patrick Steinhardt
  0 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 12:09 UTC (permalink / raw)
  To: brian m. carlson, Johannes Sixt, Jeff King, Junio C Hamano,
	Matt Smiley, git

On Thu, Mar 05, 2026 at 10:22:45PM +0000, brian m. carlson wrote:
> On 2026-03-05 at 06:39:13, Johannes Sixt wrote:
> > Please don't. The use of writev may mislead you to depend on guarantees
> > that a kernel implementation of writev can provide, but a compat/
> > implementation cannot. (For example, I read something about "the file
> > pointer shall be unchanged" in the case of errors.) Please use simple
> > and stupid functions (write). I highly doubt that you can squeeze out a
> > noticable performance improvement with writev.
> 
> I have Rust code that does show a substantial performance improvement
> with writev and I use it for a very similar purpose (printing the size
> before a packet of data).
> 
> I am okay with slightly loosening the guarantees to support our compat
> implementation if that's what we need to do.

Yeah, I think that having a syscall like writev(3p) in our codebase
could likely also lead to improvements in other areas. Nice to see that
it'll also help your use case.

So for now I'll stick with it, but will fix overflow handling as
suggested by Junio.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 03/10] upload-pack: reduce lock contention when writing packfile data
  2026-03-05  1:16     ` Jeff King
@ 2026-03-10 12:14       ` Patrick Steinhardt
  0 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 12:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matt Smiley, brian m. carlson

On Wed, Mar 04, 2026 at 08:16:38PM -0500, Jeff King wrote:
> On Tue, Mar 03, 2026 at 04:00:18PM +0100, Patrick Steinhardt wrote:
> 
> > Extend our use of the buffering infrastructure so that we soak up bytes
> > until the buffer is filled up at least 2/3rds of its capacity. The
> > change is relatively simple to implement as we already know to flush the
> > buffer in `create_pack_file()` after git-pack-objects(1) has finished.
> 
> This 2/3rds feels kind of arbitrary. Isn't our best bet to try to fill
> pkt-lines? Later you say:
> 
> > Now we could of course go even further and make sure that we always fill
> > up the whole buffer. But this might cause an increase in read(3p)
> > syscalls, and some tests show that this only reduces the number of
> > write(3p) syscalls from 130,000 to 100,000. So overall this doesn't seem
> > worth it.
> 
> But I am not clear how it increases the number of read() calls. I guess
> you are concerned that we'll get 50k, and then do a read for the
> remaining 14k, and then read 50k, and then 14k, and so on. But I'm
> unconvinced that 2/3 is really any better here, as it depends on the
> buffering patterns of the upstream writer. They could be writing 1 byte
> less than 2/3, and we'd wait to buffer, then read half their next
> packet, write it, then read the second of of their next packet, wait to
> buffer, and so on.
> 
> Even just doing:
> 
>   git clone --upload-pack='strace -e write git-upload-pack' --bare --no-local . foo.git
> 
> with this patch (and not the later one to increase the buffer size of
> pack-objects), I see an interesting flip-flop between packets of size
> 65515 and 61461. But we never send a single full-size one, even though
> pack-objects should be outpacing us (because we're slowed by running
> under strace). That's probably an OK loss of efficiency in practice, but
> it's very dependent on pack-objects buffering.

The big question though is whether it's a loss of efficiency compared to
the previous state where we didn't buffer at all.

> I'm still a little bit negative on the whole concept of buffering in
> upload-pack, just because the interactions between buffering layers can
> be so subtle. But I guess I'm not really making any argument that I
> didn't make in v1, and you kept this in v2, so I suppose you are not
> swayed by it. ;)

Yeah. Its usefulness highly depends on the downstream command indeed, so
it may or may not help. What I'm not convinced of though is that there
are likely scenarios where it'll perform _worse_ than before. If we can
come up with such a scenario then I'll be happy to drop this patch.

For downstream git-pack-objects(1) it shouldn't really make much of a
difference at the end of this series anyway, as it starts to write data
at pktline lengh. But it may help users of the pack-objects hook.

> If we are going to buffer in upload-pack, there is an obvious
> optimization that I didn't see in your series. When we send a keepalive,
> we should just send whatever we have in os->buffer (even if it is
> nothing).  If we are wasting 5 bytes of pkt-line header and a write()
> call to send the keepalive, we may as well send what data we do have.
> 
> I don't know how much it would help in practice, though. Most keepalives
> will come before the pack data starts, as once pack-objects starts
> producing data, it tends to do so pretty consistently. And of course we
> can't send os->buffer before we see the PACK header, because the whole
> point is to buffer the early bit waiting for packfile uris.
> 
> So it might not be worth adding.

It's an interesting idea indeed. We also need to be mindful that we try
to hold one byte back so that we can "fail" downstream callers in case
git-pack-objects(1) fails.

So we need to repeat some conditions here, but overall it's not too bad:

diff --git a/upload-pack.c b/upload-pack.c
index 1b1c81ea63..199bbd1ad6 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -484,7 +484,16 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		 */
 		if (!ret && pack_data->use_sideband) {
 			static const char buf[] = "0005\1";
-			write_or_die(1, buf, 5);
+
+			if (output_state->packfile_started && output_state->used > 1) {
+				send_client_data(1, output_state->buffer, output_state->used - 1,
+						 pack_data->use_sideband);
+				output_state->buffer[0] = output_state->buffer[output_state->used - 1];
+				output_state->used = 1;
+			} else {
+				write_or_die(1, buf, 5);
+			}
+
 			last_sent_ms = now_ms;
 		}
 	}

I'll include this in the next version, thanks!

Patrick

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 00/10] upload-pack: reduce lock contention when writing packfile data
  2026-02-27 11:22 [PATCH 0/2] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
                   ` (2 preceding siblings ...)
  2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
@ 2026-03-10 13:24 ` Patrick Steinhardt
  2026-03-10 13:24   ` [PATCH v3 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
                     ` (11 more replies)
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
  4 siblings, 12 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:24 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

Hi,

this small patch series fixes some heavy lock contention when writing
data from git-upload-pack(1) into pipes. This lock contention can be
observed when having hundreds of git-upload-pack(1) processes active at
the same time that write data into pipes at dozens of gigabits per
second.

I have uploaded the flame graph that clearly shows the lock contention
at [1].

Changes in v3:
  - Fix handling of `iov_len` overflows in writev(3p) wrapper.
  - Add another patch that causes us to flush out data instead of
    sending a 0005 keepalive packet.
  - Link to v2: https://lore.kernel.org/r/20260303-pks-upload-pack-write-contention-v2-0-7321830f08fe@pks.im

Changes in v2:
  - Change the buffer size in git-pack-objects(1) to also reduce the
    number of write syscalls over there.
  - Introduce writev to half the number of syscalls when writing
    pktlines.
  - Use `sizeof(os->buffer)` instead of open-coding its size.
  - Improve keepalive logic in git-upload-pack(1) to account for
    buffering.
  - Link to v1: https://lore.kernel.org/r/20260227-pks-upload-pack-write-contention-v1-0-7166fe255704@pks.im

Thanks!

Patrick

[1]: https://gitlab.com/gitlab-org/git/-/work_items/675

---
Patrick Steinhardt (10):
      upload-pack: fix debug statement when flushing packfile data
      upload-pack: adapt keepalives based on buffering
      upload-pack: prefer flushing data over sending keepalive
      upload-pack: reduce lock contention when writing packfile data
      compat/posix: introduce writev(3p) wrapper
      wrapper: introduce writev(3p) wrappers
      sideband: use writev(3p) to send pktlines
      csum-file: introduce `hashfd_ext()`
      csum-file: drop `hashfd_throughput()`
      builtin/pack-objects: reduce lock contention when writing packfile data

 Makefile               |  4 +++
 builtin/pack-objects.c | 23 +++++++++++---
 compat/posix.h         | 14 +++++++++
 compat/writev.c        | 44 +++++++++++++++++++++++++++
 config.mak.uname       |  2 ++
 csum-file.c            | 28 +++++------------
 csum-file.h            | 16 ++++++++--
 meson.build            |  1 +
 sideband.c             | 14 +++++++--
 upload-pack.c          | 81 +++++++++++++++++++++++++++++++++++++++-----------
 wrapper.c              | 41 +++++++++++++++++++++++++
 wrapper.h              |  9 ++++++
 write-or-die.c         |  8 +++++
 write-or-die.h         |  1 +
 14 files changed, 239 insertions(+), 47 deletions(-)

Range-diff versus v2:

 1:  ef3244ffcc =  1:  540b577540 upload-pack: fix debug statement when flushing packfile data
 2:  3297edd609 =  2:  253d119b10 upload-pack: adapt keepalives based on buffering
 -:  ---------- >  3:  960c650063 upload-pack: prefer flushing data over sending keepalive
 3:  84671fb222 =  4:  0e147c41cf upload-pack: reduce lock contention when writing packfile data
 4:  bd1f070cb4 <  -:  ---------- git-compat-util: introduce `cast_size_t_to_ssize_t()`
 5:  d5acbd1584 !  5:  b6fbd89b8e compat/posix: introduce writev(3p) wrapper
    @@ compat/writev.c (new)
     +ssize_t git_writev(int fd, const struct iovec *iov, int iovcnt)
     +{
     +	size_t total_written = 0;
    ++	size_t sum = 0;
    ++
    ++	/*
    ++	 * According to writev(3p), the syscall shall error with EINVAL in case
    ++	 * the sum of `iov_len` overflows `ssize_t`.
    ++	 */
    ++	 for (int i = 0; i < iovcnt; i++) {
    ++		if (iov[i].iov_len > maximum_signed_value_of_type(ssize_t) ||
    ++		    iov[i].iov_len + sum > maximum_signed_value_of_type(ssize_t)) {
    ++			errno = EINVAL;
    ++			return -1;
    ++		}
    ++
    ++		sum += iov[i].iov_len;
    ++	}
     +
     +	for (int i = 0; i < iovcnt; i++) {
     +		const char *bytes = iov[i].iov_base;
    @@ compat/writev.c (new)
     +	}
     +
     +out:
    -+	return cast_size_t_to_ssize_t(total_written);
    ++	return (ssize_t) total_written;
     +}
     
      ## config.mak.uname ##
 6:  e3525cd25e =  6:  0a90ab2f62 wrapper: introduce writev(3p) wrappers
 7:  fe07f2d331 =  7:  5e29cfbf10 sideband: use writev(3p) to send pktlines
 8:  4563ad923f =  8:  32b59d5f29 csum-file: introduce `hashfd_ext()`
 9:  549b3dbefc =  9:  a5b0eb3627 csum-file: drop `hashfd_throughput()`
10:  155f79b128 = 10:  a82959772e builtin/pack-objects: reduce lock contention when writing packfile data

---
base-commit: fb1b83bcddff60463f6e86bb021784c88d0b748c
change-id: 20260227-pks-upload-pack-write-contention-435ce01f5fe9


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v3 01/10] upload-pack: fix debug statement when flushing packfile data
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
@ 2026-03-10 13:24   ` Patrick Steinhardt
  2026-03-10 13:24   ` [PATCH v3 02/10] upload-pack: adapt keepalives based on buffering Patrick Steinhardt
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:24 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

When git-upload-pack(1) writes packfile data to the client we have some
logic in place that buffers some partial lines. When that buffer still
contains data after git-pack-objects(1) has finished we flush the buffer
so that all remaining bytes are sent out.

Curiously, when we do so we also print the string "flushed." to stderr.
This statement has been introduced in b1c71b7281 (upload-pack: avoid
sending an incomplete pack upon failure, 2006-06-20), so quite a while
ago. What's interesting though is that stderr is typically spliced
through to the client-side, and consequently the client would see this
message. Munging the way how we do the caching indeed confirms this:

  $ git clone file:///home/pks/Development/linux/
  Cloning into bare repository 'linux.git'...
  remote: Enumerating objects: 12980346, done.
  remote: Counting objects: 100% (131820/131820), done.
  remote: Compressing objects: 100% (50290/50290), done.
  remote: Total 12980346 (delta 96319), reused 104500 (delta 81217), pack-reused 12848526 (from 1)
  Receiving objects: 100% (12980346/12980346), 3.23 GiB | 57.44 MiB/s, done.
  flushed.
  Resolving deltas: 100% (10676718/10676718), done.

It's quite clear that this string shouldn't ever be visible to the
client, so it rather feels like this is a left-over debug statement. The
menitoned commit doesn't mention this line, either.

Remove the debug output to prepare for a change in how we do the
buffering in the next commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index e8c5cce1c7..b3a8561ef5 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -457,11 +457,9 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 	}

 	/* flush the data */
-	if (output_state->used > 0) {
+	if (output_state->used > 0)
 		send_client_data(1, output_state->buffer, output_state->used,
 				 pack_data->use_sideband);
-		fprintf(stderr, "flushed.\n");
-	}
 	free(output_state);
 	if (pack_data->use_sideband)
 		packet_flush(1);

-- 
2.53.0.880.g73c4285caa.dirty

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 02/10] upload-pack: adapt keepalives based on buffering
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
  2026-03-10 13:24   ` [PATCH v3 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
@ 2026-03-10 13:24   ` Patrick Steinhardt
  2026-03-10 13:24   ` [PATCH v3 03/10] upload-pack: prefer flushing data over sending keepalive Patrick Steinhardt
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:24 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

The function `create_pack_file()` is responsible for sending the
packfile data to the client of git-upload-pack(1). As generating the
bytes may take significant computing resources we also have a mechanism
in place that optionally sends keepalive pktlines in case we haven't
sent out any data.

The keepalive logic is purely based poll(3p): we pass a timeout to that
syscall, and if the call times out we send out the keepalive pktline.
While reasonable, this logic isn't entirely sufficient: even if the call
to poll(3p) ends because we have received data on any of the file
descriptors we may not necessarily send data to the client.

The most important edge case here happens in `relay_pack_data()`. When
we haven't seen the initial "PACK" signature from git-pack-objects(1)
yet we buffer incoming data. So in the worst case, if each of the bytes
of that signature arrive shortly before the configured keepalive
timeout, then we may not send out any data for a time period that is
(almost) four times as long as the configured timeout.

This edge case is rather unlikely to matter in practice. But in a
subsequent commit we're going to adapt our buffering mechanism to become
more aggressive, which makes it more likely that we don't send any data
for an extended amount of time.

Adapt the logic so that instead of using a fixed timeout on every call
to poll(3p), we instead figure out how much time has passed since the
last-sent data.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 49 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 40 insertions(+), 9 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index b3a8561ef5..f6f380a601 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -29,6 +29,7 @@
 #include "commit-graph.h"
 #include "commit-reach.h"
 #include "shallow.h"
+#include "trace.h"
 #include "write-or-die.h"
 #include "json-writer.h"
 #include "strmap.h"
@@ -218,7 +219,8 @@ struct output_state {
 };
 
 static int relay_pack_data(int pack_objects_out, struct output_state *os,
-			   int use_sideband, int write_packfile_line)
+			   int use_sideband, int write_packfile_line,
+			   bool *did_send_data)
 {
 	/*
 	 * We keep the last byte to ourselves
@@ -232,6 +234,8 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 	 */
 	ssize_t readsz;
 
+	*did_send_data = false;
+
 	readsz = xread(pack_objects_out, os->buffer + os->used,
 		       sizeof(os->buffer) - os->used);
 	if (readsz < 0) {
@@ -247,6 +251,7 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 				if (os->packfile_uris_started)
 					packet_delim(1);
 				packet_write_fmt(1, "\1packfile\n");
+				*did_send_data = true;
 			}
 			break;
 		}
@@ -259,6 +264,7 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 			}
 			*p = '\0';
 			packet_write_fmt(1, "\1%s\n", os->buffer);
+			*did_send_data = true;
 
 			os->used -= p - os->buffer + 1;
 			memmove(os->buffer, p + 1, os->used);
@@ -279,6 +285,7 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 		os->used = 0;
 	}
 
+	*did_send_data = true;
 	return readsz;
 }
 
@@ -290,6 +297,7 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 	char progress[128];
 	char abort_msg[] = "aborting due to possible repository "
 		"corruption on the remote side.";
+	uint64_t last_sent_ms = 0;
 	ssize_t sz;
 	int i;
 	FILE *pipe_fd;
@@ -365,10 +373,14 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 	 */
 
 	while (1) {
+		uint64_t now_ms = getnanotime() / 1000000;
 		struct pollfd pfd[2];
-		int pe, pu, pollsize, polltimeout;
+		int pe, pu, pollsize, polltimeout_ms;
 		int ret;
 
+		if (!last_sent_ms)
+			last_sent_ms = now_ms;
+
 		reset_timeout(pack_data->timeout);
 
 		pollsize = 0;
@@ -390,11 +402,21 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		if (!pollsize)
 			break;
 
-		polltimeout = pack_data->keepalive < 0
-			? -1
-			: 1000 * pack_data->keepalive;
+		if (pack_data->keepalive < 0) {
+			polltimeout_ms = -1;
+		} else {
+			/*
+			 * The polling timeout needs to be adjusted based on
+			 * the time we have sent our last package. The longer
+			 * it's been in the past, the shorter the timeout
+			 * becomes until we eventually don't block at all.
+			 */
+			polltimeout_ms = 1000 * pack_data->keepalive - (now_ms - last_sent_ms);
+			if (polltimeout_ms < 0)
+				polltimeout_ms = 0;
+		}
 
-		ret = poll(pfd, pollsize, polltimeout);
+		ret = poll(pfd, pollsize, polltimeout_ms);
 
 		if (ret < 0) {
 			if (errno != EINTR) {
@@ -403,16 +425,18 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 			}
 			continue;
 		}
+
 		if (0 <= pe && (pfd[pe].revents & (POLLIN|POLLHUP))) {
 			/* Status ready; we ship that in the side-band
 			 * or dump to the standard error.
 			 */
 			sz = xread(pack_objects.err, progress,
 				  sizeof(progress));
-			if (0 < sz)
+			if (0 < sz) {
 				send_client_data(2, progress, sz,
 						 pack_data->use_sideband);
-			else if (sz == 0) {
+				last_sent_ms = now_ms;
+			} else if (sz == 0) {
 				close(pack_objects.err);
 				pack_objects.err = -1;
 			}
@@ -421,11 +445,14 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 			/* give priority to status messages */
 			continue;
 		}
+
 		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
+			bool did_send_data;
 			int result = relay_pack_data(pack_objects.out,
 						     output_state,
 						     pack_data->use_sideband,
-						     !!uri_protocols);
+						     !!uri_protocols,
+						     &did_send_data);
 
 			if (result == 0) {
 				close(pack_objects.out);
@@ -433,6 +460,9 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 			} else if (result < 0) {
 				goto fail;
 			}
+
+			if (did_send_data)
+				last_sent_ms = now_ms;
 		}
 
 		/*
@@ -448,6 +478,7 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		if (!ret && pack_data->use_sideband) {
 			static const char buf[] = "0005\1";
 			write_or_die(1, buf, 5);
+			last_sent_ms = now_ms;
 		}
 	}
 

-- 
2.53.0.880.g73c4285caa.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 03/10] upload-pack: prefer flushing data over sending keepalive
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
  2026-03-10 13:24   ` [PATCH v3 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
  2026-03-10 13:24   ` [PATCH v3 02/10] upload-pack: adapt keepalives based on buffering Patrick Steinhardt
@ 2026-03-10 13:24   ` Patrick Steinhardt
  2026-03-10 17:09     ` Junio C Hamano
  2026-03-10 13:25   ` [PATCH v3 04/10] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
                     ` (8 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:24 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

When using the sideband in git-upload-pack(1) we know to send out
keepalive packets in case generating the pack takes too long. These
keepalives take the form of a simple empty pktline.

In the preceding commit we have adapted git-upload-pack(1) to buffer
data more aggressively before sending it to the client. This creates an
obvious optimization opportunity: when we hit the keepalive timeout
while we still hold on to some buffered data, then it makes more sense
to flush out the data instead of sending the empty keepalive packet.

This is overall not going to be a significant win. Most keepalives will
come before the pack data starts, and once pack-objects starts producing
data, it tends to do so pretty consistently. And of course we can't send
data before we see the PACK header, because the whole point is to buffer
the early bit waiting for packfile URIs. But the optimization is easy
enough to realize.

Do so and flush out data instead of sending an empty pktline. While at
it, drop the useless

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index f6f380a601..7a165d226d 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -466,18 +466,27 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		}
 
 		/*
-		 * We hit the keepalive timeout without saying anything; send
-		 * an empty message on the data sideband just to let the other
-		 * side know we're still working on it, but don't have any data
-		 * yet.
+		 * We hit the keepalive timeout without saying anything. If we
+		 * have pending data we flush it out to the caller now.
+		 * Otherwise, we send an empty message on the data sideband
+		 * just to let the other side know we're still working on it,
+		 * but don't have any data yet.
 		 *
 		 * If we don't have a sideband channel, there's no room in the
 		 * protocol to say anything, so those clients are just out of
 		 * luck.
 		 */
 		if (!ret && pack_data->use_sideband) {
-			static const char buf[] = "0005\1";
-			write_or_die(1, buf, 5);
+			if (output_state->packfile_started && output_state->used > 1) {
+				send_client_data(1, output_state->buffer, output_state->used - 1,
+						 pack_data->use_sideband);
+				output_state->buffer[0] = output_state->buffer[output_state->used - 1];
+				output_state->used = 1;
+			} else {
+				static const char buf[] = "0005\1";
+				write_or_die(1, buf, 5);
+			}
+
 			last_sent_ms = now_ms;
 		}
 	}

-- 
2.53.0.880.g73c4285caa.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 04/10] upload-pack: reduce lock contention when writing packfile data
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2026-03-10 13:24   ` [PATCH v3 03/10] upload-pack: prefer flushing data over sending keepalive Patrick Steinhardt
@ 2026-03-10 13:25   ` Patrick Steinhardt
  2026-03-10 13:25   ` [PATCH v3 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:25 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

In our production systems we have recently observed write contention in
git-upload-pack(1). The system in question was consistently streaming
packfiles at a rate of dozens of gigabits per second, but curiously the
system was neither bottlenecked on CPU, memory or IOPS.

We eventually discovered that Git was spending 80% of its time in
`pipe_write()`, out of which almost all of the time was spent in the
`ep_poll_callback` function in the kernel. Quoting the reporter:

  This infrastructure is part of an event notification queue designed to
  allow for multiple producers to emit events, but that concurrency
  safety is guarded by 3 layers of locking. The layer we're hitting
  contention in uses a simple reader/writer lock mode (a.k.a. shared
  versus exclusive mode), where producers need shared-mode (read mode),
  and various other actions use exclusive (write) mode.

The system in question generates workloads where we have hundreds of
git-upload-pack(1) processes active at the same point in time. These
processes end up contending around those locks, and the consequence is
that the Git processes stall.

Now git-upload-pack(1) already has the infrastructure in place to buffer
some of the data it reads from git-pack-objects(1) before actually
sending it out. We only use this infrastructure in very limited ways
though, so we generally end up matching one read(3p) call with one
write(3p) call. Even worse, when the sideband is enabled we end up
matching one read with _two_ writes: one for the pkt-line length, and
one for the packfile data.

Extend our use of the buffering infrastructure so that we soak up bytes
until the buffer is filled up at least 2/3rds of its capacity. The
change is relatively simple to implement as we already know to flush the
buffer in `create_pack_file()` after git-pack-objects(1) has finished.

This significantly reduces the number of write(3p) syscalls we need to
do. Before this change, cloning the Linux repository resulted in around
400,000 write(3p) syscalls. With the buffering in place we only do
around 130,000 syscalls.

Now we could of course go even further and make sure that we always fill
up the whole buffer. But this might cause an increase in read(3p)
syscalls, and some tests show that this only reduces the number of
write(3p) syscalls from 130,000 to 100,000. So overall this doesn't seem
worth it.

Note that the issue could also be fixed by adapting the write buffer
that we use in the downstream git-pack-objects(1) command, and such a
change would have roughly the same result. But the command that
generates the packfile data may not always be git-pack-objects(1) as it
can be changed via "uploadpack.packObjectsHook", so such a fix would
only help in _some_ cases. Regardless of that, we'll also adapt the
write buffer size of git-pack-objects(1) in a subsequent commit.

Helped-by: Matt Smiley <msmiley@gitlab.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/upload-pack.c b/upload-pack.c
index 7a165d226d..9f6d6fe48c 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -276,6 +276,13 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 		}
 	}

+	/*
+	 * Make sure that we buffer some data before sending it to the client.
+	 * This significantly reduces the number of write(3p) syscalls.
+	 */
+	if (readsz && os->used < (sizeof(os->buffer) * 2 / 3))
+		return readsz;
+
 	if (os->used > 1) {
 		send_client_data(1, os->buffer, os->used - 1, use_sideband);
 		os->buffer[0] = os->buffer[os->used - 1];

-- 
2.53.0.880.g73c4285caa.dirty

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 05/10] compat/posix: introduce writev(3p) wrapper
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2026-03-10 13:25   ` [PATCH v3 04/10] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
@ 2026-03-10 13:25   ` Patrick Steinhardt
  2026-03-10 16:59     ` Junio C Hamano
  2026-03-10 13:25   ` [PATCH v3 06/10] wrapper: introduce writev(3p) wrappers Patrick Steinhardt
                     ` (6 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:25 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

In a subsequent commit we're going to add the first caller to
writev(3p). Introduce a compatibility wrapper for this syscall that we
can use on systems that don't have this syscall.

The syscall exists on modern Unixes like Linux and macOS, and seemingly
even for NonStop according to [1]. It doesn't seem to exist on Windows
though.

[1]: http://nonstoptools.com/manuals/OSS-SystemCalls.pdf
[2]: https://www.gnu.org/software/gnulib/manual/html_node/writev.html

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 Makefile         |  4 ++++
 compat/posix.h   | 14 ++++++++++++++
 compat/writev.c  | 44 ++++++++++++++++++++++++++++++++++++++++++++
 config.mak.uname |  2 ++
 meson.build      |  1 +
 5 files changed, 65 insertions(+)

diff --git a/Makefile b/Makefile
index f3264d0a37..493851162d 100644
--- a/Makefile
+++ b/Makefile
@@ -2021,6 +2021,10 @@ ifdef NO_PREAD
 	COMPAT_CFLAGS += -DNO_PREAD
 	COMPAT_OBJS += compat/pread.o
 endif
+ifdef NO_WRITEV
+	COMPAT_CFLAGS += -DNO_WRITEV
+	COMPAT_OBJS += compat/writev.o
+endif
 ifdef NO_FAST_WORKING_DIRECTORY
 	BASIC_CFLAGS += -DNO_FAST_WORKING_DIRECTORY
 endif
diff --git a/compat/posix.h b/compat/posix.h
index 245386fa4a..3c611d2736 100644
--- a/compat/posix.h
+++ b/compat/posix.h
@@ -137,6 +137,9 @@
 #include <sys/socket.h>
 #include <sys/ioctl.h>
 #include <sys/statvfs.h>
+#ifndef NO_WRITEV
+#include <sys/uio.h>
+#endif
 #include <termios.h>
 #ifndef NO_SYS_SELECT_H
 #include <sys/select.h>
@@ -323,6 +326,17 @@ int git_lstat(const char *, struct stat *);
 ssize_t git_pread(int fd, void *buf, size_t count, off_t offset);
 #endif
 
+#ifdef NO_WRITEV
+#define writev git_writev
+#define iovec git_iovec
+struct git_iovec {
+	void *iov_base;
+	size_t iov_len;
+};
+
+ssize_t git_writev(int fd, const struct iovec *iov, int iovcnt);
+#endif
+
 #ifdef NO_SETENV
 #define setenv gitsetenv
 int gitsetenv(const char *, const char *, int);
diff --git a/compat/writev.c b/compat/writev.c
new file mode 100644
index 0000000000..3a94870a2f
--- /dev/null
+++ b/compat/writev.c
@@ -0,0 +1,44 @@
+#include "../git-compat-util.h"
+#include "../wrapper.h"
+
+ssize_t git_writev(int fd, const struct iovec *iov, int iovcnt)
+{
+	size_t total_written = 0;
+	size_t sum = 0;
+
+	/*
+	 * According to writev(3p), the syscall shall error with EINVAL in case
+	 * the sum of `iov_len` overflows `ssize_t`.
+	 */
+	 for (int i = 0; i < iovcnt; i++) {
+		if (iov[i].iov_len > maximum_signed_value_of_type(ssize_t) ||
+		    iov[i].iov_len + sum > maximum_signed_value_of_type(ssize_t)) {
+			errno = EINVAL;
+			return -1;
+		}
+
+		sum += iov[i].iov_len;
+	}
+
+	for (int i = 0; i < iovcnt; i++) {
+		const char *bytes = iov[i].iov_base;
+		size_t iovec_written = 0;
+
+		while (iovec_written < iov[i].iov_len) {
+			ssize_t bytes_written = xwrite(fd, bytes + iovec_written,
+						       iov[i].iov_len - iovec_written);
+			if (bytes_written < 0) {
+				if (total_written)
+					goto out;
+				return bytes_written;
+			}
+			if (!bytes_written)
+				goto out;
+			iovec_written += bytes_written;
+			total_written += bytes_written;
+		}
+	}
+
+out:
+	return (ssize_t) total_written;
+}
diff --git a/config.mak.uname b/config.mak.uname
index 5feb582558..ccb3f71881 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -459,6 +459,7 @@ ifeq ($(uname_S),Windows)
 	SANE_TOOL_PATH ?= $(msvc_bin_dir_msys)
 	HAVE_ALLOCA_H = YesPlease
 	NO_PREAD = YesPlease
+	NO_WRITEV = YesPlease
 	NEEDS_CRYPTO_WITH_SSL = YesPlease
 	NO_LIBGEN_H = YesPlease
 	NO_POLL = YesPlease
@@ -674,6 +675,7 @@ ifeq ($(uname_S),MINGW)
 	pathsep = ;
 	HAVE_ALLOCA_H = YesPlease
 	NO_PREAD = YesPlease
+	NO_WRITEV = YesPlease
 	NEEDS_CRYPTO_WITH_SSL = YesPlease
 	NO_LIBGEN_H = YesPlease
 	NO_POLL = YesPlease
diff --git a/meson.build b/meson.build
index 4b536e0124..381974ab57 100644
--- a/meson.build
+++ b/meson.build
@@ -1414,6 +1414,7 @@ checkfuncs = {
   'initgroups' : [],
   'strtoumax' : ['strtoumax.c', 'strtoimax.c'],
   'pread' : ['pread.c'],
+  'writev' : ['writev.c'],
 }
 
 if host_machine.system() == 'windows'

-- 
2.53.0.880.g73c4285caa.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 06/10] wrapper: introduce writev(3p) wrappers
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
                     ` (4 preceding siblings ...)
  2026-03-10 13:25   ` [PATCH v3 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
@ 2026-03-10 13:25   ` Patrick Steinhardt
  2026-03-10 13:25   ` [PATCH v3 07/10] sideband: use writev(3p) to send pktlines Patrick Steinhardt
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:25 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

In the preceding commit we have added a compatibility wrapper for the
writev(3p) syscall. Introduce some generic wrappers for this function
that we nowadays take for granted in the Git codebase.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 wrapper.c      | 41 +++++++++++++++++++++++++++++++++++++++++
 wrapper.h      |  9 +++++++++
 write-or-die.c |  8 ++++++++
 write-or-die.h |  1 +
 4 files changed, 59 insertions(+)

diff --git a/wrapper.c b/wrapper.c
index 16f5a63fbb..be8fa575e6 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -323,6 +323,47 @@ ssize_t write_in_full(int fd, const void *buf, size_t count)
 	return total;
 }
 
+ssize_t writev_in_full(int fd, struct iovec *iov, int iovcnt)
+{
+	ssize_t total_written = 0;
+
+	while (iovcnt) {
+		ssize_t bytes_written = writev(fd, iov, iovcnt);
+		if (bytes_written < 0) {
+			if (errno == EINTR || errno == EAGAIN)
+				continue;
+			return -1;
+		}
+		if (!bytes_written) {
+			errno = ENOSPC;
+			return -1;
+		}
+
+		total_written += bytes_written;
+
+		/*
+		 * We first need to discard any iovec entities that have been
+		 * fully written.
+		 */
+		while (iovcnt && (size_t)bytes_written >= iov->iov_len) {
+			bytes_written -= iov->iov_len;
+			iov++;
+			iovcnt--;
+		}
+
+		/*
+		 * Finally, we need to adjust the last iovec in case we have
+		 * performed a partial write.
+		 */
+		if (iovcnt && bytes_written) {
+			iov->iov_base = (char *) iov->iov_base + bytes_written;
+			iov->iov_len -= bytes_written;
+		}
+	}
+
+	return total_written;
+}
+
 ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset)
 {
 	char *p = buf;
diff --git a/wrapper.h b/wrapper.h
index 15ac3bab6e..27519b32d1 100644
--- a/wrapper.h
+++ b/wrapper.h
@@ -47,6 +47,15 @@ ssize_t read_in_full(int fd, void *buf, size_t count);
 ssize_t write_in_full(int fd, const void *buf, size_t count);
 ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset);
 
+/*
+ * Try to write all iovecs. Returns -1 in case an error occurred with a proper
+ * errno set, the number of bytes written otherwise.
+ *
+ * Note that the iovec will be modified as a result of this call to adjust for
+ * partial writes!
+ */
+ssize_t writev_in_full(int fd, struct iovec *iov, int iovcnt);
+
 static inline ssize_t write_str_in_full(int fd, const char *str)
 {
 	return write_in_full(fd, str, strlen(str));
diff --git a/write-or-die.c b/write-or-die.c
index 01a9a51fa2..5f522fb728 100644
--- a/write-or-die.c
+++ b/write-or-die.c
@@ -96,6 +96,14 @@ void write_or_die(int fd, const void *buf, size_t count)
 	}
 }
 
+void writev_or_die(int fd, struct iovec *iov, int iovlen)
+{
+	if (writev_in_full(fd, iov, iovlen) < 0) {
+		check_pipe(errno);
+		die_errno("writev error");
+	}
+}
+
 void fwrite_or_die(FILE *f, const void *buf, size_t count)
 {
 	if (fwrite(buf, 1, count, f) != count)
diff --git a/write-or-die.h b/write-or-die.h
index 65a5c42a47..ae3d7d88b8 100644
--- a/write-or-die.h
+++ b/write-or-die.h
@@ -7,6 +7,7 @@ void fprintf_or_die(FILE *, const char *fmt, ...);
 void fwrite_or_die(FILE *f, const void *buf, size_t count);
 void fflush_or_die(FILE *f);
 void write_or_die(int fd, const void *buf, size_t count);
+void writev_or_die(int fd, struct iovec *iov, int iovlen);
 
 /*
  * These values are used to help identify parts of a repository to fsync.

-- 
2.53.0.880.g73c4285caa.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 07/10] sideband: use writev(3p) to send pktlines
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
                     ` (5 preceding siblings ...)
  2026-03-10 13:25   ` [PATCH v3 06/10] wrapper: introduce writev(3p) wrappers Patrick Steinhardt
@ 2026-03-10 13:25   ` Patrick Steinhardt
  2026-03-10 13:25   ` [PATCH v3 08/10] csum-file: introduce `hashfd_ext()` Patrick Steinhardt
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:25 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

Every pktline that we send out via `send_sideband()` currently requires
two syscalls: one to write the pktline's length, and one to send its
data. This typically isn't all that much of a problem, but under extreme
load the syscalls may cause contention in the kernel.

Refactor the code to instead use the newly introduced writev(3p) infra
so that we can send out the data with a single syscall. This reduces the
number of syscalls from around 133,000 calls to write(3p) to around
67,000 calls to writev(3p).

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 sideband.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/sideband.c b/sideband.c
index ea7c25211e..1ed6614eaf 100644
--- a/sideband.c
+++ b/sideband.c
@@ -264,6 +264,7 @@ void send_sideband(int fd, int band, const char *data, ssize_t sz, int packet_ma
 	const char *p = data;
 
 	while (sz) {
+		struct iovec iov[2];
 		unsigned n;
 		char hdr[5];
 
@@ -273,12 +274,19 @@ void send_sideband(int fd, int band, const char *data, ssize_t sz, int packet_ma
 		if (0 <= band) {
 			xsnprintf(hdr, sizeof(hdr), "%04x", n + 5);
 			hdr[4] = band;
-			write_or_die(fd, hdr, 5);
+			iov[0].iov_base = hdr;
+			iov[0].iov_len = 5;
 		} else {
 			xsnprintf(hdr, sizeof(hdr), "%04x", n + 4);
-			write_or_die(fd, hdr, 4);
+			iov[0].iov_base = hdr;
+			iov[0].iov_len = 4;
 		}
-		write_or_die(fd, p, n);
+
+		iov[1].iov_base = (void *) p;
+		iov[1].iov_len = n;
+
+		writev_or_die(fd, iov, ARRAY_SIZE(iov));
+
 		p += n;
 		sz -= n;
 	}

-- 
2.53.0.880.g73c4285caa.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 08/10] csum-file: introduce `hashfd_ext()`
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
                     ` (6 preceding siblings ...)
  2026-03-10 13:25   ` [PATCH v3 07/10] sideband: use writev(3p) to send pktlines Patrick Steinhardt
@ 2026-03-10 13:25   ` Patrick Steinhardt
  2026-03-10 13:25   ` [PATCH v3 09/10] csum-file: drop `hashfd_throughput()` Patrick Steinhardt
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:25 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

Introduce a new `hashfd_ext()` function that takes an options structure.
This function will replace `hashd_throughput()` in the next commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 csum-file.c | 22 +++++++++++++---------
 csum-file.h | 14 ++++++++++++++
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 6e21e3cac8..a50416247e 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -161,17 +161,16 @@ struct hashfile *hashfd_check(const struct git_hash_algo *algop,
 	return f;
 }
 
-static struct hashfile *hashfd_internal(const struct git_hash_algo *algop,
-					int fd, const char *name,
-					struct progress *tp,
-					size_t buffer_len)
+struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
+			    int fd, const char *name,
+			    const struct hashfd_options *opts)
 {
 	struct hashfile *f = xmalloc(sizeof(*f));
 	f->fd = fd;
 	f->check_fd = -1;
 	f->offset = 0;
 	f->total = 0;
-	f->tp = tp;
+	f->tp = opts->progress;
 	f->name = name;
 	f->do_crc = 0;
 	f->skip_hash = 0;
@@ -179,8 +178,8 @@ static struct hashfile *hashfd_internal(const struct git_hash_algo *algop,
 	f->algop = unsafe_hash_algo(algop);
 	f->algop->init_fn(&f->ctx);
 
-	f->buffer_len = buffer_len;
-	f->buffer = xmalloc(buffer_len);
+	f->buffer_len = opts->buffer_len ? opts->buffer_len : 128 * 1024;
+	f->buffer = xmalloc(f->buffer_len);
 	f->check_buffer = NULL;
 
 	return f;
@@ -194,7 +193,8 @@ struct hashfile *hashfd(const struct git_hash_algo *algop,
 	 * measure the rate of data passing through this hashfile,
 	 * use a larger buffer size to reduce fsync() calls.
 	 */
-	return hashfd_internal(algop, fd, name, NULL, 128 * 1024);
+	struct hashfd_options opts = { 0 };
+	return hashfd_ext(algop, fd, name, &opts);
 }
 
 struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
@@ -206,7 +206,11 @@ struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
 	 * size so the progress indicators arrive at a more
 	 * frequent rate.
 	 */
-	return hashfd_internal(algop, fd, name, tp, 8 * 1024);
+	struct hashfd_options opts = {
+		.progress = tp,
+		.buffer_len = 8 * 1024,
+	};
+	return hashfd_ext(algop, fd, name, &opts);
 }
 
 void hashfile_checkpoint_init(struct hashfile *f,
diff --git a/csum-file.h b/csum-file.h
index 07ae11024a..a03b60120d 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -45,6 +45,20 @@ int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 #define CSUM_FSYNC		2
 #define CSUM_HASH_IN_STREAM	4
 
+struct hashfd_options {
+	/*
+	 * Throughput progress that counts the number of bytes that have been
+	 * hashed.
+	 */
+	struct progress *progress;
+
+	/* The length of the buffer that shall be used read read data. */
+	size_t buffer_len;
+};
+
+struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
+			    int fd, const char *name,
+			    const struct hashfd_options *opts);
 struct hashfile *hashfd(const struct git_hash_algo *algop,
 			int fd, const char *name);
 struct hashfile *hashfd_check(const struct git_hash_algo *algop,

-- 
2.53.0.880.g73c4285caa.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 09/10] csum-file: drop `hashfd_throughput()`
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
                     ` (7 preceding siblings ...)
  2026-03-10 13:25   ` [PATCH v3 08/10] csum-file: introduce `hashfd_ext()` Patrick Steinhardt
@ 2026-03-10 13:25   ` Patrick Steinhardt
  2026-03-10 13:25   ` [PATCH v3 10/10] builtin/pack-objects: reduce lock contention when writing packfile data Patrick Steinhardt
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:25 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

The `hashfd_throughput()` function is used by a single callsite in
git-pack-objects(1). In contrast to `hashfd()`, this function uses a
progress meter to measure throughput and a smaller buffer length so that
the progress meter can provide more granular metrics.

We're going to change that caller in the next commit to be a bit more
specific to packing objects. As such, `hashfd_throughput()` will be a
somewhat unfitting mechanism for any potential new callers.

Drop the function and replace it with a call to `hashfd_ext()`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/pack-objects.c | 19 +++++++++++++++----
 csum-file.c            | 16 ----------------
 csum-file.h            |  2 --
 3 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c1ee4d5ed7..f5cb80e870 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1330,11 +1330,22 @@ static void write_pack_file(void)
 		unsigned char hash[GIT_MAX_RAWSZ];
 		char *pack_tmp_name = NULL;
 
-		if (pack_to_stdout)
-			f = hashfd_throughput(the_repository->hash_algo, 1,
-					      "<stdout>", progress_state);
-		else
+		if (pack_to_stdout) {
+			/*
+			 * Since we are expecting to report progress of the
+			 * write into this hashfile, use a smaller buffer
+			 * size so the progress indicators arrive at a more
+			 * frequent rate.
+			 */
+			struct hashfd_options opts = {
+				.progress = progress_state,
+				.buffer_len = 8 * 1024,
+			};
+			f = hashfd_ext(the_repository->hash_algo, 1,
+				       "<stdout>", &opts);
+		} else {
 			f = create_tmp_packfile(the_repository, &pack_tmp_name);
+		}
 
 		offset = write_pack_header(f, nr_remaining);
 
diff --git a/csum-file.c b/csum-file.c
index a50416247e..5dfaca5543 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -197,22 +197,6 @@ struct hashfile *hashfd(const struct git_hash_algo *algop,
 	return hashfd_ext(algop, fd, name, &opts);
 }
 
-struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
-				   int fd, const char *name, struct progress *tp)
-{
-	/*
-	 * Since we are expecting to report progress of the
-	 * write into this hashfile, use a smaller buffer
-	 * size so the progress indicators arrive at a more
-	 * frequent rate.
-	 */
-	struct hashfd_options opts = {
-		.progress = tp,
-		.buffer_len = 8 * 1024,
-	};
-	return hashfd_ext(algop, fd, name, &opts);
-}
-
 void hashfile_checkpoint_init(struct hashfile *f,
 			      struct hashfile_checkpoint *checkpoint)
 {
diff --git a/csum-file.h b/csum-file.h
index a03b60120d..01472555c8 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -63,8 +63,6 @@ struct hashfile *hashfd(const struct git_hash_algo *algop,
 			int fd, const char *name);
 struct hashfile *hashfd_check(const struct git_hash_algo *algop,
 			      const char *name);
-struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
-				   int fd, const char *name, struct progress *tp);
 
 /*
  * Free the hashfile without flushing its contents to disk. This only

-- 
2.53.0.880.g73c4285caa.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 10/10] builtin/pack-objects: reduce lock contention when writing packfile data
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
                     ` (8 preceding siblings ...)
  2026-03-10 13:25   ` [PATCH v3 09/10] csum-file: drop `hashfd_throughput()` Patrick Steinhardt
@ 2026-03-10 13:25   ` Patrick Steinhardt
  2026-03-10 17:11   ` [PATCH v3 00/10] upload-pack: " Junio C Hamano
  2026-03-10 20:56   ` Johannes Sixt
  11 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 13:25 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

When running `git pack-objects --stdout` we feed the data through
`hashfd_ext()` with a progress meter and a smaller-than-usual buffer
length of 8kB so that we can track throughput more granularly. But as
packfiles tend to be on the larger side, this small buffer size may
cause a ton of write(3p) syscalls.

Originally, the buffer we used in `hashfd()` was 8kB for all use cases.
This was changed though in 2ca245f8be (csum-file.h: increase hashfile
buffer size, 2021-05-18) because we noticed that the number of writes
can have an impact on performance. So the buffer size was increased to
128kB, which improved performance a bit for some use cases.

But the commit didn't touch the buffer size for `hashd_throughput()`.
The reasoning here was that callers expect the progress indicator to
update frequently, and a larger buffer size would of course reduce the
update frequency especially on slow networks.

While that is of course true, there was (and still is, even though it's
now a call to `hashfd_ext()`) only a single caller of this function in
git-pack-objects(1). This command is responsible for writing packfiles,
and those packfiles are often on the bigger side. So arguably:

  - The user won't care about increments of 8kB when packfiles tend to
    be megabytes or even gigabytes in size.

  - Reducing the number of syscalls would be even more valuable here
    than it would be for multi-pack indices, which was the benchmark
    done in the mentioned commit, as MIDXs are typically significantly
    smaller than packfiles.

  - Nowadays, many internet connections should be able to transfer data
    at a rate significantly higher than 8kB per second.

Update the buffer to instead have a size of `LARGE_PACKET_DATA_MAX - 1`,
which translates to ~64kB. This limit was chosen because `git
pack-objects --stdout` is most often used when sending packfiles via
git-upload-pack(1), where packfile data is chunked into pktlines when
using the sideband. Furthermore, most internet connections should have a
bandwidth signifcantly higher than 64kB/s, so we'd still be able to
observe progress updates at a rate of at least once per second.

This change significantly reduces the number of write(3p) syscalls from
355,000 to 44,000 when packing the Linux repository. While this results
in a small performance improvement on an otherwise-unused system, this
improvement is mostly negligible. More importantly though, it will
reduce lock contention in the kernel on an extremely busy system where
we have many processes writing data at once.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/pack-objects.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f5cb80e870..59876b024d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -41,6 +41,7 @@
 #include "promisor-remote.h"
 #include "pack-mtimes.h"
 #include "parse-options.h"
+#include "pkt-line.h"
 #include "blob.h"
 #include "tree.h"
 #include "path-walk.h"
@@ -1332,14 +1333,17 @@ static void write_pack_file(void)

 		if (pack_to_stdout) {
 			/*
-			 * Since we are expecting to report progress of the
-			 * write into this hashfile, use a smaller buffer
-			 * size so the progress indicators arrive at a more
-			 * frequent rate.
+			 * This command is most often invoked via
+			 * git-upload-pack(1), which will typically chunk data
+			 * into pktlines. As such, we use the maximum data
+			 * length of them as buffer length.
+			 *
+			 * Note that we need to subtract one though to
+			 * accomodate for the sideband byte.
 			 */
 			struct hashfd_options opts = {
 				.progress = progress_state,
-				.buffer_len = 8 * 1024,
+				.buffer_len = LARGE_PACKET_DATA_MAX - 1,
 			};
 			f = hashfd_ext(the_repository->hash_algo, 1,
 				       "<stdout>", &opts);

-- 
2.53.0.880.g73c4285caa.dirty

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 05/10] compat/posix: introduce writev(3p) wrapper
  2026-03-10 13:25   ` [PATCH v3 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
@ 2026-03-10 16:59     ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2026-03-10 16:59 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

Patrick Steinhardt <ps@pks.im> writes:

> +ssize_t git_writev(int fd, const struct iovec *iov, int iovcnt)
> +{
> +	size_t total_written = 0;
> +	size_t sum = 0;
> +
> +	/*
> +	 * According to writev(3p), the syscall shall error with EINVAL in case
> +	 * the sum of `iov_len` overflows `ssize_t`.
> +	 */
> +	 for (int i = 0; i < iovcnt; i++) {
> +		if (iov[i].iov_len > maximum_signed_value_of_type(ssize_t) ||
> +		    iov[i].iov_len + sum > maximum_signed_value_of_type(ssize_t)) {

This made me pause, but I think you cannot wrap-around size_t and
end up with a small positive result that would fit in ssize_t by
adding two quantities that are smaller than the max_ssize_t so this
check should be sufficient.

> +			errno = EINVAL;
> +			return -1;
> +		}

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 03/10] upload-pack: prefer flushing data over sending keepalive
  2026-03-10 13:24   ` [PATCH v3 03/10] upload-pack: prefer flushing data over sending keepalive Patrick Steinhardt
@ 2026-03-10 17:09     ` Junio C Hamano
  2026-03-10 17:43       ` Patrick Steinhardt
  0 siblings, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2026-03-10 17:09 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

Patrick Steinhardt <ps@pks.im> writes:

> When using the sideband in git-upload-pack(1) we know to send out
> keepalive packets in case generating the pack takes too long. These
> keepalives take the form of a simple empty pktline.
>
> In the preceding commit we have adapted git-upload-pack(1) to buffer
> data more aggressively before sending it to the client. This creates an
> obvious optimization opportunity: when we hit the keepalive timeout
> while we still hold on to some buffered data, then it makes more sense
> to flush out the data instead of sending the empty keepalive packet.
>
> This is overall not going to be a significant win. Most keepalives will
> come before the pack data starts, and once pack-objects starts producing
> data, it tends to do so pretty consistently. And of course we can't send
> data before we see the PACK header, because the whole point is to buffer
> the early bit waiting for packfile URIs. But the optimization is easy
> enough to realize.
>
> Do so and flush out data instead of sending an empty pktline. While at
> it, drop the useless

Useless what?

> Suggested-by: Jeff King <peff@peff.net>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  upload-pack.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/upload-pack.c b/upload-pack.c
> index f6f380a601..7a165d226d 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -466,18 +466,27 @@ static void create_pack_file(struct upload_pack_data *pack_data,
>  		}
>  
>  		/*
> -		 * We hit the keepalive timeout without saying anything; send
> -		 * an empty message on the data sideband just to let the other
> -		 * side know we're still working on it, but don't have any data
> -		 * yet.
> +		 * We hit the keepalive timeout without saying anything. If we
> +		 * have pending data we flush it out to the caller now.
> +		 * Otherwise, we send an empty message on the data sideband
> +		 * just to let the other side know we're still working on it,
> +		 * but don't have any data yet.
>  		 *
>  		 * If we don't have a sideband channel, there's no room in the
>  		 * protocol to say anything, so those clients are just out of
>  		 * luck.
>  		 */
>  		if (!ret && pack_data->use_sideband) {
> -			static const char buf[] = "0005\1";
> -			write_or_die(1, buf, 5);
> +			if (output_state->packfile_started && output_state->used > 1) {
> +				send_client_data(1, output_state->buffer, output_state->used - 1,
> +						 pack_data->use_sideband);
> +				output_state->buffer[0] = output_state->buffer[output_state->used - 1];
> +				output_state->used = 1;
> +			} else {
> +				static const char buf[] = "0005\1";
> +				write_or_die(1, buf, 5);
> +			}
> +

OK.  The new part of the comment, "If we have pending data, flush
it", is what the new code is doing (i.e., flush all the bytes before
the final byte, and make the buffer hold only that final byte we
kept for ourselves).  Otherwise we give a keepalive packet (i.e., 0
byte thrown at the sideband #1) as before.

OK.

>  			last_sent_ms = now_ms;
>  		}
>  	}

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 00/10] upload-pack: reduce lock contention when writing packfile data
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
                     ` (9 preceding siblings ...)
  2026-03-10 13:25   ` [PATCH v3 10/10] builtin/pack-objects: reduce lock contention when writing packfile data Patrick Steinhardt
@ 2026-03-10 17:11   ` Junio C Hamano
  2026-03-10 20:56   ` Johannes Sixt
  11 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2026-03-10 17:11 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

Patrick Steinhardt <ps@pks.im> writes:

> Changes in v3:
>   - Fix handling of `iov_len` overflows in writev(3p) wrapper.
>   - Add another patch that causes us to flush out data instead of
>     sending a 0005 keepalive packet.
>   - Link to v2: https://lore.kernel.org/r/20260303-pks-upload-pack-write-contention-v2-0-7321830f08fe@pks.im

Except for the tail part of the proposed log message of [PATCH 03/10]
that I couldn't understand, all the differences since v2 looked
sensible to me.

Thanks.  Will replace.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 03/10] upload-pack: prefer flushing data over sending keepalive
  2026-03-10 17:09     ` Junio C Hamano
@ 2026-03-10 17:43       ` Patrick Steinhardt
  0 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-10 17:43 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

On Tue, Mar 10, 2026 at 10:09:51AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > When using the sideband in git-upload-pack(1) we know to send out
> > keepalive packets in case generating the pack takes too long. These
> > keepalives take the form of a simple empty pktline.
> >
> > In the preceding commit we have adapted git-upload-pack(1) to buffer
> > data more aggressively before sending it to the client. This creates an
> > obvious optimization opportunity: when we hit the keepalive timeout
> > while we still hold on to some buffered data, then it makes more sense
> > to flush out the data instead of sending the empty keepalive packet.
> >
> > This is overall not going to be a significant win. Most keepalives will
> > come before the pack data starts, and once pack-objects starts producing
> > data, it tends to do so pretty consistently. And of course we can't send
> > data before we see the PACK header, because the whole point is to buffer
> > the early bit waiting for packfile URIs. But the optimization is easy
> > enough to realize.
> >
> > Do so and flush out data instead of sending an empty pktline. While at
> > it, drop the useless
> 
> Useless what?

Ugh. I was initially turning the allocation of `output_state` into an
on-stack variable only to later realize that we explicitly allocate it
because it might otherwise blow the stack. Seems like I didn't manage to
fully drop the sentence that mentioned this change.

I'll remove this half-sentence.

Patrick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 00/10] upload-pack: reduce lock contention when writing packfile data
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
                     ` (10 preceding siblings ...)
  2026-03-10 17:11   ` [PATCH v3 00/10] upload-pack: " Junio C Hamano
@ 2026-03-10 20:56   ` Johannes Sixt
  2026-03-11  6:27     ` Patrick Steinhardt
  11 siblings, 1 reply; 64+ messages in thread
From: Johannes Sixt @ 2026-03-10 20:56 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Matt Smiley, brian m. carlson, Jeff King, git

Am 10.03.26 um 14:24 schrieb Patrick Steinhardt:
>       compat/posix: introduce writev(3p) wrapper
>       wrapper: introduce writev(3p) wrappers

I looked at these two patches.

At first, I was thrown off by the early exits when bytes_written == 0.
But I convinced myself that it makes sense:

- In the emulation, if the underlying write(2) does return 0, it makes
sense to signal a partial successful write to the caller of writev. It's
the best we can do when earlier blocks have already been written. A
return value of 0 is strange and may be an indication of future
problems, so let's just stop the work we are doing and let the upper
layers deal with the problem.

- In writev_in_full we need to treat the case specially in order to
guarantee forward progress.  Treating it as ENOSPC is plucked out of
thin air, I guess, but not unreasonable.

In conclusion, these patches look good.

-- Hannes


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 00/10] upload-pack: reduce lock contention when writing packfile data
  2026-03-10 20:56   ` Johannes Sixt
@ 2026-03-11  6:27     ` Patrick Steinhardt
  0 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-11  6:27 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Matt Smiley, brian m. carlson, Jeff King, git

On Tue, Mar 10, 2026 at 09:56:14PM +0100, Johannes Sixt wrote:
> Am 10.03.26 um 14:24 schrieb Patrick Steinhardt:
> >       compat/posix: introduce writev(3p) wrapper
> >       wrapper: introduce writev(3p) wrappers
> 
> I looked at these two patches.
> 
> At first, I was thrown off by the early exits when bytes_written == 0.
> But I convinced myself that it makes sense:
> 
> - In the emulation, if the underlying write(2) does return 0, it makes
> sense to signal a partial successful write to the caller of writev. It's
> the best we can do when earlier blocks have already been written. A
> return value of 0 is strange and may be an indication of future
> problems, so let's just stop the work we are doing and let the upper
> layers deal with the problem.
> 
> - In writev_in_full we need to treat the case specially in order to
> guarantee forward progress.  Treating it as ENOSPC is plucked out of
> thin air, I guess, but not unreasonable.

This part is basically copied from `write_in_full()`, which does the
same.

> In conclusion, these patches look good.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v4 00/10] upload-pack: reduce lock contention when writing packfile data
  2026-02-27 11:22 [PATCH 0/2] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
                   ` (3 preceding siblings ...)
  2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
@ 2026-03-13  6:45 ` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
                     ` (9 more replies)
  4 siblings, 10 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

Hi,

this small patch series fixes some heavy lock contention when writing
data from git-upload-pack(1) into pipes. This lock contention can be
observed when having hundreds of git-upload-pack(1) processes active at
the same time that write data into pipes at dozens of gigabits per
second.

I have uploaded the flame graph that clearly shows the lock contention
at [1].

Changes in v4:
  - Drop a stale half-sentence that I wanted to remove.
  - Link to v3: https://lore.kernel.org/r/20260310-pks-upload-pack-write-contention-v3-0-8bc97aa3e267@pks.im

Changes in v3:
  - Fix handling of `iov_len` overflows in writev(3p) wrapper.
  - Add another patch that causes us to flush out data instead of
    sending a 0005 keepalive packet.
  - Link to v2: https://lore.kernel.org/r/20260303-pks-upload-pack-write-contention-v2-0-7321830f08fe@pks.im

Changes in v2:
  - Change the buffer size in git-pack-objects(1) to also reduce the
    number of write syscalls over there.
  - Introduce writev to half the number of syscalls when writing
    pktlines.
  - Use `sizeof(os->buffer)` instead of open-coding its size.
  - Improve keepalive logic in git-upload-pack(1) to account for
    buffering.
  - Link to v1: https://lore.kernel.org/r/20260227-pks-upload-pack-write-contention-v1-0-7166fe255704@pks.im

Thanks!

Patrick

[1]: https://gitlab.com/gitlab-org/git/-/work_items/675

---
Patrick Steinhardt (10):
      upload-pack: fix debug statement when flushing packfile data
      upload-pack: adapt keepalives based on buffering
      upload-pack: prefer flushing data over sending keepalive
      upload-pack: reduce lock contention when writing packfile data
      compat/posix: introduce writev(3p) wrapper
      wrapper: introduce writev(3p) wrappers
      sideband: use writev(3p) to send pktlines
      csum-file: introduce `hashfd_ext()`
      csum-file: drop `hashfd_throughput()`
      builtin/pack-objects: reduce lock contention when writing packfile data

 Makefile               |  4 +++
 builtin/pack-objects.c | 23 +++++++++++---
 compat/posix.h         | 14 +++++++++
 compat/writev.c        | 44 +++++++++++++++++++++++++++
 config.mak.uname       |  2 ++
 csum-file.c            | 28 +++++------------
 csum-file.h            | 16 ++++++++--
 meson.build            |  1 +
 sideband.c             | 14 +++++++--
 upload-pack.c          | 81 +++++++++++++++++++++++++++++++++++++++-----------
 wrapper.c              | 41 +++++++++++++++++++++++++
 wrapper.h              |  9 ++++++
 write-or-die.c         |  8 +++++
 write-or-die.h         |  1 +
 14 files changed, 239 insertions(+), 47 deletions(-)

Range-diff versus v3:

 1:  65471f969b =  1:  f9896a8451 upload-pack: fix debug statement when flushing packfile data
 2:  c5705c1cb1 =  2:  b1bc6f5749 upload-pack: adapt keepalives based on buffering
 3:  f34fa584f4 !  3:  fcf5f06375 upload-pack: prefer flushing data over sending keepalive
    @@ Commit message
         the early bit waiting for packfile URIs. But the optimization is easy
         enough to realize.
     
    -    Do so and flush out data instead of sending an empty pktline. While at
    -    it, drop the useless
    +    Do so and flush out data instead of sending an empty pktline.
     
         Suggested-by: Jeff King <peff@peff.net>
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
 4:  8d08e93929 =  4:  f2bb7a38aa upload-pack: reduce lock contention when writing packfile data
 5:  aa42217e43 =  5:  78a1bbb810 compat/posix: introduce writev(3p) wrapper
 6:  c5d194ca2b =  6:  c199dd398a wrapper: introduce writev(3p) wrappers
 7:  e4df805840 =  7:  b7fe2b818f sideband: use writev(3p) to send pktlines
 8:  16f9968bcd =  8:  b9d33b93cd csum-file: introduce `hashfd_ext()`
 9:  47ae58440b =  9:  b2af62c4ff csum-file: drop `hashfd_throughput()`
10:  c92ccf9df2 = 10:  47f5090b66 builtin/pack-objects: reduce lock contention when writing packfile data

---
base-commit: fb1b83bcddff60463f6e86bb021784c88d0b748c
change-id: 20260227-pks-upload-pack-write-contention-435ce01f5fe9


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v4 01/10] upload-pack: fix debug statement when flushing packfile data
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
@ 2026-03-13  6:45   ` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 02/10] upload-pack: adapt keepalives based on buffering Patrick Steinhardt
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

When git-upload-pack(1) writes packfile data to the client we have some
logic in place that buffers some partial lines. When that buffer still
contains data after git-pack-objects(1) has finished we flush the buffer
so that all remaining bytes are sent out.

Curiously, when we do so we also print the string "flushed." to stderr.
This statement has been introduced in b1c71b7281 (upload-pack: avoid
sending an incomplete pack upon failure, 2006-06-20), so quite a while
ago. What's interesting though is that stderr is typically spliced
through to the client-side, and consequently the client would see this
message. Munging the way how we do the caching indeed confirms this:

  $ git clone file:///home/pks/Development/linux/
  Cloning into bare repository 'linux.git'...
  remote: Enumerating objects: 12980346, done.
  remote: Counting objects: 100% (131820/131820), done.
  remote: Compressing objects: 100% (50290/50290), done.
  remote: Total 12980346 (delta 96319), reused 104500 (delta 81217), pack-reused 12848526 (from 1)
  Receiving objects: 100% (12980346/12980346), 3.23 GiB | 57.44 MiB/s, done.
  flushed.
  Resolving deltas: 100% (10676718/10676718), done.

It's quite clear that this string shouldn't ever be visible to the
client, so it rather feels like this is a left-over debug statement. The
menitoned commit doesn't mention this line, either.

Remove the debug output to prepare for a change in how we do the
buffering in the next commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index e8c5cce1c7..b3a8561ef5 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -457,11 +457,9 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 	}

 	/* flush the data */
-	if (output_state->used > 0) {
+	if (output_state->used > 0)
 		send_client_data(1, output_state->buffer, output_state->used,
 				 pack_data->use_sideband);
-		fprintf(stderr, "flushed.\n");
-	}
 	free(output_state);
 	if (pack_data->use_sideband)
 		packet_flush(1);

-- 
2.53.0.904.g2727be2e99.dirty

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 02/10] upload-pack: adapt keepalives based on buffering
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
@ 2026-03-13  6:45   ` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 03/10] upload-pack: prefer flushing data over sending keepalive Patrick Steinhardt
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

The function `create_pack_file()` is responsible for sending the
packfile data to the client of git-upload-pack(1). As generating the
bytes may take significant computing resources we also have a mechanism
in place that optionally sends keepalive pktlines in case we haven't
sent out any data.

The keepalive logic is purely based poll(3p): we pass a timeout to that
syscall, and if the call times out we send out the keepalive pktline.
While reasonable, this logic isn't entirely sufficient: even if the call
to poll(3p) ends because we have received data on any of the file
descriptors we may not necessarily send data to the client.

The most important edge case here happens in `relay_pack_data()`. When
we haven't seen the initial "PACK" signature from git-pack-objects(1)
yet we buffer incoming data. So in the worst case, if each of the bytes
of that signature arrive shortly before the configured keepalive
timeout, then we may not send out any data for a time period that is
(almost) four times as long as the configured timeout.

This edge case is rather unlikely to matter in practice. But in a
subsequent commit we're going to adapt our buffering mechanism to become
more aggressive, which makes it more likely that we don't send any data
for an extended amount of time.

Adapt the logic so that instead of using a fixed timeout on every call
to poll(3p), we instead figure out how much time has passed since the
last-sent data.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 49 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 40 insertions(+), 9 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index b3a8561ef5..f6f380a601 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -29,6 +29,7 @@
 #include "commit-graph.h"
 #include "commit-reach.h"
 #include "shallow.h"
+#include "trace.h"
 #include "write-or-die.h"
 #include "json-writer.h"
 #include "strmap.h"
@@ -218,7 +219,8 @@ struct output_state {
 };
 
 static int relay_pack_data(int pack_objects_out, struct output_state *os,
-			   int use_sideband, int write_packfile_line)
+			   int use_sideband, int write_packfile_line,
+			   bool *did_send_data)
 {
 	/*
 	 * We keep the last byte to ourselves
@@ -232,6 +234,8 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 	 */
 	ssize_t readsz;
 
+	*did_send_data = false;
+
 	readsz = xread(pack_objects_out, os->buffer + os->used,
 		       sizeof(os->buffer) - os->used);
 	if (readsz < 0) {
@@ -247,6 +251,7 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 				if (os->packfile_uris_started)
 					packet_delim(1);
 				packet_write_fmt(1, "\1packfile\n");
+				*did_send_data = true;
 			}
 			break;
 		}
@@ -259,6 +264,7 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 			}
 			*p = '\0';
 			packet_write_fmt(1, "\1%s\n", os->buffer);
+			*did_send_data = true;
 
 			os->used -= p - os->buffer + 1;
 			memmove(os->buffer, p + 1, os->used);
@@ -279,6 +285,7 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 		os->used = 0;
 	}
 
+	*did_send_data = true;
 	return readsz;
 }
 
@@ -290,6 +297,7 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 	char progress[128];
 	char abort_msg[] = "aborting due to possible repository "
 		"corruption on the remote side.";
+	uint64_t last_sent_ms = 0;
 	ssize_t sz;
 	int i;
 	FILE *pipe_fd;
@@ -365,10 +373,14 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 	 */
 
 	while (1) {
+		uint64_t now_ms = getnanotime() / 1000000;
 		struct pollfd pfd[2];
-		int pe, pu, pollsize, polltimeout;
+		int pe, pu, pollsize, polltimeout_ms;
 		int ret;
 
+		if (!last_sent_ms)
+			last_sent_ms = now_ms;
+
 		reset_timeout(pack_data->timeout);
 
 		pollsize = 0;
@@ -390,11 +402,21 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		if (!pollsize)
 			break;
 
-		polltimeout = pack_data->keepalive < 0
-			? -1
-			: 1000 * pack_data->keepalive;
+		if (pack_data->keepalive < 0) {
+			polltimeout_ms = -1;
+		} else {
+			/*
+			 * The polling timeout needs to be adjusted based on
+			 * the time we have sent our last package. The longer
+			 * it's been in the past, the shorter the timeout
+			 * becomes until we eventually don't block at all.
+			 */
+			polltimeout_ms = 1000 * pack_data->keepalive - (now_ms - last_sent_ms);
+			if (polltimeout_ms < 0)
+				polltimeout_ms = 0;
+		}
 
-		ret = poll(pfd, pollsize, polltimeout);
+		ret = poll(pfd, pollsize, polltimeout_ms);
 
 		if (ret < 0) {
 			if (errno != EINTR) {
@@ -403,16 +425,18 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 			}
 			continue;
 		}
+
 		if (0 <= pe && (pfd[pe].revents & (POLLIN|POLLHUP))) {
 			/* Status ready; we ship that in the side-band
 			 * or dump to the standard error.
 			 */
 			sz = xread(pack_objects.err, progress,
 				  sizeof(progress));
-			if (0 < sz)
+			if (0 < sz) {
 				send_client_data(2, progress, sz,
 						 pack_data->use_sideband);
-			else if (sz == 0) {
+				last_sent_ms = now_ms;
+			} else if (sz == 0) {
 				close(pack_objects.err);
 				pack_objects.err = -1;
 			}
@@ -421,11 +445,14 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 			/* give priority to status messages */
 			continue;
 		}
+
 		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
+			bool did_send_data;
 			int result = relay_pack_data(pack_objects.out,
 						     output_state,
 						     pack_data->use_sideband,
-						     !!uri_protocols);
+						     !!uri_protocols,
+						     &did_send_data);
 
 			if (result == 0) {
 				close(pack_objects.out);
@@ -433,6 +460,9 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 			} else if (result < 0) {
 				goto fail;
 			}
+
+			if (did_send_data)
+				last_sent_ms = now_ms;
 		}
 
 		/*
@@ -448,6 +478,7 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		if (!ret && pack_data->use_sideband) {
 			static const char buf[] = "0005\1";
 			write_or_die(1, buf, 5);
+			last_sent_ms = now_ms;
 		}
 	}
 

-- 
2.53.0.904.g2727be2e99.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 03/10] upload-pack: prefer flushing data over sending keepalive
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 02/10] upload-pack: adapt keepalives based on buffering Patrick Steinhardt
@ 2026-03-13  6:45   ` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 04/10] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

When using the sideband in git-upload-pack(1) we know to send out
keepalive packets in case generating the pack takes too long. These
keepalives take the form of a simple empty pktline.

In the preceding commit we have adapted git-upload-pack(1) to buffer
data more aggressively before sending it to the client. This creates an
obvious optimization opportunity: when we hit the keepalive timeout
while we still hold on to some buffered data, then it makes more sense
to flush out the data instead of sending the empty keepalive packet.

This is overall not going to be a significant win. Most keepalives will
come before the pack data starts, and once pack-objects starts producing
data, it tends to do so pretty consistently. And of course we can't send
data before we see the PACK header, because the whole point is to buffer
the early bit waiting for packfile URIs. But the optimization is easy
enough to realize.

Do so and flush out data instead of sending an empty pktline.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index f6f380a601..7a165d226d 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -466,18 +466,27 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		}
 
 		/*
-		 * We hit the keepalive timeout without saying anything; send
-		 * an empty message on the data sideband just to let the other
-		 * side know we're still working on it, but don't have any data
-		 * yet.
+		 * We hit the keepalive timeout without saying anything. If we
+		 * have pending data we flush it out to the caller now.
+		 * Otherwise, we send an empty message on the data sideband
+		 * just to let the other side know we're still working on it,
+		 * but don't have any data yet.
 		 *
 		 * If we don't have a sideband channel, there's no room in the
 		 * protocol to say anything, so those clients are just out of
 		 * luck.
 		 */
 		if (!ret && pack_data->use_sideband) {
-			static const char buf[] = "0005\1";
-			write_or_die(1, buf, 5);
+			if (output_state->packfile_started && output_state->used > 1) {
+				send_client_data(1, output_state->buffer, output_state->used - 1,
+						 pack_data->use_sideband);
+				output_state->buffer[0] = output_state->buffer[output_state->used - 1];
+				output_state->used = 1;
+			} else {
+				static const char buf[] = "0005\1";
+				write_or_die(1, buf, 5);
+			}
+
 			last_sent_ms = now_ms;
 		}
 	}

-- 
2.53.0.904.g2727be2e99.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 04/10] upload-pack: reduce lock contention when writing packfile data
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2026-03-13  6:45   ` [PATCH v4 03/10] upload-pack: prefer flushing data over sending keepalive Patrick Steinhardt
@ 2026-03-13  6:45   ` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

In our production systems we have recently observed write contention in
git-upload-pack(1). The system in question was consistently streaming
packfiles at a rate of dozens of gigabits per second, but curiously the
system was neither bottlenecked on CPU, memory or IOPS.

We eventually discovered that Git was spending 80% of its time in
`pipe_write()`, out of which almost all of the time was spent in the
`ep_poll_callback` function in the kernel. Quoting the reporter:

  This infrastructure is part of an event notification queue designed to
  allow for multiple producers to emit events, but that concurrency
  safety is guarded by 3 layers of locking. The layer we're hitting
  contention in uses a simple reader/writer lock mode (a.k.a. shared
  versus exclusive mode), where producers need shared-mode (read mode),
  and various other actions use exclusive (write) mode.

The system in question generates workloads where we have hundreds of
git-upload-pack(1) processes active at the same point in time. These
processes end up contending around those locks, and the consequence is
that the Git processes stall.

Now git-upload-pack(1) already has the infrastructure in place to buffer
some of the data it reads from git-pack-objects(1) before actually
sending it out. We only use this infrastructure in very limited ways
though, so we generally end up matching one read(3p) call with one
write(3p) call. Even worse, when the sideband is enabled we end up
matching one read with _two_ writes: one for the pkt-line length, and
one for the packfile data.

Extend our use of the buffering infrastructure so that we soak up bytes
until the buffer is filled up at least 2/3rds of its capacity. The
change is relatively simple to implement as we already know to flush the
buffer in `create_pack_file()` after git-pack-objects(1) has finished.

This significantly reduces the number of write(3p) syscalls we need to
do. Before this change, cloning the Linux repository resulted in around
400,000 write(3p) syscalls. With the buffering in place we only do
around 130,000 syscalls.

Now we could of course go even further and make sure that we always fill
up the whole buffer. But this might cause an increase in read(3p)
syscalls, and some tests show that this only reduces the number of
write(3p) syscalls from 130,000 to 100,000. So overall this doesn't seem
worth it.

Note that the issue could also be fixed by adapting the write buffer
that we use in the downstream git-pack-objects(1) command, and such a
change would have roughly the same result. But the command that
generates the packfile data may not always be git-pack-objects(1) as it
can be changed via "uploadpack.packObjectsHook", so such a fix would
only help in _some_ cases. Regardless of that, we'll also adapt the
write buffer size of git-pack-objects(1) in a subsequent commit.

Helped-by: Matt Smiley <msmiley@gitlab.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 upload-pack.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/upload-pack.c b/upload-pack.c
index 7a165d226d..9f6d6fe48c 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -276,6 +276,13 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os,
 		}
 	}

+	/*
+	 * Make sure that we buffer some data before sending it to the client.
+	 * This significantly reduces the number of write(3p) syscalls.
+	 */
+	if (readsz && os->used < (sizeof(os->buffer) * 2 / 3))
+		return readsz;
+
 	if (os->used > 1) {
 		send_client_data(1, os->buffer, os->used - 1, use_sideband);
 		os->buffer[0] = os->buffer[os->used - 1];

-- 
2.53.0.904.g2727be2e99.dirty

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 05/10] compat/posix: introduce writev(3p) wrapper
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2026-03-13  6:45   ` [PATCH v4 04/10] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
@ 2026-03-13  6:45   ` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 06/10] wrapper: introduce writev(3p) wrappers Patrick Steinhardt
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

In a subsequent commit we're going to add the first caller to
writev(3p). Introduce a compatibility wrapper for this syscall that we
can use on systems that don't have this syscall.

The syscall exists on modern Unixes like Linux and macOS, and seemingly
even for NonStop according to [1]. It doesn't seem to exist on Windows
though.

[1]: http://nonstoptools.com/manuals/OSS-SystemCalls.pdf
[2]: https://www.gnu.org/software/gnulib/manual/html_node/writev.html

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 Makefile         |  4 ++++
 compat/posix.h   | 14 ++++++++++++++
 compat/writev.c  | 44 ++++++++++++++++++++++++++++++++++++++++++++
 config.mak.uname |  2 ++
 meson.build      |  1 +
 5 files changed, 65 insertions(+)

diff --git a/Makefile b/Makefile
index f3264d0a37..493851162d 100644
--- a/Makefile
+++ b/Makefile
@@ -2021,6 +2021,10 @@ ifdef NO_PREAD
 	COMPAT_CFLAGS += -DNO_PREAD
 	COMPAT_OBJS += compat/pread.o
 endif
+ifdef NO_WRITEV
+	COMPAT_CFLAGS += -DNO_WRITEV
+	COMPAT_OBJS += compat/writev.o
+endif
 ifdef NO_FAST_WORKING_DIRECTORY
 	BASIC_CFLAGS += -DNO_FAST_WORKING_DIRECTORY
 endif
diff --git a/compat/posix.h b/compat/posix.h
index 245386fa4a..3c611d2736 100644
--- a/compat/posix.h
+++ b/compat/posix.h
@@ -137,6 +137,9 @@
 #include <sys/socket.h>
 #include <sys/ioctl.h>
 #include <sys/statvfs.h>
+#ifndef NO_WRITEV
+#include <sys/uio.h>
+#endif
 #include <termios.h>
 #ifndef NO_SYS_SELECT_H
 #include <sys/select.h>
@@ -323,6 +326,17 @@ int git_lstat(const char *, struct stat *);
 ssize_t git_pread(int fd, void *buf, size_t count, off_t offset);
 #endif
 
+#ifdef NO_WRITEV
+#define writev git_writev
+#define iovec git_iovec
+struct git_iovec {
+	void *iov_base;
+	size_t iov_len;
+};
+
+ssize_t git_writev(int fd, const struct iovec *iov, int iovcnt);
+#endif
+
 #ifdef NO_SETENV
 #define setenv gitsetenv
 int gitsetenv(const char *, const char *, int);
diff --git a/compat/writev.c b/compat/writev.c
new file mode 100644
index 0000000000..3a94870a2f
--- /dev/null
+++ b/compat/writev.c
@@ -0,0 +1,44 @@
+#include "../git-compat-util.h"
+#include "../wrapper.h"
+
+ssize_t git_writev(int fd, const struct iovec *iov, int iovcnt)
+{
+	size_t total_written = 0;
+	size_t sum = 0;
+
+	/*
+	 * According to writev(3p), the syscall shall error with EINVAL in case
+	 * the sum of `iov_len` overflows `ssize_t`.
+	 */
+	 for (int i = 0; i < iovcnt; i++) {
+		if (iov[i].iov_len > maximum_signed_value_of_type(ssize_t) ||
+		    iov[i].iov_len + sum > maximum_signed_value_of_type(ssize_t)) {
+			errno = EINVAL;
+			return -1;
+		}
+
+		sum += iov[i].iov_len;
+	}
+
+	for (int i = 0; i < iovcnt; i++) {
+		const char *bytes = iov[i].iov_base;
+		size_t iovec_written = 0;
+
+		while (iovec_written < iov[i].iov_len) {
+			ssize_t bytes_written = xwrite(fd, bytes + iovec_written,
+						       iov[i].iov_len - iovec_written);
+			if (bytes_written < 0) {
+				if (total_written)
+					goto out;
+				return bytes_written;
+			}
+			if (!bytes_written)
+				goto out;
+			iovec_written += bytes_written;
+			total_written += bytes_written;
+		}
+	}
+
+out:
+	return (ssize_t) total_written;
+}
diff --git a/config.mak.uname b/config.mak.uname
index 5feb582558..ccb3f71881 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -459,6 +459,7 @@ ifeq ($(uname_S),Windows)
 	SANE_TOOL_PATH ?= $(msvc_bin_dir_msys)
 	HAVE_ALLOCA_H = YesPlease
 	NO_PREAD = YesPlease
+	NO_WRITEV = YesPlease
 	NEEDS_CRYPTO_WITH_SSL = YesPlease
 	NO_LIBGEN_H = YesPlease
 	NO_POLL = YesPlease
@@ -674,6 +675,7 @@ ifeq ($(uname_S),MINGW)
 	pathsep = ;
 	HAVE_ALLOCA_H = YesPlease
 	NO_PREAD = YesPlease
+	NO_WRITEV = YesPlease
 	NEEDS_CRYPTO_WITH_SSL = YesPlease
 	NO_LIBGEN_H = YesPlease
 	NO_POLL = YesPlease
diff --git a/meson.build b/meson.build
index 4b536e0124..381974ab57 100644
--- a/meson.build
+++ b/meson.build
@@ -1414,6 +1414,7 @@ checkfuncs = {
   'initgroups' : [],
   'strtoumax' : ['strtoumax.c', 'strtoimax.c'],
   'pread' : ['pread.c'],
+  'writev' : ['writev.c'],
 }
 
 if host_machine.system() == 'windows'

-- 
2.53.0.904.g2727be2e99.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 06/10] wrapper: introduce writev(3p) wrappers
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
                     ` (4 preceding siblings ...)
  2026-03-13  6:45   ` [PATCH v4 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
@ 2026-03-13  6:45   ` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 07/10] sideband: use writev(3p) to send pktlines Patrick Steinhardt
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

In the preceding commit we have added a compatibility wrapper for the
writev(3p) syscall. Introduce some generic wrappers for this function
that we nowadays take for granted in the Git codebase.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 wrapper.c      | 41 +++++++++++++++++++++++++++++++++++++++++
 wrapper.h      |  9 +++++++++
 write-or-die.c |  8 ++++++++
 write-or-die.h |  1 +
 4 files changed, 59 insertions(+)

diff --git a/wrapper.c b/wrapper.c
index 16f5a63fbb..be8fa575e6 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -323,6 +323,47 @@ ssize_t write_in_full(int fd, const void *buf, size_t count)
 	return total;
 }
 
+ssize_t writev_in_full(int fd, struct iovec *iov, int iovcnt)
+{
+	ssize_t total_written = 0;
+
+	while (iovcnt) {
+		ssize_t bytes_written = writev(fd, iov, iovcnt);
+		if (bytes_written < 0) {
+			if (errno == EINTR || errno == EAGAIN)
+				continue;
+			return -1;
+		}
+		if (!bytes_written) {
+			errno = ENOSPC;
+			return -1;
+		}
+
+		total_written += bytes_written;
+
+		/*
+		 * We first need to discard any iovec entities that have been
+		 * fully written.
+		 */
+		while (iovcnt && (size_t)bytes_written >= iov->iov_len) {
+			bytes_written -= iov->iov_len;
+			iov++;
+			iovcnt--;
+		}
+
+		/*
+		 * Finally, we need to adjust the last iovec in case we have
+		 * performed a partial write.
+		 */
+		if (iovcnt && bytes_written) {
+			iov->iov_base = (char *) iov->iov_base + bytes_written;
+			iov->iov_len -= bytes_written;
+		}
+	}
+
+	return total_written;
+}
+
 ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset)
 {
 	char *p = buf;
diff --git a/wrapper.h b/wrapper.h
index 15ac3bab6e..27519b32d1 100644
--- a/wrapper.h
+++ b/wrapper.h
@@ -47,6 +47,15 @@ ssize_t read_in_full(int fd, void *buf, size_t count);
 ssize_t write_in_full(int fd, const void *buf, size_t count);
 ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset);
 
+/*
+ * Try to write all iovecs. Returns -1 in case an error occurred with a proper
+ * errno set, the number of bytes written otherwise.
+ *
+ * Note that the iovec will be modified as a result of this call to adjust for
+ * partial writes!
+ */
+ssize_t writev_in_full(int fd, struct iovec *iov, int iovcnt);
+
 static inline ssize_t write_str_in_full(int fd, const char *str)
 {
 	return write_in_full(fd, str, strlen(str));
diff --git a/write-or-die.c b/write-or-die.c
index 01a9a51fa2..5f522fb728 100644
--- a/write-or-die.c
+++ b/write-or-die.c
@@ -96,6 +96,14 @@ void write_or_die(int fd, const void *buf, size_t count)
 	}
 }
 
+void writev_or_die(int fd, struct iovec *iov, int iovlen)
+{
+	if (writev_in_full(fd, iov, iovlen) < 0) {
+		check_pipe(errno);
+		die_errno("writev error");
+	}
+}
+
 void fwrite_or_die(FILE *f, const void *buf, size_t count)
 {
 	if (fwrite(buf, 1, count, f) != count)
diff --git a/write-or-die.h b/write-or-die.h
index 65a5c42a47..ae3d7d88b8 100644
--- a/write-or-die.h
+++ b/write-or-die.h
@@ -7,6 +7,7 @@ void fprintf_or_die(FILE *, const char *fmt, ...);
 void fwrite_or_die(FILE *f, const void *buf, size_t count);
 void fflush_or_die(FILE *f);
 void write_or_die(int fd, const void *buf, size_t count);
+void writev_or_die(int fd, struct iovec *iov, int iovlen);
 
 /*
  * These values are used to help identify parts of a repository to fsync.

-- 
2.53.0.904.g2727be2e99.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 07/10] sideband: use writev(3p) to send pktlines
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
                     ` (5 preceding siblings ...)
  2026-03-13  6:45   ` [PATCH v4 06/10] wrapper: introduce writev(3p) wrappers Patrick Steinhardt
@ 2026-03-13  6:45   ` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 08/10] csum-file: introduce `hashfd_ext()` Patrick Steinhardt
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

Every pktline that we send out via `send_sideband()` currently requires
two syscalls: one to write the pktline's length, and one to send its
data. This typically isn't all that much of a problem, but under extreme
load the syscalls may cause contention in the kernel.

Refactor the code to instead use the newly introduced writev(3p) infra
so that we can send out the data with a single syscall. This reduces the
number of syscalls from around 133,000 calls to write(3p) to around
67,000 calls to writev(3p).

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 sideband.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/sideband.c b/sideband.c
index ea7c25211e..1ed6614eaf 100644
--- a/sideband.c
+++ b/sideband.c
@@ -264,6 +264,7 @@ void send_sideband(int fd, int band, const char *data, ssize_t sz, int packet_ma
 	const char *p = data;
 
 	while (sz) {
+		struct iovec iov[2];
 		unsigned n;
 		char hdr[5];
 
@@ -273,12 +274,19 @@ void send_sideband(int fd, int band, const char *data, ssize_t sz, int packet_ma
 		if (0 <= band) {
 			xsnprintf(hdr, sizeof(hdr), "%04x", n + 5);
 			hdr[4] = band;
-			write_or_die(fd, hdr, 5);
+			iov[0].iov_base = hdr;
+			iov[0].iov_len = 5;
 		} else {
 			xsnprintf(hdr, sizeof(hdr), "%04x", n + 4);
-			write_or_die(fd, hdr, 4);
+			iov[0].iov_base = hdr;
+			iov[0].iov_len = 4;
 		}
-		write_or_die(fd, p, n);
+
+		iov[1].iov_base = (void *) p;
+		iov[1].iov_len = n;
+
+		writev_or_die(fd, iov, ARRAY_SIZE(iov));
+
 		p += n;
 		sz -= n;
 	}

-- 
2.53.0.904.g2727be2e99.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 08/10] csum-file: introduce `hashfd_ext()`
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
                     ` (6 preceding siblings ...)
  2026-03-13  6:45   ` [PATCH v4 07/10] sideband: use writev(3p) to send pktlines Patrick Steinhardt
@ 2026-03-13  6:45   ` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 09/10] csum-file: drop `hashfd_throughput()` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 10/10] builtin/pack-objects: reduce lock contention when writing packfile data Patrick Steinhardt
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

Introduce a new `hashfd_ext()` function that takes an options structure.
This function will replace `hashd_throughput()` in the next commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 csum-file.c | 22 +++++++++++++---------
 csum-file.h | 14 ++++++++++++++
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 6e21e3cac8..a50416247e 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -161,17 +161,16 @@ struct hashfile *hashfd_check(const struct git_hash_algo *algop,
 	return f;
 }
 
-static struct hashfile *hashfd_internal(const struct git_hash_algo *algop,
-					int fd, const char *name,
-					struct progress *tp,
-					size_t buffer_len)
+struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
+			    int fd, const char *name,
+			    const struct hashfd_options *opts)
 {
 	struct hashfile *f = xmalloc(sizeof(*f));
 	f->fd = fd;
 	f->check_fd = -1;
 	f->offset = 0;
 	f->total = 0;
-	f->tp = tp;
+	f->tp = opts->progress;
 	f->name = name;
 	f->do_crc = 0;
 	f->skip_hash = 0;
@@ -179,8 +178,8 @@ static struct hashfile *hashfd_internal(const struct git_hash_algo *algop,
 	f->algop = unsafe_hash_algo(algop);
 	f->algop->init_fn(&f->ctx);
 
-	f->buffer_len = buffer_len;
-	f->buffer = xmalloc(buffer_len);
+	f->buffer_len = opts->buffer_len ? opts->buffer_len : 128 * 1024;
+	f->buffer = xmalloc(f->buffer_len);
 	f->check_buffer = NULL;
 
 	return f;
@@ -194,7 +193,8 @@ struct hashfile *hashfd(const struct git_hash_algo *algop,
 	 * measure the rate of data passing through this hashfile,
 	 * use a larger buffer size to reduce fsync() calls.
 	 */
-	return hashfd_internal(algop, fd, name, NULL, 128 * 1024);
+	struct hashfd_options opts = { 0 };
+	return hashfd_ext(algop, fd, name, &opts);
 }
 
 struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
@@ -206,7 +206,11 @@ struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
 	 * size so the progress indicators arrive at a more
 	 * frequent rate.
 	 */
-	return hashfd_internal(algop, fd, name, tp, 8 * 1024);
+	struct hashfd_options opts = {
+		.progress = tp,
+		.buffer_len = 8 * 1024,
+	};
+	return hashfd_ext(algop, fd, name, &opts);
 }
 
 void hashfile_checkpoint_init(struct hashfile *f,
diff --git a/csum-file.h b/csum-file.h
index 07ae11024a..a03b60120d 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -45,6 +45,20 @@ int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 #define CSUM_FSYNC		2
 #define CSUM_HASH_IN_STREAM	4
 
+struct hashfd_options {
+	/*
+	 * Throughput progress that counts the number of bytes that have been
+	 * hashed.
+	 */
+	struct progress *progress;
+
+	/* The length of the buffer that shall be used read read data. */
+	size_t buffer_len;
+};
+
+struct hashfile *hashfd_ext(const struct git_hash_algo *algop,
+			    int fd, const char *name,
+			    const struct hashfd_options *opts);
 struct hashfile *hashfd(const struct git_hash_algo *algop,
 			int fd, const char *name);
 struct hashfile *hashfd_check(const struct git_hash_algo *algop,

-- 
2.53.0.904.g2727be2e99.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 09/10] csum-file: drop `hashfd_throughput()`
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
                     ` (7 preceding siblings ...)
  2026-03-13  6:45   ` [PATCH v4 08/10] csum-file: introduce `hashfd_ext()` Patrick Steinhardt
@ 2026-03-13  6:45   ` Patrick Steinhardt
  2026-03-13  6:45   ` [PATCH v4 10/10] builtin/pack-objects: reduce lock contention when writing packfile data Patrick Steinhardt
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

The `hashfd_throughput()` function is used by a single callsite in
git-pack-objects(1). In contrast to `hashfd()`, this function uses a
progress meter to measure throughput and a smaller buffer length so that
the progress meter can provide more granular metrics.

We're going to change that caller in the next commit to be a bit more
specific to packing objects. As such, `hashfd_throughput()` will be a
somewhat unfitting mechanism for any potential new callers.

Drop the function and replace it with a call to `hashfd_ext()`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/pack-objects.c | 19 +++++++++++++++----
 csum-file.c            | 16 ----------------
 csum-file.h            |  2 --
 3 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c1ee4d5ed7..f5cb80e870 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1330,11 +1330,22 @@ static void write_pack_file(void)
 		unsigned char hash[GIT_MAX_RAWSZ];
 		char *pack_tmp_name = NULL;
 
-		if (pack_to_stdout)
-			f = hashfd_throughput(the_repository->hash_algo, 1,
-					      "<stdout>", progress_state);
-		else
+		if (pack_to_stdout) {
+			/*
+			 * Since we are expecting to report progress of the
+			 * write into this hashfile, use a smaller buffer
+			 * size so the progress indicators arrive at a more
+			 * frequent rate.
+			 */
+			struct hashfd_options opts = {
+				.progress = progress_state,
+				.buffer_len = 8 * 1024,
+			};
+			f = hashfd_ext(the_repository->hash_algo, 1,
+				       "<stdout>", &opts);
+		} else {
 			f = create_tmp_packfile(the_repository, &pack_tmp_name);
+		}
 
 		offset = write_pack_header(f, nr_remaining);
 
diff --git a/csum-file.c b/csum-file.c
index a50416247e..5dfaca5543 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -197,22 +197,6 @@ struct hashfile *hashfd(const struct git_hash_algo *algop,
 	return hashfd_ext(algop, fd, name, &opts);
 }
 
-struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
-				   int fd, const char *name, struct progress *tp)
-{
-	/*
-	 * Since we are expecting to report progress of the
-	 * write into this hashfile, use a smaller buffer
-	 * size so the progress indicators arrive at a more
-	 * frequent rate.
-	 */
-	struct hashfd_options opts = {
-		.progress = tp,
-		.buffer_len = 8 * 1024,
-	};
-	return hashfd_ext(algop, fd, name, &opts);
-}
-
 void hashfile_checkpoint_init(struct hashfile *f,
 			      struct hashfile_checkpoint *checkpoint)
 {
diff --git a/csum-file.h b/csum-file.h
index a03b60120d..01472555c8 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -63,8 +63,6 @@ struct hashfile *hashfd(const struct git_hash_algo *algop,
 			int fd, const char *name);
 struct hashfile *hashfd_check(const struct git_hash_algo *algop,
 			      const char *name);
-struct hashfile *hashfd_throughput(const struct git_hash_algo *algop,
-				   int fd, const char *name, struct progress *tp);
 
 /*
  * Free the hashfile without flushing its contents to disk. This only

-- 
2.53.0.904.g2727be2e99.dirty


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 10/10] builtin/pack-objects: reduce lock contention when writing packfile data
  2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
                     ` (8 preceding siblings ...)
  2026-03-13  6:45   ` [PATCH v4 09/10] csum-file: drop `hashfd_throughput()` Patrick Steinhardt
@ 2026-03-13  6:45   ` Patrick Steinhardt
  9 siblings, 0 replies; 64+ messages in thread
From: Patrick Steinhardt @ 2026-03-13  6:45 UTC (permalink / raw)
  To: git; +Cc: Matt Smiley, brian m. carlson, Jeff King, Johannes Sixt

When running `git pack-objects --stdout` we feed the data through
`hashfd_ext()` with a progress meter and a smaller-than-usual buffer
length of 8kB so that we can track throughput more granularly. But as
packfiles tend to be on the larger side, this small buffer size may
cause a ton of write(3p) syscalls.

Originally, the buffer we used in `hashfd()` was 8kB for all use cases.
This was changed though in 2ca245f8be (csum-file.h: increase hashfile
buffer size, 2021-05-18) because we noticed that the number of writes
can have an impact on performance. So the buffer size was increased to
128kB, which improved performance a bit for some use cases.

But the commit didn't touch the buffer size for `hashd_throughput()`.
The reasoning here was that callers expect the progress indicator to
update frequently, and a larger buffer size would of course reduce the
update frequency especially on slow networks.

While that is of course true, there was (and still is, even though it's
now a call to `hashfd_ext()`) only a single caller of this function in
git-pack-objects(1). This command is responsible for writing packfiles,
and those packfiles are often on the bigger side. So arguably:

  - The user won't care about increments of 8kB when packfiles tend to
    be megabytes or even gigabytes in size.

  - Reducing the number of syscalls would be even more valuable here
    than it would be for multi-pack indices, which was the benchmark
    done in the mentioned commit, as MIDXs are typically significantly
    smaller than packfiles.

  - Nowadays, many internet connections should be able to transfer data
    at a rate significantly higher than 8kB per second.

Update the buffer to instead have a size of `LARGE_PACKET_DATA_MAX - 1`,
which translates to ~64kB. This limit was chosen because `git
pack-objects --stdout` is most often used when sending packfiles via
git-upload-pack(1), where packfile data is chunked into pktlines when
using the sideband. Furthermore, most internet connections should have a
bandwidth signifcantly higher than 64kB/s, so we'd still be able to
observe progress updates at a rate of at least once per second.

This change significantly reduces the number of write(3p) syscalls from
355,000 to 44,000 when packing the Linux repository. While this results
in a small performance improvement on an otherwise-unused system, this
improvement is mostly negligible. More importantly though, it will
reduce lock contention in the kernel on an extremely busy system where
we have many processes writing data at once.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/pack-objects.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f5cb80e870..59876b024d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -41,6 +41,7 @@
 #include "promisor-remote.h"
 #include "pack-mtimes.h"
 #include "parse-options.h"
+#include "pkt-line.h"
 #include "blob.h"
 #include "tree.h"
 #include "path-walk.h"
@@ -1332,14 +1333,17 @@ static void write_pack_file(void)

 		if (pack_to_stdout) {
 			/*
-			 * Since we are expecting to report progress of the
-			 * write into this hashfile, use a smaller buffer
-			 * size so the progress indicators arrive at a more
-			 * frequent rate.
+			 * This command is most often invoked via
+			 * git-upload-pack(1), which will typically chunk data
+			 * into pktlines. As such, we use the maximum data
+			 * length of them as buffer length.
+			 *
+			 * Note that we need to subtract one though to
+			 * accomodate for the sideband byte.
 			 */
 			struct hashfd_options opts = {
 				.progress = progress_state,
-				.buffer_len = 8 * 1024,
+				.buffer_len = LARGE_PACKET_DATA_MAX - 1,
 			};
 			f = hashfd_ext(the_repository->hash_algo, 1,
 				       "<stdout>", &opts);

-- 
2.53.0.904.g2727be2e99.dirty

^ permalink raw reply related	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2026-03-13  6:45 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-27 11:22 [PATCH 0/2] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
2026-02-27 11:23 ` [PATCH 1/2] upload-pack: fix debug statement when flushing " Patrick Steinhardt
2026-02-27 11:23 ` [PATCH 2/2] upload-pack: reduce lock contention when writing " Patrick Steinhardt
2026-02-27 13:04   ` brian m. carlson
2026-02-27 18:14     ` Patrick Steinhardt
2026-02-27 17:29   ` Junio C Hamano
2026-02-27 19:37   ` Jeff King
2026-03-02 12:12     ` Patrick Steinhardt
2026-03-02 18:20       ` Jeff King
2026-03-03  9:31         ` Patrick Steinhardt
2026-03-03 13:35           ` Jeff King
2026-03-03 13:47             ` Patrick Steinhardt
2026-03-03 15:00 ` [PATCH v2 00/10] " Patrick Steinhardt
2026-03-03 15:00   ` [PATCH v2 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
2026-03-03 15:00   ` [PATCH v2 02/10] upload-pack: adapt keepalives based on buffering Patrick Steinhardt
2026-03-05  0:56     ` Jeff King
2026-03-10 12:08       ` Patrick Steinhardt
2026-03-03 15:00   ` [PATCH v2 03/10] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
2026-03-05  1:16     ` Jeff King
2026-03-10 12:14       ` Patrick Steinhardt
2026-03-03 15:00   ` [PATCH v2 04/10] git-compat-util: introduce `cast_size_t_to_ssize_t()` Patrick Steinhardt
2026-03-03 15:00   ` [PATCH v2 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
2026-03-04 22:01     ` Junio C Hamano
2026-03-05  0:37       ` Jeff King
2026-03-05  2:16         ` brian m. carlson
2026-03-05  6:39           ` Johannes Sixt
2026-03-05 22:22             ` brian m. carlson
2026-03-10 12:09               ` Patrick Steinhardt
2026-03-03 15:00   ` [PATCH v2 06/10] wrapper: introduce writev(3p) wrappers Patrick Steinhardt
2026-03-03 15:00   ` [PATCH v2 07/10] sideband: use writev(3p) to send pktlines Patrick Steinhardt
2026-03-04 22:05     ` Junio C Hamano
2026-03-03 15:00   ` [PATCH v2 08/10] csum-file: introduce `hashfd_ext()` Patrick Steinhardt
2026-03-04 22:11     ` Junio C Hamano
2026-03-10 12:09       ` Patrick Steinhardt
2026-03-03 15:00   ` [PATCH v2 09/10] csum-file: drop `hashfd_throughput()` Patrick Steinhardt
2026-03-03 15:00   ` [PATCH v2 10/10] builtin/pack-objects: reduce lock contention when writing packfile data Patrick Steinhardt
2026-03-10 13:24 ` [PATCH v3 00/10] upload-pack: " Patrick Steinhardt
2026-03-10 13:24   ` [PATCH v3 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
2026-03-10 13:24   ` [PATCH v3 02/10] upload-pack: adapt keepalives based on buffering Patrick Steinhardt
2026-03-10 13:24   ` [PATCH v3 03/10] upload-pack: prefer flushing data over sending keepalive Patrick Steinhardt
2026-03-10 17:09     ` Junio C Hamano
2026-03-10 17:43       ` Patrick Steinhardt
2026-03-10 13:25   ` [PATCH v3 04/10] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
2026-03-10 13:25   ` [PATCH v3 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
2026-03-10 16:59     ` Junio C Hamano
2026-03-10 13:25   ` [PATCH v3 06/10] wrapper: introduce writev(3p) wrappers Patrick Steinhardt
2026-03-10 13:25   ` [PATCH v3 07/10] sideband: use writev(3p) to send pktlines Patrick Steinhardt
2026-03-10 13:25   ` [PATCH v3 08/10] csum-file: introduce `hashfd_ext()` Patrick Steinhardt
2026-03-10 13:25   ` [PATCH v3 09/10] csum-file: drop `hashfd_throughput()` Patrick Steinhardt
2026-03-10 13:25   ` [PATCH v3 10/10] builtin/pack-objects: reduce lock contention when writing packfile data Patrick Steinhardt
2026-03-10 17:11   ` [PATCH v3 00/10] upload-pack: " Junio C Hamano
2026-03-10 20:56   ` Johannes Sixt
2026-03-11  6:27     ` Patrick Steinhardt
2026-03-13  6:45 ` [PATCH v4 " Patrick Steinhardt
2026-03-13  6:45   ` [PATCH v4 01/10] upload-pack: fix debug statement when flushing " Patrick Steinhardt
2026-03-13  6:45   ` [PATCH v4 02/10] upload-pack: adapt keepalives based on buffering Patrick Steinhardt
2026-03-13  6:45   ` [PATCH v4 03/10] upload-pack: prefer flushing data over sending keepalive Patrick Steinhardt
2026-03-13  6:45   ` [PATCH v4 04/10] upload-pack: reduce lock contention when writing packfile data Patrick Steinhardt
2026-03-13  6:45   ` [PATCH v4 05/10] compat/posix: introduce writev(3p) wrapper Patrick Steinhardt
2026-03-13  6:45   ` [PATCH v4 06/10] wrapper: introduce writev(3p) wrappers Patrick Steinhardt
2026-03-13  6:45   ` [PATCH v4 07/10] sideband: use writev(3p) to send pktlines Patrick Steinhardt
2026-03-13  6:45   ` [PATCH v4 08/10] csum-file: introduce `hashfd_ext()` Patrick Steinhardt
2026-03-13  6:45   ` [PATCH v4 09/10] csum-file: drop `hashfd_throughput()` Patrick Steinhardt
2026-03-13  6:45   ` [PATCH v4 10/10] builtin/pack-objects: reduce lock contention when writing packfile data Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox