Re: [PATCH] wrapper: properly handle MAX_IO_SIZE in `write_in_full()`

Git development
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: git@vger.kernel.org,  Jeff King <peff@peff.net>,
	 "brian m. carlson" <sandals@crustytoothpaste.net>,
	 "Randall S. Becker" <randall.becker@nexbridge.ca>,
	 Phillip Wood <phillip.wood@dunelm.org.uk>
Subject: Re: [PATCH] wrapper: properly handle MAX_IO_SIZE in `write_in_full()`
Date: Thu, 09 Apr 2026 09:42:42 -0700	[thread overview]
Message-ID: <xmqqika0ultp.fsf@gitster.g> (raw)
In-Reply-To: <20260409-b4-pks-writev-max-io-size-v1-1-81730e8f35df@pks.im> (Patrick Steinhardt's message of "Thu, 09 Apr 2026 14:47:59 +0200")

Patrick Steinhardt <ps@pks.im> writes:

> Some systems like NonStop set a comparatively small `MAX_IO_SIZE`, which
> limits the maximum number of bytes we're allowed to write in a single
> call. We already handle this limit properly in `xwrite()`, but we have
> recently introduced wrappers for writev(3p) where we don't. This will
> cause the syscall to return EINVAL in case somebody passes an iovec
> entry to writev(3p) that is larger than `MAX_IO_SIZE`.
>
> Introduce a new function `xwritev()` that is similar to `xwrite()` in
> that it handles such platform-specific nuances. The logic is rather
> simple: we simply coalesce all iovecs that don't exceed `MAX_IO_SIZE`
> and pass those to writev(3p). If the first iovec already exceeds the
> limit, we'll instead pass it to `xwrite()`, which handles the limit for
> us.

OK, so the idea is just like xwrite(), whose original purpose was to
hide the details of having to restart write() system call, pretends
a short write on IO limited platforms, xwritev() pretends that the
underlying writev() gave a short write, instead of a failure with
EINVAL when ssize_t is unusually small.  That mirrors an established
pattern that is proven to work with write_in_full() code path, which
is a good thing.

By the way, I see that EINTR and EWOULDBLOCK are handled somewhat
differently from how xwrite() handles it.  As the original change
that introduced use of writev() were to replace write_in_full() that
eventually called into xwrite(), shouldn't we be closer to what
xwrite() used to do?

> this fixes the issue reported by Randall in [1].
>
> I mostly wanted to get this patch out there so that we can discuss a
> proposed fix, but as said in the thread I'm also happy to revise course
> and instead set NO_WRITEV on NonStop for now. I think we'll want to
> eventually land a fix like the one proposed here though, and at that
> point the workaround would not be required anymore.

It is too late to make this kind of "let's wrap writev()" effort
before the final, and Git 2.54 should ship without any such change.
If your platform does not have a writev() that works with write size
up to half of maximum of size_t (or at least 64kB), use NO_WRITEV to
build your Git.

But let's discuss to prepare for an update post release.

It would be nice to see minority platforms including NonStop test
and notice problems that appear only on their system much earlier in
the cycle next time.  A report at -rc0, while better than not seeing
any, is a bit too late.

> diff --git a/wrapper.c b/wrapper.c
> index be8fa575e6..d989c78b4b 100644
> --- a/wrapper.c
> +++ b/wrapper.c
> @@ -323,21 +323,60 @@ ssize_t write_in_full(int fd, const void *buf, size_t count)
>  	return total;
>  }
>  
> +ssize_t xwritev(int fd, struct iovec *iov, int iovcnt)
> +{
> +	ssize_t bytes_written;
> +	size_t total_length;
> +	int i;
> +
> +	/*
> +	 * We need to make sure that writev(3p) call does not write more than
> +	 * `MAX_IO_SIZE` many bytes. If we do exceed that limit, we only pass
> +	 * those iovecs to writev(3p) that sum up to less than the limit.
> +	 *
> +	 * If on the other hand the first iovec entry already exceeds this
> +	 * limit we'll instead use xwrite() to write it, which knows to handle
> +	 * `MAX_IO_SIZE` for us.
> +	 */
> +	for (i = 0, total_length = 0; i < iovcnt; i++) {
> +		if (unsigned_add_overflows(total_length, iov[i].iov_len))
> +			break;

We add .iov_len up in this first loop, because we do not want to
bust writev(3p)'s limit.

	EINVAL The sum of the iov_len values in the iov array would
              overflow an ssize_t.

cf. https://pubs.opengroup.org/onlinepubs/9799919799/functions/writev.html

As the width of ssize_t in bits can be a lot smaller than size_t,
the above "unsigned_add_overflows() triggers way too late for the
check to matter, no?

> +		total_length += iov[i].iov_len;

Then we have total_length computed.

> +		if (total_length > MAX_IO_SIZE)
> +			break;

And it is capped to MAX_IO_SIZE, which is set way lower than
SSIZE_MAX even on mainstream platforms (8MB, IIRC).  This does not
matter only because we are currently using writev() only for
sideband communication and its payload is limited to 64kB, but if a
caller gave us a list of buffers whose total size ranges in a few
hundred megabytes (e.g., packfiles to clone a small project like
git.git itself), on a not-so-I/O-limited platform, do we still want
to chomp the original request into multiple writev(3p) system calls?

I personally think it is an OK thing to do, simply because we also
chomp a large xwrite() request into chunks and make multiple
write(2) system calls, but we should explain the reason behind "We
need to make sure" in the beginning of the above comment well--the
current text has no explanation on the reason.

> +	}
> +
> +	if (i < iovcnt) {
> +		/*
> +		 * The first entry exceeds MAX_IO_SIZE, so we pass it to
> +		 * xwrite, which knows to handle this case.
> +		 */
> +		if (!i)
> +			return xwrite(fd, iov->iov_base, iov->iov_len);

Ben has a similar comment, but it would be easier to see the
correspondence if you rephrase the comment perhaps like

		/*
                 * If the first buffer is larger than MAX_IO_SIZE,
                 * let xwrite() deal with it.
                 */

xwrite() can return a short write, but the caller is counting how
many bytes among what it passed to xwritev() are consumed by each
call to this function, so the next call we may be seeing may have
iov->iov_base pointing into the buffer we threw at xwrite() with
reduced iov->iov_len, and that is expected and everything will even
out.  Quite nice.

> +		iovcnt = i;

And if our iov[0].iov_len is smaller than MAX_IO_SIZE, then i would
be at least 1 here and shows the index in iov[] array that busts the
limit.  IOW, we know iov[0]..iov[i-1] (inclusive) can be written
without busting MAX_IO_SIZE in one go.  So se reduce iovcnt down to
that number here, in preparation for the next call.

> +	}

By the way, if the total of iov[] is reaonably small, then the
initial loop runs to the end of it, the above if() statement will be
skipped, and iovcnt is not reduced.

Either way, we now pass the initial part (which may be the entirety)
of iov[] up to iovcnt to writev().

> +	bytes_written = writev(fd, iov, iovcnt);
> +	if (!bytes_written) {
> +		errno = ENOSPC;
> +		return -1;
> +	}
> +
> +	return bytes_written;
> +}

OK.  So except for "is unsigned_add_overflows() doing what we want?"
question, I think the above is reasonable.

If we used "let's make sure sum of iov[].iov_len does not overflow
an ssize_t" at the beginning of the loop, it does change the
contract between the callers and this function, as they can no
longer get EINVAL due to such overflow (instead this function will
chomp the request into smaller pieces, just like MAX_IO_SIZE is used
here).  And I think that is a reasonable semantics.

>  ssize_t writev_in_full(int fd, struct iovec *iov, int iovcnt)
>  {
>  	ssize_t total_written = 0;
>  
>  	while (iovcnt) {
> -		ssize_t bytes_written = writev(fd, iov, iovcnt);
> -		if (bytes_written < 0) {
> +		ssize_t bytes_written = xwritev(fd, iov, iovcnt);
> +		if (bytes_written <= 0) {
>  			if (errno == EINTR || errno == EAGAIN)
>  				continue;

I am not sure if this is how we want to handle EINTR, given
especially that xwritev() may have called xwrite() under the hood in
some case but not in others.  If we are doing anything to these
signals, I think it should be done where we call underlying writev(),
and we should be close to whatever is done in xwrite() where it
calls write().

>  			return -1;
>  		}
> -		if (!bytes_written) {
> -			errno = ENOSPC;
> -			return -1;
> -		}

On the other hand, I think moving this into xwritev() is a mistake.
We shoudl try to be as close to what these functions are replacing
(i.e. write_in_full and xwrite combo) in the code paths that are
rewritten to use them.

next prev parent reply	other threads:[~2026-04-09 16:42 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09 12:47 [PATCH] wrapper: properly handle MAX_IO_SIZE in `write_in_full()` Patrick Steinhardt
2026-04-09 15:46 ` Ben Knoble
2026-04-09 16:42 ` Junio C Hamano [this message]
2026-04-09 20:23   ` Jeff King
2026-04-09 20:40     ` Junio C Hamano
2026-04-09 20:59       ` Jeff King
2026-04-09 21:09         ` Junio C Hamano
2026-04-10  5:19           ` Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqika0ultp.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=phillip.wood@dunelm.org.uk \
    --cc=ps@pks.im \
    --cc=randall.becker@nexbridge.ca \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox