public inbox for qemu-devel@nongnu.org
 help / color / mirror / Atom feed
From: Hanna Czenczek <hreitz@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
	Aarushi Mehta <mehta.aaru20@gmail.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Stefano Garzarella <sgarzare@redhat.com>
Subject: Re: [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes
Date: Tue, 24 Mar 2026 11:04:19 +0100	[thread overview]
Message-ID: <190608ae-a537-44a6-a444-2d9667f80391@redhat.com> (raw)
In-Reply-To: <acJX8dXn4CUE4UxO@redhat.com>

On 24.03.26 10:22, Kevin Wolf wrote:
> Am 24.03.2026 um 09:22 hat Hanna Czenczek geschrieben:
>> On 24.03.26 09:12, Hanna Czenczek wrote:
>>> On 23.03.26 18:12, Kevin Wolf wrote:
>>>> Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
>>>>> Short reads/writes can happen.  One way to reproduce them is via our
>>>>> FUSE export, with the following diff applied (%s/escaped // to apply --
>>>>> if you put plain diffs in commit messages, git-am will apply
>>>>> them, and I
>>>>> would rather avoid breaking FUSE accidentally via this patch):
>>>>>
>>>>> escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
>>>>> escaped index a2a478d293..67dc50a412 100644
>>>>> escaped --- a/block/export/fuse.c
>>>>> escaped +++ b/block/export/fuse.c
>>>>> @@ -828,7 +828,7 @@ static ssize_t coroutine_fn GRAPH_RDLOCK
>>>>>    fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>>>>>                 const struct fuse_init_in_compat *in)
>>>>>    {
>>>>> -    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
>>>>> +    const uint32_t supported_flags = FUSE_ASYNC_READ;
>>>>>
>>>>>        if (in->major != 7) {
>>>>>            error_report("FUSE major version mismatch: We have 7,
>>>>> but kernel has %"
>>>>> @@ -1060,6 +1060,8 @@ fuse_co_read(FuseExport *exp, void
>>>>> **bufptr, uint64_t offset, uint32_t size)
>>>>>        void *buf;
>>>>>        int ret;
>>>>>
>>>>> +    size = MIN(size, 4096);
>>>>> +
>>>>>        /* Limited by max_read, should not happen */
>>>>>        if (size > FUSE_MAX_READ_BYTES) {
>>>>>            return -EINVAL;
>>>>> @@ -1110,6 +1112,8 @@ fuse_co_write(FuseExport *exp, struct
>>>>> fuse_write_out *out,
>>>>>        int64_t blk_len;
>>>>>        int ret;
>>>>>
>>>>> +    size = MIN(size, 4096);
>>>>> +
>>>>>        QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
>>>>>        /* Limited by max_write, should not happen */
>>>>>        if (size > FUSE_MAX_WRITE_BYTES) {
>>>>>
>>>>> Then:
>>>>> $ ./qemu-img create -f raw test.raw 8k
>>>>> Formatting 'test.raw', fmt=raw size=8192
>>>>> $ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
>>>>> wrote 8192/8192 bytes at offset 0
>>>>> 8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
>>>>> $ hexdump -C test.raw
>>>>> 00000000  2a 2a 2a 2a 2a 2a 2a 2a  2a 2a 2a 2a 2a 2a 2a 2a
>>>>> |****************|
>>>>> *
>>>>> 00002000
>>>>>
>>>>> With aio=threads, short I/O works:
>>>>> $ storage-daemon/qemu-storage-daemon \
>>>>>       --blockdev file,node-name=test,filename=test.raw \
>>>>>       --export
>>>>> fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
>>>>>
>>>>> Other shell:
>>>>> $ ./qemu-io --image-opts -c 'read -P 42 0 8k' \
>>>>>       driver=file,filename=test.raw,cache.direct=on,aio=threads
>>>>> read 8192/8192 bytes at offset 0
>>>>> 8 KiB, 1 ops; 00.00 sec (36.563 MiB/sec and 4680.0923 ops/sec)
>>>>> $ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
>>>>>       driver=file,filename=test.raw,cache.direct=on,aio=threads
>>>>> wrote 8192/8192 bytes at offset 0
>>>>> 8 KiB, 1 ops; 00.00 sec (35.995 MiB/sec and 4607.2970 ops/sec)
>>>>> $ hexdump -C test.raw
>>>>> 00000000  17 17 17 17 17 17 17 17  17 17 17 17 17 17 17 17
>>>>> |................|
>>>>> *
>>>>> 00002000
>>>>>
>>>>> But with aio=native, it does not:
>>>>> $ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
>>>>>       driver=file,filename=test.raw,cache.direct=on,aio=native
>>>>> Pattern verification failed at offset 0, 8192 bytes
>>>>> read 8192/8192 bytes at offset 0
>>>>> 8 KiB, 1 ops; 00.00 sec (86.155 MiB/sec and 11027.7900 ops/sec)
>>>>> $ ./qemu-io --image-opts -c 'write -P 42 0 8k' \
>>>>>       driver=file,filename=test.raw,cache.direct=on,aio=native
>>>>> write failed: No space left on device
>>>>> $ hexdump -C test.raw
>>>>> 00000000  2a 2a 2a 2a 2a 2a 2a 2a  2a 2a 2a 2a 2a 2a 2a 2a
>>>>> |****************|
>>>>> *
>>>>> 00001000  17 17 17 17 17 17 17 17  17 17 17 17 17 17 17 17
>>>>> |................|
>>>>> *
>>>>> 00002000
>>>>>
>>>>> This patch fixes that.
>>>>>
>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>> ---
>>>>>    block/linux-aio.c | 61
>>>>> ++++++++++++++++++++++++++++++++++++++++-------
>>>>>    1 file changed, 53 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>>>>> index 1f25339dc9..01621d4794 100644
>>>>> --- a/block/linux-aio.c
>>>>> +++ b/block/linux-aio.c
>>>>> @@ -46,6 +46,10 @@ struct qemu_laiocb {
>>>>>        size_t nbytes;
>>>>>        QEMUIOVector *qiov;
>>>>>    +    /* For handling short reads/writes */
>>>>> +    size_t total_done;
>>>>> +    QEMUIOVector resubmit_qiov;
>>>>> +
>>>>>        int type;
>>>>>        BdrvRequestFlags flags;
>>>>>        uint64_t dev_max_batch;
>>>>> @@ -73,28 +77,61 @@ struct LinuxAioState {
>>>>>    };
>>>>>      static void ioq_submit(LinuxAioState *s);
>>>>> +static int laio_do_submit(struct qemu_laiocb *laiocb);
>>>>>      static inline ssize_t io_event_ret(struct io_event *ev)
>>>>>    {
>>>>>        return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
>>>>>    }
>>>>>    +/**
>>>>> + * Retry tail of short requests.
>>>>> + */
>>>>> +static int laio_resubmit_short_io(struct qemu_laiocb *laiocb,
>>>>> size_t done)
>>>>> +{
>>>>> +    QEMUIOVector *resubmit_qiov = &laiocb->resubmit_qiov;
>>>>> +
>>>>> +    laiocb->total_done += done;
>>>>> +
>>>>> +    if (!resubmit_qiov->iov) {
>>>>> +        qemu_iovec_init(resubmit_qiov, laiocb->qiov->niov);
>>>>> +    } else {
>>>>> +        qemu_iovec_reset(resubmit_qiov);
>>>>> +    }
>>>>> +    qemu_iovec_concat(resubmit_qiov, laiocb->qiov,
>>>>> +                      laiocb->total_done, laiocb->nbytes -
>>>>> laiocb->total_done);
>>>>> +
>>>>> +    return laio_do_submit(laiocb);
>>>>> +}
>>>>> +
>>>>>    /*
>>>>>     * Completes an AIO request.
>>>>>     */
>>>>>    static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
>>>>>    {
>>>>> -    int ret;
>>>>> +    ssize_t ret;
>>>>>          ret = laiocb->ret;
>>>>>        if (ret != -ECANCELED) {
>>>>> -        if (ret == laiocb->nbytes) {
>>>>> +        if (ret == laiocb->nbytes - laiocb->total_done) {
>>>>>                ret = 0;
>>>>> +        } else if (ret > 0 && (laiocb->type == QEMU_AIO_READ ||
>>>>> +                               laiocb->type == QEMU_AIO_WRITE)) {
>>>>> +            ret = laio_resubmit_short_io(laiocb, ret);
>>>>> +            if (!ret) {
>>>>> +                return;
>>>>> +            }
>>>>>            } else if (ret >= 0) {
>>>>> -            /* Short reads mean EOF, pad with zeros. */
>>>>> +            /*
>>>>> +             * For normal reads and writes, we only get here if
>>>>> ret == 0, which
>>>>> +             * means EOF for reads and ENOSPC for writes.
>>>>> +             * For zone-append, we get here with any ret >= 0,
>>>>> which we just
>>>>> +             * treat as ENOSPC, too (safer than resubmitting,
>>>>> probably, but not
>>>>> +             * 100 % clear).
>>>>> +             */
>>>>>                if (laiocb->type == QEMU_AIO_READ) {
>>>>> -                qemu_iovec_memset(laiocb->qiov, ret, 0,
>>>>> -                    laiocb->qiov->size - ret);
>>>>> +                qemu_iovec_memset(laiocb->qiov, laiocb->total_done, 0,
>>>>> +                                  laiocb->qiov->size -
>>>>> laiocb->total_done);
>>>>>                } else {
>>>>>                    ret = -ENOSPC;
>>>>>                }
>>>>> @@ -102,6 +139,7 @@ static void
>>>>> qemu_laio_process_completion(struct qemu_laiocb *laiocb)
>>>>>        }
>>>>>          laiocb->ret = ret;
>>>>> +    qemu_iovec_destroy(&laiocb->resubmit_qiov);
>>>> Calling qemu_iovec_destroy() for a qiov that has potentially never been
>>>> initialised feels a bit unsafe to me. It will work in practice for the
>>>> current implementation, but maybe make this one conditional on
>>>> laiocb->resubmit_qiov.iov, too? (Which is already making assumptions
>>>> about the internals of QEMUIOVector, but that we'll have an iov after
>>>> initialising the qiov with qemu_iovec_init() above will probably never
>>>> change.)
>>> Sure!
>>>
>>> (For reference, the io-uring code calls qemu_iovec_destroy() the same
>>> way, but I agree.  I’ll make that change there, too.)
>>>
>>>>>        /*
>>>>>         * If the coroutine is already entered it must be in
>>>>> ioq_submit() and
>>>>> @@ -380,23 +418,30 @@ static int laio_do_submit(struct
>>>>> qemu_laiocb *laiocb)
>>>>>        int fd = laiocb->fd;
>>>>>        off_t offset = laiocb->offset;
>>>> I wonder if making it laiocb->offset + laiocb->total_done here wouldn't
>>>> be more robust than having the addition in every call below.
>>> I had it that way originally, but then decided to put it into the calls
>>> below to show better which ones can actually be retried. Yes, the ones
>>> not retried (zone_append) will always have total_done == 0, but I found
>>> this clearer, personally.
>>>
>>> Maybe I’ll make it two variables, original_offset and offset.
>>>
>>> Hanna
>>>
>>>>> +    if (laiocb->resubmit_qiov.iov) {
>>>>> +        qiov = &laiocb->resubmit_qiov;
>>>>> +    }
>>>>> +
>> Now that I actually touch this, this here makes me wonder what my point of
>> “yes, sure, total_done will be 0 on no resubmission anyway, but!” even was.
>> If this piece of code unconditionally uses the resubmit_qiov (when set up)
>> for all requests, including zone_append, we might as well include total_done
>> in the offset, for all requests.
> I actually considered if I should suggest 'offset += laiocb->total_done'
> in this if block instead of doing it right at initialisation, but then
> thought what's the point, it's the same in practice anyway.
>
> If you want to make the zone_append case clearer, we could add an
> assertion that it's 0 there, if that feels better to you.

No, I think it’s ok as-is (with offset including total_done).  I’ve sent 
a v2.

Hanna



  reply	other threads:[~2026-03-24 10:04 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-18 15:32 [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
2026-03-18 15:32 ` [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb Hanna Czenczek
2026-03-23 16:36   ` Kevin Wolf
2026-03-23 17:02     ` Hanna Czenczek
2026-03-23 17:04       ` Hanna Czenczek
2026-03-23 19:10         ` Kevin Wolf
2026-03-18 15:32 ` [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes Hanna Czenczek
2026-03-23 17:12   ` Kevin Wolf
2026-03-24  8:12     ` Hanna Czenczek
2026-03-24  8:22       ` Hanna Czenczek
2026-03-24  9:22         ` Kevin Wolf
2026-03-24 10:04           ` Hanna Czenczek [this message]
2026-03-18 15:32 ` [PATCH for-11.0 3/3] io-uring: Resubmit tails of short writes Hanna Czenczek
2026-03-23 19:05   ` Kevin Wolf
2026-03-23 16:28 ` [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Kevin Wolf
2026-03-23 16:59   ` Hanna Czenczek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=190608ae-a537-44a6-a444-2d9667f80391@redhat.com \
    --to=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mehta.aaru20@gmail.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=sgarzare@redhat.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox