qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Peter Lieven <pl@kamp.de>
To: Hanna Reitz <hreitz@redhat.com>,
	Stefano Garzarella <sgarzare@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Ilya Dryomov <idryomov@gmail.com>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org
Subject: Re: [PATCH] block/rbd: fix write zeroes with growing images
Date: Thu, 24 Mar 2022 12:34:55 +0100	[thread overview]
Message-ID: <15ae6594-2fdb-d374-81d6-9ed6ed19de2a@kamp.de> (raw)
In-Reply-To: <ecef6a0b-5a91-34ad-ee89-d86e9166c4a0@redhat.com>

Am 24.03.22 um 12:06 schrieb Hanna Reitz:
> On 24.03.22 11:42, Peter Lieven wrote:
>> Am 24.03.22 um 11:40 schrieb Stefano Garzarella:
>>> On Thu, Mar 24, 2022 at 10:52:04AM +0100, Peter Lieven wrote:
>>>> Am 22.03.22 um 10:38 schrieb Hanna Reitz:
>>>>> On 21.03.22 09:31, Stefano Garzarella wrote:
>>>>>> On Sat, Mar 19, 2022 at 04:15:33PM +0100, Peter Lieven wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Am 18.03.2022 um 17:47 schrieb Stefano Garzarella <sgarzare@redhat.com>:
>>>>>>>>
>>>>>>>> On Fri, Mar 18, 2022 at 04:48:18PM +0100, Peter Lieven wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Am 18.03.2022 um 09:25 schrieb Stefano Garzarella <sgarzare@redhat.com>:
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 17, 2022 at 07:27:05PM +0100, Peter Lieven wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> Am 17.03.2022 um 17:26 schrieb Stefano Garzarella <sgarzare@redhat.com>:
>>>>>>>>>>>>
>>>>>>>>>>>> Commit d24f80234b ("block/rbd: increase dynamically the image size")
>>>>>>>>>>>> added a workaround to support growing images (eg. qcow2), resizing
>>>>>>>>>>>> the image before write operations that exceed the current size.
>>>>>>>>>>>>
>>>>>>>>>>>> We recently added support for write zeroes and without the
>>>>>>>>>>>> workaround we can have problems with qcow2.
>>>>>>>>>>>>
>>>>>>>>>>>> So let's move the resize into qemu_rbd_start_co() and do it when
>>>>>>>>>>>> the command is RBD_AIO_WRITE or RBD_AIO_WRITE_ZEROES.
>>>>>>>>>>>>
>>>>>>>>>>>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2020993
>>>>>>>>>>>> Fixes: c56ac27d2a ("block/rbd: add write zeroes support")
>>>>>>>>>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> block/rbd.c | 26 ++++++++++++++------------
>>>>>>>>>>>> 1 file changed, 14 insertions(+), 12 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>>>>>>> index 8f183eba2a..6caf35cbba 100644
>>>>>>>>>>>> --- a/block/rbd.c
>>>>>>>>>>>> +++ b/block/rbd.c
>>>>>>>>>>>> @@ -1107,6 +1107,20 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
>>>>>>>>>>>>
>>>>>>>>>>>>   assert(!qiov || qiov->size == bytes);
>>>>>>>>>>>>
>>>>>>>>>>>> +    if (cmd == RBD_AIO_WRITE || cmd == RBD_AIO_WRITE_ZEROES) {
>>>>>>>>>>>> +        /*
>>>>>>>>>>>> +         * RBD APIs don't allow us to write more than actual size, so in order
>>>>>>>>>>>> +         * to support growing images, we resize the image before write
>>>>>>>>>>>> +         * operations that exceed the current size.
>>>>>>>>>>>> +         */
>>>>>>>>>>>> +        if (offset + bytes > s->image_size) {
>>>>>>>>>>>> +            int r = qemu_rbd_resize(bs, offset + bytes);
>>>>>>>>>>>> +            if (r < 0) {
>>>>>>>>>>>> +                return r;
>>>>>>>>>>>> +            }
>>>>>>>>>>>> +        }
>>>>>>>>>>>> +    }
>>>>>>>>>>>> +
>>>>>>>>>>>>   r = rbd_aio_create_completion(&task,
>>>>>>>>>>>> (rbd_callback_t) qemu_rbd_completion_cb, &c);
>>>>>>>>>>>>   if (r < 0) {
>>>>>>>>>>>> @@ -1182,18 +1196,6 @@ coroutine_fn qemu_rbd_co_pwritev(BlockDriverState *bs, int64_t offset,
>>>>>>>>>>>>                                int64_t bytes, QEMUIOVector *qiov,
>>>>>>>>>>>> BdrvRequestFlags flags)
>>>>>>>>>>>> {
>>>>>>>>>>>> -    BDRVRBDState *s = bs->opaque;
>>>>>>>>>>>> -    /*
>>>>>>>>>>>> -     * RBD APIs don't allow us to write more than actual size, so in order
>>>>>>>>>>>> -     * to support growing images, we resize the image before write
>>>>>>>>>>>> -     * operations that exceed the current size.
>>>>>>>>>>>> -     */
>>>>>>>>>>>> -    if (offset + bytes > s->image_size) {
>>>>>>>>>>>> -        int r = qemu_rbd_resize(bs, offset + bytes);
>>>>>>>>>>>> -        if (r < 0) {
>>>>>>>>>>>> -            return r;
>>>>>>>>>>>> -        }
>>>>>>>>>>>> -    }
>>>>>>>>>>>>   return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_WRITE);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> -- 2.35.1
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Do we really have a use case for growing rbd images?
>>>>>>>>>>
>>>>>>>>>> The use case is to have a qcow2 image on rbd.
>>>>>>>>>> I don't think it's very common, but some people use it and here [1] we had a little discussion about features that could be interesting (e.g. persistent dirty bitmaps for incremental backup).
>>>>>>>>>>
>>>>>>>>>> In any case the support is quite simple and does not affect other use cases since we only increase the size when we go beyond the current size.
>>>>>>>>>>
>>>>>>>>>> IMHO we can have it in :-)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The QCOW2 alone doesn’t make much sense, but additional metadata might be a use case.
>>>>>>>>
>>>>>>>> Yep.
>>>>>>>>
>>>>>>>>> Be aware that the current approach will serialize requests. If there is a real use case, we might think of a better solution.
>>>>>>>>
>>>>>>>> Good point, but it only happens when we have to resize, so maybe it's okay for now, but I agree we could do better ;-)
>>>>>>>
>>>>>>> There might also be a problem if a write for a higher offset past eof will be executed shortly before a write to a slightly lower offset past eof. The second resize will fail as it would shrink the image. We would need proper locking to avoid 
>>>>>>> this. Maybe we need to check if we write past eof. If yes, take a lock around the resize op and then check again if it’s still eof and only resize if true.
>>>>>>
>>>>>> I thought rbd_resize() was synchronous. Indeed when you said this could serialize writes it sounded like confirmation to me.
>>>>>>
>>>>>> Since we call rbd_resize() before rbd_aio_writev(), I thought this case could not occur.
>>>>>>
>>>>>> Can you please elaborate?
>>>>>
>>>>> Seconding this request, because if rbd_resize() is allowed to shrink data, it being asynchronous might cause data corruption.
>>>>>
>>>>> I’ll keep your patch because I find this highly unlikely, though: qemu_rbd_resize() itself is definitely synchronous, it can’t invoke qemu_coroutine_yield().
>>>>>
>>>>> The only other possibility that comes to my mind is that rbd_resize() might delay the actual resize operation, but I would still expect consecutive resize requests to be executed in order, and since we call rbd_aio_writev()/rbd_aio_write_zeroes() 
>>>>> immediately after the rbd_resize() (with no yielding in between), everything should be executed in the order that we expect.
>>>>
>>>>
>>>> Maybe my assumption of parallelism here was wrong. I was thinking of:
>>>>
>>>>
>>>> Request A: write at offset (EOL + 4k).
>>>>
>>>> Request A: rbd_resize is invoked (size EOL + 4k)
>>>
>>> IIUC Request B can't start until Request A calls qemu_coroutine_yield(), but I'm waiting for a confirmation from Hanna :-)
>
> That’s my impression at least.
>
>> Yes, and I would be interested if this is also true if coroutines are implemented as threads.
>
> Depends on what you mean by that.  Coroutines are a form of cooperative multitasking, i.e. they can’t be preempted unless they explicitly yield.  Threads are generally supposed to be preemptive, so those are just different things.


I believed there was a coroutine backend that used threads (at least in the past). Forget about that ;-)


>
> Of course you can use coroutines in threads, i.e. run multiple requests in parallel.  But then the coroutine part becomes largely irrelevant, and you’re just facing standard thread-safety questions, and then of course this won’t be safe.  I assume to 
> support such a model, all block drivers would need to be fully audited anyway, though.
>
> For example, theoretically, the guest could then issue two resize operations simultaneously, and qemu_rbd_co_truncate() would be called in two concurrent threads.  This would already cause problems, because setting s->image_size would race.  That’s 
> pre-existing regardless of this patch here (or d24f80234b39d2d5c0d91e63b5e4569d37b2399e).
>
> What this means is that of course we could just slap a lock around the qemu_rbd_resize() call in qemu_rbd_start_co() (and its surrounding condition), it wouldn’t cost anything, assuming that this area can’t be run in parallel anyway.  But the rest of 
> the block driver doesn’t contain a single lock yet, which to me signals that nothing in block/rbd.c is thread-safe anyway.


With that said I believe we can assume that the current implementation is safe (enough).


Peter





  reply	other threads:[~2022-03-24 11:36 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-17 16:26 [PATCH] block/rbd: fix write zeroes with growing images Stefano Garzarella
2022-03-17 18:27 ` Peter Lieven
2022-03-18  8:25   ` Stefano Garzarella
2022-03-18 15:48     ` Peter Lieven
2022-03-18 16:47       ` Stefano Garzarella
2022-03-19 15:15         ` Peter Lieven
2022-03-21  8:31           ` Stefano Garzarella
2022-03-22  9:38             ` Hanna Reitz
2022-03-24  9:52               ` Peter Lieven
2022-03-24 10:40                 ` Stefano Garzarella
2022-03-24 10:42                   ` Peter Lieven
2022-03-24 11:06                     ` Hanna Reitz
2022-03-24 11:34                       ` Peter Lieven [this message]
2022-03-19 12:40     ` Ilya Dryomov
2022-03-19 13:23       ` Ilya Dryomov
2022-03-21  8:17         ` Stefano Garzarella
2022-03-18 15:36 ` Hanna Reitz
2022-03-19 12:33 ` Ilya Dryomov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=15ae6594-2fdb-d374-81d6-9ed6ed19de2a@kamp.de \
    --to=pl@kamp.de \
    --cc=hreitz@redhat.com \
    --cc=idryomov@gmail.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=sgarzare@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).