qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Hanna Reitz <hreitz@redhat.com>
To: Stefano Garzarella <sgarzare@redhat.com>, Peter Lieven <pl@kamp.de>
Cc: Kevin Wolf <kwolf@redhat.com>, Ilya Dryomov <idryomov@gmail.com>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org
Subject: Re: [PATCH] block/rbd: fix write zeroes with growing images
Date: Tue, 22 Mar 2022 10:38:44 +0100	[thread overview]
Message-ID: <a12f9b05-1b10-02a6-111c-674d8b36df81@redhat.com> (raw)
In-Reply-To: <20220321083137.rtwh6gretloaipwk@sgarzare-redhat>

On 21.03.22 09:31, Stefano Garzarella wrote:
> On Sat, Mar 19, 2022 at 04:15:33PM +0100, Peter Lieven wrote:
>>
>>
>>> Am 18.03.2022 um 17:47 schrieb Stefano Garzarella 
>>> <sgarzare@redhat.com>:
>>>
>>> On Fri, Mar 18, 2022 at 04:48:18PM +0100, Peter Lieven wrote:
>>>>
>>>>
>>>>>> Am 18.03.2022 um 09:25 schrieb Stefano Garzarella 
>>>>>> <sgarzare@redhat.com>:
>>>>>
>>>>> On Thu, Mar 17, 2022 at 07:27:05PM +0100, Peter Lieven wrote:
>>>>>>
>>>>>>
>>>>>>>> Am 17.03.2022 um 17:26 schrieb Stefano Garzarella 
>>>>>>>> <sgarzare@redhat.com>:
>>>>>>>
>>>>>>> Commit d24f80234b ("block/rbd: increase dynamically the image 
>>>>>>> size")
>>>>>>> added a workaround to support growing images (eg. qcow2), resizing
>>>>>>> the image before write operations that exceed the current size.
>>>>>>>
>>>>>>> We recently added support for write zeroes and without the
>>>>>>> workaround we can have problems with qcow2.
>>>>>>>
>>>>>>> So let's move the resize into qemu_rbd_start_co() and do it when
>>>>>>> the command is RBD_AIO_WRITE or RBD_AIO_WRITE_ZEROES.
>>>>>>>
>>>>>>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2020993
>>>>>>> Fixes: c56ac27d2a ("block/rbd: add write zeroes support")
>>>>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
>>>>>>> ---
>>>>>>> block/rbd.c | 26 ++++++++++++++------------
>>>>>>> 1 file changed, 14 insertions(+), 12 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>> index 8f183eba2a..6caf35cbba 100644
>>>>>>> --- a/block/rbd.c
>>>>>>> +++ b/block/rbd.c
>>>>>>> @@ -1107,6 +1107,20 @@ static int coroutine_fn 
>>>>>>> qemu_rbd_start_co(BlockDriverState *bs,
>>>>>>>
>>>>>>>   assert(!qiov || qiov->size == bytes);
>>>>>>>
>>>>>>> +    if (cmd == RBD_AIO_WRITE || cmd == RBD_AIO_WRITE_ZEROES) {
>>>>>>> +        /*
>>>>>>> +         * RBD APIs don't allow us to write more than actual 
>>>>>>> size, so in order
>>>>>>> +         * to support growing images, we resize the image 
>>>>>>> before write
>>>>>>> +         * operations that exceed the current size.
>>>>>>> +         */
>>>>>>> +        if (offset + bytes > s->image_size) {
>>>>>>> +            int r = qemu_rbd_resize(bs, offset + bytes);
>>>>>>> +            if (r < 0) {
>>>>>>> +                return r;
>>>>>>> +            }
>>>>>>> +        }
>>>>>>> +    }
>>>>>>> +
>>>>>>>   r = rbd_aio_create_completion(&task,
>>>>>>>                                 (rbd_callback_t) 
>>>>>>> qemu_rbd_completion_cb, &c);
>>>>>>>   if (r < 0) {
>>>>>>> @@ -1182,18 +1196,6 @@ coroutine_fn 
>>>>>>> qemu_rbd_co_pwritev(BlockDriverState *bs, int64_t offset,
>>>>>>>                                int64_t bytes, QEMUIOVector *qiov,
>>>>>>>                                BdrvRequestFlags flags)
>>>>>>> {
>>>>>>> -    BDRVRBDState *s = bs->opaque;
>>>>>>> -    /*
>>>>>>> -     * RBD APIs don't allow us to write more than actual size, 
>>>>>>> so in order
>>>>>>> -     * to support growing images, we resize the image before write
>>>>>>> -     * operations that exceed the current size.
>>>>>>> -     */
>>>>>>> -    if (offset + bytes > s->image_size) {
>>>>>>> -        int r = qemu_rbd_resize(bs, offset + bytes);
>>>>>>> -        if (r < 0) {
>>>>>>> -            return r;
>>>>>>> -        }
>>>>>>> -    }
>>>>>>>   return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, 
>>>>>>> RBD_AIO_WRITE);
>>>>>>> }
>>>>>>>
>>>>>>> -- 
>>>>>>> 2.35.1
>>>>>>>
>>>>>>
>>>>>> Do we really have a use case for growing rbd images?
>>>>>
>>>>> The use case is to have a qcow2 image on rbd.
>>>>> I don't think it's very common, but some people use it and here 
>>>>> [1] we had a little discussion about features that could be 
>>>>> interesting (e.g.  persistent dirty bitmaps for incremental backup).
>>>>>
>>>>> In any case the support is quite simple and does not affect other 
>>>>> use cases since we only increase the size when we go beyond the 
>>>>> current size.
>>>>>
>>>>> IMHO we can have it in :-)
>>>>>
>>>>
>>>> The QCOW2 alone doesn’t make much sense, but additional metadata 
>>>> might be a use case.
>>>
>>> Yep.
>>>
>>>> Be aware that the current approach will serialize requests. If 
>>>> there is a real use case, we might think of a better solution.
>>>
>>> Good point, but it only happens when we have to resize, so maybe 
>>> it's okay for now, but I agree we could do better ;-)
>>
>> There might also be a problem if a write for a higher offset past eof 
>> will be executed shortly before a write to a slightly lower offset 
>> past eof. The second resize will fail as it would shrink the image. 
>> We would need proper locking to avoid this. Maybe we need to check if 
>> we write past eof. If yes, take a lock around the resize op and then 
>> check again if it’s still eof and only resize if true.
>
> I thought rbd_resize() was synchronous. Indeed when you said this 
> could serialize writes it sounded like confirmation to me.
>
> Since we call rbd_resize() before rbd_aio_writev(), I thought this 
> case could not occur.
>
> Can you please elaborate?

Seconding this request, because if rbd_resize() is allowed to shrink 
data, it being asynchronous might cause data corruption.

I’ll keep your patch because I find this highly unlikely, though: 
qemu_rbd_resize() itself is definitely synchronous, it can’t invoke 
qemu_coroutine_yield().

The only other possibility that comes to my mind is that rbd_resize() 
might delay the actual resize operation, but I would still expect 
consecutive resize requests to be executed in order, and since we call 
rbd_aio_writev()/rbd_aio_write_zeroes() immediately after the 
rbd_resize() (with no yielding in between), everything should be 
executed in the order that we expect.

Hanna



  reply	other threads:[~2022-03-22  9:39 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-17 16:26 [PATCH] block/rbd: fix write zeroes with growing images Stefano Garzarella
2022-03-17 18:27 ` Peter Lieven
2022-03-18  8:25   ` Stefano Garzarella
2022-03-18 15:48     ` Peter Lieven
2022-03-18 16:47       ` Stefano Garzarella
2022-03-19 15:15         ` Peter Lieven
2022-03-21  8:31           ` Stefano Garzarella
2022-03-22  9:38             ` Hanna Reitz [this message]
2022-03-24  9:52               ` Peter Lieven
2022-03-24 10:40                 ` Stefano Garzarella
2022-03-24 10:42                   ` Peter Lieven
2022-03-24 11:06                     ` Hanna Reitz
2022-03-24 11:34                       ` Peter Lieven
2022-03-19 12:40     ` Ilya Dryomov
2022-03-19 13:23       ` Ilya Dryomov
2022-03-21  8:17         ` Stefano Garzarella
2022-03-18 15:36 ` Hanna Reitz
2022-03-19 12:33 ` Ilya Dryomov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a12f9b05-1b10-02a6-111c-674d8b36df81@redhat.com \
    --to=hreitz@redhat.com \
    --cc=idryomov@gmail.com \
    --cc=kwolf@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=sgarzare@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).