From: Peter Lieven <pl@kamp.de>
To: Hanna Reitz <hreitz@redhat.com>,
Stefano Garzarella <sgarzare@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Ilya Dryomov <idryomov@gmail.com>,
qemu-devel@nongnu.org, qemu-block@nongnu.org
Subject: Re: [PATCH] block/rbd: fix write zeroes with growing images
Date: Thu, 24 Mar 2022 12:34:55 +0100 [thread overview]
Message-ID: <15ae6594-2fdb-d374-81d6-9ed6ed19de2a@kamp.de> (raw)
In-Reply-To: <ecef6a0b-5a91-34ad-ee89-d86e9166c4a0@redhat.com>
Am 24.03.22 um 12:06 schrieb Hanna Reitz:
> On 24.03.22 11:42, Peter Lieven wrote:
>> Am 24.03.22 um 11:40 schrieb Stefano Garzarella:
>>> On Thu, Mar 24, 2022 at 10:52:04AM +0100, Peter Lieven wrote:
>>>> Am 22.03.22 um 10:38 schrieb Hanna Reitz:
>>>>> On 21.03.22 09:31, Stefano Garzarella wrote:
>>>>>> On Sat, Mar 19, 2022 at 04:15:33PM +0100, Peter Lieven wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Am 18.03.2022 um 17:47 schrieb Stefano Garzarella <sgarzare@redhat.com>:
>>>>>>>>
>>>>>>>> On Fri, Mar 18, 2022 at 04:48:18PM +0100, Peter Lieven wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Am 18.03.2022 um 09:25 schrieb Stefano Garzarella <sgarzare@redhat.com>:
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 17, 2022 at 07:27:05PM +0100, Peter Lieven wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> Am 17.03.2022 um 17:26 schrieb Stefano Garzarella <sgarzare@redhat.com>:
>>>>>>>>>>>>
>>>>>>>>>>>> Commit d24f80234b ("block/rbd: increase dynamically the image size")
>>>>>>>>>>>> added a workaround to support growing images (eg. qcow2), resizing
>>>>>>>>>>>> the image before write operations that exceed the current size.
>>>>>>>>>>>>
>>>>>>>>>>>> We recently added support for write zeroes and without the
>>>>>>>>>>>> workaround we can have problems with qcow2.
>>>>>>>>>>>>
>>>>>>>>>>>> So let's move the resize into qemu_rbd_start_co() and do it when
>>>>>>>>>>>> the command is RBD_AIO_WRITE or RBD_AIO_WRITE_ZEROES.
>>>>>>>>>>>>
>>>>>>>>>>>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2020993
>>>>>>>>>>>> Fixes: c56ac27d2a ("block/rbd: add write zeroes support")
>>>>>>>>>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> block/rbd.c | 26 ++++++++++++++------------
>>>>>>>>>>>> 1 file changed, 14 insertions(+), 12 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>>>>>>> index 8f183eba2a..6caf35cbba 100644
>>>>>>>>>>>> --- a/block/rbd.c
>>>>>>>>>>>> +++ b/block/rbd.c
>>>>>>>>>>>> @@ -1107,6 +1107,20 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
>>>>>>>>>>>>
>>>>>>>>>>>> assert(!qiov || qiov->size == bytes);
>>>>>>>>>>>>
>>>>>>>>>>>> + if (cmd == RBD_AIO_WRITE || cmd == RBD_AIO_WRITE_ZEROES) {
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * RBD APIs don't allow us to write more than actual size, so in order
>>>>>>>>>>>> + * to support growing images, we resize the image before write
>>>>>>>>>>>> + * operations that exceed the current size.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + if (offset + bytes > s->image_size) {
>>>>>>>>>>>> + int r = qemu_rbd_resize(bs, offset + bytes);
>>>>>>>>>>>> + if (r < 0) {
>>>>>>>>>>>> + return r;
>>>>>>>>>>>> + }
>>>>>>>>>>>> + }
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> r = rbd_aio_create_completion(&task,
>>>>>>>>>>>> (rbd_callback_t) qemu_rbd_completion_cb, &c);
>>>>>>>>>>>> if (r < 0) {
>>>>>>>>>>>> @@ -1182,18 +1196,6 @@ coroutine_fn qemu_rbd_co_pwritev(BlockDriverState *bs, int64_t offset,
>>>>>>>>>>>> int64_t bytes, QEMUIOVector *qiov,
>>>>>>>>>>>> BdrvRequestFlags flags)
>>>>>>>>>>>> {
>>>>>>>>>>>> - BDRVRBDState *s = bs->opaque;
>>>>>>>>>>>> - /*
>>>>>>>>>>>> - * RBD APIs don't allow us to write more than actual size, so in order
>>>>>>>>>>>> - * to support growing images, we resize the image before write
>>>>>>>>>>>> - * operations that exceed the current size.
>>>>>>>>>>>> - */
>>>>>>>>>>>> - if (offset + bytes > s->image_size) {
>>>>>>>>>>>> - int r = qemu_rbd_resize(bs, offset + bytes);
>>>>>>>>>>>> - if (r < 0) {
>>>>>>>>>>>> - return r;
>>>>>>>>>>>> - }
>>>>>>>>>>>> - }
>>>>>>>>>>>> return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_WRITE);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> -- 2.35.1
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Do we really have a use case for growing rbd images?
>>>>>>>>>>
>>>>>>>>>> The use case is to have a qcow2 image on rbd.
>>>>>>>>>> I don't think it's very common, but some people use it and here [1] we had a little discussion about features that could be interesting (e.g. persistent dirty bitmaps for incremental backup).
>>>>>>>>>>
>>>>>>>>>> In any case the support is quite simple and does not affect other use cases since we only increase the size when we go beyond the current size.
>>>>>>>>>>
>>>>>>>>>> IMHO we can have it in :-)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The QCOW2 alone doesn’t make much sense, but additional metadata might be a use case.
>>>>>>>>
>>>>>>>> Yep.
>>>>>>>>
>>>>>>>>> Be aware that the current approach will serialize requests. If there is a real use case, we might think of a better solution.
>>>>>>>>
>>>>>>>> Good point, but it only happens when we have to resize, so maybe it's okay for now, but I agree we could do better ;-)
>>>>>>>
>>>>>>> There might also be a problem if a write for a higher offset past eof will be executed shortly before a write to a slightly lower offset past eof. The second resize will fail as it would shrink the image. We would need proper locking to avoid
>>>>>>> this. Maybe we need to check if we write past eof. If yes, take a lock around the resize op and then check again if it’s still eof and only resize if true.
>>>>>>
>>>>>> I thought rbd_resize() was synchronous. Indeed when you said this could serialize writes it sounded like confirmation to me.
>>>>>>
>>>>>> Since we call rbd_resize() before rbd_aio_writev(), I thought this case could not occur.
>>>>>>
>>>>>> Can you please elaborate?
>>>>>
>>>>> Seconding this request, because if rbd_resize() is allowed to shrink data, it being asynchronous might cause data corruption.
>>>>>
>>>>> I’ll keep your patch because I find this highly unlikely, though: qemu_rbd_resize() itself is definitely synchronous, it can’t invoke qemu_coroutine_yield().
>>>>>
>>>>> The only other possibility that comes to my mind is that rbd_resize() might delay the actual resize operation, but I would still expect consecutive resize requests to be executed in order, and since we call rbd_aio_writev()/rbd_aio_write_zeroes()
>>>>> immediately after the rbd_resize() (with no yielding in between), everything should be executed in the order that we expect.
>>>>
>>>>
>>>> Maybe my assumption of parallelism here was wrong. I was thinking of:
>>>>
>>>>
>>>> Request A: write at offset (EOL + 4k).
>>>>
>>>> Request A: rbd_resize is invoked (size EOL + 4k)
>>>
>>> IIUC Request B can't start until Request A calls qemu_coroutine_yield(), but I'm waiting for a confirmation from Hanna :-)
>
> That’s my impression at least.
>
>> Yes, and I would be interested if this is also true if coroutines are implemented as threads.
>
> Depends on what you mean by that. Coroutines are a form of cooperative multitasking, i.e. they can’t be preempted unless they explicitly yield. Threads are generally supposed to be preemptive, so those are just different things.
I believed there was a coroutine backend that used threads (at least in the past). Forget about that ;-)
>
> Of course you can use coroutines in threads, i.e. run multiple requests in parallel. But then the coroutine part becomes largely irrelevant, and you’re just facing standard thread-safety questions, and then of course this won’t be safe. I assume to
> support such a model, all block drivers would need to be fully audited anyway, though.
>
> For example, theoretically, the guest could then issue two resize operations simultaneously, and qemu_rbd_co_truncate() would be called in two concurrent threads. This would already cause problems, because setting s->image_size would race. That’s
> pre-existing regardless of this patch here (or d24f80234b39d2d5c0d91e63b5e4569d37b2399e).
>
> What this means is that of course we could just slap a lock around the qemu_rbd_resize() call in qemu_rbd_start_co() (and its surrounding condition), it wouldn’t cost anything, assuming that this area can’t be run in parallel anyway. But the rest of
> the block driver doesn’t contain a single lock yet, which to me signals that nothing in block/rbd.c is thread-safe anyway.
With that said I believe we can assume that the current implementation is safe (enough).
Peter
next prev parent reply other threads:[~2022-03-24 11:36 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-17 16:26 [PATCH] block/rbd: fix write zeroes with growing images Stefano Garzarella
2022-03-17 18:27 ` Peter Lieven
2022-03-18 8:25 ` Stefano Garzarella
2022-03-18 15:48 ` Peter Lieven
2022-03-18 16:47 ` Stefano Garzarella
2022-03-19 15:15 ` Peter Lieven
2022-03-21 8:31 ` Stefano Garzarella
2022-03-22 9:38 ` Hanna Reitz
2022-03-24 9:52 ` Peter Lieven
2022-03-24 10:40 ` Stefano Garzarella
2022-03-24 10:42 ` Peter Lieven
2022-03-24 11:06 ` Hanna Reitz
2022-03-24 11:34 ` Peter Lieven [this message]
2022-03-19 12:40 ` Ilya Dryomov
2022-03-19 13:23 ` Ilya Dryomov
2022-03-21 8:17 ` Stefano Garzarella
2022-03-18 15:36 ` Hanna Reitz
2022-03-19 12:33 ` Ilya Dryomov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=15ae6594-2fdb-d374-81d6-9ed6ed19de2a@kamp.de \
--to=pl@kamp.de \
--cc=hreitz@redhat.com \
--cc=idryomov@gmail.com \
--cc=kwolf@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=sgarzare@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).