From: Kevin Wolf <kwolf@redhat.com>
To: "Denis V. Lunev" <den@virtuozzo.com>
Cc: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>,
qemu-block@nongnu.org, qemu-devel@nongnu.org, hreitz@redhat.com,
vsementsov@yandex-team.ru, pbonzini@redhat.com,
eesposit@redhat.com
Subject: Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter
Date: Mon, 5 Aug 2024 13:59:18 +0200 [thread overview]
Message-ID: <ZrC-liyd3CL0LXlq@redhat.com> (raw)
In-Reply-To: <03492477-fe98-4a3d-a892-7c6a143ee048@virtuozzo.com>
Am 18.07.2024 um 21:46 hat Denis V. Lunev geschrieben:
> On 7/18/24 17:51, Kevin Wolf wrote:
> > Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> > > From: "Denis V. Lunev" <den@openvz.org>
> > >
> > > We have observed that some clusters in the QCOW2 files are zeroed
> > > while preallocation filter is used.
> > >
> > > We are able to trace down the following sequence when prealloc-filter
> > > is used:
> > > co=0x55e7cbed7680 qcow2_co_pwritev_task()
> > > co=0x55e7cbed7680 preallocate_co_pwritev_part()
> > > co=0x55e7cbed7680 handle_write()
> > > co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
> > > co=0x55e7cbed7680 raw_do_pwrite_zeroes()
> > > co=0x7f9edb7fe500 do_fallocate()
> > >
> > > Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> > > 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> > > time to handle next coroutine, which
> > > co=0x55e7cbee91b0 qcow2_co_pwritev_task()
> > > co=0x55e7cbee91b0 preallocate_co_pwritev_part()
> > > co=0x55e7cbee91b0 handle_write()
> > > co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
> > > co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
> > > co=0x7f9edb7deb00 do_fallocate()
> > >
> > > The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> > > file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> > > the same area. This means that if (once fallocate is started inside
> > > 0x7f9edb7deb00) original fallocate could end and the real write will
> > > be executed. In that case write() request is handled at the same time
> > > as fallocate().
> > >
> > > The patch moves s->file_lock assignment before fallocate and that is
> > s/file_lock/file_end/?
> >
> > > crucial. The idea is that all subsequent requests into the area
> > > being preallocation will be issued as just writes without fallocate
> > > to this area and they will not proceed thanks to overlapping
> > > requests mechanics. If preallocation will fail, we will just switch
> > > to the normal expand-by-write behavior and that is not a problem
> > > except performance.
> > >
> > > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > > Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> > > ---
> > > block/preallocate.c | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/block/preallocate.c b/block/preallocate.c
> > > index d215bc5d6d..ecf0aa4baa 100644
> > > --- a/block/preallocate.c
> > > +++ b/block/preallocate.c
> > > @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
> > > want_merge_zero = want_merge_zero && (prealloc_start <= offset);
> > > + /*
> > > + * Assign file_end before making actual preallocation. This will ensure
> > > + * that next request performed while preallocation is in progress will
> > > + * be passed without preallocation.
> > > + */
> > > + s->file_end = prealloc_end;
> > > +
> > > ret = bdrv_co_pwrite_zeroes(
> > > bs->file, prealloc_start, prealloc_end - prealloc_start,
> > > BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> > > @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
> > > return false;
> > > }
> > > - s->file_end = prealloc_end;
> > > return want_merge_zero;
> > > }
> > Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
> > unchanged. In other words, if anything calls preallocate_co_getlength()
> > in the meantime, don't we run into...
> >
> > ret = bdrv_co_getlength(bs->file->bs);
> >
> > if (has_prealloc_perms(bs)) {
> > s->file_end = s->zero_start = s->data_end = ret;
> > }
> >
> > ...and reset s->file_end back to the old value, re-enabling the bug
> > you're trying to fix here?
> >
> > Or do we know that no such code path can be called concurrently for some
> > reason?
> >
> > Kevin
> >
> After more detailed thinking I tend to disagree.
> Normally we would not hit the problem. Though
> this was not obvious at the beginning :)
>
> The point in handle_write() where we move
> file_end assignment is reachable only if the
> following code has been executed
>
> if (s->data_end < 0) {
> s->data_end = bdrv_co_getlength(bs->file->bs);
> if (s->data_end < 0) {
> return false;
> }
>
> if (s->file_end < 0) {
> s->file_end = s->data_end;
> }
> }
>
> if (end <= s->data_end) {
> return false;
> }
>
> which means that s->data_end > 0.
>
> Thus
>
> static int64_t coroutine_fn GRAPH_RDLOCK
> preallocate_co_getlength(BlockDriverState *bs)
> {
> int64_t ret;
> BDRVPreallocateState *s = bs->opaque;
>
> if (s->data_end >= 0) {
> return s->data_end; <--- we will return here
> }
>
> ret = bdrv_co_getlength(bs->file->bs);
>
> if (has_prealloc_perms(bs)) {
> s->file_end = s->zero_start = s->data_end = ret;
> }
>
> return ret;
> }
I think you're right there. And the other places that update s->file_end
should probably be okay because of the request serialisation.
I'm okay with applying this patch as it seems to fix a corruption, but
the way this whole block driver operates doesn't feel very robust to me.
There seem to be a lot of implicit assumptions that make the code hard
to understand even though it's quite short.
Kevin
next prev parent reply other threads:[~2024-08-05 12:00 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-16 14:41 [PATCH v3 0/3] Fix data corruption within preallocation Andrey Drobyshev
2024-07-16 14:41 ` [PATCH v3 1/3] block: zero data data corruption using prealloc-filter Andrey Drobyshev
2024-07-18 15:51 ` Kevin Wolf
2024-07-18 15:52 ` Denis V. Lunev
2024-07-18 19:46 ` Denis V. Lunev
2024-08-05 11:59 ` Kevin Wolf [this message]
2024-08-05 12:13 ` Denis V. Lunev
2024-07-16 14:41 ` [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter Andrey Drobyshev
2024-08-05 12:04 ` Kevin Wolf
2024-08-05 12:56 ` Andrey Drobyshev
2024-08-05 13:50 ` Kevin Wolf
2024-07-16 14:41 ` [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping Andrey Drobyshev
2024-08-05 12:29 ` Kevin Wolf
2024-08-05 13:02 ` Andrey Drobyshev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZrC-liyd3CL0LXlq@redhat.com \
--to=kwolf@redhat.com \
--cc=andrey.drobyshev@virtuozzo.com \
--cc=den@virtuozzo.com \
--cc=eesposit@redhat.com \
--cc=hreitz@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=vsementsov@yandex-team.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).