qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Kevin Wolf <kwolf@redhat.com>
To: "Denis V. Lunev" <den@virtuozzo.com>
Cc: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org, hreitz@redhat.com,
	vsementsov@yandex-team.ru, pbonzini@redhat.com,
	eesposit@redhat.com
Subject: Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter
Date: Mon, 5 Aug 2024 13:59:18 +0200	[thread overview]
Message-ID: <ZrC-liyd3CL0LXlq@redhat.com> (raw)
In-Reply-To: <03492477-fe98-4a3d-a892-7c6a143ee048@virtuozzo.com>

Am 18.07.2024 um 21:46 hat Denis V. Lunev geschrieben:
> On 7/18/24 17:51, Kevin Wolf wrote:
> > Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> > > From: "Denis V. Lunev" <den@openvz.org>
> > > 
> > > We have observed that some clusters in the QCOW2 files are zeroed
> > > while preallocation filter is used.
> > > 
> > > We are able to trace down the following sequence when prealloc-filter
> > > is used:
> > >      co=0x55e7cbed7680 qcow2_co_pwritev_task()
> > >      co=0x55e7cbed7680 preallocate_co_pwritev_part()
> > >      co=0x55e7cbed7680 handle_write()
> > >      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
> > >      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
> > >      co=0x7f9edb7fe500 do_fallocate()
> > > 
> > > Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> > > 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> > > time to handle next coroutine, which
> > >      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
> > >      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
> > >      co=0x55e7cbee91b0 handle_write()
> > >      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
> > >      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
> > >      co=0x7f9edb7deb00 do_fallocate()
> > > 
> > > The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> > > file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> > > the same area. This means that if (once fallocate is started inside
> > > 0x7f9edb7deb00) original fallocate could end and the real write will
> > > be executed. In that case write() request is handled at the same time
> > > as fallocate().
> > > 
> > > The patch moves s->file_lock assignment before fallocate and that is
> > s/file_lock/file_end/?
> > 
> > > crucial. The idea is that all subsequent requests into the area
> > > being preallocation will be issued as just writes without fallocate
> > > to this area and they will not proceed thanks to overlapping
> > > requests mechanics. If preallocation will fail, we will just switch
> > > to the normal expand-by-write behavior and that is not a problem
> > > except performance.
> > > 
> > > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > > Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> > > ---
> > >   block/preallocate.c | 8 +++++++-
> > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/preallocate.c b/block/preallocate.c
> > > index d215bc5d6d..ecf0aa4baa 100644
> > > --- a/block/preallocate.c
> > > +++ b/block/preallocate.c
> > > @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
> > >       want_merge_zero = want_merge_zero && (prealloc_start <= offset);
> > > +    /*
> > > +     * Assign file_end before making actual preallocation. This will ensure
> > > +     * that next request performed while preallocation is in progress will
> > > +     * be passed without preallocation.
> > > +     */
> > > +    s->file_end = prealloc_end;
> > > +
> > >       ret = bdrv_co_pwrite_zeroes(
> > >               bs->file, prealloc_start, prealloc_end - prealloc_start,
> > >               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> > > @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
> > >           return false;
> > >       }
> > > -    s->file_end = prealloc_end;
> > >       return want_merge_zero;
> > >   }
> > Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
> > unchanged. In other words, if anything calls preallocate_co_getlength()
> > in the meantime, don't we run into...
> > 
> >      ret = bdrv_co_getlength(bs->file->bs);
> > 
> >      if (has_prealloc_perms(bs)) {
> >          s->file_end = s->zero_start = s->data_end = ret;
> >      }
> > 
> > ...and reset s->file_end back to the old value, re-enabling the bug
> > you're trying to fix here?
> > 
> > Or do we know that no such code path can be called concurrently for some
> > reason?
> > 
> > Kevin
> > 
> After more detailed thinking I tend to disagree.
> Normally we would not hit the problem. Though
> this was not obvious at the beginning :)
> 
> The point in handle_write() where we move
> file_end assignment is reachable only if the
> following code has been executed
> 
>     if (s->data_end < 0) {
>         s->data_end = bdrv_co_getlength(bs->file->bs);
>         if (s->data_end < 0) {
>             return false;
>         }
> 
>         if (s->file_end < 0) {
>             s->file_end = s->data_end;
>         }
>     }
> 
>     if (end <= s->data_end) {
>         return false;
>     }
> 
> which means that s->data_end > 0.
> 
> Thus
> 
> static int64_t coroutine_fn GRAPH_RDLOCK
> preallocate_co_getlength(BlockDriverState *bs)
> {
>     int64_t ret;
>     BDRVPreallocateState *s = bs->opaque;
> 
>     if (s->data_end >= 0) {
>         return s->data_end; <--- we will return here
>     }
> 
>     ret = bdrv_co_getlength(bs->file->bs);
> 
>     if (has_prealloc_perms(bs)) {
>         s->file_end = s->zero_start = s->data_end = ret;
>     }
> 
>     return ret;
> }

I think you're right there. And the other places that update s->file_end
should probably be okay because of the request serialisation.

I'm okay with applying this patch as it seems to fix a corruption, but
the way this whole block driver operates doesn't feel very robust to me.
There seem to be a lot of implicit assumptions that make the code hard
to understand even though it's quite short.

Kevin



  reply	other threads:[~2024-08-05 12:00 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-16 14:41 [PATCH v3 0/3] Fix data corruption within preallocation Andrey Drobyshev
2024-07-16 14:41 ` [PATCH v3 1/3] block: zero data data corruption using prealloc-filter Andrey Drobyshev
2024-07-18 15:51   ` Kevin Wolf
2024-07-18 15:52     ` Denis V. Lunev
2024-07-18 19:46     ` Denis V. Lunev
2024-08-05 11:59       ` Kevin Wolf [this message]
2024-08-05 12:13         ` Denis V. Lunev
2024-07-16 14:41 ` [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter Andrey Drobyshev
2024-08-05 12:04   ` Kevin Wolf
2024-08-05 12:56     ` Andrey Drobyshev
2024-08-05 13:50       ` Kevin Wolf
2024-07-16 14:41 ` [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping Andrey Drobyshev
2024-08-05 12:29   ` Kevin Wolf
2024-08-05 13:02     ` Andrey Drobyshev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZrC-liyd3CL0LXlq@redhat.com \
    --to=kwolf@redhat.com \
    --cc=andrey.drobyshev@virtuozzo.com \
    --cc=den@virtuozzo.com \
    --cc=eesposit@redhat.com \
    --cc=hreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@yandex-team.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).