qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Kevin Wolf <kwolf@redhat.com>
To: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org, hreitz@redhat.com,
	vsementsov@yandex-team.ru, pbonzini@redhat.com,
	eesposit@redhat.com, den@virtuozzo.com
Subject: Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter
Date: Thu, 18 Jul 2024 17:51:20 +0200	[thread overview]
Message-ID: <Zpk5-GDaqmD4c-EF@redhat.com> (raw)
In-Reply-To: <20240716144123.651476-2-andrey.drobyshev@virtuozzo.com>

Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> We have observed that some clusters in the QCOW2 files are zeroed
> while preallocation filter is used.
> 
> We are able to trace down the following sequence when prealloc-filter
> is used:
>     co=0x55e7cbed7680 qcow2_co_pwritev_task()
>     co=0x55e7cbed7680 preallocate_co_pwritev_part()
>     co=0x55e7cbed7680 handle_write()
>     co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>     co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>     co=0x7f9edb7fe500 do_fallocate()
> 
> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> time to handle next coroutine, which
>     co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>     co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>     co=0x55e7cbee91b0 handle_write()
>     co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>     co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>     co=0x7f9edb7deb00 do_fallocate()
> 
> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> the same area. This means that if (once fallocate is started inside
> 0x7f9edb7deb00) original fallocate could end and the real write will
> be executed. In that case write() request is handled at the same time
> as fallocate().
> 
> The patch moves s->file_lock assignment before fallocate and that is

s/file_lock/file_end/?

> crucial. The idea is that all subsequent requests into the area
> being preallocation will be issued as just writes without fallocate
> to this area and they will not proceed thanks to overlapping
> requests mechanics. If preallocation will fail, we will just switch
> to the normal expand-by-write behavior and that is not a problem
> except performance.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
>  block/preallocate.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/preallocate.c b/block/preallocate.c
> index d215bc5d6d..ecf0aa4baa 100644
> --- a/block/preallocate.c
> +++ b/block/preallocate.c
> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>  
>      want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>  
> +    /*
> +     * Assign file_end before making actual preallocation. This will ensure
> +     * that next request performed while preallocation is in progress will
> +     * be passed without preallocation.
> +     */
> +    s->file_end = prealloc_end;
> +
>      ret = bdrv_co_pwrite_zeroes(
>              bs->file, prealloc_start, prealloc_end - prealloc_start,
>              BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>          return false;
>      }
>  
> -    s->file_end = prealloc_end;
>      return want_merge_zero;
>  }

Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
unchanged. In other words, if anything calls preallocate_co_getlength()
in the meantime, don't we run into...

    ret = bdrv_co_getlength(bs->file->bs);

    if (has_prealloc_perms(bs)) {
        s->file_end = s->zero_start = s->data_end = ret;
    }

...and reset s->file_end back to the old value, re-enabling the bug
you're trying to fix here?

Or do we know that no such code path can be called concurrently for some
reason?

Kevin



  reply	other threads:[~2024-07-18 15:51 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-16 14:41 [PATCH v3 0/3] Fix data corruption within preallocation Andrey Drobyshev
2024-07-16 14:41 ` [PATCH v3 1/3] block: zero data data corruption using prealloc-filter Andrey Drobyshev
2024-07-18 15:51   ` Kevin Wolf [this message]
2024-07-18 15:52     ` Denis V. Lunev
2024-07-18 19:46     ` Denis V. Lunev
2024-08-05 11:59       ` Kevin Wolf
2024-08-05 12:13         ` Denis V. Lunev
2024-07-16 14:41 ` [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter Andrey Drobyshev
2024-08-05 12:04   ` Kevin Wolf
2024-08-05 12:56     ` Andrey Drobyshev
2024-08-05 13:50       ` Kevin Wolf
2024-07-16 14:41 ` [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping Andrey Drobyshev
2024-08-05 12:29   ` Kevin Wolf
2024-08-05 13:02     ` Andrey Drobyshev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zpk5-GDaqmD4c-EF@redhat.com \
    --to=kwolf@redhat.com \
    --cc=andrey.drobyshev@virtuozzo.com \
    --cc=den@virtuozzo.com \
    --cc=eesposit@redhat.com \
    --cc=hreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@yandex-team.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).