qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Denis V. Lunev" <den@virtuozzo.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>,
	qemu-block@nongnu.org,  qemu-devel@nongnu.org, hreitz@redhat.com,
	vsementsov@yandex-team.ru, pbonzini@redhat.com,
	eesposit@redhat.com
Subject: Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter
Date: Mon, 5 Aug 2024 14:13:44 +0200	[thread overview]
Message-ID: <19c69ed8-7788-474f-93fc-8a4568d2c32f@virtuozzo.com> (raw)
In-Reply-To: <ZrC-liyd3CL0LXlq@redhat.com>

On 8/5/24 13:59, Kevin Wolf wrote:
> Am 18.07.2024 um 21:46 hat Denis V. Lunev geschrieben:
>> On 7/18/24 17:51, Kevin Wolf wrote:
>>> Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
>>>> From: "Denis V. Lunev" <den@openvz.org>
>>>>
>>>> We have observed that some clusters in the QCOW2 files are zeroed
>>>> while preallocation filter is used.
>>>>
>>>> We are able to trace down the following sequence when prealloc-filter
>>>> is used:
>>>>       co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>>>       co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>>>       co=0x55e7cbed7680 handle_write()
>>>>       co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>>>       co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>>>       co=0x7f9edb7fe500 do_fallocate()
>>>>
>>>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>>>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>>>> time to handle next coroutine, which
>>>>       co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>>>       co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>>>       co=0x55e7cbee91b0 handle_write()
>>>>       co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>>>       co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>>>       co=0x7f9edb7deb00 do_fallocate()
>>>>
>>>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>>>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>>>> the same area. This means that if (once fallocate is started inside
>>>> 0x7f9edb7deb00) original fallocate could end and the real write will
>>>> be executed. In that case write() request is handled at the same time
>>>> as fallocate().
>>>>
>>>> The patch moves s->file_lock assignment before fallocate and that is
>>> s/file_lock/file_end/?
>>>
>>>> crucial. The idea is that all subsequent requests into the area
>>>> being preallocation will be issued as just writes without fallocate
>>>> to this area and they will not proceed thanks to overlapping
>>>> requests mechanics. If preallocation will fail, we will just switch
>>>> to the normal expand-by-write behavior and that is not a problem
>>>> except performance.
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>>>> ---
>>>>    block/preallocate.c | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>>> index d215bc5d6d..ecf0aa4baa 100644
>>>> --- a/block/preallocate.c
>>>> +++ b/block/preallocate.c
>>>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>>>        want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>>> +    /*
>>>> +     * Assign file_end before making actual preallocation. This will ensure
>>>> +     * that next request performed while preallocation is in progress will
>>>> +     * be passed without preallocation.
>>>> +     */
>>>> +    s->file_end = prealloc_end;
>>>> +
>>>>        ret = bdrv_co_pwrite_zeroes(
>>>>                bs->file, prealloc_start, prealloc_end - prealloc_start,
>>>>                BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>>>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>>>            return false;
>>>>        }
>>>> -    s->file_end = prealloc_end;
>>>>        return want_merge_zero;
>>>>    }
>>> Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
>>> unchanged. In other words, if anything calls preallocate_co_getlength()
>>> in the meantime, don't we run into...
>>>
>>>       ret = bdrv_co_getlength(bs->file->bs);
>>>
>>>       if (has_prealloc_perms(bs)) {
>>>           s->file_end = s->zero_start = s->data_end = ret;
>>>       }
>>>
>>> ...and reset s->file_end back to the old value, re-enabling the bug
>>> you're trying to fix here?
>>>
>>> Or do we know that no such code path can be called concurrently for some
>>> reason?
>>>
>>> Kevin
>>>
>> After more detailed thinking I tend to disagree.
>> Normally we would not hit the problem. Though
>> this was not obvious at the beginning :)
>>
>> The point in handle_write() where we move
>> file_end assignment is reachable only if the
>> following code has been executed
>>
>>      if (s->data_end < 0) {
>>          s->data_end = bdrv_co_getlength(bs->file->bs);
>>          if (s->data_end < 0) {
>>              return false;
>>          }
>>
>>          if (s->file_end < 0) {
>>              s->file_end = s->data_end;
>>          }
>>      }
>>
>>      if (end <= s->data_end) {
>>          return false;
>>      }
>>
>> which means that s->data_end > 0.
>>
>> Thus
>>
>> static int64_t coroutine_fn GRAPH_RDLOCK
>> preallocate_co_getlength(BlockDriverState *bs)
>> {
>>      int64_t ret;
>>      BDRVPreallocateState *s = bs->opaque;
>>
>>      if (s->data_end >= 0) {
>>          return s->data_end; <--- we will return here
>>      }
>>
>>      ret = bdrv_co_getlength(bs->file->bs);
>>
>>      if (has_prealloc_perms(bs)) {
>>          s->file_end = s->zero_start = s->data_end = ret;
>>      }
>>
>>      return ret;
>> }
> I think you're right there. And the other places that update s->file_end
> should probably be okay because of the request serialisation.
>
> I'm okay with applying this patch as it seems to fix a corruption, but
> the way this whole block driver operates doesn't feel very robust to me.
> There seem to be a lot of implicit assumptions that make the code hard
> to understand even though it's quite short.
>
> Kevin
>
There are a lot of things to make a change. I'll definitely
have to address this later. Working on that.

For now the patch is working in downstream and it seems OK.

Thank you in advance,
     Den


  reply	other threads:[~2024-08-05 12:14 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-16 14:41 [PATCH v3 0/3] Fix data corruption within preallocation Andrey Drobyshev
2024-07-16 14:41 ` [PATCH v3 1/3] block: zero data data corruption using prealloc-filter Andrey Drobyshev
2024-07-18 15:51   ` Kevin Wolf
2024-07-18 15:52     ` Denis V. Lunev
2024-07-18 19:46     ` Denis V. Lunev
2024-08-05 11:59       ` Kevin Wolf
2024-08-05 12:13         ` Denis V. Lunev [this message]
2024-07-16 14:41 ` [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter Andrey Drobyshev
2024-08-05 12:04   ` Kevin Wolf
2024-08-05 12:56     ` Andrey Drobyshev
2024-08-05 13:50       ` Kevin Wolf
2024-07-16 14:41 ` [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping Andrey Drobyshev
2024-08-05 12:29   ` Kevin Wolf
2024-08-05 13:02     ` Andrey Drobyshev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=19c69ed8-7788-474f-93fc-8a4568d2c32f@virtuozzo.com \
    --to=den@virtuozzo.com \
    --cc=andrey.drobyshev@virtuozzo.com \
    --cc=eesposit@redhat.com \
    --cc=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@yandex-team.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).