Re: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Denis V. Lunev" <den@openvz.org>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes()
Date: Fri, 13 May 2016 19:09:39 +0300	[thread overview]
Message-ID: <5735FC43.9000700@openvz.org> (raw)
In-Reply-To: <20160512103730.GD4794@noname.redhat.com>

On 05/12/2016 01:37 PM, Kevin Wolf wrote:
> Am 12.05.2016 um 11:00 hat Denis V. Lunev geschrieben:
>> On 05/11/2016 02:28 PM, Kevin Wolf wrote:
>>> Am 11.05.2016 um 09:00 hat Denis V. Lunev geschrieben:
>>>> There is a possibility that qcow2_co_write_zeroes() will be called
>>>> with the partial block. This could be synthetically triggered with
>>>>      qemu-io -c "write -z 32k 4k"
>>>> and can happen in the real life in qemu-nbd. The latter happens under
>>>> the following conditions:
>>>>      (1) qemu-nbd is started with --detect-zeroes=on and is connected to the
>>>>          kernel NBD client
>>>>      (2) third party program opens kernel NBD device with O_DIRECT
>>>>      (3) third party program performs write operation with memory buffer
>>>>          not aligned to the page
>>>> In this case qcow2_co_write_zeroes() is unable to perform the operation
>>>> and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
>>>> switches to non-optimized version and writes real zeroes to the disk.
>>>>
>>>> The patch creates a shortcut. If the block is read as zeroes, f.e. if
>>>> it is unallocated, the request is extended to cover full block.
>>>> User-visible situation with this block is not changed. Before the patch
>>>> the block is filled in the image with real zeroes. After that patch the
>>>> block is marked as zeroed in metadata. Thus any subsequent changes in
>>>> backing store chain are not affected.
>>>>
>>>> Kevin, thank you for a cool suggestion.
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>> CC: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>> Changes from v2:
>>>> - checked head/tail clusters separately (one can be zeroed, one unallocated)
>>>> - fixed range calculations
>>>> - fixed race when the block can become used just after the check
>>>> - fixed zero cluster detection
>>>> - minor tweaks in the description
>>>>
>>>> Changes from v1:
>>>> - description rewritten completely
>>>> - new approach suggested by Kevin is implemented
>>>>
>>>>   block/qcow2.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
>>>>   1 file changed, 59 insertions(+), 6 deletions(-)
oops, the patch gets committed... that is unexpected but great ;)



>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>> index 470734b..c2474c1 100644
>>>> --- a/block/qcow2.c
>>>> +++ b/block/qcow2.c
>>>> @@ -2411,21 +2411,74 @@ finish:
>>>>       return ret;
>>>>   }
>>>> +
>>>> +static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
>>>> +{
>>>> +    BDRVQcow2State *s = bs->opaque;
>>>> +    int nr;
>>>> +    BlockDriverState *file;
>>>> +    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
>>>> +                                              s->cluster_sectors, &nr, &file);
>>>> +    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA));
>>> Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
>>> all unallocated clusters return true, even if the backing file contains
>>> non-zero data for them.
>> this is correct. From my POW this means that this area is unallocated
>> in the entire backing chain and thus it will be read as zeroes. Thus
>> we could cover it with zeroes.
> You're right that I made a mistake, I was thinking of the non-recursive
> bdrv_get_block_status().
>
> However, I still think that we may not assume that !BDRV_BLOCK_DATA
> means zero data, even though that affects only more obscure cases. We
> have bdrv_unallocated_blocks_are_zero() to check whether the assumption
> is true. However, bdrv_co_get_block_status() already checks this
> internally and sets BDRV_BLOCK_ZERO in this case, so just checking
> BDRV_BLOCK_ZERO in qcow2 should be good.
>
> Did you find a case where you got !DATA, but not ZERO, and assuming
> zeroes was valid? If so, we may need to fix bdrv_co_get_block_status().
actually we may have the following case (artificial)!:
- assuming we do not have bdrv_has_zero_init in backing store
- and qcow2 on top of this file
- reading from unallocated block should return 0 (no data in both 
places), qcow2
   layer will return 0

It looks like we will have this situation.

[skipped]

> Hm, I see:
>
>      if (bs->bl.write_zeroes_alignment
>          && num > bs->bl.write_zeroes_alignment) {
>
> Removing the second part should fix this, i.e. it would split a request
> into two unaligned halves even if there is no aligned "bulk" in the
> middle.
>
> I think it would match my expectations better, but maybe that's just me.
> What do you think?
actually the code here will not be significantly better (I presume),
but I'll make a try



>>>> +            cl_end = sector_num + nb_sectors - s->cluster_sectors;
>>>> +            if (!is_zero_cluster(bs, cl_end)) {
>>>> +                return -ENOTSUP;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        qemu_co_mutex_lock(&s->lock);
>>>> +        /* We can have new write after previous check */
>>>> +        if (!is_zero_cluster_top_locked(bs, sector_num) ||
>>>> +                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
>>>> +            qemu_co_mutex_unlock(&s->lock);
>>>> +            return -ENOTSUP;
>>>> +        }
>>> Just lock the mutex before the check, the possible optimisation for the
>>> emulation case (which is slow anyway) isn't worth the additional code
>>> complexity.
>> bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not
>> recursive thus the code will hang. This is the problem trying to be
>> addressed with this split of checks.
>>
>> May be we could make the lock recursive...
> Maybe your version is no far from the best we can do then. It deserves a
> comment, though, because it's not completely obvious.
>
> The other option that we have and that looks reasonable enough to me is
> checking is_zero_cluster_top_locked() first and only if that returns
> false, we check the block status of the backing chain, starting at
> bs->backing->bs. This way we would bypass the recursive call and could
> take the lock from the beginning. If we go that way, it deserves a
> comment as well.
>
> Kevin
OK. I'll send at least improved comments and (may be)
removal of "&& num > bs->bl.write_zeroes_alignment"
as follow up.

thank you for ideas ;)

Den

next prev parent reply	other threads:[~2016-05-13 16:25 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-11  7:00 [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes() Denis V. Lunev
2016-05-11 11:28 ` Kevin Wolf
2016-05-12  9:00   ` Denis V. Lunev
2016-05-12 10:37     ` Kevin Wolf
2016-05-13 16:09       ` Denis V. Lunev [this message]
2016-05-13 16:24         ` Kevin Wolf
2016-05-13 16:37           ` Denis V. Lunev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5735FC43.9000700@openvz.org \
    --to=den@openvz.org \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.