Re: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes()

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Kevin Wolf <kwolf@redhat.com>
To: "Denis V. Lunev" <den@openvz.org>
Cc: qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes()
Date: Wed, 11 May 2016 13:28:30 +0200	[thread overview]
Message-ID: <20160511112830.GB4524@noname.str.redhat.com> (raw)
In-Reply-To: <1462950014-1821-1-git-send-email-den@openvz.org>

Am 11.05.2016 um 09:00 hat Denis V. Lunev geschrieben:
> There is a possibility that qcow2_co_write_zeroes() will be called
> with the partial block. This could be synthetically triggered with
>     qemu-io -c "write -z 32k 4k"
> and can happen in the real life in qemu-nbd. The latter happens under
> the following conditions:
>     (1) qemu-nbd is started with --detect-zeroes=on and is connected to the
>         kernel NBD client
>     (2) third party program opens kernel NBD device with O_DIRECT
>     (3) third party program performs write operation with memory buffer
>         not aligned to the page
> In this case qcow2_co_write_zeroes() is unable to perform the operation
> and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
> switches to non-optimized version and writes real zeroes to the disk.
> 
> The patch creates a shortcut. If the block is read as zeroes, f.e. if
> it is unallocated, the request is extended to cover full block.
> User-visible situation with this block is not changed. Before the patch
> the block is filled in the image with real zeroes. After that patch the
> block is marked as zeroed in metadata. Thus any subsequent changes in
> backing store chain are not affected.
> 
> Kevin, thank you for a cool suggestion.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> ---
> Changes from v2:
> - checked head/tail clusters separately (one can be zeroed, one unallocated)
> - fixed range calculations
> - fixed race when the block can become used just after the check
> - fixed zero cluster detection
> - minor tweaks in the description
> 
> Changes from v1:
> - description rewritten completely
> - new approach suggested by Kevin is implemented
> 
>  block/qcow2.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 59 insertions(+), 6 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 470734b..c2474c1 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2411,21 +2411,74 @@ finish:
>      return ret;
>  }
>  
> +
> +static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int nr;
> +    BlockDriverState *file;
> +    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
> +                                              s->cluster_sectors, &nr, &file);
> +    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA));

Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
all unallocated clusters return true, even if the backing file contains
non-zero data for them.

> +}
> +
> +static bool is_zero_cluster_top_locked(BlockDriverState *bs, int64_t start)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int nr = s->cluster_sectors;
> +    uint64_t off;
> +    int ret;
> +
> +    ret = qcow2_get_cluster_offset(bs, start << BDRV_SECTOR_BITS, &nr, &off);
> +    return ret == QCOW2_CLUSTER_UNALLOCATED || ret == QCOW2_CLUSTER_ZERO;
> +}

This part is new, but it also returns true for clusters that are
allocated in the backing file.

>  static coroutine_fn int qcow2_co_write_zeroes(BlockDriverState *bs,
>      int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>  {
>      int ret;
>      BDRVQcow2State *s = bs->opaque;
>  
> -    /* Emulate misaligned zero writes */
> -    if (sector_num % s->cluster_sectors || nb_sectors % s->cluster_sectors) {
> -        return -ENOTSUP;
> +    int head = sector_num % s->cluster_sectors;
> +    int tail = (sector_num + nb_sectors) % s->cluster_sectors;
> +
> +    if (head != 0 || tail != 0) {
> +        int64_t cl_end = -1;
> +
> +        sector_num -= head;
> +        nb_sectors += head;
> +
> +        if (tail != 0) {
> +            nb_sectors += s->cluster_sectors - tail;
> +        }
> +
> +        if (!is_zero_cluster(bs, sector_num)) {
> +            return -ENOTSUP;
> +        }
> +
> +        if (nb_sectors > s->cluster_sectors) {
> +            /* Technically the request can cover 2 clusters, f.e. 4k write
> +               at s->cluster_sectors - 2k offset. One of these cluster can
> +               be zeroed, one unallocated */

This cannot happen. bdrv_co_write_zeroes() splits requests not only
based on the length, but so that they are actually aligned at cluster
boundaries.

> +            cl_end = sector_num + nb_sectors - s->cluster_sectors;
> +            if (!is_zero_cluster(bs, cl_end)) {
> +                return -ENOTSUP;
> +            }
> +        }
> +
> +        qemu_co_mutex_lock(&s->lock);
> +        /* We can have new write after previous check */
> +        if (!is_zero_cluster_top_locked(bs, sector_num) ||
> +                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
> +            qemu_co_mutex_unlock(&s->lock);
> +            return -ENOTSUP;
> +        }

Just lock the mutex before the check, the possible optimisation for the
emulation case (which is slow anyway) isn't worth the additional code
complexity.

> +    } else {
> +        qemu_co_mutex_lock(&s->lock);
>      }
>  
>      /* Whatever is left can use real zero clusters */
> -    qemu_co_mutex_lock(&s->lock);
> -    ret = qcow2_zero_clusters(bs, sector_num << BDRV_SECTOR_BITS,
> -        nb_sectors);
> +    ret = qcow2_zero_clusters(bs, sector_num << BDRV_SECTOR_BITS, nb_sectors);
>      qemu_co_mutex_unlock(&s->lock);

In the end, I think v2 with just the bugs fixed (those that I mentioned
in the v2 review, plus the locking problem that you found) would be
better than this version.

Kevin

next prev parent reply	other threads:[~2016-05-11 11:28 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-11  7:00 [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes() Denis V. Lunev
2016-05-11 11:28 ` Kevin Wolf [this message]
2016-05-12  9:00   ` Denis V. Lunev
2016-05-12 10:37     ` Kevin Wolf
2016-05-13 16:09       ` Denis V. Lunev
2016-05-13 16:24         ` Kevin Wolf
2016-05-13 16:37           ` Denis V. Lunev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160511112830.GB4524@noname.str.redhat.com \
    --to=kwolf@redhat.com \
    --cc=den@openvz.org \
    --cc=mreitz@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).