qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Kevin Wolf <kwolf@redhat.com>
To: Eric Blake <eblake@redhat.com>
Cc: qemu-devel@nongnu.org, jsnow@redhat.com,
	vsementsov@virtuozzo.com, qemu-block@nongnu.org,
	Stefan Hajnoczi <stefanha@redhat.com>, Fam Zheng <fam@euphon.net>,
	Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 5/7] block: Fix BDRV_BLOCK_RAW status to honor alignment
Date: Wed, 3 Apr 2019 15:03:54 +0200	[thread overview]
Message-ID: <20190403130354.GA6108@localhost.localdomain> (raw)
In-Reply-To: <20190403030526.12258-6-eblake@redhat.com>

Am 03.04.2019 um 05:05 hat Eric Blake geschrieben:
> Previous patches mentioned how the blkdebug filter driver demonstrates
> a bug present in our NBD server; the bug is also present with the raw
> format driver when probing occurs. Basically, if we specify a
> particular alignment > 1, but defer the actual block status to the
> underlying file, and the underlying file has a smaller alignment, then
> the use of BDRV_BLOCK_RAW to defer to the underlying file can end up
> with status split at an alignment unacceptable to our layer.  Many
> callers don't care, but NBD does - it is a violation of the NBD
> protocol to send status or read results split on an unaligned boundary
> (we've taught our client to be tolerant of such violations because of
> qemu 3.1; but we still need to fix our server for the sake of other
> stricter clients).
> 
> This patch lays the groundwork - it adjusts bdrv_block_status to
> ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
> the deferral is either truncated down to an aligned boundary, or
> multiple sub-aligned requests are coalesced into a single
> representative answer (using an implicit hole beyond EOF as
> needed). Iotest 241 exposes the effect (when format probing occurred,
> we don't want block status to subdivide the initial sector, and thus
> not any other sector either). Similarly, iotest 221 is a good
> candidate to amend to specifically track the effects; a change to a
> hole at EOF is not visible if an upper layer does not want alignment
> smaller than 512. However, this patch alone is not a complete fix - it
> is still possible to have an active layer with large alignment
> constraints backed by another layer with smaller constraints; so the
> next patch will complete the task.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c                 | 108 +++++++++++++++++++++++++++++++++++--
>  tests/qemu-iotests/221     |  10 ++++
>  tests/qemu-iotests/221.out |   6 +++
>  tests/qemu-iotests/241.out |   3 +-
>  4 files changed, 122 insertions(+), 5 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index dfc153b8d82..936877d3745 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2021,6 +2021,97 @@ int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs,
>      return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
>  }
> 
> +static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
> +                                             bool want_zero,
> +                                             int64_t offset, int64_t bytes,
> +                                             int64_t *pnum, int64_t *map,
> +                                             BlockDriverState **file);
> +
> +/*
> + * Returns an aligned allocation status of the specified sectors.
> + * Wrapper around bdrv_co_block_status() which requires the initial
> + * offset and count to be aligned, and guarantees the result will also
> + * be aligned (even if it requires piecing together multiple
> + * sub-aligned queries into an appropriate coalesced answer).
> + */

I think this comment should specify the result of the operation when the
aligned region consists of multiple subregions with different status.
Probably something like this:

- BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
- BDRV_BLOCK_ZERO is set if the flag is set for all subregions
- BDRV_BLOCK_OFFSET_VALID is set if the flag is set for all subregions,
  the provided offsets are contiguous and file is the same for all
  subregions.
- BDRV_BLOCK_ALLOCATED is never set here (the caller sets it)
- BDRV_BLOCK_EOF is set if the last subregion sets it; assert that it's
  not set for any other subregion
- BDRV_BLOCK_RAW is never set

- *map is only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
  the offset of the first subregion then.

- *file is also only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
  the *file of the subregions, which must be the same for all of them
  (otherwise, we wouldn't have set BDRV_BLOCK_OFFSET_VALID).

- *pnum: The sum of all subregions

> +static int coroutine_fn bdrv_co_block_status_aligned(BlockDriverState *bs,
> +                                                     uint32_t align,
> +                                                     bool want_zero,
> +                                                     int64_t offset,
> +                                                     int64_t bytes,
> +                                                     int64_t *pnum,
> +                                                     int64_t *map,
> +                                                     BlockDriverState **file)
> +{
> +    int ret;
> +
> +    assert(is_power_of_2(align) && QEMU_IS_ALIGNED(offset | bytes, align));
> +    ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +    if (!*pnum) {
> +        assert(!bytes || ret & BDRV_BLOCK_EOF);
> +        return ret;
> +    }
> +    if (*pnum >= align) {
> +        if (!QEMU_IS_ALIGNED(*pnum, align)) {
> +            ret &= ~BDRV_BLOCK_EOF;
> +            *pnum = QEMU_ALIGN_DOWN(*pnum, align);
> +        }
> +        return ret;
> +    }

So we decide to just shorten the function if we can return at least some
*pnum > 0. This means that is most cases, this function will have to be
called twice, once for the aligned part, and again for the misaligned
rest.

We do save a little if the caller isn't actually interested in mapping
the whole image, but just in a specific range before the misaligned
part.

So it depends on the use case whether this is an optimisation or a
pessimisation.

> +    /* Merge in status for the rest of align */
> +    while (*pnum < align) {

Okay, in any case, I can see it's a simplification. :-)

> +        int ret2;
> +        int64_t pnum2;
> +        int64_t map2;
> +        BlockDriverState *file2;
> +
> +        ret2 = bdrv_co_block_status(bs, want_zero, offset + *pnum,
> +                                    align - *pnum, &pnum2,
> +                                    map ? &map2 : NULL, file ? &file2 : NULL);
> +        if (ret2 < 0) {
> +            return ret2;
> +        }
> +        if (ret2 & BDRV_BLOCK_EOF) {
> +            ret |= BDRV_BLOCK_EOF;
> +            if (!pnum2) {
> +                pnum2 = align - *pnum;
> +                ret2 |= BDRV_BLOCK_ZERO;
> +            }
> +        } else {
> +            assert(pnum2);
> +        }
> +        if (ret2 & BDRV_BLOCK_DATA) {
> +            ret |= BDRV_BLOCK_DATA;
> +        }
> +        if (!(ret2 & BDRV_BLOCK_ZERO)) {
> +            ret &= ~BDRV_BLOCK_ZERO;
> +        }
> +        if ((ret & BDRV_BLOCK_OFFSET_VALID) &&
> +            (!(ret2 & BDRV_BLOCK_OFFSET_VALID) ||
> +             (map && *map + *pnum != map2) || (file && *file != file2))) {
> +            ret &= ~BDRV_BLOCK_OFFSET_VALID;
> +            if (map) {
> +                *map = 0;
> +            }
> +            if (file) {
> +                *file = NULL;
> +            }
> +        }
> +        if (ret2 & BDRV_BLOCK_ALLOCATED) {
> +            ret |= BDRV_BLOCK_ALLOCATED;
> +        }

Hm, so if we have a region that consists of two subregion, one is
unallocated and the other one is a zero cluster. The result will be
DATA == false, ZERO == false, ALLOCATED == true. This looks a bit odd.

Did you check this case and come to the conclusion that it's okay? If
so, I think a comment would be good.

> +        if (ret2 & BDRV_BLOCK_RAW) {
> +            assert(ret & BDRV_BLOCK_RAW);
> +            assert(ret & BDRV_BLOCK_OFFSET_VALID);
> +        }

Doesn't RAW mean that we need to recurse rather than returning the flag?
Or actually, bdrv_co_block_status() shouldn't ever return the flag, so
can't we assert that it's clear?

> +        *pnum += pnum2;
> +    }
> +    return ret;
> +}

Kevin

  reply	other threads:[~2019-04-03 13:04 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-03  3:05 [Qemu-devel] [PATCH for-4.0? 0/7] Final round of NBD alignment fixes Eric Blake
2019-04-03  3:05 ` [Qemu-devel] [PATCH 1/7] nbd/server: Fix blockstatus trace Eric Blake
2019-04-05 14:13   ` Vladimir Sementsov-Ogievskiy
2019-04-05 14:13     ` Vladimir Sementsov-Ogievskiy
2019-04-03  3:05 ` [Qemu-devel] [PATCH 2/7] nbd/server: Trace server noncompliance on unaligned requests Eric Blake
2019-04-05 14:39   ` Vladimir Sementsov-Ogievskiy
2019-04-05 14:39     ` Vladimir Sementsov-Ogievskiy
2019-04-05 20:04     ` Eric Blake
2019-04-05 20:04       ` Eric Blake
2019-04-08 12:14       ` Vladimir Sementsov-Ogievskiy
2019-04-08 12:14         ` Vladimir Sementsov-Ogievskiy
2019-04-08 14:32         ` Eric Blake
2019-04-08 14:32           ` Eric Blake
2019-04-03  3:05 ` [Qemu-devel] [PATCH 3/7] nbd/server: Don't fail NBD_OPT_INFO for byte-aligned sources Eric Blake
2019-04-05 15:34   ` Vladimir Sementsov-Ogievskiy
2019-04-05 15:34     ` Vladimir Sementsov-Ogievskiy
2019-04-03  3:05 ` [Qemu-devel] [PATCH 4/7] iotests: Update 241 to expose backing layer fragmentation Eric Blake
2019-04-08 13:51   ` Vladimir Sementsov-Ogievskiy
2019-04-08 13:51     ` Vladimir Sementsov-Ogievskiy
2019-04-10 20:44     ` Eric Blake
2019-04-10 20:44       ` Eric Blake
2019-04-03  3:05 ` [Qemu-devel] [PATCH 5/7] block: Fix BDRV_BLOCK_RAW status to honor alignment Eric Blake
2019-04-03 13:03   ` Kevin Wolf [this message]
2019-04-03 14:02     ` Eric Blake
2019-04-03  3:05 ` [Qemu-devel] [PATCH 6/7] nbd/server: Avoid unaligned read/block_status from backing Eric Blake
2019-04-03  3:05 ` [Qemu-devel] [PATCH for-4.1 7/7] nbd/server: Avoid unaligned dirty-bitmap status Eric Blake
2019-04-04 14:52 ` [Qemu-devel] [PATCH for-4.0? 8/7] nbd/client: Fix error message for server with unusable sizing Eric Blake
2019-04-04 15:22   ` [Qemu-devel] [Qemu-block] " Kevin Wolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190403130354.GA6108@localhost.localdomain \
    --to=kwolf@redhat.com \
    --cc=eblake@redhat.com \
    --cc=fam@euphon.net \
    --cc=jsnow@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).