From: Kevin Wolf <kwolf@redhat.com>
To: Eric Blake <eblake@redhat.com>
Cc: qemu-devel@nongnu.org, jsnow@redhat.com,
vsementsov@virtuozzo.com, qemu-block@nongnu.org,
Stefan Hajnoczi <stefanha@redhat.com>, Fam Zheng <fam@euphon.net>,
Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 5/7] block: Fix BDRV_BLOCK_RAW status to honor alignment
Date: Wed, 3 Apr 2019 15:03:54 +0200 [thread overview]
Message-ID: <20190403130354.GA6108@localhost.localdomain> (raw)
In-Reply-To: <20190403030526.12258-6-eblake@redhat.com>
Am 03.04.2019 um 05:05 hat Eric Blake geschrieben:
> Previous patches mentioned how the blkdebug filter driver demonstrates
> a bug present in our NBD server; the bug is also present with the raw
> format driver when probing occurs. Basically, if we specify a
> particular alignment > 1, but defer the actual block status to the
> underlying file, and the underlying file has a smaller alignment, then
> the use of BDRV_BLOCK_RAW to defer to the underlying file can end up
> with status split at an alignment unacceptable to our layer. Many
> callers don't care, but NBD does - it is a violation of the NBD
> protocol to send status or read results split on an unaligned boundary
> (we've taught our client to be tolerant of such violations because of
> qemu 3.1; but we still need to fix our server for the sake of other
> stricter clients).
>
> This patch lays the groundwork - it adjusts bdrv_block_status to
> ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
> the deferral is either truncated down to an aligned boundary, or
> multiple sub-aligned requests are coalesced into a single
> representative answer (using an implicit hole beyond EOF as
> needed). Iotest 241 exposes the effect (when format probing occurred,
> we don't want block status to subdivide the initial sector, and thus
> not any other sector either). Similarly, iotest 221 is a good
> candidate to amend to specifically track the effects; a change to a
> hole at EOF is not visible if an upper layer does not want alignment
> smaller than 512. However, this patch alone is not a complete fix - it
> is still possible to have an active layer with large alignment
> constraints backed by another layer with smaller constraints; so the
> next patch will complete the task.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> block/io.c | 108 +++++++++++++++++++++++++++++++++++--
> tests/qemu-iotests/221 | 10 ++++
> tests/qemu-iotests/221.out | 6 +++
> tests/qemu-iotests/241.out | 3 +-
> 4 files changed, 122 insertions(+), 5 deletions(-)
>
> diff --git a/block/io.c b/block/io.c
> index dfc153b8d82..936877d3745 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2021,6 +2021,97 @@ int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs,
> return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
> }
>
> +static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
> + bool want_zero,
> + int64_t offset, int64_t bytes,
> + int64_t *pnum, int64_t *map,
> + BlockDriverState **file);
> +
> +/*
> + * Returns an aligned allocation status of the specified sectors.
> + * Wrapper around bdrv_co_block_status() which requires the initial
> + * offset and count to be aligned, and guarantees the result will also
> + * be aligned (even if it requires piecing together multiple
> + * sub-aligned queries into an appropriate coalesced answer).
> + */
I think this comment should specify the result of the operation when the
aligned region consists of multiple subregions with different status.
Probably something like this:
- BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
- BDRV_BLOCK_ZERO is set if the flag is set for all subregions
- BDRV_BLOCK_OFFSET_VALID is set if the flag is set for all subregions,
the provided offsets are contiguous and file is the same for all
subregions.
- BDRV_BLOCK_ALLOCATED is never set here (the caller sets it)
- BDRV_BLOCK_EOF is set if the last subregion sets it; assert that it's
not set for any other subregion
- BDRV_BLOCK_RAW is never set
- *map is only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
the offset of the first subregion then.
- *file is also only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
the *file of the subregions, which must be the same for all of them
(otherwise, we wouldn't have set BDRV_BLOCK_OFFSET_VALID).
- *pnum: The sum of all subregions
> +static int coroutine_fn bdrv_co_block_status_aligned(BlockDriverState *bs,
> + uint32_t align,
> + bool want_zero,
> + int64_t offset,
> + int64_t bytes,
> + int64_t *pnum,
> + int64_t *map,
> + BlockDriverState **file)
> +{
> + int ret;
> +
> + assert(is_power_of_2(align) && QEMU_IS_ALIGNED(offset | bytes, align));
> + ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
> + if (ret < 0) {
> + return ret;
> + }
> + if (!*pnum) {
> + assert(!bytes || ret & BDRV_BLOCK_EOF);
> + return ret;
> + }
> + if (*pnum >= align) {
> + if (!QEMU_IS_ALIGNED(*pnum, align)) {
> + ret &= ~BDRV_BLOCK_EOF;
> + *pnum = QEMU_ALIGN_DOWN(*pnum, align);
> + }
> + return ret;
> + }
So we decide to just shorten the function if we can return at least some
*pnum > 0. This means that is most cases, this function will have to be
called twice, once for the aligned part, and again for the misaligned
rest.
We do save a little if the caller isn't actually interested in mapping
the whole image, but just in a specific range before the misaligned
part.
So it depends on the use case whether this is an optimisation or a
pessimisation.
> + /* Merge in status for the rest of align */
> + while (*pnum < align) {
Okay, in any case, I can see it's a simplification. :-)
> + int ret2;
> + int64_t pnum2;
> + int64_t map2;
> + BlockDriverState *file2;
> +
> + ret2 = bdrv_co_block_status(bs, want_zero, offset + *pnum,
> + align - *pnum, &pnum2,
> + map ? &map2 : NULL, file ? &file2 : NULL);
> + if (ret2 < 0) {
> + return ret2;
> + }
> + if (ret2 & BDRV_BLOCK_EOF) {
> + ret |= BDRV_BLOCK_EOF;
> + if (!pnum2) {
> + pnum2 = align - *pnum;
> + ret2 |= BDRV_BLOCK_ZERO;
> + }
> + } else {
> + assert(pnum2);
> + }
> + if (ret2 & BDRV_BLOCK_DATA) {
> + ret |= BDRV_BLOCK_DATA;
> + }
> + if (!(ret2 & BDRV_BLOCK_ZERO)) {
> + ret &= ~BDRV_BLOCK_ZERO;
> + }
> + if ((ret & BDRV_BLOCK_OFFSET_VALID) &&
> + (!(ret2 & BDRV_BLOCK_OFFSET_VALID) ||
> + (map && *map + *pnum != map2) || (file && *file != file2))) {
> + ret &= ~BDRV_BLOCK_OFFSET_VALID;
> + if (map) {
> + *map = 0;
> + }
> + if (file) {
> + *file = NULL;
> + }
> + }
> + if (ret2 & BDRV_BLOCK_ALLOCATED) {
> + ret |= BDRV_BLOCK_ALLOCATED;
> + }
Hm, so if we have a region that consists of two subregion, one is
unallocated and the other one is a zero cluster. The result will be
DATA == false, ZERO == false, ALLOCATED == true. This looks a bit odd.
Did you check this case and come to the conclusion that it's okay? If
so, I think a comment would be good.
> + if (ret2 & BDRV_BLOCK_RAW) {
> + assert(ret & BDRV_BLOCK_RAW);
> + assert(ret & BDRV_BLOCK_OFFSET_VALID);
> + }
Doesn't RAW mean that we need to recurse rather than returning the flag?
Or actually, bdrv_co_block_status() shouldn't ever return the flag, so
can't we assert that it's clear?
> + *pnum += pnum2;
> + }
> + return ret;
> +}
Kevin
next prev parent reply other threads:[~2019-04-03 13:04 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-03 3:05 [Qemu-devel] [PATCH for-4.0? 0/7] Final round of NBD alignment fixes Eric Blake
2019-04-03 3:05 ` [Qemu-devel] [PATCH 1/7] nbd/server: Fix blockstatus trace Eric Blake
2019-04-05 14:13 ` Vladimir Sementsov-Ogievskiy
2019-04-05 14:13 ` Vladimir Sementsov-Ogievskiy
2019-04-03 3:05 ` [Qemu-devel] [PATCH 2/7] nbd/server: Trace server noncompliance on unaligned requests Eric Blake
2019-04-05 14:39 ` Vladimir Sementsov-Ogievskiy
2019-04-05 14:39 ` Vladimir Sementsov-Ogievskiy
2019-04-05 20:04 ` Eric Blake
2019-04-05 20:04 ` Eric Blake
2019-04-08 12:14 ` Vladimir Sementsov-Ogievskiy
2019-04-08 12:14 ` Vladimir Sementsov-Ogievskiy
2019-04-08 14:32 ` Eric Blake
2019-04-08 14:32 ` Eric Blake
2019-04-03 3:05 ` [Qemu-devel] [PATCH 3/7] nbd/server: Don't fail NBD_OPT_INFO for byte-aligned sources Eric Blake
2019-04-05 15:34 ` Vladimir Sementsov-Ogievskiy
2019-04-05 15:34 ` Vladimir Sementsov-Ogievskiy
2019-04-03 3:05 ` [Qemu-devel] [PATCH 4/7] iotests: Update 241 to expose backing layer fragmentation Eric Blake
2019-04-08 13:51 ` Vladimir Sementsov-Ogievskiy
2019-04-08 13:51 ` Vladimir Sementsov-Ogievskiy
2019-04-10 20:44 ` Eric Blake
2019-04-10 20:44 ` Eric Blake
2019-04-03 3:05 ` [Qemu-devel] [PATCH 5/7] block: Fix BDRV_BLOCK_RAW status to honor alignment Eric Blake
2019-04-03 13:03 ` Kevin Wolf [this message]
2019-04-03 14:02 ` Eric Blake
2019-04-03 3:05 ` [Qemu-devel] [PATCH 6/7] nbd/server: Avoid unaligned read/block_status from backing Eric Blake
2019-04-03 3:05 ` [Qemu-devel] [PATCH for-4.1 7/7] nbd/server: Avoid unaligned dirty-bitmap status Eric Blake
2019-04-04 14:52 ` [Qemu-devel] [PATCH for-4.0? 8/7] nbd/client: Fix error message for server with unusable sizing Eric Blake
2019-04-04 15:22 ` [Qemu-devel] [Qemu-block] " Kevin Wolf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190403130354.GA6108@localhost.localdomain \
--to=kwolf@redhat.com \
--cc=eblake@redhat.com \
--cc=fam@euphon.net \
--cc=jsnow@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=vsementsov@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).