* [RFC PATCH 0/2] New NBD metacontext @ 2021-06-09 18:01 Eric Blake 2021-06-09 18:01 ` [PATCH 1/2] iotests: Improve and rename test 309 to nbd-qemu-allocation Eric Blake ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Eric Blake @ 2021-06-09 18:01 UTC (permalink / raw) To: qemu-devel; +Cc: nsoffer, vsementsov, qemu-block, libguestfs This is my counter-proposal to Nir's request [1] to revert a 6.0 behavior change. It does not expose any new information over NBD, but does make it easier to collect necessary information from a single context rather than requiring the client to have to request two contexts in parallel, then cross-correlate what may be different extent lengths between those contexts. Furthermore, this is easy to backport to downstream based on qemu 6.0, at which point clients could use the existence or absence of qemu:joint-allocation as a witness of whether it can get away with trusting base:allocation when trying to recreate a qcow2 backing chain. [1] https://lists.gnu.org/archive/html/qemu-devel/2021-06/msg01796.html Things I still want to do: - a followup patch to libnbd to teach 'nbdinfo --map=qemu:joint-allocation' to decode the bits - teach 'nbdinfo --map' to read all available contexts, instead of having to manually type each map besides base:allocation - potential followup patches to qemu to automatically feed this information through qemu-img map: - add a new BDRV_BLOCK_BACKING bit for bdrv_block_status(), with opposite semantics from BDRV_BLOCK_ALLOCATED, but where the only thing known is that the data is not local (not how deep it is) - teach qemu to favor qemu:joint-allocation over base:allocation when available, and use it to drive BDRV_BLOCK_BACKING - teach qemu-img map to recognize BDRV_BLOCK_BACKING Eric Blake (2): iotests: Improve and rename test 309 to nbd-qemu-allocation nbd: Add new qemu:joint-allocation metadata context docs/interop/nbd.txt | 31 ++++++- docs/tools/qemu-nbd.rst | 4 +- qapi/block-export.json | 4 +- include/block/nbd.h | 10 ++- nbd/server.c | 87 +++++++++++++++++-- .../{309 => tests/nbd-qemu-allocation} | 5 +- .../nbd-qemu-allocation.out} | 13 ++- 7 files changed, 139 insertions(+), 15 deletions(-) rename tests/qemu-iotests/{309 => tests/nbd-qemu-allocation} (95%) rename tests/qemu-iotests/{309.out => tests/nbd-qemu-allocation.out} (79%) -- 2.31.1 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/2] iotests: Improve and rename test 309 to nbd-qemu-allocation 2021-06-09 18:01 [RFC PATCH 0/2] New NBD metacontext Eric Blake @ 2021-06-09 18:01 ` Eric Blake 2021-06-10 12:14 ` Vladimir Sementsov-Ogievskiy 2021-06-09 18:01 ` [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context Eric Blake 2021-06-09 21:31 ` [RFC libnbd PATCH] info: Add support for new qemu:joint-allocation Eric Blake 2 siblings, 1 reply; 19+ messages in thread From: Eric Blake @ 2021-06-09 18:01 UTC (permalink / raw) To: qemu-devel Cc: Kevin Wolf, vsementsov, qemu-block, Max Reitz, nsoffer, libguestfs Enhance the test to inspect what qemu-nbd is advertising during handshake, and rename it now that we support useful iotest names. Signed-off-by: Eric Blake <eblake@redhat.com> --- .../qemu-iotests/{309 => tests/nbd-qemu-allocation} | 5 ++++- .../{309.out => tests/nbd-qemu-allocation.out} | 12 +++++++++++- 2 files changed, 15 insertions(+), 2 deletions(-) rename tests/qemu-iotests/{309 => tests/nbd-qemu-allocation} (95%) rename tests/qemu-iotests/{309.out => tests/nbd-qemu-allocation.out} (81%) diff --git a/tests/qemu-iotests/309 b/tests/qemu-iotests/tests/nbd-qemu-allocation similarity index 95% rename from tests/qemu-iotests/309 rename to tests/qemu-iotests/tests/nbd-qemu-allocation index b90b279994c9..4ee73db8033b 100755 --- a/tests/qemu-iotests/309 +++ b/tests/qemu-iotests/tests/nbd-qemu-allocation @@ -3,7 +3,7 @@ # # Test qemu-nbd -A # -# Copyright (C) 2018-2020 Red Hat, Inc. +# Copyright (C) 2018-2021 Red Hat, Inc. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by @@ -32,6 +32,7 @@ _cleanup() trap "_cleanup; exit \$status" 0 1 2 3 15 # get standard environment, filters and checks +cd .. . ./common.rc . ./common.filter . ./common.nbd @@ -57,6 +58,8 @@ echo $QEMU_IMG map --output=json -f qcow2 "$TEST_IMG" IMG="driver=nbd,server.type=unix,server.path=$nbd_unix_socket" nbd_server_start_unix_socket -r -f qcow2 -A "$TEST_IMG" +# Inspect what the server is exposing +$QEMU_NBD --list -k $nbd_unix_socket # Normal -f raw NBD block status loses access to allocation information $QEMU_IMG map --output=json --image-opts \ "$IMG" | _filter_qemu_img_map diff --git a/tests/qemu-iotests/309.out b/tests/qemu-iotests/tests/nbd-qemu-allocation.out similarity index 81% rename from tests/qemu-iotests/309.out rename to tests/qemu-iotests/tests/nbd-qemu-allocation.out index db75bb6b0df9..c51022b2a38d 100644 --- a/tests/qemu-iotests/309.out +++ b/tests/qemu-iotests/tests/nbd-qemu-allocation.out @@ -1,4 +1,4 @@ -QA output created by 309 +QA output created by nbd-qemu-allocation === Initial image setup === @@ -14,6 +14,16 @@ wrote 2097152/2097152 bytes at offset 1048576 [{ "start": 0, "length": 1048576, "depth": 1, "zero": false, "data": true, "offset": 327680}, { "start": 1048576, "length": 2097152, "depth": 0, "zero": false, "data": true, "offset": 327680}, { "start": 3145728, "length": 1048576, "depth": 1, "zero": true, "data": false}] +exports available: 1 + export: '' + size: 4194304 + flags: 0x58f ( readonly flush fua df multi cache ) + min block: 1 + opt block: 4096 + max block: 33554432 + available meta contexts: 2 + base:allocation + qemu:allocation-depth [{ "start": 0, "length": 3145728, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, { "start": 3145728, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] [{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": true, "offset": OFFSET}, -- 2.31.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/2] iotests: Improve and rename test 309 to nbd-qemu-allocation 2021-06-09 18:01 ` [PATCH 1/2] iotests: Improve and rename test 309 to nbd-qemu-allocation Eric Blake @ 2021-06-10 12:14 ` Vladimir Sementsov-Ogievskiy 0 siblings, 0 replies; 19+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-10 12:14 UTC (permalink / raw) To: Eric Blake, qemu-devel Cc: libguestfs, qemu-block, nsoffer, Kevin Wolf, Max Reitz 09.06.2021 21:01, Eric Blake wrote: > Enhance the test to inspect what qemu-nbd is advertising during > handshake, and rename it now that we support useful iotest names. > > Signed-off-by: Eric Blake<eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-09 18:01 [RFC PATCH 0/2] New NBD metacontext Eric Blake 2021-06-09 18:01 ` [PATCH 1/2] iotests: Improve and rename test 309 to nbd-qemu-allocation Eric Blake @ 2021-06-09 18:01 ` Eric Blake 2021-06-09 23:52 ` Nir Soffer 2021-06-11 23:39 ` Nir Soffer 2021-06-09 21:31 ` [RFC libnbd PATCH] info: Add support for new qemu:joint-allocation Eric Blake 2 siblings, 2 replies; 19+ messages in thread From: Eric Blake @ 2021-06-09 18:01 UTC (permalink / raw) To: qemu-devel Cc: Kevin Wolf, vsementsov, qemu-block, Markus Armbruster, Max Reitz, nsoffer, libguestfs When trying to reconstruct a qcow2 chain using information provided over NBD, ovirt had been relying on an unsafe assumption that any portion of the qcow2 file advertised as sparse would defer to the backing image; this worked with what qemu 5.2 reports for a qcow2 BSD loaded with "backing":null. However, in 6.0, commit 0da9856851 (nbd: server: Report holes for raw images) also had a side-effect of reporting unallocated zero clusters in qcow2 files as sparse. This change is correct from the NBD spec perspective (advertising bits has always been optional based on how much information the server has available, and should only be used to optimize behavior when a bit is set, while not assuming semantics merely because a bit is clear), but means that a qcow2 file that uses an unallocated zero cluster to override a backing file now shows up as sparse over NBD, and causes ovirt to fail to reproduce that cluster (ie. ovirt was assuming it only had to write clusters where the bit was clear, and the 6.0 behavior change shows the flaw in that assumption). The correct fix is for ovirt to additionally use the qemu:allocation-depth metadata context added in 5.2: after all, the actual determination for what is needed to recreate a qcow2 file is not whether a cluster is sparse, but whether the allocation-depth shows the cluster to be local. But reproducing an image is more efficient when handling known-zero clusters, which means that ovirt has to track both base:allocation and qemu:allocation-depth metadata contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine sending back information for two contexts in parallel, it comes with some bookkeeping overhead at the client side: the two contexts need not report the same length of replies, and it involves more network traffic. So, as a convenience, we can provide yet another metadata context, "qemu:joint-allocation", which provides the bulk of the same information already available from using "base:allocation" and "qemu:allocation-depth" in parallel; the only difference is that an allocation depth larger than one is collapsed to a single bit, rather than remaining an integer representing actual depth. By connecting to just this context, a client has less work to perform while still getting at all pieces of information needed to recreate a qcow2 backing chain. With regards to exposing this new feature from qemu as NBD server, it is sufficient to reuse the existing 'qemu-nbd -A': since that already exposes allocation depth, it does not hurt to advertise two separate qemu:XXX metadata contexts at once for two different views of allocation depth. And just because the server supports multiple contexts does not mean a client will want or need to connect to everything available. On the other hand, the existing hack of using the qemu NBD client option of x-dirty-bitmap to select an alternative context from the client does NOT make it possible to read the extra information exposed by the new metadata context. For now, you MUST use something like libnbd's 'nbdinfo --map=qemu:joint-allocation' in order to properly see all four bits in action: # Create a qcow2 image with a raw backing file: $ qemu-img create base.raw $((4*64*1024)) $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2 # Write to first 3 clusters of base: $ qemu-io -f raw -c "w -P 65 0 64k" -c "w -P 66 64k 64k" \ -c "w -P 67 128k 64k" base.raw # Write to second and third clusters of top, hiding base: $ qemu-io -f qcow2 -c "w -P 69 64k 64k" -c "w -z 128k 64k" top.qcow2 # Expose top.qcow2 without backing file over NBD $ ./qemu-nbd -r -t -f qcow2 -A 'json:{"driver":"qcow2", "backing":null, \ "file":{"driver":"file", "filename":"top.qcow2"}}' $ nbdinfo --map=qemu:joint-allocation nbd://localhost 0 65536 3 65536 65536 4 131072 65536 7 196608 65536 3 [This was output from nbdinfo 1.8.0; a later version will also add a column to decode the bits into human-readable strings] Additionally, later qemu patches may try to improve qemu-img to automatically take advantage of additional NBD context information, without having to use x-dirty-bitmap. Reported-by: Nir Soffer <nsoffer@redhat.com> Resolves: https://bugzilla.redhat.com/1968693 Signed-off-by: Eric Blake <eblake@redhat.com> --- docs/interop/nbd.txt | 31 ++++++- docs/tools/qemu-nbd.rst | 4 +- qapi/block-export.json | 4 +- include/block/nbd.h | 10 ++- nbd/server.c | 87 +++++++++++++++++-- .../tests/nbd-qemu-allocation.out | 3 +- 6 files changed, 125 insertions(+), 14 deletions(-) diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt index 10ce098a29bf..cc8ce2d5389f 100644 --- a/docs/interop/nbd.txt +++ b/docs/interop/nbd.txt @@ -17,7 +17,7 @@ namespace "qemu". == "qemu" namespace == -The "qemu" namespace currently contains two available metadata context +The "qemu" namespace currently contains three available metadata context types. The first is related to exposing the contents of a dirty bitmap alongside the associated disk contents. That metadata context is named with the following form: @@ -39,8 +39,32 @@ depth of which layer in a thin-provisioned backing chain provided the data (0 for unallocated, 1 for the active layer, 2 for the first backing layer, and so forth). -For NBD_OPT_LIST_META_CONTEXT the following queries are supported -in addition to the specific "qemu:allocation-depth" and +The third is for convenience in querying the results of +base:allocation and qemu:allocation-depth in one go, under the +metadata context named + + qemu:joint-allocation + +In this context, bits 0 and 1 reflect the same information as +base:allocation, and bits 2 and 3 provide a summary of +qemu:allocation-depth (although this context is not able to convey +how deep in the backing chain data comes from). + + bit 0: NBD_STATE_HOLE, set when the extent is sparse + bit 1: NBD_STATE_ZERO, set when the extent reads as zero + bit 2: NBD_STATE_LOCAL, set when the extent contents come from the + local layer + bit 3: NBD_STATE_BACKING, set when the extent contents come from a + backing layer of unspecified depth + +This context does not provide any unique information, but does make it +easier to reconstruct backing chains without having to piece together +what might other be disparate-length returns of two separate metadata +contexts. + +For NBD_OPT_LIST_META_CONTEXT the following queries are supported in +addition to the specific "qemu:allocation-depth", +"qemu:joint-allocation", and "qemu:dirty-bitmap:<dirty-bitmap-export-name>": * "qemu:" - returns list of all available metadata contexts in the @@ -68,3 +92,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE * 4.2: NBD_FLAG_CAN_MULTI_CONN for shareable read-only exports, NBD_CMD_FLAG_FAST_ZERO * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth" +* 6.1: NBD_CMD_BLOCK_STATUS for "qemu:joint-allocation" diff --git a/docs/tools/qemu-nbd.rst b/docs/tools/qemu-nbd.rst index ee862fa0bc02..975ef5cedd14 100644 --- a/docs/tools/qemu-nbd.rst +++ b/docs/tools/qemu-nbd.rst @@ -75,8 +75,8 @@ driver options if ``--image-opts`` is specified. .. option:: -A, --allocation-depth Expose allocation depth information via the - ``qemu:allocation-depth`` metadata context accessible through - NBD_OPT_SET_META_CONTEXT. + ``qemu:allocation-depth`` and ``qemu:joint-allocation`` metadata + contexts accessible through NBD_OPT_SET_META_CONTEXT. .. option:: -B, --bitmap=NAME diff --git a/qapi/block-export.json b/qapi/block-export.json index e819e70cac0d..1bd315b7958d 100644 --- a/qapi/block-export.json +++ b/qapi/block-export.json @@ -92,8 +92,8 @@ # # @allocation-depth: Also export the allocation depth map for @device, so # the NBD client can use NBD_OPT_SET_META_CONTEXT with -# the metadata context name "qemu:allocation-depth" to -# inspect allocation details. (since 5.2) +# the metadata context name "qemu:allocation-depth" or +# "qemu:joint-allocation" to inspect allocation details. # # Since: 5.2 ## diff --git a/include/block/nbd.h b/include/block/nbd.h index 5f34d23bb037..7f411dc59e57 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -1,5 +1,5 @@ /* - * Copyright (C) 2016-2020 Red Hat, Inc. + * Copyright (C) 2016-2021 Red Hat, Inc. * Copyright (C) 2005 Anthony Liguori <anthony@codemonkey.ws> * * Network Block Device @@ -261,6 +261,14 @@ enum { /* No flags needed for qemu:allocation-depth in NBD_REPLY_TYPE_BLOCK_STATUS */ +/* + * Extent flags for qemu:joint-allocation in NBD_REPLY_TYPE_BLOCK_STATUS, + * reusing the two flags already in base:allocation, and compressing the + * results of qemu:allocation-depth into two bits + */ +#define NBD_STATE_LOCAL (1 << 2) +#define NBD_STATE_BACKING (1 << 3) + static inline bool nbd_reply_type_is_error(int type) { return type & (1 << 15); diff --git a/nbd/server.c b/nbd/server.c index b60ebc3ab6ac..8b825ccaf2ac 100644 --- a/nbd/server.c +++ b/nbd/server.c @@ -28,8 +28,9 @@ #define NBD_META_ID_BASE_ALLOCATION 0 #define NBD_META_ID_ALLOCATION_DEPTH 1 +#define NBD_META_ID_JOINT_ALLOCATION 2 /* Dirty bitmaps use 'NBD_META_ID_DIRTY_BITMAP + i', so keep this id last. */ -#define NBD_META_ID_DIRTY_BITMAP 2 +#define NBD_META_ID_DIRTY_BITMAP 3 /* * NBD_MAX_BLOCK_STATUS_EXTENTS: 1 MiB of extents data. An empirical @@ -97,6 +98,7 @@ struct NBDExport { Notifier eject_notifier; bool allocation_depth; + bool joint_allocation; BdrvDirtyBitmap **export_bitmaps; size_t nr_export_bitmaps; }; @@ -111,6 +113,7 @@ typedef struct NBDExportMetaContexts { size_t count; /* number of negotiated contexts */ bool base_allocation; /* export base:allocation context (block status) */ bool allocation_depth; /* export qemu:allocation-depth */ + bool joint_allocation; /* export qemu:joint-allocation */ bool *bitmaps; /* * export qemu:dirty-bitmap:<export bitmap name>, * sized by exp->nr_export_bitmaps @@ -862,9 +865,9 @@ static bool nbd_meta_base_query(NBDClient *client, NBDExportMetaContexts *meta, /* nbd_meta_qemu_query * - * Handle queries to 'qemu' namespace. For now, only the qemu:dirty-bitmap: - * and qemu:allocation-depth contexts are available. Return true if @query - * has been handled. + * Handle queries to 'qemu' namespace. For now, only the qemu:dirty-bitmap:, + * qemu:joint-allocation, and qemu:allocation-depth contexts are available. + * Return true if @query has been handled. */ static bool nbd_meta_qemu_query(NBDClient *client, NBDExportMetaContexts *meta, const char *query) @@ -891,6 +894,12 @@ static bool nbd_meta_qemu_query(NBDClient *client, NBDExportMetaContexts *meta, return true; } + if (strcmp(query, "joint-allocation") == 0) { + trace_nbd_negotiate_meta_query_parse("joint-allocation"); + meta->joint_allocation = meta->exp->joint_allocation; + return true; + } + if (nbd_strshift(&query, "dirty-bitmap:")) { trace_nbd_negotiate_meta_query_parse("dirty-bitmap:"); if (!*query) { @@ -1023,6 +1032,7 @@ static int nbd_negotiate_meta_queries(NBDClient *client, /* enable all known contexts */ meta->base_allocation = true; meta->allocation_depth = meta->exp->allocation_depth; + meta->joint_allocation = meta->exp->joint_allocation; memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps); } else { for (i = 0; i < nb_queries; ++i) { @@ -1053,6 +1063,16 @@ static int nbd_negotiate_meta_queries(NBDClient *client, count++; } + if (meta->joint_allocation) { + ret = nbd_negotiate_send_meta_context(client, "qemu:joint-allocation", + NBD_META_ID_JOINT_ALLOCATION, + errp); + if (ret < 0) { + return ret; + } + count++; + } + for (i = 0; i < meta->exp->nr_export_bitmaps; i++) { const char *bm_name; g_autofree char *context = NULL; @@ -1743,7 +1763,8 @@ static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions *exp_args, bdrv_dirty_bitmap_set_busy(exp->export_bitmaps[i], true); } - exp->allocation_depth = arg->allocation_depth; + /* QMP allocation_depth flag controls export of both metadata contexts. */ + exp->allocation_depth = exp->joint_allocation = arg->allocation_depth; /* * We need to inhibit request queuing in the block layer to ensure we can @@ -2164,6 +2185,47 @@ static int blockalloc_to_extents(BlockDriverState *bs, uint64_t offset, return 0; } +static int blockjoint_to_extents(BlockDriverState *bs, uint64_t offset, + uint64_t bytes, NBDExtentArray *ea) +{ + while (bytes) { + uint32_t flags = 0; + int64_t num, num1, num2; + int ret1 = bdrv_block_status_above(bs, NULL, offset, bytes, &num1, + NULL, NULL); + int ret2 = bdrv_is_allocated_above(bs, NULL, false, offset, bytes, + &num2); + + if (ret1 < 0) { + return ret1; + } + if (ret2 < 0) { + return ret2; + } + num = MIN(num1, num2); + + if (!(ret1 & BDRV_BLOCK_DATA)) { + flags |= NBD_STATE_HOLE; + } + if (ret1 & BDRV_BLOCK_ZERO) { + flags |= NBD_STATE_ZERO; + } + if (ret2 == 1) { + flags |= NBD_STATE_LOCAL; + } else if (ret2 > 1) { + flags |= NBD_STATE_BACKING; + } + if (nbd_extent_array_add(ea, num, flags) < 0) { + return 0; + } + + offset += num; + bytes -= num; + } + + return 0; +} + /* * nbd_co_send_extents * @@ -2205,6 +2267,8 @@ static int nbd_co_send_block_status(NBDClient *client, uint64_t handle, if (context_id == NBD_META_ID_BASE_ALLOCATION) { ret = blockstatus_to_extents(bs, offset, length, ea); + } else if (context_id == NBD_META_ID_JOINT_ALLOCATION) { + ret = blockjoint_to_extents(bs, offset, length, ea); } else { ret = blockalloc_to_extents(bs, offset, length, ea); } @@ -2574,6 +2638,19 @@ static coroutine_fn int nbd_handle_request(NBDClient *client, } } + if (client->export_meta.joint_allocation) { + ret = nbd_co_send_block_status(client, request->handle, + blk_bs(exp->common.blk), + request->from, request->len, + dont_fragment, + !--contexts_remaining, + NBD_META_ID_JOINT_ALLOCATION, + errp); + if (ret < 0) { + return ret; + } + } + for (i = 0; i < client->exp->nr_export_bitmaps; i++) { if (!client->export_meta.bitmaps[i]) { continue; diff --git a/tests/qemu-iotests/tests/nbd-qemu-allocation.out b/tests/qemu-iotests/tests/nbd-qemu-allocation.out index c51022b2a38d..b9e469efc58e 100644 --- a/tests/qemu-iotests/tests/nbd-qemu-allocation.out +++ b/tests/qemu-iotests/tests/nbd-qemu-allocation.out @@ -21,9 +21,10 @@ exports available: 1 min block: 1 opt block: 4096 max block: 33554432 - available meta contexts: 2 + available meta contexts: 3 base:allocation qemu:allocation-depth + qemu:joint-allocation [{ "start": 0, "length": 3145728, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, { "start": 3145728, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] [{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": true, "offset": OFFSET}, -- 2.31.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-09 18:01 ` [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context Eric Blake @ 2021-06-09 23:52 ` Nir Soffer 2021-06-10 12:30 ` Vladimir Sementsov-Ogievskiy ` (2 more replies) 2021-06-11 23:39 ` Nir Soffer 1 sibling, 3 replies; 19+ messages in thread From: Nir Soffer @ 2021-06-09 23:52 UTC (permalink / raw) To: Eric Blake Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, Markus Armbruster, QEMU Developers, Max Reitz, libguestfs On Wed, Jun 9, 2021 at 9:01 PM Eric Blake <eblake@redhat.com> wrote: > > When trying to reconstruct a qcow2 chain using information provided > over NBD, ovirt had been relying on an unsafe assumption that any > portion of the qcow2 file advertised as sparse would defer to the > backing image; this worked with what qemu 5.2 reports for a qcow2 BSD > loaded with "backing":null. However, in 6.0, commit 0da9856851 (nbd: > server: Report holes for raw images) also had a side-effect of > reporting unallocated zero clusters in qcow2 files as sparse. This > change is correct from the NBD spec perspective (advertising bits has > always been optional based on how much information the server has > available, and should only be used to optimize behavior when a bit is > set, while not assuming semantics merely because a bit is clear), but > means that a qcow2 file that uses an unallocated zero cluster to > override a backing file now shows up as sparse over NBD, and causes > ovirt to fail to reproduce that cluster (ie. ovirt was assuming it > only had to write clusters where the bit was clear, and the 6.0 > behavior change shows the flaw in that assumption). > > The correct fix is for ovirt to additionally use the > qemu:allocation-depth metadata context added in 5.2: after all, the > actual determination for what is needed to recreate a qcow2 file is > not whether a cluster is sparse, but whether the allocation-depth > shows the cluster to be local. But reproducing an image is more > efficient when handling known-zero clusters, which means that ovirt > has to track both base:allocation and qemu:allocation-depth metadata > contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine > sending back information for two contexts in parallel, it comes with > some bookkeeping overhead at the client side: the two contexts need > not report the same length of replies, and it involves more network > traffic. > > So, as a convenience, we can provide yet another metadata context, > "qemu:joint-allocation", which provides the bulk of the same > information already available from using "base:allocation" and > "qemu:allocation-depth" in parallel; the only difference is that an > allocation depth larger than one is collapsed to a single bit, rather > than remaining an integer representing actual depth. By connecting to > just this context, a client has less work to perform while still > getting at all pieces of information needed to recreate a qcow2 > backing chain. Providing extended allocation is awsome, and makes client life much easier. But I'm not sure about the name, that comes from "joining" "base:allocation" and "qemu:allocation-depth". This is correct when thinking about qemu internals, but this is not really getting both, since "qemu:allocation-depth" is reduced to local and backing. From a client point of view, I think this is best described as "qemu:allocation" which is an extension to NBD protocol, providing the same HOLE and ZERO bits, and qemu specific info LOCAL, BACKING. Using different "namespace" ("qemu" vs "base") makes it clear that this is not the same. We discussed in the past the option to expose also the dirty status of every block in the response. Again this info is available using "qemu:dirty-bitmap:xxx" but just like allocation depth and base allocation, merging the results is hard and if we could expose also the dirty bit, this can make clients life even better. In this case I'm not sure "qemu:allocation" is the best name, maybe something more generic like "qemu:extents" or "qemu:block-status" is even better. > With regards to exposing this new feature from qemu as NBD server, it > is sufficient to reuse the existing 'qemu-nbd -A': since that already > exposes allocation depth, it does not hurt to advertise two separate > qemu:XXX metadata contexts at once for two different views of > allocation depth. And just because the server supports multiple > contexts does not mean a client will want or need to connect to > everything available. On the other hand, the existing hack of using > the qemu NBD client option of x-dirty-bitmap to select an alternative > context from the client does NOT make it possible to read the extra > information exposed by the new metadata context. For now, you MUST > use something like libnbd's 'nbdinfo --map=qemu:joint-allocation' in > order to properly see all four bits in action: Makes sense. > # Create a qcow2 image with a raw backing file: > $ qemu-img create base.raw $((4*64*1024)) > $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2 > > # Write to first 3 clusters of base: > $ qemu-io -f raw -c "w -P 65 0 64k" -c "w -P 66 64k 64k" \ > -c "w -P 67 128k 64k" base.raw > > # Write to second and third clusters of top, hiding base: > $ qemu-io -f qcow2 -c "w -P 69 64k 64k" -c "w -z 128k 64k" top.qcow2 Looks familiar but nicer :-) > # Expose top.qcow2 without backing file over NBD > $ ./qemu-nbd -r -t -f qcow2 -A 'json:{"driver":"qcow2", "backing":null, \ > "file":{"driver":"file", "filename":"top.qcow2"}}' > $ nbdinfo --map=qemu:joint-allocation nbd://localhost > 0 65536 3 > 65536 65536 4 > 131072 65536 7 > 196608 65536 3 Using the libnbd patch this shows: $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost 0 65536 3 hole,zero,unallocated 65536 65536 4 allocated,local 131072 65536 7 hole,zero,local 196608 65536 3 hole,zero,unallocated Looks good. We need to convert this output to: {"start": 0, "length": 65536, "zero": true, "hole": true}, {"start": 65536, "length": 65536, "zero": false, "hole": false}, {"start": 131072, "length": 65536, "zero": true, "hole": false}, {"start": 196608, "length": 65536, "zero": true, "hole": true}, So it seems that we need to use this logic for holes when we inspect a single qcow2 image: hole = not (flags & NBD_STATE_LOCAL) And ignore the NBD_STATE_HOLE, which is about qcow2 internals. This patch fixes the critical issue for oVirt, but in a way it returns the previous state when you could not report holes in raw images. With this patch holes in raw image looks like: $ truncate -s 1g empty.raw $ ./qemu-nbd -r -t -f raw empty.raw --allocation-depth $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost 0 1073741824 7 hole,zero,local This is not a practical issue for oVirt, but it would be better to report: $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost 0 1073741824 3 hole,zero,unallocated This is the output for the empty qcow2 image. And this also affects --allocation-depth, for raw empty image we get: $ ./nbdinfo --map="qemu:allocation-depth" nbd://localhost 0 1073741824 1 local But for empty qcow2 we get: $ ./nbdinfo --map="qemu:allocation-depth" nbd://localhost 0 1073741824 0 unallocated I think we have a bug reporting BDRV_BLOCK_ALLOCATED when the BDRV_BLOCK_DATA bit is not set. > [This was output from nbdinfo 1.8.0; a later version will also add a > column to decode the bits into human-readable strings] > > Additionally, later qemu patches may try to improve qemu-img to > automatically take advantage of additional NBD context information, > without having to use x-dirty-bitmap. > > Reported-by: Nir Soffer <nsoffer@redhat.com> > Resolves: https://bugzilla.redhat.com/1968693 > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > docs/interop/nbd.txt | 31 ++++++- > docs/tools/qemu-nbd.rst | 4 +- > qapi/block-export.json | 4 +- > include/block/nbd.h | 10 ++- > nbd/server.c | 87 +++++++++++++++++-- > .../tests/nbd-qemu-allocation.out | 3 +- > 6 files changed, 125 insertions(+), 14 deletions(-) > > diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt > index 10ce098a29bf..cc8ce2d5389f 100644 > --- a/docs/interop/nbd.txt > +++ b/docs/interop/nbd.txt > @@ -17,7 +17,7 @@ namespace "qemu". > > == "qemu" namespace == > > -The "qemu" namespace currently contains two available metadata context > +The "qemu" namespace currently contains three available metadata context > types. The first is related to exposing the contents of a dirty > bitmap alongside the associated disk contents. That metadata context > is named with the following form: > @@ -39,8 +39,32 @@ depth of which layer in a thin-provisioned backing chain provided the > data (0 for unallocated, 1 for the active layer, 2 for the first > backing layer, and so forth). > > -For NBD_OPT_LIST_META_CONTEXT the following queries are supported > -in addition to the specific "qemu:allocation-depth" and > +The third is for convenience in querying the results of > +base:allocation and qemu:allocation-depth in one go, under the > +metadata context named > + > + qemu:joint-allocation > + > +In this context, bits 0 and 1 reflect the same information as > +base:allocation, and bits 2 and 3 provide a summary of > +qemu:allocation-depth (although this context is not able to convey > +how deep in the backing chain data comes from). > + > + bit 0: NBD_STATE_HOLE, set when the extent is sparse > + bit 1: NBD_STATE_ZERO, set when the extent reads as zero > + bit 2: NBD_STATE_LOCAL, set when the extent contents come from the > + local layer > + bit 3: NBD_STATE_BACKING, set when the extent contents come from a > + backing layer of unspecified depth > + > +This context does not provide any unique information, but does make it > +easier to reconstruct backing chains without having to piece together > +what might other be disparate-length returns of two separate metadat > +contexts. > + > +For NBD_OPT_LIST_META_CONTEXT the following queries are supported in > +addition to the specific "qemu:allocation-depth", > +"qemu:joint-allocation", and > "qemu:dirty-bitmap:<dirty-bitmap-export-name>": Sending NBD_OPT_LIST_META_CONTEXT with "qemu:joint-allocation" should tell if the feature is available, right? If we don't find it we will fallback to "base:allocation". > * "qemu:" - returns list of all available metadata contexts in the > @@ -68,3 +92,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE > * 4.2: NBD_FLAG_CAN_MULTI_CONN for shareable read-only exports, > NBD_CMD_FLAG_FAST_ZERO > * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth" > +* 6.1: NBD_CMD_BLOCK_STATUS for "qemu:joint-allocation" > diff --git a/docs/tools/qemu-nbd.rst b/docs/tools/qemu-nbd.rst > index ee862fa0bc02..975ef5cedd14 100644 > --- a/docs/tools/qemu-nbd.rst > +++ b/docs/tools/qemu-nbd.rst > @@ -75,8 +75,8 @@ driver options if ``--image-opts`` is specified. > .. option:: -A, --allocation-depth > > Expose allocation depth information via the > - ``qemu:allocation-depth`` metadata context accessible through > - NBD_OPT_SET_META_CONTEXT. > + ``qemu:allocation-depth`` and ``qemu:joint-allocation`` metadata > + contexts accessible through NBD_OPT_SET_META_CONTEXT. > > .. option:: -B, --bitmap=NAME It would be awesome if this would enable NBD_STATE_DIRTY bit in the response, in he same way that -A enables NBD_STATE_LOCAL/BACKING > diff --git a/qapi/block-export.json b/qapi/block-export.json > index e819e70cac0d..1bd315b7958d 100644 > --- a/qapi/block-export.json > +++ b/qapi/block-export.json > @@ -92,8 +92,8 @@ > # > # @allocation-depth: Also export the allocation depth map for @device, so > # the NBD client can use NBD_OPT_SET_META_CONTEXT with > -# the metadata context name "qemu:allocation-depth" to > -# inspect allocation details. (since 5.2) > +# the metadata context name "qemu:allocation-depth" or > +# "qemu:joint-allocation" to inspect allocation details. > # > # Since: 5.2 > ## > diff --git a/include/block/nbd.h b/include/block/nbd.h > index 5f34d23bb037..7f411dc59e57 100644 > --- a/include/block/nbd.h > +++ b/include/block/nbd.h > @@ -1,5 +1,5 @@ > /* > - * Copyright (C) 2016-2020 Red Hat, Inc. > + * Copyright (C) 2016-2021 Red Hat, Inc. > * Copyright (C) 2005 Anthony Liguori <anthony@codemonkey.ws> > * > * Network Block Device > @@ -261,6 +261,14 @@ enum { > > /* No flags needed for qemu:allocation-depth in NBD_REPLY_TYPE_BLOCK_STATUS */ > > +/* > + * Extent flags for qemu:joint-allocation in NBD_REPLY_TYPE_BLOCK_STATUS, > + * reusing the two flags already in base:allocation, and compressing the > + * results of qemu:allocation-depth into two bits > + */ > +#define NBD_STATE_LOCAL (1 << 2) > +#define NBD_STATE_BACKING (1 << 3) > + > static inline bool nbd_reply_type_is_error(int type) > { > return type & (1 << 15); > diff --git a/nbd/server.c b/nbd/server.c > index b60ebc3ab6ac..8b825ccaf2ac 100644 > --- a/nbd/server.c > +++ b/nbd/server.c > @@ -28,8 +28,9 @@ > > #define NBD_META_ID_BASE_ALLOCATION 0 > #define NBD_META_ID_ALLOCATION_DEPTH 1 > +#define NBD_META_ID_JOINT_ALLOCATION 2 > /* Dirty bitmaps use 'NBD_META_ID_DIRTY_BITMAP + i', so keep this id last. */ > -#define NBD_META_ID_DIRTY_BITMAP 2 > +#define NBD_META_ID_DIRTY_BITMAP 3 > > /* > * NBD_MAX_BLOCK_STATUS_EXTENTS: 1 MiB of extents data. An empirical > @@ -97,6 +98,7 @@ struct NBDExport { > Notifier eject_notifier; > > bool allocation_depth; > + bool joint_allocation; > BdrvDirtyBitmap **export_bitmaps; > size_t nr_export_bitmaps; > }; > @@ -111,6 +113,7 @@ typedef struct NBDExportMetaContexts { > size_t count; /* number of negotiated contexts */ > bool base_allocation; /* export base:allocation context (block status) */ > bool allocation_depth; /* export qemu:allocation-depth */ > + bool joint_allocation; /* export qemu:joint-allocation */ > bool *bitmaps; /* > * export qemu:dirty-bitmap:<export bitmap name>, > * sized by exp->nr_export_bitmaps > @@ -862,9 +865,9 @@ static bool nbd_meta_base_query(NBDClient *client, NBDExportMetaContexts *meta, > > /* nbd_meta_qemu_query > * > - * Handle queries to 'qemu' namespace. For now, only the qemu:dirty-bitmap: > - * and qemu:allocation-depth contexts are available. Return true if @query > - * has been handled. > + * Handle queries to 'qemu' namespace. For now, only the qemu:dirty-bitmap:, > + * qemu:joint-allocation, and qemu:allocation-depth contexts are available. > + * Return true if @query has been handled. > */ > static bool nbd_meta_qemu_query(NBDClient *client, NBDExportMetaContexts *meta, > const char *query) > @@ -891,6 +894,12 @@ static bool nbd_meta_qemu_query(NBDClient *client, NBDExportMetaContexts *meta, > return true; > } > > + if (strcmp(query, "joint-allocation") == 0) { > + trace_nbd_negotiate_meta_query_parse("joint-allocation"); > + meta->joint_allocation = meta->exp->joint_allocation; > + return true; > + } > + > if (nbd_strshift(&query, "dirty-bitmap:")) { > trace_nbd_negotiate_meta_query_parse("dirty-bitmap:"); > if (!*query) { > @@ -1023,6 +1032,7 @@ static int nbd_negotiate_meta_queries(NBDClient *client, > /* enable all known contexts */ > meta->base_allocation = true; > meta->allocation_depth = meta->exp->allocation_depth; > + meta->joint_allocation = meta->exp->joint_allocation; > memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps); > } else { > for (i = 0; i < nb_queries; ++i) { > @@ -1053,6 +1063,16 @@ static int nbd_negotiate_meta_queries(NBDClient *client, > count++; > } > > + if (meta->joint_allocation) { > + ret = nbd_negotiate_send_meta_context(client, "qemu:joint-allocation", > + NBD_META_ID_JOINT_ALLOCATION, > + errp); > + if (ret < 0) { > + return ret; > + } > + count++; > + } > + > for (i = 0; i < meta->exp->nr_export_bitmaps; i++) { > const char *bm_name; > g_autofree char *context = NULL; > @@ -1743,7 +1763,8 @@ static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions *exp_args, > bdrv_dirty_bitmap_set_busy(exp->export_bitmaps[i], true); > } > > - exp->allocation_depth = arg->allocation_depth; > + /* QMP allocation_depth flag controls export of both metadata contexts. */ > + exp->allocation_depth = exp->joint_allocation = arg->allocation_depth; > > /* > * We need to inhibit request queuing in the block layer to ensure we can > @@ -2164,6 +2185,47 @@ static int blockalloc_to_extents(BlockDriverState *bs, uint64_t offset, > return 0; > } > > +static int blockjoint_to_extents(BlockDriverState *bs, uint64_t offset, > + uint64_t bytes, NBDExtentArray *ea) > +{ > + while (bytes) { > + uint32_t flags = 0; > + int64_t num, num1, num2; > + int ret1 = bdrv_block_status_above(bs, NULL, offset, bytes, &num1, > + NULL, NULL); > + int ret2 = bdrv_is_allocated_above(bs, NULL, false, offset, bytes, > + &num2); Do we really need to make 2 calls? > + > + if (ret1 < 0) { > + return ret1; > + } > + if (ret2 < 0) { > + return ret2; > + } > + num = MIN(num1, num2); > + > + if (!(ret1 & BDRV_BLOCK_DATA)) { > + flags |= NBD_STATE_HOLE; > + } > + if (ret1 & BDRV_BLOCK_ZERO) { > + flags |= NBD_STATE_ZERO; > + } > + if (ret2 == 1) { > + flags |= NBD_STATE_LOCAL; > + } else if (ret2 > 1) { > + flags |= NBD_STATE_BACKING; > + } > + if (nbd_extent_array_add(ea, num, flags) < 0) { > + return 0; > + } > + > + offset += num; > + bytes -= num; > + } > + > + return 0; > +} I tried this quick hack on top of your patch: diff --git a/block/io.c b/block/io.c index 1a05f320d3..aaf01c9119 100644 --- a/block/io.c +++ b/block/io.c @@ -2656,6 +2656,15 @@ int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base, pnum, map, file, NULL); } +int bdrv_block_status_above_depth(BlockDriverState *bs, BlockDriverState *base, + int64_t offset, int64_t bytes, int64_t *pnum, + int64_t *map, BlockDriverState **file, + int *depth) +{ + return bdrv_common_block_status_above(bs, base, false, true, offset, bytes, + pnum, map, file, depth); +} + int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes, int64_t *pnum, int64_t *map, BlockDriverState **file) { diff --git a/include/block/block.h b/include/block/block.h index 8e707a83b7..e3bbe26695 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -515,6 +515,10 @@ int bdrv_block_status(BlockDriverState *bs, int64_t offset, int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base, int64_t offset, int64_t bytes, int64_t *pnum, int64_t *map, BlockDriverState **file); +int bdrv_block_status_above_depth(BlockDriverState *bs, BlockDriverState *base, + int64_t offset, int64_t bytes, int64_t *pnum, + int64_t *map, BlockDriverState **file, + int *depth); int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes, int64_t *pnum); int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base, diff --git a/nbd/server.c b/nbd/server.c index 8b825ccaf2..16dbd8ed51 100644 --- a/nbd/server.c +++ b/nbd/server.c @@ -2190,31 +2190,31 @@ static int blockjoint_to_extents(BlockDriverState *bs, uint64_t offset, { while (bytes) { uint32_t flags = 0; - int64_t num, num1, num2; - int ret1 = bdrv_block_status_above(bs, NULL, offset, bytes, &num1, - NULL, NULL); - int ret2 = bdrv_is_allocated_above(bs, NULL, false, offset, bytes, - &num2); - - if (ret1 < 0) { - return ret1; - } - if (ret2 < 0) { - return ret2; + int depth = 0; + int64_t num; + int ret = bdrv_block_status_above_depth(bs, NULL, offset, bytes, &num, + NULL, NULL, &depth); + + if (ret < 0) { + return ret; } - num = MIN(num1, num2); - if (!(ret1 & BDRV_BLOCK_DATA)) { + if (!(ret & BDRV_BLOCK_DATA)) { flags |= NBD_STATE_HOLE; } - if (ret1 & BDRV_BLOCK_ZERO) { + + if (ret & BDRV_BLOCK_ZERO) { flags |= NBD_STATE_ZERO; } - if (ret2 == 1) { - flags |= NBD_STATE_LOCAL; - } else if (ret2 > 1) { - flags |= NBD_STATE_BACKING; + + if (ret & BDRV_BLOCK_ALLOCATED) { + if (depth == 1) { + flags |= NBD_STATE_LOCAL; + } else if (depth > 1) { + flags |= NBD_STATE_BACKING; + } } + if (nbd_extent_array_add(ea, num, flags) < 0) { return 0; } And seems to give the same results: top.qcow2 without backing file: $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost 0 65536 3 hole,zero,unallocated 65536 65536 4 allocated,local 131072 65536 7 hole,zero,local 196608 65536 3 hole,zero,unallocated empty qcow2 image: $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost 0 1073741824 3 hole,zero,unallocated empty raw image: $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost 0 1073741824 7 hole,zero,local > + > /* > * nbd_co_send_extents > * > @@ -2205,6 +2267,8 @@ static int nbd_co_send_block_status(NBDClient *client, uint64_t handle, > > if (context_id == NBD_META_ID_BASE_ALLOCATION) { > ret = blockstatus_to_extents(bs, offset, length, ea); > + } else if (context_id == NBD_META_ID_JOINT_ALLOCATION) { > + ret = blockjoint_to_extents(bs, offset, length, ea); > } else { > ret = blockalloc_to_extents(bs, offset, length, ea); > } > @@ -2574,6 +2638,19 @@ static coroutine_fn int nbd_handle_request(NBDClient *client, > } > } > > + if (client->export_meta.joint_allocation) { > + ret = nbd_co_send_block_status(client, request->handle, > + blk_bs(exp->common.blk), > + request->from, request->len, > + dont_fragment, > + !--contexts_remaining, > + NBD_META_ID_JOINT_ALLOCATION, > + errp); > + if (ret < 0) { > + return ret; > + } > + } > + > for (i = 0; i < client->exp->nr_export_bitmaps; i++) { > if (!client->export_meta.bitmaps[i]) { > continue; > diff --git a/tests/qemu-iotests/tests/nbd-qemu-allocation.out b/tests/qemu-iotests/tests/nbd-qemu-allocation.out > index c51022b2a38d..b9e469efc58e 100644 > --- a/tests/qemu-iotests/tests/nbd-qemu-allocation.out > +++ b/tests/qemu-iotests/tests/nbd-qemu-allocation.out > @@ -21,9 +21,10 @@ exports available: 1 > min block: 1 > opt block: 4096 > max block: 33554432 > - available meta contexts: 2 > + available meta contexts: 3 > base:allocation > qemu:allocation-depth > + qemu:joint-allocation > [{ "start": 0, "length": 3145728, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, > { "start": 3145728, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] > [{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": true, "offset": OFFSET}, > -- > 2.31.1 > Thanks for working on this! Nir ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-09 23:52 ` Nir Soffer @ 2021-06-10 12:30 ` Vladimir Sementsov-Ogievskiy 2021-06-10 13:47 ` Eric Blake 2021-06-10 13:16 ` Nir Soffer 2021-06-10 13:31 ` Eric Blake 2 siblings, 1 reply; 19+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-10 12:30 UTC (permalink / raw) To: Nir Soffer, Eric Blake Cc: QEMU Developers, libguestfs, qemu-block, Kevin Wolf, Max Reitz, Markus Armbruster 10.06.2021 02:52, Nir Soffer wrote: > On Wed, Jun 9, 2021 at 9:01 PM Eric Blake<eblake@redhat.com> wrote: >> When trying to reconstruct a qcow2 chain using information provided >> over NBD, ovirt had been relying on an unsafe assumption that any >> portion of the qcow2 file advertised as sparse would defer to the >> backing image; this worked with what qemu 5.2 reports for a qcow2 BSD >> loaded with "backing":null. However, in 6.0, commit 0da9856851 (nbd: >> server: Report holes for raw images) also had a side-effect of >> reporting unallocated zero clusters in qcow2 files as sparse. This >> change is correct from the NBD spec perspective (advertising bits has >> always been optional based on how much information the server has >> available, and should only be used to optimize behavior when a bit is >> set, while not assuming semantics merely because a bit is clear), but >> means that a qcow2 file that uses an unallocated zero cluster to >> override a backing file now shows up as sparse over NBD, and causes >> ovirt to fail to reproduce that cluster (ie. ovirt was assuming it >> only had to write clusters where the bit was clear, and the 6.0 >> behavior change shows the flaw in that assumption). >> >> The correct fix is for ovirt to additionally use the >> qemu:allocation-depth metadata context added in 5.2: after all, the >> actual determination for what is needed to recreate a qcow2 file is >> not whether a cluster is sparse, but whether the allocation-depth >> shows the cluster to be local. But reproducing an image is more >> efficient when handling known-zero clusters, which means that ovirt >> has to track both base:allocation and qemu:allocation-depth metadata >> contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine >> sending back information for two contexts in parallel, it comes with >> some bookkeeping overhead at the client side: the two contexts need >> not report the same length of replies, and it involves more network >> traffic. Aren't both context described in one reply? Or what do you mean by not the same length? >> >> So, as a convenience, we can provide yet another metadata context, >> "qemu:joint-allocation", which provides the bulk of the same >> information already available from using "base:allocation" and >> "qemu:allocation-depth" in parallel; the only difference is that an >> allocation depth larger than one is collapsed to a single bit, rather >> than remaining an integer representing actual depth. By connecting to >> just this context, a client has less work to perform while still >> getting at all pieces of information needed to recreate a qcow2 >> backing chain. > Providing extended allocation is awsome, and makes client life much > easier. But I'm not sure about the name, that comes from "joining" > "base:allocation" and "qemu:allocation-depth". This is correct when > thinking about qemu internals, but this is not really getting both, since > "qemu:allocation-depth" is reduced to local and backing. > > From a client point of view, I think this is best described as "qemu:allocation" > which is an extension to NBD protocol, providing the same HOLE and ZERO > bits, and qemu specific info LOCAL, BACKING. Using different "namespace" > ("qemu" vs "base") makes it clear that this is not the same. > > We discussed in the past the option to expose also the dirty status of every > block in the response. Again this info is available using > "qemu:dirty-bitmap:xxx" > but just like allocation depth and base allocation, merging the results is hard > and if we could expose also the dirty bit, this can make clients life > even better. > In this case I'm not sure "qemu:allocation" is the best name, maybe something > more generic like "qemu:extents" or "qemu:block-status" is even better. > Oops. Could you please describe, what's the problem with parsing several context simultaneously? This all sound to me as we are going to implement "joint" combined conexts for every useful combination of existing contexts that user wants. So, it's a kind of workaround of inconvenient protocol we have invented in the past. Doesn't it mean that we instead should rework, how we export several contexts? Maybe we can improve generic export of several contexts simultaneously, so that it will be convenient for the client? Than we don't need any additional combined contexts. -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-10 12:30 ` Vladimir Sementsov-Ogievskiy @ 2021-06-10 13:47 ` Eric Blake 2021-06-10 14:10 ` Vladimir Sementsov-Ogievskiy 0 siblings, 1 reply; 19+ messages in thread From: Eric Blake @ 2021-06-10 13:47 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy Cc: Kevin Wolf, qemu-block, Markus Armbruster, QEMU Developers, Max Reitz, Nir Soffer, libguestfs On Thu, Jun 10, 2021 at 03:30:17PM +0300, Vladimir Sementsov-Ogievskiy wrote: > > > The correct fix is for ovirt to additionally use the > > > qemu:allocation-depth metadata context added in 5.2: after all, the > > > actual determination for what is needed to recreate a qcow2 file is > > > not whether a cluster is sparse, but whether the allocation-depth > > > shows the cluster to be local. But reproducing an image is more > > > efficient when handling known-zero clusters, which means that ovirt > > > has to track both base:allocation and qemu:allocation-depth metadata > > > contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine > > > sending back information for two contexts in parallel, it comes with > > > some bookkeeping overhead at the client side: the two contexts need > > > not report the same length of replies, and it involves more network > > > traffic. > > Aren't both context described in one reply? Or what do you mean by not the same length? The example file demonstrates this. We have: base.raw ABC- top.qcow2 -D0- guest sees AD00 Querying base:allocation returns: 0 65536 3 hole,zero 65536 65536 0 allocated 131072 131072 3 hole,zero Querying qemu:allocation-depth returns: 0 65536 0 unallocated 65536 131072 1 local 196608 65536 0 unallocated That is, the query starting at 64k returns different lengths (64k for base:allocation, 128k for qemu:allocation-depth), and the client has to process the smaller of the two regions before moving on to the next query. But if the client then does a query starting at 128k, it either has to remember that it previously has information available from the earlier qemu:allocation-depth, or repeats efforts over the wire. The joy of having a single metadata context return both pieces of information at once is that the client no longer has to do this cross-correlation between the differences in extent lengths of the parallel contexts. > > We discussed in the past the option to expose also the dirty status of every > > block in the response. Again this info is available using > > "qemu:dirty-bitmap:xxx" > > but just like allocation depth and base allocation, merging the results is hard > > and if we could expose also the dirty bit, this can make clients life > > even better. > > In this case I'm not sure "qemu:allocation" is the best name, maybe something > > more generic like "qemu:extents" or "qemu:block-status" is even better. > > > > Oops. Could you please describe, what's the problem with parsing several context simultaneously? There is no inherent technical problem, just extra work. Joining the work at the server side is less coding effort than recoding the boilerplate to join the work at every single client side. And the information is already present. So we could just scrap this entire RFC by stating that the information is already available, and it is not worth qemu's effort to provide the convenience context. Joining base:allocation and qemu:allocation-depth was easy - in fact, since both use bdrv_block_status under the hood, we could (and probably should!) merge it into a single qemu query. But joining base:allocation and qemu:dirty-bitmap:FOO will be harder, at which point I question whether it is worth the complications. And if you argue that a joint context is not worthwhile without dirty bitmap(s) being part of that joint context, then maybe this RFC is too complex to worry about, and we should just leave the cross-correlation of parallel contexts to be client-side, after all. > > This all sound to me as we are going to implement "joint" combined conexts for every useful combination of existing contexts that user wants. So, it's a kind of workaround of inconvenient protocol we have invented in the past. > > Doesn't it mean that we instead should rework, how we export several contexts? Maybe we can improve generic export of several contexts simultaneously, so that it will be convenient for the client? Than we don't need any additional combined contexts. The NBD protocol intentionally left wiggle room for servers to report different extent lengths across different contexts. But other than qemu, I don't know of any other NBD servers advertising alternate contexts. If we think we can reasonbly restrict the NBD protocol to require that any server sending parallel contexts to a client MUST use the same extent lengths for all parallel contexts (clients still have to read multiple contexts, but the cross-correlation becomes easier because the client doesn't have to worry about length mismatches), and code that up in qemu, that's also something we can consider. Or maybe even have it be an opt-in, where a client requests NBD_OPT_ALIGN_META_CONTEXT; if the server acknowledges that option, the client knows that it can request parallel NBD_OPT_SET_META_CONTEXT and the extents replied to each NBD_OPT_BLOCK_STATUS will be aligned; if the server does not acknowledge the option, then the client has the choice of requesting at most one meta context, or else dealing with unmatched extent lengths itself. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-10 13:47 ` Eric Blake @ 2021-06-10 14:10 ` Vladimir Sementsov-Ogievskiy 0 siblings, 0 replies; 19+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-10 14:10 UTC (permalink / raw) To: Eric Blake Cc: Nir Soffer, QEMU Developers, libguestfs, qemu-block, Kevin Wolf, Max Reitz, Markus Armbruster 10.06.2021 16:47, Eric Blake wrote: > On Thu, Jun 10, 2021 at 03:30:17PM +0300, Vladimir Sementsov-Ogievskiy wrote: >>>> The correct fix is for ovirt to additionally use the >>>> qemu:allocation-depth metadata context added in 5.2: after all, the >>>> actual determination for what is needed to recreate a qcow2 file is >>>> not whether a cluster is sparse, but whether the allocation-depth >>>> shows the cluster to be local. But reproducing an image is more >>>> efficient when handling known-zero clusters, which means that ovirt >>>> has to track both base:allocation and qemu:allocation-depth metadata >>>> contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine >>>> sending back information for two contexts in parallel, it comes with >>>> some bookkeeping overhead at the client side: the two contexts need >>>> not report the same length of replies, and it involves more network >>>> traffic. >> >> Aren't both context described in one reply? Or what do you mean by not the same length? > > The example file demonstrates this. We have: > > base.raw ABC- > top.qcow2 -D0- > guest sees AD00 > > Querying base:allocation returns: > 0 65536 3 hole,zero > 65536 65536 0 allocated > 131072 131072 3 hole,zero > > Querying qemu:allocation-depth returns: > 0 65536 0 unallocated > 65536 131072 1 local > 196608 65536 0 unallocated Hmm right. Sorry, I forget how BLOCK_STATUS works for several contexts. I thought they are combined. > > That is, the query starting at 64k returns different lengths (64k for > base:allocation, 128k for qemu:allocation-depth), and the client has > to process the smaller of the two regions before moving on to the next > query. But if the client then does a query starting at 128k, it > either has to remember that it previously has information available > from the earlier qemu:allocation-depth, or repeats efforts over the > wire. Hmm.. but if we are going to combine contexts in qemu, we face the same problem, as sources of the contexts may return information by different chunks, so we'll have to cache something, or query the same thing twice. But yes, at least we avoid doing it thought the net. > > The joy of having a single metadata context return both pieces of > information at once is that the client no longer has to do this > cross-correlation between the differences in extent lengths of the > parallel contexts. > >>> We discussed in the past the option to expose also the dirty status of every >>> block in the response. Again this info is available using >>> "qemu:dirty-bitmap:xxx" >>> but just like allocation depth and base allocation, merging the results is hard >>> and if we could expose also the dirty bit, this can make clients life >>> even better. >>> In this case I'm not sure "qemu:allocation" is the best name, maybe something >>> more generic like "qemu:extents" or "qemu:block-status" is even better. >>> >> >> Oops. Could you please describe, what's the problem with parsing several context simultaneously? > > There is no inherent technical problem, just extra work. Joining the > work at the server side is less coding effort than recoding the > boilerplate to join the work at every single client side. And the > information is already present. So we could just scrap this entire > RFC by stating that the information is already available, and it is > not worth qemu's effort to provide the convenience context. > > Joining base:allocation and qemu:allocation-depth was easy - in fact, > since both use bdrv_block_status under the hood, we could (and > probably should!) merge it into a single qemu query. But joining > base:allocation and qemu:dirty-bitmap:FOO will be harder, at which > point I question whether it is worth the complications. And if you > argue that a joint context is not worthwhile without dirty bitmap(s) > being part of that joint context, then maybe this RFC is too complex > to worry about, and we should just leave the cross-correlation of > parallel contexts to be client-side, after all. > > >> >> This all sound to me as we are going to implement "joint" combined conexts for every useful combination of existing contexts that user wants. So, it's a kind of workaround of inconvenient protocol we have invented in the past. >> >> Doesn't it mean that we instead should rework, how we export several contexts? Maybe we can improve generic export of several contexts simultaneously, so that it will be convenient for the client? Than we don't need any additional combined contexts. > > The NBD protocol intentionally left wiggle room for servers to report > different extent lengths across different contexts. But other than > qemu, I don't know of any other NBD servers advertising alternate > contexts. If we think we can reasonbly restrict the NBD protocol to > require that any server sending parallel contexts to a client MUST use > the same extent lengths for all parallel contexts (clients still have > to read multiple contexts, but the cross-correlation becomes easier > because the client doesn't have to worry about length mismatches), and > code that up in qemu, that's also something we can consider. > > Or maybe even have it be an opt-in, where a client requests > NBD_OPT_ALIGN_META_CONTEXT; if the server acknowledges that option, > the client knows that it can request parallel NBD_OPT_SET_META_CONTEXT > and the extents replied to each NBD_OPT_BLOCK_STATUS will be aligned; > if the server does not acknowledge the option, then the client has the > choice of requesting at most one meta context, or else dealing with > unmatched extent lengths itself. Yes, that sound good. And that will work for any combination of contexts. Actually, when server doesn't support _ALIGN_, client's behavior may be simple ignoring of longer lengths, shrinking all replies to the minimum of returned lengths. This leads to larger network traffic and probably some extra work on server side, but client logic remains simpler and all problems are fixed if server supports _ALIGN_. -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-09 23:52 ` Nir Soffer 2021-06-10 12:30 ` Vladimir Sementsov-Ogievskiy @ 2021-06-10 13:16 ` Nir Soffer 2021-06-10 14:04 ` Eric Blake 2021-06-10 13:31 ` Eric Blake 2 siblings, 1 reply; 19+ messages in thread From: Nir Soffer @ 2021-06-10 13:16 UTC (permalink / raw) To: Eric Blake Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, Markus Armbruster, QEMU Developers, Max Reitz, libguestfs On Thu, Jun 10, 2021 at 2:52 AM Nir Soffer <nsoffer@redhat.com> wrote: > > On Wed, Jun 9, 2021 at 9:01 PM Eric Blake <eblake@redhat.com> wrote: I posted a work in progress patch implementing support for qemu:joint-allocaition in oVirt: https://gerrit.ovirt.org/c/ovirt-imageio/+/115197 The most important part is the nbd client: https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/ovirt_imageio/_internal/nbd.py With this our tests pass with qemu-nbd build with Eric patch: https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/test/client_test.py We may need to use qemu:joint-allocation only for qcow2 images, and base:allocation for raw images, because allocation depth reporting is not correct for raw images. Since we control the qemu-nbd in both cases this should not be an issue. But it would be better if allocation depth would work for any kind of image, and we always use qemu:joint-allocation. Nir ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-10 13:16 ` Nir Soffer @ 2021-06-10 14:04 ` Eric Blake 2021-06-10 14:43 ` Vladimir Sementsov-Ogievskiy 0 siblings, 1 reply; 19+ messages in thread From: Eric Blake @ 2021-06-10 14:04 UTC (permalink / raw) To: Nir Soffer Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, QEMU Developers, Markus Armbruster, Max Reitz, libguestfs On Thu, Jun 10, 2021 at 04:16:27PM +0300, Nir Soffer wrote: > On Thu, Jun 10, 2021 at 2:52 AM Nir Soffer <nsoffer@redhat.com> wrote: > > > > On Wed, Jun 9, 2021 at 9:01 PM Eric Blake <eblake@redhat.com> wrote: > > I posted a work in progress patch implementing support for > qemu:joint-allocaition > in oVirt: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115197 > > The most important part is the nbd client: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/ovirt_imageio/_internal/nbd.py > > With this our tests pass with qemu-nbd build with Eric patch: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/test/client_test.py Note that you _don't_ have to use qemu:joint-allocation; you could get the same information by using the already-existing qemu:allocation-depth. But your patch DOES make it obvious that it was a lot simpler to use a single context (the bulk of your tweak is trying an additional NBD_OPT_SET_META_CONTEXT), than it would be to handle parallel contexts (where you would have to tweak your NBD_CMD_BLOCK_STATUS handler to cross-correlate information). > > We may need to use qemu:joint-allocation only for qcow2 images, and > base:allocation > for raw images, because allocation depth reporting is not correct for > raw images. Since Allocation depth (aka how deep in the backing chain is data allocated) IS correct for raw images (the raw image IS the source of all data); it is just not useful. > we control the qemu-nbd in both cases this should not be an issue. But > it would be > better if allocation depth would work for any kind of image, and we always use > qemu:joint-allocation. Maybe the thing to do is improve the documentation and try to avoid ambiguous terminalogy; in qemu:allocation-depth, a return of depth 0 should be called "absent", not "unallocated". And in libnbd, a base:allocation of 0 should be "data" or "normal", not "allocated". But I don't see how qemu will ever advertise an allocation depth of 0 for a raw image, because raw images have no backing chain to defer to. If you are trying to recreate a sparse raw image, then you really do want to pay attention to NBD_STATE_HOLE and NBD_STATE_ZERO, as those are more accurate pictures of the properties of extents within the raw file. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-10 14:04 ` Eric Blake @ 2021-06-10 14:43 ` Vladimir Sementsov-Ogievskiy 0 siblings, 0 replies; 19+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2021-06-10 14:43 UTC (permalink / raw) To: Eric Blake, Nir Soffer Cc: Kevin Wolf, qemu-block, Markus Armbruster, QEMU Developers, Max Reitz, libguestfs 10.06.2021 17:04, Eric Blake wrote: > Maybe the thing to do is improve the documentation and try to avoid > ambiguous terminalogy; in qemu:allocation-depth, a return of depth 0 > should be called "absent", not "unallocated". And in libnbd, a > base:allocation of 0 should be "data" or "normal", not "allocated". Interesting, how many problems, misunderstanding and confusion we have for years because of that terminology :) Funny, that we try to imagine how to call these thing in general, but actually in 99.99% cases we are saying about only 5 simple things: file-posix data file-posix hole qcow2 DATA qcow2 ZERO qcow2 UNALLOCATED And all our problems comes from our trying to divide these thing into two categories: allocated/unallocated. But it never worked. I'd divide like this: DATA examples: - data cluster in qcow2 - data region in file-posix properties: - data actually occupies space on disk - io operations are handled by this layer, backing is shadowed - write should not increase disk occupation GO_TO_BACKING examples: - "unallocated" cluster in qcow2 properties - read from backing image (if no backing, read zeroes) - disk occupation status is known only by backing image (if no backing, disk is not occupied) - write will allocate new cluster in top image, which will increase disk occupation ZERO examples: - zero cluster in qcow2, no space is occupied (most probably), reads as zeroes - file-posix hole, no space is occupied (most probably), reads as zeroes properties: - read zeroes - io operations are handled by this layer, backing is shadowed - no space is occupied (most probably) - write should not increase disk occupation (most probably) We can consider qcow2 ALLOCATED_ZERO also, and maybe SCSI unallocated which means that nothing is occupied but read doesn't guarantee zeroes.. But that doesn't really matter. What does matter is that trying to describe qcow2 backing files in usual block terms allocated/unallocated zero/data never worked good. So in a good documentation (and good code) we should describe (and handle) qcow2 backing chains as qcow2 backing chains and don't try to shadow them under usual terminology. -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-09 23:52 ` Nir Soffer 2021-06-10 12:30 ` Vladimir Sementsov-Ogievskiy 2021-06-10 13:16 ` Nir Soffer @ 2021-06-10 13:31 ` Eric Blake 2 siblings, 0 replies; 19+ messages in thread From: Eric Blake @ 2021-06-10 13:31 UTC (permalink / raw) To: Nir Soffer Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, Markus Armbruster, QEMU Developers, Max Reitz, libguestfs On Thu, Jun 10, 2021 at 02:52:10AM +0300, Nir Soffer wrote: > > So, as a convenience, we can provide yet another metadata context, > > "qemu:joint-allocation", which provides the bulk of the same > > information already available from using "base:allocation" and > > "qemu:allocation-depth" in parallel; the only difference is that an > > allocation depth larger than one is collapsed to a single bit, rather > > than remaining an integer representing actual depth. By connecting to > > just this context, a client has less work to perform while still > > getting at all pieces of information needed to recreate a qcow2 > > backing chain. > > Providing extended allocation is awsome, and makes client life much > easier. But I'm not sure about the name, that comes from "joining" > "base:allocation" and "qemu:allocation-depth". This is correct when > thinking about qemu internals, but this is not really getting both, since > "qemu:allocation-depth" is reduced to local and backing. > > From a client point of view, I think this is best described as "qemu:allocation" > which is an extension to NBD protocol, providing the same HOLE and ZERO > bits, and qemu specific info LOCAL, BACKING. Using different "namespace" > ("qemu" vs "base") makes it clear that this is not the same. > > We discussed in the past the option to expose also the dirty status of every > block in the response. Again this info is available using > "qemu:dirty-bitmap:xxx" > but just like allocation depth and base allocation, merging the results is hard > and if we could expose also the dirty bit, this can make clients life > even better. > In this case I'm not sure "qemu:allocation" is the best name, maybe something > more generic like "qemu:extents" or "qemu:block-status" is even better. Yeah, I'm happy to bike-shed the name. Note that dirty-bitmap tracking is harder to merge in: qemu supports exposing more than one dirty bitmap context at once, but where each context uses the same bit (bit 0, which overlaps with NBD_STATE_HOLE). We'd have to decide whether we support merging only a single bitmap, or whether a user can merge in up to 28 bitmaps (where the order in which they specify dirty bitmaps as part of the qemu:mega:bitmap1:bitmap2:... context name matters). And if we decide that merging in even a single dirty bitmap is too difficult, then clients already have to correlate a dirty bitmap with everything else, at which point correlating 3 contexts (dirty bitmap with base:allocation and qemu:allocation-depth) is no harder than correlating 2 contexts (dirty bitmap with qemu:joint-allocation or whatever we name it). > > > With regards to exposing this new feature from qemu as NBD server, it > > is sufficient to reuse the existing 'qemu-nbd -A': since that already > > exposes allocation depth, it does not hurt to advertise two separate > > qemu:XXX metadata contexts at once for two different views of > > allocation depth. And just because the server supports multiple > > contexts does not mean a client will want or need to connect to > > everything available. On the other hand, the existing hack of using > > the qemu NBD client option of x-dirty-bitmap to select an alternative > > context from the client does NOT make it possible to read the extra > > information exposed by the new metadata context. For now, you MUST > > use something like libnbd's 'nbdinfo --map=qemu:joint-allocation' in > > order to properly see all four bits in action: > > Makes sense. > > > # Create a qcow2 image with a raw backing file: > > $ qemu-img create base.raw $((4*64*1024)) > > $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2 > > > > # Write to first 3 clusters of base: > > $ qemu-io -f raw -c "w -P 65 0 64k" -c "w -P 66 64k 64k" \ > > -c "w -P 67 128k 64k" base.raw > > > > # Write to second and third clusters of top, hiding base: > > $ qemu-io -f qcow2 -c "w -P 69 64k 64k" -c "w -z 128k 64k" top.qcow2 > > Looks familiar but nicer :-) Yeah, I just compressed your example. > > > # Expose top.qcow2 without backing file over NBD > > $ ./qemu-nbd -r -t -f qcow2 -A 'json:{"driver":"qcow2", "backing":null, \ > > "file":{"driver":"file", "filename":"top.qcow2"}}' > > $ nbdinfo --map=qemu:joint-allocation nbd://localhost > > 0 65536 3 > > 65536 65536 4 > > 131072 65536 7 > > 196608 65536 3 > > Using the libnbd patch this shows: > > $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost > 0 65536 3 hole,zero,unallocated > 65536 65536 4 allocated,local > 131072 65536 7 hole,zero,local > 196608 65536 3 hole,zero,unallocated > > Looks good. > > We need to convert this output to: > > {"start": 0, "length": 65536, "zero": true, "hole": true}, > {"start": 65536, "length": 65536, "zero": false, "hole": false}, > {"start": 131072, "length": 65536, "zero": true, "hole": false}, > {"start": 196608, "length": 65536, "zero": true, "hole": true}, > > So it seems that we need to use this logic for holes when we inspect a > single qcow2 image: > > hole = not (flags & NBD_STATE_LOCAL) > > And ignore the NBD_STATE_HOLE, which is about qcow2 internals. Correct. If all you care about is which portions of the image defer to the backing chain, NBD_STATE_LOCAL is sufficient. Looking at NBD_STATE_HOLE merely tells you which portions of the qcow2 file have not been pre-allocated, but that has no bearing on whether those clusters defer to the backing chain. > > This patch fixes the critical issue for oVirt, but in a way it returns > the previous > state when you could not report holes in raw images. With this patch holes > in raw image looks like: > > $ truncate -s 1g empty.raw > $ ./qemu-nbd -r -t -f raw empty.raw --allocation-depth > $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost > 0 1073741824 7 hole,zero,local > > This is not a practical issue for oVirt, but it would be better to report: > > $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost > 0 1073741824 3 hole,zero,unallocated > > This is the output for the empty qcow2 image. Unfortunately, that is not how qemu tracks things. Raw files have no backing file, and therefore 100% of the file reports as BDRV_BLOCK_ALLOCATED (which really means the raw file is responsible for 100% of the guest-visible content, and has no bearing to how much filesystem space is allocated in the host). Both a completely sparse and a completely allocated raw file will have the same result of showing the entire file as 'local'. Reporting 'hole,zero,unallocated' would be a lie - it would be claiming that 100% of the raw file defers to the backing chain but that in turn the backing chain has not been allocated. In reality, when bits 2+3 report as 0 (neither NBD_STATE_LOCAL nor NDD_STATE_BACKING was set), qemu is reporting that the cluster defaults to reading as zero because we exhausted the entire backing chain without finding any layer willing to provide those contents. Maybe "unallocated" isn't the best term, and I should use "absent" instead? That way, we don't perpetuate the confusion of HOLE (which is about host storage allocation) with backing chain presence (where a particular layer in a backing chain provides or defers data regardless of host allocation). > > And this also affects --allocation-depth, for raw empty image we get: > > $ ./nbdinfo --map="qemu:allocation-depth" nbd://localhost > 0 1073741824 1 local > > But for empty qcow2 we get: > > $ ./nbdinfo --map="qemu:allocation-depth" nbd://localhost > 0 1073741824 0 unallocated Again, that's what qemu already uses. A raw file has no backing file, so the entire guest contents come from the current layer in the backing chain (there are no deeper layers to look at). An empty qcow2 file (whether or not a backing file is specified) _does_ defer all clusters to whatever the backing chain would supply. Maybe the name "qemu:allocation-depth" is a misnomer, and we should have called it "qemu:backing-depth", but since that was chosen in 5.2, it's harder to change now. > > I think we have a bug reporting > BDRV_BLOCK_ALLOCATED when the BDRV_BLOCK_DATA bit is not set. No, that's not a bug, but intentional behavior, and changing it would break how qemu-img convert and friends operate. > > +For NBD_OPT_LIST_META_CONTEXT the following queries are supported in > > +addition to the specific "qemu:allocation-depth", > > +"qemu:joint-allocation", and > > "qemu:dirty-bitmap:<dirty-bitmap-export-name>": > > Sending NBD_OPT_LIST_META_CONTEXT with "qemu:joint-allocation" > should tell if the feature is available, right? > > If we don't find it we will fallback to "base:allocation". Correct - the shortest sequence would be first sending NBD_OPT_SET_META_CONTEXT with _just_ "qemu:joint-allocation". If the server answers with a context id, you're set; if it answers with an error, the server doesn't support contexts at all so there's no point trying anything else (your fallbacks would also fail). If it answers with 0 contexts, then the context id was unrecognized (you are not talking to a version of qemu that recognizes it), so your fallback would then be a second NBD_OPT_SET_META_CONTEXT, this time with both "base:allocation" and "qemu:allocation-depth", and see if you get 1 or 2 ids back. (If you can assume that all qemu versions in your downstream ecosystem have been patched such that qemu:joint-allocation is always present if base:allocation does not have 5.2 semantics, then you can simplify the fallback to using just "base:allocation", but that's only safe in a curated downstream ecosystem.) > > * "qemu:" - returns list of all available metadata contexts in the > > @@ -68,3 +92,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE > > * 4.2: NBD_FLAG_CAN_MULTI_CONN for shareable read-only exports, > > NBD_CMD_FLAG_FAST_ZERO > > * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth" > > +* 6.1: NBD_CMD_BLOCK_STATUS for "qemu:joint-allocation" > > diff --git a/docs/tools/qemu-nbd.rst b/docs/tools/qemu-nbd.rst > > index ee862fa0bc02..975ef5cedd14 100644 > > --- a/docs/tools/qemu-nbd.rst > > +++ b/docs/tools/qemu-nbd.rst > > @@ -75,8 +75,8 @@ driver options if ``--image-opts`` is specified. > > .. option:: -A, --allocation-depth > > > > Expose allocation depth information via the > > - ``qemu:allocation-depth`` metadata context accessible through > > - NBD_OPT_SET_META_CONTEXT. > > + ``qemu:allocation-depth`` and ``qemu:joint-allocation`` metadata > > + contexts accessible through NBD_OPT_SET_META_CONTEXT. > > > > .. option:: -B, --bitmap=NAME > > It would be awesome if this would enable NBD_STATE_DIRTY bit > in the response, in he same way that -A enables NBD_STATE_LOCAL/BACKING But remember that qemu can expose more than one dirty bitmap at a time, so then we start having to further complicate things by determining _which_ bit in the combined context would represent _which_ dirty bitmap. Merging things that only depend on bdrv_block_status() was easy, but merging multiple bitmaps starts to make the qemu implementation complicated enough that I'm leaning towards requiring the client to do the cross-context correlation rather than burdening the server with the task. > > +static int blockjoint_to_extents(BlockDriverState *bs, uint64_t offset, > > + uint64_t bytes, NBDExtentArray *ea) > > +{ > > + while (bytes) { > > + uint32_t flags = 0; > > + int64_t num, num1, num2; > > + int ret1 = bdrv_block_status_above(bs, NULL, offset, bytes, &num1, > > + NULL, NULL); > > + int ret2 = bdrv_is_allocated_above(bs, NULL, false, offset, bytes, > > + &num2); > > Do we really need to make 2 calls? Both calls boil down to bdrv_block_status at some point, so you are right that figuring out how to get the information consolidated in 1 call would be more efficient. > > I tried this quick hack on top of your patch: > > diff --git a/block/io.c b/block/io.c > index 1a05f320d3..aaf01c9119 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -2656,6 +2656,15 @@ int bdrv_block_status_above(BlockDriverState > *bs, BlockDriverState *base, > pnum, map, file, NULL); > } > > +int bdrv_block_status_above_depth(BlockDriverState *bs, BlockDriverState *base, > + int64_t offset, int64_t bytes, int64_t *pnum, > + int64_t *map, BlockDriverState **file, > + int *depth) > +{ > + return bdrv_common_block_status_above(bs, base, false, true, offset, bytes, > + pnum, map, file, depth); > +} > + Indeed, adding yet another wrapper around the core functionality of bdrv_block_status to get what we want in one call is a nice idea. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-09 18:01 ` [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context Eric Blake 2021-06-09 23:52 ` Nir Soffer @ 2021-06-11 23:39 ` Nir Soffer 2021-06-14 13:56 ` Eric Blake 1 sibling, 1 reply; 19+ messages in thread From: Nir Soffer @ 2021-06-11 23:39 UTC (permalink / raw) To: Eric Blake Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, Markus Armbruster, QEMU Developers, Max Reitz, libguestfs On Wed, Jun 9, 2021 at 9:01 PM Eric Blake <eblake@redhat.com> wrote: > > When trying to reconstruct a qcow2 chain using information provided > over NBD, ovirt had been relying on an unsafe assumption that any > portion of the qcow2 file advertised as sparse would defer to the > backing image; this worked with what qemu 5.2 reports for a qcow2 BSD > loaded with "backing":null. However, in 6.0, commit 0da9856851 (nbd: > server: Report holes for raw images) also had a side-effect of > reporting unallocated zero clusters in qcow2 files as sparse. This > change is correct from the NBD spec perspective (advertising bits has > always been optional based on how much information the server has > available, and should only be used to optimize behavior when a bit is > set, while not assuming semantics merely because a bit is clear), but > means that a qcow2 file that uses an unallocated zero cluster to > override a backing file now shows up as sparse over NBD, and causes > ovirt to fail to reproduce that cluster (ie. ovirt was assuming it > only had to write clusters where the bit was clear, and the 6.0 > behavior change shows the flaw in that assumption). > > The correct fix is for ovirt to additionally use the > qemu:allocation-depth metadata context added in 5.2: after all, the > actual determination for what is needed to recreate a qcow2 file is > not whether a cluster is sparse, but whether the allocation-depth > shows the cluster to be local. But reproducing an image is more > efficient when handling known-zero clusters, which means that ovirt > has to track both base:allocation and qemu:allocation-depth metadata > contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine > sending back information for two contexts in parallel, it comes with > some bookkeeping overhead at the client side: the two contexts need > not report the same length of replies, and it involves more network > traffic. Since this change is not simple, and the chance that we also get the dirty bitmap included in the result seems to be very low, I decided to check the direction of merging multiple extents. I started with merging "base:allocation" and "qemu:dirty-bitmap:xxx" since we already have both. It was not hard to do, although it is not completely tested yet. Here is the merging code: https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/ovirt_imageio/_internal/nbdutil.py To make merging easy and safe, we map the NBD_STATE_DIRTY bit to a private bit so it cannot clash with the NBD_STATE_HOLE bit: https://gerrit.ovirt.org/c/ovirt-imageio/+/115215/1/daemon/ovirt_imageio/_internal/nbd.py Here is a functional test using qemu-nbd showing that it works: https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/test/client_test.py I'll try to use "qemu:allocation-depth" in a similar way next week, probably mapping depth > 0 to EXTENT_EXISTS, to use when reporting holes in single qcow2 images. If this is successful, we can start using this in the next ovirt release, and we don't need "qemu:joint-allocation". Nir ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-11 23:39 ` Nir Soffer @ 2021-06-14 13:56 ` Eric Blake 2021-06-14 14:06 ` Nir Soffer 0 siblings, 1 reply; 19+ messages in thread From: Eric Blake @ 2021-06-14 13:56 UTC (permalink / raw) To: Nir Soffer Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, Markus Armbruster, QEMU Developers, Max Reitz, libguestfs On Sat, Jun 12, 2021 at 02:39:44AM +0300, Nir Soffer wrote: > Since this change is not simple, and the chance that we also get the dirty > bitmap included in the result seems to be very low, I decided to check the > direction of merging multiple extents. > > I started with merging "base:allocation" and "qemu:dirty-bitmap:xxx" since > we already have both. It was not hard to do, although it is not completely > tested yet. > > Here is the merging code: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/ovirt_imageio/_internal/nbdutil.py > > To make merging easy and safe, we map the NBD_STATE_DIRTY bit to a private bit > so it cannot clash with the NBD_STATE_HOLE bit: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115215/1/daemon/ovirt_imageio/_internal/nbd.py > > Here is a functional test using qemu-nbd showing that it works: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/test/client_test.py > > I'll try to use "qemu:allocation-depth" in a similar way next week, probably > mapping depth > 0 to EXTENT_EXISTS, to use when reporting holes in > single qcow2 images. > > If this is successful, we can start using this in the next ovirt release, and we > don't need "qemu:joint-allocation". That's nice to know. So at this point, we'll drop the patch on qemu:joint-allocation, and instead focus on teh patch that improves qemu-img map output to make it easier to use in the same way that qemu:allocation-depth is. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context 2021-06-14 13:56 ` Eric Blake @ 2021-06-14 14:06 ` Nir Soffer 0 siblings, 0 replies; 19+ messages in thread From: Nir Soffer @ 2021-06-14 14:06 UTC (permalink / raw) To: Eric Blake Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, Markus Armbruster, QEMU Developers, Max Reitz, libguestfs On Mon, Jun 14, 2021 at 4:56 PM Eric Blake <eblake@redhat.com> wrote: > > On Sat, Jun 12, 2021 at 02:39:44AM +0300, Nir Soffer wrote: > > Since this change is not simple, and the chance that we also get the dirty > > bitmap included in the result seems to be very low, I decided to check the > > direction of merging multiple extents. > > > > I started with merging "base:allocation" and "qemu:dirty-bitmap:xxx" since > > we already have both. It was not hard to do, although it is not completely > > tested yet. > > > > Here is the merging code: > > https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/ovirt_imageio/_internal/nbdutil.py > > > > To make merging easy and safe, we map the NBD_STATE_DIRTY bit to a private bit > > so it cannot clash with the NBD_STATE_HOLE bit: > > https://gerrit.ovirt.org/c/ovirt-imageio/+/115215/1/daemon/ovirt_imageio/_internal/nbd.py > > > > Here is a functional test using qemu-nbd showing that it works: > > https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/test/client_test.py > > > > I'll try to use "qemu:allocation-depth" in a similar way next week, probably > > mapping depth > 0 to EXTENT_EXISTS, to use when reporting holes in > > single qcow2 images. > > > > If this is successful, we can start using this in the next ovirt release, and we > > don't need "qemu:joint-allocation". > > That's nice to know. So at this point, we'll drop the patch on > qemu:joint-allocation, and instead focus on teh patch that improves > qemu-img map output to make it easier to use in the same way that > qemu:allocation-depth is. I can update that everything looks good on our side so far, thanks! ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC libnbd PATCH] info: Add support for new qemu:joint-allocation 2021-06-09 18:01 [RFC PATCH 0/2] New NBD metacontext Eric Blake 2021-06-09 18:01 ` [PATCH 1/2] iotests: Improve and rename test 309 to nbd-qemu-allocation Eric Blake 2021-06-09 18:01 ` [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context Eric Blake @ 2021-06-09 21:31 ` Eric Blake 2021-06-09 22:20 ` Nir Soffer 2 siblings, 1 reply; 19+ messages in thread From: Eric Blake @ 2021-06-09 21:31 UTC (permalink / raw) To: libguestfs; +Cc: nsoffer, vsementsov, qemu-devel, qemu-block Qemu is adding qemu:joint-allocation as a single context combining the two bits of base:allocation and a compression of qemu:allocation-depth into two bits [1]. Decoding the bits makes it easier for humans to see the result of that context. [1] https://lists.gnu.org/archive/html/qemu-devel/2021-06/msg02446.html --- Obviously, this libnbd patch should only go in if the qemu RFC is accepted favorably. With this patch applied, the example listed in my qemu patch 2/2 commit message [2] becomes $ ~/libnbd/run nbdinfo --map=qemu:joint-allocation nbd://localhost 0 65536 3 hole,zero,unallocated 65536 65536 4 allocated,local 131072 65536 7 hole,zero,local 196608 65536 3 hole,zero,unallocated [2] https://lists.gnu.org/archive/html/qemu-devel/2021-06/msg02448.html For what it's worth, you can also play with the qemu+libnbd patches at: https://repo.or.cz/qemu/ericb.git/ master https://repo.or.cz/libnbd/ericb.git/ master (I sometimes rewind those branches, but they'll be stable for at least a few days after this email) info/map.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/info/map.c b/info/map.c index ae6d4fe..21e8657 100644 --- a/info/map.c +++ b/info/map.c @@ -226,6 +226,27 @@ extent_description (const char *metacontext, uint32_t type) return ret; } } + else if (strcmp (metacontext, "qemu:joint-allocation") == 0) { + /* Combo of base:allocation and stripped-down qemu:allocation-depth */ + const char *base, *depth; + switch (type & 3) { + case 0: base = "allocated"; break; + case 1: base = "hole"; break; + case 2: base = "zero"; break; + case 3: base = "hole,zero"; break; + } + switch (type & 0xc) { + case 0: depth = "unallocated"; break; + case 4: depth = "local"; break; + case 8: depth = "backing"; break; + case 12: depth = "<unexpected depth>"; break; + } + if (asprintf (&ret, "%s,%s", base, depth) == -1) { + perror ("asprintf"); + exit (EXIT_FAILURE); + } + return ret; + } return NULL; /* Don't know - description field will be omitted. */ } -- 2.31.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC libnbd PATCH] info: Add support for new qemu:joint-allocation 2021-06-09 21:31 ` [RFC libnbd PATCH] info: Add support for new qemu:joint-allocation Eric Blake @ 2021-06-09 22:20 ` Nir Soffer 2021-06-10 13:06 ` Eric Blake 0 siblings, 1 reply; 19+ messages in thread From: Nir Soffer @ 2021-06-09 22:20 UTC (permalink / raw) To: Eric Blake Cc: qemu-block, Vladimir Sementsov-Ogievskiy, QEMU Developers, libguestfs On Thu, Jun 10, 2021 at 12:32 AM Eric Blake <eblake@redhat.com> wrote: > > Qemu is adding qemu:joint-allocation as a single context combining the > two bits of base:allocation and a compression of qemu:allocation-depth > into two bits [1]. Decoding the bits makes it easier for humans to > see the result of that context. > > [1] https://lists.gnu.org/archive/html/qemu-devel/2021-06/msg02446.html > --- > > Obviously, this libnbd patch should only go in if the qemu RFC is > accepted favorably. With this patch applied, the example listed in my > qemu patch 2/2 commit message [2] becomes > > $ ~/libnbd/run nbdinfo --map=qemu:joint-allocation nbd://localhost > 0 65536 3 hole,zero,unallocated > 65536 65536 4 allocated,local > 131072 65536 7 hole,zero,local > 196608 65536 3 hole,zero,unallocated > > [2] https://lists.gnu.org/archive/html/qemu-devel/2021-06/msg02448.html > > For what it's worth, you can also play with the qemu+libnbd patches at: > https://repo.or.cz/qemu/ericb.git/ master > https://repo.or.cz/libnbd/ericb.git/ master > > (I sometimes rewind those branches, but they'll be stable for at least > a few days after this email) > > info/map.c | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) > > diff --git a/info/map.c b/info/map.c > index ae6d4fe..21e8657 100644 > --- a/info/map.c > +++ b/info/map.c > @@ -226,6 +226,27 @@ extent_description (const char *metacontext, uint32_t type) > return ret; > } > } > + else if (strcmp (metacontext, "qemu:joint-allocation") == 0) { > + /* Combo of base:allocation and stripped-down qemu:allocation-depth */ > + const char *base, *depth; > + switch (type & 3) { > + case 0: base = "allocated"; break; > + case 1: base = "hole"; break; > + case 2: base = "zero"; break; > + case 3: base = "hole,zero"; break; > + } > + switch (type & 0xc) { > + case 0: depth = "unallocated"; break; Is this possible? qemu reports BDRV_BLOCK_DATA but not BDRV_BLOCK_ALLOCATED? Anyway this seems like a valid way to present qemu response. > + case 4: depth = "local"; break; > + case 8: depth = "backing"; break; > + case 12: depth = "<unexpected depth>"; break; This should not be possible based on the qemu patch, but printing this seems like a good solution, and can help to debug such an issue. Thinking about client code trying to copy extents based on the flags, the client should abort the operation since qemu response is invalid. > + } > + if (asprintf (&ret, "%s,%s", base, depth) == -1) { > + perror ("asprintf"); > + exit (EXIT_FAILURE); > + } > + return ret; > + } > > return NULL; /* Don't know - description field will be omitted. */ > } > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC libnbd PATCH] info: Add support for new qemu:joint-allocation 2021-06-09 22:20 ` Nir Soffer @ 2021-06-10 13:06 ` Eric Blake 2021-06-10 13:19 ` Nir Soffer 0 siblings, 1 reply; 19+ messages in thread From: Eric Blake @ 2021-06-10 13:06 UTC (permalink / raw) To: Nir Soffer Cc: qemu-block, Vladimir Sementsov-Ogievskiy, QEMU Developers, libguestfs On Thu, Jun 10, 2021 at 01:20:13AM +0300, Nir Soffer wrote: > > + else if (strcmp (metacontext, "qemu:joint-allocation") == 0) { > > + /* Combo of base:allocation and stripped-down qemu:allocation-depth */ > > + const char *base, *depth; > > + switch (type & 3) { > > + case 0: base = "allocated"; break; > > + case 1: base = "hole"; break; > > + case 2: base = "zero"; break; > > + case 3: base = "hole,zero"; break; > > + } > > + switch (type & 0xc) { > > + case 0: depth = "unallocated"; break; > > Is this possible? qemu reports BDRV_BLOCK_DATA but not BDRV_BLOCK_ALLOCATED? No, qemu should never report a status of 0 (which in this code would produce the string "allocated,unallocated", although a v2 may change to print "<unexpected value>"). Remember, BDRV_BLOCK_ALLOCATED is a bit of a misnomer - it has nothing to do with whether a cluster occupies allocated space, but rather whether the local image in the backing chain provides the contents of the cluster (rather than deferring to the backing chain). The code in block/io.c guarantees that if a block device reports BDRV_BLOCK_DATA, then the block layer also reports BDRV_BLOCK_ALLOCATED (that is, any cluster that provides guest-visible data by necessity implies that the current layer of the backing chain is important). However, it DOES point out that "allocated" might not be the best name in libnbd; perhaps "data" or "normal" would be better for the NBD base:allocation status of 0. > > Anyway this seems like a valid way to present qemu response. > > > + case 4: depth = "local"; break; > > + case 8: depth = "backing"; break; > > + case 12: depth = "<unexpected depth>"; break; > > This should not be possible based on the qemu patch, but printing this > seems like a good solution, and can help to debug such an issue. > > Thinking about client code trying to copy extents based on the flags, > the client should abort the operation since qemu response is invalid. > -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC libnbd PATCH] info: Add support for new qemu:joint-allocation 2021-06-10 13:06 ` Eric Blake @ 2021-06-10 13:19 ` Nir Soffer 0 siblings, 0 replies; 19+ messages in thread From: Nir Soffer @ 2021-06-10 13:19 UTC (permalink / raw) To: Eric Blake Cc: qemu-block, Vladimir Sementsov-Ogievskiy, QEMU Developers, libguestfs On Thu, Jun 10, 2021 at 4:06 PM Eric Blake <eblake@redhat.com> wrote: > > On Thu, Jun 10, 2021 at 01:20:13AM +0300, Nir Soffer wrote: > > > + else if (strcmp (metacontext, "qemu:joint-allocation") == 0) { > > > + /* Combo of base:allocation and stripped-down qemu:allocation-depth */ > > > + const char *base, *depth; > > > + switch (type & 3) { > > > + case 0: base = "allocated"; break; > > > + case 1: base = "hole"; break; > > > + case 2: base = "zero"; break; > > > + case 3: base = "hole,zero"; break; > > > + } > > > + switch (type & 0xc) { > > > + case 0: depth = "unallocated"; break; > > > > Is this possible? qemu reports BDRV_BLOCK_DATA but not BDRV_BLOCK_ALLOCATED? > > No, qemu should never report a status of 0 (which in this code would > produce the string "allocated,unallocated", although a v2 may change > to print "<unexpected value>"). > > Remember, BDRV_BLOCK_ALLOCATED is a bit of a misnomer - it has nothing > to do with whether a cluster occupies allocated space, but rather > whether the local image in the backing chain provides the contents of > the cluster (rather than deferring to the backing chain). The code in > block/io.c guarantees that if a block device reports BDRV_BLOCK_DATA, > then the block layer also reports BDRV_BLOCK_ALLOCATED (that is, any > cluster that provides guest-visible data by necessity implies that the > current layer of the backing chain is important). > > However, it DOES point out that "allocated" might not be the best name > in libnbd; perhaps "data" or "normal" would be better for the NBD > base:allocation status of 0. Yes! it also aligns better with zero, and the output is similar to qemu-img map. Hopefully the semantics of "data" in qemu-img map and libnbd is the same. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2021-06-14 14:08 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-06-09 18:01 [RFC PATCH 0/2] New NBD metacontext Eric Blake 2021-06-09 18:01 ` [PATCH 1/2] iotests: Improve and rename test 309 to nbd-qemu-allocation Eric Blake 2021-06-10 12:14 ` Vladimir Sementsov-Ogievskiy 2021-06-09 18:01 ` [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context Eric Blake 2021-06-09 23:52 ` Nir Soffer 2021-06-10 12:30 ` Vladimir Sementsov-Ogievskiy 2021-06-10 13:47 ` Eric Blake 2021-06-10 14:10 ` Vladimir Sementsov-Ogievskiy 2021-06-10 13:16 ` Nir Soffer 2021-06-10 14:04 ` Eric Blake 2021-06-10 14:43 ` Vladimir Sementsov-Ogievskiy 2021-06-10 13:31 ` Eric Blake 2021-06-11 23:39 ` Nir Soffer 2021-06-14 13:56 ` Eric Blake 2021-06-14 14:06 ` Nir Soffer 2021-06-09 21:31 ` [RFC libnbd PATCH] info: Add support for new qemu:joint-allocation Eric Blake 2021-06-09 22:20 ` Nir Soffer 2021-06-10 13:06 ` Eric Blake 2021-06-10 13:19 ` Nir Soffer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).