From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34787) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YNKTR-0000zk-5h for qemu-devel@nongnu.org; Mon, 16 Feb 2015 07:08:29 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YNKTM-0000n5-7F for qemu-devel@nongnu.org; Mon, 16 Feb 2015 07:08:25 -0500 Received: from mx2.parallels.com ([199.115.105.18]:46920) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YNKTL-0000HY-Nu for qemu-devel@nongnu.org; Mon, 16 Feb 2015 07:08:20 -0500 Message-ID: <54E1DD49.3050102@parallels.com> Date: Mon, 16 Feb 2015 15:06:33 +0300 From: Vladimir Sementsov-Ogievskiy MIME-Version: 1.0 References: <1422356197-5285-1-git-send-email-vsementsov@parallels.com> <1422356197-5285-9-git-send-email-vsementsov@parallels.com> <54DA793C.9020707@redhat.com> <54DDB398.3010907@parallels.com> <54DE5D0F.5080304@redhat.com> In-Reply-To: <54DE5D0F.5080304@redhat.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH RFC v2 8/8] migration: add migration/dirty-bitmap.c List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: John Snow , qemu-devel@nongnu.org Cc: kwolf@redhat.com, Peter Maydell , "Juan quin >> Juan Jose Quintela Carreira" , "Dr. David Alan Gilbert" , stefanha@redhat.com, den@openvz.org, amit Shah , pbonzini@redhat.com On 13.02.2015 23:22, John Snow wrote: > > > On 02/13/2015 03:19 AM, Vladimir Sementsov-Ogievskiy wrote: >> On 11.02.2015 00:33, John Snow wrote: >>> Peter Maydell: What's the right way to license a file as copied from a >>> previous version? See below, please; >>> >>> Max, Markus: ctrl+f "bdrv_get_device_name" and let me know what you >>> think, if you would. >>> >>> Juan, Amit, David: Copying migration maintainers. >>> >>> On 01/27/2015 05:56 AM, Vladimir Sementsov-Ogievskiy wrote: >>>> Live migration of dirty bitmaps. Only named dirty bitmaps are >>>> migrated. >>>> If destination qemu is already containing a dirty bitmap with the same >>>> name as a migrated bitmap, then their granularities should be the >>>> same, >>>> otherwise the error will be generated. If destination qemu doesn't >>>> contain such bitmap it will be created. >>>> >>>> format: >>>> >>>> 1 byte: flags >>>> >>>> [ 1 byte: node name size ] \ flags & DEVICE_NAME >>>> [ n bytes: node name ] / >>>> >>>> [ 1 byte: bitmap name size ] \ >>>> [ n bytes: bitmap name ] | flags & BITMAP_NAME >>>> [ [ be64: granularity ] ] flags & GRANULARITY >>>> >>>> [ 1 byte: bitmap enabled bit ] flags & ENABLED >>>> >>>> [ be64: start sector ] \ flags & (NORMAL_CHUNK | ZERO_CHUNK) >>>> [ be32: number of sectors ] / >>>> >>>> [ be64: buffer size ] \ flags & NORMAL_CHUNK >>>> [ n bytes: buffer ] / >>>> >>>> The last chunk should contain flags & EOS. The chunk may skip device >>>> and/or bitmap names, assuming them to be the same with the previous >>>> chunk. GRANULARITY is sent with the first chunk for the bitmap. >>>> ENABLED >>>> bit is sent in the end of "complete" stage of migration. So when >>>> destination gets ENABLED flag it should deserialize_finish the bitmap >>>> and set its enabled bit to corresponding value. >>>> >>>> Signed-off-by: Vladimir Sementsov-Ogievskiy >>>> --- >>>> include/migration/block.h | 1 + >>>> migration/Makefile.objs | 2 +- >>>> migration/dirty-bitmap.c | 606 >>>> ++++++++++++++++++++++++++++++++++++++++++++++ >>>> vl.c | 1 + >>>> 4 files changed, 609 insertions(+), 1 deletion(-) >>>> create mode 100644 migration/dirty-bitmap.c >>>> >>>> diff --git a/include/migration/block.h b/include/migration/block.h >>>> index ffa8ac0..566bb9f 100644 >>>> --- a/include/migration/block.h >>>> +++ b/include/migration/block.h >>>> @@ -14,6 +14,7 @@ >>>> #ifndef BLOCK_MIGRATION_H >>>> #define BLOCK_MIGRATION_H >>>> >>>> +void dirty_bitmap_mig_init(void); >>>> void blk_mig_init(void); >>>> int blk_mig_active(void); >>>> uint64_t blk_mig_bytes_transferred(void); >>> >>> OK. >>> >>>> diff --git a/migration/Makefile.objs b/migration/Makefile.objs >>>> index d929e96..9adfda9 100644 >>>> --- a/migration/Makefile.objs >>>> +++ b/migration/Makefile.objs >>>> @@ -6,5 +6,5 @@ common-obj-y += xbzrle.o >>>> common-obj-$(CONFIG_RDMA) += rdma.o >>>> common-obj-$(CONFIG_POSIX) += exec.o unix.o fd.o >>>> >>>> -common-obj-y += block.o >>>> +common-obj-y += block.o dirty-bitmap.o >>>> >>> >>> OK. >>> >>>> diff --git a/migration/dirty-bitmap.c b/migration/dirty-bitmap.c >>>> new file mode 100644 >>>> index 0000000..8621218 >>>> --- /dev/null >>>> +++ b/migration/dirty-bitmap.c >>>> @@ -0,0 +1,606 @@ >>>> +/* >>>> + * QEMU dirty bitmap migration >>>> + * >>>> + * derived from migration/block.c >>>> + * >>>> + * Author: >>>> + * Sementsov-Ogievskiy Vladimir >>>> + * >>>> + * original copyright message: >>>> + * >>>> ===================================================================== >>>> + * Copyright IBM, Corp. 2009 >>>> + * >>>> + * Authors: >>>> + * Liran Schour >>>> + * >>>> + * This work is licensed under the terms of the GNU GPL, version >>>> 2. See >>>> + * the COPYING file in the top-level directory. >>>> + * >>>> + * Contributions after 2012-01-13 are licensed under the terms of the >>>> + * GNU GPL, version 2 or (at your option) any later version. >>>> + * >>>> ===================================================================== >>>> + */ >>>> + >>> >>> Not super familiar with the right way to do licensing here; it's >>> possible you may not need to copy the original here, but I'm not sure. >>> You will want to make it clear what license applies to /your/ work, I >>> think. Maybe Peter Maydell can clue us in. >>> >>>> +#include "block/block.h" >>>> +#include "qemu/main-loop.h" >>>> +#include "qemu/error-report.h" >>>> +#include "migration/block.h" >>>> +#include "migration/migration.h" >>>> +#include "qemu/hbitmap.h" >>>> +#include >>>> + >>>> +#define CHUNK_SIZE (1 << 20) >>>> + >>>> +#define DIRTY_BITMAP_MIG_FLAG_EOS 0x01 >>>> +#define DIRTY_BITMAP_MIG_FLAG_NORMAL_CHUNK 0x02 >>>> +#define DIRTY_BITMAP_MIG_FLAG_ZERO_CHUNK 0x04 >>>> +#define DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME 0x08 >>>> +#define DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME 0x10 >>>> +#define DIRTY_BITMAP_MIG_FLAG_GRANULARITY 0x20 >>>> +#define DIRTY_BITMAP_MIG_FLAG_ENABLED 0x40 >>>> +/* flags should be <= 0xff */ >>>> + >>> >>> We should give ourselves a little breathing room with the flags, since >>> we've only got room for one more. >> Ok. Will one more byte be enough? > > I should hope so. If you do add a completion chunk and flag, that > fills up the first byte completely, so having a second byte is a good > idea. > > I might recommend reserving the last bit of the second byte to be a > flag such as DIRTY_BITMAP_EXTRA_FLAGS that indicates the presence of > additional byte(s) of flags, to be determined later, if we ever need > them, but two bytes for now should be sufficient. Ok. > >>> >>>> +/* #define DEBUG_DIRTY_BITMAP_MIGRATION */ >>>> + >>>> +#ifdef DEBUG_DIRTY_BITMAP_MIGRATION >>>> +#define DPRINTF(fmt, ...) \ >>>> + do { printf("dirty_migration: " fmt, ## __VA_ARGS__); } while (0) >>>> +#else >>>> +#define DPRINTF(fmt, ...) \ >>>> + do { } while (0) >>>> +#endif >>>> + >>>> +typedef struct DirtyBitmapMigBitmapState { >>>> + /* Written during setup phase. */ >>>> + BlockDriverState *bs; >>>> + BdrvDirtyBitmap *bitmap; >>>> + HBitmap *dirty_bitmap; >>> >>> For my own sanity, I'd really prefer "bitmap" and "meta_bitmap" here; >>> "dirty_bitmap" is often used as a synonym (outside of this file) to >>> refer to the BdrvDirtyBitmap in general, so it's usage here can be >>> somewhat confusing. >>> >>> I'd appreciate "dirty_dirty_bitmap" as in your previous patch for >>> consistency, or "meta_bitmap" as I recommend. >>> >> Ok >>>> + int64_t total_sectors; >>>> + uint64_t sectors_per_chunk; >>>> + QSIMPLEQ_ENTRY(DirtyBitmapMigBitmapState) entry; >>>> + >>>> + /* For bulk phase. */ >>>> + bool bulk_completed; >>>> + int64_t cur_sector; >>>> + bool granularity_sent; >>>> + >>>> + /* For dirty phase. */ >>>> + int64_t cur_dirty; >>>> +} DirtyBitmapMigBitmapState; >>>> + >>>> +typedef struct DirtyBitmapMigState { >>>> + int migration_enable; >>>> + QSIMPLEQ_HEAD(dbms_list, DirtyBitmapMigBitmapState) dbms_list; >>>> + >>>> + bool bulk_completed; >>>> + >>>> + /* for send_bitmap() */ >>>> + BlockDriverState *prev_bs; >>>> + BdrvDirtyBitmap *prev_bitmap; >>>> +} DirtyBitmapMigState; >>>> + >>>> +static DirtyBitmapMigState dirty_bitmap_mig_state; >>>> + >>>> +/* read name from qemu file: >>>> + * format: >>>> + * 1 byte : len = name length (<256) >>>> + * len bytes : name without last zero byte >>>> + * >>>> + * name should point to the buffer >= 256 bytes length >>>> + */ >>>> +static char *qemu_get_name(QEMUFile *f, char *name) >>>> +{ >>>> + int len = qemu_get_byte(f); >>>> + qemu_get_buffer(f, (uint8_t *)name, len); >>>> + name[len] = '\0'; >>>> + >>>> + DPRINTF("get name: %d %s\n", len, name); >>>> + >>>> + return name; >>>> +} >>>> + >>> >>> OK. Maybe these could be "qemu_put_string" or "qemu_get_string" and >>> added to qemu-file.c so others can use them. >> If no objections for sharing this format, I'll do it. >>> >>>> +/* write name to qemu file: >>>> + * format: >>>> + * same as for qemu_get_name >>>> + * >>>> + * maximum name length is 255 >>>> + */ >>>> +static void qemu_put_name(QEMUFile *f, const char *name) >>>> +{ >>>> + int len = strlen(name); >>>> + >>>> + DPRINTF("put name: %d %s\n", len, name); >>>> + >>>> + assert(len < 256); >>>> + qemu_put_byte(f, len); >>>> + qemu_put_buffer(f, (const uint8_t *)name, len); >>>> +} >>>> + >>> >>> OK. >>> >>>> +static void send_bitmap(QEMUFile *f, DirtyBitmapMigBitmapState *dbms, >>>> + uint64_t start_sector, uint32_t nr_sectors) >>>> +{ >>>> + BlockDriverState *bs = dbms->bs; >>>> + BdrvDirtyBitmap *bitmap = dbms->bitmap; >>>> + uint8_t flags = 0; >>>> + /* align for buffer_is_zero() */ >>>> + uint64_t align = 4 * sizeof(long); >>>> + uint64_t buf_size = >>>> + (bdrv_dirty_bitmap_data_size(bitmap, nr_sectors) + align - >>>> 1) & >>>> + ~(align - 1); >>>> + uint8_t *buf = g_malloc0(buf_size); >>>> + >>>> + bdrv_dirty_bitmap_serialize_part(bitmap, buf, start_sector, >>>> nr_sectors); >>>> + >>>> + if (buffer_is_zero(buf, buf_size)) { >>>> + g_free(buf); >>>> + buf = NULL; >>>> + flags |= DIRTY_BITMAP_MIG_FLAG_ZERO_CHUNK; >>>> + } else { >>>> + flags |= DIRTY_BITMAP_MIG_FLAG_NORMAL_CHUNK; >>>> + } >>>> + >>>> + if (bs != dirty_bitmap_mig_state.prev_bs) { >>>> + dirty_bitmap_mig_state.prev_bs = bs; >>>> + flags |= DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME; >>>> + } >>>> + >>>> + if (bitmap != dirty_bitmap_mig_state.prev_bitmap) { >>>> + dirty_bitmap_mig_state.prev_bitmap = bitmap; >>>> + flags |= DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME; >>>> + } >>> >>> OK, so we use the current bs/bitmap under consideration to detect if >>> we have switched context, and put the names in the stream when it >>> happens. OK. >>> >>>> + >>>> + if (dbms->granularity_sent == 0) { >>>> + dbms->granularity_sent = 1; >>>> + flags |= DIRTY_BITMAP_MIG_FLAG_GRANULARITY; >>>> + } >>>> + >>>> + DPRINTF("Enter send_bitmap" >>>> + "\n flags: %x" >>>> + "\n start_sector: %" PRIu64 >>>> + "\n nr_sectors: %" PRIu32 >>>> + "\n data_size: %" PRIu64 "\n", >>>> + flags, start_sector, nr_sectors, buf_size); >>>> + >>>> + qemu_put_byte(f, flags); >>>> + >>>> + if (flags & DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME) { >>>> + qemu_put_name(f, bdrv_get_device_name(bs)); >>>> + } >>> >>> I am still not fully clear on this myself, but I think we are phasing >>> out bdrv_get_device_name. In the context of bitmaps, we are mostly >>> likely using them one-per-tree, but they /can/ be attached >>> one-per-node, so we shouldn't be trying to get the device name here, >>> but rather, the node-name. >>> >>> I /think/ we may want to be using bdrv_get_node_name, but I think it >>> is not currently true that all nodes WILL be named ... I am CC'ing >>> Markus Armbruster and Max Reitz for (A) A refresher course and (B) >>> Opinions on what function call might make sense here, given that we >>> wish to migrate bitmaps attached to arbitrary nodes. >> Hmm.. I'm not familiar with hierarchy of nodes and devices. As I >> understand, both command_line- and qmp- created drives are created >> through blockdev_init, which creates both blk(device) and bs(node) >> through blk_new_with_bs.. Am I right? Also, bdrv_get_device_name is used >> in migration/block.c. > > Now that I'm more awake, here's a better rundown of what's going on: > > It's something that is a little bit in flux right now, unfortunately. > We're trying to transition to a format where we have arbitrarily > complex Block trees, where the root of the tree is always a > BlockBackend (See the big series by Max Reitz) and the configuration > of the tree may become arbitrarily complex. > > Simple trees may consist of just one BlockBackend and one > BlockDriverState node, where I think we can refer to this BDS as the > "root node," not to be confused with the BlockBackend "root." The > BlockBackend is a relatively new invention, so it isn't actually used > consistently everywhere yet. > > In the future, we may have commands that make distinctions based on if > you want to work on the BlockBackend, the root node under the > blockbackend associated with a BDS, only the explicit node/BDS you > identify, or some combination of the above semantics. > > As of right now, bitmaps can be *attached* to any arbitrary node, > though they are currently only *useful* when attached to the first > child of the BlockBackend, the root node. It's only useful currently > in cases where it is attached to the root because I've only proposed > patches for adding bitmap support to produce incrementals for Drive > Backup, which operates only on drives/devices (the root node of a tree.) > > However, in the context of migrating, it could be that we want to > migrate any bitmaps attached to /any/ nodes, so we should be careful > about what names we are pulling - we don't necessarily want the name > of the root node or BlockBackend, we may want the BDS and accompanying > name of strictly the node the bitmap is attached to. > > I know other areas of the code don't provide a good example for this > distinction, yet, but the block layer people are actively working on > fixing that. (See also the back-and-forth reviews for what to name my > QMP parameters in the incremental backup patches for some overview of > this semantic transition.) > > That said, We should think carefully about *which* name we want to put > in the stream and what implications it has for migration. > > > (1) bdrv_get_node_name and bdrv_find_node > > This would migrate bitmaps as attached to their specific BDS. This > would mean that the node layout on the destination is either > identical, or similar enough such that no named bitmaps are attached > to a node not present on the destination. > > This gives us precision: bitmaps may be attached lower in the tree and > can provide more fine-grained detail for which layers have been > changed or modified during runtime. > > This also gives us fragility: In cases where we transfer, say, a > complex tree of nodes and collapse it to a single destination drive, > we'd be unable to migrate bitmaps not attached to the root along with > it, because they'd have nowhere meaningful to attach. > > It is perhaps somewhat unneccessary at this exact moment in time, as > well, because bitmaps are currently only useful on root nodes. > > (2) bdrv_get_device_name and bdrv_lookup_bs(device_name, NULL, errp) > > This would migrate any bitmaps in a tree and attach them to the entire > drive on the destination. > > This is simpler: You just need to make sure that the root nodes have > the same names, which is a lot easier to manage. > > This matches how drive migration currently appears to work: The entire > tree appears to be generally squashed into a single node and > transferred cluster-by-cluster, without general consideration as to > the layout of the local block tree. As we both know by now, none of > the metadata is transferred, just the data. > > It prevents migration of just bitmaps where you WANT the extra > complexity: If a bitmap is attached lower in the tree, re-affixing it > to the root of a destination tree might invalidate the semantics of > what that bitmap was meant to track, and it may become useless. > > > So in summary: > using device names is probably fine for now, as it matches the current > use case of bitmaps as well as drive migration; but using node names > may give us more power and precision later. > > I talked to Max about it, and he is leaning towards using device names > for now and switching to node names if we decide we want that power. > > (...I wonder if we could use a flag, for now, that says we're > including DEVICE names. Later, we could add a flag that says we're > using NODE names and add an option to toggle as the usage case sees fit.) > > > Are you confused yet? :D O, thanks for the explanation). Are we really need this flag? As Markus wrote, nodes and devices are sharing namespaces.. We can use bdrv_lookup_bs(name, name, errp).. Also, we can, for example, send bitmaps as follows: if node has name - send bitmap with this name if node is root, but hasn't name - send it with blk name otherwise - don't send the bitmap > > >>> >>>> + >>>> + if (flags & DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME) { >>>> + qemu_put_name(f, bdrv_dirty_bitmap_name(bitmap)); >>>> + >>>> + if (flags & DIRTY_BITMAP_MIG_FLAG_GRANULARITY) { >>>> + qemu_put_be64(f, bdrv_dirty_bitmap_granularity(bs, >>>> bitmap)); >>>> + } >>>> + } else { >>>> + assert(!(flags & DIRTY_BITMAP_MIG_FLAG_GRANULARITY)); >>>> + } >>>> + >>> >>> I thought we were only migrating bitmaps with names? >>> I suppose the conditional can't hurt, but I am not clear on when we >>> won't have a bitmap name here. >> You are right, 'else' case is not possible.. Hmm. I've added it to be >> sure that format is not corrupted, when I decided to put granularity >> only with name. Wi won't have a bitmap name only when we send the same >> bitmap as on the previous send_bitmap() call. May be it will be better >> to use two separate if's without else and assert. > > It's okay if it is just "paranoia," but I was just checking. It would > make a decent assert(). > >>> >>>> + qemu_put_be64(f, start_sector); >>>> + qemu_put_be32(f, nr_sectors); >>>> + >>>> + /* if a block is zero we need to flush here since the network >>>> + * bandwidth is now a lot higher than the storage device >>>> bandwidth. >>>> + * thus if we queue zero blocks we slow down the migration. >>>> + * also, skip writing block when migrate only dirty bitmaps. */ >>>> + if (flags & DIRTY_BITMAP_MIG_FLAG_ZERO_CHUNK) { >>>> + qemu_fflush(f); >>>> + return; >>>> + } >>>> + >>>> + qemu_put_be64(f, buf_size); >>>> + qemu_put_buffer(f, buf, buf_size); >>>> + g_free(buf); >>>> +} >>>> + >>>> + >>>> +/* Called with iothread lock taken. */ >>>> + >>>> +static void set_dirty_tracking(void) >>>> +{ >>>> + DirtyBitmapMigBitmapState *dbms; >>>> + >>>> + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, >>>> entry) { >>>> + dbms->dirty_bitmap = >>>> + bdrv_create_dirty_dirty_bitmap(dbms->bitmap, CHUNK_SIZE); >>>> + } >>>> +} >>>> + >>> >>> OK: so we only have these dirty-dirty bitmaps when migration is >>> starting, which makes sense. >>> >>>> +static void unset_dirty_tracking(void) >>>> +{ >>>> + DirtyBitmapMigBitmapState *dbms; >>>> + >>>> + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, >>>> entry) { >>>> + bdrv_release_dirty_dirty_bitmap(dbms->bitmap); >>>> + } >>>> +} >>>> + >>> >>> OK. >>> >>>> +static void init_dirty_bitmap_migration(QEMUFile *f) >>>> +{ >>>> + BlockDriverState *bs; >>>> + BdrvDirtyBitmap *bitmap; >>>> + DirtyBitmapMigBitmapState *dbms; >>>> + >>>> + dirty_bitmap_mig_state.bulk_completed = false; >>>> + dirty_bitmap_mig_state.prev_bs = NULL; >>>> + dirty_bitmap_mig_state.prev_bitmap = NULL; >>>> + >>>> + for (bs = bdrv_next(NULL); bs; bs = bdrv_next(bs)) { >>>> + for (bitmap = bdrv_next_dirty_bitmap(bs, NULL); bitmap; >>>> + bitmap = bdrv_next_dirty_bitmap(bs, bitmap)) { >>>> + if (!bdrv_dirty_bitmap_name(bitmap)) { >>>> + continue; >>>> + } >>>> + >>>> + dbms = g_new0(DirtyBitmapMigBitmapState, 1); >>>> + dbms->bs = bs; >>>> + dbms->bitmap = bitmap; >>>> + dbms->total_sectors = bdrv_nb_sectors(bs); >>>> + dbms->sectors_per_chunk = CHUNK_SIZE * 8 * >>>> + bdrv_dirty_bitmap_granularity(dbms->bs, dbms->bitmap) >>>> + >> BDRV_SECTOR_BITS; >>>> + >>>> + QSIMPLEQ_INSERT_TAIL(&dirty_bitmap_mig_state.dbms_list, >>>> + dbms, entry); >>>> + } >>>> + } >>>> +} >>>> + >>> >>> OK, but see the note below for dirty_bitmap_mig_init. >> actually it is not 'init' but 'reinit' - called on every migration >> start.. Hmm. dbms_list should be cleared here before fill it again. >>> >>>> +/* Called with no lock taken. */ >>>> +static void bulk_phase_send_chunk(QEMUFile *f, >>>> DirtyBitmapMigBitmapState *dbms) >>>> +{ >>>> + uint32_t nr_sectors = MIN(dbms->total_sectors - dbms->cur_sector, >>>> + dbms->sectors_per_chunk); >>> >>> What about cases where nr_sectors will put us past the end of the >>> bitmap? The bitmap serialization implementation might need a touchup >>> with this in mind. >> I don't understand.. nr_sectors <= dbms->total_sectors - >> dbms->cur_sector and it can't put us past the end... > > Oh, because you take the minimum, so we don't have to worry about > sectors_per_chunk eclipsing what we have. > > Nevermind, I can't read... :( > >>> >>>> + >>>> + send_bitmap(f, dbms, dbms->cur_sector, nr_sectors); >>>> + >>>> + dbms->cur_sector += nr_sectors; >>>> + if (dbms->cur_sector >= dbms->total_sectors) { >>>> + dbms->bulk_completed = true; >>>> + } >>>> +} >>>> + >>>> +/* Called with no lock taken. */ >>>> +static void bulk_phase(QEMUFile *f, bool limit) >>>> +{ >>>> + DirtyBitmapMigBitmapState *dbms; >>>> + >>>> + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, >>>> entry) { >>>> + while (!dbms->bulk_completed) { >>>> + bulk_phase_send_chunk(f, dbms); >>>> + if (limit && qemu_file_rate_limit(f)) { >>>> + return; >>>> + } >>>> + } >>>> + } >>>> + >>>> + dirty_bitmap_mig_state.bulk_completed = true; >>>> +} >>> >>> OK. >>> >>>> + >>>> +static void blk_mig_reset_dirty_cursor(void) >>>> +{ >>>> + DirtyBitmapMigBitmapState *dbms; >>>> + >>>> + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, >>>> entry) { >>>> + dbms->cur_dirty = 0; >>>> + } >>>> +} >>>> + >>> >>> OK. >>> >>>> +/* Called with iothread lock taken. */ >>>> +static void dirty_phase_send_chunk(QEMUFile *f, >>>> DirtyBitmapMigBitmapState *dbms) >>>> +{ >>>> + uint32_t nr_sectors; >>>> + >>>> + while (dbms->cur_dirty < dbms->total_sectors && >>>> + !hbitmap_get(dbms->dirty_bitmap, dbms->cur_dirty)) { >>>> + dbms->cur_dirty += dbms->sectors_per_chunk; >>>> + } >>> >>> OK, so we fast forward the dirty cursor while the meta-bitmap is >>> empty. Is it not worth using the HBitmapIterator here? You can reset >>> them everywhere you reset the dirty cursor, and then just fast-seek to >>> the first dirty sector. >> Yes, I've thought about it, just used simpler way (copied from >> migration/block.c) for an early version of the patch set. I will do it. > > Only if it doesn't make things more complicated to look at. > >>> >>>> + >>>> + if (dbms->cur_dirty >= dbms->total_sectors) { >>>> + return; >>>> + } >>>> + >>>> + nr_sectors = MIN(dbms->total_sectors - dbms->cur_dirty, >>>> + dbms->sectors_per_chunk); >>> >>> What happens if nr_sectors goes past the end? > > Again, I misread. > >>> >>>> + send_bitmap(f, dbms, dbms->cur_dirty, nr_sectors); >>>> + hbitmap_reset(dbms->dirty_bitmap, dbms->cur_dirty, >>>> dbms->sectors_per_chunk); >>>> + dbms->cur_dirty += nr_sectors; >>>> +} >>>> + >>>> +/* Called with iothread lock taken. >>>> + * >>>> + * return value: >>>> + * 0: too much data for max_downtime >>>> + * 1: few enough data for max_downtime >>>> +*/ >>> >>> dirty_phase below doesn't have a return value. >> rudimentary comment.. thanks. >>> >>>> +static void dirty_phase(QEMUFile *f, bool limit) >>>> +{ >>>> + DirtyBitmapMigBitmapState *dbms; >>>> + >>>> + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, >>>> entry) { >>>> + while (dbms->cur_dirty < dbms->total_sectors) { >>>> + dirty_phase_send_chunk(f, dbms); >>>> + if (limit && qemu_file_rate_limit(f)) { >>>> + return; >>>> + } >>>> + } >>>> + } >>>> +} >>>> + >>> >>> OK. >>> >>>> + >>>> +/* Called with iothread lock taken. */ >>>> +static void dirty_bitmap_mig_cleanup(void) >>>> +{ >>>> + DirtyBitmapMigBitmapState *dbms; >>>> + >>>> + unset_dirty_tracking(); >>>> + >>>> + while ((dbms = >>>> QSIMPLEQ_FIRST(&dirty_bitmap_mig_state.dbms_list)) != NULL) { >>>> + QSIMPLEQ_REMOVE_HEAD(&dirty_bitmap_mig_state.dbms_list, entry); >>>> + g_free(dbms); >>>> + } >>>> +} >>>> + >>> >>> OK. >>> >>>> +static void dirty_bitmap_migration_cancel(void *opaque) >>>> +{ >>>> + dirty_bitmap_mig_cleanup(); >>>> +} >>>> + >>> >>> OK. >>> >>>> +static int dirty_bitmap_save_iterate(QEMUFile *f, void *opaque) >>>> +{ >>>> + DPRINTF("Enter save live iterate\n"); >>>> + >>>> + blk_mig_reset_dirty_cursor(); >>> >>> I suppose this is because it's easier to check if we are finished by >>> starting from sector 0 every time. >>> >>> A harder, but faster method may be: Use HBitmapIterators, but don't >>> reset them every iteration: just iterate until the end, and check that >>> the bitmap is empty. If the meta bitmap is empty, the dirty phase is >>> complete. If the meta bitmap is NOT empty, reset the HBI and continue >>> allowing iterations over the dirty phase. >> Ok, will do. >>> >>>> + >>>> + if (dirty_bitmap_mig_state.bulk_completed) { >>>> + qemu_mutex_lock_iothread(); >>>> + dirty_phase(f, true); >>>> + qemu_mutex_unlock_iothread(); >>>> + } else { >>>> + bulk_phase(f, true); >>>> + } >>>> + >>>> + qemu_put_byte(f, DIRTY_BITMAP_MIG_FLAG_EOS); >>>> + >>>> + return dirty_bitmap_mig_state.bulk_completed; >>>> +} >>>> + >>>> +/* Called with iothread lock taken. */ >>>> + >>>> +static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque) >>>> +{ >>>> + DirtyBitmapMigBitmapState *dbms; >>>> + DPRINTF("Enter save live complete\n"); >>>> + >>>> + if (!dirty_bitmap_mig_state.bulk_completed) { >>>> + bulk_phase(f, false); >>>> + } >>> >>> [Not expertly familiar with savevm:] Under what conditions can this >>> happen? >> This can happen. save_complete will happen when savevm decide that >> pending data size to send is small enough. It was the case for my bugfix >> for migration/block.c about pending. To prevent save_complete when bulk >> phase isn't completed, save_pending returns (in my bugfix for >> migration/block.c) big value. Here I decided to make more honest >> save_pending, so I need to complete (if it doesn't) bulk phase in >> save_complete. > > OK, Gotcha. > >>> >>>> + >>>> + blk_mig_reset_dirty_cursor(); >>>> + dirty_phase(f, false); >>>> + >>>> + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, >>>> entry) { >>>> + uint8_t flags = DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME | >>>> + DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME | >>>> + DIRTY_BITMAP_MIG_FLAG_ENABLED; >>>> + >>>> + qemu_put_byte(f, flags); >>>> + qemu_put_name(f, bdrv_get_device_name(dbms->bs)); >>>> + qemu_put_name(f, bdrv_dirty_bitmap_name(dbms->bitmap)); >>>> + qemu_put_byte(f, bdrv_dirty_bitmap_enabled(dbms->bitmap)); >>>> + } >>>> + >>>> + qemu_put_byte(f, DIRTY_BITMAP_MIG_FLAG_EOS); >>>> + >>>> + DPRINTF("Dirty bitmaps migration completed\n"); >>>> + >>>> + dirty_bitmap_mig_cleanup(); >>>> + return 0; >>>> +} >>>> + >>> >>> I suppose we don't need a flag that distinctly SAYS this is the end >>> section, since we can tell by omission of >>> DIRTY_BITMAP_MIG_FLAG_NORMAL_CHUNK or ZERO_CHUNK. >> Hmm. I think it simplifies the logic (to use EOS after each section). >> And the same approach is in migration/block.c.. It's a question about >> which format is better: "Each section for dirty_bitmap_load ends with >> EOS" or "Each section for dirty_bitmap_load ends with EOS except the >> last one. The last one may be recognized by absent NORMAL_CHUNK and >> ZERO_CHUNK" >>> >>>> +static uint64_t dirty_bitmap_save_pending(QEMUFile *f, void *opaque, >>>> + uint64_t max_size) >>>> +{ >>>> + DirtyBitmapMigBitmapState *dbms; >>>> + uint64_t pending = 0; >>>> + >>>> + qemu_mutex_lock_iothread(); >>>> + >>>> + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, >>>> entry) { >>>> + uint64_t sectors = hbitmap_count(dbms->dirty_bitmap); >>>> + if (!dbms->bulk_completed) { >>>> + sectors += dbms->total_sectors - dbms->cur_sector; >>>> + } >>>> + pending += bdrv_dirty_bitmap_data_size(dbms->bitmap, >>>> sectors); >>>> + } >>>> + >>>> + qemu_mutex_unlock_iothread(); >>>> + >>>> + DPRINTF("Enter save live pending %" PRIu64 ", max: %" PRIu64 >>>> "\n", >>>> + pending, max_size); >>>> + return pending; >>>> +} >>>> + >>> >>> OK. >>> >>>> +static int dirty_bitmap_load(QEMUFile *f, void *opaque, int >>>> version_id) >>>> +{ >>>> + int flags; >>>> + >>>> + static char device_name[256], bitmap_name[256]; >>>> + static BlockDriverState *bs; >>>> + static BdrvDirtyBitmap *bitmap; >>>> + >>>> + uint8_t *buf; >>>> + uint64_t first_sector; >>>> + uint32_t nr_sectors; >>>> + int ret; >>>> + >>>> + DPRINTF("load start\n"); >>>> + >>>> + do { >>>> + flags = qemu_get_byte(f); >>>> + DPRINTF("flags: %x\n", flags); >>>> + >>>> + if (flags & DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME) { >>>> + qemu_get_name(f, device_name); >>>> + bs = bdrv_find(device_name); >>> >>> Similar to the above confusion, you may want bdrv_lookup_bs or >>> similar, since we're going to be looking for BDS nodes instead of >>> "devices." >> In this case, should it be changed in migration/block.c too? > > [See discussion above!] > >>> >>>> + if (!bs) { >>>> + fprintf(stderr, "Error: unknown block device '%s'\n", >>>> + device_name); >>>> + return -EINVAL; >>>> + } >>>> + } >>>> + >>>> + if (flags & DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME) { >>>> + if (!bs) { >>>> + fprintf(stderr, "Error: block device name is not >>>> set\n"); >>>> + return -EINVAL; >>>> + } >>>> + >>>> + qemu_get_name(f, bitmap_name); >>>> + bitmap = bdrv_find_dirty_bitmap(bs, bitmap_name); >>>> + if (flags & DIRTY_BITMAP_MIG_FLAG_GRANULARITY) { >>>> + /* First chunk from this bitmap */ >>>> + uint64_t granularity = qemu_get_be64(f); >>>> + if (!bitmap) { >>>> + Error *local_err = NULL; >>>> + bitmap = bdrv_create_dirty_bitmap(bs, >>>> granularity, >>>> + bitmap_name, >>>> + &local_err); >>>> + if (!bitmap) { >>>> + error_report("%s", >>>> error_get_pretty(local_err)); >>>> + error_free(local_err); >>>> + return -EINVAL; >>>> + } >>>> + } else { >>>> + uint64_t dest_granularity = >>>> + bdrv_dirty_bitmap_granularity(bs, bitmap); >>>> + if (dest_granularity != granularity) { >>>> + fprintf(stderr, >>>> + "Error: " >>>> + "Migrated bitmap granularity (%" >>>> PRIu64 ") " >>>> + "is not match with destination >>>> bitmap '%s' " >>>> + "granularity (%" PRIu64 ")\n", >>>> + granularity, >>>> + bitmap_name, >>>> + dest_granularity); >>>> + return -EINVAL; >>>> + } >>>> + } >>>> + bdrv_disable_dirty_bitmap(bitmap); >>>> + } >>>> + if (!bitmap) { >>>> + fprintf(stderr, "Error: unknown dirty bitmap " >>>> + "'%s' for block device '%s'\n", >>>> + bitmap_name, device_name); >>>> + return -EINVAL; >>>> + } >>>> + } >>>> + >>>> + if (flags & DIRTY_BITMAP_MIG_FLAG_ENABLED) { >>>> + bool enabled; >>>> + if (!bitmap) { >>>> + fprintf(stderr, "Error: dirty bitmap name is not >>>> set\n"); >>>> + return -EINVAL; >>>> + } >>>> + bdrv_dirty_bitmap_deserialize_finish(bitmap); >>>> + /* complete migration */ >>>> + enabled = qemu_get_byte(f); >>>> + if (enabled) { >>>> + bdrv_enable_dirty_bitmap(bitmap); >>>> + } >>>> + } >>> >>> Oh, so you use the ENABLED flag to show that migration is over. >> Yes, it was bad idea.. >>> If we are going to commit to a stream format for bitmaps, though, >>> maybe it's best to actually create a "COMPLETION BLOCK" flag and then >>> split this function into two pieces: >>> >>> (1) The part that receives regular / zero blocks, and >>> (2) The part that receives completion data. >>> >>> That way, if we change the properties that bitmaps have down the line, >>> we aren't reliant on literally the "enabled" flag to decide what to do. >>> >>> Also, it might help make this fairly long function a little smaller >>> and more readable. >> Ok. >>> >>>> + >>>> + if (flags & (DIRTY_BITMAP_MIG_FLAG_NORMAL_CHUNK | >>>> + DIRTY_BITMAP_MIG_FLAG_ZERO_CHUNK)) { >>>> + if (!bs) { >>>> + fprintf(stderr, "Error: block device name is not >>>> set\n"); >>>> + return -EINVAL; >>>> + } >>>> + if (!bitmap) { >>>> + fprintf(stderr, "Error: dirty bitmap name is not >>>> set\n"); >>>> + return -EINVAL; >>>> + } >>>> + >>>> + first_sector = qemu_get_be64(f); >>>> + nr_sectors = qemu_get_be32(f); >>>> + DPRINTF("chunk: %lu %u\n", first_sector, nr_sectors); >>>> + >>>> + >>>> + if (flags & DIRTY_BITMAP_MIG_FLAG_ZERO_CHUNK) { >>>> + bdrv_dirty_bitmap_deserialize_part0(bitmap, >>>> first_sector, >>>> + nr_sectors); >>>> + } else { >>>> + uint64_t buf_size = qemu_get_be64(f); >>>> + uint64_t needed_size = >>>> + bdrv_dirty_bitmap_data_size(bitmap, nr_sectors); >>>> + >>>> + if (needed_size > buf_size) { >>>> + fprintf(stderr, >>>> + "Error: Migrated bitmap granularity is >>>> not " >>>> + "match with destination bitmap >>>> granularity\n"); >>>> + return -EINVAL; >>>> + } >>>> + >>> >>> "Migrated bitmap granularity doesn't match the destination bitmap >>> granularity" perhaps. >>> >>>> + buf = g_malloc(buf_size); >>>> + qemu_get_buffer(f, buf, buf_size); >>>> + bdrv_dirty_bitmap_deserialize_part(bitmap, buf, >>>> + first_sector, >>>> + nr_sectors); >>>> + g_free(buf); >>>> + } >>>> + } >>>> + >>>> + ret = qemu_file_get_error(f); >>>> + if (ret != 0) { >>>> + return ret; >>>> + } >>>> + } while (!(flags & DIRTY_BITMAP_MIG_FLAG_EOS)); >>>> + >>>> + DPRINTF("load finish\n"); >>>> + return 0; >>>> +} >>>> + >>>> +static void dirty_bitmap_set_params(const MigrationParams *params, >>>> void *opaque) >>>> +{ >>>> + dirty_bitmap_mig_state.migration_enable = params->dirty; >>>> +} >>>> + >>> >>> OK; though I am not immediately aware of what changes need to happen >>> to accommodate Eric's suggestions. >> This function will be dropped in v3. >>> >>>> +static bool dirty_bitmap_is_active(void *opaque) >>>> +{ >>>> + return dirty_bitmap_mig_state.migration_enable == 1; >>>> +} >>>> + >>> >>> OK. >>> >>>> +static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque) >>>> +{ >>>> + init_dirty_bitmap_migration(f); >>>> + >>>> + qemu_mutex_lock_iothread(); >>>> + /* start track dirtyness of dirty bitmaps */ >>>> + set_dirty_tracking(); >>>> + qemu_mutex_unlock_iothread(); >>>> + >>>> + blk_mig_reset_dirty_cursor(); >>>> + qemu_put_byte(f, DIRTY_BITMAP_MIG_FLAG_EOS); >>>> + >>>> + return 0; >>>> +} >>>> + >>> >>> OK; see dirty_bitmap_mig_init below, though. >>> >>>> +static SaveVMHandlers savevm_block_handlers = { >>>> + .set_params = dirty_bitmap_set_params, >>>> + .save_live_setup = dirty_bitmap_save_setup, >>>> + .save_live_iterate = dirty_bitmap_save_iterate, >>>> + .save_live_complete = dirty_bitmap_save_complete, >>>> + .save_live_pending = dirty_bitmap_save_pending, >>>> + .load_state = dirty_bitmap_load, >>>> + .cancel = dirty_bitmap_migration_cancel, >>>> + .is_active = dirty_bitmap_is_active, >>>> +}; >>>> + >>>> +void dirty_bitmap_mig_init(void) >>>> +{ >>>> + QSIMPLEQ_INIT(&dirty_bitmap_mig_state.dbms_list); >>> >>> Maybe I haven't looked thoroughly enough yet, but it's weird that part >>> of the dirty_bitmap_mig_state is initialized here, and the rest of it >>> in init_dirty_bitmap_migration. I'd prefer to keep it all together, if >>> possible. >> dirty_bitmap_mig_init is called one time when qemu starts. QSIMPLEQ_INIT >> should be called once. dirty_bitmap_save_setup is called on every >> migration start, it's like 'reinitialize'. >>> >>>> + >>>> + register_savevm_live(NULL, "dirty-bitmap", 0, 1, >>>> &savevm_block_handlers, >>>> + &dirty_bitmap_mig_state); >>>> +} >>> >>> OK. >>> >>>> diff --git a/vl.c b/vl.c >>>> index a824a7d..dee7220 100644 >>>> --- a/vl.c >>>> +++ b/vl.c >>>> @@ -4184,6 +4184,7 @@ int main(int argc, char **argv, char **envp) >>>> >>>> blk_mig_init(); >>>> ram_mig_init(); >>>> + dirty_bitmap_mig_init(); >>>> >>>> /* If the currently selected machine wishes to override the >>>> units-per-bus >>>> * property of its default HBA interface type, do so now. */ >>>> >>> >>> Hm, since dirty bitmaps are a sub-component of the block layer, would >>> it not make sense to put this hook under blk_mig_init, perhaps? >> IMHO the reason to put it here is to keep all >> register_savevm_live-entities in one place. > > If you still feel that way I won't withhold my R-b, but there are > already other cases such as ppc_spapr_init which are not in this > general area of vl.c. > > Plus the dozens of devices that use register_savevm as a wrapper to > register_savevm_live, so maybe consolidating calls to this function > isn't that important. Hm, I've missed it, ppc_spapr_init is not here, yes.. Another thing here: dirty bitmaps migration are separate from blk migration. And it may be used without blk migration (nbd+mirrow migration may be used).. Is it ok to connect dirty bitmaps migration to blk_mig_init, which is located in migration/block.c, which may not be used at all when we bitmaps are migrated using migration/dirty-bitmap.c? In other words, yes, dirty bitmaps are a sub-component of the block layer, but dirty bitmap migration is not a sub-component of blk migration. > >>> >>> >>> Overall this looks very clean compared to the intermingled format in >>> V1, and the code is organized pretty well. Just a few minor comments, >>> and I'd like to get the opinion of the migration maintainers, but I am >>> happy. Sorry it took me so long to review, please feel free to let me >>> know if you disagree with any of my opinions :) >>> >>> Thank you, >>> --John >> >> Thank you for reviewing my series) >> > > Yup. Hopefully I didn't miss too much that will irritate the Migration > overlords. > > Once you respin on top of v12, I can run some thorough migration tests > on it (perhaps over a weekend) and verify that it survives a couple > hundred migrations without any kind of integrity loss. I hope I'll do it with all other things in about two days. > > This is what makes sense to me right now, anyway. > > Do you think you'll be including the bitmap checksum in the > BlockDirtyInfo command? That'd be convenient for iotests. Ok, will do. Good idea. Only two points: 1) Is it ok to include debug info into BlockDirtyInfo? Will users be happy with it? 2) When I was debugging my code, the information about dirty-regions was very useful. Now, all is working and checksums are enough for regression control. > > Thank you, > --John. -- Best regards, Vladimir