From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:42981)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jsnow@redhat.com>) id 1YDcYL-0001j9-Hr
	for qemu-devel@nongnu.org; Tue, 20 Jan 2015 12:25:24 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jsnow@redhat.com>) id 1YDcYE-0000yy-B4
	for qemu-devel@nongnu.org; Tue, 20 Jan 2015 12:25:21 -0500
Received: from mx1.redhat.com ([209.132.183.28]:60563)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jsnow@redhat.com>) id 1YDcYE-0000wQ-00
	for qemu-devel@nongnu.org; Tue, 20 Jan 2015 12:25:14 -0500
Message-ID: <54BE8F77.2030706@redhat.com>
Date: Tue, 20 Jan 2015 12:25:11 -0500
From: John Snow <jsnow@redhat.com>
MIME-Version: 1.0
References: <1418307457-25996-1-git-send-email-vsementsov@parallels.com>
	<1418307457-25996-10-git-send-email-vsementsov@parallels.com>
	<54AEFF44.5010805@redhat.com> <54BA9926.7010509@parallels.com>
In-Reply-To: <54BA9926.7010509@parallels.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 9/9] block-migration: add named dirty
 bitmaps migration
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Vladimir Sementsov-Ogievskiy <vsementsov@parallels.com>, qemu-devel@nongnu.org
Cc: kwolf@redhat.com, "Juan Q >> Juan Jose Quintela Carreira" <quintela@redhat.com>, "gilbert >> Dr. David Alan Gilbert" <dgilbert@redhat.com>, stefanha@redhat.com, Amit Shah <amit.shah@redhat.com>, den@openvz.org


On 01/17/2015 12:17 PM, Vladimir Sementsov-Ogievskiy wrote:
>
> Best regards,
> Vladimir
>
> On 09.01.2015 01:05, John Snow wrote:
>> CCing migration maintainers, feedback otherwise in-line.
>>
>> On 12/11/2014 09:17 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> Just migrate parts of dirty bitmaps, corresponding to the block being
>>> migrated. Also, skip block migration if it is disabled (blk parameter
>>> of migrate command is false).
>>>
>>> Skipping shared sectors: bitmaps are migrated independently of this,
>>> just send blk without block data.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@parallels.com>
>>> ---
>>
>> In terms of general approach, migrating the dirty bitmap alongside the
>> blocks it describes makes sense when migrating both.
>>
>> Is this a lot of overhead when that's not the case, though? If we
>> utilize the "bitmap only" pathways added here, don't we iterate
>> through the savevm handlers a lot to only transfer very little data
>> per section?
>>
>> If we really need migration of bitmaps apart from the data they
>> describe, is it not worth developing a more condensed transfer
>> mechanism to get more bitmap data per iteration, instead of just one
>> block's worth?
> The first stage of migration can be easily optimized in this case - just
> take a larger bitmap pieces. For the second stage, it's not as simple.
> If we will take larger pieces on each step - we will just increase dirty
> data transfer. One approach is to maintain "dirty-region" - two numbers,
> representing the region of set bits in migration dirty bitmap, and send
> data only when this region is large enough.
>
>>
>>>   block-migration.c | 173
>>> +++++++++++++++++++++++++++++++++++++++++++++++-------
>>>   savevm.c          |   1 +
>>>   2 files changed, 154 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/block-migration.c b/block-migration.c
>>> index 908a66d..95d54a1 100644
>>> --- a/block-migration.c
>>> +++ b/block-migration.c
>>> @@ -32,6 +32,8 @@
>>>   #define BLK_MIG_FLAG_EOS                0x02
>>>   #define BLK_MIG_FLAG_PROGRESS           0x04
>>>   #define BLK_MIG_FLAG_ZERO_BLOCK         0x08
>>> +#define BLK_MIG_FLAG_HAS_BLOCK          0x10
>>> +#define BLK_MIG_FLAG_HAS_BITMAPS        0x20
>>>
>>
>> OK: As a result of allowing bitmaps to be migrated without the block
>> data itself, we now must acknowledge the concept that we can send
>> either block data, bitmaps, both, or neither -- so new defines are
>> added here to indicate what data can be found in the section following.
>>
>>>   #define MAX_IS_ALLOCATED_SEARCH 65536
>>>
>>> @@ -51,6 +53,8 @@ typedef struct BlkMigDevState {
>>>       int shared_base;
>>>       int64_t total_sectors;
>>>       QSIMPLEQ_ENTRY(BlkMigDevState) entry;
>>> +    int nr_bitmaps;
>>> +    BdrvDirtyBitmap **dirty_bitmaps;
>>>
>>>       /* Only used by migration thread.  Does not need a lock. */
>>>       int bulk_completed;
>>> @@ -64,6 +68,11 @@ typedef struct BlkMigDevState {
>>>       Error *blocker;
>>>   } BlkMigDevState;
>>>
>>> +typedef struct BlkMigDirtyBitmap {
>>> +    BdrvDirtyBitmap *bitmap;
>>> +    uint8_t *buf;
>>> +} BlkMigDirtyBitmap;
>>> +
>>>   typedef struct BlkMigBlock {
>>>       /* Only used by migration thread.  */
>>>       uint8_t *buf;
>>> @@ -74,6 +83,9 @@ typedef struct BlkMigBlock {
>>>       QEMUIOVector qiov;
>>>       BlockAIOCB *aiocb;
>>>
>>> +    int nr_bitmaps;
>>> +    BlkMigDirtyBitmap *dirty_bitmaps;
>>> +
>>>       /* Protected by block migration lock.  */
>>>       int ret;
>>>       QSIMPLEQ_ENTRY(BlkMigBlock) entry;
>>> @@ -83,6 +95,7 @@ typedef struct BlkMigState {
>>>       /* Written during setup phase.  Can be read without a lock.  */
>>>       int blk_enable;
>>>       int shared_base;
>>> +    int dbm_enable;
>>
>> Similar to the feedback in a previous patch, we may not want to use
>> 'dbm' to mean "Dirty Bitmap" because we have not been applying the
>> abbreviation consistently.
>>
>> For now, the recommendation from stefan is to use the full
>> "bdrv_dirty_bitmap" or "dirty_bitmap" in function names.
>>
>> If we do want an acronym to refer to this particular type of dirty
>> bitmap, we should apply it consistently to all functions that work
>> with the BdrvDirtyBitmap type.
>>
>> For now, "bdrv_dirty_bitmap_enable" should suffice, even though it's a
>> bit wordy.
> It's a flag, that enables the migration of dirty bitmaps, like
> blk_enable enables the migrtion of blocks.. Then, should it be
> "dirty_bitmap_migration_enable"? Isn't it too long for a simple flag
> variable?

Yeah, I see your point. I think the only difference is that "blk" is 
used very widely throughout the code, but this is the first occurrence 
of "dbm."

even just "bitmap_enable" would be sufficiently more self-evident than 
"dbm_enable," unless we rework all of the BdrvDirtyBitmap functions to 
use the "dbm" abbreviation.

I'd also be happy with "migrate_bitmaps" which is fairly 
self-explanatory, too.

>>
>>>       QSIMPLEQ_HEAD(bmds_list, BlkMigDevState) bmds_list;
>>>       int64_t total_sector_sum;
>>>       bool zero_blocks;
>>> @@ -116,27 +129,63 @@ static void blk_mig_unlock(void)
>>>   /* Only allocating and initializing structure fields, not copying
>>> any data. */
>>>
>>>   static BlkMigBlock *blk_create(BlkMigDevState *bmds, int64_t sector,
>>> -                                int nr_sectors)
>>> +                               int nr_sectors, bool only_bitmaps)
>>>   {
>>> +    int i;
>>>       BlkMigBlock *blk = g_new(BlkMigBlock, 1);
>>> -    blk->buf = g_malloc(BLOCK_SIZE);
>>> +    blk->buf = only_bitmaps ? NULL : g_malloc(BLOCK_SIZE);
>>>       blk->bmds = bmds;
>>>       blk->sector = sector;
>>>       blk->nr_sectors = nr_sectors;
>>> +    blk->dirty_bitmaps = NULL;
>>> +    blk->nr_bitmaps = 0;
>>>
>>>       blk->iov.iov_base = blk->buf;
>>>       blk->iov.iov_len = nr_sectors * BDRV_SECTOR_SIZE;
>>>       qemu_iovec_init_external(&blk->qiov, &blk->iov, 1);
>>>
>>
>> We can probably skip this block if only_bitmaps is true.
> I agree.
>>
>>> +    if (block_mig_state.dbm_enable) {
>>> +        BlkMigDirtyBitmap *mig_bm;
>>> +
>>> +        blk->nr_bitmaps = bmds->nr_bitmaps;
>>> +        mig_bm = blk->dirty_bitmaps = g_new(BlkMigDirtyBitmap,
>>> + bmds->nr_bitmaps);
>>> +        for (i = 0; i < bmds->nr_bitmaps; ++i) {
>>> +            BdrvDirtyBitmap *bm = bmds->dirty_bitmaps[i];
>>> +            mig_bm->buf = g_malloc(bdrv_dbm_data_size(bm, nr_sectors));
>>> +            mig_bm->bitmap = bm;
>>> +            mig_bm++;
>>> +        }
>>> +    }
>>> +
>>
>> It's strange to me that we'd give it an "only bitmaps" boolean and
>> then rely on a different condition to populate them. Is it not worthy
>> of an error if we specify only_bitmaps when block_mig_state.dbm_enable
>> is false?
> only_bitmaps means: no blocks but only bitmaps. But there is another
> case: blocks with bitmaps, and in this case this "if ()" should execute.
>
> {only_bitmaps = true} when {block_mig_state.dbm_enable = false} - it's
> of course an error. I'll add assert() for this.
>>
>>>       return blk;
>>>   }
>>>
>>>   static void blk_free(BlkMigBlock *blk)
>>>   {
>>> +    int i;
>>>       g_free(blk->buf);
>>> +
>>> +    if (blk->dirty_bitmaps) {
>>> +        for (i = 0; i < blk->nr_bitmaps; ++i) {
>>> +            g_free(blk->dirty_bitmaps[i].buf);
>>> +        }
>>> +        g_free(blk->dirty_bitmaps);
>>> +    }
>>> +
>>>       g_free(blk);
>>>   }
>>>
>>> +static void blk_store_bitmaps(BlkMigBlock *blk)
>>> +{
>>> +    int i;
>>> +    for (i = 0; i < blk->nr_bitmaps; ++i) {
>>> +        bdrv_dbm_store_data(blk->dirty_bitmaps[i].bitmap,
>>> +                            blk->dirty_bitmaps[i].buf,
>>> +                            blk->sector, blk->nr_sectors);
>>> +    }
>>> +}
>>> +
>>
>> I am worried about the case in which blk->nr_sectors is not an
>> integral multiple of the bitmap sector granularity, for reasons
>> discussed in the previous patches.
>>
>> I'm not positive it IS a problem either, but maybe you have already
>> thought about this.
> Only allocated memory should be extended in the previous patch, as I've
> already proposed. As a general approach - it's ok, because we always
> calculate the real number of bit in bitmap from the virtual in the same
> way, therefore stored region will be restored in the same place.

OK, just checking.

>>
>>>   /* Must run outside of the iothread lock during the bulk phase,
>>>    * or the VM will stall.
>>>    */
>>> @@ -146,9 +195,15 @@ static void blk_send(QEMUFile *f, BlkMigBlock *
>>> blk)
>>>       int len;
>>>       uint64_t flags = BLK_MIG_FLAG_DEVICE_BLOCK;
>>>
>>> -    if (block_mig_state.zero_blocks &&
>>> +    if (block_mig_state.zero_blocks && blk->buf &&
>>>           buffer_is_zero(blk->buf, BLOCK_SIZE)) {
>>>           flags |= BLK_MIG_FLAG_ZERO_BLOCK;
>>> +    } else if (blk->buf) {
>>> +        flags |= BLK_MIG_FLAG_HAS_BLOCK;
>>> +    }
>>> +
>>> +    if (block_mig_state.dbm_enable) {
>>> +        flags |= BLK_MIG_FLAG_HAS_BITMAPS;
>>>       }
>>>
>>>       /* sector number and flags */
>>> @@ -160,10 +215,27 @@ static void blk_send(QEMUFile *f, BlkMigBlock *
>>> blk)
>>>       qemu_put_byte(f, len);
>>>       qemu_put_buffer(f, (uint8_t
>>> *)bdrv_get_device_name(blk->bmds->bs), len);
>>>
>>> +    /* dirty bitmaps */
>>> +    if (flags & BLK_MIG_FLAG_HAS_BITMAPS) {
>>> +        int i;
>>> +        qemu_put_be32(f, blk->nr_bitmaps);
>>> +        for (i = 0; i < blk->nr_bitmaps; ++i) {
>>> +            BdrvDirtyBitmap *bm = blk->dirty_bitmaps[i].bitmap;
>>> +            int buf_size = bdrv_dbm_data_size(bm, blk->nr_sectors);
>>> +            const char *name = bdrv_dirty_bitmap_name(bm);
>>> +            int len = strlen(name);
>>> +
>>> +            qemu_put_byte(f, len);
>>> +            qemu_put_buffer(f, (const uint8_t *)name, len);
>>
>> No length in the stream for the size of the dirty bitmap buffer?
>>
>>> +            qemu_put_buffer(f, blk->dirty_bitmaps[i].buf, buf_size);
>>> +        }
>>> +    }
>>> +
>>
>> Future patch idea: We can probably optimize this block by testing a
>> range of sectors on the bitmap and skipping bitmaps that don't have
>> data for this block.
>>
>> Since we populated the bitmaps previously, we should already have the
>> chance to have counted how many of them are nonempty, so we can use
>> that number instead of blk->nr_bitmaps.
>>
>> Not important for the series right now.
>>
>>>       /* if a block is zero we need to flush here since the network
>>>        * bandwidth is now a lot higher than the storage device
>>> bandwidth.
>>> -     * thus if we queue zero blocks we slow down the migration */
>>> -    if (flags & BLK_MIG_FLAG_ZERO_BLOCK) {
>>> +     * thus if we queue zero blocks we slow down the migration.
>>> +     * also, skip writing block when migrate only dirty bitmaps. */
>>> +    if (!(flags & BLK_MIG_FLAG_HAS_BLOCK)) {
>>
>> I'm not completely clear on the reasoning behind the original
>> codeblock here, but changing it from "Only when zeroes" to "Not when
>> we have data: Zeroes and Bitmaps" makes sense, since bitmap-only
>> sections are going to be very sparse too.
>>
>>>           qemu_fflush(f);
>>>           return;
>>>       }
>>
>> A little after this, there's a call to qemu_put_buffer(f, blk->buf,
>> BLOCK_SIZE), but we have the case where blk->buf may be NULL in bitmap
>> only cases. I think we need to guard against that.
> in this case (flags & BLK_MIG_FLAG_HAS_BLOCK) should be zero. However an
> assert here won't hurt.
>>
>>> @@ -282,13 +354,20 @@ static int mig_save_device_bulk(QEMUFile *f,
>>> BlkMigDevState *bmds)
>>>       BlockDriverState *bs = bmds->bs;
>>>       BlkMigBlock *blk;
>>>       int nr_sectors;
>>> +    bool skip_chunk = false;
>>>
>>>       if (bmds->shared_base) {
>>>           qemu_mutex_lock_iothread();
>>> -        while (cur_sector < total_sectors &&
>>> -               !bdrv_is_allocated(bs, cur_sector,
>>> MAX_IS_ALLOCATED_SEARCH,
>>> -                                  &nr_sectors)) {
>>> -            cur_sector += nr_sectors;
>>> +        if (block_mig_state.dbm_enable) {
>>> +            bdrv_is_allocated(bs, cur_sector,
>>> BDRV_SECTORS_PER_DIRTY_CHUNK,
>>> +                              &nr_sectors);
>>> +            skip_chunk = nr_sectors >= BDRV_SECTORS_PER_DIRTY_CHUNK;
>>> +        } else {
>>> +            while (cur_sector < total_sectors &&
>>> +                   !bdrv_is_allocated(bs, cur_sector,
>>> MAX_IS_ALLOCATED_SEARCH,
>>> +                                      &nr_sectors)) {
>>> +                cur_sector += nr_sectors;
>>> +            }
>>
>> OK, so the approach taken here is that if bitmaps are enabled, we
>> check only to see if this current chunk is allocated or not. If it
>> isn't, we declare this a "bitmap only" chunk and set skip_chunk to true.
>>
>> The else clause taken implies no bitmap data, because either
>> dbm_enable or blk_enable (or both) MUST be set for us to be here.
>>
>> (1) Why are you not checking the return value of bdrv_is_allocated? It
>> could return either true or false AND nr_sectors ==
>> BDRV_SECTORS_PER_DIRTY_CHUNK, provided it was entirely allocated or
>> entirely unallocated, so I think this condition is not working as you
>> intend it to.
> Yes, it's my mistake.
>>
>> (2) This seems slightly hackey in a sparse or no-data situation. We
>> will transmit many small data sections, each containing a chunk of
>> bitmap data, one at a time, with no data interspersed -- which leaves
>> a high metadata-to-data ratio in these cases.
>>
>> Would it be an involved process to alter the nr_sectors count to be
>> something that isn't a single sector in the (dbm_enable &&
>> !blk_enable) case? That way we can spin until we find data worth
>> transferring and transfer all of the bitmap metadata in a larger chunk
>> all at once.
>>
>> Whatever approach you choose, it might be nice to have a comment
>> in-line explaining the general approach.
>>
>>>           }
>>>           qemu_mutex_unlock_iothread();
>>>       }
>>> @@ -309,7 +388,8 @@ static int mig_save_device_bulk(QEMUFile *f,
>>> BlkMigDevState *bmds)
>>>           nr_sectors = total_sectors - cur_sector;
>>>       }
>>>
>>> -    blk = blk_create(bmds, cur_sector, nr_sectors);
>>> +    blk = blk_create(bmds, cur_sector, nr_sectors,
>>> +                     skip_chunk || !block_mig_state.blk_enable);
>>>
>>>       blk_mig_lock();
>>>       block_mig_state.submitted++;
>>> @@ -317,8 +397,16 @@ static int mig_save_device_bulk(QEMUFile *f,
>>> BlkMigDevState *bmds)
>>>
>>>       bdrv_reset_dirty_bitmap(bs, bmds->dirty_bitmap, cur_sector,
>>> nr_sectors);
>>>
>>> -    blk->aiocb = bdrv_aio_readv(bs, cur_sector, &blk->qiov,
>>> -                                nr_sectors, blk_mig_read_cb, blk);
>>> +    if (block_mig_state.dbm_enable) {
>>> +        blk_store_bitmaps(blk);
>>> +    }
>>> +
>>> +    if (block_mig_state.blk_enable && !skip_chunk) {
>>> +        blk->aiocb = bdrv_aio_readv(bs, cur_sector, &blk->qiov,
>>> +                                    nr_sectors, blk_mig_read_cb, blk);
>>> +    } else if (block_mig_state.dbm_enable) {
>>> +        blk_mig_read_cb(blk, 0);
>>> +    }
>>>
>>>       bmds->cur_sector = cur_sector + nr_sectors;
>>>       return (bmds->cur_sector >= total_sectors);
>>> @@ -403,6 +491,8 @@ static void init_blk_migration(QEMUFile *f)
>>>               DPRINTF("Start full migration for %s\n",
>>> bdrv_get_device_name(bs));
>>>           }
>>>
>>> +        bmds->dirty_bitmaps = bdrv_dbm_find_all_named(bs,
>>> &bmds->nr_bitmaps);
>>> +
>>
>> At this point, we should have block_mig_state.dbm_enable set (or
>> disabled), so we can skip this initialization if it is not set.
> Agree.
>>
>>> QSIMPLEQ_INSERT_TAIL(&block_mig_state.bmds_list, bmds, entry);
>>>       }
>>>   }
>>> @@ -481,20 +571,32 @@ static int mig_save_device_dirty(QEMUFile *f,
>>> BlkMigDevState *bmds,
>>>               } else {
>>>                   nr_sectors = BDRV_SECTORS_PER_DIRTY_CHUNK;
>>>               }
>>> -            blk = blk_create(bmds, sector, nr_sectors);
>>> +            blk = blk_create(bmds, sector, nr_sectors,
>>> +                             !block_mig_state.blk_enable);
>>> +
>>> +            if (block_mig_state.dbm_enable) {
>>> +                blk_store_bitmaps(blk);
>>> +            }
>>>
>>>               if (is_async) {
>>> -                blk->aiocb = bdrv_aio_readv(bmds->bs, sector,
>>> &blk->qiov,
>>> -                                            nr_sectors,
>>> blk_mig_read_cb, blk);
>>> +                if (block_mig_state.blk_enable) {
>>> +                    blk->aiocb = bdrv_aio_readv(bmds->bs, sector,
>>> &blk->qiov,
>>> +                                                nr_sectors,
>>> blk_mig_read_cb,
>>> +                                                blk);
>>> +                } else {
>>> +                    blk_mig_read_cb(blk, 0);
>>> +                }
>>>
>>>                   blk_mig_lock();
>>>                   block_mig_state.submitted++;
>>>                   bmds_set_aio_inflight(bmds, sector, nr_sectors, 1);
>>>                   blk_mig_unlock();
>>>               } else {
>>> -                ret = bdrv_read(bmds->bs, sector, blk->buf,
>>> nr_sectors);
>>> -                if (ret < 0) {
>>> -                    goto error;
>>> +                if (block_mig_state.blk_enable) {
>>> +                    ret = bdrv_read(bmds->bs, sector, blk->buf,
>>> nr_sectors);
>>> +                    if (ret < 0) {
>>> +                        goto error;
>>> +                    }
>>>                   }
>>>                   blk_send(f, blk);
>>>
>>> @@ -817,10 +919,39 @@ static int block_load(QEMUFile *f, void
>>> *opaque, int version_id)
>>>                   nr_sectors = BDRV_SECTORS_PER_DIRTY_CHUNK;
>>>               }
>>>
>>> +            /* load dirty bitmaps */
>>> +            if (flags & BLK_MIG_FLAG_HAS_BITMAPS) {
>>> +                int i, nr_bitmaps = qemu_get_be32(f);
>>> +
>>> +                for (i = 0; i < nr_bitmaps; ++i) {
>>> +                    char bitmap_name[256];
>>> +                    BdrvDirtyBitmap *bitmap;
>>> +                    int buf_size, len;
>>> +
>>> +                    len = qemu_get_byte(f);
>>> +                    qemu_get_buffer(f, (uint8_t *)bitmap_name, len);
>>> +                    bitmap_name[len] = '\0';
>>> +
>>> +                    bitmap = bdrv_find_dirty_bitmap(bs, bitmap_name);
>>> +                    if (!bitmap) {
>>> +                        fprintf(stderr, "Error unknown dirty bitmap\
>>> +                                %s for block device %s\n",
>>> +                                bitmap_name, device_name);
>>> +                        return -EINVAL;
>>> +                    }
>>> +
>>> +                    buf_size = bdrv_dbm_data_size(bitmap, nr_sectors);
>>
>> Oh, this is why you didn't store the length... Still, since it seems
>> as if you expect the bitmap to already exist on the destination, what
>> if the granularity or size is different? It might be nice to
>> explicitly state the size of the buffer -- one user misconfiguration
>> and you might blow the rest of the migration.
>>
>> This way, if the calculated size doesn't match the received size, you
>> have the chance to throw a meaningful error.
>>
>>> +                    buf = g_malloc(buf_size);
>>> +                    qemu_get_buffer(f, buf, buf_size);
>>> +                    bdrv_dbm_restore_data(bitmap, buf, addr,
>>> nr_sectors);
>>> +                    g_free(buf);
>>> +                }
>>> +            }
>>> +
>>
>> So with this approach, the user needs to have created the bitmaps
>> already? Would it be possible to create a mode where they don't, or am
>> I misreading something?
>>
>> We could re-create them on the fly whenever the offset is 0 on the
>> receiving side, the only other piece of information we'd really need
>> at that point is the granularity.
> Agree.

The only problem is if we re-send the offset @ 0 a second time, or if we 
skip sending zero sectors and we never receive that block.

Still, there's probably an elegant way to achieve what we want. Maybe a 
pre-amble migration block, or arbitrarily the first sector we happen to 
transmit carries an extra flag with it that describes the granularity of 
the bitmap, or ... something.

>>
>>>               if (flags & BLK_MIG_FLAG_ZERO_BLOCK) {
>>>                   ret = bdrv_write_zeroes(bs, addr, nr_sectors,
>>>                                           BDRV_REQ_MAY_UNMAP);
>>> -            } else {
>>> +            } else if (flags & BLK_MIG_FLAG_HAS_BLOCK) {
>>>                   buf = g_malloc(BLOCK_SIZE);
>>>                   qemu_get_buffer(f, buf, BLOCK_SIZE);
>>>                   ret = bdrv_write(bs, addr, buf, nr_sectors);
>>> @@ -855,6 +986,7 @@ static void block_set_params(const
>>> MigrationParams *params, void *opaque)
>>>   {
>>>       block_mig_state.blk_enable = params->blk;
>>>       block_mig_state.shared_base = params->shared;
>>> +    block_mig_state.dbm_enable = params->dirty;
>>>
>>>       /* shared base means that blk_enable = 1 */
>>>       block_mig_state.blk_enable |= params->shared;
>>> @@ -862,7 +994,8 @@ static void block_set_params(const
>>> MigrationParams *params, void *opaque)
>>>
>>>   static bool block_is_active(void *opaque)
>>>   {
>>> -    return block_mig_state.blk_enable == 1;
>>> +    return block_mig_state.blk_enable == 1 ||
>>> +           block_mig_state.dbm_enable == 1;
>>>   }
>>>
>>>   static SaveVMHandlers savevm_block_handlers = {
>>> diff --git a/savevm.c b/savevm.c
>>> index a598d1d..1299faa 100644
>>> --- a/savevm.c
>>> +++ b/savevm.c
>>> @@ -983,6 +983,7 @@ int qemu_loadvm_state(QEMUFile *f)
>>>           }
>>>       }
>>>
>>> +    bdrv_dbm_restore_finish();
>>
>> Do we not have a cleaner way to call this? I guess we have no explicit
>> callback mechanisms for when a particular BDS is completely done,
>> because of the multiple passes over the data that we perform --
>>
>> Still, it might be nicer to have a callback for the whole of block
>> migration, and call bdrv_dbm_restore_finish() from there instead of
>> putting this implementation detail all the way up inside of
>> qemu_loadvm_state.
>>
>> Can we migrate an empty-data sentinel section from block migration
>> upon completion that qemu_loadvm_state launches a post_load method for
>> that can invoke bdrv_dbm_restore_finish for us?
>>
>> Alternatively, can this not be safely done inside of block_load?
>> somewhere around here:
>>
>> printf("Completed %d %%%c", (int)addr,
>>                    (addr == 100) ? '\n' : '\r');
>>             fflush(stdout);
>>
>> The "100%" completion flag is only written in block_save_complete(),
>> so once we read this flag we should have necessary and sufficient
>> information to conclude that it's safe to do the BdrvDirtyBitmap
>> migration post-load stuff.
> Good point, I think, I'll chose this approach.
>>
>>
>> Of course, maybe generic section post-load callbacks are useful to
>> other devices, and I'd be fine with either approach.
>>
>> CCing migration people for opinions.
>>
>>>       cpu_synchronize_all_post_init();
>>>
>>>       ret = 0;
>>>
>>
>>
>> Thank you, and sorry it took me so long to digest the series!
>> --John Snow
>>
>> P.S.: Happy New Year! I hope 2015 treats you well :~)
>
>
> With all this story about not efficient migration in case of no block in
> blk, I think, that, may be, it would be better to refuse using block
> migration bitmap to migrate bitmaps, make additional dirty bitmaps for
> dirty bitmaps and migrate bitmaps in separate migration/dirty-bitmaps.c
> file (sorry for the pun).. This will additionally provide more precise
> migration in case of resetting the bitmap while migration in progress.
>

Yeah, I tried to review this patch assuming we'd use this technique, 
from a correctness standpoint. I am relying on other reviewers with more 
knowledge of the migration layer to critique the general approach.

I think Paolo had some feedback on that front, but I haven't looked at 
the code much to see how much I agree with his suggestion yet :)

Thanks!
--John