[Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
@ 2018-03-20 17:05 Vladimir Sementsov-Ogievskiy
  2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 1/3] qcow2-bitmap: add qcow2_reopen_bitmaps_rw_hint() Vladimir Sementsov-Ogievskiy
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-20 17:05 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, jsnow, den, vsementsov

Hi all.

This fixes bitmaps migration through shared storage. Look at 02 for
details.

The bug introduced in 2.10 with the whole qcow2 bitmaps feature, so
qemu-stable in CC. However I doubt that someone really suffered from this.

Do we need dirty bitmaps at all in inactive case? - that was a question in v2.
And, keeping in mind that we are going to use inactive mode not only for
incoming migration, I'm not sure that answer is NO (but, it may be "NO" for 
2.10, 2.11), so let's fix it in proposed here manner at least for 2.12.

v4:

01: r-b: Max
02: rewrite to do not load bitmaps if any bitmaps alredy exit, drop r-b
03: add missed change of 169.out

v3: tiny context changes in 01,02
    03: instead of a separate test, enable corresponding case in existent one

v2:
   John, thank you for reviewing v1.
   changes:
    add John's r-bs, change s/timeout=10/timeout=10.0/ in last patch
    and drop old 03 patch, related to this timeout fix.

Vladimir Sementsov-Ogievskiy (3):
  qcow2-bitmap: add qcow2_reopen_bitmaps_rw_hint()
  qcow2: fix bitmaps loading when bitmaps already exist
  iotests: enable shared migration cases in 169

 block/qcow2.h              |  2 ++
 block/qcow2-bitmap.c       | 15 ++++++++++++++-
 block/qcow2.c              | 17 ++++++++++++++++-
 tests/qemu-iotests/169     |  8 +++-----
 tests/qemu-iotests/169.out |  4 ++--
 5 files changed, 37 insertions(+), 9 deletions(-)

-- 
2.11.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 1/3] qcow2-bitmap: add qcow2_reopen_bitmaps_rw_hint()
  2018-03-20 17:05 [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Vladimir Sementsov-Ogievskiy
@ 2018-03-20 17:05 ` Vladimir Sementsov-Ogievskiy
  2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: fix bitmaps loading when bitmaps already exist Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-20 17:05 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, jsnow, den, vsementsov

Add version of qcow2_reopen_bitmaps_rw, which do the same work but
also return a hint about was header updated or not. This will be
used in the following fix for bitmaps reloading after migration.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2.h        |  2 ++
 block/qcow2-bitmap.c | 15 ++++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index ccb92a9696..d301f77cea 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -671,6 +671,8 @@ int qcow2_check_bitmaps_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
                                   void **refcount_table,
                                   int64_t *refcount_table_size);
 bool qcow2_load_dirty_bitmaps(BlockDriverState *bs, Error **errp);
+int qcow2_reopen_bitmaps_rw_hint(BlockDriverState *bs, bool *header_updated,
+                                 Error **errp);
 int qcow2_reopen_bitmaps_rw(BlockDriverState *bs, Error **errp);
 void qcow2_store_persistent_dirty_bitmaps(BlockDriverState *bs, Error **errp);
 int qcow2_reopen_bitmaps_ro(BlockDriverState *bs, Error **errp);
diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index 3010adb909..6e93ec43e1 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -1004,7 +1004,8 @@ fail:
     return false;
 }
 
-int qcow2_reopen_bitmaps_rw(BlockDriverState *bs, Error **errp)
+int qcow2_reopen_bitmaps_rw_hint(BlockDriverState *bs, bool *header_updated,
+                                 Error **errp)
 {
     BDRVQcow2State *s = bs->opaque;
     Qcow2BitmapList *bm_list;
@@ -1012,6 +1013,10 @@ int qcow2_reopen_bitmaps_rw(BlockDriverState *bs, Error **errp)
     GSList *ro_dirty_bitmaps = NULL;
     int ret = 0;
 
+    if (header_updated != NULL) {
+        *header_updated = false;
+    }
+
     if (s->nb_bitmaps == 0) {
         /* No bitmaps - nothing to do */
         return 0;
@@ -1055,6 +1060,9 @@ int qcow2_reopen_bitmaps_rw(BlockDriverState *bs, Error **errp)
             error_setg_errno(errp, -ret, "Can't update bitmap directory");
             goto out;
         }
+        if (header_updated != NULL) {
+            *header_updated = true;
+        }
         g_slist_foreach(ro_dirty_bitmaps, set_readonly_helper, false);
     }
 
@@ -1065,6 +1073,11 @@ out:
     return ret;
 }
 
+int qcow2_reopen_bitmaps_rw(BlockDriverState *bs, Error **errp)
+{
+    return qcow2_reopen_bitmaps_rw_hint(bs, NULL, errp);
+}
+
 /* store_bitmap_data()
  * Store bitmap to image, filling bitmap table accordingly.
  */
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 2/3] qcow2: fix bitmaps loading when bitmaps already exist
  2018-03-20 17:05 [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Vladimir Sementsov-Ogievskiy
  2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 1/3] qcow2-bitmap: add qcow2_reopen_bitmaps_rw_hint() Vladimir Sementsov-Ogievskiy
@ 2018-03-20 17:05 ` Vladimir Sementsov-Ogievskiy
  2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-20 17:05 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, jsnow, den, vsementsov

On reopen with existing bitmaps, instead of loading bitmaps, lets
reopen them if needed. This also fixes bitmaps migration through
shared storage.
Consider the case. Persistent bitmaps are stored on bdrv_inactivate.
Then, on destination process_incoming_migration_bh() calls
bdrv_invalidate_cache_all() which leads to
qcow2_load_autoloading_dirty_bitmaps() which fails if bitmaps are
already loaded on destination start.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 7472af6931..d2d129a2e6 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1480,7 +1480,22 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
         s->autoclear_features &= QCOW2_AUTOCLEAR_MASK;
     }
 
-    if (qcow2_load_dirty_bitmaps(bs, &local_err)) {
+    if (bdrv_dirty_bitmap_next(bs, NULL)) {
+        /* It's some kind of reopen with already existing dirty bitmaps. There
+         * are no known cases where we need loading bitmaps in such situation,
+         * so it's safer don't load them.
+         *
+         * Moreover, if we have some readonly bitmaps and we are reopening for
+         * rw we should reopen bitmaps correspondingly.
+         */
+        if (bdrv_has_readonly_bitmaps(bs) &&
+            !bdrv_is_read_only(bs) && !(bdrv_get_flags(bs) & BDRV_O_INACTIVE))
+        {
+            bool header_updated = false;
+            qcow2_reopen_bitmaps_rw_hint(bs, &header_updated, &local_err);
+            update_header = update_header && !header_updated;
+        }
+    } else if (qcow2_load_dirty_bitmaps(bs, &local_err)) {
         update_header = false;
     }
     if (local_err != NULL) {
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache
  2018-03-20 17:05 [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Vladimir Sementsov-Ogievskiy
  2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 1/3] qcow2-bitmap: add qcow2_reopen_bitmaps_rw_hint() Vladimir Sementsov-Ogievskiy
  2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: fix bitmaps loading when bitmaps already exist Vladimir Sementsov-Ogievskiy
@ 2018-03-20 17:05 ` Vladimir Sementsov-Ogievskiy
  2018-03-20 17:07   ` [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache DROP IT Vladimir Sementsov-Ogievskiy
  2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 3/3] iotests: enable shared migration cases in 169 Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-20 17:05 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, jsnow, den, vsementsov

On reopen with existing bitmaps, instead of loading bitmaps, lets
reopen them if needed. This also fixes bitmaps migration through
shared storage.
Consider the case. Persistent bitmaps are stored on bdrv_inactivate.
Then, on destination process_incoming_migration_bh() calls
bdrv_invalidate_cache_all() which leads to
qcow2_load_autoloading_dirty_bitmaps() which fails if bitmaps are
already loaded on destination start.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 7472af6931..d2d129a2e6 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1480,7 +1480,22 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
         s->autoclear_features &= QCOW2_AUTOCLEAR_MASK;
     }
 
-    if (qcow2_load_dirty_bitmaps(bs, &local_err)) {
+    if (bdrv_dirty_bitmap_next(bs, NULL)) {
+        /* It's some kind of reopen with already existing dirty bitmaps. There
+         * are no known cases where we need loading bitmaps in such situation,
+         * so it's safer don't load them.
+         *
+         * Moreover, if we have some readonly bitmaps and we are reopening for
+         * rw we should reopen bitmaps correspondingly.
+         */
+        if (bdrv_has_readonly_bitmaps(bs) &&
+            !bdrv_is_read_only(bs) && !(bdrv_get_flags(bs) & BDRV_O_INACTIVE))
+        {
+            bool header_updated = false;
+            qcow2_reopen_bitmaps_rw_hint(bs, &header_updated, &local_err);
+            update_header = update_header && !header_updated;
+        }
+    } else if (qcow2_load_dirty_bitmaps(bs, &local_err)) {
         update_header = false;
     }
     if (local_err != NULL) {
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 3/3] iotests: enable shared migration cases in 169
  2018-03-20 17:05 [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache Vladimir Sementsov-Ogievskiy
@ 2018-03-20 17:05 ` Vladimir Sementsov-Ogievskiy
  2018-03-21 13:20 ` [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Max Reitz
  2018-03-26 18:06 ` Max Reitz
  5 siblings, 0 replies; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-20 17:05 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, jsnow, den, vsementsov

Shared migration for dirty bitmaps is fixed by previous patches,
so we can enable the test.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 tests/qemu-iotests/169     | 8 +++-----
 tests/qemu-iotests/169.out | 4 ++--
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/tests/qemu-iotests/169 b/tests/qemu-iotests/169
index 3a8db91f6f..153b10b6e7 100755
--- a/tests/qemu-iotests/169
+++ b/tests/qemu-iotests/169
@@ -140,16 +140,14 @@ def inject_test_case(klass, name, method, *args, **kwargs):
     mc = operator.methodcaller(method, *args, **kwargs)
     setattr(klass, 'test_' + name, new.instancemethod(mc, None, klass))
 
-for cmb in list(itertools.product((True, False), repeat=3)):
+for cmb in list(itertools.product((True, False), repeat=4)):
     name = ('_' if cmb[0] else '_not_') + 'persistent_'
     name += ('_' if cmb[1] else '_not_') + 'migbitmap_'
     name += '_online' if cmb[2] else '_offline'
-
-    # TODO fix shared-storage bitmap migration and enable cases for it
-    args = list(cmb) + [False]
+    name += '_shared' if cmb[3] else '_nonshared'
 
     inject_test_case(TestDirtyBitmapMigration, name, 'do_test_migration',
-                     *args)
+                     *list(cmb))
 
 
 if __name__ == '__main__':
diff --git a/tests/qemu-iotests/169.out b/tests/qemu-iotests/169.out
index 594c16f49f..b6f257674e 100644
--- a/tests/qemu-iotests/169.out
+++ b/tests/qemu-iotests/169.out
@@ -1,5 +1,5 @@
-........
+................
 ----------------------------------------------------------------------
-Ran 8 tests
+Ran 16 tests
 
 OK
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache DROP IT
  2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache Vladimir Sementsov-Ogievskiy
@ 2018-03-20 17:07   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-20 17:07 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mreitz, kwolf, jsnow, den

Oh, sorry, just drop it.
(the only difference is commit message subject)

20.03.2018 20:05, Vladimir Sementsov-Ogievskiy wrote:
> On reopen with existing bitmaps, instead of loading bitmaps, lets
> reopen them if needed. This also fixes bitmaps migration through
> shared storage.
> Consider the case. Persistent bitmaps are stored on bdrv_inactivate.
> Then, on destination process_incoming_migration_bh() calls
> bdrv_invalidate_cache_all() which leads to
> qcow2_load_autoloading_dirty_bitmaps() which fails if bitmaps are
> already loaded on destination start.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/qcow2.c | 17 ++++++++++++++++-
>   1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 7472af6931..d2d129a2e6 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1480,7 +1480,22 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
>           s->autoclear_features &= QCOW2_AUTOCLEAR_MASK;
>       }
>   
> -    if (qcow2_load_dirty_bitmaps(bs, &local_err)) {
> +    if (bdrv_dirty_bitmap_next(bs, NULL)) {
> +        /* It's some kind of reopen with already existing dirty bitmaps. There
> +         * are no known cases where we need loading bitmaps in such situation,
> +         * so it's safer don't load them.
> +         *
> +         * Moreover, if we have some readonly bitmaps and we are reopening for
> +         * rw we should reopen bitmaps correspondingly.
> +         */
> +        if (bdrv_has_readonly_bitmaps(bs) &&
> +            !bdrv_is_read_only(bs) && !(bdrv_get_flags(bs) & BDRV_O_INACTIVE))
> +        {
> +            bool header_updated = false;
> +            qcow2_reopen_bitmaps_rw_hint(bs, &header_updated, &local_err);
> +            update_header = update_header && !header_updated;
> +        }
> +    } else if (qcow2_load_dirty_bitmaps(bs, &local_err)) {
>           update_header = false;
>       }
>       if (local_err != NULL) {


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-20 17:05 [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 3/3] iotests: enable shared migration cases in 169 Vladimir Sementsov-Ogievskiy
@ 2018-03-21 13:20 ` Max Reitz
  2018-03-26 18:06 ` Max Reitz
  5 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2018-03-21 13:20 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

[-- Attachment #1: Type: text/plain, Size: 723 bytes --]

On 2018-03-20 18:05, Vladimir Sementsov-Ogievskiy wrote:
> Hi all.
> 
> This fixes bitmaps migration through shared storage. Look at 02 for
> details.
> 
> The bug introduced in 2.10 with the whole qcow2 bitmaps feature, so
> qemu-stable in CC. However I doubt that someone really suffered from this.
> 
> Do we need dirty bitmaps at all in inactive case? - that was a question in v2.
> And, keeping in mind that we are going to use inactive mode not only for
> incoming migration, I'm not sure that answer is NO (but, it may be "NO" for 
> 2.10, 2.11), so let's fix it in proposed here manner at least for 2.12.

Thanks, applied to my block branch:

https://github.com/XanClic/qemu/commits/block

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-20 17:05 [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2018-03-21 13:20 ` [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Max Reitz
@ 2018-03-26 18:06 ` Max Reitz
  2018-03-27  9:28   ` Vladimir Sementsov-Ogievskiy
  5 siblings, 1 reply; 18+ messages in thread
From: Max Reitz @ 2018-03-26 18:06 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

[-- Attachment #1: Type: text/plain, Size: 1593 bytes --]

On 2018-03-20 18:05, Vladimir Sementsov-Ogievskiy wrote:
> Hi all.
> 
> This fixes bitmaps migration through shared storage. Look at 02 for
> details.
> 
> The bug introduced in 2.10 with the whole qcow2 bitmaps feature, so
> qemu-stable in CC. However I doubt that someone really suffered from this.
> 
> Do we need dirty bitmaps at all in inactive case? - that was a question in v2.
> And, keeping in mind that we are going to use inactive mode not only for
> incoming migration, I'm not sure that answer is NO (but, it may be "NO" for 
> 2.10, 2.11), so let's fix it in proposed here manner at least for 2.12.

For some reason, I can't get 169 to work now at all[1].  What's more,
whenever I run it, two (on current master, maybe more after this series)
"cat $TEST_DIR/mig_file" processes stay around.  That doesn't seem right.

However, this series doesn't seem to make it worse[2]...  So I'm keeping
it.  I suppose it's just some issue with the test.

Max


[1] Sometimes there are migration even timeouts, sometimes just VM
launch timeouts (specifically when VM B is supposed to be re-launched
just after it has been shut down), and sometimes I get a dirty bitmap
hash mismatch.


[2] The whole timeline was:

- Apply this series, everything seems alright

(a couple of hours later)
- Test some other things, stumble over 169 once or so

- Focus on 169, fails a bit more often

(today)
- Can't get it to work at all

- Can't get it to work in any version, neither before nor after this patch

- Lose my sanity

- Write this email

O:-)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-26 18:06 ` Max Reitz
@ 2018-03-27  9:28   ` Vladimir Sementsov-Ogievskiy
  2018-03-27  9:53     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-27  9:28 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

26.03.2018 21:06, Max Reitz wrote:
> On 2018-03-20 18:05, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all.
>>
>> This fixes bitmaps migration through shared storage. Look at 02 for
>> details.
>>
>> The bug introduced in 2.10 with the whole qcow2 bitmaps feature, so
>> qemu-stable in CC. However I doubt that someone really suffered from this.
>>
>> Do we need dirty bitmaps at all in inactive case? - that was a question in v2.
>> And, keeping in mind that we are going to use inactive mode not only for
>> incoming migration, I'm not sure that answer is NO (but, it may be "NO" for
>> 2.10, 2.11), so let's fix it in proposed here manner at least for 2.12.
> For some reason, I can't get 169 to work now at all[1].  What's more,
> whenever I run it, two (on current master, maybe more after this series)
> "cat $TEST_DIR/mig_file" processes stay around.  That doesn't seem right.
>
> However, this series doesn't seem to make it worse[2]...  So I'm keeping
> it.  I suppose it's just some issue with the test.
>
> Max
>
>
> [1] Sometimes there are migration even timeouts, sometimes just VM
> launch timeouts (specifically when VM B is supposed to be re-launched
> just after it has been shut down), and sometimes I get a dirty bitmap
> hash mismatch.
>
>
> [2] The whole timeline was:
>
> - Apply this series, everything seems alright
>
> (a couple of hours later)
> - Test some other things, stumble over 169 once or so
>
> - Focus on 169, fails a bit more often
>
> (today)
> - Can't get it to work at all
>
> - Can't get it to work in any version, neither before nor after this patch
>
> - Lose my sanity
>
> - Write this email
>
> O:-)
>

hmm.. checked on current master (7b93d78a04aa24), tried a lot of times 
in a loop, works for me. How can I help?

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-27  9:28   ` Vladimir Sementsov-Ogievskiy
@ 2018-03-27  9:53     ` Vladimir Sementsov-Ogievskiy
  2018-03-27 10:11       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-27  9:53 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

27.03.2018 12:28, Vladimir Sementsov-Ogievskiy wrote:
> 26.03.2018 21:06, Max Reitz wrote:
>> On 2018-03-20 18:05, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all.
>>>
>>> This fixes bitmaps migration through shared storage. Look at 02 for
>>> details.
>>>
>>> The bug introduced in 2.10 with the whole qcow2 bitmaps feature, so
>>> qemu-stable in CC. However I doubt that someone really suffered from 
>>> this.
>>>
>>> Do we need dirty bitmaps at all in inactive case? - that was a 
>>> question in v2.
>>> And, keeping in mind that we are going to use inactive mode not only 
>>> for
>>> incoming migration, I'm not sure that answer is NO (but, it may be 
>>> "NO" for
>>> 2.10, 2.11), so let's fix it in proposed here manner at least for 2.12.
>> For some reason, I can't get 169 to work now at all[1].  What's more,
>> whenever I run it, two (on current master, maybe more after this series)
>> "cat $TEST_DIR/mig_file" processes stay around.  That doesn't seem 
>> right.
>>
>> However, this series doesn't seem to make it worse[2]...  So I'm keeping
>> it.  I suppose it's just some issue with the test.
>>
>> Max
>>
>>
>> [1] Sometimes there are migration even timeouts, sometimes just VM
>> launch timeouts (specifically when VM B is supposed to be re-launched
>> just after it has been shut down), and sometimes I get a dirty bitmap
>> hash mismatch.
>>
>>
>> [2] The whole timeline was:
>>
>> - Apply this series, everything seems alright
>>
>> (a couple of hours later)
>> - Test some other things, stumble over 169 once or so
>>
>> - Focus on 169, fails a bit more often
>>
>> (today)
>> - Can't get it to work at all
>>
>> - Can't get it to work in any version, neither before nor after this 
>> patch
>>
>> - Lose my sanity
>>
>> - Write this email
>>
>> O:-)
>>
>
> hmm.. checked on current master (7b93d78a04aa24), tried a lot of times 
> in a loop, works for me. How can I help?
>

O, loop finally finished, with:

169 6s ... [failed, exit status 1] - output mismatch (see 169.out.bad)
--- /work/src/qemu/master/tests/qemu-iotests/169.out    2018-03-16 
21:01:19.536765587 +0300
+++ /work/src/qemu/master/tests/qemu-iotests/169.out.bad 2018-03-27 
12:33:03.804800350 +0300
@@ -1,5 +1,20 @@
-........
+......E.
+======================================================================
+ERROR: test__persistent__not_migbitmap__offline 
(__main__.TestDirtyBitmapMigration)
+methodcaller(name, ...) --> methodcaller object
+----------------------------------------------------------------------
+Traceback (most recent call last):
+  File "169", line 129, in do_test_migration
+    self.vm_b.event_wait("RESUME", timeout=10.0)
+  File 
"/work/src/qemu/master/tests/qemu-iotests/../../scripts/qemu.py", line 
349, in event_wait
+    event = self._qmp.pull_event(wait=timeout)
+  File 
"/work/src/qemu/master/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 216, in pull_event
+    self.__get_events(wait)
+  File 
"/work/src/qemu/master/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
line 124, in __get_events
+    raise QMPTimeoutError("Timeout waiting for event")
+QMPTimeoutError: Timeout waiting for event
+
  ----------------------------------------------------------------------
  Ran 8 tests

-OK
+FAILED (errors=1)
Failures: 169
Failed 1 of 1 tests


and I have a lot of opened pipes, like:

root       18685  0.0  0.0 107924   352 pts/0    S    12:19   0:00 cat 
/work/src/qemu/master/tests/qemu-iotests/scratch/mig_file

...

restart testing loop, it continues to pass 169 again and again...

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-27  9:53     ` Vladimir Sementsov-Ogievskiy
@ 2018-03-27 10:11       ` Vladimir Sementsov-Ogievskiy
  2018-03-28 14:53         ` Max Reitz
  0 siblings, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-27 10:11 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

27.03.2018 12:53, Vladimir Sementsov-Ogievskiy wrote:
> 27.03.2018 12:28, Vladimir Sementsov-Ogievskiy wrote:
>> 26.03.2018 21:06, Max Reitz wrote:
>>> On 2018-03-20 18:05, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all.
>>>>
>>>> This fixes bitmaps migration through shared storage. Look at 02 for
>>>> details.
>>>>
>>>> The bug introduced in 2.10 with the whole qcow2 bitmaps feature, so
>>>> qemu-stable in CC. However I doubt that someone really suffered 
>>>> from this.
>>>>
>>>> Do we need dirty bitmaps at all in inactive case? - that was a 
>>>> question in v2.
>>>> And, keeping in mind that we are going to use inactive mode not 
>>>> only for
>>>> incoming migration, I'm not sure that answer is NO (but, it may be 
>>>> "NO" for
>>>> 2.10, 2.11), so let's fix it in proposed here manner at least for 
>>>> 2.12.
>>> For some reason, I can't get 169 to work now at all[1]. What's more,
>>> whenever I run it, two (on current master, maybe more after this 
>>> series)
>>> "cat $TEST_DIR/mig_file" processes stay around.  That doesn't seem 
>>> right.
>>>
>>> However, this series doesn't seem to make it worse[2]...  So I'm 
>>> keeping
>>> it.  I suppose it's just some issue with the test.
>>>
>>> Max
>>>
>>>
>>> [1] Sometimes there are migration even timeouts, sometimes just VM
>>> launch timeouts (specifically when VM B is supposed to be re-launched
>>> just after it has been shut down), and sometimes I get a dirty bitmap
>>> hash mismatch.
>>>
>>>
>>> [2] The whole timeline was:
>>>
>>> - Apply this series, everything seems alright
>>>
>>> (a couple of hours later)
>>> - Test some other things, stumble over 169 once or so
>>>
>>> - Focus on 169, fails a bit more often
>>>
>>> (today)
>>> - Can't get it to work at all
>>>
>>> - Can't get it to work in any version, neither before nor after this 
>>> patch
>>>
>>> - Lose my sanity
>>>
>>> - Write this email
>>>
>>> O:-)
>>>
>>
>> hmm.. checked on current master (7b93d78a04aa24), tried a lot of 
>> times in a loop, works for me. How can I help?
>>
>
> O, loop finally finished, with:
>
> 169 6s ... [failed, exit status 1] - output mismatch (see 169.out.bad)
> --- /work/src/qemu/master/tests/qemu-iotests/169.out    2018-03-16 
> 21:01:19.536765587 +0300
> +++ /work/src/qemu/master/tests/qemu-iotests/169.out.bad 2018-03-27 
> 12:33:03.804800350 +0300
> @@ -1,5 +1,20 @@
> -........
> +......E.
> +======================================================================
> +ERROR: test__persistent__not_migbitmap__offline 
> (__main__.TestDirtyBitmapMigration)
> +methodcaller(name, ...) --> methodcaller object
> +----------------------------------------------------------------------
> +Traceback (most recent call last):
> +  File "169", line 129, in do_test_migration
> +    self.vm_b.event_wait("RESUME", timeout=10.0)
> +  File 
> "/work/src/qemu/master/tests/qemu-iotests/../../scripts/qemu.py", line 
> 349, in event_wait
> +    event = self._qmp.pull_event(wait=timeout)
> +  File 
> "/work/src/qemu/master/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
> line 216, in pull_event
> +    self.__get_events(wait)
> +  File 
> "/work/src/qemu/master/tests/qemu-iotests/../../scripts/qmp/qmp.py", 
> line 124, in __get_events
> +    raise QMPTimeoutError("Timeout waiting for event")
> +QMPTimeoutError: Timeout waiting for event
> +
>  ----------------------------------------------------------------------
>  Ran 8 tests
>
> -OK
> +FAILED (errors=1)
> Failures: 169
> Failed 1 of 1 tests
>
>
> and I have a lot of opened pipes, like:
>
> root       18685  0.0  0.0 107924   352 pts/0    S    12:19   0:00 cat 
> /work/src/qemu/master/tests/qemu-iotests/scratch/mig_file
>
> ...
>
> restart testing loop, it continues to pass 169 again and again...
>

.... and,

--- /work/src/qemu/master/tests/qemu-iotests/169.out    2018-03-16 
21:01:19.536765587 +0300
+++ /work/src/qemu/master/tests/qemu-iotests/169.out.bad 2018-03-27 
12:58:44.804894014 +0300
@@ -1,5 +1,20 @@
-........
+F.......
+======================================================================
+FAIL: test__not_persistent__migbitmap__offline 
(__main__.TestDirtyBitmapMigration)
+methodcaller(name, ...) --> methodcaller object
+----------------------------------------------------------------------
+Traceback (most recent call last):
+  File "169", line 136, in do_test_migration
+    self.check_bitmap(self.vm_b, sha256 if persistent else False)
+  File "169", line 77, in check_bitmap
+    "Dirty bitmap 'bitmap0' not found");
+  File "/work/src/qemu/master/tests/qemu-iotests/iotests.py", line 422, 
in assert_qmp
+    result = self.dictpath(d, path)
+  File "/work/src/qemu/master/tests/qemu-iotests/iotests.py", line 381, 
in dictpath
+    self.fail('failed path traversal for "%s" in "%s"' % (path, str(d)))
+AssertionError: failed path traversal for "error/desc" in "{u'return': 
{u'sha256': 
u'01d2ebedcb8f549a2547dbf8e231c410e3e747a9479e98909fc936e0035cf8b1'}}"
+
  ----------------------------------------------------------------------
  Ran 8 tests

-OK
+FAILED (failures=1)
Failures: 169
Failed 1 of 1 tests


isn't it because a lot of cat processes? will check, update loop to
i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat; done

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-27 10:11       ` Vladimir Sementsov-Ogievskiy
@ 2018-03-28 14:53         ` Max Reitz
  2018-03-29  8:08           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Max Reitz @ 2018-03-28 14:53 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

[-- Attachment #1: Type: text/plain, Size: 7248 bytes --]

On 2018-03-27 12:11, Vladimir Sementsov-Ogievskiy wrote:
> 27.03.2018 12:53, Vladimir Sementsov-Ogievskiy wrote:
>> 27.03.2018 12:28, Vladimir Sementsov-Ogievskiy wrote:
>>> 26.03.2018 21:06, Max Reitz wrote:
>>>> On 2018-03-20 18:05, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Hi all.
>>>>>
>>>>> This fixes bitmaps migration through shared storage. Look at 02 for
>>>>> details.
>>>>>
>>>>> The bug introduced in 2.10 with the whole qcow2 bitmaps feature, so
>>>>> qemu-stable in CC. However I doubt that someone really suffered
>>>>> from this.
>>>>>
>>>>> Do we need dirty bitmaps at all in inactive case? - that was a
>>>>> question in v2.
>>>>> And, keeping in mind that we are going to use inactive mode not
>>>>> only for
>>>>> incoming migration, I'm not sure that answer is NO (but, it may be
>>>>> "NO" for
>>>>> 2.10, 2.11), so let's fix it in proposed here manner at least for
>>>>> 2.12.
>>>> For some reason, I can't get 169 to work now at all[1]. What's more,
>>>> whenever I run it, two (on current master, maybe more after this
>>>> series)
>>>> "cat $TEST_DIR/mig_file" processes stay around.  That doesn't seem
>>>> right.
>>>>
>>>> However, this series doesn't seem to make it worse[2]...  So I'm
>>>> keeping
>>>> it.  I suppose it's just some issue with the test.
>>>>
>>>> Max
>>>>
>>>>
>>>> [1] Sometimes there are migration even timeouts, sometimes just VM
>>>> launch timeouts (specifically when VM B is supposed to be re-launched
>>>> just after it has been shut down), and sometimes I get a dirty bitmap
>>>> hash mismatch.
>>>>
>>>>
>>>> [2] The whole timeline was:
>>>>
>>>> - Apply this series, everything seems alright
>>>>
>>>> (a couple of hours later)
>>>> - Test some other things, stumble over 169 once or so
>>>>
>>>> - Focus on 169, fails a bit more often
>>>>
>>>> (today)
>>>> - Can't get it to work at all
>>>>
>>>> - Can't get it to work in any version, neither before nor after this
>>>> patch
>>>>
>>>> - Lose my sanity
>>>>
>>>> - Write this email
>>>>
>>>> O:-)
>>>>
>>>
>>> hmm.. checked on current master (7b93d78a04aa24), tried a lot of
>>> times in a loop, works for me. How can I help?
>>>
>>
>> O, loop finally finished, with:
>>
>> 169 6s ... [failed, exit status 1] - output mismatch (see 169.out.bad)
>> --- /work/src/qemu/master/tests/qemu-iotests/169.out    2018-03-16
>> 21:01:19.536765587 +0300
>> +++ /work/src/qemu/master/tests/qemu-iotests/169.out.bad 2018-03-27
>> 12:33:03.804800350 +0300
>> @@ -1,5 +1,20 @@
>> -........
>> +......E.
>> +======================================================================
>> +ERROR: test__persistent__not_migbitmap__offline
>> (__main__.TestDirtyBitmapMigration)
>> +methodcaller(name, ...) --> methodcaller object
>> +----------------------------------------------------------------------
>> +Traceback (most recent call last):
>> +  File "169", line 129, in do_test_migration
>> +    self.vm_b.event_wait("RESUME", timeout=10.0)
>> +  File
>> "/work/src/qemu/master/tests/qemu-iotests/../../scripts/qemu.py", line
>> 349, in event_wait
>> +    event = self._qmp.pull_event(wait=timeout)
>> +  File
>> "/work/src/qemu/master/tests/qemu-iotests/../../scripts/qmp/qmp.py",
>> line 216, in pull_event
>> +    self.__get_events(wait)
>> +  File
>> "/work/src/qemu/master/tests/qemu-iotests/../../scripts/qmp/qmp.py",
>> line 124, in __get_events
>> +    raise QMPTimeoutError("Timeout waiting for event")
>> +QMPTimeoutError: Timeout waiting for event
>> +
>>  ----------------------------------------------------------------------
>>  Ran 8 tests
>>
>> -OK
>> +FAILED (errors=1)
>> Failures: 169
>> Failed 1 of 1 tests
>>
>>
>> and I have a lot of opened pipes, like:
>>
>> root       18685  0.0  0.0 107924   352 pts/0    S    12:19   0:00 cat
>> /work/src/qemu/master/tests/qemu-iotests/scratch/mig_file
>>
>> ...
>>
>> restart testing loop, it continues to pass 169 again and again...
>>
> 
> .... and,
> 
> --- /work/src/qemu/master/tests/qemu-iotests/169.out    2018-03-16
> 21:01:19.536765587 +0300
> +++ /work/src/qemu/master/tests/qemu-iotests/169.out.bad 2018-03-27
> 12:58:44.804894014 +0300
> @@ -1,5 +1,20 @@
> -........
> +F.......
> +======================================================================
> +FAIL: test__not_persistent__migbitmap__offline
> (__main__.TestDirtyBitmapMigration)
> +methodcaller(name, ...) --> methodcaller object
> +----------------------------------------------------------------------
> +Traceback (most recent call last):
> +  File "169", line 136, in do_test_migration
> +    self.check_bitmap(self.vm_b, sha256 if persistent else False)
> +  File "169", line 77, in check_bitmap
> +    "Dirty bitmap 'bitmap0' not found");
> +  File "/work/src/qemu/master/tests/qemu-iotests/iotests.py", line 422,
> in assert_qmp
> +    result = self.dictpath(d, path)
> +  File "/work/src/qemu/master/tests/qemu-iotests/iotests.py", line 381,
> in dictpath
> +    self.fail('failed path traversal for "%s" in "%s"' % (path, str(d)))
> +AssertionError: failed path traversal for "error/desc" in "{u'return':
> {u'sha256':
> u'01d2ebedcb8f549a2547dbf8e231c410e3e747a9479e98909fc936e0035cf8b1'}}"
> +
>  ----------------------------------------------------------------------
>  Ran 8 tests
> 
> -OK
> +FAILED (failures=1)
> Failures: 169
> Failed 1 of 1 tests
> 
> 
> isn't it because a lot of cat processes? will check, update loop to
> i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat; done

Hmm...  I know I tried to kill all of the cats, but for some reason that
didn't really help yesterday.  Seems to help now, for 2.12.0-rc0 at
least (that is, before this series).

After the whole series, I still get a lot of failures in 169
(mismatching bitmap hash, mostly).

And interestingly, if I add an abort():

diff --git a/block/qcow2.c b/block/qcow2.c
index 486f3e83b7..9204c1c0ac 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1481,6 +1481,7 @@ static int coroutine_fn
qcow2_do_open(BlockDriverState *bs, QDict *options,     }

     if (bdrv_dirty_bitmap_next(bs, NULL)) {
+        abort();
         /* It's some kind of reopen with already existing dirty
bitmaps. There
          * are no known cases where we need loading bitmaps in such
situation,
          * so it's safer don't load them.

Then this fires for a couple of test cases of 169 even without the third
patch of this series.

I guess bdrv_dirty_bitmap_next() reacts to some bitmaps that migration
adds or something?  Then this would be the wrong condition, because I
guess we still want to load the bitmaps that are in the qcow2 file.

I'm not sure whether bdrv_has_readonly_bitmaps() is the correct
condition then, either, though.  Maybe let's take a step back: We want
to load all the bitmaps from the file exactly once, and that is when it
is opened the first time.  Or that's what I would have thought...  Is
that even correct?

Why do we load the bitmaps when the device is inactive anyway?
Shouldn't we load them only once the device is activated?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-28 14:53         ` Max Reitz
@ 2018-03-29  8:08           ` Vladimir Sementsov-Ogievskiy
  2018-03-29 14:03             ` Max Reitz
  0 siblings, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-29  8:08 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

28.03.2018 17:53, Max Reitz wrote:
> On 2018-03-27 12:11, Vladimir Sementsov-Ogievskiy wrote:
>> 27.03.2018 12:53, Vladimir Sementsov-Ogievskiy wrote:
>>> 27.03.2018 12:28, Vladimir Sementsov-Ogievskiy wrote:
>>>> 26.03.2018 21:06, Max Reitz wrote:
>>>>> On 2018-03-20 18:05, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> Hi all.
>>>>>>
>>>>>> This fixes bitmaps migration through shared storage. Look at 02 for
>>>>>> details.
>>>>>>
>>>>>> The bug introduced in 2.10 with the whole qcow2 bitmaps feature, so
>>>>>> qemu-stable in CC. However I doubt that someone really suffered
>>>>>> from this.
>>>>>>
>>>>>> Do we need dirty bitmaps at all in inactive case? - that was a
>>>>>> question in v2.
>>>>>> And, keeping in mind that we are going to use inactive mode not
>>>>>> only for
>>>>>> incoming migration, I'm not sure that answer is NO (but, it may be
>>>>>> "NO" for
>>>>>> 2.10, 2.11), so let's fix it in proposed here manner at least for
>>>>>> 2.12.
>>>>> For some reason, I can't get 169 to work now at all[1]. What's more,
>>>>> whenever I run it, two (on current master, maybe more after this
>>>>> series)
>>>>> "cat $TEST_DIR/mig_file" processes stay around.  That doesn't seem
>>>>> right.
>>>>>
>>>>> However, this series doesn't seem to make it worse[2]...  So I'm
>>>>> keeping
>>>>> it.  I suppose it's just some issue with the test.
>>>>>
>>>>> Max
>>>>>
>>>>>
>>>>> [1] Sometimes there are migration even timeouts, sometimes just VM
>>>>> launch timeouts (specifically when VM B is supposed to be re-launched
>>>>> just after it has been shut down), and sometimes I get a dirty bitmap
>>>>> hash mismatch.
>>>>>
>>>>>
>>>>> [2] The whole timeline was:
>>>>>
>>>>> - Apply this series, everything seems alright
>>>>>
>>>>> (a couple of hours later)
>>>>> - Test some other things, stumble over 169 once or so
>>>>>
>>>>> - Focus on 169, fails a bit more often
>>>>>
>>>>> (today)
>>>>> - Can't get it to work at all
>>>>>
>>>>> - Can't get it to work in any version, neither before nor after this
>>>>> patch
>>>>>
>>>>> - Lose my sanity
>>>>>
>>>>> - Write this email
>>>>>
>>>>> O:-)
>>>>>
>>>> hmm.. checked on current master (7b93d78a04aa24), tried a lot of
>>>> times in a loop, works for me. How can I help?
>>>>
>>> O, loop finally finished, with:
>>>
>>> 169 6s ... [failed, exit status 1] - output mismatch (see 169.out.bad)
>>> --- /work/src/qemu/master/tests/qemu-iotests/169.out    2018-03-16
>>> 21:01:19.536765587 +0300
>>> +++ /work/src/qemu/master/tests/qemu-iotests/169.out.bad 2018-03-27
>>> 12:33:03.804800350 +0300
>>> @@ -1,5 +1,20 @@
>>> -........
>>> +......E.
>>> +======================================================================
>>> +ERROR: test__persistent__not_migbitmap__offline
>>> (__main__.TestDirtyBitmapMigration)
>>> +methodcaller(name, ...) --> methodcaller object
>>> +----------------------------------------------------------------------
>>> +Traceback (most recent call last):
>>> +  File "169", line 129, in do_test_migration
>>> +    self.vm_b.event_wait("RESUME", timeout=10.0)
>>> +  File
>>> "/work/src/qemu/master/tests/qemu-iotests/../../scripts/qemu.py", line
>>> 349, in event_wait
>>> +    event = self._qmp.pull_event(wait=timeout)
>>> +  File
>>> "/work/src/qemu/master/tests/qemu-iotests/../../scripts/qmp/qmp.py",
>>> line 216, in pull_event
>>> +    self.__get_events(wait)
>>> +  File
>>> "/work/src/qemu/master/tests/qemu-iotests/../../scripts/qmp/qmp.py",
>>> line 124, in __get_events
>>> +    raise QMPTimeoutError("Timeout waiting for event")
>>> +QMPTimeoutError: Timeout waiting for event
>>> +
>>>   ----------------------------------------------------------------------
>>>   Ran 8 tests
>>>
>>> -OK
>>> +FAILED (errors=1)
>>> Failures: 169
>>> Failed 1 of 1 tests
>>>
>>>
>>> and I have a lot of opened pipes, like:
>>>
>>> root       18685  0.0  0.0 107924   352 pts/0    S    12:19   0:00 cat
>>> /work/src/qemu/master/tests/qemu-iotests/scratch/mig_file
>>>
>>> ...
>>>
>>> restart testing loop, it continues to pass 169 again and again...
>>>
>> .... and,
>>
>> --- /work/src/qemu/master/tests/qemu-iotests/169.out    2018-03-16
>> 21:01:19.536765587 +0300
>> +++ /work/src/qemu/master/tests/qemu-iotests/169.out.bad 2018-03-27
>> 12:58:44.804894014 +0300
>> @@ -1,5 +1,20 @@
>> -........
>> +F.......
>> +======================================================================
>> +FAIL: test__not_persistent__migbitmap__offline
>> (__main__.TestDirtyBitmapMigration)
>> +methodcaller(name, ...) --> methodcaller object
>> +----------------------------------------------------------------------
>> +Traceback (most recent call last):
>> +  File "169", line 136, in do_test_migration
>> +    self.check_bitmap(self.vm_b, sha256 if persistent else False)
>> +  File "169", line 77, in check_bitmap
>> +    "Dirty bitmap 'bitmap0' not found");
>> +  File "/work/src/qemu/master/tests/qemu-iotests/iotests.py", line 422,
>> in assert_qmp
>> +    result = self.dictpath(d, path)
>> +  File "/work/src/qemu/master/tests/qemu-iotests/iotests.py", line 381,
>> in dictpath
>> +    self.fail('failed path traversal for "%s" in "%s"' % (path, str(d)))
>> +AssertionError: failed path traversal for "error/desc" in "{u'return':
>> {u'sha256':
>> u'01d2ebedcb8f549a2547dbf8e231c410e3e747a9479e98909fc936e0035cf8b1'}}"
>> +
>>   ----------------------------------------------------------------------
>>   Ran 8 tests
>>
>> -OK
>> +FAILED (failures=1)
>> Failures: 169
>> Failed 1 of 1 tests
>>
>>
>> isn't it because a lot of cat processes? will check, update loop to
>> i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat; done
> Hmm...  I know I tried to kill all of the cats, but for some reason that
> didn't really help yesterday.  Seems to help now, for 2.12.0-rc0 at
> least (that is, before this series).

reproduced with killing... (without these series, just on master)

>
> After the whole series, I still get a lot of failures in 169
> (mismatching bitmap hash, mostly).
>
> And interestingly, if I add an abort():
>
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 486f3e83b7..9204c1c0ac 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1481,6 +1481,7 @@ static int coroutine_fn
> qcow2_do_open(BlockDriverState *bs, QDict *options,     }
>
>       if (bdrv_dirty_bitmap_next(bs, NULL)) {
> +        abort();
>           /* It's some kind of reopen with already existing dirty
> bitmaps. There
>            * are no known cases where we need loading bitmaps in such
> situation,
>            * so it's safer don't load them.
>
> Then this fires for a couple of test cases of 169 even without the third
> patch of this series.
>
> I guess bdrv_dirty_bitmap_next() reacts to some bitmaps that migration
> adds or something?  Then this would be the wrong condition, because I
> guess we still want to load the bitmaps that are in the qcow2 file.
>
> I'm not sure whether bdrv_has_readonly_bitmaps() is the correct
> condition then, either, though.  Maybe let's take a step back: We want
> to load all the bitmaps from the file exactly once, and that is when it
> is opened the first time.  Or that's what I would have thought...  Is
> that even correct?
>
> Why do we load the bitmaps when the device is inactive anyway?
> Shouldn't we load them only once the device is activated?

Hmm, not sure. May be, we don't need. But we anyway need to load them, 
when opening read-only, and we should correspondingly reopen in this case.

>
> Max
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-29  8:08           ` Vladimir Sementsov-Ogievskiy
@ 2018-03-29 14:03             ` Max Reitz
  2018-03-29 15:09               ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Max Reitz @ 2018-03-29 14:03 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

[-- Attachment #1: Type: text/plain, Size: 3325 bytes --]

On 2018-03-29 10:08, Vladimir Sementsov-Ogievskiy wrote:
> 28.03.2018 17:53, Max Reitz wrote:
>> On 2018-03-27 12:11, Vladimir Sementsov-Ogievskiy wrote:

[...]

>>> isn't it because a lot of cat processes? will check, update loop to
>>> i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat;
>>> done
>> Hmm...  I know I tried to kill all of the cats, but for some reason that
>> didn't really help yesterday.  Seems to help now, for 2.12.0-rc0 at
>> least (that is, before this series).
> 
> reproduced with killing... (without these series, just on master)
> 
>>
>> After the whole series, I still get a lot of failures in 169
>> (mismatching bitmap hash, mostly).
>>
>> And interestingly, if I add an abort():
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 486f3e83b7..9204c1c0ac 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1481,6 +1481,7 @@ static int coroutine_fn
>> qcow2_do_open(BlockDriverState *bs, QDict *options,     }
>>
>>       if (bdrv_dirty_bitmap_next(bs, NULL)) {
>> +        abort();
>>           /* It's some kind of reopen with already existing dirty
>> bitmaps. There
>>            * are no known cases where we need loading bitmaps in such
>> situation,
>>            * so it's safer don't load them.
>>
>> Then this fires for a couple of test cases of 169 even without the third
>> patch of this series.
>>
>> I guess bdrv_dirty_bitmap_next() reacts to some bitmaps that migration
>> adds or something?  Then this would be the wrong condition, because I
>> guess we still want to load the bitmaps that are in the qcow2 file.
>>
>> I'm not sure whether bdrv_has_readonly_bitmaps() is the correct
>> condition then, either, though.  Maybe let's take a step back: We want
>> to load all the bitmaps from the file exactly once, and that is when it
>> is opened the first time.  Or that's what I would have thought...  Is
>> that even correct?
>>
>> Why do we load the bitmaps when the device is inactive anyway?
>> Shouldn't we load them only once the device is activated?
> 
> Hmm, not sure. May be, we don't need. But we anyway need to load them,
> when opening read-only, and we should correspondingly reopen in this case.

Yeah, well, yeah, but the current state seems just wrong.  Apparently
there are cases where a qcow2 node may have bitmaps before we try to
load them from the file, so the current condition doesn't work.

Furthermore, it seems like the current "state machine" is too complex so
we don't know which cases are possible anymore and what to do when.

So the first thing we could do is add a field to the BDRVQCow2State that
tells us whether the bitmaps have been loaded already or not.  If not,
we invoke qcow2_load_dirty_bitmaps() and set the value to true.  If the
value was true already and the BDS is active and R/W now, we call
qcow2_reopen_bitmaps_rw_hint().  That should solve one problem.

The other problem of course is the question whether we should call
qcow2_load_dirty_bitmaps() at all while the drive is still inactive.
You know the migration model better than me, so I'm asking this question
to you.  We can phrase it differently: Do we need to load the bitmaps
before the drive is activated?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-29 14:03             ` Max Reitz
@ 2018-03-29 15:09               ` Vladimir Sementsov-Ogievskiy
  2018-03-30 13:31                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-29 15:09 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

29.03.2018 17:03, Max Reitz wrote:
> On 2018-03-29 10:08, Vladimir Sementsov-Ogievskiy wrote:
>> 28.03.2018 17:53, Max Reitz wrote:
>>> On 2018-03-27 12:11, Vladimir Sementsov-Ogievskiy wrote:
> [...]
>
>>>> isn't it because a lot of cat processes? will check, update loop to
>>>> i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat;
>>>> done
>>> Hmm...  I know I tried to kill all of the cats, but for some reason that
>>> didn't really help yesterday.  Seems to help now, for 2.12.0-rc0 at
>>> least (that is, before this series).
>> reproduced with killing... (without these series, just on master)
>>
>>> After the whole series, I still get a lot of failures in 169
>>> (mismatching bitmap hash, mostly).
>>>
>>> And interestingly, if I add an abort():
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index 486f3e83b7..9204c1c0ac 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -1481,6 +1481,7 @@ static int coroutine_fn
>>> qcow2_do_open(BlockDriverState *bs, QDict *options,     }
>>>
>>>        if (bdrv_dirty_bitmap_next(bs, NULL)) {
>>> +        abort();
>>>            /* It's some kind of reopen with already existing dirty
>>> bitmaps. There
>>>             * are no known cases where we need loading bitmaps in such
>>> situation,
>>>             * so it's safer don't load them.
>>>
>>> Then this fires for a couple of test cases of 169 even without the third
>>> patch of this series.
>>>
>>> I guess bdrv_dirty_bitmap_next() reacts to some bitmaps that migration
>>> adds or something?  Then this would be the wrong condition, because I
>>> guess we still want to load the bitmaps that are in the qcow2 file.
>>>
>>> I'm not sure whether bdrv_has_readonly_bitmaps() is the correct
>>> condition then, either, though.  Maybe let's take a step back: We want
>>> to load all the bitmaps from the file exactly once, and that is when it
>>> is opened the first time.  Or that's what I would have thought...  Is
>>> that even correct?
>>>
>>> Why do we load the bitmaps when the device is inactive anyway?
>>> Shouldn't we load them only once the device is activated?
>> Hmm, not sure. May be, we don't need. But we anyway need to load them,
>> when opening read-only, and we should correspondingly reopen in this case.
> Yeah, well, yeah, but the current state seems just wrong.  Apparently
> there are cases where a qcow2 node may have bitmaps before we try to
> load them from the file, so the current condition doesn't work.
>
> Furthermore, it seems like the current "state machine" is too complex so
> we don't know which cases are possible anymore and what to do when.
>
> So the first thing we could do is add a field to the BDRVQCow2State that
> tells us whether the bitmaps have been loaded already or not.  If not,
> we invoke qcow2_load_dirty_bitmaps() and set the value to true.  If the
> value was true already and the BDS is active and R/W now, we call
> qcow2_reopen_bitmaps_rw_hint().  That should solve one problem.

good idea, will do.

>
> The other problem of course is the question whether we should call
> qcow2_load_dirty_bitmaps() at all while the drive is still inactive.
> You know the migration model better than me, so I'm asking this question
> to you.  We can phrase it differently: Do we need to load the bitmaps
> before the drive is activated?

Now I think that we don't need. At least, we don't have such cases in 
Virtuozzo (I hope :).

Why did I doubt:

1. We have cases, when we want to start vm as inactive, to be able to 
export it's drive as NBD export, push some data to it and then start the 
VM (which involves activating)
2. We have cases, when we want to start vm stopped and operate on dirty 
bitmaps.

If just open all images in inactive mode until vm start, it looks like 
we need bitmaps in inactive mode (for 2.). But it looks like wrong 
approach anyway.
Firstly, I tried to solve (1.) by simply inactivate_all() in case of 
start vm in paused mode, but it breaks at least (2.), so finally, I 
solved (1.) by an approach similar with "-incoming defer". So, we have 
inactive mode in two cases:
  - incoming migration
  - push data to vm before start

and, in these cases, we don't need to load dirty-bitmaps.

Also, inconsistency: now, we remove persistent bitmaps on inactivate. 
So, it is inconsistent to load the in inactive mode.

Ok, I'll try to respin.

>
> Max
>

about 169, how often is it reproducible for you?

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-29 15:09               ` Vladimir Sementsov-Ogievskiy
@ 2018-03-30 13:31                 ` Vladimir Sementsov-Ogievskiy
  2018-03-30 15:32                   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-30 13:31 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

29.03.2018 18:09, Vladimir Sementsov-Ogievskiy wrote:
> 29.03.2018 17:03, Max Reitz wrote:
>> On 2018-03-29 10:08, Vladimir Sementsov-Ogievskiy wrote:
>>> 28.03.2018 17:53, Max Reitz wrote:
>>>> On 2018-03-27 12:11, Vladimir Sementsov-Ogievskiy wrote:
>> [...]
>>
>>>>> isn't it because a lot of cat processes? will check, update loop to
>>>>> i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat;
>>>>> done
>>>> Hmm...  I know I tried to kill all of the cats, but for some reason 
>>>> that
>>>> didn't really help yesterday.  Seems to help now, for 2.12.0-rc0 at
>>>> least (that is, before this series).
>>> reproduced with killing... (without these series, just on master)
>>>
>>>> After the whole series, I still get a lot of failures in 169
>>>> (mismatching bitmap hash, mostly).
>>>>
>>>> And interestingly, if I add an abort():
>>>>
>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>> index 486f3e83b7..9204c1c0ac 100644
>>>> --- a/block/qcow2.c
>>>> +++ b/block/qcow2.c
>>>> @@ -1481,6 +1481,7 @@ static int coroutine_fn
>>>> qcow2_do_open(BlockDriverState *bs, QDict *options,     }
>>>>
>>>>        if (bdrv_dirty_bitmap_next(bs, NULL)) {
>>>> +        abort();
>>>>            /* It's some kind of reopen with already existing dirty
>>>> bitmaps. There
>>>>             * are no known cases where we need loading bitmaps in such
>>>> situation,
>>>>             * so it's safer don't load them.
>>>>
>>>> Then this fires for a couple of test cases of 169 even without the 
>>>> third
>>>> patch of this series.
>>>>
>>>> I guess bdrv_dirty_bitmap_next() reacts to some bitmaps that migration
>>>> adds or something?  Then this would be the wrong condition, because I
>>>> guess we still want to load the bitmaps that are in the qcow2 file.
>>>>
>>>> I'm not sure whether bdrv_has_readonly_bitmaps() is the correct
>>>> condition then, either, though.  Maybe let's take a step back: We want
>>>> to load all the bitmaps from the file exactly once, and that is 
>>>> when it
>>>> is opened the first time.  Or that's what I would have thought...  Is
>>>> that even correct?
>>>>
>>>> Why do we load the bitmaps when the device is inactive anyway?
>>>> Shouldn't we load them only once the device is activated?
>>> Hmm, not sure. May be, we don't need. But we anyway need to load them,
>>> when opening read-only, and we should correspondingly reopen in this 
>>> case.
>> Yeah, well, yeah, but the current state seems just wrong. Apparently
>> there are cases where a qcow2 node may have bitmaps before we try to
>> load them from the file, so the current condition doesn't work.
>>
>> Furthermore, it seems like the current "state machine" is too complex so
>> we don't know which cases are possible anymore and what to do when.
>>
>> So the first thing we could do is add a field to the BDRVQCow2State that
>> tells us whether the bitmaps have been loaded already or not. If not,
>> we invoke qcow2_load_dirty_bitmaps() and set the value to true. If the
>> value was true already and the BDS is active and R/W now, we call
>> qcow2_reopen_bitmaps_rw_hint().  That should solve one problem.
>
> good idea, will do.
>
>>
>> The other problem of course is the question whether we should call
>> qcow2_load_dirty_bitmaps() at all while the drive is still inactive.
>> You know the migration model better than me, so I'm asking this question
>> to you.  We can phrase it differently: Do we need to load the bitmaps
>> before the drive is activated?
>
> Now I think that we don't need. At least, we don't have such cases in 
> Virtuozzo (I hope :).
>
> Why did I doubt:
>
> 1. We have cases, when we want to start vm as inactive, to be able to 
> export it's drive as NBD export, push some data to it and then start 
> the VM (which involves activating)
> 2. We have cases, when we want to start vm stopped and operate on 
> dirty bitmaps.
>
> If just open all images in inactive mode until vm start, it looks like 
> we need bitmaps in inactive mode (for 2.). But it looks like wrong 
> approach anyway.
> Firstly, I tried to solve (1.) by simply inactivate_all() in case of 
> start vm in paused mode, but it breaks at least (2.), so finally, I 
> solved (1.) by an approach similar with "-incoming defer". So, we have 
> inactive mode in two cases:
>  - incoming migration
>  - push data to vm before start
>
> and, in these cases, we don't need to load dirty-bitmaps.
>
> Also, inconsistency: now, we remove persistent bitmaps on inactivate. 
> So, it is inconsistent to load the in inactive mode.
>
> Ok, I'll try to respin.

finally, what cases we actually have for qcow2_do_open?

1. INACTIVE -> ACTIVE (through invalidate_cache, we obviously should 
load bitmaps, if we decided that we have no persistent bitmaps in 
INACTIVE mode)
2. creating new bdrv state (first open of the image) in INACTIVE mode 
(will not load bitmaps)
3. creating new bdrv state (first open of the image) in ACTIVE mode 
(will load bitmaps, maybe read-only if disk is RO)

If only these three cases, it would be enough to just load bitmaps if 
!INACTIVE and do nothing otherwise.

Or, we have some of the following cases too?

1?. ACTIVE -> ACTIVE (through invalidate_cache, some kind of no-op, we 
should not reload bitmaps)
2?. RO -> RW (we should reopen_bitmaps_rw) (or it is possible only 
through bdrv_reopen, which will not touch qcow2_do_open?
3?. RW -> RO (reopen_bitmaps_ro ?)
? something other??

>
>>
>> Max
>>
>
> about 169, how often is it reproducible for you?
>

it becomes very interseting.

persistent-migbitmap-online case failed on line 136. with mismathced 
bitmap sha. This check is not relate to migration, on this line, bitmap 
is already successfully migrated. But for some reason it is corrupted 
after stop/start the VM. How is it possible - I can't imagine. But it 
looks not really related to migration.. May it relate to case, when 
postcopy was not finished or something like this?.. maybe fixing 
wait(RESUME) to something more appropriate will help. But it is strange 
anyway.

persistent-notmigbitmap-offline case failed on ine 128, which is 
"self.vm_b.event_wait("RESUME", timeout=10.0)" with timeout.. can we 
skip RESUME for some reason? maybe just move to MIGRATION event will 
help, will check

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-30 13:31                 ` Vladimir Sementsov-Ogievskiy
@ 2018-03-30 15:32                   ` Vladimir Sementsov-Ogievskiy
  2018-04-03 16:03                     ` Max Reitz
  0 siblings, 1 reply; 18+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-03-30 15:32 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

30.03.2018 16:31, Vladimir Sementsov-Ogievskiy wrote:
> 29.03.2018 18:09, Vladimir Sementsov-Ogievskiy wrote:
>> 29.03.2018 17:03, Max Reitz wrote:
>>> On 2018-03-29 10:08, Vladimir Sementsov-Ogievskiy wrote:
>>>> 28.03.2018 17:53, Max Reitz wrote:
>>>>> On 2018-03-27 12:11, Vladimir Sementsov-Ogievskiy wrote:
>>> [...]
>>>
>>>>>> isn't it because a lot of cat processes? will check, update loop to
>>>>>> i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat;
>>>>>> done
>>>>> Hmm...  I know I tried to kill all of the cats, but for some 
>>>>> reason that
>>>>> didn't really help yesterday.  Seems to help now, for 2.12.0-rc0 at
>>>>> least (that is, before this series).
>>>> reproduced with killing... (without these series, just on master)
>>>>
>>>>> After the whole series, I still get a lot of failures in 169
>>>>> (mismatching bitmap hash, mostly).
>>>>>
>>>>> And interestingly, if I add an abort():
>>>>>
>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>> index 486f3e83b7..9204c1c0ac 100644
>>>>> --- a/block/qcow2.c
>>>>> +++ b/block/qcow2.c
>>>>> @@ -1481,6 +1481,7 @@ static int coroutine_fn
>>>>> qcow2_do_open(BlockDriverState *bs, QDict *options,     }
>>>>>
>>>>>        if (bdrv_dirty_bitmap_next(bs, NULL)) {
>>>>> +        abort();
>>>>>            /* It's some kind of reopen with already existing dirty
>>>>> bitmaps. There
>>>>>             * are no known cases where we need loading bitmaps in 
>>>>> such
>>>>> situation,
>>>>>             * so it's safer don't load them.
>>>>>
>>>>> Then this fires for a couple of test cases of 169 even without the 
>>>>> third
>>>>> patch of this series.
>>>>>
>>>>> I guess bdrv_dirty_bitmap_next() reacts to some bitmaps that 
>>>>> migration
>>>>> adds or something?  Then this would be the wrong condition, because I
>>>>> guess we still want to load the bitmaps that are in the qcow2 file.
>>>>>
>>>>> I'm not sure whether bdrv_has_readonly_bitmaps() is the correct
>>>>> condition then, either, though.  Maybe let's take a step back: We 
>>>>> want
>>>>> to load all the bitmaps from the file exactly once, and that is 
>>>>> when it
>>>>> is opened the first time.  Or that's what I would have thought...  Is
>>>>> that even correct?
>>>>>
>>>>> Why do we load the bitmaps when the device is inactive anyway?
>>>>> Shouldn't we load them only once the device is activated?
>>>> Hmm, not sure. May be, we don't need. But we anyway need to load them,
>>>> when opening read-only, and we should correspondingly reopen in 
>>>> this case.
>>> Yeah, well, yeah, but the current state seems just wrong. Apparently
>>> there are cases where a qcow2 node may have bitmaps before we try to
>>> load them from the file, so the current condition doesn't work.
>>>
>>> Furthermore, it seems like the current "state machine" is too 
>>> complex so
>>> we don't know which cases are possible anymore and what to do when.
>>>
>>> So the first thing we could do is add a field to the BDRVQCow2State 
>>> that
>>> tells us whether the bitmaps have been loaded already or not. If not,
>>> we invoke qcow2_load_dirty_bitmaps() and set the value to true. If the
>>> value was true already and the BDS is active and R/W now, we call
>>> qcow2_reopen_bitmaps_rw_hint().  That should solve one problem.
>>
>> good idea, will do.
>>
>>>
>>> The other problem of course is the question whether we should call
>>> qcow2_load_dirty_bitmaps() at all while the drive is still inactive.
>>> You know the migration model better than me, so I'm asking this 
>>> question
>>> to you.  We can phrase it differently: Do we need to load the bitmaps
>>> before the drive is activated?
>>
>> Now I think that we don't need. At least, we don't have such cases in 
>> Virtuozzo (I hope :).
>>
>> Why did I doubt:
>>
>> 1. We have cases, when we want to start vm as inactive, to be able to 
>> export it's drive as NBD export, push some data to it and then start 
>> the VM (which involves activating)
>> 2. We have cases, when we want to start vm stopped and operate on 
>> dirty bitmaps.
>>
>> If just open all images in inactive mode until vm start, it looks 
>> like we need bitmaps in inactive mode (for 2.). But it looks like 
>> wrong approach anyway.
>> Firstly, I tried to solve (1.) by simply inactivate_all() in case of 
>> start vm in paused mode, but it breaks at least (2.), so finally, I 
>> solved (1.) by an approach similar with "-incoming defer". So, we 
>> have inactive mode in two cases:
>>  - incoming migration
>>  - push data to vm before start
>>
>> and, in these cases, we don't need to load dirty-bitmaps.
>>
>> Also, inconsistency: now, we remove persistent bitmaps on inactivate. 
>> So, it is inconsistent to load the in inactive mode.
>>
>> Ok, I'll try to respin.
>
> finally, what cases we actually have for qcow2_do_open?
>
> 1. INACTIVE -> ACTIVE (through invalidate_cache, we obviously should 
> load bitmaps, if we decided that we have no persistent bitmaps in 
> INACTIVE mode)
> 2. creating new bdrv state (first open of the image) in INACTIVE mode 
> (will not load bitmaps)
> 3. creating new bdrv state (first open of the image) in ACTIVE mode 
> (will load bitmaps, maybe read-only if disk is RO)
>
> If only these three cases, it would be enough to just load bitmaps if 
> !INACTIVE and do nothing otherwise.
>
> Or, we have some of the following cases too?
>
> 1?. ACTIVE -> ACTIVE (through invalidate_cache, some kind of no-op, we 
> should not reload bitmaps)
> 2?. RO -> RW (we should reopen_bitmaps_rw) (or it is possible only 
> through bdrv_reopen, which will not touch qcow2_do_open?
> 3?. RW -> RO (reopen_bitmaps_ro ?)
> ? something other??
>
>>
>>>
>>> Max
>>>
>>
>> about 169, how often is it reproducible for you?
>>
>
> it becomes very interseting.
>
> persistent-migbitmap-online case failed on line 136. with mismathced 
> bitmap sha. This check is not relate to migration, on this line, 
> bitmap is already successfully migrated. But for some reason it is 
> corrupted after stop/start the VM. How is it possible - I can't 
> imagine. But it looks not really related to migration.. May it relate 
> to case, when postcopy was not finished or something like this?.. 
> maybe fixing wait(RESUME) to something more appropriate will help. But 
> it is strange anyway.
>
> persistent-notmigbitmap-offline case failed on ine 128, which is 
> "self.vm_b.event_wait("RESUME", timeout=10.0)" with timeout.. can we 
> skip RESUME for some reason? maybe just move to MIGRATION event will 
> help, will check
>

looks like moving from "RESUME" to "MIGRATION" events fixes the whole 
issue (I've already 740 successful iterations, which is good enough, but 
I will not stop the loop until at least 2000). I'll submit a patch. But 
I don't understand, why it fix sha256 mismatch, and why sha256-check 
fails with "RESUME" (check success after migration and fail after 
stop/start, o_O)...

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
  2018-03-30 15:32                   ` Vladimir Sementsov-Ogievskiy
@ 2018-04-03 16:03                     ` Max Reitz
  0 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2018-04-03 16:03 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block; +Cc: kwolf, jsnow, den

On 2018-03-30 17:32, Vladimir Sementsov-Ogievskiy wrote:
> 30.03.2018 16:31, Vladimir Sementsov-Ogievskiy wrote:
>> 29.03.2018 18:09, Vladimir Sementsov-Ogievskiy wrote:
>>> 29.03.2018 17:03, Max Reitz wrote:
>>>> On 2018-03-29 10:08, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 28.03.2018 17:53, Max Reitz wrote:
>>>>>> On 2018-03-27 12:11, Vladimir Sementsov-Ogievskiy wrote:
>>>> [...]
>>>>
>>>>>>> isn't it because a lot of cat processes? will check, update loop to
>>>>>>> i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat;
>>>>>>> done
>>>>>> Hmm...  I know I tried to kill all of the cats, but for some
>>>>>> reason that
>>>>>> didn't really help yesterday.  Seems to help now, for 2.12.0-rc0 at
>>>>>> least (that is, before this series).
>>>>> reproduced with killing... (without these series, just on master)
>>>>>
>>>>>> After the whole series, I still get a lot of failures in 169
>>>>>> (mismatching bitmap hash, mostly).
>>>>>>
>>>>>> And interestingly, if I add an abort():
>>>>>>
>>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>>> index 486f3e83b7..9204c1c0ac 100644
>>>>>> --- a/block/qcow2.c
>>>>>> +++ b/block/qcow2.c
>>>>>> @@ -1481,6 +1481,7 @@ static int coroutine_fn
>>>>>> qcow2_do_open(BlockDriverState *bs, QDict *options,     }
>>>>>>
>>>>>>        if (bdrv_dirty_bitmap_next(bs, NULL)) {
>>>>>> +        abort();
>>>>>>            /* It's some kind of reopen with already existing dirty
>>>>>> bitmaps. There
>>>>>>             * are no known cases where we need loading bitmaps in
>>>>>> such
>>>>>> situation,
>>>>>>             * so it's safer don't load them.
>>>>>>
>>>>>> Then this fires for a couple of test cases of 169 even without the
>>>>>> third
>>>>>> patch of this series.
>>>>>>
>>>>>> I guess bdrv_dirty_bitmap_next() reacts to some bitmaps that
>>>>>> migration
>>>>>> adds or something?  Then this would be the wrong condition, because I
>>>>>> guess we still want to load the bitmaps that are in the qcow2 file.
>>>>>>
>>>>>> I'm not sure whether bdrv_has_readonly_bitmaps() is the correct
>>>>>> condition then, either, though.  Maybe let's take a step back: We
>>>>>> want
>>>>>> to load all the bitmaps from the file exactly once, and that is
>>>>>> when it
>>>>>> is opened the first time.  Or that's what I would have thought...  Is
>>>>>> that even correct?
>>>>>>
>>>>>> Why do we load the bitmaps when the device is inactive anyway?
>>>>>> Shouldn't we load them only once the device is activated?
>>>>> Hmm, not sure. May be, we don't need. But we anyway need to load them,
>>>>> when opening read-only, and we should correspondingly reopen in
>>>>> this case.
>>>> Yeah, well, yeah, but the current state seems just wrong. Apparently
>>>> there are cases where a qcow2 node may have bitmaps before we try to
>>>> load them from the file, so the current condition doesn't work.
>>>>
>>>> Furthermore, it seems like the current "state machine" is too
>>>> complex so
>>>> we don't know which cases are possible anymore and what to do when.
>>>>
>>>> So the first thing we could do is add a field to the BDRVQCow2State
>>>> that
>>>> tells us whether the bitmaps have been loaded already or not. If not,
>>>> we invoke qcow2_load_dirty_bitmaps() and set the value to true. If the
>>>> value was true already and the BDS is active and R/W now, we call
>>>> qcow2_reopen_bitmaps_rw_hint().  That should solve one problem.
>>>
>>> good idea, will do.
>>>
>>>>
>>>> The other problem of course is the question whether we should call
>>>> qcow2_load_dirty_bitmaps() at all while the drive is still inactive.
>>>> You know the migration model better than me, so I'm asking this
>>>> question
>>>> to you.  We can phrase it differently: Do we need to load the bitmaps
>>>> before the drive is activated?
>>>
>>> Now I think that we don't need. At least, we don't have such cases in
>>> Virtuozzo (I hope :).
>>>
>>> Why did I doubt:
>>>
>>> 1. We have cases, when we want to start vm as inactive, to be able to
>>> export it's drive as NBD export, push some data to it and then start
>>> the VM (which involves activating)
>>> 2. We have cases, when we want to start vm stopped and operate on
>>> dirty bitmaps.
>>>
>>> If just open all images in inactive mode until vm start, it looks
>>> like we need bitmaps in inactive mode (for 2.). But it looks like
>>> wrong approach anyway.
>>> Firstly, I tried to solve (1.) by simply inactivate_all() in case of
>>> start vm in paused mode, but it breaks at least (2.), so finally, I
>>> solved (1.) by an approach similar with "-incoming defer". So, we
>>> have inactive mode in two cases:
>>>  - incoming migration
>>>  - push data to vm before start
>>>
>>> and, in these cases, we don't need to load dirty-bitmaps.
>>>
>>> Also, inconsistency: now, we remove persistent bitmaps on inactivate.
>>> So, it is inconsistent to load the in inactive mode.
>>>
>>> Ok, I'll try to respin.
>>
>> finally, what cases we actually have for qcow2_do_open?
>>
>> 1. INACTIVE -> ACTIVE (through invalidate_cache, we obviously should
>> load bitmaps, if we decided that we have no persistent bitmaps in
>> INACTIVE mode)
>> 2. creating new bdrv state (first open of the image) in INACTIVE mode
>> (will not load bitmaps)
>> 3. creating new bdrv state (first open of the image) in ACTIVE mode
>> (will load bitmaps, maybe read-only if disk is RO)
>>
>> If only these three cases, it would be enough to just load bitmaps if
>> !INACTIVE and do nothing otherwise.

Maybe, but I'd prefer just adding a flag for now.  We can think about
whether we need bitmaps in inactive mode later, and even then managing
the state through a flag doesn't hurt.

>> Or, we have some of the following cases too?
>>
>> 1?. ACTIVE -> ACTIVE (through invalidate_cache, some kind of no-op, we
>> should not reload bitmaps)
>> 2?. RO -> RW (we should reopen_bitmaps_rw) (or it is possible only
>> through bdrv_reopen, which will not touch qcow2_do_open?
>> 3?. RW -> RO (reopen_bitmaps_ro ?)
>> ? something other??

I don't really know, but I just hope having a flag saves us from having
to think about all of this too much. :-)

>>> about 169, how often is it reproducible for you?

More errors than passes, I guess.

But it always leaves cats behind, and that's another sign that something
isn't right.

>> it becomes very interseting.
>>
>> persistent-migbitmap-online case failed on line 136. with mismathced
>> bitmap sha. This check is not relate to migration, on this line,
>> bitmap is already successfully migrated. But for some reason it is
>> corrupted after stop/start the VM. How is it possible - I can't
>> imagine. But it looks not really related to migration.. May it relate
>> to case, when postcopy was not finished or something like this?..
>> maybe fixing wait(RESUME) to something more appropriate will help. But
>> it is strange anyway.
>>
>> persistent-notmigbitmap-offline case failed on ine 128, which is
>> "self.vm_b.event_wait("RESUME", timeout=10.0)" with timeout.. can we
>> skip RESUME for some reason? maybe just move to MIGRATION event will
>> help, will check

It would still be interesting to know where the RESUME ended up...

> looks like moving from "RESUME" to "MIGRATION" events fixes the whole
> issue (I've already 740 successful iterations, which is good enough, but
> I will not stop the loop until at least 2000). I'll submit a patch. But
> I don't understand, why it fix sha256 mismatch, and why sha256-check
> fails with "RESUME" (check success after migration and fail after
> stop/start, o_O)...

Well, if you have the time, it might be worth investigating.

Max

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-04-03 16:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-20 17:05 [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 1/3] qcow2-bitmap: add qcow2_reopen_bitmaps_rw_hint() Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: fix bitmaps loading when bitmaps already exist Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache Vladimir Sementsov-Ogievskiy
2018-03-20 17:07   ` [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache DROP IT Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 3/3] iotests: enable shared migration cases in 169 Vladimir Sementsov-Ogievskiy
2018-03-21 13:20 ` [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Max Reitz
2018-03-26 18:06 ` Max Reitz
2018-03-27  9:28   ` Vladimir Sementsov-Ogievskiy
2018-03-27  9:53     ` Vladimir Sementsov-Ogievskiy
2018-03-27 10:11       ` Vladimir Sementsov-Ogievskiy
2018-03-28 14:53         ` Max Reitz
2018-03-29  8:08           ` Vladimir Sementsov-Ogievskiy
2018-03-29 14:03             ` Max Reitz
2018-03-29 15:09               ` Vladimir Sementsov-Ogievskiy
2018-03-30 13:31                 ` Vladimir Sementsov-Ogievskiy
2018-03-30 15:32                   ` Vladimir Sementsov-Ogievskiy
2018-04-03 16:03                     ` Max Reitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).