[PATCH v2 0/5] migration: Bug fixes (prepare for preempt-full)

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] migration: Bug fixes (prepare for preempt-full)
@ 2022-10-04 18:24 Peter Xu
  2022-10-04 18:24 ` [PATCH v2 1/5] migration: Fix possible infinite loop of ram save process Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Peter Xu @ 2022-10-04 18:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert, peterx, Juan Quintela

v2:
- Drop patch "migration: Disallow xbzrle with postcopy" [Dave]
- Added patch "migration: Disable multifd explicitly with compression"
  (according to the comment in the other series) [Dave]
- s/deadloop/infinite loop/ in patch 1 subject [Dave]

v1: https://lore.kernel.org/qemu-devel/20220920223800.47467-1-peterx@redhat.com

This patchset does bug fixes that I found when testing preempt-full.
Please refer to each of the patch on the purpose.  Thanks,

Peter Xu (5):
  migration: Fix possible infinite loop of ram save process
  migration: Fix race on qemu_file_shutdown()
  migration: Disallow postcopy preempt to be used with compress
  migration: Use non-atomic ops for clear log bitmap
  migration: Disable multifd explicitly with compression

 include/exec/ram_addr.h | 11 +++++-----
 include/exec/ramblock.h |  3 +++
 include/qemu/bitmap.h   |  1 +
 migration/migration.c   | 18 +++++++++++++++++
 migration/qemu-file.c   | 27 ++++++++++++++++++++++---
 migration/ram.c         | 27 ++++++++++++++++---------
 util/bitmap.c           | 45 +++++++++++++++++++++++++++++++++++++++++
 7 files changed, 114 insertions(+), 18 deletions(-)

-- 
2.37.3



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/5] migration: Fix possible infinite loop of ram save process
  2022-10-04 18:24 [PATCH v2 0/5] migration: Bug fixes (prepare for preempt-full) Peter Xu
@ 2022-10-04 18:24 ` Peter Xu
  2022-11-14 14:02   ` Juan Quintela
  2022-10-04 18:24 ` [PATCH v2 2/5] migration: Fix race on qemu_file_shutdown() Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2022-10-04 18:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert, peterx, Juan Quintela

When starting ram saving procedure (especially at the completion phase),
always set last_seen_block to non-NULL to make sure we can always correctly
detect the case where "we've migrated all the dirty pages".

Then we'll guarantee both last_seen_block and pss.block will be valid
always before the loop starts.

See the comment in the code for some details.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index dc1de9ddbc..1d42414ecc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2546,14 +2546,22 @@ static int ram_find_and_save_block(RAMState *rs)
         return pages;
     }
 
+    /*
+     * Always keep last_seen_block/last_page valid during this procedure,
+     * because find_dirty_block() relies on these values (e.g., we compare
+     * last_seen_block with pss.block to see whether we searched all the
+     * ramblocks) to detect the completion of migration.  Having NULL value
+     * of last_seen_block can conditionally cause below loop to run forever.
+     */
+    if (!rs->last_seen_block) {
+        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
+        rs->last_page = 0;
+    }
+
     pss.block = rs->last_seen_block;
     pss.page = rs->last_page;
     pss.complete_round = false;
 
-    if (!pss.block) {
-        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
-    }
-
     do {
         again = true;
         found = get_queued_page(rs, &pss);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/5] migration: Fix possible infinite loop of ram save process
  2022-10-04 18:24 ` [PATCH v2 1/5] migration: Fix possible infinite loop of ram save process Peter Xu
@ 2022-11-14 14:02   ` Juan Quintela
  0 siblings, 0 replies; 12+ messages in thread
From: Juan Quintela @ 2022-11-14 14:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> When starting ram saving procedure (especially at the completion phase),
> always set last_seen_block to non-NULL to make sure we can always correctly
> detect the case where "we've migrated all the dirty pages".
>
> Then we'll guarantee both last_seen_block and pss.block will be valid
> always before the loop starts.
>
> See the comment in the code for some details.
>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

queued



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 2/5] migration: Fix race on qemu_file_shutdown()
  2022-10-04 18:24 [PATCH v2 0/5] migration: Bug fixes (prepare for preempt-full) Peter Xu
  2022-10-04 18:24 ` [PATCH v2 1/5] migration: Fix possible infinite loop of ram save process Peter Xu
@ 2022-10-04 18:24 ` Peter Xu
  2022-11-14 14:02   ` Juan Quintela
  2022-10-04 18:24 ` [PATCH v2 3/5] migration: Disallow postcopy preempt to be used with compress Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2022-10-04 18:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert, peterx, Juan Quintela

In qemu_file_shutdown(), there's a possible race if with current order of
operation.  There're two major things to do:

  (1) Do real shutdown() (e.g. shutdown() syscall on socket)
  (2) Update qemufile's last_error

We must do (2) before (1) otherwise there can be a race condition like:

      page receiver                     other thread
      -------------                     ------------
      qemu_get_buffer()
                                        do shutdown()
        returns 0 (buffer all zero)
        (meanwhile we didn't check this retcode)
      try to detect IO error
        last_error==NULL, IO okay
      install ALL-ZERO page
                                        set last_error
      --> guest crash!

To fix this, we can also check retval of qemu_get_buffer(), but not all
APIs can be properly checked and ultimately we still need to go back to
qemu_file_get_error().  E.g. qemu_get_byte() doesn't return error.

Maybe some day a rework of qemufile API is really needed, but for now keep
using qemu_file_get_error() and fix it by not allowing that race condition
to happen.  Here shutdown() is indeed special because the last_error was
emulated.  For real -EIO errors it'll always be set when e.g. sendmsg()
error triggers so we won't miss those ones, only shutdown() is a bit tricky
here.

Cc: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/qemu-file.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 4f400c2e52..2d5f74ffc2 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -79,6 +79,30 @@ int qemu_file_shutdown(QEMUFile *f)
     int ret = 0;
 
     f->shutdown = true;
+
+    /*
+     * We must set qemufile error before the real shutdown(), otherwise
+     * there can be a race window where we thought IO all went though
+     * (because last_error==NULL) but actually IO has already stopped.
+     *
+     * If without correct ordering, the race can happen like this:
+     *
+     *      page receiver                     other thread
+     *      -------------                     ------------
+     *      qemu_get_buffer()
+     *                                        do shutdown()
+     *        returns 0 (buffer all zero)
+     *        (we didn't check this retcode)
+     *      try to detect IO error
+     *        last_error==NULL, IO okay
+     *      install ALL-ZERO page
+     *                                        set last_error
+     *      --> guest crash!
+     */
+    if (!f->last_error) {
+        qemu_file_set_error(f, -EIO);
+    }
+
     if (!qio_channel_has_feature(f->ioc,
                                  QIO_CHANNEL_FEATURE_SHUTDOWN)) {
         return -ENOSYS;
@@ -88,9 +112,6 @@ int qemu_file_shutdown(QEMUFile *f)
         ret = -EIO;
     }
 
-    if (!f->last_error) {
-        qemu_file_set_error(f, -EIO);
-    }
     return ret;
 }
 
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/5] migration: Fix race on qemu_file_shutdown()
  2022-10-04 18:24 ` [PATCH v2 2/5] migration: Fix race on qemu_file_shutdown() Peter Xu
@ 2022-11-14 14:02   ` Juan Quintela
  0 siblings, 0 replies; 12+ messages in thread
From: Juan Quintela @ 2022-11-14 14:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> In qemu_file_shutdown(), there's a possible race if with current order of
> operation.  There're two major things to do:
>
>   (1) Do real shutdown() (e.g. shutdown() syscall on socket)
>   (2) Update qemufile's last_error
>
> We must do (2) before (1) otherwise there can be a race condition like:
>
>       page receiver                     other thread
>       -------------                     ------------
>       qemu_get_buffer()
>                                         do shutdown()
>         returns 0 (buffer all zero)
>         (meanwhile we didn't check this retcode)
>       try to detect IO error
>         last_error==NULL, IO okay
>       install ALL-ZERO page
>                                         set last_error
>       --> guest crash!
>
> To fix this, we can also check retval of qemu_get_buffer(), but not all
> APIs can be properly checked and ultimately we still need to go back to
> qemu_file_get_error().  E.g. qemu_get_byte() doesn't return error.
>
> Maybe some day a rework of qemufile API is really needed, but for now keep
> using qemu_file_get_error() and fix it by not allowing that race condition
> to happen.  Here shutdown() is indeed special because the last_error was
> emulated.  For real -EIO errors it'll always be set when e.g. sendmsg()
> error triggers so we won't miss those ones, only shutdown() is a bit tricky
> here.
>
> Cc: Daniel P. Berrange <berrange@redhat.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 3/5] migration: Disallow postcopy preempt to be used with compress
  2022-10-04 18:24 [PATCH v2 0/5] migration: Bug fixes (prepare for preempt-full) Peter Xu
  2022-10-04 18:24 ` [PATCH v2 1/5] migration: Fix possible infinite loop of ram save process Peter Xu
  2022-10-04 18:24 ` [PATCH v2 2/5] migration: Fix race on qemu_file_shutdown() Peter Xu
@ 2022-10-04 18:24 ` Peter Xu
  2022-11-14 14:03   ` Juan Quintela
  2022-10-04 18:24 ` [PATCH v2 4/5] migration: Use non-atomic ops for clear log bitmap Peter Xu
  2022-10-04 18:24 ` [PATCH v2 5/5] migration: Disable multifd explicitly with compression Peter Xu
  4 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2022-10-04 18:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert, peterx, Juan Quintela

The preempt mode requires the capability to assign channel for each of the
page, while the compression logic will currently assign pages to different
compress thread/local-channel so potentially they're incompatible.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index bb8bbddfe4..844bca1ff6 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1336,6 +1336,17 @@ static bool migrate_caps_check(bool *cap_list,
             error_setg(errp, "Postcopy preempt requires postcopy-ram");
             return false;
         }
+
+        /*
+         * Preempt mode requires urgent pages to be sent in separate
+         * channel, OTOH compression logic will disorder all pages into
+         * different compression channels, which is not compatible with the
+         * preempt assumptions on channel assignments.
+         */
+        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+            error_setg(errp, "Postcopy preempt not compatible with compress");
+            return false;
+        }
     }
 
     return true;
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/5] migration: Disallow postcopy preempt to be used with compress
  2022-10-04 18:24 ` [PATCH v2 3/5] migration: Disallow postcopy preempt to be used with compress Peter Xu
@ 2022-11-14 14:03   ` Juan Quintela
  0 siblings, 0 replies; 12+ messages in thread
From: Juan Quintela @ 2022-11-14 14:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> The preempt mode requires the capability to assign channel for each of the
> page, while the compression logic will currently assign pages to different
> compress thread/local-channel so potentially they're incompatible.
>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 4/5] migration: Use non-atomic ops for clear log bitmap
  2022-10-04 18:24 [PATCH v2 0/5] migration: Bug fixes (prepare for preempt-full) Peter Xu
                   ` (2 preceding siblings ...)
  2022-10-04 18:24 ` [PATCH v2 3/5] migration: Disallow postcopy preempt to be used with compress Peter Xu
@ 2022-10-04 18:24 ` Peter Xu
  2022-11-14 14:04   ` Juan Quintela
  2022-10-04 18:24 ` [PATCH v2 5/5] migration: Disable multifd explicitly with compression Peter Xu
  4 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2022-10-04 18:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert, peterx, Juan Quintela

Since we already have bitmap_mutex to protect either the dirty bitmap or
the clear log bitmap, we don't need atomic operations to set/clear/test on
the clear log bitmap.  Switching all ops from atomic to non-atomic
versions, meanwhile touch up the comments to show which lock is in charge.

Introduced non-atomic version of bitmap_test_and_clear_atomic(), mostly the
same as the atomic version but simplified a few places, e.g. dropped the
"old_bits" variable, and also the explicit memory barriers.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/ram_addr.h | 11 +++++-----
 include/exec/ramblock.h |  3 +++
 include/qemu/bitmap.h   |  1 +
 util/bitmap.c           | 45 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index f3e0c78161..5092a2e0ff 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -42,7 +42,8 @@ static inline long clear_bmap_size(uint64_t pages, uint8_t shift)
 }
 
 /**
- * clear_bmap_set: set clear bitmap for the page range
+ * clear_bmap_set: set clear bitmap for the page range.  Must be with
+ * bitmap_mutex held.
  *
  * @rb: the ramblock to operate on
  * @start: the start page number
@@ -55,12 +56,12 @@ static inline void clear_bmap_set(RAMBlock *rb, uint64_t start,
 {
     uint8_t shift = rb->clear_bmap_shift;
 
-    bitmap_set_atomic(rb->clear_bmap, start >> shift,
-                      clear_bmap_size(npages, shift));
+    bitmap_set(rb->clear_bmap, start >> shift, clear_bmap_size(npages, shift));
 }
 
 /**
- * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set
+ * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set.
+ * Must be with bitmap_mutex held.
  *
  * @rb: the ramblock to operate on
  * @page: the page number to check
@@ -71,7 +72,7 @@ static inline bool clear_bmap_test_and_clear(RAMBlock *rb, uint64_t page)
 {
     uint8_t shift = rb->clear_bmap_shift;
 
-    return bitmap_test_and_clear_atomic(rb->clear_bmap, page >> shift, 1);
+    return bitmap_test_and_clear(rb->clear_bmap, page >> shift, 1);
 }
 
 static inline bool offset_in_ramblock(RAMBlock *b, ram_addr_t offset)
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 6cbedf9e0c..adc03df59c 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -53,6 +53,9 @@ struct RAMBlock {
      * and split clearing of dirty bitmap on the remote node (e.g.,
      * KVM).  The bitmap will be set only when doing global sync.
      *
+     * It is only used during src side of ram migration, and it is
+     * protected by the global ram_state.bitmap_mutex.
+     *
      * NOTE: this bitmap is different comparing to the other bitmaps
      * in that one bit can represent multiple guest pages (which is
      * decided by the `clear_bmap_shift' variable below).  On
diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
index 82a1d2f41f..3ccb00865f 100644
--- a/include/qemu/bitmap.h
+++ b/include/qemu/bitmap.h
@@ -253,6 +253,7 @@ void bitmap_set(unsigned long *map, long i, long len);
 void bitmap_set_atomic(unsigned long *map, long i, long len);
 void bitmap_clear(unsigned long *map, long start, long nr);
 bool bitmap_test_and_clear_atomic(unsigned long *map, long start, long nr);
+bool bitmap_test_and_clear(unsigned long *map, long start, long nr);
 void bitmap_copy_and_clear_atomic(unsigned long *dst, unsigned long *src,
                                   long nr);
 unsigned long bitmap_find_next_zero_area(unsigned long *map,
diff --git a/util/bitmap.c b/util/bitmap.c
index f81d8057a7..8d12e90a5a 100644
--- a/util/bitmap.c
+++ b/util/bitmap.c
@@ -240,6 +240,51 @@ void bitmap_clear(unsigned long *map, long start, long nr)
     }
 }
 
+bool bitmap_test_and_clear(unsigned long *map, long start, long nr)
+{
+    unsigned long *p = map + BIT_WORD(start);
+    const long size = start + nr;
+    int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+    unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+    bool dirty = false;
+
+    assert(start >= 0 && nr >= 0);
+
+    /* First word */
+    if (nr - bits_to_clear > 0) {
+        if ((*p) & mask_to_clear) {
+            dirty = true;
+        }
+        *p &= ~mask_to_clear;
+        nr -= bits_to_clear;
+        bits_to_clear = BITS_PER_LONG;
+        p++;
+    }
+
+    /* Full words */
+    if (bits_to_clear == BITS_PER_LONG) {
+        while (nr >= BITS_PER_LONG) {
+            if (*p) {
+                dirty = true;
+                *p = 0;
+            }
+            nr -= BITS_PER_LONG;
+            p++;
+        }
+    }
+
+    /* Last word */
+    if (nr) {
+        mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+        if ((*p) & mask_to_clear) {
+            dirty = true;
+        }
+        *p &= ~mask_to_clear;
+    }
+
+    return dirty;
+}
+
 bool bitmap_test_and_clear_atomic(unsigned long *map, long start, long nr)
 {
     unsigned long *p = map + BIT_WORD(start);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 4/5] migration: Use non-atomic ops for clear log bitmap
  2022-10-04 18:24 ` [PATCH v2 4/5] migration: Use non-atomic ops for clear log bitmap Peter Xu
@ 2022-11-14 14:04   ` Juan Quintela
  0 siblings, 0 replies; 12+ messages in thread
From: Juan Quintela @ 2022-11-14 14:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Since we already have bitmap_mutex to protect either the dirty bitmap or
> the clear log bitmap, we don't need atomic operations to set/clear/test on
> the clear log bitmap.  Switching all ops from atomic to non-atomic
> versions, meanwhile touch up the comments to show which lock is in charge.
>
> Introduced non-atomic version of bitmap_test_and_clear_atomic(), mostly the
> same as the atomic version but simplified a few places, e.g. dropped the
> "old_bits" variable, and also the explicit memory barriers.
>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 5/5] migration: Disable multifd explicitly with compression
  2022-10-04 18:24 [PATCH v2 0/5] migration: Bug fixes (prepare for preempt-full) Peter Xu
                   ` (3 preceding siblings ...)
  2022-10-04 18:24 ` [PATCH v2 4/5] migration: Use non-atomic ops for clear log bitmap Peter Xu
@ 2022-10-04 18:24 ` Peter Xu
  2022-10-05 10:49   ` Dr. David Alan Gilbert
  2022-11-14 14:04   ` Juan Quintela
  4 siblings, 2 replies; 12+ messages in thread
From: Peter Xu @ 2022-10-04 18:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert, peterx, Juan Quintela

Multifd thread model does not work for compression, explicitly disable it.

Note that previuosly even we can enable both of them, nothing will go
wrong, because the compression code has higher priority so multifd feature
will just be ignored.  Now we'll fail even earlier at config time so the
user should be aware of the consequence better.

Note that there can be a slight chance of breaking existing users, but
let's assume they're not majority and not serious users, or they should
have found that multifd is not working already.

With that, we can safely drop the check in ram_save_target_page() for using
multifd, because when multifd=on then compression=off, then the removed
check on save_page_use_compression() will also always return false too.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c |  7 +++++++
 migration/ram.c       | 11 +++++------
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 844bca1ff6..ef00bff0b3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1349,6 +1349,13 @@ static bool migrate_caps_check(bool *cap_list,
         }
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
+        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+            error_setg(errp, "Multifd is not compatible with compress");
+            return false;
+        }
+    }
+
     return true;
 }
 
diff --git a/migration/ram.c b/migration/ram.c
index 1d42414ecc..1338e47665 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2305,13 +2305,12 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
     }
 
     /*
-     * Do not use multifd for:
-     * 1. Compression as the first page in the new block should be posted out
-     *    before sending the compressed page
-     * 2. In postcopy as one whole host page should be placed
+     * Do not use multifd in postcopy as one whole host page should be
+     * placed.  Meanwhile postcopy requires atomic update of pages, so even
+     * if host page size == guest page size the dest guest during run may
+     * still see partially copied pages which is data corruption.
      */
-    if (!save_page_use_compression(rs) && migrate_use_multifd()
-        && !migration_in_postcopy()) {
+    if (migrate_use_multifd() && !migration_in_postcopy()) {
         return ram_save_multifd_page(rs, block, offset);
     }
 
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 5/5] migration: Disable multifd explicitly with compression
  2022-10-04 18:24 ` [PATCH v2 5/5] migration: Disable multifd explicitly with compression Peter Xu
@ 2022-10-05 10:49   ` Dr. David Alan Gilbert
  2022-11-14 14:04   ` Juan Quintela
  1 sibling, 0 replies; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2022-10-05 10:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Daniel P . Berrange, Leonardo Bras Soares Passos,
	Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Multifd thread model does not work for compression, explicitly disable it.
> 
> Note that previuosly even we can enable both of them, nothing will go
> wrong, because the compression code has higher priority so multifd feature
> will just be ignored.  Now we'll fail even earlier at config time so the
> user should be aware of the consequence better.
> 
> Note that there can be a slight chance of breaking existing users, but
> let's assume they're not majority and not serious users, or they should
> have found that multifd is not working already.
> 
> With that, we can safely drop the check in ram_save_target_page() for using
> multifd, because when multifd=on then compression=off, then the removed
> check on save_page_use_compression() will also always return false too.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Yes, and of course if they were trying to use it, they should have been
trying to use multifd-compression parameter instead, which is different
code.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c |  7 +++++++
>  migration/ram.c       | 11 +++++------
>  2 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 844bca1ff6..ef00bff0b3 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1349,6 +1349,13 @@ static bool migrate_caps_check(bool *cap_list,
>          }
>      }
>  
> +    if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
> +        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
> +            error_setg(errp, "Multifd is not compatible with compress");
> +            return false;
> +        }
> +    }
> +
>      return true;
>  }
>  
> diff --git a/migration/ram.c b/migration/ram.c
> index 1d42414ecc..1338e47665 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2305,13 +2305,12 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
>      }
>  
>      /*
> -     * Do not use multifd for:
> -     * 1. Compression as the first page in the new block should be posted out
> -     *    before sending the compressed page
> -     * 2. In postcopy as one whole host page should be placed
> +     * Do not use multifd in postcopy as one whole host page should be
> +     * placed.  Meanwhile postcopy requires atomic update of pages, so even
> +     * if host page size == guest page size the dest guest during run may
> +     * still see partially copied pages which is data corruption.
>       */
> -    if (!save_page_use_compression(rs) && migrate_use_multifd()
> -        && !migration_in_postcopy()) {
> +    if (migrate_use_multifd() && !migration_in_postcopy()) {
>          return ram_save_multifd_page(rs, block, offset);
>      }
>  
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 5/5] migration: Disable multifd explicitly with compression
  2022-10-04 18:24 ` [PATCH v2 5/5] migration: Disable multifd explicitly with compression Peter Xu
  2022-10-05 10:49   ` Dr. David Alan Gilbert
@ 2022-11-14 14:04   ` Juan Quintela
  1 sibling, 0 replies; 12+ messages in thread
From: Juan Quintela @ 2022-11-14 14:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Daniel P . Berrange, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Multifd thread model does not work for compression, explicitly disable it.
>
> Note that previuosly even we can enable both of them, nothing will go
> wrong, because the compression code has higher priority so multifd feature
> will just be ignored.  Now we'll fail even earlier at config time so the
> user should be aware of the consequence better.
>
> Note that there can be a slight chance of breaking existing users, but
> let's assume they're not majority and not serious users, or they should
> have found that multifd is not working already.
>
> With that, we can safely drop the check in ram_save_target_page() for using
> multifd, because when multifd=on then compression=off, then the removed
> check on save_page_use_compression() will also always return false too.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-11-15  0:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-04 18:24 [PATCH v2 0/5] migration: Bug fixes (prepare for preempt-full) Peter Xu
2022-10-04 18:24 ` [PATCH v2 1/5] migration: Fix possible infinite loop of ram save process Peter Xu
2022-11-14 14:02   ` Juan Quintela
2022-10-04 18:24 ` [PATCH v2 2/5] migration: Fix race on qemu_file_shutdown() Peter Xu
2022-11-14 14:02   ` Juan Quintela
2022-10-04 18:24 ` [PATCH v2 3/5] migration: Disallow postcopy preempt to be used with compress Peter Xu
2022-11-14 14:03   ` Juan Quintela
2022-10-04 18:24 ` [PATCH v2 4/5] migration: Use non-atomic ops for clear log bitmap Peter Xu
2022-11-14 14:04   ` Juan Quintela
2022-10-04 18:24 ` [PATCH v2 5/5] migration: Disable multifd explicitly with compression Peter Xu
2022-10-05 10:49   ` Dr. David Alan Gilbert
2022-11-14 14:04   ` Juan Quintela

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).