[PATCH 0/2] migration: Two extra fixes

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] migration: Two extra fixes
@ 2020-11-02 15:30 Peter Xu
  2020-11-02 15:30 ` [PATCH 1/2] migration: Unify reset of last_rb on destination node when recover Peter Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Peter Xu @ 2020-11-02 15:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Christian Schoenebeck, Dr . David Alan Gilbert,
	peterx

This should fix intermittent hang of migration-test due to the latest update to
postcopy recovery.

Thanks,

Peter Xu (2):
  migration: Unify reset of last_rb on destination node when recover
  migration: Postpone the kick of the fault thread after recover

 migration/postcopy-ram.c |  2 --
 migration/savevm.c       | 17 ++++++++++++++---
 2 files changed, 14 insertions(+), 5 deletions(-)

-- 
2.26.2




^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] migration: Unify reset of last_rb on destination node when recover
  2020-11-02 15:30 [PATCH 0/2] migration: Two extra fixes Peter Xu
@ 2020-11-02 15:30 ` Peter Xu
  2020-11-02 18:23   ` Dr. David Alan Gilbert
  2020-11-02 15:30 ` [PATCH 2/2] migration: Postpone the kick of the fault thread after recover Peter Xu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Peter Xu @ 2020-11-02 15:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Christian Schoenebeck, Dr . David Alan Gilbert,
	peterx

When postcopy recover happens, we need to reset last_rb after each return of
postcopy_pause_fault_thread() because that means we just got the postcopy
migration continued.

Unify this reset to the place right before we want to kick the fault thread
again, when we get the command MIG_CMD_POSTCOPY_RESUME from source.

This is actually more than that - because the main thread on destination will
now be able to call migrate_send_rp_req_pages_pending() too, so the fault
thread is not the only user of last_rb now.  Move the reset earlier will allow
the first call to migrate_send_rp_req_pages_pending() to use the reset value
even if called from the main thread.

(NOTE: this is not a real fix to 0c26781c09 mentioned below, however it is just
 a mark that when picking up 0c26781c09 we'd better have this one too; the real
 fix will come later)

Fixes: 0c26781c09 ("migration: Sync requested pages after postcopy recovery")
Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 2 --
 migration/savevm.c       | 6 ++++++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index d3bb3a744b..d99842eb1b 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -903,7 +903,6 @@ static void *postcopy_ram_fault_thread(void *opaque)
              * the channel is rebuilt.
              */
             if (postcopy_pause_fault_thread(mis)) {
-                mis->last_rb = NULL;
                 /* Continue to read the userfaultfd */
             } else {
                 error_report("%s: paused but don't allow to continue",
@@ -985,7 +984,6 @@ retry:
                 /* May be network failure, try to wait for recovery */
                 if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
                     /* We got reconnected somehow, try to continue */
-                    mis->last_rb = NULL;
                     goto retry;
                 } else {
                     /* This is a unavoidable fault */
diff --git a/migration/savevm.c b/migration/savevm.c
index 21ccba9fb3..e8834991ec 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2061,6 +2061,12 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
         return 0;
     }
 
+    /*
+     * Reset the last_rb before we resend any page req to source again, since
+     * the source should have it reset already.
+     */
+    mis->last_rb = NULL;
+
     /*
      * This means source VM is ready to resume the postcopy migration.
      * It's time to switch state and release the fault thread to
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] migration: Unify reset of last_rb on destination node when recover
  2020-11-02 15:30 ` [PATCH 1/2] migration: Unify reset of last_rb on destination node when recover Peter Xu
@ 2020-11-02 18:23   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-02 18:23 UTC (permalink / raw)
  To: Peter Xu; +Cc: Peter Maydell, Christian Schoenebeck, qemu-devel

* Peter Xu (peterx@redhat.com) wrote:
> When postcopy recover happens, we need to reset last_rb after each return of
> postcopy_pause_fault_thread() because that means we just got the postcopy
> migration continued.
> 
> Unify this reset to the place right before we want to kick the fault thread
> again, when we get the command MIG_CMD_POSTCOPY_RESUME from source.
> 
> This is actually more than that - because the main thread on destination will
> now be able to call migrate_send_rp_req_pages_pending() too, so the fault
> thread is not the only user of last_rb now.  Move the reset earlier will allow
> the first call to migrate_send_rp_req_pages_pending() to use the reset value
> even if called from the main thread.
> 
> (NOTE: this is not a real fix to 0c26781c09 mentioned below, however it is just
>  a mark that when picking up 0c26781c09 we'd better have this one too; the real
>  fix will come later)
> 
> Fixes: 0c26781c09 ("migration: Sync requested pages after postcopy recovery")
> Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/postcopy-ram.c | 2 --
>  migration/savevm.c       | 6 ++++++
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index d3bb3a744b..d99842eb1b 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -903,7 +903,6 @@ static void *postcopy_ram_fault_thread(void *opaque)
>               * the channel is rebuilt.
>               */
>              if (postcopy_pause_fault_thread(mis)) {
> -                mis->last_rb = NULL;
>                  /* Continue to read the userfaultfd */
>              } else {
>                  error_report("%s: paused but don't allow to continue",
> @@ -985,7 +984,6 @@ retry:
>                  /* May be network failure, try to wait for recovery */
>                  if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
>                      /* We got reconnected somehow, try to continue */
> -                    mis->last_rb = NULL;
>                      goto retry;
>                  } else {
>                      /* This is a unavoidable fault */
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 21ccba9fb3..e8834991ec 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2061,6 +2061,12 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
>          return 0;
>      }
>  
> +    /*
> +     * Reset the last_rb before we resend any page req to source again, since
> +     * the source should have it reset already.
> +     */
> +    mis->last_rb = NULL;
> +
>      /*
>       * This means source VM is ready to resume the postcopy migration.
>       * It's time to switch state and release the fault thread to
> -- 
> 2.26.2
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] migration: Postpone the kick of the fault thread after recover
  2020-11-02 15:30 [PATCH 0/2] migration: Two extra fixes Peter Xu
  2020-11-02 15:30 ` [PATCH 1/2] migration: Unify reset of last_rb on destination node when recover Peter Xu
@ 2020-11-02 15:30 ` Peter Xu
  2020-11-02 18:24   ` Dr. David Alan Gilbert
  2020-11-02 18:26 ` [PATCH 0/2] migration: Two extra fixes Dr. David Alan Gilbert
  2020-11-02 18:26 ` Dr. David Alan Gilbert
  3 siblings, 1 reply; 7+ messages in thread
From: Peter Xu @ 2020-11-02 15:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Christian Schoenebeck, Dr . David Alan Gilbert,
	peterx

The new migrate_send_rp_req_pages_pending() call should greatly improve
destination responsiveness because it will resync faulted address after
postcopy recovery.  However it is also the 1st place to initiate the page
request from the main thread.

One thing is overlooked on that migrate_send_rp_message_req_pages() is not
designed to be thread-safe.  So if we wake the fault thread before syncing all
the faulted pages in the main thread, it means they can race.

Postpone the wake up operation after the sync of faulted addresses.

Fixes: 0c26781c09 ("migration: Sync requested pages after postcopy recovery")
Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index e8834991ec..5f937a2762 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2069,12 +2069,9 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
 
     /*
      * This means source VM is ready to resume the postcopy migration.
-     * It's time to switch state and release the fault thread to
-     * continue service page faults.
      */
     migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_RECOVER,
                       MIGRATION_STATUS_POSTCOPY_ACTIVE);
-    qemu_sem_post(&mis->postcopy_pause_sem_fault);
 
     trace_loadvm_postcopy_handle_resume();
 
@@ -2095,6 +2092,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
      */
     migrate_send_rp_req_pages_pending(mis);
 
+    /*
+     * It's time to switch state and release the fault thread to continue
+     * service page faults.  Note that this should be explicitly after the
+     * above call to migrate_send_rp_req_pages_pending().  In short:
+     * migrate_send_rp_message_req_pages() is not thread safe, yet.
+     */
+    qemu_sem_post(&mis->postcopy_pause_sem_fault);
+
     return 0;
 }
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] migration: Postpone the kick of the fault thread after recover
  2020-11-02 15:30 ` [PATCH 2/2] migration: Postpone the kick of the fault thread after recover Peter Xu
@ 2020-11-02 18:24   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-02 18:24 UTC (permalink / raw)
  To: Peter Xu; +Cc: Peter Maydell, Christian Schoenebeck, qemu-devel

* Peter Xu (peterx@redhat.com) wrote:
> The new migrate_send_rp_req_pages_pending() call should greatly improve
> destination responsiveness because it will resync faulted address after
> postcopy recovery.  However it is also the 1st place to initiate the page
> request from the main thread.
> 
> One thing is overlooked on that migrate_send_rp_message_req_pages() is not
> designed to be thread-safe.  So if we wake the fault thread before syncing all
> the faulted pages in the main thread, it means they can race.
> 
> Postpone the wake up operation after the sync of faulted addresses.
> 
> Fixes: 0c26781c09 ("migration: Sync requested pages after postcopy recovery")
> Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/savevm.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index e8834991ec..5f937a2762 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2069,12 +2069,9 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
>  
>      /*
>       * This means source VM is ready to resume the postcopy migration.
> -     * It's time to switch state and release the fault thread to
> -     * continue service page faults.
>       */
>      migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_RECOVER,
>                        MIGRATION_STATUS_POSTCOPY_ACTIVE);
> -    qemu_sem_post(&mis->postcopy_pause_sem_fault);
>  
>      trace_loadvm_postcopy_handle_resume();
>  
> @@ -2095,6 +2092,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
>       */
>      migrate_send_rp_req_pages_pending(mis);
>  
> +    /*
> +     * It's time to switch state and release the fault thread to continue
> +     * service page faults.  Note that this should be explicitly after the
> +     * above call to migrate_send_rp_req_pages_pending().  In short:
> +     * migrate_send_rp_message_req_pages() is not thread safe, yet.
> +     */
> +    qemu_sem_post(&mis->postcopy_pause_sem_fault);
> +
>      return 0;
>  }
>  
> -- 
> 2.26.2
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] migration: Two extra fixes
  2020-11-02 15:30 [PATCH 0/2] migration: Two extra fixes Peter Xu
  2020-11-02 15:30 ` [PATCH 1/2] migration: Unify reset of last_rb on destination node when recover Peter Xu
  2020-11-02 15:30 ` [PATCH 2/2] migration: Postpone the kick of the fault thread after recover Peter Xu
@ 2020-11-02 18:26 ` Dr. David Alan Gilbert
  2020-11-02 18:26 ` Dr. David Alan Gilbert
  3 siblings, 0 replies; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-02 18:26 UTC (permalink / raw)
  To: Peter Xu; +Cc: Peter Maydell, Christian Schoenebeck, qemu-devel

* Peter Xu (peterx@redhat.com) wrote:
> This should fix intermittent hang of migration-test due to the latest update to
> postcopy recovery.
> 
> Thanks,

Queued

> 
> Peter Xu (2):
>   migration: Unify reset of last_rb on destination node when recover
>   migration: Postpone the kick of the fault thread after recover
> 
>  migration/postcopy-ram.c |  2 --
>  migration/savevm.c       | 17 ++++++++++++++---
>  2 files changed, 14 insertions(+), 5 deletions(-)
> 
> -- 
> 2.26.2
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] migration: Two extra fixes
  2020-11-02 15:30 [PATCH 0/2] migration: Two extra fixes Peter Xu
                   ` (2 preceding siblings ...)
  2020-11-02 18:26 ` [PATCH 0/2] migration: Two extra fixes Dr. David Alan Gilbert
@ 2020-11-02 18:26 ` Dr. David Alan Gilbert
  3 siblings, 0 replies; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-02 18:26 UTC (permalink / raw)
  To: Peter Xu; +Cc: Peter Maydell, Christian Schoenebeck, qemu-devel

* Peter Xu (peterx@redhat.com) wrote:
> This should fix intermittent hang of migration-test due to the latest update to
> postcopy recovery.
> 
> Thanks,

Queued

> 
> Peter Xu (2):
>   migration: Unify reset of last_rb on destination node when recover
>   migration: Postpone the kick of the fault thread after recover
> 
>  migration/postcopy-ram.c |  2 --
>  migration/savevm.c       | 17 ++++++++++++++---
>  2 files changed, 14 insertions(+), 5 deletions(-)
> 
> -- 
> 2.26.2
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-11-02 18:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-02 15:30 [PATCH 0/2] migration: Two extra fixes Peter Xu
2020-11-02 15:30 ` [PATCH 1/2] migration: Unify reset of last_rb on destination node when recover Peter Xu
2020-11-02 18:23   ` Dr. David Alan Gilbert
2020-11-02 15:30 ` [PATCH 2/2] migration: Postpone the kick of the fault thread after recover Peter Xu
2020-11-02 18:24   ` Dr. David Alan Gilbert
2020-11-02 18:26 ` [PATCH 0/2] migration: Two extra fixes Dr. David Alan Gilbert
2020-11-02 18:26 ` Dr. David Alan Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).