* [PULL 0/6] migration queue
@ 2021-07-13 15:23 Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 1/6] migration/rdma: prevent from double free the same mr Dr. David Alan Gilbert (git)
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-07-13 15:23 UTC (permalink / raw)
To: qemu-devel, lizhijian, lvivier, peterx; +Cc: quintela
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
The following changes since commit 708f50199b59476ec4b45ebcdf171550086d6292:
Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2021-07-09-v2' into staging (2021-07-13 14:32:20 +0100)
are available in the Git repository at:
https://gitlab.com/dagrh/qemu.git tags/pull-migration-20210713a
for you to fetch changes up to 63268c4970a5f126cc9af75f3ccb8057abef5ec0:
migration: Move bitmap_mutex out of migration_bitmap_clear_dirty() (2021-07-13 16:21:57 +0100)
----------------------------------------------------------------
Migration pull 2021-07-13
----------------------------------------------------------------
Laurent Vivier (1):
migration: failover: emit a warning when the card is not fully unplugged
Li Zhijian (1):
migration/rdma: prevent from double free the same mr
Peter Xu (4):
migration: Release return path early for paused postcopy
migration: Don't do migrate cleanup if during postcopy resume
migration: Clear error at entry of migrate_fd_connect()
migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()
migration/migration.c | 41 ++++++++++++++++++++++++++++++++++++-----
migration/ram.c | 13 +++++++++++--
migration/rdma.c | 1 +
3 files changed, 48 insertions(+), 7 deletions(-)
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PULL 1/6] migration/rdma: prevent from double free the same mr
2021-07-13 15:23 [PULL 0/6] migration queue Dr. David Alan Gilbert (git)
@ 2021-07-13 15:23 ` Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 2/6] migration: failover: emit a warning when the card is not fully unplugged Dr. David Alan Gilbert (git)
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-07-13 15:23 UTC (permalink / raw)
To: qemu-devel, lizhijian, lvivier, peterx; +Cc: quintela
From: Li Zhijian <lizhijian@cn.fujitsu.com>
backtrace:
'0x00007ffff5f44ec2 in __ibv_dereg_mr_1_1 (mr=0x7fff1007d390) at /home/lizhijian/rdma-core/libibverbs/verbs.c:478
478 void *addr = mr->addr;
(gdb) bt
#0 0x00007ffff5f44ec2 in __ibv_dereg_mr_1_1 (mr=0x7fff1007d390) at /home/lizhijian/rdma-core/libibverbs/verbs.c:478
#1 0x0000555555891fcc in rdma_delete_block (block=<optimized out>, rdma=0x7fff38176010) at ../migration/rdma.c:691
#2 qemu_rdma_cleanup (rdma=0x7fff38176010) at ../migration/rdma.c:2365
#3 0x00005555558925b0 in qio_channel_rdma_close_rcu (rcu=0x555556b8b6c0) at ../migration/rdma.c:3073
#4 0x0000555555d652a3 in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:281
#5 0x0000555555d5edf9 in qemu_thread_start (args=0x7fffe88bb4d0) at ../util/qemu-thread-posix.c:541
#6 0x00007ffff54c73f9 in start_thread () at /lib64/libpthread.so.0
#7 0x00007ffff53f3b03 in clone () at /lib64/libc.so.6 '
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Message-Id: <20210708144521.1959614-1-lizhijian@cn.fujitsu.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
migration/rdma.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/migration/rdma.c b/migration/rdma.c
index 38a099f7ee..5c2d113aa9 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1143,6 +1143,7 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
for (i--; i >= 0; i--) {
ibv_dereg_mr(local->block[i].mr);
+ local->block[i].mr = NULL;
rdma->total_registrations--;
}
--
2.31.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PULL 2/6] migration: failover: emit a warning when the card is not fully unplugged
2021-07-13 15:23 [PULL 0/6] migration queue Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 1/6] migration/rdma: prevent from double free the same mr Dr. David Alan Gilbert (git)
@ 2021-07-13 15:23 ` Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 3/6] migration: Release return path early for paused postcopy Dr. David Alan Gilbert (git)
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-07-13 15:23 UTC (permalink / raw)
To: qemu-devel, lizhijian, lvivier, peterx; +Cc: quintela
From: Laurent Vivier <lvivier@redhat.com>
When the migration fails or is canceled we wait the end of the unplug
operation to be able to plug it back. But if the unplug operation
is never finished we stop to wait and QEMU emits a warning to inform
the user.
Based-on: 20210629155007.629086-1-lvivier@redhat.com
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-Id: <20210701131458.112036-1-lvivier@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
migration/migration.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/migration/migration.c b/migration/migration.c
index 5ff7ba9d5c..d717cd089a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3701,6 +3701,10 @@ static void qemu_savevm_wait_unplug(MigrationState *s, int old_state,
while (timeout-- && qemu_savevm_state_guest_unplug_pending()) {
qemu_sem_timedwait(&s->wait_unplug_sem, 250);
}
+ if (qemu_savevm_state_guest_unplug_pending()) {
+ warn_report("migration: partially unplugged device on "
+ "failure");
+ }
}
migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, new_state);
--
2.31.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PULL 3/6] migration: Release return path early for paused postcopy
2021-07-13 15:23 [PULL 0/6] migration queue Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 1/6] migration/rdma: prevent from double free the same mr Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 2/6] migration: failover: emit a warning when the card is not fully unplugged Dr. David Alan Gilbert (git)
@ 2021-07-13 15:23 ` Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 4/6] migration: Don't do migrate cleanup if during postcopy resume Dr. David Alan Gilbert (git)
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-07-13 15:23 UTC (permalink / raw)
To: qemu-devel, lizhijian, lvivier, peterx; +Cc: quintela
From: Peter Xu <peterx@redhat.com>
When postcopy pause triggered, we rely on the migration thread to cleanup the
to_dst_file handle, and the return path thread to cleanup the from_dst_file
handle (which is stored in the local variable "rp").
Within the process, from_dst_file cleanup (qemu_fclose) is postponed until it's
setup again due to a postcopy recovery.
It used to work before yank was born; after yank is introduced we rely on the
refcount of IOC to correctly unregister yank function in channel_close(). If
without the early and on-time release of from_dst_file handle the yank function
will be leftover during paused postcopy.
Without this patch, below steps (quoted from Xiaohui) could trigger qemu src
crash:
1.Boot vm on src host
2.Boot vm on dst host
3.Enable postcopy on src&dst host
4.Load stressapptest in vm and set postcopy speed to 50M
5.Start migration from src to dst host, change into postcopy mode when migration is active.
6.When postcopy is active, down the network card(do migration via this network) on dst host.
7.Wait untill postcopy is paused on src&dst host.
8.Before up network card, recover migration on dst host, will get error like following.
9.Ignore the error of step 8, go on recovering migration on src host:
After step 9, qemu on src host will core dump after some seconds:
qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
1.sh: line 38: 44662 Aborted (core dumped)
Reported-by: Li Xiaohui <xiaohli@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210708190653.252961-2-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
migration/migration.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index d717cd089a..38ebc6c1ab 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2818,12 +2818,12 @@ out:
* Maybe there is something we can do: it looks like a
* network down issue, and we pause for a recovery.
*/
+ qemu_fclose(rp);
+ ms->rp_state.from_dst_file = NULL;
+ rp = NULL;
if (postcopy_pause_return_path_thread(ms)) {
/* Reload rp, reset the rest */
- if (rp != ms->rp_state.from_dst_file) {
- qemu_fclose(rp);
- rp = ms->rp_state.from_dst_file;
- }
+ rp = ms->rp_state.from_dst_file;
ms->rp_state.error = false;
goto retry;
}
--
2.31.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PULL 4/6] migration: Don't do migrate cleanup if during postcopy resume
2021-07-13 15:23 [PULL 0/6] migration queue Dr. David Alan Gilbert (git)
` (2 preceding siblings ...)
2021-07-13 15:23 ` [PULL 3/6] migration: Release return path early for paused postcopy Dr. David Alan Gilbert (git)
@ 2021-07-13 15:23 ` Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 5/6] migration: Clear error at entry of migrate_fd_connect() Dr. David Alan Gilbert (git)
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-07-13 15:23 UTC (permalink / raw)
To: qemu-devel, lizhijian, lvivier, peterx; +Cc: quintela
From: Peter Xu <peterx@redhat.com>
Below process could crash qemu with postcopy recovery:
1. (hmp) migrate -d ..
2. (hmp) migrate_start_postcopy
3. [network down, postcopy paused]
4. (hmp) migrate -r $WRONG_PORT
when try the recover on an invalid $WRONG_PORT, cleanup_bh will be cleared
5. (hmp) migrate -r $RIGHT_PORT
[qemu crash on assert(cleanup_bh)]
The thing is we shouldn't cleanup if it's postcopy resume; the error is set
mostly because the channel is wrong, so we return directly waiting for the user
to retry.
migrate_fd_cleanup() should only be called when migration is cancelled or
completed.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210708190653.252961-3-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
migration/migration.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/migration/migration.c b/migration/migration.c
index 38ebc6c1ab..20c48cfff1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3979,7 +3979,18 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
}
if (error_in) {
migrate_fd_error(s, error_in);
- migrate_fd_cleanup(s);
+ if (resume) {
+ /*
+ * Don't do cleanup for resume if channel is invalid, but only dump
+ * the error. We wait for another channel connect from the user.
+ * The error_report still gives HMP user a hint on what failed.
+ * It's normally done in migrate_fd_cleanup(), but call it here
+ * explicitly.
+ */
+ error_report_err(error_copy(s->error));
+ } else {
+ migrate_fd_cleanup(s);
+ }
return;
}
--
2.31.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PULL 5/6] migration: Clear error at entry of migrate_fd_connect()
2021-07-13 15:23 [PULL 0/6] migration queue Dr. David Alan Gilbert (git)
` (3 preceding siblings ...)
2021-07-13 15:23 ` [PULL 4/6] migration: Don't do migrate cleanup if during postcopy resume Dr. David Alan Gilbert (git)
@ 2021-07-13 15:23 ` Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 6/6] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty() Dr. David Alan Gilbert (git)
2021-07-14 11:00 ` [PULL 0/6] migration queue Peter Maydell
6 siblings, 0 replies; 8+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-07-13 15:23 UTC (permalink / raw)
To: qemu-devel, lizhijian, lvivier, peterx; +Cc: quintela
From: Peter Xu <peterx@redhat.com>
For each "migrate" command, remember to clear the s->error before going on.
For one reason, when there's a new error it'll be still remembered; see
migrate_set_error() who only sets the error if error==NULL. Meanwhile if a
failed migration completes (e.g., postcopy recovered and finished), we
shouldn't dump an error when calling migrate_fd_cleanup() at last.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210708190653.252961-4-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
migration/migration.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/migration/migration.c b/migration/migration.c
index 20c48cfff1..2d306582eb 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1855,6 +1855,15 @@ void migrate_set_error(MigrationState *s, const Error *error)
}
}
+static void migrate_error_free(MigrationState *s)
+{
+ QEMU_LOCK_GUARD(&s->error_mutex);
+ if (s->error) {
+ error_free(s->error);
+ s->error = NULL;
+ }
+}
+
void migrate_fd_error(MigrationState *s, const Error *error)
{
trace_migrate_fd_error(error_get_pretty(error));
@@ -3970,6 +3979,13 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
int64_t rate_limit;
bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED;
+ /*
+ * If there's a previous error, free it and prepare for another one.
+ * Meanwhile if migration completes successfully, there won't have an error
+ * dumped when calling migrate_fd_cleanup().
+ */
+ migrate_error_free(s);
+
s->expected_downtime = s->parameters.downtime_limit;
if (resume) {
assert(s->cleanup_bh);
--
2.31.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PULL 6/6] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()
2021-07-13 15:23 [PULL 0/6] migration queue Dr. David Alan Gilbert (git)
` (4 preceding siblings ...)
2021-07-13 15:23 ` [PULL 5/6] migration: Clear error at entry of migrate_fd_connect() Dr. David Alan Gilbert (git)
@ 2021-07-13 15:23 ` Dr. David Alan Gilbert (git)
2021-07-14 11:00 ` [PULL 0/6] migration queue Peter Maydell
6 siblings, 0 replies; 8+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-07-13 15:23 UTC (permalink / raw)
To: qemu-devel, lizhijian, lvivier, peterx; +Cc: quintela
From: Peter Xu <peterx@redhat.com>
Taking the mutex every time for each dirty bit to clear is too slow, especially
we'll take/release even if the dirty bit is cleared. So far it's only used to
sync with special cases with qemu_guest_free_page_hint() against migration
thread, nothing really that serious yet. Let's move the lock to be upper.
There're two callers of migration_bitmap_clear_dirty().
For migration, move it into ram_save_iterate(). With the help of MAX_WAIT
logic, we'll only run ram_save_iterate() for no more than 50ms-ish time, so
taking the lock once there at the entry. It also means any call sites to
qemu_guest_free_page_hint() can be delayed; but it should be very rare, only
during migration, and I don't see a problem with it.
For COLO, move it up to colo_flush_ram_cache(). I think COLO forgot to take
that lock even when calling ramblock_sync_dirty_bitmap(), where another example
is migration_bitmap_sync() who took it right. So let the mutex cover both the
ramblock_sync_dirty_bitmap() and migration_bitmap_clear_dirty() calls.
It's even possible to drop the lock so we use atomic operations upon rb->bmap
and the variable migration_dirty_pages. I didn't do it just to still be safe,
also not predictable whether the frequent atomic ops could bring overhead too
e.g. on huge vms when it happens very often. When that really comes, we can
keep a local counter and periodically call atomic ops. Keep it simple for now.
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: Leonardo Bras Soares Passos <lsoaresp@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210630200805.280905-1-peterx@redhat.com>
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
migration/ram.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/migration/ram.c b/migration/ram.c
index 88ff34f574..b5fc454b2f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -795,8 +795,6 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
{
bool ret;
- QEMU_LOCK_GUARD(&rs->bitmap_mutex);
-
/*
* Clear dirty bitmap if needed. This _must_ be called before we
* send any of the page in the chunk because we need to make sure
@@ -2834,6 +2832,14 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
goto out;
}
+ /*
+ * We'll take this lock a little bit long, but it's okay for two reasons.
+ * Firstly, the only possible other thread to take it is who calls
+ * qemu_guest_free_page_hint(), which should be rare; secondly, see
+ * MAX_WAIT (if curious, further see commit 4508bd9ed8053ce) below, which
+ * guarantees that we'll at least released it in a regular basis.
+ */
+ qemu_mutex_lock(&rs->bitmap_mutex);
WITH_RCU_READ_LOCK_GUARD() {
if (ram_list.version != rs->last_version) {
ram_state_reset(rs);
@@ -2893,6 +2899,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
i++;
}
}
+ qemu_mutex_unlock(&rs->bitmap_mutex);
/*
* Must occur before EOS (or any QEMUFile operation)
@@ -3682,6 +3689,7 @@ void colo_flush_ram_cache(void)
unsigned long offset = 0;
memory_global_dirty_log_sync();
+ qemu_mutex_lock(&ram_state->bitmap_mutex);
WITH_RCU_READ_LOCK_GUARD() {
RAMBLOCK_FOREACH_NOT_IGNORED(block) {
ramblock_sync_dirty_bitmap(ram_state, block);
@@ -3710,6 +3718,7 @@ void colo_flush_ram_cache(void)
}
}
trace_colo_flush_ram_cache_end();
+ qemu_mutex_unlock(&ram_state->bitmap_mutex);
}
/**
--
2.31.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PULL 0/6] migration queue
2021-07-13 15:23 [PULL 0/6] migration queue Dr. David Alan Gilbert (git)
` (5 preceding siblings ...)
2021-07-13 15:23 ` [PULL 6/6] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty() Dr. David Alan Gilbert (git)
@ 2021-07-14 11:00 ` Peter Maydell
6 siblings, 0 replies; 8+ messages in thread
From: Peter Maydell @ 2021-07-14 11:00 UTC (permalink / raw)
To: Dr. David Alan Gilbert (git)
Cc: Laurent Vivier, Peter Xu, QEMU Developers, Li Zhijian,
Juan Quintela
On Tue, 13 Jul 2021 at 16:25, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
>
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> The following changes since commit 708f50199b59476ec4b45ebcdf171550086d6292:
>
> Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2021-07-09-v2' into staging (2021-07-13 14:32:20 +0100)
>
> are available in the Git repository at:
>
> https://gitlab.com/dagrh/qemu.git tags/pull-migration-20210713a
>
> for you to fetch changes up to 63268c4970a5f126cc9af75f3ccb8057abef5ec0:
>
> migration: Move bitmap_mutex out of migration_bitmap_clear_dirty() (2021-07-13 16:21:57 +0100)
>
> ----------------------------------------------------------------
> Migration pull 2021-07-13
>
Applied, thanks.
Please update the changelog at https://wiki.qemu.org/ChangeLog/6.1
for any user-visible changes.
-- PMM
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-07-14 11:02 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-07-13 15:23 [PULL 0/6] migration queue Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 1/6] migration/rdma: prevent from double free the same mr Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 2/6] migration: failover: emit a warning when the card is not fully unplugged Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 3/6] migration: Release return path early for paused postcopy Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 4/6] migration: Don't do migrate cleanup if during postcopy resume Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 5/6] migration: Clear error at entry of migrate_fd_connect() Dr. David Alan Gilbert (git)
2021-07-13 15:23 ` [PULL 6/6] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty() Dr. David Alan Gilbert (git)
2021-07-14 11:00 ` [PULL 0/6] migration queue Peter Maydell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).