qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/3] migration: Fix crash by cleaning up before quit
@ 2017-09-15  5:44 Fam Zheng
  2017-09-15  5:44 ` [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state Fam Zheng
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Fam Zheng @ 2017-09-15  5:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dr. David Alan Gilbert, Juan Quintela, qemu-stable, peterx

Quit command causes asssertion failure in block layer due to op blockers added
by BMDS, if there is an active block migration.

Fixing this by calling migration cleaning up functions at the end of main()
before bdrv_close_all() is called.

Fam Zheng (3):
  migration: Allow ram_save_cleanup to be called with empty state
  migration: Cancel migration at exit
  iotests: Add "quit during block migration" case 195

 include/migration/misc.h   |  1 +
 migration/migration.c      |  7 +++-
 migration/ram.c            |  3 ++
 tests/qemu-iotests/195     | 97 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/195.out | 19 +++++++++
 tests/qemu-iotests/group   |  1 +
 vl.c                       |  3 ++
 7 files changed, 130 insertions(+), 1 deletion(-)
 create mode 100755 tests/qemu-iotests/195
 create mode 100644 tests/qemu-iotests/195.out

-- 
2.13.5

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state
  2017-09-15  5:44 [Qemu-devel] [PATCH 0/3] migration: Fix crash by cleaning up before quit Fam Zheng
@ 2017-09-15  5:44 ` Fam Zheng
  2017-09-15  6:41   ` Peter Xu
  2017-09-15  5:44 ` [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit Fam Zheng
  2017-09-15  5:44 ` [Qemu-devel] [PATCH 3/3] iotests: Add "quit during block migration" case 195 Fam Zheng
  2 siblings, 1 reply; 19+ messages in thread
From: Fam Zheng @ 2017-09-15  5:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dr. David Alan Gilbert, Juan Quintela, qemu-stable, peterx

So that we can do cleanup unconditionally at the end of main().

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 migration/ram.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index e18b3e2d4f..37e6a71241 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1365,6 +1365,9 @@ static void ram_save_cleanup(void *opaque)
     RAMState **rsp = opaque;
     RAMBlock *block;
 
+    if (!rsp || !*rsp) {
+        return;
+    }
     /* caller have hold iothread lock or is in a bh, so there is
      * no writing race against this migration_bitmap
      */
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit
  2017-09-15  5:44 [Qemu-devel] [PATCH 0/3] migration: Fix crash by cleaning up before quit Fam Zheng
  2017-09-15  5:44 ` [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state Fam Zheng
@ 2017-09-15  5:44 ` Fam Zheng
  2017-09-15  8:03   ` Peter Xu
  2017-09-15  5:44 ` [Qemu-devel] [PATCH 3/3] iotests: Add "quit during block migration" case 195 Fam Zheng
  2 siblings, 1 reply; 19+ messages in thread
From: Fam Zheng @ 2017-09-15  5:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dr. David Alan Gilbert, Juan Quintela, qemu-stable, peterx

bdrv_close_all() would abort() due to op blockers added by BMDS, clean
up migration states when main loop quits to avoid that.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/migration/misc.h | 1 +
 migration/migration.c    | 7 ++++++-
 vl.c                     | 3 +++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index c079b7771b..b9a26b0898 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
 /* ...and after the device transmission */
 bool migration_in_postcopy_after_devices(MigrationState *);
 void migration_global_dump(Monitor *mon);
+void migrate_cancel(void);
 
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 959e8ec88e..2c844945c7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
     }
 }
 
-void qmp_migrate_cancel(Error **errp)
+void migrate_cancel(void)
 {
     migrate_fd_cancel(migrate_get_current());
 }
 
+void qmp_migrate_cancel(Error **errp)
+{
+    migrate_cancel();
+}
+
 void qmp_migrate_set_cache_size(int64_t value, Error **errp)
 {
     MigrationState *s = migrate_get_current();
diff --git a/vl.c b/vl.c
index fb1f05b937..abbe61f40b 100644
--- a/vl.c
+++ b/vl.c
@@ -87,6 +87,7 @@ int main(int argc, char **argv)
 #include "sysemu/blockdev.h"
 #include "hw/block/block.h"
 #include "migration/misc.h"
+#include "migration/savevm.h"
 #include "migration/snapshot.h"
 #include "migration/global_state.h"
 #include "sysemu/tpm.h"
@@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
     iothread_stop_all();
 
     pause_all_vcpus();
+    migrate_cancel();
+    qemu_savevm_state_cleanup();
     bdrv_close_all();
     res_free();
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PATCH 3/3] iotests: Add "quit during block migration" case 195
  2017-09-15  5:44 [Qemu-devel] [PATCH 0/3] migration: Fix crash by cleaning up before quit Fam Zheng
  2017-09-15  5:44 ` [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state Fam Zheng
  2017-09-15  5:44 ` [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit Fam Zheng
@ 2017-09-15  5:44 ` Fam Zheng
  2017-09-15 17:29   ` Eric Blake
  2 siblings, 1 reply; 19+ messages in thread
From: Fam Zheng @ 2017-09-15  5:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dr. David Alan Gilbert, Juan Quintela, qemu-stable, peterx

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 tests/qemu-iotests/195     | 97 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/195.out | 19 +++++++++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 117 insertions(+)
 create mode 100755 tests/qemu-iotests/195
 create mode 100644 tests/qemu-iotests/195.out

diff --git a/tests/qemu-iotests/195 b/tests/qemu-iotests/195
new file mode 100755
index 0000000000..f732b56854
--- /dev/null
+++ b/tests/qemu-iotests/195
@@ -0,0 +1,97 @@
+#!/bin/bash
+#
+# Test quit during block migration (migrate -b) doesn't crash
+#
+# Copyright 2017 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=famz@redhat.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1 # failure is the default!
+
+MIG_SOCKET="${TEST_DIR}/migrate"
+
+_cleanup()
+{
+    rm -f "${MIG_SOCKET}"
+    rm -f "${TEST_IMG}.dest"
+    _cleanup_test_img
+    _cleanup_qemu
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+_supported_fmt qcow2 raw qed dmg quorum
+_supported_proto file
+_supported_os Linux
+
+size=64M
+_make_test_img $size
+TEST_IMG="${TEST_IMG}.dest" _make_test_img $size
+
+echo
+echo === Starting VMs ===
+echo
+
+qemu_comm_method="qmp"
+
+_launch_qemu \
+    -drive file="${TEST_IMG}",cache=$CACHEMODE,driver=$IMGFMT,id=disk
+src=$QEMU_HANDLE
+_send_qemu_cmd $src "{ 'execute': 'qmp_capabilities' }" 'return'
+
+_launch_qemu \
+    -drive file="${TEST_IMG}.dest",cache=$CACHEMODE,driver=$IMGFMT,id=disk \
+    -incoming "unix:${MIG_SOCKET}"
+dest=$QEMU_HANDLE
+_send_qemu_cmd $dest "{ 'execute': 'qmp_capabilities' }" 'return'
+
+echo
+echo === Do block migration to destination ===
+echo
+
+reply="$(_send_qemu_cmd $src \
+    "{ 'execute': 'migrate',
+       'arguments': { 'uri': 'unix:${MIG_SOCKET}', 'blk': true } }" \
+    'return\|error')"
+echo "$reply"
+if echo "$reply" | grep "compiled without old-style" > /dev/null; then
+    _notrun "migrate -b support not compiled in"
+fi
+
+echo
+echo === Shut down and check image ===
+echo
+
+_send_qemu_cmd $src '{"execute":"quit"}' 'return'
+_cleanup_qemu
+
+_check_test_img
+TEST_IMG="${TEST_IMG}.dest" _check_test_img
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/195.out b/tests/qemu-iotests/195.out
new file mode 100644
index 0000000000..36b2fc5e83
--- /dev/null
+++ b/tests/qemu-iotests/195.out
@@ -0,0 +1,19 @@
+QA output created by 195
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+Formatting 'TEST_DIR/t.IMGFMT.dest', fmt=IMGFMT size=67108864
+
+=== Starting VMs ===
+
+{"return": {}}
+{"return": {}}
+
+=== Do block migration to destination ===
+
+{"return": {}}
+
+=== Shut down and check image ===
+
+{"return": {}}
+No errors were found on the image.
+No errors were found on the image.
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 94e764865a..e7e8fcc722 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -189,3 +189,4 @@
 190 rw auto quick
 192 rw auto quick
 194 rw auto migration quick
+195 rw auto migration quick
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state
  2017-09-15  5:44 ` [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state Fam Zheng
@ 2017-09-15  6:41   ` Peter Xu
  2017-09-15  6:49     ` Fam Zheng
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2017-09-15  6:41 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, Dr. David Alan Gilbert, Juan Quintela, qemu-stable

On Fri, Sep 15, 2017 at 01:44:02PM +0800, Fam Zheng wrote:
> So that we can do cleanup unconditionally at the end of main().
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  migration/ram.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index e18b3e2d4f..37e6a71241 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1365,6 +1365,9 @@ static void ram_save_cleanup(void *opaque)
>      RAMState **rsp = opaque;
>      RAMBlock *block;
>  
> +    if (!rsp || !*rsp) {
> +        return;
> +    }
>      /* caller have hold iothread lock or is in a bh, so there is
>       * no writing race against this migration_bitmap
>       */
> -- 
> 2.13.5
> 

Instead of take special care on RAM, how about check in
migrate_fd_cancel(), and return directly if migration_is_idle()?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state
  2017-09-15  6:41   ` Peter Xu
@ 2017-09-15  6:49     ` Fam Zheng
  2017-09-15  6:56       ` Peter Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Fam Zheng @ 2017-09-15  6:49 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Dr. David Alan Gilbert, Juan Quintela, qemu-stable

On Fri, 09/15 14:41, Peter Xu wrote:
> On Fri, Sep 15, 2017 at 01:44:02PM +0800, Fam Zheng wrote:
> > So that we can do cleanup unconditionally at the end of main().
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  migration/ram.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/migration/ram.c b/migration/ram.c
> > index e18b3e2d4f..37e6a71241 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1365,6 +1365,9 @@ static void ram_save_cleanup(void *opaque)
> >      RAMState **rsp = opaque;
> >      RAMBlock *block;
> >  
> > +    if (!rsp || !*rsp) {
> > +        return;
> > +    }
> >      /* caller have hold iothread lock or is in a bh, so there is
> >       * no writing race against this migration_bitmap
> >       */
> > -- 
> > 2.13.5
> > 
> 
> Instead of take special care on RAM, how about check in
> migrate_fd_cancel(), and return directly if migration_is_idle()?

This is not from migrate_fd_cancel(), but from qemu_savevm_state_cleanup(), so
that doesn't work.

Fam

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state
  2017-09-15  6:49     ` Fam Zheng
@ 2017-09-15  6:56       ` Peter Xu
  2017-09-15  7:02         ` Fam Zheng
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2017-09-15  6:56 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, Dr. David Alan Gilbert, Juan Quintela, qemu-stable

On Fri, Sep 15, 2017 at 02:49:07PM +0800, Fam Zheng wrote:
> On Fri, 09/15 14:41, Peter Xu wrote:
> > On Fri, Sep 15, 2017 at 01:44:02PM +0800, Fam Zheng wrote:
> > > So that we can do cleanup unconditionally at the end of main().
> > > 
> > > Signed-off-by: Fam Zheng <famz@redhat.com>
> > > ---
> > >  migration/ram.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index e18b3e2d4f..37e6a71241 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -1365,6 +1365,9 @@ static void ram_save_cleanup(void *opaque)
> > >      RAMState **rsp = opaque;
> > >      RAMBlock *block;
> > >  
> > > +    if (!rsp || !*rsp) {
> > > +        return;
> > > +    }
> > >      /* caller have hold iothread lock or is in a bh, so there is
> > >       * no writing race against this migration_bitmap
> > >       */
> > > -- 
> > > 2.13.5
> > > 
> > 
> > Instead of take special care on RAM, how about check in
> > migrate_fd_cancel(), and return directly if migration_is_idle()?
> 
> This is not from migrate_fd_cancel(), but from qemu_savevm_state_cleanup(), so
> that doesn't work.

Yeh I see the point.  But my logic still stands - we don't need to
cleanup anything if the migration is not really there.

I'm thinking whether we can put qemu_savevm_state_cleanup() into
migrate_fd_cancel() in some way, though I am still not 100% sure on
the colo part.  Anyway, I feel like a bit confusing we have two
cleanup functions.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state
  2017-09-15  6:56       ` Peter Xu
@ 2017-09-15  7:02         ` Fam Zheng
  2017-09-15  7:58           ` Peter Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Fam Zheng @ 2017-09-15  7:02 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Dr. David Alan Gilbert, Juan Quintela, qemu-stable

On Fri, 09/15 14:56, Peter Xu wrote:
> On Fri, Sep 15, 2017 at 02:49:07PM +0800, Fam Zheng wrote:
> > On Fri, 09/15 14:41, Peter Xu wrote:
> > > On Fri, Sep 15, 2017 at 01:44:02PM +0800, Fam Zheng wrote:
> > > > So that we can do cleanup unconditionally at the end of main().
> > > > 
> > > > Signed-off-by: Fam Zheng <famz@redhat.com>
> > > > ---
> > > >  migration/ram.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > index e18b3e2d4f..37e6a71241 100644
> > > > --- a/migration/ram.c
> > > > +++ b/migration/ram.c
> > > > @@ -1365,6 +1365,9 @@ static void ram_save_cleanup(void *opaque)
> > > >      RAMState **rsp = opaque;
> > > >      RAMBlock *block;
> > > >  
> > > > +    if (!rsp || !*rsp) {
> > > > +        return;
> > > > +    }
> > > >      /* caller have hold iothread lock or is in a bh, so there is
> > > >       * no writing race against this migration_bitmap
> > > >       */
> > > > -- 
> > > > 2.13.5
> > > > 
> > > 
> > > Instead of take special care on RAM, how about check in
> > > migrate_fd_cancel(), and return directly if migration_is_idle()?
> > 
> > This is not from migrate_fd_cancel(), but from qemu_savevm_state_cleanup(), so
> > that doesn't work.
> 
> Yeh I see the point.  But my logic still stands - we don't need to
> cleanup anything if the migration is not really there.
> 
> I'm thinking whether we can put qemu_savevm_state_cleanup() into
> migrate_fd_cancel() in some way, though I am still not 100% sure on
> the colo part.  Anyway, I feel like a bit confusing we have two
> cleanup functions.

I agree, but I don't know what is the best way to clean this up: savevm and
migration seem a little independent from each other.

Fam

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state
  2017-09-15  7:02         ` Fam Zheng
@ 2017-09-15  7:58           ` Peter Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2017-09-15  7:58 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, Dr. David Alan Gilbert, Juan Quintela, qemu-stable

On Fri, Sep 15, 2017 at 03:02:32PM +0800, Fam Zheng wrote:
> On Fri, 09/15 14:56, Peter Xu wrote:
> > On Fri, Sep 15, 2017 at 02:49:07PM +0800, Fam Zheng wrote:
> > > On Fri, 09/15 14:41, Peter Xu wrote:
> > > > On Fri, Sep 15, 2017 at 01:44:02PM +0800, Fam Zheng wrote:
> > > > > So that we can do cleanup unconditionally at the end of main().
> > > > > 
> > > > > Signed-off-by: Fam Zheng <famz@redhat.com>
> > > > > ---
> > > > >  migration/ram.c | 3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > > 
> > > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > > index e18b3e2d4f..37e6a71241 100644
> > > > > --- a/migration/ram.c
> > > > > +++ b/migration/ram.c
> > > > > @@ -1365,6 +1365,9 @@ static void ram_save_cleanup(void *opaque)
> > > > >      RAMState **rsp = opaque;
> > > > >      RAMBlock *block;
> > > > >  
> > > > > +    if (!rsp || !*rsp) {
> > > > > +        return;
> > > > > +    }
> > > > >      /* caller have hold iothread lock or is in a bh, so there is
> > > > >       * no writing race against this migration_bitmap
> > > > >       */
> > > > > -- 
> > > > > 2.13.5
> > > > > 
> > > > 
> > > > Instead of take special care on RAM, how about check in
> > > > migrate_fd_cancel(), and return directly if migration_is_idle()?
> > > 
> > > This is not from migrate_fd_cancel(), but from qemu_savevm_state_cleanup(), so
> > > that doesn't work.
> > 
> > Yeh I see the point.  But my logic still stands - we don't need to
> > cleanup anything if the migration is not really there.
> > 
> > I'm thinking whether we can put qemu_savevm_state_cleanup() into
> > migrate_fd_cancel() in some way, though I am still not 100% sure on
> > the colo part.  Anyway, I feel like a bit confusing we have two
> > cleanup functions.
> 
> I agree, but I don't know what is the best way to clean this up: savevm and
> migration seem a little independent from each other.

After a 2nd thought I think this single patch is ok, at least it
allows qemu_savevm_state_cleanup() to be run without caring much about
migration state.  So:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit
  2017-09-15  5:44 ` [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit Fam Zheng
@ 2017-09-15  8:03   ` Peter Xu
  2017-09-15  8:20     ` Fam Zheng
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2017-09-15  8:03 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, Dr. David Alan Gilbert, Juan Quintela, qemu-stable

On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> up migration states when main loop quits to avoid that.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  include/migration/misc.h | 1 +
>  migration/migration.c    | 7 ++++++-
>  vl.c                     | 3 +++
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index c079b7771b..b9a26b0898 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
>  /* ...and after the device transmission */
>  bool migration_in_postcopy_after_devices(MigrationState *);
>  void migration_global_dump(Monitor *mon);
> +void migrate_cancel(void);
>  
>  #endif
> diff --git a/migration/migration.c b/migration/migration.c
> index 959e8ec88e..2c844945c7 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>      }
>  }
>  
> -void qmp_migrate_cancel(Error **errp)
> +void migrate_cancel(void)
>  {
>      migrate_fd_cancel(migrate_get_current());
>  }
>  
> +void qmp_migrate_cancel(Error **errp)
> +{
> +    migrate_cancel();
> +}
> +

Nit: I would prefer just call migrate_fd_cancel() below, since I don't
see much point to introduce migrate_cancel() which only calls
migrate_fd_cancel()...

>  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
>  {
>      MigrationState *s = migrate_get_current();
> diff --git a/vl.c b/vl.c
> index fb1f05b937..abbe61f40b 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -87,6 +87,7 @@ int main(int argc, char **argv)
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
>  #include "migration/misc.h"
> +#include "migration/savevm.h"
>  #include "migration/snapshot.h"
>  #include "migration/global_state.h"
>  #include "sysemu/tpm.h"
> @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
>      iothread_stop_all();
>  
>      pause_all_vcpus();
> +    migrate_cancel();

IIUC this is an async cancel, so when reach here the migration thread
can still be alive.  Then...

> +    qemu_savevm_state_cleanup();

... Here calling qemu_savevm_state_cleanup() may be problematic if
migration thread has not yet quitted.

I'm thinking whether we should make migrate_fd_cancel() wait until the
migration thread finishes (state change to CANCELLED).  Then the
migration thread will do the cleanup, and here we can avoid calling
qemu_savevm_state_cleanup() as well.

>      bdrv_close_all();
>      res_free();
>  
> -- 
> 2.13.5
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit
  2017-09-15  8:03   ` Peter Xu
@ 2017-09-15  8:20     ` Fam Zheng
  2017-09-15  8:37       ` Peter Xu
  2017-09-15  8:42       ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 19+ messages in thread
From: Fam Zheng @ 2017-09-15  8:20 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-stable, Juan Quintela, qemu-devel, Dr. David Alan Gilbert

On Fri, 09/15 16:03, Peter Xu wrote:
> On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > up migration states when main loop quits to avoid that.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  include/migration/misc.h | 1 +
> >  migration/migration.c    | 7 ++++++-
> >  vl.c                     | 3 +++
> >  3 files changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > index c079b7771b..b9a26b0898 100644
> > --- a/include/migration/misc.h
> > +++ b/include/migration/misc.h
> > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> >  /* ...and after the device transmission */
> >  bool migration_in_postcopy_after_devices(MigrationState *);
> >  void migration_global_dump(Monitor *mon);
> > +void migrate_cancel(void);
> >  
> >  #endif
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 959e8ec88e..2c844945c7 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
> >      }
> >  }
> >  
> > -void qmp_migrate_cancel(Error **errp)
> > +void migrate_cancel(void)
> >  {
> >      migrate_fd_cancel(migrate_get_current());
> >  }
> >  
> > +void qmp_migrate_cancel(Error **errp)
> > +{
> > +    migrate_cancel();
> > +}
> > +
> 
> Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> see much point to introduce migrate_cancel() which only calls
> migrate_fd_cancel()...

migrate_get_current() is a migration internal IMHO. But that can be moved to
migrate_fd_cancel() so the parameter is dropped.

> 
> >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> >  {
> >      MigrationState *s = migrate_get_current();
> > diff --git a/vl.c b/vl.c
> > index fb1f05b937..abbe61f40b 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> >  #include "sysemu/blockdev.h"
> >  #include "hw/block/block.h"
> >  #include "migration/misc.h"
> > +#include "migration/savevm.h"
> >  #include "migration/snapshot.h"
> >  #include "migration/global_state.h"
> >  #include "sysemu/tpm.h"
> > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> >      iothread_stop_all();
> >  
> >      pause_all_vcpus();
> > +    migrate_cancel();
> 
> IIUC this is an async cancel, so when reach here the migration thread
> can still be alive.  Then...
> 
> > +    qemu_savevm_state_cleanup();
> 
> ... Here calling qemu_savevm_state_cleanup() may be problematic if
> migration thread has not yet quitted.
> 
> I'm thinking whether we should make migrate_fd_cancel() wait until the
> migration thread finishes (state change to CANCELLED).  Then the
> migration thread will do the cleanup, and here we can avoid calling
> qemu_savevm_state_cleanup() as well.

But if the migration thread is stuck and CANCELLED is never reached, we'll hang
here?

Fam

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit
  2017-09-15  8:20     ` Fam Zheng
@ 2017-09-15  8:37       ` Peter Xu
  2017-09-15  8:42       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Xu @ 2017-09-15  8:37 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-stable, Juan Quintela, qemu-devel, Dr. David Alan Gilbert

On Fri, Sep 15, 2017 at 04:20:26PM +0800, Fam Zheng wrote:

[...]

> > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > >  {
> > >      MigrationState *s = migrate_get_current();
> > > diff --git a/vl.c b/vl.c
> > > index fb1f05b937..abbe61f40b 100644
> > > --- a/vl.c
> > > +++ b/vl.c
> > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > >  #include "sysemu/blockdev.h"
> > >  #include "hw/block/block.h"
> > >  #include "migration/misc.h"
> > > +#include "migration/savevm.h"
> > >  #include "migration/snapshot.h"
> > >  #include "migration/global_state.h"
> > >  #include "sysemu/tpm.h"
> > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > >      iothread_stop_all();
> > >  
> > >      pause_all_vcpus();
> > > +    migrate_cancel();
> > 
> > IIUC this is an async cancel, so when reach here the migration thread
> > can still be alive.  Then...
> > 
> > > +    qemu_savevm_state_cleanup();
> > 
> > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > migration thread has not yet quitted.
> > 
> > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > migration thread finishes (state change to CANCELLED).  Then the
> > migration thread will do the cleanup, and here we can avoid calling
> > qemu_savevm_state_cleanup() as well.
> 
> But if the migration thread is stuck and CANCELLED is never reached, we'll hang
> here?

Maybe we can add timeout. Anyway, if it gets stuck, I do see it a
migration bug, and in that case I'm not sure force return after
timeout would be anything wiser.

Dave/Juan may have better idea though.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit
  2017-09-15  8:20     ` Fam Zheng
  2017-09-15  8:37       ` Peter Xu
@ 2017-09-15  8:42       ` Dr. David Alan Gilbert
  2017-09-15  8:52         ` Fam Zheng
  1 sibling, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-15  8:42 UTC (permalink / raw)
  To: Fam Zheng; +Cc: Peter Xu, qemu-stable, Juan Quintela, qemu-devel

* Fam Zheng (famz@redhat.com) wrote:
> On Fri, 09/15 16:03, Peter Xu wrote:
> > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > > up migration states when main loop quits to avoid that.
> > > 
> > > Signed-off-by: Fam Zheng <famz@redhat.com>
> > > ---
> > >  include/migration/misc.h | 1 +
> > >  migration/migration.c    | 7 ++++++-
> > >  vl.c                     | 3 +++
> > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > index c079b7771b..b9a26b0898 100644
> > > --- a/include/migration/misc.h
> > > +++ b/include/migration/misc.h
> > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > >  /* ...and after the device transmission */
> > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > >  void migration_global_dump(Monitor *mon);
> > > +void migrate_cancel(void);
> > >  
> > >  #endif
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 959e8ec88e..2c844945c7 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
> > >      }
> > >  }
> > >  
> > > -void qmp_migrate_cancel(Error **errp)
> > > +void migrate_cancel(void)
> > >  {
> > >      migrate_fd_cancel(migrate_get_current());
> > >  }
> > >  
> > > +void qmp_migrate_cancel(Error **errp)
> > > +{
> > > +    migrate_cancel();
> > > +}
> > > +
> > 
> > Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> > see much point to introduce migrate_cancel() which only calls
> > migrate_fd_cancel()...
> 
> migrate_get_current() is a migration internal IMHO. But that can be moved to
> migrate_fd_cancel() so the parameter is dropped.
> 
> > 
> > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > >  {
> > >      MigrationState *s = migrate_get_current();
> > > diff --git a/vl.c b/vl.c
> > > index fb1f05b937..abbe61f40b 100644
> > > --- a/vl.c
> > > +++ b/vl.c
> > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > >  #include "sysemu/blockdev.h"
> > >  #include "hw/block/block.h"
> > >  #include "migration/misc.h"
> > > +#include "migration/savevm.h"
> > >  #include "migration/snapshot.h"
> > >  #include "migration/global_state.h"
> > >  #include "sysemu/tpm.h"
> > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > >      iothread_stop_all();
> > >  
> > >      pause_all_vcpus();
> > > +    migrate_cancel();
> > 
> > IIUC this is an async cancel, so when reach here the migration thread
> > can still be alive.  Then...
> > 
> > > +    qemu_savevm_state_cleanup();
> > 
> > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > migration thread has not yet quitted.
> > 
> > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > migration thread finishes (state change to CANCELLED).  Then the
> > migration thread will do the cleanup, and here we can avoid calling
> > qemu_savevm_state_cleanup() as well.
> 
> But if the migration thread is stuck and CANCELLED is never reached, we'll hang
> here?

I'm not sure I see an easy fix; I agree with Peter that calling
migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
because the cancel just forces the state to CANCELLING before coming
back to you, and the migration thread asynchronously starts to
fail/cleanup.

migrate_cancel() can forcibly unblock some cases because it calls
shutdown(2) on the network fd, but there are other ways for a migration
to hang.

Having said that,  the migration thread does it's calls
to qemu_savevm_state_cleanup under the lock_iothread;
Do we have that lock at this point?

Dave

> Fam
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit
  2017-09-15  8:42       ` Dr. David Alan Gilbert
@ 2017-09-15  8:52         ` Fam Zheng
  2017-09-15  9:22           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: Fam Zheng @ 2017-09-15  8:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Peter Xu, qemu-stable, Juan Quintela, qemu-devel

On Fri, 09/15 09:42, Dr. David Alan Gilbert wrote:
> * Fam Zheng (famz@redhat.com) wrote:
> > On Fri, 09/15 16:03, Peter Xu wrote:
> > > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > > > up migration states when main loop quits to avoid that.
> > > > 
> > > > Signed-off-by: Fam Zheng <famz@redhat.com>
> > > > ---
> > > >  include/migration/misc.h | 1 +
> > > >  migration/migration.c    | 7 ++++++-
> > > >  vl.c                     | 3 +++
> > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > > index c079b7771b..b9a26b0898 100644
> > > > --- a/include/migration/misc.h
> > > > +++ b/include/migration/misc.h
> > > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > > >  /* ...and after the device transmission */
> > > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > > >  void migration_global_dump(Monitor *mon);
> > > > +void migrate_cancel(void);
> > > >  
> > > >  #endif
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index 959e8ec88e..2c844945c7 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
> > > >      }
> > > >  }
> > > >  
> > > > -void qmp_migrate_cancel(Error **errp)
> > > > +void migrate_cancel(void)
> > > >  {
> > > >      migrate_fd_cancel(migrate_get_current());
> > > >  }
> > > >  
> > > > +void qmp_migrate_cancel(Error **errp)
> > > > +{
> > > > +    migrate_cancel();
> > > > +}
> > > > +
> > > 
> > > Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> > > see much point to introduce migrate_cancel() which only calls
> > > migrate_fd_cancel()...
> > 
> > migrate_get_current() is a migration internal IMHO. But that can be moved to
> > migrate_fd_cancel() so the parameter is dropped.
> > 
> > > 
> > > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > > >  {
> > > >      MigrationState *s = migrate_get_current();
> > > > diff --git a/vl.c b/vl.c
> > > > index fb1f05b937..abbe61f40b 100644
> > > > --- a/vl.c
> > > > +++ b/vl.c
> > > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > > >  #include "sysemu/blockdev.h"
> > > >  #include "hw/block/block.h"
> > > >  #include "migration/misc.h"
> > > > +#include "migration/savevm.h"
> > > >  #include "migration/snapshot.h"
> > > >  #include "migration/global_state.h"
> > > >  #include "sysemu/tpm.h"
> > > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > > >      iothread_stop_all();
> > > >  
> > > >      pause_all_vcpus();
> > > > +    migrate_cancel();
> > > 
> > > IIUC this is an async cancel, so when reach here the migration thread
> > > can still be alive.  Then...
> > > 
> > > > +    qemu_savevm_state_cleanup();
> > > 
> > > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > > migration thread has not yet quitted.
> > > 
> > > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > > migration thread finishes (state change to CANCELLED).  Then the
> > > migration thread will do the cleanup, and here we can avoid calling
> > > qemu_savevm_state_cleanup() as well.
> > 
> > But if the migration thread is stuck and CANCELLED is never reached, we'll hang
> > here?
> 
> I'm not sure I see an easy fix; I agree with Peter that calling
> migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
> because the cancel just forces the state to CANCELLING before coming
> back to you, and the migration thread asynchronously starts to
> fail/cleanup.
> 
> migrate_cancel() can forcibly unblock some cases because it calls
> shutdown(2) on the network fd, but there are other ways for a migration
> to hang.
> 
> Having said that,  the migration thread does it's calls
> to qemu_savevm_state_cleanup under the lock_iothread;
> Do we have that lock at this point?

Yes we do.  Main loop releases the lock only during poll(), other parts of
main() all have the lock.

Fam

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit
  2017-09-15  8:52         ` Fam Zheng
@ 2017-09-15  9:22           ` Dr. David Alan Gilbert
  2017-09-18  7:31             ` Peter Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-15  9:22 UTC (permalink / raw)
  To: Fam Zheng; +Cc: Peter Xu, qemu-stable, Juan Quintela, qemu-devel

* Fam Zheng (famz@redhat.com) wrote:
> On Fri, 09/15 09:42, Dr. David Alan Gilbert wrote:
> > * Fam Zheng (famz@redhat.com) wrote:
> > > On Fri, 09/15 16:03, Peter Xu wrote:
> > > > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > > > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > > > > up migration states when main loop quits to avoid that.
> > > > > 
> > > > > Signed-off-by: Fam Zheng <famz@redhat.com>
> > > > > ---
> > > > >  include/migration/misc.h | 1 +
> > > > >  migration/migration.c    | 7 ++++++-
> > > > >  vl.c                     | 3 +++
> > > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > > > index c079b7771b..b9a26b0898 100644
> > > > > --- a/include/migration/misc.h
> > > > > +++ b/include/migration/misc.h
> > > > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > > > >  /* ...and after the device transmission */
> > > > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > > > >  void migration_global_dump(Monitor *mon);
> > > > > +void migrate_cancel(void);
> > > > >  
> > > > >  #endif
> > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > index 959e8ec88e..2c844945c7 100644
> > > > > --- a/migration/migration.c
> > > > > +++ b/migration/migration.c
> > > > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
> > > > >      }
> > > > >  }
> > > > >  
> > > > > -void qmp_migrate_cancel(Error **errp)
> > > > > +void migrate_cancel(void)
> > > > >  {
> > > > >      migrate_fd_cancel(migrate_get_current());
> > > > >  }
> > > > >  
> > > > > +void qmp_migrate_cancel(Error **errp)
> > > > > +{
> > > > > +    migrate_cancel();
> > > > > +}
> > > > > +
> > > > 
> > > > Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> > > > see much point to introduce migrate_cancel() which only calls
> > > > migrate_fd_cancel()...
> > > 
> > > migrate_get_current() is a migration internal IMHO. But that can be moved to
> > > migrate_fd_cancel() so the parameter is dropped.
> > > 
> > > > 
> > > > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > > > >  {
> > > > >      MigrationState *s = migrate_get_current();
> > > > > diff --git a/vl.c b/vl.c
> > > > > index fb1f05b937..abbe61f40b 100644
> > > > > --- a/vl.c
> > > > > +++ b/vl.c
> > > > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > > > >  #include "sysemu/blockdev.h"
> > > > >  #include "hw/block/block.h"
> > > > >  #include "migration/misc.h"
> > > > > +#include "migration/savevm.h"
> > > > >  #include "migration/snapshot.h"
> > > > >  #include "migration/global_state.h"
> > > > >  #include "sysemu/tpm.h"
> > > > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > > > >      iothread_stop_all();
> > > > >  
> > > > >      pause_all_vcpus();
> > > > > +    migrate_cancel();
> > > > 
> > > > IIUC this is an async cancel, so when reach here the migration thread
> > > > can still be alive.  Then...
> > > > 
> > > > > +    qemu_savevm_state_cleanup();
> > > > 
> > > > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > > > migration thread has not yet quitted.
> > > > 
> > > > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > > > migration thread finishes (state change to CANCELLED).  Then the
> > > > migration thread will do the cleanup, and here we can avoid calling
> > > > qemu_savevm_state_cleanup() as well.
> > > 
> > > But if the migration thread is stuck and CANCELLED is never reached, we'll hang
> > > here?
> > 
> > I'm not sure I see an easy fix; I agree with Peter that calling
> > migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
> > because the cancel just forces the state to CANCELLING before coming
> > back to you, and the migration thread asynchronously starts to
> > fail/cleanup.
> > 
> > migrate_cancel() can forcibly unblock some cases because it calls
> > shutdown(2) on the network fd, but there are other ways for a migration
> > to hang.
> > 
> > Having said that,  the migration thread does it's calls
> > to qemu_savevm_state_cleanup under the lock_iothread;
> > Do we have that lock at this point?
> 
> Yes we do.  Main loop releases the lock only during poll(), other parts of
> main() all have the lock.

Having said that though this is pretty confusing; because at this point
we're after main_loop has exited.
I don't think the migration thread will exit without having taken the
iothread lock; so if we've got it at this point then the migration
thread will never exit, and it will never call qemu_savevm_state_cleanup
itself - so that race might not exist?

However, assuming the migration is at an earlier point, it might be
calling one of the state handlers that you're cleaning up, and that's
racy in individual devices.

If we have the lock I don't think we can wait for the migration to
complete/cancel.   The transition from cancelling->cancelled happens
in migrate_fd_cleanup() - and that's run of a bh which I assume don't
work any more at this point.

Perhaps the answer is to move this to qemu_system_shutdown_request prior
to the point where shutdown_requested is set?  At that point we've
still got the main loop, although hmm I'm not convinced if it's
consistent whether that's called with or without the lock held.

Dave

> Fam
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] iotests: Add "quit during block migration" case 195
  2017-09-15  5:44 ` [Qemu-devel] [PATCH 3/3] iotests: Add "quit during block migration" case 195 Fam Zheng
@ 2017-09-15 17:29   ` Eric Blake
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Blake @ 2017-09-15 17:29 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel
  Cc: qemu-stable, Dr. David Alan Gilbert, peterx, Juan Quintela,
	Jeff Cody

[-- Attachment #1: Type: text/plain, Size: 838 bytes --]

On 09/15/2017 12:44 AM, Fam Zheng wrote:
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  tests/qemu-iotests/195     | 97 ++++++++++++++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/195.out | 19 +++++++++
>  tests/qemu-iotests/group   |  1 +
>  3 files changed, 117 insertions(+)
>  create mode 100755 tests/qemu-iotests/195
>  create mode 100644 tests/qemu-iotests/195.out

> +_cleanup()
> +{
> +    rm -f "${MIG_SOCKET}"
> +    rm -f "${TEST_IMG}.dest"
> +    _cleanup_test_img
> +    _cleanup_qemu
> +}
> +trap "_cleanup; exit \$status" 0 1 2 3 15

Semantic conflict with Jeff's patch series that adds './check -s' that
preserves files in the per-test temp directory.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit
  2017-09-15  9:22           ` Dr. David Alan Gilbert
@ 2017-09-18  7:31             ` Peter Xu
  2017-09-18 10:00               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2017-09-18  7:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Fam Zheng, qemu-stable, Juan Quintela, qemu-devel

On Fri, Sep 15, 2017 at 10:22:43AM +0100, Dr. David Alan Gilbert wrote:
> * Fam Zheng (famz@redhat.com) wrote:
> > On Fri, 09/15 09:42, Dr. David Alan Gilbert wrote:
> > > * Fam Zheng (famz@redhat.com) wrote:
> > > > On Fri, 09/15 16:03, Peter Xu wrote:
> > > > > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > > > > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > > > > > up migration states when main loop quits to avoid that.
> > > > > > 
> > > > > > Signed-off-by: Fam Zheng <famz@redhat.com>
> > > > > > ---
> > > > > >  include/migration/misc.h | 1 +
> > > > > >  migration/migration.c    | 7 ++++++-
> > > > > >  vl.c                     | 3 +++
> > > > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > > > > index c079b7771b..b9a26b0898 100644
> > > > > > --- a/include/migration/misc.h
> > > > > > +++ b/include/migration/misc.h
> > > > > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > > > > >  /* ...and after the device transmission */
> > > > > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > > > > >  void migration_global_dump(Monitor *mon);
> > > > > > +void migrate_cancel(void);
> > > > > >  
> > > > > >  #endif
> > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > index 959e8ec88e..2c844945c7 100644
> > > > > > --- a/migration/migration.c
> > > > > > +++ b/migration/migration.c
> > > > > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
> > > > > >      }
> > > > > >  }
> > > > > >  
> > > > > > -void qmp_migrate_cancel(Error **errp)
> > > > > > +void migrate_cancel(void)
> > > > > >  {
> > > > > >      migrate_fd_cancel(migrate_get_current());
> > > > > >  }
> > > > > >  
> > > > > > +void qmp_migrate_cancel(Error **errp)
> > > > > > +{
> > > > > > +    migrate_cancel();
> > > > > > +}
> > > > > > +
> > > > > 
> > > > > Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> > > > > see much point to introduce migrate_cancel() which only calls
> > > > > migrate_fd_cancel()...
> > > > 
> > > > migrate_get_current() is a migration internal IMHO. But that can be moved to
> > > > migrate_fd_cancel() so the parameter is dropped.
> > > > 
> > > > > 
> > > > > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > > > > >  {
> > > > > >      MigrationState *s = migrate_get_current();
> > > > > > diff --git a/vl.c b/vl.c
> > > > > > index fb1f05b937..abbe61f40b 100644
> > > > > > --- a/vl.c
> > > > > > +++ b/vl.c
> > > > > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > > > > >  #include "sysemu/blockdev.h"
> > > > > >  #include "hw/block/block.h"
> > > > > >  #include "migration/misc.h"
> > > > > > +#include "migration/savevm.h"
> > > > > >  #include "migration/snapshot.h"
> > > > > >  #include "migration/global_state.h"
> > > > > >  #include "sysemu/tpm.h"
> > > > > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > > > > >      iothread_stop_all();
> > > > > >  
> > > > > >      pause_all_vcpus();
> > > > > > +    migrate_cancel();
> > > > > 
> > > > > IIUC this is an async cancel, so when reach here the migration thread
> > > > > can still be alive.  Then...
> > > > > 
> > > > > > +    qemu_savevm_state_cleanup();
> > > > > 
> > > > > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > > > > migration thread has not yet quitted.
> > > > > 
> > > > > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > > > > migration thread finishes (state change to CANCELLED).  Then the
> > > > > migration thread will do the cleanup, and here we can avoid calling
> > > > > qemu_savevm_state_cleanup() as well.
> > > > 
> > > > But if the migration thread is stuck and CANCELLED is never reached, we'll hang
> > > > here?
> > > 
> > > I'm not sure I see an easy fix; I agree with Peter that calling
> > > migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
> > > because the cancel just forces the state to CANCELLING before coming
> > > back to you, and the migration thread asynchronously starts to
> > > fail/cleanup.
> > > 
> > > migrate_cancel() can forcibly unblock some cases because it calls
> > > shutdown(2) on the network fd, but there are other ways for a migration
> > > to hang.
> > > 
> > > Having said that,  the migration thread does it's calls
> > > to qemu_savevm_state_cleanup under the lock_iothread;
> > > Do we have that lock at this point?
> > 
> > Yes we do.  Main loop releases the lock only during poll(), other parts of
> > main() all have the lock.
> 
> Having said that though this is pretty confusing; because at this point
> we're after main_loop has exited.
> I don't think the migration thread will exit without having taken the
> iothread lock; so if we've got it at this point then the migration
> thread will never exit, and it will never call qemu_savevm_state_cleanup
> itself - so that race might not exist?

Is that because we are taking the BQL in migration_thread()?  Please
see my below questions...

> 
> However, assuming the migration is at an earlier point, it might be
> calling one of the state handlers that you're cleaning up, and that's
> racy in individual devices.
> 
> If we have the lock I don't think we can wait for the migration to
> complete/cancel.   The transition from cancelling->cancelled happens
> in migrate_fd_cleanup() - and that's run of a bh which I assume don't
> work any more at this point.
> 
> Perhaps the answer is to move this to qemu_system_shutdown_request prior
> to the point where shutdown_requested is set?  At that point we've
> still got the main loop, although hmm I'm not convinced if it's
> consistent whether that's called with or without the lock held.

I do think the migration cleanup part needs some cleanup itself... :)

Actually I have two questions here about BQL and migration:

(1) I see that we took BQL in migration_thread(), but if the migration
    thread is to be cancelled, could we avoid taking the lock for the
    cancelling case right after we break from the big migration loop?
    Since I don't see why we need it if we are after all going to
    cancel the migration...  (though this one may need some other
    cleanups to let it work I guess)

(2) I see that we took BQL in migrate_fd_cleanup().  Could I ask why
    we had that?  Can we remove it?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit
  2017-09-18  7:31             ` Peter Xu
@ 2017-09-18 10:00               ` Dr. David Alan Gilbert
  2017-09-19  8:26                 ` Peter Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-18 10:00 UTC (permalink / raw)
  To: Peter Xu; +Cc: Fam Zheng, qemu-stable, Juan Quintela, qemu-devel

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Sep 15, 2017 at 10:22:43AM +0100, Dr. David Alan Gilbert wrote:
> > * Fam Zheng (famz@redhat.com) wrote:
> > > On Fri, 09/15 09:42, Dr. David Alan Gilbert wrote:
> > > > * Fam Zheng (famz@redhat.com) wrote:
> > > > > On Fri, 09/15 16:03, Peter Xu wrote:
> > > > > > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > > > > > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > > > > > > up migration states when main loop quits to avoid that.
> > > > > > > 
> > > > > > > Signed-off-by: Fam Zheng <famz@redhat.com>
> > > > > > > ---
> > > > > > >  include/migration/misc.h | 1 +
> > > > > > >  migration/migration.c    | 7 ++++++-
> > > > > > >  vl.c                     | 3 +++
> > > > > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > > > > > index c079b7771b..b9a26b0898 100644
> > > > > > > --- a/include/migration/misc.h
> > > > > > > +++ b/include/migration/misc.h
> > > > > > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > > > > > >  /* ...and after the device transmission */
> > > > > > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > > > > > >  void migration_global_dump(Monitor *mon);
> > > > > > > +void migrate_cancel(void);
> > > > > > >  
> > > > > > >  #endif
> > > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > > index 959e8ec88e..2c844945c7 100644
> > > > > > > --- a/migration/migration.c
> > > > > > > +++ b/migration/migration.c
> > > > > > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
> > > > > > >      }
> > > > > > >  }
> > > > > > >  
> > > > > > > -void qmp_migrate_cancel(Error **errp)
> > > > > > > +void migrate_cancel(void)
> > > > > > >  {
> > > > > > >      migrate_fd_cancel(migrate_get_current());
> > > > > > >  }
> > > > > > >  
> > > > > > > +void qmp_migrate_cancel(Error **errp)
> > > > > > > +{
> > > > > > > +    migrate_cancel();
> > > > > > > +}
> > > > > > > +
> > > > > > 
> > > > > > Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> > > > > > see much point to introduce migrate_cancel() which only calls
> > > > > > migrate_fd_cancel()...
> > > > > 
> > > > > migrate_get_current() is a migration internal IMHO. But that can be moved to
> > > > > migrate_fd_cancel() so the parameter is dropped.
> > > > > 
> > > > > > 
> > > > > > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > > > > > >  {
> > > > > > >      MigrationState *s = migrate_get_current();
> > > > > > > diff --git a/vl.c b/vl.c
> > > > > > > index fb1f05b937..abbe61f40b 100644
> > > > > > > --- a/vl.c
> > > > > > > +++ b/vl.c
> > > > > > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > > > > > >  #include "sysemu/blockdev.h"
> > > > > > >  #include "hw/block/block.h"
> > > > > > >  #include "migration/misc.h"
> > > > > > > +#include "migration/savevm.h"
> > > > > > >  #include "migration/snapshot.h"
> > > > > > >  #include "migration/global_state.h"
> > > > > > >  #include "sysemu/tpm.h"
> > > > > > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > > > > > >      iothread_stop_all();
> > > > > > >  
> > > > > > >      pause_all_vcpus();
> > > > > > > +    migrate_cancel();
> > > > > > 
> > > > > > IIUC this is an async cancel, so when reach here the migration thread
> > > > > > can still be alive.  Then...
> > > > > > 
> > > > > > > +    qemu_savevm_state_cleanup();
> > > > > > 
> > > > > > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > > > > > migration thread has not yet quitted.
> > > > > > 
> > > > > > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > > > > > migration thread finishes (state change to CANCELLED).  Then the
> > > > > > migration thread will do the cleanup, and here we can avoid calling
> > > > > > qemu_savevm_state_cleanup() as well.
> > > > > 
> > > > > But if the migration thread is stuck and CANCELLED is never reached, we'll hang
> > > > > here?
> > > > 
> > > > I'm not sure I see an easy fix; I agree with Peter that calling
> > > > migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
> > > > because the cancel just forces the state to CANCELLING before coming
> > > > back to you, and the migration thread asynchronously starts to
> > > > fail/cleanup.
> > > > 
> > > > migrate_cancel() can forcibly unblock some cases because it calls
> > > > shutdown(2) on the network fd, but there are other ways for a migration
> > > > to hang.
> > > > 
> > > > Having said that,  the migration thread does it's calls
> > > > to qemu_savevm_state_cleanup under the lock_iothread;
> > > > Do we have that lock at this point?
> > > 
> > > Yes we do.  Main loop releases the lock only during poll(), other parts of
> > > main() all have the lock.
> > 
> > Having said that though this is pretty confusing; because at this point
> > we're after main_loop has exited.
> > I don't think the migration thread will exit without having taken the
> > iothread lock; so if we've got it at this point then the migration
> > thread will never exit, and it will never call qemu_savevm_state_cleanup
> > itself - so that race might not exist?
> 
> Is that because we are taking the BQL in migration_thread()?  Please
> see my below questions...

Yes; there are a few places we take it.

> > However, assuming the migration is at an earlier point, it might be
> > calling one of the state handlers that you're cleaning up, and that's
> > racy in individual devices.
> > 
> > If we have the lock I don't think we can wait for the migration to
> > complete/cancel.   The transition from cancelling->cancelled happens
> > in migrate_fd_cleanup() - and that's run of a bh which I assume don't
> > work any more at this point.
> > 
> > Perhaps the answer is to move this to qemu_system_shutdown_request prior
> > to the point where shutdown_requested is set?  At that point we've
> > still got the main loop, although hmm I'm not convinced if it's
> > consistent whether that's called with or without the lock held.
> 
> I do think the migration cleanup part needs some cleanup itself... :)
> 
> Actually I have two questions here about BQL and migration:
> 
> (1) I see that we took BQL in migration_thread(), but if the migration
>     thread is to be cancelled, could we avoid taking the lock for the
>     cancelling case right after we break from the big migration loop?
>     Since I don't see why we need it if we are after all going to
>     cancel the migration...  (though this one may need some other
>     cleanups to let it work I guess)

If you mean the lock just before 'The resource has been allocated...'
no I don't think so; I'm not sure - I worry about what would happen if
someone issued a migrate_cancel or another migrate command during that
time - but then that could happen just before the lock_iothread so I'm
not sure we're any worse off.


> (2) I see that we took BQL in migrate_fd_cleanup().  Could I ask why
>     we had that?  Can we remove it?

Note migrate_fd_cleanup is called in a bh, so I think that means it's
entered with the lock held and that just drops it while we wait
for the other thread.

Dave

> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit
  2017-09-18 10:00               ` Dr. David Alan Gilbert
@ 2017-09-19  8:26                 ` Peter Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2017-09-19  8:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Fam Zheng, qemu-stable, Juan Quintela, qemu-devel

On Mon, Sep 18, 2017 at 11:00:15AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Sep 15, 2017 at 10:22:43AM +0100, Dr. David Alan Gilbert wrote:
> > > * Fam Zheng (famz@redhat.com) wrote:
> > > > On Fri, 09/15 09:42, Dr. David Alan Gilbert wrote:
> > > > > * Fam Zheng (famz@redhat.com) wrote:
> > > > > > On Fri, 09/15 16:03, Peter Xu wrote:
> > > > > > > On Fri, Sep 15, 2017 at 01:44:03PM +0800, Fam Zheng wrote:
> > > > > > > > bdrv_close_all() would abort() due to op blockers added by BMDS, clean
> > > > > > > > up migration states when main loop quits to avoid that.
> > > > > > > > 
> > > > > > > > Signed-off-by: Fam Zheng <famz@redhat.com>
> > > > > > > > ---
> > > > > > > >  include/migration/misc.h | 1 +
> > > > > > > >  migration/migration.c    | 7 ++++++-
> > > > > > > >  vl.c                     | 3 +++
> > > > > > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > > > > > 
> > > > > > > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > > > > > > index c079b7771b..b9a26b0898 100644
> > > > > > > > --- a/include/migration/misc.h
> > > > > > > > +++ b/include/migration/misc.h
> > > > > > > > @@ -54,5 +54,6 @@ bool migration_has_failed(MigrationState *);
> > > > > > > >  /* ...and after the device transmission */
> > > > > > > >  bool migration_in_postcopy_after_devices(MigrationState *);
> > > > > > > >  void migration_global_dump(Monitor *mon);
> > > > > > > > +void migrate_cancel(void);
> > > > > > > >  
> > > > > > > >  #endif
> > > > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > > > index 959e8ec88e..2c844945c7 100644
> > > > > > > > --- a/migration/migration.c
> > > > > > > > +++ b/migration/migration.c
> > > > > > > > @@ -1274,11 +1274,16 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
> > > > > > > >      }
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > -void qmp_migrate_cancel(Error **errp)
> > > > > > > > +void migrate_cancel(void)
> > > > > > > >  {
> > > > > > > >      migrate_fd_cancel(migrate_get_current());
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +void qmp_migrate_cancel(Error **errp)
> > > > > > > > +{
> > > > > > > > +    migrate_cancel();
> > > > > > > > +}
> > > > > > > > +
> > > > > > > 
> > > > > > > Nit: I would prefer just call migrate_fd_cancel() below, since I don't
> > > > > > > see much point to introduce migrate_cancel() which only calls
> > > > > > > migrate_fd_cancel()...
> > > > > > 
> > > > > > migrate_get_current() is a migration internal IMHO. But that can be moved to
> > > > > > migrate_fd_cancel() so the parameter is dropped.
> > > > > > 
> > > > > > > 
> > > > > > > >  void qmp_migrate_set_cache_size(int64_t value, Error **errp)
> > > > > > > >  {
> > > > > > > >      MigrationState *s = migrate_get_current();
> > > > > > > > diff --git a/vl.c b/vl.c
> > > > > > > > index fb1f05b937..abbe61f40b 100644
> > > > > > > > --- a/vl.c
> > > > > > > > +++ b/vl.c
> > > > > > > > @@ -87,6 +87,7 @@ int main(int argc, char **argv)
> > > > > > > >  #include "sysemu/blockdev.h"
> > > > > > > >  #include "hw/block/block.h"
> > > > > > > >  #include "migration/misc.h"
> > > > > > > > +#include "migration/savevm.h"
> > > > > > > >  #include "migration/snapshot.h"
> > > > > > > >  #include "migration/global_state.h"
> > > > > > > >  #include "sysemu/tpm.h"
> > > > > > > > @@ -4799,6 +4800,8 @@ int main(int argc, char **argv, char **envp)
> > > > > > > >      iothread_stop_all();
> > > > > > > >  
> > > > > > > >      pause_all_vcpus();
> > > > > > > > +    migrate_cancel();
> > > > > > > 
> > > > > > > IIUC this is an async cancel, so when reach here the migration thread
> > > > > > > can still be alive.  Then...
> > > > > > > 
> > > > > > > > +    qemu_savevm_state_cleanup();
> > > > > > > 
> > > > > > > ... Here calling qemu_savevm_state_cleanup() may be problematic if
> > > > > > > migration thread has not yet quitted.
> > > > > > > 
> > > > > > > I'm thinking whether we should make migrate_fd_cancel() wait until the
> > > > > > > migration thread finishes (state change to CANCELLED).  Then the
> > > > > > > migration thread will do the cleanup, and here we can avoid calling
> > > > > > > qemu_savevm_state_cleanup() as well.
> > > > > > 
> > > > > > But if the migration thread is stuck and CANCELLED is never reached, we'll hang
> > > > > > here?
> > > > > 
> > > > > I'm not sure I see an easy fix; I agree with Peter that calling
> > > > > migrate_cancel() followed by qemu_savevm_state_cleanup() is racy,
> > > > > because the cancel just forces the state to CANCELLING before coming
> > > > > back to you, and the migration thread asynchronously starts to
> > > > > fail/cleanup.
> > > > > 
> > > > > migrate_cancel() can forcibly unblock some cases because it calls
> > > > > shutdown(2) on the network fd, but there are other ways for a migration
> > > > > to hang.
> > > > > 
> > > > > Having said that,  the migration thread does it's calls
> > > > > to qemu_savevm_state_cleanup under the lock_iothread;
> > > > > Do we have that lock at this point?
> > > > 
> > > > Yes we do.  Main loop releases the lock only during poll(), other parts of
> > > > main() all have the lock.
> > > 
> > > Having said that though this is pretty confusing; because at this point
> > > we're after main_loop has exited.
> > > I don't think the migration thread will exit without having taken the
> > > iothread lock; so if we've got it at this point then the migration
> > > thread will never exit, and it will never call qemu_savevm_state_cleanup
> > > itself - so that race might not exist?
> > 
> > Is that because we are taking the BQL in migration_thread()?  Please
> > see my below questions...
> 
> Yes; there are a few places we take it.
> 
> > > However, assuming the migration is at an earlier point, it might be
> > > calling one of the state handlers that you're cleaning up, and that's
> > > racy in individual devices.
> > > 
> > > If we have the lock I don't think we can wait for the migration to
> > > complete/cancel.   The transition from cancelling->cancelled happens
> > > in migrate_fd_cleanup() - and that's run of a bh which I assume don't
> > > work any more at this point.
> > > 
> > > Perhaps the answer is to move this to qemu_system_shutdown_request prior
> > > to the point where shutdown_requested is set?  At that point we've
> > > still got the main loop, although hmm I'm not convinced if it's
> > > consistent whether that's called with or without the lock held.
> > 
> > I do think the migration cleanup part needs some cleanup itself... :)
> > 
> > Actually I have two questions here about BQL and migration:
> > 
> > (1) I see that we took BQL in migration_thread(), but if the migration
> >     thread is to be cancelled, could we avoid taking the lock for the
> >     cancelling case right after we break from the big migration loop?
> >     Since I don't see why we need it if we are after all going to
> >     cancel the migration...  (though this one may need some other
> >     cleanups to let it work I guess)
> 
> If you mean the lock just before 'The resource has been allocated...'
> no I don't think so; I'm not sure - I worry about what would happen if
> someone issued a migrate_cancel or another migrate command during that
> time - but then that could happen just before the lock_iothread so I'm
> not sure we're any worse off.

For these cases, I think we should first check the migration status,
then we proceed.  I think that's what we have done in qmp_migrate(),
however something we missed in qmp_migrate_cancel() (so I think we
should add it soon).

Besides these cases, IMHO the BQL should be used for either vm_start()
or runstate_set() below.  However again, I'm thinking whether we can
avoid taking the lock for "cancelling" case.  If my understanding is
correct above, I would like to give it a shot.

> 
> 
> > (2) I see that we took BQL in migrate_fd_cleanup().  Could I ask why
> >     we had that?  Can we remove it?
> 
> Note migrate_fd_cleanup is called in a bh, so I think that means it's
> entered with the lock held and that just drops it while we wait
> for the other thread.

Ah, sorry I obviously misread the code on the ordering of
lock/unlock...  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-09-19  8:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-15  5:44 [Qemu-devel] [PATCH 0/3] migration: Fix crash by cleaning up before quit Fam Zheng
2017-09-15  5:44 ` [Qemu-devel] [PATCH 1/3] migration: Allow ram_save_cleanup to be called with empty state Fam Zheng
2017-09-15  6:41   ` Peter Xu
2017-09-15  6:49     ` Fam Zheng
2017-09-15  6:56       ` Peter Xu
2017-09-15  7:02         ` Fam Zheng
2017-09-15  7:58           ` Peter Xu
2017-09-15  5:44 ` [Qemu-devel] [PATCH 2/3] migration: Cancel migration at exit Fam Zheng
2017-09-15  8:03   ` Peter Xu
2017-09-15  8:20     ` Fam Zheng
2017-09-15  8:37       ` Peter Xu
2017-09-15  8:42       ` Dr. David Alan Gilbert
2017-09-15  8:52         ` Fam Zheng
2017-09-15  9:22           ` Dr. David Alan Gilbert
2017-09-18  7:31             ` Peter Xu
2017-09-18 10:00               ` Dr. David Alan Gilbert
2017-09-19  8:26                 ` Peter Xu
2017-09-15  5:44 ` [Qemu-devel] [PATCH 3/3] iotests: Add "quit during block migration" case 195 Fam Zheng
2017-09-15 17:29   ` Eric Blake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).