From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55814) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1acubA-0002uj-Gq for qemu-devel@nongnu.org; Mon, 07 Mar 2016 07:49:21 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1acub7-0008JH-8P for qemu-devel@nongnu.org; Mon, 07 Mar 2016 07:49:20 -0500 Received: from mx1.redhat.com ([209.132.183.28]:59120) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1acub7-0008IA-0K for qemu-devel@nongnu.org; Mon, 07 Mar 2016 07:49:17 -0500 Date: Mon, 7 Mar 2016 12:49:11 +0000 From: "Dr. David Alan Gilbert" Message-ID: <20160307124911.GB2253@work-vm> References: <33f7c8c309e6625942e6b8548faa96606a6f99b1.1456212545.git.amit.shah@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <33f7c8c309e6625942e6b8548faa96606a6f99b1.1456212545.git.amit.shah@redhat.com> Subject: Re: [Qemu-devel] [PULL 2/5] migration: move bdrv_invalidate_cache_all of of coroutine context List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Amit Shah Cc: "Denis V. Lunev" , Peter Maydell , Paolo Bonzini , qemu list , Juan Quintela * Amit Shah (amit.shah@redhat.com) wrote: > From: "Denis V. Lunev" > > There is a possibility to hit an assert in qcow2_get_specific_info that > s->qcow_version is undefined. This happens when VM in starting from > suspended state, i.e. it processes incoming migration, and in the same > time 'info block' is called. > > The problem is that qcow2_invalidate_cache() closes the image and > memset()s BDRVQcowState in the middle. > > The patch moves processing of bdrv_invalidate_cache_all out of > coroutine context for postcopy migration to avoid that. This function > is called with the following stack: > process_incoming_migration_co > qemu_loadvm_state > qemu_loadvm_state_main > loadvm_process_command > loadvm_postcopy_handle_run > > Signed-off-by: Denis V. Lunev > Tested-by: Dr. David Alan Gilbert hmm; actually - this segs in a variety of different ways; there are two problems: a) + bh = qemu_bh_new(loadvm_postcopy_handle_run_bh, NULL); That's the easy one; that NULL should be 'mis', because the bh is expecting to use it as a MigrationIncomingState so it segs fairly reliably in the qemu_bh_delete(mis->bh) b) The harder problem is that there's a race where qemu_bh_delete segs, and I'm not 100% sure why yet - it only does it sometime (i.e. run virt-test and leave it and it occasionally does it). From the core it looks like qemu->bh is corrupt (0x10101010...) so maybe mis has been freed at that point? I'm suspecting this is the postcopy_ram_listen_thread freeing mis at the end of it, but I don't know yet. Dave > CC: Paolo Bonzini > CC: Juan Quintela > CC: Amit Shah > Message-Id: <1455259174-3384-3-git-send-email-den@openvz.org> > Signed-off-by: Amit Shah > --- > migration/savevm.c | 27 +++++++++++++++++---------- > 1 file changed, 17 insertions(+), 10 deletions(-) > > diff --git a/migration/savevm.c b/migration/savevm.c > index 94f2894..8415fd9 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -1496,18 +1496,10 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis) > return 0; > } > > -/* After all discards we can start running and asking for pages */ > -static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) > +static void loadvm_postcopy_handle_run_bh(void *opaque) > { > - PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_RUNNING); > Error *local_err = NULL; > > - trace_loadvm_postcopy_handle_run(); > - if (ps != POSTCOPY_INCOMING_LISTENING) { > - error_report("CMD_POSTCOPY_RUN in wrong postcopy state (%d)", ps); > - return -1; > - } > - > /* TODO we should move all of this lot into postcopy_ram.c or a shared code > * in migration.c > */ > @@ -1519,7 +1511,6 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) > bdrv_invalidate_cache_all(&local_err); > if (local_err) { > error_report_err(local_err); > - return -1; > } > > trace_loadvm_postcopy_handle_run_cpu_sync(); > @@ -1534,6 +1525,22 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) > /* leave it paused and let management decide when to start the CPU */ > runstate_set(RUN_STATE_PAUSED); > } > +} > + > +/* After all discards we can start running and asking for pages */ > +static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) > +{ > + PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_RUNNING); > + QEMUBH *bh; > + > + trace_loadvm_postcopy_handle_run(); > + if (ps != POSTCOPY_INCOMING_LISTENING) { > + error_report("CMD_POSTCOPY_RUN in wrong postcopy state (%d)", ps); > + return -1; > + } > + > + bh = qemu_bh_new(loadvm_postcopy_handle_run_bh, NULL); > + qemu_bh_schedule(bh); > > /* We need to finish reading the stream from the package > * and also stop reading anything more from the stream that loaded the > -- > 2.5.0 > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK