From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38148) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fXtNf-0000DT-4O for qemu-devel@nongnu.org; Tue, 26 Jun 2018 15:12:00 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fXtNc-0006uT-Fy for qemu-devel@nongnu.org; Tue, 26 Jun 2018 15:11:59 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:36700 helo=mx1.redhat.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fXtNc-0006uH-AC for qemu-devel@nongnu.org; Tue, 26 Jun 2018 15:11:56 -0400 Date: Tue, 26 Jun 2018 20:11:51 +0100 From: "Dr. David Alan Gilbert" Message-ID: <20180626191151.GD2505@work-vm> References: <20180609141457.6283-1-vsementsov@virtuozzo.com> <20180615120649.GC2615@work-vm> <20180625171447.GJ2390@work-vm> <20180625175039.GK2390@work-vm> <40eea991-4e7f-c06b-dee4-334a245721cf@redhat.com> <7b6640d1-9e93-4692-500b-f1bce22852ab@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <7b6640d1-9e93-4692-500b-f1bce22852ab@virtuozzo.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladimir Sementsov-Ogievskiy Cc: John Snow , qemu-devel@nongnu.org, quintela@redhat.com * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: > 25.06.2018 21:03, John Snow wrote: > >=20 > > On 06/25/2018 01:50 PM, Dr. David Alan Gilbert wrote: > > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: > > > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: > > > > > 15.06.2018 15:06, Dr. David Alan Gilbert wrote: > > > > > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wro= te: > > > > > > > Invalidate cache before source start in case of failed migr= ation. > > > > > > >=20 > > > > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy > > > > > > Why doesn't the code at the bottom of migration_completion, > > > > > > fail_invalidate: and the code in migrate_fd_cancel handle= this? > > > > > >=20 > > > > > > What case did you see it in that those didn't handle? > > > > > > (Also I guess it probably should set s->block_inactive =3D fa= lse) > > > > > on source I see: > > > > >=20 > > > > > 81392@1529065750.766289:migrate_set_state new state 7 > > > > > 81392@1529065750.766330:migration_thread_file_err > > > > > 81392@1529065750.766332:migration_thread_after_loop > > > > >=20 > > > > > so, we are leaving loop on > > > > > =A0=A0=A0=A0=A0=A0=A0 if (qemu_file_get_error(s->to_dst_file))= { > > > > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 migrate_set_state(&s->state,= current_active_state, > > > > > MIGRATION_STATUS_FAILED); > > > > > trace_migration_thread_file_err(); > > > > > break; > > > > > =A0=A0=A0=A0=A0=A0=A0 } > > > > >=20 > > > > > and skip migration_completion() > > > > Yeh, OK; I'd seen soemthing else a few days ago, where a cancella= tion > > > > test that had previously ended with a 'cancelled' state has now e= nded up > > > > in 'failed' (which is the state 7 you have above). > > > > I suspect there's something else going on as well; I think what i= s > > > > supposed to happen in the case of 'cancel' is that it spins in 'c= ancelling' for > > > > a while in migrate_fd_cancel and then at the bottom of migrate_fd= _cancel > > > > it does the recovery, but because it's going to failed instead, t= hen > > > > it's jumping over that recovery. > > > Going back and actually looking at the patch again; > > > can I ask for 1 small change; > > > Can you set s->block_inactive =3D false in the case where you > > > don't get the local_err (Like we do at the bottom of migrate_fd_can= cel) > > >=20 > > >=20 > > > Does that make sense? > > >=20 > > > Thanks, > > >=20 > > > Dave > > >=20 > > Vladimir, one more question for you because I'm not as familiar with > > this code: > >=20 > > In the normal case we need to invalidate the qcow2 cache as a way to > > re-engage the disk (yes?) when we have failed during the late-migrati= on > > steps. > >=20 > > In this case, we seem to be observing a failure during the bulk trans= fer > > loop. Why is it important to invalidate the cache at this step -- wou= ld > > the disk have been inactivated yet? It shouldn't, because it's in the > > bulk transfer phase -- or am I missing something? > >=20 > > I feel like this code is behaving in a way that's fairly surprising f= or > > a casual reader so I was hoping you could elaborate for me. > >=20 > > --js >=20 > In my case, source is already in postcopy state, when error occured, so= it > is inactivated. Ah, that explains why I couldn't understand the path that got you there; I never think about restarting the source once we're in postcopy - because once the destination is running all is lost. But, you might be in the gap efore management has actually started the destination so it does need fixing. Dave > --=20 > Best regards, > Vladimir >=20 -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK