From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45269) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a7Jqh-0006Lr-Co for qemu-devel@nongnu.org; Fri, 11 Dec 2015 04:18:51 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a7Jqa-0007xo-Rb for qemu-devel@nongnu.org; Fri, 11 Dec 2015 04:18:47 -0500 Received: from mx1.redhat.com ([209.132.183.28]:55459) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a7Jqa-0007xg-Kr for qemu-devel@nongnu.org; Fri, 11 Dec 2015 04:18:40 -0500 Date: Fri, 11 Dec 2015 09:18:34 +0000 From: "Dr. David Alan Gilbert" Message-ID: <20151211091834.GA2987@work-vm> References: <1448357149-17572-1-git-send-email-zhang.zhanghailiang@huawei.com> <1448357149-17572-28-git-send-email-zhang.zhanghailiang@huawei.com> <20151210200344.GN2570@work-vm> <566A8FDC.3070303@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <566A8FDC.3070303@huawei.com> Subject: Re: [Qemu-devel] [PATCH COLO-Frame v11 27/39] COLO failover: Shutdown related socket fd when do failover List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Hailiang Zhang Cc: lizhijian@cn.fujitsu.com, quintela@redhat.com, yunhong.jiang@intel.com, eddie.dong@intel.com, peter.huangpeng@huawei.com, qemu-devel@nongnu.org, arei.gonglei@huawei.com, stefanha@redhat.com, amit.shah@redhat.com, hongyang.yang@easystack.cn * Hailiang Zhang (zhang.zhanghailiang@huawei.com) wrote: > On 2015/12/11 4:03, Dr. David Alan Gilbert wrote: > >* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote: > >>If the net connection between COLO's two sides is broken while colo/colo incoming > >>thread is blocked in 'read'/'write' socket fd. It will not detect this error until > >>connect timeout. It will be a long time. > >> > >>Here we shutdown all the related socket file descriptors to wake up the blocking > >>operation in failover BH. Besides, we should close the corresponding file descriptors > >>after failvoer BH shutdown them, or there will be an error. > >> > >>Signed-off-by: zhanghailiang > >>Signed-off-by: Li Zhijian > >>--- > >>v11: > >>- Only shutdown fd for once > >>--- > >> migration/colo.c | 31 +++++++++++++++++++++++++++++-- > >> 1 file changed, 29 insertions(+), 2 deletions(-) > >> > >>diff --git a/migration/colo.c b/migration/colo.c > >>index 4cd7b00..994b80d 100644 > >>--- a/migration/colo.c > >>+++ b/migration/colo.c > >>@@ -68,6 +68,14 @@ static void secondary_vm_do_failover(void) > >> /* recover runstate to normal migration finish state */ > >> autostart = true; > >> } > >>+ /* > >>+ * Make sure colo incoming thread not block in recv, > >>+ * mis->from_src_file and mis->to_src_file use the same fd, > >>+ * so here we only need to shutdown it for once. > >>+ */ > > > >That assumption is only valid for the current socket transport; > >for example I have a (partially-working) RDMA transport where the > >forward and reverse paths are quite separate. > > > > Yes, you are right, but we really catch the bad stuck case in test for many times, > and it is easy to reproduce. So i think it's necessary to keep this codes. Yes, I agree you must shutdown from_src_file, but I mean do you also need to shutdown to_src_file ? On sockets they share an fd, but that might not always be true, so you might have to shutdown both QEMUFile's - but only if you can get blocked on them. > > >>+ if (mis->from_src_file) { > >>+ qemu_file_shutdown(mis->from_src_file); > >>+ } > >> > >> old_state = failover_set_state(FAILOVER_STATUS_HANDLING, > >> FAILOVER_STATUS_COMPLETED); > >>@@ -92,6 +100,15 @@ static void primary_vm_do_failover(void) > >> MIGRATION_STATUS_COMPLETED); > >> } > >> > >>+ /* > >>+ * Make sure colo thread no block in recv, > >>+ * Besides, s->rp_state.from_dst_file and s->to_dst_file use the > >>+ * same fd, so here we only need to shutdown it for once. > >>+ */ > >>+ if (s->to_dst_file) { > >>+ qemu_file_shutdown(s->to_dst_file); > >>+ } > >>+ > >> old_state = failover_set_state(FAILOVER_STATUS_HANDLING, > >> FAILOVER_STATUS_COMPLETED); > >> if (old_state != FAILOVER_STATUS_HANDLING) { > >>@@ -333,7 +350,7 @@ static void colo_process_checkpoint(MigrationState *s) > >> > >> out: > >> current_time = error_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); > >>- if (ret < 0) { > >>+ if (ret < 0 || (!ret && !failover_request_is_active())) { > > > >Why is this needed - when would you get a 0 return that is actually an error? > > > > If we shutdown the fd, it will return 0, and the ret will be 0, this only happen > when we do failover actively. OK > >> error_report("%s: %s", __func__, strerror(-ret)); > >> qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR, > >> true, strerror(-ret), NULL); > >>@@ -362,6 +379,11 @@ out: > >> qsb_free(buffer); > >> buffer = NULL; > >> > >>+ /* Hope this not to be too long to loop here */ > >>+ while (failover_get_state() != FAILOVER_STATUS_COMPLETED) { > >>+ ; > >>+ } > >>+ /* Must be called after failover BH is completed */ > > > >Is this the right patch for this? > > > > Yes, we must ensure the close() called after the shutdown() in failover, > or there maybe an shutdown wrong 'fd' error, the 'fd' maybe just re-used by > other thread after we close here, but before shutdown() in failover. This > is a race case, which is rarely happening. I'll add more comments here. Thanks, that makes sense; closing the wrong fd can be really hard to debug :-) Dave > > Thanks, > Hailiang > > > >> if (s->rp_state.from_dst_file) { > >> qemu_fclose(s->rp_state.from_dst_file); > >> } > >>@@ -534,7 +556,7 @@ void *colo_process_incoming_thread(void *opaque) > >> > >> out: > >> current_time = error_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); > >>- if (ret < 0) { > >>+ if (ret < 0 || (!ret && !failover_request_is_active())) { > >> error_report("colo incoming thread will exit, detect error: %s", > >> strerror(-ret)); > >> qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR, > >>@@ -573,6 +595,11 @@ out: > >> */ > >> colo_release_ram_cache(); > >> > >>+ /* Hope this not to be too long to loop here */ > >>+ while (failover_get_state() != FAILOVER_STATUS_COMPLETED) { > >>+ ; > >>+ } > >>+ /* Must be called after failover BH is completed */ > >> if (mis->to_src_file) { > >> qemu_fclose(mis->to_src_file); > >> } > >>-- > >>1.8.3.1 > >> > >> > >-- > >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > > >. > > > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK