From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58168) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e2cT7-00038M-FX for qemu-devel@nongnu.org; Thu, 12 Oct 2017 08:20:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1e2cT3-0001RN-7T for qemu-devel@nongnu.org; Thu, 12 Oct 2017 08:20:05 -0400 Received: from mx1.redhat.com ([209.132.183.28]:60128) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1e2cT2-0001Qp-UW for qemu-devel@nongnu.org; Thu, 12 Oct 2017 08:20:01 -0400 Date: Thu, 12 Oct 2017 13:19:52 +0100 From: "Dr. David Alan Gilbert" Message-ID: <20171012121952.GD2343@work-vm> References: <1504081950-2528-1-git-send-email-peterx@redhat.com> <1504081950-2528-11-git-send-email-peterx@redhat.com> <20170921192903.GH3401@work-vm> <20170927073441.GD25011@pxdev.xzpeter.org> <20171009185812.GR2374@work-vm> <20171010093801.GC20686@pxdev.xzpeter.org> <20171010123017.GE2132@work-vm> <20171011030058.GG20686@pxdev.xzpeter.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <20171011030058.GG20686@pxdev.xzpeter.org> Subject: Re: [Qemu-devel] [RFC v2 10/33] migration: allow dst vm pause on postcopy List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Xu Cc: qemu-devel@nongnu.org, Laurent Vivier , "Daniel P . Berrange" , Alexey Perevalov , Juan Quintela , Andrea Arcangeli * Peter Xu (peterx@redhat.com) wrote: > On Tue, Oct 10, 2017 at 01:30:18PM +0100, Dr. David Alan Gilbert wrote: > > * Peter Xu (peterx@redhat.com) wrote: > > > On Mon, Oct 09, 2017 at 07:58:13PM +0100, Dr. David Alan Gilbert wrot= e: >=20 > [...] >=20 > > > > We have to be careful about this; a network can fail in a way it > > > > gets stuck rather than fails - this can get stuck until a full TCP > > > > disconnection; and that takes about 30mins (from memory). > > > > The nice thing about using 'shutdown' is that you can kill the exis= ting > > > > connection if it's hung. (Which then makes an interesting question; > > > > the rules in your migrate-incoming command become different if you > > > > want to declare it's failed!). Having said that, you're right that= at > > > > this point stuff has already failed - so do we need the shutdown? > > > > (You might want to do the shutdown as part of the recovery earlier > > > > or as a separate command to force the failure) > > >=20 > > > I assume if I call shutdown before the lock then we'll be good then. > >=20 > > The question is what happens if you only allow recovery if we're already > > in postcopy-paused state; in the case of a hung socket, since no IO has > > actually failed yet, you will still be in postcopy-active. >=20 > Hmm, but isn't that a problem of kernel rather than QEMU? Since > sockets are after all managed by kernel. Kind of, but it comes down to what the right behaviour of a TCP socket is, and the kernel is probably doing the right thing. > I don't really know what is the best thing to do to detect whether a > socket is stuck. Assume we can observed that (say, we see migration > transferred bytes keep static for 30 seconds), IIRC you mentioned > about iptable tricks to break an existing e.g. TCP connection, then we > can trigger the -EIO path. =46rom the qemu level I'd prefer to make it a command; if we start adding heuristics and timeouts etc then it's very difficult to actually get them right. > Or do you think we should provide a way to manually trigger the paused > state? Then it goes back to something we discussed with Dan in the > earlier post - I'd appreciate if we can postpone the manual trigger > support a bit (to make this series small, which is already not...). I think that manual trigger is probably necessary; it would just call a shutdown() on the sockets and let the things fail into the paused state. It'd be pretty simple. It would be another OOB command; the tricky part is just making sure it's thread safe against hte migration finishing when you issue it. I think it can wait until after this series if you want, but it would be good if we can figure it out. Dave > Thanks, >=20 > --=20 > Peter Xu -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK