From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1MGJ14-0004XE-Qf for qemu-devel@nongnu.org; Mon, 15 Jun 2009 16:42:54 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1MGJ10-0004WD-4P for qemu-devel@nongnu.org; Mon, 15 Jun 2009 16:42:54 -0400 Received: from [199.232.76.173] (port=57333 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MGJ0z-0004WA-S2 for qemu-devel@nongnu.org; Mon, 15 Jun 2009 16:42:49 -0400 Received: from mx2.redhat.com ([66.187.237.31]:47782) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1MGJ0z-0006tX-7d for qemu-devel@nongnu.org; Mon, 15 Jun 2009 16:42:49 -0400 Date: Mon, 15 Jun 2009 17:48:52 -0300 From: Glauber Costa Subject: Re: [Qemu-devel] Live migration broken when under heavy IO Message-ID: <20090615204852.GA6693@poweredge.glommer> References: <4A36B025.2080602@us.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A36B025.2080602@us.ibm.com> List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: "qemu-devel@nongnu.org" , kvm-devel On Mon, Jun 15, 2009 at 03:33:41PM -0500, Anthony Liguori wrote: > The basic issue is that: > > migrate_fd_put_ready(): bdrv_flush_all(); > > Does: > > block.c: > > foreach block driver: > drv->flush(bs); > > Which in the case of raw, is just fsync(s->fd). > > Any submitted request is not queued or flushed which will lead to the > request being dropped after the live migration. you mean any request submitted _after_ that is not queued, right? > > Is anyone working on fixing this? Does anyone have a clever idea how to > fix this without just waiting for all IO requests to complete? If I understood you correctly, we could do something in the lines of dirty tracking for I/O devices. use register_savevm_live() instead of register_savevm() for those, and keep doing passes until we reach stage 3, for some criteria. We can then just flush the remaining requests on that device and mark[1] it somewhere. We can then either stop that device, so that new requests never arrive, or stop the VM entirely. [1] By mark, I mean the verb "to mark", not our dear friend Mark McLaughing.