From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=53452 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OszaX-0003tM-5A for qemu-devel@nongnu.org; Tue, 07 Sep 2010 10:56:00 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OszaV-0006lV-PO for qemu-devel@nongnu.org; Tue, 07 Sep 2010 10:55:56 -0400 Received: from mail-vw0-f45.google.com ([209.85.212.45]:62479) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OszaV-0006lO-N9 for qemu-devel@nongnu.org; Tue, 07 Sep 2010 10:55:55 -0400 Received: by vws19 with SMTP id 19so4414868vws.4 for ; Tue, 07 Sep 2010 07:55:55 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <4C865160.5030600@linux.vnet.ibm.com> References: <4C864118.7070206@linux.vnet.ibm.com> <4C865160.5030600@linux.vnet.ibm.com> Date: Tue, 7 Sep 2010 15:55:54 +0100 Message-ID: Subject: Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration From: Stefan Hajnoczi Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: "libvir-list@redhat.com" , qemu-devel , Stefan Hajnoczi On Tue, Sep 7, 2010 at 3:51 PM, Anthony Liguori wrote: > On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote: >> >> On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori >> =A0wrote: >> >>> >>> The interface for copy-on-read is just an option within qemu-img create= . >>> =A0Streaming, on the other hand, requires a bit more thought. =A0Today,= I >>> have a >>> monitor command that does the following: >>> >>> stream =A0 >>> >>> Which will try to stream the minimal amount of data for a single I/O >>> operation and then return how many sectors were successfully streamed. >>> >>> The idea about how to drive this interface is a loop like: >>> >>> offset =3D 0; >>> while offset< =A0image_size: >>> =A0 wait_for_idle_time() >>> =A0 count =3D stream(device, offset) >>> =A0 offset +=3D count >>> >>> Obviously, the "wait_for_idle_time()" requires wide system awareness. >>> =A0The >>> thing I'm not sure about is 1) would libvirt want to expose a similar >>> stream >>> interface and let management software determine idle time 2) attempt to >>> detect idle time on it's own and provide a higher level interface. =A0I= f >>> (2), >>> the question then becomes whether we should try to do this within qemu >>> and >>> provide libvirt a higher level interface. >>> >> >> A self-tuning solution is attractive because it reduces the need for >> other components (management stack) or the user to get involved. =A0In >> this case self-tuning should be possible. =A0We need to detect periods >> of I/O inactivity, for example tracking the number of in-flight >> requests and then setting a grace timer when it reaches zero. =A0When >> the grace timer expires, we start streaming until the guest initiates >> I/O again. >> > > That detects idle I/O within a single QEMU guest, but you might have anot= her > guest running that's I/O bound which means that from an overall system > throughput perspective, you really don't want to stream. > > I think libvirt might be able to do a better job here by looking at overa= ll > system I/O usage. =A0But I'm not sure hence this RFC :-) Isn't this what block I/O controller cgroups is meant to solve? If you give vm-1 50% block bandwidth and vm-2 50% block bandwidth then vm-1 can do streaming without eating into vm-2's guaranteed bandwidth. Also, I'm not sure we should worry about the priority of the I/O too much: perhaps the user wants their vm to stream more than they want an unimportant local vm that is currently I/O bound to have all resources to itself. So I think it makes sense to defer this and not try for system-wide knowledge inside a QEMU process. Stefan