From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=59738 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OuoPV-0006mP-Vl for qemu-devel@nongnu.org; Sun, 12 Sep 2010 11:24:07 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OuoPU-00068T-Gc for qemu-devel@nongnu.org; Sun, 12 Sep 2010 11:24:05 -0400 Received: from mail-yx0-f173.google.com ([209.85.213.173]:63675) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OuoPU-00068I-EB for qemu-devel@nongnu.org; Sun, 12 Sep 2010 11:24:04 -0400 Received: by yxs7 with SMTP id 7so2063861yxs.4 for ; Sun, 12 Sep 2010 08:24:03 -0700 (PDT) Message-ID: <4C8CF07C.5040509@codemonkey.ws> Date: Sun, 12 Sep 2010 10:23:40 -0500 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration References: <4C864118.7070206@linux.vnet.ibm.com> <4C864D65.6090004@redhat.com> <4C8652CB.9060801@linux.vnet.ibm.com> <4C8CCA91.4060001@redhat.com> <4C8CD4DB.9020905@codemonkey.ws> <4C8CD847.8030804@redhat.com> In-Reply-To: <4C8CD847.8030804@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Avi Kivity Cc: Kevin Wolf , Stefan Hajnoczi , qemu-devel , "libvir-list@redhat.com" , Stefan Hajnoczi On 09/12/2010 08:40 AM, Avi Kivity wrote: > Why would it serialize all I/O operations? It's just like another > vcpu issuing reads. Because the block layer isn't re-entrant. >> What you basically do is: >> >> stream_step_three(): >> complete() >> >> stream_step_two(offset, length): >> bdrv_aio_readv(offset, length, buffer, stream_step_three) >> >> bdrv_aio_stream(): >> bdrv_aio_find_free_cluster(stream_step_two) > > Isn't there a write() missing somewhere? Streaming relies on copy-on-read to do the writing. >> >> And that's exactly what the current code looks like. The only change >> to the patch that this does is make some of qed's internals be block >> layer interfaces. > > Why do you need find_free_cluster()? That's a physical offset thing. > Just write to the same logical offset. > > IOW: > > bdrv_aio_stream(): > bdrv_aio_read(offset, stream_2) It's an optimization. If you've got a fully missing L1 entry, then you're going to memset() 2GB worth of zeros. That's just wasted work. With a 1TB image with a 1GB allocation, it's a huge amount of wasted work. > stream_2(): > if all zeros: > increment offset > if more: > bdrv_aio_stream() > bdrv_aio_write(offset, stream_3) > > stream_3(): > bdrv_aio_write(offset, stream_4) I don't understand why stream_3() is needed. > stream_4(): > increment offset > if more: > bdrv_aio_stream() > > > Of course, need to serialize wrt guest writes, which adds a bit more > complexity. I'll leave it to you to code the state machine for that. http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d >> >>>> Third problem is that streaming really requires being able to do >>>> zero write detection in a meaningful way. You don't want to always >>>> do zero write detection so you need another interface to mark a >>>> specific write as a write that should be checked for zeros. >>> >>> You can do that in bdrv_stream(), above, before the actual write, >>> and call bdrv_unmap() if you detect zeros. >> >> My QED branch now does that FWIW. At the moment, it only detects >> zero reads to unallocated clusters and writes a special zero cluster >> marker. However, the detection code is in the generic path so once >> the fsck() logic is working, we can implement a free list in QED. >> >> In QED, the detection code needs to have a lot of knowledge about >> cluster boundaries and the format of the device. In principle, this >> should be common code but it's not for the same reason copy-on-write >> is not common code today. > > Parts of it are: commit. Of course, that's horribly synchronous. If you've got AIO internally, making commit work is pretty easy. Doing asynchronous commit at a generic layer is not easy though unless you expose lots of details. Generally, I think the block layer makes more sense if the interface to the formats are high level and code sharing is achieved not by mandating a world view but rather but making libraries of common functionality. This is more akin to how the FS layer works in Linux. So IMHO, we ought to add a bdrv_aio_commit function, turn the current code into a generic_aio_commit, implement a qed_aio_commit, then somehow do qcow2_aio_commit, and look at what we can refactor into common code. Regards, Anthony Liguori