From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=44111 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Ou1vg-00034m-MU
	for qemu-devel@nongnu.org; Fri, 10 Sep 2010 07:38:05 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1Ou1vf-0003XP-Bt
	for qemu-devel@nongnu.org; Fri, 10 Sep 2010 07:38:04 -0400
Received: from mx1.redhat.com ([209.132.183.28]:47373)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1Ou1vf-0003XL-4K
	for qemu-devel@nongnu.org; Fri, 10 Sep 2010 07:38:03 -0400
Message-ID: <4C8A1893.7010100@redhat.com>
Date: Fri, 10 Sep 2010 14:37:55 +0300
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com>	<4C84E738.3020802@codemonkey.ws>	<4C865187.6090508@redhat.com>	<AANLkTikv+3mfCgAAB7X6VF5cHYWJ1qRGFOK96mDQx3LT@mail.gmail.com>	<4C8885BB.8020000@redhat.com>	<4C891CC0.1090108@codemonkey.ws>	<4C8A14F6.9040209@redhat.com>
	<AANLkTikaotc-+=Wcjj1oZt6mTQGo79Uk5KaYNXGTAYdn@mail.gmail.com>
In-Reply-To: <AANLkTikaotc-+=Wcjj1oZt6mTQGo79Uk5KaYNXGTAYdn@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>, qemu-devel@nongnu.org

  On 09/10/2010 02:29 PM, Stefan Hajnoczi wrote:
>
>> They only guarantee that the filesystem is consistent.  A write() that
>> extends a file may be reordered with the L2 write() that references the new
>> cluster.  Requiring fsck on  unclean shutdown is very backwards for a 2010
>> format.
> I'm interested in understanding how preallocation will work in a way
> that does not introduce extra flushes in the common case or require
> fsck.
>
> It seems to me that you can either preallocate and then rely on an
> fsck on startup to figure out which clusters are now really in use, or
> you can keep an exact max_cluster but this requires an extra write
> operation for each allocating write (and perhaps a flush?).
>
> Can you go into more detail in how preallocation should work?

You simply leak the preallocated clusters.

That's not as bad as it sounds - if you never write() the clusters they 
don't occupy any space on disk, so you only leak address space, not 
actual storage.  If  you copy the image then you actually do lost storage.

If you really wanted to recover the lost storage you could start a 
thread in the background that looks for unallocated blocks.  Unlike 
fsck, you don't have to wait for it since data integrity does not depend 
on it.  I don't think it's worthwhile, though.

Other games you can play with preallocation is varying the preallocation 
window with workload: start with no preallocation, as the guest starts 
to allocate you increase the window.  When the guest starts to idle 
again you can return the storage to the operating system and reduce the 
window back to zero.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.