From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=44111 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Ou1vg-00034m-MU for qemu-devel@nongnu.org; Fri, 10 Sep 2010 07:38:05 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Ou1vf-0003XP-Bt for qemu-devel@nongnu.org; Fri, 10 Sep 2010 07:38:04 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47373) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Ou1vf-0003XL-4K for qemu-devel@nongnu.org; Fri, 10 Sep 2010 07:38:03 -0400 Message-ID: <4C8A1893.7010100@redhat.com> Date: Fri, 10 Sep 2010 14:37:55 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com> <4C84E738.3020802@codemonkey.ws> <4C865187.6090508@redhat.com> <4C8885BB.8020000@redhat.com> <4C891CC0.1090108@codemonkey.ws> <4C8A14F6.9040209@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Kevin Wolf , Stefan Hajnoczi , qemu-devel@nongnu.org On 09/10/2010 02:29 PM, Stefan Hajnoczi wrote: > >> They only guarantee that the filesystem is consistent. A write() that >> extends a file may be reordered with the L2 write() that references the new >> cluster. Requiring fsck on unclean shutdown is very backwards for a 2010 >> format. > I'm interested in understanding how preallocation will work in a way > that does not introduce extra flushes in the common case or require > fsck. > > It seems to me that you can either preallocate and then rely on an > fsck on startup to figure out which clusters are now really in use, or > you can keep an exact max_cluster but this requires an extra write > operation for each allocating write (and perhaps a flush?). > > Can you go into more detail in how preallocation should work? You simply leak the preallocated clusters. That's not as bad as it sounds - if you never write() the clusters they don't occupy any space on disk, so you only leak address space, not actual storage. If you copy the image then you actually do lost storage. If you really wanted to recover the lost storage you could start a thread in the background that looks for unallocated blocks. Unlike fsck, you don't have to wait for it since data integrity does not depend on it. I don't think it's worthwhile, though. Other games you can play with preallocation is varying the preallocation window with workload: start with no preallocation, as the guest starts to allocate you increase the window. When the guest starts to idle again you can return the storage to the operating system and reduce the window back to zero. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.