From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=33262 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Ot0k1-0006HT-TE for qemu-devel@nongnu.org; Tue, 07 Sep 2010 12:09:51 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Ot0k0-00058C-Do for qemu-devel@nongnu.org; Tue, 07 Sep 2010 12:09:49 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35048) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Ot0k0-000580-3m for qemu-devel@nongnu.org; Tue, 07 Sep 2010 12:09:48 -0400 Message-ID: <4C8663C4.1090508@redhat.com> Date: Tue, 07 Sep 2010 19:09:40 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com> <4C84E738.3020802@codemonkey.ws> <4C865187.6090508@redhat.com> <4C865CFE.7010508@codemonkey.ws> In-Reply-To: <4C865CFE.7010508@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Kevin Wolf , Stefan Hajnoczi , qemu-devel@nongnu.org On 09/07/2010 06:40 PM, Anthony Liguori wrote: >> >> Need a checksum for the header. > > Is that not a bit overkill for what we're doing? What's the benefit? Make sure we're not looking at a header write interrupted by a crash. >>> >>> The L2 link '''should''' be made after the data is in place on >>> storage. However, when no ordering is enforced the worst case >>> scenario is an L2 link to an unwritten cluster. >> >> Or it may cause corruption if the physical file size is not >> committed, and L2 now points at a free cluster. > > An fsync() will make sure the physical file size is committed. The > metadata does not carry an additional integrity guarantees over the > actual disk data except that in order to avoid internal corruption, we > have to order the L2 and L1 writes. I was referring to "when no ordering is enforced, the worst case scenario is an L2 link to an unwritten cluster". This isn't true - worst case you point to an unallocated cluster which can then be claimed by data or metadata. > > As part of the read process, it's important to validate that the L2 > entries don't point to blocks beyond EOF. This is an indication of a > corrupted I/O operation and we need to treat that as an unallocated > cluster. Right, but what if the first operation referring to that cluster is an allocation? >> We can remove this requirement by copying-on-write any metadata >> write, and keeping two copies of the header (with version numbers and >> checksums). > > QED has a property today that all metadata or cluster locations have a > single location on the disk format that is immutable. Defrag would > relax this but defrag can be slow. > > Having an immutable on-disk location is a powerful property which > eliminates a lot of complexity with respect to reference counting and > dealing with free lists. However, it exposes the format to "writes may corrupt overwritten data". > > For the initial design I would avoid introducing something like this. > One of the nice things about features is that we can introduce > multi-level trees as a future feature if we really think it's the > right thing to do. > > But we should start at a simple design with high confidence and high > performance, and then introduce features with the burden that we're > absolutely sure that we don't regress integrity or performance. For most things, yes. Metadata checksums should be designed in though (since we need to double the pointer size). Variable height trees have the nice property that you don't need multi cluster allocation. It's nice to avoid large L2s for very large disks. -- error compiling committee.c: too many arguments to function