From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=59278 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OszWj-0000b3-29 for qemu-devel@nongnu.org; Tue, 07 Sep 2010 10:52:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OszWg-00064y-VY for qemu-devel@nongnu.org; Tue, 07 Sep 2010 10:52:00 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59967) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OszWg-00064e-PD for qemu-devel@nongnu.org; Tue, 07 Sep 2010 10:51:58 -0400 Message-ID: <4C865187.6090508@redhat.com> Date: Tue, 07 Sep 2010 17:51:51 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com> <4C84E738.3020802@codemonkey.ws> In-Reply-To: <4C84E738.3020802@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Kevin Wolf , Stefan Hajnoczi , qemu-devel@nongnu.org On 09/06/2010 04:06 PM, Anthony Liguori wrote: > > Another point worth mentioning is that our intention is to have a > formal specification of the format before merging. A start of that is > located at http://wiki.qemu.org/Features/QED > > =Specification= > > The file format looks like this: > > +---------+---------+---------+-----+ > | extent0 | extent1 | extent1 | ... | > +---------+---------+---------+-----+ > > The first extent contains a header. The header contains information > about the first data extent. A data extent may be a data cluster, an > L2, or an L1 table. L1 and L2 tables are composed of one or more > contiguous extents. > > ==Header== > Header { > uint32_t magic; /* QED\0 */ Endianness? > > uint32_t cluster_size; /* in bytes */ Does cluster == extent? If so, use the same terminology. If not, explain. Usually extent is a variable size structure. > uint32_t table_size; /* table size, in clusters */ Presumably L1 table size? Or any table size? Hm. It would be nicer not to require contiguous sectors anywhere. How about a variable- or fixed-height tree? > uint32_t first_cluster; /* in clusters */ First cluster of what? > > uint64_t features; /* format feature bits */ > uint64_t compat_features; /* compat feature bits */ > uint64_t l1_table_offset; /* L1 table offset, in clusters */ > uint64_t image_size; /* total image size, in clusters */ Logical, yes? Is the physical image size always derived from the host file metadata? Is this always safe? > /* if (features & QED_F_BACKING_FILE) */ > uint32_t backing_file_offset; /* in bytes from start of header */ > uint32_t backing_file_size; /* in bytes */ It's really the filename size, not the file size. Also, make a note that it is not zero terminated. > > /* if (compat_features & QED_CF_BACKING_FORMAT) */ > uint32_t backing_fmt_offset; /* in bytes from start of header */ > uint32_t backing_fmt_size; /* in bytes */ Why not make it mandatory? > } Need a checksum for the header. > > ==Extent table== > > #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t)) > > Table { > uint64_t offsets[TABLE_NOFFSETS]; > } It's fashionable to put checksums here. Do we want a real extent-based format like modern filesystems? So after defragmentation a full image has O(1) metadata? > > The extent tables are organized as follows: > > +----------+ > | L1 table | > +----------+ > ,------' | '------. > +----------+ | +----------+ > | L2 table | ... | L2 table | > +----------+ +----------+ > ,------' | '------. > +----------+ | +----------+ > | Data | ... | Data | > +----------+ +----------+ > > The table_size field allows tables to be multiples of the cluster > size. For example, cluster_size=64 KB and table_size=4 results in 256 > KB tables. > > =Operations= > > ==Read== > # If L2 table is not present in L1, read from backing image. > # If data cluster is not present in L2, read from backing image. > # Otherwise read data from cluster. If not in backing image, provide zeros > > ==Write== > # If L2 table is not present in L1, allocate new cluster and L2. > Perform L2 and L1 link after writing data. > # If data cluster is not present in L2, allocate new cluster. Perform > L1 link after writing data. > # Otherwise overwrite data cluster. Detail copy-on-write from backing image. On a partial write without a backing file, do we recommend zero-filling the cluster (to avoid intra-cluster fragmentation)? > > The L2 link '''should''' be made after the data is in place on > storage. However, when no ordering is enforced the worst case > scenario is an L2 link to an unwritten cluster. Or it may cause corruption if the physical file size is not committed, and L2 now points at a free cluster. > > The L1 link '''must''' be made after the L2 cluster is in place on > storage. If the order is reversed then the L1 table may point to a > bogus L2 table. (Is this a problem since clusters are allocated at > the end of the file?) > > ==Grow== > # If table_size * TABLE_NOFFSETS < new_image_size, fail -EOVERFLOW. > The L1 table is not big enough. With a variable-height tree, we allocate a new root, link its first entry to the old root, and write the new header with updated root and height. > # Write new image_size header field. > > =Data integrity= > ==Write== > Writes that complete before a flush must be stable when the flush > completes. > > If storage is interrupted (e.g. power outage) then writes in progress > may be lost, stable, or partially completed. The storage must not be > otherwise corrupted or inaccessible after it is restarted. We can remove this requirement by copying-on-write any metadata write, and keeping two copies of the header (with version numbers and checksums). Enterprise storage will not corrupt on writes, but commodity storage may. -- error compiling committee.c: too many arguments to function