From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:46527) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QOp3f-00064z-5l for qemu-devel@nongnu.org; Tue, 24 May 2011 06:41:52 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QOp3d-0001d8-Si for qemu-devel@nongnu.org; Tue, 24 May 2011 06:41:51 -0400 Received: from mail-yx0-f173.google.com ([209.85.213.173]:38754) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QOp3d-0001d4-OW for qemu-devel@nongnu.org; Tue, 24 May 2011 06:41:49 -0400 Received: by yxk8 with SMTP id 8so2979946yxk.4 for ; Tue, 24 May 2011 03:41:49 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1304956314-7806-1-git-send-email-kwolf@redhat.com> References: <1304956314-7806-1-git-send-email-kwolf@redhat.com> Date: Tue, 24 May 2011 11:41:48 +0100 Message-ID: From: Stefan Hajnoczi Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC] Specification for qcow2 version 3 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org On Mon, May 9, 2011 at 4:51 PM, Kevin Wolf wrote: > I hope the intentions of each change are clear, but feel free to ask if t= hey > aren't. Also when I wasn't if/how exactly to specify things, I left a TOD= O in > some places. Here is what I've picked up on and a summary for lazy readers who don't want to reverse-engineer the rationale for proposed changes: 1. Feature bits In order to support extending the format in the future a flexible mechanism for specifying image features is required. The single file format version number isn't enough to express the various compatibility strategies that could apply when introducing new features. Qcow2v3 adds feature bitfields for specifying individual format features. 2. Sub-clusters A 64-cluster region of the image file can be allocated at once in order to reduce fragmentation. The sub-cluster bitfield indicates which sub-clusters are actually allocated, eliminating the need to zero out (or read from the backing file) the entire 64-cluster region at allocation time. 3. Zero clusters Cluster descriptor bit 0 can mark clusters as zero. This prevents access to the backing file and instead reads zeroes. This is not really compatible with sub-clusters because it works at cluster granularity? Zero clusters enable efficient TRIM implementation even when a backing file is in use. > @@ -67,6 +67,42 @@ The first cluster of a qcow2 image contains the file h= eader: > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Offset into the image file at whi= ch the snapshot table > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 starts. Must be aligned to a clus= ter boundary. > > +If the version is 3 or higher, the header has the following additional f= ields. > +For version 2, the values are assumed to be zero, unless specified other= wise > +in the description of a field. > + > + =A0 =A0 =A0 =A0 72 - 75: =A0 incompatible_features Is there a reason to use 32-bit instead of 64-bit? I think virtio recently learnt that wider feature bitfields are useful :). > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Bitmask of incompatible features= . An implementation must > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0fail to open an image if an unkn= own bit is set. > + > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Bit 0: =A0 =A0 =A0The reference = counts in the image file may be > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0inaccura= te. Implementations must check/rebuild > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0them if = they rely on them. > + > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Bit 1: =A0 =A0 =A0Enable subclus= ters. This affects the L2 table > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0format. > + > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Bits 2-31: =A0Reserved (set to 0= ) > + > + =A0 =A0 =A0 =A0 76 - 79: =A0 compatible_features > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Bitmask of compatible features. = An implementation can > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0safely ignore any unknown bits t= hat are set. > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0No compatible feature bits are d= efined yet. Reserved, set to 0. > + > + =A0 =A0 =A0 =A0 80 - 83: =A0 autoclear_features > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Bitmask of auto-clear features. = An implementation may only > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0write to an image with unknown a= uto-clear features if it > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0clears the respective bits from = this field first. > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0No auto-clear feature bits are d= efined yet. Reserved, set to 0. > + > + =A0 =A0 =A0 =A0 84 - 87: =A0 refcount_bits > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Size of a reference count block = entry in bits. For version 2 > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0images, the size is always 16 bi= ts. Version 2 does not have this field but always uses the default size of 16 bits? I'm checking because earlier you wrote "For version 2, the values are assumed to be zero, unless specified otherwise in the description of a field". But you don't expect v2 files to actually store the value 16 here, right? Valid ranges for this field? > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0[ TODO: Define order in sub-byte= sizes ] > + > + =A0 =A0 =A0 =A0[ TODO: Add per-L2-table dirty flag to L1? ] > + =A0 =A0 =A0 =A0[ TODO: Add per-refcount-block full flag to refcount tab= le? ] > + > =A0Directly after the image header, optional sections called header exten= sions can > =A0be stored. Each extension has a structure like the following: > > @@ -87,6 +123,8 @@ The remaining space between the end of the header exte= nsion area and the end of > =A0the first cluster can be used for other data. Usually, the backing fil= e name is > =A0stored there. > > +[ TODO Feature name table? ] There was discussion about using string names rather than feature bits. This would make failure on unknown feature bits much clearer to end-users: unable to open test.qcow3, feature "new_feature" not supported The issue with feature names as strings is that it makes header parsing more difficult - especially updating in place (delete or insert). For this reason I don't see string names as essential. Perhaps there was another requirement for feature names that I forgot about= ? > + > > =A0=3D=3D Host cluster management =3D=3D > > @@ -138,7 +176,8 @@ guest clusters to host clusters. They are called L1 a= nd L2 table. > > =A0The L1 table has a variable size (stored in the header) and may use mu= ltiple > =A0clusters, however it must be contiguous in the image file. L2 tables a= re > -exactly one cluster in size. > +exactly one cluster in size if subclusters are disabled, and two cluster= s if > +they are enabled. > > =A0Given a offset into the virtual disk, the offset into the image file c= an be > =A0obtained as follows: > @@ -168,9 +207,32 @@ L1 table entry: > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 refcount is exactly one. This inf= ormation is only accurate > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 in the active L1 table. > > -L2 table entry (for normal clusters): > +L2 table entry: > > - =A0 =A0Bit =A00 - =A08: =A0 =A0Reserved (set to 0) > + =A0 =A0Bit =A00 - =A061: =A0 Cluster descriptor > + > + =A0 =A0 =A0 =A0 =A0 =A0 =A062: =A0 0 for standard clusters > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A01 for compressed clusters > + > + =A0 =A0 =A0 =A0 =A0 =A0 =A063: =A0 0 for a cluster that is unused or re= quires COW, 1 if its > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0refcount is exactly one. This in= formation is only accurate > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0in L2 tables that are reachable = from the the active L1 > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0table. > + > + =A0 =A0 =A0 =A064 - 127: =A0 If subclusters are enabled, this contains = a bitmask that > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0describes the allocation status = of all 64 subclusters. The > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0first subcluster is represented = by the LSB. A 0 bit means > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0that the subcluster is unallocat= ed. > + > +Standard Cluster Descriptor: > + > + =A0 =A0Bit =A0 =A0 =A0 0: =A0 =A0If set to 1, the cluster reads as all = zeros instead of > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0referring to the backing file if= the (sub-)cluster is > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0unallocated. > + > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0With version 2, this is always 0= . > + > + =A0 =A0 =A0 =A0 1 - =A08: =A0 =A0Reserved (set to 0) > > =A0 =A0 =A0 =A0 =A09 - 55: =A0 =A0Bits 9-55 of host cluster offset. Must = be aligned to a > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 cluster boundary. If the offset i= s 0, the cluster is > @@ -178,29 +240,17 @@ L2 table entry (for normal clusters): > > =A0 =A0 =A0 =A0 56 - 61: =A0 =A0Reserved (set to 0) > > - =A0 =A0 =A0 =A0 =A0 =A0 62: =A0 =A00 (this cluster is not compressed) > - > - =A0 =A0 =A0 =A0 =A0 =A0 63: =A0 =A00 for a cluster that is unused or re= quires COW, 1 if its > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0refcount is exactly one. This in= formation is only accurate > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0in L2 tables that are reachable = from the the active L1 > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0table. > > -L2 table entry (for compressed clusters; x =3D 62 - (cluster_size - 8)): > +Compressed Clusters Descriptor (x =3D 62 - (cluster_size - 8)): > > =A0 =A0 Bit =A00 - =A0x: =A0 =A0Host cluster offset. This is usually _not= _ aligned to a > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 cluster boundary! > > =A0 =A0 =A0 =A0x+1 - 61: =A0 =A0Compressed size of the images in sectors = of 512 bytes > > - =A0 =A0 =A0 =A0 =A0 =A0 62: =A0 =A01 (this cluster is compressed using = zlib) > - > - =A0 =A0 =A0 =A0 =A0 =A0 63: =A0 =A00 for a cluster that is unused or re= quires COW, 1 if its > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0refcount is exactly one. This in= formation is only accurate > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0in L2 tables that are reachable = from the the active L1 > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0table. > - > -If a cluster is unallocated, read requests shall read the data from the = backing > -file. If there is no backing file or the backing file is smaller than th= e image, > +If a cluster or a subcluster is unallocated, read requests shall read th= e data > +from the backing file (except if bit 0 in the Standard Cluster Descripto= r is > +set). If there is no backing file or the backing file is smaller than th= e image, > =A0they shall read zeros for all parts that are not covered by the backin= g file. > > > @@ -253,7 +303,13 @@ Snapshot table entry: > =A0 =A0 =A0 =A0 36 - 39: =A0 =A0Size of extra data in the table entry (us= ed for future > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 extensions of the format) > > - =A0 =A0 =A0 =A0variable: =A0 Extra data for future extensions. Must be = ignored. > + =A0 =A0 =A0 =A0variable: =A0 Extra data for future extensions. Unknown = fields must be > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ignored. Currently defined are (= offset relative to snapshot > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0table entry): > + > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Byte 40 - 47: =A0 Size of the VM= state in bytes. 0 if no VM > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= state is saved. If this field is present, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= the 32-bit value in bytes 32-35 is ignored. This is because you want a 64-bit VM state offset? Need to add a note that this is v3-specific? This field now preceeds the id_str and name variable length data? Stefan