From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:56669) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1STwnq-0000YZ-UL for qemu-devel@nongnu.org; Mon, 14 May 2012 11:03:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1STwni-0008VI-Iw for qemu-devel@nongnu.org; Mon, 14 May 2012 11:03:14 -0400 Received: from mail-pb0-f45.google.com ([209.85.160.45]:58545) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1STwni-0008UH-6W for qemu-devel@nongnu.org; Mon, 14 May 2012 11:03:06 -0400 Received: by pbbro12 with SMTP id ro12so8752206pbb.4 for ; Mon, 14 May 2012 08:03:04 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: Date: Mon, 14 May 2012 23:03:03 +0800 Message-ID: From: Zhi Yong Wu Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] Some questions about qcow2 code List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Kevin Wolf , QEMU Developers , =?UTF-8?B?6Zmz6Z+L5Lu7?= On Fri, May 11, 2012 at 9:27 AM, Stefan Hajnoczi wrote= : > On Fri, May 4, 2012 at 3:44 AM, Zhi Yong Wu wrote: >> On Sun, Apr 29, 2012 at 1:35 AM, Stefan Hajnoczi wr= ote: >>> On Sat, Apr 28, 2012 at 5:25 PM, Zhi Yong Wu wro= te: >>> >>> Here explanations for quick ones that I know without looking at the >>> code. =A0I think they will help you understand some of the other ones >>> too. >>> >>> Please send questions like this to qemu-devel@nongnu.org so the thread >>> is archived on the mailing lists and others can learn from it. >> I feel better and free if it is in private way. :) > > No one else in the community can learn from the discussion or help > out. =A0We need to be comfortable doing things in public. OK, i public this mail discussion, and hope that i can help other guys interested in qcow2. > >>> >>>> 1. >>>> static int get_refcount(BlockDriverState *bs, int64_t cluster_index) >>>> { >>>> ... >>>> =A0 =A0refcount_table_index =3D cluster_index >> (s->cluster_bits - RE= FCOUNT_SHIFT); >>>> ... >>>> >>>> =A0 =A0block_index =3D cluster_index & >>>> =A0 =A0 =A0 =A0((1 << (s->cluster_bits - REFCOUNT_SHIFT)) - 1); >>>> How to understand the two expressions? >>>> ... >>>> } >>> >>> See "Host cluster management" in the qcow2 spec. =A0Refcounts are store= d >>> in a 2-level table, refcount_table_index is the L1 table where we >>> store the offset of refcount blocks. =A0block_index is the index of the >>> refcount block element that contains the actual reference count. >> I knew what refcount_table_index and block_index mean. Actually what i >> want to ask is that what "cluster_index >> (s->cluster_bits - >> REFCOUNT_SHIFT)" and "cluster_index & ((1 << (s->cluster_bits - >> REFCOUNT_SHIFT)) - 1)" mean? why are they refcount_table_index and >> block_index? why to need REFCOUNT_SHIFT? because refcount block entry >> is 16 bits? > > The cluster_index is the image file cluster number: > > | 0 | 1 | 2 | 3 | ... > image file > > So these calculations are just dividing and finding the remainder for > the 2-level refcount data structure. =A0For example, if the refcount > block holds 2 entries, then we have: > > | 0 | 1 | 2 | 3 | ... > image file > > | A =A0 =A0 | B =A0 =A0 | ... > image file -> refcount tables > > A: | x | x | B: | x | x | > refcount tables > > So cluster_index =3D 1 means we need to look at refcount table A at index= 1. > > In other words: > refcount_table_index =3D cluster_index / 2 > block_index =3D cluster_index % 2 > > The shifts and bitwise ands are just another way of expressing the > division/modulus operation. =A0And instead of using a constant like "2", > qcow2 supports variable cluster sizes (cluster_bits) and has a > REFCOUNT_SHIFT constant. Great, i can now understand the two expressions and other similar ones, tha= nks. > >>> >>>> 2. >>>> >>>> static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *b= s, >>>> =A0 =A0int64_t offset, int64_t length, int addend) >>>> { >>>> .... >>>> =A0 =A0if (addend < 0) { >>>> =A0 =A0 =A0 =A0qcow2_cache_set_dependency(bs, s->refcount_block_cache, >>>> =A0 =A0 =A0 =A0 =A0 =A0s->l2_table_cache); >>>> =A0 =A0} >>>> When added =3D -1, why to need to invoke qcow2_cache_set_dependency? >>> >>> The cache dependency is used to ensure that cached metadata is flushed >>> in the correct order. =A0qcow2 must be careful to flush data to disk so >>> that writes are ordered - otherwise a power failure could corrupt the >>> image file when unordered writes are partially applied to the image >>> file. >> great, thanks >>> >>> For example, we want to allocate an image file cluster ("host >>> cluster") *before* we reference it from a table. =A0Otherwise a power >> But this example result in an allocated cluster will not be referenced >> by one table entry if a power failure take place. > > It's not possible to keep the image in a "clean" state at all times, > but we need to keep it in a "consistent" state at all times. > "Consistent" means: > 1. All L1/L2 table entries should point to allocated clusters. > 2. Data should be written to allocated clusters before they are > referenced by L1/L2 entries. > > "Consistent" is weaker than "clean". =A0A "consistent" image can have > leaked clusters that are allocated (refcount > 0) but not referenced > by any L1/L2 table. > > For multi-step metadata updates you will find there is no order that > is "clean" at every step - it's simply not possible, no matter which > order you choose. =A0But there is an order the is "consistent" at every > step and that's good enough (it means we will not lose data or corrupt > the image file). Great, thanks. > > BTW this is why file systems use journals. =A0Journals solve the > consistent update problem in a different way - after a crash the > journal is replayed to finish all in-flight updates, resulting in a > consistent file system. =A0It's a different technique that non of the > popular image formats use. > >>> failure could result in the table pointing to an unallocated cluster >>> (the refcount update did not make it to disk but the L2 table update >> Why to say that refcount update did not? refcount block entry will >> count the reference of one cluster. > > Because disk writes may sit in a volatile disk write cache or host OS > page cache. =A0The order in which cached data is really written to disk > is not defined. =A0That means we don't know if the refcount update makes > it to disk before the power fails. > >>>> 12. >>>> >>>> static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs) >>>> >>>> I thought that this function should only flush the data to OS cache, >>>> not disk. right? But i checked it and found that it flush the data to >>>> the disk. >>> >>> Metadata updates must be handled very carefully - they need to be >>> ordered so that a power failure never leaves the image file in an >>> inconsistent (corrupt) state. =A0Therefore we *do* need to flush to dis= k >> Then should we adjust this function name? otherwise it will confuse us. > > Maybe. =A0The BlockDriver callback name makes sense. =A0It just happens > that the qcow2 implementation really does flush so that metadata > updates are ordered. kevin has explained this in another mail thread. > >>> when applying several different types of metadata updates in order >>> (e.g. L2 table, refcount table). >> Should we usually update L2 table *before* refcount table update or at >> first update refcount table *before* L2 table? > > When allocating clusters we first need to update the refcount table, > write the data into the cluster, and then perform the L2 update. When freeing clusters, the order will be reverse. > >>>> 14. >>>> >>>> qcow2_snapshot_create() =A0{ >>>> .... >>>> >>>> =A0 =A0ret =3D bdrv_flush(bs); >>>> Why does it need flush the data here? >>> >>> To understand the meaning of a flush, look at the operation that was >>> performed before it. >>> >>> Here we just incremented the refcounts. =A0We need to make sure these >>> metadata updates are on disk before we can add the snapshot entry into >>> the qcow2 image file - otherwise a power failure could result in a >>> snapshot entry without allocated clusters. >> But it may also result in that the clusters are allocated, but there >> is no corresponding snapshot entry. > > Yes. =A0The image will still be "consistent", just not "clean". =A0We > leaked clusters but did not lose data or corrupt the image file. Great, thanks. > > Stefan --=20 Regards, Zhi Yong Wu