From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:56669)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <zwu.kernel@gmail.com>) id 1STwnq-0000YZ-UL
	for qemu-devel@nongnu.org; Mon, 14 May 2012 11:03:19 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <zwu.kernel@gmail.com>) id 1STwni-0008VI-Iw
	for qemu-devel@nongnu.org; Mon, 14 May 2012 11:03:14 -0400
Received: from mail-pb0-f45.google.com ([209.85.160.45]:58545)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <zwu.kernel@gmail.com>) id 1STwni-0008UH-6W
	for qemu-devel@nongnu.org; Mon, 14 May 2012 11:03:06 -0400
Received: by pbbro12 with SMTP id ro12so8752206pbb.4
	for <qemu-devel@nongnu.org>; Mon, 14 May 2012 08:03:04 -0700 (PDT)
MIME-Version: 1.0
In-Reply-To: <CAJSP0QVD5q5TM8dGc03ufvNjhZcWMxY859D9AnCSXLfcryHLvQ@mail.gmail.com>
References: <CAEH94LjEnaBbFfMLx-KDjqGu_a6HTo36+0p6seibcLQ-ykoXuw@mail.gmail.com>
	<CAJSP0QWM7Y2tyrFqb-v_J6SBfTYk09qXG5vQ2u4-kyJudf1_5w@mail.gmail.com>
	<CAEH94LjeG0NY_o-kAvs-O4ccMwOS98zjoMAVyOc_CoHeJb5c3A@mail.gmail.com>
	<CAJSP0QVD5q5TM8dGc03ufvNjhZcWMxY859D9AnCSXLfcryHLvQ@mail.gmail.com>
Date: Mon, 14 May 2012 23:03:03 +0800
Message-ID: <CAEH94LhVY5KeTY4yPySmTXF7FhdFu5-c7UL62q=m=+HEXWJfww@mail.gmail.com>
From: Zhi Yong Wu <zwu.kernel@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] Some questions about qcow2 code
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>, QEMU Developers <qemu-devel@nongnu.org>, =?UTF-8?B?6Zmz6Z+L5Lu7?= <chenwj@iis.sinica.edu.tw>

On Fri, May 11, 2012 at 9:27 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote=
:
> On Fri, May 4, 2012 at 3:44 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>> On Sun, Apr 29, 2012 at 1:35 AM, Stefan Hajnoczi <stefanha@gmail.com> wr=
ote:
>>> On Sat, Apr 28, 2012 at 5:25 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wro=
te:
>>>
>>> Here explanations for quick ones that I know without looking at the
>>> code. =A0I think they will help you understand some of the other ones
>>> too.
>>>
>>> Please send questions like this to qemu-devel@nongnu.org so the thread
>>> is archived on the mailing lists and others can learn from it.
>> I feel better and free if it is in private way. :)
>
> No one else in the community can learn from the discussion or help
> out. =A0We need to be comfortable doing things in public.
OK, i public this mail discussion, and hope that i can help other guys
interested in qcow2.
>
>>>
>>>> 1.
>>>> static int get_refcount(BlockDriverState *bs, int64_t cluster_index)
>>>> {
>>>> ...
>>>> =A0 =A0refcount_table_index =3D cluster_index >> (s->cluster_bits - RE=
FCOUNT_SHIFT);
>>>> ...
>>>>
>>>> =A0 =A0block_index =3D cluster_index &
>>>> =A0 =A0 =A0 =A0((1 << (s->cluster_bits - REFCOUNT_SHIFT)) - 1);
>>>> How to understand the two expressions?
>>>> ...
>>>> }
>>>
>>> See "Host cluster management" in the qcow2 spec. =A0Refcounts are store=
d
>>> in a 2-level table, refcount_table_index is the L1 table where we
>>> store the offset of refcount blocks. =A0block_index is the index of the
>>> refcount block element that contains the actual reference count.
>> I knew what refcount_table_index and block_index mean. Actually what i
>> want to ask is that what "cluster_index >> (s->cluster_bits -
>> REFCOUNT_SHIFT)" and "cluster_index & ((1 << (s->cluster_bits -
>> REFCOUNT_SHIFT)) - 1)" mean? why are they refcount_table_index and
>> block_index? why to need REFCOUNT_SHIFT? because refcount block entry
>> is 16 bits?
>
> The cluster_index is the image file cluster number:
>
> | 0 | 1 | 2 | 3 | ...
> image file
>
> So these calculations are just dividing and finding the remainder for
> the 2-level refcount data structure. =A0For example, if the refcount
> block holds 2 entries, then we have:
>
> | 0 | 1 | 2 | 3 | ...
> image file
>
> | A =A0 =A0 | B =A0 =A0 | ...
> image file -> refcount tables
>
> A: | x | x | B: | x | x |
> refcount tables
>
> So cluster_index =3D 1 means we need to look at refcount table A at index=
 1.
>
> In other words:
> refcount_table_index =3D cluster_index / 2
> block_index =3D cluster_index % 2
>
> The shifts and bitwise ands are just another way of expressing the
> division/modulus operation. =A0And instead of using a constant like "2",
> qcow2 supports variable cluster sizes (cluster_bits) and has a
> REFCOUNT_SHIFT constant.
Great, i can now understand the two expressions and other similar ones, tha=
nks.
>
>>>
>>>> 2.
>>>>
>>>> static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *b=
s,
>>>> =A0 =A0int64_t offset, int64_t length, int addend)
>>>> {
>>>> ....
>>>> =A0 =A0if (addend < 0) {
>>>> =A0 =A0 =A0 =A0qcow2_cache_set_dependency(bs, s->refcount_block_cache,
>>>> =A0 =A0 =A0 =A0 =A0 =A0s->l2_table_cache);
>>>> =A0 =A0}
>>>> When added =3D -1, why to need to invoke qcow2_cache_set_dependency?
>>>
>>> The cache dependency is used to ensure that cached metadata is flushed
>>> in the correct order. =A0qcow2 must be careful to flush data to disk so
>>> that writes are ordered - otherwise a power failure could corrupt the
>>> image file when unordered writes are partially applied to the image
>>> file.
>> great, thanks
>>>
>>> For example, we want to allocate an image file cluster ("host
>>> cluster") *before* we reference it from a table. =A0Otherwise a power
>> But this example result in an allocated cluster will not be referenced
>> by one table entry if a power failure take place.
>
> It's not possible to keep the image in a "clean" state at all times,
> but we need to keep it in a "consistent" state at all times.
> "Consistent" means:
> 1. All L1/L2 table entries should point to allocated clusters.
> 2. Data should be written to allocated clusters before they are
> referenced by L1/L2 entries.
>
> "Consistent" is weaker than "clean". =A0A "consistent" image can have
> leaked clusters that are allocated (refcount > 0) but not referenced
> by any L1/L2 table.
>
> For multi-step metadata updates you will find there is no order that
> is "clean" at every step - it's simply not possible, no matter which
> order you choose. =A0But there is an order the is "consistent" at every
> step and that's good enough (it means we will not lose data or corrupt
> the image file).
Great, thanks.
>
> BTW this is why file systems use journals. =A0Journals solve the
> consistent update problem in a different way - after a crash the
> journal is replayed to finish all in-flight updates, resulting in a
> consistent file system. =A0It's a different technique that non of the
> popular image formats use.
>
>>> failure could result in the table pointing to an unallocated cluster
>>> (the refcount update did not make it to disk but the L2 table update
>> Why to say that refcount update did not? refcount block entry will
>> count the reference of one cluster.
>
> Because disk writes may sit in a volatile disk write cache or host OS
> page cache. =A0The order in which cached data is really written to disk
> is not defined. =A0That means we don't know if the refcount update makes
> it to disk before the power fails.
>
>>>> 12.
>>>>
>>>> static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
>>>>
>>>> I thought that this function should only flush the data to OS cache,
>>>> not disk. right? But i checked it and found that it flush the data to
>>>> the disk.
>>>
>>> Metadata updates must be handled very carefully - they need to be
>>> ordered so that a power failure never leaves the image file in an
>>> inconsistent (corrupt) state. =A0Therefore we *do* need to flush to dis=
k
>> Then should we adjust this function name? otherwise it will confuse us.
>
> Maybe. =A0The BlockDriver callback name makes sense. =A0It just happens
> that the qcow2 implementation really does flush so that metadata
> updates are ordered.
kevin has explained this in another mail thread.
>
>>> when applying several different types of metadata updates in order
>>> (e.g. L2 table, refcount table).
>> Should we usually update L2 table *before* refcount table update or at
>> first update refcount table *before* L2 table?
>
> When allocating clusters we first need to update the refcount table,
> write the data into the cluster, and then perform the L2 update.
When freeing clusters, the order will be reverse.
>
>>>> 14.
>>>>
>>>> qcow2_snapshot_create() =A0{
>>>> ....
>>>>
>>>> =A0 =A0ret =3D bdrv_flush(bs);
>>>> Why does it need flush the data here?
>>>
>>> To understand the meaning of a flush, look at the operation that was
>>> performed before it.
>>>
>>> Here we just incremented the refcounts. =A0We need to make sure these
>>> metadata updates are on disk before we can add the snapshot entry into
>>> the qcow2 image file - otherwise a power failure could result in a
>>> snapshot entry without allocated clusters.
>> But it may also result in that the clusters are allocated, but there
>> is no corresponding snapshot entry.
>
> Yes. =A0The image will still be "consistent", just not "clean". =A0We
> leaked clusters but did not lose data or corrupt the image file.
Great, thanks.
>
> Stefan


--=20
Regards,

Zhi Yong Wu