From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:35158)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <famz@redhat.com>) id 1VHsDw-000296-Ng
	for qemu-devel@nongnu.org; Fri, 06 Sep 2013 05:21:10 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <famz@redhat.com>) id 1VHsDq-0001Xi-M6
	for qemu-devel@nongnu.org; Fri, 06 Sep 2013 05:21:04 -0400
Received: from mx1.redhat.com ([209.132.183.28]:33279)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <famz@redhat.com>) id 1VHsDq-0001Xa-B5
	for qemu-devel@nongnu.org; Fri, 06 Sep 2013 05:20:58 -0400
Date: Fri, 6 Sep 2013 17:20:53 +0800
From: Fam Zheng <famz@redhat.com>
Message-ID: <20130906092053.GA22552@T430s.nay.redhat.com>
References: <1378215952-7151-1-git-send-email-kwolf@redhat.com>
	<20130904080352.GA8031@stefanha-thinkpad.redhat.com>
	<20130904093950.GB3562@dhcp-200-207.str.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
In-Reply-To: <20130904093950.GB3562@dhcp-200-207.str.redhat.com>
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [RFC] qcow2 journalling draft
Reply-To: famz@redhat.com
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: benoit.canet@irqsave.net, jcody@redhat.com, qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>, mreitz@redhat.com

On Wed, 09/04 11:39, Kevin Wolf wrote:
> First of all, excuse any inconsistencies in the following mail. I wrote
> it from top to bottom, and there was some thought process involved in
> almost every paragraph...
>=20
> Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben:
> > On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
> > > @@ -103,7 +107,11 @@ in the description of a field.
> > >                      write to an image with unknown auto-clear feat=
ures if it
> > >                      clears the respective bits from this field fir=
st.
> > > =20
> > > -                    Bits 0-63:  Reserved (set to 0)
> > > +                    Bit 0:      Journal valid bit. This bit indica=
tes that the
> > > +                                image contains a valid main journa=
l starting at
> > > +                                journal_offset.
> >=20
> > Whether the journal is used can be determined from the journal_offset
> > value (header length must be large enough and journal offset must be
> > valid).
> >=20
> > Why do we need this autoclear bit?
>=20
> Hm, I introduced this one first and the journal dirty incompatible bit
> later, perhaps it's unnecessary now. Let's check...
>=20
> The obvious thing we need to protect against is applying stale journal
> data to an image that has been changed by an older version. As long as
> the journal is clean, this can't happen, and the journal dirty bit will
> ensure that the old version can only open the image if it is clean.
>=20
> However, what if we run 'qemu-img check -r leaks' with an old qemu-img
> version? It will reclaim the clusters used by the journal, and if we
> continue using the journal we'll corrupt whatever new data is there
> now.
>=20
Why can old version qemu-img open the image with dirty journal in the fir=
st
place? It's incompatible bit.

> Can we protect against this without using an autoclear bit?
>=20
> > > +Journals are used to allow safe updates of metadata without impact=
ing
> > > +performance by requiring flushes to order updates to different par=
ts of the
> > > +metadata.
> >=20
> > This sentence is hard to parse.  Maybe something shorter like this:
> >=20
> > Journals allow safe metadata updates without the need for carefully
> > ordering and flushing between update steps.
>=20
> Okay, I'll update the text with your proposal.
>=20
> > > +They consist of transactions, which in turn contain operations tha=
t
> > > +are effectively executed atomically. A qcow2 image can have a main=
 image
> > > +journal that deals with cluster management operations, and additio=
nal specific
> > > +journals can be used by other features like data deduplication.
> >=20
> > I'm not sure if multiple journals will work in practice.  Doesn't thi=
s
> > re-introduce the need to order update steps and flush between them?
>=20
> This is a question for Beno=EEt, who made this requirement. I asked him
> the same a while ago and apparently his explanation made some sense to
> me, or I would have remembered that I don't want it. ;-)
>=20
> It might have something to do with the fact that deduplication uses the
> journal more as a kind of cache for hash values that can be dropped and
> rebuilt after a crash.
>=20
> > > +A journal is organised in journal blocks, all of which have a refe=
rence count
> > > +of exactly 1. It starts with a block containing the following jour=
nal header:
> > > +
> > > +    Byte  0 -  7:   Magic ("qjournal" ASCII string)
> > > +
> > > +          8 - 11:   Journal size in bytes, including the header
> > > +
> > > +         12 - 15:   Journal block size order (block size in bytes =
=3D 1 << order)
> > > +                    The block size must be at least 512 bytes and =
must not
> > > +                    exceed the cluster size.
> > > +
> > > +         16 - 19:   Journal block index of the descriptor for the =
last
> > > +                    transaction that has been synced, starting wit=
h 1 for the
> > > +                    journal block after the header. 0 is used for =
empty
> > > +                    journals.
> > > +
> > > +         20 - 23:   Sequence number of the last transaction that h=
as been
> > > +                    synced. 0 is recommended as the initial value.
> > > +
> > > +         24 - 27:   Sequence number of the last transaction that h=
as been
> > > +                    committed. When replaying a journal, all trans=
actions
> > > +                    after the last synced one up to the last commi=
t one must be
> > > +                    synced. Note that this may include a wraparoun=
d of sequence
> > > +                    numbers.
> > > +
> > > +         28 -  31:  Checksum (one's complement of the sum of all b=
ytes in the
> > > +                    header journal block except those of the check=
sum field)
> > > +
> > > +         32 - 511:  Reserved (set to 0)
> >=20
> > I'm not sure if these fields are necessary.  They require updates (an=
d
> > maybe flush) after every commit and sync.
> >=20
> > The fewer metadata updates, the better, not just for performance but
> > also to reduce the risk of data loss.  If any metadata required to
> > access the journal is corrupted, the image will be unavailable.
> >=20
> > It should be possible to determine this information by scanning the
> > journal transactions.
>=20
> This is rather handwavy. Can you elaborate how this would work in detai=
l?
>=20
>=20
> For example, let's assume we get to read this journal (a journal can be
> rather large, too, so I'm not sure if we want to read it in completely)=
:
>=20
>  - Descriptor, seq 42, 2 data blocks
>  - Data block
>  - Data block
>  - Data block starting with "qjbk"
>  - Data block
>  - Descriptor, seq 7, 0 data blocks
>  - Descriptor, seq 8, 1 data block
>  - Data block
>=20
> Which of these have already been synced? Which have been committed?
>=20
>=20
> I guess we could introduce an is_commited flag in the descriptor, but
> wouldn't correct operation look like this then:
>=20
> 1. Write out descriptor commit flag clear and any data blocks
> 2. Flush
> 3. Rewrite descriptor with commit flag set
>=20
> This ensures that the commit flag is only set if all the required data
> is indeed stable on disk. What has changed compared to this proposal is
> just the offset at which you write in step 3 (header vs. descriptor).
>=20
> For sync I suppose the same option exists.
>=20
>=20
> One unrelated thing we need to take care of is this 'data block startin=
g
> with "qjbk"' thing I mentioned in the example. I can see two solutions:
> The first is never creating such a journal, by always blanking the next
> block after a transaction that we write, so that the descriptor chain i=
s
> terminated. The second one is never creating such a journal, by
> zeroing an initial "qjbk" in data blocks and setting a flag in the
> descriptor instead.
>=20
> I was thinking of the first, but I guess it would be worth at least
> mentioning the problem in the spec.
>=20
> > > +A wraparound may not occur in the middle of a single transaction, =
but only
> > > +between two transactions. For the necessary padding an empty descr=
iptor with
> > > +any number of data blocks can be used as the last entry of the rin=
g.
> >=20
> > Why have this limitation?
>=20
> I thought it would make implementations easier if they didn't have to
> read/write in data blocks in two parts in some cases.
>=20
> > > +All descriptors start with a common part:
> > > +
> > > +    Byte  0 -  1:   Descriptor type
> > > +                        0 - No-op descriptor
> > > +                        1 - Write data block
> > > +                        2 - Copy data
> > > +                        3 - Revoke
> > > +                        4 - Deduplication hash insertion
> > > +                        5 - Deduplication hash deletion
> > > +
> > > +          2 -  3:   Size of the descriptor in bytes
> >=20
> > Data blocks are not included in the descriptor size?  I just want to
> > make sure that we don't be limited to 64 KB for the actual data.
>=20
> Right, this only for the descriptors, without data. It implies a maximu=
m
> journal block size of 64k, which I think is okay.
>=20
> > > +
> > > +          4 -  n:   Type-specific data
> > > +
> > > +The following section specifies the purpose (i.e. the action that =
is to be
> > > +performed when syncing) and type-specific data layout of each desc=
riptor type:
> > > +
> > > +  * No-op descriptor: No action is to be performed when syncing th=
is descriptor
> > > +
> > > +          4 -  n:   Ignored
> > > +
> > > +  * Write data block: Write literal data associated with this tran=
saction from
> > > +    the journal to a given offset.
> > > +
> > > +          4 -  7:   Length of the data to write in bytes
> > > +
> > > +          8 - 15:   Offset in the image file to write the data to
> > > +
> > > +         16 - 19:   Index of the journal block at which the data t=
o write
> > > +                    starts. The data must be stored sequentially a=
nd be fully
> > > +                    contained in the data blocks associated with t=
he
> > > +                    transaction.
> > > +
> > > +    The type-specific data can be repeated, specifying multiple ch=
unks of data
> > > +    to be written in one operation. This means the size of the des=
criptor must
> > > +    be 4 + 16 * n.
> >=20
> > Why is the necessary?  Multiple data descriptors could be used, is it
> > worth the additional logic and testing?
>=20
> What does a typical journal transaction look like?
>=20
> My assumption is that initially it has lots of L2 table and refcount
> block updates and little else. All of these are data writes. Once we
> implement Delayed COW using the journal, we get some data copy
> operations as well. Assuming a stupid guest, we get two copies and two
> writes for each cluster allocation.
>=20
> The space required for these is 2*(4 + 16n) + 2*(4 + 20n) =3D 16 + 72n.=
 In
> one 4k descriptor block you can fit 56 cluster allocations.
>=20
> If you used separate descriptors, you get 2*20n + 2*24n =3D 88n. In one=
 4k
> descriptor block you could fit 46 cluster allocations.
>=20
> Considering that in both cases the space used by descriptors is dwarved
> by the updated tables to be written, I guess it's not worth it for the
> main journal. Things may look different for the dedpulication journal,
> where no data blocks are used and the space taken is determined only by
> the descriptors.
>=20
> And in fact... All of the above sounded great, but is in fact wrong.
> Typically you'd get _one_ L2 update for all (sequential) allocations,
> the COW entries without data should dominate in the main journal as wel=
l.
>=20
> > > +
> > > +  * Copy data: Copy data from one offset in the image to another o=
ne. This can
> > > +    be used for journalling copy-on-write operations.
> >=20
> > This reminds me to ask what the plan is for journal scope: metadata o=
nly
> > or also data?  For some operations like dedupe it seems that full dat=
a
> > journalling may be necessary.  But for an image without dedupe it wou=
ld
> > not be necessary to journal the rewrites to an already allocated
> > cluster, for example.
>=20
> Generally only metadata.
>=20
> The COW case is a bit tricky. Here we really store only metadata in the
> journal as well ("this COW is still pending"), but it directly affects
> data and the qemu read/write paths for data must take such pending COWs
> into consideration. This will require some ordering between the journal
> and data (e.g. only free the COWed cluster after the journal is
> committed), but I think that's okay.
>=20
> > > +          4 -  7:   Length of the data to write in bytes
> > > +
> > > +          8 - 15:   Target offset in the image file
> > > +
> > > +         16 - 23:   Source offset in the image file
> >=20
> > Source and target cannot overlap?
>=20
> I don't think there's a use case for it, so let's forbid it.
>=20
> > > +
> > > +    The type-specific data can be repeated, specifying multiple ch=
unks of data
> > > +    to be copied in one operation. This means the size of the desc=
riptor must
> > > +    be 4 + 20 * n.
> > > +
> > > +  * Revoke: Marks operations on a given range in the imag file inv=
alid for all
> >=20
> > s/imag/image/
> >=20
> > > +    earlier transactions (this does not include the transaction co=
ntaining the
> > > +    revoke). They must not be executed on a sync operation (e.g. b=
ecause the
> > > +    range in question has been freed and may have been reused for =
other, not
> > > +    journalled data structures that must not be overwritten with s=
tale data).
> > > +    Note that this may mean that operations are to be executed par=
tially.
> >=20
> > Example scenario?
>=20
> Once I had one... Well, let's try this:
>=20
> 1. Cluster allocation that allocates a new L2 table, i.e. L1 update get=
s
>    journalled
>=20
> 2. Another cluster allocation, this time the L1 has to grow. The write
>    to the new L1 position is journalled, the old table is freed.
>=20
> 3. Next allocation reuses the clusters where the old L1 table was store=
d
>=20
> 4. Journal sync. The L1 update from step 1 is written out to the
>    clusters that are now used for data. Oops.
>=20
> Kevin