From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40613)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@redhat.com>) id 1UuHqX-0002MX-Io
	for qemu-devel@nongnu.org; Wed, 03 Jul 2013 03:51:30 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stefanha@redhat.com>) id 1UuHqT-00045Q-0X
	for qemu-devel@nongnu.org; Wed, 03 Jul 2013 03:51:25 -0400
Received: from mx1.redhat.com ([209.132.183.28]:51577)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@redhat.com>) id 1UuHqS-000455-P1
	for qemu-devel@nongnu.org; Wed, 03 Jul 2013 03:51:20 -0400
Date: Wed, 3 Jul 2013 09:51:14 +0200
From: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20130703075114.GB16585@stefanha-thinkpad.muc.redhat.com>
References: <1371738392-9594-1-git-send-email-benoit@irqsave.net>
	<1371738392-9594-2-git-send-email-benoit@irqsave.net>
	<20130702144224.GF9870@stefanha-thinkpad.redhat.com>
	<20130702145446.GG3031@dhcp-200-207.str.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
In-Reply-To: <20130702145446.GG3031@dhcp-200-207.str.redhat.com>
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: =?iso-8859-1?Q?Beno=EEt?= Canet <benoit@irqsave.net>, qemu-devel@nongnu.org

On Tue, Jul 02, 2013 at 04:54:46PM +0200, Kevin Wolf wrote:
> Am 02.07.2013 um 16:42 hat Stefan Hajnoczi geschrieben:
> > On Thu, Jun 20, 2013 at 04:26:09PM +0200, Beno=EEt Canet wrote:
> > > ---
> > >  docs/specs/qcow2.txt |   42 ++++++++++++++++++++++++++++++++++++++=
++++
> > >  1 file changed, 42 insertions(+)
> > >=20
> > > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
> > > index 36a559d..a4ffc85 100644
> > > --- a/docs/specs/qcow2.txt
> > > +++ b/docs/specs/qcow2.txt
> > > @@ -350,3 +350,45 @@ Snapshot table entry:
> > >          variable:   Unique ID string for the snapshot (not null te=
rminated)
> > > =20
> > >          variable:   Name of the snapshot (not null terminated)
> > > +
> > > +=3D=3D Journal =3D=3D
> > > +
> > > +QCOW2 can use one or more instance of a metadata journal.
> >=20
> > s/instance/instances/
> >=20
> > Is there a reason to use multiple journals rather than a single journ=
al
> > for all entry types?  The single journal area avoids seeks.
> >=20
> > > +
> > > +A journal is a sequential log of journal entries appended on a pre=
viously
> > > +allocated and reseted area.
> >=20
> > I think you say "previously reset area" instead of "reseted".  Anothe=
r
> > option is "initialized area".
> >=20
> > > +A journal is designed like a linked list with each entry pointing =
to the next
> > > +so it's easy to iterate over entries.
> > > +
> > > +A journal uses the following constants to denote the type of each =
entry
> > > +
> > > +TYPE_NONE =3D 0xFF      default value of any bytes in a reseted jo=
urnal
> > > +TYPE_END  =3D 1         the entry ends a journal cluster and point=
 to the next
> > > +                      cluster
> > > +TYPE_HASH =3D 2         the entry contains a deduplication hash
> > > +
> > > +QCOW2 journal entry:
> > > +
> > > +    Byte 0         :    Size of the entry: size =3D 2 + n with siz=
e <=3D 254
> >=20
> > This is not clear.  I'm wondering if the +2 is included in the byte
> > value or not.  I'm also wondering what a byte value of zero means and
> > what a byte value of 255 means.
> >=20
> > Please include an example to illustrate how this field works.
> >=20
> > > +
> > > +         1         :    Type of the entry
> > > +
> > > +         2 - size  :    The optional n bytes structure carried by =
entry
> > > +
> > > +A journal is divided into clusters and no journal entry can be spi=
lled on two
> > > +clusters. This avoid having to read more than one cluster to get a=
 single entry.
> > > +
> > > +For this purpose an entry with the end type is added at the end of=
 a journal
> > > +cluster before starting to write in the next cluster.
> > > +The size of such an entry is set so the entry points to the next c=
luster.
> > > +
> > > +As any journal cluster must be ended with an end entry the size of=
 regular
> > > +journal entries is limited to 254 bytes in order to always left ro=
om for an end
> > > +entry which mimimal size is two bytes.
> > > +
> > > +The only cases where size > 254 are none entries where size =3D 25=
5.
> > > +
> > > +The replay of a journal stop when the first end none entry is reac=
hed.
> >=20
> > s/stop/stops/
> >=20
> > > +The journal cluster size is 4096 bytes.
> >=20
> > Questions about this layout:
> >=20
> > 1. Journal entries have no integrity mechanism, which is especially
> >    important if they span physical sectors where cheap disks may perf=
orm
> >    a partial write.  This would leave a corrupt journal.  If the last
> >    bytes are a checksum then you can get some confidence that the ent=
ry
> >    was fully written and is valid.
> >=20
> >    Did I miss something?
>=20
> Adding a checksum sounds like a good idea.
>=20
> > 2. Byte-granularity means that read-modify-write is necessary to appe=
nd
> >    entries to the journal.  Therefore a failure could destroy previou=
sly
> >    committed entries.
> >=20
> >    Any ideas how existing journals handle this?
>=20
> You commit only whole blocks. So in this case we can consider a block
> only committed as soon as a TYPE_END entry has been written (and after
> that we won't touch it any more until the journalled changes have been
> flushed to disk).
>=20
> There's one "interesting" case: cache=3Dwritethrough. I'm not entirely
> sure yet what to do with it, but it's slow anyway, so using one block
> per entry and therefore flushing the journal very often might actually
> be not totally unreasonable.
>=20
> Another thing I'm not sure about is whether a fixed 4k block is good or
> if we should leave it configurable. I don't think making it an option
> would hurt (not necessarily modifyable with qemu-img, but as a field
> in the file format).

Making block size configurable seems like a good idea so we can adapt to
disk performance and data integrity characteristics.

Stefan