From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40613) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UuHqX-0002MX-Io for qemu-devel@nongnu.org; Wed, 03 Jul 2013 03:51:30 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UuHqT-00045Q-0X for qemu-devel@nongnu.org; Wed, 03 Jul 2013 03:51:25 -0400 Received: from mx1.redhat.com ([209.132.183.28]:51577) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UuHqS-000455-P1 for qemu-devel@nongnu.org; Wed, 03 Jul 2013 03:51:20 -0400 Date: Wed, 3 Jul 2013 09:51:14 +0200 From: Stefan Hajnoczi Message-ID: <20130703075114.GB16585@stefanha-thinkpad.muc.redhat.com> References: <1371738392-9594-1-git-send-email-benoit@irqsave.net> <1371738392-9594-2-git-send-email-benoit@irqsave.net> <20130702144224.GF9870@stefanha-thinkpad.redhat.com> <20130702145446.GG3031@dhcp-200-207.str.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20130702145446.GG3031@dhcp-200-207.str.redhat.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: =?iso-8859-1?Q?Beno=EEt?= Canet , qemu-devel@nongnu.org On Tue, Jul 02, 2013 at 04:54:46PM +0200, Kevin Wolf wrote: > Am 02.07.2013 um 16:42 hat Stefan Hajnoczi geschrieben: > > On Thu, Jun 20, 2013 at 04:26:09PM +0200, Beno=EEt Canet wrote: > > > --- > > > docs/specs/qcow2.txt | 42 ++++++++++++++++++++++++++++++++++++++= ++++ > > > 1 file changed, 42 insertions(+) > > >=20 > > > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt > > > index 36a559d..a4ffc85 100644 > > > --- a/docs/specs/qcow2.txt > > > +++ b/docs/specs/qcow2.txt > > > @@ -350,3 +350,45 @@ Snapshot table entry: > > > variable: Unique ID string for the snapshot (not null te= rminated) > > > =20 > > > variable: Name of the snapshot (not null terminated) > > > + > > > +=3D=3D Journal =3D=3D > > > + > > > +QCOW2 can use one or more instance of a metadata journal. > >=20 > > s/instance/instances/ > >=20 > > Is there a reason to use multiple journals rather than a single journ= al > > for all entry types? The single journal area avoids seeks. > >=20 > > > + > > > +A journal is a sequential log of journal entries appended on a pre= viously > > > +allocated and reseted area. > >=20 > > I think you say "previously reset area" instead of "reseted". Anothe= r > > option is "initialized area". > >=20 > > > +A journal is designed like a linked list with each entry pointing = to the next > > > +so it's easy to iterate over entries. > > > + > > > +A journal uses the following constants to denote the type of each = entry > > > + > > > +TYPE_NONE =3D 0xFF default value of any bytes in a reseted jo= urnal > > > +TYPE_END =3D 1 the entry ends a journal cluster and point= to the next > > > + cluster > > > +TYPE_HASH =3D 2 the entry contains a deduplication hash > > > + > > > +QCOW2 journal entry: > > > + > > > + Byte 0 : Size of the entry: size =3D 2 + n with siz= e <=3D 254 > >=20 > > This is not clear. I'm wondering if the +2 is included in the byte > > value or not. I'm also wondering what a byte value of zero means and > > what a byte value of 255 means. > >=20 > > Please include an example to illustrate how this field works. > >=20 > > > + > > > + 1 : Type of the entry > > > + > > > + 2 - size : The optional n bytes structure carried by = entry > > > + > > > +A journal is divided into clusters and no journal entry can be spi= lled on two > > > +clusters. This avoid having to read more than one cluster to get a= single entry. > > > + > > > +For this purpose an entry with the end type is added at the end of= a journal > > > +cluster before starting to write in the next cluster. > > > +The size of such an entry is set so the entry points to the next c= luster. > > > + > > > +As any journal cluster must be ended with an end entry the size of= regular > > > +journal entries is limited to 254 bytes in order to always left ro= om for an end > > > +entry which mimimal size is two bytes. > > > + > > > +The only cases where size > 254 are none entries where size =3D 25= 5. > > > + > > > +The replay of a journal stop when the first end none entry is reac= hed. > >=20 > > s/stop/stops/ > >=20 > > > +The journal cluster size is 4096 bytes. > >=20 > > Questions about this layout: > >=20 > > 1. Journal entries have no integrity mechanism, which is especially > > important if they span physical sectors where cheap disks may perf= orm > > a partial write. This would leave a corrupt journal. If the last > > bytes are a checksum then you can get some confidence that the ent= ry > > was fully written and is valid. > >=20 > > Did I miss something? >=20 > Adding a checksum sounds like a good idea. >=20 > > 2. Byte-granularity means that read-modify-write is necessary to appe= nd > > entries to the journal. Therefore a failure could destroy previou= sly > > committed entries. > >=20 > > Any ideas how existing journals handle this? >=20 > You commit only whole blocks. So in this case we can consider a block > only committed as soon as a TYPE_END entry has been written (and after > that we won't touch it any more until the journalled changes have been > flushed to disk). >=20 > There's one "interesting" case: cache=3Dwritethrough. I'm not entirely > sure yet what to do with it, but it's slow anyway, so using one block > per entry and therefore flushing the journal very often might actually > be not totally unreasonable. >=20 > Another thing I'm not sure about is whether a fixed 4k block is good or > if we should leave it configurable. I don't think making it an option > would hurt (not necessarily modifyable with qemu-img, but as a field > in the file format). Making block size configurable seems like a good idea so we can adapt to disk performance and data integrity characteristics. Stefan