From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35158) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VHsDw-000296-Ng for qemu-devel@nongnu.org; Fri, 06 Sep 2013 05:21:10 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VHsDq-0001Xi-M6 for qemu-devel@nongnu.org; Fri, 06 Sep 2013 05:21:04 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33279) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VHsDq-0001Xa-B5 for qemu-devel@nongnu.org; Fri, 06 Sep 2013 05:20:58 -0400 Date: Fri, 6 Sep 2013 17:20:53 +0800 From: Fam Zheng Message-ID: <20130906092053.GA22552@T430s.nay.redhat.com> References: <1378215952-7151-1-git-send-email-kwolf@redhat.com> <20130904080352.GA8031@stefanha-thinkpad.redhat.com> <20130904093950.GB3562@dhcp-200-207.str.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20130904093950.GB3562@dhcp-200-207.str.redhat.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC] qcow2 journalling draft Reply-To: famz@redhat.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: benoit.canet@irqsave.net, jcody@redhat.com, qemu-devel@nongnu.org, Stefan Hajnoczi , mreitz@redhat.com On Wed, 09/04 11:39, Kevin Wolf wrote: > First of all, excuse any inconsistencies in the following mail. I wrote > it from top to bottom, and there was some thought process involved in > almost every paragraph... >=20 > Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben: > > On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote: > > > @@ -103,7 +107,11 @@ in the description of a field. > > > write to an image with unknown auto-clear feat= ures if it > > > clears the respective bits from this field fir= st. > > > =20 > > > - Bits 0-63: Reserved (set to 0) > > > + Bit 0: Journal valid bit. This bit indica= tes that the > > > + image contains a valid main journa= l starting at > > > + journal_offset. > >=20 > > Whether the journal is used can be determined from the journal_offset > > value (header length must be large enough and journal offset must be > > valid). > >=20 > > Why do we need this autoclear bit? >=20 > Hm, I introduced this one first and the journal dirty incompatible bit > later, perhaps it's unnecessary now. Let's check... >=20 > The obvious thing we need to protect against is applying stale journal > data to an image that has been changed by an older version. As long as > the journal is clean, this can't happen, and the journal dirty bit will > ensure that the old version can only open the image if it is clean. >=20 > However, what if we run 'qemu-img check -r leaks' with an old qemu-img > version? It will reclaim the clusters used by the journal, and if we > continue using the journal we'll corrupt whatever new data is there > now. >=20 Why can old version qemu-img open the image with dirty journal in the fir= st place? It's incompatible bit. > Can we protect against this without using an autoclear bit? >=20 > > > +Journals are used to allow safe updates of metadata without impact= ing > > > +performance by requiring flushes to order updates to different par= ts of the > > > +metadata. > >=20 > > This sentence is hard to parse. Maybe something shorter like this: > >=20 > > Journals allow safe metadata updates without the need for carefully > > ordering and flushing between update steps. >=20 > Okay, I'll update the text with your proposal. >=20 > > > +They consist of transactions, which in turn contain operations tha= t > > > +are effectively executed atomically. A qcow2 image can have a main= image > > > +journal that deals with cluster management operations, and additio= nal specific > > > +journals can be used by other features like data deduplication. > >=20 > > I'm not sure if multiple journals will work in practice. Doesn't thi= s > > re-introduce the need to order update steps and flush between them? >=20 > This is a question for Beno=EEt, who made this requirement. I asked him > the same a while ago and apparently his explanation made some sense to > me, or I would have remembered that I don't want it. ;-) >=20 > It might have something to do with the fact that deduplication uses the > journal more as a kind of cache for hash values that can be dropped and > rebuilt after a crash. >=20 > > > +A journal is organised in journal blocks, all of which have a refe= rence count > > > +of exactly 1. It starts with a block containing the following jour= nal header: > > > + > > > + Byte 0 - 7: Magic ("qjournal" ASCII string) > > > + > > > + 8 - 11: Journal size in bytes, including the header > > > + > > > + 12 - 15: Journal block size order (block size in bytes = =3D 1 << order) > > > + The block size must be at least 512 bytes and = must not > > > + exceed the cluster size. > > > + > > > + 16 - 19: Journal block index of the descriptor for the = last > > > + transaction that has been synced, starting wit= h 1 for the > > > + journal block after the header. 0 is used for = empty > > > + journals. > > > + > > > + 20 - 23: Sequence number of the last transaction that h= as been > > > + synced. 0 is recommended as the initial value. > > > + > > > + 24 - 27: Sequence number of the last transaction that h= as been > > > + committed. When replaying a journal, all trans= actions > > > + after the last synced one up to the last commi= t one must be > > > + synced. Note that this may include a wraparoun= d of sequence > > > + numbers. > > > + > > > + 28 - 31: Checksum (one's complement of the sum of all b= ytes in the > > > + header journal block except those of the check= sum field) > > > + > > > + 32 - 511: Reserved (set to 0) > >=20 > > I'm not sure if these fields are necessary. They require updates (an= d > > maybe flush) after every commit and sync. > >=20 > > The fewer metadata updates, the better, not just for performance but > > also to reduce the risk of data loss. If any metadata required to > > access the journal is corrupted, the image will be unavailable. > >=20 > > It should be possible to determine this information by scanning the > > journal transactions. >=20 > This is rather handwavy. Can you elaborate how this would work in detai= l? >=20 >=20 > For example, let's assume we get to read this journal (a journal can be > rather large, too, so I'm not sure if we want to read it in completely)= : >=20 > - Descriptor, seq 42, 2 data blocks > - Data block > - Data block > - Data block starting with "qjbk" > - Data block > - Descriptor, seq 7, 0 data blocks > - Descriptor, seq 8, 1 data block > - Data block >=20 > Which of these have already been synced? Which have been committed? >=20 >=20 > I guess we could introduce an is_commited flag in the descriptor, but > wouldn't correct operation look like this then: >=20 > 1. Write out descriptor commit flag clear and any data blocks > 2. Flush > 3. Rewrite descriptor with commit flag set >=20 > This ensures that the commit flag is only set if all the required data > is indeed stable on disk. What has changed compared to this proposal is > just the offset at which you write in step 3 (header vs. descriptor). >=20 > For sync I suppose the same option exists. >=20 >=20 > One unrelated thing we need to take care of is this 'data block startin= g > with "qjbk"' thing I mentioned in the example. I can see two solutions: > The first is never creating such a journal, by always blanking the next > block after a transaction that we write, so that the descriptor chain i= s > terminated. The second one is never creating such a journal, by > zeroing an initial "qjbk" in data blocks and setting a flag in the > descriptor instead. >=20 > I was thinking of the first, but I guess it would be worth at least > mentioning the problem in the spec. >=20 > > > +A wraparound may not occur in the middle of a single transaction, = but only > > > +between two transactions. For the necessary padding an empty descr= iptor with > > > +any number of data blocks can be used as the last entry of the rin= g. > >=20 > > Why have this limitation? >=20 > I thought it would make implementations easier if they didn't have to > read/write in data blocks in two parts in some cases. >=20 > > > +All descriptors start with a common part: > > > + > > > + Byte 0 - 1: Descriptor type > > > + 0 - No-op descriptor > > > + 1 - Write data block > > > + 2 - Copy data > > > + 3 - Revoke > > > + 4 - Deduplication hash insertion > > > + 5 - Deduplication hash deletion > > > + > > > + 2 - 3: Size of the descriptor in bytes > >=20 > > Data blocks are not included in the descriptor size? I just want to > > make sure that we don't be limited to 64 KB for the actual data. >=20 > Right, this only for the descriptors, without data. It implies a maximu= m > journal block size of 64k, which I think is okay. >=20 > > > + > > > + 4 - n: Type-specific data > > > + > > > +The following section specifies the purpose (i.e. the action that = is to be > > > +performed when syncing) and type-specific data layout of each desc= riptor type: > > > + > > > + * No-op descriptor: No action is to be performed when syncing th= is descriptor > > > + > > > + 4 - n: Ignored > > > + > > > + * Write data block: Write literal data associated with this tran= saction from > > > + the journal to a given offset. > > > + > > > + 4 - 7: Length of the data to write in bytes > > > + > > > + 8 - 15: Offset in the image file to write the data to > > > + > > > + 16 - 19: Index of the journal block at which the data t= o write > > > + starts. The data must be stored sequentially a= nd be fully > > > + contained in the data blocks associated with t= he > > > + transaction. > > > + > > > + The type-specific data can be repeated, specifying multiple ch= unks of data > > > + to be written in one operation. This means the size of the des= criptor must > > > + be 4 + 16 * n. > >=20 > > Why is the necessary? Multiple data descriptors could be used, is it > > worth the additional logic and testing? >=20 > What does a typical journal transaction look like? >=20 > My assumption is that initially it has lots of L2 table and refcount > block updates and little else. All of these are data writes. Once we > implement Delayed COW using the journal, we get some data copy > operations as well. Assuming a stupid guest, we get two copies and two > writes for each cluster allocation. >=20 > The space required for these is 2*(4 + 16n) + 2*(4 + 20n) =3D 16 + 72n.= In > one 4k descriptor block you can fit 56 cluster allocations. >=20 > If you used separate descriptors, you get 2*20n + 2*24n =3D 88n. In one= 4k > descriptor block you could fit 46 cluster allocations. >=20 > Considering that in both cases the space used by descriptors is dwarved > by the updated tables to be written, I guess it's not worth it for the > main journal. Things may look different for the dedpulication journal, > where no data blocks are used and the space taken is determined only by > the descriptors. >=20 > And in fact... All of the above sounded great, but is in fact wrong. > Typically you'd get _one_ L2 update for all (sequential) allocations, > the COW entries without data should dominate in the main journal as wel= l. >=20 > > > + > > > + * Copy data: Copy data from one offset in the image to another o= ne. This can > > > + be used for journalling copy-on-write operations. > >=20 > > This reminds me to ask what the plan is for journal scope: metadata o= nly > > or also data? For some operations like dedupe it seems that full dat= a > > journalling may be necessary. But for an image without dedupe it wou= ld > > not be necessary to journal the rewrites to an already allocated > > cluster, for example. >=20 > Generally only metadata. >=20 > The COW case is a bit tricky. Here we really store only metadata in the > journal as well ("this COW is still pending"), but it directly affects > data and the qemu read/write paths for data must take such pending COWs > into consideration. This will require some ordering between the journal > and data (e.g. only free the COWed cluster after the journal is > committed), but I think that's okay. >=20 > > > + 4 - 7: Length of the data to write in bytes > > > + > > > + 8 - 15: Target offset in the image file > > > + > > > + 16 - 23: Source offset in the image file > >=20 > > Source and target cannot overlap? >=20 > I don't think there's a use case for it, so let's forbid it. >=20 > > > + > > > + The type-specific data can be repeated, specifying multiple ch= unks of data > > > + to be copied in one operation. This means the size of the desc= riptor must > > > + be 4 + 20 * n. > > > + > > > + * Revoke: Marks operations on a given range in the imag file inv= alid for all > >=20 > > s/imag/image/ > >=20 > > > + earlier transactions (this does not include the transaction co= ntaining the > > > + revoke). They must not be executed on a sync operation (e.g. b= ecause the > > > + range in question has been freed and may have been reused for = other, not > > > + journalled data structures that must not be overwritten with s= tale data). > > > + Note that this may mean that operations are to be executed par= tially. > >=20 > > Example scenario? >=20 > Once I had one... Well, let's try this: >=20 > 1. Cluster allocation that allocates a new L2 table, i.e. L1 update get= s > journalled >=20 > 2. Another cluster allocation, this time the L1 has to grow. The write > to the new L1 position is journalled, the old table is freed. >=20 > 3. Next allocation reuses the clusters where the old L1 table was store= d >=20 > 4. Journal sync. The L1 update from step 1 is written out to the > clusters that are now used for data. Oops. >=20 > Kevin