qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: benoit.canet@irqsave.net, jcody@redhat.com, famz@redhat.com,
	qemu-devel@nongnu.org, mreitz@redhat.com
Subject: Re: [Qemu-devel] [RFC] qcow2 journalling draft
Date: Wed, 4 Sep 2013 10:03:52 +0200	[thread overview]
Message-ID: <20130904080352.GA8031@stefanha-thinkpad.redhat.com> (raw)
In-Reply-To: <1378215952-7151-1-git-send-email-kwolf@redhat.com>

On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
> @@ -103,7 +107,11 @@ in the description of a field.
>                      write to an image with unknown auto-clear features if it
>                      clears the respective bits from this field first.
>  
> -                    Bits 0-63:  Reserved (set to 0)
> +                    Bit 0:      Journal valid bit. This bit indicates that the
> +                                image contains a valid main journal starting at
> +                                journal_offset.

Whether the journal is used can be determined from the journal_offset
value (header length must be large enough and journal offset must be
valid).

Why do we need this autoclear bit?

> +Journals are used to allow safe updates of metadata without impacting
> +performance by requiring flushes to order updates to different parts of the
> +metadata.

This sentence is hard to parse.  Maybe something shorter like this:

Journals allow safe metadata updates without the need for carefully
ordering and flushing between update steps.

> +They consist of transactions, which in turn contain operations that
> +are effectively executed atomically. A qcow2 image can have a main image
> +journal that deals with cluster management operations, and additional specific
> +journals can be used by other features like data deduplication.

I'm not sure if multiple journals will work in practice.  Doesn't this
re-introduce the need to order update steps and flush between them?

> +A journal is organised in journal blocks, all of which have a reference count
> +of exactly 1. It starts with a block containing the following journal header:
> +
> +    Byte  0 -  7:   Magic ("qjournal" ASCII string)
> +
> +          8 - 11:   Journal size in bytes, including the header
> +
> +         12 - 15:   Journal block size order (block size in bytes = 1 << order)
> +                    The block size must be at least 512 bytes and must not
> +                    exceed the cluster size.
> +
> +         16 - 19:   Journal block index of the descriptor for the last
> +                    transaction that has been synced, starting with 1 for the
> +                    journal block after the header. 0 is used for empty
> +                    journals.
> +
> +         20 - 23:   Sequence number of the last transaction that has been
> +                    synced. 0 is recommended as the initial value.
> +
> +         24 - 27:   Sequence number of the last transaction that has been
> +                    committed. When replaying a journal, all transactions
> +                    after the last synced one up to the last commit one must be
> +                    synced. Note that this may include a wraparound of sequence
> +                    numbers.
> +
> +         28 -  31:  Checksum (one's complement of the sum of all bytes in the
> +                    header journal block except those of the checksum field)
> +
> +         32 - 511:  Reserved (set to 0)

I'm not sure if these fields are necessary.  They require updates (and
maybe flush) after every commit and sync.

The fewer metadata updates, the better, not just for performance but
also to reduce the risk of data loss.  If any metadata required to
access the journal is corrupted, the image will be unavailable.

It should be possible to determine this information by scanning the
journal transactions.

> +A wraparound may not occur in the middle of a single transaction, but only
> +between two transactions. For the necessary padding an empty descriptor with
> +any number of data blocks can be used as the last entry of the ring.

Why have this limitation?

> +All descriptors start with a common part:
> +
> +    Byte  0 -  1:   Descriptor type
> +                        0 - No-op descriptor
> +                        1 - Write data block
> +                        2 - Copy data
> +                        3 - Revoke
> +                        4 - Deduplication hash insertion
> +                        5 - Deduplication hash deletion
> +
> +          2 -  3:   Size of the descriptor in bytes

Data blocks are not included in the descriptor size?  I just want to
make sure that we don't be limited to 64 KB for the actual data.

> +
> +          4 -  n:   Type-specific data
> +
> +The following section specifies the purpose (i.e. the action that is to be
> +performed when syncing) and type-specific data layout of each descriptor type:
> +
> +  * No-op descriptor: No action is to be performed when syncing this descriptor
> +
> +          4 -  n:   Ignored
> +
> +  * Write data block: Write literal data associated with this transaction from
> +    the journal to a given offset.
> +
> +          4 -  7:   Length of the data to write in bytes
> +
> +          8 - 15:   Offset in the image file to write the data to
> +
> +         16 - 19:   Index of the journal block at which the data to write
> +                    starts. The data must be stored sequentially and be fully
> +                    contained in the data blocks associated with the
> +                    transaction.
> +
> +    The type-specific data can be repeated, specifying multiple chunks of data
> +    to be written in one operation. This means the size of the descriptor must
> +    be 4 + 16 * n.

Why is the necessary?  Multiple data descriptors could be used, is it
worth the additional logic and testing?

> +
> +  * Copy data: Copy data from one offset in the image to another one. This can
> +    be used for journalling copy-on-write operations.

This reminds me to ask what the plan is for journal scope: metadata only
or also data?  For some operations like dedupe it seems that full data
journalling may be necessary.  But for an image without dedupe it would
not be necessary to journal the rewrites to an already allocated
cluster, for example.

> +          4 -  7:   Length of the data to write in bytes
> +
> +          8 - 15:   Target offset in the image file
> +
> +         16 - 23:   Source offset in the image file

Source and target cannot overlap?

> +
> +    The type-specific data can be repeated, specifying multiple chunks of data
> +    to be copied in one operation. This means the size of the descriptor must
> +    be 4 + 20 * n.
> +
> +  * Revoke: Marks operations on a given range in the imag file invalid for all

s/imag/image/

> +    earlier transactions (this does not include the transaction containing the
> +    revoke). They must not be executed on a sync operation (e.g. because the
> +    range in question has been freed and may have been reused for other, not
> +    journalled data structures that must not be overwritten with stale data).
> +    Note that this may mean that operations are to be executed partially.

Example scenario?

  parent reply	other threads:[~2013-09-04  8:04 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-03 13:45 [Qemu-devel] [RFC] qcow2 journalling draft Kevin Wolf
2013-09-03 14:43 ` Benoît Canet
2013-09-04  8:03 ` Stefan Hajnoczi [this message]
2013-09-04  9:37   ` Benoît Canet
2013-09-04  9:39   ` Kevin Wolf
2013-09-04  9:55     ` Benoît Canet
2013-09-05  9:24       ` Stefan Hajnoczi
2013-09-05 15:26         ` Benoît Canet
2013-09-06  7:27           ` Kevin Wolf
2013-09-15 18:23             ` Benoît Canet
2013-09-05  9:21     ` Stefan Hajnoczi
2013-09-05 11:18       ` Kevin Wolf
2013-09-05 14:55         ` Stefan Hajnoczi
2013-09-05 15:20           ` Kevin Wolf
2013-09-05 15:56             ` Eric Blake
2013-09-06  9:20     ` Fam Zheng
2013-09-06  9:57       ` Kevin Wolf
2013-09-06 10:02         ` Fam Zheng
2013-09-04  8:32 ` Max Reitz
2013-09-04 10:12   ` Kevin Wolf
2013-09-05  9:35 ` Stefan Hajnoczi
2013-09-05 11:50   ` Kevin Wolf
2013-09-05 12:08     ` Benoît Canet
2013-09-06  9:59 ` Fam Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130904080352.GA8031@stefanha-thinkpad.redhat.com \
    --to=stefanha@redhat.com \
    --cc=benoit.canet@irqsave.net \
    --cc=famz@redhat.com \
    --cc=jcody@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).