All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@redhat.com>
To: dm-devel@redhat.com
Cc: Ric Wheeler <rwheeler@redhat.com>, Mikulas Patocka <mpatocka@redhat.com>
Subject: Re: [PATCH 03/14] dm-multisnap-mikulas-headers
Date: Fri, 5 Mar 2010 20:54:35 -0500	[thread overview]
Message-ID: <20100306015435.GA8244@redhat.com> (raw)
In-Reply-To: <20100305224642.GA16425@redhat.com>

On Fri, Mar 05 2010 at  5:46pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Mar 01 2010 at  7:23pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > From: Mikulas Patocka <mpatocka@redhat.com>
> > 
> > Common header files for the exception store.
> > 
> > dm-multisnap-mikulas-struct.h contains on-disk structure definitions.
> > 
> > dm-multisnap-mikulas.h contains in-memory structures and kernel function
> > prototypes.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > ---
> >  drivers/md/dm-multisnap-mikulas-struct.h |  380 ++++++++++++++++++++++++++++++
> >  drivers/md/dm-multisnap-mikulas.h        |  247 +++++++++++++++++++
> >  2 files changed, 627 insertions(+), 0 deletions(-)
> >  create mode 100644 drivers/md/dm-multisnap-mikulas-struct.h
> >  create mode 100644 drivers/md/dm-multisnap-mikulas.h
> > 
> > diff --git a/drivers/md/dm-multisnap-mikulas-struct.h b/drivers/md/dm-multisnap-mikulas-struct.h
> > new file mode 100644
> > index 0000000..39eaa16
> > --- /dev/null
> > +++ b/drivers/md/dm-multisnap-mikulas-struct.h
> 
> <snip>
> 
> > +/*
> > + *	Description of on-disk format:
> > + *
> > + * The device is composed of blocks (also called chunks). The block size (also
> > + * called chunk size) is specified in the superblock.
> > + *
> > + * The chunk and block mean the same. "chunk" comes from old snapshots.
> > + * "block" comes from filesystems. We tend to use "chunk" in
> > + * exception-store-independent code to make it consistent with snapshot
> > + * terminology and "block" in exception-store code to make it consistent with
> > + * filesystem terminology.
> > + *
> > + * The minimum block size is 512, the maximum block size is not specified (it is
> > + * limited by 32-bit integer size and available memory). All on-disk pointers
> > + * are in the units of blocks. The pointers are 48-bit, making this format
> > + * capable of handling 2^48 blocks.
> 
> Shouldn't we require the chunk size be at least as big as
> (and a multiple of) physical_block_size?  E.g. 4096 on a 4K sector
> device.
> 
> This question applies to non-shared snapshots too.
> 
> > + *	Commit blocks
> > + *
> > + * Chunks 1, 1+cb_stride, 1+2*cb_stride, 1+3*cb_stride, etc. are commit blocks.
> > + * Chunks at these locations ((location % cb_stride) == 1) are only used for
> > + * commit blocks, they can't be used for anything else. A commit block is
> > + * written each time a new state is committed. The snapshot store transitions
> > + * from one consistent state to another consistent state by writing a commit
> > + * block.
> > + *
> > + * All commit blocks must be present and initialized (i.e. have valid signature
> > + * and sequence number). They are created when the device is initialized or
> > + * extended. It is not allowed to have random uninitialized data in any commit
> > + * block.
> > + *
> > + * For correctness, one commit block would be sufficient --- but to improve
> > + * performance and minimize seek times, there are multiple commit blocks and
> > + * we use the commit block that is near currently written data.
> > + *
> > + * The current commit block is stored in the super block. However, updates to
> > + * the super block would make excessive disk seeks too, so the updates to super
> > + * block are skipped if the commit block is written at the currently valid
> > + * commit block or at the next location following the currently valid commit
> > + * block. The following algorithm is used to find the commit block at mount:
> > + *	1. read the commit block multisnap_superblock->commit_block
> > + *	2. get its sequence number
> > + *	3. read the next commit block
> > + *	4. if the sequence number of the next commit block is higher than
> > + *	   the sequence number of the previous block, go to step 3. (i.e. read
> > + *	   another commit block)
> > + *	5. if the sequence number of the next commit block is lower than
> > + *	   the sequence number of the previous block, use the previous block
> > + *	   as the most recent valid commit block
> > + *
> > + * Note: because the disks only support atomic writes of 512 bytes, the commit
> > + * block has only 512 bytes of valid data. The remaining data in the commit
> > + * block up to the chunk size is unused.
> 
> Are there other places where you assume 512b is beneficial?  My concern
> is: what will happen on 4K devices?
> 
> Would making the commit block's size match the physical_block_size give
> us any multisnapshot benefit?  At a minimum I see a larger commit block
> would allow us to have more remap entries (larger remap
> array).. "Remaps" detailed below.  But what does that buy us?
> 
> However, and before I get ahead of myself, with blk_stack_limits() we
> could have a (DM) device that is composed of 4K and 512b devices; with a
> resulting physical_block_size of 4K.  But 4K wouldn't be atomic to the
> 512b disk.
> 
> But what if we were to add a checksum to the commit block?  This could
> give us the ability to have a larger commit block regardless of the
> physical_block_size. [NOTE: I saw dm_multisnap_commit() is just writing
> a fixed CB_SIGNATURE]
> 
> And in speaking with Ric Wheeler, using a checksum in the commit block
> opens up the possibility for optimizing (reducing) the barrier ops
> associated with:
> 1) before the commit block is written (flushes journal transaction) 
> 2) and after the commit block is written.
> 
> Leaving us with only needing to barrier after the commit block is
> written.  But this optimization apparently also requires having a
> checksummed journal.  Ext4 offers this (somewhat controversial yet fast)
> capability with the 'journal_async_commit' mount option. [NOTE: I'm
> largely parroting what I heard from Ric]
> 
> [NOTE: I couldn't immediately tell if dm_multisnap_commit() is doing
> multiple barriers when writing out the transaction and commit block]
> 
> Taking a step back, any reason you elected to not reuse existing kernel
> infrastructure (e.g. jbd2) for journaling?  Custom solution needed for
> the log-nature of the multisnapshot?  [Excuse my naive question(s), I
> see nilfs2 also has its own journaling... I'm just playing devil's
> advocate given how important it is that the multisnapshot journal code
> be correct]

Here is some additional detail on ext4's 'journal_async_commit':
http://marc.info/?l=linux-ext4&m=125263711211379&w=2
http://marc.info/?l=linux-ext4&m=125267485222449&w=2

Ted Tso acknowledged that the name 'journal_async_commit' is really a
misnomer here (I reference this post last because it contains an early
misunderstanding from Ted, that he corrects in the 1st url I referenced
above):
http://marc.info/?l=linux-ext4&m=125238515130681&w=2

  reply	other threads:[~2010-03-06  1:54 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-02  0:23 [PATCH 00/14] mikulas' shared snapshot patches Mike Snitzer
2010-03-02  0:23 ` [PATCH 01/14] dm-multisnap-common Mike Snitzer
2010-03-02  0:23 ` [PATCH 02/14] dm-bufio Mike Snitzer
2010-03-02  0:23 ` [PATCH 03/14] dm-multisnap-mikulas-headers Mike Snitzer
2010-03-05 22:46   ` Mike Snitzer
2010-03-06  1:54     ` Mike Snitzer [this message]
2010-03-09  3:08     ` Mikulas Patocka
2010-03-09  3:30       ` Mike Snitzer
2010-03-02  0:23 ` [PATCH 04/14] dm-multisnap-mikulas-alloc Mike Snitzer
2010-03-02  0:23 ` [PATCH 05/14] dm-multisnap-mikulas-blocks Mike Snitzer
2010-03-02  0:23 ` [PATCH 06/14] dm-multisnap-mikulas-btree Mike Snitzer
2010-03-02  0:23 ` [PATCH 07/14] dm-multisnap-mikulas-commit Mike Snitzer
2010-03-02  0:23 ` [PATCH 08/14] dm-multisnap-mikulas-delete Mike Snitzer
2010-03-02  0:23 ` [PATCH 09/14] dm-multisnap-mikulas-freelist Mike Snitzer
2010-03-02  0:23 ` [PATCH 10/14] dm-multisnap-mikulas-io Mike Snitzer
2010-03-02  0:23 ` [PATCH 11/14] dm-multisnap-mikulas-snaps Mike Snitzer
2010-03-02  0:23 ` [PATCH 12/14] dm-multisnap-mikulas-common Mike Snitzer
2010-03-02  0:23 ` [PATCH 13/14] dm-multisnap-mikulas-config Mike Snitzer
2010-03-02  0:23 ` [PATCH 14/14] dm-multisnap-daniel Mike Snitzer
2010-03-02  0:57   ` FUJITA Tomonori
2010-03-02  0:32 ` [PATCH 00/14] mikulas' shared snapshot patches Mike Snitzer
2010-03-02 14:56 ` Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100306015435.GA8244@redhat.com \
    --to=snitzer@redhat.com \
    --cc=dm-devel@redhat.com \
    --cc=mpatocka@redhat.com \
    --cc=rwheeler@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.