Re: [PATCH 03/14] dm-multisnap-mikulas-headers

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Snitzer <snitzer@redhat.com>
To: Mikulas Patocka <mpatocka@redhat.com>
Cc: dm-devel@redhat.com
Subject: Re: [PATCH 03/14] dm-multisnap-mikulas-headers
Date: Mon, 8 Mar 2010 22:30:45 -0500	[thread overview]
Message-ID: <20100309033045.GB9531@redhat.com> (raw)
In-Reply-To: <Pine.LNX.4.64.1003082145210.20675@hs20-bc2-1.build.redhat.com>

On Mon, Mar 08 2010 at 10:08pm -0500,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> > > + * The minimum block size is 512, the maximum block size is not specified (it is
> > > + * limited by 32-bit integer size and available memory). All on-disk pointers
> > > + * are in the units of blocks. The pointers are 48-bit, making this format
> > > + * capable of handling 2^48 blocks.
> > 
> > Shouldn't we require the chunk size be at least as big as
> > (and a multiple of) physical_block_size?  E.g. 4096 on a 4K sector
> > device.
> > 
> > This question applies to non-shared snapshots too.
> 
> If the device has a larger physical block size, it will reject smaller 
> chunks. The same for non-shared snapshots.

Correct, but we don't prevent the user from trying to use less than
physical_block_size.  So I think we agree we should.  Hasn't been a
concern but native 4K devices change that.

> > > + * Note: because the disks only support atomic writes of 512 bytes, the commit
> > > + * block has only 512 bytes of valid data. The remaining data in the commit
> > > + * block up to the chunk size is unused.
> > 
> > Are there other places where you assume 512b is beneficial?  My concern
> > is: what will happen on 4K devices?
> 
> With 4K chunk size, it writes 4K, but assumes that only 512b write is 
> atomic. So, if the disk supports atomic write of 4K, it doesn't hurt.

Right, so long as we impose 4K on a native 4K device, etc.  But I was
more wondering if there were other places that assume 512b granularity.
Didn't see any but figured I'd ask.
 
> > Would making the commit block's size match the physical_block_size give
> > us any multisnapshot benefit?  At a minimum I see a larger commit block
> > would allow us to have more remap entries (larger remap
> > array)..
> >
> > "Remaps" detailed below.  But what does that buy us?
> 
> They reduce the number of blocks written. Without remaps, you'd have to 
> write the path from the root every time.

Sure, I wasn't saying we'd eliminate/reduce rempas.  I was saying we
could increase the number of remaps (think the current limit is 27 per
512b commit block).  Using a 4K commit block would give us 100+?

So I was asking if more remaps help at all.

> > However, and before I get ahead of myself, with blk_stack_limits() we
> > could have a (DM) device that is composed of 4K and 512b devices; with a
> > resulting physical_block_size of 4K.  But 4K wouldn't be atomic to the
> > 512b disk.
> 
> Yes, that's why I must assume only 512b atomic write.

OK, fine by me so long as we impose chunk_size >= physical_block_size.

> > But what if we were to add a checksum to the commit block?  This could
> > give us the ability to have a larger commit block regardless of the
> > physical_block_size. [NOTE: I saw dm_multisnap_commit() is just writing
> > a fixed CB_SIGNATURE]
> 
> That would have to be cryptographic hash --- simple checksum can be 
> fooled.
> 
> Even that wouldn't be correct, because if the hash fails, the commit block 
> is lost.  If you wanted to use full commit blocks, you'd have to:
> 
> - divide the commit block to two.
> - write these two alternatively (so that at least one is valid)
> - hash them or (which is simpler) copy sequence number to each 512b sector 
> (so that if some sectors get written and some not, you find it out by 
> having different sequence number).
> 
> That is possible to do.

Yes, Ric mentioned a comparable strategy was used for database
transactions.

> > And in speaking with Ric Wheeler, using a checksum in the commit block
> > opens up the possibility for optimizing (reducing) the barrier ops
> > associated with:
> > 1) before the commit block is written (flushes journal transaction) 
> > 2) and after the commit block is written.
> 
> No, you have to use barriers. If the data before the commit blocks is not 
> written and the commit block is written (with matching checksum), then the 
> data is corrupted.

Fair enough, though it is used by ext4.  But with ext4 it is used in
conjunction with a checksummed journal.

> Obviously, you can checksum all the data, but SHA1 is slow and it is being 
> phased out already and even slower SHA256 is being recommended...
> 
> > Leaving us with only needing to barrier after the commit block is
> > written.  But this optimization apparently also requires having a
> > checksummed journal.  Ext4 offers this (somewhat controversial yet fast)
> > capability with the 'journal_async_commit' mount option. [NOTE: I'm
> > largely parroting what I heard from Ric]
> > 
> > [NOTE: I couldn't immediately tell if dm_multisnap_commit() is doing
> > multiple barriers when writing out the transaction and commit block]
> 
> It calls dm_bufio_write_dirty_buffers twice and 
> dm_bufio_write_dirty_buffers submits a zero barrier. (there's no point in 
> submitting data-barrier, because that gets split into two zero barriers 
> and non-barrier write anyway)

OK.

> > Taking a step back, any reason you elected to not reuse existing kernel
> > infrastructure (e.g. jbd2) for journaling?  Custom solution needed for
> > the log-nature of the multisnapshot?  [Excuse my naive question(s), I
> > see nilfs2 also has its own journaling... I'm just playing devil's
> > advocate given how important it is that the multisnapshot journal code
> > be correct]
> 
> All the filesystems have their own journaling. jbd is used only by ext3, 
> jbd2 only by ext4. Reiserfs has its own, JFS has its own, XFS has its 
> own... etc.
> 
> I consider the idead of sharing journaling code as inefficient: arguing 
> about the interface would take more time than writing it from scratch.

Well, ocfs2 uses jbd2 too but I understand your point.

> > Above provided, for the benefit of others, to give more context on the
> > role of remap entries (and the commit block's remap array).
> 
> If there were no remaps, change in any B-tree node would require to 
> overwrite all the nodes from the root. Similarly, changing any bitmap 
> would require to overwrite the bitmap directory from the root.
> 
> With remaps, changes to B-tree nodes or bitmaps write just that one block 
> (and commit block, to store the remap). The full write from the root is 
> done later, when the remap table fills up.

Again, I was asking about adding more remap entries in the remaps array
if the commit block was increased from 512b to 4K.

Mike

next prev parent reply	other threads:[~2010-03-09  3:30 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-02  0:23 [PATCH 00/14] mikulas' shared snapshot patches Mike Snitzer
2010-03-02  0:23 ` [PATCH 01/14] dm-multisnap-common Mike Snitzer
2010-03-02  0:23 ` [PATCH 02/14] dm-bufio Mike Snitzer
2010-03-02  0:23 ` [PATCH 03/14] dm-multisnap-mikulas-headers Mike Snitzer
2010-03-05 22:46   ` Mike Snitzer
2010-03-06  1:54     ` Mike Snitzer
2010-03-09  3:08     ` Mikulas Patocka
2010-03-09  3:30       ` Mike Snitzer [this message]
2010-03-02  0:23 ` [PATCH 04/14] dm-multisnap-mikulas-alloc Mike Snitzer
2010-03-02  0:23 ` [PATCH 05/14] dm-multisnap-mikulas-blocks Mike Snitzer
2010-03-02  0:23 ` [PATCH 06/14] dm-multisnap-mikulas-btree Mike Snitzer
2010-03-02  0:23 ` [PATCH 07/14] dm-multisnap-mikulas-commit Mike Snitzer
2010-03-02  0:23 ` [PATCH 08/14] dm-multisnap-mikulas-delete Mike Snitzer
2010-03-02  0:23 ` [PATCH 09/14] dm-multisnap-mikulas-freelist Mike Snitzer
2010-03-02  0:23 ` [PATCH 10/14] dm-multisnap-mikulas-io Mike Snitzer
2010-03-02  0:23 ` [PATCH 11/14] dm-multisnap-mikulas-snaps Mike Snitzer
2010-03-02  0:23 ` [PATCH 12/14] dm-multisnap-mikulas-common Mike Snitzer
2010-03-02  0:23 ` [PATCH 13/14] dm-multisnap-mikulas-config Mike Snitzer
2010-03-02  0:23 ` [PATCH 14/14] dm-multisnap-daniel Mike Snitzer
2010-03-02  0:57   ` FUJITA Tomonori
2010-03-02  0:32 ` [PATCH 00/14] mikulas' shared snapshot patches Mike Snitzer
2010-03-02 14:56 ` Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100309033045.GB9531@redhat.com \
    --to=snitzer@redhat.com \
    --cc=dm-devel@redhat.com \
    --cc=mpatocka@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.