Re: [PATCH V4 00/13] MD: a caching layer for raid5/6

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.com>
To: Shaohua Li <shli@fb.com>
Cc: linux-raid@vger.kernel.org, songliubraving@fb.com,
	hch@infradead.org, dan.j.williams@intel.com, Kernel-team@fb.com
Subject: Re: [PATCH V4 00/13] MD: a caching layer for raid5/6
Date: Fri, 10 Jul 2015 14:36:56 +1000	[thread overview]
Message-ID: <20150710143656.4ee7e647@noble> (raw)
In-Reply-To: <20150710040847.GA1408097@devbig257.prn2.facebook.com>

On Thu, 9 Jul 2015 21:08:49 -0700 Shaohua Li <shli@fb.com> wrote:

> On Fri, Jul 10, 2015 at 09:21:19AM +1000, NeilBrown wrote:
> > On Tue, 7 Jul 2015 22:44:02 -0700 Shaohua Li <shli@fb.com> wrote:
> > 
> > > On Wed, Jul 08, 2015 at 11:56:36AM +1000, NeilBrown wrote:
> > > > 
> > > > Hi,
> > > >  I made some time to look at these this morning - sorry for the delay.
> > > > 
> > > >  Having it all broken down with more complete comments helps a lot -
> > > >  thanks.
> > > > 
> > > >  Unfortunately ... it helps confirm that I really don't like this.  It
> > > >  seems much more complex that it should be.  There certainly are a lot
> > > >  of details that need to be carefully considered, but the result needs
> > > >  to be simpler.
> > > > 
> > > >  A big part of my concern is that you choose to avoid making proper use
> > > >  of the current stripe cache.  Your argument is that it is already
> > > >  too complex.  That is probably true but I don't think you decrease
> > > >  total complexity by adding extra bits on the side that themselves are
> > > >  complex in completely different ways.
> > > > 
> > > >  I would like to start small and keep it as simple as possible.  I
> > > >  think the minimum useful functionality is closing the write-hole.
> > > >  Anything else could be done outside raid5 (e.g. bcache), but the
> > > >  write-hole needs integration.  Once that is closed, it may make sense
> > > >  to add more functionality.
> > > > 
> > > >  To do that we need a log, and it was quite sensible for you to put
> > > >  that first in the list of patches.  So let's start with that.
> > > > 
> > > >  I would like to propose a single metadata block format.  This block
> > > >  includes the general RAID5 parameters (both 'old' and 'new' for when a
> > > >  reshape is happening), it includes linkage to find nearby metadata.
> > > >  It includes a list of the data written after the metadata.  And it
> > > >  includes a list of device-addresses of parity blocks which are now safe
> > > >  on the RAID and so any log content is now ignored.
> > > > 
> > > >  The general mechanism for writing to the log is:
> > > >    - collect a list of bios.  As each arrives update the metadata
> > > >      block with index information.
> > > >    - when ready (there are various triggers) write the metadata block
> > > >      with FLUSH/FUA and write all the data blocks normally.  This
> > > >      metadata block plus data/parity blocks is a 'transaction'.
> > > > 
> > > >  We never trigger a log transaction until all writes for the previous
> > > >  transaction have completed. This means that once the metadata block is
> > > >  safe, all previous data must also be safe.  If this imposes too much
> > > >  waiting we could allow a "double-buffer" approach were each metadata
> > > >  block completes the previous-but-one transaction.
> > > > 
> > > >  When all of the previous writes complete we trigger a new log write
> > > >  if there is an outstanding SYNC write, or maybe an outstanding
> > > >  FLUSH/FUA, but otherwise we wait until the metadata block is full, or
> > > >  some amount of time has passed.
> > > > 
> > > >  This one metadata block serves all of the purposes that you
> > > >  identified.  It records where data is, it commits previous writes, and
> > > >  records which stripes are being written to RAID and which have been
> > > >  fully committed.
> > > > 
> > > >  With that log in place, we add the code to "hijack ops_run_io" to
> > > >  write all dirty blocks from the stripe_head to the cache.  This
> > > >  includes the parity of course.  Immediately that data is safe the
> > > >  write requests can be returned and the data is written the the RAID
> > > >  devices.
> > > > 
> > > >  "Recovery" at boot time simply involves:
> > > >   1. find and load first metadata block
> > > >   2. load the "next" metadata block. If it doesn't exist, stop.
> > > >   3. For each parity/data block listed in current block, get the
> > > >      stripe_head and read the block in.
> > > >   4. For each completed stripe listed, find the stripe_head and
> > > >      invalidate it.
> > > >   5. Make the "next" metadata block the current block and goto 2
> > > >   6. Flush out all dirty data in stripe cache.
> > > > 
> > > > 
> > > >  I think that all of this *needs* to use the stripe_cache.
> > > >  Intercepting normal processing between party computation and writing
> > > >  to the RAID devices must involve the stripe_cache.
> > > > 
> > > >  So this is a "simple" start.   Once this is written and agreed and
> > > >  tested and debugged, then we can look at the latency-hiding side of
> > > >  the cache.  I think this should still use the stripe_cache - just
> > > >  schedule the writes to the log a bit earlier.
> > > >  If a case can be made for a separate cache then I'm not completely
> > > >  closed to the idea but I don't really want to think about it until the
> > > >  write-hole closing code is done.
> > > > 
> > > > So:
> > > >   [PATCH V4 01/13] MD: add a new disk role to present cache device
> > > >     probably OK
> > > >   [PATCH V4 02/13] raid5: directly use mddev->queue
> > > >     OK
> > > >   [PATCH V4 03/13] raid5: cache log handling
> > > >     strip this down to a bare minimum.  It needs:
> > > >       -on-disk metadata block format
> > > >       -two or three active transactions which identify a list of bios
> > > >        that are currently in the stripe cache.  As the bio is added,
> > > >        its address info is added to the metadata block.
> > > >       -interface to "add a bio to a transaction" and "record a
> > > >        completed stripe".
> > > >       -timer flush the current transaction if previous one was not
> > > >        empty.
> > > > 
> > > >      I don't think there needs to be a separate superblock in the log.
> > > >      Each device in the array already has an 'md' superblock.  Use e.g
> > > >      the 'recovery_offset' field in there (which is per-device and
> > > >      meaningless for a log) to store the address of the most recent
> > > >      metadata blocks at the time the superblock was written.  Search
> > > >      forward and backward from there to find whole log.  Each metadata
> > > >      block holds everything else that might be useful.
> > > > 
> > > >   [PATCH V4 05/13] raid5: cache reclaim support
> > > >     Just want the "hijack ops_run_io" part of this so that when normal
> > > >     RAID5 processing wants to write a stripe, blocks get diverted to
> > > >     the log first and further processing of the stripe is delayed until
> > > >     those writes complete and the transaction is safe.
> > > > 
> > > >   [PATCH V4 08/13] raid5: cache recovery support
> > > >     Just load all the log into the stripe cache.
> > > > 
> > > > 
> > > > That should be much more manageable and would be a good start towards
> > > > getting all the functionality that you want.
> > > > 
> > > > Feel free to ask questions if I haven't explained things, or if you
> > > > want me to look over the metadata format or whatever.  I tend to answer
> > > > direct questions more quickly than I answer 100kB patchsets :-)
> > > Hi,
> > > 
> > > So we are back to the original point (eg, only fix the write hole issue) after
> > > 3 months development, which is really disappointed. I don't object just fixing
> > > the write hole issue as a start, it's simple which is my arguement of first
> > > post too, and probably a slight change of patch 3 meets the goal. The cache
> > > does significantly increase complexity, which is well known when we move from
> > > 'just fix write hole' to 'fix write hole and do cache'. But the code is right
> > > there, I'm wondering how bad it is. Just because of stripe_cache? I don't see
> > > any significant technical merit stripe_cache is a must-have here. You said the
> > > code is complex. Yes, it is, but it's not because of the new data cache. It's a
> > > simple radix tree and takes less than 20% of the total code. The remaining code
> > > does what any log device should do (write to log, reclaim when log is full,
> > > recovery when log crash). Similar things must be done even stripe_cache is
> > > used. And if we don't use stripe_cache, we don't deeply couple with current
> > > state machine, the new data cache actually brings flexibility (for example, we
> > > can choose not using cache, or we can easily put the cache to NVRAM).
> > 
> > I don't think we are completely back to the start.  The product of
> > recent months isn't just code - it is also your understanding and
> > expertise.   You've explored the problems and understand them.  You
> > could quickly see issues because you have struggled with them already.
> > 
> > I think the principal of "write one to throw away" is a good one - and
> > probably one we don't apply often enough.  The first time you implement
> > something you don't really know what you are doing, so the result is
> > bound to be imperfect.  The second time you have a much better idea of
> > the big picture.
> > 
> > The real big problem that I have with the current proposal is that I
> > just don't understand it.  Maybe that is my failing.  But all the
> > evidence suggests that I'll end up needing to maintain it, so I *have*
> > to understand it.
> > 
> > But let's not assume we are throwing everything away (but equally not
> > promise that we won't) and start with just the first (non trivial)
> > patch.  The log.
> > 
> > I think we do want a log.  I think you have identified the key
> > functionality that the log must provide.  But I think the current
> > layout is too complex.  This is something that will be persistent on
> > storage devices, so we want to get it as "right" as we can, early.
> > It is fairly cheap to throw away a log and start a new one, so fixing
> > design mistakes isn't too hard.  But we do have to support any format
> > we create indefinitely.
> > So simple is best.  Simple is most extensible.
> > 
> > As I said: A single metadata block format, with:
> >  - forward/backward linkage - and probably a link to the "first" block
> >    in the log at the time of writing
> >  - array state data - brieifly
> >  - list of data/parity block addresses and maybe checksums
> >  - list of "completed" stripes
> > 
> > The log is simply these metadata blocks mixed with the data/parity they
> > they describe.
> > Each metadata block commits all previous transactions (or maybe all but
> > the most recent).
> > The current md superblock contains the address of one of these metadata
> > blocks.
> > 
> > Does that seem reasonable?  Does that seem simpler than what you had?
> > Will that meet your perceived need?  It meets mine.
> > If we can do that, I'll apply it and we can move on to the next step.
> 
> My original layout has 3 types of blocks: super block, metadata block for data/parity,
> flush block to list completed stripes.
> 
> Compared to your single metadata block format, your format
> - delete super block. I'm fine with this. The only useful bits in my
>   super block is log tail and seq. we can overload MD superblock
>   recovery_offset/resync_offset to record seq/log tail.
> - merge metadata block and flush block to one metadata block. That is
>   the new metadata block can record a mix of data/parity/'completed'
>   stripes. I don't object to this, it's doable, but don't think this
>   makes things simpler.
> 
> If you insist, we can take this one. The new metadata block can record a
> mix of data/parity/completed stripes, but it can also just record data
> or completed stripes. So I'll change the disk format to cater this, but
> I'll always put 'completed stripes' info into a separate metadata block
> (that is not mixing with data/parity, but we still just have one
> metadata type). That way I only need slight change of current code and
> it works for the new format. We can, if necessary, let the new metadata
> records any mix of data/parity/completed stripes later. Sounds good?

Yes it does.  Having a single sort of metadata block is an important
part of the goal.  How the code actually chooses to use these is a
separate issue that can change harmlessly.

There is also the issue of what action commits a previous transaction.
I'm not sure what you had.  I'm suggesting that each metadata block
commits previous transactions.  Is that a close-enough match to what
you had?

Thanks,
NeilBrown

next prev parent reply	other threads:[~2015-07-10  4:36 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-23 21:37 [PATCH V4 00/13] MD: a caching layer for raid5/6 Shaohua Li
2015-06-23 21:37 ` [PATCH V4 01/13] MD: add a new disk role to present cache device Shaohua Li
2015-06-23 21:37 ` [PATCH V4 02/13] raid5: directly use mddev->queue Shaohua Li
2015-06-23 21:37 ` [PATCH V4 03/13] raid5: cache log handling Shaohua Li
2015-06-23 21:37 ` [PATCH V4 04/13] raid5: cache part of raid5 cache Shaohua Li
2015-06-23 21:37 ` [PATCH V4 05/13] raid5: cache reclaim support Shaohua Li
2015-06-23 21:37 ` [PATCH V4 06/13] raid5: cache IO error handling Shaohua Li
2015-06-23 21:37 ` [PATCH V4 07/13] raid5: cache device quiesce support Shaohua Li
2015-06-23 21:37 ` [PATCH V4 08/13] raid5: cache recovery support Shaohua Li
2015-06-23 21:37 ` [PATCH V4 09/13] raid5: add some sysfs entries Shaohua Li
2015-06-23 21:38 ` [PATCH V4 10/13] raid5: don't allow resize/reshape with cache support Shaohua Li
2015-06-23 21:38 ` [PATCH V4 11/13] raid5: guarantee cache release stripes in correct way Shaohua Li
2015-06-23 21:38 ` [PATCH V4 12/13] raid5: enable cache for raid array with cache disk Shaohua Li
2015-06-23 21:38 ` [PATCH V4 13/13] raid5: skip resync if caching is enabled Shaohua Li
2015-07-02  3:25 ` [PATCH V4 00/13] MD: a caching layer for raid5/6 Yuanhan Liu
2015-07-02 17:11   ` Shaohua Li
2015-07-03  2:18     ` Yuanhan Liu
2015-07-08  1:56 ` NeilBrown
2015-07-08  5:44   ` Shaohua Li
2015-07-09 23:21     ` NeilBrown
2015-07-10  4:08       ` Shaohua Li
2015-07-10  4:36         ` NeilBrown [this message]
2015-07-10  4:52           ` Shaohua Li
2015-07-10  5:10             ` NeilBrown
2015-07-10  5:18               ` Shaohua Li
2015-07-10  6:42                 ` NeilBrown
2015-07-10 17:48                   ` Shaohua Li
2015-07-13 22:22                     ` NeilBrown
2015-07-13 22:35                       ` Shaohua Li
2015-07-15  0:45           ` Shaohua Li
2015-07-15  2:12             ` NeilBrown
2015-07-15  3:16               ` Shaohua Li
2015-07-15  4:06                 ` NeilBrown
2015-07-15 19:49                   ` Shaohua Li
2015-07-15 23:16                     ` NeilBrown
2015-07-16  0:07                       ` Shaohua Li
2015-07-16  1:22                         ` NeilBrown
2015-07-16  4:13                           ` Shaohua Li
2015-07-16  6:07                             ` NeilBrown
2015-07-16 15:07                               ` John Stoffel
2015-07-20  0:03                                 ` NeilBrown
2015-07-20 14:11                                   ` John Stoffel
2015-07-16 17:40                               ` Shaohua Li
2015-07-17  3:47                                 ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150710143656.4ee7e647@noble \
    --to=neilb@suse.com \
    --cc=Kernel-team@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=hch@infradead.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=shli@fb.com \
    --cc=songliubraving@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).