Re: [LSF/MM TOPIC] De-clustered RAID with MD

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Brown <david.brown@hesbynett.no>
To: Wols Lists <antlists@youngman.org.uk>, NeilBrown <neilb@suse.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	lsf-pc@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org, linux-block@vger.kernel.org,
	Hannes Reinecke <hare@suse.de>, Neil Brown <neilb@suse.de>
Subject: Re: [LSF/MM TOPIC] De-clustered RAID with MD
Date: Wed, 31 Jan 2018 15:41:11 +0100	[thread overview]
Message-ID: <5A71D587.2070409@hesbynett.no> (raw)
In-Reply-To: <5A71D24F.9090604@youngman.org.uk>

On 31/01/18 15:27, Wols Lists wrote:
> On 31/01/18 09:58, David Brown wrote:
>> I would also be interested in how the data and parities are distributed
>> across cabinets and disk controllers.  When you manually build from
>> smaller raid sets, you can ensure that in set the data disks and the
>> parity are all in different cabinets - that way if an entire cabinet
>> goes up in smoke, you have lost one drive from each set, and your data
>> is still there.  With a pseudo random layout, you have lost that.  (I
>> don't know how often entire cabinets of disks die, but I once lost both
>> disks of a raid1 mirror when the disk controller card died.)
> 
> The more I think about how I plan to spec raid-61, the more a modulo
> approach seems to make sense. That way, it'll be fairly easy to predict
> what ends up where, and make sure your disks are evenly scattered.
> 
> I think both your and my approach might have problems with losing an
> entire cabinet, however. Depends on how many drives per cabinet ...

Exactly.  I don't know how many cabinets are used on such systems.

> 
> Anyways, my second thoughts are ...
> 
> We have what I will call a stripe-block. The lowest common multiple of
> "disks needed" ie number of mirrors times number of drives in the
> raid-6, and the disks available.
> 
> Assuming my blocks are all stored sequentially I can then quickly
> calculate their position in this stripe-block. But this will fall foul
> of just hammering the drives nearest to the failed drive. But if I
> pseudo-randomise this position with "position * prime mod drives" where
> "prime" is not common to either the number of drives or the number or
> mirrors or the number of raid-drives, then this should achieve my aim of
> uniquely shuffling the location of all the blocks without collisions.
> 
> Pretty simple maths, for efficiency, that smears the data over all the
> drives. Does that sound feasible? All the heavy lifting, calculating the
> least common multiple, finding the prime, etc etc can be done at array
> set-up time.

Something like that should work, and be convenient to implement.  I am
not sure off the top of my head if such a simple modulo system is valid,
but it won't be difficult to check.

> 
> (If this then allows feasible 100-drive arrays, we won't just need an
> incremental assemble mode, we might need an incremental build mode :-)
> 

You really want to track which stripes are valid here, and which are not
yet made consistent.  A blank array will start with everything marked
invalid or inconsistent - build mode is just a matter of writing the
metadata.  You only need to make stripes consistent when you first write
to them.

next prev parent reply	other threads:[~2018-01-31 14:41 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-29 15:23 [LSF/MM TOPIC] De-clustered RAID with MD Johannes Thumshirn
2018-01-29 15:23 ` Johannes Thumshirn
2018-01-29 16:32 ` Wols Lists
2018-01-29 21:50   ` [Lsf-pc] " NeilBrown
2018-01-29 21:50     ` NeilBrown
2018-01-30 10:43     ` Wols Lists
2018-01-30 11:24       ` NeilBrown
2018-01-30 17:40         ` Wol's lists
2018-02-03 15:53         ` Wols Lists
2018-02-03 17:16         ` Wols Lists
2018-01-31  9:58     ` [Lsf-pc] " David Brown
2018-01-31  9:58       ` David Brown
2018-01-31 10:58       ` Johannes Thumshirn
2018-01-31 14:27       ` Wols Lists
2018-01-31 14:41         ` David Brown [this message]
2018-01-30  9:40   ` [Lsf-pc] " Johannes Thumshirn
2018-01-30  9:40     ` Johannes Thumshirn
2018-01-31  8:03     ` David Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5A71D587.2070409@hesbynett.no \
    --to=david.brown@hesbynett.no \
    --cc=antlists@youngman.org.uk \
    --cc=hare@suse.de \
    --cc=jthumshirn@suse.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=neilb@suse.com \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.