Re: [LSF/MM TOPIC] De-clustered RAID with MD

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Johannes Thumshirn <jthumshirn@suse.de>
To: David Brown <david.brown@hesbynett.no>
Cc: NeilBrown <neilb@suse.com>, Wols Lists <antlists@youngman.org.uk>,
	lsf-pc@lists.linux-foundation.org, linux-raid@vger.kernel.org,
	linux-block@vger.kernel.org, Hannes Reinecke <hare@suse.de>,
	Neil Brown <neilb@suse.de>
Subject: Re: [LSF/MM TOPIC] De-clustered RAID with MD
Date: Wed, 31 Jan 2018 11:58:34 +0100	[thread overview]
Message-ID: <mqdy3kewazp.fsf@linux-x5ow.site> (raw)
In-Reply-To: <5A71933B.1050908@hesbynett.no> (David Brown's message of "Wed, 31 Jan 2018 10:58:19 +0100")

David Brown <david.brown@hesbynett.no> writes:
> That sounds smart.  I don't see that you need anything particularly
> complicated for how you distribute your data and parity drives across
> the 100 disks - you just need a fairly even spread.

Exactly.

> I would be more concerned with how you could deal with resizing such an
> array.  In particular, I think it is not unlikely that someone with a
> 100 drive array will one day want to add another bank of 24 disks (or
> whatever fits in a cabinet).  Making that work nicely would, I believe,
> be more important than making sure the rebuild load distribution is
> balanced evenly across 99 drives.

I don't think rebuilding is such a big deal, lets consider the following
hypothetical scenario:

6 Disks with 4 data blocks (3 replicas per block, could be RAID1 like
duplicates or RAID5 like data + parity, doesn't matter at all for this
example)

D1  D2  D3  D4  D5  D6
[A] [B] [C] [ ] [ ] [ ]
[ ] [ ] [ ] [A] [D] [B] 
[ ] [A] [B] [ ] [C] [ ]
[C] [ ] [ ] [D] [ ] [D]

Now we're adding one disk and rebalance:

D1  D2  D3  D4  D5  D6  D7
[A] [B] [C] [ ] [ ] [ ] [A]
[ ] [ ] [ ] [ ] [D] [B] [ ]
[ ] [A] [B] [ ] [ ] [ ] [C]
[C] [ ] [ ] [D] [ ] [D] [ ]

This moved the "A" from D4 and the "C" from D5 to D7. The whole
rebalancing affected only 3 disks (read from D4 and D5 write to D7).

> I would also be interested in how the data and parities are distributed
> across cabinets and disk controllers.  When you manually build from
> smaller raid sets, you can ensure that in set the data disks and the
> parity are all in different cabinets - that way if an entire cabinet
> goes up in smoke, you have lost one drive from each set, and your data
> is still there.  With a pseudo random layout, you have lost that.  (I
> don't know how often entire cabinets of disks die, but I once lost both
> disks of a raid1 mirror when the disk controller card died.)

Well this is something CRSUH takes care of. As I said earlier it's a
weighted decision tree. One of the weights could be to evenly spread all
blocks across two cabinets.

Taking this into account would require a non-trivial user interface and
I'm not sure if the benefits of this outnumber the costs (at least for
an initial implementation).

Byte,
        Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

next prev parent reply	other threads:[~2018-01-31 10:58 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-29 15:23 [LSF/MM TOPIC] De-clustered RAID with MD Johannes Thumshirn
2018-01-29 15:23 ` Johannes Thumshirn
2018-01-29 16:32 ` Wols Lists
2018-01-29 21:50   ` [Lsf-pc] " NeilBrown
2018-01-29 21:50     ` NeilBrown
2018-01-30 10:43     ` Wols Lists
2018-01-30 11:24       ` NeilBrown
2018-01-30 17:40         ` Wol's lists
2018-02-03 15:53         ` Wols Lists
2018-02-03 17:16         ` Wols Lists
2018-01-31  9:58     ` [Lsf-pc] " David Brown
2018-01-31  9:58       ` David Brown
2018-01-31 10:58       ` Johannes Thumshirn [this message]
2018-01-31 14:27       ` Wols Lists
2018-01-31 14:41         ` David Brown
2018-01-30  9:40   ` [Lsf-pc] " Johannes Thumshirn
2018-01-30  9:40     ` Johannes Thumshirn
2018-01-31  8:03     ` David Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mqdy3kewazp.fsf@linux-x5ow.site \
    --to=jthumshirn@suse.de \
    --cc=antlists@youngman.org.uk \
    --cc=david.brown@hesbynett.no \
    --cc=hare@suse.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=neilb@suse.com \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.