[LSF/MM TOPIC] De-clustered RAID with MD

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Johannes Thumshirn <jthumshirn@suse.de>
To: <lsf-pc@lists.linux-foundation.org>
Cc: <linux-raid@vger.kernel.org>, <linux-block@vger.kernel.org>,
	Hannes Reinecke <hare@suse.de>, Neil Brown <neilb@suse.de>
Subject: [LSF/MM TOPIC] De-clustered RAID with MD
Date: Mon, 29 Jan 2018 16:23:07 +0100	[thread overview]
Message-ID: <mqdvafkhep0.fsf@linux-x5ow.site> (raw)

Hi linux-raid, lsf-pc

(If you've received this mail multiple times, I'm sorry, I'm having
trouble with the mail setup).

With the rise of bigger and bigger disks, array rebuilding times start
skyrocketing.

In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
similar to RAID5 but instead of utilizing all disks in an array for
every I/O operation, but implement a per-I/O mapping function to only
use a subset of the available disks.

This has at least two advantages:
1) If one disk has to be replaced, it's not needed to read the data from
   all disks to recover the one failed disk so non-affected disks can be
   used for real user I/O and not just recovery and
2) an efficient mapping function can improve parallel I/O submission, as
   two different I/Os are not necessarily going to the same disks in the
   array. 

For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
would be ideal, as it provides a pseudo random but deterministic mapping
for the I/O onto the drives.

This whole declustering of cause only makes sense for more than (at
least) 4 drives but we do have customers with several orders of
magnitude more drivers in an MD array.

At LSF I'd like to discuss if:
1) The wider MD audience is interested in de-clusterd RAID with MD
2) de-clustered RAID should be implemented as a sublevel of RAID5 or
   as a new personality
3) CRUSH is a suitible algorith for this (there's evidence in [3] that
   the NetApp E-Series Arrays do use CRUSH for parity declustering)

[1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf 
[2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
[3]
https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf

Thanks,
        Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

WARNING: multiple messages have this Message-ID (diff)

From: Johannes Thumshirn <jthumshirn@suse.de>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org, linux-block@vger.kernel.org,
	Hannes Reinecke <hare@suse.de>, Neil Brown <neilb@suse.de>
Subject: [LSF/MM TOPIC] De-clustered RAID with MD
Date: Mon, 29 Jan 2018 16:23:07 +0100	[thread overview]
Message-ID: <mqdvafkhep0.fsf@linux-x5ow.site> (raw)

Hi linux-raid, lsf-pc

(If you've received this mail multiple times, I'm sorry, I'm having
trouble with the mail setup).

With the rise of bigger and bigger disks, array rebuilding times start
skyrocketing.

In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
similar to RAID5 but instead of utilizing all disks in an array for
every I/O operation, but implement a per-I/O mapping function to only
use a subset of the available disks.

This has at least two advantages:
1) If one disk has to be replaced, it's not needed to read the data from
   all disks to recover the one failed disk so non-affected disks can be
   used for real user I/O and not just recovery and
2) an efficient mapping function can improve parallel I/O submission, as
   two different I/Os are not necessarily going to the same disks in the
   array. 

For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
would be ideal, as it provides a pseudo random but deterministic mapping
for the I/O onto the drives.

This whole declustering of cause only makes sense for more than (at
least) 4 drives but we do have customers with several orders of
magnitude more drivers in an MD array.

At LSF I'd like to discuss if:
1) The wider MD audience is interested in de-clusterd RAID with MD
2) de-clustered RAID should be implemented as a sublevel of RAID5 or
   as a new personality
3) CRUSH is a suitible algorith for this (there's evidence in [3] that
   the NetApp E-Series Arrays do use CRUSH for parity declustering)

[1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf 
[2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
[3]
https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf

Thanks,
        Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

next             reply	other threads:[~2018-01-29 15:23 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-29 15:23 Johannes Thumshirn [this message]
2018-01-29 15:23 ` [LSF/MM TOPIC] De-clustered RAID with MD Johannes Thumshirn
2018-01-29 16:32 ` Wols Lists
2018-01-29 21:50   ` [Lsf-pc] " NeilBrown
2018-01-29 21:50     ` NeilBrown
2018-01-30 10:43     ` Wols Lists
2018-01-30 11:24       ` NeilBrown
2018-01-30 17:40         ` Wol's lists
2018-02-03 15:53         ` Wols Lists
2018-02-03 17:16         ` Wols Lists
2018-01-31  9:58     ` [Lsf-pc] " David Brown
2018-01-31  9:58       ` David Brown
2018-01-31 10:58       ` Johannes Thumshirn
2018-01-31 14:27       ` Wols Lists
2018-01-31 14:41         ` David Brown
2018-01-30  9:40   ` [Lsf-pc] " Johannes Thumshirn
2018-01-30  9:40     ` Johannes Thumshirn
2018-01-31  8:03     ` David Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mqdvafkhep0.fsf@linux-x5ow.site \
    --to=jthumshirn@suse.de \
    --cc=hare@suse.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.