From: NeilBrown <neilb@suse.com>
To: Wol's lists <antlists@youngman.org.uk>,
mdraid <linux-raid@vger.kernel.org>
Subject: Re: RFC - de-clustered raid 60 or 61 algorithm
Date: Thu, 08 Feb 2018 14:14:29 +1100 [thread overview]
Message-ID: <876078maui.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <9f9e737c-d6d1-5cce-8190-14d970320265@youngman.org.uk>
[-- Attachment #1: Type: text/plain, Size: 5555 bytes --]
On Thu, Feb 08 2018, Wol's lists wrote:
> After the de-clustered thread, Neil said it would probably only take a
> small amount of coding to do something like that. It was also discussed
> about spreading the load over disks on a reconstruction if there were a
> lot of disks in an array. I've been trying to get my head round a simple
> algorithm to smear data over the disks along the lines of raid-10.
>
> Basically, the idea is to define a logical stripe which is a multiple of
> both the number of physical disks, and of the number of logical disks.
> Within this logical stripe the blocks are shuffled using prime numbers
> to make sure we don't get a pathological shuffle.
>
> At present, I've defined the logical stripe to be simply the product of
> the number of logical disks times the number of mirrors times the number
> of physical disks. We could shrink this by removing common factors, but
> we don't need to.
>
> Given a logical block within this stripe, its physical position is
> calculated by the simple equation "logical block * prime mod logical
> stripe size". So long as the "prime" does not share any factors with the
> logical stripe size, then (with one exception) you're not going to get
> hash collisions, and you're not going to get more than one block per
> stripe stored on each drive. The exception, of course, is if physical
> disks is not greater than logical disks. Having the two identical is
> especially dangerous as users will not expect the pathological behaviour
> they will get - multiple blocks per stripe stored on the same disk.
> mdadm will need to detect and reject this layout. I think the best
> behaviour will be found where logical disks, mirrors, and physical disks
> don't share any prime factors.
>
> I've been playing with a mirror setup, and if we have two mirrors, we
> can rebuild any failed disk by coping from two other drives. I think
> also (I haven't looked at it) that you could do a fast rebuild without
> impacting other users of the system too much provided you don't swamp
> i/o bandwidth, as half of the requests for data on the three drives
> being used for rebuilding could actually be satisfied from other drives.
I think that ends up being much the same result as a current raid10
where the number of copies doesn't divide the number of devices.
Reconstruction reads come from 2 different devices, and half the reads
that would go to them now go elsewhere.
I think that if you take your solution and a selection of different
"prime" number and rotate through the primes from stripe to stripe, you
can expect a more even distribution of load.
You hint at this below when you suggest that adding the "*prime" doesn't
add much. I think it adds a lot more when you start rotating the primes
across the stripes.
>
> Rebuilding a striped raid such as a raid-60 also looks like it would
> spread the load.
>
> The one thing that bothers me is that I don't know whether the "* prime
> mod" logic actually adds very much - whether we can just stripe stuff
> across like we do with raid-10. Where it will score is in a storage
> assembly that is a cluster of a cluster of disks. Say you have a
> controller with ten disks, and ten controllers in a rack, a suitable
> choice of prime and logical stripe could ensure the rack would survive
> losing a controller. And given that dealing with massive arrays is what
> this algorithm is about, that seems worthwhile.
The case of multiple failures is my major concern with the whole idea.
If failures were truly independent, then losing 3 in 100 at the same
time is probably quite unlikely. But we all know there are various
factors that can cause failures to correlate - controller failure being
just one of them.
Maybe if you have dual-attached fibre channel to each drive, and you get
the gold-plated drives that have actually been exhaustively tested....
What do large installations do? I assumed they had lots of modest
raid5 arrays and then mirrored between pairs of those (for data they
actually care about). Maybe I'm wrong.
NeilBrown
>
> Anyways, here's my simple code that demonstrates the algorithm and
> prints out how the blocks will be laid out. Is it a good idea? I'd like
> to think so ...
>
> Cheers,
> Wol
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> void main()
> {
> int disks, logdisks, mirrors, prime;
>
> printf( "%s", "Enter the number of disks ");
> scanf( "%d", &disks);
> printf( "%s", "Enter the number of logical disks ");
> scanf( "%d", &logdisks);
> printf( "%s", "Enter the number of mirrors ");
> scanf( "%d", &mirrors);
> printf( "%s", "Enter the prime ");
> scanf( "%d", &prime);
>
> int blocks, *array;
>
> blocks = logdisks * mirrors * disks;
> array = (int *) malloc( sizeof(int) * blocks );
> memset( array, '\0', sizeof(int) * blocks );
>
> int i;
>
> for (i=0; i < blocks; i++) {
> array[i] = (i * prime) % blocks;
> }
>
> int logstripe, logdisk;
>
> for (i=0; i < blocks; i++) {
> if ( (i % disks) == 0) {
> printf( "\n");
> }
>
> // printf( "%4d", array[i]);
>
> logstripe = array[i] / (mirrors * logdisks);
> logdisk = array[i] % logdisks;
> printf( " %02d%c", logstripe, (char) logdisk+65);
>
> }
> printf( "\n");
> }
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
next prev parent reply other threads:[~2018-02-08 3:14 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-08 0:46 RFC - de-clustered raid 60 or 61 algorithm Wol's lists
2018-02-08 3:14 ` NeilBrown [this message]
2018-02-08 12:56 ` Phil Turmel
2018-02-08 23:10 ` Wol's lists
2018-02-09 23:12 ` Wol's lists
2018-02-10 3:02 ` John Stoffel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=876078maui.fsf@notabene.neil.brown.name \
--to=neilb@suse.com \
--cc=antlists@youngman.org.uk \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox