* [PATCH 1 of 2] MD RAID10: Improve redundancy for 'far' and 'offset' algorithms
@ 2012-12-12 16:45 Jonathan Brassow
2012-12-12 21:59 ` David Brown
2012-12-13 1:23 ` NeilBrown
0 siblings, 2 replies; 4+ messages in thread
From: Jonathan Brassow @ 2012-12-12 16:45 UTC (permalink / raw)
To: linux-raid; +Cc: neilb, agk, jbrassow
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...
"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...
Redundancy for these algorithms is gained by shifting the copied stripes
a certain number of devices - in this case, 1. This patch proposes the
number of devices the copy be shifted by be changed from:
device# + near_copies
to
device# + raid_disks/far_copies
The above "far" algorithm example would now look like:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
D E F A B C --> Copy of stripe0, but shifted by 3
J K L G H I
...
This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are instead shifted by 3, the pairs of
devices that cannot fail together are:
(1,4) (2,5) (3,6) [20% of possible pairs]
Performing shifting in this way produces more redundancy and works especially
well when the number of devices is a multiple of the number of copies.
We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Index: linux-upstream/drivers/md/raid10.c
===================================================================
--- linux-upstream.orig/drivers/md/raid10.c
+++ linux-upstream/drivers/md/raid10.c
@@ -38,6 +38,7 @@
* near_copies (stored in low byte of layout)
* far_copies (stored in second byte of layout)
* far_offset (stored in bit 16 of layout )
+ * dev_stride (stored in bit 17 of layout )
*
* The data to be stored is divided into chunks using chunksize.
* Each device is divided into far_copies sections.
@@ -51,8 +52,14 @@
* raid_disks.
*
* If far_offset is true, then the far_copies are handled a bit differently.
- * The copies are still in different stripes, but instead of be very far apart
- * on disk, there are adjacent stripes.
+ * The copies are still in different stripes, but instead of being very far
+ * apart on disk, there are adjacent stripes.
+ *
+ * If dev_stride is true, then the devices on which copies
+ * are placed on for the 'far' and 'offset' algorithms changes from
+ * 'device# + near_copies'
+ * to
+ * 'device# + raid_disks/far_copies'.
*/
/*
@@ -552,14 +559,13 @@ static void __raid10_find_phys(struct ge
for (n = 0; n < geo->near_copies; n++) {
int d = dev;
sector_t s = sector;
- r10bio->devs[slot].addr = sector;
r10bio->devs[slot].devnum = d;
+ r10bio->devs[slot].addr = s;
slot++;
for (f = 1; f < geo->far_copies; f++) {
- d += geo->near_copies;
- if (d >= geo->raid_disks)
- d -= geo->raid_disks;
+ d += geo->dev_stride;
+ d %= geo->raid_disks;
s += geo->stride;
r10bio->devs[slot].devnum = d;
r10bio->devs[slot].addr = s;
@@ -601,16 +607,16 @@ static sector_t raid10_find_virt(struct
int fc;
chunk = sector >> geo->chunk_shift;
fc = sector_div(chunk, geo->far_copies);
- dev -= fc * geo->near_copies;
+ dev -= fc * geo->dev_stride;
if (dev < 0)
dev += geo->raid_disks;
} else {
while (sector >= geo->stride) {
sector -= geo->stride;
- if (dev < geo->near_copies)
- dev += geo->raid_disks - geo->near_copies;
+ if (dev < geo->dev_stride)
+ dev += geo->raid_disks - geo->dev_stride;
else
- dev -= geo->near_copies;
+ dev -= geo->dev_stride;
}
chunk = sector >> geo->chunk_shift;
}
@@ -3437,7 +3443,7 @@ static int setup_geo(struct geom *geo, s
disks = mddev->raid_disks + mddev->delta_disks;
break;
}
- if (layout >> 17)
+ if (layout >> 18)
return -1;
if (chunk < (PAGE_SIZE >> 9) ||
!is_power_of_2(chunk))
@@ -3449,6 +3455,10 @@ static int setup_geo(struct geom *geo, s
geo->near_copies = nc;
geo->far_copies = fc;
geo->far_offset = fo;
+ if (layout & (1<<17))
+ geo->dev_stride = disks / fc;
+ else
+ geo->dev_stride = geo->near_copies;
geo->chunk_mask = chunk - 1;
geo->chunk_shift = ffz(~chunk);
return nc*fc;
Index: linux-upstream/drivers/md/raid10.h
===================================================================
--- linux-upstream.orig/drivers/md/raid10.h
+++ linux-upstream/drivers/md/raid10.h
@@ -33,6 +33,10 @@ struct r10conf {
* far_offset, in which case it is
* 1 stripe.
*/
+ int dev_stride; /* distance to the next device
+ * on which a "far" or "offset"
+ * copy will be placed.
+ */
int chunk_shift; /* shift from chunks to sectors */
sector_t chunk_mask;
} prev, geo;
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 1 of 2] MD RAID10: Improve redundancy for 'far' and 'offset' algorithms
2012-12-12 16:45 [PATCH 1 of 2] MD RAID10: Improve redundancy for 'far' and 'offset' algorithms Jonathan Brassow
@ 2012-12-12 21:59 ` David Brown
2012-12-13 1:23 ` NeilBrown
1 sibling, 0 replies; 4+ messages in thread
From: David Brown @ 2012-12-12 21:59 UTC (permalink / raw)
To: Jonathan Brassow; +Cc: linux-raid, neilb, agk
On 12/12/12 17:45, Jonathan Brassow wrote:
> MD RAID10: Improve redundancy for 'far' and 'offset' algorithms
>
> The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
> widths - copying them to a different location on the same devices after
> shifting the stripe. An example layout of each follows below:
>
> "far" algorithm
> dev1 dev2 dev3 dev4 dev5 dev6
> ==== ==== ==== ==== ==== ====
> A B C D E F
> G H I J K L
> ...
> F A B C D E --> Copy of stripe0, but shifted by 1
> L G H I J K
> ...
>
> "offset" algorithm
> dev1 dev2 dev3 dev4 dev5 dev6
> ==== ==== ==== ==== ==== ====
> A B C D E F
> F A B C D E --> Copy of stripe0, but shifted by 1
> G H I J K L
> L G H I J K
> ...
>
> Redundancy for these algorithms is gained by shifting the copied stripes
> a certain number of devices - in this case, 1. This patch proposes the
> number of devices the copy be shifted by be changed from:
> device# + near_copies
> to
> device# + raid_disks/far_copies
>
> The above "far" algorithm example would now look like:
> "far" algorithm
> dev1 dev2 dev3 dev4 dev5 dev6
> ==== ==== ==== ==== ==== ====
> A B C D E F
> G H I J K L
> ...
> D E F A B C --> Copy of stripe0, but shifted by 3
> J K L G H I
> ...
>
> This has the affect of improving the redundancy of the array. We can
> always sustain at least one failure, but sometimes more than one can
> be handled. In the first examples, the pairs of devices that CANNOT fail
> together are:
> (1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
> In the example where the copies are instead shifted by 3, the pairs of
> devices that cannot fail together are:
> (1,4) (2,5) (3,6) [20% of possible pairs]
>
> Performing shifting in this way produces more redundancy and works especially
> well when the number of devices is a multiple of the number of copies.
>
> We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
> variable is used to indicate whether we use the old or new method of computing
> the shift. (This is similar to the way the 16th bit indicates whether the
> "far" algorithm or the "offset" algorithm is being used.)
>
As far as I can see, this new layout will also improve the speed of
small operations on the array. With the original layout, if you want to
blocks A, B and C, then you are writing once to disk 1 and 4, and twice
to disks 2 and 3. With the new layout, you are writing once to each
disk - which is obviously going to be faster (especially for far
layout). It might not be a big effect, but it's a nice bonus.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 1 of 2] MD RAID10: Improve redundancy for 'far' and 'offset' algorithms
2012-12-12 16:45 [PATCH 1 of 2] MD RAID10: Improve redundancy for 'far' and 'offset' algorithms Jonathan Brassow
2012-12-12 21:59 ` David Brown
@ 2012-12-13 1:23 ` NeilBrown
2012-12-14 0:10 ` Brassow Jonathan
1 sibling, 1 reply; 4+ messages in thread
From: NeilBrown @ 2012-12-13 1:23 UTC (permalink / raw)
To: Jonathan Brassow; +Cc: linux-raid, agk
[-- Attachment #1: Type: text/plain, Size: 3071 bytes --]
On Wed, 12 Dec 2012 10:45:05 -0600 Jonathan Brassow <jbrassow@redhat.com>
wrote:
> MD RAID10: Improve redundancy for 'far' and 'offset' algorithms
>
> The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
> widths - copying them to a different location on the same devices after
> shifting the stripe. An example layout of each follows below:
>
> "far" algorithm
> dev1 dev2 dev3 dev4 dev5 dev6
> ==== ==== ==== ==== ==== ====
> A B C D E F
> G H I J K L
> ...
> F A B C D E --> Copy of stripe0, but shifted by 1
> L G H I J K
> ...
>
> "offset" algorithm
> dev1 dev2 dev3 dev4 dev5 dev6
> ==== ==== ==== ==== ==== ====
> A B C D E F
> F A B C D E --> Copy of stripe0, but shifted by 1
> G H I J K L
> L G H I J K
> ...
>
> Redundancy for these algorithms is gained by shifting the copied stripes
> a certain number of devices - in this case, 1. This patch proposes the
> number of devices the copy be shifted by be changed from:
> device# + near_copies
> to
> device# + raid_disks/far_copies
>
> The above "far" algorithm example would now look like:
> "far" algorithm
> dev1 dev2 dev3 dev4 dev5 dev6
> ==== ==== ==== ==== ==== ====
> A B C D E F
> G H I J K L
> ...
> D E F A B C --> Copy of stripe0, but shifted by 3
> J K L G H I
> ...
>
> This has the affect of improving the redundancy of the array. We can
> always sustain at least one failure, but sometimes more than one can
> be handled. In the first examples, the pairs of devices that CANNOT fail
> together are:
> (1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
> In the example where the copies are instead shifted by 3, the pairs of
> devices that cannot fail together are:
> (1,4) (2,5) (3,6) [20% of possible pairs]
>
> Performing shifting in this way produces more redundancy and works especially
> well when the number of devices is a multiple of the number of copies.
Unfortunately it doesn't bring any benefit (I think) when the number of
devices is not a multiple of the number of copies. And if we are going to
make a change, we should do the best we can.
An approach that has previously been suggested is to divide the devices up
into set which are ncopies in size or (for the last set) a little more and
and rotate within those sets.
So with 5 devices and two copies there are 2 sets, one of 2, one of 3.
A B C D E
B A D E C
The only pairs where we cannot survive failure of both are pairs that are in
the same set. This is as good as your scheme when raid_disks divides copies,
but better when it doesn't.
So unless there is a good reason not to, I would rather we go with the scheme
that gives the best in all cases.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 1 of 2] MD RAID10: Improve redundancy for 'far' and 'offset' algorithms
2012-12-13 1:23 ` NeilBrown
@ 2012-12-14 0:10 ` Brassow Jonathan
0 siblings, 0 replies; 4+ messages in thread
From: Brassow Jonathan @ 2012-12-14 0:10 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, agk
On Dec 12, 2012, at 7:23 PM, NeilBrown wrote:
>>
>> This has the affect of improving the redundancy of the array. We can
>> always sustain at least one failure, but sometimes more than one can
>> be handled. In the first examples, the pairs of devices that CANNOT fail
>> together are:
>> (1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
>> In the example where the copies are instead shifted by 3, the pairs of
>> devices that cannot fail together are:
>> (1,4) (2,5) (3,6) [20% of possible pairs]
>>
>> Performing shifting in this way produces more redundancy and works especially
>> well when the number of devices is a multiple of the number of copies.
>
> Unfortunately it doesn't bring any benefit (I think) when the number of
> devices is not a multiple of the number of copies. And if we are going to
> make a change, we should do the best we can.
>
> An approach that has previously been suggested is to divide the devices up
> into set which are ncopies in size or (for the last set) a little more and
> and rotate within those sets.
> So with 5 devices and two copies there are 2 sets, one of 2, one of 3.
>
> A B C D E
> B A D E C
>
> The only pairs where we cannot survive failure of both are pairs that are in
> the same set. This is as good as your scheme when raid_disks divides copies,
> but better when it doesn't.
>
> So unless there is a good reason not to, I would rather we go with the scheme
> that gives the best in all cases.
I think I've got something now that works like this. I've got to test and re-document. I'll repost tomorrow or the day after.
brassow
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-12-14 0:10 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-12 16:45 [PATCH 1 of 2] MD RAID10: Improve redundancy for 'far' and 'offset' algorithms Jonathan Brassow
2012-12-12 21:59 ` David Brown
2012-12-13 1:23 ` NeilBrown
2012-12-14 0:10 ` Brassow Jonathan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).