* [LSF/MM TOPIC] De-clustered RAID with MD @ 2018-01-29 15:23 Johannes Thumshirn 2018-01-29 16:32 ` Wols Lists 0 siblings, 1 reply; 14+ messages in thread From: Johannes Thumshirn @ 2018-01-29 15:23 UTC (permalink / raw) To: lsf-pc; +Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown Hi linux-raid, lsf-pc (If you've received this mail multiple times, I'm sorry, I'm having trouble with the mail setup). With the rise of bigger and bigger disks, array rebuilding times start skyrocketing. In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm similar to RAID5 but instead of utilizing all disks in an array for every I/O operation, but implement a per-I/O mapping function to only use a subset of the available disks. This has at least two advantages: 1) If one disk has to be replaced, it's not needed to read the data from all disks to recover the one failed disk so non-affected disks can be used for real user I/O and not just recovery and 2) an efficient mapping function can improve parallel I/O submission, as two different I/Os are not necessarily going to the same disks in the array. For the mapping function used a hashing algorithm like Ceph's CRUSH [2] would be ideal, as it provides a pseudo random but deterministic mapping for the I/O onto the drives. This whole declustering of cause only makes sense for more than (at least) 4 drives but we do have customers with several orders of magnitude more drivers in an MD array. At LSF I'd like to discuss if: 1) The wider MD audience is interested in de-clusterd RAID with MD 2) de-clustered RAID should be implemented as a sublevel of RAID5 or as a new personality 3) CRUSH is a suitible algorith for this (there's evidence in [3] that the NetApp E-Series Arrays do use CRUSH for parity declustering) [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf [3] https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf Thanks, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-29 15:23 [LSF/MM TOPIC] De-clustered RAID with MD Johannes Thumshirn @ 2018-01-29 16:32 ` Wols Lists 2018-01-29 21:50 ` NeilBrown 2018-01-30 9:40 ` Johannes Thumshirn 0 siblings, 2 replies; 14+ messages in thread From: Wols Lists @ 2018-01-29 16:32 UTC (permalink / raw) To: Johannes Thumshirn, lsf-pc Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown On 29/01/18 15:23, Johannes Thumshirn wrote: > Hi linux-raid, lsf-pc > > (If you've received this mail multiple times, I'm sorry, I'm having > trouble with the mail setup). My immediate reactions as a lay person (I edit the raid wiki) ... > > With the rise of bigger and bigger disks, array rebuilding times start > skyrocketing. And? Yes, your data is at risk during a rebuild, but md-raid throttles the i/o, so it doesn't hammer the system. > > In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm > similar to RAID5 but instead of utilizing all disks in an array for > every I/O operation, but implement a per-I/O mapping function to only > use a subset of the available disks. > > This has at least two advantages: > 1) If one disk has to be replaced, it's not needed to read the data from > all disks to recover the one failed disk so non-affected disks can be > used for real user I/O and not just recovery and Again, that's throttling, so that's not a problem ... > 2) an efficient mapping function can improve parallel I/O submission, as > two different I/Os are not necessarily going to the same disks in the > array. > > For the mapping function used a hashing algorithm like Ceph's CRUSH [2] > would be ideal, as it provides a pseudo random but deterministic mapping > for the I/O onto the drives. > > This whole declustering of cause only makes sense for more than (at > least) 4 drives but we do have customers with several orders of > magnitude more drivers in an MD array. If you have four drives or more - especially if they are multi-terabyte drives - you should NOT be using raid-5 ... > > At LSF I'd like to discuss if: > 1) The wider MD audience is interested in de-clusterd RAID with MD I haven't read the papers, so no comment, sorry. > 2) de-clustered RAID should be implemented as a sublevel of RAID5 or > as a new personality Neither! If you're going to do it, it should be raid-6. > 3) CRUSH is a suitible algorith for this (there's evidence in [3] that > the NetApp E-Series Arrays do use CRUSH for parity declustering) > > [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf > [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf > [3] > https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf > Okay - I've now skimmed the crush paper [2]. Looks well interesting. BUT. It feels more like btrfs than it does like raid. Btrfs manages disks, and does raid, it tries to be the "everything between the hard drive and the file". This crush thingy reads to me like it wants to be the same. There's nothing wrong with that, but md is a unix-y "do one thing (raid) and do it well". My knee-jerk reaction is if you want to go for it, it sounds like a good idea. It just doesn't really feel a good fit for md. Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-29 16:32 ` Wols Lists @ 2018-01-29 21:50 ` NeilBrown 2018-01-30 10:43 ` Wols Lists 2018-01-31 9:58 ` David Brown 2018-01-30 9:40 ` Johannes Thumshirn 1 sibling, 2 replies; 14+ messages in thread From: NeilBrown @ 2018-01-29 21:50 UTC (permalink / raw) To: Wols Lists, Johannes Thumshirn, lsf-pc Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown [-- Attachment #1: Type: text/plain, Size: 4381 bytes --] On Mon, Jan 29 2018, Wols Lists wrote: > On 29/01/18 15:23, Johannes Thumshirn wrote: >> Hi linux-raid, lsf-pc >> >> (If you've received this mail multiple times, I'm sorry, I'm having >> trouble with the mail setup). > > My immediate reactions as a lay person (I edit the raid wiki) ... >> >> With the rise of bigger and bigger disks, array rebuilding times start >> skyrocketing. > > And? Yes, your data is at risk during a rebuild, but md-raid throttles > the i/o, so it doesn't hammer the system. >> >> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm >> similar to RAID5 but instead of utilizing all disks in an array for >> every I/O operation, but implement a per-I/O mapping function to only >> use a subset of the available disks. >> >> This has at least two advantages: >> 1) If one disk has to be replaced, it's not needed to read the data from >> all disks to recover the one failed disk so non-affected disks can be >> used for real user I/O and not just recovery and > > Again, that's throttling, so that's not a problem ... Imagine an array with 100 drives on which we store data in sets of (say) 6 data chunks and 2 parity chunks. Each group of 8 chunks is distributed over the 100 drives in a different way so that (e.g) 600 data chunks and 200 parity chunks are distributed over 8 physical stripes using some clever distribution function. If (when) one drive fails, the 8 chunks in this set of 8 physical stripes can be recovered by reading 6*8 == 48 chunks which will each be on a different drive. Half the drives deliver only one chunk (in an ideal distribution) and the other half deliver none. Maybe they will deliver some for the next set of 100 logical stripes. You would probably say that even doing raid6 on 100 drives is crazy. Better to make, e.g. 10 groups of 10 and do raid6 on each of the 10, then LVM them together. By doing declustered parity you can sanely do raid6 on 100 drives, using a logical stripe size that is much smaller than 100. When recovering a single drive, the 10-groups-of-10 would put heavy load on 9 other drives, while the decluster approach puts light load on 99 other drives. No matter how clever md is at throttling recovery, I would still rather distribute the load so that md has an easier job. NeilBrown > >> 2) an efficient mapping function can improve parallel I/O submission, as >> two different I/Os are not necessarily going to the same disks in the >> array. >> >> For the mapping function used a hashing algorithm like Ceph's CRUSH [2] >> would be ideal, as it provides a pseudo random but deterministic mapping >> for the I/O onto the drives. >> >> This whole declustering of cause only makes sense for more than (at >> least) 4 drives but we do have customers with several orders of >> magnitude more drivers in an MD array. > > If you have four drives or more - especially if they are multi-terabyte > drives - you should NOT be using raid-5 ... >> >> At LSF I'd like to discuss if: >> 1) The wider MD audience is interested in de-clusterd RAID with MD > > I haven't read the papers, so no comment, sorry. > >> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or >> as a new personality > > Neither! If you're going to do it, it should be raid-6. > >> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that >> the NetApp E-Series Arrays do use CRUSH for parity declustering) >> >> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf >> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf >> [3] >> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf >> > Okay - I've now skimmed the crush paper [2]. Looks well interesting. > BUT. It feels more like btrfs than it does like raid. > > Btrfs manages disks, and does raid, it tries to be the "everything > between the hard drive and the file". This crush thingy reads to me like > it wants to be the same. There's nothing wrong with that, but md is a > unix-y "do one thing (raid) and do it well". > > My knee-jerk reaction is if you want to go for it, it sounds like a good > idea. It just doesn't really feel a good fit for md. > > Cheers, > Wol [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-29 21:50 ` NeilBrown @ 2018-01-30 10:43 ` Wols Lists 2018-01-30 11:24 ` NeilBrown 2018-01-31 9:58 ` David Brown 1 sibling, 1 reply; 14+ messages in thread From: Wols Lists @ 2018-01-30 10:43 UTC (permalink / raw) To: NeilBrown, Johannes Thumshirn, lsf-pc Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown On 29/01/18 21:50, NeilBrown wrote: > By doing declustered parity you can sanely do raid6 on 100 drives, using > a logical stripe size that is much smaller than 100. > When recovering a single drive, the 10-groups-of-10 would put heavy load > on 9 other drives, while the decluster approach puts light load on 99 > other drives. No matter how clever md is at throttling recovery, I > would still rather distribute the load so that md has an easier job. Not offering to do it ... :-) But that sounds a bit like linux raid-10. Could a simple approach be to do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two parity, striped across 100 drives? Okay, it's not as good as the decluster approach, but it would spread the stress of a rebuild across 20 drives, not 10. And probably be fairly easy to implement. Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-30 10:43 ` Wols Lists @ 2018-01-30 11:24 ` NeilBrown 2018-01-30 17:40 ` Wol's lists ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: NeilBrown @ 2018-01-30 11:24 UTC (permalink / raw) To: Wols Lists, Johannes Thumshirn, lsf-pc Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown [-- Attachment #1: Type: text/plain, Size: 1848 bytes --] On Tue, Jan 30 2018, Wols Lists wrote: > On 29/01/18 21:50, NeilBrown wrote: >> By doing declustered parity you can sanely do raid6 on 100 drives, using >> a logical stripe size that is much smaller than 100. >> When recovering a single drive, the 10-groups-of-10 would put heavy load >> on 9 other drives, while the decluster approach puts light load on 99 >> other drives. No matter how clever md is at throttling recovery, I >> would still rather distribute the load so that md has an easier job. > > Not offering to do it ... :-) > > But that sounds a bit like linux raid-10. Could a simple approach be to > do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two > parity, striped across 100 drives? Okay, it's not as good as the > decluster approach, but it would spread the stress of a rebuild across > 20 drives, not 10. And probably be fairly easy to implement. If you did that, I think you would be about 80% of the way to fully declustered-parity RAID. If you then tweak the math a bit so that one stripe would was A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 .... and the next A1 C1 A2 C2 A3 C3 A4 C4 B1 D1 B2 D2 .... and then A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 .... XX When Ax are a logical stripe and Bx are the next, then you have a slightly better distribution. If device XX fails then the reads needed for the first stripe mostly come from different drives than those for the second stripe, which are mostly different again for the 3rd stripe. Presumably the CRUSH algorithm (which I only skim-read once about a year ago) formalizes how to do this, and does it better. Once you have the data handling in place for your proposal, it should be little more than replacing a couple of calculations to get the full solution. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-30 11:24 ` NeilBrown @ 2018-01-30 17:40 ` Wol's lists 2018-02-03 15:53 ` Wols Lists 2018-02-03 17:16 ` Wols Lists 2 siblings, 0 replies; 14+ messages in thread From: Wol's lists @ 2018-01-30 17:40 UTC (permalink / raw) To: NeilBrown, Johannes Thumshirn, lsf-pc Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown On 30/01/18 11:24, NeilBrown wrote: > On Tue, Jan 30 2018, Wols Lists wrote: > >> On 29/01/18 21:50, NeilBrown wrote: >>> By doing declustered parity you can sanely do raid6 on 100 drives, using >>> a logical stripe size that is much smaller than 100. >>> When recovering a single drive, the 10-groups-of-10 would put heavy load >>> on 9 other drives, while the decluster approach puts light load on 99 >>> other drives. No matter how clever md is at throttling recovery, I >>> would still rather distribute the load so that md has an easier job. >> >> Not offering to do it ... :-) >> >> But that sounds a bit like linux raid-10. Could a simple approach be to >> do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two >> parity, striped across 100 drives? Okay, it's not as good as the >> decluster approach, but it would spread the stress of a rebuild across >> 20 drives, not 10. And probably be fairly easy to implement. > > If you did that, I think you would be about 80% of the way to fully > declustered-parity RAID. > If you then tweak the math a bit so that one stripe would was > > A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 .... > > and the next > > A1 C1 A2 C2 A3 C3 A4 C4 B1 D1 B2 D2 .... > > and then > > A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 .... > > XX > > When Ax are a logical stripe and Bx are the next, then you have a > slightly better distribution. If device XX fails then the reads needed > for the first stripe mostly come from different drives than those for > the second stripe, which are mostly different again for the 3rd stripe. > > Presumably the CRUSH algorithm (which I only skim-read once about a year > ago) formalizes how to do this, and does it better. > Once you have the data handling in place for your proposal, it should be > little more than replacing a couple of calculations to get the full > solution. > Okay. I think I have it - a definition for raid-16 (or is it raid-61). But I need a bit of help with the maths. And it might need a look-up table :-( Okay. Like raid-10, raid-16 would be spec'd as "--level 16,3,8", ie 3 mirrors, emulating an 8-drive raid-6. What I'm looking for is a PRNG that has the "bug" that it repeats over a short period, and over that period is guaranteed to repeat each number the same number of times. I saw a wonderful video demonstration of this years ago - if you plot the generated number against the number of times it was generated, after a few rows it "filled up" a rectangle on the graph. At which point the maths becomes very simple. I just need at least as many real drives as "mirrors times emulated". If somebody can come up with the algorithm I want, I can spec it, and then maybe someone can implement it? It'll be fun testing - I'll need my new machine when I get it working :-) Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-30 11:24 ` NeilBrown 2018-01-30 17:40 ` Wol's lists @ 2018-02-03 15:53 ` Wols Lists 2018-02-03 17:16 ` Wols Lists 2 siblings, 0 replies; 14+ messages in thread From: Wols Lists @ 2018-02-03 15:53 UTC (permalink / raw) To: NeilBrown, Johannes Thumshirn, lsf-pc Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown On 30/01/18 11:24, NeilBrown wrote: > When Ax are a logical stripe and Bx are the next, then you have a > slightly better distribution. If device XX fails then the reads needed > for the first stripe mostly come from different drives than those for > the second stripe, which are mostly different again for the 3rd stripe. Well, my maths has been well and truly proved wrong. But. Does streaming from one drive directly onto another have value in reducing stress? I've just done a worked example for four drives, mirrored three times across eighteen drives. The result is that the drives have ended up in six groups of three mirrored drives. So if any drive fails, I can just copy an almost-identical drive from the mirror. I'll have to play with eg 17 drives and see if this alters the dynamics - it'll be interesting. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 a1 b2 c3 a4 c1 a2 b3 c4 b1 c2 a3 b4 a1 b2 c3 a4 c1 a2 b3 c4 b1 c2 a3 b4 a1 b2 c3 a4 c1 a2 b3 c4 b1 c2 a3 b4 Note I'm rotating 5 drives at a time, so a1,a2,a3,a4 goes on drives 0,5,10,15 ... But pick any drive and you'll notice the drives six away are almost mirrors. Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-30 11:24 ` NeilBrown 2018-01-30 17:40 ` Wol's lists 2018-02-03 15:53 ` Wols Lists @ 2018-02-03 17:16 ` Wols Lists 2 siblings, 0 replies; 14+ messages in thread From: Wols Lists @ 2018-02-03 17:16 UTC (permalink / raw) To: NeilBrown, Johannes Thumshirn, lsf-pc Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown On 30/01/18 11:24, NeilBrown wrote: > When Ax are a logical stripe and Bx are the next, then you have a > slightly better distribution. If device XX fails then the reads needed > for the first stripe mostly come from different drives than those for > the second stripe, which are mostly different again for the 3rd stripe. Having got myself thoroughly confused, I think I've got my head straight. There's actually an incredibly simple algorithm for this, I think. First of all, let's state that we're defining a new complex raid here, people get confused between linux-raid-10 and raid-1+0, so ... I was originally thinking about about raid-61, mirroring raid-6 across a random number of drives. I think here we're actually talking about raid-60 - doing a raid-6 across a random number of drives. Can we prefix whatever raid level we call it with an "L" so it's not confused with standard raid-6+0? Okay. Let's say we have a five-disk raid-6 smeared across seven drives. Let's call it Raid-L60. That gives us 35 blocks per disk stripe. A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5 D1 D2 D3 D4 D5 E1 E2 E3 E4 E5 F1 F2 F3 F4 F5 G1 G2 G3 G4 G5 That gives us a logical disk stripe 35 wide, that we just need to scramble. Position * Prime mod Disks. The problem, of course, is that this is going to scramble each logical stripe the same. So as we've got 5 stripes, can we use a table of 5 different primes? The snag, of course, is that we mustn't use a prime that is a factor of the number of disks, as this will blow up with a vengeance. It makes sense also to avoid a prime that's a factor of the logical disks, I suspect. So here, I'd use 2,3,11,13,17 I'd like to add the logical stripe number in rather than use multiple primes, but I don't think that'll work ... Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-29 21:50 ` NeilBrown 2018-01-30 10:43 ` Wols Lists @ 2018-01-31 9:58 ` David Brown 2018-01-31 10:58 ` Johannes Thumshirn 2018-01-31 14:27 ` Wols Lists 1 sibling, 2 replies; 14+ messages in thread From: David Brown @ 2018-01-31 9:58 UTC (permalink / raw) To: NeilBrown, Wols Lists, Johannes Thumshirn, lsf-pc Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown On 29/01/18 22:50, NeilBrown wrote: > On Mon, Jan 29 2018, Wols Lists wrote: > >> On 29/01/18 15:23, Johannes Thumshirn wrote: >>> Hi linux-raid, lsf-pc >>> >>> (If you've received this mail multiple times, I'm sorry, I'm having >>> trouble with the mail setup). >> >> My immediate reactions as a lay person (I edit the raid wiki) ... >>> >>> With the rise of bigger and bigger disks, array rebuilding times start >>> skyrocketing. >> >> And? Yes, your data is at risk during a rebuild, but md-raid throttles >> the i/o, so it doesn't hammer the system. >>> >>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm >>> similar to RAID5 but instead of utilizing all disks in an array for >>> every I/O operation, but implement a per-I/O mapping function to only >>> use a subset of the available disks. >>> >>> This has at least two advantages: >>> 1) If one disk has to be replaced, it's not needed to read the data from >>> all disks to recover the one failed disk so non-affected disks can be >>> used for real user I/O and not just recovery and >> >> Again, that's throttling, so that's not a problem ... > > Imagine an array with 100 drives on which we store data in sets of > (say) 6 data chunks and 2 parity chunks. > Each group of 8 chunks is distributed over the 100 drives in a > different way so that (e.g) 600 data chunks and 200 parity chunks are > distributed over 8 physical stripes using some clever distribution > function. > If (when) one drive fails, the 8 chunks in this set of 8 physical > stripes can be recovered by reading 6*8 == 48 chunks which will each be > on a different drive. Half the drives deliver only one chunk (in an ideal > distribution) and the other half deliver none. Maybe they will deliver > some for the next set of 100 logical stripes. > > You would probably say that even doing raid6 on 100 drives is crazy. > Better to make, e.g. 10 groups of 10 and do raid6 on each of the 10, > then LVM them together. > > By doing declustered parity you can sanely do raid6 on 100 drives, using > a logical stripe size that is much smaller than 100. > When recovering a single drive, the 10-groups-of-10 would put heavy load > on 9 other drives, while the decluster approach puts light load on 99 > other drives. No matter how clever md is at throttling recovery, I > would still rather distribute the load so that md has an easier job. > That sounds smart. I don't see that you need anything particularly complicated for how you distribute your data and parity drives across the 100 disks - you just need a fairly even spread. I would be more concerned with how you could deal with resizing such an array. In particular, I think it is not unlikely that someone with a 100 drive array will one day want to add another bank of 24 disks (or whatever fits in a cabinet). Making that work nicely would, I believe, be more important than making sure the rebuild load distribution is balanced evenly across 99 drives. I would also be interested in how the data and parities are distributed across cabinets and disk controllers. When you manually build from smaller raid sets, you can ensure that in set the data disks and the parity are all in different cabinets - that way if an entire cabinet goes up in smoke, you have lost one drive from each set, and your data is still there. With a pseudo random layout, you have lost that. (I don't know how often entire cabinets of disks die, but I once lost both disks of a raid1 mirror when the disk controller card died.) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-31 9:58 ` David Brown @ 2018-01-31 10:58 ` Johannes Thumshirn 2018-01-31 14:27 ` Wols Lists 1 sibling, 0 replies; 14+ messages in thread From: Johannes Thumshirn @ 2018-01-31 10:58 UTC (permalink / raw) To: David Brown Cc: NeilBrown, Wols Lists, lsf-pc, linux-raid, linux-block, Hannes Reinecke, Neil Brown David Brown <david.brown@hesbynett.no> writes: > That sounds smart. I don't see that you need anything particularly > complicated for how you distribute your data and parity drives across > the 100 disks - you just need a fairly even spread. Exactly. > I would be more concerned with how you could deal with resizing such an > array. In particular, I think it is not unlikely that someone with a > 100 drive array will one day want to add another bank of 24 disks (or > whatever fits in a cabinet). Making that work nicely would, I believe, > be more important than making sure the rebuild load distribution is > balanced evenly across 99 drives. I don't think rebuilding is such a big deal, lets consider the following hypothetical scenario: 6 Disks with 4 data blocks (3 replicas per block, could be RAID1 like duplicates or RAID5 like data + parity, doesn't matter at all for this example) D1 D2 D3 D4 D5 D6 [A] [B] [C] [ ] [ ] [ ] [ ] [ ] [ ] [A] [D] [B] [ ] [A] [B] [ ] [C] [ ] [C] [ ] [ ] [D] [ ] [D] Now we're adding one disk and rebalance: D1 D2 D3 D4 D5 D6 D7 [A] [B] [C] [ ] [ ] [ ] [A] [ ] [ ] [ ] [ ] [D] [B] [ ] [ ] [A] [B] [ ] [ ] [ ] [C] [C] [ ] [ ] [D] [ ] [D] [ ] This moved the "A" from D4 and the "C" from D5 to D7. The whole rebalancing affected only 3 disks (read from D4 and D5 write to D7). > I would also be interested in how the data and parities are distributed > across cabinets and disk controllers. When you manually build from > smaller raid sets, you can ensure that in set the data disks and the > parity are all in different cabinets - that way if an entire cabinet > goes up in smoke, you have lost one drive from each set, and your data > is still there. With a pseudo random layout, you have lost that. (I > don't know how often entire cabinets of disks die, but I once lost both > disks of a raid1 mirror when the disk controller card died.) Well this is something CRSUH takes care of. As I said earlier it's a weighted decision tree. One of the weights could be to evenly spread all blocks across two cabinets. Taking this into account would require a non-trivial user interface and I'm not sure if the benefits of this outnumber the costs (at least for an initial implementation). Byte, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-31 9:58 ` David Brown 2018-01-31 10:58 ` Johannes Thumshirn @ 2018-01-31 14:27 ` Wols Lists 2018-01-31 14:41 ` David Brown 1 sibling, 1 reply; 14+ messages in thread From: Wols Lists @ 2018-01-31 14:27 UTC (permalink / raw) To: David Brown, NeilBrown, Johannes Thumshirn, lsf-pc Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown On 31/01/18 09:58, David Brown wrote: > I would also be interested in how the data and parities are distributed > across cabinets and disk controllers. When you manually build from > smaller raid sets, you can ensure that in set the data disks and the > parity are all in different cabinets - that way if an entire cabinet > goes up in smoke, you have lost one drive from each set, and your data > is still there. With a pseudo random layout, you have lost that. (I > don't know how often entire cabinets of disks die, but I once lost both > disks of a raid1 mirror when the disk controller card died.) The more I think about how I plan to spec raid-61, the more a modulo approach seems to make sense. That way, it'll be fairly easy to predict what ends up where, and make sure your disks are evenly scattered. I think both your and my approach might have problems with losing an entire cabinet, however. Depends on how many drives per cabinet ... Anyways, my second thoughts are ... We have what I will call a stripe-block. The lowest common multiple of "disks needed" ie number of mirrors times number of drives in the raid-6, and the disks available. Assuming my blocks are all stored sequentially I can then quickly calculate their position in this stripe-block. But this will fall foul of just hammering the drives nearest to the failed drive. But if I pseudo-randomise this position with "position * prime mod drives" where "prime" is not common to either the number of drives or the number or mirrors or the number of raid-drives, then this should achieve my aim of uniquely shuffling the location of all the blocks without collisions. Pretty simple maths, for efficiency, that smears the data over all the drives. Does that sound feasible? All the heavy lifting, calculating the least common multiple, finding the prime, etc etc can be done at array set-up time. (If this then allows feasible 100-drive arrays, we won't just need an incremental assemble mode, we might need an incremental build mode :-) Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-31 14:27 ` Wols Lists @ 2018-01-31 14:41 ` David Brown 0 siblings, 0 replies; 14+ messages in thread From: David Brown @ 2018-01-31 14:41 UTC (permalink / raw) To: Wols Lists, NeilBrown, Johannes Thumshirn, lsf-pc Cc: linux-raid, linux-block, Hannes Reinecke, Neil Brown On 31/01/18 15:27, Wols Lists wrote: > On 31/01/18 09:58, David Brown wrote: >> I would also be interested in how the data and parities are distributed >> across cabinets and disk controllers. When you manually build from >> smaller raid sets, you can ensure that in set the data disks and the >> parity are all in different cabinets - that way if an entire cabinet >> goes up in smoke, you have lost one drive from each set, and your data >> is still there. With a pseudo random layout, you have lost that. (I >> don't know how often entire cabinets of disks die, but I once lost both >> disks of a raid1 mirror when the disk controller card died.) > > The more I think about how I plan to spec raid-61, the more a modulo > approach seems to make sense. That way, it'll be fairly easy to predict > what ends up where, and make sure your disks are evenly scattered. > > I think both your and my approach might have problems with losing an > entire cabinet, however. Depends on how many drives per cabinet ... Exactly. I don't know how many cabinets are used on such systems. > > Anyways, my second thoughts are ... > > We have what I will call a stripe-block. The lowest common multiple of > "disks needed" ie number of mirrors times number of drives in the > raid-6, and the disks available. > > Assuming my blocks are all stored sequentially I can then quickly > calculate their position in this stripe-block. But this will fall foul > of just hammering the drives nearest to the failed drive. But if I > pseudo-randomise this position with "position * prime mod drives" where > "prime" is not common to either the number of drives or the number or > mirrors or the number of raid-drives, then this should achieve my aim of > uniquely shuffling the location of all the blocks without collisions. > > Pretty simple maths, for efficiency, that smears the data over all the > drives. Does that sound feasible? All the heavy lifting, calculating the > least common multiple, finding the prime, etc etc can be done at array > set-up time. Something like that should work, and be convenient to implement. I am not sure off the top of my head if such a simple modulo system is valid, but it won't be difficult to check. > > (If this then allows feasible 100-drive arrays, we won't just need an > incremental assemble mode, we might need an incremental build mode :-) > You really want to track which stripes are valid here, and which are not yet made consistent. A blank array will start with everything marked invalid or inconsistent - build mode is just a matter of writing the metadata. You only need to make stripes consistent when you first write to them. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-29 16:32 ` Wols Lists 2018-01-29 21:50 ` NeilBrown @ 2018-01-30 9:40 ` Johannes Thumshirn 2018-01-31 8:03 ` David Brown 1 sibling, 1 reply; 14+ messages in thread From: Johannes Thumshirn @ 2018-01-30 9:40 UTC (permalink / raw) To: Wols Lists; +Cc: lsf-pc, linux-raid, linux-block, Hannes Reinecke, Neil Brown Wols Lists <antlists@youngman.org.uk> writes: > On 29/01/18 15:23, Johannes Thumshirn wrote: >> Hi linux-raid, lsf-pc >> >> (If you've received this mail multiple times, I'm sorry, I'm having >> trouble with the mail setup). > > My immediate reactions as a lay person (I edit the raid wiki) ... >> >> With the rise of bigger and bigger disks, array rebuilding times start >> skyrocketing. > > And? Yes, your data is at risk during a rebuild, but md-raid throttles > the i/o, so it doesn't hammer the system. >> >> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm >> similar to RAID5 but instead of utilizing all disks in an array for >> every I/O operation, but implement a per-I/O mapping function to only >> use a subset of the available disks. >> >> This has at least two advantages: >> 1) If one disk has to be replaced, it's not needed to read the data from >> all disks to recover the one failed disk so non-affected disks can be >> used for real user I/O and not just recovery and > > Again, that's throttling, so that's not a problem ... And throttling in a production environment is not exactly desired. Imagine a 500 disk array (and yes this is something we've seen with MD) and you have to replace disks. While the array is rebuilt you have to throttle all I/O because with raid-{1,5,6,10} all data is striped across all disks. With a parity declustered RAID (or DDP like Dell, NetApp or Huawei call it) you don't have to as the I/O is replicated in parity groups across a subset of disks. All I/O targeting disks which aren't needed to recover the data from the failed disks aren't affected by the throttling at all. >> 2) an efficient mapping function can improve parallel I/O submission, as >> two different I/Os are not necessarily going to the same disks in the >> array. >> >> For the mapping function used a hashing algorithm like Ceph's CRUSH [2] >> would be ideal, as it provides a pseudo random but deterministic mapping >> for the I/O onto the drives. >> >> This whole declustering of cause only makes sense for more than (at >> least) 4 drives but we do have customers with several orders of >> magnitude more drivers in an MD array. > > If you have four drives or more - especially if they are multi-terabyte > drives - you should NOT be using raid-5 ... raid-6 won't help you much in above scenario. >> >> At LSF I'd like to discuss if: >> 1) The wider MD audience is interested in de-clusterd RAID with MD > > I haven't read the papers, so no comment, sorry. > >> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or >> as a new personality > > Neither! If you're going to do it, it should be raid-6. > >> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that >> the NetApp E-Series Arrays do use CRUSH for parity declustering) >> >> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf >> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf >> [3] >> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf >> > Okay - I've now skimmed the crush paper [2]. Looks well interesting. > BUT. It feels more like btrfs than it does like raid. > > Btrfs manages disks, and does raid, it tries to be the "everything > between the hard drive and the file". This crush thingy reads to me like > it wants to be the same. There's nothing wrong with that, but md is a > unix-y "do one thing (raid) and do it well". Well CRUSH is (one of) the algorithms behind Ceph. It takes the decisions where to place a block. It is just a hash (well technically a weighted decision-tree) function that takes a block of I/O and a some configuration parameters and "calculates" the placement. > My knee-jerk reaction is if you want to go for it, it sounds like a good > idea. It just doesn't really feel a good fit for md. Thanks for the input. Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] De-clustered RAID with MD 2018-01-30 9:40 ` Johannes Thumshirn @ 2018-01-31 8:03 ` David Brown 0 siblings, 0 replies; 14+ messages in thread From: David Brown @ 2018-01-31 8:03 UTC (permalink / raw) To: Johannes Thumshirn, Wols Lists Cc: lsf-pc, linux-raid, linux-block, Hannes Reinecke, Neil Brown On 30/01/18 10:40, Johannes Thumshirn wrote: > Wols Lists <antlists@youngman.org.uk> writes: > >> On 29/01/18 15:23, Johannes Thumshirn wrote: >>> Hi linux-raid, lsf-pc >>> >>> (If you've received this mail multiple times, I'm sorry, I'm having >>> trouble with the mail setup). >> >> My immediate reactions as a lay person (I edit the raid wiki) ... >>> >>> With the rise of bigger and bigger disks, array rebuilding times start >>> skyrocketing. >> >> And? Yes, your data is at risk during a rebuild, but md-raid throttles >> the i/o, so it doesn't hammer the system. >>> >>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm >>> similar to RAID5 but instead of utilizing all disks in an array for >>> every I/O operation, but implement a per-I/O mapping function to only >>> use a subset of the available disks. >>> >>> This has at least two advantages: >>> 1) If one disk has to be replaced, it's not needed to read the data from >>> all disks to recover the one failed disk so non-affected disks can be >>> used for real user I/O and not just recovery and >> >> Again, that's throttling, so that's not a problem ... > > And throttling in a production environment is not exactly > desired. Imagine a 500 disk array (and yes this is something we've seen > with MD) and you have to replace disks. While the array is rebuilt you > have to throttle all I/O because with raid-{1,5,6,10} all data is > striped across all disks. You definitely don't want a stripe across 500 disks! I'd be inclined to have raid1 pairs as the basic block, or perhaps 6-8 drive raid6 if you want higher space efficiency. Then you build your full array on top of that, along with a file system that can take advantage of the layout. If you have an XFS over a linear concat of these sets, then you have a system that can quickly server many parallel loads - but that could be poor distribution if you are storing massive streaming data. And rebuilds only delay data from the one block that is involved in the rebuild. (I have no experience with anything bigger than about 6 disks - this is just theory on my part.) > > With a parity declustered RAID (or DDP like Dell, NetApp or Huawei call > it) you don't have to as the I/O is replicated in parity groups across a > subset of disks. All I/O targeting disks which aren't needed to recover > the data from the failed disks aren't affected by the throttling at all. > >>> 2) an efficient mapping function can improve parallel I/O submission, as >>> two different I/Os are not necessarily going to the same disks in the >>> array. >>> >>> For the mapping function used a hashing algorithm like Ceph's CRUSH [2] >>> would be ideal, as it provides a pseudo random but deterministic mapping >>> for the I/O onto the drives. >>> >>> This whole declustering of cause only makes sense for more than (at >>> least) 4 drives but we do have customers with several orders of >>> magnitude more drivers in an MD array. >> >> If you have four drives or more - especially if they are multi-terabyte >> drives - you should NOT be using raid-5 ... > > raid-6 won't help you much in above scenario. > Raid-6 is still a great deal better than raid-5 :-) And for your declustered raid or distributed parity, you can have two parities rather than just one. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2018-02-03 17:16 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-01-29 15:23 [LSF/MM TOPIC] De-clustered RAID with MD Johannes Thumshirn 2018-01-29 16:32 ` Wols Lists 2018-01-29 21:50 ` NeilBrown 2018-01-30 10:43 ` Wols Lists 2018-01-30 11:24 ` NeilBrown 2018-01-30 17:40 ` Wol's lists 2018-02-03 15:53 ` Wols Lists 2018-02-03 17:16 ` Wols Lists 2018-01-31 9:58 ` David Brown 2018-01-31 10:58 ` Johannes Thumshirn 2018-01-31 14:27 ` Wols Lists 2018-01-31 14:41 ` David Brown 2018-01-30 9:40 ` Johannes Thumshirn 2018-01-31 8:03 ` David Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).