* possibly silly question (raid failover)
@ 2011-11-01 0:38 Miles Fidelman
2011-11-01 9:14 ` David Brown
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Miles Fidelman @ 2011-11-01 0:38 UTC (permalink / raw)
To: linux-raid@vger.kernel.org
Hi Folks,
I've been exploring various ways to build a "poor man's high
availability cluster." Currently I'm running two nodes, using raid on
each box, running DRBD across the boxes, and running Xen virtual
machines on top of that.
I now have two brand new servers - for a total of four nodes - each with
four large drives, and four gigE ports.
Between the configuration of the systems, and rack space limitations,
I'm trying to use each server for both storage and processing - and been
looking at various options for building a cluster file system across all
16 drives, that supports VM migration/failover across all for nodes, and
that's resistant to both single-drive failures, and to losing an entire
server (and it's 4 drives), and maybe even losing two servers (8 drives).
The approach that looks most interesting is Sheepdog - but it's both
tied to KVM rather than Xen, and a bit immature.
But it lead me to wonder if something like this might make sense:
- mount each drive using AoE
- run md RAID 10 across all 16 drives one one node
- mount the resulting md device using AoE
- if the node running the md device fails, use pacemaker/crm to
auto-start an md device on another node, re-assemble and republish the array
- resulting in a 16-drive raid10 array that's accessible from all nodes
Or is this just silly and/or wrongheaded?
Miles Fidelman
--
In theory, there is no difference between theory and practice.
In<fnord> practice, there is. .... Yogi Berra
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: possibly silly question (raid failover) 2011-11-01 0:38 possibly silly question (raid failover) Miles Fidelman @ 2011-11-01 9:14 ` David Brown 2011-11-01 13:05 ` Miles Fidelman 2011-11-01 9:26 ` Johannes Truschnigg 2011-11-02 6:41 ` Stan Hoeppner 2 siblings, 1 reply; 27+ messages in thread From: David Brown @ 2011-11-01 9:14 UTC (permalink / raw) To: linux-raid On 01/11/2011 01:38, Miles Fidelman wrote: > Hi Folks, > > I've been exploring various ways to build a "poor man's high > availability cluster." Currently I'm running two nodes, using raid on > each box, running DRBD across the boxes, and running Xen virtual > machines on top of that. > > I now have two brand new servers - for a total of four nodes - each with > four large drives, and four gigE ports. > > Between the configuration of the systems, and rack space limitations, > I'm trying to use each server for both storage and processing - and been > looking at various options for building a cluster file system across all > 16 drives, that supports VM migration/failover across all for nodes, and > that's resistant to both single-drive failures, and to losing an entire > server (and it's 4 drives), and maybe even losing two servers (8 drives). > > The approach that looks most interesting is Sheepdog - but it's both > tied to KVM rather than Xen, and a bit immature. > > But it lead me to wonder if something like this might make sense: > - mount each drive using AoE > - run md RAID 10 across all 16 drives one one node > - mount the resulting md device using AoE > - if the node running the md device fails, use pacemaker/crm to > auto-start an md device on another node, re-assemble and republish the > array > - resulting in a 16-drive raid10 array that's accessible from all nodes > > Or is this just silly and/or wrongheaded? > > Miles Fidelman > One thing to watch out for when making high-availability systems and using RAID1 (or RAID10), is that RAID1 only tolerates a single failure in the worst case. If you have built your disk image spread across different machines with two-copy RAID1, and a server goes down, then the rest then becomes vulnerable to a single disk failure (or a single unrecoverable read error). It's a different matter if you are building a 4-way mirror from the four servers, of course. Alternatively, each server could have its four disks set up as a 3+1 local raid5. Then you combine them all from different machines using raid10 (or possibly just raid1 - depending on your usage patterns, that may be faster). That gives you an extra safety margin on disk problems. But the key issue is to consider what might fail, and what the consequences of that failure are - including the consequences for additional failures. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 9:14 ` David Brown @ 2011-11-01 13:05 ` Miles Fidelman 2011-11-01 13:37 ` John Robinson 0 siblings, 1 reply; 27+ messages in thread From: Miles Fidelman @ 2011-11-01 13:05 UTC (permalink / raw) Cc: linux-raid David Brown wrote: > > One thing to watch out for when making high-availability systems and > using RAID1 (or RAID10), is that RAID1 only tolerates a single failure > in the worst case. If you have built your disk image spread across > different machines with two-copy RAID1, and a server goes down, then > the rest then becomes vulnerable to a single disk failure (or a single > unrecoverable read error). > > It's a different matter if you are building a 4-way mirror from the > four servers, of course. > Just a nit here: I'm looking at "md RAID10" which behaves quite differently that conventional RAID10. Rather than striping and raiding as separate operations, it does both as a unitary operation - essentially spreading n copies of each block across m disks. Rather clever that way. Hence my thought about a 16-disk md RAID10 array - which offers lots of redundancy. Miles -- In theory, there is no difference between theory and practice. In<fnord> practice, there is. .... Yogi Berra ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 13:05 ` Miles Fidelman @ 2011-11-01 13:37 ` John Robinson 2011-11-01 14:36 ` David Brown 0 siblings, 1 reply; 27+ messages in thread From: John Robinson @ 2011-11-01 13:37 UTC (permalink / raw) To: Miles Fidelman; +Cc: linux-raid On 01/11/2011 13:05, Miles Fidelman wrote: > David Brown wrote: >> >> One thing to watch out for when making high-availability systems and >> using RAID1 (or RAID10), is that RAID1 only tolerates a single failure >> in the worst case. If you have built your disk image spread across >> different machines with two-copy RAID1, and a server goes down, then >> the rest then becomes vulnerable to a single disk failure (or a single >> unrecoverable read error). >> >> It's a different matter if you are building a 4-way mirror from the >> four servers, of course. >> > > Just a nit here: I'm looking at "md RAID10" which behaves quite > differently that conventional RAID10. Rather than striping and raiding > as separate operations, it does both as a unitary operation - > essentially spreading n copies of each block across m disks. Rather > clever that way. > > Hence my thought about a 16-disk md RAID10 array - which offers lots of > redundancy. I'm pretty sure that a normal (near) md RAID10 on 16 disks will use the first two drives you specify as mirrors, and the next two, and so on, so when you specify the drive order when building the array you'd need to make sure all the mirrors are on another machine. Cheers, John. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 13:37 ` John Robinson @ 2011-11-01 14:36 ` David Brown 2011-11-01 20:13 ` Miles Fidelman 0 siblings, 1 reply; 27+ messages in thread From: David Brown @ 2011-11-01 14:36 UTC (permalink / raw) To: linux-raid On 01/11/2011 14:37, John Robinson wrote: > On 01/11/2011 13:05, Miles Fidelman wrote: >> David Brown wrote: >>> >>> One thing to watch out for when making high-availability systems and >>> using RAID1 (or RAID10), is that RAID1 only tolerates a single failure >>> in the worst case. If you have built your disk image spread across >>> different machines with two-copy RAID1, and a server goes down, then >>> the rest then becomes vulnerable to a single disk failure (or a single >>> unrecoverable read error). >>> >>> It's a different matter if you are building a 4-way mirror from the >>> four servers, of course. >>> >> >> Just a nit here: I'm looking at "md RAID10" which behaves quite >> differently that conventional RAID10. Rather than striping and raiding >> as separate operations, it does both as a unitary operation - >> essentially spreading n copies of each block across m disks. Rather >> clever that way. >> >> Hence my thought about a 16-disk md RAID10 array - which offers lots of >> redundancy. No, md RAID10 does /not/ offer more redundancy than RAID1. You are right that md RAID10 offers more than RAID1 (or traditional RAID0 over RAID1 sets) - but it is a convenience and performance benefit, not a redundancy benefit. In particular, it lets you build RAID10 from any number of disks, not just two. And it lets you stripe over all disks, improving performance for some loads (though not /all/ loads - if you have lots of concurrent small reads, you may be faster using plain RAID1). To get higher redundancy with RAID10 or RAID1, you need to use more "ways" in the mirror. For example, creating RAID10 with "--layout n3" will give you three copies of all data, rather than just two, and therefore better redundancy - at the cost of disk space. When you write "RAID10", the assumption is you mean a normal two-way mirror unless you specifically say otherwise, and such a mirror has only a worst-case redundancy of 1 disk. A second failure will kill the array if it happens to hit the second copy of the data. > > I'm pretty sure that a normal (near) md RAID10 on 16 disks will use the > first two drives you specify as mirrors, and the next two, and so on, so > when you specify the drive order when building the array you'd need to > make sure all the mirrors are on another machine. > Correct. If you have a multiple of 4 disks, a "normal" near two-way RAID10 is almost indistinguishable from a standard two-way RAID1. > Cheers, > > John. > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 14:36 ` David Brown @ 2011-11-01 20:13 ` Miles Fidelman 2011-11-01 21:20 ` Robin Hill 2011-11-01 22:15 ` keld 0 siblings, 2 replies; 27+ messages in thread From: Miles Fidelman @ 2011-11-01 20:13 UTC (permalink / raw) Cc: linux-raid David Brown wrote: > > No, md RAID10 does /not/ offer more redundancy than RAID1. You are > right that md RAID10 offers more than RAID1 (or traditional RAID0 over > RAID1 sets) - but it is a convenience and performance benefit, not a > redundancy benefit. In particular, it lets you build RAID10 from any > number of disks, not just two. And it lets you stripe over all disks, > improving performance for some loads (though not /all/ loads - if you > have lots of concurrent small reads, you may be faster using plain > RAID1). wasn't suggesting that it does - just that it does things differently than normal raid 1+0 - for example, by doing mirroring and striping as a unitary operation, it works across odd number of drives - it also (I think) allows for more than 2 copies of a block (not completely clear how many copies of a block would be made if you specified a 16 drive array) - sort of what I'm wondering here -- In theory, there is no difference between theory and practice. In<fnord> practice, there is. .... Yogi Berra ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 20:13 ` Miles Fidelman @ 2011-11-01 21:20 ` Robin Hill 2011-11-01 21:32 ` Miles Fidelman 2011-11-01 22:15 ` keld 1 sibling, 1 reply; 27+ messages in thread From: Robin Hill @ 2011-11-01 21:20 UTC (permalink / raw) To: Miles Fidelman; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1514 bytes --] On Tue Nov 01, 2011 at 04:13:26 -0400, Miles Fidelman wrote: > David Brown wrote: > > > > No, md RAID10 does /not/ offer more redundancy than RAID1. You are > > right that md RAID10 offers more than RAID1 (or traditional RAID0 over > > RAID1 sets) - but it is a convenience and performance benefit, not a > > redundancy benefit. In particular, it lets you build RAID10 from any > > number of disks, not just two. And it lets you stripe over all disks, > > improving performance for some loads (though not /all/ loads - if you > > have lots of concurrent small reads, you may be faster using plain > > RAID1). > > wasn't suggesting that it does - just that it does things differently > than normal raid 1+0 - for example, by doing mirroring and striping as a > unitary operation, it works across odd number of drives - it also (I > think) allows for more than 2 copies of a block (not completely clear > how many copies of a block would be made if you specified a 16 drive > array) - sort of what I'm wondering here > By default it'll make 2 copies, regardless how many devices are in the array. You can specify how many copies you want though, so -n3 will give you a near configuration with 3 copies, -n4 for four copies, etc. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 21:20 ` Robin Hill @ 2011-11-01 21:32 ` Miles Fidelman 2011-11-01 21:50 ` Robin Hill 2011-11-01 22:00 ` David Brown 0 siblings, 2 replies; 27+ messages in thread From: Miles Fidelman @ 2011-11-01 21:32 UTC (permalink / raw) To: linux-raid Robin Hill wrote: > On Tue Nov 01, 2011 at 04:13:26 -0400, Miles Fidelman wrote: > >> David Brown wrote: >>> No, md RAID10 does /not/ offer more redundancy than RAID1. You are >>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over >>> RAID1 sets) - but it is a convenience and performance benefit, not a >>> redundancy benefit. In particular, it lets you build RAID10 from any >>> number of disks, not just two. And it lets you stripe over all disks, >>> improving performance for some loads (though not /all/ loads - if you >>> have lots of concurrent small reads, you may be faster using plain >>> RAID1). >> wasn't suggesting that it does - just that it does things differently >> than normal raid 1+0 - for example, by doing mirroring and striping as a >> unitary operation, it works across odd number of drives - it also (I >> think) allows for more than 2 copies of a block (not completely clear >> how many copies of a block would be made if you specified a 16 drive >> array) - sort of what I'm wondering here >> > By default it'll make 2 copies, regardless how many devices are in the > array. You can specify how many copies you want though, so -n3 will give > you a near configuration with 3 copies, -n4 for four copies, etc. > > cool, so with 16 drives, and say -n6 or -n8, and a far configuration - that gives a pretty good level of resistance to multi-disk failures, as well as an entire node failure (taking out 4 drives) which then leaves the question of whether the md driver, itself, can be failed over from one node to another Thanks! Miles -- In theory, there is no difference between theory and practice. In<fnord> practice, there is. .... Yogi Berra ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 21:32 ` Miles Fidelman @ 2011-11-01 21:50 ` Robin Hill 2011-11-01 22:35 ` Miles Fidelman 2011-11-01 22:00 ` David Brown 1 sibling, 1 reply; 27+ messages in thread From: Robin Hill @ 2011-11-01 21:50 UTC (permalink / raw) To: Miles Fidelman; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2316 bytes --] On Tue Nov 01, 2011 at 05:32:17 -0400, Miles Fidelman wrote: > Robin Hill wrote: > > On Tue Nov 01, 2011 at 04:13:26 -0400, Miles Fidelman wrote: > > > >> David Brown wrote: > >>> No, md RAID10 does /not/ offer more redundancy than RAID1. You are > >>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over > >>> RAID1 sets) - but it is a convenience and performance benefit, not a > >>> redundancy benefit. In particular, it lets you build RAID10 from any > >>> number of disks, not just two. And it lets you stripe over all disks, > >>> improving performance for some loads (though not /all/ loads - if you > >>> have lots of concurrent small reads, you may be faster using plain > >>> RAID1). > >> wasn't suggesting that it does - just that it does things differently > >> than normal raid 1+0 - for example, by doing mirroring and striping as a > >> unitary operation, it works across odd number of drives - it also (I > >> think) allows for more than 2 copies of a block (not completely clear > >> how many copies of a block would be made if you specified a 16 drive > >> array) - sort of what I'm wondering here > >> > > By default it'll make 2 copies, regardless how many devices are in the > > array. You can specify how many copies you want though, so -n3 will give > > you a near configuration with 3 copies, -n4 for four copies, etc. > > > > > cool, so with 16 drives, and say -n6 or -n8, and a far configuration - > that gives a pretty good level of resistance to multi-disk failures, as > well as an entire node failure (taking out 4 drives) > Sorry, my mistake - it should be -p n3, or -p n4. You'll want -p f6/-p f8 to get a far configuration though, but yes, that should give good redundancy against a single node failure. > which then leaves the question of whether the md driver, itself, can be > failed over from one node to another > I don't see why not. You'll probably need to force assembly though, as it's likely the devices will be slightly out-of-synch after the node failure. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 21:50 ` Robin Hill @ 2011-11-01 22:35 ` Miles Fidelman 0 siblings, 0 replies; 27+ messages in thread From: Miles Fidelman @ 2011-11-01 22:35 UTC (permalink / raw) To: linux-raid Robin Hill wrote: > Sorry, my mistake - it should be -p n3, or -p n4. You'll want -p f6/-p > f8 to get a far configuration though, but yes, that should give good > redundancy against a single node failure. > >> which then leaves the question of whether the md driver, itself, can be >> failed over from one node to another >> > I don't see why not. You'll probably need to force assembly though, as > it's likely the devices will be slightly out-of-synch after the node > failure. > > sort of would expect to have to resynch has anybody out there actually tried this at some point? I've been trying to find OCF resource agents for handling a RAID failover, and only coming up with deprecated functions with little documentation - the only thing that even sounds remotely close is a heartbeat2 "md group take over" resource agent, but all I can find are references to it, no actual documentation Miles -- In theory, there is no difference between theory and practice. In<fnord> practice, there is. .... Yogi Berra ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 21:32 ` Miles Fidelman 2011-11-01 21:50 ` Robin Hill @ 2011-11-01 22:00 ` David Brown 2011-11-01 22:58 ` Miles Fidelman 1 sibling, 1 reply; 27+ messages in thread From: David Brown @ 2011-11-01 22:00 UTC (permalink / raw) To: linux-raid On 01/11/11 22:32, Miles Fidelman wrote: > Robin Hill wrote: >> On Tue Nov 01, 2011 at 04:13:26 -0400, Miles Fidelman wrote: >> >>> David Brown wrote: >>>> No, md RAID10 does /not/ offer more redundancy than RAID1. You are >>>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over >>>> RAID1 sets) - but it is a convenience and performance benefit, not a >>>> redundancy benefit. In particular, it lets you build RAID10 from any >>>> number of disks, not just two. And it lets you stripe over all disks, >>>> improving performance for some loads (though not /all/ loads - if you >>>> have lots of concurrent small reads, you may be faster using plain >>>> RAID1). >>> wasn't suggesting that it does - just that it does things differently >>> than normal raid 1+0 - for example, by doing mirroring and striping as a >>> unitary operation, it works across odd number of drives - it also (I >>> think) allows for more than 2 copies of a block (not completely clear >>> how many copies of a block would be made if you specified a 16 drive >>> array) - sort of what I'm wondering here >>> >> By default it'll make 2 copies, regardless how many devices are in the >> array. You can specify how many copies you want though, so -n3 will give >> you a near configuration with 3 copies, -n4 for four copies, etc. >> >> > cool, so with 16 drives, and say -n6 or -n8, and a far configuration - > that gives a pretty good level of resistance to multi-disk failures, as > well as an entire node failure (taking out 4 drives) You are aware, of course, that if you take your 16 drives and use "-n8", you will get a total disk space equivalent to two drives. It would be very resistant to drive failures, but /very/ poor space efficiency. It would also be very fast for reads, but very slow for writes (as everything must be written 8 times). It's your choice - md is very flexible. But I think an eight-way mirror would be considered somewhat unusual. > > which then leaves the question of whether the md driver, itself, can be > failed over from one node to another > > Thanks! > > Miles > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 22:00 ` David Brown @ 2011-11-01 22:58 ` Miles Fidelman 2011-11-02 10:36 ` David Brown 0 siblings, 1 reply; 27+ messages in thread From: Miles Fidelman @ 2011-11-01 22:58 UTC (permalink / raw) Cc: linux-raid David Brown wrote: > > You are aware, of course, that if you take your 16 drives and use > "-n8", you will get a total disk space equivalent to two drives. It > would be very resistant to drive failures, but /very/ poor space > efficiency. It would also be very fast for reads, but very slow for > writes (as everything must be written 8 times). > > It's your choice - md is very flexible. But I think an eight-way > mirror would be considered somewhat unusual. What would be particularly interesting is if I can do -n4 and configure things in a way that insures that each of those 4 is on a different one of my 4 boxes (4 boxes, 4 disks each). -- In theory, there is no difference between theory and practice. In<fnord> practice, there is. .... Yogi Berra ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 22:58 ` Miles Fidelman @ 2011-11-02 10:36 ` David Brown 0 siblings, 0 replies; 27+ messages in thread From: David Brown @ 2011-11-02 10:36 UTC (permalink / raw) To: linux-raid On 01/11/2011 23:58, Miles Fidelman wrote: > David Brown wrote: >> >> You are aware, of course, that if you take your 16 drives and use >> "-n8", you will get a total disk space equivalent to two drives. It >> would be very resistant to drive failures, but /very/ poor space >> efficiency. It would also be very fast for reads, but very slow for >> writes (as everything must be written 8 times). >> >> It's your choice - md is very flexible. But I think an eight-way >> mirror would be considered somewhat unusual. > > What would be particularly interesting is if I can do -n4 and configure > things in a way that insures that each of those 4 is on a different one > of my 4 boxes (4 boxes, 4 disks each). > Theoretically, that's just a matter of getting the ordering right when you are creating the array. However, it is a lot easier to get this right if you separate the stages into setting up 4 raid1 sets, then combine them. Remember, md RAID10 is good - but it is not always the best choice. It is particularly good for desktop use on two disks, or perhaps 3 disks, with the "far2" layout - giving you excellent speed and safety. Its advantages over standard RAID1+0 drop as the number of disks increase. In particular, RAID10 in "near" format is identical to RAID1+0 if you have a multiple of 4 disks - "far" format still has some speed advantages since it can always read from the faster outer half of the disk. Also note that the benefits of striping also drop off for bigger disk sets (unless you have very big files, they don't fill the stripes), and if you have multiple concurrent accesses - typical for severs - striping doesn't help much. Finally, remember the main disadvantage of md RAID10 - once it is established, you have very few re-shape possibilities. RAID0 and RAID1 sets can be easily re-shaped - you can change their size, and you can add or remove drives. This means that if you build your system using RAID1 mirrors and RAID0 striping on top, you can add new servers or change the number of disks later. You need to establish what your needs are here - what sort of files will be accessed (big, small, etc.), what will access patterns be like (large streamed accesses, lots of concurrent small accesses, many reads many writes, etc.), and what your storage needs are (what disk sizes are you using, and what total usable disk space are you aiming for?). One idea would be to set up 8 2-way mirrors, with the mirrors split between different machines. These 8 pairs could then be combined with RAID6. That gives you 6 disks worth of total space from your 16 disks, and protects you against at least 5 concurrent disk failures, or two complete server fails, if you arrange the pairs like this: Server 1: 1a 3a 5a 7a Server 2: 1b 4a 6a 7b Server 3: 2a 3b 6b 8a Server 4: 2b 4b 5b 8b (Where 1a, 1b are the two halves of the same mirror.) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 20:13 ` Miles Fidelman 2011-11-01 21:20 ` Robin Hill @ 2011-11-01 22:15 ` keld 2011-11-01 22:25 ` NeilBrown 1 sibling, 1 reply; 27+ messages in thread From: keld @ 2011-11-01 22:15 UTC (permalink / raw) To: Miles Fidelman; +Cc: no, To-header, on, "input <", linux-raid On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote: > David Brown wrote: > > > >No, md RAID10 does /not/ offer more redundancy than RAID1. You are > >right that md RAID10 offers more than RAID1 (or traditional RAID0 over > >RAID1 sets) - but it is a convenience and performance benefit, not a > >redundancy benefit. In particular, it lets you build RAID10 from any > >number of disks, not just two. And it lets you stripe over all disks, > >improving performance for some loads (though not /all/ loads - if you > >have lots of concurrent small reads, you may be faster using plain > >RAID1). In fact raid10 mas a bit less redundancy than raid1+0. It is as far as I know built as raid0+1 with a disk layout where you can only loose eg 1 out of 4 disks, while raid1+0 in some cases can lose 2 disks out of 4. Also for lots of concurrent small reads raid10 can in some cases be somewhat faster than raid1, and AFAIK never slower than raid1. Best regards keld ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 22:15 ` keld @ 2011-11-01 22:25 ` NeilBrown 2011-11-01 22:38 ` Miles Fidelman 2011-11-02 1:37 ` keld 0 siblings, 2 replies; 27+ messages in thread From: NeilBrown @ 2011-11-01 22:25 UTC (permalink / raw) To: keld; +Cc: Miles Fidelman, linux-raid [-- Attachment #1: Type: text/plain, Size: 1427 bytes --] On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote: > On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote: > > David Brown wrote: > > > > > >No, md RAID10 does /not/ offer more redundancy than RAID1. You are > > >right that md RAID10 offers more than RAID1 (or traditional RAID0 over > > >RAID1 sets) - but it is a convenience and performance benefit, not a > > >redundancy benefit. In particular, it lets you build RAID10 from any > > >number of disks, not just two. And it lets you stripe over all disks, > > >improving performance for some loads (though not /all/ loads - if you > > >have lots of concurrent small reads, you may be faster using plain > > >RAID1). > > In fact raid10 mas a bit less redundancy than raid1+0. > It is as far as I know built as raid0+1 with a disk layout > where you can only loose eg 1 out of 4 disks, while raid1+0 > in some cases can lose 2 disks out of 4. With md/raid10 you can in some case lose 2 out of 4 disks and survive, just like raid1+0. NeilBrown > > Also for lots of concurrent small reads raid10 can in some cases be somewhat > faster than raid1, and AFAIK never slower than raid1. > > Best regards > keld > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 22:25 ` NeilBrown @ 2011-11-01 22:38 ` Miles Fidelman 2011-11-02 1:40 ` keld 2011-11-02 1:37 ` keld 1 sibling, 1 reply; 27+ messages in thread From: Miles Fidelman @ 2011-11-01 22:38 UTC (permalink / raw) Cc: linux-raid NeilBrown wrote: > On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote: > >> On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote: >>> David Brown wrote: >>>> No, md RAID10 does /not/ offer more redundancy than RAID1. You are >>>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over >>>> RAID1 sets) - but it is a convenience and performance benefit, not a >>>> redundancy benefit. In particular, it lets you build RAID10 from any >>>> number of disks, not just two. And it lets you stripe over all disks, >>>> improving performance for some loads (though not /all/ loads - if you >>>> have lots of concurrent small reads, you may be faster using plain >>>> RAID1). >> In fact raid10 mas a bit less redundancy than raid1+0. >> It is as far as I know built as raid0+1 with a disk layout >> where you can only loose eg 1 out of 4 disks, while raid1+0 >> in some cases can lose 2 disks out of 4. > With md/raid10 you can in some case lose 2 out of 4 disks and survive, just > like raid1+0. > it occurs to me that it's a real bummer that all the md documentation, that was on raid.wiki.kernel.org, has been inaccessible since the kernel.org hack a couple of months ago -- anybody know if that's going to be back soon, or if that documentation lives somewhere else as well? -- In theory, there is no difference between theory and practice. In<fnord> practice, there is. .... Yogi Berra ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 22:38 ` Miles Fidelman @ 2011-11-02 1:40 ` keld 0 siblings, 0 replies; 27+ messages in thread From: keld @ 2011-11-02 1:40 UTC (permalink / raw) To: Miles Fidelman; +Cc: no, To-header, on, "input <", linux-raid On Tue, Nov 01, 2011 at 06:38:38PM -0400, Miles Fidelman wrote: > it occurs to me that it's a real bummer that all the md documentation, > that was on raid.wiki.kernel.org, has been inaccessible since the > kernel.org hack a couple of months ago -- anybody know if that's going > to be back soon, or if that documentation lives somewhere else as well? What has happened there? Who is in contact with kernel.org to secure the wiki? There was only wiki text, so that info would most likely not be compromised. best regards keld ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 22:25 ` NeilBrown 2011-11-01 22:38 ` Miles Fidelman @ 2011-11-02 1:37 ` keld 2011-11-02 1:48 ` NeilBrown 1 sibling, 1 reply; 27+ messages in thread From: keld @ 2011-11-02 1:37 UTC (permalink / raw) To: NeilBrown; +Cc: Miles Fidelman, linux-raid On Wed, Nov 02, 2011 at 09:25:26AM +1100, NeilBrown wrote: > On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote: > > > On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote: > > > David Brown wrote: > > > > > > > >No, md RAID10 does /not/ offer more redundancy than RAID1. You are > > > >right that md RAID10 offers more than RAID1 (or traditional RAID0 over > > > >RAID1 sets) - but it is a convenience and performance benefit, not a > > > >redundancy benefit. In particular, it lets you build RAID10 from any > > > >number of disks, not just two. And it lets you stripe over all disks, > > > >improving performance for some loads (though not /all/ loads - if you > > > >have lots of concurrent small reads, you may be faster using plain > > > >RAID1). > > > > In fact raid10 mas a bit less redundancy than raid1+0. > > It is as far as I know built as raid0+1 with a disk layout > > where you can only loose eg 1 out of 4 disks, while raid1+0 > > in some cases can loose 2 disks out of 4. > > With md/raid10 you can in some case lose 2 out of 4 disks and survive, just > like raid1+0. OK, in which cases, and when is this not the case? best regards keld ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-02 1:37 ` keld @ 2011-11-02 1:48 ` NeilBrown 2011-11-02 7:02 ` keld 0 siblings, 1 reply; 27+ messages in thread From: NeilBrown @ 2011-11-02 1:48 UTC (permalink / raw) To: keld; +Cc: Miles Fidelman, linux-raid [-- Attachment #1: Type: text/plain, Size: 1363 bytes --] On Wed, 2 Nov 2011 02:37:56 +0100 keld@keldix.com wrote: > On Wed, Nov 02, 2011 at 09:25:26AM +1100, NeilBrown wrote: > > On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote: > > > > > On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote: > > > > David Brown wrote: > > > > > > > > > >No, md RAID10 does /not/ offer more redundancy than RAID1. You are > > > > >right that md RAID10 offers more than RAID1 (or traditional RAID0 over > > > > >RAID1 sets) - but it is a convenience and performance benefit, not a > > > > >redundancy benefit. In particular, it lets you build RAID10 from any > > > > >number of disks, not just two. And it lets you stripe over all disks, > > > > >improving performance for some loads (though not /all/ loads - if you > > > > >have lots of concurrent small reads, you may be faster using plain > > > > >RAID1). > > > > > > In fact raid10 mas a bit less redundancy than raid1+0. > > > It is as far as I know built as raid0+1 with a disk layout > > > where you can only loose eg 1 out of 4 disks, while raid1+0 > > > in some cases can loose 2 disks out of 4. > > > > With md/raid10 you can in some case lose 2 out of 4 disks and survive, just > > like raid1+0. > > OK, in which cases, and when is this not the case? > > best regards > keld "just like raid1+0" NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-02 1:48 ` NeilBrown @ 2011-11-02 7:02 ` keld 2011-11-02 9:20 ` Jonathan Tripathy 2011-11-02 11:27 ` David Brown 0 siblings, 2 replies; 27+ messages in thread From: keld @ 2011-11-02 7:02 UTC (permalink / raw) To: NeilBrown; +Cc: Miles Fidelman, linux-raid On Wed, Nov 02, 2011 at 12:48:16PM +1100, NeilBrown wrote: > On Wed, 2 Nov 2011 02:37:56 +0100 keld@keldix.com wrote: > > > On Wed, Nov 02, 2011 at 09:25:26AM +1100, NeilBrown wrote: > > > On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote: > > > > > > > On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote: > > > > > David Brown wrote: > > > > > > > > > > > >No, md RAID10 does /not/ offer more redundancy than RAID1. You are > > > > > >right that md RAID10 offers more than RAID1 (or traditional RAID0 over > > > > > >RAID1 sets) - but it is a convenience and performance benefit, not a > > > > > >redundancy benefit. In particular, it lets you build RAID10 from any > > > > > >number of disks, not just two. And it lets you stripe over all disks, > > > > > >improving performance for some loads (though not /all/ loads - if you > > > > > >have lots of concurrent small reads, you may be faster using plain > > > > > >RAID1). > > > > > > > > In fact raid10 mas a bit less redundancy than raid1+0. > > > > It is as far as I know built as raid0+1 with a disk layout > > > > where you can only loose eg 1 out of 4 disks, while raid1+0 > > > > in some cases can loose 2 disks out of 4. > > > > > > With md/raid10 you can in some case lose 2 out of 4 disks and survive, just > > > like raid1+0. > > > > OK, in which cases, and when is this not the case? > > > > best regards > > keld > > "just like raid1+0" No, that is not the case AFAIK. Eg the layout of raid10,f2 with 4 disks is "just like raid0+1". best regards keld ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-02 7:02 ` keld @ 2011-11-02 9:20 ` Jonathan Tripathy 2011-11-02 11:27 ` David Brown 1 sibling, 0 replies; 27+ messages in thread From: Jonathan Tripathy @ 2011-11-02 9:20 UTC (permalink / raw) To: keld; +Cc: NeilBrown, Miles Fidelman, linux-raid On 02/11/2011 07:02, keld@keldix.com wrote: > On Wed, Nov 02, 2011 at 12:48:16PM +1100, NeilBrown wrote: >> On Wed, 2 Nov 2011 02:37:56 +0100 keld@keldix.com wrote: >> >>> On Wed, Nov 02, 2011 at 09:25:26AM +1100, NeilBrown wrote: >>>> On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote: >>>> >>>>> On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote: >>>>>> David Brown wrote: >>>>>>> No, md RAID10 does /not/ offer more redundancy than RAID1. You are >>>>>>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over >>>>>>> RAID1 sets) - but it is a convenience and performance benefit, not a >>>>>>> redundancy benefit. In particular, it lets you build RAID10 from any >>>>>>> number of disks, not just two. And it lets you stripe over all disks, >>>>>>> improving performance for some loads (though not /all/ loads - if you >>>>>>> have lots of concurrent small reads, you may be faster using plain >>>>>>> RAID1). >>>>> In fact raid10 mas a bit less redundancy than raid1+0. >>>>> It is as far as I know built as raid0+1 with a disk layout >>>>> where you can only loose eg 1 out of 4 disks, while raid1+0 >>>>> in some cases can loose 2 disks out of 4. >>>> With md/raid10 you can in some case lose 2 out of 4 disks and survive, just >>>> like raid1+0. >>> OK, in which cases, and when is this not the case? >>> >>> best regards >>> keld >> "just like raid1+0" > No, that is not the case AFAIK. Eg the layout of raid10,f2 with 4 disks is > "just like raid0+1". > Isn't md raid10 n2 exactly the same as RAID1+0?? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-02 7:02 ` keld 2011-11-02 9:20 ` Jonathan Tripathy @ 2011-11-02 11:27 ` David Brown 1 sibling, 0 replies; 27+ messages in thread From: David Brown @ 2011-11-02 11:27 UTC (permalink / raw) To: linux-raid On 02/11/2011 08:02, keld@keldix.com wrote: > On Wed, Nov 02, 2011 at 12:48:16PM +1100, NeilBrown wrote: >> On Wed, 2 Nov 2011 02:37:56 +0100 keld@keldix.com wrote: >> >>> On Wed, Nov 02, 2011 at 09:25:26AM +1100, NeilBrown wrote: >>>> On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote: >>>> >>>>> On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote: >>>>>> David Brown wrote: >>>>>>> >>>>>>> No, md RAID10 does /not/ offer more redundancy than RAID1. You are >>>>>>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over >>>>>>> RAID1 sets) - but it is a convenience and performance benefit, not a >>>>>>> redundancy benefit. In particular, it lets you build RAID10 from any >>>>>>> number of disks, not just two. And it lets you stripe over all disks, >>>>>>> improving performance for some loads (though not /all/ loads - if you >>>>>>> have lots of concurrent small reads, you may be faster using plain >>>>>>> RAID1). >>>>> >>>>> In fact raid10 mas a bit less redundancy than raid1+0. >>>>> It is as far as I know built as raid0+1 with a disk layout >>>>> where you can only loose eg 1 out of 4 disks, while raid1+0 >>>>> in some cases can loose 2 disks out of 4. >>>> >>>> With md/raid10 you can in some case lose 2 out of 4 disks and survive, just >>>> like raid1+0. >>> >>> OK, in which cases, and when is this not the case? >>> >>> best regards >>> keld >> >> "just like raid1+0" > > No, that is not the case AFAIK. Eg the layout of raid10,f2 with 4 disks is > "just like raid0+1". > And raid0+1 can also survive two disk failures in some cases. It boils down to this - if you have a two-way mirror (RAID1, RAID10, RAID1+0, RAID0+1), then you can keep losing disks unless you lose both copies of part of your data. Look at the layout diagrams on <http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10> for the four drive cases. You can lose disk 1 and 3, or disk 2 and 4, in either the "near 2" or the "far 2" cases. But if you lose disks 1 and 2, or disks 3 and 4, your data is gone: RAID10,n2, or RAID1+0 (stripe of mirrors) : Good array Lost 1+3 (OK) Lost 1+2 (Dead) A1 A1 A2 A2 x A1 x A2 x x A2 A2 A3 A3 A4 A4 x A3 x A4 x x A4 A4 A5 A5 A6 A6 x A5 x A6 x x A6 A6 A7 A7 A8 A8 x A7 x A8 x x A8 A8 RAID10,f2 : Good array Lost 1+3 (OK) Lost 1+2 (Dead) A1 A2 A3 A4 x A2 x A4 x x A3 A4 A5 A6 A7 A8 x A6 x A8 x x A7 A8 .... A4 A1 A2 A3 x A1 x A3 x x A2 A3 A8 A5 A6 A7 x A5 x A7 x x A6 A7 RAID10,02 : Good array Lost 1+3 (OK) Lost 1+2 (Dead) A1 A2 A3 A4 x A2 x A4 x x A3 A4 A4 A1 A2 A3 x A1 x A3 x x A2 A3 A5 A6 A7 A8 x A6 x A8 x x A7 A8 A8 A5 A6 A7 x A5 x A7 x x A6 A7 RAID0+1 (mirror of stripes) : Good array Lost 1+3 (Dead) Lost 1+2 (OK) A1 A2 A1 A2 x A2 x A2 x x A1 A2 A3 A4 A3 A4 x A4 x A4 x x A3 A4 A5 A6 A5 A6 x A6 x A6 x x A5 A6 A7 A8 A7 A8 x A8 x A8 x x A7 A8 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 0:38 possibly silly question (raid failover) Miles Fidelman 2011-11-01 9:14 ` David Brown @ 2011-11-01 9:26 ` Johannes Truschnigg 2011-11-01 13:02 ` Miles Fidelman 2011-11-02 6:41 ` Stan Hoeppner 2 siblings, 1 reply; 27+ messages in thread From: Johannes Truschnigg @ 2011-11-01 9:26 UTC (permalink / raw) To: Miles Fidelman; +Cc: linux-raid@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 799 bytes --] Hi Miles, On Mon, Oct 31, 2011 at 08:38:16PM -0400, Miles Fidelman wrote: > Hi Folks, > > I've been exploring various ways to build a "poor man's high > availability cluster." Currently I'm running two nodes, using raid > on each box, running DRBD across the boxes, and running Xen virtual > machines on top of that. > [...] while I do note that I don't answer your question at hand, I'm still inclined to ask if you do know Ganeti (http://code.google.com/p/ganeti/) yet? It offers pretty much everything you seem to want to have. -- with best regards: - Johannes Truschnigg ( johannes@truschnigg.info ) www: http://johannes.truschnigg.info/ phone: +43 650 2 133337 xmpp: johannes@truschnigg.info Please do not bother me with HTML-eMail or attachments. Thank you. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 9:26 ` Johannes Truschnigg @ 2011-11-01 13:02 ` Miles Fidelman 2011-11-01 13:33 ` John Robinson 0 siblings, 1 reply; 27+ messages in thread From: Miles Fidelman @ 2011-11-01 13:02 UTC (permalink / raw) Cc: linux-raid@vger.kernel.org Johannes Truschnigg wrote: > Hi Miles, > > On Mon, Oct 31, 2011 at 08:38:16PM -0400, Miles Fidelman wrote: >> Hi Folks, >> >> I've been exploring various ways to build a "poor man's high >> availability cluster." Currently I'm running two nodes, using raid >> on each box, running DRBD across the boxes, and running Xen virtual >> machines on top of that. >> [...] > while I do note that I don't answer your question at hand, I'm still inclined > to ask if you do know Ganeti (http://code.google.com/p/ganeti/) yet? It offers > pretty much everything you seem to want to have. Actually I do know Ganeti, and it does NOT come close to what I'm suggesting: - it supports migration but not auto-failover - DRBD is the only mechanism it provides for replicating data across nodes - which limits migration to a 2-node pair -- In theory, there is no difference between theory and practice. In<fnord> practice, there is. .... Yogi Berra ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 13:02 ` Miles Fidelman @ 2011-11-01 13:33 ` John Robinson 0 siblings, 0 replies; 27+ messages in thread From: John Robinson @ 2011-11-01 13:33 UTC (permalink / raw) To: Miles Fidelman; +Cc: linux-raid@vger.kernel.org On 01/11/2011 13:02, Miles Fidelman wrote: > Johannes Truschnigg wrote: >> Hi Miles, >> >> On Mon, Oct 31, 2011 at 08:38:16PM -0400, Miles Fidelman wrote: >>> Hi Folks, >>> >>> I've been exploring various ways to build a "poor man's high >>> availability cluster." Currently I'm running two nodes, using raid >>> on each box, running DRBD across the boxes, and running Xen virtual >>> machines on top of that. >>> [...] >> while I do note that I don't answer your question at hand, I'm still >> inclined >> to ask if you do know Ganeti (http://code.google.com/p/ganeti/) yet? >> It offers >> pretty much everything you seem to want to have. > > Actually I do know Ganeti, and it does NOT come close to what I'm > suggesting: > - it supports migration but not auto-failover > - DRBD is the only mechanism it provides for replicating data across > nodes - which limits migration to a 2-node pair It might still do what I think you want: think of each of the four servers running 3 VMs (or groups of VMs) normally, and three servers running 4 VMs when one of the servers fails. Then for each VM you replicate its storage to another server, as follows: Node A: VM A1->Node B; VM A2->Node C; VM A3->Node D Node B: VM B1->Node C; VM B2->Node D; VM B3->Node A Node C: VM C1->Node D; VM C2->Node A; VM C3->Node B Node D: VM D1->Node A; VM D2->Node B; VM D3->Node C So each node needs double the storage, because as well as its own VMs is has copies of one from each of the other nodes. When any node goes down, your cluster management makes all three of the others start up one more VM - isn't that what Ganeti means by "quick recovery in case of physical system failure" and "automated instance migration across clusters"? I'd probably do some kind of RAID over the 4 disks on each server as well, and do live migrations when a drive fails in any one machine, so that the VMs don't suffer from the degraded RAID and the machine's relatively quiet while you're replacing the failed drive, but now we're getting into having to have perhaps double the storage again, and it's not looking like it's a poor man's solution after all - can you buy 4 cheap commodity servers with double the storage and enough spare RAM for less than you could have bought 3 classy bulletproof ones? Cheers, John. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-01 0:38 possibly silly question (raid failover) Miles Fidelman 2011-11-01 9:14 ` David Brown 2011-11-01 9:26 ` Johannes Truschnigg @ 2011-11-02 6:41 ` Stan Hoeppner 2011-11-02 13:17 ` Miles Fidelman 2 siblings, 1 reply; 27+ messages in thread From: Stan Hoeppner @ 2011-11-02 6:41 UTC (permalink / raw) To: Miles Fidelman; +Cc: linux-raid@vger.kernel.org On 10/31/2011 7:38 PM, Miles Fidelman wrote: > Hi Folks, > > I've been exploring various ways to build a "poor man's high > availability cluster." Overall advice: Don't attempt to reinvent the wheel. Building such a thing is normally a means to end, not an end itself. If your goal is supporting an actual workload and not simply the above, there are a number of good options readily available. > Currently I'm running two nodes, using raid on > each box, running DRBD across the boxes, and running Xen virtual > machines on top of that. > > I now have two brand new servers - for a total of four nodes - each with > four large drives, and four gigE ports. A good option in this case would be to simply take the 8 new drives and add 4 each to the existing servers, expanding existing md RAID devices and filesystems where appropriate. Then setup NFS cluster services and export the appropriate filesystems to the two new servers. This keeps your overall complexity low, reliability and performance high, and yields a setup many are familiar with if you need troubleshooting assistance in the future. This is a widely used architecture and has been for many years. > Between the configuration of the systems, and rack space limitations, > I'm trying to use each server for both storage and processing - and been > looking at various options for building a cluster file system across all > 16 drives, that supports VM migration/failover across all for nodes, and > that's resistant to both single-drive failures, and to losing an entire > server (and it's 4 drives), and maybe even losing two servers (8 drives). The solution above gives you all of this, except the unlikely scenario of losing both storage servers simultaneously. If that is truly something you're willing to spend money to mitigate then slap a 3rd storage server in an off site location and use the DRBD option for such. > The approach that looks most interesting is Sheepdog - but it's both > tied to KVM rather than Xen, and a bit immature. Interesting disclaimer for an open source project, specifically the 2nd half of the statement: "There is no guarantee that this software will be included in future software releases, and it probably will not be included." > But it lead me to wonder if something like this might make sense: > - mount each drive using AoE > - run md RAID 10 across all 16 drives one one node > - mount the resulting md device using AoE > - if the node running the md device fails, use pacemaker/crm to > auto-start an md device on another node, re-assemble and republish the > array > - resulting in a 16-drive raid10 array that's accessible from all nodes The level of complexity here is too high for a production architecture. In addition, doing something like this puts you way out in uncharted waters, where you will have few, if any, peers to assist in time of need. When (not if) something breaks in an unexpected way, how quickly will you be able to troubleshoot and resolve a problem in such a complex architecture? > Or is this just silly and/or wrongheaded? I don't think it's silly. Maybe a little wrongheaded, to use your term. IBM has had GPFS on the market for a decade plus. It will do exactly what you want, but the price is likely well beyond your budget, assuming they'd even return your call WRT a 4 node cluster. (IBM GPFS customers are mostly government labs, aerospace giants, and pharma companies, with very large node count clusters, hundreds to thousands). If I were doing such a setup to fit your stated needs, I'd spend ~$10-15K USD on a low/midrange iSCSI SAN box with 2GB cache dual controllers/PSUs and 16 x 500GB SATA drives. I'd create a single RAID6 array of 14 drives with two standby spares, yielding 7TB of space for carving up LUNS. Carve and export the LUNS you need to each node's dual/quad NIC MACs with multipathing setup on each node, and format the LUNs with GFS2. All nodes now have access to all storage you assign. With such a setup you can easily add future nodes. It's not complex, it is a well understood architecture, and relatively straightforward to troubleshoot. Now, if that solution is out of your price range, I think the redundant cluster NFS server architecture is in your immediate future. It's in essence free, and it will give you everything you need, in spite of the fact that the "node symmetry" isn't what you apparently envision as "optimal" for a cluster. -- Stan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: possibly silly question (raid failover) 2011-11-02 6:41 ` Stan Hoeppner @ 2011-11-02 13:17 ` Miles Fidelman 0 siblings, 0 replies; 27+ messages in thread From: Miles Fidelman @ 2011-11-02 13:17 UTC (permalink / raw) Cc: linux-raid@vger.kernel.org Stan, Stan Hoeppner wrote: > On 10/31/2011 7:38 PM, Miles Fidelman wrote: >> Hi Folks, >> >> I've been exploring various ways to build a "poor man's high >> availability cluster." > Overall advice: Don't attempt to reinvent the wheel. > > Building such a thing is normally a means to end, not an end itself. If > your goal is supporting an actual workload and not simply the above, > there are a number of good options readily available. well, normally I'd agree with you, but... - we're both an R&D organization and a (small, but aspiring) provider of hosted services - so experimenting with infrastructure is part of the actual work -- and part of where I'd like to head is an environment that's built out of commodity boxes configured in a way that scales out (Sheepdog is really the model I have in mind) - I'd sure like to find something that does what we need: -- we're using DRBD/Pacemaker/etc. - but that's sort of brittle and only supports pair-wise migration/failover -- if Sheepdog was a little more mature, and supported Xen, it would be exactly what I'm looking for -- Xen over the newest release of GlustFS is starting to look attractive -- some of the single system image projects (OpenMosix, Kerrighed) would be attractive if the projects were alive >> Currently I'm running two nodes, using raid on >> each box, running DRBD across the boxes, and running Xen virtual >> machines on top of that. >> >> I now have two brand new servers - for a total of four nodes - each with >> four large drives, and four gigE ports. > A good option in this case would be to simply take the 8 new drives and > add 4 each to the existing servers, expanding existing md RAID devices > and filesystems where appropriate. Then setup NFS cluster services and > export the appropriate filesystems to the two new servers. This keeps > your overall complexity low, reliability and performance high, and > yields a setup many are familiar with if you need troubleshooting > assistance in the future. This is a widely used architecture and has > been for many years. unfortunately, we're currently trying to make do with 4U of rackspace, and 4 1U servers, each of which holds 4 drives, can't quite move the disks around the way you're talking about -- unfortunately, the older boxes don't have hardware virtualization support or I'd seriously consider migrating to KVM and Sheepdog -- if Sheepdog were just a bit more mature, I'd seriously consider simply replacing the older boxes >> The approach that looks most interesting is Sheepdog - but it's both >> tied to KVM rather than Xen, and a bit immature. > Interesting disclaimer for an open source project, specifically the 2nd > half of the statement: > > "There is no guarantee that this software will be included in future > software releases, and it probably will not be included." Yeah, but it seems to have some traction and support, and the OpenStack community seems to be looking at it seriously. Having said that, it's things like that that are pushing me toward GlusterFS (doesn't hurt that Red Hat just purchased Gluster and seems to be putting some serious resources into it). > >> But it lead me to wonder if something like this might make sense: >> - mount each drive using AoE >> - run md RAID 10 across all 16 drives one one node >> - mount the resulting md device using AoE >> - if the node running the md device fails, use pacemaker/crm to >> auto-start an md device on another node, re-assemble and republish the >> array >> - resulting in a 16-drive raid10 array that's accessible from all nodes > The level of complexity here is too high for a production architecture. > In addition, doing something like this puts you way out in uncharted > waters, where you will have few, if any, peers to assist in time of > need. When (not if) something breaks in an unexpected way, how quickly > will you be able to troubleshoot and resolve a problem in such a complex > architecture? Understood. This path is somewhat more of a matter of curiosity. AoE is pretty mature, and there does seem to be a RAID resource agent for CRM - so some of the pieces exist. Seems like the pieces would fit together - so I was wondering if anybody had actually tried it. > If I were doing such a setup to fit your stated needs, I'd spend > ~$10-15K USD on a low/midrange iSCSI SAN box with 2GB cache dual > controllers/PSUs and 16 x 500GB SATA drives. I'd create a single RAID6 > array of 14 drives with two standby spares, yielding 7TB of space for > carving up LUNS. Carve and export the LUNS you need to each node's > dual/quad NIC MACs with multipathing setup on each node, and format > the LUNs with GFS2. All nodes now have access to all storage you > assign. With such a setup you can easily add future nodes. It's not > complex, it is a well understood architecture, and relatively > straightforward to troubleshoot. Now, if that solution is out of your > price range, I think the redundant cluster NFS server architecture is > in your immediate future. It's in essence free, and it will give you > everything you need, in spite of the fact that the "node symmetry" > isn't what you apparently envision as "optimal" for a cluster. Hmm... if I were spending real money, and had more rack space to put things in, I'd probably do something more like a small OpenStack configuration, but that's me. Thanks for your comments. Lots of food for thought! Miles -- In theory, there is no difference between theory and practice. In<fnord> practice, there is. .... Yogi Berra ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2011-11-02 13:17 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-11-01 0:38 possibly silly question (raid failover) Miles Fidelman 2011-11-01 9:14 ` David Brown 2011-11-01 13:05 ` Miles Fidelman 2011-11-01 13:37 ` John Robinson 2011-11-01 14:36 ` David Brown 2011-11-01 20:13 ` Miles Fidelman 2011-11-01 21:20 ` Robin Hill 2011-11-01 21:32 ` Miles Fidelman 2011-11-01 21:50 ` Robin Hill 2011-11-01 22:35 ` Miles Fidelman 2011-11-01 22:00 ` David Brown 2011-11-01 22:58 ` Miles Fidelman 2011-11-02 10:36 ` David Brown 2011-11-01 22:15 ` keld 2011-11-01 22:25 ` NeilBrown 2011-11-01 22:38 ` Miles Fidelman 2011-11-02 1:40 ` keld 2011-11-02 1:37 ` keld 2011-11-02 1:48 ` NeilBrown 2011-11-02 7:02 ` keld 2011-11-02 9:20 ` Jonathan Tripathy 2011-11-02 11:27 ` David Brown 2011-11-01 9:26 ` Johannes Truschnigg 2011-11-01 13:02 ` Miles Fidelman 2011-11-01 13:33 ` John Robinson 2011-11-02 6:41 ` Stan Hoeppner 2011-11-02 13:17 ` Miles Fidelman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).