possibly silly question (raid failover)

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* possibly silly question (raid failover)
@ 2011-11-01  0:38 Miles Fidelman
  2011-11-01  9:14 ` David Brown
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Miles Fidelman @ 2011-11-01  0:38 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

Hi Folks,

I've been exploring various ways to build a "poor man's high 
availability cluster."  Currently I'm running two nodes, using raid on 
each box, running DRBD across the boxes, and running Xen virtual 
machines on top of that.

I now have two brand new servers - for a total of four nodes - each with 
four large drives, and four gigE ports.

Between the configuration of the systems, and rack space limitations, 
I'm trying to use each server for both storage and processing - and been 
looking at various options for building a cluster file system across all 
16 drives, that supports VM migration/failover across all for nodes, and 
that's resistant to both single-drive failures, and to losing an entire 
server (and it's 4 drives), and maybe even losing two servers (8 drives).

The approach that looks most interesting is Sheepdog - but it's both 
tied to KVM rather than Xen, and a bit immature.

But it lead me to wonder if something like this might make sense:
- mount each drive using AoE
- run md RAID 10 across all 16 drives one one node
- mount the resulting md device using AoE
- if the node running the md device fails, use pacemaker/crm to 
auto-start an md device on another node, re-assemble and republish the array
- resulting in a 16-drive raid10 array that's accessible from all nodes

Or is this just silly and/or wrongheaded?

Miles Fidelman

-- 
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01  0:38 possibly silly question (raid failover) Miles Fidelman
@ 2011-11-01  9:14 ` David Brown
  2011-11-01 13:05   ` Miles Fidelman
  2011-11-01  9:26 ` Johannes Truschnigg
  2011-11-02  6:41 ` Stan Hoeppner
  2 siblings, 1 reply; 27+ messages in thread
From: David Brown @ 2011-11-01  9:14 UTC (permalink / raw)
  To: linux-raid

On 01/11/2011 01:38, Miles Fidelman wrote:
> Hi Folks,
>
> I've been exploring various ways to build a "poor man's high
> availability cluster." Currently I'm running two nodes, using raid on
> each box, running DRBD across the boxes, and running Xen virtual
> machines on top of that.
>
> I now have two brand new servers - for a total of four nodes - each with
> four large drives, and four gigE ports.
>
> Between the configuration of the systems, and rack space limitations,
> I'm trying to use each server for both storage and processing - and been
> looking at various options for building a cluster file system across all
> 16 drives, that supports VM migration/failover across all for nodes, and
> that's resistant to both single-drive failures, and to losing an entire
> server (and it's 4 drives), and maybe even losing two servers (8 drives).
>
> The approach that looks most interesting is Sheepdog - but it's both
> tied to KVM rather than Xen, and a bit immature.
>
> But it lead me to wonder if something like this might make sense:
> - mount each drive using AoE
> - run md RAID 10 across all 16 drives one one node
> - mount the resulting md device using AoE
> - if the node running the md device fails, use pacemaker/crm to
> auto-start an md device on another node, re-assemble and republish the
> array
> - resulting in a 16-drive raid10 array that's accessible from all nodes
>
> Or is this just silly and/or wrongheaded?
>
> Miles Fidelman
>

One thing to watch out for when making high-availability systems and 
using RAID1 (or RAID10), is that RAID1 only tolerates a single failure 
in the worst case.  If you have built your disk image spread across 
different machines with two-copy RAID1, and a server goes down, then the 
rest then becomes vulnerable to a single disk failure (or a single 
unrecoverable read error).

It's a different matter if you are building a 4-way mirror from the four 
servers, of course.

Alternatively, each server could have its four disks set up as a 3+1 
local raid5.  Then you combine them all from different machines using 
raid10 (or possibly just raid1 - depending on your usage patterns, that 
may be faster).  That gives you an extra safety margin on disk problems.

But the key issue is to consider what might fail, and what the 
consequences of that failure are - including the consequences for 
additional failures.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01  9:14 ` David Brown
@ 2011-11-01 13:05   ` Miles Fidelman
  2011-11-01 13:37     ` John Robinson
  0 siblings, 1 reply; 27+ messages in thread
From: Miles Fidelman @ 2011-11-01 13:05 UTC (permalink / raw)
  Cc: linux-raid

David Brown wrote:
>
> One thing to watch out for when making high-availability systems and 
> using RAID1 (or RAID10), is that RAID1 only tolerates a single failure 
> in the worst case.  If you have built your disk image spread across 
> different machines with two-copy RAID1, and a server goes down, then 
> the rest then becomes vulnerable to a single disk failure (or a single 
> unrecoverable read error).
>
> It's a different matter if you are building a 4-way mirror from the 
> four servers, of course.
>

Just a nit here: I'm looking at "md RAID10" which behaves quite 
differently that conventional RAID10.  Rather than striping and raiding 
as separate operations, it does both as a unitary operation - 
essentially spreading n copies of each block across m disks.  Rather 
clever that way.

Hence my thought about a 16-disk md RAID10 array - which offers lots of 
redundancy.

Miles

-- 
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 13:05   ` Miles Fidelman
@ 2011-11-01 13:37     ` John Robinson
  2011-11-01 14:36       ` David Brown
  0 siblings, 1 reply; 27+ messages in thread
From: John Robinson @ 2011-11-01 13:37 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: linux-raid

On 01/11/2011 13:05, Miles Fidelman wrote:
> David Brown wrote:
>>
>> One thing to watch out for when making high-availability systems and
>> using RAID1 (or RAID10), is that RAID1 only tolerates a single failure
>> in the worst case. If you have built your disk image spread across
>> different machines with two-copy RAID1, and a server goes down, then
>> the rest then becomes vulnerable to a single disk failure (or a single
>> unrecoverable read error).
>>
>> It's a different matter if you are building a 4-way mirror from the
>> four servers, of course.
>>
>
> Just a nit here: I'm looking at "md RAID10" which behaves quite
> differently that conventional RAID10. Rather than striping and raiding
> as separate operations, it does both as a unitary operation -
> essentially spreading n copies of each block across m disks. Rather
> clever that way.
>
> Hence my thought about a 16-disk md RAID10 array - which offers lots of
> redundancy.

I'm pretty sure that a normal (near) md RAID10 on 16 disks will use the 
first two drives you specify as mirrors, and the next two, and so on, so 
when you specify the drive order when building the array you'd need to 
make sure all the mirrors are on another machine.

Cheers,

John.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 13:37     ` John Robinson
@ 2011-11-01 14:36       ` David Brown
  2011-11-01 20:13         ` Miles Fidelman
  0 siblings, 1 reply; 27+ messages in thread
From: David Brown @ 2011-11-01 14:36 UTC (permalink / raw)
  To: linux-raid

On 01/11/2011 14:37, John Robinson wrote:
> On 01/11/2011 13:05, Miles Fidelman wrote:
>> David Brown wrote:
>>>
>>> One thing to watch out for when making high-availability systems and
>>> using RAID1 (or RAID10), is that RAID1 only tolerates a single failure
>>> in the worst case. If you have built your disk image spread across
>>> different machines with two-copy RAID1, and a server goes down, then
>>> the rest then becomes vulnerable to a single disk failure (or a single
>>> unrecoverable read error).
>>>
>>> It's a different matter if you are building a 4-way mirror from the
>>> four servers, of course.
>>>
>>
>> Just a nit here: I'm looking at "md RAID10" which behaves quite
>> differently that conventional RAID10. Rather than striping and raiding
>> as separate operations, it does both as a unitary operation -
>> essentially spreading n copies of each block across m disks. Rather
>> clever that way.
>>
>> Hence my thought about a 16-disk md RAID10 array - which offers lots of
>> redundancy.

No, md RAID10 does /not/ offer more redundancy than RAID1.  You are 
right that md RAID10 offers more than RAID1 (or traditional RAID0 over 
RAID1 sets) - but it is a convenience and performance benefit, not a 
redundancy benefit.  In particular, it lets you build RAID10 from any 
number of disks, not just two.  And it lets you stripe over all disks, 
improving performance for some loads (though not /all/ loads - if you 
have lots of concurrent small reads, you may be faster using plain RAID1).

To get higher redundancy with RAID10 or RAID1, you need to use more 
"ways" in the mirror.  For example, creating RAID10 with "--layout n3" 
will give you three copies of all data, rather than just two, and 
therefore better redundancy - at the cost of disk space.  When you write 
"RAID10", the assumption is you mean a normal two-way mirror unless you 
specifically say otherwise, and such a mirror has only a worst-case 
redundancy of 1 disk.  A second failure will kill the array if it 
happens to hit the second copy of the data.

>
> I'm pretty sure that a normal (near) md RAID10 on 16 disks will use the
> first two drives you specify as mirrors, and the next two, and so on, so
> when you specify the drive order when building the array you'd need to
> make sure all the mirrors are on another machine.
>

Correct.  If you have a multiple of 4 disks, a "normal" near two-way 
RAID10 is almost indistinguishable from a standard two-way RAID1.

> Cheers,
>
> John.
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 14:36       ` David Brown
@ 2011-11-01 20:13         ` Miles Fidelman
  2011-11-01 21:20           ` Robin Hill
  2011-11-01 22:15           ` keld
  0 siblings, 2 replies; 27+ messages in thread
From: Miles Fidelman @ 2011-11-01 20:13 UTC (permalink / raw)
  Cc: linux-raid

David Brown wrote:
>
> No, md RAID10 does /not/ offer more redundancy than RAID1.  You are 
> right that md RAID10 offers more than RAID1 (or traditional RAID0 over 
> RAID1 sets) - but it is a convenience and performance benefit, not a 
> redundancy benefit.  In particular, it lets you build RAID10 from any 
> number of disks, not just two.  And it lets you stripe over all disks, 
> improving performance for some loads (though not /all/ loads - if you 
> have lots of concurrent small reads, you may be faster using plain 
> RAID1).

wasn't suggesting that it does - just that it does things differently 
than normal raid 1+0 - for example, by doing mirroring and striping as a 
unitary operation, it works across odd number of drives - it also (I 
think) allows for more than 2 copies of a block (not completely clear 
how many copies of a block would be made if you specified a 16 drive 
array) - sort of what I'm wondering here





-- 
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 20:13         ` Miles Fidelman
@ 2011-11-01 21:20           ` Robin Hill
  2011-11-01 21:32             ` Miles Fidelman
  2011-11-01 22:15           ` keld
  1 sibling, 1 reply; 27+ messages in thread
From: Robin Hill @ 2011-11-01 21:20 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1514 bytes --]

On Tue Nov 01, 2011 at 04:13:26 -0400, Miles Fidelman wrote:

> David Brown wrote:
> >
> > No, md RAID10 does /not/ offer more redundancy than RAID1.  You are 
> > right that md RAID10 offers more than RAID1 (or traditional RAID0 over 
> > RAID1 sets) - but it is a convenience and performance benefit, not a 
> > redundancy benefit.  In particular, it lets you build RAID10 from any 
> > number of disks, not just two.  And it lets you stripe over all disks, 
> > improving performance for some loads (though not /all/ loads - if you 
> > have lots of concurrent small reads, you may be faster using plain 
> > RAID1).
> 
> wasn't suggesting that it does - just that it does things differently 
> than normal raid 1+0 - for example, by doing mirroring and striping as a 
> unitary operation, it works across odd number of drives - it also (I 
> think) allows for more than 2 copies of a block (not completely clear 
> how many copies of a block would be made if you specified a 16 drive 
> array) - sort of what I'm wondering here
> 
By default it'll make 2 copies, regardless how many devices are in the
array. You can specify how many copies you want though, so -n3 will give
you a near configuration with 3 copies, -n4 for four copies, etc.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 21:20           ` Robin Hill
@ 2011-11-01 21:32             ` Miles Fidelman
  2011-11-01 21:50               ` Robin Hill
  2011-11-01 22:00               ` David Brown
  0 siblings, 2 replies; 27+ messages in thread
From: Miles Fidelman @ 2011-11-01 21:32 UTC (permalink / raw)
  To: linux-raid

Robin Hill wrote:
> On Tue Nov 01, 2011 at 04:13:26 -0400, Miles Fidelman wrote:
>
>> David Brown wrote:
>>> No, md RAID10 does /not/ offer more redundancy than RAID1.  You are
>>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over
>>> RAID1 sets) - but it is a convenience and performance benefit, not a
>>> redundancy benefit.  In particular, it lets you build RAID10 from any
>>> number of disks, not just two.  And it lets you stripe over all disks,
>>> improving performance for some loads (though not /all/ loads - if you
>>> have lots of concurrent small reads, you may be faster using plain
>>> RAID1).
>> wasn't suggesting that it does - just that it does things differently
>> than normal raid 1+0 - for example, by doing mirroring and striping as a
>> unitary operation, it works across odd number of drives - it also (I
>> think) allows for more than 2 copies of a block (not completely clear
>> how many copies of a block would be made if you specified a 16 drive
>> array) - sort of what I'm wondering here
>>
> By default it'll make 2 copies, regardless how many devices are in the
> array. You can specify how many copies you want though, so -n3 will give
> you a near configuration with 3 copies, -n4 for four copies, etc.
>
>
cool, so with 16 drives, and say -n6 or -n8, and a far configuration - 
that gives a pretty good level of resistance to multi-disk failures, as 
well as an entire node failure (taking out 4 drives)

which then leaves the question of whether the md driver, itself, can be 
failed over from one node to another

Thanks!

Miles




-- 
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 21:32             ` Miles Fidelman
@ 2011-11-01 21:50               ` Robin Hill
  2011-11-01 22:35                 ` Miles Fidelman
  2011-11-01 22:00               ` David Brown
  1 sibling, 1 reply; 27+ messages in thread
From: Robin Hill @ 2011-11-01 21:50 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2316 bytes --]

On Tue Nov 01, 2011 at 05:32:17 -0400, Miles Fidelman wrote:

> Robin Hill wrote:
> > On Tue Nov 01, 2011 at 04:13:26 -0400, Miles Fidelman wrote:
> >
> >> David Brown wrote:
> >>> No, md RAID10 does /not/ offer more redundancy than RAID1.  You are
> >>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over
> >>> RAID1 sets) - but it is a convenience and performance benefit, not a
> >>> redundancy benefit.  In particular, it lets you build RAID10 from any
> >>> number of disks, not just two.  And it lets you stripe over all disks,
> >>> improving performance for some loads (though not /all/ loads - if you
> >>> have lots of concurrent small reads, you may be faster using plain
> >>> RAID1).
> >> wasn't suggesting that it does - just that it does things differently
> >> than normal raid 1+0 - for example, by doing mirroring and striping as a
> >> unitary operation, it works across odd number of drives - it also (I
> >> think) allows for more than 2 copies of a block (not completely clear
> >> how many copies of a block would be made if you specified a 16 drive
> >> array) - sort of what I'm wondering here
> >>
> > By default it'll make 2 copies, regardless how many devices are in the
> > array. You can specify how many copies you want though, so -n3 will give
> > you a near configuration with 3 copies, -n4 for four copies, etc.
> >
> >
> cool, so with 16 drives, and say -n6 or -n8, and a far configuration - 
> that gives a pretty good level of resistance to multi-disk failures, as 
> well as an entire node failure (taking out 4 drives)
> 
Sorry, my mistake - it should be -p n3, or -p n4. You'll want -p f6/-p
f8 to get a far configuration though, but yes, that should give good
redundancy against a single node failure.

> which then leaves the question of whether the md driver, itself, can be 
> failed over from one node to another
> 
I don't see why not. You'll probably need to force assembly though, as
it's likely the devices will be slightly out-of-synch after the node
failure.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 21:50               ` Robin Hill
@ 2011-11-01 22:35                 ` Miles Fidelman
  0 siblings, 0 replies; 27+ messages in thread
From: Miles Fidelman @ 2011-11-01 22:35 UTC (permalink / raw)
  To: linux-raid

Robin Hill wrote:
> Sorry, my mistake - it should be -p n3, or -p n4. You'll want -p f6/-p
> f8 to get a far configuration though, but yes, that should give good
> redundancy against a single node failure.
>
>> which then leaves the question of whether the md driver, itself, can be
>> failed over from one node to another
>>
> I don't see why not. You'll probably need to force assembly though, as
> it's likely the devices will be slightly out-of-synch after the node
> failure.
>
>
sort of would expect to have to resynch

has anybody out there actually tried this at some point?

I've been trying to find OCF resource agents for handling a RAID 
failover, and only coming up with deprecated functions with little 
documentation - the only thing that even sounds remotely close is a 
heartbeat2 "md group take over" resource agent, but all I can find are 
references to it, no actual documentation

Miles


-- 
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 21:32             ` Miles Fidelman
  2011-11-01 21:50               ` Robin Hill
@ 2011-11-01 22:00               ` David Brown
  2011-11-01 22:58                 ` Miles Fidelman
  1 sibling, 1 reply; 27+ messages in thread
From: David Brown @ 2011-11-01 22:00 UTC (permalink / raw)
  To: linux-raid

On 01/11/11 22:32, Miles Fidelman wrote:
> Robin Hill wrote:
>> On Tue Nov 01, 2011 at 04:13:26 -0400, Miles Fidelman wrote:
>>
>>> David Brown wrote:
>>>> No, md RAID10 does /not/ offer more redundancy than RAID1. You are
>>>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over
>>>> RAID1 sets) - but it is a convenience and performance benefit, not a
>>>> redundancy benefit. In particular, it lets you build RAID10 from any
>>>> number of disks, not just two. And it lets you stripe over all disks,
>>>> improving performance for some loads (though not /all/ loads - if you
>>>> have lots of concurrent small reads, you may be faster using plain
>>>> RAID1).
>>> wasn't suggesting that it does - just that it does things differently
>>> than normal raid 1+0 - for example, by doing mirroring and striping as a
>>> unitary operation, it works across odd number of drives - it also (I
>>> think) allows for more than 2 copies of a block (not completely clear
>>> how many copies of a block would be made if you specified a 16 drive
>>> array) - sort of what I'm wondering here
>>>
>> By default it'll make 2 copies, regardless how many devices are in the
>> array. You can specify how many copies you want though, so -n3 will give
>> you a near configuration with 3 copies, -n4 for four copies, etc.
>>
>>
> cool, so with 16 drives, and say -n6 or -n8, and a far configuration -
> that gives a pretty good level of resistance to multi-disk failures, as
> well as an entire node failure (taking out 4 drives)

You are aware, of course, that if you take your 16 drives and use "-n8", 
you will get a total disk space equivalent to two drives.  It would be 
very resistant to drive failures, but /very/ poor space efficiency.  It 
would also be very fast for reads, but very slow for writes (as 
everything must be written 8 times).

It's your choice - md is very flexible.  But I think an eight-way mirror 
would be considered somewhat unusual.

>
> which then leaves the question of whether the md driver, itself, can be
> failed over from one node to another
>
> Thanks!
>
> Miles
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 22:00               ` David Brown
@ 2011-11-01 22:58                 ` Miles Fidelman
  2011-11-02 10:36                   ` David Brown
  0 siblings, 1 reply; 27+ messages in thread
From: Miles Fidelman @ 2011-11-01 22:58 UTC (permalink / raw)
  Cc: linux-raid

David Brown wrote:
>
> You are aware, of course, that if you take your 16 drives and use 
> "-n8", you will get a total disk space equivalent to two drives.  It 
> would be very resistant to drive failures, but /very/ poor space 
> efficiency.  It would also be very fast for reads, but very slow for 
> writes (as everything must be written 8 times).
>
> It's your choice - md is very flexible.  But I think an eight-way 
> mirror would be considered somewhat unusual.

What would be particularly interesting is if I can do -n4 and configure 
things in a way that insures that each of those 4 is on a different one 
of my 4 boxes (4 boxes, 4 disks each).




-- 
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 22:58                 ` Miles Fidelman
@ 2011-11-02 10:36                   ` David Brown
  0 siblings, 0 replies; 27+ messages in thread
From: David Brown @ 2011-11-02 10:36 UTC (permalink / raw)
  To: linux-raid

On 01/11/2011 23:58, Miles Fidelman wrote:
> David Brown wrote:
>>
>> You are aware, of course, that if you take your 16 drives and use
>> "-n8", you will get a total disk space equivalent to two drives. It
>> would be very resistant to drive failures, but /very/ poor space
>> efficiency. It would also be very fast for reads, but very slow for
>> writes (as everything must be written 8 times).
>>
>> It's your choice - md is very flexible. But I think an eight-way
>> mirror would be considered somewhat unusual.
>
> What would be particularly interesting is if I can do -n4 and configure
> things in a way that insures that each of those 4 is on a different one
> of my 4 boxes (4 boxes, 4 disks each).
>

Theoretically, that's just a matter of getting the ordering right when 
you are creating the array.

However, it is a lot easier to get this right if you separate the stages 
into setting up 4 raid1 sets, then combine them.

Remember, md RAID10 is good - but it is not always the best choice.  It 
is particularly good for desktop use on two disks, or perhaps 3 disks, 
with the "far2" layout - giving you excellent speed and safety.  Its 
advantages over standard RAID1+0 drop as the number of disks increase. 
In particular, RAID10 in "near" format is identical to RAID1+0 if you 
have a multiple of 4 disks - "far" format still has some speed 
advantages since it can always read from the faster outer half of the disk.

Also note that the benefits of striping also drop off for bigger disk 
sets (unless you have very big files, they don't fill the stripes), and 
if you have multiple concurrent accesses - typical for severs - striping 
doesn't help much.

Finally, remember the main disadvantage of md RAID10 - once it is 
established, you have very few re-shape possibilities.  RAID0 and RAID1 
sets can be easily re-shaped - you can change their size, and you can 
add or remove drives.  This means that if you build your system using 
RAID1 mirrors and RAID0 striping on top, you can add new servers or 
change the number of disks later.

You need to establish what your needs are here - what sort of files will 
be accessed (big, small, etc.), what will access patterns be like (large 
streamed accesses, lots of concurrent small accesses, many reads many 
writes, etc.), and what your storage needs are (what disk sizes are you 
using, and what total usable disk space are you aiming for?).

One idea would be to set up 8 2-way mirrors, with the mirrors split 
between different machines.  These 8 pairs could then be combined with 
RAID6.  That gives you 6 disks worth of total space from your 16 disks, 
and protects you against at least 5 concurrent disk failures, or two 
complete server fails, if you arrange the pairs like this:

Server 1:  1a 3a 5a 7a
Server 2:  1b 4a 6a 7b
Server 3:  2a 3b 6b 8a
Server 4:  2b 4b 5b 8b

(Where 1a, 1b are the two halves of the same mirror.)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 20:13         ` Miles Fidelman
  2011-11-01 21:20           ` Robin Hill
@ 2011-11-01 22:15           ` keld
  2011-11-01 22:25             ` NeilBrown
  1 sibling, 1 reply; 27+ messages in thread
From: keld @ 2011-11-01 22:15 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: no, To-header, on, "input <", linux-raid

On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote:
> David Brown wrote:
> >
> >No, md RAID10 does /not/ offer more redundancy than RAID1.  You are 
> >right that md RAID10 offers more than RAID1 (or traditional RAID0 over 
> >RAID1 sets) - but it is a convenience and performance benefit, not a 
> >redundancy benefit.  In particular, it lets you build RAID10 from any 
> >number of disks, not just two.  And it lets you stripe over all disks, 
> >improving performance for some loads (though not /all/ loads - if you 
> >have lots of concurrent small reads, you may be faster using plain 
> >RAID1).

In fact raid10 mas a bit less redundancy than raid1+0.
It is as far as I know built as raid0+1 with a disk layout
where you can only loose eg 1 out of 4 disks, while raid1+0
in some cases can lose 2 disks out of 4.

Also for lots of concurrent small reads raid10 can in some cases be somewhat
faster than raid1, and AFAIK never slower than raid1. 

Best regards
keld

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 22:15           ` keld
@ 2011-11-01 22:25             ` NeilBrown
  2011-11-01 22:38               ` Miles Fidelman
  2011-11-02  1:37               ` keld
  0 siblings, 2 replies; 27+ messages in thread
From: NeilBrown @ 2011-11-01 22:25 UTC (permalink / raw)
  To: keld; +Cc: Miles Fidelman, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1427 bytes --]

On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote:

> On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote:
> > David Brown wrote:
> > >
> > >No, md RAID10 does /not/ offer more redundancy than RAID1.  You are 
> > >right that md RAID10 offers more than RAID1 (or traditional RAID0 over 
> > >RAID1 sets) - but it is a convenience and performance benefit, not a 
> > >redundancy benefit.  In particular, it lets you build RAID10 from any 
> > >number of disks, not just two.  And it lets you stripe over all disks, 
> > >improving performance for some loads (though not /all/ loads - if you 
> > >have lots of concurrent small reads, you may be faster using plain 
> > >RAID1).
> 
> In fact raid10 mas a bit less redundancy than raid1+0.
> It is as far as I know built as raid0+1 with a disk layout
> where you can only loose eg 1 out of 4 disks, while raid1+0
> in some cases can lose 2 disks out of 4.

With md/raid10 you can in some case lose 2 out of 4 disks and survive, just
like raid1+0.

NeilBrown


> 
> Also for lots of concurrent small reads raid10 can in some cases be somewhat
> faster than raid1, and AFAIK never slower than raid1. 
> 
> Best regards
> keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 22:25             ` NeilBrown
@ 2011-11-01 22:38               ` Miles Fidelman
  2011-11-02  1:40                 ` keld
  2011-11-02  1:37               ` keld
  1 sibling, 1 reply; 27+ messages in thread
From: Miles Fidelman @ 2011-11-01 22:38 UTC (permalink / raw)
  Cc: linux-raid

NeilBrown wrote:
> On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote:
>
>> On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote:
>>> David Brown wrote:
>>>> No, md RAID10 does /not/ offer more redundancy than RAID1.  You are
>>>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over
>>>> RAID1 sets) - but it is a convenience and performance benefit, not a
>>>> redundancy benefit.  In particular, it lets you build RAID10 from any
>>>> number of disks, not just two.  And it lets you stripe over all disks,
>>>> improving performance for some loads (though not /all/ loads - if you
>>>> have lots of concurrent small reads, you may be faster using plain
>>>> RAID1).
>> In fact raid10 mas a bit less redundancy than raid1+0.
>> It is as far as I know built as raid0+1 with a disk layout
>> where you can only loose eg 1 out of 4 disks, while raid1+0
>> in some cases can lose 2 disks out of 4.
> With md/raid10 you can in some case lose 2 out of 4 disks and survive, just
> like raid1+0.
>
it occurs to me that it's a real bummer that all the md documentation, 
that was on raid.wiki.kernel.org, has been inaccessible since the 
kernel.org hack a couple of months ago -- anybody know if that's going 
to be back soon, or if that documentation lives somewhere else as well?




-- 
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 22:38               ` Miles Fidelman
@ 2011-11-02  1:40                 ` keld
  0 siblings, 0 replies; 27+ messages in thread
From: keld @ 2011-11-02  1:40 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: no, To-header, on, "input <", linux-raid

On Tue, Nov 01, 2011 at 06:38:38PM -0400, Miles Fidelman wrote:
> it occurs to me that it's a real bummer that all the md documentation, 
> that was on raid.wiki.kernel.org, has been inaccessible since the 
> kernel.org hack a couple of months ago -- anybody know if that's going 
> to be back soon, or if that documentation lives somewhere else as well?

What has happened there? Who is in contact with kernel.org to secure
the wiki? There was only wiki text, so that info would most likely
not be compromised.

best regards
keld

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 22:25             ` NeilBrown
  2011-11-01 22:38               ` Miles Fidelman
@ 2011-11-02  1:37               ` keld
  2011-11-02  1:48                 ` NeilBrown
  1 sibling, 1 reply; 27+ messages in thread
From: keld @ 2011-11-02  1:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: Miles Fidelman, linux-raid

On Wed, Nov 02, 2011 at 09:25:26AM +1100, NeilBrown wrote:
> On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote:
> 
> > On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote:
> > > David Brown wrote:
> > > >
> > > >No, md RAID10 does /not/ offer more redundancy than RAID1.  You are 
> > > >right that md RAID10 offers more than RAID1 (or traditional RAID0 over 
> > > >RAID1 sets) - but it is a convenience and performance benefit, not a 
> > > >redundancy benefit.  In particular, it lets you build RAID10 from any 
> > > >number of disks, not just two.  And it lets you stripe over all disks, 
> > > >improving performance for some loads (though not /all/ loads - if you 
> > > >have lots of concurrent small reads, you may be faster using plain 
> > > >RAID1).
> > 
> > In fact raid10 mas a bit less redundancy than raid1+0.
> > It is as far as I know built as raid0+1 with a disk layout
> > where you can only loose eg 1 out of 4 disks, while raid1+0
> > in some cases can loose 2 disks out of 4.
> 
> With md/raid10 you can in some case lose 2 out of 4 disks and survive, just
> like raid1+0.

OK, in which cases, and when is this not the case?

best regards
keld

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-02  1:37               ` keld
@ 2011-11-02  1:48                 ` NeilBrown
  2011-11-02  7:02                   ` keld
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2011-11-02  1:48 UTC (permalink / raw)
  To: keld; +Cc: Miles Fidelman, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1363 bytes --]

On Wed, 2 Nov 2011 02:37:56 +0100 keld@keldix.com wrote:

> On Wed, Nov 02, 2011 at 09:25:26AM +1100, NeilBrown wrote:
> > On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote:
> > 
> > > On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote:
> > > > David Brown wrote:
> > > > >
> > > > >No, md RAID10 does /not/ offer more redundancy than RAID1.  You are 
> > > > >right that md RAID10 offers more than RAID1 (or traditional RAID0 over 
> > > > >RAID1 sets) - but it is a convenience and performance benefit, not a 
> > > > >redundancy benefit.  In particular, it lets you build RAID10 from any 
> > > > >number of disks, not just two.  And it lets you stripe over all disks, 
> > > > >improving performance for some loads (though not /all/ loads - if you 
> > > > >have lots of concurrent small reads, you may be faster using plain 
> > > > >RAID1).
> > > 
> > > In fact raid10 mas a bit less redundancy than raid1+0.
> > > It is as far as I know built as raid0+1 with a disk layout
> > > where you can only loose eg 1 out of 4 disks, while raid1+0
> > > in some cases can loose 2 disks out of 4.
> > 
> > With md/raid10 you can in some case lose 2 out of 4 disks and survive, just
> > like raid1+0.
> 
> OK, in which cases, and when is this not the case?
> 
> best regards
> keld

"just like raid1+0"


NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-02  1:48                 ` NeilBrown
@ 2011-11-02  7:02                   ` keld
  2011-11-02  9:20                     ` Jonathan Tripathy
  2011-11-02 11:27                     ` David Brown
  0 siblings, 2 replies; 27+ messages in thread
From: keld @ 2011-11-02  7:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: Miles Fidelman, linux-raid

On Wed, Nov 02, 2011 at 12:48:16PM +1100, NeilBrown wrote:
> On Wed, 2 Nov 2011 02:37:56 +0100 keld@keldix.com wrote:
> 
> > On Wed, Nov 02, 2011 at 09:25:26AM +1100, NeilBrown wrote:
> > > On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote:
> > > 
> > > > On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote:
> > > > > David Brown wrote:
> > > > > >
> > > > > >No, md RAID10 does /not/ offer more redundancy than RAID1.  You are 
> > > > > >right that md RAID10 offers more than RAID1 (or traditional RAID0 over 
> > > > > >RAID1 sets) - but it is a convenience and performance benefit, not a 
> > > > > >redundancy benefit.  In particular, it lets you build RAID10 from any 
> > > > > >number of disks, not just two.  And it lets you stripe over all disks, 
> > > > > >improving performance for some loads (though not /all/ loads - if you 
> > > > > >have lots of concurrent small reads, you may be faster using plain 
> > > > > >RAID1).
> > > > 
> > > > In fact raid10 mas a bit less redundancy than raid1+0.
> > > > It is as far as I know built as raid0+1 with a disk layout
> > > > where you can only loose eg 1 out of 4 disks, while raid1+0
> > > > in some cases can loose 2 disks out of 4.
> > > 
> > > With md/raid10 you can in some case lose 2 out of 4 disks and survive, just
> > > like raid1+0.
> > 
> > OK, in which cases, and when is this not the case?
> > 
> > best regards
> > keld
> 
> "just like raid1+0"

No, that is not the case AFAIK. Eg the layout of raid10,f2 with 4 disks is 
"just like raid0+1".

best regards
keld

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-02  7:02                   ` keld
@ 2011-11-02  9:20                     ` Jonathan Tripathy
  2011-11-02 11:27                     ` David Brown
  1 sibling, 0 replies; 27+ messages in thread
From: Jonathan Tripathy @ 2011-11-02  9:20 UTC (permalink / raw)
  To: keld; +Cc: NeilBrown, Miles Fidelman, linux-raid


On 02/11/2011 07:02, keld@keldix.com wrote:
> On Wed, Nov 02, 2011 at 12:48:16PM +1100, NeilBrown wrote:
>> On Wed, 2 Nov 2011 02:37:56 +0100 keld@keldix.com wrote:
>>
>>> On Wed, Nov 02, 2011 at 09:25:26AM +1100, NeilBrown wrote:
>>>> On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote:
>>>>
>>>>> On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote:
>>>>>> David Brown wrote:
>>>>>>> No, md RAID10 does /not/ offer more redundancy than RAID1.  You are
>>>>>>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over
>>>>>>> RAID1 sets) - but it is a convenience and performance benefit, not a
>>>>>>> redundancy benefit.  In particular, it lets you build RAID10 from any
>>>>>>> number of disks, not just two.  And it lets you stripe over all disks,
>>>>>>> improving performance for some loads (though not /all/ loads - if you
>>>>>>> have lots of concurrent small reads, you may be faster using plain
>>>>>>> RAID1).
>>>>> In fact raid10 mas a bit less redundancy than raid1+0.
>>>>> It is as far as I know built as raid0+1 with a disk layout
>>>>> where you can only loose eg 1 out of 4 disks, while raid1+0
>>>>> in some cases can loose 2 disks out of 4.
>>>> With md/raid10 you can in some case lose 2 out of 4 disks and survive, just
>>>> like raid1+0.
>>> OK, in which cases, and when is this not the case?
>>>
>>> best regards
>>> keld
>> "just like raid1+0"
> No, that is not the case AFAIK. Eg the layout of raid10,f2 with 4 disks is
> "just like raid0+1".
>
Isn't md raid10 n2 exactly the same as RAID1+0??

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-02  7:02                   ` keld
  2011-11-02  9:20                     ` Jonathan Tripathy
@ 2011-11-02 11:27                     ` David Brown
  1 sibling, 0 replies; 27+ messages in thread
From: David Brown @ 2011-11-02 11:27 UTC (permalink / raw)
  To: linux-raid

On 02/11/2011 08:02, keld@keldix.com wrote:
> On Wed, Nov 02, 2011 at 12:48:16PM +1100, NeilBrown wrote:
>> On Wed, 2 Nov 2011 02:37:56 +0100 keld@keldix.com wrote:
>>
>>> On Wed, Nov 02, 2011 at 09:25:26AM +1100, NeilBrown wrote:
>>>> On Tue, 1 Nov 2011 23:15:39 +0100 keld@keldix.com wrote:
>>>>
>>>>> On Tue, Nov 01, 2011 at 04:13:26PM -0400, Miles Fidelman wrote:
>>>>>> David Brown wrote:
>>>>>>>
>>>>>>> No, md RAID10 does /not/ offer more redundancy than RAID1.  You are
>>>>>>> right that md RAID10 offers more than RAID1 (or traditional RAID0 over
>>>>>>> RAID1 sets) - but it is a convenience and performance benefit, not a
>>>>>>> redundancy benefit.  In particular, it lets you build RAID10 from any
>>>>>>> number of disks, not just two.  And it lets you stripe over all disks,
>>>>>>> improving performance for some loads (though not /all/ loads - if you
>>>>>>> have lots of concurrent small reads, you may be faster using plain
>>>>>>> RAID1).
>>>>>
>>>>> In fact raid10 mas a bit less redundancy than raid1+0.
>>>>> It is as far as I know built as raid0+1 with a disk layout
>>>>> where you can only loose eg 1 out of 4 disks, while raid1+0
>>>>> in some cases can loose 2 disks out of 4.
>>>>
>>>> With md/raid10 you can in some case lose 2 out of 4 disks and survive, just
>>>> like raid1+0.
>>>
>>> OK, in which cases, and when is this not the case?
>>>
>>> best regards
>>> keld
>>
>> "just like raid1+0"
>
> No, that is not the case AFAIK. Eg the layout of raid10,f2 with 4 disks is
> "just like raid0+1".
>

And raid0+1 can also survive two disk failures in some cases.

It boils down to this - if you have a two-way mirror (RAID1, RAID10, 
RAID1+0, RAID0+1), then you can keep losing disks unless you lose both 
copies of part of your data.

Look at the layout diagrams on 
<http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10> 
for the four drive cases.  You can lose disk 1 and 3, or disk 2 and 4, 
in either the "near 2" or the "far 2" cases.  But if you lose disks 1 
and 2, or disks 3 and 4, your data is gone:

RAID10,n2, or RAID1+0 (stripe of mirrors) :

Good array        Lost 1+3 (OK)    Lost 1+2 (Dead)
A1 A1 A2 A2       x A1 x A2        x x A2 A2
A3 A3 A4 A4       x A3 x A4        x x A4 A4
A5 A5 A6 A6       x A5 x A6        x x A6 A6
A7 A7 A8 A8       x A7 x A8        x x A8 A8


RAID10,f2 :

Good array        Lost 1+3 (OK)    Lost 1+2 (Dead)
A1 A2 A3 A4       x A2 x A4        x x A3 A4
A5 A6 A7 A8       x A6 x A8        x x A7 A8
....
A4 A1 A2 A3       x A1 x A3        x x A2 A3
A8 A5 A6 A7       x A5 x A7        x x A6 A7


RAID10,02 :

Good array        Lost 1+3 (OK)    Lost 1+2 (Dead)
A1 A2 A3 A4       x A2 x A4        x x A3 A4
A4 A1 A2 A3       x A1 x A3        x x A2 A3
A5 A6 A7 A8       x A6 x A8        x x A7 A8
A8 A5 A6 A7       x A5 x A7        x x A6 A7

RAID0+1 (mirror of stripes) :

Good array        Lost 1+3 (Dead)  Lost 1+2 (OK)
A1 A2 A1 A2       x A2 x A2        x x A1 A2
A3 A4 A3 A4       x A4 x A4        x x A3 A4
A5 A6 A5 A6       x A6 x A6        x x A5 A6
A7 A8 A7 A8       x A8 x A8        x x A7 A8




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01  0:38 possibly silly question (raid failover) Miles Fidelman
  2011-11-01  9:14 ` David Brown
@ 2011-11-01  9:26 ` Johannes Truschnigg
  2011-11-01 13:02   ` Miles Fidelman
  2011-11-02  6:41 ` Stan Hoeppner
  2 siblings, 1 reply; 27+ messages in thread
From: Johannes Truschnigg @ 2011-11-01  9:26 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 799 bytes --]

Hi Miles,

On Mon, Oct 31, 2011 at 08:38:16PM -0400, Miles Fidelman wrote:
> Hi Folks,
> 
> I've been exploring various ways to build a "poor man's high
> availability cluster."  Currently I'm running two nodes, using raid
> on each box, running DRBD across the boxes, and running Xen virtual
> machines on top of that.
> [...]

while I do note that I don't answer your question at hand, I'm still inclined
to ask if you do know Ganeti (http://code.google.com/p/ganeti/) yet? It offers
pretty much everything you seem to want to have.

-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-eMail or attachments. Thank you.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01  9:26 ` Johannes Truschnigg
@ 2011-11-01 13:02   ` Miles Fidelman
  2011-11-01 13:33     ` John Robinson
  0 siblings, 1 reply; 27+ messages in thread
From: Miles Fidelman @ 2011-11-01 13:02 UTC (permalink / raw)
  Cc: linux-raid@vger.kernel.org

Johannes Truschnigg wrote:
> Hi Miles,
>
> On Mon, Oct 31, 2011 at 08:38:16PM -0400, Miles Fidelman wrote:
>> Hi Folks,
>>
>> I've been exploring various ways to build a "poor man's high
>> availability cluster."  Currently I'm running two nodes, using raid
>> on each box, running DRBD across the boxes, and running Xen virtual
>> machines on top of that.
>> [...]
> while I do note that I don't answer your question at hand, I'm still inclined
> to ask if you do know Ganeti (http://code.google.com/p/ganeti/) yet? It offers
> pretty much everything you seem to want to have.

Actually I do know Ganeti, and it does NOT come close to what I'm 
suggesting:
- it supports migration but not auto-failover
- DRBD is the only mechanism it provides for replicating data across 
nodes - which limits migration to a 2-node pair




-- 
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01 13:02   ` Miles Fidelman
@ 2011-11-01 13:33     ` John Robinson
  0 siblings, 0 replies; 27+ messages in thread
From: John Robinson @ 2011-11-01 13:33 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: linux-raid@vger.kernel.org

On 01/11/2011 13:02, Miles Fidelman wrote:
> Johannes Truschnigg wrote:
>> Hi Miles,
>>
>> On Mon, Oct 31, 2011 at 08:38:16PM -0400, Miles Fidelman wrote:
>>> Hi Folks,
>>>
>>> I've been exploring various ways to build a "poor man's high
>>> availability cluster." Currently I'm running two nodes, using raid
>>> on each box, running DRBD across the boxes, and running Xen virtual
>>> machines on top of that.
>>> [...]
>> while I do note that I don't answer your question at hand, I'm still
>> inclined
>> to ask if you do know Ganeti (http://code.google.com/p/ganeti/) yet?
>> It offers
>> pretty much everything you seem to want to have.
>
> Actually I do know Ganeti, and it does NOT come close to what I'm
> suggesting:
> - it supports migration but not auto-failover
> - DRBD is the only mechanism it provides for replicating data across
> nodes - which limits migration to a 2-node pair

It might still do what I think you want: think of each of the four 
servers running 3 VMs (or groups of VMs) normally, and three servers 
running 4 VMs when one of the servers fails. Then for each VM you 
replicate its storage to another server, as follows:

Node A: VM A1->Node B; VM A2->Node C; VM A3->Node D
Node B: VM B1->Node C; VM B2->Node D; VM B3->Node A
Node C: VM C1->Node D; VM C2->Node A; VM C3->Node B
Node D: VM D1->Node A; VM D2->Node B; VM D3->Node C

So each node needs double the storage, because as well as its own VMs is 
has copies of one from each of the other nodes. When any node goes down, 
your cluster management makes all three of the others start up one more 
VM - isn't that what Ganeti means by "quick recovery in case of physical 
system failure" and "automated instance migration across clusters"?

I'd probably do some kind of RAID over the 4 disks on each server as 
well, and do live migrations when a drive fails in any one machine, so 
that the VMs don't suffer from the degraded RAID and the machine's 
relatively quiet while you're replacing the failed drive, but now we're 
getting into having to have perhaps double the storage again, and it's 
not looking like it's a poor man's solution after all - can you buy 4 
cheap commodity servers with double the storage and enough spare RAM for 
less than you could have bought 3 classy bulletproof ones?

Cheers,

John.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-01  0:38 possibly silly question (raid failover) Miles Fidelman
  2011-11-01  9:14 ` David Brown
  2011-11-01  9:26 ` Johannes Truschnigg
@ 2011-11-02  6:41 ` Stan Hoeppner
  2011-11-02 13:17   ` Miles Fidelman
  2 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2011-11-02  6:41 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: linux-raid@vger.kernel.org

On 10/31/2011 7:38 PM, Miles Fidelman wrote:
> Hi Folks,
> 
> I've been exploring various ways to build a "poor man's high
> availability cluster."  

Overall advice:  Don't attempt to reinvent the wheel.

Building such a thing is normally a means to end, not an end itself.  If
your goal is supporting an actual workload and not simply the above,
there are a number of good options readily available.

> Currently I'm running two nodes, using raid on
> each box, running DRBD across the boxes, and running Xen virtual
> machines on top of that.
> 
> I now have two brand new servers - for a total of four nodes - each with
> four large drives, and four gigE ports.

A good option in this case would be to simply take the 8 new drives and
add 4 each to the existing servers, expanding existing md RAID devices
and filesystems where appropriate.  Then setup NFS cluster services and
export the appropriate filesystems to the two new servers.  This keeps
your overall complexity low, reliability and performance high, and
yields a setup many are familiar with if you need troubleshooting
assistance in the future.  This is a widely used architecture and has
been for many years.

> Between the configuration of the systems, and rack space limitations,
> I'm trying to use each server for both storage and processing - and been
> looking at various options for building a cluster file system across all
> 16 drives, that supports VM migration/failover across all for nodes, and
> that's resistant to both single-drive failures, and to losing an entire
> server (and it's 4 drives), and maybe even losing two servers (8 drives).

The solution above gives you all of this, except the unlikely scenario
of losing both storage servers simultaneously.  If that is truly
something you're willing to spend money to mitigate then slap a 3rd
storage server in an off site location and use the DRBD option for such.

> The approach that looks most interesting is Sheepdog - but it's both
> tied to KVM rather than Xen, and a bit immature.

Interesting disclaimer for an open source project, specifically the 2nd
half of the statement:

"There is no guarantee that this software will be included in future
software releases, and it probably will not be included."

> But it lead me to wonder if something like this might make sense:
> - mount each drive using AoE
> - run md RAID 10 across all 16 drives one one node
> - mount the resulting md device using AoE
> - if the node running the md device fails, use pacemaker/crm to
> auto-start an md device on another node, re-assemble and republish the
> array
> - resulting in a 16-drive raid10 array that's accessible from all nodes

The level of complexity here is too high for a production architecture.
 In addition, doing something like this puts you way out in uncharted
waters, where you will have few, if any, peers to assist in time of
need.  When (not if) something breaks in an unexpected way, how quickly
will you be able to troubleshoot and resolve a problem in such a complex
architecture?

> Or is this just silly and/or wrongheaded?

I don't think it's silly.  Maybe a little wrongheaded, to use your term.
 IBM has had GPFS on the market for a decade plus.  It will do exactly
what you want, but the price is likely well beyond your budget, assuming
they'd even return your call WRT a 4 node cluster.  (IBM GPFS customers
are mostly government labs, aerospace giants, and pharma companies, with
very large node count clusters, hundreds to thousands).

If I were doing such a setup to fit your stated needs, I'd spend
~$10-15K USD on a low/midrange iSCSI SAN box with 2GB cache dual
controllers/PSUs and 16 x 500GB SATA drives.  I'd create a single RAID6
array of 14 drives with two standby spares, yielding 7TB of space for
carving up LUNS.  Carve and export the LUNS you need to each node's
dual/quad NIC MACs with multipathing setup on each node, and format the
LUNs with GFS2.  All nodes now have access to all storage you assign.
With such a setup you can easily add future nodes.  It's not complex, it
is a well understood architecture, and relatively straightforward to
troubleshoot.

Now, if that solution is out of your price range, I think the redundant
cluster NFS server architecture is in your immediate future.  It's in
essence free, and it will give you everything you need, in spite of the
fact that the "node symmetry" isn't what you apparently envision as
"optimal" for a cluster.

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: possibly silly question (raid failover)
  2011-11-02  6:41 ` Stan Hoeppner
@ 2011-11-02 13:17   ` Miles Fidelman
  0 siblings, 0 replies; 27+ messages in thread
From: Miles Fidelman @ 2011-11-02 13:17 UTC (permalink / raw)
  Cc: linux-raid@vger.kernel.org

Stan,

Stan Hoeppner wrote:
> On 10/31/2011 7:38 PM, Miles Fidelman wrote:
>> Hi Folks,
>>
>> I've been exploring various ways to build a "poor man's high
>> availability cluster."
> Overall advice:  Don't attempt to reinvent the wheel.
>
> Building such a thing is normally a means to end, not an end itself.  If
> your goal is supporting an actual workload and not simply the above,
> there are a number of good options readily available.
well, normally I'd agree with you, but...

- we're both an R&D organization and a (small, but aspiring) provider of 
hosted services - so experimenting with infrastructure is part of the 
actual work -- and part of where I'd like to head is an environment 
that's built out of commodity boxes configured in a way that scales out 
(Sheepdog is really the model I have in mind)

- I'd sure like to find something that does what we need:
-- we're using DRBD/Pacemaker/etc. - but that's sort of brittle and only 
supports pair-wise migration/failover
-- if Sheepdog was a little more mature, and supported Xen, it would be 
exactly what I'm looking for
-- Xen over the newest release of GlustFS is starting to look attractive
-- some of the single system image projects (OpenMosix, Kerrighed) would 
be attractive if the projects were alive
>> Currently I'm running two nodes, using raid on
>> each box, running DRBD across the boxes, and running Xen virtual
>> machines on top of that.
>>
>> I now have two brand new servers - for a total of four nodes - each with
>> four large drives, and four gigE ports.
> A good option in this case would be to simply take the 8 new drives and
> add 4 each to the existing servers, expanding existing md RAID devices
> and filesystems where appropriate.  Then setup NFS cluster services and
> export the appropriate filesystems to the two new servers.  This keeps
> your overall complexity low, reliability and performance high, and
> yields a setup many are familiar with if you need troubleshooting
> assistance in the future.  This is a widely used architecture and has
> been for many years.

unfortunately, we're currently trying to make do with 4U of rackspace, 
and 4 1U servers, each of which holds 4 drives, can't quite move the 
disks around the way you're talking about -- unfortunately, the older 
boxes don't have hardware virtualization support or I'd seriously 
consider migrating to KVM and Sheepdog -- if Sheepdog were just a bit 
more mature, I'd seriously consider simply replacing the older boxes

>> The approach that looks most interesting is Sheepdog - but it's both
>> tied to KVM rather than Xen, and a bit immature.
> Interesting disclaimer for an open source project, specifically the 2nd
> half of the statement:
>
> "There is no guarantee that this software will be included in future
> software releases, and it probably will not be included."

Yeah, but it seems to have some traction and support, and the OpenStack 
community seems to be looking at it seriously.

Having said that, it's things like that that are pushing me toward 
GlusterFS (doesn't hurt that Red Hat just purchased Gluster and seems to 
be putting some serious resources into it).
>
>> But it lead me to wonder if something like this might make sense:
>> - mount each drive using AoE
>> - run md RAID 10 across all 16 drives one one node
>> - mount the resulting md device using AoE
>> - if the node running the md device fails, use pacemaker/crm to
>> auto-start an md device on another node, re-assemble and republish the
>> array
>> - resulting in a 16-drive raid10 array that's accessible from all nodes
> The level of complexity here is too high for a production architecture.
>   In addition, doing something like this puts you way out in uncharted
> waters, where you will have few, if any, peers to assist in time of
> need.  When (not if) something breaks in an unexpected way, how quickly
> will you be able to troubleshoot and resolve a problem in such a complex
> architecture?

Understood.  This path is somewhat more of a matter of curiosity.  AoE 
is pretty mature, and there does seem to be a RAID resource agent for 
CRM - so some of the pieces exist.  Seems like the pieces would fit 
together - so I was wondering if anybody had actually tried it.
> If I were doing such a setup to fit your stated needs, I'd spend 
> ~$10-15K USD on a low/midrange iSCSI SAN box with 2GB cache dual 
> controllers/PSUs and 16 x 500GB SATA drives. I'd create a single RAID6 
> array of 14 drives with two standby spares, yielding 7TB of space for 
> carving up LUNS. Carve and export the LUNS you need to each node's 
> dual/quad NIC MACs with multipathing setup on each node, and format 
> the LUNs with GFS2. All nodes now have access to all storage you 
> assign. With such a setup you can easily add future nodes. It's not 
> complex, it is a well understood architecture, and relatively 
> straightforward to troubleshoot. Now, if that solution is out of your 
> price range, I think the redundant cluster NFS server architecture is 
> in your immediate future. It's in essence free, and it will give you 
> everything you need, in spite of the fact that the "node symmetry" 
> isn't what you apparently envision as "optimal" for a cluster. 

Hmm... if I were spending real money, and had more rack space to put 
things in, I'd probably do something more like a small OpenStack 
configuration, but that's me.

Thanks for your comments.  Lots of food for thought!

Miles


-- 
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-11-02 13:17 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-01  0:38 possibly silly question (raid failover) Miles Fidelman
2011-11-01  9:14 ` David Brown
2011-11-01 13:05   ` Miles Fidelman
2011-11-01 13:37     ` John Robinson
2011-11-01 14:36       ` David Brown
2011-11-01 20:13         ` Miles Fidelman
2011-11-01 21:20           ` Robin Hill
2011-11-01 21:32             ` Miles Fidelman
2011-11-01 21:50               ` Robin Hill
2011-11-01 22:35                 ` Miles Fidelman
2011-11-01 22:00               ` David Brown
2011-11-01 22:58                 ` Miles Fidelman
2011-11-02 10:36                   ` David Brown
2011-11-01 22:15           ` keld
2011-11-01 22:25             ` NeilBrown
2011-11-01 22:38               ` Miles Fidelman
2011-11-02  1:40                 ` keld
2011-11-02  1:37               ` keld
2011-11-02  1:48                 ` NeilBrown
2011-11-02  7:02                   ` keld
2011-11-02  9:20                     ` Jonathan Tripathy
2011-11-02 11:27                     ` David Brown
2011-11-01  9:26 ` Johannes Truschnigg
2011-11-01 13:02   ` Miles Fidelman
2011-11-01 13:33     ` John Robinson
2011-11-02  6:41 ` Stan Hoeppner
2011-11-02 13:17   ` Miles Fidelman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).