linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* possibly silly configuration question
@ 2012-12-27  4:16 Miles Fidelman
  2012-12-27  4:43 ` Adam Goryachev
  2012-12-27 21:11 ` Roy Sigurd Karlsbakk
  0 siblings, 2 replies; 6+ messages in thread
From: Miles Fidelman @ 2012-12-27  4:16 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

Hi Folks,

I find myself having four servers, each with 4 large disks, that I'm 
trying to assemble into a high-availability cluster.  (Note: I've got 4 
gigE ports on each box, 2 set aside for outside access, 2 for inter-node 
clustering)

Now it's easy enough to RAID disks on each server, and/or mirror disks 
pair-wise with DRBD, but DRBD doesn't work as well with >2 servers.

No what I really should do is separate storage nodes from compute nodes 
- but I'm limited by rack space and chassis configuration of the 
hardware I've got, and I've been thinking through various configurations 
to make use of the resources at hand.

One option is to put all the drives into one large pool managed by 
gluster - but I expect that would result in some serious performance 
hits (and gluster's replicated/distributed mode is fairly new).

It's late at night and a thought occurred to me that is probably 
wrongheaded (or at least silly) - but maybe I'm too tired to see any 
obvious problems.  So I'd welcome 2nd (and 3rd) opinions.

The basic notion:
- mount all 16 drives as network block devices via iSCSI or AoE
- build 4 RAID10 volumes - each volume consisting of one drive from each 
server
- run LVM on top of the RAID volumes
- then use NFS or maybe OCFS2 to make volumes available across nodes
- of course md would be running on only one node (for each array), so if 
a node goes down, use pacemaker to startup md on another node, 
reassemble the array, and remount everything

Does this make sense, or is it totally crazy?

Thanks much,

Miles Fidelman

-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: possibly silly configuration question
  2012-12-27  4:16 possibly silly configuration question Miles Fidelman
@ 2012-12-27  4:43 ` Adam Goryachev
  2012-12-27 16:02   ` Miles Fidelman
  2012-12-27 21:11 ` Roy Sigurd Karlsbakk
  1 sibling, 1 reply; 6+ messages in thread
From: Adam Goryachev @ 2012-12-27  4:43 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: linux-raid@vger.kernel.org

On 27/12/12 15:16, Miles Fidelman wrote:
> Hi Folks,
>
> I find myself having four servers, each with 4 large disks, that I'm
> trying to assemble into a high-availability cluster.  (Note: I've got
> 4 gigE ports on each box, 2 set aside for outside access, 2 for
> inter-node clustering)
>
> Now it's easy enough to RAID disks on each server, and/or mirror disks
> pair-wise with DRBD, but DRBD doesn't work as well with >2 servers.
>
> No what I really should do is separate storage nodes from compute
> nodes - but I'm limited by rack space and chassis configuration of the
> hardware I've got, and I've been thinking through various
> configurations to make use of the resources at hand.
>
> One option is to put all the drives into one large pool managed by
> gluster - but I expect that would result in some serious performance
> hits (and gluster's replicated/distributed mode is fairly new).
>
> It's late at night and a thought occurred to me that is probably
> wrongheaded (or at least silly) - but maybe I'm too tired to see any
> obvious problems.  So I'd welcome 2nd (and 3rd) opinions.
>
> The basic notion:
> - mount all 16 drives as network block devices via iSCSI or AoE
> - build 4 RAID10 volumes - each volume consisting of one drive from
> each server
> - run LVM on top of the RAID volumes
> - then use NFS or maybe OCFS2 to make volumes available across nodes
> - of course md would be running on only one node (for each array), so
> if a node goes down, use pacemaker to startup md on another node,
> reassemble the array, and remount everything
>
> Does this make sense, or is it totally crazy?
>
Not entirely crazy... but, how about another option:
On each node:
1) Partition each drive into two halves
2) Create two RAID arrays using each half of the 4 drives (ie, sd[abcd]1
in one RAID and sd[abcd]2 in the second RAID)
3) Create 4 x DRBD volumes where
drbd0 uses server1_raid1 and server2_raid1
drbd1 uses server2_raid2 and server3_raid2
drbd2 uses server3_raid1 and server4_raid1
drbd3 uses server4_raid2 and server1_raid2

Now you can run iscsi on all servers, where each server will export one
DRBD device:
iscsi server1 drbd0
iscsi server2 drbd1
iscsi server3 drbd2
iscsi server4 drbd3

If a server goes down, you need to use pacemaker to start iscsi (and
steal the virtual IP) on the "partner" server.
In this way, you can lose any one server, or you can lose two servers
(if they are the right two).

You could adjust this further to have a third drbd host, and reduce the
total number of iscsi exported devices to 3.

Each VM config would use the specific virtual IP/iSCSI exported location.

Maybe that will provide some ideas.... It is slightly better than two
storage + two working nodes, and gives the added reliability of
potentially losing two servers without losing any services....

PS, I'd probably put LVM2 on top of each drbd device, to divide the
storage for each VM, and export each VM over iscsi individually.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: possibly silly configuration question
  2012-12-27  4:43 ` Adam Goryachev
@ 2012-12-27 16:02   ` Miles Fidelman
  2012-12-27 16:21     ` Adam Goryachev
  0 siblings, 1 reply; 6+ messages in thread
From: Miles Fidelman @ 2012-12-27 16:02 UTC (permalink / raw)
  Cc: linux-raid@vger.kernel.org

Adam,

Thanks for the suggestions.  The thing I'm worried about is how much 
traffic gets generated as I start wiring together more complex 
configurations, and the kind of performance hits involved (particularly 
if a node goes down and things start getting re-syncd).

Miles

Adam Goryachev wrote:
> On 27/12/12 15:16, Miles Fidelman wrote:
>> Hi Folks,
>>
>> I find myself having four servers, each with 4 large disks, that I'm
>> trying to assemble into a high-availability cluster.  (Note: I've got
>> 4 gigE ports on each box, 2 set aside for outside access, 2 for
>> inter-node clustering)
>>
>> Now it's easy enough to RAID disks on each server, and/or mirror disks
>> pair-wise with DRBD, but DRBD doesn't work as well with >2 servers.
>>
>> No what I really should do is separate storage nodes from compute
>> nodes - but I'm limited by rack space and chassis configuration of the
>> hardware I've got, and I've been thinking through various
>> configurations to make use of the resources at hand.
>>
>> One option is to put all the drives into one large pool managed by
>> gluster - but I expect that would result in some serious performance
>> hits (and gluster's replicated/distributed mode is fairly new).
>>
>> It's late at night and a thought occurred to me that is probably
>> wrongheaded (or at least silly) - but maybe I'm too tired to see any
>> obvious problems.  So I'd welcome 2nd (and 3rd) opinions.
>>
>> The basic notion:
>> - mount all 16 drives as network block devices via iSCSI or AoE
>> - build 4 RAID10 volumes - each volume consisting of one drive from
>> each server
>> - run LVM on top of the RAID volumes
>> - then use NFS or maybe OCFS2 to make volumes available across nodes
>> - of course md would be running on only one node (for each array), so
>> if a node goes down, use pacemaker to startup md on another node,
>> reassemble the array, and remount everything
>>
>> Does this make sense, or is it totally crazy?
>>
> Not entirely crazy... but, how about another option:
> On each node:
> 1) Partition each drive into two halves
> 2) Create two RAID arrays using each half of the 4 drives (ie, sd[abcd]1
> in one RAID and sd[abcd]2 in the second RAID)
> 3) Create 4 x DRBD volumes where
> drbd0 uses server1_raid1 and server2_raid1
> drbd1 uses server2_raid2 and server3_raid2
> drbd2 uses server3_raid1 and server4_raid1
> drbd3 uses server4_raid2 and server1_raid2
>
> Now you can run iscsi on all servers, where each server will export one
> DRBD device:
> iscsi server1 drbd0
> iscsi server2 drbd1
> iscsi server3 drbd2
> iscsi server4 drbd3
>
> If a server goes down, you need to use pacemaker to start iscsi (and
> steal the virtual IP) on the "partner" server.
> In this way, you can lose any one server, or you can lose two servers
> (if they are the right two).
>
> You could adjust this further to have a third drbd host, and reduce the
> total number of iscsi exported devices to 3.
>
> Each VM config would use the specific virtual IP/iSCSI exported location.
>
> Maybe that will provide some ideas.... It is slightly better than two
> storage + two working nodes, and gives the added reliability of
> potentially losing two servers without losing any services....
>
> PS, I'd probably put LVM2 on top of each drbd device, to divide the
> storage for each VM, and export each VM over iscsi individually.
>
> Regards,
> Adam
>


-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: possibly silly configuration question
  2012-12-27 16:02   ` Miles Fidelman
@ 2012-12-27 16:21     ` Adam Goryachev
  2012-12-27 16:44       ` Miles Fidelman
  0 siblings, 1 reply; 6+ messages in thread
From: Adam Goryachev @ 2012-12-27 16:21 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: linux-raid@vger.kernel.org

On 28/12/12 03:02, Miles Fidelman wrote:
> Adam,
>
> Thanks for the suggestions.  The thing I'm worried about is how much
> traffic gets generated as I start wiring together more complex
> configurations, and the kind of performance hits involved
> (particularly if a node goes down and things start getting re-syncd).
>
With my suggested config, I'd put 2 x Gb ethernet from each machine on
one vlan, and the other two from each to the network. (Actually, what is
you're bandwidth to the end user? If these are Internet services, you
probably don't need 2 x Gb connections, so use 3Gb for the storage, and
1Gb for the end user facing network).

Anyway, under normal load, you should get reasonable performance (you
are using spinning disks for a random small read/write load (assumed)),
so you won't get fantastic IO performance anyway (hopefully they are 15k
rpm enterprise disks). Ideally, as you said, putting two storage servers
with 8 disks each in RAID10 would provide much better performance, but
you don't always get what you want....

When a system fails, you will get degraded I/O performance, since you
now have additional load on the remaining machines, but I suppose
degraded performance is better than a total outage. When the machine
comes back into service, just ensure the resync speed is low enough to
not cause additional performance degradation. You could even delay the
resync until "off peak" times.

At the end of the day, you will need to examine your work load, and
expected results, if they are not possible with existing hardware, you
either need to change the hardware, or change the work load, or change
the expected results.

The biggest issue I've found with this type of setup is the lack of I/O
performance, which simply comes down to the fact you have a small number
of disks trying to seek all over the place to satisfy all the different
VM's. Seeks really kill performance. The only solutions are:
1) Get lots of (8 or more) fast disks (15k rpm) and put them in RAID10,
then proceed from there...
2) Get enterprise grade SSD's and use some RAID for data protection (no
need for RAID10, use RAID5 or RAID6).

Personally, I have a couple of systems in live operation, one is using 4
disks in RAID10, and it mostly works.... another is using 5 consumer
grade SSD's in RAID5, with the secondary DRBD using 4 disks in RAID10
and it mostly works. I'd love to replace everything with the consumer
grade SSD's, but I just can't justify the dollars in these scenarios. If
only I had realised this issue before I started.... SSD's are amazing
with lots of small, random, IO.

Do you actually know what the workload will be ?

PS. I did forget to add last time, I could be wrong, the above could be
a bunch of nonsense, etc... Though hopefully it will help...

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: possibly silly configuration question
  2012-12-27 16:21     ` Adam Goryachev
@ 2012-12-27 16:44       ` Miles Fidelman
  0 siblings, 0 replies; 6+ messages in thread
From: Miles Fidelman @ 2012-12-27 16:44 UTC (permalink / raw)
  Cc: linux-raid@vger.kernel.org

Adam Goryachev wrote:
> On 28/12/12 03:02, Miles Fidelman wrote:
>> Adam,
>>
>> Thanks for the suggestions.  The thing I'm worried about is how much
>> traffic gets generated as I start wiring together more complex
>> configurations, and the kind of performance hits involved
>> (particularly if a node goes down and things start getting re-syncd).
>>
> With my suggested config, I'd put 2 x Gb ethernet from each machine on
> one vlan, and the other two from each to the network. (Actually, what is
> you're bandwidth to the end user? If these are Internet services, you
> probably don't need 2 x Gb connections, so use 3Gb for the storage, and
> 1Gb for the end user facing network).

Yup.  I have 4 gigE ports on each box - so I was thinking 2 for storage, 
2 for outside - giving me full redundancy (I have two separate outside 
connections for the cluster).
>
> Do you actually know what the workload will be ? 

Not really.  I have a mix of production email/listserv/web/database on 
one VM (relatively low load), a backup server on a second VM, and the 
rest of the cluster is used for a mix of development and test for some 
new service development.  Short term, load will be low, but could start 
spiking quickly.  My other task is designing for rapid expansion - first 
through AWS, then through more hardware.

Thanks Again,

Miles


-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: possibly silly configuration question
  2012-12-27  4:16 possibly silly configuration question Miles Fidelman
  2012-12-27  4:43 ` Adam Goryachev
@ 2012-12-27 21:11 ` Roy Sigurd Karlsbakk
  1 sibling, 0 replies; 6+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-12-27 21:11 UTC (permalink / raw)
  To: Miles Fidelman; +Cc: linux-raid

> I find myself having four servers, each with 4 large disks, that I'm
> trying to assemble into a high-availability cluster. (Note: I've got 4
> gigE ports on each box, 2 set aside for outside access, 2 for
> inter-node clustering)

Why not glusterfs? It's designed for you needs…

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-12-27 21:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-27  4:16 possibly silly configuration question Miles Fidelman
2012-12-27  4:43 ` Adam Goryachev
2012-12-27 16:02   ` Miles Fidelman
2012-12-27 16:21     ` Adam Goryachev
2012-12-27 16:44       ` Miles Fidelman
2012-12-27 21:11 ` Roy Sigurd Karlsbakk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).