md with shared disks

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* md with shared disks
@ 2014-11-09  8:30 Anton Ekermans
  2014-11-10 16:40 ` Ethan Wilson
  2014-11-10 22:14 ` Stan Hoeppner
  0 siblings, 2 replies; 7+ messages in thread
From: Anton Ekermans @ 2014-11-09  8:30 UTC (permalink / raw)
  To: linux-raid

Good day raiders,
I have a question on md that I cannot find (up to date) answer to.
We use SuperMicro server with 16 shared disks on a shared backplane 
between two motherboards, running up to date CentOS7.
If I create an array on one node, the other node can detect it. I put 
GFS2 on top of the array so both system can share the filesystem, but I 
want to know if md raid is safe to be used in this way with possibly 2 
active/active nodes changing the metadata at the same time. I've 
disabled raid-check cron job on one node so they don't both resync the 
drives weekly, but I suspect there's a lot more to it than that.

If it's not possible, then alternatively some advice on strategy to have 
a large active/active shared disk/filesystem would also be welcome.

Best regards

Untitled Document

Anton Ekermans
Technical/R&D
E-mail: antone true co za
Tel: 042 293 4168 Fax: 042 293 1851
Web: www.true.co.za <http://www.true.co.za>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md with shared disks
  2014-11-09  8:30 md with shared disks Anton Ekermans
@ 2014-11-10 16:40 ` Ethan Wilson
  2014-11-10 22:14 ` Stan Hoeppner
  1 sibling, 0 replies; 7+ messages in thread
From: Ethan Wilson @ 2014-11-10 16:40 UTC (permalink / raw)
  To: linux-raid

On 09/11/2014 09:30, Anton Ekermans wrote:
> Good day raiders,
> I have a question on md that I cannot find (up to date) answer to.
> We use SuperMicro server with 16 shared disks on a shared backplane 
> between two motherboards, running up to date CentOS7.
> If I create an array on one node, the other node can detect it. I put 
> GFS2 on top of the array so both system can share the filesystem, but 
> I want to know if md raid is safe to be used in this way with possibly 
> 2 active/active nodes changing the metadata at the same time. I've 
> disabled raid-check cron job on one node so they don't both resync the 
> drives weekly, but I suspect there's a lot more to it than that.
>
> If it's not possible, then alternatively some advice on strategy to 
> have a large active/active shared disk/filesystem would also be welcome.

Not possible, as far as I know: MD does not reload / exchange metadata 
information with other MD peers. MD thinks it is the only user of those 
disks.
If you attempt to share the arrays and then one head fails one disk and 
starts reconstruction onto another disk, while the other head thinks the 
array is all right, havoc will arise certainly.

Even without this worst-case scenario, data probably will be still lost 
because the two MDs are not cache coherent, so writes on one head will 
not invalidate the kernel cache for the same region on the other head, 
and this is bad because reads performed on the other head will not see 
the changes just written if such area was cached in the kernel.
GFS actually will attempt to invalidate such cache but I am not sure to 
what extent: if you use raid5/6 probably it is not enough because the 
stripe-cache will hold stale data in a way that GFS probably does not 
know about (does not go away even with echo 3 > /proc/sys/vm/drop_caches 
). Maybe raid0/1/10 can be safer... anybody knows if cache dropping 
works well there?
But the problem of consistent vision of disk failures and raid 
reconstruction seems harder to overcome.

You can do an active/passive configuration, shutting down MD on one head 
and starting it on the other head.
Another option is the crossed-active or whatever it is called: some 
arrays are active on one head node, other arrays on the other head node, 
so to share the computational and bandwidth burden.

If other people have better ideas I am all ears.

Regards
EW

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md with shared disks
  2014-11-09  8:30 md with shared disks Anton Ekermans
  2014-11-10 16:40 ` Ethan Wilson
@ 2014-11-10 22:14 ` Stan Hoeppner
  2014-11-13 13:14   ` Anton Ekermans
  1 sibling, 1 reply; 7+ messages in thread
From: Stan Hoeppner @ 2014-11-10 22:14 UTC (permalink / raw)
  To: Anton Ekermans, linux-raid

On 11/09/2014 02:30 AM, Anton Ekermans wrote:
> Good day raiders,
> I have a question on md that I cannot find (up to date) answer to.
> We use SuperMicro server with 16 shared disks on a shared backplane
> between two motherboards, running up to date CentOS7.
> If I create an array on one node, the other node can detect it. I put
> GFS2 on top of the array so both system can share the filesystem, but I
> want to know if md raid is safe to be used in this way with possibly 2
> active/active nodes changing the metadata at the same time. I've
> disabled raid-check cron job on one node so they don't both resync the
> drives weekly, but I suspect there's a lot more to it than that.
> 
> If it's not possible, then alternatively some advice on strategy to have
> a large active/active shared disk/filesystem would also be welcome.

It's not possible to do what you mention as md is not cluster aware.  It
will break, badly.  What most people do in such cases in create two md
arrays, one controlled by each host, and mirror them with DRBD, then put
OCFS/GFS atop DRBD.  You lose half your capacity doing this, but it's
the only way to do it and have all disks active.  Of course you lose
half your bandwidth as well.  This is a high availability solution, not
high performance.

You bought this hardware to do something.  And that something wasn't
simply making two hosts in one box use all the disks in the box.  What
is the workload you plan to run on this hardware?  The workload dictates
the needed hardware architecture, not the other way around.  If you want
high availability this hardware will work using the stack architecture
above, and work well.  If you need high performance shared filesystem
access between both nodes you need an external SAS/FC RAID array and a
cluster FS.  In either case you're using a cluster FS which means high
file throughput but low metadata throughgput.

If it's high performance you need, an option is to submit patches to
make md cluster aware.  Another is the LSI clustering RAID controller
kit for internal drives.  Don't know anything about it other than it is
available and apparently works with RHEL and SUSE.  Seems suitable for
what you express as your need.

http://www.lsi.com/products/shared-das/pages/syncro-cs-9271-8i.aspx#tab/tab2

Cheers,
Stan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md with shared disks
  2014-11-10 22:14 ` Stan Hoeppner
@ 2014-11-13 13:14   ` Anton Ekermans
  2014-11-13 20:56     ` Stan Hoeppner
  0 siblings, 1 reply; 7+ messages in thread
From: Anton Ekermans @ 2014-11-13 13:14 UTC (permalink / raw)
  To: Stan Hoeppner, linux-raid

Thank you very much for your clear response.
The purpose of this hardware is to primarily host ample VM storage for 
the 2 nodes itself and 3 other i7 PC/servers.
The HA was hoped to be achieved as active/active with both nodes sharing 
the same disks and non-cluster servers(i7) having multi-path to these 
two nodes. This is advertised as HA active/active in storage software 
such as Nexenta using RSF-1. However upon closer inspection, their 
active/active means both nodes share some data and the other can take 
over. So for me, in essence it is "active/passive + passive/active" and 
not truly "active/active". We will try to config this way to get quasi 
active/active for best performance with kind-of high-availability. Seems 
the shared disks is not the problem, but combining them on a cluster is.

Thank you again

Best regards
Untitled Document

Anton Ekermans

> It's not possible to do what you mention as md is not cluster aware.  It
> will break, badly.  What most people do in such cases in create two md
> arrays, one controlled by each host, and mirror them with DRBD, then put
> OCFS/GFS atop DRBD.  You lose half your capacity doing this, but it's
> the only way to do it and have all disks active.  Of course you lose
> half your bandwidth as well.  This is a high availability solution, not
> high performance.
>
> You bought this hardware to do something.  And that something wasn't
> simply making two hosts in one box use all the disks in the box.  What
> is the workload you plan to run on this hardware?  The workload dictates
> the needed hardware architecture, not the other way around.  If you want
> high availability this hardware will work using the stack architecture
> above, and work well.  If you need high performance shared filesystem
> access between both nodes you need an external SAS/FC RAID array and a
> cluster FS.  In either case you're using a cluster FS which means high
> file throughput but low metadata throughgput.
>
> If it's high performance you need, an option is to submit patches to
> make md cluster aware.  Another is the LSI clustering RAID controller
> kit for internal drives.  Don't know anything about it other than it is
> available and apparently works with RHEL and SUSE.  Seems suitable for
> what you express as your need.
>
> http://www.lsi.com/products/shared-das/pages/syncro-cs-9271-8i.aspx#tab/tab2
>
>
> Cheers,
> Stan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md with shared disks
  2014-11-13 13:14   ` Anton Ekermans
@ 2014-11-13 20:56     ` Stan Hoeppner
  2014-11-13 22:53       ` Ethan Wilson
  0 siblings, 1 reply; 7+ messages in thread
From: Stan Hoeppner @ 2014-11-13 20:56 UTC (permalink / raw)
  To: Anton Ekermans, linux-raid

With DRBD and GFS2 it is true active/active at the block level.  You
just lose half your disk capacity due to the host-to-host mirroring.
Whether your upper layers are active/active is another story.  E.g.
getting NFS server/client to do seamless automatic path failover is
still a shaky proposition AIUI.

You mention multipath.  If you plan to use iSCSI multipath for the i7
servers you need to make sure each LUN you export has the same WWID on
both cluster nodes.

Stan



On 11/13/2014 07:14 AM, Anton Ekermans wrote:
> Thank you very much for your clear response.
> The purpose of this hardware is to primarily host ample VM storage for
> the 2 nodes itself and 3 other i7 PC/servers.
> The HA was hoped to be achieved as active/active with both nodes sharing
> the same disks and non-cluster servers(i7) having multi-path to these
> two nodes. This is advertised as HA active/active in storage software
> such as Nexenta using RSF-1. However upon closer inspection, their
> active/active means both nodes share some data and the other can take
> over. So for me, in essence it is "active/passive + passive/active" and
> not truly "active/active". We will try to config this way to get quasi
> active/active for best performance with kind-of high-availability. Seems
> the shared disks is not the problem, but combining them on a cluster is.
> 
> Thank you again
> 
> Best regards
> Untitled Document
> 
> Anton Ekermans
> 
>> It's not possible to do what you mention as md is not cluster aware.  It
>> will break, badly.  What most people do in such cases in create two md
>> arrays, one controlled by each host, and mirror them with DRBD, then put
>> OCFS/GFS atop DRBD.  You lose half your capacity doing this, but it's
>> the only way to do it and have all disks active.  Of course you lose
>> half your bandwidth as well.  This is a high availability solution, not
>> high performance.
>>
>> You bought this hardware to do something.  And that something wasn't
>> simply making two hosts in one box use all the disks in the box.  What
>> is the workload you plan to run on this hardware?  The workload dictates
>> the needed hardware architecture, not the other way around.  If you want
>> high availability this hardware will work using the stack architecture
>> above, and work well.  If you need high performance shared filesystem
>> access between both nodes you need an external SAS/FC RAID array and a
>> cluster FS.  In either case you're using a cluster FS which means high
>> file throughput but low metadata throughgput.
>>
>> If it's high performance you need, an option is to submit patches to
>> make md cluster aware.  Another is the LSI clustering RAID controller
>> kit for internal drives.  Don't know anything about it other than it is
>> available and apparently works with RHEL and SUSE.  Seems suitable for
>> what you express as your need.
>>
>> http://www.lsi.com/products/shared-das/pages/syncro-cs-9271-8i.aspx#tab/tab2
>>
>>
>>
>> Cheers,
>> Stan
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md with shared disks
  2014-11-13 20:56     ` Stan Hoeppner
@ 2014-11-13 22:53       ` Ethan Wilson
  2014-11-14  0:07         ` Stan Hoeppner
  0 siblings, 1 reply; 7+ messages in thread
From: Ethan Wilson @ 2014-11-13 22:53 UTC (permalink / raw)
  To: linux-raid

On 13/11/2014 21:56, Stan Hoeppner wrote:
> With DRBD and GFS2 it is true active/active at the block level.  You
> just lose half your disk capacity due to the host-to-host mirroring.

Sorry but I don't share your definition of active/active.

Would you say that a raid1 is an active/active thing?

Doubling the number of disks and repeating the operation on both sides 
is not active/active in the sense that people usually want.

Active/active commonly means that you have twice the performance of 
active/passive.

In this sense DRBD not only is an active/passive but it is even way 
below the performances of an active/passive because it has to transmit 
the data to the peer in addition to write to the disks, and this takes 
CPU time for memcpy and interrupts, introduces latency, requires 
additional hardware (= fast networking dedicated to DRBD). An 
active/passive with shared disks is hence "twice" (very roughly) faster 
than DRBD at the same price spent on the head nodes. An active/active 
with shared disks is hence 4 times (again very roughly) faster than 
DRBD, at the same price for the head nodes.

In addition to this with DRBD you have to buy twice the number of disks, 
which is also an additional expense. Marginally though, because a 
shared-disk infrastructure is way more expensive than a direct-attached 
one, but it has to be planned like that in advance, and not retrofitted 
like you propose.

His current infrastructure cannot be easily converted to DRBD without 
major losses: if he attempts to do so he will have almost double the 
costs of a basic DRBD shared-nothing direct-attached infrastructure or 
exactly double the cost of a shared-disk infrastructure, intended as 
cost per TB of data. Unfortunately, after this he will still have half 
the performances of an active/passive shared-disk clustered-MD solution.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md with shared disks
  2014-11-13 22:53       ` Ethan Wilson
@ 2014-11-14  0:07         ` Stan Hoeppner
  0 siblings, 0 replies; 7+ messages in thread
From: Stan Hoeppner @ 2014-11-14  0:07 UTC (permalink / raw)
  To: Ethan Wilson, linux-raid

On 11/13/2014 04:53 PM, Ethan Wilson wrote:
> On 13/11/2014 21:56, Stan Hoeppner wrote:
>> With DRBD and GFS2 it is true active/active at the block level.  You
>> just lose half your disk capacity due to the host-to-host mirroring.
> 
> Sorry but I don't share your definition of active/active.
> 
> Would you say that a raid1 is an active/active thing?
> 
> Doubling the number of disks and repeating the operation on both sides
> is not active/active in the sense that people usually want.
> 
> Active/active commonly means that you have twice the performance of
> active/passive.
> 
> In this sense DRBD not only is an active/passive but it is even way
> below the performances of an active/passive because it has to transmit
> the data to the peer in addition to write to the disks, and this takes
> CPU time for memcpy and interrupts, introduces latency, requires
> additional hardware (= fast networking dedicated to DRBD). An
> active/passive with shared disks is hence "twice" (very roughly) faster
> than DRBD at the same price spent on the head nodes. An active/active
> with shared disks is hence 4 times (again very roughly) faster than
> DRBD, at the same price for the head nodes.
> 
> In addition to this with DRBD you have to buy twice the number of disks,
> which is also an additional expense. Marginally though, because a
> shared-disk infrastructure is way more expensive than a direct-attached
> one, but it has to be planned like that in advance, and not retrofitted
> like you propose.
> 
> His current infrastructure cannot be easily converted to DRBD without
> major losses: if he attempts to do so he will have almost double the
> costs of a basic DRBD shared-nothing direct-attached infrastructure or
> exactly double the cost of a shared-disk infrastructure, intended as
> cost per TB of data. Unfortunately, after this he will still have half
> the performances of an active/passive shared-disk clustered-MD solution.

He doesn't have an infrastructure yet.  He's attempting to build one but
purchased the wrong gear for his requirements.  I presented him with
options to do it the right way, and to salvage what he has already
purchased.  The DRBD active/active option is the latter.  The SAN option
was the former.  You seem to have misunderstood my comments.

Cheers,
Stan


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-11-14  0:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-09  8:30 md with shared disks Anton Ekermans
2014-11-10 16:40 ` Ethan Wilson
2014-11-10 22:14 ` Stan Hoeppner
2014-11-13 13:14   ` Anton Ekermans
2014-11-13 20:56     ` Stan Hoeppner
2014-11-13 22:53       ` Ethan Wilson
2014-11-14  0:07         ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).