* md with shared disks
@ 2014-11-09 8:30 Anton Ekermans
2014-11-10 16:40 ` Ethan Wilson
2014-11-10 22:14 ` Stan Hoeppner
0 siblings, 2 replies; 7+ messages in thread
From: Anton Ekermans @ 2014-11-09 8:30 UTC (permalink / raw)
To: linux-raid
Good day raiders,
I have a question on md that I cannot find (up to date) answer to.
We use SuperMicro server with 16 shared disks on a shared backplane
between two motherboards, running up to date CentOS7.
If I create an array on one node, the other node can detect it. I put
GFS2 on top of the array so both system can share the filesystem, but I
want to know if md raid is safe to be used in this way with possibly 2
active/active nodes changing the metadata at the same time. I've
disabled raid-check cron job on one node so they don't both resync the
drives weekly, but I suspect there's a lot more to it than that.
If it's not possible, then alternatively some advice on strategy to have
a large active/active shared disk/filesystem would also be welcome.
Best regards
Untitled Document
Anton Ekermans
Technical/R&D
E-mail: antone true co za
Tel: 042 293 4168 Fax: 042 293 1851
Web: www.true.co.za <http://www.true.co.za>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: md with shared disks
2014-11-09 8:30 md with shared disks Anton Ekermans
@ 2014-11-10 16:40 ` Ethan Wilson
2014-11-10 22:14 ` Stan Hoeppner
1 sibling, 0 replies; 7+ messages in thread
From: Ethan Wilson @ 2014-11-10 16:40 UTC (permalink / raw)
To: linux-raid
On 09/11/2014 09:30, Anton Ekermans wrote:
> Good day raiders,
> I have a question on md that I cannot find (up to date) answer to.
> We use SuperMicro server with 16 shared disks on a shared backplane
> between two motherboards, running up to date CentOS7.
> If I create an array on one node, the other node can detect it. I put
> GFS2 on top of the array so both system can share the filesystem, but
> I want to know if md raid is safe to be used in this way with possibly
> 2 active/active nodes changing the metadata at the same time. I've
> disabled raid-check cron job on one node so they don't both resync the
> drives weekly, but I suspect there's a lot more to it than that.
>
> If it's not possible, then alternatively some advice on strategy to
> have a large active/active shared disk/filesystem would also be welcome.
Not possible, as far as I know: MD does not reload / exchange metadata
information with other MD peers. MD thinks it is the only user of those
disks.
If you attempt to share the arrays and then one head fails one disk and
starts reconstruction onto another disk, while the other head thinks the
array is all right, havoc will arise certainly.
Even without this worst-case scenario, data probably will be still lost
because the two MDs are not cache coherent, so writes on one head will
not invalidate the kernel cache for the same region on the other head,
and this is bad because reads performed on the other head will not see
the changes just written if such area was cached in the kernel.
GFS actually will attempt to invalidate such cache but I am not sure to
what extent: if you use raid5/6 probably it is not enough because the
stripe-cache will hold stale data in a way that GFS probably does not
know about (does not go away even with echo 3 > /proc/sys/vm/drop_caches
). Maybe raid0/1/10 can be safer... anybody knows if cache dropping
works well there?
But the problem of consistent vision of disk failures and raid
reconstruction seems harder to overcome.
You can do an active/passive configuration, shutting down MD on one head
and starting it on the other head.
Another option is the crossed-active or whatever it is called: some
arrays are active on one head node, other arrays on the other head node,
so to share the computational and bandwidth burden.
If other people have better ideas I am all ears.
Regards
EW
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: md with shared disks
2014-11-09 8:30 md with shared disks Anton Ekermans
2014-11-10 16:40 ` Ethan Wilson
@ 2014-11-10 22:14 ` Stan Hoeppner
2014-11-13 13:14 ` Anton Ekermans
1 sibling, 1 reply; 7+ messages in thread
From: Stan Hoeppner @ 2014-11-10 22:14 UTC (permalink / raw)
To: Anton Ekermans, linux-raid
On 11/09/2014 02:30 AM, Anton Ekermans wrote:
> Good day raiders,
> I have a question on md that I cannot find (up to date) answer to.
> We use SuperMicro server with 16 shared disks on a shared backplane
> between two motherboards, running up to date CentOS7.
> If I create an array on one node, the other node can detect it. I put
> GFS2 on top of the array so both system can share the filesystem, but I
> want to know if md raid is safe to be used in this way with possibly 2
> active/active nodes changing the metadata at the same time. I've
> disabled raid-check cron job on one node so they don't both resync the
> drives weekly, but I suspect there's a lot more to it than that.
>
> If it's not possible, then alternatively some advice on strategy to have
> a large active/active shared disk/filesystem would also be welcome.
It's not possible to do what you mention as md is not cluster aware. It
will break, badly. What most people do in such cases in create two md
arrays, one controlled by each host, and mirror them with DRBD, then put
OCFS/GFS atop DRBD. You lose half your capacity doing this, but it's
the only way to do it and have all disks active. Of course you lose
half your bandwidth as well. This is a high availability solution, not
high performance.
You bought this hardware to do something. And that something wasn't
simply making two hosts in one box use all the disks in the box. What
is the workload you plan to run on this hardware? The workload dictates
the needed hardware architecture, not the other way around. If you want
high availability this hardware will work using the stack architecture
above, and work well. If you need high performance shared filesystem
access between both nodes you need an external SAS/FC RAID array and a
cluster FS. In either case you're using a cluster FS which means high
file throughput but low metadata throughgput.
If it's high performance you need, an option is to submit patches to
make md cluster aware. Another is the LSI clustering RAID controller
kit for internal drives. Don't know anything about it other than it is
available and apparently works with RHEL and SUSE. Seems suitable for
what you express as your need.
http://www.lsi.com/products/shared-das/pages/syncro-cs-9271-8i.aspx#tab/tab2
Cheers,
Stan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: md with shared disks
2014-11-10 22:14 ` Stan Hoeppner
@ 2014-11-13 13:14 ` Anton Ekermans
2014-11-13 20:56 ` Stan Hoeppner
0 siblings, 1 reply; 7+ messages in thread
From: Anton Ekermans @ 2014-11-13 13:14 UTC (permalink / raw)
To: Stan Hoeppner, linux-raid
Thank you very much for your clear response.
The purpose of this hardware is to primarily host ample VM storage for
the 2 nodes itself and 3 other i7 PC/servers.
The HA was hoped to be achieved as active/active with both nodes sharing
the same disks and non-cluster servers(i7) having multi-path to these
two nodes. This is advertised as HA active/active in storage software
such as Nexenta using RSF-1. However upon closer inspection, their
active/active means both nodes share some data and the other can take
over. So for me, in essence it is "active/passive + passive/active" and
not truly "active/active". We will try to config this way to get quasi
active/active for best performance with kind-of high-availability. Seems
the shared disks is not the problem, but combining them on a cluster is.
Thank you again
Best regards
Untitled Document
Anton Ekermans
> It's not possible to do what you mention as md is not cluster aware. It
> will break, badly. What most people do in such cases in create two md
> arrays, one controlled by each host, and mirror them with DRBD, then put
> OCFS/GFS atop DRBD. You lose half your capacity doing this, but it's
> the only way to do it and have all disks active. Of course you lose
> half your bandwidth as well. This is a high availability solution, not
> high performance.
>
> You bought this hardware to do something. And that something wasn't
> simply making two hosts in one box use all the disks in the box. What
> is the workload you plan to run on this hardware? The workload dictates
> the needed hardware architecture, not the other way around. If you want
> high availability this hardware will work using the stack architecture
> above, and work well. If you need high performance shared filesystem
> access between both nodes you need an external SAS/FC RAID array and a
> cluster FS. In either case you're using a cluster FS which means high
> file throughput but low metadata throughgput.
>
> If it's high performance you need, an option is to submit patches to
> make md cluster aware. Another is the LSI clustering RAID controller
> kit for internal drives. Don't know anything about it other than it is
> available and apparently works with RHEL and SUSE. Seems suitable for
> what you express as your need.
>
> http://www.lsi.com/products/shared-das/pages/syncro-cs-9271-8i.aspx#tab/tab2
>
>
> Cheers,
> Stan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: md with shared disks
2014-11-13 13:14 ` Anton Ekermans
@ 2014-11-13 20:56 ` Stan Hoeppner
2014-11-13 22:53 ` Ethan Wilson
0 siblings, 1 reply; 7+ messages in thread
From: Stan Hoeppner @ 2014-11-13 20:56 UTC (permalink / raw)
To: Anton Ekermans, linux-raid
With DRBD and GFS2 it is true active/active at the block level. You
just lose half your disk capacity due to the host-to-host mirroring.
Whether your upper layers are active/active is another story. E.g.
getting NFS server/client to do seamless automatic path failover is
still a shaky proposition AIUI.
You mention multipath. If you plan to use iSCSI multipath for the i7
servers you need to make sure each LUN you export has the same WWID on
both cluster nodes.
Stan
On 11/13/2014 07:14 AM, Anton Ekermans wrote:
> Thank you very much for your clear response.
> The purpose of this hardware is to primarily host ample VM storage for
> the 2 nodes itself and 3 other i7 PC/servers.
> The HA was hoped to be achieved as active/active with both nodes sharing
> the same disks and non-cluster servers(i7) having multi-path to these
> two nodes. This is advertised as HA active/active in storage software
> such as Nexenta using RSF-1. However upon closer inspection, their
> active/active means both nodes share some data and the other can take
> over. So for me, in essence it is "active/passive + passive/active" and
> not truly "active/active". We will try to config this way to get quasi
> active/active for best performance with kind-of high-availability. Seems
> the shared disks is not the problem, but combining them on a cluster is.
>
> Thank you again
>
> Best regards
> Untitled Document
>
> Anton Ekermans
>
>> It's not possible to do what you mention as md is not cluster aware. It
>> will break, badly. What most people do in such cases in create two md
>> arrays, one controlled by each host, and mirror them with DRBD, then put
>> OCFS/GFS atop DRBD. You lose half your capacity doing this, but it's
>> the only way to do it and have all disks active. Of course you lose
>> half your bandwidth as well. This is a high availability solution, not
>> high performance.
>>
>> You bought this hardware to do something. And that something wasn't
>> simply making two hosts in one box use all the disks in the box. What
>> is the workload you plan to run on this hardware? The workload dictates
>> the needed hardware architecture, not the other way around. If you want
>> high availability this hardware will work using the stack architecture
>> above, and work well. If you need high performance shared filesystem
>> access between both nodes you need an external SAS/FC RAID array and a
>> cluster FS. In either case you're using a cluster FS which means high
>> file throughput but low metadata throughgput.
>>
>> If it's high performance you need, an option is to submit patches to
>> make md cluster aware. Another is the LSI clustering RAID controller
>> kit for internal drives. Don't know anything about it other than it is
>> available and apparently works with RHEL and SUSE. Seems suitable for
>> what you express as your need.
>>
>> http://www.lsi.com/products/shared-das/pages/syncro-cs-9271-8i.aspx#tab/tab2
>>
>>
>>
>> Cheers,
>> Stan
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: md with shared disks
2014-11-13 20:56 ` Stan Hoeppner
@ 2014-11-13 22:53 ` Ethan Wilson
2014-11-14 0:07 ` Stan Hoeppner
0 siblings, 1 reply; 7+ messages in thread
From: Ethan Wilson @ 2014-11-13 22:53 UTC (permalink / raw)
To: linux-raid
On 13/11/2014 21:56, Stan Hoeppner wrote:
> With DRBD and GFS2 it is true active/active at the block level. You
> just lose half your disk capacity due to the host-to-host mirroring.
Sorry but I don't share your definition of active/active.
Would you say that a raid1 is an active/active thing?
Doubling the number of disks and repeating the operation on both sides
is not active/active in the sense that people usually want.
Active/active commonly means that you have twice the performance of
active/passive.
In this sense DRBD not only is an active/passive but it is even way
below the performances of an active/passive because it has to transmit
the data to the peer in addition to write to the disks, and this takes
CPU time for memcpy and interrupts, introduces latency, requires
additional hardware (= fast networking dedicated to DRBD). An
active/passive with shared disks is hence "twice" (very roughly) faster
than DRBD at the same price spent on the head nodes. An active/active
with shared disks is hence 4 times (again very roughly) faster than
DRBD, at the same price for the head nodes.
In addition to this with DRBD you have to buy twice the number of disks,
which is also an additional expense. Marginally though, because a
shared-disk infrastructure is way more expensive than a direct-attached
one, but it has to be planned like that in advance, and not retrofitted
like you propose.
His current infrastructure cannot be easily converted to DRBD without
major losses: if he attempts to do so he will have almost double the
costs of a basic DRBD shared-nothing direct-attached infrastructure or
exactly double the cost of a shared-disk infrastructure, intended as
cost per TB of data. Unfortunately, after this he will still have half
the performances of an active/passive shared-disk clustered-MD solution.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: md with shared disks
2014-11-13 22:53 ` Ethan Wilson
@ 2014-11-14 0:07 ` Stan Hoeppner
0 siblings, 0 replies; 7+ messages in thread
From: Stan Hoeppner @ 2014-11-14 0:07 UTC (permalink / raw)
To: Ethan Wilson, linux-raid
On 11/13/2014 04:53 PM, Ethan Wilson wrote:
> On 13/11/2014 21:56, Stan Hoeppner wrote:
>> With DRBD and GFS2 it is true active/active at the block level. You
>> just lose half your disk capacity due to the host-to-host mirroring.
>
> Sorry but I don't share your definition of active/active.
>
> Would you say that a raid1 is an active/active thing?
>
> Doubling the number of disks and repeating the operation on both sides
> is not active/active in the sense that people usually want.
>
> Active/active commonly means that you have twice the performance of
> active/passive.
>
> In this sense DRBD not only is an active/passive but it is even way
> below the performances of an active/passive because it has to transmit
> the data to the peer in addition to write to the disks, and this takes
> CPU time for memcpy and interrupts, introduces latency, requires
> additional hardware (= fast networking dedicated to DRBD). An
> active/passive with shared disks is hence "twice" (very roughly) faster
> than DRBD at the same price spent on the head nodes. An active/active
> with shared disks is hence 4 times (again very roughly) faster than
> DRBD, at the same price for the head nodes.
>
> In addition to this with DRBD you have to buy twice the number of disks,
> which is also an additional expense. Marginally though, because a
> shared-disk infrastructure is way more expensive than a direct-attached
> one, but it has to be planned like that in advance, and not retrofitted
> like you propose.
>
> His current infrastructure cannot be easily converted to DRBD without
> major losses: if he attempts to do so he will have almost double the
> costs of a basic DRBD shared-nothing direct-attached infrastructure or
> exactly double the cost of a shared-disk infrastructure, intended as
> cost per TB of data. Unfortunately, after this he will still have half
> the performances of an active/passive shared-disk clustered-MD solution.
He doesn't have an infrastructure yet. He's attempting to build one but
purchased the wrong gear for his requirements. I presented him with
options to do it the right way, and to salvage what he has already
purchased. The DRBD active/active option is the latter. The SAN option
was the former. You seem to have misunderstood my comments.
Cheers,
Stan
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-11-14 0:07 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-09 8:30 md with shared disks Anton Ekermans
2014-11-10 16:40 ` Ethan Wilson
2014-11-10 22:14 ` Stan Hoeppner
2014-11-13 13:14 ` Anton Ekermans
2014-11-13 20:56 ` Stan Hoeppner
2014-11-13 22:53 ` Ethan Wilson
2014-11-14 0:07 ` Stan Hoeppner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).