Cluster Aware MD Driver

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Cluster Aware MD Driver
@ 2007-06-13 19:16 Xinwei Hu
  2007-06-13 19:50 ` Mike Snitzer
  0 siblings, 1 reply; 5+ messages in thread
From: Xinwei Hu @ 2007-06-13 19:16 UTC (permalink / raw)
  To: linux-raid

Hi all,

  Steven Dake proposed a solution* to make MD layer and tools to be cluster 
aware in early 2003. But it seems that no progressing is made since then. I'd 
like to pick this one up again. :)

  So far as I understand, Steven's proposal still applies to currently MD 
implementation mostly, except we have bitmap now. And bitmap can be 
workarounded via set_bitmap_file.

   The problem is that it seems we need a kernel<->userspace interface to sync 
the mddev struct across all nodes, but I don't find out how.

   I'm new to the MD driver, so correct me if I'm wrong. And you suggestions 
are really appreciated.

   Thanks.

* http://osdir.com/ml/raid/2003-01/msg00013.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Cluster Aware MD Driver
  2007-06-13 19:16 Cluster Aware MD Driver Xinwei Hu
@ 2007-06-13 19:50 ` Mike Snitzer
  0 siblings, 0 replies; 5+ messages in thread
From: Mike Snitzer @ 2007-06-13 19:50 UTC (permalink / raw)
  To: Xinwei Hu; +Cc: linux-raid

Is the goal to have the MD device be directly accessible from all
nodes? This strategy seems flawed in that it speaks to updating MD
superblocks then in-memory Linux data structures across a cluster.
The reality is if we're talking about shared storage the MD management
only needs to happen in one node.  Others can weigh in on this but the
current MD really doesn't want to be cluster-aware.

IMHO, this cluster awareness really doesn't belong in MD/mdadm.  A
high-level cluster management tool should be doing this MD
ownership/coordination work.  The MD ownership can be transferred
accordingly if/when the current owner fails, etc.  But this implies
that the MD is only ever active on one node at any given point in
time.

Mike

On 6/13/07, Xinwei Hu <hxinwei@gmail.com> wrote:
> Hi all,
>
>   Steven Dake proposed a solution* to make MD layer and tools to be cluster
> aware in early 2003. But it seems that no progressing is made since then. I'd
> like to pick this one up again. :)
>
>   So far as I understand, Steven's proposal still applies to currently MD
> implementation mostly, except we have bitmap now. And bitmap can be
> workarounded via set_bitmap_file.
>
>    The problem is that it seems we need a kernel<->userspace interface to sync
> the mddev struct across all nodes, but I don't find out how.
>
>    I'm new to the MD driver, so correct me if I'm wrong. And you suggestions
> are really appreciated.
>
>    Thanks.
>
> * http://osdir.com/ml/raid/2003-01/msg00013.html
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Cluster Aware MD Driver
@ 2003-01-07 14:54 Cress, Andrew R
  0 siblings, 0 replies; 5+ messages in thread
From: Cress, Andrew R @ 2003-01-07 14:54 UTC (permalink / raw)
  To: 'Brian Jackson', Steven Dake; +Cc: opengfs-users, linux-raid

I also had a couple of comments below.  My .02.

Andy

-----Original Message-----
From: Brian Jackson
Sent: Saturday, January 04, 2003 4:06 PM
To: Steven Dake
Cc: opengfs-users@lists.sourceforge.net; linux-raid@vger.kernel.org
Subject: Re: Cluster Aware MD Driver

[...]
One thing you might want to think about is that most people who are looking 
at a cluster capable raid are already going to have some sort of cluster 
management software. It might be useful to use the transports available. 
Maybe a plugin system that uses what you were saying as the default method, 
but can also use a plugin written to take advantage of an existing cluster 
management system. Just an idea. 

[Andy] I would agree, but state this more strongly, in that utilizing the
existing 
cluster management software would be a requirement for most of these
customers.  
Things like the heartbeat, network transport, election process, and
master/slave
relationships would be things that would come with that software, and need
to be 
administered from a common interface in the cluster, so using a plugin or
dynamic 
library for these functions sounds like a good approach.

> 
> The question you might be asking is, how do you protect from each server 
> overwritting similiar data such as the superblock or resync data.  
[...]
>              The writes will default to on to ensure that non-clusters 
> work properly even with autostart.

Maybe the ability to write or not could be a mdadm switch. Something like: 

mdadm -A --non-master 

that would keep the changes to the MD drivers to a minimum(I think, but I 
may be thinking the wrong way), but require manual intervention if the 
master were to die(or at least some sort of outside intervention). 

[Andy] The outside intervention would be from the cluster management
software, since
I don't think manual intervention would do.  If there were an API to toggle
this, the 
cluster management software would be able to change who the master was if
the master 
node went down.  

[...]

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <200301041016.10224.vittorio.ballestra@infogestnet.it>]

[parent not found: <20030104164314.26621.qmail@escalade.vistahp.com>]

* Cluster Aware MD Driver
       [not found] ` <20030104164314.26621.qmail@escalade.vistahp.com>
@ 2003-01-04 19:13   ` Steven Dake
  2003-01-04 21:06     ` Brian Jackson
  0 siblings, 1 reply; 5+ messages in thread
From: Steven Dake @ 2003-01-04 19:13 UTC (permalink / raw)
  To: Brian Jackson; +Cc: opengfs-users, linux-raid

Brian,

The way to fix this problem is to enable the MD layer and MD tools to be 
cluster aware.  I hope sometime before March 03 to have this 
functionality implemented in 2.4 and mdadm 1.0.1.

While I have not spent a ton of time thinking about it, I'll share what 
I'm thinking in case someone else wants to get a jump start since I wont 
have time until Feb to start working on it.

What I am thinking of doing is making one node in a cluster the master, 
and all other nodes nonmasters.  The master is responsible for commiting 
superblock and resync changes to disk.  The other nodes are responsible 
for updating their status if any one node changes state.

During startup, a userland daemon runs which opens a TCP port.  Then the 
daemon attempts to connect to a list of other servers in a configuration 
file (/etc/clusterips).  On each connection, a master of the cluster is 
elected by an election algortihm.  Once the daemon makes connections, it 
culls its file descriptor list such that it only has one connection per 
server to each other server.  Whenever a state change occurs (raid set 
faulty, raid array failed, raid hot added, raid hot removed, raid start, 
raid stop, etc) the state change is transmitted via the server to all 
other server nodes.  When the server receives a state change message, it 
will handle appropriately by using the md ioctls to update its internal 
state.  There are more complications, such as heartbeating to detect 
dead nodes, etc.  All of this keeps the arrays in sync across the entire 
cluster.  The big changes here are to the mdadm package to use the local 
server to execute operations and to create a server which processes the 
state changes and executes the appropriate MD layer ioctls.

The question you might be asking is, how do you protect from each server 
overwritting similiar data such as the superblock or resync data.  The 
trick is to add an ioctl to the MD layer that turns writes on or off. 
 During master election above, the master will turn its writes on.  All 
non-masters will turn their writes off before any RAID arrays are 
started.  Also, resyncs have to be communicated across the cluster such 
that /proc/mdstat displays correct information so I believe an ioctl 
will have to be added to indicate this and allow resync from a specific 
spot in case the current master dies in a resync.  The changes to the MD 
layer should be fairly minimal and noninvasive and also work well for 
non-cluster configurations.  The writes will default to on to ensure 
that non-clusters work properly even with autostart.

The only downside of this approach is that RAID autostart can no longer 
be used.  The only solution to supporting that feature for clusters is 
to write all of the above userspace code into the kernel which would be 
a big pain and likely not accepted into the mainline.

Thanks
-steve

Brian Jackson wrote:

> At the moment if one node goes down, the rest of the nodes will 
> continue to run as expected. The only single point of failure from a 
> node point of view is the lock server. Dominik is working on 
> integrating the OpenDLM to get rid of that single point of failure. If 
> one disk in the shared storage dies, that will bring you down. I have 
> tried to stack pool on top of software raid, but the MD driver doesn't 
> play well with clusters so that won't work. We are trying to figure 
> out a way to fix this, but when GFS was originally designed the pool 
> driver(that does GFS's volume management) only had stripping(raid0). I 
> hope this answers your questions.
> --Brian Jackson
> P.S. I am not sure, but it sounds like you have some misconception 
> about how OpenGFS works. It does not use the disks on your nodes. It 
> uses some kind of shared storage.
> Vittorio Ballestra writes:
>
>> Escuse me if this is a stupid or repetead question, I've searched 
>> into the docs (?) and the mailinglists but I'm not able to understand 
>> if openGFS has some sort of fault-tollerance.
>> This are my doubts :
>> What happens if one host is down or one disk on one host is down ? 
>> Will the entire openGFS filesystem be down ? If one disk is broken 
>> and its content corrupted, will the whole openGFS be corrupted ?
>> If openGFS is not supporting any kind of fault-tollerance, can one 
>> use raid disks on each node ?
>> Thanks,
>>     V
>>
>> -------------------------------------------------------
>> This sf.net email is sponsored by:ThinkGeek
>> Welcome to geek heaven.
>> http://thinkgeek.com/sf
>> _______________________________________________
>> Opengfs-users mailing list
>> Opengfs-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/opengfs-users
>
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Opengfs-users mailing list
> Opengfs-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opengfs-users
>
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Cluster Aware MD Driver
  2003-01-04 19:13   ` Steven Dake
@ 2003-01-04 21:06     ` Brian Jackson
  0 siblings, 0 replies; 5+ messages in thread
From: Brian Jackson @ 2003-01-04 21:06 UTC (permalink / raw)
  To: Steven Dake; +Cc: opengfs-users, linux-raid

Steven Dake writes: 

> Brian, 
> 
> The way to fix this problem is to enable the MD layer and MD tools to be 
> cluster aware.  I hope sometime before March 03 to have this functionality 
> implemented in 2.4 and mdadm 1.0.1. 
> 
> While I have not spent a ton of time thinking about it, I'll share what 
> I'm thinking in case someone else wants to get a jump start since I wont 
> have time until Feb to start working on it.

I haven't looked at their implementation in depth, but I know EVMS has the 
beginnings of this. Might be worth a peek. 

> 
> What I am thinking of doing is making one node in a cluster the master, 
> and all other nodes nonmasters.  The master is responsible for commiting 
> superblock and resync changes to disk.  The other nodes are responsible 
> for updating their status if any one node changes state. 
> 
> During startup, a userland daemon runs which opens a TCP port.  Then the 
> daemon attempts to connect to a list of other servers in a configuration 
> file (/etc/clusterips).  On each connection, a master of the cluster is 
> elected by an election algortihm.  Once the daemon makes connections, it 
> culls its file descriptor list such that it only has one connection per 
> server to each other server.  Whenever a state change occurs (raid set 
> faulty, raid array failed, raid hot added, raid hot removed, raid start, 
> raid stop, etc) the state change is transmitted via the server to all 
> other server nodes.  When the server receives a state change message, it 
> will handle appropriately by using the md ioctls to update its internal 
> state.  There are more complications, such as heartbeating to detect dead 
> nodes, etc.  All of this keeps the arrays in sync across the entire 
> cluster.  The big changes here are to the mdadm package to use the local 
> server to execute operations and to create a server which processes the 
> state changes and executes the appropriate MD layer ioctls.

One thing you might want to think about is that most people who are looking 
at a cluster capable raid are already going to have some sort of cluster 
management software. It might be useful to use the transports available. 
Maybe a plugin system that uses what you were saying as the default method, 
but can also use a plugin written to take advantage of an existing cluster 
management system. Just an idea. 

> 
> The question you might be asking is, how do you protect from each server 
> overwritting similiar data such as the superblock or resync data.  The 
> trick is to add an ioctl to the MD layer that turns writes on or off. 
> During master election above, the master will turn its writes on.  All 
> non-masters will turn their writes off before any RAID arrays are started. 
>  Also, resyncs have to be communicated across the cluster such that 
> /proc/mdstat displays correct information so I believe an ioctl will have 
> to be added to indicate this and allow resync from a specific spot in case 
> the current master dies in a resync.  The changes to the MD layer should 
> be fairly minimal and noninvasive and also work well for non-cluster 
> configurations.  The writes will default to on to ensure that non-clusters 
> work properly even with autostart.

Maybe the ability to write or not could be a mdadm switch. Something like: 

mdadm -A --non-master 

that would keep the changes to the MD drivers to a minimum(I think, but I 
may be thinking the wrong way), but require manual intervention if the 
master were to die(or at least some sort of outside intervention). 

> 
> The only downside of this approach is that RAID autostart can no longer be 
> used.  The only solution to supporting that feature for clusters is to 
> write all of the above userspace code into the kernel which would be a big 
> pain and likely not accepted into the mainline.

That won't be a big deal for most HA/LB type clusters. I think it will 
matter more to the HPC crowd though(they might be using GFS as /). Even then 
they can probably get most of the functionality they need with early 
userspace. 

> 
> Thanks
> -steve

 --Brian Jackson 

P.S. I have some other ideas about this I might send later on. Also LMB from 
the Linux-ha list might be interested too. I think he wrote some patches 
that sort of made MD work in a cluster. 

> 
> Brian Jackson wrote: 
> 
>> At the moment if one node goes down, the rest of the nodes will continue 
>> to run as expected. The only single point of failure from a node point of 
>> view is the lock server. Dominik is working on integrating the OpenDLM to 
>> get rid of that single point of failure. If one disk in the shared 
>> storage dies, that will bring you down. I have tried to stack pool on top 
>> of software raid, but the MD driver doesn't play well with clusters so 
>> that won't work. We are trying to figure out a way to fix this, but when 
>> GFS was originally designed the pool driver(that does GFS's volume 
>> management) only had stripping(raid0). I hope this answers your 
>> questions.
>> --Brian Jackson
>> P.S. I am not sure, but it sounds like you have some misconception about 
>> how OpenGFS works. It does not use the disks on your nodes. It uses some 
>> kind of shared storage.
>> Vittorio Ballestra writes: 
>> 
>>> Escuse me if this is a stupid or repetead question, I've searched into 
>>> the docs (?) and the mailinglists but I'm not able to understand if 
>>> openGFS has some sort of fault-tollerance.
>>> This are my doubts :
>>> What happens if one host is down or one disk on one host is down ? Will 
>>> the entire openGFS filesystem be down ? If one disk is broken and its 
>>> content corrupted, will the whole openGFS be corrupted ?
>>> If openGFS is not supporting any kind of fault-tollerance, can one use 
>>> raid disks on each node ?
>>> Thanks,
>>>     V 
>>> 
>>> -------------------------------------------------------
>>> This sf.net email is sponsored by:ThinkGeek
>>> Welcome to geek heaven.
>>> http://thinkgeek.com/sf
>>> _______________________________________________
>>> Opengfs-users mailing list
>>> Opengfs-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/opengfs-users
>>  
>> 
>>  
>> 
>> -------------------------------------------------------
>> This sf.net email is sponsored by:ThinkGeek
>> Welcome to geek heaven.
>> http://thinkgeek.com/sf
>> _______________________________________________
>> Opengfs-users mailing list
>> Opengfs-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/opengfs-users 
>> 
>>  
>> 
>  
> 
> 
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Opengfs-users mailing list
> Opengfs-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opengfs-users
 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-06-13 19:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-13 19:16 Cluster Aware MD Driver Xinwei Hu
2007-06-13 19:50 ` Mike Snitzer
  -- strict thread matches above, loose matches on Subject: below --
2003-01-07 14:54 Cress, Andrew R
     [not found] <200301041016.10224.vittorio.ballestra@infogestnet.it>
     [not found] ` <20030104164314.26621.qmail@escalade.vistahp.com>
2003-01-04 19:13   ` Steven Dake
2003-01-04 21:06     ` Brian Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).