Cluster Aware MD Driver

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Steven Dake <sdake@mvista.com>
To: Brian Jackson <brian-lists@mdrx.com>
Cc: opengfs-users@lists.sourceforge.net, linux-raid@vger.kernel.org
Subject: Cluster Aware MD Driver
Date: Sat, 04 Jan 2003 12:13:05 -0700	[thread overview]
Message-ID: <3E173241.2040301@mvista.com> (raw)
In-Reply-To: 20030104164314.26621.qmail@escalade.vistahp.com

Brian,

The way to fix this problem is to enable the MD layer and MD tools to be 
cluster aware.  I hope sometime before March 03 to have this 
functionality implemented in 2.4 and mdadm 1.0.1.

While I have not spent a ton of time thinking about it, I'll share what 
I'm thinking in case someone else wants to get a jump start since I wont 
have time until Feb to start working on it.

What I am thinking of doing is making one node in a cluster the master, 
and all other nodes nonmasters.  The master is responsible for commiting 
superblock and resync changes to disk.  The other nodes are responsible 
for updating their status if any one node changes state.

During startup, a userland daemon runs which opens a TCP port.  Then the 
daemon attempts to connect to a list of other servers in a configuration 
file (/etc/clusterips).  On each connection, a master of the cluster is 
elected by an election algortihm.  Once the daemon makes connections, it 
culls its file descriptor list such that it only has one connection per 
server to each other server.  Whenever a state change occurs (raid set 
faulty, raid array failed, raid hot added, raid hot removed, raid start, 
raid stop, etc) the state change is transmitted via the server to all 
other server nodes.  When the server receives a state change message, it 
will handle appropriately by using the md ioctls to update its internal 
state.  There are more complications, such as heartbeating to detect 
dead nodes, etc.  All of this keeps the arrays in sync across the entire 
cluster.  The big changes here are to the mdadm package to use the local 
server to execute operations and to create a server which processes the 
state changes and executes the appropriate MD layer ioctls.

The question you might be asking is, how do you protect from each server 
overwritting similiar data such as the superblock or resync data.  The 
trick is to add an ioctl to the MD layer that turns writes on or off. 
 During master election above, the master will turn its writes on.  All 
non-masters will turn their writes off before any RAID arrays are 
started.  Also, resyncs have to be communicated across the cluster such 
that /proc/mdstat displays correct information so I believe an ioctl 
will have to be added to indicate this and allow resync from a specific 
spot in case the current master dies in a resync.  The changes to the MD 
layer should be fairly minimal and noninvasive and also work well for 
non-cluster configurations.  The writes will default to on to ensure 
that non-clusters work properly even with autostart.

The only downside of this approach is that RAID autostart can no longer 
be used.  The only solution to supporting that feature for clusters is 
to write all of the above userspace code into the kernel which would be 
a big pain and likely not accepted into the mainline.

Thanks
-steve

Brian Jackson wrote:

> At the moment if one node goes down, the rest of the nodes will 
> continue to run as expected. The only single point of failure from a 
> node point of view is the lock server. Dominik is working on 
> integrating the OpenDLM to get rid of that single point of failure. If 
> one disk in the shared storage dies, that will bring you down. I have 
> tried to stack pool on top of software raid, but the MD driver doesn't 
> play well with clusters so that won't work. We are trying to figure 
> out a way to fix this, but when GFS was originally designed the pool 
> driver(that does GFS's volume management) only had stripping(raid0). I 
> hope this answers your questions.
> --Brian Jackson
> P.S. I am not sure, but it sounds like you have some misconception 
> about how OpenGFS works. It does not use the disks on your nodes. It 
> uses some kind of shared storage.
> Vittorio Ballestra writes:
>
>> Escuse me if this is a stupid or repetead question, I've searched 
>> into the docs (?) and the mailinglists but I'm not able to understand 
>> if openGFS has some sort of fault-tollerance.
>> This are my doubts :
>> What happens if one host is down or one disk on one host is down ? 
>> Will the entire openGFS filesystem be down ? If one disk is broken 
>> and its content corrupted, will the whole openGFS be corrupted ?
>> If openGFS is not supporting any kind of fault-tollerance, can one 
>> use raid disks on each node ?
>> Thanks,
>>     V
>>
>> -------------------------------------------------------
>> This sf.net email is sponsored by:ThinkGeek
>> Welcome to geek heaven.
>> http://thinkgeek.com/sf
>> _______________________________________________
>> Opengfs-users mailing list
>> Opengfs-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/opengfs-users
>
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Opengfs-users mailing list
> Opengfs-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opengfs-users
>
>
>

next      parent reply	other threads:[~2003-01-04 19:13 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <200301041016.10224.vittorio.ballestra@infogestnet.it>
     [not found] ` <20030104164314.26621.qmail@escalade.vistahp.com>
2003-01-04 19:13   ` Steven Dake [this message]
2003-01-04 21:06     ` Cluster Aware MD Driver Brian Jackson
2003-01-07 14:54 Cress, Andrew R
  -- strict thread matches above, loose matches on Subject: below --
2007-06-13 19:16 Xinwei Hu
2007-06-13 19:50 ` Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3E173241.2040301@mvista.com \
    --to=sdake@mvista.com \
    --cc=brian-lists@mdrx.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=opengfs-users@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.