public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Daniel Phillips <phillips@istop.com>
To: Lars Marowsky-Bree <lmb@suse.de>
Cc: David Teigland <teigland@redhat.com>,
	Steven Dake <sdake@mvista.com>,
	linux-kernel@vger.kernel.org, akpm@osdl.org,
	Patrick Caulfield <pcaulfie@redhat.com>
Subject: Re: [PATCH 1b/7] dlm: core locking
Date: Thu, 28 Apr 2005 20:26:35 -0400	[thread overview]
Message-ID: <200504282026.36273.phillips@istop.com> (raw)
In-Reply-To: <20050428125512.GR21645@marowsky-bree.de>

On Thursday 28 April 2005 08:55, Lars Marowsky-Bree wrote:
> On 2005-04-28T02:49:04, Daniel Phillips <phillips@istop.com> wrote:
> > > Just some food for thought how this all fits together rather
> > > neatly.
> >
> > It's actually the membership system that glues it all together.  The
> > dlm is just another service.
>
> Membership is one of the lowest level and high privileged inputs to the
> whole picture, of course.
>
> However, "membership" is already a pretty broad term, and one must
> clearly state what one is talking about. So we're clearly focused on
> node membership here, which is a special case of group membership; the
> top-level, sort of.

Indeed, you caught me being imprecise.  By "membership system" I mean cman, 
which includes basic cluster membership, service groups, socket interface, 
event messages, PF_CLUSTER, and a few other odds and ends.  Really, it _is_ 
our cluster infrastructure.  And it has warts, some really giant ones.  At 
least it did the last time I used it.  There is apparently a new, 
much-improved version I haven't seen yet.  I have heard that the re-rolled 
cman is in cvs somewhere.  Patrick?  Dave?

> Then every node has it's local view of node membership, constructed
> typically from observing node heartbeats.

Actually, it is constructed from observing cman events over the socket.

I see that some fantastical /sys/ filesystem has wormed itself into the 
machinery.  I need to check that this hasn't compromised the basic beauty of 
the event messaging model.

Fencing is a whole nuther issue.  It's sort of unclear how it is actually 
supposed to work, and judging from the number of complaints I see about it on 
mailing lists, it doesn't work very well.  We need to take a good look at 
that.

> Then the nodes communicate to reach concensus on the coordinated
> membership, which will usually be a set of nodes with full N:N
> connectivity (via the cluster messaging mechanism); and they'll also
> usually aim to identify the largest possible set.

Yes.  "Reaching consensus" is signalled to each node by cman sending a 
"finish" event, as in "finish recovering".  (To be sure, this is misleading 
terminology.  We should kill it before it has a chance to reproduce.)

> Eventually, there'll be a membership view which also implies certain
> shared data integrity guarantees if appropriate (ie, fencing in case a
> node didn't go down cleanly, and granting access on a clean join).

Each node's membership view is simply the cumulative state implied by the cman 
events.  Necessarily, this view will suffer some skew across the cluster.  
All cluster algorithms _must_ recognize and accomodate that.  This is where 
barriers come into play, though that mechanism is buried inside cman, and 
each node's view of barrier operations consists of cman events.  (The way 
this is actually implemented smells a little scary to me, but it seems to 
work ok for small numbers of nodes.)

> These steps but the last one usually happen completely internal to the
> membership layer; the last one requires coordination already, because
> the fencing layer itself might need recovery before it can fence
> something after a node failure.

Right, we need to do a lot more work on the fencing interface.  For example, I 
haven't even begun to analyze it from the point of view of memory inversion 
deadlock.  My spider sense tells me there is some of that in there.  Fencing 
is currently done via bash scripts, which alone sucks nearly beyond belief.

> And then there's quorum computation.

Aha!  There is a beautiful solution in the case of ddraid, i.e., any cluster 
with (m of n) redundant shared disks resident on the nodes themselves:

   http://sourceware.org/cluster/ddraid/

For ddraid order 1 and higher, there is no quorum ambiguity at all, because 
you _require_ a quorum of data nodes in order for any node to access the 
cluster filesystem data.  For example, for a five node ddraid distributed 
data cluster, you need four data nodes active or the cluster will only be 
able to sit there stupidly doing nothing.  Four data nodes is therefore the 
quorum group ordained by God.  Non-data nodes can come and go as they please, 
without ever worrying about split brain or other nasty quorum-related 
diseases.

> Certainly you could also try looking at it from a membership-centric
> angle, but the piece which coordinates the recovery of the various
> components which makes sure the right kind of membership events are
> delivered in the proper order, and errors during component recovery are
> appropriately handled, is, I think, pretty much distinct from the
> "membership" and a higher level component.

Sorry for the red herring.  Where I wrote "membership" I meant to write 
"cman", that is, cluster management.

> So I'm not sure I'd buy "the membership is what glues it all together"
> on eBay even for a low starting bid.

Though I'm not sure the concept is for sale, your buy-in will be appreciated 
nonetheless, no matter how many limp jokes we need to put up with on the way 
there.

Regards,

Daniel

  reply	other threads:[~2005-04-29  0:26 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-25 16:58 [PATCH 1b/7] dlm: core locking David Teigland
2005-04-25 18:34 ` Nikita Danilov
2005-04-25 20:44   ` Daniel Phillips
2005-04-25 22:27     ` Nikita Danilov
2005-04-26  1:34       ` Daniel Phillips
2005-04-25 20:41 ` Jesper Juhl
2005-04-26  5:00   ` Daniel Phillips
2005-04-25 21:54 ` Steven Dake
2005-04-26  5:49   ` David Teigland
2005-04-26 17:40     ` Steven Dake
2005-04-26 22:24       ` Daniel Phillips
2005-04-26 23:04         ` Steven Dake
2005-04-27  0:53           ` Daniel Phillips
2005-04-27  1:50             ` Steven Dake
2005-04-27  4:21               ` Daniel Phillips
2005-04-27  3:02       ` David Teigland
2005-04-27 13:41         ` Lars Marowsky-Bree
2005-04-27 14:26           ` David Teigland
2005-04-28 12:33             ` Lars Marowsky-Bree
2005-04-28 16:39               ` Daniel McNeil
2005-04-28 16:45                 ` Lars Marowsky-Bree
2005-04-29  8:25                   ` Daniel Phillips
2005-05-02 20:45                     ` Lars Marowsky-Bree
2005-05-02 23:23                       ` Daniel Phillips
2005-04-29  4:01                 ` David Teigland
2005-04-29 22:58                   ` Daniel McNeil
2005-04-30  4:29                     ` David Teigland
2005-04-30  9:09                     ` Daniel Phillips
2005-04-30 10:32                       ` Lars Marowsky-Bree
2005-04-30 11:12                         ` Daniel Phillips
2005-05-02 20:51                           ` Lars Marowsky-Bree
2005-05-02 22:21                             ` Daniel Phillips
2005-05-05 12:25                       ` Stephen C. Tweedie
2005-05-05 12:40                         ` copy_to_user question linux
2005-05-05 13:13                           ` Richard B. Johnson
2005-05-05 19:29                         ` [PATCH 1b/7] dlm: core locking Daniel Phillips
2005-04-28  2:52           ` Daniel Phillips
2005-04-28 12:37             ` Lars Marowsky-Bree
2005-04-28 23:43               ` Daniel Phillips
2005-04-28  6:49           ` Daniel Phillips
2005-04-28 12:55             ` Lars Marowsky-Bree
2005-04-29  0:26               ` Daniel Phillips [this message]
2005-04-29  2:52                 ` David Teigland
2005-04-29  3:49                   ` Daniel Phillips
2005-05-02 21:00                     ` Lars Marowsky-Bree
2005-05-03  2:54                       ` David Teigland
2005-04-27 12:33 ` Domen Puncer
2005-04-27 13:30   ` David Teigland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200504282026.36273.phillips@istop.com \
    --to=phillips@istop.com \
    --cc=akpm@osdl.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lmb@suse.de \
    --cc=pcaulfie@redhat.com \
    --cc=sdake@mvista.com \
    --cc=teigland@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox