From mboxrd@z Thu Jan  1 00:00:00 1970
From: David Teigland <teigland@redhat.com>
Date: Mon, 19 Oct 2009 13:49:37 -0500
Subject: [Cluster-devel] Re: gfs2-utils: master - gfs_controld: Remove
	three unused functions
In-Reply-To: <1255948517.6052.630.camel@localhost.localdomain>
References: <20091014145504.9ADCE1201DA@lists.fedorahosted.org>
	<20091014175350.GC28090@redhat.com>
	<1255704965.6052.589.camel@localhost.localdomain>
	<20091016155957.GC23459@redhat.com>
	<1255708878.6052.590.camel@localhost.localdomain>
	<20091016163333.GE23459@redhat.com>
	<1255948517.6052.630.camel@localhost.localdomain>
Message-ID: <20091019184936.GC20020@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

On Mon, Oct 19, 2009 at 11:35:17AM +0100, Steven Whitehouse wrote:
> The question really is why we have all these (apparently) different
> ideas of cluster membership.  Looking at gfs_controld itself, it uses two
> CPGs (one for all gfs_controlds which seems to only be used in
> negotiating the protocol, of which there seems to be only one anyway,
> and the other on a per mount group basis), each of which have their own
> idea of which cluster members exist.
> 
> So I guess one question I have is, can we be certain the the "per mount
> group" CPG will always have a membership which is a subset of the "all
> gfs_controlds" CPG? Will the sequencing of delivery of membership events
> by synchronised between the two CPGs wrt to other events (i.e. message
> delivery) ? I guess that question might be more in Steve Dake's line, so
> I've cc'd him too.

No, agreed event order is per-cpg.

> Given all that, what is the relationship which the membership events
> reported in this new quorum_callback() have with the above?

This is a good point, they are technically independent, so referencing the
quorum events from the context of the cpg events is not perfect in theory.
See below.

> From my earlier investigations, it appeared that dlm_controld was in
> charge of ensuring quorum was attained before fencing took place, so I'm
> not quite sure why that should affect gfs_controld directly.

libquorum is being used as a general cluster membership api, not for the
quorum info.  i.e. the membership from libquorum is the node-level
membership of the whole cluster, i.e. the up/down states of nodes as a
whole.

(The other basic, general, node-level "cluster membership" api is CLM
which is an ais service and not as nice to use.  libquorum works great for
general membership if you don't get hung up on the name "quorum" which you
can choose to use or not.)

> The thing that I've not quite figured out yet is why we need to record
> the times at all. My understanding of corosync is that it gives us a
> guaranteed ordering of events, so that I'd expect to see a sequence
> number rather than an actual timestamp. That is always assuming that the
> timestamp isn't just being used as a monotonic sequence number, of
> course.

We don't really need all this libquorum code, or the cluster add/remove
times, or the referencing of libquorum events from cpg events, or a bunch
of other stuff.  Everything works great without any of that, except for
one tiny, complicated, painful exception.

Corosync/totem are built to provide "extended virtual synchrony", and the
one key difference between that and plain old "virtual synchrony" is that
EVS extends VS to do merges of cluster partitions.  That creates a problem
for us, because we want VS, not EVS; we don't *want* partitions to merge,
and it creates a big problem for us when corosync/totem goes and happily
merges partitions thinking it's doing us a favor :-/

It's a problem because when a node partitions, the only way for us to
merge it again is by resyncing/resetting its state, which requires that
all previous state first be cleared away [1].  In the past (cluster1), we
would not let a node join the cluster if it wasn't in a clean state, and
that meant we could simply resync any node that joined.  With aisexec or
corosync, a node will rejoin the cluster in a dirty state if a partition
merges, so before syncing it we need to figure out if all old state has
been cleared away.  If not, we need to clear the old before resyncing the
new (and keep the old stale state from mucking with the current state).

Figuring out that a new node has old state and then clearing it is messy
and involves various hacks.  I'm keenly interested in finding nicer ways
to do this, and spend a lot of time thinking about it and working on it.
We're making progress... In cluster2 we dealt with it using cman dirty and
disallowed flags, in cluster3 we're moving away from putting magic in
cman, so we're detecting and resolving the problems closer to the source.
In cluster4 (gfs2-utils.git) I'm hoping we can take another step or two.
At the end of the day we have to release working code (these situations
really happen), so you have to settle for solutions that are not ideal but
hopefully better than before.

One of the things that make this difficult to even work around is that
corosync/cpg don't provide a global generation number on events,
https://lists.linux-foundation.org/pipermail/openais/2009-August/012970.html
That's one feature that I hope may improves things a lot in cluster4.

The specific hacks in cluster3 to deal with this are what you've stumbled
upon.  They've come up largely from testing merges and fixing what goes
wrong.  Corosync and openais have a fair amount of undefined behavior
themselves in this area which doesn't help.
https://lists.linux-foundation.org/pipermail/openais/2009-July/012705.html

One problem was getting a message from a node and not knowing which of two
identical events it referred to.  After many failed experiments, I ended
up using the cluster add time from libquorum... it's as ugly as any of the
other hacks, and may not be perfect, but that's the nature of this
problem.  There's still more work to do in this area, so it may yet change
https://bugzilla.redhat.com/show_bug.cgi?id=529424

[1] The only way to "reset" to a clean state is to "fail" and restart the
system.  The cluster state extends into dlm, gfs, the kernel and random
apps using any of that.

Dave