From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Teigland Date: Mon, 19 Oct 2009 13:49:37 -0500 Subject: [Cluster-devel] Re: gfs2-utils: master - gfs_controld: Remove three unused functions In-Reply-To: <1255948517.6052.630.camel@localhost.localdomain> References: <20091014145504.9ADCE1201DA@lists.fedorahosted.org> <20091014175350.GC28090@redhat.com> <1255704965.6052.589.camel@localhost.localdomain> <20091016155957.GC23459@redhat.com> <1255708878.6052.590.camel@localhost.localdomain> <20091016163333.GE23459@redhat.com> <1255948517.6052.630.camel@localhost.localdomain> Message-ID: <20091019184936.GC20020@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Mon, Oct 19, 2009 at 11:35:17AM +0100, Steven Whitehouse wrote: > The question really is why we have all these (apparently) different > ideas of cluster membership. Looking at gfs_controld itself, it uses two > CPGs (one for all gfs_controlds which seems to only be used in > negotiating the protocol, of which there seems to be only one anyway, > and the other on a per mount group basis), each of which have their own > idea of which cluster members exist. > > So I guess one question I have is, can we be certain the the "per mount > group" CPG will always have a membership which is a subset of the "all > gfs_controlds" CPG? Will the sequencing of delivery of membership events > by synchronised between the two CPGs wrt to other events (i.e. message > delivery) ? I guess that question might be more in Steve Dake's line, so > I've cc'd him too. No, agreed event order is per-cpg. > Given all that, what is the relationship which the membership events > reported in this new quorum_callback() have with the above? This is a good point, they are technically independent, so referencing the quorum events from the context of the cpg events is not perfect in theory. See below. > From my earlier investigations, it appeared that dlm_controld was in > charge of ensuring quorum was attained before fencing took place, so I'm > not quite sure why that should affect gfs_controld directly. libquorum is being used as a general cluster membership api, not for the quorum info. i.e. the membership from libquorum is the node-level membership of the whole cluster, i.e. the up/down states of nodes as a whole. (The other basic, general, node-level "cluster membership" api is CLM which is an ais service and not as nice to use. libquorum works great for general membership if you don't get hung up on the name "quorum" which you can choose to use or not.) > The thing that I've not quite figured out yet is why we need to record > the times at all. My understanding of corosync is that it gives us a > guaranteed ordering of events, so that I'd expect to see a sequence > number rather than an actual timestamp. That is always assuming that the > timestamp isn't just being used as a monotonic sequence number, of > course. We don't really need all this libquorum code, or the cluster add/remove times, or the referencing of libquorum events from cpg events, or a bunch of other stuff. Everything works great without any of that, except for one tiny, complicated, painful exception. Corosync/totem are built to provide "extended virtual synchrony", and the one key difference between that and plain old "virtual synchrony" is that EVS extends VS to do merges of cluster partitions. That creates a problem for us, because we want VS, not EVS; we don't *want* partitions to merge, and it creates a big problem for us when corosync/totem goes and happily merges partitions thinking it's doing us a favor :-/ It's a problem because when a node partitions, the only way for us to merge it again is by resyncing/resetting its state, which requires that all previous state first be cleared away [1]. In the past (cluster1), we would not let a node join the cluster if it wasn't in a clean state, and that meant we could simply resync any node that joined. With aisexec or corosync, a node will rejoin the cluster in a dirty state if a partition merges, so before syncing it we need to figure out if all old state has been cleared away. If not, we need to clear the old before resyncing the new (and keep the old stale state from mucking with the current state). Figuring out that a new node has old state and then clearing it is messy and involves various hacks. I'm keenly interested in finding nicer ways to do this, and spend a lot of time thinking about it and working on it. We're making progress... In cluster2 we dealt with it using cman dirty and disallowed flags, in cluster3 we're moving away from putting magic in cman, so we're detecting and resolving the problems closer to the source. In cluster4 (gfs2-utils.git) I'm hoping we can take another step or two. At the end of the day we have to release working code (these situations really happen), so you have to settle for solutions that are not ideal but hopefully better than before. One of the things that make this difficult to even work around is that corosync/cpg don't provide a global generation number on events, https://lists.linux-foundation.org/pipermail/openais/2009-August/012970.html That's one feature that I hope may improves things a lot in cluster4. The specific hacks in cluster3 to deal with this are what you've stumbled upon. They've come up largely from testing merges and fixing what goes wrong. Corosync and openais have a fair amount of undefined behavior themselves in this area which doesn't help. https://lists.linux-foundation.org/pipermail/openais/2009-July/012705.html One problem was getting a message from a node and not knowing which of two identical events it referred to. After many failed experiments, I ended up using the cluster add time from libquorum... it's as ugly as any of the other hacks, and may not be perfect, but that's the nature of this problem. There's still more work to do in this area, so it may yet change https://bugzilla.redhat.com/show_bug.cgi?id=529424 [1] The only way to "reset" to a clean state is to "fail" and restart the system. The cluster state extends into dlm, gfs, the kernel and random apps using any of that. Dave