From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lon H. Hohberger Date: Thu, 12 Nov 2009 12:50:11 -0500 Subject: [Cluster-devel] unfence during startup In-Reply-To: <20091106172756.GA22183@redhat.com> References: <20091106172756.GA22183@redhat.com> Message-ID: <1258048211.2601.8.camel@localhost> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Fri, 2009-11-06 at 11:27 -0600, David Teigland wrote: > The current init.d/cman startup sequence is: > > start_cman > unfence_self > start_qdiskd > wait_for_quorum > start_fenced > start_dlm_controld > start_gfs_controld > join_fence_domain > > I believe the reason we put unfence between cman and qdisk was in case the > qdisk was on a fenced device. But, I'd forgotten about the more critical > case where someone runs 'service cman start' on a node after it has been > kicked out of the cluster and has been fenced (via fence_scsi). This is > not too uncommon for someone to try -- they think they can just restart > the cluster on the node without first rebooting. We go to a lot of > trouble in fenced and other daemons to recognize when someone does that > and shut things down again before getting far enough to corrupt storage. > > Obviously, unfencing right at the beginning undercuts all those checks and > precautions, and could easily lead to corrupt storage. So, we need to > move unfence to just before the join_fence_domain step. Requiring a qdisk > to use a disk not subject to fencing shouldn't be too onerous? It shouldn't matter -- it's what we require today with fence_scsi. Alternatively, we can make qdiskd check for this sort of thing as well. It might be more trouble than it's worth, but qdiskd already has a 'stop_cman' flag which will kill cman if qdiskd detects a critical error (e.g. trying to rejoin a cluster...) -- Lon