From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Teigland Date: Wed, 5 Aug 2009 11:12:39 -0500 Subject: [Cluster-devel] waiting in init.d/cman Message-ID: <20090805161239.GA17292@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Back in the busy days of cluster3 development, I spent a little time looking at the issue of waiting for quorum (and other waiting/timeouts) during init.d/cman startup. I wanted to clean up cluster2's somewhat arbitrary approach and have explicit, intentional behavior around what each init.d/cman step would wait for and what it wouldn't. Strangely, it was fence_tool join where all sorts of odd waits/timeouts had been wedged at various times. In untangling and fixing, I'm not sure I got it quite right. Current behavior is that init.d/cman runs through and completes successfully very quickly without waiting for quorum. This seems nice, because it can be annoying to have init.d/cman block. In general it works too, it just ends up delaying the wait for quorum until some cluster-using service starts later (clvmd, rgmanager, gfs mount). But, I think it may be best for init.d/cman to wait explicitly for quorum. It would be clearer what's happening (what's delaying startup), which was one of the cluster2 problems. So, roughly, init.d/cman would do: - cman_tool join, print "Joining cluster" - qdiskd (if configured), print "Starting qdiskd" - wait for quorum, print "Waiting for quorum" Any reasons to not do this or do it differently? Related to this is the broader issue of waiting and timeouts in init.d/cman. It would be nice to not have timeouts... I think the main reason for them is that cman has started before the ssh service, so people could never log in if cman was stuck (we talked about this a while back and I guess decided we couldn't move cman later in the startup.) Here's the startup with each wait/timeout mentioned (steps 3,4 only if qdisk is configured.) 1. cman_tool join -w -t 120 2. WAIT/120s for join to complete, in cman_tool from the -w -t 120 options 3. qdiskd 4. WAIT/20s for cman to recognize qdisk (?), in init script loop 5. WAIT/??s for quorum, new step probably via cman_tool wait -q -t ?? 6. start other daemons 7. fence_tool join -w 20 8. WAIT/20s for fence domain join to complete, in fence_tool from -w 20 option step 2: there's been some doubt about what join -w actually gives us; at a minimum -w may be useful here to catch delayed startup errors from corosync and to be sure it's started up enough that qdiskd can use it in step 3. Otherwise, the wait in step 5 seems to obviate the need for waiting at all in step 2. step 5: this is the only wait that people will typically notice during normal operation. Any suggestions on a timeout here? And if it expires should init.d/cman exit with a failure? (I believe that's what other timeouts cause.) Dave