* [Cluster-devel] waiting in init.d/cman
@ 2009-08-05 16:12 David Teigland
2009-08-05 17:25 ` Fabio M. Di Nitto
0 siblings, 1 reply; 6+ messages in thread
From: David Teigland @ 2009-08-05 16:12 UTC (permalink / raw)
To: cluster-devel.redhat.com
Back in the busy days of cluster3 development, I spent a little time looking
at the issue of waiting for quorum (and other waiting/timeouts) during
init.d/cman startup.
I wanted to clean up cluster2's somewhat arbitrary approach and have explicit,
intentional behavior around what each init.d/cman step would wait for and what
it wouldn't. Strangely, it was fence_tool join where all sorts of odd
waits/timeouts had been wedged at various times.
In untangling and fixing, I'm not sure I got it quite right. Current behavior
is that init.d/cman runs through and completes successfully very quickly
without waiting for quorum. This seems nice, because it can be annoying to
have init.d/cman block. In general it works too, it just ends up delaying the
wait for quorum until some cluster-using service starts later (clvmd,
rgmanager, gfs mount).
But, I think it may be best for init.d/cman to wait explicitly for quorum. It
would be clearer what's happening (what's delaying startup), which was one of
the cluster2 problems. So, roughly, init.d/cman would do:
- cman_tool join, print "Joining cluster"
- qdiskd (if configured), print "Starting qdiskd"
- wait for quorum, print "Waiting for quorum"
Any reasons to not do this or do it differently?
Related to this is the broader issue of waiting and timeouts in init.d/cman.
It would be nice to not have timeouts... I think the main reason for them is
that cman has started before the ssh service, so people could never log in if
cman was stuck (we talked about this a while back and I guess decided we
couldn't move cman later in the startup.)
Here's the startup with each wait/timeout mentioned (steps 3,4 only if qdisk
is configured.)
1. cman_tool join -w -t 120
2. WAIT/120s for join to complete, in cman_tool from the -w -t 120 options
3. qdiskd
4. WAIT/20s for cman to recognize qdisk (?), in init script loop
5. WAIT/??s for quorum, new step probably via cman_tool wait -q -t ??
6. start other daemons
7. fence_tool join -w 20
8. WAIT/20s for fence domain join to complete, in fence_tool from -w 20 option
step 2: there's been some doubt about what join -w actually gives us; at a
minimum -w may be useful here to catch delayed startup errors from corosync
and to be sure it's started up enough that qdiskd can use it in step 3.
Otherwise, the wait in step 5 seems to obviate the need for waiting at all in
step 2.
step 5: this is the only wait that people will typically notice during normal
operation. Any suggestions on a timeout here? And if it expires should
init.d/cman exit with a failure? (I believe that's what other timeouts
cause.)
Dave
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] waiting in init.d/cman
2009-08-05 16:12 [Cluster-devel] waiting in init.d/cman David Teigland
@ 2009-08-05 17:25 ` Fabio M. Di Nitto
2009-08-05 18:20 ` David Teigland
0 siblings, 1 reply; 6+ messages in thread
From: Fabio M. Di Nitto @ 2009-08-05 17:25 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi David,
On Wed, 2009-08-05 at 11:12 -0500, David Teigland wrote:
> But, I think it may be best for init.d/cman to wait explicitly for quorum.
I agree but it has to be optional with default to not wait for quorum.
> It
> would be clearer what's happening (what's delaying startup), which was one of
> the cluster2 problems. So, roughly, init.d/cman would do:
>
> - cman_tool join, print "Joining cluster"
> - qdiskd (if configured), print "Starting qdiskd"
> - wait for quorum, print "Waiting for quorum"
>
> Any reasons to not do this or do it differently?
I can see the possibility to block the boot for quorum when quorum might
never be available. As above, I don't mind to add that to the init
script, but it will need yet another timeout.
> Related to this is the broader issue of waiting and timeouts in init.d/cman.
> It would be nice to not have timeouts... I think the main reason for them is
> that cman has started before the ssh service, so people could never log in if
> cman was stuck (we talked about this a while back and I guess decided we
> couldn't move cman later in the startup.)
>
> Here's the startup with each wait/timeout mentioned (steps 3,4 only if qdisk
> is configured.)
>
> 1. cman_tool join -w -t 120
> 2. WAIT/120s for join to complete, in cman_tool from the -w -t 120 options
This is configurable so we could probably lower it a bit, but it needs
to be there. The cman_tool -> corosync startup is complex and takes
time. There is no exact moment when it finishes.
> 3. qdiskd
> 4. WAIT/20s for cman to recognize qdisk (?), in init script loop
Yes that is correct. cman will see qdisk only after qdisk has completed
it's init on disk. We wait as it can also guarantee quorum for that
node.
> 5. WAIT/??s for quorum, new step probably via cman_tool wait -q -t ??
+1 from me if everybody else agrees. In general this timeout would be
hit only when the whole cluster is booting for the first time ever.
Otherwise a one node reboot won't even see this.
> 6. start other daemons
> 7. fence_tool join -w 20
> 8. WAIT/20s for fence domain join to complete, in fence_tool from -w 20 option
>
> step 2: there's been some doubt about what join -w actually gives us; at a
> minimum -w may be useful here to catch delayed startup errors from corosync
> and to be sure it's started up enough that qdiskd can use it in step 3.
> Otherwise, the wait in step 5 seems to obviate the need for waiting at all in
> step 2.
qdisk doesn't care if cman is not there. It will run and wait for cman
to appear.
>
> step 5: this is the only wait that people will typically notice during normal
> operation. Any suggestions on a timeout here? And if it expires should
> init.d/cman exit with a failure? (I believe that's what other timeouts
> cause.)
I'd say 20 seconds again? it seems reasonable to me.
Fabio
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] waiting in init.d/cman
2009-08-05 17:25 ` Fabio M. Di Nitto
@ 2009-08-05 18:20 ` David Teigland
2009-08-05 19:32 ` Fabio M. Di Nitto
0 siblings, 1 reply; 6+ messages in thread
From: David Teigland @ 2009-08-05 18:20 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Wed, Aug 05, 2009 at 07:25:53PM +0200, Fabio M. Di Nitto wrote:
> I can see the possibility to block the boot for quorum when quorum might
> never be available. As above, I don't mind to add that to the init
> script, but it will need yet another timeout.
Sure, but as I mentioned, if cman doesn't wait for quorum, then clvmd,
rgmanager or gfs mount will... and those don't time out and sometimes can't be
cancelled, whereas cman_tool wait -q can be.
The no wait option is good, it could make sense in some cases, like when no
other init scripts follow cman that depend on quorum.
Dave
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] waiting in init.d/cman
2009-08-05 18:20 ` David Teigland
@ 2009-08-05 19:32 ` Fabio M. Di Nitto
2009-08-05 19:43 ` Bob Peterson
2009-08-06 16:05 ` David Teigland
0 siblings, 2 replies; 6+ messages in thread
From: Fabio M. Di Nitto @ 2009-08-05 19:32 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Wed, 2009-08-05 at 13:20 -0500, David Teigland wrote:
> On Wed, Aug 05, 2009 at 07:25:53PM +0200, Fabio M. Di Nitto wrote:
> > I can see the possibility to block the boot for quorum when quorum might
> > never be available. As above, I don't mind to add that to the init
> > script, but it will need yet another timeout.
>
> Sure, but as I mentioned, if cman doesn't wait for quorum, then clvmd,
> rgmanager or gfs mount will... and those don't time out and sometimes can't be
> cancelled, whereas cman_tool wait -q can be.
During the boot process you can't issue ctrl+c no matter what. IIRC
somebody suggested to use boot options. Perhaps forcing a wait for
quorum in our init script is sensible if we allow a boot option to not
run cman at all. The other daemons will fail if cman is not there and
boot would be "unblocked".
> The no wait option is good, it could make sense in some cases, like when no
> other init scripts follow cman that depend on quorum.
It's a rare case since our init script itself fires up different daemons
automatically. A bit offtopic but perhaps it might be the case to revive
the idea that mount.gfs2 should spawn gfs_controld if required and not
running and libdlm would spawn dlm_controld.
Fabio
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] waiting in init.d/cman
2009-08-05 19:32 ` Fabio M. Di Nitto
@ 2009-08-05 19:43 ` Bob Peterson
2009-08-06 16:05 ` David Teigland
1 sibling, 0 replies; 6+ messages in thread
From: Bob Peterson @ 2009-08-05 19:43 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- "Fabio M. Di Nitto" <fdinitto@redhat.com> wrote:
| automatically. A bit offtopic but perhaps it might be the case to
| revive
| the idea that mount.gfs2 should spawn gfs_controld if required and
| not
| running and libdlm would spawn dlm_controld.
|
| Fabio
I seem to recall something about getting rid of the mount.gfs2 helper
altogether upstream.
Bob Peterson
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] waiting in init.d/cman
2009-08-05 19:32 ` Fabio M. Di Nitto
2009-08-05 19:43 ` Bob Peterson
@ 2009-08-06 16:05 ` David Teigland
1 sibling, 0 replies; 6+ messages in thread
From: David Teigland @ 2009-08-06 16:05 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Wed, Aug 05, 2009 at 09:32:04PM +0200, Fabio M. Di Nitto wrote:
> On Wed, 2009-08-05 at 13:20 -0500, David Teigland wrote:
> > On Wed, Aug 05, 2009 at 07:25:53PM +0200, Fabio M. Di Nitto wrote:
> > > I can see the possibility to block the boot for quorum when quorum might
> > > never be available. As above, I don't mind to add that to the init
> > > script, but it will need yet another timeout.
> >
> > Sure, but as I mentioned, if cman doesn't wait for quorum, then clvmd,
> > rgmanager or gfs mount will... and those don't time out and sometimes
> > can't be cancelled, whereas cman_tool wait -q can be.
>
> During the boot process you can't issue ctrl+c no matter what. IIRC somebody
> suggested to use boot options. Perhaps forcing a wait for quorum in our init
> script is sensible if we allow a boot option to not run cman at all. The
> other daemons will fail if cman is not there and boot would be "unblocked".
During 'service cman start' you could cancel via ctrl+c if it's blocked on
quorum, but during 'service gfs start' you may not be able to cancel if it's
blocked on quorum.
During boot init.d/cman you can cancel via timeout if it's blocked on quorum,
but during boot init.d/gfs you may not be able to cancel via timeout if it's
blocked on quorum.
So, I think we end up with the best option being our current approach: do all
waiting with timeouts in cman. This gives us the best options for boot start
and manual start.
Dave
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-08-06 16:05 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-05 16:12 [Cluster-devel] waiting in init.d/cman David Teigland
2009-08-05 17:25 ` Fabio M. Di Nitto
2009-08-05 18:20 ` David Teigland
2009-08-05 19:32 ` Fabio M. Di Nitto
2009-08-05 19:43 ` Bob Peterson
2009-08-06 16:05 ` David Teigland
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).