[Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

From: David Teigland <teigland@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
Date: Mon, 19 Dec 2011 12:47:38 -0500	[thread overview]
Message-ID: <20111219174738.GA24652@redhat.com> (raw)
In-Reply-To: <1324300058.2723.47.camel@menhir>

On Mon, Dec 19, 2011 at 01:07:38PM +0000, Steven Whitehouse wrote:
> >  struct lm_lockstruct {
> >  	int ls_jid;
> >  	unsigned int ls_first;
> > -	unsigned int ls_first_done;
> >  	unsigned int ls_nodir;
> Since ls_flags and ls_first also also only boolean flags, they could
> potentially be moved into the flags, though we can always do that later.

yes, I can use a flag in place of ls_first.

> > +	int ls_recover_jid_done; /* read by gfs_controld */
> > +	int ls_recover_jid_status; /* read by gfs_controld */
>                                           ^^^^^^^^^^^ this isn't
> actually true any more. All recent gfs_controld versions take their cue
> from the uevents, so this is here only for backwards compatibility
> reasons and these two will be removed at some future date.

I'll add a longer comment saying something like that.

> > +	/*
> > +	 * Other nodes need to do some work in dlm recovery and gfs2_control
> > +	 * before the recover_done and control_lock will be ready for us below.
> > +	 * A delay here is not required but often avoids having to retry.
> > +	 */
> > +
> > +	msleep(500);
> Can we get rid of this then? I'd rather just wait for the lock, rather
> than adding delays of arbitrary time periods into the code.

I dislike arbitrary delays also, so I'm hesitant to add them.
The choices here are:
- removing NOQUEUE from the requests below, but with NOQUEUE you have a
  much better chance of killing a mount command, which is a fairly nice
  feature, I think.
- removing the delay, which results in nodes often doing fast+repeated
  lock attempts, which could get rather excessive.  I'd be worried about
  having that kind of unlimited loop sitting there.
- using some kind of delay.

While I don't like the look of the delay, I like the other options less.
Do you have a preference, or any other ideas?


> > +static int control_first_done(struct gfs2_sbd *sdp)
> > +{
> > +	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +	char lvb_bits[GDLM_LVB_SIZE];
> > +	uint32_t start_gen, block_gen;
> > +	int error;
> > +
> > +restart:
> > +	spin_lock(&ls->ls_recover_spin);
> > +	start_gen = ls->ls_recover_start;
> > +	block_gen = ls->ls_recover_block;
> > +
> > +	if (test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags) ||
> > +	    !test_bit(DFL_MOUNT_DONE, &ls->ls_recover_flags) ||
> > +	    !test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags)) {
> > +		/* sanity check, should not happen */
> > +		fs_err(sdp, "control_first_done start %u block %u flags %lx\n",
> > +		       start_gen, block_gen, ls->ls_recover_flags);
> > +		spin_unlock(&ls->ls_recover_spin);
> > +		control_unlock(sdp);
> > +		return -1;
> > +	}
> > +
> > +	if (start_gen == block_gen) {
> > +		/*
> > +		 * Wait for the end of a dlm recovery cycle to switch from
> > +		 * first mounter recovery.  We can ignore any recover_slot
> > +		 * callbacks between the recover_prep and next recover_done
> > +		 * because we are still the first mounter and any failed nodes
> > +		 * have not fully mounted, so they don't need recovery.
> > +		 */
> > +		spin_unlock(&ls->ls_recover_spin);
> > +		fs_info(sdp, "control_first_done wait gen %u\n", start_gen);
> > +		msleep(500);
> Again - I don't want to add arbitrary delays into the code. Why is this
> waiting for half a second? Why not some other length of time? We should
> figure out how to wait for the end of the first mounter recovery some
> other way if that is what is required.

This msleep slows down a rare loop to wake up a couple times vs once with
a proper wait mechanism.  It's waiting for the next recover_done()
callback, which the dlm will call when it is done with recovery.  We do
have the option here of using a standard wait mechanism, wait_on_bit() or
something.  I'll see if any of those would work here without adding too
much to the code.


> > +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
> > +				 unsigned int result)
> > +{
> > +	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +
> > +	/* don't care about the recovery of own journal during mount */
> > +	if (jid == ls->ls_jid)
> > +		return;
> > +
> > +	/* another node is recovering the journal, give it a chance to
> > +	   finish before trying again */
> > +	if (result == LM_RD_GAVEUP)
> > +		msleep(1000);
> Again, lets put in a proper wait for this condition. If the issue is one
> of races between cluster nodes (thundering herd type problem), then we
> might need some kind of back off, but in that case, it should probably
> be for a random time period.

In this case, while one node is recovering a journal, the other nodes will
all try to recover the same journal (and fail), as quickly as they can.  I
looked at using queue_delayed_work here, but couldn't tell if that was ok
with zero delay... I now see others use 0, so I'll try it.


> > +	error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > +				  &ops, &ls->ls_dlm);
> > +
> > +	if (error == -EOPNOTSUPP) {
> > +		/*
> > +		 * dlm does not support ops callbacks,
> > +		 * old dlm_controld/gfs_controld are used, try without ops.
> > +		 */
> > +		fs_info(sdp, "dlm lockspace ops not used %d\n", error);
> > +		free_recover_size(ls);
> > +
> > +		error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > +					  NULL, &ls->ls_dlm);
> > +		if (error)
> > +			fs_err(sdp, "dlm_new_lockspace error %d\n", error);
> > +		return error;
> > +	}
> > +
> Hmm. This is a bit complicated. Can't we just make it return 0 anyway?
> If we do need to know whether the dlm supports the recovery ops, then
> lets just make it signal that somehow (e.g. returns 1 so that >= 0 means
> success and -ve means error). It doesn't matter if we don't call
> free_recover_size until umount time I think, even if the dlm doesn't
> support that since the data structures are fairly small.

I went with this because I thought it was simpler than adding a second
return value for the ops status.  It would also let us simply drop the
special case in the future.  The alternative is:

int dlm_new_lockspace(const char *name, const char *cluster,
                      uint32_t flags, int lvblen,
                      struct dlm_lockspace_ops *ops, void *ops_arg,
                      int *ops_error, dlm_lockspace_t **lockspace);

I'm willing to try that if you think it's clearer to understand.

Dave

next prev parent reply	other threads:[~2011-12-19 17:47 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-16 22:03 [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination David Teigland
2011-12-19 13:07 ` Steven Whitehouse
2011-12-19 17:47   ` David Teigland [this message]
2011-12-20 10:39     ` Steven Whitehouse
2011-12-20 19:16       ` David Teigland
2011-12-20 21:04         ` David Teigland
2011-12-21 10:45           ` Steven Whitehouse
2011-12-21 15:40             ` David Teigland
2011-12-22 21:23     ` David Teigland
2011-12-23  9:19       ` Steven Whitehouse
2011-12-19 15:17 ` Steven Whitehouse
2012-01-05 15:08 ` Bob Peterson
2012-01-05 15:21   ` David Teigland
2012-01-05 15:40     ` Steven Whitehouse
2012-01-05 16:16       ` David Teigland
2012-01-05 16:45 ` Bob Peterson
  -- strict thread matches above, loose matches on Subject: below --
2012-01-05 16:46 David Teigland
2012-01-05 16:58 ` Steven Whitehouse
2012-01-05 17:13   ` David Teigland
2012-01-09 16:36 ` Steven Whitehouse
2012-01-09 16:46   ` David Teigland
2012-01-09 17:00     ` David Teigland
2012-01-09 17:04       ` Steven Whitehouse
2012-01-09 17:02     ` Steven Whitehouse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111219174738.GA24652@redhat.com \
    --to=teigland@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).