From mboxrd@z Thu Jan  1 00:00:00 1970
From: David Teigland <teigland@redhat.com>
Date: Mon, 19 Dec 2011 12:47:38 -0500
Subject: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery
	coordination
In-Reply-To: <1324300058.2723.47.camel@menhir>
References: <1324072991-30729-6-git-send-email-teigland@redhat.com>
	<1324300058.2723.47.camel@menhir>
Message-ID: <20111219174738.GA24652@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

On Mon, Dec 19, 2011 at 01:07:38PM +0000, Steven Whitehouse wrote:
> >  struct lm_lockstruct {
> >  	int ls_jid;
> >  	unsigned int ls_first;
> > -	unsigned int ls_first_done;
> >  	unsigned int ls_nodir;
> Since ls_flags and ls_first also also only boolean flags, they could
> potentially be moved into the flags, though we can always do that later.

yes, I can use a flag in place of ls_first.

> > +	int ls_recover_jid_done; /* read by gfs_controld */
> > +	int ls_recover_jid_status; /* read by gfs_controld */
>                                           ^^^^^^^^^^^ this isn't
> actually true any more. All recent gfs_controld versions take their cue
> from the uevents, so this is here only for backwards compatibility
> reasons and these two will be removed at some future date.

I'll add a longer comment saying something like that.

> > +	/*
> > +	 * Other nodes need to do some work in dlm recovery and gfs2_control
> > +	 * before the recover_done and control_lock will be ready for us below.
> > +	 * A delay here is not required but often avoids having to retry.
> > +	 */
> > +
> > +	msleep(500);
> Can we get rid of this then? I'd rather just wait for the lock, rather
> than adding delays of arbitrary time periods into the code.

I dislike arbitrary delays also, so I'm hesitant to add them.
The choices here are:
- removing NOQUEUE from the requests below, but with NOQUEUE you have a
  much better chance of killing a mount command, which is a fairly nice
  feature, I think.
- removing the delay, which results in nodes often doing fast+repeated
  lock attempts, which could get rather excessive.  I'd be worried about
  having that kind of unlimited loop sitting there.
- using some kind of delay.

While I don't like the look of the delay, I like the other options less.
Do you have a preference, or any other ideas?


> > +static int control_first_done(struct gfs2_sbd *sdp)
> > +{
> > +	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +	char lvb_bits[GDLM_LVB_SIZE];
> > +	uint32_t start_gen, block_gen;
> > +	int error;
> > +
> > +restart:
> > +	spin_lock(&ls->ls_recover_spin);
> > +	start_gen = ls->ls_recover_start;
> > +	block_gen = ls->ls_recover_block;
> > +
> > +	if (test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags) ||
> > +	    !test_bit(DFL_MOUNT_DONE, &ls->ls_recover_flags) ||
> > +	    !test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags)) {
> > +		/* sanity check, should not happen */
> > +		fs_err(sdp, "control_first_done start %u block %u flags %lx\n",
> > +		       start_gen, block_gen, ls->ls_recover_flags);
> > +		spin_unlock(&ls->ls_recover_spin);
> > +		control_unlock(sdp);
> > +		return -1;
> > +	}
> > +
> > +	if (start_gen == block_gen) {
> > +		/*
> > +		 * Wait for the end of a dlm recovery cycle to switch from
> > +		 * first mounter recovery.  We can ignore any recover_slot
> > +		 * callbacks between the recover_prep and next recover_done
> > +		 * because we are still the first mounter and any failed nodes
> > +		 * have not fully mounted, so they don't need recovery.
> > +		 */
> > +		spin_unlock(&ls->ls_recover_spin);
> > +		fs_info(sdp, "control_first_done wait gen %u\n", start_gen);
> > +		msleep(500);
> Again - I don't want to add arbitrary delays into the code. Why is this
> waiting for half a second? Why not some other length of time? We should
> figure out how to wait for the end of the first mounter recovery some
> other way if that is what is required.

This msleep slows down a rare loop to wake up a couple times vs once with
a proper wait mechanism.  It's waiting for the next recover_done()
callback, which the dlm will call when it is done with recovery.  We do
have the option here of using a standard wait mechanism, wait_on_bit() or
something.  I'll see if any of those would work here without adding too
much to the code.


> > +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
> > +				 unsigned int result)
> > +{
> > +	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +
> > +	/* don't care about the recovery of own journal during mount */
> > +	if (jid == ls->ls_jid)
> > +		return;
> > +
> > +	/* another node is recovering the journal, give it a chance to
> > +	   finish before trying again */
> > +	if (result == LM_RD_GAVEUP)
> > +		msleep(1000);
> Again, lets put in a proper wait for this condition. If the issue is one
> of races between cluster nodes (thundering herd type problem), then we
> might need some kind of back off, but in that case, it should probably
> be for a random time period.

In this case, while one node is recovering a journal, the other nodes will
all try to recover the same journal (and fail), as quickly as they can.  I
looked at using queue_delayed_work here, but couldn't tell if that was ok
with zero delay... I now see others use 0, so I'll try it.


> > +	error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > +				  &ops, &ls->ls_dlm);
> > +
> > +	if (error == -EOPNOTSUPP) {
> > +		/*
> > +		 * dlm does not support ops callbacks,
> > +		 * old dlm_controld/gfs_controld are used, try without ops.
> > +		 */
> > +		fs_info(sdp, "dlm lockspace ops not used %d\n", error);
> > +		free_recover_size(ls);
> > +
> > +		error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > +					  NULL, &ls->ls_dlm);
> > +		if (error)
> > +			fs_err(sdp, "dlm_new_lockspace error %d\n", error);
> > +		return error;
> > +	}
> > +
> Hmm. This is a bit complicated. Can't we just make it return 0 anyway?
> If we do need to know whether the dlm supports the recovery ops, then
> lets just make it signal that somehow (e.g. returns 1 so that >= 0 means
> success and -ve means error). It doesn't matter if we don't call
> free_recover_size until umount time I think, even if the dlm doesn't
> support that since the data structures are fairly small.

I went with this because I thought it was simpler than adding a second
return value for the ops status.  It would also let us simply drop the
special case in the future.  The alternative is:

int dlm_new_lockspace(const char *name, const char *cluster,
                      uint32_t flags, int lvblen,
                      struct dlm_lockspace_ops *ops, void *ops_arg,
                      int *ops_error, dlm_lockspace_t **lockspace);

I'm willing to try that if you think it's clearer to understand.

Dave