From: David Teigland <teigland@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
Date: Mon, 19 Dec 2011 12:47:38 -0500 [thread overview]
Message-ID: <20111219174738.GA24652@redhat.com> (raw)
In-Reply-To: <1324300058.2723.47.camel@menhir>
On Mon, Dec 19, 2011 at 01:07:38PM +0000, Steven Whitehouse wrote:
> > struct lm_lockstruct {
> > int ls_jid;
> > unsigned int ls_first;
> > - unsigned int ls_first_done;
> > unsigned int ls_nodir;
> Since ls_flags and ls_first also also only boolean flags, they could
> potentially be moved into the flags, though we can always do that later.
yes, I can use a flag in place of ls_first.
> > + int ls_recover_jid_done; /* read by gfs_controld */
> > + int ls_recover_jid_status; /* read by gfs_controld */
> ^^^^^^^^^^^ this isn't
> actually true any more. All recent gfs_controld versions take their cue
> from the uevents, so this is here only for backwards compatibility
> reasons and these two will be removed at some future date.
I'll add a longer comment saying something like that.
> > + /*
> > + * Other nodes need to do some work in dlm recovery and gfs2_control
> > + * before the recover_done and control_lock will be ready for us below.
> > + * A delay here is not required but often avoids having to retry.
> > + */
> > +
> > + msleep(500);
> Can we get rid of this then? I'd rather just wait for the lock, rather
> than adding delays of arbitrary time periods into the code.
I dislike arbitrary delays also, so I'm hesitant to add them.
The choices here are:
- removing NOQUEUE from the requests below, but with NOQUEUE you have a
much better chance of killing a mount command, which is a fairly nice
feature, I think.
- removing the delay, which results in nodes often doing fast+repeated
lock attempts, which could get rather excessive. I'd be worried about
having that kind of unlimited loop sitting there.
- using some kind of delay.
While I don't like the look of the delay, I like the other options less.
Do you have a preference, or any other ideas?
> > +static int control_first_done(struct gfs2_sbd *sdp)
> > +{
> > + struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > + char lvb_bits[GDLM_LVB_SIZE];
> > + uint32_t start_gen, block_gen;
> > + int error;
> > +
> > +restart:
> > + spin_lock(&ls->ls_recover_spin);
> > + start_gen = ls->ls_recover_start;
> > + block_gen = ls->ls_recover_block;
> > +
> > + if (test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags) ||
> > + !test_bit(DFL_MOUNT_DONE, &ls->ls_recover_flags) ||
> > + !test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags)) {
> > + /* sanity check, should not happen */
> > + fs_err(sdp, "control_first_done start %u block %u flags %lx\n",
> > + start_gen, block_gen, ls->ls_recover_flags);
> > + spin_unlock(&ls->ls_recover_spin);
> > + control_unlock(sdp);
> > + return -1;
> > + }
> > +
> > + if (start_gen == block_gen) {
> > + /*
> > + * Wait for the end of a dlm recovery cycle to switch from
> > + * first mounter recovery. We can ignore any recover_slot
> > + * callbacks between the recover_prep and next recover_done
> > + * because we are still the first mounter and any failed nodes
> > + * have not fully mounted, so they don't need recovery.
> > + */
> > + spin_unlock(&ls->ls_recover_spin);
> > + fs_info(sdp, "control_first_done wait gen %u\n", start_gen);
> > + msleep(500);
> Again - I don't want to add arbitrary delays into the code. Why is this
> waiting for half a second? Why not some other length of time? We should
> figure out how to wait for the end of the first mounter recovery some
> other way if that is what is required.
This msleep slows down a rare loop to wake up a couple times vs once with
a proper wait mechanism. It's waiting for the next recover_done()
callback, which the dlm will call when it is done with recovery. We do
have the option here of using a standard wait mechanism, wait_on_bit() or
something. I'll see if any of those would work here without adding too
much to the code.
> > +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
> > + unsigned int result)
> > +{
> > + struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +
> > + /* don't care about the recovery of own journal during mount */
> > + if (jid == ls->ls_jid)
> > + return;
> > +
> > + /* another node is recovering the journal, give it a chance to
> > + finish before trying again */
> > + if (result == LM_RD_GAVEUP)
> > + msleep(1000);
> Again, lets put in a proper wait for this condition. If the issue is one
> of races between cluster nodes (thundering herd type problem), then we
> might need some kind of back off, but in that case, it should probably
> be for a random time period.
In this case, while one node is recovering a journal, the other nodes will
all try to recover the same journal (and fail), as quickly as they can. I
looked at using queue_delayed_work here, but couldn't tell if that was ok
with zero delay... I now see others use 0, so I'll try it.
> > + error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > + &ops, &ls->ls_dlm);
> > +
> > + if (error == -EOPNOTSUPP) {
> > + /*
> > + * dlm does not support ops callbacks,
> > + * old dlm_controld/gfs_controld are used, try without ops.
> > + */
> > + fs_info(sdp, "dlm lockspace ops not used %d\n", error);
> > + free_recover_size(ls);
> > +
> > + error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > + NULL, &ls->ls_dlm);
> > + if (error)
> > + fs_err(sdp, "dlm_new_lockspace error %d\n", error);
> > + return error;
> > + }
> > +
> Hmm. This is a bit complicated. Can't we just make it return 0 anyway?
> If we do need to know whether the dlm supports the recovery ops, then
> lets just make it signal that somehow (e.g. returns 1 so that >= 0 means
> success and -ve means error). It doesn't matter if we don't call
> free_recover_size until umount time I think, even if the dlm doesn't
> support that since the data structures are fairly small.
I went with this because I thought it was simpler than adding a second
return value for the ops status. It would also let us simply drop the
special case in the future. The alternative is:
int dlm_new_lockspace(const char *name, const char *cluster,
uint32_t flags, int lvblen,
struct dlm_lockspace_ops *ops, void *ops_arg,
int *ops_error, dlm_lockspace_t **lockspace);
I'm willing to try that if you think it's clearer to understand.
Dave
next prev parent reply other threads:[~2011-12-19 17:47 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-16 22:03 [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination David Teigland
2011-12-19 13:07 ` Steven Whitehouse
2011-12-19 17:47 ` David Teigland [this message]
2011-12-20 10:39 ` Steven Whitehouse
2011-12-20 19:16 ` David Teigland
2011-12-20 21:04 ` David Teigland
2011-12-21 10:45 ` Steven Whitehouse
2011-12-21 15:40 ` David Teigland
2011-12-22 21:23 ` David Teigland
2011-12-23 9:19 ` Steven Whitehouse
2011-12-19 15:17 ` Steven Whitehouse
2012-01-05 15:08 ` Bob Peterson
2012-01-05 15:21 ` David Teigland
2012-01-05 15:40 ` Steven Whitehouse
2012-01-05 16:16 ` David Teigland
2012-01-05 16:45 ` Bob Peterson
-- strict thread matches above, loose matches on Subject: below --
2012-01-05 16:46 David Teigland
2012-01-05 16:58 ` Steven Whitehouse
2012-01-05 17:13 ` David Teigland
2012-01-09 16:36 ` Steven Whitehouse
2012-01-09 16:46 ` David Teigland
2012-01-09 17:00 ` David Teigland
2012-01-09 17:04 ` Steven Whitehouse
2012-01-09 17:02 ` Steven Whitehouse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111219174738.GA24652@redhat.com \
--to=teigland@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).