From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Teigland Date: Mon, 19 Dec 2011 12:47:38 -0500 Subject: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination In-Reply-To: <1324300058.2723.47.camel@menhir> References: <1324072991-30729-6-git-send-email-teigland@redhat.com> <1324300058.2723.47.camel@menhir> Message-ID: <20111219174738.GA24652@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Mon, Dec 19, 2011 at 01:07:38PM +0000, Steven Whitehouse wrote: > > struct lm_lockstruct { > > int ls_jid; > > unsigned int ls_first; > > - unsigned int ls_first_done; > > unsigned int ls_nodir; > Since ls_flags and ls_first also also only boolean flags, they could > potentially be moved into the flags, though we can always do that later. yes, I can use a flag in place of ls_first. > > + int ls_recover_jid_done; /* read by gfs_controld */ > > + int ls_recover_jid_status; /* read by gfs_controld */ > ^^^^^^^^^^^ this isn't > actually true any more. All recent gfs_controld versions take their cue > from the uevents, so this is here only for backwards compatibility > reasons and these two will be removed at some future date. I'll add a longer comment saying something like that. > > + /* > > + * Other nodes need to do some work in dlm recovery and gfs2_control > > + * before the recover_done and control_lock will be ready for us below. > > + * A delay here is not required but often avoids having to retry. > > + */ > > + > > + msleep(500); > Can we get rid of this then? I'd rather just wait for the lock, rather > than adding delays of arbitrary time periods into the code. I dislike arbitrary delays also, so I'm hesitant to add them. The choices here are: - removing NOQUEUE from the requests below, but with NOQUEUE you have a much better chance of killing a mount command, which is a fairly nice feature, I think. - removing the delay, which results in nodes often doing fast+repeated lock attempts, which could get rather excessive. I'd be worried about having that kind of unlimited loop sitting there. - using some kind of delay. While I don't like the look of the delay, I like the other options less. Do you have a preference, or any other ideas? > > +static int control_first_done(struct gfs2_sbd *sdp) > > +{ > > + struct lm_lockstruct *ls = &sdp->sd_lockstruct; > > + char lvb_bits[GDLM_LVB_SIZE]; > > + uint32_t start_gen, block_gen; > > + int error; > > + > > +restart: > > + spin_lock(&ls->ls_recover_spin); > > + start_gen = ls->ls_recover_start; > > + block_gen = ls->ls_recover_block; > > + > > + if (test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags) || > > + !test_bit(DFL_MOUNT_DONE, &ls->ls_recover_flags) || > > + !test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags)) { > > + /* sanity check, should not happen */ > > + fs_err(sdp, "control_first_done start %u block %u flags %lx\n", > > + start_gen, block_gen, ls->ls_recover_flags); > > + spin_unlock(&ls->ls_recover_spin); > > + control_unlock(sdp); > > + return -1; > > + } > > + > > + if (start_gen == block_gen) { > > + /* > > + * Wait for the end of a dlm recovery cycle to switch from > > + * first mounter recovery. We can ignore any recover_slot > > + * callbacks between the recover_prep and next recover_done > > + * because we are still the first mounter and any failed nodes > > + * have not fully mounted, so they don't need recovery. > > + */ > > + spin_unlock(&ls->ls_recover_spin); > > + fs_info(sdp, "control_first_done wait gen %u\n", start_gen); > > + msleep(500); > Again - I don't want to add arbitrary delays into the code. Why is this > waiting for half a second? Why not some other length of time? We should > figure out how to wait for the end of the first mounter recovery some > other way if that is what is required. This msleep slows down a rare loop to wake up a couple times vs once with a proper wait mechanism. It's waiting for the next recover_done() callback, which the dlm will call when it is done with recovery. We do have the option here of using a standard wait mechanism, wait_on_bit() or something. I'll see if any of those would work here without adding too much to the code. > > +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid, > > + unsigned int result) > > +{ > > + struct lm_lockstruct *ls = &sdp->sd_lockstruct; > > + > > + /* don't care about the recovery of own journal during mount */ > > + if (jid == ls->ls_jid) > > + return; > > + > > + /* another node is recovering the journal, give it a chance to > > + finish before trying again */ > > + if (result == LM_RD_GAVEUP) > > + msleep(1000); > Again, lets put in a proper wait for this condition. If the issue is one > of races between cluster nodes (thundering herd type problem), then we > might need some kind of back off, but in that case, it should probably > be for a random time period. In this case, while one node is recovering a journal, the other nodes will all try to recover the same journal (and fail), as quickly as they can. I looked at using queue_delayed_work here, but couldn't tell if that was ok with zero delay... I now see others use 0, so I'll try it. > > + error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE, > > + &ops, &ls->ls_dlm); > > + > > + if (error == -EOPNOTSUPP) { > > + /* > > + * dlm does not support ops callbacks, > > + * old dlm_controld/gfs_controld are used, try without ops. > > + */ > > + fs_info(sdp, "dlm lockspace ops not used %d\n", error); > > + free_recover_size(ls); > > + > > + error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE, > > + NULL, &ls->ls_dlm); > > + if (error) > > + fs_err(sdp, "dlm_new_lockspace error %d\n", error); > > + return error; > > + } > > + > Hmm. This is a bit complicated. Can't we just make it return 0 anyway? > If we do need to know whether the dlm supports the recovery ops, then > lets just make it signal that somehow (e.g. returns 1 so that >= 0 means > success and -ve means error). It doesn't matter if we don't call > free_recover_size until umount time I think, even if the dlm doesn't > support that since the data structures are fairly small. I went with this because I thought it was simpler than adding a second return value for the ops status. It would also let us simply drop the special case in the future. The alternative is: int dlm_new_lockspace(const char *name, const char *cluster, uint32_t flags, int lvblen, struct dlm_lockspace_ops *ops, void *ops_arg, int *ops_error, dlm_lockspace_t **lockspace); I'm willing to try that if you think it's clearer to understand. Dave