From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Gruenbacher Date: Fri, 17 Apr 2020 16:27:26 +0200 Subject: [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function In-Reply-To: <1706994570.22510265.1587128956368.JavaMail.zimbra@redhat.com> References: <202004171521.Z9s8gKSV%lkp@intel.com> <1706994570.22510265.1587128956368.JavaMail.zimbra@redhat.com> Message-ID: <20200417142726.188190-1-agruenba@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi Bob, On Fri, Apr 17, 2020 at 3:09 PM Bob Peterson wrote: > ----- Original Message ----- > > Bob, > > > > commit "gfs2: Force withdraw to replay journals and wait for it to > > finish" adds three new users of gfs2_glock_dq_wait in function > > signal_our_withdraw. Is the waiting there done for a reason, or can we > > replace gfs2_glock_dq_wait with gfs2_glock_dq / gfs2_glock_dq_uninit > > in that function? > > > > Thanks, > > Andreas > > Hi Andreas, > > I remember debugging cases in which we needed to wait. > When we didn't wait for glocks to be demoted, the node just reacquired > the glocks again too quickly, before other nodes could attempt recovery. > When the withdrawing node tried to reacquire them, DLM simply granted > them on the same node, which is the only node that must not do recovery. > > Addressing each instance separately: > > (1) We would dequeue our journal glock, then dequeue the live glock. > The other nodes would see the callback for the "live" glock and > attempt recovery, however they were denied access to the journal > glock because it was still held on the recovering node. That's > because the glock state machine (but more importantly the dlm) > had not yet demoted the lock completely when this took place. > So none of the nodes performed recovery. Hmm, sdp->sd_journal_gh has the GL_NOCACHE flag set, so the demote should happen immediately. On the other hand, sdp->sd_live_gh doesn't have that flag set, so that may be the actual problem here. > (2) We might be able to get rid of the "wait" for the "live" glock. > I can't think of a case in which that would behave badly, but > bear in mind it's been more than a year since I originally wrote > that. It's probably closer to 2 years now. > (3) We might be able to get rid of the third "wait" which is also for > the "live" glock. I don't remember the circumstances. > > TBH, I wouldn't feel comfortable getting rid of any of those waits > until I do some heavy experimentation on my 5-node cluster with the > recovery tests. I agree that testing will be needed. What do you think of the below patch? Thanks, Andreas --- fs/gfs2/ops_fstype.c | 2 +- fs/gfs2/util.c | 16 ++++++++-------- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index e2b69ffcc6a8..98b2577b815b 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -405,7 +405,7 @@ static int init_locking(struct gfs2_sbd *sdp, struct gfs2_holder *mount_gh, error = gfs2_glock_nq_num(sdp, GFS2_LIVE_LOCK, &gfs2_nondisk_glops, LM_ST_SHARED, - LM_FLAG_NOEXP | GL_EXACT, + LM_FLAG_NOEXP | GL_EXACT | GL_NOCACHE, &sdp->sd_live_gh); if (error) { fs_err(sdp, "can't acquire live glock: %d\n", error); diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c index 9b64d40ab379..6e48cef79c53 100644 --- a/fs/gfs2/util.c +++ b/fs/gfs2/util.c @@ -122,10 +122,8 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp) /* * Drop the glock for our journal so another node can recover it. */ - if (gfs2_holder_initialized(&sdp->sd_journal_gh)) { - gfs2_glock_dq_wait(&sdp->sd_journal_gh); - gfs2_holder_uninit(&sdp->sd_journal_gh); - } + if (gfs2_holder_initialized(&sdp->sd_journal_gh)) + gfs2_glock_dq_uninit(&sdp->sd_journal_gh); sdp->sd_jinode_gh.gh_flags |= GL_NOCACHE; gfs2_glock_dq(&sdp->sd_jinode_gh); if (test_bit(SDF_FS_FROZEN, &sdp->sd_flags)) { @@ -167,7 +165,7 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp) * Dequeue the "live" glock, but keep a reference so it's never freed. */ gfs2_glock_hold(gl); - gfs2_glock_dq_wait(&sdp->sd_live_gh); + gfs2_glock_dq(&sdp->sd_live_gh); /* * We enqueue the "live" glock in EX so that all other nodes * get a demote request and act on it. We don't really want the @@ -175,7 +173,8 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp) */ fs_warn(sdp, "Requesting recovery of jid %d.\n", sdp->sd_lockstruct.ls_jid); - gfs2_holder_reinit(LM_ST_EXCLUSIVE, LM_FLAG_TRY_1CB | LM_FLAG_NOEXP, + gfs2_holder_reinit(LM_ST_EXCLUSIVE, + LM_FLAG_TRY_1CB | LM_FLAG_NOEXP | GL_NOCACHE, &sdp->sd_live_gh); msleep(GL_GLOCK_MAX_HOLD); /* @@ -199,8 +198,9 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp) if (gfs2_recover_journal(sdp->sd_jdesc, 1)) fs_warn(sdp, "Unable to recover our journal jid %d.\n", sdp->sd_lockstruct.ls_jid); - gfs2_glock_dq_wait(&sdp->sd_live_gh); - gfs2_holder_reinit(LM_ST_SHARED, LM_FLAG_NOEXP | GL_EXACT, + gfs2_glock_dq(&sdp->sd_live_gh); + gfs2_holder_reinit(LM_ST_SHARED, + LM_FLAG_NOEXP | GL_EXACT | GL_NOCACHE, &sdp->sd_live_gh); gfs2_glock_nq(&sdp->sd_live_gh); } base-commit: bd437e630fee648b79999e617d48bb07ae76f8eb prerequisite-patch-id: e3b7fbfc1e67b4065cb6c928c29c7ed8bee0fc1c prerequisite-patch-id: 35e9388872b2d6ffc70c4fc5497d3fdec97d6392 -- 2.25.2