* [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function 'gfs2_glock_dq_wait'; did you mean 'gfs2_glock_nq_init'?
@ 2020-04-17 7:36 kbuild test robot
2020-04-17 12:23 ` Andreas Gruenbacher
0 siblings, 1 reply; 5+ messages in thread
From: kbuild test robot @ 2020-04-17 7:36 UTC (permalink / raw)
To: cluster-devel.redhat.com
tree: https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git gfs2-iopen
head: c748e1ec9bd20d71265a148042f6bc97ffc5f343
commit: c748e1ec9bd20d71265a148042f6bc97ffc5f343 [12/12] gfs2: Remove unused function gfs2_glock_dq_wait
config: s390-defconfig (attached as .config)
compiler: s390-linux-gcc (GCC) 9.3.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout c748e1ec9bd20d71265a148042f6bc97ffc5f343
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=s390
If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
fs/gfs2/util.c: In function 'signal_our_withdraw':
>> fs/gfs2/util.c:126:3: error: implicit declaration of function 'gfs2_glock_dq_wait'; did you mean 'gfs2_glock_nq_init'? [-Werror=implicit-function-declaration]
126 | gfs2_glock_dq_wait(&sdp->sd_journal_gh);
| ^~~~~~~~~~~~~~~~~~
| gfs2_glock_nq_init
cc1: some warnings being treated as errors
vim +126 fs/gfs2/util.c
0d91061a372671a Bob Peterson 2019-02-18 92
601ef0d52e96175 Bob Peterson 2020-01-28 93 static void signal_our_withdraw(struct gfs2_sbd *sdp)
601ef0d52e96175 Bob Peterson 2020-01-28 94 {
601ef0d52e96175 Bob Peterson 2020-01-28 95 struct gfs2_glock *gl = sdp->sd_live_gh.gh_gl;
601ef0d52e96175 Bob Peterson 2020-01-28 96 struct inode *inode = sdp->sd_jdesc->jd_inode;
601ef0d52e96175 Bob Peterson 2020-01-28 97 struct gfs2_inode *ip = GFS2_I(inode);
601ef0d52e96175 Bob Peterson 2020-01-28 98 u64 no_formal_ino = ip->i_no_formal_ino;
601ef0d52e96175 Bob Peterson 2020-01-28 99 int ret = 0;
601ef0d52e96175 Bob Peterson 2020-01-28 100 int tries;
601ef0d52e96175 Bob Peterson 2020-01-28 101
601ef0d52e96175 Bob Peterson 2020-01-28 102 if (test_bit(SDF_NORECOVERY, &sdp->sd_flags))
601ef0d52e96175 Bob Peterson 2020-01-28 103 return;
601ef0d52e96175 Bob Peterson 2020-01-28 104
601ef0d52e96175 Bob Peterson 2020-01-28 105 /* Prevent any glock dq until withdraw recovery is complete */
601ef0d52e96175 Bob Peterson 2020-01-28 106 set_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags);
601ef0d52e96175 Bob Peterson 2020-01-28 107 /*
601ef0d52e96175 Bob Peterson 2020-01-28 108 * Don't tell dlm we're bailing until we have no more buffers in the
601ef0d52e96175 Bob Peterson 2020-01-28 109 * wind. If journal had an IO error, the log code should just purge
601ef0d52e96175 Bob Peterson 2020-01-28 110 * the outstanding buffers rather than submitting new IO. Making the
601ef0d52e96175 Bob Peterson 2020-01-28 111 * file system read-only will flush the journal, etc.
601ef0d52e96175 Bob Peterson 2020-01-28 112 *
601ef0d52e96175 Bob Peterson 2020-01-28 113 * During a normal unmount, gfs2_make_fs_ro calls gfs2_log_shutdown
601ef0d52e96175 Bob Peterson 2020-01-28 114 * which clears SDF_JOURNAL_LIVE. In a withdraw, we must not write
601ef0d52e96175 Bob Peterson 2020-01-28 115 * any UNMOUNT log header, so we can't call gfs2_log_shutdown, and
601ef0d52e96175 Bob Peterson 2020-01-28 116 * therefore we need to clear SDF_JOURNAL_LIVE manually.
601ef0d52e96175 Bob Peterson 2020-01-28 117 */
601ef0d52e96175 Bob Peterson 2020-01-28 118 clear_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags);
601ef0d52e96175 Bob Peterson 2020-01-28 119 if (!sb_rdonly(sdp->sd_vfs))
601ef0d52e96175 Bob Peterson 2020-01-28 120 ret = gfs2_make_fs_ro(sdp);
601ef0d52e96175 Bob Peterson 2020-01-28 121
601ef0d52e96175 Bob Peterson 2020-01-28 122 /*
601ef0d52e96175 Bob Peterson 2020-01-28 123 * Drop the glock for our journal so another node can recover it.
601ef0d52e96175 Bob Peterson 2020-01-28 124 */
601ef0d52e96175 Bob Peterson 2020-01-28 125 if (gfs2_holder_initialized(&sdp->sd_journal_gh)) {
601ef0d52e96175 Bob Peterson 2020-01-28 @126 gfs2_glock_dq_wait(&sdp->sd_journal_gh);
601ef0d52e96175 Bob Peterson 2020-01-28 127 gfs2_holder_uninit(&sdp->sd_journal_gh);
601ef0d52e96175 Bob Peterson 2020-01-28 128 }
601ef0d52e96175 Bob Peterson 2020-01-28 129 sdp->sd_jinode_gh.gh_flags |= GL_NOCACHE;
601ef0d52e96175 Bob Peterson 2020-01-28 130 gfs2_glock_dq(&sdp->sd_jinode_gh);
601ef0d52e96175 Bob Peterson 2020-01-28 131 if (test_bit(SDF_FS_FROZEN, &sdp->sd_flags)) {
601ef0d52e96175 Bob Peterson 2020-01-28 132 /* Make sure gfs2_unfreeze works if partially-frozen */
601ef0d52e96175 Bob Peterson 2020-01-28 133 flush_workqueue(gfs2_freeze_wq);
601ef0d52e96175 Bob Peterson 2020-01-28 134 atomic_set(&sdp->sd_freeze_state, SFS_FROZEN);
601ef0d52e96175 Bob Peterson 2020-01-28 135 thaw_super(sdp->sd_vfs);
601ef0d52e96175 Bob Peterson 2020-01-28 136 } else {
601ef0d52e96175 Bob Peterson 2020-01-28 137 wait_on_bit(&gl->gl_flags, GLF_DEMOTE, TASK_UNINTERRUPTIBLE);
601ef0d52e96175 Bob Peterson 2020-01-28 138 }
601ef0d52e96175 Bob Peterson 2020-01-28 139
601ef0d52e96175 Bob Peterson 2020-01-28 140 /*
601ef0d52e96175 Bob Peterson 2020-01-28 141 * holder_uninit to force glock_put, to force dlm to let go
601ef0d52e96175 Bob Peterson 2020-01-28 142 */
601ef0d52e96175 Bob Peterson 2020-01-28 143 gfs2_holder_uninit(&sdp->sd_jinode_gh);
601ef0d52e96175 Bob Peterson 2020-01-28 144
601ef0d52e96175 Bob Peterson 2020-01-28 145 /*
601ef0d52e96175 Bob Peterson 2020-01-28 146 * Note: We need to be careful here:
601ef0d52e96175 Bob Peterson 2020-01-28 147 * Our iput of jd_inode will evict it. The evict will dequeue its
601ef0d52e96175 Bob Peterson 2020-01-28 148 * glock, but the glock dq will wait for the withdraw unless we have
601ef0d52e96175 Bob Peterson 2020-01-28 149 * exception code in glock_dq.
601ef0d52e96175 Bob Peterson 2020-01-28 150 */
601ef0d52e96175 Bob Peterson 2020-01-28 151 iput(inode);
601ef0d52e96175 Bob Peterson 2020-01-28 152 /*
601ef0d52e96175 Bob Peterson 2020-01-28 153 * Wait until the journal inode's glock is freed. This allows try locks
601ef0d52e96175 Bob Peterson 2020-01-28 154 * on other nodes to be successful, otherwise we remain the owner of
601ef0d52e96175 Bob Peterson 2020-01-28 155 * the glock as far as dlm is concerned.
601ef0d52e96175 Bob Peterson 2020-01-28 156 */
601ef0d52e96175 Bob Peterson 2020-01-28 157 if (gl->gl_ops->go_free) {
601ef0d52e96175 Bob Peterson 2020-01-28 158 set_bit(GLF_FREEING, &gl->gl_flags);
601ef0d52e96175 Bob Peterson 2020-01-28 159 wait_on_bit(&gl->gl_flags, GLF_FREEING, TASK_UNINTERRUPTIBLE);
601ef0d52e96175 Bob Peterson 2020-01-28 160 }
601ef0d52e96175 Bob Peterson 2020-01-28 161
601ef0d52e96175 Bob Peterson 2020-01-28 162 if (sdp->sd_lockstruct.ls_ops->lm_lock == NULL) { /* lock_nolock */
601ef0d52e96175 Bob Peterson 2020-01-28 163 clear_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags);
601ef0d52e96175 Bob Peterson 2020-01-28 164 goto skip_recovery;
601ef0d52e96175 Bob Peterson 2020-01-28 165 }
601ef0d52e96175 Bob Peterson 2020-01-28 166 /*
601ef0d52e96175 Bob Peterson 2020-01-28 167 * Dequeue the "live" glock, but keep a reference so it's never freed.
601ef0d52e96175 Bob Peterson 2020-01-28 168 */
601ef0d52e96175 Bob Peterson 2020-01-28 169 gfs2_glock_hold(gl);
601ef0d52e96175 Bob Peterson 2020-01-28 170 gfs2_glock_dq_wait(&sdp->sd_live_gh);
601ef0d52e96175 Bob Peterson 2020-01-28 171 /*
601ef0d52e96175 Bob Peterson 2020-01-28 172 * We enqueue the "live" glock in EX so that all other nodes
601ef0d52e96175 Bob Peterson 2020-01-28 173 * get a demote request and act on it. We don't really want the
601ef0d52e96175 Bob Peterson 2020-01-28 174 * lock in EX, so we send a "try" lock with 1CB to produce a callback.
601ef0d52e96175 Bob Peterson 2020-01-28 175 */
601ef0d52e96175 Bob Peterson 2020-01-28 176 fs_warn(sdp, "Requesting recovery of jid %d.\n",
601ef0d52e96175 Bob Peterson 2020-01-28 177 sdp->sd_lockstruct.ls_jid);
601ef0d52e96175 Bob Peterson 2020-01-28 178 gfs2_holder_reinit(LM_ST_EXCLUSIVE, LM_FLAG_TRY_1CB | LM_FLAG_NOEXP,
601ef0d52e96175 Bob Peterson 2020-01-28 179 &sdp->sd_live_gh);
601ef0d52e96175 Bob Peterson 2020-01-28 180 msleep(GL_GLOCK_MAX_HOLD);
601ef0d52e96175 Bob Peterson 2020-01-28 181 /*
601ef0d52e96175 Bob Peterson 2020-01-28 182 * This will likely fail in a cluster, but succeed standalone:
601ef0d52e96175 Bob Peterson 2020-01-28 183 */
601ef0d52e96175 Bob Peterson 2020-01-28 184 ret = gfs2_glock_nq(&sdp->sd_live_gh);
601ef0d52e96175 Bob Peterson 2020-01-28 185
601ef0d52e96175 Bob Peterson 2020-01-28 186 /*
601ef0d52e96175 Bob Peterson 2020-01-28 187 * If we actually got the "live" lock in EX mode, there are no other
601ef0d52e96175 Bob Peterson 2020-01-28 188 * nodes available to replay our journal. So we try to replay it
601ef0d52e96175 Bob Peterson 2020-01-28 189 * ourselves. We hold the "live" glock to prevent other mounters
601ef0d52e96175 Bob Peterson 2020-01-28 190 * during recovery, then just dequeue it and reacquire it in our
601ef0d52e96175 Bob Peterson 2020-01-28 191 * normal SH mode. Just in case the problem that caused us to
601ef0d52e96175 Bob Peterson 2020-01-28 192 * withdraw prevents us from recovering our journal (e.g. io errors
601ef0d52e96175 Bob Peterson 2020-01-28 193 * and such) we still check if the journal is clean before proceeding
601ef0d52e96175 Bob Peterson 2020-01-28 194 * but we may wait forever until another mounter does the recovery.
601ef0d52e96175 Bob Peterson 2020-01-28 195 */
601ef0d52e96175 Bob Peterson 2020-01-28 196 if (ret == 0) {
601ef0d52e96175 Bob Peterson 2020-01-28 197 fs_warn(sdp, "No other mounters found. Trying to recover our "
601ef0d52e96175 Bob Peterson 2020-01-28 198 "own journal jid %d.\n", sdp->sd_lockstruct.ls_jid);
601ef0d52e96175 Bob Peterson 2020-01-28 199 if (gfs2_recover_journal(sdp->sd_jdesc, 1))
601ef0d52e96175 Bob Peterson 2020-01-28 200 fs_warn(sdp, "Unable to recover our journal jid %d.\n",
601ef0d52e96175 Bob Peterson 2020-01-28 201 sdp->sd_lockstruct.ls_jid);
601ef0d52e96175 Bob Peterson 2020-01-28 202 gfs2_glock_dq_wait(&sdp->sd_live_gh);
601ef0d52e96175 Bob Peterson 2020-01-28 203 gfs2_holder_reinit(LM_ST_SHARED, LM_FLAG_NOEXP | GL_EXACT,
601ef0d52e96175 Bob Peterson 2020-01-28 204 &sdp->sd_live_gh);
601ef0d52e96175 Bob Peterson 2020-01-28 205 gfs2_glock_nq(&sdp->sd_live_gh);
601ef0d52e96175 Bob Peterson 2020-01-28 206 }
601ef0d52e96175 Bob Peterson 2020-01-28 207
601ef0d52e96175 Bob Peterson 2020-01-28 208 gfs2_glock_queue_put(gl); /* drop the extra reference we acquired */
601ef0d52e96175 Bob Peterson 2020-01-28 209 clear_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags);
601ef0d52e96175 Bob Peterson 2020-01-28 210
601ef0d52e96175 Bob Peterson 2020-01-28 211 /*
601ef0d52e96175 Bob Peterson 2020-01-28 212 * At this point our journal is evicted, so we need to get a new inode
601ef0d52e96175 Bob Peterson 2020-01-28 213 * for it. Once done, we need to call gfs2_find_jhead which
601ef0d52e96175 Bob Peterson 2020-01-28 214 * calls gfs2_map_journal_extents to map it for us again.
601ef0d52e96175 Bob Peterson 2020-01-28 215 *
601ef0d52e96175 Bob Peterson 2020-01-28 216 * Note that we don't really want it to look up a FREE block. The
601ef0d52e96175 Bob Peterson 2020-01-28 217 * GFS2_BLKST_FREE simply overrides a block check in gfs2_inode_lookup
601ef0d52e96175 Bob Peterson 2020-01-28 218 * which would otherwise fail because it requires grabbing an rgrp
601ef0d52e96175 Bob Peterson 2020-01-28 219 * glock, which would fail with -EIO because we're withdrawing.
601ef0d52e96175 Bob Peterson 2020-01-28 220 */
601ef0d52e96175 Bob Peterson 2020-01-28 221 inode = gfs2_inode_lookup(sdp->sd_vfs, DT_UNKNOWN,
601ef0d52e96175 Bob Peterson 2020-01-28 222 sdp->sd_jdesc->jd_no_addr, no_formal_ino,
601ef0d52e96175 Bob Peterson 2020-01-28 223 GFS2_BLKST_FREE);
601ef0d52e96175 Bob Peterson 2020-01-28 224 if (IS_ERR(inode)) {
601ef0d52e96175 Bob Peterson 2020-01-28 225 fs_warn(sdp, "Reprocessing of jid %d failed with %ld.\n",
601ef0d52e96175 Bob Peterson 2020-01-28 226 sdp->sd_lockstruct.ls_jid, PTR_ERR(inode));
601ef0d52e96175 Bob Peterson 2020-01-28 227 goto skip_recovery;
601ef0d52e96175 Bob Peterson 2020-01-28 228 }
601ef0d52e96175 Bob Peterson 2020-01-28 229 sdp->sd_jdesc->jd_inode = inode;
601ef0d52e96175 Bob Peterson 2020-01-28 230
601ef0d52e96175 Bob Peterson 2020-01-28 231 /*
601ef0d52e96175 Bob Peterson 2020-01-28 232 * Now wait until recovery is complete.
601ef0d52e96175 Bob Peterson 2020-01-28 233 */
601ef0d52e96175 Bob Peterson 2020-01-28 234 for (tries = 0; tries < 10; tries++) {
7d9f9249580e05a Bob Peterson 2019-02-18 235 ret = check_journal_clean(sdp, sdp->sd_jdesc, false);
601ef0d52e96175 Bob Peterson 2020-01-28 236 if (!ret)
601ef0d52e96175 Bob Peterson 2020-01-28 237 break;
601ef0d52e96175 Bob Peterson 2020-01-28 238 msleep(HZ);
601ef0d52e96175 Bob Peterson 2020-01-28 239 fs_warn(sdp, "Waiting for journal recovery jid %d.\n",
601ef0d52e96175 Bob Peterson 2020-01-28 240 sdp->sd_lockstruct.ls_jid);
601ef0d52e96175 Bob Peterson 2020-01-28 241 }
601ef0d52e96175 Bob Peterson 2020-01-28 242 skip_recovery:
601ef0d52e96175 Bob Peterson 2020-01-28 243 if (!ret)
601ef0d52e96175 Bob Peterson 2020-01-28 244 fs_warn(sdp, "Journal recovery complete for jid %d.\n",
601ef0d52e96175 Bob Peterson 2020-01-28 245 sdp->sd_lockstruct.ls_jid);
601ef0d52e96175 Bob Peterson 2020-01-28 246 else
601ef0d52e96175 Bob Peterson 2020-01-28 247 fs_warn(sdp, "Journal recovery skipped for %d until next "
601ef0d52e96175 Bob Peterson 2020-01-28 248 "mount.\n", sdp->sd_lockstruct.ls_jid);
601ef0d52e96175 Bob Peterson 2020-01-28 249 fs_warn(sdp, "Glock dequeues delayed: %lu\n", sdp->sd_glock_dqs_held);
601ef0d52e96175 Bob Peterson 2020-01-28 250 sdp->sd_glock_dqs_held = 0;
601ef0d52e96175 Bob Peterson 2020-01-28 251 wake_up_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY);
601ef0d52e96175 Bob Peterson 2020-01-28 252 }
601ef0d52e96175 Bob Peterson 2020-01-28 253
:::::: The code at line 126 was first introduced by commit
:::::: 601ef0d52e9617588fcff3df26953592f2eb44ac gfs2: Force withdraw to replay journals and wait for it to finish
:::::: TO: Bob Peterson <rpeterso@redhat.com>
:::::: CC: Bob Peterson <rpeterso@redhat.com>
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all at lists.01.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/gzip
Size: 19349 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20200417/2336dfce/attachment.gz>
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function 'gfs2_glock_dq_wait'; did you mean 'gfs2_glock_nq_init'?
2020-04-17 7:36 [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function 'gfs2_glock_dq_wait'; did you mean 'gfs2_glock_nq_init'? kbuild test robot
@ 2020-04-17 12:23 ` Andreas Gruenbacher
2020-04-17 13:09 ` Bob Peterson
0 siblings, 1 reply; 5+ messages in thread
From: Andreas Gruenbacher @ 2020-04-17 12:23 UTC (permalink / raw)
To: cluster-devel.redhat.com
Bob,
On Fri, Apr 17, 2020 at 9:37 AM kbuild test robot <lkp@intel.com> wrote:
> tree: https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git gfs2-iopen
> head: c748e1ec9bd20d71265a148042f6bc97ffc5f343
> commit: c748e1ec9bd20d71265a148042f6bc97ffc5f343 [12/12] gfs2: Remove unused function gfs2_glock_dq_wait
> config: s390-defconfig (attached as .config)
> compiler: s390-linux-gcc (GCC) 9.3.0
> reproduce:
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> git checkout c748e1ec9bd20d71265a148042f6bc97ffc5f343
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=s390
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kbuild test robot <lkp@intel.com>
>
> All errors (new ones prefixed by >>):
>
> fs/gfs2/util.c: In function 'signal_our_withdraw':
> >> fs/gfs2/util.c:126:3: error: implicit declaration of function 'gfs2_glock_dq_wait'; did you mean 'gfs2_glock_nq_init'? [-Werror=implicit-function-declaration]
> 126 | gfs2_glock_dq_wait(&sdp->sd_journal_gh);
> | ^~~~~~~~~~~~~~~~~~
> | gfs2_glock_nq_init
> cc1: some warnings being treated as errors
commit "gfs2: Force withdraw to replay journals and wait for it to
finish" adds three new users of gfs2_glock_dq_wait in function
signal_our_withdraw. Is the waiting there done for a reason, or can we
replace gfs2_glock_dq_wait with gfs2_glock_dq / gfs2_glock_dq_uninit
in that function?
Thanks,
Andreas
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function 'gfs2_glock_dq_wait'; did you mean 'gfs2_glock_nq_init'?
2020-04-17 12:23 ` Andreas Gruenbacher
@ 2020-04-17 13:09 ` Bob Peterson
2020-04-17 14:27 ` [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function Andreas Gruenbacher
0 siblings, 1 reply; 5+ messages in thread
From: Bob Peterson @ 2020-04-17 13:09 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- Original Message -----
> Bob,
>
> commit "gfs2: Force withdraw to replay journals and wait for it to
> finish" adds three new users of gfs2_glock_dq_wait in function
> signal_our_withdraw. Is the waiting there done for a reason, or can we
> replace gfs2_glock_dq_wait with gfs2_glock_dq / gfs2_glock_dq_uninit
> in that function?
>
> Thanks,
> Andreas
Hi Andreas,
I remember debugging cases in which we needed to wait.
When we didn't wait for glocks to be demoted, the node just reacquired
the glocks again too quickly, before other nodes could attempt recovery.
When the withdrawing node tried to reacquire them, DLM simply granted
them on the same node, which is the only node that must not do recovery.
Addressing each instance separately:
(1) We would dequeue our journal glock, then dequeue the live glock.
The other nodes would see the callback for the "live" glock and
attempt recovery, however they were denied access to the journal
glock because it was still held on the recovering node. That's
because the glock state machine (but more importantly the dlm)
had not yet demoted the lock completely when this took place.
So none of the nodes performed recovery.
(2) We might be able to get rid of the "wait" for the "live" glock.
I can't think of a case in which that would behave badly, but
bear in mind it's been more than a year since I originally wrote
that. It's probably closer to 2 years now.
(3) We might be able to get rid of the third "wait" which is also for
the "live" glock. I don't remember the circumstances.
TBH, I wouldn't feel comfortable getting rid of any of those waits
until I do some heavy experimentation on my 5-node cluster with the
recovery tests.
Bob
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function
2020-04-17 13:09 ` Bob Peterson
@ 2020-04-17 14:27 ` Andreas Gruenbacher
2020-04-17 14:35 ` Bob Peterson
0 siblings, 1 reply; 5+ messages in thread
From: Andreas Gruenbacher @ 2020-04-17 14:27 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi Bob,
On Fri, Apr 17, 2020 at 3:09 PM Bob Peterson <rpeterso@redhat.com> wrote:
> ----- Original Message -----
> > Bob,
> >
> > commit "gfs2: Force withdraw to replay journals and wait for it to
> > finish" adds three new users of gfs2_glock_dq_wait in function
> > signal_our_withdraw. Is the waiting there done for a reason, or can we
> > replace gfs2_glock_dq_wait with gfs2_glock_dq / gfs2_glock_dq_uninit
> > in that function?
> >
> > Thanks,
> > Andreas
>
> Hi Andreas,
>
> I remember debugging cases in which we needed to wait.
> When we didn't wait for glocks to be demoted, the node just reacquired
> the glocks again too quickly, before other nodes could attempt recovery.
> When the withdrawing node tried to reacquire them, DLM simply granted
> them on the same node, which is the only node that must not do recovery.
>
> Addressing each instance separately:
>
> (1) We would dequeue our journal glock, then dequeue the live glock.
> The other nodes would see the callback for the "live" glock and
> attempt recovery, however they were denied access to the journal
> glock because it was still held on the recovering node. That's
> because the glock state machine (but more importantly the dlm)
> had not yet demoted the lock completely when this took place.
> So none of the nodes performed recovery.
Hmm, sdp->sd_journal_gh has the GL_NOCACHE flag set, so the demote
should happen immediately. On the other hand, sdp->sd_live_gh doesn't
have that flag set, so that may be the actual problem here.
> (2) We might be able to get rid of the "wait" for the "live" glock.
> I can't think of a case in which that would behave badly, but
> bear in mind it's been more than a year since I originally wrote
> that. It's probably closer to 2 years now.
> (3) We might be able to get rid of the third "wait" which is also for
> the "live" glock. I don't remember the circumstances.
>
> TBH, I wouldn't feel comfortable getting rid of any of those waits
> until I do some heavy experimentation on my 5-node cluster with the
> recovery tests.
I agree that testing will be needed. What do you think of the below
patch?
Thanks,
Andreas
---
fs/gfs2/ops_fstype.c | 2 +-
fs/gfs2/util.c | 16 ++++++++--------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index e2b69ffcc6a8..98b2577b815b 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -405,7 +405,7 @@ static int init_locking(struct gfs2_sbd *sdp, struct gfs2_holder *mount_gh,
error = gfs2_glock_nq_num(sdp,
GFS2_LIVE_LOCK, &gfs2_nondisk_glops,
LM_ST_SHARED,
- LM_FLAG_NOEXP | GL_EXACT,
+ LM_FLAG_NOEXP | GL_EXACT | GL_NOCACHE,
&sdp->sd_live_gh);
if (error) {
fs_err(sdp, "can't acquire live glock: %d\n", error);
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 9b64d40ab379..6e48cef79c53 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -122,10 +122,8 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp)
/*
* Drop the glock for our journal so another node can recover it.
*/
- if (gfs2_holder_initialized(&sdp->sd_journal_gh)) {
- gfs2_glock_dq_wait(&sdp->sd_journal_gh);
- gfs2_holder_uninit(&sdp->sd_journal_gh);
- }
+ if (gfs2_holder_initialized(&sdp->sd_journal_gh))
+ gfs2_glock_dq_uninit(&sdp->sd_journal_gh);
sdp->sd_jinode_gh.gh_flags |= GL_NOCACHE;
gfs2_glock_dq(&sdp->sd_jinode_gh);
if (test_bit(SDF_FS_FROZEN, &sdp->sd_flags)) {
@@ -167,7 +165,7 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp)
* Dequeue the "live" glock, but keep a reference so it's never freed.
*/
gfs2_glock_hold(gl);
- gfs2_glock_dq_wait(&sdp->sd_live_gh);
+ gfs2_glock_dq(&sdp->sd_live_gh);
/*
* We enqueue the "live" glock in EX so that all other nodes
* get a demote request and act on it. We don't really want the
@@ -175,7 +173,8 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp)
*/
fs_warn(sdp, "Requesting recovery of jid %d.\n",
sdp->sd_lockstruct.ls_jid);
- gfs2_holder_reinit(LM_ST_EXCLUSIVE, LM_FLAG_TRY_1CB | LM_FLAG_NOEXP,
+ gfs2_holder_reinit(LM_ST_EXCLUSIVE,
+ LM_FLAG_TRY_1CB | LM_FLAG_NOEXP | GL_NOCACHE,
&sdp->sd_live_gh);
msleep(GL_GLOCK_MAX_HOLD);
/*
@@ -199,8 +198,9 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp)
if (gfs2_recover_journal(sdp->sd_jdesc, 1))
fs_warn(sdp, "Unable to recover our journal jid %d.\n",
sdp->sd_lockstruct.ls_jid);
- gfs2_glock_dq_wait(&sdp->sd_live_gh);
- gfs2_holder_reinit(LM_ST_SHARED, LM_FLAG_NOEXP | GL_EXACT,
+ gfs2_glock_dq(&sdp->sd_live_gh);
+ gfs2_holder_reinit(LM_ST_SHARED,
+ LM_FLAG_NOEXP | GL_EXACT | GL_NOCACHE,
&sdp->sd_live_gh);
gfs2_glock_nq(&sdp->sd_live_gh);
}
base-commit: bd437e630fee648b79999e617d48bb07ae76f8eb
prerequisite-patch-id: e3b7fbfc1e67b4065cb6c928c29c7ed8bee0fc1c
prerequisite-patch-id: 35e9388872b2d6ffc70c4fc5497d3fdec97d6392
--
2.25.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function
2020-04-17 14:27 ` [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function Andreas Gruenbacher
@ 2020-04-17 14:35 ` Bob Peterson
0 siblings, 0 replies; 5+ messages in thread
From: Bob Peterson @ 2020-04-17 14:35 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- Original Message -----
> Hi Bob,
>
> Hmm, sdp->sd_journal_gh has the GL_NOCACHE flag set, so the demote
> should happen immediately. On the other hand, sdp->sd_live_gh doesn't
> have that flag set, so that may be the actual problem here.
>
> I agree that testing will be needed. What do you think of the below
> patch?
>
> Thanks,
> Andreas
Hi Andreas,
I'll test it out on my cluster with some recovery tests and let you
know how it goes.
Bob
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-04-17 14:35 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-17 7:36 [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function 'gfs2_glock_dq_wait'; did you mean 'gfs2_glock_nq_init'? kbuild test robot
2020-04-17 12:23 ` Andreas Gruenbacher
2020-04-17 13:09 ` Bob Peterson
2020-04-17 14:27 ` [Cluster-devel] [gfs2:gfs2-iopen 12/12] fs/gfs2/util.c:126:3: error: implicit declaration of function Andreas Gruenbacher
2020-04-17 14:35 ` Bob Peterson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).