From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Whitehouse Date: Fri, 15 Feb 2019 11:55:03 +0000 Subject: [Cluster-devel] [GFS2 PATCH 4/9] gfs2: Force withdraw to replay journals and wait for it to finish In-Reply-To: <20190213152130.8047-5-rpeterso@redhat.com> References: <20190213152130.8047-1-rpeterso@redhat.com> <20190213152130.8047-5-rpeterso@redhat.com> Message-ID: <0dfef7af-96b8-32a2-25a6-b8524cd294ca@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi, On 13/02/2019 15:21, Bob Peterson wrote: > When a node withdraws from a file system, it often leaves its journal > in an incomplete state. This is especially true when the withdraw is > caused by io errors writing to the journal. Before this patch, a > withdraw would try to write a "shutdown" record to the journal, tell > dlm it's done with the file system, and none of the other nodes > know about the problem. Later, when the problem is fixed and the > withdrawn node is rebooted, it would then discover that its own > journal was incomplete, and replay it. However, replaying it at this > point is almost guaranteed to introduce corruption because the other > nodes are likely to have used affected resource groups that appeared > in the journal since the time of the withdraw. Replaying the journal > later will overwrite any changes made, and not through any fault of > dlm, which was instructed during the withdraw to release those > resources. > > This patch makes file system withdraws seen by the entire cluster. > Withdrawing nodes dequeue their journal glock to allow recovery. > > The remaining nodes check all the journals to see if they are > clean or in need of replay. They try to replay dirty journals, but > only the journals of withdrawn nodes will be "not busy" and > therefore available for replay. > > Until the journal replay is complete, no i/o related glocks may be > given out, to ensure that the replay does not cause the > aforementioned corruption: We cannot allow any journal replay to > overwrite blocks associated with a glock once it is held. The > glocks not affected by a withdraw are permitted to be passed > around as normal during a withdraw. A new glops flag, called > GLOF_OK_AT_WITHDRAW, indicates glocks that may be passed around > freely while a withdraw is taking place. > > One such glock is the "live" glock which is now used to signal when > a withdraw occurs. When a withdraw occurs, the node signals its > withdraw by dequeueing the "live" glock and trying to enqueue it > in EX mode, thus forcing the other nodes to all see a demote > request, by way of a "1CB" (one callback) try lock. The "live" > glock is not granted in EX; the callback is only just used to > indicate a withdraw has occurred. > > Note that all nodes in the cluster must wait for the recovering > node to finish replaying the withdrawing node's journal before > continuing. To this end, it checks that the journals are clean > multiple times in a retry loop. > > Signed-off-by: Bob Peterson This new algorithm seems rather complicated, so it will need a lot of careful testing I think. It would be good if there was some way to simplify things a bit here. > --- > fs/gfs2/glock.c | 35 ++++++++-- > fs/gfs2/glock.h | 1 + > fs/gfs2/glops.c | 61 +++++++++++++++++- > fs/gfs2/incore.h | 6 ++ > fs/gfs2/lock_dlm.c | 32 ++++++++++ > fs/gfs2/log.c | 22 +++++-- > fs/gfs2/meta_io.c | 2 +- > fs/gfs2/ops_fstype.c | 48 ++------------ > fs/gfs2/super.c | 24 ++++--- > fs/gfs2/super.h | 1 + > fs/gfs2/util.c | 148 ++++++++++++++++++++++++++++++++++++++++++- > fs/gfs2/util.h | 3 + > 12 files changed, 315 insertions(+), 68 deletions(-) > > diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c > index c6d6e478f5e3..20fb6cdf7829 100644 > --- a/fs/gfs2/glock.c > +++ b/fs/gfs2/glock.c > @@ -242,7 +242,8 @@ static void __gfs2_glock_put(struct gfs2_glock *gl) > gfs2_glock_remove_from_lru(gl); > spin_unlock(&gl->gl_lockref.lock); > GLOCK_BUG_ON(gl, !list_empty(&gl->gl_holders)); > - GLOCK_BUG_ON(gl, mapping && mapping->nrpages); > + GLOCK_BUG_ON(gl, mapping && mapping->nrpages && > + !test_bit(SDF_SHUTDOWN, &sdp->sd_flags)); > trace_gfs2_glock_put(gl); > sdp->sd_lockstruct.ls_ops->lm_put_lock(gl); > } > @@ -543,6 +544,8 @@ __acquires(&gl->gl_lockref.lock) > int ret; > > if (unlikely(withdrawn(sdp)) && > + !(glops->go_flags & GLOF_OK_AT_WITHDRAW) && > + (gh && !(LM_FLAG_NOEXP & gh->gh_flags)) && > target != LM_ST_UNLOCKED) > return; > lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP | > @@ -561,9 +564,10 @@ __acquires(&gl->gl_lockref.lock) > (lck_flags & (LM_FLAG_TRY|LM_FLAG_TRY_1CB))) > clear_bit(GLF_BLOCKING, &gl->gl_flags); > spin_unlock(&gl->gl_lockref.lock); > - if (glops->go_sync) > + if (glops->go_sync && !test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) > glops->go_sync(gl); > - if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags)) > + if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags) && > + !test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) > glops->go_inval(gl, target == LM_ST_DEFERRED ? 0 : DIO_METADATA); > clear_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags); > > @@ -1091,7 +1095,8 @@ int gfs2_glock_nq(struct gfs2_holder *gh) > struct gfs2_sbd *sdp = gl->gl_name.ln_sbd; > int error = 0; > > - if (unlikely(withdrawn(sdp))) > + if (unlikely(withdrawn(sdp) && !(LM_FLAG_NOEXP & gh->gh_flags) && > + !(gl->gl_ops->go_flags & GLOF_OK_AT_WITHDRAW))) > return -EIO; > > if (test_bit(GLF_LRU, &gl->gl_flags)) > @@ -1135,11 +1140,28 @@ int gfs2_glock_poll(struct gfs2_holder *gh) > void gfs2_glock_dq(struct gfs2_holder *gh) > { > struct gfs2_glock *gl = gh->gh_gl; > + struct gfs2_sbd *sdp = gl->gl_name.ln_sbd; > const struct gfs2_glock_operations *glops = gl->gl_ops; > unsigned delay = 0; > int fast_path = 0; > > spin_lock(&gl->gl_lockref.lock); > + /** > + * If we're in the process of file system withdraw, we cannot just > + * dequeue any glocks until our journal is recovered, lest we > + * introduce file system corruption. We need to exceptions to this > + * rule: (1) We need to allow unlocking of nondisk glocks and the > + * glock for our own journal that needs recovery. > + */ > + if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags) && > + test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) && > + !(gl->gl_ops->go_flags & GLOF_OK_AT_WITHDRAW) && > + gh != &sdp->sd_jinode_gh) { > + sdp->sd_glock_dqs_held++; > + might_sleep(); > + wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY, > + TASK_UNINTERRUPTIBLE); > + } > if (gh->gh_flags & GL_NOCACHE) > handle_callback(gl, LM_ST_UNLOCKED, 0, false); > > @@ -1619,6 +1641,11 @@ static void dump_glock_func(struct gfs2_glock *gl) > dump_glock(NULL, gl); > } > > +void gfs2_gl_flushwork(struct gfs2_sbd *sdp) > +{ > + flush_workqueue(glock_workqueue); > +} > + > /** > * gfs2_gl_hash_clear - Empty out the glock hash table > * @sdp: the filesystem > diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h > index 936b3295839c..c1c40e2dbd96 100644 > --- a/fs/gfs2/glock.h > +++ b/fs/gfs2/glock.h > @@ -202,6 +202,7 @@ extern int gfs2_glock_nq_num(struct gfs2_sbd *sdp, u64 number, > struct gfs2_holder *gh); > extern int gfs2_glock_nq_m(unsigned int num_gh, struct gfs2_holder *ghs); > extern void gfs2_glock_dq_m(unsigned int num_gh, struct gfs2_holder *ghs); > +extern void gfs2_gl_flushwork(struct gfs2_sbd *sdp); > extern void gfs2_dump_glock(struct seq_file *seq, struct gfs2_glock *gl); > #define GLOCK_BUG_ON(gl,x) do { if (unlikely(x)) { gfs2_dump_glock(NULL, gl); BUG(); } } while(0) > extern __printf(2, 3) > diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c > index 4b0e52bf5825..f372a6f169a2 100644 > --- a/fs/gfs2/glops.c > +++ b/fs/gfs2/glops.c > @@ -32,6 +32,8 @@ > > struct workqueue_struct *gfs2_freeze_wq; > > +extern struct workqueue_struct *gfs2_control_wq; > + > static void gfs2_ail_error(struct gfs2_glock *gl, const struct buffer_head *bh) > { > fs_err(gl->gl_name.ln_sbd, > @@ -396,6 +398,7 @@ static int gfs2_dinode_in(struct gfs2_inode *ip, const void *buf) > return 0; > corrupt: > gfs2_consist_inode(ip); > + printk("gah2"); > return -EIO; > } > > @@ -584,8 +587,58 @@ static void iopen_go_callback(struct gfs2_glock *gl, bool remote) > } > } > > +/** > + * nondisk_go_callback - used to signal when a node did a withdraw > + * @gl: the nondisk glock > + * @remote: true if this came from a different cluster node > + * > + */ > +static void nondisk_go_callback(struct gfs2_glock *gl, bool remote) > +{ > + struct gfs2_sbd *sdp = gl->gl_name.ln_sbd; > + > + /* Ignore the callback unless it's from another node, and it's the > + live lock. */ > + if (!remote || gl->gl_name.ln_number != GFS2_LIVE_LOCK) > + return; > + > + /* First order of business is to cancel the demote request. We don't > + * really want to demote a nondisk glock. At best it's just to inform > + * us of a another node's withdraw. We'll keep it in SH mode. */ > + clear_bit(GLF_DEMOTE, &gl->gl_flags); > + clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags); > + > + /* Ignore the unlock if we're withdrawn, unmounting, or in recovery. */ > + if (test_bit(SDF_NORECOVERY, &sdp->sd_flags) || > + test_bit(SDF_SHUTDOWN, &sdp->sd_flags) || > + test_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags)) > + return; > + > + /* We only care when a node wants us to unlock, because that means > + * they want a journal recovered. */ > + if (gl->gl_demote_state != LM_ST_UNLOCKED) > + return; > + > + if (sdp->sd_args.ar_spectator) { > + fs_warn(sdp, "Spectator node cannot recover journals.\n"); > + return; > + } > + > + fs_warn(sdp, "Some node has withdrawn; checking for recovery.\n"); > + set_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags); > + /** > + * We can't call remote_withdraw directly here or gfs2_recover_journal > + * because this is called from the glock unlock function and the > + * remote_withdraw needs to enqueue and dequeue the same "live" glock > + * we were called from. So we queue it to the control work queue in > + * lock_dlm. > + */ > + queue_delayed_work(gfs2_control_wq, &sdp->sd_control_work, 0); > +} > + > const struct gfs2_glock_operations gfs2_meta_glops = { > .go_type = LM_TYPE_META, > + .go_flags = GLOF_OK_AT_WITHDRAW, > }; > > const struct gfs2_glock_operations gfs2_inode_glops = { > @@ -613,6 +666,7 @@ const struct gfs2_glock_operations gfs2_freeze_glops = { > .go_xmote_bh = freeze_go_xmote_bh, > .go_demote_ok = freeze_go_demote_ok, > .go_type = LM_TYPE_NONDISK, > + .go_flags = GLOF_OK_AT_WITHDRAW, > }; > > const struct gfs2_glock_operations gfs2_iopen_glops = { > @@ -623,20 +677,23 @@ const struct gfs2_glock_operations gfs2_iopen_glops = { > > const struct gfs2_glock_operations gfs2_flock_glops = { > .go_type = LM_TYPE_FLOCK, > - .go_flags = GLOF_LRU, > + .go_flags = GLOF_LRU | GLOF_OK_AT_WITHDRAW, > }; > > const struct gfs2_glock_operations gfs2_nondisk_glops = { > .go_type = LM_TYPE_NONDISK, > + .go_callback = nondisk_go_callback, > + .go_flags = GLOF_OK_AT_WITHDRAW, > }; > > const struct gfs2_glock_operations gfs2_quota_glops = { > .go_type = LM_TYPE_QUOTA, > - .go_flags = GLOF_LVB | GLOF_LRU, > + .go_flags = GLOF_LVB | GLOF_LRU | GLOF_OK_AT_WITHDRAW, > }; > > const struct gfs2_glock_operations gfs2_journal_glops = { > .go_type = LM_TYPE_JOURNAL, > + .go_flags = GLOF_OK_AT_WITHDRAW, > }; > > const struct gfs2_glock_operations *gfs2_glops_list[] = { > diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h > index 8380d4db8be6..2ddae1326ce2 100644 > --- a/fs/gfs2/incore.h > +++ b/fs/gfs2/incore.h > @@ -250,6 +250,7 @@ struct gfs2_glock_operations { > #define GLOF_ASPACE 1 > #define GLOF_LVB 2 > #define GLOF_LRU 4 > +#define GLOF_OK_AT_WITHDRAW 8 > }; > > enum { > @@ -622,6 +623,9 @@ enum { > SDF_FORCE_AIL_FLUSH = 9, > SDF_AIL1_IO_ERROR = 10, > SDF_PENDING_WITHDRAW = 11, /* Will withdraw eventually */ > + SDF_REMOTE_WITHDRAW = 12, /* Performing remote recovery */ > + SDF_WITHDRAW_RECOVERY = 13, /* Wait for journal recovery when we are > + withdrawing */ > }; > > enum gfs2_freeze_state { > @@ -770,6 +774,7 @@ struct gfs2_sbd { > struct gfs2_jdesc *sd_jdesc; > struct gfs2_holder sd_journal_gh; > struct gfs2_holder sd_jinode_gh; > + struct gfs2_glock *sd_jinode_gl; > > struct gfs2_holder sd_sc_gh; > struct gfs2_holder sd_qc_gh; > @@ -854,6 +859,7 @@ struct gfs2_sbd { > > unsigned long sd_last_warning; > struct dentry *debugfs_dir; /* debugfs directory */ > + unsigned long sd_glock_dqs_held; > }; > > static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int which) > diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c > index d2cb2fe1c3f3..619d7a0e8ac1 100644 > --- a/fs/gfs2/lock_dlm.c > +++ b/fs/gfs2/lock_dlm.c > @@ -19,6 +19,8 @@ > > #include "incore.h" > #include "glock.h" > +#include "glops.h" > +#include "recovery.h" > #include "util.h" > #include "sys.h" > #include "trace_gfs2.h" > @@ -325,6 +327,7 @@ static void gdlm_cancel(struct gfs2_glock *gl) > /* > * dlm/gfs2 recovery coordination using dlm_recover callbacks > * > + * 0. gfs2 checks for another cluster node withdraw, needing journal replay > * 1. dlm_controld sees lockspace members change > * 2. dlm_controld blocks dlm-kernel locking activity > * 3. dlm_controld within dlm-kernel notifies gfs2 (recover_prep) > @@ -573,6 +576,28 @@ static int control_lock(struct gfs2_sbd *sdp, int mode, uint32_t flags) > &ls->ls_control_lksb, "control_lock"); > } > > +/** > + * remote_withdraw - react to a node withdrawing from the file system > + * @sdp: The superblock > + */ > +static void remote_withdraw(struct gfs2_sbd *sdp) > +{ > + struct gfs2_jdesc *jd; > + int ret = 0, count = 0; > + > + list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) { > + if (jd->jd_jid == sdp->sd_lockstruct.ls_jid) > + continue; > + ret = gfs2_recover_journal(jd, true); > + if (ret) > + break; > + count++; > + } > + > + /* Now drop the additional reference we acquired */ > + fs_err(sdp, "Journals checked: %d, ret = %d.\n", count, ret); > +} > + > static void gfs2_control_func(struct work_struct *work) > { > struct gfs2_sbd *sdp = container_of(work, struct gfs2_sbd, sd_control_work.work); > @@ -583,6 +608,13 @@ static void gfs2_control_func(struct work_struct *work) > int recover_size; > int i, error; > > + /* First check for other nodes that may have done a withdraw. */ > + if (test_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags)) { > + remote_withdraw(sdp); > + clear_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags); > + return; > + } > + > spin_lock(&ls->ls_recover_spin); > /* > * No MOUNT_DONE means we're still mounting; control_mount() > diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c > index ec8675113b0d..81550038ace3 100644 > --- a/fs/gfs2/log.c > +++ b/fs/gfs2/log.c > @@ -107,7 +107,7 @@ __acquires(&sdp->sd_ail_lock) > gfs2_assert(sdp, bd->bd_tr == tr); > > if (!buffer_busy(bh)) { > - if (!buffer_uptodate(bh) && > + if (!buffer_uptodate(bh) && !withdrawn(sdp) && > !test_and_set_bit(SDF_AIL1_IO_ERROR, > &sdp->sd_flags)) { > gfs2_io_error_bh(sdp, bh); > @@ -205,7 +205,7 @@ static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr) > gfs2_assert(sdp, bd->bd_tr == tr); > if (buffer_busy(bh)) > continue; > - if (!buffer_uptodate(bh) && > + if (!buffer_uptodate(bh) && !withdrawn(sdp) && > !test_and_set_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) { > gfs2_io_error_bh(sdp, bh); > set_bit(SDF_PENDING_WITHDRAW, &sdp->sd_flags); > @@ -747,6 +747,10 @@ static void log_write_header(struct gfs2_sbd *sdp, u32 flags) > int op_flags = REQ_PREFLUSH | REQ_FUA | REQ_META | REQ_SYNC; > enum gfs2_freeze_state state = atomic_read(&sdp->sd_freeze_state); > > + if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) { > + log_flush_wait(sdp); > + return; > + } > gfs2_assert_withdraw(sdp, (state != SFS_FROZEN)); > tail = current_tail(sdp); > > @@ -776,6 +780,8 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags) > struct gfs2_trans *tr; > enum gfs2_freeze_state state = atomic_read(&sdp->sd_freeze_state); > > + if (!test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags)) > + return; > down_write(&sdp->sd_log_flush_lock); > > /* Log might have been flushed while we waited for the flush lock */ > @@ -1003,8 +1009,10 @@ int gfs2_logd(void *data) > did_flush = false; > if (gfs2_jrnl_flush_reqd(sdp) || t == 0) { > gfs2_ail1_empty(sdp); > - gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL | > - GFS2_LFC_LOGD_JFLUSH_REQD); > + if (test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags)) > + gfs2_log_flush(sdp, NULL, > + GFS2_LOG_HEAD_FLUSH_NORMAL | > + GFS2_LFC_LOGD_JFLUSH_REQD); > did_flush = true; > } > > @@ -1012,8 +1020,10 @@ int gfs2_logd(void *data) > gfs2_ail1_start(sdp); > gfs2_ail1_wait(sdp); > gfs2_ail1_empty(sdp); > - gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL | > - GFS2_LFC_LOGD_AIL_FLUSH_REQD); > + if (test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags)) > + gfs2_log_flush(sdp, NULL, > + GFS2_LOG_HEAD_FLUSH_NORMAL | > + GFS2_LFC_LOGD_AIL_FLUSH_REQD); > did_flush = true; > } > > diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c > index 97c161782763..39a6cc84a908 100644 > --- a/fs/gfs2/meta_io.c > +++ b/fs/gfs2/meta_io.c > @@ -254,7 +254,7 @@ int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags, > struct buffer_head *bh, *bhs[2]; > int num = 0; > > - if (unlikely(withdrawn(sdp))) { > + if (unlikely(withdrawn(sdp)) && gl != sdp->sd_jinode_gl) { > *bhp = NULL; > return -EIO; > } > diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c > index 402201978312..650e841f2e44 100644 > --- a/fs/gfs2/ops_fstype.c > +++ b/fs/gfs2/ops_fstype.c > @@ -591,48 +591,6 @@ static int gfs2_jindex_hold(struct gfs2_sbd *sdp, struct gfs2_holder *ji_gh) > return error; > } > > -/** > - * check_journal_clean - Make sure a journal is clean for a spectator mount > - * @sdp: The GFS2 superblock > - * @jd: The journal descriptor > - * > - * Returns: 0 if the journal is clean or locked, else an error > - */ > -static int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd) > -{ > - int error; > - struct gfs2_holder j_gh; > - struct gfs2_log_header_host head; > - struct gfs2_inode *ip; > - > - ip = GFS2_I(jd->jd_inode); > - error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_NOEXP | > - GL_EXACT | GL_NOCACHE, &j_gh); > - if (error) { > - fs_err(sdp, "Error locking journal for spectator mount.\n"); > - return -EPERM; > - } > - error = gfs2_jdesc_check(jd); > - if (error) { > - fs_err(sdp, "Error checking journal for spectator mount.\n"); > - goto out_unlock; > - } > - error = gfs2_find_jhead(jd, &head); > - if (error) { > - fs_err(sdp, "Error parsing journal for spectator mount.\n"); > - goto out_unlock; > - } > - if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) { > - error = -EPERM; > - fs_err(sdp, "jid=%u: Journal is dirty, so the first mounter " > - "must not be a spectator.\n", jd->jd_jid); > - } > - > -out_unlock: > - gfs2_glock_dq_uninit(&j_gh); > - return error; > -} > - > static int init_journal(struct gfs2_sbd *sdp, int undo) > { > struct inode *master = d_inode(sdp->sd_master_dir); > @@ -685,7 +643,8 @@ static int init_journal(struct gfs2_sbd *sdp, int undo) > > error = gfs2_glock_nq_num(sdp, sdp->sd_lockstruct.ls_jid, > &gfs2_journal_glops, > - LM_ST_EXCLUSIVE, LM_FLAG_NOEXP, > + LM_ST_EXCLUSIVE, > + LM_FLAG_NOEXP | GL_NOCACHE, > &sdp->sd_journal_gh); > if (error) { > fs_err(sdp, "can't acquire journal glock: %d\n", error); > @@ -693,6 +652,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo) > } > > ip = GFS2_I(sdp->sd_jdesc->jd_inode); > + sdp->sd_jinode_gl = ip->i_gl; > error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, > LM_FLAG_NOEXP | GL_EXACT | GL_NOCACHE, > &sdp->sd_jinode_gh); > @@ -723,7 +683,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo) > struct gfs2_jdesc *jd = gfs2_jdesc_find(sdp, x); > > if (sdp->sd_args.ar_spectator) { > - error = check_journal_clean(sdp, jd); > + error = check_journal_clean(sdp, jd, true); > if (error) > goto fail_jinode_gh; > continue; > diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c > index 8033f24e0ad0..ebb11165a1b1 100644 > --- a/fs/gfs2/super.c > +++ b/fs/gfs2/super.c > @@ -841,11 +841,12 @@ static void gfs2_dirty_inode(struct inode *inode, int flags) > /** > * gfs2_make_fs_ro - Turn a Read-Write FS into a Read-Only one > * @sdp: the filesystem > + * @withdrawing: if 1, we're withdrawing so only do what's necessary > * > * Returns: errno > */ > > -static int gfs2_make_fs_ro(struct gfs2_sbd *sdp) > +int gfs2_make_fs_ro(struct gfs2_sbd *sdp, int withdrawing) > { > struct gfs2_holder freeze_gh; > int error; > @@ -859,11 +860,12 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp) > kthread_stop(sdp->sd_quotad_process); > kthread_stop(sdp->sd_logd_process); > > - gfs2_quota_sync(sdp->sd_vfs, 0); > - gfs2_statfs_sync(sdp->sd_vfs, 0); > - > - gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_SHUTDOWN | > - GFS2_LFC_MAKE_FS_RO); > + if (!withdrawing) { > + gfs2_quota_sync(sdp->sd_vfs, 0); > + gfs2_statfs_sync(sdp->sd_vfs, 0); > + gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_SHUTDOWN | > + GFS2_LFC_MAKE_FS_RO); > + } > wait_event(sdp->sd_reserving_log_wait, atomic_read(&sdp->sd_reserving_log) == 0); > gfs2_assert_warn(sdp, atomic_read(&sdp->sd_log_blks_free) == sdp->sd_jdesc->jd_blocks); > > @@ -905,7 +907,7 @@ static void gfs2_put_super(struct super_block *sb) > spin_unlock(&sdp->sd_jindex_spin); > > if (!sb_rdonly(sb)) { > - error = gfs2_make_fs_ro(sdp); > + error = gfs2_make_fs_ro(sdp, 0); > if (error) > gfs2_io_error(sdp); > } > @@ -922,8 +924,10 @@ static void gfs2_put_super(struct super_block *sb) > gfs2_glock_put(sdp->sd_freeze_gl); > > if (!sdp->sd_args.ar_spectator) { > - gfs2_glock_dq_uninit(&sdp->sd_journal_gh); > - gfs2_glock_dq_uninit(&sdp->sd_jinode_gh); > + if (gfs2_holder_initialized(&sdp->sd_journal_gh)) > + gfs2_glock_dq_uninit(&sdp->sd_journal_gh); > + if (gfs2_holder_initialized(&sdp->sd_jinode_gh)) > + gfs2_glock_dq_uninit(&sdp->sd_jinode_gh); > gfs2_glock_dq_uninit(&sdp->sd_sc_gh); > gfs2_glock_dq_uninit(&sdp->sd_qc_gh); > iput(sdp->sd_sc_inode); > @@ -1271,7 +1275,7 @@ static int gfs2_remount_fs(struct super_block *sb, int *flags, char *data) > > if ((sb->s_flags ^ *flags) & SB_RDONLY) { > if (*flags & SB_RDONLY) > - error = gfs2_make_fs_ro(sdp); > + error = gfs2_make_fs_ro(sdp, 0); > else > error = gfs2_make_fs_rw(sdp); > if (error) > diff --git a/fs/gfs2/super.h b/fs/gfs2/super.h > index 73c97dccae21..e859c6d5bb3e 100644 > --- a/fs/gfs2/super.h > +++ b/fs/gfs2/super.h > @@ -45,6 +45,7 @@ extern void gfs2_statfs_change_in(struct gfs2_statfs_change_host *sc, > extern void update_statfs(struct gfs2_sbd *sdp, struct buffer_head *m_bh, > struct buffer_head *l_bh); > extern int gfs2_statfs_sync(struct super_block *sb, int type); > +extern int gfs2_make_fs_ro(struct gfs2_sbd *sdp, int withdrawing); > extern void gfs2_freeze_func(struct work_struct *work); > > extern struct file_system_type gfs2_fs_type; > diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c > index ca6de80b5e8b..75f67284bba8 100644 > --- a/fs/gfs2/util.c > +++ b/fs/gfs2/util.c > @@ -14,12 +14,17 @@ > #include > #include > #include > +#include > #include > > #include "gfs2.h" > #include "incore.h" > #include "glock.h" > +#include "log.h" > +#include "lops.h" > +#include "recovery.h" > #include "rgrp.h" > +#include "super.h" > #include "util.h" > > struct kmem_cache *gfs2_glock_cachep __read_mostly; > @@ -36,6 +41,145 @@ void gfs2_assert_i(struct gfs2_sbd *sdp) > fs_emerg(sdp, "fatal assertion failed\n"); > } > > +/** > + * check_journal_clean - Make sure a journal is clean for a spectator mount > + * @sdp: The GFS2 superblock > + * @jd: The journal descriptor > + * > + * Returns: 0 if the journal is clean or locked, else an error > + */ > +int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd, > + bool verbose) > +{ > + int error; > + struct gfs2_log_header_host head; > + struct gfs2_inode *ip; > + > + ip = GFS2_I(jd->jd_inode); > + error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_NOEXP | > + GL_EXACT | GL_NOCACHE, &sdp->sd_jinode_gh); > + if (error) { > + if (verbose) > + fs_err(sdp, "Error %d locking journal for spectator " > + "mount.\n", error); > + return -EPERM; > + } > + error = gfs2_jdesc_check(jd); > + if (error) { > + if (verbose) > + fs_err(sdp, "Error checking journal for spectator " > + "mount.\n"); > + goto out_unlock; > + } > + error = gfs2_find_jhead(jd, &head); > + if (error) { > + if (verbose) > + fs_err(sdp, "Error parsing journal for spectator " > + "mount.\n"); > + goto out_unlock; > + } > + if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) { > + error = -EPERM; > + if (verbose) > + fs_err(sdp, "jid=%u: Journal is dirty, so the first " > + "mounter must not be a spectator.\n", > + jd->jd_jid); > + } > + > +out_unlock: > + gfs2_glock_dq_uninit(&sdp->sd_jinode_gh); > + return error; > +} > + > +static void signal_our_withdraw(struct gfs2_sbd *sdp) > +{ > + struct gfs2_glock *gl = sdp->sd_live_gh.gh_gl; > + int ret = 0; > + int tries; > + > + /* Prevent any glock dq until withdraw recovery is complete */ > + set_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags); > + /** > + * Don't tell dlm we're bailing until we have no more buffers in the > + * wind. If journal had an IO error, the log code should just purge > + * the outstanding buffers rather than submitting new IO. Making the > + * file system read-only will flush the journal, etc. > + * > + * During a normal unmount, gfs2_make_fs_ro calls gfs2_log_shutdown > + * which clears SDF_JOURNAL_LIVE. In a withdraw, we cannot write > + * any UNMOUNT log header, so we can't call gfs2_log_shutdown, and > + * therefore we need to clear SDF_JOURNAL_LIVE manually. > + */ > + clear_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags); > + ret = gfs2_make_fs_ro(sdp, 1); > + sdp->sd_vfs->s_flags |= SB_RDONLY; > + > + /* Drop the glock for our journal so another node can recover it. */ > + gfs2_glock_dq_wait(&sdp->sd_journal_gh); > + gfs2_holder_uninit(&sdp->sd_journal_gh); > + sdp->sd_jinode_gh.gh_flags |= GL_NOCACHE; > + gfs2_glock_dq_wait(&sdp->sd_jinode_gh); > + /* holder_uninit to force glock_put, to force dlm to let go */ > + gfs2_holder_uninit(&sdp->sd_jinode_gh); > + gfs2_jindex_free(sdp); > + /* Flush the glock work so the glock is freed. This allows try locks > + * on other nodes to be successful, otherwise we remain the owner of > + * the glock until the workqueue gets around to running. */ > + gfs2_gl_flushwork(sdp); > + > + if (sdp->sd_lockstruct.ls_ops->lm_lock == NULL) /* lock_nolock */ > + goto skip_recovery; > + /** > + * Dequeue the "live" glock, but keep a reference so it's never freed. > + */ > + gfs2_glock_hold(gl); > + gfs2_glock_dq_wait(&sdp->sd_live_gh); > + /** > + * We enqueue the "live" glock in EX so that all other nodes > + * get a demote request and act on it, demoting their glock > + * from SHARED to UNLOCKED. Once we have the glock in EX, we > + * know all other nodes have been informed of our departure. > + * They cannot do anything more until our journal has been > + * replayed and our locks released. > + */ > + fs_warn(sdp, "Requesting recovery of jid %d.\n", > + sdp->sd_lockstruct.ls_jid); > + gfs2_holder_reinit(LM_ST_EXCLUSIVE, LM_FLAG_TRY_1CB | LM_FLAG_NOEXP, > + &sdp->sd_live_gh); > + msleep(GL_GLOCK_MAX_HOLD); What is this delay for? > + /* This will likely fail in a cluster, but succeed stand-alone: */ > + ret = gfs2_glock_nq(&sdp->sd_live_gh); > + if (ret == 0) { > + gfs2_glock_dq_wait(&sdp->sd_live_gh); > + gfs2_holder_reinit(LM_ST_SHARED, LM_FLAG_NOEXP | GL_EXACT, > + &sdp->sd_live_gh); > + gfs2_glock_nq(&sdp->sd_live_gh); > + } > + /* Now drop the additional reference we acquired */ > + gfs2_glock_queue_put(gl); > + clear_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags); > + > + /* Now wait until recovery complete. */ > + for (tries = 0; tries < 10; tries++) { > + ret = check_journal_clean(sdp, sdp->sd_jdesc, false); > + if (!ret) > + break; > + msleep(HZ); > + fs_warn(sdp, "Waiting for journal recovery jid %d.\n", > + sdp->sd_lockstruct.ls_jid); > + } > +skip_recovery: > + if (!ret) > + fs_warn(sdp, "Journal recovery complete for jid %d.\n", > + sdp->sd_lockstruct.ls_jid); > + else > + fs_warn(sdp, "Journal recovery skipped for %d until next " > + "mount.\n", sdp->sd_lockstruct.ls_jid); > + fs_warn(sdp, "Glock dequeues delayed: %lu\n", sdp->sd_glock_dqs_held); > + sdp->sd_glock_dqs_held = 0; > + wake_up_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY); > +} > + > int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...) > { > struct lm_lockstruct *ls = &sdp->sd_lockstruct; > @@ -63,6 +207,8 @@ int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...) > fs_err(sdp, "about to withdraw this file system\n"); > BUG_ON(sdp->sd_args.ar_debug); > > + signal_our_withdraw(sdp); > + > kobject_uevent(&sdp->sd_kobj, KOBJ_OFFLINE); > > if (!strcmp(sdp->sd_lockstruct.ls_ops->lm_proto_name, "lock_dlm")) > @@ -73,7 +219,7 @@ int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...) > lm->lm_unmount(sdp); > } > set_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags); > - fs_err(sdp, "withdrawn\n"); > + fs_err(sdp, "File system withdrawn\n"); > dump_stack(); > } > > diff --git a/fs/gfs2/util.h b/fs/gfs2/util.h > index 16e087da3bd3..036c7cfd856d 100644 > --- a/fs/gfs2/util.h > +++ b/fs/gfs2/util.h > @@ -132,6 +132,9 @@ static inline void gfs2_metatype_set(struct buffer_head *bh, u16 type, > int gfs2_io_error_i(struct gfs2_sbd *sdp, const char *function, > char *file, unsigned int line); > > +extern int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd, > + bool verbose); > + > #define gfs2_io_error(sdp) \ > gfs2_io_error_i((sdp), __func__, __FILE__, __LINE__); >