From: Steven Whitehouse <swhiteho@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [GFS2 PATCH v3 09/19] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
Date: Wed, 1 May 2019 01:08:39 +0100 [thread overview]
Message-ID: <550f4715-669a-5e58-43d2-46b84e08285f@redhat.com> (raw)
In-Reply-To: <20190430230319.10375-10-rpeterso@redhat.com>
Hi,
On 01/05/2019 00:03, Bob Peterson wrote:
> This patch addresses various problems with gfs2/dlm recovery.
>
> For example, suppose a node with a bunch of gfs2 mounts suddenly
> reboots due to kernel panic, and dlm determines it should perform
> recovery. DLM does so from a pseudo-state machine calling various
> callbacks into lock_dlm to perform a sequence of steps. It uses
> generation numbers and recover bits in dlm "control" lock lvbs.
>
> Now suppose another node tries to recover the failed node's
> journal, but in so doing, encounters an IO error or withdraws
> due to unforeseen circumstances, such as an hba driver failure.
> In these cases, the recovery would eventually bail out, but it
> would still update its generation number in the lvb. The other
> nodes would all see the newer generation number and think they
> don't need to do recovery because the generation number is newer
> than the last one they saw, and therefore someone else has already
> taken care of it.
>
> If the file system has an io error or is withdrawn, it cannot
> safely replay any journals (its own or others) but someone else
> still needs to do it. Therefore we don't want it messing with
> the journal recovery generation numbers: the local generation
> numbers eventually get put into the lvb generation numbers to be
> seen by all nodes.
>
> This patch adds checks to many of the callbacks used by dlm
> in its recovery state machine so that the functions are ignored
> and skipped if an io error has occurred or if the file system
> was withdraw.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
These should probably propagate the error back to the caller of the
recovery request. We do have a proper notification system for failed
recovery via uevents,
Steve.
> ---
> fs/gfs2/lock_dlm.c | 18 ++++++++++++++++++
> fs/gfs2/util.c | 15 +++++++--------
> 2 files changed, 25 insertions(+), 8 deletions(-)
>
> diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
> index 31df26ed7854..9329f86ffcbe 100644
> --- a/fs/gfs2/lock_dlm.c
> +++ b/fs/gfs2/lock_dlm.c
> @@ -1081,6 +1081,10 @@ static void gdlm_recover_prep(void *arg)
> struct gfs2_sbd *sdp = arg;
> struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>
> + if (gfs2_withdrawn(sdp)) {
> + fs_err(sdp, "recover_prep ignored due to withdraw.\n");
> + return;
> + }
> spin_lock(&ls->ls_recover_spin);
> ls->ls_recover_block = ls->ls_recover_start;
> set_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
> @@ -1103,6 +1107,11 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot *slot)
> struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> int jid = slot->slot - 1;
>
> + if (gfs2_withdrawn(sdp)) {
> + fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n",
> + jid);
> + return;
> + }
> spin_lock(&ls->ls_recover_spin);
> if (ls->ls_recover_size < jid + 1) {
> fs_err(sdp, "recover_slot jid %d gen %u short size %d\n",
> @@ -1127,6 +1136,10 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots,
> struct gfs2_sbd *sdp = arg;
> struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>
> + if (gfs2_withdrawn(sdp)) {
> + fs_err(sdp, "recover_done ignored due to withdraw.\n");
> + return;
> + }
> /* ensure the ls jid arrays are large enough */
> set_recover_size(sdp, slots, num_slots);
>
> @@ -1154,6 +1167,11 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
> {
> struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>
> + if (gfs2_withdrawn(sdp)) {
> + fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n",
> + jid);
> + return;
> + }
> if (test_bit(DFL_NO_DLM_OPS, &ls->ls_recover_flags))
> return;
>
> diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
> index 0a814ccac41d..7eaea6dfe1cf 100644
> --- a/fs/gfs2/util.c
> +++ b/fs/gfs2/util.c
> @@ -259,14 +259,13 @@ void gfs2_io_error_bh_i(struct gfs2_sbd *sdp, struct buffer_head *bh,
> const char *function, char *file, unsigned int line,
> bool withdraw)
> {
> - if (!test_bit(SDF_SHUTDOWN, &sdp->sd_flags))
> - fs_err(sdp,
> - "fatal: I/O error\n"
> - " block = %llu\n"
> - " function = %s, file = %s, line = %u\n",
> - (unsigned long long)bh->b_blocknr,
> - function, file, line);
> + if (gfs2_withdrawn(sdp))
> + return;
> +
> + fs_err(sdp, "fatal: I/O error\n"
> + " block = %llu\n"
> + " function = %s, file = %s, line = %u\n",
> + (unsigned long long)bh->b_blocknr, function, file, line);
> if (withdraw)
> gfs2_lm_withdraw(sdp, NULL);
> }
> -
next prev parent reply other threads:[~2019-05-01 0:08 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-30 23:03 [Cluster-devel] [GFS2 PATCH v3 00/19] gfs2: misc recovery patch collection Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 01/19] gfs2: kthread and remount improvements Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 02/19] gfs2: eliminate tr_num_revoke_rm Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 03/19] gfs2: log which portion of the journal is replayed Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 04/19] gfs2: Warn when a journal replay overwrites a rgrp with buffers Bob Peterson
2019-05-07 14:26 ` Andreas Gruenbacher
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 05/19] gfs2: Introduce concept of a pending withdraw Bob Peterson
2019-05-07 14:36 ` Andreas Gruenbacher
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 06/19] gfs2: log error reform Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 07/19] gfs2: Only complain the first time an io error occurs in quota or log Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 08/19] gfs2: Stop ail1 wait loop when withdrawn Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 09/19] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn Bob Peterson
2019-05-01 0:08 ` Steven Whitehouse [this message]
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 10/19] gfs2: move check_journal_clean to util.c for future use Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 11/19] gfs2: Allow some glocks to be used during withdraw Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 12/19] gfs2: Don't loop forever in gfs2_freeze if withdrawn Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 13/19] gfs2: Make secondary withdrawers wait for first withdrawer Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 14/19] gfs2: Don't write log headers after file system withdraw Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 15/19] gfs2: Force withdraw to replay journals and wait for it to finish Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 16/19] gfs2: simply gfs2_freeze by removing case Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 17/19] gfs2: Add verbose option to check_journal_clean Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 18/19] gfs2: Check for log write errors before telling dlm to unlock Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 19/19] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty Bob Peterson
2019-05-01 0:10 ` [Cluster-devel] [GFS2 PATCH v3 00/19] gfs2: misc recovery patch collection Steven Whitehouse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=550f4715-669a-5e58-43d2-46b84e08285f@redhat.com \
--to=swhiteho@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).