From: Bob Peterson <rpeterso@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH 1/2] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
Date: Wed, 21 Nov 2018 12:52:56 -0600 [thread overview]
Message-ID: <20181121185257.11206-2-rpeterso@redhat.com> (raw)
In-Reply-To: <20181121185257.11206-1-rpeterso@redhat.com>
This patch addresses various problems with gfs2/dlm recovery.
For example, suppose a node with a bunch of gfs2 mounts suddenly
reboots due to kernel panic, and dlm determines it should perform
recovery. DLM does so from a pseudo-state machine calling various
callbacks into lock_dlm to perform a sequence of steps. It uses
generation numbers and recover bits in dlm "control" lock lvbs.
Now suppose another node tries to recover the failed node's
journal, but in so doing, encounters an IO error or withdraws
due to unforeseen circumstances, such as an hba driver failure.
In these cases, the recovery would eventually bail out, but it
would still update its generation number in the lvb. The other
nodes would all see the newer generation number and think they
don't need to do recovery because the generation number is newer
than the last one they saw, and therefore someone else has already
taken care of it.
If the file system has an io error or is withdrawn, it cannot
safely replay any journals (its own or others) but someone else
still needs to do it. Therefore we don't want it messing with
the journal recovery generation numbers: the local generation
numbers eventually get put into the lvb generation numbers to be
seen by all nodes.
This patch adds checks to many of the callbacks used by dlm
in its recovery state machine so that the functions are ignored
and skipped if an io error has occurred or if the file system
was withdraw.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
fs/gfs2/lock_dlm.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 31df26ed7854..68ca648cf918 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1081,6 +1081,14 @@ static void gdlm_recover_prep(void *arg)
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = &sdp->sd_lockstruct;
+ if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) {
+ fs_err(sdp, "recover_prep ignored due to io error.\n");
+ return;
+ }
+ if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) {
+ fs_err(sdp, "recover_prep ignored due to withdraw.\n");
+ return;
+ }
spin_lock(&ls->ls_recover_spin);
ls->ls_recover_block = ls->ls_recover_start;
set_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
@@ -1103,6 +1111,16 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot *slot)
struct lm_lockstruct *ls = &sdp->sd_lockstruct;
int jid = slot->slot - 1;
+ if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) {
+ fs_err(sdp, "recover_slot jid %d ignored due to io error.\n",
+ jid);
+ return;
+ }
+ if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) {
+ fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n",
+ jid);
+ return;
+ }
spin_lock(&ls->ls_recover_spin);
if (ls->ls_recover_size < jid + 1) {
fs_err(sdp, "recover_slot jid %d gen %u short size %d\n",
@@ -1127,6 +1145,14 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots,
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = &sdp->sd_lockstruct;
+ if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) {
+ fs_err(sdp, "recover_done ignored due to io error.\n");
+ return;
+ }
+ if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) {
+ fs_err(sdp, "recover_done ignored due to withdraw.\n");
+ return;
+ }
/* ensure the ls jid arrays are large enough */
set_recover_size(sdp, slots, num_slots);
@@ -1154,6 +1180,16 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
{
struct lm_lockstruct *ls = &sdp->sd_lockstruct;
+ if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) {
+ fs_err(sdp, "recovery_result jid %d ignored due to io error.\n",
+ jid);
+ return;
+ }
+ if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) {
+ fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n",
+ jid);
+ return;
+ }
if (test_bit(DFL_NO_DLM_OPS, &ls->ls_recover_flags))
return;
--
2.19.1
next prev parent reply other threads:[~2018-11-21 18:52 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-21 18:52 [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2) Bob Peterson
2018-11-21 18:52 ` Bob Peterson [this message]
2018-11-29 20:56 ` [Cluster-devel] [PATCH 1/2] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn Andreas Gruenbacher
2018-11-21 18:52 ` [Cluster-devel] [PATCH 2/2] gfs2: initiate journal recovery as soon as a node withdraws Bob Peterson
2018-11-22 10:16 ` [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2) Steven Whitehouse
-- strict thread matches above, loose matches on Subject: below --
2018-11-08 20:25 [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process Bob Peterson
2018-11-08 20:25 ` [Cluster-devel] [PATCH 1/2] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn Bob Peterson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181121185257.11206-2-rpeterso@redhat.com \
--to=rpeterso@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).