cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
From: Steven Whitehouse <swhiteho@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [GFS2 PATCH v3 09/19] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
Date: Wed, 1 May 2019 01:08:39 +0100	[thread overview]
Message-ID: <550f4715-669a-5e58-43d2-46b84e08285f@redhat.com> (raw)
In-Reply-To: <20190430230319.10375-10-rpeterso@redhat.com>

Hi,

On 01/05/2019 00:03, Bob Peterson wrote:
> This patch addresses various problems with gfs2/dlm recovery.
>
> For example, suppose a node with a bunch of gfs2 mounts suddenly
> reboots due to kernel panic, and dlm determines it should perform
> recovery. DLM does so from a pseudo-state machine calling various
> callbacks into lock_dlm to perform a sequence of steps. It uses
> generation numbers and recover bits in dlm "control" lock lvbs.
>
> Now suppose another node tries to recover the failed node's
> journal, but in so doing, encounters an IO error or withdraws
> due to unforeseen circumstances, such as an hba driver failure.
> In these cases, the recovery would eventually bail out, but it
> would still update its generation number in the lvb. The other
> nodes would all see the newer generation number and think they
> don't need to do recovery because the generation number is newer
> than the last one they saw, and therefore someone else has already
> taken care of it.
>
> If the file system has an io error or is withdrawn, it cannot
> safely replay any journals (its own or others) but someone else
> still needs to do it. Therefore we don't want it messing with
> the journal recovery generation numbers: the local generation
> numbers eventually get put into the lvb generation numbers to be
> seen by all nodes.
>
> This patch adds checks to many of the callbacks used by dlm
> in its recovery state machine so that the functions are ignored
> and skipped if an io error has occurred or if the file system
> was withdraw.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>

These should probably propagate the error back to the caller of the 
recovery request. We do have a proper notification system for failed 
recovery via uevents,

Steve.

> ---
>   fs/gfs2/lock_dlm.c | 18 ++++++++++++++++++
>   fs/gfs2/util.c     | 15 +++++++--------
>   2 files changed, 25 insertions(+), 8 deletions(-)
>
> diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
> index 31df26ed7854..9329f86ffcbe 100644
> --- a/fs/gfs2/lock_dlm.c
> +++ b/fs/gfs2/lock_dlm.c
> @@ -1081,6 +1081,10 @@ static void gdlm_recover_prep(void *arg)
>   	struct gfs2_sbd *sdp = arg;
>   	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>   
> +	if (gfs2_withdrawn(sdp)) {
> +		fs_err(sdp, "recover_prep ignored due to withdraw.\n");
> +		return;
> +	}
>   	spin_lock(&ls->ls_recover_spin);
>   	ls->ls_recover_block = ls->ls_recover_start;
>   	set_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
> @@ -1103,6 +1107,11 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot *slot)
>   	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>   	int jid = slot->slot - 1;
>   
> +	if (gfs2_withdrawn(sdp)) {
> +		fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n",
> +		       jid);
> +		return;
> +	}
>   	spin_lock(&ls->ls_recover_spin);
>   	if (ls->ls_recover_size < jid + 1) {
>   		fs_err(sdp, "recover_slot jid %d gen %u short size %d\n",
> @@ -1127,6 +1136,10 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots,
>   	struct gfs2_sbd *sdp = arg;
>   	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>   
> +	if (gfs2_withdrawn(sdp)) {
> +		fs_err(sdp, "recover_done ignored due to withdraw.\n");
> +		return;
> +	}
>   	/* ensure the ls jid arrays are large enough */
>   	set_recover_size(sdp, slots, num_slots);
>   
> @@ -1154,6 +1167,11 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
>   {
>   	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>   
> +	if (gfs2_withdrawn(sdp)) {
> +		fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n",
> +		       jid);
> +		return;
> +	}
>   	if (test_bit(DFL_NO_DLM_OPS, &ls->ls_recover_flags))
>   		return;
>   
> diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
> index 0a814ccac41d..7eaea6dfe1cf 100644
> --- a/fs/gfs2/util.c
> +++ b/fs/gfs2/util.c
> @@ -259,14 +259,13 @@ void gfs2_io_error_bh_i(struct gfs2_sbd *sdp, struct buffer_head *bh,
>   			const char *function, char *file, unsigned int line,
>   			bool withdraw)
>   {
> -	if (!test_bit(SDF_SHUTDOWN, &sdp->sd_flags))
> -		fs_err(sdp,
> -		       "fatal: I/O error\n"
> -		       "  block = %llu\n"
> -		       "  function = %s, file = %s, line = %u\n",
> -		       (unsigned long long)bh->b_blocknr,
> -		       function, file, line);
> +	if (gfs2_withdrawn(sdp))
> +		return;
> +
> +	fs_err(sdp, "fatal: I/O error\n"
> +	       "  block = %llu\n"
> +	       "  function = %s, file = %s, line = %u\n",
> +	       (unsigned long long)bh->b_blocknr, function, file, line);
>   	if (withdraw)
>   		gfs2_lm_withdraw(sdp, NULL);
>   }
> -



  reply	other threads:[~2019-05-01  0:08 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-30 23:03 [Cluster-devel] [GFS2 PATCH v3 00/19] gfs2: misc recovery patch collection Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 01/19] gfs2: kthread and remount improvements Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 02/19] gfs2: eliminate tr_num_revoke_rm Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 03/19] gfs2: log which portion of the journal is replayed Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 04/19] gfs2: Warn when a journal replay overwrites a rgrp with buffers Bob Peterson
2019-05-07 14:26   ` Andreas Gruenbacher
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 05/19] gfs2: Introduce concept of a pending withdraw Bob Peterson
2019-05-07 14:36   ` Andreas Gruenbacher
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 06/19] gfs2: log error reform Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 07/19] gfs2: Only complain the first time an io error occurs in quota or log Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 08/19] gfs2: Stop ail1 wait loop when withdrawn Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 09/19] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn Bob Peterson
2019-05-01  0:08   ` Steven Whitehouse [this message]
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 10/19] gfs2: move check_journal_clean to util.c for future use Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 11/19] gfs2: Allow some glocks to be used during withdraw Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 12/19] gfs2: Don't loop forever in gfs2_freeze if withdrawn Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 13/19] gfs2: Make secondary withdrawers wait for first withdrawer Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 14/19] gfs2: Don't write log headers after file system withdraw Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 15/19] gfs2: Force withdraw to replay journals and wait for it to finish Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 16/19] gfs2: simply gfs2_freeze by removing case Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 17/19] gfs2: Add verbose option to check_journal_clean Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 18/19] gfs2: Check for log write errors before telling dlm to unlock Bob Peterson
2019-04-30 23:03 ` [Cluster-devel] [GFS2 PATCH v3 19/19] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty Bob Peterson
2019-05-01  0:10 ` [Cluster-devel] [GFS2 PATCH v3 00/19] gfs2: misc recovery patch collection Steven Whitehouse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=550f4715-669a-5e58-43d2-46b84e08285f@redhat.com \
    --to=swhiteho@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).