Re: [PATCH 09/10] xfs_repair: add a testing hook for NEEDSREPAIR

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Brian Foster <bfoster@redhat.com>
Cc: sandeen@sandeen.net, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 09/10] xfs_repair: add a testing hook for NEEDSREPAIR
Date: Tue, 9 Feb 2021 11:59:20 -0800	[thread overview]
Message-ID: <20210209195920.GZ7193@magnolia> (raw)
In-Reply-To: <20210209185939.GK14273@bfoster>

On Tue, Feb 09, 2021 at 01:59:39PM -0500, Brian Foster wrote:
> On Tue, Feb 09, 2021 at 10:17:38AM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 09, 2021 at 12:21:31PM -0500, Brian Foster wrote:
> > > On Mon, Feb 08, 2021 at 08:10:55PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Simulate a crash when anyone calls force_needsrepair.  This is a debug
> > > > knob so that we can test that the kernel won't mount after setting
> > > > needsrepair and that a re-run of xfs_repair will clear the flag.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > 
> > > Can't we just use db to manually set the bit on the superblock?
> > 
> > No, because the fstest uses this debug knob to simulate the following:
> > 
> > 1) sysadmin issues 'xfs_admin -O inobtcount /dev/sda1'
> > 2) xfs_repair flips on INOBTCOUNT and NEEDSREPAIR
> > 3) system goes down and repair never completes
> > 4) verify that we can't mount
> > 5) verify that repair clears NEEDSREPAIR and gives us a clean fs
> > 6) verify that mount works again
> > 
> 
> Ok, but that seems like circular reasoning.

I'm sorry, but I don't see how this is circular logic?

The test needs to show that NEEDSREPAIR is turned on during phase 1 (or
2) when we apply an upgrade, and it needs to induce some kind of early
exit so that the needsrepair clearing code after phase 7 does not run.

If we set NEEDSREPAIR with xfs_db before running repair then we have no
way to detect if the inobtcount upgrade doesn't set needsrepair.

If we don't have a debugging knob to stop repair before it reaches phase
7, we're not really testing a genuine early-repair-exit scenario.  Yes,
we can use xfs_db to manually set the flag after repair returns, but
that doesn't fill the testing gap above.

> It wouldn't be quite the
> same as a simulated repair failure, but ISTM that if we set the bit
> manually, we can still verify steps 4, 5 and 6 as is (with the caveat
> that the repair invocation performs a feature upgrade). I'm not sure how
> important it really is to verify that a feature upgrade sequence sets
> the bit if it happens to fail provided we have independent tests that 1.
> verify the needsrepair bit works as expected and 2. verify the feature
> upgrades work appropriately, since that is the primary functionality.
> 
> I wanted to think about that a little more before replying, but I also
> just realized something odd when digging into the debug code:
> 
> # ./repair/xfs_repair -c needsrepair=1 /dev/test/scratch 
> Phase 1 - find and verify superblock...
> Marking filesystem in need of repair.
> writing modified primary superblock
> Phase 2 - using internal log
>         - zero log...
> ERROR: The filesystem has valuable metadata changes in a log which needs to
> ...
> # mount /dev/test/scratch /mnt/
> mount: /mnt: wrong fs type, bad option, bad superblock on /dev/mapper/test-scratch, missing codepage or helper program, or other error.
> #
> 
> It looks like we can set a feature upgrade bit on the superblock before
> we've examined the log and potentially discovered that it's dirty (phase
> 2). If the log is recoverable, that puts the user in a bit of a bind..

Heh, funny that I was thinking that the upgrades shouldn't really be
happening in phase 1 anyway--

I've (separately) started working on a patch to make it so that you can
add reflink and finobt to a filesystem.  Those upgrades require somewhat
more intensive checks of the filesystem (such as checking free space in
each AG), so I ended up dumping them into phase 2, since the xfs_mount
and buffer cache aren't fully initialized until after phase 1.

So, yeah, the upgrade code should move to phase2() after log zeroing and
before the AG scan.

--D

> Brian
> 
> > and the other scenario is:
> > 
> > 1) fuzz a directory entry in such a way that repair will decide to
> >    blow out the dirent and rebuild the directory later
> > 2) sysadmin issues 'xfs_repair /dev/sda1'
> > 2) xfs_repair flips on NEEDSREPAIR at the same time it corrupts the
> >    dirent to trigger the rebuild later
> > 3) system goes down and repair never completes
> > 4) verify that we can't mount
> > 5) verify that repair clears NEEDSREPAIR and gives us a clean fs
> > 6) verify that mount works again
> > 
> > Both cases reflect what I think are the most likely failure scenarios,
> > hence the knob needs to be in xfs_repair to prevent it from running to
> > completion.
> > 
> > (And yes, I've been recently very bad at sending fstests out for review
> > the past few months; I will get that done by this afternoon.)
> > 
> > --D
> > 
> > > Brian
> > > 
> > > >  repair/globals.c    |    1 +
> > > >  repair/globals.h    |    2 ++
> > > >  repair/phase1.c     |    5 +++++
> > > >  repair/xfs_repair.c |    7 +++++++
> > > >  4 files changed, 15 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/repair/globals.c b/repair/globals.c
> > > > index 699a96ee..b0e23864 100644
> > > > --- a/repair/globals.c
> > > > +++ b/repair/globals.c
> > > > @@ -40,6 +40,7 @@ int	dangerously;		/* live dangerously ... fix ro mount */
> > > >  int	isa_file;
> > > >  int	zap_log;
> > > >  int	dumpcore;		/* abort, not exit on fatal errs */
> > > > +bool	abort_after_force_needsrepair;
> > > >  int	force_geo;		/* can set geo on low confidence info */
> > > >  int	assume_xfs;		/* assume we have an xfs fs */
> > > >  char	*log_name;		/* Name of log device */
> > > > diff --git a/repair/globals.h b/repair/globals.h
> > > > index 043b3e8e..9fa73b2c 100644
> > > > --- a/repair/globals.h
> > > > +++ b/repair/globals.h
> > > > @@ -82,6 +82,8 @@ extern int	isa_file;
> > > >  extern int	zap_log;
> > > >  extern int	dumpcore;		/* abort, not exit on fatal errs */
> > > >  extern int	force_geo;		/* can set geo on low confidence info */
> > > > +/* Abort after forcing NEEDSREPAIR to test its functionality */
> > > > +extern bool	abort_after_force_needsrepair;
> > > >  extern int	assume_xfs;		/* assume we have an xfs fs */
> > > >  extern char	*log_name;		/* Name of log device */
> > > >  extern int	log_spec;		/* Log dev specified as option */
> > > > diff --git a/repair/phase1.c b/repair/phase1.c
> > > > index b26d25f8..57f72cd0 100644
> > > > --- a/repair/phase1.c
> > > > +++ b/repair/phase1.c
> > > > @@ -170,5 +170,10 @@ _("Cannot disable lazy-counters on V5 fs\n"));
> > > >  	 */
> > > >  	sb_ifree = sb_icount = sb_fdblocks = sb_frextents = 0;
> > > >  
> > > > +	/* Simulate a crash after setting needsrepair. */
> > > > +	if (primary_sb_modified && add_needsrepair &&
> > > > +	    abort_after_force_needsrepair)
> > > > +		exit(55);
> > > > +
> > > >  	free(sb);
> > > >  }
> > > > diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> > > > index ee377e8a..ae7106a6 100644
> > > > --- a/repair/xfs_repair.c
> > > > +++ b/repair/xfs_repair.c
> > > > @@ -44,6 +44,7 @@ enum o_opt_nums {
> > > >  	BLOAD_LEAF_SLACK,
> > > >  	BLOAD_NODE_SLACK,
> > > >  	NOQUOTA,
> > > > +	FORCE_NEEDSREPAIR_ABORT,
> > > >  	O_MAX_OPTS,
> > > >  };
> > > >  
> > > > @@ -57,6 +58,7 @@ static char *o_opts[] = {
> > > >  	[BLOAD_LEAF_SLACK]	= "debug_bload_leaf_slack",
> > > >  	[BLOAD_NODE_SLACK]	= "debug_bload_node_slack",
> > > >  	[NOQUOTA]		= "noquota",
> > > > +	[FORCE_NEEDSREPAIR_ABORT] = "debug_force_needsrepair_abort",
> > > >  	[O_MAX_OPTS]		= NULL,
> > > >  };
> > > >  
> > > > @@ -282,6 +284,9 @@ process_args(int argc, char **argv)
> > > >  		_("-o debug_bload_node_slack requires a parameter\n"));
> > > >  					bload_node_slack = (int)strtol(val, NULL, 0);
> > > >  					break;
> > > > +				case FORCE_NEEDSREPAIR_ABORT:
> > > > +					abort_after_force_needsrepair = true;
> > > > +					break;
> > > >  				case NOQUOTA:
> > > >  					quotacheck_skip();
> > > >  					break;
> > > > @@ -795,6 +800,8 @@ force_needsrepair(
> > > >  		error = -libxfs_bwrite(bp);
> > > >  		if (error)
> > > >  			do_log(_("couldn't force needsrepair, err=%d\n"), error);
> > > > +		if (abort_after_force_needsrepair)
> > > > +			exit(55);
> > > >  	}
> > > >  	if (bp)
> > > >  		libxfs_buf_relse(bp);
> > > > 
> > > 
> > 
>

next prev parent reply	other threads:[~2021-02-09 21:37 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-09  4:10 [PATCHSET v4 00/10] xfs: add the ability to flag a fs for repair Darrick J. Wong
2021-02-09  4:10 ` [PATCH 01/10] xfs_admin: clean up string quoting Darrick J. Wong
2021-02-09  9:07   ` Christoph Hellwig
2021-02-09  4:10 ` [PATCH 02/10] xfs_admin: support filesystems with realtime devices Darrick J. Wong
2021-02-09  9:08   ` Christoph Hellwig
2021-02-09 17:19   ` Brian Foster
2021-02-09  4:10 ` [PATCH 03/10] xfs_db: support the needsrepair feature flag in the version command Darrick J. Wong
2021-02-09  9:09   ` Christoph Hellwig
2021-02-09 17:15     ` Darrick J. Wong
2021-02-09 17:19   ` Brian Foster
2021-02-09  4:10 ` [PATCH 04/10] xfs_repair: fix unmount error message to have a newline Darrick J. Wong
2021-02-09  9:09   ` Christoph Hellwig
2021-02-09  4:10 ` [PATCH 05/10] xfs_repair: clear quota CHKD flags on the incore superblock too Darrick J. Wong
2021-02-09  9:10   ` Christoph Hellwig
2021-02-09 17:20   ` Brian Foster
2021-02-09 17:46     ` Darrick J. Wong
2021-02-09  4:10 ` [PATCH 06/10] xfs_repair: clear the needsrepair flag Darrick J. Wong
2021-02-09  9:12   ` Christoph Hellwig
2021-02-09 17:20   ` Brian Foster
2021-02-09 18:01     ` Darrick J. Wong
2021-02-09  4:10 ` [PATCH 07/10] xfs_repair: set NEEDSREPAIR when we deliberately corrupt directories Darrick J. Wong
2021-02-09  9:13   ` Christoph Hellwig
2021-02-09 18:45     ` Darrick J. Wong
2021-02-09 17:20   ` Brian Foster
2021-02-09 18:35     ` Darrick J. Wong
2021-02-09 19:14       ` Brian Foster
2021-02-09 19:43         ` Darrick J. Wong
2021-02-10 20:19           ` Eric Sandeen
2021-02-09  4:10 ` [PATCH 08/10] xfs_repair: allow setting the needsrepair flag Darrick J. Wong
2021-02-09  9:15   ` Christoph Hellwig
2021-02-09 14:41     ` Eric Sandeen
2021-02-09 16:47       ` Darrick J. Wong
2021-02-10 20:44         ` Eric Sandeen
2021-02-09 17:21   ` Brian Foster
2021-02-09 18:10     ` Darrick J. Wong
2021-02-10 20:26       ` Eric Sandeen
2021-02-09  4:10 ` [PATCH 09/10] xfs_repair: add a testing hook for NEEDSREPAIR Darrick J. Wong
2021-02-09  9:16   ` Christoph Hellwig
2021-02-09 17:21   ` Brian Foster
2021-02-09 18:17     ` Darrick J. Wong
2021-02-09 18:59       ` Brian Foster
2021-02-09 19:59         ` Darrick J. Wong [this message]
2021-02-09 20:32           ` Brian Foster
2021-02-10 21:41           ` Eric Sandeen
2021-02-11  1:30             ` Darrick J. Wong
2021-02-09  4:11 ` [PATCH 10/10] xfs_admin: support adding features to V5 filesystems Darrick J. Wong
2021-02-09  9:18   ` Christoph Hellwig
2021-02-09 17:22   ` Brian Foster
2021-02-09 18:22     ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210209195920.GZ7193@magnolia \
    --to=djwong@kernel.org \
    --cc=bfoster@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=sandeen@sandeen.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox