linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 11/21] xfs: repair the rmapbt
Date: Thu, 5 Jul 2018 17:47:48 -0700	[thread overview]
Message-ID: <20180706004637.GI32415@magnolia> (raw)
In-Reply-To: <20180705070324.GQ2234@dastard>

On Thu, Jul 05, 2018 at 05:03:24PM +1000, Dave Chinner wrote:
> On Wed, Jul 04, 2018 at 08:48:58PM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 05, 2018 at 09:21:15AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> > > Keep in mind that online repair will never guarantee that it can fix
> > > all problems, so we're always going to have to offline repair. iWhat
> > > we want to achieve is minimising downtime for users when a repair is
> > > required. With the above IO limitations in mind, I've always
> > > considered that online repair would just be for all the simple,
> > > quick, easy to fix stuff, because complex stuff that required huge
> > > amounts of RAM and full filesystem scans to resolve would always be
> > > done faster offline.
> > >
> > > That's why I think that offline repair will be a better choice for
> > > users for the forseeable future if repairing the damage requires
> > > full filesystem metadata scans.
> > 
> > That might be, but how else can we determine that than to merge it under
> > an experimental banner, iterate it for a while, and try to persuade a
> > wider band of users than ourselves to try it on a non-production system
> > to see if it better solves their problems? :)
> 
> I think "deploy and idea to users and see if they have problems" is
> a horrible development model. That's where btrfs went off the rails
> almost 10 years ago....

I would only test-deploy to a very small and carefully selected group of
tightly-controlled internal users.  For everyone else, it hides behind a
default=N kconfig option and we disavow any support of EXPERIMENTAL
features. :)

> That aside, I'd like to make sure we're on the same page - we don't
> need online repair to fix /everything/ before it gets merged. Even

Indeed.  I'm fine with it merging the non-freezing repairers in their
current form.  I regularly run the repair tool fuzz xfstests to compare
the effectiveness of the xfs_scrub / xfs_repair.  They both have
deficiencies here and there, which will keep us busy for years to come.
:P

> without rmap rebuild, it will still be usable and useful to the vast
> majority of users wanting the filesystem to self repair the typical
> one-off corruption problems filesystems encounter during their
> typical production life.
> 
> What I'm worried about is that the online repair algorithms have a
> fundmental dependence on rmap lookups to avoid the need to build the
> global cross-references needed to isolate and repair corruptions.

Yes, the primary metadata repairers are absolutely dependent on the
secondary data to be faster than a full scan, and repairing secondary
metadata was never going to be easy to do online.

> This use of rmap, from one angle, can be seen as a performance
> optimisation, but as I've come understand more of the algorithms the
> online repair code uses, it looks more like an intractable problem
> than one of performance optimisation.
> 
> That is, if the rmap is corrupt, or we even suspect that it's
> corrupt, then we cannot use it to validate the state and contents of
> the other on-disk structures. We could propagate rmap corruptions to
> those other structures and it would go undetected. And because we
> can't determine the validity of all the other structures, the only
> way we can correctly repair the filesystem is to build a global
> cross-reference, resolve all the inconsistencies, and then rewrite
> all the structures. i.e. we need to do what xfs_repair does in phase
> 3, 4 and 5.
> 
> IOWs, ISTM that the online scrub/repair algorithms break down in the
> face of rmap corruption and it is correctable only by building a
> global cross-reference which we can then reconcile and use to
> rebuild all the per-AG metadata btrees in the filesystem. Trying to
> do it online via scanning of unverifiable structures risks making
> the problem worse and there is a good possibility that we can't
> repair the inconsistencies in a single pass.

Yeah.  I've pondered whether or not the primary repairers ought to
require at least a quick rmap scan before they touch anything, but up
till now I've preferred to keep that knowledge in xfs_scrub.

> So at this point, I'd prefer that we stop, discuss and work out a
> solution to the rmap rebuild problem rather than merge a rmap
> rebuild algorithm prematurely. This doesn't need to hold up any of
> the other online repair functionality - none of that is dependent on
> this particular piece of the puzzle being implemented.

I didn't think it would hold up the others repairers.  I always knew
that the rmapbt repair was going to be a tough project. :)

So, seeing as you're about to head off on vacation, let's set a rough
time to discuss the rmap rebuild problem in detail once you're back.
Does that sound good?

> > At worst, if future us decide that we will never figure out how to make
> > online rmapbt repair robust I'm fully prepared to withdraw it.
> 
> Merging new features before sorting out the fundamental principles,
> behaviours and algorithms of those features is how btrfs ended up
> with a pile of stuff that doesn't quite work which no-one
> understands quite well enough to fix.

Ok ok it can stay out of tree until we're all fairly confident it'll
survive on our bombardment tests.  I do have a test, xfs/422, to
simulate repairing an rmapbt under heavy load by running fsstress,
fsfreeze, and xfs_io injecting and force-repairing rmapbts in a loop.
So far it hasn't blown up or failed to regenerate the rmapbt...

> > > > > This seems like a likely thing to happen given the "no reclaim"
> > > > > state of the filesystem and the memory demand a rmapbt rebuild
> > > > > can have. If we've got GBs of rmap info in the AG that needs to be
> > > > > rebuilt, how much RAM are we going to need to index it all as we
> > > > > scan the filesystem?
> > > > 
> > > > More than I'd like -- at least 24 bytes per record (which at least is no
> > > > larger than the size of the on-disk btree) plus a list_head until I can
> > > > move the repairers away from creating huge lists.
> > > 
> > > Ok, it kinda sounds a bit like we need to be able to create the new
> > > btree on the fly, rather than as a single operation at the end. e.g.
> > > if the list builds up to, say, 100k records, we push them into the
> > > new tree and can free them. e.g. can we iteratively build the new
> > > tree on disk as we go, then do a root block swap at the end to
> > > switch from the old tree to the new tree?
> > 
> > Seeing as we have a bunch of storage at our disposal, I think we could
> > push the records out to disk (or just write them straight into a new
> > btree) when we detect low memory conditions or consume more than some
> > threshold of memory. i
> 
> *nod* That's kinda along the lines of what I was thinking of.
> 
> > For v0 I was focusing on getting it to work at all
> > even with a largeish memory cost. :)
> 
> Right, I'm not suggesting that this needs to be done before the
> initial merge - I just want to make sure we don't back ourselves
> into a corner we can't get out of.

Agreed.  I don't want to get stuck in a corner either, but I don't know
that I know where all those corners are. :)

> > > If that's a potential direction, then maybe we can look at this as a
> > > future direction? It also leads to the posibility of
> > > pausing/continuing repair from where the last chunk of records were
> > > processed, so if we do run out of memory we don't have to start from
> > > the beginning again?
> > 
> > That could be difficult to do in practice -- we'll continue to
> > accumulate deferred frees for that AG in the meantime.
> 
> I think it depends on the way we record deferred frees, but it's a
> bit premature to be discussing this right now :P

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2018-07-06  0:47 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
2018-06-24 19:23 ` [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap Darrick J. Wong
2018-06-27  0:54   ` Dave Chinner
2018-06-28 21:11   ` Allison Henderson
2018-06-29 14:39     ` Darrick J. Wong
2018-06-24 19:23 ` [PATCH 02/21] xfs: add helper to decide if an inode has allocated cow blocks Darrick J. Wong
2018-06-27  1:02   ` Dave Chinner
2018-06-28 21:12   ` Allison Henderson
2018-06-24 19:23 ` [PATCH 03/21] xfs: refactor part of xfs_free_eofblocks Darrick J. Wong
2018-06-28 21:13   ` Allison Henderson
2018-06-24 19:23 ` [PATCH 04/21] xfs: repair the AGF and AGFL Darrick J. Wong
2018-06-27  2:19   ` Dave Chinner
2018-06-27 16:44     ` Allison Henderson
2018-06-27 23:37       ` Dave Chinner
2018-06-29 15:14         ` Darrick J. Wong
2018-06-28 17:25     ` Allison Henderson
2018-06-29 15:08       ` Darrick J. Wong
2018-06-28 21:14   ` Allison Henderson
2018-06-28 23:21     ` Dave Chinner
2018-06-29  1:35       ` Allison Henderson
2018-06-29 14:55         ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 05/21] xfs: repair the AGI Darrick J. Wong
2018-06-27  2:22   ` Dave Chinner
2018-06-28 21:15   ` Allison Henderson
2018-06-24 19:24 ` [PATCH 06/21] xfs: repair free space btrees Darrick J. Wong
2018-06-27  3:21   ` Dave Chinner
2018-07-04  2:15     ` Darrick J. Wong
2018-07-04  2:25       ` Dave Chinner
2018-06-30 17:36   ` Allison Henderson
2018-06-24 19:24 ` [PATCH 07/21] xfs: repair inode btrees Darrick J. Wong
2018-06-28  0:55   ` Dave Chinner
2018-07-04  2:22     ` Darrick J. Wong
2018-06-30 17:36   ` Allison Henderson
2018-06-30 18:30     ` Darrick J. Wong
2018-07-01  0:45       ` Allison Henderson
2018-06-24 19:24 ` [PATCH 08/21] xfs: defer iput on certain inodes while scrub / repair are running Darrick J. Wong
2018-06-28 23:37   ` Dave Chinner
2018-06-29 14:49     ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 09/21] xfs: finish our set of inode get/put tracepoints for scrub Darrick J. Wong
2018-06-24 19:24 ` [PATCH 10/21] xfs: introduce online scrub freeze Darrick J. Wong
2018-06-24 19:24 ` [PATCH 11/21] xfs: repair the rmapbt Darrick J. Wong
2018-07-03  5:32   ` Dave Chinner
2018-07-03 23:59     ` Darrick J. Wong
2018-07-04  8:44       ` Carlos Maiolino
2018-07-04 18:40         ` Darrick J. Wong
2018-07-04 23:21       ` Dave Chinner
2018-07-05  3:48         ` Darrick J. Wong
2018-07-05  7:03           ` Dave Chinner
2018-07-06  0:47             ` Darrick J. Wong [this message]
2018-07-06  1:08               ` Dave Chinner
2018-06-24 19:24 ` [PATCH 12/21] xfs: repair refcount btrees Darrick J. Wong
2018-07-03  5:50   ` Dave Chinner
2018-07-04  2:23     ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 13/21] xfs: repair inode records Darrick J. Wong
2018-07-03  6:17   ` Dave Chinner
2018-07-04  0:16     ` Darrick J. Wong
2018-07-04  1:03       ` Dave Chinner
2018-07-04  1:30         ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 14/21] xfs: zap broken inode forks Darrick J. Wong
2018-07-04  2:07   ` Dave Chinner
2018-07-04  3:26     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 15/21] xfs: repair inode block maps Darrick J. Wong
2018-07-04  3:00   ` Dave Chinner
2018-07-04  3:41     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 16/21] xfs: repair damaged symlinks Darrick J. Wong
2018-07-04  5:45   ` Dave Chinner
2018-07-04 18:45     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 17/21] xfs: repair extended attributes Darrick J. Wong
2018-07-06  1:03   ` Dave Chinner
2018-07-06  3:10     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 18/21] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
2018-06-29  2:52   ` Dave Chinner
2018-06-24 19:25 ` [PATCH 19/21] xfs: repair quotas Darrick J. Wong
2018-07-06  1:50   ` Dave Chinner
2018-07-06  3:16     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 20/21] xfs: implement live quotacheck as part of quota repair Darrick J. Wong
2018-06-24 19:25 ` [PATCH 21/21] xfs: add online scrub/repair for superblock counters Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180706004637.GI32415@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).