All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 11/21] xfs: repair the rmapbt
Date: Thu, 5 Jul 2018 17:47:48 -0700	[thread overview]
Message-ID: <20180706004637.GI32415@magnolia> (raw)
In-Reply-To: <20180705070324.GQ2234@dastard>

On Thu, Jul 05, 2018 at 05:03:24PM +1000, Dave Chinner wrote:
> On Wed, Jul 04, 2018 at 08:48:58PM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 05, 2018 at 09:21:15AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> > > Keep in mind that online repair will never guarantee that it can fix
> > > all problems, so we're always going to have to offline repair. iWhat
> > > we want to achieve is minimising downtime for users when a repair is
> > > required. With the above IO limitations in mind, I've always
> > > considered that online repair would just be for all the simple,
> > > quick, easy to fix stuff, because complex stuff that required huge
> > > amounts of RAM and full filesystem scans to resolve would always be
> > > done faster offline.
> > >
> > > That's why I think that offline repair will be a better choice for
> > > users for the forseeable future if repairing the damage requires
> > > full filesystem metadata scans.
> > 
> > That might be, but how else can we determine that than to merge it under
> > an experimental banner, iterate it for a while, and try to persuade a
> > wider band of users than ourselves to try it on a non-production system
> > to see if it better solves their problems? :)
> 
> I think "deploy and idea to users and see if they have problems" is
> a horrible development model. That's where btrfs went off the rails
> almost 10 years ago....

I would only test-deploy to a very small and carefully selected group of
tightly-controlled internal users.  For everyone else, it hides behind a
default=N kconfig option and we disavow any support of EXPERIMENTAL
features. :)

> That aside, I'd like to make sure we're on the same page - we don't
> need online repair to fix /everything/ before it gets merged. Even

Indeed.  I'm fine with it merging the non-freezing repairers in their
current form.  I regularly run the repair tool fuzz xfstests to compare
the effectiveness of the xfs_scrub / xfs_repair.  They both have
deficiencies here and there, which will keep us busy for years to come.
:P

> without rmap rebuild, it will still be usable and useful to the vast
> majority of users wanting the filesystem to self repair the typical
> one-off corruption problems filesystems encounter during their
> typical production life.
> 
> What I'm worried about is that the online repair algorithms have a
> fundmental dependence on rmap lookups to avoid the need to build the
> global cross-references needed to isolate and repair corruptions.

Yes, the primary metadata repairers are absolutely dependent on the
secondary data to be faster than a full scan, and repairing secondary
metadata was never going to be easy to do online.

> This use of rmap, from one angle, can be seen as a performance
> optimisation, but as I've come understand more of the algorithms the
> online repair code uses, it looks more like an intractable problem
> than one of performance optimisation.
> 
> That is, if the rmap is corrupt, or we even suspect that it's
> corrupt, then we cannot use it to validate the state and contents of
> the other on-disk structures. We could propagate rmap corruptions to
> those other structures and it would go undetected. And because we
> can't determine the validity of all the other structures, the only
> way we can correctly repair the filesystem is to build a global
> cross-reference, resolve all the inconsistencies, and then rewrite
> all the structures. i.e. we need to do what xfs_repair does in phase
> 3, 4 and 5.
> 
> IOWs, ISTM that the online scrub/repair algorithms break down in the
> face of rmap corruption and it is correctable only by building a
> global cross-reference which we can then reconcile and use to
> rebuild all the per-AG metadata btrees in the filesystem. Trying to
> do it online via scanning of unverifiable structures risks making
> the problem worse and there is a good possibility that we can't
> repair the inconsistencies in a single pass.

Yeah.  I've pondered whether or not the primary repairers ought to
require at least a quick rmap scan before they touch anything, but up
till now I've preferred to keep that knowledge in xfs_scrub.

> So at this point, I'd prefer that we stop, discuss and work out a
> solution to the rmap rebuild problem rather than merge a rmap
> rebuild algorithm prematurely. This doesn't need to hold up any of
> the other online repair functionality - none of that is dependent on
> this particular piece of the puzzle being implemented.

I didn't think it would hold up the others repairers.  I always knew
that the rmapbt repair was going to be a tough project. :)

So, seeing as you're about to head off on vacation, let's set a rough
time to discuss the rmap rebuild problem in detail once you're back.
Does that sound good?

> > At worst, if future us decide that we will never figure out how to make
> > online rmapbt repair robust I'm fully prepared to withdraw it.
> 
> Merging new features before sorting out the fundamental principles,
> behaviours and algorithms of those features is how btrfs ended up
> with a pile of stuff that doesn't quite work which no-one
> understands quite well enough to fix.

Ok ok it can stay out of tree until we're all fairly confident it'll
survive on our bombardment tests.  I do have a test, xfs/422, to
simulate repairing an rmapbt under heavy load by running fsstress,
fsfreeze, and xfs_io injecting and force-repairing rmapbts in a loop.
So far it hasn't blown up or failed to regenerate the rmapbt...

> > > > > This seems like a likely thing to happen given the "no reclaim"
> > > > > state of the filesystem and the memory demand a rmapbt rebuild
> > > > > can have. If we've got GBs of rmap info in the AG that needs to be
> > > > > rebuilt, how much RAM are we going to need to index it all as we
> > > > > scan the filesystem?
> > > > 
> > > > More than I'd like -- at least 24 bytes per record (which at least is no
> > > > larger than the size of the on-disk btree) plus a list_head until I can
> > > > move the repairers away from creating huge lists.
> > > 
> > > Ok, it kinda sounds a bit like we need to be able to create the new
> > > btree on the fly, rather than as a single operation at the end. e.g.
> > > if the list builds up to, say, 100k records, we push them into the
> > > new tree and can free them. e.g. can we iteratively build the new
> > > tree on disk as we go, then do a root block swap at the end to
> > > switch from the old tree to the new tree?
> > 
> > Seeing as we have a bunch of storage at our disposal, I think we could
> > push the records out to disk (or just write them straight into a new
> > btree) when we detect low memory conditions or consume more than some
> > threshold of memory. i
> 
> *nod* That's kinda along the lines of what I was thinking of.
> 
> > For v0 I was focusing on getting it to work at all
> > even with a largeish memory cost. :)
> 
> Right, I'm not suggesting that this needs to be done before the
> initial merge - I just want to make sure we don't back ourselves
> into a corner we can't get out of.

Agreed.  I don't want to get stuck in a corner either, but I don't know
that I know where all those corners are. :)

> > > If that's a potential direction, then maybe we can look at this as a
> > > future direction? It also leads to the posibility of
> > > pausing/continuing repair from where the last chunk of records were
> > > processed, so if we do run out of memory we don't have to start from
> > > the beginning again?
> > 
> > That could be difficult to do in practice -- we'll continue to
> > accumulate deferred frees for that AG in the meantime.
> 
> I think it depends on the way we record deferred frees, but it's a
> bit premature to be discussing this right now :P

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2018-07-06  0:47 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
2018-06-24 19:23 ` [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap Darrick J. Wong
2018-06-27  0:54   ` Dave Chinner
2018-06-28 21:11   ` Allison Henderson
2018-06-29 14:39     ` Darrick J. Wong
2018-06-24 19:23 ` [PATCH 02/21] xfs: add helper to decide if an inode has allocated cow blocks Darrick J. Wong
2018-06-27  1:02   ` Dave Chinner
2018-06-28 21:12   ` Allison Henderson
2018-06-24 19:23 ` [PATCH 03/21] xfs: refactor part of xfs_free_eofblocks Darrick J. Wong
2018-06-28 21:13   ` Allison Henderson
2018-06-24 19:23 ` [PATCH 04/21] xfs: repair the AGF and AGFL Darrick J. Wong
2018-06-27  2:19   ` Dave Chinner
2018-06-27 16:44     ` Allison Henderson
2018-06-27 23:37       ` Dave Chinner
2018-06-29 15:14         ` Darrick J. Wong
2018-06-28 17:25     ` Allison Henderson
2018-06-29 15:08       ` Darrick J. Wong
2018-06-28 21:14   ` Allison Henderson
2018-06-28 23:21     ` Dave Chinner
2018-06-29  1:35       ` Allison Henderson
2018-06-29 14:55         ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 05/21] xfs: repair the AGI Darrick J. Wong
2018-06-27  2:22   ` Dave Chinner
2018-06-28 21:15   ` Allison Henderson
2018-06-24 19:24 ` [PATCH 06/21] xfs: repair free space btrees Darrick J. Wong
2018-06-27  3:21   ` Dave Chinner
2018-07-04  2:15     ` Darrick J. Wong
2018-07-04  2:25       ` Dave Chinner
2018-06-30 17:36   ` Allison Henderson
2018-06-24 19:24 ` [PATCH 07/21] xfs: repair inode btrees Darrick J. Wong
2018-06-28  0:55   ` Dave Chinner
2018-07-04  2:22     ` Darrick J. Wong
2018-06-30 17:36   ` Allison Henderson
2018-06-30 18:30     ` Darrick J. Wong
2018-07-01  0:45       ` Allison Henderson
2018-06-24 19:24 ` [PATCH 08/21] xfs: defer iput on certain inodes while scrub / repair are running Darrick J. Wong
2018-06-28 23:37   ` Dave Chinner
2018-06-29 14:49     ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 09/21] xfs: finish our set of inode get/put tracepoints for scrub Darrick J. Wong
2018-06-24 19:24 ` [PATCH 10/21] xfs: introduce online scrub freeze Darrick J. Wong
2018-06-24 19:24 ` [PATCH 11/21] xfs: repair the rmapbt Darrick J. Wong
2018-07-03  5:32   ` Dave Chinner
2018-07-03 23:59     ` Darrick J. Wong
2018-07-04  8:44       ` Carlos Maiolino
2018-07-04 18:40         ` Darrick J. Wong
2018-07-04 23:21       ` Dave Chinner
2018-07-05  3:48         ` Darrick J. Wong
2018-07-05  7:03           ` Dave Chinner
2018-07-06  0:47             ` Darrick J. Wong [this message]
2018-07-06  1:08               ` Dave Chinner
2018-06-24 19:24 ` [PATCH 12/21] xfs: repair refcount btrees Darrick J. Wong
2018-07-03  5:50   ` Dave Chinner
2018-07-04  2:23     ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 13/21] xfs: repair inode records Darrick J. Wong
2018-07-03  6:17   ` Dave Chinner
2018-07-04  0:16     ` Darrick J. Wong
2018-07-04  1:03       ` Dave Chinner
2018-07-04  1:30         ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 14/21] xfs: zap broken inode forks Darrick J. Wong
2018-07-04  2:07   ` Dave Chinner
2018-07-04  3:26     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 15/21] xfs: repair inode block maps Darrick J. Wong
2018-07-04  3:00   ` Dave Chinner
2018-07-04  3:41     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 16/21] xfs: repair damaged symlinks Darrick J. Wong
2018-07-04  5:45   ` Dave Chinner
2018-07-04 18:45     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 17/21] xfs: repair extended attributes Darrick J. Wong
2018-07-06  1:03   ` Dave Chinner
2018-07-06  3:10     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 18/21] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
2018-06-29  2:52   ` Dave Chinner
2018-06-24 19:25 ` [PATCH 19/21] xfs: repair quotas Darrick J. Wong
2018-07-06  1:50   ` Dave Chinner
2018-07-06  3:16     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 20/21] xfs: implement live quotacheck as part of quota repair Darrick J. Wong
2018-06-24 19:25 ` [PATCH 21/21] xfs: add online scrub/repair for superblock counters Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180706004637.GI32415@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.