From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH v19 00/18] xfs: online repair support
Date: Mon, 5 Aug 2019 17:20:31 +1000 [thread overview]
Message-ID: <20190805072031.GW7777@dread.disaster.area> (raw)
In-Reply-To: <156496528310.804304.8105015456378794397.stgit@magnolia>
On Sun, Aug 04, 2019 at 05:34:43PM -0700, Darrick J. Wong wrote:
> Hi all,
>
> This is the first part of the nineteenth revision of a patchset that
> adds to XFS kernel support for online metadata scrubbing and repair.
> There aren't any on-disk format changes.
>
> New for this version is a rebase against 5.3-rc2, integration with the
> health reporting subsystem, and the explicit revalidation of all
> metadata structures that were rebuilt.
>
> Patch 1 lays the groundwork for scrub types specifying a revalidation
> function that will check everything that the repair function might have
> rebuilt. This will be necessary for the free space and inode btree
> repair functions, which rebuild both btrees at once.
>
> Patch 2 ensures that the health reporting query code doesn't get in the
> way of post-repair revalidation of all rebuilt metadata structures.
>
> Patch 3 creates a new data structure that provides an abstraction of a
> big memory array by using linked lists. This is where we store records
> for btree reconstruction. This first implementation is memory
> inefficient and consumes a /lot/ of kernel memory, but lays the
> groundwork for the last patch in the set to convert the implementation
> to use a (memfd) swap file, which enables us to use pageable memory
> without pounding the slab cache.
>
> Patches 4-10 implement reconstruction of the free space btrees, inode
> btrees, reference count btrees, inode records, inode forks, inode block
> maps, and symbolic links.
Darrick and I had a discussion on #xfs about the btree rebuilds
mainly centered around robustness. The biggest issue I saw with the
code as it stands is that we replace the existing btree as we build
it. As a result, we go from a complete tree with a single corruption
to an empty tree with lots of external dangling references (i.e.
massive corruption!) until the rebuild finishes. Hence if we crash
while the rebuild is in progress, we risk being in a state where:
- log recovery will abort because it trips over partial tree
state
- mounting won't run because scanning the btree at mount
time falls of the end of the btree unexpectedly, doesn't
find enough free space for reservations, etc
- mounting succeeds but then the first operations fail
because the tree is incomplete and the filesystem
immediately shuts down.
So if we crash while there is a background repair taking place on
the root filesystem, then it is very likely the system will not boot
up after the crash. :(
We came to the conclusion - independently, at the same time :) -
that we should rebuild btrees in known free space with a dangling
root node and then, once the whole new tree has been built, we
atomically swap the btree root nodes. Hence if we crash during
rebuild, we just have some dangling, unreferenced used space that a
subsequent scrub/repair/rebuild cycle will release back to the free
space pool.
That leaves the original corrupt tree in place, and hence we don't
make things any worse than they already are by trying to repair the
tree. The atomic swap of the root nodes allows failsafe transition
between the old and new trees, and the rebuild can then free the
space the old tree used. If we crash at this point, then it's just
dangling free space and a subsequent scrub/repair/rebuild cycle will
release it back to the free space pool.
This mechanism also works with xfs_repair - if we run xfs_repair
after a crash during online rebuild, it will still see the original
corrupt trees, find the dangling free space as well, and clean
everything up with a new tree rebuild. Which means, again, an online
rebuild failure does not make anything worse than before the rebuild
started....
Darrick thinks that this can quite easily be done simply by skipping
the root node pointer update (->set_root, IIRC) until the new tree
has been fully rebuilt. Hopefully that is the case, because an
atomic swap mechanism like this will make the repair algorithms a
lot more robust. :)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
prev parent reply other threads:[~2019-08-05 7:21 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-05 0:34 [PATCH v19 00/18] xfs: online repair support Darrick J. Wong
2019-08-05 0:34 ` [PATCH 01/18] xfs: add a repair revalidation function pointer Darrick J. Wong
2019-08-05 0:34 ` [PATCH 02/18] xfs: always rescan allegedly healthy per-ag metadata after repair Darrick J. Wong
2019-08-05 0:35 ` [PATCH 03/18] xfs: create a big array data structure Darrick J. Wong
2019-08-05 0:35 ` [PATCH 04/18] xfs: repair free space btrees Darrick J. Wong
2019-08-05 0:35 ` [PATCH 05/18] xfs: repair inode btrees Darrick J. Wong
2019-08-05 0:35 ` [PATCH 06/18] xfs: repair refcount btrees Darrick J. Wong
2019-08-05 0:35 ` [PATCH 07/18] xfs: repair inode records Darrick J. Wong
2019-08-05 0:35 ` [PATCH 08/18] xfs: zap broken inode forks Darrick J. Wong
2019-08-05 0:35 ` [PATCH 09/18] xfs: repair inode block maps Darrick J. Wong
2019-08-05 0:35 ` [PATCH 10/18] xfs: repair damaged symlinks Darrick J. Wong
2019-08-05 0:35 ` [PATCH 11/18] xfs: create a blob array data structure Darrick J. Wong
2019-08-05 0:36 ` [PATCH 12/18] xfs: convert xfs_itruncate_extents_flags to use __xfs_bunmapi Darrick J. Wong
2019-08-05 0:36 ` [PATCH 13/18] xfs: remove unnecessary inode-transaction roll Darrick J. Wong
2019-08-05 0:36 ` [PATCH 14/18] xfs: create a new inode fork block unmap helper Darrick J. Wong
2019-08-05 0:36 ` [PATCH 15/18] xfs: repair extended attributes Darrick J. Wong
2019-08-05 0:36 ` [PATCH 16/18] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
2019-08-05 0:36 ` [PATCH 17/18] xfs: repair quotas Darrick J. Wong
2019-08-05 0:36 ` [PATCH 18/18] xfs: convert big array and blob array to use memfd backend Darrick J. Wong
2019-08-05 7:20 ` Dave Chinner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190805072031.GW7777@dread.disaster.area \
--to=david@fromorbit.com \
--cc=darrick.wong@oracle.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox