From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 04/22] xfs: add helpers to dispose of old btree blocks after a repair
Date: Wed, 16 May 2018 22:58:05 -0700 [thread overview]
Message-ID: <20180517055805.GR23858@magnolia> (raw)
In-Reply-To: <20180516231820.GO23858@magnolia>
On Wed, May 16, 2018 at 04:18:20PM -0700, Darrick J. Wong wrote:
> On Thu, May 17, 2018 at 08:32:25AM +1000, Dave Chinner wrote:
> > On Wed, May 16, 2018 at 12:34:25PM -0700, Darrick J. Wong wrote:
> > > On Wed, May 16, 2018 at 06:32:32PM +1000, Dave Chinner wrote:
> > > > On Tue, May 15, 2018 at 03:34:04PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > >
> > > > > Now that we've plumbed in the ability to construct a list of dead btree
> > > > > blocks following a repair, add more helpers to dispose of them. This is
> > > > > done by examining the rmapbt -- if the btree was the only owner we can
> > > > > free the block, otherwise it's crosslinked and we can only remove the
> > > > > rmapbt record.
> > > > >
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> >
> > [...]
> >
> > > > > + struct xfs_owner_info oinfo;
> > > > > + struct xfs_perag *pag;
> > > > > + int error;
> > > > > +
> > > > > + /* Make sure there's space on the freelist. */
> > > > > + error = xfs_repair_fix_freelist(sc, true);
> > > > > + if (error)
> > > > > + return error;
> > > > > + pag = xfs_perag_get(sc->mp, sc->sa.agno);
> > > >
> > > > Because this is how it quickly gets it gets to silly numbers of
> > > > lookups. That's two now in this function.
> > > >
> > > > > + if (pag->pagf_flcount == 0) {
> > > > > + xfs_perag_put(pag);
> > > > > + return -EFSCORRUPTED;
> > > >
> > > > Why is having an empty freelist a problem here? It's an AG thatis
> > > > completely out of space, but it isn't corruption? And I don't see
> > > > why an empty freelist prevents us from adding a backs back onto the
> > > > AGFL?
> >
> > I think you missed a question :P
>
> Doh, sorry. I don't remember exactly why I put that in there; judging
> from my notes I think the idea was that if the AG is completely full
> we'd rather shut down with a corruption signal hoping that the admin
> will run xfs_repair.
>
> I also don't see why it's necessary now, I'll see what happens if I
> remove it.
>
> > > > > + /* Can we find any other rmappings? */
> > > > > + error = xfs_rmap_has_other_keys(cur, agbno, 1, oinfo, &has_other_rmap);
> > > > > + if (error)
> > > > > + goto out_cur;
> > > > > + xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > > > > +
> > > > > + /*
> > > > > + * If there are other rmappings, this block is cross linked and must
> > > > > + * not be freed. Remove the reverse mapping and move on. Otherwise,
> > > >
> > > > Why do we just remove the reverse mapping if the block cannot be
> > > > freed? I have my suspicions that this is removing cross-links one by
> > > > one until there's only one reference left to the extent, but then I
> > > > ask "how do we know which one is the correct mapping"?
> > >
> > > Right. Prior to calling this function we built a totally new btree with
> > > blocks from the freespace, so now we need to remove the rmaps that
> > > covered the old btree and/or free the block. The goal is to rebuild
> > > /all/ the trees that think they own this block so that we can free the
> > > block and not have to care which one is correct.
> >
> > Ok, so we've already rebuilt the new btree, and this is removing
> > stale references to cross-linked blocks that have owners different
> > to the one we are currently scanning.
> >
> > What happens if the cross-linked block is cross-linked within the
> > same owner context?
>
> It won't end up on the reap list in first place, because we scan every
> block of every object with the same rmap owner to construct sublist.
> Then we subtract sublist from exlist (which we got from rmap) and only
> reap the difference.
>
> > > > > + struct xfs_scrub_context *sc,
> > > > > + xfs_fsblock_t fsbno,
> > > > > + xfs_extlen_t len,
> > > > > + struct xfs_owner_info *oinfo,
> > > > > + enum xfs_ag_resv_type resv)
> > > > > +{
> > > > > + struct xfs_mount *mp = sc->mp;
> > > > > + int error = 0;
> > > > > +
> > > > > + ASSERT(xfs_sb_version_hasrmapbt(&mp->m_sb));
> > > > > + ASSERT(sc->ip != NULL || XFS_FSB_TO_AGNO(mp, fsbno) == sc->sa.agno);
> > > > > +
> > > > > + trace_xfs_repair_dispose_btree_extent(mp, XFS_FSB_TO_AGNO(mp, fsbno),
> > > > > + XFS_FSB_TO_AGBNO(mp, fsbno), len);
> > > > > +
> > > > > + for (; len > 0; len--, fsbno++) {
> > > > > + error = xfs_repair_dispose_btree_block(sc, fsbno, oinfo, resv);
> > > > > + if (error)
> > > > > + return error;
> > > >
> > > > So why do we do this one block at a time, rather than freeing it
> > > > as an entire extent in one go?
> > >
> > > At the moment the xfs_rmap_has_other_keys helper can only tell you if
> > > there are multiple rmap owners for any part of a given extent. For
> > > example, if the rmap records were:
> > >
> > > (start = 35, len = 3, owner = rmap)
> > > (start = 35, len = 1, owner = refcount)
> > > (start = 37, len = 1, owner = inobt)
> > >
> > > Notice how block 35 and 37 are crosslinked, but 36 isn't. A call to
> > > xfs_rmap_has_other_keys(35, 3) will say "yes" but doesn't have a way to
> > > signal back that the yes applies to 35 but that the caller should try
> > > again with block 36. Doing so would require _has_other_keys to maintain
> > > a refcount and to return to the caller any time the refcount changed,
> > > and the caller would still have to loop the extent. It's easier to have
> > > a dumb loop for the initial implementation and optimize it if we start
> > > taking more heat than we'd like on crosslinked filesystems.
> >
> > Well, I can see why you are doing this now, but the problems with
> > multi-block metadata makes me think that we really need to know more
> > detail of the owner in the rmap. e.g. that it's directory or
> > attribute data, not user file data and hence we can infer things
> > about expected block sizes, do the correct sort of buffer lookups
> > for invalidation, etc.
>
> I'm not sure we can do that without causing a deadlocking problem, since
> we lock all the AG headers to rebuild a btree and in general we can't
> _iget an inode to find out if it's a dir or not. But I have more to say
> on this in a few paragraphs...
>
> > I'm tending towards "this needs a design doc to explain all
> > this stuff" right now. Code is great, but I'm struggling understand
> > (reverse engineer!) all the algorithms and decisions that have been
> > made from the code...
>
> Working on it.
Nearly my bedtime, so here's the current draft:
/*
* Reconstructing per-AG Btrees
*
* When a space btree is corrupt, we don't bother trying to fix it.
* Instead, we scan secondary space metadata to derive the records that
* should be in the damaged btree, initialize a fresh btree root, and
* insert the records. Note that for rebuilding the rmapbt we scan all
* the primary data.
*
* However, that leaves the matter of removing all the metadata
* describing the old broken structure. For primary metadata we use the
* rmap data to construct a first bitmap of every extent with a matching
* rmap owner; we then iterate all other metadata structures with the
* same rmap owner to construct a second bitmap of rmaps that cannot be
* removed. We then subtract the second bitmap from the first bitmap
* (first & ~second) to derive the blocks that were used by the old
* btree. These blocks can be reaped.
*
* For rmapbt reconstructions we must use different tactics. First we
* iterate all primary metadata (this excludes the old rmapbt,
* obviously) to generate new rmap records. Then we iterate the new
* rmap records to find the gaps, which should be encompass the free
* space and the old rmapbt blocks. That corresponds to the 'first
* bitmap' of the previous section. The bnobt is iterated to generate
* the second bitmap of the previous section. We then reap the blocks
* corresponding to the difference just like we do for primary data.
*
* The comment for xfs_repair_reap_btree_extents will describe the block
* disposal process in more detail.
*/
And later, down by xfs_repair_reap_btree_extents,
/*
* Dispose of btree blocks from the old per-AG btree.
*
* Now that we've constructed a new btree to replace the damaged one, we
* want to dispose of the blocks that (we think) the old btree was
* using. Previously, we used the rmapbt to construct a list of extents
* (@exlist) with the rmap owner corresponding to the tree we rebuilt,
* then subtracted out any other blocks with the same rmap owner that
* are owned by another data structure. In theory the extents in
* @exlist are the old btree's blocks.
*
* Unfortunately, it's possible that the btree was crosslinked with
* other blocks on disk. The rmap data can tell us if there are
* multiple owners, so if the rmapbt says there is an owner of this
* block other than @oinfo, then the block is crosslinked. Remove the
* reverse mapping and continue.
*
* If there is one rmap record, we can free the block, which removes the
* reverse mapping but doesn't add the block to the free space. Our
* repair strategy is to hope the other metadata objects crosslinked on
* this block will be rebuilt (atop different blocks), thereby removing
* all the cross links.
*
* If there are no rmap records at all, we also free the block. If the
* btree being rebuilt lives in the free space (bnobt/cntbt/rmapbt) then
* there isn't supposed to be a rmap record and everything is ok. For
* other btrees there had to have been an rmap entry for the block to
* have ended up on @exlist, so if it's gone now there's something wrong
* and the fs will shut down.
*
* The caller is responsible for locking the AG headers for the entire
* rebuild operation so that nothing else can sneak in and change the AG
* state while we're not looking. We also assume that the caller
* already invalidated any buffers associated with @exlist.
*/
Later, for the function that finds AG btree roots for agf/agi
reconstruction:
/*
* Find the roots of the per-AG btrees described in btree_info.
*
* The caller provides information about the btrees to look for by
* passing in an array (@btree_info) of xfs_repair_find_ag_btree with
* the (rmap owner, buf_ops, magic) fields set. The last element of the
* array should have a NULL buf_ops, and the (root, height) fields will
* be set on return if anything is found.
*
* For every rmapbt record matching any of the rmap owners in
* @btree_info, read each block referenced by the rmap record. If the
* block is a btree block from this filesystem matching any of the magic
* numbers and has a level higher than what we've already seen, remember
* the block and the height of the tree required to have such a block.
* When the call completes, we return the highest block we've found for
* each btree description; those should be the roots.
*
* The caller must lock the applicable per-AG header buffers (AGF, AGI)
* to prevent other threads from changing the shape of the btrees that
* we are looking for. It must maintain those locks until it's safe for
* other threads to change the btrees' shapes.
*/
--D
>
> > > > > +/*
> > > > > + * Invalidate buffers for per-AG btree blocks we're dumping. We assume that
> > > > > + * exlist points only to metadata blocks.
> > > > > + */
> > > > > +int
> > > > > +xfs_repair_invalidate_blocks(
> > > > > + struct xfs_scrub_context *sc,
> > > > > + struct xfs_repair_extent_list *exlist)
> > > > > +{
> > > > > + struct xfs_repair_extent *rex;
> > > > > + struct xfs_repair_extent *n;
> > > > > + struct xfs_buf *bp;
> > > > > + xfs_agnumber_t agno;
> > > > > + xfs_agblock_t agbno;
> > > > > + xfs_agblock_t i;
> > > > > +
> > > > > + for_each_xfs_repair_extent_safe(rex, n, exlist) {
> > > > > + agno = XFS_FSB_TO_AGNO(sc->mp, rex->fsbno);
> > > > > + agbno = XFS_FSB_TO_AGBNO(sc->mp, rex->fsbno);
> > > > > + for (i = 0; i < rex->len; i++) {
> > > > > + bp = xfs_btree_get_bufs(sc->mp, sc->tp, agno,
> > > > > + agbno + i, 0);
> > > > > + xfs_trans_binval(sc->tp, bp);
> > > > > + }
> > > >
> > > > Again, this is doing things by single blocks. We do have multi-block
> > > > metadata (inodes, directory blocks, remote attrs) that, if it
> > > > is already in memory, needs to be treated as multi-block extents. If
> > > > we don't do that, we'll cause aliasing problems in the buffer cache
> > > > (see _xfs_buf_obj_cmp()) and it's all downhill from there.
> > >
> > > I only recently started testing with filesystems containing multiblock
> > > dir/rmt metadata, and this is an unsolved problem. :(
> >
> > That needs documenting, too. Perhaps explicitly, by rejecting repair
> > requests on filesystems or types that have multi-block constructs
> > until we solve these problems.
>
> Trouble is, remote attr values can have an xfs_buf that spans however
> many blocks you need to store a full 64k value, and what happens if the
> rmapbt collides with that? It sorta implies that we can't do
> invalidation on /any/ filesystem, which is unfortunate....
>
> ...unless we have an easy way of finding /any/ buffer that points to a
> given block? Probably not, since iirc they're indexed by the first disk
> block number. Hm. I suppose we could use the rmap data to look for
> anything within 64k of the logical offset of an attr/data rmap
> overlapping the same block...
>
> ...but on second thought we only care about invalidating the buffer if
> the block belonged to the ag btree we've just killed, right? If there's
> a multi-block buffer because it's part of a directory or an rmt block
> then the buffer is clearly owned by someone else and we don't even have
> to look for that. Likewise, if it's a single-block buffer but the
> block has some other magic then we don't own it and we should leave it
> alone.
>
> > > I /think/ the solution is that we need to query the buffer cache to see
> > > if it has a buffer for the given disk blocks, and if it matches the
> > > btree we're discarding (correct magic/uuid/b_length) then we invalidate
> > > it,
> >
> > I don't think that provides any guarantees. Even ignoring all the
> > problems with invalidation while the buffer is dirty and tracked in
> > the AIL, there's nothing stopping the other code from attempting to
> > re-instantiate the buffer due to some other access. And then we
> > have aliasing problems again....
>
> Well, we /could/ just freeze the fs while we do repairs on any ag btree.
>
> --D
>
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2018-05-17 5:58 UTC|newest]
Thread overview: 76+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-15 22:33 [PATCH v15.1 00/22] xfs-4.18: online repair support Darrick J. Wong
2018-05-15 22:33 ` [PATCH 01/22] xfs: add helpers to deal with transaction allocation and rolling Darrick J. Wong
2018-05-16 6:51 ` Dave Chinner
2018-05-16 16:46 ` Darrick J. Wong
2018-05-16 21:19 ` Dave Chinner
2018-05-16 16:48 ` Allison Henderson
2018-05-18 3:49 ` [PATCH v2 " Darrick J. Wong
2018-05-15 22:33 ` [PATCH 02/22] xfs: add helpers to allocate and initialize fresh btree roots Darrick J. Wong
2018-05-16 7:07 ` Dave Chinner
2018-05-16 17:15 ` Darrick J. Wong
2018-05-16 17:00 ` Allison Henderson
2018-05-15 22:33 ` [PATCH 03/22] xfs: add helpers to collect and sift btree block pointers during repair Darrick J. Wong
2018-05-16 7:56 ` Dave Chinner
2018-05-16 17:34 ` Allison Henderson
2018-05-16 18:06 ` Darrick J. Wong
2018-05-16 21:23 ` Dave Chinner
2018-05-16 21:33 ` Allison Henderson
2018-05-16 18:01 ` Darrick J. Wong
2018-05-16 21:32 ` Dave Chinner
2018-05-16 22:05 ` Darrick J. Wong
2018-05-17 0:41 ` Dave Chinner
2018-05-17 5:05 ` Darrick J. Wong
2018-05-18 3:51 ` [PATCH v2 " Darrick J. Wong
2018-05-29 3:10 ` Dave Chinner
2018-05-29 15:28 ` Darrick J. Wong
2018-05-15 22:34 ` [PATCH 04/22] xfs: add helpers to dispose of old btree blocks after a repair Darrick J. Wong
2018-05-16 8:32 ` Dave Chinner
2018-05-16 18:02 ` Allison Henderson
2018-05-16 19:34 ` Darrick J. Wong
2018-05-16 22:32 ` Dave Chinner
2018-05-16 23:18 ` Darrick J. Wong
2018-05-17 5:58 ` Darrick J. Wong [this message]
2018-05-18 3:53 ` [PATCH v2 " Darrick J. Wong
2018-05-29 3:14 ` Dave Chinner
2018-05-29 18:01 ` Darrick J. Wong
2018-05-15 22:34 ` [PATCH 05/22] xfs: recover AG btree roots from rmap data Darrick J. Wong
2018-05-16 8:51 ` Dave Chinner
2018-05-16 18:37 ` Darrick J. Wong
2018-05-16 19:18 ` Allison Henderson
2018-05-16 22:36 ` Dave Chinner
2018-05-17 5:53 ` Darrick J. Wong
2018-05-18 3:54 ` [PATCH v2 " Darrick J. Wong
2018-05-29 3:16 ` Dave Chinner
2018-05-15 22:34 ` [PATCH 06/22] xfs: add a repair helper to reset superblock counters Darrick J. Wong
2018-05-16 21:29 ` Allison Henderson
2018-05-18 3:56 ` Darrick J. Wong
2018-05-18 3:56 ` [PATCH v2 " Darrick J. Wong
2018-05-29 3:28 ` Dave Chinner
2018-05-29 22:07 ` Darrick J. Wong
2018-05-29 22:24 ` Dave Chinner
2018-05-29 22:43 ` Darrick J. Wong
2018-05-30 1:23 ` Dave Chinner
2018-05-30 3:22 ` Darrick J. Wong
2018-05-15 22:34 ` [PATCH 07/22] xfs: add helpers to attach quotas to inodes Darrick J. Wong
2018-05-16 22:21 ` Allison Henderson
2018-05-18 3:58 ` [PATCH v2 " Darrick J. Wong
2018-05-29 3:29 ` Dave Chinner
2018-05-15 22:34 ` [PATCH 08/22] xfs: repair superblocks Darrick J. Wong
2018-05-16 22:55 ` Allison Henderson
2018-05-29 3:42 ` Dave Chinner
2018-05-15 22:34 ` [PATCH 09/22] xfs: repair the AGF and AGFL Darrick J. Wong
2018-05-15 22:34 ` [PATCH 10/22] xfs: repair the AGI Darrick J. Wong
2018-05-15 22:34 ` [PATCH 11/22] xfs: repair free space btrees Darrick J. Wong
2018-05-15 22:34 ` [PATCH 12/22] xfs: repair inode btrees Darrick J. Wong
2018-05-15 22:35 ` [PATCH 13/22] xfs: repair the rmapbt Darrick J. Wong
2018-05-15 22:35 ` [PATCH 14/22] xfs: repair refcount btrees Darrick J. Wong
2018-05-15 22:35 ` [PATCH 15/22] xfs: repair inode records Darrick J. Wong
2018-05-15 22:35 ` [PATCH 16/22] xfs: zap broken inode forks Darrick J. Wong
2018-05-15 22:35 ` [PATCH 17/22] xfs: repair inode block maps Darrick J. Wong
2018-05-15 22:35 ` [PATCH 18/22] xfs: repair damaged symlinks Darrick J. Wong
2018-05-15 22:35 ` [PATCH 19/22] xfs: repair extended attributes Darrick J. Wong
2018-05-15 22:35 ` [PATCH 20/22] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
2018-05-15 22:35 ` [PATCH 21/22] xfs: repair quotas Darrick J. Wong
2018-05-15 22:36 ` [PATCH 22/22] xfs: implement live quotacheck as part of quota repair Darrick J. Wong
2018-05-18 3:47 ` [PATCH 0.5/22] xfs: grab the per-ag structure whenever relevant Darrick J. Wong
2018-05-30 6:44 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180517055805.GR23858@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).