From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 10/21] xfs: add helper routines for the repair code
Date: Mon, 9 Apr 2018 10:19:43 -0700 [thread overview]
Message-ID: <20180409171943.GO7500@magnolia> (raw)
In-Reply-To: <20180409002342.GS23861@dastard>
On Mon, Apr 09, 2018 at 10:23:42AM +1000, Dave Chinner wrote:
> On Sat, Apr 07, 2018 at 10:55:42AM -0700, Darrick J. Wong wrote:
> > On Fri, Apr 06, 2018 at 10:52:51AM +1000, Dave Chinner wrote:
> > > On Mon, Apr 02, 2018 at 12:57:18PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > >
> > > > Add some helper functions for repair functions that will help us to
> > > > allocate and initialize new metadata blocks for btrees that we're
> > > > rebuilding.
> > >
> > > This is more than "helper functions" - my replay is almost 700 lines
> > > long!
> > >
> > > i.e. This is a stack of extent invalidation, freeing and rmap/free
> > > space rebuild code. Needs a lot more description and context than
> > > "helper functions".
> >
> > I agree now; this patch has become overly long and incohesive. It could
> > be broken into the following pieces:
> >
> > - modifying transaction allocations to take resblks
> > or "ensuring sufficient free space to rebuild"
> >
> > - rolling transactions
> >
> > - allocation and reinit functions
> >
> > - fixing/putting on the freelist
> >
> > - dealing with leftover blocks after we rebuild a tree
> >
> > - finding btree roots
> >
> > - resetting superblock counters
> >
> > So I'll break this up into seven smaller pieces.
>
> Thanks!
>
> >
> > > .....
> > > > @@ -574,7 +575,10 @@ xfs_scrub_setup_fs(
> > > > struct xfs_scrub_context *sc,
> > > > struct xfs_inode *ip)
> > > > {
> > > > - return xfs_scrub_trans_alloc(sc->sm, sc->mp, &sc->tp);
> > > > + uint resblks;
> > > > +
> > > > + resblks = xfs_repair_calc_ag_resblks(sc);
> > > > + return xfs_scrub_trans_alloc(sc->sm, sc->mp, resblks, &sc->tp);
> > > > }
> > >
> > > What's the block reservation for?
> >
> > We rebuild a metadata structure by constructing a completely new btree
> > with blocks we got from the free space and switching the root when we're
> > done. This isn't treated any differently from any other btree shape
> > change, so we need to reserve blocks in the transaction.
>
> Can you put this in a comment?
Added to the comment above xfs_scrub_trans_alloc.
> > > > + args.tp = sc->tp;
> > > > + args.mp = sc->mp;
> > > > + args.oinfo = *oinfo;
> > > > + args.fsbno = XFS_AGB_TO_FSB(args.mp, sc->sa.agno, 0);
> > >
> > > So our target BNO is the start of the AG.
> > >
> > > > + args.minlen = 1;
> > > > + args.maxlen = 1;
> > > > + args.prod = 1;
> > > > + args.type = XFS_ALLOCTYPE_NEAR_BNO;
> > >
> > > And we do a "near" search" for a single block. i.e. we are looking
> > > for a single block near to the start of the AG. Basically, we are
> > > looking for the extent at the left edge of the by-bno tree, which
> > > may not be a single block.
> > >
> > > Would it be better to do a XFS_ALLOCTYPE_THIS_AG allocation here so
> > > we look up the by-size btree for a single block extent? The left
> > > edge of that tree will always be the smallest free extent closest to
> > > the start of the AG, and so using THIS_AG will tend to fill
> > > small freespace holes rather than tear up larger free spaces for
> > > single block allocations.....
> >
> > Makes sense. Originally I was thinking that we try to put the rebuilt
> > btree as close to the start of the AG as possible, but there's little
> > point in doing that so long as regular btree block allocation doesn't
> > bother.
> >
> > TBH I've been wondering on the side if it makes more sense for us to
> > detect things like dm-{thinp,cache} which have large(ish) underlying
> > blocks and try to consolidate metadata into a smaller number of those
> > underlying blocks at the start (or the end) of the AG? It could be
> > useful to pack the metadata into a smaller number of underlying blocks
> > so that if one piece of metadata gets hot enough the rest will end up in
> > the fast cache as well.
>
> Separate issue, probably can be done with an in-memory threshold
> for metadata allocation (e.g. search for BNO < 5% from start of AG).
> I've thought about doing somthing like that (reserved area for
> metadata) for some time, but I've never really seen signficant
> advantages to doing it...
<nod> I suppose on a spinny disk it might cut down scrub/repair time,
though maybe for pmem it'll prove advantageous to try to keep free space
as unfragmented as possible to increase the likelihood of large page
use. Oh well, that's a research question for another thread. :)
> > > > +/* Put a block back on the AGFL. */
> > > > +int
> > > > +xfs_repair_put_freelist(
> > > > + struct xfs_scrub_context *sc,
> > > > + xfs_agblock_t agbno)
> > > > +{
> > > > + struct xfs_owner_info oinfo;
> > > > + int error;
> > > > +
> > > > + /*
> > > > + * Since we're "freeing" a lost block onto the AGFL, we have to
> > > > + * create an rmap for the block prior to merging it or else other
> > > > + * parts will break.
> > > > + */
> > > > + xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > + error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno, agbno, 1,
> > > > + &oinfo);
> > > > + if (error)
> > > > + return error;
> > > > +
> > > > + /* Put the block on the AGFL. */
> > > > + error = xfs_alloc_put_freelist(sc->tp, sc->sa.agf_bp, sc->sa.agfl_bp,
> > > > + agbno, 0);
> > >
> > > At what point do we check there's actually room in the AGFL for this
> > > block?
> > >
> > > > + if (error)
> > > > + return error;
> > > > + xfs_extent_busy_insert(sc->tp, sc->sa.agno, agbno, 1,
> > > > + XFS_EXTENT_BUSY_SKIP_DISCARD);
> > > > +
> > > > + /* Make sure the AGFL doesn't overfill. */
> > > > + return xfs_repair_fix_freelist(sc, true);
> > >
> > > i.e. shouldn't this be done first?
> >
> >
> >
>
> ???
That was supposed to be "Yes".
>
> > > > + else if (resv == XFS_AG_RESV_AGFL)
> > > > + error = xfs_repair_put_freelist(sc, agbno);
> > > > + else
> > > > + error = xfs_free_extent(sc->tp, fsbno, 1, oinfo, resv);
> > > > + if (error)
> > > > + break;
> > > > +
> > > > + if (sc->ip)
> > > > + error = xfs_trans_roll_inode(&sc->tp, sc->ip);
> > > > + else
> > > > + error = xfs_repair_roll_ag_trans(sc);
> > >
> > > why don't we break on error here?
> >
> > We do, but it's up in the for loop.
>
> I missed that completely. I'd much prefer we don't hide "break on
> error" conditions inside the for loop machinery because it is so
> easy to miss...
I refactored the loop body into a separate helper, which decreased the
indentation and made the loop control clearer.
>
> > > > +
> > > > + trace_xfs_repair_collect_btree_extent(sc->mp,
> > > > + XFS_FSB_TO_AGNO(sc->mp, fsbno),
> > > > + XFS_FSB_TO_AGBNO(sc->mp, fsbno), len);
> > > > +
> > > > + rae = kmem_alloc(sizeof(struct xfs_repair_extent),
> > > > + KM_MAYFAIL | KM_NOFS);
> > >
> > > Why KM_NOFS?
> >
> > The caller holds a transaction and (potentially) has locked an entire
> > AG, so we want to avoid recursing into the filesystem to free up memory.
>
> If we're in transaction context, we are already under a KM_NOFS
> allocation contexts. SO it shouldn't be needed here - if we are
> called outside transaction context, then we need a comment
> explaining it (as opposed to using KM_NOFS to shut up lockdep).
Ok, I'll get rid of the KM_NOFS in this and all the other patches.
(I admit I probably /have/ been stuffing in KM_NOFS to shut up
lockdep...)
> > > > + for_each_xfs_repair_extent_safe(rae, n, exlist) {
> > > > + agno = XFS_FSB_TO_AGNO(sc->mp, rae->fsbno);
> > > > + agbno = XFS_FSB_TO_AGBNO(sc->mp, rae->fsbno);
> > > > + for (i = 0; i < rae->len; i++) {
> > > > + bp = xfs_btree_get_bufs(sc->mp, sc->tp, agno,
> > > > + agbno + i, 0);
> > > > + xfs_trans_binval(sc->tp, bp);
> > > > + }
> > > > + }
> > >
> > > What if they are data blocks we are dumping? xfs_trans_binval is
> > > fine for metadata blocks, but creating a stale metadata buffer and
> > > logging a buffer cancel for user data blocks doesn't make any sense
> > > to me....
> >
> > exlist is only supposed to contain references to metadata blocks that we
> > collected from the rmapbt. The current iteration of the repair code
> > shouldn't ever dump file data blocks.
>
> That needs explanation, then. I had no idea that exlist was only for
> metadata blocks...
Ok, comment changed to:
/*
* Invalidate buffers for per-AG btree blocks we're dumping. We assume
* that exlist points only to metadata blocks.
*/
> > > > +/* Remove all the blocks in sublist from exlist. */
> > > > +#define LEFT_CONTIG (1 << 0)
> > > > +#define RIGHT_CONTIG (1 << 1)
> > >
> > > Namespace, please.
> > >
> > > > +int
> > > > +xfs_repair_subtract_extents(
> > > > + struct xfs_scrub_context *sc,
> > > > + struct xfs_repair_extent_list *exlist,
> > > > + struct xfs_repair_extent_list *sublist)
> > >
> > > What is this function supposed to do? I can't really review the code
> > > (looks complex) because I don't know what it's function is supposed
> > > to be.
> >
> > The intent is that we iterate the rmapbt for all of its records for a
> > given owner to generate exlist. Then we iterate all blocks of metadata
> > structures that we are not rebuilding but have the same rmapbt owner to
> > generate sublist. This routine subtracts all the blocks mentioned in
> > sublist from all the extents linked in exlist, which leaves exlist as
> > the list of all blocks owned by the old version of btree that we're
> > rebuilding. The list is fed into _reap_btree_extents to free the old
> > btree blocks (or merely remove the rmap if the block is crosslinked).
>
> Comment, please :)
Added.
> > > > +
> > > > + if (delta_icount) {
> > > > + error = xfs_mod_icount(mp, delta_icount);
> > > > + if (error)
> > > > + goto out;
> > > > + }
> > > > + if (delta_ifree) {
> > > > + error = xfs_mod_ifree(mp, delta_ifree);
> > > > + if (error)
> > > > + goto out;
> > > > + }
> > > > + if (delta_fdblocks) {
> > > > + error = xfs_mod_fdblocks(mp, delta_fdblocks, false);
> > > > + if (error)
> > > > + goto out;
> > > > + }
> > > > +
> > > > +out:
> > > > + xfs_trans_cancel(tp);
> > > > + return error;
> > >
> > > Ummmm - why do we do all this superblock modification work in a
> > > transaction context, and then just cancel it?
> >
> > It's an empty transaction, so nothing gets dirty and nothing needs
> > committing. TBH I don't even think this is necessary, since we only use
> > it for reading the agi/agf, and for that we can certainly _buf_relse
> > instead of letting the _trans_cancel do it.
>
> Yeah, it'd be better to get rid of the transaction context if we
> can. It should also explain how it avoids races with ongoing
> superblock counter updates...
<nod> AFAICT it's a simple matter of taking m_sb_lock to update the
m_sb counters, after which we unlock and do the percpu counters.
> > > > + inobt_sz = xfs_iallocbt_calc_size(mp, icount /
> > > > + XFS_INODES_PER_CHUNK);
> > > > + if (xfs_sb_version_hasfinobt(&mp->m_sb))
> > > > + inobt_sz *= 2;
> > > > + if (xfs_sb_version_hasreflink(&mp->m_sb)) {
> > > > + rmapbt_sz = xfs_rmapbt_calc_size(mp, aglen);
> > > > + refcbt_sz = xfs_refcountbt_calc_size(mp, usedlen);
> > > > + } else {
> > > > + rmapbt_sz = xfs_rmapbt_calc_size(mp, usedlen);
> > > > + refcbt_sz = 0;
> > > > + }
> > > > + if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > > + rmapbt_sz = 0;
> > > > +
> > > > + trace_xfs_repair_calc_ag_resblks_btsize(mp, sm->sm_agno, bnobt_sz,
> > > > + inobt_sz, rmapbt_sz, refcbt_sz);
> > > > +
> > > > + return max(max(bnobt_sz, inobt_sz), max(rmapbt_sz, refcbt_sz));
> > >
> > > Why are we only returning the resblks for a single tree rebuild?
> > > Shouldn't it return the total blocks require to rebuild all all the
> > > trees?
> >
> > A repair call only ever rebuilds one btree in one AG, so I don't see why
> > we'd need to reserve space to rebuild all the btrees in the AG (or the
> > same btree in all AGs)... though it's possible that I don't understand
> > the question?
>
> It's not clear that we are only doing a reservation for a single
> tree in an AG. needs a comment to explain the context the
> reservation is being made for.
/*
* Figure out how many blocks to reserve for an AG repair. We calculate
* the worst case estimate for the number of blocks we'd need to rebuild
* one of any type of per-AG btree.
*/
--D
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2018-04-09 17:19 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-02 19:56 [PATCH v14.1 00/21] xfs: online repair support Darrick J. Wong
2018-04-02 19:56 ` [PATCH 01/21] xfs: add helpers to calculate btree size Darrick J. Wong
2018-04-02 19:56 ` [PATCH 02/21] xfs: expose various functions to repair code Darrick J. Wong
2018-04-02 19:56 ` [PATCH 03/21] xfs: add repair helpers for the reverse mapping btree Darrick J. Wong
2018-04-05 23:02 ` Dave Chinner
2018-04-06 16:31 ` Darrick J. Wong
2018-04-02 19:56 ` [PATCH 04/21] xfs: add repair helpers for the reference count btree Darrick J. Wong
2018-04-02 19:56 ` [PATCH 05/21] xfs: add BMAPI_NORMAP flag to perform block remapping without updating rmapbt Darrick J. Wong
2018-04-05 23:07 ` Dave Chinner
2018-04-06 16:41 ` Darrick J. Wong
2018-04-02 19:56 ` [PATCH 06/21] xfs: make xfs_bmapi_remapi work with attribute forks Darrick J. Wong
2018-04-05 23:12 ` Dave Chinner
2018-04-06 17:31 ` Darrick J. Wong
2018-04-02 19:56 ` [PATCH 07/21] xfs: halt auto-reclamation activities while rebuilding rmap Darrick J. Wong
2018-04-05 23:14 ` Dave Chinner
2018-04-02 19:57 ` [PATCH 08/21] xfs: create tracepoints for online repair Darrick J. Wong
2018-04-05 23:15 ` Dave Chinner
2018-04-02 19:57 ` [PATCH 09/21] xfs: implement the metadata repair ioctl flag Darrick J. Wong
2018-04-05 23:24 ` Dave Chinner
2018-04-02 19:57 ` [PATCH 10/21] xfs: add helper routines for the repair code Darrick J. Wong
2018-04-06 0:52 ` Dave Chinner
2018-04-07 17:55 ` Darrick J. Wong
2018-04-09 0:23 ` Dave Chinner
2018-04-09 17:19 ` Darrick J. Wong [this message]
2018-04-02 19:57 ` [PATCH 11/21] xfs: repair superblocks Darrick J. Wong
2018-04-06 1:05 ` Dave Chinner
2018-04-07 18:04 ` Darrick J. Wong
2018-04-09 0:26 ` Dave Chinner
2018-04-09 17:36 ` Darrick J. Wong
2018-04-02 19:57 ` [PATCH 12/21] xfs: repair the AGF and AGFL Darrick J. Wong
2018-04-02 19:57 ` [PATCH 13/21] xfs: repair the AGI Darrick J. Wong
2018-04-02 19:57 ` [PATCH 14/21] xfs: repair free space btrees Darrick J. Wong
2018-04-02 19:57 ` [PATCH 15/21] xfs: repair inode btrees Darrick J. Wong
2018-04-02 19:57 ` [PATCH 16/21] xfs: repair the rmapbt Darrick J. Wong
2018-04-02 19:58 ` [PATCH 17/21] xfs: repair refcount btrees Darrick J. Wong
2018-04-02 19:58 ` [PATCH 18/21] xfs: repair inode records Darrick J. Wong
2018-04-02 19:58 ` [PATCH 19/21] xfs: zap broken inode forks Darrick J. Wong
2018-04-02 19:58 ` [PATCH 20/21] xfs: repair inode block maps Darrick J. Wong
2018-04-02 19:58 ` [PATCH 21/21] xfs: repair damaged symlinks Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180409171943.GO7500@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).