From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Gao Xiang <hsiangkao@redhat.com>
Cc: linux-xfs@vger.kernel.org, Dave Chinner <dchinner@redhat.com>,
Eric Sandeen <sandeen@sandeen.net>
Subject: Re: [RFC PATCH] xfs_repair: fix rebuilding btree node block less than minrecs
Date: Tue, 9 Jun 2020 13:14:23 -0700 [thread overview]
Message-ID: <20200609201423.GD11245@magnolia> (raw)
In-Reply-To: <20200609114053.31924-1-hsiangkao@redhat.com>
On Tue, Jun 09, 2020 at 07:40:53PM +0800, Gao Xiang wrote:
> In production, we found that sometimes xfs_repair phase 5
> rebuilds freespace node block with pointers less than minrecs
> and if we trigger xfs_repair again it would report such
> the following message:
>
> bad btree nrecs (39, min=40, max=80) in btbno block 0/7882
>
> The background is that xfs_repair starts to rebuild AGFL
> after the freespace btree is settled in phase 5 so we may
> need to leave necessary room in advance for each btree
> leaves in order to avoid freespace btree split and then
> result in AGFL rebuild fails. The old mathematics uses
> ceil(num_extents / maxrecs) to decide the number of node
> blocks. That would be fine without leaving extra space
> since minrecs = maxrecs / 2 but if some slack was decreased
> from maxrecs, the result would be larger than what is
> expected and cause num_recs_pb less than minrecs, i.e:
>
> num_extents = 79, adj_maxrecs = 80 - 2 (slack) = 78
>
> so we'd get
>
> num_blocks = ceil(79 / 78) = 2,
> num_recs_pb = 79 / 2 = 39, which is less than
> minrecs = 80 / 2 = 40
>
> OTOH, btree bulk loading code behaves in a different way.
> As in xfs_btree_bload_level_geometry it wrote
>
> num_blocks = floor(num_extents / maxrecs)
>
> which will never go below minrecs. And when it goes
> above maxrecs, just increment num_blocks and recalculate
> so we can get the reasonable results.
>
> In the long term, btree bulk loader will replace the current
> repair code as well as to resolve AGFL dependency issue.
> But we may still want to look for a backportable solution
> for stable versions. Hence, use the same logic to avoid the
> freespace btree minrecs underflow for now.
>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Eric Sandeen <sandeen@sandeen.net>
> Fixes: 9851fd79bfb1 ("repair: AGFL rebuild fails if btree split required")
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
> not heavy tested yet..
>
> repair/phase5.c | 101 +++++++++++++++++++++---------------------------
> 1 file changed, 45 insertions(+), 56 deletions(-)
>
> diff --git a/repair/phase5.c b/repair/phase5.c
> index abae8a08..997804a5 100644
> --- a/repair/phase5.c
> +++ b/repair/phase5.c
> @@ -348,11 +348,29 @@ finish_cursor(bt_status_t *curs)
> * failure at runtime. Hence leave a couple of records slack space in
> * each block to allow immediate modification of the tree without
> * requiring splits to be done.
> - *
> - * XXX(hch): any reason we don't just look at mp->m_alloc_mxr?
> */
> -#define XR_ALLOC_BLOCK_MAXRECS(mp, level) \
> - (libxfs_allocbt_maxrecs((mp), (mp)->m_sb.sb_blocksize, (level) == 0) - 2)
> +static void
> +compute_level_geometry(xfs_mount_t *mp, bt_stat_level_t *lptr,
> + uint64_t nr_this_level, bool leaf)
Please don't use structure typedefs here.
Also, please indent the argument list like the rest of xfs code, e.g.
static void
compute_bnobt_geometry(
struct xfs_mount *mp,
struct bt_stat_level *lptr,
(etc)
> +{
> + unsigned int maxrecs = mp->m_alloc_mxr[!leaf];
> + int slack = leaf ? 2 : 0;
> + unsigned int desired_npb;
> +
> + desired_npb = max(mp->m_alloc_mnr[!leaf], maxrecs - slack);
> + lptr->num_recs_tot = nr_this_level;
> + lptr->num_blocks = max(1ULL, nr_this_level / desired_npb);
> +
> + lptr->num_recs_pb = nr_this_level / lptr->num_blocks;
> + lptr->modulo = nr_this_level % lptr->num_blocks;
> + if (lptr->num_recs_pb > maxrecs || (lptr->num_recs_pb == maxrecs &&
> + lptr->modulo)) {
Indentation....
if (lptr->num_recs_pb > maxrecs ||
(lptr->num_recs_pb == maxrecs && lptr->modulo)) {
lptr->num_blocks++;
...
}
> + lptr->num_blocks++;
> +
> + lptr->num_recs_pb = nr_this_level / lptr->num_blocks;
> + lptr->modulo = nr_this_level % lptr->num_blocks;
> + }
> +}
>
> /*
> * this calculates a freespace cursor for an ag.
> @@ -370,6 +388,7 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno,
> int i;
> int extents_used;
> int extra_blocks;
> + uint64_t old_blocks;
> bt_stat_level_t *lptr;
> bt_stat_level_t *p_lptr;
> extent_tree_node_t *ext_ptr;
> @@ -388,10 +407,7 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno,
> * of the tree and set up the cursor for the leaf level
> * (note that the same code is duplicated further down)
> */
> - lptr->num_blocks = howmany(num_extents, XR_ALLOC_BLOCK_MAXRECS(mp, 0));
> - lptr->num_recs_pb = num_extents / lptr->num_blocks;
> - lptr->modulo = num_extents % lptr->num_blocks;
> - lptr->num_recs_tot = num_extents;
> + compute_level_geometry(mp, lptr, num_extents, true);
> level = 1;
>
> #ifdef XR_BLD_FREE_TRACE
> @@ -406,18 +422,12 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno,
> * per level is the # of blocks in the level below it
> */
> if (lptr->num_blocks > 1) {
> - for (; btree_curs->level[level - 1].num_blocks > 1
> - && level < XFS_BTREE_MAXLEVELS;
> - level++) {
> + do {
> + p_lptr = lptr;
> lptr = &btree_curs->level[level];
> - p_lptr = &btree_curs->level[level - 1];
> - lptr->num_blocks = howmany(p_lptr->num_blocks,
> - XR_ALLOC_BLOCK_MAXRECS(mp, level));
> - lptr->modulo = p_lptr->num_blocks
> - % lptr->num_blocks;
> - lptr->num_recs_pb = p_lptr->num_blocks
> - / lptr->num_blocks;
> - lptr->num_recs_tot = p_lptr->num_blocks;
> +
> + compute_level_geometry(mp, lptr,
> + p_lptr->num_blocks, false);
> #ifdef XR_BLD_FREE_TRACE
> fprintf(stderr, "%s %d %d %d %d %d\n", __func__,
> level,
> @@ -426,7 +436,9 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno,
> lptr->modulo,
> lptr->num_recs_tot);
> #endif
> - }
> + level++;
> + } while (lptr->num_blocks > 1);
> + ASSERT (level < XFS_BTREE_MAXLEVELS);
> }
>
> ASSERT(lptr->num_blocks == 1);
> @@ -496,8 +508,11 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno,
> * see if the number of leaf blocks will change as a result
> * of the number of extents changing
> */
> - if (howmany(num_extents, XR_ALLOC_BLOCK_MAXRECS(mp, 0))
> - != btree_curs->level[0].num_blocks) {
> + old_blocks = btree_curs->level[0].num_blocks;
> + compute_level_geometry(mp, &btree_curs->level[0], num_extents, true);
> + extra_blocks = 0;
> +
> + if (old_blocks != btree_curs->level[0].num_blocks) {
> /*
> * yes -- recalculate the cursor. If the number of
> * excess (overallocated) blocks is < xfs_agfl_size/2, we're ok.
> @@ -553,30 +568,20 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno,
> }
>
> lptr = &btree_curs->level[0];
> - lptr->num_blocks = howmany(num_extents,
> - XR_ALLOC_BLOCK_MAXRECS(mp, 0));
> - lptr->num_recs_pb = num_extents / lptr->num_blocks;
> - lptr->modulo = num_extents % lptr->num_blocks;
> - lptr->num_recs_tot = num_extents;
> level = 1;
>
> /*
> * if we need more levels, set them up
> */
> if (lptr->num_blocks > 1) {
> - for (level = 1; btree_curs->level[level-1].num_blocks
> - > 1 && level < XFS_BTREE_MAXLEVELS;
> - level++) {
> - lptr = &btree_curs->level[level];
> - p_lptr = &btree_curs->level[level-1];
> - lptr->num_blocks = howmany(p_lptr->num_blocks,
> - XR_ALLOC_BLOCK_MAXRECS(mp, level));
> - lptr->modulo = p_lptr->num_blocks
> - % lptr->num_blocks;
> - lptr->num_recs_pb = p_lptr->num_blocks
> - / lptr->num_blocks;
> - lptr->num_recs_tot = p_lptr->num_blocks;
> - }
> + do {
> + p_lptr = lptr;
> + lptr = &btree_curs->level[level++];
> +
> + compute_level_geometry(mp, lptr,
> + p_lptr->num_blocks, false);
> + } while (lptr->num_blocks > 1);
/me wonders why it's necessary to wrap the do...while in an if test,
as opposed to using a while loop with the if test up front?
--D
> + ASSERT (level < XFS_BTREE_MAXLEVELS);
> }
> ASSERT(lptr->num_blocks == 1);
> btree_curs->num_levels = level;
> @@ -591,22 +596,6 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno,
>
> ASSERT(blocks_allocated_total >= blocks_needed);
> extra_blocks = blocks_allocated_total - blocks_needed;
> - } else {
> - if (extents_used > 0) {
> - /*
> - * reset the leaf level geometry to account
> - * for consumed extents. we can leave the
> - * rest of the cursor alone since the number
> - * of leaf blocks hasn't changed.
> - */
> - lptr = &btree_curs->level[0];
> -
> - lptr->num_recs_pb = num_extents / lptr->num_blocks;
> - lptr->modulo = num_extents % lptr->num_blocks;
> - lptr->num_recs_tot = num_extents;
> - }
> -
> - extra_blocks = 0;
> }
>
> btree_curs->num_tot_blocks = blocks_allocated_pt;
> --
> 2.18.1
>
next prev parent reply other threads:[~2020-06-09 20:14 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-09 11:40 [RFC PATCH] xfs_repair: fix rebuilding btree node block less than minrecs Gao Xiang
2020-06-09 14:35 ` Gao Xiang
2020-06-09 20:14 ` Darrick J. Wong [this message]
2020-06-09 22:12 ` Darrick J. Wong
2020-06-10 1:04 ` Gao Xiang
2020-06-10 3:58 ` [RFC PATCH v2] xfs_repair: fix rebuilding btree " Gao Xiang
2020-06-10 5:26 ` [RFC PATCH v3] " Gao Xiang
2020-06-16 16:11 ` Darrick J. Wong
2020-06-16 17:26 ` Gao Xiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200609201423.GD11245@magnolia \
--to=darrick.wong@oracle.com \
--cc=dchinner@redhat.com \
--cc=hsiangkao@redhat.com \
--cc=linux-xfs@vger.kernel.org \
--cc=sandeen@sandeen.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox