From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations
Date: Fri, 8 Apr 2022 09:17:25 +1000 [thread overview]
Message-ID: <20220407231725.GM1544202@dread.disaster.area> (raw)
In-Reply-To: <164936442248.457511.4389675360381809144.stgit@magnolia>
On Thu, Apr 07, 2022 at 01:47:02PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> As mentioned in the previous commit, the kernel misuses sb_frextents in
> the incore mount to reflect both incore reservations made by running
> transactions as well as the actual count of free rt extents on disk.
> This results in the superblock being written to the log with an
> underestimate of the number of rt extents that are marked free in the
> rtbitmap.
>
> Teaching XFS to recompute frextents after log recovery avoids
> operational problems in the current mount, but it doesn't solve the
> problem of us writing undercounted frextents which are then recovered by
> an older kernel that doesn't have that fix.
>
> Create an incore percpu counter to mirror the ondisk frextents. This
> new counter will track transaction reservations and the only time we
> will touch the incore super counter (i.e the one that gets logged) is
> when those transactions commit updates to the rt bitmap. This is in
> contrast to the lazysbcount counters (e.g. fdblocks), where we know that
> log recovery will always fix any incorrect counter that we log.
> As a bonus, we only take m_sb_lock at transaction commit time.
Again, the concept looks fine as does most of the code. Some
comments on the implementation below.
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index c5f153c3693f..d5463728c305 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1183,17 +1183,37 @@ xfs_mod_frextents(
> struct xfs_mount *mp,
> int64_t delta)
> {
> - int64_t lcounter;
> - int ret = 0;
> + int batch;
>
> - spin_lock(&mp->m_sb_lock);
> - lcounter = mp->m_sb.sb_frextents + delta;
> - if (lcounter < 0)
> - ret = -ENOSPC;
> + if (delta > 0) {
> + percpu_counter_add(&mp->m_frextents, delta);
> + return 0;
> + }
> +
> + /*
> + * Taking blocks away, need to be more accurate the closer we
> + * are to zero.
> + *
> + * If the counter has a value of less than 2 * max batch size,
> + * then make everything serialise as we are real close to
> + * ENOSPC.
> + */
> + if (__percpu_counter_compare(&mp->m_frextents, 2 * XFS_FDBLOCKS_BATCH,
> + XFS_FDBLOCKS_BATCH) < 0)
> + batch = 1;
> else
> - mp->m_sb.sb_frextents = lcounter;
> - spin_unlock(&mp->m_sb_lock);
> - return ret;
> + batch = XFS_FDBLOCKS_BATCH;
> +
> + percpu_counter_add_batch(&mp->m_frextents, delta, batch);
> + if (__percpu_counter_compare(&mp->m_frextents, 0,
> + XFS_FDBLOCKS_BATCH) >= 0) {
> + /* we had space! */
> + return 0;
> + }
> +
> + /* oops, negative free space, put that back! */
> + percpu_counter_add(&mp->m_frextents, -delta);
> + return -ENOSPC;
> }
Ok, so this looks like a copy-pasta of xfs_mod_fdblocks() with the
reservation pool stuff stripped. I'd kinda prefer to factor
xfs_mod_fdblocks() so that we aren't blowing out the instruction
cache footprint on this hot path - we're frequently modifying
both RT and fd block counts in the same transaction, so having them
run the same code would be good.
Something like:
int
xfs_mod_blocks(
struct xfs_mount *mp,
struct pcp_counter *pccnt,
int64_t delta,
bool use_resv_pool,
bool rsvd)
{
int64_t lcounter;
long long res_used;
s32 batch;
uint64_t set_aside = 0;
if (delta > 0) {
/*
* If the reserve pool is depleted, put blocks back into it
* first. Most of the time the pool is full.
*/
if (likely(!use_resv_pool || mp->m_resblks == mp->m_resblks_avail)) {
percpu_counter_add(pccnt, delta);
return 0;
}
spin_lock(&mp->m_sb_lock);
res_used = (long long)(mp->m_resblks - mp->m_resblks_avail);
if (res_used > delta) {
mp->m_resblks_avail += delta;
} else {
delta -= res_used;
mp->m_resblks_avail = mp->m_resblks;
percpu_counter_add(&mp->m_fdblocks, delta);
}
spin_unlock(&mp->m_sb_lock);
return 0;
}
/*
* Taking blocks away, need to be more accurate the closer we
* are to zero.
*
* If the counter has a value of less than 2 * max batch size,
* then make everything serialise as we are real close to
* ENOSPC.
*/
if (__percpu_counter_compare(pccnt, 2 * XFS_FDBLOCKS_BATCH,
XFS_FDBLOCKS_BATCH) < 0)
batch = 1;
else
batch = XFS_FDBLOCKS_BATCH;
/*
* Set aside allocbt blocks because these blocks are tracked as free
* space but not available for allocation. Technically this means that a
* single reservation cannot consume all remaining free space, but the
* ratio of allocbt blocks to usable free blocks should be rather small.
* The tradeoff without this is that filesystems that maintain high
* perag block reservations can over reserve physical block availability
* and fail physical allocation, which leads to much more serious
* problems (i.e. transaction abort, pagecache discards, etc.) than
* slightly premature -ENOSPC.
*/
if (use_resv_pool)
set_aside = xfs_fdblocks_unavailable(mp);
percpu_counter_add_batch(pccnt, delta, batch);
if (__percpu_counter_compare(&pccnt, set_aside,
XFS_FDBLOCKS_BATCH) >= 0) {
/* we had space! */
return 0;
}
/*
* lock up the sb for dipping into reserves before releasing the space
* that took us to ENOSPC.
*/
spin_lock(&mp->m_sb_lock);
percpu_counter_add(pccnt, -delta);
if (!use_resv_pool || !rsvd)
goto fdblocks_enospc;
lcounter = (long long)mp->m_resblks_avail + delta;
if (lcounter >= 0) {
mp->m_resblks_avail = lcounter;
spin_unlock(&mp->m_sb_lock);
return 0;
}
xfs_warn_once(mp,
"Reserve blocks depleted! Consider increasing reserve pool size.");
fdblocks_enospc:
spin_unlock(&mp->m_sb_lock);
return -ENOSPC;
}
And in the relevant header file:
int xfs_mod_blocks(struct xfs_mount *mp, struct pcp_counter *pccnt,
int64_t delta, bool use_resv_pool, bool rsvd);
static inline int
xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta, bool rsvd)
{
return xfs_mod_blocks(mp, &mp->m_fdblocks, delta, true, resvd);
}
static inline int
xfs_mod_frextents(struct xfs_mount *mp, int64_t delta)
{
return xfs_mod_blocks(mp, &mp->m_frextents, delta, false, false);
}
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 54be9d64093e..cc95768eb8e1 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -843,9 +843,11 @@ xfs_fs_statfs(
>
> if (XFS_IS_REALTIME_MOUNT(mp) &&
> (ip->i_diflags & (XFS_DIFLAG_RTINHERIT | XFS_DIFLAG_REALTIME))) {
> + s64 freertx;
> +
> statp->f_blocks = sbp->sb_rblocks;
> - statp->f_bavail = statp->f_bfree =
> - sbp->sb_frextents * sbp->sb_rextsize;
> + freertx = max_t(s64, 0, percpu_counter_sum(&mp->m_frextents));
percpu_counter_sum_positive()
> if (error)
> goto free_fdblocks;
>
> + error = percpu_counter_init(&mp->m_frextents, 0, GFP_KERNEL);
> + if (error)
> + goto free_delalloc;
> +
> return 0;
>
> +free_delalloc:
> + percpu_counter_destroy(&mp->m_delalloc_blks);
> free_fdblocks:
> percpu_counter_destroy(&mp->m_fdblocks);
> free_ifree:
> @@ -1033,6 +1041,7 @@ xfs_reinit_percpu_counters(
> percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount);
> percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree);
> percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks);
> + percpu_counter_set(&mp->m_frextents, mp->m_sb.sb_frextents);
> }
>
> static void
> @@ -1045,6 +1054,7 @@ xfs_destroy_percpu_counters(
> ASSERT(xfs_is_shutdown(mp) ||
> percpu_counter_sum(&mp->m_delalloc_blks) == 0);
> percpu_counter_destroy(&mp->m_delalloc_blks);
> + percpu_counter_destroy(&mp->m_frextents);
> }
>
> static int
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 0ac717aad380..63a4d3a24340 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -498,10 +498,31 @@ xfs_trans_apply_sb_deltas(
> be64_add_cpu(&sbp->sb_fdblocks, tp->t_res_fdblocks_delta);
> }
>
> - if (tp->t_frextents_delta)
> - be64_add_cpu(&sbp->sb_frextents, tp->t_frextents_delta);
> - if (tp->t_res_frextents_delta)
> - be64_add_cpu(&sbp->sb_frextents, tp->t_res_frextents_delta);
> + /*
> + * Updating frextents requires careful handling because it does not
> + * behave like the lazysb counters because we cannot rely on log
> + * recovery in older kenels to recompute the value from the rtbitmap.
> + * This means that the ondisk frextents must be consistent with the
> + * rtbitmap.
> + *
> + * Therefore, log the frextents change to the ondisk superblock and
> + * update the incore superblock so that future calls to xfs_log_sb
> + * write the correct value ondisk.
> + *
> + * Don't touch m_frextents because it includes incore reservations,
> + * and those are handled by the unreserve function.
> + */
> + if (tp->t_frextents_delta || tp->t_res_frextents_delta) {
> + struct xfs_mount *mp = tp->t_mountp;
> + int64_t rtxdelta;
> +
> + rtxdelta = tp->t_frextents_delta + tp->t_res_frextents_delta;
> +
> + spin_lock(&mp->m_sb_lock);
> + be64_add_cpu(&sbp->sb_frextents, rtxdelta);
> + mp->m_sb.sb_frextents += rtxdelta;
> + spin_unlock(&mp->m_sb_lock);
> + }
Hmmmm. This wants a comment in xfs_log_sb() to explain why we
aren't updating mp->m_sb.sb_frextents from mp->m_frextents like we
do with all the other per-cpu counters tracking resource usage.
>
> if (tp->t_dblocks_delta) {
> be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
> @@ -614,7 +635,12 @@ xfs_trans_unreserve_and_mod_sb(
> if (ifreedelta)
> percpu_counter_add(&mp->m_ifree, ifreedelta);
>
> - if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY))
> + if (rtxdelta) {
> + error = xfs_mod_frextents(mp, rtxdelta);
> + ASSERT(!error);
> + }
> +
> + if (!(tp->t_flags & XFS_TRANS_SB_DIRTY))
> return;
>
> /* apply remaining deltas */
> @@ -622,7 +648,6 @@ xfs_trans_unreserve_and_mod_sb(
> mp->m_sb.sb_fdblocks += tp->t_fdblocks_delta + tp->t_res_fdblocks_delta;
> mp->m_sb.sb_icount += idelta;
> mp->m_sb.sb_ifree += ifreedelta;
> - mp->m_sb.sb_frextents += rtxdelta;
This makes my head hurt trying to work out if this is necessary or
not. (the lazy sb stuff in these functions has always strained my
cognitive abilities, even though I wrote it in the first place!)
A comment explaining why we don't need to update
mp->m_sb.sb_frextents when XFS_TRANS_SB_DIRTY is set would be useful
in the above if (rtxdelta) update above.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2022-04-07 23:17 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-07 20:46 [PATCHSET 0/2] xfs: fix corruption of free rt extent count Darrick J. Wong
2022-04-07 20:46 ` [PATCH 1/2] xfs: recalculate free rt extents after log recovery Darrick J. Wong
2022-04-07 21:56 ` Dave Chinner
2022-04-07 23:39 ` Darrick J. Wong
2022-04-08 0:06 ` Dave Chinner
2022-04-08 17:42 ` Darrick J. Wong
2022-04-07 20:47 ` [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations Darrick J. Wong
2022-04-07 23:17 ` Dave Chinner [this message]
2022-04-07 23:45 ` Darrick J. Wong
2022-04-08 0:12 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220407231725.GM1544202@dread.disaster.area \
--to=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox