From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, zhangshaokun@hisilicon.com
Subject: Re: [PATCH] xfs: gut error handling in xfs_trans_unreserve_and_mod_sb()
Date: Wed, 20 Nov 2019 18:38:36 -0800 [thread overview]
Message-ID: <20191121023836.GV6219@magnolia> (raw)
In-Reply-To: <20191121004437.9633-1-david@fromorbit.com>
On Thu, Nov 21, 2019 at 11:44:37AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Shaokun Zhang reported that XFs was using substantial CPU time in
> percpu_count_sum() when running a single threaded benchmark on
> a high CPU count (128p) machine from xfs_mod_ifree(). The issue
> is that the filesystem is empty when the benchmark runs, so inode
> allocation is running with a very low inode free count.
>
> With the percpu counter batching, this means comparisons when the
> counter is less that 128 * 256 = 32768 use the slow path of adding
> up all the counters across the CPUs, and this is expensive on high
> CPU count machines.
>
> The summing in xfs_mod_ifree() is only used to fire an assert if an
> underrun occurs. The error is ignored by the higher level code.
> Hence this is really just debug code. Hence we don't need to run it
> on production kernels, nor do we need such debug checks to return
> error values just to trigger an assert.
>
> Further, the error handling in xfs_trans_unreserve_and_mod_sb() is
> largely incorrect - Rolling back the changes in the transaction if
> only one counter underruns makes all the other counters
> incorrect.
Separate change, separate patch...
> We still allow the change to proceed and committing
> the transaction, except now we have multiple incorrect counters
> instead of a single underflow. Hence we should remove all this
> counter unwinding, too.
>
> Finally, xfs_mod_icount/xfs_mod_ifree are only called from
> xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
> directly call the percpu_counter_add/percpu_counter_compare
> functions. The compare functions are now run only on debug builds as
> they are internal to ASSERT() checks and so only compiled in when
> ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).
>
> Difference in binary size for a production kernel:
>
> Before:
> text data bss dec hex filename
> 9486 184 8 9678 25ce fs/xfs/xfs_trans.o
> 10858 89 12 10959 2acf fs/xfs/xfs_mount.o
>
> After:
> text data bss dec hex filename
> 8462 184 8 8654 21ce fs/xfs/xfs_trans.o
> 10510 89 12 10611 2973 fs/xfs/xfs_mount.o
>
> So not only does this cleanup chop out a lot of source code, it also
> results in a binary size reduction of ~1.3kB in a very frequently
> used fast path of the filesystem.
>
> Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_mount.c | 33 ----------
> fs/xfs/xfs_mount.h | 2 -
> fs/xfs/xfs_trans.c | 153 +++++++++++----------------------------------
> 3 files changed, 37 insertions(+), 151 deletions(-)
>
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index fca65109cf24..205a83f33abc 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1125,39 +1125,6 @@ xfs_log_sbcount(xfs_mount_t *mp)
> return xfs_sync_sb(mp, true);
> }
>
> -/*
> - * Deltas for the inode count are +/-64, hence we use a large batch size
> - * of 128 so we don't need to take the counter lock on every update.
> - */
> -#define XFS_ICOUNT_BATCH 128
> -int
> -xfs_mod_icount(
> - struct xfs_mount *mp,
> - int64_t delta)
> -{
> - percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH);
> - if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
> - ASSERT(0);
> - percpu_counter_add(&mp->m_icount, -delta);
> - return -EINVAL;
> - }
> - return 0;
> -}
> -
> -int
> -xfs_mod_ifree(
> - struct xfs_mount *mp,
> - int64_t delta)
> -{
> - percpu_counter_add(&mp->m_ifree, delta);
> - if (percpu_counter_compare(&mp->m_ifree, 0) < 0) {
> - ASSERT(0);
> - percpu_counter_add(&mp->m_ifree, -delta);
> - return -EINVAL;
> - }
> - return 0;
> -}
> -
> /*
> * Deltas for the block count can vary from 1 to very large, but lock contention
> * only occurs on frequent small block count updates such as in the delayed
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 88ab09ed29e7..0c9524660100 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -389,8 +389,6 @@ extern int xfs_initialize_perag(xfs_mount_t *mp, xfs_agnumber_t agcount,
> xfs_agnumber_t *maxagi);
> extern void xfs_unmountfs(xfs_mount_t *);
>
> -extern int xfs_mod_icount(struct xfs_mount *mp, int64_t delta);
> -extern int xfs_mod_ifree(struct xfs_mount *mp, int64_t delta);
> extern int xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
> bool reserved);
> extern int xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3b208f9a865c..93e9a5154cdb 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -527,57 +527,51 @@ xfs_trans_apply_sb_deltas(
> sizeof(sbp->sb_frextents) - 1);
> }
>
> -STATIC int
> +static void
> xfs_sb_mod8(
> uint8_t *field,
> int8_t delta)
> {
> int8_t counter = *field;
>
> + if (!delta)
> + return;
> counter += delta;
> - if (counter < 0) {
> - ASSERT(0);
> - return -EINVAL;
> - }
> + ASSERT(counter >= 0);
> *field = counter;
> - return 0;
> }
>
> -STATIC int
> +static void
> xfs_sb_mod32(
> uint32_t *field,
> int32_t delta)
> {
> int32_t counter = *field;
>
> + if (!delta)
> + return;
> counter += delta;
> - if (counter < 0) {
> - ASSERT(0);
> - return -EINVAL;
> - }
> + ASSERT(counter >= 0);
> *field = counter;
> - return 0;
> }
>
> -STATIC int
> +static void
> xfs_sb_mod64(
> uint64_t *field,
> int64_t delta)
> {
> int64_t counter = *field;
>
> + if (!delta)
> + return;
> counter += delta;
> - if (counter < 0) {
> - ASSERT(0);
> - return -EINVAL;
> - }
> + ASSERT(counter >= 0);
> *field = counter;
> - return 0;
> }
>
> /*
> - * xfs_trans_unreserve_and_mod_sb() is called to release unused reservations
> - * and apply superblock counter changes to the in-core superblock. The
> + * xfs_trans_unreserve_and_mod_sb() is called to release unused reservations and
> + * apply superblock counter changes to the in-core superblock. The
> * t_res_fdblocks_delta and t_res_frextents_delta fields are explicitly NOT
> * applied to the in-core superblock. The idea is that that has already been
> * done.
> @@ -586,7 +580,12 @@ xfs_sb_mod64(
> * used block counts are not updated in the on disk superblock. In this case,
> * XFS_TRANS_SB_DIRTY will not be set when the transaction is updated but we
> * still need to update the incore superblock with the changes.
> + *
> + * Deltas for the inode count are +/-64, hence we use a large batch size of 128
> + * so we don't need to take the counter lock on every update.
> */
> +#define XFS_ICOUNT_BATCH 128
> +
> void
> xfs_trans_unreserve_and_mod_sb(
> struct xfs_trans *tp)
> @@ -622,20 +621,21 @@ xfs_trans_unreserve_and_mod_sb(
> /* apply the per-cpu counters */
> if (blkdelta) {
> error = xfs_mod_fdblocks(mp, blkdelta, rsvd);
> - if (error)
> - goto out;
> + ASSERT(!error);
> }
>
> if (idelta) {
> - error = xfs_mod_icount(mp, idelta);
> - if (error)
> - goto out_undo_fdblocks;
> + percpu_counter_add_batch(&mp->m_icount, idelta,
> + XFS_ICOUNT_BATCH);
> + if (idelta < 0)
> + ASSERT(__percpu_counter_compare(&mp->m_icount, 0,
> + XFS_ICOUNT_BATCH) >= 0);
> }
>
> if (ifreedelta) {
> - error = xfs_mod_ifree(mp, ifreedelta);
> - if (error)
> - goto out_undo_icount;
> + percpu_counter_add(&mp->m_ifree, ifreedelta);
> + if (ifreedelta < 0)
> + ASSERT(percpu_counter_compare(&mp->m_ifree, 0) >= 0);
Since the whole thing is a debug statement, why not shove everything
into a single assert?
ASSERT(ifreedelta >= 0 || percpu_computer_compare() >= 0); ?
Don't really care that much, just wondering... overall this part seems
reasonable.
> }
>
> if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY))
> @@ -643,95 +643,16 @@ xfs_trans_unreserve_and_mod_sb(
>
> /* apply remaining deltas */
> spin_lock(&mp->m_sb_lock);
> - if (rtxdelta) {
> - error = xfs_sb_mod64(&mp->m_sb.sb_frextents, rtxdelta);
> - if (error)
> - goto out_undo_ifree;
> - }
> -
> - if (tp->t_dblocks_delta != 0) {
> - error = xfs_sb_mod64(&mp->m_sb.sb_dblocks, tp->t_dblocks_delta);
> - if (error)
> - goto out_undo_frextents;
> - }
> - if (tp->t_agcount_delta != 0) {
> - error = xfs_sb_mod32(&mp->m_sb.sb_agcount, tp->t_agcount_delta);
> - if (error)
> - goto out_undo_dblocks;
> - }
> - if (tp->t_imaxpct_delta != 0) {
> - error = xfs_sb_mod8(&mp->m_sb.sb_imax_pct, tp->t_imaxpct_delta);
> - if (error)
> - goto out_undo_agcount;
> - }
> - if (tp->t_rextsize_delta != 0) {
> - error = xfs_sb_mod32(&mp->m_sb.sb_rextsize,
> - tp->t_rextsize_delta);
> - if (error)
> - goto out_undo_imaxpct;
> - }
> - if (tp->t_rbmblocks_delta != 0) {
> - error = xfs_sb_mod32(&mp->m_sb.sb_rbmblocks,
> - tp->t_rbmblocks_delta);
> - if (error)
> - goto out_undo_rextsize;
> - }
> - if (tp->t_rblocks_delta != 0) {
> - error = xfs_sb_mod64(&mp->m_sb.sb_rblocks, tp->t_rblocks_delta);
> - if (error)
> - goto out_undo_rbmblocks;
> - }
> - if (tp->t_rextents_delta != 0) {
> - error = xfs_sb_mod64(&mp->m_sb.sb_rextents,
> - tp->t_rextents_delta);
> - if (error)
> - goto out_undo_rblocks;
> - }
> - if (tp->t_rextslog_delta != 0) {
> - error = xfs_sb_mod8(&mp->m_sb.sb_rextslog,
> - tp->t_rextslog_delta);
> - if (error)
> - goto out_undo_rextents;
> - }
> - spin_unlock(&mp->m_sb_lock);
> - return;
> -
> -out_undo_rextents:
> - if (tp->t_rextents_delta)
> - xfs_sb_mod64(&mp->m_sb.sb_rextents, -tp->t_rextents_delta);
> -out_undo_rblocks:
> - if (tp->t_rblocks_delta)
> - xfs_sb_mod64(&mp->m_sb.sb_rblocks, -tp->t_rblocks_delta);
> -out_undo_rbmblocks:
> - if (tp->t_rbmblocks_delta)
> - xfs_sb_mod32(&mp->m_sb.sb_rbmblocks, -tp->t_rbmblocks_delta);
> -out_undo_rextsize:
> - if (tp->t_rextsize_delta)
> - xfs_sb_mod32(&mp->m_sb.sb_rextsize, -tp->t_rextsize_delta);
> -out_undo_imaxpct:
> - if (tp->t_rextsize_delta)
> - xfs_sb_mod8(&mp->m_sb.sb_imax_pct, -tp->t_imaxpct_delta);
> -out_undo_agcount:
> - if (tp->t_agcount_delta)
> - xfs_sb_mod32(&mp->m_sb.sb_agcount, -tp->t_agcount_delta);
> -out_undo_dblocks:
> - if (tp->t_dblocks_delta)
> - xfs_sb_mod64(&mp->m_sb.sb_dblocks, -tp->t_dblocks_delta);
> -out_undo_frextents:
> - if (rtxdelta)
> - xfs_sb_mod64(&mp->m_sb.sb_frextents, -rtxdelta);
> -out_undo_ifree:
> + xfs_sb_mod64(&mp->m_sb.sb_frextents, rtxdelta);
As for these bits... why even bother with a three line helper? I think
this is clearer about what's going on:
mp->m_sb.sb_frextents += rtxdelta;
mp->m_sb.sb_dblocks += tp->t_dblocks_delta;
...
ASSERT(!rtxdelta || mp->m_sb.sb_frextents >= 0);
ASSERT(!tp->t_dblocks_delta || mp->m_sb.sb.dblocks >= 0);
and since we hold m_sb_lock it's not like we're trying to do anything
fancy with memory accesses...?
I also wonder if we should be shutting down the fs here if the counts
go negative, but <shrug> that would be yet a different patch. :)
--D
> + xfs_sb_mod64(&mp->m_sb.sb_dblocks, tp->t_dblocks_delta);
> + xfs_sb_mod32(&mp->m_sb.sb_agcount, tp->t_agcount_delta);
> + xfs_sb_mod8(&mp->m_sb.sb_imax_pct, tp->t_imaxpct_delta);
> + xfs_sb_mod32(&mp->m_sb.sb_rextsize, tp->t_rextsize_delta);
> + xfs_sb_mod32(&mp->m_sb.sb_rbmblocks, tp->t_rbmblocks_delta);
> + xfs_sb_mod64(&mp->m_sb.sb_rblocks, tp->t_rblocks_delta);
> + xfs_sb_mod64(&mp->m_sb.sb_rextents, tp->t_rextents_delta);
> + xfs_sb_mod8(&mp->m_sb.sb_rextslog, tp->t_rextslog_delta);
> spin_unlock(&mp->m_sb_lock);
> - if (ifreedelta)
> - xfs_mod_ifree(mp, -ifreedelta);
> -out_undo_icount:
> - if (idelta)
> - xfs_mod_icount(mp, -idelta);
> -out_undo_fdblocks:
> - if (blkdelta)
> - xfs_mod_fdblocks(mp, -blkdelta, rsvd);
> -out:
> - ASSERT(error == 0);
> return;
> }
>
> --
> 2.24.0.rc0
>
next prev parent reply other threads:[~2019-11-21 2:38 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-21 0:44 [PATCH] xfs: gut error handling in xfs_trans_unreserve_and_mod_sb() Dave Chinner
2019-11-21 2:38 ` Darrick J. Wong [this message]
2019-11-21 4:00 ` Dave Chinner
2019-11-21 4:50 ` Darrick J. Wong
2019-11-21 6:44 ` Dave Chinner
2019-11-21 7:47 ` Shaokun Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191121023836.GV6219@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=zhangshaokun@hisilicon.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.