From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 16/30] xfs: pin inode backing buffer to the inode log item
Date: Thu, 4 Jun 2020 10:05:34 -0400 [thread overview]
Message-ID: <20200604140534.GE17815@bfoster> (raw)
In-Reply-To: <20200604074606.266213-17-david@fromorbit.com>
On Thu, Jun 04, 2020 at 05:45:52PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> When we dirty an inode, we are going to have to write it disk at
> some point in the near future. This requires the inode cluster
> backing buffer to be present in memory. Unfortunately, under severe
> memory pressure we can reclaim the inode backing buffer while the
> inode is dirty in memory, resulting in stalling the AIL pushing
> because it has to do a read-modify-write cycle on the cluster
> buffer.
>
> When we have no memory available, the read of the cluster buffer
> blocks the AIL pushing process, and this causes all sorts of issues
> for memory reclaim as it requires inode writeback to make forwards
> progress. Allocating a cluster buffer causes more memory pressure,
> and results in more cluster buffers to be reclaimed, resulting in
> more RMW cycles to be done in the AIL context and everything then
> backs up on AIL progress. Only the synchronous inode cluster
> writeback in the the inode reclaim code provides some level of
> forwards progress guarantees that prevent OOM-killer rampages in
> this situation.
>
> Fix this by pinning the inode backing buffer to the inode log item
> when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
> This may mean the first modification of an inode that has been held
> in cache for a long time may block on a cluster buffer read, but
> we can do that in transaction context and block safely until the
> buffer has been allocated and read.
>
> Once we have the cluster buffer, the inode log item takes a
> reference to it, pinning it in memory, and attaches it to the log
> item for future reference. This means we can always grab the cluster
> buffer from the inode log item when we need it.
>
> When the inode is finally cleaned and removed from the AIL, we can
> drop the reference the inode log item holds on the cluster buffer.
> Once all inodes on the cluster buffer are clean, the cluster buffer
> will be unpinned and it will be available for memory reclaim to
> reclaim again.
>
> This avoids the issues with needing to do RMW cycles in the AIL
> pushing context, and hence allows complete non-blocking inode
> flushing to be performed by the AIL pushing context.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
More clear with the spurious bits removed, thanks. I'm still not clear
on the ili_flush_lsn reset bit. It seems a little strange that we'd
clear it and reset it in cases where flush completions race with relogs
(and thus AIL moving), but it shouldn't cause any issues from what I can
tell:
Reviewed-by: Brian Foster <bfoster@redhat.com>
> fs/xfs/libxfs/xfs_inode_buf.c | 3 +-
> fs/xfs/libxfs/xfs_trans_inode.c | 53 ++++++++++++++++++++++++----
> fs/xfs/xfs_buf_item.c | 4 +--
> fs/xfs/xfs_inode_item.c | 61 ++++++++++++++++++++++++++-------
> fs/xfs/xfs_trans_ail.c | 8 +++--
> 5 files changed, 105 insertions(+), 24 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6f84ea85fdd83..1af97235785c8 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -176,7 +176,8 @@ xfs_imap_to_bp(
> }
>
> *bpp = bp;
> - *dipp = xfs_buf_offset(bp, imap->im_boffset);
> + if (dipp)
> + *dipp = xfs_buf_offset(bp, imap->im_boffset);
> return 0;
> }
>
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index c66d9d1dd58b9..ad5974365c589 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -8,6 +8,8 @@
> #include "xfs_shared.h"
> #include "xfs_format.h"
> #include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> #include "xfs_inode.h"
> #include "xfs_trans.h"
> #include "xfs_trans_priv.h"
> @@ -72,13 +74,19 @@ xfs_trans_ichgtime(
> }
>
> /*
> - * This is called to mark the fields indicated in fieldmask as needing
> - * to be logged when the transaction is committed. The inode must
> - * already be associated with the given transaction.
> + * This is called to mark the fields indicated in fieldmask as needing to be
> + * logged when the transaction is committed. The inode must already be
> + * associated with the given transaction.
> *
> - * The values for fieldmask are defined in xfs_inode_item.h. We always
> - * log all of the core inode if any of it has changed, and we always log
> - * all of the inline data/extents/b-tree root if any of them has changed.
> + * The values for fieldmask are defined in xfs_inode_item.h. We always log all
> + * of the core inode if any of it has changed, and we always log all of the
> + * inline data/extents/b-tree root if any of them has changed.
> + *
> + * Grab and pin the cluster buffer associated with this inode to avoid RMW
> + * cycles at inode writeback time. Avoid the need to add error handling to every
> + * xfs_trans_log_inode() call by shutting down on read error. This will cause
> + * transactions to fail and everything to error out, just like if we return a
> + * read error in a dirty transaction and cancel it.
> */
> void
> xfs_trans_log_inode(
> @@ -131,6 +139,39 @@ xfs_trans_log_inode(
> spin_lock(&iip->ili_lock);
> iip->ili_fsync_fields |= flags;
>
> + if (!iip->ili_item.li_buf) {
> + struct xfs_buf *bp;
> + int error;
> +
> + /*
> + * We hold the ILOCK here, so this inode is not going to be
> + * flushed while we are here. Further, because there is no
> + * buffer attached to the item, we know that there is no IO in
> + * progress, so nothing will clear the ili_fields while we read
> + * in the buffer. Hence we can safely drop the spin lock and
> + * read the buffer knowing that the state will not change from
> + * here.
> + */
> + spin_unlock(&iip->ili_lock);
> + error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL,
> + &bp, 0);
> + if (error) {
> + xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> + return;
> + }
> +
> + /*
> + * We need an explicit buffer reference for the log item but
> + * don't want the buffer to remain attached to the transaction.
> + * Hold the buffer but release the transaction reference.
> + */
> + xfs_buf_hold(bp);
> + xfs_trans_brelse(tp, bp);
> +
> + spin_lock(&iip->ili_lock);
> + iip->ili_item.li_buf = bp;
> + }
> +
> /*
> * Always OR in the bits from the ili_last_fields field. This is to
> * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index bc8f8df6cc4c6..a8c5070376b21 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1144,11 +1144,9 @@ xfs_buf_inode_iodone(
> if (ret == XBF_IOERROR_DONE)
> return;
> ASSERT(ret == XBF_IOERROR_FAIL);
> - spin_lock(&bp->b_mount->m_ail->ail_lock);
> list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> - xfs_set_li_failed(lip, bp);
> + set_bit(XFS_LI_FAILED, &lip->li_flags);
> }
> - spin_unlock(&bp->b_mount->m_ail->ail_lock);
> xfs_buf_ioerror(bp, 0);
> xfs_buf_relse(bp);
> return;
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 0ba75764a8dc5..64bdda72f7b27 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -439,6 +439,7 @@ xfs_inode_item_pin(
> struct xfs_inode *ip = INODE_ITEM(lip)->ili_inode;
>
> ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> + ASSERT(lip->li_buf);
>
> trace_xfs_inode_pin(ip, _RET_IP_);
> atomic_inc(&ip->i_pincount);
> @@ -450,6 +451,12 @@ xfs_inode_item_pin(
> * item which was previously pinned with a call to xfs_inode_item_pin().
> *
> * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0.
> + *
> + * Note that unpin can race with inode cluster buffer freeing marking the buffer
> + * stale. In that case, flush completions are run from the buffer unpin call,
> + * which may happen before the inode is unpinned. If we lose the race, there
> + * will be no buffer attached to the log item, but the inode will be marked
> + * XFS_ISTALE.
> */
> STATIC void
> xfs_inode_item_unpin(
> @@ -459,6 +466,7 @@ xfs_inode_item_unpin(
> struct xfs_inode *ip = INODE_ITEM(lip)->ili_inode;
>
> trace_xfs_inode_unpin(ip, _RET_IP_);
> + ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
> ASSERT(atomic_read(&ip->i_pincount) > 0);
> if (atomic_dec_and_test(&ip->i_pincount))
> wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> @@ -629,10 +637,15 @@ xfs_inode_item_init(
> */
> void
> xfs_inode_item_destroy(
> - xfs_inode_t *ip)
> + struct xfs_inode *ip)
> {
> - kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
> - kmem_cache_free(xfs_ili_zone, ip->i_itemp);
> + struct xfs_inode_log_item *iip = ip->i_itemp;
> +
> + ASSERT(iip->ili_item.li_buf == NULL);
> +
> + ip->i_itemp = NULL;
> + kmem_free(iip->ili_item.li_lv_shadow);
> + kmem_cache_free(xfs_ili_zone, iip);
> }
>
>
> @@ -673,11 +686,10 @@ xfs_iflush_done(
> list_move_tail(&lip->li_bio_list, &tmp);
>
> /* Do an unlocked check for needing the AIL lock. */
> - if (lip->li_lsn == iip->ili_flush_lsn ||
> + if (iip->ili_flush_lsn == lip->li_lsn ||
> test_bit(XFS_LI_FAILED, &lip->li_flags))
> need_ail++;
> }
> - ASSERT(list_empty(&bp->b_li_list));
>
> /*
> * We only want to pull the item from the AIL if it is actually there
> @@ -690,7 +702,7 @@ xfs_iflush_done(
> /* this is an opencoded batch version of xfs_trans_ail_delete */
> spin_lock(&ailp->ail_lock);
> list_for_each_entry(lip, &tmp, li_bio_list) {
> - xfs_clear_li_failed(lip);
> + clear_bit(XFS_LI_FAILED, &lip->li_flags);
> if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
> xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
> if (!tail_lsn && lsn)
> @@ -706,14 +718,29 @@ xfs_iflush_done(
> * them is safely on disk.
> */
> list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
> + bool drop_buffer = false;
> +
> list_del_init(&lip->li_bio_list);
> iip = INODE_ITEM(lip);
>
> spin_lock(&iip->ili_lock);
> +
> + /*
> + * Remove the reference to the cluster buffer if the inode is
> + * clean in memory. Drop the buffer reference once we've dropped
> + * the locks we hold.
> + */
> + ASSERT(iip->ili_item.li_buf == bp);
> + if (!iip->ili_fields) {
> + iip->ili_item.li_buf = NULL;
> + drop_buffer = true;
> + }
> iip->ili_last_fields = 0;
> + iip->ili_flush_lsn = 0;
> spin_unlock(&iip->ili_lock);
> -
> xfs_ifunlock(iip->ili_inode);
> + if (drop_buffer)
> + xfs_buf_rele(bp);
> }
> }
>
> @@ -725,12 +752,20 @@ xfs_iflush_done(
> */
> void
> xfs_iflush_abort(
> - struct xfs_inode *ip)
> + struct xfs_inode *ip)
> {
> - struct xfs_inode_log_item *iip = ip->i_itemp;
> + struct xfs_inode_log_item *iip = ip->i_itemp;
> + struct xfs_buf *bp = NULL;
>
> if (iip) {
> + /*
> + * Clear the failed bit before removing the item from the AIL so
> + * xfs_trans_ail_delete() doesn't try to clear and release the
> + * buffer attached to the log item before we are done with it.
> + */
> + clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
> xfs_trans_ail_delete(&iip->ili_item, 0);
> +
> /*
> * Clear the inode logging fields so no more flushes are
> * attempted.
> @@ -739,12 +774,14 @@ xfs_iflush_abort(
> iip->ili_last_fields = 0;
> iip->ili_fields = 0;
> iip->ili_fsync_fields = 0;
> + iip->ili_flush_lsn = 0;
> + bp = iip->ili_item.li_buf;
> + iip->ili_item.li_buf = NULL;
> spin_unlock(&iip->ili_lock);
> }
> - /*
> - * Release the inode's flush lock since we're done with it.
> - */
> xfs_ifunlock(ip);
> + if (bp)
> + xfs_buf_rele(bp);
> }
>
> /*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index ac33f6393f99c..c3be6e4401343 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -377,8 +377,12 @@ xfsaild_resubmit_item(
> }
>
> /* protected by ail_lock */
> - list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
> - xfs_clear_li_failed(lip);
> + list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> + if (bp->b_flags & _XBF_INODES)
> + clear_bit(XFS_LI_FAILED, &lip->li_flags);
> + else
> + xfs_clear_li_failed(lip);
> + }
>
> xfs_buf_unlock(bp);
> return XFS_ITEM_SUCCESS;
> --
> 2.26.2.761.g0e0b3e54be
>
next prev parent reply other threads:[~2020-06-04 14:05 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-04 7:45 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-06-04 7:45 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
2020-06-04 7:45 ` [PATCH 02/30] xfs: remove logged flag from inode log item Dave Chinner
2020-06-04 7:45 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
2020-06-09 13:13 ` Brian Foster
2020-06-04 7:45 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
2020-06-04 14:04 ` Brian Foster
2020-06-04 7:45 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
2020-06-04 7:45 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
2020-06-04 7:45 ` [PATCH 07/30] xfs: call xfs_buf_iodone directly Dave Chinner
2020-06-04 7:45 ` [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Dave Chinner
2020-06-04 7:45 ` [PATCH 09/30] xfs: make inode IO completion buffer centric Dave Chinner
2020-06-04 7:45 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
2020-06-04 7:45 ` [PATCH 11/30] xfs: clean up the buffer iodone callback functions Dave Chinner
2020-06-04 7:45 ` [PATCH 12/30] xfs: get rid of log item callbacks Dave Chinner
2020-06-04 7:45 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
2020-06-04 14:05 ` Brian Foster
2020-06-05 0:59 ` Dave Chinner
2020-06-05 1:32 ` [PATCH 13/30 V2] " Dave Chinner
2020-06-05 16:24 ` Brian Foster
2020-06-04 7:45 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
2020-06-04 7:45 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
2020-06-04 7:45 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-06-04 14:05 ` Brian Foster [this message]
2020-06-04 7:45 ` [PATCH 17/30] xfs: make inode reclaim almost non-blocking Dave Chinner
2020-06-04 18:06 ` Brian Foster
2020-06-04 7:45 ` [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
2020-06-04 18:08 ` Brian Foster
2020-06-04 22:53 ` Dave Chinner
2020-06-05 16:25 ` Brian Foster
2020-06-04 7:45 ` [PATCH 19/30] xfs: allow multiple reclaimers per AG Dave Chinner
2020-06-05 16:26 ` Brian Foster
2020-06-05 21:07 ` Dave Chinner
2020-06-08 16:44 ` Brian Foster
2020-06-04 7:45 ` [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK Dave Chinner
2020-06-05 16:26 ` Brian Foster
2020-06-04 7:45 ` [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
2020-06-05 16:26 ` Brian Foster
2020-06-04 7:45 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
2020-06-05 16:26 ` Brian Foster
2020-06-05 21:09 ` Dave Chinner
2020-06-04 7:45 ` [PATCH 23/30] xfs: clean up inode reclaim comments Dave Chinner
2020-06-05 16:26 ` Brian Foster
2020-06-04 7:46 ` [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Dave Chinner
2020-06-05 18:27 ` Brian Foster
2020-06-05 21:32 ` Dave Chinner
2020-06-08 16:44 ` Brian Foster
2020-06-04 7:46 ` [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
2020-06-08 16:45 ` Brian Foster
2020-06-08 21:05 ` Dave Chinner
2020-06-04 7:46 ` [PATCH 26/30] xfs: xfs_iflush() is no longer necessary Dave Chinner
2020-06-08 16:45 ` Brian Foster
2020-06-08 21:37 ` Dave Chinner
2020-06-08 22:26 ` [PATCH 26/30 V2] " Dave Chinner
2020-06-09 13:11 ` Brian Foster
2020-06-04 7:46 ` [PATCH 27/30] xfs: rename xfs_iflush_int() Dave Chinner
2020-06-08 17:37 ` Brian Foster
2020-06-04 7:46 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-06-09 13:11 ` Brian Foster
2020-06-09 22:01 ` Dave Chinner
2020-06-10 13:06 ` Brian Foster
2020-06-10 23:40 ` Dave Chinner
2020-06-11 13:56 ` Brian Foster
2020-06-15 1:01 ` Dave Chinner
2020-06-15 14:21 ` Brian Foster
2020-06-16 14:41 ` Brian Foster
2020-06-11 1:56 ` [PATCH 28/30 V2] " Dave Chinner
2020-06-04 7:46 ` [PATCH 29/30] xfs: factor xfs_iflush_done Dave Chinner
2020-06-09 13:12 ` Brian Foster
2020-06-09 22:14 ` Dave Chinner
2020-06-10 13:08 ` Brian Foster
2020-06-11 0:16 ` Dave Chinner
2020-06-11 14:07 ` Brian Foster
2020-06-15 1:49 ` Dave Chinner
2020-06-15 5:20 ` Amir Goldstein
2020-06-15 14:31 ` Brian Foster
2020-06-11 1:58 ` [PATCH 29/30 V2] " Dave Chinner
2020-06-04 7:46 ` [PATCH 30/30] xfs: remove xfs_inobp_check() Dave Chinner
2020-06-09 13:12 ` Brian Foster
-- strict thread matches above, loose matches on Subject: below --
2020-06-22 8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-06-22 8:15 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-06-23 2:39 ` Darrick J. Wong
2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-06-01 21:42 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-06-02 22:30 ` Darrick J. Wong
2020-06-02 22:53 ` Dave Chinner
2020-06-03 18:58 ` Brian Foster
2020-06-03 22:15 ` Dave Chinner
2020-06-04 14:03 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200604140534.GE17815@bfoster \
--to=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).