Re: [PATCH 24/28] xfs: reclaim inodes from the LRU

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 24/28] xfs: reclaim inodes from the LRU
Date: Wed, 6 Nov 2019 12:21:04 -0500	[thread overview]
Message-ID: <20191106172104.GB37080@bfoster> (raw)
In-Reply-To: <20191031234618.15403-25-david@fromorbit.com>

On Fri, Nov 01, 2019 at 10:46:14AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Replace the AG radix tree walking reclaim code with a list_lru
> walker, giving us both node-aware and memcg-aware inode reclaim
> at the XFS level. This requires adding an inode isolation function to
> determine if the inode can be reclaim, and a list walker to
> dispose of the inodes that were isolated.
> 
> We want the isolation function to be non-blocking. If we can't
> grab an inode then we either skip it or rotate it. If it's clean
> then we skip it, if it's dirty then we rotate to give it time to be
> cleaned before it is scanned again.
> 
> This congregates the dirty inodes at the tail of the LRU, which
> means that if we start hitting a majority of dirty inodes either
> there are lots of unlinked inodes in the reclaim list or we've
> reclaimed all the clean inodes and we're looped back on the dirty
> inodes. Either way, this is an indication we should tell kswapd to
> back off.
> 
> The non-blocking isolation function introduces a complexity for the
> filesystem shutdown case. When the filesystem is shut down, we want
> to free the inode even if it is dirty, and this may require
> blocking. We already hold the locks needed to do this blocking, so
> what we do is that we leave inodes locked - both the ILOCK and the
> flush lock - while they are sitting on the dispose list to be freed
> after the LRU walk completes.  This allows us to process the
> shutdown state outside the LRU walk where we can block safely.
> 
> Because we now are reclaiming inodes from the context that it needs
> memory in (memcg and/or node), direct reclaim throttling within the
> high level reclaim code in now much more effective. Hence we don't
> wait on IO for either kswapd or direct reclaim. However, we have to
> tell kswapd to back off if we start hitting too many dirty inodes.
> This implies we've wrapped around the LRU and don't have many clean
> inodes left to reclaim, so it needs to wait a while for the AIL
> pushing to clean some of the remaining reclaimable inodes.
> 
> Keep in mind we don't have to care about inode lock order or
> blocking with inode locks held here because a) we are using
> trylocks, and b) once marked with XFS_IRECLAIM they can't be found
> via the LRU and inode cache lookups will abort and retry. Hence
> nobody will try to lock them in any other context that might also be
> holding other inode locks.
> 
> Also convert xfs_reclaim_all_inodes() to use a LRU walk to free all
> the reclaimable inodes in the filesystem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks fundamentally sane. Some logic quibbles..

>  fs/xfs/xfs_icache.c | 404 +++++++++++++-------------------------------
>  fs/xfs/xfs_icache.h |  18 +-
>  fs/xfs/xfs_inode.h  |  18 ++
>  fs/xfs/xfs_super.c  |  46 ++++-
>  4 files changed, 190 insertions(+), 296 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 350f42e7730b..05dd292bfdb6 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -968,160 +968,110 @@ xfs_inode_ag_iterator_tag(
>  	return last_error;
>  }
>  
> -/*
> - * Grab the inode for reclaim.
> - *
> - * Return false if we aren't going to reclaim it, true if it is a reclaim
> - * candidate.
> - *
> - * If the inode is clean or unreclaimable, return 0 to tell the caller it does
> - * not require flushing. Otherwise return the log item lsn of the inode so the
> - * caller can determine it's inode flush target.  If we get the clean/dirty
> - * state wrong then it will be sorted in xfs_reclaim_inode() once we have locks
> - * held.
> - */
> -STATIC bool
> -xfs_reclaim_inode_grab(
> -	struct xfs_inode	*ip,
> -	int			flags,
> -	xfs_lsn_t		*lsn)
> +enum lru_status
> +xfs_inode_reclaim_isolate(
> +	struct list_head	*item,
> +	struct list_lru_one	*lru,
> +	spinlock_t		*lru_lock,

Did we ever establish whether we should cycle the lru_lock during long
running scans?

> +	void			*arg)
>  {
> -	ASSERT(rcu_read_lock_held());
> -	*lsn = 0;
> +        struct xfs_ireclaim_args *ra = arg;
> +        struct inode		*inode = container_of(item, struct inode,
> +						      i_lru);
> +        struct xfs_inode	*ip = XFS_I(inode);

Whitespace damage on the above lines (space indentation vs tabs).

> +	enum lru_status		ret;
> +	xfs_lsn_t		lsn = 0;
> +
> +	/* Careful: inversion of iflags_lock and everything else here */
> +	if (!spin_trylock(&ip->i_flags_lock))
> +		return LRU_SKIP;
> +
> +	/* if we are in shutdown, we'll reclaim it even if dirty */
> +	ret = LRU_ROTATE;
> +	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE) &&
> +	    !XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> +		lsn = ip->i_itemp->ili_item.li_lsn;
> +		ra->dirty_skipped++;
> +		goto out_unlock_flags;
> +	}
>  
> -	/* quick check for stale RCU freed inode */
> -	if (!ip->i_ino)
> -		return false;
> +	ret = LRU_SKIP;
> +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> +		goto out_unlock_flags;
>  
> -	/*
> -	 * Do unlocked checks to see if the inode already is being flushed or in
> -	 * reclaim to avoid lock traffic. If the inode is not clean, return the
> -	 * position in the AIL for the caller to push to.
> -	 */
> -	if (!xfs_inode_clean(ip)) {
> -		*lsn = ip->i_itemp->ili_item.li_lsn;
> -		return false;
> +	if (!__xfs_iflock_nowait(ip)) {
> +		lsn = ip->i_itemp->ili_item.li_lsn;

This looks like a potential crash vector if we ever got here with a
clean inode.

> +		ra->dirty_skipped++;
> +		goto out_unlock_inode;
>  	}
>  
> -	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> -		return false;
> +	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> +		goto reclaim;
>  
>  	/*
> -	 * The radix tree lock here protects a thread in xfs_iget from racing
> -	 * with us starting reclaim on the inode.  Once we have the
> -	 * XFS_IRECLAIM flag set it will not touch us.
> -	 *
> -	 * Due to RCU lookup, we may find inodes that have been freed and only
> -	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
> -	 * aren't candidates for reclaim at all, so we must check the
> -	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
> +	 * Now the inode is locked, we can actually determine if it is dirty
> +	 * without racing with anything.
>  	 */
> -	spin_lock(&ip->i_flags_lock);
> -	if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) ||
> -	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
> -		/* not a reclaim candidate. */
> -		spin_unlock(&ip->i_flags_lock);
> -		return false;
> +	ret = LRU_ROTATE;
> +	if (xfs_ipincount(ip)) {
> +		ra->dirty_skipped++;

Hmm.. didn't we have an LSN check here?

Altogether, I think the logic in this function would be a lot more
simple if we had something like the following:

	...
	/* ret == LRU_SKIP */
        if (!xfs_inode_clean(ip)) {
		ret = LRU_ROTATE;
                lsn = ip->i_itemp->ili_item.li_lsn;
                ra->dirty_skipped++;
        }
        if (lsn && XFS_LSN_CMP(lsn, ra->lowest_lsn) < 0)
                ra->lowest_lsn = lsn;
        return ret;

... as the non-reclaim exit path. Then the earlier logic simply dictates
how we process the inode instead of conflating lru processing with
lsn/dirty checks. Otherwise for example (based on the current logic),
it's not really clear to me whether ->dirty_skipped cares about dirty
inodes or just the fact that we skipped an inode.

> +		goto out_ifunlock;
> +	}
> +	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) {
> +		lsn = ip->i_itemp->ili_item.li_lsn;
> +		ra->dirty_skipped++;
> +		goto out_ifunlock;
>  	}
> +
...
> @@ -1165,167 +1108,52 @@ xfs_reclaim_inode(
...
>  void
>  xfs_reclaim_all_inodes(
>  	struct xfs_mount	*mp)
>  {
...
> +	while (list_lru_count(&mp->m_inode_lru)) {

It seems unnecessary to call this twice per-iter:

	while ((to_free = list_lru_count(&mp->m_inode_lru))) {
		...
	}

Hm?

Brian

> +		struct xfs_ireclaim_args ra;
> +		long freed, to_free;
> +
> +		xfs_ireclaim_args_init(&ra);
> +
> +		to_free = list_lru_count(&mp->m_inode_lru);
> +		freed = list_lru_walk(&mp->m_inode_lru,
> +				      xfs_inode_reclaim_isolate, &ra, to_free);
> +		xfs_dispose_inodes(&ra.freeable);
> +
> +		if (freed == 0) {
> +			xfs_log_force(mp, XFS_LOG_SYNC);
> +			xfs_ail_push_all(mp->m_ail);
> +		} else if (ra.lowest_lsn != NULLCOMMITLSN) {
> +			xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
> +		}
> +		cond_resched();
> +	}
>  }
>  
>  STATIC int
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index afd692b06c13..86e858e4a281 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -49,8 +49,24 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
>  struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
>  void xfs_inode_free(struct xfs_inode *ip);
>  
> +struct xfs_ireclaim_args {
> +	struct list_head	freeable;
> +	xfs_lsn_t		lowest_lsn;
> +	unsigned long		dirty_skipped;
> +};
> +
> +static inline void
> +xfs_ireclaim_args_init(struct xfs_ireclaim_args *ra)
> +{
> +	INIT_LIST_HEAD(&ra->freeable);
> +	ra->lowest_lsn = NULLCOMMITLSN;
> +	ra->dirty_skipped = 0;
> +}
> +
> +enum lru_status xfs_inode_reclaim_isolate(struct list_head *item,
> +		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg);
> +void xfs_dispose_inodes(struct list_head *freeable);
>  void xfs_reclaim_all_inodes(struct xfs_mount *mp);
> -long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
>  
>  void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index bcfb35a9c5ca..00145debf820 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -270,6 +270,15 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
>  
>  extern void __xfs_iflock(struct xfs_inode *ip);
>  
> +static inline int __xfs_iflock_nowait(struct xfs_inode *ip)
> +{
> +	lockdep_assert_held(&ip->i_flags_lock);
> +	if (ip->i_flags & XFS_IFLOCK)
> +		return false;
> +	ip->i_flags |= XFS_IFLOCK;
> +	return true;
> +}
> +
>  static inline int xfs_iflock_nowait(struct xfs_inode *ip)
>  {
>  	return !xfs_iflags_test_and_set(ip, XFS_IFLOCK);
> @@ -281,6 +290,15 @@ static inline void xfs_iflock(struct xfs_inode *ip)
>  		__xfs_iflock(ip);
>  }
>  
> +static inline void __xfs_ifunlock(struct xfs_inode *ip)
> +{
> +	lockdep_assert_held(&ip->i_flags_lock);
> +	ASSERT(ip->i_flags & XFS_IFLOCK);
> +	ip->i_flags &= ~XFS_IFLOCK;
> +	smp_mb();
> +	wake_up_bit(&ip->i_flags, __XFS_IFLOCK_BIT);
> +}
> +
>  static inline void xfs_ifunlock(struct xfs_inode *ip)
>  {
>  	ASSERT(xfs_isiflocked(ip));
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 98ffbe42f8ae..096ae31b5436 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -17,6 +17,7 @@
>  #include "xfs_alloc.h"
>  #include "xfs_fsops.h"
>  #include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_log.h"
>  #include "xfs_log_priv.h"
> @@ -1772,23 +1773,54 @@ xfs_fs_mount(
>  }
>  
>  static long
> -xfs_fs_nr_cached_objects(
> +xfs_fs_free_cached_objects(
>  	struct super_block	*sb,
>  	struct shrink_control	*sc)
>  {
> -	/* Paranoia: catch incorrect calls during mount setup or teardown */
> -	if (WARN_ON_ONCE(!sb->s_fs_info))
> -		return 0;
> +	struct xfs_mount	*mp = XFS_M(sb);
> +	struct xfs_ireclaim_args ra;
> +	long			freed;
>  
> -	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> +	xfs_ireclaim_args_init(&ra);
> +
> +	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> +					xfs_inode_reclaim_isolate, &ra);
> +	xfs_dispose_inodes(&ra.freeable);
> +
> +	/*
> +	 * Deal with dirty inodes. We will have the LSN of
> +	 * the oldest dirty inode in our reclaim args if we skipped any.
> +	 *
> +	 * For kswapd, if we skipped too many dirty inodes (i.e. more dirty than
> +	 * we freed) then we need kswapd to back off once it's scan has been
> +	 * completed. That way it will have some clean inodes once it comes back
> +	 * and can make progress, but make sure we have inode cleaning in
> +	 * progress.
> +	 *
> +	 * Direct reclaim will be throttled by the caller as it winds the
> +	 * priority up. All we need to do is keep pushing on dirty inodes
> +	 * in the background so when we come back progress will be made.
> +	 */
> +	if (current_is_kswapd() && ra.dirty_skipped >= freed) {
> +		if (current->reclaim_state)
> +			current->reclaim_state->need_backoff = true;
> +	}
> +	if (ra.lowest_lsn != NULLCOMMITLSN)
> +		xfs_ail_push(mp->m_ail, ra.lowest_lsn);
> +
> +	return freed;
>  }
>  
>  static long
> -xfs_fs_free_cached_objects(
> +xfs_fs_nr_cached_objects(
>  	struct super_block	*sb,
>  	struct shrink_control	*sc)
>  {
> -	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
> +	/* Paranoia: catch incorrect calls during mount setup or teardown */
> +	if (WARN_ON_ONCE(!sb->s_fs_info))
> +		return 0;
> +
> +	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
>  }
>  
>  static const struct super_operations xfs_super_operations = {
> -- 
> 2.24.0.rc0
>

next prev parent reply	other threads:[~2019-11-06 17:21 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
2019-10-31 23:45 ` [PATCH 01/28] xfs: Lower CIL flush limit for large logs Dave Chinner
2019-10-31 23:45 ` [PATCH 02/28] xfs: Throttle commits on delayed background CIL push Dave Chinner
2019-11-01 12:04   ` Brian Foster
2019-11-01 21:40     ` Dave Chinner
2019-11-04 22:48       ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 03/28] xfs: don't allow log IO to be throttled Dave Chinner
2019-10-31 23:45 ` [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability Dave Chinner
2019-11-01 12:05   ` Brian Foster
2019-11-04 23:21   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 05/28] xfs: correctly acount for reclaimable slabs Dave Chinner
2019-10-31 23:45 ` [PATCH 06/28] xfs: factor common AIL item deletion code Dave Chinner
2019-11-04 23:16   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 07/28] xfs: tail updates only need to occur when LSN changes Dave Chinner
2019-11-04 23:18   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 08/28] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
2019-11-01 12:05   ` Brian Foster
2019-11-04 23:20   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 09/28] mm: directed shrinker work deferral Dave Chinner
2019-11-04 15:25   ` Brian Foster
2019-11-14 20:49     ` Dave Chinner
2019-11-15 17:21       ` Brian Foster
2019-11-18  0:49         ` Dave Chinner
2019-11-19 15:12           ` Brian Foster
2019-10-31 23:46 ` [PATCH 10/28] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers Dave Chinner
2019-10-31 23:46 ` [PATCH 11/28] mm: factor shrinker work calculations Dave Chinner
2019-11-02 10:55   ` kbuild test robot
2019-11-02 10:55     ` kbuild test robot
2019-11-04 15:29   ` Brian Foster
2019-11-14 20:59     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 12/28] shrinker: defer work only to kswapd Dave Chinner
2019-11-04 15:29   ` Brian Foster
2019-11-14 21:11     ` Dave Chinner
2019-11-15 17:23       ` Brian Foster
2019-10-31 23:46 ` [PATCH 13/28] shrinker: clean up variable types and tracepoints Dave Chinner
2019-11-04 15:30   ` Brian Foster
2019-10-31 23:46 ` [PATCH 14/28] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
2019-11-04 19:58   ` Brian Foster
2019-10-31 23:46 ` [PATCH 15/28] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
2019-11-04 19:58   ` Brian Foster
2019-11-14 21:28     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 16/28] mm: kswapd backoff for shrinkers Dave Chinner
2019-11-04 19:58   ` Brian Foster
2019-11-14 21:41     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 17/28] xfs: synchronous AIL pushing Dave Chinner
2019-11-05 17:05   ` Brian Foster
2019-10-31 23:46 ` [PATCH 18/28] xfs: don't block kswapd in inode reclaim Dave Chinner
2019-10-31 23:46 ` [PATCH 19/28] xfs: reduce kswapd blocking on inode locking Dave Chinner
2019-11-05 17:05   ` Brian Foster
2019-10-31 23:46 ` [PATCH 20/28] xfs: kill background reclaim work Dave Chinner
2019-11-05 17:05   ` Brian Foster
2019-10-31 23:46 ` [PATCH 21/28] xfs: use AIL pushing for inode reclaim IO Dave Chinner
2019-11-05 17:06   ` Brian Foster
2019-10-31 23:46 ` [PATCH 22/28] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
2019-10-31 23:46 ` [PATCH 23/28] xfs: track reclaimable inodes using a LRU list Dave Chinner
2019-10-31 23:46 ` [PATCH 24/28] xfs: reclaim inodes from the LRU Dave Chinner
2019-11-06 17:21   ` Brian Foster [this message]
2019-11-14 21:51     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 25/28] xfs: remove unusued old inode reclaim code Dave Chinner
2019-11-06 17:21   ` Brian Foster
2019-10-31 23:46 ` [PATCH 26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes Dave Chinner
2019-11-06 17:22   ` Brian Foster
2019-11-14 21:53     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 27/28] rwsem: introduce down/up_write_non_owner Dave Chinner
2019-10-31 23:46 ` [PATCH 28/28] xfs: rework unreferenced inode lookups Dave Chinner
2019-11-06 22:18   ` Brian Foster
2019-11-14 22:16     ` Dave Chinner
2019-11-15 13:13       ` Christoph Hellwig
2019-11-15 17:26       ` Brian Foster
2019-11-18  1:00         ` Dave Chinner
2019-11-19 15:13           ` Brian Foster
2019-11-19 21:18             ` Dave Chinner
2019-11-20 12:42               ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191106172104.GB37080@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.