All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] mm: fix page cache convergence regression
Date: Thu, 30 May 2019 12:15:48 -0400	[thread overview]
Message-ID: <20190530161548.GA8415@cmpxchg.org> (raw)
In-Reply-To: <20190524173900.GA11702@cmpxchg.org>

Are there any objections or feedback on the proposed fix below? This
is kind of a serious regression.

On Fri, May 24, 2019 at 01:39:00PM -0400, Johannes Weiner wrote:
> From 63a0dbc571ff38f7c072c62d6bc28192debe37ac Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 24 May 2019 10:12:46 -0400
> Subject: [PATCH] mm: fix page cache convergence regression
> 
> Since a28334862993 ("page cache: Finish XArray conversion"), on most
> major Linux distributions, the page cache doesn't correctly transition
> when the hot data set is changing, and leaves the new pages thrashing
> indefinitely instead of kicking out the cold ones.
> 
> On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
> running stock Arch Linux:
> 
> [root@ham ~]# ./reclaimtest.sh
> + dd of=workingset-a bs=1M count=0 seek=600
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + ./mincore workingset-a
> 153600/153600 workingset-a
> + dd of=workingset-b bs=1M count=0 seek=600
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 104029/153600 workingset-a
> 120086/153600 workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 104029/153600 workingset-a
> 120268/153600 workingset-b
> 
> workingset-b is a 600M file on a 1G host that is otherwise entirely
> idle. No matter how often it's being accessed, it won't get cached.
> 
> While investigating, I noticed that the non-resident information gets
> aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
> a problem because a workingset transition like this relies on the
> non-resident information tracked in the page cache tree of evicted
> file ranges: when the cache faults are refaults of recently evicted
> cache, we challenge the existing active set, and that allows a new
> workingset to establish itself.
> 
> Tracing the shrinker that maintains this memory revealed that all page
> cache tree nodes were allocated to the root cgroup. This is a problem,
> because 1) the shrinker sizes the amount of non-resident information
> it keeps to the size of the cgroup's other memory and 2) on most major
> Linux distributions, only kernel threads live in the root cgroup and
> everything else gets put into services or session groups:
> 
> [root@ham ~]# cat /proc/self/cgroup
> 0::/user.slice/user-0.slice/session-c1.scope
> 
> As a result, we basically maintain no non-resident information for the
> workloads running on the system, thus breaking the caching algorithm.
> 
> Looking through the code, I found the culprit in the above-mentioned
> patch: when switching from the radix tree to xarray, it dropped the
> __GFP_ACCOUNT flag from the tree node allocations - the flag that
> makes sure the allocated memory gets charged to and tracked by the
> cgroup of the calling process - in this case, the one doing the fault.
> 
> To fix this, allow xarray users to specify per-tree flag that makes
> xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
> tree annotation to request such cgroup tracking for the cache nodes.
> 
> With this patch applied, the page cache correctly converges on new
> workingsets again after just a few iterations:
> 
> [root@ham ~]# ./reclaimtest.sh
> + dd of=workingset-a bs=1M count=0 seek=600
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + ./mincore workingset-a
> 153600/153600 workingset-a
> + dd of=workingset-b bs=1M count=0 seek=600
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 124607/153600 workingset-a
> 87876/153600 workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 81313/153600 workingset-a
> 133321/153600 workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 63036/153600 workingset-a
> 153600/153600 workingset-b
> 
> Cc: stable@vger.kernel.org # 4.20+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/inode.c             |  2 +-
>  include/linux/xarray.h |  1 +
>  lib/xarray.c           | 12 ++++++++++--
>  3 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index e9d18b2c3f91..cd67859dbaf1 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -361,7 +361,7 @@ EXPORT_SYMBOL(inc_nlink);
>  
>  static void __address_space_init_once(struct address_space *mapping)
>  {
> -	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ);
> +	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
>  	init_rwsem(&mapping->i_mmap_rwsem);
>  	INIT_LIST_HEAD(&mapping->private_list);
>  	spin_lock_init(&mapping->private_lock);
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index 0e01e6129145..5921599b6dc4 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -265,6 +265,7 @@ enum xa_lock_type {
>  #define XA_FLAGS_TRACK_FREE	((__force gfp_t)4U)
>  #define XA_FLAGS_ZERO_BUSY	((__force gfp_t)8U)
>  #define XA_FLAGS_ALLOC_WRAPPED	((__force gfp_t)16U)
> +#define XA_FLAGS_ACCOUNT	((__force gfp_t)32U)
>  #define XA_FLAGS_MARK(mark)	((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \
>  						(__force unsigned)(mark)))
>  
> diff --git a/lib/xarray.c b/lib/xarray.c
> index 6be3acbb861f..446b956c9188 100644
> --- a/lib/xarray.c
> +++ b/lib/xarray.c
> @@ -298,6 +298,8 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
>  		xas_destroy(xas);
>  		return false;
>  	}
> +	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> +		gfp |= __GFP_ACCOUNT;
>  	xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
>  	if (!xas->xa_alloc)
>  		return false;
> @@ -325,6 +327,8 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
>  		xas_destroy(xas);
>  		return false;
>  	}
> +	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> +		gfp |= __GFP_ACCOUNT;
>  	if (gfpflags_allow_blocking(gfp)) {
>  		xas_unlock_type(xas, lock_type);
>  		xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
> @@ -358,8 +362,12 @@ static void *xas_alloc(struct xa_state *xas, unsigned int shift)
>  	if (node) {
>  		xas->xa_alloc = NULL;
>  	} else {
> -		node = kmem_cache_alloc(radix_tree_node_cachep,
> -					GFP_NOWAIT | __GFP_NOWARN);
> +		gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
> +
> +		if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> +			gfp |= __GFP_ACCOUNT;
> +
> +		node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
>  		if (!node) {
>  			xas_set_err(xas, -ENOMEM);
>  			return NULL;
> -- 
> 2.21.0
> 

  parent reply	other threads:[~2019-05-30 16:15 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-24 15:31 [PATCH] mm: fix page cache convergence regression Johannes Weiner
2019-05-24 16:04 ` Matthew Wilcox
2019-05-24 17:39   ` Johannes Weiner
2019-05-24 18:04     ` Shakeel Butt
2019-05-30 16:15     ` Johannes Weiner [this message]
2019-05-30 17:13       ` Matthew Wilcox
2019-05-30 17:58         ` Johannes Weiner
2019-06-24 15:19         ` Johannes Weiner
2019-06-24 21:52           ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190530161548.GA8415@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.