From: Minchan Kim <minchan@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>,
Andrew Morton <akpm@linux-foundation.org>,
Andi Kleen <andi@firstfloor.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Bob Liu <bob.liu@oracle.com>,
Christoph Hellwig <hch@infradead.org>,
Greg Thelen <gthelen@google.com>, Hugh Dickins <hughd@google.com>,
Jan Kara <jack@suse.cz>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Luigi Semenzato <semenzato@google.com>,
Mel Gorman <mgorman@suse.de>, Metin Doslu <metin@citusdata.com>,
Michel Lespinasse <walken@google.com>,
Ozgun Erdogan <ozgun@citusdata.com>,
Peter Zijlstra <peterz@infradead.org>,
Rik van Riel <riel@redhat.com>,
Roman Gushchin <klamm@yandex-team.ru>,
Ryan Mallon <rmallon@gmail.com>, Tejun Heo <tj@kernel.org>,
Vlastimil Babka <vbabka@suse.cz>,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [patch 9/9] mm: keep page cache radix tree nodes in check
Date: Thu, 23 Jan 2014 14:57:40 +0900 [thread overview]
Message-ID: <20140123055740.GA14369@bbox> (raw)
In-Reply-To: <20140120231737.GS6963@cmpxchg.org>
On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > Previously, page cache radix tree nodes were freed after reclaim
> > > emptied out their page pointers. But now reclaim stores shadow
> > > entries in their place, which are only reclaimed when the inodes
> > > themselves are reclaimed. This is problematic for bigger files that
> > > are still in use after they have a significant amount of their cache
> > > reclaimed, without any of those pages actually refaulting. The shadow
> > > entries will just sit there and waste memory. In the worst case, the
> > > shadow entries will accumulate until the machine runs out of memory.
> > >
> > > To get this under control, the VM will track radix tree nodes
> > > exclusively containing shadow entries on a per-NUMA node list.
> > > Per-NUMA rather than global because we expect the radix tree nodes
> > > themselves to be allocated node-locally and we want to reduce
> > > cross-node references of otherwise independent cache workloads. A
> > > simple shrinker will then reclaim these nodes on memory pressure.
> > >
> > > A few things need to be stored in the radix tree node to implement the
> > > shadow node LRU and allow tree deletions coming from the list:
> >
> > Just a couple of things with the list_lru interfaces.
> >
> > ....
> > > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
> > > * same time and miss a shadow entry.
> > > */
> > > smp_wmb();
> > > - } else
> > > - radix_tree_delete(&mapping->page_tree, page->index);
> > > + }
> > > mapping->nrpages--;
> > > +
> > > + if (!node) {
> > > + /* Clear direct pointer tags in root node */
> > > + mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > > + radix_tree_replace_slot(slot, shadow);
> > > + return;
> > > + }
> > > +
> > > + /* Clear tree tags for the removed page */
> > > + index = page->index;
> > > + offset = index & RADIX_TREE_MAP_MASK;
> > > + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > > + if (test_bit(offset, node->tags[tag]))
> > > + radix_tree_tag_clear(&mapping->page_tree, index, tag);
> > > + }
> > > +
> > > + /* Delete page, swap shadow entry */
> > > + radix_tree_replace_slot(slot, shadow);
> > > + node->count--;
> > > + if (shadow)
> > > + node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> > > + else
> > > + if (__radix_tree_delete_node(&mapping->page_tree, node))
> > > + return;
> > > +
> > > + /* Only shadow entries in there, keep track of this node */
> > > + if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > + list_empty(&node->private_list)) {
> > > + node->private_data = mapping;
> > > + list_lru_add(&workingset_shadow_nodes, &node->private_list);
> > > + }
> >
> > You can't do this list_empty(&node->private_list) check safely
> > externally to the list_lru code - only time that entry can be
> > checked safely is under the LRU list locks. This is the reason that
> > list_lru_add/list_lru_del return a boolean to indicate is the object
> > was added/removed from the list - they do this list_empty() check
> > internally. i.e. the correct, safe way to do conditionally update
> > state iff the object was added to the LRU is:
> >
> > if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > if (list_lru_add(&workingset_shadow_nodes, &node->private_list))
> > node->private_data = mapping;
> > }
> >
> > > + radix_tree_replace_slot(slot, page);
> > > + mapping->nrpages++;
> > > + if (node) {
> > > + node->count++;
> > > + /* Installed page, can't be shadow-only anymore */
> > > + if (!list_empty(&node->private_list))
> > > + list_lru_del(&workingset_shadow_nodes,
> > > + &node->private_list);
> > > + }
> >
> > Same issue here:
> >
> > if (node) {
> > node->count++;
> > list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > }
>
> All modifications to node->private_list happen under
> mapping->tree_lock, and modifications of a neighboring link should not
> affect the outcome of the list_empty(), so I don't think the lru lock
> is necessary.
>
> It would be cleaner to take it of course, but that would mean adding
> an unconditional NUMAnode-wide lock to every page cache population.
>
> > > static int __add_to_page_cache_locked(struct page *page,
> > > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > > index 72f9decb0104..47a9faf4070b 100644
> > > --- a/mm/list_lru.c
> > > +++ b/mm/list_lru.c
> > > @@ -88,10 +88,18 @@ restart:
> > > ret = isolate(item, &nlru->lock, cb_arg);
> > > switch (ret) {
> > > case LRU_REMOVED:
> > > + case LRU_REMOVED_RETRY:
> > > if (--nlru->nr_items == 0)
> > > node_clear(nid, lru->active_nodes);
> > > WARN_ON_ONCE(nlru->nr_items < 0);
> > > isolated++;
> > > + /*
> > > + * If the lru lock has been dropped, our list
> > > + * traversal is now invalid and so we have to
> > > + * restart from scratch.
> > > + */
> > > + if (ret == LRU_REMOVED_RETRY)
> > > + goto restart;
> > > break;
> > > case LRU_ROTATE:
> > > list_move_tail(item, &nlru->list);
> >
> > I think that we need to assert that the list lru lock is correctly
> > held here on return with LRU_REMOVED_RETRY. i.e.
> >
> > case LRU_REMOVED_RETRY:
> > assert_spin_locked(&nlru->lock);
> > case LRU_REMOVED:
>
> Ah, good idea. How about adding it to LRU_RETRY as well?
>
> > > +/*
> > > + * Page cache radix tree nodes containing only shadow entries can grow
> > > + * excessively on certain workloads. That's why they are tracked on
> > > + * per-(NUMA)node lists and pushed back by a shrinker, but with a
> > > + * slightly higher threshold than regular shrinkers so we don't
> > > + * discard the entries too eagerly - after all, during light memory
> > > + * pressure is exactly when we need them.
> > > + */
> > > +
> > > +struct list_lru workingset_shadow_nodes;
> > > +
> > > +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> > > + struct shrink_control *sc)
> > > +{
> > > + return list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> > > +}
> > > +
> > > +static enum lru_status shadow_lru_isolate(struct list_head *item,
> > > + spinlock_t *lru_lock,
> > > + void *arg)
> > > +{
> > > + unsigned long *nr_reclaimed = arg;
> > > + struct address_space *mapping;
> > > + struct radix_tree_node *node;
> > > + unsigned int i;
> > > + int ret;
> > > +
> > > + /*
> > > + * Page cache insertions and deletions synchroneously maintain
> > > + * the shadow node LRU under the mapping->tree_lock and the
> > > + * lru_lock. Because the page cache tree is emptied before
> > > + * the inode can be destroyed, holding the lru_lock pins any
> > > + * address_space that has radix tree nodes on the LRU.
> > > + *
> > > + * We can then safely transition to the mapping->tree_lock to
> > > + * pin only the address_space of the particular node we want
> > > + * to reclaim, take the node off-LRU, and drop the lru_lock.
> > > + */
> > > +
> > > + node = container_of(item, struct radix_tree_node, private_list);
> > > + mapping = node->private_data;
> > > +
> > > + /* Coming from the list, invert the lock order */
> > > + if (!spin_trylock_irq(&mapping->tree_lock)) {
> > > + spin_unlock(lru_lock);
> > > + ret = LRU_RETRY;
> > > + goto out;
> > > + }
> > > +
> > > + list_del_init(item);
> > > + spin_unlock(lru_lock);
> > > +
> > > + /*
> > > + * The nodes should only contain one or more shadow entries,
> > > + * no pages, so we expect to be able to remove them all and
> > > + * delete and free the empty node afterwards.
> > > + */
> > > +
> > > + BUG_ON(!node->count);
> > > + BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> > > +
> > > + for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> > > + if (node->slots[i]) {
> > > + BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
> > > + node->slots[i] = NULL;
> > > + BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
> > > + node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> > > + BUG_ON(!mapping->nrshadows);
> > > + mapping->nrshadows--;
> > > + }
> > > + }
> > > + BUG_ON(node->count);
> > > + inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
> > > + if (!__radix_tree_delete_node(&mapping->page_tree, node))
> > > + BUG();
> > > + (*nr_reclaimed)++;
> > > +
> > > + spin_unlock_irq(&mapping->tree_lock);
> > > + ret = LRU_REMOVED_RETRY;
> > > +out:
> > > + cond_resched();
> > > + spin_lock(lru_lock);
> > > + return ret;
> > > +}
> > > +
> > > +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
> > > + struct shrink_control *sc)
> > > +{
> > > + unsigned long nr_reclaimed = 0;
> > > +
> > > + list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
> > > + shadow_lru_isolate, &nr_reclaimed, &sc->nr_to_scan);
> > > +
> > > + return nr_reclaimed;
> >
> > list_lru_walk_node() returns the number of reclaimed objects (i.e.
> > the number of objects that returned LRU_REMOVED/LRU_REMOVED_RETRY
> > from the ->isolate callback). You don't need to count nr_reclaimed
> > yourself.
>
> Good catch, this is a leftover from before LRU_REMOVED_RETRY. Removed
> the ad-hoc counter altogether.
>
> > > +static struct shrinker workingset_shadow_shrinker = {
> > > + .count_objects = count_shadow_nodes,
> > > + .scan_objects = scan_shadow_nodes,
> > > + .seeks = DEFAULT_SEEKS * 4,
> > > + .flags = SHRINKER_NUMA_AWARE,
> > > +};
> >
> > Can you add a comment explaining how you calculated the .seeks
> > value? It's important to document the weighings/importance
> > we give to slab reclaim so we can determine if it's actually
> > acheiving the desired balance under different loads...
>
> This is not an exact science, to say the least.
>
> The shadow entries are mostly self-regulated, so I don't want the
> shrinker to interfere while the machine is just regularly trimming
> caches during normal operation.
>
> It should only kick in when either a) reclaim is picking up and the
> scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> swapping etc. or b) the number of objects compared to LRU pages
> becomes excessive.
>
> I think that is what most shrinkers with an elevated seeks value want,
> but this translates very awkwardly (and not completely) to the current
> cost model, and we should probably rework that interface.
>
> "Seeks" currently encodes 3 ratios:
>
> 1. the cost of creating an object vs. a page
>
> 2. the expected number of objects vs. pages
>
> 3. the cost of reclaiming an object vs. a page
>
> but they are not necessarily correlated. How I would like to
> configure the shadow shrinker instead is:
>
> o scan objects when reclaim efficiency is down to 75%, because they
> are more valuable than use-once cache but less than workingset
>
Sorry if it is another topic.
Just out of curiosity. Why do you set it to 75%?
The why I ask is it's really needed thing for volatile range, which
want to discard volatile pages more than use-once but less than
working set.
In recent version, I just used (priority < DEF_PRIORITY - 2) to catch up
reclaim efficiency because we already have used it for noticing reclaim
trouble several places but not hard tested so I'm not sure it's better
than #reclaimed_pages/#scanned_page.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-01-23 5:56 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
2014-01-10 18:10 ` [patch 1/9] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
2014-01-13 1:17 ` Minchan Kim
2014-01-10 18:10 ` [patch 2/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2014-01-10 18:10 ` [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2014-01-10 18:25 ` Rik van Riel
2014-01-10 18:10 ` [patch 4/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2014-01-10 19:22 ` Rik van Riel
2014-01-13 1:25 ` Minchan Kim
2014-01-10 18:10 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2014-01-10 19:39 ` Rik van Riel
2014-01-13 2:01 ` Minchan Kim
2014-01-22 17:47 ` Johannes Weiner
2014-01-23 5:07 ` Minchan Kim
2014-02-12 14:00 ` Mel Gorman
2014-03-12 1:15 ` Johannes Weiner
2014-01-10 18:10 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
2014-01-10 22:30 ` Rik van Riel
2014-01-13 2:18 ` Minchan Kim
2014-01-10 18:10 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
2014-01-10 22:51 ` Rik van Riel
2014-01-13 2:42 ` Minchan Kim
2014-01-14 1:01 ` Bob Liu
2014-01-14 19:16 ` Johannes Weiner
2014-01-15 2:57 ` Bob Liu
2014-01-15 3:52 ` Zhang Yanfei
2014-01-16 21:17 ` Johannes Weiner
2014-01-10 18:10 ` [patch 8/9] lib: radix_tree: tree node interface Johannes Weiner
2014-01-10 22:57 ` Rik van Riel
2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2014-01-10 23:09 ` Rik van Riel
2014-01-13 7:39 ` Minchan Kim
2014-01-14 5:40 ` Minchan Kim
2014-01-22 18:42 ` Johannes Weiner
2014-01-23 5:20 ` Minchan Kim
2014-01-23 19:22 ` Johannes Weiner
2014-01-27 2:31 ` Minchan Kim
2014-01-15 5:55 ` Bob Liu
2014-01-16 22:09 ` Johannes Weiner
2014-01-17 0:05 ` Dave Chinner
2014-01-20 23:17 ` Johannes Weiner
2014-01-21 3:03 ` Dave Chinner
2014-01-21 5:50 ` Johannes Weiner
2014-01-22 3:06 ` Dave Chinner
2014-01-22 6:57 ` Johannes Weiner
2014-01-22 18:48 ` Johannes Weiner
2014-01-23 5:57 ` Minchan Kim [this message]
-- strict thread matches above, loose matches on Subject: below --
2013-12-02 19:21 [patch 0/9] mm: thrash detection-based file cache sizing v7 Johannes Weiner
2013-12-02 19:21 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-12-02 22:10 ` Dave Chinner
2013-12-02 22:46 ` Johannes Weiner
2013-11-24 23:38 [patch 0/9] mm: thrash detection-based file cache sizing v6 Johannes Weiner
2013-11-24 23:38 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-11-25 23:49 ` Dave Chinner
2013-11-26 21:27 ` Johannes Weiner
2013-11-26 22:29 ` Dave Chinner
2013-11-26 23:00 ` Johannes Weiner
2013-11-27 0:59 ` Dave Chinner
2013-11-26 0:13 ` Andrew Morton
2013-11-26 22:05 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140123055740.GA14369@bbox \
--to=minchan@kernel.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=bob.liu@oracle.com \
--cc=david@fromorbit.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=klamm@yandex-team.ru \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=metin@citusdata.com \
--cc=mgorman@suse.de \
--cc=ozgun@citusdata.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=rmallon@gmail.com \
--cc=semenzato@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=walken@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).