From: Minchan Kim <minchan@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Andi Kleen <andi@firstfloor.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Bob Liu <bob.liu@oracle.com>,
Christoph Hellwig <hch@infradead.org>,
Dave Chinner <david@fromorbit.com>,
Greg Thelen <gthelen@google.com>, Hugh Dickins <hughd@google.com>,
Jan Kara <jack@suse.cz>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Luigi Semenzato <semenzato@google.com>,
Mel Gorman <mgorman@suse.de>, Metin Doslu <metin@citusdata.com>,
Michel Lespinasse <walken@google.com>,
Ozgun Erdogan <ozgun@citusdata.com>,
Peter Zijlstra <peterz@infradead.org>,
Rik van Riel <riel@redhat.com>,
Roman Gushchin <klamm@yandex-team.ru>,
Ryan Mallon <rmallon@gmail.com>, Tejun Heo <tj@kernel.org>,
Vlastimil Babka <vbabka@suse.cz>,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [patch 9/9] mm: keep page cache radix tree nodes in check
Date: Thu, 23 Jan 2014 14:20:14 +0900 [thread overview]
Message-ID: <20140123052014.GC28732@bbox> (raw)
In-Reply-To: <20140122184217.GD4407@cmpxchg.org>
On Wed, Jan 22, 2014 at 01:42:17PM -0500, Johannes Weiner wrote:
> On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > Previously, page cache radix tree nodes were freed after reclaim
> > > emptied out their page pointers. But now reclaim stores shadow
> > > entries in their place, which are only reclaimed when the inodes
> > > themselves are reclaimed. This is problematic for bigger files that
> > > are still in use after they have a significant amount of their cache
> > > reclaimed, without any of those pages actually refaulting. The shadow
> > > entries will just sit there and waste memory. In the worst case, the
> > > shadow entries will accumulate until the machine runs out of memory.
> > >
> > > To get this under control, the VM will track radix tree nodes
> > > exclusively containing shadow entries on a per-NUMA node list.
> > > Per-NUMA rather than global because we expect the radix tree nodes
> > > themselves to be allocated node-locally and we want to reduce
> > > cross-node references of otherwise independent cache workloads. A
> > > simple shrinker will then reclaim these nodes on memory pressure.
> > >
> > > A few things need to be stored in the radix tree node to implement the
> > > shadow node LRU and allow tree deletions coming from the list:
> > >
> > > 1. There is no index available that would describe the reverse path
> > > from the node up to the tree root, which is needed to perform a
> > > deletion. To solve this, encode in each node its offset inside the
> > > parent. This can be stored in the unused upper bits of the same
> > > member that stores the node's height at no extra space cost.
> > >
> > > 2. The number of shadow entries needs to be counted in addition to the
> > > regular entries, to quickly detect when the node is ready to go to
> > > the shadow node LRU list. The current entry count is an unsigned
> > > int but the maximum number of entries is 64, so a shadow counter
> > > can easily be stored in the unused upper bits.
> > >
> > > 3. Tree modification needs tree lock and tree root, which are located
> > > in the address space, so store an address_space backpointer in the
> > > node. The parent pointer of the node is in a union with the 2-word
> > > rcu_head, so the backpointer comes at no extra cost as well.
> > >
> > > 4. The node needs to be linked to an LRU list, which requires a list
> > > head inside the node. This does increase the size of the node, but
> > > it does not change the number of objects that fit into a slab page.
> > >
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > > include/linux/list_lru.h | 2 +
> > > include/linux/mmzone.h | 1 +
> > > include/linux/radix-tree.h | 32 +++++++++---
> > > include/linux/swap.h | 1 +
> > > lib/radix-tree.c | 36 ++++++++------
> > > mm/filemap.c | 77 +++++++++++++++++++++++------
> > > mm/list_lru.c | 8 +++
> > > mm/truncate.c | 20 +++++++-
> > > mm/vmstat.c | 1 +
> > > mm/workingset.c | 121 +++++++++++++++++++++++++++++++++++++++++++++
> > > 10 files changed, 259 insertions(+), 40 deletions(-)
> > >
> > > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> > > index 3ce541753c88..b02fc233eadd 100644
> > > --- a/include/linux/list_lru.h
> > > +++ b/include/linux/list_lru.h
> > > @@ -13,6 +13,8 @@
> > > /* list_lru_walk_cb has to always return one of those */
> > > enum lru_status {
> > > LRU_REMOVED, /* item removed from list */
> > > + LRU_REMOVED_RETRY, /* item removed, but lock has been
> > > + dropped and reacquired */
> > > LRU_ROTATE, /* item referenced, give another pass */
> > > LRU_SKIP, /* item cannot be locked, skip */
> > > LRU_RETRY, /* item not freeable. May drop the lock
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 118ba9f51e86..8cac5a7ef7a7 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -144,6 +144,7 @@ enum zone_stat_item {
> > > #endif
> > > WORKINGSET_REFAULT,
> > > WORKINGSET_ACTIVATE,
> > > + WORKINGSET_NODERECLAIM,
> > > NR_ANON_TRANSPARENT_HUGEPAGES,
> > > NR_FREE_CMA_PAGES,
> > > NR_VM_ZONE_STAT_ITEMS };
> > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > > index 13636c40bc42..33170dbd9db4 100644
> > > --- a/include/linux/radix-tree.h
> > > +++ b/include/linux/radix-tree.h
> > > @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
> > > #define RADIX_TREE_TAG_LONGS \
> > > ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
> > >
> > > +#define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > > +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> > > + RADIX_TREE_MAP_SHIFT))
> > > +
> > > +/* Height component in node->path */
> > > +#define RADIX_TREE_HEIGHT_SHIFT (RADIX_TREE_MAX_PATH + 1)
> > > +#define RADIX_TREE_HEIGHT_MASK ((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
> > > +
> > > +/* Internally used bits of node->count */
> > > +#define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1)
> > > +#define RADIX_TREE_COUNT_MASK ((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
> > > +
> > > struct radix_tree_node {
> > > - unsigned int height; /* Height from the bottom */
> > > + unsigned int path; /* Offset in parent & height from the bottom */
> > > unsigned int count;
> > > union {
> > > - struct radix_tree_node *parent; /* Used when ascending tree */
> > > - struct rcu_head rcu_head; /* Used when freeing node */
> > > + struct {
> > > + /* Used when ascending tree */
> > > + struct radix_tree_node *parent;
> > > + /* For tree user */
> > > + void *private_data;
> > > + };
> > > + /* Used when freeing node */
> > > + struct rcu_head rcu_head;
> > > };
> > > + /* For tree user */
> > > + struct list_head private_list;
> > > void __rcu *slots[RADIX_TREE_MAP_SIZE];
> > > unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
> > > };
> > >
> > > -#define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > > -#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> > > - RADIX_TREE_MAP_SHIFT))
> > > -
> > > /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
> > > struct radix_tree_root {
> > > unsigned int height;
> > > @@ -251,7 +267,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
> > > struct radix_tree_node **nodep, void ***slotp);
> > > void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
> > > void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
> > > -bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
> > > +bool __radix_tree_delete_node(struct radix_tree_root *root,
> > > struct radix_tree_node *node);
> > > void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
> > > void *radix_tree_delete(struct radix_tree_root *, unsigned long);
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index b83cf61403ed..102e37bc82d5 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -264,6 +264,7 @@ struct swap_list_t {
> > > void *workingset_eviction(struct address_space *mapping, struct page *page);
> > > bool workingset_refault(void *shadow);
> > > void workingset_activation(struct page *page);
> > > +extern struct list_lru workingset_shadow_nodes;
> > >
> > > /* linux/mm/page_alloc.c */
> > > extern unsigned long totalram_pages;
> > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > index e601c56a43d0..0a0895371447 100644
> > > --- a/lib/radix-tree.c
> > > +++ b/lib/radix-tree.c
> > > @@ -342,7 +342,8 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
> > >
> > > /* Increase the height. */
> > > newheight = root->height+1;
> > > - node->height = newheight;
> > > + BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK);
> > > + node->path = newheight;
> >
> > Nitpick:
> > It would be better to add some accessor for path and offset for
> > readability and future enhance?
>
> Nodes are instantiated in one central place, I can't see the value in
> obscuring a straight-forward bitop with a radix_tree_node_set_offset()
> call.
>
> And height = node->path & RADIX_TREE_HEIGHT_MASK should be fairly
> descriptive, I think.
>
> > > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
> > > * same time and miss a shadow entry.
> > > */
> > > smp_wmb();
> > > - } else
> > > - radix_tree_delete(&mapping->page_tree, page->index);
> > > + }
> > > mapping->nrpages--;
> > > +
> > > + if (!node) {
> > > + /* Clear direct pointer tags in root node */
> > > + mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > > + radix_tree_replace_slot(slot, shadow);
> > > + return;
> > > + }
> > > +
> > > + /* Clear tree tags for the removed page */
> > > + index = page->index;
> > > + offset = index & RADIX_TREE_MAP_MASK;
> > > + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > > + if (test_bit(offset, node->tags[tag]))
> > > + radix_tree_tag_clear(&mapping->page_tree, index, tag);
> > > + }
> > > +
> > > + /* Delete page, swap shadow entry */
> > > + radix_tree_replace_slot(slot, shadow);
> > > + node->count--;
> > > + if (shadow)
> > > + node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> >
> > Nitpick2:
> > It should be a function of workingset.c rather than exposing
> > RADIX_TREE_COUNT_SHIFT?
> >
> > IMO, It would be better to provide some accessor functions here, too.
>
> The shadow maintenance and node lifetime management are pretty
> interwoven to share branches and reduce instructions as these are
> common paths. I don't see how this could result in cleaner code while
> keeping these advantages.
What I want is just put a inline accessor in somewhere like workingset.h
static inline void inc_shadow_entry(struct radix_tree_node *node)
{
node->count += 1U << RADIX_TREE_COUNT_MASK;
}
So, anyone don't need to know that node->count upper bits present
count of shadow entry.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-01-23 5:19 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
2014-01-10 18:10 ` [patch 1/9] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
2014-01-13 1:17 ` Minchan Kim
2014-01-10 18:10 ` [patch 2/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2014-01-10 18:10 ` [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2014-01-10 18:25 ` Rik van Riel
2014-01-10 18:10 ` [patch 4/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2014-01-10 19:22 ` Rik van Riel
2014-01-13 1:25 ` Minchan Kim
2014-01-10 18:10 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2014-01-10 19:39 ` Rik van Riel
2014-01-13 2:01 ` Minchan Kim
2014-01-22 17:47 ` Johannes Weiner
2014-01-23 5:07 ` Minchan Kim
2014-02-12 14:00 ` Mel Gorman
2014-03-12 1:15 ` Johannes Weiner
2014-01-10 18:10 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
2014-01-10 22:30 ` Rik van Riel
2014-01-13 2:18 ` Minchan Kim
2014-01-10 18:10 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
2014-01-10 22:51 ` Rik van Riel
2014-01-13 2:42 ` Minchan Kim
2014-01-14 1:01 ` Bob Liu
2014-01-14 19:16 ` Johannes Weiner
2014-01-15 2:57 ` Bob Liu
2014-01-15 3:52 ` Zhang Yanfei
2014-01-16 21:17 ` Johannes Weiner
2014-01-10 18:10 ` [patch 8/9] lib: radix_tree: tree node interface Johannes Weiner
2014-01-10 22:57 ` Rik van Riel
2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2014-01-10 23:09 ` Rik van Riel
2014-01-13 7:39 ` Minchan Kim
2014-01-14 5:40 ` Minchan Kim
2014-01-22 18:42 ` Johannes Weiner
2014-01-23 5:20 ` Minchan Kim [this message]
2014-01-23 19:22 ` Johannes Weiner
2014-01-27 2:31 ` Minchan Kim
2014-01-15 5:55 ` Bob Liu
2014-01-16 22:09 ` Johannes Weiner
2014-01-17 0:05 ` Dave Chinner
2014-01-20 23:17 ` Johannes Weiner
2014-01-21 3:03 ` Dave Chinner
2014-01-21 5:50 ` Johannes Weiner
2014-01-22 3:06 ` Dave Chinner
2014-01-22 6:57 ` Johannes Weiner
2014-01-22 18:48 ` Johannes Weiner
2014-01-23 5:57 ` Minchan Kim
-- strict thread matches above, loose matches on Subject: below --
2013-12-02 19:21 [patch 0/9] mm: thrash detection-based file cache sizing v7 Johannes Weiner
2013-12-02 19:21 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-12-02 22:10 ` Dave Chinner
2013-12-02 22:46 ` Johannes Weiner
2013-11-24 23:38 [patch 0/9] mm: thrash detection-based file cache sizing v6 Johannes Weiner
2013-11-24 23:38 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-11-25 23:49 ` Dave Chinner
2013-11-26 21:27 ` Johannes Weiner
2013-11-26 22:29 ` Dave Chinner
2013-11-26 23:00 ` Johannes Weiner
2013-11-27 0:59 ` Dave Chinner
2013-11-26 0:13 ` Andrew Morton
2013-11-26 22:05 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140123052014.GC28732@bbox \
--to=minchan@kernel.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=bob.liu@oracle.com \
--cc=david@fromorbit.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=klamm@yandex-team.ru \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=metin@citusdata.com \
--cc=mgorman@suse.de \
--cc=ozgun@citusdata.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=rmallon@gmail.com \
--cc=semenzato@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=walken@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).