From: Dave Chinner <david@fromorbit.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Andi Kleen <andi@firstfloor.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Bob Liu <bob.liu@oracle.com>,
Christoph Hellwig <hch@infradead.org>,
Greg Thelen <gthelen@google.com>, Hugh Dickins <hughd@google.com>,
Jan Kara <jack@suse.cz>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Luigi Semenzato <semenzato@google.com>,
Mel Gorman <mgorman@suse.de>, Metin Doslu <metin@citusdata.com>,
Michel Lespinasse <walken@google.com>,
Minchan Kim <minchan.kim@gmail.com>,
Ozgun Erdogan <ozgun@citusdata.com>,
Peter Zijlstra <peterz@infradead.org>,
Rik van Riel <riel@redhat.com>,
Roman Gushchin <klamm@yandex-team.ru>,
Ryan Mallon <rmallon@gmail.com>, Tejun Heo <tj@kernel.org>,
Vlastimil Babka <vbabka@suse.cz>,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [patch 9/9] mm: keep page cache radix tree nodes in check
Date: Tue, 21 Jan 2014 14:03:58 +1100 [thread overview]
Message-ID: <20140121030358.GN18112@dastard> (raw)
In-Reply-To: <20140120231737.GS6963@cmpxchg.org>
On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > + /* Only shadow entries in there, keep track of this node */
> > > + if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > + list_empty(&node->private_list)) {
> > > + node->private_data = mapping;
> > > + list_lru_add(&workingset_shadow_nodes, &node->private_list);
> > > + }
> >
> > You can't do this list_empty(&node->private_list) check safely
> > externally to the list_lru code - only time that entry can be
> > checked safely is under the LRU list locks. This is the reason that
> > list_lru_add/list_lru_del return a boolean to indicate is the object
> > was added/removed from the list - they do this list_empty() check
> > internally. i.e. the correct, safe way to do conditionally update
> > state iff the object was added to the LRU is:
> >
> > if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > if (list_lru_add(&workingset_shadow_nodes, &node->private_list))
> > node->private_data = mapping;
> > }
> >
> > > + radix_tree_replace_slot(slot, page);
> > > + mapping->nrpages++;
> > > + if (node) {
> > > + node->count++;
> > > + /* Installed page, can't be shadow-only anymore */
> > > + if (!list_empty(&node->private_list))
> > > + list_lru_del(&workingset_shadow_nodes,
> > > + &node->private_list);
> > > + }
> >
> > Same issue here:
> >
> > if (node) {
> > node->count++;
> > list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > }
>
> All modifications to node->private_list happen under
> mapping->tree_lock, and modifications of a neighboring link should not
> affect the outcome of the list_empty(), so I don't think the lru lock
> is necessary.
Can you please add that as a comment somewhere explaining why it is
safe to do this?
> > > + case LRU_REMOVED_RETRY:
> > > if (--nlru->nr_items == 0)
> > > node_clear(nid, lru->active_nodes);
> > > WARN_ON_ONCE(nlru->nr_items < 0);
> > > isolated++;
> > > + /*
> > > + * If the lru lock has been dropped, our list
> > > + * traversal is now invalid and so we have to
> > > + * restart from scratch.
> > > + */
> > > + if (ret == LRU_REMOVED_RETRY)
> > > + goto restart;
> > > break;
> > > case LRU_ROTATE:
> > > list_move_tail(item, &nlru->list);
> >
> > I think that we need to assert that the list lru lock is correctly
> > held here on return with LRU_REMOVED_RETRY. i.e.
> >
> > case LRU_REMOVED_RETRY:
> > assert_spin_locked(&nlru->lock);
> > case LRU_REMOVED:
>
> Ah, good idea. How about adding it to LRU_RETRY as well?
Yup, good idea.
> > > +static struct shrinker workingset_shadow_shrinker = {
> > > + .count_objects = count_shadow_nodes,
> > > + .scan_objects = scan_shadow_nodes,
> > > + .seeks = DEFAULT_SEEKS * 4,
> > > + .flags = SHRINKER_NUMA_AWARE,
> > > +};
> >
> > Can you add a comment explaining how you calculated the .seeks
> > value? It's important to document the weighings/importance
> > we give to slab reclaim so we can determine if it's actually
> > acheiving the desired balance under different loads...
>
> This is not an exact science, to say the least.
I know, that's why I asked it be documented rather than be something
kept in your head.
> The shadow entries are mostly self-regulated, so I don't want the
> shrinker to interfere while the machine is just regularly trimming
> caches during normal operation.
>
> It should only kick in when either a) reclaim is picking up and the
> scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> swapping etc. or b) the number of objects compared to LRU pages
> becomes excessive.
>
> I think that is what most shrinkers with an elevated seeks value want,
> but this translates very awkwardly (and not completely) to the current
> cost model, and we should probably rework that interface.
>
> "Seeks" currently encodes 3 ratios:
>
> 1. the cost of creating an object vs. a page
>
> 2. the expected number of objects vs. pages
It doesn't encode that at all. If it did, then the default value
wouldn't be "2".
> 3. the cost of reclaiming an object vs. a page
Which, when you consider #3 in conjunction with #1, the actual
intended meaning of .seeks is "the cost of replacing this object in
the cache compared to the cost of replacing a page cache page."
> but they are not necessarily correlated. How I would like to
> configure the shadow shrinker instead is:
>
> o scan objects when reclaim efficiency is down to 75%, because they
> are more valuable than use-once cache but less than workingset
>
> o scan objects when the ratio between them and the number of pages
> exceeds 1/32 (one shadow entry for each resident page, up to 64
> entries per shrinkable object, assume 50% packing for robustness)
>
> o as the expected balance between objects and lru pages is 1:32,
> reclaim one object for every 32 reclaimed LRU pages, instead of
> assuming that number of scanned pages corresponds meaningfully to
> number of objects to scan.
You're assuming that every radix tree node has a full population of
pages. This only occurs on sequential read and write workloads, and
so isn't going tobe true for things like mapped executables or any
semi-randomly accessed data set...
> "4" just doesn't have the same ring to it.
Right, but you still haven't explained how you came to the value of
"4"....
> It would be great if we could eliminate the reclaim cost assumption by
> turning the nr_to_scan into a nr_to_reclaim, and then set the other
> two ratios independently.
That doesn't work for caches that are full of objects that can't (or
won't) be reclaimed immediately. The CPU cost of repeatedly scanning
to find N reclaimable objects when you have millions of objects in
the cache is prohibitive.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Andi Kleen <andi@firstfloor.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Bob Liu <bob.liu@oracle.com>,
Christoph Hellwig <hch@infradead.org>,
Greg Thelen <gthelen@google.com>, Hugh Dickins <hughd@google.com>,
Jan Kara <jack@suse.cz>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Luigi Semenzato <semenzato@google.com>,
Mel Gorman <mgorman@suse.de>, Metin Doslu <metin@citusdata.com>,
Michel Lespinasse <walken@google.com>,
Minchan Kim <minchan.kim@gmail.com>,
Ozgun Erdogan <ozgun@citusdata.com>,
Peter Zijlstra <peterz@infradead.org>,
Rik van Riel <riel@redhat.com>,
Roman Gushchin <klamm@yandex-team.ru>,
Ryan Mallon <rmallon@gmail.com>, Tejun Heo <tj@kernel.org>,
Vlastimil Babka <vbabka@suse.cz>,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [patch 9/9] mm: keep page cache radix tree nodes in check
Date: Tue, 21 Jan 2014 14:03:58 +1100 [thread overview]
Message-ID: <20140121030358.GN18112@dastard> (raw)
In-Reply-To: <20140120231737.GS6963@cmpxchg.org>
On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > + /* Only shadow entries in there, keep track of this node */
> > > + if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > + list_empty(&node->private_list)) {
> > > + node->private_data = mapping;
> > > + list_lru_add(&workingset_shadow_nodes, &node->private_list);
> > > + }
> >
> > You can't do this list_empty(&node->private_list) check safely
> > externally to the list_lru code - only time that entry can be
> > checked safely is under the LRU list locks. This is the reason that
> > list_lru_add/list_lru_del return a boolean to indicate is the object
> > was added/removed from the list - they do this list_empty() check
> > internally. i.e. the correct, safe way to do conditionally update
> > state iff the object was added to the LRU is:
> >
> > if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > if (list_lru_add(&workingset_shadow_nodes, &node->private_list))
> > node->private_data = mapping;
> > }
> >
> > > + radix_tree_replace_slot(slot, page);
> > > + mapping->nrpages++;
> > > + if (node) {
> > > + node->count++;
> > > + /* Installed page, can't be shadow-only anymore */
> > > + if (!list_empty(&node->private_list))
> > > + list_lru_del(&workingset_shadow_nodes,
> > > + &node->private_list);
> > > + }
> >
> > Same issue here:
> >
> > if (node) {
> > node->count++;
> > list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > }
>
> All modifications to node->private_list happen under
> mapping->tree_lock, and modifications of a neighboring link should not
> affect the outcome of the list_empty(), so I don't think the lru lock
> is necessary.
Can you please add that as a comment somewhere explaining why it is
safe to do this?
> > > + case LRU_REMOVED_RETRY:
> > > if (--nlru->nr_items == 0)
> > > node_clear(nid, lru->active_nodes);
> > > WARN_ON_ONCE(nlru->nr_items < 0);
> > > isolated++;
> > > + /*
> > > + * If the lru lock has been dropped, our list
> > > + * traversal is now invalid and so we have to
> > > + * restart from scratch.
> > > + */
> > > + if (ret == LRU_REMOVED_RETRY)
> > > + goto restart;
> > > break;
> > > case LRU_ROTATE:
> > > list_move_tail(item, &nlru->list);
> >
> > I think that we need to assert that the list lru lock is correctly
> > held here on return with LRU_REMOVED_RETRY. i.e.
> >
> > case LRU_REMOVED_RETRY:
> > assert_spin_locked(&nlru->lock);
> > case LRU_REMOVED:
>
> Ah, good idea. How about adding it to LRU_RETRY as well?
Yup, good idea.
> > > +static struct shrinker workingset_shadow_shrinker = {
> > > + .count_objects = count_shadow_nodes,
> > > + .scan_objects = scan_shadow_nodes,
> > > + .seeks = DEFAULT_SEEKS * 4,
> > > + .flags = SHRINKER_NUMA_AWARE,
> > > +};
> >
> > Can you add a comment explaining how you calculated the .seeks
> > value? It's important to document the weighings/importance
> > we give to slab reclaim so we can determine if it's actually
> > acheiving the desired balance under different loads...
>
> This is not an exact science, to say the least.
I know, that's why I asked it be documented rather than be something
kept in your head.
> The shadow entries are mostly self-regulated, so I don't want the
> shrinker to interfere while the machine is just regularly trimming
> caches during normal operation.
>
> It should only kick in when either a) reclaim is picking up and the
> scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> swapping etc. or b) the number of objects compared to LRU pages
> becomes excessive.
>
> I think that is what most shrinkers with an elevated seeks value want,
> but this translates very awkwardly (and not completely) to the current
> cost model, and we should probably rework that interface.
>
> "Seeks" currently encodes 3 ratios:
>
> 1. the cost of creating an object vs. a page
>
> 2. the expected number of objects vs. pages
It doesn't encode that at all. If it did, then the default value
wouldn't be "2".
> 3. the cost of reclaiming an object vs. a page
Which, when you consider #3 in conjunction with #1, the actual
intended meaning of .seeks is "the cost of replacing this object in
the cache compared to the cost of replacing a page cache page."
> but they are not necessarily correlated. How I would like to
> configure the shadow shrinker instead is:
>
> o scan objects when reclaim efficiency is down to 75%, because they
> are more valuable than use-once cache but less than workingset
>
> o scan objects when the ratio between them and the number of pages
> exceeds 1/32 (one shadow entry for each resident page, up to 64
> entries per shrinkable object, assume 50% packing for robustness)
>
> o as the expected balance between objects and lru pages is 1:32,
> reclaim one object for every 32 reclaimed LRU pages, instead of
> assuming that number of scanned pages corresponds meaningfully to
> number of objects to scan.
You're assuming that every radix tree node has a full population of
pages. This only occurs on sequential read and write workloads, and
so isn't going tobe true for things like mapped executables or any
semi-randomly accessed data set...
> "4" just doesn't have the same ring to it.
Right, but you still haven't explained how you came to the value of
"4"....
> It would be great if we could eliminate the reclaim cost assumption by
> turning the nr_to_scan into a nr_to_reclaim, and then set the other
> two ratios independently.
That doesn't work for caches that are full of objects that can't (or
won't) be reclaimed immediately. The CPU cost of repeatedly scanning
to find N reclaimable objects when you have millions of objects in
the cache is prohibitive.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2014-01-21 3:03 UTC|newest]
Thread overview: 126+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
2014-01-10 18:10 ` Johannes Weiner
2014-01-10 18:10 ` [patch 1/9] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
2014-01-10 18:10 ` Johannes Weiner
2014-01-13 1:17 ` Minchan Kim
2014-01-13 1:17 ` Minchan Kim
2014-01-10 18:10 ` [patch 2/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2014-01-10 18:10 ` Johannes Weiner
2014-01-10 18:10 ` [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2014-01-10 18:10 ` Johannes Weiner
2014-01-10 18:25 ` Rik van Riel
2014-01-10 18:25 ` Rik van Riel
2014-01-10 18:25 ` Rik van Riel
2014-01-10 18:10 ` [patch 4/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2014-01-10 18:10 ` Johannes Weiner
2014-01-10 19:22 ` Rik van Riel
2014-01-10 19:22 ` Rik van Riel
2014-01-10 19:22 ` Rik van Riel
2014-01-13 1:25 ` Minchan Kim
2014-01-13 1:25 ` Minchan Kim
2014-01-10 18:10 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2014-01-10 18:10 ` Johannes Weiner
2014-01-10 19:39 ` Rik van Riel
2014-01-10 19:39 ` Rik van Riel
2014-01-10 19:39 ` Rik van Riel
2014-01-13 2:01 ` Minchan Kim
2014-01-13 2:01 ` Minchan Kim
2014-01-22 17:47 ` Johannes Weiner
2014-01-22 17:47 ` Johannes Weiner
2014-01-23 5:07 ` Minchan Kim
2014-01-23 5:07 ` Minchan Kim
2014-02-12 14:00 ` Mel Gorman
2014-02-12 14:00 ` Mel Gorman
2014-03-12 1:15 ` Johannes Weiner
2014-03-12 1:15 ` Johannes Weiner
2014-01-10 18:10 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
2014-01-10 18:10 ` Johannes Weiner
2014-01-10 22:30 ` Rik van Riel
2014-01-10 22:30 ` Rik van Riel
2014-01-10 22:30 ` Rik van Riel
2014-01-13 2:18 ` Minchan Kim
2014-01-13 2:18 ` Minchan Kim
2014-01-10 18:10 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
2014-01-10 18:10 ` Johannes Weiner
2014-01-10 22:51 ` Rik van Riel
2014-01-10 22:51 ` Rik van Riel
2014-01-10 22:51 ` Rik van Riel
2014-01-13 2:42 ` Minchan Kim
2014-01-13 2:42 ` Minchan Kim
2014-01-14 1:01 ` Bob Liu
2014-01-14 1:01 ` Bob Liu
2014-01-14 1:01 ` Bob Liu
2014-01-14 19:16 ` Johannes Weiner
2014-01-14 19:16 ` Johannes Weiner
2014-01-15 2:57 ` Bob Liu
2014-01-15 2:57 ` Bob Liu
2014-01-15 2:57 ` Bob Liu
2014-01-15 3:52 ` Zhang Yanfei
2014-01-15 3:52 ` Zhang Yanfei
2014-01-16 21:17 ` Johannes Weiner
2014-01-16 21:17 ` Johannes Weiner
2014-01-10 18:10 ` [patch 8/9] lib: radix_tree: tree node interface Johannes Weiner
2014-01-10 18:10 ` Johannes Weiner
2014-01-10 22:57 ` Rik van Riel
2014-01-10 22:57 ` Rik van Riel
2014-01-10 22:57 ` Rik van Riel
2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2014-01-10 18:10 ` Johannes Weiner
2014-01-10 23:09 ` Rik van Riel
2014-01-10 23:09 ` Rik van Riel
2014-01-10 23:09 ` Rik van Riel
2014-01-13 7:39 ` Minchan Kim
2014-01-13 7:39 ` Minchan Kim
2014-01-14 5:40 ` Minchan Kim
2014-01-14 5:40 ` Minchan Kim
2014-01-22 18:42 ` Johannes Weiner
2014-01-22 18:42 ` Johannes Weiner
2014-01-23 5:20 ` Minchan Kim
2014-01-23 5:20 ` Minchan Kim
2014-01-23 19:22 ` Johannes Weiner
2014-01-23 19:22 ` Johannes Weiner
2014-01-27 2:31 ` Minchan Kim
2014-01-27 2:31 ` Minchan Kim
2014-01-15 5:55 ` Bob Liu
2014-01-15 5:55 ` Bob Liu
2014-01-15 5:55 ` Bob Liu
2014-01-16 22:09 ` Johannes Weiner
2014-01-16 22:09 ` Johannes Weiner
2014-01-17 0:05 ` Dave Chinner
2014-01-17 0:05 ` Dave Chinner
2014-01-20 23:17 ` Johannes Weiner
2014-01-20 23:17 ` Johannes Weiner
2014-01-21 3:03 ` Dave Chinner [this message]
2014-01-21 3:03 ` Dave Chinner
2014-01-21 5:50 ` Johannes Weiner
2014-01-21 5:50 ` Johannes Weiner
2014-01-22 3:06 ` Dave Chinner
2014-01-22 3:06 ` Dave Chinner
2014-01-22 6:57 ` Johannes Weiner
2014-01-22 6:57 ` Johannes Weiner
2014-01-22 18:48 ` Johannes Weiner
2014-01-22 18:48 ` Johannes Weiner
2014-01-23 5:57 ` Minchan Kim
2014-01-23 5:57 ` Minchan Kim
-- strict thread matches above, loose matches on Subject: below --
2013-12-02 19:21 [patch 0/9] mm: thrash detection-based file cache sizing v7 Johannes Weiner
2013-12-02 19:21 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-12-02 19:21 ` Johannes Weiner
2013-12-02 22:10 ` Dave Chinner
2013-12-02 22:10 ` Dave Chinner
2013-12-02 22:46 ` Johannes Weiner
2013-12-02 22:46 ` Johannes Weiner
2013-11-24 23:38 [patch 0/9] mm: thrash detection-based file cache sizing v6 Johannes Weiner
2013-11-24 23:38 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-11-24 23:38 ` Johannes Weiner
2013-11-25 23:49 ` Dave Chinner
2013-11-25 23:49 ` Dave Chinner
2013-11-26 21:27 ` Johannes Weiner
2013-11-26 21:27 ` Johannes Weiner
2013-11-26 22:29 ` Dave Chinner
2013-11-26 22:29 ` Dave Chinner
2013-11-26 23:00 ` Johannes Weiner
2013-11-26 23:00 ` Johannes Weiner
2013-11-27 0:59 ` Dave Chinner
2013-11-27 0:59 ` Dave Chinner
2013-11-26 0:13 ` Andrew Morton
2013-11-26 0:13 ` Andrew Morton
2013-11-26 22:05 ` Johannes Weiner
2013-11-26 22:05 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140121030358.GN18112@dastard \
--to=david@fromorbit.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=bob.liu@oracle.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=klamm@yandex-team.ru \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=metin@citusdata.com \
--cc=mgorman@suse.de \
--cc=minchan.kim@gmail.com \
--cc=ozgun@citusdata.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=rmallon@gmail.com \
--cc=semenzato@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=walken@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.