All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Bob Liu <bob.liu@oracle.com>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Greg Thelen <gthelen@google.com>, Hugh Dickins <hughd@google.com>,
	Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Luigi Semenzato <semenzato@google.com>,
	Mel Gorman <mgorman@suse.de>, Metin Doslu <metin@citusdata.com>,
	Michel Lespinasse <walken@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Ozgun Erdogan <ozgun@citusdata.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>,
	Roman Gushchin <klamm@yandex-team.ru>,
	Ryan Mallon <rmallon@gmail.com>, Tejun Heo <tj@kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [patch 10/10] mm: keep page cache radix tree nodes in check
Date: Tue, 4 Feb 2014 20:53:52 -0500	[thread overview]
Message-ID: <20140205015352.GW6963@cmpxchg.org> (raw)
In-Reply-To: <20140204150756.d7f46af4385026ce61c89c55@linux-foundation.org>

On Tue, Feb 04, 2014 at 03:07:56PM -0800, Andrew Morton wrote:
> On Mon,  3 Feb 2014 19:53:42 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.

    ^^^^^^^^^^^^^^^
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >    from the node up to the tree root, which is needed to perform a
> >    deletion.  To solve this, encode in each node its offset inside the
> >    parent.  This can be stored in the unused upper bits of the same
> >    member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >    regular entries, to quickly detect when the node is ready to go to
> >    the shadow node LRU list.  The current entry count is an unsigned
> >    int but the maximum number of entries is 64, so a shadow counter
> >    can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs tree lock and tree root, which are located
> >    in the address space, so store an address_space backpointer in the
> >    node.  The parent pointer of the node is in a union with the 2-word
> >    rcu_head, so the backpointer comes at no extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >    head inside the node.  This does increase the size of the node, but
> >    it does not change the number of objects that fit into a slab page.
> 
> changelog forgot to mention that this reclaim is performed via a
> shrinker...

Uhm...  see above? :)

> How expensive is that list walk in scan_shadow_nodes()?  I assume in
> the best case it will bale out after nr_to_scan iterations?

Yes, it scans sc->nr_to_scan radix tree nodes, cleans their pointers,
and frees them.

I ran a worst-case scenario on an 8G machine that creates one 8T
sparse file and faults one page per 64-page radix tree node, i.e. one
node per sparse file fault at CPU speed.  The profile:

     1       9.21%     radixblow  [kernel.kallsyms]   [k] memset
     2       7.23%     radixblow  [kernel.kallsyms]   [k] do_mpage_readpage
     3       4.76%     radixblow  [kernel.kallsyms]   [k] copy_user_generic_string
     4       3.85%     radixblow  [kernel.kallsyms]   [k] __radix_tree_lookup
     5       3.32%       kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate
     6       2.92%     radixblow  [kernel.kallsyms]   [k] get_page_from_freelist
     7       2.81%       kswapd0  [kernel.kallsyms]   [k] __delete_from_page_cache
     8       2.50%     radixblow  [kernel.kallsyms]   [k] radix_tree_node_ctor
     9       1.79%     radixblow  [kernel.kallsyms]   [k] _raw_spin_lock_irq
    10       1.70%       kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common

Same scenario with 4 pages per 64-page radix tree node:

    13       1.39%       kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate

16 pages per 64-page node:

    75       0.20%       kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate

So I doubt this will bother anyone, especially since most use-once
streamers should have a better population density and populate cache
at disk speed, not CPU speed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Bob Liu <bob.liu@oracle.com>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Greg Thelen <gthelen@google.com>, Hugh Dickins <hughd@google.com>,
	Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Luigi Semenzato <semenzato@google.com>,
	Mel Gorman <mgorman@suse.de>, Metin Doslu <metin@citusdata.com>,
	Michel Lespinasse <walken@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Ozgun Erdogan <ozgun@citusdata.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>,
	Roman Gushchin <klamm@yandex-team.ru>,
	Ryan Mallon <rmallon@gmail.com>, Tejun Heo <tj@kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [patch 10/10] mm: keep page cache radix tree nodes in check
Date: Tue, 4 Feb 2014 20:53:52 -0500	[thread overview]
Message-ID: <20140205015352.GW6963@cmpxchg.org> (raw)
In-Reply-To: <20140204150756.d7f46af4385026ce61c89c55@linux-foundation.org>

On Tue, Feb 04, 2014 at 03:07:56PM -0800, Andrew Morton wrote:
> On Mon,  3 Feb 2014 19:53:42 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.

    ^^^^^^^^^^^^^^^
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >    from the node up to the tree root, which is needed to perform a
> >    deletion.  To solve this, encode in each node its offset inside the
> >    parent.  This can be stored in the unused upper bits of the same
> >    member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >    regular entries, to quickly detect when the node is ready to go to
> >    the shadow node LRU list.  The current entry count is an unsigned
> >    int but the maximum number of entries is 64, so a shadow counter
> >    can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs tree lock and tree root, which are located
> >    in the address space, so store an address_space backpointer in the
> >    node.  The parent pointer of the node is in a union with the 2-word
> >    rcu_head, so the backpointer comes at no extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >    head inside the node.  This does increase the size of the node, but
> >    it does not change the number of objects that fit into a slab page.
> 
> changelog forgot to mention that this reclaim is performed via a
> shrinker...

Uhm...  see above? :)

> How expensive is that list walk in scan_shadow_nodes()?  I assume in
> the best case it will bale out after nr_to_scan iterations?

Yes, it scans sc->nr_to_scan radix tree nodes, cleans their pointers,
and frees them.

I ran a worst-case scenario on an 8G machine that creates one 8T
sparse file and faults one page per 64-page radix tree node, i.e. one
node per sparse file fault at CPU speed.  The profile:

     1       9.21%     radixblow  [kernel.kallsyms]   [k] memset
     2       7.23%     radixblow  [kernel.kallsyms]   [k] do_mpage_readpage
     3       4.76%     radixblow  [kernel.kallsyms]   [k] copy_user_generic_string
     4       3.85%     radixblow  [kernel.kallsyms]   [k] __radix_tree_lookup
     5       3.32%       kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate
     6       2.92%     radixblow  [kernel.kallsyms]   [k] get_page_from_freelist
     7       2.81%       kswapd0  [kernel.kallsyms]   [k] __delete_from_page_cache
     8       2.50%     radixblow  [kernel.kallsyms]   [k] radix_tree_node_ctor
     9       1.79%     radixblow  [kernel.kallsyms]   [k] _raw_spin_lock_irq
    10       1.70%       kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common

Same scenario with 4 pages per 64-page radix tree node:

    13       1.39%       kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate

16 pages per 64-page node:

    75       0.20%       kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate

So I doubt this will bother anyone, especially since most use-once
streamers should have a better population density and populate cache
at disk speed, not CPU speed.

  reply	other threads:[~2014-02-05  1:53 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-04  0:53 [patch 00/10] mm: thrash detection-based file cache sizing v9 Johannes Weiner
2014-02-04  0:53 ` Johannes Weiner
2014-02-04  0:53 ` [patch 01/10] mm: vmstat: fix UP zone state accounting Johannes Weiner
2014-02-04  0:53   ` Johannes Weiner
2014-02-05 22:17   ` Andrew Morton
2014-02-05 22:17     ` Andrew Morton
2014-02-04  0:53 ` [patch 02/10] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
2014-02-04  0:53   ` Johannes Weiner
2014-02-08 11:43   ` Rafael Aquini
2014-02-08 11:43     ` Rafael Aquini
2014-02-08 11:43     ` Rafael Aquini
2014-02-09 17:34     ` Johannes Weiner
2014-02-09 17:34       ` Johannes Weiner
2014-02-12 10:58   ` Mel Gorman
2014-02-12 10:58     ` Mel Gorman
2014-02-04  0:53 ` [patch 03/10] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2014-02-04  0:53   ` Johannes Weiner
2014-02-12 11:00   ` Mel Gorman
2014-02-12 11:00     ` Mel Gorman
2014-02-04  0:53 ` [patch 04/10] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2014-02-04  0:53   ` Johannes Weiner
2014-02-12 11:11   ` Mel Gorman
2014-02-12 11:11     ` Mel Gorman
2014-02-04  0:53 ` [patch 05/10] mm: filemap: move radix tree hole searching here Johannes Weiner
2014-02-04  0:53   ` Johannes Weiner
2014-02-12 11:16   ` Mel Gorman
2014-02-12 11:16     ` Mel Gorman
2014-02-04  0:53 ` [patch 06/10] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2014-02-04  0:53   ` Johannes Weiner
2014-02-04  0:53 ` [patch 07/10] mm + fs: store shadow entries in page cache Johannes Weiner
2014-02-04  0:53   ` Johannes Weiner
2014-02-04  0:53 ` [patch 08/10] mm: thrash detection-based file cache sizing Johannes Weiner
2014-02-04  0:53   ` Johannes Weiner
2014-02-04  0:53 ` [patch 09/10] lib: radix_tree: tree node interface Johannes Weiner
2014-02-04  0:53   ` Johannes Weiner
2014-02-04  0:53 ` [patch 10/10] mm: keep page cache radix tree nodes in check Johannes Weiner
2014-02-04  0:53   ` Johannes Weiner
2014-02-04 23:07   ` Andrew Morton
2014-02-04 23:07     ` Andrew Morton
2014-02-05  1:53     ` Johannes Weiner [this message]
2014-02-05  1:53       ` Johannes Weiner
2014-02-04 23:14 ` [patch 00/10] mm: thrash detection-based file cache sizing v9 Andrew Morton
2014-02-04 23:14   ` Andrew Morton
2014-02-05  2:02   ` Johannes Weiner
2014-02-05  2:02     ` Johannes Weiner
2014-02-13  3:21 ` Tetsuo Handa
2014-02-13 22:11   ` Johannes Weiner
2014-02-13 22:24     ` Andrew Morton
2014-02-14  0:29       ` Stephen Rothwell
2014-02-14  6:05         ` Tetsuo Handa
2014-02-14 15:30           ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140205015352.GW6963@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=bob.liu@oracle.com \
    --cc=david@fromorbit.com \
    --cc=gthelen@google.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=klamm@yandex-team.ru \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=metin@citusdata.com \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=ozgun@citusdata.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rmallon@gmail.com \
    --cc=semenzato@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.