linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andi Kleen <andi@firstfloor.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Greg Thelen <gthelen@google.com>,
	Christoph Hellwig <hch@infradead.org>,
	Hugh Dickins <hughd@google.com>, Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Mel Gorman <mgorman@suse.de>, Minchan Kim <minchan.kim@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>,
	Michel Lespinasse <walken@google.com>,
	Seth Jennings <sjenning@linux.vnet.ibm.com>,
	Roman Gushchin <klamm@yandex-team.ru>,
	Ozgun Erdogan <ozgun@citusdata.com>,
	Metin Doslu <metin@citusdata.com>,
	Vlastimil Babka <vbabka@suse.cz>, Tejun Heo <tj@kernel.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
Date: Tue, 15 Oct 2013 12:27:12 +0200	[thread overview]
Message-ID: <20131015102712.GA12428@quack.suse.cz> (raw)
In-Reply-To: <20131015014123.GQ4446@dastard>

On Tue 15-10-13 12:41:23, Dave Chinner wrote:
> Then you have a serious design flaw if you are relying on a shrinker
> to control memory consumed by page cache radix trees as a result of
> page cache reclaim inserting exceptional entries into the radix
> tree and then forgetting about them.
> 
> To work around this, you keep a global count of exceptional entries
> and a global list of inodes with such exceptional radix tree
> entries. The count doesn't really tell you how much memory is used
> by the radix trees - the same count can mean an order of
> magnitude difference in actual memory consumption (one shadow entry
> per radix tree node vs 64) so it's not a very good measure to base
> memory reclaim behaviour on but it is an inferred (rather than
> actual) object count.
> 
> And even if you do free some entries, there is no guarantee that any
> memory will be freed because only empty radix tree nodes will get
> freed, and then memory will only get freed when the entire slab of
> radix tree nodes are freed.
> 
> This reclaim behaviour has potential to cause internal
> fragmentation of the radix tree node slab, which means that we'll
> simply spend time scanning and freeing entries but not free any
> memory.
> 
> You walk the inode list by a shrinker and scan radix trees for
> shadow entries that can be removed. It's expensive to scan radix
> trees, especially for inodes with large amounts of cached data, so
> this could do a lot of work to find very little in way of entries to
> free.
> 
> The shrinker doesn't rotate inodes on the list, so it will always
> scan the same inodes on the list in the same order and so if memory
> reclaim removes a few pages from an inode with a large amount of
> cached pages between each shrinker call, then those radix trees will
> be repeatedly scanned in it's entirety on each call to the shrinker.
> 
> Also, the shrinker only decrements nr_to_scan when it finds an entry
> to reclaim. nr_to_scan is the number of objects to scan for reclaim,
> not the number of objects to reclaim. hence the shrinker will be
> doing a lot of scanning if there's inodes at the head of the list
> with large radix trees....
> 
> Do I need to go on pointing out how unscalable this approach is?
  Just to add some real world experience to what Dave points out - ext4 has
a thing called extent cache. It essentially caches logical->physical
mapping of blocks together with some state flags together with inode. And
currently the cache is maintained similarly as you do it with shadow
entries - we have LRU list of inodes and we have a shrinker to scan extents
in an inode to find extents to free (we cannot reclaim arbitrary cached
extent because some state cannot be simply lost). And it is a pain. We burn
lots of CPU when scanning for extents to free under some loads, sometimes we
get RCU lockups and similar stuff. So we will have to rewrite the code to
use something more clever sooner rather than later. I don't think you
should repeat our mistake ;)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-10-15 10:27 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
2013-10-10 21:46 ` [patch 1/8] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
2013-10-10 21:46 ` [patch 2/8] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2013-10-10 21:46 ` [patch 3/8] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2013-10-10 21:46 ` [patch 4/8] mm: filemap: move radix tree hole searching here Johannes Weiner
2013-10-10 21:46 ` [patch 5/8] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2013-10-10 21:47 ` [patch 6/8] mm + fs: store shadow entries in page cache Johannes Weiner
2013-10-10 21:47 ` [patch 7/8] mm: thrash detection-based file cache sizing Johannes Weiner
2013-10-10 21:47 ` [patch 8/8] mm: workingset: keep shadow entries in check Johannes Weiner
2013-10-11  0:39 ` [patch 0/8] mm: thrash detection-based file cache sizing v5 Dave Chinner
2013-10-14 21:42   ` Johannes Weiner
2013-10-15  1:41     ` Dave Chinner
2013-10-15 10:27       ` Jan Kara [this message]
2013-10-15 17:41       ` Johannes Weiner
2013-10-15 23:41         ` Dave Chinner
2013-10-16  2:05           ` Rik van Riel
2013-10-16  2:26             ` Dave Chinner
2013-10-16 22:31               ` Johannes Weiner
2013-10-21 12:10                 ` Dave Chinner
2013-10-21  9:26 ` Vlastimil Babka
2013-10-22  9:35   ` Johannes Weiner
2013-11-14 15:56   ` Rik van Riel
2013-11-12 10:30 ` Bob Liu
2013-11-14 15:53   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131015102712.GA12428@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=david@fromorbit.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=klamm@yandex-team.ru \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=metin@citusdata.com \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=ozgun@citusdata.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=sjenning@linux.vnet.ibm.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).