[ 14/40] vmscan: reduce wind up shrinker->nr when shrinker cant do work

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Greg KH <gregkh@linuxfoundation.org>,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	alan@lxorguk.ukuu.org.uk, Dave Chinner <dchinner@redhat.com>,
	Al Viro <viro@zeniv.linux.org.uk>, Mel Gorman <mgorman@suse.de>
Subject: [ 14/40] vmscan: reduce wind up shrinker->nr when shrinker cant do work
Date: Thu, 26 Jul 2012 14:29:32 -0700	[thread overview]
Message-ID: <20120726211412.376866987@linuxfoundation.org> (raw)
In-Reply-To: <20120726211411.164006056@linuxfoundation.org>

From: Greg KH <gregkh@linuxfoundation.org>

3.0-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Dave Chinner <dchinner@redhat.com>

commit 3567b59aa80ac4417002bf58e35dce5c777d4164 upstream.

Stable note: Not tracked in Bugzilla. This patch reduces excessive
	reclaim of slab objects reducing the amount of information that
	has to be brought back in from disk. The third and fourth paragram
	in the series describes the impact.

When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea ehind this is that
whenteh shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.

However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.

This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.

Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.

This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.

To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.

If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/vmscan.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,6 +277,21 @@ unsigned long shrink_slab(struct shrink_
 		}

 		/*
+		 * We need to avoid excessive windup on filesystem shrinkers
+		 * due to large numbers of GFP_NOFS allocations causing the
+		 * shrinkers to return -1 all the time. This results in a large
+		 * nr being built up so when a shrink that can do some work
+		 * comes along it empties the entire cache due to nr >>>
+		 * max_pass.  This is bad for sustaining a working set in
+		 * memory.
+		 *
+		 * Hence only allow the shrinker to scan the entire cache when
+		 * a large delta change is calculated directly.
+		 */
+		if (delta < max_pass / 4)
+			total_scan = min(total_scan, max_pass / 2);
+
+		/*
 		 * Avoid risking looping forever due to too large nr value:
 		 * never try to free more than twice the estimate number of
 		 * freeable entries.

next prev parent reply	other threads:[~2012-07-26 21:38 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-26 21:14 [ 00/40] 3.0.39-stable review Greg KH
2012-07-26 21:29 ` [ 01/40] cifs: always update the inode cache with the results from a FIND_* Greg Kroah-Hartman
2012-07-26 21:29   ` [ 02/40] ntp: Fix STA_INS/DEL clearing bug Greg Kroah-Hartman
2012-07-26 21:29   ` [ 03/40] mm: fix lost kswapd wakeup in kswapd_stop() Greg Kroah-Hartman
2012-07-26 21:29   ` [ 04/40] MIPS: Properly align the .data..init_task section Greg Kroah-Hartman
2012-07-26 21:29   ` [ 05/40] UBIFS: fix a bug in empty space fix-up Greg Kroah-Hartman
2012-07-26 21:29   ` [ 06/40] dm raid1: fix crash with mirror recovery and discard Greg Kroah-Hartman
2012-07-26 21:29   ` [ 07/40] mm/vmstat.c: cache align vm_stat Greg Kroah-Hartman
2012-07-26 21:29   ` [ 08/40] mm: memory hotplug: Check if pages are correctly reserved on a per-section basis Greg Kroah-Hartman
2012-07-26 21:29   ` [ 09/40] mm: reduce the amount of work done when updating min_free_kbytes Greg Kroah-Hartman
2012-07-26 21:29   ` [ 10/40] mm: vmscan: fix force-scanning small targets without swap Greg Kroah-Hartman
2012-07-26 21:29   ` [ 11/40] vmscan: clear ZONE_CONGESTED for zone with good watermark Greg Kroah-Hartman
2012-07-26 21:29   ` [ 12/40] vmscan: add shrink_slab tracepoints Greg Kroah-Hartman
2012-07-26 21:29   ` [ 13/40] vmscan: shrinker->nr updates race and go wrong Greg Kroah-Hartman
2012-07-29 20:29     ` Ben Hutchings
2012-07-30  9:06       ` Mel Gorman
2012-07-30 15:41         ` Greg Kroah-Hartman
2012-07-26 21:29   ` Greg Kroah-Hartman [this message]
2012-07-26 21:29   ` [ 15/40] vmscan: limit direct reclaim for higher order allocations Greg Kroah-Hartman
2012-07-26 21:29   ` [ 16/40] vmscan: abort reclaim/compaction if compaction can proceed Greg Kroah-Hartman
2012-07-26 21:29   ` [ 17/40] mm: compaction: trivial clean up in acct_isolated() Greg Kroah-Hartman
2012-07-26 21:29   ` [ 18/40] mm: change isolate mode from #define to bitwise type Greg Kroah-Hartman
2012-07-26 21:29   ` [ 19/40] mm: compaction: make isolate_lru_page() filter-aware Greg Kroah-Hartman
2012-07-26 21:29   ` [ 20/40] mm: zone_reclaim: " Greg Kroah-Hartman
2012-07-26 21:29   ` [ 21/40] mm: migration: clean up unmap_and_move() Greg Kroah-Hartman
2012-07-26 21:29   ` [ 22/40] mm: compaction: allow compaction to isolate dirty pages Greg Kroah-Hartman
2012-07-26 21:29   ` [ 23/40] mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage Greg Kroah-Hartman
2012-07-26 21:29   ` [ 24/40] mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred Greg Kroah-Hartman
2012-07-26 21:29   ` [ 25/40] mm: compaction: make isolate_lru_page() filter-aware again Greg Kroah-Hartman
2012-07-26 21:29   ` [ 26/40] kswapd: avoid unnecessary rebalance after an unsuccessful balancing Greg Kroah-Hartman
2012-07-26 21:29   ` [ 27/40] kswapd: assign new_order and new_classzone_idx after wakeup in sleeping Greg Kroah-Hartman
2012-07-26 21:29   ` [ 28/40] mm: compaction: introduce sync-light migration for use by compaction Greg Kroah-Hartman
2012-07-26 21:29   ` [ 29/40] mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available Greg Kroah-Hartman
2012-07-26 21:29   ` [ 30/40] mm: vmscan: do not OOM if aborting reclaim to start compaction Greg Kroah-Hartman
2012-07-26 21:29   ` [ 31/40] mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone Greg Kroah-Hartman
2012-07-26 21:29   ` [ 32/40] vmscan: promote shared file mapped pages Greg Kroah-Hartman
2012-07-26 21:29   ` [ 33/40] vmscan: activate executable pages after first usage Greg Kroah-Hartman
2012-07-26 21:29   ` [ 34/40] mm/vmscan.c: consider swap space when deciding whether to continue reclaim Greg Kroah-Hartman
2012-07-26 21:29   ` [ 35/40] mm: test PageSwapBacked in lumpy reclaim Greg Kroah-Hartman
2012-07-26 21:29   ` [ 36/40] mm: vmscan: convert global reclaim to per-memcg LRU lists Greg Kroah-Hartman
2012-07-30  0:25     ` Ben Hutchings
2012-07-30 15:29       ` Greg Kroah-Hartman
2012-07-26 21:29   ` [ 37/40] cpusets: avoid looping when storing to mems_allowed if one node remains set Greg Kroah-Hartman
2012-07-26 21:29   ` [ 38/40] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask Greg Kroah-Hartman
2012-07-26 21:29   ` [ 39/40] cpuset: mm: reduce large amounts of memory barrier related damage v3 Greg Kroah-Hartman
2012-07-27 15:08     ` Herton Ronaldo Krzesinski
2012-07-27 15:23       ` Mel Gorman
2012-07-27 19:01         ` Greg Kroah-Hartman
2012-07-28  5:02           ` Herton Ronaldo Krzesinski
2012-07-28 10:26             ` Mel Gorman
2012-07-30 15:39               ` Greg Kroah-Hartman
2012-07-30 15:37             ` Greg Kroah-Hartman
2012-07-30 15:38               ` Greg Kroah-Hartman
2012-07-26 21:29   ` [ 40/40] mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120726211412.376866987@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=dchinner@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).