Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Larry Woodman <lwoodman@redhat.com>
To: Rik van Riel <riel@redhat.com>
Cc: akpm@linux-foundation.org,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	minchan.kim@gmail.com
Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone
Date: Mon, 14 Dec 2009 12:08:01 -0500	[thread overview]
Message-ID: <1260810481.6666.13.camel@dhcp-100-19-198.bos.redhat.com> (raw)
In-Reply-To: <20091211164651.036f5340@annuminas.surriel.com>

On Fri, 2009-12-11 at 16:46 -0500, Rik van Riel wrote:

Rik, the latest patch appears to have a problem although I dont know
what the problem is yet.  When the system ran out of memory we see
thousands of runnable processes and 100% system time:


 9420  2  29824  79856  62676  19564    0    0     0     0 8054  379  0 
100  0  0  0
9420  2  29824  79368  62292  19564    0    0     0     0 8691  413  0 
100  0  0  0
9421  1  29824  79780  61780  19820    0    0     0     0 8928  408  0 
100  0  0  0

The system would not respond so I dont know whats going on yet.  I'll
add debug code to figure out why its in that state as soon as I get
access to the hardware.

Larry


> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways.  The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
> 
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.
> 
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
> 
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
> 
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves.  We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.
> 
> Reported-by: Larry Woodman <lwoodman@redhat.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> --- 
> v2:
> - fix typos in sysctl.c and vm.txt
> - move the code in sysctl.c out from under the ifdef
> - only __GFP_FS|__GFP_IO tasks can wait
> 
>  Documentation/sysctl/vm.txt |   18 ++++++++++++++
>  include/linux/mmzone.h      |    4 +++
>  include/linux/swap.h        |    1 +
>  kernel/sysctl.c             |    7 +++++
>  mm/page_alloc.c             |    3 ++
>  mm/vmscan.c                 |   40 +++++++++++++++++++++++++++++++++
>  6 files changed, 73 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index fc5790d..8bd1a96 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
>  - legacy_va_layout
>  - lowmem_reserve_ratio
>  - max_map_count
> +- max_zone_concurrent_reclaimers
>  - memory_failure_early_kill
>  - memory_failure_recovery
>  - min_free_kbytes
> @@ -278,6 +279,23 @@ The default value is 65536.
>  
>  =============================================================
>  
> +max_zone_concurrent_reclaimers:
> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously.  This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +
> +The default value is 8.
> +
> +=============================================================
> +
>  memory_failure_early_kill:
>  
>  Control how to kill processes when uncorrected memory error (typically
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..ed614b8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -345,6 +345,10 @@ struct zone {
>  	/* Zone statistics */
>  	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
>  
> +	/* Number of processes running page reclaim code on this zone. */
> +	atomic_t		concurrent_reclaimers;
> +	wait_queue_head_t	reclaim_wait;
> +
>  	/*
>  	 * prev_priority holds the scanning priority for this zone.  It is
>  	 * defined as the scanning priority at which we achieved our reclaim
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..661eec7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
>  extern long vm_total_pages;
> +extern int max_zone_concurrent_reclaimers;
>  
>  #ifdef CONFIG_NUMA
>  extern int zone_reclaim_mode;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6ff0ae6..4ec17ed 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1271,6 +1271,13 @@ static struct ctl_table vm_table[] = {
>  		.extra2		= &one,
>  	},
>  #endif
> +	{
> +		.procname	= "max_zone_concurrent_reclaimers",
> +		.data		= &max_zone_concurrent_reclaimers,
> +		.maxlen		= sizeof(max_zone_concurrent_reclaimers),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec,
> +	},
>  
>  /*
>   * NOTE: do not add new entries to this table unless you have read
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 11ae66e..ca9cae1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  
>  		zone->prev_priority = DEF_PRIORITY;
>  
> +		atomic_set(&zone->concurrent_reclaimers, 0);
> +		init_waitqueue_head(&zone->reclaim_wait);
> +
>  		zone_pcp_init(zone);
>  		for_each_lru(l) {
>  			INIT_LIST_HEAD(&zone->lru[l].list);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2bbee91..ecfe28c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
> +#include <linux/wait.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -129,6 +130,17 @@ struct scan_control {
>  int vm_swappiness = 60;
>  long vm_total_pages;	/* The total number of pages which the VM controls */
>  
> +/*
> + * Maximum number of processes concurrently running the page
> + * reclaim code in a memory zone.  Having too many processes
> + * just results in them burning CPU time waiting for locks,
> + * so we're better off limiting page reclaim to a sane number
> + * of processes at a time.  We do this per zone so local node
> + * reclaim on one NUMA node will not block other nodes from
> + * making progress.
> + */
> +int max_zone_concurrent_reclaimers = 8;
> +
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
> @@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone,
>  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>  	int noswap = 0;
>  
> +	if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> +				max_zone_concurrent_reclaimers &&
> +				(sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
> +				(__GFP_IO|__GFP_FS)) {
> +		/*
> +		 * Do not add to the lock contention if this zone has
> +		 * enough processes doing page reclaim already, since
> +		 * we would just make things slower.
> +		 */
> +		sleep_on(&zone->reclaim_wait);
> +
> +		/*
> +		 * If other processes freed enough memory while we waited,
> +		 * break out of the loop and go back to the allocator.
> +		 */
> +		if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> +					0, 0)) {
> +			wake_up(&zone->reclaim_wait);
> +			sc->nr_reclaimed += nr_to_reclaim;
> +			return;
> +		}
> +	}
> +
> +	atomic_inc(&zone->concurrent_reclaimers);
> +
>  	/* If we have no swap space, do not bother scanning anon pages. */
>  	if (!sc->may_swap || (nr_swap_pages <= 0)) {
>  		noswap = 1;
> @@ -1655,6 +1692,9 @@ static void shrink_zone(int priority, struct zone *zone,
>  		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>  
>  	throttle_vm_writeout(sc->gfp_mask);
> +
> +	atomic_dec(&zone->concurrent_reclaimers);
> +	wake_up(&zone->reclaim_wait);
>  }
>  
>  /*
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-12-14 17:05 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-11 21:46 [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone Rik van Riel
2009-12-14  0:14 ` Minchan Kim
2009-12-14  4:09   ` Rik van Riel
2009-12-14  4:19     ` Minchan Kim
2009-12-14  4:29       ` Rik van Riel
2009-12-14  5:00         ` Minchan Kim
2009-12-14 12:22 ` KOSAKI Motohiro
2009-12-14 12:23   ` [cleanup][PATCH 1/8] vmscan: Make shrink_zone_begin/end helper function KOSAKI Motohiro
2009-12-14 14:34     ` Rik van Riel
2009-12-14 22:39     ` Minchan Kim
2009-12-14 12:24   ` [PATCH 2/8] Mark sleep_on as deprecated KOSAKI Motohiro
2009-12-14 13:03     ` Christoph Hellwig
2009-12-14 16:04       ` Arjan van de Ven
2009-12-14 14:34     ` Rik van Riel
2009-12-14 22:44     ` Minchan Kim
2009-12-14 12:29   ` [PATCH 3/8] Don't use sleep_on() KOSAKI Motohiro
2009-12-14 14:35     ` Rik van Riel
2009-12-14 22:46     ` Minchan Kim
2009-12-14 12:30   ` [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait() KOSAKI Motohiro
2009-12-14 14:33     ` Rik van Riel
2009-12-15  0:45       ` KOSAKI Motohiro
2009-12-15  5:32         ` Mike Galbraith
2009-12-15  8:28           ` Mike Galbraith
2009-12-15 14:36             ` Mike Galbraith
2009-12-15 14:58           ` Rik van Riel
2009-12-15 18:17             ` Mike Galbraith
2009-12-15 18:43             ` Mike Galbraith
2009-12-15 19:33               ` Rik van Riel
2009-12-16  0:48             ` KOSAKI Motohiro
2009-12-16  2:44               ` Rik van Riel
2009-12-16  5:43               ` Mike Galbraith
2009-12-14 23:03     ` Minchan Kim
2009-12-14 12:30   ` [PATCH 5/8] Use io_schedule() instead schedule() KOSAKI Motohiro
2009-12-14 14:37     ` Rik van Riel
2009-12-14 23:46     ` Minchan Kim
2009-12-15  0:56       ` KOSAKI Motohiro
2009-12-15  1:13         ` Minchan Kim
2009-12-14 12:31   ` [PATCH 6/8] Stop reclaim quickly when the task reclaimed enough lots pages KOSAKI Motohiro
2009-12-14 14:45     ` Rik van Riel
2009-12-14 23:51       ` KOSAKI Motohiro
2009-12-15  0:11     ` Minchan Kim
2009-12-15  0:35       ` KOSAKI Motohiro
2009-12-14 12:32   ` [PATCH 7/8] Use TASK_KILLABLE instead TASK_UNINTERRUPTIBLE KOSAKI Motohiro
2009-12-14 14:47     ` Rik van Riel
2009-12-14 23:52     ` Minchan Kim
2009-12-14 12:32   ` [PATCH 8/8] mm: Give up allocation if the task have fatal signal KOSAKI Motohiro
2009-12-14 14:48     ` Rik van Riel
2009-12-14 23:54     ` Minchan Kim
2009-12-15  0:50       ` KOSAKI Motohiro
2009-12-15  1:03         ` Minchan Kim
2009-12-15  1:16           ` KOSAKI Motohiro
2009-12-14 12:40   ` [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone KOSAKI Motohiro
2009-12-14 17:08 ` Larry Woodman [this message]
2009-12-15  0:49   ` KOSAKI Motohiro
     [not found]   ` <20091217193818.9FA9.A69D9226@jp.fujitsu.com>
2009-12-17 12:23     ` FWD: " Larry Woodman
2009-12-17 14:43       ` Rik van Riel
2009-12-17 19:55       ` Rik van Riel
2009-12-17 21:05         ` Hugh Dickins
2009-12-17 22:52           ` Rik van Riel
2009-12-18 16:23           ` Andrea Arcangeli
2009-12-18 17:43             ` Rik van Riel
2009-12-18 10:27       ` KOSAKI Motohiro
2009-12-18 14:09         ` Rik van Riel
2009-12-18 13:38 ` Avi Kivity
2009-12-18 14:12   ` Rik van Riel
2009-12-18 14:13     ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1260810481.6666.13.camel@dhcp-100-19-198.bos.redhat.com \
    --to=lwoodman@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).