From: Larry Woodman <lwoodman@redhat.com>
To: Rik van Riel <riel@redhat.com>
Cc: akpm@linux-foundation.org,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
minchan.kim@gmail.com
Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone
Date: Mon, 14 Dec 2009 12:08:01 -0500 [thread overview]
Message-ID: <1260810481.6666.13.camel@dhcp-100-19-198.bos.redhat.com> (raw)
In-Reply-To: <20091211164651.036f5340@annuminas.surriel.com>
On Fri, 2009-12-11 at 16:46 -0500, Rik van Riel wrote:
Rik, the latest patch appears to have a problem although I dont know
what the problem is yet. When the system ran out of memory we see
thousands of runnable processes and 100% system time:
9420 2 29824 79856 62676 19564 0 0 0 0 8054 379 0
100 0 0 0
9420 2 29824 79368 62292 19564 0 0 0 0 8691 413 0
100 0 0 0
9421 1 29824 79780 61780 19820 0 0 0 0 8928 408 0
100 0 0 0
The system would not respond so I dont know whats going on yet. I'll
add debug code to figure out why its in that state as soon as I get
access to the hardware.
Larry
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways. The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves. We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.
>
> Reported-by: Larry Woodman <lwoodman@redhat.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
>
> ---
> v2:
> - fix typos in sysctl.c and vm.txt
> - move the code in sysctl.c out from under the ifdef
> - only __GFP_FS|__GFP_IO tasks can wait
>
> Documentation/sysctl/vm.txt | 18 ++++++++++++++
> include/linux/mmzone.h | 4 +++
> include/linux/swap.h | 1 +
> kernel/sysctl.c | 7 +++++
> mm/page_alloc.c | 3 ++
> mm/vmscan.c | 40 +++++++++++++++++++++++++++++++++
> 6 files changed, 73 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index fc5790d..8bd1a96 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
> - legacy_va_layout
> - lowmem_reserve_ratio
> - max_map_count
> +- max_zone_concurrent_reclaimers
> - memory_failure_early_kill
> - memory_failure_recovery
> - min_free_kbytes
> @@ -278,6 +279,23 @@ The default value is 65536.
>
> =============================================================
>
> +max_zone_concurrent_reclaimers:
> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously. This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +
> +The default value is 8.
> +
> +=============================================================
> +
> memory_failure_early_kill:
>
> Control how to kill processes when uncorrected memory error (typically
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..ed614b8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -345,6 +345,10 @@ struct zone {
> /* Zone statistics */
> atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
>
> + /* Number of processes running page reclaim code on this zone. */
> + atomic_t concurrent_reclaimers;
> + wait_queue_head_t reclaim_wait;
> +
> /*
> * prev_priority holds the scanning priority for this zone. It is
> * defined as the scanning priority at which we achieved our reclaim
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..661eec7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
> extern int vm_swappiness;
> extern int remove_mapping(struct address_space *mapping, struct page *page);
> extern long vm_total_pages;
> +extern int max_zone_concurrent_reclaimers;
>
> #ifdef CONFIG_NUMA
> extern int zone_reclaim_mode;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6ff0ae6..4ec17ed 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1271,6 +1271,13 @@ static struct ctl_table vm_table[] = {
> .extra2 = &one,
> },
> #endif
> + {
> + .procname = "max_zone_concurrent_reclaimers",
> + .data = &max_zone_concurrent_reclaimers,
> + .maxlen = sizeof(max_zone_concurrent_reclaimers),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
>
> /*
> * NOTE: do not add new entries to this table unless you have read
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 11ae66e..ca9cae1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>
> zone->prev_priority = DEF_PRIORITY;
>
> + atomic_set(&zone->concurrent_reclaimers, 0);
> + init_waitqueue_head(&zone->reclaim_wait);
> +
> zone_pcp_init(zone);
> for_each_lru(l) {
> INIT_LIST_HEAD(&zone->lru[l].list);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2bbee91..ecfe28c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,7 @@
> #include <linux/memcontrol.h>
> #include <linux/delayacct.h>
> #include <linux/sysctl.h>
> +#include <linux/wait.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -129,6 +130,17 @@ struct scan_control {
> int vm_swappiness = 60;
> long vm_total_pages; /* The total number of pages which the VM controls */
>
> +/*
> + * Maximum number of processes concurrently running the page
> + * reclaim code in a memory zone. Having too many processes
> + * just results in them burning CPU time waiting for locks,
> + * so we're better off limiting page reclaim to a sane number
> + * of processes at a time. We do this per zone so local node
> + * reclaim on one NUMA node will not block other nodes from
> + * making progress.
> + */
> +int max_zone_concurrent_reclaimers = 8;
> +
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> @@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone,
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> + max_zone_concurrent_reclaimers &&
> + (sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
> + (__GFP_IO|__GFP_FS)) {
> + /*
> + * Do not add to the lock contention if this zone has
> + * enough processes doing page reclaim already, since
> + * we would just make things slower.
> + */
> + sleep_on(&zone->reclaim_wait);
> +
> + /*
> + * If other processes freed enough memory while we waited,
> + * break out of the loop and go back to the allocator.
> + */
> + if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> + 0, 0)) {
> + wake_up(&zone->reclaim_wait);
> + sc->nr_reclaimed += nr_to_reclaim;
> + return;
> + }
> + }
> +
> + atomic_inc(&zone->concurrent_reclaimers);
> +
> /* If we have no swap space, do not bother scanning anon pages. */
> if (!sc->may_swap || (nr_swap_pages <= 0)) {
> noswap = 1;
> @@ -1655,6 +1692,9 @@ static void shrink_zone(int priority, struct zone *zone,
> shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>
> throttle_vm_writeout(sc->gfp_mask);
> +
> + atomic_dec(&zone->concurrent_reclaimers);
> + wake_up(&zone->reclaim_wait);
> }
>
> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-12-14 17:05 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-12-11 21:46 [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone Rik van Riel
2009-12-14 0:14 ` Minchan Kim
2009-12-14 4:09 ` Rik van Riel
2009-12-14 4:19 ` Minchan Kim
2009-12-14 4:29 ` Rik van Riel
2009-12-14 5:00 ` Minchan Kim
2009-12-14 12:22 ` KOSAKI Motohiro
2009-12-14 12:23 ` [cleanup][PATCH 1/8] vmscan: Make shrink_zone_begin/end helper function KOSAKI Motohiro
2009-12-14 14:34 ` Rik van Riel
2009-12-14 22:39 ` Minchan Kim
2009-12-14 12:24 ` [PATCH 2/8] Mark sleep_on as deprecated KOSAKI Motohiro
2009-12-14 13:03 ` Christoph Hellwig
2009-12-14 16:04 ` Arjan van de Ven
2009-12-14 14:34 ` Rik van Riel
2009-12-14 22:44 ` Minchan Kim
2009-12-14 12:29 ` [PATCH 3/8] Don't use sleep_on() KOSAKI Motohiro
2009-12-14 14:35 ` Rik van Riel
2009-12-14 22:46 ` Minchan Kim
2009-12-14 12:30 ` [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait() KOSAKI Motohiro
2009-12-14 14:33 ` Rik van Riel
2009-12-15 0:45 ` KOSAKI Motohiro
2009-12-15 5:32 ` Mike Galbraith
2009-12-15 8:28 ` Mike Galbraith
2009-12-15 14:36 ` Mike Galbraith
2009-12-15 14:58 ` Rik van Riel
2009-12-15 18:17 ` Mike Galbraith
2009-12-15 18:43 ` Mike Galbraith
2009-12-15 19:33 ` Rik van Riel
2009-12-16 0:48 ` KOSAKI Motohiro
2009-12-16 2:44 ` Rik van Riel
2009-12-16 5:43 ` Mike Galbraith
2009-12-14 23:03 ` Minchan Kim
2009-12-14 12:30 ` [PATCH 5/8] Use io_schedule() instead schedule() KOSAKI Motohiro
2009-12-14 14:37 ` Rik van Riel
2009-12-14 23:46 ` Minchan Kim
2009-12-15 0:56 ` KOSAKI Motohiro
2009-12-15 1:13 ` Minchan Kim
2009-12-14 12:31 ` [PATCH 6/8] Stop reclaim quickly when the task reclaimed enough lots pages KOSAKI Motohiro
2009-12-14 14:45 ` Rik van Riel
2009-12-14 23:51 ` KOSAKI Motohiro
2009-12-15 0:11 ` Minchan Kim
2009-12-15 0:35 ` KOSAKI Motohiro
2009-12-14 12:32 ` [PATCH 7/8] Use TASK_KILLABLE instead TASK_UNINTERRUPTIBLE KOSAKI Motohiro
2009-12-14 14:47 ` Rik van Riel
2009-12-14 23:52 ` Minchan Kim
2009-12-14 12:32 ` [PATCH 8/8] mm: Give up allocation if the task have fatal signal KOSAKI Motohiro
2009-12-14 14:48 ` Rik van Riel
2009-12-14 23:54 ` Minchan Kim
2009-12-15 0:50 ` KOSAKI Motohiro
2009-12-15 1:03 ` Minchan Kim
2009-12-15 1:16 ` KOSAKI Motohiro
2009-12-14 12:40 ` [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone KOSAKI Motohiro
2009-12-14 17:08 ` Larry Woodman [this message]
2009-12-15 0:49 ` KOSAKI Motohiro
[not found] ` <20091217193818.9FA9.A69D9226@jp.fujitsu.com>
2009-12-17 12:23 ` FWD: " Larry Woodman
2009-12-17 14:43 ` Rik van Riel
2009-12-17 19:55 ` Rik van Riel
2009-12-17 21:05 ` Hugh Dickins
2009-12-17 22:52 ` Rik van Riel
2009-12-18 16:23 ` Andrea Arcangeli
2009-12-18 17:43 ` Rik van Riel
2009-12-18 10:27 ` KOSAKI Motohiro
2009-12-18 14:09 ` Rik van Riel
2009-12-18 13:38 ` Avi Kivity
2009-12-18 14:12 ` Rik van Riel
2009-12-18 14:13 ` Avi Kivity
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1260810481.6666.13.camel@dhcp-100-19-198.bos.redhat.com \
--to=lwoodman@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan.kim@gmail.com \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).