From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f50.google.com (mail-wg0-f50.google.com [74.125.82.50]) by kanga.kvack.org (Postfix) with ESMTP id 355B56B0031 for ; Mon, 7 Apr 2014 18:34:33 -0400 (EDT) Received: by mail-wg0-f50.google.com with SMTP id x13so95523wgg.33 for ; Mon, 07 Apr 2014 15:34:32 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id w48si108584eel.326.2014.04.07.15.34.31 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 07 Apr 2014 15:34:31 -0700 (PDT) From: Mel Gorman Subject: [PATCH 0/2] Disable zone_reclaim_mode by default Date: Mon, 7 Apr 2014 23:34:26 +0100 Message-Id: <1396910068-11637-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML , Mel Gorman When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it's routine to see major performance due to zone_reclaim_mode being disabled but relatively few can identify the problem. Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. Documentation/sysctl/vm.txt | 17 +++++++++-------- include/linux/mmzone.h | 1 - mm/page_alloc.c | 17 +---------------- 3 files changed, 10 insertions(+), 25 deletions(-) -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f169.google.com (mail-we0-f169.google.com [74.125.82.169]) by kanga.kvack.org (Postfix) with ESMTP id 75BFF6B0036 for ; Mon, 7 Apr 2014 18:34:33 -0400 (EDT) Received: by mail-we0-f169.google.com with SMTP id w62so107820wes.0 for ; Mon, 07 Apr 2014 15:34:32 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id x47si123914eel.253.2014.04.07.15.34.31 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 07 Apr 2014 15:34:31 -0700 (PDT) From: Mel Gorman Subject: [PATCH 1/2] mm: Disable zone_reclaim_mode by default Date: Mon, 7 Apr 2014 23:34:27 +0100 Message-Id: <1396910068-11637-2-git-send-email-mgorman@suse.de> In-Reply-To: <1396910068-11637-1-git-send-email-mgorman@suse.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML , Mel Gorman zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman --- Documentation/sysctl/vm.txt | 17 +++++++++-------- mm/page_alloc.c | 2 -- 2 files changed, 9 insertions(+), 10 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index d614a9b..ff5da70 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -751,16 +751,17 @@ This is value ORed together of 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages -zone_reclaim_mode is set during bootup to 1 if it is determined that pages -from remote zones will cause a measurable performance reduction. The -page allocator will then reclaim easily reusable pages (those page -cache pages that are currently not used) before allocating off node pages. - -It may be beneficial to switch off zone reclaim if the system is -used for a file server and all of memory should be used for caching files -from disk. In that case the caching effect is more important than +zone_reclaim_mode is disabled by default. For file servers or workloads +that benefit from having their data cached, zone_reclaim_mode should be +left disabled as the caching effect is likely to be more important than data locality. +zone_reclaim may be enabled if it's known that the workload is partitioned +such that each partition fits within a NUMA node and that accessing remote +memory would cause a measurable performance reduction. The page allocator +will then reclaim easily reusable pages (those page cache pages that are +currently not used) before allocating off node pages. + Allowing zone reclaim to write out pages stops processes that are writing large amounts of data from dirtying pages on other nodes. Zone reclaim will write out dirty pages if a zone fills up and so effectively diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3bac76a..a256f85 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid) for_each_online_node(i) if (node_distance(nid, i) <= RECLAIM_DISTANCE) node_set(i, NODE_DATA(nid)->reclaim_nodes); - else - zone_reclaim_mode = 1; } #else /* CONFIG_NUMA */ -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id AD5E56B0036 for ; Mon, 7 Apr 2014 18:34:34 -0400 (EDT) Received: by mail-wi0-f180.google.com with SMTP id q5so357307wiv.1 for ; Mon, 07 Apr 2014 15:34:33 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id q5si146627eem.141.2014.04.07.15.34.32 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 07 Apr 2014 15:34:33 -0700 (PDT) From: Mel Gorman Subject: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances Date: Mon, 7 Apr 2014 23:34:28 +0100 Message-Id: <1396910068-11637-3-git-send-email-mgorman@suse.de> In-Reply-To: <1396910068-11637-1-git-send-email-mgorman@suse.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML , Mel Gorman pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by zone_reclaim due to its distance. As it is expected that zone_reclaim_mode will be rarely enabled it is unreasonable for all machines to take a penalty. Fortunately, the zone_reclaim_mode() path is already slow and it is the path that takes the hit. Signed-off-by: Mel Gorman --- include/linux/mmzone.h | 1 - mm/page_alloc.c | 15 +-------------- 2 files changed, 1 insertion(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9b61b9b..564b169 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -757,7 +757,6 @@ typedef struct pglist_data { unsigned long node_spanned_pages; /* total size of physical page range, including holes */ int node_id; - nodemask_t reclaim_nodes; /* Nodes allowed to reclaim from */ wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a256f85..574928e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone) static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) { - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes); -} - -static void __paginginit init_zone_allows_reclaim(int nid) -{ - int i; - - for_each_online_node(i) - if (node_distance(nid, i) <= RECLAIM_DISTANCE) - node_set(i, NODE_DATA(nid)->reclaim_nodes); + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE; } #else /* CONFIG_NUMA */ @@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) return true; } -static inline void init_zone_allows_reclaim(int nid) -{ -} #endif /* CONFIG_NUMA */ /* @@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, pgdat->node_id = nid; pgdat->node_start_pfn = node_start_pfn; - init_zone_allows_reclaim(nid); #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); #endif -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com [209.85.214.54]) by kanga.kvack.org (Postfix) with ESMTP id A9F996B0031 for ; Mon, 7 Apr 2014 19:35:23 -0400 (EDT) Received: by mail-bk0-f54.google.com with SMTP id 6so37653bkj.27 for ; Mon, 07 Apr 2014 16:35:22 -0700 (PDT) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id ur10si264361bkb.242.2014.04.07.16.35.21 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 07 Apr 2014 16:35:22 -0700 (PDT) Date: Mon, 7 Apr 2014 19:35:11 -0400 From: Johannes Weiner Subject: Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default Message-ID: <20140407233511.GO4407@cmpxchg.org> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-2-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1396910068-11637-2-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML On Mon, Apr 07, 2014 at 11:34:27PM +0100, Mel Gorman wrote: > zone_reclaim_mode causes processes to prefer reclaiming memory from local > node instead of spilling over to other nodes. This made sense initially when > NUMA machines were almost exclusively HPC and the workload was partitioned > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > the memory. On current machines and workloads it is often the case that > zone_reclaim_mode destroys performance but not all users know how to detect > this. Favour the common case and disable it by default. Users that are > sophisticated enough to know they need zone_reclaim_mode will detect it. > > Signed-off-by: Mel Gorman Acked-by: Johannes Weiner -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-bk0-f47.google.com (mail-bk0-f47.google.com [209.85.214.47]) by kanga.kvack.org (Postfix) with ESMTP id 6A2F46B0031 for ; Mon, 7 Apr 2014 19:37:04 -0400 (EDT) Received: by mail-bk0-f47.google.com with SMTP id w10so37669bkz.34 for ; Mon, 07 Apr 2014 16:37:03 -0700 (PDT) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id yp6si274165bkb.162.2014.04.07.16.37.02 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 07 Apr 2014 16:37:03 -0700 (PDT) Date: Mon, 7 Apr 2014 19:36:57 -0400 From: Johannes Weiner Subject: Re: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances Message-ID: <20140407233657.GP4407@cmpxchg.org> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-3-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1396910068-11637-3-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML On Mon, Apr 07, 2014 at 11:34:28PM +0100, Mel Gorman wrote: > pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by > zone_reclaim due to its distance. As it is expected that zone_reclaim_mode > will be rarely enabled it is unreasonable for all machines to take a penalty. > Fortunately, the zone_reclaim_mode() path is already slow and it is the path > that takes the hit. > > Signed-off-by: Mel Gorman Acked-by: Johannes Weiner -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f54.google.com (mail-pb0-f54.google.com [209.85.160.54]) by kanga.kvack.org (Postfix) with ESMTP id 8A9476B0055 for ; Mon, 7 Apr 2014 23:55:10 -0400 (EDT) Received: by mail-pb0-f54.google.com with SMTP id ma3so416300pbc.13 for ; Mon, 07 Apr 2014 20:55:10 -0700 (PDT) Received: from heian.cn.fujitsu.com ([59.151.112.132]) by mx.google.com with ESMTP id ub3si215811pac.153.2014.04.07.18.19.04 for ; Mon, 07 Apr 2014 18:19:06 -0700 (PDT) Message-ID: <53434E41.1010306@cn.fujitsu.com> Date: Tue, 8 Apr 2014 09:17:53 +0800 From: Zhang Yanfei MIME-Version: 1.0 Subject: Re: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-3-git-send-email-mgorman@suse.de> In-Reply-To: <1396910068-11637-3-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML On 04/08/2014 06:34 AM, Mel Gorman wrote: > pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by > zone_reclaim due to its distance. As it is expected that zone_reclaim_mode > will be rarely enabled it is unreasonable for all machines to take a penalty. > Fortunately, the zone_reclaim_mode() path is already slow and it is the path > that takes the hit. > > Signed-off-by: Mel Gorman Reviewed-by: Zhang Yanfei > --- > include/linux/mmzone.h | 1 - > mm/page_alloc.c | 15 +-------------- > 2 files changed, 1 insertion(+), 15 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 9b61b9b..564b169 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -757,7 +757,6 @@ typedef struct pglist_data { > unsigned long node_spanned_pages; /* total size of physical page > range, including holes */ > int node_id; > - nodemask_t reclaim_nodes; /* Nodes allowed to reclaim from */ > wait_queue_head_t kswapd_wait; > wait_queue_head_t pfmemalloc_wait; > struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */ > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index a256f85..574928e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone) > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > { > - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes); > -} > - > -static void __paginginit init_zone_allows_reclaim(int nid) > -{ > - int i; > - > - for_each_online_node(i) > - if (node_distance(nid, i) <= RECLAIM_DISTANCE) > - node_set(i, NODE_DATA(nid)->reclaim_nodes); > + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE; > } > > #else /* CONFIG_NUMA */ > @@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > return true; > } > > -static inline void init_zone_allows_reclaim(int nid) > -{ > -} > #endif /* CONFIG_NUMA */ > > /* > @@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, > > pgdat->node_id = nid; > pgdat->node_start_pfn = node_start_pfn; > - init_zone_allows_reclaim(nid); > #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); > #endif > -- Thanks. Zhang Yanfei -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f44.google.com (mail-pb0-f44.google.com [209.85.160.44]) by kanga.kvack.org (Postfix) with ESMTP id 10A356B0069 for ; Tue, 8 Apr 2014 00:12:45 -0400 (EDT) Received: by mail-pb0-f44.google.com with SMTP id rp16so432835pbb.17 for ; Mon, 07 Apr 2014 21:12:44 -0700 (PDT) Received: from heian.cn.fujitsu.com ([59.151.112.132]) by mx.google.com with ESMTP id gg7si220084pac.106.2014.04.07.18.18.49 for ; Mon, 07 Apr 2014 18:18:51 -0700 (PDT) Message-ID: <53434E28.4040304@cn.fujitsu.com> Date: Tue, 8 Apr 2014 09:17:28 +0800 From: Zhang Yanfei MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-2-git-send-email-mgorman@suse.de> In-Reply-To: <1396910068-11637-2-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML On 04/08/2014 06:34 AM, Mel Gorman wrote: > zone_reclaim_mode causes processes to prefer reclaiming memory from local > node instead of spilling over to other nodes. This made sense initially when > NUMA machines were almost exclusively HPC and the workload was partitioned > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > the memory. On current machines and workloads it is often the case that > zone_reclaim_mode destroys performance but not all users know how to detect > this. Favour the common case and disable it by default. Users that are > sophisticated enough to know they need zone_reclaim_mode will detect it. > > Signed-off-by: Mel Gorman Reviewed-by: Zhang Yanfei > --- > Documentation/sysctl/vm.txt | 17 +++++++++-------- > mm/page_alloc.c | 2 -- > 2 files changed, 9 insertions(+), 10 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index d614a9b..ff5da70 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -751,16 +751,17 @@ This is value ORed together of > 2 = Zone reclaim writes dirty pages out > 4 = Zone reclaim swaps pages > > -zone_reclaim_mode is set during bootup to 1 if it is determined that pages > -from remote zones will cause a measurable performance reduction. The > -page allocator will then reclaim easily reusable pages (those page > -cache pages that are currently not used) before allocating off node pages. > - > -It may be beneficial to switch off zone reclaim if the system is > -used for a file server and all of memory should be used for caching files > -from disk. In that case the caching effect is more important than > +zone_reclaim_mode is disabled by default. For file servers or workloads > +that benefit from having their data cached, zone_reclaim_mode should be > +left disabled as the caching effect is likely to be more important than > data locality. > > +zone_reclaim may be enabled if it's known that the workload is partitioned > +such that each partition fits within a NUMA node and that accessing remote > +memory would cause a measurable performance reduction. The page allocator > +will then reclaim easily reusable pages (those page cache pages that are > +currently not used) before allocating off node pages. > + > Allowing zone reclaim to write out pages stops processes that are > writing large amounts of data from dirtying pages on other nodes. Zone > reclaim will write out dirty pages if a zone fills up and so effectively > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3bac76a..a256f85 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid) > for_each_online_node(i) > if (node_distance(nid, i) <= RECLAIM_DISTANCE) > node_set(i, NODE_DATA(nid)->reclaim_nodes); > - else > - zone_reclaim_mode = 1; > } > > #else /* CONFIG_NUMA */ > -- Thanks. Zhang Yanfei -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f48.google.com (mail-ee0-f48.google.com [74.125.83.48]) by kanga.kvack.org (Postfix) with ESMTP id 654C16B0073 for ; Tue, 8 Apr 2014 03:14:55 -0400 (EDT) Received: by mail-ee0-f48.google.com with SMTP id b57so295759eek.7 for ; Tue, 08 Apr 2014 00:14:53 -0700 (PDT) Received: from moutng.kundenserver.de (moutng.kundenserver.de. [212.227.126.131]) by mx.google.com with ESMTPS id z2si1452784eeo.124.2014.04.08.00.14.51 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 08 Apr 2014 00:14:52 -0700 (PDT) Date: Tue, 8 Apr 2014 09:14:43 +0200 From: Andres Freund Subject: Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default Message-ID: <20140408071443.GQ4161@awork2.anarazel.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-2-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1396910068-11637-2-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Robert Haas , Josh Berkus , Christoph Lameter , Linux-MM , LKML Hi, On 2014-04-07 23:34:27 +0100, Mel Gorman wrote: > zone_reclaim_mode causes processes to prefer reclaiming memory from local > node instead of spilling over to other nodes. This made sense initially when > NUMA machines were almost exclusively HPC and the workload was partitioned > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > the memory. On current machines and workloads it is often the case that > zone_reclaim_mode destroys performance but not all users know how to detect > this. Favour the common case and disable it by default. Users that are > sophisticated enough to know they need zone_reclaim_mode will detect it. Unsurprisingly I am in favor of this. > Documentation/sysctl/vm.txt | 17 +++++++++-------- > mm/page_alloc.c | 2 -- > 2 files changed, 9 insertions(+), 10 deletions(-) But I think linux/topology.h's comment about RECLAIM_DISTANCE should be adapted as well. Thanks, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f46.google.com (mail-ee0-f46.google.com [74.125.83.46]) by kanga.kvack.org (Postfix) with ESMTP id 0B6CD6B007B for ; Tue, 8 Apr 2014 03:26:16 -0400 (EDT) Received: by mail-ee0-f46.google.com with SMTP id t10so310146eei.5 for ; Tue, 08 Apr 2014 00:26:15 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 45si1506281eeh.63.2014.04.08.00.26.14 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 08 Apr 2014 00:26:14 -0700 (PDT) Message-ID: <5343A494.9070707@suse.cz> Date: Tue, 08 Apr 2014 09:26:12 +0200 From: Vlastimil Babka MIME-Version: 1.0 Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default References: <1396910068-11637-1-git-send-email-mgorman@suse.de> In-Reply-To: <1396910068-11637-1-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Andrew Morton Cc: Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML On 04/08/2014 12:34 AM, Mel Gorman wrote: > When it was introduced, zone_reclaim_mode made sense as NUMA distances > punished and workloads were generally partitioned to fit into a NUMA > node. NUMA machines are now common but few of the workloads are NUMA-aware > and it's routine to see major performance due to zone_reclaim_mode being > disabled but relatively few can identify the problem. ^ I think you meant "enabled" here? Just in case the cover letter goes to the changelog... Vlastimil > Those that require zone_reclaim_mode are likely to be able to detect when > it needs to be enabled and tune appropriately so lets have a sensible > default for the bulk of users. > > Documentation/sysctl/vm.txt | 17 +++++++++-------- > include/linux/mmzone.h | 1 - > mm/page_alloc.c | 17 +---------------- > 3 files changed, 10 insertions(+), 25 deletions(-) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by kanga.kvack.org (Postfix) with ESMTP id 1CF4A6B00AA for ; Tue, 8 Apr 2014 10:14:10 -0400 (EDT) Received: by mail-pa0-f45.google.com with SMTP id kl14so1090922pab.18 for ; Tue, 08 Apr 2014 07:14:09 -0700 (PDT) Received: from qmta08.emeryville.ca.mail.comcast.net (qmta08.emeryville.ca.mail.comcast.net. [2001:558:fe2d:43:76:96:30:80]) by mx.google.com with ESMTP id ep2si1078020pbb.160.2014.04.08.07.14.08 for ; Tue, 08 Apr 2014 07:14:09 -0700 (PDT) Date: Tue, 8 Apr 2014 09:14:05 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default In-Reply-To: <1396910068-11637-2-git-send-email-mgorman@suse.de> Message-ID: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-2-git-send-email-mgorman@suse.de> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: sivanich@sgi.com Cc: Mel Gorman , Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Linux-MM , LKML On Mon, 7 Apr 2014, Mel Gorman wrote: > zone_reclaim_mode causes processes to prefer reclaiming memory from local > node instead of spilling over to other nodes. This made sense initially when > NUMA machines were almost exclusively HPC and the workload was partitioned > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > the memory. On current machines and workloads it is often the case that > zone_reclaim_mode destroys performance but not all users know how to detect > this. Favour the common case and disable it by default. Users that are > sophisticated enough to know they need zone_reclaim_mode will detect it. Ok that is going to require SGI machines to deal with zone_reclaim configurations on bootup. Dimitri? Any comments? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f182.google.com (mail-pd0-f182.google.com [209.85.192.182]) by kanga.kvack.org (Postfix) with ESMTP id 54E696B00AE for ; Tue, 8 Apr 2014 10:17:08 -0400 (EDT) Received: by mail-pd0-f182.google.com with SMTP id y10so1068855pdj.27 for ; Tue, 08 Apr 2014 07:17:07 -0700 (PDT) Received: from qmta14.emeryville.ca.mail.comcast.net (qmta14.emeryville.ca.mail.comcast.net. [2001:558:fe2d:44:76:96:27:212]) by mx.google.com with ESMTP id m9si1091628pab.372.2014.04.08.07.17.07 for ; Tue, 08 Apr 2014 07:17:07 -0700 (PDT) Date: Tue, 8 Apr 2014 09:17:04 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default In-Reply-To: <5343A494.9070707@suse.cz> Message-ID: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Mel Gorman , Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Linux-MM , LKML , sivanich@sgi.com On Tue, 8 Apr 2014, Vlastimil Babka wrote: > On 04/08/2014 12:34 AM, Mel Gorman wrote: > > When it was introduced, zone_reclaim_mode made sense as NUMA distances > > punished and workloads were generally partitioned to fit into a NUMA > > node. NUMA machines are now common but few of the workloads are NUMA-aware > > and it's routine to see major performance due to zone_reclaim_mode being > > disabled but relatively few can identify the problem. > ^ I think you meant "enabled" here? > > Just in case the cover letter goes to the changelog... Correct. Another solution here would be to increase the threshhold so that 4 socket machines do not enable zone reclaim by default. The larger the NUMA system is the more memory is off node from the perspective of a processor and the larger the hit from remote memory. On the other hand: The more expensive we make reclaim the less it makes sense to allow zone reclaim to occur. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f54.google.com (mail-ee0-f54.google.com [74.125.83.54]) by kanga.kvack.org (Postfix) with ESMTP id 7AF2C6B003D for ; Tue, 8 Apr 2014 10:26:53 -0400 (EDT) Received: by mail-ee0-f54.google.com with SMTP id d49so773034eek.27 for ; Tue, 08 Apr 2014 07:26:51 -0700 (PDT) Received: from moutng.kundenserver.de (moutng.kundenserver.de. [212.227.17.10]) by mx.google.com with ESMTPS id 43si2981755eei.325.2014.04.08.07.26.50 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 08 Apr 2014 07:26:50 -0700 (PDT) Date: Tue, 8 Apr 2014 16:26:42 +0200 From: Andres Freund Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default Message-ID: <20140408142642.GU4161@awork2.anarazel.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Vlastimil Babka , Mel Gorman , Andrew Morton , Robert Haas , Josh Berkus , Linux-MM , LKML , sivanich@sgi.com On 2014-04-08 09:17:04 -0500, Christoph Lameter wrote: > On Tue, 8 Apr 2014, Vlastimil Babka wrote: > > > On 04/08/2014 12:34 AM, Mel Gorman wrote: > > > When it was introduced, zone_reclaim_mode made sense as NUMA distances > > > punished and workloads were generally partitioned to fit into a NUMA > > > node. NUMA machines are now common but few of the workloads are NUMA-aware > > > and it's routine to see major performance due to zone_reclaim_mode being > > > disabled but relatively few can identify the problem. > > ^ I think you meant "enabled" here? > > > > Just in case the cover letter goes to the changelog... > > Correct. > > Another solution here would be to increase the threshhold so that > 4 socket machines do not enable zone reclaim by default. The larger the > NUMA system is the more memory is off node from the perspective of a > processor and the larger the hit from remote memory. FWIW, I've the problem hit majorly on 8 socket machines. Those are the largest I have seen so far in postgres scenarios. Everything larger is far less likely to be used as single node database server, so that's possibly a sensible cutoff. But then, I'd think that special many-socket machines are setup by specialists, that'd know to enable if it makes sense... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f178.google.com (mail-ob0-f178.google.com [209.85.214.178]) by kanga.kvack.org (Postfix) with ESMTP id 0ECF86B0038 for ; Tue, 8 Apr 2014 10:46:51 -0400 (EDT) Received: by mail-ob0-f178.google.com with SMTP id wp18so1098777obc.37 for ; Tue, 08 Apr 2014 07:46:50 -0700 (PDT) Received: from smtp.01.com (smtp.01.com. [199.36.142.181]) by mx.google.com with ESMTP id e3si1807156obp.178.2014.04.08.07.46.49 for ; Tue, 08 Apr 2014 07:46:49 -0700 (PDT) Message-ID: <53440BD6.5030008@agliodbs.com> Date: Tue, 08 Apr 2014 10:46:46 -0400 From: Josh Berkus MIME-Version: 1.0 Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter , Vlastimil Babka Cc: Mel Gorman , Andrew Morton , Robert Haas , Andres Freund , Linux-MM , LKML , sivanich@sgi.com On 04/08/2014 10:17 AM, Christoph Lameter wrote: > Another solution here would be to increase the threshhold so that > 4 socket machines do not enable zone reclaim by default. The larger the > NUMA system is the more memory is off node from the perspective of a > processor and the larger the hit from remote memory. 8 and 16 socket machines aren't common for nonspecialist workloads *now*, but by the time these changes make it into supported distribution kernels, they may very well be. So having zone_reclaim_mode automatically turn itself on if you have more than 8 sockets would still be a booby-trap ("Boss, I dunno. I installed the additional processors and memory performance went to hell!") For zone_reclaim_mode=1 to be useful on standard servers, both of the following need to be true: 1. the user has to have set CPU affinity for their applications; 2. the applications can't need more than one memory bank worth of cache. The thing is, there is *no way* for Linux to know if the above is true. Now, I can certainly imagine non-HPC workloads for which both of the above would be true; for example, I've set up VMware ESX servers where each VM has one socket and one memory bank. However, if the user knows enough to set up socket affinity, they know enough to set zone_reclaim_mode = 1. The default should cover the know-nothing case, not the experienced specialist case. I'd also argue that there's a fundamental false assumption in the entire algorithm of zone_reclaim_mode, because there is no memory bank which is as distant as disk is, ever. However, if it's off by default, then I don't care. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f42.google.com (mail-ee0-f42.google.com [74.125.83.42]) by kanga.kvack.org (Postfix) with ESMTP id 17F756B003D for ; Tue, 8 Apr 2014 10:47:43 -0400 (EDT) Received: by mail-ee0-f42.google.com with SMTP id d17so770999eek.15 for ; Tue, 08 Apr 2014 07:47:41 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id u5si3065707een.263.2014.04.08.07.47.40 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 08 Apr 2014 07:47:40 -0700 (PDT) Date: Tue, 8 Apr 2014 15:47:35 +0100 From: Mel Gorman Subject: Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default Message-ID: <20140408144735.GK7292@suse.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-2-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: sivanich@sgi.com, Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Linux-MM , LKML On Tue, Apr 08, 2014 at 09:14:05AM -0500, Christoph Lameter wrote: > On Mon, 7 Apr 2014, Mel Gorman wrote: > > > zone_reclaim_mode causes processes to prefer reclaiming memory from local > > node instead of spilling over to other nodes. This made sense initially when > > NUMA machines were almost exclusively HPC and the workload was partitioned > > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > > the memory. On current machines and workloads it is often the case that > > zone_reclaim_mode destroys performance but not all users know how to detect > > this. Favour the common case and disable it by default. Users that are > > sophisticated enough to know they need zone_reclaim_mode will detect it. > > Ok that is going to require SGI machines to deal with zone_reclaim > configurations on bootup. Dimitri? Any comments? > The SGI machines are also likely to be managed by system administrators who are both aware of zone_reclaim_mode and know how to evaluate if it should be enabled or not. The pair of patches is really aimmed at the common case of 2-8 socket machines running workloads that are not NUMA aware. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f45.google.com (mail-wg0-f45.google.com [74.125.82.45]) by kanga.kvack.org (Postfix) with ESMTP id 708376B0035 for ; Tue, 8 Apr 2014 15:53:05 -0400 (EDT) Received: by mail-wg0-f45.google.com with SMTP id l18so1472185wgh.4 for ; Tue, 08 Apr 2014 12:53:04 -0700 (PDT) Received: from mail-we0-x232.google.com (mail-we0-x232.google.com [2a00:1450:400c:c03::232]) by mx.google.com with ESMTPS id eh10si1504177wib.58.2014.04.08.12.53.03 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 08 Apr 2014 12:53:03 -0700 (PDT) Received: by mail-we0-f178.google.com with SMTP id u56so1485738wes.9 for ; Tue, 08 Apr 2014 12:53:03 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> Date: Tue, 8 Apr 2014 15:53:02 -0400 Message-ID: Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default From: Robert Haas Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Vlastimil Babka , Mel Gorman , Andrew Morton , Josh Berkus , Andres Freund , Linux-MM , LKML , sivanich@sgi.com On Tue, Apr 8, 2014 at 10:17 AM, Christoph Lameter wrote: > Another solution here would be to increase the threshhold so that > 4 socket machines do not enable zone reclaim by default. The larger the > NUMA system is the more memory is off node from the perspective of a > processor and the larger the hit from remote memory. Well, as Josh quite rightly said, the hit from accessing remote memory is never going to be as large as the hit from disk. If and when there is a machine where remote memory is more expensive to access than disk, that's a good argument for zone_reclaim_mode. But I don't believe that's anywhere close to being true today, even on an 8-socket machine with an SSD. Now, perhaps the fear is that if we access that remote memory *repeatedly* the aggregate cost will exceed what it would have cost to fault that page into the local node just once. But it takes a lot of accesses for that to be true, and most of the time you won't get them. Even if you do, I bet many workloads will prefer even performance across all the accesses over a very slow first access followed by slightly faster subsequent accesses. In an ideal world, the kernel would put the hottest pages on the local node and the less-hot pages on remote nodes, moving pages around as the workload shifts. In practice, that's probably pretty hard. Fortunately, it's not nearly as important as making sure we don't unnecessarily hit the disk, which is infinitely slower than any memory bank. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oa0-f41.google.com (mail-oa0-f41.google.com [209.85.219.41]) by kanga.kvack.org (Postfix) with ESMTP id 19E896B0031 for ; Tue, 8 Apr 2014 15:56:55 -0400 (EDT) Received: by mail-oa0-f41.google.com with SMTP id j17so1621167oag.14 for ; Tue, 08 Apr 2014 12:56:53 -0700 (PDT) Received: from smtp.01.com (smtp.01.com. [199.36.142.181]) by mx.google.com with ESMTP id pu6si2560229oeb.178.2014.04.08.12.56.53 for ; Tue, 08 Apr 2014 12:56:53 -0700 (PDT) Message-ID: <53445481.3030202@agliodbs.com> Date: Tue, 08 Apr 2014 15:56:49 -0400 From: Josh Berkus MIME-Version: 1.0 Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Robert Haas , Christoph Lameter Cc: Vlastimil Babka , Mel Gorman , Andrew Morton , Andres Freund , Linux-MM , LKML , sivanich@sgi.com On 04/08/2014 03:53 PM, Robert Haas wrote: > In an ideal world, the kernel would put the hottest pages on the local > node and the less-hot pages on remote nodes, moving pages around as > the workload shifts. In practice, that's probably pretty hard. > Fortunately, it's not nearly as important as making sure we don't > unnecessarily hit the disk, which is infinitely slower than any memory > bank. Even if the kernel could do this, we would *still* have to disable it for PostgreSQL, since our double-buffering makes our pages look "cold" to the kernel ... as discussed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f42.google.com (mail-pa0-f42.google.com [209.85.220.42]) by kanga.kvack.org (Postfix) with ESMTP id 53E4F6B0031 for ; Tue, 8 Apr 2014 18:58:26 -0400 (EDT) Received: by mail-pa0-f42.google.com with SMTP id fb1so1684558pad.1 for ; Tue, 08 Apr 2014 15:58:25 -0700 (PDT) Received: from qmta10.emeryville.ca.mail.comcast.net (qmta10.emeryville.ca.mail.comcast.net. [2001:558:fe2d:43:76:96:30:17]) by mx.google.com with ESMTP id w4si1742779paa.34.2014.04.08.15.58.25 for ; Tue, 08 Apr 2014 15:58:25 -0700 (PDT) Date: Tue, 8 Apr 2014 17:58:21 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default In-Reply-To: Message-ID: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Robert Haas Cc: Vlastimil Babka , Mel Gorman , Andrew Morton , Josh Berkus , Andres Freund , Linux-MM , LKML , sivanich@sgi.com On Tue, 8 Apr 2014, Robert Haas wrote: > Well, as Josh quite rightly said, the hit from accessing remote memory > is never going to be as large as the hit from disk. If and when there > is a machine where remote memory is more expensive to access than > disk, that's a good argument for zone_reclaim_mode. But I don't > believe that's anywhere close to being true today, even on an 8-socket > machine with an SSD. I am nost sure how disk figures into this? The tradeoff is zone reclaim vs. the aggregate performance degradation of the remote memory accesses. That depends on the cacheability of the app and the scale of memory accesses. The reason that zone reclaim is on by default is that off node accesses are a big performance hit on large scale NUMA systems (like ScaleMP and SGI). Zone reclaim was written *because* those system experienced severe performance degradation. On the tightly coupled 4 and 8 node systems there does not seem to be a benefit from what I hear. > Now, perhaps the fear is that if we access that remote memory > *repeatedly* the aggregate cost will exceed what it would have cost to > fault that page into the local node just once. But it takes a lot of > accesses for that to be true, and most of the time you won't get them. > Even if you do, I bet many workloads will prefer even performance > across all the accesses over a very slow first access followed by > slightly faster subsequent accesses. Many HPC workloads prefer the opposite. > In an ideal world, the kernel would put the hottest pages on the local > node and the less-hot pages on remote nodes, moving pages around as > the workload shifts. In practice, that's probably pretty hard. > Fortunately, it's not nearly as important as making sure we don't > unnecessarily hit the disk, which is infinitely slower than any memory > bank. Shifting pages involves similar tradeoffs as zone reclaim vs. remote allocations. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f49.google.com (mail-wg0-f49.google.com [74.125.82.49]) by kanga.kvack.org (Postfix) with ESMTP id EBAE86B0035 for ; Tue, 8 Apr 2014 19:26:48 -0400 (EDT) Received: by mail-wg0-f49.google.com with SMTP id a1so1703110wgh.20 for ; Tue, 08 Apr 2014 16:26:47 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id v2si1852882wix.2.2014.04.08.16.26.45 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 08 Apr 2014 16:26:46 -0700 (PDT) Date: Wed, 9 Apr 2014 00:26:42 +0100 From: Mel Gorman Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default Message-ID: <20140408232642.GR7292@suse.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Christoph Lameter Cc: Robert Haas , Vlastimil Babka , Andrew Morton , Josh Berkus , Andres Freund , Linux-MM , LKML , sivanich@sgi.com On Tue, Apr 08, 2014 at 05:58:21PM -0500, Christoph Lameter wrote: > On Tue, 8 Apr 2014, Robert Haas wrote: > > > Well, as Josh quite rightly said, the hit from accessing remote memory > > is never going to be as large as the hit from disk. If and when there > > is a machine where remote memory is more expensive to access than > > disk, that's a good argument for zone_reclaim_mode. But I don't > > believe that's anywhere close to being true today, even on an 8-socket > > machine with an SSD. > > I am nost sure how disk figures into this? > It's a matter of perspective. For those that are running file servers, databases and the like they don't see the remote accesses, they see their page cache getting reclaimed but not all of those users understand why because they are not NUMA aware. This is why they are seeing the cost of zone_reclaim_mode to be IO-related. I think pretty much 100% of the bug reports I've seen related to zone_reclaim_mode were due to IO-intensive workloads and the user not recognising why page cache was getting reclaimed aggressively. > The tradeoff is zone reclaim vs. the aggregate performance > degradation of the remote memory accesses. That depends on the > cacheability of the app and the scale of memory accesses. > For HPC, yes. > The reason that zone reclaim is on by default is that off node accesses > are a big performance hit on large scale NUMA systems (like ScaleMP and > SGI). Zone reclaim was written *because* those system experienced severe > performance degradation. > Yes, this is understood. However, those same people already know how to use cpusets, NUMA bindings and how tune their workload to partition it into the nodes. From a NUMA perspective they are relatively sophisticated and know how and when to set zone_reclaim_mode. At least on any bug report I've seen related to these really large machines, they were already using cpusets. This is why I think think the default for zone_reclaim should now be off because it helps the common case. > On the tightly coupled 4 and 8 node systems there does not seem to > be a benefit from what I hear. > > > Now, perhaps the fear is that if we access that remote memory > > *repeatedly* the aggregate cost will exceed what it would have cost to > > fault that page into the local node just once. But it takes a lot of > > accesses for that to be true, and most of the time you won't get them. > > Even if you do, I bet many workloads will prefer even performance > > across all the accesses over a very slow first access followed by > > slightly faster subsequent accesses. > > Many HPC workloads prefer the opposite. > And they know how to tune accordingly. > > In an ideal world, the kernel would put the hottest pages on the local > > node and the less-hot pages on remote nodes, moving pages around as > > the workload shifts. In practice, that's probably pretty hard. > > Fortunately, it's not nearly as important as making sure we don't > > unnecessarily hit the disk, which is infinitely slower than any memory > > bank. > > Shifting pages involves similar tradeoffs as zone reclaim vs. remote > allocations. In practice it really is hard for the kernel to do this automatically. Automatic NUMA balancing will help if the data is mapped but not if it's buffered read/writes because there is no hinting information available right now. At some point we may need to tackle IO locality but it'll take time for users to get experience with automatic balancing as it is before taking further steps. That's an aside to the current discussion. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f173.google.com (mail-wi0-f173.google.com [209.85.212.173]) by kanga.kvack.org (Postfix) with ESMTP id 0A8DF6B0031 for ; Wed, 9 Apr 2014 09:08:29 -0400 (EDT) Received: by mail-wi0-f173.google.com with SMTP id z2so8846064wiv.6 for ; Wed, 09 Apr 2014 06:08:27 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id g5si405586wjw.110.2014.04.09.06.08.24 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 09 Apr 2014 06:08:24 -0700 (PDT) Date: Wed, 9 Apr 2014 14:08:19 +0100 From: Mel Gorman Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default Message-ID: <20140409130819.GS7292@suse.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> <53445481.3030202@agliodbs.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <53445481.3030202@agliodbs.com> Sender: owner-linux-mm@kvack.org List-ID: To: Josh Berkus Cc: Robert Haas , Christoph Lameter , Vlastimil Babka , Andrew Morton , Andres Freund , Linux-MM , LKML , sivanich@sgi.com On Tue, Apr 08, 2014 at 03:56:49PM -0400, Josh Berkus wrote: > On 04/08/2014 03:53 PM, Robert Haas wrote: > > In an ideal world, the kernel would put the hottest pages on the local > > node and the less-hot pages on remote nodes, moving pages around as > > the workload shifts. In practice, that's probably pretty hard. > > Fortunately, it's not nearly as important as making sure we don't > > unnecessarily hit the disk, which is infinitely slower than any memory > > bank. > > Even if the kernel could do this, we would *still* have to disable it > for PostgreSQL, since our double-buffering makes our pages look "cold" > to the kernel ... as discussed. > If it's the shared mapping that is being used then automatic NUMA balancing should migrate those pages to a node local to the CPU accessing it but how well it works will partially depend on how much those accesses move around. It's independent of the zone_reclaim_mode issue. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f44.google.com (mail-qa0-f44.google.com [209.85.216.44]) by kanga.kvack.org (Postfix) with ESMTP id A2A7E6B0031 for ; Thu, 10 Apr 2014 06:26:39 -0400 (EDT) Received: by mail-qa0-f44.google.com with SMTP id hw13so3632944qab.17 for ; Thu, 10 Apr 2014 03:26:39 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id b6si1666823qae.168.2014.04.10.03.26.38 for ; Thu, 10 Apr 2014 03:26:38 -0700 (PDT) Message-ID: <534671DB.50802@redhat.com> Date: Thu, 10 Apr 2014 11:26:35 +0100 From: Jeremy Harris MIME-Version: 1.0 Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: Cc: Linux-MM , LKML On 08/04/14 23:58, Christoph Lameter wrote: > The reason that zone reclaim is on by default is that off node accesses > are a big performance hit on large scale NUMA systems (like ScaleMP and > SGI). Zone reclaim was written *because* those system experienced severe > performance degradation. > > On the tightly coupled 4 and 8 node systems there does not seem to > be a benefit from what I hear. In principle, is this difference in distance something the kernel could measure? -- Cheers, Jeremy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ee0-f41.google.com (mail-ee0-f41.google.com [74.125.83.41]) by kanga.kvack.org (Postfix) with ESMTP id 049096B003A for ; Fri, 18 Apr 2014 11:49:21 -0400 (EDT) Received: by mail-ee0-f41.google.com with SMTP id t10so1765516eei.28 for ; Fri, 18 Apr 2014 08:49:21 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id g45si40741409eev.340.2014.04.18.08.49.20 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Apr 2014 08:49:20 -0700 (PDT) Date: Fri, 18 Apr 2014 17:49:18 +0200 From: Michal Hocko Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default Message-ID: <20140418154918.GD4523@dhcp22.suse.cz> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1396910068-11637-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML On Mon 07-04-14 23:34:26, Mel Gorman wrote: > When it was introduced, zone_reclaim_mode made sense as NUMA distances > punished and workloads were generally partitioned to fit into a NUMA > node. NUMA machines are now common but few of the workloads are NUMA-aware > and it's routine to see major performance due to zone_reclaim_mode being > disabled but relatively few can identify the problem. > > Those that require zone_reclaim_mode are likely to be able to detect when > it needs to be enabled and tune appropriately so lets have a sensible > default for the bulk of users. > > Documentation/sysctl/vm.txt | 17 +++++++++-------- > include/linux/mmzone.h | 1 - > mm/page_alloc.c | 17 +---------------- > 3 files changed, 10 insertions(+), 25 deletions(-) Auto-enabling caused so many reports in the past that it is definitely much better to not be clever and let admins enable zone_reclaim where it is appropriate instead. For both patches. Acked-by: Michal Hocko -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f169.google.com (mail-qc0-f169.google.com [209.85.216.169]) by kanga.kvack.org (Postfix) with ESMTP id E1B896B0031 for ; Fri, 18 Apr 2014 12:44:37 -0400 (EDT) Received: by mail-qc0-f169.google.com with SMTP id i17so1902107qcy.0 for ; Fri, 18 Apr 2014 09:44:37 -0700 (PDT) Received: from qmta04.emeryville.ca.mail.comcast.net (qmta04.emeryville.ca.mail.comcast.net. [2001:558:fe2d:43:76:96:30:40]) by mx.google.com with ESMTP id r70si12189313qga.92.2014.04.18.09.44.36 for ; Fri, 18 Apr 2014 09:44:37 -0700 (PDT) Date: Fri, 18 Apr 2014 11:44:34 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default In-Reply-To: <20140418154918.GD4523@dhcp22.suse.cz> Message-ID: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <20140418154918.GD4523@dhcp22.suse.cz> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Mel Gorman , Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Linux-MM , LKML On Fri, 18 Apr 2014, Michal Hocko wrote: > Auto-enabling caused so many reports in the past that it is definitely > much better to not be clever and let admins enable zone_reclaim where it > is appropriate instead. > > For both patches. > Acked-by: Michal Hocko I did not get any objections from SGI either. Reviewed-by: Christoph Lameter -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756020AbaDGWef (ORCPT ); Mon, 7 Apr 2014 18:34:35 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49503 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753771AbaDGWed (ORCPT ); Mon, 7 Apr 2014 18:34:33 -0400 From: Mel Gorman To: Andrew Morton Cc: Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML , Mel Gorman Subject: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances Date: Mon, 7 Apr 2014 23:34:28 +0100 Message-Id: <1396910068-11637-3-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4.5 In-Reply-To: <1396910068-11637-1-git-send-email-mgorman@suse.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by zone_reclaim due to its distance. As it is expected that zone_reclaim_mode will be rarely enabled it is unreasonable for all machines to take a penalty. Fortunately, the zone_reclaim_mode() path is already slow and it is the path that takes the hit. Signed-off-by: Mel Gorman --- include/linux/mmzone.h | 1 - mm/page_alloc.c | 15 +-------------- 2 files changed, 1 insertion(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9b61b9b..564b169 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -757,7 +757,6 @@ typedef struct pglist_data { unsigned long node_spanned_pages; /* total size of physical page range, including holes */ int node_id; - nodemask_t reclaim_nodes; /* Nodes allowed to reclaim from */ wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a256f85..574928e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone) static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) { - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes); -} - -static void __paginginit init_zone_allows_reclaim(int nid) -{ - int i; - - for_each_online_node(i) - if (node_distance(nid, i) <= RECLAIM_DISTANCE) - node_set(i, NODE_DATA(nid)->reclaim_nodes); + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE; } #else /* CONFIG_NUMA */ @@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) return true; } -static inline void init_zone_allows_reclaim(int nid) -{ -} #endif /* CONFIG_NUMA */ /* @@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, pgdat->node_id = nid; pgdat->node_start_pfn = node_start_pfn; - init_zone_allows_reclaim(nid); #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); #endif -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755960AbaDGWed (ORCPT ); Mon, 7 Apr 2014 18:34:33 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49485 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753771AbaDGWeb (ORCPT ); Mon, 7 Apr 2014 18:34:31 -0400 From: Mel Gorman To: Andrew Morton Cc: Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML , Mel Gorman Subject: [PATCH 0/2] Disable zone_reclaim_mode by default Date: Mon, 7 Apr 2014 23:34:26 +0100 Message-Id: <1396910068-11637-1-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4.5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it's routine to see major performance due to zone_reclaim_mode being disabled but relatively few can identify the problem. Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. Documentation/sysctl/vm.txt | 17 +++++++++-------- include/linux/mmzone.h | 1 - mm/page_alloc.c | 17 +---------------- 3 files changed, 10 insertions(+), 25 deletions(-) -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756086AbaDGWfE (ORCPT ); Mon, 7 Apr 2014 18:35:04 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49496 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755855AbaDGWec (ORCPT ); Mon, 7 Apr 2014 18:34:32 -0400 From: Mel Gorman To: Andrew Morton Cc: Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML , Mel Gorman Subject: [PATCH 1/2] mm: Disable zone_reclaim_mode by default Date: Mon, 7 Apr 2014 23:34:27 +0100 Message-Id: <1396910068-11637-2-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4.5 In-Reply-To: <1396910068-11637-1-git-send-email-mgorman@suse.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman --- Documentation/sysctl/vm.txt | 17 +++++++++-------- mm/page_alloc.c | 2 -- 2 files changed, 9 insertions(+), 10 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index d614a9b..ff5da70 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -751,16 +751,17 @@ This is value ORed together of 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages -zone_reclaim_mode is set during bootup to 1 if it is determined that pages -from remote zones will cause a measurable performance reduction. The -page allocator will then reclaim easily reusable pages (those page -cache pages that are currently not used) before allocating off node pages. - -It may be beneficial to switch off zone reclaim if the system is -used for a file server and all of memory should be used for caching files -from disk. In that case the caching effect is more important than +zone_reclaim_mode is disabled by default. For file servers or workloads +that benefit from having their data cached, zone_reclaim_mode should be +left disabled as the caching effect is likely to be more important than data locality. +zone_reclaim may be enabled if it's known that the workload is partitioned +such that each partition fits within a NUMA node and that accessing remote +memory would cause a measurable performance reduction. The page allocator +will then reclaim easily reusable pages (those page cache pages that are +currently not used) before allocating off node pages. + Allowing zone reclaim to write out pages stops processes that are writing large amounts of data from dirtying pages on other nodes. Zone reclaim will write out dirty pages if a zone fills up and so effectively diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3bac76a..a256f85 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid) for_each_online_node(i) if (node_distance(nid, i) <= RECLAIM_DISTANCE) node_set(i, NODE_DATA(nid)->reclaim_nodes); - else - zone_reclaim_mode = 1; } #else /* CONFIG_NUMA */ -- 1.8.4.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755624AbaDGXfY (ORCPT ); Mon, 7 Apr 2014 19:35:24 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:58471 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754637AbaDGXfX (ORCPT ); Mon, 7 Apr 2014 19:35:23 -0400 Date: Mon, 7 Apr 2014 19:35:11 -0400 From: Johannes Weiner To: Mel Gorman Cc: Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML Subject: Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default Message-ID: <20140407233511.GO4407@cmpxchg.org> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-2-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1396910068-11637-2-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 07, 2014 at 11:34:27PM +0100, Mel Gorman wrote: > zone_reclaim_mode causes processes to prefer reclaiming memory from local > node instead of spilling over to other nodes. This made sense initially when > NUMA machines were almost exclusively HPC and the workload was partitioned > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > the memory. On current machines and workloads it is often the case that > zone_reclaim_mode destroys performance but not all users know how to detect > this. Favour the common case and disable it by default. Users that are > sophisticated enough to know they need zone_reclaim_mode will detect it. > > Signed-off-by: Mel Gorman Acked-by: Johannes Weiner From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755984AbaDGXhG (ORCPT ); Mon, 7 Apr 2014 19:37:06 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:58479 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755330AbaDGXhE (ORCPT ); Mon, 7 Apr 2014 19:37:04 -0400 Date: Mon, 7 Apr 2014 19:36:57 -0400 From: Johannes Weiner To: Mel Gorman Cc: Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML Subject: Re: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances Message-ID: <20140407233657.GP4407@cmpxchg.org> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-3-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1396910068-11637-3-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 07, 2014 at 11:34:28PM +0100, Mel Gorman wrote: > pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by > zone_reclaim due to its distance. As it is expected that zone_reclaim_mode > will be rarely enabled it is unreasonable for all machines to take a penalty. > Fortunately, the zone_reclaim_mode() path is already slow and it is the path > that takes the hit. > > Signed-off-by: Mel Gorman Acked-by: Johannes Weiner From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756084AbaDHBSq (ORCPT ); Mon, 7 Apr 2014 21:18:46 -0400 Received: from cn.fujitsu.com ([59.151.112.132]:21027 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1754690AbaDHBSo (ORCPT ); Mon, 7 Apr 2014 21:18:44 -0400 X-IronPort-AV: E=Sophos;i="4.97,814,1389715200"; d="scan'208";a="28974175" Message-ID: <53434E28.4040304@cn.fujitsu.com> Date: Tue, 8 Apr 2014 09:17:28 +0800 From: Zhang Yanfei User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131030 Thunderbird/17.0.10 MIME-Version: 1.0 To: Mel Gorman CC: Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML Subject: Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-2-git-send-email-mgorman@suse.de> In-Reply-To: <1396910068-11637-2-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.167.226.197] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/08/2014 06:34 AM, Mel Gorman wrote: > zone_reclaim_mode causes processes to prefer reclaiming memory from local > node instead of spilling over to other nodes. This made sense initially when > NUMA machines were almost exclusively HPC and the workload was partitioned > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > the memory. On current machines and workloads it is often the case that > zone_reclaim_mode destroys performance but not all users know how to detect > this. Favour the common case and disable it by default. Users that are > sophisticated enough to know they need zone_reclaim_mode will detect it. > > Signed-off-by: Mel Gorman Reviewed-by: Zhang Yanfei > --- > Documentation/sysctl/vm.txt | 17 +++++++++-------- > mm/page_alloc.c | 2 -- > 2 files changed, 9 insertions(+), 10 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index d614a9b..ff5da70 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -751,16 +751,17 @@ This is value ORed together of > 2 = Zone reclaim writes dirty pages out > 4 = Zone reclaim swaps pages > > -zone_reclaim_mode is set during bootup to 1 if it is determined that pages > -from remote zones will cause a measurable performance reduction. The > -page allocator will then reclaim easily reusable pages (those page > -cache pages that are currently not used) before allocating off node pages. > - > -It may be beneficial to switch off zone reclaim if the system is > -used for a file server and all of memory should be used for caching files > -from disk. In that case the caching effect is more important than > +zone_reclaim_mode is disabled by default. For file servers or workloads > +that benefit from having their data cached, zone_reclaim_mode should be > +left disabled as the caching effect is likely to be more important than > data locality. > > +zone_reclaim may be enabled if it's known that the workload is partitioned > +such that each partition fits within a NUMA node and that accessing remote > +memory would cause a measurable performance reduction. The page allocator > +will then reclaim easily reusable pages (those page cache pages that are > +currently not used) before allocating off node pages. > + > Allowing zone reclaim to write out pages stops processes that are > writing large amounts of data from dirtying pages on other nodes. Zone > reclaim will write out dirty pages if a zone fills up and so effectively > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3bac76a..a256f85 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid) > for_each_online_node(i) > if (node_distance(nid, i) <= RECLAIM_DISTANCE) > node_set(i, NODE_DATA(nid)->reclaim_nodes); > - else > - zone_reclaim_mode = 1; > } > > #else /* CONFIG_NUMA */ > -- Thanks. Zhang Yanfei From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756123AbaDHBTI (ORCPT ); Mon, 7 Apr 2014 21:19:08 -0400 Received: from cn.fujitsu.com ([59.151.112.132]:35083 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1754690AbaDHBTD (ORCPT ); Mon, 7 Apr 2014 21:19:03 -0400 X-IronPort-AV: E=Sophos;i="4.97,814,1389715200"; d="scan'208";a="28974183" Message-ID: <53434E41.1010306@cn.fujitsu.com> Date: Tue, 8 Apr 2014 09:17:53 +0800 From: Zhang Yanfei User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131030 Thunderbird/17.0.10 MIME-Version: 1.0 To: Mel Gorman CC: Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML Subject: Re: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-3-git-send-email-mgorman@suse.de> In-Reply-To: <1396910068-11637-3-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.167.226.197] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/08/2014 06:34 AM, Mel Gorman wrote: > pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by > zone_reclaim due to its distance. As it is expected that zone_reclaim_mode > will be rarely enabled it is unreasonable for all machines to take a penalty. > Fortunately, the zone_reclaim_mode() path is already slow and it is the path > that takes the hit. > > Signed-off-by: Mel Gorman Reviewed-by: Zhang Yanfei > --- > include/linux/mmzone.h | 1 - > mm/page_alloc.c | 15 +-------------- > 2 files changed, 1 insertion(+), 15 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 9b61b9b..564b169 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -757,7 +757,6 @@ typedef struct pglist_data { > unsigned long node_spanned_pages; /* total size of physical page > range, including holes */ > int node_id; > - nodemask_t reclaim_nodes; /* Nodes allowed to reclaim from */ > wait_queue_head_t kswapd_wait; > wait_queue_head_t pfmemalloc_wait; > struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */ > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index a256f85..574928e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone) > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > { > - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes); > -} > - > -static void __paginginit init_zone_allows_reclaim(int nid) > -{ > - int i; > - > - for_each_online_node(i) > - if (node_distance(nid, i) <= RECLAIM_DISTANCE) > - node_set(i, NODE_DATA(nid)->reclaim_nodes); > + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE; > } > > #else /* CONFIG_NUMA */ > @@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > return true; > } > > -static inline void init_zone_allows_reclaim(int nid) > -{ > -} > #endif /* CONFIG_NUMA */ > > /* > @@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, > > pgdat->node_id = nid; > pgdat->node_start_pfn = node_start_pfn; > - init_zone_allows_reclaim(nid); > #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); > #endif > -- Thanks. Zhang Yanfei From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756360AbaDHHOw (ORCPT ); Tue, 8 Apr 2014 03:14:52 -0400 Received: from moutng.kundenserver.de ([212.227.126.131]:55315 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750789AbaDHHOv (ORCPT ); Tue, 8 Apr 2014 03:14:51 -0400 Date: Tue, 8 Apr 2014 09:14:43 +0200 From: Andres Freund To: Mel Gorman Cc: Andrew Morton , Robert Haas , Josh Berkus , Christoph Lameter , Linux-MM , LKML Subject: Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default Message-ID: <20140408071443.GQ4161@awork2.anarazel.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-2-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1396910068-11637-2-git-send-email-mgorman@suse.de> X-Provags-ID: V02:K0:ukBdsTxA/Y/+VkRDXQ4RLVJZIKyk02ikFdMqNz2rbP5 lowcK9eOrJ6GMK6m75n7Hgm1mXczs4MibWItrNzspraCTILe6p EXpwWM6R1ymuBhY7uePo1mOuBHc7RP0MNJvf5V81gX+hZ7TWXE T2IC/RmnUSYJhWflHxPxRawK/QEwsc/102v/EBj8pcaL54O2mc C/qiQm0ZJmjsiOQUxUZp8CtS+kg7355Bc2iBLWpQ4mMhS8QcF7 x5u2RrhYl2+oijUV2lus6xqLvx1Kdfp++cf7th1kDJUnR9zhh9 r3jcoIGWilzVJAqW4ZV+aJlzYSK6pTFfJE3qMxaOMATNuXe6LR Y5SvhVZ6G7O52Oddn91Hq62RRz6bRXZm0w4wYHxUs Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On 2014-04-07 23:34:27 +0100, Mel Gorman wrote: > zone_reclaim_mode causes processes to prefer reclaiming memory from local > node instead of spilling over to other nodes. This made sense initially when > NUMA machines were almost exclusively HPC and the workload was partitioned > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > the memory. On current machines and workloads it is often the case that > zone_reclaim_mode destroys performance but not all users know how to detect > this. Favour the common case and disable it by default. Users that are > sophisticated enough to know they need zone_reclaim_mode will detect it. Unsurprisingly I am in favor of this. > Documentation/sysctl/vm.txt | 17 +++++++++-------- > mm/page_alloc.c | 2 -- > 2 files changed, 9 insertions(+), 10 deletions(-) But I think linux/topology.h's comment about RECLAIM_DISTANCE should be adapted as well. Thanks, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756154AbaDHH0Q (ORCPT ); Tue, 8 Apr 2014 03:26:16 -0400 Received: from cantor2.suse.de ([195.135.220.15]:53282 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750785AbaDHH0P (ORCPT ); Tue, 8 Apr 2014 03:26:15 -0400 Message-ID: <5343A494.9070707@suse.cz> Date: Tue, 08 Apr 2014 09:26:12 +0200 From: Vlastimil Babka User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Mel Gorman , Andrew Morton CC: Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default References: <1396910068-11637-1-git-send-email-mgorman@suse.de> In-Reply-To: <1396910068-11637-1-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/08/2014 12:34 AM, Mel Gorman wrote: > When it was introduced, zone_reclaim_mode made sense as NUMA distances > punished and workloads were generally partitioned to fit into a NUMA > node. NUMA machines are now common but few of the workloads are NUMA-aware > and it's routine to see major performance due to zone_reclaim_mode being > disabled but relatively few can identify the problem. ^ I think you meant "enabled" here? Just in case the cover letter goes to the changelog... Vlastimil > Those that require zone_reclaim_mode are likely to be able to detect when > it needs to be enabled and tune appropriately so lets have a sensible > default for the bulk of users. > > Documentation/sysctl/vm.txt | 17 +++++++++-------- > include/linux/mmzone.h | 1 - > mm/page_alloc.c | 17 +---------------- > 3 files changed, 10 insertions(+), 25 deletions(-) > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757080AbaDHOOK (ORCPT ); Tue, 8 Apr 2014 10:14:10 -0400 Received: from qmta14.emeryville.ca.mail.comcast.net ([76.96.27.212]:44639 "EHLO qmta14.emeryville.ca.mail.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756782AbaDHOOJ (ORCPT ); Tue, 8 Apr 2014 10:14:09 -0400 Date: Tue, 8 Apr 2014 09:14:05 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@nuc To: sivanich@sgi.com cc: Mel Gorman , Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Linux-MM , LKML Subject: Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default In-Reply-To: <1396910068-11637-2-git-send-email-mgorman@suse.de> Message-ID: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-2-git-send-email-mgorman@suse.de> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 7 Apr 2014, Mel Gorman wrote: > zone_reclaim_mode causes processes to prefer reclaiming memory from local > node instead of spilling over to other nodes. This made sense initially when > NUMA machines were almost exclusively HPC and the workload was partitioned > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > the memory. On current machines and workloads it is often the case that > zone_reclaim_mode destroys performance but not all users know how to detect > this. Favour the common case and disable it by default. Users that are > sophisticated enough to know they need zone_reclaim_mode will detect it. Ok that is going to require SGI machines to deal with zone_reclaim configurations on bootup. Dimitri? Any comments? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757100AbaDHORJ (ORCPT ); Tue, 8 Apr 2014 10:17:09 -0400 Received: from qmta15.emeryville.ca.mail.comcast.net ([76.96.27.228]:44057 "EHLO qmta15.emeryville.ca.mail.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756869AbaDHORH (ORCPT ); Tue, 8 Apr 2014 10:17:07 -0400 Date: Tue, 8 Apr 2014 09:17:04 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@nuc To: Vlastimil Babka cc: Mel Gorman , Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Linux-MM , LKML , sivanich@sgi.com Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default In-Reply-To: <5343A494.9070707@suse.cz> Message-ID: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 8 Apr 2014, Vlastimil Babka wrote: > On 04/08/2014 12:34 AM, Mel Gorman wrote: > > When it was introduced, zone_reclaim_mode made sense as NUMA distances > > punished and workloads were generally partitioned to fit into a NUMA > > node. NUMA machines are now common but few of the workloads are NUMA-aware > > and it's routine to see major performance due to zone_reclaim_mode being > > disabled but relatively few can identify the problem. > ^ I think you meant "enabled" here? > > Just in case the cover letter goes to the changelog... Correct. Another solution here would be to increase the threshhold so that 4 socket machines do not enable zone reclaim by default. The larger the NUMA system is the more memory is off node from the perspective of a processor and the larger the hit from remote memory. On the other hand: The more expensive we make reclaim the less it makes sense to allow zone reclaim to occur. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757185AbaDHO0z (ORCPT ); Tue, 8 Apr 2014 10:26:55 -0400 Received: from moutng.kundenserver.de ([212.227.17.10]:64472 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756605AbaDHO0v (ORCPT ); Tue, 8 Apr 2014 10:26:51 -0400 Date: Tue, 8 Apr 2014 16:26:42 +0200 From: Andres Freund To: Christoph Lameter Cc: Vlastimil Babka , Mel Gorman , Andrew Morton , Robert Haas , Josh Berkus , Linux-MM , LKML , sivanich@sgi.com Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default Message-ID: <20140408142642.GU4161@awork2.anarazel.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Provags-ID: V02:K0:oWQZg+uCGw3khjqNFVtzcZeCbforjZDOia/Ij3/TJHr 8BL249u78BmFc37Uxz3P+TPPgViu4TpUq/5EIKayPoDxMLzhLl oNUBdwg0zCHhnCLGa4fVJVXs3mPodj9nk2wEoC82Exi0MaW7h4 gXSQpwzdiITOpsjBAP03aW40lb6L24j8bC9JezXzjBfNl9JKjS wjl5evhVfckhGKRihw5PfxwkrFb729jZUS1Af0YoGQ6b6tVz6W OrTnigo9d/C4wupZWEdiLdn1DWQ6RZ9YbdCeniN77kyDnPGtbO 4y1/7IXCdA1hHZNtmYyeIz89d6t6JSczHRHaf/FS37LfBmfRq3 Su4dk3tke/+bsyhgtRu+s0ZRHGcd2fSp5KoYhKs21 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2014-04-08 09:17:04 -0500, Christoph Lameter wrote: > On Tue, 8 Apr 2014, Vlastimil Babka wrote: > > > On 04/08/2014 12:34 AM, Mel Gorman wrote: > > > When it was introduced, zone_reclaim_mode made sense as NUMA distances > > > punished and workloads were generally partitioned to fit into a NUMA > > > node. NUMA machines are now common but few of the workloads are NUMA-aware > > > and it's routine to see major performance due to zone_reclaim_mode being > > > disabled but relatively few can identify the problem. > > ^ I think you meant "enabled" here? > > > > Just in case the cover letter goes to the changelog... > > Correct. > > Another solution here would be to increase the threshhold so that > 4 socket machines do not enable zone reclaim by default. The larger the > NUMA system is the more memory is off node from the perspective of a > processor and the larger the hit from remote memory. FWIW, I've the problem hit majorly on 8 socket machines. Those are the largest I have seen so far in postgres scenarios. Everything larger is far less likely to be used as single node database server, so that's possibly a sensible cutoff. But then, I'd think that special many-socket machines are setup by specialists, that'd know to enable if it makes sense... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932476AbaDHOrm (ORCPT ); Tue, 8 Apr 2014 10:47:42 -0400 Received: from cantor2.suse.de ([195.135.220.15]:35234 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932294AbaDHOrk (ORCPT ); Tue, 8 Apr 2014 10:47:40 -0400 Date: Tue, 8 Apr 2014 15:47:35 +0100 From: Mel Gorman To: Christoph Lameter Cc: sivanich@sgi.com, Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Linux-MM , LKML Subject: Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default Message-ID: <20140408144735.GK7292@suse.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <1396910068-11637-2-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 08, 2014 at 09:14:05AM -0500, Christoph Lameter wrote: > On Mon, 7 Apr 2014, Mel Gorman wrote: > > > zone_reclaim_mode causes processes to prefer reclaiming memory from local > > node instead of spilling over to other nodes. This made sense initially when > > NUMA machines were almost exclusively HPC and the workload was partitioned > > into nodes. The NUMA penalties were sufficiently high to justify reclaiming > > the memory. On current machines and workloads it is often the case that > > zone_reclaim_mode destroys performance but not all users know how to detect > > this. Favour the common case and disable it by default. Users that are > > sophisticated enough to know they need zone_reclaim_mode will detect it. > > Ok that is going to require SGI machines to deal with zone_reclaim > configurations on bootup. Dimitri? Any comments? > The SGI machines are also likely to be managed by system administrators who are both aware of zone_reclaim_mode and know how to evaluate if it should be enabled or not. The pair of patches is really aimmed at the common case of 2-8 socket machines running workloads that are not NUMA aware. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757210AbaDHOz1 (ORCPT ); Tue, 8 Apr 2014 10:55:27 -0400 Received: from smtp.01.com ([199.36.142.181]:46749 "EHLO smtp.01.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756780AbaDHOzZ (ORCPT ); Tue, 8 Apr 2014 10:55:25 -0400 X-Greylist: delayed 515 seconds by postgrey-1.27 at vger.kernel.org; Tue, 08 Apr 2014 10:55:25 EDT Message-ID: <53440BD6.5030008@agliodbs.com> Date: Tue, 08 Apr 2014 10:46:46 -0400 From: Josh Berkus Organization: PostgreSQL Experts Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Christoph Lameter , Vlastimil Babka CC: Mel Gorman , Andrew Morton , Robert Haas , Andres Freund , Linux-MM , LKML , sivanich@sgi.com Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> In-Reply-To: X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/08/2014 10:17 AM, Christoph Lameter wrote: > Another solution here would be to increase the threshhold so that > 4 socket machines do not enable zone reclaim by default. The larger the > NUMA system is the more memory is off node from the perspective of a > processor and the larger the hit from remote memory. 8 and 16 socket machines aren't common for nonspecialist workloads *now*, but by the time these changes make it into supported distribution kernels, they may very well be. So having zone_reclaim_mode automatically turn itself on if you have more than 8 sockets would still be a booby-trap ("Boss, I dunno. I installed the additional processors and memory performance went to hell!") For zone_reclaim_mode=1 to be useful on standard servers, both of the following need to be true: 1. the user has to have set CPU affinity for their applications; 2. the applications can't need more than one memory bank worth of cache. The thing is, there is *no way* for Linux to know if the above is true. Now, I can certainly imagine non-HPC workloads for which both of the above would be true; for example, I've set up VMware ESX servers where each VM has one socket and one memory bank. However, if the user knows enough to set up socket affinity, they know enough to set zone_reclaim_mode = 1. The default should cover the know-nothing case, not the experienced specialist case. I'd also argue that there's a fundamental false assumption in the entire algorithm of zone_reclaim_mode, because there is no memory bank which is as distant as disk is, ever. However, if it's off by default, then I don't care. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932654AbaDHTxG (ORCPT ); Tue, 8 Apr 2014 15:53:06 -0400 Received: from mail-we0-f170.google.com ([74.125.82.170]:52294 "EHLO mail-we0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932233AbaDHTxE (ORCPT ); Tue, 8 Apr 2014 15:53:04 -0400 MIME-Version: 1.0 In-Reply-To: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> Date: Tue, 8 Apr 2014 15:53:02 -0400 Message-ID: Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default From: Robert Haas To: Christoph Lameter Cc: Vlastimil Babka , Mel Gorman , Andrew Morton , Josh Berkus , Andres Freund , Linux-MM , LKML , sivanich@sgi.com Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 8, 2014 at 10:17 AM, Christoph Lameter wrote: > Another solution here would be to increase the threshhold so that > 4 socket machines do not enable zone reclaim by default. The larger the > NUMA system is the more memory is off node from the perspective of a > processor and the larger the hit from remote memory. Well, as Josh quite rightly said, the hit from accessing remote memory is never going to be as large as the hit from disk. If and when there is a machine where remote memory is more expensive to access than disk, that's a good argument for zone_reclaim_mode. But I don't believe that's anywhere close to being true today, even on an 8-socket machine with an SSD. Now, perhaps the fear is that if we access that remote memory *repeatedly* the aggregate cost will exceed what it would have cost to fault that page into the local node just once. But it takes a lot of accesses for that to be true, and most of the time you won't get them. Even if you do, I bet many workloads will prefer even performance across all the accesses over a very slow first access followed by slightly faster subsequent accesses. In an ideal world, the kernel would put the hottest pages on the local node and the less-hot pages on remote nodes, moving pages around as the workload shifts. In practice, that's probably pretty hard. Fortunately, it's not nearly as important as making sure we don't unnecessarily hit the disk, which is infinitely slower than any memory bank. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757713AbaDHT4z (ORCPT ); Tue, 8 Apr 2014 15:56:55 -0400 Received: from smtp.01.com ([199.36.142.181]:37288 "EHLO smtp.01.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757633AbaDHT4x (ORCPT ); Tue, 8 Apr 2014 15:56:53 -0400 Message-ID: <53445481.3030202@agliodbs.com> Date: Tue, 08 Apr 2014 15:56:49 -0400 From: Josh Berkus Organization: PostgreSQL Experts Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Robert Haas , Christoph Lameter CC: Vlastimil Babka , Mel Gorman , Andrew Morton , Andres Freund , Linux-MM , LKML , sivanich@sgi.com Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> In-Reply-To: X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/08/2014 03:53 PM, Robert Haas wrote: > In an ideal world, the kernel would put the hottest pages on the local > node and the less-hot pages on remote nodes, moving pages around as > the workload shifts. In practice, that's probably pretty hard. > Fortunately, it's not nearly as important as making sure we don't > unnecessarily hit the disk, which is infinitely slower than any memory > bank. Even if the kernel could do this, we would *still* have to disable it for PostgreSQL, since our double-buffering makes our pages look "cold" to the kernel ... as discussed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757689AbaDHW61 (ORCPT ); Tue, 8 Apr 2014 18:58:27 -0400 Received: from qmta13.emeryville.ca.mail.comcast.net ([76.96.27.243]:53296 "EHLO qmta13.emeryville.ca.mail.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756767AbaDHW6Z (ORCPT ); Tue, 8 Apr 2014 18:58:25 -0400 Date: Tue, 8 Apr 2014 17:58:21 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@nuc To: Robert Haas cc: Vlastimil Babka , Mel Gorman , Andrew Morton , Josh Berkus , Andres Freund , Linux-MM , LKML , sivanich@sgi.com Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default In-Reply-To: Message-ID: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 8 Apr 2014, Robert Haas wrote: > Well, as Josh quite rightly said, the hit from accessing remote memory > is never going to be as large as the hit from disk. If and when there > is a machine where remote memory is more expensive to access than > disk, that's a good argument for zone_reclaim_mode. But I don't > believe that's anywhere close to being true today, even on an 8-socket > machine with an SSD. I am nost sure how disk figures into this? The tradeoff is zone reclaim vs. the aggregate performance degradation of the remote memory accesses. That depends on the cacheability of the app and the scale of memory accesses. The reason that zone reclaim is on by default is that off node accesses are a big performance hit on large scale NUMA systems (like ScaleMP and SGI). Zone reclaim was written *because* those system experienced severe performance degradation. On the tightly coupled 4 and 8 node systems there does not seem to be a benefit from what I hear. > Now, perhaps the fear is that if we access that remote memory > *repeatedly* the aggregate cost will exceed what it would have cost to > fault that page into the local node just once. But it takes a lot of > accesses for that to be true, and most of the time you won't get them. > Even if you do, I bet many workloads will prefer even performance > across all the accesses over a very slow first access followed by > slightly faster subsequent accesses. Many HPC workloads prefer the opposite. > In an ideal world, the kernel would put the hottest pages on the local > node and the less-hot pages on remote nodes, moving pages around as > the workload shifts. In practice, that's probably pretty hard. > Fortunately, it's not nearly as important as making sure we don't > unnecessarily hit the disk, which is infinitely slower than any memory > bank. Shifting pages involves similar tradeoffs as zone reclaim vs. remote allocations. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758109AbaDHX0s (ORCPT ); Tue, 8 Apr 2014 19:26:48 -0400 Received: from cantor2.suse.de ([195.135.220.15]:44553 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758067AbaDHX0r (ORCPT ); Tue, 8 Apr 2014 19:26:47 -0400 Date: Wed, 9 Apr 2014 00:26:42 +0100 From: Mel Gorman To: Christoph Lameter Cc: Robert Haas , Vlastimil Babka , Andrew Morton , Josh Berkus , Andres Freund , Linux-MM , LKML , sivanich@sgi.com Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default Message-ID: <20140408232642.GR7292@suse.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 08, 2014 at 05:58:21PM -0500, Christoph Lameter wrote: > On Tue, 8 Apr 2014, Robert Haas wrote: > > > Well, as Josh quite rightly said, the hit from accessing remote memory > > is never going to be as large as the hit from disk. If and when there > > is a machine where remote memory is more expensive to access than > > disk, that's a good argument for zone_reclaim_mode. But I don't > > believe that's anywhere close to being true today, even on an 8-socket > > machine with an SSD. > > I am nost sure how disk figures into this? > It's a matter of perspective. For those that are running file servers, databases and the like they don't see the remote accesses, they see their page cache getting reclaimed but not all of those users understand why because they are not NUMA aware. This is why they are seeing the cost of zone_reclaim_mode to be IO-related. I think pretty much 100% of the bug reports I've seen related to zone_reclaim_mode were due to IO-intensive workloads and the user not recognising why page cache was getting reclaimed aggressively. > The tradeoff is zone reclaim vs. the aggregate performance > degradation of the remote memory accesses. That depends on the > cacheability of the app and the scale of memory accesses. > For HPC, yes. > The reason that zone reclaim is on by default is that off node accesses > are a big performance hit on large scale NUMA systems (like ScaleMP and > SGI). Zone reclaim was written *because* those system experienced severe > performance degradation. > Yes, this is understood. However, those same people already know how to use cpusets, NUMA bindings and how tune their workload to partition it into the nodes. From a NUMA perspective they are relatively sophisticated and know how and when to set zone_reclaim_mode. At least on any bug report I've seen related to these really large machines, they were already using cpusets. This is why I think think the default for zone_reclaim should now be off because it helps the common case. > On the tightly coupled 4 and 8 node systems there does not seem to > be a benefit from what I hear. > > > Now, perhaps the fear is that if we access that remote memory > > *repeatedly* the aggregate cost will exceed what it would have cost to > > fault that page into the local node just once. But it takes a lot of > > accesses for that to be true, and most of the time you won't get them. > > Even if you do, I bet many workloads will prefer even performance > > across all the accesses over a very slow first access followed by > > slightly faster subsequent accesses. > > Many HPC workloads prefer the opposite. > And they know how to tune accordingly. > > In an ideal world, the kernel would put the hottest pages on the local > > node and the less-hot pages on remote nodes, moving pages around as > > the workload shifts. In practice, that's probably pretty hard. > > Fortunately, it's not nearly as important as making sure we don't > > unnecessarily hit the disk, which is infinitely slower than any memory > > bank. > > Shifting pages involves similar tradeoffs as zone reclaim vs. remote > allocations. In practice it really is hard for the kernel to do this automatically. Automatic NUMA balancing will help if the data is mapped but not if it's buffered read/writes because there is no hinting information available right now. At some point we may need to tackle IO locality but it'll take time for users to get experience with automatic balancing as it is before taking further steps. That's an aside to the current discussion. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933170AbaDINI0 (ORCPT ); Wed, 9 Apr 2014 09:08:26 -0400 Received: from cantor2.suse.de ([195.135.220.15]:55474 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932674AbaDINIZ (ORCPT ); Wed, 9 Apr 2014 09:08:25 -0400 Date: Wed, 9 Apr 2014 14:08:19 +0100 From: Mel Gorman To: Josh Berkus Cc: Robert Haas , Christoph Lameter , Vlastimil Babka , Andrew Morton , Andres Freund , Linux-MM , LKML , sivanich@sgi.com Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default Message-ID: <20140409130819.GS7292@suse.de> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> <53445481.3030202@agliodbs.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <53445481.3030202@agliodbs.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 08, 2014 at 03:56:49PM -0400, Josh Berkus wrote: > On 04/08/2014 03:53 PM, Robert Haas wrote: > > In an ideal world, the kernel would put the hottest pages on the local > > node and the less-hot pages on remote nodes, moving pages around as > > the workload shifts. In practice, that's probably pretty hard. > > Fortunately, it's not nearly as important as making sure we don't > > unnecessarily hit the disk, which is infinitely slower than any memory > > bank. > > Even if the kernel could do this, we would *still* have to disable it > for PostgreSQL, since our double-buffering makes our pages look "cold" > to the kernel ... as discussed. > If it's the shared mapping that is being used then automatic NUMA balancing should migrate those pages to a node local to the CPU accessing it but how well it works will partially depend on how much those accesses move around. It's independent of the zone_reclaim_mode issue. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935351AbaDJK0k (ORCPT ); Thu, 10 Apr 2014 06:26:40 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44200 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933885AbaDJK0j (ORCPT ); Thu, 10 Apr 2014 06:26:39 -0400 Message-ID: <534671DB.50802@redhat.com> Date: Thu, 10 Apr 2014 11:26:35 +0100 From: Jeremy Harris User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 CC: Linux-MM , LKML Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit To: unlisted-recipients:; (no To-header on input) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/04/14 23:58, Christoph Lameter wrote: > The reason that zone reclaim is on by default is that off node accesses > are a big performance hit on large scale NUMA systems (like ScaleMP and > SGI). Zone reclaim was written *because* those system experienced severe > performance degradation. > > On the tightly coupled 4 and 8 node systems there does not seem to > be a benefit from what I hear. In principle, is this difference in distance something the kernel could measure? -- Cheers, Jeremy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753655AbaDRPtX (ORCPT ); Fri, 18 Apr 2014 11:49:23 -0400 Received: from cantor2.suse.de ([195.135.220.15]:52831 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751459AbaDRPtV (ORCPT ); Fri, 18 Apr 2014 11:49:21 -0400 Date: Fri, 18 Apr 2014 17:49:18 +0200 From: Michal Hocko To: Mel Gorman Cc: Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Christoph Lameter , Linux-MM , LKML Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default Message-ID: <20140418154918.GD4523@dhcp22.suse.cz> References: <1396910068-11637-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1396910068-11637-1-git-send-email-mgorman@suse.de> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 07-04-14 23:34:26, Mel Gorman wrote: > When it was introduced, zone_reclaim_mode made sense as NUMA distances > punished and workloads were generally partitioned to fit into a NUMA > node. NUMA machines are now common but few of the workloads are NUMA-aware > and it's routine to see major performance due to zone_reclaim_mode being > disabled but relatively few can identify the problem. > > Those that require zone_reclaim_mode are likely to be able to detect when > it needs to be enabled and tune appropriately so lets have a sensible > default for the bulk of users. > > Documentation/sysctl/vm.txt | 17 +++++++++-------- > include/linux/mmzone.h | 1 - > mm/page_alloc.c | 17 +---------------- > 3 files changed, 10 insertions(+), 25 deletions(-) Auto-enabling caused so many reports in the past that it is definitely much better to not be clever and let admins enable zone_reclaim where it is appropriate instead. For both patches. Acked-by: Michal Hocko -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753123AbaDRQol (ORCPT ); Fri, 18 Apr 2014 12:44:41 -0400 Received: from qmta08.emeryville.ca.mail.comcast.net ([76.96.30.80]:34245 "EHLO qmta08.emeryville.ca.mail.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751352AbaDRQoh (ORCPT ); Fri, 18 Apr 2014 12:44:37 -0400 Date: Fri, 18 Apr 2014 11:44:34 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@gentwo.org To: Michal Hocko cc: Mel Gorman , Andrew Morton , Robert Haas , Josh Berkus , Andres Freund , Linux-MM , LKML Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default In-Reply-To: <20140418154918.GD4523@dhcp22.suse.cz> Message-ID: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <20140418154918.GD4523@dhcp22.suse.cz> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 18 Apr 2014, Michal Hocko wrote: > Auto-enabling caused so many reports in the past that it is definitely > much better to not be clever and let admins enable zone_reclaim where it > is appropriate instead. > > For both patches. > Acked-by: Michal Hocko I did not get any objections from SGI either. Reviewed-by: Christoph Lameter